×

SUBSCRIBE TO TMCnet
TMCnet - World's Largest Communications and Technology Community

CHANNEL BY TOPICS


QUICK LINKS




 

January 1998


Minimizing VoIP Transmission Delays To Optimize Performance

BY JEFF HILL

VoIP (voice over Internet protocol) technology has made great strides. Pundits are hailing the arrival of this revolutionary new technology, all the while predicting the demise of traditional voice communications. VoIP allows users to bypass the existing phone companies, thus offering huge savings off the cost of traditional phone calls. However, issues remain that stand in the way of this technology becoming the end-all solution that some purport it to be — chief among these issues is quality of service (QoS), in particular, how the delays inherent in Internet transmissions adversely impact that quality.

With the ability to place a call from Butte, Montana to London (or anywhere else), talk for hours, and pay no more than the cost of a local phone call for the privilege, most people would agree that this sounds like a pretty good deal, especially when such a phone call from Montana to London could cost hundreds of dollars if made via more traditional means. All that is needed is a multimedia PC, some inexpensive Internet phone software package, and a recipient in London with their PC turned on. Tens of millions of people own PCs, and millions more are buying them every year. So why then, does anyone use the telephone anymore? The answer is simple: sound quality. The world is accustomed to a certain degree of clarity and naturalness when conversing via the telephone . Unfortunately, systems designed to enable two-way voice communications over data networks (e.g., VoIP systems) have yet to approach that quality plateau, at least in the PC-toPC domain.

CIRCUITS, PACKETS, AND BIRTHDAY GREETINGS
To clarify the VoIP process, let us assume an example of calling your mother on her birthday. When you pick up the telephone to place such a call, you expect a reasonable delay between the time you finish dialing the number and the time the phone on the other end begins to ring. During this time, the telephone company is finding an electronic roadway between your phone and your mother’s. Once that path is established, it remains open for the length of the call and allows you to experience almost no perceptible delay. When you say “Happy Birthday, Mom,” an electronic version of your speech travels over that dedicated path to your mother’s ear. When she says “thank you,” her speech travels back uninterrupted over that same highway. This circuit-switched architecture has been the foundation of the world’s phone system since the telephone was first invented, and it has served us well.

The sound quality is very good, but you pay for that dedicated line in long-distance phone charges. If you use test instruments to actually measure the delay of a land-based coast-to-coast (in the United States) phone call, you would probably measure approximately 40 ms of delay, where 30 ms is attributable to the speed of light, for the elec-tronic signal to travel the 3,000 miles across North America.

Recently, however, people have discovered that you can send speech information over different kinds of networks: the Internet, LANs, WANs, and Virtual Private Networks (VPNs). For years, the Internet has been used to send text, but only recently has its function expanded to include realtime speech delivery. The manner in which speech is sent over the Internet is very different than that used in the standard public telephone network. For example, if you decided to place that phone call to your mother using your home PC, you would not be afforded the luxury of a dedicated electronic highway for your transmission. You would, instead, be sharing a massive electronic highway with the millions of other Internet users logged on at the time of your phone call.

When you said “Happy Birthday, Mom,” that speech stream would be divided and sent to your mother via potentially several different paths, depending on traffic conditions on the network at each moment. The “H-a” in “Happy” may travel through California before reaching your mother’s house. However, if the California link became congested a moment later, the “p-py” might make its way through Texas. All or most of these little snippets of speech would eventually arrive at your mother’s house to be reconstructed and played to her. But it might take some time to accomplish this. In fact, it may take one or two seconds from the time you begin saying “Happy Birthday, Mom” to the time she actually begins to hear it. This delay, or latency, makes Internet voice communications cum-bersome and annoying, and is likely one of the key reasons people have not abandoned their telephones just yet.

SOURCES OF DELAY
So what causes this delay? Actually, it is the collective result of the attributes of ten steps in the process by which voice information is collected and transmitted over the Internet. These steps are listed in Table 1 in order of their chronological occurrence.

Transmission Recording Delay
The first system delay is incurred when the first speaker begins speaking. Unlike the telephone network, in which speech data is sent almost immediately and without extraordinary formatting, speech data must be carefully processed before transmission over the Internet. As a result, the system must record a certain amount of data to be processed before it does anything else. Picture yourself watering flowers on the weekend. You turn the hose on and fill the bucket with water, then you pour the water on the flowers. Before you can water the first flower, however, you must wait for the bucket to fill. Here you incur a delay, or latency, in which you’re not watering any flowers and basically not accomplishing anything. You might start the project at 9:00 A.M. (by turning on the hose), but you don’t actually start watering the flowers until perhaps 9:02 A.M. (when the bucket has filled). A similar delay occurs when transmitting speech over the Internet.

Let’s say a speaker begins talking at exactly 1:00:00 P.M. The VoIP system might collect data for one second, then begin processing that one second of data for transmission. (Transmitter record time slices are typically a fraction of that, but we’ll use the 1second timeframe to illustrate the concept.) It’s now 1:00:01 P.M., and although speech has begun on the transmitter’s end, on the other end of the connection the listener has heard nothing. This initial data collection delay is known as transmitter record delay, and can be reduced by minimizing the record time slice interval from 1 second in this simple example, to something much less than that — say 20 ms. Note, however, that reducing this recording period too much can adversely affect system quality in other ways that we’ll discuss later. At this point, it should suffice to say that there is a critical and delicate trade-off between recording time length and system latency.

One additional note is appropriate here regarding transmitter record delay. Assume for a moment a phone-tophone call. That is, someone is using their standard household telephone to call someone else with a standard household telephone. The call in this case, however, is connected not over a standard circuit-switched network as defined previously, but rather over a packet-switched network. The industry term for this type of VoIP application is “toll bypass.” In essence, the caller is placing a long-distance call which uses standard phone lines only to a certain point, then is handed off to the data network (e.g., Internet, VPN). This results in a long-distance connection for the price of a local phone call.

Since the telephone, as we’ve seen, is a low-latency device, the transmitter record delay can be engineered to the minimum “bucket” of time acceptable to the next step in the VoIP system, encoding. This is typically in the range of 15 to 45 ms. However, if the call were placed using a personal computer, transmitter record delays would be much longer. This is because the current generation of personal computers (and PC operating systems) was not originally designed for lowlatency record and playback. On today’s PCs, the minimum speech data bucket size that can be processed is much larger than the bucket size of a codec (between 150 and 300 ms). In other words, when using a PC to make a phone call, the “bucket” used to hold the water is large, and it thus takes a long time to begin watering the flowers once you’ve started the project. In the future we can expect low-latency PC sound drivers to be developed, which will significantly improve the latency of PC-to-PC communications.

Encoding Delay
The second source of delay or latency is attributable to software that actually compresses the data before transmitting it. Speech data takes up a great deal of space electronically; this is why voice mail systems allow you to leave only a certain sized message before they cut off. Sophisticated software exists today that can compress speech before it is transmitted and decompress it when it arrives at its destination. To do this, the software, also known as a codec (short for coder/decoder), must hold up the data briefly so it can evaluate longer segments of it. For example, codecs work much better if they see the entire word “Hello” compared to just the “H-e” part of the word. Instead of compressing “H-e,” the codec might wait for the entire word “Hello” before compressing it. Having seen what follows “H-e,” the codec is much better able to code the “H-e.” Thus, some small delay is incurred as the codec “looks ahead” during its mathematical computations. Typical “low delay” codecs look ahead 15 to 45 ms for this purpose. However, it should be noted that if you engineer the VoIP system correctly, you size the recording delay to exactly meet the requirements of the codec. In this case, no extra delay is introduced into the system by the lookahead requirement of the codec. Finally, various sources use different terminology for the description of delay contributions. The reader should therefore be aware that the combination of transmitter record delay and codec delay is often called algorithmic delay.

Compression Delay
The codec does, however, introduce some additional delay while it conducts the actual computations that compress the speech for transmission. Those calculations are conducted on the computer processor on which the codec is running, e.g., a Pentium chip or a digital signal processor (DSP), and consume actual time. The process does not happen instantaneously. The faster the processor, the lower the delay. The time required to conduct these calculations, and the system delay incurred as a result of it, are known as compression delay, the third of the 10 steps. In addition to performing the compression calculations during this step, the speech data is also formatted for transmission over the Internet. Although that process introduces minimal system delay, it is a notable activity. In essence, the speech data is encapsulated in “packets” that the Internet can recognize and distribute appropriately. For example, the Internet needs to understand the final destination of the speech data so it can route it properly. This data is included in the packet built during this step. The composition of a typical packet will be discussed later.

Once the speech data has been compressed, it is ready to be shipped over the Internet. If the terminal isn’t directly connected to a network (which in turn is connected to the Internet), a connection must be set up, typically over a standard phone line. This is how most consumers access the Internet from their homes, and how many business travelers access the Internet from the road. Unfortunately, the data that a computer understands is quite different than the data understood by the public telephone network. Computers process digitized data, while the telephone network transports sound (analog signals). Thus, a device is needed to translate the computer’s data format to sounds that can be carried over the public telephone network.

This device is commonly known as a modem (short for modulator/demodulator). Like the codec, the modem must conduct calculations to enable this conversion, and those calculations take time to accomplish. For example, a 28.8 Kbps modem can convert 28,800 bits of data to the telephone network’s format in 1 second. Although that sounds very fast, when you’re attempting to enable perceptibly instantaneous communications, every small delay counts.

Transmitter Modem Delay
In fact, this modem delay, the fourth step, can account for a non-trivial percentage of the end-to-end VoIP latency. Moreover, this modem delay occurs not only once during a VoIP PC to PC call, but four times, as detailed in Table 2. It should be noted that Internet Service Provider (ISP) or phone company modems are substantially faster (able to process more data in a given period of time), and thus introduce considerably less delay than typical PC modems. Therefore, it would be somewhat inaccurate to calculate the modem delay associated with a 28.8 modem and simply multiply by four to obtain a figure for total end-to-end modem delay. Note that similar to the transmitter record delay discussion, there are considerable latency implications whether the VoIP call is made via PC to PC or standard telephone to telephone. In a phone to phone or toll-bypass call, there are no PC modems, so modem delay is non-existent (as indicated previously, ISP or phone company modem delay is considered to be insignificant). Our example here assumes a PC to PC access connection.

Internet Delay
One of the most uncertain sources of VoIP packet delay is encountered when the data packets actually begin their journey over the network. As discussed previously, the Internet’s packetswitched architecture moves packets from point to point in an unpredictable manner, and considerable delay is typically incurred for a significant percentage of packets sent via the Internet. Indeed, under very poor network conditions, up to 15 percent of the packets might not arrive at all, while it’s not uncommon for a well-engineered network to lose 5 percent of the packets sent (on average). Given the nature of the transmission medium, it is nearly impossible to calculate this Internet delay, and even more difficult to control it.

All data networks will introduce a minimum delay that cannot be reduced by the VoIP system. It simply takes a finite period of time for a packet to make its way from point A to point B. In a well-engineered network, this fixed packet delay is unlikely to fall below 75 ms, and delays in the range of 90 to 120 ms are typical for well-engineered networks. (Some of this inherent delay is due literally to the speed of light. It simply takes a finite period of time for an electronic impulse to traverse a long distance.) Such a fixed delay is tolerable, and VoIP system designers are more than capable of accounting for it as they design their systems. What is substantially more troublesome is the variation in this delay known as jitter. A hypothetical data network that can guarantee its packets will arrive exactly 100 ms after transmission, and never more than that, is considered to have zero jitter. Thus it is important to distinguish between Internet delay, a rather fixed quantity for a given network, and the variation in that delay (jitter).

Receiver Modem Delay
The sixth source of VoIP packet delay, receiver modem delay, is the inverse of the fourth type of delay (transmitter modem delay) mentioned previously.

Jitter Buffer Delay
This brings us to our seventh source of delay, jitter buffer delay. Jitter buffer delay is a response to the Internet’s unreliability and volatility. As mentioned above, when packet delivery is delayed by the Internet, the VoIP application developer is typically forced to wait for those late packets. The jitter buffer is the mechanism by which this waiting occurs. Let’s say that someone in Texas is talking to someone in Boston using their PC and a VoIP system, and the person in Texas says “Hello.” The VoIP system takes the “Hello” data, processes it and places it in packets for shipment over the Internet. For the purpose of this argument, let’s assume it can pack that data into three discrete packages or packets. In Boston, packets 1 and 3 arrive on time, but packet 2 is delayed. When packet 1 arrives, it is stored in the jitter buffer while it waits for packets 2 and 3 to catch up. If the system doesn’t wait for the slowed packets to arrive, it will be forced to play the “Hello” to the Boston conversation member with gaps in the speech: It will sound more like “He…..o.” Thus, most VoIP systems incorporate a jitter buffer that fills and empties like a bucket of water. However, the larger the jitter buffer (the longer the system waits for delayed packets), the greater the delay introduced into the system.

Decompression And Decoding Delay
After waiting their turn in the jitter buffer, the speech packets must now be decoded. Delay sources 8 and 9, decompression delay and decoding delay, are the inverse of delay sources 2 and 3 (encoding and compression delay).

Playback Delay
Finally, the digitized packets must be converted one last time from digital format (the output of the decoder) to analog format so the sound can be played through the PC’s speakers. Since our ears only understand analog signals, playing the digitized version wouldn’t do us much good. Playing digitized speech through conventional speakers without first converting it to analog format would create the same sound you hear when you fax a document, that is, the highpitched, screeching and annoying tones that indicate your fax machine is “talking” to the receiving fax machine. This final digital-to-analog conversion is accomplished by the PC’s sound card and its core operating system. Unfortunately, as discussed previously, the management of the data by the PC’s operating system through this device also introduces delay. This final source of delay, playback delay, can also be significant, as today’s sound cards and the software that controls them (drivers) were not necessarily designed with VoIP applications in mind. This playback delay can be as large as 150 ms.

THE VOIP CONUNDRUM
As discussed above, when a segment of speech data is sent over the Internet, it is first chopped into small time slices and compressed (or coded). The compressed output is organized into frames, each of which contains the data to represent the individual time slice of speech. For example, if you say “Happy Birthday,” the “H-a” may go in one frame, the “p-p-y” in the next, and so on. However, the system can’t simply send these naked speech frames alone. One or more frames (typically 2 to 4) must be placed in small bundles (called packets) for shipment. The packets add some instructions about their destination, their origin , a sequence number, and the like. Remember, each of these packets may take a different path to the final destination (your mother’s house in this case), and some may never arrive at all, so each one must have all this information. VoIP system lexicon refers to this information as the header. The header assures that each packet has the necessary information required for it to reach its destination and be reconstructed effectively. Unfortunately, data space is required to transmit this information. That is, to send this information the system must ship 320 bits of data just for the header, not including speech data. This overhead becomes significant when a system is sending thousands of packets. Given this, you might suggest sending as many frames as possible in each packet, so the effect of the header overhead is minimized. Unfortunately, it’s not that easy — don’t forget about the transmitter record delay discussed earlier.

What does it mean to send more frames of data in each packet to minimize header overhead? Well, it means you must record more speech before you send it. Thus, instead of recording the “H-a” in “Happy Birthday” and sending it immediately, the system would record all of “Happy,” stuff it into a single packet and send it. If you recorded just “H-a” and sent it, then “p-py” and sent it, you would have to send two headers (one for the “Ha” frame and one for the “p-p-y” frame). However, your mother would have to wait less time to hear you speak because she would be listening to the “H-a” playing back while you were recording and sending the “p-py.” With a little luck, she might not be able to discern that two different packets were sent. On the other hand, if we send all of “Happy” in one packet, we would save 320 bits (the size of one header), but a substantial latency or waiting period would be introduced.

One of the challenges faced by today’s VoIP system designers is to minimize the delays described above, as they seek to optimize the quality of voice transmission over the Internet, regardless of network conditions. Their success may very well determine the acceptance and future of IP telephony.

Jeff Hill is director of product management at Voxware, Inc. Voxware develops, markets, licenses, and supports a suite of digital processing technologies which provide the ability to compress, model, and transform speech and audio. The company licenses technologies and developers’ kits, and also employs them to build platforms and packaged solutions. Voxware’s core technologies are designed to reproduce high-quality speech and audio while requiring very low communications bandwidth and processing power. These technologies enable users to create audio- and speech-enhanced communications and interactive products for the Internet and other bandwidthconstrained environments. For more information, call Voxware at 609-5144100 or visit the company’s Web site at www.voxware.com.







Technology Marketing Corporation

2 Trap Falls Road Suite 106, Shelton, CT 06484 USA
Ph: +1-203-852-6800, 800-243-6002

General comments: [email protected].
Comments about this site: [email protected].

STAY CURRENT YOUR WAY

© 2024 Technology Marketing Corporation. All rights reserved | Privacy Policy