Minimizing VoIP Transmission Delays To Optimize
Performance BY JEFF HILL
VoIP (voice over Internet protocol) technology has made great strides. Pundits are
hailing the arrival of this revolutionary new technology, all the while predicting the
demise of traditional voice communications. VoIP allows users to bypass the existing phone
companies, thus offering huge savings off the cost of traditional phone calls. However,
issues remain that stand in the way of this technology becoming the end-all solution that
some purport it to be chief among these issues is quality of service (QoS), in
particular, how the delays inherent in Internet transmissions adversely impact that
quality.
With the ability to place a call from Butte, Montana to London (or anywhere else), talk
for hours, and pay no more than the cost of a local phone call for the privilege, most
people would agree that this sounds like a pretty good deal, especially when such a phone
call from Montana to London could cost hundreds of dollars if made via more traditional
means. All that is needed is a multimedia PC, some inexpensive Internet phone software
package, and a recipient in London with their PC turned on. Tens of millions of people own
PCs, and millions more are buying them every year. So why then, does anyone use the
telephone anymore? The answer is simple: sound quality. The world is accustomed to a
certain degree of clarity and naturalness when conversing via the telephone .
Unfortunately, systems designed to enable two-way voice communications over data networks
(e.g., VoIP systems) have yet to approach that quality plateau, at least in the PC-toPC
domain.
CIRCUITS, PACKETS, AND BIRTHDAY GREETINGS
To clarify the VoIP process, let us assume an example of calling your mother on her
birthday. When you pick up the telephone to place such a call, you expect a reasonable
delay between the time you finish dialing the number and the time the phone on the other
end begins to ring. During this time, the telephone company is finding an electronic
roadway between your phone and your mothers. Once that path is established, it
remains open for the length of the call and allows you to experience almost no perceptible
delay. When you say Happy Birthday, Mom, an electronic version of your speech
travels over that dedicated path to your mothers ear. When she says thank
you, her speech travels back uninterrupted over that same highway. This
circuit-switched architecture has been the foundation of the worlds phone system
since the telephone was first invented, and it has served us well.
The sound quality is very good, but you pay for that dedicated line in long-distance
phone charges. If you use test instruments to actually measure the delay of a land-based
coast-to-coast (in the United States) phone call, you would probably measure approximately
40 ms of delay, where 30 ms is attributable to the speed of light, for the elec-tronic
signal to travel the 3,000 miles across North America.
Recently, however, people have discovered that you can send speech information over
different kinds of networks: the Internet, LANs, WANs, and Virtual Private Networks
(VPNs). For years, the Internet has been used to send text, but only recently has its
function expanded to include realtime speech delivery. The manner in which speech is sent
over the Internet is very different than that used in the standard public telephone
network. For example, if you decided to place that phone call to your mother using your
home PC, you would not be afforded the luxury of a dedicated electronic highway for your
transmission. You would, instead, be sharing a massive electronic highway with the
millions of other Internet users logged on at the time of your phone call.
When you said Happy Birthday, Mom, that speech stream would be divided and
sent to your mother via potentially several different paths, depending on traffic
conditions on the network at each moment. The H-a in Happy may
travel through California before reaching your mothers house. However, if the
California link became congested a moment later, the p-py might make its way
through Texas. All or most of these little snippets of speech would eventually arrive at
your mothers house to be reconstructed and played to her. But it might take some
time to accomplish this. In fact, it may take one or two seconds from the time you begin
saying Happy Birthday, Mom to the time she actually begins to hear it. This
delay, or latency, makes Internet voice communications cum-bersome and annoying, and is
likely one of the key reasons people have not abandoned their telephones just yet.
SOURCES OF DELAY
So what causes this delay? Actually, it is the collective result of the attributes of ten
steps in the process by which voice information is collected and transmitted over the
Internet. These steps are listed in Table 1 in order of their chronological occurrence.
Transmission Recording Delay
The first system delay is incurred when the first speaker begins speaking. Unlike
the telephone network, in which speech data is sent almost immediately and without
extraordinary formatting, speech data must be carefully processed before transmission over
the Internet. As a result, the system must record a certain amount of data to be processed
before it does anything else. Picture yourself watering flowers on the weekend. You turn
the hose on and fill the bucket with water, then you pour the water on the flowers. Before
you can water the first flower, however, you must wait for the bucket to fill. Here you
incur a delay, or latency, in which youre not watering any flowers and basically not
accomplishing anything. You might start the project at 9:00 A.M. (by turning on the hose),
but you dont actually start watering the flowers until perhaps 9:02 A.M. (when the
bucket has filled). A similar delay occurs when transmitting speech over the Internet.
Lets say a speaker begins talking at exactly 1:00:00 P.M. The VoIP system might
collect data for one second, then begin processing that one second of data for
transmission. (Transmitter record time slices are typically a fraction of that, but
well use the 1second timeframe to illustrate the concept.) Its now 1:00:01
P.M., and although speech has begun on the transmitters end, on the other end of the
connection the listener has heard nothing. This initial data collection delay is known as
transmitter record delay, and can be reduced by minimizing the record time slice interval
from 1 second in this simple example, to something much less than that say 20 ms.
Note, however, that reducing this recording period too much can adversely affect system
quality in other ways that well discuss later. At this point, it should suffice to
say that there is a critical and delicate trade-off between recording time length and
system latency.
One additional note is appropriate here regarding transmitter record delay. Assume for
a moment a phone-tophone call. That is, someone is using their standard household
telephone to call someone else with a standard household telephone. The call in this case,
however, is connected not over a standard circuit-switched network as defined previously,
but rather over a packet-switched network. The industry term for this type of VoIP
application is toll bypass. In essence, the caller is placing a long-distance
call which uses standard phone lines only to a certain point, then is handed off to the
data network (e.g., Internet, VPN). This results in a long-distance connection for the
price of a local phone call.
Since the telephone, as weve seen, is a low-latency device, the transmitter
record delay can be engineered to the minimum bucket of time acceptable to the
next step in the VoIP system, encoding. This is typically in the range of 15 to 45 ms.
However, if the call were placed using a personal computer, transmitter record delays
would be much longer. This is because the current generation of personal computers (and PC
operating systems) was not originally designed for lowlatency record and playback. On
todays PCs, the minimum speech data bucket size that can be processed is much larger
than the bucket size of a codec (between 150 and 300 ms). In other words, when using a PC
to make a phone call, the bucket used to hold the water is large, and it thus
takes a long time to begin watering the flowers once youve started the project. In
the future we can expect low-latency PC sound drivers to be developed, which will
significantly improve the latency of PC-to-PC communications.
Encoding Delay
The second source of delay or latency is attributable to software that actually
compresses the data before transmitting it. Speech data takes up a great deal of space
electronically; this is why voice mail systems allow you to leave only a certain sized
message before they cut off. Sophisticated software exists today that can compress speech
before it is transmitted and decompress it when it arrives at its destination. To do this,
the software, also known as a codec (short for coder/decoder), must hold up the data
briefly so it can evaluate longer segments of it. For example, codecs work much better if
they see the entire word Hello compared to just the H-e part of
the word. Instead of compressing H-e, the codec might wait for the entire word
Hello before compressing it. Having seen what follows H-e, the
codec is much better able to code the H-e. Thus, some small delay is incurred
as the codec looks ahead during its mathematical computations. Typical
low delay codecs look ahead 15 to 45 ms for this purpose. However, it should
be noted that if you engineer the VoIP system correctly, you size the recording delay to
exactly meet the requirements of the codec. In this case, no extra delay is introduced
into the system by the lookahead requirement of the codec. Finally, various sources use
different terminology for the description of delay contributions. The reader should
therefore be aware that the combination of transmitter record delay and codec delay is
often called algorithmic delay.
Compression Delay
The codec does, however, introduce some additional delay while it conducts the
actual computations that compress the speech for transmission. Those calculations are
conducted on the computer processor on which the codec is running, e.g., a Pentium chip or
a digital signal processor (DSP), and consume actual time. The process does not happen
instantaneously. The faster the processor, the lower the delay. The time required to
conduct these calculations, and the system delay incurred as a result of it, are known as
compression delay, the third of the 10 steps. In addition to performing the compression
calculations during this step, the speech data is also formatted for transmission over the
Internet. Although that process introduces minimal system delay, it is a notable activity.
In essence, the speech data is encapsulated in packets that the Internet can
recognize and distribute appropriately. For example, the Internet needs to understand the
final destination of the speech data so it can route it properly. This data is included in
the packet built during this step. The composition of a typical packet will be discussed
later.
Once the speech data has been compressed, it is ready to be shipped over the Internet.
If the terminal isnt directly connected to a network (which in turn is connected to
the Internet), a connection must be set up, typically over a standard phone line. This is
how most consumers access the Internet from their homes, and how many business travelers
access the Internet from the road. Unfortunately, the data that a computer understands is
quite different than the data understood by the public telephone network. Computers
process digitized data, while the telephone network transports sound (analog signals).
Thus, a device is needed to translate the computers data format to sounds that can
be carried over the public telephone network.
This device is commonly known as a modem (short for modulator/demodulator). Like the
codec, the modem must conduct calculations to enable this conversion, and those
calculations take time to accomplish. For example, a 28.8 Kbps modem can convert 28,800
bits of data to the telephone networks format in 1 second. Although that sounds very
fast, when youre attempting to enable perceptibly instantaneous communications,
every small delay counts.
Transmitter Modem Delay
In fact, this modem delay, the fourth step, can account for a non-trivial
percentage of the end-to-end VoIP latency. Moreover, this modem delay occurs not only once
during a VoIP PC to PC call, but four times, as detailed in Table 2. It should be noted
that Internet Service Provider (ISP) or phone company modems are substantially faster
(able to process more data in a given period of time), and thus introduce considerably
less delay than typical PC modems. Therefore, it would be somewhat inaccurate to calculate
the modem delay associated with a 28.8 modem and simply multiply by four to obtain a
figure for total end-to-end modem delay. Note that similar to the transmitter record delay
discussion, there are considerable latency implications whether the VoIP call is made via
PC to PC or standard telephone to telephone. In a phone to phone or toll-bypass call,
there are no PC modems, so modem delay is non-existent (as indicated previously, ISP or
phone company modem delay is considered to be insignificant). Our example here assumes a
PC to PC access connection.
Internet Delay
One of the most uncertain sources of VoIP packet delay is encountered when the
data packets actually begin their journey over the network. As discussed previously, the
Internets packetswitched architecture moves packets from point to point in an
unpredictable manner, and considerable delay is typically incurred for a significant
percentage of packets sent via the Internet. Indeed, under very poor network conditions,
up to 15 percent of the packets might not arrive at all, while its not uncommon for
a well-engineered network to lose 5 percent of the packets sent (on average). Given the
nature of the transmission medium, it is nearly impossible to calculate this Internet
delay, and even more difficult to control it.
All data networks will introduce a minimum delay that cannot be reduced by the VoIP
system. It simply takes a finite period of time for a packet to make its way from point A
to point B. In a well-engineered network, this fixed packet delay is unlikely to fall
below 75 ms, and delays in the range of 90 to 120 ms are typical for well-engineered
networks. (Some of this inherent delay is due literally to the speed of light. It simply
takes a finite period of time for an electronic impulse to traverse a long distance.) Such
a fixed delay is tolerable, and VoIP system designers are more than capable of accounting
for it as they design their systems. What is substantially more troublesome is the
variation in this delay known as jitter. A hypothetical data network that can guarantee
its packets will arrive exactly 100 ms after transmission, and never more than that, is
considered to have zero jitter. Thus it is important to distinguish between Internet
delay, a rather fixed quantity for a given network, and the variation in that delay
(jitter).
Receiver Modem Delay
The sixth source of VoIP packet delay, receiver modem delay, is the inverse of
the fourth type of delay (transmitter modem delay) mentioned previously.
Jitter Buffer Delay
This brings us to our seventh source of delay, jitter buffer delay. Jitter buffer
delay is a response to the Internets unreliability and volatility. As mentioned
above, when packet delivery is delayed by the Internet, the VoIP application developer is
typically forced to wait for those late packets. The jitter buffer is the mechanism by
which this waiting occurs. Lets say that someone in Texas is talking to someone in
Boston using their PC and a VoIP system, and the person in Texas says Hello.
The VoIP system takes the Hello data, processes it and places it in packets
for shipment over the Internet. For the purpose of this argument, lets assume it can
pack that data into three discrete packages or packets. In Boston, packets 1 and 3 arrive
on time, but packet 2 is delayed. When packet 1 arrives, it is stored in the jitter buffer
while it waits for packets 2 and 3 to catch up. If the system doesnt wait for the
slowed packets to arrive, it will be forced to play the Hello to the Boston
conversation member with gaps in the speech: It will sound more like
He
..o. Thus, most VoIP systems incorporate a jitter buffer that fills
and empties like a bucket of water. However, the larger the jitter buffer (the longer the
system waits for delayed packets), the greater the delay introduced into the system.
Decompression And Decoding Delay
After waiting their turn in the jitter buffer, the speech packets must now be
decoded. Delay sources 8 and 9, decompression delay and decoding delay, are the inverse of
delay sources 2 and 3 (encoding and compression delay).
Playback Delay
Finally, the digitized packets must be converted one last time from digital
format (the output of the decoder) to analog format so the sound can be played through the
PCs speakers. Since our ears only understand analog signals, playing the digitized
version wouldnt do us much good. Playing digitized speech through conventional
speakers without first converting it to analog format would create the same sound you hear
when you fax a document, that is, the highpitched, screeching and annoying tones that
indicate your fax machine is talking to the receiving fax machine. This final
digital-to-analog conversion is accomplished by the PCs sound card and its core
operating system. Unfortunately, as discussed previously, the management of the data by
the PCs operating system through this device also introduces delay. This final
source of delay, playback delay, can also be significant, as todays sound cards and
the software that controls them (drivers) were not necessarily designed with VoIP
applications in mind. This playback delay can be as large as 150 ms.
THE VOIP CONUNDRUM
As discussed above, when a segment of speech data is sent over the Internet, it is first
chopped into small time slices and compressed (or coded). The compressed output is
organized into frames, each of which contains the data to represent the individual time
slice of speech. For example, if you say Happy Birthday, the H-a
may go in one frame, the p-p-y in the next, and so on. However, the system
cant simply send these naked speech frames alone. One or more frames (typically 2 to
4) must be placed in small bundles (called packets) for shipment. The packets add some
instructions about their destination, their origin , a sequence number, and the like.
Remember, each of these packets may take a different path to the final destination (your
mothers house in this case), and some may never arrive at all, so each one must have
all this information. VoIP system lexicon refers to this information as the header. The
header assures that each packet has the necessary information required for it to reach its
destination and be reconstructed effectively. Unfortunately, data space is required to
transmit this information. That is, to send this information the system must ship 320 bits
of data just for the header, not including speech data. This overhead becomes significant
when a system is sending thousands of packets. Given this, you might suggest sending as
many frames as possible in each packet, so the effect of the header overhead is minimized.
Unfortunately, its not that easy dont forget about the transmitter
record delay discussed earlier.
What does it mean to send more frames of data in each packet to minimize header
overhead? Well, it means you must record more speech before you send it. Thus, instead of
recording the H-a in Happy Birthday and sending it immediately,
the system would record all of Happy, stuff it into a single packet and send
it. If you recorded just H-a and sent it, then p-py and sent it,
you would have to send two headers (one for the Ha frame and one for the
p-p-y frame). However, your mother would have to wait less time to hear you
speak because she would be listening to the H-a playing back while you were
recording and sending the p-py. With a little luck, she might not be able to
discern that two different packets were sent. On the other hand, if we send all of
Happy in one packet, we would save 320 bits (the size of one header), but a
substantial latency or waiting period would be introduced.
One of the challenges faced by todays VoIP system designers is to minimize the
delays described above, as they seek to optimize the quality of voice transmission over
the Internet, regardless of network conditions. Their success may very well determine the
acceptance and future of IP telephony.
Jeff Hill is director of product management at Voxware, Inc. Voxware develops,
markets, licenses, and supports a suite of digital processing technologies which provide
the ability to compress, model, and transform speech and audio. The company licenses
technologies and developers kits, and also employs them to build platforms and
packaged solutions. Voxwares core technologies are designed to reproduce
high-quality speech and audio while requiring very low communications bandwidth and
processing power. These technologies enable users to create audio- and speech-enhanced
communications and interactive products for the Internet and other bandwidthconstrained
environments. For more information, call Voxware at 609-5144100 or visit the
companys Web site at www.voxware.com. |