ITEXPO begins in:   New Coverage :  Asterisk  |  Fax Software  |  SIP Phones  |  Small Cells

Feature Article
October 2001

Testing VoIP For Maximum Performance


[ Go Right To: Methods For Measuring Speech Quality ]

Although data traffic volumes worldwide have exceeded voice traffic volumes, voice services continue to bring in several times more revenue than data. Clearly the market value of voice, combined with the low cost of transporting data, makes a compelling business case for converged networks.

The convergence of voice and data on the same packet network can enable network operators to realize significant operational and capital cost savings. But enabling voice and data to co-exist on the same network, and to realize the related cost savings, requires a careful balance between network bandwidth-efficiency and voice service quality. Today, over $600 billion a year is paid for the predictable quality of the PSTN and circuit-switched networks. VoIP service providers that cannot deliver PSTN performance will have a difficult time winning market share.

Delivering PSTN performance on a VoIP network is not a simple task, as many network operators have come to discover. Testing the performance of VoIP networks is a continuous activity, from initial lab trials to an in-service production network. If designed and implemented properly, a testing program can be used to drive network designs and configurations, and can result in a network that is optimized for both voice service quality and network bandwidth efficiency.

Testing network performance begins with identifying parameters to be measured and performance objectives. One approach is to identify which parameters impact a customer, and then identify the underlying factors that contribute to those parameters.

Performance testing on the PSTN has traditionally focused on call completion metrics (such as answer/seizure ratios) and call duration. This is understandable since these metrics represent how much network usage can be billed, and where call quality was not an issue. However, call quality is an important issue
for IP telephony, and can directly impact call duration. This drives the need for testing call quality.

The ITU, in recommendations P.800 and P.830, defines subjective testing for "listening quality" and "conversational quality." Listening quality is a one-way phenomenon and is affected by the clearness, or clarity, and the loudness of the speaker's voice as it is perceived by a listener.

Conversational quality is a two-way phenomenon and is affected by voice delay and echo, in addition to clarity and loudness. It thus becomes apparent that one must first test these customer-impacting parameters.

To understand how a network's performance contributes to these parameters, one must also test the underlying factors. Underlying factors that impact voice clarity include encoding and compression, time clipping from a voice activity detector, concealed and unconcealed packet loss, and excessive packet jitter (resulting in dropped packets). PSTN impairments, such as PCM quantization distortion, are already inherent in the de-facto standard of "PSTN quality."

Factors that contribute to delay include all processing delay inherent with packet capture, routing and queuing; transmission delay; voice encoding/decoding; and jitter buffering. Factors that contribute to the effect of echo are an echo signal's loudness and delay.

While it is important to determine the value of each of these underlying factors, it is difficult to determine the customer impact with just a measurement of an underlying factor. This underscores the need to test customer performance parameters, such as clarity, in addition to just packet loss. For example, a two percent packet loss may or may not be perceived by a customer.

A basic testing program can begin with five parts: benchmarking, baselining, detecting impairments, troubleshooting, and optimizing.

According to Webster a benchmark is "a standard or point of reference in judging or measuring quality." Benchmarking in our industry involves the determination of performance objectives. This is needed to avoid costly overprovisioning of a network or underprovisioning of a network. A network should be designed to meet objectives.

Benchmarks may be set by the limits of customer acceptance as determined by subjective testing, by comparison with the PSTN or other standard-bearer network, or by established industry standards. Subjective testing is very difficult and expensive, so one should turn to previous work in setting benchmarks.

For one-way delay in networks with echo adequately controlled, ITU G.114 recommends a maximum one-way transmission time of 150 milliseconds for most voice applications. In addition, a maximum of 50 of these 150 milliseconds is allocated for processing time. This is due to the transmission time that may be needed on international connections.

G.114 provides a good objective for a VoIP network. However, an alternative may be to measure delay on the PSTN for a call with the same endpoints as a VoIP call, and to meet that objective with an additional time allocated for VoIP processing (e.g., PSTN delay plus 50 milliseconds).

For echo, ITU G.131 provides valuable information on the effects of echo's loudness and delay on user acceptance. Included in G.131 is a graph showing the range of user acceptability as a function of echo loudness and one-way transmission time. This can provide benchmarks for echo loudness and delay.

While there are standards for how to measure voice clarity (or "speech quality"), there is no established standard for what values need to be met. The most widely recognized scale for speech quality is the Mean Opinion Score (MOS) listening quality scale (1 = bad to 5 = excellent). PSQM + scores are also used, and are on a scale of 0 = perfect to 6 and higher = bad. One method for setting clarity benchmarks is to measure the speech quality of the PSTN. PSTN calls will typically fall above 3.5 on a MOS scale and below 2.5 on a PSQM+ scale. (For a more in-depth look, make sure to check out the sidebar article Methods For Measuring Speech Quality.)

Baselining is the process of determining the nominal performance of a network. This is different than benchmarking. Benchmarking determines the performance objectives of a network; baselining determines how a network actually performs under operating conditions. For example, a benchmark for delay may be 150 milliseconds, but a network's baseline performance may consistently deliver 100 milliseconds of delay. This is important for detecting impairments. Delay measured at 145 milliseconds may fall within the benchmark bounds, but because it is more than 100 milliseconds, it may indicate an impairment (e.g., network congestion or a jitter buffer that is set too high).

Baselining is useful for setting thresholds to be used to detect impairments, for optimizing a network, for declaring performance standards for customers, and for establishing service level agreements (SLAs). Baselining is also useful in performing network VoIP readiness assessments to determine if an IP network is properly configured and provisioned to carry voice traffic.

Detecting Impairments
Once a network is operational, then it must be monitored to detect impairments. This requires thresholds, which can be determined from baselining and/or benchmarking. Thresholds may be dependent on network segments. For example, delay thresholds for impairment detection may be different depending on the call endpoints. However, it is important to remember that to a customer, a long call path does not provide justification for unacceptable delay.

Detecting impairments can be done with active testing and with passive testing, and using both is recommended. Passive testing is an inexpensive way to monitor certain underlying factors of network performance, such as packet loss and jitter. However, some parameters, like voice delay, can only be performed with active testing. Clarity can only be determined with active testing, but it can be predicted or estimated with passive testing using sophisticated software. Techniques for predicting MOS or other clarity scores from passively obtained metrics should be the next area of focus for industry measurement standards.

Detecting impairments should begin with measuring customer-impacting parameters, including clarity, delay, and echo. If a measurement of an underlying factor, such as packet loss or jitter, exceeds a threshold, then customer-impacting parameters should be measured to determine the severity of the impairment.

Troubleshooting is perhaps the most important part of network testing. It requires the most sophisticated tools and expertise, and efficient troubleshooting is essential to resolve customer-impacting problems quickly.

Troubleshooting VoIP network performance requires isolating an impairment and determining its cause. Voice quality analyzers are available that can expose many impairments that impact clarity, delay, and echo. For example, a clarity test may indicate packet loss as a suspect. One can then utilize a protocol analyzer with RTP analysis to isolate the network segment on which loss is occurring. Knowledge of the network is also critical. For example, packet loss may be a result of late packets dropped from a jitter buffer; measuring packet loss on network segments will not expose this, but measuring packet jitter and knowing the configuration of the jitter buffers will help.

By testing a network in segments, one can isolate the source of an impairment to one or a few systems. Then equipped with the right testing tools and knowledge of the network, one can quickly determine which underlying factors are contributing to the impairment.

Optimizing a network for performance is an ongoing endeavor. There are many configurations and designs in a VoIP network that can be changed or tweaked to affect performance. Optimizing requires a carefully controlled test environment and process. It should not be done on an in-service network, but it can be done on a production network that is either partitioned for a test environment, or that has call resources taken out of service for this purpose.

The subject of network optimization is vast and complex. But in short, it involves careful testing of the impact of individual resources and configuration changes. For example, the impact of using a VAD (e.g., G.729b) in conjunction with a codec (e.g., G.729a) on voice clarity should be determined with all other network conditions kept the same.

Optimization also includes determining how much performance degradation can be accepted. For example, designing and provisioning a network for 0 percent VoIP packet loss may prove too costly, when in fact a 0.5 percent random loss can be accepted. In this respect, optimization means striking the right balance between quality performance and network utilization.

John Anderson is the IP telephony manger at Agilent Technologies, Network Systems Test Division. Agilent is a global technology leader in communications, electronics, life sciences and healthcare. Agilent's Network Systems Test Division provides telecom equipment manufactures, service providers and enterprises with a suite of network testing WAN ATM, IP, and 3G networks and products. To find more information about agilent's IP telephony testing solutions, visit www.agilent.com/comms/voicequality/.

[ Return To The October 2001 Table Of Contents ]

Methods For Measuring Speech Quality

There are several methods for measuring speech quality. The most obvious method is to have a large group of people subjectively rate the quality of a telephone call. The mean average of a group's ratings under a certain set of test conditions is then used as a quality score. This is known as Mean Opinion Score (MOS) testing.

MOS testing has become the "benchmark" for determining the accuracy of speech quality measurement methods. However, it is very impractical to use in most situations. A repeatable mathematical method that can be automated with software is needed. Because of the non-linear and time-variant nature of VoIP networks, traditional metrics such as signal-to-noise ratios do not prove to be accurate when compared with subjective test results. However, three very effective and accurate methods have emerged: PSQM+, PAMS, and PESQ.

These methods are mathematical algorithms that are implemented in software. They are based on perceptual models and are accurate in predicting scores from subjective testing. Each method performs a comparison between two voice signals: an original input signal, and an output signal that represents the input signal after it has been processed by a system or transmitted across a network. The comparison is performed in a perceptually-modeled domain, and determines audible errors in the output signal.

The Perceptual Speech Quality Measurement (PSQM) was developed by researchers at KPN in the Netherlands. It was approved by the ITU in 1996 as recommendation P.861, and was the first perceptual speech measurement technique to be accepted as an international standard. Although PSQM was designed to objectively measure speech quality across low bit-rate codecs, it became popular as a way to measure speech quality across entire VoIP networks. One PSQM drawback, however, is that it does not accurately report the impact of distortion when that distortion is caused by packet loss or other types of transmission impairments. Therefore, an improved version known as PSQM+ was developed to accurately account for these network impairments.

PSQM+ scores represent a perceptual distance measure. That is, PSQM scores reflect the amount of divergence from a clean signal that a distorted signal exhibits. These scores range from 0 (indicating perfect quality) to values of six and higher (indicating poor quality). PSQM+ requires time-synchronization between the input and output signal. However, PSQM does not dictate a time-synchronization method.

The Perceptual Analysis Measurement System (PAMS) was developed by researchers within the Psytechnics Group at British Telecom (Psytechnics was later spun off into an independent company). PAMS was developed independent of PSQM and was first released in 1998. PAMS also uses perceptual modeling to objectively measure speech quality and to predict subjective test scores, but uses different signal processing and modeling techniques than PSQM. PAMS includes a precise time-synchronization technique that increases its robustness when used for networks with variable delays. PAMS produces scores that correlate directly with MOS Listening Quality and Listening Effort scores; that is, on a scale from one to five.

Beginning in 1998, the ITU solicited and reviewed drafts for a new standard. Upon reviewing submissions for the latest versions of PAMS and PSQM, they
recommended a collaboration to combine these two leading methods. This collaboration between Psytechnics and KPN Research occurred, and the
result was the Perceptual Evaluation of Speech Quality (PESQ). PESQ was approved by the ITU in March of 2001 as recommendation P.862, which replaces P.861.

PESQ combines the best merits of PAMS and PSQM+. It is very accurate in predicting subjective test scores, and is robust under severe network conditions. PESQ produces MOS-like scores to indicate overall quality. It is expected by many that PESQ will become the leading standard for speech quality measurement on telephony networks. 

[ Return To The October 2001 Table Of Contents ]

Today @ TMC
Upcoming Events
ITEXPO West 2012
October 2- 5, 2012
The Austin Convention Center
Austin, Texas
The World's Premier Managed Services and Cloud Computing Event
Click for Dates and Locations
Mobility Tech Conference & Expo
October 3- 5, 2012
The Austin Convention Center
Austin, Texas
Cloud Communications Summit
October 3- 5, 2012
The Austin Convention Center
Austin, Texas