Carrier-Class IP Telephony: What Will It Take?
BY SCOTT McNUTT & HENRIK SORENSEN
IP telephony has changed dramatically over the last two to three years. It has migrated
from the realm of hobbyists making "free" phone calls via the Internet, to the
corporate boardrooms of all major telecommunications service providers. Although the
fundamental principles have stayed the same, this migration has imposed a paradigm shift
with the IP telephony equipment vendors.
A visualization of this paradigm shift is the "Carrier-class" label attached
to most IP telephony products and services. It creates images of big, rock-hard networks
operating flawlessly day in and day out. But beyond these images, what does
"carrier-class" really mean? This article examines the three quality
characteristics that define a carrier-class system: Scalability, Interoperability, and
Reliability.
HIGH SCALABILITY
Scalability is a figure of merit that describes how cost-effective a system is over a
specified capacity range. Phrases like "It doesn't scale very well," or "It
scales well on the low end," are common. In general, "high scalability"
means a system supports a wide dynamic capacity range in a cost-effective manner.
Therefore, cost-effectiveness does not simply mean low equipment cost. It also includes
the cost of items such as installation, maintenance, operating expenses, and capacity
changes.
Carrier-class IP-telephony systems must be highly scalable and are expected to cover
capacities from a few thousand calls to potentially several hundred thousand calls. The
dynamic property of scalability addresses the system's ability to cost-effectively change
capacity after the initial installation. This is particularly important for IP telephony
systems, since it is likely they will initially be installed with limited capacity,
followed by rapid deployment.
During the introduction phase, the system's operation is verified, personnel are
trained, and maintenance processes are implemented. The system's capacity is then rapidly
increased during the deployment phase as older systems are removed from service. The
system eventually enters the general availability phase where capacity changes are small
and infrequent. Finally, when the system reaches the end of its useful life, its capacity
is reduced as newer systems are introduced.
The rapid changes in system capacity during the deployment and end of service phases
represent the greatest risks. This is when a system's scalability is most critical since
thousands of customers can be affected as their service is switched from one system to
another. Carrier-class scalability is much more than just the cost of adding equipment --
it considers all of the costs associated with safely and efficiently changing capacity
with minimal or no disruption of service - and to effectively do that, a powerful network
management system is needed.
INTEROPERABILITY
Interoperability is sometimes defined as the ability of competing vendors' equipment to
establish a call from a signaling perspective, and as such, is clearly a must for
carrier-class solutions. However, we prefer a broader definition that encompasses all
features needed for competing vendors' equipment; to not just coexist in the same network,
but to successfully share the responsibility of providing services for a particular call.
Thus, common network models must be developed that specify the network elements needed.
And, well-defined protocols must exist for interconnecting these elements. Even more
importantly, common network management models must exist, such that calls handled by one
part of the network can seamlessly be transferred to another in case of network or
equipment failure.
Network management is also crucial in handling the inevitable congestion problems,
which, for large carriers using IP trunks in the core of their networks, is a particularly
thorny issue. The IP protocol was designed to maximize the average utilization of the
network, but puts no bounds on the worst-case performance, which is the gauging factor for
voice communication. The new IPv6 protocol incorporates a first attempt at providing
Quality of Service (QoS) features within the protocol by adding a priority and a
flow-label field to the packet header - but effective use of these fields will require
stringent agreement between vendors and thus presents new interoperability issues.
Security is yet another important component of interoperability. Unauthorized access,
or worse -- directly pernicious attack - costs service companies untold amounts of lost
revenue. To effectively combat these problems, each equipment element, or network segment,
must be able to authenticate other elements or segments to spot unwanted intruders. Within
the ITU H.323 standards umbrella, security issues are being addressed within the H.235
specification. This work is still in its infancy, but is supplying the framework for
implementing specific security profiles.
HIGH RELIABILITY
Reliability is perhaps the most notable characteristic of carrier-class systems. It's the
probability that a system does not fail during a given period of time. The scope of
reliability can be extremely broad. It includes everything from component reliability,
manufacturing materials, and processes to infant mortality rates, and software quality
management systems. But taken as a whole, these seemingly endless component-level
requirements contribute to one primary system-level objective: Minimize failures and their
effects on call handling ability. System-level reliability requirements are specified as
availability objectives, which define limits on the amount of time call handling ability
can be affected.
Availability
When a system is originating, terminating, or carrying calls it's available. Availability
is the probability that a system is available at any given time, and is normally expressed
as a percent. Its complement, downtime, is the amount of time the system is not available.
In contrast, downtime is expressed in minutes per year. Any event or activity that
prevents the system from operating at its specified capacity reduces its availability.
This includes hardware and software failures as well as maintenance activities.
Downtimes are weighted sums that include contributions from all failures (or potential
failures) that affect service. For example, consider a 100-port system where a single
analog line fails to operate for 9 minutes/year. The downtime objective for that line has
been met (<18 min/year) and the contribution to the partial system outage time is 1
percent (1 line of 100 total) of 9 minutes or 5.4 seconds. If routine testing were
performed on the same line, during which time calls can neither be originated or received,
the test time would also contribute to its downtime.
Minimizing system downtime is clearly a key goal within a carrier-class system. If the
system is down, customers are not being served and revenues are not being generated. In
order to meet the availability objectives, carrier-class systems must be both fault
tolerant and maintainable. Fault tolerance strategies focus on preventing failures from
affecting service. Maintenance strategies assume service will ultimately be affected and
network management models therefore focus on minimizing the time it takes to restore
service and repair failures.
Fault Tolerance
Even some of the highest-quality products fail, often during times of stress when we need
them the most. Fault tolerant systems seek to minimize the effects component failures have
on service capacity. Carrier-class systems, not surprisingly, employ many traditional
redundancy techniques to achieve high levels of fault tolerance, but the ones that are
most unique are load sharing, load balancing, and diversity. Each of these techniques are,
in some sense, forms of redundancy, and each are designed to minimize the number of
ineffective call attempts and/or the call cutoff rate.
Load sharing is primarily focused on reducing ineffective call attempts. It does this
by sharing service-related resources such that a failure in any one or more of the
resources will not necessarily prevent a call from being completed. An example of load
sharing is an IP telephony system that includes a pool of codecs that are shared among all
of its analog lines. If there were enough codecs to handle the system's peak load, a
single codec failure would only affect service during peak load periods. During average
load periods, service would remain unaffected, since any one of the remaining functional
codecs may be used. Clearly then, if there were more codecs than necessary, the system
could suffer the loss of several codecs before service was affected. However, if the
system is designed such that a particular codec serves a single line, a failure in that
codec would prevent all calls on that line from being completed.
Load balancing attempts to minimize the call cutoff rate by balancing the number of
active calls across independent platforms. If a particular platform were to fail, only a
portion of the active calls would be affected. An IP telephony system for example, could
balance calls among multiple "media gateways" that contain the codecs of the
previous example. One of the side effects of load balancing is "graceful
degradation," which simply means that the system gracefully loses capability as
failures occur rather than being "dragged to its knees."
Maintainability
With typical availability targets of 99.999 percent or better (less than six minutes of
downtime per year), a carrier-class system must include robust support for maintenance
activities such as problem detection, notification, isolation, and repair, as well as
service recovery. Maintainability is a reliability attribute that describes how well a
system supports these maintenance activities.
Given that failures are inevitable, carrier-class systems must be able to detect
problems quickly -- ideally before service is affected. Therefore, any equipment failures
that can potentially cause 1 percent or more of a system's capacity to be affected must be
continuously monitored. This includes operational as well as standby equipment. A
10,000-port IP telephony gateway for example, would have to continuously monitor a line
interface unit if it supported more than four T1 facilities - a scenario, which is highly
likely.
Once a problem is detected, maintenance personnel must be informed in order to effect
repair. The network management system does this by communicating with an operations system
or "OS" that in turn notifies them of the problem. In addition to notification,
the network management system must also automatically isolate the problem and provide a
prioritized list of replaceable components where the problem is most likely located. The
problem must then be repaired and service restored within some set Mean Repair Times.
CONCLUSION
The IP telephony industry has come a long way in a short period of time, but there are
still many unresolved issues. Customers have come to expect that their phone works with no
"ifs, ands, or buts" and are going to expect the same service in the future. To
provide such a solution, the equipment vendors must solve the problems addressed in this
article, and a solid network management foundation is clearly the nucleus of the overall
solution.
Scott McNutt, systems engineer and Dr. Henrik Sorenson, VP of Advanced Technology
are part of the advanced technology products team at elemedia, a wholly-owned software
venture of Lucent Technologies. elemedia is a leading provider of H.323-based software
toolkits that enable high-quality solutions for Internet telephony and multimedia
communications. Developed by engineers with years of experience in the technologies
required for sophisticated telephony networks, elemedia's products link today's networks
with tomorrow's while promoting standards and interoperability. For more information on
elemedia, visit the company's Web site at www.elemedia.com.
|