
August 1999
Make It Run Forever
BY JEFF LAWRENCE
Products and services undergo three phases of development: they evolve from the
"make it run" phase to the "make it run fast" phase, and finally reach
the "make it run forever" phase. As telephony, wireless, and Internet services
grow in complexity and economic value, the need for products and services to "run
forever" in the network will continue to gain importance.
Today's network users expect a service to be available upon request. This expectation
places a special burden on the service providers to ensure that all of the network
elements needed to provide a service are functioning when a user requests that particular
service. In other words, the availability of a service is truly ensured only when even the
weakest link in the chain of equipment and transmission facilities is available to provide
the service. Service availability depends on software, hardware, and network design as
well as on environmental and operational factors.
Public telephony network service providers understand these factors clearly and have
developed several approaches to ensure the availability of services from the public
network. In contrast, some enterprise networks operate with lower levels of service
availability.
Businesses using these enterprise networks accept these lower levels of service
availability because they can justify avoiding the perceived and actual expense of
ensuring high availability. However, as the industry matures, service providers and
carriers will increasingly demand telecommunications products that are designed to
minimize downtime and minimize the associated revenue losses. For equipment manufacturers,
providing high availability will become an economic necessity.
RELIABILITY AND AVAILABILITY
To understand availability, we first need to understand reliability. The
reliability of an element is the conditional probability that the element will operate
during a specified period of time. The availability of an element is the probability that
the element is in service and available to a user at any instant in time.
Systems may go out of service for any number of reasons, such as the occurrence of a
fault, repair activities, software loading, hardware upgrading, or periodic maintenance.
For a system to achieve high availability, the duration of these interruptions must be as
short as possible. A system may be considered highly reliable (that is, it may fail very
infrequently), but, if it is out of service for a significant period of time during a
failure, it will not be considered highly available.
HIGH AVAILABILITY
The network can be described as a collection of elements, including equipment
(such as switches and routers, gateways, and service platforms) and transmission
facilities (such as copper, cable, fiberoptic, and wireless technologies). Any of these
elements may fail because of incorrect design, environmental factors, physical defects, or
incorrect usage (that is, operator error). Incorrect usage and component mortality are
typically the most common causes of failure.
Network elements, such as telephone switching systems, typically operate with a target
availability of 99.999 percent often referred to as five nines availability.
This level of availability translates to the equivalent of a telephone switch being
allowed to be out of service for only a few minutes per year. These few minutes per year
include all of the time needed to repair faults, load software, upgrade hardware, and
perform periodic maintenance and any other necessary activities.
Telephone switching systems today are designed so that active calls are not lost and
only an insignificant number of calls in progress are mishandled during switching system
failure. Five nines availability is the standard toward which the Internet infrastructure
and services will need to strive if they are to fulfill the requirements of providing
services seamlessly across both the Internet and the PSTN.
Service providers and communications equipment manufacturers face the challenge of
deciding what level of availability is sufficient for each network element when it is
operating alone and when it is operating in conjunction with other network elements to
meet service requirements. Higher availability usually costs more, and it is often
difficult to determine whether the potential economic benefit of high availability is
worth the cost.
Some network elements may not be designed to support high availability because it is
assumed that their failure will have minimal service disruption and little economic impact
on the user or the service provider (for example, a single mobile handset). On the other
hand, some network elements will have to be specially designed to support very high
availability. A switch in the core of the network, through which tens of thousands of
connections are flowing, cannot be allowed to fail. If such a switch were to fail, the
real and potential lost revenue could be very significant.
High availability can be achieved using various design approaches that attempt to
strike a balance between meeting availability objectives and minimizing complexity and
cost.
HIGH AVAILABILITY NETWORK
In the context of a network, various design approaches are not only applied to
the design of specific network elements but are also applied to the arrangement,
interconnection, and communication between those network elements (which occur using
various organizational principles, routing protocols, and communications protocols).
The transport of voice over a circuit-switched network is resilient to bit errors. If
the conversation becomes too noisy, the network effectively relies on the
listening party to ask the speaking party to repeat themselves.
However, in the case of data transmission, the error detection and correction
mechanisms are more stringent and are designed to operate more reliably over transmission
facilities. The reliable and error-free transport of data and signaling information over
these protocols is critical to ensuring proper network operation. SS7, IP, ATM, and Frame
Relay are all being used for these purposes. The transport portion of the SS7 protocols,
for example, is connectionless and has specific features designed to ensure very low
latency, very quick error detection, and low message loss. In comparison, the TCP, UDP,
and IP portions of the Internet Protocols are also connectionless but offer levels of
performance different than from those offered by the SS7 protocols.
Routing protocols may come into play if a transmission facility has very high bit error
rates, or if a transmission facility is not available for some reason, or if a connected
node has failed. Within the public telephony network, SS7 protocols have been specifically
designed to allow the routing and rerouting of signaling messages across multiple links to
the same destination without message loss. Routing between SS7 network elements, such as
Service Switching Points, Signaling Transfer Points, and Service Control Points, is based
on element identifiers known as point codes. Point codes are not hierarchical in
structure, although the SS7 network elements are frequently deployed as mated
pairs. Mated pairs are redundant nodes that are fully interconnected with each other
and the network. If one fails, then the other can easily continue performing the functions
of the failed node until it is repaired.
Routing in the Internet follows a different approach. Internet network elements
typically consist of routers and servers that are identified by IP addresses. The concept
of mated pairs is not generally used within the Internet, and, in fact, routers are
typically not organized in any hierarchy. If a transmission facility or network fails,
messages may be lost. It is then up to the end user application to ensure that the routing
and rerouting of messages occurs. The probability of ensuring the successful rerouting of
messages increases as the diversity of the paths connecting the same two endpoints
increases.
The future integration of the SS7 signaling protocols with the Internet protocols to
provide unified signaling will generate a number of technical challenges, since ways must
be found for the Internet protocols to provide the same level of service as the transport
portion of the SS7 protocols.
HIGH AVAILABILITY EQUIPMENT
There are several design approaches to ensure high availability for individual
network elements. The simplest type of network element is non-redundant and must be
repaired off-line if it fails. This type of element will have relatively low design
complexity and low cost. (Depending on its reliability, it may also have low
availability.) In contrast, high availability elements require both the ability to support
on-line repair (usually through the hot swap of components while the element
is in service) and additional redundancy.
Elements with additional redundancy typically use retry and
masking for recovery. Retry-based elements attempt to ensure that there is a
second attempt at the operation if an initial operation fails. If the second attempt
succeeds, the fault was probably transient. If the second attempt fails, the fault is
probably permanent. Masking-based elements attempt to ensure that only the results from
the correctly operating portion of the element are used if a component fails. In either
case, if a component has failed, an attempt to diagnose, confine, and compensate for the
fault is undertaken.
Hardware fault tolerance typically relies on redundant processors, memory, buses, power
supplies, and disk storage. Software fault tolerance uses a combination of software
redundancy and simple hardware redundancy to provide the necessary availability in the
case of failure. Depending on the approach that is chosen, one or more of the redundant
components may be operating simultaneously. Hardware fault-tolerant approaches can
typically support higher levels of performance than software fault-tolerant approaches.
Using hardware fault-tolerance, the need for complex circuits is eliminated
significantly decreasing design complexity and cost.
CONCLUSION
High-availability products and services will play an increasingly important role
in the network of the future. The availability of services is a complicated function of
the equipment design and also of the network design. Different design choices will result
in different levels of complexity, cost, performance, potential information loss, and (of
course) availability. The reliability and availability of the public telephone network has
set the standard against which future services from an integrated Internet and PSTN
infrastructure will be measured.
Jeff Lawrence is president and CEO of Trillium Digital Systems, a leading provider
of communications software solutions for computer and communications equipment
manufacturers. Trillium develops, licenses, and supports standards-based communications
software solutions for SS7, ATM, ISDN, frame relay, V5, IP, and X.25/X.75 technologies.
For more information, visit the companys Web site at www.trillium.com. |