
June 1999
High Availability - Open Systems Rise To The Challenge
BY BROUGH TURNER
Highly available, even fault tolerant, computer systems have become critical in many
industries. But requirements differ markedly between industries. In the world of finance,
transactions must reach predictable results once they are started - even if the power,
communication line, or central computer fails. You want to be sure that if you request
money from an Automatic Teller Machine (ATM), the money comes out of the machine before
your bank account is debited. It is not desirable, but it is acceptable, for the
transaction to take 1 minute instead of 20 seconds, or for an ATM at a particular location
to be temporarily unavailable. But debiting your account when no money is received is
unacceptable.
In telecommunications services, the first priority is availability of the service, not
maintaining the consistency of transactions in progress. When you pick up the phone, you
expect to hear a dialtone within a second, and to be able to place a call. If you are in
the middle of a call, it is not desirable for the call to be dropped, but as long as you
can pick up the phone and get dial tone again, it's an unfortunate glitch, but not a
failure to deliver telecommunications service. Unlike financial transactions where the
overriding consideration is to get the transaction correct, in the telecom world the
overriding consideration is to provide the service.
These realities lead to some subtle differences in system architecture. But the
underlying hardware and software components are substantially the same. In each market
there is a history of special purpose hardware, whether the hardware involves computer
systems from Tandem (now Compaq Computer) or Stratus, or computers embedded within central
office (CO) switches from Lucent or Nortel.
The heart of the today's CO switch is a special purpose, highly available computing
system, designed to localize failures, bring standby components on line quickly, and
support 99.999 percent service availability. But now, just as the CO switch is being
threatened by IP telephony, the basic approach to the computers that control
telecommunications equipment is also under fire. Today, open telecommunications technology
is providing a way to create a highly available services using off-the-shelf components.
SHARING THE LOAD
With the advent of PC-based computer telephony, it became possible to create highly
reliable services - in telecommunications terms - using distributed solutions. For
example, if the telecommunications traffic for a specific service is being carried by 50
separate PCs and one of them fails, 2 percent of the calls in progress at that instant
will be dropped. But users will be able to immediately re-establish their calls because
the remaining 98 percent of the system is still functioning. This scenario can be viewed
as providing a "highly available" service as defined in the telecom industry.
Indeed, even before the advent of CompactPCI, industrial PCs have made some inroads
providing enhanced services in the public network.
This architecture can also be seen in action at many points in the Internet, for
example, in the equipment used by a company like Yahoo. Yahoo uses many, many mass-market
computing devices (PCs) to provide a very high capacity service that is constantly
available. Telecom services are more challenging than Web hosting in part because telecom
has to combine new-generation technology with legacy equipment, but the extra challenges
of telecom are being addressed.
KEEPING USERS ON LINE
Let's begin by looking at the individual subscriber connections. In a traditional CO
switch, there are wires from each subscriber that terminate at a line card. It is not
economically feasible, or even rational, to provide duplicate (that is, redundant) line
cards since they fail very infrequently. In traditional CO switches, services outages are
minimized by limiting the number of users per line card, being able to detect failures and
impending failures, and being able to hot-swap failed cards. This localizes and minimizes
any service outage and ensures that users connected to the failed card are back on line
very quickly.
With CompactPCI we can provide the same functionality in an open telecommunications
platform. Subscriber lines can be terminated on line cards with built-in test facilities,
and these line cards can be hot-swapped in the event of a failure. In a CompactPCI
chassis, the line cards are connected together for telephony purposes by the CT Bus
(H.110). The CT Bus has 32 separate serial paths and totally redundant clocking, providing
highly available interconnection at the chassis level. In traditional telecommunications
systems, the line card chassis (also known as a peripherals shelf) may or may not come
with redundant "shelf controllers." This is a commercial trade-off between
possible outages of a larger group of subscribers versus added cost. Similarly, with
CompactPCI we can choose chassis with single or redundant CPUs.
DISTRIBUTING THE SWITCHING FABRIC
To create a next-generation CO capable of supporting hundreds of thousands of subscribers,
you would need many CompactPCI chassis. Obviously it is necessary to interconnect these
chassis in a way that is either redundant or highly distributed. One option is to use
Asynchronous Transfer Mode (ATM) inter-chassis links, also known as MC4, a technology
available today from vendors such as InnoMediaLogic (IML). IP-based approaches, running on
gigabit Ethernet links, are also possible; however, for this application, IP is still an
emerging technology. Whether ATM or IP, we can easily craft both redundant and distributed
inter-chassis interconnections.
THE SOFTWARE TO CONTROL IT ALL
So far we've concentrated on the hardware. The biggest issue is to make the software in
this distributed system as robust as the hardware. One approach would be to use open
computing versions of traditional methods - for example, use a redundant UNIX server
running an Oracle NonStop database to keep track of all the configuration, billing, and
resource management issues. This may be the most direct approach, but lower cost,
distributed solutions can achieve the same result.
Managing The Configuration Information
A major issue for large telecommunications systems is keeping track of configuration
information - the line cards, the subscribers, the services each subscriber is entitled
to, where the trunks are connected, etc. This is a "read-mostly" database; it is
written to only when performing what is called "operations, administration, and
maintenance" (OA&M) - functions such as adding new subscribers, changing trunks,
and reconfiguring the system.
You don't really need a fault-tolerant transaction processing database system to track
this information, as long as there are several replicas and relevant subsets are cached in
each of the distributed processors in the system, with cache updates at user-defined
intervals. Database replication is appearing in a variety of commercial software. And as
long as each processor in the distributed switching system has a local copy of the
necessary portion of the configuration information, the system will be able to function,
even if the central database goes down.
Managing Billing Information
Another centralized service maintained by the CO is billing. Typically, a centralized
database accumulates the actual records of all calls or other transactions that are
billable. In a distributed system, this information is generated on the individual
processors setting up the calls.
It is true that each processor is a single point of failure, but as long as it is
functioning, it can be generating Call Detail Records (CDRs). And if individual processors
broadcast their CDRs over an IP network - a redundant LAN if desired - then two or more
machines can be configured to accumulate billing information (redundantly). IP
broadcast is well understood, as are redundant Ethernets to interconnect with the machines
recording the call detail information.
The actual processing of the billing information then becomes a batch process. You may
want to run this batch process frequently to provide hour-by-hour billing. But, if there
is a delay of 15 minutes because one of the systems needs to be rebooted, raw information
may be collected on another machine in real time, and subscribers may continue receiving
service.
The worst that can happen is that one of the call processing boxes fails. In this case,
the calls in process would be dropped and the users would not be billed - a good thing
under the circumstances. The important thing is the total system still provides dialtone.
That is the key issue for high availability in telecommunications.
Managing Shared Resources
The final hurdle in creating a highly available CO using distributed, mass-market
computing devices is figuring out how to manage resources where control must be shared.
It's one thing to have read access to a replica of a configuration database. It is another
thing to share control of the inter-chassis switch fabric.
In a traditional CO, there is a centralized database that describes the
second-by-second utilization of the switch fabric and determines routes for each new call
through the system. One option in a distributed system is to use a highly reliable, and
expensive, central database system to allocates virtual circuits - over the MC4 links -
between the chassis. A more distributed approach would be to configure and partition these
virtual circuits in advance and then allow each chassis to manage the traffic it is
sending over its portion of the system.
The success of this distributed approach has been demonstrated. The MC1 multi-chassis
MVIP systems (and the equivalent SCXbus) provide a switch fabric for interconnecting
multiple PCs. Its capacity is limited to 1500 timeslots, and its copper cable is limited
to 15 meters interconnecting a dozen or so chassis. However, its operation is instructive.
The transmit rights for specific paths within the cable are divided up and assigned to
individual processors. This means that the resource allocation for conversation paths in
the cable are distributed at configuration time. There is no central database involved in
a call between chassis on the cable. For example, to connect a call between chassis 2 and
chassis 7, you send messages (over redundant Ethernet, for example) to both chassis
telling them to allocate transmit timeslots and report back. Each chassis allocates the
timeslot it will transmit on using a free path from those it was pre-assigned. When you
get back the two transmit assignments, you tell chassis 2 to listen on the timeslot that
chassis 7 is transmitting on and visa versa. Each chassis has complete control of a subset
of the switch fabric so the configuration database is completely distributed. This makes
less efficient use of the switch fabric, but the database is dramatically simplified. With
the cost of inter-chassis capacity dropping, using a completely distributed architecture
can be economically justified.
BARRIERS ARE DROPPING
Software development is the gating factor for bringing new equipment and services to
market. The richest software development environment in the world is on PCs connected to
the Internet. And now, the high availability requirements of telecommunications systems
can be handled by CompactPCI and distributed computing architectures. As a result,
mass-market computing technology has become the most practical and cost-effective way of
building new telecom services. New designs based on proprietary technology make little
sense. Virtually any telecommunication service can be more economically built using
mass-market computing technology on open platforms - while addressing all of the high
availability issues.
Brough Turner is senior vice president of technology at Natural MicroSystems, a leading
provider of hardware and software technologies for developers of high-value
telecommunications solutions. For more information, call Natural MicroSystems at
508-620-9300 or visit the company's Web site at
www.nmss.com. E-mail to the author ([email protected]) is
also welcome. |