Deploying Reliable Power For Critical IT Network Infrastructure
BY Jay Adelson
With no particular warning, August 14, 2003 turned out to be a rude awakening for IT managers across North America. The largest power grid failure in the history of the United States cascaded throughout the Northeast, causing devastating downtime to many IT systems that did not have a solid redundant power infrastructure in place. The U.S.-Canada Power System Outage Task Force released a report on April 5, 2004 describing the causes of the blackout to be long-term in nature and pervasive throughout the industry. Based on their various recommendations, as well as the recommendations of the North American Electric Reliability Council (NERC), there are many more actions that remain before the rest of us can breathe a sigh of relief. The most important lesson learned by IT managers during the days of the failure was clear: a high-performance power infrastructure has now become one of the most critical elements in IT network operations.
During that outage, some networks remained online through the correct implementation of power backup systems. However, the continuous operation of any application using Internet or networking technology, such as VoIP, business systems and storage, requires the uptime of a chain of components, many of which are outside of an IT manager’s immediate control. To properly protect against a failure, IT managers must first understand the chain of components interlinked to deliver the service. Secondly, they must understand the maximum power reliability options of each device. Finally, through location decisions or strategic power delivery planning, they need to deliver reliable power — to the extent possible — to the devices.
The service delivery network infrastructure chain can always be broken down into its simplest elements: Client devices (including desktops, servers, IP phones, or other user devices), LANs (including switches, routers, gateways, softswitches, etc.), WANs (including network service provider transport devices, routers and switches), remote LANs and finally the destination client devices. Regardless of whether or not each element of the chain is within the IT manager’s control, each element should be mapped out and evaluated for its inherent power failover capabilities.
When it comes to power redundancy built into client devices, the options vary. Desktops rarely have multiple power supplies, however servers often have two or more. Devices such as these can failover from one supply to the other seamlessly. A static switch can be used to connect a single power supply to two diverse power sources, also providing seamless failover. For VoIP phones and smaller devices, built-in battery backup is far more common. A device with its own batteries can bridge the gap waiting for power delivery to return.
Almost every LAN device available on the market today comes with multiple power supply options. Manufacturers such as Cisco Systems, Juniper, 3Com, Foundry, and Extreme offer systems with multiple power supplies at reasonable prices. Choosing where to place this equipment will depend on a building’s power delivery capabilities or the importance of the application in question.
WAN devices tend to have greater power redundancy capabilities. Often WAN devices have both AC and DC power supply options to ensure compatibility with whatever redundant power delivery services are available at the deployment location. When purchasing network services, such as transport for a private WAN (i.e., frame-relay or ATM) or transit for an Internet-based WAN (either VPN or public access), IT managers should demand a list of the power configuration for all equipment in their network provider’s service chain.
In some architectures, where the technology and protocols support it, multiple devices can be used as a less expensive alternative to higher-end devices with multiple power supplies. For example, in Domain Name Service (DNS) or more advanced ENUM deployments, which can be essential to VoIP call routing, multiple servers can deployed to accommodate the load should one server fail. However, not all protocols and elements of certain architectures support multiple devices.
When developing a continuity analysis for an enterprise, IT managers should list all the devices in their chain and their likelihood of failure. A detailed study of the impact of a power failure to each device should be conducted to measure the effect on overall operations. After a priority level has been assigned to each device, an IT manager has the information to make decisions on where to invest in better power delivery or the purchase of higher-end devices.
In terms of power delivery, economies of scale can provide strong benefits. Using large, multi-tenant data centers will likely provide the best power redundancy options, whereas a local building’s “telecom closet” is the least likely to support backup power of significant quality. As it is impossible to put the entire chain in a data center, it is important to initially focus on improving power redundancy at the customer premises, and secondly at the data center.
At the customer premises, devices should be centrally located so as to provide recognizable and consistent power to all the devices. In employee work spaces, dedicate circuits to areas of users that can be monitored independently. A typical telecom room cabinet should be fed with, at a minimum, two separate circuits, so that devices with multiple power supplies can connect to different circuits in the event of one failing. For single power supply devices, static switches can be placed at the top or bottom of the rack, in turn feeding into multiple circuit breakers.
Once power is properly distributed at the customer premises, Uninterruptible Power Supply (UPS) devices can be used to bridge short power failures from the power utility or backup generators. A UPS system designed to handle full load for five minutes should be more than adequate if diesel generators are part of the permanent infrastructure. If the building has generators, the building management should provide the IT staff with a detailed maintenance plan, including at least monthly generator tests and refueling, and information about where fuel will be obtained during a major power crisis. In August 2003, many generators worked fine for a day until the fuel ran out.
When choosing a UPS system, an IT manager should find systems that allow for “smart shutdown” controls, so that in the event of a failure, should a generator fail to turn on in a designated time or utility power return, messages can be sent to servers and other fragile equipment to shut down gracefully.
In a larger data center, one should start with the cabinet equipment and work back to utility power or generators to develop a power delivery map. Typically the map goes from client device to circuit breakers to Power Distribution Units (PDUs) to UPS systems to either generators or utility power.
Each power circuit is generally delivered on a “power bus” with various levels of redundancy, and often multiple buses are available. In some data centers, two circuits dedicated to two power supplies may be on totally separate power buses, while in other data centers they may be on a single bus.
Power is then distributed into either local (installed on a bus bar) or centralized breaker panels, and then onto Power Distribution Units or PDUs. These PDUs aggregate load onto a bus for delivery into the main UPS infrastructure of the data center. PDUs are a common source of failure in the data center power infrastructure; understanding the quality of these devices and how they are maintained is important.
UPS systems can be installed in parallel redundant or isolated redundant configurations, so that if a single UPS fails another can back it up. This “hot spare” concept is often referred to N+1, or 2N if there is literally double the amount of UPS systems available for the entire load. A UPS depends on batteries, and these can be of various levels of redundancy themselves; smaller UPS systems use a single, internal chain of batteries, while larger UPS systems can often use two redundant sets of batteries. A data center operator should provide a good maintenance plan for the UPS systems, including battery monitoring, to any enterprise interested in using them.
A data center’s fuel plant should be able to provide full load power for an indefinite period. To do this, the expectation of the plant operator should be that one or more generators, or fuel pumps to the generators, could fail at any time. The fuel should also come from redundant tanks and sources. Data center operators can sign multiple fuel supply contracts to ensure continued operation.
A data center also can get its utility power from single or multiple sources, or feeds. While a building may have two independent feeds, they may come from the same or different substations or grids. Data center operators who tell you they are on two different grids are often misrepresenting what really are two separate substations. In reality, a data center would need to be located in out-of-the-way or rural locations near major regional borders to straddle two grids.
Finally, a data center operator should provide you a maintenance plan that demonstrates an expectation of failure, both from the utility and from components within the data center power delivery. No system is foolproof, and a good data center operator will test it on a weekly or monthly basis at full load.
Overall, when deploying IT infrastructure to support services such as VoIP, breaking down the elements of power delivery is critical to understanding the reliability of a system. Often critical parts of the service delivery chain are ignored, only being identified during a crisis. Any power-related network issues can be mitigated without becoming an electrical engineer… just some simple due diligence.
Jay Adelson is founder and chief technology officer at Equinix, Inc. For more information, please visit the company online at www.equinix.com.
If you are interested in purchasing reprints of this article (in either print or HTML format), please visit Reprint Management Services online at www.reprintbuyer.com or contact a representative via e-mail at [email protected] or by phone at 800-290-5460.
[ Return To The November 2004 Of Contents ]