Social Media Giant’s Networking Head Discusses Automation, Disaggregation & Scaling
Facebook’s (News - Alert) Director of Engineering Omar Baldonado, who is in charge of the network side of the social media giant, this spring at the Open Networking Summit 2016 provided an update on the company’s Open Compute Project. During his keynote presentation he reiterated the goal of the OCP (News - Alert), explained how the effort has advanced over time, and talked about his group’s focus on efficient and scalable operation over features.
The company started the networking side of the Open Compute Project in 2013. OCP was initiated by Facebook, he explained, because the company couldn’t find in the marketplace the solutions it needed to work at hyperscale, so it started to build the hardware and software solutions it needed itself and in partnership with other companies.
There were 1.04 billion daily active users on Facebook on average as of December 2015, and more than 80 percent of those users are outside the U.S., Baldonado said. To support that level of activity, he noted, Facebook has built out data centers; is investing heavily in fiber to connect its data centers and to connect to points of presence; and has recently put more effort into expanding access to the rest of the world via drones, lasers, and satellites.
“So for all of this, we have to write a lot of software,” he said.
Publicly Facebook has talked a lot about FBOSS, Linux-based software it developed that runs on its top-of-rack Wedge data center switches, he said. The company introduced FBOSS and Wedge in June of 2014. These two solutions were among the first to separate the hardware and software components while automating and providing better visibility of the network. Thousands of FBOSS/Wedge switches are now in production, Baldonado said.
But FBOSS is just one example of the software Facebook has written, he added. It also writes software related to 100G, IPv6, backbone and edge traffic engineering, circuit automation and testing, configuration automation and management, hybrid controllers, network modeling, network analytics and simulation, passive and active monitoring, and traffic shaping, he said.
Facebook does software upgrades weekly, he added, instead of twice a year. That, he explained, reduces the number of changes that are introduced at any one time. Such frequent updates, he noted, create the need for automation, because you can’t manually upgrade thousands of switches every week.
The company also subscribes to the concept of fail faster over fail-proof, he continued.
“Making a 100 percent fail-proof network is really a tall order,” he said.
So, instead, Facebook assumes there will be some failure, but makes sure it can address and fix problems quickly. For example, his group gets millions of syslog events from its network every day, every week, every month, he said. That doesn’t mean it has a failing network, he said, it just means it needed to write software to filter these events, identify patterns, and encode best practices to quickly fix problem areas. He added that Facebook’s NetNORAD system isolates network faults and automatically mitigates them within seconds, and that the company has open sourced some of the NetNORAD code.
On the hardware side, in addition to Wedge, the OCP ecosystem has come out with a wide array of solutions, including a battery cabinet, Freedom servers, Windmill (Intel (News - Alert)), Spitfire Server (AMD), power supply, and many more.
High-tech entrepreneur and investor Marc Andreessen in 2011 famously wrote that software is eating the world, and that’s clearly still true today. But Baldonado added that
“software needs to run on something” and the OCP hardware shows there are lots of options on that front.
Providing an update on how OCP has advanced in the past year, Baldonado said that while only one OCP switch had been accepted in 2015, today there are 11 OCP data center switches, noting contributions on this front from Alpha, Broadcom, Edge-Core, Facebook, Inventec, and Mellanox (News - Alert). And there are more switches teed up and ready to go, he added. Also, full design packages that have been reviewed by the OCP community are now available; a testing program is in place; software from one vendor can run on hardware from another; and new OCP specs are available for new silicon, chassis/modular solutions, and access and edge solutions, he said.
Edited by Stefania Viscusi