A Reliable Network Management Workflow System at Scale
Investigators: Yiting Xia in cooperation with Kuo-Feng Hsu, Jiarong Xing, Ang Chen (Rice University), Yan Cai, Yanping Li, and Ying Zhang (Meta)
Managing complex networks is challenging, particularly at scale. A planet-scale cloud/content provider’s network can contain devices from dozens of vendors, many tens of network element roles, and tens of thousands of circuits in operation, which altogether contribute to thousands of configuration changes every day. Besides, network management involves a diverse set of tasks, ranging from service upgrades, device updates, network expansion, to feature deployment— each requiring a disparate workflow of operations. The combined effect of the network scale and task diversity has made network management a risk-ridden procedure: as changes often come with risk, it is up to the network management system to monitor the network health, detect runtime failures, and assist with recovery when possible. Reliability is, hence, the primary goal of network management given its error-prone nature. We present the network management system in Meta and how it evolves through two generations to enhance reliability.
Our first generation, Netgram, is a network-specific workflow system. In Netgram, a workflow is a Python program written in a pipeline of stages, where each stage performs several execution steps and passes intermediate results to the next. A stage can invoke pre-defined sub-procedures for common network operations, which are flexibly implemented as any executables, such as Ansible, Python, or CLI scripts. Netgram is a significant leap from the former practice of ‘method of procedures’ (MOPs), which uses loose text to document management steps and rules of thumbs, and thus can be easily misinterpreted by individual operators and often requires manual translation into scripts or configlets. Netgram effectively eliminates manual efforts and streamlines the management tasks in an automated and trackable manner.
After four years operating Netgram in production, however, we learned the lesson that workflow programs have few restrictions or validations, so they could still cause severe damages to the network, especially when applied automatically at scale. Just like how rogue Shell scripts may wipe an entire disk, a problematic workflow program can exert arbitrary influence to the network. We analyzed failure tickets caused by Netgram and interviewed network engineers for operational pain points. From the results, we broke down the reliability goal into four tangible requirements of the workflow system.
Safety. Unconstrained workflows could result in unexpected network outages, e.g., draining too much capacity and leaving the network severely congested. A workflow system for network management should provide network operators with a confined set of allowable operations and simple interfaces to carry out validations.
Consistency. A network management task may require changes of multiplec onfigurations on a device or changes on multiple devices. Sometimes it requires changes applied to a set of devices in a pre-defined order. A workflow system should guarantee that all necessary changes are all made successfully so that the network is in a consistent and correct state.
Efficiency. Many changes need to be deployed as quickly as possible, e.g., to mitigate failures, to balance traffic, and to deploy a security patch. Thus, scheduling and executing workflows efficiently to minimize network vulnerable time is another requirement.
Resilience. Workflows may fail for various reasons, e.g., wrong device states, conflicts with other workflows, failure of manual steps (such as maintenance in the field), etc. Partially executed workflows should be undone according to the specific order of each counter-operation, which is not as simple as reverse-order rollback.
Our second generation system, Netgram++, provides systematic reliability guardrails with these considerations. Our core idea is leveraging the fact that most industry-grade net- work management systems already separate logical network representations from the physical networks. The logical network data is stored in a source-of-truth database, such as FBNet at Meta and Malt at Google. All network management tasks require read and write to the network data in some form. Thus, we can abstract the network management workflows as changes to the network data and apply various database techniques to implement customized features tackling each reliability requirement. For safety, we propose a programming model with the network object abstraction and a set of APIs. A Netgram++ program constructs network objects by scoping a number of devices, and all operations to them are through the APIs. From the programming model, the runtime system can automatically generate queries to FBNet, enforce constraints on the network, monitor task progress, and handle failures. For consistency, we build an object tree from network objects based on their dependencies and apply multi-granularity locking on the object tree to enable workflow-level transactions. For efficiency, we perform hierarchical lock scheduling on the object tree to maximize task parallelization and minimize execution time. For resilience, we identify limitations of reverse-order rollbacks and devise rollback plans with pattern matching on semantics of management operations.
Our contributions in this project are as follows. (1) We are the first to share experience of a production-grade network workflow management system and present comprehensive workflow measurements that can be used for future research in this area. (2) We demonstrate how database techniques can be customized to workflow systems to enhance reliability, and give detailed explanation of an example design with a programming model and runtime system for locking, scheduling, and failure handling. (3) We conduct extensive evaluation of the improved system with both simulation and production case studies: the programming model reduces LoC of workflow programs by 95.4% to 96.4%, effective locking and scheduling reduces task execution time by 90.1%, and conflicts between multiple workflows are resolved successfully. (4) We offer the first open-sourced task traces and simulator of net-work workflow management systems, which give academic researchers access to first-hand production data and an evaluation tool to study an industrial problem.