A Reliable Network Management Workﬂow System at Scale

Investigators: Yiting Xia in cooperation with Kuo-Feng Hsu, Jiarong Xing, Ang Chen (Rice University), Yan Cai, Yanping Li, and Ying Zhang (Meta)

Managing complex networks is challenging, particularly at scale. A planet-scale cloud/content provider’s network can contain devices from dozens of vendors, many tens of network element roles, and tens of thousands of circuits in operation, which altogether contribute to thousands of conﬁguration changes every day. Besides, network management involves a diverse set of tasks, ranging from service upgrades, device updates, network expansion, to feature deployment— each requiring a disparate workﬂow of operations. The combined eﬀect of the network scale and task diversity has made network management a risk-ridden procedure: as changes often come with risk, it is up to the network management system to monitor the network health, detect runtime failures, and assist with recovery when possible. Reliability is, hence, the primary goal of network management given its error-prone nature. We present the network management system in Meta and how it evolves through two generations to enhance reliability.

Our ﬁrst generation, Netgram, is a network-speciﬁc workﬂow system. In Netgram, a workﬂow is a Python program written in a pipeline of stages, where each stage performs several execution steps and passes intermediate results to the next. A stage can invoke pre-deﬁned sub-procedures for common network operations, which are ﬂexibly implemented as any executables, such as Ansible, Python, or CLI scripts. Netgram is a signiﬁcant leap from the former practice of ‘method of procedures’ (MOPs), which uses loose text to document management steps and rules of thumbs, and thus can be easily misinterpreted by individual operators and often requires manual translation into scripts or conﬁglets. Netgram eﬀectively eliminates manual eﬀorts and streamlines the management tasks in an automated and trackable manner.

After four years operating Netgram in production, however, we learned the lesson that workﬂow programs have few restrictions or validations, so they could still cause severe damages to the network, especially when applied automatically at scale. Just like how rogue Shell scripts may wipe an entire disk, a problematic workﬂow program can exert arbitrary inﬂuence to the network. We analyzed failure tickets caused by Netgram and interviewed network engineers for operational pain points. From the results, we broke down the reliability goal into four tangible requirements of the workﬂow system.

Safety. Unconstrained workﬂows could result in unexpected network outages, e.g., draining too much capacity and leaving the network severely congested. A workﬂow system for network management should provide network operators with a conﬁned set of allowable operations and simple interfaces to carry out validations.

Consistency. A network management task may require changes of multiplec onﬁgurations on a device or changes on multiple devices. Sometimes it requires changes applied to a set of devices in a pre-deﬁned order. A workﬂow system should guarantee that all necessary changes are all made successfully so that the network is in a consistent and correct state.

Eﬃciency. Many changes need to be deployed as quickly as possible, e.g., to mitigate failures, to balance traﬃc, and to deploy a security patch. Thus, scheduling and executing workﬂows eﬃciently to minimize network vulnerable time is another requirement.

Resilience. Workﬂows may fail for various reasons, e.g., wrong device states, conﬂicts with other workﬂows, failure of manual steps (such as maintenance in the ﬁeld), etc. Partially executed workﬂows should be undone according to the speciﬁc order of each counter-operation, which is not as simple as reverse-order rollback.

Our second generation system, Netgram++, provides systematic reliability guardrails with these considerations. Our core idea is leveraging the fact that most industry-grade net- work management systems already separate logical network representations from the physical networks. The logical network data is stored in a source-of-truth database, such as FBNet at Meta and Malt at Google. All network management tasks require read and write to the network data in some form. Thus, we can abstract the network management workﬂows as changes to the network data and apply various database techniques to implement customized features tackling each reliability requirement. For safety, we propose a programming model with the network object abstraction and a set of APIs. A Netgram++ program constructs network objects by scoping a number of devices, and all operations to them are through the APIs. From the programming model, the runtime system can automatically generate queries to FBNet, enforce constraints on the network, monitor task progress, and handle failures. For consistency, we build an object tree from network objects based on their dependencies and apply multi-granularity locking on the object tree to enable workﬂow-level transactions. For eﬃciency, we perform hierarchical lock scheduling on the object tree to maximize task parallelization and minimize execution time. For resilience, we identify limitations of reverse-order rollbacks and devise rollback plans with pattern matching on semantics of management operations.

Our contributions in this project are as follows. (1) We are the ﬁrst to share experience of a production-grade network workﬂow management system and present comprehensive workﬂow measurements that can be used for future research in this area. (2) We demonstrate how database techniques can be customized to workﬂow systems to enhance reliability, and give detailed explanation of an example design with a programming model and runtime system for locking, scheduling, and failure handling. (3) We conduct extensive evaluation of the improved system with both simulation and production case studies: the programming model reduces LoC of workﬂow programs by 95.4% to 96.4%, eﬀective locking and scheduling reduces task execution time by 90.1%, and conﬂicts between multiple workﬂows are resolved successfully. (4) We oﬀer the ﬁrst open-sourced task traces and simulator of net-work workﬂow management systems, which give academic researchers access to ﬁrst-hand production data and an evaluation tool to study an industrial problem.