Nick Schmidt

Posted on Jan 3, 2021 • Originally published at blog.engyak.net on Jan 1, 2021

Why Automate? Using Pipelines to Develop and Manage Network Configurations

#ansible #bgp #datacenternetworking #ebgp

Continuous Delivery: No Rest for the Wicked

Now that we have:

A method to generate Desired State Configurations, by defining Declaratively what the device config should be, and combining it with what a device config should have
A method to apply configurations automatically, without PuTTY Copy-Pasting

We now can achieve Infrastructure As Code, where we can take a few artifacts from source control and turn them into a live, viable network device.

This is handy, but what about maintaining it? CI/CD Pipelines

In simplest terms, CI/CD tools provide an automated way to "do a thing" to make it pretty easy to perform repetitive tasks. For this example, I'll be using Jenkins CI, but the steps we'll be performing are pretty simple.

Pipelines aren't the only things that a CI tool can do, but there are some pretty big differences between a traditional pipeline and managing a network - for example, there's no code to compile. Instead, it's best to map out the steps that we want a CI tool to perform. Jenkins has a project type - Freestyle that lends itself well to applications like this, but it can also get fairly messy/disorganized.

A more comprehensive definition of a pipeline (from Red Hat) is here: https://redhat.com/en/topics/devops/what-cicd-pipeline

Installing Tools

In this case, I am leveraging a purpose-build CentOS host with Ansible, Jenkins, Jinja, and Python3 installed. Since this prerequisite list is fairly short, it should lend itself rather well to containerization.

Network infrastructure tends to have inbound access restrictions that most container platforms cannot meet in an auditable, secure method. This capability can be provided with VMware NSX-T or with Project Calico, but these capabilities are pretty advanced. I'd consider containerization an option for those willing to take it on in this case, and am keeping this guide as agnostic as possible.

Perhaps later I'll build on this and provide a dockerfile. Starring the repository will probably be the best way to keep track!

Executing Continuous Integration / Continuous Delivery

Let's start with the specifications for what we want to do. This doesn't need to be excessively convoluted.

The CI Tool should simply execute code, minimally. If we resort to a ton of shell scripting here, it won't be managed by source control and cannot easily be updated.
The CI Tool is responsible for:
- Execution of written code
- Logging
- Notification
- Testing of written code
- Scoring of results to assess code viability / production readiness

Steps:

Fetch code from GitHub. Execute every five minutes, if a new code commit is available.
Lint (syntax validate) all code.
Compile Network Configurations, and apply to network infrastructure
Test
Notify of build success

I have added an example CI Project file to this repository. https://github.com/ngschmidt/vyos-vclos It does not contain testing or validating steps yet, as those are considerably more complex - writing a parsable logger will take quite a bit more time than I feel an individual post is worth.

The CI Project

We're not asking much of Jenkins CI in this case, so you can easily replicate this configuration by:

Setting a Git repository to clone from (Under Source Code Management)
Setting the Build Trigger to Poll SCM (H/5 * * * *)
Execute the playbooks (provided in the GitHub repository). Instead of executing each individual pipeline, I elected to make a main.yml playbook that contains all steps, so that the control aspects of this remain centralized in the Git repository.
Automated Evaluation: I provided a yamllint example, eventually this should be tallying the results of each automated test and scoring it.

People want new stuff, and some of it might be new networking Features

Now that we have an easy way of keeping all of our networking gear (2-N nodes) managed and in baseline with the same level of effort, it's pretty straightforward to automatically roll out Features.

Features in this case don't need to be a large, earth-shaking new capability in more traditional software development parlance. Instead, let's consider a Feature something smaller:

A Feature should be something a consumer wants (DevOps term would be to delight users)
A Feature should be a notable change to an information system
A Feature should be maintainable or maintainability. A system's infrastructure administrator/engineer/architect is a consumer as well, and that person's needs have value, too!

Some Examples of Network Features:

Wireless AP-to-AP Roaming: Users like having connectivity stay as they move about. This can vary from 802.11i in Personal Mode, to 802.11i in Enterprise mode with 802.11k/r/v implemented to be truly seamless.
- If this were a CI Project:
- Minimum Viable Product would be defined. If the security teams are okay with WPA2-PSK, then that would be it. If not, the roaming capability would be at ~6 seconds, with lots of room for improvement.
- Roll out 802.11k reports for better AP association decisions
- Roll out 802.11v for better notifications around Power Saving
- Roll out 802.11r or OKC for secure hand-off
- No Rest for the Wicked: Do it all again with WPA3!
VPN Capability
- If this were a CI Project:
- MVP: IPSec-based VPN with RADIUS authentication
- TLS Fallback for low-MTU networks or PMTUD
- Improved authentication mechanisms, like PKI or SAML
- Client Posture Assessment

In the world of continuous delivery, these can be done out of order, or to a roadmap. When you're done with a capability, deliver it instead of waiting for the next major code drop.

I'm a network guy, what's a code drop?

Honestly, infrastructure teams never really followed more traditional software development approaches - Continuous Delivery is a better fit, because of our key problems:

Change
Loops caused by changes

There's no true hand-off from development to operations, just the people who run the network, and those who don't. We are afflicted by an industry of either change fear or CAB purgatory where once something is built, it can no longer be improved. This builds up a lot of indebtedness that is rarely fixed by anything short of a forklift. Ideally, we can leverage CI tools in this way:

Clean Slate: Delete all workspace files
Write Feature Code
Build configurations
Apply configurations to test nodes
Validate (manually or automatically, or both) that the change did what it was supposed to, and that it worked
If it fails, go back to step #1
Stage Feature release, do paperwork, etc.
Release Feature to all applicable managed nodes
Work on the next Feature

I have attached a Jenkins Project that performs most of these tasks here. There are some caveats to this method that I'll cover below.

This should result in much higher quality work being released, and in the networking world, reliability is king. This is the key to becoming free of CAB Purgatory in large organizations.

A Day in the life of a Feature

Since the majority of the muscle work with Jenkins has already been programmed, we simply need to focus on the source code (device configuration), and work from there:

Create a new git branch. This can be achieved with git checkout or via your SCM GUI.
Write code for the git branch. Ideally, you'd create a new project for this step against that specific branch, but there is no "production environment" to speak of in my home lab.
Commit code. Again, small steps are still the best approach. The biggest change here is to periodically check in on your pipeline to see if anything breaks. This gives you the "fix or backpedal" opportunity at all times, and makes it easy to spot any breakage.
Submit a git pull request: This is an opportunity for the team to review your results, so be sure to include some form of linkage to your CI testing/execution data to better make your case.
Merge code. This will automatically roll to production at the next available window, and is your release lever.

Example 1: Fix an issue where BGP NLRIs are not being imported due to no policy

Pull Request #1

For this, we ran into a particularly odd behavior change - VyOS was somewhat recently rebased from Quagga to FRR, which picked up the following behavior: http://docs.frrouting.org/en/latest/bgp.html#require-policy-on-ebgp

Require policy on EBGP[no] bgp ebgp-requires-policyThis command requires incoming and outgoing filters to be applied for eBGP sessions. Without the incoming filter, no routes will be accepted. Without the outgoing filter, no routes will be announced.This is enabled by default.When the incoming or outgoing filter is missing you will see “(Policy)” sign under show bgp summary:exit1# show bgp summaryIPv4 Unicast Summary:BGP router identifier 10.10.10.1, local AS number 65001 vrf-id 0BGP table version 4RIB entries 7, using 1344 bytes of memoryPeers 2, using 43 KiB of memoryNeighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt192.168.0.2 4 65002 8 10 0 0 0 00:03:09 5 (Policy)fe80:1::2222 4 65002 9 11 0 0 0 00:03:09 (Policy) (Policy)

This was preventing BGP route propagation, and was a result of an upstream change. In Software Development, this is called a "breaking change" because it implements major functional changes that will have potentially negative effects unless action is taken.

To mitigate this, we can develop a solution iteratively, using our lab environment, and test, re-test, and then test again until we get the desired result. 24 Commits later, I'm satisfied with the result. Once a solution is sound (passes automated testing) it is best practice to submit a solution for peer review. Git calls this action a pull request. Here's the one for this change:

Example 2: Roll out IPv6 Dynamic Routing

Pull Request #2

Like with the previous pull request, this particular implementation isn't huge.

By code volume, this was about 200 lines, but the real difference here is in the multiplier. From those 284 lines:

100 lines DRY (Don't Repeat Yourself) highly repetitive code (template)
57 are documentation
38 lines DRY highly repetitive code (variables)

The history of this pull request is publicly available. I made a few mistakes, and then caught them with automated testing, as everyone can see.

About two-thirds of the way through this I realized I was rolling out IPv6 will a pull request. Neat.

Conclusions

This generates quite a bit of code, repeatably and reliably.

Value in Volume

All in all, we're generating 1,020 lines of configuration with 833 lines of code. The ratio becomes more favorable for a developer in terms of sheer work more homogenous your environment or custom configurations are. If you're only evaluating saved time:

2 Devices may feel dubious
3 Devices will show real value in saved time
4+ the benefits become insane

Value in Consistency

The real value here is consistent configurations. Using traditional methods I'd normally have a ton of frustration trying to configure things consistently, un-doing and re-doing copy-paste errors, and re-testing. If you configure both sides with Jinja2, they'll match exactly and peer up, every time

Value in Documentation

This is the part where I truly value this approach. If an engineer or architect designs variable definitions well, the end result summarily defines the device. This can be attached in-line or as meta-data to a diagram, or easily verified against a diagram to ensure things are consistent. The few issues I had were quickly resolvable by comparing YAML to a diagram. I'm probably going to use this method to generate diagrams as well.

Downsides

I trivialized the network driver aspect of this work. The one I chose, vyos.vyos.vyos_config, is not idempotent and was causing serious issues as a result (BGP neighbors dropping constantly as I re-applied the configuration). Off-the-shelf network drivers are perfectly well suited for prototyping, but substantial development is required to use them in production. This would take a full team to become reality, but a middle ground is readily achievable.

We call it Continuous Delivery for a reason.
Automate when it helps you.

We can use this guidance to come up with a plan, for example:

Milestone 1: Jinja2-fy your golden configurations, and stop manually generating them
Milestone 2: You have the config a device should have, gather_facts the current configuration, and generate a report to see if it's not compliant.
Milestone 3: Topical automation replaces manual remediation
Milestone 4: Fully mature NETCONF springs forth and saves the day!

People who are at #4 aren't better than people who have finished #1. Use what's useful.

DEV Community