Mustafa ERBAY

Posted on May 21 • Originally published at mustafaerbay.com.tr

Why is BGP Route Flap Management Only Easy in Theory?

#career #network #bgp #routeflap

While BGP route flap might seem like a simple network problem in theory, it can turn into a headache-inducing mess in production environments. When a route flap occurs, routing tables in the network are constantly updated, leading to packet loss, delays, and significant drops in overall network performance. In a recent client project, the route flap problem I experienced with BGP connections over two different ISPs once again showed me how deep and nuanced this topic truly is.

In this post, I will explain the fundamental causes of BGP route flap, its real impact on the network, and the practical approaches I use to resolve such issues. I will also share my own experiences on why some theoretically great solutions don't always work in practice.

What is BGP Route Flap and Why Does It Occur?

BGP route flap is a situation where a BGP router constantly changes the best path it has learned for a specific destination network within short periods. This occurs even if the BGP session itself doesn't go up and down, causing prefixes in the routing table to be continuously added, deleted, or have their attributes changed. It's not uncommon to see dozens, or even hundreds, of route updates within a few seconds. This situation means significant CPU load and memory consumption for backbone routers, and can be completely paralyzing for smaller edge routers.

So, why do these flaps occur? They usually arise from a combination of multiple reasons. Sometimes, it can be triggered by fluctuations in a physical connection (e.g., a momentary fiber cut), a software bug in a peer router, or incorrect configurations in routing policies. On one occasion, in the infrastructure of my own side product, I detected this situation in my logs when momentary disconnections occurred between BGP peers during a VPS provider's maintenance work. These disconnections, instead of completely dropping the BGP session, only led to specific prefixes being momentarily withdrawn and re-advertised.

ℹ️ RFC 4271 and BGP Route Stabilization

BGP, by default, attempts to prevent such fluctuations with a route flap dampening mechanism. However, this mechanism involves certain trade-offs, especially for critical prefixes. According to the standards specified in RFC 4271, if a route changes more frequently than a certain threshold, the router starts ignoring that route for a period. While this increases overall network stability, it can sometimes cause a legitimate route change to be propagated with a delay.

The Real-World Impact of Route Flap on the Network

Route flap doesn't just strain router CPUs; it also directly impacts the end-user experience. When a route constantly changes, routers engage in intensive calculations to find the new best path. During this time, packets sent to the old route might be dropped or misrouted.

I encountered a similar situation in a manufacturing company's ERP system while pulling supply chain data via iSCSI integration. VPN tunnels between the company's head office and production site would occasionally experience outages due to instantaneous flapping of underlying BGP routes. This situation became more pronounced, especially at the beginning and end of the workday, when network traffic was at its peak. Outages starting around 08:30 AM and lasting 15-20 minutes prevented operators from accessing production planning screens. In my analysis, I noticed hundreds of prefix updates within seconds between the VPN gateways' BGP peers.

These continuous routing table changes lead to concrete problems such as:

Packet Loss: As routers calculate new routes or when old routes become invalid, packets may not reach their correct destination. Even 5% packet loss can severely affect VoIP calls or real-time application performance.
Increased Latency and Jitter: While routing tables change, packets may be forced to travel longer or less optimal paths. This leads to an increase in latency, measured in milliseconds. In a bank's internal platform, I measured that some financial transaction times were twice as long as normal due to this situation.
High Resource Utilization on Network Devices: Routers recalculate the routing table and send updates to their neighbors with every route change. This increases CPU and memory usage. In very intense flap situations, the router itself might freeze or be unable to perform other critical tasks.
BGP Session Instability: Excessive flapping can prevent Keepalive messages between BGP peers from being processed in a timely manner, leading to unexpected BGP session drops and re-establishments. This further complicates the problem.

Root Cause Analysis: Finding the Source of the Problem

The first step in resolving BGP route flap issues is to correctly identify the root cause. This is often a detective-like process that requires combining multiple log sources and network monitoring tools. In such situations, I proceed step-by-step and generally follow these steps:

Examine BGP Router Logs: Router logs provide the most valuable information about BGP session status, route updates, and error messages. I check the general status with commands like show ip bgp summary or show ip bgp neighbors. Then I can check how many times specific prefixes have changed in the show ip bgp output. On one occasion, in a client's network, I saw that a specific 192.168.1.0/24 prefix was advertised and withdrawn 17 times within minutes in the output of the show ip bgp neighbors x.x.x.x received-routes command. This was a clear indication of a flap.
```
Router# show ip bgp neighbors x.x.x.x advertised-routes
BGP table version is 1234, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, a additional-path, c CIST-OV
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 192.168.1.0/24   0.0.0.0                  0    100  32768 i
*> 10.10.10.0/24    0.0.0.0                  0    100  32768 i
```
This output shows which prefixes the router is advertising. However, to understand flapping, it's more important to check the routes received with the show ip bgp neighbors x.x.x.x routes command and whether they have entered a damped state. The show logging command also provides critical information for BGP session drops or error messages.
Physical Layer Check: Typically, 80% of network problems originate from the physical layer. Cables, optical modules (SFP/SFP+), patch panels, and power supplies should be checked. A few years ago, I discovered that a BGP flap issue in a data center was caused by a power supply fluctuation in a server, which momentarily affected the switch port.
ISP Side Investigation: If the BGP flap occurs in a peering with an ISP, the problem can often be on the ISP's side. Instabilities in their networks, maintenance work, or faulty configurations can cause flapping on our end. In such cases, I contact the ISP and ask them to check their logs and network status. Sometimes, an incorrect route-map or prefix-list configuration on their end can cause flapping in my network.
End-to-End Monitoring: By identifying the IP addresses and target systems of users experiencing latency or outages at the application level, and tracing the paths of this traffic with tools like traceroute or mtr, we can get clues about where the problem started. In my own network monitoring tool for my site, I observed that latency values to specific destinations would spike from 20ms to 500ms in 10-second intervals. This indicated that the problem was at the network layer.

Route Flap Dampening: Benefits and Pitfalls

BGP route flap dampening is a mechanism that "damps" (suppresses) a route for a period if a router changes that route more frequently than a certain threshold. This protects the router's routing table from excessive updates and increases overall network stability. However, dampening also has its own pitfalls.

Damping a route essentially means making that route unusable for a while. If the flapping route is the only exit path for a critical service, applying dampening can cause that service to be completely cut off. Therefore, dampening parameters must be set very carefully.

It is typically configured with the bgp dampening command:

router bgp 65000
 bgp dampening 1 2 5 1000

The parameters here are:

1: Half-life time (in minutes). The time it takes for the penalty score to halve when the route becomes stable.
2: Reuse threshold (penalty score). When the score drops below this level, the route becomes reusable.
5: Suppress threshold (penalty score). When the score exceeds this level, the route is suppressed.
1000: Max suppress time (in minutes). The maximum time a route can be suppressed.

In my experience, leaving these parameters at their default values often causes problems. This is because default values are usually too aggressive and can suppress critical routes for longer than necessary, even during short-term fluctuations. In a production ERP system, we observed that access to a specific external IP was cut off for 10 minutes. The reason was that this IP experienced a momentary flap and was suppressed due to dampening. In such situations, I might prefer to completely disable dampening with the no bgp dampening command or make the parameters less aggressive. [Related: Stable BGP Configurations in the Network]

⚠️ Dampening and Critical Services

Dampening should be used carefully for BGP routes used by critical services. A route entering dampening can cause that service to become unreachable. Therefore, it may be safer to apply dampening only in special cases or for specific prefixes.

Practical Solutions and Mitigation Strategies

Route flap management is not a problem that can be solved with a single magic wand. It usually requires applying multiple strategies simultaneously. Here are some practical approaches I use in production environments:

1. BGP Keepalive and Hold Time Settings

Keepalive and Hold Time values are important for ensuring the stability of BGP sessions. Default values are typically 60 seconds for Keepalive and 180 seconds for Hold Time. However, on highly unstable or latent links, these values may be insufficient.

In a client project, we established BGP peering over VPN tunnels between two different data centers. The physical link here occasionally experienced momentary micro-outages. These outages, while not completely dropping the BGP session, caused Keepalive messages to be delayed, leading to the session being momentarily reset. This triggered route flapping. In my experience, slightly increasing these values (e.g., Keepalive 30 seconds, Hold Time 90 seconds) can make the session more resilient. However, increasing them too much can delay the detection of a genuine fault.

router bgp 65000
 neighbor x.x.x.x timers 30 90

2. Route Filtering with Prefix-Lists and Route-Maps

Not receiving or advertising unnecessary prefixes reduces the size of the BGP table and minimizes the risk of flapping. I use prefix-list and route-map to keep strict limits on which prefixes are accepted or advertised.

For example, we might only want to advertise our network's 10.0.0.0/8 block and only accept specific ISP prefixes:

ip prefix-list OUR_PREFIXES seq 5 permit 10.0.0.0/8 le 32
ip prefix-list ALLOWED_ISP_PREFIXES seq 10 permit 192.0.2.0/24

route-map ADVERTISE_OUT permit 10
 match ip address prefix-list OUR_PREFIXES
!
route-map ACCEPT_IN permit 10
 match ip address prefix-list ALLOWED_ISP_PREFIXES
!
router bgp 65000
 neighbor x.x.x.x send-community both
 neighbor x.x.x.x route-map ADVERTISE_OUT out
 neighbor x.x.x.x route-map ACCEPT_IN in

This way, I can prevent flap issues that might arise from accidentally advertised or received prefixes. Especially restricting prefix length with le and ge parameters allows me to perform more specific filtering. On one occasion, my routers experienced significant resource consumption because an ISP accidentally advertised a /8 prefix. Such filtering prevents these kinds of situations. [Related: Secure BGP Configurations]

3. Limiting the Maximum Number of Prefixes

Receiving far more prefixes than expected from a BGP peer can indicate that the peer is misconfigured or is the target of an attack. I can limit the number of prefixes accepted from a peer using the maximum-prefix command.

router bgp 65000
 neighbor x.x.x.x maximum-prefix 5000 75 restart 5

In this example, I will accept a maximum of 5000 prefixes from neighbor x.x.x.x. If this number reaches 75% (3750 prefixes), a warning log will be generated, and if it exceeds 5000 prefixes within 5 minutes, the BGP session will be reset. This prevented a disaster at a large Turkish e-commerce site when an ISP's misconfiguration caused hundreds of thousands of prefixes to be advertised.

4. Utilizing BFD (Bidirectional Forwarding Detection)

BFD is a lightweight protocol that allows BGP sessions to detect the status of a physical link much faster. While Keepalive messages take seconds, BFD can detect link failures within milliseconds. This allows a fluctuation in a connection to drop the BGP session much faster, shortening the flap duration.

router bgp 65000
 neighbor x.x.x.x fall-over bfd

BFD has been a lifesaver for me, especially in environments requiring fast recovery and running sensitive applications (e.g., VoIP or financial transaction platforms). However, for BFD to work, both routers must support and be configured for BFD.

5. High Availability and Redundancy

Finally, relying on a single exit point or a single ISP increases the risk of route flap. Establishing connections with multiple ISPs and using them in active/passive or active/active mode prevents a problem with one ISP from affecting the entire network. In my experience, by balancing the service I receive from two different ISPs with BGP, I ensured that momentary instabilities in one did not burden the other, keeping the overall network stable. However, in this architecture, BGP path selection processes, local preferences (Local Pref), AS-Path Prepend, and other factors come into play, and correct configuration is critically important. [Related: High Availability Network Design with Dual ISPs]

💡 Measurement and Verification

Always take measurements before and after making any changes. Monitor the flap count and the router's CPU/memory usage with commands like show ip bgp summary and show ip bgp neighbors x.x.x.x flap-statistics. In the backend of my own side product, I continuously monitor changes in BGP prefix count and router CPU usage using Prometheus and Grafana. This is critical for me to understand when a problem started and whether my intervention was effective.

Conclusion

BGP route flap management is not a task that can be accomplished solely based on theoretical protocol knowledge. The myriad variables encountered in the field span a wide range, from the physical layer to software bugs. In my 20 years of system and network management experience, solving these problems has always been a detective story. It requires first correctly interpreting the symptoms, then finding the root cause with the right tools, and finally implementing the most appropriate mitigation strategy.

Remember, every network environment has its unique dynamics, and there is no single "one-size-fits-all" solution. Flexibility, continuous monitoring, and problem-solving ability are the keys to success in managing complex protocols like BGP. The next time you encounter a route flap, I believe you can solve the problem more systematically by following these steps.

DEV Community