Jeff Kronlage's CCIE Study Blog: BGP PIC and Add-Path

The meat of this article will be Add-Path, and why it's needed in certain PIC scenarios. However, understanding where and why we need these technologies, what was done before the Add-Path implementation was widely in place, etc, is nearly as challenging to learn as the Add-Path implementation itself.

This is not intended to completely document Add-Path, nor is it just a primer. My original intent was to document the entire use of Add-Path, however, I realized halfway through that this would have easily produced a 50+ page document: There are many one-off cases for Add-Path that have their own features, and to show a use case for each one would've required several different topologies and drawings. My hope that at the depth I took it to, it will be more than sufficient to educate to the level required for the CCIE R&S v5 lab.

So - what is PIC?

PIC stands for Prefix Independent Convergence.

PIC is a method for speeding up convergence of the FIB under failover conditions.

Unless you have a really serious lab or a Spirent to play with, forget trying to lab the performance gain. The gains we're talking about here are only seen when you have tens of thousands, hundreds of thousands, or even 1M routes in your FIB.

The use case is actually pretty easy to understand - when the next-hop to a set of prefixes changes, the router (presumably talking about a 7600 or ASR) has to walk each prefix in the FIB and update the next-hop. If you have 100 routes, this time is negligible. If you're carrying 1M routes in an MPLS environment, this is not a small problem. I've been told first-hand (from someone who does have a Spirent to play with) that this takes about two minutes.

This would be Prefix Dependent Convergence, or a problem that grows dependent upon how many prefixes are in your FIB. The solution we want is something that updates in the same amount of time (presumably small amount of time!) no matter how many FIB entries we have.

The concept of the FIB dates back decades now, and when it was originally written it was made in the most efficient manner possible, for CPU and RAM conservation:

Prefix = Interface/Next-Hop

For example,

10.10.10.10/32 = FastEthernet0/0 192.168.0.1

This was great 20 years ago when a "large" routing table was 40,000 routes. To converge quickly, a new method is required. Introducing the Hierarchical FIB.

When using PIC, the FIB actually restructures to a 3-tier system:

Prefix = Pointer = Interface/Next-Hop

Understanding why this is better takes understanding that while a router may be carrying 1M routes, it's probably only directly connected (layer 3) to a dozen or less. So you've got 1M routes, and 12 possible exits.

Let's say half those routes go out to two primary edge routers. Those routers are at 192.168.1.1 and 192.168.2.1.

So, roughly half your routes look like:

10.10.10.10/32 = Pointer A = Gigabit0/0 192.168.1.1
11.11.11.11/32 = Pointer A = Gigabit0/0 192.168.1.1
.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/0 192.168.1.1

192.168.1.1 fails. However, all these same prefixes are reachable via 192.168.2.1.
With an appropriately designed network, PIC can simply reassign Pointer A. This takes less than 50ms as opposed to 60+ seconds.

10.10.10.10/32 = Pointer A = Gigabit0/1 192.168.2.1
11.11.11.11/32 = Pointer A = Gigabit0/1 192.168.2.1

.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/1 192.168.2.1

The CEF process updated one value, that of Pointer A. Previously this took 500,000 updates, now it takes one. The time required for this process is independent of how many routes use the next-hop, hence Prefix Independent Convergence.

Now if you're following along, you probably see the enormous catch here: unless you're multipathing, how is CEF even going to know about the second path? PIC is a data-plane/CEF/FIB feature, it doesn't touch the control-plane. Normally we'd have to wait on BGP convergence (topology dependent), which takes a heck of a lot longer than 50ms. As we're all aware, and this is key to understanding this topic, BGP only sends its single best-path per-prefix to its neighbors. What if we needed two or more? Even worse, what if we're crossing a route-reflector, that aggregates everyone's paths and picks only one?

I am going to cover five different ways to solve this, add-path being the newest of them.

Here are the options at a high-level:
1) Multipath. This is by far the easiest option if your topology fits.
2) BGP Advertise-Best-External. For advertising from PE->PE, or PE->RR; this tells the edge PE to send it's external route (presumably from a CE via eBGP) as best. More below.
3) Diverse-Path (Shadow Router). This tells a route reflector, a secondary one in a topology, to deliberately calculate a "second-best" path that has a different next-hop. Instead of forwarding its best-path, it forwards this "second-best" path. Only the route-reflector needs to be updated to support this feature.
4) Add-Path. In short, Add-Path modifies the BGP behavior to send two or more paths instead of just one best-path. This requires that every device in the topology that needs to send or receive multiple paths supports Add-Path.

I've chosen to demonstrate these solutions in a VPNv4 environment, as it's where PIC makes the most sense. Note that add-path is purely an iBGP technology, the parser gets upset if you try it on eBGP:

R3(config-router)#neighbor 192.168.30.2 advertise additional-paths all
% BGP: Add-Path *not* supported on EBGP peering

I have a hobby (perhaps more of an interest?) of the language used in IOS parser messages. Half the time, unless you know the technology already, you can't even tell what the programmer was trying to convey when you make a mistake. If it's a new feature sometimes you don't even get an error, it just doesn't apply the config. Then other times you get blunt messages with *stars*!

I'm running a common VPNv4 design: BGP on the PEs, VRFs between CE and PE, and a "BGP free core" (all one P router that isn't a route reflector :) ).

On the PE->CE links, I'm using 10.0.X.Y/24, where X is a combination of the two routers the link connects (i.e. R1->R2 is "12"), and Y is the router number. This is also the same number on the subinterface on the diagram.

On the PE->P or PE->RR links, the IPs are 192.168.X.Y, same explanation of X and Y as above.

Every router has a loopback0 of Y.Y.Y.Y/32, where Y is the router number.

Note that R4 is a route-reflector, and R6, R7 and R8 are all PEs.

Let's talk about the two flavors of PIC. There's PIC Core and PIC Edge. They're both applied to a PE.

PIC Core is far simpler than PIC Edge, so we'll start there. We've enabled PIC Core on R2.

PIC Core is enabled with one command:

R2-PE(config)#cef table output-chain build favor convergence-speed

Of note, to disable it, you replace "convergence-speed" with "memory-utilization".

Unlike PIC Edge, which, depending on the implementation, may require widespread support on the network, PIC Core can literally be enabled on just one device if you wanted.

As mentioned above, in a typical VPNv4 scenario, the core is BGP-free, and only the PEs (and any route reflectors) maintain the BGP table. Next hops to the PEs are carried in the IGP. Let's look at how that plays out:

- Let's assume R1's bestpath to R9 is via R2. R1 is BGP peered to R2.

- R2 takes R1's traffic in to a VRF. It imports the VRF traffic into VPNv4.

- R4, the route reflector, learns via iBGP that the PEs R6 and R7 can both reach R9. It chooses R6 as the bestpath.

- R2, only peered with R4 for iBGP, learns the that R6 is the bestpath.

- Since this is VPNv4, R2 needs to choose an LDP-enabled next hop that has a label for 6.6.6.6. Remember, in VPNv4, the next hop inside the iBGP network is always the iBGP next-hop. The IGP indicates that R5 is the bestpath for R2 to reach R6 (via MPLS).

The key element here is the recursion between R2 and R6:

BGP tells R2 how to reach R9 via R6: 2.2.2.2 -> 6.6.6.6

R2 needs to find out how to reach 6.6.6.6 via the IGP: 2.2.2.2 -> R5

R2 needs to know how to reach R5: 2.2.2.2 -> 192.168.25.5 (R5's interface IP)

R2 needs to pick an interface to reach 192.168.25.5: gig1.25

So one more time!

iBGP: 2.2.2.2 -> 6.6.6.6

iBGP Next-Hop via OSPF: Find 6.6.6.6 via R5 at 192.168.24.5

CEF: Exit interface gig1.25 towards 192.168.25.5

I'm going to harp on the high-level of this again because it's dead critical to understanding the hierarchy of the process:
BGP recurses to IGP
IGP recurses to one or more Next Hops
FIB populates one or more next hops from the IGP

When you're using PIC Core, this is what we care about:
BGP recurses to IGP
IGP recurses to one or more Next Hops <-- PIC CORE INFLUENCES
FIB populates one or more next hops from the IGP <-- PIC CORE INFLUENCES

I will demonstrate below.

So given that R1 -> R2 -> R5 -> R6 -> R9 above, let's say R5 goes completely offline - dead.

This does not impact the BGP session between R2 and R4, or between R4 and R6. However, the next hop specific in the BGP next-hop (192.168.24.5), which it learned from the IGP, must change. The IGP can reconverge very quickly, but let's say the BGP process was carrying 1M routes from R9. How long will it take R2 to update the next-hops of the BGP table and CEF?

So to be clear, BGP is not reconverging. PIC Core cannot handle a BGP reconverge, you need PIC Edge for that. But if the IGP reconverges and requires the BGP Table and FIB to update, and you have a large quantity of routes, this can create a major impact on a PE - possibly several minutes of dropping traffic.

With a traditional FIB, we'd have to make 1 million updates in both the BGP table and the FIB in order to be fully forwarding again. With a hierarchical FIB - what PIC Core provides us - the following process would happen:

The FIB, before:
Prefix 1 -> Pointer A (192.168.25.5) -> gig1.25

The IGP reconverges the path via R4.
Now we update Pointer A - one value instead of 1M values - and we end up with:

Prefix 1 -> Pointer A (192.168.24.4) -> gig1.24

So to reiterate, PIC Core is for failure of non-BGP speakers. It doesn't help if BGP itself needs to reconverge, but it does dramatically speed up CEF's failover if the IGP fails.

Now moving in to the more complex PIC Edge.

If PIC Core was about dealing with IGP failure, PIC Edge is about dealing with BGP failure.

For the moment, we'll continue using our VPNv4 topology, except we're temporarily removing the route reflector and instead installing a full-mesh iBGP.

Please note that using PIC Edge should involve running BFD between the BGP speakers for fast detection of a failure. For simplicity, I've omitted this step. To learn more about BFD, please see my BFD blog: http://brbccie.blogspot.com/2014/06/everything-bfd.html

That's quite a few iBGP peerings, The red lines indicate all the iBGP peerings:

In this scenario, we're going to deal with R2's convergence process again, except we're going to assume R6 - the BGP-adjacent PE - dies, instead of a P router.

Let's look at our routing protocols from R2's perspective.

R2-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 11
Paths: (3 available, best #2, table VPN)
Advertised to update-groups:
1
Refresh Epoch 3
300
8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal
Extended Community: RT:1:1
mpls labels in/out nolabel/16
rx pathid: 0, tx pathid: 0
Refresh Epoch 3
300
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, best
Extended Community: RT:1:1
mpls labels in/out nolabel/22
rx pathid: 0, tx pathid: 0x0
Refresh Epoch 3
300
7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal
Extended Community: RT:1:1
mpls labels in/out nolabel/26
rx pathid: 0, tx pathid: 0

As expected, R2 has three BGP paths to 9.9.9.9. 6.6.6.6 is the best.

How do we reach 6.6.6.6?

R2-PE#sh ip ospf route | s 6.6.6.6
*> 6.6.6.6/32, Intra, cost 3, area 0
via 192.168.24.4, GigabitEthernet1.24
via 192.168.25.5, GigabitEthernet1.25

The BGP table has one selected bestpath, the IGP has two multipath bestpaths to BGP's next hop:

R2-PE#sh ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 192.168.24.4 on GigabitEthernet1.24, 00:22:04 ago
Routing Descriptor Blocks:
* 192.168.25.5, from 6.6.6.6, 01:31:56 ago, via GigabitEthernet1.25
Route metric is 3, traffic share count is 1
192.168.24.4, from 6.6.6.6, 00:22:04 ago, via GigabitEthernet1.24
Route metric is 3, traffic share count is 1

R2-PE#sh ip cef 6.6.6.6
6.6.6.6/32
nexthop 192.168.24.4 GigabitEthernet1.24 label 17
nexthop 192.168.25.5 GigabitEthernet1.25 label 18

Now let's refer back to my process from earlier:

BGP recurses to IGP
IGP recurses to one or more Next Hops
FIB populates one or more next hops from the IGP

BGP says use 6.6.6.6
IGP says to get to 6.6.6.6 use either 192.168.25.5 or 192.168.24.4
FIB points to 192.168.25.5 / tag 17 and 192.168.24.4 / tag 18 multipath

Now what happens if 6.6.6.6 fails?

R6-PE(config)#int gig1.46
R6-PE(config-subif)#shut
R6-PE(config-subif)#int gig1.56

R6-PE(config-subif)#shut

R6-PE(config-subif)#int gig1.69

R6-PE(config-subif)#shut

Debugging BGP updates on R2 (significantly edited for brevity):

*Aug 20 23:30:29.037: RT(VPN): updating bgp 9.9.9.9/32 (0x1) : via 7.7.7.7 0 26

*Aug 20 23:30:29.037: RT(VPN): closer admin distance for 9.9.9.9, flushing 1 routes

*Aug 20 23:30:29.037: RT(VPN): add 9.9.9.9/32 via 7.7.7.7, bgp metric [200/0]

BGP figures out that 6.6.6.6 is down, and picks 7.7.7.7 for the next hop. Now we have the same problem we had with PIC Core, only it's more significant:

BGP recurses to IGP <-- PIC EDGE INFLUENCES
IGP recurses to one or more Next Hops <-- PIC EDGE & CORE INFLUENCE
FIB populates one or more next hops from the IGP <-- PIC EDGE & CORE INFLUENCE

Just pointing out the process there - we don't have PIC edge enabled, so our theoretical 1M routes just took minutes to reconverge.

So how do we enable PIC Edge? Quite simply, we can't wait for the IGP and BGP to converge. We need two paths in BGP. This can be easy or difficult, depending on our topology. Let's look at the easiest methods and progress towards harder.

Note we still have cef table output-chain build favor convergence-speed configured on R2, which is still necessary.

Re-enabling R6 to show how this could play out with PIC Edge.
router bgp 200
address-family ipv4 vrf VPN
maximum-paths ibgp 3

Now we've told R2 to install multiple BGP paths, not just multiple IGP paths. This way if R6's advertisement gets pulled again, there's already a pre-made alternative path.

Now we have three "hot", installed BGP paths to 9.9.9.9, instead of just one. This means with the IGP in consideration, we have six paths:

R2-PE#sh bgp vpnv4 un all | b 9.9.9.9
*mi 9.9.9.9/32 7.7.7.7 0 100 0 300 i
*>i 6.6.6.6 0 100 0 300 i
*mi 8.8.8.8 0 100 0 300 i

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels, per-destination sharing

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7 label 26

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

recursive via 8.8.8.8 label 16

nexthop 192.168.24.4 GigabitEthernet1.24 label 28

If we lose the path via 6.6.6.6, one of the other paths would simply pick up the load, and because of the hierarchical FIB we already implemented, there's no need to rewrite all 1M prefixes in the FIB one at a time.

This represented our first PIC solution I described above: Multipathing.

I'm going to temporarily cut to a much simpler scenario to show BGP Advertise-Best-External. While I could mix this in to the topology we've been using, it's getting too complex to clearly illustrate the topic.

Let's say multipathing isn't an option - what if one of the paths is clearly better than the others. What else can we do?

I've deliberately made R6 the bestpath by setting the local preference on all routes leaving it to 150. Now what we see from R2 looks like:

R2-PE#show bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 63

Paths: (1 available, best #1, table VPN)

Advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Only one path ... via 3 upstreams? Yep. The problem here is that, depending on timing, R2 may end up with three paths, for just a moment - since all routers are peered with one another, R7 will learn that R6 is the bestpath via its iBGP session to R6, as will R8. Both R7 and R8 will send a withdraw for their route to R6. Now R6 is stuck with one path - we need at least two for PIC edge.

The dead easiest solution to this design is to use Advertise-Best-External:

R7 & R8:

router bgp 200

address-family ipv4 vrf VPN

bgp advertise-best-external

What's this do?

R7-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 18

Paths: (3 available, best #2, table VPN)

Advertised to update-groups:

1 6

Refresh Epoch 5

300

8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out 26/16

rx pathid: 0, tx pathid: 0

Refresh Epoch 3

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out 26/22

rx pathid: 0, tx pathid: 0x0

Refresh Epoch 2

300

10.0.79.9 (via vrf VPN) from 10.0.79.9 (9.9.9.9)

Origin IGP, metric 0, localpref 100, valid, external

Extended Community: RT:1:1

mpls labels in/out 26/nolabel

rx pathid: 0, tx pathid: 0

R7 still sees the path through R6 as best. However, what's it sending to R2? It's sending it's eBGP path to the CE as opposed to the path to R6.

Since R8 is doing the same thing, R2 now has three paths again:

R2-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 70

Paths: (3 available, best #3, table VPN)

Advertised to update-groups:

Refresh Epoch 5

300

8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out nolabel/16

rx pathid: 0, tx pathid: 0

Refresh Epoch 3

300

7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out nolabel/26

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

So Advertise-Best-External sends your eBGP route as bestpath to your neighbors, but local routing (on R7 or R8) still goes through R6 due to the local-preference.

We're not done yet however:

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

R2 still only sees one possible path.

We need to implement some single-router Add-Path to make this work. The key item of importance is that only the routers that need the non-multipath redundant paths have to support Add-Path in this design. If we're not worried about R6, R7, or R8 having an additional path back to R1, then we might just have R2 and R3 require the Add-Path support (Add-Path is a reasonably new feature at the time of this writing, so having your entire topology support it could be challenging).

router bgp 200

address-family ipv4 vrf VPN

bgp additional-paths select backup

bgp additional-paths install

Don't worry about the specific mechanisms of "select backup" and "install" yet, I'm going to cover them thoroughly later. In short, we need to tell this router to pick a backup path and pre-install it in the FIB so that PIC can use it in failover, which this config accomplishes:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7 label 26, repair

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Note the "repair" syntax, that's the key.

I'm removing the R2 Add-Path config and bgp advertise-best-external on the PEs.

This is all fantastic with full-mesh iBGP - what if you have a huge topology and a route-reflector (or several) is more realistic? There's a big problem here, because like any BGP router, the route reflector will only choose its one best path to send to the other PEs. This makes multipathing impossible.

I've re-made R4 a route reflector, and removed all the redundant iBGP paths between the other PEs. Every PE is getting their routes via R4 now.

Clearly down to just one path now:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

As usual, we have multiple IGP paths, but those will both get pulled if we lose the BGP path.

Without going to full-on Add-Path across the network, our simplest answer is another route-reflector running diverse-path. I'm temporarily making R5 an additional route-reflector.

For brevity I'm not going to include all the config necessary to make R5 a route-reflector. However, the outcome on R2 looks like this:

R2-PE#sh bgp vpnv4 uni all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 79

Paths: (2 available, best #2, table VPN)

Advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 5.5.5.5

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 4.4.4.4

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Hey, great, we've got two paths, we can just enable Add-Path on R2 and we're done, right?

Not so fast.

The next-hop is 6.6.6.6 on both routes - in order for Add-Path to be viable, the backup path's next-hop must be different that the primary path.

The solution, as I'd mentioned above, is to use Diverse-Path. Diverse-Path tells a BGP router to deliberately calculate the 2nd-best path that has a different next hop than the first-best path. Diverse-Path was a workaround before Add-Path was supported (or widely supported) in IOS. Only the route reflector running Diverse Path needs to know about it, all the other routes are just following standard IOS rules.

R5-RR(config)#router bgp 200

R5-RR(config-router)#address-family vpnv4

R5-RR(config-router-af)#bgp additional-paths select backup

R5-RR(config-router-af)#bgp additional-paths install

R5-RR(config-router-af)#neighbor 2.2.2.2 advertise diverse-path backup

Here, we tell R5 to calculate a backup path, and then we tell it to advertise it to R2 as if it were R5's bestpath (in production, you'd presumably want to send this to all route-reflector clients, not just one).

One more step is also required on R7 and R8 (I've done R6 as well to keep the config consistent) - right now, this topology suffers from the same problem we saw in the first advertise-best-external scenario. Consider:

1) R6 sends its bestpath (its external path) to R4 and R5. This prefix has a local pref of 150.
2) R7 sends its bestpath (its external path) to R4 and R5.
3) R8 sends its bestpath (its external path) to R4 and R5.
4) R5 starts calculating 2nd-best-path for R2
5) R7 learns about R6's bestpath from R4
6) R8 learns about R6's bestpath from R4
8) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
9) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
10) R5 calculates it's only path to 9.9.9.9 via R6

Now we could put bgp advertise-best-external back in, but that would advertise the best external to both R4 and R5 and we'd have the same exact problem as above.

Per-neighbor best-external is the solution:
R6, R7 & R8:
router bgp 200
address-family vpnv4
neighbor 5.5.5.5 advertise best-external

This will advertise the "internal bestpath" (via R6, because of local preference) to R4, and the external bestpath to R5.

Now back to R2:

R2-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 83

Paths: (2 available, best #2, table VPN)

Advertised to update-groups:

Refresh Epoch 2

300

6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 5.5.5.5

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 4.4.4.4

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Now we've got two routes with two next-hops.

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

But we still need to enable the calculation of a backup route, otherwise PIC Edge won't work.

R2-PE(config)#router bgp 200

R2-PE(config-router)#address-family ipv4 vrf VPN

R2-PE(config-router-af)#bgp additional-paths select backup

R2-PE(config-router-af)#bgp additional-paths install

R2-PE#sh ip cef vrf VPN 9.9.9.9 det
9.9.9.9/32, epoch 1, flags rib defined all labels
recursive via 6.6.6.6 label 22
nexthop 192.168.24.4 GigabitEthernet1.24 label 23
nexthop 192.168.25.5 GigabitEthernet1.25 label 18
recursive via 7.7.7.7 label 26, repair
nexthop 192.168.24.4 GigabitEthernet1.24 label 16
nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Now we've got a working solution!

And last but certainly not least, the gold standard of receiving two paths: Simply rework how BGP handles multiple paths by using Add-Path.

Sadly, as much as this technology seems like it's custom-built for VPNv4, if you can believe it, Add-Path isn't supported in VPNv4 on my OS:

R4#sh ver
Cisco IOS XE Software, Version 03.11.01.S - Standard Support Release
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)

In the IPv4 (default) Family:

R4(config)#router bgp 200
R4(config-router)#bgp additional-paths select ?
all Select all available paths
backup Select backup path
best Select best N paths
best-external Select best-external path
group-best Select group-best path

R4(config-router)#neighbor 2.2.2.2 advertise ?
additional-paths Advertise additional paths

best-external Advertise best-external (at RRs best-internal) path

diverse-path Advertise diverse path

Note the bolded and italic items, that's what we're looking for in VPNv4:

R4(config-router)#address-family vpnv4

R4(config-router-af)#neighbor 2.2.2.2 advertise ?

best-external Advertise best-external (at RRs best-internal) path
diverse-path Advertise diverse path

R4(config-router-af)#bgp additional-paths select ?
backup Select backup path
best-external Select best-external path

Completely lacking.

On that note, we'll be reverting this design back to a non-MPLS scenario for the remainder of the blog.

I've also reverted R5 from being a route-reflector, it's now simply a client of R4. This was necessary to carry the IPv4 BGP table through R5.

Note R6 deliberately still has the best path via local-preference.

Here is a diagram of roughly what we're trying to achieve.

We'd like R6, R7 and R8 to all send (initially) one route to the RR. We'd like the R4 to reflect back two paths for reaching 9.9.9.9 to everyone (technically speaking we'll also be reflecting two paths for 1.1.1.1 on CE1, but I chose not to focus on that).

This design suffers from the same problem the last several have. Everything will start out looking good until the route-reflector reflects the superior path from R6 to R7 and R8, and those two routers both pick R6 as their bestpath. After that they'll withdraw their routes from R4, and R4 will only have a single route to send to R2, R3, etc, etc, because every path will point to R6.

We can solve this with one of three methods:
- BGP Advertise-Best-External on R7 and R8 (optionally on R6)
- Per-neighbor advertise best-external
- Running two-path Add-Path on R7 and R8 in addition to R4.

The top two options I imagine are self-explanatory at this point as I covered them above, however, the final option is hopefully interesting to the reader, and therefore it's the method I will choose for this lab. What will happen if R4, R7 and R8 run add-path is as follows:

1) R6, R7 and R8 all advertise their own (connected/external) bestpath to R4
2) (Let's assume R7 had the 2nd-best path for this example) R4 reflects BOTH R6 and R7's bestpath to R2, R3, R5, R6, R7 and R8.
3) R2 and R3 install both paths in BGP and in the FIB.
4) R6 installs R7's path in the FIB as a repair route.
5) R7 and R8 both change their bestpath to R6 instead of their external route.
6) R7 and R8 both advertise back to the route reflector that R6 is their bestpath and R7 is their backup path.
... No change on R2, R3, or R4, that influences a shift on the route reflector, so it's clients aren't modified either.

The key here is that while we still have the same problem of R7 and R8 preferring R6's external path, we're still advertising two paths to the route reflector: R6's (as best), and R7's as a backup.

Here is the relevant config:
R4:
router bgp 200
bgp additional-paths select best 2
bgp additional-paths send receive
bgp additional-paths install
neighbor 2.2.2.2 advertise additional-paths best 2
neighbor 3.3.3.3 advertise additional-paths best 2
neighbor 5.5.5.5 advertise additional-paths best 2
neighbor 6.6.6.6 advertise additional-paths best 2
neighbor 7.7.7.7 advertise additional-paths best 2
neighbor 8.8.8.8 advertise additional-paths best 2

R2, R3, & R5:

router bgp 200

bgp additional-paths select best 2

bgp additional-paths receive

bgp additional-paths install

R6 - R8:

router bgp 200

bgp additional-paths select best 2
bgp additional-paths send receive
bgp additional-paths install

neighbor 4.4.4.4 advertise additional-paths best 2

Remember not to use bgp additional-paths select backup - that command is for diverse-path or for local (non-advertised) selection of a backup route. You're trying to create a backup path, but that's still the wrong command.

So we used a few new commands here:

bgp additional-paths select best 2 - This calculates the best path and 2nd best path and flags them in BGP. This is a non-transitive flag, the neighbors aren't aware of what your flags are.

R4#sh ip bgp 9.9.9.9

BGP routing table entry for 9.9.9.9/32, version 5

Paths: (3 available, best #3, table default)

Additional-path-install

Path advertised to update-groups:

19 20

Refresh Epoch 1

300, (Received from a RR-client)

7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

rx pathid: 0x1, tx pathid: 0x2

Path not advertised to any peer

Refresh Epoch 1

300, (Received from a RR-client)

8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

rx pathid: 0x1, tx pathid: 0

Path advertised to update-groups:

19 20

Refresh Epoch 1

300, (Received from a RR-client)

6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

rx pathid: 0x0, tx pathid: 0x0

You see we've flagged "best" and "best2".

bgp additional-paths send receive

Unlike all the fixes we've seen up until now, Add-Path is a negotiated feature. This is why there's so many workarounds for it - to get to full Add-Path you basically have to forklift upgrade your network. On that note, you need to tell your neighbors if you have send, receive, or both send & receive capability. This can be done globally, as we've done here, or per neighbor with:

R2(config-router)#neighbor 4.4.4.4 additional-paths ?
disable Disable additional paths for this neighbor
receive Receive additional paths from neighbors
send Send additional paths to this neighbor

Note per-neighbor settings override the global settings.

bgp additional-paths install

You can select additional-paths and pass them to neighbors without installing them in your RIB or FIB. This command should be on any device requiring PIC Edge, but if your route reflector isn't in the forwarding path, you may be able to omit it.

neighbor X.X.X.X advertise additional-paths best 2

Even if you've negotiated the Add-Path capability with your neighbor, you still need to tell the BGP process to advertise all of, or a subset of, your calculated best paths. The way it does this is via the tag system I described above. An important element of this is that the tagging system is not mutually exclusive. Let's say there are 4 paths with different next-hops. You could select "all" and "best 3", and the best 3 would be flagged with "best" and "all", and the 4th path would only be flagged with "all". We'll show an examples of this below.

Let's see the output from this.

R4#sh ip bgp | b Network
Network Next Hop Metric LocPrf Weight Path
*>i 1.1.1.1/32 2.2.2.2 0 100 0 100 i
*bia 3.3.3.3 0 100 0 100 i
*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i
* i 8.8.8.8 0 100 0 300 i
*>i 6.6.6.6 0 150 0 300 i

We see two paths for 1.1.1.1, and three paths for 9.9.9.9.

Two are flagged with "b" for backup - this is a side-effect of using the bgp additional-paths install.

"a" is the flag for additional-paths.

You'd need to do a sh ip bgp 9.9.9.9 to see the "best", "best2", etc flags, which I am omitting for brevity - there's already a sample further above.

R4#sh ip cef 9.9.9.9 det

9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels

recursive via 6.6.6.6

nexthop 192.168.46.6 GigabitEthernet1.46

recursive via 7.7.7.7, repair

nexthop 192.168.47.6 GigabitEthernet1.47

We can see the repair path in the FIB.

On R2:

R2(config-router)#do sh ip bgp 9.9.9.9

BGP routing table entry for 9.9.9.9/32, version 3

Paths: (2 available, best #2, table default)

Additional-path-install

Path not advertised to any peer

Refresh Epoch 1

300

7.7.7.7 (metric 3) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

Originator: 7.7.7.7, Cluster list: 4.4.4.4

rx pathid: 0x2, tx pathid: 0x1

Path advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Originator: 6.6.6.6, Cluster list: 4.4.4.4

rx pathid: 0x0, tx pathid: 0x0

We see a best and best2 flag. It's important to note again that this is not learned from the route reflector, it's locally decided and set by the local bgp additional-paths select best 2 on R2. As mentioned above, I decided to use add-path from the edge BGP devices back towards the route-reflector to avoid the problem of the single-best-path replacing all the secondaries during convergence.

Another important note is the pathid. Add-Path's trickery to make this work doesn't make a real integral change to BGP - it still only passes one best, unique path - it just makes each additional path unique by adding a unique pathid. Note the pathids of 0x0 and 0x1 above. Think of these similar to Route Distinguishers in VPNv4, making the same two routes unique.

R2#sh ip cef 9.9.9.9 det

9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels

recursive via 6.6.6.6

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7, repair

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

And there's PIC Edge and Add-Path in action on R2.

I'm going to quickly cover the rest of the simpler Add-Path options.
Just to recap, the route-reflector has chosen two best paths so far:

R4#sh ip bgp 9.9.9.9 | s from
300, (Received from a RR-client)
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
rx pathid: 0x1, tx pathid: 0x2
300, (Received from a RR-client)
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal
rx pathid: 0x1, tx pathid: 0
300, (Received from a RR-client)
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 150, valid, internal, best
rx pathid: 0x0, tx pathid: 0x0

router bgp 200

bgp additional-paths select best 3

R4#sh ip bgp 9.9.9.9 | s from

300, (Received from a RR-client)

7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

rx pathid: 0x1, tx pathid: 0x2

300, (Received from a RR-client)

8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal, best3

rx pathid: 0x1, tx pathid: 0x1

300, (Received from a RR-client)

6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

rx pathid: 0x0, tx pathid: 0x0

Note we've added a pathid and "best3" to the remaining path. We'd be able to send those to neighbors if we wanted. With this config we're choosing 3 but sending 2.

I found this option confusing initially:

R4(config-router)#no neighbor 2.2.2.2 advertise additional-paths best 2
R4(config-router)#neighbor 2.2.2.2 advertise additional-paths all
% BGP: AF level 'bgp additional-paths select' more restrictive than advertising policy. This is a reminder that AF level additional-path select commands are needed.

The way I originally read this was, I've selected 3 best paths, and I want to send all 3 of them to my neighbor -- this is incorrect. Remember this is a flag system. All is a flag. None of our BGP prefixes are flagged with All, so we just broke Add-Path:

R4(config-router-af)#do sh ip bgp neigh 2.2.2.2 adv | b Network
Network Next Hop Metric LocPrf Weight Path
*>i 9.9.9.9/32 6.6.6.6 0 150 0 300 i

Let's fix it.
All is meant to simulate full-mesh iBGP with a route-reflector - if all routers use it, you'll get a similar outcome to all the routers being peered together.

R4(config-router)#bgp additional-paths select all

R4(config-router)#do sh ip bgp 9.9.9.9 | s from
300, (Received from a RR-client)
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2, all
rx pathid: 0x1, tx pathid: 0x1
300, (Received from a RR-client)
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal, best3, all
rx pathid: 0x1, tx pathid: 0x2
300, (Received from a RR-client)
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 150, valid, internal, best
rx pathid: 0x0, tx pathid: 0x0

OK, now we're flagged with both All and Best simultaneously. As mentioned above, the select system is not mutually exclusive:

R4#sh run | i select

bgp additional-paths select all best 3

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i

* i 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

There's a few options you can potentially pick under "select":

R4(config-router)#bgp additional-paths select ?

all Select all available paths

backup Select backup path

best Select best N paths

best-external Select best-external path

group-best Select group-best path

All, we just covered.

Backup is for diverse-path

Best, we've covered

Best-External is a feature that permits best-external selection on a route reflector. The use case for this is complicated and is out of scope for this document.

Group-Best is also very complicated.

Let's discuss group-best at a very high level.

BGP, under normal circumstances, can potentially end up in a scenario where it never converges - it never stabilizes. This is called BGP Med Oscillation. Explaining this is beyond the scope of this document, however, this blog covers it well: http://ccieblog.co.uk/bgp/bgp-deterministic-med

BGP Deterministic Med can solve this problem.

However, this problem gets additionally complex with Add-Path. Group-Best solves these problems.

This document covers this feature: http://inl.info.ucl.ac.be/system/files/add-paths-jsac.pdf

Route-Maps can additionally be used with Add-Path.

R3(config-route-map)#match additional-paths advertise-set ?

all BGP Add-Path advertise all paths

best BGP Add-Path advertise best n paths

best-range BGP Add-Path advertise best paths (range m to n)

group-best BGP Add-Path advertise group-best path

The two use cases I've seen for the route maps are:

- Setting the egress MED

- Selecting specific routes with the "best" flag to advertise

For example, if you wanted to only advertise the 1st best and 3rd best routes:

R4:

route-map block2ndbest deny 10

match additional-paths advertise-set best-range 2 2 ! matches the "range" of 2 through 2

route-map block2ndbest permit 20

Before:

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i

* i 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

R4(config)#router bgp 200

R4(config-router)#neighbor 2.2.2.2 route-map block2ndbest out

R4(config-router)#do clear ip bgp * soft out

After:

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

As I mentioned the MED can be modified on a per-bestpath basis as well, but only from edge BGP device -> RR or edge BGP device -> edge BGP device. Route reflectors are not permitted to set MED.

Hope you enjoyed,

Jeff

12 comments:

UnknownApril 20, 2015 at 11:27 PM
Hi Jeff,

Seems like a nice lab, do you have the all the configuration of all the devices, I can set this lab up and want to get my hands dirty :)
UnknownSeptember 4, 2015 at 1:17 AM
Hi Jeff!

thank you for this great article. Iwas searching the net for such a topic about add-path feature of BGP while I saw your website. I really appreciate your efforts about creating this step-by-step article that helps readers to understand every aspects of the topic.
besides, I need to ask some questions from you. I've read your article up tp the end of "diverse-path" part. following this part, you said that configuring RR R5 as diverse-path feature makes R5 to calculates 2nd best path and this selected path will be advertised to the RR clients. also you have mentioned that we must use per-neighbor best-external. why do we need to do that? because I think without this configuration,
RR R4 receives a route from R6 with local-pref 150 and BGP routes from R7 and R8 that are selected as best routes on R7 and R8 with regards to the best-external command on those routers. then RR R4 will choose to use the route with local-pref 150 and advertise only that route as the best route to RR clients (regarding that we have not configured diverse-path feature on RR R4). did I miss something?
I have not read the parts beyond "diverse-path" part and I might have other questions too later ;)
UnknownSeptember 4, 2015 at 7:46 AM
This comment has been removed by the author.
UnknownFebruary 18, 2016 at 8:19 AM
Hi Jeff,

That's a good article. Actually you have clearly explained some features that are quite bad described at cisco.com. Thank you for that!

BR,
Anton
UnknownApril 14, 2016 at 6:49 AM
Hi Jeff,

Could you share GNS3 topology file ?

Br,
Michal
RelativitydriveJune 8, 2017 at 1:17 AM
Wow Jeff another good post.

I haven't checked but I suspect you got your first IE in 2015?

On this post I think you have not shown two different next-hop paths where you used two RRs to get both R6 and R7 to add-paths for R2. Is that correct?

"This will advertise the "internal bestpath" (via R6, because of local preference) to R4, and the external bestpath to R5." - this shows two GBP paths but both with the next hop of 6.6.6.6 with ip cef showing a single route (via 6.6.6.6) to two next hops derived from OSPF ECMP.

Shouldn't you be after two BGP routes with different NHs i.e. 6.6.6.6 and 7.7.7.7 as you show later?
UnknownApril 27, 2020 at 1:56 PM
Such a detailed and easy to follow post. Thank you so much for all the efforts tp put this together Jeff.

Saturday, August 16, 2014

BGP PIC and Add-Path

12 comments: