Sunday, August 24, 2014

EIGRP Enhancements

Cisco did a major overhaul of EIGRP in recent IOS. These can be loosely looked at as new features in "EIGRP Named Mode". In reality, I suspect that the EIGRP teams were working on a series of new features, and they opted to renovate the interface at the same time, hence creating named mode.

We'll start with the new interface and then delve into all the new features one at a time.

Named EIGRP mode replaces the tradition EIGRP interfaces we're familiar with, and puts all the various commands into one configuration section.

The major distinguishing factor is the router process has a name instead of a number.

Old method:
router eigrp 100
 network 192.168.0.0 0.0.255.255

New equivalent method:
router eigrp SOMENAME
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  network 192.168.0.0 0.0.255.255
 exit-address-family

The name is completely arbitrary and is a local value. 

Interface settings that were previously configured on the interface, such as hello interval, authentication, etc, are now configured as part of the EIGRP named process:

router eigrp SOMENAME
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1
   authentication mode md5
   authentication key-chain FOO
   hello-interval 10
   no split-horizon
  exit-af-interface
  !
  topology base
  exit-af-topology
  network 192.168.0.0 0.0.255.255
 exit-address-family

A traditional EIGRP process can be upgraded to named mode on newer IOS with this command:

Router(config)#router eigrp 101
Router(config-router)#eigrp upgrade-cli SOMENAME

The process also doesn't interrupt traffic flow.

That's the guts of the configuration reformatting, let's move on to features.

Wide Metrics
First and foremost, the metric has been reworked.

EIGRP named mode automatically uses wide metrics when speaking to another EIGRP named mode process. No additional configuration is necessary, this is automatic. So if it's speaking to a traditional EIGRP process, it uses the old calculations.

The new metric is designed to be able to differentiate paths above 10GB.  The new metric essentially changes four things:
- Delay is now measured in picoseconds instead of microseconds. 10ms was the minimum previously.
- Bandwidth's scaling factor is made much larger, the calculation is now 10^7 * 65536 / Interface Bandwidth, as opposed to the original 10^7 * 256 / Interface Bandwidth.
- The overall metric is now 64 bit.
- The K6 value has been added "for future use", but Cisco has indicated this will be used for accumulated energy or accumulated jitter.  Jitter is reasonsably obvious.  Energy is the actual electric power it takes to use an interface, so that you could literally do "least cost" routing based on how inexpensively the packet can be sent from the various interface types in a path.

One important note here is that with wide metrics, the EIGRP calculated metric no longer fits into the RIB. For example:

Router#sh ip eigrp top 10.10.10.10/32 | i Composite metric
      Composite metric is (330301440/329646080), route is Internal

Router#sh ip route 10.10.10.10 | i Route metric
      Route metric is 2580480, traffic share count is 1

The EIGRP topology table indicates 330301440, the RIB says 2580480.  
The RIB's metric can't exceed 32-bits, and there are circumstances with the new, more granular metrics won't fit into the RIB. So all metrics, regardless of if the value would fit into 32-bits, are divided by the rib-scale value. The rib-scale is 128 by default:

330301440/128 = 2580480

You can reassign it to any value 1 to 255:

router eigrp SOMENAME
 address-family ipv4 unicast autonomous-system 100
  metric rib-scale [1-255]

Here's a catch - I've gotten in the habit of using this command for redistributing into EIGRP when labbing:

redistribute <some other protocol> metric 1 1 1 1 1

Why? It's quick and easy to type if you're not trying to do traffic engineering.

Router#sh ip eigrp top 13.13.13.13/32 | i Composite metric
      Composite metric is (655361310720/655360655360), route is External

655361310720/128 = 5120005120

The largest number that can be represented in a 32-bit unsigned integer is 4,294,967,296.

5120005120 > 4294967296, therefore it cannot be represented in the RIB:

Router#sh ip route 13.13.13.13
% Network not in table

You read that right: This is a valid, routable prefix that simply can't make it into the RIB because of compatibility between the EIGRP topology table and the RIB. You need to adjust the rib-scale to make this work:

Router(config-router-af)#metric rib-scale 153
Router(config-router-af)#do sh ip route 13.13.13.13 | i Route metric
      Route metric is 4283407259, traffic share count is 1

I imagine that would make for a really good troubleshooting problem. "A route is being redistributed on R1 with a specific metric, but is not being installed in the RIB on R3. Do not change the metric on R1, or adjust with a route-map".

There are a few concerns with interoperability between the traditional EIGRP metric and the wide metrics, but not many. As I mentioned above, routers unable to understand wide metrics are auto-detected and sent the old metric, however, there are circumstances where a route might get depreffed after having passed through an older EIGRP process. For example, if two paths exist to a destination, one of them running entirely wide metrics and a different one running one router with traditional metrics, the traditional metric may make the entire path look worse and it may impact load share, or the ability to ECMP.

SHA Authentication
Now supporting more than just MD5:

R1(config-subif)#router eigrp TEST1
R1(config-router)#address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#af-interface gig1.123
R1(config-router-af-interface)#authentication mode hmac-sha-256 CCIE

I think authentication would also make a great TS question - the authentication could be placed on the interface still, which named mode silently ignores. You'd need to know to look at the EIGRP named process to fix it:

interface GigabitEthernet1.123
 ip authentication key-chain eigrp 100 BOB  ! this does nothing when named mode is enabled.

Route Tag Enhancements
To be fair, the route tag enhancements aren't limited to EIGRP named mode - it works with OSPF, BGP, RIP, etc. It even works in the traditional (non-named) eigrp syntax. However, I didn't think I needed a write a separate blog just to show it in every context, they all basically work the same.

In short, the route tag enhancements allow the route tag to be formatted as a dotted decimal tag (looks like an IPv4 address) that can me matched either directly (in the traditional route tag method in route-map) or via a route-tag list. The route-tag list is where things get interesting.

R1:
interface Loopback1
 ip address 1.1.1.1 255.255.255.255
interface Loopback2
 ip address 2.2.2.2 255.255.255.255
interface Loopback3
 ip address 3.3.3.3 255.255.255.255
interface Loopback4
 ip address 4.4.4.4 255.255.255.255
interface Loopback5
 ip address 5.5.5.5 255.255.255.255
interface Loopback6
 ip address 6.6.6.6 255.255.255.255
interface Loopback7
 ip address 7.7.7.7 255.255.255.255

route-tag notation dotted-decimal

router eigrp TEST1
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
   redistribute connected route-map tag-routes

route-map tag-routes permit 10
 match interface Loopback1 Loopback2 Loopback3
 set tag 100.100.100.1
route-map tag-routes permit 20
 match interface Loopback4 Loopback5
 set tag 100.100.200.1
route-map tag-routes permit 30
 match interface Loopback6 Loopback7
 set tag 100.100.101.1

So we've set some dotted-decimal tags on R1, now let's filter on R2.

R2:
route-tag notation dotted-decimal
route-tag list binary-match seq 5 permit 100.100.0.0 0.0.254.255

route-map filter permit 10
 match tag list binary-match
 set metric 100 100 255 1 1500
route-map filter permit 20

router eigrp TEST2
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
   distribute-list route-map filter in GigabitEthernet1.123

Anyone who's done any amount of CCIE-level route filtering should catch what I just did. The route-tag list is looking for any routes that begin with 100.100 and have an even 3rd octet - if you need an explanation of filtering with wildcard masks there are many available on the Internet.

So now tags can be matched based on what bits are set in them -- very cool.

R2(config)#do sh ip eigrp top 1.1.1.1/32 | i Composite metric
      Composite metric is (6619136000/163840), route is External
R2(config)#do sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (6619136000/163840), route is External
R2(config)#do sh ip eigrp top 6.6.6.6/32 | i Composite metric
      Composite metric is (1392640/163840), route is External

1.1.1.1 and 4.4.4.4 were tagged with 100.100.100.1 and 100.100.200.1 respectively, both even 3rd octets, and had their metric successfully recreated. 6.6.6.6, tagged with 100.100.101.1, was not matched, and retained its original metric.

I immediately tried this in IPv6... however...

R2(config-router)#address-family ipv6 unicast autonomous-system 200
R2(config-router-af)#topology base
R2(config-router-af-topology)#distribute-list ?
  prefix-list  Filter connections based on an IPv6 prefix-list
R2(config-router-af-topology)#distribute-list route-map ?
% Unrecognized command

IPv6 can't be filtered ingress with route-maps yet. I didn't expect that. For anyone curious I'm on:

R2(config-router-af-topology)#do sh ver  | i IOS Software
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)

There's open more option for settings tags:

router eigrp TEST1
  address-family ipv4 unicast autonomous-system 100
   eigrp default-route-tag 9.9.9.9

default-route-tag is fairly picky what it will tag. From some tinkering, it will tag all routes except:
- Locally redistributed routes
- Routes that were already set a tag in some other fashion
- Routes it learned from another router

So in short, unless you learned the routes with the "network" statement, this tag won't take effect.

IPv6 VRF Lite

The traditional EIGRP process doesn't support IPv6 in a VRF.  

You also must use the new format - multiprotocol VRF -  for creating VRFs. 
Old format:
R2(config)#ip vrf FOO
R2(config-vrf)#rd 1:1
R2(config-vrf)#exit
R2(config)#int gig1.10
R2(config-subif)#ip vrf forwarding FOO

Multiprotocol VRF:
R2(config-vrf)#vrf definition FOO
R2(config-vrf)#rd 1:1
R2(config-vrf)#address-family ipv6 unicast
R2(config-vrf-af)#address-family ipv4 unicast
R2(config-vrf-af)#exit
R2(config-vrf)#int gig1.10
R2(config-subif)#vrf forwarding FOO

router eigrp SAMPLE
 !
 address-family ipv6 unicast vrf FOO autonomous-system 200
  !
  topology base
  exit-af-topology
  eigrp router-id 2.2.2.2
 exit-address-family

Note the bolded line - eigrp router-id 2.2.2.2. Unless you have an IPv4 address in the routing table of the same VRF, you must specify the router ID manually. There is no parser error, it just doesn't work. Once again, this would make a great TS problem.

With IPv6, things work differently than IPv4 in named EIGRP mode. This process is already up:

*Sep  2 23:39:52.815: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FEF7:FE11 (GigabitEthernet1.10) is up: new adjacency

However, note I haven't told it what interfaces to use. In our case, it automatically includes any interface that's in the appropriate VRF and has an IPv6 address on it. If you don't want to run EIGRP on an interface, you have to manually specify:

R2(config)#router eigrp SAMPLE
R2(config-router-af)#address-family ipv6 unicast vrf FOO autonomous-system 200
R2(config-router-af)#af-interface gig1.10
R2(config-router-af-interface)#shut

*Sep  2 23:47:10.304: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FED7:2458 (GigabitEthernet1.10) is down: interface down

3rd party Next-Hop
While also not a feature specific to named mode, EIGRP has recently started supporting 3rd party next hop. The concept of 3rd party next-hop is fairly simple. The easiest way I can explain it is if you have three routers on a single segment, R1, R2, and R3.  They all share the 192.168.123.0/24 space between them. However, R1 and R2 speak EIGRP, and R2 and R3 speak OSPF.  R1 doesn't speak OSPF, and R3 doesn't speak EIGRP. Assume there are extra routers behind R1 and R3 on different segments that are advertised in their respective routing protocols.

R2 is mutually redistributing between EIGRP and OSPF.

Without 3rd party next-hop, R1 would have to send traffic destined for the OSPF segments to R2, then R2 would have to forward it to R3. Inefficient and messy.

With 3rd party next-hop, R2 is permitted to use R3's address, even though it doesn't exist in the EIGRP process, when advertising routes to R1.

This is an automatic feature and requires only that R2 doesn't re-write the next-hop to itself (rewriting the next hop is default EIGRP behavior):

router eigrp TEST2
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1.123
   no next-hop-self

EIGRP Fast ReRoute (FRR)
The point of FRR is to generate Loop Free Alternates, or LFAs. What's an LFA?
An LFA is a back-up route that can be pre-programmed into the FIB as a repair route. If you're familiar with EIGRP, you might think "but EIGRP already has feasible successors". True, but it doesn't program those into the forwarding linecards. 

I wrote a rather lengthy article regarding BGP PIC and Add-Path two weeks ago, and I covered the problem that PIC was trying to solve, which is not necessarily easy to comprehend unless you've spent a great deal of time in a large service provider environment. PIC and FRR are trying to solve the same issue with different protocols. Rather than pasting the multi-page explanation I've already typed into this document as well, please reference that one to understand the issue:


The good news is that EIGRP doesn't require as complex an environment to explain FRR as it took to explain BGP PIC.

We already know EIGRP makes feasible successors, and can rely on those during reconvergence. But if we want the FIB to be able to swap over to a feasible successor as soon as the successor route is lost, we need to pre-program it.

In a nutshell, FRR simply picks the "best" feasible successor and sticks it in the FIB as a backup route.

There are two types of FRR, per-link and per-prefix. Per-link is only supported on IOS-XR at the time of this writing, so we'll be looking only at per-prefix.

First and foremost, we must ensure we have a feasible successor. If we have multiple successors (no feasibles), then we have ECMP - equal cost multi-path - and there's no need for FRR.

R1 has two paths to prefix 4.4.4.4 on R4, one via R2 and another via R3. I've deliberately de-prefed the route through R3. Note, if you're attempting to lab along with this, you'll want to create the depref on R1. If you're ECMP up until you create the depref on R1, you're guaranteed to have a feasible successor!

R1(config-subif)#int gig1.13
R1(config-subif)#delay 5000

R1#sh ip eigrp topo 4.4.4.4/32
EIGRP-IPv4 VR(TEST) Topology Entry for AS(100)/ID(192.168.12.1) for 4.4.4.4/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000
  Descriptor Blocks:
  192.168.12.2 (GigabitEthernet1.12), from 192.168.12.2, Send flag is 0x0
      Composite metric is (2048000/1392640), route is Internal
      Vector metric:
        Minimum bandwidth is 1000000 Kbit
        Total delay is 21250000 picoseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 192.168.24.4
  192.168.13.3 (GigabitEthernet1.13), from 192.168.13.3, Send flag is 0x0
      Composite metric is (3278192640/1392640), route is Internal
      Vector metric:
        Minimum bandwidth is 1000000 Kbit
        Total delay is 50011250000 picoseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 192.168.24.4

Since the route via 192.168.13.3 (from R3) has an advertised distance less than the feasible distance to 192.168.12.2 (from R2), we now have a feasible successor.

R1(config)#router eigrp TEST
R1(config-router)# address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#  topology base
R1(config-router-af-topology)#fast-reroute per-prefix all

R1#sh ip route 4.4.4.4 | i Repair
      Repair Path: 192.168.13.3, via GigabitEthernet1.13

R1#sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

It's very simple if we only have two paths, but what if there are 3 or more? Cisco uses what it calls "tie breakers", but I really dislike the name, we're not really tie-breaking necessarily because the criteria for selection isn't comparing apples to apples. It's a bit more like "2nd bestpath decision maker".

Before I list off the tie-breakers, let's look at what the problems might be if we had numerous paths to choose from.

Let's say we have multiple neighbors on a shared segment, with varying metrics to the destination we're trying to protect. Your bestpath is on that segment, as is your "second best" feasible successor, all hanging off the same interface on your router. If you're choosing the LFA purely based on metric, the same interface will get chosen for the backup path as is the primary route. That doesn't help us if that WAN link fails, or if the interface goes down, etc. 

Take that one step further and say your best-path and best feasible successor are both on the same linecard. That might also be a poor decision.

What I'm getting at is there's more to consider than just the metric in this scenario.

The four tie-breakers are:
- srlg-disjoint, priority 10
- interface-disjoint, priority 20
- lowest-backup-path-metric, priority 30
- linecard-disjoint, priority 40

Lower priority is better.

srlg-disjoint favors a backup-path/interface that isn't in the same Shared Risk Link Group (more below).

interface-disjoint favors a backup route that doesn't share the same interface for its next-hop. BEWARE, sub-interfaces are considered disjointed interfaces by the FRR process on my version of IOS-XE!

lowest-backup-path-metric favors a backup route with the lowest metric.

linecard-disjoint favors a backup route that doesn't share the same linecard.

So to clarify, by default, SRLG gets priority unless not set, then interface-disjoint gets priority unless the two paths are already on different interfaces (or subinterfaces), then the lowest metric is picked. If the metric is the same, it looks for a port on a different linecard.

So to start, what the heck is SRLG?

There's very little information on this feature that I can find, but the idea, as best I can tell, is that if you happen to know to physical links share some dependency (perhaps passing through the same L2 switch upstream, for example), you can tell IOS which ones have dependencies.

For example, if Gig1 and Gig2 on my router both passed through a single point of failure upstream, my config might look something like this:

R1(config)#int gig1
R1(config-if)#srlg gid 1
R1(config-if)#int gig2
R1(config-if)#srlg gid 1
R1(config-if)#int gig3
R1(config-if)#srlg gid 2

Note gig3 didn't necessarily need to get assigned to an srlg, but I included it for clarity.

I'm going to introduce a new path from R1 to R4 via R5.  R1, R2 and R5 are all going to share a common link, meaning R1 routes to R2 and R5 on the same interface. I'm increasing delay slightly more on the path to R5. Furthermore, I'm going to prevent R2 and R5 from peering with one another, otherwise R5 would end up only advertising it's bestpath from R2, and my topology breaks.

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (78725120/13189120), route is Internal
      Composite metric is (3289989120/13189120), route is Internal
      Composite metric is (79380480/13844480), route is Internal

We see we've got three paths, let's look at those again with my comments:

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (78725120/13189120), route is Internal = Path through gig1.12 via R2
      Composite metric is (3289989120/13189120), route is Internal  = Path through gig1.12 via R5
      Composite metric is (79380480/13844480), route is Internal = Path through gig1.13 via R3

R1#sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

We can see IOS made a very smart move here, and it's in line with the priorities we discussed above. The backup path is not the best feasible successor from a metric standpoint, it's the less risky separate "interface" (again, IOS considers a subinterface a separate interface).

If we instead wanted it to choose based on metric:

R1(config)#router eigrp TEST
R1(config-router)# address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#  topology base
R1(config-router-af-topology)#fast-reroute tie-break lowest-backup-path-metric 5

<<note I normally clear the eigrp neighbors here, these commands don't always seem to react quickly after the change>>

R1(config-router-af-topology)#do sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.12.5 GigabitEthernet1.12

Now we're preferring the backup path through the same interface, that has the better metric.

I'm not going to show the output from srlg disjoint here, but I have labbed it previously and it does work - just set the srlg guid on the appropriate interfaces. Also, I have no way of labbing linecard disjoint because I'm on a virtual router.

EIGRP Over The Top (OTP)
Does anyone besides me use the OTP abbreviation to mean "on the phone"? I wish they could've gone with OTT instead.

It is a really neat feature though - I know a lot of people will bash EIGRP as obsolete, proprietary, distance vector ... say what you will, amongst enterprise Cisco enterprise networks, it's the most popular IGP on the Cisco-powered market by a landslide. As a consultant, I would say 80% of the networks I come across run it.

Furthermore, finding enterprise network support personnel that are BGP experts is somewhat rare.

So what is one to do when MPLS separates all your sites, and your carrier (wisely) uses BGP as a PE->CE protocol? You hire a consultant to come in and make changes to the redistribution strategy periodically.

Or... you run EIGRP OTP, and toss the BGP work out the window.

OTP allows remote EIGRP peerings over any underlying IP protocol. All you need is reachability to the other EIGRP host. That means all your carrier needs to do is advertise the PE->CE link itself (probably a /30 between you and the carrier) in their MPBGP and the CE doesnt even need to run BGP (topology dependent). All the CE needs is a static default pointing at the PE router.

If you have more than a few CEs, you'll probably want an EIGRP Route Reflector, which isn't nearly as complicated as it sounds. An EIGRP RR listens for dynamic connections (optionally), and then disables split horizon and next-hop-self.

LISP provides the tunneling mechanism for the neighbors to reach one another. Fortunately, no LISP knowledge is required, the config is automatic.



Here, R2 - R5 represent the provider network, R1 and R7 represent isolated customer sites, and R6 and R8 represent a dual-homed customer site.

R7 will be our EIGRP route reflector.

Assume the provider is advertising the links between the CE and PE.
Here are the rest of the relevant configs:

R1:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.12 remote 10 lisp-encap
  network 1.1.1.1 0.0.0.0
  network 192.168.12.0
 exit-address-family

ip route 0.0.0.0 0.0.0.0 192.168.12.2

just to prove there's no BGP involved here:

R1#sh ip protocol sum
Index Process Name
0     connected
1     static
2     application
4     eigrp 100
*** IP Routing is NSF aware ***

R6:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.46 remote 10 lisp-encap
  network 6.6.6.6 0.0.0.0
  network 192.168.46.0
  network 192.168.68.0
 exit-address-family

R8:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.58 remote 10 lisp-encap
  network 8.8.8.8 0.0.0.0
  network 192.168.58.0
  network 192.168.68.0
 exit-address-family

R7 (route reflector):
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1.37
   no next-hop-self
   no split-horizon
  exit-af-interface
  !
  topology base
  exit-af-topology
  remote-neighbors source GigabitEthernet1.37 unicast-listen lisp-encap
  network 7.7.7.7 0.0.0.0
  network 192.168.37.0
 exit-address-family

The route reflector is also running BGP. Route reflectors can have a topology problem requiring this if you have backdoor links. In my case, if I only ran a default on the route reflector, I'd learn the link to R8 via EIGRP from R6, as opposed to using my default route. And vice-versa, R8 would advertise connectivity to R6, and my routes would do a continual up/down because they'd learn next-hops via the LISP interface. It's a typical tunnel recursion loop issue. Running BGP puts the prefixes to reach R6 and R8 in R7's table at a lower AD and solves the problem 

Also note that the link between PE and CE must be advertised into EIGRP in order for LISP to come up.

Now we have full reachability to the EIGRP prefixes without the majority of the CEs running BGP, and none of the CEs advertising their EIGRP routes into it.

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1#sh ip cef 6.6.6.6
6.6.6.6/32
  nexthop 192.168.46.6 LISP0

R1#ping 6.6.6.6
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

Add-Path

Add-Path is the capability to advertise more than one bestpath to a neighbor. I've done a large write-up on the BGP implementation of it:


The Cisco documentation indicates a use case of DMVPN for EIGRP Add-Path, but that seems a pretty narrow use to me, as summarization with DMVPN phase 3 would make it useless. However, our scenario for OTP above is perfect! 

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1 only learns one path to 192.168.68.0/24. Two are available, why can't we install both? Same problem with BGP, the EIGRP route reflector only sends its one best-path.

R7(config)#router eigrp OTP-TEST
R7(config-router)# address-family ipv4 unicast autonomous-system 100
R7(config-router-af)#  af-interface GigabitEthernet1.37
R7(config-router-af-interface)#add-paths 2

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:12:51, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:12:52, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:12:51, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.58.8, 00:00:26, LISP0
                      [90/93998811] via 192.168.46.6, 00:00:26, LISP0

And we've got multiple redundant paths to 192.168.68.0/24 now!

Note, EIGRP add-path is incompatible with variance.

Hope you enjoyed,

Jeff

Saturday, August 16, 2014

BGP PIC and Add-Path

The meat of this article will be Add-Path, and why it's needed in certain PIC scenarios. However, understanding where and why we need these technologies, what was done before the Add-Path implementation was widely in place, etc, is nearly as challenging to learn as the Add-Path implementation itself.

This is not intended to completely document Add-Path, nor is it just a primer. My original intent was to document the entire use of Add-Path, however, I realized halfway through that this would have easily produced a 50+ page document: There are many one-off cases for Add-Path that have their own features, and to show a use case for each one would've required several different topologies and drawings. My hope that at the depth I took it to, it will be more than sufficient to educate to the level required for the CCIE R&S v5 lab.

So - what is PIC?

PIC stands for Prefix Independent Convergence.

PIC is a method for speeding up convergence of the FIB under failover conditions.

Unless you have a really serious lab or a Spirent to play with, forget trying to lab the performance gain. The gains we're talking about here are only seen when you have tens of thousands, hundreds of thousands, or even 1M routes in your FIB.

The use case is actually pretty easy to understand - when the next-hop to a set of prefixes changes, the router (presumably talking about a 7600 or ASR) has to walk each prefix in the FIB and update the next-hop. If you have 100 routes, this time is negligible. If you're carrying 1M routes in an MPLS environment, this is not a small problem. I've been told first-hand (from someone who does have a Spirent to play with) that this takes about two minutes.

This would be Prefix Dependent Convergence, or a problem that grows dependent upon how many prefixes are in your FIB. The solution we want is something that updates in the same amount of time (presumably small amount of time!) no matter how many FIB entries we have.

The concept of the FIB dates back decades now, and when it was originally written it was made in the most efficient manner possible, for CPU and RAM conservation:

Prefix = Interface/Next-Hop

For example,

10.10.10.10/32 = FastEthernet0/0 192.168.0.1

This was great 20 years ago when a "large" routing table was 40,000 routes. To converge quickly, a new method is required. Introducing the Hierarchical FIB.

When using PIC, the FIB actually restructures to a 3-tier system:

Prefix = Pointer = Interface/Next-Hop

Understanding why this is better takes understanding that while a router may be carrying 1M routes, it's probably only directly connected (layer 3) to a dozen or less. So you've got 1M routes, and 12 possible exits.

Let's say half those routes go out to two primary edge routers. Those routers are at 192.168.1.1 and 192.168.2.1.

So, roughly half your routes look like:

10.10.10.10/32 = Pointer A = Gigabit0/0 192.168.1.1
11.11.11.11/32 = Pointer A = Gigabit0/0 192.168.1.1
.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/0 192.168.1.1

192.168.1.1 fails. However, all these same prefixes are reachable via 192.168.2.1.
With an appropriately designed network, PIC can simply reassign Pointer A. This takes less than 50ms as opposed to 60+ seconds.

10.10.10.10/32 = Pointer A = Gigabit0/1 192.168.2.1
11.11.11.11/32 = Pointer A = Gigabit0/1 192.168.2.1

.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/1 192.168.2.1

The CEF process updated one value, that of Pointer A. Previously this took 500,000 updates, now it takes one. The time required for this process is independent of how many routes use the next-hop, hence Prefix Independent Convergence.

Now if you're following along, you probably see the enormous catch here: unless you're multipathing, how is CEF even going to know about the second path? PIC is a data-plane/CEF/FIB feature, it doesn't touch the control-plane. Normally we'd have to wait on BGP convergence (topology dependent), which takes a heck of a lot longer than 50ms. As we're all aware, and this is key to understanding this topic, BGP only sends its single best-path per-prefix to its neighbors. What if we needed two or more? Even worse, what if we're crossing a route-reflector, that aggregates everyone's paths and picks only one?

I am going to cover five different ways to solve this, add-path being the newest of them.

Here are the options at a high-level:
1) Multipath. This is by far the easiest option if your topology fits.
2) BGP Advertise-Best-External. For advertising from PE->PE, or PE->RR; this tells the edge PE to send it's external route (presumably from a CE via eBGP) as best. More below.
3) Diverse-Path (Shadow Router). This tells a route reflector, a secondary one in a topology, to deliberately calculate a "second-best" path that has a different next-hop. Instead of forwarding its best-path, it forwards this "second-best" path. Only the route-reflector needs to be updated to support this feature.
4) Add-Path. In short, Add-Path modifies the BGP behavior to send two or more paths instead of just one best-path. This requires that every device in the topology that needs to send or receive multiple paths supports Add-Path.

I've chosen to demonstrate these solutions in a VPNv4 environment, as it's where PIC makes the most sense. Note that add-path is purely an iBGP technology, the parser gets upset if you try it on eBGP:

R3(config-router)#neighbor 192.168.30.2 advertise additional-paths all
% BGP: Add-Path *not* supported on EBGP peering

I have a hobby (perhaps more of an interest?) of the language used in IOS parser messages.  Half the time, unless you know the technology already, you can't even tell what the programmer was trying to convey when you make a mistake. If it's a new feature sometimes you don't even get an error, it just doesn't apply the config. Then other times you get blunt messages with *stars*!




I'm running a common VPNv4 design: BGP on the PEs, VRFs between CE and PE, and a "BGP free core" (all one P router that isn't a route reflector :) ).

On the PE->CE links, I'm using 10.0.X.Y/24, where X is a combination of the two routers the link connects (i.e. R1->R2 is "12"), and Y is the router number. This is also the same number on the subinterface on the diagram.

On the PE->P or PE->RR links, the IPs are 192.168.X.Y, same explanation of X and Y as above.

Every router has a loopback0 of Y.Y.Y.Y/32, where Y is the router number.

Note that R4 is a route-reflector, and R6, R7 and R8 are all PEs.

Let's talk about the two flavors of PIC. There's PIC Core and PIC Edge. They're both applied to a PE.

PIC Core is far simpler than PIC Edge, so we'll start there. We've enabled PIC Core on R2.

PIC Core is enabled with one command:


R2-PE(config)#cef table output-chain build favor convergence-speed

Of note, to disable it, you replace "convergence-speed" with "memory-utilization".

Unlike PIC Edge, which, depending on the implementation, may require widespread support on the network, PIC Core can literally be enabled on just one device if you wanted.



As mentioned above, in a typical VPNv4 scenario, the core is BGP-free, and only the PEs (and any route reflectors) maintain the BGP table. Next hops to the PEs are carried in the IGP. Let's look at how that plays out:

- Let's assume R1's bestpath to R9 is via R2. R1 is BGP peered to R2.
- R2 takes R1's traffic in to a VRF. It imports the VRF traffic into VPNv4.
- R4, the route reflector, learns via iBGP that the PEs R6 and R7 can both reach R9. It chooses R6 as the bestpath.
- R2, only peered with R4 for iBGP, learns the that R6 is the bestpath.
- Since this is VPNv4, R2 needs to choose an LDP-enabled next hop that has a label for 6.6.6.6. Remember, in VPNv4, the next hop inside the iBGP network is always the iBGP next-hop. The IGP indicates that R5 is the bestpath for R2 to reach R6 (via MPLS).

The key element here is the recursion between R2 and R6:
BGP tells R2 how to reach R9 via R6: 2.2.2.2 -> 6.6.6.6
R2 needs to find out how to reach 6.6.6.6 via the IGP: 2.2.2.2 -> R5
R2 needs to know how to reach R5: 2.2.2.2 -> 192.168.25.5 (R5's interface IP)
R2 needs to pick an interface to reach 192.168.25.5: gig1.25

So one more time!
iBGP: 2.2.2.2 -> 6.6.6.6
  iBGP Next-Hop via OSPF: Find 6.6.6.6 via R5 at 192.168.24.5
     CEF: Exit interface gig1.25 towards 192.168.25.5

I'm going to harp on the high-level of this again because it's dead critical to understanding the hierarchy of the process:
BGP recurses to IGP
   IGP recurses to one or more Next Hops
      FIB populates one or more next hops from the IGP

When you're using PIC Core, this is what we care about:
BGP recurses to IGP
   IGP recurses to one or more Next Hops <-- PIC CORE INFLUENCES
      FIB populates one or more next hops from the IGP <-- PIC CORE INFLUENCES

I will demonstrate below.

So given that R1 -> R2 -> R5 -> R6 -> R9 above, let's say R5 goes completely offline - dead.

This does not impact the BGP session between R2 and R4, or between R4 and R6. However, the next hop specific in the BGP next-hop (192.168.24.5), which it learned from the IGP, must change. The IGP can reconverge very quickly, but let's say the BGP process was carrying 1M routes from R9. How long will it take R2 to update the next-hops of the BGP table and CEF?

So to be clear, BGP is not reconverging. PIC Core cannot handle a BGP reconverge, you need PIC Edge for that. But if the IGP reconverges and requires the BGP Table and FIB to update, and you have a large quantity of routes, this can create a major impact on a PE - possibly several minutes of dropping traffic.

With a traditional FIB, we'd have to make 1 million updates in both the BGP table and the FIB in order to be fully forwarding again.  With a hierarchical FIB - what PIC Core provides us - the following process would happen:

The FIB, before:
Prefix 1 -> Pointer A (192.168.25.5) -> gig1.25

The IGP reconverges the path via R4.
Now we update Pointer A - one value instead of 1M values - and we end up with:

Prefix 1 -> Pointer A (192.168.24.4) -> gig1.24

So to reiterate, PIC Core is for failure of non-BGP speakers. It doesn't help if BGP itself needs to reconverge, but it does dramatically speed up CEF's failover if the IGP fails.

Now moving in to the more complex PIC Edge.

If PIC Core was about dealing with IGP failure, PIC Edge is about dealing with BGP failure.

For the moment, we'll continue using our VPNv4 topology, except we're temporarily removing the route reflector and instead installing a full-mesh iBGP.

Please note that using PIC Edge should involve running BFD between the BGP speakers for fast detection of a failure. For simplicity, I've omitted this step. To learn more about BFD, please see my BFD blog: http://brbccie.blogspot.com/2014/06/everything-bfd.html

That's quite a few iBGP peerings, The red lines indicate all the iBGP peerings:


In this scenario, we're going to deal with R2's convergence process again, except we're going to assume R6 - the BGP-adjacent PE - dies, instead of a P router.

Let's look at our routing protocols from R2's perspective.

R2-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 11
Paths: (3 available, best #2, table VPN)
  Advertised to update-groups:
     1
  Refresh Epoch 3
  300
    8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      mpls labels in/out nolabel/16
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 3
  300
    6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 100, valid, internal, best
      Extended Community: RT:1:1
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 3
  300
    7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      mpls labels in/out nolabel/26
      rx pathid: 0, tx pathid: 0

As expected, R2 has three BGP paths to 9.9.9.9. 6.6.6.6 is the best.

How do we reach 6.6.6.6?

R2-PE#sh ip ospf route | s 6.6.6.6
*>  6.6.6.6/32, Intra, cost 3, area 0
      via 192.168.24.4, GigabitEthernet1.24
      via 192.168.25.5, GigabitEthernet1.25

The BGP table has one selected bestpath, the IGP has two multipath bestpaths to BGP's next hop:

R2-PE#sh ip route 6.6.6.6
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.24.4 on GigabitEthernet1.24, 00:22:04 ago
  Routing Descriptor Blocks:
  * 192.168.25.5, from 6.6.6.6, 01:31:56 ago, via GigabitEthernet1.25
      Route metric is 3, traffic share count is 1
    192.168.24.4, from 6.6.6.6, 00:22:04 ago, via GigabitEthernet1.24
      Route metric is 3, traffic share count is 1

R2-PE#sh ip cef 6.6.6.6
6.6.6.6/32
  nexthop 192.168.24.4 GigabitEthernet1.24 label 17
  nexthop 192.168.25.5 GigabitEthernet1.25 label 18

Now let's refer back to my process from earlier:

BGP recurses to IGP
   IGP recurses to one or more Next Hops
      FIB populates one or more next hops from the IGP

or

BGP says use 6.6.6.6
   IGP says to get to 6.6.6.6 use either 192.168.25.5 or 192.168.24.4
      FIB points to 192.168.25.5 / tag 17 and 192.168.24.4 / tag 18 multipath

Now what happens if 6.6.6.6 fails?

R6-PE(config)#int gig1.46
R6-PE(config-subif)#shut
R6-PE(config-subif)#int gig1.56
R6-PE(config-subif)#shut
R6-PE(config-subif)#int gig1.69
R6-PE(config-subif)#shut

Debugging BGP updates on R2 (significantly edited for brevity):

*Aug 20 23:30:29.037: RT(VPN): updating bgp 9.9.9.9/32 (0x1)  :  via 7.7.7.7   0 26
*Aug 20 23:30:29.037: RT(VPN): closer admin distance for 9.9.9.9, flushing 1 routes
*Aug 20 23:30:29.037: RT(VPN): add 9.9.9.9/32 via 7.7.7.7, bgp metric [200/0]

BGP figures out that 6.6.6.6 is down, and picks 7.7.7.7 for the next hop. Now we have the same problem we had with PIC Core, only it's more significant:

BGP recurses to IGP  <-- PIC EDGE INFLUENCES
   IGP recurses to one or more Next Hops <-- PIC EDGE & CORE INFLUENCE
      FIB populates one or more next hops from the IGP <-- PIC EDGE & CORE INFLUENCE

Just pointing out the process there - we don't have PIC edge enabled, so our theoretical 1M routes just took minutes to reconverge.

So how do we enable PIC Edge?  Quite simply, we can't wait for the IGP and BGP to converge.  We need two paths in BGP.  This can be easy or difficult, depending on our topology.  Let's look at the easiest methods and progress towards harder.

Note we still have cef table output-chain build favor convergence-speed configured on R2, which is still necessary.

Re-enabling R6 to show how this could play out with PIC Edge.
router bgp 200
 address-family ipv4 vrf VPN
  maximum-paths ibgp 3

Now we've told R2 to install multiple BGP paths, not just multiple IGP paths. This way if R6's advertisement gets pulled again, there's already a pre-made alternative path.

Now we have three "hot", installed BGP paths to 9.9.9.9, instead of just one. This means with the IGP in consideration, we have six paths:

R2-PE#sh bgp vpnv4 un all | b 9.9.9.9
 *mi 9.9.9.9/32       7.7.7.7                  0    100      0 300 i
 *>i                  6.6.6.6                  0    100      0 300 i
 *mi                  8.8.8.8                  0    100      0 300 i

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail
9.9.9.9/32, epoch 1, flags rib defined all labels, per-destination sharing
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18
  recursive via 7.7.7.7 label 26
    nexthop 192.168.24.4 GigabitEthernet1.24 label 16
    nexthop 192.168.25.5 GigabitEthernet1.25 label 20
  recursive via 8.8.8.8 label 16
    nexthop 192.168.24.4 GigabitEthernet1.24 label 28

If we lose the path via 6.6.6.6, one of the other paths would simply pick up the load, and because of the hierarchical FIB we already implemented, there's no need to rewrite all 1M prefixes in the FIB one at a time. 

This represented our first PIC solution I described above: Multipathing. 

I'm going to temporarily cut to a much simpler scenario to show BGP Advertise-Best-External. While I could mix this in to the topology we've been using, it's getting too complex to clearly illustrate the topic.

Let's say multipathing isn't an option - what if one of the paths is clearly better than the others. What else can we do?

I've deliberately made R6 the bestpath by setting the local preference on all routes leaving it to 150.  Now what we see from R2 looks like:

R2-PE#show bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 63
Paths: (1 available, best #1, table VPN)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Extended Community: RT:1:1
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0x0

Only one path ... via 3 upstreams?  Yep.  The problem here is that, depending on timing, R2 may end up with three paths, for just a moment - since all routers are peered with one another, R7 will learn that R6 is the bestpath via its iBGP session to R6, as will R8.  Both R7 and R8 will send a withdraw for their route to R6. Now R6 is stuck with one path - we need at least two for PIC edge.

The dead easiest solution to this design is to use Advertise-Best-External:

R7 & R8:
router bgp 200
 address-family ipv4 vrf VPN
  bgp advertise-best-external

What's this do?

R7-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 18
Paths: (3 available, best #2, table VPN)
  Advertised to update-groups:
     1          6
  Refresh Epoch 5
  300
    8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      mpls labels in/out 26/16
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 3
  300
    6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Extended Community: RT:1:1
      mpls labels in/out 26/22
      rx pathid: 0, tx pathid: 0x0
  Refresh Epoch 2
  300
    10.0.79.9 (via vrf VPN) from 10.0.79.9 (9.9.9.9)
      Origin IGP, metric 0, localpref 100, valid, external
      Extended Community: RT:1:1
      mpls labels in/out 26/nolabel
      rx pathid: 0, tx pathid: 0

R7 still sees the path through R6 as best. However, what's it sending to R2? It's sending it's eBGP path to the CE as opposed to the path to R6.

Since R8 is doing the same thing, R2 now has three paths again:

R2-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 70
Paths: (3 available, best #3, table VPN)
  Advertised to update-groups:
     1
  Refresh Epoch 5
  300
    8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      mpls labels in/out nolabel/16
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 3
  300
    7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      mpls labels in/out nolabel/26
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Extended Community: RT:1:1
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0x0

So Advertise-Best-External sends your eBGP route as bestpath to your neighbors, but local routing (on R7 or R8) still goes through R6 due to the local-preference.

We're not done yet however:

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail
9.9.9.9/32, epoch 1, flags rib defined all labels
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18

R2 still only sees one possible path.

We need to implement some single-router Add-Path to make this work. The key item of importance is that only the routers that need the non-multipath redundant paths have to support Add-Path in this design.  If we're not worried about R6, R7, or R8 having an additional path back to R1, then we might just have R2 and R3 require the Add-Path support (Add-Path is a reasonably new feature at the time of this writing, so having your entire topology support it could be challenging).

router bgp 200
 address-family ipv4 vrf VPN
  bgp additional-paths select backup
  bgp additional-paths install

Don't worry about the specific mechanisms of "select backup" and "install" yet, I'm going to cover them thoroughly later. In short, we need to tell this router to pick a backup path and pre-install it in the FIB so that PIC can use it in failover, which this config accomplishes:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det
9.9.9.9/32, epoch 1, flags rib defined all labels
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18
  recursive via 7.7.7.7 label 26, repair
    nexthop 192.168.24.4 GigabitEthernet1.24 label 16
    nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Note the "repair" syntax, that's the key.

I'm removing the R2 Add-Path config and bgp advertise-best-external on the PEs.

This is all fantastic with full-mesh iBGP - what if you have a huge topology and a route-reflector (or several) is more realistic? There's a big problem here, because like any BGP router, the route reflector will only choose its one best path to send to the other PEs. This makes multipathing impossible.

I've re-made R4 a route reflector, and removed all the redundant iBGP paths between the other PEs. Every PE is getting their routes via R4 now.

Clearly down to just one path now:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det
9.9.9.9/32, epoch 1, flags rib defined all labels
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18

As usual, we have multiple IGP paths, but those will both get pulled if we lose the BGP path.

Without going to full-on Add-Path across the network, our simplest answer is another route-reflector running diverse-path. I'm temporarily making R5 an additional route-reflector.

For brevity I'm not going to include all the config necessary to make R5 a route-reflector. However, the outcome on R2 looks like this:

R2-PE#sh bgp vpnv4 uni all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 79
Paths: (2 available, best #2, table VPN)
  Advertised to update-groups:
     1
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      Originator: 6.6.6.6, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Extended Community: RT:1:1
      Originator: 6.6.6.6, Cluster list: 4.4.4.4
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0x0

Hey, great, we've got two paths, we can just enable Add-Path on R2 and we're done, right?  

Not so fast.

The next-hop is 6.6.6.6 on both routes - in order for Add-Path to be viable, the backup path's next-hop must be different that the primary path.

The solution, as I'd mentioned above, is to use Diverse-Path. Diverse-Path tells a BGP router to deliberately calculate the 2nd-best path that has a different next hop than the first-best path. Diverse-Path was a workaround before Add-Path was supported (or widely supported) in IOS. Only the route reflector running Diverse Path needs to know about it, all the other routes are just following standard IOS rules. 

R5-RR(config)#router bgp 200
R5-RR(config-router)#address-family vpnv4
R5-RR(config-router-af)#bgp additional-paths select backup
R5-RR(config-router-af)#bgp additional-paths install
R5-RR(config-router-af)#neighbor 2.2.2.2 advertise diverse-path backup

Here, we tell R5 to calculate a backup path, and then we tell it to advertise it to R2 as if it were R5's bestpath (in production, you'd presumably want to send this to all route-reflector clients, not just one).

One more step is also required on R7 and R8 (I've done R6 as well to keep the config consistent) - right now, this topology suffers from the same problem we saw in the first advertise-best-external scenario. Consider:

1) R6 sends its bestpath (its external path) to R4 and R5. This prefix has a local pref of 150.
2) R7 sends its bestpath (its external path) to R4 and R5.
3) R8 sends its bestpath (its external path) to R4 and R5.
4) R5 starts calculating 2nd-best-path for R2
5) R7 learns about R6's bestpath from R4
6) R8 learns about R6's bestpath from R4
8) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
9) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
10) R5 calculates it's only path to 9.9.9.9 via R6

Now we could put bgp advertise-best-external back in, but that would advertise the best external to both R4 and R5 and we'd have the same exact problem as above.

Per-neighbor best-external is the solution:
R6, R7 & R8:
router bgp 200
 address-family vpnv4
  neighbor 5.5.5.5 advertise best-external

This will advertise the "internal bestpath" (via R6, because of local preference) to R4, and the external bestpath to R5.

Now back to R2:
R2-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 83
Paths: (2 available, best #2, table VPN)
  Advertised to update-groups:
     1
  Refresh Epoch 2
  300
    6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)
      Origin IGP, metric 0, localpref 100, valid, internal
      Extended Community: RT:1:1
      Originator: 6.6.6.6, Cluster list: 5.5.5.5
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Extended Community: RT:1:1
      Originator: 6.6.6.6, Cluster list: 4.4.4.4
      mpls labels in/out nolabel/22
      rx pathid: 0, tx pathid: 0x0

Now we've got two routes with two next-hops.

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail
9.9.9.9/32, epoch 1, flags rib defined all labels
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18

But we still need to enable the calculation of a backup route, otherwise PIC Edge won't work.

R2-PE(config)#router bgp 200
R2-PE(config-router)#address-family ipv4 vrf VPN
R2-PE(config-router-af)#bgp additional-paths select backup
R2-PE(config-router-af)#bgp additional-paths install

R2-PE#sh ip cef vrf VPN 9.9.9.9 det
9.9.9.9/32, epoch 1, flags rib defined all labels
  recursive via 6.6.6.6 label 22
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18
  recursive via 7.7.7.7 label 26, repair
    nexthop 192.168.24.4 GigabitEthernet1.24 label 16
    nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Now we've got a working solution!

And last but certainly not least, the gold standard of receiving two paths: Simply rework how BGP handles multiple paths by using Add-Path.

Sadly, as much as this technology seems like it's custom-built for VPNv4, if you can believe it, Add-Path isn't supported in VPNv4 on my OS:

R4#sh ver
Cisco IOS XE Software, Version 03.11.01.S - Standard Support Release
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)
<output omitted>

In the IPv4 (default) Family:

R4(config)#router bgp 200
R4(config-router)#bgp additional-paths select ?
  all            Select all available paths
  backup         Select backup path
  best           Select best N paths
  best-external  Select best-external path
  group-best     Select group-best path

R4(config-router)#neighbor 2.2.2.2 advertise ?
  additional-paths  Advertise additional paths
  best-external     Advertise best-external (at RRs best-internal) path
  diverse-path      Advertise diverse path

Note the bolded and italic items, that's what we're looking for in VPNv4:

R4(config-router)#address-family vpnv4
R4(config-router-af)#neighbor 2.2.2.2 advertise ?
  best-external  Advertise best-external (at RRs best-internal) path
  diverse-path   Advertise diverse path

R4(config-router-af)#bgp additional-paths select ?
  backup         Select backup path
  best-external  Select best-external path

Completely lacking.

On that note, we'll be reverting this design back to a non-MPLS scenario for the remainder of the blog.
I've also reverted R5 from being a route-reflector, it's now simply a client of R4. This was necessary to carry the IPv4 BGP table through R5.

Note R6 deliberately still has the best path via local-preference.

Here is a diagram of roughly what we're trying to achieve.



We'd like R6, R7 and R8 to all send (initially) one route to the RR. We'd like the R4 to reflect back two paths for reaching 9.9.9.9 to everyone (technically speaking we'll also be reflecting two paths for 1.1.1.1 on CE1, but I chose not to focus on that).

This design suffers from the same problem the last several have. Everything will start out looking good until the route-reflector reflects the superior path from R6 to R7 and R8, and those two routers both pick R6 as their bestpath. After that they'll withdraw their routes from R4, and R4 will only have a single route to send to R2, R3, etc, etc, because every path will point to R6.

We can solve this with one of three methods:
- BGP Advertise-Best-External on R7 and R8 (optionally on R6)
- Per-neighbor advertise best-external
- Running two-path Add-Path on R7 and R8 in addition to R4.

The top two options I imagine are self-explanatory at this point as I covered them above, however, the final option is hopefully interesting to the reader, and therefore it's the method I will choose for this lab. What will happen if R4, R7 and R8 run add-path is as follows:

1) R6, R7 and R8 all advertise their own (connected/external) bestpath to R4
2) (Let's assume R7 had the 2nd-best path for this example) R4 reflects BOTH R6 and R7's bestpath to R2, R3, R5, R6, R7 and R8.
3) R2 and R3 install both paths in BGP and in the FIB.
4) R6 installs R7's path in the FIB as a repair route.
5) R7 and R8 both change their bestpath to R6 instead of their external route.
6) R7 and R8 both advertise back to the route reflector that R6 is their bestpath and R7 is their backup path.
... No change on R2, R3, or R4, that influences a shift on the route reflector, so it's clients aren't modified either.

The key here is that while we still have the same problem of R7 and R8 preferring R6's external path, we're still advertising two paths to the route reflector: R6's (as best), and R7's as a backup.

Here is the relevant config:
R4:
router bgp 200
 bgp additional-paths select best 2
 bgp additional-paths send receive
 bgp additional-paths install
 neighbor 2.2.2.2 advertise additional-paths best 2
 neighbor 3.3.3.3 advertise additional-paths best 2
 neighbor 5.5.5.5 advertise additional-paths best 2
 neighbor 6.6.6.6 advertise additional-paths best 2
 neighbor 7.7.7.7 advertise additional-paths best 2
 neighbor 8.8.8.8 advertise additional-paths best 2

R2, R3, & R5:
router bgp 200
 bgp additional-paths select best 2
 bgp additional-paths receive
 bgp additional-paths install

R6 - R8:
router bgp 200
 bgp additional-paths select best 2
 bgp additional-paths send receive
 bgp additional-paths install
 neighbor 4.4.4.4 advertise  additional-paths best 2

Remember not to use bgp additional-paths select backup - that command is for diverse-path or for local (non-advertised) selection of a backup route. You're trying to create a backup path, but that's still the wrong command.

So we used a few new commands here:
bgp additional-paths select best 2 - This calculates the best path and 2nd best path and flags them in BGP. This is a non-transitive flag, the neighbors aren't aware of what your flags are.

R4#sh ip bgp 9.9.9.9
BGP routing table entry for 9.9.9.9/32, version 5
Paths: (3 available, best #3, table default)
  Additional-path-install
  Path advertised to update-groups:
     19         20
  Refresh Epoch 1
  300, (Received from a RR-client)
    7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
      rx pathid: 0x1, tx pathid: 0x2
  Path not advertised to any peer
  Refresh Epoch 1
  300, (Received from a RR-client)
    8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal
      rx pathid: 0x1, tx pathid: 0
  Path advertised to update-groups:
     19         20
  Refresh Epoch 1
  300, (Received from a RR-client)
    6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      rx pathid: 0x0, tx pathid: 0x0

You see we've flagged "best" and "best2".

bgp additional-paths send receive

Unlike all the fixes we've seen up until now, Add-Path is a negotiated feature. This is why there's so many workarounds for it - to get to full Add-Path you basically have to forklift upgrade your network. On that note, you need to tell your neighbors if you have send, receive, or both send & receive capability. This can be done globally, as we've done here, or per neighbor with:

R2(config-router)#neighbor 4.4.4.4 additional-paths ?
  disable  Disable additional paths for this neighbor
  receive  Receive additional paths from neighbors
  send     Send additional paths to this neighbor

Note per-neighbor settings override the global settings.

bgp additional-paths install

You can select additional-paths and pass them to neighbors without installing them in your RIB or FIB. This command should be on any device requiring PIC Edge, but if your route reflector isn't in the forwarding path, you may be able to omit it.

neighbor X.X.X.X advertise additional-paths best 2

Even if you've negotiated the Add-Path capability with your neighbor, you still need to tell the BGP process to advertise all of, or a subset of, your calculated best paths. The way it does this is via the tag system I described above. An important element of this is that the tagging system is not mutually exclusive. Let's say there are 4 paths with different next-hops. You could select "all" and "best 3", and the best 3 would be flagged with "best" and "all", and the 4th path would only be flagged with "all". We'll show an examples of this below.

Let's see the output from this.

R4#sh ip bgp | b Network
     Network          Next Hop            Metric LocPrf Weight Path
 *>i 1.1.1.1/32       2.2.2.2                  0    100      0 100 i
 *bia                 3.3.3.3                  0    100      0 100 i
 *bia9.9.9.9/32       7.7.7.7                  0    100      0 300 i
 * i                  8.8.8.8                  0    100      0 300 i
 *>i                  6.6.6.6                  0    150      0 300 i

We see two paths for 1.1.1.1, and three paths for 9.9.9.9. 
Two are flagged with "b" for backup - this is a side-effect of using the bgp additional-paths install
"a" is the flag for additional-paths.
You'd need to do a sh ip bgp 9.9.9.9 to see the "best", "best2", etc flags, which I am omitting for brevity - there's already a sample further above.

R4#sh ip cef 9.9.9.9 det
9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels
  recursive via 6.6.6.6
    nexthop 192.168.46.6 GigabitEthernet1.46
  recursive via 7.7.7.7, repair
    nexthop 192.168.47.6 GigabitEthernet1.47

We can see the repair path in the FIB.

On R2:

R2(config-router)#do sh ip bgp 9.9.9.9
BGP routing table entry for 9.9.9.9/32, version 3
Paths: (2 available, best #2, table default)
  Additional-path-install
  Path not advertised to any peer
  Refresh Epoch 1
  300
    7.7.7.7 (metric 3) from 4.4.4.4 (4.4.4.4)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
      Originator: 7.7.7.7, Cluster list: 4.4.4.4
      rx pathid: 0x2, tx pathid: 0x1
  Path advertised to update-groups:
     29
  Refresh Epoch 1
  300
    6.6.6.6 (metric 3) from 4.4.4.4 (4.4.4.4)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      Originator: 6.6.6.6, Cluster list: 4.4.4.4
      rx pathid: 0x0, tx pathid: 0x0

We see a best and best2 flag. It's important to note again that this is not learned from the route reflector, it's locally decided and set by the local bgp additional-paths select best 2 on R2. As mentioned above, I decided to use add-path from the edge BGP devices back towards the route-reflector to avoid the problem of the single-best-path replacing all the secondaries during convergence.

Another important note is the pathid. Add-Path's trickery to make this work doesn't make a real integral change to BGP - it still only passes one best, unique path - it just makes each additional path unique by adding a unique pathid. Note the pathids of 0x0 and 0x1 above. Think of these similar to Route Distinguishers in VPNv4, making the same two routes unique.

R2#sh ip cef 9.9.9.9 det
9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels
  recursive via 6.6.6.6
    nexthop 192.168.24.4 GigabitEthernet1.24 label 23
    nexthop 192.168.25.5 GigabitEthernet1.25 label 18
  recursive via 7.7.7.7, repair
    nexthop 192.168.24.4 GigabitEthernet1.24 label 16
    nexthop 192.168.25.5 GigabitEthernet1.25 label 20

And there's PIC Edge and Add-Path in action on R2.

I'm going to quickly cover the rest of the simpler Add-Path options.
Just to recap, the route-reflector has chosen two best paths so far:

R4#sh ip bgp 9.9.9.9 | s from
  300, (Received from a RR-client)
    7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
      rx pathid: 0x1, tx pathid: 0x2
  300, (Received from a RR-client)
    8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal
      rx pathid: 0x1, tx pathid: 0
  300, (Received from a RR-client)
    6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      rx pathid: 0x0, tx pathid: 0x0

router bgp 200
 bgp additional-paths select best 3

R4#sh ip bgp 9.9.9.9 | s from
  300, (Received from a RR-client)
    7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
      rx pathid: 0x1, tx pathid: 0x2
  300, (Received from a RR-client)
    8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal, best3
      rx pathid: 0x1, tx pathid: 0x1
  300, (Received from a RR-client)
    6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      rx pathid: 0x0, tx pathid: 0x0

Note we've added a pathid and "best3" to the remaining path.  We'd be able to send those to neighbors if we wanted. With this config we're choosing 3 but sending 2.

I found this option confusing initially:

R4(config-router)#no neighbor 2.2.2.2 advertise additional-paths best 2
R4(config-router)#neighbor 2.2.2.2 advertise additional-paths all
 % BGP: AF level 'bgp additional-paths select' more restrictive than advertising policy. This is a reminder that AF level additional-path select commands are needed.

The way I originally read this was, I've selected 3 best paths, and I want to send all 3 of them to my neighbor -- this is incorrect. Remember this is a flag system. All is a flag. None of our BGP prefixes are flagged with All, so we just broke Add-Path:

R4(config-router-af)#do sh ip bgp neigh 2.2.2.2 adv | b Network
     Network          Next Hop            Metric LocPrf Weight Path
 *>i 9.9.9.9/32       6.6.6.6                  0    150      0 300 i

Let's fix it.
All is meant to simulate full-mesh iBGP with a route-reflector - if all routers use it, you'll get a similar outcome to all the routers being peered together.

R4(config-router)#bgp additional-paths select all

R4(config-router)#do sh ip bgp 9.9.9.9 | s from
  300, (Received from a RR-client)
    7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
      Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2, all
      rx pathid: 0x1, tx pathid: 0x1
  300, (Received from a RR-client)
    8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
      Origin IGP, metric 0, localpref 100, valid, internal, best3, all
      rx pathid: 0x1, tx pathid: 0x2
  300, (Received from a RR-client)
    6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
      Origin IGP, metric 0, localpref 150, valid, internal, best
      rx pathid: 0x0, tx pathid: 0x0

OK, now we're flagged with both All and Best simultaneously. As mentioned above, the select system is not mutually exclusive:

R4#sh run | i select
 bgp additional-paths select all best 3

R2#sh ip bgp | b 9.9.9.9
 *bia9.9.9.9/32       7.7.7.7                  0    100      0 300 i
 * i                  8.8.8.8                  0    100      0 300 i
 *>i                  6.6.6.6                  0    150      0 300 i

There's a few options you can potentially pick under "select":

R4(config-router)#bgp additional-paths select ?
  all            Select all available paths 
  backup         Select backup path 
  best           Select best N paths 
  best-external  Select best-external path
  group-best     Select group-best path 

All, we just covered.
Backup is for diverse-path
Best, we've covered
Best-External is a feature that permits best-external selection on a route reflector. The use case for this is complicated and is out of scope for this document.
Group-Best is also very complicated.  

Let's discuss group-best at a very high level.

BGP, under normal circumstances, can potentially end up in a scenario where it never converges - it never stabilizes. This is called BGP Med Oscillation. Explaining this is beyond the scope of this document, however, this blog covers it well: http://ccieblog.co.uk/bgp/bgp-deterministic-med

BGP Deterministic Med can solve this problem.

However, this problem gets additionally complex with Add-Path. Group-Best solves these problems.

Route-Maps can additionally be used with Add-Path.

R3(config-route-map)#match additional-paths advertise-set ?
  all         BGP Add-Path advertise all paths
  best        BGP Add-Path advertise best n paths
  best-range  BGP Add-Path advertise best paths (range m to n)
  group-best  BGP Add-Path advertise group-best path

The two use cases I've seen for the route maps are:
- Setting the egress MED
- Selecting specific routes with the "best" flag to advertise

For example, if you wanted to only advertise the 1st best and 3rd best routes:

R4:
route-map block2ndbest deny 10
 match additional-paths advertise-set best-range 2 2  ! matches the "range" of 2 through 2
route-map block2ndbest permit 20

Before:

R2#sh ip bgp | b 9.9.9.9
 *bia9.9.9.9/32       7.7.7.7                  0    100      0 300 i
 * i                  8.8.8.8                  0    100      0 300 i
 *>i                  6.6.6.6                  0    150      0 300 i

R4(config)#router bgp 200
R4(config-router)#neighbor 2.2.2.2 route-map block2ndbest out
R4(config-router)#do clear ip bgp * soft out

After:
R2#sh ip bgp | b 9.9.9.9
 *bia9.9.9.9/32       8.8.8.8                  0    100      0 300 i
 *>i                  6.6.6.6                  0    150      0 300 i

As I mentioned the MED can be modified on a per-bestpath basis as well, but only from edge BGP device -> RR or edge BGP device -> edge BGP device.  Route reflectors are not permitted to set MED. 

Hope you enjoyed,

Jeff