Saturday, October 4, 2014

[mini] Fail-Over Policy Based Routing

Playing with PBR recently I came across what I thought was an odd usage - two set commands in the same statement.

i.e.

route-map PBR permit 10
  match ip address to-be-matched
  set ip next-hop 192.168.0.1
  set ip default next-hop 192.168.1.1

This is a bit odd to look at until you break it down.

Turns out there's an order of operations to PBR set statements.

From the Cisco documentation:

1. set ip next-hop
2. set interface
3. set ip default next-hop
4. set default interface

This means set ip next-hop will be attempted prior to, say, set interface. If it fails, then the next statement will be evaluated.

When I saw that, the first place my brain went to was, why not create two route-map elements to fix this?

(please note it's hard to air your dirty laundry on the Internet. Yes, this seemed dumb after I tested it)

route-map PBR permit 10
  match ip address to-be-matched
  set ip next-hop 192.168.0.1
route-map PBR permit 20
  match ip address to-be-matched
  set ip default next-hop 192.168.1.1

My thought process here was that if statement 10 failed to apply the set statement, then it would move on to statement 20. This is, of course, not true. Just like an ACL, a route-map stops evaluating future statements as soon as it has a match. So in the above config, using the same ACL (or even two ACLs that both matched the same traffic in different ways), statement 10 is always matched, and if it fails, traffic is just normally routed.

So there is some reason (albeit niche cases) to put "fail-over" statements into the route-map. The CCIE lab is basically all about niche cases (less lovingly called a "stupid router trick" by most of us), so this seemed worth exploring.

Here's our topology:


It's a little complex but I wanted to show a lot of different possibilities in one route-map statement.

R1's loopback0 (1.1.1.1/32) will be our source, travelling towards R9's loopback0 (9.9.9.9). Segments are IPed as 192.168.XY.Z/24, where XY is the lower and higher router number on the segment, and Z is the local router number. Example: The serial segment between R2 and R4 is 192.168.24.0/24, with R2's interface being 192.168.24.2 and R4 being 192.168.24.4.

EIGRP is advertising every IP in the topology. However, R5, R6 and R7 are summarizing all routes behind them to a default route towards R2.
R2 has an offset list towards R4 to make paths through it less desirable:

R2:
router eigrp 100
 network 0.0.0.0
 offset-list 0 in 50 Serial4/1

The net result of this is that traffic will be sent from R2 through R3 towards R9 unless PBR is involved.

Here's R2's PBR config:

ip access-list extended match
 permit ip host 1.1.1.1 host 9.9.9.9

route-map PBR permit 10
 match ip address match
 set ip next-hop 192.168.24.4
 set interface Serial4/2
 set ip default next-hop 192.168.26.6
 set default interface Serial4/4

interface FastEthernet1/0
 ip policy route-map PBR

This will match traffic from 1.1.1.1 towards 9.9.9.9. The first step is to attempt to send the traffic towards R4.

So to be clear, non-PBR traffic will go through R3:

R2#sh ip cef 9.9.9.9
9.9.9.9/32
  nexthop 192.168.23.3 Serial4/0

R1#trace 9.9.9.9 source fa1/0  ! Not from 1.1.1.1
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 44 msec 88 msec 48 msec
  2 192.168.23.3 112 msec 40 msec 68 msec
  3 192.168.39.9 116 msec 116 msec 160 msec

Now let's try our PBR match:

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 108 msec 48 msec 72 msec
  2 192.168.24.4 96 msec 44 msec 76 msec
  3 192.168.49.9 136 msec 108 msec 108 msec

As expected, it went R2 -> R4 -> R9.

What if the link between R2 and R4 went down?

R2(config)#int s4/0
R2(config-if)#shut

Referencing back to our route-map...

route-map PBR permit 10
 match ip address match
 set ip next-hop 192.168.24.4  (now unavailable)
 set interface Serial4/2 
 set ip default next-hop 192.168.26.6
 set default interface Serial4/4

We'd expect the PBR to send the traffic through R5 next, via Serial4/2. A very important note is these are point to point serial interfaces. Using Ethernet is not such a good choice for set interface. The problem should be obvious to any CCIE candidate: We're relying on the far side to proxy ARP for 9.9.9.9, which it will do in our design, but also because of our design, IOS will typically reject the ARP change as "wrong interface".

In short, the safe answer is to use serial (P2P) interfaces.

Also to point out again that according to the routing table, R2 should send traffic towards 9.9.9.9 via R3:

R2(config-if)#do sh ip cef 9.9.9.9
9.9.9.9/32
  nexthop 192.168.23.3 Serial4/0

From R1's traceroute, we see that traffic does go through R5:

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 24 msec 116 msec 44 msec
  2 192.168.25.5 92 msec 48 msec 112 msec
  3 192.168.58.8 96 msec 128 msec 96 msec
  4 192.168.89.9 152 msec 108 msec 96 msec

We've successfully failed the first statement and moved to the 2nd one.

R2(config-if)#int s4/2
R2(config-if)#shut

route-map PBR permit 10
 match ip address match
 set ip next-hop 192.168.24.4  (now unavailable)
 set interface Serial4/2  (now unavailable)
 set ip default next-hop 192.168.26.6
 set default interface Serial4/4

The set [ip] default commands will only trigger if the route towards the destination is via a default.
These won't work for us yet because....

R2#sh ip route 9.9.9.9
Routing entry for 9.9.9.9/32
  Known via "eigrp 100", distance 90, metric 2300416, type internal
[output omitted]

We have a specific route.
We see our PBR is doing nothing now:

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 76 msec 36 msec 68 msec
  2 192.168.23.3 116 msec 52 msec 60 msec (through R3)
  3 192.168.39.9 84 msec 104 msec 100 msec

I haven't got a smooth answer for this, so let's just make R3 send a default as well.  Note I've increased the delay between R5, R6, R7 and R8, so that R3 will still be preffed even with just a default being sent.

R3(config)#int s2/0
R3(config-if)#ip summary-address eigrp 100 0.0.0.0 0.0.0.0

R2#sh ip route 9.9.9.9 long
[output omitted]
Gateway of last resort is 192.168.23.3 to network 0.0.0.0

So now we match a default towards R2.  Let's see our PBR kick in again.

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 40 msec 88 msec 40 msec
  2 192.168.26.6 88 msec 108 msec 48 msec
  3 192.168.68.8 128 msec 140 msec 124 msec
  4 192.168.89.9 92 msec 116 msec 112 msec

Through R6!

And if the link to R6 is down?

R2(config)#int s4/3
R2(config-if)#shut

route-map PBR permit 10
 match ip address match
 set ip next-hop 192.168.24.4  (now unavailable)
 set interface Serial4/2  (now unavailable)
 set ip default next-hop 192.168.26.6 (now unavailable)
 set default interface Serial4/4

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 80 msec 60 msec 88 msec
  2 192.168.27.7 112 msec 44 msec 88 msec
  3 192.168.78.8 96 msec 92 msec 100 msec
  4 192.168.89.9 88 msec 128 msec 132 msec

Through R7.

That wraps up my main point, but while we've got this setup, let's look at recursive PBR too.

I'm no-shutting all the interfaces we turned down earlier, and re-advertising the specific route from R3.

I've also added a leak-map on R5, R6 and R7 to allow R8's Lo0 (8.8.8.8) through in addition to the default route. Additionally, I de-prefed 8.8.8.8 through R3 and R4.

So to be clear, 9.9.9.9 is now reachable via R3, and 8.8.8.8 is reachable via equal-cost load sharing on R5, R6 and R7:

R2#sh ip cef 9.9.9.9
9.9.9.9/32
  nexthop 192.168.23.3 Serial4/0

R2#sh ip cef 8.8.8.8
8.8.8.8/32
  nexthop 192.168.25.5 Serial4/2
  nexthop 192.168.26.6 Serial4/3
  nexthop 192.168.27.7 Serial4/4

Recursive PBR allows for ECMP (equal cost multipathing) and PBR to mix. In short, pre-PBR, the path to 9.9.9.9 is via R3. Post PBR, we'll target having "8.8.8.8" as the next hop - which will ECMP through R5, R6 and R7.

In our environment, however, this is a bit hard to see, because per-destination CEF ECMP won't show up on our traceroute. Let's change to per-packet:

R2(config-route-map)#int s4/2
R2(config-if)#ip load-sharing per-packet
R2(config-if)#int s4/3
R2(config-if)#ip load-sharing per-packet
R2(config-if)#int s4/4
R2(config-if)#ip load-sharing per-packet

And let's re-write our route-map for recursion:

R2(config)#no route-map PBR permit 10
R2(config)#route-map PBR permit 10
R2(config-route-map)#match ip address match
R2(config-route-map)#set ip next-hop recursive 8.8.8.8

And test:

R1#trace 9.9.9.9 source Loopback0
Type escape sequence to abort.
Tracing the route to 9.9.9.9
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.12.2 28 msec *  44 msec
  2 192.168.25.5 72 msec <-- Indicates ECMP
    192.168.26.6 76 msec <-- Indicates ECMP
    192.168.27.7 60 msec <-- Indicates ECMP
  3 192.168.58.8 132 msec
    192.168.68.8 80 msec
    192.168.78.8 64 msec
  4 192.168.89.9 76 msec 76 msec 64 msec

Hope you enjoyed!

Jeff

Sunday, September 7, 2014

CCIE v4 to v5: BGP NHT, SAT, FSD, Dynamic Neighbors, Multisession Transport Per AF

BGP Next Hop Tracking (NHT) is an on-by-default feature that notifies BGP to a change in routing for BGP prefix next-hops. This is important because previously this only happened as part of the BGP Scanner process, which runs every 60 seconds by default. Waiting 60 seconds to determine your BGP route is effectively no longer valid (because of invalid next-hop) significantly hampers reconvergence. Instead of being timer-based, NHT makes the process of dealing with next-hop changes event-driven.



EIGRP is peered on all routers on the 192.168.124.0/24 link.

Here's the relevant base BGP config:

R1:
router bgp 1
 bgp log-neighbor-changes
 neighbor 3.3.3.3 remote-as 3
 neighbor 3.3.3.3 ebgp-multihop 255
 neighbor 3.3.3.3 update-source Loopback0
 neighbor 4.4.4.4 remote-as 4
 neighbor 4.4.4.4 ebgp-multihop 255
 neighbor 4.4.4.4 update-source Loopback0

R3:
router bgp 3
 bgp log-neighbor-changes
 neighbor 1.1.1.1 remote-as 1
 neighbor 1.1.1.1 ebgp-multihop 255
 neighbor 1.1.1.1 update-source Loopback0
 neighbor 192.168.34.4 remote-as 4

R4:
interface Loopback1
 ip address 44.44.44.44 255.255.255.255

router bgp 4
 bgp log-neighbor-changes
 network 44.44.44.44 mask 255.255.255.255
 neighbor 1.1.1.1 remote-as 1
 neighbor 1.1.1.1 ebgp-multihop 255
 neighbor 1.1.1.1 update-source Loopback0
 neighbor 192.168.34.3 remote-as 3

In short, we're using ebgp multihop in order to keep my mock-up smaller. We have two paths from R1 to R4's 44.44.44.44:

R1 -> R4's 4.4.4.4 (and consequently to 44.44.44.44 in the same hop)
R1 -> R3's 3.3.3.3, then R3 to R4's 192.168.34.4 

The first route has one AS in it's AS-PATH, the 2nd route has two ASes, and is less preferred.

R1#sh ip bgp 44.44.44.44 bestpath
BGP routing table entry for 44.44.44.44/32, version 11
Paths: (2 available, best #1, table default)
  Advertised to update-groups:
     2
  Refresh Epoch 2
  4
    4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)
      Origin IGP, metric 0, localpref 100, valid, external, best
      rx pathid: 0, tx pathid: 0x0

Let's try this experiment without NHT enabled first:

R1(config)#router bgp 1
R1(config-router)# no bgp nexthop trigger enable

R1#debug ip routing
IP routing debugging is on

R4(config-if)#int lo0  ! this is the 4.4.4.4 interface (the next-hop for 44.44.44.44 from R1)
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 22:59:03.552: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 22:59:03.552: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 22:59:03.552: RT: delete subnet route to 4.4.4.4/32
*Sep 17 22:59:03.552: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 22:59:03.552: RT: rib update return code: 5
================

This happened as fast as EIGRP converged - very quickly.  So we know 4.4.4.4 isn't a valid route any longer, but what about 44.44.44.44?

R1#sh ip bgp 44.44.44.44 bestpath
BGP routing table entry for 44.44.44.44/32, version 11
Paths: (2 available, best #1, table default)
  Advertised to update-groups:
     2
  Refresh Epoch 2
  4
    4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)
      Origin IGP, metric 0, localpref 100, valid, external, best
      rx pathid: 0, tx pathid: 0x0

R1#ping 44.44.44.44
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Still thinking the next-hop is 4.4.4.4, and it's Very Down.

I didn't time it this way specifically, but remember the scan timer runs every 60 seconds. so 51 seconds after we yanked the 4.4.4.4 next-hop, BGP finally figured out something was up and reconverged to the alternate path for 44.44.44.44 via R3.

*Sep 17 22:59:54.031: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 22:59:54.031: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 22:59:54.031: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

R1#ping 44.44.44.44
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/3 ms

R1#trace 44.44.44.44
Type escape sequence to abort.
Tracing the route to 44.44.44.44
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.124.3 4 msec 1 msec 0 msec
  2 192.168.34.4 2 msec *  2 msec

A 51 second reconverge in a modern network is pretty awful.

R4(config-if)#int lo0
R4(config-if)#no shut

Let's re-add the next-hop trigger and try again.

R1(config-router)#router bgp 1
R1(config-router)#bgp nexthop trigger enable

R4(config-if)#int lo0
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 23:11:53.582: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:11:53.582: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:11:53.582: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:11:53.582: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:11:53.582: RT: rib update return code: 5
*Sep 17 23:11:58.582: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:11:58.582: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:11:58.582: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Note the bottom two lines of output, we see the reconverge this time - in 5 seconds. Why 5 seconds?

The bgp nexthop trigger delay defines how long for the NHT process to delay updating BGP. This timer is here to prevent BGP from being beaten up by a flapping IGP route. At 5 seconds, the BGP process can't get bogged down from unnecessary updates. 

Let's set it to 2 and try again.

R1(config-router)#bgp nexthop trigger delay 2

Debug from R1 below
===============
*Sep 17 23:18:40.167: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:18:40.167: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:18:40.167: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:18:40.167: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:18:40.167: RT: rib update return code: 5
*Sep 17 23:18:42.168: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:18:42.168: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:18:42.168: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Now converging at 2 seconds.

Applying a route-map to the NHT process is provided by a feature called Selective Address Tracking, or SAT.

The route-map determines what prefixes can be seen as valid prefixes for next-hops.

For example, if 4.4.4.4 is your desired next hop, but you have a default on your router, if you lose 4.4.4.4/32 do you want the router to consider 4.4.4.4 reachable via the default? Potentially not.

R1(config)#ip route 0.0.0.0 0.0.0.0 192.168.124.10  ! Deliberately non-existent next-hop

Without the route map....

R4(config-if)#int lo0
R4(config-if)#shut

This is hard to demonstrate, because the prefix might never recover. In our over-simplified mock-up, the BGP process would fail at timeout (because 4.4.4.4 is actually our peer) before the prefix vanished; in a more realistic design this could be a permanent black-hole.

We still have the bogus static default route in place:
ip route 0.0.0.0 0.0.0.0 192.168.124.10

R1(config-router)#ip prefix-list onlyloops seq 5 permit 0.0.0.0/0 ge 32
R1(config)#route-map SAT permit 10
R1(config-route-map)# match ip address prefix-list onlyloops
R1(config-route-map)#router bgp 1
R1(config-router)# bgp nexthop route-map SAT

This config only allows for /32s as viable next-hops.

R4(config-if)#int lo0
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 23:47:09.497: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:47:09.497: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:47:09.497: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:47:09.497: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:47:09.497: RT: rib update return code: 5
*Sep 17 23:47:11.498: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:47:11.498: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:47:11.499: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Now reconverging in 2 seconds again!

This is great for the downstream prefix, but what about the neighbor session itself?

This could work...
R1(config-router)#neighbor 4.4.4.4 fall-over

Except that pesky default is keeping 4.4.4.4 supposedly reachable....
For brevity, I'll tell you that as expected, when I shut the Lo0 interface on R4, 4.4.4.4 was pulled from R1's IGP and 44.44.44.44 was pulled from R1's BGP table.  However, the session is still up!

The same concept (even the same route-map) can be applied to the neighbor fall-over statement. This feature is called Fast Session Deactivation (FSD). 

R1(config-router)#neighbor 4.4.4.4 fall-over route-map SAT ! re-using SAT's route-map

Debug from R1 below
===============
*Sep 18 00:11:08.107: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (Route to peer lost)
*Sep 18 00:11:08.107: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down Route to peer lost
*Sep 18 00:11:08.107: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session  Route to peer lost
===============

And the BGP session gets torn down immediately.

This next feature I'm not sure of the use case on, but it was recommended as a topic, so I looked at it. Multisession Transport per AF appears to be related to Multi-Topology Routing (MTR), but MTR should be solidly out-of-scope for CCIE R&S v5.

What multisession transport does is opens a separate TCP session for each address family.

I've erased all the BGP config from the previous task.

R1:
ipv6 unicast-routing

router bgp 100
 bgp log-neighbor-changes
 neighbor 4.4.4.4 remote-as 100
 neighbor 4.4.4.4 update-source Loopback0
 !
 address-family ipv4
  neighbor 4.4.4.4 activate
 exit-address-family
 !
 address-family vpnv4
  neighbor 4.4.4.4 activate
  neighbor 4.4.4.4 send-community extended
 exit-address-family
 !
 address-family ipv6
  neighbor 4.4.4.4 activate
 exit-address-family

R4:
ipv6 unicast-routing

router bgp 100
 bgp log-neighbor-changes
 neighbor 1.1.1.1 remote-as 100
 neighbor 1.1.1.1 update-source Loopback0
 !
 address-family ipv4
  neighbor 1.1.1.1 activate
 exit-address-family
 !
 address-family vpnv4
  neighbor 1.1.1.1 activate
  neighbor 1.1.1.1 send-community extended
 exit-address-family
 !
 address-family ipv6
  neighbor 1.1.1.1 activate
 exit-address-family

R1(config-router-af)#do show tcp brief
TCB       Local Address               Foreign Address             (state)
7F612C7742A0  1.1.1.1.40234              4.4.4.4.179                 ESTAB

Three families, one TCP session.

R1(config-router)#neighbor 4.4.4.4 transport multi-session

R4(config-router)#neighbor 1.1.1.1 transport multi-session

The two sides of the session do need to agree on the setting.

R1:
*Sep 18 00:31:19.102: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Up
*Sep 18 00:31:25.940: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 2 Up
*Sep 18 00:31:28.322: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 3 Up

R1(config-router)#do show tcp brief
TCB       Local Address               Foreign Address             (state)
7F612C76F0F0  1.1.1.1.179                4.4.4.4.30092               ESTAB
7F612C76DE20  1.1.1.1.179                4.4.4.4.42417               ESTAB
7F612C76E788  1.1.1.1.48539              4.4.4.4.179                 ESTAB

Our last topic is BGP Dynamic Neighbors. Yes, automagic BGP peerings!

Erasing all the pre-existing BGP config again...

R1:
router bgp 100
 bgp log-neighbor-changes
 bgp listen range 192.168.124.0/24 peer-group PEERS
 neighbor PEERS peer-group
 neighbor PEERS remote-as 100
 neighbor PEERS password CISCO
 neighbor PEERS update-source Loopback0
 neighbor PEERS route-reflector-client
 bgp listen limit 3

R2-R4:
router bgp 100
 bgp log-neighbor-changes
 neighbor 192.168.124.1 remote-as 100
 neighbor 192.168.124.1 password CISCO

R1:
*Sep 18 00:38:24.696: %BGP-5-ADJCHANGE: neighbor *192.168.124.2 Up
*Sep 18 00:39:04.980: %BGP-5-ADJCHANGE: neighbor *192.168.124.4 Up
*Sep 18 00:39:05.932: %BGP-5-ADJCHANGE: neighbor *192.168.124.3 Up

iBGP doesn't get any faster to setup than that!

I've used the most obvious settings here - the dynamic "host" would normally be a route-reflector, and would normally require authentication. 

However, you can:
- Run multiple dynamic groups
- Listen to multiple ranges
- Use multiple address families (this works great for VPNv4!)
- Listen for more neighbors (I limited it to 3 above)

Cheers,

Jeff

CCIE v4 to v5 Updates: NTPv4 and Netflow

I didn't find these updates on any Cisco or 3rd party list, but when writing my original NTP and Netflow blogs in mid-2013, I mentioned out-of-scope topics when writing them, because they weren't supported on IOS v12.4(15)T. Now that v5 is out, all those topics are back in-scope, so I decided to blog them.

Here are the original articles this one builds off of:

http://brbccie.blogspot.com/2013/05/ntp.html
http://brbccie.blogspot.com/2013/06/netflow.html

The topics we'll be covering specifically are:
- Netflow w/ NBAR
- IPFIX (Netflow v10)
- NTPv4 (IPv6 support)
- NTPv4 Multicast NTP
- NTP Panic
- NTP Maxdistance
- NTP Orphan

Netflow
First, I wanted to mention an omission from my original blog. At that time I didn't have a collector that would support Flexible Netflow, so I evaluated FNF via Wireshark. That was fairly effective except I was missing a major element of netflow: the bytes transferred! I'm now using a collector that supports FNF, and I immediately noticed I wasn't graphing any traffic.

flow record JIMBO
 match ipv4 source address
 match ipv4 destination address
 collect counter bytes
 collect counter packets

This is a simple, working FNF config. Matching or collecting counter bytes and counter packets should be done to make Netflow do what you're used to it doing -- measuring traffic.

What's the advantage of integrating NBAR with Netflow?
By default, Netflow only exports very high-level protocol information. Integrating NBAR gives very specific/granular protocol output to the collector. Note, your collector needs to specifically support this, this is not a small change from the protocol level.

If you're familiar with how the template is sent out for FNF every so often, the NBAR table is very similar. IOS will send out a rather large (many packets) template defining the NBAR Application to ID at specified intervals, then those IDs are sent with the Netflow packet to define what the protocol is.

There are several other blogs out there that give big, complex templates for integrating NBAR with Netflow. I took a few of these as a base and worked backwards to the real requirements. This is not a hard thing to enable. Your flow record must contain collect application name (or match application name), and optionally you can tune the frequency of the NBAR FNF template being sent out with option application-table timeout in the exporter.

Here's a working config:

flow record FNF-RECORD
 match ipv4 source address
 match ipv4 destination address
 collect counter bytes
 collect counter packets
 collect application name 

flow exporter FNF-EXPORTER
 destination 192.168.0.5
 source GigabitEthernet1
 transport udp 9996
 template data timeout 60
 option application-table timeout 30

flow monitor FNF-MONITOR
 exporter FNF-EXPORTER
 cache timeout inactive 60
 cache timeout active 60
 record FNF-RECORD

interface gig1
 ip flow monitor FNF-MONITOR input

Netflow was recently made an open standard with v10. The open version is called IPFIX. To enable IPFIX output instead of FNF v9, you would:

flow exporter FNF-MONITOR
 export-protocol ipfix

Note I haven't tested this beyond checking it in Wireshark, because I still don't have a collector that speaks IPFIX.

NTP



The big difference on NTP v4 is IPv6 support. There's really not much to cover on the basics... clearly broadcast NTP is gone, but Multicast NTP still works the same general way it did in v4.

R1(config)#ntp master 4

R2(config)#ntp server 1::1

R2#show ntp association detail
1::1 configured, ipv6, our_master, sane, valid, stratum 4
ref ID 127.127.1.1    , time D7C45F20.4AC083E0 (19:27:28.292 UTC Wed Sep 17 2014)
<output omitted>

Really quite simple.

15.x implementations of NTP now leave domain names in the config.
Pre 15.x:
foo.com(config)#ip host foo.com 4.4.4.4
foo.com(config)#ntp server foo.com
foo.com(config)#do sh run | i ntp
ntp server 4.4.4.4

It would translate the hostname to an IP address and the IP address would be saved in the config, not a good thing if the server changes IPs.

Post 15.x:
R2(config)#ip host test.com 4.1.1.1
R2(config)#ntp server test.com
R2(config)#do sh run | i ntp
ntp server test.com

Let's take a look at the multicast option. As IPv6 multicast has blessedly been removed from the v5 blueprint, I'm going to cheap out and perform non-routed/same-link multicast.

R2(config)#no ntp server 1::1

R1(config)#ntp authentication-key 1 md5 CISCO
R1(config)#ntp trusted-key 1
R1(config)#int gig1.123
R1(config-subif)#ntp multicast FF02::123 key 1

R2(config)#ntp authentication-key 1 md5 CISCO
R2(config)#ntp trusted-key 1
R2(config)#ntp authenticate
R2(config)#int gig1.123
R2(config-subif)#ntp multicast client FF02::123

R2(config-subif)#do show ntp ass det
FE80::20C:29FF:FEB6:3557 dynamic, ipv6, authenticated, our_master, sane, valid, stratum 4
ref ID 127.127.1.1    , time D7C460E0.4AC083E0 (19:34:56.292 UTC Wed Sep 17 2014)

Maxdistance, for me, is very confusing. It appears to be a trust value. It's normally modified in NTPv4 in order to speed up convergence. As I understand it, the higher the value the faster the synchronization will happen, because the upstream time will be trusted sooner. The algorithm appears to combine half the value of the root delay and the dispersion, and if that value is lower than Maxdistance, then it's OK to consider yourself in-sync. My labbing did not produce exactly that outcome but it was extremely hard to say for sure because my NTPv4 convergences very quickly. Because you basically have to be a time expert to understand what this does, I would hope the CCIE lab would be limited to two types of questions on it:
1) Set it to some value they provide
2) Set it to "slowest" convergence (1) or "fastest" convergence (16)

R1(config)#ntp maxdistance ?
  <1-16>  Maximum distance for synchronization

NTP Panic is simple:

R2(config)#ntp panic ?
  update  Reject time updates > panic threshold (default 1000Sec)

It does just what it says - if my peer or configured master's clock is more than 1,000 seconds off of my clock, reject the update and syslog:

.Sep  8 00:51:00.155: NTP Core (ERROR): Time correction of nan seconds exceeds sanity limit of 0. seconds. Set clock manually to the correct UTC time.

NTP Orphan is really cool. It seems like an obvious feature now that I've seen it, but I can imagine this is a huge help for smaller organizations that rely heavily on NTP.

Let's say, from our diagram, R1 is an Internet time server that our fictional organization uses as its sole NTP master. R2 and R3 are edge routers inside the company, and R4 and R5 will represent servers querying R2 and R3.

So to be clear, R2 and R3 get their time from R1, and also peer towards one another (so if R3 can't reach R1 but R2 can, R3 can learn it's time via R2).  R4 and R5 query R2 and R3 for time, respectively.

Relevant config:
R1(config)#ntp master 4

R2(config)#int gig1.123
R2(config-subif)#no  ntp multicast client FF02::123
R2(config-subif)#no ntp authenticate
R2(config)#ntp server 1::1
R2(config)#ntp peer 3::3
R2(config)#ntp source lo0

R3(config)#ntp server 1::1
R3(config)#ntp peer 2::2
R3(config)#ntp source lo0

R4(config)#ntp server 2::2

R5(config)#ntp server 3::3

At this point every device has the up-to-date time.

Now let's say R1 goes offline.
R1(config)#int lo0
R1(config-if)#shut

<<wait a while>>

R2(config)#do show ntp status
Clock is unsynchronized, stratum 16, no reference clock
<output omitted>

R3(config)#do show ntp status
Clock is unsynchronized, stratum 16, no reference clock
<output omitted>

and obviously R4 and R5 share the same fate.

What if we could program R2 and R3 to take their best stab at what the time should still be - mind you we're talking about being only a couple minutes since last sync, so the time is probably still very close to accurate - and then temporarily and seamlessly take over the NTP Master role if they lose valid clock from R1?

This is exactly what NTP Orphan does.

The config is extremely complicated:

R2(config)#ntp orphan 6

R3(config)#ntp orphan 6

(I was joking about the complicated part)

Really, that's it.  Let's understand what's happening here now.  Orphan kicks in when we lose sync with our server. The number 6 here is a stratum number, and must be a number lower than your real upstream NTP server - otherwise the failover/fail-back mechanism won't work right. 

Best practices indicate configuring the same Orphan stratum on all devices you're running Orphan on, then peering all the Orphans to one another so that only one is "elected" to be the temporary Orphan master.

R2(config)#do show ntp status
Clock is synchronized, stratum 6, reference is 127.0.0.1
<output omitted>

We see R2 is now stratum 6, synchronized with it's own virtual Orphan server.

R3(config)#do show ntp status
Clock is synchronized, stratum 7, reference is 26.33.33.239
<output omitted>

R3 is synchronized with R2 as its Master. 

R4#show ntp status
Clock is synchronized, stratum 7, reference is 26.33.33.239

R4 is synchronized with R2 as its master.

R5#show ntp status
Clock is synchronized, stratum 9, reference is 24.235.166.45

R5 is synchronized with R5 as its master.

Now the most important feature of this is fail-back, let's re-activate R1.

R1(config)#int lo0
R1(config-if)#no shut

R3 was first to recover:
R3(config)#do show ntp association detail
1::1 configured, ipv6, our_master, sane, valid, stratum 4

It automatically shut down its Orphan process when it synced to the superior stratum 4.

R5 then received the now-correct time from R3:
R5#show ntp association detail
3::3 configured, ipv6, our_master, sane, valid, stratum 5

Cheers,

Jeff Kronlage


Saturday, September 6, 2014

OSPF LFA & Remote LFA

Continuing on the same track as my recent posts regarding EIGRP FRR and BGP PIC/Add-path, today I'm writing about OSPF LFA. OSPF FRR/LFA accomplishes the same concept as EIGRP FRR, but in a much more elegant and thorough fashion.

As I did in my EIGRP article, I'm going to reference back to the BGP PIC article, as that has a lengthy explanation of why fast re-reroute is important. If you don't understand the use case, please read this first article:

http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html

Again building off former articles, the EIGRP method of LFA is dead simple: take the feasible successor and pre-install it in the FIB for faster convergence.

http://brbccie.blogspot.com/2014/08/eigrp-enhancements.html

I genuinely like this approach, because it's very easy to understand. If you're savvy enough to engineer for feasible successors, you can literally just turn on this feature and it works.

OSPF takes this idea to a whole new level. Obviously, OSPF does not have a concept of feasible successors, but it does have a huge advantage: because, in the same area, the OSPF database is identical among all routers, OSPF can run the SPF algorithm with a neighboring router as root. The advantage of this is being able to find a loop-free alternate path in complex topologies that would have failed the feasible successor check in EIGRP. When we look at Remote LFA, we can even tunnel to distant routers to form loop-free paths, all calculated via the router running FRR.

Note - much like EIGRP, OSPF on IOS does not support per-link LFA, so we will only be examining per-prefix LFA.  IOS-XR supports both per-prefix and per-link.



All links have an IP address of 192.168.YY.X, where YY is the lower router number followed by the higher router number, and X is the router number (i.e. on the link facing R4, R1's IP address is 192.168.14.1) .  Each router has a loopback0 address of X.X.X.X, where X is the router number.

Consider this diagram, with R1 attempting to reach R5 (5.5.5.5).

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix enable area 0 prefix-priority low

The primary path is obvious: R1 -> R2 -> R5
The backup path requires some thought...

If this were EIGRP, neither path would be valid for LFA. They'd both fail the feasibility condition:
R1->R3->R5 has an "advertised distance" of 10, which is greater than the "feasible distance" of 2. Likewise, R1->R4->R5 has an "advertised distance" of 10.

However, OSPF being link state can actually calculate the SPF from R2 and R4's perspective. Cisco calls this process "reverse SPF" -- RSPF. I'm not going to make this a large lesson on link state protocols, but let's quickly look at what R1 would discover about its neighbors:

R2:
  This is already the primary path, so eliminate R2.
R3:
  When attempting to reach R5, R3 will route back through R1. This will loop. Eliminate R3.
R4:
  R4 reaches R5 via the link between R4 and R5.  Valid Backup Route.

I deliberately built the scenario this way to show how a higher-metric route could beat a lower metric for the backup route - of course, in our case, the lower metric would've looped.

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.12.2 on GigabitEthernet1.12, 02:29:11 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.12
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.14.4, via GigabitEthernet1.14
    [RPR]192.168.14.4, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.14
      Route metric is 26, traffic share count is 1

R1#sh ip cef 5.5.5.5
5.5.5.5/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.14.4 GigabitEthernet1.14

As with EIGRP, there are "tie-breakers" if you have multiple options for backup path. With OSPF, you can get a lot more granular than EIGRP. I still hate the term "tie-breakers", as I explained in my EIGRP blog, I think "2nd bestpath decision maker" explains it better.

The tie-breakers are as follows, with their respective default priorities:

- SRLG 10 
- Primary Path 20
- Interface Disjoint 30
- Lowest-Metric 40
- Linecard-disjoint 50
- Node protecting 60
- Broadcast interface disjoint 70 
- Load Sharing 256 

These tie-breakers are off by default:
- Downstream 
- Secondary-Path

The syntax to change the priorities - or turn on downstream or secondary-path - is as follows:

router ospf 1
  fast-reroute per-prefix tie-break interface-disjoint required index 5

If you use the fast-reroute per-prefix tie-break command at all, it disables all the other tie-breakers. So for example, if you wanted SRLG to be the 2nd tie breaker, you would have to turn it back on after the interface-disjoint command:

router ospf 1
  fast-reroute per-prefix tie-break interface-disjoint required index 5
  fast-reroute per-prefix tie-break srlg index 10

You may have also noticed the required keyword. This means that if that tie-breaker doesn't match/pass, then disallow that path completely.

My original plan was to show a scenario for every tie-breaker, but after it taking me two days to build a topology that showed each possible technique, I decided to just go with a written explanation on each tie-breaker and then give one semi-complex tie-breaker topology with a few examples.

- SRLG
SRLG - Shared Risk Link Group - is a manual setting, optionally assigned per-interface, with the intent of identifying "shared risk" elements that the router can't detect on it's own. For example, if two of your Ethernet links shared a downstream switch, you might put those two in the same SRLG.

Usage:
R1(config)#int gig1
R1(config-if)#srlg gid 1
R1(config-if)#int gig2
R1(config-if)#srlg gid 1
R1(config-if)#int gig3
R1(config-if)#srlg gid 2

- Primary Path
Primary Path prefers a backup path that's part of equal-cost multipath (ECMP), This is the antithesis of Secondary Path, which we'll cover below.

- Interface Disjoint
This is fairly obvious, prefer a backup next-hop that exits through a different interface. Note, Ethernet sub-interfaces are considered different interfaces.

- Lowest-Metric
Prefer the path with the lowest metric (note, this command doesn't offer a "required" keyword)

- Linecard-disjoint
Prefer a path that exits through a different linecard than the primary path (I have no way of labbing this as I'm using a CSR1K)

- Node protecting
Prefer a path that doesn't pass through the same next-hop router as the primary path. Note this means any interface on the same next-hop router. So if R2 is the next-hop of your primary path via 192.168.12.2, and your backup path goes through (either directly or indirectly, later in the path) 192.168.25.2 on R2, node protecting will depref that path - or with the required keyword, would prevent it from being used completely.

- Broadcast interface disjoint
Broadcast interface disjoint deprefs backup routes that pass through the same broadcast area as the primary path. The thought here is if the layer 2 device (presumably a switch) connecting the interfaces together fails, we might lose the backup path too.

- Load Sharing 
I haven't labbed this, but my understanding is this is basically a worst-case scenario. If you have two or more paths that can't be differentiated by all of the above tie-breakers, share the backup paths amongst any applicable prefixes.

- Downstream (off by default)
This is very similar to the EIGRP feasability condition - ensure that the metric, from the neighbor's RSPF perspective, is smaller than the total metric of our primary path from the calculating router's perspective. Using the original example above, the backup path we picked would not meet the criteria for this tie-breaker. It's important to reinforce this is not a default option, and OSPF does not require this EIGRP-feasibility-like requirement as OSPF is a link state protocol and can calculate non-looping paths without concerns for metric because it has the entire topology at hand.  

- Secondary-Path (off by default)
This is the antithesis of the Primary-Path tie-breaker above. This instructs the process to prefer a backup path that is not part of multipathing (ECMP). The idea here is if all your multipaths are required for your traffic flows - for example, if you are equal-cost multipathing across two 1-gig links, but consistently have 1.2gb of data crossing them, it would not be desirable to just run over one the opposing link in the ECMP if one failed. Secondary-Path prefers a path not in the ECMP for the backup. 

I'm going to run a couple of examples of tie-breaking, but in order to do that, I needed more paths in the topology. Pay close attention, I have shifted the OSPF costs from the prior topology:



* Please note costs listed below do not include the on-router cost to the loopback for clarity*
If you look at metric alone, the paths from R1->R5 look most desirable in this order:
R1 -> R3 -> R5 (Cost 2)
R1 -> R6 -> R3 -> R5 (Cost 4)
R1 -> R2 -> R5 (Cost 11)
R1 -> R4 -> R5 (Cost 25)

Clearly R3 is the winning primary path.

Let's go down the decision-making process for the backup path:

- SRLG 10 - Not applicable, we're not using SRLG (yet)
- Primary Path 20 - Not applicable, we have no ECMP.
- Interface Disjoint 30 - Applicable, but all are on separate interfaces already.
- Lowest-Metric 40 - Applicable, choose R6 as backup. Do not proceed further, as all paths have different costs.

So without any modification, our primary next-hop router will be R3, and backup next-hop router will be R6:

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:14:19 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.16.6, via GigabitEthernet1.16
    [RPR]192.168.16.6, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.16
      Route metric is 6, traffic share count is 1

There's an obvious flaw in that plan however, they both rely on R3 being online. 

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20

R1(config-router)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:09 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.14.4, via GigabitEthernet1.14
    [RPR]192.168.14.4, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.14
      Route metric is 26, traffic share count is 1

Now the process has chosen the backup through R4, which eliminates R3 as a single point of failure.

Let's pretend that gig1.13, gig 1.14, and gig1.16 all cross the same L2 switch somewhere in their path. We want to protect against that too:

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20
R1(config-router)#fast-reroute per-prefix tie-break srlg required index 30

R1(config-router)#int gig1.13
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.14
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.16
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.12
R1(config-subif)#srlg gid 2

R1(config-subif)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:18:34 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:18:34 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1

Uh-oh, no backup route. We were hoping for R1->R2->R5...

R2#sh ip cef 5.5.5.5
5.5.5.5/32
  nexthop 192.168.12.1 GigabitEthernet1.12

That's because R2 routes back through R1 - R1 would've run the RSPF with R2 as the root and disregarded the route.

We have two options at this point:
- Remove the required keyword from the SRLG and fall back to the prior answer
- Tinker with the metrics to make R2 a viable path.

R1(config)#int gig1.12
R1(config-subif)#ip ospf cost 10

R2(config)#int gig1.12
R2(config-subif)#ip ospf cost 10

R1(config-subif)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:00:52 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.12.2, via GigabitEthernet1.12
    [RPR]192.168.12.2, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.12
      Route metric is 21, traffic share count is 1

Now we have a backup via R2.

Before we move on to remote LFA, let's cover some smaller topics.

There were two pieces to the initial command that I did not explain:
fast-reroute per-prefix enable area 0 prefix-priority low

enable area 0 may seem obvious - we want backup paths for area 0. Note, you can only specify areas the router is directly connected to, so if, for example, you wanted backup paths in areas 0, 1, and 2, your router would have to be an ABR for areas 1 and 2. This is true of both direct LFA and remote LFA.

But there's another issue with specifying areas:

R5(config)#int lo1
R5(config-if)#ip address 55.55.55.55 255.255.255.255
R5(config-if)#exit
R5(config)#route-map lo1-extern
R5(config-route-map)#match interface lo1
R5(config-route-map)#exit
R5(config)#router ospf 1
R5(config-router)#redistribute connected route-map lo1-extern

R1(config)#do sh ip route repair 55.55.55.55
Routing entry for 55.55.55.55/32
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:27 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:27 ago, via GigabitEthernet1.13
      Route metric is 20, traffic share count is 1

No repair route for 55.55.55.55 - and we won't, because an external route is in no area. We have to change our initial configuration to fix this:

R1(config-router)#no ip fast-reroute per-prefix enable area 0 prefix-priority low
R1(config-router)#fast-reroute per-prefix enable prefix-priority low

A lack of an area implies all areas this router is connected to - including external routes.

R1(config-router)#do sh ip route repair 55.55.55.55
Routing entry for 55.55.55.55/32
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:42 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.13
      Route metric is 20, traffic share count is 1
      Repair Path: 192.168.12.2, via GigabitEthernet1.12
    [RPR]192.168.12.2, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.12
      Route metric is 20, traffic share count is 1

What's the story on prefix-priority low?

IOS prioritizes convergence events by default by prefix length. If SPF has to be calculated for thousands of routes, it's assumed by default that /32s (typical for iBGP next-hops) are "high priority". You can define what routes are priority to OSPF with:

R1(config-router)#prefix-priority high route-map <your route map>

There are only two tiers, high and low. High indicates (by default, unless the route map is used) only calculate backup routes for /32s, Low means calculate backup routes for all routes.

So you're debugging and trying to figure out why one path was chosen over another. IOS has a fantastic output system for this:

R1(config-router)#fast-reroute keep-all-paths

This is basically a debugging command, and tells OSPF to keep the output from all the RSPFs it ran to calculate the backup path - including the ones it didnt choose as best.

show ip ospf rib is our 2nd magic command:

R1(config-router)#do sh ip ospf rib 5.5.5.5

            OSPF Router with ID (1.1.1.1) (Process ID 1)


                Base Topology (MTID 0)

OSPF local RIB
Codes: * - Best, > - Installed in global RIB
LSA: type/LSID/originator

*>  5.5.5.5/32, Intra, cost 3, area 0
     SPF Instance 62, age 00:13:50
     Flags: RIB, HiPrio
      via 192.168.13.3, GigabitEthernet1.13
       Flags: RIB
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.12.2, GigabitEthernet1.12, cost 21
       Flags: RIB, Repair, IntfDj, BcastDj, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.16.6, GigabitEthernet1.16, cost 6
       Flags: Ignore, Repair, IntfDj, BcastDj, SRLG
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.14.4, GigabitEthernet1.14, cost 26
       Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5

Look at all that fantastic output - it list the parameters per route so you can determine why the repair path was chosen. Let's break one of these down:

      repair path via 192.168.12.2, GigabitEthernet1.12, cost 21
       Flags: RIB, Repair, IntfDj, BcastDj, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5

This is our current best backup path - "RIB" means it's installed, "Repair" means it's a backup path - so "RIB" + "Repair" means it's the installed backup path. IntfDj means it's on a separate interface from the primary path, BcastDj means it's not sharing a broadcast interface with the primary path, and NodeProt means the path does not include shared hops with the primary path.

Microloops can add complexity with fast-reroute. A microloop is what happens when one router converges significantly faster than a neighbor. Let's say two adjacent routers both receive new LSAs simultaneously. One router is high-performance, another is older. The high-performance router calculates the change and updates the FIB several seconds before the older router. Now we could end up with a scenario where the newer router starts forwarding traffic through the older router, but the older router's FIB hasn't updated yet, and it's forwarding through the faster router for that same prefix. For a couple of seconds, the two routers loop.

I'm not going to go into detail on this as it's a fringe topic, but here's the starting point for using this:
R1(config-router)#microloop avoidance ?
  disable           Microloop avoidance auto-enable prohibited
  protected         Microloop avoidance for protected prefixes only
  rib-update-delay  Delay before updating the RIB

In short, it allows you to deliberately slow down updating the FIB on the faster router for prefixes that are high-risk for this type of reconvergence.

If you don't want an interface being considered for fast-reroute:

R1(config-router)#int gig1.12
R1(config-subif)#ip ospf fast-reroute per-prefix candidate disable

And if you need a quick summary of what percentage of routes are and aren't protected:

R1#sh ip ospf fast-reroute prefix-summary

            OSPF Router with ID (1.1.1.1) (Process ID 1)
                    Base Topology (MTID 0)

Area 0:

Interface        Protected    Primary paths    Protected paths Percent protected
                             All  High   Low   All  High   Low    All High  Low
Lo0                    Yes     0     0     0     0     0     0     0%   0%   0%
Gi1.16                 Yes     1     1     0     0     0     0     0%   0%   0%
Gi1.14                 Yes     0     0     0     0     0     0     0%   0%   0%
Gi1.13                 Yes     7     3     4     4     2     2    57%  66%  50%
Gi1.12                 Yes     1     1     0     0     0     0     0%   0%   0%

Area total:                    9     5     4     4     2     2    44%  40%  50%

Process total:                 9     5     4     4     2     2    44%  40%  50%

That's a wrap for direct LFA. Now we'll look at remote LFA.



This is a simplistic topology but it has a huge problem for direct LFA.
Let's protect the path from R1 to R4.

We have two paths:
R1 -> R4 (cost 1)
R1 -> R2 -> R3 -> R4 (cost 12)

Obviously R1 -> R4 is the primary path,
What does R2 see as it's possible paths to R4?
R2 -> R1 -> R4 (Cost 2)
R2 -> R3 -> R4 (Cost 11)

R2 will always send traffic back to R1 when heading towards R4.

What about R3?
R3 -> R4 (Cost 6)
R3 -> R2 -> R1 (Cost 7)

R3 would work for a backup path... if only we could get to R3 without R2 knowing what we're up to.

Enter Remote LFA.

R1(config-router)#int gig1.14
R1(config-subif)#mpls ip
R1(config-subif)#int gig1.12
R1(config-subif)#mpls ip
R1(config-subif)#mpls ldp discovery targeted-hello accept

R2(config-subif)#int gig1.12
R2(config-subif)#mpls ip
R2(config-subif)#int gig1.23
R2(config-subif)#mpls ip
R2(config-subif)#mpls ldp discovery targeted-hello accept

R3(config-subif)#int gig1.23
R3(config-subif)#mpls ip
R3(config-subif)#int gig1.34
R3(config-subif)#mpls ip
R3(config-subif)#mpls ldp discovery targeted-hello accept

R4(config-subif)#int gig1.14
R4(config-subif)#mpls ip
R4(config-subif)#int gig1.34
R4(config-subif)#mpls ip
R4(config-subif)#mpls ldp discovery targeted-hello accept

R1(config-router)#router ospf 1
R1(config-router)#fast-reroute per-prefix remote-lfa tunnel mpls-ldp

There's a complex algorithm that makes this work, but it's somewhat irrelevant from a CCIE v5 perspective. 

Here's what you really need to know:
- Direct LFA had to have failed to turn up a path already (direct is always tried first)
- A tunnel is built over targeted LDP.
- The destination tunnel router is picked on the following criteria:
   -  It must be in the same area as the router running LFA
   - The tunnel endpoint is picked from among the group of routers that can be reached through a next-hop other than the one you're trying to protect.
   - Of that group of routers, it's narrowed down to the subset that can reach your repair prefix without passing through the protecting router.
   - Those that qualify are called the PQ space (refer to the RFC for a lot more detail, but it may be overkill for a CCIE candidate) 

R1#sh ip route repair 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 2, type intra area
  Last update from 192.168.14.4 on GigabitEthernet1.14, 00:29:36 ago
  Routing Descriptor Blocks:
  * 192.168.14.4, from 4.4.4.4, 00:29:36 ago, via GigabitEthernet1.14
      Route metric is 2, traffic share count is 1
      Repair Path: 3.3.3.3, via MPLS-Remote-Lfa1
    [RPR]3.3.3.3, from 4.4.4.4, 00:29:36 ago, via MPLS-Remote-Lfa1
      Route metric is 12, traffic share count is 1

R1#sh ip int br | i MPLS
MPLS-Remote-Lfa1       192.168.12.1    YES unset  up                    up

This whole process is reasonably automatic, just make sure your LDP is in good shape and targeted LDP is enabled and you're good to go.

You can optionally specify areas and maximum costs:

R1(config-router)#fast-reroute per-prefix remote-lfa area 0 maximum-cost 10

The areas work the same way they did with direct LFA - we're just saying we only want to protect area 0, 1, 2, 3, etc. For remote LFA, the router you're running LFA on has to be in the area you're trying to protect - you can't protect area 5 if you're only an ABR for areas 0 and 1.

The maximum cost option restricts which prefixes you should be building tunnels for. In other words, it has nothing to do with the metric to reach the tunnel endpoint - it has to do with the prefix you're trying to protect.

Hope you enjoyed!

Jeff

Sunday, August 24, 2014

EIGRP Enhancements

Cisco did a major overhaul of EIGRP in recent IOS. These can be loosely looked at as new features in "EIGRP Named Mode". In reality, I suspect that the EIGRP teams were working on a series of new features, and they opted to renovate the interface at the same time, hence creating named mode.

We'll start with the new interface and then delve into all the new features one at a time.

Named EIGRP mode replaces the tradition EIGRP interfaces we're familiar with, and puts all the various commands into one configuration section.

The major distinguishing factor is the router process has a name instead of a number.

Old method:
router eigrp 100
 network 192.168.0.0 0.0.255.255

New equivalent method:
router eigrp SOMENAME
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  network 192.168.0.0 0.0.255.255
 exit-address-family

The name is completely arbitrary and is a local value. 

Interface settings that were previously configured on the interface, such as hello interval, authentication, etc, are now configured as part of the EIGRP named process:

router eigrp SOMENAME
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1
   authentication mode md5
   authentication key-chain FOO
   hello-interval 10
   no split-horizon
  exit-af-interface
  !
  topology base
  exit-af-topology
  network 192.168.0.0 0.0.255.255
 exit-address-family

A traditional EIGRP process can be upgraded to named mode on newer IOS with this command:

Router(config)#router eigrp 101
Router(config-router)#eigrp upgrade-cli SOMENAME

The process also doesn't interrupt traffic flow.

That's the guts of the configuration reformatting, let's move on to features.

Wide Metrics
First and foremost, the metric has been reworked.

EIGRP named mode automatically uses wide metrics when speaking to another EIGRP named mode process. No additional configuration is necessary, this is automatic. So if it's speaking to a traditional EIGRP process, it uses the old calculations.

The new metric is designed to be able to differentiate paths above 10GB.  The new metric essentially changes four things:
- Delay is now measured in picoseconds instead of microseconds. 10ms was the minimum previously.
- Bandwidth's scaling factor is made much larger, the calculation is now 10^7 * 65536 / Interface Bandwidth, as opposed to the original 10^7 * 256 / Interface Bandwidth.
- The overall metric is now 64 bit.
- The K6 value has been added "for future use", but Cisco has indicated this will be used for accumulated energy or accumulated jitter.  Jitter is reasonsably obvious.  Energy is the actual electric power it takes to use an interface, so that you could literally do "least cost" routing based on how inexpensively the packet can be sent from the various interface types in a path.

One important note here is that with wide metrics, the EIGRP calculated metric no longer fits into the RIB. For example:

Router#sh ip eigrp top 10.10.10.10/32 | i Composite metric
      Composite metric is (330301440/329646080), route is Internal

Router#sh ip route 10.10.10.10 | i Route metric
      Route metric is 2580480, traffic share count is 1

The EIGRP topology table indicates 330301440, the RIB says 2580480.  
The RIB's metric can't exceed 32-bits, and there are circumstances with the new, more granular metrics won't fit into the RIB. So all metrics, regardless of if the value would fit into 32-bits, are divided by the rib-scale value. The rib-scale is 128 by default:

330301440/128 = 2580480

You can reassign it to any value 1 to 255:

router eigrp SOMENAME
 address-family ipv4 unicast autonomous-system 100
  metric rib-scale [1-255]

Here's a catch - I've gotten in the habit of using this command for redistributing into EIGRP when labbing:

redistribute <some other protocol> metric 1 1 1 1 1

Why? It's quick and easy to type if you're not trying to do traffic engineering.

Router#sh ip eigrp top 13.13.13.13/32 | i Composite metric
      Composite metric is (655361310720/655360655360), route is External

655361310720/128 = 5120005120

The largest number that can be represented in a 32-bit unsigned integer is 4,294,967,296.

5120005120 > 4294967296, therefore it cannot be represented in the RIB:

Router#sh ip route 13.13.13.13
% Network not in table

You read that right: This is a valid, routable prefix that simply can't make it into the RIB because of compatibility between the EIGRP topology table and the RIB. You need to adjust the rib-scale to make this work:

Router(config-router-af)#metric rib-scale 153
Router(config-router-af)#do sh ip route 13.13.13.13 | i Route metric
      Route metric is 4283407259, traffic share count is 1

I imagine that would make for a really good troubleshooting problem. "A route is being redistributed on R1 with a specific metric, but is not being installed in the RIB on R3. Do not change the metric on R1, or adjust with a route-map".

There are a few concerns with interoperability between the traditional EIGRP metric and the wide metrics, but not many. As I mentioned above, routers unable to understand wide metrics are auto-detected and sent the old metric, however, there are circumstances where a route might get depreffed after having passed through an older EIGRP process. For example, if two paths exist to a destination, one of them running entirely wide metrics and a different one running one router with traditional metrics, the traditional metric may make the entire path look worse and it may impact load share, or the ability to ECMP.

SHA Authentication
Now supporting more than just MD5:

R1(config-subif)#router eigrp TEST1
R1(config-router)#address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#af-interface gig1.123
R1(config-router-af-interface)#authentication mode hmac-sha-256 CCIE

I think authentication would also make a great TS question - the authentication could be placed on the interface still, which named mode silently ignores. You'd need to know to look at the EIGRP named process to fix it:

interface GigabitEthernet1.123
 ip authentication key-chain eigrp 100 BOB  ! this does nothing when named mode is enabled.

Route Tag Enhancements
To be fair, the route tag enhancements aren't limited to EIGRP named mode - it works with OSPF, BGP, RIP, etc. It even works in the traditional (non-named) eigrp syntax. However, I didn't think I needed a write a separate blog just to show it in every context, they all basically work the same.

In short, the route tag enhancements allow the route tag to be formatted as a dotted decimal tag (looks like an IPv4 address) that can me matched either directly (in the traditional route tag method in route-map) or via a route-tag list. The route-tag list is where things get interesting.

R1:
interface Loopback1
 ip address 1.1.1.1 255.255.255.255
interface Loopback2
 ip address 2.2.2.2 255.255.255.255
interface Loopback3
 ip address 3.3.3.3 255.255.255.255
interface Loopback4
 ip address 4.4.4.4 255.255.255.255
interface Loopback5
 ip address 5.5.5.5 255.255.255.255
interface Loopback6
 ip address 6.6.6.6 255.255.255.255
interface Loopback7
 ip address 7.7.7.7 255.255.255.255

route-tag notation dotted-decimal

router eigrp TEST1
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
   redistribute connected route-map tag-routes

route-map tag-routes permit 10
 match interface Loopback1 Loopback2 Loopback3
 set tag 100.100.100.1
route-map tag-routes permit 20
 match interface Loopback4 Loopback5
 set tag 100.100.200.1
route-map tag-routes permit 30
 match interface Loopback6 Loopback7
 set tag 100.100.101.1

So we've set some dotted-decimal tags on R1, now let's filter on R2.

R2:
route-tag notation dotted-decimal
route-tag list binary-match seq 5 permit 100.100.0.0 0.0.254.255

route-map filter permit 10
 match tag list binary-match
 set metric 100 100 255 1 1500
route-map filter permit 20

router eigrp TEST2
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
   distribute-list route-map filter in GigabitEthernet1.123

Anyone who's done any amount of CCIE-level route filtering should catch what I just did. The route-tag list is looking for any routes that begin with 100.100 and have an even 3rd octet - if you need an explanation of filtering with wildcard masks there are many available on the Internet.

So now tags can be matched based on what bits are set in them -- very cool.

R2(config)#do sh ip eigrp top 1.1.1.1/32 | i Composite metric
      Composite metric is (6619136000/163840), route is External
R2(config)#do sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (6619136000/163840), route is External
R2(config)#do sh ip eigrp top 6.6.6.6/32 | i Composite metric
      Composite metric is (1392640/163840), route is External

1.1.1.1 and 4.4.4.4 were tagged with 100.100.100.1 and 100.100.200.1 respectively, both even 3rd octets, and had their metric successfully recreated. 6.6.6.6, tagged with 100.100.101.1, was not matched, and retained its original metric.

I immediately tried this in IPv6... however...

R2(config-router)#address-family ipv6 unicast autonomous-system 200
R2(config-router-af)#topology base
R2(config-router-af-topology)#distribute-list ?
  prefix-list  Filter connections based on an IPv6 prefix-list
R2(config-router-af-topology)#distribute-list route-map ?
% Unrecognized command

IPv6 can't be filtered ingress with route-maps yet. I didn't expect that. For anyone curious I'm on:

R2(config-router-af-topology)#do sh ver  | i IOS Software
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)

There's open more option for settings tags:

router eigrp TEST1
  address-family ipv4 unicast autonomous-system 100
   eigrp default-route-tag 9.9.9.9

default-route-tag is fairly picky what it will tag. From some tinkering, it will tag all routes except:
- Locally redistributed routes
- Routes that were already set a tag in some other fashion
- Routes it learned from another router

So in short, unless you learned the routes with the "network" statement, this tag won't take effect.

IPv6 VRF Lite

The traditional EIGRP process doesn't support IPv6 in a VRF.  

You also must use the new format - multiprotocol VRF -  for creating VRFs. 
Old format:
R2(config)#ip vrf FOO
R2(config-vrf)#rd 1:1
R2(config-vrf)#exit
R2(config)#int gig1.10
R2(config-subif)#ip vrf forwarding FOO

Multiprotocol VRF:
R2(config-vrf)#vrf definition FOO
R2(config-vrf)#rd 1:1
R2(config-vrf)#address-family ipv6 unicast
R2(config-vrf-af)#address-family ipv4 unicast
R2(config-vrf-af)#exit
R2(config-vrf)#int gig1.10
R2(config-subif)#vrf forwarding FOO

router eigrp SAMPLE
 !
 address-family ipv6 unicast vrf FOO autonomous-system 200
  !
  topology base
  exit-af-topology
  eigrp router-id 2.2.2.2
 exit-address-family

Note the bolded line - eigrp router-id 2.2.2.2. Unless you have an IPv4 address in the routing table of the same VRF, you must specify the router ID manually. There is no parser error, it just doesn't work. Once again, this would make a great TS problem.

With IPv6, things work differently than IPv4 in named EIGRP mode. This process is already up:

*Sep  2 23:39:52.815: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FEF7:FE11 (GigabitEthernet1.10) is up: new adjacency

However, note I haven't told it what interfaces to use. In our case, it automatically includes any interface that's in the appropriate VRF and has an IPv6 address on it. If you don't want to run EIGRP on an interface, you have to manually specify:

R2(config)#router eigrp SAMPLE
R2(config-router-af)#address-family ipv6 unicast vrf FOO autonomous-system 200
R2(config-router-af)#af-interface gig1.10
R2(config-router-af-interface)#shut

*Sep  2 23:47:10.304: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FED7:2458 (GigabitEthernet1.10) is down: interface down

3rd party Next-Hop
While also not a feature specific to named mode, EIGRP has recently started supporting 3rd party next hop. The concept of 3rd party next-hop is fairly simple. The easiest way I can explain it is if you have three routers on a single segment, R1, R2, and R3.  They all share the 192.168.123.0/24 space between them. However, R1 and R2 speak EIGRP, and R2 and R3 speak OSPF.  R1 doesn't speak OSPF, and R3 doesn't speak EIGRP. Assume there are extra routers behind R1 and R3 on different segments that are advertised in their respective routing protocols.

R2 is mutually redistributing between EIGRP and OSPF.

Without 3rd party next-hop, R1 would have to send traffic destined for the OSPF segments to R2, then R2 would have to forward it to R3. Inefficient and messy.

With 3rd party next-hop, R2 is permitted to use R3's address, even though it doesn't exist in the EIGRP process, when advertising routes to R1.

This is an automatic feature and requires only that R2 doesn't re-write the next-hop to itself (rewriting the next hop is default EIGRP behavior):

router eigrp TEST2
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1.123
   no next-hop-self

EIGRP Fast ReRoute (FRR)
The point of FRR is to generate Loop Free Alternates, or LFAs. What's an LFA?
An LFA is a back-up route that can be pre-programmed into the FIB as a repair route. If you're familiar with EIGRP, you might think "but EIGRP already has feasible successors". True, but it doesn't program those into the forwarding linecards. 

I wrote a rather lengthy article regarding BGP PIC and Add-Path two weeks ago, and I covered the problem that PIC was trying to solve, which is not necessarily easy to comprehend unless you've spent a great deal of time in a large service provider environment. PIC and FRR are trying to solve the same issue with different protocols. Rather than pasting the multi-page explanation I've already typed into this document as well, please reference that one to understand the issue:


The good news is that EIGRP doesn't require as complex an environment to explain FRR as it took to explain BGP PIC.

We already know EIGRP makes feasible successors, and can rely on those during reconvergence. But if we want the FIB to be able to swap over to a feasible successor as soon as the successor route is lost, we need to pre-program it.

In a nutshell, FRR simply picks the "best" feasible successor and sticks it in the FIB as a backup route.

There are two types of FRR, per-link and per-prefix. Per-link is only supported on IOS-XR at the time of this writing, so we'll be looking only at per-prefix.

First and foremost, we must ensure we have a feasible successor. If we have multiple successors (no feasibles), then we have ECMP - equal cost multi-path - and there's no need for FRR.

R1 has two paths to prefix 4.4.4.4 on R4, one via R2 and another via R3. I've deliberately de-prefed the route through R3. Note, if you're attempting to lab along with this, you'll want to create the depref on R1. If you're ECMP up until you create the depref on R1, you're guaranteed to have a feasible successor!

R1(config-subif)#int gig1.13
R1(config-subif)#delay 5000

R1#sh ip eigrp topo 4.4.4.4/32
EIGRP-IPv4 VR(TEST) Topology Entry for AS(100)/ID(192.168.12.1) for 4.4.4.4/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000
  Descriptor Blocks:
  192.168.12.2 (GigabitEthernet1.12), from 192.168.12.2, Send flag is 0x0
      Composite metric is (2048000/1392640), route is Internal
      Vector metric:
        Minimum bandwidth is 1000000 Kbit
        Total delay is 21250000 picoseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 192.168.24.4
  192.168.13.3 (GigabitEthernet1.13), from 192.168.13.3, Send flag is 0x0
      Composite metric is (3278192640/1392640), route is Internal
      Vector metric:
        Minimum bandwidth is 1000000 Kbit
        Total delay is 50011250000 picoseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 192.168.24.4

Since the route via 192.168.13.3 (from R3) has an advertised distance less than the feasible distance to 192.168.12.2 (from R2), we now have a feasible successor.

R1(config)#router eigrp TEST
R1(config-router)# address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#  topology base
R1(config-router-af-topology)#fast-reroute per-prefix all

R1#sh ip route 4.4.4.4 | i Repair
      Repair Path: 192.168.13.3, via GigabitEthernet1.13

R1#sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

It's very simple if we only have two paths, but what if there are 3 or more? Cisco uses what it calls "tie breakers", but I really dislike the name, we're not really tie-breaking necessarily because the criteria for selection isn't comparing apples to apples. It's a bit more like "2nd bestpath decision maker".

Before I list off the tie-breakers, let's look at what the problems might be if we had numerous paths to choose from.

Let's say we have multiple neighbors on a shared segment, with varying metrics to the destination we're trying to protect. Your bestpath is on that segment, as is your "second best" feasible successor, all hanging off the same interface on your router. If you're choosing the LFA purely based on metric, the same interface will get chosen for the backup path as is the primary route. That doesn't help us if that WAN link fails, or if the interface goes down, etc. 

Take that one step further and say your best-path and best feasible successor are both on the same linecard. That might also be a poor decision.

What I'm getting at is there's more to consider than just the metric in this scenario.

The four tie-breakers are:
- srlg-disjoint, priority 10
- interface-disjoint, priority 20
- lowest-backup-path-metric, priority 30
- linecard-disjoint, priority 40

Lower priority is better.

srlg-disjoint favors a backup-path/interface that isn't in the same Shared Risk Link Group (more below).

interface-disjoint favors a backup route that doesn't share the same interface for its next-hop. BEWARE, sub-interfaces are considered disjointed interfaces by the FRR process on my version of IOS-XE!

lowest-backup-path-metric favors a backup route with the lowest metric.

linecard-disjoint favors a backup route that doesn't share the same linecard.

So to clarify, by default, SRLG gets priority unless not set, then interface-disjoint gets priority unless the two paths are already on different interfaces (or subinterfaces), then the lowest metric is picked. If the metric is the same, it looks for a port on a different linecard.

So to start, what the heck is SRLG?

There's very little information on this feature that I can find, but the idea, as best I can tell, is that if you happen to know to physical links share some dependency (perhaps passing through the same L2 switch upstream, for example), you can tell IOS which ones have dependencies.

For example, if Gig1 and Gig2 on my router both passed through a single point of failure upstream, my config might look something like this:

R1(config)#int gig1
R1(config-if)#srlg gid 1
R1(config-if)#int gig2
R1(config-if)#srlg gid 1
R1(config-if)#int gig3
R1(config-if)#srlg gid 2

Note gig3 didn't necessarily need to get assigned to an srlg, but I included it for clarity.

I'm going to introduce a new path from R1 to R4 via R5.  R1, R2 and R5 are all going to share a common link, meaning R1 routes to R2 and R5 on the same interface. I'm increasing delay slightly more on the path to R5. Furthermore, I'm going to prevent R2 and R5 from peering with one another, otherwise R5 would end up only advertising it's bestpath from R2, and my topology breaks.

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (78725120/13189120), route is Internal
      Composite metric is (3289989120/13189120), route is Internal
      Composite metric is (79380480/13844480), route is Internal

We see we've got three paths, let's look at those again with my comments:

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric
      Composite metric is (78725120/13189120), route is Internal = Path through gig1.12 via R2
      Composite metric is (3289989120/13189120), route is Internal  = Path through gig1.12 via R5
      Composite metric is (79380480/13844480), route is Internal = Path through gig1.13 via R3

R1#sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

We can see IOS made a very smart move here, and it's in line with the priorities we discussed above. The backup path is not the best feasible successor from a metric standpoint, it's the less risky separate "interface" (again, IOS considers a subinterface a separate interface).

If we instead wanted it to choose based on metric:

R1(config)#router eigrp TEST
R1(config-router)# address-family ipv4 unicast autonomous-system 100
R1(config-router-af)#  topology base
R1(config-router-af-topology)#fast-reroute tie-break lowest-backup-path-metric 5

<<note I normally clear the eigrp neighbors here, these commands don't always seem to react quickly after the change>>

R1(config-router-af-topology)#do sh ip cef 4.4.4.4
4.4.4.4/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.12.5 GigabitEthernet1.12

Now we're preferring the backup path through the same interface, that has the better metric.

I'm not going to show the output from srlg disjoint here, but I have labbed it previously and it does work - just set the srlg guid on the appropriate interfaces. Also, I have no way of labbing linecard disjoint because I'm on a virtual router.

EIGRP Over The Top (OTP)
Does anyone besides me use the OTP abbreviation to mean "on the phone"? I wish they could've gone with OTT instead.

It is a really neat feature though - I know a lot of people will bash EIGRP as obsolete, proprietary, distance vector ... say what you will, amongst enterprise Cisco enterprise networks, it's the most popular IGP on the Cisco-powered market by a landslide. As a consultant, I would say 80% of the networks I come across run it.

Furthermore, finding enterprise network support personnel that are BGP experts is somewhat rare.

So what is one to do when MPLS separates all your sites, and your carrier (wisely) uses BGP as a PE->CE protocol? You hire a consultant to come in and make changes to the redistribution strategy periodically.

Or... you run EIGRP OTP, and toss the BGP work out the window.

OTP allows remote EIGRP peerings over any underlying IP protocol. All you need is reachability to the other EIGRP host. That means all your carrier needs to do is advertise the PE->CE link itself (probably a /30 between you and the carrier) in their MPBGP and the CE doesnt even need to run BGP (topology dependent). All the CE needs is a static default pointing at the PE router.

If you have more than a few CEs, you'll probably want an EIGRP Route Reflector, which isn't nearly as complicated as it sounds. An EIGRP RR listens for dynamic connections (optionally), and then disables split horizon and next-hop-self.

LISP provides the tunneling mechanism for the neighbors to reach one another. Fortunately, no LISP knowledge is required, the config is automatic.



Here, R2 - R5 represent the provider network, R1 and R7 represent isolated customer sites, and R6 and R8 represent a dual-homed customer site.

R7 will be our EIGRP route reflector.

Assume the provider is advertising the links between the CE and PE.
Here are the rest of the relevant configs:

R1:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.12 remote 10 lisp-encap
  network 1.1.1.1 0.0.0.0
  network 192.168.12.0
 exit-address-family

ip route 0.0.0.0 0.0.0.0 192.168.12.2

just to prove there's no BGP involved here:

R1#sh ip protocol sum
Index Process Name
0     connected
1     static
2     application
4     eigrp 100
*** IP Routing is NSF aware ***

R6:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.46 remote 10 lisp-encap
  network 6.6.6.6 0.0.0.0
  network 192.168.46.0
  network 192.168.68.0
 exit-address-family

R8:
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  topology base
  exit-af-topology
  neighbor 192.168.37.7 GigabitEthernet1.58 remote 10 lisp-encap
  network 8.8.8.8 0.0.0.0
  network 192.168.58.0
  network 192.168.68.0
 exit-address-family

R7 (route reflector):
router eigrp OTP-TEST
 !
 address-family ipv4 unicast autonomous-system 100
  !
  af-interface GigabitEthernet1.37
   no next-hop-self
   no split-horizon
  exit-af-interface
  !
  topology base
  exit-af-topology
  remote-neighbors source GigabitEthernet1.37 unicast-listen lisp-encap
  network 7.7.7.7 0.0.0.0
  network 192.168.37.0
 exit-address-family

The route reflector is also running BGP. Route reflectors can have a topology problem requiring this if you have backdoor links. In my case, if I only ran a default on the route reflector, I'd learn the link to R8 via EIGRP from R6, as opposed to using my default route. And vice-versa, R8 would advertise connectivity to R6, and my routes would do a continual up/down because they'd learn next-hops via the LISP interface. It's a typical tunnel recursion loop issue. Running BGP puts the prefixes to reach R6 and R8 in R7's table at a lower AD and solves the problem 

Also note that the link between PE and CE must be advertised into EIGRP in order for LISP to come up.

Now we have full reachability to the EIGRP prefixes without the majority of the CEs running BGP, and none of the CEs advertising their EIGRP routes into it.

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1#sh ip cef 6.6.6.6
6.6.6.6/32
  nexthop 192.168.46.6 LISP0

R1#ping 6.6.6.6
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

Add-Path

Add-Path is the capability to advertise more than one bestpath to a neighbor. I've done a large write-up on the BGP implementation of it:


The Cisco documentation indicates a use case of DMVPN for EIGRP Add-Path, but that seems a pretty narrow use to me, as summarization with DMVPN phase 3 would make it useless. However, our scenario for OTP above is perfect! 

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1 only learns one path to 192.168.68.0/24. Two are available, why can't we install both? Same problem with BGP, the EIGRP route reflector only sends its one best-path.

R7(config)#router eigrp OTP-TEST
R7(config-router)# address-family ipv4 unicast autonomous-system 100
R7(config-router-af)#  af-interface GigabitEthernet1.37
R7(config-router-af-interface)#add-paths 2

R1#sh ip route eigrp | b Gateway
Gateway of last resort is 192.168.12.2 to network 0.0.0.0

      6.0.0.0/32 is subnetted, 1 subnets
D        6.6.6.6 [90/93994331] via 192.168.46.6, 00:12:51, LISP0
      7.0.0.0/32 is subnetted, 1 subnets
D        7.7.7.7 [90/93994331] via 192.168.37.7, 00:12:52, LISP0
      8.0.0.0/32 is subnetted, 1 subnets
D        8.8.8.8 [90/93994331] via 192.168.58.8, 00:12:51, LISP0
D     192.168.68.0/24 [90/93998811] via 192.168.58.8, 00:00:26, LISP0
                      [90/93998811] via 192.168.46.6, 00:00:26, LISP0

And we've got multiple redundant paths to 192.168.68.0/24 now!

Note, EIGRP add-path is incompatible with variance.

Hope you enjoyed,

Jeff