Sunday, September 7, 2014

CCIE v4 to v5: BGP NHT, SAT, FSD, Dynamic Neighbors, Multisession Transport Per AF

BGP Next Hop Tracking (NHT) is an on-by-default feature that notifies BGP to a change in routing for BGP prefix next-hops. This is important because previously this only happened as part of the BGP Scanner process, which runs every 60 seconds by default. Waiting 60 seconds to determine your BGP route is effectively no longer valid (because of invalid next-hop) significantly hampers reconvergence. Instead of being timer-based, NHT makes the process of dealing with next-hop changes event-driven.



EIGRP is peered on all routers on the 192.168.124.0/24 link.

Here's the relevant base BGP config:

R1:
router bgp 1
 bgp log-neighbor-changes
 neighbor 3.3.3.3 remote-as 3
 neighbor 3.3.3.3 ebgp-multihop 255
 neighbor 3.3.3.3 update-source Loopback0
 neighbor 4.4.4.4 remote-as 4
 neighbor 4.4.4.4 ebgp-multihop 255
 neighbor 4.4.4.4 update-source Loopback0

R3:
router bgp 3
 bgp log-neighbor-changes
 neighbor 1.1.1.1 remote-as 1
 neighbor 1.1.1.1 ebgp-multihop 255
 neighbor 1.1.1.1 update-source Loopback0
 neighbor 192.168.34.4 remote-as 4

R4:
interface Loopback1
 ip address 44.44.44.44 255.255.255.255

router bgp 4
 bgp log-neighbor-changes
 network 44.44.44.44 mask 255.255.255.255
 neighbor 1.1.1.1 remote-as 1
 neighbor 1.1.1.1 ebgp-multihop 255
 neighbor 1.1.1.1 update-source Loopback0
 neighbor 192.168.34.3 remote-as 3

In short, we're using ebgp multihop in order to keep my mock-up smaller. We have two paths from R1 to R4's 44.44.44.44:

R1 -> R4's 4.4.4.4 (and consequently to 44.44.44.44 in the same hop)
R1 -> R3's 3.3.3.3, then R3 to R4's 192.168.34.4 

The first route has one AS in it's AS-PATH, the 2nd route has two ASes, and is less preferred.

R1#sh ip bgp 44.44.44.44 bestpath
BGP routing table entry for 44.44.44.44/32, version 11
Paths: (2 available, best #1, table default)
  Advertised to update-groups:
     2
  Refresh Epoch 2
  4
    4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)
      Origin IGP, metric 0, localpref 100, valid, external, best
      rx pathid: 0, tx pathid: 0x0

Let's try this experiment without NHT enabled first:

R1(config)#router bgp 1
R1(config-router)# no bgp nexthop trigger enable

R1#debug ip routing
IP routing debugging is on

R4(config-if)#int lo0  ! this is the 4.4.4.4 interface (the next-hop for 44.44.44.44 from R1)
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 22:59:03.552: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 22:59:03.552: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 22:59:03.552: RT: delete subnet route to 4.4.4.4/32
*Sep 17 22:59:03.552: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 22:59:03.552: RT: rib update return code: 5
================

This happened as fast as EIGRP converged - very quickly.  So we know 4.4.4.4 isn't a valid route any longer, but what about 44.44.44.44?

R1#sh ip bgp 44.44.44.44 bestpath
BGP routing table entry for 44.44.44.44/32, version 11
Paths: (2 available, best #1, table default)
  Advertised to update-groups:
     2
  Refresh Epoch 2
  4
    4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)
      Origin IGP, metric 0, localpref 100, valid, external, best
      rx pathid: 0, tx pathid: 0x0

R1#ping 44.44.44.44
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

Still thinking the next-hop is 4.4.4.4, and it's Very Down.

I didn't time it this way specifically, but remember the scan timer runs every 60 seconds. so 51 seconds after we yanked the 4.4.4.4 next-hop, BGP finally figured out something was up and reconverged to the alternate path for 44.44.44.44 via R3.

*Sep 17 22:59:54.031: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 22:59:54.031: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 22:59:54.031: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

R1#ping 44.44.44.44
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/3 ms

R1#trace 44.44.44.44
Type escape sequence to abort.
Tracing the route to 44.44.44.44
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.124.3 4 msec 1 msec 0 msec
  2 192.168.34.4 2 msec *  2 msec

A 51 second reconverge in a modern network is pretty awful.

R4(config-if)#int lo0
R4(config-if)#no shut

Let's re-add the next-hop trigger and try again.

R1(config-router)#router bgp 1
R1(config-router)#bgp nexthop trigger enable

R4(config-if)#int lo0
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 23:11:53.582: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:11:53.582: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:11:53.582: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:11:53.582: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:11:53.582: RT: rib update return code: 5
*Sep 17 23:11:58.582: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:11:58.582: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:11:58.582: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Note the bottom two lines of output, we see the reconverge this time - in 5 seconds. Why 5 seconds?

The bgp nexthop trigger delay defines how long for the NHT process to delay updating BGP. This timer is here to prevent BGP from being beaten up by a flapping IGP route. At 5 seconds, the BGP process can't get bogged down from unnecessary updates. 

Let's set it to 2 and try again.

R1(config-router)#bgp nexthop trigger delay 2

Debug from R1 below
===============
*Sep 17 23:18:40.167: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:18:40.167: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:18:40.167: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:18:40.167: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:18:40.167: RT: rib update return code: 5
*Sep 17 23:18:42.168: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:18:42.168: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:18:42.168: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Now converging at 2 seconds.

Applying a route-map to the NHT process is provided by a feature called Selective Address Tracking, or SAT.

The route-map determines what prefixes can be seen as valid prefixes for next-hops.

For example, if 4.4.4.4 is your desired next hop, but you have a default on your router, if you lose 4.4.4.4/32 do you want the router to consider 4.4.4.4 reachable via the default? Potentially not.

R1(config)#ip route 0.0.0.0 0.0.0.0 192.168.124.10  ! Deliberately non-existent next-hop

Without the route map....

R4(config-if)#int lo0
R4(config-if)#shut

This is hard to demonstrate, because the prefix might never recover. In our over-simplified mock-up, the BGP process would fail at timeout (because 4.4.4.4 is actually our peer) before the prefix vanished; in a more realistic design this could be a permanent black-hole.

We still have the bogus static default route in place:
ip route 0.0.0.0 0.0.0.0 192.168.124.10

R1(config-router)#ip prefix-list onlyloops seq 5 permit 0.0.0.0/0 ge 32
R1(config)#route-map SAT permit 10
R1(config-route-map)# match ip address prefix-list onlyloops
R1(config-route-map)#router bgp 1
R1(config-router)# bgp nexthop route-map SAT

This config only allows for /32s as viable next-hops.

R4(config-if)#int lo0
R4(config-if)#shut

Debug from R1 below
===============
*Sep 17 23:47:09.497: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]
*Sep 17 23:47:09.497: RT: no routes to 4.4.4.4, delayed flush
*Sep 17 23:47:09.497: RT: delete subnet route to 4.4.4.4/32
*Sep 17 23:47:09.497: RT: updating eigrp 4.4.4.4/32 (0x0)  :
    via 192.168.124.4 Gi1.124  0 1048578

*Sep 17 23:47:09.497: RT: rib update return code: 5
*Sep 17 23:47:11.498: RT: updating bgp 44.44.44.44/32 (0x0)  :
    via 3.3.3.3   0 1048577

*Sep 17 23:47:11.498: RT: closer admin distance for 44.44.44.44, flushing 1 routes
*Sep 17 23:47:11.499: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]
===============

Now reconverging in 2 seconds again!

This is great for the downstream prefix, but what about the neighbor session itself?

This could work...
R1(config-router)#neighbor 4.4.4.4 fall-over

Except that pesky default is keeping 4.4.4.4 supposedly reachable....
For brevity, I'll tell you that as expected, when I shut the Lo0 interface on R4, 4.4.4.4 was pulled from R1's IGP and 44.44.44.44 was pulled from R1's BGP table.  However, the session is still up!

The same concept (even the same route-map) can be applied to the neighbor fall-over statement. This feature is called Fast Session Deactivation (FSD). 

R1(config-router)#neighbor 4.4.4.4 fall-over route-map SAT ! re-using SAT's route-map

Debug from R1 below
===============
*Sep 18 00:11:08.107: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (Route to peer lost)
*Sep 18 00:11:08.107: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down Route to peer lost
*Sep 18 00:11:08.107: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session  Route to peer lost
===============

And the BGP session gets torn down immediately.

This next feature I'm not sure of the use case on, but it was recommended as a topic, so I looked at it. Multisession Transport per AF appears to be related to Multi-Topology Routing (MTR), but MTR should be solidly out-of-scope for CCIE R&S v5.

What multisession transport does is opens a separate TCP session for each address family.

I've erased all the BGP config from the previous task.

R1:
ipv6 unicast-routing

router bgp 100
 bgp log-neighbor-changes
 neighbor 4.4.4.4 remote-as 100
 neighbor 4.4.4.4 update-source Loopback0
 !
 address-family ipv4
  neighbor 4.4.4.4 activate
 exit-address-family
 !
 address-family vpnv4
  neighbor 4.4.4.4 activate
  neighbor 4.4.4.4 send-community extended
 exit-address-family
 !
 address-family ipv6
  neighbor 4.4.4.4 activate
 exit-address-family

R4:
ipv6 unicast-routing

router bgp 100
 bgp log-neighbor-changes
 neighbor 1.1.1.1 remote-as 100
 neighbor 1.1.1.1 update-source Loopback0
 !
 address-family ipv4
  neighbor 1.1.1.1 activate
 exit-address-family
 !
 address-family vpnv4
  neighbor 1.1.1.1 activate
  neighbor 1.1.1.1 send-community extended
 exit-address-family
 !
 address-family ipv6
  neighbor 1.1.1.1 activate
 exit-address-family

R1(config-router-af)#do show tcp brief
TCB       Local Address               Foreign Address             (state)
7F612C7742A0  1.1.1.1.40234              4.4.4.4.179                 ESTAB

Three families, one TCP session.

R1(config-router)#neighbor 4.4.4.4 transport multi-session

R4(config-router)#neighbor 1.1.1.1 transport multi-session

The two sides of the session do need to agree on the setting.

R1:
*Sep 18 00:31:19.102: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Up
*Sep 18 00:31:25.940: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 2 Up
*Sep 18 00:31:28.322: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 3 Up

R1(config-router)#do show tcp brief
TCB       Local Address               Foreign Address             (state)
7F612C76F0F0  1.1.1.1.179                4.4.4.4.30092               ESTAB
7F612C76DE20  1.1.1.1.179                4.4.4.4.42417               ESTAB
7F612C76E788  1.1.1.1.48539              4.4.4.4.179                 ESTAB

Our last topic is BGP Dynamic Neighbors. Yes, automagic BGP peerings!

Erasing all the pre-existing BGP config again...

R1:
router bgp 100
 bgp log-neighbor-changes
 bgp listen range 192.168.124.0/24 peer-group PEERS
 neighbor PEERS peer-group
 neighbor PEERS remote-as 100
 neighbor PEERS password CISCO
 neighbor PEERS update-source Loopback0
 neighbor PEERS route-reflector-client
 bgp listen limit 3

R2-R4:
router bgp 100
 bgp log-neighbor-changes
 neighbor 192.168.124.1 remote-as 100
 neighbor 192.168.124.1 password CISCO

R1:
*Sep 18 00:38:24.696: %BGP-5-ADJCHANGE: neighbor *192.168.124.2 Up
*Sep 18 00:39:04.980: %BGP-5-ADJCHANGE: neighbor *192.168.124.4 Up
*Sep 18 00:39:05.932: %BGP-5-ADJCHANGE: neighbor *192.168.124.3 Up

iBGP doesn't get any faster to setup than that!

I've used the most obvious settings here - the dynamic "host" would normally be a route-reflector, and would normally require authentication. 

However, you can:
- Run multiple dynamic groups
- Listen to multiple ranges
- Use multiple address families (this works great for VPNv4!)
- Listen for more neighbors (I limited it to 3 above)

Cheers,

Jeff

CCIE v4 to v5 Updates: NTPv4 and Netflow

I didn't find these updates on any Cisco or 3rd party list, but when writing my original NTP and Netflow blogs in mid-2013, I mentioned out-of-scope topics when writing them, because they weren't supported on IOS v12.4(15)T. Now that v5 is out, all those topics are back in-scope, so I decided to blog them.

Here are the original articles this one builds off of:

http://brbccie.blogspot.com/2013/05/ntp.html
http://brbccie.blogspot.com/2013/06/netflow.html

The topics we'll be covering specifically are:
- Netflow w/ NBAR
- IPFIX (Netflow v10)
- NTPv4 (IPv6 support)
- NTPv4 Multicast NTP
- NTP Panic
- NTP Maxdistance
- NTP Orphan

Netflow
First, I wanted to mention an omission from my original blog. At that time I didn't have a collector that would support Flexible Netflow, so I evaluated FNF via Wireshark. That was fairly effective except I was missing a major element of netflow: the bytes transferred! I'm now using a collector that supports FNF, and I immediately noticed I wasn't graphing any traffic.

flow record JIMBO
 match ipv4 source address
 match ipv4 destination address
 collect counter bytes
 collect counter packets

This is a simple, working FNF config. Matching or collecting counter bytes and counter packets should be done to make Netflow do what you're used to it doing -- measuring traffic.

What's the advantage of integrating NBAR with Netflow?
By default, Netflow only exports very high-level protocol information. Integrating NBAR gives very specific/granular protocol output to the collector. Note, your collector needs to specifically support this, this is not a small change from the protocol level.

If you're familiar with how the template is sent out for FNF every so often, the NBAR table is very similar. IOS will send out a rather large (many packets) template defining the NBAR Application to ID at specified intervals, then those IDs are sent with the Netflow packet to define what the protocol is.

There are several other blogs out there that give big, complex templates for integrating NBAR with Netflow. I took a few of these as a base and worked backwards to the real requirements. This is not a hard thing to enable. Your flow record must contain collect application name (or match application name), and optionally you can tune the frequency of the NBAR FNF template being sent out with option application-table timeout in the exporter.

Here's a working config:

flow record FNF-RECORD
 match ipv4 source address
 match ipv4 destination address
 collect counter bytes
 collect counter packets
 collect application name 

flow exporter FNF-EXPORTER
 destination 192.168.0.5
 source GigabitEthernet1
 transport udp 9996
 template data timeout 60
 option application-table timeout 30

flow monitor FNF-MONITOR
 exporter FNF-EXPORTER
 cache timeout inactive 60
 cache timeout active 60
 record FNF-RECORD

interface gig1
 ip flow monitor FNF-MONITOR input

Netflow was recently made an open standard with v10. The open version is called IPFIX. To enable IPFIX output instead of FNF v9, you would:

flow exporter FNF-MONITOR
 export-protocol ipfix

Note I haven't tested this beyond checking it in Wireshark, because I still don't have a collector that speaks IPFIX.

NTP



The big difference on NTP v4 is IPv6 support. There's really not much to cover on the basics... clearly broadcast NTP is gone, but Multicast NTP still works the same general way it did in v4.

R1(config)#ntp master 4

R2(config)#ntp server 1::1

R2#show ntp association detail
1::1 configured, ipv6, our_master, sane, valid, stratum 4
ref ID 127.127.1.1    , time D7C45F20.4AC083E0 (19:27:28.292 UTC Wed Sep 17 2014)
<output omitted>

Really quite simple.

15.x implementations of NTP now leave domain names in the config.
Pre 15.x:
foo.com(config)#ip host foo.com 4.4.4.4
foo.com(config)#ntp server foo.com
foo.com(config)#do sh run | i ntp
ntp server 4.4.4.4

It would translate the hostname to an IP address and the IP address would be saved in the config, not a good thing if the server changes IPs.

Post 15.x:
R2(config)#ip host test.com 4.1.1.1
R2(config)#ntp server test.com
R2(config)#do sh run | i ntp
ntp server test.com

Let's take a look at the multicast option. As IPv6 multicast has blessedly been removed from the v5 blueprint, I'm going to cheap out and perform non-routed/same-link multicast.

R2(config)#no ntp server 1::1

R1(config)#ntp authentication-key 1 md5 CISCO
R1(config)#ntp trusted-key 1
R1(config)#int gig1.123
R1(config-subif)#ntp multicast FF02::123 key 1

R2(config)#ntp authentication-key 1 md5 CISCO
R2(config)#ntp trusted-key 1
R2(config)#ntp authenticate
R2(config)#int gig1.123
R2(config-subif)#ntp multicast client FF02::123

R2(config-subif)#do show ntp ass det
FE80::20C:29FF:FEB6:3557 dynamic, ipv6, authenticated, our_master, sane, valid, stratum 4
ref ID 127.127.1.1    , time D7C460E0.4AC083E0 (19:34:56.292 UTC Wed Sep 17 2014)

Maxdistance, for me, is very confusing. It appears to be a trust value. It's normally modified in NTPv4 in order to speed up convergence. As I understand it, the higher the value the faster the synchronization will happen, because the upstream time will be trusted sooner. The algorithm appears to combine half the value of the root delay and the dispersion, and if that value is lower than Maxdistance, then it's OK to consider yourself in-sync. My labbing did not produce exactly that outcome but it was extremely hard to say for sure because my NTPv4 convergences very quickly. Because you basically have to be a time expert to understand what this does, I would hope the CCIE lab would be limited to two types of questions on it:
1) Set it to some value they provide
2) Set it to "slowest" convergence (1) or "fastest" convergence (16)

R1(config)#ntp maxdistance ?
  <1-16>  Maximum distance for synchronization

NTP Panic is simple:

R2(config)#ntp panic ?
  update  Reject time updates > panic threshold (default 1000Sec)

It does just what it says - if my peer or configured master's clock is more than 1,000 seconds off of my clock, reject the update and syslog:

.Sep  8 00:51:00.155: NTP Core (ERROR): Time correction of nan seconds exceeds sanity limit of 0. seconds. Set clock manually to the correct UTC time.

NTP Orphan is really cool. It seems like an obvious feature now that I've seen it, but I can imagine this is a huge help for smaller organizations that rely heavily on NTP.

Let's say, from our diagram, R1 is an Internet time server that our fictional organization uses as its sole NTP master. R2 and R3 are edge routers inside the company, and R4 and R5 will represent servers querying R2 and R3.

So to be clear, R2 and R3 get their time from R1, and also peer towards one another (so if R3 can't reach R1 but R2 can, R3 can learn it's time via R2).  R4 and R5 query R2 and R3 for time, respectively.

Relevant config:
R1(config)#ntp master 4

R2(config)#int gig1.123
R2(config-subif)#no  ntp multicast client FF02::123
R2(config-subif)#no ntp authenticate
R2(config)#ntp server 1::1
R2(config)#ntp peer 3::3
R2(config)#ntp source lo0

R3(config)#ntp server 1::1
R3(config)#ntp peer 2::2
R3(config)#ntp source lo0

R4(config)#ntp server 2::2

R5(config)#ntp server 3::3

At this point every device has the up-to-date time.

Now let's say R1 goes offline.
R1(config)#int lo0
R1(config-if)#shut

<<wait a while>>

R2(config)#do show ntp status
Clock is unsynchronized, stratum 16, no reference clock
<output omitted>

R3(config)#do show ntp status
Clock is unsynchronized, stratum 16, no reference clock
<output omitted>

and obviously R4 and R5 share the same fate.

What if we could program R2 and R3 to take their best stab at what the time should still be - mind you we're talking about being only a couple minutes since last sync, so the time is probably still very close to accurate - and then temporarily and seamlessly take over the NTP Master role if they lose valid clock from R1?

This is exactly what NTP Orphan does.

The config is extremely complicated:

R2(config)#ntp orphan 6

R3(config)#ntp orphan 6

(I was joking about the complicated part)

Really, that's it.  Let's understand what's happening here now.  Orphan kicks in when we lose sync with our server. The number 6 here is a stratum number, and must be a number lower than your real upstream NTP server - otherwise the failover/fail-back mechanism won't work right. 

Best practices indicate configuring the same Orphan stratum on all devices you're running Orphan on, then peering all the Orphans to one another so that only one is "elected" to be the temporary Orphan master.

R2(config)#do show ntp status
Clock is synchronized, stratum 6, reference is 127.0.0.1
<output omitted>

We see R2 is now stratum 6, synchronized with it's own virtual Orphan server.

R3(config)#do show ntp status
Clock is synchronized, stratum 7, reference is 26.33.33.239
<output omitted>

R3 is synchronized with R2 as its Master. 

R4#show ntp status
Clock is synchronized, stratum 7, reference is 26.33.33.239

R4 is synchronized with R2 as its master.

R5#show ntp status
Clock is synchronized, stratum 9, reference is 24.235.166.45

R5 is synchronized with R5 as its master.

Now the most important feature of this is fail-back, let's re-activate R1.

R1(config)#int lo0
R1(config-if)#no shut

R3 was first to recover:
R3(config)#do show ntp association detail
1::1 configured, ipv6, our_master, sane, valid, stratum 4

It automatically shut down its Orphan process when it synced to the superior stratum 4.

R5 then received the now-correct time from R3:
R5#show ntp association detail
3::3 configured, ipv6, our_master, sane, valid, stratum 5

Cheers,

Jeff Kronlage


Saturday, September 6, 2014

OSPF LFA & Remote LFA

Continuing on the same track as my recent posts regarding EIGRP FRR and BGP PIC/Add-path, today I'm writing about OSPF LFA. OSPF FRR/LFA accomplishes the same concept as EIGRP FRR, but in a much more elegant and thorough fashion.

As I did in my EIGRP article, I'm going to reference back to the BGP PIC article, as that has a lengthy explanation of why fast re-reroute is important. If you don't understand the use case, please read this first article:

http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html

Again building off former articles, the EIGRP method of LFA is dead simple: take the feasible successor and pre-install it in the FIB for faster convergence.

http://brbccie.blogspot.com/2014/08/eigrp-enhancements.html

I genuinely like this approach, because it's very easy to understand. If you're savvy enough to engineer for feasible successors, you can literally just turn on this feature and it works.

OSPF takes this idea to a whole new level. Obviously, OSPF does not have a concept of feasible successors, but it does have a huge advantage: because, in the same area, the OSPF database is identical among all routers, OSPF can run the SPF algorithm with a neighboring router as root. The advantage of this is being able to find a loop-free alternate path in complex topologies that would have failed the feasible successor check in EIGRP. When we look at Remote LFA, we can even tunnel to distant routers to form loop-free paths, all calculated via the router running FRR.

Note - much like EIGRP, OSPF on IOS does not support per-link LFA, so we will only be examining per-prefix LFA.  IOS-XR supports both per-prefix and per-link.



All links have an IP address of 192.168.YY.X, where YY is the lower router number followed by the higher router number, and X is the router number (i.e. on the link facing R4, R1's IP address is 192.168.14.1) .  Each router has a loopback0 address of X.X.X.X, where X is the router number.

Consider this diagram, with R1 attempting to reach R5 (5.5.5.5).

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix enable area 0 prefix-priority low

The primary path is obvious: R1 -> R2 -> R5
The backup path requires some thought...

If this were EIGRP, neither path would be valid for LFA. They'd both fail the feasibility condition:
R1->R3->R5 has an "advertised distance" of 10, which is greater than the "feasible distance" of 2. Likewise, R1->R4->R5 has an "advertised distance" of 10.

However, OSPF being link state can actually calculate the SPF from R2 and R4's perspective. Cisco calls this process "reverse SPF" -- RSPF. I'm not going to make this a large lesson on link state protocols, but let's quickly look at what R1 would discover about its neighbors:

R2:
  This is already the primary path, so eliminate R2.
R3:
  When attempting to reach R5, R3 will route back through R1. This will loop. Eliminate R3.
R4:
  R4 reaches R5 via the link between R4 and R5.  Valid Backup Route.

I deliberately built the scenario this way to show how a higher-metric route could beat a lower metric for the backup route - of course, in our case, the lower metric would've looped.

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.12.2 on GigabitEthernet1.12, 02:29:11 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.12
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.14.4, via GigabitEthernet1.14
    [RPR]192.168.14.4, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.14
      Route metric is 26, traffic share count is 1

R1#sh ip cef 5.5.5.5
5.5.5.5/32
  nexthop 192.168.12.2 GigabitEthernet1.12
    repair: attached-nexthop 192.168.14.4 GigabitEthernet1.14

As with EIGRP, there are "tie-breakers" if you have multiple options for backup path. With OSPF, you can get a lot more granular than EIGRP. I still hate the term "tie-breakers", as I explained in my EIGRP blog, I think "2nd bestpath decision maker" explains it better.

The tie-breakers are as follows, with their respective default priorities:

- SRLG 10 
- Primary Path 20
- Interface Disjoint 30
- Lowest-Metric 40
- Linecard-disjoint 50
- Node protecting 60
- Broadcast interface disjoint 70 
- Load Sharing 256 

These tie-breakers are off by default:
- Downstream 
- Secondary-Path

The syntax to change the priorities - or turn on downstream or secondary-path - is as follows:

router ospf 1
  fast-reroute per-prefix tie-break interface-disjoint required index 5

If you use the fast-reroute per-prefix tie-break command at all, it disables all the other tie-breakers. So for example, if you wanted SRLG to be the 2nd tie breaker, you would have to turn it back on after the interface-disjoint command:

router ospf 1
  fast-reroute per-prefix tie-break interface-disjoint required index 5
  fast-reroute per-prefix tie-break srlg index 10

You may have also noticed the required keyword. This means that if that tie-breaker doesn't match/pass, then disallow that path completely.

My original plan was to show a scenario for every tie-breaker, but after it taking me two days to build a topology that showed each possible technique, I decided to just go with a written explanation on each tie-breaker and then give one semi-complex tie-breaker topology with a few examples.

- SRLG
SRLG - Shared Risk Link Group - is a manual setting, optionally assigned per-interface, with the intent of identifying "shared risk" elements that the router can't detect on it's own. For example, if two of your Ethernet links shared a downstream switch, you might put those two in the same SRLG.

Usage:
R1(config)#int gig1
R1(config-if)#srlg gid 1
R1(config-if)#int gig2
R1(config-if)#srlg gid 1
R1(config-if)#int gig3
R1(config-if)#srlg gid 2

- Primary Path
Primary Path prefers a backup path that's part of equal-cost multipath (ECMP), This is the antithesis of Secondary Path, which we'll cover below.

- Interface Disjoint
This is fairly obvious, prefer a backup next-hop that exits through a different interface. Note, Ethernet sub-interfaces are considered different interfaces.

- Lowest-Metric
Prefer the path with the lowest metric (note, this command doesn't offer a "required" keyword)

- Linecard-disjoint
Prefer a path that exits through a different linecard than the primary path (I have no way of labbing this as I'm using a CSR1K)

- Node protecting
Prefer a path that doesn't pass through the same next-hop router as the primary path. Note this means any interface on the same next-hop router. So if R2 is the next-hop of your primary path via 192.168.12.2, and your backup path goes through (either directly or indirectly, later in the path) 192.168.25.2 on R2, node protecting will depref that path - or with the required keyword, would prevent it from being used completely.

- Broadcast interface disjoint
Broadcast interface disjoint deprefs backup routes that pass through the same broadcast area as the primary path. The thought here is if the layer 2 device (presumably a switch) connecting the interfaces together fails, we might lose the backup path too.

- Load Sharing 
I haven't labbed this, but my understanding is this is basically a worst-case scenario. If you have two or more paths that can't be differentiated by all of the above tie-breakers, share the backup paths amongst any applicable prefixes.

- Downstream (off by default)
This is very similar to the EIGRP feasability condition - ensure that the metric, from the neighbor's RSPF perspective, is smaller than the total metric of our primary path from the calculating router's perspective. Using the original example above, the backup path we picked would not meet the criteria for this tie-breaker. It's important to reinforce this is not a default option, and OSPF does not require this EIGRP-feasibility-like requirement as OSPF is a link state protocol and can calculate non-looping paths without concerns for metric because it has the entire topology at hand.  

- Secondary-Path (off by default)
This is the antithesis of the Primary-Path tie-breaker above. This instructs the process to prefer a backup path that is not part of multipathing (ECMP). The idea here is if all your multipaths are required for your traffic flows - for example, if you are equal-cost multipathing across two 1-gig links, but consistently have 1.2gb of data crossing them, it would not be desirable to just run over one the opposing link in the ECMP if one failed. Secondary-Path prefers a path not in the ECMP for the backup. 

I'm going to run a couple of examples of tie-breaking, but in order to do that, I needed more paths in the topology. Pay close attention, I have shifted the OSPF costs from the prior topology:



* Please note costs listed below do not include the on-router cost to the loopback for clarity*
If you look at metric alone, the paths from R1->R5 look most desirable in this order:
R1 -> R3 -> R5 (Cost 2)
R1 -> R6 -> R3 -> R5 (Cost 4)
R1 -> R2 -> R5 (Cost 11)
R1 -> R4 -> R5 (Cost 25)

Clearly R3 is the winning primary path.

Let's go down the decision-making process for the backup path:

- SRLG 10 - Not applicable, we're not using SRLG (yet)
- Primary Path 20 - Not applicable, we have no ECMP.
- Interface Disjoint 30 - Applicable, but all are on separate interfaces already.
- Lowest-Metric 40 - Applicable, choose R6 as backup. Do not proceed further, as all paths have different costs.

So without any modification, our primary next-hop router will be R3, and backup next-hop router will be R6:

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:14:19 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.16.6, via GigabitEthernet1.16
    [RPR]192.168.16.6, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.16
      Route metric is 6, traffic share count is 1

There's an obvious flaw in that plan however, they both rely on R3 being online. 

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20

R1(config-router)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:09 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.14.4, via GigabitEthernet1.14
    [RPR]192.168.14.4, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.14
      Route metric is 26, traffic share count is 1

Now the process has chosen the backup through R4, which eliminates R3 as a single point of failure.

Let's pretend that gig1.13, gig 1.14, and gig1.16 all cross the same L2 switch somewhere in their path. We want to protect against that too:

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20
R1(config-router)#fast-reroute per-prefix tie-break srlg required index 30

R1(config-router)#int gig1.13
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.14
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.16
R1(config-subif)#srlg gid 1
R1(config-subif)#int gig1.12
R1(config-subif)#srlg gid 2

R1(config-subif)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:18:34 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:18:34 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1

Uh-oh, no backup route. We were hoping for R1->R2->R5...

R2#sh ip cef 5.5.5.5
5.5.5.5/32
  nexthop 192.168.12.1 GigabitEthernet1.12

That's because R2 routes back through R1 - R1 would've run the RSPF with R2 as the root and disregarded the route.

We have two options at this point:
- Remove the required keyword from the SRLG and fall back to the prior answer
- Tinker with the metrics to make R2 a viable path.

R1(config)#int gig1.12
R1(config-subif)#ip ospf cost 10

R2(config)#int gig1.12
R2(config-subif)#ip ospf cost 10

R1(config-subif)#do sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
  Known via "ospf 1", distance 110, metric 3, type intra area
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:00:52 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.13
      Route metric is 3, traffic share count is 1
      Repair Path: 192.168.12.2, via GigabitEthernet1.12
    [RPR]192.168.12.2, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.12
      Route metric is 21, traffic share count is 1

Now we have a backup via R2.

Before we move on to remote LFA, let's cover some smaller topics.

There were two pieces to the initial command that I did not explain:
fast-reroute per-prefix enable area 0 prefix-priority low

enable area 0 may seem obvious - we want backup paths for area 0. Note, you can only specify areas the router is directly connected to, so if, for example, you wanted backup paths in areas 0, 1, and 2, your router would have to be an ABR for areas 1 and 2. This is true of both direct LFA and remote LFA.

But there's another issue with specifying areas:

R5(config)#int lo1
R5(config-if)#ip address 55.55.55.55 255.255.255.255
R5(config-if)#exit
R5(config)#route-map lo1-extern
R5(config-route-map)#match interface lo1
R5(config-route-map)#exit
R5(config)#router ospf 1
R5(config-router)#redistribute connected route-map lo1-extern

R1(config)#do sh ip route repair 55.55.55.55
Routing entry for 55.55.55.55/32
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:27 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:27 ago, via GigabitEthernet1.13
      Route metric is 20, traffic share count is 1

No repair route for 55.55.55.55 - and we won't, because an external route is in no area. We have to change our initial configuration to fix this:

R1(config-router)#no ip fast-reroute per-prefix enable area 0 prefix-priority low
R1(config-router)#fast-reroute per-prefix enable prefix-priority low

A lack of an area implies all areas this router is connected to - including external routes.

R1(config-router)#do sh ip route repair 55.55.55.55
Routing entry for 55.55.55.55/32
  Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2
  Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:42 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.13
      Route metric is 20, traffic share count is 1
      Repair Path: 192.168.12.2, via GigabitEthernet1.12
    [RPR]192.168.12.2, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.12
      Route metric is 20, traffic share count is 1

What's the story on prefix-priority low?

IOS prioritizes convergence events by default by prefix length. If SPF has to be calculated for thousands of routes, it's assumed by default that /32s (typical for iBGP next-hops) are "high priority". You can define what routes are priority to OSPF with:

R1(config-router)#prefix-priority high route-map <your route map>

There are only two tiers, high and low. High indicates (by default, unless the route map is used) only calculate backup routes for /32s, Low means calculate backup routes for all routes.

So you're debugging and trying to figure out why one path was chosen over another. IOS has a fantastic output system for this:

R1(config-router)#fast-reroute keep-all-paths

This is basically a debugging command, and tells OSPF to keep the output from all the RSPFs it ran to calculate the backup path - including the ones it didnt choose as best.

show ip ospf rib is our 2nd magic command:

R1(config-router)#do sh ip ospf rib 5.5.5.5

            OSPF Router with ID (1.1.1.1) (Process ID 1)


                Base Topology (MTID 0)

OSPF local RIB
Codes: * - Best, > - Installed in global RIB
LSA: type/LSID/originator

*>  5.5.5.5/32, Intra, cost 3, area 0
     SPF Instance 62, age 00:13:50
     Flags: RIB, HiPrio
      via 192.168.13.3, GigabitEthernet1.13
       Flags: RIB
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.12.2, GigabitEthernet1.12, cost 21
       Flags: RIB, Repair, IntfDj, BcastDj, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.16.6, GigabitEthernet1.16, cost 6
       Flags: Ignore, Repair, IntfDj, BcastDj, SRLG
       LSA: 1/5.5.5.5/5.5.5.5
      repair path via 192.168.14.4, GigabitEthernet1.14, cost 26
       Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5

Look at all that fantastic output - it list the parameters per route so you can determine why the repair path was chosen. Let's break one of these down:

      repair path via 192.168.12.2, GigabitEthernet1.12, cost 21
       Flags: RIB, Repair, IntfDj, BcastDj, NodeProt
       LSA: 1/5.5.5.5/5.5.5.5

This is our current best backup path - "RIB" means it's installed, "Repair" means it's a backup path - so "RIB" + "Repair" means it's the installed backup path. IntfDj means it's on a separate interface from the primary path, BcastDj means it's not sharing a broadcast interface with the primary path, and NodeProt means the path does not include shared hops with the primary path.

Microloops can add complexity with fast-reroute. A microloop is what happens when one router converges significantly faster than a neighbor. Let's say two adjacent routers both receive new LSAs simultaneously. One router is high-performance, another is older. The high-performance router calculates the change and updates the FIB several seconds before the older router. Now we could end up with a scenario where the newer router starts forwarding traffic through the older router, but the older router's FIB hasn't updated yet, and it's forwarding through the faster router for that same prefix. For a couple of seconds, the two routers loop.

I'm not going to go into detail on this as it's a fringe topic, but here's the starting point for using this:
R1(config-router)#microloop avoidance ?
  disable           Microloop avoidance auto-enable prohibited
  protected         Microloop avoidance for protected prefixes only
  rib-update-delay  Delay before updating the RIB

In short, it allows you to deliberately slow down updating the FIB on the faster router for prefixes that are high-risk for this type of reconvergence.

If you don't want an interface being considered for fast-reroute:

R1(config-router)#int gig1.12
R1(config-subif)#ip ospf fast-reroute per-prefix candidate disable

And if you need a quick summary of what percentage of routes are and aren't protected:

R1#sh ip ospf fast-reroute prefix-summary

            OSPF Router with ID (1.1.1.1) (Process ID 1)
                    Base Topology (MTID 0)

Area 0:

Interface        Protected    Primary paths    Protected paths Percent protected
                             All  High   Low   All  High   Low    All High  Low
Lo0                    Yes     0     0     0     0     0     0     0%   0%   0%
Gi1.16                 Yes     1     1     0     0     0     0     0%   0%   0%
Gi1.14                 Yes     0     0     0     0     0     0     0%   0%   0%
Gi1.13                 Yes     7     3     4     4     2     2    57%  66%  50%
Gi1.12                 Yes     1     1     0     0     0     0     0%   0%   0%

Area total:                    9     5     4     4     2     2    44%  40%  50%

Process total:                 9     5     4     4     2     2    44%  40%  50%

That's a wrap for direct LFA. Now we'll look at remote LFA.



This is a simplistic topology but it has a huge problem for direct LFA.
Let's protect the path from R1 to R4.

We have two paths:
R1 -> R4 (cost 1)
R1 -> R2 -> R3 -> R4 (cost 12)

Obviously R1 -> R4 is the primary path,
What does R2 see as it's possible paths to R4?
R2 -> R1 -> R4 (Cost 2)
R2 -> R3 -> R4 (Cost 11)

R2 will always send traffic back to R1 when heading towards R4.

What about R3?
R3 -> R4 (Cost 6)
R3 -> R2 -> R1 (Cost 7)

R3 would work for a backup path... if only we could get to R3 without R2 knowing what we're up to.

Enter Remote LFA.

R1(config-router)#int gig1.14
R1(config-subif)#mpls ip
R1(config-subif)#int gig1.12
R1(config-subif)#mpls ip
R1(config-subif)#mpls ldp discovery targeted-hello accept

R2(config-subif)#int gig1.12
R2(config-subif)#mpls ip
R2(config-subif)#int gig1.23
R2(config-subif)#mpls ip
R2(config-subif)#mpls ldp discovery targeted-hello accept

R3(config-subif)#int gig1.23
R3(config-subif)#mpls ip
R3(config-subif)#int gig1.34
R3(config-subif)#mpls ip
R3(config-subif)#mpls ldp discovery targeted-hello accept

R4(config-subif)#int gig1.14
R4(config-subif)#mpls ip
R4(config-subif)#int gig1.34
R4(config-subif)#mpls ip
R4(config-subif)#mpls ldp discovery targeted-hello accept

R1(config-router)#router ospf 1
R1(config-router)#fast-reroute per-prefix remote-lfa tunnel mpls-ldp

There's a complex algorithm that makes this work, but it's somewhat irrelevant from a CCIE v5 perspective. 

Here's what you really need to know:
- Direct LFA had to have failed to turn up a path already (direct is always tried first)
- A tunnel is built over targeted LDP.
- The destination tunnel router is picked on the following criteria:
   -  It must be in the same area as the router running LFA
   - The tunnel endpoint is picked from among the group of routers that can be reached through a next-hop other than the one you're trying to protect.
   - Of that group of routers, it's narrowed down to the subset that can reach your repair prefix without passing through the protecting router.
   - Those that qualify are called the PQ space (refer to the RFC for a lot more detail, but it may be overkill for a CCIE candidate) 

R1#sh ip route repair 4.4.4.4
Routing entry for 4.4.4.4/32
  Known via "ospf 1", distance 110, metric 2, type intra area
  Last update from 192.168.14.4 on GigabitEthernet1.14, 00:29:36 ago
  Routing Descriptor Blocks:
  * 192.168.14.4, from 4.4.4.4, 00:29:36 ago, via GigabitEthernet1.14
      Route metric is 2, traffic share count is 1
      Repair Path: 3.3.3.3, via MPLS-Remote-Lfa1
    [RPR]3.3.3.3, from 4.4.4.4, 00:29:36 ago, via MPLS-Remote-Lfa1
      Route metric is 12, traffic share count is 1

R1#sh ip int br | i MPLS
MPLS-Remote-Lfa1       192.168.12.1    YES unset  up                    up

This whole process is reasonably automatic, just make sure your LDP is in good shape and targeted LDP is enabled and you're good to go.

You can optionally specify areas and maximum costs:

R1(config-router)#fast-reroute per-prefix remote-lfa area 0 maximum-cost 10

The areas work the same way they did with direct LFA - we're just saying we only want to protect area 0, 1, 2, 3, etc. For remote LFA, the router you're running LFA on has to be in the area you're trying to protect - you can't protect area 5 if you're only an ABR for areas 0 and 1.

The maximum cost option restricts which prefixes you should be building tunnels for. In other words, it has nothing to do with the metric to reach the tunnel endpoint - it has to do with the prefix you're trying to protect.

Hope you enjoyed!

Jeff