Jeff Kronlage's CCIE Study Blog

GETVPN

2016-01-02T14:23:00.001-08:00

GETVPN, or Group Encrypted Transport VPN, is Cisco's implementation of the GDOI standard. GDOI, or Group Domain of Interpretation, is defined in RFC 6407, which obsoleted the original RFC, 3547.

GDOI was originally established to allow for a way of encrypting multicast traffic, which was rather cumbersome to do with, say, GRE-over-IPSEC tunnels previously.

https://tools.ietf.org/html/rfc3547
"GDOI Applications. Secure multicast applications include video broadcast and multicast file transfer."

However, GETVPN is now commonly used for encrypting any type of traffic over any private network. Most commonly, it is used for encryption over MPLS VPNs, as MPLS VPNs are not truly secure, and without encryption you're putting a lot of faith that your service provider won't sniff your data. However, GETVPN is L2/L3 agnostic, so arguably it could be used for any application where NAT is not involved. GETVPN does not replace DMVPN for Internet applications. More on that further down the document.

At a high-level, GETVPN establishes a set of rotating encryption keys that a group shares. In this fashion, any group member can encrypt data to any other group member without setting up a tunnel to the other group member. In fact, the entire system is "tunnel-less". Additionally, as GETVPN re-uses the original IP header, the underlying routing is preserved. So if you're using BGP to peer to an MPLS VPN, that same routing just keeps working even with the encrypted packets.

How the encryption process occurs can be most easily shown over a series of slides.

There are two router types involved with GETVPN: Key Servers (KS) and Group Members (GM). GMs, in this usage, are customer CEs that will be encrypting traffic at one another. KSs are control-plane only routers that are not in the forwarding path, nor do they encrypt data.

The first step is for the GMs to register to a KS. In order to do this, ISAKMP is established between the GM and the KS. This is a one-off ISAKMP session for this initial communication only.

During this initial step, a "pull" is initiated from GM to KS. The GM receives the initial Key Encryption Key (KEK) and Traffic Encryption Key (TEK). As I mentioned above, the initial ISAKMP session, while it may be up for a good while longer than the initial session, isn't used after this process - only the KEK is. It's important to note that all KSs, of which there can be up to 8, can encrypt using the KEK, which makes it sort of a distributed/shared phase 1, as opposed to the initial point-to-point ISAKMP session.

The KEK key, which is generated by the primary key server (and distributed to the other key servers), is then used by each GM to reach any other GM. I struggled with how to draw this to avoid the appearance of tunnels, which are inherently point-to-point.

There are some comparisons and contrasts to be drawn with both traditional IPSEC point-to-point tunnels and with DMVPN.

An obvious difference with point-to-point IPSEC is that, with some exceptions we will cover throughout the document, all traffic egressing the CE -> PE interface is encrypted, regardless of where it is destined. This makes for a tunnel-less, or group-"tunnel" style interface. Moreover, unlike point-to-point IPSEC, the original source and destinations in the IP header are retained, whereas with IPSEC, they are rewritten with the tunnel endpoints. As such, traditional routing - and multicast - both work.

While a GETVPN and DMVPN may accomplish similar tasks, there are some significant differences there, as well. Without making a messy static-hack to the configuration, DMVPN only supports multicast from the DMVPN head-end to spokes. As pointed out above, native multicast works fine on GETVPN, without utilizing pseudo-multicast as is common at tunnel head-ends. Additionally, DMVPN builds dynamic tunnels from spoke-to-spoke on an as-needed basis - but that leaves the spoke still building tunnels every time it needed to speak to another spoke. This creates overhead, both in tunnel setup - there's a small, but measurable delay in each tunnel being created - and in scalability; if a spoke needs to speak to hundreds of other spokes, it must build and maintain hundreds of point-to-point IPSEC tunnels.

A disadvantage of GETVPN is that it isn't supported with NAT, or the NAT must be engineered in such a way that it's invisible to the encryption devices. This has to do with the original IP addressing and header being preserved by GETVPN. One could arguably run GET on the Internet if the GMs and KSs used only public IP addressing, and, if needed, hide the NAT behind extra routers behind the GMs. I've seen documents on the Internet claiming even more can be done with GETVPN and NAT, but these are not supported use cases by Cisco, and I didn't try to verify them. Cisco's approach is evident, if an Internet-facing tunnel with NAT is required, it's best to use DMVPN, which works well with NAT.

There's a fair amount going on behind-the-scenes in a GETVPN, and I'm going to pause explaining that at this point to look at some of the config. A key aspect of a CCIE is to know both the configuration steps and the steps happening behind-the-scenes, and I always find it best to introduce both in conjunction.

Here's the topology we will be working from:

I'll review IP address usage on-the-fly, there are too many links here to describe them all initially.

Also, we won't be reviewing any of the P or PE devices, as they're just a basic MPLS VPN configuration.

We'll start by looking at the configuration of Key Server 1 (KS1). For now, we'll pretend KS2 doesn't exist, as I'll cover that as part of the COOP (pronounced "co-op") configuration later in the document.

But first, a quick review of scope. As with all my other previous documentation, my articles are targeted at the CCIE R&S. This means we'll only be inspecting the ISAKMP and IPSEC configuration enough for an R&S understanding, and we'll be skipping any advanced topics that are irrelevant to R&S (i.e. Trustsec integration).

KS1:

crypto isakmp policy 1

encr aes
authentication pre-share
group 2

crypto isakmp key MYGDOIPSK address 0.0.0.0

crypto ipsec transform-set aes128 esp-aes esp-sha-hmac
mode tunnel

crypto ipsec profile profile1
set transform-set aes128

crypto gdoi group GDOI-GROUP1
identity number 1234
server local
rekey algorithm aes 128
rekey authentication mypubkey rsa MYRSAKEY
rekey transport unicast
sa ipsec 1
   profile profile1
   match address ipv4 getvpn-acl
   replay time window-size 5
address ipv4 192.168.111.111

ip access-list extended getvpn-acl

deny   udp any eq 848 any
deny   udp any any eq 848
deny   tcp any eq bgp any
deny   tcp any any eq bgp
permit ip any any

Not shown here is the BGP configuration. I have KS1 peered with PE1, advertising it's loopback, 192.168.111.111. KS1 is setup similarly to how any CE router would be in an MPLS VPN.

With the understanding that I'm going to high-level the crypto explanations, here's what the various relevant pieces of the config do:

crypto isakmp key MYGDOIPSK address 0.0.0.0

As mentioned above, all GMs stand up a temporary ISAKMP session to the KS during registration. In order to do so, they need to share a PSK (or have a PKI, which out-of-scope for this article). You can create a key per-GM, or just one that matches all GMs. Here we've defined the key as MYGDOIPSK for all GMs "0.0.0.0".

You can view the ISAKMP sessions before they die off, if desired:

KS1#show crypto isakmp sa
IPv4 Crypto ISAKMP SA
dst             src             state          conn-id status
192.168.111.111 192.168.11.3    GDOI_IDLE         1067 ACTIVE

If all devices had come up after a fresh reboot, you'd see four connections here, but as I've only recently bounced one, the other three have expired already.

crypto gdoi group GDOI-GROUP1
identity number 1234

The crypto gdoi group command is where all the magic happens on the KSs, and we'll be reviewing the rest of the configuration below. It's important to note it's not assigned to an interface on the KS. The KS doesn't encrypt anything but the control-plane traffic, so this config, when used with 'server' local', simply enables the GDOI KS process and opens UDP port 848 for communications to the other GMs (and eventually other KSs for COOP). The identity number defines which encryption group this config belongs to - a KS can run multiple groups for different GMs, and keep the keying (and consequently the communication) isolated between groups.

server local
rekey algorithm aes 128
rekey authentication mypubkey rsa MYRSAKEY

Here we define the KEK key and rekey process. The initial keys, shown here, are defined as 128-bit AES, authenticated with the RSA key "MYRSAKEY". The RSA keys need to be pre-created, which is accomplished with:

crypto key generate rsa label MYRSAKEY modulus 1024 exportable

With a single KS you don't technically need to make the key exportable, but if you ever want to add a second KS, this is mandatory, so it's a good idea to do it to begin with.

rekey transport unicast

As with any IPSEC tunnel, the keys are rotated periodically so that in case they are compromised, they can't be used to decrypt messages in the future - in other words, the theory is that it takes longer to crack the keys than the actual lifetime of the key, therefore making it impossible for a hacker to decrypt data in real-time. GETVPN has to rekey both the KEK and the TEK periodically, at intervals defined at the IOS CLI. New keys are sent out prior to the expiration of the old key, so that there's a clean roll-over to new key when the appropriate time has been reached.

There are two methods for rekeying with GETVPN. If you look back to why GDOI was originally developed, it was to encrypt multicast traffic. So, logically, rekeying via multicast is an option. I didn't lab this as it would've required me to either move away from using MPLS as my core, or enable service-provider multicast over MPLS, which seemed excessive for the scope I was attempting to cover. Regardless, Cisco recommends using unicast rekey now, namely because there's an acknowledgement system in unicast that's not available in multicast. Multicast rekey does a "fire and forget" mechanism and simply hopes the new keys reach the destination; unicast rekey double-checks to ensure the keys are received by expecting an ACK back from the GM. Eventually, if a key is coming up on expiring and the GM hasn't received a replacement, it will attempt a re-register with the KS in order to resolve the issue.

The actual rekeying/retry logic is incredibly deep, and for more information on it, I recommend reading the Cisco documentation, which is actually quite good (to my surprise, as most of my CCIE-level articles got written in the first place because the Cisco documentation is generally awful):

http://www.cisco.com/c/en/us/td/docs/ios/12_4t/12_4t11/htgetvpn.html

sa ipsec 1
   profile profile1
   match address ipv4 getvpn-acl
   replay time window-size 5

Here the TEK key attributes are defined, inherited from profile1:

crypto ipsec profile profile1
set transform-set aes128

The GETVPN ACL is defined as getvpn-acl.

It's probably not desirable to encrypt all traffic over the MPLS circuit. For example, control-plane protocols (probably BGP) as well as the initial control plane session (UDP port 848) from GM to KS need to be exempt from this process. It might also be desirable for ICMP, SSH, and perhaps SNMP - your management protocols - to be exempt.

At it's most basic, your ACL should look something like this:
ip access-list extended getvpn-acl
deny   udp any eq 848 any
deny   udp any any eq 848
deny   tcp any eq bgp any
deny   tcp any any eq bgp
permit ip any any

deny indicates to not encrypt traffic. permit indicates to encrypt traffic. Normally this ACL will end in "permit ip any any".

The replay-time command has a big topic to discuss behind-the-scenes. The traditional IPSEC method for anti-replay doesn't work with GETVPN. If you're not familiar with replay attacks, "A replay attack is a form of network attack in which a valid data transmission is maliciously or fraudulently repeated or delayed. It is an attempt to subvert security by someone who records legitimate communications and repeats them in order to impersonate a valid user, and to disrupt or cause negative impact for legitimate connections."

http://www.cisco.com/c/en/us/support/docs/ip/internet-key-exchange-ike/116858-problem-replay-00.html

Also from the same document, anti-replay is described: "IPSec provides anti-replay protection against an attacker who duplicates encrypted packets with the assignment of a monotonically increasing sequence number to each encrypted packet".

In a nutshell, traditional anti-replay has a counter embedded in each packet, with the far side of a point-to-point tunnel anticipating the number to continuously count up, one packet at a time. This clearly can't work with GETVPN, as any neighbor can forward traffic, so there's no way to maintain a two-router counting system. Introducing Time-Based Anti-Replay, or TBAR.

TBAR has the KS maintain a pseudo-time clock ('pseudo' as it's not based on NTP) with the GMs. This gives every GM a coordinated reference point for time. Every GM then sends its pseudo-timestamp embedded in every packet, and if the timestamp is more than X seconds on the receiving GM, the packet it considered a replay attack and is dropped. 'X' seconds is defined by the replay time window-size 5, where 5 is the number of seconds a packet is considered valid.

address ipv4 192.168.111.111

This defines the local IP address in which to send and receive GETVPN messages on. It's normally set to a loopback. Our loopback on KS1 is 192.168.111.111.

Now let's move on to our first spoke configuration, on CE1/GM1:

crypto isakmp policy 1
encr aes
authentication pre-share
group 2
crypto isakmp key MYGDOIPSK address 192.168.111.111
crypto gdoi group GDOI-GROUP1
identity number 1234
server address ipv4 192.168.111.111

crypto map gdoimap 1 gdoi
set group GDOI-GROUP1

int e0/0
crypto map gdoimap

This configuration is notably smaller than that of the KS. Moreover, with some rare exception, it can be pasted in identically to each GM, so config deployment is very easy and fast.

crypto isakmp policy 1
encr aes
authentication pre-share
group 2
crypto isakmp key MYGDOIPSK address 192.168.111.111

This is an identical match to the ISAKMP GM -> KS policy shown on the KS. The GM will use this to establish the initial temporary ISAKMP session back to the KS to register and download KEK & TEK. The only real important item here is that this config match with that of the KS.

crypto gdoi group GDOI-GROUP1
identity number 1234
server address ipv4 192.168.111.111

Here we define our group number, which will control which key set we receive, as well as which members we can speak to. Our initial deployment will all be on group 1234 for simplicity. server address determines which KS we register to. There can be more than one KS, and we'll cover the GM config for that when we cover COOP on the KSes.

crypto map gdoimap 1 gdoi
set group GDOI-GROUP1

int e0/0
crypto map gdoimap

On the GMs, we activate both the control plane and forwarding plane of GETVPN on-the-interface, unlike on the KS, which has no interface-level config.

My lab has this all running already, so I'm going to manually bounce CE1 to watch the registration process.

CE1#sh run int e0/0
Building configuration...

Current configuration : 152 bytes
!
interface Ethernet0/0
ip address 192.168.11.3 255.255.255.0
crypto map gdoimap
end

CE1(config)#int e0/0
CE1(config-if)#no crypto map gdoimap
CE1(config-if)#crypto map gdoimap

I'll break the log down:
*Jan 2 18:10:37.203: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.111.111 for group GDOI-GROUP1 using address 192.168.11.3 fvrf default ivrf default

We started attempting registration

*Jan 2 18:10:37.236: %GDOI-5-SA_TEK_UPDATED: SA TEK was updated
*Jan 2 18:10:37.237: %GDOI-5-SA_KEK_UPDATED: SA KEK was updated

We received TEK and KEK

*Jan 2 18:10:37.237: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.111.111 complete for group GDOI-GROUP1 using address 192.168.11.3 fvrf default ivrf default

We successfully registered to KS 192.168.111.111.

*Jan 2 18:10:37.238: %GDOI-5-GM_INSTALL_POLICIES_SUCCESS: SUCCESS: Installation of Reg/Rekey policies from KS 192.168.111.111 for group GDOI-GROUP1 & gm identity 192.168.11.3 fvrf default ivrf default

Policies pushed from KS1 were activated successfully.

Remember that ACL we put on the key-server? It's downloaded to the GM as part of the registration process to the KS:

CE1#show crypto gdoi gm acl
Group Name: GDOI-GROUP1
ACL Downloaded From KS 192.168.111.111:
   access-list   deny udp any port = 848 any
   access-list   deny udp any any port = 848
   access-list   deny tcp any port = 179 any
   access-list   deny tcp any any port = 179
   access-list   permit ip any any
ACL Configured Locally:

Moreover, if you update the ACL on the KS, it will get re-pushed with the next scheduled rekey, or you can force a rekey at any time with:

crypto gdoi ks rekey ! refreshes the ACL and sends out the next set of keys
crypto gdoi ks rekey replace-now ! steps above, plus force swapping to a new key (traffic impacting)

There are some other useful show commands we'll take a moment to look at.
One thing that threw me initially is that the traditional ipsec "show" commands don't work all that well here. KEK and TEK are different enough that the commands developed for point-to-point throw some odd output, for example:

CE1#show crypto ipsec sa

interface: Ethernet0/0
    Crypto map tag: gdoimap, local addr 192.168.11.3

   protected vrf: (none)
   local ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)
   remote ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)
   <output omitted>

The local and remote ident would normally describe the local and remote subnet listed in the ACE of the interesting traffic list described by this IPSEC SA. However, in the case of GDOI a single SA is shown for the whole GDOI group but no ACE information from the GDOI ACL is given.

You can, however, use traditional ISAKMP commands to see the temporary tunnel to the KS:

CE1#show crypto isakmp sa
IPv4 Crypto ISAKMP SA
dst             src             state          conn-id status
192.168.111.111 192.168.11.3    GDOI_IDLE         1067 ACTIVE

That said, let's look at the GDOI-specific commands.

On the GM, to see if you're registered:

CE1#show crypto gdoi gm
Group Member Information For Group GDOI-GROUP1:
    IPSec SA Direction       : Both
    ACL Received From KS     : gdoi_group_GDOI-GROUP1_temp_acl

    Group member             : 192.168.11.3    vrf: None
       Local addr/port       : 192.168.11.3/848
       Remote addr/port      : 192.168.111.111/848
       fvrf/ivrf             : None/None
       Version               : 1.0.8
       Registration status   : Registered
       Registered with       : 192.168.111.111
       Re-registers in       : 897 sec
       Succeeded registration: 1
       Attempted registration: 1
      <output omitted for brevity>

On the KS, to see who's registered to it:

KS1#show crypto gdoi ks members summary | s 11.3
Group Member ID    : 192.168.11.3        GM Version: 1.0.8
Group ID          : 1234
Group Name        : GDOI-GROUP1
GM State          : Registered
Key Server ID     : 192.168.111.111

This command produces a lot of output, even when using "summary", when you have many GMs registered. As all of my lab ones are up at this moment, note I filtered the output to just the one GM (CE1) that we've been working with.

To view the details on KEK and TEK on the GM (you may want to check the remaining lifetimes):

CE1#show crypto gdoi | s KEK
KEK POLICY:
    Rekey Transport Type     : Unicast
    Lifetime (secs)          : 84190
    Encrypt Algorithm        : AES
    Key Size                 : 128
    Sig Hash Algorithm       : HMAC_AUTH_SHA
    Sig Key Length (bits)    : 1296

CE1#show crypto gdoi | s TEK
TEK POLICY for the current KS-Policy ACEs Downloaded:
Ethernet0/0:
    IPsec SA:
        spi: 0x6BAFD3AB(1806685099)
        transform: esp-aes esp-sha-hmac
        sa timing:remaining key lifetime (sec): (1386)
        Anti-Replay(Time Based) : 5 sec interval
        tag method : disabled
        alg key size: 16 (bytes)
        sig key size: 20 (bytes)
        encaps: ENCAPS_TUNNEL

Now, our GM is set to encrypt any traffic (minus UDP 848 and and BGP) that leaves it's e0/0 interface.

If you've ever watched an American cooking show, there's always a moment when the celebrity chef shows the basics of how to put a yet-to-be-cooked dish together, then instantly pops out the final product that's been in the oven for two hours prior, compliments of the magic of television. This is my moment! Not shown here, I've applied the GM config to the other three CE devices, and we have encrypted communication across all CE and Host devices on the MPLS VPN.

I'm going to send pings from Host1 (10.0.111.2, with a default route to CE1) to Host3 (10.0.33.2, with a default route to CE3).

HOST1#ping 10.0.33.2 repeat 10
Type escape sequence to abort.
Sending 10, 100-byte ICMP Echos to 10.0.33.2, timeout is 2 seconds:
!!!!!!!!!!
Success rate is 100 percent (10/10), round-trip min/avg/max = 5/6/7 ms

Great, how do we verify that the packets were encrypted? We go ask CE1:

CE1#show crypto gdoi gm dataplane counters

Data-plane statistics for group GDOI-GROUP1:
    #pkts encrypt            : 10       #pkts decrypt            : 10    #pkts tagged (send)      : 0        #pkts untagged (rcv)     : 0
    #pkts no sa (send)       : 0        #pkts invalid sa (rcv)   : 0
    #pkts encaps fail (send) : 0        #pkts decap fail (rcv)   : 0
    #pkts invalid prot (rcv) : 0        #pkts verify fail (rcv) : 0
    #pkts not tagged (send) : 0        #pkts not untagged (rcv) : 0
    #pkts internal err (send): 0        #pkts internal err (rcv) : 0

As we can see we sent 10 encrypted packets and received 10 encrypted packets, it's a fair bet that encryption happened, using our current TEK key. We could go check CE3 as well, but we already know we'd get the same results, because of the number of decrypted packets on CE1.

Let's test this with an ACL change on the KS:

KS1(config)#ip access-list extended getvpn-acl
KS1(config-ext-nacl)#1 deny tcp any eq telnet any
KS1(config-ext-nacl)#2 deny tcp any any eq telnet
KS1(config-ext-nacl)#end
KS1#
*Jan 2 19:16:51.992: %SYS-5-CONFIG_I: Configured from console by console
*Jan 2 19:16:51.992: %GDOI-5-POLICY_CHANGE: GDOI group GDOI-GROUP1 policy has changed. Use 'crypto gdoi ks rekey' to send a rekey, or the changes will be send in the next scheduled rekey

The KS is smart enough to know we just changed global policy, and throws a reminder that the GMs won't be aware of this until the next scheduled rekey unless we force it:

KS1#crypto gdoi ks rekey
KS1#
*Jan 2 19:17:36.784: %GDOI-5-KS_SEND_UNICAST_REKEY: Sending Unicast Rekey with policy-replace for group GDOI-GROUP1 from address 192.168.111.111 with seq # 23

Meanwhile, back on CE1:

CE1#show crypto gdoi gm acl
Group Name: GDOI-GROUP1
ACL Downloaded From KS 192.168.111.111:
   access-list   deny tcp any port = 23 any
   access-list   deny tcp any any port = 23
   access-list   deny udp any port = 848 any
   access-list   deny udp any any port = 848
   access-list   deny tcp any port = 179 any
   access-list   deny tcp any any port = 179
   access-list   permit ip any any
ACL Configured Locally:

And then test on CE1:

HOST1#telnet 10.0.33.2
Trying 10.0.33.2 ... Open
Password required, but none set
[Connection to 10.0.33.2 closed by foreign host]

I don't actually have Host3 setup to accept telnet logins, but it's irrelevant - we just generated bidirectional traffic.

And for verification on CE1:

CE1#show crypto gdoi gm dataplane counters

Data-plane statistics for group GDOI-GROUP1:
    #pkts encrypt            : 10       #pkts decrypt            : 10    <output omitted for brevity>

Note, the counters didn't go up this time - because this was sent in plain text. Let's pull those new telnet exemptions back off KS1 and try again:

KS1(config)#ip access-list ext getvpn-acl
KS1(config-ext-nacl)#no 1
KS1(config-ext-nacl)#no 2

KS1#crypto gdoi ks rekey

Try again on Host1:

HOST1#telnet 10.0.33.2
Trying 10.0.33.2 ... Open
Password required, but none set
[Connection to 10.0.33.2 closed by foreign host]

And back to CE1 for verification:

CE1#show crypto gdoi gm dataplane counters

Data-plane statistics for group GDOI-GROUP1:
    #pkts encrypt            : 31       #pkts decrypt            : 29
    <output omitted for brevity>

Perfect!

Another benefit of GETVPN is seamless QoS support. All Cisco tunneling solutions copy TOS markings from the original packet to the encrypted packet when creating the encrypted packet. As such, unless you're trying to perform egress marking (which would require a qos pre-classify configuration), no change is required in QoS to migrate to GETVPN.

We'll test from Host1 to Host2.

First, we need to setup QoS on CE1:

CE1(config)#class-map match-all EF
CE1(config-cmap)# match dscp ef
CE1(config-cmap)#policy-map QOS
CE1(config-pmap)# class EF
CE1(config-pmap-c)# priority 50000
CE1(config)#int e0/0
CE1(config-if)#service-policy output QOS

HOST1#ping   ! Extended Ping is required to set ToS to 184 (DSCP EF)
Protocol [ip]:
Target IP address: 10.0.222.2   ! Host2
Repeat count [5]: 100
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]: 184
<output omitted for brevity>
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 10.0.222.2, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (100/100), round-trip min/avg/max = 1/8/46 ms

Back to CE1 for verification:

CE1#show policy-map int | s EF
    Class-map: EF (match-all)
      100 packets, 19800 bytes
      30 second offered rate 0000 bps, drop rate 0000 bps
      Match: dscp ef (46)
      Priority: 50000 kbps, burst bytes 1250000, b/w exceed drops: 0

Now let's take a look at COOP, the key server redundancy protocol for GETVPN.

COOP works by establishing permanent ISAKMP sessions between redundant key servers. It uses these tunnels to maintain GM registration status as well as uses dead peer detection (DPD) to ensure other key servers are up.

As I mentioned previously, all KSs must have the same RSA public & private keys installed. This is so if the primary KS fails, the KEK session between the secondary KS(s) and the GMs can be maintained. In this fashion, re-registration from GM to the backup KS is not necessary if a KS fails. Also, traffic isn't impacted - a KS failure is a 'hitless' outage for the GMs.

An additional benefit of having the KSs in sync with one another as well as having the same key on all servers is that a GM can register to any KS - even if it's not the primary. This can become important if a network gets segmented, where some GMs can reach, say, KS1, and others can only reach KS2 (again, note there can be up to eight KSes). When the KSs can reach one another again, they sync their registration database back together!

One more important item of note, if you were paying attention to the diagram: I have KS2 behind a GM. Key servers can be directly connected as CEs, they can be behind a GM, they can basically be anywhere that's globally routable from the rest of the network - it doesn't matter. To show this, I put KS1 in a CE-style configuration, directly attached to PE1, and KS2 behind CE2, in more of a "host-like" setup.

A quick reminder of our config above used to generate the RSA keys:
crypto key generate rsa label MYRSAKEY modulus 1024 exportable

Now we need to go back to KS1 and retrieve that key for KS2:
KS1(config)#crypto key export rsa MYRSAKEY pem terminal 3des MYSECRETPASS

% Key name: MYRSAKEY
   Usage: General Purpose Key
   Key data:
-----BEGIN PUBLIC KEY-----
<Public Key Omitted for Brevity>
-----END PUBLIC KEY-----
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: DES-EDE3-CBC,0B2283C620CB3CCA

<Private Key Omitted for Brevity>
-----END RSA PRIVATE KEY-----

Now that we have the keys, we can import them into KS2:

KS2(config)#crypto key import rsa MYRSAKEY terminal MYSECRETPASS
% Enter PEM-formatted public General Purpose key or certificate.
% End with a blank line or "quit" on a line by itself.
-----BEGIN PUBLIC KEY-----
<Public Key Omitted for Brevity>-----END PUBLIC KEY-----
quit
% Enter PEM-formatted encrypted private General Purpose key.
% End with "quit" on a line by itself.
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: DES-EDE3-CBC,0B2283C620CB3CCA

<Private Key Omitted for Brevity>
-----END RSA PRIVATE KEY-----
quit
% Key pair import succeeded.

Now back to configure COOP on KS1:

KS1(config)#crypto isakmp keepalive 10 periodic

KS1(config)#crypto isakmp key COOPKEY address 192.168.222.222
KS1(config)#crypto gdoi group GDOI-GROUP1
KS1(config-gdoi-group)#server local
KS1(gdoi-local-server)#redundancy
KS1(gdoi-coop-ks-config)#local priority 100
KS1(gdoi-coop-ks-config)#peer address ipv4 192.168.222.222

There's not really much new config here, but I'll run over the key elements:

crypto isakmp keepalive 10 periodic
COOP uses Dead Peer Detection (DPD) to keep track of it's neighbors up/down status, and needs to be enabled with this command.

crypto isakmp key COOPKEY address 192.168.222.222
As the KS's maintain a ISAKMP session between them, we need a key to set the session up. I believe hypothetically one could re-use the same key from the GMs, but that seems like a bad idea, so I've been in the practice of using a different key. Note, if you had more than two KSs, this config would need to be replicated for each KS.

redundancy
local priority 100
peer address ipv4 192.168.222.222

Enter the redundancy config, set the local priority - higher is better and more likely to become primary - and enter the address of the other KS COOP servers. Note, each key server needs to be configured with the IPs of all the other key servers, so if you were running three COOP key servers, each KS would have two entries (the other two redundant servers) for the other two servers.

With mild adaption of KS1's config to KS2, KS2's config appears like this:

ip access-list extended getvpn-acl
deny   udp any eq 848 any
deny   udp any any eq 848
deny   tcp any eq bgp any
deny   tcp any any eq bgp
permit ip any any

crypto isakmp policy 1
encr aes
authentication pre-share
group 2
crypto isakmp key COOPKEY address 192.168.111.111
crypto isakmp key MYGDOIPSK address 0.0.0.0
crypto isakmp keepalive 10 periodic
crypto ipsec transform-set aes128 esp-aes esp-sha-hmac
mode tunnel
crypto ipsec profile profile1
set transform-set aes128
crypto gdoi group GDOI-GROUP1
identity number 1234
server local
rekey algorithm aes 128
rekey authentication mypubkey rsa MYRSAKEY
rekey transport unicast
sa ipsec 1
   profile profile1
   match address ipv4 getvpn-acl
   replay time window-size 5
address ipv4 192.168.222.222
redundancy
   local priority 40
   peer address ipv4 192.168.111.111

And with that, COOP is up!

KS1#show crypto gdoi ks coop
Crypto Gdoi Group Name :GDOI-GROUP1
        Group handle: 2147483650, Local Key Server handle: 2147483650

        Local Address: 192.168.111.111
        Local Priority: 100
        Local KS Role: Primary   , Local KS Status: Alive
        Local KS version: 1.0.8
        Primary Timers:
                Primary Refresh Policy Time: 20
                Remaining Time: 10
                Antireplay Sequence Number: 247

        Peer Sessions:
        Session 1:
                Server handle: 2147483651
                Peer Address: 192.168.222.222
                Peer Version: 1.0.8
                Peer Priority: 40
                Peer KS Role: Secondary , Peer KS Status: Alive
                Antireplay Sequence Number: 31

                IKE status: Established
                Counters:
                    Ann msgs sent: 220
                    Ann msgs sent with reply request: 0
                    Ann msgs recv: 2
                    Ann msgs recv with reply request: 1
                    Packet sent drops: 27
                    Packet Recv drops: 0
                    Total bytes sent: 156002
                    Total bytes recv: 2247

As I mentioned, there's a permanent ISAKMP session established between COOP KSes, and you can see that with standard ISAKMP show commands:

KS1#show crypto isakmp sa
IPv4 Crypto ISAKMP SA
dst             src             state          conn-id status
192.168.222.222 192.168.111.111 GDOI_IDLE         1001 ACTIVE

We then need to tell our GMs about the additional server(s):

CE1, CE2, CE3, & CE4:
crypto gdoi group GDOI-GROUP1
server address ipv4 192.168.222.222

So - let's take KS1 out of commission and try a few things.

KS1(config)#int e0/0
KS1(config-if)#shut

Eventually KS2 will realize that KS1 is out-of-comission. It's important to note that unlike a redundancy protocol that's directly in the dataplane (like HSRP), a brief key server outage, in an appropriately-built GETVPN, shouldn't be a big deal. The key servers are there to distribute policy and keys, and the keys are sent out well in advance, so it's unlikely the GMs would even notice the KS outage until a reregistration sometime in the distant future happened.

KS2 (a while later):
*Jan 2 20:26:55.540: %GDOI-5-COOP_KS_TRANS_TO_PRI: KS 192.168.222.222 in group GDOI-GROUP1 transitioned to Primary (Previous Primary = 192.168.111.111)

*Jan 2 20:27:15.543: %GDOI-3-COOP_KS_UNREACH: Cooperative KS 192.168.111.111 Unreachable in group GDOI-GROUP1. IKE SA Status = Failed to establish.

KS2 realizes the primary is down and assumes primary itself.

Let's check in on CE1, and when it expects a rekey:

CE1#show crypto gdoi | i life
        sa timing:remaining key lifetime (sec): (3063)

It's got a bit. Let's see if it will accept new keys from KS2:

KS2#crypto gdoi ks rekey replace-now

CE1#show crypto gdoi | i life
        sa timing:remaining key lifetime (sec): (3598)

CE1#show crypto gdoi gm
Group Member Information For Group GDOI-GROUP1:
    IPSec SA Direction       : Both
    ACL Received From KS     : gdoi_group_GDOI-GROUP1_temp_acl

    Group member             : 192.168.11.3    vrf: None
       Local addr/port       : 192.168.11.3/848
       Remote addr/port      : 192.168.111.111/848
       fvrf/ivrf             : None/None
       Version               : 1.0.8
       Registration status   : Registered
       Registered with       : 192.168.111.111
       Re-registers in       : 3326 sec
       Succeeded registration: 1
       Attempted registration: 1
       Last rekey from       : 192.168.222.222
       Last rekey seq num    : 0
       Unicast rekey received: 7
       Rekey ACKs sent       : 7
       Rekey Rcvd(hh:mm:ss) : 00:01:31
       DP Error Monitoring   : OFF

As KS2 has KS1's RSA key, CE1 (GM) accepts KS2's authentication. Note CE1 still thinks it's registered with KS1, which is basically irrelevant, as KS2 has taken over all ongoing tasks of KS1.

Now let's force CE2 to re-register to KS2.

CE2(config)#int e0/0
CE2(config-if)#no crypto map gdoimap
*Jan 2 20:35:32.253: %CRYPTO-6-GDOI_ON_OFF: GDOI is OFF

GDOI disabled...

CE2(config-if)#crypto map gdoimap
*Jan 2 20:35:34.330: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.111.111 for group GDOI-GROUP1 using address 192.168.12.2
*Jan 2 20:35:34.331: %CRYPTO-6-GDOI_ON_OFF: GDOI is ON

GDOI re-enabled, and now attempting registration to KS1.

*Jan 2 20:36:14.344: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.222.222 for group GDOI-GROUP1 using address 192.168.12.2

CE2 gives up on KS1 and moves down its list to KS2.

<output omitted for brevity>
*Jan 2 20:36:14.366: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.222.222 complete for group GDOI-GROUP1 using address 192.168.12.2
<output omitted for brevity>

And successful registration!

Looking at KS2's registrations:
KS2#show crypto gdoi ks mem summary | i Member ID
Group Member ID    : 192.168.12.2        GM Version: 1.0.6
Group Member ID    : 192.168.13.2        GM Version: 1.0.6
Group Member ID    : 192.168.14.2        GM Version: 1.0.8
Group Member ID    : 192.168.11.3        GM Version: 1.0.8

The other GM's didn't reregister to KS2 - KS2 learned about them from KS1 before KS1 went offline.

Let's turn KS1 back online:
KS1(config-if)#no shut

It's all over the Cisco documentation that COOP doesn't support preemption:
"The recovering KS receives an announcement message reply from an existing primary, which has lower priority. In this case, there is no preemption, and the recovering KS remains a secondary KS. This eliminates unnecessary changes in the system."
http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf

Well, you could have fooled me!:

KS1(config-if)#
*Jan 2 20:39:32.559: %BGP-5-ADJCHANGE: neighbor 192.168.11.1 Up
*Jan 2 20:39:45.739: %GDOI-5-COOP_KS_REACH: Reachability restored with Cooperative KS 192.168.222.222 in group GDOI-GROUP1.

KS1#show crypto gdoi ks coop | i Role
        Local KS Role: Primary   , Local KS Status: Alive
                Peer KS Role: Secondary , Peer KS Status: Alive

KS2#show crypto gdoi ks coop | i Role
        Local KS Role: Secondary , Local KS Status: Alive
                Peer KS Role: Primary   , Peer KS Status: Alive

I'm running IOS 15.4, and it occurs to me that this could've changed since the documentation was written, but it seems somewhat unlikely seeing as adamant Cisco was about this in all previous documentation. Does anyone have any idea here? I literally cannot get it to not preempt, so I find that confusing.

For a final topic related to COOP, a pure failure of a KS is one thing, but what happens if you have a network segmentation that has some GMs speaking to one KS and some GMs speaking to another KS, in a 'split-brain' scenario?

In that case, the KSs perform what's called a Key Server Merge, which doesn't have much relevance as a study topic (other than knowing it exists), but it does have some design implications. If you're reading this to build a large production GETVPN as opposed to study purposes, I recommend reading

http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf
and check out section 3.7.4.2, "Network Split and Merge".

Now, on to some final topics:

Fail Open vs Fail Closed

If a crypto policy isn't in place or isn't matched, the default reaction of the router is to simply send the traffic unencrypted. This is normal, default behavior and isn't a feature. However, if security demands that traffic be stopped rather than being sent in the clear, the Fail Closed feature may be enabled on a per-GM basis:

First, create an ACL of what to still transmit even during fail-closed. For example, your routing and management traffic should probably still be permitted:

ip access-list extended fail-close
deny   tcp any eq bgp any
deny   tcp any any eq bgp

Much like the standard GETVPN ACL, "deny" means "send unencrypted".

Then you basically enable an extension to the crypto map. For example, my GDOI map is called "gdoimap", and looks like this:

crypto map gdoimap 1 gdoi
set group GDOI-GROUP1

In that case, you create this addtional config:
crypto map gdoimap gdoi fail-close
match address fail-close
activate
Then, if a valid KEK key isn't present, the only traffic allowed to transmit is BGP, given the ACL above.

Local Exception ACL

It's possible that one global ACL doesn't meet the needs of every GM. If you have a GM that needs to transmit some data in clear text even though it's indicated by the KS ACL that it should be encrypted, you can create a per-GM one-off ACL for this scenario.

From the Cisco documentation:
"The crypto ACL applied at the GM represents a concatenation of the downloaded ACL and local ACL. The order of operations is such that the locally defined ACL is checked first, followed by the one downloaded from the KS."
"Note:    Only deny statements can be added locally at the GM. Permit statements are not supported in the locally configured policies. In case of a conflict, local policy overrides the policy downloaded from the KS."
http://www.cisco.com/c/en/us/products/collateral/security/group-encrypted-transport-vpn/deployment_guide_c07_554713.html

An ACL is created:
ip access-list extended exception-acl
deny   icmp any any

And then applied to the crypto map on the GM:
crypto map gdoimap 1 gdoi
match address exception-acl

I've gone ahead and applied this on CE1:

CE1#sh crypto gdoi gm acl
Group Name: GDOI-GROUP1
ACL Downloaded From KS 192.168.111.111:
   access-list   deny udp any port = 848 any
   access-list   deny udp any any port = 848
   access-list   deny tcp any port = 179 any
   access-list   deny tcp any any port = 179
   access-list   permit ip any any
ACL Configured Locally:
Map Name: gdoimap
   access-list exception-acl deny icmp any any
Let's test it...

CE1#ping 192.168.33.33   ! CE3's loopback
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.33.33, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

I scratched my head for a split second until I remembered that CE3 doesn't agree with the exception policy and therefore won't take unecrypted ICMP traffic:

CE3#
*Jan 2 21:08:30.239: %CRYPTO-4-RECVD_PKT_NOT_IPSEC: Rec'd packet not an IPSEC packet. (ip) vrf/dest_addr= /192.168.33.33, src_addr= 192.168.11.3, prot= 1

However, the Key Servers can only reply to unecrypted ICMP traffic:

CE1#ping 192.168.111.111
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.111.111, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/6 ms

Receive-only SA

When deploying GETVPN on an existing network, it is almost certainly desirable to ensure all GMs can decrypt traffic before beginning to encrypt traffic - otherwise, some GMs would be sending encrypted traffic to other GMs that hadn't had the config pasted in yet.

Receive-only SA is a policy pushed from the KS that tells all GMs to decrypt traffic but not encrypt it. Implementation is very simple:

crypto gdoi group GDOI-GROUP1
server local
sa receive-only

Passive SA

While deploying Receive-only SA, it may also be a good idea to do small-scale encryption testing without globally rolling encryption and hoping for the best. Passive SA is a per-GM setting that basically overrides the sa receive-only command pushed from the KS. It indicates that the GM should encrypt and decrypt traffic, rather than just decrypting it. This allows for a single-GM (or however many you'd like to apply the config to) rollout of encryption, without applying it globally with the KS.

crypto gdoi group GDOI-GROUP1
passive

Interestingly, there's also a privilege exec command for the GMs that can control whether to encrypt, decrypt, or both - basically a macro for the functions described above:

CE3#crypto gdoi gm group 1234 ipsec direction ?
both     IPsec SA will only accept cipher text and will encrypt the packet
              before forwarding it out
inbound Specify IPsec SA inbound options

CE3#crypto gdoi gm group 1234 ipsec direction inbound ?
only      IPsec SA will accept both cipher/plain text and will forward the
           packet in clear.
optional IPsec SA will accept both cipher/plain text and will encrypt the
           packet before forwarding it out

Using this command, you can indicate mandatory encryption in both directions ('both'), mandatory inbound encryption ('inbound only') or to receive encrypted and unencrypted traffic inbound ('inbound optional').

Multiple Group Support and Authorization Lists

We've only been using one group up until now. Imagine if you had separate divisions inside a company that shared a single VRF in a L3 MPLS VPN but have no business speaking to one another, at least not without going through a firewall at HQ first (I've also heard there are service provider applications for this deployment).

Let's say Division 1 is CE1 and CE2, and Division 2 is CE3 and CE4.

Let's configure the new group on KS1. I am deliberately not configuring KS2 for brevity, the second group will not participate in the COOP config from above.

KS1(config)#crypto gdoi group GDOI-GROUP2
KS1(config-gdoi-group)#identity number 6789
KS1(config-gdoi-group)#server local
KS1(gdoi-local-server)#rekey algorithm aes 128
KS1(gdoi-local-server)#rekey authentication mypubkey rsa MYRSAKEY
KS1(gdoi-local-server)#rekey transport unicast
KS1(gdoi-local-server)#sa ipsec 1
KS1(gdoi-sa-ipsec)#profile profile1
KS1(gdoi-sa-ipsec)#match address ipv4 getvpn-acl
KS1(gdoi-sa-ipsec)#replay time window-size 5
KS1(gdoi-sa-ipsec)#address ipv4 192.168.111.111

And on CE3 and CE4:
CE3(config)#crypto gdoi group GDOI-GROUP1 ! name is only locally significant
CE3(config-gdoi-group)#no server address ipv4 192.168.222.222 ! remove KS2
CE3(config-gdoi-group)#identity number 6789 ! change the group number

CE4(config)#crypto gdoi group GDOI-GROUP1 ! name is only locally significant
CE4(config-gdoi-group)#no server address ipv4 192.168.222.222 ! remove KS2
CE4(config-gdoi-group)#identity number 6789 ! change the group number

CE4#ping 192.168.33.33 so lo0    ! 192.168.33.33 is CE3's loopback
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.33.33, timeout is 2 seconds:
Packet sent with a source address of 192.168.44.44
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 5/5/6 ms

CE4#sh crypto gdoi gm dataplane counters

Data-plane statistics for group GDOI-GROUP1:
    #pkts encrypt            : 5       #pkts decrypt            : 5
    #pkts tagged (send)      : 0        #pkts untagged (rcv)     : 0
    #pkts no sa (send)       : 0        #pkts invalid sa (rcv)   : 0
    #pkts encaps fail (send) : 0        #pkts decap fail (rcv)   : 0
    #pkts invalid prot (rcv) : 0        #pkts verify fail (rcv) : 0
    #pkts not tagged (send) : 0        #pkts not untagged (rcv) : 0
    #pkts internal err (send): 0        #pkts internal err (rcv) : 0

OK, CE3 and CE4 can still talk to one another.

CE4#ping 10.0.111.1 ! An interface on CE1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.111.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)

CE1#
*Jan 2 21:38:41.803: %CRYPTO-4-RECVD_PKT_INV_SPI: decaps: rec'd IPSEC packet has invalid spi for destaddr=10.0.111.11, prot=50, spi=0xA0B897B5(2696452021), srcaddr=192.168.14.2, input interface=Ethernet0/0

No talking from CE4 to CE1.

You'll note I used the same RSA keys on both groups:
KS1(gdoi-local-server)#rekey authentication mypubkey rsa MYRSAKEY

It doesn't matter - the TEK keys aren't built off the RSA key. The RSA key is just for the KS to authenticate to the GM, to prove it's still the original group of KS's the GM registered to.

But there's still an easy, easy way to work around this on the GM:

CE4(config)#crypto gdoi group GDOI-GROUP1
CE4(config-gdoi-group)#identity number 1234

CE4#ping 10.0.111.1 so lo0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.0.111.1, timeout is 2 seconds:
Packet sent with a source address of 192.168.44.44
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 6/6/7 ms

Authorization lists to the rescue!

KS1(config)#access-list 10 permit 192.168.11.3 ! CE1's registration address
KS1(config)#access-list 10 permit 192.168.12.3 ! CE2's registration address

KS1(config)#access-list 20 permit 192.168.13.2 ! CE3's registration address
KS1(config)#access-list 20 permit 192.168.14.2 ! CE4's registration address

KS1(config)#crypto gdoi group GDOI-GROUP1
KS1(config-gdoi-group)# server local
KS1(gdoi-local-server)# authorization address ipv4 10
KS1(gdoi-local-server)#crypto gdoi group GDOI-GROUP2
KS1(config-gdoi-group)# server local
KS1(gdoi-local-server)# authorization address ipv4 20

Forcing re-registration on CE4:
CE4(config-if)#no crypto map gdoimap
CE4(config-if)#crypto map gdoimap

KS1(config)#
*Jan 2 21:48:15.202: %GDOI-1-UNAUTHORIZED_IPADDR: Group GDOI-GROUP1 received registration from unauthorized ip address: 192.168.14.2

And CE4 begrudingly goes back to his own group:
CE4(config-if)#crypto gdoi group GDOI-GROUP1
CE4(config-gdoi-group)#identity number 6789

*Jan 2 21:49:25.208: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.111.111 complete for group GDOI-GROUP1 using address 192.168.14.2 fvrf default ivrf default

...and succeeds.

VRF Lite Support

One CE can join multiple, different GDOI groups on the KS by using VRF-lite on the CEs. This isn't that complex conceptually, however, I saw little reason to lab it. If you want to know more, here's the link to the Cisco documentation: http://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/enterprise-class-teleworker-ect-solution/prod_white_paper0900aecd80617171.html

Note, the KS does not support VRFs.

Cheers,

Jeff

New Material Coming Soon... honest!

2015-12-27T07:24:00.002-08:00

Just shy of a year ago, I posted:

"...the blog will continue!
My next step is CCNP Voice, and I plan on writing up my findings here, as well as any interesting R&S topics I come across."

Well, that didn't end up panning out as anticipated. The major problem was that I spent three full years of my life working on getting my CCIE number, and after that I had a long list of things - not related to IT - that needed completed, such as:

- Lots of home repair projects
- Cleaning my garage (Took nearly two months; it was pretty bad)
- Replacing a Jeep engine
- etc etc

Those tasks took a bit longer than I had anticipated. I've been non-stop busy since last January, and a year later I still have about 5% of the list left, but it's fairly manageable and something I can do in minor spare time now.

I did in fact pick up the CCNP Voice material, only to have it immediately replaced with CCNP Collab, and by that point I was so embroiled in 'the list' mentioned above that I didn't bother buying it.

However, with my 1-year anniversary of my lab pass looming, I figure I better start studying for the written again.

I nearly decided to ditch R&S and go re-up on the Collab written, as I spend more of my time in that space now and I do R&S. However, to be fair to my family, I decided that (hopefully) the R&S written would take me less time to study for than the Collab, and I can just plan on re-upping on Collab next time when I've got more "spare time".

There are a good number of topics on the v5 written that weren't on the v5 lab, but the big ones that I'm not already an expert at are GETVPN and IS-IS.

I'm working on the GETVPN blog right now, which will be posted when it's done/when I have time, but I expect in the coming weeks.

Cheers,

Jeff

46110

2015-01-05T21:38:00.001-08:00

Well folks, I am finally done. Two years, 11 months. Today, January 5th, 2015, I passed, on my 4th attempt - #46110.

However, the blog will continue!

My next step is CCNP Voice, and I plan on writing up my findings here, as well as any interesting R&S topics I come across.

Best of luck to everyone else on this track, the hard work does eventually pay off.

Cheers,

Jeff

[mini] Fail-Over Policy Based Routing

2014-10-04T11:21:00.004-07:00

Playing with PBR recently I came across what I thought was an odd usage - two set commands in the same statement.

i.e.

route-map PBR permit 10
match ip address to-be-matched
set ip next-hop 192.168.0.1
set ip default next-hop 192.168.1.1

This is a bit odd to look at until you break it down.

Turns out there's an order of operations to PBR set statements.

From the Cisco documentation:

1. set ip next-hop
2. set interface
3. set ip default next-hop
4. set default interface

This means set ip next-hop will be attempted prior to, say, set interface. If it fails, then the next statement will be evaluated.

When I saw that, the first place my brain went to was, why not create two route-map elements to fix this?

(please note it's hard to air your dirty laundry on the Internet. Yes, this seemed dumb after I tested it)

route-map PBR permit 10
match ip address to-be-matched
set ip next-hop 192.168.0.1
route-map PBR permit 20
match ip address to-be-matched
set ip default next-hop 192.168.1.1

My thought process here was that if statement 10 failed to apply the set statement, then it would move on to statement 20. This is, of course, not true. Just like an ACL, a route-map stops evaluating future statements as soon as it has a match. So in the above config, using the same ACL (or even two ACLs that both matched the same traffic in different ways), statement 10 is always matched, and if it fails, traffic is just normally routed.

So there is some reason (albeit niche cases) to put "fail-over" statements into the route-map. The CCIE lab is basically all about niche cases (less lovingly called a "stupid router trick" by most of us), so this seemed worth exploring.

Here's our topology:

It's a little complex but I wanted to show a lot of different possibilities in one route-map statement.

R1's loopback0 (1.1.1.1/32) will be our source, travelling towards R9's loopback0 (9.9.9.9). Segments are IPed as 192.168.XY.Z/24, where XY is the lower and higher router number on the segment, and Z is the local router number. Example: The serial segment between R2 and R4 is 192.168.24.0/24, with R2's interface being 192.168.24.2 and R4 being 192.168.24.4.

EIGRP is advertising every IP in the topology. However, R5, R6 and R7 are summarizing all routes behind them to a default route towards R2.
R2 has an offset list towards R4 to make paths through it less desirable:

R2:
router eigrp 100
network 0.0.0.0

offset-list 0 in 50 Serial4/1

The net result of this is that traffic will be sent from R2 through R3 towards R9 unless PBR is involved.

Here's R2's PBR config:

ip access-list extended match

permit ip host 1.1.1.1 host 9.9.9.9

route-map PBR permit 10

match ip address match

set ip next-hop 192.168.24.4

set interface Serial4/2

set ip default next-hop 192.168.26.6

set default interface Serial4/4

interface FastEthernet1/0

ip policy route-map PBR

This will match traffic from 1.1.1.1 towards 9.9.9.9. The first step is to attempt to send the traffic towards R4.

So to be clear, non-PBR traffic will go through R3:

R2#sh ip cef 9.9.9.9

9.9.9.9/32

nexthop 192.168.23.3 Serial4/0

R1#trace 9.9.9.9 source fa1/0 ! Not from 1.1.1.1

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 44 msec 88 msec 48 msec

2 192.168.23.3 112 msec 40 msec 68 msec

3 192.168.39.9 116 msec 116 msec 160 msec

Now let's try our PBR match:

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 108 msec 48 msec 72 msec

2 192.168.24.4 96 msec 44 msec 76 msec

3 192.168.49.9 136 msec 108 msec 108 msec

As expected, it went R2 -> R4 -> R9.

What if the link between R2 and R4 went down?

R2(config)#int s4/0

R2(config-if)#shut

Referencing back to our route-map...

route-map PBR permit 10

match ip address match

~~set ip next-hop 192.168.24.4~~ (now unavailable)

set interface Serial4/2

set ip default next-hop 192.168.26.6

set default interface Serial4/4

We'd expect the PBR to send the traffic through R5 next, via Serial4/2. A very important note is these are point to point serial interfaces. Using Ethernet is not such a good choice for set interface. The problem should be obvious to any CCIE candidate: We're relying on the far side to proxy ARP for 9.9.9.9, which it will do in our design, but also because of our design, IOS will typically reject the ARP change as "wrong interface".

In short, the safe answer is to use serial (P2P) interfaces.

Also to point out again that according to the routing table, R2 should send traffic towards 9.9.9.9 via R3:

R2(config-if)#do sh ip cef 9.9.9.9

9.9.9.9/32

nexthop 192.168.23.3 Serial4/0

From R1's traceroute, we see that traffic does go through R5:

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 24 msec 116 msec 44 msec

2 192.168.25.5 92 msec 48 msec 112 msec

3 192.168.58.8 96 msec 128 msec 96 msec

4 192.168.89.9 152 msec 108 msec 96 msec

We've successfully failed the first statement and moved to the 2nd one.

R2(config-if)#int s4/2

R2(config-if)#shut

route-map PBR permit 10

match ip address match

~~set ip next-hop 192.168.24.4~~ (now unavailable)

~~set interface Serial4/2~~ (now unavailable)

set ip default next-hop 192.168.26.6

set default interface Serial4/4

The set [ip] default commands will only trigger if the route towards the destination is via a default.

These won't work for us yet because....

R2#sh ip route 9.9.9.9

Routing entry for 9.9.9.9/32

Known via "eigrp 100", distance 90, metric 2300416, type internal

[output omitted]

We have a specific route.

We see our PBR is doing nothing now:

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 76 msec 36 msec 68 msec

2 192.168.23.3 116 msec 52 msec 60 msec (through R3)

3 192.168.39.9 84 msec 104 msec 100 msec

I haven't got a smooth answer for this, so let's just make R3 send a default as well. Note I've increased the delay between R5, R6, R7 and R8, so that R3 will still be preffed even with just a default being sent.

R3(config)#int s2/0

R3(config-if)#ip summary-address eigrp 100 0.0.0.0 0.0.0.0

R2#sh ip route 9.9.9.9 long

[output omitted]

Gateway of last resort is 192.168.23.3 to network 0.0.0.0

So now we match a default towards R2. Let's see our PBR kick in again.

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 40 msec 88 msec 40 msec

2 192.168.26.6 88 msec 108 msec 48 msec

3 192.168.68.8 128 msec 140 msec 124 msec

4 192.168.89.9 92 msec 116 msec 112 msec

Through R6!

And if the link to R6 is down?

R2(config)#int s4/3

R2(config-if)#shut

route-map PBR permit 10

match ip address match

~~set ip next-hop 192.168.24.4~~ (now unavailable)

~~set interface Serial4/2~~ (now unavailable)

~~set ip default next-hop 192.168.26.6~~ (now unavailable)

set default interface Serial4/4

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 80 msec 60 msec 88 msec

2 192.168.27.7 112 msec 44 msec 88 msec

3 192.168.78.8 96 msec 92 msec 100 msec

4 192.168.89.9 88 msec 128 msec 132 msec

Through R7.

That wraps up my main point, but while we've got this setup, let's look at recursive PBR too.

I'm no-shutting all the interfaces we turned down earlier, and re-advertising the specific route from R3.

I've also added a leak-map on R5, R6 and R7 to allow R8's Lo0 (8.8.8.8) through in addition to the default route. Additionally, I de-prefed 8.8.8.8 through R3 and R4.

So to be clear, 9.9.9.9 is now reachable via R3, and 8.8.8.8 is reachable via equal-cost load sharing on R5, R6 and R7:

R2#sh ip cef 9.9.9.9

9.9.9.9/32

nexthop 192.168.23.3 Serial4/0

R2#sh ip cef 8.8.8.8

8.8.8.8/32

nexthop 192.168.25.5 Serial4/2

nexthop 192.168.26.6 Serial4/3

nexthop 192.168.27.7 Serial4/4

Recursive PBR allows for ECMP (equal cost multipathing) and PBR to mix. In short, pre-PBR, the path to 9.9.9.9 is via R3. Post PBR, we'll target having "8.8.8.8" as the next hop - which will ECMP through R5, R6 and R7.

In our environment, however, this is a bit hard to see, because per-destination CEF ECMP won't show up on our traceroute. Let's change to per-packet:

R2(config-route-map)#int s4/2

R2(config-if)#ip load-sharing per-packet

R2(config-if)#int s4/3

R2(config-if)#ip load-sharing per-packet

R2(config-if)#int s4/4

R2(config-if)#ip load-sharing per-packet

And let's re-write our route-map for recursion:

R2(config)#no route-map PBR permit 10

R2(config)#route-map PBR permit 10

R2(config-route-map)#match ip address match

R2(config-route-map)#set ip next-hop recursive 8.8.8.8

And test:

R1#trace 9.9.9.9 source Loopback0

Type escape sequence to abort.

Tracing the route to 9.9.9.9

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.12.2 28 msec * 44 msec

2 192.168.25.5 72 msec <-- Indicates ECMP

192.168.26.6 76 msec <-- Indicates ECMP

192.168.27.7 60 msec <-- Indicates ECMP

3 192.168.58.8 132 msec

192.168.68.8 80 msec

192.168.78.8 64 msec

4 192.168.89.9 76 msec 76 msec 64 msec

Hope you enjoyed!

Jeff

CCIE v4 to v5: BGP NHT, SAT, FSD, Dynamic Neighbors, Multisession Transport Per AF

2014-09-07T17:59:00.002-07:00

BGP Next Hop Tracking (NHT) is an on-by-default feature that notifies BGP to a change in routing for BGP prefix next-hops. This is important because previously this only happened as part of the BGP Scanner process, which runs every 60 seconds by default. Waiting 60 seconds to determine your BGP route is effectively no longer valid (because of invalid next-hop) significantly hampers reconvergence. Instead of being timer-based, NHT makes the process of dealing with next-hop changes event-driven.

EIGRP is peered on all routers on the 192.168.124.0/24 link.

Here's the relevant base BGP config:

R1:
router bgp 1
bgp log-neighbor-changes
neighbor 3.3.3.3 remote-as 3
neighbor 3.3.3.3 ebgp-multihop 255
neighbor 3.3.3.3 update-source Loopback0
neighbor 4.4.4.4 remote-as 4
neighbor 4.4.4.4 ebgp-multihop 255
neighbor 4.4.4.4 update-source Loopback0

R3:

router bgp 3

bgp log-neighbor-changes

neighbor 1.1.1.1 remote-as 1

neighbor 1.1.1.1 ebgp-multihop 255

neighbor 1.1.1.1 update-source Loopback0

neighbor 192.168.34.4 remote-as 4

R4:

interface Loopback1

ip address 44.44.44.44 255.255.255.255

router bgp 4

bgp log-neighbor-changes

network 44.44.44.44 mask 255.255.255.255

neighbor 1.1.1.1 remote-as 1

neighbor 1.1.1.1 ebgp-multihop 255

neighbor 1.1.1.1 update-source Loopback0

neighbor 192.168.34.3 remote-as 3

In short, we're using ebgp multihop in order to keep my mock-up smaller. We have two paths from R1 to R4's 44.44.44.44:

R1 -> R4's 4.4.4.4 (and consequently to 44.44.44.44 in the same hop)

R1 -> R3's 3.3.3.3, then R3 to R4's 192.168.34.4

The first route has one AS in it's AS-PATH, the 2nd route has two ASes, and is less preferred.

R1#sh ip bgp 44.44.44.44 bestpath

BGP routing table entry for 44.44.44.44/32, version 11

Paths: (2 available, best #1, table default)

Advertised to update-groups:

Refresh Epoch 2

4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)

Origin IGP, metric 0, localpref 100, valid, external, best

rx pathid: 0, tx pathid: 0x0

Let's try this experiment without NHT enabled first:

R1(config)#router bgp 1

R1(config-router)# no bgp nexthop trigger enable

R1#debug ip routing

IP routing debugging is on

R4(config-if)#int lo0 ! this is the 4.4.4.4 interface (the next-hop for 44.44.44.44 from R1)

R4(config-if)#shut

Debug from R1 below

===============

*Sep 17 22:59:03.552: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]

*Sep 17 22:59:03.552: RT: no routes to 4.4.4.4, delayed flush

*Sep 17 22:59:03.552: RT: delete subnet route to 4.4.4.4/32

*Sep 17 22:59:03.552: RT: updating eigrp 4.4.4.4/32 (0x0) :

via 192.168.124.4 Gi1.124 0 1048578

*Sep 17 22:59:03.552: RT: rib update return code: 5

================

This happened as fast as EIGRP converged - very quickly. So we know 4.4.4.4 isn't a valid route any longer, but what about 44.44.44.44?

R1#sh ip bgp 44.44.44.44 bestpath

BGP routing table entry for 44.44.44.44/32, version 11

Paths: (2 available, best #1, table default)

Advertised to update-groups:

Refresh Epoch 2

4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)

Origin IGP, metric 0, localpref 100, valid, external, best

rx pathid: 0, tx pathid: 0x0

R1#ping 44.44.44.44

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

Still thinking the next-hop is 4.4.4.4, and it's Very Down.

I didn't time it this way specifically, but remember the scan timer runs every 60 seconds. so 51 seconds after we yanked the 4.4.4.4 next-hop, BGP finally figured out something was up and reconverged to the alternate path for 44.44.44.44 via R3.

*Sep 17 22:59:54.031: RT: updating bgp 44.44.44.44/32 (0x0) :

via 3.3.3.3 0 1048577

*Sep 17 22:59:54.031: RT: closer admin distance for 44.44.44.44, flushing 1 routes

*Sep 17 22:59:54.031: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

R1#ping 44.44.44.44

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/3 ms

R1#trace 44.44.44.44

Type escape sequence to abort.

Tracing the route to 44.44.44.44

VRF info: (vrf in name/id, vrf out name/id)

1 192.168.124.3 4 msec 1 msec 0 msec

2 192.168.34.4 2 msec * 2 msec

A 51 second reconverge in a modern network is pretty awful.

R4(config-if)#int lo0

R4(config-if)#no shut

Let's re-add the next-hop trigger and try again.

R1(config-router)#router bgp 1

R1(config-router)#bgp nexthop trigger enable

R4(config-if)#int lo0

R4(config-if)#shut

Debug from R1 below

===============

*Sep 17 23:11:53.582: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]

*Sep 17 23:11:53.582: RT: no routes to 4.4.4.4, delayed flush

*Sep 17 23:11:53.582: RT: delete subnet route to 4.4.4.4/32

*Sep 17 23:11:53.582: RT: updating eigrp 4.4.4.4/32 (0x0) :

via 192.168.124.4 Gi1.124 0 1048578

*Sep 17 23:11:53.582: RT: rib update return code: 5

*Sep 17 23:11:58.582: RT: updating bgp 44.44.44.44/32 (0x0) :

via 3.3.3.3 0 1048577

*Sep 17 23:11:58.582: RT: closer admin distance for 44.44.44.44, flushing 1 routes

*Sep 17 23:11:58.582: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

===============

Note the bottom two lines of output, we see the reconverge this time - in 5 seconds. Why 5 seconds?

The bgp nexthop trigger delay defines how long for the NHT process to delay updating BGP. This timer is here to prevent BGP from being beaten up by a flapping IGP route. At 5 seconds, the BGP process can't get bogged down from unnecessary updates.

Let's set it to 2 and try again.

R1(config-router)#bgp nexthop trigger delay 2

Debug from R1 below

===============

*Sep 17 23:18:40.167: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]

*Sep 17 23:18:40.167: RT: no routes to 4.4.4.4, delayed flush

*Sep 17 23:18:40.167: RT: delete subnet route to 4.4.4.4/32

*Sep 17 23:18:40.167: RT: updating eigrp 4.4.4.4/32 (0x0) :

via 192.168.124.4 Gi1.124 0 1048578

*Sep 17 23:18:40.167: RT: rib update return code: 5

*Sep 17 23:18:42.168: RT: updating bgp 44.44.44.44/32 (0x0) :

via 3.3.3.3 0 1048577

*Sep 17 23:18:42.168: RT: closer admin distance for 44.44.44.44, flushing 1 routes

*Sep 17 23:18:42.168: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

===============

Now converging at 2 seconds.

Applying a route-map to the NHT process is provided by a feature called Selective Address Tracking, or SAT.

The route-map determines what prefixes can be seen as valid prefixes for next-hops.

For example, if 4.4.4.4 is your desired next hop, but you have a default on your router, if you lose 4.4.4.4/32 do you want the router to consider 4.4.4.4 reachable via the default? Potentially not.

R1(config)#ip route 0.0.0.0 0.0.0.0 192.168.124.10 ! Deliberately non-existent next-hop

Without the route map....

R4(config-if)#int lo0

R4(config-if)#shut

This is hard to demonstrate, because the prefix might never recover. In our over-simplified mock-up, the BGP process would fail at timeout (because 4.4.4.4 is actually our peer) before the prefix vanished; in a more realistic design this could be a permanent black-hole.

We still have the bogus static default route in place:

ip route 0.0.0.0 0.0.0.0 192.168.124.10

R1(config-router)#ip prefix-list onlyloops seq 5 permit 0.0.0.0/0 ge 32

R1(config)#route-map SAT permit 10

R1(config-route-map)# match ip address prefix-list onlyloops

R1(config-route-map)#router bgp 1

R1(config-router)# bgp nexthop route-map SAT

This config only allows for /32s as viable next-hops.

R4(config-if)#int lo0

R4(config-if)#shut

Debug from R1 below

===============

*Sep 17 23:47:09.497: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]

*Sep 17 23:47:09.497: RT: no routes to 4.4.4.4, delayed flush

*Sep 17 23:47:09.497: RT: delete subnet route to 4.4.4.4/32

*Sep 17 23:47:09.497: RT: updating eigrp 4.4.4.4/32 (0x0) :

via 192.168.124.4 Gi1.124 0 1048578

*Sep 17 23:47:09.497: RT: rib update return code: 5

*Sep 17 23:47:11.498: RT: updating bgp 44.44.44.44/32 (0x0) :

via 3.3.3.3 0 1048577

*Sep 17 23:47:11.498: RT: closer admin distance for 44.44.44.44, flushing 1 routes

*Sep 17 23:47:11.499: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]

===============

Now reconverging in 2 seconds again!

This is great for the downstream prefix, but what about the neighbor session itself?

This could work...

R1(config-router)#neighbor 4.4.4.4 fall-over

Except that pesky default is keeping 4.4.4.4 supposedly reachable....

For brevity, I'll tell you that as expected, when I shut the Lo0 interface on R4, 4.4.4.4 was pulled from R1's IGP and 44.44.44.44 was pulled from R1's BGP table. However, the session is still up!

The same concept (even the same route-map) can be applied to the neighbor fall-over statement. This feature is called Fast Session Deactivation (FSD).

R1(config-router)#neighbor 4.4.4.4 fall-over route-map SAT ! re-using SAT's route-map

Debug from R1 below

===============

*Sep 18 00:11:08.107: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (Route to peer lost)

*Sep 18 00:11:08.107: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down Route to peer lost

*Sep 18 00:11:08.107: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session Route to peer lost

===============

And the BGP session gets torn down immediately.

This next feature I'm not sure of the use case on, but it was recommended as a topic, so I looked at it. Multisession Transport per AF appears to be related to Multi-Topology Routing (MTR), but MTR should be solidly out-of-scope for CCIE R&S v5.

What multisession transport does is opens a separate TCP session for each address family.

I've erased all the BGP config from the previous task.

R1:

ipv6 unicast-routing

router bgp 100

bgp log-neighbor-changes

neighbor 4.4.4.4 remote-as 100

neighbor 4.4.4.4 update-source Loopback0

address-family ipv4

neighbor 4.4.4.4 activate

exit-address-family

address-family vpnv4

neighbor 4.4.4.4 activate

neighbor 4.4.4.4 send-community extended

exit-address-family

address-family ipv6

neighbor 4.4.4.4 activate

exit-address-family

R4:

ipv6 unicast-routing

router bgp 100

bgp log-neighbor-changes

neighbor 1.1.1.1 remote-as 100

neighbor 1.1.1.1 update-source Loopback0

address-family ipv4

neighbor 1.1.1.1 activate

exit-address-family

address-family vpnv4

neighbor 1.1.1.1 activate

neighbor 1.1.1.1 send-community extended

exit-address-family

address-family ipv6

neighbor 1.1.1.1 activate

exit-address-family

R1(config-router-af)#do show tcp brief

TCB Local Address Foreign Address (state)

7F612C7742A0 1.1.1.1.40234 4.4.4.4.179 ESTAB

Three families, one TCP session.

R1(config-router)#neighbor 4.4.4.4 transport multi-session

R4(config-router)#neighbor 1.1.1.1 transport multi-session

The two sides of the session do need to agree on the setting.

R1:

*Sep 18 00:31:19.102: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Up

*Sep 18 00:31:25.940: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 2 Up

*Sep 18 00:31:28.322: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 3 Up

R1(config-router)#do show tcp brief

TCB Local Address Foreign Address (state)

7F612C76F0F0 1.1.1.1.179 4.4.4.4.30092 ESTAB

7F612C76DE20 1.1.1.1.179 4.4.4.4.42417 ESTAB

7F612C76E788 1.1.1.1.48539 4.4.4.4.179 ESTAB

Our last topic is BGP Dynamic Neighbors. Yes, automagic BGP peerings!

Erasing all the pre-existing BGP config again...

R1:

router bgp 100

bgp log-neighbor-changes

bgp listen range 192.168.124.0/24 peer-group PEERS

neighbor PEERS peer-group

neighbor PEERS remote-as 100

neighbor PEERS password CISCO

neighbor PEERS update-source Loopback0

neighbor PEERS route-reflector-client

bgp listen limit 3

R2-R4:

router bgp 100

bgp log-neighbor-changes

neighbor 192.168.124.1 remote-as 100

neighbor 192.168.124.1 password CISCO

R1:

*Sep 18 00:38:24.696: %BGP-5-ADJCHANGE: neighbor *192.168.124.2 Up

*Sep 18 00:39:04.980: %BGP-5-ADJCHANGE: neighbor *192.168.124.4 Up

*Sep 18 00:39:05.932: %BGP-5-ADJCHANGE: neighbor *192.168.124.3 Up

iBGP doesn't get any faster to setup than that!

I've used the most obvious settings here - the dynamic "host" would normally be a route-reflector, and would normally require authentication.

However, you can:

- Run multiple dynamic groups

- Listen to multiple ranges

- Use multiple address families (this works great for VPNv4!)

- Listen for more neighbors (I limited it to 3 above)

Cheers,

Jeff

CCIE v4 to v5 Updates: NTPv4 and Netflow

2014-09-07T13:55:00.002-07:00

I didn't find these updates on any Cisco or 3rd party list, but when writing my original NTP and Netflow blogs in mid-2013, I mentioned out-of-scope topics when writing them, because they weren't supported on IOS v12.4(15)T. Now that v5 is out, all those topics are back in-scope, so I decided to blog them.

Here are the original articles this one builds off of:

http://brbccie.blogspot.com/2013/05/ntp.html
http://brbccie.blogspot.com/2013/06/netflow.html

The topics we'll be covering specifically are:
- Netflow w/ NBAR
- IPFIX (Netflow v10)
- NTPv4 (IPv6 support)
- NTPv4 Multicast NTP
- NTP Panic
- NTP Maxdistance
- NTP Orphan

Netflow
First, I wanted to mention an omission from my original blog. At that time I didn't have a collector that would support Flexible Netflow, so I evaluated FNF via Wireshark. That was fairly effective except I was missing a major element of netflow: the bytes transferred! I'm now using a collector that supports FNF, and I immediately noticed I wasn't graphing any traffic.

flow record JIMBO
match ipv4 source address
match ipv4 destination address
collect counter bytes
collect counter packets

This is a simple, working FNF config. Matching or collecting counter bytes and counter packets should be done to make Netflow do what you're used to it doing -- measuring traffic.

What's the advantage of integrating NBAR with Netflow?
By default, Netflow only exports very high-level protocol information. Integrating NBAR gives very specific/granular protocol output to the collector. Note, your collector needs to specifically support this, this is not a small change from the protocol level.

If you're familiar with how the template is sent out for FNF every so often, the NBAR table is very similar. IOS will send out a rather large (many packets) template defining the NBAR Application to ID at specified intervals, then those IDs are sent with the Netflow packet to define what the protocol is.

There are several other blogs out there that give big, complex templates for integrating NBAR with Netflow. I took a few of these as a base and worked backwards to the real requirements. This is not a hard thing to enable. Your flow record must contain collect application name (or match application name), and optionally you can tune the frequency of the NBAR FNF template being sent out with option application-table timeout in the exporter.

Here's a working config:

flow record FNF-RECORD
match ipv4 source address
match ipv4 destination address
collect counter bytes
collect counter packets
collect application name

flow exporter FNF-EXPORTER
destination 192.168.0.5
source GigabitEthernet1
transport udp 9996
template data timeout 60
option application-table timeout 30

flow monitor FNF-MONITOR
exporter FNF-EXPORTER
cache timeout inactive 60
cache timeout active 60
record FNF-RECORD

interface gig1
ip flow monitor FNF-MONITOR input

Netflow was recently made an open standard with v10. The open version is called IPFIX. To enable IPFIX output instead of FNF v9, you would:

flow exporter FNF-MONITOR
export-protocol ipfix

Note I haven't tested this beyond checking it in Wireshark, because I still don't have a collector that speaks IPFIX.

NTP

The big difference on NTP v4 is IPv6 support. There's really not much to cover on the basics... clearly broadcast NTP is gone, but Multicast NTP still works the same general way it did in v4.

R1(config)#ntp master 4

R2(config)#ntp server 1::1

R2#show ntp association detail

1::1 configured, ipv6, our_master, sane, valid, stratum 4

ref ID 127.127.1.1 , time D7C45F20.4AC083E0 (19:27:28.292 UTC Wed Sep 17 2014)

Really quite simple.

15.x implementations of NTP now leave domain names in the config.
Pre 15.x:
foo.com(config)#ip host foo.com 4.4.4.4
foo.com(config)#ntp server foo.com
foo.com(config)#do sh run | i ntp
ntp server 4.4.4.4

It would translate the hostname to an IP address and the IP address would be saved in the config, not a good thing if the server changes IPs.

Post 15.x:
R2(config)#ip host test.com 4.1.1.1
R2(config)#ntp server test.com
R2(config)#do sh run | i ntp
ntp server test.com

Let's take a look at the multicast option. As IPv6 multicast has blessedly been removed from the v5 blueprint, I'm going to cheap out and perform non-routed/same-link multicast.

R2(config)#no ntp server 1::1

R1(config)#ntp authentication-key 1 md5 CISCO

R1(config)#ntp trusted-key 1

R1(config)#int gig1.123

R1(config-subif)#ntp multicast FF02::123 key 1

R2(config)#ntp authentication-key 1 md5 CISCO

R2(config)#ntp trusted-key 1

R2(config)#ntp authenticate

R2(config)#int gig1.123

R2(config-subif)#ntp multicast client FF02::123

R2(config-subif)#do show ntp ass det

FE80::20C:29FF:FEB6:3557 dynamic, ipv6, authenticated, our_master, sane, valid, stratum 4

ref ID 127.127.1.1 , time D7C460E0.4AC083E0 (19:34:56.292 UTC Wed Sep 17 2014)

Maxdistance, for me, is very confusing. It appears to be a trust value. It's normally modified in NTPv4 in order to speed up convergence. As I understand it, the higher the value the faster the synchronization will happen, because the upstream time will be trusted sooner. The algorithm appears to combine half the value of the root delay and the dispersion, and if that value is lower than Maxdistance, then it's OK to consider yourself in-sync. My labbing did not produce exactly that outcome but it was extremely hard to say for sure because my NTPv4 convergences very quickly. Because you basically have to be a time expert to understand what this does, I would hope the CCIE lab would be limited to two types of questions on it:

1) Set it to some value they provide

2) Set it to "slowest" convergence (1) or "fastest" convergence (16)

R1(config)#ntp maxdistance ?
<1-16> Maximum distance for synchronization

NTP Panic is simple:

R2(config)#ntp panic ?
update Reject time updates > panic threshold (default 1000Sec)

It does just what it says - if my peer or configured master's clock is more than 1,000 seconds off of my clock, reject the update and syslog:

.Sep 8 00:51:00.155: NTP Core (ERROR): Time correction of nan seconds exceeds sanity limit of 0. seconds. Set clock manually to the correct UTC time.

NTP Orphan is really cool. It seems like an obvious feature now that I've seen it, but I can imagine this is a huge help for smaller organizations that rely heavily on NTP.

Let's say, from our diagram, R1 is an Internet time server that our fictional organization uses as its sole NTP master. R2 and R3 are edge routers inside the company, and R4 and R5 will represent servers querying R2 and R3.

So to be clear, R2 and R3 get their time from R1, and also peer towards one another (so if R3 can't reach R1 but R2 can, R3 can learn it's time via R2). R4 and R5 query R2 and R3 for time, respectively.

Relevant config:
R1(config)#ntp master 4

R2(config)#int gig1.123

R2(config-subif)#no ntp multicast client FF02::123

R2(config-subif)#no ntp authenticate

R2(config)#ntp server 1::1

R2(config)#ntp peer 3::3

R2(config)#ntp source lo0

R3(config)#ntp server 1::1

R3(config)#ntp peer 2::2

R3(config)#ntp source lo0

R4(config)#ntp server 2::2

R5(config)#ntp server 3::3

At this point every device has the up-to-date time.

Now let's say R1 goes offline.

R1(config)#int lo0

R1(config-if)#shut

<<wait a while>>

R2(config)#do show ntp status

Clock is unsynchronized, stratum 16, no reference clock

R3(config)#do show ntp status

Clock is unsynchronized, stratum 16, no reference clock

and obviously R4 and R5 share the same fate.

What if we could program R2 and R3 to take their best stab at what the time should still be - mind you we're talking about being only a couple minutes since last sync, so the time is probably still very close to accurate - and then temporarily and seamlessly take over the NTP Master role if they lose valid clock from R1?

This is exactly what NTP Orphan does.

The config is extremely complicated:

R2(config)#ntp orphan 6

R3(config)#ntp orphan 6

(I was joking about the complicated part)

Really, that's it. Let's understand what's happening here now. Orphan kicks in when we lose sync with our server. The number 6 here is a stratum number, and must be a number lower than your real upstream NTP server - otherwise the failover/fail-back mechanism won't work right.

Best practices indicate configuring the same Orphan stratum on all devices you're running Orphan on, then peering all the Orphans to one another so that only one is "elected" to be the temporary Orphan master.

R2(config)#do show ntp status

Clock is synchronized, stratum 6, reference is 127.0.0.1

We see R2 is now stratum 6, synchronized with it's own virtual Orphan server.

R3(config)#do show ntp status

Clock is synchronized, stratum 7, reference is 26.33.33.239

R3 is synchronized with R2 as its Master.

R4#show ntp status

Clock is synchronized, stratum 7, reference is 26.33.33.239

R4 is synchronized with R2 as its master.

R5#show ntp status

Clock is synchronized, stratum 9, reference is 24.235.166.45

R5 is synchronized with R5 as its master.

Now the most important feature of this is fail-back, let's re-activate R1.

R1(config)#int lo0

R1(config-if)#no shut

R3 was first to recover:

R3(config)#do show ntp association detail

1::1 configured, ipv6, our_master, sane, valid, stratum 4

It automatically shut down its Orphan process when it synced to the superior stratum 4.

R5 then received the now-correct time from R3:

R5#show ntp association detail

3::3 configured, ipv6, our_master, sane, valid, stratum 5

Cheers,

Jeff Kronlage

OSPF LFA & Remote LFA

2014-09-06T22:03:00.003-07:00

Continuing on the same track as my recent posts regarding EIGRP FRR and BGP PIC/Add-path, today I'm writing about OSPF LFA. OSPF FRR/LFA accomplishes the same concept as EIGRP FRR, but in a much more elegant and thorough fashion.

As I did in my EIGRP article, I'm going to reference back to the BGP PIC article, as that has a lengthy explanation of why fast re-reroute is important. If you don't understand the use case, please read this first article:

http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html

Again building off former articles, the EIGRP method of LFA is dead simple: take the feasible successor and pre-install it in the FIB for faster convergence.

http://brbccie.blogspot.com/2014/08/eigrp-enhancements.html

I genuinely like this approach, because it's very easy to understand. If you're savvy enough to engineer for feasible successors, you can literally just turn on this feature and it works.

OSPF takes this idea to a whole new level. Obviously, OSPF does not have a concept of feasible successors, but it does have a huge advantage: because, in the same area, the OSPF database is identical among all routers, OSPF can run the SPF algorithm with a neighboring router as root. The advantage of this is being able to find a loop-free alternate path in complex topologies that would have failed the feasible successor check in EIGRP. When we look at Remote LFA, we can even tunnel to distant routers to form loop-free paths, all calculated via the router running FRR.

Note - much like EIGRP, OSPF on IOS does not support per-link LFA, so we will only be examining per-prefix LFA. IOS-XR supports both per-prefix and per-link.

All links have an IP address of 192.168.YY.X, where YY is the lower router number followed by the higher router number, and X is the router number (i.e. on the link facing R4, R1's IP address is 192.168.14.1) . Each router has a loopback0 address of X.X.X.X, where X is the router number.

Consider this diagram, with R1 attempting to reach R5 (5.5.5.5).

R1(config)#router ospf 1
R1(config-router)#fast-reroute per-prefix enable area 0 prefix-priority low

The primary path is obvious: R1 -> R2 -> R5
The backup path requires some thought...

If this were EIGRP, neither path would be valid for LFA. They'd both fail the feasibility condition:
R1->R3->R5 has an "advertised distance" of 10, which is greater than the "feasible distance" of 2. Likewise, R1->R4->R5 has an "advertised distance" of 10.

However, OSPF being link state can actually calculate the SPF from R2 and R4's perspective. Cisco calls this process "reverse SPF" -- RSPF. I'm not going to make this a large lesson on link state protocols, but let's quickly look at what R1 would discover about its neighbors:

R2:
This is already the primary path, so eliminate R2.
R3:
When attempting to reach R5, R3 will route back through R1. This will loop. Eliminate R3.
R4:
R4 reaches R5 via the link between R4 and R5. Valid Backup Route.

I deliberately built the scenario this way to show how a higher-metric route could beat a lower metric for the backup route - of course, in our case, the lower metric would've looped.

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 192.168.12.2 on GigabitEthernet1.12, 02:29:11 ago
Routing Descriptor Blocks:
* 192.168.12.2, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.12
Route metric is 3, traffic share count is 1
Repair Path: 192.168.14.4, via GigabitEthernet1.14
[RPR]192.168.14.4, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.14
Route metric is 26, traffic share count is 1

R1#sh ip cef 5.5.5.5

5.5.5.5/32

nexthop 192.168.12.2 GigabitEthernet1.12

repair: attached-nexthop 192.168.14.4 GigabitEthernet1.14

As with EIGRP, there are "tie-breakers" if you have multiple options for backup path. With OSPF, you can get a lot more granular than EIGRP. I still hate the term "tie-breakers", as I explained in my EIGRP blog, I think "2nd bestpath decision maker" explains it better.

The tie-breakers are as follows, with their respective default priorities:

- SRLG 10

- Primary Path 20

- Interface Disjoint 30

- Lowest-Metric 40

- Linecard-disjoint 50

- Node protecting 60

- Broadcast interface disjoint 70

- Load Sharing 256

These tie-breakers are off by default:

- Downstream

- Secondary-Path

The syntax to change the priorities - or turn on downstream or secondary-path - is as follows:

router ospf 1

fast-reroute per-prefix tie-break interface-disjoint required index 5

If you use the fast-reroute per-prefix tie-break command at all, it disables all the other tie-breakers. So for example, if you wanted SRLG to be the 2nd tie breaker, you would have to turn it back on after the interface-disjoint command:

router ospf 1

fast-reroute per-prefix tie-break interface-disjoint required index 5

fast-reroute per-prefix tie-break srlg index 10

You may have also noticed the required keyword. This means that if that tie-breaker doesn't match/pass, then disallow that path completely.

My original plan was to show a scenario for every tie-breaker, but after it taking me two days to build a topology that showed each possible technique, I decided to just go with a written explanation on each tie-breaker and then give one semi-complex tie-breaker topology with a few examples.

- SRLG

SRLG - Shared Risk Link Group - is a manual setting, optionally assigned per-interface, with the intent of identifying "shared risk" elements that the router can't detect on it's own. For example, if two of your Ethernet links shared a downstream switch, you might put those two in the same SRLG.

Usage:

R1(config)#int gig1

R1(config-if)#srlg gid 1

R1(config-if)#int gig2

R1(config-if)#srlg gid 1

R1(config-if)#int gig3

R1(config-if)#srlg gid 2

- Primary Path

Primary Path prefers a backup path that's part of equal-cost multipath (ECMP), This is the antithesis of Secondary Path, which we'll cover below.

- Interface Disjoint

This is fairly obvious, prefer a backup next-hop that exits through a different interface. Note, Ethernet sub-interfaces are considered different interfaces.

- Lowest-Metric

Prefer the path with the lowest metric (note, this command doesn't offer a "required" keyword)

- Linecard-disjoint

Prefer a path that exits through a different linecard than the primary path (I have no way of labbing this as I'm using a CSR1K)

- Node protecting

Prefer a path that doesn't pass through the same next-hop router as the primary path. Note this means any interface on the same next-hop router. So if R2 is the next-hop of your primary path via 192.168.12.2, and your backup path goes through (either directly or indirectly, later in the path) 192.168.25.2 on R2, node protecting will depref that path - or with the required keyword, would prevent it from being used completely.

- Broadcast interface disjoint

Broadcast interface disjoint deprefs backup routes that pass through the same broadcast area as the primary path. The thought here is if the layer 2 device (presumably a switch) connecting the interfaces together fails, we might lose the backup path too.

- Load Sharing

I haven't labbed this, but my understanding is this is basically a worst-case scenario. If you have two or more paths that can't be differentiated by all of the above tie-breakers, share the backup paths amongst any applicable prefixes.

- Downstream (off by default)

This is very similar to the EIGRP feasability condition - ensure that the metric, from the neighbor's RSPF perspective, is smaller than the total metric of our primary path from the calculating router's perspective. Using the original example above, the backup path we picked would not meet the criteria for this tie-breaker. It's important to reinforce this is not a default option, and OSPF does not require this EIGRP-feasibility-like requirement as OSPF is a link state protocol and can calculate non-looping paths without concerns for metric because it has the entire topology at hand.

- Secondary-Path (off by default)

This is the antithesis of the Primary-Path tie-breaker above. This instructs the process to prefer a backup path that is not part of multipathing (ECMP). The idea here is if all your multipaths are required for your traffic flows - for example, if you are equal-cost multipathing across two 1-gig links, but consistently have 1.2gb of data crossing them, it would not be desirable to just run over one the opposing link in the ECMP if one failed. Secondary-Path prefers a path not in the ECMP for the backup.

I'm going to run a couple of examples of tie-breaking, but in order to do that, I needed more paths in the topology. Pay close attention, I have shifted the OSPF costs from the prior topology:

* Please note costs listed below do not include the on-router cost to the loopback for clarity*
If you look at metric alone, the paths from R1->R5 look most desirable in this order:
R1 -> R3 -> R5 (Cost 2)

R1 -> R6 -> R3 -> R5 (Cost 4)
R1 -> R2 -> R5 (Cost 11)
R1 -> R4 -> R5 (Cost 25)

Clearly R3 is the winning primary path.

Let's go down the decision-making process for the backup path:

- SRLG 10 - Not applicable, we're not using SRLG (yet)

- Primary Path 20 - Not applicable, we have no ECMP.

- Interface Disjoint 30 - Applicable, but all are on separate interfaces already.
- Lowest-Metric 40 - Applicable, choose R6 as backup. Do not proceed further, as all paths have different costs.

So without any modification, our primary next-hop router will be R3, and backup next-hop router will be R6:

R1#sh ip route repair 5.5.5.5
Routing entry for 5.5.5.5/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:14:19 ago
Routing Descriptor Blocks:
* 192.168.13.3, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.13
Route metric is 3, traffic share count is 1
Repair Path: 192.168.16.6, via GigabitEthernet1.16
[RPR]192.168.16.6, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.16
Route metric is 6, traffic share count is 1

There's an obvious flaw in that plan however, they both rely on R3 being online.

R1(config)#router ospf 1

R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10

R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20

R1(config-router)#do sh ip route repair 5.5.5.5

Routing entry for 5.5.5.5/32

Known via "ospf 1", distance 110, metric 3, type intra area

Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:09 ago

Routing Descriptor Blocks:

* 192.168.13.3, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.13

Route metric is 3, traffic share count is 1

Repair Path: 192.168.14.4, via GigabitEthernet1.14

[RPR]192.168.14.4, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.14

Route metric is 26, traffic share count is 1

Now the process has chosen the backup through R4, which eliminates R3 as a single point of failure.

Let's pretend that gig1.13, gig 1.14, and gig1.16 all cross the same L2 switch somewhere in their path. We want to protect against that too:

R1(config)#router ospf 1

R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10

R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20

R1(config-router)#fast-reroute per-prefix tie-break srlg required index 30

R1(config-router)#int gig1.13

R1(config-subif)#srlg gid 1

R1(config-subif)#int gig1.14

R1(config-subif)#srlg gid 1

R1(config-subif)#int gig1.16

R1(config-subif)#srlg gid 1

R1(config-subif)#int gig1.12

R1(config-subif)#srlg gid 2

R1(config-subif)#do sh ip route repair 5.5.5.5

Routing entry for 5.5.5.5/32

Known via "ospf 1", distance 110, metric 3, type intra area

Last update from 192.168.13.3 on GigabitEthernet1.13, 00:18:34 ago

Routing Descriptor Blocks:

* 192.168.13.3, from 5.5.5.5, 00:18:34 ago, via GigabitEthernet1.13

Route metric is 3, traffic share count is 1

Uh-oh, no backup route. We were hoping for R1->R2->R5...

R2#sh ip cef 5.5.5.5

5.5.5.5/32

nexthop 192.168.12.1 GigabitEthernet1.12

That's because R2 routes back through R1 - R1 would've run the RSPF with R2 as the root and disregarded the route.

We have two options at this point:

- Remove the required keyword from the SRLG and fall back to the prior answer

- Tinker with the metrics to make R2 a viable path.

R1(config)#int gig1.12

R1(config-subif)#ip ospf cost 10

R2(config)#int gig1.12

R2(config-subif)#ip ospf cost 10

R1(config-subif)#do sh ip route repair 5.5.5.5

Routing entry for 5.5.5.5/32

Known via "ospf 1", distance 110, metric 3, type intra area

Last update from 192.168.13.3 on GigabitEthernet1.13, 00:00:52 ago

Routing Descriptor Blocks:

* 192.168.13.3, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.13

Route metric is 3, traffic share count is 1

Repair Path: 192.168.12.2, via GigabitEthernet1.12

[RPR]192.168.12.2, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.12

Route metric is 21, traffic share count is 1

Now we have a backup via R2.

Before we move on to remote LFA, let's cover some smaller topics.

There were two pieces to the initial command that I did not explain:

fast-reroute per-prefix enable area 0 prefix-priority low

enable area 0 may seem obvious - we want backup paths for area 0. Note, you can only specify areas the router is directly connected to, so if, for example, you wanted backup paths in areas 0, 1, and 2, your router would have to be an ABR for areas 1 and 2. This is true of both direct LFA and remote LFA.

But there's another issue with specifying areas:

R5(config)#int lo1

R5(config-if)#ip address 55.55.55.55 255.255.255.255

R5(config-if)#exit

R5(config)#route-map lo1-extern

R5(config-route-map)#match interface lo1

R5(config-route-map)#exit

R5(config)#router ospf 1

R5(config-router)#redistribute connected route-map lo1-extern

R1(config)#do sh ip route repair 55.55.55.55

Routing entry for 55.55.55.55/32

Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2

Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:27 ago

Routing Descriptor Blocks:

* 192.168.13.3, from 5.5.5.5, 00:01:27 ago, via GigabitEthernet1.13

Route metric is 20, traffic share count is 1

No repair route for 55.55.55.55 - and we won't, because an external route is in no area. We have to change our initial configuration to fix this:

R1(config-router)#no ip fast-reroute per-prefix enable area 0 prefix-priority low

R1(config-router)#fast-reroute per-prefix enable prefix-priority low

A lack of an area implies all areas this router is connected to - including external routes.

R1(config-router)#do sh ip route repair 55.55.55.55

Routing entry for 55.55.55.55/32

Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2

Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:42 ago

Routing Descriptor Blocks:

* 192.168.13.3, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.13

Route metric is 20, traffic share count is 1

Repair Path: 192.168.12.2, via GigabitEthernet1.12

[RPR]192.168.12.2, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.12

Route metric is 20, traffic share count is 1

What's the story on prefix-priority low?

IOS prioritizes convergence events by default by prefix length. If SPF has to be calculated for thousands of routes, it's assumed by default that /32s (typical for iBGP next-hops) are "high priority". You can define what routes are priority to OSPF with:

R1(config-router)#prefix-priority high route-map <your route map>

There are only two tiers, high and low. High indicates (by default, unless the route map is used) only calculate backup routes for /32s, Low means calculate backup routes for all routes.

So you're debugging and trying to figure out why one path was chosen over another. IOS has a fantastic output system for this:

R1(config-router)#fast-reroute keep-all-paths

This is basically a debugging command, and tells OSPF to keep the output from all the RSPFs it ran to calculate the backup path - including the ones it didnt choose as best.

show ip ospf rib is our 2nd magic command:

R1(config-router)#do sh ip ospf rib 5.5.5.5

OSPF Router with ID (1.1.1.1) (Process ID 1)

Base Topology (MTID 0)

OSPF local RIB

Codes: * - Best, > - Installed in global RIB

LSA: type/LSID/originator

*> 5.5.5.5/32, Intra, cost 3, area 0

SPF Instance 62, age 00:13:50

Flags: RIB, HiPrio

via 192.168.13.3, GigabitEthernet1.13

Flags: RIB

LSA: 1/5.5.5.5/5.5.5.5

repair path via 192.168.12.2, GigabitEthernet1.12, cost 21

Flags: RIB, Repair, IntfDj, BcastDj, NodeProt

LSA: 1/5.5.5.5/5.5.5.5

repair path via 192.168.16.6, GigabitEthernet1.16, cost 6

Flags: Ignore, Repair, IntfDj, BcastDj, SRLG

LSA: 1/5.5.5.5/5.5.5.5

repair path via 192.168.14.4, GigabitEthernet1.14, cost 26

Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt

LSA: 1/5.5.5.5/5.5.5.5

Look at all that fantastic output - it list the parameters per route so you can determine why the repair path was chosen. Let's break one of these down:

repair path via 192.168.12.2, GigabitEthernet1.12, cost 21

Flags: RIB, Repair, IntfDj, BcastDj, NodeProt

LSA: 1/5.5.5.5/5.5.5.5

This is our current best backup path - "RIB" means it's installed, "Repair" means it's a backup path - so "RIB" + "Repair" means it's the installed backup path. IntfDj means it's on a separate interface from the primary path, BcastDj means it's not sharing a broadcast interface with the primary path, and NodeProt means the path does not include shared hops with the primary path.

Microloops can add complexity with fast-reroute. A microloop is what happens when one router converges significantly faster than a neighbor. Let's say two adjacent routers both receive new LSAs simultaneously. One router is high-performance, another is older. The high-performance router calculates the change and updates the FIB several seconds before the older router. Now we could end up with a scenario where the newer router starts forwarding traffic through the older router, but the older router's FIB hasn't updated yet, and it's forwarding through the faster router for that same prefix. For a couple of seconds, the two routers loop.

I'm not going to go into detail on this as it's a fringe topic, but here's the starting point for using this:
R1(config-router)#microloop avoidance ?
disable Microloop avoidance auto-enable prohibited
protected Microloop avoidance for protected prefixes only
rib-update-delay Delay before updating the RIB

In short, it allows you to deliberately slow down updating the FIB on the faster router for prefixes that are high-risk for this type of reconvergence.

If you don't want an interface being considered for fast-reroute:

R1(config-router)#int gig1.12
R1(config-subif)#ip ospf fast-reroute per-prefix candidate disable

And if you need a quick summary of what percentage of routes are and aren't protected:

R1#sh ip ospf fast-reroute prefix-summary

OSPF Router with ID (1.1.1.1) (Process ID 1)

Base Topology (MTID 0)

Area 0:

Interface Protected Primary paths Protected paths Percent protected

All High Low All High Low All High Low

Lo0 Yes 0 0 0 0 0 0 0% 0% 0%

Gi1.16 Yes 1 1 0 0 0 0 0% 0% 0%

Gi1.14 Yes 0 0 0 0 0 0 0% 0% 0%

Gi1.13 Yes 7 3 4 4 2 2 57% 66% 50%

Gi1.12 Yes 1 1 0 0 0 0 0% 0% 0%

Area total: 9 5 4 4 2 2 44% 40% 50%

Process total: 9 5 4 4 2 2 44% 40% 50%

That's a wrap for direct LFA. Now we'll look at remote LFA.

This is a simplistic topology but it has a huge problem for direct LFA.
Let's protect the path from R1 to R4.

We have two paths:
R1 -> R4 (cost 1)
R1 -> R2 -> R3 -> R4 (cost 12)

Obviously R1 -> R4 is the primary path,
What does R2 see as it's possible paths to R4?
R2 -> R1 -> R4 (Cost 2)
R2 -> R3 -> R4 (Cost 11)

R2 will always send traffic back to R1 when heading towards R4.

What about R3?
R3 -> R4 (Cost 6)
R3 -> R2 -> R1 (Cost 7)

R3 would work for a backup path... if only we could get to R3 without R2 knowing what we're up to.

Enter Remote LFA.

R1(config-router)#int gig1.14
R1(config-subif)#mpls ip
R1(config-subif)#int gig1.12
R1(config-subif)#mpls ip

R1(config-subif)#mpls ldp discovery targeted-hello accept

R2(config-subif)#int gig1.12

R2(config-subif)#mpls ip

R2(config-subif)#int gig1.23

R2(config-subif)#mpls ip

R2(config-subif)#mpls ldp discovery targeted-hello accept

R3(config-subif)#int gig1.23
R3(config-subif)#mpls ip
R3(config-subif)#int gig1.34
R3(config-subif)#mpls ip
R3(config-subif)#mpls ldp discovery targeted-hello accept

R4(config-subif)#int gig1.14

R4(config-subif)#mpls ip

R4(config-subif)#int gig1.34

R4(config-subif)#mpls ip

R4(config-subif)#mpls ldp discovery targeted-hello accept

R1(config-router)#router ospf 1
R1(config-router)#fast-reroute per-prefix remote-lfa tunnel mpls-ldp

There's a complex algorithm that makes this work, but it's somewhat irrelevant from a CCIE v5 perspective.

Here's what you really need to know:

- Direct LFA had to have failed to turn up a path already (direct is always tried first)

- A tunnel is built over targeted LDP.

- The destination tunnel router is picked on the following criteria:

- It must be in the same area as the router running LFA

- The tunnel endpoint is picked from among the group of routers that can be reached through a next-hop other than the one you're trying to protect.

- Of that group of routers, it's narrowed down to the subset that can reach your repair prefix without passing through the protecting router.

- Those that qualify are called the PQ space (refer to the RFC for a lot more detail, but it may be overkill for a CCIE candidate)

R1#sh ip route repair 4.4.4.4

Routing entry for 4.4.4.4/32

Known via "ospf 1", distance 110, metric 2, type intra area

Last update from 192.168.14.4 on GigabitEthernet1.14, 00:29:36 ago

Routing Descriptor Blocks:

* 192.168.14.4, from 4.4.4.4, 00:29:36 ago, via GigabitEthernet1.14

Route metric is 2, traffic share count is 1

Repair Path: 3.3.3.3, via MPLS-Remote-Lfa1

[RPR]3.3.3.3, from 4.4.4.4, 00:29:36 ago, via MPLS-Remote-Lfa1

Route metric is 12, traffic share count is 1

R1#sh ip int br | i MPLS

MPLS-Remote-Lfa1 192.168.12.1 YES unset up up

This whole process is reasonably automatic, just make sure your LDP is in good shape and targeted LDP is enabled and you're good to go.

You can optionally specify areas and maximum costs:

R1(config-router)#fast-reroute per-prefix remote-lfa area 0 maximum-cost 10

The areas work the same way they did with direct LFA - we're just saying we only want to protect area 0, 1, 2, 3, etc. For remote LFA, the router you're running LFA on has to be in the area you're trying to protect - you can't protect area 5 if you're only an ABR for areas 0 and 1.

The maximum cost option restricts which prefixes you should be building tunnels for. In other words, it has nothing to do with the metric to reach the tunnel endpoint - it has to do with the prefix you're trying to protect.

Hope you enjoyed!

Jeff

EIGRP Enhancements

2014-08-24T12:42:00.001-07:00

Cisco did a major overhaul of EIGRP in recent IOS. These can be loosely looked at as new features in "EIGRP Named Mode". In reality, I suspect that the EIGRP teams were working on a series of new features, and they opted to renovate the interface at the same time, hence creating named mode.

We'll start with the new interface and then delve into all the new features one at a time.

Named EIGRP mode replaces the tradition EIGRP interfaces we're familiar with, and puts all the various commands into one configuration section.

The major distinguishing factor is the router process has a name instead of a number.

Old method:
router eigrp 100
network 192.168.0.0 0.0.255.255

New equivalent method:
router eigrp SOMENAME
!
address-family ipv4 unicast autonomous-system 100
!
topology base
exit-af-topology
network 192.168.0.0 0.0.255.255
exit-address-family

The name is completely arbitrary and is a local value.

Interface settings that were previously configured on the interface, such as hello interval, authentication, etc, are now configured as part of the EIGRP named process:

router eigrp SOMENAME

address-family ipv4 unicast autonomous-system 100

af-interface GigabitEthernet1

authentication mode md5

authentication key-chain FOO

hello-interval 10

no split-horizon

exit-af-interface

topology base

exit-af-topology

network 192.168.0.0 0.0.255.255

exit-address-family

A traditional EIGRP process can be upgraded to named mode on newer IOS with this command:

Router(config)#router eigrp 101

Router(config-router)#eigrp upgrade-cli SOMENAME

The process also doesn't interrupt traffic flow.

That's the guts of the configuration reformatting, let's move on to features.

Wide Metrics

First and foremost, the metric has been reworked.

EIGRP named mode automatically uses wide metrics when speaking to another EIGRP named mode process. No additional configuration is necessary, this is automatic. So if it's speaking to a traditional EIGRP process, it uses the old calculations.

The new metric is designed to be able to differentiate paths above 10GB. The new metric essentially changes four things:

- Delay is now measured in picoseconds instead of microseconds. 10ms was the minimum previously.

- Bandwidth's scaling factor is made much larger, the calculation is now 10^7 * 65536 / Interface Bandwidth, as opposed to the original 10^7 * 256 / Interface Bandwidth.

- The overall metric is now 64 bit.

- The K6 value has been added "for future use", but Cisco has indicated this will be used for accumulated energy or accumulated jitter. Jitter is reasonsably obvious. Energy is the actual electric power it takes to use an interface, so that you could literally do "least cost" routing based on how inexpensively the packet can be sent from the various interface types in a path.

One important note here is that with wide metrics, the EIGRP calculated metric no longer fits into the RIB. For example:

Router#sh ip eigrp top 10.10.10.10/32 | i Composite metric

Composite metric is (330301440/329646080), route is Internal

Router#sh ip route 10.10.10.10 | i Route metric

Route metric is 2580480, traffic share count is 1

The EIGRP topology table indicates 330301440, the RIB says 2580480.

The RIB's metric can't exceed 32-bits, and there are circumstances with the new, more granular metrics won't fit into the RIB. So all metrics, regardless of if the value would fit into 32-bits, are divided by the rib-scale value. The rib-scale is 128 by default:

330301440/128 = 2580480

You can reassign it to any value 1 to 255:

router eigrp SOMENAME

address-family ipv4 unicast autonomous-system 100

metric rib-scale [1-255]

Here's a catch - I've gotten in the habit of using this command for redistributing into EIGRP when labbing:

redistribute <some other protocol> metric 1 1 1 1 1

Why? It's quick and easy to type if you're not trying to do traffic engineering.

Router#sh ip eigrp top 13.13.13.13/32 | i Composite metric

Composite metric is (655361310720/655360655360), route is External

655361310720/128 = 5120005120

The largest number that can be represented in a 32-bit unsigned integer is 4,294,967,296.

5120005120 > 4294967296, therefore it cannot be represented in the RIB:

Router#sh ip route 13.13.13.13

% Network not in table

You read that right: This is a valid, routable prefix that simply can't make it into the RIB because of compatibility between the EIGRP topology table and the RIB. You need to adjust the rib-scale to make this work:

Router(config-router-af)#metric rib-scale 153

Router(config-router-af)#do sh ip route 13.13.13.13 | i Route metric

Route metric is 4283407259, traffic share count is 1

I imagine that would make for a really good troubleshooting problem. "A route is being redistributed on R1 with a specific metric, but is not being installed in the RIB on R3. Do not change the metric on R1, or adjust with a route-map".

There are a few concerns with interoperability between the traditional EIGRP metric and the wide metrics, but not many. As I mentioned above, routers unable to understand wide metrics are auto-detected and sent the old metric, however, there are circumstances where a route might get depreffed after having passed through an older EIGRP process. For example, if two paths exist to a destination, one of them running entirely wide metrics and a different one running one router with traditional metrics, the traditional metric may make the entire path look worse and it may impact load share, or the ability to ECMP.

SHA Authentication

Now supporting more than just MD5:

R1(config-subif)#router eigrp TEST1

R1(config-router)#address-family ipv4 unicast autonomous-system 100

R1(config-router-af)#af-interface gig1.123

R1(config-router-af-interface)#authentication mode hmac-sha-256 CCIE

I think authentication would also make a great TS question - the authentication could be placed on the interface still, which named mode silently ignores. You'd need to know to look at the EIGRP named process to fix it:

interface GigabitEthernet1.123

ip authentication key-chain eigrp 100 BOB ! this does nothing when named mode is enabled.

Route Tag Enhancements

To be fair, the route tag enhancements aren't limited to EIGRP named mode - it works with OSPF, BGP, RIP, etc. It even works in the traditional (non-named) eigrp syntax. However, I didn't think I needed a write a separate blog just to show it in every context, they all basically work the same.

In short, the route tag enhancements allow the route tag to be formatted as a dotted decimal tag (looks like an IPv4 address) that can me matched either directly (in the traditional route tag method in route-map) or via a route-tag list. The route-tag list is where things get interesting.

R1:

interface Loopback1

ip address 1.1.1.1 255.255.255.255

interface Loopback2

ip address 2.2.2.2 255.255.255.255

interface Loopback3

ip address 3.3.3.3 255.255.255.255

interface Loopback4

ip address 4.4.4.4 255.255.255.255

interface Loopback5

ip address 5.5.5.5 255.255.255.255

interface Loopback6

ip address 6.6.6.6 255.255.255.255

interface Loopback7

ip address 7.7.7.7 255.255.255.255

route-tag notation dotted-decimal

router eigrp TEST1

address-family ipv4 unicast autonomous-system 100

topology base

redistribute connected route-map tag-routes

route-map tag-routes permit 10

match interface Loopback1 Loopback2 Loopback3

set tag 100.100.100.1

route-map tag-routes permit 20

match interface Loopback4 Loopback5

set tag 100.100.200.1

route-map tag-routes permit 30

match interface Loopback6 Loopback7

set tag 100.100.101.1

So we've set some dotted-decimal tags on R1, now let's filter on R2.

R2:

route-tag notation dotted-decimal

route-tag list binary-match seq 5 permit 100.100.0.0 0.0.254.255

route-map filter permit 10

match tag list binary-match

set metric 100 100 255 1 1500

route-map filter permit 20

router eigrp TEST2

address-family ipv4 unicast autonomous-system 100

topology base

distribute-list route-map filter in GigabitEthernet1.123

Anyone who's done any amount of CCIE-level route filtering should catch what I just did. The route-tag list is looking for any routes that begin with 100.100 and have an even 3rd octet - if you need an explanation of filtering with wildcard masks there are many available on the Internet.

So now tags can be matched based on what bits are set in them -- very cool.

R2(config)#do sh ip eigrp top 1.1.1.1/32 | i Composite metric

Composite metric is (6619136000/163840), route is External

R2(config)#do sh ip eigrp top 4.4.4.4/32 | i Composite metric

Composite metric is (6619136000/163840), route is External

R2(config)#do sh ip eigrp top 6.6.6.6/32 | i Composite metric

Composite metric is (1392640/163840), route is External

1.1.1.1 and 4.4.4.4 were tagged with 100.100.100.1 and 100.100.200.1 respectively, both even 3rd octets, and had their metric successfully recreated. 6.6.6.6, tagged with 100.100.101.1, was not matched, and retained its original metric.

I immediately tried this in IPv6... however...

R2(config-router)#address-family ipv6 unicast autonomous-system 200

R2(config-router-af)#topology base

R2(config-router-af-topology)#distribute-list ?

prefix-list Filter connections based on an IPv6 prefix-list

R2(config-router-af-topology)#distribute-list route-map ?

% Unrecognized command

IPv6 can't be filtered ingress with route-maps yet. I didn't expect that. For anyone curious I'm on:

R2(config-router-af-topology)#do sh ver | i IOS Software

Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)

There's open more option for settings tags:

router eigrp TEST1

address-family ipv4 unicast autonomous-system 100

eigrp default-route-tag 9.9.9.9

default-route-tag is fairly picky what it will tag. From some tinkering, it will tag all routes except:

- Locally redistributed routes

- Routes that were already set a tag in some other fashion

- Routes it learned from another router

So in short, unless you learned the routes with the "network" statement, this tag won't take effect.

IPv6 VRF Lite

The traditional EIGRP process doesn't support IPv6 in a VRF.

You also must use the new format - multiprotocol VRF - for creating VRFs.

Old format:

R2(config)#ip vrf FOO

R2(config-vrf)#rd 1:1

R2(config-vrf)#exit

R2(config)#int gig1.10

R2(config-subif)#ip vrf forwarding FOO

Multiprotocol VRF:

R2(config-vrf)#vrf definition FOO

R2(config-vrf)#rd 1:1

R2(config-vrf)#address-family ipv6 unicast

R2(config-vrf-af)#address-family ipv4 unicast

R2(config-vrf-af)#exit

R2(config-vrf)#int gig1.10

R2(config-subif)#vrf forwarding FOO

router eigrp SAMPLE

address-family ipv6 unicast vrf FOO autonomous-system 200

topology base

exit-af-topology

eigrp router-id 2.2.2.2

exit-address-family

Note the bolded line - eigrp router-id 2.2.2.2. Unless you have an IPv4 address in the routing table of the same VRF, you must specify the router ID manually. There is no parser error, it just doesn't work. Once again, this would make a great TS problem.

With IPv6, things work differently than IPv4 in named EIGRP mode. This process is already up:

*Sep 2 23:39:52.815: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FEF7:FE11 (GigabitEthernet1.10) is up: new adjacency

However, note I haven't told it what interfaces to use. In our case, it automatically includes any interface that's in the appropriate VRF and has an IPv6 address on it. If you don't want to run EIGRP on an interface, you have to manually specify:

R2(config)#router eigrp SAMPLE

R2(config-router-af)#address-family ipv6 unicast vrf FOO autonomous-system 200

R2(config-router-af)#af-interface gig1.10

R2(config-router-af-interface)#shut

*Sep 2 23:47:10.304: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FED7:2458 (GigabitEthernet1.10) is down: interface down

3rd party Next-Hop

While also not a feature specific to named mode, EIGRP has recently started supporting 3rd party next hop. The concept of 3rd party next-hop is fairly simple. The easiest way I can explain it is if you have three routers on a single segment, R1, R2, and R3. They all share the 192.168.123.0/24 space between them. However, R1 and R2 speak EIGRP, and R2 and R3 speak OSPF. R1 doesn't speak OSPF, and R3 doesn't speak EIGRP. Assume there are extra routers behind R1 and R3 on different segments that are advertised in their respective routing protocols.

R2 is mutually redistributing between EIGRP and OSPF.

Without 3rd party next-hop, R1 would have to send traffic destined for the OSPF segments to R2, then R2 would have to forward it to R3. Inefficient and messy.

With 3rd party next-hop, R2 is permitted to use R3's address, even though it doesn't exist in the EIGRP process, when advertising routes to R1.

This is an automatic feature and requires only that R2 doesn't re-write the next-hop to itself (rewriting the next hop is default EIGRP behavior):

router eigrp TEST2

address-family ipv4 unicast autonomous-system 100

af-interface GigabitEthernet1.123

no next-hop-self

EIGRP Fast ReRoute (FRR)

The point of FRR is to generate Loop Free Alternates, or LFAs. What's an LFA?

An LFA is a back-up route that can be pre-programmed into the FIB as a repair route. If you're familiar with EIGRP, you might think "but EIGRP already has feasible successors". True, but it doesn't program those into the forwarding linecards.

I wrote a rather lengthy article regarding BGP PIC and Add-Path two weeks ago, and I covered the problem that PIC was trying to solve, which is not necessarily easy to comprehend unless you've spent a great deal of time in a large service provider environment. PIC and FRR are trying to solve the same issue with different protocols. Rather than pasting the multi-page explanation I've already typed into this document as well, please reference that one to understand the issue:

http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html

The good news is that EIGRP doesn't require as complex an environment to explain FRR as it took to explain BGP PIC.

We already know EIGRP makes feasible successors, and can rely on those during reconvergence. But if we want the FIB to be able to swap over to a feasible successor as soon as the successor route is lost, we need to pre-program it.

In a nutshell, FRR simply picks the "best" feasible successor and sticks it in the FIB as a backup route.

There are two types of FRR, per-link and per-prefix. Per-link is only supported on IOS-XR at the time of this writing, so we'll be looking only at per-prefix.

First and foremost, we must ensure we have a feasible successor. If we have multiple successors (no feasibles), then we have ECMP - equal cost multi-path - and there's no need for FRR.

R1 has two paths to prefix 4.4.4.4 on R4, one via R2 and another via R3. I've deliberately de-prefed the route through R3. Note, if you're attempting to lab along with this, you'll want to create the depref on R1. If you're ECMP up until you create the depref on R1, you're guaranteed to have a feasible successor!

R1(config-subif)#int gig1.13

R1(config-subif)#delay 5000

R1#sh ip eigrp topo 4.4.4.4/32

EIGRP-IPv4 VR(TEST) Topology Entry for AS(100)/ID(192.168.12.1) for 4.4.4.4/32

State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000

Descriptor Blocks:

192.168.12.2 (GigabitEthernet1.12), from 192.168.12.2, Send flag is 0x0

Composite metric is (2048000/1392640), route is Internal

Vector metric:

Minimum bandwidth is 1000000 Kbit

Total delay is 21250000 picoseconds

Reliability is 255/255

Load is 1/255

Minimum MTU is 1500

Hop count is 2

Originating router is 192.168.24.4

192.168.13.3 (GigabitEthernet1.13), from 192.168.13.3, Send flag is 0x0

Composite metric is (3278192640/1392640), route is Internal

Vector metric:

Minimum bandwidth is 1000000 Kbit

Total delay is 50011250000 picoseconds

Reliability is 255/255

Load is 1/255

Minimum MTU is 1500

Hop count is 2

Originating router is 192.168.24.4

Since the route via 192.168.13.3 (from R3) has an advertised distance less than the feasible distance to 192.168.12.2 (from R2), we now have a feasible successor.

R1(config)#router eigrp TEST

R1(config-router)# address-family ipv4 unicast autonomous-system 100

R1(config-router-af)# topology base

R1(config-router-af-topology)#fast-reroute per-prefix all

R1#sh ip route 4.4.4.4 | i Repair

Repair Path: 192.168.13.3, via GigabitEthernet1.13

R1#sh ip cef 4.4.4.4

4.4.4.4/32

nexthop 192.168.12.2 GigabitEthernet1.12

repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

It's very simple if we only have two paths, but what if there are 3 or more? Cisco uses what it calls "tie breakers", but I really dislike the name, we're not really tie-breaking necessarily because the criteria for selection isn't comparing apples to apples. It's a bit more like "2nd bestpath decision maker".

Before I list off the tie-breakers, let's look at what the problems might be if we had numerous paths to choose from.

Let's say we have multiple neighbors on a shared segment, with varying metrics to the destination we're trying to protect. Your bestpath is on that segment, as is your "second best" feasible successor, all hanging off the same interface on your router. If you're choosing the LFA purely based on metric, the same interface will get chosen for the backup path as is the primary route. That doesn't help us if that WAN link fails, or if the interface goes down, etc.

Take that one step further and say your best-path and best feasible successor are both on the same linecard. That might also be a poor decision.

What I'm getting at is there's more to consider than just the metric in this scenario.

The four tie-breakers are:

- srlg-disjoint, priority 10

- interface-disjoint, priority 20

- lowest-backup-path-metric, priority 30

- linecard-disjoint, priority 40

Lower priority is better.

srlg-disjoint favors a backup-path/interface that isn't in the same Shared Risk Link Group (more below).

interface-disjoint favors a backup route that doesn't share the same interface for its next-hop. BEWARE, sub-interfaces are considered disjointed interfaces by the FRR process on my version of IOS-XE!

lowest-backup-path-metric favors a backup route with the lowest metric.

linecard-disjoint favors a backup route that doesn't share the same linecard.

So to clarify, by default, SRLG gets priority unless not set, then interface-disjoint gets priority unless the two paths are already on different interfaces (or subinterfaces), then the lowest metric is picked. If the metric is the same, it looks for a port on a different linecard.

So to start, what the heck is SRLG?

There's very little information on this feature that I can find, but the idea, as best I can tell, is that if you happen to know to physical links share some dependency (perhaps passing through the same L2 switch upstream, for example), you can tell IOS which ones have dependencies.

For example, if Gig1 and Gig2 on my router both passed through a single point of failure upstream, my config might look something like this:

R1(config)#int gig1

R1(config-if)#srlg gid 1

R1(config-if)#int gig2

R1(config-if)#srlg gid 1

R1(config-if)#int gig3

R1(config-if)#srlg gid 2

Note gig3 didn't necessarily need to get assigned to an srlg, but I included it for clarity.

I'm going to introduce a new path from R1 to R4 via R5. R1, R2 and R5 are all going to share a common link, meaning R1 routes to R2 and R5 on the same interface. I'm increasing delay slightly more on the path to R5. Furthermore, I'm going to prevent R2 and R5 from peering with one another, otherwise R5 would end up only advertising it's bestpath from R2, and my topology breaks.

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric

Composite metric is (78725120/13189120), route is Internal
Composite metric is (3289989120/13189120), route is Internal
Composite metric is (79380480/13844480), route is Internal

We see we've got three paths, let's look at those again with my comments:

R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric

Composite metric is (78725120/13189120), route is Internal = Path through gig1.12 via R2

Composite metric is (3289989120/13189120), route is Internal = Path through gig1.12 via R5
Composite metric is (79380480/13844480), route is Internal = Path through gig1.13 via R3

R1#sh ip cef 4.4.4.4

4.4.4.4/32

nexthop 192.168.12.2 GigabitEthernet1.12

repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13

We can see IOS made a very smart move here, and it's in line with the priorities we discussed above. The backup path is not the best feasible successor from a metric standpoint, it's the less risky separate "interface" (again, IOS considers a subinterface a separate interface).

If we instead wanted it to choose based on metric:

R1(config)#router eigrp TEST

R1(config-router)# address-family ipv4 unicast autonomous-system 100

R1(config-router-af)# topology base

R1(config-router-af-topology)#fast-reroute tie-break lowest-backup-path-metric 5

<<note I normally clear the eigrp neighbors here, these commands don't always seem to react quickly after the change>>

R1(config-router-af-topology)#do sh ip cef 4.4.4.4
4.4.4.4/32
nexthop 192.168.12.2 GigabitEthernet1.12
repair: attached-nexthop 192.168.12.5 GigabitEthernet1.12

Now we're preferring the backup path through the same interface, that has the better metric.

I'm not going to show the output from srlg disjoint here, but I have labbed it previously and it does work - just set the srlg guid on the appropriate interfaces. Also, I have no way of labbing linecard disjoint because I'm on a virtual router.

EIGRP Over The Top (OTP)
Does anyone besides me use the OTP abbreviation to mean "on the phone"? I wish they could've gone with OTT instead.

It is a really neat feature though - I know a lot of people will bash EIGRP as obsolete, proprietary, distance vector ... say what you will, amongst enterprise Cisco enterprise networks, it's the most popular IGP on the Cisco-powered market by a landslide. As a consultant, I would say 80% of the networks I come across run it.

Furthermore, finding enterprise network support personnel that are BGP experts is somewhat rare.

So what is one to do when MPLS separates all your sites, and your carrier (wisely) uses BGP as a PE->CE protocol? You hire a consultant to come in and make changes to the redistribution strategy periodically.

Or... you run EIGRP OTP, and toss the BGP work out the window.

OTP allows remote EIGRP peerings over any underlying IP protocol. All you need is reachability to the other EIGRP host. That means all your carrier needs to do is advertise the PE->CE link itself (probably a /30 between you and the carrier) in their MPBGP and the CE doesnt even need to run BGP (topology dependent). All the CE needs is a static default pointing at the PE router.

If you have more than a few CEs, you'll probably want an EIGRP Route Reflector, which isn't nearly as complicated as it sounds. An EIGRP RR listens for dynamic connections (optionally), and then disables split horizon and next-hop-self.

LISP provides the tunneling mechanism for the neighbors to reach one another. Fortunately, no LISP knowledge is required, the config is automatic.

Here, R2 - R5 represent the provider network, R1 and R7 represent isolated customer sites, and R6 and R8 represent a dual-homed customer site.

R7 will be our EIGRP route reflector.

Assume the provider is advertising the links between the CE and PE.
Here are the rest of the relevant configs:

R1:
router eigrp OTP-TEST
!
address-family ipv4 unicast autonomous-system 100
!
topology base
exit-af-topology
neighbor 192.168.37.7 GigabitEthernet1.12 remote 10 lisp-encap
network 1.1.1.1 0.0.0.0
network 192.168.12.0
exit-address-family

ip route 0.0.0.0 0.0.0.0 192.168.12.2

just to prove there's no BGP involved here:

R1#sh ip protocol sum

Index Process Name

0 connected

1 static

2 application

4 eigrp 100

*** IP Routing is NSF aware ***

R6:

router eigrp OTP-TEST

address-family ipv4 unicast autonomous-system 100

topology base

exit-af-topology

neighbor 192.168.37.7 GigabitEthernet1.46 remote 10 lisp-encap

network 6.6.6.6 0.0.0.0

network 192.168.46.0

network 192.168.68.0

exit-address-family

R8:

router eigrp OTP-TEST

address-family ipv4 unicast autonomous-system 100

topology base

exit-af-topology

neighbor 192.168.37.7 GigabitEthernet1.58 remote 10 lisp-encap

network 8.8.8.8 0.0.0.0

network 192.168.58.0

network 192.168.68.0

exit-address-family

R7 (route reflector):

router eigrp OTP-TEST

address-family ipv4 unicast autonomous-system 100

af-interface GigabitEthernet1.37

no next-hop-self

no split-horizon

exit-af-interface

topology base

exit-af-topology

remote-neighbors source GigabitEthernet1.37 unicast-listen lisp-encap

network 7.7.7.7 0.0.0.0

network 192.168.37.0

exit-address-family

The route reflector is also running BGP. Route reflectors can have a topology problem requiring this if you have backdoor links. In my case, if I only ran a default on the route reflector, I'd learn the link to R8 via EIGRP from R6, as opposed to using my default route. And vice-versa, R8 would advertise connectivity to R6, and my routes would do a continual up/down because they'd learn next-hops via the LISP interface. It's a typical tunnel recursion loop issue. Running BGP puts the prefixes to reach R6 and R8 in R7's table at a lower AD and solves the problem

Also note that the link between PE and CE must be advertised into EIGRP in order for LISP to come up.

Now we have full reachability to the EIGRP prefixes without the majority of the CEs running BGP, and none of the CEs advertising their EIGRP routes into it.

R1#sh ip route eigrp | b Gateway

Gateway of last resort is 192.168.12.2 to network 0.0.0.0

6.0.0.0/32 is subnetted, 1 subnets

D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0

7.0.0.0/32 is subnetted, 1 subnets

D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0

8.0.0.0/32 is subnetted, 1 subnets

D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0

D 192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1#sh ip cef 6.6.6.6

6.6.6.6/32

nexthop 192.168.46.6 LISP0

R1#ping 6.6.6.6

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms

Add-Path

Add-Path is the capability to advertise more than one bestpath to a neighbor. I've done a large write-up on the BGP implementation of it:

http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html

The Cisco documentation indicates a use case of DMVPN for EIGRP Add-Path, but that seems a pretty narrow use to me, as summarization with DMVPN phase 3 would make it useless. However, our scenario for OTP above is perfect!

R1#sh ip route eigrp | b Gateway

Gateway of last resort is 192.168.12.2 to network 0.0.0.0

6.0.0.0/32 is subnetted, 1 subnets

D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0

7.0.0.0/32 is subnetted, 1 subnets

D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0

8.0.0.0/32 is subnetted, 1 subnets

D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0

D 192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0

R1 only learns one path to 192.168.68.0/24. Two are available, why can't we install both? Same problem with BGP, the EIGRP route reflector only sends its one best-path.

R7(config)#router eigrp OTP-TEST

R7(config-router)# address-family ipv4 unicast autonomous-system 100

R7(config-router-af)# af-interface GigabitEthernet1.37

R7(config-router-af-interface)#add-paths 2

R1#sh ip route eigrp | b Gateway

Gateway of last resort is 192.168.12.2 to network 0.0.0.0

6.0.0.0/32 is subnetted, 1 subnets

D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:12:51, LISP0

7.0.0.0/32 is subnetted, 1 subnets

D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:12:52, LISP0

8.0.0.0/32 is subnetted, 1 subnets

D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:12:51, LISP0

D 192.168.68.0/24 [90/93998811] via 192.168.58.8, 00:00:26, LISP0

[90/93998811] via 192.168.46.6, 00:00:26, LISP0

And we've got multiple redundant paths to 192.168.68.0/24 now!

Note, EIGRP add-path is incompatible with variance.

Hope you enjoyed,

Jeff

BGP PIC and Add-Path

2014-08-16T08:18:00.002-07:00

The meat of this article will be Add-Path, and why it's needed in certain PIC scenarios. However, understanding where and why we need these technologies, what was done before the Add-Path implementation was widely in place, etc, is nearly as challenging to learn as the Add-Path implementation itself.

This is not intended to completely document Add-Path, nor is it just a primer. My original intent was to document the entire use of Add-Path, however, I realized halfway through that this would have easily produced a 50+ page document: There are many one-off cases for Add-Path that have their own features, and to show a use case for each one would've required several different topologies and drawings. My hope that at the depth I took it to, it will be more than sufficient to educate to the level required for the CCIE R&S v5 lab.

So - what is PIC?

PIC stands for Prefix Independent Convergence.

PIC is a method for speeding up convergence of the FIB under failover conditions.

Unless you have a really serious lab or a Spirent to play with, forget trying to lab the performance gain. The gains we're talking about here are only seen when you have tens of thousands, hundreds of thousands, or even 1M routes in your FIB.

The use case is actually pretty easy to understand - when the next-hop to a set of prefixes changes, the router (presumably talking about a 7600 or ASR) has to walk each prefix in the FIB and update the next-hop. If you have 100 routes, this time is negligible. If you're carrying 1M routes in an MPLS environment, this is not a small problem. I've been told first-hand (from someone who does have a Spirent to play with) that this takes about two minutes.

This would be Prefix Dependent Convergence, or a problem that grows dependent upon how many prefixes are in your FIB. The solution we want is something that updates in the same amount of time (presumably small amount of time!) no matter how many FIB entries we have.

The concept of the FIB dates back decades now, and when it was originally written it was made in the most efficient manner possible, for CPU and RAM conservation:

Prefix = Interface/Next-Hop

For example,

10.10.10.10/32 = FastEthernet0/0 192.168.0.1

This was great 20 years ago when a "large" routing table was 40,000 routes. To converge quickly, a new method is required. Introducing the Hierarchical FIB.

When using PIC, the FIB actually restructures to a 3-tier system:

Prefix = Pointer = Interface/Next-Hop

Understanding why this is better takes understanding that while a router may be carrying 1M routes, it's probably only directly connected (layer 3) to a dozen or less. So you've got 1M routes, and 12 possible exits.

Let's say half those routes go out to two primary edge routers. Those routers are at 192.168.1.1 and 192.168.2.1.

So, roughly half your routes look like:

10.10.10.10/32 = Pointer A = Gigabit0/0 192.168.1.1
11.11.11.11/32 = Pointer A = Gigabit0/0 192.168.1.1
.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/0 192.168.1.1

192.168.1.1 fails. However, all these same prefixes are reachable via 192.168.2.1.
With an appropriately designed network, PIC can simply reassign Pointer A. This takes less than 50ms as opposed to 60+ seconds.

10.10.10.10/32 = Pointer A = Gigabit0/1 192.168.2.1
11.11.11.11/32 = Pointer A = Gigabit0/1 192.168.2.1

.... 499,998 routes later ...
197.197.197.197/32 = Pointer A = Gigabit0/1 192.168.2.1

The CEF process updated one value, that of Pointer A. Previously this took 500,000 updates, now it takes one. The time required for this process is independent of how many routes use the next-hop, hence Prefix Independent Convergence.

Now if you're following along, you probably see the enormous catch here: unless you're multipathing, how is CEF even going to know about the second path? PIC is a data-plane/CEF/FIB feature, it doesn't touch the control-plane. Normally we'd have to wait on BGP convergence (topology dependent), which takes a heck of a lot longer than 50ms. As we're all aware, and this is key to understanding this topic, BGP only sends its single best-path per-prefix to its neighbors. What if we needed two or more? Even worse, what if we're crossing a route-reflector, that aggregates everyone's paths and picks only one?

I am going to cover five different ways to solve this, add-path being the newest of them.

Here are the options at a high-level:
1) Multipath. This is by far the easiest option if your topology fits.
2) BGP Advertise-Best-External. For advertising from PE->PE, or PE->RR; this tells the edge PE to send it's external route (presumably from a CE via eBGP) as best. More below.
3) Diverse-Path (Shadow Router). This tells a route reflector, a secondary one in a topology, to deliberately calculate a "second-best" path that has a different next-hop. Instead of forwarding its best-path, it forwards this "second-best" path. Only the route-reflector needs to be updated to support this feature.
4) Add-Path. In short, Add-Path modifies the BGP behavior to send two or more paths instead of just one best-path. This requires that every device in the topology that needs to send or receive multiple paths supports Add-Path.

I've chosen to demonstrate these solutions in a VPNv4 environment, as it's where PIC makes the most sense. Note that add-path is purely an iBGP technology, the parser gets upset if you try it on eBGP:

R3(config-router)#neighbor 192.168.30.2 advertise additional-paths all
% BGP: Add-Path *not* supported on EBGP peering

I have a hobby (perhaps more of an interest?) of the language used in IOS parser messages. Half the time, unless you know the technology already, you can't even tell what the programmer was trying to convey when you make a mistake. If it's a new feature sometimes you don't even get an error, it just doesn't apply the config. Then other times you get blunt messages with *stars*!

I'm running a common VPNv4 design: BGP on the PEs, VRFs between CE and PE, and a "BGP free core" (all one P router that isn't a route reflector :) ).

On the PE->CE links, I'm using 10.0.X.Y/24, where X is a combination of the two routers the link connects (i.e. R1->R2 is "12"), and Y is the router number. This is also the same number on the subinterface on the diagram.

On the PE->P or PE->RR links, the IPs are 192.168.X.Y, same explanation of X and Y as above.

Every router has a loopback0 of Y.Y.Y.Y/32, where Y is the router number.

Note that R4 is a route-reflector, and R6, R7 and R8 are all PEs.

Let's talk about the two flavors of PIC. There's PIC Core and PIC Edge. They're both applied to a PE.

PIC Core is far simpler than PIC Edge, so we'll start there. We've enabled PIC Core on R2.

PIC Core is enabled with one command:

R2-PE(config)#cef table output-chain build favor convergence-speed

Of note, to disable it, you replace "convergence-speed" with "memory-utilization".

Unlike PIC Edge, which, depending on the implementation, may require widespread support on the network, PIC Core can literally be enabled on just one device if you wanted.

As mentioned above, in a typical VPNv4 scenario, the core is BGP-free, and only the PEs (and any route reflectors) maintain the BGP table. Next hops to the PEs are carried in the IGP. Let's look at how that plays out:

- Let's assume R1's bestpath to R9 is via R2. R1 is BGP peered to R2.

- R2 takes R1's traffic in to a VRF. It imports the VRF traffic into VPNv4.

- R4, the route reflector, learns via iBGP that the PEs R6 and R7 can both reach R9. It chooses R6 as the bestpath.

- R2, only peered with R4 for iBGP, learns the that R6 is the bestpath.

- Since this is VPNv4, R2 needs to choose an LDP-enabled next hop that has a label for 6.6.6.6. Remember, in VPNv4, the next hop inside the iBGP network is always the iBGP next-hop. The IGP indicates that R5 is the bestpath for R2 to reach R6 (via MPLS).

The key element here is the recursion between R2 and R6:

BGP tells R2 how to reach R9 via R6: 2.2.2.2 -> 6.6.6.6

R2 needs to find out how to reach 6.6.6.6 via the IGP: 2.2.2.2 -> R5

R2 needs to know how to reach R5: 2.2.2.2 -> 192.168.25.5 (R5's interface IP)

R2 needs to pick an interface to reach 192.168.25.5: gig1.25

So one more time!

iBGP: 2.2.2.2 -> 6.6.6.6

iBGP Next-Hop via OSPF: Find 6.6.6.6 via R5 at 192.168.24.5

CEF: Exit interface gig1.25 towards 192.168.25.5

I'm going to harp on the high-level of this again because it's dead critical to understanding the hierarchy of the process:
BGP recurses to IGP
IGP recurses to one or more Next Hops
FIB populates one or more next hops from the IGP

When you're using PIC Core, this is what we care about:
BGP recurses to IGP
IGP recurses to one or more Next Hops <-- PIC CORE INFLUENCES
FIB populates one or more next hops from the IGP <-- PIC CORE INFLUENCES

I will demonstrate below.

So given that R1 -> R2 -> R5 -> R6 -> R9 above, let's say R5 goes completely offline - dead.

This does not impact the BGP session between R2 and R4, or between R4 and R6. However, the next hop specific in the BGP next-hop (192.168.24.5), which it learned from the IGP, must change. The IGP can reconverge very quickly, but let's say the BGP process was carrying 1M routes from R9. How long will it take R2 to update the next-hops of the BGP table and CEF?

So to be clear, BGP is not reconverging. PIC Core cannot handle a BGP reconverge, you need PIC Edge for that. But if the IGP reconverges and requires the BGP Table and FIB to update, and you have a large quantity of routes, this can create a major impact on a PE - possibly several minutes of dropping traffic.

With a traditional FIB, we'd have to make 1 million updates in both the BGP table and the FIB in order to be fully forwarding again. With a hierarchical FIB - what PIC Core provides us - the following process would happen:

The FIB, before:
Prefix 1 -> Pointer A (192.168.25.5) -> gig1.25

The IGP reconverges the path via R4.
Now we update Pointer A - one value instead of 1M values - and we end up with:

Prefix 1 -> Pointer A (192.168.24.4) -> gig1.24

So to reiterate, PIC Core is for failure of non-BGP speakers. It doesn't help if BGP itself needs to reconverge, but it does dramatically speed up CEF's failover if the IGP fails.

Now moving in to the more complex PIC Edge.

If PIC Core was about dealing with IGP failure, PIC Edge is about dealing with BGP failure.

For the moment, we'll continue using our VPNv4 topology, except we're temporarily removing the route reflector and instead installing a full-mesh iBGP.

Please note that using PIC Edge should involve running BFD between the BGP speakers for fast detection of a failure. For simplicity, I've omitted this step. To learn more about BFD, please see my BFD blog: http://brbccie.blogspot.com/2014/06/everything-bfd.html

That's quite a few iBGP peerings, The red lines indicate all the iBGP peerings:

In this scenario, we're going to deal with R2's convergence process again, except we're going to assume R6 - the BGP-adjacent PE - dies, instead of a P router.

Let's look at our routing protocols from R2's perspective.

R2-PE#sh bgp vpnv4 un all 9.9.9.9
BGP routing table entry for 1:1:9.9.9.9/32, version 11
Paths: (3 available, best #2, table VPN)
Advertised to update-groups:
1
Refresh Epoch 3
300
8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal
Extended Community: RT:1:1
mpls labels in/out nolabel/16
rx pathid: 0, tx pathid: 0
Refresh Epoch 3
300
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 100, valid, internal, best
Extended Community: RT:1:1
mpls labels in/out nolabel/22
rx pathid: 0, tx pathid: 0x0
Refresh Epoch 3
300
7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal
Extended Community: RT:1:1
mpls labels in/out nolabel/26
rx pathid: 0, tx pathid: 0

As expected, R2 has three BGP paths to 9.9.9.9. 6.6.6.6 is the best.

How do we reach 6.6.6.6?

R2-PE#sh ip ospf route | s 6.6.6.6
*> 6.6.6.6/32, Intra, cost 3, area 0
via 192.168.24.4, GigabitEthernet1.24
via 192.168.25.5, GigabitEthernet1.25

The BGP table has one selected bestpath, the IGP has two multipath bestpaths to BGP's next hop:

R2-PE#sh ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 192.168.24.4 on GigabitEthernet1.24, 00:22:04 ago
Routing Descriptor Blocks:
* 192.168.25.5, from 6.6.6.6, 01:31:56 ago, via GigabitEthernet1.25
Route metric is 3, traffic share count is 1
192.168.24.4, from 6.6.6.6, 00:22:04 ago, via GigabitEthernet1.24
Route metric is 3, traffic share count is 1

R2-PE#sh ip cef 6.6.6.6
6.6.6.6/32
nexthop 192.168.24.4 GigabitEthernet1.24 label 17
nexthop 192.168.25.5 GigabitEthernet1.25 label 18

Now let's refer back to my process from earlier:

BGP recurses to IGP
IGP recurses to one or more Next Hops
FIB populates one or more next hops from the IGP

BGP says use 6.6.6.6
IGP says to get to 6.6.6.6 use either 192.168.25.5 or 192.168.24.4
FIB points to 192.168.25.5 / tag 17 and 192.168.24.4 / tag 18 multipath

Now what happens if 6.6.6.6 fails?

R6-PE(config)#int gig1.46
R6-PE(config-subif)#shut
R6-PE(config-subif)#int gig1.56

R6-PE(config-subif)#shut

R6-PE(config-subif)#int gig1.69

R6-PE(config-subif)#shut

Debugging BGP updates on R2 (significantly edited for brevity):

*Aug 20 23:30:29.037: RT(VPN): updating bgp 9.9.9.9/32 (0x1) : via 7.7.7.7 0 26

*Aug 20 23:30:29.037: RT(VPN): closer admin distance for 9.9.9.9, flushing 1 routes

*Aug 20 23:30:29.037: RT(VPN): add 9.9.9.9/32 via 7.7.7.7, bgp metric [200/0]

BGP figures out that 6.6.6.6 is down, and picks 7.7.7.7 for the next hop. Now we have the same problem we had with PIC Core, only it's more significant:

BGP recurses to IGP <-- PIC EDGE INFLUENCES
IGP recurses to one or more Next Hops <-- PIC EDGE & CORE INFLUENCE
FIB populates one or more next hops from the IGP <-- PIC EDGE & CORE INFLUENCE

Just pointing out the process there - we don't have PIC edge enabled, so our theoretical 1M routes just took minutes to reconverge.

So how do we enable PIC Edge? Quite simply, we can't wait for the IGP and BGP to converge. We need two paths in BGP. This can be easy or difficult, depending on our topology. Let's look at the easiest methods and progress towards harder.

Note we still have cef table output-chain build favor convergence-speed configured on R2, which is still necessary.

Re-enabling R6 to show how this could play out with PIC Edge.
router bgp 200
address-family ipv4 vrf VPN
maximum-paths ibgp 3

Now we've told R2 to install multiple BGP paths, not just multiple IGP paths. This way if R6's advertisement gets pulled again, there's already a pre-made alternative path.

Now we have three "hot", installed BGP paths to 9.9.9.9, instead of just one. This means with the IGP in consideration, we have six paths:

R2-PE#sh bgp vpnv4 un all | b 9.9.9.9
*mi 9.9.9.9/32 7.7.7.7 0 100 0 300 i
*>i 6.6.6.6 0 100 0 300 i
*mi 8.8.8.8 0 100 0 300 i

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels, per-destination sharing

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7 label 26

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

recursive via 8.8.8.8 label 16

nexthop 192.168.24.4 GigabitEthernet1.24 label 28

If we lose the path via 6.6.6.6, one of the other paths would simply pick up the load, and because of the hierarchical FIB we already implemented, there's no need to rewrite all 1M prefixes in the FIB one at a time.

This represented our first PIC solution I described above: Multipathing.

I'm going to temporarily cut to a much simpler scenario to show BGP Advertise-Best-External. While I could mix this in to the topology we've been using, it's getting too complex to clearly illustrate the topic.

Let's say multipathing isn't an option - what if one of the paths is clearly better than the others. What else can we do?

I've deliberately made R6 the bestpath by setting the local preference on all routes leaving it to 150. Now what we see from R2 looks like:

R2-PE#show bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 63

Paths: (1 available, best #1, table VPN)

Advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Only one path ... via 3 upstreams? Yep. The problem here is that, depending on timing, R2 may end up with three paths, for just a moment - since all routers are peered with one another, R7 will learn that R6 is the bestpath via its iBGP session to R6, as will R8. Both R7 and R8 will send a withdraw for their route to R6. Now R6 is stuck with one path - we need at least two for PIC edge.

The dead easiest solution to this design is to use Advertise-Best-External:

R7 & R8:

router bgp 200

address-family ipv4 vrf VPN

bgp advertise-best-external

What's this do?

R7-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 18

Paths: (3 available, best #2, table VPN)

Advertised to update-groups:

1 6

Refresh Epoch 5

300

8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out 26/16

rx pathid: 0, tx pathid: 0

Refresh Epoch 3

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out 26/22

rx pathid: 0, tx pathid: 0x0

Refresh Epoch 2

300

10.0.79.9 (via vrf VPN) from 10.0.79.9 (9.9.9.9)

Origin IGP, metric 0, localpref 100, valid, external

Extended Community: RT:1:1

mpls labels in/out 26/nolabel

rx pathid: 0, tx pathid: 0

R7 still sees the path through R6 as best. However, what's it sending to R2? It's sending it's eBGP path to the CE as opposed to the path to R6.

Since R8 is doing the same thing, R2 now has three paths again:

R2-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 70

Paths: (3 available, best #3, table VPN)

Advertised to update-groups:

Refresh Epoch 5

300

8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out nolabel/16

rx pathid: 0, tx pathid: 0

Refresh Epoch 3

300

7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

mpls labels in/out nolabel/26

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

So Advertise-Best-External sends your eBGP route as bestpath to your neighbors, but local routing (on R7 or R8) still goes through R6 due to the local-preference.

We're not done yet however:

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

R2 still only sees one possible path.

We need to implement some single-router Add-Path to make this work. The key item of importance is that only the routers that need the non-multipath redundant paths have to support Add-Path in this design. If we're not worried about R6, R7, or R8 having an additional path back to R1, then we might just have R2 and R3 require the Add-Path support (Add-Path is a reasonably new feature at the time of this writing, so having your entire topology support it could be challenging).

router bgp 200

address-family ipv4 vrf VPN

bgp additional-paths select backup

bgp additional-paths install

Don't worry about the specific mechanisms of "select backup" and "install" yet, I'm going to cover them thoroughly later. In short, we need to tell this router to pick a backup path and pre-install it in the FIB so that PIC can use it in failover, which this config accomplishes:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7 label 26, repair

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Note the "repair" syntax, that's the key.

I'm removing the R2 Add-Path config and bgp advertise-best-external on the PEs.

This is all fantastic with full-mesh iBGP - what if you have a huge topology and a route-reflector (or several) is more realistic? There's a big problem here, because like any BGP router, the route reflector will only choose its one best path to send to the other PEs. This makes multipathing impossible.

I've re-made R4 a route reflector, and removed all the redundant iBGP paths between the other PEs. Every PE is getting their routes via R4 now.

Clearly down to just one path now:

R2-PE#sh ip cef vrf VPN 9.9.9.9 det

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

As usual, we have multiple IGP paths, but those will both get pulled if we lose the BGP path.

Without going to full-on Add-Path across the network, our simplest answer is another route-reflector running diverse-path. I'm temporarily making R5 an additional route-reflector.

For brevity I'm not going to include all the config necessary to make R5 a route-reflector. However, the outcome on R2 looks like this:

R2-PE#sh bgp vpnv4 uni all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 79

Paths: (2 available, best #2, table VPN)

Advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 5.5.5.5

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 4.4.4.4

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Hey, great, we've got two paths, we can just enable Add-Path on R2 and we're done, right?

Not so fast.

The next-hop is 6.6.6.6 on both routes - in order for Add-Path to be viable, the backup path's next-hop must be different that the primary path.

The solution, as I'd mentioned above, is to use Diverse-Path. Diverse-Path tells a BGP router to deliberately calculate the 2nd-best path that has a different next hop than the first-best path. Diverse-Path was a workaround before Add-Path was supported (or widely supported) in IOS. Only the route reflector running Diverse Path needs to know about it, all the other routes are just following standard IOS rules.

R5-RR(config)#router bgp 200

R5-RR(config-router)#address-family vpnv4

R5-RR(config-router-af)#bgp additional-paths select backup

R5-RR(config-router-af)#bgp additional-paths install

R5-RR(config-router-af)#neighbor 2.2.2.2 advertise diverse-path backup

Here, we tell R5 to calculate a backup path, and then we tell it to advertise it to R2 as if it were R5's bestpath (in production, you'd presumably want to send this to all route-reflector clients, not just one).

One more step is also required on R7 and R8 (I've done R6 as well to keep the config consistent) - right now, this topology suffers from the same problem we saw in the first advertise-best-external scenario. Consider:

1) R6 sends its bestpath (its external path) to R4 and R5. This prefix has a local pref of 150.
2) R7 sends its bestpath (its external path) to R4 and R5.
3) R8 sends its bestpath (its external path) to R4 and R5.
4) R5 starts calculating 2nd-best-path for R2
5) R7 learns about R6's bestpath from R4
6) R8 learns about R6's bestpath from R4
8) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
9) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better
10) R5 calculates it's only path to 9.9.9.9 via R6

Now we could put bgp advertise-best-external back in, but that would advertise the best external to both R4 and R5 and we'd have the same exact problem as above.

Per-neighbor best-external is the solution:
R6, R7 & R8:
router bgp 200
address-family vpnv4
neighbor 5.5.5.5 advertise best-external

This will advertise the "internal bestpath" (via R6, because of local preference) to R4, and the external bestpath to R5.

Now back to R2:

R2-PE#sh bgp vpnv4 un all 9.9.9.9

BGP routing table entry for 1:1:9.9.9.9/32, version 83

Paths: (2 available, best #2, table VPN)

Advertised to update-groups:

Refresh Epoch 2

300

6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)

Origin IGP, metric 0, localpref 100, valid, internal

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 5.5.5.5

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0

Refresh Epoch 1

300

6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Extended Community: RT:1:1

Originator: 6.6.6.6, Cluster list: 4.4.4.4

mpls labels in/out nolabel/22

rx pathid: 0, tx pathid: 0x0

Now we've got two routes with two next-hops.

R2-PE#sh ip cef vrf VPN 9.9.9.9 detail

9.9.9.9/32, epoch 1, flags rib defined all labels

recursive via 6.6.6.6 label 22

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

But we still need to enable the calculation of a backup route, otherwise PIC Edge won't work.

R2-PE(config)#router bgp 200

R2-PE(config-router)#address-family ipv4 vrf VPN

R2-PE(config-router-af)#bgp additional-paths select backup

R2-PE(config-router-af)#bgp additional-paths install

R2-PE#sh ip cef vrf VPN 9.9.9.9 det
9.9.9.9/32, epoch 1, flags rib defined all labels
recursive via 6.6.6.6 label 22
nexthop 192.168.24.4 GigabitEthernet1.24 label 23
nexthop 192.168.25.5 GigabitEthernet1.25 label 18
recursive via 7.7.7.7 label 26, repair
nexthop 192.168.24.4 GigabitEthernet1.24 label 16
nexthop 192.168.25.5 GigabitEthernet1.25 label 20

Now we've got a working solution!

And last but certainly not least, the gold standard of receiving two paths: Simply rework how BGP handles multiple paths by using Add-Path.

Sadly, as much as this technology seems like it's custom-built for VPNv4, if you can believe it, Add-Path isn't supported in VPNv4 on my OS:

R4#sh ver
Cisco IOS XE Software, Version 03.11.01.S - Standard Support Release
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)

In the IPv4 (default) Family:

R4(config)#router bgp 200
R4(config-router)#bgp additional-paths select ?
all Select all available paths
backup Select backup path
best Select best N paths
best-external Select best-external path
group-best Select group-best path

R4(config-router)#neighbor 2.2.2.2 advertise ?
additional-paths Advertise additional paths

best-external Advertise best-external (at RRs best-internal) path

diverse-path Advertise diverse path

Note the bolded and italic items, that's what we're looking for in VPNv4:

R4(config-router)#address-family vpnv4

R4(config-router-af)#neighbor 2.2.2.2 advertise ?

best-external Advertise best-external (at RRs best-internal) path
diverse-path Advertise diverse path

R4(config-router-af)#bgp additional-paths select ?
backup Select backup path
best-external Select best-external path

Completely lacking.

On that note, we'll be reverting this design back to a non-MPLS scenario for the remainder of the blog.

I've also reverted R5 from being a route-reflector, it's now simply a client of R4. This was necessary to carry the IPv4 BGP table through R5.

Note R6 deliberately still has the best path via local-preference.

Here is a diagram of roughly what we're trying to achieve.

We'd like R6, R7 and R8 to all send (initially) one route to the RR. We'd like the R4 to reflect back two paths for reaching 9.9.9.9 to everyone (technically speaking we'll also be reflecting two paths for 1.1.1.1 on CE1, but I chose not to focus on that).

This design suffers from the same problem the last several have. Everything will start out looking good until the route-reflector reflects the superior path from R6 to R7 and R8, and those two routers both pick R6 as their bestpath. After that they'll withdraw their routes from R4, and R4 will only have a single route to send to R2, R3, etc, etc, because every path will point to R6.

We can solve this with one of three methods:
- BGP Advertise-Best-External on R7 and R8 (optionally on R6)
- Per-neighbor advertise best-external
- Running two-path Add-Path on R7 and R8 in addition to R4.

The top two options I imagine are self-explanatory at this point as I covered them above, however, the final option is hopefully interesting to the reader, and therefore it's the method I will choose for this lab. What will happen if R4, R7 and R8 run add-path is as follows:

1) R6, R7 and R8 all advertise their own (connected/external) bestpath to R4
2) (Let's assume R7 had the 2nd-best path for this example) R4 reflects BOTH R6 and R7's bestpath to R2, R3, R5, R6, R7 and R8.
3) R2 and R3 install both paths in BGP and in the FIB.
4) R6 installs R7's path in the FIB as a repair route.
5) R7 and R8 both change their bestpath to R6 instead of their external route.
6) R7 and R8 both advertise back to the route reflector that R6 is their bestpath and R7 is their backup path.
... No change on R2, R3, or R4, that influences a shift on the route reflector, so it's clients aren't modified either.

The key here is that while we still have the same problem of R7 and R8 preferring R6's external path, we're still advertising two paths to the route reflector: R6's (as best), and R7's as a backup.

Here is the relevant config:
R4:
router bgp 200
bgp additional-paths select best 2
bgp additional-paths send receive
bgp additional-paths install
neighbor 2.2.2.2 advertise additional-paths best 2
neighbor 3.3.3.3 advertise additional-paths best 2
neighbor 5.5.5.5 advertise additional-paths best 2
neighbor 6.6.6.6 advertise additional-paths best 2
neighbor 7.7.7.7 advertise additional-paths best 2
neighbor 8.8.8.8 advertise additional-paths best 2

R2, R3, & R5:

router bgp 200

bgp additional-paths select best 2

bgp additional-paths receive

bgp additional-paths install

R6 - R8:

router bgp 200

bgp additional-paths select best 2
bgp additional-paths send receive
bgp additional-paths install

neighbor 4.4.4.4 advertise additional-paths best 2

Remember not to use bgp additional-paths select backup - that command is for diverse-path or for local (non-advertised) selection of a backup route. You're trying to create a backup path, but that's still the wrong command.

So we used a few new commands here:

bgp additional-paths select best 2 - This calculates the best path and 2nd best path and flags them in BGP. This is a non-transitive flag, the neighbors aren't aware of what your flags are.

R4#sh ip bgp 9.9.9.9

BGP routing table entry for 9.9.9.9/32, version 5

Paths: (3 available, best #3, table default)

Additional-path-install

Path advertised to update-groups:

19 20

Refresh Epoch 1

300, (Received from a RR-client)

7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

rx pathid: 0x1, tx pathid: 0x2

Path not advertised to any peer

Refresh Epoch 1

300, (Received from a RR-client)

8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal

rx pathid: 0x1, tx pathid: 0

Path advertised to update-groups:

19 20

Refresh Epoch 1

300, (Received from a RR-client)

6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

rx pathid: 0x0, tx pathid: 0x0

You see we've flagged "best" and "best2".

bgp additional-paths send receive

Unlike all the fixes we've seen up until now, Add-Path is a negotiated feature. This is why there's so many workarounds for it - to get to full Add-Path you basically have to forklift upgrade your network. On that note, you need to tell your neighbors if you have send, receive, or both send & receive capability. This can be done globally, as we've done here, or per neighbor with:

R2(config-router)#neighbor 4.4.4.4 additional-paths ?
disable Disable additional paths for this neighbor
receive Receive additional paths from neighbors
send Send additional paths to this neighbor

Note per-neighbor settings override the global settings.

bgp additional-paths install

You can select additional-paths and pass them to neighbors without installing them in your RIB or FIB. This command should be on any device requiring PIC Edge, but if your route reflector isn't in the forwarding path, you may be able to omit it.

neighbor X.X.X.X advertise additional-paths best 2

Even if you've negotiated the Add-Path capability with your neighbor, you still need to tell the BGP process to advertise all of, or a subset of, your calculated best paths. The way it does this is via the tag system I described above. An important element of this is that the tagging system is not mutually exclusive. Let's say there are 4 paths with different next-hops. You could select "all" and "best 3", and the best 3 would be flagged with "best" and "all", and the 4th path would only be flagged with "all". We'll show an examples of this below.

Let's see the output from this.

R4#sh ip bgp | b Network
Network Next Hop Metric LocPrf Weight Path
*>i 1.1.1.1/32 2.2.2.2 0 100 0 100 i
*bia 3.3.3.3 0 100 0 100 i
*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i
* i 8.8.8.8 0 100 0 300 i
*>i 6.6.6.6 0 150 0 300 i

We see two paths for 1.1.1.1, and three paths for 9.9.9.9.

Two are flagged with "b" for backup - this is a side-effect of using the bgp additional-paths install.

"a" is the flag for additional-paths.

You'd need to do a sh ip bgp 9.9.9.9 to see the "best", "best2", etc flags, which I am omitting for brevity - there's already a sample further above.

R4#sh ip cef 9.9.9.9 det

9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels

recursive via 6.6.6.6

nexthop 192.168.46.6 GigabitEthernet1.46

recursive via 7.7.7.7, repair

nexthop 192.168.47.6 GigabitEthernet1.47

We can see the repair path in the FIB.

On R2:

R2(config-router)#do sh ip bgp 9.9.9.9

BGP routing table entry for 9.9.9.9/32, version 3

Paths: (2 available, best #2, table default)

Additional-path-install

Path not advertised to any peer

Refresh Epoch 1

300

7.7.7.7 (metric 3) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

Originator: 7.7.7.7, Cluster list: 4.4.4.4

rx pathid: 0x2, tx pathid: 0x1

Path advertised to update-groups:

Refresh Epoch 1

300

6.6.6.6 (metric 3) from 4.4.4.4 (4.4.4.4)

Origin IGP, metric 0, localpref 150, valid, internal, best

Originator: 6.6.6.6, Cluster list: 4.4.4.4

rx pathid: 0x0, tx pathid: 0x0

We see a best and best2 flag. It's important to note again that this is not learned from the route reflector, it's locally decided and set by the local bgp additional-paths select best 2 on R2. As mentioned above, I decided to use add-path from the edge BGP devices back towards the route-reflector to avoid the problem of the single-best-path replacing all the secondaries during convergence.

Another important note is the pathid. Add-Path's trickery to make this work doesn't make a real integral change to BGP - it still only passes one best, unique path - it just makes each additional path unique by adding a unique pathid. Note the pathids of 0x0 and 0x1 above. Think of these similar to Route Distinguishers in VPNv4, making the same two routes unique.

R2#sh ip cef 9.9.9.9 det

9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels

recursive via 6.6.6.6

nexthop 192.168.24.4 GigabitEthernet1.24 label 23

nexthop 192.168.25.5 GigabitEthernet1.25 label 18

recursive via 7.7.7.7, repair

nexthop 192.168.24.4 GigabitEthernet1.24 label 16

nexthop 192.168.25.5 GigabitEthernet1.25 label 20

And there's PIC Edge and Add-Path in action on R2.

I'm going to quickly cover the rest of the simpler Add-Path options.
Just to recap, the route-reflector has chosen two best paths so far:

R4#sh ip bgp 9.9.9.9 | s from
300, (Received from a RR-client)
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2
rx pathid: 0x1, tx pathid: 0x2
300, (Received from a RR-client)
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal
rx pathid: 0x1, tx pathid: 0
300, (Received from a RR-client)
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 150, valid, internal, best
rx pathid: 0x0, tx pathid: 0x0

router bgp 200

bgp additional-paths select best 3

R4#sh ip bgp 9.9.9.9 | s from

300, (Received from a RR-client)

7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)

Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2

rx pathid: 0x1, tx pathid: 0x2

300, (Received from a RR-client)

8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)

Origin IGP, metric 0, localpref 100, valid, internal, best3

rx pathid: 0x1, tx pathid: 0x1

300, (Received from a RR-client)

6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)

Origin IGP, metric 0, localpref 150, valid, internal, best

rx pathid: 0x0, tx pathid: 0x0

Note we've added a pathid and "best3" to the remaining path. We'd be able to send those to neighbors if we wanted. With this config we're choosing 3 but sending 2.

I found this option confusing initially:

R4(config-router)#no neighbor 2.2.2.2 advertise additional-paths best 2
R4(config-router)#neighbor 2.2.2.2 advertise additional-paths all
% BGP: AF level 'bgp additional-paths select' more restrictive than advertising policy. This is a reminder that AF level additional-path select commands are needed.

The way I originally read this was, I've selected 3 best paths, and I want to send all 3 of them to my neighbor -- this is incorrect. Remember this is a flag system. All is a flag. None of our BGP prefixes are flagged with All, so we just broke Add-Path:

R4(config-router-af)#do sh ip bgp neigh 2.2.2.2 adv | b Network
Network Next Hop Metric LocPrf Weight Path
*>i 9.9.9.9/32 6.6.6.6 0 150 0 300 i

Let's fix it.
All is meant to simulate full-mesh iBGP with a route-reflector - if all routers use it, you'll get a similar outcome to all the routers being peered together.

R4(config-router)#bgp additional-paths select all

R4(config-router)#do sh ip bgp 9.9.9.9 | s from
300, (Received from a RR-client)
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2, all
rx pathid: 0x1, tx pathid: 0x1
300, (Received from a RR-client)
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)
Origin IGP, metric 0, localpref 100, valid, internal, best3, all
rx pathid: 0x1, tx pathid: 0x2
300, (Received from a RR-client)
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)
Origin IGP, metric 0, localpref 150, valid, internal, best
rx pathid: 0x0, tx pathid: 0x0

OK, now we're flagged with both All and Best simultaneously. As mentioned above, the select system is not mutually exclusive:

R4#sh run | i select

bgp additional-paths select all best 3

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i

* i 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

There's a few options you can potentially pick under "select":

R4(config-router)#bgp additional-paths select ?

all Select all available paths

backup Select backup path

best Select best N paths

best-external Select best-external path

group-best Select group-best path

All, we just covered.

Backup is for diverse-path

Best, we've covered

Best-External is a feature that permits best-external selection on a route reflector. The use case for this is complicated and is out of scope for this document.

Group-Best is also very complicated.

Let's discuss group-best at a very high level.

BGP, under normal circumstances, can potentially end up in a scenario where it never converges - it never stabilizes. This is called BGP Med Oscillation. Explaining this is beyond the scope of this document, however, this blog covers it well: http://ccieblog.co.uk/bgp/bgp-deterministic-med

BGP Deterministic Med can solve this problem.

However, this problem gets additionally complex with Add-Path. Group-Best solves these problems.

This document covers this feature: http://inl.info.ucl.ac.be/system/files/add-paths-jsac.pdf

Route-Maps can additionally be used with Add-Path.

R3(config-route-map)#match additional-paths advertise-set ?

all BGP Add-Path advertise all paths

best BGP Add-Path advertise best n paths

best-range BGP Add-Path advertise best paths (range m to n)

group-best BGP Add-Path advertise group-best path

The two use cases I've seen for the route maps are:

- Setting the egress MED

- Selecting specific routes with the "best" flag to advertise

For example, if you wanted to only advertise the 1st best and 3rd best routes:

R4:

route-map block2ndbest deny 10

match additional-paths advertise-set best-range 2 2 ! matches the "range" of 2 through 2

route-map block2ndbest permit 20

Before:

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i

* i 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

R4(config)#router bgp 200

R4(config-router)#neighbor 2.2.2.2 route-map block2ndbest out

R4(config-router)#do clear ip bgp * soft out

After:

R2#sh ip bgp | b 9.9.9.9

*bia9.9.9.9/32 8.8.8.8 0 100 0 300 i

*>i 6.6.6.6 0 150 0 300 i

As I mentioned the MED can be modified on a per-bestpath basis as well, but only from edge BGP device -> RR or edge BGP device -> edge BGP device. Route reflectors are not permitted to set MED.

Hope you enjoyed,

Jeff

VTP v3

2014-07-08T21:39:00.003-07:00

VTP v3 isn't technically a "new addition" to the CCIE lab, but code versions prohibited it from being used up until recently. I've been told IOL does in fact support VTP v3, so it should be considered a viable lab topic now.

So what's new in VTP v3? In no particular order:

- Supports extended VLANs (1006 - 4094)
- Support for propagating Private VLANs
- Support for propagating Multiple Spanning Tree
- Support for flagging VLANs as RSPAN (disables MAC learning on the VLAN)
- Fixes the bane of VTP v1/2, the accidental-high-configuration-revision-wipes-out-your-network issue.
- VTP can now be turned off completely, as opposed to just transparent mode
- Support for hidden passwords

My lab is two 3560s running 15.0(2)SE6, and one 3550 running 12.x code. The 3560s support VTP v3, the 3550 I'm just using to show backwards compatibility to v2.

My switches are plugged in in a row:

SW1 3560 --> SW2 3560 --> SW3 3550

Let's enable v3 on SW1 and SW2:

SW1(config)#vtp version 3
Cannot set the version to 3 because domain name is not configured

You can't enable v3 without specifying a domain. Previous versions of VTP just inherited the domain name from its neighbors if you didn't specify one. This ties in some to the security measures, we don't necessarily want to participate in the neighbor's VTP process, so don't make assumptions.

SW1(config)#vtp domain CCIE
Changing VTP domain name from NULL to CCIE
*Mar 1 00:04:08.638: %SW_VLAN-6-VTP_DOMAIN_NAME_CHG: VTP domain name changed to CCIE.
SW1(config)#vtp version 3
*Mar 1 00:04:12.908: %SW_VLAN-6-OLD_CONFIG_FILE_READ: Old version 2 VLAN configuration file detected and read OK. Version 3
files will be written in the future.

I've done the same on SW2.

Let's try adding some VLANs.

SW1(config)#vtp mode server
Setting device to VTP Server mode for VLANS.
SW1(config)#vlan 100
VTP VLAN configuration not allowed when device is not the primary server for vlan database.

Let's stop here and talk about a huge problem with previous versions of VTP. As a network consultant, I always recommend - especially prior to version 3 - that customers use VTP mode transparent. The problem is that VTP devices - VTP clients included - can have their VLANs removed or changed while not connected to the mothership, and inadvertently end up with a higher configuration revision. When that switch is introduced, or reintroduced, to the greater network, the higher configuration revision "wins", and the rest of the network replicates that VLAN database, erasing their own VLANs. This can be so dramatic that the entire network can end up with just VLAN 1, and the entire layer 2 domain goes down. This is a very easy problem to create, and causes a dramatic outage.

VTPv3 can no longer create this issue.

VTP mode clients, and secondary servers cannot write the VLAN database. What's a secondary server? Well, it's any server that isn't the primary! (sorry, couldn't resist).

There can only be one primary server. The primary server is the only server allowed to write the VLAN database:

SW1#vtp primary vlan

This system is becoming primary server for feature vlan

No conflicting VTP3 devices found.

Do you want to continue? [confirm]

SW1#

*Mar 1 00:30:19.564: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1ceb.f600 has become the primary server for the VLAN VTP feature

SW1 is now the only device that can make changes to the contiguous v3 PVST VLAN database. Note the command vtp primary vlan is in privilege exec mode and is not saved to the config - if you reboot you lose this privilege. This completely eliminates the possibility of have a plug-and-play way of accidentally overwriting another network's VTP database.

SW1#conf t

Enter configuration commands, one per line. End with CNTL/Z.

SW1(config)#vlan 100

SW1(config-vlan)#exit

SW2#show vlan | i 100

100 VLAN0100 active

I'd like to spend a moment looking at the primary server takeover process in a little more detail.

As I mentioned, only one server can be primary.

So if we do this from SW2:

SW2#vtp primary vlan

This system is becoming primary server for feature vlan

No conflicting VTP3 devices found.

Do you want to continue? [confirm]

SW2#

*Mar 1 00:52:46.632: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the VLAN VTP feature

So I'm confused by "No conflicted VTP3 devices found." I'm not sure what a conflicting server would be if not the existing primary server, but my switches always produce this output, so maybe it's a version/platform issue.

Anyway, if you look at SW1:

SW1(config)#

*Mar 1 00:53:36.963: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the VLAN VTP feature

SW1#show vtp status

VTP Version capable : 1 to 3

VTP version running : 3

VTP Domain Name : CCIE

VTP Pruning Mode : Disabled

VTP Traps Generation : Disabled

Device ID : 0014.1ceb.f600

Feature VLAN:

--------------

VTP Operating Mode : Server

Number of existing VLANs : 6

Number of existing extended VLANs : 0

Maximum VLANs supported locally : 1005

Configuration Revision : 2

Primary ID : 0014.1cec.0280

Primary Description : SW2

MD5 digest : 0x73 0x33 0x29 0x15 0x3B 0xA7 0x29 0x04

0x74 0x34 0x70 0x4F 0x58 0x74 0xAF 0x5E

Feature MST:

--------------

VTP Operating Mode : Transparent

Feature UNKNOWN:

--------------

VTP Operating Mode : Transparent

Note the Device ID - 0014.1ceb.f600, and then the Primary ID of Feature VLAN - 0014.1cec.0280 (SW2's ID). SW2 just stole it from SW1.

It's hard to show in the blog, but the process of becoming primary actually takes a little bit. There's a quicker way to steal it, which "doesn't check for conflicting devices" (not that I can seem to find conflicting devices anyway):

SW1#vtp primary vlan force

This system is becoming primary server for feature vlan

SW1#

*Mar 1 00:59:32.112: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1ceb.f600 has become the primary server for the VLAN VTP feature

While I'm on the topic, you can actually see your VTP neighbors now (provided they're running v3):

SW1#show vtp device

Retrieving information from the VTP domain. Waiting for 5 seconds.

VTP Feature Conf Revision Primary Server Device ID Device Description

------------ ---- -------- -------------- -------------- ----------------------

VLAN No 2 0014.1ceb.f600 0014.1cec.0280 SW2

Let's try to add some high-number VLANs now.

On the primary:

SW1(config)#vlan 1006

SW1(config-vlan)#exit

On the secondary, verifying

SW2#show vlan | i 1006

1006 VLAN1006 active

Let's make a routed port on the secondary (an SVI would work as well):

SW2(config)#int fa0/1

SW2(config-if)#no switchport

SW2(config-if)#ip address 192.168.0.1 255.255.255.0

Adding another VLAN on the primary:

SW1(config)#vlan 1007

SW1(config-vlan)#exit

And verifying on the secondary:

SW2(config-if)#

*Mar 1 01:04:07.384: %PM-4-EXT_VLAN_INUSE: VLAN 1007 currently in use by FastEthernet0/1

*Mar 1 01:04:07.384: %SW_VLAN-4-VLAN_CREATE_FAIL: Failed to create VLANs 1007: VLAN(s) not available in Port Manager

Well that didn't work.

Whenever you create a routed interface (SVI or interface-based) on a Catalyst switch, it assigns a VLAN between the routed interface and the control plane (best I can tell, that's what's happening...). I've heard an argument that this behavior has to do with allocating the BIA MAC addresses to routed interfaces, but if you look around, there are some Catalyst switches that just assign the same MAC to every routed interface by default, yet they still all use separate VLAN numbers, so I'm not inclined to believe that.

Anyway, the default behavior on my 3560s is to allocate these VLANs from 1006 and counting upwards, so if 1006 is in use, 1007 will be grabbed, which is what we just saw.

vlan internal allocation policy ascending

You've seen this command if you've used a 3K series switch before, because you can't turn it off.

SW1#sh run | i ascen

vlan internal allocation policy ascending

SW1#conf t

Enter configuration commands, one per line. End with CNTL/Z.

SW1(config)#no vlan internal allocation policy ascending

SW1(config)#do sh run | i ascend

vlan internal allocation policy ascending

SW1(config)#vlan internal allocation policy ?

ascending Allocate internal VLAN in ascending order

SW1(config)#vlan internal allocation policy ascending ?

<cr>

SW1(config)#vlan internal allocation policy ascending

So, tough luck, they're ascending!

Some higher-end platforms support descending, I don't have one sitting around to lab, but I'm told descending starts at 4094 and counts downward instead.

Back to reality on my 3560s...

SW2#show vlan internal usage

VLAN Usage

---- --------------------

1007 FastEthernet0/1

Shutting down Fa0/1 releases Vlan 1007, but we're now out-of-sync. I let this sit a while and it never caught up, so I'm guessing you're just out of luck until the primary server pushes another update.

SW2(config)#int fa0/1

SW2(config-if)#shut

SW1(config)#no vlan 1007

SW1(config)#vlan 1007

SW1(config-vlan)#exit

SW2(config-if)#do show vlan | i 1007

1007 VLAN1007 active

So we've covered the primary VTP server, exactly what is the difference between a secondary VTP server and a VTP client?

Well, on the 3560, not much.

According to the VTPv3 documentation:

"Client: A device using a local temporary storage space (for example, DRAM) to hold via VTP received information for runtime use. This information is used to update other devices, such as a device that is working as a server. Local configuration of devices in the client role is not possible. After booting, a client device issues a VTP message asking for the configuration of other VTP devices."

This implies that the VTP secondary server saves its database to flash and the client doesn't store it at all. And on the 3560? My VTP clients (and secondary servers) store the full VTP database and will load it up every time unless I manually delete it. So in practice, on this equipment, I can't actually find a difference between VTP secondary servers and VTP clients.

I'm told a best practice is to demote the primary vtp server when you're done making changes. The method to accomplish that is not particularly clear, but here you have it:

SW1(config)#vtp mode client

Setting device to VTP Client mode for VLANS.

SW1(config)#vtp mode server

Setting device to VTP Server mode for VLANS.

Now you're a secondary server.

Let's take a look at the new password features. The old, plain-text password still works the same way. The hidden ones, however:

SW1(config)#vtp password CANTSEEME hidden

Setting device VTP password

SW1#show vtp password

VTP Password: 80B0218C160CD951A38982EECCC22AD5

There's apparently no way to recover it, even snooping through the vlan.dat file, so if you need to add a switch without giving the password out:

SW2(config)#vtp password 80B0218C160CD951A38982EECCC22AD5 secret

Setting device VTP password

Of note, the password is required (the unencrypted one) in order to promote a server to primary:

SW1#vtp primary vlan

This system is becoming primary server for feature vlan

Enter VTP Password:

How about interoperability with previous versions?

Previous version switches will promote themselves from v1 to v2 if connected to a v3 device:

Prior to talking to SW2:

SW3#show vtp status

VTP Version : running VTP1 (VTP2 capable)

Configuration Revision : 0

Maximum VLANs supported locally : 1005

Number of existing VLANs : 5

VTP Operating Mode : Server

VTP Domain Name :

VTP Pruning Mode : Disabled

VTP V2 Mode : Disabled

VTP Traps Generation : Disabled

MD5 digest : 0x57 0xCD 0x40 0x65 0x63 0x59 0x47 0xBD

Configuration last modified by 0.0.0.0 at 0-0-00 00:00:00

Local updater ID is 0.0.0.0 (no valid interface found)

After talking to SW2:

SW3#show vtp status

VTP Version : running VTP2

Configuration Revision : 1

Maximum VLANs supported locally : 1005

Number of existing VLANs : 7

VTP Operating Mode : Server

VTP Domain Name : CCIE

VTP Pruning Mode : Disabled

VTP V2 Mode : Enabled

VTP Traps Generation : Disabled

MD5 digest : 0x60 0xEC 0xC1 0xEF 0xEF 0xE3 0x24 0xB6

Configuration last modified by 0.0.0.0 at 3-1-93 00:00:58

Local updater ID is 0.0.0.0 (no valid interface found)

Note the automatic V2 change. I actually had a heck of a time getting this working when I re-labbed it for this document, and that's because I had left a hidden password on SW1 and SW2. VTPv1/2 aren't going to speak hidden password, turn that off first!

The important thing to grasp about the v2 compatibility is this must be a one-way path: The v3 network needs to make the database changes. It's best to keep the entire v2 domain as clients. If you make changes in v2, the v3 devices will not accept the changes, but the v2 domain will up its configuration revision number. Then when v3 pushes a legitimate update, the v2 domain will reject it because it will by definition be lower than that of v2. You end up with a segmented VTP domain, and a royal mess in the v2 network:

*Mar 1 06:17:23.646: %SW_VLAN-4-VTP_USER_NOTIFICATION: VTP protocol user notification: MD5 digest checksum mismatch on receipt of equal revision summary on trunk: Fa0/21

*Mar 1 06:17:23.650: %SW_VLAN-4-VTP_USER_NOTIFICATION: VTP protocol user notification: MD5 digest checksum mismatch on receipt of equal revision summary on trunk: Fa0/22

Disabling VTP: Why?

Most of us have been happy to use transparent for years. The big difference with disabling VTP as opposed to using transparent mode is that the switch won't even pass VTP messages in "off" mode, it deliberately filters them. The benefit would be for a network administrative boundary, like connecting trunks between two carriers.

Globally:

SW2(config)#vtp mode off

Setting device to VTP Off mode for VLANS.

Per interface:

SW2(config)#int fa0/21

SW2(config-if)#no vtp

The Private VLAN support sounds daunting, but it really does a very simple task. All it does is carry the VLAN associations, it's not assigning interface or trunk configs anywhere.

SW1(config-vlan)#vlan 601

SW1(config-vlan)# private-vlan isolated

SW1(config-vlan)#

SW1(config-vlan)#vlan 600

SW1(config-vlan)# private-vlan primary

SW1(config-vlan)# private-vlan association 601

It doesn't matter where I trunk this, or what ports are applied in what fashion. This is basically all we're replicating:

SW1#show vlan private-vlan

Primary Secondary Type Ports

------- --------- ----------------- ------------------------------------------

600 601 isolated

SW2#show vlan private-vlan

Primary Secondary Type Ports

------- --------- ----------------- ------------------------------------------

600 601 isolated Fa0/6

Of note, and this will make more sense as we move into MST, the private-vlan feature is an add-on to the "VLAN" feature, which you may have noticed inside the show vtp status output:

SW1#show vtp status

VTP Version capable : 1 to 3

VTP version running : 3

VTP Domain Name : CCIE

VTP Pruning Mode : Disabled

VTP Traps Generation : Disabled

Device ID : 0014.1ceb.f600

Feature VLAN:

--------------

VTP Operating Mode : Primary Server

Number of existing VLANs : 9

Number of existing extended VLANs : 2

Maximum VLANs supported locally : 1005

Configuration Revision : 3

Primary ID : 0014.1ceb.f600

Primary Description : SW1

MD5 digest : 0xBE 0x75 0xED 0x56 0xCB 0xAF 0xF3 0xF6

0x59 0x8D 0x91 0x6C 0x60 0x28 0x55 0xEB

Feature MST:

--------------

VTP Operating Mode : Transparent

Feature UNKNOWN:

--------------

VTP Operating Mode : Transparent

Now let's talk about those last two, Feature MST and Feature UNKNOWN.

The MST config is pretty cool. If you choose to run MST, the big drag is the manual configuration updates, and this fixes all of them.

Just as with feature VLAN, we need to make a primary server for MST. Note, this does not have to be the same switch as the feature VLAN. In fact, I'm leaving SW1 the primary for feature VLAN, and making SW2 the primary for feature MST:

SW2(config)#vtp mode server mst

Setting device to VTP Server mode for MST.

SW2(config)#exit

SW2#vtp primary mst force

This system is becoming primary server for feature mst

SW2#

*Mar 1 00:17:26.453: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the MST VTP feature

Feature VLAN:

--------------

VTP Operating Mode : Server

Number of existing VLANs : 9

Number of existing extended VLANs : 2

Maximum VLANs supported locally : 1005

Configuration Revision : 3

Primary ID : 0014.1ceb.f600

Primary Description : SW1

MD5 digest : 0xBE 0x75 0xED 0x56 0xCB 0xAF 0xF3 0xF6

0x59 0x8D 0x91 0x6C 0x60 0x28 0x55 0xEB

Feature MST:

--------------

VTP Operating Mode : Primary Server

Configuration Revision : 1

Primary ID : 0014.1cec.0280

Primary Description : SW2

SW1(config)#vtp mode client mst

Setting device to VTP Client mode for MST.

SW2(config)#spanning-tree mst config

SW2(config-mst)#instance 1 vlan 1-100

SW2(config-mst)#instance 2 vlan 101-200

SW2(config-mst)#instance 3 vlan 201-300

SW2(config-mst)#name region1

SW2(config-mst)#revision 1

SW2(config-mst)#exit

SW2(config)#spanning-tree mode mst

SW2#show spanning-tree mst config

Name [region1]

Revision 1 Instances configured 4

Instance Vlans mapped

-------- ---------------------------------------------------------------------

0 301-4094

1 1-100

2 101-200

3 201-300

-------------------------------------------------------------------------------

Enable MST on SW1:

SW1(config)#spanning-tree mode mst

And we'll see it has the configuration already:

SW1#show span mst config

Name [region1]

Revision 1 Instances configured 4

Instance Vlans mapped

-------- ---------------------------------------------------------------------

0 301-4094

1 1-100

2 101-200

3 201-300

-------------------------------------------------------------------------------

Another nifty thing is that it actually updates the running config to match:

SW1#show run | s spanning-tree mst

spanning-tree mst configuration

name region1

revision 1

instance 1 vlan 1-100

instance 2 vlan 101-200

instance 3 vlan 201-300

Reverting to PVST for simplicity --

SW1(config)#spanning-tree mode rapid-pvst

SW2(config)#spanning-tree mode rapid-pvst

The Remote SPAN flag is quite simple:

SW1(config)#vlan 150

SW1(config-vlan)#remote-span

SW2#show vlan remote-span

Remote SPAN VLANs

------------------------------------------------------------------------------

150

The purpose here is to tell all the switches in the forwarding path of the remote SPAN not to learn MAC addresses on that VLAN.

Feature UNKNOWN is actually kinda cool too, although I have no way of demonstrating it. VTPv3 is designed to carry different types of databases, so that it can be adapted to other replication tasks in the future. So what's an earlier VTPv3 IOS to do with these new formats it doesn't understand? Forward them or drop them?

SW1(config)#vtp mode off unknown

Setting device to VTP Off mode for unknown instances.

SW1(config)#vtp mode transparent unknown

Setting device to VTP Transparent mode for unknown instances.

You can set them only to off or transparent. Clearly can't be a VTP server for a format you don't understand. off works the same way explained above, drop the traffic; transparent forwards without processing.

And lastly, where's this at in the DocCD?:

Switches ->

3850 ->

Catalyst 3850-12S-E Switch ->

Configuration Guides ->

VLAN Configuration Guide, Cisco IOS XE Release 3SE (Catalyst 3850 Switches) ->

Configuring VTP

v3 is just a section in the larger VTP configuration guide, but everything you need should be there.

Cheers,

Jeff

IPv6 First Hop Security

2014-07-05T17:07:00.002-07:00

IPv6 First Hop Security is a new topic for CCIE v5. It's important to note that at the time of this writing (June/July 2014), IPv6 FH Security is not supported in IOL, so this cannot be on the CLI-based parts of the lab yet, but it can be in diagnostics or the written.

The biggest barrier to understanding IPv6 FH Security is understanding the whole first hop process to begin with. IPv6 changes this dramatically from the IPv4 model. First, we will examine IPv6 First Hop and determine where the security problems are.

First, let's answer a simple question: How do I receive a routable IPv6 address?

ICMP communicates nearly everything regarding IPv6 addressing. It's used for finding a router, building an address (typically), making sure it's unique, finding a DHCP server if necessary, locating other hosts, assigning a default route, etc.

That said, IPv6 has a major chicken-or-the-egg problem. You can't run ICMP effectively without having an IPv6 address already, but how can you have an address before... you have an address?

Enter link-local addressing. Link local addresses all exist within FE80::/10. Typically, and true on Cisco devices, this address is built as FE80::<address based on MAC address>. It can alternatively (as is done in modern Windows OS) be built from FE80::<random address>. The specific format of the address isn't necessary for my explanation, so I will consider that out-of-scope for this document, and it can be found in hundreds of other places.

Since my blog is geared towards the CCIE, and we know we'll be using all Cisco kit, we can assume that the MAC address method is what we'll be using.

So we've built our interface a "link local" address of FE80::FA66:F2FF:FEDE:FF1. This link-local is non-routable, and can only be reached on the local segment. We must immediately run a process called DAD (Duplicate Address Detection) to ensure that we're the only person using this address. When we built the link local address, we also joined a multicast group called a "solicted node multicast", which is a multicast address that's unique for our host. DAD sends a multicast to this solicited node, and if anyone else responds (they shouldn't, if the address is unique), then we drop the address and don't use it.

Assuming the address is unique, our next goal is to come up with a routable address in addition to our link local. Now that we have a link-local address, we can send out a Router Solicitation (RS) and ask for the global prefix we should be on. Let's say our router's IPv6 address is 2001:100::1/64. The router will fire back a Router Advertisement (RA) to our host that it's prefix is 2001:100::/64. We'll use the same process we used for link-local earlier to assign our globally routable address. Using the same example above, that address would be 2001:100::FA66:F2FF:FEDE:FF1.

This process is called StateLess Address AutoConfiguration, or SLAAC. It's important to note that the router must send out an RA with a /64 length, or this process doesn't work.

Perhaps more important to our examples, the RA we got the prefix from also hands out the router's IPv6 link local address, which we can optionally use as our default gateway. Just to point out again: this is not a global address. SLAAC's default routing is always done with link-local addresses.

Let's look at the rather minor config for this thus far:

R1 ("router"):

ipv6 unicast-routing

interface Gigabit0/1

ipv6 address 2001:100::1/64

R2 ("host"):

ipv6 unicast-routing ! May or may not be necessary for a "host", platform-dependent.

interface Gigabit0/1

ipv6 address autoconfig default

Omitting the "default" above would prevent it from installing R1 as a default route.

R2(config-if)#do sh ipv6 int br | s GigabitEthernet0/1

GigabitEthernet0/1 [up/up]

FE80::FA66:F2FF:FEDE:FF1

2001:100::FA66:F2FF:FEDE:FF1

R2#sh ipv6 route ::/0

Routing entry for ::/0

Known via "static", distance 2, metric 0

Route count is 1/1, share count 0

Routing paths:

FE80::F2F7:55FF:FE8D:96A2, GigabitEthernet0/1

Last updated 00:00:10 ago

FE80::F2F7:55FF:FE8D:96A2 is R1's link local address.

Cisco routers all announce themselves as viable gateways (imagine that), so if you have a "host" router - perhaps a voice gateway or whatnot - you need to tell it not to send router advertisements:

R2(config)#int gig0/1

R2(config-if)#ipv6 nd ra suppress all

The suppress keyword indicates not to send periodic RAs, suppress all means don't respond to RSes.

An optional feature that's not on all platforms:

R2(config-if)#ipv6 address autoconfig prefix

That will make the router insert routes for any other routes on the same segment. So if your neighbor had a second IPv6 address of 13::1, we'd insert a route to it as well.

So that gets us a link-local address, global unicast address, and default gateway. What about DNS?

Up until very recently, the only real option was to run a stateless DHCPv6 server. Stateless because in this scenario, the DHCPv6 server doesn't actually keep track of anything, it just hands out options: DNS, Call Manager info, etc.

R1(config)#ipv6 dhcp pool DHCP-POOL

R1(config-dhcpv6)#dns-server 4::4

R1(config-dhcpv6)#dns-server 8::8

R1(config-dhcpv6)#domain-name ABC.COM

R1(config-dhcpv6)#int gig0/1

R1(config-if)#ipv6 dhcp server DHCP-POOL

R1(config-if)#ipv6 nd other-config-flag

The process on R2 is automatic, but on R1 we create the pool, which is reasonably obvious config, apply it to the interface, and then set the O-flag. This tells clients via RA that it should query for a DHCP server for more information. The DHCP server and the device sending the RA do not need to be the same device.

Great! We've got our addresses, default gateway, and DNS. Before we move on to neighbor discovery, let's look at stateful DHCP.

You've got a couple options here.

Either way we need more info in our DHCP server:

R1(config)#ipv6 dhcp pool DHCP-POOL

R1(config-dhcpv6)#address prefix 2001:100::/64

We'll still get our default gateway through RAs, but we can at least track the addresses that our hosts are using.

Our first option is to just recommend to our host that it use stateful DHCP.

R1(config-if)#no ipv6 nd other-config-flag

R1(config-if)#ipv6 nd managed-config-flag

The M-Flag (Managed Flag) is a suggestion to the client that it should use the DHCP server for its host address instead of SLAAC.

Since we're using a Cisco client, I need to stop here and advise that I've never been able to get the client to recognize the M-flag. It could be my IOS, I haven't investigated it that much.

Since that's a bust, let's look at the other method.

R2(config-if)#int gig0/1

R2(config-if)#ipv6 address autoconfig default-route ! still need this for the default gateway

R2(config-if)#ipv6 address dhcp

R2(config-if)#ipv6 enable

I'll explain ipv6 enable momentarily.

R2(config-if)#do sh ipv6 int br | s GigabitEthernet0/1

GigabitEthernet0/1 [up/up]

FE80::FA66:F2FF:FEDE:FF1

2001:100::2C11:9212:8690:D8FE

We got our address. (Side-note: if you have ipv6 address dhcp on an interface and not ipv6 address autoconfig default-route, you won't get a connected route to the local subnet - you get a totally unusable /128 host address and that's it. The workaround is to disable ipv6 unicast-routing, then you'll get the connected route)

R1#sh ipv6 dhcp binding

Client: FE80::FA66:F2FF:FEDE:FF1

DUID: 00030001F866F2DE0FF0

Username : unassigned

IA NA: IA ID 0x00040001, T1 43200, T2 69120

Address: 2001:100::2C11:9212:8690:D8FE

preferred lifetime 86400, valid lifetime 172800

expires at Jul 02 2014 01:29 AM (172654 seconds)

and R1 knows about us - it's stateful.

There's also a new option defined by RFC 6106 that completely eliminates the need for DHCPv6 in relation to DNS. Unfortunately, at the time of this writing, you need bleeding-edge IOS-XE in order to use it. In fact, the hardware lab I'm using (which will be explained later) for this doesn't even support it - I have to turn to CSR1000v:

Router(config-if)#ipv6 nd ra dns server ?

X:X:X:X::X IPv6 address

This passes DNS servers as an RA option. I have labbed this previously, it does appear to work from what I can see in Wireshark.

I promised an explanation to ipv6 enable. I bet you've seen this command before, followed by a statically configured address. You actually need this very infrequently. With the exception of the DHCP example I showed above, the only time you need ipv6 enable is if you want a eui-64 derived link-local address without a global unicast address. If you're statically assigning an IPv6 global unicast, or you're getting one via SLAAC, you don't need this command - so cut it out! :)

Of course your final option is a statically assigned address:

ipv6 address 1::1/64

Link locals can also be statically assigned:

ipv6 address FE80::1 link-local

Now that we're past "how do we get an address", let's move on to the other big topic of "how do we find our neighbors". As you're probably already aware, IPv6 does not use ARP, instead it uses a Neighbor Solicitation (NS) and Neighbor Advertisement (NA) ICMPv6 messages. This builds the neighbor table instead of the ARP cache.

NS and NA map pretty directly in functionality to ARP request and ARP reply in IPv4. I'm not going to go over them in detail, just understanding their function is sufficient for now.

If R2 has just gotten it's address and wants to find R3, it would send out an NS.

I've reset R2 to SLAAC and setup R3 for SLAAC as well.

Addresses:

R2: 2001:100::FA66:F2FF:FEDE:FF1

R3: 2001:100::C67D:4FFF:FEF9:5340

R2#ping 2001:100::C67D:4FFF:FEF9:5340

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2001:100::C67D:4FFF:FEF9:5340, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 0/3/16 ms

and some filtered output from debug ipv6 icmp:

ICMPv6: Sent N-Solicit, Src=2001:100::FA66:F2FF:FEDE:FF1, Dst=FF02::1:FFF9:5340

ICMPv6: Received N-Advert, Src=2001:100::C67D:4FFF:FEF9:5340, Dst=2001:100::FA66:F2FF:FEDE:FF1

There's our NS going out, and NA coming in. It's important to note that the NS also provided R3 with R2's layer 2 address, so there's no need for the reverse process to happen.

R2#sh ipv6 neighbor

IPv6 Address Age Link-layer Addr State Interface

2001:100::C67D:4FFF:FEF9:5340 0 c47d.4ff9.5340 REACH Gi0/1

R3#sh ipv6 neighbor

IPv6 Address Age Link-layer Addr State Interface

2001:100::FA66:F2FF:FEDE:FF1 0 f866.f2de.0ff1 REACH Gi0/0

As you can see, R2 is aware of R3's addresses and vice-versa.

Before I move on to pointing out the (somewhat obvious) security problems with this whole process, I'd like to pause and look at a few other features that I felt were of important note, but didn't fit into the explanation thus far.

If you have multiple routers on a segment and want to use one or more for failover, you can specify how important their advertisements are:

ipv6 nd router-preference med !default (setting this doesn't show in config)

ipv6 nd router-preference low !depreffed

ipv6 nd router-preference high !preffed

Although, I imagine most of us would just use HSRP.

Also, if you set the RA lifetime to 0, hosts won't use it as a default gateway, but it can still be used for SLAAC: ipv6 nd ra lifetime 0

I've previously run production dual-stack IPv6 at home. It was an interesting experiment, but the overwhelming lesson I learned from it is that most content providers are using separate pipes for their IPv6 traffic, and instead of being wide open and unused liked I'd hoped, they were miserably oversubscribed, so when my Windows installation naturally preferred a IPv6 DNS resolution when both IPv6 and IPv4 were available for the same site (Here's looking at you, Youtube!) all I got was slower web traffic. I still have the v6 service on my cablemodem, but I disabled it on the router.

One of the more interesting things I learned from it was the sudden realization of what the heck I do without NAT. Almost all home networks are setup like so on IPv4:

[provider] ---- outside /30 ----> [home router] (NAT) --- private IP range ---> hosts

So if you think about this for a minute... in order to duplicate this on IPv6, which doesn't typically use NAT, you need a routable IPv6 address block on the outside of your home router, and a routable IPv6 address block on the inside of your router too. And since home ISPs don't exactly assign you two blocks of static addresses typically, you're also going to need a way to do this with DHCP or SLAAC.

There's a really clever fix for this in IPv6. It's called prefix delegation. The idea is that our provider will delegate an IPv6 block to our router, which will then in turn function as the DHCP server for that block.

Let's look at some sample config.

ISP router:

ipv6 local pool dhcpv6-pool1 2001:DB8:1200::/40 48

ipv6 dhcp pool test

prefix-delegation pool dhcpv6-pool1 lifetime 1800 600

dns-server 4::4

dns-server 8::8

domain-name isp.net

interface GigabitEthernet1/0

ipv6 address 12::1/64

ipv6 nd other-config-flag

ipv6 dhcp server test

Here we've told the ISP router that it should delegate /48 blocks from a larger /40 block to every DHCP server that asks.

Home router:

ipv6 dhcp pool MY-DNS ! we'll use our DNS servers

dns-server 44::44

dns-server 88::88

domain-name foo.com

interface GigabitEthernet1/0

ipv6 address autoconfig default ! we want a SLAAC address

ipv6 dhcp client pd FOO ! we'll collect prefix delegation for the next interface

interface GigabitEthernet2/0

ipv6 address FOO ::/64 eui-64 ! we'll use a /64 out of whatever we got above

ipv6 nd other-config-flag

ipv6 dhcp server MY-DNS

So, lots going on on the home router. First, we're going to assign our outside interface - Gig1/0 - an address via SLAAC. This gives us the IPv6 equivalent of my "outside /30" from my ASCII diagram above, albeit it on a much larger block than a IPv4 /30 :). Next, we're going to create a prefix delegation named FOO, and pick up a prefix from our ISP server, which we already know will be a /48 from my explanation above. We'll then go to our inside interface, Gig2/0, and assign ourselves an EUI-64 address from the prefix delegation FOO. We'll of course be sending RAs, so our internal hosts will also use SLAAC to get an address on the inside, delegated subnet.

Our end device/host would have a simple config, just running SLAAC and being a stateless DHCP client. In IOS it'd just be:

interface GigabitEthernet1/0

ipv6 address autoconfig default

Back on the ISP server, you get reverse route injection in the form of static IPv6 routes pointing to whomever you delegated the prefix to. These can be redistributed into your IGP or BGP.

Now let's look at what security faults all these features potentially have.

There's some obvious "for like" issues that IPv4 had, that carried over to IPv6:

- An attacker could pose as the stateful DHCP server, handing out either bad information for denial of service (DOS), or handing out an attacker's IPv6 address for a man-in-the-middle attack (MiM).

- An attacker could pose as another host. Let's say Host1 is trying to reach Host2, and Host3 is an attacker. Host1 could send out a NS for Host2, and Host3 could send an NA masquerading as Host2, and Host1 may accept it as true if timed correctly. A similar process could be done from Host2 -> Host1, with Host3 - inserting itself there as well, and effectively inserting itself into the entire conversation - a MiM attack - with Host1 and Host2 none the wiser.

This is where the similarities with IPv4 end. Here are some of the new security concerns:

- A router advertisement could be faked, allowing a host to insert itself into a conversation, for a MiM attack.
- A router could advertise, from one router to another, a prefix that shouldn't be on the local link. This connected route would be seen as closer than a theoretical downstream router that's legitimately advertising the same prefix.

- An attacker could respond to every DAD request from a host, or even from the entire segment, effectively preventing hosts from using their legitimately unique IPv6 addresses, creating a DOS.

- An outside host - not one on your local segment - could send traffic in rapid-fire towards a large swath of your available IPv6 space. Many IPv4 segments were either behind NAT or simply weren't all that big, but IPv6 space is really, really large - a /64 - the common LAN segment size - is 18 quintillion addresses. No last-hop router has enough memory to send NS requests for 18 quintillion addresses without running out of memory or CPU, creating a DOS by worst-case crashing the router, best-case busying the CPU out to the point where new requests aren't serviced.

For me, the most confusing thing about IPv6 FH Security was that there are, as I understand it, an "old" way of doing things, and a "new" way of doing things. These overlap a lot, and there's no real discussion of this in the documentation, so figuring out when to use what was confusing.

We're going to start with the "old" way, which is reasonably well-documented and well-blogged on the Internet, but not particularly thorough, and then move on to the "new" way, which seems more complete.

First let's see about tackling invalid RAs.

The simplest, non-automatic way to prevent invalid RAs is to write an PACL for them:

ipv6 access-list IPv6
deny icmp any any routeradvertisement
permit ipv6 any any

and simply apply it to any device that shouldn't be sending RAs.

There's a gotcha here, in that there is a well-known exploit by sending fragmented packets. I'm not going to go into this in detail as hundreds have blogged about this already. The workaround is to add one more deny to the PACL:

deny ipv6 any any undeterminedtransport

This will make the ACL drop any IPv6 traffic where the router is unable to determine the transport type (no layer 4 information). Some say this makes for better security than the actual RA Guard feature, but debates like that are out of scope for the CCIE, so I'll leave that to others.

This feature is simple and well-documented enough that I'm not going to lab it.

Moving on to features we will be labbing, let's look at RA Guard. It's the automated, more granular version of what we just did with the PACL.

Here's the diagram we'll work off of for the remainder of the article.

As I have in many other blogs, I use GNS3 for diagramming, but I'm using physical gear for the lab. Unfortunately, IPv6 FH Security requires a bleeding-edge IOS or IOS-XE layer 2 device, so labbing it is not so easy. I was lucky enough that some friends were able to lend me a 4948-E running 15.2(1). So, I am running entirely physical gear in this lab, despite what's implied by the diagram above.

This is also good and bad, because I'm using a remote lab, I can't recable on the fly. So there are a few scenarios that I didn't lab out as thoroughly as I would've done normally, but the knowledge gained here should still be more than sufficient for what may appear on the lab.

R1 will represent our valid router.
R2 will (usually) represent our valid host.
R3 will represent our attacker.

The most common way to set this up is to basically un-trust all ports that shouldn't have a router on them, and trust all the ones that should. This is accomplished with this config:

ipv6 nd raguard policy MY_ROUTERS

device-role router

ipv6 nd raguard policy MY_HOSTS

device-role host ! DEFAULT

ipv6 snooping logging packet drop ! NEEDED FOR LOGGING

vlan configuration 123 ! all our hosts are on vlan 123
ipv6 nd raguard attach-policy MY_HOSTS

int gig1/1
ipv6 nd raguard attach-policy MY_ROUTERS

I know the first thing I thought when I saw this was "what the heck is 'vlan configuration XYZ'"?
I always did find it a bit strange back on the Catalyst 3560, when I learned it for CCIE v4, that QoS config would be put on an SVI even though the SVI sometimes had no IP address on it - you'd just create it for the QoS config. I guess someone at Cisco thought the same way; so now there's a specific configuration section for VLANs.

So in short what we accomplished above was to configure every port in vlan 123 to assume a "host" was attached to it - a host should never be sending RAs. We then overrode that configuration on gig1/1 and told the switch to expect a router there - routers are, of course, OK to hear RAs from. Interface configuration always overrides vlan configuration, and that's true for the rest of the features we'll review in this article, so I'm going to assume that's understood from here on in.

We enabled logging via ipv6 snooping logging packet drop, which will actually turn on the majority of the logging we need for this article. There is one additional command we'll cover later.

I instructed R2 and R3 not to send RAs, let's see what output we get if I attempt to enable them on R3.

R3-ATTACKER(config-if)#ipv6 nd router-preference high ! Let's attempt to become the preferred router
R3-ATTACKER(config-if)#no ipv6 nd ra suppress all

WS-C4948E(config)#

*Jul 3 18:15:58.734: %SISF-4-PAK_DROP: Message dropped A=FE80::C67D:4FFF:FEF9:5340 G=- V=123 I=Gi1/3 P=NDP::RA Reason=Message unauthorized on port

That's the extreme basics, and I did it in long-hand so to speak, here's the shorter way to accomplish the same thing with defaults:

ipv6 nd raguard policy MY_ROUTERS

device-role router

vlan configuration 123
ipv6 nd raguard

int gig1/1
ipv6 nd raguard attach-policy MY_ROUTERS

This is true of just about every filtering command for IPv6 FH Security, if you use just the basic command with no policy, i.e. ipv6 nd raguard, you get the default untrusted configuration. Here, we just said don't trust any ports' RAs except Gig1/1.

Filtering can also be done per-VLAN:

interface GigabitEthernet1/1

switchport trunk allowed vlan 122,123

switchport mode trunk

ipv6 nd raguard attach-policy MY_ROUTERS vlan 123

ipv6 nd raguard vlan 122

This hypothetical config would trust RAs on vlan 123, but not vlan 122.

There are many other options for RA Guard:

ipv6 nd raguard policy SAMPLE

device-role router
On the 4948, I have four options here: router, host, switch, and monitor.
Router and Host we already covered. Switch, I don't understand, and I can't find any documentation explaining what it does. Suffice to say that out-of-the-box it doesn't allow RAs, so if you set it with no other options, you get something similar to "Host". Monitor I found some vague definitions on, but it too has a similar outcome as switch: No RAs.
other-config-flag on
Require the other-config-flat to be set in the RA or the policy will not pass the traffic.
trusted-port
Permits RAs -- I can't find any other benefits to it for RA Guard, although the other frameworks (name ND Inspection) preference ports that have trusted-port enabled if they have a conflict.

router-preference maximum
Sets the highest allowed preference on the port. If you want to enforce a backup router to being "low" or "medium" preference, this will drop the packet if anything higher is advertised.
hop-limit minimum
hop-limit maximum
RAs advertise what hosts should use as a TTL. You can use these two functions to control what the minimum and maximum advertised TTLs should be. Note, if you want to test this with all IOS devices, IOS can only send one of two options for this field - the default, which is 64 (this is NOT noted on the CLI anywhere, I figured it out the hard way) - or "unspecified", which basically means "use whatever you want".
If you want to match a specific size, such as 90, you would set minimum and maximum to the same value.
managed-config-flag on ! Ensures the managed config flag ("please get your IPv6 address statefully from DHCPv6") is on, if not, drop the packet.
match ipv6 access-list <ACL>
This will match the link local address that's advertising the RA. If no match on the access-list, drop the packet. Formatting is as such: permit ipv6 host FE80::F2F7:55FF:FE8D:96A1 any
match ra prefix-list <prefix list>
This will match the prefix being advertised in the RA. If no match, drop the packet.

Covering each of these options would make the article drag on and on, so I'm going to give one large configuration and comment on it:

ipv6 access-list LL-EXAMPLE
Permit ipv6 host FE80::F2F7:55FF:FE8D:96A1 any

ipv6 prefix-list PREFIX-EXAMPLE permit 2001:100::/64

ipv6 nd raguard policy SAMPLE
device-role router
other-config-flag on
router-preference maximum medium
hop-limit minimum 64
hop-limit maximum 64
match ipv6 access-list LL-EXAMPLE
match ra prefix-list PREFIX-EXAMPLE

int gig1/1
ipv6 nd raguard attach-policy SAMPLE

This sample would allow RAs, require the other-config-flag to be enabled, not allow a router preference higher than medium, ensure a strict TTL advertisement of 64, require the RA to be sourced from FE80::F2F7:55FF:FE8D:96A1, and only permit it to advertise 2001:100::/64.

DHCP Guard is a tad simpler.

We'll setup R1 as a stateful DHCP server, R2 as a DHCP client, and R3 as a malicious stateful DHCP server. I've removed all the RA Guard configuration to decrease the example complexity.

R1-ROUTER(config)#ipv6 dhcp pool TEST-POOL

R1-ROUTER(config-dhcpv6)#address prefix 2001:100::/64

R1-ROUTER(config-dhcpv6)#dns-server 4::4

R1-ROUTER(config-dhcpv6)#dns-server 8::8

R1-ROUTER(config-dhcpv6)#int gig0/0

R1-ROUTER(config-if)#ipv6 dhcp server TEST-POOL

R3-ATTACKER(config-if)#ipv6 dhcp pool TEST-POOL

R3-ATTACKER(config-dhcpv6)#address prefix 2001:101::/64

R3-ATTACKER(config-dhcpv6)#dns-server 2001:101::BAD

R3-ATTACKER(config-dhcpv6)#int gig0/0

R3-ATTACKER(config-if)#ipv6 address 2001:101::BAD/64

R3-ATTACKER(config-if)#ipv6 dhcp server TEST-POOL

and on our switch:

ipv6 prefix-list GOOD-PREFIX seq 5 permit 2001:100::/64 le 128

ipv6 access-list LL

permit ipv6 host FE80::F2F7:55FF:FE8D:96A2 any

ipv6 dhcp guard policy TRUSTED

device-role server

match server access-list LL

match reply prefix-list GOOD-PREFIX

vlan configuration 123

ipv6 dhcp guard attach-policy TRUSTED

I'm going to treat this a bit differently - I'm going to trust all ports, but require them to match a specific link local source and address range.

R2-HOST(config-if)#no ipv6 address autoconfig

R2-HOST(config-if)#ipv6 enable

R2-HOST(config-if)#ipv6 address dhcp

WS-C4948E#

*Jul 3 20:39:34.985: %SISF-4-PAK_DROP: Message dropped A=FE80::C67D:4FFF:FEF9:5340 G=2001:101::68:6AF1:6FC2:5983 V=123 I=Gi1/3 P=DHCPv6::ADV Reason=The source address in the DHCPv6 ADVERTISE packet is not authorized by the DHCP Guard policy

R2-HOST#sh ipv6 int br | s GigabitEthernet0/0

GigabitEthernet0/0 [up/up]

FE80::FA66:F2FF:FEDE:FF1

2001:100::F1CB:5AD:3844:CB86

We get the correct address from the correct server.

Now moving on to a-la-carte Neighbor Discovery Inspection.

The first thing to understand about Neighbor Discovery is that it's a control plane feature only - it doesn't inspect actual data traffic, it is only looking at ND ICMP packets, and that's it - so if your users spoof their source address in an actual traffic flow, this doesn't protect against it.

IPv6 ND Inspection builds a table based on NS/NA messages. It then enforces the table. The funny thing about it is, with ND/NA, basically the first one it hears becomes the trusted one.

WS-C4948E(config)#vlan configuration 123
WS-C4948E(config-vlan-config)#ipv6 nd inspection

I really expected DHCP Guard to also populate this table, but the a-la-carte version of ND Inspection and DHCP Guard don't appear to talk to one another, as best I was able to tell. The unified (ipv6 snooping) feature, which we will see next, does populate the neighbor bindings from DHCP, so don't be confused as to why you see DHCP as a population feature in the screen shot.

So let's spoof R2 (Gig1/2)'s link local from R3:

R3-ATTACKER(config-if)#ipv6 nd dad attempts 0 ! DAD would prevent the spoof
R3-ATTACKER(config-if)#ipv6 address FE80::FA66:F2FF:FEDE:FF1 link-local

WS-C4948E#

*Jul 4 14:45:01.109: %SISF-4-PAK_DROP: Message dropped A=FE80::FA66:F2FF:FEDE:FF1 G=- V=123 I=Gi1/3 P=NDP::RA Reason=More trusted entry exists

More trusted entry exists. As you'll see above, there is a "Preflevel" numbering system, which I think of as administrative distance for neighbor bindings. The higher the number, the more trusted the entry is. Although I do find the message odd when you try to spoof -- what it really should say is "I already learned this entry from someone else, first come first serve!"

The most trusted method is a static entry, which looks like:

ipv6 neighbor binding vlan 123 FE80::FA66:F2FF:FEDE:FF1 interface gig1/3 c47d.4ff9.5340

You can shorten this up a whole bunch, by omitting the optional VLAN and MAC address, but in my case, I'm trying to override the legitimate link-local of R2 to allow R3 to ping from its link local. Without "fully populating" the table, it still prefs the dynamically learned entry over the static one.

We see the static entry with the priority of 100 - let's see if we can ping from R3 now, with our spoofed link-local:

R3-ATTACKER#ping FE80::F2F7:55FF:FE8D:96A2 source FE80::FA66:F2FF:FEDE:FF1
Output Interface: GigabitEthernet0/0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to FE80::F2F7:55FF:FE8D:96A2, timeout is 2 seconds:
Packet sent with a source address of FE80::FA66:F2FF:FEDE:FF1%GigabitEthernet0/0
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 0/0/0 ms

Let's look at the rest of the features you can optionally put in a policy.

ipv6 nd inspection policy TEST-POLICY
limit address-count X
This is relatively obvious, it limits how many addresses are permitted to participate in the ND process on a port. Keep in mind you always need a minimum of 2 if you have a global unicast: you need the link-local as well.
tracking enable
This is liveliness tracking, and is pretty cool. We'll give this it's own paragraph below.
drop-unsecure
There's a cryptographic version of neighbor discovery called SeND. It requires a PKI infrastructure and certificates all the way down to the client. I'm going to call SeND "out of scope" for the CCIE v5 based on complexity. drop-unsecure is regarding SeND, so I'm skipping it.

sec-level minimum

Also used for SeND; out of scope.

device-role {host | monitor | router}

I'm not really sure what the point is here - I don't know how host or router would differ, and Cisco hasn't documented it.

validate source-mac

I imagine this validates that the source MAC is appropriate for future ND communication, but it's not documented and I can't lab it without changing wires (my lab is remote, as mentioned above), so unfortunately I haven't got a good explanation on this option.

If you enable tracking, as shown above, hosts get probed with an "are you still there?" if no ND packets are heard periodically.

WS-C4948E(config)#ipv6 nd inspection policy ND-TEST
WS-C4948E(config-nd-inspection)#tracking enable

WS-C4948E(config)#vlan configuration 123

WS-C4948E(config-vlan-config)#ipv6 nd inspection attach-policy ND-TEST

Note the "time left" field on the far right. This is how long until a NS is sent out to the neighbor asking if it's still there. This is useful for two reasons:
- Since the table is first-come first-serve, this frees up address space from being held indefinitely if it's actually not in use.
- It allows for a host to move - imagine unplugging from one switchport and moving to another on the same switch - we need a way to age out the information reasonably quickly.

We see we have 220 and 228 seconds on the two hosts presently in the table until they're probed.

If we have an IPv6 address on the appropriate VLAN, we'll send a NS sourced from our IPv6 address.
It sure confused me - if we have a pure layer 2 device - how will it send an NS?

If we have a link local on the network, we'll send a NS from our IPv6 address. If we don't have an IPv6 address on the VLAN (we're just switching L2 only), we'll send an NS from the IPv6 unspecified address: that's right, the switch will send NSes even if IPv6 routing isn't enabled.

If you want probes to go out more frequently than every 300 seconds, you can set it like so:

ipv6 nd inspection policy TEST-POLICY
tracking enable reachable-lifetime 15 ! this will set it for 15 seconds.

The most confusing feature for me was the "ipv6 snooping" syntax, which accomplishes basically everything we saw above, plus has the option (via source guard and destination guard) to filter data plane traffic as well. The confusing part about it is simple: it has lots of cross-over with all the previous functions, yet it uses a different syntax.

The basic concept of ipv6 snooping is to build the neighbor database, similar to what IPv6 ND inspection did, except it uses and enforces more methods all at once.

It can use:
- Information from DHCP (Default)
- Information from ND (Default)
- Static bindings

I'm wiping out all prior FH Security functions implemented above, we're starting from scratch with basic addressing. Since this is a large topic, I'm going to give increasingly more complicated examples and explain them one at a time.

At it's basics, enabling it is very simple:

WS-C4948E(config)#vlan configuration 123
WS-C4948E(config-vlan-config)#ipv6 snooping

Unfortunately this totally breaks the network.

By default, ipv6 snooping enables its version of RA Guard and ND Inspection, so now RAs won't work any longer.

WS-C4948E#

*Jul 5 11:41:42.007: %SISF-4-PAK_DROP: Message dropped A=FE80::F2F7:55FF:FE8D:96A2 G=- V=123 I=Gi1/1 P=NDP::RA Reason=Packet not authorized on port

So let's fix that.

WS-C4948E(config)#ipv6 snooping policy TRUST_ROUTER

WS-C4948E(config-ipv6-snooping)# security-level glean

WS-C4948E(config-ipv6-snooping)#int gig1/1

WS-C4948E(config-if)#ipv6 snooping attach-policy TRUST_ROUTER

R2-HOST(config-if)#ipv6 address autoconfig default

R2-HOST(config-if)#do sh ipv6 int br | s GigabitEthernet0/0

GigabitEthernet0/0 [up/up]

FE80::FA66:F2FF:FEDE:FF1

2001:100::FA66:F2FF:FEDE:FF1

Ok great! Now what the heck does "security-level glean" mean? Let me tell you, it was a lot of 'fun' to figure out from the near zero explanation the docs give. glean learns from DHCP and ND, but doesn't enforce anything. So "glean" basically means "trust this port". There is also a trusted-port option on the policy, and on my IOS, it appears to do absolutely nothing (how handy).

For purposes of keeping this article a reasonable size, I'm going to go ahead and point out the 1:1 features in common with the a-la-carte ND inspection. By default, when you enable snooping, you're getting integrated ND Inspection. If you don't want it, you would:

WS-C4948E(config)#ipv6 snooping policy MY_POLICY

WS-C4948E(config-ipv6-snooping)#no protocol ndp

In this case, it would only learn entries from DHCPv6 "Guard" and static entries. Of note, you can do the same thing to disable dhcp inspection: no protocol dhcp

Assuming you left protocol ndp on, these features work the same way I explained them above:

limit address-count

Make sure you allow at least 2 if you're using global unicast.

tracking

Same as before, send NSes to keep the table up to date.

So let's look at the default integrated DHCP Guard inspection yet. As mentioned above, this is on by default when using IPv6 Snooping.

R1-ROUTER(config)#ipv6 dhcp pool DHCP-POOL

R1-ROUTER(config-dhcpv6)#address prefix 2001:100::/64

R1-ROUTER(config-dhcpv6)#dns-server 4::4

R1-ROUTER(config-dhcpv6)#dns-server 8::8

R1-ROUTER(config-dhcpv6)#int gig0/0

R1-ROUTER(config-if)#ipv6 dhcp server DHCP-POOL

Now I've already pre-trusted this port (with the glean security level mentioned above), so it's also OK to be a DHCP server.

R2-HOST(config)#int gig0/0

R2-HOST(config-if)#ipv6 enable

R2-HOST(config-if)#no ipv6 address autoconfig default

R2-HOST(config-if)#ipv6 address dhcp

R2-HOST(config-if)#do sh ipv6 int br | s GigabitEthernet0/0

GigabitEthernet0/0 [up/up]

FE80::FA66:F2FF:FEDE:FF1

2001:100::B043:724:FE40:931D

We got our DHCP address.

And we now learned a binding via "DH" - DHCP.

There are three security models that can be applied to entire VLANs or per-port/per-vlan:

WS-C4948E(config-ipv6-snooping)#security-level ?
glean Glean addresses
guard inspect and drop un-authorized messages (default)
inspect glean and Validate message

We've discussed glean - it basically trusts the port but still keeps track of the bindings. Guard, as indicated above, is the default, and it's what we're getting when nothing is specified. Guard is, in effect, the same as enabling the a-la-cart DHCP Guard, RA Guard, and ND Inspection: It learns bindings and denies untrusted (non-glean) ports from sending RAs, DHCP offers, invalid DAD, invalid NAs, etc. It's an all-in-one control-plane security enforcer!

Best I can tell, "inspect" enforces ND only (similar to the a-la-carte ipv6 nd inspection feature) but doesn't protect against malicious RAs or DHCP servers. I validated this by enabling it on all interfaces, sending RAs and DHCP off R1, and then attempting to spoof R2's address on R3's interface. Everything was permitted except the R3 spoof of R2, which produced:

*Jul 5 13:38:48.559: %SISF-4-PAK_DROP: Message dropped A=2001:100::9193:876E:7E83:F33C G=- V=123 I=Gi1/3 P=NDP::NA Reason=More trusted entry exists

on the 4948.

That wraps up the control-plane filters, data-plane filters are next, including Source Guard, Destination Guard, and Prefix Guard. IPv6 snooping builds the database that these features use, so it is a prerequisite for everything we'll see from here on.

Source guard is very simple. If the data-plane traffic - any traffic other than IPv6 ND/RA and DHCP - doesn't match the source address present in the prebuilt binding table, drop the traffic.

WS-C4948E(config)#vlan configuration 123

WS-C4948E(config-vlan-config)#ipv6 source-guard

The main additional configuration for this is to validate prefixes, which is what's known as Prefix Guard.

Far earlier in the article we discussed prefix delegations via DHCP. This is the technology that allows you to sub-lease a prefix from one DHCP server to a downstream DHCP server. I couldn't get this feature to work with my lab, and I believe it's a platform limitation, but unfortunately when you're borrowing high-end switches to lab bleeding-edge functions, beggars can't be choosers.

Here is how I think it's supposed to work:

! DHCP SERVER

ipv6 local pool dhcpv6-pool1 2001:100:123::/40 48

ipv6 dhcp pool test

prefix-delegation pool dhcpv6-pool1 lifetime 1800 600

dns-server 4::4

dns-server 8::8

domain-name servers.net

interface GigabitEthernet0/0

ipv6 address 2001:100:123::1/48

ipv6 nd other-config-flag

ipv6 dhcp server test

! CLIENT

interface GigabitEthernet0/0

ipv6 address autoconfig default

ipv6 nd ra suppress all

ipv6 dhcp client pd PREFIX-DELEGATION

interface Loopback0 ! I needed something to pretend to be a downstream client

ipv6 address PREFIX-DELEGATION ::/64 eui-64

ipv6 enable

! SWITCH

ipv6 snooping policy SNOOPING-POLICY

security-level glean

prefix-glean ! this learns prefix delegations - this did work for me, it was the filtering that didn't work.

ipv6 source-guard policy PREFIX-GUARD

no validate address

validate prefix

vlan configuration 123

ipv6 snooping attach-policy SNOOPING-POLICY

ipv6 source-guard attach-policy PREFIX-GUARD

And after applying all this, my switch shot back at me:

WS-C4948E(config)#%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 64%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 64%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 128

I tinkered with it for a while, changing the address to another one on Lo0 that I shouldn't be able to use, to no avail. Traffic kept passing. So I'm assuming this just can't be accomplished on this hardware or I'm hitting a bug.

I did have one other curiosity with source guard (just vanilla source guard, not prefix guard); while it did filter traffic appropriately, sometimes when I disabled the feature, traffic still wouldn't forward. Clearing the database solved the problem: clear ipv6 neigh bind

Our last major topic is Destination Guard, which is pretty darn cool. The recommended prefix size for a LAN segment in IPv6 is a /64, which is 18 quintillion addresses. To put that into perspective, that's 18,446,744,073,709,551,616 addresses in one segment.

What would happen if you tried to "arp" (IPv6 NS) for 18,446,744,073,709,551,616 addresses back-to-back?

Your router would melt. You'd run out of RAM and CPU very quickly. At a best case this would result in a simple DOS where legitimate NSes couldn't get through for a while, at a worst case the OS might crash. Most of the attacks we've looked at so far require the attacker to have access to the "inside" of your network, what's worse about this attack is that it can be accomplished from the outside of your network - potentially from the Internet.

Destination Guard addresses this. Destination Guard is a "last hop" security feature: the attack can be launched from anywhere on a routed network, and the last hop router is the only one that is heavily impacted, because interim routers don't have to NS for the final destination, they just CEF-switch the packet.

We're also assuming, for our purposes here, that the last-hop router is a layer 3 switch supporting destination guard.

WS-C4948E(config)#vlan configuration 123

WS-C4948E(config-vlan-config)#ipv6 destination-guard

There's only one very simple setting you can add if you use a policy:

WS-C4948E(config)#ipv6 destination-guard policy FOO

WS-C4948E(config-destguard)#enforcement ?

always Enforced under all conditions (default)

stressed Enforced when system is under stress

stressed isn't defined anywhere, but I'm assuming it means only kick this feature in during high CPU load, or perhaps under a great deal of NS.

I'll be changing our lab up a bit in order to test this, R1 will be out of the picture, R2 will have a static assignment and will be our "trusted" host, and R3 will be our attacker on the Internet.

R2-HOST(config-if)#int gig0/0

R2-HOST(config-if)#ipv6 address 2001:600D::2/64 ! my best representation of "GOOD" in hex :)

R2-HOST(config)#ipv6 route ::/0 2001:600D::1 ! the switch's address

R3-ATTACKER(config-if)#int gig0/0

R3-ATTACKER(config-if)#ipv6 address 2001:BAD::2/64

R3-ATTACKER(config)#ipv6 route ::/0 2001:BAD::1

WS-C4948E(config)#ipv6 unicast-routing

WS-C4948E(config)#vlan 200

WS-C4948E(config-vlan)#exit

WS-C4948E(config)#vlan 300

WS-C4948E(config-vlan)#exit

WS-C4948E(config)#int vlan200

WS-C4948E(config-if)#ipv6 address 2001:600D::1/64

WS-C4948E(config-if)#no shut

WS-C4948E(config-if)#int vlan 300

WS-C4948E(config-if)#ipv6 address 2001:BAD::1/64

WS-C4948E(config-if)#no shut

WS-C4948E(config)#int gig1/2

WS-C4948E(config-if)#switchport access vlan 200

WS-C4948E(config)#int gig1/3

WS-C4948E(config-if)#switchport access vlan 300

WS-C4948E(config-if)#vlan configuration 200,300

WS-C4948E(config-vlan-config)# ipv6 snooping

WS-C4948E(config-vlan-config)# ipv6 destination-guard

Now let's say the "BAD" neighbor, a host on the Internet, tries to hammer away at tens of thousands of addresses on the "GOOD" (600D) network in order to make the router (our 4948) collapse under the NS load.

In order to see this security feature take effect, we need to enable one more logging feature:

WS-C4948E(config)#ipv6 snooping logging resolution-veto

What the heck is resolution veto? If the switch decides it's getting asked to resolve for bogus addresses, it will "veto" the neighbor solicitation.

Before I try the attack, I went ahead and verified connectivity from R2 to R3.

R2-HOST(config)#do ping 2001:BAD::2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2001:BAD::2, timeout is 2 seconds:

.....

That should've worked. What did the 4948 have to say?

WS-C4948E(config)#

*Jul 5 16:13:12.774: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:BAD::2 on I=Vl300 reason=Destination not active on link

This is an important point on this feature: it doesn't know what hosts are valid until it hears from them. So due to my order-of-operations in this scenario, the switch hadn't actually heard from R3 at all yet, and the NS is denied. Making R3 speak solves the problem, which would've worked itself out eventually anyway:

R3-ATTACKER(config)#do ping 2001:600D::2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2001:600D::2, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 0/1/8 ms

Now let's try our theoretical attack.

R3-ATTACKER#ping 2001:600D::100 repeat 1 timeout 0

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 2001:600D::100, timeout is 0 seconds:

Success rate is 0 percent (0/1)

R3-ATTACKER#ping 2001:600D::101 repeat 1 timeout 0

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 2001:600D::101, timeout is 0 seconds:

Success rate is 0 percent (0/1)

R3-ATTACKER#ping 2001:600D::102 repeat 1 timeout 0

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 2001:600D::102, timeout is 0 seconds:

Success rate is 0 percent (0/1)

R3-ATTACKER#ping 2001:600D::103 repeat 1 timeout 0

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 2001:600D::103, timeout is 0 seconds:

Success rate is 0 percent (0/1)

(pretend there are 65,431 hypothetical pings here)

R3-ATTACKER#ping 2001:600D::FFFF repeat 1 timeout 0

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 2001:600D::FFFF, timeout is 0 seconds:

Success rate is 0 percent (0/1)

We expected that outcome as these aren't valid hosts, but what happened on the switch?

*Jul 5 16:18:26.226: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::100 on I=Vl200 reason=Destination not active on link

*Jul 5 16:18:34.146: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::101 on I=Vl200 reason=Destination not active on link

*Jul 5 16:18:43.338: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::102 on I=Vl200 reason=Destination not active on link

*Jul 5 16:18:50.418: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::103 on I=Vl200 reason=Destination not active on link

.....

*Jul 5 16:18:58.958: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::FFFF on I=Vl200 reason=Destination not active on link

And that about sums up destination guard.

A few random notes for the wrap-up.

Some useful show commands:

WS-C4948E#sh ipv6 snooping policies

Target Type Policy Feature Target range

Gi1/2 PORT policy1 NDP inspection vlan all

This will show you what policies are applied where.

WS-C4948E#sh ipv6 snooping features

Feature name priority state

NDP inspection 160 READY

Snooping 128 READY

This shows which features are enabled.

And where is all this in the documentation?

Well, at the time of this writing, there's a link to all this directly under the IOS 15.2E, which is what I labbed on! and ... it's a broken link! Uuuuurgh!

Best I could come up with that I could drill-down to:

Switches ->

3850 ->

Catalyst 3850-12S-E Switch ->

Configuration Guides ->

IPv6 Configuration Library, Cisco IOS XE Release 3SE (Catalyst 3850 Switches) ->

IPv6 First-Hop Security Configuration Guide, Cisco IOS XE Release 3SE (Catalyst 3850 Series)

That covers just about everything except Source Guard and Prefix Guard, but let's face it, those two features are pretty easy to understand if you have the rest of it down.

Best of luck!

Jeff

Everything BFD

2014-06-14T11:38:00.001-07:00

I sat down to start working on BFD three weeks ago thinking I'd be done in a couple of days. Three weeks later I'm finally starting blogging on it. To be fair, I missed a week to drinking heavily in California, but it still took a really look time to dive down every rabbit hole - there's a lot of BFD features, and the documentation stinks, particularly when it comes to "why would I want to use this".

So what is BFD (Bidirectional Forwarding Detection)?

BFD is a high-speed "are you up" protocol that other routing protocols subscribe to. It can detect link failures in milliseconds, with the potential for microseconds on the right platform. All routing protocols have some way of detecting failure, usually timer-related. Tuning the timers can theoretically get you sub-second failure detection in some protocols, but this produces unnecessary high overhead as the average IGP wasn't designed with that in mind. BFD was specifically built for fast/low CPU detection, and in the case of single-hop, can offload a great deal of the checks to CEF (by using echo mode - more later), even on a typical router. Some high-end platforms can even offload the entire BFD process to the linecard. The CEF or hardware offload makes BFD a major improvement over the other obvious choice, IP SLA.

Scope of this document:
To give a granular understanding of BFD, but focused on the CCIE v5 R&S. On that note, I have covered every BFD function I can find, excluding ISIS and MPLS-TE, as these are both out of scope for the v5 R&S lab. We will cover BFD's use with single-hop BGP, OSPF, EIGRP, RIP, PIM, HSRP, static routes, hierarchical static routes and multi-hop BGP and multi-hop static routes. We will cover with and without Echo, and with and without authentication. We will discuss IPv6 implementation of all of the applicable above protocols. We'll also cover BFD dampening.

Some key items to know:

- BFD has no neighbor detection. When the routing protocol needs to monitor a neighbor, it informs BFD, and BFD establishes the neighbor relationship at that point.

- Various routing protocols can piggyback a single BFD session. If you have BGP and EIGRP running between the same two subnets on the same two routers, there's no need to have two BFD sessions for checking the same exact topology.

- There are two versions, 0 and 1. While there are some deep programmatic differences between the two, those are out of scope for this document. The major difference for us is that v1 supports echo mode (more later) and v0 doesn't. v1 is on by default on Cisco equipment. The documentation says Cisco equipment can be backwards compatible with v0 if it's neighbor only supports v0, but since there's no way to manually enable this on a Cisco device, you can safely assume for the lab exam that we're always talking about v1.

- There are two modes, asynchronous and demand. Asynchronous is the "normal" BFD mode that you're used to when you think Cisco BFD: continuous, high-speed detection of neighbor failure. Demand mode is more of a steady-state operation, where it's assumed the neighbor and link are generally stable, but you'd periodically want to check to see if it's up. Cisco has output for the demand bit in show commands, and they also talk about it (vaguely) in the documentation in places, but best I can tell there's just no command to enable it - at least not on my device (CSR1000v v15.4.1). I read elsewhere that enabling echo mode (more on echo mode later) put the control packets in demand mode, but the demand bit is not set in these cases, so that appears incorrect. Perhaps it would work if the other neighbor initiated? Regardless, assume out of scope for the v5 lab.

- BFD is always unicast.

- BFD control packets are always UDP, sourced from 49152 and sent to 3784. BFD echo packets are also UDP, and are sourced from 3785 and sent to 3785 (why this is will become obvious later).

Let's start with basic, single-hop configuration. This will be our topology:

As with many of my other blogs, I use GNS3 as an easy diagramming tool. In this case, I am not using GNS for the actual lab, because I have never been able to get BFD to work in GNS3. The second the BFD relationship tries to establish, dynamips hard locks. Instead, I am using CSR1000v running on VMWare.

Each router's gigabit interfaces are assigned 192.168.AB.X/24, where AB is the router numbers on the link, and X is the router number: i.e. R2's Gig1 is 192.168.12.2 and Gig2 is 192.168.23.2. Each router has a Loopback0 address of X.X.X.X, where X is the router number: i.e. R1's Loopback0 is 1.1.1.1.

In addition, each router's gigabit interfaces are assigned AB::X/64, where AB is the router numbers on the link, and X is the router number: i.e. R2's Gig1 is 12::2/64. Each router has a Loopback0 address of X::X/128, where X is the router number: i.e. R1's Loopback0 is 1::1/128. IPv6 unicast routing is enabled on all devices.

For this section, we'll mostly be working with R1 and R2.

R1#conf t

R1(config)#interface GigabitEthernet1

R1(config-if)#bfd interval 300 min_rx 300 multiplier 3

R2#conf t

R2(config)#interface GigabitEthernet1

R2(config-if)#bfd interval 300 min_rx 300 multiplier 3

Now with a routing protocol, you'd probably expect some sort of output now.

R1#show bfd neighbor
R1#

As I mentioned above, no auto-discovery process. The routing protocol has to tell BFD it's needed first, and where to establish its neighbor relationship. It's also very noteworthy from a debugging perspective that if you get the blank output I showed above, you're missing something locally. Having half a BFD session configured (or having your authentication messed up) will produce output with a status of "AdminDown".

Let's get this working:

R1(config-if)#ip ospf 1 area 0
R1(config-if)#ip ospf bfd

R2(config-if)#ip ospf 1 area 0

R2(config-if)#ip ospf bfd

Now we should see some output.

R1#show bfd neighbor

IPv4 Sessions
NeighAddr LD/RD RH/RS State Int
192.168.12.2 4097/4097 Up Up Gi1

R2 has a similar output referencing R1.

Now that we have a base config up, let's test the detection.

R2(config-if)#shut

R1#

*Jun 21 02:17:21.983: %OSPF-5-ADJCHG: Process 1, Nbr 2.2.2.2 on GigabitEthernet1 from FULL to DOWN, Neighbor Down: BFD node down

R1#

It's hard to demonstrate speed in a blog, but it happens very fast. We see from the output that BFD told the routing protocol that it's neighbor had been lost - "BFD node down".

Let's bring the link back up.

R2(config-if)#no shut

The command usage is not as simple as it seems - the variable names are terrible, in my opinion. Let's build a more confusing example:

R1(config-if)#bfd interval 200 min_rx 500 multiplier 5
R2(config-if)#bfd interval 250 min_rx 400 multiplier 5

The first value is the "min_tx" and the second value is the "min_rx". I don't care for the names at all. min_rx from R1 will be compared to min_tx from R2, and a per-direction transmission value will be calculated.

In our scenario above, R1's min_tx - 200 - will be compared to R2's min_rx - 400. The slower (larger) value wins. Clearly, 400ms is longer than 200ms, so 400ms will be the negotiated transmission value for R1 towards R2. Vice-versa, R2's min_tx is 250ms, and R1's min_rx is 500, so 500ms will be the transmission speed from R2 to R1.

For output clarity I am going to disable echo mode for the moment (not shown here) and show the simplified output of show bfd neighbor detail:

Here's the relevant output:
Rx Count: 47, Rx Interval (ms) min/max/avg: 1/500/403 last: 379 ms ago
Tx Count: 58, Tx Interval (ms) min/max/avg: 1/398/331 last: 59 ms ago

We see that our maximum Rx is 500 (speed from R2 to R1), and maximum Tx is 398 (speed from R1 to R2). It can send/receive a touch faster than this, the idea being that if the BFD control packet doesn't arrive in under the maximum time, it'll be considered lost.

The multiplier is reasonably obvious, if you miss that many BFD control packets, consider the link failed.

Since I've got this output up, also noteworthy are:

* Registered Protocols: What protocols are "subscribed" to this BFD session? We see OSPF and CEF here.

* Session state is UP and not using echo function. - as I mentioned I disabled echo.

* C bit: 0 - This is only relevant on platforms that can completely hardware offload BFD, and we'll talk about it later.

* Demand bit: 0 - I talked about this earlier, interesting that there's output for it, but I couldn't find any way to enable it.

Since we started with an OSPF example, let me recap what I did above and then we'll look at the alternative way to enable BFD.

Presently:

R1:

interface GigabitEthernet1

ip address 192.168.12.1 255.255.255.0

ip ospf bfd

ip ospf 1 area 0

negotiation auto

bfd interval 200 min_rx 500 multiplier 5

no bfd echo

R2:

interface GigabitEthernet1

ip address 192.168.12.2 255.255.255.0

ip ospf bfd

ip ospf 1 area 0

negotiation auto

bfd interval 250 min_rx 400 multiplier 3

no bfd echo

the other option:

R1(config)#int gig1

R1(config-if)#no ip ospf bfd

R2(config-if)#int gig1

R2(config-if)#no ip ospf bfd

R1(config)#router ospf 1

R1(config-router)#bfd all-interfaces

R2(config)#router ospf 1

R2(config-router)#bfd all-interfaces

and if you wanted to selectively turn it off on an interface:

R2(config-router)#int gig2

R2(config-if)#ip ospf bfd disable

R1(config-router)#do sh bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 1/1 Up Up Gi1

Next I'll quickly burn through the rest of the IGPs, and single-hop BGP.

For clarity, I've removed all the OSPF config beforehand.

R1(config)#router eigrp 100

R1(config-router)#network 0.0.0.0

R1(config-router)#bfd all-interfaces

R2(config)#router eigrp 100

R2(config-router)#network 0.0.0.0

R2(config-router)#bfd interface gig1

Note the syntax difference between R1 and R2. Showing both configuration methods at once, single interface vs all interfaces. I of course still have the interface-level BFD config in place.

R1#sh ip eigrp neigh

EIGRP-IPv4 Neighbors for AS(100)

H Address Interface Hold Uptime SRTT RTO Q Seq

(sec) (ms) Cnt Num

0 192.168.12.2 Gi1 13 00:01:42 1596 5000 0 3

R1#sh bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 1/1 Up Up Gi1

Removing EIGRP config for clarity.

RIP is supported, but it's a bit of an oddity. If you know RIP at all, your first question should be "how can BFD work with a neighborless routing protocol?".

It's a bit of a hack.

First item of note, Cisco advertises the feature as "BFD for RIPv2". Just to prove that it's not RIPv2 specific, I'm going to do this lab on RIPv1.

R1(config)#router rip

R1(config-router)# version 1

R1(config-router)# network 192.168.12.0

R1(config-router)# neighbor 192.168.12.2 bfd

R1(config-router)# bfd all-interfaces ! note, this is the only option

R2(config)#router rip

R2(config-router)#version 1

R2(config-router)#network 192.168.12.0

R2(config-router)#neighbor 192.168.12.1 bfd

R2(config-router)#bfd all-interfaces

R1(config-if)#do show bfd neigh

R1(config-if)#

Hmm, no luck.

Turns out RIP requires you to be advertising a route other than the transit link for the BFD relationship to establish.

R1(config-router)#network 1.1.1.1

R2(config-router)#network 2.2.2.2

R1(config-router)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/1 Up Up Gi1

So, clearly we can't tear down a non-existent neighbor relationship if the link fails. So what's this good for?

R2(config-router)#int gig1

R2(config-if)#shut

R1(config-router)#do sh ip rip data | i 2.0.0.0

2.0.0.0/8 is possibly down

R1(config-router)#do sh ip route 2.0.0.0

Routing entry for 2.0.0.0/8

Known via "rip", distance 120, metric 4294967295 (inaccessible)

Redistributing via rip

Last update from 192.168.12.2 on GigabitEthernet1, 00:00:56 ago

Hold down timer expires in 142 secs

It marks the route as invalid immediately, rather than waiting on painfully slow RIP timers.

Removing RIP config for cleanliness...

Single-hop BGP BFD is very simple. It's also probably the most-deployed implementation of BFD.

R1(config)#router bgp 100
R1(config-router)#neighbor 192.168.12.2 remote-as 200
R1(config-router)#neighbor 192.168.12.2 fall-over bfd

R2(config)#router bgp 200
R2(config-router)#neighbor 192.168.12.1 remote-as 100
R2(config-router)#neighbor 192.168.12.1 fall-over bfd

R1(config-router)#do show bfd neigh det | i protocols
Registered protocols: BGP CEF

There's an extra flag you can use with BGP that takes some explaining. It's called the C-Bit, and if you don't understand the usage, it's a confusing thing.

There are some service provider platforms that can run BFD completely in hardware. Meaning, the line card itself knows the BFD logic, and the control plane can actually crash and BFD will keep working. On these platforms, graceful restart (GR) or non-stop forwarding (NSF) can keep the FIB populated on the line card while the control plane reboots itself. GR is actually a negotiated BGP parameter -- when BGP needs to reboot, the neighbor keeps the routes from the rebooting device. In this fashion, the neighbor keeps forwarding traffic to the device that's rebooting even though BGP keepalives have failed.

So what's this got to do with BFD?

There could be circumstances where both the control plane needs to reboot and the forwarding plane dies at the same time. BFD can help detect this. Consider this topology.

R1 --> R2. R2 is a provider platform that has NSF enabled.

R1 learns that R2 is an NSF device, and assumes that it's OK for R2's control plane to die and still forward it traffic.

If...: BFD control packets are still coming, and C-BIT = 0 or 1, then R1 should keep forwarding to R2
If...: BFD control packets stop coming, and the C-BIT = 0, then R1 should assume that BFD was run in software on the neighbor, and should keep forwarding packets during graceful restart.
If...: BFD control packets stop coming, and the C-BIT = 1, then R1 should assume that BFD was run in hardware on the linecard on the neighbor, and that the neighbor is genuinely broken, and to yank the routes rather than wait on graceful restart.

As best I can tell, without a platform to lab this on, the C-Bit is set by the BFD process itself, and isn't something you can toggle. However, you can tell your BFD process whether to ignore the setting or not. The default is to ignore. If you want to use it:

R1(config-router)#neighbor 192.168.12.2 fall-over bfd check-control-plane-failure

Of note, this feature is also available in multi-hop BGP, which we'll cover further below.

And now for PIM!

I'm not going to build a full multicast lab up here, but we can see the basics.

R1(config)#int gig1

R1(config-if)#ip pim sparse-mode

R1(config-if)#ip pim bfd

R2(config)#int gig1

R2(config-if)#ip pim sparse-mode

R2(config-if)#ip pim bfd

R1(config-if)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/1 Up Up Gi1

R1(config-if)#do show bfd neigh det | i protocols

Registered protocols: PIM CEF

That's all there is to it. Removing PIM config...

I'll cover HSRP now as well.

R1(config)#int gig1
R1(config-if)#standby 1 ip 192.168.12.100
R1(config-if)#standby bfd

R2(config)#int gig1
R2(config-if)#standby 1 ip 192.168.12.100

R2(config-if)#standby bfd

R1(config-if)#do show bfd neigh det | i protocol

Registered protocols: HSRP CEF

R1(config-if)#do show standby | i BFD

BFD enabled

Alternatively, HSRP BFD support can be enabled globally with:

R1(config)#standby bfd all-interfaces

IOS-based VRRP doesn't appear to have BFD support at the time of this writing. I've seen some documents indicating it is supported in IOS-XR and Nexus.

And now for IPv6 IGPs and BGP.

OSPFv3's BFD usage is very similar to OSPFv2's.

R1(config-if)#int gig1
R1(config-if)#ipv6 ospf 1 area 0
R1(config-if)#ipv6 ospf bfd

R2(config-if)#int gig1

R2(config-if)#ipv6 ospf 1 area 0

R2(config-if)#ipv6 ospf bfd

R1(config-rtr)#do show bfd neigh

IPv6 Sessions

NeighAddr LD/RD RH/RS State Int

FE80::20C:29FF:FECF:21FF 1/1 Up Up Gi1

You can also use the "all interfaces" style like from OSPFv2.

EIGRPv6 supports BFD in named EIGRP configuration, which is pretty darn different from the "old" way of doing EIGRPv6:

R1(config)# router eigrp FOO

R1(config-router)# address-family ipv6 unicast autonomous-system 1

R1(config-router-af)# af-interface default

R1(config-router-af-interface)# bfd

R1(config-router-af-interface)# exit-af-interface

R1(config-router-af)# topology base

R1(config-router-af-topology)# exit-af-topology

R1(config-router-af)# exit-address-family

R2's config is 100% identical so I am omitting it.

R1(config-router)#do show bfd neigh

IPv6 Sessions

NeighAddr LD/RD RH/RS State Int

FE80::20C:29FF:FECF:21FF 1/1 Up Up Gi1

R1(config-router)#do show bfd neigh det | i protocol

Registered protocols: EIGRP CEF

and RIPng? No such luck. It's not supported.

Multiprotocol (IPv6) BGP is basically the same as v4:

R1(config-router)#router bgp 100
R1(config-router)#neighbor 12::2 remote-as 200
R1(config-router)#neighbor 12::2 fall-over bfd
R1(config-router)#address-family ipv6
R1(config-router-af)#neighbor 12::2 activate

R2(config-router)#router bgp 200
R2(config-router)#bgp log-neighbor-changes
R2(config-router)#neighbor 12::1 remote-as 100
R2(config-router)#neighbor 12::1 fall-over bfd
R2(config-router)#address-family ipv6
R2(config-router-af)# neighbor 12::1 activate

R1(config-router)#do show bfd neigh

IPv6 Sessions
NeighAddr LD/RD RH/RS State Int
12::2 1/1 Up Up Gi1

IPv6 PIM is just as easy as v4:

R1(config)#int gig1

R1(config-if)#ipv6 pim bfd

HSRP for IPv6:

interface GigabitEthernet1

standby version 2

standby 1 ipv6 autoconfig

standby bfd

Similar to v4, VRRP support for v6 doesn't seem to be supported in traditional IOS at this time.

Back to v4 for static routing.

Static routing takes a little more work, because there's no IGP to notify BFD of who to peer with, nor is there any neighbor relationship.

Let's start with the most basic usage.

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2

R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2

R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1

R2(config)#ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1

R1(config)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/2 Up Up Gi1

R1(config)#do show bfd neigh det | i protocols

Registered protocols: CEF IPv4 Static

With the absence of a routing protocol, we use ip route static bfd <interface> <next-hop to monitor>

Where <next-hop to monitor> is the IP we'll be pointing out static routes to.

We still need to fulfill two more prerequisites:

- We must have a static route pointing that the specified next-hop. BFD doesn't setup the neighbor otherwise. Alternatively, you can set it up unassociated mode, covered below.

- The static route that points at the next hop must specify the egress interface if we're doing single-hop routes. (multi-hop covered below)

But what if the neighbor doesn't need a static route back to us?

Imagine R2 knew R1's routes via another protocol, or even a default, and had no need to setup static routes back towards it:

R2(config)#no ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1

R1(config)#int gig1

R1(config-if)#ip ospf 1 area 0

R1(config-if)#int lo0

R1(config-if)#ip ospf 1 area 0

R2(config)#int gig1

R2(config-if)#ip ospf 1 area 0

But, R1 still doesn't have a route to R2's Loopback0. So it needs that static.

Now, the BFD session has failed, because there's no route dependent on R2's statement:

ip route static bfd GigabitEthernet1 192.168.12.1

This is because the dependent method is known as an "associated" route. An unassociated route brings the BFD up anyway:

R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1 unassociate

R1(config-if)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/1 Up Up Gi1

We can also hierarchically group static routes.

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2 group DOWNSTREAM

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.50 group DOWNSTREAM passive

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.75 group DOWNSTREAM passive

R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2

R1(config)#ip route 10.10.10.10 255.255.255.255 GigabitEthernet1 192.168.12.50

R1(config)#ip route 100.100.100.100 255.255.255.255 GigabitEthernet1 192.168.12.75

Let's walk this line by line:

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2 group DOWNSTREAM

This is our non-passive route - basically an anchor route. Let's say for example's sake that from our topology, if the BFD session to 192.168.12.2 (my neighbor) goes down, all the passive routes in my group, DOWNSTREAM, will also be offline. Perhaps they're all attached to some sort of shared Ethernet segment and 192.168.12.2 is the management IP of the first switch - if we can't reach it, we can't reach other devices on it's link. There may be no reason to run BFD with the other devices, as perhaps they're on super-stable/redundant links. However, we need to pull them from our table if they're not reachable.

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.50 group DOWNSTREAM passive

R1(config)#ip route static bfd GigabitEthernet1 192.168.12.75 group DOWNSTREAM passive

192.168.12.50 and 192.168.12.75 are imaginary next-hops on the shared Ethernet segment of 192.16812.0. They don't exist in our topology anywhere, but they don't need to for our example. They're passive, meaning they're reliant on the status from the anchor BFD session (the non-passive entry). If it goes down, they need to fail their BFD "status" too.

R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2

This references our "anchor" next-hop, and is necessary for BFD to establish.

R1(config)#ip route 10.10.10.10 255.255.255.255 GigabitEthernet1 192.168.12.50

R1(config)#ip route 100.100.100.100 255.255.255.255 GigabitEthernet1 192.168.12.75

These are our routes that reference the passive BFD next-hops above.

BFD is already up, and we can see the imaginary downstream hosts 10.10.10.10 and 100.100.100.100 are installed in our routing table.

R1#sh ip route | b subnets

1.0.0.0/32 is subnetted, 1 subnets

C 1.1.1.1 is directly connected, Loopback0

2.0.0.0/32 is subnetted, 1 subnets

S 2.2.2.2 [1/0] via 192.168.12.2, GigabitEthernet1

10.0.0.0/32 is subnetted, 1 subnets

S 10.10.10.10 [1/0] via 192.168.12.50, GigabitEthernet1

100.0.0.0/32 is subnetted, 1 subnets

S 100.100.100.100 [1/0] via 192.168.12.75, GigabitEthernet1

192.168.12.0/24 is variably subnetted, 2 subnets, 2 masks

C 192.168.12.0/24 is directly connected, GigabitEthernet1

L 192.168.12.1/32 is directly connected, GigabitEthernet1

I'm going to fail the interface on R2, which, of important note, does not bring down the line protocol on R1 in my virtual lab.

R2#conf t

Enter configuration commands, one per line. End with CNTL/Z.

R2(config)#int gig1

R2(config-if)#shut

R1#sh ip route | b subnets

1.0.0.0/32 is subnetted, 1 subnets

C 1.1.1.1 is directly connected, Loopback0

192.168.12.0/24 is variably subnetted, 2 subnets, 2 masks

C 192.168.12.0/24 is directly connected, GigabitEthernet1

L 192.168.12.1/32 is directly connected, GigabitEthernet1

and all three routes gone!

Now, just to prove my line protocol is still up on that subnet, and this is actually BFD removing the routes and it isn't just a generic next-hop failure:

R1#sh ip cef 192.168.12.50

192.168.12.0/24

attached to GigabitEthernet1

R1#sh ip cef 192.168.12.75

192.168.12.0/24

attached to GigabitEthernet1

The IPv6 implementation isn't quite as feature-filled:

R1(config)#ipv6 route static bfd GigabitEthernet1 12::2

R1(config)#ipv6 route 2::/64 GigabitEthernet1 12::2

R2(config)#ipv6 route static bfd GigabitEthernet1 12::1 unassociated

R2(config)#do show bfd neigh | b IPv6

IPv6 Sessions

NeighAddr LD/RD RH/RS State Int

12::1 2/2 Up Up Gi1

This works much the same way IPv4 does - R1 specifies an associated BFD neighbor and corresponding static route, R2 has an unassociated route (we'll assume it knows how to get back to R1 through other means). And... that's it for v6. No groups!

I've left all the multihop options for one section, as they all share some of the same configuration.

We'll start with multihop IPv4 BGP. Now we'll be peering R1 to R4. I've setup interim IGPs throughout; assume full reachability.

R1(config)#bfd-template multi-hop MHOP-TEMPLATE

R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3

R1(config)#bfd map ipv4 4.4.4.0/24 1.1.1.0/24 MHOP-TEMPLATE

R1(config)#router bgp 14

R1(config-router)#neighbor 4.4.4.4 remote-as 14
R1(config-router)#neighbor 4.4.4.4 update-source lo0

R1(config-router)#neighbor 4.4.4.4 fall-over bfd multi-hop

R4(config)#bfd-template multi-hop MHOP-TEMPLATE

R4(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3

R4(config)#bfd map ipv4 1.1.1.1/32 4.4.4.4/32 MHOP-TEMPLATE

R4(config)#router bgp 14

R4(config-router)#neighbor 1.1.1.1 remote-as 14

R4(config-router)#neighbor 1.1.1.1 update-source lo0

R4(config-router)#neighbor 1.1.1.1 fall-over bfd multi-hop

We'll walk through this config as well:

R1(config)#bfd-template multi-hop MHOP-TEMPLATE
R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3

This is just a series of settings to apply to the multi-hop session. Clearly we can't glean it from the interface BFD configuration because there might be different settings for different neighbors. Of note, you can also set authentication here. There are also single-hop templates, which we'll talk about later.

R1(config)#bfd map ipv4 4.4.4.0/24 1.1.1.0/24 MHOP-TEMPLATE

The BFD map is the slightly confusing part. This statement could be interpreted as:

"If I establish a multi-hop BFD session to a destination inside 4.4.4.0/24, sourced from any of my interfaces inside of 1.1.1.0/24, then use the settings from MHOP-TEMPLATE"

Note it doesn't matter what mask size you use on this. In fact, if you look at R2, I specifically used /32s instead, just to prove a point. As long as the mask encompasses the IPs in question, you're good.

It's also important to note that the BFD map isn't neighbor discovery or a static neighbor. It just assigns settings to a neighbor session that another protocol informs BFD of.

Also important to note as, at least for me, the configuration is backwards from the way I think. It's destination/source: 4.4.4.0/24 is my TARGET, 1.1.1.0/24 is my SOURCE. I mis-type it almost every time, because I think source/dest.

R1(config)#router bgp 14

R1(config-router)#neighbor 4.4.4.4 remote-as 14
R1(config-router)#neighbor 4.4.4.4 update-source lo0

R1(config-router)#neighbor 4.4.4.4 fall-over bfd multi-hop

The BGP config is pretty obvious.

And the outcome...

R1(config-router)#do show bfd neigh

IPv4 Multihop Sessions
NeighAddr[vrf] LD/RD RH/RS State
4.4.4.4 4097/4097 Up Up

Let's validate by shutting down the link between R2 and R3, which is not participating in BFD other than forwarding packets for R1 and R4.

R2(config)#int gig2
R2(config-if)#shut

R1(config-router)#
*Jun 24 04:26:34.371: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (BFD adjacency down)
*Jun 24 04:26:34.371: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down BFD adjacency down
*Jun 24 04:26:34.372: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session BFD adjacency down

BGP IPv6 multi-hop is identical, so I'm not going to demonstrate it here.

You may want to consider QoS on the interim routers when it comes to BFD. Not very helpful if your RTP packets continuously push your BFD out of the way, just to have BFD completely remove the link:

- BFD packets are marked with precedence 6 by default

- Be sure the value isn't reset by your interim routers, and that they prioritize/LLQ the Prec 6 traffic.

This leads us into static route multihop. I've removed the previous BGP config.

Much the same as BGP multihop BFD, static route multihop BFD uses multihop templates and BFD maps.
Let's create a static route multihop session between R1 and R3. I've added a new loopback to R3, Lo1, with IP address 33.33.33.33/32 for validation purposes. It is not in the IGP.

R1(config)#bfd-template multi-hop MHOP-TEMPLATE
R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3

R1(config)#bfd map ipv4 192.168.23.0/24 192.168.12.0/24 MHOP-TEMPLATE

R1(config)#ip route static bfd 192.168.23.3 192.168.12.1

R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3

R3(config)#bfd-template multi-hop MHOP-TEMPLATE

R3(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3

R3(config-bfd)#bfd map ipv4 192.168.12.0/24 192.168.23.0/24 MHOP-TEMPLATE

R3(config)#ip route static bfd 192.168.12.1 192.168.23.3 unassociate

R1(config)#do show bfd neigh

IPv4 Multihop Sessions

NeighAddr[vrf] LD/RD RH/RS State

192.168.23.3 4097/4097 Up Up

Most of this config should be familiar if you read the entire article up until now, but there are some peculiar ways to do this incorrectly that will bust it.

R1(config)#ip route static bfd 192.168.23.3 192.168.12.1

This may seem very similar to single-hop, but here's a sample from single-hop above:

R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1

Note the lack of an interface on multi-hop, and the presence of one in single-hop. These are a mutually exclusive setting: You must not specify an interface on multi-hop, and you must specify an interface on single-hop. This is very poorly documented, unfortunately - the samples on the DocCD do show the right thing, but it never calls it out like this.

R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3

A normal static route from our multihop config - but let's look at our earlier single-hop sample:

R2(config)#ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1

Now this genuinely surprised me. If you specify the interface on a static route with multi-hop - even though all the other information needed is present - destination prefix and next hop - it will break multi-hop BFD. On the other hand you must have it for single-hop. Check out a quick before & after on multihop:

R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3

R1(config)#do show bfd neigh

IPv4 Multihop Sessions

NeighAddr[vrf] LD/RD RH/RS State

192.168.23.3 4097/4097 Up Up

R1(config)#no ip route 33.33.33.33 255.255.255.255 192.168.23.3

R1(config)#ip route 33.33.33.33 255.255.255.255 Gigabit1 192.168.23.3

R1(config)#do show bfd neigh

R1(config)#

Multi-hop static BFD can also use groups like single-hop can, but the config is identical (aside from not specifying the egress interfaces!), so I'm going to skip them here for brevity.

I've referred to echo mode in various places in the article up until now. Echo mode is a very clever way of decreasing BFD's hit on the CPU. It took me a while to figure out how it worked, however, mostly because the RFC wins the "too vague" award of the year: "When the Echo function is active, a stream of BFD Echo packets is transmitted in such a way as to have the other system loop them back through its forwarding path." http://tools.ietf.org/html/rfc5880

I already knew echo mode was a way to save on CPU, so I theorized that the idea was to get the BFD "are you up?" packets to be processed in fast switching instead of the control plane, but that description doesn't exactly explain it programatically. After more googling and some Wireshark, I figured out the implementation.

Echo is single-hop only, so let's use R1 and R2 as my examples.

R1 sends an echo packet (instead of a control packet) to R2, formatted as:

L3 Source: R1 (192.168.12.1)

L3 Destination: R1 (192.168.12.1)

MAC Source: Itself (000c.298f.aca3)

MAC Destination: (000c.29cf.21ff)

R2's receives this packet, sees this packet, and CEF-switches it straight back to R1! In this fashion, R1 knows that R2 is reachable.

R2 would perform similar behavior towards R1, for it's own echo process.

There's more to know, however:

- The echo packets are sent at the rate negotiated in the BFD interval (on interface or single-hop template)

- Echo mode is only supported single-hop, obviously.

- Control-plane packets are still sent, but they are sent at the "slow timers" speed, specified as: bfd slow-timers <speed>. Since the control packets are no longer vital to knowing that the neighbor is up at high-speed, you can crank down these heavier-CPU-intensive packets to slower rates.

- The Cisco documentation says you need to disable ICMP redirects first - as technically speaking, the traffic above should generate a redirect - but in modern 15.1x+ IOS I have yet to see this requirement; it appears IOS is smart enough to know not to send redirects to echo packets.

- Echo mode is on by default. It needs to be on on both sides of the link in order to work.

On a side note, I've periodically had problems getting echo mode to come up when labbing on the CSR1000v; it usually seems to have to do with other BFD config on the device. I would call it a bug. With some cleanup and tinkering you can usually get it to come up.

R1(config)#int gig1

R1(config-if)#bfd echo ! this is on by default, but I'd disabled it earlier in the article.

R1(config-if)#ip ospf bfd

R1(config-if)#exit

R1(config)#bfd slow-timers 30000 ! send control packets every 30 seconds

R2(config)#int gig1

R2(config-if)#bfd echo

R2(config-if)#ip ospf bfd

R2(config-if)#exit

R2(config)#bfd slow-timers 30000

R1#show bfd neigh det | i echo

Session state is UP and using echo function with 400 ms interval.

R1#show bfd neigh det | i Min

MinTxInt: 30000000, MinRxInt: 30000000, Multiplier: 5

Received MinRxInt: 30000000, Received Multiplier: 3

Min tx interval: 30000000 - Min rx interval: 30000000

Min Echo interval: 400000

We now see in "Min Echo interval" that the echo packets are going at the pace we expected control packets at before (400ms - negotiated by the interface values), and control packets are now sending every 30 seconds.

I mentioned single-hop templates briefly above. They're not of much use outside of authentication and dampening:

R1(config)#bfd-template single-hop TEST

R1(config-bfd)#?

BFD template configuration commands:

authentication Authentication type

dampening Enable session dampening

echo Use echo adjunct as bfd detection mechanism

interval Transmit interval between BFD packets

Dampening works much the same way as any other protocol's dampening works. If the BFD session flaps a bunch, mark it as "down" (pull it out of the routing table) for a certain amount of time to wait on stabilization.

I did lab this and it does work, but it's too hard to demonstrate it in a blog, so here's the basic usage:

R1(config)#bfd-template single-hop TEST-SH

R1(config-bfd)#interval both 300 multiplier 3

R1(config-bfd)#dampening 5 4000 4000 10

R1(config-bfd)#int gig1

R1(config-if)#bfd ?

R1(config-if)#no bfd interval 200 min_rx 500 multiplier 5 ! mutually exclusive from a single-hop template

R1(config-if)#bfd template TEST-SH

BFD Authentication is also reasonably straightforward.

R1(config-if)#key chain BFD

R1(config-keychain)#key 1

R1(config-keychain-key)#key-string cisco

R1(config-keychain-key)#exit

R1(config-keychain)#exit

R1(config)#bfd-template single-hop TEST-SH

R1(config-bfd)#authentication sha-1 keychain BFD

Since we configured this on only one side....

R1(config-bfd)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/0 Down Down Gi1

R2(config-if)#key chain BFD

R2(config-keychain)#key 1

R2(config-keychain-key)#key-string cisco

R2(config-keychain-key)#exit

R2(config-keychain)#exit

R2(config)#bfd-template single-hop TEST-SH

R2(config-bfd)#interval both 300 multiplier 3

R2(config-bfd)#authentication sha-1 keychain BFD

R2(config-bfd)#int gig1

R2(config-if)#no bfd interval 250 min_rx 400 multiplier 3

R2(config-if)#bfd-template single-hop TEST-SH

R1(config-bfd)#do show bfd neigh

IPv4 Sessions

NeighAddr LD/RD RH/RS State Int

192.168.12.2 4097/4097 Up Up Gi1

BFD authentication can use MD5, SHA1, or meticulous MD5 or SHA1. So what's meticulous? Out of scope of this document, but here's the RFC: http://tools.ietf.org/html/draft-ietf-bfd-generic-crypto-auth-06

And last but certainly not least, how do you debug BFD? Honestly, most of the times I break BFD, it's because I missed a requirement - for example, forgetting to put an egress interface on single-hop static routes. In these circumstances, you get nearly zero debug output, because IOS doesn't detect that anything needs to happen.

If you can get BFD to realize you're trying to get it to work, you can see some inner-workings with:

debug bfd event

I hope you enjoyed,

Jeff

DMVPN

2014-05-04T15:17:00.001-07:00

In this post we'll take a look at DMVPN from a perspective of what I suspect will be on the CCIE R&S v5 blueprint. Admittedly I'm taking guesses as all Cisco has released is "single hub DMVPN", but some of the surrounding/related topics I've seen on practice labs as well as just taking some guesses on my part.

I'm going to briefly show some scenarios which require you to think beyond single-hub design for the command structure to make sense. I can absolutely imagine Cisco would throw requirements for commands that only make sense in a larger network into the lab. My preference for my blog is to understand the practicality, design theory, and use cases behind commands, not just "if you apply this you get action X".

So, at a high level - What is DMVPN?

DMVPN stands for Dynamic Multipoint Virtual Private Network. It's a Cisco proprietary tunnel technology with a hub-and-spoke control-plane and spoke to spoke tunnels. Assuming "Phase 2" or newer (more on phases later), a normal use case is to establish a full-mesh VPN over the Internet with minimal configuration. For example, having 10 routers that all needed VPNs to one another would have the "full mesh formula" apply of N(N-1)/2, or 10(10-1)/2 = 45 tunnels. That's a lot of config. On the other hand, with DMVPN, you create the config for just 10 tunnels. The 45 might still happen if every router in fact needed to contact every other router at the same time, but we let the routers handle that part dynamically.

Here's our diagram:

R1 will be our single hub, R2, R3, and R4 are all spokes. "INTERNET" represents the Internet. In theory, these routers could alternatively be dozens of hops from each other, but the concept doesn't change.

As I explained above, DMVPN's control plane is hub-and-spoke, and is R1 is our hub, whatever routing protocol we're using will be pinned up via R1 to each individual spoke.

So our control plane will look like this:

However, our traffic flows can be full-mesh, so the data plane will (theoretically) look like this:

This is largely dependent on which routers needed to talk to which other routers. While hub-to-spoke tunnels are always up, spoke-to-spoke are "on demand" and are established dynamically.

The nuts and bolts of how this work depends largely on what development "phase" of DMVPN you're using. We'll talk more about that shortly. First, let's take a high-level look at the three technologies that make up DMVPN - MGRE, IPSEC, and NHRP.

GRE - Generic Routing Encapsulation - creates unencrypted tunnels between two endpoints. MGRE creates Multipoint GRE tunnels. These tunnels can be established to endpoints based on information discovered via NHRP, discussed below.

I'm going to assume the audience has had a general exposure to IPSEC in the past. In our case, we're just using it to optionally encrypt the MGRE tunnel we're performing our routing on. I am not going to deep-dive IOS-based IPSEC with this post, one assumption I am making is that the IPSEC/VPN requirement for v5 is going to be "DocCD level", or something you can pull out of the documentation "stock" or "near stock" on short notice.

NHRP - this is really what makes the magic happen on DMVPN. NHRP, at its core, resolves private addresses (those behind MGRE and optionally IPSEC) to a public address. In our example, that public network will be assumed to be the Internet. NHRP treats this public network like a big NBMA area. In fact, several comparisons can be drawn between NBMA frame relay and NHRP/DMVPN, to the point where I'm betting some of the old frame-relay tricks from the R&S lab will be repeated in DMVPN. NHRP facilitates registration between the spokes and the hubs, and helps the spokes resolve the public address of another spoke based on the tunneled IPs behind it.

Next, let's look at the three phases of DMVPN and some sample config for all of them.

DMVPN "Phase 1". This phase is largely unused, and, as I understand it, was an early deployment model. When most people refer to "DMVPN" these days, they're talking about the behavior expected from Phase 2 or Phase 3, not Phase 1.

Phase 1 pins not only the control plane through the hub, but also the data plane, so all your traffic goes through the hub.

The differentiating components of Phase 1 are:
- An MGRE tunnel on the hub
- A standard GRE tunnel on the spokes
- A routing protocol on the hub that sets next-hop-self

A sample config based on our diagram from above:

For brevity, this config is applied on all four routers identically, but I will only show it here once:
crypto isakmp policy 1
encr aes 256
authentication pre-share
group 5

crypto isakmp key ABCcisco123 address 0.0.0.0

crypto ipsec transform-set TRANSFORM-SET esp-aes esp-sha-hmac
mode transport

crypto ipsec profile IPSEC_PROFILE
set transform-set TRANSFORM-SET

!R1 - The hub
interface Tunnel1
ip address 10.0.0.1 255.255.255.0
no ip redirects
no ip split-horizon eigrp 100
ip nhrp authentication CISCO
ip nhrp map multicast dynamic
ip nhrp network-id 1
tunnel source FastEthernet0/0
tunnel mode gre multipoint

tunnel protection ipsec profile IPSEC_PROFILE

router eigrp 100

network 1.1.1.1 0.0.0.0

network 10.0.0.0 0.0.0.255

!R2 - A Spoke

interface Tunnel1

ip address 10.0.0.2 255.255.255.0

ip nhrp authentication CISCO

ip nhrp map 10.0.0.1 87.14.10.1

ip nhrp network-id 1

ip nhrp nhs 10.0.0.1

tunnel source FastEthernet0/0

tunnel destination 87.14.10.1

tunnel protection ipsec profile IPSEC_PROFILE

router eigrp 100

network 1.1.1.1 0.0.0.0

network 10.0.0.0 0.0.0.255

!R3 - Another Spoke

interface Tunnel1

ip address 10.0.0.3 255.255.255.0

ip nhrp authentication CISCO

ip nhrp map 10.0.0.1 87.14.10.1

ip nhrp network-id 1

ip nhrp nhs 10.0.0.1

tunnel source FastEthernet0/0

tunnel destination 87.14.10.1

tunnel protection ipsec profile IPSEC_PROFILE

router eigrp 100

network 3.3.3.3 0.0.0.0

network 10.0.0.0 0.0.0.255

!R4 - Another Spoke

interface Tunnel1

ip address 10.0.0.4 255.255.255.0

ip nhrp authentication CISCO

ip nhrp map 10.0.0.1 87.14.10.1

ip nhrp network-id 1

ip nhrp nhs 10.0.0.1

tunnel source FastEthernet0/0

tunnel destination 87.14.10.1

tunnel protection ipsec profile IPSEC_PROFILE

router eigrp 100

network 4.4.4.4 0.0.0.0

network 10.0.0.0 0.0.0.255

Assume the "Internet" router knows how to reach all public IP space on the 87.14.0.0/16 network, and that each router participating in DMVPN has a private loopback of X.X.X.X, where X is it's router number.

Before we look at the output from this config, let's break it apart a bit.

Note: I'm going to deliberately ignore most of the crypto config, this can be pulled out of the DocCD very easily from "Security" and then "Secure Connectivity Configuration Guide Library, Cisco IOS Release 15M&T", and then "Dynamic Multipoint VPN Configuration Guide, Cisco IOS Release 15M&T".

On R1, the hub -

crypto ipsec transform-set TRANSFORM-SET esp-aes esp-sha-hmac
mode transport

This is the only part of the crypto config I'm going to drill into. I was initially confused as to when to use "mode tunnel" and when to use "mode transport". I've seen examples with both. Unless you are doing a multi-tier DMVPN hub (one set of routers doing crypto-only, another set doing NHRP and the routing), which is clearly out of scope of the R&S lab, you want to use transport. Tunnel adds 20 bytes of overhead and comes out with the same exact results as transport. I suppose if you got a lab question that said "use the IPSEC method that requires more overhead", this might be important? The rest of this document will assume we are using transport only.

no ip split-horizon eigrp 100 - Clearly, we're going to be taking EIGRP routes in from one spoke and passing them to another. If we don't disable split horizon that process will not happen.

ip nhrp authentication CISCO - This is a plain-text key, more of an "identifier" than a password, keeping in mind that this traffic will be inside IPSEC, it doesn't need it's own encryption method per se.

ip nhrp map multicast dynamic - This tells the hub to pseudo-multicast to any spoke that registers to it. This is (usually) necessary for the routing protocol to communicate.

ip nhrp network-id 1 - This is a local identifier only. It is not communicated across the network. Think of it similarly to the OSPF process number. You must have it enabled, and it must be unique to your tunnel, or NHRP will not work.

tunnel source FastEthernet0/0 - where to source tunnel packets from

tunnel mode gre multipoint - This may as well read "tunnel mode gre dmvpn". A GRE multipoint tunnel, by its nature, must use NHRP for resolution.

tunnel protection ipsec profile IPSEC_PROFILE - Encrypt this tunnel with our IPSEC config from above. Note, the IPSEC config above used a pre-shared key (PSK), but it's worth pointing out that a public key infrastructure (PKI) can be used as well, although that is beyond the scope of this document.

On R2, a spoke (excluding any repetition of commands I explained on the hub) -

ip nhrp map 10.0.0.1 87.14.10.1 - This is a lot like the frame-relay map command, that, if you were a student of CCIE v4, you are well familiar with. In this case, we're mapping private IP 10.0.0.1 to NBMA IP 87.14.10.1. This is to facilitate registration to the hub.

ip nhrp nhs 10.0.0.1 - nhs stands for "next hop server", which is the hub. This basically says "register my private IP address to my NBMA address (87.14.20.1 on this case) with the hub, so it knows how to reach me.

tunnel destination 87.14.10.1 - You'll note a lack of the tunnel mode gre multipoint command on this tunnel. That's because in phase 1, the spokes only get regular GRE tunnels. So in this case, we have to set the destination statically to the hub.

Let's now look at the outcome of all this on R1, the hub.

R1#sh ip nhrp

10.0.0.2/32 via 10.0.0.2

Tunnel1 created 00:52:07, expire 01:35:14

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.20.1

10.0.0.3/32 via 10.0.0.3

Tunnel1 created 00:51:25, expire 01:35:13

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.30.1

10.0.0.4/32 via 10.0.0.4

Tunnel1 created 00:50:55, expire 01:35:13

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.40.1

We see the three mappings for the three spokes that registered to the hub. We see type "dynamic" - meaning it was learned through registration, "unique registered" - meaning the spoke has instructed the hub not to take a registration from another NBMA address but with the same private address, and of course we see the NBMA address for each IP listed.

On the topic of NHRP mappings, optionally, we could also add static entries on the hub:

R1(config-if)#ip nhrp map 10.0.0.10 4.2.2.2

R1(config-if)#ip nhrp map multicast 4.2.2.2 ! optional

R1#sh ip nhrp | s 10.0.0.10

10.0.0.10/32 via 10.0.0.10

Tunnel1 created 00:00:09, never expire

Type: static, Flags:

NBMA address: 4.2.2.2

This entry will, of course, do nothing, but I wanted to demonstrate the idea.

We can also see who we're pseudo-multicasting towards:

R1#sh ip nhrp multicast

I/F NBMA address

Tunnel1 4.2.2.2 Flags: static

Tunnel1 87.14.20.1 Flags: dynamic

Tunnel1 87.14.30.1 Flags: dynamic

Tunnel1 87.14.40.1 Flags: dynamic

What about the routing protocol?

R1#sh ip eigrp neigh

EIGRP-IPv4 Neighbors for AS(100)

H Address Interface Hold Uptime SRTT RTO Q Seq

(sec) (ms) Cnt Num

2 10.0.0.2 Tu1 14 00:33:26 299 1794 0 13

1 10.0.0.4 Tu1 11 00:33:32 818 4908 0 13

0 10.0.0.3 Tu1 11 00:33:33 409 2454 0 16

We have EIGRP peerings with all the neighbors.

At this point, we should have any-to-any connectivity, via the hub. Let's test it out:

R4#ping 3.3.3.3 source lo0

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:

Packet sent with a source address of 4.4.4.4

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 168/205/240 ms

R4#trace 3.3.3.3 source lo0

Type escape sequence to abort.

Tracing the route to 3.3.3.3

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.1 156 msec 152 msec 128 msec

2 10.0.0.3 224 msec 236 msec 252 msec

We see we have connectivity from 4.4.4.4 (loopback of R4) to 3.3.3.3 (loopback of R3), however, it goes through the hub - not very efficient, since the hub doesn't need to be in the forwarding path. That is, however, the drawback of phase 1.

Phase 2 is where DMVPN really starts to shine, because it gets the hub (more or less) out of the data plane forwarding path for spoke-to-spoke communication.

Building off our existing config, let's implement a phase 2 configuration.

R1:

interface Tunnel1

no ip next-hop-self eigrp

no ip nhrp map 10.0.0.10 4.2.2.2 ! just for cleanup

no ip nhrp map multicast 4.2.2.2 ! just for cleanup

R2-R4:

interface Tunnel1

no tunnel destination 87.14.10.1

tunnel mode gre multipoint

ip nhrp map multicast 87.14.10.1

We'll do a high-level breakdown of this config, then spend a good bit of time on the theory behind Phase 2. While the config isn't a complex change, there's a lot more going on behind the scenes.

On the hub:

no ip next-hop-self eigrp - This is absolutely vital. You can setup the rest of NHRP to happily work spoke-to-spoke, but if you don't modify the control plane to not modify the next hops, you're going to get behavior very akin to phase 1.

On any spoke:

no tunnel destination 87.14.10.1 - this is only used with a regular GRE tunnel and isn't required any longer.

tunnel mode gre multipoint - Swap from a point-to-point to multipoint tunnel on the spokes. Now, the spokes will be using NHRP for resolution as well as the hub.

ip nhrp map multicast 87.14.10.1 - When we were using a standard GRE tunnel, it was inherently point-to-point, and natively supported multicast without any extra instructions. Now, we have to tell it to pseudo-multicast to the hub.

Before we deep dive into what's going on behind the scenes, let's look at what's changed.

R3#sh ip route 2.2.2.2

Routing entry for 2.2.2.2/32

Known via "eigrp 100", distance 90, metric 28288000, type internal

Redistributing via eigrp 100

Last update from 10.0.0.2 on Tunnel1, 00:04:52 ago

Routing Descriptor Blocks:

* 10.0.0.2, from 10.0.0.1, 00:04:52 ago, via Tunnel1

Route metric is 28288000, traffic share count is 1

Total delay is 105000 microseconds, minimum bandwidth is 100 Kbit

Reliability 255/255, minimum MTU 1434 bytes

Loading 1/255, Hops 2

We learned 2.2.2.2 (R2's loopback) on R3 via 10.0.0.1 (R1), but the forwarding path is via 10.0.0.2.

That's great, how do we reach 10.0.0.2?

R3#sh ip route 10.0.0.2

Routing entry for 10.0.0.0/24

Known via "connected", distance 0, metric 0 (connected, via interface)

Redistributing via eigrp 100

Routing Descriptor Blocks:

* directly connected, via Tunnel1

Route metric is 0, traffic share count is 1

OK, not so fast... while it is "on subnet" on Tunnel1, Tunnel1 is NBMA, so we can't just forward there without some type of resolution.

R3#sh ip cef 10.0.0.2 internal

10.0.0.0/24, epoch 0, flags attached, connected, cover dependents, need deagg, RIB[C], refcount 5, per-destination sharing

sources: RIB

feature space:

IPRM: 0x0003800C

subblocks:

gsb Connected chain head(1): 0x6AAF5DF4

Covered dependent prefixes: 3

need deagg: 2

notify cover updated: 1

ifnums:

Tunnel1(10)

path 6B108C6C, path list 6B101100, share 1/1, type connected prefix, for IPv4

connected to Tunnel1, adjacency punt

output chain: punt

The important lines are the bottom two. I've read in other blogs that we should see a "glean adjacency" for unresolved NHRP next hops, but I haven't been able to reproduce that on 15.2; I suspect Cisco changed the output. But there's our answer plain as day: punt. This traffic cannot be CEF forwarded because we have an unresolved dependency; we don't know how to get to 10.0.0.2.

The CPU knows we need NHRP for this to work, and it doesn't have a resolution in its NHRP cache yet:

R3#sh ip nhrp

10.0.0.1/32 via 10.0.0.1

Tunnel1 created 00:20:23, never expire

Type: static, Flags: used

NBMA address: 87.14.10.1

So how do we get it?

This is a reasonably clever process, and it only gets more clever once we get into Phase 3. The CPU, after the punt, will forward the traffic to the hub by default. This ensures while we're waiting on NHRP to do its thing and the spoke-to-spoke tunnel to build, we're not dropping packets. Generally speaking you can expect the first 2-3 packets to get punted.

On the hub, debug nhrp

On the spoke:

R3#ping 2.2.2.2 source lo0

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

Packet sent with a source address of 3.3.3.3

NHRP: Tunnels gave us remote_nbma: 87.14.30.1 for Redirect

NHRP: Receive Resolution Request via Tunnel1 vrf 0, packet size: 85

The hub knows R3 needs a resolution for R2.

NHRP: nhrp_rtlookup for destination on 10.0.0.2 yielded interface Tunnel1, prefixlen 24

NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.0.2

NHRP: Attempting to forward to destination: 10.0.0.2

NHRP: Attempting to send packet via DEST 10.0.0.2

NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 87.14.20.1

NHRP: Forwarding Resolution Request via Tunnel1 vrf 0, packet size: 105

src: 10.0.0.1, dst: 10.0.0.2

NHRP: 129 bytes out Tunnel1

But it doesn't answer R3. Instead, it forwards the NHRP request to R2, which included R3's NBMA address. Not pictured here, it also forwards the ping packet from R3 to R2 at the same time, so that no traffic is lost.

Meanwhile, on R2... R2 has received the initial ping echo request, along with the NHRP control packet. R2 will now set up an encrypted (IPSEC) MGRE tunnel back to R3! However, in the meantime, it still needs to forward it's echo reply, and we can't just stall that until the tunnel comes up. So we see the reverse traffic from R2, trying to resolve for R3.

NHRP: Receive Resolution Request via Tunnel1 vrf 0, packet size 85

NHRP: nhrp_rtlookup for destination on 10.0.0.3 yielded interface Tunnel1, prefixlen 24

NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.0.3

NHRP: Attempting to forward to destination: 10.0.0.3

NHRP: Attempting to send packet via DEST 10.0.0.3

NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 87.14.30.1

And the traffic is delivered to R3, via R1.

During this process I've also enabled debug dmvpn all tunnel on R2, so we can see the crypto process fire off (note, this was also edited for brevity):

IPSEC-IFC MGRE/Tu1: Checking to see if we need to delay for src 87.14.20.1 dst 87.14.30.1

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Opening a socket with profile IPSEC_PROFILE

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): connection lookup returned 0

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Triggering tunnel immediately.

IPSEC-IFC MGRE/Tu1: Adding Tunnel1 tunnel interface to shared list

IPSEC-IFC MGRE/Tu1: Need to delay.

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): good socket ready message

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Got MTU message mtu 1458

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): tunnel_protection_socket_up

IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Signalling NHRP

And back on R3, we can see that the tunnel is up:

R3#show dmvpn | b Tunnel1

Interface: Tunnel1, IPv4 NHRP Details

Type:Spoke, NHRP Peers:2,

# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb

----- --------------- --------------- ----- -------- -----

1 87.14.10.1 10.0.0.1 UP 00:04:38 S

1 87.14.20.1 10.0.0.2 UP 00:04:03 D

We can also see that now that we have resolution for R2 (and a dynamic tunnel), we can now directly CEF switch to it:

R3#sh ip cef 2.2.2.2 internal

2.2.2.2/32, epoch 0, RIB[I], refcount 5, per-destination sharing

sources: RIB

feature space:

IPRM: 0x00028000

ifnums:

Tunnel1(12): 10.0.0.2

path 6B1081EC, path list 6B100D40, share 1/1, type attached nexthop, for IPv4

nexthop 10.0.0.2 Tunnel1, adjacency IP midchain out of Tunnel1, addr 10.0.0.2 6AE17200

output chain: IP midchain out of Tunnel1, addr 10.0.0.2 6AE17200 IP adj out of FastEthernet0/0, addr 87.14.30.100 6925D980

We see the appropriate next-hop on Tunnel1, and no longer a mention of "punt".

Just to further prove the point:

R3#trace 2.2.2.2 source lo0

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 172 msec 192 msec 184 msec

We also see we have an NHRP resolution now:

R3#sh ip nhrp

10.0.0.1/32 via 10.0.0.1

Tunnel1 created 00:08:01, never expire

Type: static, Flags: used

NBMA address: 87.14.10.1

10.0.0.2/32 via 10.0.0.2

Tunnel1 created 00:07:26, expire 01:52:36

Type: dynamic, Flags: router used

NBMA address: 87.14.20.1

10.0.0.3/32 via 10.0.0.3

Tunnel1 created 00:07:24, expire 01:52:35

Type: dynamic, Flags: router unique local

NBMA address: 87.14.30.1

(no-socket)

You'd see the flip-side of that same output on R2.

Before we push on to Phase 3, we need to spend some time looking at the various possible routing protocols for the NHRP/DMVPN control plane, and some of the oddities.

We've covered EIGRP fairly well thus far. The only thing I need to add, is that in a production environment, you need to set the bandwidth manually on the interface, regardless of whether or nor you're using it as a QoS value. You may remember back from your CCNA/CCNP days that EIGRP will only use half the available bandwidth of a link:

http://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/13672-12.html#band

R3#show int tun1 | i bandwidth

Tunnel transmit bandwidth 8000 (kbps)

Tunnel receive bandwidth 8000 (kbps)

Unfortunately, 4K won't get you too many EIGRP updates.

R1-R4:

interface Tunnel1

bandwidth 1000 ! or any reasonable number for your environment

We'll look at RIP next - it's super-easy.

R1-R4:

no router eigrp 100

router rip

version 2

network 10.0.0.0

network <their specific loopback prefix>

R1:

interface Tunnel1

no ip split-horizon

That's it ...

R3#sh ip route rip | b Gateway

Gateway of last resort is 87.14.30.100 to network 0.0.0.0

R 1.0.0.0/8 [120/1] via 10.0.0.1, 00:00:06, Tunnel1

R 2.0.0.0/8 [120/2] via 10.0.0.1, 00:00:06, Tunnel1

3.0.0.0/8 is variably subnetted, 2 subnets, 2 masks

R 3.0.0.0/8 [120/2] via 10.0.0.1, 00:01:31, Tunnel1

R 4.0.0.0/8 [120/2] via 10.0.0.1, 00:00:06, Tunnel1

R3#ping 2.2.2.2 source lo0

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

Packet sent with a source address of 3.3.3.3

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 304/326/336 ms

R3#trace 2.2.2.2 source lo0

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 160 msec 192 msec 196 msec

In order to get Spoke->Spoke, and not Spoke->Hub->Spoke, you need to make sure you're using RIPv2.

On to BGP.

R1:

no router rip

router bgp 100

bgp log-neighbor-changes

network 1.1.1.1 mask 255.255.255.255

network 10.0.0.0 mask 255.255.255.0

neighbor 10.0.0.2 remote-as 100

neighbor 10.0.0.2 route-reflector-client

neighbor 10.0.0.3 remote-as 100

neighbor 10.0.0.3 route-reflector-client

neighbor 10.0.0.4 remote-as 100

neighbor 10.0.0.4 route-reflector-client

R2-R4:

no router rip

router bgp 100

bgp log-neighbor-changes

network <local Loopback Prefix> mask 255.255.255.255

network 10.0.0.0 mask 255.255.255.0

neighbor 10.0.0.1 remote-as 100

R1 of course needs to be a route reflector, or the other iBGP spokes wouldn't receive the other spoke routes.

R4#ping 2.2.2.2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 240/301/376 ms

R4#trace 2.2.2.2 source lo0

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 156 msec 192 msec 184 msec

For OSPF, you'll need to use network type broadcast or non-broadcast. There's no point-to-multipoint support until Phase 3, which we'll go over in detail later.

R1:

no router bgp 100

router ospf 1

network 1.1.1.1 0.0.0.0 area 0

network 10.0.0.0 0.0.0.255 area 0

interface Tunnel1

ip ospf network broadcast

ip ospf priority 255

R2-R4:

no router bgp 100

router ospf 1

network <Local Loopback Address> 0.0.0.0 area 0

network 10.0.0.0 0.0.0.255 area 0

interface Tunnel1

ip ospf network broadcast

ip ospf priority 0

Broadcast is used here to avoid changing the next-hop. If it were changed, we'd end up with Spoke->Hub->Spoke flows. We want to be careful that the hub becomes the DR, hence changing the ospf priorities.

R3#ping 2.2.2.2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 164/183/204 ms

R3#trace 2.2.2.2

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 172 msec 184 msec 196 msec

We can also use non-broadcast. Imagine a task that didn't allow multicast mappings to be used, but required an IGP to be run.

R1:

interface Tunnel1

ip ospf network non-broadcast

no ip nhrp map multicast dynamic

router ospf 1

neighbor 10.0.0.2

neighbor 10.0.0.3

neighbor 10.0.0.4

R2-R4:

interface Tunnel1

ip ospf network non-broadcast

no ip nhrp map multicast 87.14.10.1

** I actually rebuilt all the tunnels here to clear the NHRP cache thoroughly - I've found "clear ip nhrp" doesn't always produce the results I'd expect **

R1#sh ip ospf neigh

Neighbor ID Pri State Dead Time Address Interface

2.2.2.2 0 FULL/DROTHER 00:01:51 10.0.0.2 Tunnel1

3.3.3.3 0 FULL/DROTHER 00:01:57 10.0.0.3 Tunnel1

4.4.4.4 0 FULL/DROTHER 00:01:43 10.0.0.4 Tunnel1

R3#ping 2.2.2.2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 208/288/360 ms

R3#trace 2.2.2.2

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 168 msec 188 msec 168 msec

I mentioned at the beginning of the article that I was going to go outside the scope of R&S v5 in order to explain the "why would I use this?" behind some topics. Phase 3 DMVPN, if you're only looking at it from a handful of routers, doesn't make near as much sense. You need to take a step back and realize the challenges Phase 2 would bring if you had, say, 1,500 DMVPN spokes.

In a scenario like that, you're clearly not going to have just one hub. In fact, not even having primary/backup would be sufficient, because one router simply cannot terminate 1,500 IPSEC tunnels from a CPU perspective. In order to scale Phase 2 in volume of spokes, you had to build a topology that looked something like this:

Let's pretend SPOKES1, 2 and 3 each represented 500 spokes. They'd register to HUB1, 2, and 3, respectively. I'm not going to get into DR/failover scenarios here, but you can start seeing the problem - each hub has it's own NHRP database, which isn't shared with it's neighbors. What happens when a spoke in SPOKES1 wants to reach a spoke in SPOKES3?

Phase 2 solved this by using daisy-chained NHRP. In short, HUB1 became a NHRP client of HUB2, which became a NHRP client of HUB3, which became an NHRP client of HUB1. It would look something like this:

You can reference Cisco's drawing of the same solution here; reference figure 3-4:
http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/WAN_and_MAN/DMVPDG/DMVPN_3.html

The config is reasonably simple. This isn't something I have mocked up right now, but let's pretend they're each using tunnel 1 and have a tunnel IP address of 10.0.0.X, where X is the hub number.

HUB1:
interface Tunnel1
ip nhrp map N.B.M.A2 10.0.0.2
ip nhrp nhs 10.0.0.2

HUB2:
interface Tunnel1
ip nhrp map N.B.M.A3 10.0.0.3
ip nhrp nhs 10.0.0.3

HUB3:
interface Tunnel1
ip nhrp map N.B.M.A1 10.0.0.1
ip nhrp nhs 10.0.0.1

So given my earlier predicament, "What happens when a spoke in SPOKES1 wants to reach a spoke in SPOKES3?", in this case, the requesting spoke in SPOKES1 sends the initial packet to HUB1, which, not having a resolution for the spoke that's registered to SPOKES3, passes both the original packet and the NHRP request to SPOKES2, which in turn passes it to SPOKES3. In theory, SPOKES3 has the resolution for the router we're trying to reach, and tells that router via NHRP to establish a tunnel (via NBMA) back to the original requester in SPOKES1.

You can see how inefficient this is. It's not hierarchical; it scales sideways.

Let's take a worse scenario - what if the spoke in SPOKES3 is offline and not registered to HUB3? Well, HUB3 doesn't have a resolution so it passes it to HUB1, which in turn passes it to HUB2 ... yes, it loops. It eventually TTLs away and dies, but it's messy.

Another scalability issue in Phase 2 is that there's absolutely no way to summarize routes. If you summarize, you get the spoke->hub->spoke syndrome, because the next hop is always the summarizing router.

Also, I mentioned above the problem with the first few packets being punted to the CPU, and not being CEF switched.

Phase 3 fixes all these issues, and is basically better in every way.

In Phase 3, completely contrary to the way Phase 2 worked, all the routing needs to point towards the hub (initially). So the routing protocol does need some sort of "next hop self" feature enabled.

After the hub receives the first packet, instead of generating NHRP resolution packets, the hub sends an NHRP redirect any time it receives a packet in one tunnel and sends it back out the same tunnel. This redirect goes back to the originating router (the one that sent the first packet to the hub - the packet that got sent in & out the same tunnel), informing the originating router that a better path is available over DMVPN. NHRP redirect is very similar to an ICMP redirect.

Let's demonstrate with a sample of 4.4.4.4 trying to reach 2.2.2.2. It's important to note here that the route to 2.2.2.2 has a next-hop of R1's private address, which was resolved by the static entry on R1, so there's no CEF failure!

So now R4 knows R1 isn't the best path to R2. At this point, R4 needs to send an NHRP resolution request to R2 (not the hub!), to find out how to reach it directly. In the meantime, it knows R1 will continue to forward packets for it.

Since R4 still can't speak directly to R2, the NHRP resolution gets forwarded via R1, but not processed via R1. In Phase 3, it's no longer R1's job to answer NHRP resolutions.

R2 receives the resolution, and responds directly to R4 (similar to the way Phase 2 worked at this point), also initiating the tunnel to R4.

At this point, R2 and R4 would still have R1 as the next hop for one another, but Phase 3 fixes that as well, rewriting CEF in order to fix this issue.

In modern versions of IOS, you can actually see the rewrite (more or less) via the routing table:

R4#sh ip route ospf | b Gateway
Gateway of last resort is 87.14.40.100 to network 0.0.0.0

1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/1001] via 10.0.0.1, 00:03:12, Tunnel1
2.0.0.0/32 is subnetted, 1 subnets
O % 2.2.2.2 [110/2001] via 10.0.0.1, 00:02:43, Tunnel1
3.0.0.0/32 is subnetted, 1 subnets
O 3.3.3.3 [110/2001] via 10.0.0.1, 00:03:12, Tunnel1
10.0.0.0/8 is variably subnetted, 5 subnets, 2 masks
O 10.0.0.1/32 [110/1000] via 10.0.0.1, 00:03:12, Tunnel1
O 10.0.0.2/32 [110/2000] via 10.0.0.1, 00:02:43, Tunnel1
O 10.0.0.3/32 [110/2000] via 10.0.0.1, 00:03:12, Tunnel1

Note the % sign next to 2.2.2.2:

R4#sh ip route ospf | i %

+ - replicated route, % - next hop override

O % 2.2.2.2 [110/2001] via 10.0.0.1, 00:03:33, Tunnel1

"next hop overide". Pretty cool.

R4#sh ip nhrp shortcut

2.2.2.2/32 via 10.0.0.2

Tunnel1 created 00:04:46, expire 01:55:15

Type: dynamic, Flags: router rib nho

NBMA address: 87.14.20.1

There's our shortcut table. What's a shortcut, you might ask? Let's look at the handful of simple commands necessary to make Phase 3 work.

First, the routing protocol must point back towards the hub, instead of towards the spoke, like we were setup for in Phase 2.

R1-R4:
interface Tunnel1
ip ospf network point-to-multipoint

Point-to-Multipoint will rewrite the nexthop as the hub instead of Broadcast or Non-Broadcast, which did not. Also, not pictured here, I re-established the hub<->spoke multicast prior to this. Another important footnote is that with Point-to-Multipoint, we no longer need the DR/BDR we were stuck with in Phase 2 (effectively limiting us to two hubs). This design also permits for summarization (or even just a default route), which Phase 2 certainly did not allow for. More on this in a bit.

R1:
interface Tunnel1
ip nhrp redirect

R2-R4:
interface Tunnel1
ip nhrp shortcut

ip nhrp redirect goes on the hub only (note many installations will just put ip nhrp redirect and ip nhrp shortcut on every device; this is not necessary). ip nhrp redirect enables the behavior described earlier: if a packet is received and transmitted on the same MGRE tunnel, send the redirect. You can actually see who we've sent redirects for recently:

R1#sh ip nhrp redirects
I/F NBMA address Destination Drop Count Expiry

Tunnel1 87.14.40.1 3.3.3.3 4 00:00:06
Tunnel1 87.14.30.1 10.0.0.4 4 00:00:06

ip nhrp shortcut goes only on the spokes; it is used to enable processing redirect packets.

That's all there is to enabling Phase 3; but I still haven't answered the scenario I proposed being the problem with Phase 2 ("sideways" scaling for hubs). With Phase 3, you can build hierarchical hubs because of the NHRP daisy chain doesn't need to exist any longer. Imagine our 1,500 spoke router scenario from earlier, but now with Phase 3:

We still have 500 spokes registering to HUB1, HUB2, and HUB3, from SPOKES1, SPOKES2, and SPOKES3, respectively.

What if a router in SPOKES1 wants to build a spoke-to-spoke tunnel to a router in SPOKES3?

Here we see the first packet leave the spoke in SPOKES1. HUB1 will forward this packet (according to the routing table via CORE1, not pictured here). HUB1 then sends a redirect back towards the spoke in SPOKES1.

Because the hubs no longer answer NHRP requests, there is no need to NHRP daisy chain the hubs! So in our next diagram, we're going to watch the NHRP resolution request be routed to its destination.

Again, this is a routed IP packet, HUB1, CORE1, and HUB3 are not NHRP-processing this packet, they're just CEF-switching it.

When the target spoke in SPOKES3 receives the NHRP resolution request, it replies directly to the originating spoke in SPOKES1:

Much more scalable than Phase 2.

This leads back into summarization. In Phase 3 there is no need to have the full routing table. You can send out a summary for your network, or even a default.

I can't summarize intra-area in OSPF, so I'm switching back to EIGRP (not pictured here).

R1:
router eigrp 100
ip summary-address eigrp 100 0.0.0.0 248.0.0.0

Sorry for the weird summary - I didn't do myself any favors by using 1.1.1.1 - 4.4.4.4 for the loopbacks. You try summarizing those :)

R3#sh ip route eigrp | b Gateway
Gateway of last resort is 87.14.30.100 to network 0.0.0.0

D 0.0.0.0/5 [90/3968000] via 10.0.0.1, 00:02:26, Tunnel1

One route - that'd sure be easier on my spokes if I had 1,500 spokes to consider.

R3#ping 2.2.2.2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 172/218/324 ms

We have reachability.

R3#trace 2.2.2.2

Type escape sequence to abort.

Tracing the route to 2.2.2.2

VRF info: (vrf in name/id, vrf out name/id)

1 10.0.0.2 156 msec 180 msec 172 msec

We're reaching it via one hop (our spoke-to-spoke tunnel)

R3#sh ip route | b Gateway

Gateway of last resort is 87.14.30.100 to network 0.0.0.0

S* 0.0.0.0/0 [1/0] via 87.14.30.100

D 0.0.0.0/5 [90/3968000] via 10.0.0.1, 00:04:07, Tunnel1

2.0.0.0/32 is subnetted, 1 subnets

H 2.2.2.2 [250/1] via 10.0.0.2, 00:00:29, Tunnel1

3.0.0.0/32 is subnetted, 1 subnets

C 3.3.3.3 is directly connected, Loopback0

10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks

C 10.0.0.0/24 is directly connected, Tunnel1

L 10.0.0.3/32 is directly connected, Tunnel1

87.0.0.0/8 is variably subnetted, 2 subnets, 2 masks

C 87.14.30.0/24 is directly connected, FastEthernet0/0

L 87.14.30.1/32 is directly connected, FastEthernet0/0

H = NHRP

R3#sh ip route nhrp | b Gateway

Gateway of last resort is 87.14.30.100 to network 0.0.0.0

2.0.0.0/32 is subnetted, 1 subnets

H 2.2.2.2 [250/1] via 10.0.0.2, 00:00:41, Tunnel1

R3#show dmvpn | b Tunnel1

Interface: Tunnel1, IPv4 NHRP Details

Type:Spoke, NHRP Peers:2,

# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb

----- --------------- --------------- ----- -------- -----

2 87.14.20.1 10.0.0.2 UP 00:04:09 DT1

10.0.0.2 UP 00:04:09 D

1 87.14.10.1 10.0.0.1 UP 01:25:21 S

Pretty darn slick.

What about IPv6 DMVPN?

Note, there is no IPv6 over IPv6 DMVPN yet - at least not on my IOS. So we'll be tunneling v6 over v4.

No changes to the existing tunnels are required, we just add v6 to our existing infrastructure.
I've added X::X/64 to every Loopback0, and 10::X/64 to every Tunnel1, where X is the router number.

R1:
ipv6 unicast-routing
ipv6 router eigrp 100
no shut

interface Tunnel1

no ipv6 split-horizon eigrp 100

ipv6 address 10::1/64

ipv6 eigrp 100
ipv6 nhrp map multicast dynamic
ipv6 nhrp network-id 1
ipv6 nhrp redirect

R2-R4:
ipv6 unicast-routing
ipv6 router eigrp 100
no shut
interface Tunnel1
ipv6 address 10::X/64 ! Where X is the router number
ipv6 eigrp 100
ipv6 nhrp map multicast 87.14.10.1
ipv6 nhrp map 10::1/128 87.14.10.1
ipv6 nhrp network-id 1
ipv6 nhrp nhs 10::1
ipv6 nhrp shortcut

The one thing that did throw me off here is that you don't need to map the link-local address of the hub on the spokes, or vice-versa. As I'd mentioned earlier, the ipv6 nhrp map commands remind me a lot of frame-relay, so I immediately started putting in manual mappings. No need. NHRP takes care of all of that:

R1#sh ipv6 nhrp | s FE80

FE80::C800:37FF:FEDC:8/128 via 10::3

Tunnel1 created 00:11:04, expire 01:48:56

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.30.1

FE80::C801:FF:FEF8:8/128 via 10::4

Tunnel1 created 00:10:54, expire 01:49:06

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.40.1

FE80::C803:13FF:FE90:8/128 via 10::2

Tunnel1 created 00:14:51, expire 01:45:08

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.20.1

The link locals are auto-registered along with the unicast IPv6 addresses.

There's not much more to say - it works -

R4#ping 2::2 source lo0

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 2::2, timeout is 2 seconds:

Packet sent with a source address of 4::4

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 156/198/260 ms

R4#trace 2::2

Type escape sequence to abort.

Tracing the route to 2::2

1 10::2 176 msec 172 msec 168 msec

Now let's look at what QoS options we have.

The QoS is largely Hub -> Spoke. You can get some Spoke -> Spoke but it's generally a hackjob, because your neighbors are dynamic, it's difficult to fine tune a policy.

The basic idea is that the spoke registers a value (called an NHRP "group") back to the hub, which the hub can then match and apply a policy-map to.

R2:

interface Tunnel1

ip nhrp group GROUP1

R3:

interface Tunnel1

ip nhrp group GROUP1

R4:

interface Tunnel1

ip nhrp group GROUP2

On all three spoke routers I did the following procedure:

interface tunnel1

no ip nhrp nhs 10.0.0.1

ip nhrp nhs 10.0.0.1

The reason is that the spoke doesn't dynamically re-register to the hub, so we're forcing it.

We can now see the hub is aware of the groups:

R1(config-if)#do sh ip nhrp

10.0.0.2/32 via 10.0.0.2

Tunnel1 created 00:34:47, expire 01:57:03

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.20.1

Group: GROUP1

10.0.0.3/32 via 10.0.0.3

Tunnel1 created 02:24:59, expire 01:59:36

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.30.1

Group: GROUP1

10.0.0.4/32 via 10.0.0.4

Tunnel1 created 02:24:41, expire 01:59:49

Type: dynamic, Flags: unique registered used

NBMA address: 87.14.40.1

Group: GROUP2

Let's build some policies on the hub. I have this all mocked up in GNS3, so we have to keep the performance expectations low.

Most of the DMVPNs I've designed used the DMVPN for bulk traffic, and used a side-by-side MPLS for the traffic that needed priority/QoS. So I have honestly never used this in production, but I suspect the design case is going to be mostly for shaping by group. You don't want a spoke with a device in the hub sending towards a spoke at 100Mbit if the spoke has a pair of bonded T1s for Internet access. If we can shape perhaps "low speed" clients to one group, and "high speed" to another group, we can stop the slow spokes from getting overwhelmed while allowing the faster spokes to get traffic at, or near, the line rate of the hub Internet connection. This would also be very easy to config, theoretically just 8 lines would take care of all spokes.

That all said, I've written slightly more complex configs for this implementation, because the CCIE lab's questions are about as far from reality as you can get.

R1:

ip access-list extended TOWARDS-R2

permit ip any host 2.2.2.2

ip access-list extended TOWARDS-R3

permit ip any host 3.3.3.3

class-map match-all TOWARDS-R3

match access-group name TOWARDS-R3

class-map match-all TOWARDS-R2

match access-group name TOWARDS-R2

policy-map GROUP1-PM

class TOWARDS-R2

shape average 4000

class TOWARDS-R3

shape average 4000

policy-map GR1-POLICY-PARENT

class class-default

shape average 6000

service-policy GROUP1-PM

policy-map GR2-POLICY-PARENT

class class-default

shape average 8000

interface Tunnel1

ip nhrp map group GROUP1 service-policy output GR1-POLICY-PARENT

ip nhrp map group GROUP2 service-policy output GR2-POLICY-PARENT

The idea here is that the cumulative bandwidth of GROUP1 should not exceed 6K, and each spoke should only get 4K maximum. Cumulative GROUP2 should not exceed 8K.

I worked up the "proof" from this, but it doesn't work into a blog well. Suffice to say it works.

You can see the policy-map hits with show policy-map multipoint, and can also get information from show dmvpn detail.

Ingress Per-Tunnel QoS (policing and remarking, basically) is not supported on DMVPN.

I know the first time I mocked this up, the first question I had was: that's great for Hub -> Spoke, but what about Spoke -> Hub, or Spoke->Spoke?

Turns out they're both kind of a pain (not to mention unsupported). As of 15.x, you can no longer apply a service policy directly to an MGRE tunnel. You can, of course, still shape, police, and queue on the physical interface that your tunnel is connected to. This more or less implies you need qos pre-classify, but interestingly, on 15.2(4)M6, I got the same results with or without it with the traffic generated on the router - if I pinged from the router, I got the inside (pre-tunnel) QoS values on the outer DSCP value. I suspect that may have differed if I was testing from behind the device, but I didn't test it.

The big nail in the coffin for Spoke->Spoke QoS is that the neighbors are dynamic. Without some way of applying a grouping to the neighbor which implies how much bandwidth they have, or what traffic is priority for them, you have to either individually manually match destinations, which defeats the dynamic nature of DMVPN, or have one generic policy that matches both the hub and every spoke.

A sample config might look something like this:

class-map match-all ef

match dscp ef

policy-map out

class ef

priority percent 50

class class-default

random-detect

interface Tunnel1

qos pre-classify

interface FastEthernet0/0

service-policy output out

And finally, some miscellaneous topics that I thought were interesting.

UNIQUE NHRP

By default, the spoke instructs the hub that it's registration is unique, and not to accept a registration for the same DMVPN (private) IP from a different NBMA (public) IP.

If you're using DHCP on a spoke, and your IP might change, you'd want to disable this.

Use ip nhrp registration no-unique on the spoke.

TUNNEL KEYS

If you have multiple MGRE tunnels attached to the same physical interface, you need to put tunnel keys on them to keep them separate. Older IOSes (12.3 and older) required them on every MGRE tunnel.

Use tunnel key 123

SPOKE TO SPOKE MULTICAST

This is a very similar question to spoke-to-spoke QoS, but I can see this one getting used on the CCIE lab. It's impractical for large production networks, but in our topology:

R2:

ip pim rp-address 3.3.3.3

interface Tunnel1

ip nhrp map 10.0.0.3 87.14.30.1

ip nhrp map multicast 87.14.30.1

ip pim nbma-mode

ip pim sparse-mode

interface Loopback0

ip pim sparse-mode

ip igmp join-group 239.0.0.1

R3:

ip pim rp-address 3.3.3.3

interface Tunnel1

ip nhrp map 10.0.0.2 87.14.20.1

ip nhrp map multicast 87.14.20.1

ip pim nbma-mode

ip pim sparse-mode

interface Loopback0

ip pim sparse-mode

R3(config-if)#do sh ip pim neigh

PIM Neighbor Table

Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,

P - Proxy Capable, S - State Refresh Capable, G - GenID Capable

Neighbor Interface Uptime/Expires Ver DR

Address Prio/Mode

10.0.0.2 Tunnel1 00:06:00/00:01:38 v2 1 / S P G

R3(config-if)#do ping 239.0.0.1

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:

Reply to request 0 from 2.2.2.2, 344 ms

Very ungainly and manual, but it does work. It's also of note that EIGRP peered between R2 and R3 as well:

R3(config-if)#do sh ip eigrp neigh

EIGRP-IPv4 Neighbors for AS(100)

H Address Interface Hold Uptime SRTT RTO Q Seq

(sec) (ms) Cnt Num

1 10.0.0.2 Tu1 14 00:09:50 266 1596 0 20

0 10.0.0.1 Tu1 11 02:58:23 328 1968 0 32

CRYPTO CALL ADMISSION

So one of these theoretical hubs with 500 spokes - let's assume it's not a big burly router, but it's getting along just fine in steady-state. Uh-oh, it lost power and had to reboot! Does it have the horsepower to establish 500 encrypted tunnels all trying to reconnect at the same time?

crypto call admission limit IKE in-negotiation can control how many simultaneous tunnels it will try to process (any new incoming tunnels get dropped temporarily until the first group is up).

Hope you enjoyed,

Jeff

A Thorough Approach for Debugging MPLS L3 VPNs

2014-04-15T19:39:00.001-07:00

I recently realized I needed a more organized approach to debugging MPLS L3 VPNs for the troubleshooting section. Referencing a lot of the practice labs I've taken, I'm going to give a run-down of what I think are the fastest way to track down any problem.

First let's run down my list, then we'll pick it apart with an example below.

I'm going to assume the run-of-the-mill validation of "Host A needs to be able to ping host B".

Since we're talking high-level in the first segment, and with MPLS VPNs we're always talking about a sender and a receiver, I am going to refer to the sender that's unable to reach the receiver as the originating router, and the side that cannot be reached the terminating router for referencing direction.

Before you start debugging...

1) Validate the problem: ping <problem IP>

3) From the originating router, run "sh ip route <problem IP>" and "sh ip cef <problem IP>". Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range. That makes it not an MPLS problem, and is out of scope for this post.

Once you clear the starting checks, you want to validate whether or not you have the route in your routing table.

1) Are you importing the route in to your VRF? Make sure the other side's exported route-target is being imported on the originating router's VRF.

2) Is the terminating router or terminating router's PE advertising the route?

3) Are route reflectors involved? If you're relying on one route reflector to relay a route through another route reflector, you need to ensure the cluster-IDs are different.

These following items are dependent on using OSPF as your PE->CE routing protocol:

4) If you're using OSPF as PE->CE, check for sham links. It's easy to break these and hard to look for them. Do a "sh run | s sham" on the PEs and see if any exist. If they do, run "show ip ospf sham-links"

5) If you're using OSPF as PE->CE, and the CE is also part of the VRF (the VRF itself exists on the CE), enable capability vrf-lite on the OSPF process on the CE.

If you don't have an internal route and you need one to beat another AD, then additionally check out:

6a) If OSPF is PE->CE, make sure domain-id is set the same on all of them, or you'll end up with external routes across the MPLS cloud.

6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.

If you checked into all of that, you should have an appropriate route by now. What happens when you've got the route in your routing table, pointing the right direction, but the traffic just doesn't arrive on the far side? Now we start debugging MPLS itself.

1) sh run | s mpls on every PE and P device. Look for LDP filtering. There are more elegant ways to find this, but this is the fastest.

2) From the PE on the originating side, run a "sh ip cef <VRF NAME> <problem IP>". Is the correct PE listed as next-hop in the "via" field? If it's not, go investigate the PE that is originating the route, there may be more than one path (and one may not lead anywhere!)

3) If it's the correct PE from step 2, do show mpls forwarding-table <PE LDP ID>. Unless your PEs are L2 adjacent, you must have tag listed for the PE, or "Pop Tag". If you don't, walk your adjacent routers to be sure mpls ip is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.

4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.
5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32. If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match. Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.
6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking? On the PEs, sh run | i cost-community. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find bgp bestpath cost-community ignore in the config.

Now let's walk through the scenarios that these verifications above can save you from.

1) Validate the problem: ping <problem IP>

This should be obvious, but I actually proctor a private TS test, and I'm amazed the number of people that don't check what I put in front of them. In rare circumstances, sometimes the solution can be derived just from verifying the issue. And in a TS lab, you need to be sure you didn't somehow fix the problem at some other point.

2) Find out if the problem is unidirectional. Run "debug ip icmp" on both the source and the destination. Ping both ways. If you're taking an INE lab, be sure logging is on too: logging con 7 and logging on.

This is very important - so you "can't ping" the destination. Do you know if your echo request isn't making it from origination to destination, or that the echo reply isn't making it from destination to origination? Don't waste time debugging the wrong flow. Quite regularly only one direction is failing.

3) From the originating router, run "sh ip route <problem IP>" and "sh ip cef <problem IP>". Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range. That makes it not an MPLS problem, and is out of scope for this post.

This is easy to overlook. You may have the route in both BGP, and the MPLS labels can be in good shape, but you're only getting a /24 across the MPLS VPN, and you're getting a bogus /32 route for the destination that leads nowhere, injected by your IGP from a router behind you. Your packet is going the wrong direction.

1) Are you importing the route in to your VRF? Make sure the other side's exported route-target is being imported on the originating router's VRF.

Originating router:

ip vrf VPN

rd 1:1

route-target export 1:1

route-target import 3:3

route-target import 7:7

Terminating router:

ip vrf VPN

rd 3:3

route-target export 2:2

route-target import 1:1

route-target import 7:7

This config above for "Originating router" is missing route-target import 2:2. The route target is a community carried with MP-BGP, if you don't import it into your VRF, you won't see the route. The RD is basically irrelevant - as long as they're unique on each PE, they don't matter for the import process.

2) Is the terminating router or terminating router's PE advertising the route?

This one sure got me once. I'm looking and looking for an MP-BGP problem, and it turns out that the CE just didn't advertise the route to the PE. Simple BGP error.

3) Are route reflectors involved? If you're relying on one route reflector to relay a route through another route reflector, you need to ensure the cluster-IDs are different.

If you have two route reflectors in your MP-BGP topology, unless the PEs in question both peer to the same route reflector, you need to ensure that the route reflectors have different cluster IDs. In other words, if your MP-BGP topology looks like this:

RR1 <-- PE1 --> RR2 <-- PE2

This will work fine, even if the cluster IDs are the same, because RR2 will reflect the routes from PE1 to PE2 and vice-versa. However, if you have:

PE1 --> RR1 <--> RR2 <-- PE2

Then you'll need separate cluster IDs, or RR1 will not reflect PE1's routes to RR2, and vice-versa.

Sham links allow you to extend an OSPF area across the "Super Area 0" backbone area. These are most commonly used to pref an MPLS path instead of a back-door link. Topology aside, I've been bitten on broken sham links before, so look out for these. If you want to know more about them:

http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html

5) If you're using OSPF as PE->CE, and the CE is also part of the VRF (the VRF itself exists on the CE), enable capability vrf-lite on the OSPF process on the CE.

The first time I ran into this I spent 5 hours debugging it. Some may say a waste of time, but I'll never forget it. In short: OSPF checks for the downward bit on routes exported from MP-BGP directly into the OSPF process. You'll watch the routes arrive on the PE and get put in the OSPF process no problem, and then when they hit the CE device(s), if the CEs are in the VRF as well, they'll be in the OSPF database but not get put into the RIB/FIB. This is a loop prevention mechanism. To disable it, use "capability vrf-lite" inside the OSPF process.

Also reference: http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html

6a) If OSPF is PE->CE, make sure domain-id is set the same on all of them, or you'll end up with external routes across the MPLS cloud.

This only matters if you're shooting for an internal route for some reason, and is more of a reminder than a big deal.

6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.

This can make a slightly bigger difference, in that EIGRP naturally deprefs (via higher AD) external routes. You may need an internal route in order to make the traffic cross the MPLS cloud. If the AS number doesn't match, you'll end up with external routes.

1) sh run | s mpls on every PE and P device. Look for LDP filtering. There are more elegant ways to find this, but this is the fastest.

This is a bit of a hack, but it catches about 90% of LDP problems in < 60 seconds. You can't beat it for speed. I'll show more about this below.

PE1#sh ip cef vrf VPN 192.168.1.7

192.168.1.0/24, version 8, epoch 0, cached adjacency 10.0.23.3

0 packets, 0 bytes

tag information set

local tag: VPN-route-head

fast tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}

via 5.5.5.5, 0 dependencies, recursive

next hop 10.0.23.3, FastEthernet0/1 via 5.5.5.5/32

valid cached adjacency

tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}

The via field above shows the PE you're heading towards. Is it the correct PE? This threw me off something awful once. The prefix in question was endlessly looping off a 3rd PE, and was being re-advertised on the 3rd PE. That PE was being preffed. Boom, an hour gone debugging - if only I'd paid more attention to the output of "sh ip cef vrf VPN"!

Assuming it is the right PE listed above, you walk the MPLS labels from there:

PE1#sh mpls forwarding-table 5.5.5.5

Local Outgoing Prefix Bytes tag Outgoing Next Hop

tag tag or VC or Tunnel Id switched interface

17 17 5.5.5.5/32 0 Fa0/1 10.0.23.3

Next hop is 10.0.23.3, via Fa0/1; that's P1:

P1#show mpls forwarding-table 5.5.5.5

Local Outgoing Prefix Bytes tag Outgoing Next Hop

tag tag or VC or Tunnel Id switched interface

17 Untagged 5.5.5.5/32 13766 Fa0/1 10.0.34.4

There's the evil Untagged! Let's go see what's up on P2.

P2#sh run | s mpls

no mpls ldp advertise-labels

mpls label protocol ldp

mpls ip

mpls label protocol ldp

mpls ip

Note, we should have caught this in MPLS debugging step 1, but just in case you didn't...!

There's about 3 scenarios you want to look out for related to label advertisement:

no mpls ldp advertise-labels will make no labels be advertised at all.

That command can be used in combination with mpls ldp advertise-labels for <standard ACL>. The standard ACL can be (rather obviously) rigged to prevent the labels you need advertised from being advertised.

The final command is mpls label range <min> <max>. If you don't allow enough labels the ones you need can end up not getting assigned one at all.

I've fixed the mpls ldp advertise-labels command above, and now we see the appropriate output on P1:

P1#show mpls forwarding-table 5.5.5.5

Local Outgoing Prefix Bytes tag Outgoing Next Hop

tag tag or VC or Tunnel Id switched interface

17 17 5.5.5.5/32 0 Fa0/1 10.0.34.4

And on P2:

P2#show mpls forwarding-table 5.5.5.5

Local Outgoing Prefix Bytes tag Outgoing Next Hop

tag tag or VC or Tunnel Id switched interface

17 Pop tag 5.5.5.5/32 508 Fa0/1 10.0.45.5

We see "Pop tag". Pop tag is OK, it's just part of the Penultimate Hop Pop process.

3) If it's the correct PE from step 2, do show mpls forwarding-table <PE LDP ID>. Unless your PEs are L2 adjacent, you must have tag listed for the PE, or "Pop Tag". If you don't, walk your adjacent routers to be sure mpls ip is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.

I've seen some nasty, nasty things done with VACLs on the layer 2 switches between routers on practice labs. It's not much of a stretch to think they'd block LDP. The config would look perfect and your adjacency simply wouldn't come up. Count how many adjacencies you're expecting from the diagram, and make sure you get a good head count:

P1#show mpls ldp neigh | i Peer

Peer LDP Ident: 7.7.7.7:0; Local LDP Ident 10.0.37.3:0

Peer LDP Ident: 2.2.2.2:0; Local LDP Ident 10.0.37.3:0

Peer LDP Ident: 192.168.49.4:0; Local LDP Ident 10.0.37.3:0

If you're missing one, investigate the adjacency.

And a shout out to my friend Keith Chayer, who reminded me to check for CEF being enabled as well. It is of note that you'll be missing labels if CEF is disabled on the MPLS transit path - at least LDP is smart enough to tell it's neighbors "I'm broken - don't use me".

4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.

I covered this above.

5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32. If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match. Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.

This is reasonably self-explanatory. The route prefix length and the LDP prefix length need to match. OSPF is the common culprit.

Reference: http://brbccie.blogspot.com/2013/11/mini-why-does-ldp-require-32-loopback.html

6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking? On the PEs, sh run | i cost-community. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find bgp bestpath cost-community ignore in the config.

I got this on a mock lab once, as well. If the PEs are disabling cost community, you need to ask yourself why: is this a mandatory traffic engineering, or are they just trying to steer routes in the wrong direction?

Reference: http://brbccie.blogspot.com/2012/12/bgp-cost-community-eigrp-soo-and.html

/* Addition 11/27/14 - I apologize for not inserting this more thoroughly in the blog, but time doesn't permit right now - be sure to look for import or export maps on the VRF. It's possible to define a route-map that filters prefixes inbound or outbound of the VRF. The syntax is not particuarly complex:

ip prefix-list IMPORT_PL seq 5 deny 0.0.0.0/0 le 32

route-map SNAFU permit 10

match ip address prefix-list IMPORT_PL

vrf definition VRFTEST
rd 1:1
route-target export 1:1
route-target import 1:1
!
address-family ipv4
import ipv4 unicast map IMPORT-FILTER

*/

Cheers,

Jeff Kronlage

MPLS EXP-based QoS and QoS Groups

2014-04-12T12:44:00.002-07:00

This topic is a bit of a stretch for the R&S lab, really being more oriented towards Service Provider, but I wanted to talk about it anyway.

So what does your MPLS carrier do with those QoS settings you pass them?
It's unlikely they're queuing at congestion spots in their network based on the DSCP values you set.

You've probably heard about the EXP bits in the MPLS tag. These are used "for QoS". But no one really seems to know how. And there's only 3 bits, but we use 6 bits for DSCP, so what's the story?

Here's our topology:

We'll be setting DSCP values on H1 and manipulating them, or their MPLS equivalents, on the way to H2.

Without any special config, let's see how this works right out of the box. Of important note, I have null-routed H1's IP address on H2. This makes it easier to read the output from "debug mpls packet", because we're only seeing a one-way flow instead of a two-way flow.

H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 2
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]: 184
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

(Remember, we were not expecting responses)

So we sent this as EF traffic (TOS 184, above). Any hypothesis on what's seen in transit?

P2#debug mpls packet

MPLS packet debugging is on

P2#

*Mar 1 09:20:04.473: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

*Mar 1 09:20:04.473: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

P2#

*Mar 1 09:20:06.437: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

*Mar 1 09:20:06.437: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

CoS=5, meaning the EXP bits are set to 5. The default behavior on a PE is to map the IPP (IP Precedence) on to the EXP bits. These line up nicely both being three bits. Reference the ToS value above - 184. That's a full 8 bit QoS value, in binary it's 10111000. Chop off the last two unused digits for the 6-bit DSCP value of 101110, and you have 46 (which I suspect you recognize as "EF"), knock off everything except the first three bits - 101 - for the IPP, and you have? Five. Hence, EXP becomes 5 as well. This default feature is known as "ToS Reflection".

We'll look at how this value can be used to our advantage later.

What comes in on H2?

For those following my blog for a while, you may know about 14 months ago I wrote a giant ACL that matches every possible QoS value. I still have it on file, and I'll be using it here to see what values come in on H2.

H2#sh ip access-list | i match

460 permit ip any any dscp ef (2 matches)

480 permit ip any any dscp cs6 (1 match)

Ok great! We've got two EF matches, and a ... Class Selector 6?

The EF matches are the two pings arriving. I found this odd right off the bat, I would've expected that if IOS takes the IPP bits and maps them to EXP, that it would then take EXP and match them to IPP on the way out the other PE when the final label is popped. However, it doesn't work that way - instead, it just uses the DSCP that was already in the packet - which, of course, never changed. An MPLS label was put on top of it, but the underlying packet was left intact.

The class selector 6 packet is a BGP keepalive. We'll be seeing more of them throughout the post.

It turns out there are terms for the different types of MPLS QoS behavior. What we observed above would be either "Pipe Mode" or "Short Pipe Mode". Both of these behaviors include using the original ToS bits instead of replacing them based on the EXP bits. The difference between Pipe Mode and Short Pipe Mode is that Pipe Mode egress queues based on the EXP bits, and Short Pipe Mode egress queues at the PE on the original ToS (DSCP) bits. This post assumes the audience understands how to write a hierarchical QoS policy, so I'm not going to elaborate or examine the differences between them any further. Any additional mention of "Pipe Mode" assumes either of the above behaviors. The third option is "Uniform Mode", which is the process of replacing the IP Packet's ToS bits (IPP/DSCP) with something derived from the EXP bits.

We just saw Pipe Mode in action above, let's look at how to implement Uniform Mode.

First we need to take a quick look at QoS groups.

There's a particular challenge with ingress and egress marking on a PE. On ingress, you can't set an IPP or DSCP value because the MPLS header is still on the frame. On the egress interface, you can't match on the EXP bits to set IPP or DSCP bits, because the MPLS label is already popped. So how do you match on an EXP value and set a DSCP value? Enter QoS groups.

PE2:

class-map match-all EXP5

match mpls experimental topmost 5

policy-map uniform-ingress

class EXP5

set qos-group 5

class class-default

set qos-group 0

interface fa0/0 ! MPLS side

service-policy input uniform-ingress

This config will match a decimal value of five on the topmost MPLS label - which, in our case, on the PE, is the only MPLS label thanks to Penultimate Hop Pop. We'll assign a local value of "5" (although this could be any number 1-99) if the EXP bit is 5. Anything else will get reset to 0.

class-map match-all GROUP5

match qos-group 5

policy-map uniform-egress

class GROUP5

set ip dscp af41

class class-default

set ip dscp default

interface fa0/1 ! IP/VRF side

service-policy output uniform-egress

On egress, we'll match on that 5, and set af41. Why af41? Because I wanted to show the policy was doing something.

We'll ping from H1 to H2 again. I'm omitting any non-essential bits from the extended ping for brevity.

H1#ping

Target IP address: 192.168.1.6

Repeat count [5]: 2

Extended commands [n]: y

Type of service [0]: 184

Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/2)

Again, expected failure, this is a deliberate one-way flow.

H2#sh ip access-list | i match

340 permit ip any any dscp af41 (2 matches)

640 permit ip any any precedence routine (4 matches)

We see our two af41 hits, and 4 routine. The routine are because the IPP 6 packets are being remarked to zero because it doesn't match anything else in the policy.

Now obviously this is a pretty useless policy, but it was more about showing how the function works.

Here's an adaptation for a more scalable Uniform Mode solution:

policy-map uniform-ingress
class class-default
set qos-group mpls experimental topmost

interface fa0/0

service-policy input uniform-ingress

policy-map uniform-egress

class class-default
set precedence qos-group

interface Fa0/1

service-policy output uniform-egress

Let's see what the outcome is.

H1#ping

Target IP address: 192.168.1.6

Repeat count [5]: 2

Extended commands [n]: y

Type of service [0]: 184

Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/2)

H2#clear ip access-list count

H2#sh ip access-list | i match

460 permit ip any any dscp ef (2 matches)

I was rather surprised the first time I saw this output. We're setting a precedence value but getting back a DSCP value. I expected to see a precedence/class-selector value. The original bits were 101110 (DSCP 46, or EF), and I expected to replace them with 101000, which would be class selector 5. Things brings up an important difference in IOS's handling of class-selector vs precedence, I'd always treated them the same, but it turns out IOS is more literal - Precedence sets only the precedence bits. So we re-wrote the first three bits with 101, which ... were already set to 101. So we ended up with 101110 (DSCP 46/EF) again.

We could do something like this:

policy-map uniform-egress

class class-default

set dscp qos-group

But then we'd get literal DSCP values: if the QoS Group is 5, it would set DSCP 5. Not DSCP CS5 (101000), but actual binary 5 - (000101). To accomplish EF -> EXP 5 -> CS5, we'd have to use either a lengthy QoS-Group -> DSCP class-map/policy-map setup, or we could use a table map!

table-map TABMAP

map from 1 to 8 ! Group 1 to DSCP CS1

map from 2 to 16 ! Group 2 to DSCP CS2

map from 3 to 24 ! ...

map from 4 to 32

map from 5 to 40

map from 6 to 48

map from 7 to 56 ! Group 7 to DSCP CS7

policy-map uniform-egress

class class-default

set dscp qos-group table TABMAP

H1#ping

Target IP address: 192.168.1.6

Repeat count [5]: 2

Extended commands [n]: y

Type of service [0]: 184

Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/2)

H2#sh ip access-list | i match

400 permit ip any any dscp cs5 (2 matches)

480 permit ip any any dscp cs6 (18 matches)

I think the table map use is pretty obvious - take a qos group and match it to some other integer, which has some meaning when applied to a DSCP or IPP field. Now we have the CS5 output we were looking for.

Now clearly, MPLS/EXP QoS needs to be able to be modified on more than just the egress PE. Let's take a look at the other spots we can match and adapt behavior to it.

So far we've been doing matches on the "topmost" label, so what other options have we got? Keeping this oriented towards the R&S CCIE, I'm not going to look at anything other than a 2-tag (VRF + MPLS PE) system. When traffic is received in from the host towards the PE, the PE is going to impose a label for the VRF. It will then add the MPLS transit label on top of that, for reaching the other PE. So to reiterate, we go from zero labels to two labels on the PE.

We can set both those labels, and it's really not hard, but you have to pay attention to what label is being manipulated on which interface. IOS is picky about the order of operations in this case.

For ingress on a PE, we can only set imposition. We clearly can't set "topmost" because there are no labels on the packet yet:

PE1:
policy-map impose1
class class-default
set mpls experimental imposition 4

int fa0/0

service-policy input impose1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/1)

H2#sh ip access-list | i match

320 permit ip any any dscp cs4 (1 match)

480 permit ip any any dscp cs6 (2 matches)

And what if we set EF manually on H1?

H1#ping

Target IP address: 192.168.1.6

Repeat count [5]: 2

Extended commands [n]: y

Type of service [0]: 184

Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/2)

H2#sh ip access-list | i match
320 permit ip any any dscp cs4 (3 matches)
480 permit ip any any dscp cs6 (2 matches)

Still CS4, because we're remarking the EXP bits on the inner label on PE1 to 4, that's carried down to PE2, and then the qos-group-based policy remarks the DSCP to CS4.

What about the outer label?

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

We'd need to look at the results on P2, because PE2 never gets the outer label - the PHP process removes it before forwarding the frame.

P2#

*Mar 2 01:44:59.409: MPLS: Fa0/0: recvd: CoS=4, TTL=253, Label(s)=16/19

*Mar 2 01:44:59.409: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

So P2 receives the outer label as 4, and the inner label as 4. We see 4 coming in on Fa0/0 on label 16, and going out on label 19 on Fa0/1, showing both the PHP process and the fact that both EXP values are the same. That's because the default behavior of a PE is to copy the inner label's EXP bits to the outer label. But what if we wanted to set the outer label to something different?

There's two places we could do that: egress on the PE, or ingress on the P routers.

Let's try the PE first.

PE1:
policy-map topmost1
class class-default
set mpls experimental topmost 2

interface FastEthernet0/1

service-policy output topmost1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/1)

P2#

*Mar 2 01:53:51.609: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19

*Mar 2 01:53:51.609: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

Now we see EXP 2 on the topmost and EXP 4 on the inner.

It's of some interest that if we wanted the final PE (PE2) to see that value of 2, we'd want to disable PHP. PHP is disabled from the PE, not the router upstream from it. This is done by the PE advertising an explicit blank label for the prefixes terminating on it:

PE2:

mpls ldp explicit-null

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/1)

P2#

*Mar 2 01:57:56.889: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19

*Mar 2 01:57:56.889: MPLS: Fa0/1: xmit: CoS=2, TTL=252, Label(s)=0/19

H2#sh ip access-list | i match

160 permit ip any any dscp cs2 (1 match)

We see that P2 forwarded both labels, one of which was the explicit null/0 label (reference 0/19). The PE has to pop both labels before forwarding. Consequently, we also see that the PE now marked CS2 based on the EXP2 in the topmost label.

Now let's see about manipulating the topmost label on a P device.

For clarity's sake on P2, I am disabling the implicit null (enabling PHP) on PE2:

PE2(config)#no mpls ldp explicit-null

P1:

policy-map set-topmost

class class-default

set mpls experimental topmost 7

interface FastEthernet0/1

service-policy output set-topmost

Before I show the output of this, it's important to note that setting the topmost EXP on egress is the only option I could find that worked on the P routers. The P routers aren't imposing any labels (just swapping, which is different), so imposition doesn't work, and setting topmost on ingress doesn't appear to do anything (although I am not sure why). And now for the outcome:

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.

Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:

Success rate is 0 percent (0/1)

P2#

*Mar 2 02:22:25.641: MPLS: Fa0/0: recvd: CoS=7, TTL=253, Label(s)=16/19

*Mar 2 02:22:25.645: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

As anticipated, EXP 7 on the outer label only.

It's also important to note how P routers treat the EXP bits. By default, unless you manually change it with the processes I've demonstrated and will demonstrate to come, the P router, as it swaps labels hop-by-hop, will always copy the EXP of the old outer label to the new outer label unmodified.

And now for our final topic - policing based on EXP.

P1:

class-map match-all EXP5

match mpls experimental topmost 5

policy-map POLICER

class EXP5

police cir 32000

conform-action transmit

exceed-action set-mpls-exp-topmost-transmit 1

interface FastEthernet0/1

service-policy output POLICER

H1#ping

Protocol [ip]:

Target IP address: 192.168.1.6

Repeat count [5]: 500

Datagram size [100]: 1000

Extended commands [n]: y

Type of service [0]: 184

Sending 500, 1000-byte ICMP Echos to 192.168.1.6, timeout is 0 seconds:

......................................................................

..........

Success rate is 0 percent (0/500)

This one is tricky to validate - we want to see some MPLS packets leave P1 as 5, and some leave as 1. Unfortunately my ACL doesn't work here (Without turning PHP back off) because we're playing with the upper label and not the inner label, and the Uniform Mode config on PE2 won't take heed of the outer label, because it's popped before hitting the egress interface.

Instead, we're just going to look at a sampling of "debug mpls packet" on P2:

*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

*Mar 2 02:48:26.101: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

Let's decipher this a bit:

Remember, P2 is performing PHP for PE2, so what we see coming in and what we see going out will be different. P1 is only making modifications to the topmost label.

*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

We got an MPLS packet in as EXP 5.

*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

We popped the upper label and sent the inner label on as EXP 5 as well.

*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

By this point, we've already gotten the policer to kick in, so we receive EXP 1.

*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

and we transmit EXP 5 based on the inner label, which was set on PE1 because of the IPP -> EXP ToS Reflection. The policer on P1 did not modify this value.

That's MPLS QoS/QoS Groups in a nutshell. Hope you enjoyed!

Jeff

[mini] PIM Dense State Refresh

2014-02-15T16:55:00.001-08:00

Been brushing up on multicast recently. It was one of the first topics I ever deep-dived and some of the material is rusty now... two and a half years later.

Came across PIM-DM State Refresh. This is an interesting attempt to make dense mode PIM more scalable. If you ask a CCNP student what the common detriment with using dense mode is, he'd probably tell you "It floods all its groups to every PIM device every three minutes". That is true, but that's an attempt to solve a problem, not the problem itself.

The real problem is that dense mode has no way of letting potential receivers know what groups are available, or where to find them. It's easy to lose site of this when labbing: you control every device and know exactly where all the transmitters and receivers are, and know what groups are on the network. The 3-minute flooding is only present because that's how dense mode tells the network what groups are available and where to find them.

State Refresh is not a new technology - it was proposed in the late 90's - but I'd never heard of it before today. With it enabled, you still have the initial densing of the actual multicast stream, but after the initial prune, instead of just firing the stream off every 3 minutes, it instead sends a state refresh every X number of seconds, where X is defined by the command that enables it:

interface FastEthernet1/0
ip pim dense-mode
ip pim state-refresh origination-interval 10

In this sample, we would send state refresh messages every 10 seconds.

In this fashion, all PIM routers in the network are still aware of the stream, but they don't get the annoying densing out of the traffic followed by having to prune it constantly.

Also note, this process only works if the transmitter is still sending traffic. It does not do this "keepalive" signaling for multicast streams that are no longer in use.

Where you place the state-refresh command is important. It should always go on the PIM interface closest to the transmitting host. If you put it anywhere else, it does not work. You do not need to enable it on other routers on the host. In my lab, I had three interfaces, one pointing at a host endlessly pinging 239.0.0.100, and the other two pointing towards PIM routers. I only have it enabled on the interface pointing towards the host.

All PIM Dense routers/interfaces will automatically relay state-refresh messages. This command only needs to be enabled on interface facing transmitting hosts.

If you don't want a PIM router to relay these messages, use this global command:

ip pim state-refresh disable

Cheers,

Jeff

The Woz!

2014-02-07T21:42:00.004-08:00

Totally off topic this time - but tonight, I met Steve Wozniak, and it was amazing.

He went to a small networking event that I attended. When I signed up for it, I was on the fence about attending - he hadn't been signed at that time - but I decided to go anyway (drug along was more like it, but that's another story). Then the message came out that he was the guest speaker, and tickets sold out in the blink of an eye.

He talked for about 30 minutes, took Q&A for another 30. What a super guy. Incredibly personable.

Brief recap of the discussion:
- He spoke several times about Steve Jobs. Some insightful things, including the mentor Apple had in the early days, and how Steve J learned most of his technique from that mentor (one of their early investors). Interestingly, it wasn't all positive - he mentioned on several occasions how he wished Steve J had been more generous both in his personality and financially.
- He's such a nerd! (in a good way). This was a technology business networking function, so inevitably the topics tended to lean towards business. He'd start off on a business topic, and really just end up saying that if you have a cool engineering idea, it'll probably do well. Then he'd go off on a story about a neat engineering idea.
- He has some interesting ideas about wearable technology. He doesn't think we've got it nailed yet, and that the smartwatch in particular doesn't have a big enough screen to be usable.
- He's disappointed that computers in schools didn't result in smarter students. He talks a lot about how people had interest in them as they displayed something new, but the interest didn't stick unless the "new" kept coming.
- He talked a lot about the education system in general. Interesting ideas about Singularity - http://en.wikipedia.org/wiki/Technological_singularity - and how that might help in learning someday. Also he hypothesized that Moore's Law (http://en.wikipedia.org/wiki/Moore's_law) is going to fail soon, and that Singularity may be a long way off because of that.
- He made a few good jokes, including a reference that we now ask all questions of the Internet, which was clearly never designed to be an "ask me a question" resource. He says we now ask all questions of something that starts with "Go" and it's not "God".
- He had some really positive comments about Google, even a vague reference that there perhaps should have been an Apple/Google merger ... ? He said Google definitely had Apple licked on human phrase interpretation, indicating Siri did a relatively poor job of taking a human idea and producing an answer, but Google took human phrasing very well and produced ... a page of links (Google doesn't answer questions, for the most part).

He was signed up for a photo op with the sponsors, but let everyone know in advance that he'd be coming back down to mingle afterwards.

He came back down and gave everyone a chance to take photos, and chatted with everyone, as best he could, for a person that was being mobbed by 60+ fans.

This was my best photo, which was (thank God) taken unbeknownst to me by an old friend, who I happened to bump into there. Because all my other ones came out terrible. You can't see it, but I did get to shake his hand.

This is a very hastily cleaned up version of one of my other photos --

All my other ones had severe lightning problems, unfortunately.

What a great evening!

Jeff

Private VLANs - How they really work

2014-01-29T19:46:00.001-08:00

You're probably already familiar with the basics of a private VLAN: it allows you to group hosts in a single subnet on Ethernet, but limit which hosts can talk to each other at layer 2. A common design is to have the default gateway accessible to the entire subnet, but prevent the individual hosts from talking to each other (an isolated VLAN). Another common design is to break the VLAN up into various smaller groups that can talk to each other (community VLANs), but allow all hosts in all groups to talk to the default gateway. Instead of a default gateway, the promiscuous host might be a community server, such as a backup server.

Configuring them is not very tricky. Our topology will consist of two 3560 switches, SW1 and SW2, trunked together on Fa0/13. R1 will simulate our default gateway, on 192.168.0.1.

SW1:

vlan 124

private-vlan primary

private-vlan association 216,402

vlan 216

private-vlan community

vlan 402

private-vlan isolated

interface FastEthernet0/1

switchport private-vlan mapping 124 216,402

switchport mode private-vlan promiscuous

interface FastEthernet0/13 ! Trunk to SW2

switchport trunk encapsulation dot1q

switchport mode trunk

SW2:

vlan 124

private-vlan primary

private-vlan association 216,402

vlan 216

private-vlan community

vlan 402

private-vlan isolated

interface FastEthernet0/13 ! trunk to SW1

switchport trunk encapsulation dot1q

switchport mode trunk

interface FastEthernet0/2 ! Connects to R2, a community host

switchport private-vlan host-association 124 216

switchport mode private-vlan host

interface FastEthernet0/4 ! Connects to R4, a community host

switchport private-vlan host-association 124 216

switchport mode private-vlan host

interface FastEthernet0/6 ! Connects to R6, an isolated host

switchport private-vlan host-association 124 402, an isolated host

switchport mode private-vlan host

I'm not going to go over this config in extreme detail, because this information can be found very easily anywhere. I just needed a baseline to show a few other interesting things.

R1 is the promiscuous default gateway, R6 is an isolated host, R2 and R4 are in a community VLAN.

First, let's make sure it works.

Can everyone reach R1?

R2#ping 192.168.0.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

R4#ping 192.168.0.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

R6#ping 192.168.0.1

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

OK, our promiscuous port works.

Can R2 ping R4?

R2#ping 192.168.0.4

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

Looks good, community VLAN works.

Can R6 reach anything besides R1?

R6#ping 192.168.0.2

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.2, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

R6#ping 192.168.0.4

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:

.....

Success rate is 0 percent (0/5)

Nope, it really is isolated.

So it works - fantastic. But how does it work? Answering that question takes a bit of research.

This page documents the high-level workings:

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/pvlans.html

Specifically,

"• Primary VLAN— The primary VLAN carries unidirectional traffic downstream from the promiscuous ports to the (isolated and community) host ports and to other promiscuous ports.

• Isolated VLAN — [edited for brevity] An isolated VLAN is a secondary VLAN that carries unidirectional traffic upstream from the hosts toward the promiscuous ports and the gateway.

• Community VLAN—A community VLAN is a secondary VLAN that carries upstream traffic from the community ports to the promiscuous port gateways and to other host ports in the same community. [edited for brevity] "

So, in a nutshell, the primary VLAN carries traffic from the promiscuous port to everyone else, unidirectionally. An isolated VLAN carries traffic from every isolated host to the promiscuous port. And a community VLAN carries bidirectional traffic for members of the community, but only unidirectional traffic towards the promiscuous port.

OK, we got that, but still, how does it work?

It's all a clever manipulation of MAC address learning.

Referencing the pings I made above, let's look at the mac table on SW1 and SW2.

SW1#show mac address-table | i 124|216|402

402 0013.c460.2be0 DYNAMIC pv Fa0/1

402 0019.2fb8.d552 BLOCKED Fa0/13

124 0013.c460.2be0 DYNAMIC Fa0/1

124 0019.2fb8.d552 DYNAMIC pv Fa0/13

124 0019.e880.09c0 DYNAMIC pv Fa0/13

216 0013.c460.2be0 DYNAMIC pv Fa0/1

216 0019.e880.09c0 DYNAMIC Fa0/13

SW2#show mac address-table | i 124|216|402

402 0013.c460.2be0 DYNAMIC pv Fa0/13

402 0014.1ceb.f60f BLOCKED Fa0/13

402 0019.2fb8.d552 BLOCKED Fa0/6

124 0013.c460.2be0 DYNAMIC Fa0/13

124 0014.1ceb.f60f DYNAMIC pv Fa0/13

124 0019.2fb8.d552 DYNAMIC pv Fa0/6

124 0019.e880.09c0 DYNAMIC pv Fa0/2

124 0024.c4eb.ed68 DYNAMIC pv Fa0/4

216 0013.c460.2be0 DYNAMIC pv Fa0/13

216 0019.e880.09c0 DYNAMIC Fa0/2

216 0024.c4eb.ed68 DYNAMIC Fa0/4

There's a lot going on there, so let's break this down into more manageable chunks.

R6 is our simplest host - it's a single isolated host, so let's start there.

R6#show int fa0/0 | i bia

Hardware is Gt96k FE, address is 0019.2fb8.d552 (bia 0019.2fb8.d552)

So we now know that 0019.2fb8.d552 is R6's MAC address on it's Fa0/0 port.

SW1#show mac address-table | i 0019.2fb8.d552

402 0019.2fb8.d552 BLOCKED Fa0/13

124 0019.2fb8.d552 DYNAMIC pv Fa0/13

SW2#show mac address-table | i 0019.2fb8.d552

402 0019.2fb8.d552 BLOCKED Fa0/6

124 0019.2fb8.d552 DYNAMIC pv Fa0/6

We see some new terms here in the CAM table, "BLOCKED" and "DYNAMIC pv".

Best I can tell without any documentation on these, this is what they mean:

BLOCKED = Do not forward traffic to this MAC (essentially the same as not learning it)

DYNAMIC pv = This is a receive-only MAC.

So putting that in context of R6, R6 is allowed to SEND on 402, and can RECEIVE on 124. It cannot receive on 402, because the MAC is "BLOCKED". This is what enforces an isolated VLAN. The MAC learning process takes place, but the CAM table flags them as unusable. Therefore, any traffic sent towards that MAC on that VLAN would be discarded.

Let's look at our earlier definition of isolated VLAN from Cisco:

• Isolated VLAN — [edited for brevity] An isolated VLAN is a secondary VLAN that carries unidirectional traffic upstream from the hosts toward the promiscuous ports and the gateway.

This is true, based upon that R6 (and any other future isolated host) can't learn MACs on that VLAN, except...

R1#show int fa0/0 | i bia

Hardware is AmdFE, address is 0013.c460.2be0 (bia 0013.c460.2be0)

SW1#show mac address-table | i 0013.c460.2be0

402 0013.c460.2be0 DYNAMIC pv Fa0/1

[edited for brevity]

R1's MAC can be learned on 402. So R6 will be able to send frames out 402 towards R1. This completes the concept that 402 is a "one way" VLAN from Isolated -> Promiscuous.

But what about R1 -> R6?

• Primary VLAN— The primary VLAN carries unidirectional traffic downstream from the promiscuous ports to the (isolated and community) host ports and to other promiscuous ports.

SW1#show mac address-table | i 124

124 0013.c460.2be0 DYNAMIC Fa0/1

124 0019.2fb8.d552 DYNAMIC pv Fa0/13

on Fa0/1, we see R1, and on Fa0/13, we see R6. R6 has the "DYNAMIC pv" status, meaning it is a receive-only MAC. Frames originating from 0019.2fb8.d552 would not be accepted, only frames towards 0019.2fb8.d552. We'll be generating more traffic in a moment, and the switches will learn more on vlan 124.

Now let's look at the community hosts.

First, re-generating traffic for MAC learning:

R2#ping 192.168.0.4

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

R2#show int fa0/0 | i bia

Hardware is AmdFE, address is 0019.e880.09c0 (bia 0019.e880.09c0)

R4#show int fa0/0 | i bia

Hardware is Gt96k FE, address is 0024.c4eb.ed68 (bia 0024.c4eb.ed68)

There's our MACs for R2 and R4.

How's the table look?

SW1#show mac address-table | i 0019.e880.09c0 ! R2

124 0019.e880.09c0 DYNAMIC pv Fa0/13

216 0019.e880.09c0 DYNAMIC Fa0/13

SW1#show mac address-table | i 0024.c4eb.ed68 ! R4

124 0024.c4eb.ed68 DYNAMIC pv Fa0/13

216 0024.c4eb.ed68 DYNAMIC Fa0/13

SW1#show mac address-table | i 0013.c460.2be0 ! R1

402 0013.c460.2be0 DYNAMIC pv Fa0/1

124 0013.c460.2be0 DYNAMIC Fa0/1

216 0013.c460.2be0 DYNAMIC pv Fa0/1

You should notice immediately we don't have any "BLOCKED" status here. That's because R2 and R4 can talk to each other. This port has no affiliation with the isolated VLAN in any fashion, so R2 and R4's MACs simply aren't learned on VLAN 402. We do have some "DYNAMIC pv". The primary VLAN is still used to carry traffic from R1 (promiscuous) to R2 and R4. Therefore, R2 and R4 must be learned on VLAN 124 (as "DYNAMIC pv"), but not able to speak, or they'd be able to reach hosts outside their community. Also, R1 should be "DYNAMIC pv" on 216 (community VLAN), so that replies are forced over 124 (primary VLAN).

With that wrapped up, here's some other cool stuff I did while working on this!

- Pushing private VLANs through a switch that doesn't support them. This is a bad, bad idea. Before I labbed this out this granuarly, I thought, sure, why not? If the switch doesn't know it's a private VLAN, the whole "DYNAMIC pv" and "BLOCKED" MAC learning concepts stop working, and all hell breaks loose. Just don't do it!

- Try putting a regular, non-private, access port up on one of the primary or secondary VLANs. It just doesn't work - you can send frames at the port and the switch just won't learn your MAC and ignores all the traffic.

- Not so exotic on my last bullet, but what about using an SVI as part of a private VLAN? Well for starters, this is totally doable, but only as a promiscuous port. You'd be using the switch as a default gateway instead of a standalone router. The config looks like this:

interface Vlan124

ip address 192.168.0.249 255.255.255.0

private-vlan mapping 216,402

R6#ping 192.168.0.249

Type escape sequence to abort.

Sending 5, 100-byte ICMP Echos to 192.168.0.249, timeout is 2 seconds:

!!!!!

Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms

Trying to make an SVI with vlan 216 or 402 does not work.

Another oddity:

SW1#show int vlan124 | i bia

Hardware is EtherSVI, address is 0014.1ceb.f641 (bia 0014.1ceb.f641)

SW1#sh mac address-table | i 0014.1ceb.f641

SW1#

Well that's not something you see every day...

Happy Studying,

Jeff

[mini] OSPF Point-to-Multipoint .... Multicast?

2014-01-12T14:40:00.000-08:00

I recently took a practice lab and got dinged for points on an OSPF area question. Without quoting the actual practice lab, the question was referencing a frame-relay link and said something akin to "use an OSPF area type that doesn't elect a DR and multicasts updates".

I've gotten in a (bad?) habit of just using point-to-multipoint with frame-relay. If you've studied OSPF at a design/granular level, you are probably aware that point-to-multipoint is relatively inefficient. However, it works in all sorts of crazy environments, so I just like using it in labs because of versatility.

So even though "point-to-point" would've worked in this environment, I used point-to-multipoint anyway.

And the answer guide insisted point-to-point was the only viable option.

I thought about this, and what the heck is the point of "point-to-multipoint non-broadcast" if "point-to-multipoint" doesn't multicast? Also, I know point-to-multipoint auto-discovers neighbors, so how the heck is it not multicasting?

My lab is R1 -> Frame Relay Switch -> R2
DLCI is R1 102 -> R2 201

R1:
interface Serial0/0
ip address 192.168.0.1 255.255.255.0
encapsulation frame-relay
ip ospf network point-to-multipoint
no keepalive
clock rate 2000000
frame-relay map ip 192.168.0.2 102 broadcast
no frame-relay inverse-arp

router ospf 1

log-adjacency-changes

network 0.0.0.0 255.255.255.255 area 0

R2:

interface Serial0/0

ip address 192.168.0.2 255.255.255.0

encapsulation frame-relay

ip ospf network point-to-multipoint

no keepalive

clock rate 2000000

frame-relay map ip 192.168.0.1 201 broadcast

no frame-relay inverse-arp

shutdown

router ospf 1

log-adjacency-changes

network 0.0.0.0 255.255.255.255 area 0

R2(config-if)#no shut

R2(config-if)#

*Mar 1 00:20:11.075: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.0.1 on Serial0/0 from LOADING to FULL, Loading Done

Clearly, we have automatic neighbor discovery.

Aha! Multicast! I knew it!

But wait... here comes an update:

So the truth comes out - Hello packets are multicast, hence the automatic neighbor discovery. But updates are unicast, so the lab was correct in docking me. Live and learn!

Jeff

[mini] VTY Rotary

2014-01-05T21:09:00.001-08:00

I've always found it helps a great deal to have a use-case for a feature. There's thousands of features to learn and be at least somewhat familiar with when attempting the CCIE lab. Remembering them all is a real challenge, but knowing how to apply a feature and why you'd want to use it make it all that much easier to remember. One of those crazy features is "rotary" when used in conjunction with a VTY line.

I totally get what it does:

line vty 0 4
password cisco
login
rotary 1

This config would allow you to telnet to this router on port 23, enter the password "cisco", and get privilege level 1. With "rotary 1", you could also telnet to 3001 and have the same experience. Basically, it would mimic port 23 on port 3001.

R2#telnet 192.168.0.1 3001

Trying 192.168.0.1, 3001 ... Open

User Access Verification

Password:

R1>

If you used "rotary 2", you'd be able to telnet to 3002, etc.

That's the nuts and bolts of what rotary does. I am immediately reminded of a quote from Despicable Me:

http://www.youtube.com/watch?v=aD4k148YcDU

"...because I was wondering, under what circumstances would we use this?"

I haven't exactly been dying to telnet to my equipment on alternative port numbers.

Now, I finally understand the use case. It has to do with using different authentication methods on different lines.

For example:
line vty 0 4
privilege level 1 ! default, but included for clarity
password cisco
login
line vty 5
privilege level 15
password secretpassword
login

We see line 5 has a higher privilege level than lines 0-4. So how do you hit line 5? Well, I suppose you could telnet at the router 5 times and fill up the first four lines, then hit it again, but that's not very practical. Not to mention you may not know the password for 0-4, if you're an admin-type logging in to line 5. Enter rotary:

line vty 5

privilege level 15

password secretpassword

rotary 1

Now when we telnet to port 23:

R2#telnet 192.168.0.1

Trying 192.168.0.1 ... Open

User Access Verification

Password: cisco

R1>

Now when we telnet to port 3001:

R2#telnet 192.168.0.1 3001

Trying 192.168.0.1, 3001 ... Open

User Access Verification

Password: secretpassword

R1#

Trying each line's respective password on the other's port number produces the expected failure.

That's a simple use case, let's take a more advanced one.

Let's say you're using lock & key / dynamic ACLs and need *local* auth on one line only.

R1(config)#aaa new-model

R1(config)#aaa authentication login default group radius

R1(config)#aaa authentication login LOCKANDKEY local

R1(config)#username LOCK password ANDKEY

line vty 0 40

line vty 41

rotary 1

autocommand access-enable host

The idea here is to use RADIUS for authentication of lines 0-40, and local auth for line 41, to allow your Lock & Key ACL to work.

I didn't actually setup a lock & key ACL or a RADIUS server, but this can get the point across still:

Regular telnet just fails in our case because of the lack of RADIUS servers:

R2#telnet 192.168.0.1

Trying 192.168.0.1 ... Open

% Authentication failed

[Connection to 192.168.0.1 closed by foreign host]

However, telnetting to 3001:

R2#telnet 192.168.0.1 3001

Trying 192.168.0.1, 3001 ... Open

User Access Verification

Username: LOCK

Password: ANDKEY

% No input access group defined for FastEthernet0/0.

[Connection to 192.168.0.1 closed by foreign host]

The error message is because of the lack of a lock & key ACL, but the proof of concept is the same.

Cheers,

Jeff Kronlage

[mini] BGP Auto-Summary

2013-12-14T16:08:00.002-08:00

I recently got a task on a practice lab that was obviously regarding BGP auto summary. I'm well-practiced in BGP on production systems, but who the heck uses auto-summary any longer? It then occurred to me that I'd never even turned it on.

My first attempt was to:

int lo5
ip address 5.5.5.5 255.255.255.0

router bgp 100
auto-summary
network 5.5.5.0 mask 255.255.255.0

I peered it up with another router, and expected to see "5.0.0.0/8" in the BGP table of the other router.

No such luck, I ended up with 5.5.5.0/24.

After some googling, I found two methods to make this work:

int lo5
ip address 5.5.5.5 255.255.255.0

router bgp 100
auto-summary
network 5.0.0.0

That will produce 5.0.0.0 in both the local BGP table and anyone it peers to.

You can also:

int lo5
ip address 5.5.5.5 255.255.255.0

router bgp 100
auto-summary
redistribute connected

That will also get you 5.0.0.0 in both the local BGP table and anyone it peers to.

Of interesting note, if you:

int lo5
ip address 5.5.5.5 255.255.255.0

int lo6
ip address 5.5.6.6 255.255.255.0

router bgp 100
auto-summary
network 5.5.0.0 mask 255.255.0.0

That will also produce 5.0.0.0/8.

Not a complex topic, but it works differently than the way IGPs do, and I thought it was worth mentioning.

Happy studying!

Jeff

[mini] PPPoE in the DocCD

2013-11-29T13:12:00.002-08:00

I ran across a PPPoE problem a couple days ago, and let me tell you, this is not my favorite topic. I've only used it in production once, and I don't come across it in practice labs enough to keep it fresh in my mind. I've been skipping these questions when doing time-trial practice labs and just using traditional Ethernet whenever this was called for, and just taking a hit on the points. Not a good plan, but I felt there were more important things to focus on.

One of the other reasons I haven't wanted to focus on it, knowing that I only see it once in a blue moon, is that the documentation is so spread out I could never figure out where all the various pieces are. The lab questions always call for server and client installs, and they're on different pages, and spread out across those two pages.

I decided a good interim step on this problem is to nutshell exactly where the pieces are in the documentation.

First, you want the Broadband Access Aggregation and DSL Configuration Guide. It's on the main "Configuration" page for 12.4T that you've been going to in the DocCD. See below.

The next page has a lot of options on it. Fortunately we only need two of them, and they're right on top of each other:

- PPPoE "server" is on Providing Protocol Support for Broadband Access Aggregation of PPPoE Sessions.
- PPPoE client is on "PPP over Ethernet Client"

We'll start on the server side first. "R1" will be our server router. Not providing a diagram, just two devices connected in Fa0/0 involved.

You need three sections on the "Providing Protocol Support for Broadband Access Aggregation of PPPoE Sessions" page:

- "Configuring a Virtual Template Interface"
- "Defining a PPPoE Profile"
- "Assigning a PPPoE Profile to an Ethernet Interface"

I put them in the order I felt they should be done in, so let's start with "Configuring a Virtual Template Interface". Frankly, if you don't know how to this, this is worth memorizing. It comes up in more places than just PPPoE (PPP over Frame Relay, namely).

Let's apply the necessary pieces as we walk through this:

R1:
R1(config)#interface virtual-template 1
R1(config-if)#ip address 192.168.1.1 255.255.255.0 ! you don't actually have to use IP unnumbered
R1(config-if)#mtu 1492 ! not really a requirement but a really good idea
R1(config-if)#peer default ip address dhcp-pool TEST-POOL

To be fair, the "peer default" bit for assigning IP addresses to clients isn't actually in the above documentation snippet, but it is elsewhere on the page if you search for it. It's also not a requirement, you could assign IPs statically.

Next step -

R1(config-if)#bba-group pppoe global
R1(config-bba-group)# virtual-template 1

Yep, that's all you really must have to get the bba-group working. Now let's assign it to an interface.

R1(config)#interface fa0/0
R1(config-if)#pppoe enable
R1(config-if)#no shut

The pppoe enable command will expand to pppoe enable group global on its own, if you do a "show run".

We did reference a DHCP pool up above; we'll need to create that.

R1(config)#ip dhcp pool TEST-POOL
R1(dhcp-config)#network 192.168.1.0

That's all - now for the client side. As we saw earlier (same image repeated from above), the client side is directly underneath the "server" side.

Once you're in there, there's once again many options, however the two you need are pretty easy to spot. Note carefully that we are on the "12.2(13)T 12.4T and Later Releases" section. There's one just above this for pre-12.2(13)T.

Configuring the dialer interface first makes more sense, so we'll start there:

R2(config)#int dialer 1
R2(config-if)#mtu 1492
R2(config-if)#encapsulation ppp
R2(config-if)#ip address negotiated
R2(config-if)#dialer pool 1

R2(config-if)#pppoe-client dial-pool-number 1
R2(config-if)#no shut

That's it - if you did it correctly, you should get output something like this on your client:

*Mar 1 00:28:51.103: %DIALER-6-BIND: Interface Vi1 bound to profile Di1
*Mar 1 00:28:51.191: %LINK-3-UPDOWN: Interface Virtual-Access1, changed state to up
*Mar 1 00:28:52.235: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access1, changed state to up

R2(config-if)#do sh ip int dialer1 | i Internet address
Internet address is 192.168.1.3/32

R2(config-if)#do ping 192.168.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/23/36 ms

Cheers,

Jeff

[mini] Embarassing BGP as-override misunderstanding

2013-11-10T20:16:00.003-08:00

It can be hard to post on the Internet about dramatically misunderstanding a technology.

In my defense, I've never worked for an MPLS provider, so I've never used as-override outside of a lab - actually I'm not sure I've ever used it in a lab before tonight, either.

For those unfamiliar with the basic idea, as-override is used in MP-BGP/VRF/MPLS scenarios where the customer wants to re-use an AS number on several sites. Since the CE routers see the traffic from the PE routers as eBGP, they see their own AS number in the path and reject the update from the PE. as-override is the PE mechanism to overcome this problem.

Let's take a four-router scenario - two CE routers and two PE.

It might look something like this:

CE1 (AS 100) -> PE1 (AS 250) -> PE2 (AS 250) -> CE2 (AS 100)

Clearly, when PE2 advertises CE1's routes to CE2, CE2 should reject them.

Fixing this on the CE side is very easy; you can change the AS number or use allowas-in to allow the CE to ignore the fact that its own AS number is present while receiving BGP updates.

As a network consultant I regularly deal with MPLS site activations, and twice now I've had the carrier offer to use as-override to fix the problem above, and I've declined, one time opting to change the AS number on the CE, another time I used allowas-in. I'd gotten the idea that, given that the carrier technician was signed into the PE connected to my CE, that that's the only place where the as-override would go. Boy was I wrong.

I spent about 90 minutes this evening trying to get as-override working in the scenario described above. CE1 would send AS 100 to PE1. PE1 was configured with as-override facing CE1, and what I expected to have happen was PE1 strip out AS 100 on its way to PE2. Incorrect!

I'd repeatedly pull up PE2's BGP table:

PE2#sh ip bgp vpnv4 vrf CCIE | s 1.1.1.1
*>i1.1.1.1/32 192.168.23.2 0 100 0 100 I

BGP output doesn't paste the best into a non-monospaced document, but in short, it shows the prefix is still learned from AS 100 still (the other "100" adjacent to that is the local preference). I sat there scratching my head, wondering how CE2 was going to be able to learn this (quick answer - it can't).

It turns out as-override is not an ingress setting at all. It's an egress setting. All it does is tell the PE that as-override is configured on that when it's passing routes to a CE, to do a find-and-replace of the CE's AS number and replace it with the local PE's AS number.

In other words, in our scenario:

CE1 (AS 100) -> PE1 (AS 250) -> PE2 (AS 250) -> CE2 (AS 100)

If I were to set as-override on PE1, that would enable CE1 to receive CE2's routes - not vice-versa.

CE1(config)#do sh ip bgp | i 2.2.2.2
*> 2.2.2.2/32 192.168.12.2 0 250 250 I

We see that CE1 sees 2.2.2.2 (CE2's loopback) as going through AS 250 twice, instead of AS 250 followed by AS 100.

Thought this might help others out there stuck on a similar misunderstanding.

Cheers,

Jeff

[mini] Why does LDP "require" a /32 Loopback?

2013-11-07T20:58:00.000-08:00

A few days ago I asked a coworker why LDP sessions had issues if they weren't peered on /32s. He answered, it doesn't have to be a /32, but the IGP and LDP had to agree on the mask length. So I asked the more specific question - why does it have to agree on the mask length? He didn't know. And neither did I.

Everyone seems to know that /32s are best practice for the LDP router ID. But it's hard to find a good, clear explanation of why this is.

Let's start with some obvious facts.

- "The router considers all the IP addresses of all operational interfaces.... If these addresses include loopback interface addresses, the router selects the largest loopback address." http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686

As always, my posts are geared for the CCIE lab, and it's a fair bet most of your gear on the lab is going to have a loopback. So, expect the router ID to be a loopback, unless it's specified otherwise.

- You can specify the interface with mpls ldp router-id <interface>. If you don't want it to be a loopback, or you want a certain loopback to be chosen over another, then use this command. If you want to change the router-id while LDP is already up you have to use the force command, i.e. mpls ldp router-id lo7 force. If you don't use force, and LDP was already online, you'll have to reboot in order for the switch to take place.

- You can set the range of labels that LDP is allowed to use with mpls label range <lower> <upper> I find this useful in debugging, because you can make your labels match your router number and it's easier to read the output. LDP show commands are not always easy to interpret if you're not used to reading them.

-  "The LDP default behavior is to allocate local labels for all non-BGP prefixes."
http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686

So what's that mean to us? It might be better phrased as "The LDP default behavior is to allocate local labels choosing the best administrative distance as long as it's not from BGP".

- This problem is most commonly seen with OSPF (although you could see it from a summary route as well). The sure-fire way to demonstrate it is to create a /24 loopback and not change the default network type. OSPF automatically uses network type LOOPBACK, which is always advertised as a /32.

- With MPLS VPNs, BGP actually distributes the labels for the VRFs, not LDP. You learn the stacked VRF tag, relevant only to the egress PE, from BGP. You also learn the global routing table's next hop. The next-hop is used to find out the LDP label.

Let's take a look at how this plays out.

R3 is trying to reach R1 in VRF CCIE. R3's IP address is 3.3.3.3 and R1's IP address is 1.1.1.1. R2 is sitting in the middle of the two.

R3#ping vrf CCIE 1.1.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
As we can see, ping is failing.

R3#sh ip route vrf CCIE 1.1.1.1
Routing entry for 1.1.1.1/32
Known via "bgp 100", distance 200, metric 0, type internal
Last update from 11.11.11.11 00:26:04 ago
Routing Descriptor Blocks:
* 11.11.11.11 (Default-IP-Routing-Table), from 22.22.22.22, 00:26:04 ago
      Route metric is 0, traffic share count is 1
      AS Hops 0
We have a route to reach it.

R3#show ip cef vrf CCIE 1.1.1.1
1.1.1.1/32, version 3, epoch 0, cached adjacency 192.168.23.2
0 packets, 0 bytes
tag information set
    local tag: VPN-route-head
    fast tag rewrite with Fa0/0, 192.168.23.2, tags imposed: {200 103}
via 11.11.11.11, 0 dependencies, recursive
    next hop 192.168.23.2, FastEthernet0/0 via 11.11.11.11/32
    valid cached adjacency
    tag rewrite with Fa0/0, 192.168.23.2, tags imposed: {200 103}

I used the mpls label range command (mentioned above) in order to restrict the tags to start with their own router ID. In this case, we should be using MPLS "transit" tag of 200, and a MPLS "VRF" tag of 103.

R3#show mpls ldp bindings | b 11.11.11.11
tib entry: 11.11.11.11/32, rev 6
        local binding: tag: 300
        remote binding: tsr: 22.22.22.22:0, tag: 200
<output omitted>

We know that tag 200 references R1's primary routing table loopback IP (11.11.11.11).

R3#show mpls forwarding-table 11.11.11.11
Local Outgoing    Prefix            Bytes tag Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
300    200         11.11.11.11/32      0          Fa0/0      192.168.23.2

We know that means sending traffic out Fa0/0 towards R2 (192.168.23.2) with tag 200.

Ok, so this router should be able to send traffic, right?

R2#debug mpls packet
MPLS packet debugging is on
R3#ping vrf CCIE 1.1.1.1 rep 2 timeout 1
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 1.1.1.1, timeout is 1 seconds:
..
Success rate is 0 percent (0/2)

R2#
*Mar 1 00:36:08.651: MPLS: Fa0/1: recvd: CoS=6, TTL=255, Label(s)=0
*Mar 1 00:36:09.067: MPLS: Fa0/1: recvd: CoS=6, TTL=255, Label(s)=0
R2 gets the MPLS packet just fine! And that's all it does. Notice my debug doesn't say anything about forwarding it on.

R2#show mpls ldp binding | b 11.11.11.11
tib entry: 11.11.11.11/32, rev 10
        local binding: tag: 200
        remote binding: tsr: 33.33.33.33:0, tag: 300
<output omitted>

We see R2 has locally bound tag 200 for 11.11.11.11, and has received a tag from R3 for 11.11.11.11, but ... no tag from R1?

Let's look at the routing tables.

R2#sh ip route 11.11.11.11
Routing entry for 11.11.11.11/32
Known via "ospf 1", distance 110, metric 2, type intra area
Last update from 192.168.12.1 on FastEthernet0/0, 00:00:02 ago
Routing Descriptor Blocks:
* 192.168.12.1, from 11.11.11.11, 00:00:02 ago, via FastEthernet0/0
      Route metric is 2, traffic share count is 1
R2 sees this as a /32.

R3#sh ip route 11.11.11.11
Routing entry for 11.11.11.11/32
Known via "ospf 1", distance 110, metric 3, type intra area
Last update from 192.168.23.2 on FastEthernet0/0, 00:39:16 ago
Routing Descriptor Blocks:
* 192.168.23.2, from 11.11.11.11, 00:39:16 ago, via FastEthernet0/0
      Route metric is 3, traffic share count is 1

R3 sees this as a /32. Consequently, R3 has no problem sending the MPLS packet to R2.

R1#sh ip route 11.11.11.11
Routing entry for 11.11.11.0/24
Known via "connected", distance 0, metric 0 (connected, via interface)
Routing Descriptor Blocks:
* directly connected, via Loopback0
      Route metric is 0, traffic share count is 1
And R1 sees it as a ... /24 connected route. As mentioned above, OSPF is the common culprit here. It's advertising a /32 to everyone else, except the local router, which still sees it as a /24. In fact...

R2#sh mpls ldp binding | b 11.11.11.0/24
tib entry: 11.11.11.0/24, rev 11
        remote binding: tsr: 11.11.11.0:0, tag: exp-null
<output omitted>

R1 is advertising a /24 to R2. MPLS bindings work a bit different than the routing table, R2's LDP process isn't simply going to choose the best route to R1, it's matching labels to prefixes, and the prefixes are considered unique if they're not identical. So R2 just drops the packet, as it has no more bindings for 11.11.11.0/24.

The fix is to just make the two prefix lengths the same. They don't need to be /32s! The easiest way to make this happen in this scenario is to change the OSPF network type away from LOOPBACK and stop forcing the /32 advertisement:

R1(config)#int lo0
R1(config-if)#ip ospf network point-to-point
R2#sh mpls ldp binding | b 11.11.11.0/24
tib entry: 11.11.11.0/24, rev 16
        local binding: tag: 203
        remote binding: tsr: 11.11.11.11:0, tag: exp-null
        remote binding: tsr: 33.33.33.33:0, tag: 305
<output omitted>

We can see R2 now has a binding from R1 and R3 that matches the same prefix length.

R3#ping vrf CCIE 1.1.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 60/66/76 ms

And forwarding works end-to-end.

In a nutshell: LDP associates labels with both the IP address and subnet mask. The prefix length does have to match to become part of the same MPLS forwarding path. However, the prefix length does not have to be /32 - it's just a good, safe practice.

[mini] Static RP Address Blocks auto-RP Dense Flows

2013-10-27T15:21:00.001-07:00

My first 40 posts were written while I was attempting to improve my understanding of a number of topics. At this point in my studying, I've moved on to practicing interoperability of features, so I haven't written any new posts in some time. My first posts were between five and twenty page topic deep-dives. Now that I've moved on to review & practice, I'm planning on starting a new series of posts, which I will label with [mini] in front of the subject. These will cover any small problems that really got me stuck while doing practice labs. Same quality as my old posts, but much smaller scope.

Today, I got stuck on a multicast problem.

I have EIGRP running on every interface, and pim sparse-dense mode on every interface.
Every IP address has reachability to every other IP address. The last octet IP on every segment is the router number. Every router has a loopback of Y.Y.Y.Y where Y is the router number.

I was working a lab for auto-RP. In an equivalence for the simpler scenario above, R1 was the mapping agent and R2 was the RP candidate. Then R3 would join 239.0.0.1, and R1 would send a ping towards 239.0.0.1 and expect a reply.

The setup was as follows (remember, PIM sparse-dense and routing are already setup)

R1:
ip pim send-rp-discovery Loopback0 scope 10 interval 2

R2:
ip pim send-rp-announce Loopback0 scope 10 interval 2

R3:
interface FastEthernet0/0
ip igmp join-group 239.0.0.1

And be damned if I could get the join on R3 to work. I discovered pretty quickly that R3 wasn't learning the dynamic RP address:

R3#sh ip pim rp mapping
PIM Group-to-RP Mappings

R3#

"Well there's your problem!"

After a lot of digging, I finally noticed some odd output on R2:

R2#sh ip mroute 224.0.1.40 | b 224
(*, 224.0.1.40), 00:15:00/stopped, RP 2.2.2.2, flags: SJCL
Incoming interface: Null, RPF nbr 0.0.0.0
Outgoing interface list:
    FastEthernet0/0, Forward/Sparse-Dense, 00:15:00/00:01:58

(1.1.1.1, 224.0.1.40), 00:14:47/00:02:57, flags: PLJTX
Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1
Outgoing interface list: Null

It's pretty evident that 224.0.1.40 (The mapping agent group) isn't going to reach R3, as the OIL lists "Null", and R3 isn't going to learn the RP address, and therefore isn't going to be able to join the group. Let's look closer on that output:

R2#sh ip mroute 224.0.1.40 | i 224
(*, 224.0.1.40), 00:21:47/stopped, RP 2.2.2.2, flags: SJCL
(1.1.1.1, 224.0.1.40), 00:21:34/00:02:59, flags: PLJTX

What's up with those flags?

S=Sparse, P=Pruned ... wait a minute! 224.0.1.40 is supposed to be dense mode forwarded.
Just to verify that, look at R1:

R1#sh ip mroute 224.0.1.40 | i 224
(*, 224.0.1.40), 00:36:03/stopped, RP 0.0.0.0, flags: DCL
(1.1.1.1, 224.0.1.40), 00:29:21/00:02:58, flags: LT

D=Dense

What the heck is R2 up to?
Turns out I didn't remove some debugging config I'd put in earlier, which as a whole is really nothing new on these type of tasks, but this one struck me as odd:

R2:
ip pim rp-address 2.2.2.2

In fact, let's take it out and see what happens:

R2(config)#no ip pim rp-address 2.2.2.2
R2(config)#exit
R2#sh ip mroute 224.0.1.40 | b 224
(*, 224.0.1.40), 00:00:18/stopped, RP 0.0.0.0, flags: DCL
Incoming interface: Null, RPF nbr 0.0.0.0
Outgoing interface list:
    FastEthernet0/1, Forward/Sparse-Dense, 00:00:18/00:00:00
    FastEthernet0/0, Forward/Sparse-Dense, 00:00:18/00:00:00

(1.1.1.1, 224.0.1.40), 00:00:16/00:02:58, flags: LT
Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1
Outgoing interface list:
    FastEthernet0/1, Forward/Sparse-Dense, 00:00:16/00:00:00

I can't imagine why this behavior is there, but if you have ip pim rp-address Y.Y.Y.Y configured, the RP will automatically assume auto-RP groups originated by other routers are sparse mode instead of dense, which effectively breaks auto-RP. That makes no sense to me, and it took me almost two hours to go pull this line of config out. I also can't find any documentation on why this behavior happens.

In a nutshell: Configuring a static RP address on an auto-RP device will stop the device in question from sending auto-RP dense groups to downstream neighbors.

Cheers,

Jeff Kronlage