tag:blogger.com,1999:blog-59686864352834545262024-03-18T22:42:55.993-07:00Jeff Kronlage's CCIE Study Blogbrbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.comBlogger63125tag:blogger.com,1999:blog-5968686435283454526.post-57954116497205661902016-01-02T14:23:00.001-08:002016-01-05T18:50:54.806-08:00GETVPNGETVPN, or Group Encrypted Transport VPN, is Cisco's implementation of the GDOI standard. GDOI, or Group Domain of Interpretation, is defined in RFC 6407, which obsoleted the original RFC, 3547.<br />
<br />
GDOI was originally established to allow for a way of encrypting multicast traffic, which was rather cumbersome to do with, say, GRE-over-IPSEC tunnels previously.<br />
<br />
<a href="https://tools.ietf.org/html/rfc3547">https://tools.ietf.org/html/rfc3547 </a><br />
"GDOI Applications. Secure multicast applications include video broadcast and multicast file transfer."<br />
<br />
However, GETVPN is now commonly used for encrypting any type of traffic over any private network. Most commonly, it is used for encryption over MPLS VPNs, as MPLS VPNs are not truly secure, and without encryption you're putting a lot of faith that your service provider won't sniff your data. However, GETVPN is L2/L3 agnostic, so arguably it could be used for any application <i>where NAT is not involved</i>. GETVPN does not replace DMVPN for Internet applications. More on that further down the document.<br />
<br />
At a high-level, GETVPN establishes a set of rotating encryption keys that a g<i>roup</i> shares. In this fashion, any group member can encrypt data to any other group member without setting up a tunnel to the other group member. In fact, the entire system is "tunnel-less". Additionally, as GETVPN re-uses the original IP header, the underlying routing is preserved. So if you're using BGP to peer to an MPLS VPN, that same routing just keeps working even with the encrypted packets.<br />
<br />
How the encryption process occurs can be most easily shown over a series of slides.<br />
<br />
There are two router types involved with GETVPN: Key Servers (KS) and Group Members (GM). GMs, in this usage, are customer CEs that will be encrypting traffic at one another. KSs are control-plane only routers that are not in the forwarding path, nor do they encrypt data.<br />
<br />
The first step is for the GMs to register to a KS. In order to do this, ISAKMP is established between the GM and the KS. This is a one-off ISAKMP session for this initial communication only.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEij8byyKjkqXk7_TkYvVoc5Q65T4kYeeJFuV8JDZjy3Sy1lYCBtHyh29cyM_Ic_wnf7Oj6u4KXoOMLxyproCJLOeeIkayMOOPhPhgMrAOgeIqWzFeo-5WQjZxw8kI4sYtWmSp3i8gX4s8o/s1600/diagr4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEij8byyKjkqXk7_TkYvVoc5Q65T4kYeeJFuV8JDZjy3Sy1lYCBtHyh29cyM_Ic_wnf7Oj6u4KXoOMLxyproCJLOeeIkayMOOPhPhgMrAOgeIqWzFeo-5WQjZxw8kI4sYtWmSp3i8gX4s8o/s1600/diagr4.png" /></a></div>
<br />
<br />
During this initial step, a "pull" is initiated from GM to KS. The GM receives the initial Key Encryption Key (KEK) and Traffic Encryption Key (TEK). As I mentioned above, the initial ISAKMP session, while it may be up for a good while longer than the initial session, isn't used after this process - only the KEK is. It's important to note that all KSs, of which there can be up to 8, can encrypt using the KEK, which makes it sort of a distributed/shared phase 1, as opposed to the initial point-to-point ISAKMP session.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFVfg-GFbV271kNPoVnPb-ZX2Plzx83-725wqQRweVBUGcNJKZ-_JYgqhHIHxyigO_Et9V_lSd1HAFKoz8IingV2E5UI-Q-KS2zb-VgQjZxUKcwoNJRmOdUT4Ew4axmW98mJ5EaTGWpBM/s1600/diag2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFVfg-GFbV271kNPoVnPb-ZX2Plzx83-725wqQRweVBUGcNJKZ-_JYgqhHIHxyigO_Et9V_lSd1HAFKoz8IingV2E5UI-Q-KS2zb-VgQjZxUKcwoNJRmOdUT4Ew4axmW98mJ5EaTGWpBM/s1600/diag2.png" /></a></div>
<br />
The KEK key, which is generated by the primary key server (and distributed to the other key servers), is then used by each GM to reach any other GM. I struggled with how to draw this to avoid the appearance of tunnels, which are inherently point-to-point. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlkX1VQWJ8QllothXZ_DVe-lc8C-sYv7YjZuENbHf59d62kLCCJAjohq0XBgkRx63_CwEEQXRS3JXFmdf3lQHw_LyZo4UnGpiI4OuzmCYfFU9-0aAQOsmLRPMrFJlU5PHAz4gIuNmzWKE/s1600/diagr3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlkX1VQWJ8QllothXZ_DVe-lc8C-sYv7YjZuENbHf59d62kLCCJAjohq0XBgkRx63_CwEEQXRS3JXFmdf3lQHw_LyZo4UnGpiI4OuzmCYfFU9-0aAQOsmLRPMrFJlU5PHAz4gIuNmzWKE/s1600/diagr3.png" /></a></div>
<br />
There are some comparisons and contrasts to be drawn with both traditional IPSEC point-to-point tunnels and with DMVPN.<br />
<br />
An obvious difference with point-to-point IPSEC is that, with some exceptions we will cover throughout the document, <i>all</i> traffic egressing the CE -> PE interface is encrypted, regardless of where it is destined. This makes for a tunnel-less, or group-"tunnel" style interface. Moreover, unlike point-to-point IPSEC, the original source and destinations in the IP header are retained, whereas with IPSEC, they are rewritten with the tunnel endpoints. As such, traditional routing - and multicast - both work.<br />
<br />
While a GETVPN and DMVPN may accomplish similar tasks, there are some significant differences there, as well. Without making a messy static-hack to the configuration, DMVPN only supports multicast from the DMVPN head-end to spokes. As pointed out above, native multicast works fine on GETVPN, without utilizing pseudo-multicast as is common at tunnel head-ends. Additionally, DMVPN builds dynamic tunnels from spoke-to-spoke on an as-needed basis - but that leaves the spoke still <i>building tunnels</i> every time it needed to speak to another spoke. This creates overhead, both in tunnel setup - there's a small, but measurable delay in each tunnel being created - and in scalability; if a spoke needs to speak to hundreds of other spokes, it must build and maintain hundreds of point-to-point IPSEC tunnels.<br />
<br />
<a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"></a>A disadvantage of GETVPN is that it isn't supported with NAT, or the NAT must be engineered in such a way that it's invisible to the encryption devices. This has to do with the original IP addressing and header being preserved by GETVPN. One could arguably run GET on the Internet if the GMs and KSs used only public IP addressing, and, if needed, hide the NAT behind extra routers behind the GMs. I've seen documents on the Internet claiming even more can be done with GETVPN and NAT, but these are not supported use cases by Cisco, and I didn't try to verify them. Cisco's approach is evident, if an Internet-facing tunnel with NAT is required, it's best to use DMVPN, which works well with NAT.<br />
<br />
There's a fair amount going on behind-the-scenes in a GETVPN, and I'm going to pause explaining that at this point to look at some of the config. A key aspect of a CCIE is to know both the configuration steps and the steps happening behind-the-scenes, and I always find it best to introduce both in conjunction.<br />
<br />
Here's the topology we will be working from:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDRWq6djXmZLwFREyGBdo6rP3tBMYocXaXUun9EVzPtxc1j3FGW6ZRU2RvFVH4CD96eHapyCvu19_zaJMRvRBKjTPVE62730sELRGKtLcqN77spVXuOXLKDVqO7NtyhSwwavAekxhCXi4/s1600/diagr2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhDRWq6djXmZLwFREyGBdo6rP3tBMYocXaXUun9EVzPtxc1j3FGW6ZRU2RvFVH4CD96eHapyCvu19_zaJMRvRBKjTPVE62730sELRGKtLcqN77spVXuOXLKDVqO7NtyhSwwavAekxhCXi4/s1600/diagr2.png" /> </a> </div>
<div class="separator" style="clear: both; text-align: left;">
I'll review IP address usage on-the-fly, there are too many links here to describe them all initially.</div>
<div class="separator" style="clear: both; text-align: left;">
Also, we won't be reviewing any of the P or PE devices, as they're just a basic MPLS VPN configuration.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div class="separator" style="clear: both; text-align: left;">
We'll start by looking at the configuration of Key Server 1 (KS1). For now, we'll pretend KS2 doesn't exist, as I'll cover that as part of the COOP (pronounced "co-op") configuration later in the document.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
But first, a quick review of scope. As with all my other previous documentation, my articles are targeted at the CCIE R&S. This means we'll only be inspecting the ISAKMP and IPSEC configuration enough for an R&S understanding, and we'll be skipping any advanced topics that are irrelevant to R&S (i.e. Trustsec integration).</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
KS1:</div>
<div class="separator" style="clear: both; text-align: left;">
crypto isakmp policy 1</div>
encr aes<br />
authentication pre-share<br />
group 2<br />
<br />
crypto isakmp key MYGDOIPSK address 0.0.0.0<br />
<br />
crypto ipsec transform-set aes128 esp-aes esp-sha-hmac<br />
mode tunnel<br />
<br />
crypto ipsec profile profile1<br />
set transform-set aes128<br />
<br />
crypto gdoi group GDOI-GROUP1<br />
identity number 1234<br />
server local<br />
rekey algorithm aes 128<br />
rekey authentication mypubkey rsa MYRSAKEY<br />
rekey transport unicast<br />
sa ipsec 1<br />
profile profile1<br />
match address ipv4 getvpn-acl<br />
replay time window-size 5<br />
address ipv4 192.168.111.111<br />
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
ip access-list extended getvpn-acl</div>
deny udp any eq 848 any<br />
deny udp any any eq 848<br />
deny tcp any eq bgp any<br />
deny tcp any any eq bgp<br />
permit ip any any<br />
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<i>Not shown here is the BGP configuration. I have KS1 peered with PE1, advertising it's loopback, 192.168.111.111. KS1 is setup similarly to how any CE router would be in an MPLS VPN.</i></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
With the understanding that I'm going to high-level the crypto explanations, here's what the various relevant pieces of the config do:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<b>crypto isakmp key MYGDOIPSK address 0.0.0.0</b></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
As mentioned above, all GMs stand up a temporary ISAKMP session to the KS during registration. In order to do so, they need to share a PSK (or have a PKI, which out-of-scope for this article). You can create a key per-GM, or just one that matches all GMs. Here we've defined the key as MYGDOIPSK for all GMs "0.0.0.0".</div>
<br />
You can view the ISAKMP sessions before they die off, if desired:<br />
<br />
KS1#show crypto isakmp sa<br />
IPv4 Crypto ISAKMP SA<br />
dst src state conn-id status<br />
192.168.111.111 192.168.11.3 GDOI_IDLE 1067 ACTIVE<br />
<br />
If all devices had come up after a fresh reboot, you'd see four connections here, but as I've only recently bounced one, the other three have expired already.<br />
<br />
<b>crypto gdoi group GDOI-GROUP1<br /> identity number 1234</b><br />
<br />
The crypto gdoi group command is where all the magic happens on the KSs, and we'll be reviewing the rest of the configuration below. It's important to note it's not assigned to an interface on the KS. The KS doesn't encrypt anything but the control-plane traffic, so this config, when used with 'server' local', simply enables the GDOI KS process and opens UDP port 848 for communications to the other GMs (and eventually other KSs for COOP). The <i>identity number</i> defines which encryption group this config belongs to - a KS can run multiple groups for different GMs, and keep the keying (and consequently the communication) isolated between groups. <br />
<br />
<b> server local<br /> rekey algorithm aes 128<br /> rekey authentication mypubkey rsa MYRSAKEY<br /> </b><br />
Here we define the KEK key and rekey process. The initial keys, shown here, are defined as 128-bit AES, authenticated with the RSA key "MYRSAKEY". The RSA keys need to be pre-created, which is accomplished with:<br />
<br />
<b>crypto key generate rsa label MYRSAKEY modulus 1024 exportable</b><br />
<br />
With a single KS you don't technically need to make the key exportable, but if you ever want to add a second KS, this is mandatory, so it's a good idea to do it to begin with.<br />
<br />
<b> rekey transport unicast</b> <br />
<br />
As with any IPSEC tunnel, the keys are rotated periodically so that in case they are compromised, they can't be used to decrypt messages in the future - in other words, the theory is that it takes longer to crack the keys than the actual lifetime of the key, therefore making it impossible for a hacker to decrypt data in real-time. GETVPN has to rekey both the KEK and the TEK periodically, at intervals defined at the IOS CLI. New keys are sent out prior to the expiration of the old key, so that there's a clean roll-over to new key when the appropriate time has been reached.<br />
<br />
There are two methods for rekeying with GETVPN. If you look back to why GDOI was originally developed, it was to encrypt multicast traffic. So, logically, rekeying via multicast is an option. I didn't lab this as it would've required me to either move away from using MPLS as my core, or enable service-provider multicast over MPLS, which seemed excessive for the scope I was attempting to cover. Regardless, Cisco recommends using unicast rekey now, namely because there's an acknowledgement system in unicast that's not available in multicast. Multicast rekey does a "fire and forget" mechanism and simply hopes the new keys reach the destination; unicast rekey double-checks to ensure the keys are received by expecting an ACK back from the GM. Eventually, if a key is coming up on expiring and the GM hasn't received a replacement, it will attempt a re-register with the KS in order to resolve the issue.<br />
<br />
The actual rekeying/retry logic is incredibly deep, and for more information on it, I recommend reading the Cisco documentation, which is actually quite good (to my surprise, as most of my CCIE-level articles got written in the first place because the Cisco documentation is generally awful):<br />
<br />
<a href="http://www.cisco.com/c/en/us/td/docs/ios/12_4t/12_4t11/htgetvpn.html">http://www.cisco.com/c/en/us/td/docs/ios/12_4t/12_4t11/htgetvpn.html</a><br />
<br />
<b> sa ipsec 1<br /> profile profile1<br /> match address ipv4 getvpn-acl<br /> replay time window-size 5</b><br />
<br />
Here the TEK key attributes are defined, inherited from <i>profile1</i>:<br />
<br />
crypto ipsec profile profile1<br />
set transform-set aes128<br />
<br />
The GETVPN ACL is defined as <i>getvpn-acl</i>.<br />
<br />
It's probably not desirable to encrypt <i>all</i> traffic over the MPLS circuit. For example, control-plane protocols (probably BGP) as well as the initial control plane session (UDP port 848) from GM to KS need to be exempt from this process. It might also be desirable for ICMP, SSH, and perhaps SNMP - your management protocols - to be exempt. <br />
<br />
At it's most basic, your ACL should look something like this:<br />
ip access-list extended getvpn-acl<br />
deny udp any eq 848 any<br />
deny udp any any eq 848<br />
deny tcp any eq bgp any<br />
deny tcp any any eq bgp<br />
permit ip any any<b><br /></b><br />
<i>deny</i> indicates to not encrypt traffic. <i>permit</i> indicates to encrypt traffic. Normally this ACL will end in "permit ip any any".<br />
<br />
The replay-time command has a big topic to discuss behind-the-scenes. The traditional IPSEC method for anti-replay doesn't work with GETVPN. If you're not familiar with replay attacks, "A replay attack is a form of network attack in which a valid data transmission is maliciously or fraudulently repeated or delayed. It is an attempt to subvert security by someone who records legitimate communications and repeats them in order to impersonate a valid user, and to disrupt or cause negative impact for legitimate connections."<br />
<br />
<a href="http://www.cisco.com/c/en/us/support/docs/ip/internet-key-exchange-ike/116858-problem-replay-00.html">http://www.cisco.com/c/en/us/support/docs/ip/internet-key-exchange-ike/116858-problem-replay-00.html</a><br />
<br />
Also from the same document, anti-replay is described: "IPSec provides anti-replay protection against an attacker who duplicates encrypted packets with the assignment of a monotonically increasing sequence number to each encrypted packet".<br />
<br />
In a nutshell, traditional anti-replay has a counter embedded in each packet, with the far side of a point-to-point tunnel anticipating the number to continuously count up, one packet at a time. This clearly can't work with GETVPN, as <i>any</i> neighbor can forward traffic, so there's no way to maintain a two-router counting system. Introducing Time-Based Anti-Replay, or TBAR.<br />
<br />
TBAR has the KS maintain a pseudo-time clock ('pseudo' as it's not based on NTP) with the GMs. This gives every GM a coordinated reference point for time. Every GM then sends its pseudo-timestamp embedded in every packet, and if the timestamp is more than X seconds on the receiving GM, the packet it considered a replay attack and is dropped. 'X' seconds is defined by the <i>replay time window-size 5</i>, where 5 is the number of seconds a packet is considered valid.<br />
<br />
<b> address ipv4 192.168.111.111</b><br />
<br />
This defines the local IP address in which to send and receive GETVPN messages on. It's normally set to a loopback. Our loopback on KS1 is 192.168.111.111.<br />
<br />
Now let's move on to our first spoke configuration, on CE1/GM1:<br />
<br />
<b>crypto isakmp policy 1<br /> encr aes<br /> authentication pre-share<br /> group 2<br />crypto isakmp key MYGDOIPSK address 192.168.111.111<br />crypto gdoi group GDOI-GROUP1<br /> identity number 1234<br /> server address ipv4 192.168.111.111<br /><br />crypto map gdoimap 1 gdoi<br /> set group GDOI-GROUP1<br /><br />int e0/0<br /> crypto map gdoimap</b><br />
<br />
This configuration is notably smaller than that of the KS. Moreover, with some rare exception, it can be pasted in identically to each GM, so config deployment is very easy and fast.<br />
<br />
<b>crypto isakmp policy 1<br /> encr aes<br /> authentication pre-share<br /> group 2<br />crypto isakmp key </b><b>MYGDOIPSK address 192.168.111.111</b><br />
<br />
This is an identical match to the ISAKMP GM -> KS policy shown on the KS. The GM will use this to establish the initial temporary ISAKMP session back to the KS to register and download KEK & TEK. The only real important item here is that this config match with that of the KS.<br />
<br />
<b>crypto gdoi group GDOI-GROUP1<br /> identity number 1234<br /> server address ipv4 192.168.111.111</b><br />
<br />
Here we define our group number, which will control which key set we receive, as well as which members we can speak to. Our initial deployment will all be on group 1234 for simplicity. <i>server address</i> determines which KS we register to. There can be more than one KS, and we'll cover the GM config for that when we cover COOP on the KSes.<br />
<br />
<b>crypto map gdoimap 1 gdoi<br /> set group GDOI-GROUP1</b><br />
<br />
<b>int e0/0<br /> crypto map gdoimap </b><br />
<br />
On the GMs, we activate both the control plane and forwarding plane of GETVPN on-the-interface, unlike on the KS, which has no interface-level config.<br />
<br />
My lab has this all running already, so I'm going to manually bounce CE1 to watch the registration process.<br />
<br />
CE1#sh run int e0/0<br />
Building configuration...<br />
<br />
Current configuration : 152 bytes<br />
!<br />
interface Ethernet0/0<br />
ip address 192.168.11.3 255.255.255.0<br />
crypto map gdoimap<br />
end<br />
<br />
CE1(config)#int e0/0<br />
CE1(config-if)#no crypto map gdoimap<br />
CE1(config-if)#crypto map gdoimap<br />
<br />
I'll break the log down: <br />
*Jan 2 18:10:37.203: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.111.111 for group GDOI-GROUP1 using address 192.168.11.3 fvrf default ivrf default<i> </i><br />
<br />
<i>We started attempting registration</i> <br />
<br />
*Jan 2 18:10:37.236: %GDOI-5-SA_TEK_UPDATED: SA TEK was updated<br />
*Jan 2 18:10:37.237: %GDOI-5-SA_KEK_UPDATED: SA KEK was updated<br />
<br />
<i>We received TEK and KEK</i><br />
<br />
*Jan 2 18:10:37.237: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.111.111 complete for group GDOI-GROUP1 using address 192.168.11.3 fvrf default ivrf default<i> </i><br />
<br />
<i>We successfully registered to KS 192.168.111.111.</i><br />
<br />
*Jan 2 18:10:37.238: %GDOI-5-GM_INSTALL_POLICIES_SUCCESS: SUCCESS: Installation of Reg/Rekey policies from KS 192.168.111.111 for group GDOI-GROUP1 & gm identity 192.168.11.3 fvrf default ivrf default<br />
<br />
<i>Policies pushed from KS1 were activated successfully.</i><br />
<br />
<i> </i><br />
Remember that ACL we put on the key-server? It's downloaded to the GM as part of the registration process to the KS:<br />
<br />
CE1#show crypto gdoi gm acl<br />
Group Name: GDOI-GROUP1<br />
ACL Downloaded From KS 192.168.111.111:<br />
access-list deny udp any port = 848 any<br />
access-list deny udp any any port = 848<br />
access-list deny tcp any port = 179 any<br />
access-list deny tcp any any port = 179<br />
access-list permit ip any any<br />
ACL Configured Locally:<br />
<br />
Moreover, if you update the ACL on the KS, it will get re-pushed with the next scheduled rekey, or you can force a rekey at any time with:<br />
<br />
<b>crypto gdoi ks rekey </b>! refreshes the ACL and sends out the next set of keys<br />
<b>crypto gdoi ks rekey replace-now </b>! steps above, plus force swapping to a new key (traffic impacting)<br />
<b><br /></b>
There are some other useful show commands we'll take a moment to look at.<br />
One thing that threw me initially is that the traditional ipsec "show" commands don't work all that well here. KEK and TEK are different enough that the commands developed for point-to-point throw some odd output, for example:<br />
<br />
CE1#show crypto ipsec sa<br />
<br />
interface: Ethernet0/0<br />
Crypto map tag: gdoimap, local addr 192.168.11.3<br />
<br />
protected vrf: (none)<br />
local ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)<br />
remote ident (addr/mask/prot/port): (0.0.0.0/0.0.0.0/0/0)<br />
<output omitted><br />
<br />
The local and remote ident would normally describe the local and remote subnet listed in the ACE of the interesting traffic list described by this IPSEC SA. However, in the case of GDOI a single SA is shown for the whole GDOI group but no ACE information from the GDOI ACL is given.<br />
<br />
You can, however, use traditional ISAKMP commands to see the temporary tunnel to the KS:<br />
<br />
CE1#show crypto isakmp sa<br />
IPv4 Crypto ISAKMP SA<br />
dst src state conn-id status<br />
192.168.111.111 192.168.11.3 GDOI_IDLE 1067 ACTIVE<br />
<br />
That said, let's look at the GDOI-specific commands.<br />
<br />
On the GM, to see if you're registered:<br />
<br />
CE1#<b>show crypto gdoi gm</b><br />
Group Member Information For Group GDOI-GROUP1:<br />
IPSec SA Direction : Both<br />
ACL Received From KS : gdoi_group_GDOI-GROUP1_temp_acl<br />
<br />
Group member : 192.168.11.3 vrf: None<br />
Local addr/port : 192.168.11.3/848<br />
Remote addr/port : 192.168.111.111/848<br />
fvrf/ivrf : None/None<br />
Version : 1.0.8<br />
Registration status : Registered<br />
Registered with : 192.168.111.111<br />
Re-registers in : 897 sec<br />
Succeeded registration: 1<br />
Attempted registration: 1<br />
<output omitted for brevity><br />
<br />
On the KS, to see who's registered to it:<br />
<br />
KS1#<b>show crypto gdoi ks members summary</b> | s 11.3<br />
Group Member ID : 192.168.11.3 GM Version: 1.0.8<br />
Group ID : 1234<br />
Group Name : GDOI-GROUP1<br />
GM State : Registered<br />
Key Server ID : 192.168.111.111<br />
<br />
This command produces a lot of output, even when using "summary", when you have many GMs registered. As all of my lab ones are up at this moment, note I filtered the output to just the one GM (CE1) that we've been working with.<br />
<br />
To view the details on KEK and TEK on the GM (you may want to check the remaining lifetimes):<br />
<br />
CE1#show crypto gdoi | s KEK<br />
KEK POLICY:<br />
Rekey Transport Type : Unicast<br />
Lifetime (secs) : 84190<br />
Encrypt Algorithm : AES<br />
Key Size : 128<br />
Sig Hash Algorithm : HMAC_AUTH_SHA<br />
Sig Key Length (bits) : 1296<br />
<br />
CE1#show crypto gdoi | s TEK<br />
TEK POLICY for the current KS-Policy ACEs Downloaded:<br />
Ethernet0/0:<br />
IPsec SA:<br />
spi: 0x6BAFD3AB(1806685099)<br />
transform: esp-aes esp-sha-hmac<br />
sa timing:remaining key lifetime (sec): (1386)<br />
Anti-Replay(Time Based) : 5 sec interval<br />
tag method : disabled<br />
alg key size: 16 (bytes)<br />
sig key size: 20 (bytes)<br />
encaps: ENCAPS_TUNNEL<br />
<br />
Now, our GM is set to encrypt any traffic (minus UDP 848 and and BGP) that leaves it's e0/0 interface.<br />
<br />
If you've ever watched an American cooking show, there's always a moment when the celebrity chef shows the basics of how to put a yet-to-be-cooked dish together, then instantly pops out the final product that's been in the oven for two hours prior, compliments of the magic of television. This is my moment! Not shown here, I've applied the GM config to the other three CE devices, and we have encrypted communication across all CE and Host devices on the MPLS VPN.<br />
<br />
I'm going to send pings from Host1 (10.0.111.2, with a default route to CE1) to Host3 (10.0.33.2, with a default route to CE3).<br />
<br />
HOST1#ping 10.0.33.2 <b>repeat 10</b><br />
Type escape sequence to abort.<br />
Sending 10, 100-byte ICMP Echos to 10.0.33.2, timeout is 2 seconds:<br />
!!!!!!!!!!<br />
Success rate is 100 percent (10/10), round-trip min/avg/max = 5/6/7 ms<br />
<br />
Great, how do we verify that the packets were encrypted? We go ask CE1:<br />
<br />
CE1#show crypto gdoi gm dataplane counters<br />
<br />
Data-plane statistics for group GDOI-GROUP1:<br />
<b> #pkts encrypt : 10 #pkts decrypt : 10</b> #pkts tagged (send) : 0 #pkts untagged (rcv) : 0<br />
#pkts no sa (send) : 0 #pkts invalid sa (rcv) : 0<br />
#pkts encaps fail (send) : 0 #pkts decap fail (rcv) : 0<br />
#pkts invalid prot (rcv) : 0 #pkts verify fail (rcv) : 0<br />
#pkts not tagged (send) : 0 #pkts not untagged (rcv) : 0<br />
#pkts internal err (send): 0 #pkts internal err (rcv) : 0<br />
<br />
As we can see we sent 10 encrypted packets and received 10 encrypted packets, it's a fair bet that encryption happened, using our current TEK key. We could go check CE3 as well, but we already know we'd get the same results, because of the number of decrypted packets on CE1.<br />
<br />
Let's test this with an ACL change on the KS:<br />
<br />
KS1(config)#ip access-list extended getvpn-acl<br />
KS1(config-ext-nacl)#1 deny tcp any eq telnet any<br />
KS1(config-ext-nacl)#2 deny tcp any any eq telnet<br />
KS1(config-ext-nacl)#end<br />
KS1#<br />
*Jan 2 19:16:51.992: %SYS-5-CONFIG_I: Configured from console by console<br />
*Jan 2 19:16:51.992: %GDOI-5-POLICY_CHANGE: GDOI group GDOI-GROUP1 policy has changed. Use 'crypto gdoi ks rekey' to send a rekey, or the changes will be send in the next scheduled rekey<br />
<br />
The KS is smart enough to know we just changed global policy, and throws a reminder that the GMs won't be aware of this until the next scheduled rekey unless we force it:<br />
<br />
KS1#crypto gdoi ks rekey<br />
KS1#<br />
*Jan 2 19:17:36.784: %GDOI-5-KS_SEND_UNICAST_REKEY: Sending Unicast Rekey with policy-replace for group GDOI-GROUP1 from address 192.168.111.111 with seq # 23<br />
<br />
Meanwhile, back on CE1:<br />
<br />
CE1#show crypto gdoi gm acl<br />
Group Name: GDOI-GROUP1<br />
ACL Downloaded From KS 192.168.111.111:<br />
<b> access-list deny tcp any port = 23 any<br /> access-list deny tcp any any port = 23</b><br />
access-list deny udp any port = 848 any<br />
access-list deny udp any any port = 848<br />
access-list deny tcp any port = 179 any<br />
access-list deny tcp any any port = 179<br />
access-list permit ip any any<br />
ACL Configured Locally:<br />
<br />
And then test on CE1:<br />
<br />
HOST1#telnet 10.0.33.2<br />
Trying 10.0.33.2 ... Open<br />
Password required, but none set<br />
[Connection to 10.0.33.2 closed by foreign host]<br />
<br />
I don't actually have Host3 setup to accept telnet logins, but it's irrelevant - we just generated bidirectional traffic.<br />
<br />
And for verification on CE1:<br />
<br />
CE1#show crypto gdoi gm dataplane counters<br />
<br />
Data-plane statistics for group GDOI-GROUP1:<br />
<b> #pkts encrypt : 10 #pkts decrypt : 10</b> <output omitted for brevity><br />
<br />
Note, the counters didn't go up this time - because this was sent in plain text. Let's pull those new telnet exemptions back off KS1 and try again:<br />
<br />
KS1(config)#ip access-list ext getvpn-acl<br />
KS1(config-ext-nacl)#no 1<br />
KS1(config-ext-nacl)#no 2<br />
<br />
KS1#crypto gdoi ks rekey<br />
<br />
Try again on Host1:<br />
<br />
HOST1#telnet 10.0.33.2<br />
Trying 10.0.33.2 ... Open<br />
Password required, but none set<br />
[Connection to 10.0.33.2 closed by foreign host]<br />
<br />
And back to CE1 for verification:<br />
<br />
CE1#show crypto gdoi gm dataplane counters<br />
<br />
Data-plane statistics for group GDOI-GROUP1:<br />
<b> #pkts encrypt : 31 #pkts decrypt : 29</b><br />
<output omitted for brevity><br />
<br />
Perfect!<br />
<br />
Another benefit of GETVPN is seamless QoS support. All Cisco tunneling solutions copy TOS markings from the original packet to the encrypted packet when creating the encrypted packet. As such, unless you're trying to perform egress marking (which would require a qos pre-classify configuration), <i>no change</i> is required in QoS to migrate to GETVPN.<br />
<br />
We'll test from Host1 to Host2.<br />
<br />
First, we need to setup QoS on CE1:<br />
<br />
CE1(config)#class-map match-all EF<br />
CE1(config-cmap)# match dscp ef<br />
CE1(config-cmap)#policy-map QOS<br />
CE1(config-pmap)# class EF<br />
CE1(config-pmap-c)# priority 50000<br />
CE1(config)#int e0/0<br />
CE1(config-if)#service-policy output QOS<br />
<br />
HOST1#ping ! Extended Ping is required to set ToS to 184 (DSCP EF)<br />
Protocol [ip]:<br />
Target IP address: 10.0.222.2 ! Host2<br />
Repeat count [5]: 100<br />
Datagram size [100]:<br />
Timeout in seconds [2]:<br />
Extended commands [n]: y<br />
Source address or interface:<br />
<b>Type of service [0]: 184</b><br />
<output omitted for brevity> <br />
Type escape sequence to abort.<br />
Sending 100, 100-byte ICMP Echos to 10.0.222.2, timeout is 2 seconds:<br />
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!<br />
Success rate is 100 percent (100/100), round-trip min/avg/max = 1/8/46 ms<br />
<br />
Back to CE1 for verification:<br />
<br />
CE1#show policy-map int | s EF<br />
Class-map: EF (match-all)<br />
<b> 100 packets, 19800 bytes</b><br />
30 second offered rate 0000 bps, drop rate 0000 bps<br />
Match: dscp ef (46)<br />
Priority: 50000 kbps, burst bytes 1250000, b/w exceed drops: 0<br />
<br />
Now let's take a look at COOP, the key server redundancy protocol for GETVPN.<br />
<br />
COOP works by establishing <i>permanent</i> ISAKMP sessions between redundant key servers. It uses these tunnels to maintain GM registration status as well as uses dead peer detection (DPD) to ensure other key servers are up.<br />
<br />
As I mentioned previously, all KSs must have the same RSA public & private keys installed. This is so if the primary KS fails, the KEK session between the secondary KS(s) and the GMs can be maintained. In this fashion, re-registration from GM to the backup KS is <i>not necessary</i> if a KS fails. Also, traffic isn't impacted - a KS failure is a 'hitless' outage for the GMs.<br />
<br />
An additional benefit of having the KSs in sync with one another as well as having the same key on all servers is that a GM can register to <u>any</u> KS - even if it's not the primary. This can become important if a network gets segmented, where some GMs can reach, say, KS1, and others can only reach KS2 (again, note there can be up to eight KSes). When the KSs can reach one another again, they sync their registration database back together!<br />
<br />
One more important item of note, if you were paying attention to the diagram: I have KS2 <i>behind a GM.</i> Key servers can be directly connected as CEs, they can be behind a GM, they can basically be anywhere that's globally routable from the rest of the network - it doesn't matter. To show this, I put KS1 in a CE-style configuration, directly attached to PE1, and KS2 behind CE2, in more of a "host-like" setup.<br />
<br />
A quick reminder of our config above used to generate the RSA keys:<br />
<b>crypto key generate rsa label MYRSAKEY modulus 1024 exportable</b><br />
<br />
Now we need to go back to KS1 and retrieve that key for KS2:<br />
<b>KS1(config)#crypto key export rsa MYRSAKEY pem terminal 3des MYSECRETPASS</b><br />
<br />
% Key name: MYRSAKEY<br />
Usage: General Purpose Key<br />
Key data:<br />
-----BEGIN PUBLIC KEY-----<br />
<b><Public Key Omitted for Brevity> </b><br />
-----END PUBLIC KEY-----<br />
-----BEGIN RSA PRIVATE KEY-----<br />
Proc-Type: 4,ENCRYPTED<br />
DEK-Info: DES-EDE3-CBC,0B2283C620CB3CCA<br />
<br />
<b><Private Key Omitted for Brevity> </b><br />
-----END RSA PRIVATE KEY-----<br />
<br />
Now that we have the keys, we can import them into KS2:<br />
<br />
KS2(config)#crypto key import rsa MYRSAKEY terminal MYSECRETPASS<br />
% Enter PEM-formatted public General Purpose key or certificate.<br />
% End with a blank line or "quit" on a line by itself.<br />
-----BEGIN PUBLIC KEY-----<br />
<b><Public Key Omitted for Brevity></b>-----END PUBLIC KEY-----<br />
quit<br />
% Enter PEM-formatted encrypted private General Purpose key.<br />
% End with "quit" on a line by itself.<br />
-----BEGIN RSA PRIVATE KEY-----<br />
Proc-Type: 4,ENCRYPTED<br />
DEK-Info: DES-EDE3-CBC,0B2283C620CB3CCA<br />
<br />
<b><Private Key Omitted for Brevity> </b><br />
-----END RSA PRIVATE KEY-----<br />
quit<br />
% Key pair import succeeded.<br />
<br />
Now back to configure COOP on KS1:<br />
<br />
KS1(config)#<b>crypto isakmp keepalive 10 periodic</b><br />
<br />
KS1(config)#<b>crypto isakmp key COOPKEY address 192.168.222.222</b><br />
KS1(config)#<b>crypto gdoi group GDOI-GROUP1</b><br />
KS1(config-gdoi-group)#<b>server local</b><br />
KS1(gdoi-local-server)#<b>redundancy</b><br />
KS1(gdoi-coop-ks-config)#<b>local priority 100</b><br />
KS1(gdoi-coop-ks-config)#<b>peer address ipv4 192.168.222.222</b><br />
<br />
There's not really much new config here, but I'll run over the key elements:<br />
<br />
<b>crypto isakmp keepalive 10 periodic</b><br />
COOP uses Dead Peer Detection (DPD) to keep track of it's neighbors up/down status, and needs to be enabled with this command.<br />
<br />
<b>crypto isakmp key COOPKEY address 192.168.222.222</b><br />
As the KS's maintain a ISAKMP session between them, we need a key to set the session up. I believe hypothetically one could re-use the same key from the GMs, but that seems like a <i>bad idea</i>, so I've been in the practice of using a different key. Note, if you had more than two KSs, this config would need to be replicated for each KS.<br />
<br />
<b>redundancy<br /> local priority 100<br /> peer address ipv4 192.168.222.222</b><br />
<br />
Enter the redundancy config, set the local priority - higher is better and more likely to become primary - and enter the address of the other KS COOP servers. Note, each key server needs to be configured with the IPs of all the other key servers, so if you were running three COOP key servers, each KS would have two entries (the other two redundant servers) for the other two servers.<br />
<br />
With mild adaption of KS1's config to KS2, KS2's config appears like this:<br />
<br />
ip access-list extended getvpn-acl<br />
deny udp any eq 848 any<br />
deny udp any any eq 848<br />
deny tcp any eq bgp any<br />
deny tcp any any eq bgp<br />
permit ip any any<br />
<br />
crypto isakmp policy 1<br />
encr aes<br />
authentication pre-share<br />
group 2<br />
crypto isakmp key COOPKEY address 192.168.111.111<br />
crypto isakmp key MYGDOIPSK address 0.0.0.0<br />
crypto isakmp keepalive 10 periodic<br />
crypto ipsec transform-set aes128 esp-aes esp-sha-hmac<br />
mode tunnel<br />
crypto ipsec profile profile1<br />
set transform-set aes128<br />
crypto gdoi group GDOI-GROUP1<br />
identity number 1234<br />
server local<br />
rekey algorithm aes 128<br />
rekey authentication mypubkey rsa MYRSAKEY<br />
rekey transport unicast<br />
sa ipsec 1<br />
profile profile1<br />
match address ipv4 getvpn-acl<br />
replay time window-size 5<br />
address ipv4 192.168.222.222<br />
redundancy<br />
local priority 40<br />
peer address ipv4 192.168.111.111<br />
<br />
And with that, COOP is up!<br />
<br />
KS1#show crypto gdoi ks coop<br />
Crypto Gdoi Group Name :GDOI-GROUP1<br />
Group handle: 2147483650, Local Key Server handle: 2147483650<br />
<br />
Local Address: 192.168.111.111<br />
Local Priority: 100<br />
Local KS Role: Primary , Local KS Status: Alive<br />
Local KS version: 1.0.8<br />
Primary Timers:<br />
Primary Refresh Policy Time: 20<br />
Remaining Time: 10<br />
Antireplay Sequence Number: 247<br />
<br />
Peer Sessions:<br />
Session 1:<br />
Server handle: 2147483651<br />
Peer Address: 192.168.222.222<br />
Peer Version: 1.0.8<br />
Peer Priority: 40<br />
Peer KS Role: Secondary , Peer KS Status: Alive<br />
Antireplay Sequence Number: 31<br />
<br />
IKE status: Established<br />
Counters:<br />
Ann msgs sent: 220<br />
Ann msgs sent with reply request: 0<br />
Ann msgs recv: 2<br />
Ann msgs recv with reply request: 1<br />
Packet sent drops: 27<br />
Packet Recv drops: 0<br />
Total bytes sent: 156002<br />
Total bytes recv: 2247<br />
<br />
As I mentioned, there's a permanent ISAKMP session established between COOP KSes, and you can see that with standard ISAKMP show commands:<br />
<br />
KS1#show crypto isakmp sa<br />
IPv4 Crypto ISAKMP SA<br />
dst src state conn-id status<br />
<b>192.168.222.222 192.168.111.111 GDOI_IDLE 1001 ACTIVE</b><br />
<br />
We then need to tell our GMs about the additional server(s):<br />
<br />
CE1, CE2, CE3, & CE4:<br />
crypto gdoi group GDOI-GROUP1<br />
server address ipv4 192.168.222.222<br />
<br />
So - let's take KS1 out of commission and try a few things.<br />
<br />
KS1(config)#int e0/0<br />
KS1(config-if)#shut<br />
<br />
Eventually KS2 will realize that KS1 is out-of-comission. It's important to note that unlike a redundancy protocol that's directly in the dataplane (like HSRP), a brief key server outage, in an appropriately-built GETVPN, shouldn't be a big deal. The key servers are there to distribute policy and keys, and the keys are sent out <i>well in advance</i>, so it's unlikely the GMs would even notice the KS outage until a reregistration sometime in the distant future happened.<br />
<br />
KS2 (a while later):<br />
*Jan 2 20:26:55.540: %GDOI-5-COOP_KS_TRANS_TO_PRI: KS 192.168.222.222 in group GDOI-GROUP1 <b>transitioned to Primary </b>(Previous Primary = 192.168.111.111)<br />
<br />
*Jan 2 20:27:15.543: %GDOI-3-COOP_KS_UNREACH: Cooperative KS 192.168.111.111 Unreachable in group GDOI-GROUP1. IKE SA Status = Failed to establish.<br />
<br />
KS2 realizes the primary is down and assumes primary itself.<br />
<br />
Let's check in on CE1, and when it expects a rekey:<br />
<br />
CE1#show crypto gdoi | i life<br />
sa timing:remaining key lifetime (sec): (<b>3063</b>)<br />
<br />
It's got a bit. Let's see if it will accept new keys from KS2:<br />
<br />
KS2#crypto gdoi ks rekey replace-now<br />
<br />
<br />
CE1#show crypto gdoi | i life<br />
sa timing:remaining key lifetime (sec): (<b>3598</b>)<br />
<br />
CE1#show crypto gdoi gm<br />
Group Member Information For Group GDOI-GROUP1:<br />
IPSec SA Direction : Both<br />
ACL Received From KS : gdoi_group_GDOI-GROUP1_temp_acl<br />
<br />
Group member : 192.168.11.3 vrf: None<br />
Local addr/port : 192.168.11.3/848<br />
Remote addr/port : 192.168.111.111/848<br />
fvrf/ivrf : None/None<br />
Version : 1.0.8<br />
Registration status : Registered<br />
<b> Registered with : 192.168.111.111</b><br />
Re-registers in : 3326 sec<br />
Succeeded registration: 1<br />
Attempted registration: 1<br />
<b> Last rekey from : 192.168.222.222</b><br />
Last rekey seq num : 0<br />
Unicast rekey received: 7<br />
Rekey ACKs sent : 7<br />
Rekey Rcvd(hh:mm:ss) : 00:01:31<br />
DP Error Monitoring : OFF<br />
<br />
As KS2 has KS1's RSA key, CE1 (GM) accepts KS2's authentication. Note CE1 still thinks it's registered with KS1, which is basically irrelevant, as KS2 has taken over all ongoing tasks of KS1.<br />
<br />
Now let's force CE2 to re-register to KS2.<br />
<br />
CE2(config)#int e0/0<br />
CE2(config-if)#no crypto map gdoimap<br />
*Jan 2 20:35:32.253: %CRYPTO-6-GDOI_ON_OFF: GDOI is OFF<br />
<br />
GDOI disabled... <br />
<br />
CE2(config-if)#crypto map gdoimap<br />
*Jan 2 20:35:34.330: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.111.111 for group GDOI-GROUP1 using address 192.168.12.2<br />
*Jan 2 20:35:34.331: %CRYPTO-6-GDOI_ON_OFF: GDOI is ON<br />
<br />
GDOI re-enabled, and now attempting registration to KS1. <br />
<br />
*Jan 2 20:36:14.344: %CRYPTO-5-GM_REGSTER: Start registration to KS 192.168.222.222 for group GDOI-GROUP1 using address 192.168.12.2<br />
<br />
CE2 gives up on KS1 and moves down its list to KS2. <br />
<br />
<output omitted for brevity> <br />
*Jan 2 20:36:14.366: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.222.222 complete for group GDOI-GROUP1 using address 192.168.12.2<br />
<output omitted for brevity><br />
<br />
And successful registration!<br />
<br />
Looking at KS2's registrations:<br />
KS2#show crypto gdoi ks mem summary | i Member ID<br />
Group Member ID : 192.168.12.2 GM Version: 1.0.6<br />
Group Member ID : 192.168.13.2 GM Version: 1.0.6<br />
Group Member ID : 192.168.14.2 GM Version: 1.0.8<br />
Group Member ID : 192.168.11.3 GM Version: 1.0.8<br />
<br />
The other GM's didn't reregister to KS2 - KS2 learned about them from KS1 before KS1 went offline.<br />
<br />
Let's turn KS1 back online:<br />
KS1(config-if)#no shut<br />
<br />
It's all over the Cisco documentation that <b>COOP doesn't support preemption</b>:<br />
"The recovering KS receives an announcement message reply from an existing primary, which has lower priority. In this case, there is no preemption, and the recovering KS remains a secondary KS. This eliminates unnecessary changes in the system." <br />
<a href="http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf">http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf</a><br />
<br />
Well, you could have fooled me!:<br />
<br />
KS1(config-if)#<br />
*Jan 2 20:39:32.559: %BGP-5-ADJCHANGE: neighbor 192.168.11.1 Up<br />
*Jan 2 20:39:45.739: %GDOI-5-COOP_KS_REACH: Reachability restored with Cooperative KS 192.168.222.222 in group GDOI-GROUP1.<br />
<br />
KS1#show crypto gdoi ks coop | i Role<br />
<b>Local KS Role: Primary</b> , Local KS Status: Alive<br />
Peer KS Role: Secondary , Peer KS Status: Alive<br />
<br />
KS2#show crypto gdoi ks coop | i Role<br />
<b>Local KS Role: Secondary </b>, Local KS Status: Alive<br />
Peer KS Role: Primary , Peer KS Status: Alive<br />
<br />
I'm running IOS 15.4, and it occurs to me that this could've changed since the documentation was written, but it seems somewhat unlikely seeing as adamant Cisco was about this in all previous documentation. Does anyone have any idea here? I literally cannot get it to <i>not</i> preempt, so I find that confusing.<br />
<br />
For a final topic related to COOP, a pure failure of a KS is one thing, but what happens if you have a network segmentation that has some GMs speaking to one KS and some GMs speaking to another KS, in a 'split-brain' scenario?<br />
<br />
In that case, the KSs perform what's called a Key Server Merge, which doesn't have much relevance as a study topic (other than knowing it exists), but it does have some design implications. If you're reading this to build a large production GETVPN as opposed to study purposes, I recommend reading<br />
<br />
<a href="http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf">http://www.cisco.com/c/dam/en/us/products/collateral/security/group-encrypted-transport-vpn/GETVPN_DIG_version_1_0_External.pdf</a><br />
and check out section 3.7.4.2, "Network Split and Merge".<br />
<br />
Now, on to some final topics:<br />
<br />
<i><b>Fail Open vs Fail Closed</b></i><br />
<br />
If a crypto policy isn't in place or isn't matched, the default reaction of the router is to simply send the traffic unencrypted. This is normal, default behavior and isn't a feature. However, if security demands that traffic <i>be stopped</i> rather than being sent in the clear, the Fail Closed feature may be enabled on a per-GM basis:<br />
<br />
First, create an ACL of what to still transmit even during fail-closed. For example, your routing and management traffic should probably still be permitted: <br />
<br />
<b>ip access-list extended fail-close<br />deny tcp any eq bgp any<br />deny tcp any any eq bgp</b><br />
<br />
Much like the standard GETVPN ACL, "deny" means "send unencrypted".<br />
<br />
Then you basically enable an extension to the crypto map. For example, my GDOI map is called "gdoimap", and looks like this:<br />
<br />
crypto map gdoimap 1 gdoi <br />
set group GDOI-GROUP1<br />
<br />
In that case, you create this addtional config: <br />
<b>crypto map gdoimap gdoi fail-close</b><br />
<b>match address fail-close <br /> activate</b><br />
Then, if a valid KEK key isn't present, the only traffic allowed to transmit is BGP, given the ACL above.<br />
<br />
<i><b>Local Exception ACL</b></i><br />
<br />
It's possible that one global ACL doesn't meet the needs of every GM. If you have a GM that needs to transmit some data in clear text even though it's indicated by the KS ACL that it should be encrypted, you can create a per-GM one-off ACL for this scenario.<br />
<br />
From the Cisco documentation:<br />
"The crypto ACL applied at the GM represents a concatenation of the downloaded ACL and local ACL. The order of operations is such that the locally defined ACL is checked first, followed by the one downloaded from the KS."<br />
"Note: Only deny statements can be added locally at the GM. Permit statements are not supported in the locally configured policies. In case of a conflict, local policy overrides the policy downloaded from the KS."<br />
<a href="http://www.cisco.com/c/en/us/products/collateral/security/group-encrypted-transport-vpn/deployment_guide_c07_554713.html">http://www.cisco.com/c/en/us/products/collateral/security/group-encrypted-transport-vpn/deployment_guide_c07_554713.html</a><br />
<br />
An ACL is created:<br />
<b>ip access-list extended exception-acl<br />deny icmp any any</b><br />
<br />
And then applied to the crypto map on the GM:<b> </b><br />
<b>crypto map gdoimap 1 gdoi<br /> match address exception-acl</b><br />
<b><br /></b>
I've gone ahead and applied this on CE1:<br />
<br />
CE1#sh crypto gdoi gm acl<br />
Group Name: GDOI-GROUP1<br />
ACL Downloaded From KS 192.168.111.111:<br />
access-list deny udp any port = 848 any<br />
access-list deny udp any any port = 848<br />
access-list deny tcp any port = 179 any<br />
access-list deny tcp any any port = 179<br />
access-list permit ip any any<br />
ACL Configured Locally:<br />
Map Name: gdoimap<br />
<b> access-list exception-acl deny icmp any any</b><br />
Let's test it...<br />
<br />
CE1#ping 192.168.33.33 ! CE3's loopback<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 192.168.33.33, timeout is 2 seconds:<br />
.....<br />
Success rate is 0 percent (0/5)<br />
<br />
I scratched my head for a split second until I remembered that CE3 doesn't agree with the exception policy and therefore won't take unecrypted ICMP traffic:<br />
<br />
CE3#<br />
*Jan 2 21:08:30.239: %CRYPTO-4-RECVD_PKT_NOT_IPSEC: Rec'd packet not an IPSEC packet. (ip) vrf/dest_addr= /192.168.33.33, src_addr= 192.168.11.3, prot= 1<br />
<br />
However, the Key Servers can <i>only</i> reply to unecrypted ICMP traffic:<br />
<br />
CE1#ping 192.168.111.111<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 192.168.111.111, timeout is 2 seconds:<br />
!!!!!<br />
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/6 ms<br />
<br />
<i><b>Receive-only SA</b></i><br />
<br />
When deploying GETVPN on an existing network, it is almost certainly desirable to ensure all GMs can decrypt traffic before beginning to encrypt traffic - otherwise, some GMs would be sending encrypted traffic to other GMs that hadn't had the config pasted in yet. <br />
<i><b> </b></i><br />
Receive-only SA is a policy pushed from the KS that tells all GMs to decrypt traffic but not encrypt it. Implementation is very simple:<br />
<br />
<b>crypto gdoi group GDOI-GROUP1<br /> server local<br /> sa receive-only </b><br />
<br />
<i><b>Passive SA</b></i><br />
<br />
While deploying Receive-only SA, it may also be a good idea to do small-scale encryption testing without globally rolling encryption and hoping for the best. Passive SA is a per-GM setting that basically overrides the <b>sa receive-only</b> command pushed from the KS. It indicates that the GM should encrypt and decrypt traffic, rather than just decrypting it. This allows for a single-GM (or however many you'd like to apply the config to) rollout of encryption, without applying it globally with the KS.<br />
<br />
<b>crypto gdoi group GDOI-GROUP1<br /> passive</b><br />
<br />
Interestingly, there's also a privilege exec command for the GMs that can control whether to encrypt, decrypt, or both - basically a macro for the functions described above:<br />
<br />
CE3#<b>crypto gdoi gm group 1234 ipsec direction ?</b><br />
both IPsec SA will only accept cipher text and will encrypt the packet<br />
before forwarding it out<br />
inbound Specify IPsec SA inbound options<br />
<br />
CE3#<b>crypto gdoi gm group 1234 ipsec direction inbound ?</b><br />
only IPsec SA will accept both cipher/plain text and will forward the<br />
packet in clear.<br />
optional IPsec SA will accept both cipher/plain text and will encrypt the<br />
packet before forwarding it out<br />
<br />
Using this command, you can indicate mandatory encryption in both directions ('both'), mandatory inbound encryption ('inbound only') or to receive encrypted and unencrypted traffic inbound ('inbound optional').<br />
<br />
<i><b>Multiple Group Support and Authorization Lists</b></i><br />
<br />
We've only been using one group up until now. Imagine if you had separate divisions inside a company that shared a single VRF in a L3 MPLS VPN but have no business speaking to one another, at least not without going through a firewall at HQ first (I've also heard there are service provider applications for this deployment).<br />
<i><b> </b></i><br />
Let's say Division 1 is CE1 and CE2, and Division 2 is CE3 and CE4.<br />
<br />
Let's configure the new group on KS1. I am deliberately not configuring KS2 for brevity, the second group will not participate in the COOP config from above.<br />
<br />
KS1(config)#<b>crypto gdoi group GDOI-GROUP2</b><br />
KS1(config-gdoi-group)#<b>identity number 6789</b><br />
KS1(config-gdoi-group)#<b>server local</b><br />
KS1(gdoi-local-server)#<b>rekey algorithm aes 128</b><br />
KS1(gdoi-local-server)#<b>rekey authentication mypubkey rsa MYRSAKEY</b><br />
KS1(gdoi-local-server)#<b>rekey transport unicast</b><br />
KS1(gdoi-local-server)#<b>sa ipsec 1</b><br />
KS1(gdoi-sa-ipsec)#<b>profile profile1</b><br />
KS1(gdoi-sa-ipsec)#<b>match address ipv4 getvpn-acl</b><br />
KS1(gdoi-sa-ipsec)#<b>replay time window-size 5</b><br />
KS1(gdoi-sa-ipsec)#<b>address ipv4 192.168.111.111</b><br />
<br />
And on CE3 and CE4:<br />
CE3(config)#crypto gdoi group GDOI-GROUP1 ! name is only locally significant<br />
CE3(config-gdoi-group)#<b>no server address ipv4 192.168.222.222</b> ! remove KS2<br />
CE3(config-gdoi-group)#<b>identity number 6789</b> ! change the group number<br />
<br />
CE4(config)#crypto gdoi group GDOI-GROUP1 ! name is only locally significant<br />
CE4(config-gdoi-group)#<b>no server address ipv4 192.168.222.222</b> ! remove KS2<br />
CE4(config-gdoi-group)#<b>identity number 6789</b> ! change the group number<br />
<br />
CE4#ping 192.168.33.33 so lo0 ! 192.168.33.33 is CE3's loopback<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 192.168.33.33, timeout is 2 seconds:<br />
Packet sent with a source address of 192.168.44.44<br />
!!!!!<br />
Success rate is 100 percent (5/5), round-trip min/avg/max = 5/5/6 ms<br />
<br />
CE4#sh crypto gdoi gm dataplane counters<br />
<br />
Data-plane statistics for group GDOI-GROUP1:<br />
<b> #pkts encrypt : 5 #pkts decrypt : 5</b><br />
#pkts tagged (send) : 0 #pkts untagged (rcv) : 0<br />
#pkts no sa (send) : 0 #pkts invalid sa (rcv) : 0<br />
#pkts encaps fail (send) : 0 #pkts decap fail (rcv) : 0<br />
#pkts invalid prot (rcv) : 0 #pkts verify fail (rcv) : 0<br />
#pkts not tagged (send) : 0 #pkts not untagged (rcv) : 0<br />
#pkts internal err (send): 0 #pkts internal err (rcv) : 0<br />
<br />
OK, CE3 and CE4 can still talk to one another.<br />
<br />
CE4#ping 10.0.111.1 ! An interface on CE1<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 10.0.111.1, timeout is 2 seconds:<br />
.....<br />
Success rate is 0 percent (0/5)<br />
<br />
CE1#<br />
*Jan 2 21:38:41.803: %CRYPTO-4-RECVD_PKT_INV_SPI: decaps: rec'd IPSEC packet has invalid spi for destaddr=10.0.111.11, prot=50, spi=0xA0B897B5(2696452021), srcaddr=192.168.14.2, input interface=Ethernet0/0<br />
<br />
No talking from CE4 to CE1.<br />
<br />
You'll note I used the same RSA keys on both groups:<br />
KS1(gdoi-local-server)#<b>rekey authentication mypubkey rsa MYRSAKEY</b><br />
<br />
It doesn't matter - the TEK keys aren't built off the RSA key. The RSA key is just for the KS to authenticate to the GM, to prove it's still the original group of KS's the GM registered to.<br />
<br />
<br />
But there's still an easy, easy way to work around this on the GM:<br />
<br />
CE4(config)#<b>crypto gdoi group GDOI-GROUP1</b><br />
CE4(config-gdoi-group)#<b>identity number 1234</b><br />
<br />
CE4#ping 10.0.111.1 so lo0<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 10.0.111.1, timeout is 2 seconds:<br />
Packet sent with a source address of 192.168.44.44<br />
!!!!!<br />
Success rate is 100 percent (5/5), round-trip min/avg/max = 6/6/7 ms<br />
<br />
Authorization lists to the rescue!<br />
<br />
KS1(config)#access-list 10 permit 192.168.11.3 ! CE1's registration address<br />
KS1(config)#access-list 10 permit 192.168.12.3 ! CE2's registration address<br />
<br />
KS1(config)#access-list 20 permit 192.168.13.2 ! CE3's registration address<br />
KS1(config)#access-list 20 permit 192.168.14.2 ! CE4's registration address<br />
<br />
KS1(config)#crypto gdoi group GDOI-GROUP1<br />KS1(config-gdoi-group)# server local<br />KS1(gdoi-local-server)# authorization address ipv4 10<br />
KS1(gdoi-local-server)#crypto gdoi group GDOI-GROUP2<br />KS1(config-gdoi-group)# server local<br />KS1(gdoi-local-server)# authorization address ipv4 20<br /><br />
Forcing re-registration on CE4:<br />
CE4(config-if)#no crypto map gdoimap<br />
CE4(config-if)#crypto map gdoimap<br />
<br />
KS1(config)#<br />
*Jan 2 21:48:15.202: %GDOI-1-UNAUTHORIZED_IPADDR: Group GDOI-GROUP1 received registration from unauthorized ip address: 192.168.14.2<br />
<br />
And CE4 begrudingly goes back to his own group:<br />
CE4(config-if)#crypto gdoi group GDOI-GROUP1<br />
CE4(config-gdoi-group)#identity number 6789<br />
<br />
*Jan 2 21:49:25.208: %GDOI-5-GM_REGS_COMPL: Registration to KS 192.168.111.111 complete for group GDOI-GROUP1 using address 192.168.14.2 fvrf default ivrf default<br />
<br />
...and succeeds.<br />
<br />
<i><b>VRF Lite Support</b></i><br />
<br />
One CE can join multiple, different GDOI groups on the KS by using VRF-lite on the CEs. This isn't that complex conceptually, however, I saw little reason to lab it. If you want to know more, here's the link to the Cisco documentation: <a href="http://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/enterprise-class-teleworker-ect-solution/prod_white_paper0900aecd80617171.html">http://www.cisco.com/c/en/us/products/collateral/ios-nx-os-software/enterprise-class-teleworker-ect-solution/prod_white_paper0900aecd80617171.html</a><br />
<i><b></b></i><br />
Note, the KS does not support VRFs.<br />
<br />
Cheers,<br />
<br />
Jeff<br />
<br />brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com15tag:blogger.com,1999:blog-5968686435283454526.post-63834959303897880892015-12-27T07:24:00.002-08:002015-12-27T07:24:25.243-08:00New Material Coming Soon... honest!Just shy of a year ago, I posted:<br />
<br />
"...the blog will continue!<br />
My next step is CCNP Voice, and I plan on writing up my findings here, as well as any interesting R&S topics I come across."<br />
<br />
Well, that didn't end up panning out as anticipated. The major problem was that I spent three full years of my life working on getting my CCIE number, and after that I had a <i>long list</i> of things - not related to IT - that needed completed, such as:<br />
<br />
- Lots of home repair projects<br />
- Cleaning my garage (Took nearly two months; it was pretty bad) <br />
- Replacing a Jeep engine<br />
- etc etc<br />
<br />
Those tasks took a bit longer than I had anticipated. I've been non-stop busy since last January, and a year later I still have about 5% of the list left, but it's fairly manageable and something I can do in minor spare time now.<br />
<br />
I did in fact pick up the CCNP Voice material, only to have it immediately replaced with CCNP Collab, and by that point I was so embroiled in 'the list' mentioned above that I didn't bother buying it.<br />
<br />
However, with my 1-year anniversary of my lab pass looming, I figure I better start studying for the written again.<br />
<br />
I nearly decided to ditch R&S and go re-up on the Collab written, as I spend more of my time in that space now and I do R&S. However, to be fair to my family, I decided that (hopefully) the R&S written would take me less time to study for than the Collab, and I can just plan on re-upping on Collab next time when I've got more "spare time".<br />
<br />
There are a good number of topics on the v5 written that weren't on the v5 lab, but the big ones that I'm not already an expert at are GETVPN and IS-IS.<br />
<br />
I'm working on the GETVPN blog right now, which will be posted when it's done/when I have time, but I expect in the coming weeks.<br />
<br />
Cheers,<br />
<br />
Jeffbrbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com1tag:blogger.com,1999:blog-5968686435283454526.post-22320886391237218092015-01-05T21:38:00.001-08:002015-01-05T21:38:31.271-08:0046110Well folks, I am finally done. Two years, 11 months. Today, January 5th, 2015, I passed, on my 4th attempt - #46110.<br />
<br />
However, the blog will continue!<br />
<br />
My next step is CCNP Voice, and I plan on writing up my findings here, as well as any interesting R&S topics I come across.<br />
<br />
Best of luck to everyone else on this track, the hard work does eventually pay off.<br />
<br />
Cheers,<br />
<br />
Jeffbrbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com13tag:blogger.com,1999:blog-5968686435283454526.post-56622934070463003332014-10-04T11:21:00.004-07:002014-10-04T11:21:55.070-07:00[mini] Fail-Over Policy Based RoutingPlaying with PBR recently I came across what I thought was an odd usage - two set commands in the same statement.<br />
<br />
i.e.<br />
<br />
route-map PBR permit 10<br />
match ip address to-be-matched<br />
set ip next-hop 192.168.0.1<br />
set ip default next-hop 192.168.1.1<br />
<br />
This is a bit odd to look at until you break it down.<br />
<br />
Turns out there's an order of operations to PBR <i>set</i> statements.<br />
<br />
From the Cisco documentation:<br />
<br />
1. set ip next-hop<br />
2. set interface<br />
3. set ip default next-hop<br />
4. set default interface<br />
<br />
This means <b>set</b> <b>ip next-hop </b>will be attempted prior to, say, <b>set interface</b>. If it fails, then the next statement will be evaluated.<br />
<br />
When I saw that, the first place my brain went to was, why not create two route-map elements to fix this?<br />
<br />
(please note it's hard to air your dirty laundry on the Internet. Yes, this seemed dumb after I tested it)<br />
<br />
route-map PBR permit 10<br />
match ip address to-be-matched<br />
set ip next-hop 192.168.0.1<br />
route-map PBR permit 20<br />
match ip address to-be-matched<br />
set ip default next-hop 192.168.1.1<br />
<div>
<br /></div>
<div>
My thought process here was that if statement 10 failed to apply the <i>set</i> statement, then it would move on to statement 20. This is, of course, not true. Just like an ACL, a route-map stops evaluating future statements as soon as it has a match. So in the above config, using the same ACL (or even two ACLs that both matched the same traffic in different ways), statement 10 is always matched, and if it fails, traffic is just normally routed.</div>
<div>
<br /></div>
<div>
So there is some reason (albeit niche cases) to put "fail-over" statements into the route-map. The CCIE lab is basically all about niche cases (less lovingly called a "stupid router trick" by most of us), so this seemed worth exploring.</div>
<div>
<br /></div>
<div>
Here's our topology:</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjr4ZNdawof3_WvxrXWiz8ngR9zP6qZd3wYj9aolHi2xniy4CuOFBeixerMVbsVqd0Ez1G2SVnxbXL-dWDKuHgWv04BSb-pobpZBk0R2vyanLU81TJbjE3wuNgGBRiSv0vUS3XFA70ooPY/s800/diagram1.png" />
<br />
It's a little complex but I wanted to show a lot of different possibilities in one route-map statement.<br />
<br />
R1's loopback0 (1.1.1.1/32) will be our source, travelling towards R9's loopback0 (9.9.9.9). Segments are IPed as 192.168.XY.Z/24, where XY is the lower and higher router number on the segment, and Z is the local router number. Example: The serial segment between R2 and R4 is 192.168.24.0/24, with R2's interface being 192.168.24.2 and R4 being 192.168.24.4.<br />
<br />
EIGRP is advertising every IP in the topology. <i>However,</i> R5, R6 and R7 are summarizing all routes behind them to a default route towards R2.<br />
R2 has an offset list towards R4 to make paths through it less desirable:<br />
<br />
R2:<br />
router eigrp 100<br />
network 0.0.0.0<br />
<div>
offset-list 0 in 50 Serial4/1</div>
<div>
<br /></div>
<div>
The net result of this is that traffic will be sent from R2 through R3 towards R9 unless PBR is involved.</div>
<div>
<br /></div>
<div>
Here's R2's PBR config:</div>
<div>
<br /></div>
<div>
<div>
ip access-list extended match</div>
<div>
permit ip host 1.1.1.1 host 9.9.9.9</div>
</div>
<div>
<br /></div>
<div>
<div>
route-map PBR permit 10</div>
<div>
match ip address match</div>
<div>
set ip next-hop 192.168.24.4</div>
<div>
set interface Serial4/2</div>
<div>
set ip default next-hop 192.168.26.6</div>
<div>
set default interface Serial4/4</div>
</div>
<div>
<br /></div>
<div>
interface FastEthernet1/0</div>
<div>
ip policy route-map PBR</div>
<div>
<br /></div>
<div>
This will match traffic from 1.1.1.1 towards 9.9.9.9. The first step is to attempt to send the traffic towards R4.</div>
<div>
<br /></div>
<div>
So to be clear, non-PBR traffic will go through R3:</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip cef 9.9.9.9</div>
<div>
9.9.9.9/32</div>
<div>
nexthop 192.168.23.3 Serial4/0</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source fa1/0 <b>! Not from 1.1.1.1</b></div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 44 msec 88 msec 48 msec</div>
<div>
2 192.168.23.3 112 msec 40 msec 68 msec</div>
<div>
3 192.168.39.9 116 msec 116 msec 160 msec</div>
</div>
<div>
<br /></div>
<div>
Now let's try our PBR match:</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 108 msec 48 msec 72 msec</div>
<div>
2 192.168.24.4 96 msec 44 msec 76 msec</div>
<div>
3 192.168.49.9 136 msec 108 msec 108 msec</div>
</div>
<div>
<br /></div>
<div>
As expected, it went R2 -> R4 -> R9.</div>
<div>
<br /></div>
<div>
What if the link between R2 and R4 went down?</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int s4/0</div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
Referencing back to our route-map...</div>
<div>
<br /></div>
<div>
<div>
route-map PBR permit 10</div>
<div>
match ip address match</div>
<div>
<strike> set ip next-hop 192.168.24.4</strike> (now unavailable)</div>
<div>
set interface Serial4/2 </div>
<div>
set ip default next-hop 192.168.26.6</div>
<div>
set default interface Serial4/4</div>
</div>
<div>
<br /></div>
<div>
We'd expect the PBR to send the traffic through R5 next, via Serial4/2. A very important note is <b>these are point to point serial interfaces</b>. Using Ethernet is not such a good choice for <b>set interface</b>. The problem should be obvious to any CCIE candidate: We're relying on the far side to proxy ARP for 9.9.9.9, <i>which it will do in our design</i>, but also because of our design, IOS will typically reject the ARP change as "wrong interface".</div>
<div>
<br /></div>
<div>
In short, the safe answer is to use serial (P2P) interfaces.</div>
<div>
<br /></div>
<div>
Also to point out again that according to the routing table, R2 should send traffic towards 9.9.9.9 via R3:</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#do sh ip cef 9.9.9.9</div>
<div>
9.9.9.9/32</div>
<div>
nexthop 192.168.23.3 Serial4/0</div>
</div>
<div>
<br /></div>
<div>
From R1's traceroute, we see that traffic does go through R5:</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 24 msec 116 msec 44 msec</div>
<div>
2 192.168.25.5 92 msec 48 msec 112 msec</div>
<div>
3 192.168.58.8 96 msec 128 msec 96 msec</div>
<div>
4 192.168.89.9 152 msec 108 msec 96 msec</div>
</div>
<div>
<br /></div>
<div>
We've successfully failed the first statement and moved to the 2nd one.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#int s4/2</div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
route-map PBR permit 10</div>
<div>
match ip address match</div>
<div>
<strike> set ip next-hop 192.168.24.4</strike> (now unavailable)</div>
<div>
<strike> set interface Serial4/2 </strike> (now unavailable)</div>
<div>
set ip default next-hop 192.168.26.6</div>
<div>
set default interface Serial4/4</div>
</div>
</div>
<div>
<br /></div>
<div>
The <i>set [ip] default </i>commands will only trigger if the route <i>towards the destination</i> is via a default.</div>
<div>
These won't work for us yet because....</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip route 9.9.9.9</div>
<div>
Routing entry for 9.9.9.9/32</div>
<div>
Known via "eigrp 100", distance 90, metric 2300416, type internal</div>
</div>
<div>
[output omitted]</div>
<div>
<br /></div>
<div>
We have a specific route.</div>
<div>
We see our PBR is doing nothing now:</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 76 msec 36 msec 68 msec</div>
<div>
<b> 2 192.168.23.3 116 msec 52 msec 60 msec (through R3)</b></div>
<div>
3 192.168.39.9 84 msec 104 msec 100 msec</div>
</div>
<div>
<br /></div>
<div>
I haven't got a smooth answer for this, so let's just make R3 send a default as well. Note I've increased the delay between R5, R6, R7 and R8, so that R3 will still be preffed even with just a default being sent.</div>
<div>
<br /></div>
<div>
<div>
R3(config)#int s2/0</div>
<div>
R3(config-if)#ip summary-address eigrp 100 0.0.0.0 0.0.0.0</div>
</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip route 9.9.9.9 long</div>
<div>
[output omitted]</div>
<div>
Gateway of last resort is 192.168.23.3 to network 0.0.0.0</div>
</div>
<div>
<br /></div>
<div>
So now we match a default towards R2. Let's see our PBR kick in again.</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 40 msec 88 msec 40 msec</div>
<div>
<b> 2 192.168.26.6 88 msec 108 msec 48 msec</b></div>
<div>
3 192.168.68.8 128 msec 140 msec 124 msec</div>
<div>
4 192.168.89.9 92 msec 116 msec 112 msec</div>
</div>
<div>
<br /></div>
<div>
Through R6!</div>
<div>
<br /></div>
<div>
And if the link to R6 is down?</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int s4/3</div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
route-map PBR permit 10</div>
<div>
match ip address match</div>
<div>
<strike> set ip next-hop 192.168.24.4</strike> (now unavailable)</div>
<div>
<strike> set interface Serial4/2 </strike> (now unavailable)</div>
<div>
<strike> set ip default next-hop 192.168.26.6</strike> (now unavailable)</div>
<div>
set default interface Serial4/4</div>
</div>
</div>
<div>
<br /></div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 80 msec 60 msec 88 msec</div>
<div>
<b> 2 192.168.27.7 112 msec 44 msec 88 msec</b></div>
<div>
3 192.168.78.8 96 msec 92 msec 100 msec</div>
<div>
4 192.168.89.9 88 msec 128 msec 132 msec</div>
</div>
<div>
<br /></div>
<div>
Through R7.</div>
<div>
<br /></div>
<div>
That wraps up my main point, but while we've got this setup, let's look at recursive PBR too.</div>
<div>
<br /></div>
<div>
I'm no-shutting all the interfaces we turned down earlier, and re-advertising the specific route from R3.</div>
<div>
<br /></div>
<div>
I've also added a leak-map on R5, R6 and R7 to allow R8's Lo0 (8.8.8.8) through in addition to the default route. Additionally, I de-prefed 8.8.8.8 through R3 and R4.</div>
<div>
<br /></div>
<div>
So to be clear, 9.9.9.9 is now reachable via R3, and 8.8.8.8 is reachable via equal-cost load sharing on R5, R6 and R7:</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip cef 9.9.9.9</div>
<div>
9.9.9.9/32</div>
<div>
nexthop 192.168.23.3 Serial4/0</div>
</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip cef 8.8.8.8</div>
<div>
8.8.8.8/32</div>
<div>
nexthop 192.168.25.5 Serial4/2</div>
<div>
nexthop 192.168.26.6 Serial4/3</div>
<div>
nexthop 192.168.27.7 Serial4/4</div>
</div>
<div>
<br /></div>
<div>
Recursive PBR allows for ECMP (equal cost multipathing) and PBR to mix. In short, pre-PBR, the path to 9.9.9.9 is via R3. Post PBR, we'll target having "8.8.8.8" as the next hop - which will ECMP through R5, R6 and R7.</div>
<div>
<br /></div>
<div>
In our environment, however, this is a bit hard to see, because per-destination CEF ECMP won't show up on our traceroute. Let's change to per-packet:</div>
<div>
<br /></div>
<div>
<div>
R2(config-route-map)#int s4/2</div>
<div>
R2(config-if)#ip load-sharing per-packet</div>
<div>
R2(config-if)#int s4/3</div>
<div>
R2(config-if)#ip load-sharing per-packet</div>
<div>
R2(config-if)#int s4/4</div>
<div>
R2(config-if)#ip load-sharing per-packet</div>
</div>
<div>
<br /></div>
<div>
And let's re-write our route-map for recursion:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#no route-map PBR permit 10</div>
<div>
R2(config)#route-map PBR permit 10</div>
<div>
R2(config-route-map)#match ip address match</div>
<div>
R2(config-route-map)#set ip next-hop recursive 8.8.8.8</div>
</div>
<div>
<br /></div>
<div>
And test:</div>
<div>
<br /></div>
<div>
<div>
R1#trace 9.9.9.9 source Loopback0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 9.9.9.9</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.12.2 28 msec * 44 msec</div>
<div>
2 192.168.25.5 72 msec <b><-- Indicates ECMP</b></div>
<div>
192.168.26.6 76 msec <b><-- Indicates ECMP</b></div>
<div>
192.168.27.7 60 msec <b><-- Indicates ECMP</b></div>
<div>
3 192.168.58.8 132 msec</div>
<div>
192.168.68.8 80 msec</div>
<div>
192.168.78.8 64 msec</div>
<div>
4 192.168.89.9 76 msec 76 msec 64 msec</div>
</div>
<div>
<br /></div>
<div>
Hope you enjoyed!</div>
<div>
<br /></div>
<div>
Jeff</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com4tag:blogger.com,1999:blog-5968686435283454526.post-56758978091626631732014-09-07T17:59:00.002-07:002014-09-07T18:06:15.708-07:00CCIE v4 to v5: BGP NHT, SAT, FSD, Dynamic Neighbors, Multisession Transport Per AFBGP Next Hop Tracking (NHT) is an on-by-default feature that notifies BGP to a change in routing for BGP prefix next-hops. This is important because previously this only happened as part of the BGP Scanner process, which runs every 60 seconds by default. Waiting 60 seconds to determine your BGP route is effectively no longer valid (because of invalid next-hop) significantly hampers reconvergence. Instead of being timer-based, NHT makes the process of dealing with next-hop changes event-driven.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXM756F_TGKXb_HP7qnlgMXrpbPl9dNwiZ6q-w8xO3QOjCtF5_DW2URmqMuHtY0OcVMEhnw4qNE0TAaZtrdz8ueVcXKC7-MLdb47FNfufJFbxlJJJJi-gblAA6pyUiqaVPb49TPjkCUvs/s1600/diagram1.png" />
<br />
<br />
EIGRP is peered on all routers on the 192.168.124.0/24 link.<br />
<br />
Here's the relevant base BGP config:<br />
<br />
R1:<br />
router bgp 1<br />
bgp log-neighbor-changes<br />
neighbor 3.3.3.3 remote-as 3<br />
neighbor 3.3.3.3 ebgp-multihop 255<br />
neighbor 3.3.3.3 update-source Loopback0<br />
neighbor 4.4.4.4 remote-as 4<br />
neighbor 4.4.4.4 ebgp-multihop 255<br />
neighbor 4.4.4.4 update-source Loopback0<br />
<div>
<br /></div>
<div>
R3:</div>
<div>
<div>
router bgp 3</div>
<div>
bgp log-neighbor-changes</div>
<div>
neighbor 1.1.1.1 remote-as 1</div>
<div>
neighbor 1.1.1.1 ebgp-multihop 255</div>
<div>
neighbor 1.1.1.1 update-source Loopback0</div>
<div>
neighbor 192.168.34.4 remote-as 4</div>
</div>
<div>
<br /></div>
<div>
R4:</div>
<div>
<div>
<div>
interface Loopback1</div>
<div>
ip address 44.44.44.44 255.255.255.255</div>
</div>
<div>
<br /></div>
<div>
router bgp 4</div>
<div>
bgp log-neighbor-changes</div>
<div>
network 44.44.44.44 mask 255.255.255.255</div>
<div>
neighbor 1.1.1.1 remote-as 1</div>
<div>
neighbor 1.1.1.1 ebgp-multihop 255</div>
<div>
neighbor 1.1.1.1 update-source Loopback0</div>
<div>
neighbor 192.168.34.3 remote-as 3</div>
</div>
<div>
<br /></div>
<div>
In short, we're using ebgp multihop in order to keep my mock-up smaller. We have two paths from R1 to R4's 44.44.44.44:</div>
<div>
<br /></div>
<div>
R1 -> R4's 4.4.4.4 (and consequently to 44.44.44.44 in the same hop)</div>
<div>
R1 -> R3's 3.3.3.3, then R3 to R4's 192.168.34.4 </div>
<div>
<br /></div>
<div>
The first route has one AS in it's AS-PATH, the 2nd route has two ASes, and is less preferred.</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip bgp 44.44.44.44 bestpath</div>
<div>
BGP routing table entry for 44.44.44.44/32, version 11</div>
<div>
Paths: (2 available, best #1, table default)</div>
<div>
Advertised to update-groups:</div>
<div>
2</div>
<div>
Refresh Epoch 2</div>
<div>
4</div>
<div>
4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, external, best</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
Let's try this experiment without NHT enabled first:</div>
<div>
<br /></div>
<div>
<div>
<div>
R1(config)#router bgp 1</div>
<div>
R1(config-router)# no bgp nexthop trigger enable</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#debug ip routing</div>
<div>
IP routing debugging is on</div>
</div>
<div>
<br /></div>
<div>
<div>
R4(config-if)#int lo0 ! this is the 4.4.4.4 interface (the next-hop for 44.44.44.44 from R1)</div>
<div>
R4(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
Debug from R1 below</div>
<div>
===============</div>
<div>
<div>
*Sep 17 22:59:03.552: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]</div>
<div>
*Sep 17 22:59:03.552: RT: no routes to 4.4.4.4, delayed flush</div>
<div>
*Sep 17 22:59:03.552: RT: delete subnet route to 4.4.4.4/32</div>
<div>
*Sep 17 22:59:03.552: RT: updating eigrp 4.4.4.4/32 (0x0) :</div>
<div>
via 192.168.124.4 Gi1.124 0 1048578</div>
<div>
<br /></div>
<div>
*Sep 17 22:59:03.552: RT: rib update return code: 5</div>
<div>
================</div>
<div>
<br /></div>
<div>
This happened as fast as EIGRP converged - very quickly. So we know 4.4.4.4 isn't a valid route any longer, but what about 44.44.44.44?</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip bgp 44.44.44.44 bestpath</div>
<div>
BGP routing table entry for 44.44.44.44/32, version 11</div>
<div>
Paths: (2 available, best #1, table default)</div>
<div>
Advertised to update-groups:</div>
<div>
2</div>
<div>
Refresh Epoch 2</div>
<div>
4</div>
<div>
4.4.4.4 (metric 10880) from 4.4.4.4 (44.44.44.44)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, external, best</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
R1#ping 44.44.44.44</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:</div>
<div>
.....</div>
<div>
Success rate is 0 percent (0/5)</div>
<div>
<br /></div>
<div>
Still thinking the next-hop is 4.4.4.4, and it's Very Down.</div>
<div>
<br /></div>
<div>
I didn't time it this way specifically, but remember the scan timer runs every 60 seconds. so 51 seconds after we yanked the 4.4.4.4 next-hop, BGP finally figured out something was up and reconverged to the alternate path for 44.44.44.44 via R3.</div>
<div>
<br /></div>
<div>
*Sep 17 22:59:54.031: RT: updating bgp 44.44.44.44/32 (0x0) :</div>
<div>
via 3.3.3.3 0 1048577</div>
<div>
<br /></div>
<div>
*Sep 17 22:59:54.031: RT: closer admin distance for 44.44.44.44, flushing 1 routes</div>
<div>
*Sep 17 22:59:54.031: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#ping 44.44.44.44</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 44.44.44.44, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/3 ms</div>
<div>
<br /></div>
<div>
R1#trace 44.44.44.44</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 44.44.44.44</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 192.168.124.3 4 msec 1 msec 0 msec</div>
<div>
2 192.168.34.4 2 msec * 2 msec</div>
</div>
<div>
<br /></div>
<div>
A 51 second reconverge in a modern network is pretty awful.</div>
<div>
<br /></div>
<div>
<div>
R4(config-if)#int lo0</div>
<div>
R4(config-if)#no shut</div>
</div>
<div>
<br /></div>
<div>
Let's re-add the next-hop trigger and try again.</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#router bgp 1</div>
<div>
R1(config-router)#bgp nexthop trigger enable</div>
</div>
<div>
<br /></div>
<div>
<div>
R4(config-if)#int lo0</div>
<div>
R4(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
Debug from R1 below</div>
<div>
<div>
<div>
===============</div>
<div>
</div>
</div>
<div>
*Sep 17 23:11:53.582: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]</div>
<div>
*Sep 17 23:11:53.582: RT: no routes to 4.4.4.4, delayed flush</div>
<div>
*Sep 17 23:11:53.582: RT: delete subnet route to 4.4.4.4/32</div>
<div>
*Sep 17 23:11:53.582: RT: updating eigrp 4.4.4.4/32 (0x0) :</div>
<div>
via 192.168.124.4 Gi1.124 0 1048578</div>
<div>
<br /></div>
<div>
*Sep 17 23:11:53.582: RT: rib update return code: 5</div>
<div>
*Sep 17 23:11:58.582: RT: updating bgp 44.44.44.44/32 (0x0) :</div>
<div>
via 3.3.3.3 0 1048577</div>
<div>
<br /></div>
<div>
*Sep 17 23:11:58.582: RT: closer admin distance for 44.44.44.44, flushing 1 routes</div>
<div>
*Sep 17 23:11:58.582: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]</div>
</div>
<div>
<div>
===============</div>
<div>
<br /></div>
<div>
</div>
</div>
<div>
Note the bottom two lines of output, we see the reconverge this time - in 5 seconds. Why 5 seconds?</div>
<div>
<br /></div>
<div>
The <b>bgp nexthop trigger delay</b> defines how long for the NHT process to delay updating BGP. This timer is here to prevent BGP from being beaten up by a flapping IGP route. At 5 seconds, the BGP process can't get bogged down from unnecessary updates. </div>
<div>
<br /></div>
<div>
Let's set it to 2 and try again.</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#bgp nexthop trigger delay 2</div>
</div>
<div>
<br /></div>
<div>
<div>
Debug from R1 below</div>
<div>
<div>
===============</div>
<div>
</div>
</div>
</div>
<div>
<div>
*Sep 17 23:18:40.167: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]</div>
<div>
*Sep 17 23:18:40.167: RT: no routes to 4.4.4.4, delayed flush</div>
<div>
*Sep 17 23:18:40.167: RT: delete subnet route to 4.4.4.4/32</div>
<div>
*Sep 17 23:18:40.167: RT: updating eigrp 4.4.4.4/32 (0x0) :</div>
<div>
via 192.168.124.4 Gi1.124 0 1048578</div>
<div>
<br /></div>
<div>
*Sep 17 23:18:40.167: RT: rib update return code: 5</div>
<div>
*Sep 17 23:18:42.168: RT: updating bgp 44.44.44.44/32 (0x0) :</div>
<div>
via 3.3.3.3 0 1048577</div>
<div>
<br /></div>
<div>
*Sep 17 23:18:42.168: RT: closer admin distance for 44.44.44.44, flushing 1 routes</div>
<div>
*Sep 17 23:18:42.168: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]</div>
</div>
<div>
===============</div>
<div>
<div>
<br /></div>
<div>
Now converging at 2 seconds.</div>
<div>
<br /></div>
<div>
Applying a route-map to the NHT process is provided by a feature called Selective Address Tracking, or SAT.</div>
<div>
<br /></div>
<div>
The route-map determines what prefixes can be seen as valid prefixes for next-hops.</div>
<div>
<br /></div>
<div>
For example, if 4.4.4.4 is your desired next hop, but you have a default on your router, if you lose 4.4.4.4/32 do you want the router to consider 4.4.4.4 reachable via the default? Potentially not.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ip route 0.0.0.0 0.0.0.0 192.168.124.10 ! Deliberately non-existent next-hop</div>
</div>
<div>
<br /></div>
<div>
Without the route map....</div>
<div>
<br /></div>
<div>
<div>
R4(config-if)#int lo0</div>
<div>
R4(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
This is hard to demonstrate, because the prefix <b>might never recover</b>. In our over-simplified mock-up, the BGP process would fail at timeout (because 4.4.4.4 is actually our peer) before the prefix vanished; in a more realistic design this could be a permanent black-hole.</div>
<div>
<br /></div>
<div>
We still have the bogus static default route in place:</div>
<div>
<div>
ip route 0.0.0.0 0.0.0.0 192.168.124.10</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#ip prefix-list onlyloops seq 5 permit 0.0.0.0/0 ge 32</div>
<div>
R1(config)#route-map SAT permit 10</div>
<div>
R1(config-route-map)# match ip address prefix-list onlyloops</div>
</div>
<div>
<div>
R1(config-route-map)#router bgp 1</div>
<div>
R1(config-router)# bgp nexthop route-map SAT</div>
</div>
<div>
<br /></div>
<div>
This config only allows for /32s as viable next-hops.</div>
<div>
<br /></div>
<div>
<div>
R4(config-if)#int lo0</div>
<div>
R4(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
Debug from R1 below</div>
<div>
<div>
===============</div>
</div>
</div>
</div>
<div>
*Sep 17 23:47:09.497: RT: delete route to 4.4.4.4 via 192.168.124.4, eigrp metric [90/10880]</div>
<div>
*Sep 17 23:47:09.497: RT: no routes to 4.4.4.4, delayed flush</div>
<div>
*Sep 17 23:47:09.497: RT: delete subnet route to 4.4.4.4/32</div>
<div>
*Sep 17 23:47:09.497: RT: updating eigrp 4.4.4.4/32 (0x0) :</div>
<div>
via 192.168.124.4 Gi1.124 0 1048578</div>
<div>
<br /></div>
<div>
*Sep 17 23:47:09.497: RT: rib update return code: 5</div>
<div>
*Sep 17 23:47:11.498: RT: updating bgp 44.44.44.44/32 (0x0) :</div>
<div>
via 3.3.3.3 0 1048577</div>
<div>
<br /></div>
<div>
*Sep 17 23:47:11.498: RT: closer admin distance for 44.44.44.44, flushing 1 routes</div>
<div>
*Sep 17 23:47:11.499: RT: add 44.44.44.44/32 via 3.3.3.3, bgp metric [20/0]</div>
<div>
<div>
<div>
===============</div>
<div>
<div>
<br /></div>
<div>
Now reconverging in 2 seconds again!</div>
<div>
<br /></div>
<div>
This is great for the downstream prefix, but what about the neighbor session itself?</div>
<div>
<br /></div>
<div>
This could work...</div>
<div>
<div>
R1(config-router)#neighbor 4.4.4.4 fall-over</div>
</div>
<div>
<br /></div>
<div>
Except that pesky default is keeping 4.4.4.4 supposedly reachable....</div>
<div>
For brevity, I'll tell you that as expected, when I shut the Lo0 interface on R4, 4.4.4.4 was pulled from R1's IGP and 44.44.44.44 was pulled from R1's BGP table. However, the session is still up!</div>
<div>
<br /></div>
<div>
The same concept (even the same route-map) can be applied to the neighbor fall-over statement. This feature is called Fast Session Deactivation (FSD). </div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#neighbor 4.4.4.4 fall-over route-map SAT ! re-using SAT's route-map</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
Debug from R1 below</div>
<div>
===============</div>
</div>
</div>
<div>
<div>
*Sep 18 00:11:08.107: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (Route to peer lost)</div>
<div>
*Sep 18 00:11:08.107: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down Route to peer lost</div>
<div>
*Sep 18 00:11:08.107: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session Route to peer lost</div>
</div>
<div>
<div>
===============</div>
</div>
<div>
<br /></div>
<div>
And the BGP session gets torn down immediately.</div>
<div>
<br /></div>
<div>
This next feature I'm not sure of the use case on, but it was recommended as a topic, so I looked at it. Multisession Transport per AF appears to be related to Multi-Topology Routing (MTR), but MTR should be solidly out-of-scope for CCIE R&S v5.</div>
<div>
<br /></div>
<div>
What multisession transport does is opens a separate TCP session for each address family.</div>
<div>
<br /></div>
<div>
I've erased all the BGP config from the previous task.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
ipv6 unicast-routing</div>
<div>
<br /></div>
<div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
neighbor 4.4.4.4 remote-as 100</div>
<div>
neighbor 4.4.4.4 update-source Loopback0</div>
<div>
!</div>
<div>
address-family ipv4</div>
<div>
neighbor 4.4.4.4 activate</div>
<div>
exit-address-family</div>
<div>
!</div>
<div>
address-family vpnv4</div>
<div>
neighbor 4.4.4.4 activate</div>
<div>
neighbor 4.4.4.4 send-community extended</div>
<div>
exit-address-family</div>
<div>
!</div>
<div>
address-family ipv6</div>
<div>
neighbor 4.4.4.4 activate</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
R4:</div>
<div>
<div>
ipv6 unicast-routing</div>
<div>
<br /></div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
neighbor 1.1.1.1 remote-as 100</div>
<div>
neighbor 1.1.1.1 update-source Loopback0</div>
<div>
!</div>
<div>
address-family ipv4</div>
<div>
neighbor 1.1.1.1 activate</div>
<div>
exit-address-family</div>
<div>
!</div>
<div>
address-family vpnv4</div>
<div>
neighbor 1.1.1.1 activate</div>
<div>
neighbor 1.1.1.1 send-community extended</div>
<div>
exit-address-family</div>
<div>
!</div>
<div>
address-family ipv6</div>
<div>
neighbor 1.1.1.1 activate</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router-af)#do show tcp brief</div>
<div>
TCB Local Address Foreign Address (state)</div>
<div>
7F612C7742A0 1.1.1.1.40234 4.4.4.4.179 ESTAB</div>
</div>
<div>
<br /></div>
<div>
Three families, one TCP session.</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#neighbor 4.4.4.4 transport multi-session</div>
</div>
<div>
<br /></div>
<div>
<div>
R4(config-router)#neighbor 1.1.1.1 transport multi-session</div>
</div>
<div>
<br /></div>
<div>
The two sides of the session do need to agree on the setting.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
*Sep 18 00:31:19.102: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Up</div>
<div>
*Sep 18 00:31:25.940: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 2 Up</div>
<div>
*Sep 18 00:31:28.322: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 session 3 Up</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do show tcp brief</div>
<div>
TCB Local Address Foreign Address (state)</div>
<div>
7F612C76F0F0 1.1.1.1.179 4.4.4.4.30092 ESTAB</div>
<div>
7F612C76DE20 1.1.1.1.179 4.4.4.4.42417 ESTAB</div>
<div>
7F612C76E788 1.1.1.1.48539 4.4.4.4.179 ESTAB</div>
</div>
<div>
<br /></div>
<div>
Our last topic is BGP Dynamic Neighbors. Yes, automagic BGP peerings!</div>
<div>
<br /></div>
<div>
Erasing all the pre-existing BGP config again...</div>
<div>
<br />
R1:</div>
<div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
bgp listen range 192.168.124.0/24 peer-group PEERS</div>
<div>
neighbor PEERS peer-group</div>
<div>
neighbor PEERS remote-as 100</div>
<div>
neighbor PEERS password CISCO</div>
<div>
neighbor PEERS update-source Loopback0</div>
<div>
neighbor PEERS route-reflector-client</div>
</div>
<div>
bgp listen limit 3</div>
<div>
<br /></div>
<div>
R2-R4:</div>
<div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
neighbor 192.168.124.1 remote-as 100</div>
<div>
neighbor 192.168.124.1 password CISCO</div>
</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
*Sep 18 00:38:24.696: %BGP-5-ADJCHANGE: neighbor *192.168.124.2 Up</div>
<div>
*Sep 18 00:39:04.980: %BGP-5-ADJCHANGE: neighbor *192.168.124.4 Up</div>
<div>
*Sep 18 00:39:05.932: %BGP-5-ADJCHANGE: neighbor *192.168.124.3 Up</div>
</div>
<div>
<br /></div>
<div>
<div>
iBGP doesn't get any faster to setup than that!</div>
</div>
<div>
<br /></div>
<div>
I've used the most obvious settings here - the dynamic "host" would normally be a route-reflector, and would normally require authentication. </div>
<div>
<br /></div>
<div>
However, you can:</div>
<div>
- Run multiple dynamic groups</div>
<div>
- Listen to multiple ranges</div>
<div>
- Use multiple address families (this works great for VPNv4!)</div>
<div>
- Listen for more neighbors (I limited it to 3 above)</div>
<div>
<br /></div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
</div>
</div>
</div>
<div>
</div>
</div>
<div>
</div>
</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-46075708904899223322014-09-07T13:55:00.002-07:002014-09-07T14:02:30.578-07:00CCIE v4 to v5 Updates: NTPv4 and NetflowI didn't find these updates on any Cisco or 3rd party list, but when writing my original NTP and Netflow blogs in mid-2013, I mentioned out-of-scope topics when writing them, because they weren't supported on IOS v12.4(15)T. Now that v5 is out, all those topics are back in-scope, so I decided to blog them.<br />
<br />
Here are the original articles this one builds off of:<br />
<br />
<a href="http://brbccie.blogspot.com/2013/05/ntp.html">http://brbccie.blogspot.com/2013/05/ntp.html</a><br />
<a href="http://brbccie.blogspot.com/2013/06/netflow.html">http://brbccie.blogspot.com/2013/06/netflow.html</a><br />
<br />
The topics we'll be covering specifically are:<br />
- Netflow w/ NBAR<br />
- IPFIX (Netflow v10)<br />
- NTPv4 (IPv6 support)<br />
- NTPv4 Multicast NTP<br />
- NTP Panic<br />
- NTP Maxdistance<br />
- NTP Orphan<br />
<br />
<u>Netflow</u><br />
First, I wanted to mention an omission from my original blog. At that time I didn't have a collector that would support Flexible Netflow, so I evaluated FNF via Wireshark. That was fairly effective except I was missing a major element of netflow: the bytes transferred! I'm now using a collector that supports FNF, and I immediately noticed I wasn't graphing any traffic.<br />
<br />
flow record JIMBO<br />
match ipv4 source address<br />
match ipv4 destination address<br />
<b> collect counter bytes</b><br />
<b> collect counter packets</b><br />
<br />
This is a simple, working FNF config. Matching or collecting <i>counter bytes </i>and <i>counter packets</i> should be done to make Netflow do what you're used to it doing -- measuring traffic.<br />
<br />
What's the advantage of integrating NBAR with Netflow?<br />
By default, Netflow only exports very high-level protocol information. Integrating NBAR gives very specific/granular protocol output to the collector. Note, your collector needs to specifically support this, this is <u>not a small change</u> from the protocol level.<br />
<br />
If you're familiar with how the template is sent out for FNF every so often, the NBAR table is very similar. IOS will send out a rather large (many packets) template defining the NBAR Application to ID at specified intervals, then those IDs are sent with the Netflow packet to define what the protocol is.<br />
<br />
There are several other blogs out there that give big, complex templates for integrating NBAR with Netflow. I took a few of these as a base and worked backwards to the real requirements. This is not a hard thing to enable. Your flow record must contain <b>collect application name</b> (or match application name), and optionally you can tune the frequency of the NBAR FNF template being sent out with <b>option application-table timeout</b> in the exporter.<br />
<br />
Here's a working config:<br />
<br />
flow record FNF-RECORD<br />
match ipv4 source address<br />
match ipv4 destination address<br />
collect counter bytes<br />
collect counter packets<br />
<b> collect application name </b><br />
<b><br /></b>
flow exporter FNF-EXPORTER<br />
destination 192.168.0.5<br />
source GigabitEthernet1<br />
transport udp 9996<br />
template data timeout 60<br />
<b>option application-table timeout 30</b><br />
<br />
flow monitor FNF-MONITOR<br />
exporter FNF-EXPORTER<br />
cache timeout inactive 60<br />
cache timeout active 60<br />
record FNF-RECORD<br />
<br />
interface gig1<br />
ip flow monitor FNF-MONITOR input<br />
<div>
<br /></div>
Netflow was recently made an open standard with v10. The open version is called IPFIX. To enable IPFIX output instead of FNF v9, you would:<br />
<br />
flow exporter FNF-MONITOR<br />
export-protocol ipfix<br />
<br />
Note I haven't tested this beyond checking it in Wireshark, because I still don't have a collector that speaks IPFIX.<br />
<br />
<u>NTP</u><br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKEZ4gorG_2wyMBi2SmhnNpBvFreNxHo5QLuvG5I69pEdzYc2LdrcGufUvhU4UwZjU96myP6s9JiQd0l4JLAUc6MY6T8MV7iKTj7wtlTLlqtu0q7-lvKTNtQRGp3ShpuTPkTleINM-uks/s1600/diagram1.png" />
<br />
<br />
The big difference on NTP v4 is IPv6 support. There's really not much to cover on the basics... clearly broadcast NTP is gone, but Multicast NTP still works the same general way it did in v4.<br />
<br />
R1(config)#ntp master 4<br />
<div>
<br /></div>
<div>
<div>
R2(config)#ntp server 1::1</div>
</div>
<div>
<br /></div>
<div>
<div>
R2#show ntp association detail</div>
<div>
<b>1::1 configured, ipv6, our_master, sane, valid, stratum 4</b></div>
<div>
ref ID 127.127.1.1 , time D7C45F20.4AC083E0 (19:27:28.292 UTC Wed Sep 17 2014)</div>
</div>
<div>
<output omitted></div>
<br />
Really quite simple.<br />
<br />
15.x implementations of NTP now leave domain names in the config.<br />
Pre 15.x:<br />
foo.com(config)#ip host foo.com 4.4.4.4<br />
foo.com(config)#ntp server foo.com<br />
foo.com(config)#do sh run | i ntp<br />
ntp server 4.4.4.4<br />
<br />
It would translate the hostname to an IP address and the IP address would be saved in the config, not a good thing if the server changes IPs.<br />
<br />
Post 15.x:<br />
R2(config)#ip host test.com 4.1.1.1<br />
R2(config)#ntp server test.com<br />
R2(config)#do sh run | i ntp<br />
ntp server test.com<br />
<div>
<br /></div>
Let's take a look at the multicast option. As IPv6 multicast has blessedly been removed from the v5 blueprint, I'm going to cheap out and perform non-routed/same-link multicast.<br />
<br />
R2(config)#no ntp server 1::1<br />
<div>
<br /></div>
<div>
<div>
R1(config)#ntp authentication-key 1 md5 CISCO</div>
<div>
R1(config)#ntp trusted-key 1</div>
<div>
R1(config)#int gig1.123</div>
<div>
R1(config-subif)#ntp multicast FF02::123 key 1</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#ntp authentication-key 1 md5 CISCO</div>
<div>
R2(config)#ntp trusted-key 1</div>
<div>
R2(config)#ntp authenticate</div>
<div>
R2(config)#int gig1.123</div>
<div>
R2(config-subif)#ntp multicast client FF02::123</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config-subif)#do show ntp ass det</div>
<div>
FE80::20C:29FF:FEB6:3557 dynamic, ipv6, authenticated, our_master, sane, valid, stratum 4</div>
<div>
ref ID 127.127.1.1 , time D7C460E0.4AC083E0 (19:34:56.292 UTC Wed Sep 17 2014)</div>
</div>
<div>
<br /></div>
<div>
Maxdistance, for me, is very confusing. It appears to be a trust value. It's normally modified in NTPv4 in order to speed up convergence. As I understand it, the higher the value the faster the synchronization will happen, because the upstream time will be trusted sooner. The algorithm appears to combine half the value of the root delay and the dispersion, and if that value is lower than Maxdistance, then it's OK to consider yourself in-sync. My labbing did not produce exactly that outcome but it was extremely hard to say for sure because my NTPv4 convergences very quickly. Because you basically have to be a time expert to understand what this does, I would hope the CCIE lab would be limited to two types of questions on it:</div>
<div>
1) Set it to some value they provide</div>
<div>
2) Set it to "slowest" convergence (1) or "fastest" convergence (16)</div>
<br />
R1(config)#ntp maxdistance ?<br />
<1-16> Maximum distance for synchronization<br />
<br />
NTP Panic is simple:<br />
<br />
R2(config)#ntp panic ?<br />
update Reject time updates > panic threshold (default 1000Sec)<br />
<br />
It does just what it says - if my peer or configured master's clock is more than 1,000 seconds off of my clock, reject the update and syslog:<br />
<br />
.Sep 8 00:51:00.155: NTP Core (ERROR): Time correction of nan seconds exceeds sanity limit of 0. seconds. Set clock manually to the correct UTC time.<br />
<br />
NTP Orphan is really cool. It seems like an obvious feature now that I've seen it, but I can imagine this is a huge help for smaller organizations that rely heavily on NTP.<br />
<br />
Let's say, from our diagram, R1 is an Internet time server that our fictional organization uses as its sole NTP master. R2 and R3 are edge routers inside the company, and R4 and R5 will represent servers querying R2 and R3.<br />
<br />
So to be clear, R2 and R3 get their time from R1, and also peer towards one another (so if R3 can't reach R1 but R2 can, R3 can learn it's time via R2). R4 and R5 query R2 and R3 for time, respectively.<br />
<br />
Relevant config:<br />
R1(config)#ntp master 4<br />
<div>
<br /></div>
<div>
<div>
R2(config)#int gig1.123</div>
<div>
R2(config-subif)#no ntp multicast client FF02::123</div>
<div>
R2(config-subif)#no ntp authenticate</div>
</div>
<div>
R2(config)#ntp server 1::1</div>
<div>
<div>
R2(config)#ntp peer 3::3</div>
</div>
<div>
<div>
R2(config)#ntp source lo0</div>
</div>
<div>
<br /></div>
<div>
<div>
R3(config)#ntp server 1::1</div>
<div>
R3(config)#ntp peer 2::2</div>
</div>
<div>
<div>
R3(config)#ntp source lo0</div>
</div>
<div>
<br /></div>
<div>
<div>
R4(config)#ntp server 2::2</div>
</div>
<div>
<br /></div>
<div>
<div>
R5(config)#ntp server 3::3</div>
</div>
<div>
<br /></div>
<div>
At this point every device has the up-to-date time.</div>
<div>
<br /></div>
<div>
Now let's say R1 goes offline.</div>
<div>
<div>
R1(config)#int lo0</div>
<div>
R1(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<<wait a while>></div>
<div>
<br /></div>
<div>
<div>
R2(config)#do show ntp status</div>
<div>
Clock is unsynchronized, stratum 16, no reference clock</div>
</div>
<div>
<output omitted></div>
<div>
<br /></div>
<div>
<div>
R3(config)#do show ntp status</div>
<div>
<div>
Clock is unsynchronized, stratum 16, no reference clock</div>
</div>
</div>
<div>
<div>
<output omitted></div>
</div>
<div>
<br /></div>
<div>
and obviously R4 and R5 share the same fate.</div>
<div>
<br /></div>
<div>
What if we could program R2 and R3 to take their best stab at what the time should still be - mind you we're talking about being only a couple minutes since last sync, so the time is probably still very close to accurate - and then <i>temporarily</i> and <i>seamlessly</i> take over the NTP Master role if they lose valid clock from R1?</div>
<div>
<br /></div>
<div>
This is exactly what NTP Orphan does.</div>
<div>
<br /></div>
<div>
The config is extremely complicated:</div>
<div>
<div>
<br /></div>
<div>
R2(config)#ntp orphan 6</div>
</div>
<div>
<div>
<br /></div>
<div>
R3(config)#ntp orphan 6</div>
</div>
<div>
<br /></div>
<div>
(I was joking about the complicated part)</div>
<div>
<br /></div>
<div>
Really, that's it. Let's understand what's happening here now. Orphan kicks in when we lose sync with our server. The number 6 here is a stratum number, and must be a number <u>lower</u> than your real upstream NTP server - otherwise the failover/fail-back mechanism won't work right. </div>
<div>
<br /></div>
<div>
Best practices indicate configuring the same Orphan stratum on all devices you're running Orphan on, then peering all the Orphans to one another so that only one is "elected" to be the temporary Orphan master.</div>
<div>
<br /></div>
<div>
<div>
R2(config)#do show ntp status</div>
<div>
Clock is synchronized, stratum 6, reference is 127.0.0.1</div>
</div>
<div>
<output omitted></div>
<div>
<br /></div>
<div>
We see R2 is now stratum 6, synchronized with it's own virtual Orphan server.</div>
<div>
<br /></div>
<div>
<div>
R3(config)#do show ntp status</div>
<div>
Clock is synchronized, stratum 7, reference is 26.33.33.239</div>
</div>
<div>
<output omitted></div>
<div>
<br /></div>
<div>
R3 is synchronized with R2 as its Master. </div>
<div>
<br /></div>
<div>
<div>
R4#show ntp status</div>
<div>
Clock is synchronized, stratum 7, reference is 26.33.33.239</div>
</div>
<div>
<br /></div>
<div>
R4 is synchronized with R2 as its master.</div>
<div>
<br /></div>
<div>
<div>
R5#show ntp status</div>
<div>
Clock is synchronized, stratum 9, reference is 24.235.166.45</div>
</div>
<div>
<br /></div>
<div>
R5 is synchronized with R5 as its master.</div>
<div>
<br /></div>
<div>
Now the most important feature of this is fail-back, let's re-activate R1.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int lo0</div>
<div>
R1(config-if)#no shut</div>
</div>
<div>
<br /></div>
<div>
R3 was first to recover:</div>
<div>
<div>
R3(config)#do show ntp association detail</div>
<div>
1::1 configured, ipv6, our_master, sane, valid, stratum 4</div>
</div>
<div>
<br /></div>
<div>
It automatically shut down its Orphan process when it synced to the superior stratum 4.</div>
<div>
<br /></div>
<div>
R5 then received the now-correct time from R3:</div>
<div>
<div>
R5#show ntp association detail</div>
<div>
3::3 configured, ipv6, our_master, sane, valid, stratum 5</div>
</div>
<div>
<br /></div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff Kronlage</div>
<div>
<br /></div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-72393938407133917312014-09-06T22:03:00.003-07:002015-09-23T06:08:16.985-07:00OSPF LFA & Remote LFAContinuing on the same track as my recent posts regarding EIGRP FRR and BGP PIC/Add-path, today I'm writing about OSPF LFA. OSPF FRR/LFA accomplishes the same concept as EIGRP FRR, but in a much more elegant and thorough fashion.<br />
<br />
As I did in my EIGRP article, I'm going to reference back to the BGP PIC article, as that has a lengthy explanation of why fast re-reroute is important. If you don't understand the use case, please read this first article:<br />
<br />
<a href="http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html">http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html</a><br />
<br />
Again building off former articles, the EIGRP method of LFA is dead simple: take the feasible successor and pre-install it in the FIB for faster convergence.<br />
<br />
<a href="http://brbccie.blogspot.com/2014/08/eigrp-enhancements.html">http://brbccie.blogspot.com/2014/08/eigrp-enhancements.html</a><br />
<br />
I genuinely like this approach, because it's very easy to understand. If you're savvy enough to engineer for feasible successors, you can literally just turn on this feature and it works.<br />
<br />
OSPF takes this idea to a whole new level. Obviously, OSPF does not have a concept of feasible successors, but it does have a huge advantage: because, in the same area, the OSPF database is identical among all routers, OSPF can run the SPF algorithm with a neighboring router as root. The advantage of this is being able to find a loop-free alternate path in complex topologies that would have failed the feasible successor check in EIGRP. When we look at Remote LFA, we can even tunnel to distant routers to form loop-free paths, all calculated via the router running FRR.<br />
<br />
Note - much like EIGRP, OSPF on IOS does not support per-link LFA, so we will only be examining per-prefix LFA. IOS-XR supports both per-prefix and per-link.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSB9lfVWOW6IKoriJq5xGOcfdCbP6u8VuLbsrMBErrXrKdkuiTZ0v0Z-VjdlbzW-l83Ki94ner9ddA2XQVL4QzW8JtR_Ni01XTTRbkVV-I8lwS9-80mcl8RXom9N-czmKm6rH8CAUs0CM/s1600/diagram1.png" />
<br />
<br />
All links have an IP address of 192.168.YY.X, where YY is the lower router number followed by the higher router number, and X is the router number (i.e. on the link facing R4, R1's IP address is 192.168.14.1) . Each router has a loopback0 address of X.X.X.X, where X is the router number.<br />
<br />
Consider this diagram, with R1 attempting to reach R5 (5.5.5.5).<br />
<br />
R1(config)#router ospf 1<br />
R1(config-router)#<b>fast-reroute per-prefix enable area 0 prefix-priority low</b><br />
<div>
<br /></div>
The primary path is obvious: R1 -> R2 -> R5<br />
The backup path requires some thought...<br />
<br />
If this were EIGRP, neither path would be valid for LFA. They'd both fail the feasibility condition:<br />
R1->R3->R5 has an "advertised distance" of 10, which is greater than the "feasible distance" of 2. Likewise, R1->R4->R5 has an "advertised distance" of 10.<br />
<br />
However, OSPF being link state can actually calculate the SPF from R2 and R4's perspective. Cisco calls this process "reverse SPF" -- RSPF. I'm not going to make this a large lesson on link state protocols, but let's quickly look at what R1 would discover about its neighbors:<br />
<br />
R2:<br />
This is already the primary path, so eliminate R2.<br />
R3:<br />
When attempting to reach R5, R3 will route back through R1. <b>This will loop</b>. Eliminate R3.<br />
R4:<br />
R4 reaches R5 via the link between R4 and R5. <b>Valid Backup Route</b>.<br />
<br />
I deliberately built the scenario this way to show how a higher-metric route could beat a lower metric for the backup route - of course, in our case, the lower metric would've looped.<br />
<br />
R1#sh ip route repair 5.5.5.5<br />
Routing entry for 5.5.5.5/32<br />
Known via "ospf 1", distance 110, metric 3, type intra area<br />
Last update from 192.168.12.2 on GigabitEthernet1.12, 02:29:11 ago<br />
Routing Descriptor Blocks:<br />
<b> * 192.168.12.2, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.12</b><br />
Route metric is 3, traffic share count is 1<br />
Repair Path: 192.168.14.4, via GigabitEthernet1.14<br />
<b> [RPR]192.168.14.4, from 5.5.5.5, 02:29:11 ago, via GigabitEthernet1.14</b><br />
Route metric is 26, traffic share count is 1<br />
<div>
<br /></div>
<div>
<div>
R1#sh ip cef 5.5.5.5</div>
<div>
5.5.5.5/32</div>
<div>
nexthop 192.168.12.2 GigabitEthernet1.12</div>
<div>
<b> repair: attached-nexthop 192.168.14.4 GigabitEthernet1.14</b></div>
</div>
<div>
<br /></div>
<div>
As with EIGRP, there are "tie-breakers" if you have multiple options for backup path. With OSPF, you can get a lot more granular than EIGRP. I still hate the term "tie-breakers", as I explained in my EIGRP blog, I think "2nd bestpath decision maker" explains it better.</div>
<div>
<br /></div>
<div>
The tie-breakers are as follows, with their respective default priorities:</div>
<div>
<br /></div>
<div>
<div>
- SRLG 10 </div>
<div>
- Primary Path 20</div>
<div>
- Interface Disjoint 30</div>
<div>
- Lowest-Metric 40</div>
<div>
- Linecard-disjoint 50</div>
<div>
- Node protecting 60</div>
<div>
- Broadcast interface disjoint 70 </div>
<div>
- Load Sharing 256 </div>
<div>
<br /></div>
<div>
These tie-breakers are off by default:</div>
<div>
- Downstream </div>
<div>
- Secondary-Path</div>
<div>
<br /></div>
<div>
The syntax to change the priorities - or turn on downstream or secondary-path - is as follows:</div>
<div>
<br /></div>
<div>
router ospf 1</div>
<div>
fast-reroute per-prefix tie-break interface-disjoint required index 5</div>
<div>
<br /></div>
<div>
If you use the <b>fast-reroute per-prefix tie-break</b> command <i>at all</i>, it disables all the other tie-breakers. So for example, if you wanted SRLG to be the 2nd tie breaker, you would have to turn it back on after the interface-disjoint command:</div>
<div>
<br /></div>
<div>
<div>
router ospf 1</div>
<div>
fast-reroute per-prefix tie-break interface-disjoint required index 5</div>
</div>
<div>
<b> fast-reroute per-prefix tie-break srlg index 10</b></div>
<div>
<br /></div>
<div>
You may have also noticed the <b>required </b>keyword. This means that if that tie-breaker doesn't match/pass, then disallow that path completely.</div>
</div>
<div>
<br /></div>
<div>
My original plan was to show a scenario for every tie-breaker, but after it taking me two days to build a topology that showed each possible technique, I decided to just go with a written explanation on each tie-breaker and then give one semi-complex tie-breaker topology with a few examples.</div>
<div>
<br /></div>
<div>
<div>
- <b>SRLG</b></div>
<div>
SRLG - Shared Risk Link Group - is a manual setting, optionally assigned per-interface, with the intent of identifying "shared risk" elements that the router can't detect on it's own. For example, if two of your Ethernet links shared a downstream switch, you might put those two in the same SRLG.</div>
<div>
<br /></div>
<div>
Usage:</div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#srlg gid 1</div>
<div>
R1(config-if)#int gig2</div>
<div>
R1(config-if)#srlg gid 1</div>
<div>
R1(config-if)#int gig3</div>
<div>
R1(config-if)#srlg gid 2</div>
</div>
<div>
<br /></div>
<div>
- <b>Primary Path</b></div>
<div>
Primary Path prefers a backup path that's part of equal-cost multipath (ECMP), This is the antithesis of Secondary Path, which we'll cover below.</div>
<div>
<br /></div>
<div>
- <b>Interface Disjoint</b></div>
<div>
This is fairly obvious, prefer a backup next-hop that exits through a different interface. Note, Ethernet sub-interfaces are considered different interfaces.</div>
<div>
<br /></div>
<div>
- <b>Lowest-Metric</b></div>
<div>
Prefer the path with the lowest metric (note, this command doesn't offer a "required" keyword)</div>
<div>
<br /></div>
<div>
- <b>Linecard-disjoint</b></div>
<div>
Prefer a path that exits through a different linecard than the primary path (I have no way of labbing this as I'm using a CSR1K)</div>
<div>
<br /></div>
<div>
- <b>Node protecting</b></div>
<div>
Prefer a path that doesn't pass through the same next-hop router as the primary path. Note this means <i>any interface</i> on the same next-hop router. So if R2 is the next-hop of your primary path via 192.168.12.2, and your backup path goes through (either directly or indirectly, later in the path) 192.168.25.2 on R2, node protecting will depref that path - or with the <b>required</b> keyword, would prevent it from being used completely.</div>
<div>
<br /></div>
<div>
- <b>Broadcast interface disjoint</b></div>
<div>
Broadcast interface disjoint deprefs backup routes that pass through the same broadcast area as the primary path. The thought here is if the layer 2 device (presumably a switch) connecting the interfaces together fails, we might lose the backup path too.</div>
<div>
<br /></div>
<div>
- <b>Load Sharing </b></div>
<div>
I haven't labbed this, but my understanding is this is basically a worst-case scenario. If you have two or more paths that can't be differentiated by all of the above tie-breakers, share the backup paths amongst any applicable prefixes.</div>
<div>
<br /></div>
<div>
- <b>Downstream</b> (off by default)</div>
<div>
This is very similar to the EIGRP feasability condition - ensure that the metric, from the neighbor's RSPF perspective, is smaller than the total metric of our primary path from the calculating router's perspective. Using the original example above, the backup path we picked would <b>not</b> meet the criteria for this tie-breaker. It's important to reinforce this is not a default option, and OSPF does not require this EIGRP-feasibility-like requirement as OSPF is a link state protocol and can calculate non-looping paths without concerns for metric because it has the entire topology at hand. </div>
<div>
<div>
<br /></div>
</div>
<div>
- <b>Secondary-Path </b>(off by default)</div>
</div>
<div>
This is the antithesis of the Primary-Path tie-breaker above. This instructs the process to prefer a backup path that is not part of multipathing (ECMP). The idea here is if all your multipaths are required for your traffic flows - for example, if you are equal-cost multipathing across two 1-gig links, but consistently have 1.2gb of data crossing them, it would not be desirable to just run over one the opposing link in the ECMP if one failed. Secondary-Path prefers a path <i>not in the ECMP</i> for the backup. </div>
<div>
<br />
I'm going to run a couple of examples of tie-breaking, but in order to do that, I needed more paths in the topology. Pay close attention, I have shifted the OSPF costs from the prior topology:<br />
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4cFHp9p5u6MhUu35R3FNiBy-nKHA14Bjpj0SzyI14vZM2yK4VySXtbkKXBvr8dBQ_t2NNMupZGpOMD7LxYnRYnSperYA-nRFeI4lFvleilUntUpgskHfKhDHzcjiDYnPA0WkQLZ4zD_A/s1600/diagram2.png" />
<br />
<div>
<br />
* Please note costs listed below do not include the on-router cost to the loopback for clarity*<br />
If you look at metric alone, the paths from R1->R5 look most desirable in this order:<br />
R1 -> R3 -> R5 (Cost 2)</div>
<div>
R1 -> R6 -> R3 -> R5 (Cost 4)<br />
R1 -> R2 -> R5 (Cost 11)<br />
R1 -> R4 -> R5 (Cost 25)<br />
<br />
Clearly R3 is the winning primary path.<br />
<br />
Let's go down the decision-making process for the backup path:<br />
<br />
<div>
- SRLG 10 - Not applicable, we're not using SRLG (yet)</div>
<div>
- Primary Path 20 - Not applicable, we have no ECMP.</div>
<div>
- Interface Disjoint 30 - Applicable, but all are on separate interfaces already.<br />
- Lowest-Metric 40 - Applicable, choose R6 as backup. Do not proceed further, as all paths have different costs.</div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<div style="margin: 0px;">
<br /></div>
</div>
So without any modification, our primary next-hop router will be R3, and backup next-hop router will be R6:<br />
<br />
R1#sh ip route repair 5.5.5.5<br />
Routing entry for 5.5.5.5/32<br />
Known via "ospf 1", distance 110, metric 3, type intra area<br />
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:14:19 ago<br />
Routing Descriptor Blocks:<br />
<b> * 192.168.13.3, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.13</b><br />
Route metric is 3, traffic share count is 1<br />
Repair Path: 192.168.16.6, via GigabitEthernet1.16<br />
<b> [RPR]192.168.16.6, from 5.5.5.5, 00:14:19 ago, via GigabitEthernet1.16</b><br />
Route metric is 6, traffic share count is 1<br />
<div>
<br /></div>
<div>
There's an obvious flaw in that plan however, they both rely on R3 being online. </div>
<div>
<br /></div>
<div>
<div>
R1(config)#router ospf 1</div>
</div>
<div>
<div>
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10</div>
<div>
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do sh ip route repair 5.5.5.5</div>
<div>
Routing entry for 5.5.5.5/32</div>
<div>
Known via "ospf 1", distance 110, metric 3, type intra area</div>
<div>
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:09 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.13.3, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.13</div>
<div>
Route metric is 3, traffic share count is 1</div>
<div>
Repair Path: 192.168.14.4, via GigabitEthernet1.14</div>
<div>
<b> [RPR]192.168.14.4, from 5.5.5.5, 00:01:09 ago, via GigabitEthernet1.14</b></div>
<div>
Route metric is 26, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
Now the process has chosen the backup through R4, which eliminates R3 as a single point of failure.</div>
<div>
<br /></div>
<div>
Let's pretend that gig1.13, gig 1.14, and gig1.16 all cross the same L2 switch somewhere in their path. We want to protect against that too:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#router ospf 1</div>
<div>
<div>
R1(config-router)#fast-reroute per-prefix tie-break lowest-metric index 10</div>
<div>
R1(config-router)#fast-reroute per-prefix tie-break node-protecting required index 20</div>
</div>
</div>
<div>
<div>
R1(config-router)#<b>fast-reroute per-prefix tie-break srlg required index 30</b></div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#int gig1.13</div>
<div>
R1(config-subif)#srlg gid 1</div>
<div>
R1(config-subif)#int gig1.14</div>
<div>
R1(config-subif)#srlg gid 1</div>
<div>
R1(config-subif)#int gig1.16</div>
<div>
R1(config-subif)#srlg gid 1</div>
<div>
R1(config-subif)#int gig1.12</div>
<div>
R1(config-subif)#srlg gid 2</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-subif)#do sh ip route repair 5.5.5.5</div>
<div>
Routing entry for 5.5.5.5/32</div>
<div>
Known via "ospf 1", distance 110, metric 3, type intra area</div>
<div>
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:18:34 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.13.3, from 5.5.5.5, 00:18:34 ago, via GigabitEthernet1.13</div>
<div>
Route metric is 3, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
Uh-oh, no backup route. We were hoping for R1->R2->R5...</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip cef 5.5.5.5</div>
<div>
5.5.5.5/32</div>
<div>
nexthop 192.168.12.1 GigabitEthernet1.12</div>
</div>
<div>
<br /></div>
<div>
That's because R2 routes back through R1 - R1 would've run the RSPF with R2 as the root and disregarded the route.</div>
<div>
<br /></div>
<div>
We have two options at this point:</div>
<div>
- Remove the <b>required</b> keyword from the SRLG and fall back to the prior answer</div>
<div>
- Tinker with the metrics to make R2 a viable path.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1.12</div>
<div>
R1(config-subif)#ip ospf cost 10</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int gig1.12</div>
<div>
R2(config-subif)#ip ospf cost 10</div>
<div>
<br /></div>
</div>
<div>
<div>
R1(config-subif)#do sh ip route repair 5.5.5.5</div>
<div>
Routing entry for 5.5.5.5/32</div>
<div>
Known via "ospf 1", distance 110, metric 3, type intra area</div>
<div>
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:00:52 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.13.3, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.13</div>
<div>
Route metric is 3, traffic share count is 1</div>
<div>
Repair Path: 192.168.12.2, via GigabitEthernet1.12</div>
<div>
[RPR]192.168.12.2, from 5.5.5.5, 00:00:52 ago, via GigabitEthernet1.12</div>
<div>
Route metric is 21, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
Now we have a backup via R2.</div>
<div>
<br /></div>
<div>
Before we move on to remote LFA, let's cover some smaller topics.</div>
<div>
<br /></div>
<div>
There were two pieces to the initial command that I did not explain:</div>
<div>
fast-reroute per-prefix enable area 0 prefix-priority low</div>
<div>
<br /></div>
<div>
<b>enable area 0</b> may seem obvious - we want backup paths for area 0. Note, you can only specify areas the router is directly connected to, so if, for example, you wanted backup paths in areas 0, 1, and 2, your router would have to be an ABR for areas 1 and 2. This is true of both direct LFA and remote LFA.</div>
<div>
<br /></div>
<div>
But there's another issue with specifying areas:</div>
<div>
<br /></div>
<div>
<div>
R5(config)#int lo1</div>
<div>
R5(config-if)#ip address 55.55.55.55 255.255.255.255</div>
<div>
R5(config-if)#exit</div>
<div>
R5(config)#route-map lo1-extern</div>
<div>
R5(config-route-map)#match interface lo1</div>
<div>
R5(config-route-map)#exit</div>
<div>
R5(config)#router ospf 1</div>
<div>
R5(config-router)#redistribute connected route-map lo1-extern</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config)#do sh ip route repair 55.55.55.55</div>
<div>
Routing entry for 55.55.55.55/32</div>
<div>
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2</div>
<div>
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:27 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.13.3, from 5.5.5.5, 00:01:27 ago, via GigabitEthernet1.13</div>
<div>
Route metric is 20, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
No repair route for 55.55.55.55 - and we won't, because an external route is in no area. We have to change our initial configuration to fix this:</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#no ip fast-reroute per-prefix enable area 0 prefix-priority low</div>
<div>
R1(config-router)#fast-reroute per-prefix enable prefix-priority low</div>
</div>
<div>
<br /></div>
<div>
A lack of an area implies all areas this router is connected to - including external routes.</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do sh ip route repair 55.55.55.55</div>
<div>
Routing entry for 55.55.55.55/32</div>
<div>
Known via "ospf 1", distance 110, metric 20, type extern 2, forward metric 2</div>
<div>
Last update from 192.168.13.3 on GigabitEthernet1.13, 00:01:42 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.13.3, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.13</div>
<div>
Route metric is 20, traffic share count is 1</div>
<div>
Repair Path: 192.168.12.2, via GigabitEthernet1.12</div>
<div>
[RPR]192.168.12.2, from 5.5.5.5, 00:01:42 ago, via GigabitEthernet1.12</div>
<div>
Route metric is 20, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
What's the story on <b>prefix-priority low</b>?</div>
<div>
<br /></div>
<div>
IOS prioritizes convergence events by default by prefix length. If SPF has to be calculated for thousands of routes, it's assumed by default that /32s (typical for iBGP next-hops) are "high priority". You can define what routes are priority to OSPF with:</div>
<div>
<br /></div>
R1(config-router)#prefix-priority high route-map <your route map><br />
<br />
There are only two tiers, high and low. High indicates (by default, unless the route map is used) only calculate backup routes for /32s, Low means calculate backup routes for all routes.<br />
<br />
So you're debugging and trying to figure out why one path was chosen over another. IOS has a <u>fantastic</u> output system for this:<br />
<br />
R1(config-router)#fast-reroute keep-all-paths<br />
<div>
<br /></div>
<div>
This is basically a debugging command, and tells OSPF to keep the output from all the RSPFs it ran to calculate the backup path - including the ones it <i>didnt choose as best.</i></div>
<div>
<i><br /></i></div>
<div>
<b>show ip ospf rib</b> is our 2nd magic command:</div>
<div>
<i><br /></i></div>
<div>
<div>
R1(config-router)#do sh ip ospf rib 5.5.5.5</div>
<div>
<br /></div>
<div>
OSPF Router with ID (1.1.1.1) (Process ID 1)</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Base Topology (MTID 0)</div>
<div>
<br /></div>
<div>
OSPF local RIB</div>
<div>
Codes: * - Best, > - Installed in global RIB</div>
<div>
LSA: type/LSID/originator</div>
<div>
<br /></div>
<div>
*> 5.5.5.5/32, Intra, cost 3, area 0</div>
<div>
SPF Instance 62, age 00:13:50</div>
<div>
Flags: RIB, HiPrio</div>
<div>
via 192.168.13.3, GigabitEthernet1.13</div>
<div>
Flags: RIB</div>
<div>
LSA: 1/5.5.5.5/5.5.5.5</div>
<div>
repair path via 192.168.12.2, GigabitEthernet1.12, cost 21</div>
<div>
Flags: RIB, Repair, IntfDj, BcastDj, NodeProt</div>
<div>
LSA: 1/5.5.5.5/5.5.5.5</div>
<div>
repair path via 192.168.16.6, GigabitEthernet1.16, cost 6</div>
<div>
Flags: Ignore, Repair, IntfDj, BcastDj, SRLG</div>
<div>
LSA: 1/5.5.5.5/5.5.5.5</div>
<div>
repair path via 192.168.14.4, GigabitEthernet1.14, cost 26</div>
<div>
Flags: Ignore, Repair, IntfDj, BcastDj, SRLG, NodeProt</div>
<div>
LSA: 1/5.5.5.5/5.5.5.5</div>
<div style="font-style: italic;">
<br /></div>
</div>
Look at all that fantastic output - it list the parameters per route so you can determine why the repair path was chosen. Let's break one of these down:<br />
<br />
<div>
repair path via 192.168.12.2, GigabitEthernet1.12, cost 21</div>
<div>
Flags: RIB, Repair, IntfDj, BcastDj, NodeProt</div>
<div>
LSA: 1/5.5.5.5/5.5.5.5</div>
<br />
This is our current best backup path - "RIB" means it's installed, "Repair" means it's a backup path - so "RIB" + "Repair" means it's the installed backup path. IntfDj means it's on a separate interface from the primary path, BcastDj means it's not sharing a broadcast interface with the primary path, and NodeProt means the path does not include shared hops with the primary path.<br />
<br />
Microloops can add complexity with fast-reroute. A microloop is what happens when one router converges significantly faster than a neighbor. Let's say two adjacent routers both receive new LSAs simultaneously. One router is high-performance, another is older. The high-performance router calculates the change and updates the FIB several seconds before the older router. Now we could end up with a scenario where the newer router starts forwarding traffic through the older router, but the older router's FIB hasn't updated yet, and it's forwarding through the faster router for that same prefix. For a couple of seconds, the two routers loop.<br />
<br />
I'm not going to go into detail on this as it's a fringe topic, but here's the starting point for using this:<br />
R1(config-router)#microloop avoidance ?<br />
disable Microloop avoidance auto-enable prohibited<br />
protected Microloop avoidance for protected prefixes only<br />
rib-update-delay Delay before updating the RIB<br />
<div>
<br /></div>
<div>
In short, it allows you to deliberately slow down updating the FIB on the faster router for prefixes that are high-risk for this type of reconvergence.</div>
<br />
If you don't want an interface being considered for fast-reroute:<br />
<br />
R1(config-router)#int gig1.12<br />
R1(config-subif)#ip ospf fast-reroute per-prefix candidate disable<br />
<div>
<br /></div>
<div>
And if you need a quick summary of what percentage of routes are and aren't protected:</div>
<div>
<div>
<br /></div>
<div>
<div>
R1#sh ip ospf fast-reroute prefix-summary</div>
<div>
<br /></div>
<div>
OSPF Router with ID (1.1.1.1) (Process ID 1)</div>
<div>
Base Topology (MTID 0)</div>
<div>
<br /></div>
<div>
Area 0:</div>
<div>
<br /></div>
<div>
Interface Protected Primary paths Protected paths Percent protected</div>
<div>
All High Low All High Low All High Low</div>
<div>
Lo0 Yes 0 0 0 0 0 0 0% 0% 0%</div>
<div>
Gi1.16 Yes 1 1 0 0 0 0 0% 0% 0%</div>
<div>
Gi1.14 Yes 0 0 0 0 0 0 0% 0% 0%</div>
<div>
Gi1.13 Yes 7 3 4 4 2 2 57% 66% 50%</div>
<div>
Gi1.12 Yes 1 1 0 0 0 0 0% 0% 0%</div>
<div>
<br /></div>
<div>
Area total: 9 5 4 4 2 2 44% 40% 50%</div>
<div>
<br /></div>
<div>
Process total: 9 5 4 4 2 2 44% 40% 50%</div>
</div>
</div>
</div>
<div>
<br /></div>
<div>
That's a wrap for direct LFA. Now we'll look at remote LFA.</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiexNU3XeTKW3Z5OBr-RfhvXEGSduA741i5zRwiSwDeAZZsfbwxA3cEOJ6RO47T7siqMnI437NxrKTTeqNA5n7Het0ZAXluqlvRE_uwJMgDLiRTJHoxjcCNRnTkWCJJbJ61A8u5VyIoYVQ/s1600/diagram3.png" />
<br />
<br />
This is a simplistic topology but it has a huge problem for direct LFA.<br />
Let's protect the path from R1 to R4.<br />
<br />
We have two paths:<br />
R1 -> R4 (cost 1)<br />
R1 -> R2 -> R3 -> R4 (cost 12)<br />
<br />
Obviously R1 -> R4 is the primary path,<br />
What does R2 see as it's possible paths to R4?<br />
R2 -> R1 -> R4 (Cost 2)<br />
R2 -> R3 -> R4 (Cost 11)<br />
<br />
R2 will always send traffic back to R1 when heading towards R4.<br />
<br />
What about R3?<br />
R3 -> R4 (Cost 6)<br />
R3 -> R2 -> R1 (Cost 7)<br />
<br />
<div>
R3 would work for a backup path... if only we could get to R3 without R2 knowing what we're up to.<br />
<br />
Enter Remote LFA.<br />
<br />
R1(config-router)#int gig1.14<br />
R1(config-subif)#mpls ip<br />
R1(config-subif)#int gig1.12<br />
R1(config-subif)#mpls ip<br />
<div>
R1(config-subif)#mpls ldp discovery targeted-hello accept</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config-subif)#int gig1.12</div>
<div>
R2(config-subif)#mpls ip</div>
<div>
R2(config-subif)#int gig1.23</div>
<div>
<div>
R2(config-subif)#mpls ip</div>
</div>
<div>
R2(config-subif)#mpls ldp discovery targeted-hello accept</div>
</div>
<div>
<br /></div>
<div>
R3(config-subif)#int gig1.23<br />
R3(config-subif)#mpls ip<br />
R3(config-subif)#int gig1.34<br />
R3(config-subif)#mpls ip<br />
R3(config-subif)#mpls ldp discovery targeted-hello accept<br />
<div>
<br /></div>
<div>
<div>
R4(config-subif)#int gig1.14</div>
<div>
R4(config-subif)#mpls ip</div>
<div>
R4(config-subif)#int gig1.34</div>
<div>
R4(config-subif)#mpls ip</div>
<div>
R4(config-subif)#mpls ldp discovery targeted-hello accept</div>
</div>
<div>
<br /></div>
R1(config-router)#router ospf 1<br />
R1(config-router)#fast-reroute per-prefix remote-lfa tunnel mpls-ldp</div>
<div>
<br /></div>
<div>
There's a complex algorithm that makes this work, but it's somewhat irrelevant from a CCIE v5 perspective. </div>
<div>
<br /></div>
<div>
Here's what you really need to know:</div>
<div>
- Direct LFA had to have failed to turn up a path already (direct is always tried first)</div>
<div>
- A tunnel is built over targeted LDP.</div>
<div>
- The destination tunnel router is picked on the following criteria:</div>
<div>
<div>
- It must be in the same area as the router running LFA</div>
</div>
<div>
- The tunnel endpoint is picked from among the group of routers that can be reached through a next-hop other than the one you're trying to protect.</div>
<div>
- Of that group of routers, it's narrowed down to the subset that can reach your repair prefix without passing through the protecting router.</div>
<div>
- Those that qualify are called the PQ space (refer to the RFC for a lot more detail, but it may be overkill for a CCIE candidate) </div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route repair 4.4.4.4</div>
<div>
Routing entry for 4.4.4.4/32</div>
<div>
Known via "ospf 1", distance 110, metric 2, type intra area</div>
<div>
Last update from 192.168.14.4 on GigabitEthernet1.14, 00:29:36 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 192.168.14.4, from 4.4.4.4, 00:29:36 ago, via GigabitEthernet1.14</div>
<div>
Route metric is 2, traffic share count is 1</div>
<div>
Repair Path: 3.3.3.3, via MPLS-Remote-Lfa1</div>
<div>
<b> [RPR]3.3.3.3, from 4.4.4.4, 00:29:36 ago, via MPLS-Remote-Lfa1</b></div>
<div>
Route metric is 12, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip int br | i MPLS</div>
<div>
MPLS-Remote-Lfa1 192.168.12.1 YES unset up up</div>
</div>
<div>
<br /></div>
<div>
This whole process is reasonably automatic, just make sure your LDP is in good shape and targeted LDP is enabled and you're good to go.</div>
<div>
<br /></div>
<div>
<div>
You can optionally specify areas and maximum costs:</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#fast-reroute per-prefix remote-lfa area 0 maximum-cost 10</div>
</div>
<div>
<br /></div>
</div>
<div>
The areas work the same way they did with direct LFA - we're just saying we only want to protect area 0, 1, 2, 3, etc. For remote LFA, the router you're running LFA on has to be in the area you're trying to protect - you can't protect area 5 if you're only an ABR for areas 0 and 1.</div>
<div>
<br /></div>
<div>
The maximum cost option restricts which prefixes you should be building tunnels <i>for</i>. In other words, it has nothing to do with the metric to reach the tunnel endpoint - it has to do with the prefix you're trying to protect.</div>
<div>
<br /></div>
<div>
Hope you enjoyed!</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com17tag:blogger.com,1999:blog-5968686435283454526.post-67495258215195510662014-08-24T12:42:00.001-07:002014-08-24T12:42:03.277-07:00EIGRP EnhancementsCisco did a major overhaul of EIGRP in recent IOS. These can be loosely looked at as new features in "EIGRP Named Mode". In reality, I suspect that the EIGRP teams were working on a series of new features, and they opted to renovate the interface at the same time, hence creating named mode.<br />
<div>
<br /></div>
<div>
We'll start with the new interface and then delve into all the new features one at a time.</div>
<div>
<br /></div>
<div>
Named EIGRP mode replaces the tradition EIGRP interfaces we're familiar with, and puts all the various commands into one configuration section.<br />
<br />
The major distinguishing factor is the router process has a <i>name</i> instead of a <i>number</i>.<br />
<br />
Old method:<br />
router eigrp 100<br />
network 192.168.0.0 0.0.255.255<br />
<br />
New equivalent method:<br />
router eigrp SOMENAME<br />
!<br />
address-family ipv4 unicast autonomous-system 100<br />
!<br />
topology base<br />
exit-af-topology<br />
network 192.168.0.0 0.0.255.255<br />
exit-address-family<br />
<div>
<br /></div>
<div>
The name is completely arbitrary and is a local value. </div>
<div>
<br /></div>
<div>
Interface settings that were previously configured on the interface, such as hello interval, authentication, etc, are now configured as part of the EIGRP named process:</div>
<div>
<br /></div>
<div>
<div>
router eigrp SOMENAME</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
af-interface GigabitEthernet1</div>
<div>
authentication mode md5</div>
<div>
authentication key-chain FOO</div>
<div>
<div>
hello-interval 10</div>
</div>
<div>
no split-horizon</div>
<div>
exit-af-interface</div>
<div>
!</div>
<div>
topology base</div>
<div>
exit-af-topology</div>
<div>
network 192.168.0.0 0.0.255.255</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
</div>
<div>
A traditional EIGRP process can be upgraded to named mode <i>on newer IOS</i> with this command:</div>
<div>
<div>
<br /></div>
<div>
Router(config)#router eigrp 101</div>
<div>
Router(config-router)#eigrp upgrade-cli SOMENAME</div>
</div>
<div>
<br /></div>
<div>
The process also doesn't interrupt traffic flow.</div>
<div>
<br /></div>
<div>
That's the guts of the configuration reformatting, let's move on to features.</div>
<div>
<br /></div>
<div>
<b><u>Wide Metrics</u></b></div>
<div>
First and foremost, the metric has been reworked.</div>
<div>
<br /></div>
<div>
EIGRP named mode automatically uses <i>wide metrics</i> when speaking to another EIGRP named mode process. No additional configuration is necessary, this is automatic. So if it's speaking to a traditional EIGRP process, it uses the old calculations.</div>
<div>
<br /></div>
<div>
The new metric is designed to be able to differentiate paths above 10GB. The new metric essentially changes four things:</div>
<div>
- Delay is now measured in picoseconds instead of microseconds. 10ms was the minimum previously.</div>
<div>
- Bandwidth's scaling factor is made much larger, the calculation is now 10^7 * 65536 / Interface Bandwidth, as opposed to the original 10^7 * 256 / Interface Bandwidth.</div>
<div>
- The overall metric is now 64 bit.</div>
<div>
<div>
- The K6 value has been added "for future use", but Cisco has indicated this will be used for accumulated energy or accumulated jitter. Jitter is reasonsably obvious. Energy is the actual electric power it takes to use an interface, so that you could literally do "least cost" routing based on how inexpensively the packet can be sent from the various interface types in a path.</div>
<div>
<br /></div>
<div>
One important note here is that with wide metrics, the EIGRP calculated metric no longer fits into the RIB. For example:</div>
<div>
<br /></div>
<div>
<div>
Router#sh ip eigrp top 10.10.10.10/32 | i Composite metric</div>
<div>
Composite metric is (330301440/329646080), route is Internal</div>
</div>
<div>
<div>
<br /></div>
<div>
Router#sh ip route 10.10.10.10 | i Route metric</div>
<div>
Route metric is 2580480, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
The EIGRP topology table indicates 330301440, the RIB says 2580480. </div>
<div>
The RIB's metric can't exceed 32-bits, and there are circumstances with the new, more granular metrics won't fit into the RIB. So all metrics, regardless of if the value would fit into 32-bits, are divided by the rib-scale value. The rib-scale is 128 by default:</div>
<div>
<br /></div>
<div>
330301440/128 = 2580480</div>
<div>
<br /></div>
<div>
You can reassign it to any value 1 to 255:</div>
<div>
<br /></div>
<div>
<div>
router eigrp SOMENAME</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
<b> metric rib-scale [1-255]</b></div>
<div>
<br /></div>
</div>
<div>
Here's a catch - I've gotten in the habit of using this command for redistributing into EIGRP when labbing:</div>
<div>
<br /></div>
<div>
redistribute <some other protocol> metric 1 1 1 1 1</div>
<div>
<br /></div>
<div>
Why? It's quick and easy to type if you're not trying to do traffic engineering.</div>
<div>
<br /></div>
<div>
<div>
Router#sh ip eigrp top 13.13.13.13/32 | i Composite metric</div>
<div>
Composite metric is (655361310720/655360655360), route is External</div>
</div>
<div>
<br /></div>
<div>
655361310720/128 = 5120005120</div>
<div>
<br /></div>
<div>
The largest number that can be represented in a 32-bit unsigned integer is 4,294,967,296.</div>
<div>
<br /></div>
<div>
5120005120 > 4294967296, therefore it cannot be represented in the RIB:</div>
<div>
<br /></div>
<div>
<div>
Router#sh ip route 13.13.13.13</div>
<div>
% Network not in table</div>
</div>
<div>
<br /></div>
<div>
You read that right: This is a valid, routable prefix that simply can't make it into the RIB because of compatibility between the EIGRP topology table and the RIB. You need to adjust the rib-scale to make this work:</div>
<div>
<br /></div>
<div>
<div>
Router(config-router-af)#metric rib-scale 153</div>
<div>
Router(config-router-af)#do sh ip route 13.13.13.13 | i Route metric</div>
<div>
Route metric is 4283407259, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
I imagine that would make for a really good troubleshooting problem. "A route is being redistributed on R1 with a specific metric, but is not being installed in the RIB on R3. Do not change the metric on R1, or adjust with a route-map".</div>
<div>
<br /></div>
</div>
<div>
There are a few concerns with interoperability between the traditional EIGRP metric and the wide metrics, but not many. As I mentioned above, routers unable to understand wide metrics are auto-detected and sent the old metric, however, there are circumstances where a route might get depreffed after having passed through an older EIGRP process. For example, if two paths exist to a destination, one of them running entirely wide metrics and a different one running one router with traditional metrics, the traditional metric may make the entire path look worse and it may impact load share, or the ability to ECMP.</div>
<div>
<br /></div>
<div>
<b><u>SHA Authentication</u></b></div>
<div>
Now supporting more than just MD5:</div>
<div>
<br /></div>
<div>
<div>
R1(config-subif)#router eigrp TEST1</div>
<div>
R1(config-router)#address-family ipv4 unicast autonomous-system 100</div>
<div>
R1(config-router-af)#af-interface gig1.123</div>
<div>
R1(config-router-af-interface)#authentication mode hmac-sha-256 CCIE</div>
</div>
<div>
<br /></div>
<div>
I think authentication would also make a great TS question - the authentication could be placed on the interface still, which named mode silently ignores. You'd need to know to look at the EIGRP named process to fix it:</div>
<div>
<br /></div>
<div>
<div>
interface GigabitEthernet1.123</div>
<div>
ip authentication key-chain eigrp 100 BOB ! this does nothing when named mode is enabled.</div>
</div>
<div>
<br /></div>
<div>
<div>
<b><u>Route Tag Enhancements</u></b></div>
<div>
To be fair, the route tag enhancements aren't limited to EIGRP named mode - it works with OSPF, BGP, RIP, etc. It even works in the traditional (non-named) eigrp syntax. However, I didn't think I needed a write a separate blog just to show it in every context, they all basically work the same.</div>
</div>
<div>
<br /></div>
<div>
In short, the route tag enhancements allow the route tag to be formatted as a dotted decimal tag (looks like an IPv4 address) that can me matched either directly (in the traditional route tag method in route-map) or via a route-tag list. The route-tag list is where things get interesting.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
interface Loopback1</div>
<div>
ip address 1.1.1.1 255.255.255.255</div>
<div>
interface Loopback2</div>
<div>
ip address 2.2.2.2 255.255.255.255</div>
<div>
interface Loopback3</div>
<div>
ip address 3.3.3.3 255.255.255.255</div>
<div>
interface Loopback4</div>
<div>
ip address 4.4.4.4 255.255.255.255</div>
<div>
interface Loopback5</div>
<div>
ip address 5.5.5.5 255.255.255.255</div>
<div>
interface Loopback6</div>
<div>
ip address 6.6.6.6 255.255.255.255</div>
<div>
interface Loopback7</div>
<div>
ip address 7.7.7.7 255.255.255.255</div>
</div>
<div>
<br /></div>
<div>
<div>
<b>route-tag notation dotted-decimal</b></div>
</div>
<div>
<br /></div>
<div>
<div>
router eigrp TEST1</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
topology base</div>
<div>
redistribute connected route-map tag-routes</div>
</div>
<div>
<br /></div>
<div>
<div>
route-map tag-routes permit 10</div>
<div>
match interface Loopback1 Loopback2 Loopback3</div>
<div>
set tag 100.100.100.1</div>
<div>
route-map tag-routes permit 20</div>
<div>
match interface Loopback4 Loopback5</div>
<div>
set tag 100.100.200.1</div>
<div>
route-map tag-routes permit 30</div>
<div>
match interface Loopback6 Loopback7</div>
<div>
set tag 100.100.101.1</div>
</div>
<div>
<br /></div>
<div>
So we've set some dotted-decimal tags on R1, now let's filter on R2.</div>
<div>
<br /></div>
<div>
R2:</div>
<div>
<div>
route-tag notation dotted-decimal</div>
<div>
route-tag list binary-match seq 5 permit 100.100.0.0 0.0.254.255</div>
</div>
<div>
<br /></div>
<div>
<div>
route-map filter permit 10</div>
<div>
match tag list binary-match</div>
<div>
set metric 100 100 255 1 1500</div>
<div>
route-map filter permit 20</div>
</div>
<div>
<br /></div>
<div>
<div>
router eigrp TEST2</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
topology base</div>
<div>
distribute-list route-map filter in GigabitEthernet1.123</div>
</div>
<div>
<br /></div>
<div>
Anyone who's done any amount of CCIE-level route filtering should catch what I just did. The route-tag list is looking for any routes that begin with 100.100 and have an even 3rd octet - if you need an explanation of filtering with wildcard masks there are many available on the Internet.</div>
<div>
<br /></div>
<div>
So now tags can be matched based on what bits are set in them -- very cool.</div>
<div>
<br /></div>
<div>
<div>
R2(config)#do sh ip eigrp top 1.1.1.1/32 | i Composite metric</div>
<div>
<b> Composite metric is (6619136000/163840), route is External</b></div>
<div>
R2(config)#do sh ip eigrp top 4.4.4.4/32 | i Composite metric</div>
<div>
<b> Composite metric is (6619136000/163840), route is External</b></div>
<div>
R2(config)#do sh ip eigrp top 6.6.6.6/32 | i Composite metric</div>
<div>
<b> Composite metric is (1392640/163840), route is External</b></div>
</div>
<div>
<br /></div>
<div>
1.1.1.1 and 4.4.4.4 were tagged with 100.100.100.1 and 100.100.200.1 respectively, both even 3rd octets, and had their metric successfully recreated. 6.6.6.6, tagged with 100.100.101.1, was not matched, and retained its original metric.</div>
<div>
<br /></div>
<div>
I immediately tried this in IPv6... however...</div>
<div>
<br /></div>
<div>
<div>
R2(config-router)#address-family ipv6 unicast autonomous-system 200</div>
<div>
R2(config-router-af)#topology base</div>
<div>
R2(config-router-af-topology)#distribute-list ?</div>
<div>
prefix-list Filter connections based on an IPv6 prefix-list</div>
<div>
R2(config-router-af-topology)#distribute-list route-map ?</div>
<div>
% Unrecognized command</div>
</div>
<div>
<br /></div>
<div>
IPv6 can't be filtered ingress with route-maps yet. I didn't expect that. For anyone curious I'm on:</div>
<div>
<div>
<br /></div>
<div>
R2(config-router-af-topology)#do sh ver | i IOS Software</div>
<div>
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)</div>
</div>
<div>
<br /></div>
<div>
There's open more option for settings tags:</div>
<div>
<br /></div>
<div>
<div>
router eigrp TEST1</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
<b> eigrp default-route-tag 9.9.9.9</b></div>
<div>
<br /></div>
</div>
<div>
<b>default-route-tag</b> is fairly picky what it will tag. From some tinkering, it will tag all routes <i>except</i>:</div>
<div>
- Locally redistributed routes</div>
<div>
- Routes that were already set a tag in some other fashion</div>
<div>
- Routes it learned from another router</div>
<div>
<br /></div>
<div>
So in short, unless you learned the routes with the "network" statement, this tag won't take effect.</div>
<div>
<br /></div>
<div>
<b><u>IPv6 VRF Lite</u></b></div>
<div>
<br /></div>
<div>
The traditional EIGRP process doesn't support IPv6 in a VRF. </div>
<div>
<br /></div>
<div>
You also must use the new format - multiprotocol VRF - for creating VRFs. </div>
<div>
Old format:</div>
<div>
<div>
R2(config)#ip vrf FOO</div>
<div>
R2(config-vrf)#rd 1:1</div>
</div>
<div>
<div>
R2(config-vrf)#exit</div>
</div>
<div>
R2(config)#int gig1.10</div>
<div>
<div>
R2(config-subif)#ip vrf forwarding FOO</div>
</div>
<div>
<br /></div>
<div>
Multiprotocol VRF:</div>
<div>
<div>
R2(config-vrf)#vrf definition FOO</div>
<div>
R2(config-vrf)#rd 1:1</div>
<div>
R2(config-vrf)#address-family ipv6 unicast</div>
<div>
R2(config-vrf-af)#address-family ipv4 unicast</div>
</div>
<div>
<div>
R2(config-vrf-af)#exit</div>
</div>
<div>
<div>
R2(config-vrf)#int gig1.10</div>
<div>
R2(config-subif)#vrf forwarding FOO</div>
</div>
<div>
<br /></div>
<div>
<div>
router eigrp SAMPLE</div>
<div>
!</div>
<div>
address-family ipv6 unicast vrf FOO autonomous-system 200</div>
<div>
!</div>
<div>
topology base</div>
<div>
exit-af-topology</div>
<div>
<b> eigrp router-id 2.2.2.2</b></div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
Note the bolded line - <b>eigrp router-id 2.2.2.2</b>. Unless you have an IPv4 address in the routing table of the same VRF, you <b>must specify </b>the router ID manually. <b>There is no parser error</b>, it just doesn't work. Once again, this would make a great TS problem.</div>
<div>
<br /></div>
<div>
With IPv6, things work differently than IPv4 in named EIGRP mode. This process is already up:</div>
<div>
<br /></div>
<div>
<div>
*Sep 2 23:39:52.815: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FEF7:FE11 (GigabitEthernet1.10) is up: new adjacency</div>
</div>
<div>
<br /></div>
<div>
However, note I haven't told it what interfaces to use. In our case, it automatically includes any interface that's in the appropriate VRF and has an IPv6 address on it. If you don't want to run EIGRP on an interface, you have to manually specify:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#router eigrp SAMPLE</div>
<div>
R2(config-router-af)#address-family ipv6 unicast vrf FOO autonomous-system 200</div>
<div>
R2(config-router-af)#af-interface gig1.10</div>
<div>
R2(config-router-af-interface)#shut</div>
<div>
<br /></div>
<div>
*Sep 2 23:47:10.304: %DUAL-5-NBRCHANGE: EIGRP-IPv6 200: Neighbor FE80::20C:29FF:FED7:2458 (GigabitEthernet1.10) is down: interface down</div>
</div>
<div>
<br /></div>
<div>
<u><b>3rd party Next-Hop</b></u></div>
<div>
While also not a feature specific to named mode, EIGRP has recently started supporting 3rd party next hop. The concept of 3rd party next-hop is fairly simple. The easiest way I can explain it is if you have three routers on a single segment, R1, R2, and R3. They all share the 192.168.123.0/24 space between them. However, R1 and R2 speak EIGRP, and R2 and R3 speak OSPF. R1 doesn't speak OSPF, and R3 doesn't speak EIGRP. Assume there are extra routers behind R1 and R3 on different segments that are advertised in their respective routing protocols.</div>
<div>
<br /></div>
<div>
R2 is mutually redistributing between EIGRP and OSPF.</div>
<div>
<br /></div>
<div>
Without 3rd party next-hop, R1 would have to send traffic destined for the OSPF segments to R2, then R2 would have to forward it to R3. Inefficient and messy.</div>
<div>
<br /></div>
<div>
With 3rd party next-hop, R2 is permitted to use R3's address, even though it doesn't exist in the EIGRP process, when advertising routes to R1.</div>
<div>
<br /></div>
<div>
This is an automatic feature and requires only that R2 doesn't re-write the next-hop to itself (rewriting the next hop is default EIGRP behavior):</div>
<div>
<br /></div>
<div>
<div>
router eigrp TEST2</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
af-interface GigabitEthernet1.123</div>
<div>
<b>no next-hop-self</b></div>
</div>
<div>
<br /></div>
<div>
<b><u>EIGRP Fast ReRoute (FRR)</u></b></div>
<div>
The point of FRR is to generate Loop Free Alternates, or LFAs. What's an LFA?</div>
<div>
An LFA is a back-up route that can be pre-programmed into the FIB as a repair route. If you're familiar with EIGRP, you might think "but EIGRP already has feasible successors". True, but it doesn't program those into the forwarding linecards. </div>
<div>
<br /></div>
<div>
I wrote a rather lengthy article regarding BGP PIC and Add-Path two weeks ago, and I covered the problem that PIC was trying to solve, which is not necessarily easy to comprehend unless you've spent a great deal of time in a large service provider environment. PIC and FRR are trying to solve the same issue with different protocols. Rather than pasting the multi-page explanation I've already typed into this document as well, please reference that one to understand the issue:</div>
<div>
<br /></div>
<div>
<a href="http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html">http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html</a></div>
<div>
<br /></div>
<div>
The good news is that EIGRP doesn't require as complex an environment to explain FRR as it took to explain BGP PIC.</div>
<div>
<br /></div>
<div>
We already know EIGRP makes feasible successors, and can rely on those during reconvergence. But if we want the FIB to be able to swap over to a feasible successor as soon as the successor route is lost, we need to pre-program it.</div>
<div>
<br /></div>
<div>
In a nutshell, FRR simply picks the "best" feasible successor and sticks it in the FIB as a backup route.</div>
<div>
<br /></div>
<div>
There are two types of FRR, per-link and per-prefix. Per-link is only supported on IOS-XR at the time of this writing, so we'll be looking only at per-prefix.</div>
<div>
<br /></div>
<div>
First and foremost, we must ensure we have a feasible successor. If we have multiple successors (no feasibles), then we have ECMP - equal cost multi-path - and there's no need for FRR.</div>
<div>
<br /></div>
<div>
R1 has two paths to prefix 4.4.4.4 on R4, one via R2 and another via R3. I've deliberately de-prefed the route through R3. Note, if you're attempting to lab along with this, you'll want to create the depref <b>on R1</b>. If you're ECMP up until you create the depref on R1, you're guaranteed to have a feasible successor!</div>
<div>
<br /></div>
<div>
<div>
R1(config-subif)#int gig1.13</div>
<div>
R1(config-subif)#delay 5000</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip eigrp topo 4.4.4.4/32</div>
<div>
EIGRP-IPv4 VR(TEST) Topology Entry for AS(100)/ID(192.168.12.1) for 4.4.4.4/32</div>
<div>
State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2048000, RIB is 16000</div>
<div>
Descriptor Blocks:</div>
<div>
192.168.12.2 (GigabitEthernet1.12), from 192.168.12.2, Send flag is 0x0</div>
<div>
<b> Composite metric is (2048000/1392640), route is Internal</b></div>
<div>
Vector metric:</div>
<div>
Minimum bandwidth is 1000000 Kbit</div>
<div>
Total delay is 21250000 picoseconds</div>
<div>
Reliability is 255/255</div>
<div>
Load is 1/255</div>
<div>
Minimum MTU is 1500</div>
<div>
Hop count is 2</div>
<div>
Originating router is 192.168.24.4</div>
<div>
192.168.13.3 (GigabitEthernet1.13), from 192.168.13.3, Send flag is 0x0</div>
<div>
<b> Composite metric is (3278192640/1392640), route is Internal</b></div>
<div>
Vector metric:</div>
<div>
Minimum bandwidth is 1000000 Kbit</div>
<div>
Total delay is 50011250000 picoseconds</div>
<div>
Reliability is 255/255</div>
<div>
Load is 1/255</div>
<div>
Minimum MTU is 1500</div>
<div>
Hop count is 2</div>
<div>
Originating router is 192.168.24.4</div>
</div>
<div>
<br /></div>
<div>
Since the route via 192.168.13.3 (from R3) has an advertised distance less than the feasible distance to 192.168.12.2 (from R2), we now have a feasible successor.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#router eigrp TEST</div>
<div>
R1(config-router)# address-family ipv4 unicast autonomous-system 100</div>
<div>
R1(config-router-af)# topology base</div>
<div>
R1(config-router-af-topology)#<b>fast-reroute per-prefix all</b></div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route 4.4.4.4 | i Repair</div>
<div>
<b> Repair Path: 192.168.13.3, via GigabitEthernet1.13</b></div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip cef 4.4.4.4</div>
<div>
4.4.4.4/32</div>
<div>
nexthop 192.168.12.2 GigabitEthernet1.12</div>
<div>
<b> repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13</b></div>
</div>
<div>
<br /></div>
<div>
It's very simple if we only have two paths, but what if there are 3 or more? Cisco uses what it calls "tie breakers", but I really dislike the name, we're not really tie-breaking necessarily because the criteria for selection isn't comparing apples to apples. It's a bit more like "2nd bestpath decision maker".</div>
<div>
<br /></div>
<div>
Before I list off the tie-breakers, let's look at what the problems might be if we had numerous paths to choose from.</div>
<div>
<br /></div>
<div>
Let's say we have multiple neighbors on a shared segment, with varying metrics to the destination we're trying to protect. Your bestpath is on that segment, as is your "second best" feasible successor, all hanging off the same interface on your router. If you're choosing the LFA purely based on metric, the <i>same interface</i> will get chosen for the backup path as is the primary route. That doesn't help us if that WAN link fails, or if the interface goes down, etc. </div>
<div>
<br /></div>
<div>
Take that one step further and say your best-path and best feasible successor are both on the same linecard. That might also be a poor decision.</div>
<div>
<br /></div>
<div>
What I'm getting at is there's more to consider than just the metric in this scenario.</div>
<div>
<br /></div>
<div>
The four tie-breakers are:</div>
<div>
<div>
- srlg-disjoint, priority 10</div>
</div>
<div>
- interface-disjoint, priority 20</div>
<div>
<div>
- lowest-backup-path-metric, priority 30</div>
</div>
<div>
- linecard-disjoint, priority 40</div>
<div>
<br /></div>
<div>
Lower priority is better.</div>
<div>
<b><br /></b></div>
<div>
<b>srlg-disjoint </b>favors a backup-path/interface that isn't in the same Shared Risk Link Group (more below).</div>
<div>
<b><br /></b></div>
<div>
<b>interface-disjoint </b>favors a backup route that doesn't share the same interface for its next-hop. <b>BEWARE</b>, sub-interfaces are considered disjointed interfaces by the FRR process on my version of IOS-XE!</div>
<div>
<br /></div>
<div>
<b>lowest-backup-path-metric</b> favors a backup route with the lowest metric.</div>
<div>
<br /></div>
<div>
<b>linecard-disjoint </b>favors a backup route that doesn't share the same linecard.</div>
<div>
<br /></div>
<div>
So to clarify, by default, SRLG gets priority unless not set, then interface-disjoint gets priority unless the two paths are already on different interfaces (or subinterfaces), then the lowest metric is picked. If the metric is the same, it looks for a port on a different linecard.</div>
<div>
<br /></div>
<div>
So to start, what the heck is SRLG?</div>
<div>
<br /></div>
<div>
There's very little information on this feature that I can find, but the idea, as best I can tell, is that if you happen to know to physical links share some dependency (perhaps passing through the same L2 switch upstream, for example), you can tell IOS which ones have dependencies.</div>
<div>
<br /></div>
<div>
For example, if Gig1 and Gig2 on my router both passed through a single point of failure upstream, my config might look something like this:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#srlg gid 1</div>
<div>
R1(config-if)#int gig2</div>
<div>
R1(config-if)#srlg gid 1</div>
<div>
R1(config-if)#int gig3</div>
<div>
R1(config-if)#srlg gid 2</div>
</div>
<div>
<br /></div>
<div>
Note gig3 didn't necessarily need to get assigned to an srlg, but I included it for clarity.</div>
<div>
<br /></div>
<div>
I'm going to introduce a new path from R1 to R4 via R5. R1, R2 and R5 are all going to share a common link, meaning R1 routes to R2 and R5 on the same interface. I'm increasing delay slightly more on the path to R5. Furthermore, I'm going to prevent R2 and R5 from peering with one another, otherwise R5 would end up only advertising it's bestpath from R2, and my topology breaks.</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric</div>
<div>
Composite metric is (78725120/13189120), route is Internal<br />
Composite metric is (3289989120/13189120), route is Internal<br />
Composite metric is (79380480/13844480), route is Internal<br />
<br /></div>
</div>
<div>
We see we've got three paths, let's look at those again with my comments:</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip eigrp top 4.4.4.4/32 | i Composite metric</div>
<div>
Composite metric is (78725120/13189120), route is Internal = Path through gig1.12 via R2</div>
</div>
<div>
Composite metric is (3289989120/13189120), route is Internal = Path through gig1.12 via R5<br />
Composite metric is (79380480/13844480), route is Internal = Path through gig1.13 via R3<br />
<br /></div>
<div>
<div>
R1#sh ip cef 4.4.4.4</div>
<div>
4.4.4.4/32</div>
<div>
nexthop 192.168.12.2 GigabitEthernet1.12</div>
<div>
repair: attached-nexthop 192.168.13.3 GigabitEthernet1.13</div>
</div>
<div>
<br /></div>
<div>
We can see IOS made a very smart move here, and it's in line with the priorities we discussed above. The backup path is not the best feasible successor from a metric standpoint, it's the less risky separate "interface" (again, IOS considers a subinterface a separate interface).</div>
<div>
<br /></div>
<div>
If we instead wanted it to choose based on metric:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#router eigrp TEST</div>
<div>
R1(config-router)# address-family ipv4 unicast autonomous-system 100</div>
<div>
R1(config-router-af)# topology base</div>
<div>
R1(config-router-af-topology)#fast-reroute tie-break lowest-backup-path-metric 5</div>
<div>
<br />
<<note I normally clear the eigrp neighbors here, these commands don't always seem to react quickly after the change>><br />
<br /></div>
</div>
<div>
R1(config-router-af-topology)#do sh ip cef 4.4.4.4<br />
4.4.4.4/32<br />
nexthop 192.168.12.2 GigabitEthernet1.12<br />
repair: attached-nexthop 192.168.12.5 GigabitEthernet1.12<br />
<br />
Now we're preferring the backup path through the same interface, that has the better metric.<br />
<br />
I'm not going to show the output from srlg disjoint here, but I have labbed it previously and it does work - just set the srlg guid on the appropriate interfaces. Also, I have no way of labbing linecard disjoint because I'm on a virtual router.<br />
<br />
<b><u>EIGRP Over The Top (OTP)</u></b><br />
Does anyone besides me use the OTP abbreviation to mean "on the phone"? I wish they could've gone with OTT instead.<br />
<br />
It is a really neat feature though - I know a lot of people will bash EIGRP as obsolete, proprietary, distance vector ... say what you will, amongst enterprise Cisco enterprise networks, it's the most popular IGP on the Cisco-powered market by a landslide. As a consultant, I would say 80% of the networks I come across run it.<br />
<br />
Furthermore, finding enterprise network support personnel that are BGP experts is somewhat rare.<br />
<br />
So what is one to do when MPLS separates all your sites, and your carrier (wisely) uses BGP as a PE->CE protocol? You hire a consultant to come in and make changes to the redistribution strategy periodically.<br />
<br />
Or... you run EIGRP OTP, and toss the BGP work out the window.<br />
<br />
OTP allows remote EIGRP peerings over any underlying IP protocol. All you need is reachability to the other EIGRP host. That means all your carrier needs to do is advertise the PE->CE link itself (probably a /30 between you and the carrier) in their MPBGP and <i>the CE</i> <i>doesnt even need to run BGP </i>(topology dependent). All the CE needs is a static default pointing at the PE router.<br />
<br />
If you have more than a few CEs, you'll probably want an EIGRP Route Reflector, which isn't nearly as complicated as it sounds. An EIGRP RR listens for dynamic connections (optionally), and then disables split horizon and next-hop-self.<br />
<br />
LISP provides the tunneling mechanism for the neighbors to reach one another. Fortunately, no LISP knowledge is required, the config is automatic.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjM7958QpkbSBMmgXW7XFstO9NO7ztGazsXfdk36CoU7KyhT88aEd3qzzKvVFyBucm49403mmdfzcSd9xhTG_MH9H3TI3gjX-taAi2unQh9d-Vw9FKC5TLd8CKgQ5-FC5CfkGNnWeKFDv4/s800/diagram1.png" />
<br />
<br />
Here, R2 - R5 represent the provider network, R1 and R7 represent isolated customer sites, and R6 and R8 represent a dual-homed customer site.<br />
<br />
R7 will be our EIGRP route reflector.<br />
<br />
Assume the provider is advertising the links between the CE and PE.<br />
Here are the rest of the relevant configs:<br />
<br />
R1:<br />
router eigrp OTP-TEST<br />
!<br />
address-family ipv4 unicast autonomous-system 100<br />
!<br />
topology base<br />
exit-af-topology<br />
neighbor 192.168.37.7 GigabitEthernet1.12 remote 10 lisp-encap<br />
network 1.1.1.1 0.0.0.0<br />
network 192.168.12.0<br />
exit-address-family</div>
<div>
<div>
<br /></div>
<div>
ip route 0.0.0.0 0.0.0.0 192.168.12.2</div>
</div>
<div>
<br /></div>
<div>
just to prove there's no BGP involved here:</div>
<div>
<div>
<br /></div>
<div>
R1#sh ip protocol sum</div>
<div>
Index Process Name</div>
<div>
0 connected</div>
<div>
1 static</div>
<div>
2 application</div>
<div>
4 eigrp 100</div>
<div>
*** IP Routing is NSF aware ***</div>
</div>
<div>
<br /></div>
<div>
R6:</div>
<div>
<div>
router eigrp OTP-TEST</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
topology base</div>
<div>
exit-af-topology</div>
<div>
neighbor 192.168.37.7 GigabitEthernet1.46 remote 10 lisp-encap</div>
<div>
network 6.6.6.6 0.0.0.0</div>
<div>
network 192.168.46.0</div>
<div>
network 192.168.68.0</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
R8:</div>
<div>
<div>
router eigrp OTP-TEST</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
topology base</div>
<div>
exit-af-topology</div>
<div>
neighbor 192.168.37.7 GigabitEthernet1.58 remote 10 lisp-encap</div>
<div>
network 8.8.8.8 0.0.0.0</div>
<div>
network 192.168.58.0</div>
<div>
network 192.168.68.0</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
R7 (route reflector):</div>
<div>
<div>
router eigrp OTP-TEST</div>
<div>
!</div>
<div>
address-family ipv4 unicast autonomous-system 100</div>
<div>
!</div>
<div>
af-interface GigabitEthernet1.37</div>
<div>
no next-hop-self</div>
<div>
no split-horizon</div>
<div>
exit-af-interface</div>
<div>
!</div>
<div>
topology base</div>
<div>
exit-af-topology</div>
<div>
remote-neighbors source GigabitEthernet1.37 unicast-listen lisp-encap</div>
<div>
network 7.7.7.7 0.0.0.0</div>
<div>
network 192.168.37.0</div>
<div>
exit-address-family</div>
</div>
<div>
<br /></div>
<div>
The route reflector is also running BGP. Route reflectors can have a topology problem requiring this if you have backdoor links. In my case, if I only ran a default on the route reflector, I'd learn the link to R8 via EIGRP from R6, as opposed to using my default route. And vice-versa, R8 would advertise connectivity to R6, and my routes would do a continual up/down because they'd learn next-hops via the LISP interface. It's a typical tunnel recursion loop issue. Running BGP puts the prefixes to reach R6 and R8 in R7's table at a lower AD and solves the problem </div>
<div>
<br /></div>
<div>
Also note that the link between PE and CE must be advertised into EIGRP in order for LISP to come up.</div>
<div>
<br /></div>
<div>
Now we have full reachability to the EIGRP prefixes without the majority of the CEs running BGP, and none of the CEs advertising their EIGRP routes into it.</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route eigrp | b Gateway</div>
</div>
<div>
<div>
Gateway of last resort is 192.168.12.2 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
6.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0</div>
<div>
7.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0</div>
<div>
8.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0</div>
<div>
D 192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip cef 6.6.6.6</div>
<div>
6.6.6.6/32</div>
<div>
nexthop 192.168.46.6 LISP0</div>
<div>
<br /></div>
<div>
R1#ping 6.6.6.6</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/2 ms</div>
</div>
<div>
<br /></div>
<div>
<b><u>Add-Path</u></b></div>
<div>
<b><br /></b></div>
<div>
Add-Path is the capability to advertise more than one bestpath to a neighbor. I've done a large write-up on the BGP implementation of it:</div>
<div>
<br /></div>
<div>
<a href="http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html">http://brbccie.blogspot.com/2014/08/bgp-pic-and-add-path.html</a></div>
<div>
<br /></div>
<div>
The Cisco documentation indicates a use case of DMVPN for EIGRP Add-Path, but that seems a pretty narrow use to me, as summarization with DMVPN phase 3 would make it useless. However, our scenario for OTP above is perfect! </div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route eigrp | b Gateway</div>
<div>
<div>
Gateway of last resort is 192.168.12.2 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
6.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:06:34, LISP0</div>
<div>
7.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:06:35, LISP0</div>
<div>
8.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:06:34, LISP0</div>
<div>
D 192.168.68.0/24 [90/93998811] via 192.168.46.6, 00:06:34, LISP0</div>
</div>
</div>
<div>
<br /></div>
<div>
R1 only learns one path to 192.168.68.0/24. Two are available, why can't we install both? Same problem with BGP, the EIGRP route reflector only sends its one best-path.</div>
<div>
<br /></div>
<div>
<div>
R7(config)#router eigrp OTP-TEST</div>
<div>
R7(config-router)# address-family ipv4 unicast autonomous-system 100</div>
<div>
R7(config-router-af)# af-interface GigabitEthernet1.37</div>
<div>
R7(config-router-af-interface)#add-paths 2</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route eigrp | b Gateway</div>
<div>
Gateway of last resort is 192.168.12.2 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
6.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 6.6.6.6 [90/93994331] via 192.168.46.6, 00:12:51, LISP0</div>
<div>
7.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 7.7.7.7 [90/93994331] via 192.168.37.7, 00:12:52, LISP0</div>
<div>
8.0.0.0/32 is subnetted, 1 subnets</div>
<div>
D 8.8.8.8 [90/93994331] via 192.168.58.8, 00:12:51, LISP0</div>
<div>
D 192.168.68.0/24 [90/93998811] via 192.168.58.8, 00:00:26, LISP0</div>
<div>
[90/93998811] via 192.168.46.6, 00:00:26, LISP0</div>
</div>
<div>
<br /></div>
<div>
And we've got multiple redundant paths to 192.168.68.0/24 now!</div>
<div>
<br /></div>
<div>
Note, EIGRP add-path is incompatible with <i>variance</i>.</div>
<div>
<br /></div>
<div>
Hope you enjoyed,</div>
<div>
<br /></div>
<div>
Jeff</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com9tag:blogger.com,1999:blog-5968686435283454526.post-53853025878393461772014-08-16T08:18:00.002-07:002014-10-16T20:14:49.652-07:00BGP PIC and Add-PathThe meat of this article will be Add-Path, and why it's needed in certain PIC scenarios. However, understanding where and why we need these technologies, what was done before the Add-Path implementation was widely in place, etc, is nearly as challenging to learn as the Add-Path implementation itself.<br />
<br />
This is not intended to completely document Add-Path, nor is it just a primer. My original intent was to document the entire use of Add-Path, however, I realized halfway through that this would have easily produced a 50+ page document: There are many one-off cases for Add-Path that have their own features, and to show a use case for each one would've required several different topologies and drawings. My hope that at the depth I took it to, it will be more than sufficient to educate to the level required for the CCIE R&S v5 lab.<br />
<br />
So - what is PIC?<br />
<br />
PIC stands for Prefix Independent Convergence.<br />
<br />
PIC is a method for speeding up convergence of the FIB under failover conditions.<br />
<br />
Unless you have a really serious lab or a Spirent to play with, forget trying to lab the performance gain. The gains we're talking about here are only seen when you have tens of thousands, hundreds of thousands, or even 1M routes in your FIB.<br />
<br />
The use case is actually pretty easy to understand - when the next-hop to a set of prefixes changes, the router (presumably talking about a 7600 or ASR) has to walk each prefix in the FIB and update the next-hop. If you have 100 routes, this time is negligible. If you're carrying 1M routes in an MPLS environment, this is <i>not a small problem</i>. I've been told first-hand (from someone who does have a Spirent to play with) that this takes about <i>two minutes</i>.<br />
<br />
This would be Prefix <b>Dependent </b>Convergence, or a problem that grows dependent upon how many prefixes are in your FIB. The solution we want is something that updates in the same amount of time (presumably small amount of time!) no matter how many FIB entries we have.<br />
<br />
The concept of the FIB dates back decades now, and when it was originally written it was made in the most efficient manner possible, for CPU and RAM conservation:<br />
<br />
Prefix = Interface/Next-Hop<br />
<br />
For example,<br />
<br />
10.10.10.10/32 = FastEthernet0/0 192.168.0.1<br />
<br />
This was great 20 years ago when a "large" routing table was 40,000 routes. To converge quickly, a new method is required. Introducing the Hierarchical FIB.<br />
<br />
When using PIC, the FIB actually restructures to a 3-tier system:<br />
<br />
Prefix = Pointer = Interface/Next-Hop<br />
<br />
Understanding why this is better takes understanding that while a router may be carrying 1M routes, it's probably only directly connected (layer 3) to a dozen or less. So you've got 1M routes, and 12 possible exits.<br />
<br />
Let's say half those routes go out to two primary edge routers. Those routers are at 192.168.1.1 and 192.168.2.1.<br />
<br />
So, roughly half your routes look like:<br />
<br />
10.10.10.10/32 = Pointer A = Gigabit0/0 192.168.1.1<br />
11.11.11.11/32 = Pointer A = Gigabit0/0 192.168.1.1<br />
.... 499,998 routes later ...<br />
197.197.197.197/32 = Pointer A = Gigabit0/0 192.168.1.1<br />
<br />
192.168.1.1 fails. However, all these same prefixes are reachable via 192.168.2.1.<br />
With an appropriately designed network, PIC can simply reassign Pointer A.<b> This takes less than 50ms as opposed to 60+ seconds.</b><br />
<b><br /></b>
10.10.10.10/32 = Pointer A = Gigabit0/1 192.168.2.1<br />
11.11.11.11/32 = Pointer A = Gigabit0/1 192.168.2.1<br />
<br />
.... 499,998 routes later ...<br />
197.197.197.197/32 = Pointer A = Gigabit0/1 192.168.2.1<br />
<br />
The CEF process updated <b>one value</b>, that of Pointer A. Previously this took 500,000 updates, now it takes one. The time required for this process is independent of how many routes use the next-hop, hence Prefix <i>Independent </i>Convergence.<br />
<br />
Now if you're following along, you probably see the enormous catch here: unless you're multipathing, how is CEF even going to know about the second path? PIC is a data-plane/CEF/FIB feature, it doesn't touch the control-plane. Normally we'd have to wait on BGP convergence (topology dependent), which takes a heck of a lot longer than 50ms. As we're all aware, and this is key to understanding this topic, <u>BGP only sends its single best-path per-prefix to its neighbors</u>. What if we needed two or more? Even worse, what if we're crossing a route-reflector, that aggregates everyone's paths and picks only one?<br />
<br />
I am going to cover five different ways to solve this, add-path being the newest of them.<br />
<br />
Here are the options at a high-level:<br />
1) Multipath. This is by far the easiest option if your topology fits.<br />
2) BGP Advertise-Best-External. For advertising from PE->PE, or PE->RR; this tells the edge PE to send it's external route (presumably from a CE via eBGP) as best. More below.<br />
3) Diverse-Path (Shadow Router). This tells a route reflector, a secondary one in a topology, to deliberately calculate a "second-best" path <i>that has a different next-hop</i>. Instead of forwarding its best-path, it forwards this "second-best" path. Only the route-reflector needs to be updated to support this feature.<br />
4) Add-Path. In short, Add-Path modifies the BGP behavior to send two or more paths instead of just one best-path. This requires that every device in the topology that needs to send or receive multiple paths supports Add-Path.<br />
<br />
I've chosen to demonstrate these solutions in a VPNv4 environment, as it's where PIC makes the most sense. Note that add-path is purely an iBGP technology, the parser gets upset if you try it on eBGP:<br />
<br />
R3(config-router)#neighbor 192.168.30.2 advertise additional-paths all<br />
% BGP: Add-Path *not* supported on EBGP peering<br />
<div>
<br /></div>
<div>
I have a hobby (perhaps more of an interest?) of the language used in IOS parser messages. Half the time, unless you know the technology already, you can't even tell what the programmer was trying to convey when you make a mistake. If it's a new feature sometimes you don't even get an error, it just doesn't apply the config. Then other times you get blunt messages with *stars*!</div>
<div>
<br />
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN3xW-aT5avlZHr1MMaXC1wCwCp43OY-EYBUT7xeK5XcWAis2GkOB4nKgd-pK4yybW8SfX7lKz8tDNbayxMyZW3BNPsW4bm0GhMqPbUlKeyWO2QmmhnvzeocFNcRxcuzDPKpfhIZ0Pm1w/s800/diagram1.png" />
<br />
<div>
<br /></div>
<div>
I'm running a common VPNv4 design: BGP on the PEs, VRFs between CE and PE, and a "BGP free core" (all <i>one</i> P router that isn't a route reflector :) ).<br />
<br />
On the PE->CE links, I'm using 10.0.X.Y/24, where X is a combination of the two routers the link connects (i.e. R1->R2 is "12"), and Y is the router number. This is also the same number on the subinterface on the diagram.<br />
<br />
On the PE->P or PE->RR links, the IPs are 192.168.X.Y, same explanation of X and Y as above.<br />
<br />
Every router has a loopback0 of Y.Y.Y.Y/32, where Y is the router number.<br />
<br />
Note that R4 is a route-reflector, and R6, R7 and R8 are all PEs.<br />
<br />
<div style="margin-bottom: .0001pt; margin: 0in;">
Let's talk about the two flavors of PIC. There's PIC Core and PIC
Edge. They're both applied to a PE.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
PIC Core is far simpler than PIC Edge, so we'll start there. We've
enabled PIC Core on R2.<br />
<br />
PIC Core is enabled with <i>one command</i>:<br />
<br />
<br />
R2-PE(config)#cef table output-chain build favor convergence-speed<br />
<br />
Of note, to disable it, you replace "convergence-speed" with "memory-utilization".<br />
<br />
Unlike PIC Edge, which, depending on the
implementation, may require widespread support on the network, PIC Core can
literally be enabled on just one device if you wanted.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnF7jKAUybOQRZabtQoPWSHbmr90-LnjtKyYtmVhH0x-0ShGADh9TvXFy_iUrisfbJm3Wt-EpU9r6puIlWG8KgCIvHI_JrCTthrO51Q3v8jEckKYOxXB2CUMRXX4EXfyBe44R0K0XSzl8/s800/diagram2.png" />
<br />
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
As mentioned above, in a typical VPNv4 scenario, the core is
BGP-free, and only the PEs (and any route reflectors) maintain the BGP table.
Next hops to the PEs are carried in the IGP. Let's look at how that plays out:<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
- Let's assume R1's bestpath to R9 is via R2. R1 is BGP peered to
R2.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
- R2 takes R1's traffic in to a VRF. It imports the VRF traffic
into VPNv4.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
- R4, the route reflector, learns via iBGP that the PEs R6 and R7
can both reach R9. It chooses R6 as the bestpath.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
- R2, only peered with R4 for iBGP, learns the that R6 is the
bestpath.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
- Since this is VPNv4, R2 needs to choose an LDP-enabled next hop
that has a label for 6.6.6.6. Remember, in VPNv4, the next hop inside the iBGP
network is always the iBGP next-hop. The IGP indicates that R5 is the bestpath
for R2 to reach R6 (via MPLS).<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
The key element here is the recursion between R2 and R6:<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
BGP tells R2 how to reach R9 via R6: 2.2.2.2 -> 6.6.6.6<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
R2 needs to find out how to reach 6.6.6.6 via the IGP: 2.2.2.2
-> R5<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
R2 needs to know how to reach R5: 2.2.2.2 -> 192.168.25.5 (R5's
interface IP)<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
R2 needs to pick an interface to reach 192.168.25.5: gig1.25<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
So one more time!<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
iBGP: 2.2.2.2 -> 6.6.6.6<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
iBGP Next-Hop via OSPF: Find 6.6.6.6 via R5 at 192.168.24.5<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
CEF: Exit interface gig1.25 towards 192.168.25.5<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br />
I'm going to harp on the high-level of this again because it's dead critical to understanding the hierarchy of the process:<br />
BGP recurses to IGP<br />
IGP recurses to one or more Next Hops<br />
FIB populates one or more next hops from the IGP<br />
<br />
When you're using PIC Core, this is what we care about:<br />
BGP recurses to IGP<br />
IGP recurses to one or more Next Hops <-- PIC CORE INFLUENCES<br />
FIB populates one or more next hops from the IGP <-- PIC CORE INFLUENCES<br />
<div>
<br /></div>
<div>
I will demonstrate below.</div>
<div>
<br /></div>
</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
So given that R1 -> R2 -> R5 -> R6 -> R9 above, let's say R5 goes completely offline - dead.<br />
<br />
This does not impact the BGP session between R2 and R4, or between R4 and R6. However, the next hop specific in the BGP next-hop (192.168.24.5), which it learned from the IGP, must change. The IGP can reconverge very quickly, but let's say the BGP process was carrying 1M routes from R9. How long will it take R2 to update the next-hops of the BGP table and CEF?<br />
<br />
So to be clear, BGP is <u>not</u> reconverging. PIC Core cannot handle a BGP reconverge, you need PIC Edge for that. But if the IGP reconverges and requires the BGP Table and FIB to update, and you have a large quantity of routes, this can create a major impact on a PE - possibly several minutes of dropping traffic.<br />
<br />
With a traditional FIB, we'd have to make 1 million updates in both the BGP table and the FIB in order to be fully forwarding again. With a hierarchical FIB - what PIC Core provides us - the following process would happen:</div>
<br />
The FIB, before:<br />
Prefix 1 -> Pointer A (192.168.25.5) -> gig1.25<br />
<br />
The IGP reconverges the path via R4.<br />
Now we update Pointer A - <b>one value instead of 1M values</b> - and we end up with:<br />
<br />
Prefix 1 -> Pointer A (192.168.24.4) -> gig1.24<br />
<br />
So to reiterate, PIC Core is for failure of non-BGP speakers. It doesn't help if BGP itself needs to reconverge, but it does dramatically speed up CEF's failover if the IGP fails.<br />
<br />
Now moving in to the more complex PIC Edge.<br />
<br />
If PIC Core was about dealing with IGP failure, PIC Edge is about dealing with BGP failure.<br />
<br />
For the moment, we'll continue using our VPNv4 topology, except we're temporarily removing the route reflector and instead installing a full-mesh iBGP.<br />
<br />
<b>Please note that using PIC Edge should involve running BFD between the BGP speakers for fast detection of a failure. For simplicity, I've omitted this step. To learn more about BFD, please see my BFD blog: <a href="http://brbccie.blogspot.com/2014/06/everything-bfd.html">http://brbccie.blogspot.com/2014/06/everything-bfd.html</a></b><br />
<br />
That's quite a few iBGP peerings, The red lines indicate all the iBGP peerings:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYADQlIONbJmrWRoDGJZPYIW09uYNHTrE0QOVreU0P_1xwANUdIvxF7mty5T_QpyOt7iAQ9_vm8FqYk5V5r9Uji8wgpU2QWhOADudXokEZWBrJZfiFqYURariZ2DSICElng1zYZLti52k/s800/iBGP-peerings-wo-RR.png" />
</div>
<div>
<br />
In this scenario, we're going to deal with R2's convergence process again, except we're going to assume R6 - the BGP-adjacent PE - dies, instead of a P router.<br />
<br />
Let's look at our routing protocols from R2's perspective.<br />
<br />
R2-PE#sh bgp vpnv4 un all 9.9.9.9<br />
BGP routing table entry for 1:1:9.9.9.9/32, version 11<br />
Paths: (3 available, best #2, table VPN)<br />
Advertised to update-groups:<br />
1<br />
Refresh Epoch 3<br />
300<br />
8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)<br />
Origin IGP, metric 0, localpref 100, valid, internal<br />
Extended Community: RT:1:1<br />
mpls labels in/out nolabel/16<br />
rx pathid: 0, tx pathid: 0<br />
Refresh Epoch 3<br />
300<br />
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)<br />
Origin IGP, metric 0, localpref 100, valid, internal, best<br />
Extended Community: RT:1:1<br />
mpls labels in/out nolabel/22<br />
rx pathid: 0, tx pathid: 0x0<br />
Refresh Epoch 3<br />
300<br />
7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)<br />
Origin IGP, metric 0, localpref 100, valid, internal<br />
Extended Community: RT:1:1<br />
mpls labels in/out nolabel/26<br />
rx pathid: 0, tx pathid: 0<br />
<br />
As expected, R2 has three BGP paths to 9.9.9.9. 6.6.6.6 is the best.<br />
<br />
How do we reach 6.6.6.6?<br />
<br />
R2-PE#sh ip ospf route | s 6.6.6.6<br />
*> 6.6.6.6/32, Intra, cost 3, area 0<br />
via 192.168.24.4, GigabitEthernet1.24<br />
via 192.168.25.5, GigabitEthernet1.25<br />
<br />
The BGP table has one selected bestpath, the IGP has two multipath bestpaths to BGP's next hop:<br />
<br />
R2-PE#sh ip route 6.6.6.6<br />
Routing entry for 6.6.6.6/32<br />
Known via "ospf 1", distance 110, metric 3, type intra area<br />
Last update from 192.168.24.4 on GigabitEthernet1.24, 00:22:04 ago<br />
Routing Descriptor Blocks:<br />
* 192.168.25.5, from 6.6.6.6, 01:31:56 ago, via GigabitEthernet1.25<br />
Route metric is 3, traffic share count is 1<br />
192.168.24.4, from 6.6.6.6, 00:22:04 ago, via GigabitEthernet1.24<br />
Route metric is 3, traffic share count is 1<br />
<br />
R2-PE#sh ip cef 6.6.6.6<br />
6.6.6.6/32<br />
nexthop 192.168.24.4 GigabitEthernet1.24 label 17<br />
nexthop 192.168.25.5 GigabitEthernet1.25 label 18<br />
<br />
Now let's refer back to my process from earlier:<br />
<br /></div>
<div>
BGP recurses to IGP<br />
IGP recurses to one or more Next Hops<br />
FIB populates one or more next hops from the IGP</div>
<div>
<br /></div>
<div>
or</div>
<div>
<br />
BGP says use 6.6.6.6<br />
IGP says to get to 6.6.6.6 use either 192.168.25.5 or 192.168.24.4<br />
FIB points to 192.168.25.5 / tag 17 and 192.168.24.4 / tag 18 multipath<br />
<br />
Now what happens if 6.6.6.6 fails?<br />
<br />
R6-PE(config)#int gig1.46<br />
R6-PE(config-subif)#shut<br />
R6-PE(config-subif)#int gig1.56<br />
<div>
R6-PE(config-subif)#shut</div>
<div>
R6-PE(config-subif)#int gig1.69<br />
<div>
R6-PE(config-subif)#shut</div>
</div>
<div>
<br /></div>
<div>
Debugging BGP updates on R2 (significantly edited for brevity):</div>
<div>
<br /></div>
<div>
<div>
*Aug 20 23:30:29.037: RT(VPN): updating bgp 9.9.9.9/32 (0x1) : via 7.7.7.7 0 26</div>
<div>
*Aug 20 23:30:29.037: RT(VPN): closer admin distance for 9.9.9.9, flushing 1 routes</div>
<div>
*Aug 20 23:30:29.037: RT(VPN): add 9.9.9.9/32 via 7.7.7.7, bgp metric [200/0]</div>
</div>
<div>
<br /></div>
<div>
BGP figures out that 6.6.6.6 is down, and picks 7.7.7.7 for the next hop. Now we have the same problem we had with PIC Core, only it's more significant:</div>
<div>
<br /></div>
<div>
BGP recurses to IGP <-- PIC EDGE INFLUENCES<br />
IGP recurses to one or more Next Hops <-- PIC EDGE & CORE INFLUENCE<br />
FIB populates one or more next hops from the IGP <-- PIC EDGE & CORE INFLUENCE<br />
<br />
Just pointing out the process there - we don't have PIC edge enabled, so our theoretical 1M routes just took minutes to reconverge.<br />
<br />
So how do we enable PIC Edge? Quite simply, we can't wait for the IGP and BGP to converge. We need <b>two paths </b>in BGP. This can be easy or difficult, depending on our topology. Let's look at the easiest methods and progress towards harder.<br />
<br />
Note we still have <b>cef table output-chain build favor convergence-speed </b>configured on R2, which is still necessary.<br />
<div>
<br /></div>
<div>
</div>
</div>
Re-enabling R6 to show how this could play out with PIC Edge.<br />
router bgp 200<br />
address-family ipv4 vrf VPN<br />
maximum-paths ibgp 3<br />
<br />
Now we've told R2 to install multiple <b>BGP</b> paths, not just multiple <b>IGP </b>paths. This way if R6's advertisement gets pulled again, there's already a pre-made alternative path.<br />
<br />
Now we have three "hot", installed <b>BGP </b>paths to 9.9.9.9, instead of just one. This means with the IGP in consideration, we have six paths:<br />
<br />
R2-PE#sh bgp vpnv4 un all | b 9.9.9.9<br />
*mi 9.9.9.9/32 7.7.7.7 0 100 0 300 i<br />
*>i 6.6.6.6 0 100 0 300 i<br />
*mi 8.8.8.8 0 100 0 300 i<br />
<div>
<br /></div>
<div>
<div>
R2-PE#sh ip cef vrf VPN 9.9.9.9 detail</div>
<div>
9.9.9.9/32, epoch 1, flags rib defined all labels, per-destination sharing</div>
<div>
recursive via 6.6.6.6 label 22</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
<div>
recursive via 7.7.7.7 label 26</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 16</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 20</div>
<div>
recursive via 8.8.8.8 label 16</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 28</div>
</div>
<div>
<br /></div>
</div>
<div>
If we lose the path via 6.6.6.6, one of the other paths would simply pick up the load, and because of the hierarchical FIB we already implemented, there's no need to rewrite all 1M prefixes in the FIB one at a time. </div>
<div>
<br /></div>
<div>
This represented our first PIC solution I described above: Multipathing. </div>
<div>
<br /></div>
<div>
I'm going to temporarily cut to a much simpler scenario to show BGP Advertise-Best-External. While I could mix this in to the topology we've been using, it's getting too complex to clearly illustrate the topic.</div>
<div>
<br /></div>
<div>
Let's say multipathing isn't an option - what if one of the paths is clearly better than the others. What else can we do?</div>
<div>
<br /></div>
<div>
I've deliberately made R6 the bestpath by setting the local preference on all routes leaving it to 150. Now what we see from R2 looks like:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#show bgp vpnv4 un all 9.9.9.9</div>
<div>
BGP routing table entry for 1:1:9.9.9.9/32, version 63</div>
<div>
Paths: (1 available, best #1, table VPN)</div>
<div>
Advertised to update-groups:</div>
<div>
1</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
Only one path ... via 3 upstreams? Yep. The problem here is that, depending on timing, R2 may end up with three paths, for just a moment - since all routers are peered with one another, R7 will learn that R6 is the bestpath via its iBGP session to R6, as will R8. Both R7 and R8 will send a withdraw for their route to R6. Now R6 is stuck with one path - we need at least two for PIC edge.</div>
<div>
<br /></div>
<div>
The dead easiest solution to this design is to use Advertise-Best-External:</div>
<div>
<br /></div>
<div>
R7 & R8:</div>
<div>
<div>
router bgp 200</div>
<div>
address-family ipv4 vrf VPN</div>
<div>
bgp advertise-best-external</div>
<div>
<br /></div>
</div>
<div>
What's this do?</div>
<div>
<br /></div>
<div>
<div>
R7-PE#sh bgp vpnv4 un all 9.9.9.9</div>
<div>
BGP routing table entry for 1:1:9.9.9.9/32, version 18</div>
<div>
Paths: (3 available, best #2, table VPN)</div>
<div>
Advertised to update-groups:</div>
<div>
1 6</div>
<div>
Refresh Epoch 5</div>
<div>
300</div>
<div>
8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out 26/16</div>
<div>
rx pathid: 0, tx pathid: 0</div>
<div>
Refresh Epoch 3</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out 26/22</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
<div>
Refresh Epoch 2</div>
<div>
300</div>
<div>
10.0.79.9 (via vrf VPN) from 10.0.79.9 (9.9.9.9)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, external</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out 26/nolabel</div>
<div>
rx pathid: 0, tx pathid: 0</div>
</div>
<div>
<br /></div>
<div>
R7 still sees the path through R6 as best. However, what's it sending to R2? It's sending it's eBGP path to the CE as opposed to the path to R6.<br />
<br /></div>
<div>
Since R8 is doing the same thing, R2 now has three paths again:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh bgp vpnv4 un all 9.9.9.9</div>
<div>
BGP routing table entry for 1:1:9.9.9.9/32, version 70</div>
<div>
Paths: (3 available, best #3, table VPN)</div>
<div>
Advertised to update-groups:</div>
<div>
1</div>
<div>
Refresh Epoch 5</div>
<div>
300</div>
<div>
8.8.8.8 (metric 3) (via default) from 8.8.8.8 (8.8.8.8)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out nolabel/16</div>
<div>
rx pathid: 0, tx pathid: 0</div>
<div>
Refresh Epoch 3</div>
<div>
300</div>
<div>
7.7.7.7 (metric 3) (via default) from 7.7.7.7 (7.7.7.7)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out nolabel/26</div>
<div>
rx pathid: 0, tx pathid: 0</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 6.6.6.6 (6.6.6.6)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Extended Community: RT:1:1</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
So Advertise-Best-External sends your eBGP route as bestpath to your neighbors, but local routing (on R7 or R8) still goes through R6 due to the local-preference.</div>
<div>
<br /></div>
<div>
We're not done yet however:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh ip cef vrf VPN 9.9.9.9 detail</div>
<div>
9.9.9.9/32, epoch 1, flags rib defined all labels</div>
<div>
recursive via 6.6.6.6 label 22</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
<div>
<br /></div>
<div>
R2 still only sees one possible path.</div>
<div>
<br /></div>
<div>
We need to implement some single-router Add-Path to make this work. <b>The key item of importance is that only the routers that need the non-multipath redundant paths have to support Add-Path in this design.</b> If we're not worried about R6, R7, or R8 having an additional path back to R1, then we might just have R2 and R3 require the Add-Path support (Add-Path is a reasonably new feature at the time of this writing, so having your entire topology support it could be challenging).</div>
<div>
<br /></div>
<div>
<div>
router bgp 200</div>
<div>
address-family ipv4 vrf VPN</div>
<div>
bgp additional-paths select backup</div>
<div>
bgp additional-paths install</div>
<div>
<br /></div>
</div>
<div>
Don't worry about the specific mechanisms of "select backup" and "install" yet, I'm going to cover them thoroughly later. In short, we need to tell this router to pick a backup path and pre-install it in the FIB so that PIC can use it in failover, which this config accomplishes:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh ip cef vrf VPN 9.9.9.9 det</div>
<div>
9.9.9.9/32, epoch 1, flags rib defined all labels</div>
<div>
recursive via 6.6.6.6 label 22</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
<div>
<b> recursive via 7.7.7.7 label 26, repair</b></div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 16</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 20</div>
</div>
<div>
<br /></div>
<div>
Note the "repair" syntax, that's the key.</div>
<div>
<br /></div>
<div>
I'm removing the R2 Add-Path config and bgp advertise-best-external on the PEs.<br />
<br /></div>
<div>
This is all fantastic with full-mesh iBGP - what if you have a huge topology and a route-reflector (or several) is more realistic? There's a big problem here, because like any BGP router, the route reflector will only choose its <u>one</u> best path to send to the other PEs. This makes multipathing impossible.</div>
</div>
<div>
<br /></div>
<div>
I've re-made R4 a route reflector, and removed all the redundant iBGP paths between the other PEs. Every PE is getting their routes via R4 now.</div>
<div>
<br /></div>
<div>
Clearly down to just one path now:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh ip cef vrf VPN 9.9.9.9 det</div>
<div>
9.9.9.9/32, epoch 1, flags rib defined all labels</div>
<div>
recursive via 6.6.6.6 label 22</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
</div>
<div>
<br /></div>
<div>
As usual, we have multiple IGP paths, but those will both get pulled if we lose the BGP path.</div>
<div>
<br /></div>
<div>
Without going to full-on Add-Path across the network, our simplest answer is another route-reflector running diverse-path. I'm temporarily making R5 an additional route-reflector.</div>
<div>
<br /></div>
<div>
For brevity I'm not going to include all the config necessary to make R5 a route-reflector. However, the outcome on R2 looks like this:</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh bgp vpnv4 uni all 9.9.9.9</div>
<div>
BGP routing table entry for 1:1:9.9.9.9/32, version 79</div>
<div>
Paths: (2 available, best #2, table VPN)</div>
<div>
Advertised to update-groups:</div>
<div>
1</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
Extended Community: RT:1:1</div>
<div>
Originator: 6.6.6.6, Cluster list: 5.5.5.5</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Extended Community: RT:1:1</div>
<div>
Originator: 6.6.6.6, Cluster list: 4.4.4.4</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
Hey, great, we've got two paths, we can just enable Add-Path on R2 and we're done, right? </div>
<div>
<br /></div>
<div>
Not so fast.</div>
<div>
<br /></div>
<div>
The next-hop is 6.6.6.6 on both routes - in order for Add-Path to be viable, the backup path's next-hop <u>must be different</u> that the primary path.</div>
<div>
<br /></div>
<div>
The solution, as I'd mentioned above, is to use Diverse-Path. Diverse-Path tells a BGP router to deliberately calculate the 2nd-best path that has a different next hop than the first-best path. Diverse-Path was a workaround before Add-Path was supported (or widely supported) in IOS. Only the route reflector running Diverse Path needs to know about it, all the other routes are just following standard IOS rules. </div>
<div>
<br /></div>
<div>
<div>
R5-RR(config)#router bgp 200</div>
<div>
R5-RR(config-router)#address-family vpnv4</div>
</div>
<div>
<div>
R5-RR(config-router-af)#bgp additional-paths select backup</div>
<div>
R5-RR(config-router-af)#bgp additional-paths install</div>
<div>
R5-RR(config-router-af)#neighbor 2.2.2.2 <b>advertise diverse-path backup</b></div>
</div>
<div>
<br /></div>
<div>
Here, we tell R5 to calculate a backup path, and then we tell it to advertise it to R2 as if it were R5's bestpath (in production, you'd presumably want to send this to all route-reflector clients, not just one).</div>
<div>
<br />
One more step is also required on R7 and R8 (I've done R6 as well to keep the config consistent) - right now, this topology suffers from the same problem we saw in the first advertise-best-external scenario. Consider:<br />
<br />
1) R6 sends its bestpath (its external path) to R4 and R5. This prefix has a local pref of 150.<br />
2) R7 sends its bestpath (its external path) to R4 and R5.<br />
3) R8 sends its bestpath (its external path) to R4 and R5.<br />
4) R5 starts calculating 2nd-best-path for R2<br />
5) R7 learns about R6's bestpath from R4<br />
6) R8 learns about R6's bestpath from R4<br />
8) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better<br />
9) R7 withdraws its bestpath from R4 and R5 after learning R6's path is better<br />
10) R5 calculates it's <i>only</i> path to 9.9.9.9 via R6<br />
<br />
Now we could put <b>bgp advertise-best-external</b> back in, but that would advertise the best external to both R4 and R5 and we'd have the same exact problem as above.<br />
<br />
Per-neighbor best-external is the solution:<br />
R6, R7 & R8:<br />
router bgp 200<br />
address-family vpnv4<br />
neighbor 5.5.5.5 advertise best-external<br />
<br />
This will advertise the "internal bestpath" (via R6, because of local preference) to R4, and the external bestpath to R5.<br />
<br /></div>
<div>
Now back to R2:</div>
<div>
<div>
R2-PE#sh bgp vpnv4 un all 9.9.9.9</div>
<div>
BGP routing table entry for 1:1:9.9.9.9/32, version 83</div>
<div>
Paths: (2 available, best #2, table VPN)</div>
<div>
Advertised to update-groups:</div>
<div>
1</div>
<div>
Refresh Epoch 2</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 5.5.5.5 (5.5.5.5)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
Extended Community: RT:1:1</div>
<div>
Originator: 6.6.6.6, Cluster list: 5.5.5.5</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) (via default) from 4.4.4.4 (4.4.4.4)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Extended Community: RT:1:1</div>
<div>
Originator: 6.6.6.6, Cluster list: 4.4.4.4</div>
<div>
mpls labels in/out nolabel/22</div>
<div>
rx pathid: 0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
Now we've got two routes with two next-hops.</div>
<div>
<br /></div>
<div>
<div>
R2-PE#sh ip cef vrf VPN 9.9.9.9 detail</div>
<div>
9.9.9.9/32, epoch 1, flags rib defined all labels</div>
<div>
recursive via 6.6.6.6 label 22</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
</div>
<div>
<br /></div>
<div>
But we still need to enable the calculation of a backup route, otherwise PIC Edge won't work.</div>
<div>
<br /></div>
<div>
<div>
R2-PE(config)#router bgp 200</div>
<div>
R2-PE(config-router)#address-family ipv4 vrf VPN</div>
<div>
R2-PE(config-router-af)#bgp additional-paths select backup</div>
<div>
R2-PE(config-router-af)#bgp additional-paths install</div>
</div>
<div>
<br />
R2-PE#sh ip cef vrf VPN 9.9.9.9 det<br />
9.9.9.9/32, epoch 1, flags rib defined all labels<br />
recursive via 6.6.6.6 label 22<br />
nexthop 192.168.24.4 GigabitEthernet1.24 label 23<br />
nexthop 192.168.25.5 GigabitEthernet1.25 label 18<br />
recursive via 7.7.7.7 label 26, repair<br />
nexthop 192.168.24.4 GigabitEthernet1.24 label 16<br />
nexthop 192.168.25.5 GigabitEthernet1.25 label 20</div>
<div>
<br /></div>
<div>
Now we've got a working solution!</div>
<div>
<br /></div>
<div>
And last but certainly not least, the gold standard of receiving two paths: Simply rework how BGP handles multiple paths by using Add-Path.</div>
<div>
<br />
Sadly, as much as this technology seems like it's custom-built for VPNv4, if you can believe it, Add-Path isn't supported in VPNv4 on my OS:<br />
<br />
R4#sh ver<br />
Cisco IOS XE Software, Version 03.11.01.S - Standard Support Release<br />
Cisco IOS Software, CSR1000V Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.4(1)S1, RELEASE SOFTWARE (fc2)<br />
<div>
<output omitted></div>
<div>
<br /></div>
In the IPv4 (default) Family:<br />
<br />
R4(config)#router bgp 200<br />
R4(config-router)#bgp additional-paths select ?<br />
<b><i> all Select all available paths</i></b><br />
backup Select backup path<br />
<b><i> best Select best N paths</i></b><br />
best-external Select best-external path<br />
<b><i> group-best Select group-best path</i></b><br />
<div>
<br /></div>
R4(config-router)#neighbor 2.2.2.2 advertise ?<br />
<b><i> additional-paths Advertise additional paths</i></b><br />
<div>
<div>
best-external Advertise best-external (at RRs best-internal) path</div>
<div>
diverse-path Advertise diverse path</div>
</div>
<div>
<br /></div>
<div>
Note the <b>bolded </b>and <i>italic</i> items, that's what we're looking for in VPNv4:</div>
<div>
<br /></div>
R4(config-router)#address-family vpnv4<br />
<div>
R4(config-router-af)#neighbor 2.2.2.2 advertise ?</div>
best-external Advertise best-external (at RRs best-internal) path<br />
diverse-path Advertise diverse path<br />
<br />
R4(config-router-af)#bgp additional-paths select ?<br />
backup Select backup path<br />
best-external Select best-external path<br />
<br />
Completely lacking.<br />
<br />
On that note, we'll be reverting this design back to a non-MPLS scenario for the remainder of the blog.<br />
<div>
I've also reverted R5 from being a route-reflector, it's now simply a client of R4. This was necessary to carry the IPv4 BGP table through R5.<br />
<br />
<b>Note R6 deliberately still has the best path via local-preference.</b></div>
</div>
<div>
<br /></div>
<div>
Here is a diagram of roughly what we're trying to achieve.</div>
<div>
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0hM4oWg2dsa9g7LpIFsQDkbqv2VzY4_TIBOwG_Mo5Zm-H98RWpTYotLtnHKPRZ0fsQNjJEHEHJqHF9P8GyrTIoeRVm-_AKwtfVCJYae_WStf3liB8Zps0QJzYYLco12MqVX6WnfjcMHI/s800/diagram3.png" />
<br />
<br /></div>
<div>
We'd like R6, R7 and R8 to all send (initially) one route to the RR. We'd like the R4 to reflect back two paths for reaching 9.9.9.9 to everyone (technically speaking we'll also be reflecting two paths for 1.1.1.1 on CE1, but I chose not to focus on that).<br />
<br />
This design suffers from the same problem the last several have. Everything will start out looking good until the route-reflector reflects the superior path from R6 to R7 and R8, and those two routers both pick R6 as their bestpath. After that they'll withdraw their routes from R4, and R4 will only have a single route to send to R2, R3, etc, etc, because every path will point to R6.<br />
<br />
We can solve this with one of three methods:<br />
- BGP Advertise-Best-External on R7 and R8 (optionally on R6)<br />
- Per-neighbor advertise best-external<br />
- Running two-path Add-Path on R7 and R8 in addition to R4.<br />
<br />
The top two options I imagine are self-explanatory at this point as I covered them above, however, the final option is hopefully interesting to the reader, and therefore it's the method I will choose for this lab. What will happen if R4, R7 and R8 run add-path is as follows:<br />
<br />
1) R6, R7 and R8 all advertise their own (connected/external) bestpath to R4<br />
2) (Let's assume R7 had the 2nd-best path for this example) R4 reflects BOTH R6 and R7's bestpath to R2, R3, R5, R6, R7 and R8.<br />
3) R2 and R3 install both paths in BGP and in the FIB.<br />
4) R6 installs R7's path in the FIB as a repair route.<br />
5) R7 and R8 both change their bestpath to R6 instead of their external route.<br />
6) R7 and R8 both advertise back to the route reflector that R6 is their bestpath and R7 is their backup path.<br />
... No change on R2, R3, or R4, that influences a shift on the route reflector, so it's clients aren't modified either.<br />
<br />
The key here is that while we still have the same problem of R7 and R8 preferring R6's external path, we're still advertising two paths to the route reflector: R6's (as best), and R7's as a backup.<br />
<br />
Here is the relevant config:<br />
R4:<br />
router bgp 200<br />
bgp additional-paths select best 2<br />
bgp additional-paths send receive<br />
bgp additional-paths install<br />
neighbor 2.2.2.2 advertise additional-paths best 2<br />
neighbor 3.3.3.3 advertise additional-paths best 2<br />
neighbor 5.5.5.5 advertise additional-paths best 2<br />
neighbor 6.6.6.6 advertise additional-paths best 2<br />
neighbor 7.7.7.7 advertise additional-paths best 2<br />
neighbor 8.8.8.8 advertise additional-paths best 2</div>
<div>
<br /></div>
<div>
R2, R3, & R5:</div>
<div>
<div>
router bgp 200</div>
<div>
bgp additional-paths select best 2</div>
<div>
bgp additional-paths receive</div>
<div>
bgp additional-paths install</div>
</div>
<div>
<br /></div>
<div>
R6 - R8:</div>
<div>
router bgp 200</div>
<div>
bgp additional-paths select best 2<br />
bgp additional-paths send receive<br />
bgp additional-paths install</div>
<div>
neighbor 4.4.4.4 advertise additional-paths best 2</div>
<div>
<br /></div>
<div>
Remember <u>not</u> to use <b>bgp additional-paths select backup </b>- that command is for diverse-path or for local (non-advertised) selection of a backup route. You're trying to create a backup path, but that's still the wrong command.</div>
<div>
<br /></div>
<div>
So we used a few new commands here:</div>
<div>
<b>bgp additional-paths select best 2</b> - This calculates the best path and 2nd best path and <i>flags them</i> in BGP. This is a non-transitive flag, the neighbors aren't aware of what your flags are.</div>
<div>
<br /></div>
<div>
<div>
R4#sh ip bgp 9.9.9.9</div>
<div>
BGP routing table entry for 9.9.9.9/32, version 5</div>
<div>
Paths: (3 available, best #3, table default)</div>
<div>
Additional-path-install</div>
<div>
Path advertised to update-groups:</div>
<div>
19 20</div>
<div>
Refresh Epoch 1</div>
<div>
300, (Received from a RR-client)</div>
<div>
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)</div>
<div>
<b> Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2</b></div>
<div>
rx pathid: 0x1, tx pathid: 0x2</div>
<div>
Path not advertised to any peer</div>
<div>
Refresh Epoch 1</div>
<div>
300, (Received from a RR-client)</div>
<div>
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal</div>
<div>
rx pathid: 0x1, tx pathid: 0</div>
<div>
Path advertised to update-groups:</div>
<div>
19 20</div>
<div>
Refresh Epoch 1</div>
<div>
300, (Received from a RR-client)</div>
<div>
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)</div>
<div>
<b> Origin IGP, metric 0, localpref 150, valid, internal, best</b></div>
<div>
rx pathid: 0x0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
You see we've flagged "best" and "best2".</div>
<div>
<br /></div>
<div>
<b>bgp additional-paths send receive</b></div>
<div>
<br />
Unlike all the fixes we've seen up until now, Add-Path is a negotiated feature. This is why there's so many workarounds for it - to get to full Add-Path you basically have to forklift upgrade your network. On that note, you need to tell your neighbors if you have send, receive, or both send & receive capability. This can be done globally, as we've done here, or per neighbor with:<br />
<br />
R2(config-router)#neighbor 4.4.4.4 additional-paths ?<br />
disable Disable additional paths for this neighbor<br />
receive Receive additional paths from neighbors<br />
send Send additional paths to this neighbor<br />
<div>
<br /></div>
<div>
Note per-neighbor settings override the global settings.</div>
<div>
<br /></div>
<div>
<b>bgp additional-paths install</b></div>
<div>
<br /></div>
You can select additional-paths and pass them to neighbors without installing them in your RIB or FIB. This command should be on any device requiring PIC Edge, but if your route reflector isn't in the forwarding path, you may be able to omit it.<br />
<br /></div>
<div>
<b>neighbor X.X.X.X advertise additional-paths best 2</b></div>
<div>
<br />
Even if you've negotiated the Add-Path capability with your neighbor, you still need to tell the BGP process to advertise all of, or a subset of, your calculated best paths. The way it does this is via the tag system I described above. An important element of this is that the tagging system is not mutually exclusive. Let's say there are 4 paths with different next-hops. You could select "all" and "best 3", and the best 3 would be flagged with "best" <i>and</i> "all", and the 4th path would only be flagged with "all". We'll show an examples of this below.<br />
<br />
Let's see the output from this.<br />
<br />
R4#sh ip bgp | b Network<br />
Network Next Hop Metric LocPrf Weight Path<br />
*>i 1.1.1.1/32 2.2.2.2 0 100 0 100 i<br />
*bia 3.3.3.3 0 100 0 100 i<br />
*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i<br />
* i 8.8.8.8 0 100 0 300 i<br />
*>i 6.6.6.6 0 150 0 300 i</div>
<div>
<br /></div>
<div>
We see two paths for 1.1.1.1, and three paths for 9.9.9.9. </div>
<div>
Two are flagged with "b" for backup - this is a side-effect of using the <b>bgp additional-paths install</b>. </div>
<div>
"a" is the flag for additional-paths.</div>
<div>
You'd need to do a <b>sh ip bgp 9.9.9.9</b> to see the "best", "best2", etc flags, which I am omitting for brevity - there's already a sample further above.</div>
<div>
<br /></div>
<div>
<div>
R4#sh ip cef 9.9.9.9 det</div>
<div>
9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels</div>
<div>
recursive via 6.6.6.6</div>
<div>
nexthop 192.168.46.6 GigabitEthernet1.46</div>
<div>
recursive via 7.7.7.7, repair</div>
<div>
nexthop 192.168.47.6 GigabitEthernet1.47</div>
</div>
<div>
<br /></div>
<div>
We can see the repair path in the FIB.</div>
<div>
<br /></div>
<div>
On R2:</div>
<div>
<br /></div>
<div>
<div>
R2(config-router)#do sh ip bgp 9.9.9.9</div>
<div>
BGP routing table entry for 9.9.9.9/32, version 3</div>
<div>
Paths: (2 available, best #2, table default)</div>
<div>
Additional-path-install</div>
<div>
Path not advertised to any peer</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
7.7.7.7 (metric 3) from 4.4.4.4 (4.4.4.4)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2</div>
<div>
Originator: 7.7.7.7, Cluster list: 4.4.4.4</div>
<div>
rx pathid: 0x2, tx pathid: 0x1</div>
<div>
Path advertised to update-groups:</div>
<div>
29</div>
<div>
Refresh Epoch 1</div>
<div>
300</div>
<div>
6.6.6.6 (metric 3) from 4.4.4.4 (4.4.4.4)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
Originator: 6.6.6.6, Cluster list: 4.4.4.4</div>
<div>
rx pathid: 0x0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
<div>
We see a <b>best </b>and <b>best2 </b>flag. It's important to note again that this is not learned from the route reflector, it's locally decided and set by the local <b>bgp additional-paths select best 2 </b>on R2. As mentioned above, I decided to use add-path from the edge BGP devices back towards the route-reflector to avoid the problem of the single-best-path replacing all the secondaries during convergence.</div>
<div>
<br /></div>
<div>
Another important note is the pathid. Add-Path's trickery to make this work doesn't make a real integral change to BGP - it still only passes one best, unique path - it just makes each additional path unique by adding a unique pathid. Note the pathids of 0x0 and 0x1 above. Think of these similar to Route Distinguishers in VPNv4, making the same two routes unique.</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip cef 9.9.9.9 det</div>
<div>
9.9.9.9/32, epoch 2, flags rib only nolabel, rib defined all labels</div>
<div>
recursive via 6.6.6.6</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 23</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 18</div>
<div>
recursive via 7.7.7.7, repair</div>
<div>
nexthop 192.168.24.4 GigabitEthernet1.24 label 16</div>
<div>
nexthop 192.168.25.5 GigabitEthernet1.25 label 20</div>
</div>
<div>
<br /></div>
<div>
And there's PIC Edge and Add-Path in action on R2.</div>
<div>
<br /></div>
<div>
I'm going to quickly cover the rest of the simpler Add-Path options.<br />
Just to recap, the route-reflector has chosen two best paths so far:<br />
<br />
R4#sh ip bgp 9.9.9.9 | s from<br />
300, (Received from a RR-client)<br />
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)<br />
<b> Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2</b><br />
rx pathid: 0x1, tx pathid: 0x2<br />
300, (Received from a RR-client)<br />
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)<br />
Origin IGP, metric 0, localpref 100, valid, internal<br />
rx pathid: 0x1, tx pathid: 0<br />
300, (Received from a RR-client)<br />
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)<br />
<b> Origin IGP, metric 0, localpref 150, valid, internal, best</b><br />
rx pathid: 0x0, tx pathid: 0x0<br />
<div>
<br /></div>
<div>
<div>
router bgp 200</div>
<div>
bgp additional-paths select best 3</div>
</div>
<div>
<br /></div>
<div>
<div>
R4#sh ip bgp 9.9.9.9 | s from</div>
<div>
300, (Received from a RR-client)</div>
<div>
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2</div>
<div>
rx pathid: 0x1, tx pathid: 0x2</div>
<div>
300, (Received from a RR-client)</div>
<div>
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)</div>
<div>
Origin IGP, metric 0, localpref 100, valid, internal, best3</div>
<div>
rx pathid: 0x1, tx pathid: 0x1</div>
<div>
300, (Received from a RR-client)</div>
<div>
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)</div>
<div>
Origin IGP, metric 0, localpref 150, valid, internal, best</div>
<div>
rx pathid: 0x0, tx pathid: 0x0</div>
</div>
<div>
<br /></div>
Note we've added a pathid and "best3" to the remaining path. We'd be able to send those to neighbors if we wanted. With this config we're choosing 3 but sending 2.<br />
<br />
I found this option confusing initially:<br />
<br />
R4(config-router)#no neighbor 2.2.2.2 advertise additional-paths best 2<br />
R4(config-router)#neighbor 2.2.2.2 advertise additional-paths all<br />
% BGP: AF level 'bgp additional-paths select' more restrictive than advertising policy. This is a reminder that AF level additional-path select commands are needed.<br />
<div>
<br /></div>
The way I originally read this was, I've selected 3 best paths, and I want to send all 3 of them to my neighbor -- this is incorrect. Remember this is a flag system. <i>All</i> is a flag. None of our BGP prefixes are flagged with All, so we just broke Add-Path:<br />
<br />
R4(config-router-af)#do sh ip bgp neigh 2.2.2.2 adv | b Network<br />
Network Next Hop Metric LocPrf Weight Path<br />
*>i 9.9.9.9/32 6.6.6.6 0 150 0 300 i<br />
<div>
<br /></div>
Let's fix it.<br />
<i>All</i> is meant to simulate full-mesh iBGP with a route-reflector - if all routers use it, you'll get a similar outcome to all the routers being peered together.<br />
<br />
R4(config-router)#bgp additional-paths select all<br />
<div>
<br /></div>
R4(config-router)#do sh ip bgp 9.9.9.9 | s from<br />
300, (Received from a RR-client)<br />
7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)<br />
<b> Origin IGP, metric 0, localpref 100, valid, internal, backup/repair, best2, all</b><br />
rx pathid: 0x1, tx pathid: 0x1<br />
300, (Received from a RR-client)<br />
8.8.8.8 (metric 2) from 8.8.8.8 (8.8.8.8)<br />
<b> Origin IGP, metric 0, localpref 100, valid, internal, best3, all</b><br />
rx pathid: 0x1, tx pathid: 0x2<br />
300, (Received from a RR-client)<br />
6.6.6.6 (metric 2) from 6.6.6.6 (6.6.6.6)<br />
<b> Origin IGP, metric 0, localpref 150, valid, internal, best</b><br />
rx pathid: 0x0, tx pathid: 0x0<br />
<div>
<br /></div>
<div>
OK, now we're flagged with both <i>All</i> and <i>Best</i> simultaneously. As mentioned above, the select system is not mutually exclusive:</div>
</div>
<div>
<br /></div>
<div>
<div>
R4#sh run | i select</div>
<div>
bgp additional-paths select <b>all best 3</b></div>
</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip bgp | b 9.9.9.9</div>
<div>
*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i</div>
<div>
* i 8.8.8.8 0 100 0 300 i</div>
<div>
*>i 6.6.6.6 0 150 0 300 i</div>
</div>
<div>
<br /></div>
<div>
There's a few options you can potentially pick under "select":</div>
<div>
<br /></div>
<div>
R4(config-router)#bgp additional-paths select ?</div>
<div>
<div>
all Select all available paths </div>
<div>
backup Select backup path </div>
<div>
best Select best N paths </div>
<div>
best-external Select best-external path</div>
<div>
group-best Select group-best path </div>
<div>
<br /></div>
<div>
All, we just covered.</div>
<div>
Backup is for diverse-path</div>
<div>
Best, we've covered</div>
<div>
Best-External is a feature that permits best-external selection on a route reflector. The use case for this is complicated and is out of scope for this document.</div>
<div>
Group-Best is also very complicated. </div>
<div>
<br /></div>
<div>
Let's discuss group-best at a very high level.</div>
<div>
<br /></div>
<div>
BGP, under normal circumstances, can potentially end up in a scenario where it never converges - it never stabilizes. This is called BGP Med Oscillation. Explaining this is beyond the scope of this document, however, this blog covers it well: <a href="http://ccieblog.co.uk/bgp/bgp-deterministic-med">http://ccieblog.co.uk/bgp/bgp-deterministic-med</a></div>
<div>
<br /></div>
<div>
BGP Deterministic Med can solve this problem.</div>
<div>
<br /></div>
<div>
However, this problem gets additionally complex with Add-Path. Group-Best solves these problems.</div>
<div>
This document covers this feature: <a href="http://inl.info.ucl.ac.be/system/files/add-paths-jsac.pdf">http://inl.info.ucl.ac.be/system/files/add-paths-jsac.pdf</a></div>
</div>
<div>
<br /></div>
<div>
Route-Maps can additionally be used with Add-Path.</div>
<div>
<br /></div>
<div>
<div>
R3(config-route-map)#match additional-paths advertise-set ?</div>
<div>
all BGP Add-Path advertise all paths</div>
<div>
best BGP Add-Path advertise best n paths</div>
<div>
best-range BGP Add-Path advertise best paths (range m to n)</div>
<div>
group-best BGP Add-Path advertise group-best path</div>
</div>
<div>
<br /></div>
<div>
The two use cases I've seen for the route maps are:</div>
<div>
- Setting the egress MED</div>
<div>
- Selecting specific routes with the "best" flag to advertise</div>
<div>
<br /></div>
<div>
For example, if you wanted to only advertise the 1st best and 3rd best routes:</div>
<div>
<div>
<br /></div>
<div>
R4:</div>
<div>
route-map block2ndbest deny 10</div>
<div>
match additional-paths advertise-set best-range 2 2 ! matches the "range" of 2 through 2</div>
<div>
route-map block2ndbest permit 20</div>
</div>
<div>
<br /></div>
<div>
Before:</div>
<div>
<br /></div>
<div>
<div>
R2#sh ip bgp | b 9.9.9.9</div>
<div>
*bia9.9.9.9/32 7.7.7.7 0 100 0 300 i</div>
<div>
* i 8.8.8.8 0 100 0 300 i</div>
<div>
*>i 6.6.6.6 0 150 0 300 i</div>
</div>
<div>
<br /></div>
<div>
<div>
R4(config)#router bgp 200</div>
<div>
R4(config-router)#neighbor 2.2.2.2 route-map block2ndbest out</div>
</div>
<div>
R4(config-router)#do clear ip bgp * soft out</div>
<div>
<br /></div>
<div>
After:</div>
<div>
<div>
R2#sh ip bgp | b 9.9.9.9</div>
<div>
*bia9.9.9.9/32 8.8.8.8 0 100 0 300 i</div>
<div>
*>i 6.6.6.6 0 150 0 300 i</div>
</div>
<div>
<br /></div>
<div>
As I mentioned the MED can be modified on a per-bestpath basis as well, but only from edge BGP device -> RR or edge BGP device -> edge BGP device. Route reflectors are not permitted to set MED. </div>
<div>
<br /></div>
<div>
Hope you enjoyed,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com12tag:blogger.com,1999:blog-5968686435283454526.post-2775614808377631052014-07-08T21:39:00.003-07:002014-07-08T21:39:48.340-07:00VTP v3VTP v3 isn't technically a "new addition" to the CCIE lab, but code versions prohibited it from being used up until recently. I've been told IOL does in fact support VTP v3, so it should be considered a viable lab topic now.<br />
<div>
<br />
So what's new in VTP v3? In no particular order:<br />
<br />
- Supports extended VLANs (1006 - 4094)<br />
- Support for propagating Private VLANs<br />
- Support for propagating Multiple Spanning Tree<br />
- Support for flagging VLANs as RSPAN (disables MAC learning on the VLAN)<br />
- Fixes the bane of VTP v1/2, the accidental-high-configuration-revision-wipes-out-your-network issue.<br />
- VTP can now be turned off completely, as opposed to just transparent mode<br />
- Support for hidden passwords<br />
<br />
My lab is two 3560s running 15.0(2)SE6, and one 3550 running 12.x code. The 3560s support VTP v3, the 3550 I'm just using to show backwards compatibility to v2.<br />
<br />
My switches are plugged in in a row:<br />
<br />
SW1 3560 --> SW2 3560 --> SW3 3550<br />
<br />
Let's enable v3 on SW1 and SW2:<br />
<br />
SW1(config)#vtp version 3<br />
Cannot set the version to 3 because domain name is not configured<br />
<br />
You can't enable v3 without specifying a domain. Previous versions of VTP just inherited the domain name from its neighbors if you didn't specify one. This ties in some to the security measures, we don't necessarily want to participate in the neighbor's VTP process, so don't make assumptions.<br />
<br />
SW1(config)#vtp domain CCIE<br />
Changing VTP domain name from NULL to CCIE<br />
*Mar 1 00:04:08.638: %SW_VLAN-6-VTP_DOMAIN_NAME_CHG: VTP domain name changed to CCIE.<br />
SW1(config)#vtp version 3<br />
*Mar 1 00:04:12.908: %SW_VLAN-6-OLD_CONFIG_FILE_READ: Old version 2 VLAN configuration file detected and read OK. Version 3<br />
files will be written in the future.<br />
<br />
I've done the same on SW2.<br />
<br />
Let's try adding some VLANs.<br />
<br />
SW1(config)#vtp mode server<br />
Setting device to VTP Server mode for VLANS.<br />
SW1(config)#vlan 100<br />
VTP VLAN configuration not allowed when device is not the primary server for vlan database.<br />
<div>
<br /></div>
<div>
Let's stop here and talk about a huge problem with previous versions of VTP. As a network consultant, I always recommend - especially prior to version 3 - that customers use VTP mode transparent. The problem is that VTP devices - <i>VTP clients included</i> - can have their VLANs removed or changed while not connected to the mothership, and inadvertently end up with a higher configuration revision. When that switch is introduced, or reintroduced, to the greater network, the higher configuration revision "wins", and the rest of the network replicates that VLAN database, erasing their own VLANs. This can be so dramatic that the entire network can end up with just VLAN 1, and the entire layer 2 domain goes down. This is a very easy problem to create, and causes a dramatic outage.</div>
</div>
<div>
<br /></div>
<div>
VTPv3 can no longer create this issue.</div>
<div>
<br /></div>
<div>
VTP mode clients, and secondary servers cannot write the VLAN database. What's a secondary server? Well, it's any server that isn't the primary! (sorry, couldn't resist).</div>
<div>
<br /></div>
<div>
There can only be one <i>primary</i> server. The primary server is the only server allowed to write the VLAN database:</div>
<div>
<br /></div>
<div>
<div>
SW1#vtp primary vlan</div>
<div>
This system is becoming primary server for feature vlan</div>
<div>
No conflicting VTP3 devices found.</div>
<div>
Do you want to continue? [confirm]</div>
</div>
<div>
<div>
SW1#</div>
<div>
*Mar 1 00:30:19.564: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1ceb.f600 has become the primary server for the VLAN VTP feature</div>
</div>
<div>
<br /></div>
<div>
SW1 is now the only device that can make changes to the contiguous v3 PVST VLAN database. Note the command <b>vtp primary vlan </b>is in privilege exec mode and is not saved to the config - if you reboot you lose this privilege. This completely eliminates the possibility of have a plug-and-play way of accidentally overwriting another network's VTP database.</div>
<div>
<br /></div>
<div>
<div>
SW1#conf t</div>
<div>
Enter configuration commands, one per line. End with CNTL/Z.</div>
<div>
SW1(config)#vlan 100</div>
<div>
SW1(config-vlan)#exit</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2#show vlan | i 100</div>
<div>
100 VLAN0100 active</div>
</div>
<div>
<output omitted for brevity></div>
<div>
<br /></div>
<div>
I'd like to spend a moment looking at the primary server takeover process in a little more detail.</div>
<div>
As I mentioned, only one server can be primary.</div>
<div>
<br /></div>
<div>
So if we do this from SW2:</div>
<div>
<br /></div>
<div>
<div>
SW2#vtp primary vlan</div>
<div>
This system is becoming primary server for feature vlan</div>
<div>
No conflicting VTP3 devices found.</div>
<div>
Do you want to continue? [confirm]</div>
<div>
SW2#</div>
<div>
*Mar 1 00:52:46.632: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the VLAN VTP feature</div>
</div>
<div>
<br /></div>
<div>
So I'm confused by "No conflicted VTP3 devices found." I'm not sure what a conflicting server would be if not the existing primary server, but my switches always produce this output, so maybe it's a version/platform issue.</div>
<div>
<br /></div>
<div>
Anyway, if you look at SW1:</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#</div>
<div>
*Mar 1 00:53:36.963: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the VLAN VTP feature</div>
</div>
<div>
<br /></div>
<div>
<div>
SW1#show vtp status</div>
<div>
VTP Version capable : 1 to 3</div>
<div>
VTP version running : 3</div>
<div>
VTP Domain Name : CCIE</div>
<div>
VTP Pruning Mode : Disabled</div>
<div>
VTP Traps Generation : Disabled</div>
<div>
Device ID : 0014.1ceb.f600</div>
<div>
<br /></div>
<div>
Feature VLAN:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Server</div>
<div>
Number of existing VLANs : 6</div>
<div>
Number of existing extended VLANs : 0</div>
<div>
Maximum VLANs supported locally : 1005</div>
<div>
Configuration Revision : 2</div>
<div>
Primary ID : 0014.1cec.0280</div>
<div>
Primary Description : SW2</div>
<div>
MD5 digest : 0x73 0x33 0x29 0x15 0x3B 0xA7 0x29 0x04</div>
<div>
0x74 0x34 0x70 0x4F 0x58 0x74 0xAF 0x5E</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Feature MST:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Transparent</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Feature UNKNOWN:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Transparent</div>
</div>
<div>
<br /></div>
<div>
Note the <b>Device ID</b> - 0014.1ceb.f600, and then the <b>Primary ID </b>of Feature VLAN - 0014.1cec.0280 (SW2's ID). SW2 just stole it from SW1. </div>
<div>
<br /></div>
<div>
It's hard to show in the blog, but the process of becoming primary actually takes a little bit. There's a quicker way to steal it, which "doesn't check for conflicting devices" (not that I can seem to find conflicting devices anyway):</div>
<div>
<br /></div>
<div>
<div>
SW1#vtp primary vlan force</div>
<div>
This system is becoming primary server for feature vlan</div>
<div>
SW1#</div>
<div>
*Mar 1 00:59:32.112: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1ceb.f600 has become the primary server for the VLAN VTP feature</div>
</div>
<div>
<br /></div>
<div>
While I'm on the topic, you can actually see your VTP neighbors now (provided they're running v3):</div>
<div>
<br /></div>
<div>
<div>
SW1#show vtp device</div>
<div>
Retrieving information from the VTP domain. Waiting for 5 seconds.</div>
<div>
<br /></div>
<div>
VTP Feature Conf Revision Primary Server Device ID Device Description</div>
<div>
------------ ---- -------- -------------- -------------- ----------------------</div>
<div>
VLAN No 2 0014.1ceb.f600 0014.1cec.0280 SW2</div>
</div>
<div>
<br /></div>
<div>
Let's try to add some high-number VLANs now.</div>
<div>
<br /></div>
<div>
On the primary:</div>
<div>
<div>
SW1(config)#vlan 1006</div>
<div>
SW1(config-vlan)#exit</div>
</div>
<div>
<br /></div>
<div>
On the secondary, verifying</div>
<div>
<div>
SW2#show vlan | i 1006</div>
<div>
1006 VLAN1006 active</div>
</div>
<div>
<output omitted for brevity></div>
<div>
<br /></div>
<div>
Let's make a routed port on the secondary (an SVI would work as well):</div>
<div>
<div>
SW2(config)#int fa0/1</div>
<div>
SW2(config-if)#no switchport</div>
<div>
SW2(config-if)#ip address 192.168.0.1 255.255.255.0</div>
</div>
<div>
<br /></div>
<div>
Adding another VLAN on the primary:</div>
<div>
<div>
SW1(config)#vlan 1007</div>
<div>
SW1(config-vlan)#exit</div>
</div>
<div>
<br /></div>
<div>
And verifying on the secondary:</div>
<div>
<div>
SW2(config-if)#</div>
<div>
*Mar 1 01:04:07.384: %PM-4-EXT_VLAN_INUSE: VLAN 1007 currently in use by FastEthernet0/1</div>
<div>
*Mar 1 01:04:07.384: %SW_VLAN-4-VLAN_CREATE_FAIL: Failed to create VLANs 1007: VLAN(s) not available in Port Manager</div>
</div>
<div>
<br /></div>
<div>
Well that didn't work.</div>
<div>
<br /></div>
<div>
Whenever you create a routed interface (SVI or interface-based) on a Catalyst switch, it assigns a VLAN between the routed interface and the control plane (best I can tell, that's what's happening...). I've heard an argument that this behavior has to do with allocating the BIA MAC addresses to routed interfaces, but if you look around, there are some Catalyst switches that just assign the same MAC to every routed interface by default, yet they still all use separate VLAN numbers, so I'm not inclined to believe that.</div>
<div>
<br /></div>
<div>
Anyway, the default behavior on my 3560s is to allocate these VLANs from 1006 and counting upwards, so if 1006 is in use, 1007 will be grabbed, which is what we just saw. </div>
<div>
<br /></div>
<div>
<b>vlan internal allocation policy ascending</b></div>
<div>
<br /></div>
<div>
You've seen this command if you've used a 3K series switch before, because you can't turn it off.</div>
<div>
<br /></div>
<div>
<div>
SW1#sh run | i ascen</div>
<div>
vlan internal allocation policy ascending</div>
<div>
SW1#conf t</div>
<div>
Enter configuration commands, one per line. End with CNTL/Z.</div>
<div>
SW1(config)#no vlan internal allocation policy ascending</div>
<div>
SW1(config)#do sh run | i ascend</div>
</div>
<div>
vlan internal allocation policy ascending</div>
<div>
<div>
SW1(config)#vlan internal allocation policy ?</div>
<div>
ascending Allocate internal VLAN in ascending order</div>
<div>
SW1(config)#vlan internal allocation policy ascending ?</div>
<div>
<cr></div>
<div>
<br /></div>
<div>
SW1(config)#vlan internal allocation policy ascending</div>
</div>
<div>
<br /></div>
<div>
So, tough luck, they're ascending! </div>
<div>
<br /></div>
<div>
Some higher-end platforms support descending, I don't have one sitting around to lab, but I'm told descending starts at 4094 and counts downward instead.</div>
<div>
<br /></div>
<div>
Back to reality on my 3560s...</div>
<div>
<div>
SW2#show vlan internal usage</div>
<div>
<br /></div>
<div>
VLAN Usage</div>
<div>
---- --------------------</div>
<div>
1007 FastEthernet0/1</div>
</div>
<div>
<br /></div>
<div>
Shutting down Fa0/1 releases Vlan 1007, but we're now out-of-sync. I let this sit a while and it never caught up, so I'm guessing you're just out of luck until the primary server pushes another update.</div>
<div>
<br /></div>
<div>
<div>
SW2(config)#int fa0/1</div>
<div>
SW2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#no vlan 1007</div>
<div>
SW1(config)#vlan 1007</div>
<div>
SW1(config-vlan)#exit</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2(config-if)#do show vlan | i 1007</div>
<div>
1007 VLAN1007 active</div>
</div>
<div>
<output omitted></div>
<div>
<br /></div>
<div>
So we've covered the primary VTP server, exactly what is the difference between a secondary VTP server and a VTP client?</div>
<div>
<br /></div>
<div>
Well, on the 3560, not much.</div>
<div>
<br /></div>
<div>
<div>
According to the VTPv3 documentation:</div>
<div>
"Client: A device using a local temporary storage space (for example, DRAM) to hold via VTP received information for runtime use. This information is used to update other devices, such as a device that is working as a server. Local configuration of devices in the client role is not possible. After booting, a client device issues a VTP message asking for the configuration of other VTP devices."</div>
<div>
<br /></div>
<div>
This implies that the VTP secondary server saves its database to flash and the client doesn't store it at all. And on the 3560? My VTP clients (and secondary servers) store the full VTP database and will load it up every time unless I manually delete it. So in practice, on this equipment, I can't actually find a difference between VTP secondary servers and VTP clients.</div>
<div>
<br /></div>
<div>
I'm told a best practice is to demote the primary vtp server when you're done making changes. The method to accomplish that is not particularly clear, but here you have it:</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#vtp mode client</div>
<div>
Setting device to VTP Client mode for VLANS.</div>
<div>
SW1(config)#vtp mode server</div>
<div>
Setting device to VTP Server mode for VLANS.</div>
</div>
<div>
<br /></div>
<div>
Now you're a secondary server.</div>
<div>
<br /></div>
<div>
Let's take a look at the new password features. The old, plain-text password still works the same way. The hidden ones, however:</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#vtp password CANTSEEME hidden</div>
<div>
Setting device VTP password</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
SW1#show vtp password</div>
<div>
VTP Password: 80B0218C160CD951A38982EECCC22AD5</div>
</div>
<div>
<br /></div>
<div>
There's apparently no way to recover it, even snooping through the vlan.dat file, so if you need to add a switch without giving the password out:</div>
<div>
<br /></div>
<div>
<div>
SW2(config)#vtp password 80B0218C160CD951A38982EECCC22AD5 secret</div>
<div>
Setting device VTP password</div>
</div>
<div>
<br /></div>
<div>
Of note, the password is required (the unencrypted one) in order to promote a server to primary:</div>
<div>
<div>
SW1#vtp primary vlan</div>
<div>
This system is becoming primary server for feature vlan</div>
<div>
<b>Enter VTP Password:</b></div>
</div>
<div>
<br /></div>
<div>
How about interoperability with previous versions?</div>
<div>
<br /></div>
<div>
Previous version switches will promote themselves from v1 to v2 if connected to a v3 device:</div>
<div>
<br /></div>
<div>
Prior to talking to SW2:</div>
<div>
<br /></div>
<div>
<div>
SW3#show vtp status</div>
<div>
VTP Version : running VTP1 (VTP2 capable)</div>
<div>
Configuration Revision : 0</div>
<div>
Maximum VLANs supported locally : 1005</div>
<div>
Number of existing VLANs : 5</div>
<div>
VTP Operating Mode : Server</div>
<div>
VTP Domain Name :</div>
<div>
VTP Pruning Mode : Disabled</div>
<div>
VTP V2 Mode : Disabled</div>
<div>
VTP Traps Generation : Disabled</div>
<div>
MD5 digest : 0x57 0xCD 0x40 0x65 0x63 0x59 0x47 0xBD</div>
<div>
Configuration last modified by 0.0.0.0 at 0-0-00 00:00:00</div>
<div>
Local updater ID is 0.0.0.0 (no valid interface found)</div>
</div>
<div>
<br /></div>
<div>
After talking to SW2:</div>
<div>
<div>
SW3#show vtp status</div>
<div>
VTP Version : running VTP2</div>
<div>
Configuration Revision : 1</div>
<div>
Maximum VLANs supported locally : 1005</div>
<div>
Number of existing VLANs : 7</div>
<div>
VTP Operating Mode : Server</div>
<div>
VTP Domain Name : CCIE</div>
<div>
VTP Pruning Mode : Disabled</div>
<div>
VTP V2 Mode : Enabled</div>
<div>
VTP Traps Generation : Disabled</div>
<div>
MD5 digest : 0x60 0xEC 0xC1 0xEF 0xEF 0xE3 0x24 0xB6</div>
<div>
Configuration last modified by 0.0.0.0 at 3-1-93 00:00:58</div>
<div>
Local updater ID is 0.0.0.0 (no valid interface found)</div>
</div>
<div>
<br /></div>
<div>
Note the automatic V2 change. I actually had a heck of a time getting this working when I re-labbed it for this document, and that's because I had left a hidden password on SW1 and SW2. VTPv1/2 aren't going to speak hidden password, turn that off first!</div>
<div>
<br /></div>
<div>
The important thing to grasp about the v2 compatibility is this must be a one-way path: The v3 network needs to make the database changes. It's best to keep the entire v2 domain as clients. If you make changes in v2, the v3 devices will not accept the changes, but the v2 domain will up its configuration revision number. Then when v3 pushes a legitimate update, the v2 domain will reject it because it will by definition be lower than that of v2. You end up with a segmented VTP domain, and a royal mess in the v2 network:</div>
<div>
<br /></div>
<div>
<div>
*Mar 1 06:17:23.646: %SW_VLAN-4-VTP_USER_NOTIFICATION: VTP protocol user notification: MD5 digest checksum mismatch on receipt of equal revision summary on trunk: Fa0/21</div>
<div>
<br /></div>
<div>
*Mar 1 06:17:23.650: %SW_VLAN-4-VTP_USER_NOTIFICATION: VTP protocol user notification: MD5 digest checksum mismatch on receipt of equal revision summary on trunk: Fa0/22</div>
<div>
<br /></div>
<div>
Disabling VTP: Why?</div>
</div>
<div>
<br /></div>
<div>
Most of us have been happy to use transparent for years. The big difference with disabling VTP as opposed to using transparent mode is that the switch won't even pass VTP messages in "off" mode, it deliberately filters them. The benefit would be for a network administrative boundary, like connecting trunks between two carriers.</div>
<div>
<br /></div>
<div>
Globally:</div>
<div>
<div>
SW2(config)#vtp mode off</div>
<div>
Setting device to VTP Off mode for VLANS.</div>
</div>
<div>
<br /></div>
<div>
Per interface:</div>
<div>
<div>
SW2(config)#int fa0/21</div>
<div>
SW2(config-if)#no vtp</div>
</div>
<div>
<br /></div>
<div>
The Private VLAN support sounds daunting, but it really does a very simple task. All it does is carry the VLAN associations, it's not assigning interface or trunk configs anywhere.</div>
<div>
<br /></div>
<div>
<div>
SW1(config-vlan)#vlan 601</div>
<div>
SW1(config-vlan)# private-vlan isolated</div>
</div>
<div>
<div>
SW1(config-vlan)#</div>
</div>
<div>
<div>
SW1(config-vlan)#vlan 600</div>
<div>
SW1(config-vlan)# private-vlan primary</div>
<div>
SW1(config-vlan)# private-vlan association 601</div>
<div>
<br /></div>
</div>
<div>
It doesn't matter where I trunk this, or what ports are applied in what fashion. This is basically all we're replicating:</div>
<div>
<br /></div>
<div>
<div>
SW1#show vlan private-vlan</div>
<div>
<br /></div>
<div>
Primary Secondary Type Ports</div>
<div>
------- --------- ----------------- ------------------------------------------</div>
<div>
600 601 isolated</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2#show vlan private-vlan</div>
<div>
<br /></div>
<div>
Primary Secondary Type Ports</div>
<div>
------- --------- ----------------- ------------------------------------------</div>
<div>
600 601 isolated Fa0/6</div>
</div>
<div>
<br /></div>
<div>
Of note, and this will make more sense as we move into MST, the private-vlan feature is an add-on to the "VLAN" feature, which you may have noticed inside the show vtp status output:</div>
<div>
<br /></div>
<div>
<div>
SW1#show vtp status</div>
<div>
VTP Version capable : 1 to 3</div>
<div>
VTP version running : 3</div>
<div>
VTP Domain Name : CCIE</div>
<div>
VTP Pruning Mode : Disabled</div>
<div>
VTP Traps Generation : Disabled</div>
<div>
Device ID : 0014.1ceb.f600</div>
<div>
<br /></div>
<div>
Feature VLAN:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Primary Server</div>
<div>
Number of existing VLANs : 9</div>
<div>
Number of existing extended VLANs : 2</div>
<div>
Maximum VLANs supported locally : 1005</div>
<div>
Configuration Revision : 3</div>
<div>
Primary ID : 0014.1ceb.f600</div>
<div>
Primary Description : SW1</div>
<div>
MD5 digest : 0xBE 0x75 0xED 0x56 0xCB 0xAF 0xF3 0xF6</div>
<div>
0x59 0x8D 0x91 0x6C 0x60 0x28 0x55 0xEB</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Feature MST:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Transparent</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Feature UNKNOWN:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : Transparent</div>
</div>
<div>
<br /></div>
<div>
Now let's talk about those last two, Feature MST and Feature UNKNOWN.</div>
<div>
<br /></div>
<div>
The MST config is pretty cool. If you choose to run MST, the big drag is the manual configuration updates, and this fixes all of them.</div>
<div>
<br /></div>
<div>
Just as with feature VLAN, we need to make a primary server for MST. <b>Note, this does not have to be the same switch as the feature VLAN.</b> In fact, I'm leaving SW1 the primary for feature VLAN, and making SW2 the primary for feature MST:</div>
<div>
<br /></div>
<div>
<div>
SW2(config)#vtp mode server mst</div>
<div>
Setting device to VTP Server mode for MST.</div>
<div>
SW2(config)#exit</div>
<div>
<br /></div>
<div>
SW2#vtp primary mst force</div>
<div>
This system is becoming primary server for feature mst</div>
</div>
<div>
<div>
SW2#</div>
<div>
*Mar 1 00:17:26.453: %SW_VLAN-4-VTP_PRIMARY_SERVER_CHG: 0014.1cec.0280 has become the primary server for the MST VTP feature</div>
</div>
<div>
<br /></div>
<div>
<div>
Feature VLAN:</div>
<div>
--------------</div>
<div>
VTP Operating Mode :<b> Server</b></div>
<div>
Number of existing VLANs : 9</div>
<div>
Number of existing extended VLANs : 2</div>
<div>
Maximum VLANs supported locally : 1005</div>
<div>
Configuration Revision : 3</div>
<div>
Primary ID :<b> 0014.1ceb.f600</b></div>
<div>
Primary Description : <b>SW1</b></div>
<div>
MD5 digest : 0xBE 0x75 0xED 0x56 0xCB 0xAF 0xF3 0xF6</div>
<div>
0x59 0x8D 0x91 0x6C 0x60 0x28 0x55 0xEB</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
Feature MST:</div>
<div>
--------------</div>
<div>
VTP Operating Mode : <b>Primary Server</b></div>
<div>
Configuration Revision : 1</div>
<div>
Primary ID : <b>0014.1cec.0280</b></div>
<div>
Primary Description : SW2</div>
<div>
<output omitted for brevity></div>
</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#vtp mode client mst</div>
<div>
Setting device to VTP Client mode for MST.</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2(config)#spanning-tree mst config</div>
<div>
SW2(config-mst)#instance 1 vlan 1-100</div>
<div>
SW2(config-mst)#instance 2 vlan 101-200</div>
<div>
SW2(config-mst)#instance 3 vlan 201-300</div>
<div>
SW2(config-mst)#name region1</div>
<div>
SW2(config-mst)#revision 1</div>
<div>
SW2(config-mst)#exit</div>
<div>
<br /></div>
<div>
SW2(config)#spanning-tree mode mst</div>
<div>
<br /></div>
<div>
SW2#show spanning-tree mst config</div>
<div>
Name [region1]</div>
<div>
Revision 1 Instances configured 4</div>
<div>
<br /></div>
<div>
Instance Vlans mapped</div>
<div>
-------- ---------------------------------------------------------------------</div>
<div>
0 301-4094</div>
<div>
1 1-100</div>
<div>
2 101-200</div>
<div>
3 201-300</div>
<div>
-------------------------------------------------------------------------------</div>
<div>
<br /></div>
<div>
Enable MST on SW1:</div>
<div>
<br /></div>
<div>
SW1(config)#spanning-tree mode mst</div>
<div>
<br /></div>
<div>
And we'll see it has the configuration already:</div>
<div>
<br /></div>
<div>
SW1#show span mst config</div>
<div>
Name [region1]</div>
<div>
Revision 1 Instances configured 4</div>
<div>
<br /></div>
<div>
Instance Vlans mapped</div>
<div>
-------- ---------------------------------------------------------------------</div>
<div>
0 301-4094</div>
<div>
1 1-100</div>
<div>
2 101-200</div>
<div>
3 201-300</div>
<div>
-------------------------------------------------------------------------------</div>
<div>
<br /></div>
<div>
Another nifty thing is that it actually updates the running config to match:</div>
<div>
<br /></div>
<div>
SW1#show run | s spanning-tree mst</div>
<div>
spanning-tree mst configuration</div>
<div>
name region1</div>
<div>
revision 1</div>
<div>
instance 1 vlan 1-100</div>
<div>
instance 2 vlan 101-200</div>
<div>
instance 3 vlan 201-300</div>
</div>
<div>
<br /></div>
<div>
Reverting to PVST for simplicity --</div>
<div>
<div>
SW1(config)#spanning-tree mode rapid-pvst</div>
</div>
<div>
SW2(config)#spanning-tree mode rapid-pvst</div>
<div>
<br /></div>
<div>
The Remote SPAN flag is quite simple:</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#vlan 150</div>
<div>
SW1(config-vlan)#remote-span</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2#show vlan remote-span</div>
<div>
<br /></div>
<div>
Remote SPAN VLANs</div>
<div>
------------------------------------------------------------------------------</div>
<div>
150</div>
</div>
<div>
<br /></div>
<div>
The purpose here is to tell all the switches in the forwarding path of the remote SPAN not to learn MAC addresses on that VLAN.</div>
<div>
<br /></div>
<div>
Feature UNKNOWN is actually kinda cool too, although I have no way of demonstrating it. VTPv3 is designed to carry different types of databases, so that it can be adapted to other replication tasks in the future. So what's an earlier VTPv3 IOS to do with these new formats it doesn't understand? Forward them or drop them?</div>
<div>
<br /></div>
<div>
<div>
SW1(config)#vtp mode off unknown</div>
<div>
Setting device to VTP Off mode for unknown instances.</div>
<div>
SW1(config)#vtp mode transparent unknown</div>
<div>
Setting device to VTP Transparent mode for unknown instances.</div>
</div>
<div>
<br /></div>
<div>
You can set them only to <b>off </b>or <b>transparent</b>. Clearly can't be a VTP server for a format you don't understand. <b>off</b> works the same way explained above, drop the traffic; <b>transparent </b>forwards without processing.</div>
<div>
<br /></div>
<div>
And lastly, where's this at in the DocCD?:</div>
<div>
<br /></div>
<div>
<div>
Switches -> </div>
<div>
3850 -> </div>
<div>
Catalyst 3850-12S-E Switch -> </div>
<div>
Configuration Guides -> </div>
<div>
VLAN Configuration Guide, Cisco IOS XE Release 3SE (Catalyst 3850 Switches) -></div>
<div>
Configuring VTP</div>
</div>
<div>
<br /></div>
<div>
v3 is just a section in the larger VTP configuration guide, but everything you need should be there.</div>
<div>
<br /></div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
<br /></div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com10tag:blogger.com,1999:blog-5968686435283454526.post-9855373633646475352014-07-05T17:07:00.002-07:002014-07-05T18:01:45.529-07:00IPv6 First Hop SecurityIPv6 First Hop Security is a new topic for CCIE v5. It's important to note that at the time of this writing (June/July 2014), IPv6 FH Security is not supported in IOL, so this cannot be on the CLI-based parts of the lab yet, but it can be in diagnostics or the written.<br />
<div>
<br /></div>
<div>
The biggest barrier to understanding IPv6 FH Security is understanding the whole first hop process to begin with. IPv6 changes this dramatically from the IPv4 model. First, we will examine IPv6 First Hop and determine where the security problems are.</div>
<div>
<br /></div>
<div>
First, let's answer a simple question: How do I receive a routable IPv6 address?</div>
<div>
<br /></div>
<div>
ICMP communicates nearly everything regarding IPv6 addressing. It's used for finding a router, building an address (typically), making sure it's unique, finding a DHCP server if necessary, locating other hosts, assigning a default route, etc.<br />
<br /></div>
<div>
</div>
<div>
That said, IPv6 has a major chicken-or-the-egg problem. You can't run ICMP effectively without having an IPv6 address already, but how can you have an address before... you have an address?</div>
<div>
<br /></div>
<div>
Enter link-local addressing. Link local addresses all exist within FE80::/10. Typically, and true on Cisco devices, this address is built as FE80::<address based on MAC address>. It can alternatively (as is done in modern Windows OS) be built from FE80::<random address>. The specific format of the address isn't necessary for my explanation, so I will consider that out-of-scope for this document, and it can be found in hundreds of other places.</div>
<div>
<br /></div>
<div>
Since my blog is geared towards the CCIE, and we know we'll be using all Cisco kit, we can assume that the MAC address method is what we'll be using.</div>
<div>
<br /></div>
<div>
So we've built our interface a "link local" address of FE80::FA66:F2FF:FEDE:FF1. This link-local is non-routable, and can only be reached on the local segment. We must immediately run a process called DAD (Duplicate Address Detection) to ensure that we're the only person using this address. When we built the link local address, we also joined a multicast group called a "solicted node multicast", which is a multicast address that's unique for our host. DAD sends a multicast to this solicited node, and if anyone else responds (they shouldn't, if the address is unique), then we drop the address and don't use it.</div>
<div>
<br /></div>
<div>
Assuming the address is unique, our next goal is to come up with a routable address in addition to our link local. Now that we have a link-local address, we can send out a Router Solicitation (RS) and ask for the global prefix we should be on. Let's say our router's IPv6 address is 2001:100::1/64. The router will fire back a Router Advertisement (RA) to our host that it's prefix is 2001:100::/64. We'll use the same process we used for link-local earlier to assign our globally routable address. Using the same example above, that address would be 2001:100::FA66:F2FF:FEDE:FF1.</div>
<div>
<br /></div>
<div>
This process is called StateLess Address AutoConfiguration, or SLAAC. It's important to note that the router must send out an RA with a /64 length, or this process doesn't work.</div>
<div>
<br /></div>
<div>
Perhaps more important to our examples, the RA we got the prefix from also hands out the router's IPv6 link local address, which we can optionally use as our default gateway. Just to point out again: this is not a global address. SLAAC's default routing is always done with link-local addresses.</div>
<div>
<br /></div>
<div>
Let's look at the rather minor config for this thus far:</div>
<div>
<br /></div>
<div>
R1 ("router"):</div>
<div>
ipv6 unicast-routing</div>
<div>
<br /></div>
<div>
interface Gigabit0/1</div>
<div>
ipv6 address 2001:100::1/64</div>
<div>
<br /></div>
<div>
R2 ("host"):</div>
<div>
ipv6 unicast-routing ! May or may not be necessary for a "host", platform-dependent.</div>
<div>
<br /></div>
<div>
interface Gigabit0/1</div>
<div>
ipv6 address autoconfig default</div>
<div>
<br /></div>
<div>
Omitting the "default" above would prevent it from installing R1 as a default route.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#do sh ipv6 int br | s GigabitEthernet0/1</div>
<div>
GigabitEthernet0/1 [up/up]</div>
<div>
FE80::FA66:F2FF:FEDE:FF1</div>
<div>
2001:100::FA66:F2FF:FEDE:FF1</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
R2#sh ipv6 route ::/0</div>
<div>
Routing entry for ::/0</div>
<div>
Known via "static", distance 2, metric 0</div>
<div>
Route count is 1/1, share count 0</div>
<div>
Routing paths:</div>
<div>
FE80::F2F7:55FF:FE8D:96A2, GigabitEthernet0/1</div>
<div>
Last updated 00:00:10 ago</div>
</div>
</div>
<div>
<br /></div>
<div>
FE80::F2F7:55FF:FE8D:96A2 is R1's link local address.</div>
<div>
<br /></div>
<div>
Cisco routers all announce themselves as viable gateways (imagine that), so if you have a "host" router - perhaps a voice gateway or whatnot - you need to tell it not to send router advertisements:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int gig0/1</div>
<div>
R2(config-if)#ipv6 nd ra suppress all</div>
</div>
<div>
<br /></div>
<div>
The <b>suppress </b>keyword indicates not to send periodic RAs, <b>suppress all</b> means don't respond to RSes.</div>
<div>
<br /></div>
<div>
An optional feature that's not on all platforms:</div>
<div>
<div>
R2(config-if)#ipv6 address autoconfig prefix</div>
</div>
<div>
<br /></div>
<div>
That will make the router insert routes for any other routes on the same segment. So if your neighbor had a second IPv6 address of 13::1, we'd insert a route to it as well.</div>
<div>
<br /></div>
<div>
So that gets us a link-local address, global unicast address, and default gateway. What about DNS?</div>
<div>
<br /></div>
<div>
Up until very recently, the only real option was to run a stateless DHCPv6 server. Stateless because in this scenario, the DHCPv6 server doesn't actually keep track of anything, it just hands out options: DNS, Call Manager info, etc.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ipv6 dhcp pool DHCP-POOL</div>
<div>
R1(config-dhcpv6)#dns-server 4::4</div>
<div>
R1(config-dhcpv6)#dns-server 8::8</div>
</div>
<div>
<div>
R1(config-dhcpv6)#domain-name ABC.COM</div>
</div>
<div>
<div>
R1(config-dhcpv6)#int gig0/1</div>
<div>
<div>
R1(config-if)#ipv6 dhcp server DHCP-POOL</div>
</div>
<div>
R1(config-if)#ipv6 nd other-config-flag</div>
</div>
<div>
<br /></div>
<div>
The process on R2 is automatic, but on R1 we create the pool, which is reasonably obvious config, apply it to the interface, and then set the O-flag. This tells clients via RA that it should query for a DHCP server for more information. The DHCP server and the device sending the RA do not need to be the same device.</div>
<div>
<br /></div>
<div>
Great! We've got our addresses, default gateway, and DNS. Before we move on to neighbor discovery, let's look at stateful DHCP.</div>
<div>
<br /></div>
<div>
You've got a couple options here.</div>
<div>
<br /></div>
<div>
Either way we need more info in our DHCP server:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ipv6 dhcp pool DHCP-POOL</div>
<div>
R1(config-dhcpv6)#address prefix 2001:100::/64</div>
</div>
<div>
<br /></div>
<div>
We'll still get our default gateway through RAs, but we can at least track the addresses that our hosts are using.</div>
<div>
<br /></div>
<div>
Our first option is to just recommend to our host that it use stateful DHCP.</div>
<div>
<br /></div>
<div>
R1(config-if)#no ipv6 nd other-config-flag</div>
<div>
<div>
R1(config-if)#ipv6 nd managed-config-flag</div>
</div>
<div>
<br /></div>
<div>
The M-Flag (Managed Flag) is a <i>suggestion</i> to the client that it should use the DHCP server for its host address instead of SLAAC.</div>
<div>
<br /></div>
<div>
Since we're using a Cisco client, I need to stop here and advise that I've never been able to get the client to recognize the M-flag. It could be my IOS, I haven't investigated it that much.</div>
<div>
<br /></div>
<div>
Since that's a bust, let's look at the other method.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#int gig0/1</div>
<div>
R2(config-if)#ipv6 address autoconfig default-route ! still need this for the default gateway</div>
<div>
R2(config-if)#ipv6 address dhcp</div>
<div>
R2(config-if)#ipv6 enable</div>
</div>
<div>
<br /></div>
<div>
I'll explain <b>ipv6 enable</b> momentarily.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#do sh ipv6 int br | s GigabitEthernet0/1</div>
<div>
GigabitEthernet0/1 [up/up]</div>
<div>
FE80::FA66:F2FF:FEDE:FF1</div>
<div>
2001:100::2C11:9212:8690:D8FE</div>
</div>
<div>
<br /></div>
<div>
We got our address. (Side-note: if you have <b>ipv6 address dhcp</b> on an interface and not <b>ipv6 address autoconfig default-route</b>, you <i>won't</i> get a connected route to the local subnet - you get a totally unusable /128 host address and that's it. The workaround is to disable <b>ipv6 unicast-routing</b>, then you'll get the connected route)</div>
<div>
<br /></div>
<div>
<div>
R1#sh ipv6 dhcp binding</div>
<div>
Client: FE80::FA66:F2FF:FEDE:FF1</div>
<div>
DUID: 00030001F866F2DE0FF0</div>
<div>
Username : unassigned</div>
<div>
IA NA: IA ID 0x00040001, T1 43200, T2 69120</div>
<div>
Address: 2001:100::2C11:9212:8690:D8FE</div>
<div>
preferred lifetime 86400, valid lifetime 172800</div>
<div>
expires at Jul 02 2014 01:29 AM (172654 seconds)</div>
</div>
<div>
<br /></div>
<div>
and R1 knows about us - it's <i>stateful</i>.</div>
<div>
<br /></div>
<div>
There's also a new option defined by RFC 6106 that completely eliminates the need for DHCPv6 in relation to DNS. Unfortunately, at the time of this writing, you need bleeding-edge IOS-XE in order to use it. In fact, the hardware lab I'm using (which will be explained later) for this doesn't even support it - I have to turn to CSR1000v:</div>
<div>
<br /></div>
<div>
Router(config-if)#ipv6 nd ra dns server ?</div>
<div>
<div>
X:X:X:X::X IPv6 address</div>
</div>
<div>
<br /></div>
<div>
This passes DNS servers as an RA option. I have labbed this previously, it does appear to work from what I can see in Wireshark.</div>
<div>
<br /></div>
<div>
I promised an explanation to <b>ipv6 enable</b>. I bet you've seen this command before, followed by a statically configured address. You actually need this very infrequently. With the exception of the DHCP example I showed above, <i>the only time you need <b>ipv6 enable</b> is if you want a eui-64 derived link-local address without a global unicast address</i>. If you're statically assigning an IPv6 global unicast, or you're getting one via SLAAC, you don't need this command - so cut it out! :)</div>
<div>
<br /></div>
<div>
Of course your final option is a statically assigned address:</div>
<div>
<br /></div>
<div>
ipv6 address 1::1/64 </div>
<div>
<br /></div>
<div>
Link locals can also be statically assigned:</div>
<div>
<br /></div>
<div>
ipv6 address FE80::1 link-local</div>
<div>
<br /></div>
<div>
Now that we're past "how do we get an address", let's move on to the other big topic of "how do we find our neighbors". As you're probably already aware, IPv6 does not use ARP, instead it uses a Neighbor Solicitation (NS) and Neighbor Advertisement (NA) ICMPv6 messages. This builds the neighbor table instead of the ARP cache.</div>
<div>
<br /></div>
<div>
NS and NA map pretty directly in functionality to ARP request and ARP reply in IPv4. I'm not going to go over them in detail, just understanding their function is sufficient for now.</div>
<div>
<br /></div>
<div>
If R2 has just gotten it's address and wants to find R3, it would send out an NS.</div>
<div>
I've reset R2 to SLAAC and setup R3 for SLAAC as well.</div>
<div>
Addresses:</div>
<div>
R2: 2001:100::FA66:F2FF:FEDE:FF1</div>
<div>
R3: 2001:100::C67D:4FFF:FEF9:5340</div>
<div>
<br /></div>
<div>
R2#ping 2001:100::C67D:4FFF:FEF9:5340</div>
<div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2001:100::C67D:4FFF:FEF9:5340, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 0/3/16 ms</div>
</div>
<div>
<br /></div>
<div>
and some filtered output from <b>debug ipv6 icmp</b>:</div>
<div>
<br /></div>
<div>
<div>
ICMPv6: Sent <b>N-Solicit,</b> Src=2001:100::FA66:F2FF:FEDE:FF1, Dst=FF02::1:FFF9:5340</div>
</div>
<div>
<div>
ICMPv6: Received <b>N-Advert</b>, Src=2001:100::C67D:4FFF:FEF9:5340, Dst=2001:100::FA66:F2FF:FEDE:FF1</div>
</div>
<div>
<br /></div>
<div>
There's our NS going out, and NA coming in. It's important to note that the NS also provided R3 with R2's layer 2 address, so there's no need for the reverse process to happen.</div>
<div>
<br /></div>
<div>
<div>
R2#sh ipv6 neighbor</div>
<div>
IPv6 Address Age Link-layer Addr State Interface</div>
<div>
2001:100::C67D:4FFF:FEF9:5340 0 c47d.4ff9.5340 REACH Gi0/1</div>
</div>
<div>
<br /></div>
<div>
<div>
R3#sh ipv6 neighbor</div>
<div>
IPv6 Address Age Link-layer Addr State Interface</div>
<div>
2001:100::FA66:F2FF:FEDE:FF1 0 f866.f2de.0ff1 REACH Gi0/0</div>
</div>
<div>
<div>
<br /></div>
</div>
<div>
As you can see, R2 is aware of R3's addresses and vice-versa.</div>
<div>
<br /></div>
<div>
Before I move on to pointing out the (somewhat obvious) security problems with this whole process, I'd like to pause and look at a few other features that I felt were of important note, but didn't fit into the explanation thus far.</div>
<div>
<br /></div>
<div>
If you have multiple routers on a segment and want to use one or more for failover, you can specify how important their advertisements are:</div>
<div>
<br /></div>
<div>
<div>
ipv6 nd router-preference med !default (setting this doesn't show in config)</div>
<div>
ipv6 nd router-preference low !depreffed</div>
<div>
ipv6 nd router-preference high !preffed</div>
</div>
<div>
<br /></div>
<div>
Although, I imagine most of us would just use HSRP.</div>
<div>
<br /></div>
<div>
Also, if you set the RA lifetime to 0, hosts won't use it as a default gateway, but it can still be used for SLAAC: <b>ipv6 nd ra lifetime 0</b></div>
<div>
<br /></div>
<div>
I've previously run production dual-stack IPv6 at home. It was an interesting experiment, but the overwhelming lesson I learned from it is that most content providers are using separate pipes for their IPv6 traffic, and instead of being wide open and unused liked I'd hoped, they were miserably oversubscribed, so when my Windows installation naturally preferred a IPv6 DNS resolution when both IPv6 and IPv4 were available for the same site (Here's looking at you, Youtube!) all I got was slower web traffic. I still have the v6 service on my cablemodem, but I disabled it on the router.</div>
<div>
<br /></div>
<div>
One of the more interesting things I learned from it was the sudden realization of what the heck I do without NAT. Almost all home networks are setup like so on IPv4:</div>
<div>
<br /></div>
<div>
[provider] ---- outside /30 ----> [home router] (NAT) --- private IP range ---> hosts</div>
<div>
<br /></div>
<div>
So if you think about this for a minute... in order to duplicate this on IPv6, which doesn't typically use NAT, you need a routable IPv6 address block on the outside of your home router, and a routable IPv6 address block on the inside of your router too. And since home ISPs don't exactly assign you two blocks of static addresses typically, you're also going to need a way to do this with DHCP or SLAAC.</div>
<div>
<br /></div>
<div>
There's a really clever fix for this in IPv6. It's called prefix delegation. The idea is that our provider will delegate an IPv6 block to our router, which will then in turn function as the DHCP server for that block.</div>
<div>
<br />
Let's look at some sample config.</div>
<div>
<br /></div>
<div>
ISP router:</div>
<div>
ipv6 local pool dhcpv6-pool1 2001:DB8:1200::/40 48</div>
<div>
<div>
<div>
<br /></div>
<div>
ipv6 dhcp pool test</div>
<div>
prefix-delegation pool dhcpv6-pool1 lifetime 1800 600</div>
</div>
</div>
<div>
dns-server 4::4</div>
<div>
<div>
dns-server 8::8 </div>
<div>
domain-name isp.net</div>
<div>
<br /></div>
<div>
interface GigabitEthernet1/0</div>
<div>
ipv6 address 12::1/64</div>
<div>
ipv6 nd other-config-flag</div>
<div>
ipv6 dhcp server test </div>
<div>
<br /></div>
</div>
<div>
Here we've told the ISP router that it should delegate /48 blocks from a larger /40 block to every DHCP server that asks. </div>
<div>
<br /></div>
<div>
Home router:</div>
<div>
<div>
ipv6 dhcp pool MY-DNS ! we'll use our DNS servers</div>
<div>
dns-server 44::44</div>
<div>
dns-server 88::88</div>
<div>
domain-name foo.com</div>
<div>
<br /></div>
<div>
interface GigabitEthernet1/0</div>
<div>
ipv6 address autoconfig default ! we want a SLAAC address</div>
<div>
ipv6 dhcp client pd FOO ! we'll collect prefix delegation for the next interface</div>
<div>
<br /></div>
<div>
interface GigabitEthernet2/0</div>
<div>
ipv6 address FOO ::/64 eui-64 ! we'll use a /64 out of whatever we got above</div>
<div>
ipv6 nd other-config-flag</div>
<div>
ipv6 dhcp server MY-DNS</div>
</div>
<div>
<br /></div>
<div>
So, lots going on on the home router. First, we're going to assign our outside interface - Gig1/0 - an address via SLAAC. This gives us the IPv6 equivalent of my "outside /30" from my ASCII diagram above, albeit it on a much larger block than a IPv4 /30 :). Next, we're going to create a prefix delegation named FOO, and pick up a prefix from our ISP server, which we already know will be a /48 from my explanation above. We'll then go to our inside interface, Gig2/0, and assign ourselves an EUI-64 address from the prefix delegation FOO. We'll of course be sending RAs, so our internal hosts will also use SLAAC to get an address on the inside, delegated subnet.</div>
<div>
<br /></div>
<div>
Our end device/host would have a simple config, just running SLAAC and being a stateless DHCP client. In IOS it'd just be:</div>
<div>
<br /></div>
<div>
<div>
interface GigabitEthernet1/0</div>
<div>
ipv6 address autoconfig default</div>
</div>
<div>
<br /></div>
<div>
Back on the ISP server, you get reverse route injection in the form of static IPv6 routes pointing to whomever you delegated the prefix to. These can be redistributed into your IGP or BGP.</div>
<div>
<br /></div>
<div>
Now let's look at what security faults all these features potentially have.</div>
<div>
<br /></div>
<div>
There's some obvious "for like" issues that IPv4 had, that carried over to IPv6:</div>
<div>
<br /></div>
<div>
- An attacker could pose as the stateful DHCP server, handing out either bad information for denial of service (DOS), or handing out an attacker's IPv6 address for a man-in-the-middle attack (MiM).</div>
<div>
- An attacker could pose as another host. Let's say Host1 is trying to reach Host2, and Host3 is an attacker. Host1 could send out a NS for Host2, and Host3 could send an NA masquerading as Host2, and Host1 may accept it as true if timed correctly. A similar process could be done from Host2 -> Host1, with Host3 - inserting itself there as well, and effectively inserting itself into the entire conversation - a MiM attack - with Host1 and Host2 none the wiser.</div>
<div>
<br /></div>
<div>
This is where the similarities with IPv4 end. Here are some of the new security concerns:</div>
<div>
<br /></div>
<div>
- A router advertisement could be faked, allowing a host to insert itself into a conversation, for a MiM attack.<br />
- A router could advertise, from one router to another, a prefix that shouldn't be on the local link. This connected route would be seen as closer than a theoretical downstream router that's legitimately advertising the same prefix.</div>
<div>
- An attacker could respond to every DAD request from a host, or even from the entire segment, effectively preventing hosts from using their legitimately unique IPv6 addresses, creating a DOS.</div>
<div>
- An outside host - not one on your local segment - could send traffic in rapid-fire towards a large swath of your available IPv6 space. Many IPv4 segments were either behind NAT or simply weren't all that big, but IPv6 space is really, really large - a /64 - the common LAN segment size - is 18 quintillion addresses. No last-hop router has enough memory to send NS requests for 18 quintillion addresses without running out of memory or CPU, creating a DOS by worst-case crashing the router, best-case busying the CPU out to the point where new requests aren't serviced.</div>
<div>
<br /></div>
<div>
For me, the most confusing thing about IPv6 FH Security was that there are, as I understand it, an "old" way of doing things, and a "new" way of doing things. These overlap a lot, and there's no real discussion of this in the documentation, so figuring out when to use what was confusing.<br />
<br />
We're going to start with the "old" way, which is reasonably well-documented and well-blogged on the Internet, but not particularly thorough, and then move on to the "new" way, which seems more complete.<br />
<br />
First let's see about tackling invalid RAs.<br />
<br />
The simplest, non-automatic way to prevent invalid RAs is to write an PACL for them:<br />
<br />
ipv6 access-list IPv6<br />
deny icmp any any routeradvertisement<br />
permit ipv6 any any<br />
<br />
and simply apply it to any device that shouldn't be sending RAs.<br />
<br />
There's a gotcha here, in that there is a well-known exploit by sending fragmented packets. I'm not going to go into this in detail as hundreds have blogged about this already. The workaround is to add one more deny to the PACL:<br />
<br />
deny ipv6 any any undeterminedtransport</div>
<div>
<br /></div>
<div>
This will make the ACL drop any IPv6 traffic where the router is unable to determine the transport type (no layer 4 information). Some say this makes for better security than the actual RA Guard feature, but debates like that are out of scope for the CCIE, so I'll leave that to others.</div>
<div>
<br /></div>
<div>
This feature is simple and well-documented enough that I'm not going to lab it.</div>
<div>
<br /></div>
<div>
Moving on to features we will be labbing, let's look at RA Guard. It's the automated, more granular version of what we just did with the PACL.<br />
<br />
Here's the diagram we'll work off of for the remainder of the article.</div>
<div>
<br /></div>
<div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhc2My5N1oaYmNe6QSo0clxfJFbSLm_Fs5ioFc0HkMcQACWohFgL1ayLHwink-lhk1w6RjbWGbHRWapFRdw9cSGdBb-rp5fyjrGYNfFLhMnoCeO8OQh5ka40rs7k4b5RVuSY_TV05BzAkA/s1600/diagram1.png" />
<br />
<div>
<br />
As I have in many other blogs, I use GNS3 for diagramming, but I'm using physical gear for the lab. Unfortunately, IPv6 FH Security requires a bleeding-edge IOS or IOS-XE layer 2 device, so labbing it is not so easy. I was lucky enough that some friends were able to lend me a 4948-E running 15.2(1). So, I am running entirely physical gear in this lab, despite what's implied by the diagram above.<br />
<br />
This is also good and bad, because I'm using a remote lab, I can't recable on the fly. So there are a few scenarios that I didn't lab out as thoroughly as I would've done normally, but the knowledge gained here should still be more than sufficient for what may appear on the lab.<br />
<br />
R1 will represent our valid router.<br />
R2 will (usually) represent our valid host.<br />
R3 will represent our attacker.<br />
<br />
The most common way to set this up is to basically un-trust all ports that shouldn't have a router on them, and trust all the ones that should. This is accomplished with this config:<br />
<br /></div>
<div>
<div>
ipv6 nd raguard policy MY_ROUTERS</div>
<div>
device-role router</div>
<div>
<br /></div>
<div>
ipv6 nd raguard policy MY_HOSTS</div>
<div>
device-role host ! DEFAULT</div>
<div>
<br /></div>
<div>
ipv6 snooping logging packet drop ! NEEDED FOR LOGGING</div>
<div>
<br />
vlan configuration 123 ! all our hosts are on vlan 123<br />
ipv6 nd raguard attach-policy MY_HOSTS<br />
<br />
int gig1/1<br />
ipv6 nd raguard attach-policy MY_ROUTERS<br />
<br />
I know the first thing I thought when I saw this was "what the heck is 'vlan configuration XYZ'"?<br />
I always did find it a bit strange back on the Catalyst 3560, when I learned it for CCIE v4, that QoS config would be put on an SVI even though the SVI sometimes had no IP address on it - you'd just create it for the QoS config. I guess someone at Cisco thought the same way; so now there's a specific configuration section for VLANs.<br />
<br />
So in short what we accomplished above was to configure every port in vlan 123 to assume a "host" was attached to it - a host should never be sending RAs. We then overrode that configuration on gig1/1 and told the switch to expect a router there - routers are, of course, OK to hear RAs from. Interface configuration always overrides vlan configuration, and that's true for the rest of the features we'll review in this article, so I'm going to assume that's understood from here on in.<br />
<br />
We enabled logging via <b>ipv6 snooping logging packet drop</b>, which will actually turn on the majority of the logging we need for this article. There is one additional command we'll cover later.<br />
<br />
I instructed R2 and R3 not to send RAs, let's see what output we get if I attempt to enable them on R3.<br />
<br />
R3-ATTACKER(config-if)#ipv6 nd router-preference high ! Let's attempt to become the preferred router<br />
R3-ATTACKER(config-if)#no ipv6 nd ra suppress all</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#</div>
<div>
*Jul 3 18:15:58.734: %SISF-4-PAK_DROP: Message dropped A=FE80::C67D:4FFF:FEF9:5340 G=- V=123 I=Gi1/3 P=NDP::RA Reason=<b>Message unauthorized on port</b></div>
</div>
<div>
<br /></div>
<div>
That's the extreme basics, and I did it in long-hand so to speak, here's the shorter way to accomplish the same thing with defaults:</div>
<div>
<br /></div>
<div>
<div>
ipv6 nd raguard policy MY_ROUTERS</div>
<div>
device-role router</div>
<div>
<br /></div>
<div>
vlan configuration 123<br />
ipv6 nd raguard<br />
<br />
int gig1/1<br />
ipv6 nd raguard attach-policy MY_ROUTERS</div>
</div>
<div>
<br /></div>
<div>
This is true of just about every filtering command for IPv6 FH Security, if you use just the basic command with no policy, i.e. <b>ipv6 nd raguard</b>, you get the default untrusted configuration. Here, we just said don't trust any ports' RAs except Gig1/1.</div>
<div>
<br /></div>
<div>
Filtering can also be done per-VLAN: </div>
<div>
<br /></div>
<div>
interface GigabitEthernet1/1</div>
<div>
switchport trunk allowed vlan 122,123</div>
<div>
switchport mode trunk</div>
<div>
ipv6 nd raguard attach-policy MY_ROUTERS vlan 123</div>
<div>
ipv6 nd raguard vlan 122<br />
<br />
This hypothetical config would trust RAs on vlan 123, but not vlan 122.</div>
<div>
<br />
There are many other options for RA Guard:<br />
<br />
ipv6 nd raguard policy SAMPLE</div>
<div>
<b><u>device-role router</u></b><br />
On the 4948, I have four options here: router, host, switch, and monitor.<br />
Router and Host we already covered. Switch, I don't understand, and I can't find any documentation explaining what it does. Suffice to say that out-of-the-box it doesn't allow RAs, so if you set it with no other options, you get something similar to "Host". Monitor I found some vague definitions on, but it too has a similar outcome as switch: No RAs.<br />
<b style="text-decoration: underline;">other-config-flag on</b><br />
Require the other-config-flat to be set in the RA or the policy will not pass the traffic.<br />
<b><u>trusted-port</u></b><br />
Permits RAs -- I can't find any other benefits to it for RA Guard, although the other frameworks (name ND Inspection) preference ports that have trusted-port enabled if they have a conflict.</div>
<div>
<b><u>router-preference maximum</u></b><br />
Sets the highest allowed preference on the port. If you want to enforce a backup router to being "low" or "medium" preference, this will drop the packet if anything higher is advertised.<br />
<b>hop-limit minimum</b><br />
<b> <u>hop-limit maximum </u></b><br />
RAs advertise what hosts should use as a TTL. You can use these two functions to control what the minimum and maximum advertised TTLs should be. Note, if you want to test this with all IOS devices, IOS can only send one of two options for this field - the default, which is 64 (this is NOT noted on the CLI anywhere, I figured it out the hard way) - or "unspecified", which basically means "use whatever you want".<br />
If you want to match a specific size, such as 90, you would set minimum and maximum to the same value.<br />
<b><u>managed-config-flag on</u></b> ! Ensures the managed config flag ("please get your IPv6 address statefully from DHCPv6") is on, if not, drop the packet.<br />
<b> <u>match ipv6 access-list <ACL></u></b><br />
This will match the <i>link local</i> address that's advertising the RA. If no match on the access-list, drop the packet. Formatting is as such: permit ipv6 host FE80::F2F7:55FF:FE8D:96A1 any<br />
<b><u>match ra prefix-list <prefix list></u></b><br />
This will match the prefix being advertised in the RA. If no match, drop the packet.</div>
<div>
<br />
Covering each of these options would make the article drag on and on, so I'm going to give one large configuration and comment on it:<br />
<br />
ipv6 access-list LL-EXAMPLE<br />
Permit ipv6 host FE80::F2F7:55FF:FE8D:96A1 any<br />
<br />
ipv6 prefix-list PREFIX-EXAMPLE permit 2001:100::/64<br />
<br />
ipv6 nd raguard policy SAMPLE<br />
device-role router<br />
other-config-flag on<br />
router-preference maximum medium<br />
hop-limit minimum 64<br />
hop-limit maximum 64<br />
match ipv6 access-list LL-EXAMPLE<br />
match ra prefix-list PREFIX-EXAMPLE<br />
<br />
int gig1/1<br />
ipv6 nd raguard attach-policy SAMPLE</div>
</div>
</div>
<div>
<br /></div>
<div>
This sample would allow RAs, require the other-config-flag to be enabled, not allow a router preference higher than medium, ensure a strict TTL advertisement of 64, require the RA to be sourced from FE80::F2F7:55FF:FE8D:96A1, and only permit it to advertise 2001:100::/64.</div>
<div>
<br /></div>
<div>
DHCP Guard is a tad simpler.</div>
<div>
<br /></div>
<div>
We'll setup R1 as a stateful DHCP server, R2 as a DHCP client, and R3 as a malicious stateful DHCP server. I've removed all the RA Guard configuration to decrease the example complexity.</div>
<div>
<br /></div>
<div>
<div>
R1-ROUTER(config)#ipv6 dhcp pool TEST-POOL</div>
<div>
R1-ROUTER(config-dhcpv6)#address prefix 2001:100::/64</div>
<div>
R1-ROUTER(config-dhcpv6)#dns-server 4::4</div>
<div>
R1-ROUTER(config-dhcpv6)#dns-server 8::8</div>
<div>
R1-ROUTER(config-dhcpv6)#int gig0/0</div>
<div>
R1-ROUTER(config-if)#ipv6 dhcp server TEST-POOL</div>
</div>
<div>
<br /></div>
<div>
<div>
R3-ATTACKER(config-if)#ipv6 dhcp pool TEST-POOL</div>
<div>
R3-ATTACKER(config-dhcpv6)#address prefix 2001:101::/64</div>
<div>
R3-ATTACKER(config-dhcpv6)#dns-server 2001:101::BAD</div>
<div>
R3-ATTACKER(config-dhcpv6)#int gig0/0</div>
<div>
R3-ATTACKER(config-if)#ipv6 address 2001:101::BAD/64</div>
<div>
R3-ATTACKER(config-if)#ipv6 dhcp server TEST-POOL</div>
</div>
<div>
<br /></div>
<div>
and on our switch:</div>
<div>
<br /></div>
<div>
<div>
ipv6 prefix-list GOOD-PREFIX seq 5 permit 2001:100::/64 le 128 </div>
<div>
ipv6 access-list LL</div>
<div>
permit ipv6 host FE80::F2F7:55FF:FE8D:96A2 any</div>
<div>
<br /></div>
<div>
ipv6 dhcp guard policy TRUSTED</div>
<div>
device-role server</div>
<div>
match server access-list LL</div>
<div>
match reply prefix-list GOOD-PREFIX</div>
<div>
<br /></div>
<div>
vlan configuration 123</div>
<div>
ipv6 dhcp guard attach-policy TRUSTED</div>
</div>
<div>
<br /></div>
<div>
I'm going to treat this a bit differently - I'm going to trust all ports, but require them to match a specific link local source and address range.</div>
<div>
<br /></div>
<div>
<div>
R2-HOST(config-if)#no ipv6 address autoconfig</div>
<div>
R2-HOST(config-if)#ipv6 enable</div>
<div>
R2-HOST(config-if)#ipv6 address dhcp</div>
</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E#</div>
<div>
*Jul 3 20:39:34.985: %SISF-4-PAK_DROP: Message dropped A=FE80::C67D:4FFF:FEF9:5340 G=2001:101::68:6AF1:6FC2:5983 V=123 I=Gi1/3 P=DHCPv6::ADV Reason=The source address in the DHCPv6 ADVERTISE packet is not authorized by the DHCP Guard policy</div>
</div>
<div>
<br /></div>
<div>
<div>
R2-HOST#sh ipv6 int br | s GigabitEthernet0/0</div>
<div>
GigabitEthernet0/0 [up/up]</div>
<div>
FE80::FA66:F2FF:FEDE:FF1</div>
<div>
2001:100::F1CB:5AD:3844:CB86</div>
</div>
<div>
<br /></div>
<div>
We get the correct address from the correct server.</div>
<div>
<br />
Now moving on to a-la-carte Neighbor Discovery Inspection.<br />
<br />
The first thing to understand about Neighbor Discovery is that it's a control plane feature only - it doesn't inspect actual data traffic, it is only looking at ND ICMP packets, and that's it - so if your users spoof their source address in an actual traffic flow, this doesn't protect against it.<br />
<br />
IPv6 ND Inspection builds a table based on NS/NA messages. It then enforces the table. The funny thing about it is, with ND/NA, basically the first one it hears becomes the trusted one.<br />
<br />
WS-C4948E(config)#vlan configuration 123<br />
WS-C4948E(config-vlan-config)#ipv6 nd inspection<br />
<div>
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6VmzTkEt9m-1SOnmavZS0SxvtFU0Sh2x_Fr32m7K_ym_8eMiLoiQKh-tiAbZMBaF0f_uuptLqtqPImNHhYSVLnBKo2DmDJZZ_Df7JWXzRuPBH80w5z0eVfYOHKexwu5_BH6EXfYQGWjQ/s1600/diagram2.png" />
<br />
<br /></div>
I really expected DHCP Guard to also populate this table, but the a-la-carte version of ND Inspection and DHCP Guard don't appear to talk to one another, as best I was able to tell. The unified (ipv6 snooping) feature, which we will see next, does populate the neighbor bindings from DHCP, so don't be confused as to why you see DHCP as a population feature in the screen shot.<br />
<br />
So let's spoof R2 (Gig1/2)'s link local from R3:<br />
<br />
R3-ATTACKER(config-if)#ipv6 nd dad attempts 0 ! DAD would prevent the spoof<br />
R3-ATTACKER(config-if)#ipv6 address FE80::FA66:F2FF:FEDE:FF1 link-local<br />
<div>
<br /></div>
WS-C4948E#<br />
<div>
*Jul 4 14:45:01.109: %SISF-4-PAK_DROP: Message dropped A=FE80::FA66:F2FF:FEDE:FF1 G=- V=123 I=Gi1/3 P=NDP::RA Reason=<b>More trusted entry exists</b></div>
<div>
<br /></div>
More trusted entry exists. As you'll see above, there is a "Preflevel" numbering system, which I think of as administrative distance for neighbor bindings. The higher the number, the more trusted the entry is. Although I do find the message odd when you try to spoof -- what it really should say is "I already learned this entry from someone else, first come first serve!"<br />
<br />
The most trusted method is a static entry, which looks like:<br />
<br />
ipv6 neighbor binding vlan 123 FE80::FA66:F2FF:FEDE:FF1 interface gig1/3 c47d.4ff9.5340<br />
<div>
<br /></div>
<div>
You can shorten this up a whole bunch, by omitting the optional VLAN and MAC address, but in my case, I'm trying to override the legitimate link-local of R2 to allow R3 to ping from its link local. Without "fully populating" the table, it still prefs the dynamically learned entry over the static one.</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYRoFFvRXNyu3aE_sdxELm2T682Yw9z5EN8pYieQXK5ri37Jqqas4tyijOrmIOGcEd5q2XSLlf3n0okQh1FyLpPuwR360Onecf8oa0_P_ht37ZwFWuDvQiK0QMXsBXsG2rWpmrx4QrbSA/s1600/diagram3.png" />
<br />
<div>
<br /></div>
<div>
We see the static entry with the priority of 100 - let's see if we can ping from R3 now, with our spoofed link-local:<br />
<br />
R3-ATTACKER#ping FE80::F2F7:55FF:FE8D:96A2 source FE80::FA66:F2FF:FEDE:FF1<br />
Output Interface: GigabitEthernet0/0<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to FE80::F2F7:55FF:FE8D:96A2, timeout is 2 seconds:<br />
Packet sent with a source address of FE80::FA66:F2FF:FEDE:FF1%GigabitEthernet0/0<br />
!!!!!<br />
Success rate is 100 percent (5/5), round-trip min/avg/max = 0/0/0 ms<br />
<div>
<br /></div>
</div>
Let's look at the rest of the features you can optionally put in a policy.<br />
<br />
ipv6 nd inspection policy TEST-POLICY<br />
<b> <u>limit address-count X</u></b><br />
This is relatively obvious, it limits how many addresses are permitted to participate in the ND process on a port. Keep in mind you always need a minimum of 2 if you have a global unicast: you need the link-local as well.<br />
<b> <u>tracking enable</u></b><br />
This is liveliness tracking, and is pretty cool. We'll give this it's own paragraph below.<br />
<b><u>drop-unsecure</u></b><br />
There's a cryptographic version of neighbor discovery called SeND. It requires a PKI infrastructure and certificates all the way down to the client. I'm going to call SeND "out of scope" for the CCIE v5 based on complexity. drop-unsecure is regarding SeND, so I'm skipping it.<br />
<div>
<b><u>sec-level minimum</u> </b></div>
<div>
Also used for SeND; out of scope.</div>
<div>
<u><b>device-role</b> {host | monitor | router}</u> </div>
<div>
I'm not really sure what the point is here - I don't know how host or router would differ, and Cisco hasn't documented it.</div>
<div>
<b><u>validate source-mac</u></b></div>
<div>
I imagine this validates that the source MAC is appropriate for future ND communication, but it's not documented and I can't lab it without changing wires (my lab is remote, as mentioned above), so unfortunately I haven't got a good explanation on this option.</div>
<div>
<br /></div>
If you enable tracking, as shown above, hosts get probed with an "are you still there?" if no ND packets are heard periodically.<br />
<br />
WS-C4948E(config)#ipv6 nd inspection policy ND-TEST<br />
WS-C4948E(config-nd-inspection)#tracking enable<br />
<div>
<div>
WS-C4948E(config)#vlan configuration 123</div>
<div>
WS-C4948E(config-vlan-config)#ipv6 nd inspection attach-policy ND-TEST</div>
</div>
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0RJc_HrIqLhyFZUBokoQotcM6chL_raYKjQRmBbpBE6mo-BYoMjkiBtwePMEXHeHawkPBGylJR96tBEH-30kL4PiK-ozoBU-o7nDlb5MTs9lslGuhNH_LFzzzUyu76Jy31EtfvmtQeF0/s1600/diagram4.png" /><br />
<br />
Note the "time left" field on the far right. This is how long until a NS is sent out to the neighbor asking if it's still there. This is useful for two reasons:<br />
- Since the table is first-come first-serve, this frees up address space from being held indefinitely if it's actually not in use.<br />
- It allows for a host to move - imagine unplugging from one switchport and moving to another on the same switch - we need a way to age out the information reasonably quickly.<br />
<br />
We see we have 220 and 228 seconds on the two hosts presently in the table until they're probed.<br />
<br />
If we have an IPv6 address on the appropriate VLAN, we'll send a NS sourced from our IPv6 address.<br />
It sure confused me - if we have a pure layer 2 device - how will it send an NS?<br />
<br />
If we have a link local on the network, we'll send a NS from our IPv6 address. If we don't have an IPv6 address on the VLAN (we're just switching L2 only), we'll send an NS from the IPv6 unspecified address: that's right, the switch will send NSes even if IPv6 routing isn't enabled.<br />
<br />
If you want probes to go out more frequently than every 300 seconds, you can set it like so:<br />
<br />
ipv6 nd inspection policy TEST-POLICY<br />
tracking enable reachable-lifetime 15 ! this will set it for 15 seconds.<br />
<br />
The most confusing feature for me was the "ipv6 snooping" syntax, which accomplishes basically everything we saw above, plus has the option (via source guard and destination guard) to filter data plane traffic as well. The confusing part about it is simple: it has lots of cross-over with all the previous functions, yet it uses a different syntax.<br />
<br />
The basic concept of ipv6 snooping is to build the neighbor database, similar to what IPv6 ND inspection did, except it uses and enforces more methods all at once.<br />
<br />
It can use:<br />
- Information from DHCP (Default)<br />
- Information from ND (Default)<br />
- Static bindings<br />
<br />
I'm wiping out all prior FH Security functions implemented above, we're starting from scratch with basic addressing. Since this is a large topic, I'm going to give increasingly more complicated examples and explain them one at a time.<br />
<br />
At it's basics, enabling it is very simple:<br />
<br />
WS-C4948E(config)#vlan configuration 123<br />
WS-C4948E(config-vlan-config)#ipv6 snooping</div>
<div>
<br /></div>
<div>
Unfortunately this totally breaks the network.</div>
<div>
<br /></div>
<div>
By default, ipv6 snooping enables its version of RA Guard and ND Inspection, so now RAs won't work any longer.</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E#</div>
<div>
*Jul 5 11:41:42.007: %SISF-4-PAK_DROP: Message dropped A=FE80::F2F7:55FF:FE8D:96A2 G=- V=123 I=Gi1/1 P=NDP::RA Reason=Packet not authorized on port</div>
</div>
<div>
<br /></div>
<div>
So let's fix that.</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#ipv6 snooping policy TRUST_ROUTER</div>
<div>
WS-C4948E(config-ipv6-snooping)# security-level glean</div>
<div>
WS-C4948E(config-ipv6-snooping)#int gig1/1</div>
<div>
WS-C4948E(config-if)#ipv6 snooping attach-policy TRUST_ROUTER</div>
</div>
<div>
<br /></div>
<div>
<div>
R2-HOST(config-if)#ipv6 address autoconfig default</div>
<div>
R2-HOST(config-if)#do sh ipv6 int br | s GigabitEthernet0/0</div>
<div>
GigabitEthernet0/0 [up/up]</div>
<div>
FE80::FA66:F2FF:FEDE:FF1</div>
<div>
2001:100::FA66:F2FF:FEDE:FF1</div>
</div>
<div>
<br /></div>
<div>
Ok great! Now what the heck does "security-level glean" mean? Let me tell you, it was a lot of 'fun' to figure out from the near zero explanation the docs give. <b>glean</b> learns from DHCP and ND, but doesn't enforce anything. So "glean" basically means "trust this port". There is also a <b>trusted-port</b> option on the policy, and on my IOS, it appears to do absolutely nothing (how handy).</div>
<div>
<br /></div>
<div>
For purposes of keeping this article a reasonable size, I'm going to go ahead and point out the 1:1 features in common with the a-la-carte ND inspection. By default, when you enable snooping, you're getting integrated ND Inspection. If you don't want it, you would:</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#ipv6 snooping policy MY_POLICY</div>
<div>
WS-C4948E(config-ipv6-snooping)#no protocol ndp</div>
</div>
<div>
<br /></div>
<div>
In this case, it would only learn entries from DHCPv6 "Guard" and static entries. Of note, you can do the same thing to disable dhcp inspection: <b>no protocol dhcp</b></div>
<div>
<br /></div>
<div>
Assuming you left <b>protocol ndp</b> on, these features work the same way I explained them above:</div>
<div>
<br /></div>
<div>
<div>
<b>limit address-count</b></div>
<div>
Make sure you allow at least 2 if you're using global unicast.</div>
<div>
<b>tracking</b></div>
<div>
Same as before, send NSes to keep the table up to date.</div>
</div>
<div>
<br /></div>
<div>
So let's look at the default integrated DHCP Guard inspection yet. As mentioned above, this is <i>on by default </i>when using IPv6 Snooping.</div>
<div>
<br /></div>
<div>
<div>
R1-ROUTER(config)#ipv6 dhcp pool DHCP-POOL</div>
<div>
R1-ROUTER(config-dhcpv6)#address prefix 2001:100::/64</div>
<div>
R1-ROUTER(config-dhcpv6)#dns-server 4::4</div>
<div>
R1-ROUTER(config-dhcpv6)#dns-server 8::8</div>
</div>
<div>
R1-ROUTER(config-dhcpv6)#int gig0/0</div>
<div>
<div>
R1-ROUTER(config-if)#ipv6 dhcp server DHCP-POOL</div>
</div>
<div>
<br /></div>
<div>
Now I've already pre-trusted this port (with the <b>glean</b> security level mentioned above), so it's also OK to be a DHCP server.</div>
<div>
<br /></div>
<div>
<div>
R2-HOST(config)#int gig0/0</div>
<div>
R2-HOST(config-if)#ipv6 enable</div>
<div>
R2-HOST(config-if)#no ipv6 address autoconfig default</div>
<div>
R2-HOST(config-if)#ipv6 address dhcp</div>
<div>
R2-HOST(config-if)#do sh ipv6 int br | s GigabitEthernet0/0</div>
<div>
GigabitEthernet0/0 [up/up]</div>
<div>
FE80::FA66:F2FF:FEDE:FF1</div>
<div>
2001:100::B043:724:FE40:931D</div>
<div>
<br /></div>
</div>
<div>
We got our DHCP address.</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj2B_9k7pWE6p6Yv_LAtUR84u0PNdIGO7EfOjXI6w2feELl-4qPSzBvHaj7th2Y_7QVuhMV-aGiM_qWxjADdUmNw8e59by9Xg1vog3uvU8RjErFYTJrqRAM3Jyt5NRwYpGjezYiAJIBMI/s1600/diagram5.png" />
<br />
<div>
<br />
And we now learned a binding via "DH" - DHCP.<br />
<br />
There are three security models that can be applied to entire VLANs or per-port/per-vlan:<br />
<br />
WS-C4948E(config-ipv6-snooping)#security-level ?<br />
glean Glean addresses<br />
guard inspect and drop un-authorized messages (default)<br />
inspect glean and Validate message</div>
<div>
<br /></div>
<div>
We've discussed glean - it basically trusts the port but still keeps track of the bindings. Guard, as indicated above, is the default, and it's what we're getting when nothing is specified. Guard is, in effect, the same as enabling the a-la-cart DHCP Guard, RA Guard, and ND Inspection: It learns bindings and denies untrusted (non-glean) ports from sending RAs, DHCP offers, invalid DAD, invalid NAs, etc. It's an all-in-one control-plane security enforcer! </div>
<div>
<br /></div>
<div>
Best I can tell, "inspect" enforces ND only (similar to the a-la-carte ipv6 nd inspection feature) but doesn't protect against malicious RAs or DHCP servers. I validated this by enabling it on all interfaces, sending RAs and DHCP off R1, and then attempting to spoof R2's address on R3's interface. Everything was permitted except the R3 spoof of R2, which produced:</div>
<div>
<br /></div>
<div>
<div>
*Jul 5 13:38:48.559: %SISF-4-PAK_DROP: Message dropped A=2001:100::9193:876E:7E83:F33C G=- V=123 I=Gi1/3 P=NDP::NA Reason=More trusted entry exists</div>
</div>
<div>
<br /></div>
<div>
on the 4948.</div>
<div>
<br /></div>
<div>
That wraps up the control-plane filters, data-plane filters are next, including Source Guard, Destination Guard, and Prefix Guard. IPv6 snooping builds the database that these features use, so it is a prerequisite for everything we'll see from here on.</div>
<div>
<br /></div>
<div>
Source guard is very simple. If the data-plane traffic - any traffic <i>other than</i> IPv6 ND/RA and DHCP - doesn't match the source address present in the prebuilt binding table, drop the traffic.</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#vlan configuration 123</div>
<div>
WS-C4948E(config-vlan-config)#ipv6 source-guard</div>
</div>
<div>
<br /></div>
<div>
The main additional configuration for this is to validate prefixes, which is what's known as Prefix Guard.</div>
<div>
<br /></div>
<div>
Far earlier in the article we discussed prefix delegations via DHCP. This is the technology that allows you to sub-lease a prefix from one DHCP server to a downstream DHCP server. I couldn't get this feature to work with my lab, and I believe it's a platform limitation, but unfortunately when you're borrowing high-end switches to lab bleeding-edge functions, beggars can't be choosers.</div>
<div>
<br /></div>
<div>
Here is how I think it's supposed to work:</div>
<div>
<br /></div>
<div>
! DHCP SERVER</div>
<div>
<div>
ipv6 local pool dhcpv6-pool1 2001:100:123::/40 48</div>
<div>
<div>
<br /></div>
<div>
ipv6 dhcp pool test</div>
<div>
prefix-delegation pool dhcpv6-pool1 lifetime 1800 600</div>
</div>
<div>
dns-server 4::4</div>
<div>
<div>
dns-server 8::8 </div>
<div>
domain-name servers.net</div>
<div>
<br /></div>
<div>
interface GigabitEthernet0/0</div>
<div>
ipv6 address 2001:100:123::1/48</div>
<div>
ipv6 nd other-config-flag</div>
<div>
ipv6 dhcp server test </div>
<div>
</div>
</div>
</div>
<div>
<br /></div>
<div>
! CLIENT</div>
<div>
interface GigabitEthernet0/0</div>
<div>
<div>
ipv6 address autoconfig default</div>
<div>
ipv6 nd ra suppress all</div>
<div>
ipv6 dhcp client pd PREFIX-DELEGATION</div>
<div>
<br /></div>
<div>
interface Loopback0 ! I needed something to pretend to be a downstream client</div>
<div>
ipv6 address PREFIX-DELEGATION ::/64 eui-64 </div>
<div>
ipv6 enable</div>
<div>
<br /></div>
<div>
! SWITCH</div>
<div>
ipv6 snooping policy SNOOPING-POLICY</div>
<div>
security-level glean</div>
<div>
prefix-glean ! this learns prefix delegations - this did work for me, it was the filtering that didn't work.</div>
<div>
<br /></div>
<div>
ipv6 source-guard policy PREFIX-GUARD</div>
<div>
no validate address</div>
<div>
validate prefix</div>
<div>
<br /></div>
<div>
vlan configuration 123</div>
<div>
ipv6 snooping attach-policy SNOOPING-POLICY</div>
<div>
ipv6 source-guard attach-policy PREFIX-GUARD</div>
</div>
<div>
<br /></div>
<div>
And after applying all this, my switch shot back at me:</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 64%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 64%warning% This filter is not supported. Vlan - 123, mac - any, prefix_length - 128 </div>
<div>
<br /></div>
<div>
I tinkered with it for a while, changing the address to another one on Lo0 that I shouldn't be able to use, to no avail. Traffic kept passing. So I'm assuming this just can't be accomplished on this hardware or I'm hitting a bug.</div>
<div>
<br /></div>
<div>
I did have one other curiosity with source guard (just vanilla source guard, not prefix guard); while it did filter traffic appropriately, sometimes when I disabled the feature, traffic still wouldn't forward. Clearing the database solved the problem: <b>clear ipv6 neigh bind</b></div>
<div>
<b><br /></b></div>
<div>
Our last major topic is Destination Guard, which is pretty darn cool. The recommended prefix size for a LAN segment in IPv6 is a /64, which is 18 quintillion addresses. To put that into perspective, that's 18,446,744,073,709,551,616 addresses in one segment.</div>
</div>
<div>
<br /></div>
<div>
What would happen if you tried to "arp" (IPv6 NS) for 18,446,744,073,709,551,616 addresses back-to-back? </div>
<div>
<br /></div>
<div>
Your router would melt. You'd run out of RAM and CPU very quickly. At a best case this would result in a simple DOS where legitimate NSes couldn't get through for a while, at a worst case the OS might crash. Most of the attacks we've looked at so far require the attacker to have access to the "inside" of your network, what's worse about this attack is that it can be accomplished from the outside of your network - potentially from the Internet.</div>
<div>
<br /></div>
<div>
Destination Guard addresses this. Destination Guard is a "last hop" security feature: the attack can be launched from anywhere on a routed network, and the last hop router is the only one that is heavily impacted, because interim routers don't have to NS for the final destination, they just CEF-switch the packet.</div>
<div>
<br /></div>
<div>
We're also assuming, for our purposes here, that the last-hop router is a layer 3 switch supporting destination guard.</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#vlan configuration 123</div>
<div>
WS-C4948E(config-vlan-config)#ipv6 destination-guard</div>
</div>
<div>
<br /></div>
<div>
There's only one very simple setting you can add if you use a policy:</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#ipv6 destination-guard policy FOO</div>
</div>
<div>
WS-C4948E(config-destguard)#enforcement ?</div>
<div>
<div>
always Enforced under all conditions (default)</div>
<div>
stressed Enforced when system is under stress</div>
</div>
<div>
<br /></div>
<div>
<b>stressed </b>isn't defined anywhere, but I'm assuming it means only kick this feature in during high CPU load, or perhaps under a great deal of NS.</div>
<div>
<br /></div>
<div>
I'll be changing our lab up a bit in order to test this, R1 will be out of the picture, R2 will have a static assignment and will be our "trusted" host, and R3 will be our attacker on the Internet.</div>
<div>
<br /></div>
<div>
<div>
R2-HOST(config-if)#int gig0/0</div>
<div>
R2-HOST(config-if)#ipv6 address 2001:600D::2/64 ! my best representation of "GOOD" in hex :)</div>
</div>
<div>
R2-HOST(config)#ipv6 route ::/0 2001:600D::1 ! the switch's address</div>
<div>
<br /></div>
<div>
R3-ATTACKER(config-if)#int gig0/0</div>
<div>
<div>
R3-ATTACKER(config-if)#ipv6 address 2001:BAD::2/64</div>
</div>
<div>
<div>
R3-ATTACKER(config)#ipv6 route ::/0 2001:BAD::1</div>
</div>
<div>
<br /></div>
<div>
WS-C4948E(config)#ipv6 unicast-routing</div>
<div>
<div>
<div>
</div>
</div>
<div>
<div>
WS-C4948E(config)#vlan 200</div>
<div>
WS-C4948E(config-vlan)#exit</div>
<div>
WS-C4948E(config)#vlan 300</div>
<div>
WS-C4948E(config-vlan)#exit</div>
<div>
<div>
WS-C4948E(config)#int vlan200</div>
<div>
WS-C4948E(config-if)#ipv6 address 2001:600D::1/64</div>
<div>
<div>
WS-C4948E(config-if)#no shut</div>
</div>
<div>
WS-C4948E(config-if)#int vlan 300</div>
<div>
WS-C4948E(config-if)#ipv6 address 2001:BAD::1/64</div>
</div>
<div>
<div>
WS-C4948E(config-if)#no shut</div>
</div>
<div>
WS-C4948E(config)#int gig1/2</div>
<div>
<div>
WS-C4948E(config-if)#switchport access vlan 200</div>
</div>
<div>
WS-C4948E(config)#int gig1/3</div>
<div>
WS-C4948E(config-if)#switchport access vlan 300</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config-if)#vlan configuration 200,300</div>
<div>
WS-C4948E(config-vlan-config)# ipv6 snooping</div>
<div>
WS-C4948E(config-vlan-config)# ipv6 destination-guard</div>
</div>
<div>
<br /></div>
<div>
Now let's say the "BAD" neighbor, a host on the Internet, tries to hammer away at tens of thousands of addresses on the "GOOD" (600D) network in order to make the router (our 4948) collapse under the NS load.</div>
<div>
<br /></div>
<div>
In order to see this security feature take effect, we need to enable one more logging feature:</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#ipv6 snooping logging resolution-veto</div>
</div>
<div>
<br /></div>
<div>
What the heck is resolution veto? If the switch decides it's getting asked to resolve for bogus addresses, it will "veto" the neighbor solicitation.</div>
<div>
<br /></div>
<div>
Before I try the attack, I went ahead and verified connectivity from R2 to R3. </div>
<div>
<br /></div>
<div>
<div>
R2-HOST(config)#do ping 2001:BAD::2</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2001:BAD::2, timeout is 2 seconds:</div>
<div>
.....</div>
</div>
<div>
<br /></div>
<div>
That should've worked. What did the 4948 have to say?</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E(config)#</div>
<div>
*Jul 5 16:13:12.774: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:BAD::2 on I=Vl300 reason=Destination not active on link</div>
<div>
<br /></div>
</div>
<div>
This is an important point on this feature: it doesn't know what hosts are valid until it hears from them. So due to my order-of-operations in this scenario, the switch hadn't actually heard from R3 at all yet, and the NS is denied. Making R3 speak solves the problem, which would've worked itself out eventually anyway:</div>
<div>
<br /></div>
<div>
<div>
R3-ATTACKER(config)#do ping 2001:600D::2</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2001:600D::2, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 0/1/8 ms</div>
</div>
<div>
<br /></div>
<div>
Now let's try our theoretical attack.</div>
<div>
<br /></div>
<div>
<div>
R3-ATTACKER#ping 2001:600D::100 repeat 1 timeout 0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 2001:600D::100, timeout is 0 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
<div>
R3-ATTACKER#ping 2001:600D::101 repeat 1 timeout 0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 2001:600D::101, timeout is 0 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
<div>
R3-ATTACKER#ping 2001:600D::102 repeat 1 timeout 0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 2001:600D::102, timeout is 0 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
<div>
R3-ATTACKER#ping 2001:600D::103 repeat 1 timeout 0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 2001:600D::103, timeout is 0 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
<div>
<br /></div>
<div>
(pretend there are 65,431 hypothetical pings here)</div>
<div>
<br /></div>
<div>
R3-ATTACKER#ping 2001:600D::FFFF repeat 1 timeout 0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 2001:600D::FFFF, timeout is 0 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
</div>
<div>
<br /></div>
<div>
We expected that outcome as these aren't valid hosts, but what happened on the switch?</div>
<div>
<br /></div>
<div>
<div>
*Jul 5 16:18:26.226: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::100 on I=Vl200 reason=Destination not active on link</div>
<div>
*Jul 5 16:18:34.146: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::101 on I=Vl200 reason=Destination not active on link</div>
<div>
*Jul 5 16:18:43.338: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::102 on I=Vl200 reason=Destination not active on link</div>
<div>
*Jul 5 16:18:50.418: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::103 on I=Vl200 reason=Destination not active on link</div>
<div>
.....</div>
<div>
*Jul 5 16:18:58.958: %SISF-4-RESOLUTION_VETO: Resolution vetoed NS for D=2001:600D::FFFF on I=Vl200 reason=Destination not active on link</div>
</div>
<div>
<br /></div>
<div>
And that about sums up destination guard.</div>
<div>
<br /></div>
<div>
A few random notes for the wrap-up.</div>
<div>
<br /></div>
<div>
Some useful show commands:</div>
<div>
<br /></div>
<div>
<div>
WS-C4948E#sh ipv6 snooping policies</div>
<div>
Target Type Policy Feature Target range</div>
<div>
Gi1/2 PORT policy1 NDP inspection vlan all</div>
<div>
<br /></div>
<div>
This will show you what policies are applied where.</div>
<div>
<br /></div>
<div>
WS-C4948E#sh ipv6 snooping features</div>
<div>
Feature name priority state</div>
<div>
NDP inspection 160 READY</div>
<div>
Snooping 128 READY</div>
</div>
<div>
</div>
<div>
This shows which features are enabled.</div>
<div>
<br /></div>
<div>
And where is all this in the documentation?</div>
<div>
<br /></div>
<div>
Well, at the time of this writing, there's a link to all this directly under the IOS 15.2E, which is what I labbed on! and ... it's a broken link! Uuuuurgh!</div>
<div>
<br /></div>
<div>
Best I could come up with that I could drill-down to:</div>
<div>
<br /></div>
<div>
Switches -> </div>
<div>
3850 -> </div>
<div>
Catalyst 3850-12S-E Switch -> </div>
<div>
Configuration Guides -> </div>
<div>
IPv6 Configuration Library, Cisco IOS XE Release 3SE (Catalyst 3850 Switches) -> </div>
<div>
IPv6 First-Hop Security Configuration Guide, Cisco IOS XE Release 3SE (Catalyst 3850 Series)</div>
<div>
<br /></div>
<div>
That covers just about everything except Source Guard and Prefix Guard, but let's face it, those two features are pretty easy to understand if you have the rest of it down.</div>
<div>
<br /></div>
<div>
Best of luck!</div>
<div>
<br /></div>
<div>
Jeff</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com8tag:blogger.com,1999:blog-5968686435283454526.post-70495057700089789662014-06-14T11:38:00.001-07:002016-02-22T19:17:33.954-08:00Everything BFDI sat down to start working on BFD three weeks ago thinking I'd be done in a couple of days. Three weeks later I'm finally starting blogging on it. To be fair, I missed a week to drinking heavily in California, but it still took a really look time to dive down every rabbit hole - there's a <b>lot</b> of BFD features, and the documentation stinks, particularly when it comes to "why would I want to use this".<br />
<br />
So what is BFD (Bidirectional Forwarding Detection)?<br />
<br />
BFD is a high-speed "are you up" protocol that other routing protocols subscribe to. It can detect link failures in milliseconds, with the potential for microseconds on the right platform. All routing protocols have some way of detecting failure, usually timer-related. Tuning the timers can theoretically get you sub-second failure detection in some protocols, but this produces unnecessary high overhead as the average IGP wasn't designed with that in mind. BFD was specifically built for fast/low CPU detection, and in the case of single-hop, can offload a great deal of the checks to CEF (by using echo mode - more later), even on a typical router. Some high-end platforms can even offload the entire BFD process to the linecard. The CEF or hardware offload makes BFD a major improvement over the other obvious choice, IP SLA.<br />
<br />
Scope of this document:<br />
To give a granular understanding of BFD, but focused on the CCIE v5 R&S. On that note, I have covered every BFD function I can find, excluding ISIS and MPLS-TE, as these are both out of scope for the v5 R&S lab. We will cover BFD's use with single-hop BGP, OSPF, EIGRP, RIP, PIM, HSRP, static routes, hierarchical static routes and multi-hop BGP and multi-hop static routes. We will cover with and without Echo, and with and without authentication. We will discuss IPv6 implementation of all of the applicable above protocols. We'll also cover BFD dampening.<br />
<br />
Some key items to know:<br />
<br />
- BFD has no neighbor detection. When the routing protocol needs to monitor a neighbor, it informs BFD, and BFD establishes the neighbor relationship at that point.<br />
<br />
- Various routing protocols can piggyback a single BFD session. If you have BGP and EIGRP running between the same two subnets on the same two routers, there's no need to have two BFD sessions for checking the same exact topology.<br />
<br />
- There are two versions, 0 and 1. While there are some deep programmatic differences between the two, those are out of scope for this document. The major difference for us is that v1 supports echo mode (more later) and v0 doesn't. v1 is on by default on Cisco equipment. The documentation says Cisco equipment can be backwards compatible with v0 if it's neighbor only supports v0, but since there's no way to manually enable this on a Cisco device, you can safely assume for the lab exam that we're always talking about v1.<br />
<br />
- There are two modes, asynchronous and demand. Asynchronous is the "normal" BFD mode that you're used to when you think Cisco BFD: continuous, high-speed detection of neighbor failure. Demand mode is more of a steady-state operation, where it's assumed the neighbor and link are generally stable, but you'd periodically want to check to see if it's up. Cisco has output for the demand bit in show commands, and they also talk about it (vaguely) in the documentation in places, but <i>best I can tell </i>there's just no command to enable it - at least not on my device (CSR1000v v15.4.1). I read elsewhere that enabling echo mode (more on echo mode later) put the control packets in demand mode, but the demand bit is <i>not set</i> in these cases, so that appears incorrect. Perhaps it would work if the other neighbor initiated? Regardless, assume out of scope for the v5 lab.<br />
<br />
- BFD is always unicast.<br />
<br />
- BFD control packets are always UDP, sourced from 49152 and sent to 3784. BFD echo packets are also UDP, and are sourced from 3785 and sent to 3785 (why this is will become obvious later).<br />
<br />
Let's start with basic, single-hop configuration. This will be our topology:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3vJYwTCyDU_32gwlxJp-vxPudZa4x7i4gU0vUd7lP0TeoODIOFICv75vuYYXr9tXbRa34p56UnH6IvB_JZwxzI8iQkYajMGMaHsjQPJU7TipcoT1SEjG50iUXohAYxYckwd5Rn5LN4Lg/s800/diagram1.png" />
<br />
As with many of my other blogs, I use GNS3 as an easy diagramming tool. In this case, I am not using GNS for the actual lab, because I have never been able to get BFD to work in GNS3. The second the BFD relationship tries to establish, dynamips hard locks. Instead, I am using CSR1000v running on VMWare.<br />
<br />
Each router's gigabit interfaces are assigned 192.168.AB.X/24, where AB is the router numbers on the link, and X is the router number: i.e. R2's Gig1 is 192.168.12.2 and Gig2 is 192.168.23.2. Each router has a Loopback0 address of X.X.X.X, where X is the router number: i.e. R1's Loopback0 is 1.1.1.1.<br />
<br />
In addition, each router's gigabit interfaces are assigned AB::X/64, where AB is the router numbers on the link, and X is the router number: i.e. R2's Gig1 is 12::2/64. Each router has a Loopback0 address of X::X/128, where X is the router number: i.e. R1's Loopback0 is 1::1/128. IPv6 unicast routing is enabled on all devices.<br />
<br />
For this section, we'll mostly be working with R1 and R2.<br />
<br />
R1#conf t<br />
<div>
R1(config)#interface GigabitEthernet1</div>
<div>
R1(config-if)#bfd interval 300 min_rx 300 multiplier 3</div>
<div>
<br /></div>
<div>
R2#conf t</div>
<div>
R2(config)#interface GigabitEthernet1</div>
<div>
R2(config-if)#bfd interval 300 min_rx 300 multiplier 3<br />
<br />
Now with a routing protocol, you'd probably expect some sort of output now.<br />
<br />
R1#show bfd neighbor<br />
R1#<br />
<div>
<br /></div>
As I mentioned above, no auto-discovery process. The routing protocol has to tell BFD it's needed first, and where to establish its neighbor relationship. It's also very noteworthy from a debugging perspective that if you get the blank output I showed above, you're missing something <b>locally</b>. Having half a BFD session configured (or having your authentication messed up) will produce output with a status of "AdminDown".<br />
<br />
Let's get this working:<br />
<br />
R1(config-if)#ip ospf 1 area 0<br />
R1(config-if)#ip ospf bfd<br />
<div>
<br /></div>
<div>
<div>
R2(config-if)#ip ospf 1 area 0</div>
<div>
R2(config-if)#ip ospf bfd</div>
</div>
<div>
<br /></div>
<div>
Now we should see some output.</div>
<div>
<br /></div>
R1#show bfd neighbor<br />
<br />
IPv4 Sessions<br />
NeighAddr LD/RD RH/RS State Int<br />
192.168.12.2 4097/4097 Up Up Gi1</div>
<div>
<br /></div>
<div>
R2 has a similar output referencing R1.</div>
<div>
<br /></div>
<div>
Now that we have a base config up, let's test the detection.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#</div>
<div>
*Jun 21 02:17:21.983: %OSPF-5-ADJCHG: Process 1, Nbr 2.2.2.2 on GigabitEthernet1 from FULL to DOWN, Neighbor Down: BFD node down</div>
<div>
R1#</div>
</div>
<div>
<br /></div>
<div>
It's hard to demonstrate speed in a blog, but it happens very fast. We see from the output that BFD told the routing protocol that it's neighbor had been lost - "BFD node down".</div>
<div>
<br /></div>
<div>
Let's bring the link back up.</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#no shut</div>
</div>
<div>
<br /></div>
<div>
The command usage is not as simple as it seems - the variable names are terrible, in my opinion. Let's build a more confusing example:</div>
<div>
<div>
<br />
R1(config-if)#bfd interval 200 min_rx 500 multiplier 5<br />
R2(config-if)#bfd interval 250 min_rx 400 multiplier 5</div>
</div>
<div>
<br /></div>
<div>
The first value is the "min_tx" and the second value is the "min_rx". I don't care for the names at all. min_rx from R1 will be compared to min_tx from R2, and a per-direction transmission value will be calculated.</div>
<div>
<br /></div>
<div>
In our scenario above, R1's min_tx - 200 - will be compared to R2's min_rx - 400. The <b>slower (larger)</b> value wins. Clearly, 400ms is longer than 200ms, so 400ms will be the negotiated <b>transmission </b>value for R1 towards R2. Vice-versa, R2's min_tx is 250ms, and R1's min_rx is 500, so 500ms will be the transmission speed from R2 to R1.</div>
<div>
<br /></div>
<div>
For output clarity I am going to disable echo mode for the moment (not shown here) and show the simplified output of <b>show bfd neighbor detail</b>:</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjo6bpPZN7rYKXiCWVvhM_mpvwRpupiyC8eS2-XIRUlynTzhi317ml1_V_OiPVU4h0eOnYHS_c8mlbuoA-nFy_JEkjOrw6HxtGzpJA26mZJ58aMDQTr7HMTRf8YFdmn_QaGlZVNmohpTjw/s800/diagram2.png" />
<br />
<div>
<br />
Here's the relevant output:<br />
Rx Count: 47, Rx Interval (ms) min/max/avg: 1/500/403 last: 379 ms ago<br />
Tx Count: 58, Tx Interval (ms) min/max/avg: 1/398/331 last: 59 ms ago</div>
<div>
<br /></div>
<div>
We see that our maximum Rx is 500 (speed from R2 to R1), and maximum Tx is 398 (speed from R1 to R2). It can send/receive a touch faster than this, the idea being that if the BFD control packet doesn't arrive in <i>under</i> the maximum time, it'll be considered lost.</div>
<div>
<br /></div>
<div>
The multiplier is reasonably obvious, if you miss that many BFD control packets, consider the link failed.</div>
<div>
<br /></div>
<div>
Since I've got this output up, also noteworthy are:</div>
<div>
* Registered Protocols: What protocols are "subscribed" to this BFD session? We see OSPF and CEF here.</div>
<div>
* Session state is UP and not using echo function. - as I mentioned I disabled echo.</div>
<div>
* C bit: 0 - This is only relevant on platforms that can completely hardware offload BFD, and we'll talk about it later.</div>
<div>
* Demand bit: 0 - I talked about this earlier, interesting that there's output for it, but I couldn't find any way to enable it.</div>
<div>
<br /></div>
<div>
Since we started with an OSPF example, let me recap what I did above and then we'll look at the alternative way to enable BFD.</div>
<div>
<br /></div>
<div>
Presently:</div>
<div>
R1:</div>
<div>
<div>
interface GigabitEthernet1</div>
<div>
ip address 192.168.12.1 255.255.255.0</div>
<div>
ip ospf bfd</div>
<div>
ip ospf 1 area 0</div>
<div>
negotiation auto</div>
<div>
bfd interval 200 min_rx 500 multiplier 5</div>
<div>
no bfd echo</div>
</div>
<div>
<br /></div>
<div>
R2:</div>
<div>
<div>
interface GigabitEthernet1</div>
<div>
ip address 192.168.12.2 255.255.255.0</div>
<div>
ip ospf bfd</div>
<div>
ip ospf 1 area 0</div>
<div>
negotiation auto</div>
<div>
bfd interval 250 min_rx 400 multiplier 3</div>
<div>
no bfd echo</div>
</div>
<div>
<br /></div>
<div>
the other option:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#no ip ospf bfd</div>
</div>
<div>
<div>
<br /></div>
<div>
R2(config-if)#int gig1</div>
<div>
R2(config-if)#no ip ospf bfd</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config)#router ospf 1</div>
<div>
R1(config-router)#bfd all-interfaces</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#router ospf 1</div>
<div>
R2(config-router)#bfd all-interfaces</div>
</div>
<div>
<br /></div>
<div>
and if you wanted to selectively turn it off on an interface:</div>
<div>
<br /></div>
<div>
<div>
R2(config-router)#int gig2</div>
<div>
R2(config-if)#ip ospf bfd disable</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do sh bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 1/1 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
Next I'll quickly burn through the rest of the IGPs, and single-hop BGP.</div>
<div>
<br /></div>
<div>
For clarity, I've removed all the OSPF config beforehand.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#router eigrp 100</div>
<div>
R1(config-router)#network 0.0.0.0</div>
<div>
R1(config-router)#bfd all-interfaces</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#router eigrp 100</div>
<div>
R2(config-router)#network 0.0.0.0</div>
<div>
R2(config-router)#bfd interface gig1</div>
</div>
<div>
<br /></div>
<div>
Note the syntax difference between R1 and R2. Showing both configuration methods at once, single interface vs all interfaces. I of course still have the interface-level BFD config in place.</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip eigrp neigh</div>
<div>
EIGRP-IPv4 Neighbors for AS(100)</div>
<div>
H Address Interface Hold Uptime SRTT RTO Q Seq</div>
<div>
(sec) (ms) Cnt Num</div>
<div>
0 192.168.12.2 Gi1 13 00:01:42 1596 5000 0 3</div>
<div>
<br /></div>
<div>
R1#sh bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 1/1 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
Removing EIGRP config for clarity.</div>
<div>
<br /></div>
<div>
RIP is supported, but it's a bit of an oddity. If you know RIP at all, your first question should be "how can BFD work with a neighborless routing protocol?". </div>
<div>
<br /></div>
<div>
It's a bit of a hack.</div>
<div>
<br /></div>
<div>
First item of note, Cisco advertises the feature as "BFD for RIPv2". Just to prove that it's not RIPv2 specific, I'm going to do this lab on RIPv1.</div>
<div>
<br /></div>
<div>
<div>
<div>
R1(config)#router rip</div>
<div>
R1(config-router)# version 1</div>
<div>
R1(config-router)# network 192.168.12.0</div>
<div>
R1(config-router)# neighbor 192.168.12.2 bfd</div>
<div>
R1(config-router)# bfd all-interfaces ! note, this is the only option</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#router rip</div>
</div>
<div>
R2(config-router)#version 1</div>
<div>
<div>
R2(config-router)#network 192.168.12.0</div>
<div>
R2(config-router)#neighbor 192.168.12.1 bfd</div>
<div>
R2(config-router)#bfd all-interfaces</div>
</div>
<div>
<br /></div>
</div>
<div>
<div>
R1(config-if)#do show bfd neigh</div>
</div>
<div>
<div>
R1(config-if)#</div>
</div>
<div>
<br /></div>
<div>
Hmm, no luck.</div>
<div>
<br /></div>
<div>
Turns out RIP requires you to be advertising a route <i>other than the transit link</i> for the BFD relationship to establish.</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#network 1.1.1.1</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config-router)#network 2.2.2.2</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/1 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
So, clearly we can't tear down a non-existent neighbor relationship if the link fails. So what's this good for?</div>
<div>
<br /></div>
<div>
<div>
R2(config-router)#int gig1</div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do sh ip rip data | i 2.0.0.0</div>
<div>
<b>2.0.0.0/8 is possibly down</b></div>
<div>
<b>2.0.0.0/8 is possibly down</b></div>
<div>
<br /></div>
<div>
R1(config-router)#do sh ip route 2.0.0.0</div>
<div>
Routing entry for 2.0.0.0/8</div>
<div>
Known via "rip", distance 120, <b>metric 4294967295 (inaccessible)</b></div>
<div>
Redistributing via rip</div>
<div>
Last update from 192.168.12.2 on GigabitEthernet1, 00:00:56 ago</div>
<div>
Hold down timer expires in 142 secs</div>
</div>
<div>
<br /></div>
<div>
It marks the route as invalid immediately, rather than waiting on painfully slow RIP timers.</div>
<div>
<br /></div>
<div>
Removing RIP config for cleanliness...</div>
<div>
<br />
Single-hop BGP BFD is very simple. It's also probably the most-deployed implementation of BFD.<br />
<br />
R1(config)#router bgp 100<br />
R1(config-router)#neighbor 192.168.12.2 remote-as 200<br />
R1(config-router)#neighbor 192.168.12.2 fall-over bfd<br />
<br />
R2(config)#router bgp 200<br />
R2(config-router)#neighbor 192.168.12.1 remote-as 100<br />
R2(config-router)#neighbor 192.168.12.1 fall-over bfd<br />
<br />
R1(config-router)#do show bfd neigh det | i protocols<br />
Registered protocols: BGP CEF</div>
<div>
<br />
There's an extra flag you can use with BGP that takes some explaining. It's called the C-Bit, and if you don't understand the usage, it's a confusing thing.<br />
<br />
There are some service provider platforms that can run BFD completely in hardware. Meaning, the line card itself knows the BFD logic, and the control plane can actually crash and BFD will keep working. On these platforms, graceful restart (GR) or non-stop forwarding (NSF) can keep the FIB populated on the line card while the control plane reboots itself. GR is actually a negotiated BGP parameter -- when BGP needs to reboot, the neighbor keeps the routes from the rebooting device. In this fashion, the <b>neighbor</b> keeps forwarding traffic to the device that's rebooting even though BGP keepalives have failed. <br />
<br />
So what's this got to do with BFD?<br />
<br />
There could be circumstances where both the control plane needs to reboot and the forwarding plane dies at the same time. BFD can help detect this. Consider this topology.<br />
<br />
R1 --> R2. R2 is a provider platform that has NSF enabled.<br />
<br />
R1 learns that R2 is an NSF device, and assumes that it's OK for R2's control plane to die and still forward it traffic.<br />
<br />
If...: BFD control packets are still coming, and C-BIT = 0 or 1, then R1 should keep forwarding to R2<br />
If...: BFD control packets stop coming, and the C-BIT = 0, then R1 should assume that BFD was run <b>in software</b> on the neighbor, and should keep forwarding packets during graceful restart.<br />
If...: BFD control packets stop coming, and the C-BIT = 1, then R1 should assume that BFD was run <b>in hardware on the linecard</b> on the neighbor, and that the neighbor is genuinely broken, and to yank the routes rather than wait on graceful restart.<br />
<br />
As best I can tell, without a platform to lab this on, the C-Bit is set by the BFD process itself, and isn't something you can toggle. However, you can tell your BFD process whether to ignore the setting or not. The default is to ignore. If you want to use it:<br />
<br />
R1(config-router)#neighbor 192.168.12.2 fall-over bfd <b>check-control-plane-failure</b><br />
<br />
Of note, this feature is also available in multi-hop BGP, which we'll cover further below.<br />
<br /></div>
<div>
And now for PIM!</div>
<div>
<br /></div>
<div>
I'm not going to build a full multicast lab up here, but we can see the basics.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#ip pim sparse-mode</div>
<div>
R1(config-if)#ip pim bfd</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int gig1</div>
<div>
R2(config-if)#ip pim sparse-mode</div>
<div>
R2(config-if)#ip pim bfd</div>
</div>
<div>
<br /></div>
<div>
R1(config-if)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/1 Up Up Gi1</div>
<div>
<br /></div>
<div>
R1(config-if)#do show bfd neigh det | i protocols</div>
<div>
Registered protocols: PIM CEF</div>
</div>
<div>
<br /></div>
<div>
That's all there is to it. Removing PIM config...</div>
<div>
<br />
I'll cover HSRP now as well.<br />
<br />
R1(config)#int gig1<br />
R1(config-if)#standby 1 ip 192.168.12.100<br />
R1(config-if)#standby bfd<br />
<div>
<br /></div>
R2(config)#int gig1<br />
R2(config-if)#standby 1 ip 192.168.12.100<br />
<div>
R2(config-if)#standby bfd</div>
<div>
<br /></div>
<div>
<div>
R1(config-if)#do show bfd neigh det | i protocol</div>
<div>
Registered protocols: HSRP CEF</div>
<div>
<br /></div>
<div>
R1(config-if)#do show standby | i BFD</div>
<div>
BFD enabled</div>
</div>
<div>
<br /></div>
<div>
Alternatively, HSRP BFD support can be enabled globally with:</div>
<div>
<div>
R1(config)#standby bfd all-interfaces</div>
</div>
<div>
<br /></div>
<div>
IOS-based VRRP doesn't appear to have BFD support at the time of this writing. I've seen some documents indicating it is supported in IOS-XR and Nexus.</div>
<div>
<br /></div>
And now for IPv6 IGPs and BGP.<br />
<br />
OSPFv3's BFD usage is very similar to OSPFv2's.<br />
<br />
R1(config-if)#int gig1<br />
R1(config-if)#ipv6 ospf 1 area 0<br />
R1(config-if)#ipv6 ospf bfd</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#int gig1</div>
<div>
R2(config-if)#ipv6 ospf 1 area 0</div>
<div>
R2(config-if)#ipv6 ospf bfd</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-rtr)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv6 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
FE80::20C:29FF:FECF:21FF 1/1 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
You can also use the "all interfaces" style like from OSPFv2.</div>
<div>
<br /></div>
<div>
EIGRPv6 supports BFD in named EIGRP configuration, which is pretty darn different from the "old" way of doing EIGRPv6:</div>
<div>
<div>
<br /></div>
<div>
R1(config)# router eigrp FOO</div>
</div>
<div>
R1(config-router)# address-family ipv6 unicast autonomous-system 1</div>
<div>
<div>
R1(config-router-af)# af-interface default</div>
<div>
R1(config-router-af-interface)# bfd</div>
<div>
R1(config-router-af-interface)# exit-af-interface</div>
<div>
R1(config-router-af)# topology base</div>
<div>
R1(config-router-af-topology)# exit-af-topology</div>
<div>
R1(config-router-af)# exit-address-family</div>
</div>
<div>
<br /></div>
<div>
<div>
R2's config is 100% identical so I am omitting it.</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-router)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv6 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
FE80::20C:29FF:FECF:21FF 1/1 Up Up Gi1</div>
<div>
<br /></div>
<div>
R1(config-router)#do show bfd neigh det | i protocol</div>
<div>
Registered protocols: EIGRP CEF</div>
</div>
<div>
<br /></div>
<div>
and RIPng? No such luck. It's not supported.</div>
<div>
<br />
Multiprotocol (IPv6) BGP is basically the same as v4:<br />
<br />
R1(config-router)#router bgp 100<br />
R1(config-router)#neighbor 12::2 remote-as 200<br />
R1(config-router)#neighbor 12::2 fall-over bfd<br />
R1(config-router)#address-family ipv6<br />
R1(config-router-af)#neighbor 12::2 activate<br />
<br />
R2(config-router)#router bgp 200<br />
R2(config-router)#bgp log-neighbor-changes<br />
R2(config-router)#neighbor 12::1 remote-as 100<br />
R2(config-router)#neighbor 12::1 fall-over bfd<br />
R2(config-router)#address-family ipv6<br />
R2(config-router-af)# neighbor 12::1 activate<br />
<div>
<br /></div>
R1(config-router)#do show bfd neigh<br />
<br />
IPv6 Sessions<br />
NeighAddr LD/RD RH/RS State Int<br />
12::2 1/1 Up Up Gi1<br />
<div>
<br /></div>
IPv6 PIM is just as easy as v4:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#ipv6 pim bfd</div>
</div>
<div>
<br /></div>
<div>
HSRP for IPv6:</div>
<div>
<br /></div>
<div>
<div class="MsoNormalCxSpFirst">
interface
GigabitEthernet1<o:p></o:p></div>
<div class="MsoNormalCxSpMiddle">
standby version 2<o:p></o:p></div>
<div class="MsoNormalCxSpMiddle">
standby 1 ipv6 autoconfig<o:p></o:p></div>
<div class="MsoNormalCxSpMiddle">
standby bfd</div>
<div class="MsoNormalCxSpMiddle">
<br /></div>
<div class="MsoNormalCxSpMiddle">
Similar to v4, VRRP support for v6 doesn't seem to be supported in traditional IOS at this time.</div>
<div class="MsoNormalCxSpMiddle">
<o:p></o:p><br /></div>
</div>
<div>
<br /></div>
<div>
Back to v4 for static routing.</div>
<div>
<br /></div>
<div>
Static routing takes a little more work, because there's no IGP to notify BFD of who to peer with, nor is there any neighbor relationship.</div>
<div>
<br /></div>
<div>
Let's start with the most basic usage.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2</div>
<div>
R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1</div>
<div>
R2(config)#ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/2 Up Up Gi1</div>
<div>
<br /></div>
<div>
R1(config)#do show bfd neigh det | i protocols</div>
<div>
Registered protocols: CEF <b>IPv4 Static</b></div>
</div>
<div>
<br /></div>
<div>
With the absence of a routing protocol, we use <b>ip route static bfd <interface> <next-hop to monitor></b></div>
<div>
<b><br /></b></div>
<div>
Where <next-hop to monitor> is the IP we'll be pointing out static routes to.</div>
<div>
We still need to fulfill two more prerequisites:</div>
<div>
- We must have a static route pointing that the specified next-hop. BFD doesn't setup the neighbor otherwise. Alternatively, you can set it up unassociated mode, covered below.</div>
<div>
- The static route that points at the next hop must specify the egress interface <i>if we're doing single-hop routes. </i>(multi-hop covered below)</div>
<div>
<br /></div>
<div>
But what if the neighbor doesn't need a static route back to us?</div>
<div>
<br /></div>
<div>
Imagine R2 knew R1's routes via another protocol, or even a default, and had no need to setup static routes back towards it:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#no ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1</div>
</div>
<div>
<div>
<br />
R1(config)#int gig1</div>
<div>
R1(config-if)#ip ospf 1 area 0</div>
<div>
R1(config-if)#int lo0</div>
<div>
R1(config-if)#ip ospf 1 area 0</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int gig1</div>
<div>
R2(config-if)#ip ospf 1 area 0</div>
</div>
<div>
<br /></div>
<div>
But, R1 still doesn't have a route to R2's Loopback0. So it needs that static.</div>
<div>
<br /></div>
<div>
Now, the BFD session has failed, because there's no route dependent on R2's statement:</div>
<div>
ip route static bfd GigabitEthernet1 192.168.12.1</div>
<div>
<br /></div>
<div>
This is because the dependent method is known as an "associated" route. An unassociated route brings the BFD up anyway:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1 <b>unassociate</b></div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-if)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/1 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
We can also hierarchically group static routes.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2 group DOWNSTREAM</div>
<div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.50 group DOWNSTREAM passive</div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.75 group DOWNSTREAM passive</div>
<div>
R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2</div>
<div>
R1(config)#ip route 10.10.10.10 255.255.255.255 GigabitEthernet1 192.168.12.50</div>
<div>
R1(config)#ip route 100.100.100.100 255.255.255.255 GigabitEthernet1 192.168.12.75</div>
</div>
</div>
<div>
<br /></div>
<div>
Let's walk this line by line:</div>
<div>
<div>
<br /></div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.2 group DOWNSTREAM</div>
<div>
</div>
</div>
<div>
<div>
<br /></div>
</div>
<div>
This is our non-passive route - basically an anchor route. Let's say for example's sake that from our topology, if the BFD session to 192.168.12.2 (my neighbor) goes down, all the passive routes in my group, DOWNSTREAM, will also be offline. Perhaps they're all attached to some sort of shared Ethernet segment and 192.168.12.2 is the management IP of the first switch - if we can't reach it, we can't reach other devices on it's link. There may be no reason to run BFD with the other devices, as perhaps they're on super-stable/redundant links. However, we need to pull them from our table if they're not reachable.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.50 group DOWNSTREAM passive</div>
<div>
R1(config)#ip route static bfd GigabitEthernet1 192.168.12.75 group DOWNSTREAM passive</div>
</div>
<div>
<br /></div>
<div>
192.168.12.50 and 192.168.12.75 are imaginary next-hops on the shared Ethernet segment of 192.16812.0. They don't exist in our topology anywhere, but they don't need to for our example. They're passive, meaning they're reliant on the status from the anchor BFD session (the non-passive entry). If it goes down, they need to fail their BFD "status" too.</div>
<div>
<br /></div>
<div>
<div>
<div>
R1(config)#ip route 2.2.2.2 255.255.255.255 GigabitEthernet1 192.168.12.2</div>
<div>
<br /></div>
<div>
This references our "anchor" next-hop, and is necessary for BFD to establish.</div>
<div>
<br /></div>
<div>
R1(config)#ip route 10.10.10.10 255.255.255.255 GigabitEthernet1 192.168.12.50</div>
<div>
R1(config)#ip route 100.100.100.100 255.255.255.255 GigabitEthernet1 192.168.12.75</div>
</div>
</div>
<div>
<br /></div>
<div>
These are our routes that reference the passive BFD next-hops above.</div>
<div>
<br /></div>
<div>
BFD is already up, and we can see the imaginary downstream hosts 10.10.10.10 and 100.100.100.100 are installed in our routing table.</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip route | b subnets</div>
<div>
1.0.0.0/32 is subnetted, 1 subnets</div>
<div>
C 1.1.1.1 is directly connected, Loopback0</div>
<div>
2.0.0.0/32 is subnetted, 1 subnets</div>
<div>
S 2.2.2.2 [1/0] via 192.168.12.2, GigabitEthernet1</div>
<div>
10.0.0.0/32 is subnetted, 1 subnets</div>
<div>
S <b>10.10.10.10 </b>[1/0] via 192.168.12.50, GigabitEthernet1</div>
<div>
100.0.0.0/32 is subnetted, 1 subnets</div>
<div>
S <b>100.100.100.100</b> [1/0] via 192.168.12.75, GigabitEthernet1</div>
<div>
192.168.12.0/24 is variably subnetted, 2 subnets, 2 masks</div>
<div>
C 192.168.12.0/24 is directly connected, GigabitEthernet1</div>
<div>
L 192.168.12.1/32 is directly connected, GigabitEthernet1</div>
</div>
<div>
<br /></div>
<div>
I'm going to fail the interface on R2, which, of important note, does not bring down the line protocol on R1 in my virtual lab.</div>
<div>
<br /></div>
<div>
<div>
R2#conf t</div>
<div>
Enter configuration commands, one per line. End with CNTL/Z.</div>
<div>
R2(config)#int gig1</div>
<div>
R2(config-if)#shut</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
R1#sh ip route | b subnets</div>
<div>
1.0.0.0/32 is subnetted, 1 subnets</div>
<div>
C 1.1.1.1 is directly connected, Loopback0</div>
<div>
192.168.12.0/24 is variably subnetted, 2 subnets, 2 masks</div>
<div>
C 192.168.12.0/24 is directly connected, GigabitEthernet1</div>
<div>
L 192.168.12.1/32 is directly connected, GigabitEthernet1</div>
</div>
</div>
<div>
<br /></div>
<div>
and all three routes gone!</div>
<div>
<br /></div>
<div>
Now, just to prove my line protocol is still up on that subnet, and this is actually BFD removing the routes and it isn't just a generic next-hop failure:</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip cef 192.168.12.50</div>
<div>
192.168.12.0/24</div>
<div>
attached to GigabitEthernet1</div>
<div>
<br /></div>
<div>
R1#sh ip cef 192.168.12.75</div>
<div>
192.168.12.0/24</div>
<div>
attached to GigabitEthernet1</div>
</div>
<div>
<br /></div>
<div>
The IPv6 implementation isn't quite as feature-filled:</div>
<div>
<br /></div>
<div>
R1(config)#ipv6 route static bfd GigabitEthernet1 12::2</div>
<div>
<div>
R1(config)#ipv6 route 2::/64 GigabitEthernet1 12::2</div>
</div>
<div>
<br /></div>
<div>
R2(config)#ipv6 route static bfd GigabitEthernet1 12::1 unassociated</div>
<div>
<br /></div>
<div>
<div>
R2(config)#do show bfd neigh | b IPv6</div>
<div>
IPv6 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
12::1 2/2 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
This works much the same way IPv4 does - R1 specifies an associated BFD neighbor and corresponding static route, R2 has an unassociated route (we'll assume it knows how to get back to R1 through other means). And... that's it for v6. No groups!</div>
<div>
<br /></div>
<div>
I've left all the multihop options for one section, as they all share some of the same configuration.<br />
<br />
We'll start with multihop IPv4 BGP. Now we'll be peering R1 to R4. I've setup interim IGPs throughout; assume full reachability.<br />
<br />
<div>
R1(config)#bfd-template multi-hop MHOP-TEMPLATE</div>
<div>
R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3</div>
<div>
<div>
R1(config)#bfd map ipv4 4.4.4.0/24 1.1.1.0/24 MHOP-TEMPLATE</div>
</div>
<div>
R1(config)#router bgp 14</div>
R1(config-router)#neighbor 4.4.4.4 remote-as 14<br />
R1(config-router)#neighbor 4.4.4.4 update-source lo0<br />
<div>
<div>
R1(config-router)#neighbor 4.4.4.4 fall-over bfd multi-hop</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
R4(config)#bfd-template multi-hop MHOP-TEMPLATE</div>
<div>
R4(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3</div>
</div>
</div>
<div>
<div>
R4(config)#bfd map ipv4 1.1.1.1/32 4.4.4.4/32 MHOP-TEMPLATE</div>
<div>
R4(config)#router bgp 14</div>
</div>
<div>
<div>
R4(config-router)#neighbor 1.1.1.1 remote-as 14</div>
<div>
R4(config-router)#neighbor 1.1.1.1 update-source lo0</div>
</div>
<div>
<div>
R4(config-router)#neighbor 1.1.1.1 fall-over bfd multi-hop</div>
</div>
<div>
<br /></div>
<div>
We'll walk through this config as well:<br />
<br />
<div>
R1(config)#bfd-template multi-hop MHOP-TEMPLATE<br />
R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3</div>
</div>
<div>
<br />
This is just a series of settings to apply to the multi-hop session. Clearly we can't glean it from the interface BFD configuration because there might be different settings for different neighbors. Of note, you can also set authentication here. There are also single-hop templates, which we'll talk about later.<br />
<br />
R1(config)#bfd map ipv4 4.4.4.0/24 1.1.1.0/24 MHOP-TEMPLATE<br />
<br />
The BFD map is the slightly confusing part. This statement could be interpreted as:<br />
<br />
"<b>If</b> I establish a multi-hop BFD session to a destination inside 4.4.4.0/24, sourced from any of my interfaces inside of 1.1.1.0/24, then use the settings from MHOP-TEMPLATE"<br />
<br />
Note it doesn't matter what mask size you use on this. In fact, if you look at R2, I specifically used /32s instead, just to prove a point. As long as the mask encompasses the IPs in question, you're good.<br />
<br />
It's also important to note that the BFD map isn't neighbor discovery or a static neighbor. It just assigns settings to a neighbor session that another protocol informs BFD of.<br />
<br />
Also important to note as, at least for me, the configuration is backwards from the way I think. It's destination/source: 4.4.4.0/24 is my TARGET, 1.1.1.0/24 is my SOURCE. I mis-type it almost every time, because I think source/dest.<br />
<br />
<div>
R1(config)#router bgp 14</div>
R1(config-router)#neighbor 4.4.4.4 remote-as 14<br />
R1(config-router)#neighbor 4.4.4.4 update-source lo0<br />
<div>
R1(config-router)#neighbor 4.4.4.4 fall-over bfd multi-hop<br />
<br />
The BGP config is pretty obvious.<br />
<br />
And the outcome...<br />
<br />
R1(config-router)#do show bfd neigh<br />
<br />
IPv4 Multihop Sessions<br />
NeighAddr[vrf] LD/RD RH/RS State<br />
4.4.4.4 4097/4097 Up Up<br />
<div>
<br /></div>
Let's validate by shutting down the link between R2 and R3, which is not participating in BFD other than forwarding packets for R1 and R4.<br />
<br />
R2(config)#int gig2<br />
R2(config-if)#shut<br />
<br />
R1(config-router)#<br />
*Jun 24 04:26:34.371: %BGP-5-NBR_RESET: Neighbor 4.4.4.4 reset (BFD adjacency down)<br />
*Jun 24 04:26:34.371: %BGP-5-ADJCHANGE: neighbor 4.4.4.4 Down BFD adjacency down<br />
*Jun 24 04:26:34.372: %BGP_SESSION-5-ADJCHANGE: neighbor 4.4.4.4 IPv4 Unicast topology base removed from session BFD adjacency down</div>
</div>
<div>
<br /></div>
<div>
BGP IPv6 multi-hop is identical, so I'm not going to demonstrate it here.</div>
<div>
<br /></div>
<div>
You may want to consider QoS on the interim routers when it comes to BFD. Not very helpful if your RTP packets continuously push your BFD out of the way, just to have BFD completely remove the link:</div>
<div>
- BFD packets are marked with precedence 6 by default</div>
<div>
- Be sure the value isn't reset by your interim routers, and that they prioritize/LLQ the Prec 6 traffic.</div>
</div>
<div>
<br />
This leads us into static route multihop. I've removed the previous BGP config.</div>
<div>
<br /></div>
<div>
Much the same as BGP multihop BFD, static route multihop BFD uses multihop templates and BFD maps.<br />
Let's create a static route multihop session between R1 and R3. I've added a new loopback to R3, Lo1, with IP address 33.33.33.33/32 for validation purposes. It is not in the IGP.<br />
<br />
R1(config)#bfd-template multi-hop MHOP-TEMPLATE<br />
R1(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3<br />
<div>
<div>
R1(config)#bfd map ipv4 192.168.23.0/24 192.168.12.0/24 MHOP-TEMPLATE</div>
</div>
<div>
<div>
R1(config)#ip route static bfd 192.168.23.3 192.168.12.1</div>
</div>
<div>
<div>
R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3</div>
</div>
<div>
<br /></div>
<div>
<div>
R3(config)#bfd-template multi-hop MHOP-TEMPLATE</div>
<div>
R3(config-bfd)#interval min-tx 300 min-rx 300 multiplier 3</div>
<div>
R3(config-bfd)#bfd map ipv4 192.168.12.0/24 192.168.23.0/24 MHOP-TEMPLATE</div>
<div>
R3(config)#ip route static bfd 192.168.12.1 192.168.23.3 unassociate</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Multihop Sessions</div>
<div>
NeighAddr[vrf] LD/RD RH/RS State</div>
<div>
192.168.23.3 4097/4097 Up Up</div>
</div>
<div>
<br /></div>
<div>
Most of this config should be familiar if you read the entire article up until now, but there are some peculiar ways to do this incorrectly that will bust it.</div>
<div>
<br /></div>
<div>
R1(config)#ip route static bfd 192.168.23.3 192.168.12.1</div>
<div>
<br /></div>
<div>
This may seem very similar to single-hop, but here's a sample from single-hop above:</div>
<div>
<br /></div>
<div>
<div>
R2(config)#ip route static bfd GigabitEthernet1 192.168.12.1<br />
<br />
Note the lack of an interface on multi-hop, and the presence of one in single-hop. These are a mutually exclusive setting: You <i>must not</i> specify an interface on multi-hop, and you <i>must</i> specify an interface on single-hop. This is very poorly documented, unfortunately - the samples on the DocCD do show the right thing, but it never calls it out like this.<br />
<br />
R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3<br />
<br />
A normal static route from our multihop config - but let's look at our earlier single-hop sample:<br />
<br /></div>
<div>
R2(config)#ip route 1.1.1.1 255.255.255.255 GigabitEthernet1 192.168.12.1</div>
</div>
<div>
<br /></div>
<div>
Now this genuinely surprised me. If you specify the interface on a static route with multi-hop - even though <i>all the other information needed is present</i> - destination prefix and next hop - it will break multi-hop BFD. On the other hand you <b>must</b> have it for single-hop. Check out a quick before & after on multihop:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#ip route 33.33.33.33 255.255.255.255 192.168.23.3</div>
<div>
R1(config)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Multihop Sessions</div>
<div>
NeighAddr[vrf] LD/RD RH/RS State</div>
<div>
192.168.23.3 4097/4097 Up Up</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config)#no ip route 33.33.33.33 255.255.255.255 192.168.23.3</div>
<div>
R1(config)#ip route 33.33.33.33 255.255.255.255 Gigabit1 192.168.23.3</div>
<div>
R1(config)#do show bfd neigh</div>
<div>
R1(config)#</div>
</div>
<div>
<br /></div>
<div>
Multi-hop static BFD can also use groups like single-hop can, but the config is identical (aside from not specifying the egress interfaces!), so I'm going to skip them here for brevity.</div>
<div>
<br /></div>
<div>
I've referred to echo mode in various places in the article up until now. Echo mode is a very clever way of decreasing BFD's hit on the CPU. It took me a while to figure out how it worked, however, mostly because the RFC wins the "too vague" award of the year: "When the Echo function is active, a stream of BFD Echo packets is transmitted in such a way as to have the other system loop them back through its forwarding path." http://tools.ietf.org/html/rfc5880</div>
<div>
<br /></div>
<div>
I already knew echo mode was a way to save on CPU, so I theorized that the idea was to get the BFD "are you up?" packets to be processed in fast switching instead of the control plane, but that description doesn't exactly explain it programatically. After more googling and some Wireshark, I figured out the implementation.</div>
<div>
<br /></div>
<div>
Echo is single-hop only, so let's use R1 and R2 as my examples.</div>
<div>
<br /></div>
<div>
R1 sends an echo packet (instead of a control packet) to R2, formatted as:</div>
<div>
L3 Source: R1 (192.168.12.1)</div>
<div>
L3 Destination: R1 (192.168.12.1) </div>
<div>
MAC Source: Itself (000c.298f.aca3)</div>
<div>
MAC Destination: (000c.29cf.21ff)</div>
<div>
<br /></div>
<div>
R2's receives this packet, sees this packet, and CEF-switches it straight back to R1! In this fashion, R1 knows that R2 is reachable.</div>
<div>
<br /></div>
<div>
R2 would perform similar behavior towards R1, for it's own echo process.</div>
<div>
<br /></div>
<div>
There's more to know, however:</div>
<div>
- The echo packets are sent at the rate negotiated in the BFD interval (on interface or single-hop template)</div>
<div>
- Echo mode is only supported single-hop, obviously.</div>
<div>
- Control-plane packets are still sent, but they are sent at the "slow timers" speed, specified as: <b>bfd slow-timers</b> <speed>. Since the control packets are no longer vital to knowing that the neighbor is up at high-speed, you can crank down these heavier-CPU-intensive packets to slower rates.</div>
<div>
- The Cisco documentation says you need to disable ICMP redirects first - as technically speaking, the traffic above should generate a redirect - but in modern 15.1x+ IOS I have yet to see this requirement; it appears IOS is smart enough to know not to send redirects to echo packets.</div>
</div>
<div>
- Echo mode is on by default. It needs to be on on both sides of the link in order to work.</div>
<div>
<br /></div>
<div>
On a side note, I've periodically had problems getting echo mode to come up when labbing on the CSR1000v; it usually seems to have to do with other BFD config on the device. I would call it a bug. With some cleanup and tinkering you can usually get it to come up.</div>
<div>
<br /></div>
<div>
<div>
R1(config)#int gig1</div>
<div>
R1(config-if)#bfd echo ! this is on by default, but I'd disabled it earlier in the article.</div>
<div>
R1(config-if)#ip ospf bfd</div>
</div>
<div>
<div>
R1(config-if)#exit</div>
<div>
R1(config)#bfd slow-timers 30000 ! send control packets every 30 seconds</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config)#int gig1</div>
<div>
R2(config-if)#bfd echo</div>
<div>
R2(config-if)#ip ospf bfd</div>
</div>
<div>
<div>
R2(config-if)#exit</div>
<div>
R2(config)#bfd slow-timers 30000</div>
</div>
<div>
<br /></div>
<div>
R1#show bfd neigh det | i echo</div>
<div>
<div>
Session state is UP and using echo function with 400 ms interval.</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#show bfd neigh det | i Min</div>
<div>
MinTxInt: 30000000, MinRxInt: 30000000, Multiplier: 5</div>
<div>
Received MinRxInt: 30000000, Received Multiplier: 3</div>
<div>
Min tx interval: 30000000 - Min rx interval: 30000000</div>
<div>
Min Echo interval: 400000</div>
</div>
<div>
<br /></div>
<div>
We now see in "Min Echo interval" that the echo packets are going at the pace we expected control packets at before (400ms - negotiated by the interface values), and control packets are now sending every 30 seconds.</div>
<div>
<br /></div>
<div>
I mentioned single-hop templates briefly above. They're not of much use outside of authentication and dampening:</div>
<div>
<br /></div>
<div>
R1(config)#bfd-template single-hop TEST</div>
<div>
R1(config-bfd)#?</div>
<div>
BFD template configuration commands:</div>
<div>
<b> authentication Authentication type</b></div>
<div>
<b> dampening Enable session dampening</b></div>
<div>
echo Use echo adjunct as bfd detection
mechanism</div>
<div>
interval Transmit interval between BFD packets</div>
<div>
<div class="MsoListParagraphCxSpLast" style="margin-left: 1in;">
<o:p></o:p><br /></div>
</div>
<div>
<br /></div>
<div>
Dampening works much the same way as any other protocol's dampening works. If the BFD session flaps a bunch, mark it as "down" (pull it out of the routing table) for a certain amount of time to wait on stabilization.</div>
<div>
I did lab this and it does work, but it's too hard to demonstrate it in a blog, so here's the basic usage:</div>
<div>
<br /></div>
<div>
<div>
R1(config)#bfd-template single-hop TEST-SH</div>
<div>
R1(config-bfd)#interval both 300 multiplier 3</div>
<div>
R1(config-bfd)#dampening 5 4000 4000 10</div>
<div>
R1(config-bfd)#int gig1</div>
<div>
R1(config-if)#bfd ?</div>
<div>
R1(config-if)#no bfd interval 200 min_rx 500 multiplier 5 ! mutually exclusive from a single-hop template </div>
<div>
R1(config-if)#bfd template TEST-SH</div>
</div>
<div>
<br /></div>
<div>
BFD Authentication is also reasonably straightforward.</div>
<div>
<br /></div>
<div>
R1(config-if)#key chain BFD</div>
<div>
<div>
R1(config-keychain)#key 1</div>
<div>
R1(config-keychain-key)#key-string cisco</div>
<div>
R1(config-keychain-key)#exit</div>
<div>
R1(config-keychain)#exit</div>
<div>
R1(config)#bfd-template single-hop TEST-SH</div>
<div>
R1(config-bfd)#authentication sha-1 keychain BFD</div>
</div>
<div>
<br /></div>
<div>
Since we configured this on only one side....</div>
<div>
<br /></div>
<div>
<div>
R1(config-bfd)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/0 Down Down Gi1</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#key chain BFD</div>
<div>
<div>
R2(config-keychain)#key 1</div>
<div>
R2(config-keychain-key)#key-string cisco</div>
<div>
R2(config-keychain-key)#exit</div>
<div>
R2(config-keychain)#exit</div>
<div>
R2(config)#bfd-template single-hop TEST-SH</div>
<div>
R2(config-bfd)#interval both 300 multiplier 3</div>
<div>
R2(config-bfd)#authentication sha-1 keychain BFD</div>
</div>
</div>
<div>
<div>
R2(config-bfd)#int gig1</div>
<div>
<div>
R2(config-if)#no bfd interval 250 min_rx 400 multiplier 3</div>
</div>
<div>
R2(config-if)#bfd-template single-hop TEST-SH</div>
</div>
<div>
<br /></div>
<div>
<div>
R1(config-bfd)#do show bfd neigh</div>
<div>
<br /></div>
<div>
IPv4 Sessions</div>
<div>
NeighAddr LD/RD RH/RS State Int</div>
<div>
192.168.12.2 4097/4097 Up Up Gi1</div>
</div>
<div>
<br /></div>
<div>
BFD authentication can use MD5, SHA1, or meticulous MD5 or SHA1. So what's meticulous? Out of scope of this document, but here's the RFC: http://tools.ietf.org/html/draft-ietf-bfd-generic-crypto-auth-06</div>
<div>
<br /></div>
<div>
And last but certainly not least, how do you debug BFD? Honestly, most of the times I break BFD, it's because I missed a requirement - for example, forgetting to put an egress interface on single-hop static routes. In these circumstances, you get nearly zero debug output, because IOS doesn't detect that anything needs to happen. </div>
<div>
<br /></div>
<div>
If you can get BFD to realize you're trying to get it to work, you can see some inner-workings with:</div>
<div>
<b>debug bfd event</b></div>
<div>
<br /></div>
<div>
I hope you enjoyed,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div class="MsoListParagraphCxSpLast" style="margin-left: 0.75in; text-indent: -0.25in;">
<o:p></o:p><br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com35tag:blogger.com,1999:blog-5968686435283454526.post-81547936601626797212014-05-04T15:17:00.001-07:002014-05-04T22:55:07.475-07:00DMVPNIn this post we'll take a look at DMVPN from a perspective of what I suspect will be on the CCIE R&S v5 blueprint. Admittedly I'm taking guesses as all Cisco has released is "single hub DMVPN", but some of the surrounding/related topics I've seen on practice labs as well as just taking some guesses on my part.<br />
<br />
I'm going to briefly show some scenarios which require you to think beyond single-hub design for the command structure to make sense. I can absolutely imagine Cisco would throw requirements for commands that only make sense in a larger network into the lab. My preference for my blog is to understand the practicality, design theory, and use cases behind commands, not just "if you apply this you get action X".<br />
<br />
So, at a high level - What is DMVPN?<br />
<br />
DMVPN stands for Dynamic Multipoint Virtual Private Network. It's a Cisco proprietary tunnel technology with a hub-and-spoke control-plane and spoke to spoke tunnels. Assuming "Phase 2" or newer (more on phases later), a normal use case is to establish a full-mesh VPN over the Internet with minimal configuration. For example, having 10 routers that all needed VPNs to one another would have the "full mesh formula" apply of N(N-1)/2, or 10(10-1)/2 = 45 tunnels. That's a lot of config. On the other hand, with DMVPN, you create the config for just 10 tunnels. The 45 might still happen if every router in fact needed to contact every other router at the same time, but we let the routers handle that part dynamically.<br />
<br />
Here's our diagram:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhURp67zZNdQOnE6owJ_YuYeIJzigo5nIL7pmguve_cVmuNcX8jLQ2kLsdp4xEv6PFuyt-MBBzB_TMVZifKvpkVNdYMOgoqtZf6GPpCBbH9Ztx63xKETc5X7POw2OGcKrhmFvKhBazyZfM/s1600/diagram1.png" />
<br />
<br />
R1 will be our single hub, R2, R3, and R4 are all spokes. "INTERNET" represents the Internet. In theory, these routers could alternatively be dozens of hops from each other, but the concept doesn't change.<br />
<br />
As I explained above, DMVPN's control plane is hub-and-spoke, and is R1 is our hub, whatever routing protocol we're using will be pinned up via R1 to each individual spoke.<br />
<br />
So our control plane will look like this:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkbODg4krTg3a-E0fQsI0wU1kEKZ0cAq4OPdsvi5QvoLb1S3TB_lkOOS-P9lO87l1hc_oEHGeoZoj98iOnMglLRjUXXTgTpEOeUs48rJN3IJ-Qan3BMdI-HJoTW1EdfMurd5mVlYj2dUI/s1600/diagram2.png" />
<br />
<br />
However, our traffic flows can be full-mesh, so the data plane will (theoretically) look like this:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn34XulsgZ4J6APoJ60iutnA0-4kJ0A49XRDejKe29nN68HTmuoIPRkvjICbSM05jmEFVwrHr8ibeECFDNfelQF1I2d55oHzTahoEUf1Plu0RmLu8vKkT6jejIbK5Nj7PPR2s-A5bqi4s/s1600/diagram3.png" />
<br />
<br />
This is largely dependent on which routers needed to talk to which other routers. While hub-to-spoke tunnels are always up, spoke-to-spoke are "on demand" and are established dynamically.<br />
<br />
The nuts and bolts of how this work depends largely on what development "phase" of DMVPN you're using. We'll talk more about that shortly. First, let's take a high-level look at the three technologies that make up DMVPN - MGRE, IPSEC, and NHRP.<br />
<br />
GRE - Generic Routing Encapsulation - creates unencrypted tunnels between two endpoints. MGRE creates Multipoint GRE tunnels. These tunnels can be established to endpoints based on information discovered via NHRP, discussed below.<br />
<br />
I'm going to assume the audience has had a general exposure to IPSEC in the past. In our case, we're just using it to optionally encrypt the MGRE tunnel we're performing our routing on. I am not going to deep-dive IOS-based IPSEC with this post, one assumption I am making is that the IPSEC/VPN requirement for v5 is going to be "DocCD level", or something you can pull out of the documentation "stock" or "near stock" on short notice.<br />
<br />
NHRP - this is really what makes the magic happen on DMVPN. NHRP, at its core, resolves private addresses (those behind MGRE and optionally IPSEC) to a public address. In our example, that public network will be assumed to be the Internet. NHRP treats this public network like a big NBMA area. In fact, several comparisons can be drawn between NBMA frame relay and NHRP/DMVPN, to the point where I'm betting some of the old frame-relay tricks from the R&S lab will be repeated in DMVPN. NHRP facilitates registration between the spokes and the hubs, and helps the spokes resolve the public address of another spoke based on the tunneled IPs behind it.<br />
<br />
Next, let's look at the three phases of DMVPN and some sample config for all of them.<br />
<br />
DMVPN "Phase 1". This phase is largely unused, and, as I understand it, was an early deployment model. When most people refer to "DMVPN" these days, they're talking about the behavior expected from Phase 2 or Phase 3, not Phase 1.<br />
<br />
Phase 1 pins not only the control plane through the hub, but also the data plane, so <b>all</b> your traffic goes through the hub.<br />
<br />
The differentiating components of Phase 1 are:<br />
- An MGRE tunnel on the hub<br />
- A standard GRE tunnel on the spokes<br />
- A routing protocol on the hub that sets next-hop-self<br />
<br />
A sample config based on our diagram from above:<br />
<br />
<i>For brevity, this config is applied on all four routers identically, but I will only show it here once:</i><br />
crypto isakmp policy 1<br />
encr aes 256<br />
authentication pre-share<br />
group 5<br />
<br />
crypto isakmp key ABCcisco123 address 0.0.0.0<br />
<br />
crypto ipsec transform-set TRANSFORM-SET esp-aes esp-sha-hmac<br />
mode transport<br />
<br />
crypto ipsec profile IPSEC_PROFILE<br />
set transform-set TRANSFORM-SET<br />
<div>
<br /></div>
!R1 - The hub<br />
interface Tunnel1<br />
ip address 10.0.0.1 255.255.255.0<br />
no ip redirects<br />
no ip split-horizon eigrp 100<br />
ip nhrp authentication CISCO<br />
ip nhrp map multicast dynamic<br />
ip nhrp network-id 1<br />
tunnel source FastEthernet0/0<br />
tunnel mode gre multipoint<br />
<div>
<div>
tunnel protection ipsec profile IPSEC_PROFILE</div>
</div>
<div>
<br /></div>
<div>
<div>
router eigrp 100</div>
<div>
network 1.1.1.1 0.0.0.0</div>
<div>
network 10.0.0.0 0.0.0.255</div>
</div>
<div>
<br /></div>
<div>
!R2 - A Spoke</div>
<div>
<div>
interface Tunnel1</div>
<div>
ip address 10.0.0.2 255.255.255.0</div>
<div>
ip nhrp authentication CISCO</div>
<div>
ip nhrp map 10.0.0.1 87.14.10.1</div>
<div>
ip nhrp network-id 1</div>
<div>
ip nhrp nhs 10.0.0.1</div>
<div>
tunnel source FastEthernet0/0</div>
<div>
tunnel destination 87.14.10.1</div>
</div>
<div>
<div>
<div>
<div>
tunnel protection ipsec profile IPSEC_PROFILE</div>
</div>
<div>
<br /></div>
<div>
router eigrp 100</div>
<div>
network 1.1.1.1 0.0.0.0</div>
<div>
network 10.0.0.0 0.0.0.255</div>
</div>
</div>
<div>
<br /></div>
<div>
!R3 - Another Spoke</div>
<div>
<div>
interface Tunnel1</div>
<div>
ip address 10.0.0.3 255.255.255.0</div>
<div>
ip nhrp authentication CISCO</div>
<div>
ip nhrp map 10.0.0.1 87.14.10.1</div>
<div>
ip nhrp network-id 1</div>
<div>
ip nhrp nhs 10.0.0.1</div>
<div>
tunnel source FastEthernet0/0</div>
<div>
tunnel destination 87.14.10.1</div>
</div>
<div>
<div>
tunnel protection ipsec profile IPSEC_PROFILE</div>
</div>
<div>
<br /></div>
<div>
<div>
router eigrp 100</div>
<div>
network 3.3.3.3 0.0.0.0</div>
<div>
network 10.0.0.0 0.0.0.255</div>
</div>
<div>
<br /></div>
<div>
!R4 - Another Spoke</div>
<div>
<div>
interface Tunnel1</div>
<div>
ip address 10.0.0.4 255.255.255.0</div>
<div>
ip nhrp authentication CISCO</div>
<div>
ip nhrp map 10.0.0.1 87.14.10.1</div>
<div>
ip nhrp network-id 1</div>
<div>
ip nhrp nhs 10.0.0.1</div>
<div>
tunnel source FastEthernet0/0</div>
<div>
tunnel destination 87.14.10.1</div>
<div>
<div>
tunnel protection ipsec profile IPSEC_PROFILE</div>
</div>
<div>
<br /></div>
<div>
router eigrp 100</div>
<div>
network 4.4.4.4 0.0.0.0</div>
<div>
network 10.0.0.0 0.0.0.255</div>
</div>
<div>
<br /></div>
<div>
Assume the "Internet" router knows how to reach all public IP space on the 87.14.0.0/16 network, and that each router participating in DMVPN has a <b>private</b> loopback of X.X.X.X, where X is it's router number.</div>
<div>
<br /></div>
<div>
Before we look at the output from this config, let's break it apart a bit.</div>
<div>
<br /></div>
<div>
<b>Note: </b>I'm going to deliberately ignore most of the crypto config, this can be pulled out of the DocCD very easily from "Security" and then "Secure Connectivity Configuration Guide Library, Cisco IOS Release 15M&T", and then "Dynamic Multipoint VPN Configuration Guide, Cisco IOS Release 15M&T". </div>
<div>
<br /></div>
<div>
On R1, the hub -</div>
<div>
<br />
<b>crypto ipsec transform-set TRANSFORM-SET esp-aes esp-sha-hmac</b><br />
<b> mode transport</b><br />
<div>
<br /></div>
<div>
This is the only part of the crypto config I'm going to drill into. I was initially confused as to when to use "mode tunnel" and when to use "mode transport". I've seen examples with both. Unless you are doing a multi-tier DMVPN hub (one set of routers doing crypto-only, another set doing NHRP and the routing), which is clearly out of scope of the R&S lab, you want to use <b>transport.</b> Tunnel adds 20 bytes of overhead and comes out with the same exact results as transport. I suppose if you got a lab question that said "use the IPSEC method that requires more overhead", this might be important? The rest of this document will assume we are using <b>transport</b> only.</div>
<br />
<b>no ip split-horizon eigrp 100 </b>- Clearly, we're going to be taking EIGRP routes in from one spoke and passing them to another. If we don't disable split horizon that process will not happen.<br />
<br />
<b>ip nhrp authentication CISCO </b>- This is a plain-text key, more of an "identifier" than a password, keeping in mind that this traffic will be inside IPSEC, it doesn't need it's own encryption method per se.<br />
<br />
<b>ip nhrp map multicast dynamic </b>- This tells the hub to pseudo-multicast to any spoke that registers to it. This is (usually) necessary for the routing protocol to communicate.<br />
<br />
<b>ip nhrp network-id 1 </b>- This is a <i>local identifier only</i>. It is not communicated across the network. Think of it similarly to the OSPF process number. You must have it enabled, and it must be unique to your tunnel, or NHRP will not work.<br />
<br />
<b>tunnel source FastEthernet0/0 </b>- where to source tunnel packets from<br />
<br />
<b>tunnel mode gre multipoint </b>- This may as well read "tunnel mode gre dmvpn". A GRE multipoint tunnel, by its nature, must use NHRP for resolution.<br />
<br />
<b>tunnel protection ipsec profile IPSEC_PROFILE </b>- Encrypt this tunnel with our IPSEC config from above. Note, the IPSEC config above used a pre-shared key (PSK), but it's worth pointing out that a public key infrastructure (PKI) can be used as well, although that is beyond the scope of this document.</div>
<div>
<br /></div>
<div>
On R2, a spoke (excluding any repetition of commands I explained on the hub) -</div>
<div>
<div>
<div>
<br /></div>
<div>
<b>ip nhrp map 10.0.0.1 87.14.10.1 </b>- This is a <b>lot</b> like the <b>frame-relay map </b>command, that, if you were a student of CCIE v4, you are well familiar with. In this case, we're mapping private IP 10.0.0.1 to NBMA IP 87.14.10.1. This is to facilitate registration to the hub.</div>
<div>
<br /></div>
<div>
<b>ip nhrp nhs 10.0.0.1</b> - nhs stands for "next hop server", which is the hub. This basically says "register my private IP address to my NBMA address (87.14.20.1 on this case) with the hub, so it knows how to reach me. </div>
<div>
<br /></div>
<div>
<b>tunnel destination 87.14.10.1 </b>- You'll note a lack of the <b>tunnel mode gre multipoint</b> command on this tunnel. That's because in phase 1, the spokes only get regular GRE tunnels. So in this case, we have to set the destination statically to the hub.</div>
</div>
<div>
<br /></div>
<div>
Let's now look at the outcome of all this on R1, the hub.</div>
</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip nhrp</div>
<div>
10.0.0.2/32 via 10.0.0.2</div>
<div>
Tunnel1 created 00:52:07, expire 01:35:14</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.20.1</div>
<div>
10.0.0.3/32 via 10.0.0.3</div>
<div>
Tunnel1 created 00:51:25, expire 01:35:13</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.30.1</div>
<div>
10.0.0.4/32 via 10.0.0.4</div>
<div>
Tunnel1 created 00:50:55, expire 01:35:13</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.40.1</div>
</div>
<div>
<br /></div>
<div>
We see the three mappings for the three spokes that registered to the hub. We see type "dynamic" - meaning it was learned through registration, "unique registered" - meaning the spoke has instructed the hub not to take a registration from another NBMA address but with the same private address, and of course we see the NBMA address for each IP listed.</div>
<div>
<br /></div>
<div>
On the topic of NHRP mappings, optionally, we could also add static entries on the hub:</div>
<div>
<br /></div>
<div>
<div>
R1(config-if)#ip nhrp map 10.0.0.10 4.2.2.2</div>
<div>
<div>
R1(config-if)#ip nhrp map multicast 4.2.2.2 ! optional</div>
</div>
<div>
<br /></div>
<div>
R1#sh ip nhrp | s 10.0.0.10</div>
<div>
10.0.0.10/32 via 10.0.0.10</div>
<div>
Tunnel1 created 00:00:09, never expire</div>
<div>
Type: static, Flags:</div>
<div>
NBMA address: 4.2.2.2</div>
</div>
<div>
<br /></div>
<div>
This entry will, of course, do nothing, but I wanted to demonstrate the idea.</div>
<div>
We can also see who we're pseudo-multicasting towards:</div>
<div>
<div>
R1#sh ip nhrp multicast</div>
<div>
I/F NBMA address</div>
<div>
Tunnel1 4.2.2.2 Flags: static</div>
<div>
Tunnel1 87.14.20.1 Flags: dynamic</div>
<div>
Tunnel1 87.14.30.1 Flags: dynamic</div>
<div>
Tunnel1 87.14.40.1 Flags: dynamic</div>
</div>
<div>
<br /></div>
<div>
What about the routing protocol?</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip eigrp neigh</div>
<div>
EIGRP-IPv4 Neighbors for AS(100)</div>
<div>
H Address Interface Hold Uptime SRTT RTO Q Seq</div>
<div>
(sec) (ms) Cnt Num</div>
<div>
2 10.0.0.2 Tu1 14 00:33:26 299 1794 0 13</div>
<div>
1 10.0.0.4 Tu1 11 00:33:32 818 4908 0 13</div>
<div>
0 10.0.0.3 Tu1 11 00:33:33 409 2454 0 16</div>
</div>
<div>
<br /></div>
<div>
We have EIGRP peerings with all the neighbors.</div>
<div>
<br /></div>
<div>
At this point, we should have any-to-any connectivity, via the hub. Let's test it out:</div>
<div>
<br /></div>
<div>
<div>
R4#ping 3.3.3.3 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:</div>
<div>
Packet sent with a source address of 4.4.4.4</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 168/205/240 ms</div>
<div>
<br /></div>
</div>
<div>
<div>
R4#trace 3.3.3.3 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 3.3.3.3</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.1 156 msec 152 msec 128 msec</div>
<div>
2 10.0.0.3 224 msec 236 msec 252 msec</div>
</div>
<div>
<br /></div>
<div>
We see we have connectivity from 4.4.4.4 (loopback of R4) to 3.3.3.3 (loopback of R3), however, it goes through the hub - not very efficient, since the hub doesn't need to be in the forwarding path. That is, however, the drawback of phase 1.</div>
<div>
<br /></div>
<div>
Phase 2 is where DMVPN really starts to shine, because it gets the hub (more or less) out of the data plane forwarding path for spoke-to-spoke communication.</div>
<div>
<br /></div>
<div>
Building off our existing config, let's implement a phase 2 configuration.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
interface Tunnel1</div>
<div>
no ip next-hop-self eigrp</div>
</div>
<div>
<div>
no ip nhrp map 10.0.0.10 4.2.2.2 ! just for cleanup</div>
<div>
no ip nhrp map multicast 4.2.2.2 ! just for cleanup</div>
</div>
<div>
<br /></div>
<div>
R2-R4:</div>
<div>
<div>
interface Tunnel1</div>
</div>
<div>
no tunnel destination 87.14.10.1</div>
<div>
<div>
tunnel mode gre multipoint</div>
</div>
<div>
<div>
ip nhrp map multicast 87.14.10.1</div>
</div>
<div>
<br /></div>
<div>
We'll do a high-level breakdown of this config, then spend a good bit of time on the theory behind Phase 2. While the config isn't a complex change, there's a lot more going on behind the scenes.</div>
<div>
<br /></div>
<div>
On the hub:<br />
<div>
<b>no ip next-hop-self eigrp </b>- This is absolutely vital. You can setup the rest of NHRP to happily work spoke-to-spoke, but if you don't modify the control plane to not modify the next hops, you're going to get behavior very akin to phase 1.</div>
<div>
<br /></div>
<div>
On any spoke:</div>
<div>
<div>
<b>no tunnel destination 87.14.10.1 </b>- this is only used with a regular GRE tunnel and isn't required any longer.</div>
<div>
<br /></div>
<div>
<b>tunnel mode gre multipoint </b>- Swap from a point-to-point to multipoint tunnel on the spokes. Now, the spokes will be using NHRP for resolution as well as the hub.</div>
<div>
<br /></div>
<div>
<b>ip nhrp map multicast 87.14.10.1 </b>- When we were using a standard GRE tunnel, it was inherently point-to-point, and natively supported multicast without any extra instructions. Now, we have to tell it to pseudo-multicast to the hub.</div>
</div>
<div>
<br /></div>
<div>
Before we deep dive into what's going on behind the scenes, let's look at what's changed.</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip route 2.2.2.2</div>
<div>
Routing entry for 2.2.2.2/32</div>
<div>
Known via "eigrp 100", distance 90, metric 28288000, type internal</div>
<div>
Redistributing via eigrp 100</div>
<div>
Last update from <b>10.0.0.2 on Tunnel1</b>, 00:04:52 ago</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* 10.0.0.2, from <b>10.0.0.1</b>, 00:04:52 ago, via Tunnel1</div>
<div>
Route metric is 28288000, traffic share count is 1</div>
<div>
Total delay is 105000 microseconds, minimum bandwidth is 100 Kbit</div>
<div>
Reliability 255/255, minimum MTU 1434 bytes</div>
<div>
Loading 1/255, Hops 2</div>
</div>
<div>
<br /></div>
<div>
We learned 2.2.2.2 (R2's loopback) on R3 via 10.0.0.1 (R1), but the forwarding path is via 10.0.0.2.</div>
<div>
That's great, how do we reach 10.0.0.2?</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip route 10.0.0.2</div>
<div>
Routing entry for 10.0.0.0/24</div>
<div>
Known via "connected", distance 0, metric 0 (connected, via interface)</div>
<div>
Redistributing via eigrp 100</div>
<div>
Routing Descriptor Blocks:</div>
<div>
* directly connected, via Tunnel1</div>
<div>
Route metric is 0, traffic share count is 1</div>
</div>
<div>
<br /></div>
<div>
OK, not so fast... while it is "on subnet" on Tunnel1, Tunnel1 is NBMA, so we can't just forward there without some type of resolution.</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip cef 10.0.0.2 internal</div>
<div>
10.0.0.0/24, epoch 0, flags attached, connected, cover dependents, need deagg, RIB[C], refcount 5, per-destination sharing</div>
<div>
sources: RIB</div>
<div>
feature space:</div>
<div>
IPRM: 0x0003800C</div>
<div>
subblocks:</div>
<div>
gsb Connected chain head(1): 0x6AAF5DF4</div>
<div>
Covered dependent prefixes: 3</div>
<div>
need deagg: 2</div>
<div>
notify cover updated: 1</div>
<div>
ifnums:</div>
<div>
Tunnel1(10)</div>
<div>
path 6B108C6C, path list 6B101100, share 1/1, type connected prefix, for IPv4</div>
<div>
<b> connected to Tunnel1, adjacency punt</b></div>
<div>
<b> output chain: punt</b></div>
</div>
<div>
<br /></div>
<div>
The important lines are the bottom two. I've read in other blogs that we should see a "glean adjacency" for unresolved NHRP next hops, but I haven't been able to reproduce that on 15.2; I suspect Cisco changed the output. But there's our answer plain as day: <b>punt</b>. This traffic cannot be CEF forwarded because we have an unresolved dependency; we don't know how to get to 10.0.0.2.</div>
<div>
<br /></div>
<div>
The CPU knows we need NHRP for this to work, and it doesn't have a resolution in its NHRP cache yet:</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip nhrp</div>
<div>
10.0.0.1/32 via 10.0.0.1</div>
<div>
Tunnel1 created 00:20:23, never expire</div>
<div>
Type: static, Flags: used</div>
<div>
NBMA address: 87.14.10.1</div>
<div>
<br /></div>
</div>
<div>
So how do we get it?</div>
<div>
<br /></div>
<div>
This is a reasonably clever process, and it only gets more clever once we get into Phase 3. The CPU, after the punt, will forward the traffic to the hub by default. This ensures while we're waiting on NHRP to do its thing and the spoke-to-spoke tunnel to build, we're not dropping packets. Generally speaking you can expect the first 2-3 packets to get punted.</div>
<div>
<br /></div>
<div>
On the hub, <b>debug nhrp</b></div>
<div>
<br /></div>
<div>
On the spoke:</div>
<div>
<div>
R3#ping 2.2.2.2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
Packet sent with a source address of 3.3.3.3</div>
<div>
!</div>
</div>
<div>
<br /></div>
<div>
<cutting to the debugs on R1></div>
<div>
<note this has been cut to the key elements for brevity, <b>debug nhrp</b> is wordy></div>
<div>
<br /></div>
<div>
<div>
NHRP: Tunnels gave us remote_nbma: 87.14.30.1 for Redirect</div>
<div>
NHRP: Receive Resolution Request via Tunnel1 vrf 0, packet size: 85</div>
<div>
<br /></div>
<div>
The hub knows R3 needs a resolution for R2.</div>
<div>
<br /></div>
<div>
NHRP: nhrp_rtlookup for destination on 10.0.0.2 yielded interface Tunnel1, prefixlen 24</div>
<div>
NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.0.2</div>
<div>
NHRP: Attempting to forward to destination: 10.0.0.2</div>
<div>
NHRP: Attempting to send packet via DEST 10.0.0.2</div>
<div>
NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 87.14.20.1</div>
<div>
NHRP: Forwarding Resolution Request via Tunnel1 vrf 0, packet size: 105</div>
<div>
src: 10.0.0.1, dst: 10.0.0.2</div>
<div>
NHRP: 129 bytes out Tunnel1</div>
<div>
<br /></div>
<div>
<div>
But it doesn't answer R3. Instead, it forwards the NHRP request to R2, which included R3's NBMA address. Not pictured here, it also forwards the ping packet from R3 to R2 at the same time, so that no traffic is lost.</div>
</div>
<div>
<br /></div>
<div>
Meanwhile, on R2... R2 has received the initial ping echo request, along with the NHRP control packet. R2 will now set up an encrypted (IPSEC) MGRE tunnel back to R3! However, in the meantime, it still needs to forward it's echo reply, and we can't just stall that until the tunnel comes up. So we see the reverse traffic from R2, trying to resolve for R3.</div>
<div>
<br /></div>
<div>
NHRP: Receive Resolution Request via Tunnel1 vrf 0, packet size 85</div>
<div>
NHRP: nhrp_rtlookup for destination on 10.0.0.3 yielded interface Tunnel1, prefixlen 24</div>
<div>
NHRP-ATTR: In nhrp_recv_resolution_request NHRP Resolution Request packet is forwarded to 10.0.0.3</div>
<div>
NHRP: Attempting to forward to destination: 10.0.0.3</div>
<div>
NHRP: Attempting to send packet via DEST 10.0.0.3</div>
<div>
NHRP: Encapsulation succeeded. Sending NHRP Control Packet NBMA Address: 87.14.30.1</div>
<div>
<br /></div>
</div>
<div>
And the traffic is delivered to R3, via R1.</div>
<div>
<br /></div>
<div>
During this process I've also enabled <b>debug dmvpn all tunnel </b>on R2, so we can see the crypto process fire off (note, this was also edited for brevity):</div>
<div>
<br /></div>
<div>
<div>
IPSEC-IFC MGRE/Tu1: Checking to see if we need to delay for src 87.14.20.1 dst 87.14.30.1</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Opening a socket with profile IPSEC_PROFILE</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): connection lookup returned 0</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Triggering tunnel immediately.</div>
<div>
IPSEC-IFC MGRE/Tu1: Adding Tunnel1 tunnel interface to shared list</div>
<div>
IPSEC-IFC MGRE/Tu1: Need to delay.</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): good socket ready message</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Got MTU message mtu 1458</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): tunnel_protection_socket_up</div>
<div>
IPSEC-IFC MGRE/Tu1(87.14.20.1/87.14.30.1): Signalling NHRP</div>
</div>
<div>
<br /></div>
<div>
And back on R3, we can see that the tunnel is up:</div>
</div>
<div>
<br /></div>
<div>
<div>
R3#show dmvpn | b Tunnel1</div>
<div>
Interface: Tunnel1, IPv4 NHRP Details</div>
<div>
Type:Spoke, NHRP Peers:2,</div>
<div>
<br /></div>
<div>
# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb</div>
<div>
----- --------------- --------------- ----- -------- -----</div>
<div>
1 87.14.10.1 10.0.0.1 UP 00:04:38 S</div>
<div>
<b> 1 87.14.20.1 10.0.0.2 UP 00:04:03 D</b></div>
</div>
<div>
<br /></div>
<div>
We can also see that now that we have resolution for R2 (and a dynamic tunnel), we can now directly CEF switch to it:</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip cef 2.2.2.2 internal</div>
<div>
2.2.2.2/32, epoch 0, RIB[I], refcount 5, per-destination sharing</div>
<div>
sources: RIB</div>
<div>
feature space:</div>
<div>
IPRM: 0x00028000</div>
<div>
ifnums:</div>
<div>
Tunnel1(12): 10.0.0.2</div>
<div>
path 6B1081EC, path list 6B100D40, share 1/1, type attached nexthop, for IPv4</div>
<div>
<b> nexthop 10.0.0.2 Tunnel1, adjacency IP midchain out of Tunnel1, addr 10.0.0.2 6AE17200</b></div>
<div>
output chain: IP midchain out of Tunnel1, addr 10.0.0.2 6AE17200 IP adj out of FastEthernet0/0, addr 87.14.30.100 6925D980</div>
</div>
<div>
<br /></div>
<div>
We see the appropriate next-hop on Tunnel1, and no longer a mention of "punt".</div>
<div>
<br /></div>
<div>
Just to further prove the point:</div>
<div>
<div>
R3#trace 2.2.2.2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 172 msec 192 msec 184 msec</div>
</div>
<div>
<br /></div>
<div>
We also see we have an NHRP resolution now:</div>
<div>
<br /></div>
<div>
<div>
R3#sh ip nhrp</div>
<div>
10.0.0.1/32 via 10.0.0.1</div>
<div>
Tunnel1 created 00:08:01, never expire</div>
<div>
Type: static, Flags: used</div>
<div>
NBMA address: 87.14.10.1</div>
<div>
<b>10.0.0.2/32 via 10.0.0.2</b></div>
<div>
<b> Tunnel1 created 00:07:26, expire 01:52:36</b></div>
<div>
<b> Type: dynamic, Flags: router used</b></div>
<div>
<b> NBMA address: 87.14.20.1</b></div>
<div>
10.0.0.3/32 via 10.0.0.3</div>
<div>
Tunnel1 created 00:07:24, expire 01:52:35</div>
<div>
Type: dynamic, Flags: router unique local</div>
<div>
NBMA address: 87.14.30.1</div>
<div>
(no-socket)</div>
</div>
<div>
<br /></div>
<div>
You'd see the flip-side of that same output on R2.</div>
<div>
<br /></div>
<div>
Before we push on to Phase 3, we need to spend some time looking at the various possible routing protocols for the NHRP/DMVPN control plane, and some of the oddities.</div>
<div>
<br /></div>
<div>
We've covered EIGRP fairly well thus far. The only thing I need to add, is that in a production environment, you need to set the bandwidth manually on the interface, regardless of whether or nor you're using it as a QoS value. You may remember back from your CCNA/CCNP days that EIGRP will only use half the available bandwidth of a link: </div>
<div>
<br /></div>
<div>
<a href="http://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/13672-12.html#band">http://www.cisco.com/c/en/us/support/docs/ip/enhanced-interior-gateway-routing-protocol-eigrp/13672-12.html#band</a></div>
<div>
<br /></div>
<div>
<div>
R3#show int tun1 | i bandwidth</div>
<div>
Tunnel transmit bandwidth 8000 (kbps)</div>
<div>
Tunnel receive bandwidth 8000 (kbps)</div>
</div>
<div>
<br /></div>
<div>
Unfortunately, 4K won't get you too many EIGRP updates.</div>
<div>
<br /></div>
<div>
R1-R4:</div>
<div>
interface Tunnel1</div>
<div>
bandwidth 1000 ! or any reasonable number for your environment</div>
<div>
<br /></div>
<div>
We'll look at RIP next - it's super-easy.</div>
<div>
<br /></div>
<div>
R1-R4:</div>
<div>
no router eigrp 100</div>
<div>
router rip</div>
<div>
version 2</div>
<div>
network 10.0.0.0</div>
<div>
network <their specific loopback prefix></div>
<div>
<br /></div>
<div>
R1:</div>
<div>
interface Tunnel1</div>
<div>
no ip split-horizon</div>
<div>
<br /></div>
<div>
That's it ... </div>
<div>
<br /></div>
<div>
<div>
R3#sh ip route rip | b Gateway</div>
<div>
Gateway of last resort is 87.14.30.100 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
R 1.0.0.0/8 [120/1] via 10.0.0.1, 00:00:06, Tunnel1</div>
<div>
R 2.0.0.0/8 [120/2] via 10.0.0.1, 00:00:06, Tunnel1</div>
<div>
3.0.0.0/8 is variably subnetted, 2 subnets, 2 masks</div>
<div>
R 3.0.0.0/8 [120/2] via 10.0.0.1, 00:01:31, Tunnel1</div>
<div>
R 4.0.0.0/8 [120/2] via 10.0.0.1, 00:00:06, Tunnel1</div>
</div>
<div>
<br /></div>
<div>
<div>
R3#ping 2.2.2.2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
Packet sent with a source address of 3.3.3.3</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 304/326/336 ms</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
R3#trace 2.2.2.2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 160 msec 192 msec 196 msec</div>
</div>
</div>
<div>
<br /></div>
<div>
In order to get Spoke->Spoke, and not Spoke->Hub->Spoke, you need to make sure you're using RIPv2.</div>
<div>
<br /></div>
<div>
On to BGP.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
no router rip</div>
<div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
network 1.1.1.1 mask 255.255.255.255</div>
<div>
network 10.0.0.0 mask 255.255.255.0</div>
<div>
neighbor 10.0.0.2 remote-as 100</div>
<div>
neighbor 10.0.0.2 route-reflector-client</div>
<div>
neighbor 10.0.0.3 remote-as 100</div>
<div>
neighbor 10.0.0.3 route-reflector-client</div>
<div>
neighbor 10.0.0.4 remote-as 100</div>
<div>
neighbor 10.0.0.4 route-reflector-client</div>
</div>
<div>
<br /></div>
<div>
R2-R4:</div>
<div>
no router rip</div>
<div>
<div>
router bgp 100</div>
<div>
bgp log-neighbor-changes</div>
<div>
network <local Loopback Prefix> mask 255.255.255.255</div>
<div>
network 10.0.0.0 mask 255.255.255.0</div>
<div>
neighbor 10.0.0.1 remote-as 100</div>
</div>
<div>
<br /></div>
<div>
R1 of course needs to be a route reflector, or the other iBGP spokes wouldn't receive the other spoke routes.</div>
<div>
<br /></div>
<div>
<div>
R4#ping 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 240/301/376 ms</div>
<div>
<br /></div>
<div>
R4#trace 2.2.2.2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 156 msec 192 msec 184 msec</div>
</div>
<div>
<br /></div>
<div>
For OSPF, you'll need to use network type broadcast or non-broadcast. There's no point-to-multipoint support until Phase 3, which we'll go over in detail later.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
no router bgp 100</div>
<div>
<div>
router ospf 1</div>
<div>
network 1.1.1.1 0.0.0.0 area 0</div>
<div>
network 10.0.0.0 0.0.0.255 area 0</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
<div>
ip ospf network broadcast</div>
<div>
ip ospf priority 255</div>
<div>
<br /></div>
</div>
<div>
R2-R4:</div>
<div>
no router bgp 100</div>
<div>
<div>
router ospf 1</div>
<div>
network <Local Loopback Address> 0.0.0.0 area 0</div>
<div>
network 10.0.0.0 0.0.0.255 area 0</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
<div>
ip ospf network broadcast</div>
<div>
ip ospf priority 0</div>
</div>
<div>
<br /></div>
<div>
Broadcast is used here to avoid changing the next-hop. If it were changed, we'd end up with Spoke->Hub->Spoke flows. We want to be careful that the hub becomes the DR, hence changing the ospf priorities.</div>
<div>
<br /></div>
<div>
<div>
R3#ping 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 164/183/204 ms</div>
<div>
<br /></div>
<div>
R3#trace 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 172 msec 184 msec 196 msec</div>
</div>
<div>
<br /></div>
<div>
We can also use non-broadcast. Imagine a task that didn't allow multicast mappings to be used, but required an IGP to be run.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
<div>
interface Tunnel1</div>
<div>
ip ospf network non-broadcast</div>
<div>
no ip nhrp map multicast dynamic</div>
</div>
</div>
<div>
<br /></div>
<div>
router ospf 1</div>
<div>
neighbor 10.0.0.2</div>
<div>
neighbor 10.0.0.3</div>
<div>
neighbor 10.0.0.4</div>
<div>
<br /></div>
<div>
R2-R4:</div>
<div>
<div>
<div>
interface Tunnel1</div>
<div>
ip ospf network non-broadcast</div>
<div>
<div>
no ip nhrp map multicast 87.14.10.1</div>
</div>
</div>
</div>
<div>
<br /></div>
<div>
** I actually rebuilt all the tunnels here to clear the NHRP cache thoroughly - I've found "clear ip nhrp" doesn't always produce the results I'd expect **</div>
<div>
<br /></div>
<div>
<div>
R1#sh ip ospf neigh</div>
<div>
<br /></div>
<div>
Neighbor ID Pri State Dead Time Address Interface</div>
<div>
2.2.2.2 0 FULL/DROTHER 00:01:51 10.0.0.2 Tunnel1</div>
<div>
3.3.3.3 0 FULL/DROTHER 00:01:57 10.0.0.3 Tunnel1</div>
<div>
4.4.4.4 0 FULL/DROTHER 00:01:43 10.0.0.4 Tunnel1</div>
</div>
<div>
<br /></div>
<div>
<div>
R3#ping 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 208/288/360 ms</div>
<div>
<br /></div>
<div>
R3#trace 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 168 msec 188 msec 168 msec</div>
</div>
<div>
<br /></div>
<div>
I mentioned at the beginning of the article that I was going to go outside the scope of R&S v5 in order to explain the "why would I use this?" behind some topics. Phase 3 DMVPN, if you're only looking at it from a handful of routers, doesn't make near as much sense. You need to take a step back and realize the challenges Phase 2 would bring if you had, say, 1,500 DMVPN spokes.</div>
<div>
<br /></div>
<div>
In a scenario like that, you're clearly not going to have just one hub. In fact, not even having primary/backup would be sufficient, because one router simply cannot terminate 1,500 IPSEC tunnels from a CPU perspective. In order to scale Phase 2 in volume of spokes, you had to build a topology that looked something like this:</div>
<div>
<br /></div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgg95ZXosutpuTyQoI2mh7VkPqupuGG3sKI-57yDSbZmqZegHR8T6gz8q6uT8EZh69a35lg6ub8UCUJ2B-9ts1SsjIJszgRTo0us7y_tbqPDa0C9SmIm8DkocWwIw1-n1f5bJetrl5uPSA/s1600/diagram4.png" /><br />
<div>
Let's pretend SPOKES1, 2 and 3 each represented 500 spokes. They'd register to HUB1, 2, and 3, respectively. I'm not going to get into DR/failover scenarios here, but you can start seeing the problem - each hub has it's own NHRP database, which isn't shared with it's neighbors. What happens when a spoke in SPOKES1 wants to reach a spoke in SPOKES3?<br />
<br />
Phase 2 solved this by using daisy-chained NHRP. In short, HUB1 became a NHRP client of HUB2, which became a NHRP client of HUB3, which became an NHRP client of HUB1. It would look something like this:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiskcqTprnZg_P_NuYnFL3BJ9BUeAUnPySI-2Co9rFd9avN6NFl0WwsH9-MG1mpUoiAHv9CPCv4Wwb46pVlh8p9RY4cq7fTNZ1jdAAZ9X6gVXJOOvdFwX1KDmeAI6XAXI3lIGVqbNJ-4MI/s1600/diagram5.png" />
<br />
You can reference Cisco's drawing of the same solution here; reference figure 3-4:<br />
<a href="http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/WAN_and_MAN/DMVPDG/DMVPN_3.html">http://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/WAN_and_MAN/DMVPDG/DMVPN_3.html</a><br />
<br />
The config is reasonably simple. This isn't something I have mocked up right now, but let's pretend they're each using tunnel 1 and have a tunnel IP address of 10.0.0.X, where X is the hub number.<br />
<br />
HUB1:<br />
interface Tunnel1<br />
ip nhrp map N.B.M.A2 10.0.0.2<br />
ip nhrp nhs 10.0.0.2<br />
<br />
HUB2:<br />
interface Tunnel1<br />
ip nhrp map N.B.M.A3 10.0.0.3<br />
ip nhrp nhs 10.0.0.3<br />
<br />
HUB3:<br />
interface Tunnel1<br />
ip nhrp map N.B.M.A1 10.0.0.1<br />
ip nhrp nhs 10.0.0.1<br />
<div>
<br /></div>
<div>
So given my earlier predicament, "What happens when a spoke in SPOKES1 wants to reach a spoke in SPOKES3?", in this case, the requesting spoke in SPOKES1 sends the initial packet to HUB1, which, not having a resolution for the spoke that's registered to SPOKES3, passes both the original packet and the NHRP request to SPOKES2, which in turn passes it to SPOKES3. In theory, SPOKES3 has the resolution for the router we're trying to reach, and tells that router via NHRP to establish a tunnel (via NBMA) back to the original requester in SPOKES1. </div>
<div>
<br /></div>
<div>
You can see how inefficient this is. It's not hierarchical; it scales sideways.</div>
<div>
<br /></div>
<div>
Let's take a worse scenario - what if the spoke in SPOKES3 is offline and not registered to HUB3? Well, HUB3 doesn't have a resolution so it passes it to HUB1, which in turn passes it to HUB2 ... yes, it loops. It eventually TTLs away and dies, but it's messy.</div>
<div>
<br /></div>
<div>
Another scalability issue in Phase 2 is that there's absolutely no way to summarize routes. If you summarize, you get the spoke->hub->spoke syndrome, because the next hop is always the summarizing router.</div>
<div>
<br /></div>
<div>
Also, I mentioned above the problem with the first few packets being punted to the CPU, and not being CEF switched.</div>
<div>
<br /></div>
<div>
Phase 3 fixes all these issues, and is basically better in every way.</div>
<div>
<br /></div>
<div>
In Phase 3, completely contrary to the way Phase 2 worked, all the routing needs to point towards the hub (initially). So the routing protocol does need some sort of "next hop self" feature enabled.<br />
<br />
After the hub receives the first packet, instead of generating NHRP resolution packets, the hub sends an NHRP redirect any time it receives a packet in one tunnel and sends it back out the same tunnel. This redirect goes back to the originating router (the one that sent the first packet to the hub - the packet that got sent in & out the same tunnel), informing the originating router that a better path is available over DMVPN. NHRP redirect is very similar to an ICMP redirect. </div>
<br />
Let's demonstrate with a sample of 4.4.4.4 trying to reach 2.2.2.2. It's important to note here that the route to 2.2.2.2 has a next-hop of R1's private address, which was resolved by the static entry on R1, so there's no CEF failure!<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoLtvetTQNLx_s5sO_cB7V8siNbOWJHF27no71QUIy_k4hImC5gri8rBGsgnlOcNgMvWKHFTl9fDU-esEDJ3n6JuJvQc6uyl3RjOe5Zg12ACFLzcF5ZITbey2JRsO9rlWj_nxPVBwEkDI/s1600/diagram6.png" />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnVGKHgZcQ8Bc4KM6frKgDmvhhPe0rrU5atLtLZeR6pnwnIHfq3TNzVXXigOoKN8PduzUF5oYjqlkPqRl-4iuKhE8j6WadByHtBQdzCHprrSK6QsSRT89C2UR1pDdbopRpwXnIkfg0qT0/s1600/diagram8.png" />
<br />
<br />
So now R4 knows R1 isn't the best path to R2. At this point, R4 needs to send an NHRP resolution request <i>to R2 (not the hub!)</i>, to find out how to reach it directly. In the meantime, it knows R1 will continue to forward packets for it.<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3frf5yLHHIDzkyRM8Li15jcwUL3g4qwSPhDo5edQKJeW7G3EXiNN381v7IBCu1z0xVitbIcHloOIlepnhfQVSiWLt9c03eH34hlvapjwEHeBlBQmA_eqIECQd6lpQmWg-IKFarEWBhYw/s1600/diagram9.png" />
<br />
Since R4 still can't speak directly to R2, the NHRP resolution gets forwarded via R1, but not processed via R1. In Phase 3, it's no longer R1's job to answer NHRP resolutions.<br />
<br />
R2 receives the resolution, and responds directly to R4 (similar to the way Phase 2 worked at this point), also initiating the tunnel to R4.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjub8hyphenhypheng_gJbXWjGs7fOF4COtRAsnqpxHIy3Ve0k71GaRp49dCV3xAgLWXfA0Wa6zCjNdgNR3eIJHObJaIMyihpCYLcXLJ5DgcXdeB9xW4Vg9pAGqjeCchz5KBIWadqfLqZNjXpf3DrSNA/s1600/diagram11.png" />
<br />
At this point, R2 and R4 would still have R1 as the next hop for one another, but Phase 3 fixes that as well, <u>rewriting CEF</u> in order to fix this issue.<br />
<br />
In modern versions of IOS, you can actually see the rewrite (more or less) via the routing table:<br />
<br />
R4#sh ip route ospf | b Gateway<br />
Gateway of last resort is 87.14.40.100 to network 0.0.0.0<br />
<br />
1.0.0.0/32 is subnetted, 1 subnets<br />
O 1.1.1.1 [110/1001] via 10.0.0.1, 00:03:12, Tunnel1<br />
<b> 2.0.0.0/32 is subnetted, 1 subnets</b><br />
<b>O % 2.2.2.2 [110/2001] via 10.0.0.1, 00:02:43, Tunnel1</b><br />
3.0.0.0/32 is subnetted, 1 subnets<br />
O 3.3.3.3 [110/2001] via 10.0.0.1, 00:03:12, Tunnel1<br />
10.0.0.0/8 is variably subnetted, 5 subnets, 2 masks<br />
O 10.0.0.1/32 [110/1000] via 10.0.0.1, 00:03:12, Tunnel1<br />
O 10.0.0.2/32 [110/2000] via 10.0.0.1, 00:02:43, Tunnel1<br />
O 10.0.0.3/32 [110/2000] via 10.0.0.1, 00:03:12, Tunnel1<br />
<div>
<br /></div>
<div>
Note the % sign next to 2.2.2.2:</div>
<div>
<br /></div>
<div>
<div>
R4#sh ip route ospf | i %</div>
<div>
+ - replicated route, % - next hop override</div>
<div>
O % 2.2.2.2 [110/2001] via 10.0.0.1, 00:03:33, Tunnel1</div>
</div>
<div>
<br /></div>
<div>
"next hop overide". Pretty cool.</div>
<div>
<br /></div>
<div>
<div>
R4#sh ip nhrp <b>shortcut</b></div>
<div>
2.2.2.2/32 via 10.0.0.2</div>
<div>
Tunnel1 created 00:04:46, expire 01:55:15</div>
<div>
Type: dynamic, Flags: router rib nho</div>
<div>
NBMA address: 87.14.20.1</div>
</div>
<div>
<br /></div>
There's our shortcut table. What's a shortcut, you might ask? Let's look at the handful of simple commands necessary to make Phase 3 work.<br />
<br />
First, the routing protocol must point back towards the hub, instead of towards the spoke, like we were setup for in Phase 2.<br />
<br />
R1-R4:<br />
interface Tunnel1<br />
ip ospf network point-to-multipoint<br />
<br />
Point-to-Multipoint will rewrite the nexthop as the hub instead of Broadcast or Non-Broadcast, which did not. Also, not pictured here, I re-established the hub<->spoke multicast prior to this. Another important footnote is that with Point-to-Multipoint, we no longer need the DR/BDR we were stuck with in Phase 2 (effectively limiting us to two hubs). This design also permits for summarization (or even just a default route), which Phase 2 certainly did not allow for. More on this in a bit.<br />
<br />
R1:<br />
interface Tunnel1<br />
ip nhrp redirect<br />
<br />
R2-R4:<br />
interface Tunnel1<br />
ip nhrp shortcut<br />
<br />
<b>ip nhrp redirect</b> goes on the hub only (note many installations will just put <b>ip nhrp redirect</b> and <b>ip nhrp shortcut</b> on every device; this is <u>not</u> necessary). <b>ip nhrp redirect</b> enables the behavior described earlier: if a packet is received and transmitted on the same MGRE tunnel, send the redirect. You can actually see who we've sent redirects for recently:<br />
<br />
R1#sh ip nhrp redirects<br />
I/F NBMA address Destination Drop Count Expiry<br />
<br />
Tunnel1 87.14.40.1 3.3.3.3 4 00:00:06<br />
Tunnel1 87.14.30.1 10.0.0.4 4 00:00:06<br />
<div>
<br /></div>
<div>
<b>ip nhrp shortcut</b> goes only on the spokes; it is used to enable processing redirect packets.</div>
<div>
<br /></div>
<div>
That's all there is to enabling Phase 3; but I still haven't answered the scenario I proposed being the problem with Phase 2 ("sideways" scaling for hubs). With Phase 3, you can build hierarchical hubs because of the NHRP daisy chain doesn't need to exist any longer. Imagine our 1,500 spoke router scenario from earlier, but now with Phase 3:</div>
<div>
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibKvf7_VhzghY3y7sSxvMBhyspiV4zw3UYzQp5MyeZslR6RDY9l3pN4FQK43gXS0G7SkfSgxPkJmV8b6HN3vDyQvsJCMrxNU-uqK5Q9C4blRokh-Tzoz5nS-FXGA5_9e9IFb7JSeQn1U4/s1600/diagram12.png" /><br />
<br /></div>
We still have 500 spokes registering to HUB1, HUB2, and HUB3, from SPOKES1, SPOKES2, and SPOKES3, respectively.<br />
<br />
What if a router in SPOKES1 wants to build a spoke-to-spoke tunnel to a router in SPOKES3?<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_cYfy91bjYQCzXm55RT8qQ8pRZH6BWzBFtMuAz98vMk72MP4v4wxps3jcMoLynGhtHBtzSpn1ZnrKVERRyWNRFK0wnRGDS2U1vScr80b45wqqUTJ7jQQJJGAYDANZlopy9os5Xa9soaY/s1600/diagram13.png" />
<br />
Here we see the first packet leave the spoke in SPOKES1. HUB1 will forward this packet (according to the routing table via CORE1, not pictured here). HUB1 then sends a redirect back towards the spoke in SPOKES1.<br />
<br />
Because the hubs no longer answer NHRP requests, there is <u>no need to NHRP daisy chain the hubs!</u> So in our next diagram, we're going to watch the NHRP resolution request be routed to its destination.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNB2qKC4h2bRWAQ9q9mHifXywNc1i139Kd8XDO6g0QPdSoiw3PagHj1U-0_TsxmbblCI3essP7xA9uC8OBwYxEIKEgG-EVvbGviPNbbrM2ydc_9NdeKcaS3Ko-TKN6mhM-u7m-9YsK9TM/s1600/diagram14.png" />
<br />
Again, this is a <u>routed IP packet</u>, HUB1, CORE1, and HUB3 are not NHRP-processing this packet, they're just CEF-switching it.<br />
<br />
When the target spoke in SPOKES3 receives the NHRP resolution request, it replies directly to the originating spoke in SPOKES1:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUi2ScqUpE5lV837tgpGBMumXv6niFLFHhcgeofv-mFBjUWzM_g_sD9mmGKZjgMKEnRShfDcnInfSB5dJjSo5whuDOhyhsn_UstgSstKIUJ6ngUYaZbAGBO4ErPjidJSFdsymDchhfxro/s1600/diagram15.png" />
<br />
Much more scalable than Phase 2.<br />
<br />
This leads back into summarization. In Phase 3 there is no need to have the full routing table. You can send out a summary for your network, or even a default.<br />
<br />
I can't summarize intra-area in OSPF, so I'm switching back to EIGRP (not pictured here).<br />
<br />
R1:<br />
router eigrp 100<br />
ip summary-address eigrp 100 0.0.0.0 248.0.0.0<br />
<div>
<br /></div>
Sorry for the weird summary - I didn't do myself any favors by using 1.1.1.1 - 4.4.4.4 for the loopbacks. You try summarizing those :)<br />
<br />
R3#sh ip route eigrp | b Gateway<br />
Gateway of last resort is 87.14.30.100 to network 0.0.0.0<br />
<br />
D 0.0.0.0/5 [90/3968000] via 10.0.0.1, 00:02:26, Tunnel1<br />
<div>
<br /></div>
<div>
One route - that'd sure be easier on my spokes if I had 1,500 spokes to consider.</div>
<div>
<br /></div>
<div>
<div>
R3#ping 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 172/218/324 ms</div>
<div>
<br /></div>
<div>
We have reachability.</div>
<div>
<br /></div>
<div>
R3#trace 2.2.2.2</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2.2.2.2</div>
<div>
VRF info: (vrf in name/id, vrf out name/id)</div>
<div>
1 10.0.0.2 156 msec 180 msec 172 msec</div>
<div>
<br /></div>
<div>
We're reaching it via one hop (our spoke-to-spoke tunnel)</div>
<div>
<br /></div>
<div>
R3#sh ip route | b Gateway</div>
<div>
Gateway of last resort is 87.14.30.100 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
S* 0.0.0.0/0 [1/0] via 87.14.30.100</div>
<div>
D 0.0.0.0/5 [90/3968000] via 10.0.0.1, 00:04:07, Tunnel1</div>
<div>
2.0.0.0/32 is subnetted, 1 subnets</div>
<div>
<b>H 2.2.2.2 [250/1] via 10.0.0.2, 00:00:29, Tunnel1</b></div>
<div>
<b> 3.0.0.0/32 is subnetted, 1 subnets</b></div>
<div>
C 3.3.3.3 is directly connected, Loopback0</div>
<div>
10.0.0.0/8 is variably subnetted, 2 subnets, 2 masks</div>
<div>
C 10.0.0.0/24 is directly connected, Tunnel1</div>
<div>
L 10.0.0.3/32 is directly connected, Tunnel1</div>
<div>
87.0.0.0/8 is variably subnetted, 2 subnets, 2 masks</div>
<div>
C 87.14.30.0/24 is directly connected, FastEthernet0/0</div>
<div>
L 87.14.30.1/32 is directly connected, FastEthernet0/0</div>
<div>
<br /></div>
<div>
<b>H = NHRP</b></div>
<div>
<br /></div>
<div>
R3#sh ip route nhrp | b Gateway</div>
<div>
Gateway of last resort is 87.14.30.100 to network 0.0.0.0</div>
<div>
<br /></div>
<div>
2.0.0.0/32 is subnetted, 1 subnets</div>
<div>
H 2.2.2.2 [250/1] via 10.0.0.2, 00:00:41, Tunnel1</div>
</div>
<div>
<br /></div>
<div>
<div>
R3#show dmvpn | b Tunnel1</div>
<div>
Interface: Tunnel1, IPv4 NHRP Details</div>
<div>
Type:Spoke, NHRP Peers:2,</div>
<div>
<br /></div>
<div>
# Ent Peer NBMA Addr Peer Tunnel Add State UpDn Tm Attrb</div>
<div>
----- --------------- --------------- ----- -------- -----</div>
<div>
2 87.14.20.1 10.0.0.2 UP 00:04:09 DT1</div>
<div>
10.0.0.2 UP 00:04:09 D</div>
<div>
1 87.14.10.1 10.0.0.1 UP 01:25:21 S</div>
</div>
<div>
<br /></div>
Pretty darn slick.<br />
<br />
What about IPv6 DMVPN?<br />
<br />
Note, there is no IPv6 over IPv6 DMVPN yet - at least not on my IOS. So we'll be tunneling v6 over v4.<br />
<br />
No changes to the existing tunnels are required, we just add v6 to our existing infrastructure.<br />
I've added X::X/64 to every Loopback0, and 10::X/64 to every Tunnel1, where X is the router number.<br />
<br />
R1:<br />
ipv6 unicast-routing<br />
ipv6 router eigrp 100<br />
no shut<br />
<div>
interface Tunnel1</div>
<div>
<div>
no ipv6 split-horizon eigrp 100</div>
</div>
<div>
ipv6 address 10::1/64</div>
ipv6 eigrp 100<br />
ipv6 nhrp map multicast dynamic<br />
ipv6 nhrp network-id 1<br />
ipv6 nhrp redirect<br />
<br />
R2-R4:<br />
ipv6 unicast-routing<br />
ipv6 router eigrp 100<br />
no shut<br />
interface Tunnel1<br />
ipv6 address 10::X/64 ! Where X is the router number<br />
ipv6 eigrp 100<br />
ipv6 nhrp map multicast 87.14.10.1<br />
ipv6 nhrp map 10::1/128 87.14.10.1<br />
ipv6 nhrp network-id 1<br />
ipv6 nhrp nhs 10::1<br />
ipv6 nhrp shortcut</div>
<div>
<br /></div>
<div>
The one thing that did throw me off here is that you don't need to map the link-local address of the hub on the spokes, or vice-versa. As I'd mentioned earlier, the <b>ipv6 nhrp map </b>commands remind me a lot of frame-relay, so I immediately started putting in manual mappings. No need. NHRP takes care of all of that:</div>
<div>
<br /></div>
<div>
<div>
R1#sh ipv6 nhrp | s FE80</div>
<div>
FE80::C800:37FF:FEDC:8/128 via 10::3</div>
<div>
Tunnel1 created 00:11:04, expire 01:48:56</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.30.1</div>
<div>
FE80::C801:FF:FEF8:8/128 via 10::4</div>
<div>
Tunnel1 created 00:10:54, expire 01:49:06</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.40.1</div>
<div>
FE80::C803:13FF:FE90:8/128 via 10::2</div>
<div>
Tunnel1 created 00:14:51, expire 01:45:08</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.20.1</div>
</div>
<div>
<br /></div>
<div>
The link locals are auto-registered along with the unicast IPv6 addresses.</div>
<div>
<br /></div>
<div>
There's not much more to say - it works -</div>
<div>
<br /></div>
<div>
<div>
R4#ping 2::2 source lo0</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 2::2, timeout is 2 seconds:</div>
<div>
Packet sent with a source address of 4::4</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 156/198/260 ms</div>
<div>
<br /></div>
<div>
R4#trace 2::2</div>
<div>
Type escape sequence to abort.</div>
<div>
Tracing the route to 2::2</div>
<div>
<br /></div>
<div>
1 10::2 176 msec 172 msec 168 msec</div>
</div>
<div>
<br /></div>
<div>
Now let's look at what QoS options we have.</div>
<div>
<br /></div>
<div>
The QoS is largely Hub -> Spoke. You can get some Spoke -> Spoke but it's generally a hackjob, because your neighbors are dynamic, it's difficult to fine tune a policy.</div>
<div>
<br /></div>
<div>
The basic idea is that the spoke registers a value (called an NHRP "group") back to the hub, which the hub can then match and apply a policy-map to.</div>
<div>
<br /></div>
<div>
R2:</div>
<div>
<div>
interface Tunnel1</div>
<div>
ip nhrp group GROUP1</div>
</div>
<div>
<br /></div>
<div>
R3:</div>
<div>
<div>
<div>
interface Tunnel1</div>
<div>
ip nhrp group GROUP1</div>
</div>
</div>
<div>
<br /></div>
<div>
R4:</div>
<div>
<div>
<div>
interface Tunnel1</div>
<div>
ip nhrp group GROUP2</div>
</div>
</div>
<div>
<br /></div>
<div>
On all three spoke routers I did the following procedure:</div>
<div>
<br /></div>
<div>
interface tunnel1</div>
<div>
<div>
no ip nhrp nhs 10.0.0.1</div>
<div>
ip nhrp nhs 10.0.0.1</div>
</div>
<div>
<br /></div>
<div>
The reason is that the spoke doesn't dynamically re-register to the hub, so we're forcing it.</div>
<div>
We can now see the hub is aware of the groups:</div>
<div>
<br /></div>
<div>
<div>
R1(config-if)#do sh ip nhrp</div>
<div>
10.0.0.2/32 via 10.0.0.2</div>
<div>
Tunnel1 created 00:34:47, expire 01:57:03</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.20.1</div>
<div>
<b> Group: GROUP1</b></div>
<div>
10.0.0.3/32 via 10.0.0.3</div>
<div>
Tunnel1 created 02:24:59, expire 01:59:36</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.30.1</div>
<div>
<b> Group: GROUP1</b></div>
<div>
10.0.0.4/32 via 10.0.0.4</div>
<div>
Tunnel1 created 02:24:41, expire 01:59:49</div>
<div>
Type: dynamic, Flags: unique registered used</div>
<div>
NBMA address: 87.14.40.1</div>
<div>
<b> Group: GROUP2</b></div>
</div>
<div>
<br /></div>
<div>
Let's build some policies on the hub. I have this all mocked up in GNS3, so we have to keep the performance expectations low.</div>
<div>
<br /></div>
<div>
Most of the DMVPNs I've designed used the DMVPN for bulk traffic, and used a side-by-side MPLS for the traffic that needed priority/QoS. So I have honestly never used this in production, but I suspect the design case is going to be mostly for shaping by group. You don't want a spoke with a device in the hub sending towards a spoke at 100Mbit if the spoke has a pair of bonded T1s for Internet access. If we can shape perhaps "low speed" clients to one group, and "high speed" to another group, we can stop the slow spokes from getting overwhelmed while allowing the faster spokes to get traffic at, or near, the line rate of the hub Internet connection. This would also be very easy to config, theoretically just 8 lines would take care of all spokes.</div>
<div>
<br /></div>
<div>
That all said, I've written slightly more complex configs for this implementation, because the CCIE lab's questions are about as far from reality as you can get.</div>
<div>
<br /></div>
<div>
R1:</div>
<div>
<div>
<div>
<div>
<div>
ip access-list extended TOWARDS-R2</div>
<div>
permit ip any host 2.2.2.2</div>
<div>
<br /></div>
<div>
ip access-list extended TOWARDS-R3</div>
<div>
permit ip any host 3.3.3.3</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
class-map match-all TOWARDS-R3</div>
<div>
match access-group name TOWARDS-R3</div>
<div>
class-map match-all TOWARDS-R2</div>
<div>
match access-group name TOWARDS-R2</div>
</div>
<div>
<br /></div>
<div>
policy-map GROUP1-PM</div>
<div>
class TOWARDS-R2</div>
<div>
shape average 4000</div>
<div>
class TOWARDS-R3</div>
<div>
shape average 4000</div>
<div>
<br /></div>
<div>
policy-map GR1-POLICY-PARENT</div>
<div>
class class-default</div>
<div>
shape average 6000</div>
<div>
service-policy GROUP1-PM</div>
<div>
<br /></div>
<div>
policy-map GR2-POLICY-PARENT</div>
<div>
class class-default</div>
<div>
shape average 8000</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
</div>
</div>
<div>
<div>
<b>ip nhrp map group GROUP1 service-policy output GR1-POLICY-PARENT</b></div>
<div>
<b>ip nhrp map group GROUP2 service-policy output GR2-POLICY-PARENT</b></div>
</div>
<div>
<br /></div>
<div>
The idea here is that the cumulative bandwidth of GROUP1 should not exceed 6K, and each spoke should only get 4K maximum. Cumulative GROUP2 should not exceed 8K.</div>
<div>
<br /></div>
<div>
I worked up the "proof" from this, but it doesn't work into a blog well. Suffice to say it works.</div>
<div>
You can see the policy-map hits with <b>show policy-map multipoint</b>, and can also get information from <b>show dmvpn detail</b>.</div>
<div>
<br /></div>
<div>
Ingress Per-Tunnel QoS (policing and remarking, basically) is not supported on DMVPN.</div>
<div>
<br /></div>
<div>
I know the first time I mocked this up, the first question I had was: that's great for Hub -> Spoke, but what about Spoke -> Hub, or Spoke->Spoke?</div>
<div>
<br /></div>
<div>
Turns out they're both kind of a pain (not to mention unsupported). As of 15.x, you can no longer apply a service policy directly to an MGRE tunnel. You can, of course, still shape, police, and queue on the physical interface that your tunnel is connected to. This more or less implies you need <b>qos pre-classify</b>, but interestingly, on 15.2(4)M6, I got the same results with or without it with the traffic generated on the router - if I pinged from the router, I got the inside (pre-tunnel) QoS values on the outer DSCP value. I suspect that may have differed if I was testing from behind the device, but I didn't test it.</div>
<div>
<br /></div>
<div>
The big nail in the coffin for Spoke->Spoke QoS is that the neighbors are dynamic. Without some way of applying a grouping to the neighbor which implies how much bandwidth they have, or what traffic is priority for them, you have to either individually manually match destinations, which defeats the dynamic nature of DMVPN, or have one generic policy that matches both the hub and every spoke.</div>
<div>
<br /></div>
<div>
A sample config might look something like this:</div>
<div>
<br /></div>
<div>
<div>
class-map match-all ef</div>
<div>
match dscp ef</div>
</div>
<div>
<br /></div>
<div>
<div>
policy-map out</div>
<div>
class ef</div>
<div>
priority percent 50</div>
<div>
class class-default</div>
<div>
random-detect</div>
</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
<div>
qos pre-classify</div>
<div>
<br /></div>
<div>
interface FastEthernet0/0</div>
<div>
service-policy output out</div>
<div>
<br /></div>
<div>
And finally, some miscellaneous topics that I thought were interesting.</div>
<div>
<br /></div>
<div>
<b><span style="font-family: inherit;">UNIQUE NHRP</span></b></div>
<div>
<br /></div>
<div>
By default, the spoke instructs the hub that it's registration is unique, and not to accept a registration for the same DMVPN (private) IP from a different NBMA (public) IP.</div>
<div>
<br /></div>
<div>
If you're using DHCP on a spoke, and your IP might change, you'd want to disable this.</div>
<div>
<br /></div>
<div>
Use <span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><b>ip
nhrp registration no-unique</b> on the <b>spoke</b>.</span></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><br /></span></div>
<div>
<span style="line-height: 16.866666793823242px;"><b><span style="font-family: inherit;">TUNNEL KEYS</span></b></span></div>
<div>
<span style="font-family: Calibri, sans-serif;"><span style="font-size: 15px; line-height: 16.866666793823242px;"><b><br /></b></span></span></div>
<div>
If you have multiple MGRE tunnels attached to the same physical interface, you need to put tunnel keys on them to keep them separate. Older IOSes (12.3 and older) required them on every MGRE tunnel.</div>
<div>
<br /></div>
<div>
Use <b>tunnel key 123</b></div>
<div>
<b><br /></b></div>
<div>
<b>SPOKE TO SPOKE MULTICAST</b></div>
<div>
<b><br /></b></div>
<div>
This is a very similar question to spoke-to-spoke QoS, but I can see this one getting used on the CCIE lab. It's impractical for large production networks, but in our topology:</div>
<div>
<br /></div>
<div>
R2:</div>
<div>
<div>
ip pim rp-address 3.3.3.3</div>
</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
<div>
ip nhrp map 10.0.0.3 87.14.30.1</div>
<div>
ip nhrp map multicast 87.14.30.1</div>
<div>
<div>
ip pim nbma-mode</div>
<div>
ip pim sparse-mode</div>
</div>
<div>
<br /></div>
<div>
interface Loopback0</div>
<div>
<div>
ip pim sparse-mode</div>
</div>
<div>
ip igmp join-group 239.0.0.1</div>
<div>
<br /></div>
<div>
R3:</div>
<div>
<div>
<div>
ip pim rp-address 3.3.3.3</div>
</div>
<div>
<br /></div>
<div>
interface Tunnel1</div>
<div>
ip nhrp map 10.0.0.2 87.14.20.1</div>
<div>
ip nhrp map multicast 87.14.20.1</div>
</div>
<div>
ip pim nbma-mode</div>
<div>
<div>
ip pim sparse-mode</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
interface Loopback0</div>
</div>
</div>
<div>
<div>
ip pim sparse-mode</div>
</div>
<div>
<br /></div>
<div>
<div>
R3(config-if)#do sh ip pim neigh</div>
<div>
PIM Neighbor Table</div>
<div>
Mode: B - Bidir Capable, DR - Designated Router, N - Default DR Priority,</div>
<div>
P - Proxy Capable, S - State Refresh Capable, G - GenID Capable</div>
<div>
Neighbor Interface Uptime/Expires Ver DR</div>
<div>
Address Prio/Mode</div>
<div>
10.0.0.2 Tunnel1 00:06:00/00:01:38 v2 1 / S P G</div>
</div>
<div>
<br /></div>
<div>
<div>
R3(config-if)#do ping 239.0.0.1</div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 239.0.0.1, timeout is 2 seconds:</div>
<div>
<br /></div>
<div>
Reply to request 0 from 2.2.2.2, 344 ms</div>
<div>
Reply to request 0 from 2.2.2.2, 344 ms</div>
</div>
<div>
<br /></div>
<div>
Very ungainly and manual, but it does work. It's also of note that EIGRP peered between R2 and R3 as well:</div>
<div>
<div>
R3(config-if)#do sh ip eigrp neigh</div>
<div>
EIGRP-IPv4 Neighbors for AS(100)</div>
<div>
H Address Interface Hold Uptime SRTT RTO Q Seq</div>
<div>
(sec) (ms) Cnt Num</div>
<div>
1 10.0.0.2 Tu1 14 00:09:50 266 1596 0 20</div>
<div>
0 10.0.0.1 Tu1 11 02:58:23 328 1968 0 32</div>
</div>
<div>
<br /></div>
<div>
<b>CRYPTO CALL ADMISSION</b></div>
<div>
<b><br /></b></div>
<div>
So one of these theoretical hubs with 500 spokes - let's assume it's not a big burly router, but it's getting along just fine in steady-state. Uh-oh, it lost power and had to reboot! Does it have the horsepower to establish 500 encrypted tunnels all trying to reconnect at the same time?</div>
<div>
<br /></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><b>crypto call admission limit IKE in-negotiation </b>can control how many simultaneous tunnels it will try to process (any new incoming tunnels get dropped temporarily until the first group is up).</span></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><b><br /></b></span></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><b> </b>Hope you enjoyed,</span></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;"><br /></span></div>
<div>
<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;">Jeff</span></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com10tag:blogger.com,1999:blog-5968686435283454526.post-70109821465347858632014-04-15T19:39:00.001-07:002014-11-27T16:27:31.426-08:00A Thorough Approach for Debugging MPLS L3 VPNsI recently realized I needed a more organized approach to debugging MPLS L3 VPNs for the troubleshooting section. Referencing a lot of the practice labs I've taken, I'm going to give a run-down of what I think are the fastest way to track down any problem.<br />
<div>
<br /></div>
<div>
First let's run down my list, then we'll pick it apart with an example below.</div>
<div>
<br /></div>
<div>
I'm going to assume the run-of-the-mill validation of "Host A needs to be able to ping host B".</div>
<div>
Since we're talking high-level in the first segment, and with MPLS VPNs we're always talking about a sender and a receiver, I am going to refer to the sender that's unable to reach the receiver as <b>the originating</b><b> </b>router,<b> </b>and the side that cannot be reached the <b>terminating router </b>for referencing direction. </div>
<div>
<br /></div>
<div>
Before you start debugging...</div>
<div>
<br /></div>
<div>
1) Validate the problem: <b>ping </b><problem IP></div>
<div>
2) Find out if the problem is unidirectional. Run "<b>debug ip icmp</b>" on both the source and the destination. Ping both ways. If you're taking an INE lab, be sure logging is on too: <b>logging con 7 </b>and <b>logging on</b>. </div>
<div>
3) From the originating router, run "<b>sh ip route</b> <problem IP>" and "<b>sh ip cef</b> <problem IP>". Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range. That makes it not an MPLS problem, and is out of scope for this post.</div>
<div>
<br /></div>
<div>
Once you clear the starting checks, you want to validate whether or not you have the route in your routing table.</div>
<div>
<br /></div>
<div>
1) Are you importing the route in to your VRF? Make sure the other side's exported route-target is being imported on the originating router's VRF.</div>
<div>
2) Is the terminating router or terminating router's PE advertising the route?</div>
<div>
3) Are route reflectors involved? If you're relying on one route reflector to relay a route through another route reflector, you need to ensure the cluster-IDs are different.</div>
<div>
<br /></div>
<div>
These following items are dependent on using OSPF as your PE->CE routing protocol:</div>
<div>
4) If you're using OSPF as PE->CE, check for sham links. It's easy to break these and hard to look for them. Do a "sh run | s sham" on the PEs and see if any exist. If they do, run "show ip ospf sham-links"</div>
<div>
5) If you're using OSPF as PE->CE, <i>and</i> the CE is also part of the VRF (the VRF itself exists on the CE), enable <b>capability vrf-lite</b> on the OSPF process on the CE.</div>
<div>
<br /></div>
<div>
If you don't have an internal route and you need one to beat another AD, then additionally check out:</div>
<div>
6a) If OSPF is PE->CE, make sure <b>domain-id </b>is set the same on all of them, or you'll end up with external routes across the MPLS cloud.</div>
<div>
6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.</div>
<div>
<br /></div>
<div>
If you checked into all of that, you should have an appropriate route by now. What happens when you've got the route in your routing table, pointing the right direction, but the traffic just doesn't arrive on the far side? Now we start debugging MPLS itself.</div>
<div>
<br /></div>
<div>
1) sh run | s mpls on every PE and P device. Look for LDP filtering. There are more elegant ways to find this, but this is the fastest.</div>
<div>
2) From the PE on the originating side, run a "sh ip cef <VRF NAME> <problem IP>". Is the correct PE listed as next-hop in the "via" field? If it's not, go investigate the PE that is originating the route, there may be more than one path (and one may not lead anywhere!)</div>
<div>
3) If it's the correct PE from step 2, do <b>show mpls forwarding-table <PE LDP ID></b>. <i>Unless your PEs are L2 adjacent</i>, you must have tag listed for the PE, <i>or </i>"Pop Tag". If you don't, walk your adjacent routers to be sure <b>mpls ip</b> is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.</div>
<div>
4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.<br />
5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32. If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match. Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.<br />
6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking? On the PEs, <b>sh run | i cost-community</b>. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find <b>bgp bestpath cost-community ignore</b> in the config.<br />
<br />
Now let's walk through the scenarios that these verifications above can save you from.<br />
<br />
<div>
1) Validate the problem: <b>ping </b><problem IP><br />
<br />
This should be obvious, but I actually proctor a private TS test, and I'm amazed the number of people that don't check what I put in front of them. In rare circumstances, sometimes the solution can be derived just from verifying the issue. And in a TS lab, you need to be sure you didn't somehow fix the problem at some other point.<br />
<br /></div>
<div>
2) Find out if the problem is unidirectional. Run "<b>debug ip icmp</b>" on both the source and the destination. Ping both ways. If you're taking an INE lab, be sure logging is on too: <b>logging con 7 </b>and <b>logging on</b>.<br />
<br />
This is very important - so you "can't ping" the destination. Do you know if your echo request isn't making it from origination to destination, or that the echo reply isn't making it from destination to origination? Don't waste time debugging the wrong flow. Quite regularly only one direction is failing.<br />
<br />
3) From the originating router, run "<b>sh ip route</b> <problem IP>" and "<b>sh ip cef</b> <problem IP>". Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range. That makes it not an MPLS problem, and is out of scope for this post.</div>
<div>
<br />
This is easy to overlook. You may have the route in both BGP, and the MPLS labels can be in good shape, but you're only getting a /24 across the MPLS VPN, and you're getting a bogus /32 route for the destination that leads nowhere, injected by your IGP from a router behind you. Your packet is going the wrong direction.<br />
<br />
<div style="margin-bottom: .0001pt; margin: 0in;">
1) Are you importing the route in
to your VRF? Make sure the other side's exported route-target is being
imported on the originating router's VRF.<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Originating router:</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
ip vrf VPN</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
rd 1:1</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
route-target export 1:1</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
route-target import 3:3</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
route-target import 7:7</div>
<div>
<br /></div>
<div>
Terminating router:</div>
<div>
<div>
ip vrf VPN</div>
<div>
rd 3:3</div>
<div>
route-target export 2:2</div>
<div>
route-target import 1:1</div>
<div>
route-target import 7:7</div>
</div>
<div>
<br /></div>
<div>
This config above for "Originating router" is missing <b>route-target import 2:2.</b> The route target is a community carried with MP-BGP, if you don't import it into your VRF, you won't see the route. The RD is basically irrelevant - as long as they're unique on each PE, they don't matter for the import process. </div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
2) Is the terminating router or
terminating router's PE advertising the route?<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
This one sure got me once. I'm looking and looking for an MP-BGP problem, and it turns out that the CE just didn't advertise the route to the PE. Simple BGP error.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br />
3) Are route reflectors involved?
If you're relying on one route reflector to relay a route through another
route reflector, you need to ensure the cluster-IDs are different.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
If you have two route reflectors in your MP-BGP topology, unless the PEs in question both peer to the same route reflector, you need to ensure that the route reflectors have different cluster IDs. In other words, if your MP-BGP topology looks like this:</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
RR1 <-- PE1 --> RR2 <-- PE2</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
This will work fine, even if the cluster IDs are the same, because RR2 will reflect the routes from PE1 to PE2 and vice-versa. However, if you have:</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
PE1 --> RR1 <--> RR2 <-- PE2</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Then you'll need separate cluster IDs, or RR1 will not reflect PE1's routes to RR2, and vice-versa.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
4) If you're using OSPF as
PE->CE, check for sham links. It's easy to break these and hard to
look for them. Do a "sh run | s sham" on the PEs and see if any
exist. If they do, run "show ip ospf sham-links"<o:p></o:p></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Sham links allow you to extend an OSPF area across the "Super Area 0" backbone area. These are most commonly used to pref an MPLS path instead of a back-door link. Topology aside, I've been bitten on broken sham links before, so look out for these. If you want to know more about them:</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
5) If you're using OSPF as PE->CE, and the CE is also part of the VRF (the VRF itself exists on the CE), enable capability vrf-lite on the OSPF process on the CE.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
The first time I ran into this I spent 5 hours debugging it. Some may say a waste of time, but I'll never forget it. In short: OSPF checks for the downward bit on routes exported from MP-BGP directly into the OSPF process. You'll watch the routes arrive on the PE and get put in the OSPF process no problem, and then when they hit the CE device(s), if the CEs are in the VRF as well, they'll be in the OSPF database but not get put into the RIB/FIB. This is a loop prevention mechanism. To disable it, use "capability vrf-lite" inside the OSPF process.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Also reference: http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
6a) If OSPF is PE->CE, make sure domain-id is set the same on all of them, or you'll end up with external routes across the MPLS cloud.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
This only matters if you're shooting for an internal route for some reason, and is more of a reminder than a big deal.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.</div>
<div>
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
This can make a slightly bigger difference, in that EIGRP naturally deprefs (via higher AD) external routes. You may need an internal route in order to make the traffic cross the MPLS cloud. If the AS number doesn't match, you'll end up with external routes.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
1) sh run | s mpls on every PE and P device. Look for LDP filtering. There are more elegant ways to find this, but this is the fastest.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
This is a bit of a hack, but it catches about 90% of LDP problems in < 60 seconds. You can't beat it for speed. I'll show more about this below.</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
2) From the PE on the originating side, run a "sh ip cef <VRF NAME> <problem IP>". Is the correct PE listed as next-hop in the "via" field? If it's not, go investigate the PE that is originating the route, there may be more than one path (and one may not lead anywhere!)</div>
<div>
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
PE1#sh ip cef vrf VPN 192.168.1.7</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
192.168.1.0/24, version 8, epoch 0, cached adjacency 10.0.23.3</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
0 packets, 0 bytes</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
tag information set</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
local tag: VPN-route-head</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
fast tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<b> via 5.5.5.5</b>, 0 dependencies, recursive</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
next hop 10.0.23.3, FastEthernet0/1 via 5.5.5.5/32</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
valid cached adjacency</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}</div>
<div>
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
The via field above shows the PE you're heading towards. Is it the correct PE? This threw me off something awful once. The prefix in question was endlessly looping off a 3rd PE, and was being re-advertised on the 3rd PE. That PE was being preffed. Boom, an hour gone debugging - if only I'd paid more attention to the output of "sh ip cef vrf VPN"!</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Assuming it is the right PE listed above, you walk the MPLS labels from there:</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
<br /></div>
<div style="margin-bottom: .0001pt; margin: 0in;">
PE1#sh mpls forwarding-table 5.5.5.5</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
Local Outgoing Prefix Bytes tag Outgoing Next Hop</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
tag tag or VC or Tunnel Id switched interface</div>
<div style="margin-bottom: .0001pt; margin: 0in;">
17 17 5.5.5.5/32 0 Fa0/1 10.0.23.3</div>
<div>
<br /></div>
<div>
Next hop is 10.0.23.3, via Fa0/1; that's P1:</div>
<div>
<br /></div>
<div>
<div>
P1#show mpls forwarding-table 5.5.5.5</div>
<div>
Local Outgoing Prefix Bytes tag Outgoing Next Hop</div>
<div>
tag tag or VC or Tunnel Id switched interface</div>
<div>
17 <b>Untagged </b>5.5.5.5/32 13766 Fa0/1 10.0.34.4</div>
</div>
<div>
<br /></div>
<div>
There's the evil Untagged! Let's go see what's up on P2.</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
P2#sh run | s mpls</div>
<div>
<b>no mpls ldp advertise-labels</b></div>
<div>
mpls label protocol ldp</div>
<div>
mpls ip</div>
<div>
mpls label protocol ldp</div>
<div>
mpls ip</div>
<div>
<br /></div>
</div>
<div>
Note, we should have caught this in MPLS debugging step 1, but just in case you didn't...!</div>
<div>
There's about 3 scenarios you want to look out for related to label advertisement:</div>
<div>
<br /></div>
<div>
<b>no mpls ldp advertise-labels</b> will make no labels be advertised at all.</div>
<div>
That command can be used in combination with <b>mpls ldp advertise-labels for <standard ACL></b>. The standard ACL can be (rather obviously) rigged to prevent the labels you need advertised from being advertised.</div>
<div>
The final command is <b>mpls label range</b> <min> <max>. If you don't allow <i>enough labels</i> the ones you need can end up not getting assigned one at all.</div>
<div>
<br /></div>
<div>
I've fixed the <b>mpls ldp advertise-labels </b>command above, and now we see the appropriate output on P1:</div>
<div>
<br /></div>
<div>
<div>
P1#show mpls forwarding-table 5.5.5.5</div>
<div>
Local Outgoing Prefix Bytes tag Outgoing Next Hop</div>
<div>
tag tag or VC or Tunnel Id switched interface</div>
<div>
17 17 5.5.5.5/32 0 Fa0/1 10.0.34.4</div>
</div>
<div>
<br /></div>
<div>
And on P2:</div>
<div>
<br /></div>
<div>
<div>
P2#show mpls forwarding-table 5.5.5.5</div>
<div>
Local Outgoing Prefix Bytes tag Outgoing Next Hop</div>
<div>
tag tag or VC or Tunnel Id switched interface</div>
<div>
17 Pop tag 5.5.5.5/32 508 Fa0/1 10.0.45.5</div>
</div>
<div>
<br /></div>
<div>
We see "Pop tag". Pop tag is OK, it's just part of the Penultimate Hop Pop process.<br />
<br />
3) If it's the correct PE from step 2, do <b>show mpls forwarding-table <PE LDP ID></b>. <i>Unless your PEs are L2 adjacent</i>, you must have tag listed for the PE, <i>or </i>"Pop Tag". If you don't, walk your adjacent routers to be sure <b>mpls ip</b> is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.</div>
<div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<div style="margin: 0px;">
<br /></div>
</div>
</div>
<div>
I've seen some nasty, nasty things done with VACLs on the layer 2 switches between routers on practice labs. It's not much of a stretch to think they'd block LDP. The config would look perfect and your adjacency simply wouldn't come up. Count how many adjacencies you're expecting from the diagram, and make sure you get a good head count:</div>
<div>
<br /></div>
<div>
<div>
P1#show mpls ldp neigh | i Peer</div>
<div>
Peer LDP Ident: 7.7.7.7:0; Local LDP Ident 10.0.37.3:0</div>
<div>
Peer LDP Ident: 2.2.2.2:0; Local LDP Ident 10.0.37.3:0</div>
<div>
Peer LDP Ident: 192.168.49.4:0; Local LDP Ident 10.0.37.3:0</div>
</div>
<div>
<br /></div>
<div>
If you're missing one, investigate the adjacency.<br />
<br />
And a shout out to my friend Keith Chayer, who reminded me to check for CEF being enabled as well. It is of note that you'll be missing labels if CEF is disabled on the MPLS transit path - at least LDP is smart enough to tell it's neighbors "I'm broken - don't use me".</div>
<div>
<br /></div>
<div>
4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.</div>
<div>
<br /></div>
<div>
I covered this above.</div>
<div>
<br /></div>
<div>
<div>
5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32. If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match. Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.</div>
</div>
<div>
<br /></div>
<div>
This is reasonably self-explanatory. The route prefix length and the LDP prefix length need to match. OSPF is the common culprit. </div>
<div>
Reference: http://brbccie.blogspot.com/2013/11/mini-why-does-ldp-require-32-loopback.html</div>
<div>
<br /></div>
<div>
<div>
6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking? On the PEs, sh run | i cost-community. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find bgp bestpath cost-community ignore in the config.</div>
</div>
<div>
<br /></div>
<div>
I got this on a mock lab once, as well. If the PEs are disabling cost community, you need to ask yourself why: is this a mandatory traffic engineering, or are they just trying to steer routes in the wrong direction?</div>
<div>
<br /></div>
<div>
Reference: http://brbccie.blogspot.com/2012/12/bgp-cost-community-eigrp-soo-and.html</div>
<div>
<br />
/* Addition 11/27/14 - I apologize for not inserting this more thoroughly in the blog, but time doesn't permit right now - be sure to look for import or export maps on the VRF. It's possible to define a route-map that filters prefixes inbound or outbound of the VRF. The syntax is not particuarly complex:<br />
<br />
ip prefix-list IMPORT_PL seq 5 deny 0.0.0.0/0 le 32<br />
<div>
<div>
<div>
route-map SNAFU permit 10</div>
<div>
match ip address prefix-list IMPORT_PL</div>
</div>
</div>
<div>
<br /></div>
vrf definition VRFTEST<br />
rd 1:1<br />
route-target export 1:1<br />
route-target import 1:1<br />
!<br />
address-family ipv4<br />
import ipv4 unicast map IMPORT-FILTER<br />
<br />
*/<br />
<div>
<div>
<br /></div>
</div>
</div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff Kronlage</div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com1tag:blogger.com,1999:blog-5968686435283454526.post-57662698711380775092014-04-12T12:44:00.002-07:002014-04-12T12:44:46.922-07:00MPLS EXP-based QoS and QoS GroupsThis topic is a bit of a stretch for the R&S lab, really being more oriented towards Service Provider, but I wanted to talk about it anyway.<br />
<br />
So what does your MPLS carrier do with those QoS settings you pass them?<br />
It's unlikely they're queuing at congestion spots in their network based on the DSCP values you set.<br />
<br />
You've probably heard about the EXP bits in the MPLS tag. These are used "for QoS". But no one really seems to know how. And there's only 3 bits, but we use 6 bits for DSCP, so what's the story?<br />
<br />
Here's our topology:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJlp3y3hAndkNDB83z2_3CugnzQJL0v5H4pjmuDh7abT_k_czxVXXe-yjUiEMmigleA3mDK3FqoIQmtuTJMhr4b94byHUHfllP2I8mvLU4pJVH7si9vG789uDrHLxjmT2woKpaz7b8-xo/s1600/drawing1.png" />
<br />
<br />
We'll be setting DSCP values on H1 and manipulating them, or their MPLS equivalents, on the way to H2.<br />
<br />
Without any special config, let's see how this works right out of the box. Of important note, I have null-routed H1's IP address on H2. This makes it easier to read the output from "debug mpls packet", because we're only seeing a one-way flow instead of a two-way flow.<br />
<br />
H1#ping<br />
Protocol [ip]:<br />
Target IP address: 192.168.1.6<br />
Repeat count [5]: 2<br />
Datagram size [100]:<br />
Timeout in seconds [2]:<br />
Extended commands [n]: y<br />
Source address or interface:<br />
Type of service [0]: 184<br />
Set DF bit in IP header? [no]:<br />
Validate reply data? [no]:<br />
Data pattern [0xABCD]:<br />
Loose, Strict, Record, Timestamp, Verbose[none]:<br />
Sweep range of sizes [n]:<br />
Type escape sequence to abort.<br />
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:<br />
..<br />
Success rate is 0 percent (0/2)<br />
<div>
<br /></div>
<div>
(Remember, we were not expecting responses)</div>
<div>
So we sent this as EF traffic (TOS 184, above). Any hypothesis on what's seen in transit?</div>
<div>
<div>
<br /></div>
<div>
P2#debug mpls packet</div>
<div>
MPLS packet debugging is on</div>
</div>
<div>
P2#</div>
<div>
<div>
*Mar 1 09:20:04.473: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19</div>
<div>
*Mar 1 09:20:04.473: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
<div>
P2#</div>
<div>
*Mar 1 09:20:06.437: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19</div>
<div>
*Mar 1 09:20:06.437: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
</div>
<div>
<br /></div>
<div>
CoS=5, meaning the EXP bits are set to 5. The default behavior on a PE is to map the IPP (IP Precedence) on to the EXP bits. These line up nicely both being three bits. Reference the ToS value above - 184. That's a full 8 bit QoS value, in binary it's 10111000. Chop off the last two unused digits for the 6-bit DSCP value of 101110, and you have 46 (which I suspect you recognize as "EF"), knock off everything except the first three bits - 101 - for the IPP, and you have? Five. Hence, EXP becomes 5 as well. This default feature is known as "ToS Reflection".</div>
<div>
<br /></div>
<div>
We'll look at how this value can be used to our advantage later.</div>
<div>
<br /></div>
<div>
What comes in on H2?</div>
<div>
<br /></div>
<div>
For those following my blog for a while, you may know about 14 months ago I wrote a giant ACL that matches every possible QoS value. I still have it on file, and I'll be using it here to see what values come in on H2.</div>
<div>
<br /></div>
<div>
<div>
H2#sh ip access-list | i match</div>
<div>
460 permit ip any any dscp ef (2 matches)</div>
<div>
480 permit ip any any dscp cs6 (1 match)</div>
</div>
<div>
<br /></div>
<div>
Ok great! We've got two EF matches, and a ... Class Selector 6?</div>
<div>
The EF matches are the two pings arriving. I found this odd right off the bat, I would've expected that if IOS takes the IPP bits and maps them to EXP, that it would then take EXP and match them to IPP on the way out the other PE when the final label is popped. However, it doesn't work that way - instead, it just uses the DSCP that was already in the packet - which, of course, never changed. An MPLS label was put on top of it, but the underlying packet was left intact. </div>
<div>
<br /></div>
<div>
The class selector 6 packet is a BGP keepalive. We'll be seeing more of them throughout the post.</div>
<div>
<br /></div>
<div>
It turns out there are terms for the different types of MPLS QoS behavior. What we observed above would be either "Pipe Mode" or "Short Pipe Mode". Both of these behaviors include using the original ToS bits instead of replacing them based on the EXP bits. The difference between Pipe Mode and Short Pipe Mode is that Pipe Mode egress queues based on the EXP bits, and Short Pipe Mode egress queues at the PE on the original ToS (DSCP) bits. This post assumes the audience understands how to write a hierarchical QoS policy, so I'm not going to elaborate or examine the differences between them any further. Any additional mention of "Pipe Mode" assumes either of the above behaviors. The third option is "Uniform Mode", which is the process of replacing the IP Packet's ToS bits (IPP/DSCP) with something derived from the EXP bits.</div>
<div>
<br /></div>
<div>
We just saw Pipe Mode in action above, let's look at how to implement Uniform Mode.</div>
<div>
<br /></div>
<div>
First we need to take a quick look at QoS groups.</div>
<div>
<br /></div>
<div>
There's a particular challenge with ingress and egress marking on a PE. On ingress, you can't set an IPP or DSCP value because the MPLS header is still on the frame. On the egress interface, you can't match on the EXP bits to set IPP or DSCP bits, because the MPLS label is already popped. So how do you match on an EXP value and set a DSCP value? Enter QoS groups.</div>
<div>
<br /></div>
<div>
PE2:</div>
<div>
<br /></div>
<div>
<div>
<div>
class-map match-all EXP5</div>
<div>
match mpls experimental topmost 5</div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
policy-map uniform-ingress</div>
<div>
class EXP5</div>
<div>
set qos-group 5</div>
<div>
class class-default</div>
<div>
set qos-group 0</div>
</div>
<div>
<br /></div>
<div>
interface fa0/0 ! MPLS side</div>
<div>
service-policy input uniform-ingress</div>
<div>
<br /></div>
<div>
This config will match a <b>decimal value</b> of five on the <i>topmost</i> MPLS label - which, in our case, on the PE, is the <b>only</b> MPLS label thanks to Penultimate Hop Pop. We'll assign a local value of "5" (although this could be any number 1-99) if the EXP bit is 5. Anything else will get reset to 0.</div>
<div>
<br /></div>
<div>
<div>
class-map match-all GROUP5</div>
<div>
match qos-group 5</div>
<div>
<br /></div>
</div>
<div>
<div>
policy-map uniform-egress</div>
<div>
class GROUP5</div>
<div>
set ip dscp af41 </div>
<div>
class class-default</div>
<div>
set ip dscp default</div>
</div>
<div>
<br /></div>
<div>
interface fa0/1 ! IP/VRF side</div>
<div>
service-policy output uniform-egress</div>
<div>
<br /></div>
<div>
On egress, we'll match on that 5, and set af41. Why af41? Because I wanted to show the policy was doing something.</div>
<div>
<br /></div>
<div>
We'll ping from H1 to H2 again. I'm omitting any non-essential bits from the extended ping for brevity.</div>
<div>
<br /></div>
<div>
<div>
H1#ping</div>
<div>
Target IP address: 192.168.1.6</div>
<div>
Repeat count [5]: 2</div>
<div>
Extended commands [n]: y</div>
<div>
Type of service [0]: 184</div>
<div>
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
..</div>
<div>
Success rate is 0 percent (0/2)</div>
</div>
<div>
<br /></div>
<div>
Again, expected failure, this is a deliberate one-way flow.</div>
<div>
<br /></div>
<div>
<div>
H2#sh ip access-list | i match</div>
<div>
340 permit ip any any dscp af41 (2 matches)</div>
<div>
640 permit ip any any precedence routine (4 matches)</div>
</div>
<div>
<br /></div>
<div>
We see our two af41 hits, and 4 routine. The routine are because the IPP 6 packets are being remarked to zero because it doesn't match anything else in the policy. </div>
<div>
<br /></div>
<div>
Now obviously this is a pretty useless policy, but it was more about showing how the function works.</div>
<div>
Here's an adaptation for a more scalable Uniform Mode solution:</div>
<div>
<br />
policy-map uniform-ingress<br />
class class-default<br />
set qos-group mpls experimental topmost<br />
<div>
<br /></div>
<div>
interface fa0/0</div>
<div>
service-policy input uniform-ingress</div>
<div>
<br /></div>
<div>
policy-map uniform-egress</div>
</div>
<div>
class class-default<br />
set precedence qos-group</div>
<div>
<br /></div>
<div>
interface Fa0/1</div>
<div>
service-policy output uniform-egress</div>
<div>
<br /></div>
<div>
Let's see what the outcome is.</div>
<div>
<br /></div>
<div>
<div>
<div>
H1#ping</div>
<div>
Target IP address: 192.168.1.6</div>
<div>
Repeat count [5]: 2</div>
<div>
Extended commands [n]: y</div>
<div>
Type of service [0]: 184</div>
<div>
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
..</div>
<div>
Success rate is 0 percent (0/2)</div>
</div>
<div>
</div>
</div>
<div>
<br /></div>
<div>
<div>
H2#clear ip access-list count</div>
<div>
<br /></div>
<div>
H2#sh ip access-list | i match</div>
<div>
460 permit ip any any dscp ef (2 matches)</div>
</div>
<div>
<br /></div>
<div>
I was rather surprised the first time I saw this output. We're setting a precedence value but getting back a DSCP value. I expected to see a precedence/class-selector value. The original bits were 101110 (DSCP 46, or EF), and I expected to replace them with 101000, which would be class selector 5. Things brings up an important difference in IOS's handling of class-selector vs precedence, I'd always treated them the same, but it turns out IOS is more literal - Precedence sets <b>only</b> the precedence bits. So we re-wrote the first three bits with 101, which ... were already set to 101. So we ended up with 101110 (DSCP 46/EF) again.</div>
<div>
<br /></div>
<div>
We could do something like this:</div>
<div>
<br /></div>
<div>
<div>
policy-map uniform-egress</div>
<div>
class class-default</div>
<div>
set dscp qos-group </div>
</div>
<div>
<br /></div>
<div>
But then we'd get literal DSCP values: if the QoS Group is 5, it would set DSCP 5. Not DSCP CS5 (101000), but actual binary 5 - (000101). To accomplish EF -> EXP 5 -> CS5, we'd have to use either a lengthy QoS-Group -> DSCP class-map/policy-map setup, or we could use a table map!</div>
<div>
<br /></div>
<div>
<div>
table-map TABMAP</div>
<div>
map from 1 to 8 ! Group 1 to DSCP CS1</div>
<div>
map from 2 to 16 ! Group 2 to DSCP CS2</div>
<div>
map from 3 to 24 ! ...</div>
<div>
map from 4 to 32</div>
<div>
map from 5 to 40</div>
<div>
map from 6 to 48</div>
<div>
map from 7 to 56 ! Group 7 to DSCP CS7</div>
</div>
<div>
<br /></div>
<div>
<div>
policy-map uniform-egress</div>
<div>
class class-default</div>
<div>
set dscp qos-group table TABMAP</div>
</div>
<div>
<br /></div>
<div>
<div>
<div>
H1#ping</div>
<div>
Target IP address: 192.168.1.6</div>
<div>
Repeat count [5]: 2</div>
<div>
Extended commands [n]: y</div>
<div>
Type of service [0]: 184</div>
<div>
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
..</div>
<div>
Success rate is 0 percent (0/2)</div>
</div>
<div>
<br /></div>
<div>
<div>
H2#sh ip access-list | i match</div>
<div>
400 permit ip any any dscp cs5 (2 matches)</div>
<div>
480 permit ip any any dscp cs6 (18 matches)</div>
</div>
<div>
<br /></div>
<div>
I think the table map use is pretty obvious - take a qos group and match it to some other integer, which has some meaning when applied to a DSCP or IPP field. Now we have the CS5 output we were looking for.</div>
<div>
<br /></div>
<div>
Now clearly, MPLS/EXP QoS needs to be able to be modified on more than just the egress PE. Let's take a look at the other spots we can match and adapt behavior to it.</div>
<div>
<br /></div>
<div>
So far we've been doing matches on the "topmost" label, so what other options have we got? Keeping this oriented towards the R&S CCIE, I'm not going to look at anything other than a 2-tag (VRF + MPLS PE) system. When traffic is received in from the host towards the PE, the PE is going to <i>impose </i>a label for the VRF. It will then add the MPLS transit label on top of that, for reaching the other PE. So to reiterate, we go from zero labels to two labels on the PE.</div>
<div>
<br /></div>
<div>
We can set both those labels, and it's really not hard, but you have to pay attention to what label is being manipulated on which interface. IOS is picky about the order of operations in this case.</div>
<div>
<br /></div>
<div>
For ingress on a PE, we can only set <i>imposition</i>. We clearly can't set "topmost" because there are no labels on the packet yet:<br />
<br />
PE1:<br />
policy-map impose1<br />
class class-default<br />
set mpls experimental imposition 4<br />
<div>
<br /></div>
<div>
<div>
int fa0/0</div>
<div>
service-policy input impose1</div>
</div>
<div>
<br /></div>
<div>
H1#ping 192.168.1.6 rep 1</div>
<div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
</div>
<div>
<br /></div>
<div>
<div>
H2#sh ip access-list | i match</div>
<div>
320 permit ip any any dscp cs4 (1 match)</div>
<div>
480 permit ip any any dscp cs6 (2 matches)</div>
</div>
<br />
And what if we set EF manually on H1?<br />
<br />
<div>
H1#ping</div>
<div>
Target IP address: 192.168.1.6</div>
<div>
Repeat count [5]: 2</div>
<div>
Extended commands [n]: y</div>
<div>
Type of service [0]: 184</div>
<div>
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
..</div>
<div>
Success rate is 0 percent (0/2)<br />
<br />
H2#sh ip access-list | i match<br />
320 permit ip any any dscp cs4 (3 matches)<br />
480 permit ip any any dscp cs6 (2 matches)</div>
<div>
<br /></div>
Still CS4, because we're remarking the EXP bits on the <i>inner</i> label on PE1 to 4, that's carried down to PE2, and then the qos-group-based policy remarks the DSCP to CS4.<br />
<br />
What about the outer label?<br />
<br />
H1#ping 192.168.1.6 rep 1<br />
<br />
Type escape sequence to abort.<br />
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:<br />
.<br />
Success rate is 0 percent (0/1)<br />
<div>
<br /></div>
<div>
We'd need to look at the results on P2, because PE2 never gets the outer label - the PHP process removes it before forwarding the frame.</div>
<div>
<br /></div>
<div>
<div>
P2#</div>
<div>
<div>
*Mar 2 01:44:59.409: MPLS: Fa0/0: recvd: CoS=4, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 01:44:59.409: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19</div>
</div>
</div>
<div>
<br /></div>
So P2 receives the outer label as 4, and the inner label as 4. We see 4 coming in on Fa0/0 on label 16, and going out on label 19 on Fa0/1, showing both the PHP process and the fact that both EXP values are the same. That's because the default behavior of a PE is to copy the inner label's EXP bits to the outer label. But what if we wanted to set the outer label to something different?<br />
<br />
There's two places we could do that: egress on the PE, or ingress on the P routers.<br />
<br />
Let's try the PE first.<br />
<br />
PE1:<br />
policy-map topmost1<br />
class class-default<br />
set mpls experimental topmost 2</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/1</div>
<div>
service-policy output topmost1</div>
</div>
<div>
<br /></div>
<div>
<div>
H1#ping 192.168.1.6 rep 1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
</div>
<div>
<br /></div>
<div>
<div>
P2#</div>
</div>
<div>
<div>
*Mar 2 01:53:51.609: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 01:53:51.609: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19</div>
</div>
<div>
<br /></div>
<div>
Now we see EXP 2 on the topmost and EXP 4 on the inner.</div>
<div>
<br /></div>
<div>
It's of some interest that if we wanted the final PE (PE2) to see that value of 2, we'd want to disable PHP. PHP is disabled <i>from the PE</i>, not the router upstream from it. This is done by the PE advertising an explicit blank label for the prefixes terminating on it:</div>
<div>
<br /></div>
<div>
PE2:</div>
<div>
<br /></div>
<div>
mpls ldp explicit-null</div>
<div>
<br /></div>
<div>
<div>
H1#ping 192.168.1.6 rep 1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
</div>
<div>
<br /></div>
<div>
<div>
P2#</div>
</div>
<div>
<div>
*Mar 2 01:57:56.889: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 01:57:56.889: MPLS: Fa0/1: xmit: CoS=2, TTL=252, Label(s)=0/19</div>
</div>
<div>
<br /></div>
<div>
<div>
H2#sh ip access-list | i match</div>
<div>
160 permit ip any any dscp cs2 (1 match)</div>
</div>
<div>
<br /></div>
<div>
We see that P2 forwarded both labels, one of which was the explicit null/0 label (reference 0/19). The PE has to pop both labels before forwarding. Consequently, we also see that the PE now marked CS2 based on the EXP2 in the topmost label.</div>
<div>
<br /></div>
<div>
Now let's see about manipulating the topmost label on a P device.</div>
<div>
For clarity's sake on P2, I am disabling the implicit null (enabling PHP) on PE2:</div>
<div>
<br /></div>
<div>
<div>
PE2(config)#no mpls ldp explicit-null</div>
</div>
<div>
<br /></div>
<div>
P1:</div>
<div>
<br /></div>
<div>
<div>
policy-map set-topmost</div>
<div>
class class-default</div>
<div>
set mpls experimental topmost 7</div>
</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/1</div>
<div>
service-policy output set-topmost</div>
</div>
<div>
<br /></div>
<div>
Before I show the output of this, it's important to note that setting the topmost EXP on egress is <i>the only</i> option I could find that worked on the P routers. The P routers aren't imposing any labels (just swapping, which is different), so imposition doesn't work, and setting topmost on ingress doesn't appear to do anything (although I am not sure why). And now for the outcome:</div>
<div>
<br /></div>
<div>
<div>
H1#ping 192.168.1.6 rep 1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:</div>
<div>
.</div>
<div>
Success rate is 0 percent (0/1)</div>
</div>
<div>
<br /></div>
<div>
<div>
P2#</div>
<div>
*Mar 2 02:22:25.641: MPLS: Fa0/0: recvd: CoS=7, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 02:22:25.645: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19</div>
</div>
<div>
<br /></div>
<div>
As anticipated, EXP 7 on the outer label only.</div>
<div>
<br /></div>
<div>
It's also important to note how P routers treat the EXP bits. By default, unless you manually change it with the processes I've demonstrated and will demonstrate to come, the P router, as it swaps labels hop-by-hop, will always copy the EXP of the old outer label to the new outer label unmodified.</div>
<div>
<br /></div>
<div>
And now for our final topic - policing based on EXP.</div>
<div>
<br /></div>
<div>
P1:</div>
<div>
<div>
class-map match-all EXP5</div>
<div>
match mpls experimental topmost 5</div>
</div>
<div>
<br /></div>
<div>
<div>
policy-map POLICER</div>
<div>
class EXP5</div>
<div>
police cir 32000</div>
<div>
conform-action transmit</div>
<div>
exceed-action set-mpls-exp-topmost-transmit 1</div>
</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/1</div>
<div>
service-policy output POLICER</div>
</div>
<div>
<br /></div>
<div>
<div>
H1#ping</div>
<div>
Protocol [ip]:</div>
<div>
Target IP address: 192.168.1.6</div>
<div>
Repeat count [5]: 500</div>
<div>
Datagram size [100]: 1000</div>
<div>
Extended commands [n]: y</div>
<div>
Type of service [0]: 184</div>
<div>
Sending 500, 1000-byte ICMP Echos to 192.168.1.6, timeout is 0 seconds:</div>
<div>
......................................................................</div>
<div>
<output omitted></div>
<div>
..........</div>
<div>
Success rate is 0 percent (0/500)</div>
</div>
<div>
<br /></div>
<div>
This one is tricky to validate - we want to see some MPLS packets leave P1 as 5, and some leave as 1. Unfortunately my ACL doesn't work here (Without turning PHP back off) because we're playing with the upper label and not the inner label, and the Uniform Mode config on PE2 won't take heed of the outer label, because it's popped before hitting the egress interface.</div>
<div>
<br /></div>
<div>
Instead, we're just going to look at a sampling of "debug mpls packet" on P2:</div>
<div>
<br /></div>
<div>
<div>
*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19</div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
<div>
*Mar 2 02:48:26.101: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19</div>
</div>
<div>
<br /></div>
<div>
Let's decipher this a bit:</div>
<div>
<br /></div>
<div>
Remember, P2 is performing PHP for PE2, so what we see coming in and what we see going out will be different. P1 is only making modifications to the topmost label.</div>
<div>
<br /></div>
<div>
<div>
*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19</div>
<div>
<br /></div>
<div>
We got an MPLS packet in as EXP 5.</div>
<div>
<br /></div>
<div>
*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
</div>
<div>
<br /></div>
<div>
We popped the upper label and sent the inner label on as EXP 5 as well.</div>
<div>
<br /></div>
<div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19</div>
<div>
<br /></div>
<div>
By this point, we've already gotten the policer to kick in, so we receive EXP 1.</div>
<div>
<br /></div>
<div>
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19</div>
</div>
<div>
<br /></div>
<div>
and we transmit EXP 5 based on the inner label, which was set on PE1 because of the IPP -> EXP ToS Reflection. The policer on P1 did not modify this value.</div>
<div>
<br /></div>
<div>
That's MPLS QoS/QoS Groups in a nutshell. Hope you enjoyed!</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
</div>
</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com3tag:blogger.com,1999:blog-5968686435283454526.post-14602259808929695622014-02-15T16:55:00.001-08:002014-02-15T17:04:39.910-08:00[mini] PIM Dense State RefreshBeen brushing up on multicast recently. It was one of the first topics I ever deep-dived and some of the material is rusty now... two and a half years later. <br />
<br />
Came across PIM-DM State Refresh. This is an interesting attempt to make dense mode PIM more scalable. If you ask a CCNP student what the common detriment with using dense mode is, he'd probably tell you "It floods all its groups to every PIM device every three minutes". That is true, but that's an attempt to solve a problem, not the problem itself.<br />
<br />
The real problem is that dense mode has no way of letting potential receivers know what groups are available, or where to find them. It's easy to lose site of this when labbing: you control every device and know exactly where all the transmitters and receivers are, and know what groups are on the network. The 3-minute flooding is only present because that's how dense mode tells the network what groups are available and where to find them.<br />
<br />
State Refresh is not a new technology - it was proposed in the late 90's - but I'd never heard of it before today. With it enabled, you still have the initial densing of the actual multicast stream, but after the initial prune, instead of just firing the stream off every 3 minutes, it instead sends a state refresh every X number of seconds, where X is defined by the command that enables it:<br />
<br />
interface FastEthernet1/0<br />
ip pim dense-mode<br />
<b> ip pim state-refresh origination-interval 10</b><br />
<div>
<br /></div>
<div>
In this sample, we would send state refresh messages every 10 seconds.<br />
<br />
In this fashion, all PIM routers in the network are still aware of the stream, but they don't get the annoying densing out of the traffic followed by having to prune it constantly.</div>
<div>
<br /></div>
<div>
Also note, this process only works if the transmitter is still sending traffic. It does not do this "keepalive" signaling for multicast streams that are no longer in use.</div>
<div>
<br /></div>
<div>
Where you place the state-refresh command is important. It should always go on the PIM interface closest to the transmitting host. If you put it anywhere else, it does not work. You do not need to enable it on other routers on the host. In my lab, I had three interfaces, one pointing at a host endlessly pinging 239.0.0.100, and the other two pointing towards PIM routers. I only have it enabled on the interface pointing towards the host.</div>
<div>
<br /></div>
<div>
All PIM Dense routers/interfaces will automatically relay state-refresh messages. This command only needs to be enabled on interface facing transmitting hosts.</div>
<div>
<br /></div>
<div>
If you don't want a PIM router to relay these messages, use this global command:</div>
<div>
<b>ip pim state-refresh disable</b></div>
<div>
<b><br /></b></div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
<b><br /></b></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-58117507632153320552014-02-07T21:42:00.004-08:002014-02-07T21:42:32.897-08:00The Woz!Totally off topic this time - but tonight, I met Steve Wozniak, and it was amazing.<br />
<br />
He went to a small networking event that I attended. When I signed up for it, I was on the fence about attending - he hadn't been signed at that time - but I decided to go anyway (drug along was more like it, but that's another story). Then the message came out that he was the guest speaker, and tickets sold out in the blink of an eye.<br />
<br />
He talked for about 30 minutes, took Q&A for another 30. What a super guy. Incredibly personable. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBA4MvpzcvRO4U65GIqLaQXj_-M_9y8mDFHBEkvfslWd8jQYLo2bB-rg9236bV6cyEc7BbISkHzM5Bg5zx7RmyJWMc2hOso_GqCkBuYComstlIRFku_OWwi2zwjvxM-4u5Dn4Va5YlBEg/s1600/photor.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBA4MvpzcvRO4U65GIqLaQXj_-M_9y8mDFHBEkvfslWd8jQYLo2bB-rg9236bV6cyEc7BbISkHzM5Bg5zx7RmyJWMc2hOso_GqCkBuYComstlIRFku_OWwi2zwjvxM-4u5Dn4Va5YlBEg/s1600/photor.JPG" height="320" width="240" /></a></div>
<br />
<br />
Brief recap of the discussion:<br />
- He spoke several times about Steve Jobs. Some insightful things, including the mentor Apple had in the early days, and how Steve J learned most of his technique from that mentor (one of their early investors). Interestingly, it wasn't all positive - he mentioned on several occasions how he wished Steve J had been more generous both in his personality and financially.<br />
- He's such a nerd! (in a good way). This was a technology business networking function, so inevitably the topics tended to lean towards business. He'd start off on a business topic, and really just end up saying that if you have a cool engineering idea, it'll probably do well. Then he'd go off on a story about a neat engineering idea.<br />
- He has some interesting ideas about wearable technology. He doesn't think we've got it nailed yet, and that the smartwatch in particular doesn't have a big enough screen to be usable.<br />
- He's disappointed that computers in schools didn't result in smarter students. He talks a lot about how people had interest in them as they displayed something new, but the interest didn't stick unless the "new" kept coming.<br />
- He talked a lot about the education system in general. Interesting ideas about Singularity - http://en.wikipedia.org/wiki/Technological_singularity - and how that might help in learning someday. Also he hypothesized that Moore's Law (http://en.wikipedia.org/wiki/Moore's_law) is going to fail soon, and that Singularity may be a long way off because of that.<br />
- He made a few good jokes, including a reference that we now ask all questions of the Internet, which was clearly never designed to be an "ask me a question" resource. He says we now ask all questions of something that starts with "Go" and it's not "God".<br />
- He had some really positive comments about Google, even a vague reference that there perhaps should have been an Apple/Google merger ... ? He said Google definitely had Apple licked on human phrase interpretation, indicating Siri did a relatively poor job of taking a human idea and producing an answer, but Google took human phrasing very well and produced ... a page of links (Google doesn't answer questions, for the most part).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaFmSfqKO7fDK-td71cQ0BOUnJ_ybtV8O7kqhz2wwDSyhoTHyuhw5nK-7vhUix-6b44l0OOBShAy0dCGyyJg8_gcHrR5-MFYWJvlJWA9adtGFc003FhNNEak06GbJe3U5UD-V-2iIDhzU/s1600/photoe.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiaFmSfqKO7fDK-td71cQ0BOUnJ_ybtV8O7kqhz2wwDSyhoTHyuhw5nK-7vhUix-6b44l0OOBShAy0dCGyyJg8_gcHrR5-MFYWJvlJWA9adtGFc003FhNNEak06GbJe3U5UD-V-2iIDhzU/s1600/photoe.JPG" height="320" width="240" /></a></div>
<br />
<br />
He was signed up for a photo op with the sponsors, but let everyone know in advance that he'd be coming back down to mingle afterwards. <br />
<br />
He came back down and gave everyone a chance to take photos, and chatted with everyone, as best he could, for a person that was being mobbed by 60+ fans.<br />
<br />
This was my best photo, which was (thank God) taken unbeknownst to me by an old friend, who I happened to bump into there. Because all my other ones came out terrible. You can't see it, but I did get to shake his hand.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGLZ4lMQW_ix8ehgfIXxNi62BPofCCEq-B7hXcALARZjVpZWKsuLV1djnEZA_Cl7WPi3kd-Z_v2hFNF0CN-KC3N27W-1x2laYZ8QBGKgr8cXUm4piABdEYCdlxSqoHD2NefMcNdMUBmVg/s1600/1898228_695634260457894_1255939804_n.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhGLZ4lMQW_ix8ehgfIXxNi62BPofCCEq-B7hXcALARZjVpZWKsuLV1djnEZA_Cl7WPi3kd-Z_v2hFNF0CN-KC3N27W-1x2laYZ8QBGKgr8cXUm4piABdEYCdlxSqoHD2NefMcNdMUBmVg/s1600/1898228_695634260457894_1255939804_n.jpg" height="180" width="320" /></a></div>
<br />
This is a very hastily cleaned up version of one of my other photos --<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7ik_6dC5kfPihLrgbmsTqTAPR6AcgZNPxNwnP3uQINdvVLD4qAZwbGeLop52tV6fKaLbSI7_UlqHe2xHJJzh9OBMLNgWBpf0reBnEa3vIIe1fQ1q_NOIQ6rD11qf02AOW04WyF2UEf1s/s1600/photo+3-a.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7ik_6dC5kfPihLrgbmsTqTAPR6AcgZNPxNwnP3uQINdvVLD4qAZwbGeLop52tV6fKaLbSI7_UlqHe2xHJJzh9OBMLNgWBpf0reBnEa3vIIe1fQ1q_NOIQ6rD11qf02AOW04WyF2UEf1s/s1600/photo+3-a.jpg" height="320" width="240" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
All my other ones had severe lightning problems, unfortunately.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
What a great evening!</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Jeff</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-5121878658543448362014-01-29T19:46:00.001-08:002014-01-29T19:53:20.419-08:00Private VLANs - How they really workYou're probably already familiar with the basics of a private VLAN: it allows you to group hosts in a single subnet on Ethernet, but limit which hosts can talk to each other at layer 2. A common design is to have the default gateway accessible to the entire subnet, but prevent the individual hosts from talking to each other (an isolated VLAN). Another common design is to break the VLAN up into various smaller groups that can talk to each other (community VLANs), but allow all hosts in all groups to talk to the default gateway. Instead of a default gateway, the promiscuous host might be a community server, such as a backup server.<br />
<div>
<br /></div>
<div>
Configuring them is not very tricky. Our topology will consist of two 3560 switches, SW1 and SW2, trunked together on Fa0/13. R1 will simulate our default gateway, on 192.168.0.1.</div>
<div>
<br /></div>
<div>
SW1:</div>
<div>
<div>
vlan 124</div>
<div>
private-vlan primary</div>
<div>
private-vlan association 216,402</div>
<div>
<br /></div>
<div>
vlan 216</div>
<div>
private-vlan community</div>
<div>
<br /></div>
<div>
vlan 402</div>
<div>
private-vlan isolated</div>
</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/1</div>
<div>
switchport private-vlan mapping 124 216,402</div>
<div>
switchport mode private-vlan promiscuous</div>
</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/13 ! Trunk to SW2</div>
<div>
switchport trunk encapsulation dot1q</div>
<div>
switchport mode trunk</div>
</div>
<div>
<br /></div>
<div>
SW2:</div>
<div>
<div>
vlan 124</div>
<div>
private-vlan primary</div>
<div>
private-vlan association 216,402</div>
<div>
<br /></div>
<div>
vlan 216</div>
<div>
private-vlan community</div>
<div>
<br /></div>
<div>
vlan 402</div>
<div>
private-vlan isolated</div>
</div>
<div>
<br /></div>
<div>
<div>
interface FastEthernet0/13 ! trunk to SW1</div>
<div>
switchport trunk encapsulation dot1q</div>
<div>
switchport mode trunk</div>
<div>
<br /></div>
<div>
interface FastEthernet0/2 ! Connects to R2, a community host</div>
<div>
switchport private-vlan host-association 124 216</div>
<div>
switchport mode private-vlan host</div>
<div>
<br /></div>
<div>
interface FastEthernet0/4 ! Connects to R4, a community host</div>
<div>
switchport private-vlan host-association 124 216</div>
<div>
switchport mode private-vlan host</div>
<div>
<br /></div>
<div>
interface FastEthernet0/6 ! Connects to R6, an isolated host</div>
<div>
switchport private-vlan host-association 124 402, an isolated host</div>
<div>
switchport mode private-vlan host</div>
</div>
<div>
<br /></div>
<div>
I'm not going to go over this config in extreme detail, because this information can be found very easily anywhere. I just needed a baseline to show a few other interesting things.</div>
<div>
<br /></div>
<div>
R1 is the promiscuous default gateway, R6 is an isolated host, R2 and R4 are in a community VLAN.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div>
</div>
<div>
First, let's make sure it works.</div>
<div>
Can everyone reach R1?</div>
<div>
<br /></div>
<div>
<div>
R2#ping 192.168.0.1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
<div>
R4#ping 192.168.0.1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
<div>
R6#ping 192.168.0.1</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.1, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
OK, our promiscuous port works.</div>
<div>
<br /></div>
<div>
Can R2 ping R4?</div>
<div>
<br /></div>
<div>
<div>
R2#ping 192.168.0.4</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
Looks good, community VLAN works.</div>
<div>
<br /></div>
<div>
Can R6 reach anything besides R1?</div>
<div>
<br /></div>
<div>
<div>
R6#ping 192.168.0.2</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.2, timeout is 2 seconds:</div>
<div>
.....</div>
<div>
Success rate is 0 percent (0/5)</div>
<div>
R6#ping 192.168.0.4</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:</div>
<div>
.....</div>
<div>
Success rate is 0 percent (0/5)</div>
</div>
<div>
<br /></div>
<div>
Nope, it really is isolated.</div>
<div>
<br /></div>
<div>
So it works - fantastic. But <b>how </b>does it work? Answering that question takes a bit of research.</div>
<div>
<br /></div>
<div>
This page documents the high-level workings:</div>
<div>
http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SX/configuration/guide/pvlans.html</div>
<div>
<br /></div>
<div>
Specifically,</div>
<div>
<br /></div>
<div>
"• Primary VLAN— The primary VLAN carries unidirectional traffic downstream from the promiscuous ports to the (isolated and community) host ports and to other promiscuous ports.</div>
<a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wp1131385"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131383"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131384"></a><br />
<div class="pBu1_Bullet1">
• Isolated VLAN — [edited for brevity] An isolated VLAN is a secondary VLAN that carries unidirectional traffic upstream from the hosts toward the promiscuous ports and the gateway. </div>
<a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wp1131388"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131386"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131387"></a><br />
<div class="pBu1_Bullet1">
• Community VLAN—A community VLAN is a secondary VLAN that carries upstream traffic from the community ports to the promiscuous port gateways and to other host ports in the same community. [edited for brevity] "</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
So, in a nutshell, the primary VLAN carries traffic from the promiscuous port to everyone else, unidirectionally. An isolated VLAN carries traffic from every isolated host to the promiscuous port. And a community VLAN carries bidirectional traffic for members of the community, but only unidirectional traffic towards the promiscuous port.</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
OK, we got that, but still, <b>how</b> does it work?</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
It's all a clever manipulation of MAC address learning.</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
Referencing the pings I made above, let's look at the mac table on SW1 and SW2.</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
SW1#show mac address-table | i 124|216|402</div>
<div class="pBu1_Bullet1">
402 0013.c460.2be0 DYNAMIC pv Fa0/1</div>
<div class="pBu1_Bullet1">
402 0019.2fb8.d552 BLOCKED Fa0/13</div>
<div class="pBu1_Bullet1">
124 0013.c460.2be0 DYNAMIC Fa0/1</div>
<div class="pBu1_Bullet1">
124 0019.2fb8.d552 DYNAMIC pv Fa0/13</div>
<div class="pBu1_Bullet1">
124 0019.e880.09c0 DYNAMIC pv Fa0/13</div>
<div class="pBu1_Bullet1">
216 0013.c460.2be0 DYNAMIC pv Fa0/1</div>
<div class="pBu1_Bullet1">
216 0019.e880.09c0 DYNAMIC Fa0/13</div>
<div>
<br /></div>
<div class="pBu1_Bullet1">
SW2#show mac address-table | i 124|216|402</div>
<div class="pBu1_Bullet1">
402 0013.c460.2be0 DYNAMIC pv Fa0/13</div>
<div class="pBu1_Bullet1">
402 0014.1ceb.f60f BLOCKED Fa0/13</div>
<div class="pBu1_Bullet1">
402 0019.2fb8.d552 BLOCKED Fa0/6</div>
<div class="pBu1_Bullet1">
124 0013.c460.2be0 DYNAMIC Fa0/13</div>
<div class="pBu1_Bullet1">
124 0014.1ceb.f60f DYNAMIC pv Fa0/13</div>
<div class="pBu1_Bullet1">
124 0019.2fb8.d552 DYNAMIC pv Fa0/6</div>
<div class="pBu1_Bullet1">
124 0019.e880.09c0 DYNAMIC pv Fa0/2</div>
<div class="pBu1_Bullet1">
124 0024.c4eb.ed68 DYNAMIC pv Fa0/4</div>
<div class="pBu1_Bullet1">
216 0013.c460.2be0 DYNAMIC pv Fa0/13</div>
<div class="pBu1_Bullet1">
216 0019.e880.09c0 DYNAMIC Fa0/2</div>
<div class="pBu1_Bullet1">
216 0024.c4eb.ed68 DYNAMIC Fa0/4</div>
<div>
<br /></div>
<div>
There's a lot going on there, so let's break this down into more manageable chunks.</div>
<div>
<br /></div>
<div>
R6 is our simplest host - it's a single isolated host, so let's start there.</div>
<div>
<br /></div>
<div>
<div>
R6#show int fa0/0 | i bia</div>
<div>
Hardware is Gt96k FE, address is 0019.2fb8.d552 (bia 0019.2fb8.d552)</div>
</div>
<div>
<br /></div>
<div>
So we now know that 0019.2fb8.d552 is R6's MAC address on it's Fa0/0 port.</div>
<div>
<br /></div>
<div>
<div>
SW1#show mac address-table | i 0019.2fb8.d552</div>
<div>
402 0019.2fb8.d552 BLOCKED Fa0/13</div>
<div>
124 0019.2fb8.d552 DYNAMIC pv Fa0/13</div>
</div>
<div>
<br /></div>
<div>
<div>
SW2#show mac address-table | i 0019.2fb8.d552</div>
<div>
402 0019.2fb8.d552 BLOCKED Fa0/6</div>
<div>
124 0019.2fb8.d552 DYNAMIC pv Fa0/6</div>
</div>
<div>
<br /></div>
<div>
We see some new terms here in the CAM table, "BLOCKED" and "DYNAMIC pv".</div>
<div>
<br /></div>
<div>
Best I can tell without any documentation on these, this is what they mean:</div>
<div>
BLOCKED = Do not forward traffic to this MAC (essentially the same as not learning it)</div>
<div>
DYNAMIC pv = This is a receive-only MAC.</div>
<div>
<br /></div>
<div>
So putting that in context of R6, R6 is allowed to SEND on 402, and can RECEIVE on 124. It cannot receive on 402, because the MAC is "BLOCKED". This is what enforces an isolated VLAN. The MAC learning process takes place, but the CAM table flags them as unusable. Therefore, any traffic sent towards that MAC on that VLAN would be discarded.</div>
<div>
<br /></div>
<div>
Let's look at our earlier definition of isolated VLAN from Cisco:</div>
<div>
<br /></div>
<div>
<div>
• Isolated VLAN — [edited for brevity] An isolated VLAN is a secondary VLAN that carries unidirectional traffic upstream from the hosts toward the promiscuous ports and the gateway.</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
This is true, based upon that R6 (and any other future isolated host) can't learn MACs on that VLAN, except...</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
R1#show int fa0/0 | i bia</div>
<div class="pBu1_Bullet1">
Hardware is AmdFE, address is 0013.c460.2be0 (bia 0013.c460.2be0)</div>
<div class="pBu1_Bullet1">
<br /></div>
<div class="pBu1_Bullet1">
SW1#show mac address-table | i 0013.c460.2be0</div>
<div class="pBu1_Bullet1">
402 0013.c460.2be0 DYNAMIC pv Fa0/1</div>
<div>
[edited for brevity]</div>
<div>
<br /></div>
<div>
R1's MAC can be learned on 402. So R6 will be able to send frames out 402 towards R1. This completes the concept that 402 is a "one way" VLAN from Isolated -> Promiscuous.</div>
<div>
<br /></div>
<div>
But what about R1 -> R6?</div>
<div class="pBu1_Bullet1">
<br /></div>
</div>
<div>
<div>
• Primary VLAN— The primary VLAN carries unidirectional traffic downstream from the promiscuous ports to the (isolated and community) host ports and to other promiscuous ports.</div>
<div>
<br /></div>
<div>
<div>
SW1#show mac address-table | i 124</div>
<div>
124 0013.c460.2be0 DYNAMIC Fa0/1</div>
<div>
124 0019.2fb8.d552 DYNAMIC pv Fa0/13</div>
</div>
<div>
<br /></div>
<div>
on Fa0/1, we see R1, and on Fa0/13, we see R6. R6 has the "DYNAMIC pv" status, meaning it is a receive-only MAC. Frames originating from 0019.2fb8.d552 would not be accepted, only frames towards 0019.2fb8.d552. We'll be generating more traffic in a moment, and the switches will learn more on vlan 124.</div>
<div>
<br /></div>
<div>
Now let's look at the community hosts.</div>
<div>
<br /></div>
<div>
First, re-generating traffic for MAC learning:</div>
<div>
<br /></div>
<div>
<div>
R2#ping 192.168.0.4</div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.4, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
<div>
R2#show int fa0/0 | i bia</div>
<div>
Hardware is AmdFE, address is 0019.e880.09c0 (bia 0019.e880.09c0)</div>
</div>
<div>
<br /></div>
<div>
<div>
R4#show int fa0/0 | i bia</div>
<div>
Hardware is Gt96k FE, address is 0024.c4eb.ed68 (bia 0024.c4eb.ed68)</div>
</div>
<a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wp1131385"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131383"></a><a href="https://www.blogger.com/blogger.g?blogID=5968686435283454526" name="wpmkr1131384"></a></div>
<div>
<br /></div>
<div>
There's our MACs for R2 and R4.</div>
<div>
<br /></div>
<div>
How's the table look?</div>
<div>
<br /></div>
<div>
<div>
SW1#show mac address-table | i 0019.e880.09c0 <b>! R2</b></div>
<div>
124 0019.e880.09c0 DYNAMIC pv Fa0/13</div>
<div>
216 0019.e880.09c0 DYNAMIC Fa0/13</div>
<div>
<br /></div>
<div>
SW1#show mac address-table | i 0024.c4eb.ed68 <b>! R4</b></div>
<div>
124 0024.c4eb.ed68 DYNAMIC pv Fa0/13</div>
<div>
216 0024.c4eb.ed68 DYNAMIC Fa0/13</div>
</div>
<div>
<br /></div>
<div>
<div>
SW1#show mac address-table | i 0013.c460.2be0 <b>!</b> <b>R1</b></div>
<div>
402 0013.c460.2be0 DYNAMIC pv Fa0/1</div>
<div>
124 0013.c460.2be0 DYNAMIC Fa0/1</div>
<div>
216 0013.c460.2be0 DYNAMIC pv Fa0/1</div>
</div>
<div>
<br /></div>
<div>
You should notice immediately we don't have any "BLOCKED" status here. That's because R2 and R4 can talk to each other. This port has no affiliation with the isolated VLAN in any fashion, so R2 and R4's MACs simply aren't learned on VLAN 402. We do have some "DYNAMIC pv". The primary VLAN is still used to carry traffic from R1 (promiscuous) to R2 and R4. Therefore, R2 and R4 must be learned on VLAN 124 (as "DYNAMIC pv"), but not able to speak, or they'd be able to reach hosts outside their community. Also, R1 should be "DYNAMIC pv" on 216 (community VLAN), so that replies are forced over 124 (primary VLAN).</div>
<div>
<br /></div>
<div>
With that wrapped up, here's some other cool stuff I did while working on this!</div>
<div>
<br /></div>
<div>
- Pushing private VLANs through a switch that doesn't support them. This is a bad, bad idea. Before I labbed this out this granuarly, I thought, sure, why not? If the switch doesn't know it's a private VLAN, the whole "DYNAMIC pv" and "BLOCKED" MAC learning concepts stop working, and all hell breaks loose. Just don't do it!</div>
<div>
<br /></div>
<div>
- Try putting a regular, non-private, access port up on one of the primary or secondary VLANs. It just doesn't work - you can send frames at the port and the switch just won't learn your MAC and ignores all the traffic.</div>
<div>
<br /></div>
<div>
- Not so exotic on my last bullet, but what about using an SVI as part of a private VLAN? Well for starters, this is totally doable, but only as a promiscuous port. You'd be using the switch as a default gateway instead of a standalone router. The config looks like this:</div>
<div>
<br /></div>
<div>
<div>
interface Vlan124</div>
<div>
ip address 192.168.0.249 255.255.255.0</div>
<div>
private-vlan mapping 216,402</div>
</div>
<div>
<br /></div>
<div>
R6#ping 192.168.0.249</div>
<div>
<div>
<br /></div>
<div>
Type escape sequence to abort.</div>
<div>
Sending 5, 100-byte ICMP Echos to 192.168.0.249, timeout is 2 seconds:</div>
<div>
!!!!!</div>
<div>
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/2/4 ms</div>
</div>
<div>
<br /></div>
<div>
<div>
Trying to make an SVI with vlan 216 or 402 does not work.</div>
</div>
<div>
<br /></div>
<div>
Another oddity:</div>
<div>
<br /></div>
<div>
<div>
SW1#show int vlan124 | i bia</div>
<div>
Hardware is EtherSVI, address is 0014.1ceb.f641 (bia 0014.1ceb.f641)</div>
<div>
<br /></div>
<div>
SW1#sh mac address-table | i 0014.1ceb.f641</div>
</div>
<div>
SW1#</div>
<div>
<br /></div>
<div>
Well that's not something you see every day...</div>
<div>
<br /></div>
<div>
Happy Studying,</div>
<div>
<br /></div>
<div>
Jeff</div>
<div>
<br /></div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com3tag:blogger.com,1999:blog-5968686435283454526.post-45750588638991675222014-01-12T14:40:00.000-08:002014-01-12T14:43:07.790-08:00[mini] OSPF Point-to-Multipoint .... Multicast?I recently took a practice lab and got dinged for points on an OSPF area question. Without quoting the actual practice lab, the question was referencing a frame-relay link and said something akin to "use an OSPF area type that doesn't elect a DR and multicasts updates".<br />
<br />
I've gotten in a (bad?) habit of just using point-to-multipoint with frame-relay. If you've studied OSPF at a design/granular level, you are probably aware that point-to-multipoint is relatively inefficient. However, it works in all sorts of crazy environments, so I just like using it in labs because of versatility.<br />
<br />
So even though "point-to-point" would've worked in this environment, I used point-to-multipoint anyway.<br />
<br />
And the answer guide insisted point-to-point was the only viable option.<br />
<br />
I thought about this, and what the heck is the point of "point-to-multipoint non-broadcast" if "point-to-multipoint" doesn't multicast? Also, I <b>know</b> point-to-multipoint auto-discovers neighbors, so how the heck is it not multicasting?<br />
<br />
My lab is R1 -> Frame Relay Switch -> R2<br />
DLCI is R1 102 -> R2 201<br />
<br />
R1:<br />
interface Serial0/0<br />
ip address 192.168.0.1 255.255.255.0<br />
encapsulation frame-relay<br />
ip ospf network point-to-multipoint<br />
no keepalive<br />
clock rate 2000000<br />
frame-relay map ip 192.168.0.2 102 broadcast<br />
no frame-relay inverse-arp<br />
<div>
<br /></div>
<div>
<div>
router ospf 1</div>
<div>
log-adjacency-changes</div>
<div>
network 0.0.0.0 255.255.255.255 area 0</div>
</div>
<div>
<br /></div>
<div>
R2:</div>
<div>
<div>
interface Serial0/0</div>
<div>
ip address 192.168.0.2 255.255.255.0</div>
<div>
encapsulation frame-relay</div>
<div>
ip ospf network point-to-multipoint</div>
<div>
no keepalive</div>
<div>
clock rate 2000000</div>
<div>
frame-relay map ip 192.168.0.1 201 broadcast</div>
<div>
no frame-relay inverse-arp</div>
</div>
<div>
shutdown</div>
<div>
<br /></div>
<div>
<div>
router ospf 1</div>
<div>
log-adjacency-changes</div>
<div>
network 0.0.0.0 255.255.255.255 area 0</div>
</div>
<div>
<br /></div>
<div>
<div>
R2(config-if)#no shut</div>
<div>
R2(config-if)#</div>
<div>
*Mar 1 00:20:11.075: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.0.1 on Serial0/0 from LOADING to FULL, Loading Done</div>
</div>
<div>
<br /></div>
Clearly, we have automatic neighbor discovery.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfq_HvTru9Uf_LAYm31yfxO9uWNFmrMqIDnuWhUPnG1GSEhJHB1GS5q9gRoS3HVrkYnh2JUucYHe-AC5K_LB2waAHx-xnOyI8ZLoWy2HUCvHYxpqw88nnGDaBxaynOqDuRtNYfCQcrv1E/s800/multicast.png" />
<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
Aha! Multicast! I knew it!<br />
<br />
But wait... here comes an update:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkx0RGNwWa6-teE9ebGouSqpX1U20UPNxzk38adQmu-F5sETmrHwpvuASWpTNe38tzEc0J3kvfIHtUt99O7ZaU2F_VCN5xiyBksPA0xHXwb_txed9z7u4TVOK6RVle2j9SYwwokQGu2dk/s1600/unicast.png" />
<br />
<br />
So the truth comes out - Hello packets are multicast, hence the automatic neighbor discovery. But updates are unicast, so the lab was correct in docking me. Live and learn!<br />
<br />
Jeffbrbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com4tag:blogger.com,1999:blog-5968686435283454526.post-40179121245779661922014-01-05T21:09:00.001-08:002014-01-06T15:11:19.029-08:00[mini] VTY RotaryI've always found it helps a great deal to have a use-case for a feature. There's thousands of features to learn and be at least somewhat familiar with when attempting the CCIE lab. Remembering them all is a real challenge, but knowing how to apply a feature and why you'd want to use it make it all that much easier to remember. One of those crazy features is "rotary" when used in conjunction with a VTY line.<br />
<br />
I totally get what it does:<br />
<br />
line vty 0 4<br />
password cisco<br />
login<br />
rotary 1 <br />
<br />
<div>
This config would allow you to telnet to this router on port 23, enter the password "cisco", and get privilege level 1. With "rotary 1", you could also telnet to 3001 and have the same experience. Basically, it would mimic port 23 on port 3001. </div>
<div>
<br /></div>
<div>
<div>
R2#telnet 192.168.0.1 3001</div>
<div>
Trying 192.168.0.1, 3001 ... Open</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
User Access Verification</div>
<div>
<br /></div>
<div>
Password:</div>
<div>
R1></div>
</div>
<div>
<br /></div>
<div>
If you used "rotary 2", you'd be able to telnet to 3002, etc.</div>
<div>
<br /></div>
<div>
That's the nuts and bolts of what rotary does. I am immediately reminded of a quote from Despicable Me:</div>
<div>
<a href="http://www.youtube.com/watch?v=aD4k148YcDU">http://www.youtube.com/watch?v=aD4k148YcDU</a></div>
<div>
<br /></div>
"...because I was wondering, under what circumstances would we use this?"<br />
<br />
I haven't exactly been dying to telnet to my equipment on alternative port numbers.<br />
<br />
Now, I finally understand the use case. It has to do with using different authentication methods on different lines.<br />
<br />
For example:<br />
line vty 0 4<br />
privilege level 1 ! default, but included for clarity<br />
password cisco<br />
login<br />
line vty 5<br />
privilege level 15<br />
password secretpassword<br />
login<br />
<div>
<br /></div>
<div>
We see line 5 has a higher privilege level than lines 0-4. So how do you hit line 5? Well, I suppose you could telnet at the router 5 times and fill up the first four lines, then hit it again, but that's not very practical. Not to mention you may not know the password for 0-4, if you're an admin-type logging in to line 5. Enter rotary:</div>
<div>
<br /></div>
<div>
<div>
line vty 5</div>
<div>
privilege level 15</div>
<div>
password secretpassword</div>
<div>
login</div>
<div>
<b> rotary 1</b></div>
</div>
<div>
<br /></div>
<div>
Now when we telnet to port 23:</div>
<div>
<div>
R2#telnet 192.168.0.1</div>
<div>
Trying 192.168.0.1 ... Open</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
User Access Verification</div>
<div>
<br /></div>
<div>
Password: cisco</div>
<div>
R1></div>
</div>
<div>
<br /></div>
<div>
Now when we telnet to port 3001:</div>
<div>
<div>
R2#telnet 192.168.0.1 3001</div>
<div>
Trying 192.168.0.1, 3001 ... Open</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
User Access Verification</div>
<div>
<br /></div>
<div>
Password: secretpassword</div>
<div>
R1#</div>
</div>
<div>
<br /></div>
<div>
Trying each line's respective password on the other's port number produces the expected failure.</div>
<div>
<br /></div>
<div>
That's a simple use case, let's take a more advanced one.</div>
<div>
Let's say you're using lock & key / dynamic ACLs and need *local* auth on one line only. </div>
<div>
<br /></div>
<div>
R1(config)#aaa new-model</div>
<div>
<div>
R1(config)#aaa authentication login default group radius</div>
</div>
<div>
<div>
R1(config)#aaa authentication login LOCKANDKEY local</div>
</div>
<div>
<div>
R1(config)#username LOCK password ANDKEY</div>
</div>
<div>
<br /></div>
<div>
<div>
line vty 0 40</div>
<div>
login authentication default</div>
<div>
<div>
line vty 41</div>
<div>
login authentication LOCKANDKEY</div>
<div>
rotary 1</div>
<div>
autocommand access-enable host</div>
</div>
</div>
<div>
</div>
<div>
The idea here is to use RADIUS for authentication of lines 0-40, and local auth for line 41, to allow your Lock & Key ACL to work. </div>
<div>
<br /></div>
<div>
I didn't actually setup a lock & key ACL or a RADIUS server, but this can get the point across still:</div>
<div>
<br /></div>
<div>
Regular telnet just fails in our case because of the lack of RADIUS servers:</div>
<div>
<div>
R2#telnet 192.168.0.1</div>
<div>
Trying 192.168.0.1 ... Open</div>
<div>
<br /></div>
<div>
% Authentication failed</div>
<div>
<br /></div>
<div>
% Authentication failed</div>
<div>
<br /></div>
<div>
% Authentication failed</div>
<div>
<br /></div>
<div>
[Connection to 192.168.0.1 closed by foreign host]</div>
</div>
<div>
<br /></div>
<div>
However, telnetting to 3001:</div>
<div>
<div>
R2#telnet 192.168.0.1 3001</div>
<div>
Trying 192.168.0.1, 3001 ... Open</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
User Access Verification</div>
<div>
<br /></div>
<div>
Username: LOCK</div>
<div>
Password: ANDKEY</div>
<div>
<br /></div>
<div>
% No input access group defined for FastEthernet0/0.</div>
<div>
[Connection to 192.168.0.1 closed by foreign host]</div>
</div>
<div>
<br /></div>
<div>
The error message is because of the lack of a lock & key ACL, but the proof of concept is the same.</div>
<div>
<br /></div>
<div>
Cheers,</div>
<div>
<br /></div>
<div>
Jeff Kronlage</div>
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com1tag:blogger.com,1999:blog-5968686435283454526.post-2792855619126300002013-12-14T16:08:00.002-08:002013-12-14T16:08:41.772-08:00[mini] BGP Auto-SummaryI recently got a task on a practice lab that was obviously regarding BGP auto summary. I'm well-practiced in BGP on production systems, but who the heck uses auto-summary any longer? It then occurred to me that I'd never even turned it on.<br />
<br />
My first attempt was to:<br />
<br />
int lo5<br />
ip address 5.5.5.5 255.255.255.0<br />
<br />
router bgp 100<br />
auto-summary<br />
network 5.5.5.0 mask 255.255.255.0<br />
<br />
I peered it up with another router, and expected to see "5.0.0.0/8" in the BGP table of the other router.<br />
<br />
No such luck, I ended up with 5.5.5.0/24.<br />
<br />
After some googling, I found two methods to make this work:<br />
<br />
int lo5<br />
ip address 5.5.5.5 255.255.255.0<br />
<br />
router bgp 100<br />
auto-summary<br />
network 5.0.0.0 <br />
<br />
That will produce 5.0.0.0 in both the local BGP table and anyone it peers to.<br />
<br />
You can also:<br />
<br />
int lo5<br />
ip address 5.5.5.5 255.255.255.0<br />
<br />
router bgp 100<br />
auto-summary<br />
redistribute connected<br />
<br />
That will also get you 5.0.0.0 in both the local BGP table and anyone it peers to.<br />
<br />
Of interesting note, if you:<br />
<br />
int lo5<br />
ip address 5.5.5.5 255.255.255.0<br />
<br />
int lo6<br />
ip address 5.5.6.6 255.255.255.0<br />
<br />
router bgp 100<br />
auto-summary<br />
network 5.5.0.0 mask 255.255.0.0<br />
<br />
That will also produce 5.0.0.0/8.<br />
<br />
Not a complex topic, but it works differently than the way IGPs do, and I thought it was worth mentioning.<br />
<br />
Happy studying!<br />
<br />
Jeff<br />
brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com1tag:blogger.com,1999:blog-5968686435283454526.post-38432769698880886532013-11-29T13:12:00.002-08:002013-11-29T13:16:37.411-08:00[mini] PPPoE in the DocCDI ran across a PPPoE problem a couple days ago, and let me tell you, this is not my favorite topic. I've only used it in production once, and I don't come across it in practice labs enough to keep it fresh in my mind. I've been skipping these questions when doing time-trial practice labs and just using traditional Ethernet whenever this was called for, and just taking a hit on the points. Not a good plan, but I felt there were more important things to focus on.<br />
<br />
One of the other reasons I haven't wanted to focus on it, knowing that I only see it once in a blue moon, is that the documentation is so spread out I could never figure out where all the various pieces are. The lab questions always call for server <strong>and</strong> client installs, and they're on different pages, and spread out across those two pages. <br />
<br />
I decided a good interim step on this problem is to nutshell exactly where the pieces are in the documentation.<br />
<br />
First, you want the Broadband Access Aggregation and DSL Configuration Guide. It's on the main "Configuration" page for 12.4T that you've been going to in the DocCD. See below.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYPXQQzkr10HV_uIHTGcFbCP78fsu4j-oriua1A0R1WRfOdpDX7Gz_M_AOCuEYBJqQkeo8lqf1c3r4WQywRw7PiY-A5z0GYoI3Qg4sGNs_r3vG_eAk8pUsAjCP2h-8QOL594FfsNF_obM/s800/image2.png" /><br />
<br />
The next page has a <em>lot of options on it.</em> Fortunately we only need two of them, and they're right on top of each other:<br />
<br />
- PPPoE "server" is on Providing Protocol Support for Broadband Access Aggregation of PPPoE Sessions.<br />
- PPPoE client is on "PPP over Ethernet Client"<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEin-OVjiLecGPM9jW73UXO8FIGzAtBWTgyMpZs8QZTarhjQ0PUnEIyNXaZ6hlBowamoQsdRMjwIL4Nw7xLBo2H3D53FVe4XGkfL0HFrWJklLVodD51XwKQDliVafInXn_mDRKn2nLVuGFk/s800/image3.png" />
<br />
<br />
We'll start on the server side first. "R1" will be our server router. Not providing a diagram, just two devices connected in Fa0/0 involved.<br />
<br />
You need three sections on the "Providing Protocol Support for Broadband Access Aggregation of PPPoE Sessions" page:<br />
<br />
- "Configuring a Virtual Template Interface"<br />
- "Defining a PPPoE Profile"<br />
- "Assigning a PPPoE Profile to an Ethernet Interface"<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtXggVj-PGpDOET7Y47VLiFubk5H9UDoeZbdDO_wgG1iAhpmuQmCYfWV80W6c5dOPKaOElvnW-QLCDS3Nm6iXwND_7G13qZmPYREc-ocnigTOwJx40sn4Tl3mcZT6NAIScz3wbttFXqHg/s800/image4.png" />
<br />
<br />
I put them in the order I felt they should be done in, so let's start with "Configuring a Virtual Template Interface". Frankly, if you don't know how to this, this is worth memorizing. It comes up in more places than just PPPoE (PPP over Frame Relay, namely). <br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjS7i128KUwGwFMKZAUWTQJ6ptm8ShV6znGyJgKbw5lp8Cu7BrLGFIYmNQzCkzp_cQyDEabDz9JuiAd2jY2MGsfYE866N-LdQBkewOiiI37M_4Y6GMX7xuGFXbySjcu2ULFkclBbz-9atc/s800/image5.png" />
<br />
Let's apply the necessary pieces as we walk through this:<br />
<br />
R1:<br />
R1(config)#interface virtual-template 1<br />
R1(config-if)#ip address 192.168.1.1 255.255.255.0 ! you don't actually have to use IP unnumbered<br />
R1(config-if)#mtu 1492 ! not really a requirement but a really good idea<br />
R1(config-if)#peer default ip address dhcp-pool TEST-POOL<br />
<br />
To be fair, the "peer default" bit for assigning IP addresses to clients isn't actually in the above documentation snippet, but it is elsewhere on the page if you search for it. It's also not a <em>requirement</em>, you could assign IPs statically.<br />
<br />
Next step -<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI4qa6V10PWHOI4xfectc_aflTw1KjMf42CzNPmGBG1j1IHM9CI4wHE4bhu6bH-qSWIpBGWt9T8gMEaTKIG6fSA7_HZ7xbLoIul9S0O-qBxR9XlbUFzv3WZjLhpG6VPW8Qxnwe3eKXqFI/s800/image6.png" />
<br />
<br />
R1(config-if)#bba-group pppoe global<br />
R1(config-bba-group)# virtual-template 1<br />
<br />
Yep, that's all you really must have to get the bba-group working. Now let's assign it to an interface.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_G-9NQ5CeslEeotP0kzyPADlYl0FeDCfA7UT5506disVRdz1JxwUL8eW1ndtpl7ZjoMF_RNqLQV3eEPmM95hRqsmiIr3rqag8FaYCDKaze3QJERfx0MRHWysZbLPR4Hrkiv7vm1taSeY/s800/image7.png" />
<br />
<br />
R1(config)#interface fa0/0<br />
R1(config-if)#pppoe enable<br />
R1(config-if)#no shut<br />
<br />
The <strong>pppoe enable</strong> command will expand to <strong>pppoe enable group global</strong> on its own, if you do a "show run".<br />
<br />
We did reference a DHCP pool up above; we'll need to create that.<br />
<br />
R1(config)#ip dhcp pool TEST-POOL<br />
R1(dhcp-config)#network 192.168.1.0<br />
<br />
That's all - now for the client side. As we saw earlier (same image repeated from above), the client side is directly underneath the "server" side.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjnUxDKXRH54nJnD8-VZo2hEGpqCyjwR2mefdyDREvTds6DBE6R78mkL3E4QqlgS3__lU-xYSibS0q4ijj29mRxKOAFHpmB0tGz9y_DNbJnAdbHzIuLdkAEIASVMvid6WcqnK6a6HXCZPw/s800/image3.png" />
<br />
<br />
Once you're in there, there's once again many options, however the two you need are pretty easy to spot. <strong>Note carefully that we are on the "12.2(13)T 12.4T and Later Releases" section</strong>. There's one just above this for pre-12.2(13)T.<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCKErDI0mtHKhpoSLAVRy5ajg6WIhrVPTHEOULhNICeve6HrHaB8i6DyMO4AYOog8OTCvwCQNK8Cmm5F5dHEzdmAcGZ9S3i4Xp1fz2DV6EFLzxR-PPx8znhLD9NJ78n-V6q9rjJDkRWtw/s800/image8.png" />
<br />
<br />
Configuring the dialer interface first makes more sense, so we'll start there:<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjN_E0LKv8-J2jFWTgruCx8Q1g186SCPG6srbWi1OfdKDTxB_6a19gnGy3Ez6gBY5kCnoJJFk_iEyVYa0IoJ8PNFQIeREOQTSqfMF2aNlmqNExrf7nmi5MsAiARnuLWZ_twlcaG0tNZzbY/s800/image9.png" />
<br />
<br />
R2(config)#int dialer 1<br />
R2(config-if)#mtu 1492<br />
R2(config-if)#encapsulation ppp<br />
R2(config-if)#ip address negotiated<br />
R2(config-if)#dialer pool 1<br />
<br />
<img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiH79xzBI2J46YGQTW_X7XqfRjmmEvlXgSPc2Mp5RHurx4C7WjNXaL-EsFf_kNnyQMm8ys9RWrxE17r4pC1EWNAa1QE2DTo_yW9IjIdXaDyV3P8cNMEd8OalyEJ2AcNDDU_KVN_kIy5b5Q/s800/image10.png" />
<br />
<br />
R2(config-if)#pppoe-client dial-pool-number 1<br />
R2(config-if)#no shut<br />
<br />
That's <strong>it</strong> - if you did it correctly, you should get output something like this on your client:<br />
<br />
*Mar 1 00:28:51.103: %DIALER-6-BIND: Interface Vi1 bound to profile Di1<br />
*Mar 1 00:28:51.191: %LINK-3-UPDOWN: Interface Virtual-Access1, changed state to up<br />
*Mar 1 00:28:52.235: %LINEPROTO-5-UPDOWN: Line protocol on Interface Virtual-Access1, changed state to up<br />
<br />
R2(config-if)#do sh ip int dialer1 | i Internet address<br />
Internet address is 192.168.1.3/32<br />
<br />
R2(config-if)#do ping 192.168.1.1<br />
Type escape sequence to abort.<br />
Sending 5, 100-byte ICMP Echos to 192.168.1.1, timeout is 2 seconds:<br />
!!!!!<br />
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/23/36 ms<br />
<br />
Cheers,<br />
<br />
Jeff<br />
<br />brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-80819835619206503292013-11-10T20:16:00.003-08:002013-11-10T20:19:45.791-08:00[mini] Embarassing BGP as-override misunderstandingIt can be hard to post on the Internet about dramatically misunderstanding a technology. <br />
<br />
In my defense, I've never worked for an MPLS provider, so I've never used <strong>as-override</strong> outside of a lab - actually I'm not sure I've ever used it in a lab before tonight, either.<br />
<br />
For those unfamiliar with the basic idea, <strong>as-override</strong> is used in MP-BGP/VRF/MPLS scenarios where the customer wants to re-use an AS number on several sites. Since the CE routers see the traffic from the PE routers as eBGP, they see their own AS number in the path and reject the update from the PE. <strong>as-override </strong>is the PE mechanism to overcome this problem.<br />
<br />
Let's take a four-router scenario - two CE routers and two PE.<br />
<br />
It might look something like this:<br />
<br />
CE1 (AS 100) -> PE1 (AS 250) -> PE2 (AS 250) -> CE2 (AS 100)<br />
<br />
Clearly, when PE2 advertises CE1's routes to CE2, CE2 should reject them.<br />
<br />
Fixing this on the CE side is very easy; you can change the AS number or use <strong>allowas-in </strong>to allow the CE to ignore the fact that its own AS number is present while receiving BGP updates.<br />
<br />
As a network consultant I regularly deal with MPLS site activations, and twice now I've had the carrier offer to use as-override to fix the problem above, and I've declined, one time opting to change the AS number on the CE, another time I used allowas-in. I'd gotten the idea that, given that the carrier technician was signed into the PE connected to my CE, that that's the only place where the <strong>as-override </strong>would go. Boy was I wrong.<br />
<br />
I spent about 90 minutes this evening trying to get <strong>as-override</strong> working in the scenario described above. CE1 would send AS 100 to PE1. PE1 was configured with <strong>as-override</strong> facing CE1, and what I expected to have happen was PE1 strip out AS 100 on its way to PE2. Incorrect! <br />
<br />
I'd repeatedly pull up PE2's BGP table:<br />
<br />
PE2#sh ip bgp vpnv4 vrf CCIE | s 1.1.1.1<br />
*>i1.1.1.1/32 192.168.23.2 0 100 0 100 I<br />
<br />
BGP output doesn't paste the best into a non-monospaced document, but in short, it shows the prefix is still learned from AS 100 still (the other "100" adjacent to that is the local preference). I sat there scratching my head, wondering how CE2 was going to be able to learn this (quick answer - it can't).<br />
<br />
It turns out <strong>as-override </strong>is not an ingress setting at all. It's an egress setting. All it does is tell the PE that <strong>as-override </strong>is configured on that when it's passing routes to a CE, to do a find-and-replace of the CE's AS number and replace it with the local PE's AS number.<br />
<br />
In other words, in our scenario:<br />
<br />
CE1 (AS 100) -> PE1 (AS 250) -> PE2 (AS 250) -> CE2 (AS 100)<br />
<br />
If I were to set <strong>as-override</strong> on PE1, that would enable CE1 to receive CE2's routes - not vice-versa.<br />
<br />
CE1(config)#do sh ip bgp | i 2.2.2.2<br />
*> 2.2.2.2/32 192.168.12.2 0 250 250 I<br />
<br />
We see that CE1 sees 2.2.2.2 (CE2's loopback) as going through AS 250 twice, instead of AS 250 followed by AS 100.<br />
<br />
Thought this might help others out there stuck on a similar misunderstanding.<br />
<br />
Cheers,<br />
<br />
Jeff<br />
<br />brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com0tag:blogger.com,1999:blog-5968686435283454526.post-30743076323465072152013-11-07T20:58:00.000-08:002013-11-07T20:58:44.889-08:00[mini] Why does LDP "require" a /32 Loopback?A few days ago I asked a coworker why LDP sessions had issues if they weren't peered on /32s. He answered, it doesn't have to be a /32, but the IGP and LDP had to agree on the mask length. So I asked the more specific question - why does it have to agree on the mask length? He didn't know. And neither did I.<br />
<br />
Everyone seems to know that /32s are best practice for the LDP router ID. But it's hard to find a good, clear explanation of why this is.<br />
<br />
Let's start with some obvious facts.<br />
<br />
- "The router considers all the IP addresses of all operational interfaces.... If these addresses include loopback interface addresses, the router selects the
largest loopback address." <a href="http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686">http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686</a><br />
<br />
As always, my posts are geared for the CCIE lab, and it's a fair bet most of your gear on the lab is going to have a loopback. So, expect the router ID to be a loopback, unless it's specified otherwise.<br />
<br />
- You can specify the interface with <strong>mpls ldp router-id <interface>. </strong>If you don't want it to be a loopback, or you want a certain loopback to be chosen over another, then use this command. If you want to change the router-id while LDP is already up you have to use the <strong>force</strong> command, i.e. <strong>mpls ldp router-id lo7 force</strong>. If you don't use force, and LDP was already online, you'll have to reboot in order for the switch to take place.<br />
<br />
- You can set the range of labels that LDP is allowed to use with <strong>mpls label range <lower> <upper></strong> I find this useful in debugging, because you can make your labels match your router number and it's easier to read the output. LDP show commands are not always easy to interpret if you're not used to reading them.<br />
<br />
- "The LDP default behavior is to allocate local labels for all non-BGP prefixes." <br />
<a href="http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686">http://www.cisco.com/en/US/docs/ios/12_4t/12_4t2/ftldp41.html#wp1654686</a><br />
<br />
So what's that mean to us? It might be better phrased as "The LDP default behavior is to allocate local labels <em>choosing the best administrative distance</em> as long as it's not from BGP".<br />
<br />
- This problem is most commonly seen with OSPF (although you could see it from a summary route as well). The sure-fire way to demonstrate it is to create a /24 loopback and not change the default network type. OSPF automatically uses network type LOOPBACK, which is always advertised as a /32.<br />
<br />
- With MPLS VPNs, BGP actually distributes the labels for the VRFs, not LDP. You learn the stacked VRF tag, relevant only to the egress PE, from BGP. You also learn the global routing table's next hop. The next-hop is used to find out the LDP label.<br />
<br />
Let's take a look at how this plays out.<br />
<br />
R3 is trying to reach R1 in VRF CCIE. R3's IP address is 3.3.3.3 and R1's IP address is 1.1.1.1. R2 is sitting in the middle of the two.<br />
<br />
R3#ping vrf CCIE 1.1.1.1<br />
Type escape sequence to abort.<br />Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:<br />.....<br />Success rate is 0 percent (0/5)<br />
As we can see, ping is failing.<br />
<br />
R3#sh ip route vrf CCIE 1.1.1.1<br />Routing entry for 1.1.1.1/32<br /> Known via "bgp 100", distance 200, metric 0, type internal<br /> Last update from 11.11.11.11 00:26:04 ago<br /> Routing Descriptor Blocks:<br /> * 11.11.11.11 (Default-IP-Routing-Table), from 22.22.22.22, 00:26:04 ago<br /> Route metric is 0, traffic share count is 1<br /> AS Hops 0<br />
We have a route to reach it.<br />
<br />
R3#show ip cef vrf CCIE 1.1.1.1<br />1.1.1.1/32, version 3, epoch 0, cached adjacency 192.168.23.2<br />0 packets, 0 bytes<br /> tag information set<br /> local tag: VPN-route-head<br /> fast tag rewrite with Fa0/0, 192.168.23.2, tags imposed: {200 103}<br /> via 11.11.11.11, 0 dependencies, recursive<br /> next hop 192.168.23.2, FastEthernet0/0 via 11.11.11.11/32<br /> valid cached adjacency<br /> tag rewrite with Fa0/0, 192.168.23.2, tags imposed: {200 103}<br />
<br />
I used the <strong>mpls label range</strong> command (mentioned above) in order to restrict the tags to start with their own router ID. In this case, we should be using MPLS "transit" tag of 200, and a MPLS "VRF" tag of 103.<br />
<br />
R3#show mpls ldp bindings | b 11.11.11.11<br /> tib entry: 11.11.11.11/32, rev 6<br /> local binding: tag: 300<br /> remote binding: tsr: 22.22.22.22:0, tag: 200<br />
<output omitted><br />
<br />
We know that tag 200 references R1's primary routing table loopback IP (11.11.11.11).<br />
<br />
R3#show mpls forwarding-table 11.11.11.11<br />Local Outgoing Prefix Bytes tag Outgoing Next Hop<br />tag tag or VC or Tunnel Id switched interface<br />300 200 11.11.11.11/32 0 Fa0/0 192.168.23.2<br />
<br />
We know that means sending traffic out Fa0/0 towards R2 (192.168.23.2) with tag 200.<br />
<br />
Ok, so this router should be able to send traffic, right?<br />
<br />
R2#debug mpls packet<br />MPLS packet debugging is on<br />
R3#ping vrf CCIE 1.1.1.1 rep 2 timeout 1<br />
Type escape sequence to abort.<br />Sending 2, 100-byte ICMP Echos to 1.1.1.1, timeout is 1 seconds:<br />..<br />Success rate is 0 percent (0/2)<br />
<br />
R2#<br />*Mar 1 00:36:08.651: MPLS: Fa0/1: recvd: CoS=6, TTL=255, Label(s)=0<br />*Mar 1 00:36:09.067: MPLS: Fa0/1: recvd: CoS=6, TTL=255, Label(s)=0<br />
R2 gets the MPLS packet just fine! And that's all it does. Notice my debug doesn't say anything about forwarding it on.<br />
<br />
R2#show mpls ldp binding | b 11.11.11.11<br /> tib entry: 11.11.11.11/32, rev 10<br /> local binding: tag: 200<br /> remote binding: tsr: 33.33.33.33:0, tag: 300<br />
<output omitted><br />
<br />
We see R2 has locally bound tag 200 for 11.11.11.11, and has received a tag from R3 for 11.11.11.11, but ... no tag from R1?<br />
<br />
Let's look at the routing tables.<br />
<br />
R2#sh ip route 11.11.11.11<br />Routing entry for 11.11.11.11/32<br /> Known via "ospf 1", distance 110, metric 2, type intra area<br /> Last update from 192.168.12.1 on FastEthernet0/0, 00:00:02 ago<br /> Routing Descriptor Blocks:<br /> * 192.168.12.1, from 11.11.11.11, 00:00:02 ago, via FastEthernet0/0<br /> Route metric is 2, traffic share count is 1<br />
R2 sees this as a /32.<br />
<br />
R3#sh ip route 11.11.11.11<br />Routing entry for 11.11.11.11/32<br /> Known via "ospf 1", distance 110, metric 3, type intra area<br /> Last update from 192.168.23.2 on FastEthernet0/0, 00:39:16 ago<br /> Routing Descriptor Blocks:<br /> * 192.168.23.2, from 11.11.11.11, 00:39:16 ago, via FastEthernet0/0<br /> Route metric is 3, traffic share count is 1<br />
<br />
R3 sees this as a /32. Consequently, R3 has no problem sending the MPLS packet to R2.<br />
<br />
R1#sh ip route 11.11.11.11<br />Routing entry for 11.11.11.0/24<br /> Known via "connected", distance 0, metric 0 (connected, via interface)<br /> Routing Descriptor Blocks:<br /> * directly connected, via Loopback0<br /> Route metric is 0, traffic share count is 1<br />
And R1 sees it as a ... /24 connected route. As mentioned above, OSPF is the common culprit here. It's advertising a /32 to everyone else, except the local router, which still sees it as a /24. In fact...<br />
<br />
R2#sh mpls ldp binding | b 11.11.11.0/24<br /> tib entry: 11.11.11.0/24, rev 11<br /> remote binding: tsr: 11.11.11.0:0, tag: exp-null<br />
<output omitted><br />
<br />
R1 is advertising a /24 to R2. MPLS bindings work a bit different than the routing table, R2's LDP process isn't simply going to choose the best route to R1, it's matching labels to prefixes, and the prefixes are considered unique if they're not identical. So R2 just drops the packet, as it has no more <em>bindings</em> for 11.11.11.0/24.<br />
<br />
The fix is to just make the two prefix lengths the same. They don't need to be /32s! The easiest way to make this happen in this scenario is to change the OSPF network type away from LOOPBACK and stop forcing the /32 advertisement:<br />
<br />
R1(config)#int lo0<br />R1(config-if)#ip ospf network point-to-point<br />
R2#sh mpls ldp binding | b 11.11.11.0/24<br /> tib entry: 11.11.11.0/24, rev 16<br /> local binding: tag: 203<br /> remote binding: tsr: 11.11.11.11:0, tag: exp-null<br /> remote binding: tsr: 33.33.33.33:0, tag: 305<br /><output omitted><br />
<br />
We can see R2 now has a binding from R1 and R3 that matches the same prefix length.<br />
<br />
R3#ping vrf CCIE 1.1.1.1<br />
Type escape sequence to abort.<br />Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:<br />!!!!!<br />Success rate is 100 percent (5/5), round-trip min/avg/max = 60/66/76 ms<br />
<br />
And forwarding works end-to-end.<br />
<br />
In a nutshell: LDP associates labels with both the IP address and subnet mask. The prefix length does have to match to become part of the same MPLS forwarding path. However, the prefix length does not have to be /32 - it's just a good, safe practice.<br />
<br />brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com3tag:blogger.com,1999:blog-5968686435283454526.post-10957803089221899362013-10-27T15:21:00.001-07:002013-10-27T15:23:36.342-07:00[mini] Static RP Address Blocks auto-RP Dense FlowsMy first 40 posts were written while I was attempting to improve my understanding of a number of topics. At this point in my studying, I've moved on to practicing interoperability of features, so I haven't written any new posts in some time. My first posts were between five and twenty page topic deep-dives. Now that I've moved on to review & practice, I'm planning on starting a new series of posts, which I will label with [mini] in front of the subject. These will cover any small problems that really got me stuck while doing practice labs. Same quality as my old posts, but much smaller scope.<br />
<br />
Today, I got stuck on a multicast problem.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQAjAMgakY5hX3fQkFpm2FzunILpumb5GFQN94tK0IMPi8KCN26vpOmk6Y7LkZ7E29UqJfk0Rgfc-X1f-_jMuUHnzPKu46vjdc8kzxCtW-WMENXfNZDe5z5q171EQ1dlyaC2BLq0y0qSI/s1600/diagram1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQAjAMgakY5hX3fQkFpm2FzunILpumb5GFQN94tK0IMPi8KCN26vpOmk6Y7LkZ7E29UqJfk0Rgfc-X1f-_jMuUHnzPKu46vjdc8kzxCtW-WMENXfNZDe5z5q171EQ1dlyaC2BLq0y0qSI/s1600/diagram1.png" /></a></div>
<br />
I have EIGRP running on every interface, and pim <strong>sparse-dense</strong> mode on every interface.<br />
Every IP address has reachability to every other IP address. The last octet IP on every segment is the router number. Every router has a loopback of Y.Y.Y.Y where Y is the router number.<br />
<br />
I was working a lab for auto-RP. In an equivalence for the simpler scenario above, R1 was the mapping agent and R2 was the RP candidate. Then R3 would join 239.0.0.1, and R1 would send a ping towards 239.0.0.1 and expect a reply.<br />
<br />
The setup was as follows (remember, PIM sparse-dense and routing are already setup)<br />
<br />
R1:<br />
ip pim send-rp-discovery Loopback0 scope 10 interval 2<br />
<br />
R2:<br />
ip pim send-rp-announce Loopback0 scope 10 interval 2<br />
<br />
R3:<br />
interface FastEthernet0/0<br />
ip igmp join-group 239.0.0.1<br />
<br />
And be damned if I could get the join on R3 to work. I discovered pretty quickly that R3 wasn't learning the dynamic RP address:<br />
<br />
R3#sh ip pim rp mapping<br />
PIM Group-to-RP Mappings<br />
<br />
R3#<br />
<br />
"Well there's your problem!" <br />
<br />
After a lot of digging, I finally noticed some odd output on R2:<br />
<br />
R2#sh ip mroute 224.0.1.40 | b 224<br />
(*, 224.0.1.40), 00:15:00/stopped, RP 2.2.2.2, flags: SJCL<br />
Incoming interface: Null, RPF nbr 0.0.0.0<br />
Outgoing interface list:<br />
FastEthernet0/0, Forward/Sparse-Dense, 00:15:00/00:01:58<br />
<br />
(1.1.1.1, 224.0.1.40), 00:14:47/00:02:57, flags: PLJTX<br />
Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1<br />
Outgoing interface list: Null<br />
<br />
It's pretty evident that 224.0.1.40 (The mapping agent group) isn't going to reach R3, as the OIL lists "Null", and R3 isn't going to learn the RP address, and therefore isn't going to be able to join the group. Let's look closer on that output:<br />
<br />
R2#sh ip mroute 224.0.1.40 | i 224<br />
(*, 224.0.1.40), 00:21:47/stopped, RP 2.2.2.2, flags: SJCL<br />
(1.1.1.1, 224.0.1.40), 00:21:34/00:02:59, flags: PLJTX<br />
<br />
What's up with those flags?<br />
<br />
S=Sparse, P=Pruned ... wait a minute! 224.0.1.40 is supposed to be <strong>dense mode forwarded</strong>.<br />
Just to verify that, look at R1:<br />
<br />
R1#sh ip mroute 224.0.1.40 | i 224<br />
(*, 224.0.1.40), 00:36:03/stopped, RP 0.0.0.0, flags: DCL<br />
(1.1.1.1, 224.0.1.40), 00:29:21/00:02:58, flags: LT<br />
<br />
D=Dense<br />
<br />
What the heck is R2 up to?<br />
Turns out I didn't remove some debugging config I'd put in earlier, which as a whole is really nothing new on these type of tasks, but this one struck me as odd:<br />
<br />
R2:<br />
ip pim rp-address 2.2.2.2<br />
<br />
In fact, let's take it out and see what happens:<br />
<br />
R2(config)#no ip pim rp-address 2.2.2.2<br />
R2(config)#exit<br />
R2#sh ip mroute 224.0.1.40 | b 224<br />
(*, 224.0.1.40), 00:00:18/stopped, RP 0.0.0.0, flags: DCL<br />
Incoming interface: Null, RPF nbr 0.0.0.0<br />
Outgoing interface list:<br />
FastEthernet0/1, Forward/Sparse-Dense, 00:00:18/00:00:00<br />
FastEthernet0/0, Forward/Sparse-Dense, 00:00:18/00:00:00<br />
<br />
(1.1.1.1, 224.0.1.40), 00:00:16/00:02:58, flags: LT<br />
Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1<br />
Outgoing interface list:<br />
FastEthernet0/1, Forward/Sparse-Dense, 00:00:16/00:00:00<br />
<br />
I can't imagine why this behavior is there, but if you have <strong>ip pim rp-address Y.Y.Y.Y</strong> configured, the RP will automatically assume auto-RP groups <strong>originated by other routers </strong>are sparse mode instead of dense, which effectively breaks auto-RP. That makes no sense to me, and it took me almost two hours to go pull this line of config out. I also can't find any documentation on why this behavior happens.<br />
<br />
In a nutshell: Configuring a static RP address on an auto-RP device will stop the device in question from sending auto-RP dense groups to downstream neighbors.<br />
<br />
Cheers,<br />
<br />
Jeff Kronlage<br />
<br />brbcciehttp://www.blogger.com/profile/14586635047530183862noreply@blogger.com1