Thursday, December 27, 2012

A different perspective on CIR, PIR, Tc, Bc & Be

The best topics in my CCIE studies have been the ones where I've experienced a true paradigm shift in my thinking. With this topic, I've had three, and one of those came months after the first, when I thought I was most of the way done typing the first revision of this document. I will do my best to convey all three here and now, and perhaps save someone else the same long journey.

But first, an introduction....

Both policing and shaping are tools to deal with a service provider giving a higher speed phyiscal interface with an understanding that the customer will only use a fraction of the speed. This is an ingress tool to not allow the core or egress links to become swamped. For example, a SP might give out Gig-E interfaces with the understanding that their customers will only use 200MB of it. If all customers actually used 1GB, the edge router, core, or egress routers could easily run out of bandwidth.

Policing is the tool used at the SP side to enforce the traffic policy, and shaping is the tool used at the enterprise edge towards the policer, to conform to the policy.

By the way, I've had an argument about this with a couple people in the past. The CIR CAN equal the line rate of the interface. Marketing nonsense from some ISPs may make you believe otherwise.  In a scenario where CIR equals line rate, you don't need to shape or police, and none of this matters!

Before I delve into how CIR, PIR, Bc, Be, Tc, etc work, I will share with you the first two secrets to understanding all of this.

Saturday, December 22, 2012

Serial Link Compression

At a high-level, there are basically two compression techniques supported on serial links on Cisco routers:

Stacker - Based on the Lempel-Ziv (LZ) algorithm, this method provides the best compression, particuarly in the case of varying data, but at the cost of more CPU usage.
Predictor - Attempts to predict the next bits to be transmitted based on history.  Does not compress as well as Stacker, but is lighter on CPU usage.

These two high-level technologies are implemented for payload compression through:
Stacker - Obvious; uses LZ algorithm mentioned above, supported on both PPP and HDLC.  This is the only compression method supported on HDLC.
Predictor - Again, obvious; only supported on PPP.
MPPC - A Microsoft algorithm (Microsoft Point to Point Compression); implemented in Cisco devices for interoperability to Microsoft clients.  Cannot be used router-to-router, only router-to-Microsoft OS.  Uses LZ / Stacker compression.  I'm not going to talk about this one further because I have no convenient way to lab it.
LZS - I only know this exists because it's on the menu.  I had a terribly hard time finding details about this, either in my CCIE study guides, the Cisco documentation, or other blogs.  It's obviously LZ-based, but it doesn't seem anyone uses it!  If you have details I'm lacking, please reply to the post.  Also not going to talk about this one any further.
Frame Relay Packet-by-Packet  - Cisco proprietary, uses LZ algorithm, but uses a per-packet dictionary.  Pre-FRF9
Frame Relay Data Stream - Cisco proprietary, uses LZ algorithm, but uses the same dictionary across all packets.  Pre-FRF9
Frame Relay FRF9 - Industry standard; enabled per DLCI or per sub-interface. Uses LZ algorithm.  Requires IETF LMI. 

So here's how we apply all these things.

Sunday, December 16, 2012

The nitty-gritty of WRED

WRED is such a simple topic at first glance.  It's effective congestion avoidance, and it can be enabled with one command!  If you dig a little deeper, however, there can be quite a lot there.  I've been avoiding doing the WRED deep-dive for a while now, but it finally caught up with me. 

I assume most anyone reading this understands what WRED does at a high-level already, so I will only touch on the general idea.  Any interface that has its transmit (egress) buffer fill up goes into tail drop.  Tail drop is a state where all new packets are dropped.  This is bad, because if TCP sessions are running through that interface, the packet loss will cause all TCP sessions that were part of the tail drop to decrease their window size and go into slow start.  This process is called global synchronization.  It produces a saw-tooth effect on traffic diagrams, as all TCP flows slow down at once, gradually speed back up, experience congestion/packet loss at the same time, and then repeat the slowdown, for infinity.

RED (random early detection) solves this problem by randomly dropping packets prior to the transmit buffer filling up.  The idea is that some TCP flows will go into slow start instead of all of them.  Theoretically, tail drop is avoided, and therefore global synchronization is also avoided.  It's of note that RED/WRED does absolutely nothing for UDP flows, as UDP flows don't have a transport-layer ACK, and therefore there's no way to know at the UDP level if packets haven't been received.  Therefore, UDP cannot implement a slow-start at the transport layer.  If you have an entire interface full of UDP traffic, there's no benefit to running RED at all.

Cisco only implements WRED, as opposed to RED.  WRED is Weighted Random Early Detection, and takes into account IP Precedence (default) or DSCP values, making the "less important" flows get more aggressively dropped. 

WRED can be implemented in two fashions:
1) On the command line
2) As part of a CBWFQ policy

Friday, December 7, 2012

BGP Cost Community, EIGRP SoO, and backdoor links

BGP Cost Community (in relation to EIGRP) and EIGRP Site-of-Origin, or SoO, are two related, and somewhat overlapping topics.  The intent of Cost Community is to prevent suboptimal routing and routing loops between EIGRP sites (sometimes) separated by MPLS.  Site of Origin is more focused on loop prevention.

We'll be working off this diagram:

Sunday, December 2, 2012

OSPF PE: Downward bit, Super Area 0, Domain IDs, capability vrf-lite, sham links

This post will bite off quite a lot.  I wanted to write one post that encompassed the entirety of the interaction of using OSPF as a PE to CE routing protocol.

Let me begin by saying... what a disastrously bad idea doing this is.  BGP is the obvious PE to CE routing protocol.  I've never deployed OSPF as a PE to CE in production, but I know someone that has, and he hated it too.  Even the service provider (AT&T) that offered the OSPF option won't let you opt for it any longer if you're a new customer.  The only argument I've heard for using anything besides BGP - that actually made sense - is if you have a great deal of routers with basic IOSes and don't have BGP as a routing option. 

The reason it's a disastrously bad idea is because it's too ambitious.  To me, it feels like the designers sat down with the concept of converting a large layer 2 frame-relay OSPF network natively to MPLS without having to rethink the OSPF design.  With all the band-aids available, you can keep your area design intact, even if it makes no sense whatsoever in an MPLS world.

In a nutshell, these are the "add-ons" we'll be looking at:
The OSPF Down Bit - designed to prevent loops from forming in an OSPF area that's multihomed to the MPLS backbone.
Super Area 0 - The MPLS network will be treated as an "area 0" in and of itself.  This is in case your areas become disconnected from area 0 due to the migration to MPLS.  This way, each area will always be attached to area 0.  Can be disabled with "capability vrf-lite"
Sham Links - Creates a control-plane intra-area link over the Super Area 0.  Can be useful for traffic engineering
Domain-ID (community) - Controls whether or not routes should be considered inter-area or external.  The domain-id is populated by the router process number by default if it's not specified.  Same Domain-ID = inter-area, different = external.  It assumes different means that they should be treated as separate OSPF processes.

This is the diagram we'll be referencing. 

 

Sunday, November 25, 2012

MPLS Tunnel Next-Hop & LDP Filtering

I ran into a rather tricky-to-debug MPLS scenario.

We're going to setup a rather traditional MPLS configuration - two PEs, two CEs, and one BGP-free P MPLS-only router:  The CEs both represent the same customer and will be sharing a VRF. 


As I'm sure you're all already aware, setting up MP-BGP, VRFs, MPLS, etc, is quite a lot of config.  Nonetheless, in order to accurately convey the point I'd like to make, here's the relevant config from each router:

Saturday, November 24, 2012

Automatic v6 over v4 tunnels: ISATAP vs 6to4

Let's compare Cisco's ISATAP and automatic 6to4 tunneling methods.

Here's the diagram:

The square represents an IPv4-only service provider, and the circles are IPv6 islands.  The goal is to have any-to-any IPv6 connectivity between the circles, via the IPv4 only-network.

Thursday, November 22, 2012

MSDP & Anycast RP

Anycast RP is a pretty "easy" topic, but it has a gotcha I'd forgotten about.

Anycast RP allows RP load balancing inside a single AS by advertising the same IP in multiple spots in the domain, and then announcing that IP as the RP.  Those wishing to receive their respective multicast feeds just need to join the closest RP determined by the IGP's metric.

Of course the multicast sources don't know about all this extra-RP craziness we're doing.  So we need to tie all the RPs together in full-mesh with MSDP.  MSDP will share information on the sources with each other RP.

The config for this is simple enough that I'm not going to bother with a diagram.

Wednesday, November 21, 2012

Rethinking mroutes; Multicast BGP

I started working on using BGP to distribute multicast routes today.  I've touched on this topic a few other times, and while I "kind of" got the idea, it never really sat well with me - I never felt comfortable with it.  The problem I've always had, probably stemming from my CCNP studies or early CCIE work, is that I learned mroutes as a way to fix issues with unicast routing.  In other words, if the IGP didn't jive with what the multicast was doing, you could use an mroute to bandaid it.  Layer BGP on top of that logic, and you've now got nested confusion. 

Today, I had a paradigm shift on how I thought of mroutes, and it clarified everything for me.  I'm hoping that some of you have the same problem and this may help.

Here's our topology:


All links are 192.168.YZ.0/24; where Y is the low router number, and Z is the high router number.  For example, the link between R2 and R3 is 192.168.23.0/24.  The interface IPs are 192.168.YZ.X, where X is the router number.  R1 has a loopback address of 1.1.1.1/32.

For the moment, ignore the BGP AS numbers.  Pretend this is all one IGP-driven network.

The thing that kept throwing me off is the way PIM incorporates the IGP, or unicast routing table, into its calculations.  Everything feels kind of "backwards".  Unicasting is inherently about where traffic is going, and (to some extent) multicasting is about where traffic came from (think RPF check).  When you slap multicasting on top of a unicast network, without defining mroutes, you end up thinking "backwards".  When you look at a unicast route, you think "I'd send traffic that way".  When you look at a unicast route being used for RPF, you think "I'd receive traffic that way".  The way mroutes are taught initially is as an override for this behavior. 

For example, let's say we're running sparse mode, and R1's Lo0 is the RP.  R4 is creating ping traffic towards 239.1.1.1, and R3 has done an IGMP join on its Fa0/0 interface.  For the moment, let's assume the serial link between R1 and R3, which is high-speed (let's say it's an OC12) is in shutdown.  Several processes will take place here:

1) R3 will send a join towards 1.1.1.1.  In this case, because of the unicast routing table, R3 will send that join to R2.  R2, again because of the unicast routing table, will send it towards R1.  R1 processes the join and puts it's Fa0/0 interface into the OIL for 239.1.1.1.
2) R4's ping is heard by the PIM process running on the local router.  R4 directs that traffic towards the RP, which means sending the traffic towards R1, because of the unicast routing table.  R1 processes the traffic, adds the (S,G) of (192.168.14.4, 239.1.1.1) and sends it towards R3 via R2, because of the unicast routing table
3) R3 hears the (192.168.14.4, 239.1.1.1) traffic, and initiates an SPT join towards 192.168.14.4, via R2, and R3, because of the unicast routing table.

We've got a gob of RPF checks here, and a great deal of forward routing, too.  For example, the join would never reach from R3 to R1 if we were just thinking in terms of RPF.

Now, for a moment, let's say the serial link is turned on and advertised into the IGP.  Since the Fast Ethernet links are all 100MB, and the Serial link is 600MB, it gets preffed.  However, whoops, we didn't enable PIM on the serial link!  Now, we are screwed.  Looking at the steps above again:
1) R3 will send the join towards 1.1.1.1 via the serial link, no such luck, no PIM here!
2) R4 would still reach the RP OK, but the RP would try to reach R3 via the serial link; no PIM here.
3) We can't even contemplate this step because steps 1 and 2 failed.

Now, a basic understanding of mrouting tells us we can fix this with some static mroutes:
R1:
  ip mroute 192.168.23.3 255.255.255.255 192.168.12.2

R3:
  ip mroute 1.1.1.1 255.255.255.255 192.168.23.2

And boom, our problem is solved.

That's how I thought of mroutes up until attempting to apply BGP to them.  My brain kept saying "how can I apply targeted/spot repairs with a routing protocol?".  That's where it all broke down.  It's difficult to think of a routing protocol in the same sense you think of "fix it" static routes.  We've all put that goofy unicast static route in in production - the one you wish wasn't there for cleanliness.  It's pointing at some VPN tunnel on some firewall you can't run RRI on, and there's no way to get the route into your IGP without just defining the damned thing statically .  Now, again, imagine trying to fix that with BGP.  Uugh, my brain hurts.  And this is where I ended up turning my thinking around.

First thing's first: DROP THE IGP.  You can route an entire multicast configuration without having a single IGP or static route in the network.

Same topology, dump the IGP.  We're not going to do any unicast here at all.  Assume PIM sparse-mode is enabled everywhere.  Let's build the multicast topology as if we were statically routing any traditional unicast network.

R1:
 ip mroute 192.168.23.0 255.255.255.0 192.168.12.2

R2:
 ip mroute 192.168.14.0 255.255.255.0 192.168.12.1
 ip mroute 1.1.1.1 255.255.255.255 192.168.12.1

R3:
 ip mroute 192.168.12.0 255.255.255.0 192.168.23.2
 ip mroute 192.168.14.0 255.255.255.0 192.168.23.2
 ip mroute 1.1.1.1 255.255.255.255 192.168.23.2

R4:
 ip mroute 192.168.12.0 255.255.255.0 192.168.14.1
 ip mroute 192.168.23.0 255.255.255.0 192.168.14.1
 ip mroute 1.1.1.1 255.255.255.255 192.168.14.1

No IGP, but everything works.  This takes the mroute out of the role of bandaid, and into the role of controlling the network.  The first thing I'd like to point out is we're not just "fixing RPF failures" here, but we control bi-directional communication, in a way.  For example, R3 can locate the RP via the mroutes.  This is a very "forward" behavior.

(Let's not forget if you're using pings to test, you'll need to use "debug ip icmp" on R3.  It can't reply because there are no unicast routes.)

This gave me the feeling of using mroutes as the primary workings of the network, and PIM's interworking with the IGP as more of a "backup" strategy.  "If I have no mroute, turn to the IGP's tables". When you start using this logic, replacing the above static routes with BGP makes complete sense!  This was my "aha!" moment.

Let's turn the same strategy used above into BGP.  Again, the serial link is in shutdown (I used it entirely for the example of how to break things above, so it's basically off from here on in). 

R1:
 router bgp 100
  neighbor 192.168.14.4 remote-as 100
  neighbor 192.168.12.2 remote-as 200

  address-family ipv4 multicast
   neighbor 192.168.14.4 activate
   neighbor 192.168.14.4 next-hop-self
   neighbor 192.168.12.2 activate
   network 1.1.1.1 mask 255.255.255.255
   network 192.168.14.0 mask 255.255.255.0
   network 192.168.12.0 mask 255.255.255.0

R4:
 router bgp 100
  neighbor 192.168.14.1 remote-as 100

  address-family ipv4 multicast
   neighbor 192.168.14.1 activate
   neighbor 192.168.14.1 next-hop-self ! not really necessary here, but I'll explain below
   network 192.168.14.0 mask 255.255.255.0

R2:
 router bgp 200
  neighbor 192.168.12.1 remote-as 100
  neighbor 192.168.23.1 remote-as 200
  neighbor 192.168.23.1 next-hop-self
 
 address-family ipv4 multicast
  neighbor 192.168.23.1 activate
  neighbor 192.168.23.1 next-hop-self
  network 192.168.12.0 mask 255.255.255.0
  network 192.168.23.0 mask 255.255.255.0

R3:
 router bgp 200
  neighbor 192.168.23.2 remote-as 200
 
  address-family ipv4 multicast
   neighbor 192.168.23.2 activate
   neighbor 192.168.23.2 next-hop-self  ! not really necessary here, but I'll explain below
   network 192.168.23.0 mask 255.255.255.0

A bit wordier than the static configuration, but at least it's a dynamic protocol. And now it makes sense.  Stop trying to solve a problem, and look at it as the "right" answer, and it all makes sense.

One thing to look out for is that multicast BGP won't recurse on other multicast BGP routes.  So you can't count on, for example, R3 being able to reach 1.1.1.1 because it knows how to reach 192.168.12.0 via R2.  Good use of next-hop-self in iBGP sessions is necessary.  eBGP still defaults next-hop-self as usual.

Hope you enjoyed...

Jeff Kronlage

Tuesday, November 20, 2012

Multicast Equal Cost Multipathing (ECMP)

Imagine a PIM sparse-mode scenario where multiple senders were sending to two different groups, and you had multiple equal-cost paths to receive the traffic on, but PIM, by default, always picks the neighbor's interface with the highest IP and sends the join up that one.

How can you get some load sharing?

Here's our diagram:
 
R1 is pinging 239.1.1.1 and R2 is pinging 239.1.1.2.  R3 is the RP.  R5 is joined to 239.1.1.1 and 239.1.1.2. 

We have two equal-cost paths between R3 and R4, and we want to use one for 239.1.1.1 and the other for 239.1.1.2, headed towards R5.  Right now Fa0/1, whose neighbor has a higher IP address than Fa0/0's neighbor, is getting all the traffic:

R4(config)#do sh ip mroute 239.1.1.1
<output omitted>

(192.168.13.1, 239.1.1.1), 00:00:19/00:02:47, flags: JT
  Incoming interface: FastEthernet0/1, RPF nbr 192.168.234.3
  Outgoing interface list:
    FastEthernet1/0, Forward/Sparse, 00:00:19/00:02:40

<output omitted>

(192.168.23.2, 239.1.1.2), 00:00:19/00:02:46, flags: JT
  Incoming interface: FastEthernet0/1, RPF nbr 192.168.234.3
  Outgoing interface list:
    FastEthernet1/0, Forward/Sparse, 00:00:19/00:02:40

ip multicast multipath is the answer.

R4(config)#ip multicast multipath
R4(config)#do sh ip mroute
<output omitted>

(192.168.13.1, 239.1.1.1), 00:05:39/00:02:56, flags: JT
  Incoming interface: FastEthernet0/1, RPF nbr 192.168.234.3
  Outgoing interface list:
    FastEthernet1/0, Forward/Sparse, 00:05:39/00:02:13

<output omitted>

(192.168.23.2, 239.1.1.2), 00:05:39/00:02:55, flags: JT
  Incoming interface: FastEthernet0/0, RPF nbr 192.168.34.3
  Outgoing interface list:
    FastEthernet1/0, Forward/Sparse, 00:05:39/00:02:10

Not too difficult.  It uses a hash to achieve this, which I'm frankly not interested enough to look into, because on the IOS version I'm using, you can't change the hash anyway.  It looks like starting in IOS 15 you can make some modifications to it.

Also, one catch here is this only balances by source.  If you want to balance by group, you'll need to do that with your RP assignments, sending some traffic to one RP and other traffic to a different RP.

Enjoy

Jeff Kronlage

Monday, November 19, 2012

IP Multicast Boundary

In this post we'll take a look at the ip multicast boundary interface-level command.  This command's function isn't hard to understand - it references a standard or extended access list, and either permits or denies multicast traffic through the interface.  It can also optionally manipulate auto-rp to remove groups you don't want advertised downstream.


The subnets are on the diagram.  Every router is using it's router number as the last octet; every router also has a loopback of X.X.X.X, where X is the router number.  Every router is running EIGRP on every interface, and pim sparse-dense mode on every interface. 

R1 is setup for auto-rp announcement & discovery via it's loopback 0 address:

ip access-list standard GRL
 permit 232.0.0.0 7.255.255.255
 permit 239.0.0.0 0.255.255.255
ip pim send-rp-announce Loopback0 scope 5 group-list GRL
ip pim send-rp-discovery Loopback0 scope 16

We can verify that R3 is receiving the mappings for these groups:

R3#show ip pim rp mapping
PIM Group-to-RP Mappings
Group(s) 232.0.0.0/5
  RP 1.1.1.1 (?), v2v1
    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:07:33, expires: 00:02:13
Group(s) 239.0.0.0/8
  RP 1.1.1.1 (?), v2v1
    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:07:33, expires: 00:02:13

Let's have R3's Lo0 interface join 239.1.1.1 and 232.1.1.1:

interface Lo0
 ip igmp join-group 239.1.1.1
 ip igmp join-group 224.1.1.1

And ping them from R1:

R1#ping 239.1.1.1
<output omitted>
Reply to request 0 from 192.168.23.3, 72 ms
R1#ping 224.1.1.1
<output omitted>
Reply to request 0 from 192.168.23.3, 112 ms
Now let's see if we can use R2 to selectively filter 239.0.0.0/8 from reaching R3:

R2:
ip access-list standard blockthings
 deny   239.0.0.0 0.255.255.255
 permit any
interface FastEthernet0/1
 ip multicast boundary blockthings out
R1:
R1#ping 239.1.1.1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:
.
R2:
sh ip mroute 239.1.1.1
<output omitted>

(192.168.12.1, 239.1.1.1), 00:03:22/00:01:55, flags: PFT
  Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1
  Outgoing interface list: Null

It works!

However, on R3, we still think we can join the group:

R3#sh ip pim rp map
PIM Group-to-RP Mappings
Group(s) 224.0.0.0/5
  RP 1.1.1.1 (?), v2v1
    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:00:53, expires: 00:02:05
Group(s) 239.0.0.0/8
  RP 1.1.1.1 (?), v2v1

    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:00:53, expires: 00:02:04
We can fix that as well.

R2:
interface Fa0/1
  ip multicast boundary blockthings filter-autorp

Please note this command is in addition to the prior "blockthings out" statement, not in replacement.

R3#sh ip pim rp map
PIM Group-to-RP Mappings
Group(s) 224.0.0.0/5
  RP 1.1.1.1 (?), v2v1
    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:00:04, expires: 00:02:53
... and it's gone from auto-rp, as well.

Now, let's change that access-list on R2 a bit:

no ip access-list standard blockthings
ip access-list standard blockthings
 deny 224.0.0.0 0.255.255.255
 permit any

R2 & R3:
clear ip pim rp-mapping

Now you'd think we'd be able to ping 239.1.1.1 and not 224.1.1.1, right?

R1#ping 224.1.1.1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 224.1.1.1, timeout is 2 seconds:
.
OK, that was expected.

R1#ping 239.1.1.1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 239.1.1.1, timeout is 2 seconds:
.
Uh-oh.

R3#sh ip pim rp map
PIM Group-to-RP Mappings
R3#
More uh-oh.

So what happened?  We just blocked auto-rp.  Let's try this again:

no ip access-list standard blockthings
ip access-list standard blockthings
  permit 224.0.1.40 0.0.0.0
  deny 224.0.0.0 0.255.255.255
  permit any

If our auto-rp mapping agent were behind R2, we'd also want to permit 224.0.1.39.  In fact, it's best if you just permit both all the time if you're using auto-rp, just to be safe.

R1#ping 239.1.1.1
<output omitted>
Reply to request 0 from 192.168.23.3, 144 ms

R3#sh ip pim rp map
PIM Group-to-RP Mappings
Group(s) 239.0.0.0/8
  RP 1.1.1.1 (?), v2v1
    Info source: 1.1.1.1 (?), elected via Auto-RP
         Uptime: 00:00:03, expires: 00:02:53
Much happier.

There are many other things you can do with ip multicast boundary, which I will give a high-level view of here.

If you wanted to block R2 from being able to perform joins to 224.1.1.1 as well, you'd place the boundary on Fa0/0 as "in".  This is probably obvious but the command help is written in such a way that it leaves you scratching your head.

R2:
int lo0
 ip igmp join-group 224.1.1.1

R1#ping 224.1.1.1
<output omitted>
.
The interesting thing is the way this shows up on the mroute table on R2:
R2(config-if)#do sh ip mroute 224.1.1.1
<output omitted>

(192.168.12.1, 224.1.1.1), 00:02:03/00:01:12, flags: PT
  Incoming interface: FastEthernet0/0, RPF nbr 192.168.12.1
  Outgoing interface list: Null

The stream is allowed on to R2, but R2 won't allow Lo0 to be added to the OIL (Outgoing interface list).

Labbing up the extended access lists version of this can get rather tricky.  Tricky to the point where I really hope this never shows up on the lab.

In short, you can filter by (S,G) instead of just by group.  The format is:
permit ip <source ip> <source mask> <group address><group mask>

The catch is, you have to consider traffic from the RP, after the SPT join, etc.  Even the joins can have issues.  It's difficult to come up with a "clean" way of showing how to make this work.

...Here's hoping we all dodge that bullet.

Jeff Kronlage

Sunday, November 18, 2012

PIM Assert

The PIM Assert is an election process that prevents multiple senders on a broadcast media from replicating the same multicast stream on to the wire. The scenario this can happen in is somewhat specific; I've only been able to think of a way to create it using PIM dense mode. 

Here's our topology:





















The "top" segment is IPed 192.168.234.X, where X is the router number.  The "bottom" segment is IPed 192.168.123.X, where X is the router number. R4 has a loopback address of 4.4.4.4/32.  All routers are running EIGRP on all interfaces.

R4 is our receiver and R1 is our transmitter.  PIM dense mode is running on all interfaces except FastEthernet0/0 on R1 (R1 is not running any type of multicast routing protocol).  R4's lo0 has IGMP join-group configured for 239.1.1.1:

interface Loopback0
 ip address 4.4.4.4 255.255.255.255
 ip pim dense-mode
 ip igmp join-group 239.1.1.1
When R1 pings 239.1.1.1, the packet hits both R2 and R3.  This being dense mode, both R2 and R3 forward the packet on to R4.  This is obviously inefficient, and creates duplicate packets.  During this process, R2 and R3 will hear each other's packets, and will start trying to sort the situation out.

There's a catch here using a router to create the multicast traffic, and there's a reason I set the lab up this way. In order to trigger a PIM Assert, which we are about to see, the (S,G) has to match exactly on both packets egressing R2 and R3.  If the links between R1 and R2 and R1 and R3 were any type of point-to-point link, the traffic would be duplicated and the assert would never happen.  IOS's implementation for creating multicast traffic is to source it off every router interface and send it out every interface.  Even if you use a "ping 239.1.1.1 /source Lo0", the source part of the command is completely ignored - the packets always originate off their respective interfaces.  If you're pinging off two separate interfaces, you have two separate (S,G) entries, and the traffic is duplicated.  If you've only got one interface, pointed at a broadcast media, you end up with one (S,G), and this lab is possible.

So, going back to our Assert between R2 and R3...  R2 and R3 both send an Assert packet at one another, saying why they should be the ones sending the traffic.  The winner is decided by:
- Lowest Administrative Distance (AD) back to the source
- In a tie, best metric value
- In a tie, highest IP address

Our routers both have the same AD (Internal EIGRP, 90) and metric.  In this case, R3 is always chosen for highest IP address, which we can see in this excerpt from show ip mroute from R3:

(192.168.123.1, 239.2.2.2), 00:00:03/00:02:56, flags: FT
  Incoming interface: FastEthernet0/0, RPF nbr 0.0.0.0
  Outgoing interface list:
    FastEthernet0/1, Forward/Dense, 00:00:03/00:00:00, A
The red "A" indicates it's the assert winner.

We're clearly not going to be able fiddle with the metric or the AD back to the source in this case, as both routers have connected interfaces.  This isn't the way I originally planned this lab, but after modifying it repeatedly due to the source interface problem described above, we're going with what we've got!  So, let's play with the IP addresses and watch the magic:

R2:
int Fa0/1
  no ip address 192.168.234.2 255.255.255.0
  ip address 192.168.234.222 255.255.255.0

R3:
clear ip mroute *

R1:
ping 239.1.1.1

....

R2#sh ip mroute 239.1.1.1
IP Multicast Routing Table
Flags: D - Dense, S - Sparse, B - Bidir Group, s - SSM Group, C - Connected,
       L - Local, P - Pruned, R - RP-bit set, F - Register flag,
       T - SPT-bit set, J - Join SPT, M - MSDP created entry,
       X - Proxy Join Timer Running, A - Candidate for MSDP Advertisement,
       U - URD, I - Received Source Specific Host Report,
       Z - Multicast Tunnel, z - MDT-data group sender,
       Y - Joined MDT-data group, y - Sending to MDT-data group
Outgoing interface flags: H - Hardware switched, A - Assert winner
 Timers: Uptime/Expires
 Interface state: Interface, Next-Hop or VCD, State/Mode

(*, 239.1.1.1), 00:00:03/stopped, RP 192.168.234.4, flags: S
  Incoming interface: FastEthernet0/1, RPF nbr 192.168.234.4
  Outgoing interface list:
    FastEthernet0/0, Forward/Dense, 00:00:03/00:00:00

(192.168.123.1, 239.1.1.1), 00:00:03/00:02:56, flags: T
  Incoming interface: FastEthernet0/0, RPF nbr 0.0.0.0
  Outgoing interface list:
    FastEthernet0/1, Forward/Dense, 00:00:03/00:00:00, A
And there it is, now on R2!

Cheers,

Jeff Kronlage

Saturday, November 17, 2012

BGP Capability ORF

ORF - Outbound Route Filtering - is not hard to grasp the concept of, but I hadn't actually seen it before this, and it's a fantastic idea.

Anyone who's been a BGP admin is familiar with prefix filtering on the "customer edge" side.  The real-world example is that service providers normally only offer a handful of options for receiving the BGP table from them: Full routes, no routes (a default), and connected customers + a default.  Normally the last two are used for customer edge routers that have limited CPU or RAM and don't have the capacity to store and parse the entire BGP table. 

A common solution from the customer edge side - one I've personally implemented - is to take the entire BGP table and filter it down with a prefix list to what it actually wants to keep in memory.  This works fine, however, it still keeps the burden of the PE router sending the entire BGP table to the CE router, and the CE router then having to reject a rather large percentage of it.  This is terribly inefficient.

What if you could ask the PE router to only send you the routes you wanted, dynamically?  This is exactly what ORF does.

ORF "sends" a prefix list from the CE to the PE, the PE keeps the prefix list in memory (not in the configuration), and then only transmits that prefix list to the CE.

The configuration is simple:

PE:
router bgp 1
 no synchronization
 bgp log-neighbor-changes
 network 1.1.1.1 mask 255.255.255.255
 network 2.2.2.2 mask 255.255.255.255
 network 3.3.3.3 mask 255.255.255.255
 network 4.4.4.4 mask 255.255.255.255
 network 5.5.5.5 mask 255.255.255.255
 neighbor 192.168.12.2 remote-as 2
 neighbor 192.168.12.2 capability orf prefix-list receive no auto-summary
CE:
ip prefix-list someroutes seq 5 permit 2.2.2.2/32
ip prefix-list someroutes seq 10 permit 4.4.4.4/32
router bgp 2
 no synchronization
 bgp log-neighbor-changes
 neighbor 192.168.12.1 remote-as 1
 neighbor 192.168.12.1 capability orf prefix-list send
 neighbor 192.168.12.1 prefix-list someroutes in

 no auto-summary
One really nice thing about this config is that even if the PE doesn't support the method, you still get the filtering (via traditional CE-side prefix filtering).

Now obviously, the filtering happens on the CE one way or the other.  So how do you verify this is working?

PE#sh ip bgp neighbors 192.168.12.2 | s capabilities
  Neighbor capabilities:
    Route refresh: advertised and received(old & new)
    Address family IPv4 Unicast: advertised and received
  AF-dependant capabilities:
    Outbound Route Filter (ORF) type (128) Prefix-list:
      Send-mode: received
      Receive-mode: advertised

PE#sh ip bgp neighbors 192.168.12.2 advertised-routes
BGP table version is 6, local router ID is 5.5.5.5
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network          Next Hop            Metric LocPrf Weight Path
*> 2.2.2.2/32       0.0.0.0                  0         32768 i
*> 4.4.4.4/32       0.0.0.0                  0         32768 i

 
There's our prefix filtering, now on the PE router!

Jeff Kronlage

BGP TTL Security

I had actually never labbed TTL Security before today, and I got "good and stuck" for a while on the mechanics.  One of the items that baffled me is I saw it good for an enterprise, but didn't realize the service provider had to play along or it was essentially useless.  Here's our diagram:



In this case, R1 is the service provider core router, and R4 is the customer.

 
So what's the threat?

Thursday, November 15, 2012

Using Extended Access Lists as a Substitute for Prefix Lists

I've known this feature was out there for a long while now, but my brain has just rejected learning it. 

Let's say you get a lab task that has one of the two following requirements:
1) Filtering by prefix size, but don't use a prefix list
2) Filtering by prefix size and arbitrary bits in the prefix

Neither of these have any real-world purpose, unfortunately (fortunately?).

So let's take this prefix list and turn it into an extended access list:
ip prefix-list prefixmatch permit 10.5.0.0/16 ge 18 le 24

So just to recap basic prefix list, this would match anything 10.5.X.X that has a subnet mask of 18-24.  So, these would match:

10.5.40.0/24
10.5.40.0/20
10.5.100.0/18

These would not match:

10.5.40.0/26
10.5.40.0/19
10.6.40.0/20

To replicate this match in an extended access-list, the following format is used:
[permit|deny] ip [prefix] [mask] [ge prefix length] [le prefix length]

The prefix and mask are really straightforward (unless you're doing arbitrary binary bit matching).  The GE/LE length take some staring at to understand, because you have to do binary matching.

The easy part of the translation looks like this:

ip prefix-list prefixmatch permit 10.5.0.0/16
... is equivalent to...
access-list 100 permit ip 10.5.0.0 0.0.255.255

Now to understand the hard part.
So we're looking to match masks 18 bits (GE) to 24 bits (LE).
GE on an access-list, in this case, is 255.255.192.0.  That part makes sense.  /18 = 255.255.192.0.
Now we already know the second part of the mask must be a wildcard mask.

In order for my brain to wrap around this, I always have to use binary as an intermediary.  The LE wildcard is based off the GE mask, so let's translate the GE to binary first:
255.255.192.0 = 11111111.11111111.11000000.00000000

the LE match needs to specify all the bits between the GE and LE.  The LE is /24, so translating to binary, we have:
11111111.11111111.11111111.00000000

We need the difference of the two, LE minus GE:
 11111111.11111111.11111111.00000000
-11111111.11111111.11000000.00000000
=00000000.00000000.00111111.00000000

translate your answer back to decimal:
0.0.63.0

Now we can figure out the rest of the solution:
ip prefix-list prefixmatch permit 10.5.0.0/16 ge 18 le 24
... is equivalent to...
access-list 100 permit ip 10.5.0.0 0.0.255.255 255.255.192.0 0.0.63.0

Thanks folks, but I'll stick with prefix lists!

Now just to throw one more curveball, let's try the task that can't be done with prefix lists.
Same prefix list: ip prefix-list prefixmatch permit 10.5.0.0/16 ge 18 le 24
However, this time, we want to match subnets that only have even IPs in the third octet.

access-list 100 permit ip 10.5.0.0 0.0.254.255 255.255.192.0 0.0.63.0

I'm not going to go over the binary math behind the 254 match (there are dozens of posts out there about this already), but it's quite clear this type of arbitrary non-sequential bit match is impossible with a prefix list.

Cheers...

Jeff Kronlage

Wednesday, November 14, 2012

The many maps of BGP

Every time I sit down with BGP for a prolonged period, I get quickly overwhelmed by the quantity of different types of "maps" that can be applied to various parts of the configuration.

I tried to count them this morning, and I came up with:
Suppress-Map
Unsuppress-Map
Inject-Map
Advertise-Map
Attribute-Map
Exist-Map
Non-exist-map
and of course traditional route-maps, which make for a total of 8.

When I sit down to fine-tune summarization, or do a conditional advertisement, I can never remember the term for the map I'm looking for.  So let's do a thorough run-down of what they all do.

Saturday, November 3, 2012

iBGP Route-Reflector Loop Prevention

I've always been a bit foggy on the loop prevention mechanism of a route reflector.  I originally assumed it used some sort of split horizon, but as I've discovered, this is simply not the case. 

We'll be using two topologies here, starting with this simple one:




R1 will be our route reflector, R2 and R3 will be route reflector clients.  The fourth octet in the diagram's IP address is the router number. In addition to the IPs indicated on the diagram, each router has a loopback of X.X.X.X, where X is the router number.

I'm going to peer R1 to R2 and R1 to R3, but not R2 to R3.

Thursday, November 1, 2012

OSPF max-metric command

The usage of max-metric router-lsa eluded me, so I labbed it up.  Here's our topology:



As usual, X represents the router number. 
 
OSPF is running on all links in the topology, including R4's loopback.  Our test traffic flow will be travelling from R1 to R4.  Traffic has been manually preffed to go R1 -> R2 -> R4:
 
R1:
interface FastEthernet0/0
 ip ospf cost 10000
 

Wednesday, October 24, 2012

RIP Default Information Originate w/ Route-Map

I found an unexpected circumstance with RIP's default-information originate command, when using route-maps.

Here is the reference diagram:


....where X is the router number.

All interfaces are up, RIPv2 is setup to advertise network 0.0.0.0.  Auto-summary is disabled.

I'd like to inject a default from R1 to R3.  This can be accomplished with this configuration:

route-map just_fa01 permit 10
 set interface FastEthernet0/1


router rip
 default-information originate route-map just_fa01


No issues thus far, this technically works:

R3#sh ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "rip", distance 120, metric 1, candidate default path
  Redistributing via rip
  Last update from 192.168.13.1 on FastEthernet0/0, 00:00:00 ago
  Routing Descriptor Blocks:
  * 192.168.13.1, from 192.168.13.1, 00:00:00 ago, via FastEthernet0/0
      Route metric is 1, traffic share count is 1

R2#sh ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "rip", distance 120, metric 2, candidate default path
  Redistributing via rip
  Last update from 192.168.23.3 on FastEthernet0/1, 00:00:00 ago
  Routing Descriptor Blocks:
  * 192.168.23.3, from 192.168.23.3, 00:00:00 ago, via FastEthernet0/1
      Route metric is 5, traffic share count is 1

Notice R2 learns the route from R3, as we'd expect.

Of course, the problem is, this gets advertised from R1 -> R3, R3 -> R2, and then R2 back to R1. This is because there's no split horizon happening at R2, because we didn't advertise the route to R2, and R2 has no idea it came from R1.  R1 will actually take it's own default, and this loops around the next work until a hop count of 16 occurs ("counting to infinity").  Then the route is pulled, and a new one is introduced, and the problem starts all over again.

You can see the counting to infinity problem quite easily on any of the routers:

R3#
*Mar  1 00:58:23.807: RT: rip's 0.0.0.0/0 (via 192.168.13.1) metric changed from distance/metric [120/1] to [120/4]
R3#
*Mar  1 00:58:29.891: RT: rip's 0.0.0.0/0 (via 192.168.13.1) metric changed from distance/metric [120/4] to [120/7]
R3#
*Mar  1 00:58:35.971: RT: rip's 0.0.0.0/0 (via 192.168.13.1) metric changed from distance/metric [120/7] to [120/10]
I've described a solution to this problem in another blog post here:
http://brbccie.blogspot.com/2012/10/rip-summarization-null0-routes.html

My answer is to add a static route to null0 for the summarization.  In this case, the summarization is to a default route.  So let's add one and see what happens:

R1(config)#ip route 0.0.0.0 0.0.0.0 null0

The counting to infinity stops:

R3#
*Mar  1 01:00:37.427: RT: add 0.0.0.0/0 via 192.168.13.1, rip metric [120/1]
*Mar  1 01:00:37.431: RT: NET-RED 0.0.0.0/0
*Mar  1 01:00:37.431: RT: default path is now 0.0.0.0 via 192.168.13.1
*Mar  1 01:00:37.431: RT: new default network 0.0.0.0
*Mar  1 01:00:37.435: RT: NET-RED 0.0.0.0/0
*Mar  1 01:01:08.007: RT: NET-RED 0.0.0.0/0
*Mar  1 01:02:08.007: RT: NET-RED 0.0.0.0/0

We have the appropriate route on R3:

R3#sh ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "rip", distance 120, metric 1, candidate default path
  Redistributing via rip
  Last update from 192.168.13.1 on FastEthernet0/0, 00:00:00 ago
  Routing Descriptor Blocks:
  * 192.168.13.1, from 192.168.13.1, 00:00:00 ago, via FastEthernet0/0
      Route metric is 1, traffic share count is 1

We have the default on R2, but... hold the phone!:

R2#sh ip route 0.0.0.0
Routing entry for 0.0.0.0/0, supernet
  Known via "rip", distance 120, metric 1, candidate default path
  Redistributing via rip
  Last update from 192.168.12.1 on FastEthernet0/0, 00:00:07 ago
  Routing Descriptor Blocks:
  * 192.168.12.1, from 192.168.12.1, 00:00:07 ago, via FastEthernet0/0
      Route metric is 1, traffic share count is 1

R2 learned that route from R1!  We were expecting it via R3, as before, because R1 has a route-map telling it not to advertise a default towards R2.

The intent of using a route-map with default-information originate is to use either "match ip address", to check for the existence of a specific route as a condition of originating a default, and "set interface", for the reasons we've shown above.  Commonly these two are used together --- match route X, then send a default out interface Y.

My hypothesis for this issue, although I cannot find any reference to this in the Cisco documentation, is because a default is statically created on the originating router, the route-map we apply is deliberately ignored.  I believe the thought process in the router is "why should I check for the existence of a specific route when I clearly have a default in my routing table?".  Without consulting the route-map, it has no idea about the set interface command, and therefore sends the default to all neighbors.

I welcome comments on this!   Does anyone have a documented or more specific reason as to why this is happening?

Jeff Kronlage

Monday, October 22, 2012

RIP Summarization & Null0 Routes

Here's oddity I came across today with RIP.

This scenario is simple enough I'm not even going to bother creating a diagram.

Take two routers, connect two Ethernet interfaces from each router to the other router.
In our scenario, we'll use:
R1: Fa0/0 192.168.0.1  ->  R2 Fa0/0 192.168.0.2
R2: Fa0/1 10.0.0.1        ->  R2 Fa0/1 10.0.0.2

Saturday, October 20, 2012

OER/PfR Configuration [Part 2 of 2]

We'll be picking up right where we left off in part 1, transitioning our topology from dynamips 7200s to dynamips 3725s.  The 7200s had too many issues running BGP under dynamips, and also had many bugs with OER's interoperability with BGP.  We'll also be using OER v2.1 (12.4(15)T) instead of OER v2.2 (12.4(24)T) for this section.

Our new topology looks very similar to the old, but note that the interface numbers have all changed:

 

Thursday, October 18, 2012

OER/PfR Configuration [Part 1 of 2]

In this two-part document, I will cover my OER/PFR labbing experience, covering static routing, BGP (inbound and outbound), PBR, including all the things that stumped me for any length of time.

This document assumes a basic level understanding of what function OER and PfR is intended to accomplish.  If you’d like an introduction, I highly recommend the first hour of Brian Dennis’ PfR video at http://www.ine.com/all-access-pass/training/playlist/ccie-rs-pfr-vseminar/-pfr--vseminar-22200011.html.

I will be covering only the features in 12.4T.  I understand that PfR is much more mature in IOS 15, however, my first goal is to clarify this topic for CCIE studies, and 12.4T is the version presently on the CCIE R&S.

This is the topology I will be referencing for this document: