Friday, December 7, 2012

BGP Cost Community, EIGRP SoO, and backdoor links

BGP Cost Community (in relation to EIGRP) and EIGRP Site-of-Origin, or SoO, are two related, and somewhat overlapping topics.  The intent of Cost Community is to prevent suboptimal routing and routing loops between EIGRP sites (sometimes) separated by MPLS.  Site of Origin is more focused on loop prevention.

We'll be working off this diagram:

First let's look at what problems we're trying to solve.

1) Suboptimal routing.  In a race condition, if CE3 advertises its routes to PE3, and the routes reach PE1 before PE2, PE2 may end up installing CE3's routes via the backdoor link.  In other words, PE1 advertises its routes to CE1, CE1 to CE2, and CE2 to PE2.  PE2 will redistribute these routes back from EIGRP into BGP.  Cisco routers automatically set the weight of any locally advertised prefix to 32768.  When PE2 learns about CE3's routes from PE3, it doesn't install the routes.  Instead, from then on in, it prefers to go the "long way around" via CE2 and CE1.

2) Loops.  Because of  the scenario described above, it's actually possible that PE2's route will cycle around to PE1 and be installed there.  Dependent on the BGP configuration.

BGP Cost Community, which is on by default when EIGRP is used as the CE -> PE protocol, can solve problem #1 and #2.  SoO, which is not in use by default, can solve problem #2, but not necessarily #1. 

Let's start with Cost Community.  There's probably a bunch of stuff I could refer to in the RFC that would make my explanation far more detailed, but the bottom line is this: If this feature is enabled, BGP takes the metric from EIGRP and pulls it into the cost community.  If then compares the cost community first before all other BGP path selection criteria.  This means that weight, AS-PATH, local-pref, etc, mean almost nothing. 

There's really not much to say or configure here.  This fixes problem #1 and #2 right out of the box.  Let's break down the steps of how.

1) CE3 advertises its routes to PE3. 
2) PE3 advertises its routes to the MPLS cloud
3) Let's assume worst case scenario and that PE1 gets CE3's routes from PE3 significantly (several seconds) sooner than PE2. 
4) PE1 installs these routes into EIGRP and passes them on to CE1
5) CE1 passes the routes to CE2
6) CE2 passes the routes to PE2
7) PE2, again assuming worst case outcome in a race condition, installs the routes as well,  redistributes them into BGP, and weights them at 32768.
8) PE1 hears the new advertisement from PE2 of CE3's routes.  It does not install them.  PE1 already knows the cost-community of the routes advertised by PE3. This value, based on the EIGRP metric between CE3 and PE3, no matter how large, is guaranteed to be the lowest.  That's because EIGRP's metric is carried in BGP communities across the MPLS cloud, in addition to the cost community, which is carried seperately.  When PE1 placed CE3's routes back into EIGRP in step 4, it copied the metric from CE3->PE3 back into the EIGRP route.  Now that original metric, plus the metric from PE1->CE1->CE2->PE2, must always be larger than just the original metric.  Given that a cost-community enabled router will always prefer lowest cost-community over all other BGP attributes, the original, and correct, route sticks.
9) Eventually, PE2 hears PE3's advertisement of CE3's routes.  Even though it has that nasty 32768 weighted route from "the long way around" via PE1, it immediately installs the route with the lower cost community into the BGP table.  It then redistributes it into EIGRP.  Theoreteically (metric dependent), CE2 would then route via PE2 to reach CE3.  If it didn't, the metrics could be tweaked at that point.

Even though this all happens automatically, there are a few commands of note regarding this feature.

How do I see it in action?

PE1#show bgp vpnv4 unicast all 3.3.3.3/32 | b Extended Community
      Extended Community: RT:1:1 Cost:pre-bestpath:128:156160 0x8800:32768:0
        0x8801:1:130560 0x8802:65281:25600 0x8803:65281:1500
      mpls labels in/out nolabel/18

(3.3.3.3/32 is CE3's Lo0 address)

156160 is the EIGRP metric from CE3 -> PE3.  Check it out:

PE3#sh ip eigrp vrf VPN_A topology 6.6.6.6/32 | i Composite metric
      Composite metric is (156160/128256), Route is Internal

About the only other command you need to know is how to turn this behavior off if for some reason you don't want it.

PE1:
router bgp 1
 bgp bestpath cost-community ignore

Let's say you have a lab problem that prevents you from using the cost-community, but requires you to prevent loops.  You could certainly make some elaborate match-and-tag schema, which I've personally done in a scenario similar to this one in the past.  However, that does take a lot of thinking and a lot of typing.  SoO does almost the entire process for you.

I am going to disable BGP cost community for this part of the exercise.  

Let's look at PE1 and CE1, and PE2 and CE2, as diverse sites.  Let's say that backdoor link is a pay-per-MB point-to-point MOE link:


Getting your head around that so far is pretty easy.  I wanted to break the thought process up into chunks.

The scenario we need to prevent in order to solve the looping problem, assuming the same race condition described above, is PE1 receiving CE3's routes first, propagating them via EIGRP across the MOE backdoor link, PE2 receiving them, reintroducing them to the MPLS cloud, and PE1 re-accepting PE2's advertisement, causing a loop or counting-to-infinity problem.

For clarity I will diagram this:

Here, we see half of what I just described:  For whatever reason, PE1 gets the routes a moment or two sooner than PE2.  It sends the routes across the MOE to PE2, which, not having a BGP route yet, accepts this EIGRP route.

You have to use your imagination on this one.  Let's pretend we have a more complex MP-BGP cloud where PE2' re-origination of CE3's routes, now lacking cost community, could be seen as preferable to PE1.  PE3 would ignore this update, but PE1 might take it.  Let's say it does, for this demonstration.




You'd eventually end up with the loop shown above.  This would eventually resolve itself via counting-to-infinity, but timing dependent, it has the potential to start all over again.

We need to do some creative blocking here in order to fix this problem.  As I'd already mentioned, you could absolutely fix this with a mix of BGP communities and EIGRP route tags.  However, this is cumbersome.  SoO does precisely this for us, but without 90% of the configuration.

In a nutshell, SoO is all of the following:
1) A BGP community
2) An EIGRP "route tag" of sorts
3) A filtering mechansim

Once SoO is set, it is automatically converted from BGP community to EIGRP TLV (think of it as an EIGRP route tag, it's easier that way) and vice-versa at the appropriate borders.  The same commands that set the "tags" also check for their tag coming back inbound, and don't allow routes that have them.

Let's update the diagram with these concepts.






Routes leaves site A will be tagged with 1:1.  Routes leaving site B will be tagged with 2:2.  Site A will accept Site B's routes, and Site B will accept Site A's routes, but Site A won't accept it's own routes back in on the "border" routers.  One of the confusing items here for me was understanding that the CE is in fact a border router. That's why I draw that circle around Site A.  This is all one big EIGRP process between CE1 and CE2, but you want to filter on it because of the problem we've been describing.  Now let's look at what happens during the loop scenario we've been describing.

 
CE3/PE3 send out the routes, PE1 gets them first.  PE1 at this point sets SoO 1:1 on the EIGRP route as it passes it to CE1. CE1 sends it to CE2.  CE2 accepts this route because SoO 1:1 is different than SoO 2:2.  A very important item here is that CE2 does not re-write the tag from 1:1 to 2:2. Tags are only written if they're missing, they are never re-written.






CE2 sends the un-modified route to PE2.  PE2 again, same as CE2 did, does not re-write the tag (now a community) when it redistributes it into BGP.  1:1 is maintained.  PE3 ignores this route as it has a better path via EIGRP.  PE1 receives the route, and...
DENIED!  PE1 refuses to put the route back into the VRF, because of the SoO community being set to 1:1, and it's own community also being set to 1:1.

Let's look at a similar, yet subtly different, scenario that shows why the CE routers need to participate in SoO.

Let's say the race condition was slightly different.  Perhaps CE2 communicates its EIGRP route to CE1 before PE2.  PE1 learns it from CE1, and redistributes it back to PE2:
 

SoO, when applied to the CEs, fixes this problem, too. 

CE2 (mark with SoO 2:2), allowed in to CE1 because it doesn't match SoO 1:1, sent to PE1, PE1 redistributes to BGP (SoO tag is carried along as a community), PE2 receives it and denies putting it back on the VRF based on 2:2 not being allowed back in.  If the CEs were not involved in the SoO process, this could have theoretically happened.

Let's look at the configuration.  It's a little unusual, but not so bad once you know the commands.

PE1:
route-map soowhat permit 10
 set extcommunity soo 1:1

interface Fa0/1
  ip vrf sitemap soowhat

CE1:
route-map soowhat permit 10
 set extcommunity soo 1:1

interface Fa0/0
   ip vrf sitemap soowhat

PE2 and CE2 are the flipside, just change the soo to 2:2.

So I found this config kind of weird.  Not only is the concept of a vrf sitemap foreign (doesn't appear it has much use aside from this feature), why the heck would you apply what appears to be an extended BGP community to an EIGRP-only (no BGP) router.  Also, a "VRF" command on a CE router that isn't running VRFs?  But what the heck, it works, weird or not.

So how do I see this thing in action? 

In EIGRP -

CE1#show ip eigrp top 2.2.2.2/32
IP-EIGRP (AS 1): Topology entry for 2.2.2.2/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 158720
  Routing Descriptor Blocks:
  172.16.13.1 (FastEthernet0/0), from 172.16.13.1, Send flag is 0x0
      Composite metric is (158720/156160), Route is Internal
      Vector metric:
        Minimum bandwidth is 100000 Kbit
        Total delay is 5200 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
      Extended Community: SoO:2:2

In BGP -

PE1#show bgp vpnv4 unicast all 2.2.2.2/32
BGP routing table entry for 1:1:2.2.2.2/32, version 51
Paths: (1 available, best #1, table VPN_A)
Flag: 0x820
  Not advertised to any peer
  Local
    7.7.7.7 (metric 2) from 7.7.7.7 (7.7.7.7)
      Origin incomplete, metric 156160, localpref 100, valid, internal, best
      Extended Community: SoO:2:2 RT:1:1 Cost:pre-bestpath:128:156160
        0x8800:32768:0 0x8801:1:130560 0x8802:65281:25600 0x8803:65281:1500
      mpls labels in/out nolabel/21

You may have noticed I made a point of showing the backdoor link being more costly than the MPLS.  This was my cheesy attempt to kill two birds with one stone, and basically shove another topic into this same post.  Now let's take the scenario that says we want to use the costly MOE for backup only, and use the cheap MPLS as our primary link. 

CE1 & CE2:
Interface FastEthernet0/1
  delay 9999999

This threw me off the first time I saw it.  Wouldn't the EIGRP routes coming from BGP be external, and we'd always be stuck using the backdoor?  NOPE!

Let's look at the metric before the above delay was added.

CE1#sh ip eigrp top 2.2.2.2/32  ! 2.2.2.2 is CE2's Loopback0
IP-EIGRP (AS 1): Topology entry for 2.2.2.2/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 156160
  Routing Descriptor Blocks:
  10.9.0.2 (FastEthernet0/1), from 10.9.0.2, Send flag is 0x0
      Composite metric is (156160/128256), Route is Internal
      Vector metric:
        Minimum bandwidth is 100000 Kbit
        Total delay is 5100 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 1
  172.16.13.1 (FastEthernet0/0), from 172.16.13.1, Send flag is 0x0
      Composite metric is (158720/156160), Route is Internal
      Vector metric:
        Minimum bandwidth is 100000 Kbit
        Total delay is 5200 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2

We've got two EIGRP routes here.  The first one is from across the MOE backdoor.  Route is Internal.  OK, we expected that. The second one is from across the MPLS, redistributed from BGP -> EIGRP.  Route is Internal.  What the heck?

First, a quick recap of EIGRP CE->PE routing.  If your autonomous systems are the same in the different PEs...

router eigrp 1
 !
 address-family ipv4 vrf VPN_A
  autonomous-system 1
 exit-address-family

...between two MPLS-separated EIGRP sites, then the EIGRP routes are re-created as internal routes.  That's right, BGP -> EIGRP redistribution does not create an external route, as it would normally, unless the ASes dont match. 

This is key to understanding the idea that it's even possible to prefer the MPLS routes over the "native" EIGRP routes.  We're not battling the higher AD for external routes, because they all appear internal anyway (provided the same ASes were used).

So, yes, it's just that simple.  Just change the delay on the CEs.

CE1 & CE2:
Interface FastEthernet0/1
  delay 9999999

CE1#sh ip eigrp top 2.2.2.2/32
IP-EIGRP (AS 1): Topology entry for 2.2.2.2/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 158720
  Routing Descriptor Blocks:
  172.16.13.1 (FastEthernet0/0), from 172.16.13.1, Send flag is 0x0
      Composite metric is (158720/156160), Route is Internal
      Vector metric:
        Minimum bandwidth is 100000 Kbit
        Total delay is 5200 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
  10.9.0.2 (FastEthernet0/1), from 10.9.0.2, Send flag is 0x0
      Composite metric is (2560153344/128256), Route is Internal
      Vector metric:
        Minimum bandwidth is 100000 Kbit
        Total delay is 100004990 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 1

Now the route through the MPLS looks far better than the huge-metric route through the MOE.

Testing...

CE1#trace 2.2.2.2
Type escape sequence to abort.
Tracing the route to 2.2.2.2
  1 172.16.13.1 40 msec 24 msec 20 msec
  2 172.16.24.2 [MPLS: Label 19 Exp 0] 40 msec 56 msec 36 msec
  3 172.16.24.4 60 msec *  72 msec

Yep, it goes through the MPLS.

Cheers,

Jeff Kronlage

16 comments:

  1. Thank you Jeff for the clarification. Nicely explained mate.

    ReplyDelete
  2. Amazing the explanation.. Especialy the draw you did.

    Thanks

    ReplyDelete
  3. Ok I have been fighting with my brain all day just because this concept! and just because I didnt read your post on the morning!!!! fuuuu!!.

    Thank you very mucho!

    ReplyDelete
  4. Thanks for the post, have been hitting my head for understand for last two days

    ReplyDelete
  5. what a wonderful explanation !! Thanks !!

    ReplyDelete
  6. Hello,

    Nice write up! What versuion of code were you using? I find that the SoO does not get set when it goes "outbound"...instead, only when received "inbound". When you say "outbound, I set 1:1...inbound I take everything except 1:1". This sounds like it would make sense, but when I have tested it out on several code versions I dont get your results. For example, if PE1 has a route via BGP to CE3, and it redistributes it into EIGRP so that CE1 can receive it, I would think that going to CE1 and checking the topology table should show the prifix with the SoO...but it does not! I find that only when the prefix is RECEIVED via an interface with the sitemap configured, and then it gets redistributed into BGP, I can look at the BGP VPNv4 table and see the SoO.

    ReplyDelete
    Replies
    1. Pablo, for all I know, it may be handled differently in different versions. I used 3725 v12.4(15)T14 on GNS3. I'd be curious to see if you're able to swap to this version and still have problems. I recall labbing this, it was very tricky because there was hardly any documentation available that explained *how* to deploy it, so I fooled with this for some time -- pretty confident my results were correct, at least for my IOS version. Good luck!

      Delete
  7. Great info Jeff - this has definitely helped to make things far clearer for me (lab date shortly due).
    One thing I would note is that I was not able to see the SoO tag in the output on the CEs, although I could on the PEs.
    For CEs I'm testing with 3750s using 12.2(25)SEE
    For PEs I'm testing with 1841s using 15.1(3)T4

    thanks, Will.

    ReplyDelete
  8. Thank you for the wonderful explanation . I read ine pdf and cisco documentations too, but nothing was as perfect as this !!!

    ReplyDelete
  9. Wow, this was mind blowing, thanks a lot mate! Studying for CCIE SP, and couldn´t find this one anywhere...

    ReplyDelete
  10. Thanks for this simple yet to the point explanation. It is really informative.

    ReplyDelete
  11. excellent write up need to lab this up

    ReplyDelete
  12. Thanks mate for such detailed explanation ! You made this confusing topic easy.

    ReplyDelete