Tuesday, April 15, 2014

A Thorough Approach for Debugging MPLS L3 VPNs

I recently realized I needed a more organized approach to debugging MPLS L3 VPNs for the troubleshooting section. Referencing a lot of the practice labs I've taken, I'm going to give a run-down of what I think are the fastest way to track down any problem.

First let's run down my list, then we'll pick it apart with an example below.

I'm going to assume the run-of-the-mill validation of "Host A needs to be able to ping host B".
Since we're talking high-level in the first segment, and with MPLS VPNs we're always talking about a sender and a receiver, I am going to refer to the sender that's unable to reach the receiver as the originating router, and the side that cannot be reached the terminating router for referencing direction.  

Before you start debugging...

1) Validate the problem: ping <problem IP>
2) Find out if the problem is unidirectional.  Run "debug ip icmp" on both the source and the destination.  Ping both ways.  If you're taking an INE lab, be sure logging is on too: logging con 7 and logging on
3) From the originating router, run "sh ip route <problem IP>" and "sh ip cef <problem IP>".  Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range.  That makes it not an MPLS problem, and is out of scope for this post.

Once you clear the starting checks, you want to validate whether or not you have the route in your routing table.

1) Are you importing the route in to your VRF?  Make sure the other side's exported route-target is being imported on the originating router's VRF.
2) Is the terminating router or terminating router's PE advertising the route?
3) Are route reflectors involved?  If you're relying on one route reflector to relay a route through another route reflector, you need to ensure the cluster-IDs are different.

These following items are dependent on using OSPF as your PE->CE routing protocol:
4) If you're using OSPF as PE->CE, check for sham links.  It's easy to break these and hard to look for them.  Do a "sh run | s sham" on the PEs and see if any exist.  If they do, run "show ip ospf sham-links"
5) If you're using OSPF as PE->CE, and the CE is also part of the VRF (the VRF itself exists on the CE), enable capability vrf-lite on the OSPF process on the CE.

If you don't have an internal route and you need one to beat another AD, then additionally check out:
6a) If OSPF is PE->CE, make sure domain-id is set the same on all of them, or you'll end up with external routes across the MPLS cloud.
6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.

If you checked into all of that, you should have an appropriate route by now.  What happens when you've got the route in your routing table, pointing the right direction, but the traffic just doesn't arrive on the far side?  Now we start debugging MPLS itself.

1) sh run | s mpls on every PE and P device. Look for LDP filtering.  There are more elegant ways to find this, but this is the fastest.
2) From the PE on the originating side, run a "sh ip cef <VRF NAME> <problem IP>".  Is the correct PE listed as next-hop in the "via" field? If it's not, go investigate the PE that is originating the route, there may be more than one path (and one may not lead anywhere!)
3) If it's the correct PE from step 2, do show mpls forwarding-table <PE LDP ID>. Unless your PEs are L2 adjacent, you must have tag listed for the PE, or "Pop Tag".  If you don't, walk your adjacent routers to be sure mpls ip is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.
4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.
5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32.  If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match.  Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.
6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking?  On the PEs, sh run | i cost-community. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find bgp bestpath cost-community ignore in the config.

Now let's walk through the scenarios that these verifications above can save you from.

1) Validate the problem: ping <problem IP>

This should be obvious, but I actually proctor a private TS test, and I'm amazed the number of people that don't check what I put in front of them. In rare circumstances, sometimes the solution can be derived just from verifying the issue.  And in a TS lab, you need to be sure you didn't somehow fix the problem at some other point.

2) Find out if the problem is unidirectional.  Run "debug ip icmp" on both the source and the destination.  Ping both ways.  If you're taking an INE lab, be sure logging is on too: logging con 7 and logging on.

This is very important - so you "can't ping" the destination.  Do you know if your echo request isn't making it from origination to destination, or that the echo reply isn't making it from destination to origination? Don't waste time debugging the wrong flow. Quite regularly only one direction is failing.

3) From the originating router, run "sh ip route <problem IP>" and "sh ip cef <problem IP>". Sometimes some other route in the table is defeating the MPLS route on AD or, worse, more specific IP range.  That makes it not an MPLS problem, and is out of scope for this post.

This is easy to overlook.  You may have the route in both BGP, and the MPLS labels can be in good shape, but you're only getting a /24 across the MPLS VPN, and you're getting a bogus /32 route for the destination that leads nowhere, injected by your IGP from a router behind you. Your packet is going the wrong direction.

1) Are you importing the route in to your VRF?  Make sure the other side's exported route-target is being imported on the originating router's VRF.

Originating router:
ip vrf VPN
 rd 1:1
 route-target export 1:1
 route-target import 3:3
 route-target import 7:7

Terminating router:
ip vrf VPN
 rd 3:3
 route-target export 2:2
 route-target import 1:1
 route-target import 7:7

This config above for "Originating router" is missing route-target import 2:2.  The route target is a community carried with MP-BGP, if you don't import it into your VRF, you won't see the route.  The RD is basically irrelevant - as long as they're unique on each PE, they don't matter for the import process. 

2) Is the terminating router or terminating router's PE advertising the route?

This one sure got me once.  I'm looking and looking for an MP-BGP problem, and it turns out that the CE just didn't advertise the route to the PE.  Simple BGP error.

3) Are route reflectors involved?  If you're relying on one route reflector to relay a route through another route reflector, you need to ensure the cluster-IDs are different.

If you have two route reflectors in your MP-BGP topology, unless the PEs in question both peer to the same route reflector, you need to ensure that the route reflectors have different cluster IDs.   In other words, if your MP-BGP topology looks like this:

RR1 <-- PE1 --> RR2 <-- PE2

This will work fine, even if the cluster IDs are the same, because RR2 will reflect the routes from PE1 to PE2 and vice-versa. However, if you have:

PE1 --> RR1 <--> RR2 <-- PE2

Then you'll need separate cluster IDs, or RR1 will not reflect PE1's routes to RR2, and vice-versa.

4) If you're using OSPF as PE->CE, check for sham links. It's easy to break these and hard to look for them.  Do a "sh run | s sham" on the PEs and see if any exist.  If they do, run "show ip ospf sham-links"

Sham links allow you to extend an OSPF area across the "Super Area 0" backbone area. These are most commonly used to pref an MPLS path instead of a back-door link.  Topology aside, I've been bitten on broken sham links before, so look out for these.  If you want to know more about them:
http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html

5) If you're using OSPF as PE->CE, and the CE is also part of the VRF (the VRF itself exists on the CE), enable capability vrf-lite on the OSPF process on the CE.

The first time I ran into this I spent 5 hours debugging it. Some may say a waste of time, but I'll never forget it. In short: OSPF checks for the downward bit on routes exported from MP-BGP directly into the OSPF process. You'll watch the routes arrive on the PE and get put in the OSPF process no problem, and then when they hit the CE device(s), if the CEs are in the VRF as well, they'll be in the OSPF database but not get put into the RIB/FIB.  This is a loop prevention mechanism. To disable it, use "capability vrf-lite" inside the OSPF process.
Also reference: http://brbccie.blogspot.com/2012/12/ospf-pe-downward-bit-super-area-0.html

6a) If OSPF is PE->CE, make sure domain-id is set the same on all of them, or you'll end up with external routes across the MPLS cloud.

This only matters if you're shooting for an internal route for some reason, and is more of a reminder than a big deal.

6b) If EIGRP is PE->CE, make sure your EIGRP AS number (process number) matches on the PE routers.

This can make a slightly bigger difference, in that EIGRP naturally deprefs (via higher AD) external routes.  You may need an internal route in order to make the traffic cross the MPLS cloud. If the AS number doesn't match, you'll end up with external routes.

1) sh run | s mpls on every PE and P device. Look for LDP filtering. There are more elegant ways to find this, but this is the fastest.

This is a bit of a hack, but it catches about 90% of LDP problems in < 60 seconds.  You can't beat it for speed. I'll show more about this below.

2) From the PE on the originating side, run a "sh ip cef <VRF NAME> <problem IP>".  Is the correct PE listed as next-hop in the "via" field? If it's not, go investigate the PE that is originating the route, there may be more than one path (and one may not lead anywhere!)

PE1#sh ip cef vrf VPN 192.168.1.7
192.168.1.0/24, version 8, epoch 0, cached adjacency 10.0.23.3
0 packets, 0 bytes
  tag information set
    local tag: VPN-route-head
    fast tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}
  via 5.5.5.5, 0 dependencies, recursive
    next hop 10.0.23.3, FastEthernet0/1 via 5.5.5.5/32
    valid cached adjacency
    tag rewrite with Fa0/1, 10.0.23.3, tags imposed: {17 23}

The via field above shows the PE you're heading towards.  Is it the correct PE?  This threw me off something awful once. The prefix in question was endlessly looping off a 3rd PE, and was being re-advertised on the 3rd PE.  That PE was being preffed.  Boom, an hour gone debugging - if only I'd paid more attention to the output of "sh ip cef vrf VPN"!

Assuming it is the right PE listed above, you walk the MPLS labels from there:

PE1#sh mpls forwarding-table 5.5.5.5
Local  Outgoing    Prefix            Bytes tag  Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
17     17          5.5.5.5/32        0          Fa0/1      10.0.23.3

Next hop is 10.0.23.3, via Fa0/1; that's P1:

P1#show mpls forwarding-table 5.5.5.5
Local  Outgoing    Prefix            Bytes tag  Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
17     Untagged    5.5.5.5/32        13766      Fa0/1      10.0.34.4

There's the evil Untagged! Let's go see what's up on P2.

P2#sh run | s mpls
no mpls ldp advertise-labels
 mpls label protocol ldp
 mpls ip
 mpls label protocol ldp
 mpls ip

Note, we should have caught this in MPLS debugging step 1, but just in case you didn't...!
There's about 3 scenarios you want to look out for related to label advertisement:

no mpls ldp advertise-labels will make no labels be advertised at all.
That command can be used in combination with mpls ldp advertise-labels for <standard ACL>. The standard ACL can be (rather obviously) rigged to prevent the labels you need advertised from being advertised.
The final command is mpls label range <min> <max>.  If you don't allow enough labels the ones you need can end up not getting assigned one at all.

I've fixed the mpls ldp advertise-labels command above, and now we see the appropriate output on P1:

P1#show mpls forwarding-table 5.5.5.5
Local  Outgoing    Prefix            Bytes tag  Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
17     17          5.5.5.5/32        0          Fa0/1      10.0.34.4

And on P2:

P2#show mpls forwarding-table 5.5.5.5
Local  Outgoing    Prefix            Bytes tag  Outgoing   Next Hop
tag    tag or VC   or Tunnel Id      switched   interface
17     Pop tag     5.5.5.5/32        508        Fa0/1      10.0.45.5

We see "Pop tag".  Pop tag is OK, it's just part of the Penultimate Hop Pop process.

3) If it's the correct PE from step 2, do show mpls forwarding-table <PE LDP ID>Unless your PEs are L2 adjacent, you must have tag listed for the PE, or "Pop Tag".  If you don't, walk your adjacent routers to be sure mpls ip is enabled on every interface or OSPF MPLS auto-config is enabled. Make sure CEF is turned on on all P and PE devices - MPLS doesn't work without CEF. If necessary, re-check step 1, make sure nothing is filtering tags. If still no problem is found, do a "show mpls ldp neighbor | i Peer" and make sure you have the correct count of neighbors.

I've seen some nasty, nasty things done with VACLs on the layer 2 switches between routers on practice labs.  It's not much of a stretch to think they'd block LDP.  The config would look perfect and your adjacency simply wouldn't come up.  Count how many adjacencies you're expecting from the diagram, and make sure you get a good head count:

P1#show mpls ldp neigh | i Peer
    Peer LDP Ident: 7.7.7.7:0; Local LDP Ident 10.0.37.3:0
    Peer LDP Ident: 2.2.2.2:0; Local LDP Ident 10.0.37.3:0
    Peer LDP Ident: 192.168.49.4:0; Local LDP Ident 10.0.37.3:0

If you're missing one, investigate the adjacency.

And a shout out to my friend Keith Chayer, who reminded me to check for CEF being enabled as well. It is of note that you'll be missing labels if CEF is disabled on the MPLS transit path - at least LDP is smart enough to tell it's neighbors "I'm broken - don't use me".

4) Note the next-hop associated with the tag you identified in step 3. Open a command prompt on the next-hop and repeat step 3. Continue until you reach a "pop tag" for the terminating PE.

I covered this above.

5) Check for Router-ID failures. LDP design can be picky that the mask in the routing table and the mask in the label match. This is most commonly an issue when OSPF is used as the MPLS IGP; if your router ID is based off a loopback that is other than a /32.  If this is the case, either change your loopback address to a /32 (if permitted), or change your ospf network type to point-to-point so that the label mask and the OSPF mask match.  Also, this can sometimes be an issue with summarized routes in other protocols (such as EIGRP), so be on the lookout there.

This is reasonably self-explanatory.  The route prefix length and the LDP prefix length need to match. OSPF is the common culprit.  
Reference: http://brbccie.blogspot.com/2013/11/mini-why-does-ldp-require-32-loopback.html

6) As a final check, be sure to see if cost-community was disabled on the PE routers. It's possible to perform traffic engineering against the prefixes if it's been disabled, and then who knows what path your traffic might be taking?  On the PEs, sh run | i cost-community. Cost community is on by default. and you want it left on. This command should show nothing if it is enabled, if it's disabled you will find bgp bestpath cost-community ignore in the config.

I got this on a mock lab once, as well.  If the PEs are disabling cost community, you need to ask yourself why: is this a mandatory traffic engineering, or are they just trying to steer routes in the wrong direction?

Reference: http://brbccie.blogspot.com/2012/12/bgp-cost-community-eigrp-soo-and.html

/* Addition 11/27/14 - I apologize for not inserting this more thoroughly in the blog, but time doesn't permit right now - be sure to look for import or export maps on the VRF. It's possible to define a route-map that filters prefixes inbound or outbound of the VRF.  The syntax is not particuarly complex:

ip prefix-list IMPORT_PL seq 5 deny 0.0.0.0/0 le 32
route-map SNAFU permit 10
 match ip address prefix-list  IMPORT_PL

vrf definition VRFTEST
 rd 1:1
 route-target export 1:1
 route-target import 1:1
 !
 address-family ipv4
  import ipv4 unicast map IMPORT-FILTER

*/

Cheers,

Jeff Kronlage

Saturday, April 12, 2014

MPLS EXP-based QoS and QoS Groups

This topic is a bit of a stretch for the R&S lab, really being more oriented towards Service Provider, but I wanted to talk about it anyway.

So what does your MPLS carrier do with those QoS settings you pass them?
It's unlikely they're queuing at congestion spots in their network based on the DSCP values you set.

You've probably heard about the EXP bits in the MPLS tag.  These are used "for QoS".  But no one really seems to know how.  And there's only 3 bits, but we use 6 bits for DSCP, so what's the story?

Here's our topology:



We'll be setting DSCP values on H1 and manipulating them, or their MPLS equivalents, on the way to H2.

Without any special config, let's see how this works right out of the box. Of important note, I have null-routed H1's IP address on H2. This makes it easier to read the output from "debug mpls packet", because we're only seeing a one-way flow instead of a two-way flow.

H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 2
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]: 184
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

(Remember, we were not expecting responses)
So we sent this as EF traffic (TOS 184, above).  Any hypothesis on what's seen in transit?

P2#debug mpls packet
MPLS packet debugging is on
P2#
*Mar  1 09:20:04.473: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  1 09:20:04.473: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
P2#
*Mar  1 09:20:06.437: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  1 09:20:06.437: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

CoS=5, meaning the EXP bits are set to 5.  The default behavior on a PE is to map the IPP (IP Precedence) on to the EXP bits. These line up nicely both being three bits.  Reference the ToS value above - 184.  That's a full 8 bit QoS value, in binary it's 10111000.  Chop off the last two unused digits for the 6-bit DSCP value of 101110, and you have 46 (which I suspect you recognize as "EF"), knock off everything except the first three bits - 101 - for the IPP, and you have?  Five.  Hence, EXP becomes 5 as well. This default feature is known as "ToS Reflection".

We'll look at how this value can be used to our advantage later.

What comes in on H2?

For those following my blog for a while, you may know about 14 months ago I wrote a giant ACL that matches every possible QoS value.  I still have it on file, and I'll be using it here to see what values come in on H2.

H2#sh ip access-list | i match
    460 permit ip any any dscp ef (2 matches)
    480 permit ip any any dscp cs6 (1 match)

Ok great! We've got two EF matches, and a ... Class Selector 6?
The EF matches are the two pings arriving.  I found this odd right off the bat, I would've expected that if IOS takes the IPP bits and maps them to EXP, that it would then take EXP and match them to IPP on the way out the other PE when the final label is popped. However, it doesn't work that way - instead, it just uses the DSCP that was already in the packet - which, of course, never changed.  An MPLS label was put on top of it, but the underlying packet was left intact.  

The class selector 6 packet is a BGP keepalive.  We'll be seeing more of them throughout the post.

It turns out there are terms for the different types of MPLS QoS behavior.  What we observed above would be either "Pipe Mode" or "Short Pipe Mode". Both of these behaviors include using the original ToS bits instead of replacing them based on the EXP bits.  The difference between Pipe Mode and Short Pipe Mode is that Pipe Mode egress queues based on the EXP bits, and Short Pipe Mode egress queues at the PE on the original ToS (DSCP) bits.  This post assumes the audience understands how to write a hierarchical QoS policy, so I'm not going to elaborate or examine the differences between them any further. Any additional mention of "Pipe Mode" assumes either of the above behaviors.  The third option is "Uniform Mode", which is the process of replacing the IP Packet's ToS bits (IPP/DSCP) with something derived from the EXP bits.

We just saw Pipe Mode in action above, let's look at how to implement Uniform Mode.

First we need to take a quick look at QoS groups.

There's a particular challenge with ingress and egress marking on a PE. On ingress, you can't set an IPP or DSCP value because the MPLS header is still on the frame.  On the egress interface, you can't match on the EXP bits to set IPP or DSCP bits, because the MPLS label is already popped.  So how do you match on an EXP value and set a DSCP value?  Enter QoS groups.

PE2:

class-map match-all EXP5
 match mpls experimental topmost 5

policy-map uniform-ingress
 class EXP5
  set qos-group 5
 class class-default
  set qos-group 0

interface fa0/0 ! MPLS side
 service-policy input uniform-ingress

This config will match a decimal value of five on the topmost MPLS label - which, in our case, on the PE, is the only MPLS label thanks to Penultimate Hop Pop.  We'll assign a local value of "5" (although this could be any number 1-99) if the EXP bit is 5.  Anything else will get reset to 0.

class-map match-all GROUP5
 match qos-group 5

policy-map uniform-egress
 class GROUP5
  set ip dscp af41  
 class class-default
  set ip dscp default

interface fa0/1 ! IP/VRF side
  service-policy output uniform-egress

On egress, we'll match on that 5, and set af41.  Why af41?  Because I wanted to show the policy was doing something.

We'll ping from H1 to H2 again.  I'm omitting any non-essential bits from the extended ping for brevity.

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

Again, expected failure, this is a deliberate one-way flow.

H2#sh ip access-list | i match
    340 permit ip any any dscp af41 (2 matches)
    640 permit ip any any precedence routine (4 matches)

We see our two af41 hits, and 4 routine.  The routine are because the IPP 6 packets are being remarked to zero because it doesn't match anything else in the policy. 

Now obviously this is a pretty useless policy, but it was more about showing how the function works.
Here's an adaptation for a more scalable Uniform Mode solution:

policy-map uniform-ingress
 class class-default
  set qos-group mpls experimental topmost

interface fa0/0
 service-policy input uniform-ingress

policy-map uniform-egress
 class class-default
  set precedence qos-group

interface Fa0/1
  service-policy output uniform-egress

Let's see what the outcome is.

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#clear ip access-list count

H2#sh ip access-list | i match
    460 permit ip any any dscp ef (2 matches)

I was rather surprised the first time I saw this output.  We're setting a precedence value but getting back a DSCP value. I expected to see a precedence/class-selector value. The original bits were 101110 (DSCP 46, or EF), and I expected to replace them with 101000, which would be class selector 5.  Things brings up an important difference in IOS's handling of class-selector vs precedence, I'd always treated them the same, but it turns out IOS is more literal - Precedence sets only the precedence bits.  So we re-wrote the first three bits with 101, which ... were already set to 101.  So we ended up with 101110 (DSCP 46/EF) again.

We could do something like this:

policy-map uniform-egress
 class class-default
  set dscp qos-group 

But then we'd get literal DSCP values: if the QoS Group is 5, it would set DSCP 5.  Not DSCP CS5 (101000), but actual binary 5 - (000101).  To accomplish EF -> EXP 5 -> CS5, we'd have to use either a lengthy QoS-Group -> DSCP class-map/policy-map setup, or we could use a table map!

table-map TABMAP
 map from 1 to 8     ! Group 1 to DSCP CS1
 map from 2 to 16   ! Group 2 to DSCP CS2
 map from 3 to 24   ! ...
 map from 4 to 32
 map from 5 to 40
 map from 6 to 48
 map from 7 to 56   ! Group 7 to DSCP CS7

policy-map uniform-egress
 class class-default
  set dscp qos-group table TABMAP

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#sh ip access-list | i match
    400 permit ip any any dscp cs5 (2 matches)
    480 permit ip any any dscp cs6 (18 matches)

I think the table map use is pretty obvious - take a qos group and match it to some other integer, which has some meaning when applied to a DSCP or IPP field.  Now we have the CS5 output we were looking for.

Now clearly, MPLS/EXP QoS needs to be able to be modified on more than just the egress PE.  Let's take a look at the other spots we can match and adapt behavior to it.

So far we've been doing matches on the "topmost" label, so what other options have we got?  Keeping this oriented towards the R&S CCIE, I'm not going to look at anything other than a 2-tag (VRF + MPLS PE) system. When traffic is received in from the host towards the PE, the PE is going to impose a label for the VRF. It will then add the MPLS transit label on top of that, for reaching the other PE. So to reiterate, we go from zero labels to two labels on the PE.

We can set both those labels, and it's really not hard, but you have to pay attention to what label is being manipulated on which interface. IOS is picky about the order of operations in this case.

For ingress on a PE, we can only set imposition. We clearly can't set "topmost" because there are no labels on the packet yet:

PE1:
policy-map impose1
 class class-default
  set mpls experimental imposition 4

int fa0/0
 service-policy input impose1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

H2#sh ip access-list | i match
    320 permit ip any any dscp cs4 (1 match)
    480 permit ip any any dscp cs6 (2 matches)

And what if we set EF manually on H1?

H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)

H2#sh ip access-list | i match
    320 permit ip any any dscp cs4 (3 matches)
    480 permit ip any any dscp cs6 (2 matches)

Still CS4, because we're remarking the EXP bits on the inner label on PE1 to 4, that's carried down to PE2, and then the qos-group-based policy remarks the DSCP to CS4.

What about the outer label?

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

We'd need to look at the results on P2, because PE2 never gets the outer label - the PHP process removes it before forwarding the frame.

P2#
*Mar  2 01:44:59.409: MPLS: Fa0/0: recvd: CoS=4, TTL=253, Label(s)=16/19
*Mar  2 01:44:59.409: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

So P2 receives the outer label as 4, and the inner label as 4.  We see 4 coming in on Fa0/0 on label 16, and going out on label 19 on Fa0/1, showing both the PHP process and the fact that both EXP values are the same.  That's because the default behavior of a PE is to copy the inner label's EXP bits to the outer label.  But what if we wanted to set the outer label to something different?

There's two places we could do that: egress on the PE, or ingress on the P routers.

Let's try the PE first.

PE1:
policy-map topmost1
 class class-default
  set mpls experimental topmost 2

interface FastEthernet0/1
 service-policy output topmost1

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 01:53:51.609: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar  2 01:53:51.609: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

Now we see EXP 2 on the topmost and EXP 4 on the inner.

It's of some interest that if we wanted the final PE (PE2) to see that value of 2, we'd want to disable PHP.  PHP is disabled from the PE, not the router upstream from it.  This is done by the PE advertising an explicit blank label for the prefixes terminating on it:

PE2:

mpls ldp explicit-null

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 01:57:56.889: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar  2 01:57:56.889: MPLS: Fa0/1: xmit: CoS=2, TTL=252, Label(s)=0/19

H2#sh ip access-list | i match
    160 permit ip any any dscp cs2 (1 match)

We see that P2 forwarded both labels, one of which was the explicit null/0 label (reference 0/19).  The PE has to pop both labels before forwarding.  Consequently, we also see that the PE now marked CS2 based on the EXP2 in the topmost label.

Now let's see about manipulating the topmost label on a P device.
For clarity's sake on P2, I am disabling the implicit null (enabling PHP) on PE2:

PE2(config)#no mpls ldp explicit-null

P1:

policy-map set-topmost
 class class-default
  set mpls experimental topmost 7

interface FastEthernet0/1
 service-policy output set-topmost

Before I show the output of this, it's important to note that setting the topmost EXP on egress is the only option I could find that worked on the P routers.  The P routers aren't imposing any labels (just swapping, which is different), so imposition doesn't work, and setting topmost on ingress doesn't appear to do anything (although I am not sure why).  And now for the outcome:

H1#ping 192.168.1.6 rep 1

Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)

P2#
*Mar  2 02:22:25.641: MPLS: Fa0/0: recvd: CoS=7, TTL=253, Label(s)=16/19
*Mar  2 02:22:25.645: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19

As anticipated, EXP 7 on the outer label only.

It's also important to note how P routers treat the EXP bits.  By default, unless you manually change it with the processes I've demonstrated and will demonstrate to come, the P router, as it swaps labels hop-by-hop, will always copy the EXP of the old outer label to the new outer label unmodified.

And now for our final topic - policing based on EXP.

P1:
class-map match-all EXP5
 match mpls experimental topmost 5

policy-map POLICER
 class EXP5
   police cir 32000
     conform-action transmit
     exceed-action set-mpls-exp-topmost-transmit 1

interface FastEthernet0/1
 service-policy output POLICER

H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 500
Datagram size [100]: 1000
Extended commands [n]: y
Type of service [0]: 184
Sending 500, 1000-byte ICMP Echos to 192.168.1.6, timeout is 0 seconds:
......................................................................
<output omitted>
..........
Success rate is 0 percent (0/500)

This one is tricky to validate - we want to see some MPLS packets leave P1 as 5, and some leave as 1.  Unfortunately my ACL doesn't work here (Without turning PHP back off) because we're playing with the upper label and not the inner label, and the Uniform Mode config on PE2 won't take heed of the outer label, because it's popped before hitting the egress interface.

Instead, we're just going to look at a sampling of "debug mpls packet" on P2:

*Mar  2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar  2 02:48:26.101: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

Let's decipher this a bit:

Remember, P2 is performing PHP for PE2, so what we see coming in and what we see going out will be different.  P1 is only making modifications to the topmost label.

*Mar  2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19

We got an MPLS packet in as EXP 5.

*Mar  2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

We popped the upper label and sent the inner label on as EXP 5 as well.

*Mar  2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19

By this point, we've already gotten the policer to kick in, so we receive EXP 1.

*Mar  2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19

and we transmit EXP 5 based on the inner label, which was set on PE1 because of the IPP -> EXP ToS Reflection.  The policer on P1 did not modify this value.

That's MPLS QoS/QoS Groups in a nutshell.  Hope you enjoyed!

Jeff