So what does your MPLS carrier do with those QoS settings you pass them?
It's unlikely they're queuing at congestion spots in their network based on the DSCP values you set.
You've probably heard about the EXP bits in the MPLS tag. These are used "for QoS". But no one really seems to know how. And there's only 3 bits, but we use 6 bits for DSCP, so what's the story?
Here's our topology:
We'll be setting DSCP values on H1 and manipulating them, or their MPLS equivalents, on the way to H2.
Without any special config, let's see how this works right out of the box. Of important note, I have null-routed H1's IP address on H2. This makes it easier to read the output from "debug mpls packet", because we're only seeing a one-way flow instead of a two-way flow.
H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 2
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]: y
Source address or interface:
Type of service [0]: 184
Set DF bit in IP header? [no]:
Validate reply data? [no]:
Data pattern [0xABCD]:
Loose, Strict, Record, Timestamp, Verbose[none]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)
(Remember, we were not expecting responses)
So we sent this as EF traffic (TOS 184, above). Any hypothesis on what's seen in transit?
P2#debug mpls packet
MPLS packet debugging is on
P2#
*Mar 1 09:20:04.473: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar 1 09:20:04.473: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
P2#
*Mar 1 09:20:06.437: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar 1 09:20:06.437: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
CoS=5, meaning the EXP bits are set to 5. The default behavior on a PE is to map the IPP (IP Precedence) on to the EXP bits. These line up nicely both being three bits. Reference the ToS value above - 184. That's a full 8 bit QoS value, in binary it's 10111000. Chop off the last two unused digits for the 6-bit DSCP value of 101110, and you have 46 (which I suspect you recognize as "EF"), knock off everything except the first three bits - 101 - for the IPP, and you have? Five. Hence, EXP becomes 5 as well. This default feature is known as "ToS Reflection".
We'll look at how this value can be used to our advantage later.
What comes in on H2?
For those following my blog for a while, you may know about 14 months ago I wrote a giant ACL that matches every possible QoS value. I still have it on file, and I'll be using it here to see what values come in on H2.
H2#sh ip access-list | i match
460 permit ip any any dscp ef (2 matches)
480 permit ip any any dscp cs6 (1 match)
Ok great! We've got two EF matches, and a ... Class Selector 6?
The EF matches are the two pings arriving. I found this odd right off the bat, I would've expected that if IOS takes the IPP bits and maps them to EXP, that it would then take EXP and match them to IPP on the way out the other PE when the final label is popped. However, it doesn't work that way - instead, it just uses the DSCP that was already in the packet - which, of course, never changed. An MPLS label was put on top of it, but the underlying packet was left intact.
The class selector 6 packet is a BGP keepalive. We'll be seeing more of them throughout the post.
It turns out there are terms for the different types of MPLS QoS behavior. What we observed above would be either "Pipe Mode" or "Short Pipe Mode". Both of these behaviors include using the original ToS bits instead of replacing them based on the EXP bits. The difference between Pipe Mode and Short Pipe Mode is that Pipe Mode egress queues based on the EXP bits, and Short Pipe Mode egress queues at the PE on the original ToS (DSCP) bits. This post assumes the audience understands how to write a hierarchical QoS policy, so I'm not going to elaborate or examine the differences between them any further. Any additional mention of "Pipe Mode" assumes either of the above behaviors. The third option is "Uniform Mode", which is the process of replacing the IP Packet's ToS bits (IPP/DSCP) with something derived from the EXP bits.
We just saw Pipe Mode in action above, let's look at how to implement Uniform Mode.
First we need to take a quick look at QoS groups.
There's a particular challenge with ingress and egress marking on a PE. On ingress, you can't set an IPP or DSCP value because the MPLS header is still on the frame. On the egress interface, you can't match on the EXP bits to set IPP or DSCP bits, because the MPLS label is already popped. So how do you match on an EXP value and set a DSCP value? Enter QoS groups.
PE2:
class-map match-all EXP5
match mpls experimental topmost 5
policy-map uniform-ingress
class EXP5
set qos-group 5
class class-default
set qos-group 0
interface fa0/0 ! MPLS side
service-policy input uniform-ingress
This config will match a decimal value of five on the topmost MPLS label - which, in our case, on the PE, is the only MPLS label thanks to Penultimate Hop Pop. We'll assign a local value of "5" (although this could be any number 1-99) if the EXP bit is 5. Anything else will get reset to 0.
class-map match-all GROUP5
match qos-group 5
policy-map uniform-egress
class GROUP5
set ip dscp af41
class class-default
set ip dscp default
interface fa0/1 ! IP/VRF side
service-policy output uniform-egress
On egress, we'll match on that 5, and set af41. Why af41? Because I wanted to show the policy was doing something.
We'll ping from H1 to H2 again. I'm omitting any non-essential bits from the extended ping for brevity.
H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)
Again, expected failure, this is a deliberate one-way flow.
H2#sh ip access-list | i match
340 permit ip any any dscp af41 (2 matches)
640 permit ip any any precedence routine (4 matches)
We see our two af41 hits, and 4 routine. The routine are because the IPP 6 packets are being remarked to zero because it doesn't match anything else in the policy.
Now obviously this is a pretty useless policy, but it was more about showing how the function works.
Here's an adaptation for a more scalable Uniform Mode solution:
policy-map uniform-ingress
class class-default
set qos-group mpls experimental topmost
interface fa0/0
service-policy input uniform-ingress
policy-map uniform-egress
class class-default
set precedence qos-group
set precedence qos-group
interface Fa0/1
service-policy output uniform-egress
Let's see what the outcome is.
H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)
H2#clear ip access-list count
H2#sh ip access-list | i match
460 permit ip any any dscp ef (2 matches)
I was rather surprised the first time I saw this output. We're setting a precedence value but getting back a DSCP value. I expected to see a precedence/class-selector value. The original bits were 101110 (DSCP 46, or EF), and I expected to replace them with 101000, which would be class selector 5. Things brings up an important difference in IOS's handling of class-selector vs precedence, I'd always treated them the same, but it turns out IOS is more literal - Precedence sets only the precedence bits. So we re-wrote the first three bits with 101, which ... were already set to 101. So we ended up with 101110 (DSCP 46/EF) again.
We could do something like this:
policy-map uniform-egress
class class-default
set dscp qos-group
But then we'd get literal DSCP values: if the QoS Group is 5, it would set DSCP 5. Not DSCP CS5 (101000), but actual binary 5 - (000101). To accomplish EF -> EXP 5 -> CS5, we'd have to use either a lengthy QoS-Group -> DSCP class-map/policy-map setup, or we could use a table map!
table-map TABMAP
map from 1 to 8 ! Group 1 to DSCP CS1
map from 2 to 16 ! Group 2 to DSCP CS2
map from 3 to 24 ! ...
map from 4 to 32
map from 5 to 40
map from 6 to 48
map from 7 to 56 ! Group 7 to DSCP CS7
policy-map uniform-egress
class class-default
set dscp qos-group table TABMAP
H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)
H2#sh ip access-list | i match
400 permit ip any any dscp cs5 (2 matches)
480 permit ip any any dscp cs6 (18 matches)
I think the table map use is pretty obvious - take a qos group and match it to some other integer, which has some meaning when applied to a DSCP or IPP field. Now we have the CS5 output we were looking for.
Now clearly, MPLS/EXP QoS needs to be able to be modified on more than just the egress PE. Let's take a look at the other spots we can match and adapt behavior to it.
So far we've been doing matches on the "topmost" label, so what other options have we got? Keeping this oriented towards the R&S CCIE, I'm not going to look at anything other than a 2-tag (VRF + MPLS PE) system. When traffic is received in from the host towards the PE, the PE is going to impose a label for the VRF. It will then add the MPLS transit label on top of that, for reaching the other PE. So to reiterate, we go from zero labels to two labels on the PE.
We can set both those labels, and it's really not hard, but you have to pay attention to what label is being manipulated on which interface. IOS is picky about the order of operations in this case.
For ingress on a PE, we can only set imposition. We clearly can't set "topmost" because there are no labels on the packet yet:
PE1:
policy-map impose1
class class-default
set mpls experimental imposition 4
And what if we set EF manually on H1?
Still CS4, because we're remarking the EXP bits on the inner label on PE1 to 4, that's carried down to PE2, and then the qos-group-based policy remarks the DSCP to CS4.
What about the outer label?
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
So P2 receives the outer label as 4, and the inner label as 4. We see 4 coming in on Fa0/0 on label 16, and going out on label 19 on Fa0/1, showing both the PHP process and the fact that both EXP values are the same. That's because the default behavior of a PE is to copy the inner label's EXP bits to the outer label. But what if we wanted to set the outer label to something different?
There's two places we could do that: egress on the PE, or ingress on the P routers.
Let's try the PE first.
PE1:
policy-map topmost1
class class-default
set mpls experimental topmost 2
PE1:
policy-map impose1
class class-default
set mpls experimental imposition 4
int fa0/0
service-policy input impose1
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
H2#sh ip access-list | i match
320 permit ip any any dscp cs4 (1 match)
480 permit ip any any dscp cs6 (2 matches)
And what if we set EF manually on H1?
H1#ping
Target IP address: 192.168.1.6
Repeat count [5]: 2
Extended commands [n]: y
Type of service [0]: 184
Sending 2, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
..
Success rate is 0 percent (0/2)
H2#sh ip access-list | i match
320 permit ip any any dscp cs4 (3 matches)
480 permit ip any any dscp cs6 (2 matches)
H2#sh ip access-list | i match
320 permit ip any any dscp cs4 (3 matches)
480 permit ip any any dscp cs6 (2 matches)
What about the outer label?
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
We'd need to look at the results on P2, because PE2 never gets the outer label - the PHP process removes it before forwarding the frame.
P2#
*Mar 2 01:44:59.409: MPLS: Fa0/0: recvd: CoS=4, TTL=253, Label(s)=16/19
*Mar 2 01:44:59.409: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19
There's two places we could do that: egress on the PE, or ingress on the P routers.
Let's try the PE first.
PE1:
policy-map topmost1
class class-default
set mpls experimental topmost 2
interface FastEthernet0/1
service-policy output topmost1
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
P2#
*Mar 2 01:53:51.609: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar 2 01:53:51.609: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19
Now we see EXP 2 on the topmost and EXP 4 on the inner.
It's of some interest that if we wanted the final PE (PE2) to see that value of 2, we'd want to disable PHP. PHP is disabled from the PE, not the router upstream from it. This is done by the PE advertising an explicit blank label for the prefixes terminating on it:
PE2:
mpls ldp explicit-null
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
P2#
*Mar 2 01:57:56.889: MPLS: Fa0/0: recvd: CoS=2, TTL=253, Label(s)=16/19
*Mar 2 01:57:56.889: MPLS: Fa0/1: xmit: CoS=2, TTL=252, Label(s)=0/19
H2#sh ip access-list | i match
160 permit ip any any dscp cs2 (1 match)
We see that P2 forwarded both labels, one of which was the explicit null/0 label (reference 0/19). The PE has to pop both labels before forwarding. Consequently, we also see that the PE now marked CS2 based on the EXP2 in the topmost label.
Now let's see about manipulating the topmost label on a P device.
For clarity's sake on P2, I am disabling the implicit null (enabling PHP) on PE2:
PE2(config)#no mpls ldp explicit-null
P1:
policy-map set-topmost
class class-default
set mpls experimental topmost 7
interface FastEthernet0/1
service-policy output set-topmost
Before I show the output of this, it's important to note that setting the topmost EXP on egress is the only option I could find that worked on the P routers. The P routers aren't imposing any labels (just swapping, which is different), so imposition doesn't work, and setting topmost on ingress doesn't appear to do anything (although I am not sure why). And now for the outcome:
H1#ping 192.168.1.6 rep 1
Type escape sequence to abort.
Sending 1, 100-byte ICMP Echos to 192.168.1.6, timeout is 2 seconds:
.
Success rate is 0 percent (0/1)
P2#
*Mar 2 02:22:25.641: MPLS: Fa0/0: recvd: CoS=7, TTL=253, Label(s)=16/19
*Mar 2 02:22:25.645: MPLS: Fa0/1: xmit: CoS=4, TTL=252, Label(s)=19
As anticipated, EXP 7 on the outer label only.
It's also important to note how P routers treat the EXP bits. By default, unless you manually change it with the processes I've demonstrated and will demonstrate to come, the P router, as it swaps labels hop-by-hop, will always copy the EXP of the old outer label to the new outer label unmodified.
And now for our final topic - policing based on EXP.
P1:
class-map match-all EXP5
match mpls experimental topmost 5
policy-map POLICER
class EXP5
police cir 32000
conform-action transmit
exceed-action set-mpls-exp-topmost-transmit 1
interface FastEthernet0/1
service-policy output POLICER
H1#ping
Protocol [ip]:
Target IP address: 192.168.1.6
Repeat count [5]: 500
Datagram size [100]: 1000
Extended commands [n]: y
Type of service [0]: 184
Sending 500, 1000-byte ICMP Echos to 192.168.1.6, timeout is 0 seconds:
......................................................................
<output omitted>
..........
Success rate is 0 percent (0/500)
This one is tricky to validate - we want to see some MPLS packets leave P1 as 5, and some leave as 1. Unfortunately my ACL doesn't work here (Without turning PHP back off) because we're playing with the upper label and not the inner label, and the Uniform Mode config on PE2 won't take heed of the outer label, because it's popped before hitting the egress interface.
Instead, we're just going to look at a sampling of "debug mpls packet" on P2:
*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
*Mar 2 02:48:26.101: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
Let's decipher this a bit:
Remember, P2 is performing PHP for PE2, so what we see coming in and what we see going out will be different. P1 is only making modifications to the topmost label.
*Mar 2 02:48:26.057: MPLS: Fa0/0: recvd: CoS=5, TTL=253, Label(s)=16/19
We got an MPLS packet in as EXP 5.
*Mar 2 02:48:26.057: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
We popped the upper label and sent the inner label on as EXP 5 as well.
*Mar 2 02:48:26.097: MPLS: Fa0/0: recvd: CoS=1, TTL=253, Label(s)=16/19
By this point, we've already gotten the policer to kick in, so we receive EXP 1.
*Mar 2 02:48:26.097: MPLS: Fa0/1: xmit: CoS=5, TTL=252, Label(s)=19
and we transmit EXP 5 based on the inner label, which was set on PE1 because of the IPP -> EXP ToS Reflection. The policer on P1 did not modify this value.
That's MPLS QoS/QoS Groups in a nutshell. Hope you enjoyed!
Jeff
Awesome explaination - helped clarify a number of things for me about this.
ReplyDeleteThank you ! Jeff
ReplyDeleteReally very Nice explanations
Simply excellent. Thx!
ReplyDelete