The Linux kernel offers us lots of queueing disciplines. By far the most widely used is the pfifo_fast queue - this is the default. This also explains why these advanced features are so robust. They are nothing more than 'just another queue'.
Each of these queues has specific strengths and weaknesses. Not all of them may be as well tested.
This queue is, as the name says, First In, First Out, which means that no packet receives special treatment. At least, not quite. This queue has 3 so called 'bands'. Within each band, FIFO rules apply. However, as long as there are packets waiting in band 0, band 1 won't be processed. Same goes for band 1 and band 2.
SFQ, as said earlier, is not quite deterministic, but works (on average). Its main benefits are that it requires little CPU and memory. 'Real' fair queueing requires that the kernel keep track of all running sessions.
Stochastic Fairness Queueing (SFQ) is a simple implementation of fair queueing algorithms family. It's less accurate than others, but it also requires less calculations while being almost perfectly fair.
The key word in SFQ is conversation (or flow), being a sequence of data packets having enough common parameters to distinguish it from other conversations. The parameters used in case of IP packets are source and destination address, and the protocol number.
SFQ consists of dynamically allocated number of FIFO queues, one queue for one conversation. The discipline runs in round-robin, sending one packet from each FIFO in one turn, and this is why it's called fair. The main advantage of SFQ is that it allows fair sharing the link between several applications and prevent bandwidth take-over by one client. SFQ however cannot determine interactive flows from bulk ones -- one usually needs to do the selection with CBQ before, and then direct the bulk traffic into SFQ.
The Token Bucket Filter (TBF) is a simple queue, that only passes packets arriving at rate in bounds of some administratively set limit, with possibility to buffer short bursts.
The TBF implementation consists of a buffer (bucket), constatly filled by some virtual pieces of information called tokens, at specific rate (token rate). The most important parameter of the bucket is its size, that is number of tokens it can store.
Each arriving token lets one incoming data packet of out the queue and is then deleted from the bucket. Associating this algorithm with the two flows -- token and data, gives us three possible scenarios:
The last scenario is very important, because it allows to administratively shape the bandwidth available to data, passing the filter. The accumulation of tokens allows short burst of overlimit data to be still passed without loss, but any lasting overload will cause packets to be constantly dropped.
The Linux kernel seems to go beyond this specification, and also allows us to limit the speed of the burst transmission. However, Alexey warns us:
Note that the peak rate TBF is much more tough: with MTU 1500
P_crit = 150Kbytes/sec. So, if you need greater peak
rates, use alpha with HZ=1000 :-)
FIXME: is this still true with TSC (pentium+)? Well sort of
FIXME: if not, add section on raising HZ
RED has some extra smartness built in. When a TCP/IP session starts, neither end knows the amount of bandwidth available. So TCP/IP starts to transmit slowly and goes faster and faster, though limited by the latency at which ACKs return.
Once a link is filling up, RED starts dropping packets, which indicate to TCP/IP that the link is congested, and that it should slow down. The smart bit is that RED simulates real congestion, and starts to drop some packets some time before the link is entirely filled up. Once the link is completely saturated, it behaves like a normal policer.
For more information on this, see the Backbone chapter.
The Ingress qdisc comes in handy if you need to ratelimit a host without help from routers or other Linux boxes. You can police incoming bandwidth and drop packets when this bandwidth exceeds your desired rate. This can save your host from a SYN flood, for example, and also works to slow down TCP/IP, which responds to dropped packets by reducing speed.
FIXME: instead of dropping, can we also assign it to a real queue?
FIXME: shaping by dropping packets seems less desirable than using, for example, a token bucket filter. Not sure though, Cisco CAR works this way, and people appear happy with it.
See the reference to IOS Committed Access Rate at the end of this document.
In short: you can use this to limit how fast your computer downloads files, thus leaving more of the available bandwidth for others.
See the section on protecting your host from SYN floods for an example on how this works.
This chapter was written by Esteve Camps <esteve@hades.udg.es>.
First of all, first of all, it would be a great idea for you to read RFCs written about this (RFC2474, RFC2475, RFC2597 and RFC2598) at IETF DiffServ working Group web site and Werner Almesberger web site (he wrote the code to support Differentiated Services on Linux).
Dsmark is a queueing discipline that offers the capabilities needed in Differentiated Services (also called DiffServ or, simply, DS). DiffServ is one of two actual QoS architectures (the other one is called Integrated Services) that is based on a value carried by packets in the DS field of the IP header.
One of the first solutions in IP designed to offer some QoS level was the Type of Service field (TOS byte) in IP header. Changing that value we could choose a high/low level of throughput, delay or reliability. But this didn't provided sufficient flexibility to the needs of new services (such as real-time applications, interactive applications and others). After this, new architectures appeared. One of these was DiffServ which kept TOS bits and renamed DS field.
Differentiated Services is group-oriented. I mean, we don't know nothing about flows (this will be the Integrated Services purpose); we know about flow aggregations and we will apply different behaviours depending on which aggregation a packet belongs to.
When a packet arrives to an edge node (entry node to a DiffServ domain) entering to a DiffServ Domain we'll have to policy, shape and/or mark those packets (marking refers to assign a value to DS field. It's just like the cows :-) ). This will be the mark/value that the internal/core nodes on our DiffServ Domain will look at to determine which behaviour or QoS level apply.
As you can deduce, Differentiated Services involves a domain on which all DS rules will have to be applied. In fact you can think "I will classify all the packets entering my domain. Once they enter my domain they will be subjected to the rules that my classification dictates and every traversed node will apply that QoS level".
In fact, you can apply your own policies into your local domains, but some Service Level Agreements should be considered when connecting to other DS domains.
At this point, you maybe have a lot of questions. DiffServ is more than I've explained. In fact, you can understand that I can not resume more than 3 RFC's in just 50 lines :-).
As the DiffServ bibliography specifies, we differentiate boundary nodes and interior nodes. These are two important points in the traffic path. Both types perform a classification when the packets arrive. Its result may be used in different places along the DS process before the packet is released to the network. It's just because of this diffserv code supplies an structure called sk_buff, including a new field called skb->tc_index where we'll store the result of initial classification that may be used in several points in DS treatment.
The skb->tc_index value will be initially set by the DSMARK qdisc, retrieving it from the DS field in IP header of every received packet. Besides, cls_tcindex classifier will read all or part of skb->tcindex value and use it to select classes.
But, first of all, take a look at DSMARK qdisc command and its parameters:
... dsmark indices INDICES [ default_index DEFAULT_INDEX ] [ set_tc_index ]
What do these parameters mean?
This qdisc will apply the next steps:
New_Ds_field = ( Old_DS_field & mask ) | value
skb->ihp->tos
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >
| | ^
| -- If you declare set_tc_index, we set DS | | <-----May change
| value into skb->tc_index variable | |O DS field
| A| |R
+-|-+ +------+ +---+-+ Internal +-+ +---N|-----|----+
| | | | tc |--->| | |--> . . . -->| | | D| | |
| | |----->|index |--->| | | Qdisc | |---->| v | |
| | | |filter|--->| | | +---------------+ | ---->(mask,value) |
-->| O | +------+ +-|-+--------------^----+ / | (. , .) |
| | | ^ | | | | (. , .) |
| | +----------|---------|----------------|-------|--+ (. , .) |
| | sch_dsmark | | | | |
+-|------------|---------|----------------|-------|------------------+
| | | <- tc_index -> | |
| |(read) | may change | | <--------------Index to the
| | | | | (mask,value)
v | v v | pairs table
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ->
skb->tc_index
How to do marking? Just change the mask and value of the class you want to remark. See next line of code:
tc class change dev eth0 classid 1:1 dsmark mask 0x3 value 0xb8
This changes the (mask,value) pair in hash table, to remark packets belonging to class 1:1.You have to "change" this values
because of default values that (mask,value) gets initially (see table below).
Now, we'll explain how TC_INDEX filter works and how fits into this. Besides, TCINDEX filter can be used in other configurations rather than those including DS services.
This is the basic command to declare a TC_INDEX filter:
... tcindex [ hash SIZE ] [ mask MASK ] [ shift SHIFT ]
[ pass_on | fall_through ]
[ classid CLASSID ] [ police POLICE_SPEC ]
Next, we show the example used to explain TC_INDEX operation mode. Pay attention to bolded words:First of all, supose we receive a packet marked as EF . If you read RFC2598, you'll see that DSCP recommended value for EF traffic is 101110. This means that DS field will be 10111000 (remember that less signifiant bits in TOS byte are not used in DS) or 0xb8 in hexadecimal codification.
TC INDEX
FILTER
+---+ +-------+ +---+-+ +------+ +-+ +-------+
| | | | | | | |FILTER| +-+ +-+ | | | |
| |----->| MASK | -> | | | -> |HANDLE|->| | | | -> | | -> | |
| | . | =0xfc | | | | |0x2E | | +----+ | | | | |
| | . | | | | | +------+ +--------+ | | | |
| | . | | | | | | | | |
-->| | . | SHIFT | | | | | | | |-->
| | . | =2 | | | +----------------------------+ | | |
| | | | | | CBQ 2:0 | | |
| | +-------+ +---+--------------------------------+ | |
| | | |
| +-------------------------------------------------------------+ |
| DSMARK 1:0 |
+-------------------------------------------------------------------------+
The packet arrives, then, set with 0xb8 value at DS field. As we explained before, dsmark qdisc identified by 1:0 id in the example, retrieves DS field and store it in skb->tc_index variable. Next step in the example will correspond to the filter associated to this qdisc (second line in the example). This will perform next operations:
Value1 = skb->tc_index & MASK
Key = Value1 >> SHIFT
In the example, MASK=0xFC i SHIFT=2.
Value1 = 10111000 & 11111100 = 10111000
Key = 10111000 >> 2 = 00101110 -> 0x2E in hexadecimal
The returned value will correspond to a qdisc interal filter handle (in the example, identifier 2:0). If a filter with this id exists, policing and metering conditions will be verified (in case that filter includes this) and the classid will be returned (in our example, classid 2:1) and stored in skb->tc_index variable.
But if any filter with that identifier is found, the result will depend on fall_through flag declaration. If so,
value key is returned as classid. If not, an error is returned and process continues with the rest filters. Be
careful if you use fall_through flag; this can be done if a simple relation exists between values
of skb->tc_index variable and class id's.
The latest parameters to comment are hash and pass_on. The first one relates to hash table size. Pass_on will be used to indicate that if no classid equal to the result of this filter is found, try next filter. The default action is fall_through (look at next table).
Finally, let's see which possible values can be set to all this TCINDEX parameters:
TC Name Value Default
-----------------------------------------------------------------
Hash 1...0x10000 Implementation dependent
Mask 0...0xffff 0xffff
Shift 0...15 0
Fall through / Pass_on Flag Fall_through
Classid Major:minor None
Police ..... None
This kind of filter is very powerfull. It's necessary to explore all possibilities. Besides, this filter it's not only used in DiffServ configurations. You can use it as any other kind of filter.
I recommend you to look at all DiffServ examples included in iproute2 distribution. I promise I will try to complement this text as soon as I can. Besides, all I have explained is the result of a lot of tests. I would thank you tell me if I'm wrong in any point.