NOTES ON THE ALGORITHM AND ITS IMPLEMENTATION
05.09.12: The "missing comments"
It might help to think of CoDel as two separate algorithms being run simultaneously. One is time-based and keeps the delay low & utilization high when the bottleneck bandwidth is large (MSS-packet time << target). The other is backlog based and keeps utilization high and delay as low as possible when the bottleneck bandwidth is small (MSS-packet time >> target). The two terms of the "if" are these two different algorithms. (Figure 6 of the Queue paper sort of shows these two regimes in action and the transition between them.)
The comments around the maxpacket stuff may be inadequate since what's going on is possibly a bit subtle. If you test for !curq (i.e., "Is the queue non-empty?), this can completely hose link utilization on low speed links with mixed bulk data/web traffic or bi-directional traffic (intermixed acks with the bulk data packets). Figure 2 of the Queue paper sort of hints at why - TCP's "self clocking" tends to make bulk-data packets arriving at the head of the queue be spaced a bottleneck time apart. So if only one packet is left on the queue and it's small, after it's sent the link will often go idle, waiting a maxpacket bottleneck time for the next packet to show up. To get high utilization on low speed links (links where the time to send an MSS-sized packet is bigger than 'target'), one has to leave enough backlog on the queue to keep the link busy for a worst-case inter-arrival time. The appropriate amount of backlog can be inferred directly from the data stream - just use the maximum outgoing packet size. Remember that our goals were to get the queue as small as possible without impacting utilization.
05.11.12: Calculating the inverse square root in real life
There's no need for a lookup table. We're doing inverse square root
which is even cheaper than square root (only requires multiplies).
And we generate the roots sequentially which means you can start
from the previous value and remove about half the computation. See
E.g., if you save the previous invsqrt (and set it to 1 when you set
count to 1) and use one iteration of Newton's method, the
computation would be:
invsqrt = (invsqrt >> 1) * (3 - count * invsqrt * invsqrt)
The worst case error is 0.02% (when count is 2) and goes down
quadratically with count (the error is 4 parts in 1,000,000 when
count is 1000). If you're doing the computation in 32 bit ints,
since invsqrt is < 1 for count > 1, you prescale count by 2^16
(i.e., treat it as a fixed point binary integer with the binary
point at the 16th bit) so the control law computation becomes the
(interval * invsqrt) >> 16
Just shifts, adds & integer multiplies - can't get much simpler
hardware than that. It's like 2 bytes of sequencer state if their
design uses an ALU and <2000 gates if they unroll the whole thing as