doc/developer/TrafficShapingIssues

Main contributors: Benoit Grégoire, last modified: 2008-06-11

Traffic shaping and abuse control

This is a subject that repeatedly comes up in discussions on the mailing list and on IRC. I've put this document here so I can easily point to it and update it over time.

There are different types of problems that can be solved through traffic shaping, and they must not be confused with one another: because the solutions for each are different. However, solving them all at the same time in practice is difficult, because the solutions interact.

Before we begin, let's give some practical definitions for the two main performance characteristics of networks:

  • Bandwidth: The amount of data the the network can transfer in a given unit of time. This is what determines how long it takes to transfer a large file.
  • Latency: The amount of time required for a small packet of data to make a round trip. This is what determines how long it takes before something happens when you click something on the Internet (for example, when you click on your friend's head in Counterstrike...). Note that except for large file transfer, high latency is the primary cause of perceived slowdown.

Off course, networks have other performances characteristics (jitter, packet loss, etc.) but we'll ignore them for now.

Issue 1: Buffers in DSL/Cable/WiMax modems are way too large for good multi-user performance.

Problem caused: Maxing out upload will lead to higher latency and lower download bandwidth, and vice-versa. This is off course much more likely to happen if you have many users.

Historical reason for the problem's existence: The ISPs designed their infrastructure to optimize performance for single users. While large buffers can cause latency problems for even a single user (say you are downloading large files and playing a first person shooter game at the same time), the will lead to the largest peak bandwith. In other words, the ISP will fare better when the connection is benchmarked on bandwidth.

Typical solution(s): Limiting total incoming and outgoing bandwidth to slightly less (90-95%) of available bandwidth, prioritizing TCP ACK packets.

Solution in the wifidog context: The same solution is applicable. Challenges:

  1. It is impractical to know in advance how much bandwidth is available. Not only can every hotspot have different Internet plans, but even if one knows the maximum bandwidth of the ISP's plan, that bandwidth may not be available. For instance, if you subscribe to a 5Mbps DSL plan, if you have long/poor phone lines, you modem might connect at only 1.8Mbps. So that bandwidth has to be measured by wifidog.
  1. It is not always practical for the wifidog gateway to be the very first thing plugged in the modem. If you are plugged into a LAN that is server by a DSL modem, and someone uses bandwidth omewhere on the LAN, your measurement above will no longer be valid, and your shaping may actually make things worse if you are not very careful.

Issue 2: Users not getting their fair share of bandwidth.

Problem caused: When someone download's large files using modern P2P applications, in addition to triggering Issue 1, he will cause additional problems. That is if 3 users are on the network, two downloading a mail attachment, and one downloading a file over P2P. Typically, they will NOT get 1/3 of the available bandwidth each.

Historical reason for the problem's existence: In the beginning, shapers did "fair queuing" between IP/port pairs. So in theory If the P2P client in the scenario above opened 100 connections, the use would get ~98% of the bandwidth. Modern shapers and default kernel configs aren't nearly that bad, but the frequent problem is that the oldest opened connection will keep hogging most of the bandwidth.

Typical solution(s): Various misguided "solutions" are frequently applied to this problem:

  • Static user classes: Say you have a 3Mbps uplink, and you only allow users to use up to 300Kbps. This will fix the problem (assuming you have no more than 10 concurrent users), at the cost of making the connection suck for everyone, 100% of the time.
  • Connection aging: Make each connection fast at first, and then slow it down (Yes, people really do this...) The rationale is (presumably) that web browsing will be fast (lots of small, short connections) and the user will only look at the download speed of large file at the beginning. Besides being stupid, this will actually give a significant advantage to our P2P user compared to our two mail users: the P2P client will snob peers that look like they are slowing down, and will open brand new, fast connections to new peers.
  • Trying to block/throttle P2P users. Above and beyond the fact that this is is ethically questionable for various reasons, it is an arm's race that network administrators are unlikely to win. See examples in my last email. It's also extremely shortsighted since it's very expensive (both computationally (L7 shaping) and in manpower) and has to be revisited over and over.

Solution in the wifidog context: ESFQ (Enhanced Stocastic Fair Queuing), which would allow each wireless client to get no more than it's share of bandwidth, but allow the entire amount of bandwidth to be used.

Challenges:

  1. Issue 1 must be solved for ESFQ to have any chance of working at all.
  2. It is not possible to instantly throttle downstream bandwidth. The lag time in doing so can cause problems for ESFQ.

Issue 3: Applications that would need priority, such as VOIP

Problem caused: Depending on the user, some application should have more priority over other for network latency. For example:

  • For one user: VOIP > web browsing. SSH > FTP
  • For another user: World of warcraft > Bittorrent > Everything else.

Historical reason for the problem's existence: Ever since IPv4 was standardised, there was a QoS flag, that you were supposed to set when an application needs priority. Sadly, human nature being what it is, if the users notice that an application will go faster if they set the flag, they would start to set it for every application (not caring that it may slow down their neighbour). Once the neighbor notices, he will very rationally set the flag as well to defend himself, leaving the whole network ... right back where it started! So in practice no one ever obeys the QoS flag.

Typical solution(s): First, try to discriminate the type of service from the port range (or more sophisticated packet analysis). Then, make a subjective value judgement over which service is more important that some other. Finally, give priority according to that grid. The problems are:

  1. one again the questionable ethics of it
  2. the simple fact that what is a priority for one user may not be for another
  3. that if ISP would start to give priority to everything VOIP, you can be sure that P2P apps would offer an option to transfer data over VOIP protocols.

Solution in the wifidog context: Actually obey the QOS flag, but only up to a part (say 10%) of the slice the user would get in the solution to Issue 2. In other words, pass ACKs first, QOS traffic second (up to 10% of the user's slice), and pas the rest after.

Challenges:

  1. Issue 1 and 2 must be solved.
  2. If you VOIP handset doesn't set the QOS flag, it doesn't help you (although you'll probably still get decent performance from the solution to issue 2)

Issue 4: Chronic bandwidth abuse over a long period/reducing bandwidth cost.

Bandwith takes real money, and real resources to create. Whether you run a free or for pay network, you may decide that there is a maximum amount of network resources that your users should be allowed to use per day/month/hotspot.

Typical solution(s):

  • Bandwidth capping: Not allowing the user to use more than 1Mbps
  • Data transfer capping: Not allowing the user to tranfer more than 40GB per month.

Solutions in the wifidog context:

  1. Dynamic abuse control. Allow defining criterias of maximum data transfer per unit of time, at a hotspot, over the entire network, etc.
  2. Opening hours support. For free networks designed to be used in public places, closing access when the public place is closed can drasticaly reduce monthly bandwidth consumption.
  3. Supporting the "password of the day" model. Allows drastically reducing the bandwidth leached by a hotspot's neighbor's by forcing them to physically visit the place to get access.

Note that technically, none of the above require any kind of actual traffic shaping. Traffic shaping is involved if you want to implement policies that are a little less drastic than cutting off access once a user/machine reaches the threshold. Let's say that your quota in 5GB per hotspot per month, instead of cutting off the user once he reaches 5GB, at 4GB you would progressively slow down the user in such a way that he would never reach 5GB, or you would slow down the user to a low maximum bandwidth (say 128Kbps), or make him pass after every other users.

'Challenges:

  1. The wifidog protocol needs to be redesigned to allow the auth server to specify the maximum bandwidth for each user individually, and update that number periodically.
  2. The user could open a new account/spoof the MAC address. There are ways to make that very inconvenient, but that's another arms race (and another feature list altogether).

Issue 5: Selling the user monthly access with fixed bandwidth (say 512Kbps).

Typical solution(s): Client side user classes

Solutions in the wifidog context: server side token architecture and per user bandwidth specification. Basically, if we have per user bandwidth specifications in the gateway and protocol, selling fixed slices is just a degenerate case of the general problem.