GSU’s internet went down today. Actually the internet-based phone system reached critical numbers of restarts and took down the rest of the network. It was basically a self-inflicted resource denial attack.
In essence, if too many of the phones request a new IP address at once, the delay on the response from the DHCP server takes longer than the time that the phone waits for its IP address. (It is a little more complex than this as the phones also download their system software so the process takes a “measurable” interval.) So the phone stops listening and eventually issues a new request. Which of course, reinitialises the process. If the request does not get addressed in time it issues another request, after a “random” delay. So the whole thing snowballs out of control and soon the network is full of nothing else than DHCP requests and invalid responses.
It would not make sense to size the network and phone-DHCP server for maximum possible load. (although maybe a bigger size might make sense). The capacity would not be needed 99.999% of the time – which is a huge waste of resources.
It uses a stochastic algorithm that works well when the load is moderate and which fails catastrophically above a critical threshold. Is there a way to cross over to a more deterministic algorithm when it would be needed? (Right now they more or less manually reset parts of the network).
A simple solution would be a gated network of physically distinct subnets, where each sub-net was smaller than the maximum capacity of the DHCP server. Then using a deterministic switch between each sub-net would let the the sub-net’s recover, while limiting the damage. This description is a bit simplistic but could work. This is somewhat similar in spirit to “token-ring”, but for DHCP only.
Another simple solution would be to use a different physical layer for the phones and the data. This would work, but defeats the economic advantage of the internet phones. On the other hand, it would preserve the integrity of the data network – which is sort of important when the students are registering online for their courses and faculty are trying to write grant proposals. (but then that’s another cost center).
Edit:
Apparently the crash was caused by “water damage”.
There is a software-only solution though. The central server should monitor the depth of its queue of unresolved requests. When this gets too large, it should issue “shutup” messages to the clients, reset the queue to zero, and then systematically (in O(n)) check and restart as needed. While this may take longer for total reset, it is bounded and more importantly will not shut the network down.