Captain’s log, stardate 170619. I have just piloted the SS Enterprise WLAN out of the Codesuck Nebula after hostilities with both the Switchites and the WAPs. It was a trying 48 hours of lost man-hours cleaning up after a breakdown in WLC update procedures, but I’m glad to be heading home. Regrettably, we did suffer casualties. Two valiant 802.1ac access points were cut down in their prime (hee hee, Prime). Ah well, time for an adult beverage and some cheese.
– Captain Beef Wellington, Intergalactic Wi-Fi Warrior
I feel for Captain Wellington. In fact, its impossible to tell his story without revealing a bit of my own. Do you remember this missive about network bits and pieces not living up to their responsibilities? Of course you do. And now that the cleanup work is done from that misadventure, let’s talk about the indirect costs of a code upgrade gone a bit wrong on a large wireless network.
On this particular code upgrade, I did three failover-pairs of WLC. The first one hosts 144 APs. The second, 908 APs. The third currently has 3,212 access points. All WLC are the same model, had the same starting and ending code, and all APs are uplinked to switches of two different models (but all running same OS version).
The first WLC pair went swimmingly. The WLC pair and 144 access points upgraded in a textbook maintenance maneuver that yielded no surprises.
The second upgraded pair was generally OK, but three APs were orphaned. They seemingly lost their configurations and names, and kept hitting the upgraded controller and falling away. Over, and over, and over, and over, and over, and over. This went on until their switchports were identified, and the interface PoE was cycled. Then TWO came back fully configured, properly named, and code-upgraded, while the remaining AP did upgrade, but lost it’s shit and had to be fully reconfigured.
- INDIRECT COSTS:
- The loss of use of each AP during their little visit to the Muffin Man
- Around a man-hour and a half or so to locate the APs MAC addresses in switching, deal with the PoE, verify, and configure the lone problem child.
The last and largest environment didn’t go so well for the upgrade. That’s despite the facts that this environment has not changed much since the last upgrade, and that I have done this procedure many a time in the past. Here, around 80ish access points did not take the upgrade. For the math-minded, that’s around 2 1/2 per cent of the APs in this big environment. Many completely dumped their configs and went stupid, some only seemed stupid until PoE resets, about half needed multiple PoE resets (after waiting a goodly period to see if the AP would snap out of it each time), and two completely failed and had to be replaced.
- INDIRECT COSTS:
- The loss of use of each AP during their outage- that’s a lot of capacity denied to end users
- Because the APs that failed the upgrade were scattered far and wide on several dozen switches, and that many needed to be power cycled more than once, it took at least the equivalent in hours of five full working days at the engineer level to tame the chaos and reconfigure those APs that needed it.
- DIRECT COSTS:
- Two current-model APs were irrecoverably lost in this process
- One man-hour per AP to get each replaced
Items of note throughout this:
- We did the code upgrade in the name of stability and bug fixes on the WLAN side (yes, irony- shut up)
- We recently learned of a PoE bug or two on the switching side, which may or may not have been in play
- Top-end gear is not without problems
- Even “routine” changes can go off the rails, at least in this product set
- System complexity and scale lead to more indirect costs in the form of support overhead, that’s just a fact of life on certain product sets
- There is no moving away from bugs, only trading bugs for other bugs- at least in my own reality
And there you have it.