Tag Archives: Bugs

Chasing Down Errant Cisco APs

Some product sets definitely require more care and feeding than others… that’s all I’ll say in that regard lest I let go with the rant that is on the tip of my tongue. What I’m about to present is in regard to Cisco 3702 access points specifically on 8.5.151.0 code, although I have no doubt the condition applies to many models and code versions.

Problem statementThe freakin’ APs cut and run. They go over the wall, but they are real sneaky about it. They do it in a way that ain’t so easy to detect… Or in Cisco’s own words: “As per FN70330 – IOS AP stranded due to flash corruption issue, due to a number of software bugs an AP in normal operation,  the flash file system on some IOS APs may become corrupt over time. This is seen especially after an upgrade is performed to the WLC but not necessarily limited to this scenario. AP may be working fine, servicing client, etc, while on this problem state which is not easily detectable”. 

See this Cisco doc as the source of the above statement– and please know that I’m not saying that MY issue is absolutely THIS issue. Although it could be. There are are many fine bugs to choose from.

What it Looks Like, and What it Doesn’t Look Like.

Cisco rightly says that the “problem state is not easily detectable”, and I agree. We’ll focus on a single 3702 AP for this blog, but I know from first-hand conversation that some folks have been bitten by dozens or hundreds of similar free=spirited APs all going for an intent-based spontaneous joyride in the name of innovation.

Prime Infrastructure doesn’t show my AP as being “out”, and I have yet to find any reliable way to show this condition via any other reports in PI.  If you ping it, it responds. Look at it in CDP, it’s there. But… all is not well, sir. Not at all, sir. Despite the obvious indicators. This AP that has been up and fine and doing it’s job suddenly got cabin fever:

4704

So… the normal ways of finding out that APs are essentially out of service (like using your expensive NMS) don’t apply in this scenario, and you basically have to stumble upon it, or be alerted when users can’t connect to the AP- which unfortunately is a common canary in the coalmine when dealing with bugs in this particular framework.

Say there- did I mention that the AP never recovers in this situation? It stays in perpetual “Downloading” until you figure out a way to recover it. Value. Buy more licenses… because the one this AP is using is worthless while it’s in this innovative state of self-determinism.

No Resetting Through the Controller UI

It stands to reason that maybe rebooting the AP will get it back to where it needs to be. That’s a pretty common troubleshooting step. But you can’t do it from the controller interface while the AP is trying to go to a happy place that it will never reach.

4704-1

Allow me to digress…I like to think that when the AP gets to this point, it probably hears Soul Asylum singing Runaway Train in it’s mind…

It seems no one can help me now
I’m in too deep
There’s no way out
This time I have really lead myself astray

Runaway train never going back
Wrong way on a one-way track
Seems like I should be getting somewhere
Somehow I’m neither here nor there

Ahem. Back to topic. (But what a great song.)

Off to the Switch We Go

Being that we can’t reboot THE VALUE from the controller interface while the AP is riding the runaway train, we need to visit the switch for command line operations. Basically, we pull the PoE plug via command entry, then restore it (informational note: no innovation licenses are required to enter commands- yet).

4704-2

If all goes well, a couple of minutes later you’ll have an AP that has atoned for it’s separatist thoughts of independence and freedom, and you can welcome it back to the fleet.

4704-3

Simple Fix (Maybe)

I’m guessing that you’d agree after reading this that the fix for my situation was fairly easy. I’ve seen maybe 20 of these goofball 3702 instances in the last year, now more reliably found after my office mate found a way to poll them with some degree of success via SNMP using AKIPS.

day downloading

So… finding them may be harder than fixing them, depending on how you are equipped and IF you are dealing exactly with whatever nuanced issue I happen to have in play. But let me again bring you back to this Cisco doc on the topic of corrupt AP flash. Your situation may end up being a lot messier than mine, given the hoops mentioned in the document.

That Which Pisses Us Wireless Folk Off- Vendor Edition

Now there’s a title. And since you’re reading this, you bit on it… Sucka. Now that you’re here, let’s share some observations from the WLAN community over the last few weeks. This is not (totally) a “Lee’s complaining again” blog; it’s more a collection of sentiments from dozens of friends and colleagues from across the Wi-Fi Fruited Plain that stuck with me for one reason or another.

Most of these observations are aimed squarely at our vendors- those who we do business with “above” as we shape their offerings into the systems and services we offer to clients “below”, with us in the middle.

You may not agree with all of these. Perhaps some of your own beefs didn’t make my list. Either way, I’d love to hear from you in the comments section. Now, in no specific order:

  • Marketing claims. OK, we’re starting out with the obvious. Wi-Fi marketing has always been about hype, far-fetchedness, and creative blather. Nothing new under the sun here. I truly hope that your 10x better Wi-Fi is serving up 500 APs per client that are all streaming 62 Netflix movies each simultaneously from a range of 37 miles away from the AP.
  • “Enterprise” switches that don’t stack. Stacking is neither new, nor special. Do your bigger switches stack? Is it not even an option? If not, maybe tone down calling them “enterprise”.
  • Big Bucks for power cords. You got major balls as a vendor if you’re pricing garden variety power cables at $20 per.  Shame on you. Same same for PoE injectors, nothing-special antennas, rack mounts and assorted other parts/pieces that can be gotten for pennies on YOUR dollar elsewhere. C’mon…
  • No version numbers. By now, we all get “cloud”. And most cloud infrastructure vendors ARE using OS version numbers as a point of reference for their customers. The absence of version numbers becomes more onerous as ever more features get added. Give us the damn version number. Do it. Doooooo it.
  •  No CPU/Memory/Interface stats. It doesn’t matter what the “thing” is, or whether it’s cloud-managed or not. EVERY interface needs to show statistics and errors, and every thingy needs to show CPU and memory information. Whatever your argument to the contrary may be, I promise that you are wrong.
  • Frequent product name changes. Just stop already.
  • The same stinking model numbers used for everything. Why? Maybe someone has a 3 and 5 fetish out in Silly Valley. It’s confusing, it’s weird, and it’s weirdly confusing in it’s weirdness, which leaves me confused.
  • The notion that EVERYTHING to do with wireless must be monetized. After a while, we start to feel like pimps as opposed to WLAN admins. I get that vendors need to be creative with new revenue streams, but it can be carried to extremes when applied to the WLAN ecosystem.
  • Too many models. It seems like some vendors must be awarding bonuses to HW developers based on how many different versions of stuff they can turn out, but customers are left confused about what to use when and where and why versus the other thing down the page a bit. Variety is good, but massive variety is not.
  • Complexity. This might be news to some vendors: the ultimate goals in deploying your systems for both us and the end user are STABILITY and WELL-PERFORMING ACCESS. Somewhere, vendors have lost track of that, and they are delivering BLOATED and HYPER-COMPLICATED FRAMEWORKS that place a cornucopia of buggy features higher on the priority list than wireless that simply works as users expect it to.
  • Slow quote/support ticket turnaround. Most times when we ask for pricing or open a case with technical support, it’s because there is a need. As in, we need something. And our assumptions are that our needs will be fielded with some degree of urgency, as we’re all in the business of service at the end of the day. No one likes slow service. No one likes asking over, and over, and over, and over, and over if there are any updates to our need possibly getting addressed.
  • Escalation builds/engineering code bugs. At the WLAN professional level, most of us work off the assumption that if we don’t typically do our jobs right the first time, we may not get follow up work and ultimately may be unemployed. That’s kind of how we see the world. I’m guessing that WLAN code developers play by different rules. ‘Nough said.
  • Bad, deceitful specs. Integrity is what keeps many of us in the game as professionals. Our word is our bond, as they say. Can you imagine telling someone that you can deliver X, but then when they need X, you can actually only provide a fraction of X- and then expecting that person to not be pissed off? Why are networking specs any different? Enough truth-stretching and hyper-qualified performance claims that you have to call a product manager and sign an NDA to get the truth about.
  • Mixed messages. OK, we ALL own this one- not just the vendors. The examples are many- grand platitudes and declarations that might sound elegant and world-changing in our own minds, but then they often fizzle in the light of day. Things like…
    • We need mGig switches for 802.11ac! 
    • We’ll never need more than a Gig uplink for 802.11ac!
    • 2.4 GHz is dead!
    • Boy, there’s a lot of 2.4 GHz-only clients out there!
    • We’re Vendor X, and we’re enterprise-grade!
    • Why do I see Vendor X gear everywhere, mounted wrong and in nonsensical quantities for the situation?
    • That one agency is awesome at interoperability!
    • Why does so much of this stuff NOT interoperate?
    • You must be highly-skilled with $50K worth of licensed WLAN tools or your Wi-Fi will suck!
    • Vendor X sells more Wi-Fi than anyone, most people putting it in are obviously untrained, yet there are lots of happy clients on those networks!
    • Pfft- just put in one AP per classroom. Done!
    • Cloud Wi-Fi is a ripoff!
    • Cloud Wi-Fi saves me soooo much money and headaches!
    • Here’s MY version of “cloud!”
    • Here’s MY version of “cloud!”
    • I freakin hate how buggy this expensive gear is!
    • At least those bugs are numbered on a pretty table!

It goes on and on and on. Always has, always will. Behind the electronics that we bring to life and build systems from are We the People. The humanity involved pervades pretty much everything written here, from all sides and all angles. And I have no doubt that every vendor could write their own blog called “That Which Pisses Us Vendor Folk Off- WLAN Pro Edition”.  Touche on that.

Ah well- there’s still nothing I’d rather be doing for a living.

The Thing About Code

Code is amazing stuff. Good code puts people into space, runs super-colliders, and keeps the Internet ticking. Bad code on the other hand, winds up on wireless controllers.

OK, just kidding.

Maybe.

For the life of me I can’t understand how vendors keep crappy code listed on their download pages, often at the top of the list, for customers to find. You know, the kind of half-baked stuff that everyone from sales engineers to tech support cringe at when you tell them what version you are running. Which often also happens to be the same code that others from the same company declare to be “the good code”, and recommend that you go to to get past some other problem with earlier buggy code. Ever been there? It pretty much sucks, yet this rhythm seems to have become an operational model for some vendors.

This is where we pause, and I read minds. Quiet please….. quiet….. shhhhhh. I’m picking something up….. ah yes, got it. The “testing” fallacy- I’ll address that….. wait, one more coming…. what’s that? Oh, sure- the release notes thing. Let’s talk about both of those.

I hear an awful lot of “test, test, test!” from colleagues and respected industry folk. And I do agree that nothing, including code, should be rushed into to. But please tell me- other than just being a mantra, what does “test, test, test!” really mean? Does it mean load the code on a test box, configure it the way you’d use in prod, throw clients at it, and then wait for smoke and screams? OK, that’s acceptable. Or maybe it means that you should actually take what I just mentioned and add whatever new features that interest you into the mix, and make sure they don’t create problems. Fine, yes- this too is arguably reasonable.

But guess what vendors? If you expect us (and evidently some of you do) to be your crowd-sourced QA departments, let’s call it what is and put warning labels on code:

“Caution: we either don’t quite know WTF this code will do in many environments, or we have some inkling, and it ain’t pretty. But we’re putting it out there anyways so you can be our debug squad. Stuff that has always worked now may crash, but it’s worth it because this is NEW code.”

We buy the hardware and code, pay for support on it all, eat the pain and suffering that comes with the shaky code, and the vendor gets to say “you really need to test new code and let us know what you find”. Everybody wins- except for the customer.

We don’t know what modules and packages were added and changed, and we’re not programmers with access to source views to that which is causing us pain. (Funny how we don’t tend to have these problems in the mobile network world.)

Then there are the release notes. Hats off to vendors that are open an honest about their shortcomings with their code. But… when the same bugs are listed for years, you start not to pay attention. And some unresolved issues sound minor, but can bring the house down. Others sound apocalyptic, but actually happen so rarely or have minimal real impact that they can be safely disregarded. But they are all listed in the same terse “you figure it out, and good luck with that” manner. The onus is unfairly on the customer to wade through it all, and that is wrong for COTS gear- would be different if this were all open source.

So how do “we” fix this?

  • Stop putting out shitty code. Plain and simple. Just stop. New features aren’t worth instability- client access is the key mission of the WLAN and if the WLAN is melting down from crappy code the key mission is compromised
  • If code is found to be crappy on a catastrophic scale, PULL IT. Don’t leave it up for others to find. And reach out to customers pro-actively like an automotive recall to let us know about it. Many WLANs these days have million-dollar plus price tags- we deserve better.

It’s time to stop the code insanity.