Tag Archives: Code bugs

Unfortunately, Seeing Isn’t Always Believing With Meraki Network Topology View (updated)

update After opening a support case, Meraki was able to verify what I describe below in their own lab testing- some MX ports were not doing LLDP right. And…. the fix: “The issue of the transmission of LLDP packets from the MX is fixed in versions starting MX14.2” – which so far is proving to be the case. (Now we need to solve client MAC addresses showing up on the wrong switch ports, but that’s another tale for another time.) Read on if you’d like, but keep in mind that they DID fix this problem at long last.

The only thing worse than an important feature that’s missing in a network management solution is an important feature that works wrong. Those of us who pay for these systems expect the features that we trade our dollars for to actually, well, work. And bad information is just terrible in that it sends us down time-wasting bad roads.

Meraki- you have a problem. Your topology view doesn’t refresh for hours, days, or weeks after changes are made. And for whatever reason, you have not given us decent LLDP tools that we can invoke on demand. That’s the polite description, which can be abbreviated down to your topology view sucks at times.

Before I show and tell my latest frustration in this regard, let me share Meraki’s summary on it’s topology view and evidence that what I’m about to describe is not my frustration alone. See this post from the Meraki community. Now on to the defective topology feature.

It Just NEVER Updates

In a faraway branch location, we had two switches that daisy-chained off of another switch due to physical layer constraints. Here’s the old topology:

Pay attention to VillaRosa-3, Villino-1,and Vilillino-2. V1 and V2 were daisy-chained off of V3. THEN THEY WERE MOVED TO EACH DIRECTLY CONNECT TO THE MX. And then I got an email… “The switches got moved to ports 6 and 7 on The MX, but the topology isn’t updating. Maybe I should restart the devices…” As we see in the community forums, we may be waiting quite a while for this map to update…

The Implications

Few things are more basic in network support than simply knowing what connects to what- both physically and logically. Commands like <show cdp neighbors> and <show lldp neighbors> are pretty important, but not available in the Meraki dashboard paradigm. Instead, we need to rely on incomplete or inaccurate graphical information.

As I mentioned above, V1 and V2 moved to direct connects on the MX. So what does the MX say about the switches that are connected to it? Unfortunately, nothing. For whatever reason, Meraki has never seen it as important to give full lldp information for devices connected to an MX:

This is utterly nonsensical for enterprise-grade networking equipment. But it gets worse. Let’s look at one of the switches that has had it’s uplink moved to the MX. Days after the move, the switch still says it’s connected to it’s old uplinked V3 switch:

This is pretty bad. It’s wrong, it’s outdated, it’s misleading, and it’s inexcusable. Let’s have a look at “VillaRosa-3 / 51.

To recap:

  • Topology views don’t refresh as they should (if ever)
  • The MX doesn’t tell who it’s lldp neighbors are- which is a glaring deficiency
  • The lldp neighbors information on the Meraki switches can’t be trusted
  • There are no obvious ways to invoke on-demand refreshes for topology and lldp views, we are at the mercy of some undeclared loooooooong refresh timer

That this “feature set” can be this off-kilter defies logic- especially when you consider the cost of the gear and it’s licensing and the importance of lldp to network support.

Jake Talks Wireless Code Quality- and It’s Worth Hearing

Jake Snyder is a smart guy. He’s got experience, credentials, and is just a well-rounded networking gent no matter what other descriptors you assign him. In this video, Jake spends about a dozen minutes contemplating the current state of code cranked out by WLAN vendors. Give a listen, and see what YOU think.

My own opinion: the more complexity that vendors cram into their code, the more we’re going to deal with bugs. That’s a pretty simple equation. And… the basic notions of reliability and providing access for clients are getting deprioritized for more exciting features that read well in marketing materials.

So- we’re doomed.

Nah- I’m kidding! It’s not that bad.

Except maybe I’m not kidding.

Will Reliability Be Prioritized Before Wi-Fi’s Whizzbang Future Gets Here?

This blog looks forward, but before we go there we need to zoom back to 1983 where I will corrupt John Mellencamp’s “Crumblin Down“:

Some features ain’t no damn good
You can’t trust ’em, you can’t love em
No good deed goes unpunished
And I don’t mind being their whipping boy
I’ve had that pleasure for years and years

Indeed. I too have had that pleasure for years and years. Whether it’s what comes out of mechanisms that are supposed to ensure that standards and interoperability testing bring harmony to the wireless world (but don’t), or code suck that flows like an avalanche coming down a mountain, I’ve been there and suffered that a-plenty. Somewhere during one of many wireless system malfunctions, the opening lyrics of “Crumblin’ Down” started blaring in my head, usually followed up Annie Lennox singing this line from 1992’s “Why”:

Why can’t you see this boat is sinking
(this boat is sinking this boat is sinking)

But enough of the musical ghosts trapped in my head, waiting to sing to me when the network breaks. We’re going forward, and as Timbuk3 sang in 1986- The future is so bright I gotta wear shades.

Maybe, maybe not on that.

Super-Systems Become Super-Terrific Systems

Soon, market-leading WLAN vendors will likely unveil grand strategies that finally bring real SDN kinda stuff to the Wi-Fi space. And just like the day is fast coming where you can’t just buy a simple RADIUS server from the same folks (you have to invest in a NAC system then simply NOT use the parts that aren’t RADIUS to get a RADIUS server), one day some Grand Orchestrator of All Networky Things will get it’s tentacles into our wireless access points and controllers and you might not have a say in that. (Some of this is already happening with specific vendors, but it’s all just warm-up for the big show, in my opinion.)

This magic in the middle will promise API-enabled everything network-wide, so provisioning and on-going operations on LAN and WLAN will be child’s play. The frameworks will have spiffy marketing names, and get pushed heavy as “where our customers should be going”.

Some of you are probably thinking “So what? This is evolution. Deal with it.” I’m down with that, to a point.

What If They Don’t Fix What’s Broke First?

I know well that I’m not alone in feeling a bit behind the 8-ball when it comes to our networking vendors. There are far too many code bugs impacting far too many components, end users, and networking teams. There’s also an entrenched culture that keeps chronically problematic operating systems alive when they should arguably be scrapped and the bug factories in full production.

I personally shudder to think what might happen if that grand vision for the future meets the Culture of Suck, and a whole new species of bug is unleashed on end users. Ideally, vendors would take a hard look at their code bases, their developers, and their cultures and ask if what’s in place today is worth rigging up a bunch of APIs to as part of The New Stuff.

As an end user, it terrifies me.

A House Built on Suck Can Not Stand

As a man-of-action-living-in-the-world, I’ve been around.  I’ve seen first-hand what happens during earthquakes to buildings and people when there are no rules governing building quality. I’ve seen carnage and devastation in multiple situations “out there” that all could have been prevented, and when I became Deputy Mayor of my village, I was able to appreciate what our Code Enforcement Officer does to keep people and buildings safe. Often it’s just curbing somebody’s foolish way of doing something.

As silly as it sounds, I’d love to see independent Code Enforcement Officers  for the network industry who enforce… well, code quality.  They would audit developers, their track records, and the pain inflicted on end users. Any vendor that gets too sloppy gets fined, or has to probably clean up their mess before they can keep developing. Like I said, I know how silly that sounds- but the current culture of poor Quality Assurance and protracted debug sessions at customer expense does not serve as a suitable foundation for the Super-Terrific Systems that are coming our way.

What’s really scary is that vendors tend to go all-in on these initiatives. It’s not like they leave a de-bloated, scalable option (key phrase) for those who don’t want all the Terrific Superness as they develop these monster frameworks of complex functionality.

I’d like to put on my sunglasses for the future of wireless, but if things aren’t cleaned up first for certain vendors, the current cloud over their wireless units is just going to get darker.

Code Suck Regulation: Should We Fine Vendors For Major Code Bugs?

Tell me if this sounds familiar- you spend top dollar on brand-name networking gear, only to put in into service to have some major future bork out and cause your organization significant embarrassment. You’ve researched the product, have been cajoled into buying from a vendor that swears you’re getting a great piece of gear, and yet something catastrophic makes your deployment go sideways. You engage tech support, verify that your topology and configurations are OK, yet the suck storm still pummels the networked landscape. You’ve found yourself in The Bug Zone.

Ever been here? It gives a bloke or blokette a powerful lonely feelin’. With users in pain, managers who may or may not be sympathetic, and the little voice in the back of your head asking “what could I have done differently?” that ultimately answers itself with “maybe I shoulda cut this vendor off after the last dozen major code issues. But like a victim of domestic abuse, I keep going back for more, hoping it’ll get better.” 

Does this ring familiar with anyone?

I’ve heard from a lot of individuals in the greater IT community of late about all of the many bugs they have hit, and 75% of the time the lament is accompanied by something like “the rush to get ever more features under the hood is making the whole damn thing a time bomb of suck, and it feels like QA is being short-cutted in the name of getting it to market faster”.

What if, in our support contracts, we added a section that gave us a weapon against major code bugs? Perhaps we need to become our own CSRs (Code Suck Regulators) and have it in our agreements that any major code bug that is verified to cause network downtime or significant user impact when a half-baked feature sends the network into a tailspin results in a fine of $1,000 a day until the bug is resolved by the vendor?  Would code development maybe slow down a bit and QA labs be better funded, staffed, and used? Would major bugs drag out for weeks and months if the meter was running at each affected customer site? I’d also suggest making vendors keep all of their verified major bugs in plain view of the world on a vendor-neutral website that requires no login to see bug details and impact, with posting a mandatory requirement enforced by somebody or other- or again, fines are levied.

OK- I get that the networking industry and all of it’s various niches doesn’t, and won’t, ever work this way. At the same time, it’s mildly fun to think about not being victimized anymore by companies that don’t feel like they really care about their code quality after you’ve used their stuff long enough to see definite trends in significant bugs. And I am talking about SIGNIFICANT bugs- the ones that are devastating to network performance, and your organizational and personal reputations, and not just horrible misspellings or cryptic broken-English error messages on a webpage.  Maybe fines aren’t the answer, but if you’ve got a better idea on how to change trend of Free-Flowing Suck when it comes to code, I’d love to hear it.

(This is where some of you are thinking- bah, just do better testing before you deploy the code that you say sucks. My reaction: yeah, good luck with that. There’s only so much you can test, and only so far you should have to go to be the vendor’s QA department.).

What’s Up With Cisco’s 5760?

So the new 5760 Controller is here. It’s IOS based, it supports 1000 APs, it has 10 Gig interfaces at long last… what’s not to love?

Plenty, actually. At least right now.

Cisco’s wireless controllers are fairly complicated beasts, especially on large networks that use multiple SSIDs with differing feature sets across each one. With each code release, more features get unleashed, which ups the complexity in exchange for capabilities like RF Groups, application visibility and control, rate limiting, and Clean Air. This complexity pretty much demands that multiple controllers and lots of APs serving huge volumes of clients be managed by the likes of WCS, NCS, Prime NCS,  Prime Infrastructure, Supreme Excellent Unificated Management Suite, or whatever we call Cisco’s wireless management platform this week. It can be challenging to stay on top of Cisco’s endless parade of new features, capabilities, bugs, interface changes, gaps between CLI/Controller UI/Management UI, licensing changes and other nuances, but that is the nature of the beast. We can do complex, even quirky.

For wireless controller code, we have other challenges. Some versions are to be avoided by even Cisco’s recommendations (?) while others are the darlings that we all love. If you want stable code, that’s not always the same thing as the latest code. You have to talk to SEs and TAC to find out what code is preferred, and what is the other stuff. (Who uses the other stuff, and why is it even out there?) Then there is the dance between controller code, Prime Infrastructure code, and the Mobility Service Engines. They all tend to have mutual dependencies. Complex, quirky.. again, we can deal with that.

Back to the 5760 Controller.

A controller that supports 1000 APs is aimed at big environments. Big wireless networks tend to require trending, configuration templates, and reporting- you know, management type stuff. This is why we all have PI or one of it’s earlier versions. But… the 5760 isn’t compatible with current PI (1.3). So, for now you get real-time views of client and AP behavior at best, if you can scrape what you need directly out of the 5760.

In fairness to Cisco, they did include the fact that the 5760 would not be managed by Prime until PI 2.0 in their January 2013 announcement on the new controllers.

At the same time, SEs and sales folks that know their customers’ environments arguably have a duty to say “you know… you can’t manage this thing in your version of PI- are you sure you want it?” That it was even released “unmanageably” is pretty confusing to me when I contemplate trying to support thousands of clients on a 5760 with no NMS after years of running a big WLAN.

The UI on the controller itself currently looks like a knock-off of the 5508’s interface (it actually strikes me as a phishing-kinda cheesy copy of a real UI). And… many of it’s features are buried in the CLI, no exposure in the UI.

Speaking of features, AVC was a big thing when it came out earlier on other WLC versions- huge actually. Once you turn it on and start using it, you wonder how you did without it. On the 5760, you won’t have to wonder- you will do without it as AVC (and other big-deal features) isn’t in this biggest, newest controller.

Nor is preferred happy coexistence with 5508 controllers- unless you are willing to drop your 5508s back to 7.3 code, or wait for new 7.5 to come out sometime in the future. If you are on current 5508 code (7.4 train), you won’t seamlessly roam your clients with 5508s.

(I won’t even get into the HA thing that was touted when the 5760s were announced, that you can’t leverage yet either.)

Final word: today, the 5760 is almost like a real controller that you can’t yet properly manage. Things are supposed to get sunnier later in the calendar year for some of the limitations described here, but why didn’t Cisco simply wait until they had a more fully baked unit to dazzle us with?

This is just a bit weird. Are IOS and the 1000 AP count supposed to be the sparkly things that distract you from all the warts? Complex and quirky are arguably acceptable. Beta-quality and incomplete are other animals completely. Don’t we deserve better by now?

 Am I missing something? Would love to be wrong in my analysis…

Code Bugs as Value Adds

Oh baby, oh baby 
Then it fell apart, it fell apart 
Oh baby, oh baby 
Like it always does, always does.
                            – “Extreme Ways”- Moby

Reliable code on network equipment is stodgy, unadventurous, even “lame”. What is life without surprises?

So you read the white papers, got hooked on impressive reports of blazing performance, and snagged a great discount on that new solution you’re about to get into. You’re getting features out the wazoo, but are you getting the right framework for code bugs? Where many organizations never give the notion of bugs and poor QA practices even a fleeting thought, you really should slow it down and put some time into making sure that you are getting your fair share of bugs.

Though it may be counter-intuitive, problem-ridden operating systems and code versions are actually gems in disguise. When the right bug hits at just the worst time, it’s like no other experience in the IT realm. What you should be considering:

  • Nerdy IT Folks Aren’t The Most Social Creatures. But get a few people all pulling their hair out over some issue that never should have made it into the release version of code, and the same introverts now have reason to come out of their shells and intermingle on social media and in support forums. Vendors supplying crappy code could legitimately charge for this service.
  • Bugs Let You Test Your Moxy. So your management system craps out for weeks and the vendor is stumped. Or network switch ports arbitrarily stop passing traffic. Or access points reboot whenever someone on another continent takes a sip of Red Bull. These are opportunities for you to show your managers that you can carve endless hours out of your already-busy schedule, relay with creative expression that even the vendor is stumped, and exercise great patience as someone you can barely understand guides you through days of debug to finally declare “oh that is a known bug.” Have you got what it takes? Again, this character-building could legitimately be billed as a service.
  • You Can Be Part of the QA “Matrix”. Although Keanu Reeves and that rather attractive young lady raged against The Man in the famous movie The Matrix, you’ll want to be part of the “crowd-sourced quality assurance” experience provided by the right vendor. More simple-minded customers might bitch about catastrophic bugs that should have been caught before code release, but many of us are thrilled to be able to donate countless hours to being part of our vendor’s Quality Army. Let ’em push whatever half-baked nonsense the developers can burp out, We’ll find the problems, and our clients trying to use the network can just shut up about it. Everybody wins.

But the notion doesn’t end with sub-par code. You also want to find a framework that provides for confusing, time-consuming upgrades as well. This is important if you are looking for “the complete experience”. Ideally, your upgrades to things like management servers or security appliances should fail or bring a whole new crop of problems, regardless of how methodical you are in your procedures. (Remember, the goal here is to really get quality time with the vendor’s support wing while minimizing uptime). As a minimum, you want to have to rebuild databases multiple times and ride the Licensing Merry-Go-Round until you almost get sick- that’s when you know you’re recognizing the full value of your investment.

So how do you know you’re going to get properly served the right allotment of bugs? Unfortunately, there are no guaranties. But you want to stick with vendors that prolifically churn out lots of new features before really fixing old bugs- that way the effect is additive and unpredictable. Search release notes for lots of problem conditions that look like they could impact you, and make sure there are a lot of “no workaround” and “don’t use the feature that you paid good money for” as suggested fixes.

These are exciting times in networking, and there are a slew of bugs to be had. Choose your vendor wisely, and make sure that you are getting your share of these exciting little bonuses.