I'm a big fan of simplicity and I do my best to ensure things are straightforward and easy to understand and support when I design them. Therefore I have to sometimes stop that part of me that really enjoyed CCIE labs or the other that is tempted by any shiny new thing found online or in a vendor presentation.
I mention CCIE labs because of the any-crazy-solution-mindset that came with them: those hours upon hours of intricate labs improve your problem-solving skills to an extent that's dangerous for real networks.
A while ago I wrote about a VSS related outage a customer had due to too much reliance on a high-availability solution that was not well deployed. I've been watching from the sidelines the fun my colleagues have had with Cisco TAC while trying to investigate what happened that day and I realized how much time is being wasted ultimately due to not keeping things simple.
First of all, let me say that VSS, especially the quad-SUP variety, is not simple. By simple I mean a number of things:
- solution can be understood by average support engineers
- it's easy to troubleshoot
- failure model is predictable
- a failure, be it hardware or software, is easy to reproduce and fix (by the vendor)
- changes can be implemented with very low risk by any level of engineer
Furthermore, interaction with TAC has become increasingly difficult, let me tell you what I observed in this particular case:
- the case was urgent, so engineers had to change due to shift coverage, growing to a list of about five or six (lost count)
- they were in different timezones and handovers were done via case notes
- each new engineer did not have time to fully understand the case and was under pressure to provide an update, so his response did not add any value whatsoever
- most responses coming were a copy-paste of previous answers
- investigation and failure analysis get mixed into the same case, but handled by different teams, so there's one stream at a time
But most of this time and effort investment comes from the complexity of the solution, resulting in too many questions being asked:
- how did the failure happen?
- why was there a total network outage when the design should've prevented that?
- is what was designed actually what got deployed?
- how can we prevent it from happening again?
- based on the history of the box do we need to replace it wholly or change the design?
In the end, had the design been simpler (and failure scenarios documented and tested), it would have been a matter of dealing with a quick RMA for the part that failed and a very short TAC case (for a post-mortem of the supervisor).
It also looks to me that the support model is changing from being able to do complex investigations to a position in which it's either a hardware swap or a software upgrade no matter what happened. I guess it's due to the scale at which a company the size of Cisco operates, you just don't have enough engineering power to deal with the massive amount of cases in any detail and it's cheaper to just swap (be it HW or SW).
Therefore simplify your design as much as possible, try not to use overly complex pieces of hardware that you cannot predict how they will function and test, test, test. No matter how much documentation you read, real life will find a way to surprise you.
And, as always, thanks for reading.