Our last iNOG meeting was hosted by Riot Games (was a blast, check out the recording here) - and part of getting to know them I found out that they have a pretty interesting engineering blog out there with long, well written posts. Only yesterday I managed to read through their network oriented posts and I can recommend Fixing the Internet for real-time applications: Part I, Part II and Part III. In fact I enjoyed the story so much that I started to write a short comment about it on Part III, then added a few other lines to it and, well, I'm surprised Disqus allowed me to post this little monster in their comments section.
The comment itself as addressed to the author
First of all I'll say that as a network engineer I really enjoyed reading these posts, they show a very interesting journey from using the network as a high cost, "meh" performance black box to understanding how it functions and iterating on it until it became an enabler for services and good end user experience.
Talking about the hyperscale companies, totally agree - you don't need to be at that level to build a scalable, reliable and visible (metrics) infrastructure - but a lot of people are stuck in the old way of organic-growth-that-solves-these-short-term-needs-only. One important advantage they hold over others is that they have lots of engineering power in house that they are not afraid to use to solve the problems that inevitably arise from operating at scale.
But here's the thing: companies like Riot have strong software engineering that can be used to help develop more than just the core product - tools for deployment/testing/monitoring that solve your particular business and functional requirements, as opposed to buying vendor solutions that come with limitations, slow release cycles and high price tags that give you a limited (or convoluted) feature set to work with. These self-developed tools are probably based on opensource projects and inspired from stuff done in the sysadmin/dev areas, because you don't really need to reinvent the wheel all the time (maybe just reshape it a bit and add a few spikes) - and the awesome part is that this model also encourages giving back to the wider community, be it through blog posts, talks at conferences or code contributions back into the opensource projects.
"Bring technology-agnostic expertise in house"
Many network teams don't have someone design/architecture focused that knows to also look at things from the business perspective, leaving low-level technical discussions and vendor-specific ways of doing things at the door. In too many cases these teams are so busy firefighting or delivering whatever infrastructure is required by other parts of the IT organization that they lose sight of everything else. It becomes an unwinnable race against time and it takes a lot of work to break this inertia.
"Create knowable and measurable networks"
I think that the reactive model needs to go. Just polling SNMP every now and then (but not too often as you said, don't wanna over-stress the poor CPU) and receiving traps is not enough. It works for simple hard failures (with clear remedies) but not for complex gray failures that may be gone by the time you're logging in to investigate. Streaming metrics sounds very good and with a bit of intelligence built around that that detects anomalies (baselines are important) then more data and even packets can be captured on the spot for later analysis. How many vendor solutions can do such a thing? None, because no vendor will ever be able to build something specifically for your business needs.
This brings us to the next part, love the honesty and the inevitable realization that in the end no external 3rd party can fully understand and address your needs (it's also why I cringe when anyone suggests outsourcing as a viable model for critical infrastructure):
"In this case, that involved hiring vendors to help us with connectivity problems. And while we have many great partners, we’re often not 100% aligned - they want to sell hardware and network access. To them, 300ms latency was the same as 60ms - to us those numbers are worlds apart. ... So we kept buying things that vendors told us would fix our problems, whether it be a new piece of hardware, or a new data center location."
Vendors do a lot of things well and they have a wealth of knowledge and very experienced people that can give good advice. But they also need to make a profit and can't tailor solutions to every different possible need out there (although
if lots($$$) then introduce(nerd-knobs)). So it becomes clear that the solution is always that you should know what's best for your business and have enough expertise internally in order to make (most of the time) good decisions about how to solve your problems.
One final thing: you mentioned having interoperability problems when you went multi-vendor. They will always happen and it's a risk/complexity balancing game when choosing to go with multiple vendors at the same time. Jose's presentation at iNOG::9 (hosted with the help of your guys in Dublin, 'twas awesome!) talks about exactly a scenario like that.
"To spoil the ending: those networks won’t be built by vendors, they will be built by us, the networking community."
Totally agree and this is why sharing with the larger community is so important: the IEFT draft suggestion above was great, but also giving talks (Rodrigo@iNOG, maybe RIPE next?) and contributing to opensource projects and standards. Keep it up, loving it!