Description
Challenges of self-hosting services for network engineers
In the last few years we have been seeing industry-wide push for taking services back from cloud and running them "on-prem". There is a wide variety of reasons for taking that step and it inevitably brings challenges for network engineers, most notably when the main focus of the organization is not the infrastructure (DCs and servers). I will share few stories about self-hosted services (ProxMox cluster, K8s, Ceph, complex Prometheus+VictoriaMetrics cluster, just to mention a few) failing because of network or - ominously - the network had been the primary suspect for large portion of the debugging sprint but the true cause proved to be unrelated to network after all.
I am going to focus on debugging procedures and tools necessary for instrumenting the network in multiple different contexts pertinent to my failure stories. Since the tooling is obviously totally different for self-managed fabric within DC, outsourced DC to DC interconnects or SD-WAN and the Internet, there is no hope whatsoever for uniform approach to this. Or is there...?
Short Annotation
Overview of network debugging tools - current state (with usage examples in moderately entertaining stories) and few attached notes about latest contributions and desirable improvements.