The ICMP Illusion: Bypassing Cloud SDN to Measure True L4 Latency

If you run stateful workloads on a major cloud provider, Azure in our case, your ICMP telemetry may be lying to you.

While debugging a persistent 4-6 ms latency floor between our application platform and our persistence layer (MongoDB, Redis, and a managed SQL database), we saw the floor appear to degrade under high compute load. The standard observability stack, which leaned on ping metrics, showed erratic spikes reaching about 10 ms. Our first hypothesis was a saturated interface or a hypervisor bottleneck.

The telemetry was a false positive: we were hitting a Software-Defined Networking (SDN) abstraction, not a network limit.

The problem with ICMP in the cloud

In Azure, host networking policy (private virtual networks, load-balancer NAT, ACLs, QoS) is implemented on the physical host by the Virtual Filtering Platform (VFP), Azure's programmable virtual switch. On VMs with Accelerated Networking, the steady-state datapath for established flows is offloaded to a SmartNIC (FPGA) and bypasses the host CPU. The host software path handles only the first packet of a flow and any traffic that is not offloaded.

ICMP is one of the classes that does not benefit from that offload. Microsoft's documentation states that ICMP is treated differently from application traffic and that ping does not measure the accelerated-networking datapath, so ICMP round-trip times do not represent what TCP and UDP workloads experience. ICMP echo is also commonly deprioritized by hosts and routers under load: when more important traffic is queued, echo requests are among the first to be delayed or dropped.

The failure mode we observed follows from this, though it is inference from behavior rather than a documented Azure guarantee. Under host CPU or packets-per-second pressure, ICMP rides the contended software path and its latency inflates, while the offloaded TCP and UDP datapath is unaffected. If a Grafana panel derives inter-node latency from ICMP, any compute load manufactures a network bottleneck that does not exist at L4 or L7.

To measure the real data-plane propagation delay, without the latency added by application-level accept() loops or thread-pool exhaustion in the database, we needed an L4 telemetry agent.

The requirements were strict:

Measure TCP connection-establishment time, which is effectively one round trip (SYN, then SYN-ACK), the cost a real client pays before it can send a byte.
Consume minimal CPU and memory on the target, to keep the observer effect small.
Avoid the standard asynchronous-runtime overhead.

Existing options either shipped heavy agents or required kernel facilities like eBPF, which was not viable across all of our heterogeneous environments. So I built wire-probe.

Architecture decisions, as implemented

wire-probe is written in Rust and avoids the standard async ecosystem in favor of system-level kernel interfaces and aggressive compiler settings. The description below matches the shipped code, including its trade-offs.

1. io_uring instead of tokio (server mode)

A general-purpose async runtime such as tokio carries a multi-megabyte RSS baseline (typically single-digit MB for a minimal multi-threaded runtime, counting its worker threads) before a single packet is processed. For a binary whose only job is to acknowledge and drop TCP connections, that overhead is not justified.

The server mode, which runs on the DB nodes, uses a serial io_uring accept loop. It submits an Accept SQE, calls submit_and_wait for completion, drops the connection with a synchronous libc::close(fd), re-arms, and repeats.

There is a trade-off here. This is a serial loop, with one submit_and_wait syscall per accepted connection, and the close is an ordinary blocking syscall rather than an io_uring operation. It is not the zero-syscall, multishot-accept design that io_uring makes possible (multishot accept would cut the per-connection syscall cost on kernels 5.19 and newer). What it does guarantee is a flat memory profile: no per-connection allocation and no thread per connection, so the daemon holds at roughly 500 KB RSS regardless of inbound connection rate.

2. Blocking connect for deterministic probing

In probe mode, which runs on the application nodes, asynchronous timers introduce scheduler drift under host load. Instead, the RTT is taken by wrapping std::net::TcpStream::connect_timeout with std::time::Instant.

connect_timeout issues a non-blocking connect and then waits on the socket's readiness through the OS poller. The calling thread parks in that kernel wait until the handshake completes or --timeout elapses, with no userland spin. The measured interval is the establishment time, about one RTT. No application data is sent or received, so the number is free of recv and send latency.

DNS is resolved once at startup, not per probe. The target is resolved to a single SocketAddr before the loop begins, so there is no getaddrinfo syscall in the hot path and a slow resolver cannot corrupt the probe cadence. The trade-off is at the other end: the first resolved address is pinned for the life of the process, so DNS-level failover or record changes are not picked up until restart. For fixed-target probing against known DB nodes that is the right bias. If you need failover, pass an IP per instance or restart on change.

3. Binary density and an allocation-free export path

To keep the binary small and avoid fragmentation over long runs, it is compiled against musl-libc with fat LTO, codegen-units = 1, panic = "abort", and strip = true. panic = "abort" removes the stack-unwinding tables. The result is a 370 KB static binary with no glibc version dependency. The probe process runs at roughly 300 KB RSS. These figures are from the project's own release build.

The export and formatting path is allocation-free. The static portion of each line (tcp_latency,target=...,az=... rtt_ms= for Influx Line Protocol, or the PUTVAL ... prefix for Collectd) is built once at construction. On each send, the value and timestamp are written into a single reused buffer using ryu for floats and itoa for integers, both of which format into stack space and skip the standard library's String machinery. There is no format! call and no per-send heap allocation.

4. The Collectd integration

Collectd's Exec plugin runs a long-lived child process once and reads its stdout over a pipe. It does not re-fork or re-exec per interval. That model works, but it means deploying and supervising a separate binary alongside collectd.

wire-probe also ships a native Python plugin (wire_probe.py). Rather than wrapping the Rust binary, it is a small, self-contained reimplementation that uses Python's socket.create_connection, hooks collectd's read_cb, and emits the same value types as the stock ping plugin (ping, ping_droprate, ping_stddev), so existing dashboards and thresholds work without changes. The plugin also bounds its own work: a total per-read_cb wall-clock budget stops a misconfiguration from starving collectd's read thread.

One caveat: this path runs in CPython doing blocking connects. It shares none of the Rust binary's properties (the musl footprint, the io_uring server, the allocation-free export path). It is a convenience integration for collectd shops, not the zero-footprint agent. Use the Rust probe where the footprint matters, and the Python plugin where drop-in collectd compatibility matters more.

Conclusion

With wire-probe deployed, we separated the actual network floor from our internal compute constraints. The 4-6 ms floor is independent of compute load and of ICMP queuing. It is a stable property of the underlying L3 and L4 path, not an application bottleneck and not a ping artifact. Once the measurement stopped going through ICMP, the spikes under load disappeared.

The cause needs to be stated carefully. Azure targets inter-zone (cross-AZ, same region) round-trip latency under about 2 ms, and real measurements usually come in well below that. A 6 ms floor does not fit a plain cross-AZ explanation. It points to traffic crossing regions, or to a network virtual appliance or gateway hop in the path. Either way it is a routing reality you can design around, but confirm it against your own topology before naming it.

If you are hunting ghosts in cloud network telemetry, stop trusting ping. Measure the wire.

Source and pre-compiled musl binaries: github.com/vorjdux/wire-probe