How Network Monitoring Works: From Ping to Modern Observability
Network monitoring is the practice of continuously observing network infrastructure -- routers, switches, firewalls, links, servers, and services -- to detect failures, measure performance, plan capacity, and troubleshoot problems. It ranges from the simplest possible check (can I ping this host?) to sophisticated observability pipelines that correlate metrics, logs, and traces across thousands of devices in real time. Every network operator, from a small business with a single router to a Tier 1 ISP managing a global backbone, relies on monitoring to keep their infrastructure running.
The evolution of network monitoring mirrors the evolution of networks themselves. In the early days of the internet, ping and manual inspection were sufficient. As networks grew, SNMP provided a standardized way to poll device metrics. The rise of high-speed networks brought NetFlow and sFlow for traffic analysis. Today's cloud-native and microservice architectures demand distributed tracing, time-series databases capable of ingesting millions of metrics per second, and intelligent alerting that can distinguish a real outage from normal variance.
This article covers the full stack: from basic reachability testing through protocol-level monitoring, traffic analysis, modern time-series architectures, alerting pipelines, and the distinction between monitoring and observability.
Layer 1: Reachability -- Ping and Traceroute
The most fundamental monitoring question is: can I reach this host? ICMP (Internet Control Message Protocol) provides the answer.
ICMP Ping
Ping sends ICMP Echo Request packets to a target and measures the time until an Echo Reply is received. It provides three critical metrics:
- Reachability -- does the host respond at all? A 100% packet loss indicates the host is down, unreachable, or blocking ICMP.
- Round-trip time (RTT) -- how long the packet takes to reach the target and return. RTT varies with distance, congestion, and routing path. A sudden increase in RTT often signals congestion or a routing change.
- Packet loss -- what percentage of packets are lost. Healthy networks show 0% loss; consistent loss above 1% indicates a problem.
Monitoring systems send pings at regular intervals (typically every 30-60 seconds) and alert when a host becomes unreachable or when RTT or loss exceed thresholds. While simple, ping has significant limitations:
- Many firewalls and hosts block or rate-limit ICMP, making ping unreliable for reachability testing.
- Ping measures ICMP path performance, which may differ from TCP or UDP path performance due to different QoS treatment.
- Ping tells you nothing about the cause of a problem -- only that a problem exists.
Traceroute
Traceroute extends ping by revealing the path packets take through the network. It sends packets with incrementally increasing TTL values, causing each router along the path to return an ICMP Time Exceeded message. This reveals every hop between source and destination, along with the RTT to each hop.
Network operators use traceroute to:
- Identify which hop is causing packet loss or high latency.
- Verify that traffic is following the expected routing path (important after BGP changes).
- Detect asymmetric routing where forward and return paths differ.
- Identify third-party networks contributing to performance problems.
Modern variants include MTR (My Traceroute), which combines traceroute with continuous ping to show per-hop loss and latency statistics over time, and Paris Traceroute, which uses consistent flow identifiers to avoid per-packet load-balancing artifacts that cause traceroute to show false paths.
SNMP: The Universal Device Monitoring Protocol
SNMP (Simple Network Management Protocol) is the standard protocol for monitoring network devices. Virtually every managed switch, router, firewall, UPS, and server supports SNMP. The protocol works on a poll-and-response model: the monitoring system (the manager) periodically queries devices (the agents) for specific metrics identified by OIDs (Object Identifiers) in the MIB (Management Information Base).
Key SNMP Metrics
The most commonly polled SNMP metrics include:
- Interface counters (IF-MIB) -- bytes in/out (
ifHCInOctets,ifHCOutOctets), packets in/out, errors, discards, and interface operational status (ifOperStatus). These are the foundation of bandwidth monitoring and are polled every 30-300 seconds. - CPU and memory utilization -- device CPU load and memory usage, indicating whether a device is under stress. Vendor-specific MIBs provide these metrics (Cisco PROCESS-MIB, Juniper jnxOperatingTable).
- BGP session state -- the state of BGP sessions (established, idle, active, connect), the number of received/advertised prefixes, and session uptime. Critical for monitoring peering and transit connections.
- Temperature and power -- environmental sensors on network equipment, alerting before a device overheats or loses a power supply.
- SNMP traps -- unsolicited notifications sent by devices when events occur (link up/down, BGP session state change, authentication failure, fan failure). Unlike polling, traps are event-driven and provide immediate notification.
SNMP Polling Architecture
A typical SNMP monitoring system polls hundreds or thousands of devices at regular intervals. Each polling cycle queries multiple OIDs per device. For a network with 1,000 devices, each with 50 interfaces polled every 60 seconds, the monitoring system must handle approximately 50,000 SNMP queries per minute.
Counter metrics (bytes, packets) require computation: the monitoring system stores the current counter value and computes the rate of change (bytes per second) between consecutive polls. This introduces artifacts: a counter reset (device reboot) can produce a massive false spike; a polling interval that is too long can miss short-duration traffic bursts.
Flow Analysis: NetFlow, sFlow, and IPFIX
While SNMP tells you how much traffic is on an interface, flow protocols tell you what that traffic is: which source and destination IP addresses, ports, protocols, and AS numbers are responsible for the bandwidth. Flow data answers questions that interface counters cannot:
- Which application is consuming 80% of the WAN link's bandwidth?
- What percentage of traffic crosses which peering links vs. transit links?
- Is a sudden bandwidth spike caused by legitimate traffic or a DDoS attack?
- Which customers or business units generate the most traffic?
NetFlow (Cisco) and IPFIX
NetFlow (versions 5 and 9) was developed by Cisco and has become a de facto standard. It works by having routers and switches maintain a flow table that tracks active connections. When a flow expires (inactivity timeout, typically 15 seconds, or active timeout, typically 5 minutes), the device exports a flow record to a collector.
A NetFlow v5 record contains:
Source IP: 192.0.2.10
Destination IP: 203.0.113.50
Source Port: 54321
Destination Port: 443
Protocol: TCP (6)
Bytes: 1,523,400
Packets: 1,203
Start Time: 2025-03-15T14:22:33Z
End Time: 2025-03-15T14:27:18Z
TCP Flags: SYN, ACK, PSH, FIN
Input Interface: Gi0/0/1
Output Interface: Gi0/0/2
Source AS: 64500
Destination AS: 13335
Next Hop: 198.51.100.1
IPFIX (IP Flow Information Export, RFC 7011) is the IETF-standardized evolution of NetFlow v9. It uses a template-based format that supports arbitrary fields, making it extensible for new protocols and vendor-specific information.
sFlow
sFlow (RFC 3176) takes a fundamentally different approach. Instead of tracking every flow, sFlow samples packets at a configurable rate (e.g., 1 in every 1,000 packets) and exports the sampled packet headers along with interface counter data. The collector uses statistical extrapolation to estimate the total traffic.
sFlow's sampling approach is less CPU-intensive on the network device (critical for high-speed switches that may not have a separate flow processing ASIC), provides a statistically accurate picture of traffic composition, and works equally well for short-lived and long-lived flows. The trade-off is that short, low-volume flows may be missed entirely if no packets from that flow are sampled.
Flow Collection and Analysis
Flow data is exported to a flow collector that stores, indexes, and analyzes the records. Common flow collectors include:
- pmacct / nfacctd -- open-source NetFlow/IPFIX/sFlow collector with SQL database backend, widely used by ISPs.
- Kentik -- cloud-based flow analytics platform designed for ISPs and large networks.
- ntopng -- open-source network traffic analysis with real-time flow processing and visualization.
- Elasticsearch -- flow records can be ingested into Elasticsearch for full-text search and Kibana visualization.
- ClickHouse -- columnar database increasingly used for high-volume flow analytics due to its query performance on time-series data.
Synthetic Monitoring
While SNMP and flow analysis monitor the network's internal state, synthetic monitoring measures the network from the user's perspective. Synthetic monitors periodically execute transactions -- loading a web page, resolving a DNS name, connecting to a TCP port, making an API call -- and measure the response time, availability, and correctness of the result.
Types of Synthetic Tests
- HTTP/HTTPS probes -- issue HTTP requests to web services and verify response status codes, content, response times, and TLS certificate validity. The most common synthetic test.
- DNS resolution probes -- query DNS resolvers for specific records and verify the response is correct and timely. Detects DNS misconfiguration, DNS server outages, and DNS hijacking.
- TCP/UDP connect probes -- attempt to open a connection to a specific port and measure connection setup time. Detects service outages and firewall misconfigurations.
- Multi-step transactions -- execute a sequence of HTTP requests simulating a user workflow (login, navigate, search, checkout). These detect application-level problems that simple health checks miss.
- BGP monitoring probes -- specialized probes that monitor BGP route announcements for a set of prefixes, detecting hijacks, leaks, and origin changes in real time.
Synthetic tests should run from multiple geographic locations to detect problems that are specific to certain paths or regions. A service might be reachable from the US but unreachable from Europe due to a routing issue, cable cut, or regional DNS failure.
Common synthetic monitoring tools include Prometheus Blackbox Exporter (open-source), Grafana Synthetic Monitoring, ThousandEyes (Cisco), Catchpoint, and Pingdom.
gNMI and Streaming Telemetry: Beyond SNMP
SNMP's poll-based model has fundamental limitations at scale. Polling thousands of devices every 30 seconds creates a burst of queries that can overwhelm both the monitoring system and the managed devices. The polling interval creates a resolution ceiling -- events that happen between polls are invisible. And SNMP's text-based encoding is relatively inefficient.
Streaming telemetry inverts the model: instead of the monitoring system pulling data from devices, devices push data to collectors continuously. The network device subscribes to a set of metrics and streams updates at a configured interval (as low as every second or even sub-second for critical metrics) or on change (only when the value changes).
gNMI (gRPC Network Management Interface) is the dominant streaming telemetry protocol, using gRPC over HTTP/2 with Protocol Buffers encoding. gNMI operates on YANG data models -- the same models used by NETCONF and RESTCONF -- providing a standardized, vendor-neutral way to access device configuration and operational state.
# gNMI subscription example (using gnmic CLI tool)
gnmic subscribe \
--address router1.example.com:57400 \
--username admin \
--password secret \
--encoding json_ietf \
--mode stream \
--stream-mode sample \
--sample-interval 10s \
--path "/interfaces/interface/state/counters" \
--path "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state"
Advantages of streaming telemetry over SNMP:
- Higher resolution -- 1-second or sub-second collection intervals, compared to SNMP's typical 30-300 seconds.
- Lower device overhead -- push-based streaming is more efficient than handling thousands of SNMP GET requests per polling cycle.
- Structured data -- gNMI uses YANG models and Protobuf encoding, providing strongly typed, hierarchical data rather than flat OID/value pairs.
- On-change subscriptions -- for state data (interface up/down, BGP session state), the device sends an update only when the value changes, eliminating redundant polling for unchanged values.
Major network OS platforms support gNMI: Arista EOS, Cisco IOS-XR (and NX-OS), Juniper Junos, Nokia SR OS, and SONiC. Open-source collectors include Telegraf (with gNMI input plugin), gnmic, and Dial-In Collector.
Time-Series Databases
All monitoring data -- SNMP counters, flow metrics, synthetic test results, streaming telemetry -- is time-series data: values associated with timestamps. Storing and querying this data efficiently requires specialized time-series databases (TSDBs) optimized for high write throughput, time-range queries, and downsampling.
Prometheus
Prometheus is the dominant open-source monitoring system and TSDB, particularly in cloud-native and Kubernetes environments. Its architecture is pull-based: Prometheus scrapes metrics from HTTP endpoints (/metrics) on monitored targets at regular intervals. Targets expose metrics in a text-based format:
# HELP node_network_receive_bytes_total Network device statistic receive_bytes
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="eth0"} 1.234567890e+12
node_network_receive_bytes_total{device="eth1"} 5.678901234e+11
# HELP node_network_transmit_drop_total Network device statistic transmit_drop
# TYPE node_network_transmit_drop_total counter
node_network_transmit_drop_total{device="eth0"} 42
Prometheus stores data locally in a custom TSDB optimized for write-ahead logging and block compaction. For network monitoring, key Prometheus exporters include:
- SNMP Exporter -- translates SNMP polling into Prometheus metrics, bridging legacy network monitoring with modern TSDB architectures.
- Blackbox Exporter -- performs synthetic probes (HTTP, DNS, TCP, ICMP) and exposes the results as Prometheus metrics.
- Node Exporter -- collects host-level metrics (CPU, memory, disk, network interfaces) on Linux systems.
- cAdvisor -- collects container-level resource usage metrics.
For large-scale deployments, Prometheus is paired with a long-term storage backend: Thanos, Cortex, Mimir (Grafana Labs), or VictoriaMetrics. These provide horizontal scalability, multi-cluster federation, and long-term retention (months to years) that single-node Prometheus cannot achieve.
Alerting Pipelines
Monitoring data is only useful if it drives action. An alerting pipeline transforms metrics into notifications that reach the right people at the right time. Effective alerting is one of the hardest problems in operations engineering -- too many alerts cause fatigue and ignored pages; too few alerts mean outages go undetected.
Alert Design Principles
- Alert on symptoms, not causes -- alert when users are impacted (high error rate, increased latency) rather than on internal causes (high CPU, full disk). A server at 95% CPU is not a problem if it is serving traffic with acceptable latency. A server at 10% CPU is a critical problem if it is returning 500 errors.
- Every alert should be actionable -- if the on-call engineer receives an alert, they should know what to check and have a clear path to remediation. Alerts that say "this number is high" without context or action guidance are useless.
- Use multiple severity levels -- not every problem is a page-the-on-call emergency. Use tiered severities: critical (page immediately, production impact), warning (ticket, address during business hours), informational (dashboard only, investigate if time permits).
- De-duplicate and group related alerts -- a single switch failure might cause 50 interface-down alerts. The alerting system should group these into a single notification rather than paging the engineer 50 times.
- Suppress during known maintenance -- define maintenance windows during which alerts are silenced for the affected devices. This prevents false alarms during planned work.
Alerting Tools
The Prometheus ecosystem uses Alertmanager for alert routing, grouping, inhibition, and notification. Alert rules are defined in Prometheus, and fired alerts are sent to Alertmanager, which handles the notification pipeline:
# Prometheus alert rule
groups:
- name: network-alerts
rules:
- alert: InterfaceDown
expr: ifOperStatus{ifAlias!~".*unused.*"} != 1
for: 2m
labels:
severity: critical
annotations:
summary: "Interface {{ $labels.ifName }} down on {{ $labels.instance }}"
description: "{{ $labels.ifAlias }} has been operationally down for more than 2 minutes"
- alert: HighBandwidthUtilization
expr: rate(ifHCInOctets[5m]) * 8 / ifHighSpeed / 1e6 > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "Interface {{ $labels.ifName }} above 85% utilization on {{ $labels.instance }}"
Alert notification channels include PagerDuty (for on-call escalation), Slack/Teams (for team awareness), email (for non-urgent notifications), and webhooks (for custom integrations with ticketing systems or automation platforms).
Network Observability vs Monitoring
The distinction between monitoring and observability is increasingly relevant in network engineering. Monitoring answers predefined questions: "Is this interface up?" "What is the bandwidth utilization?" "Is the BGP session established?" These questions are encoded in dashboards and alert rules. Monitoring works well for known failure modes.
Observability addresses the unknown unknowns: "Why is traffic from AS64500 taking a different path than expected?" "Why did latency to a specific destination increase by 200ms at 3:47 PM?" These questions cannot be anticipated in advance. Observability requires:
- Rich, high-cardinality data -- not just aggregate interface counters, but per-flow, per-prefix, per-peer metrics that allow arbitrary slicing and dicing.
- Ad-hoc query capability -- the ability to ask new questions of the data without modifying the collection infrastructure. PromQL, LogQL, and SQL-based flow analytics enable this.
- Correlation across signals -- combining metrics (time-series), logs (events), and traces (request paths) to build a complete picture of a problem. In network engineering, this might mean correlating a BGP route change (log) with a latency increase (metric) and a traceroute path change (synthetic test result).
- Exploration tools -- interactive dashboards, topology visualizations, and anomaly detection that help engineers explore data rather than just view pre-built charts.
Network observability platforms are emerging to fill this gap. Traditional tools like Cacti, Nagios, and MRTG are being replaced (or supplemented) by modern stacks built on Prometheus, Grafana, Elasticsearch, and specialized network analytics platforms.
Common Monitoring Anti-Patterns
- Monitoring the monitoring system -- if Prometheus goes down, who monitors Prometheus? Always have an independent health check for your monitoring infrastructure, ideally from a separate failure domain.
- Alert storms -- a core router failure can generate hundreds of alerts (every downstream interface, BGP session, and dependent service triggers independently). Use alert grouping, inhibition rules, and dependency modeling to consolidate these into a single actionable notification.
- Dashboard rot -- dashboards created years ago that no one looks at, monitoring devices that no longer exist, or showing metrics that are no longer relevant. Regularly audit and prune dashboards.
- Polling interval mismatches -- polling a counter every 5 minutes but alerting on a 1-minute spike. The spike is invisible to the monitoring system because it averages out over the polling interval.
- Ignoring the control plane -- monitoring data plane metrics (bandwidth, errors) but not control plane health (BGP session state, OSPF adjacencies, spanning tree topology changes). A routing protocol flap can cause traffic loss even when all interfaces show "up" and "no errors."
See It in Action
Network monitoring tools observe the same infrastructure visible in the global BGP routing table. The traceroutes that monitoring systems run, the SNMP counters they poll, and the flow data they analyze all reflect the routing decisions BGP makes. Use the god.ad BGP Looking Glass to explore the networks behind major monitoring platforms and the infrastructure they observe:
- AS15169 -- Google, operator of one of the most sophisticated internal monitoring systems in the world
- AS13335 -- Cloudflare, which provides distributed synthetic monitoring across its anycast network
- AS36351 -- SoftLayer (IBM Cloud), where many hosted monitoring platforms run
- AS14618 -- AWS, host of CloudWatch, the largest cloud monitoring platform
- AS8075 -- Microsoft Azure, provider of Azure Monitor and Network Watcher
This BGP looking glass is itself a monitoring tool -- it continuously observes the global routing table via RIS Live, tracking route announcements and withdrawals in real time. Look up any IP address or AS number to see its current routing state, and consider how each hop along the path is monitored by dozens of overlapping systems, from simple ping checks to sophisticated observability platforms.