How Network Monitoring Works: From Ping to Modern Observability

Network monitoring is the practice of continuously observing network infrastructure -- routers, switches, firewalls, links, servers, and services -- to detect failures, measure performance, plan capacity, and troubleshoot problems. It ranges from the simplest possible check (can I ping this host?) to sophisticated observability pipelines that correlate metrics, logs, and traces across thousands of devices in real time. Every network operator, from a small business with a single router to a Tier 1 ISP managing a global backbone, relies on monitoring to keep their infrastructure running.

The evolution of network monitoring mirrors the evolution of networks themselves. In the early days of the internet, ping and manual inspection were sufficient. As networks grew, SNMP provided a standardized way to poll device metrics. The rise of high-speed networks brought NetFlow and sFlow for traffic analysis. Today's cloud-native and microservice architectures demand distributed tracing, time-series databases capable of ingesting millions of metrics per second, and intelligent alerting that can distinguish a real outage from normal variance.

This article covers the full stack: from basic reachability testing through protocol-level monitoring, traffic analysis, modern time-series architectures, alerting pipelines, and the distinction between monitoring and observability.

Layer 1: Reachability -- Ping and Traceroute

The most fundamental monitoring question is: can I reach this host? ICMP (Internet Control Message Protocol) provides the answer.

ICMP Ping

Ping sends ICMP Echo Request packets to a target and measures the time until an Echo Reply is received. It provides three critical metrics:

Monitoring systems send pings at regular intervals (typically every 30-60 seconds) and alert when a host becomes unreachable or when RTT or loss exceed thresholds. While simple, ping has significant limitations:

Traceroute

Traceroute extends ping by revealing the path packets take through the network. It sends packets with incrementally increasing TTL values, causing each router along the path to return an ICMP Time Exceeded message. This reveals every hop between source and destination, along with the RTT to each hop.

Network operators use traceroute to:

Modern variants include MTR (My Traceroute), which combines traceroute with continuous ping to show per-hop loss and latency statistics over time, and Paris Traceroute, which uses consistent flow identifiers to avoid per-packet load-balancing artifacts that cause traceroute to show false paths.

SNMP: The Universal Device Monitoring Protocol

SNMP (Simple Network Management Protocol) is the standard protocol for monitoring network devices. Virtually every managed switch, router, firewall, UPS, and server supports SNMP. The protocol works on a poll-and-response model: the monitoring system (the manager) periodically queries devices (the agents) for specific metrics identified by OIDs (Object Identifiers) in the MIB (Management Information Base).

Key SNMP Metrics

The most commonly polled SNMP metrics include:

SNMP Polling Architecture

A typical SNMP monitoring system polls hundreds or thousands of devices at regular intervals. Each polling cycle queries multiple OIDs per device. For a network with 1,000 devices, each with 50 interfaces polled every 60 seconds, the monitoring system must handle approximately 50,000 SNMP queries per minute.

Counter metrics (bytes, packets) require computation: the monitoring system stores the current counter value and computes the rate of change (bytes per second) between consecutive polls. This introduces artifacts: a counter reset (device reboot) can produce a massive false spike; a polling interval that is too long can miss short-duration traffic bursts.

Network Monitoring Stack Reachability ICMP Ping / Traceroute / MTR up/down, RTT, loss Device Metrics SNMP Polling / gNMI Streaming BW, CPU, errors Traffic Analysis NetFlow / sFlow / IPFIX who, what, where Synthetic Testing HTTP probes / DNS checks / TCP connect user experience Time-Series DB Prometheus / InfluxDB VictoriaMetrics / Mimir Alerting Pipeline Alertmanager / PagerDuty Grafana Alerts / OpsGenie

Flow Analysis: NetFlow, sFlow, and IPFIX

While SNMP tells you how much traffic is on an interface, flow protocols tell you what that traffic is: which source and destination IP addresses, ports, protocols, and AS numbers are responsible for the bandwidth. Flow data answers questions that interface counters cannot:

NetFlow (Cisco) and IPFIX

NetFlow (versions 5 and 9) was developed by Cisco and has become a de facto standard. It works by having routers and switches maintain a flow table that tracks active connections. When a flow expires (inactivity timeout, typically 15 seconds, or active timeout, typically 5 minutes), the device exports a flow record to a collector.

A NetFlow v5 record contains:

Source IP:      192.0.2.10
Destination IP: 203.0.113.50
Source Port:    54321
Destination Port: 443
Protocol:       TCP (6)
Bytes:          1,523,400
Packets:        1,203
Start Time:     2025-03-15T14:22:33Z
End Time:       2025-03-15T14:27:18Z
TCP Flags:      SYN, ACK, PSH, FIN
Input Interface: Gi0/0/1
Output Interface: Gi0/0/2
Source AS:      64500
Destination AS: 13335
Next Hop:       198.51.100.1

IPFIX (IP Flow Information Export, RFC 7011) is the IETF-standardized evolution of NetFlow v9. It uses a template-based format that supports arbitrary fields, making it extensible for new protocols and vendor-specific information.

sFlow

sFlow (RFC 3176) takes a fundamentally different approach. Instead of tracking every flow, sFlow samples packets at a configurable rate (e.g., 1 in every 1,000 packets) and exports the sampled packet headers along with interface counter data. The collector uses statistical extrapolation to estimate the total traffic.

sFlow's sampling approach is less CPU-intensive on the network device (critical for high-speed switches that may not have a separate flow processing ASIC), provides a statistically accurate picture of traffic composition, and works equally well for short-lived and long-lived flows. The trade-off is that short, low-volume flows may be missed entirely if no packets from that flow are sampled.

Flow Collection and Analysis

Flow data is exported to a flow collector that stores, indexes, and analyzes the records. Common flow collectors include:

Synthetic Monitoring

While SNMP and flow analysis monitor the network's internal state, synthetic monitoring measures the network from the user's perspective. Synthetic monitors periodically execute transactions -- loading a web page, resolving a DNS name, connecting to a TCP port, making an API call -- and measure the response time, availability, and correctness of the result.

Types of Synthetic Tests

Synthetic tests should run from multiple geographic locations to detect problems that are specific to certain paths or regions. A service might be reachable from the US but unreachable from Europe due to a routing issue, cable cut, or regional DNS failure.

Common synthetic monitoring tools include Prometheus Blackbox Exporter (open-source), Grafana Synthetic Monitoring, ThousandEyes (Cisco), Catchpoint, and Pingdom.

gNMI and Streaming Telemetry: Beyond SNMP

SNMP's poll-based model has fundamental limitations at scale. Polling thousands of devices every 30 seconds creates a burst of queries that can overwhelm both the monitoring system and the managed devices. The polling interval creates a resolution ceiling -- events that happen between polls are invisible. And SNMP's text-based encoding is relatively inefficient.

Streaming telemetry inverts the model: instead of the monitoring system pulling data from devices, devices push data to collectors continuously. The network device subscribes to a set of metrics and streams updates at a configured interval (as low as every second or even sub-second for critical metrics) or on change (only when the value changes).

gNMI (gRPC Network Management Interface) is the dominant streaming telemetry protocol, using gRPC over HTTP/2 with Protocol Buffers encoding. gNMI operates on YANG data models -- the same models used by NETCONF and RESTCONF -- providing a standardized, vendor-neutral way to access device configuration and operational state.

# gNMI subscription example (using gnmic CLI tool)
gnmic subscribe \
  --address router1.example.com:57400 \
  --username admin \
  --password secret \
  --encoding json_ietf \
  --mode stream \
  --stream-mode sample \
  --sample-interval 10s \
  --path "/interfaces/interface/state/counters" \
  --path "/network-instances/network-instance/protocols/protocol/bgp/neighbors/neighbor/state"

Advantages of streaming telemetry over SNMP:

Major network OS platforms support gNMI: Arista EOS, Cisco IOS-XR (and NX-OS), Juniper Junos, Nokia SR OS, and SONiC. Open-source collectors include Telegraf (with gNMI input plugin), gnmic, and Dial-In Collector.

Time-Series Databases

All monitoring data -- SNMP counters, flow metrics, synthetic test results, streaming telemetry -- is time-series data: values associated with timestamps. Storing and querying this data efficiently requires specialized time-series databases (TSDBs) optimized for high write throughput, time-range queries, and downsampling.

Prometheus

Prometheus is the dominant open-source monitoring system and TSDB, particularly in cloud-native and Kubernetes environments. Its architecture is pull-based: Prometheus scrapes metrics from HTTP endpoints (/metrics) on monitored targets at regular intervals. Targets expose metrics in a text-based format:

# HELP node_network_receive_bytes_total Network device statistic receive_bytes
# TYPE node_network_receive_bytes_total counter
node_network_receive_bytes_total{device="eth0"} 1.234567890e+12
node_network_receive_bytes_total{device="eth1"} 5.678901234e+11

# HELP node_network_transmit_drop_total Network device statistic transmit_drop
# TYPE node_network_transmit_drop_total counter
node_network_transmit_drop_total{device="eth0"} 42

Prometheus stores data locally in a custom TSDB optimized for write-ahead logging and block compaction. For network monitoring, key Prometheus exporters include:

For large-scale deployments, Prometheus is paired with a long-term storage backend: Thanos, Cortex, Mimir (Grafana Labs), or VictoriaMetrics. These provide horizontal scalability, multi-cluster federation, and long-term retention (months to years) that single-node Prometheus cannot achieve.

Prometheus Monitoring Pipeline Exporters SNMP Exporter Blackbox Exp. Node Exporter App Metrics gNMI Collector scrape Prometheus TSDB Storage PromQL Engine Recording Rules Alert Rules alerts Alertmanager route, group, silence, notify PagerDuty Slack Email / Webhook query Grafana Dashboards Network Maps Traffic Graphs Alert Annotations Long-term: Thanos / Mimir / VictoriaMetrics (months/years) PromQL Example rate( ifHCInOctets {device="xe-0/0/0"} [5m] ) * 8 = bits/sec on iface

Alerting Pipelines

Monitoring data is only useful if it drives action. An alerting pipeline transforms metrics into notifications that reach the right people at the right time. Effective alerting is one of the hardest problems in operations engineering -- too many alerts cause fatigue and ignored pages; too few alerts mean outages go undetected.

Alert Design Principles

Alerting Tools

The Prometheus ecosystem uses Alertmanager for alert routing, grouping, inhibition, and notification. Alert rules are defined in Prometheus, and fired alerts are sent to Alertmanager, which handles the notification pipeline:

# Prometheus alert rule
groups:
- name: network-alerts
  rules:
  - alert: InterfaceDown
    expr: ifOperStatus{ifAlias!~".*unused.*"} != 1
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Interface {{ $labels.ifName }} down on {{ $labels.instance }}"
      description: "{{ $labels.ifAlias }} has been operationally down for more than 2 minutes"

  - alert: HighBandwidthUtilization
    expr: rate(ifHCInOctets[5m]) * 8 / ifHighSpeed / 1e6 > 0.85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Interface {{ $labels.ifName }} above 85% utilization on {{ $labels.instance }}"

Alert notification channels include PagerDuty (for on-call escalation), Slack/Teams (for team awareness), email (for non-urgent notifications), and webhooks (for custom integrations with ticketing systems or automation platforms).

Network Observability vs Monitoring

The distinction between monitoring and observability is increasingly relevant in network engineering. Monitoring answers predefined questions: "Is this interface up?" "What is the bandwidth utilization?" "Is the BGP session established?" These questions are encoded in dashboards and alert rules. Monitoring works well for known failure modes.

Observability addresses the unknown unknowns: "Why is traffic from AS64500 taking a different path than expected?" "Why did latency to a specific destination increase by 200ms at 3:47 PM?" These questions cannot be anticipated in advance. Observability requires:

Network observability platforms are emerging to fill this gap. Traditional tools like Cacti, Nagios, and MRTG are being replaced (or supplemented) by modern stacks built on Prometheus, Grafana, Elasticsearch, and specialized network analytics platforms.

Common Monitoring Anti-Patterns

See It in Action

Network monitoring tools observe the same infrastructure visible in the global BGP routing table. The traceroutes that monitoring systems run, the SNMP counters they poll, and the flow data they analyze all reflect the routing decisions BGP makes. Use the god.ad BGP Looking Glass to explore the networks behind major monitoring platforms and the infrastructure they observe:

This BGP looking glass is itself a monitoring tool -- it continuously observes the global routing table via RIS Live, tracking route announcements and withdrawals in real time. Look up any IP address or AS number to see its current routing state, and consider how each hop along the path is monitored by dozens of overlapping systems, from simple ping checks to sophisticated observability platforms.

See BGP routing data in real time

Open Looking Glass
More Articles
What is DNS? The Internet's Phone Book
What is an IP Address?
IPv4 vs IPv6: What's the Difference?
What is a Network Prefix (CIDR)?
How Does Traceroute Work?
What is a CDN? Content Delivery Networks Explained