How Protocol Buffers Work

Protocol Buffers (protobuf) is a binary serialization format created at Google in 2001 and open-sourced in 2008. It encodes structured data into a compact binary representation that is smaller, faster to parse, and more rigorously typed than text formats like JSON or XML. Protobuf is the wire format underlying gRPC, and it is used internally at Google for virtually all inter-service communication and data storage — reportedly processing billions of messages per second across their infrastructure.

Understanding protobuf requires understanding three layers: the schema language (.proto files that define message structures), the wire format (how data is actually encoded as bytes on the wire), and the code generation pipeline (the protoc compiler that produces language-specific serialization code). This article covers all three in depth.

The Wire Format: How Bytes Are Laid Out

At its core, a protobuf message is a sequence of tag-value pairs. There are no delimiters between fields, no field names, and no structural markers like braces or brackets. Each pair consists of a tag (which encodes the field number and wire type) followed by the field value. The decoder reads pairs one after another until it reaches the end of the message.

The tag byte encodes two pieces of information packed together: the field number and the wire type. The wire type occupies the lowest 3 bits, and the field number occupies the remaining upper bits. So the tag is computed as (field_number << 3) | wire_type. For field 1 with wire type 0 (varint), the tag is (1 << 3) | 0 = 0x08. For field 2 with wire type 2 (length-delimited), the tag is (2 << 3) | 2 = 0x12.

Varint Encoding

Protobuf uses variable-length integer encoding (varints) extensively — for integer field values, for tags, and for length prefixes. A varint encodes an integer using one or more bytes, where the most significant bit (MSB) of each byte is a continuation flag. If the MSB is 1, more bytes follow. If it is 0, this is the last byte. The remaining 7 bits of each byte carry the actual value, in little-endian order.

Small integers are extremely compact under this encoding. The value 1 takes a single byte. The value 127 still takes one byte. The value 128 requires two bytes. This is ideal for the common case where field numbers are small (1-15 fit in a single-byte tag) and values are modest. By contrast, a fixed 32-bit integer always uses 4 bytes even to encode the value 0.

Signed Integers: ZigZag Encoding

Varints encode unsigned integers directly, but negative numbers are problematic. A negative int32 in two's complement has its high bit set, which means the varint encoding would always use the maximum 10 bytes (since protobuf varints extend to 64 bits). To avoid this waste, protobuf provides sint32 and sint64 types that use ZigZag encoding, which maps signed integers to unsigned integers so that values with small absolute magnitude produce small varints:

 0 ->  0
-1 ->  1
 1 ->  2
-2 ->  3
 2 ->  4
...

The formula is (n << 1) ^ (n >> 31) for 32-bit values. This means -1 encodes as 1 (one byte) instead of the 10-byte monstrosity you would get with a plain int32.

Wire Types

There are six wire types in the protobuf binary format, though only four are commonly encountered:

0	VARINT	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	I64	fixed64, sfixed64, double
2	LEN	string, bytes, nested messages, packed repeated fields
5	I32	fixed32, sfixed32, float

Wire types 3 and 4 (SGROUP and EGROUP) were used for the deprecated "groups" feature in proto2 and are no longer used in proto3. The wire type tells the decoder how many bytes the value occupies without needing to know the schema — this is what makes it possible to skip unknown fields during deserialization.

For wire type 0 (VARINT), the decoder reads a varint. For types 1 and 5, it reads a fixed-size chunk (8 or 4 bytes). For type 2 (LEN), it reads a varint giving the byte length, then reads that many bytes. This self-describing aspect of the wire format is critical for forwards and backwards compatibility.

The .proto Schema Language

Protobuf messages are defined in .proto files. Here is an example illustrating the major features:

syntax = "proto3";

package example;

enum Status {
  STATUS_UNKNOWN = 0;
  STATUS_ACTIVE  = 1;
  STATUS_PAUSED  = 2;
}

message SearchRequest {
  string query           = 1;
  int32  page_number     = 2;
  int32  result_per_page = 3;
  Status status          = 4;
  repeated string tags   = 5;
  oneof test_type {
    string  a_test = 6;
    int32   b_test = 7;
  }
  map<string, string> metadata = 8;
}

message SearchResponse {
  repeated Result results = 1;
  int32 total_count       = 2;

  message Result {
    string url     = 1;
    string title   = 2;
    float  score   = 3;
  }
}

Each field has a type, a name, and a field number. The field number is the identifier that appears on the wire — field names exist only in the schema, not in the encoded data. This is a fundamental difference from JSON, where the key "query" is literally spelled out in every message. In protobuf, field 1 is just the tag byte 0x0A.

Scalar Types

Protobuf provides a rich set of scalar types. Each maps to a specific wire type:

double / float — 64-bit / 32-bit IEEE 754 floating point (wire type 1 / 5)
int32 / int64 — variable-length encoding, inefficient for negative numbers (wire type 0)
uint32 / uint64 — variable-length unsigned (wire type 0)
sint32 / sint64 — ZigZag-encoded signed integers, efficient for negative values (wire type 0)
fixed32 / fixed64 — always 4 / 8 bytes, faster when values are often large (wire type 5 / 1)
sfixed32 / sfixed64 — signed fixed-width (wire type 5 / 1)
bool — varint, 0 or 1 (wire type 0)
string — UTF-8 text, length-prefixed (wire type 2)
bytes — arbitrary byte sequence, length-prefixed (wire type 2)

Choosing between int32 and sint32 matters if your field frequently holds negative values. Choosing between int32 and fixed32 matters if your values are typically large (above ~2^28), where the fixed encoding is actually smaller than the varint encoding.

Nested Messages

Messages can contain other messages as fields. On the wire, a nested message is encoded as a length-delimited byte sequence (wire type 2) — the outer message contains a length prefix followed by the complete serialization of the inner message. This makes nesting recursive: you can have arbitrarily deep message trees.

Because the outer message only knows the byte length of the inner message, a decoder without the inner schema can still skip the nested message entirely. This is how protobuf achieves forward compatibility — old code encountering a new nested message type simply skips the unknown bytes.

Repeated Fields and Packed Encoding

A repeated field represents a list. In the original encoding, each element appears as a separate tag-value pair with the same field number. If you have a repeated int32 field numbered 4 with values [3, 270, 86942], the wire format would contain three separate tag-value pairs, each with tag 0x20.

Proto3 introduced packed encoding as the default for repeated scalar fields. Instead of separate tag-value pairs, all values are concatenated into a single length-delimited byte sequence. This eliminates the per-element tag overhead:

// Unpacked (proto2 default): 3 tags + 3 values
20 03  20 8E 02  20 9E A7 05

// Packed (proto3 default):   1 tag + 1 length + 3 values
22 06  03 8E 02 9E A7 05

For a list of 100 integers, packed encoding saves 99 tag bytes — a significant reduction. Packed encoding applies to scalar numeric types; strings, bytes, and nested messages cannot be packed because each element has variable size and the decoder would not know where one ends and the next begins.

Oneof Fields

A oneof declaration says that at most one of the listed fields can be set at a time. On the wire, oneof has no special encoding — the fields are regular tag-value pairs. The constraint is enforced by the generated code: setting one field in a oneof automatically clears the others. If multiple oneof fields appear in the encoded bytes (due to manual construction or concatenation), the last one wins.

Map Fields

A map<K, V> field is syntactic sugar for a repeated field of key-value entry messages. On the wire, map<string, int32> scores = 5; is equivalent to:

message ScoresEntry {
  string key   = 1;
  int32  value = 2;
}
repeated ScoresEntry scores = 5;

Each map entry is a length-delimited message containing exactly two fields. Map iteration order is not guaranteed to match insertion order — the wire format does not preserve ordering, and different language implementations use different internal data structures (hash maps, B-trees, etc.).

Enums

Enum values are encoded as varints on the wire — they are just integers. In proto3, every enum must have a value with number 0, which serves as the default. When a decoder encounters an unknown enum value (one not in its copy of the schema), proto3 preserves the integer value in the parsed message, while proto2 would put it in the unknown field set.

Default Values and Field Presence

This is one of the most misunderstood aspects of protobuf. In proto3, scalar fields have implicit presence by default — there is no way to distinguish between a field that was explicitly set to its default value (0, empty string, false) and a field that was never set. If you serialize a message with age = 0, the encoder omits the field entirely (zero bytes on the wire), and the decoder produces 0 for the age field. The encoding is identical to never having set the field at all.

Proto2, by contrast, tracks field presence for all singular fields: you can distinguish "field is set to 0" from "field is not set." Proto3 added the optional keyword (reintroduced in 2020) to opt individual fields into explicit presence tracking, generating has_ methods in the generated code.

This default value behavior means you must design your schema carefully. If distinguishing "absent" from "zero" matters, use optional, a wrapper type like google.protobuf.Int32Value, or a oneof.

Proto2 vs Proto3

Proto3, released in 2016, simplified the language significantly but also removed features that some teams relied on:

Feature	Proto2	Proto3
Field presence	All singular fields tracked	Only `optional`-annotated fields
Required fields	Supported (with `required`)	Removed entirely
Default values	Custom defaults per field	Always zero/empty/false
Enums	Closed (unknown values rejected)	Open (unknown values preserved)
Groups	Supported (deprecated)	Removed
Extensions	Supported	Removed (use `Any`)
Packed repeated	Opt-in (`[packed = true]`)	Default for scalars
JSON mapping	Not standardized	Canonical JSON mapping defined

The removal of required fields in proto3 was deliberate. Google's internal experience showed that required fields create a permanent compatibility hazard — once deployed, a required field can never be safely removed because old readers will reject messages without it. Proto3's position is that all fields should be optional on the wire, with application-level validation handling mandatory semantics.

A new edition-based system called Protobuf Editions (starting with "edition 2023") is replacing the proto2/proto3 syntax keywords, allowing per-feature opt-in rather than a monolithic version switch.

Backwards and Forwards Compatibility

Schema evolution is the primary design goal of protobuf. The rules are straightforward:

Adding a field — safe. Old readers skip the unknown field (they know its wire type, so they know how many bytes to skip). New readers use the default value when reading old data that lacks the field.
Removing a field — safe, as long as you never reuse the field number. Old readers just will not find the field in new data, and use the default.
Changing a field type — generally unsafe unless the wire type is the same (e.g., int32 to int64 are both varint).
Renaming a field — safe. Field names do not appear on the wire; only field numbers matter.
Changing field numbers — breaks everything. The field number is the identity of the field on the wire.

The reserved keyword prevents accidental reuse of retired field numbers:

message User {
  reserved 2, 15, 9 to 11;
  reserved "email", "phone";
  string name = 1;
  int32  age  = 3;
}

If anyone tries to define a new field using number 2, 15, or 9-11, the protobuf compiler will reject it. This is a safety mechanism for large teams where one developer might not know that field 2 was once "password_hash" and should never be recycled.

Code Generation with protoc

The protobuf compiler protoc reads .proto files and generates serialization/deserialization code in the target language. It has built-in code generators for C++, Java, Python, C#, Objective-C, Ruby, PHP, and Dart. Additional languages (Go, Rust, TypeScript, Swift) are supported via plugins.

# Generate Go code
protoc --go_out=. --go_opt=paths=source_relative user.proto

# Generate Python code
protoc --python_out=. user.proto

# Generate C++ code
protoc --cpp_out=. user.proto

# Use a plugin for Rust
protoc --prost_out=. user.proto

The generated code provides typed message classes with getters, setters, serialization (SerializeToString / encode), deserialization (ParseFromString / decode), and utilities like deep copy, equality comparison, and merging. In statically typed languages, the compiler catches field type errors at compile time rather than at runtime.

The code generator uses a plugin architecture. The protoc binary handles parsing and validation of .proto files, then passes a structured representation (itself a protobuf message — the FileDescriptorProto) to a language-specific plugin that emits source code. This means anyone can write a protoc plugin to generate code for a new language, framework, or purpose (validation, mock generation, documentation, etc.).

Reflection and Descriptors

Protobuf supports runtime reflection — the ability to inspect and manipulate messages without compile-time knowledge of their schema. This is powered by descriptors, which are protobuf messages (defined in google/protobuf/descriptor.proto) that describe the structure of other protobuf messages.

A FileDescriptorProto contains the complete description of a .proto file: its messages, fields, enums, services, and options. A FileDescriptorSet bundles multiple file descriptors together. You can generate one with:

protoc --descriptor_set_out=schema.pb user.proto

With the descriptor, you can write generic code that serializes, deserializes, transforms, or validates any protobuf message without generated code. This is how tools like grpcurl, protoc-gen-doc, and the gRPC server reflection protocol work — they use descriptors to understand message schemas at runtime.

Self-Describing Messages and Well-Known Types

Protobuf messages are not self-describing on the wire: you need the schema to interpret the bytes beyond basic tag-value parsing. However, Google provides the google.protobuf.Any type as a way to embed arbitrary messages with type information:

message Any {
  string type_url = 1;  // e.g., "type.googleapis.com/example.User"
  bytes  value    = 2;  // serialized message
}

The Any type acts like a typed envelope — the type_url identifies the message type, and value contains the serialized bytes. The receiver uses the type URL to find the appropriate descriptor and deserialize the payload. This is used extensively in Google's gRPC ecosystem, particularly in APIs like the Google Cloud error model.

Other well-known types address common needs:

Wrapper types (google.protobuf.Int32Value, StringValue, etc.) — wrap scalars in a message so that "absent" (null) is distinguishable from the default value, since a message field has explicit presence
Timestamp — represents a point in time as seconds + nanoseconds since the Unix epoch
Duration — a signed time span (seconds + nanoseconds)
Struct / Value / ListValue — represent arbitrary JSON-like structures in protobuf, useful when the schema is truly dynamic
FieldMask — specifies a subset of fields to read or update, used in CRUD APIs to implement partial updates

Performance Characteristics

Protobuf's performance advantages come from several sources:

Compact encoding — no field names on the wire, varint compression for small integers, no whitespace or structural characters. A typical protobuf message is 3-10x smaller than its JSON equivalent.
Zero-copy parsing — string and bytes fields can reference the original buffer without copying in some implementations (C++, Rust). The decoder can skip unknown fields without allocating memory for them.
No tokenization — JSON parsing requires a lexer to identify strings, numbers, braces, and commas. Protobuf parsing is a tight loop: read a tag, switch on wire type, read value. No string matching, no Unicode handling, no escape processing.
Schema-driven code — the generated code knows the exact field layout at compile time, enabling direct struct field writes rather than hash table lookups.

In benchmarks, protobuf serialization is typically 2-10x faster than JSON, and deserialization is 2-20x faster, depending on message complexity and language runtime. The gap is largest for messages with many numeric fields (where JSON must parse decimal strings) and smallest for messages dominated by large strings (where both formats must copy the bytes).

However, protobuf is not the fastest binary format. Both parsing and serialization require processing every byte. This is where zero-copy formats like FlatBuffers and Cap'n Proto have an advantage.

Comparison with Other Formats

Protobuf vs JSON

JSON is human-readable, universally supported, and schema-optional. Protobuf is binary, requires a schema, and requires code generation or a descriptor for deserialization. JSON is the natural choice for browser-facing APIs, configuration files, and debugging. Protobuf is the natural choice for service-to-service communication, persistent storage of structured data, and any context where bandwidth or CPU cost matters. Proto3 defines a canonical JSON mapping, so you can transcode between the two formats losslessly.

Protobuf vs MessagePack

MessagePack is a binary serialization format that is "like JSON but fast and small." It is schema-less — field names appear on the wire, just in binary encoding. This means MessagePack messages are self-describing but larger than protobuf (field names take space), and there is no compile-time type checking. MessagePack is a good choice when you want binary compactness without the overhead of schema management. Protobuf is a better choice when you have a stable schema and want maximum compactness and type safety.

Protobuf vs FlatBuffers

FlatBuffers (also from Google) is designed for zero-copy access — you can read fields directly from the serialized buffer without parsing into an object. This makes read access essentially free (a pointer dereference), at the cost of more complex serialization and slightly larger encoded size. FlatBuffers is ideal for performance-critical applications like games, where you mmap a file and read fields directly. Protobuf is better for general-purpose RPC and storage where the full deserialization cost is acceptable.

Protobuf vs Cap'n Proto

Cap'n Proto (created by a former protobuf tech lead at Google) takes the zero-copy approach further: the in-memory representation and the wire format are the same thing. There is no serialization step at all — you write fields into a buffer, and that buffer is the wire format. Cap'n Proto achieves lower latency than protobuf for large messages but uses more wire space due to fixed-width fields and alignment padding. It also has a less mature ecosystem and fewer supported languages.

Format	Schema	Human-readable	Size	Parse speed	Zero-copy read
JSON	Optional	Yes	Largest	Slowest	No
MessagePack	No	No	Medium	Fast	No
Protobuf	Required	No	Smallest	Fast	Partial
FlatBuffers	Required	No	Medium	Fastest*	Yes
Cap'n Proto	Required	No	Medium	Fastest*	Yes

* "Fastest" for read-heavy workloads because there is no parsing step. Serialization speed is comparable across formats.

Protobuf in the Networking Stack

Protobuf plays a significant role in the infrastructure that powers internet routing. gRPC, which uses protobuf as its default wire format, is the communication backbone for many network management and telemetry systems. OpenConfig and gNMI (gRPC Network Management Interface) use protobuf-encoded messages to stream real-time telemetry data from network devices — including the BGP route updates and link state changes that populate tools like this BGP looking glass.

BGP Monitoring Protocol (BMP) data, route collector feeds, and network automation platforms increasingly use protobuf as their interchange format, replacing older text-based formats. Google's B4 WAN — one of the largest private networks in the world — uses protobuf-encoded gRPC for all control plane communication between its SDN controller and the switches that manage traffic across autonomous systems.