In this post, we'll go over the protocol buffer wire format, which specifies the exact way your message is transmitted over the wire and how much room it takes up on your computer's hard drive. This isn't something you need to know in order to use protocol buffers in your application, but it could come in handy when trying to optimize things.
Encoding a simple message
The following is a very simplified definition of a message:
message Message1 {
optional int32 a = 1;
}
For this example, we'll create a Message1
message in the app and set its parameter to 150
. The next step is to serialize the message and write the results to a stream. The encoded message consists of three bytes, and here they are:
08 96 01
Clearly, this is a very abstract and numerical concept at this point, but what does it mean? You'd get a value around 1: 150
if you dumped those bytes with the Protoscope
program. Where does it get that information that this is what the message is about?
Base128 Varints
The wire format is built around variable-width integers, or varints. They enable the encoding of unsigned 64-bit integers using one to ten bytes, with smaller values requiring fewer bytes.
Each byte in the varint has a continuation bit that indicates whether or not the byte after it is a part of the varint. This is the byte's most significant bit (MSB
) (sometimes also called the sign bit). The lower 7 bits are a payload; the resulting integer is constructed by appending the 7-bit payloads of its constituent bytes together.
So, for example, here is the number one encoded as 01
- because it is a single byte, the MSB
is not set:
And 150
is 9601
in binary - It gets a little trickier here:
How do you know this equals 150
? First, you take the most significant bit (MSB
) out of each byte, since it's only there to tell us if we've reached the end of the number (as you can see, it's set in the first byte because the variable has more than one byte). Then we put the 7-bit payloads together and read it as a little-endian, 64-bit unsigned integer:
Message Structure
A message in a protocol buffer is a list of key-value pairs. The key for the binary version of a message is just the field number. The name and declared type of each field can only be found by looking at the message type's definition (i.e. the .proto
file). Protoscope doesn't have access to this information, so it can only give the field numbers.
When a message is encoded, each key-value pair is turned into a record with a field number, a wire type, and a payload. The size of the payload is told to the parser by the wire type. This lets old parsers skip over fields they don't know how to read. Tag-Length-Value
, or TLV
, is sometimes used to describe this kind of plan.
There are six wire types: VARINT
, I64
, LEN
, SGROUP
, EGROUP
, and I32
ID | Name | Used For |
---|---|---|
0 |
VARINT |
int32 , int64 , uint32 , uint64 , sint32 , sint64 , bool , enum
|
1 |
I64 |
fixed64 , sfixed64 , double
|
2 |
LEN |
string , bytes , embedded messages , packed repeated files
|
3 |
SGROUP |
group start (deprecated) |
4 |
EGROUP |
group end (deprecated) |
5 |
I32 |
fixed32 , sfixed32 , float
|
Using the formula (field number << 3) | wire type
, the "tag" of a record is encoded as a variable made up of the field number and the type of wire. In other words, after decoding the variable that represents a field, the low 3 bits tell us the type of wire, and the rest of the integer tells us the number of the field.
Let's take another look at our simple example. You now know that the first number in a stream is always a variable key. In this case, that number is 08
, or (without the MSB
):
000 1000
You take the last three bits to get the wire type (0
), and then move three bits to the right to get the field number (1
). A tag in Protoscope is made up of an integer, a colon, and the type of wire, so we can write the above bytes as 1:VARINT
. Since the wire type is 0
, or VARINT
, we know that to get the payload, we need to decode a varint. As we saw above, the bytes 9601
vary-decode to 150
, which gives us our record. We can write it as 1:VARINT 150
in Protoscope.
More Integer Types
Bools and Enums
Both Bools and Enums are encoded like they are int32
s. Particularly, Bools always encode as either 00
or 01
. false
and true
are other names for these byte strings in Protoscope.
Signed Integers
As we saw in the last section, all of the protocol buffer types for wire type 0
are encoded as variables. Varints, on the other hand, are not signed, so negative integers are encoded differently in sint32
and sint64
than in int32
and int64
.
The intN
types store negative numbers as two's complement, which means that their highest bit is set because they are unsigned 64-bit integers. So, this means that each of the ten bytes must be used. For example, when protoscope sees -2
, it changes it to
11111110 11111111 11111111 11111111 11111111
11111111 11111111 11111111 11111111 00000001
This is the two's complement of 2. This is written as 0 - 2 + 1
in unsigned math, where 0
is the 64-bit integer with all ones. This is a good way to figure out why it makes so many ones.
Calculating the complement
Every computer I know uses two's complement to represent integers. To get the two's complement negative notation of an integer, write it out in binary. You then invert the digits and add one to the result.
Assume we're working with 8-bit numbers (for simplicity's sake) and want to know how to express -28 in two's complement notation. We begin by writing 28 in binary form:
00011100
The digits are then inverted. 1 becomes 0, and 0 becomes 1.
111000011
Then we add 1.
11100100
That is how -28 would be written in 8 bit binary.
sintN
, on the other hand, encodes negative integers using the "ZigZag" encoding rather than two's complement. Positive integers n
are represented as 2-n
(the even numbers), while negative integers -n
are encoded as 2 * n + 1
(the odd numbers). The encoding thus "zig-zags" between positive and negative numbers.
Signed Original | Encoded As |
---|---|
0 |
0 |
-1 |
1 |
1 |
2 |
-2 |
3 |
... |
... |
0x7FFFFFFF |
0xFFFFFFFE |
0x80000000 |
0xFFFFFFFF |
Using some bit voodoo, we can convert n
into its ZigZag representation:
n + n + (n < 0)
In this case, we assume that the boolean n < 0
is converted to an integer 1
if true
and an integer 0
if false
. When a sint32
or sint64
value is parsed, it is decoded back to the original, signed version.
In protoscope, appending a z
to an integer causes it to encode as ZigZag. For example, -500z
is equivalent to the variant 1001
.
Non-varint Numbers
Non-variable numeric types are straightforward - double
and fixed64
have wire type I64
, which instructs the parser to expect a fixed eight-byte chunk of data. A double
record is specified by 5: 25.4
, and a fixed64
record is specified by 6: 200i64
. The absence of an explicit wire type infers the I64
wire type in both cases.
Similarly, float
and fixed32
have wire type I32
, which indicates that four bytes are expected instead. The syntax for these is as follows: add an i32
prefix. Both 25.4i32
and 200i32
will emit four bytes. Tag types are deduced to be I32
.
Length-Delimited Records
Another important concept in the wire format is length prefixes. The LEN
wire type has a variable length, which is specified by a variable immediately after the tag and followed by the payload as usual.
Consider the following message schema:
message Message2 {
optional string b = 2;
}
A string is a record for the field b
, and strings are LEN
-encoded. If we set b
to "testing," we encoded as an LEN
record with the ASCII string "testing" in field number 2
. The resulting value is 120774657374696e67
. By splitting the bytes:
12 07 [74 65 73 74 69 6e 67]
The tag 12
is 00010 010
, or 2:LEN
, as we can see. The byte that follows is the varint 7
, and the next seven bytes are the UTF-8 encoding of "testing".
This is written as 2:LEN 7 "testing"
in Protoscope. However, repeating the length of the string can be inconvenient (which, in Protoscope text, is already quote-delimited). Wrapping Protoscope content in braces generates the following length prefix: {"testing"}
is an abbreviation for 7 "testing"
. {}
is always inferred by fields to be a LEN
record, so we can write this record simply as 2: {"testing"}
.
Submessages
Submessage fields also use the LEN
wire type. Here’s a message definition with an embedded message of our original example message, Message1
:
message Message3 {
optional Message1 c = 3;
}
If Message1
's a
field (i.e., Message3
's c.a
field) is set to 150, we get 1a03089601
. Breaking it up we get:
1a 03 [08 96 01]
The last three bytes (in []
) are identical to those in our first example. In the same way that strings are encoded, these bytes are preceded by an LEN
-typed tag and a length of 3
.
In Protoscope, submessages are quite succinct. 1a03089601
can be written as 3: {1: 150}
.
Optional and Repeated Elements
Missing optional fields are simple to encode: we simply leave the record out if they are not present. As a result, "huge" protos with only a few fields set are scarce.
repeated
fields are a little more complex. Ordinary (unpacked) repeated fields produce one record for each field element. As a result, if we have:
message Message4 {
optional string d = 4;
repeated int32 e = 5;
}
and we construct a Message4
message with d
set to "hello"
, and e
set to 1
, 2
, and 3
, this could be encoded as 220568656c6c6f280128022803
, or written out as Protoscope:
4: {"hello"}
5: 1
5: 2
5: 3
Records for e
, on the other hand, do not have to appear consecutively and can be interleaved with other fields; only the order of records for the same field in relation to each other is preserved. As a result, this could also have been encoded as:
5: 1
5: 2
4: {"hello"}
5: 3
There is no special treatment for oneof
s in the wire format.
Last One Wins
Normally, a message that has been encoded wouldn't have more than one instance of a field that isn't repeated
. But parsers should be able to handle the case where they do. If the same field appears more than once and it is a number or string, the parser takes the last value it sees. For embedded message fields, the parser combines multiple instances of the same field, as if with the Message::MergeFrom
method, all of the single-scalar fields in the second instance replace the ones in the first, single-embedded messages are merged, and fields that appear more than once are added together. Because of these rules, decoding two encoded messages that are joined together gives you the same result as if you had decoded each message separately and then joined the objects that were made. In other words:
MyMessage message;
message.ParseFromString(str1 + str2);
is equivalent to this:
MyMessage message, message2;
message.ParseFromString(str1);
message2.ParseFromString(str2);
message.MergeFrom(message2);
Packed Repeated Fields
As of v2.1.0
, scalar fields that are repeated can be declared as "packed." In proto2, you use the [packed=true]
to do this, but in proto3, it's already done.
Instead of being encoded as a separate record for each entry, they are encoded as a single LEN
record with all of the elements joined together. To decode, each part of the LEN
record is decoded one at a time until the payload is gone. The length of the previous element, which depends on the type of field, determines where the next element begins.
For example, let's say you have the type of message:
message Message5 {
repeated int32 f = 6 [packed=true];
}
Now, let's say you make a Message5
and give the repeated field f
the values 3
, 270
, and 86942
. When this is turned into code, we get 3206038e029ea705
or, in Protoscope text,
6: {3 270 86942}
"Packed" can only be used to describe repeated fields with simple numeric types. These are types that would usually use the VARINT, I32, or I64 wire types.
Note that there is usually no reason to encode more than one key-value pair for a packed repeated field, but parsers must be ready to accept multiple key-value pairs. In this situation, the payloads should be joined together. There must be an even number of elements in each pair. This is a valid way for parsers to encode the same message as above:
6: {3 270}
6: {86942}
Protocol buffer parsers must be able to read repeated fields that were compiled as packed as if they were not packed, and vice versa. This makes it possible to add [packed=true]
to existing fields in a way that works both forward and backward.
Maps
Map fields are just a shorthand for a special kind of repeated field. If we've got:
message Message6 {
map<string, int32> g = 7;
}
is actually the same as
message Message6 {
message g_Entry {
optional string key = 1;
optional int32 value = 2;
}
repeated g_Entry g = 7;
}
Thus, maps are encoded in the same way that repeated message fields are: as a series of LEN
-typed records with two fields each.
Field Order
In a.proto file, field numbers can be declared in any order. The order of the messages has no bearing on how they are serialized.
When a message is serialized, there is no guarantee that its known or unknown fields will be written in the same order. Serialization order is an implementation detail, and any particular implementation's details may change in the future. As a result, protocol buffer parsers must be capable of parsing fields in any order.
Implications
- Do not assume the byte output of a serialized message is stable. This is especially true for messages with transitive bytes fields representing other serialized protocol buffer messages.
- By default, repeated invocations of serialization methods on the same protocol buffer message instance may not produce the same byte output. That is, the default serialization is not deterministic.
- Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.
- The following checks may fail for a protocol buffer message instance
foo
:-
foo.SerializeAsString() == foo.SerializeAsString()
-
Hash(foo.SerializeAsString()) == Hash(foo.SerializeAsString())
-
CRC(foo.SerializeAsString()) == CRC(foo.SerializeAsString())
-
FingerPrint(foo.SerializeAsString()) == FingerPrint(foo.SerializeAsString())
-
- Here are a few example scenarios where logically equivalent protocol buffer messages
foo
andbar
may serialize to different byte outputs:-
bar
is serialized by an old server that treats some fields as unknown. -
bar
is serialized by a server that is implemented in a different programming language and serializes fields in different order. -
bar
has a field that serializes in a non-deterministic manner. -
bar
has a field that stores a serialized byte output of a protocol buffer message which is serialized differently. -
bar
is serialized by a new server that serializes fields in a different order due to an implementation change. -
foo
andbar
are concatenations of the same individual messages in a different order.
-
Conclusion
In this post we discussed the protocol buffer wire format, which defines the details of how your message is sent on the wire and how much space it consumes on disk. On the next post of this series we are going to start diving on the internal workings of gRPC. Stay tuned!
Top comments (0)