Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 4561

Packed Repeated Fields in Protobuf in Java

$
0
0
start here featured

1. Overview

In this tutorial, we’ll discuss packed repeated fields in Google’s Protocol Buffer (protobuf) messages. Protocol Buffers help define highly optimized language-neutral and platform-neutral data structures for achieving extremely efficient serialization. In protobuf, the repeated keyword helps define fields that can hold multiple values.

Additionally, to achieve even higher optimization during serialization on repeated fields, a new option packed was introduced in protobuf. It applies a special encoding technique to reduce the messages’ size further.

Let’s explore more on this.

2. Repeated Fields

Before we discuss the packed option on the repeated fields, let’s find out the meaning of the label repeated. Let’s consider a proto file repeated.proto:

syntax = "proto3";
option java_multiple_files = true;
option java_package = "com.baeldung.grpc.repeated";
package repeated;
message PackedOrder {
  int32 orderId = 1;
  repeated int32 productIds = 2 [packed = true];
}
message UnpackedOrder {
  int32 orderId = 1;
  repeated int32 productIds = 2 [packed = false];
}
service OrderService {
  rpc createOrder(UnpackedOrder) returns (UnpackedOrder){}
}

The file defines two message types (DTOs) PackedOrder and UnpackedOrder, and a service called OrderService. The repeated label on the productIds field emphasizes that it can have multiple values of type integer similar to a collection or an array. Starting from protobuf v2.1.0, the packed option is true for the repeated fields by default. Therefore, to disable the packed behavior we’re explicitly using the option packed = false for now to focus on the repeated feature.

Interestingly, if we modify a repeated field and add the packed = true option, we don’t need to adjust the code to make it work. The only difference is how the internal gRPC library encodes the fields during serialization. We’ll discuss this later in the upcoming sections.

Let’s define the OrderService that has the RPC createOrder():

public class OrderService extends OrderServiceGrpc.OrderServiceImplBase {
    @Override
    public void createOrder(UnpackedOrder unpackedOrder, StreamObserver<UnpackedOrder> responseObserver) {
        List productIds = unpackedOrder.getProductIdsList();
        if(validateProducts(productIds) {
            int orderID = insertOrder(unpackedOrder);
            UnpackedOrder createdUnpackedOrder = UnpackedOrder.newBuilder(unpackedOrder)
              .setOrderId(orderID)
              .build();
            responseObserver.onNext(createdUnpackedOrder);
            responseObserver.onCompleted();
        }
    }
}

The protoc Maven plugin auto-generates the method getProductIdsList() for fetching the list of elements in the repeated fields. This applies irrespective of the packed or unpacked fields. Finally, we set the generated orderID in the UnpackedOrder object, and return it to the client.

Let’s now invoke the RPC:

@Test
void whenUnpackedRepeatedProductIds_thenCreateUnpackedOrderAndInvokeRPC() {
    UnpackedOrder.Builder unpackedOrderBuilder = UnpackedOrder.newBuilder();
    unpackedOrderBuilder.setOrderId(1);
    Arrays.stream(fetchProductIds()).forEach(unpackedOrderBuilder::addProductIds);
    UnpackedOrder unpackedOrderRequest = unpackedOrderBuilder.build();
    UnpackedOrder unpackedOrderResponse = orderClientStub.createOrder(unpackedOrderRequest);
    assertInstanceOf(Integer.class, unpackedOrderResponse.getOrderId());
}

While we compile the code using the protoc Maven plugin, it generates the Java class file for the UnpackedOrder message type defined in the proto file. We call the method addProductIds() multiple times while iterating through the Stream to populate the repeated field productIds in the UnpackedOrder object. In general, during the compilation of the proto file, a similar method is created prefixed with the text add for all the repeated field names. This applies to all repeated fields, whether packed or unpacked.

After this, we invoke the RPC createOrder() that returns the field orderId.

3. Packed Repeated Fields

So far, we know that packed repeated fields differ from repeated fields majorly due to the encoding process before serialization. To understand the encoding technique, let’s first see how to serialize PackedOrder and UnpackedOrder message types defined in the proto file:

void serializeObject(String file, GeneratedMessageV3 object) throws IOException {
    try(FileOutputStream fileOutputStream = new FileOutputStream(file)) {
        object.writeTo(fileOutputStream);
    }
}

The method serializeObject() calls the writeTo() method in the object of type GeneratedMessageV3 to serialize it to the file system.

PackedOrder and UnpackedOrder message types inherit the writeTo() method from their parent GeneratedMessageV3 class. Hence, we’ll use the serializeObject() method to write their instances into the file system:

@Test
void whenSerializeUnpackedOrderAndPackedOrderObject_thenSizeofPackedOrderObjectIsLess() throws IOException {
    UnpackedOrder.Builder unpackedOrderBuilder = UnpackedOrder.newBuilder();
    unpackedOrderBuilder.setOrderId(1);
    Arrays.stream(fetchProductIds()).forEach(unpackedOrderBuilder::addProductIds);
    UnpackedOrder unpackedOrder = unpackedOrderBuilder.build();
    String unpackedOrderObjFileName = FOLDER_TO_WRITE_OBJECTS + "unpacked_order.bin";
    serializeObject(unpackedOrderObjFileName, unpackedOrder);
    PackedOrder.Builder packedOrderBuilder = PackedOrder.newBuilder();
    packedOrderBuilder.setOrderId(1);
    Arrays.stream(fetchProductIds()).forEach(packedOrderBuilder::addProductIds);
    PackedOrder packedOrder = packedOrderBuilder.build();
    String packedOrderObjFileName = FOLDER_TO_WRITE_OBJECTS + "packed_order.bin";
    serializeObject(packedOrderObjFileName, packedOrder);
    
    long sizeOfUnpackedOrderObjectFile = getFileSize(unpackedOrderObjFileName);
    long sizeOfPackedOrderObjectFile = getFileSize(packedOrderObjFileName);
    long sizeReductionPercentage = (sizeOfUnpackedOrderObjectFile - sizeOfPackedOrderObjectFile) * 100/sizeOfUnpackedOrderObjectFile;
    logger.info("Packed field saved {}% over unpacked field", sizeReductionPercentage);
    assertTrue(sizeOfUnpackedOrderObjectFile > sizeOfPackedOrderObjectFile);
}

First, we create the unpackedOrder and packedOrder objects by adding the same set of product IDs to each. Then, we serialize both objects and compare their file sizes. The program also calculates the percentage reduction in the file size in the object using the packed version of productID. As anticipated, the file containing the unpackedOrder object is larger than the file containing the packedOrder object.

Let’s now look at the console output of the program:

Packed field saved 29% over unpacked field

This example, with 20 product IDs demonstrates a 29% reduction in file size for the packedOrder object. Furthermore, the savings improve and eventually stabilize as product IDs increase.

Naturally, packed repeated fields result in better performance. However, we can use the packed option only on the primitive numeric types.

4. Encoded Unpacked vs Packed Fields

Earlier, we created two files unpacked_order.bin and packed_order.bin corresponding to UnpackedOrder and PackedOrder objects respectively. We’ll use the protoscope tool to inspect the encoded contents of these two files. Protoscope is a simple, human-editable language that helps us view the low-level Protobuf wire format of the messages in transit.

Let’s inspect the contents of unpacked_order.bin:

#cat unpacked_order.bin | protoscope -explicit-wire-types
1:VARINT 1
2:VARINT 266
2:VARINT 629
2:VARINT 725
2:VARINT 259
2:VARINT 353
2:VARINT 746
more elements...

The protoscope command dumps the encoded protocol buffers as text. In the text, the field and its values are represented in a key-value format, where the key is the field number defined in the repeated.proto file. The productId field with key 2 is repeated with its values each represented as a VARINT wire-format type. This means that each record defined by the key-value pairs is encoded separately.

Similarly, let’s look at the contents of packed-order.bin in protoscope text format:

#cat packed_order.bin | protoscope -explicit-wire-types -explicit-length-prefixes
1:VARINT 1
2:LEN 38 `fc06c0058e047293069702ea04c203ba0165c005d601da02dc02a307a804f101ca019a02df03`

Interestingly, once we enable the packed option on the productId field, the gRPC library encodes them together for serialization. It represents it as a single LEN wire-format record with 38 hexadecimal bytes:

fc 06 c0 05 8e 04 72 93 06 97 02 ea 04 c2 03 ba 01 65 c0 05 d6 01 da 02 dc 02 a3 07 a8 04 f1 01 ca 01 9a 02 df 03

We’ll not discuss the encoding of protobuf messages as the official site already covers it in detail. We can also refer to other sites to understand the encoding algorithm in detail.

5. Conclusion

In this article, we explored the packed option for repeated fields in the protobuf. The elements of a packed field are encoded together, and as a result, their size reduces considerably. This leads to performance improvement through faster serialization. It’s important to note that we can only declare primitive numeric wire types such as VARINT, I32, or I64 types as packed.

As usual, the code used in this article is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 4561

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>