<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jerónimo López</title>
    <description>The latest articles on DEV Community by Jerónimo López (@jerolba).</description>
    <link>https://dev.to/jerolba</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F838032%2F146f2a4a-38c6-4ca0-b15f-f76673040d64.jpg</url>
      <title>DEV Community: Jerónimo López</title>
      <link>https://dev.to/jerolba</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jerolba"/>
    <language>en</language>
    <item>
      <title>The Carpet feature that nobody will use</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Thu, 15 May 2025 12:02:07 +0000</pubDate>
      <link>https://dev.to/jerolba/the-carpet-feature-that-nobody-will-use-46il</link>
      <guid>https://dev.to/jerolba/the-carpet-feature-that-nobody-will-use-46il</guid>
      <description>&lt;p&gt;This week I released a new version of &lt;a href="https://github.com/jerolba/parquet-carpet" rel="noopener noreferrer"&gt;Carpet&lt;/a&gt;, the Java library for working with Parquet files. In this version, I've added a feature that I believe nobody will ever use: &lt;strong&gt;the ability to read and write BSON-type columns&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Parquet supports &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#embedded-types" rel="noopener noreferrer"&gt;embedded types&lt;/a&gt;, which allow defining complex data types within a column. Each of these types has its own internal representation as a byte array.&lt;/p&gt;

&lt;p&gt;Historically, the Parquet format has defined &lt;code&gt;JSON&lt;/code&gt; and &lt;code&gt;BSON&lt;/code&gt; types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;JSON&lt;/strong&gt;, with &lt;code&gt;binary value (JSON)&lt;/code&gt; in the schema, represents data as a UTF-8 text (String) serialization of a JSON object. The difference from storing the same content in a &lt;code&gt;STRING&lt;/code&gt; type column is that it explicitly defines the content as a JSON object, not just plain text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BSON&lt;/strong&gt;, with &lt;code&gt;binary value (BSON)&lt;/code&gt; in the schema, represents data as a byte array following &lt;a href="https://bsonspec.org/spec.html" rel="noopener noreferrer"&gt;the BSON specification&lt;/a&gt;. By typing it this way, it also explicitly identifies the content as a BSON object, rather than just any byte array.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In my limited experience with Parquet, I haven't come across examples where these embedded types are used, and no one had requested support for them in Carpet.&lt;/p&gt;

&lt;p&gt;So why did I decide to implement support for them? The answer is simple: &lt;strong&gt;to challenge myself and see if I could do it&lt;/strong&gt;, and to test whether the codebase was flexible enough to support these features without breaking anything or imposing limitations on Carpet.&lt;/p&gt;

&lt;p&gt;Another reason is that &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant" rel="noopener noreferrer"&gt;new embedded types&lt;/a&gt; have recently been defined in the Parquet specification: VARIANT, GEOMETRY, and GEOGRAPHY. There are still no functional implementations of these new types in Parquet, but they are already defined and work is underway on their implementation in at least two languages. It's wise to lay the groundwork for their future availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Implementation Problem
&lt;/h2&gt;

&lt;p&gt;The key challenge when implementing these types was deciding &lt;strong&gt;which Java types to use for representation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's the best Java class to represent JSON? Should I create a Carpet-specific class? Use the &lt;code&gt;JSONObject&lt;/code&gt; class from the &lt;code&gt;org.json&lt;/code&gt; library? Or perhaps the &lt;code&gt;JsonNode&lt;/code&gt; class from Jackson? Could I use any object and serialize it to JSON? And if so, with which library? Any of these approaches would tie Carpet to a specific implementation and add unnecessary dependencies for 99.99% of use cases, while also forcing users into a particular library.&lt;/p&gt;

&lt;p&gt;The same issue applies to BSON. I could use the &lt;a href="https://central.sonatype.com/artifact/org.mongodb/bson" rel="noopener noreferrer"&gt;MongoDB bson library&lt;/a&gt;, but that would create a dependency on a specific implementation that virtually no one needs.&lt;/p&gt;

&lt;p&gt;Instead, I opted to use Java's &lt;code&gt;String&lt;/code&gt; type to represent JSON content and Parquet's &lt;a href="https://github.com/apache/parquet-java/blob/5f079b98e63c814535e8709ab5c6fb672c2aedc5/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java" rel="noopener noreferrer"&gt;Binary&lt;/a&gt; class for BSON. This approach keeps Carpet independent of any specific implementation, allowing users to choose their preferred library for handling the content.&lt;/p&gt;

&lt;p&gt;The trade-off is that anyone using this functionality (which I suspect will be very few people) will need to declare their attributes as &lt;code&gt;String&lt;/code&gt; or &lt;code&gt;Binary&lt;/code&gt; instead of using their own business classes, and will need to handle serialization and deserialization themselves before using Carpet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;ProductEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonData&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;Binary&lt;/span&gt; &lt;span class="n"&gt;bsonData&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;Instant&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Annotating Java Types
&lt;/h2&gt;

&lt;p&gt;I couldn't simply use &lt;code&gt;String&lt;/code&gt; for JSON and &lt;code&gt;Binary&lt;/code&gt; for BSON because Java's &lt;code&gt;String&lt;/code&gt; type already automatically maps to Parquet's &lt;code&gt;STRING&lt;/code&gt; type. Additionally, &lt;code&gt;Binary&lt;/code&gt; needs to support multiple Parquet logical types (&lt;code&gt;STRING&lt;/code&gt;, &lt;code&gt;BSON&lt;/code&gt;, and eventually &lt;code&gt;VARIANT&lt;/code&gt;, &lt;code&gt;GEOMETRY&lt;/code&gt;, and &lt;code&gt;GEOGRAPHY&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;To solve this, &lt;strong&gt;I created annotations for Java record attributes&lt;/strong&gt; to specify which Parquet data type they should map to.&lt;/p&gt;

&lt;p&gt;When you annotate a &lt;code&gt;String&lt;/code&gt; attribute with &lt;code&gt;@ParquetJson&lt;/code&gt; or a &lt;code&gt;Binary&lt;/code&gt; attribute with &lt;code&gt;@ParquetBson&lt;/code&gt;, you're telling Carpet that these attributes contain JSON or BSON data respectively, not just regular text or byte arrays.&lt;/p&gt;

&lt;p&gt;This record, when written to a Parquet file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;ProductEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@ParquetJson&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;jsonData&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@ParquetBson&lt;/span&gt; &lt;span class="nc"&gt;Binary&lt;/span&gt; &lt;span class="n"&gt;bsonData&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;Instant&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;will generate the following Parquet schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;message ProductEvent {
    required int64 id;
    optional binary name (STRING);
    optional binary jsonData (JSON);
    optional binary bsonData (BSON);
    required int64 timestamp (TIMESTAMP(MILLIS,true));
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After introducing annotations for changing attribute types in the Parquet schema, I expanded this capability beyond just embedded types to include Java's &lt;code&gt;String&lt;/code&gt; and &lt;code&gt;Enum&lt;/code&gt; types as well.&lt;/p&gt;

&lt;p&gt;I added the &lt;code&gt;@ParquetString&lt;/code&gt; and &lt;code&gt;@ParquetEnum&lt;/code&gt; annotations to modify the Parquet logical type of Java attributes of type &lt;code&gt;String&lt;/code&gt;, &lt;code&gt;Enum&lt;/code&gt;, or &lt;code&gt;Binary&lt;/code&gt;. This is particularly useful when conforming to a third-party contract while still using the most convenient data types in your Java code.&lt;/p&gt;

&lt;p&gt;This record, when written to a Parquet file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;ProductEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@ParquetString&lt;/span&gt; &lt;span class="nc"&gt;Binary&lt;/span&gt; &lt;span class="n"&gt;productCode&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@ParquetString&lt;/span&gt; &lt;span class="nc"&gt;MyEnum&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
    &lt;span class="nd"&gt;@ParquetEnum&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;){&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;will generate the following Parquet schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;message ProductEvent {
    required int64 id;
    optional binary name (STRING);
    optional binary productCode (STRING);
    optional binary category (STRING);
    optional binary type (ENUM);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to know more about the new features, you can read the &lt;a href="https://carpet.jerolba.com/advanced/java-type-annotations/" rel="noopener noreferrer"&gt;Carpet documentation on Java type annotations&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;While I doubt anyone will use the JSON and BSON functionality in Carpet, developing this feature has been valuable for testing the flexibility of Carpet's codebase and preparing it for the new embedded types being defined in the Parquet specification.&lt;/p&gt;

&lt;p&gt;As a bonus, I've enhanced support for Parquet's &lt;code&gt;Binary&lt;/code&gt; type and added the ability to modify logical types of attributes in the Parquet schema, capabilities I hadn't originally planned but that could prove useful in certain scenarios.&lt;/p&gt;

&lt;p&gt;With this new functionality requiring documentation, I took the opportunity to create a &lt;strong&gt;dedicated documentation site for Carpet&lt;/strong&gt;, moving all the content from the GitHub repository's README to this new resource.&lt;/p&gt;

&lt;p&gt;The documentation is built with &lt;a href="https://www.mkdocs.org/" rel="noopener noreferrer"&gt;MkDocs&lt;/a&gt; and hosted on &lt;a href="https://pages.github.com/" rel="noopener noreferrer"&gt;GitHub Pages&lt;/a&gt;. You can browse the complete documentation at &lt;a href="https://carpet.jerolba.com/" rel="noopener noreferrer"&gt;carpet.jerolba.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>java</category>
      <category>parquet</category>
      <category>bigdata</category>
      <category>backend</category>
    </item>
    <item>
      <title>The two versions of Parquet</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Thu, 20 Feb 2025 06:00:00 +0000</pubDate>
      <link>https://dev.to/jerolba/the-two-versions-of-parquet-16il</link>
      <guid>https://dev.to/jerolba/the-two-versions-of-parquet-16il</guid>
      <description>&lt;p&gt;A few days ago, the creators of DuckDB wrote the article: &lt;a href="https://duckdb.org/2025/01/22/parquet-encodings.html" rel="noopener noreferrer"&gt;Query Engines: Gatekeepers of the Parquet File Format&lt;/a&gt;, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it.&lt;/p&gt;

&lt;p&gt;In my experience, this issue is not limited to Query Engines but extends to the tools within the ecosystem. Soon after releasing the first version of &lt;a href="https://github.com/jerolba/parquet-carpet" rel="noopener noreferrer"&gt;Carpet&lt;/a&gt;, I discovered that there was a version 2 of the format and that the core Java Parquet library does not activate it by default. Since the specification had been finalized for some time, I decided that the best approach was to &lt;a href="https://github.com/jerolba/parquet-carpet/commit/eecb813cdbddfde9a4b27c46104c5a8969ef4477" rel="noopener noreferrer"&gt;make Carpet use version 2 by default&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A week later, I discovered at work the hard way that if you are not up to date with Pandas in Python, you cannot read files written with version 2. I had to &lt;a href="https://github.com/jerolba/parquet-carpet/commit/983f87ec23b1d3c5c091daa4dc7dfb95c919f6ef" rel="noopener noreferrer"&gt;rollback the change&lt;/a&gt; immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parquet Version 2
&lt;/h2&gt;

&lt;p&gt;Upon researching the topic, you'll find that even though the format specification is finalized, it is not fully implemented across the ecosystem. Ideally, the standard would be whatever the specification defines, but in reality, there is no agreement on the minimum set of features an implementation must support to be considered compatible with version 2.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/apache/parquet-format/pull/164" rel="noopener noreferrer"&gt;this Pull Request&lt;/a&gt; from the project that describes the file format, &lt;strong&gt;there has been an ongoing discussion for four years about what constitutes the core&lt;/strong&gt;, and there are no signs of a resolution anytime soon. Reading &lt;a href="https://lists.apache.org/thread/th3ls02pd1yn74mtj12s05tbbx0x8bjj" rel="noopener noreferrer"&gt;this other thread&lt;/a&gt; on the mailing list, I came to the conclusion that although they are part of the specification, two concepts are mixed that could evolve independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Given a series of values in a column, how to encode them efficiently. Being able to incorporate new encodings such as &lt;code&gt;RLE_DICTIONARY&lt;/code&gt; or &lt;code&gt;DELTA_BYTE_ARRAY&lt;/code&gt;, which further improve compression.&lt;/li&gt;
&lt;li&gt;Given an encoded column’s data, where to write it within the file along with its metadata such as headers, nulls, or statistics, which helps to maximize the available metadata while minimizing its size and the number of file reads. This is what they call Data Page V2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many would likely prefer to prioritize improvements in encoding over page structure. Finding a file that uses an unknown encoding would make a column unreadable, but a change in how pages are structured would make the entire file unreadable.&lt;/p&gt;

&lt;p&gt;What I came to understand is that new &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md" rel="noopener noreferrer"&gt;logical types&lt;/a&gt; are not tied to a specific format version. On the one hand, there are the primitive types that are fixed, but on top of them, logical types are defined: a date is a representation of an &lt;code&gt;int64&lt;/code&gt;, a Big Decimal or String is represented with a &lt;code&gt;BYTE_ARRAY&lt;/code&gt;. Now the &lt;code&gt;VARIANT&lt;/code&gt; type is being &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant" rel="noopener noreferrer"&gt;defined&lt;/a&gt; and I have not seen it associated with either of the two versions.&lt;/p&gt;

&lt;p&gt;Meanwhile, in the Machine Learning world, Parquet and ORC have become limiting, requiring specialized features such as handling files with thousands of columns. To address this, two new formats have emerged: &lt;a href="https://github.com/facebookincubator/nimble" rel="noopener noreferrer"&gt;Nimble&lt;/a&gt; from Facebook and &lt;a href="https://blog.lancedb.com/lance-v2/" rel="noopener noreferrer"&gt;LV2&lt;/a&gt; from LanceDB.&lt;/p&gt;

&lt;p&gt;If you want to delve deeper into the topic, I recommend &lt;a href="https://materializedview.io/p/nimble-and-lance-parquet-killers" rel="noopener noreferrer"&gt;this introductory article&lt;/a&gt;. I consider these two formats to be niche solutions and &lt;strong&gt;Parquet will continue to reign in the world of data engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance of Version 2
&lt;/h2&gt;

&lt;p&gt;The DuckDB article prompted me to investigate the performance implications of Parquet Version 2, which I hadn't considered in [my previous post on compression algorithms]](&lt;a href="https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec"&gt;https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Configuring file writing with version 2 is straightforward, requiring only a property setting in the writer’s builder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;CarpetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clazz&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriterVersion&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WriterVersion&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PARQUET_2_0&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  File Size
&lt;/h3&gt;

&lt;p&gt;Italian government dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;1761 MB&lt;/td&gt;
&lt;td&gt;1761 MB&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;564 MB&lt;/td&gt;
&lt;td&gt;355 MB&lt;/td&gt;
&lt;td&gt;37 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;220 MB&lt;/td&gt;
&lt;td&gt;198 MB&lt;/td&gt;
&lt;td&gt;10 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;146 MB&lt;/td&gt;
&lt;td&gt;138 MB&lt;/td&gt;
&lt;td&gt;5 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;148  MB&lt;/td&gt;
&lt;td&gt;144 MB&lt;/td&gt;
&lt;td&gt;2 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;209 MB&lt;/td&gt;
&lt;td&gt;192 MB&lt;/td&gt;
&lt;td&gt;8 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;215 MB&lt;/td&gt;
&lt;td&gt;195 MB&lt;/td&gt;
&lt;td&gt;9 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;New York taxi dataset:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;2983 MB&lt;/td&gt;
&lt;td&gt;2983 MB&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;760 MB&lt;/td&gt;
&lt;td&gt;511 MB&lt;/td&gt;
&lt;td&gt;33 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;542 MB&lt;/td&gt;
&lt;td&gt;480 MB&lt;/td&gt;
&lt;td&gt;11 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;td&gt;444 MB&lt;/td&gt;
&lt;td&gt;1 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;430 MB&lt;/td&gt;
&lt;td&gt;444 MB&lt;/td&gt;
&lt;td&gt;-3 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;547 MB&lt;/td&gt;
&lt;td&gt;482 MB&lt;/td&gt;
&lt;td&gt;12 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;518 MB&lt;/td&gt;
&lt;td&gt;479 MB&lt;/td&gt;
&lt;td&gt;7 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The new encodings in Version 2 compact the data more effectively before compression, resulting in a larger relative improvement for the UNCOMPRESSED version. This leaves less room for subsequent compression algorithms to further reduce the file size (or even slightly worsening it like ZSTD).&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing
&lt;/h3&gt;

&lt;p&gt;Italian government dataset (seconds):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;td&gt;23.6&lt;/td&gt;
&lt;td&gt;6 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;25.2&lt;/td&gt;
&lt;td&gt;23.5&lt;/td&gt;
&lt;td&gt;7 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;39.3&lt;/td&gt;
&lt;td&gt;35.8&lt;/td&gt;
&lt;td&gt;9 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;27.3&lt;/td&gt;
&lt;td&gt;25.7&lt;/td&gt;
&lt;td&gt;6 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;24.9&lt;/td&gt;
&lt;td&gt;23.8&lt;/td&gt;
&lt;td&gt;4 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;26.0&lt;/td&gt;
&lt;td&gt;24.6&lt;/td&gt;
&lt;td&gt;5 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;New York taxi dataset (seconds):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;57.9&lt;/td&gt;
&lt;td&gt;50.2&lt;/td&gt;
&lt;td&gt;13 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;56.4&lt;/td&gt;
&lt;td&gt;50.7&lt;/td&gt;
&lt;td&gt;10 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;91.1&lt;/td&gt;
&lt;td&gt;66.9&lt;/td&gt;
&lt;td&gt;27 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;64.1&lt;/td&gt;
&lt;td&gt;57.1&lt;/td&gt;
&lt;td&gt;11 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;56.5&lt;/td&gt;
&lt;td&gt;50.5&lt;/td&gt;
&lt;td&gt;11 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;56.1&lt;/td&gt;
&lt;td&gt;51.1&lt;/td&gt;
&lt;td&gt;9 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The improvement in writing times is remarkable, but especially in the New York taxi dataset, with a majority of numeric values. Especially noteworthy is the improvement in GZIP format times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reading
&lt;/h3&gt;

&lt;p&gt;Italian government dataset (seconds):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;11.4&lt;/td&gt;
&lt;td&gt;11.3&lt;/td&gt;
&lt;td&gt;1 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;12.5&lt;/td&gt;
&lt;td&gt;11.5&lt;/td&gt;
&lt;td&gt;8 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;13.6&lt;/td&gt;
&lt;td&gt;12.8&lt;/td&gt;
&lt;td&gt;6 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;13.1&lt;/td&gt;
&lt;td&gt;12.2&lt;/td&gt;
&lt;td&gt;7 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;12.8&lt;/td&gt;
&lt;td&gt;11.3&lt;/td&gt;
&lt;td&gt;12 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;13.1&lt;/td&gt;
&lt;td&gt;12.1&lt;/td&gt;
&lt;td&gt;7 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;New York taxi dataset (seconds):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Version 1&lt;/th&gt;
&lt;th&gt;Version 2&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;37.4&lt;/td&gt;
&lt;td&gt;33.0&lt;/td&gt;
&lt;td&gt;12 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;39.9&lt;/td&gt;
&lt;td&gt;34.0&lt;/td&gt;
&lt;td&gt;15 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;td&gt;34.4&lt;/td&gt;
&lt;td&gt;16  %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;41.5&lt;/td&gt;
&lt;td&gt;34.1&lt;/td&gt;
&lt;td&gt;18 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;41.5&lt;/td&gt;
&lt;td&gt;33.6&lt;/td&gt;
&lt;td&gt;19 %&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;41.1&lt;/td&gt;
&lt;td&gt;33.7&lt;/td&gt;
&lt;td&gt;18 %&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In reading, we again see a notable improvement, but even better in the taxi dataset with many decimal types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Although this post might seem like a critique of Parquet, that is not my intention.  I am simply documenting what I have learned and explaining the challenges maintainers of an open format face when evolving it. &lt;strong&gt;All the benefits and utilities that a format like Parquet has far outweigh these inconveniences&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The improvements that the latest version of Parquet brings help reduce file sizes and processing times, but the difference is not dramatic. Given the low adoption of Version 2 in the ecosystem, for now these improvements do not help to justify potential compatibility problems when you integrate with third parties. However, if you control all parts of the process, consider adopting the latest specification.&lt;/p&gt;

&lt;p&gt;Most of what I have written is my interpretation, and I could be wrong. If you have better sources or a different opinion, feel free to share it in the comments.&lt;/p&gt;

</description>
      <category>parquet</category>
      <category>bigdata</category>
      <category>java</category>
    </item>
    <item>
      <title>Compression algorithms in Parquet Java</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Mon, 20 Jan 2025 09:06:16 +0000</pubDate>
      <link>https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec</link>
      <guid>https://dev.to/jerolba/compression-algorithms-in-parquet-java-4kec</guid>
      <description>&lt;p&gt;Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.&lt;/p&gt;

&lt;p&gt;One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.&lt;/p&gt;

&lt;p&gt;This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compression Techniques
&lt;/h2&gt;

&lt;p&gt;Unlike traditional row-based storage formats, Parquet uses a columnar approach, allowing the usage of more specific and effective compression techniques based on data locality and redundancy of values of the same type.&lt;/p&gt;

&lt;p&gt;Parquet writes information in binary and applies compression at two distinct levels, using different techniques at each:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;While writing the values of a column, it adaptively chooses the &lt;a href="https://parquet.apache.org/docs/file-format/data-pages/encodings/" rel="noopener noreferrer"&gt;encoding type&lt;/a&gt; based on the characteristics of the initial values: Dictionary, Run-Length Encoding Bit-Packing, Delta Encoding, etc.&lt;/li&gt;
&lt;li&gt;Every time a certain amount of bytes is reached (1MB by default) a page is formed, and the binary block is compressed with the algorithm configured by the programmer (none, GZip, Snappy, LZ4, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Although the compression algorithm is configured at the file level, the encoding of each column is automatically selected using an internal heuristic (at least in the parquet-java implementation).&lt;/p&gt;

&lt;p&gt;The performance of different compression techniques will depend heavily on your data, so there's no silver bullet that guarantees the fastest processing time and lowest space consumption. &lt;strong&gt;You will need to execute your own tests&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;The configuration is straightforward, and it only needs to be explicitly set when writing. When reading a file, Parquet discovers which compression algorithm was used and applies the corresponding decompression algorithm.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring the Algorithm or Codec
&lt;/h3&gt;

&lt;p&gt;In both &lt;a href="https://github.com/jerolba/parquet-carpet" rel="noopener noreferrer"&gt;Carpet&lt;/a&gt; and Parquet with &lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-protocol-buffers-fdf"&gt;Protocol Buffers&lt;/a&gt; and &lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-avro-134"&gt;Avro&lt;/a&gt;, to configure the compression algorithm, you just need to call the &lt;code&gt;withCompressionCodec&lt;/code&gt; method of the builder:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Carpet&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;CarpetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clazz&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Avro&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Protocol Buffers&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The value must be one of the available ones in the &lt;a href="https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java#L26" rel="noopener noreferrer"&gt;CompressionCodecName&lt;/a&gt; enum: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD, and LZ4_RAW (LZ4 is deprecated, and you should use LZ4_RAW).&lt;/p&gt;

&lt;h3&gt;
  
  
  Compression Level
&lt;/h3&gt;

&lt;p&gt;Some compression algorithms offer a way to fine-tune the compression level. This level is usually related to the effort they need to apply to find repetition patterns, and the higher the compression, the more time and memory is required for the compression process.&lt;/p&gt;

&lt;p&gt;Although they come with a default value, it is modifiable using Parquet's generic configuration mechanism, although each codec uses a different key.&lt;/p&gt;

&lt;p&gt;Additionally, the value to choose is not standard and depends on each codec, so you must refer to the documentation of each algorithm to understand what each level offers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ZSTD&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To reference the configuration of the level, ZSTD codec declares a constant: &lt;code&gt;ZstandardCodec.PARQUET_COMPRESS_ZSTD_LEVEL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Possible values range from &lt;a href="https://facebook.github.io/zstd/zstd_manual.html#Chapter1" rel="noopener noreferrer"&gt;1 to 22&lt;/a&gt;, and the default value is 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ZstandardCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PARQUET_COMPRESS_ZSTD_LEVEL&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"6"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LZO&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To reference the configuration of the level, LZO codec declares a constant: &lt;code&gt;LzoCodec.LZO_COMPRESSION_LEVEL_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Possible values range from &lt;a href="https://github.com/nemequ/lzo/blob/master/doc/LZO.TXT#L74" rel="noopener noreferrer"&gt;1 to 9, 99, and 999&lt;/a&gt;, and the default value is "999".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZO&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LzoCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZO_COMPRESSION_LEVEL_KEY&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"99"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GZIP&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It does not declare any constant, and you have to use the string &lt;code&gt;"zlib.compress.level"&lt;/code&gt; directly, with possible values ranging from &lt;a href="https://www.euccas.me/zlib/#zlib_compression_levels" rel="noopener noreferrer"&gt;0 to 9&lt;/a&gt; and with a default value of "6"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GZIP&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"zlib.compress.level"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"9"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Performance Tests
&lt;/h2&gt;

&lt;p&gt;To analyze the performance of different compression algorithms, I will use two public datasets containing different types of data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="noopener noreferrer"&gt;New York City Taxi Trips&lt;/a&gt;: with a large number of numeric values and few string values in a few columns. It has 23 columns and contains 19.6 million records.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opencoesione.gov.it/en/opendata/#!basedati_section" rel="noopener noreferrer"&gt;Cohesion Projects of the Italian Government&lt;/a&gt;: many columns with float values and a large quantity and variety of text strings. It has 91 columns and contains 2 million rows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will evaluate some of the compression algorithms enabled in Parquet Java: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD, LZ4_RAW.&lt;/p&gt;

&lt;p&gt;As expected, I will use &lt;a href="https://github.com/jerolba/parquet-carpet" rel="noopener noreferrer"&gt;Carpet&lt;/a&gt; with the default configuration that &lt;a href="https://github.com/apache/parquet-java/blob/4aeba6cb7348a8fb57d8585058d27f53ce592b48/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49" rel="noopener noreferrer"&gt;parquet-java&lt;/a&gt; brings, and the default compression level of each algorithm.&lt;/p&gt;

&lt;p&gt;You can find the source code on GitHub, and the tests were done on a laptop with an AMD Ryzen 7 4800HS CPU and JDK 17.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Size
&lt;/h3&gt;

&lt;p&gt;To understand how each compression performs, we will take the equivalent CSV file as a reference.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;gov.it&lt;/th&gt;
&lt;th&gt;NYC Taxis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CSV&lt;/td&gt;
&lt;td&gt;1761 MB&lt;/td&gt;
&lt;td&gt;2983 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;564 MB&lt;/td&gt;
&lt;td&gt;760 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;220 MB&lt;/td&gt;
&lt;td&gt;542 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;146 MB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;148  MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;430 MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;209 MB&lt;/td&gt;
&lt;td&gt;547 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;215 MB&lt;/td&gt;
&lt;td&gt;518 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In both tests, compression with GZip and Zstandard stands out as the most efficient.&lt;/p&gt;

&lt;p&gt;Using only Parquet encoding techniques, the file size can be reduced to 25-32% of the original CSV size. Applying additional compression reduces it to between &lt;strong&gt;9% and 15% of the CSV size&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing
&lt;/h3&gt;

&lt;p&gt;How much overhead does compressing the information bring?&lt;/p&gt;

&lt;p&gt;If we write the same information three times and average the seconds, we get:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;gov.it&lt;/th&gt;
&lt;th&gt;NYC Taxis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;25.0&lt;/td&gt;
&lt;td&gt;57.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;25.2&lt;/td&gt;
&lt;td&gt;56.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;39.3&lt;/td&gt;
&lt;td&gt;91.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;27.3&lt;/td&gt;
&lt;td&gt;64.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;24.9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;56.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;26.0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;56.1&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;SNAPPY, LZ4, and LZO achieve similar times to not compressing, while ZSTD adds a bit of overhead. GZIP performs the worst, worsening the writing time by 50%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reading
&lt;/h3&gt;

&lt;p&gt;Reading the files is faster than writing since fewer computations are needed.&lt;/p&gt;

&lt;p&gt;Reading all the columns from the file, the times in seconds are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Algorithm&lt;/th&gt;
&lt;th&gt;gov.it&lt;/th&gt;
&lt;th&gt;NYC Taxis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNCOMPRESSED&lt;/td&gt;
&lt;td&gt;11.4&lt;/td&gt;
&lt;td&gt;37.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SNAPPY&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12.5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39.9&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZIP&lt;/td&gt;
&lt;td&gt;13.6&lt;/td&gt;
&lt;td&gt;40.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZSTD&lt;/td&gt;
&lt;td&gt;13.1&lt;/td&gt;
&lt;td&gt;41.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZ4_RAW&lt;/td&gt;
&lt;td&gt;12.8&lt;/td&gt;
&lt;td&gt;41.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LZO&lt;/td&gt;
&lt;td&gt;13.1&lt;/td&gt;
&lt;td&gt;41.1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reading times are close to not compressing the information, and the overhead of decompression is between 10% and 20%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;No algorithm has stood out significantly over the others in reading and writing times, all being within a similar range. &lt;strong&gt;In most cases, compressing the information compensates for the space savings (and transmission) over the time penalty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In these two use cases, the determining factor for selecting one or another would probably be the compression ratio achieved, with ZSTD and Gzip standing out (but with poor writing time).&lt;/p&gt;

&lt;p&gt;Each algorithm has its strengths, so the best option is to test with your data, considering which factor is more important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Minimizing storage usage, because you store a lot of data that you rarely use.&lt;/li&gt;
&lt;li&gt;Minimizing file generation time.&lt;/li&gt;
&lt;li&gt;Minimizing reading time, since files are read many times.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As with everything in life, it's a trade-off, and you will have to see what compensates the most. In Carpet, by default, if you configure nothing, it compresses with Snappy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Details
&lt;/h2&gt;

&lt;p&gt;The value must be one of those available in the &lt;a href="https://github.com/apache/parquet-java/blob/master/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/CompressionCodecName.java#L26" rel="noopener noreferrer"&gt;CompressionCodecName&lt;/a&gt; enum. Associated with each enum value is the name of the class implementing the algorithm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="no"&gt;UNCOMPRESSED&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;UNCOMPRESSED&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;SNAPPY&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.parquet.hadoop.codec.SnappyCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;SNAPPY&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".snappy"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;GZIP&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.hadoop.io.compress.GzipCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GZIP&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".gz"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;LZO&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"com.hadoop.compression.lzo.LzoCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".lzo"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;BROTLI&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.hadoop.io.compress.BrotliCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BROTLI&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".br"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;LZ4&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.hadoop.io.compress.Lz4Codec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZ4&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".lz4hadoop"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.parquet.hadoop.codec.ZstandardCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;ZSTD&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".zstd"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="no"&gt;LZ4_RAW&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"org.apache.parquet.hadoop.codec.Lz4RawCodec"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;CompressionCodec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;LZ4_RAW&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;".lz4raw"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet will use reflection to instantiate the specified class, which must implement the &lt;a href="https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/CompressionCodec.java" rel="noopener noreferrer"&gt;&lt;code&gt;CompressionCodec&lt;/code&gt;&lt;/a&gt; interface. If you look at its source code, you will see that it is within the Hadoop project, not Parquet. This shows how coupled Parquet is with Hadoop in the Java implementation.&lt;/p&gt;

&lt;p&gt;To use one of the codecs, you must ensure you have added a JAR containing its implementation as a dependency.&lt;/p&gt;

&lt;p&gt;Not all implementations are available in the transitive dependencies you have when adding &lt;code&gt;parquet-java&lt;/code&gt;, or you may have excluded Hadoop dependencies too aggressively.&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;org.apache.parquet:parquet-hadoop&lt;/code&gt; dependency, the implementations of &lt;code&gt;SnappyCodec&lt;/code&gt;, &lt;code&gt;ZstandardCodec&lt;/code&gt;, and &lt;code&gt;Lz4RawCodec&lt;/code&gt; are included, which transitively imports the &lt;code&gt;snappy-java&lt;/code&gt;, &lt;code&gt;zstd-jni&lt;/code&gt;, and &lt;code&gt;aircompressor&lt;/code&gt; dependencies with the actual implementations of the three algorithms.&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;hadoop-common:hadoop-common&lt;/code&gt; dependency, the implementation of &lt;code&gt;GzipCodec&lt;/code&gt; is included.&lt;/p&gt;

&lt;p&gt;Where are the implementations of &lt;code&gt;BrotliCodec&lt;/code&gt; and &lt;code&gt;LzoCodec&lt;/code&gt;? &lt;strong&gt;They are not in any of the Parquet or Hadoop dependencies&lt;/strong&gt;, so if you use them without adding additional dependencies, your application will not work with files compressed with those formats.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;To support LZO, you need to add the &lt;a href="https://central.sonatype.com/artifact/org.anarres.lzo/lzo-hadoop/1.0.1" rel="noopener noreferrer"&gt;dependency&lt;/a&gt; &lt;code&gt;org.anarres.lzo:lzo-hadoop&lt;/code&gt; to your pom or gradle files.&lt;/li&gt;
&lt;li&gt;Even more complex is the case of Brotli: the &lt;a href="https://github.com/rdblue/brotli-codec" rel="noopener noreferrer"&gt;dependency&lt;/a&gt; is not in Maven Central, and you must also add the &lt;a href="https://jitpack.io" rel="noopener noreferrer"&gt;JitPack&lt;/a&gt; repository.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>parquet</category>
      <category>java</category>
      <category>compression</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Working with Parquet files in Java using Carpet</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Wed, 19 Jun 2024 16:55:22 +0000</pubDate>
      <link>https://dev.to/jerolba/working-with-parquet-files-in-java-using-carpet-lmk</link>
      <guid>https://dev.to/jerolba/working-with-parquet-files-in-java-using-carpet-lmk</guid>
      <description>&lt;p&gt;After some time working with Parquet files in Java using the Parquet Avro library, and studying how it worked, I concluded that despite &lt;strong&gt;being very useful&lt;/strong&gt; in multiple use cases and having great potential, &lt;strong&gt;the documentation and ecosystem needed for adoption in the Java world was very poor&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Many people are using suboptimal solutions (CSV or JSON files), applying more complex solutions (Spark), or using languages they are not familiar with (Python) because they don't know how to work with Parquet files easily. That's why I decided to &lt;strong&gt;write this &lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-3f1j"&gt;series of articles&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once you understand it and have the examples, everything is easier. But, &lt;strong&gt;can it be even easier?&lt;/strong&gt; Can we avoid the hassle of using &lt;em&gt;strange&lt;/em&gt; libraries that serialize other formats? &lt;strong&gt;Yes, it should be even easier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's why I decided to &lt;strong&gt;implement an Open Source library&lt;/strong&gt; that makes working with Parquet from Java extremely simple, something that covers it: &lt;strong&gt;Carpet&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Carpet is a Java library that serializes and deserializes Parquet files to Java 17 Records, abstracting you (if you want) from the particularities of Parquet and Hadoop, and minimizing the number of necessary dependencies, because it works directly with Parquet code. It is available on &lt;a href="https://central.sonatype.com/artifact/com.jerolba/carpet-record"&gt;Maven Central&lt;/a&gt; and you can find its source code on &lt;a href="https://github.com/jerolba/parquet-carpet"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hello world
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Carpet works by reflection&lt;/strong&gt;: it inspects your class model and there is no need to define an IDL, implement interfaces, or use annotations. &lt;strong&gt;Carpet is based on Java records&lt;/strong&gt;, the primitive created by the JDK for &lt;a href="https://www.infoq.com/articles/data-oriented-programming-java/"&gt;Data Oriented Programming&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Continuing with the same examples from previous articles, we will have a collection of Organization objects, which have a list of Attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="no"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAZ&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Carpet, it is not necessary to create special classes or perform transformations. &lt;strong&gt;Carpet works directly with your model&lt;/strong&gt;, as long as it fits the Parquet schema you need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;With Carpet, you don't need to use Parquet writers or Hadoop classes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OutputStream&lt;/span&gt; &lt;span class="n"&gt;outputStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CarpetWriter&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;outputStream&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code can be found on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/carpet/ToParquetUsingCarpetWriter.java#L14"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If your records match the required Parquet schema, class conversion is not necessary. If you don't need special Parquet configuration, you don't have to create builders, and you can use a Java &lt;code&gt;OutputStream&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By reflection, it creates the Parquet schema&lt;/strong&gt;, using the names and types of the fields in your records as column names and types.&lt;/p&gt;

&lt;p&gt;Carpet supports complex data structures, as long as all objects are records, collections (List, Set, etc.), and maps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;Deserialization is equally simple, or even simpler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also iterate through the file with a stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;somePredicate&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code can be found on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/carpet/FromParquetUsingCarpetReader.java#L13"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since Carpet uses reflection, it conventionally expects the types and names of the fields to match those of the columns in the Parquet file.&lt;/p&gt;

&lt;p&gt;None of the Parquet or Hadoop classes are imported into your code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization using a projection
&lt;/h3&gt;

&lt;p&gt;Carpet reads only the columns that are defined in the records and ignores any other columns that exist in the file. &lt;strong&gt;Defining a projection with a subset of attributes is as simple as defining a record in Java&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;OrgProjection&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CarpetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nc"&gt;OrgProjection&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this case, reading time is reduced to hundreds of milliseconds.&lt;/p&gt;

&lt;p&gt;The code can be found on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/carpet/FromParquetUsingCarpetReaderProjection.java#L14"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Parquet way
&lt;/h2&gt;

&lt;p&gt;If for any reason you need to customize some parameter of file generation or use it with Hadoop, Carpet provides an implementation of the &lt;code&gt;ParquetWriter&lt;/code&gt; and &lt;code&gt;ParquetReader&lt;/code&gt; builders. This way, all Parquet configurations are exposed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;We will need to instantiate a Parquet writer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileSystemOutputFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CarpetParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withCompressionCodec&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;CompressionCodecName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;GZIP&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code can be found on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/carpet/ToParquetUsingCarpetParquetWriter.java#L19"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Carpet implements a &lt;code&gt;ParquetWriter&amp;lt;T&amp;gt;&lt;/code&gt; builder with all the logic to &lt;strong&gt;convert Java records to Parquet API calls&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To avoid using Hadoop classes (and importing all their dependencies), &lt;strong&gt;Carpet implements the &lt;code&gt;InputFile&lt;/code&gt; and &lt;code&gt;OutputFile&lt;/code&gt; interfaces using regular files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;OutputFile&lt;/code&gt; and &lt;code&gt;ParquetWriter&lt;/code&gt; are classes defined by the Parquet API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CarpetParquetWriter&lt;/code&gt; and &lt;code&gt;FileSystemOutputFile&lt;/code&gt; are classes implemented by Carpet&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Org&lt;/code&gt; and &lt;code&gt;Attr&lt;/code&gt; are Java records from your domain, unrelated to Parquet or Carpet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Carpet implicitly generates the Parquet schema from the fields of your records.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;We will need to instantiate a Parquet reader using the &lt;code&gt;CarpetParquetReader&lt;/code&gt; builder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileSystemInputFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CarpetParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/carpet/FromParquetUsingParquetCarpetReader.java#L18"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Parquet defines a class called &lt;code&gt;ParquetReader&amp;lt;T&amp;gt;&lt;/code&gt;, and Carpet implements it with &lt;code&gt;CarpetParquetReader&lt;/code&gt;, handling the logic to &lt;strong&gt;convert internal data structures of Parquet&lt;/strong&gt; to your Java records.&lt;/p&gt;

&lt;p&gt;In this case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;InputFile&lt;/code&gt;  and &lt;code&gt;ParquetReader&lt;/code&gt; are classes defined by the Parquet API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CarpetParquetReader&lt;/code&gt; and &lt;code&gt;FileSystemOutputFile&lt;/code&gt; are classes implemented by Carpet&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Org&lt;/code&gt; (and &lt;code&gt;Attr&lt;/code&gt;) are Java records from your domain, unrelated to Parquet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The instantiation of the &lt;code&gt;ParquetReader&lt;/code&gt; class is also done with a Builder to maintain the pattern followed by Parquet.&lt;/p&gt;

&lt;p&gt;Carpet validates that the schema of the Parquet file is compatible with the Java records. If not, it throws an exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;With identical schemas and data, the file sizes compared to &lt;code&gt;parquet-avro&lt;/code&gt; and &lt;code&gt;parquet-protobuf&lt;/code&gt; are the same. However, what is the overhead cost of using reflection?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Serialization&lt;/th&gt;
&lt;th&gt;Deserialization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Parquet Avro&lt;/td&gt;
&lt;td&gt;15,381 ms&lt;/td&gt;
&lt;td&gt;7,665 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parquet Protocol Buffers&lt;/td&gt;
&lt;td&gt;16,174 ms&lt;/td&gt;
&lt;td&gt;11,025 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carpet&lt;/td&gt;
&lt;td&gt;12,769 ms&lt;/td&gt;
&lt;td&gt;8,881 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Writing, Carpet is 20% faster than using Avro and Protocol Buffers. The overhead of reflection is less than the work required to create Avro or Protocol Buffers objects.&lt;/p&gt;

&lt;p&gt;In terms of reading, Carpet is slightly slower than the fastest version of Parquet Avro. The use of reflection does not significantly penalize performance, and in return, we avoid using custom data types of the library.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Parquet is a very powerful format, yet underutilized in the Java ecosystem. This is partly due to lack of awareness and the difficulty in working with it, and partly because being a binary format, it is not very comfortable to work with it.&lt;/p&gt;

&lt;p&gt;Even if you're not into Big Data, Parquet can still be useful in scenarios involving large datasets. Often, due to unfamiliarity, complex or inefficient solutions and architectures are adopted.&lt;/p&gt;

&lt;p&gt;The format, with its schema, &lt;strong&gt;ensures that the defined types are satisfied or the data cannot be null&lt;/strong&gt;. How many times have you struggled parsing a CSV file?&lt;/p&gt;

&lt;p&gt;Carpet provides a very simple API, making it extremely easy to write and process Parquet files in 99% of use cases. For me, &lt;strong&gt;working with Parquet files is now more convenient than CSVs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Carpet is an open-source library under the Apache 2.0 license. You can find its source code on &lt;a href="https://github.com/jerolba/parquet-carpet"&gt;GitHub&lt;/a&gt; and it's available on &lt;a href="https://central.sonatype.com/artifact/com.jerolba/carpet-record"&gt;Maven Central&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/jerolba/parquet-carpet?tab=readme-ov-file#table-of-contents"&gt;README.md&lt;/a&gt; of the project provides a detailed explanation of its various functionalities, customization options, and how to use its API. &lt;strong&gt;I encourage you to use Carpet and share your feedback or tell me about your use cases working with Parquet.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>parquet</category>
      <category>java</category>
      <category>bigdata</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Working with Parquet files in Java using Protocol Buffers</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Thu, 07 Dec 2023 05:00:00 +0000</pubDate>
      <link>https://dev.to/jerolba/working-with-parquet-files-in-java-using-protocol-buffers-fdf</link>
      <guid>https://dev.to/jerolba/working-with-parquet-files-in-java-using-protocol-buffers-fdf</guid>
      <description>&lt;p&gt;This post continues the &lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-3f1j"&gt;series&lt;/a&gt; of &lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-avro-134"&gt;articles&lt;/a&gt; about working with Parquet files in Java. This time, I'll explain how to do it using the Protocol Buffers (PB) library.&lt;/p&gt;

&lt;p&gt;Finding examples and documentation on how to use Parquet with Avro is challenging, but with &lt;strong&gt;Protocol Buffers, it's even more complicated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Protocol Buffers and Parquet both support complex data structures, offering a direct and clear mapping between them.&lt;/p&gt;

&lt;p&gt;I will base this on the same example I used in previous articles talking about serialization. The code will be very similar to that in the &lt;a href="https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj"&gt;article about Protocol Buffers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the example, we will work with a collection of Organization objects, which in turn have a list of Attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="no"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAZ&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Parquet with PB we must also use classes generated from Protocol Buffers &lt;a href="https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj#idl-and-code-generation"&gt;IDL&lt;/a&gt;. This capability is specific to PB, not Parquet, but it is inherited by &lt;code&gt;parquet-protobuf&lt;/code&gt;, the library that implements this integration.&lt;/p&gt;

&lt;p&gt;Internally, the library transforms the PB schema into the Parquet schema, so most tools and libraries that can work with PB classes will be able to work indirectly with Parquet with few changes.&lt;/p&gt;

&lt;p&gt;Apart from the classes used to write or read the files, the remaining logic for building the objects generated by PB or reading their data remains unchanged compared to PB serialization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;We will need to instantiate a Parquet writer that supports the writing of objects created by PB:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/my_output_file.parquet"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProtoWriteSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PB_SPECS_COMPLIANT_WRITE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet defines a class named &lt;code&gt;ParquetWriter&amp;lt;T&amp;gt;&lt;/code&gt; and the &lt;code&gt;parquet-protobuf&lt;/code&gt; library extends it by implementing in &lt;code&gt;ProtoParquetWriter&amp;lt;T&amp;gt;&lt;/code&gt; the logic of &lt;strong&gt;converting PB objects into calls to the Parquet API&lt;/strong&gt;. The object we will serialize is &lt;code&gt;Organization&lt;/code&gt;, which has been generated using the PB utility and implements the PB API.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Path&lt;/code&gt; class is not the one from &lt;code&gt;java.nio.file&lt;/code&gt;, but a Hadoop-specific abstraction for referencing file paths. Whereas the &lt;code&gt;OutputFile&lt;/code&gt; class is Parquet's file abstraction with the capability to write to them.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Path&lt;/code&gt;, &lt;code&gt;OutputFile&lt;/code&gt;, &lt;code&gt;HadoopOutputFile&lt;/code&gt;, and &lt;code&gt;ParquetWriter&lt;/code&gt; are classes defined by the Parquet API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ProtoParquetWriter&lt;/code&gt; is a class defined by the &lt;code&gt;parquet-protobuf&lt;/code&gt; API, a library that encapsulates Parquet with Protocol Buffers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Organization&lt;/code&gt; and &lt;code&gt;Attribute&lt;/code&gt; are classes generated by the PB utility, not related to Parquet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The way to create an instance of &lt;code&gt;ParquetWriter&lt;/code&gt; is through a Builder, where you can configure many Parquet-specific parameters or those of the library we are using (Protocol Buffers). For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;withMessage&lt;/code&gt;: class generated with Protocol Buffers that we want to persist (which internally defines the schema)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withCompressionCodec&lt;/code&gt;: compression method to use: SNAPPY, GZIP, LZ4, etc. By default, it doesn't configure any.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withWriteMode&lt;/code&gt;: by default it is CREATE, so if the file already existed, it would not overwrite it and would throw an exception. To avoid this, use OVERWRITE.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withValidation&lt;/code&gt;: if we want it to validate the data types passed against the defined schema.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generic configuration can be passed using the &lt;code&gt;config(String property, String value)&lt;/code&gt; method. In this case, we configure it to internally use a &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists"&gt;three-level structure&lt;/a&gt; to represent nested lists.&lt;/p&gt;

&lt;p&gt;Once the &lt;code&gt;ParquetWriter&lt;/code&gt; class is instantiated, the major complexity lies in transforming your POJOs into the &lt;code&gt;Organization&lt;/code&gt; classes generated from the PB IDL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/my_output_file.parquet"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withMessage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ProtoWriteSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PB_SPECS_COMPLIANT_WRITE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"true"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizationBuilder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCategory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCountry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setType&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrganizationType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forNumber&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;ordinal&lt;/span&gt;&lt;span class="o"&gt;()));&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attribute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setQuantity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAmount&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setActive&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPercent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
                &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="n"&gt;organizationBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAttributes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizationBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of converting the entire collection of organizations and then writing it, we can convert and persist each &lt;code&gt;Organization&lt;/code&gt; one by one.&lt;/p&gt;

&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/protocol/ToParquetUsingProtocolBuffers.java#L29"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deserialization
&lt;/h2&gt;

&lt;p&gt;Deserialization is very straightforward if we are willing to work with the classes generated by Protocol Buffers.&lt;/p&gt;

&lt;p&gt;To read the file, we will need to instantiate a Parquet reader:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;ProtoParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet defines a class named &lt;code&gt;ParquetReader&amp;lt;T&amp;gt;&lt;/code&gt; and the &lt;code&gt;parquet-protobuf&lt;/code&gt; library extends it by implementing in &lt;code&gt;ProtoParquetReader&lt;/code&gt; the logic of &lt;strong&gt;converting Parquet's internal data structures&lt;/strong&gt; into classes generated by Protocol Buffers.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;InputFile&lt;/code&gt; is Parquet's file abstraction with the capability to read from them.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Path&lt;/code&gt;, &lt;code&gt;InputFile&lt;/code&gt;, &lt;code&gt;HadoopInputFile&lt;/code&gt;, and &lt;code&gt;ParquetReader&lt;/code&gt; are classes defined by the Parquet API&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ProtoParquetReader&lt;/code&gt; implements &lt;code&gt;ParquetReader&lt;/code&gt; and is defined in &lt;code&gt;parquet-protobuf&lt;/code&gt;, a library that encapsulates Parquet with PB&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Organization&lt;/code&gt; (and &lt;code&gt;Attribute&lt;/code&gt;) are classes generated by the PB utility, not related to Parquet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The instantiation of the &lt;code&gt;ParquetReader&lt;/code&gt; class is also done with a Builder, although the options to configure are much fewer, as all its configuration is determined by the file we are going to read. The reader does not need to know if the file uses dictionary encoding or if it is compressed, so it is not necessary to configure it; it discovers this by reading the file.&lt;/p&gt;

&lt;p&gt;It's important to note that the data type the reader will return is a &lt;code&gt;Builder&lt;/code&gt; for &lt;code&gt;Organization&lt;/code&gt;, rather than the &lt;code&gt;Organization&lt;/code&gt; itself, and we will need to call the &lt;code&gt;build()&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ProtoParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Builder&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the IDL used to generate the code contained only a subset of the columns existing in the file, when reading it, it would be ignoring all the columns not present in the IDL. This would save us disk reads and data deserialization/decoding.&lt;/p&gt;

&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/protocol/FromParquetUsingProtocolBuffers.java#L21"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;What performance does Parquet with Protocol Buffers offer when serializing and deserializing large volumes of data? To what extent do different compression options affect it? Should we choose compression with Snappy or no compression at all? And what about enabling the dictionary or not?&lt;/p&gt;

&lt;p&gt;Leveraging the analyses I conducted &lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#analysis-and-impressions"&gt;previously&lt;/a&gt; on different serialization formats, we can get an idea of their strengths and weaknesses. The benchmarks were conducted on the same computer, making them comparable.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Size
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;1 034 MB&lt;/td&gt;
&lt;td&gt;508 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;289 MB&lt;/td&gt;
&lt;td&gt;281 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Given the difference in sizes, we can see that in my synthetic example, the use of dictionaries compresses the information quite effectively, almost better than the Snappy algorithm itself. The decision to enable compression or not will depend on the performance penalty it entails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization Time
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;14,802 ms&lt;/td&gt;
&lt;td&gt;15,450 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;16,018 ms&lt;/td&gt;
&lt;td&gt;16,174 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The time is very similar in all cases, and we can say that the different compression techniques do not significantly affect the time taken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#analysis-and-impressions"&gt;Compared with other serialization formats&lt;/a&gt;, it takes between 50% (Jackson) and 300% (Protocol Buffers/Avro) longer, but in return, it achieves files that are between 60% (Protocol Buffers/Avro) and 90% (Jackson) smaller.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization Time
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;12 ,64 ms&lt;/td&gt;
&lt;td&gt;13,028 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;10,492 ms&lt;/td&gt;
&lt;td&gt;11,025 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this case, the use of the dictionary has a significant impact on time, saving the effort of decoding information that is repeated. There is definitely no reason to disable this feature.&lt;/p&gt;

&lt;p&gt;Compared to other formats, it is 100% slower than pure Protocol Buffers and 30% slower than pure Avro, but it is almost twice as fast as Jackson.&lt;/p&gt;

&lt;p&gt;To put the performance into perspective, on my laptop, it reads 45,000 &lt;code&gt;Organization&lt;/code&gt;s per second, which in turn contain almost 2.6 million instances of the &lt;code&gt;Attribute&lt;/code&gt; type.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you were already familiar with Protocol Buffers, most of the code and particularities related to PB will be recognizable to you. If not, it increases the entry barrier, as you need to learn about two distinct but similar technologies, and it's not always clear which part belongs to each one.&lt;/p&gt;

&lt;p&gt;The biggest change compared to pure Protocol Buffers is the way to create the writer and reader objects, where we have to deal with all the configuration and particularities specific to Parquet.&lt;/p&gt;

&lt;p&gt;Although possible, Protocol Buffers is not intended for serializing large amounts of data, so it's not comparable to Parquet using Protocol Buffers. If I had to choose between using Parquet with PB or Parquet with Avro, I would probably opt for the Avro version, as Avro is often used in the data engineering world and you could benefit more from the experience.&lt;/p&gt;

&lt;p&gt;The data I used in the example is random and results may vary depending on the characteristics of your data. Test with your data to draw conclusions.&lt;/p&gt;

&lt;p&gt;In environments where you write once and read multiple times, the time spent on serialization should not be a determining factor. More important are the file size, transfer time, or processing speed (especially if you can filter columns by making projections).&lt;/p&gt;

&lt;p&gt;Despite using different compression and encoding techniques, the processing speed of files is quite high. Coupled with its ability to work with a typed schema, this makes Parquet a data exchange format to consider in projects with a high volume of data.&lt;/p&gt;

</description>
      <category>parquet</category>
      <category>java</category>
      <category>protocolbuffers</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Working with Parquet files in Java using Avro</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Sun, 26 Nov 2023 11:09:55 +0000</pubDate>
      <link>https://dev.to/jerolba/working-with-parquet-files-in-java-using-avro-134</link>
      <guid>https://dev.to/jerolba/working-with-parquet-files-in-java-using-avro-134</guid>
      <description>&lt;p&gt;In the previous article, I wrote an introduction to using Parquet files in Java, but I did not include any examples. In this article, I will explain how to do this using the Avro library.&lt;/p&gt;

&lt;p&gt;Parquet with Avro &lt;strong&gt;is one of the most popular ways to work with Parquet files in Java&lt;/strong&gt; due to its simplicity, flexibility, and because it is the library with the most examples.&lt;/p&gt;

&lt;p&gt;Both Avro and Parquet allow complex data structures, and there is a mapping between the types of one and the other.&lt;/p&gt;

&lt;p&gt;The post will use the same example I used in previous articles talking about serialization. The code will be very similar to the article about Avro. For specific details about Avro, I refer you to &lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91"&gt;that article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the example, we will work with a collection of Organization objects (&lt;code&gt;Org&lt;/code&gt;), which have also a list of Attributes (&lt;code&gt;Attr&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="no"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAZ&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to saving files in Avro format, this version of Parquet with Avro allows writing files using classes generated from the &lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#idl-and-code-generation"&gt;IDL&lt;/a&gt; or the &lt;code&gt;GenericRecord&lt;/code&gt; data structure. This capability is specific to Avro, not Parquet, but is inherited by &lt;code&gt;parquet-avro&lt;/code&gt;, the library that implements this integration.&lt;/p&gt;

&lt;p&gt;Internally, the library transforms the Avro schema into the Parquet schema, so most tools and libraries that know how to work with Avro classes will be able to work indirectly with Parquet with few changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using code generation
&lt;/h2&gt;

&lt;p&gt;The only difference when compared to serializing in Avro format lies in the class used for writing or reading files; otherwise, the logic for &lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#idl-and-code-generation"&gt;building the Avro-generated classes&lt;/a&gt; and reading their data remains unchanged.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;We will need to instantiate a Parquet writer that supports the writing of objects created by Avro:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/my_output_file.parquet"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AvroWriteSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;WRITE_OLD_LIST_STRUCTURE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"false"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet defines a class called &lt;code&gt;ParquetWriter&amp;lt;T&amp;gt;&lt;/code&gt; and the &lt;code&gt;parquet-avro&lt;/code&gt; library extends it implementing in &lt;code&gt;AvroParquetWriter&amp;lt;T&amp;gt;&lt;/code&gt; the logic of &lt;strong&gt;converting Avro objects into calls to the Parquet API&lt;/strong&gt;. The object we will serialize is &lt;code&gt;Organization&lt;/code&gt;, which has been generated using the Avro utility and implements the Avro API.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Path&lt;/code&gt; class is not the one existing in &lt;code&gt;java.nio.file&lt;/code&gt;, but a Hadoop-specific abstraction for referencing file paths. Whereas the &lt;code&gt;OutputFile&lt;/code&gt; class is Parquet's file abstraction with the capability to write to them.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Path&lt;/code&gt;, &lt;code&gt;OutputFile&lt;/code&gt;, &lt;code&gt;HadoopOutputFile&lt;/code&gt;, and &lt;code&gt;ParquetWriter&lt;/code&gt; are classes defined by the Parquet API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AvroParquetWriter&lt;/code&gt; is a class defined by the &lt;code&gt;parquet-avro&lt;/code&gt; API, a library that encapsulates Parquet with Avro.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Organization&lt;/code&gt; and &lt;code&gt;Attribute&lt;/code&gt; are classes generated by the Avro utility, not related to Parquet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The way to create an instance of &lt;code&gt;ParquetWriter&lt;/code&gt; is through a Builder, where you can configure many Parquet-specific parameters or those of the library we are using (Avro). For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;withSchema&lt;/code&gt;: schema of the Organization class in Avro, which internally converts to a Parquet schema.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withCompressionCodec&lt;/code&gt;: compression method to use: SNAPPY, GZIP, LZ4, etc. By default, it doesn't configure any.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withWriteMode&lt;/code&gt;: by default it is CREATE, so if the file already existed, it would not overwrite it and would throw an exception. To avoid this, use OVERWRITE.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withValidation&lt;/code&gt;: if we want it to validate the data types passed against the defined schema.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withBloomFilterEnabled&lt;/code&gt;: if we want to enable the creation of &lt;a href="https://en.wikipedia.org/wiki/Bloom_filter"&gt;bloom filters&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A most generic configuration of both libraries (not defined in the API) can be passed with the &lt;code&gt;config(String property, String value)&lt;/code&gt; method. In this case, we configure it to internally use a &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists"&gt;3-level structure&lt;/a&gt; to represent nested lists.&lt;/p&gt;

&lt;p&gt;Once the &lt;code&gt;ParquetWriter&lt;/code&gt; class is instantiated, the greatest complexity lies in transforming your POJOs into the &lt;code&gt;Organization&lt;/code&gt; classes generated from Avro's IDL. The complete code would be as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/my_output_file.parquet"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AvroWriteSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;WRITE_OLD_LIST_STRUCTURE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"false"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setQuantity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAmount&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPercent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setActive&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Organization&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCategory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCountry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setOrganizationType&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrganizationType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAttributes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of converting the entire collection of organizations and then writing it, we can convert and persist each &lt;code&gt;Organization&lt;/code&gt; one by one.&lt;/p&gt;

&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/avro/ToParquetUsingAvroWithGeneratedClasses.java#L26"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Deserialization
&lt;/h4&gt;

&lt;p&gt;Deserialization is very straightforward if we agree to work with the classes generated by Avro.&lt;/p&gt;

&lt;p&gt;To read the file, we will need to instantiate a Parquet reader:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parquet defines a class called &lt;code&gt;ParquetReader&amp;lt;T&amp;gt;&lt;/code&gt; and the &lt;code&gt;parquet-avro&lt;/code&gt; library extends it by implementing in &lt;code&gt;AvroParquetReader&lt;/code&gt; the logic of &lt;strong&gt;converting Parquet's internal data structures&lt;/strong&gt; into the classes generated by Avro.&lt;/p&gt;

&lt;p&gt;In this case, &lt;code&gt;InputFile&lt;/code&gt; is Parquet's file abstraction with the capability to read from them.&lt;/p&gt;

&lt;p&gt;Therefore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Path&lt;/code&gt;, &lt;code&gt;InputFile&lt;/code&gt;, &lt;code&gt;HadoopInputFile&lt;/code&gt;, and &lt;code&gt;ParquetReader&lt;/code&gt; are classes defined by the Parquet API.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AvroParquetReader&lt;/code&gt; implements &lt;code&gt;ParquetReader&lt;/code&gt; and is defined in &lt;code&gt;parquet-avro&lt;/code&gt;, a library that encapsulates Parquet with Avro.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Organization&lt;/code&gt; (and &lt;code&gt;Attribute&lt;/code&gt;) are classes generated by the Avro utility, not related to Parquet.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The instantiation of the &lt;code&gt;ParquetReader&lt;/code&gt; class is also done with a Builder, although the options to configure are much fewer, as all its configuration is determined by the file we are going to read. The reader does not need to know if the file uses dictionary encoding or if it is compressed, so it is not necessary to configure it; it discovers this by reading the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Organization&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the IDL used to generate the code contains a subset of the attributes persisted in the file, when reading it we would be ignoring all the columns not present in the IDL. This would save us from disk reads and the deserialization/decoding of data.&lt;/p&gt;

&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/avro/FromParquetUsingAvroWithGeneratedClasses.java#L18"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Using GenericRecord
&lt;/h3&gt;

&lt;p&gt;Here, it will not be necessary to generate any code, and we will work with the &lt;code&gt;GenericRecord&lt;/code&gt; class provided by Avro, but the code will be a bit more verbose.&lt;/p&gt;

&lt;h4&gt;
  
  
  Serialization
&lt;/h4&gt;

&lt;p&gt;As we do not have generated files containing the embedded schema, we need to programmatically define the Avro schema we are going to use. The code is the same as in the article about Avro:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SchemaBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Attribute"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredDouble&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredBoolean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endRecord&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;enumSymbols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;Type:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toArray&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]::&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SchemaBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Organizations"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;enumeration&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;symbols&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enumSymbols&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;noDefault&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;array&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;noDefault&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endRecord&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;typeField&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;EnumMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAZ&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAZ&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of using an &lt;code&gt;AvroParquetWriter&lt;/code&gt; of the type &lt;code&gt;Organization&lt;/code&gt;, we create one of the type &lt;code&gt;GenericRecord&lt;/code&gt; and construct instances of it as if it were a &lt;code&gt;Map&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetWriter&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;outputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withSchema&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withWriteMode&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Mode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;OVERWRITE&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AvroWriteSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;WRITE_OLD_LIST_STRUCTURE&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"false"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;attrRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;orgRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;()));&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;write&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/avro/ToParquetUsingAvroWithGenericRecord.java#L33"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Deserialization
&lt;/h4&gt;

&lt;p&gt;As in the original version of Avro, most of the work consists in converting the &lt;code&gt;GenericRecord&lt;/code&gt; into our data structure. Because it behaves like a &lt;code&gt;Map&lt;/code&gt;, we will have to cast the types of the values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filePath&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;boolean&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
        &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;shortValue&lt;/span&gt;&lt;span class="o"&gt;())).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As we are using the Avro interface, it maintains its logic that Strings are encoded within the &lt;code&gt;Utf8&lt;/code&gt; class and it will be necessary to extract their values.&lt;/p&gt;

&lt;p&gt;You can find the code on &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/avro/FromParquetUsingAvroWithGenericRecord.java#L23"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By default, when reading the file, it deserializes all fields of each row because it does not know the schema of what you need to read, and processes everything. &lt;strong&gt;If you want a projection of the fields, you must pass it in the form of an Avro schema&lt;/strong&gt; when creating the &lt;code&gt;ParquetReader&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;projection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SchemaBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Organizations"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endRecord&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;Configuration&lt;/span&gt; &lt;span class="n"&gt;configuration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;set&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;AvroReadSupport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;AVRO_REQUESTED_PROJECTION&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ParquetReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AvroParquetReader&lt;/span&gt;&lt;span class="o"&gt;.&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputFile&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;withConf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
&lt;span class="o"&gt;....&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rest of the process would be the same, but with fewer fields. You can see the entire source code of the example &lt;a href="https://github.com/jerolba/parquet-for-java-posts/blob/master/src/main/java/com/jerolba/parquet/avro/FromParquetUsingAvroWithGenericRecordProjection.java#L24"&gt;here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;What performance does Parquet Avro offer when serializing and deserializing a large volume of data? To what extent do the different compression options influence? Do we choose compression with Snappy or no compression at all? And what about activating the dictionary or not?&lt;/p&gt;

&lt;p&gt;Taking advantage of the analyses I did &lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#analysis-and-impressions"&gt;previously&lt;/a&gt; on different serialization formats, we can get an idea of their strengths and weaknesses. The benchmarks were done on the same computer, so they are comparable to give us an idea.&lt;/p&gt;

&lt;h3&gt;
  
  
  File Size
&lt;/h3&gt;

&lt;p&gt;Both using code generation and GenericRecord, the result is the same, as they are different ways of defining the same schema and persisting the same data:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionay False&lt;/td&gt;
&lt;td&gt;1,034 MB&lt;/td&gt;
&lt;td&gt;508 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionay True&lt;/td&gt;
&lt;td&gt;289 MB&lt;/td&gt;
&lt;td&gt;281 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Given the difference in sizes, we can see that in my synthetic example, the use of dictionaries compresses the information significantly, even more than the Snappy algorithm itself. The decision to activate compression or not will depend on the performance penalty it entails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization Time
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Using code generation:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;14,386 ms&lt;/td&gt;
&lt;td&gt;14,920 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;15,110 ms&lt;/td&gt;
&lt;td&gt;15,381 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Using GenericRecord:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;15,287 ms&lt;/td&gt;
&lt;td&gt;15,809 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;16,119 ms&lt;/td&gt;
&lt;td&gt;16,432 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The time is very similar in all cases, and we can say that the different compression techniques do not significantly affect the time spent.&lt;/p&gt;

&lt;p&gt;There are no notable time differences between generated code and the use of &lt;code&gt;GenericRecord&lt;/code&gt;, so performance should not be a determining factor in choosing a solution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/jerolba/java-serialization-with-avro-1j91#analysis-and-impressions"&gt;Compared to other serialization formats&lt;/a&gt;, it takes between 40% (Jackson) and 300% (Protocol Buffers/Avro) more time, but in return achieves files that are between 70% (Protocol Buffers/Avro) and 90% (Jackson) smaller.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization Time
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Using code generation:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;10,722 ms&lt;/td&gt;
&lt;td&gt;10,736 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;7,707 ms&lt;/td&gt;
&lt;td&gt;7,665 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Using GenericRecord:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary False&lt;/td&gt;
&lt;td&gt;12,089 ms&lt;/td&gt;
&lt;td&gt;11,931 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionary True&lt;/td&gt;
&lt;td&gt;8,374 ms&lt;/td&gt;
&lt;td&gt;8,451 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In this case, the use of the dictionary has a significant impact on time, as it saves decoding information that is repeated. There is definitely no reason to disable this functionality.&lt;/p&gt;

&lt;p&gt;If we compare with other formats, it is twice as slow as Protocol Buffers and on par with Avro, but more than twice as fast as Jackson.&lt;/p&gt;

&lt;p&gt;To put the performance into perspective, on my laptop, it reads 50,000 &lt;code&gt;Organization&lt;/code&gt;s per second, which in turn contain almost 3 million instances of type &lt;code&gt;Attribute&lt;/code&gt;, per second!&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization Time Using a Projection
&lt;/h3&gt;

&lt;p&gt;What is the performance like if we use a projection and we only read three fields of the Organization object and ignore its collection of attributes?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Uncompressed&lt;/th&gt;
&lt;th&gt;Snappy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dictionay False&lt;/td&gt;
&lt;td&gt;289 ms&lt;/td&gt;
&lt;td&gt;304 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dictionay True&lt;/td&gt;
&lt;td&gt;195 ms&lt;/td&gt;
&lt;td&gt;203 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We confirm the promise that if we access a subset of columns, we will read and decode much less information. In this case, &lt;strong&gt;it takes only 2.5% of the time&lt;/strong&gt;, or in other words, &lt;strong&gt;it is 40 times faster&lt;/strong&gt; at processing the same file.&lt;/p&gt;

&lt;p&gt;This is where Parquet shows its full power, by allowing to read and decode a subset of data quickly, taking advantage of how the data is distributed in the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you are already using Avro or are familiar with it, most of the code and nuances related to Avro will be familiar to you. If you are not, it increases the entry barrier, as you have to learn about two different technologies, and it may not be clear what corresponds to each.&lt;/p&gt;

&lt;p&gt;The major change compared to using only Avro is how the writer and reader objects are created, where we will have to deal with all the configuration and particularities specific to Parquet.&lt;/p&gt;

&lt;p&gt;If I had to choose between using only Avro or Parquet with Avro, I would choose the latter, as it produces more compact files and we have the opportunity to take advantage of the columnar format.&lt;/p&gt;

&lt;p&gt;The data I have used in the example are synthetic, and the results may vary depending on the characteristics of your data. I recommend doing tests, but unless all your values are very random, the compression rates will be high.&lt;/p&gt;

&lt;p&gt;In environments where you write once and read multiple times, the time spent serializing should not be decisive. More important, for example, are the consumption of your storage, the file transfer time, or the processing speed (especially if you can filter the columns you access).&lt;/p&gt;

&lt;p&gt;Despite using different compression and encoding techniques, the file processing time is quite fast. Along with its ability to work with a typed schema, this makes it a &lt;strong&gt;data interchange format to be considered in projects with a heavy load of data&lt;/strong&gt;.&lt;/p&gt;

</description>
      <category>parquet</category>
      <category>java</category>
      <category>avro</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Working with Parquet files in Java</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Tue, 21 Nov 2023 06:00:00 +0000</pubDate>
      <link>https://dev.to/jerolba/working-with-parquet-files-in-java-3f1j</link>
      <guid>https://dev.to/jerolba/working-with-parquet-files-in-java-3f1j</guid>
      <description>&lt;p&gt;Parquet is a widely used format in the Data Engineering realm and holds significant potential for traditional Backend applications. This article serves as an &lt;strong&gt;introduction to the format&lt;/strong&gt;, including some of the unique challenges I've faced while using it, to spare you from similar experiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt;, released by &lt;a href="https://blog.twitter.com/engineering/en_us/a/2013/announcing-parquet-10-columnar-storage-for-hadoop"&gt;Twitter&lt;/a&gt; and &lt;a href="https://web.archive.org/web/20130504133255/http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/"&gt;Cloudera&lt;/a&gt; in 2013, is an efficient and &lt;strong&gt;general-purpose columnar&lt;/strong&gt; file format for the Apache Hadoop ecosystem. Inspired by Google's paper &lt;a href="https://research.google/pubs/pub36632/"&gt;"Dremel: Interactive Analysis of Web-Scale Datasets"&lt;/a&gt;, Parquet is optimized to support complex and nested data structures.&lt;/p&gt;

&lt;p&gt;Although it emerged almost simultaneously with &lt;a href="https://es.wikipedia.org/wiki/Apache_ORC"&gt;ORC&lt;/a&gt; from Hortonworks and Facebook, it appears that Parquet has become the most used format.&lt;/p&gt;

&lt;p&gt;Unlike row-oriented formats, Parquet organizes data by columns, enabling more efficient data persistence through advanced encoding and compression techniques.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical table&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfrsln6h7pu2ly14g8ho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbfrsln6h7pu2ly14g8ho.png" alt="Logical table structure" width="354" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row Oriented storage:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0irtko1wlqvm59kr73k3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0irtko1wlqvm59kr73k3.png" alt="Row oriented structure" width="800" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Column Oriented Storage:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaowudxgev1f3ds32xrm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjaowudxgev1f3ds32xrm.png" alt="Column oriented structure" width="800" height="111"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When it comes to reading time, Parquet's columnar data organization proves beneficial. If there's no need to access every column, this arrangement allows you to bypass reading and processing numerous data blocks, significantly enhancing efficiency.&lt;/p&gt;

&lt;p&gt;&lt;iframe class="tweet-embed" id="tweet-1725849122823983419-35" src="https://platform.twitter.com/embed/Tweet.html?id=1725849122823983419"&gt;
&lt;/iframe&gt;

  // Detect dark theme
  var iframe = document.getElementById('tweet-1725849122823983419-35');
  if (document.body.className.includes('dark-theme')) {
    iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1725849122823983419&amp;amp;theme=dark"
  }



 &lt;/p&gt;

&lt;p&gt;Columnar data formats are frequently employed in analytical database systems (such as &lt;a href="https://cassandra.apache.org/_/index.html"&gt;Cassandra&lt;/a&gt;, &lt;a href="https://cloud.google.com/bigquery"&gt;BigQuery&lt;/a&gt;, &lt;a href="https://clickhouse.com/"&gt;ClickHouse&lt;/a&gt;, and &lt;a href="https://questdb.io/"&gt;QuestDB&lt;/a&gt;) and in large dataset processing systems like &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; and &lt;a href="https://orc.apache.org/"&gt;ORC&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Format
&lt;/h2&gt;

&lt;p&gt;Data in Parquet is stored in binary form, making it unreadable when printed on the console, unlike other text-based formats like JSON, XML, or CSV. The binary storage format of Parquet offers benefits in terms of efficiency and performance, but lacks the human-readable aspect inherent to text-based data formats.&lt;/p&gt;

&lt;p&gt;Given that Parquet is designed to handle vast volumes of data, it is practical to include &lt;strong&gt;the schema information directly within the files&lt;/strong&gt;, as well as additional statistical metadata. This feature is particularly advantageous for working with files when their schema is not previously known. Each Parquet file inherently contains all the necessary details to deduce its schema, enabling seamless data reading and analysis, even when dealing with unknown or variable data structures.&lt;/p&gt;

&lt;p&gt;Parquet supports &lt;a href="https://github.com/apache/parquet-format/tree/master#types"&gt;basic data types&lt;/a&gt;, with the ability to extend them through logical types, giving them their own semantics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BOOLEAN&lt;/strong&gt;: 1-bit boolean (boolean in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT32&lt;/strong&gt;: 32-bit signed integer (int in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT64&lt;/strong&gt;: 64-bit signed integer (long in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INT96&lt;/strong&gt;: 96-bit signed integer (no direct equivalent in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FLOAT&lt;/strong&gt;: 32-bit IEEE floating-point value (float in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOUBLE&lt;/strong&gt;: 64-bit IEEE floating-point value (double in Java)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BYTE_ARRAY&lt;/strong&gt;: byte array of indeterminate size&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FIXED_LEN_BYTE_ARRAY&lt;/strong&gt;: fixed-size byte array&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More complex data types such as Strings, Enums, UUIDs, and different kinds of date formats can be efficiently constructed using these foundational types. This ability demonstrates the flexibility of Parquet in accommodating a wide range of data structures by building upon basic data types.&lt;/p&gt;

&lt;p&gt;The format supports persisting &lt;strong&gt;complex data structures, &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types"&gt;lists, and maps&lt;/a&gt;&lt;/strong&gt; in a nested manner, which opens the door to storing any type of data that can be structured as a Document.&lt;/p&gt;

&lt;p&gt;Historically, collections can be represented internally in multiple ways, depending on how you want to handle the possibility of a collection being null or empty. Additionally, as an implementation detail, how to name each element of the collection can vary between implementations of the format. These variations have led to different utilities in various languages generating files with differences that make them incompatible.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists"&gt;official representation of collections&lt;/a&gt; in Parquet is now defined, yet many tools still generate files with legacy formats to preserve backward compatibility. This requires an explicit configuration to ensure files are written in the  standardized format. For instance, when using Pandas with &lt;a href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html#pyarrow.parquet.ParquetWriter"&gt;PyArrow&lt;/a&gt;, the &lt;code&gt;use_compliant_nested_type&lt;/code&gt; option must be enabled, while in Java Avro Parquet, the &lt;code&gt;WRITE_OLD_LIST_STRUCTURE_DEFAULT&lt;/code&gt; flag should be disabled.&lt;/p&gt;

&lt;p&gt;While Parquet utilizes an &lt;a href="https://en.wikipedia.org/wiki/Interface_description_language"&gt;IDL&lt;/a&gt; to define the data format within a file (its schema), &lt;strong&gt;it lacks a direct, standard tool for generating Java code from an IDL&lt;/strong&gt; to facilitate the serialization and deserialization of Parquet data in simple Java classes.&lt;/p&gt;

&lt;p&gt;Parquet inherently includes data compression within its encoding process. It automatically applies techniques like Run-length encoding (RLE) or Bit Packing. Additionally, it offers the option to compress data blocks with compressors such as Snappy, GZip, or LZ4, and to use Dictionaries for the normalization of repeated values.&lt;/p&gt;

&lt;p&gt;Snappy compression is typically employed as the default, due to its favorable balance between compression efficiency and CPU time consumption.&lt;/p&gt;

&lt;p&gt;When serializing or deserializing large amounts of data, Parquet allows us to &lt;strong&gt;write or read records one at a time&lt;/strong&gt;, obviating the need for retaining all data in memory, unlike Protocol Buffers or FlatBuffers. You can serialize a stream of records or iterate through a file reading records one by one.&lt;/p&gt;
&lt;h2&gt;
  
  
  Documentation
&lt;/h2&gt;

&lt;p&gt;Despite the format's significance in the world of Data Engineering, &lt;strong&gt;documentation on its basic use is quite scarce&lt;/strong&gt;, particularly in the context of Java.&lt;/p&gt;

&lt;p&gt;How would you feel if, to learn about how to read or write JSON files, you had to go through Pandas or Spark and it wasn't straightforward to do it directly? That's the sensation you get when you start studying Parquet.&lt;/p&gt;

&lt;p&gt;High-level tools commonly used by Data Engineers, such as Pandas and Spark, offer built-in methods to seamlessly export and import data to Parquet (and other formats), simplifying the process by hiding the underlying complexities. However, finding documentation and examples for using Parquet independently of these tools is challenging, as the information is scattered across multiple articles written by different individuals.&lt;/p&gt;

&lt;p&gt;What would you say if, to read or write JSON files, you had to go through other tools/formats such as Avro or Protocol Buffers, and there was no library that supported it directly? This is the scenario encountered with Parquet.&lt;/p&gt;

&lt;p&gt;The absence of a straightforward library specifically for Parquet files, necessitating the use of third-party libraries for other format serializations, &lt;strong&gt;makes it more challenging to become proficient with Parquet&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Libraries
&lt;/h2&gt;

&lt;p&gt;The Parquet library in Java does not offer a direct way to read or write Parquet files. Just as the Jackson library handles JSON files or the Protocol Buffers library works with its own format, Parquet does not include a function to read or write Java Objects (POJOs) or Parquet-specific data structures.&lt;/p&gt;

&lt;p&gt;When working with Parquet in Java, there are two main approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using the low-level API provided by the Parquet library (this would be equivalent to processing the tokens of a JSON or XML parser).&lt;/li&gt;
&lt;li&gt;Employing the functionalities of other serialization libraries, like Avro or Protocol Buffers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Among the libraries that make up the &lt;a href="https://github.com/apache/parquet-mr"&gt;Apache Parquet project in Java&lt;/a&gt;, there are specific libraries that use Protocol Buffers or Avro classes and interfaces for reading and writing Parquet files. These libraries employ the low-level API of &lt;code&gt;parquet-mr&lt;/code&gt; to convert objects of Avro or Protocol Buffers type into Parquet files and vice versa.&lt;/p&gt;

&lt;p&gt;To summarize, while working with Parquet in Java, you'll engage with three types of classes, each associated with three different APIs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The API of your chosen serialization library, which provides the way to define serialized Objects and interact with them.&lt;/li&gt;
&lt;li&gt;The API of the wrapper library for the serialization library you have chosen, with the readers and writers for these Objects.&lt;/li&gt;
&lt;li&gt;The API of the &lt;code&gt;parquet-mr&lt;/code&gt; low-level library itself that defines common interfaces and configurations, and handles the actual serialization process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most user-friendly and flexible library, often encountered in internet examples, is &lt;a href="https://github.com/apache/parquet-mr/tree/master/parquet-avro"&gt;Avro&lt;/a&gt;. However, &lt;a href="https://github.com/apache/parquet-mr/tree/master/parquet-protobuf"&gt;Protocol Buffers&lt;/a&gt; is also a viable option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The use of utilities from these formats does not imply that the information is serialized twice&lt;/strong&gt;, first through an intermediate format. Instead, it involves reusing the classes generated by Avro or Protocol Buffers, which hold the data to be stored. Each wrapper implementation uses the Parquet MR API.&lt;/p&gt;
&lt;h2&gt;
  
  
  Abstraction Over Files
&lt;/h2&gt;

&lt;p&gt;The Parquet library is agnostic to the location of the data - it could be on a local file system, within a Hadoop cluster, or stored in S3.&lt;/p&gt;

&lt;p&gt;To provide an abstraction layer for file locations, Parquet defines the interfaces &lt;a href="https://javadoc.io/doc/org.apache.parquet/parquet-common/latest/org/apache/parquet/io/OutputFile.html"&gt;&lt;code&gt;org.apache.parquet.io.OutputFile&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://javadoc.io/doc/org.apache.parquet/parquet-common/latest/org/apache/parquet/io/InputFile.html"&gt;&lt;code&gt;org.apache.parquet.io.InputFile&lt;/code&gt;&lt;/a&gt;. These interfaces contain methods to create specialized types of Output and Input Streams for data management.&lt;/p&gt;

&lt;p&gt;For those interfaces, it provides an implementation responsible for implementing access to files in Hadoop, SFTP, or local files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/latest/org/apache/parquet/hadoop/util/HadoopOutputFile.html"&gt;&lt;code&gt;org.apache.parquet.hadoop.util.HadoopOutputFile&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/latest/org/apache/parquet/hadoop/util/HadoopInputFile.html"&gt;&lt;code&gt;org.apache.parquet.hadoop.util.HadoopInputFile&lt;/code&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These implementations, in turn, require to reference files using the &lt;a href="https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Path.html"&gt;&lt;code&gt;org.apache.hadoop.fs.Path&lt;/code&gt;&lt;/a&gt; class, which is unrelated to the Java &lt;a href="https://docs.oracle.com/javase/8/docs/api/java/nio/file/Path.html"&gt;&lt;code&gt;Path&lt;/code&gt;&lt;/a&gt; class, along with an &lt;a href="https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html"&gt;&lt;code&gt;org.apache.hadoop.conf.Configuration&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To reference a file, we would need to write code like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Path&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/my_file.parquet"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;OutputFile&lt;/span&gt; &lt;span class="n"&gt;outputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopOutputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="nc"&gt;InputFile&lt;/span&gt; &lt;span class="n"&gt;inputFile&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HadoopInputFile&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fromPath&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But fortunately, this won't last long, as a second implementation (that allows working with local files only) has recently been &lt;a href="https://github.com/apache/parquet-mr/pull/1111"&gt;merged to master&lt;/a&gt;, and &lt;a href="https://github.com/apache/parquet-mr/pull/1141"&gt;it is being decoupled&lt;/a&gt; from the Hadoop configuration class. No release has yet included this, but it is expected to be available in version 1.14.0.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;One of the major drawbacks of using Parquet in Java is the &lt;strong&gt;large number of transitive dependencies that its libraries have&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Parquet was conceived to be used in conjunction with Hadoop. The &lt;a href="https://github.com/apache/parquet-mr"&gt;project on GitHub&lt;/a&gt; for the Java implementation is named &lt;code&gt;parquet-mr&lt;/code&gt;, and &lt;code&gt;mr&lt;/code&gt; stands for Map Reduce. As you can see, the package of the file classes refers to &lt;code&gt;hadoop&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Over time, it has evolved and become more independent, but it has not been able to completely decouple from Hadoop and still has many transitive dependencies from the libraries used by Hadoop (ranging from a Jetty server to a Kerberos or Yarn clients).&lt;/p&gt;

&lt;p&gt;If your project is going to use Hadoop, all those dependencies will be necessary, but if you intend to use standard files outside of Hadoop, it makes your application heavier. &lt;strong&gt;I suggest excluding those transitive dependencies in your pom.xml or build.gradle&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In addition to including a lot of unnecessary code, this can pose a problem when resolving conflicts with versions of transitive dependencies that you are also using.&lt;/p&gt;

&lt;p&gt;If you don't exclude any dependencies, you might end up with more than 130 JARs and 75 MB in your deployable artifact. However, by selectively excluding unused dependencies, I've been able to trim this down to just 30 JARs, with a total size of 23 to 29MB.&lt;/p&gt;

&lt;p&gt;As previously noted, efforts are underway to address this, but it's not all ready yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The Parquet format stands as a crucial tool in the Data Engineering ecosystem, providing an efficient solution for the storage and processing of large volumes of data.&lt;/p&gt;

&lt;p&gt;Although its adoption has been solid in Big Data environments, its potential transcends this field and &lt;strong&gt;can also be leveraged in the world of traditional Backend&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Its column-oriented design, combined with advanced data compression and complex data structuring capabilities, make it a &lt;strong&gt;robust option for those seeking to enhance performance and efficiency in handling data within traditional Backend environments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Despite its indisputable advantages, the adoption of Parquet as a data interchange format in Java application development faces obstacles, primarily due to the complexity of its low-level API and the lack of a high-level interface that simplifies its use. The need to rely on third-party libraries adds an additional layer of complexity and dependencies, and the scarcity of accessible documentation and concrete examples poses a significant barrier for many developers.&lt;/p&gt;

&lt;p&gt;This post has been an introduction to the format, its advantages, and the &lt;em&gt;WTFs&lt;/em&gt; I've encountered along the way; don't be discouraged. With this foundational knowledge, &lt;strong&gt;the forthcoming posts will focus on how to work with Parquet using different libraries&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-avro-134"&gt;Working with Parquet files in Java using Avro&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-protocol-buffers-fdf"&gt;Working with Parquet files in Java using Protocol Buffers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/jerolba/working-with-parquet-files-in-java-using-carpet-lmk"&gt;Working with Parquet files in Java using Carpet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>parquet</category>
      <category>java</category>
      <category>bigdata</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Java serialization with Avro</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Mon, 05 Dec 2022 14:23:58 +0000</pubDate>
      <link>https://dev.to/jerolba/java-serialization-with-avro-1j91</link>
      <guid>https://dev.to/jerolba/java-serialization-with-avro-1j91</guid>
      <description>&lt;p&gt;In previous posts I've analyzed &lt;a href="https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj"&gt;Protocol Buffers&lt;/a&gt; and &lt;a href="https://dev.to/jerolba/java-serialization-with-flatbuffers-45nl"&gt;FlatBuffers&lt;/a&gt;, using JSON as the baseline. In this post, I will analyze Apache Avro and compare it with the previously studied formats.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://avro.apache.org/docs/current/" rel="noopener noreferrer"&gt;Apache Avro&lt;/a&gt; was developed as a component of the Apache Hadoop project and was released in 2009 under the Apache 2.0 license.&lt;/p&gt;

&lt;p&gt;Avro should not be confused with Parquet. Often when searching for documentation on Parquet, you end up reading about Avro, confusing them. &lt;strong&gt;Although the Avro library is useful for generating Parquet files and both are widely used in the world of Big Data, the formats have no relation to each other&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As Protocol Buffers and Flat Buffers, you can predefine a schema (in JSON), and generates binary content that is not human-readable. It also has support for &lt;a href="https://avro.apache.org/docs/1.11.1/#comparison-with-other-systems" rel="noopener noreferrer"&gt;multiple languages&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It is not only used to persist large batches of data. For example, Avro is widely used in &lt;a href="https://www.confluent.io/blog/avro-kafka-data/" rel="noopener noreferrer"&gt;Kafka&lt;/a&gt; to serialize the information in its messages.&lt;/p&gt;

&lt;p&gt;Because the format or schema can be included in the serialized data, code generation is optional, which makes it easy to build systems that process their files generically. It provides the flexibility of JSON, but with a more compact and efficient format.&lt;/p&gt;

&lt;p&gt;I'm going to explore both approaches: code generation from the schema and generated programmatically.&lt;/p&gt;




&lt;h3&gt;
  
  
  IDL and code generation
&lt;/h3&gt;

&lt;p&gt;The file with the equivalent schema to the example in the previous articles would be &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/resources/organizations.avsc" rel="noopener noreferrer"&gt;this one&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Organization"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"namespace"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"com.jerolba.xbuffers.avro"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"organizationType"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"enum"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"OrganizationType"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"symbols"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"FOO"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BAR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BAZ"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"attributes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"array"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"items"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Attribute"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"quantity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"amount"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"int"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"percent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"double"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt; 
            &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"active"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"boolean"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate all Java classes you need to download &lt;a href="https://downloads.apache.org/avro/stable/java/avro-tools-1.11.1.jar" rel="noopener noreferrer"&gt;avro-tools&lt;/a&gt;, &lt;br&gt;
and run it with parameters that reference where the IDL file is located and the target path of the generated files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; java &lt;span class="nt"&gt;-jar&lt;/span&gt; avro-tools-1.11.1.jar compile schema ./src/main/resources/organizations.avsc ./src/main/java/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or directly using Docker with an image ready to execute the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/src:/avro/src kpnnv/avro-tools:1.11.1 compile schema /avro/src/main/resources/organizations.avsc /avro/src/main/java/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Using generated classes
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Serialization
&lt;/h4&gt;

&lt;p&gt;Like Protocol Buffers, Avro does not directly serialize your POJOs, and you need to copy the information to the objects generated by the schema compiler.&lt;/p&gt;

&lt;p&gt;But in this case, if you are persisting a collection of objects you don't need to have all them in memory because you can serialize objects one by one as you instantiate them.&lt;/p&gt;

&lt;p&gt;The code needed to serialize the information starting from the POJOs would look like &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToAvroWithGeneratedClasses.java#L37" rel="noopener noreferrer"&gt;this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;DatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SpecificDatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dataFileWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;datumWriter&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.avro"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setQuantity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAmount&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPercent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setActive&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Organization&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCategory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCountry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setOrganizationType&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt; &lt;span class="nc"&gt;OrganizationType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;()))&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAttributes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
      &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;append&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of converting the whole collection and then persisting it, we can convert and persist each &lt;code&gt;Organization&lt;/code&gt; one by one.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serialization time: 5,409 ms&lt;/li&gt;
&lt;li&gt;File size: 846 MB&lt;/li&gt;
&lt;li&gt;Compressed file size: 530 MB&lt;/li&gt;
&lt;li&gt;Memory required: because we are directly serializing to &lt;code&gt;OutputStream&lt;/code&gt;, it does not consume anything other than the necessary internal IO buffers (and original objects)&lt;/li&gt;
&lt;li&gt;Library size (avro-1.11.1.jar + dependencies): 3,552,326 bytes&lt;/li&gt;
&lt;li&gt;Size of generated classes: 37,283 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Deserialization
&lt;/h4&gt;

&lt;p&gt;Due to the internal representation of the data, Avro needs to move around the file parsing the data and you need to provide a seekable InputStream or directly a &lt;code&gt;File&lt;/code&gt;. For example, you can not use directly an &lt;code&gt;InputStream&lt;/code&gt; from an HTTP request.&lt;/p&gt;

&lt;p&gt;With &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromAvroWithGeneratedClasses.java#L31" rel="noopener noreferrer"&gt;few lines&lt;/a&gt; you can read and process the whole object graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;File&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;File&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.avro"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;DatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SpecificDatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;hasNext&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;next&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The objects are instances of the classes generated from the schema, not the original records. Because we are iterating a reader, we can transform each instance into our representation if necessary, without having both representations repeated in memory.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deserialization time: 8,197 ms&lt;/li&gt;
&lt;li&gt;Memory required: reconstructing all the object structures defined by the schema takes up 2,520 MB&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Using Generic Record
&lt;/h3&gt;

&lt;p&gt;In Avro, you can have the schema &lt;a href="https://avro.apache.org/docs/1.11.1/specification/_print/#object-container-files" rel="noopener noreferrer"&gt;embedded in the binary file&lt;/a&gt;, and it allows you to read a serialized record without needing to know or agree on the schema in advance.  This enables us to deserialize and know the contents of any file, without requiring code generation&lt;/p&gt;

&lt;p&gt;You can define the schema at runtime and decide, based on your needs, the fields and structure of the serialized information.&lt;/p&gt;

&lt;h4&gt;
  
  
  Serialization
&lt;/h4&gt;

&lt;p&gt;Instead of copying the data to the generated classes, you can do it through an Avro &lt;code&gt;GenericRecord&lt;/code&gt;, which behaves like a Map. First, you need to define the Avro schema using code (or load it from a static JSON file):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SchemaBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Attribute"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredInt&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredDouble&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredBoolean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endRecord&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;enumSymbols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;of&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="o"&gt;()).&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;Type:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toArray&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;[]::&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SchemaBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Organizations"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;requiredString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;enumeration&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
                             &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;symbols&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enumSymbols&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;noDefault&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;array&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;items&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;noDefault&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;endRecord&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;//Auxiliar Map to encode Enums&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;typeField&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;EnumMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAZ&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;BAZ&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;EnumSymbol&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typeField&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the code necessary to serialize the collection would look like &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToAvroWithGenericRecord.java#L68" rel="noopener noreferrer"&gt;this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;DatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericDatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dataFileWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;datumWriter&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.avro"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;attrRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;orgRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;()));&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;append&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because we are using the same Schema, and we are only changing how we serialize the data, the file size and memory used are the same.&lt;/p&gt;

&lt;p&gt;The serialization time grows to &lt;strong&gt;5,903 ms&lt;/strong&gt;, 10% more than using generated code. The implementation of &lt;code&gt;GenericRecord&lt;/code&gt; introduces a slight overhead.&lt;/p&gt;

&lt;h4&gt;
  
  
  Deserialization
&lt;/h4&gt;

&lt;p&gt;Using a different &lt;code&gt;Reader&lt;/code&gt;, the result of the deserialized data is again an untyped &lt;code&gt;GenericRecord&lt;/code&gt; object. In this case, we need to convert each instance to the original data structure, &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromAvroWithGenericRecord.java#L35" rel="noopener noreferrer"&gt;mapping each type&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="nc"&gt;DatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericDatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;hasNext&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;next&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;boolean&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;shortValue&lt;/span&gt;&lt;span class="o"&gt;())).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is verbose and is plenty of castings. Avro does not support the &lt;code&gt;byte&lt;/code&gt; and &lt;code&gt;short&lt;/code&gt; types, and they are persisted as &lt;code&gt;int&lt;/code&gt;, so we need to downcast their values. As an optimization, Strings are created with an internal representation called &lt;a href="https://avro.apache.org/docs/1.4.1/api/java/org/apache/avro/util/Utf8.html" rel="noopener noreferrer"&gt;Utf8&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The deserialization time grows to &lt;strong&gt;8,471 ms&lt;/strong&gt;, around 5% of overhead compared to the static code version.&lt;/p&gt;




&lt;h3&gt;
  
  
  Using optimized Generic Record
&lt;/h3&gt;

&lt;p&gt;If you inspect the implementation of the &lt;code&gt;GenericRecord&lt;/code&gt; class, you can see that &lt;code&gt;get(String key)&lt;/code&gt; and &lt;code&gt;put(String key, Object value)&lt;/code&gt; &lt;a href="https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/Schema.java#L928" rel="noopener noreferrer"&gt;access by that key to a Map to get the index in an array&lt;/a&gt;. Since each attribute will always be in the same position when reading the file, we can access it only once and reuse its value with a variable, improving the execution time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;Because you need to store the index of each field in the Schema, the code is even more verbose. After creating the Schema, the &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToAvroWithGenericRecordByPosition.java#L68" rel="noopener noreferrer"&gt;code&lt;/a&gt; is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;quantityPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;amountPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;activePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;percentPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;sizePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;namePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;categoryPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;countryPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;organizationTypePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;attributesPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.avro"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;DatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericDatumWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="nc"&gt;DataFileWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;dataFileWriter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileWriter&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;datumWriter&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;create&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;attrRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantityPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amountPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sizePos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activePos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
      &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attrRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;orgRecord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgsSchema&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namePos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categoryPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countryPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizationTypePos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enums&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;()));&lt;/span&gt;
    &lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributesPos&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;append&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orgRecord&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;dataFileWriter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The serialization time is &lt;strong&gt;5,381 ms&lt;/strong&gt;, very close to the time used in the version with generated code..&lt;/p&gt;

&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromAvroWithGenericRecordByPosition.java#L36" rel="noopener noreferrer"&gt;code&lt;/a&gt; is very similar, we only add variables to know the position of each field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ArrayList&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="nc"&gt;DatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;GenericDatumReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DataFileReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DataFileReader&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datumReader&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;getElementType&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;quantityPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"quantity"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;amountPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"amount"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;activePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"active"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;percentPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"percent"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;sizePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"size"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;Schema&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getSchema&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;namePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;categoryPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"category"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;countryPos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;organizationTypePos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"organizationType"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;pos&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;hasNext&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;GenericRecord&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFileReader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;next&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;GenericRecord&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"attributes"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrsRecords&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idPos&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantityPos&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amountPos&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;byteValue&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;boolean&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activePos&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;double&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;percentPos&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
      &lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sizePos&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="na"&gt;shortValue&lt;/span&gt;&lt;span class="o"&gt;())).&lt;/span&gt;&lt;span class="na"&gt;toList&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namePos&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categoryPos&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Utf8&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Utf8&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;countryPos&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;valueOf&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizationTypePos&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;add&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;toString&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deserialization time drops to &lt;strong&gt;7,353 ms&lt;/strong&gt;, around 10% faster than the generated code version. Why? I don't have enough knowledge of its internals to venture an answer, but I was surprised by the result.&lt;/p&gt;

&lt;h4&gt;
  
  
  Avro Summary
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Generated Code&lt;/th&gt;
&lt;th&gt;Generic Record&lt;/th&gt;
&lt;th&gt;Optimized &lt;br&gt;Generic Record&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serialization time&lt;/td&gt;
&lt;td&gt;5,409 ms&lt;/td&gt;
&lt;td&gt;5,903 ms&lt;/td&gt;
&lt;td&gt;5,381 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deserialization time&lt;/td&gt;
&lt;td&gt;8,197 ms&lt;/td&gt;
&lt;td&gt;8,471 ms&lt;/td&gt;
&lt;td&gt;7,353 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using GenericRecord allows us to gain some flexibility in the process without losing performance, but it makes the code much more verbose and prone to errors due to the manual mapping of fields..&lt;/p&gt;

&lt;p&gt;The included dependencies are the same in all cases, and we can only save the generation of the source code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Analysis and impressions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;Protocol Buffers&lt;/th&gt;
&lt;th&gt;FlatBuffers&lt;/th&gt;
&lt;th&gt;Avro&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serialization time&lt;/td&gt;
&lt;td&gt;11,718 ms&lt;/td&gt;
&lt;td&gt;5,823 ms&lt;/td&gt;
&lt;td&gt;3,803 ms&lt;/td&gt;
&lt;td&gt;5,409 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;2,457 MB&lt;/td&gt;
&lt;td&gt;1,044 MB&lt;/td&gt;
&lt;td&gt;600 MB&lt;/td&gt;
&lt;td&gt;846 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZ file size&lt;/td&gt;
&lt;td&gt;525 MB&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;td&gt;414 MB&lt;/td&gt;
&lt;td&gt;530 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory serializing&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;1.29 GB&lt;/td&gt;
&lt;td&gt;0.6 GB - 1 GB&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deserialization time&lt;/td&gt;
&lt;td&gt;20,410 ms&lt;/td&gt;
&lt;td&gt;4,535 ms&lt;/td&gt;
&lt;td&gt;202 - 1,876 ms&lt;/td&gt;
&lt;td&gt;8,197 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory deserialization&lt;/td&gt;
&lt;td&gt;2,193 MB&lt;/td&gt;
&lt;td&gt;2,710 MB&lt;/td&gt;
&lt;td&gt;0 - 600 MB&lt;/td&gt;
&lt;td&gt;2,520 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JAR library size&lt;/td&gt;
&lt;td&gt;1,910 KB&lt;/td&gt;
&lt;td&gt;1,636 KB&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;td&gt;3,469 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size of generated classes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;40 KB&lt;/td&gt;
&lt;td&gt;9 KB&lt;/td&gt;
&lt;td&gt;36 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;If we don't consider the optimizations applied in FlatBuffers, the Avro file takes up less space.&lt;/li&gt;
&lt;li&gt;In the example, although all fields are required, you can easily make them nullable (consuming slightly more space).&lt;/li&gt;
&lt;li&gt;For me, its main strength is that it doesn't require having all the data in memory to serialize the information. For example, you can read it from a database or a file, transform or enrich it according to your logic, and at the same time persist it in some type of OutputStream (created from a file or HTTP connection).&lt;/li&gt;
&lt;li&gt;The possibility of defining a schema programmatically gives us the option of modifying the output format depending on the business logic and creating our own serialization tooling.&lt;/li&gt;
&lt;li&gt;Avro is halfway between JSON and binary formats like Protocol Buffers and Flatbuffers:

&lt;ul&gt;
&lt;li&gt;Avro supports flexible schema&lt;/li&gt;
&lt;li&gt;It is not necessary to know or agree on the schema beforehand to be able to read an Avro file&lt;/li&gt;
&lt;li&gt;In JSON you can not get the schema from the file itself (without fully parsing it first)&lt;/li&gt;
&lt;li&gt;Avro is more compact than JSON (but not human-readable)&lt;/li&gt;
&lt;li&gt;Serialization and deserialization time is faster than JSON&lt;/li&gt;
&lt;li&gt;You can easily serialize/deserialize all objects in a loop/stream without needing to have it all in memory&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>career</category>
      <category>community</category>
    </item>
    <item>
      <title>Java Serialization with Flatbuffers</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Mon, 14 Nov 2022 18:37:26 +0000</pubDate>
      <link>https://dev.to/jerolba/java-serialization-with-flatbuffers-45nl</link>
      <guid>https://dev.to/jerolba/java-serialization-with-flatbuffers-45nl</guid>
      <description>&lt;p&gt;In the &lt;a href="https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj"&gt;previous post&lt;/a&gt; I analyzed Protocol Buffers format, using JSON as baseline. In this post I'm going to analyze FlatfBuffers and compare it with previously studied formats.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://google.github.io/flatbuffers/"&gt;FlatBuffers&lt;/a&gt; was also created within Google in 2014 under Apache 2.0 license. It was developed to meet specific needs within the world of video games and mobile applications, where resources are more constrained.&lt;/p&gt;

&lt;p&gt;Like Protocol Buffers, it relies on predefining the data format with similar syntax and generates a human-unreadable binary content. It has support for &lt;a href="https://google.github.io/flatbuffers/flatbuffers_support.html"&gt;multiple languages&lt;/a&gt; and allows adding and deprecating fields without breaking compatibility.&lt;/p&gt;

&lt;p&gt;The main distinction of FlatBuffers is that it implements &lt;em&gt;zero-copy&lt;/em&gt; deserialization: it does not need to create objects or reserve new memory areas to parse the information, because it always works with the information in binary within a memory or disk area. &lt;strong&gt;The objects that represent the deserialized information do not contain the information, but know how to resolve the value when their &lt;em&gt;get&lt;/em&gt; methods are called&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the specific case of Java, this is not strictly correct, because for Strings you have to instantiate the &lt;code&gt;char[]&lt;/code&gt; needed for its internal structure. But it is only necessary if you call the getter of the String type attribute. &lt;strong&gt;Only the information that is accessed is deserialized.&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  IDL and code generation
&lt;/h3&gt;

&lt;p&gt;The file with the schema could be &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/resources/organizations.fbs"&gt;this one&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;namespace com.jerolba.xbuffers.flat&lt;span class="p"&gt;;&lt;/span&gt;

enum OrganizationType : byte &lt;span class="o"&gt;{&lt;/span&gt; FOO, BAR, BAZ &lt;span class="o"&gt;}&lt;/span&gt;

table Attribute &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;id&lt;/span&gt;: string&lt;span class="p"&gt;;&lt;/span&gt;
    quantity: byte&lt;span class="p"&gt;;&lt;/span&gt; 
    amount: byte&lt;span class="p"&gt;;&lt;/span&gt;
    size: short&lt;span class="p"&gt;;&lt;/span&gt;
    percent: double&lt;span class="p"&gt;;&lt;/span&gt;
    active: bool&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

table Organization &lt;span class="o"&gt;{&lt;/span&gt;
    name: string&lt;span class="p"&gt;;&lt;/span&gt;
    category: string&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;: OrganizationType&lt;span class="p"&gt;;&lt;/span&gt;
    country: string&lt;span class="p"&gt;;&lt;/span&gt;
    attributes: &lt;span class="o"&gt;[&lt;/span&gt;Attribute]&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

table Organizations &lt;span class="o"&gt;{&lt;/span&gt;
    organizations: &lt;span class="o"&gt;[&lt;/span&gt;Organization]&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

root_type Organizations&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate all Java classes you need to install the compiler &lt;code&gt;flatc&lt;/code&gt; and execute it with some parameters referencing where the IDL file is located and the target path of the generated files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; flatc &lt;span class="nt"&gt;-j&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /opt/src/main/java/  /opt/src/main/resources/organizations.fbs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As with Protocol Buffers, I prefer to use directly docker with an image ready to execute the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;/src:/opt/src neomantra/flatbuffers flatc &lt;span class="nt"&gt;-j&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /opt/src/main/java/  /opt/src/main/resources/organizations.fbs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;This is the worst part of FlatBuffers: it's complex and it's not automatic. The serialization operation requires a &lt;strong&gt;manual coding process&lt;/strong&gt;, where you describe step by step how to fill a binary buffer. &lt;/p&gt;

&lt;p&gt;As you add elements of your data structures to the buffer, it returns offsets or &lt;strong&gt;pointers&lt;/strong&gt;, which are the values used as references in the data structures that contain them. All in a recursive way. &lt;/p&gt;

&lt;p&gt;If we see the data structure as a tree, we would have to do &lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Tree_traversal#Post-order,_LRN"&gt;a postorder traversal&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The process is very fragile and it is very easy to make mistakes, so it will be necessary to cover this part with &lt;strong&gt;unit tests&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The code necessary to serialize the information starting from POJOs would look like &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToFlatbuffers.java#L48"&gt;this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400_000&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;FlatBufferBuilder&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FlatBufferBuilder&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;orgsArr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()];&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;contOrgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()];&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;contAttr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;contAttr&lt;/span&gt;&lt;span class="o"&gt;++]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;idOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; 
        &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;attrsOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createAttributesVector&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nameOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;categoryOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;ordinal&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;countryOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="n"&gt;orgsArr&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;contOrgs&lt;/span&gt;&lt;span class="o"&gt;++]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganization&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;nameOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;categoryOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;countryOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrsOffset&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;organizationsOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganizationsVector&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orgsArr&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;root_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organizationsOffset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;finish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root_table&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/flatbuffer.json"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;InputStream&lt;/span&gt; &lt;span class="n"&gt;sizedInputStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sizedInputStream&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="n"&gt;sizedInputStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;transferTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, it's a pretty ugly code, and you can easily make a mistake.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serialization time: 5,639 ms&lt;/li&gt;
&lt;li&gt;File size: 1,284 MB&lt;/li&gt;
&lt;li&gt;Compressed file size: 530 MB&lt;/li&gt;
&lt;li&gt;Memory required: internally creates a &lt;code&gt;ByteBuffer&lt;/code&gt; that grows by power of two. To fit the 1.2 GB of data you need to allocate 2 GB of memory, unless you initially set it to 1.2 GB if you know its size beforehand.&lt;/li&gt;
&lt;li&gt;Library size (flatbuffers-java-1.12.0.jar): 64,873 bytes&lt;/li&gt;
&lt;li&gt;Size of generated classes: 9,080 bytes&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;Deserialization is simpler than serialization, and the only difficulty lies in instantiating a &lt;code&gt;ByteBuffer&lt;/code&gt; containing the binary serialized information.&lt;/p&gt;

&lt;p&gt;Depending on the requirements we can bring all the content into memory or use &lt;a href="https://en.wikipedia.org/wiki/Memory-mapped_file"&gt;a memory-mapped file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once deserialized (or rather read the file), the objects you use do not actually contain the information, they are simply a proxy that knows how to locate it on demand. Therefore, if some data is not accessed, it is not really deserialized... on the other hand, every time you access the same value, you are deserializing it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reading the entire file into memory
&lt;/h4&gt;

&lt;p&gt;If the available memory allows it and you are going to use the data intensively, you are more likely to want to &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromFlatBuffersMemoryAll.java#L27"&gt;read everything in memory&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;RandomAccessFile&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RandomAccessFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.flatbuffers"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;FileChannel&lt;/span&gt; &lt;span class="n"&gt;inChannel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getChannel&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;ByteBuffer&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ByteBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;allocate&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;read&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;flip&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;Organizations&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRootAsOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="nc"&gt;Organization&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
  &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributesLength&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++){&lt;/span&gt;
    &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;attrId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="o"&gt;......&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Deserialization time accessing some attributes: 640 ms&lt;/li&gt;
&lt;li&gt;Deserialization time accessing all attributes: 2,184 ms&lt;/li&gt;
&lt;li&gt;Memory required: loading the entire file into memory, 1,284 MB&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Mapping the file into memory
&lt;/h4&gt;

&lt;p&gt;The Java &lt;code&gt;FileChannel&lt;/code&gt; system abstracts whether all the information is directly in memory or whether it is read from disk &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromFlatBuffersMappedFileAll.java#L21"&gt;as needed&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;RandomAccessFile&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;RandomAccessFile&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.flatbuffers"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"r"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;FileChannel&lt;/span&gt; &lt;span class="n"&gt;inChannel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getChannel&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;MappedByteBuffer&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;MapMode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;READ_ONLY&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;load&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;Organizations&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRootAsOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;.....&lt;/span&gt;
  &lt;span class="n"&gt;inChannel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;close&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Deserialization time accessing some attributes: 306 ms&lt;/li&gt;
&lt;li&gt;Deserialization time accessing all attributes: 2,044 ms&lt;/li&gt;
&lt;li&gt;Memory required: as it only creates file reading objects and temporary buffers, I do not know how to measure it, and I think it can be considered negligible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I am surprised that using a memory-mapped file takes slightly less time. Probably as I run the benchmark multiple times, the operating system has cached the file in memory. If I were rigorous with the test I should change the process, but it is good enough to get an idea.&lt;/p&gt;




&lt;h2&gt;
  
  
  Analysis and impressions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;Protocol Buffers&lt;/th&gt;
&lt;th&gt;FlatBuffers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serialization time&lt;/td&gt;
&lt;td&gt;11,718 ms&lt;/td&gt;
&lt;td&gt;5,823 ms&lt;/td&gt;
&lt;td&gt;5,639 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;2,457 MB&lt;/td&gt;
&lt;td&gt;1,044 MB&lt;/td&gt;
&lt;td&gt;1,284 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZ file size&lt;/td&gt;
&lt;td&gt;525 MB&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;td&gt;530 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory serializing&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;1.29 GB&lt;/td&gt;
&lt;td&gt;1.3 GB - 2 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deserialization time&lt;/td&gt;
&lt;td&gt;20,410 ms&lt;/td&gt;
&lt;td&gt;4,535 ms&lt;/td&gt;
&lt;td&gt;306 - 2,184 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory deserialization&lt;/td&gt;
&lt;td&gt;2,193 MB&lt;/td&gt;
&lt;td&gt;2,710 MB&lt;/td&gt;
&lt;td&gt;0 - 1,284 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JAR library size&lt;/td&gt;
&lt;td&gt;1,910 KB&lt;/td&gt;
&lt;td&gt;1,636 KB&lt;/td&gt;
&lt;td&gt;64 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size of generated classes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;40 KB&lt;/td&gt;
&lt;td&gt;9 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From the data and from what I have been able to see by playing with the formats, we can conclude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FlatBuffers is highly recommended when deserialization time and memory consumption is important. This explains why it is so widely used in the world of video games and mobile applications. Its serialization API is very delicate and prone to errors.&lt;/li&gt;
&lt;li&gt;In FlatBuffers, in an example like mine where there is a lot of data, in the serialization process it is important to &lt;strong&gt;configure a buffer size&lt;/strong&gt; close to the final result, otherwise, the library will have to continuously extend the buffer, spending more memory and time.&lt;/li&gt;
&lt;li&gt;In FlatBuffers, the JAR dependency needed to serialize and deserialize, together with the generated classes, is &lt;strong&gt;ridiculously small&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;While in JSON and Protocol Buffers you need to deserialize all the information to access a part of it, in FlatBuffers &lt;strong&gt;you can randomly access any element without having to traverse and parse all the information that precedes it&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Protocol Buffers takes less space than FlatBuffers because the overhead per data structure (message/table) is smaller, and especially because Protocol Buffers serializes scalars with the value that takes the least bytes, compressing the information.&lt;/li&gt;
&lt;li&gt;For scalar values, FlatBuffers &lt;strong&gt;does not support null&lt;/strong&gt;. If a value is not present it takes the default value of the corresponding primitive (0, 0.0, or false). But Strings can be &lt;code&gt;null&lt;/code&gt;. &lt;/li&gt;
&lt;li&gt;If you need to represent the &lt;code&gt;null&lt;/code&gt; value for a scalar, you can define a struct: &lt;code&gt;struct NullableInt32 { i:int32 }&lt;/code&gt;. This adds more complexity to the serialization code and consumes more bytes.&lt;/li&gt;
&lt;li&gt;If you follow the serialization algorithm, all normalized strings (countries and categories) from the original object graph will appear repeated in the serialized file.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Iterating serialization implementation
&lt;/h2&gt;

&lt;p&gt;Implementing the serialization with FlatBuffers &lt;strong&gt;is really hard&lt;/strong&gt;, but on the other hand, if you understand how it works internally, this problem can become a great advantage that works in its favor.&lt;/p&gt;

&lt;p&gt;Every time you do something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;you are adding to the buffer a character array, and getting a kind of pointer or offset to use in the object that contains it.&lt;/p&gt;

&lt;p&gt;If the same String is repeated in a serialization, can I reuse the pointer each time it appears? &lt;strong&gt;Yes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If in a serialization you know that the same String will be repeated multiple times you can reuse its offset &lt;strong&gt;without breaking the internal representation of the serialization&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;With a small change every time we serialize a String, using a &lt;code&gt;Map&amp;lt;String, Integer&amp;gt;&lt;/code&gt; and querying it every time we add a String, we would have something &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToFlatbuffersV2.java#L52"&gt;like this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400_000&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;FlatBufferBuilder&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FlatBufferBuilder&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Integer&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;strOffsets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;();&lt;/span&gt;   &lt;span class="c1"&gt;//&amp;lt;-- offsets cache&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;orgsArr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()];&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;contOrgs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[]&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;()];&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;contAttr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;idOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strOffsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;computeIfAbsent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nl"&gt;builder:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// &amp;lt;--&lt;/span&gt;
    &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;contAttr&lt;/span&gt;&lt;span class="o"&gt;++]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createAttribute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
        &lt;span class="n"&gt;idOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;attrsOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createAttributesVector&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;nameOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strOffsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;computeIfAbsent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nl"&gt;builder:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// &amp;lt;--&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;categoryOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strOffsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;computeIfAbsent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nl"&gt;builder:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// &amp;lt;--&lt;/span&gt;
  &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;ordinal&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;countryOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;strOffsets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;computeIfAbsent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="nl"&gt;builder:&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;createString&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// &amp;lt;--&lt;/span&gt;
  &lt;span class="n"&gt;orgsArr&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;contOrgs&lt;/span&gt;&lt;span class="o"&gt;++]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganization&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nameOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;categoryOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;countryOffset&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attrsOffset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;organizationsOffset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganizationsVector&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orgsArr&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;root_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;organizationsOffset&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;finish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root_table&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/flatbuffer.json"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;InputStream&lt;/span&gt; &lt;span class="n"&gt;sizedInputStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;sizedInputStream&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="n"&gt;sizedInputStream&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;transferTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Serialization time: 3,803 ms&lt;/li&gt;
&lt;li&gt;File size: 600 MB&lt;/li&gt;
&lt;li&gt;Compressed file size: 414 MB&lt;/li&gt;
&lt;li&gt;Memory needed: to store those 600 MB it will use a &lt;code&gt;ByteBuffer&lt;/code&gt; that will require 1 GB if you do not preconfigure it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this synthetic example, &lt;strong&gt;the reduction of the file size reaches more than 50%&lt;/strong&gt;, and despite having to constantly look up a Map, the time it takes to serialize is significantly lower.&lt;/p&gt;

&lt;p&gt;Because the format does not change, the deserialization retains the same properties and the &lt;strong&gt;read code does not change&lt;/strong&gt;. If you choose to read the entire file into memory, you will gain the same memory space, while if you read it as a mapped file, the change will not be noticeable except for the lower I/O usage.&lt;/p&gt;

&lt;p&gt;If we add this optimization to the comparison table, we have:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;Protocol Buffers&lt;/th&gt;
&lt;th&gt;FlatBuffers&lt;/th&gt;
&lt;th&gt;FlatBuffers V2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serialization time&lt;/td&gt;
&lt;td&gt;11,718 ms&lt;/td&gt;
&lt;td&gt;5,823 ms&lt;/td&gt;
&lt;td&gt;5,639 ms&lt;/td&gt;
&lt;td&gt;3,803 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;2,457 MB&lt;/td&gt;
&lt;td&gt;1,044 MB&lt;/td&gt;
&lt;td&gt;1,284 MB&lt;/td&gt;
&lt;td&gt;600 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZ file size&lt;/td&gt;
&lt;td&gt;525 MB&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;td&gt;530 MB&lt;/td&gt;
&lt;td&gt;414 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory serializing&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;1.29 GB&lt;/td&gt;
&lt;td&gt;1.3 GB - 2 GB&lt;/td&gt;
&lt;td&gt;0.6 GB - 1 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deserialization time&lt;/td&gt;
&lt;td&gt;20,410 ms&lt;/td&gt;
&lt;td&gt;4,535 ms&lt;/td&gt;
&lt;td&gt;306 - 2,184 ms&lt;/td&gt;
&lt;td&gt;202 - 1,876 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory deserialization&lt;/td&gt;
&lt;td&gt;2,193 MB&lt;/td&gt;
&lt;td&gt;2,710 MB&lt;/td&gt;
&lt;td&gt;0 - 1,284 MB&lt;/td&gt;
&lt;td&gt;0 - 600 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What we are actually doing is compressing the information at serialization time, taking advantage of the fact that we have some knowledge about the data. We are creating a &lt;a href="https://en.wikipedia.org/wiki/Dictionary_coder"&gt;compression dictionary&lt;/a&gt; manually.&lt;/p&gt;




&lt;h2&gt;
  
  
  Iterating again the idea
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can this process of String compression or normalization be applied to data structures that you serialize?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What if the tuple of attribute values were repeated between Organizations, can we reuse it?  &lt;strong&gt;Yes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We can reuse the pointer of a previously serialized tuple to be referenced from the serialization of another organization. We would only need to create a Map where the key is the Attr class and the value is the offset of its first occurrence: &lt;code&gt;Map&amp;lt;Attr, Integer&amp;gt; attrOffsets = new HashMap&amp;lt;&amp;gt;()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The example I have created for this article is synthetic and the values used are random. But in my real case, with data structures similar to the example, &lt;strong&gt;data is more repetitive and data reuse is higher&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;In our real use case, we load the information from a MongoDB collection, and the numbers are these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The records in MongoDB occupy 737 MB (which compresses on disk with &lt;a href="https://docs.mongodb.com/manual/core/wiredtiger/"&gt;WiredTiger&lt;/a&gt;, taking physically 510 MB)&lt;/li&gt;
&lt;li&gt;If we serialize the same records with FlatBuffers without optimizations, it consumes 390 MB&lt;/li&gt;
&lt;li&gt;If we further optimize by normalizing Strings, it takes 186 MB&lt;/li&gt;
&lt;li&gt;If we normalize the &lt;code&gt;Attr&lt;/code&gt; instances, the file requires 45 MB&lt;/li&gt;
&lt;li&gt;If we compress the file with Gzip, the resulting file is only 25 MB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--zkx_Bhq---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.jeronimo.dev/images/mongowiredtiger_vs_flatbuffers.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--zkx_Bhq---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://www.jeronimo.dev/images/mongowiredtiger_vs_flatbuffers.png" alt="MongoDB Wiredtigler vs FlatBuffers optimizations" width="717" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Knowing how the technologies you use work, their strengths and weaknesses, and the principles they are based on, allows you to optimize their use and push them beyond the usual use cases you are told about.&lt;/p&gt;

&lt;p&gt;Knowing how your data looks like is also important when your data volume is high and your resources are limited. How can you structure your data to make it lighter? How can you structure your data to make it faster to process? &lt;/p&gt;

&lt;p&gt;In our case, these optimizations have allowed us to postpone a complex architecture change. In an online service, instead of taking 60 seconds to load a batch of information, we can reduce to 3-4 seconds.&lt;/p&gt;

&lt;p&gt;If the volume of data becomes really Big Data, I would advise you to use formats where these optimizations are already built-in, such as dictionaries or grouping data into columns. &lt;/p&gt;

</description>
      <category>java</category>
      <category>flatbuffers</category>
      <category>serialization</category>
    </item>
    <item>
      <title>Java Serialization with Protocol Buffers</title>
      <dc:creator>Jerónimo López</dc:creator>
      <pubDate>Tue, 01 Nov 2022 10:42:28 +0000</pubDate>
      <link>https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj</link>
      <guid>https://dev.to/jerolba/java-serialization-with-protocol-buffers-4pbj</guid>
      <description>&lt;p&gt;At &lt;a href="https://clarity.ai/"&gt;Clarity AI&lt;/a&gt;, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.&lt;/p&gt;

&lt;p&gt;The information can be stored in a database or as files, serialized in a standard format and with a schema agreed with your Data Engineering team. Depending on your information and requirements, it can be as simple as CSV, XML or JSON, or Big Data formats such as &lt;a href="https://parquet.apache.org/documentation/latest/"&gt;Parquet&lt;/a&gt;, &lt;a href="https://avro.apache.org/docs/current/"&gt;Avro&lt;/a&gt;, &lt;a href="https://orc.apache.org/"&gt;ORC&lt;/a&gt;, &lt;a href="https://arrow.apache.org/"&gt;Arrow&lt;/a&gt;, or message  serialization formats like &lt;a href="https://developers.google.com/protocol-buffers"&gt;Protocol Buffers&lt;/a&gt;, &lt;a href="https://google.github.io/flatbuffers/"&gt;FlatBuffers&lt;/a&gt;, &lt;a href="https://msgpack.org/"&gt;MessagePack&lt;/a&gt;, &lt;a href="https://thrift.apache.org/"&gt;Thrift&lt;/a&gt;, or &lt;a href="https://capnproto.org/"&gt;Cap'n Proto&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;My idea is to analyze some of these systems over a collection of articles, using JSON as a reference, to compare all with something everyone knows.&lt;/p&gt;

&lt;p&gt;To allow anyone to reach their own conclusions according to their requirements, I will analyze different technical aspects: creation time, size of the resulting file, size of the file on compression (gzip), the memory needed to create the file, size of the libraries used, deserialization time or the memory needed to parse and access the data.&lt;/p&gt;

&lt;p&gt;In my case, because my access pattern is Write Once Read Many, in my final selection read factors will take preference over write factors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data model
&lt;/h2&gt;

&lt;p&gt;Given &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/SampleDataFactory.java#L13"&gt;this data model&lt;/a&gt; of organizations and their attributes, using new Java records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
           &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="nf"&gt;Attr&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;boolean&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; 
            &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;short&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="nc"&gt;Type&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="no"&gt;FOO&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAR&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="no"&gt;BAZ&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I will simulate a heavy scenario, where each Organization will randomly have between 40 and 70 different Attributes, and in total, we will have about 400K organizations. The values of the attributes are also randomized.&lt;/p&gt;




&lt;h2&gt;
  
  
  JSON
&lt;/h2&gt;

&lt;p&gt;It is the data interchange format par excellence and the facto standard in web services communication. Although it was defined as a format in the early 2000s, it was not until 2013 that the &lt;a href="https://www.ecma-international.org/"&gt;Ecma&lt;/a&gt; published the first version, which became an international standard in 2017.&lt;/p&gt;

&lt;p&gt;With a simple syntax, it does not need to predefine the schema to be able to parse it. Because it is a plain text-based format, it is human readable and there are a lot of libraries to process it in all languages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;Using a library like &lt;a href="https://github.com/FasterXML/jackson"&gt;Jackson&lt;/a&gt; and without special annotations, the code is &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToJson.java#L22"&gt;very simple&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400_000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

&lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.json"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeValue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serialization time: 11,718 ms&lt;/li&gt;
&lt;li&gt;File size: 2,457 MB&lt;/li&gt;
&lt;li&gt;Compressed file size: 525 MB&lt;/li&gt;
&lt;li&gt;Memory required: because we are serializing directly to &lt;code&gt;OutputStream&lt;/code&gt;, it consumes nothing apart from the required internal IO buffers.&lt;/li&gt;
&lt;li&gt;Library size (jackson-xxx.jar): 1,956,679 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;It is difficult to define how to measure deserialization in this analysis. The objective is to recover the state of the entities from their binary representation, but which entities? the original class or the one provided by the tool with a similar interface?&lt;/p&gt;

&lt;p&gt;To exploit the capabilities and strengths of each tool, we will keep the representation of the entity generated by each library. In the case of JSON will be the original class, but in other libraries will be a different class.&lt;/p&gt;

&lt;p&gt;Jackson makes it very easy again and solves the problem with only &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromJson.java#L29"&gt;3 lines&lt;/a&gt; of code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;InputStream&lt;/span&gt; &lt;span class="n"&gt;is&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.json"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ObjectMapper&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
  &lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="n"&gt;mapper&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;readValue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;TypeReference&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;List&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class="o"&gt;{});&lt;/span&gt;
  &lt;span class="o"&gt;....&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Deserialization time: 20,410 ms&lt;/li&gt;
&lt;li&gt;Memory required: because is reconstructing the original object structures, they consume 2,193 MB&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Protocol Buffers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://developers.google.com/protocol-buffers"&gt;Protocol Buffers&lt;/a&gt; is the system that Google developed internally to serialize data in a way that is both CPU and storage efficient, especially if you &lt;strong&gt;compare it with how it was done in those days with XML&lt;/strong&gt;. In 2008 it was released under BSD license.&lt;/p&gt;

&lt;p&gt;It is based on predefining what format the data will have through an IDL (Interface Definition Language) and from it, generating the source code that will be able to either write and read files with data. The producer and the consumer must somehow share the format defined in the IDL. &lt;/p&gt;

&lt;p&gt;The format is flexible enough to support adding new fields and deprecating existing fields without breaking compatibility.&lt;/p&gt;

&lt;p&gt;The information generated after serialization is an array of bytes, &lt;strong&gt;unreadable to humans&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Support for different programming languages is provided by the existence of a code generator for each language. If a language is not officially supported you can always &lt;a href="https://github.com/protocolbuffers/protobuf/blob/master/docs/third_party.md"&gt;find an implementation&lt;/a&gt; from the community.&lt;/p&gt;

&lt;p&gt;Google has standardized it and made it the base of their server-to-server communication mechanism: &lt;a href="https://grpc.io/"&gt;gRPC&lt;/a&gt;, instead of the usual REST with JSON.&lt;/p&gt;

&lt;p&gt;Although the &lt;a href="https://developers.google.com/protocol-buffers/docs/techniques#large-data"&gt;documentation discourages&lt;/a&gt; its use with large datasets and proposes to split large collections into a "concatenation" of individual objects serialization, I'm going to evaluate and try it.&lt;/p&gt;

&lt;h3&gt;
  
  
  IDL and code generation
&lt;/h3&gt;

&lt;p&gt;The file with the schema could be &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/resources/organizations.proto"&gt;this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;syntax &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"proto3"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

package com.jerolba.xbuffers.protocol&lt;span class="p"&gt;;&lt;/span&gt;

option java_multiple_files &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
option java_package &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.jerolba.xbuffers.protocol"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
option java_outer_classname &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OrganizationsCollection"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

message Organization &lt;span class="o"&gt;{&lt;/span&gt;
  string name &lt;span class="o"&gt;=&lt;/span&gt; 1&lt;span class="p"&gt;;&lt;/span&gt;
  string category &lt;span class="o"&gt;=&lt;/span&gt; 2&lt;span class="p"&gt;;&lt;/span&gt;
  OrganizationType &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3&lt;span class="p"&gt;;&lt;/span&gt;
  string country &lt;span class="o"&gt;=&lt;/span&gt; 4&lt;span class="p"&gt;;&lt;/span&gt;
  repeated Attribute attributes &lt;span class="o"&gt;=&lt;/span&gt; 5&lt;span class="p"&gt;;&lt;/span&gt;

  enum OrganizationType &lt;span class="o"&gt;{&lt;/span&gt;
    FOO &lt;span class="o"&gt;=&lt;/span&gt; 0&lt;span class="p"&gt;;&lt;/span&gt;
    BAR &lt;span class="o"&gt;=&lt;/span&gt; 1&lt;span class="p"&gt;;&lt;/span&gt;
    BAZ &lt;span class="o"&gt;=&lt;/span&gt; 2&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

  message Attribute &lt;span class="o"&gt;{&lt;/span&gt;
    string &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 1&lt;span class="p"&gt;;&lt;/span&gt;
    int32 quantity &lt;span class="o"&gt;=&lt;/span&gt; 2&lt;span class="p"&gt;;&lt;/span&gt;
    int32 amount &lt;span class="o"&gt;=&lt;/span&gt; 3&lt;span class="p"&gt;;&lt;/span&gt;
    int32 size &lt;span class="o"&gt;=&lt;/span&gt; 4&lt;span class="p"&gt;;&lt;/span&gt;
    double percent &lt;span class="o"&gt;=&lt;/span&gt; 5&lt;span class="p"&gt;;&lt;/span&gt;
    bool active &lt;span class="o"&gt;=&lt;/span&gt; 6&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="o"&gt;}&lt;/span&gt;

message Organizations &lt;span class="o"&gt;{&lt;/span&gt;
  repeated Organization organizations &lt;span class="o"&gt;=&lt;/span&gt; 1&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To generate all Java classes you need to install the compiler &lt;code&gt;protoc&lt;/code&gt; and execute it with some parameters referencing where the IDL file is located and the target path of the generated files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;protoc &lt;span class="nt"&gt;--java_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./src/main/java &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./src/main/resources ./src/main/resources/organizations.proto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You have all the instructions &lt;a href="https://developers.google.com/protocol-buffers/docs/proto#generating"&gt;here&lt;/a&gt; and you can download the compiler utility from &lt;a href="https://developers.google.com/protocol-buffers/docs/downloads"&gt;here&lt;/a&gt;, but I prefer to use directly docker with an image ready to execute the command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt; docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;:&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; znly/protoc &lt;span class="nt"&gt;--java_out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./src/main/java &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;./src/main/resources ./src/main/resources/organizations.proto
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Serialization
&lt;/h3&gt;

&lt;p&gt;Protocol Buffers does not directly serialize your POJOs, but you need to copy the information to the objects generated by the schema compiler.&lt;/p&gt;

&lt;p&gt;The code needed to serialize the information from the POJOs would look like &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/ToProtocolBuffers.java#L26"&gt;this&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dataFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400_000&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;orgsBuilder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Org&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;organizationBuilder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organization&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setName&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCategory&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;category&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCountry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setType&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrganizationType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;forNumber&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;ordinal&lt;/span&gt;&lt;span class="o"&gt;()));&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Attr&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;org&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;attributes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;attribute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Attribute&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newBuilder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setQuantity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setAmount&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setActive&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;active&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setPercent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;percent&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
            &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;organizationBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addAttributes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;orgsBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;addOrganizations&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;organizationBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="nc"&gt;Organizations&lt;/span&gt; &lt;span class="n"&gt;orgsBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orgsBuilder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;var&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileOutputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.protobuffer"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;orgsBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;writeTo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code is verbose, but simple. If for some reason you decided to make your business logic work directly with the Protocol Buffers generated classes, all that copy code would be unnecessary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serialization time: 5,823 ms&lt;/li&gt;
&lt;li&gt;File size: 1,044 MB&lt;/li&gt;
&lt;li&gt;Compressed file size: 448 MB&lt;/li&gt;
&lt;li&gt;Memory required: instantiating all these intermediate objects in memory before serializing them requires 1,315 MB&lt;/li&gt;
&lt;li&gt;Library size (protobuf-java-3.16.0.jar): 1,675,739 bytes&lt;/li&gt;
&lt;li&gt;Size of generated classes: 41,229 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deserialization
&lt;/h3&gt;

&lt;p&gt;It is also quite straightforward, and it is enough to pass an &lt;code&gt;InputStream&lt;/code&gt; to &lt;a href="https://github.com/jerolba/xbuffers-article/blob/master/src/main/java/com/jerolba/xbuffers/FromProtocolBuffers.java#L26"&gt;rebuild in memory&lt;/a&gt; the whole object graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;InputStream&lt;/span&gt; &lt;span class="n"&gt;is&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;FileInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/tmp/organizations.protobuffer"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Organizations&lt;/span&gt; &lt;span class="n"&gt;organizations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Organizations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;parseFrom&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;.....&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The objects are instances of the classes generated from the schema, not the original records.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deserialization time: 4,535 ms&lt;/li&gt;
&lt;li&gt;Memory required: reconstructing the object structures defined by the schema takes up 2,710 MB&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Analysis and impressions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;JSON&lt;/th&gt;
&lt;th&gt;Protocol Buffers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Serialization time&lt;/td&gt;
&lt;td&gt;11,718 ms&lt;/td&gt;
&lt;td&gt;5,823 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;File size&lt;/td&gt;
&lt;td&gt;2,457 MB&lt;/td&gt;
&lt;td&gt;1,044 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GZ file size&lt;/td&gt;
&lt;td&gt;525 MB&lt;/td&gt;
&lt;td&gt;448 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory serializing&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;1.29 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deserialization time&lt;/td&gt;
&lt;td&gt;20,410 ms&lt;/td&gt;
&lt;td&gt;4,535 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory deserialization&lt;/td&gt;
&lt;td&gt;2,193 MB&lt;/td&gt;
&lt;td&gt;2,710 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;JAR library size&lt;/td&gt;
&lt;td&gt;1,910 KB&lt;/td&gt;
&lt;td&gt;1,636 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Size of generated classes&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;40 KB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From the data and from what I have been able to see by playing with the formats, we can conclude:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON is undoubtedly slower and heavier, but it is a very comfortable and convenient system.&lt;/li&gt;
&lt;li&gt;Protocol Buffers is a good choice: fast serialization and deserialization, and with compact and compressible files. Its API is simple and intuitive, and the format is widely used, becoming a market standard with multiple use cases.&lt;/li&gt;
&lt;li&gt;Both formats are all or nothing: &lt;strong&gt;you need to deserialize all the information to access a part of it&lt;/strong&gt;. While in other tools you can access any element without having to traverse and parse all the information that precedes it.&lt;/li&gt;
&lt;li&gt;Protocol Buffers uses 32-bit integers for all integer types (int, short, byte), but when serializing &lt;strong&gt;represents them with the value that takes the least bytes&lt;/strong&gt;, saving some bytes.&lt;/li&gt;
&lt;li&gt;For the same reason, when Protocol Buffers generates the classes from the IDL, it defines everything as &lt;strong&gt;32-bit integers&lt;/strong&gt; and deserializes the values to int32.  That is why the memory consumption of deserialized objects is 23% higher than JSON. What it gains by compressing integers it loses in memory consumption.&lt;/li&gt;
&lt;li&gt;For scalar values, Protocol Buffers &lt;strong&gt;does not support null&lt;/strong&gt;. If a value is not present it takes the default value of the corresponding primitive (0, 0.0, or false), while Null Strings are deserialized to "". &lt;a href="https://itnext.io/protobuf-and-null-support-1908a15311b6"&gt;This article&lt;/a&gt; proposes different strategies to simulate null values.&lt;/li&gt;
&lt;li&gt;Although in the original object graph the name of the countries or categories are Strings that are only present in memory once, when serialized to a binary format they will appear as many times as references you have, and when deserialized you will have as many instances as references you had. That is why in both cases, the memory consumed when deserializing &lt;strong&gt;occupies twice the memory occupied by the objects originally serialized&lt;/strong&gt;&lt;sup&gt;1&lt;/sup&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;span id="1"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/span&gt; Since Java 8u20, thanks to &lt;a href="https://openjdk.java.net/jeps/192"&gt;JEP 192&lt;/a&gt;, we have an option in G1 to de-duplicate Strings during garbage collection. But it is disabled by default and when it is enabled we have no control over when it is executed, so we can't rely on that optimization to reduce the size of deserialization. &lt;/p&gt;

</description>
      <category>java</category>
      <category>protocolbuffers</category>
      <category>serialization</category>
    </item>
  </channel>
</rss>
