DEV Community

Cover image for BSON and OSON: documents are designed to be nested, not flat
Franck Pachot
Franck Pachot

Posted on

BSON and OSON: documents are designed to be nested, not flat

I like vendors benchmarks as they are often good illustrations of worst practice. Rather than focusing on real implementation trade-offs, they pick an extreme case that favors their own implementation by chance but is not optimized in the competitor's — because it is simply not how that technology is meant to be used. For example, Oracle published a "2x" benchmark results using YCSB, a key-value benchmark (slide 14 here), and a "529x" test misusing the raw BSON purpose (here). Both have something in common: they compare a relational database against a document database by using neither technology as it was designed to be used — no normalized tables, no nested documents, just flat key-value fields. The latter test is even worse: it uses 1,000 top-level fields in a single document. Don't do that!


While having thousands of columns in an SQL table is usually undesirable, it's acceptable to have hundreds or even thousands of fields in a document, as they represent multiple entities and value objects. However, nobody creates a flat structure with 1000 top-level fields. The advantage of JSON is its ability to organize entities into nested sub-documents. For instance, instead of storing first_name and last_name as separate fields, you can have a name field containing a sub-object with first and last. During queries, using dot notation to reference name.first and name.last makes no difference with first_name or last_name. It simplifies reading and displaying the document, and, as it is how it should be used, the binary JSON formats are optimized for it.

To illustrate this, I create two collections, flat with one thousand fields at the top level, and nest with ten top-level fields, each containing a sub-object with ten sub-objects.


{
echo field{0..9}{0..9}{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "flat" --type=csv --headerline --drop

{
echo field{0..9}.sub{0..9}.sub{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "nest" --type=csv --headerline --drop

Enter fullscreen mode Exit fullscreen mode

The flat documents are like:

  { _id: ObjectId('6a1a1ed41d55219677a79c22'),
  field000: 0, field001: 1, field002: 2, field003: 3, field004: 4, field005: 5, field006: 6, field007: 7, field008: 8, field009: 9,
  field010: 10, field011: 11, field012: 12, field013: 13, field014: 14, field015: 15, field016: 16, field017: 17, field018: 18, field019: 19,
  field020: 20, field021: 21, field022: 22, field023: 23, field024: 24, field025: 25, field026: 26, field027: 27, field028: 28, field029: 29,
  field030: 30, field031: 31, field032: 32, field033: 33, field034: 34, field035: 35, field036: 36, field037: 37, field038: 38,
...
Enter fullscreen mode Exit fullscreen mode

The nested documents are like:

  {                                                                                                                                                                                               
    _id: ObjectId('6a1a160041c5538a7e93a9c5'),
    field0: {
      sub0: { sub0: 0, sub1: 1, sub2: 2, sub3: 3, sub4: 4, sub5: 5, sub6: 6, sub7: 7, sub8: 8, sub9: 9 },
      sub1: { sub0: 10, sub1: 11, sub2: 12, sub3: 13, sub4: 14, sub5: 15, sub6: 16, sub7: 17, sub8: 18, sub9: 19 },
      sub2: { sub0: 20, sub1: 21, sub2: 22, sub3: 23, sub4: 24, sub5: 25, sub6: 26, sub7: 27, sub8: 28, sub9: 29 },
      sub3: { sub0: 30, sub1: 31, sub2: 32, sub3: 33, sub4: 34, sub5: 35,
...
Enter fullscreen mode Exit fullscreen mode

They hold the same values, with a different field structure.

BSON serialization format optimization

BSON is designed for efficient sequential scanning through network or disk access. Each field is stored one after another, with a small header containing the field type and name, followed by the value. Because every value type has either a fixed size (like a 64-bit integer or a double) or an explicit length prefix (like a string or a binary blob), the parser can skip over any field it doesn't care about without reading its value at all — it simply jumps forward by the declared length.

Nested sub-documents and arrays are stored the same way: the type is object or array, followed by the total byte length of the entire nested structure, followed by the nested fields themselves (for arrays, the field names are simply "0", "1", "2", and so on). That length prefix is the key optimization: if the parser is looking for a top-level field and the current field is a sub-document that doesn't match, it can skip the entire sub-document in one jump — all its nested fields, however deep — without reading a single byte inside it.

When you query a field using dot notation, such as "name.first", the parser works level by level. It scans the top-level fields looking for name. Any top-level field that doesn't match name is skipped, including any sub-documents, in a single jump. Once name is found and confirmed to be of type object, the parser steps into that sub-document and begins scanning its fields for first, again skipping anything that doesn't match. It never needs to read the sibling sub-documents of name at all.

The worst case is a field that doesn't exist: the parser must scan every field at the relevant nesting level before concluding it isn't there. But even then, sub-documents and arrays that are clearly irrelevant — because their parent name doesn't match — are each skipped in a single operation regardless of how large or deeply nested they are.

This is precisely why organizing many fields into a nested hierarchy is not just a cosmetic preference. A flat document with 1,000 top-level fields forces the parser to read 1,000 field names in the worst case. A document with 10 top-level fields, each containing 10 sub-fields, each containing 10 values, holds the same 1,000 values but the parser only ever reads at most 10 + 10 + 10 = 30 field names to locate any one of them. The BSON length prefix turns the nested structure into an implicit index, making the format genuinely faster to navigate at scale.

With the collection flat that has one thousand top-level fields from "field000" to "field999", looking for the last field "field999" takes more than one second:


db.flat.find(
 { "field999" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 1081,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { field999: { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 1070,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 64,
    restoreState: 64,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}
Enter fullscreen mode Exit fullscreen mode

With the collection nest that has ten top-level fields from "field0" to "field9", and two levels of ten sub-objects from "sub0" to "sub9", looking for the last field "field9.sub9.sub9" is two times faster:


db.nest.find(
 { "field9.sub9.sub9" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 424,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { 'field9.sub9.sub9': { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 420,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 21,
    restoreState: 21,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}
Enter fullscreen mode Exit fullscreen mode

A major part of the time is fetching the document from storage. Navigating into it is a smaller part, but the difference is still clearly visible: 1,081 ms versus 424 ms for 100,000 documents, with the same number of values, and the same worst-case field to find: the last one. The only difference is how those fields are arranged. Nesting them three levels deep, with ten branches at each level, makes the collection scan faster — with no index, no schema change, and no query rewrite beyond the dot notation that you would use anyway.

This is not a micro-optimization or an edge case. Any collection scan, whether triggered by a missing index, a low-selectivity filter, or a background analytics query, pays this cost on every document it reads. The deeper and wider your documents are, the more the BSON length-prefix trick pays off, because the parser can leap over entire branches of the document tree in a single bounds check.

What about Oracle's OSON format?

Oracle Database stores JSON documents in its proprietary binary format called OSON. At first glance, OSON appears to take a fundamentally different approach: rather than storing field names inline with each value as BSON does, OSON builds a field name dictionary at the beginning of the document. Think of it like a mini-datastore with it's catalog and indexes, rather than a protocol buffer. Each field in the document body then refers to its name by a short numeric ID rather than repeating the full string. This makes individual field names cheaper to compare and reduces document size when the same field name appears many times.

Given that design, you might expect nesting to make no difference in Oracle: if every field is just a numeric ID anyway, scanning 1,000 flat fields should cost the same as scanning 10 + 10 + 10 fields spread across three levels. The dictionary lookup cost is the same either way.

In practice, however, the same experiment on Oracle shows a very similar result to MongoDB.

Querying field999 on the flat table takes about 1.6 seconds:


set autotrace traceonly

select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"
 where JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
________________________________________________________________________________________________
SQL_ID  1h98k751w7dws, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"  where
JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'  passing 42 as
"B0" type(strict))

Plan hash value: 4073748891
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.60 |     301K|    739 |
|*  1 |  TABLE ACCESS FULL| flat |      1 |     12 |      0 |00:00:01.60 |     301K|    739 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field999.numberOnly() < $B0)' /* json_path_str
              $?(@.field999.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


Statistics
-----------------------------------------------------------
             117  CPU used by this session
             117  CPU used when call started
             160  DB time
         1174164  RM usage by this session
               3  Requests to/from client
             738  Session total flash IO requests
         6053888  cell physical IO interconnect bytes
          301074  consistent gets
          301074  consistent gets from cache
          301074  consistent gets pin
            1068  consistent gets pin (fastpath)
             739  gcs data block access records
               1  messages sent
             747  non-idle wait count
              48  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
             738  physical read total IO requests
         6053888  physical read total bytes
               2  process last non-idle time
          301074  session logical reads
              48  user I/O wait time
               4  user calls
Enter fullscreen mode Exit fullscreen mode

Querying field9.sub9.sub9 on the nested table takes about 1.0 second:


select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"
 where JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
__________________________________________________________________________________________________
SQL_ID  7yuwrup6rscrc, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"   where
JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
passing 42 as "B0" type(strict))

Plan hash value: 3225864993
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.04 |     100K|   1143 |
|*  1 |  TABLE ACCESS FULL| nest |      1 |   1000 |      0 |00:00:01.04 |     100K|   1143 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field9.sub9.sub9.numberOnly() < $B0)' /* json_path_str
              $?(@.field9.sub9.sub9.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)


Statistics
-----------------------------------------------------------
              40  CPU used by this session
              40  CPU used when call started
             106  DB time
          404934  RM usage by this session
               4  Requests to/from client
            1125  Session total flash IO requests
         9363456  cell physical IO interconnect bytes
          100221  consistent gets
          100221  consistent gets from cache
          100221  consistent gets pin
           99096  consistent gets pin (fastpath)
            1142  gcs data block access records
             158  global enqueue gets sync
             158  global enqueue releases
            1134  non-idle wait count
              71  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
            1125  physical read total IO requests
         9363456  physical read total bytes
              61  process last non-idle time
          100221  session logical reads
              71  user I/O wait time
               4  user calls

Enter fullscreen mode Exit fullscreen mode

The most telling number is not the execution time but the Buffers column: 301,074 consistent gets for the flat table versus 100,221 for the nested table. Oracle is reading three times as many data blocks for the flat documents. To understand why, I check the actual segment sizes:


with
table_seg (owner,table_name, table_blocks) as ( select owner, segment_name, sum(blocks) from dba_segments group by owner,segment_name),
lob_seg   (owner,table_name,   lob_blocks) as ( select owner, table_name,   sum(blocks) from dba_lobs join dba_segments using (owner, segment_name) group by owner,table_name),
tab_stats (owner,table_name,  avg_row_len) as ( select owner, table_name,   avg_row_len from dba_tables)
select owner, table_name, table_blocks, nvl(lob_blocks, 0) as lob_blocks, table_blocks + nvl(lob_blocks, 0) as total_blocks, avg_row_len
 from tab_stats natural join table_seg natural left join lob_seg
 where owner = 'ORA' and table_name in ('flat', 'nest')
;

OWNER    TABLE_NAME       TABLE_BLOCKS    LOB_BLOCKS    TOTAL_BLOCKS    AVG_ROW_LEN
________ _____________ _______________ _____________ _______________ ______________
ORA      flat                    1,152       317,608         318,760            166
ORA      nest                  100,992            32         101,024          7,724

Enter fullscreen mode Exit fullscreen mode

The segment query reveals the full story. The nested documents have an average row length of 7,724 bytes and are stored entirely inline in the table segment's blocks, with almost no LOB blocks. The flat documents have an average row length of only 166 bytes in the table segment, but they spill into 317,608 LOB blocks. That tiny average row length for flat is not a sign of small documents — it is the sign of a pointer because large documents do not fit in fixed-sized table blocks. The actual OSON bytes have been pushed out of the row into a separate LOB segment managed by Oracle's securefile infrastructure, and every document access requires an additional pointer dereference to fetch them from there.

The root cause is the OSON dictionary. OSON stores each distinct field name only once in a per-document dictionary and replaces every occurrence in the document body with a short numeric ID. The dictionary itself is the mechanism that compresses the field names. A flat document with 1,000 entirely unique field names like field000 through field999 requires a dictionary with 1,000 entries, one per distinct string, with no repetition to exploit. The resulting OSON document is large enough to exceed Oracle's block size and gets stored out of row as a LOB. A nested document whose entire structure uses only 20 distinct names — field0 through field9 and sub0 through sub9 — has a dictionary with just 20 short entries. Furthermore, objects that share the same field structure can reuse each other's field ID array entirely rather than repeating it. The resulting document is compact enough to be stored inline, directly in the table block, with no LOB indirection at all.

The consequence is dramatic here. Scanning the flat collection means fetching a pointer from the table block, then chasing it to a LOB block, for every single one of the 100,000 documents. Scanning the nested collection means reading the document directly from the table block. That is the difference between 301,074 consistent gets and 100,221.

So the two formats reach the same conclusion by different paths. BSON benefits from nesting because the length-prefix on sub-documents lets the parser skip entire branches with a single jump, reducing the number of field names it reads. OSON benefits from nesting because the dictionary compresses repeated field names, shrinking the documents enough to keep them inline in the table and avoiding the cost of LOB storage entirely.

Conclusion

The underlying mechanisms differ, but the guidance is the same: documents are nested structures that aggregate multiple entities and value objects. Don’t chase performance with documents containing 1,000 top-level fields — fix the data model first.

The practical takeaway is straightforward: model your documents the way you naturally think about the data — as a hierarchy of related objects — and the binary JSON format your database uses will reward you for it, whether that is BSON in MongoDB or pg_documentdb_core, or OSON. Flat documents with hundreds of top-level fields are not just harder to read. They are measurably slower to query and heavier on storage once compression is applied. Nesting is not just good document design — it is how these formats are engineered to be used.

Top comments (0)