DEV Community: Alex Merced

Apache Data Lakehouse Weekly: July 21 to July 29, 2026

Alex Merced — Wed, 29 Jul 2026 18:41:19 +0000

This was a week where the open lakehouse stack spent most of its energy on contracts. Not legal contracts, but the promises that formats, catalogs, and clients make to each other. Iceberg debated whether equality deletes belong in V4 and whether incremental scan semantics belong in the spec at all. Parquet voted a new floating point encoding into the format and argued about who owns the Thrift file that defines everything. Polaris tried to ship 1.7.0, found a licensing gap in 44 staged jars, and pulled the release candidate. Arrow wrestled with what the nullable flag actually means. Across all of it, a pattern showed up again and again: the community is done with informal conventions and wants written guarantees.

Here is what happened, project by project.

Apache Iceberg

The week opened with a personnel note that matters more than personnel notes usually do. Steven Wu announced that Maximilian Michels joined the project as a committer, and the thread ran to 30 messages of congratulations from across the contributor base. Iceberg has grown into a project where the streaming and Flink side of the ecosystem carries real weight, and adding committers who live in that part of the codebase keeps the review load from concentrating on a handful of people. The size of the congratulations thread is its own signal about how many active humans are now paying attention to this list.

The most consequential technical debate of the week was about deletes. Huaxin Gao pushed forward on the proposal to deprecate equality deletes in Iceberg V4, and the thread drew 15 messages including strong support from Ryan Blue. Equality deletes are the mechanism that lets a streaming writer say "delete every row where id equals 42" without knowing which file that row lives in. They make writes cheap and reads expensive. Every scan has to carry the delete predicates forward and apply them against candidate files, and scan planning cannot prune as aggressively because the planner does not know which files a given equality delete actually touches.

Xiening Dai raised the concern that killing equality deletes shifts a burden onto writers that still have to support fast updates and deletes. Gao reframed the tradeoff. The work does not disappear, it moves. Instead of every reader paying the cost forever, the writer pays it once by maintaining an index that resolves the predicate to positions at write time. Blue backed the position directly and argued that V4 should not allow writing equality deletes at all, pointing at the Flink work as evidence that maintaining such an index is practical. He followed up with a note on migration: existing tables that already contain equality deletes keep working after an upgrade, and the win concentrates in scan planning where the current design forces conservative behavior.

This is the kind of decision that defines a format version. V2 gave the ecosystem row-level deletes. V3 gave it deletion vectors and new types. V4 looks like the version where the community trims the surface area that made earlier versions hard to implement correctly across engines. If you maintain a writer, this thread is the one to read this week.

Right behind it, Prashant Singh called a vote to align the REST catalog OpenAPI expression schemas with the Iceberg expression specification. The vote drew 17 messages and passed. The problem it solves is unglamorous and important. The REST spec described expressions one way and the table spec described them another way, which meant client authors had to guess which definition an implementation actually followed. Two documents describing one concept is a bug that produces silent incompatibility rather than loud errors, and those are the worst kind.

Alexandre Dutra also brought the vote to formalize remote signing configuration in the REST spec to a close, collecting a binding +1 from Russell Spitzer along with several non-binding votes. Remote signing is how a catalog hands a client short-lived, narrowly scoped access to object storage without shipping long-lived cloud credentials. It has been in use for a while, configured through conventions that varied by implementation. Writing it into the spec turns a working practice into a guarantee, which is exactly what enterprise security teams ask for when they audit a lakehouse deployment.

Alexander Bailey opened a related question about written guarantees: should incremental append scan semantics be part of the spec? Incremental scans let a downstream job read only what changed since a given snapshot, which is the backbone of most CDC and streaming ingestion patterns built on Iceberg. Xiening Dai pointed out that the current REST incremental scan API returns a set of files, which restricts it to the append-only case, while a full incremental scan needs to express the delta between two snapshots including deletes and updates. Ryan Blue replied that V4 already has planned changes that tighten change detection requirements across versions, so the spec will support all incremental scan types instead of relying on conventions that vary by implementation. Bailey asked the obvious follow-up question, which is where to follow that work and which community sync to attend. That question comes up often enough on this list that a better answer than "ask on dev@" is overdue.

Format convergence showed up in a thread from Nitya Kumar Sharma, who proposed a file data type for Iceberg V4 now that the Parquet FILE logical type has merged in parquet-format PR 585. The idea is to give Iceberg a first class way to reference a blob, an image, or a document that lives outside the columnar data, with the metadata about that reference stored in the table. Russell Spitzer bumped the thread out of spam and noted that a prior proposal already exists from Talat and collaborators. Daniel Weeks agreed and asked the community to consolidate around updating the original FileRef proposal instead of starting a parallel effort. Sharma agreed to wait for the original authors to return from vacation and comment on the existing proposal. This is good project hygiene. Two competing proposals for the same feature produce a worse outcome than one proposal with two sets of authors.

On the release side, Shawn Chang called a vote for Apache Iceberg Rust 0.10.1 RC1, and verification came in from Kevin Liu with a binding +1 plus non-binding checks from Xin Huang on Linux with rustc 1.94.0 and L. C. Hsieh on macOS aarch64, who reported 1839 tests passing. The patch release exists because of a real incident, which Danny Jones documented on the list. The pyiceberg-core release artifacts for 0.10.0 were built from the main branch rather than the 0.10.0 tag, which means the published package contained code that was never part of the release. Vova Kot spotted it, Kevin Liu yanked the package from PyPI, and Renjie Liu took responsibility for the manual publish that caused it.

The way the community handled this is worth calling out. Nobody got piled on. The thread moved straight from acknowledgment to the question of how to prevent a repeat, with Matt Butrovich thanking Jones for starting that conversation. Manual release steps are a standing hazard in every Apache project, and the fix is almost always automation plus a verification step that compares published artifacts against the signed tag. Expect a follow-up proposal on that front.

The Terraform provider had its own release friction. Matt Topol proposed v0.1.0 RC1 for the Apache Iceberg Terraform provider, and Kevin Liu found a schema validation issue during testing that affected table creation and had the potential to write incorrect metadata or leave table state wrong. Topol said he will fix the issues and cut a new candidate. Rich Bowen also weighed in on the thread. Catching a metadata correctness bug in a release candidate is the system working as designed, and it is a good argument for the Apache voting process against people who think it slows things down.

Performance work drew attention too. Gianluca Graziadei opened a review request for Hilbert curve clustering in rewrite_data_files, offering it as an alternative spatial clustering strategy alongside the existing Z-order implementation. Tanmay Rauth reviewed it, praised the approach for reusing existing Z-order byte encodings to keep the change small, and asked for end-to-end numbers. Graziadei came back with two things: a citation to the Moon, Jagadish, Faloutsos and Saltz analysis in IEEE TKDE showing that Hilbert curves have better clustering locality than Z-order on analytical grounds, and a full comparison report with headline numbers. For teams doing multi-dimensional filtering on large tables, the difference between Z-order and Hilbert ordering shows up directly in how many files a query has to open. This is the kind of contribution that pays for itself the first week it ships.

Several smaller threads rounded out the week. There was discussion of migrating Iceberg to Jackson 3, which matters because Jackson version conflicts are one of the most common dependency headaches in JVM data stacks. Contributors discussed integrating EagerInputFile into the manifest reader, a change aimed at cutting the number of round trips during scan planning. There was a vote on labels in the IRC read path, a proposal to add the variant type to the REST catalog spec, a request for table-level filtering in MetricsReporter, a review request for a memory leak fix on V3 Spark, and continued work on column update metadata representation. There was also a dedicated sync scheduled for Iceberg index support, which is a topic worth watching closely given how much of the V4 conversation touches scan planning cost.

One recurring theme that is not technical at all: three separate threads this week were people trying to join the Iceberg Slack workspace because the public invite link was broken. That is three people who cared enough to email a developer mailing list. The number who gave up quietly is larger. A broken front door costs a project more contributors than most people realize.

Apache Polaris

Polaris had the most eventful release week of any project on this list, and the story has a good ending even though it started with a failed vote.

Jean-Baptiste Onofré called a vote to release Apache Polaris 1.7.0 rc0. Yufei Gu voted -1 with a binding vote after inspecting all 269 staged jars in the Maven repository and finding that 44 of them lacked both META-INF/LICENSE and META-INF/NOTICE. That is a release blocking issue at the Apache Software Foundation, and it does not matter how long the problem has existed. Onofré acknowledged the catch, noted the issue predates 1.5.0 and goes back to incubation, and moved directly to a fix. He opened a pull request to correct both the jars published to Maven and the version string in the Python package, planned a backport to the 1.7.x branch, and cancelled the vote.

The lead-up to the vote was itself a good example of release management. In the thread on preparing 1.7.0, Gu asked Onofré to include a pull request that fixes a bug where dropping a valid idempotency key leads to table corruption. Onofré reviewed it, proposed a simple rule for the cut, and stuck to it: if the fix merges before Wednesday it lands in 1.7.0, if it does not, it lands in 1.8.0. Gu argued the fix was ready and that a follow-up can make the window size configurable. Table corruption bugs deserve exactly that level of urgency, and a time-boxed decision rule keeps the release from drifting while people negotiate.

The longest running technical debate in Polaris this week was about a connection pool. Yufei Gu summarized the state of the Polaris-managed JDBC datasource discussion as a choice between Hikari and Agroal, with the thread running to 21 messages. Romain Manni-Bucau asked the right question first, which is whether anyone had written down the criteria for the decision. Robert Stupp went further and argued the thread had not established consensus that Quarkus-managed data sources should be replaced at all. He walked back to the original proof of concept and separated the two things it addressed, one of which is loading optional JDBC drivers at runtime.

That distinction matters. Loading drivers dynamically is a packaging problem. Choosing a pool is a runtime behavior problem. Solving the first by rewriting the second is how projects accumulate accidental complexity. Dmitri Bourlatchkov asked Gu to explain the reasoning behind picking Hikari in the original pull request, which is the kind of question that turns a preference argument into a design discussion.

Security semantics got serious attention. In the thread on forwarding user-defined principal properties in PolarisPrincipal, Gu asked whether user-defined and system-managed attributes should live in separate namespaces to preserve provenance. Stupp agreed that provenance sits at the center of the question and pushed to settle security semantics before exposing these attributes to authorizers. His specific concern was that exposing the complete principal entity through a public attribute key invites authorizer implementations to depend on it, and once they do, the project owns that surface forever. Alexandre Dutra argued the resolver has legitimate need for the entity. Stupp narrowed his objection to placement rather than existence: the concern is that the key is declared on the public-facing type.

This is a good argument to read if you build authorization systems. The failure mode is not that someone reads a field they should not read. It is that a convenient field becomes a load-bearing part of third-party code, and then it cannot change.

Persistence correctness got its own thread. Robert Stupp picked up the discussion on consistent multi-object changes in Polaris persistence, noting Gu's clarification about where the atomicity guarantee lives. If atomicity is a property of the lower-level BasePersistence contract rather than the general metastore manager contract, callers cannot tell whether the state they read for validation and authorization is consistent. Prithvi S agreed the persistence layer needs one well-defined contract for multi-object consistency rather than a series of operation-specific patches, and clarified where their pull request sits in that picture. Onofré backed Stupp's framing. There is a related thread on atomic multi-entity and grant commits covering partial-commit windows during grant, createCatalog, and drop operations, which is the same problem seen from the authorization side.

Dependency reduction came up in the discussion about deprecating TreeMapMetaStore and friends. Gu asked for the specific security concern behind removing H2, pointing out that any library can have vulnerabilities and asking for data showing H2 has more than others. Bourlatchkov replied that the data is not the point. The question is whether a driver used only for non-production getting-started scenarios earns a place in the production dependency graph. Both positions have merit, and the resolution probably involves shipping the getting-started path as a separate artifact rather than arguing about any single library's track record.

Adam Szita moved Iceberg table encryption support from proposal to code with a pull request. Gu reviewed it, said it heads in the right direction, and asked for a split between spec changes and implementation. Stupp cross-linked a related reply about how two encryption pull requests fit together against Iceberg's two catalog security requirements. There is also a separate discussion about decrypt-only access for legacy AWS KMS keys, which is the sort of migration path detail that decides whether an encryption feature is adoptable by an existing deployment or only by greenfield ones.

Agent-facing work showed up in the thread on authentication modes in polaris-tools. Gu drew the distinction that organizes the whole problem: a local MCP server and a standalone shared MCP service are different deployment models with different security requirements. For a local single-user process, using the user's own Polaris token is enough because there is no separate service identity. Stupp agreed the deployment model is the first fork in the decision tree. Anyone wiring an AI agent to a catalog should read this thread before designing their auth story, because getting this wrong produces a service that can act as any user.

Onofré also handled community logistics. He created a public Google Calendar for Polaris and proposed a new meeting structure of 90 minutes every three weeks with timeboxed topics. Dutra agreed with the cadence but pushed back on a 10 minute timebox as too short. Onofré moved it to 20 minutes on the spot and Gu agreed. Small thread, real improvement. He also pushed the Polaris Directories proposal forward with a plan to rebase the pull request, incorporate comments, and schedule a dedicated meeting, with Bourlatchkov and Gu both supporting a live discussion.

Other Polaris threads worth a look: HTTP status codes for CommitStateUnknownException from federated catalogs, the result of the vote on returning 503 for concurrent modification during table and view rename, support for external principals, making the relational JDBC schema name configurable, adding Open Sharing APIs, the semantic model REST API payload representation, Polaris SPI principles, whether the Polaris Console belongs in the main repository, and the August 2026 board report.

One operational note from that authentication thread: a message with a July 20 date header sat in the moderation queue for three days before reaching the list. Bourlatchkov flagged it. Moderation delays make threads look abandoned when they are not, and contributors read silence as rejection.

Apache Parquet

Parquet landed a real format change this week. Prateek Gaur announced that the vote to add ALP encoding to the Parquet format passed with 11 +1 votes, 7 of them binding. ALP stands for Adaptive Lossless floating-Point. It compresses double and float columns by finding a decimal representation that maps the values to integers, then encoding those integers with existing integer techniques, and falling back to a second scheme for values that do not fit the pattern.

Floating point columns have been the weak spot in columnar compression for years. Sensor readings, prices, model scores, and embedding components all store as doubles, and general purpose compressors do poorly on them because the bit patterns look close to random. ALP takes advantage of the fact that most real-world doubles are actually decimals with limited precision that were converted to binary floating point somewhere upstream. In the vote thread, Antoine Pitrou gave a binding +1 and noted the remaining items were wording and document organization rather than substance. Micah Kornfield and Andrew Lamb also voted +1 binding, with Lamb calling the spec pleasant to read and flagging some subtlety worth follow-up. Julien Le Dem voted +1 and credited Gaur's leadership. Gaur thanked Dhirhan and Russell Spitzer for their contributions.

That vote almost did not happen on schedule, for a reason that has nothing to do with engineering. Gaur reported that at least three people found the voting email in their spam folder and asked what the protocol is. Kornfield said he had seen other votes land normally and hoped people will check. Spitzer said his own +1 was partly an attempt to pull the thread out of spam filters, and reported that two other votes went to spam for him. Le Dem said the same and noted a general uptick in Apache list mail being filtered. This is a real threat to a governance model built on email. A vote that nobody sees is a vote that fails by default.

The other big Parquet discussion was about ownership of the format definition. Divjot Arora proposed inlining parquet.thrift into parquet-java, and the thread ran to 11 messages. The current setup has parquet-java depending on a pinned version of parquet-format and pulling the Thrift file from it. That means you cannot prototype a Java implementation of a format change until a new parquet-format jar ships, which turns every experiment into a release-blocking dependency chain.

Andrew Lamb backed the proposal from experience, noting that the arrow-rs Parquet implementation keeps its own copy of parquet.thrift and has never had a problem with it. Gang Wu supported it in stronger terms, describing the current workflow as painful because continuous integration always fails and pull requests cannot merge before a new format jar is released. Pitrou weighed in as well. The tradeoff is real. A single canonical Thrift file guarantees that implementations agree. Copies drift. But copies that drift produce visible test failures, while a release-blocking dependency chain produces contributors who give up before writing the test.

Pitrou also raised a naming question inside the 1.18.0 release vote thread: whether future releases should be named Apache Parquet Java rather than Apache Parquet. That is more than cosmetics now that Rust, C++, and Go implementations are widely used and the format itself versions separately from any implementation. Calling the Java release "Parquet 1.18.0" implies it is the format, and it is not.

The 1.18.0 RC1 vote itself hit turbulence. Gábor Szádovszky reported that the build fails on the release candidate tag and the tarball with format violations, that running spotless fixes it, and that it was strange the problem got through continuous integration. Fokko Driesprong diagnosed it as a Java version issue, noting he saw the same behavior on JDK 17 and that JDK 11 worked, with the next release moving to JDK 17 and later. Peter Toth voted +1 non-binding. Build reproducibility across JDK versions keeps biting Java projects, and a formatter that behaves differently by JDK is a nasty variant of the problem because it produces a diff rather than an error.

The FIXED_SIZE_LIST discussion took an unexpected turn. Gunnar Morling reported implementing a fixed-length list fast path in Hardwood that detects effectively fixed-length lists by scanning the encoded definition and repetition level streams, then bypasses the general path. Will Edwards called the result worth reflecting on beyond a simple optimization and questioned what remains of the performance case for a new logical type. Alkis Evlogimenos said the result matches work done internally on Photon and argued the fixed size list discussion shifts away from physical performance toward something else. Morling agreed the finding is useful and said he sees more work ahead.

That exchange is a model for how format decisions should go. Someone proposed a new type on performance grounds. Someone else showed you can get most of the benefit through smarter decoding of the existing representation. The conversation then moved to whether the type earns its place on semantic grounds instead. Formats accumulate types easily and shed them never, so this level of scrutiny is correct. Related work continues on a VECTOR repetition level for fixed-size-list serialization.

Julien Le Dem bumped the FSST spec design thread to ask about next steps, specifically comparing the original 8-bit codes against variations with larger code sizes. Arnav Balyan reported general consensus on the FSST spec and credited Kornfield, Gang Wu, Lamb, Pitrou, and others for feedback. FSST is a string compression scheme that builds a symbol table of common substrings, and pairing it with ALP gives Parquet strong coverage on the two column types that general compressors handle worst.

Burak Yavuz posted updates on the File logical type after its vote passed, covering how to reason about compression for data stored inside the new type and a rename from path to uri based on feedback from Rok and Pitrou. That rename is small and correct. A path implies a filesystem. A URI does not.

Rounding out the project, Aaron Niskode-Dossett and Ismaël Mejía discussed whether Parquet should publish a test helpers artifact, with Mejía reporting a successful downstream test into Apache Spark. Steve Loughran gave a mixed assessment from experience on other projects: sharing testing tools helps consumers, and it also commits you to keeping them stable across changes like a JUnit 5 migration. There was also continued discussion on extended precision nanosecond timestamps, and the Parquet sync met on Wednesday July 29.

Apache Arrow

Arrow's headline discussion asked a question that sounds simple and is not. Antoine Pitrou opened a thread on the semantics of non-nullable fields with non-trivial types, prompted by an issue and pull request in the Arrow repository. What does the nullable flag in the IPC format actually mean when the field is a struct, a list, or a map? Does it describe the field itself, the children, or both? The thread ran to nine messages.

Raúl Cumplido linked back to an earlier mailing list thread on one of the cases, which tells you this question has been open for months. Weston Pace said he finds the nullability flag confusing in general and raised the deeper question: is an Arrow schema meant to describe the data in this specific batch, or to state a contract about all data that will ever flow through this stream? Those are different things. A batch-level description is an observation. A stream-level contract is a promise that downstream code can optimize against. David Lee, who opened the original issue, argued for an official format change and drew the distinction between a nullable array and a struct that contains nulls.

This is the same theme running through Iceberg and Polaris this week. A flag that has worked by convention for years turns out to mean different things to different implementations, and the fix is to write down which meaning is normative. Arrow also has a related open question in the thread on the variant extension spec being inconsistent with the Parquet shredding spec, which is a cross-project version of exactly the same failure.

On releases, David Li shipped ADBC 24 through a two-candidate cycle. RC1 covered 55 resolved issues and drew a binding +1 from Raúl Cumplido on Debian 14. Edgar Ramírez Mondragón noticed missing manylinux wheels in the nightly repository, and Bryce Mecum redirected that to the issue tracker as a non-blocking concern. Li then cut RC2. Matt Topol found a small non-blocking problem and filed a fix. Mecum verified on macOS 26 aarch64 and confirmed that the RC2 JNI library no longer dynamically links the driver manager. Ian Cook verified on macOS Tahoe 26.3.1 and confirmed the fix. The vote passed with four binding +1 votes from Mecum, Cook, Topol, and Cumplido, and Li worked through the release checklist and announced ADBC 24.

ADBC deserves more attention than it gets in lakehouse conversations. It gives you a database connectivity API that speaks Arrow natively, so result sets move as columnar batches instead of being converted row by row through a JDBC or ODBC layer. For any workload that pulls large result sets into Python or Rust for processing, that difference is the whole performance story.

Arrow Rust 58.4.0 also cleared its vote. Kosta Tarasov verified RC3 on Fedora 44 x86_64 with a non-binding +1, Adam Reeve added a binding +1 on the same platform, and Andrew Lamb, who proposed the release, posted the result. Arrow Go 18.7.0 was released after its own vote passed.

Demetrius Albuquerque proposed raising the minimum Swift version for apache/arrow-swift from 5.10 to 6.0. Sutou Kouhei did not object and asked for data on how many Swift users are still on 5.10. That is the right response to every minimum-version bump: not "no" and not "sure," but "who breaks?" The Arrow community meeting was scheduled for July 29 at 16:00 UTC.

Apache DataFusion

DataFusion had a quiet week on dev@, with the notable item being Matt Butrovich's announcement that the vote for Apache DataFusion 54.1.0 RC1 passed with six binding +1 votes from Butrovich, Andrew Lamb, Andy Grove, L. C. Hsieh, Marko Milenković, and Wang Xudong, plus one non-binding +1 from Martin Grigorov. No zero or negative votes.

A quiet dev list does not mean a quiet project. DataFusion does most of its design work in GitHub issues and discussions, and the mailing list carries mostly release traffic. What the 54.1.0 patch release signals is a project on a fast, predictable cadence, which is exactly what downstream projects need. Iceberg Rust depends on DataFusion for query execution, which makes DataFusion's release rhythm part of the Rust lakehouse story rather than a separate concern.

Apache Ossie (incubating)

Ossie is the newest name on this list and the one most people have not read about yet. It is an incubating project building a shared, vendor-neutral specification for semantic and ontology metadata, expressed in YAML, with converters that map it into other systems. This week it acted like a project preparing to be real.

Jean-Baptiste Onofré proposed beginning work on the first releases, targeting late August or early September. Yong Zheng responded with concerns about outstanding converter issues and about the proposed version numbering. Onofré replied that the first community meeting already discussed versioning and the group leans toward starting at 0.3.0-incubating rather than 1.0.0. Starting below 1.0 is the right call for a spec that is still absorbing feedback from multiple working groups. A 1.0 label sets expectations about stability that an incubating project should not make.

Markus Weimer asked a deceptively small question: is there a canonical file suffix for Ossie files? He had seen raw .yaml in the wild. Onofré said there is no canonical suffix today and supported introducing a .ossie suffix while keeping the YAML format. Sahil W supported keeping .yaml and liked the .ossie idea for discoverability. Weimer landed on the practical answer: use .ossie.yaml when Ossie files live alongside other files in the same folder, which keeps every YAML-aware editor and linter working while making the file's role obvious. That is a good outcome, and it took four messages.

Sahil W also proposed unified Python linting and formatting across the project's Python code and offered to own the implementation across several small pull requests. Yong Zheng gave a +1, said he had started similar work, and listed the problems he hit, including questions about ownership of formatting rules in a repository with multiple converters. Sahil pointed out that ruff supports hierarchical configuration natively, so a shared top-level config acts as a baseline that individual converters can extend. Quigley Malcolm backed standardization and endorsed starting small.

Housekeeping continued elsewhere. Yong Zheng proposed cleaning up two unused directories, Onofré confirmed they are legacy and volunteered to remove them, and Will Pugh shared a document to align the community on directory structure. Zheng also raised the Java version question, noting the Polaris converter still targets Java 11, which reached end of life a while ago. Malcolm supported the immediate alignment fix and asked for a sanity check on jumping further. Onofré supported going straight to Java 21 and explained Java 11 was set as the minimum required version at the time even though the build used JDK 17. Zheng agreed Java 21 has wider adoption than Java 25.

The governance thread of the week came from Ramya Priya of Tellius, who asked about the process for a new organization to join a working group. Quigley Malcolm welcomed her, said the questions indicate the contributing guide needs updates, and made the framing point clearly: at the Apache Software Foundation people participate as individuals, not as company representatives. Onofré echoed it and confirmed the project is vendor-neutral, then addressed what "active participant" means in practice.

Anyone who works at a vendor and contributes to open source should internalize that answer. Your employer does not join an Apache project. You do.

Other Ossie activity included notes from the Ontology working group sync on July 23 shared by Ankit Tandon, a proposal on representing reified relationships as entity types, discussion of extended metadata fields for fields and metrics, community workflows and compliance automation, a scaffold for repositories that consume the spec, a proposal to add Flink SQL support, and an independent .NET viewer with a two-way fact-based modelling converter looking for collaborators.

Cross-Project Themes

Three patterns connect the six lists this week.

Written contracts are replacing working conventions. Iceberg voted to align REST expressions with the table spec and to formalize remote signing, and it opened a debate about whether incremental scan semantics belong in the spec at all. Arrow asked what the nullable flag normatively means. Parquet renamed path to uri and pushed for clearer language in the ALP spec. Polaris argued about which layer owns the atomicity guarantee. These look like separate discussions. They are the same discussion. Each project has reached the point where multiple independent implementations exist, and the cost of an underspecified detail has flipped from "nobody notices" to "two engines disagree in production."

The mechanism behind that flip is worth naming. When one implementation dominates, the implementation is the spec, and ambiguity in the document costs nothing. When four implementations exist across four languages, every ambiguity becomes a compatibility bug that surfaces at the worst possible moment. Iceberg has Java, Rust, Python, Go, and C++ implementations. Parquet has at least as many. Arrow was multi-language from birth. The specs are catching up to a reality the code created.

Formats are converging around composite and nested data. Parquet merged a FILE logical type, and Iceberg immediately started designing a matching file data type for V4. Parquet is working through FIXED_SIZE_LIST and a VECTOR repetition level. Arrow flagged that its variant extension spec is inconsistent with the Parquet shredding spec, while Iceberg discussed adding variant to the REST catalog spec. ALP and FSST together attack the two column types that general compression handles worst, floating point and strings.

Read those together and you can see the destination. The stack is preparing for tables where a single row holds structured columns, semi-structured documents, a vector of floats, and a reference to a blob sitting in object storage, all queryable through one SQL interface with real pruning and real compression. That is the shape of data that AI applications produce and consume. The format work happening this week is what makes it possible to store that data without a separate vector database, a separate document store, and a separate blob index that nobody keeps in sync.

Catalogs are becoming security products. Polaris spent the week on principal property provenance, external principals, table encryption, decrypt-only access for legacy KMS keys, atomic grant commits, and authentication modes for its MCP server. Iceberg formalized remote signing in the REST spec. Both projects are working the same problem from opposite ends: the catalog is the only component that sees every request, so it is the only place where access control can be enforced consistently across engines.

The MCP thread in Polaris makes the stakes concrete. Once AI agents query catalogs directly, the question of whose identity an agent acts under stops being academic. Gu's distinction between a local single-user MCP process and a shared MCP service is the first fork in that design, and getting it wrong produces a service that quietly acts as any user who ever touched it. Every organization deploying agents against a lakehouse will confront this within the next year.

A fourth, smaller pattern: release engineering hygiene took a beating this week and the community responded well every time. Polaris found 44 jars missing license files and pulled the vote. Iceberg yanked a Python package built from the wrong branch. Parquet found a formatter that behaves differently across JDK versions. The Terraform provider had a schema bug caught during verification. In every case the process worked, which is a reasonable argument for the verification steps that people complain about when nothing is wrong.

And a fifth that is not technical at all: Parquet votes landed in spam folders, Iceberg contributors failed to join Slack through three separate threads, and a Polaris message sat in moderation for three days. Open source governance runs on email and chat working reliably. When they do not, participation drops in ways that never show up in a metrics dashboard.

Looking Ahead

Watch the Iceberg equality deletes thread. Ryan Blue's support gives the deprecation real momentum, and the follow-up questions about upgrade behavior for existing tables will shape how much work this creates for anyone running Flink-based ingestion. Alongside it, the V4 change detection work that Blue referenced in the incremental scan thread is the piece that determines whether incremental reads become a spec guarantee or stay a convention. The dedicated index support sync is the third leg of that story.

Polaris 1.7.0 should reach a vote quickly now that the license and notice fix is in flight, and the Polaris Directories meeting that Onofré promised to schedule is worth attending if you care about how the catalog organizes objects. The JDBC datasource decision needs written criteria before it needs a winner.

Parquet will move ALP from an approved spec toward reference implementations, and the FSST spec looks close to the same milestone. The parquet.thrift inlining discussion has enough support that a formal proposal is likely soon. Watch for the Parquet Java naming question to come back as a separate thread.

Arrow's nullability discussion has not converged, and Weston Pace's question about whether a schema describes a batch or promises a contract is the fork that needs an answer before any format change makes sense. The Swift minimum version bump waits on usage data.

Ossie is heading toward first releases at 0.3.0-incubating in late August or early September, which will be the first time anyone outside the mailing list can install and evaluate the spec. If you work on semantic layers or metric definitions, that is the moment to start paying attention.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free: Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow: Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide: O'Reilly book, free download
Apache Polaris: The Definitive Guide: O'Reilly book, free download

Books by Alex Merced

Browse the full catalog of 50+ books at books.alexmerced.com.

Alex Merced, Data Lakehouse and AI Evangelist

AI Weekly: Opus 5 Lands, MCP Goes Stateless, and AMD Ships Helios

Alex Merced — Wed, 29 Jul 2026 18:26:50 +0000

Week of July 22 to July 29, 2026

Four things moved this week, one in each layer of the stack. Anthropic released Claude Opus 5 on July 24 at unchanged Opus pricing with benchmark results that beat the tier above it. Coding agent vendors kept shipping approval modes and audit surfaces instead of benchmark wins. The Model Context Protocol published its 2026-07-28 specification on Tuesday, the largest rewrite since launch, removing sessions from the protocol entirely. And AMD moved Helios rack-scale systems into production with customer commitments measured in gigawatts.

Starting this issue, the newsletter runs the same four sections every week in the same order: models, tooling, standards, infrastructure. Models set what is possible. Tooling decides who gets to use it. Standards decide whether the pieces connect. Infrastructure sets the price.

Models: Claude Opus 5 Beats the Tier Above It

Anthropic released Claude Opus 5 on Friday, July 24, 2026, available the same day on the Claude API, Claude.ai, Claude Code, and Claude Cowork under the model ID claude-opus-5. It is the company's fourth model in under two months, following Mythos 5, Fable 5, and Sonnet 5 in June.

The pricing is the headline. Opus 5 costs $5 per million input tokens and $25 per million output tokens, identical to Opus 4.8 and half of Fable 5's rates. A Fast mode roughly doubles the price to $10 and $50 and runs about 2.5 times faster. Cached input bills at one tenth of the base input rate, and asynchronous batch processing carries a 50 percent discount. Context is 1 million tokens as both default and maximum, with no smaller variant. Maximum output is 128,000 tokens on the synchronous Messages API, reaching 300,000 through the Message Batches API with a beta header. The minimum cacheable prompt dropped from 1,024 tokens to 512.

On benchmarks, the numbers are strong and they come from Anthropic's own testing, so read them as vendor-reported. On FrontierBench v0.1, a 74-task successor to Terminal-Bench 2.1, Opus 5 scored 43.3 percent at max effort against 18.7 percent for Opus 4.8, 33.7 percent for Fable 5, and 37.5 percent for GPT-5.6 Sol. At the highest effort setting it reached 44.4 percent mean reward. Anthropic's published tables also show Opus 5 leading on GDPval-AA, ARC-AGI-3, OSWorld 2.0, and AutomationBench. The losses are disclosed too: it trails on legal and health evaluations where Mythos 5 leads, and GPT-5.6 Sol edges it on one agentic coding test.

Two details matter more than the scores.

The first is the effort setting. Opus 5 exposes a per-request low, medium, or high control over how much reasoning the model spends. That turns cost against capability into a runtime decision rather than a model selection decision. A pipeline running the same model at low effort for routine classification and high effort for hard reasoning gets better economics than one that routes between two models and maintains two prompt sets.

The second is the knowledge cutoff. Opus 5 carries a reliable cutoff of May 2026 against January 2026 for both Fable 5 and Opus 4.8. For coding work that touches recent library versions or infrastructure released in the last two quarters, a four-month gap in training data affects output quality more than a few benchmark points do. Anyone writing code against the MCP 2026-07-28 spec discussed later in this issue has a direct interest in which model has seen the relevant material.

Anthropic positioned the release unusually. The company is not claiming a new capability ceiling. Fable 5 remains the most capable public model, with restricted Mythos 5 above it. Product leader Dianne Penn told Reuters that users should pick Opus 5 for value and reserve Fable 5 for days-long autonomous projects. Opus 5 is now the default on Claude Max and the strongest model available on Claude Pro, so most paying subscribers received the upgrade without changing anything.

Two operational notes for teams migrating. The API ships two breaking changes, and code moved over untouched risks 400 errors or truncated output, so read the migration guide before swapping the model ID. And Anthropic tells developers to delete verification prompts. Instructions like asking the model to include a final verification step now cause over-verification, because the model already verifies its own work. Prompt patterns tuned for an older generation actively hurt on this one, which is a good reminder that a model swap is not a configuration change.

On data handling, Opus 5 supports zero data retention, which Fable 5 does not under its 30-day requirement. It also carries less restrictive cybersecurity safeguards than Fable 5 and is described as the most aligned model the company has measured, with the lowest observed rates of deceptive behavior.

Kimi K3 opens its weights

The open-weight side of the market did not stay quiet. Moonshot AI's Kimi K3 reached full open weights on July 27, 2026 after an API-first launch. K3 is a 2.8-trillion-parameter sparse mixture-of-experts model, the largest open model available at release. It handles text, images, and video natively and supports a 1-million-token context window.

The engineering is worth reading even if you never run it. MXFP4 weight quantization brings storage down to roughly 1.4 terabytes, which puts multi-node self-hosting inside reach for organizations that own hardware. A LatentMoE framework coordinates 896 experts with about 16 active per token. Two architectural additions, Kimi Delta Attention and Attention Residuals, target long-context efficiency, which is the cost center that makes million-token windows expensive to serve.

Pricing lands at roughly $3 per million input tokens and $15 per million output tokens, noticeably below comparable US frontier models. On benchmarks the model took first place across six of seven domains in the Frontend Code Arena and scored 88.3 on Terminal-Bench 2.1. Moonshot also ships Kimi Code, a terminal agent built on the K-series flagship, with a free quota and paid plans starting at $19 per month.

What the release cadence means for your architecture

Trackers now log a notable model roughly every two to three days once strong open-weight releases are counted. Two patterns come out of that.

Price per unit of capability is falling faster than capability is rising. Opus 5 delivers near-flagship results at half the flagship price with no change from the model it replaces. Kimi K3 puts a 2.8-trillion-parameter multimodal model on open weights at $3 and $15. The practical consequence is that the model line item in a budget written in January is wrong by July, and wrong in your favor.

The second pattern is that tier boundaries have stopped meaning much. Anthropic shipped four models in under two months across four tiers, and the mid-tier model beats the tier above it on several benchmarks while trailing on others. Selecting a model by tier name produces worse results than selecting by evaluation on your own workload. Build the evaluation harness once and re-run it on every release. That harness is the durable asset. The model behind it is a swappable part.

Tooling: The Control Plane Became the Product

The coding agent market spent late July competing on operations rather than model quality, and the pattern is consistent enough across vendors that it looks like a shared realization.

GitHub added Claude Opus 5 to GitHub Copilot across supported Copilot applications and IDEs. GitHub describes the model as built for complex, long-running coding tasks that need careful reasoning, effective tool use, and reliable execution across multiple steps, and reports strong early testing results on agentic workflows including autonomous code changes, regression verification, and tasks that coordinate several tools.

Note what GitHub chose to highlight. Not completion quality. Not benchmark scores. Regression verification and multi-tool coordination. Those are the things that determine whether an agent's output is trustworthy enough to merge.

A late-July roundup of Codex and Claude Code updates makes the pattern explicit. OpenAI's July Codex notes added interactive forms in task transcripts, Mermaid diagram rendering, prompt recovery, the ability to resume goals that were blocked or hit usage limits, better task lists, and more reliable cross-device task handling. Claude Code's July updates added in-app browsing, a /doctor command for environment diagnosis, /fork, public artifact sharing, artifact access to each viewer's own MCP connectors, editor roles, and broader auto mode availability across Amazon Bedrock, Google Cloud's Agent Platform, and Microsoft Foundry.

GitHub's July 7 JetBrains changelog added Codex as an agent provider in public preview, expanded the customizations editor with hooks support and richer MCP server management, added custom model support for Business and Enterprise administrators, and added approval settings for Copilot CLI sessions.

Read that list as a whole and the competitive axis is obvious. Approval modes. Resumable work. Credential boundaries. Audit surfaces. Review loops. These are the features an engineering manager asks about before rolling a tool out to fifty developers, and none of them show up on a leaderboard.

The in-app browser in Claude Code deserves specific attention, because it is not primarily a documentation reader. It gives the agent a review loop against websites, dashboards, and hosted applications. An agent that changes a frontend and then looks at the rendered result is doing something categorically different from an agent that changes a frontend and reports that the diff compiled. Verification loops are what turn plausible output into correct output.

The pricing picture shifted too, and it shifted toward metering. Current pricing across the major tools as of July 23 puts GitHub Copilot at free and $10 per month for Pro, running on usage-based billing since June 1 where one AI Credit equals one cent, with a Max tier at $100 per month. Claude Code arrives through Claude plans at $20 per month for Pro and $100 or $200 per month for Max, running Claude Sonnet 5 by default. OpenAI Codex is included with ChatGPT plans on token-based credits and has run the GPT-5.6 family since July 9. Cursor lists Pro at $20, Pro+ at $60, and Ultra at $200 per month, with Cursor's own documentation noting that daily agent users typically land closer to $60 to $100 per month than $20. Moonshot's Kimi Code entered the terminal agent market with a free quota and paid plans starting at $19 per month.

The gap between Cursor's list price and Cursor's own estimate of what daily agent users actually pay is the most honest number in that whole set. Agentic coding consumes tokens at a rate that flat subscriptions cannot absorb, and every vendor is converging on metering because the underlying economics leave no alternative.

For teams standardizing this quarter, the sequence that makes sense is approval policy first, credential boundaries second, model choice third. Model quality changes every eight weeks. Your policy for what an agent is allowed to do without a human in the loop should not.

How to evaluate a coding agent in August 2026

Model benchmarks have stopped being useful for tool selection because the leaders change every few weeks and the differences at the top are small compared to the differences in how the tools behave in your repository. A better evaluation runs on seven questions.

What does the agent do without asking? Every tool has an approval model. Read it carefully and test the edges, especially around file deletion, dependency installation, and anything that touches a remote system.

Which credentials does it hold, and for how long? An agent with a long-lived cloud credential is a standing risk. An agent that requests scoped, short-lived access per task is a different security posture entirely.

Can it resume? Long-running tasks fail for boring reasons. Network drops, usage limits, laptop sleep. The tools that added resumable goals this month did it because that failure mode is constant.

Does it verify its own work? An agent that runs tests, reads the rendered output, or checks a regression suite produces a different quality of change than one that stops at the diff.

What does the audit trail look like? For any regulated environment, the question of what an agent did, when, under whose identity, and with what approval is not optional. Task transcripts and artifact sharing exist for this reason.

How does it handle your MCP connectors? Agents increasingly reach your data through MCP servers, and the auth story between agent, MCP server, and data platform is where most real incidents will originate.

What does it actually cost at your usage? Run a two-week pilot with real work and measure. The list price is a marketing number.

Agents are showing up outside the IDE

Coding is the visible edge of a broader move. Google Threat Intelligence took its agentic capabilities from public preview to general availability for Enterprise and Enterprise Plus customers, targeting threat hunting, incident response, and daily alert triage, with one workflow staying preview-only. On the industrial side, Altia launched an AI layer inside its human-machine interface workflow on July 21 that connects to any model and assists across the development process, with a notable policy attached: the company says no AI-generated code ships in production.

Those two launches bracket the range of what enterprises are comfortable with right now. Security triage is a domain where the volume is overwhelming and the cost of a missed alert is high, so automation wins even with imperfect precision. Safety-critical embedded software is a domain where the cost of a subtle defect is measured in recalls, so the agent assists and humans still write what ships. Most organizations sit somewhere between those poles, and the honest answer about where you sit depends on your blast radius, not on your enthusiasm.

Standards: MCP Drops Sessions and Grows Up

The Model Context Protocol published its 2026-07-28 specification on Tuesday, written by lead maintainers David Soria Parra and Den Delimarsky. The release candidate had been locked since May 21, giving SDK maintainers a ten-week window to validate changes against real workloads before the final publication.

Start with the scale numbers, because they explain why the changes are what they are. Across the Tier 1 SDKs, the project reports close to half a billion downloads a month, with both the TypeScript and Python SDKs crossing one billion total downloads. A protocol at that adoption level stops being a developer convenience and becomes infrastructure. Infrastructure gets judged on how it behaves at 3 a.m. during an incident, not on how fast you can wire up a demo.

Why sessions were the problem

To understand why this release matters, it helps to remember what MCP looked like at the start. The protocol was designed for a local model: an AI application starts a server as a subprocess, talks to it over standard input and output, and both sides hold a live connection for the duration of the conversation. In that setting a session is free. The connection is the session.

Remote MCP servers broke that assumption. Streamable HTTP arrived in the 2025-03-26 revision and replaced the earlier HTTP plus server-sent events transport, using one endpoint that supports POST and GET with optional streaming for server-to-client messages. Sessions were tracked with an Mcp-Session-Id header. That design worked, and it dragged a long tail of operational requirements behind it.

Every team that ran MCP at scale hit the same wall. A session identifier means a request must reach the instance that holds that session, so you configure sticky routing. Sticky routing means an instance restart drops live conversations, so you add a shared session store. A shared store means a new failure domain and a new latency hop on every call. Autoscaling gets harder because scaling in kills sessions. Blue-green deployment gets harder for the same reason. None of that has anything to do with connecting a model to a tool. All of it was mandatory.

The 2026-07-28 revision deletes the requirement at the root. That is the right place to fix a problem like this, and it is the harder place, which is why it took eighteen months and a breaking change to get there.

The stateless core

The headline change is that MCP is now a request/response protocol instead of a bidirectional stateful one. The initialize and initialized exchange is retired. The Mcp-Session-Id header is gone. Each request carries its own protocol version, client identity, and client capabilities in a _meta field. Clients that want to learn a server's capabilities up front call a new server/discover RPC, and that call is optional.

The practical effect on a deployment is immediate and large. Before this release, running a remote MCP server in production meant sticky session routing at the load balancer, a shared session store like Redis behind it, and often deep packet inspection at the gateway to figure out what a request was actually doing. All of that existed to reconstruct state the transport assumed but did not carry. With the stateless core, any request lands on any server instance behind an ordinary round-robin load balancer with no shared storage.

The maintainers make an important clarification about what stateless does not mean. Dropping protocol-level sessions does not force your application to be stateless. If a server needs to carry state across calls, the recommended pattern is to mint an explicit handle from a tool and have the model pass that handle back as an argument on the next call. The reasoning behind that recommendation is the interesting part: when state lives in an explicit handle, the model can see it and thread it between tools. When state hides in the transport, the model is operating blind on something that affects its results.

That distinction is worth sitting with if you build agents. Hidden state is the single most common source of agent behavior that looks random. An agent that fetched a result under one session, lost the session, and retried under another has no way to notice or explain what changed. An agent holding a visible handle does.

Multi Round-Trip Requests

The stateless core created a problem the spec had to solve. Some server operations need something from the user in the middle of a call. A confirmation before a destructive action. A missing parameter. A sampling request back to the model. Under the old design those flows used server-initiated requests like elicitation/create, sampling/createMessage, and roots/list, all of which required a stream held open in both directions.

Multi Round-Trip Requests replace that pattern. The server returns a result with resultType: "input_required" plus the specific requests it needs answered. The client gathers answers and retries the original call with them attached in inputResponses. No open stream, no session, and every step is a plain HTTP request/response pair that any proxy in the path understands.

Supabase called this out as the change that unblocks a feature they wanted. Their MCP server runs statelessly, which made elicitation impractical under the old design. With MRTR, their tools can confirm with the user before acting, giving examples like the cost of creating a new project or a query that deletes data. That is a concrete safety improvement, and it exists because the protocol stopped requiring a persistent connection to ask a question.

Header-based routing and cacheable lists

Streamable HTTP requests now must carry Mcp-Method and Mcp-Name headers. Method and tool names travel in HTTP headers instead of only inside the JSON body. Gateways, rate limiters, and web application firewalls route, meter, and authorize on those headers directly, without parsing request bodies.

If you have ever tried to rate-limit an expensive tool differently from a cheap one, or block a specific tool at the edge for a specific tenant, you know why this matters. Under the old design that required a gateway that understood JSON-RPC payload structure. Now it requires a header match rule, which every piece of network infrastructure built in the last thirty years already supports.

Responses from tools/list, prompts/list, resources/list, and resources/read now carry ttlMs and cacheScope fields. Clients cache tool catalogs for as long as the server says is safe, and list responses have a deterministic order. Deterministic ordering matters more than it sounds: if a tool catalog comes back in a different order on each reconnect, the prompt that lists those tools changes, and every upstream prompt cache misses. Stable ordering plus explicit TTLs keeps those caches warm across reconnects, which shows up directly on the token bill.

Authorization hardening

The maintainers state plainly that authorization is where implementers spend most of their integration time, and this revision goes after several specific weaknesses.

Authorization servers should now return the iss parameter per RFC 9207, and clients must validate it before redeeming an authorization code. That closes the authorization server mix-up attack, where a client tricked into talking to a malicious authorization server hands a code to the wrong party.

Clients now set application_type during Dynamic Client Registration, which fixes a long-running annoyance where authorization servers rejected localhost redirect URIs for desktop and command-line applications. If you have debugged a CLI OAuth flow that died on a redirect_uri error, that was the cause.

Client credentials are now bound to the issuer that minted them, with no reuse across authorization servers. And Dynamic Client Registration itself is formally deprecated in favor of Client ID Metadata Documents. DCR keeps working for backward compatibility and gets removed in a future revision.

Extensions, Tasks, and deprecations

The release locks in a formal extensions framework. Tasks, which handles long-running operations, moved out of the experimental core into the io.modelcontextprotocol/tasks extension, with a poll-based tasks/get and a new tasks/update. AWS contributed that extension. MCP Apps, which covers server-rendered user interfaces, and Enterprise Managed Authorization sit alongside it as extensions rather than core features.

That structure solves a governance problem as much as a technical one. Under a monolithic spec, every capability has to ship on the spec's release cadence, so useful work waits for unrelated work. Extensions ship on their own timelines and version independently. The core stays small enough that a new implementation is achievable by a small team.

Change notifications move off the old HTTP GET endpoint to a single subscriptions/listen stream that clients opt into per notification type. Roots, Sampling, and Logging are deprecated. They keep working for at least twelve months, and new implementations should not adopt them. The legacy HTTP+SSE transport is officially deprecated with a year-long offramp.

That twelve-month floor comes from a new formal deprecation policy, which is one of the most underrated items in the release. A dated protocol with a written deprecation window lets a platform team plan an upgrade quarter instead of reacting to a surprise. Vendors who need to certify software against a spec version now have something to certify against.

SDKs and ecosystem

All four Tier 1 SDKs speak the new version as of publication day: TypeScript, Python, Go, and C#. The Rust SDK supports it in beta.

The ecosystem response tells you how deep MCP has embedded itself. AWS reports the stateless core available in Amazon Bedrock AgentCore, letting developers deploy MCP servers on standard infrastructure without managing sessions or persistent connections. Cloudflare's Agents SDK supports the spec from day zero, so developers run MCP servers directly in Workers, with customers like Sentry and Linear picking up the improvements immediately. Microsoft ties MCP to Foundry's unified toolbox endpoint, describing it as what let them scale from dozens of integrations to thousands while centralizing governance, identity, and observability. Google Cloud framed the stateless architecture as removing friction from deploying agentic workflows at scale.

Two data points from smaller players say more than the platform quotes. Honeycomb reports that nearly 20 percent of all monthly interactive queries on their platform now come from agents. Manufact, which hosts thousands of MCP servers, reports that the new SDK cut their package size by roughly 83 percent and made it about 25 percent faster thanks to the client-server split.

Twenty percent of interactive queries coming from agents is the number to sit with. That is not a pilot. That is a workload class.

Migration reality

This release breaks things, and the maintainers say so directly. Servers speaking 2026-07-28 will not necessarily work with older clients, and older servers will not necessarily work with new clients. Teams that depended on session identifiers have real migration work ahead of them, though the SDK maintainers incorporated early testing feedback specifically to reduce that cost.

Here is a practical checklist for anyone running MCP servers in production this quarter:

Inventory every place your server depends on Mcp-Session-Id or on state that lives across calls in memory. Each one becomes an explicit handle minted by a tool.
Replace server-initiated elicitation and sampling with MRTR flows. Test the retry path carefully, because the client now resends the original call.
Add Mcp-Method and Mcp-Name to every outbound request if you write a client. Add header-based rules to your gateway if you operate one.
Set ttlMs and cacheScope on list responses deliberately rather than accepting defaults. Too long and clients miss new tools. Too short and you pay for cache misses on every reconnect.
Move off Dynamic Client Registration toward Client ID Metadata Documents. Validate iss per RFC 9207 on every code redemption.
Audit for Roots, Sampling, and Logging usage. You have twelve months, which sounds long and is not.
Plan the HTTP+SSE transport retirement on the same year-long clock.

Do the inventory in step one first. It usually surfaces state you did not know you were carrying.

What this changes for data and analytics teams

If your team exposes data through an MCP server, this release changes your architecture in four specific ways.

Deployment gets ordinary. An MCP server that fronts a query engine or a catalog becomes a stateless HTTP service, which means it deploys the same way as every other service you already run. Same autoscaling rules, same rolling updates, same health checks. The special handling goes away.

Authorization gets a real story. The issuer binding and RFC 9207 validation changes close attack paths that security review teams ask about by name. If you have been stuck in a review because nobody had a good answer for authorization server mix-up, you now have one.

Tool catalogs get cacheable, which changes cost. A data MCP server often exposes a large tool surface: one tool per dataset, per metric, or per saved query. Sending that catalog on every reconnect was expensive in tokens and in latency. With ttlMs and deterministic ordering, clients cache it and upstream prompt caches stay warm.

Long-running work gets a home. Analytical queries do not finish in a request timeout. The Tasks extension gives long-running operations a poll-based lifecycle instead of forcing you to invent one. For anything that scans a large table, that is the difference between a working integration and a pile of timeout workarounds.

The one thing that does not change is the hard part. A protocol that connects an agent to your data does not tell the agent what your data means. Column names, business definitions, join paths, and freshness expectations still have to come from somewhere, and that somewhere is a semantic layer with governance attached. MCP moved the plumbing problem out of the way. The meaning problem is still yours.

Infrastructure: AMD Ships Helios, NVIDIA Points Agents at Silicon

The hardware news this week was concentrated in one event and one partnership.

At Advancing AI 2026 in San Francisco on July 23, AMD launched its next-generation AI infrastructure and physical AI portfolio, led by Helios rack-scale solutions now in production for deployment at gigawatt scale. CEO Lisa Su used the keynote to move from roadmap slides to shipping silicon, walking through volume production details for the MI400 accelerator family, the Helios rack built around it, and EPYC Venice, the first x86 server processor built on TSMC's 2 nanometer node.

The customer commitments are the story. AMD confirmed that OpenAI and Meta together account for 12 gigawatts of committed accelerator capacity, with Microsoft Azure and Oracle named as early Helios customers. Meta's portion is a separately confirmed 6 gigawatt deployment across multiple chip generations, starting with roughly 1 gigawatt of MI450-class hardware in the second half of 2026. Converting gigawatts to chip counts is imprecise because power draw varies by generation and cooling design, but industry estimates put one gigawatt at roughly 25,000 to 50,000 high-end accelerators. That puts the combined commitments in the range of several hundred thousand chips at full deployment.

Anthropic and AMD detailed a partnership to deploy up to 2 gigawatts of AMD Instinct MI455X GPUs in Helios racks, paired with a multiyear engineering collaboration that uses Claude to accelerate AMD software development. The specific targets are workload optimization for Instinct GPUs and ROCm software development, and AMD plans broad Claude adoption across its engineering and product teams. OpenAI and AMD announced a parallel effort to optimize the stack from silicon to software.

The ROCm detail is the one worth watching. AMD's hardware has been competitive on paper for several generations. The software has been the gap. SemiAnalysis, which in December 2024 gave AMD no chance of breaking NVIDIA's CUDA advantage, revised that assessment on July 25, 2026 to a strong chance of success conditional on two risks: the Helios rack production ramp, where weak SerDes require retiming up to 85 percent of the backplane with more than 550 Broadcom ethernet retimers per rack, and a persistent shortage of stable internal GPU clusters for software development.

That second risk is exactly what the Anthropic collaboration attacks. Using a frontier model to accelerate kernel and library development is a direct answer to a software velocity problem, and it makes AMD's competitive position partly a function of how well AI-assisted systems programming works in practice. If it works, that is a meaningful data point well beyond one vendor's roadmap.

NVIDIA had its own week. The company announced a long-term partnership with Safe Superintelligence Inc. on July 27 to accelerate SSI's growth. On July 26 it announced a collaboration with Cadence and Synopsys aimed at chip design complexity, and expanded the NVIDIA Agent Toolkit for engineering with PhysicsNeMo and CUDA-X libraries exposed as agent-ready tools and skills.

The chip design angle is a closed loop worth naming. NVIDIA is applying AI agents to silicon engineering because chip complexity per generation is outgrowing what traditional design methods handle, according to Tim Costa, the company's VP and general manager of computational engineering. Chips designed with agent assistance run the models that power the agents that design the next chips. Every step in that loop that gets faster compounds.

On market structure, TrendForce data cited in the same chip maker analysis puts ASIC-based systems at about 27 percent of 2026 AI server unit shipments, down slightly from an April estimate of 27.8 percent after chip validation and tuning delays at Meta and AWS, against 69.7 percent for GPU-based systems, with ASICs projected to reach roughly 40 percent by 2030. The custom silicon story is real and it is slower than the headlines suggest.

For data platform teams, the practical read on all of this is about supply and price rather than architecture. Two credible accelerator vendors at gigawatt scale means better availability and better negotiating position than a single-vendor market. It also means your inference and embedding workloads need to be portable enough to move, which puts a premium on standard formats and standard interfaces at every layer of the stack.

The Data Layer: Format Work That Makes AI Workloads Cheaper

One story this week connects the protocol news and the hardware news to the storage layer underneath both, and it came out of an Apache mailing list rather than a press release.

The Apache Parquet community voted to add ALP encoding to the Parquet format, with the vote passing on 11 +1 votes, 7 of them binding. ALP stands for Adaptive Lossless floating-Point. It compresses double and float columns by finding a decimal representation that maps values to integers, encoding those integers with existing integer techniques, and applying a fallback scheme for values that do not fit the pattern.

Floating point has been the worst-compressing column type in analytical storage for as long as columnar formats have existed. General purpose compressors do poorly on it because IEEE 754 bit patterns look close to random. That mattered moderately in a world of financial metrics and sensor readings. It matters enormously in a world where a meaningful fraction of stored data is embedding vectors, model scores, and feature values, all of which are float arrays.

Parquet is working the string side of the same problem through the FSST spec, which reached general consensus this week according to Arnav Balyan. FSST builds a symbol table of common substrings and encodes against it, which handles the repeated-prefix and shared-vocabulary patterns that show up in logs, identifiers, and document text.

Parquet also merged a FILE logical type, and the Apache Iceberg community immediately opened a proposal for a matching file data type in Iceberg V4. That gives tables a first-class way to reference documents, images, and other blobs sitting in object storage, with the reference metadata stored in the table itself and queryable through SQL.

Put those three together and the destination is clear. A table where one row holds structured columns, semi-structured JSON, an embedding vector, and a reference to the source document, all in one open format, all compressed properly, all queryable through one interface. That is the storage substrate a retrieval pipeline actually needs, and it removes the usual arrangement where a vector database, a document store, and a warehouse each hold a partial copy of the truth and drift apart.

The connection to the MCP news is direct. Agents that query data through MCP servers are only as good as the data layer underneath. If that layer is three disconnected systems, the agent gets three inconsistent answers and no way to tell which is right. If it is one governed lakehouse, the agent gets one answer with lineage attached.

Where the compute money is going

One number frames the rest of the hardware story. Gigawatts, not chip counts, are now the unit of measure in accelerator announcements. That shift happened because power is the binding constraint. Fabs have capacity. Grid interconnects, transformers, and cooling do not. When a vendor announces gigawatt-scale commitments, the interesting question is not whether the chips exist but where the power comes from and when the substation gets built.

This matters for anyone planning data platform capacity, even indirectly. Inference pricing tracks accelerator availability, and accelerator availability tracks power delivery schedules that run on multi-year timelines. Pricing on frontier models has moved down steadily through 2026, with introductory rates and tiered families becoming standard. The direction is favorable and the volatility is real, which argues for architectures where switching a model provider is a configuration change rather than a rewrite.

The same logic applies to the ASIC question. Custom silicon at roughly 27 percent of 2026 AI server shipments, headed toward 40 percent by 2030 on current projections, means a meaningful share of inference will run on hardware you do not choose and cannot benchmark directly. Your defense against that is the same as your defense against everything else in this stack. Keep the interfaces standard, keep the data in open formats, and keep the ability to move.

What This Week Means If You Build With AI

Five takeaways worth acting on.

Treat the MCP upgrade as a scheduled project, not a background chore. The breaking changes are real, the twelve-month deprecation windows are generous but finite, and the migration surfaces hidden state you probably do not have documented. Teams that do the session inventory in August will have a much better September than teams that discover the problem when a client stops connecting.

Stateless is a scaling decision and a debuggability decision. The maintainers made the case that explicit handles beat hidden transport state because the model can see them. That reasoning applies well beyond MCP. Anywhere your agent architecture keeps state the model cannot inspect, you have created a class of bug that reproduces poorly and explains badly.

Standardize agent policy before agent tooling. The coding agent vendors converged on approval modes, credential boundaries, and audit surfaces this month because that is what enterprise buyers demand. Write your policy for what agents do unattended, which credentials they hold, and what gets logged. Then pick tools that implement it. Doing it in the other order means rewriting the policy every time a vendor ships a release.

Budget for metering. Cursor's own documentation puts daily agent users at three to five times the list subscription price. GitHub moved to credits. Codex runs on token-based credits. Any capacity plan built on flat per-seat pricing is going to be wrong, and the error runs in one direction.

Keep the data layer open enough to move. Two accelerator vendors at gigawatt scale, four Tier 1 MCP SDKs, and a Parquet format gaining first-class support for float and string compression all point the same way. The parts of the stack that are standardized are the parts you can renegotiate. The parts that are proprietary are the parts that set your price.

The pattern across every story this week is the same one that showed up in the Apache mailing lists: the AI stack is trading demo velocity for operational guarantees. Sessions became handles. Monolithic specs became core plus extensions. Coding agents grew approval gates. Accelerator roadmaps became production deployments with named customers. None of it is exciting in the way a new model release is exciting. All of it is what has to happen before any of this runs a business.

What to Watch Next Week

Four things are worth tracking as August opens.

MCP client adoption rates. The spec is final and the Tier 1 SDKs shipped on day one, but the number that matters is how fast client applications adopt it. A server speaking 2026-07-28 does not necessarily work with an older client. Expect a period where server authors run both versions in parallel and expect the migration guides to get better as the first wave of production upgrades produces real war stories. The Rust SDK moving from beta to stable is the milestone to watch for anyone building in that ecosystem.

Whether extensions actually ship independently. The extensions framework is a promise about release cadence. Tasks, MCP Apps, and Enterprise Managed Authorization are the first test of it. If those three evolve on their own timelines without dragging the core spec along, the framework works. If the next core revision bundles extension changes anyway, it does not.

Helios production ramp reports. The retiming issue flagged in the SemiAnalysis assessment, requiring hundreds of ethernet retimers per rack, is the kind of manufacturing detail that separates an announced deployment from a delivered one. Watch for customer deployment confirmations from Microsoft Azure and Oracle rather than vendor slides.

ALP and FSST reference implementations. The Parquet format vote is the easy half. Reference implementations in Java, C++, and Rust are what determine when these encodings reach the data you actually query. Compression improvements on float and string columns show up on your storage bill and your scan times, so this is worth following even if format internals are not your usual reading.

One broader thing to keep an eye on: the number of places where an AI standard and a data standard now touch each other. MCP servers fronting catalogs. Catalogs deciding how agents authenticate. Table formats adding types designed for the data AI produces. Those used to be separate conversations happening in separate communities. They are converging fast, and the teams that read both sides will build better systems than the teams that read one.

Resources to Go Further

AI moves fast. Here are tools and resources to help you keep pace.

Try Dremio Free - Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data - Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community - Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development - Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis - A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

Browse the full catalog of 50+ books at books.alexmerced.com.

Alex Merced, Data Lakehouse and AI Evangelist

The Filters We Build: How Every New Medium Rewires Our Defenses, From Radio Ads to AI Slop

Alex Merced — Thu, 23 Jul 2026 15:28:41 +0000

My grandparents' generation learned to tune out the radio pitchman. My parents learned to mute the commercials and hang up on telemarketers. I was born in 1985, which means I was nine years old when the first banner ad appeared on the web, and my generation built its filters live and in production: we learned to stop seeing banner ads, to close a pop-up before it finished loading, to smell a phishing email from the subject line alone. The generation after mine learned to clock a sponsored post mid-scroll before they could drive. And right now, all of us together are being asked to learn something harder: how to doubt a voice on the phone that sounds exactly like someone we love, and, just as urgently, how to find anything worth our attention in an ocean of machine-generated noise.

Notice that those are two different problems, and this article is about both, because they have always been both. Every time a new medium arrives, it arrives faster than our defenses, and the defenses we need come in two kinds. The first is the shield: the ability to recognize manipulation, to spot the scam, the ad dressed as advice, the lie dressed as news. If you have ever walked an older relative back from the edge of a scam, patiently explaining that no, the IRS does not accept payment in gift cards, you know the shield and its uneven distribution across generations. The second is the sieve: the ability to sort the valuable from the worthless without drowning, to find the good stuff without reading everything, and to avoid the quieter failure nobody warns you about, missing wonderful things because exhaustion made you stop looking. The shield fails loudly, in stolen savings and viral hoaxes. The sieve fails silently, in overload, in cynicism, in the slow retreat from a medium that became too noisy to love.

And at every point in this history, the same institution has appeared to carry both burdens for us: the trusted curator. The news anchor, the magazine editor, the radio DJ, the blog aggregator, the newsletter writer. Curators are how societies scale their filters, and one of the central arguments of this article is that in the AI era, when anyone can theoretically make anything, the curator is about to become more valuable than at any point in media history. So this is the full story, told with the receipts: how each medium forced a cognitive realignment, how the shield and the sieve got built each time, how curation kept reinventing itself, why the burden always falls unevenly across generations, and then a fair assessment of both the optimistic and pessimistic cases for what happens as AI-generated content floods every channel we have. I write about data and AI for a living, I have watched this latest shift from unusually close range, and I sit at the exact generational midpoint of the story: young enough to have built the internet-era filters natively, old enough to feel the new ones straining, with family group chats full of screenshots that start with "is this real?" Same as you.

Filters Are Infrastructure, and They Have Two Jobs

Before the history, the concepts, because they are the thread through everything that follows.

A cognitive filter, as I am using the term, is a learned, mostly automatic judgment about media: this is an ad, this is a scam, this is worth my time, this is not. Filters are specific to formats and have to be learned per medium, because each medium carries its own signals. Knowing a carnival barker is selling you something does not transfer automatically to knowing that a friendly radio voice is doing the same, and neither transfers to recognizing that a heartfelt product recommendation from a YouTuber was purchased.

The shield job is protective: keep out the predatory and the false. The sieve job is selective: let in the valuable, at a volume you can survive. The sieve job is older than people realize and was named perfectly back in 1971 by the economist Herbert Simon: a wealth of information creates a poverty of attention, and what an abundant medium demands is precisely the ability to allocate attention efficiently. Every generation since has relearned Simon's law at higher volume, and the exhaustion you feel scrolling past the four hundredth piece of content today is not a personal failing. It is the tax every abundant medium levies until filters and curators catch up.

Three properties of filters explain most of media history. First, they are built socially: from embarrassment, warnings, jokes, school, and eventually institutions, regulations, labels, spam folders, that encode the filter into the environment so individuals no longer carry the whole load. Psychologists call the deliberate version inoculation: exposing people to weakened doses of manipulation, with the trick explained, builds durable resistance, and most of history's filter-building has been accidental inoculation, a society catching the disease and developing antibodies the hard way. Second, both filter jobs exhaust the same finite resource: vigilance and selection draw on the same attention budget, which is why the eras of greatest media abundance are also the eras of greatest scam success, tired sorters make easy marks. And third, the property that matters most for our moment: most practical filters are actually shortcuts that key on production cost. Bad grammar signaled a scammer who could not afford a copywriter. A polished broadcast signaled an institution with something to lose. Effortful content signaled that someone believed the content was worth effort. Those shortcuts worked for a century because production cost was a real constraint, and every one of them fails when production cost falls to zero. Hold that thought. It is the key to why the AI moment feels different in kind.

And when individual filters cannot keep up, societies delegate: to curators, humans and institutions who filter professionally, staking their reputations on the sorting. The curator solves both jobs at once, vouching against the predatory and selecting for the valuable, and charges for it in attention, subscription, or trust. Watch the curator's costume change in every era below, because the role never disappears. It just gets rehired in new clothes.

Radio: A Stranger's Voice, and the First Modern Curators

Start in 1922, when a real estate company paid AT&T's station WEAF about fifty dollars for ten minutes of airtime to praise apartments in Queens, the first paid radio advertisement, controversial enough that serious people, including Commerce Secretary Herbert Hoover, called commercializing the public airwaves unthinkable. The market disagreed, and within a decade American radio was thoroughly sponsor-funded, with entertainment and salesmanship blended so completely that the era's signature genre is named for its advertiser: the soap opera.

Radio's shield problem was intimacy. For all prior history, a voice in your home belonged to someone physically present. Radio put strangers' warm voices in the family kitchen, and listeners had no inherited filter for parasocial persuasion, the announcer selling in the same trusted tones that read the news. The famous rupture came in 1938 with the War of the Worlds broadcast, and the story's best lesson is its nuance: modern historians have shown the reported mass panic was largely a fabrication, inflated by newspapers eager to paint their upstart rival as dangerous. Some listeners were fooled by drama formatted as news bulletins. Newspaper readers were simultaneously fooled by motivated reporting about radio. In 1938, everyone was running deficient filters for somebody's channel.

Radio's sieve problem was newer still: hundreds of stations, an endless broadcast day, and no way to know what deserved the family's evening. The solutions invented for it became the templates for everything since. The networks themselves were curation machines, their programming departments deciding what the nation heard. Program guides and radio columns told you where the good stuff was. And the era minted a figure who would reappear in every medium after: the trusted voice whose taste you outsourced to, the announcer, and later the disc jockey, whose entire job was to have listened to everything so you did not have to. By the 1940s the realignment was complete: audiences had learned the commercial break as a genre, sponsorship rules had institutionalized the shield, and the curation layer had made abundance livable. A generation of kids grew up fluent in all of it while their print-era elders adapted partially, establishing the pattern that never breaks.

Television: Seeing Is Believing, and the Anchor as National Filter

Television's first legal commercial aired July 1, 1941, a ten-second Bulova watch spot before a Dodgers game, purchased for nine dollars, and what followed was the most persuasive machine yet built, stacking sight on sound on domestic intimacy. For its first decade, audiences extended television the credulity that "seeing is believing" implies, and two ruptures forced the correction. The quiz show scandal of the late 1950s revealed that beloved big-money programs were rigged, contestants coached, drama scripted, and the congressional hearings of 1959 taught a nation that television was a constructed artifact whose appearance of spontaneous reality was itself a production value. The subliminal advertising panic taught a stranger lesson: market researcher James Vicary claimed hidden flashed messages drove snack sales, the nation was horrified, and years later he admitted the study was essentially fabricated. The specific threat was fake, and the panic still did real work, installing a structural suspicion of the persuasion industry that outlived its bogus origin. Moral panics about new media are usually wrong in their specifics and weirdly productive in their effects.

But television's more instructive legacy for our purposes is its curation golden age, because TV made curators into the most trusted people in the country. The evening news anchor was a human filter for reality itself, and for decades polls ranked Walter Cronkite among the most trusted figures in America, a man whose actual job description was deciding which fraction of the day's events deserved twenty-two minutes of national attention. TV Guide became one of the highest-circulation magazines on earth by solving the sieve at the level of the listing. Critics, prime-time schedules, and network standards departments formed a thick curation layer, and audiences, whatever they grumbled, largely accepted the deal: enormous filtering power concentrated in few hands, in exchange for a legible, navigable medium. The deal had real costs, gatekeeping excluded voices and narrowed the aperture of what counted as news, and the next era would be defined by tearing the deal up. It is worth remembering, before we cheer or mourn, that the deal existed because it solved a problem, and the problem did not go away when the deal did.

Institutional shields assembled alongside: truth-in-advertising enforcement, and the children's television rules born from research showing young children literally cannot distinguish programs from commercials, filters built into law for the humans too young to have personal ones. The generational pattern repeated on schedule, kids of the sixties feeling an ad coming from the music cue alone while some of their print-era grandparents believed the man on television because he seemed so sincere.

The Internet, Act One: Banner Blindness and the Rise of the Amateur Curator

On October 27, 1994, the first banner ad appeared on HotWired, an AT&T campaign, and its performance is the purest specimen of filter formation ever recorded: roughly forty-four percent of viewers clicked it. Within a few years, average click rates collapsed toward a fraction of one percent, and usability researchers documented banner blindness, users' gaze skating around ad-shaped page regions without consciously perceiving them. An entire population built an automatic perceptual filter in under half a decade. Pop-up ads spawned blockers so decisively that the format's inventor eventually published a public apology. Email spam grew past all human filtering, and the response previewed something important: the burden moved from person to infrastructure, Bayesian and then industrial machine-learning filters deleting the flood before human eyes saw it, a problem that felt existential for the medium in 2002 largely won by automated defense.

Email also brought phishing, where the generational shield gap turned from observation into crime statistics: the Nigerian prince became a worldwide joke, which is to say a socially transmitted inoculation, and the scam evolved as parasites do, always probing for the population whose filters lagged, finding it among older adults whose trust instincts were calibrated for an era when a professional-sounding voice was expensive to fake.

And here is the act-one story that usually gets left out: the web's first abundance crisis, and the curatorial explosion that answered it. A million pages and no map produced the portal era, Yahoo began literally as a hand-edited directory of the web, human librarians for a new continent. Then curation democratized in a way no previous medium had allowed: Slashdot and its peers turned communities into editors, bloggers became trusted guides to their corners of the world, blog aggregators and blogrolls wove webs of vouched-for sources, and RSS let individuals compose personal newspapers from chosen voices. The professional gatekeeper's monopoly broke, and what replaced it was not chaos but a bazaar of small curators, each staking a reputation on their sorting. For those of us who lived it, the golden age of blogs was really a golden age of curation, and the lesson it taught is one this article will lean on at the end: when a medium's abundance explodes, the value migrates to whoever can be trusted to point.

The Internet, Act Two: Manual Slop, Algorithmic Feeds, and the Newsletter Counterrevolution

Before anyone said AI slop, the internet spent a decade drowning in the handmade kind. The economics were simple: search and social paid, in traffic and ad revenue, for content matching queries and provoking engagement, so industries manufactured exactly that at the lowest cost, content farms paying pennies for keyword-stuffed filler, clickbait perfecting the headline as an unpaid cliffhanger, recipe pages burying the recipe under two thousand words of sludge. It got bad enough that in 2011 Google shipped its Panda update specifically to demote content-farm sludge, an institutional sieve deployed at planetary scale because individual sieves could not keep up. Advertising dissolved itself into content, native ads dressed as articles, influencers industrializing radio's parasocial trick, your friend who happens to love this mattress, at a scale that forced the #ad disclosure mandate, a tiny legal shield bolted onto a format designed to defeat shielding.

The deeper shift was who did the curating. The feed replaced the anchor: engagement-ranked algorithms became the default sieve for billions, and they changed the question filters must answer. Broadcast asked, is this message selling me something. The feed asks, why am I seeing this at all, a question about invisible machine curation optimizing for the platform's engagement, not your nourishment, and most users never learned to ask it. The consequences filled a decade of headlines, and the era's research produced two findings everyone should carry. From Stanford: thousands of students, fluent digital natives all, routinely could not tell news from native ads and judged credibility by polish, proving that fluency in a medium's interface is not a filter for its content, while the professionals who sorted well used a different move entirely, lateral reading, leave the suspicious thing, open new tabs, and check what the rest of the world says about the source. And from the 2016 misinformation reckoning: researchers found Americans over sixty-five shared roughly seven times as many articles from fabricated news domains as the youngest cohort, not from lesser intelligence, the authors were explicit, but because the filters for the feed, for engineered virality, had not been built by a generation that arrived late with trust settings tuned for print and broadcast. The filter gap, measured.

And act two staged a counterrevolution that predicts our present: exhausted by the feed, audiences began rehiring human curators, and creators began accepting the job. The email newsletter, the most unfashionable technology imaginable, came roaring back precisely because it restored a chosen, accountable voice sorting a beat for you. Podcast hosts became the new DJs. Playlist curators became the new radio programmers. Substack, Patreon, and the subscription wave demonstrated that people will pay actual money for trustworthy selection, a market verdict on Simon's law: attention had become so scarce, and slop so abundant, that filtering became a product. I participate in this economy from both sides, as a reader who survives on chosen newsletters and as a writer of them, and the mechanics are worth stating plainly for what comes next: a curator earns trust through consistency, transparency about methods and interests, a track record you can audit, and skin in the game, a name attached, a reputation that pays the price of being wrong. Keep that list. It is about to become the most valuable checklist in media.

The AI Era: When Slop Learned to Make Itself

Generative AI did something specific to the content economy: it drove the marginal cost of producing plausible media, text, images, voices, video, toward zero. The manual slop era needed buildings full of underpaid writers. The AI slop era needs a prompt and a loop, and the result is a flood without precedent: feeds thick with synthetic engagement bait, the surreal shrimp-Jesus genre of algorithm chum, search results silted with machine-written filler, fake books, fake reviews, fake people, a tide sufficient that "slop" entered the mainstream vocabulary and the once-fringe "dead internet" joke, that much of what you see online is machines performing for machines, started sounding like a rounding estimate.

On the shield side, the collapse of cost signals shows up in the fraud data, and it is brutal. The FBI's Internet Crime Complaint Center logged 4.9 billion dollars in reported fraud losses among Americans over sixty in 2024, and 7.75 billion in 2025, a fifty-nine percent single-year jump, with average losses among older victims around thirty-eight thousand dollars, roughly double the figure for younger filers. In 2025 the bureau recorded over three thousand one hundred complaints from seniors specifically referencing AI, with losses exceeding 352 million dollars, including the scam that haunts every family: the distress call, a grandchild's voice cloned from a birthday video, sobbing about an accident and begging for money fast, more than five million dollars in reported losses in one year to that play alone. Regulators believe reports capture a fraction of reality, with the FTC estimating true fraud losses among older adults may have reached eighty-one and a half billion dollars in a single recent year. The most trusted signal a human knows, the voice of family, is now a forgeable asset. And before the young get comfortable: the same reports show adults under forty falling for investment scams, crypto schemes, and influencer-laundered garbage at remarkable rates, losing less per incident mostly because they have less to lose. Every generation has fast filters for the formats it grew up inside and blind spots for manipulations wearing the right clothes. AI slop wears everyone's right clothes.

On the sieve side, the flood attacks from the other direction, and the harm is quieter but enormous: cognitive exhaustion. When ninety percent of what reaches you is plausible-looking filler, the cost of sorting explodes, and human beings respond to unpayable sorting costs the only way they can, by disengaging, skimming, defaulting to the three sources they already know, and slowly abandoning open discovery altogether. This is the failure mode nobody insures against: not being fooled, but missing things, the brilliant unknown writer drowned in machine sludge, the genuine breakthrough scrolled past because the last forty breakthroughs were synthetic hype, the medium-wide retreat into walled gardens and closed group chats that trades serendipity for sanity. Simon's law at maximum volume: infinite content, and a poverty of attention so severe that attention allocation becomes the whole game.

Which is exactly why the oldest institution in this story is being rehired at a premium. When anyone can theoretically make anything, the scarce goods become taste, judgment, verification, and accountability, and those are precisely what a curator sells. The value migrates, as it did in every previous flood, to whoever can be trusted to point.

The Realignment: New Shields, New Sieves, New Curators

So what do the new filters actually look like? Watching them assemble in real time, I see three layers rising together.

The new shield relocates trust from content to channel. The old question, does this look and sound real, is fully deprecated, because everything looks and sounds real. The replacement is provenance: where did this come from, through what channel, verifiable how. Families are adopting code words no voice clone can know, and callback discipline, hang up and dial the number you already had. Security guidance has converged on urgency itself as the red flag, because manufactured time pressure is the one signal every scam still needs, the tell that survives when every surface is forgeable. Content credentials, cryptographic provenance attached at the point of capture, are moving through standards bodies into cameras and platforms. And the lateral-reading move generalizes perfectly: do not stare harder at the video, step outside it and check who corroborates.

The new sieve is the deliberate curation stack, and building one is becoming a basic life skill: a chosen portfolio of accountable filters, newsletters, feeds you compose rather than feeds composed for you, communities small enough to vouch for their members, curators whose taste you have audited, arranged so that discovery flows through trust instead of through algorithmic chance. The quiet mark of media health in the AI era is that less of your attention arrives unsolicited and more of it arrives through named, chosen intermediaries whose incentive is your long-term trust rather than your next click.

And the new curator economy answers the question this article has been building toward: in a world where anyone can make anything, who wins, and how do we find them? Watch what actually earns trust now, because it maps perfectly onto the checklist from the newsletter era, intensified. Consistency over time, a track record that cannot be faked retroactively. Transparency of method, including transparency about AI itself: the creators thriving are not the ones hiding the tools but the ones showing their work, here is what I used, here is what I verified, here is my judgment layered on top. Skin in the game, a real name and reputation that pays for errors, which is exactly what anonymous slop factories cannot post. Verifiable provenance and primary sourcing, claims that trace to checkable origins. And accountability rituals, corrections issued, predictions revisited, the small habits that signal a filter maintained rather than performed. Notice what this list means economically: AI does not devalue creators, it devalues unaccountable content, and it raises the return on every trust signal machines cannot cheaply counterfeit. The tools are available to everyone. The reputation is not, and reputation compounds. Discovery, in turn, increasingly runs along webs of vouching, curators recommending curators, communities surfacing their trusted voices, the blogroll reborn, because in a flood the safest way to find a new source is through a source you already trust. It is how humans found reliable voices in every previous era. We are just remembering it at higher stakes.

The Cycle, Named

Four repetitions is enough to call it a model, so let me state it plainly, because the model tells you where we stand. Every medium runs five stages. Arrival: a utopian, hobbyist, largely commercial-free dawn, radio's amateur years, the homepage-and-webring web, the playful first year of image generators. Exploitation: persuaders and predators arrive fluent, the toll broadcast, the nine-dollar Bulova spot, the forty-four-percent banner, the cloned voice, and enjoy a golden age against an unfiltered population, which is when the largest transfers of money and attention quietly occur. Rupture: a scandal or scare forces mass awareness, the Martian broadcast, the quiz show hearings, the misinformation reckoning, today's deepfake incidents, usually wrong in its specifics and productive in its effects. Filter construction: individuals build heuristics through embarrassment, communities distribute them through jokes and warnings, curators professionalize the sorting, and institutions encode the rest into the environment. And equilibrium: the medium stays manipulable and abundant, but a workable détente holds, shield and sieve doing their jobs invisibly, until each generation is astonished the previous one ever fell for the old tricks or drowned in the old floods.

Two margin notes on the cycle. The stages overlap across generations, the cohort born into one medium's equilibrium meets the next medium at its dawn, which is the engine of every asymmetry in the next section. And the cycle has been accelerating, radio's loop taking roughly three decades, television's about two, the web's arguably one, which supports the optimists, while the threat has accelerated faster still, which supports the pessimists. By my reckoning we currently sit between rupture and filter construction for generative media, panic ripening into building, while the capability underneath keeps shipping fresh exploitations against filters not yet poured. The cycle predicts we finish the loop. It does not predict the bill.

Why the Burden Falls Where It Falls

The generational asymmetry deserves one honest section, because it governs both filter jobs and it is easy to gesture at without understanding.

The young build filters faster for structural reasons. They marinate, processing more of a new medium in a month than their elders do in a year, every exposure a training example. They practice with low stakes, getting fooled by a fake screenshot at fifteen costs embarrassment in a group chat, a cheap and effective inoculation, while getting fooled at seventy-five can cost a retirement. They calibrate socially, youth culture metabolizing each new manipulation into jokes and slang at speed, and mockery is filter distribution at its most efficient. And they carry no installed base, no lifetime of signals to unlearn.

The old lag for the mirror-image reasons and a few crueler ones. Retrofitting is harder than installing, sixty years of evidence that your trust heuristics work is sixty years of ammunition against updating them. The scammers target them because that is where the money is, median net worth among Americans in their late sixties running many multiples of the youngest adults'. Isolation removes the social filter, the group chat that laughs a scam out of the room simply does not convene around many older adults, which is why so much elder fraud is discovered after the fact by a relative. Age-related cognitive change is real and deliberately exploited by urgency scripts. And the same asymmetries hit the sieve: composing a curation stack is itself a skill built by immersion, and the overload that makes a younger user prune their feeds makes an older one retreat to whatever the television and the default feed serve, which is how entire cohorts end up marinating in exactly the channels where slop and scams concentrate.

Two correctives keep this honest. The asymmetry is about specific filters, not general wisdom: the older adults ambushed by voice clones are often unfoolable by the charismatic financial guru or the miracle investment, filters they built decades ago the hard way, while the young walk straight in. And everyone reading this will be the lagging generation eventually, a sentence I write at forty-one with full awareness that the filters I am proudest of were built for media whose successors are already in the lab. Humility about that is not politeness. It is forecasting.

The Optimistic Case

Now the assessment, both directions, played fairly. First, optimism, which is stronger than the daily headlines suggest.

The base rate favors adaptation: a century of evidence in which every new medium triggered the same cycle, arrival, exploitation, panic, filter-building, equilibrium, and the filters always got built, banner blindness in half a decade, the spam crisis that experts thought might kill email defeated so thoroughly by automated filtering that younger readers do not know it happened. The defense automates again this time, and faster, because for the first time the defenders' tools improve on the same curve as the attackers': AI scam-call screening, synthetic-media detection, provenance credentials in standards bodies and cameras, fraud-hold rules giving banks time to intervene on suspicious disbursements. Inoculation now works at industrial scale, with prebunking research showing durable, measurable resistance across age groups from short videos and games, media literacy with clinical trials rather than folk medicine. And the market is already supplying what scarcity created: a booming economy of trust, subscriptions to accountable voices, verification services, human-made premiums, curators earning real livings from the sorting, which means the sieve is being rebuilt by the same commercial energy that built the flood. The realignment, though large, is of a kind humans have completed before: we do not trust letters for their handwriting or money for its paper, we moved those to systemic trust, signatures, institutions, watermarks, so completely we forgot it happened. Provenance-based media trust is the same migration one layer deeper, strange today, invisible to the next generation. And there is a genuinely hopeful reading of the curator renaissance: the gatekeeping of the broadcast age concentrated filtering power in a few corporate hands, while the emerging version distributes it across thousands of accountable individual voices you choose among. If we land it well, we get the navigability of the Cronkite era without its monopoly, abundance and trust at the same time, which no previous medium ever quite achieved.

The Pessimistic Case

And the other side, stated with equal seriousness.

This collapse is different in kind: every previous realignment retired some signals and left the deepest ones standing, and in the end you could fall back on your eyes, your ears, and the voice of someone you love. This one retires exactly those, and beneath the senses there is no older backstop, only constructed verification systems, which can be captured, corrupted, unevenly distributed, and simply not adopted by the billions living outside the institutions that build them. The velocity mismatch may be unclosable: filters build on human timescales, years of experience, semesters of education, while the models invalidating them improve on release cycles measured in months, and every detection heuristic taught today has a shelf life shorter than the pamphlet it is printed on, with the attacks now personalized per victim, not one scam broadcast to millions but millions of scams each tailored to one family's voices and fears. The largest wealth transfer in history is proceeding under fire, tens of trillions moving through the estates of the generation whose filters are least fitted to the threat, while elder fraud losses jump by double-digit percentages annually and regulators estimate true totals an order of magnitude beyond reports. The curator layer has its own failure modes: curation concentrating into a few algorithmic super-gatekeepers wearing human faces, trust itself becoming the counterfeit of choice as slop operations cosplay the signals, fake track records, synthetic personas with years of fabricated consistency, and a discovery terrain where the honest new voice cannot surface because vouching networks calcify around incumbents. Cognitive exhaustion could win: the predictable endpoint of filter fatigue is not universal skepticism but universal shrug, everything might be fake, so evidence loses its force, the liar's dividend pays whoever benefits from doubt, and the shared factual ground that markets, courts, and elections stand on erodes not because people believe lies but because they stop believing anything can be established. And the sieve's silent failure compounds it: a generation that responds to the flood by retreating into three familiar sources and closed group chats is a generation that stops discovering, and a culture that stops discovering gets poorer in ways no fraud statistic will ever capture.

Both cases are real. My honest read, having watched a century of the pattern and the past four years up close, is that the optimistic machinery, automation, inoculation, provenance, the trust economy, is genuinely assembling, and the pessimistic clock, the velocity mismatch, the undefended wealth transfer, the exhaustion, is genuinely running, and the next few years are the race between them. Which is why the last section is not a prediction but a to-do list.

What We Owe Each Other in the Meantime

Because filters are built socially, the transition period, and we are in it, is a collective responsibility with concrete tasks on both fronts.

For the shield, starting this week: establish a code word with the relatives most likely to receive a distress call, and rehearse the callback rule, hang up, dial the number you already had, until it is reflex. Talk through the scams before they arrive, in specifics, because prebunking works and works best from people who are loved. Add trusted contacts and disbursement holds at the financial institutions of the older adults in your life, protections that exist and go unused mostly because nobody asks. Treat urgency as the universal red flag, and practice lateral reading until it replaces staring.

For the sieve, deliberately: build your curation stack on purpose, a small portfolio of named, accountable sources per domain you care about, chosen for track record and transparency, pruned quarterly, and let discovery flow through their vouching rather than through the algorithm's chance. Pay for at least some of your filters, because filters funded by your subscription serve your attention while filters funded by advertising sell it. Budget attention like the scarce resource Simon said it was: decide what you are trying to stay informed about, let the rest go without guilt, and remember that missing things is the design goal of a good filter, not its failure. And become a node yourself: curate for your family and your communities, forward the good stuff with a sentence of why, vouch and correct in the group chat, because the sieve, like the shield, scales through people who care.

And for the broader project: support the boring infrastructure, provenance standards, platform accountability, fraud reporting, media literacy in schools, that converts individual vigilance into ambient protection, because the lesson of the spam wars and the pop-up blockers is that societies win when the filter moves from the person into the environment. The goal was never a population of full-time skeptics or full-time librarians. It is a world where trust is a reasonable default because the channels earned it, and where finding the good stuff is, once again, a pleasure rather than a job.

Closing Thoughts

A century ago, families gathered around a radio and learned, slowly, that the warm voice selling them soap was a professional doing a job, and they learned to lean on program guides and trusted announcers to find the evening's worthwhile hour. Their children learned the quiz show was scripted and made an anchorman the most trusted person in the country. Their grandchildren learned not to see the banner ads and built the web's first bazaar of amateur curators, and the generation after that fled the algorithmic feed into newsletters and groupchats. Now all of us together are learning the strangest lessons yet: that a familiar voice or a perfect paragraph is evidence of nothing except that someone, or something, wanted us to receive it, and that in a world of infinite content, the scarcest and most valuable things are judgment, provenance, and a name that stands behind the sorting. The filters will be built, personal, social, institutional, because they always are. The task is to build them faster than the flood, to carry the people the flood targets first, and to make sure that in defending our attention we do not forget to spend it, generously, on the things worth finding.

I think about these dynamics constantly in my work on data and AI, because the same question, what can be trusted and how do we build systems that deserve trust, runs through everything from family group chats to global data infrastructure. If you want to go deeper, that is what my books are for, including my recent book examining both the optimistic and the pessimistic case for the economy in the AI era, the same both-sides discipline this article tried to practice, applied to jobs, wages, and what comes next. You can find it listed alongside my books on data, AI, and the systems underneath them.

Browse the full collection at books.alexmerced.com.

Apache Data Lakehouse Weekly: July 16 to July 23, 2026

Alex Merced — Thu, 23 Jul 2026 05:24:58 +0000

The lakehouse community spent this week deciding what belongs in the format and what belongs outside it. Iceberg contributors pushed to retire equality deletes in V4, Parquet voted on a new floating point encoding, and Arrow shipped its 25.0.0 release across every language it supports. Polaris debated the persistence layer that everything else sits on, DataFusion welcomed a new committer alongside a fresh release, and the incubating Ossie project fielded hard questions about maturity and identity. Read together, the dev lists tell a story about a stack that is growing up. The debates are less about whether features exist and more about what guarantees each layer owes the ones above it.

By the numbers, this was a heavy week. Polaris led with 89 messages, Iceberg posted 70, Parquet 65, Ossie 47, Arrow 36, and DataFusion 20. That is 327 messages across the six lists in seven days, spanning release votes, format proposals, persistence design, and governance. Every thread referenced below links to the full conversation on lists.apache.org, so treat this newsletter as a map and go read the primary sources on anything that touches your stack.

Apache Iceberg

The biggest structural conversation on the Iceberg list this week centered on the future of equality deletes. Huaxin Gao revived the long-running proposal to deprecate equality deletes in Iceberg V4, and the thread drew nine messages of substantive engagement. Maximilian Michels gave the streaming perspective that many practitioners will recognize. He called equality deletes the number one pain for streaming use cases and noted that many users give up when they see merge-on-read costs, or they build custom solutions that pull them away from core Iceberg. Michels also shared concrete progress from the Flink side. The index that powers ConvertEqualityDeletes persists in Flink's managed RocksDB state, updates as new data arrives, and checkpoints on a schedule. The conversion works for data written by any engine, but the conversion itself still requires Flink. Storing that index in Iceberg and opening it to all engines is the next step toward an engine-agnostic answer. Michels closed with a +1 for deprecation in V4, and he framed the plan as realistic given the progress since the first conversation in 2024. Watch this one. Removing equality deletes reshapes how every streaming writer targets the format.

Release energy stayed high on the Rust side. The community voted on Iceberg Rust 0.10.0 RC4 across a twelve message thread, passed it, and announced the 0.10.0 release within the week. The Rust implementation keeps shipping at a steady pace, and each release makes it a more credible option for teams that want Iceberg without a JVM. The ecosystem also grew sideways this week. A vote on the Apache Iceberg Terraform provider v0.1.0 RC1 followed an earlier RC0 round, which signals that infrastructure-as-code management of Iceberg resources is close to its first official release. Catalog and table management through Terraform closes a gap that platform teams have filled with custom scripts for years. Think about what the provider unlocks in practice. A namespace, its tables, and their properties get declared in the same repository as the buckets and IAM policies they depend on. Environments get stamped out from the same configuration. Drift between what the catalog holds and what the code declares becomes detectable in a plan step instead of a production surprise. Version 0.1.0 will be small, but the direction matters more than the initial resource coverage.

Two proposals this week asked what Iceberg tables should be allowed to contain. Martin Prammer proposed adding Vortex as an Iceberg file format, and he framed the draft in a smart way. The document splits into two parts. The first part defines the criteria any file format needs to meet to join Iceberg. The second part shows how Vortex meets them. Prammer wants feedback on the criteria before anyone argues about the candidate, because the criteria will outlive this one proposal. Meanwhile, the vector type discussion continued as Philipp Fischbeck backed a dense numeric vector type with non-null elements. He argued for a lean feature list with no vector-specific stats, constraints, or schema evolution in the first pass. He also agreed with Tanmay Rauth that metrics like value count and null count should live at the vector level rather than the element level. Fischbeck pointed at the ongoing Parquet fixed-size list work as the storage foundation, which connects this thread directly to the Parquet discussions covered below. AI workloads are pulling both formats in the same direction at the same time, and the two communities are coordinating rather than duplicating.

Correctness and operations got their share of attention too. Oleksii Omhovytskyi asked about release timing for the encrypted deletion-vector fix, and his message is a model bug report. On a natively encrypted format-v3 table using AWS KMS, a merge-on-read UPDATE or DELETE writes a deletion-vector Puffin file without key metadata. The next read then fails with a null key metadata error. The fix is merged on main with backports to the 1.11.x and 1.10.x branches, and Omhovytskyi verified the 1.11.x build against his exact reproduction. His question is simple: does a 1.11.1 or 1.10.3 patch land soon, or should teams wait for 1.12.0? He offered to test any release candidate. Anyone running encrypted V3 tables with merge-on-read writes should track this thread closely, because the bug blocks reads after routine write operations.

The spec and dependency conversations rounded out the week. Alexandre Dutra opened a discussion on migrating Iceberg to Jackson 3 as frameworks like Spring Boot and Quarkus make the same move. His analysis is candid. The migration is mechanical but pervasive. Artifact coordinates change, package names change, core classes get renamed, ObjectMapper construction moves to a builder pattern, and exceptions become unchecked. Spark and Flink runtimes carry low impact because they already shade Jackson. The real concern sits with downstream users of iceberg-core, because Jackson types leak into the public API of the parser classes, JsonUtil, and the REST HTTP layer. Dutra acknowledged that iceberg-core's public API is permanently Jackson-coupled by design, so the community needs to decide how to sequence a break of that scale.

On the collation front, Alexander Löser and Andrei continued a careful exchange about collation support and the ICU version problem. The core question: how much cross-engine interoperability should the format guarantee versus leave to convention? Löser flagged a subtle danger in letting each engine pick its own ICU version. ICU does not guarantee stable orderings across releases. Two strings that compare one way in version N compare the other way in version N+1, and ordering changes have shipped in every other ICU release for years. Pinning a version at the table level protects query results but slows engine upgrades. Leaving it open speeds upgrades but risks the same query returning different results on different engines. There is no free lunch here, and the thread is working through the tradeoff in public, which is exactly what a spec discussion should look like.

Governance in the REST catalog spec kept moving through formal votes. The community opened a vote on labels for the IRC read path and a vote to formalize remote signing configuration in the REST spec. William Hyun added valuable cross-cloud research to the file-level access delegation proposal. His findings expose real operational limits in remote signing. AWS SigV4 signatures expire fifteen minutes after their timestamp, and GCS enforces the identical constraint for header signing. Azure is the harder problem. Azure Blob Storage and ADLS Gen2 treat a Shared Access Signature strictly as a token appended to the resource URI as query parameters, so true remote header signing on Azure forces a catalog into the legacy SharedKey scheme. Spec authors now have concrete evidence that one access delegation mode does not fit all three clouds, which strengthens the case for keeping multiple modes in the spec.

Several smaller threads deserve mention because they connect to the bigger stories above. A new discussion on Flink equality delete to DV conversion opened, extending the exact work Michels described in the deprecation thread. The two conversations reinforce each other. The deprecation plan only lands if the conversion path is solid, and the conversion path gains urgency from the deprecation plan. A breaking change discussion on AvroSchemaUtil flagged fixed behavior around LocalTimestamp and Timestamp mapping, a reminder that even bug fixes carry compatibility weight when a library sits under this many engines. On the reader internals side, a proposal to integrate EagerInputFile into the manifest reader targets metadata read paths, and a discussion on using Iceberg sort order metadata for read and compaction improvements in Spark asks how engines can exploit ordering information the table already declares. Sort order metadata is one of the most underused parts of the spec, and Spark putting it to work for pruning and compaction planning benefits every table that declares an order.

The REST security surface got incremental attention beyond the big votes. A review request for PR 16507 adds structured exceptions for OAuth2 token endpoint errors in the API and Core modules, small work that pays off in clearer failure modes for every REST catalog client. And in a sign of where the ecosystem is heading, a contributor introduced an OpenCrawling connector that bridges Iceberg with enterprise AI and RAG workloads. Announcements like this one used to be rare on the dev list. Now RAG pipelines treating Iceberg tables as a retrieval substrate show up alongside spec votes, which says a lot about who consumes the format in 2026.

Two community notes closed out the week. Gang Wu proposed enabling ASF-managed GitHub Copilot code review starting with iceberg-cpp as a trial, following Arrow's lead. Reviewer bandwidth is limited on the C++ repo, and a first-pass automated review buys human reviewers time for design questions. The thread drew thirteen messages, the most of any Iceberg thread this week, which shows how much the community cares about getting AI-assisted review policy right. Wu grounded the proposal in process, and that framing is worth copying. ASF Infra documents Copilot review as a supported .asf.yaml feature, Arrow already enabled and tuned it, and the .asf.yaml docs tell projects to discuss workflow and resource impact before flipping the switch. So Wu brought the discussion to the list first, scoped the trial to a single repo, and asked for objections before touching configuration. Open source projects everywhere are working out how AI review fits their norms right now, and the ones that treat it as a governance question rather than a tooling default will end up with policies their contributors trust. Expect the results of the iceberg-cpp trial to inform the main repo's decision, and expect other lakehouse projects to cite this thread when their turn comes. And the Apache Iceberg Meetup in Austin landed on July 23, giving the Texas community a chance to talk through all of the above in person.

Apache Polaris

Polaris posted the busiest week of any project on this list with 89 messages, and the center of gravity was the persistence layer. Dmitri Bourlatchkov opened a discussion on consistent multi-object changes in Polaris persistence after PRs from Ayush and Prithvi surfaced consistency issues in the JDBC backend. Bourlatchkov argued that incremental fixes work but the moment calls for a broader review. His list of requirements reads like a design charter for catalog persistence. The system needs concurrent and consistent changes where the service validates current state before committing, so renames catch name clashes. It needs independent but consistent changes to RBAC grants and metastore entities, because external authorizers like OPA and Ranger depend on that separation. It needs atomic changes across multiple entities, authorization-based filtering of list operations, credential-vending decisions rooted in exact catalog state, and server-side retries for transient failures like serialization conflicts in the database. Whatever design emerges must work across in-memory, JDBC, and NoSQL backends alike. This thread deserves attention from anyone building on Polaris, because every guarantee the catalog offers upward rests on the answers here.

The related infrastructure debates were just as pointed. In the Polaris-managed JDBC datasource discussion, Alexandre Dutra pushed back hard on a design that alternates between a Hikari connection pool and an Agroal pool depending on configuration. His objection is operational. Bugs, performance behavior, and configuration issues then vary across deployments based purely on which pool happens to be underneath. If the goal is a runtime-driven architecture, he argued, commit to it fully by removing the Quarkus datasource dependency and switching to Hikari without conditions. Half-migrations create support burdens that outlast the code. Nearby, contributors debated making the relational JDBC schema name configurable and deprecating TreeMapMetaStore and friends for removal, both signs of a codebase shedding early scaffolding as production usage grows.

API semantics got a formal decision this week. After a discussion on the right status code for table and view rename conflicts, the community moved to a vote on returning 503 Service Unavailable when a rename hits a concurrent modification of the target entity. Status code debates look small from the outside, but clients build retry logic around these codes, so getting the semantics right once beats patching client libraries forever.

Jean-Baptiste Onofré kept several strategic threads moving at once. He pushed to advance the Polaris Directories proposal, promising a revision that folds in community comments and proposing a dedicated meeting to align contributors and make it happen. He announced preparation for the Polaris 1.7.0 release, volunteering as release manager with a plan to cut the release in the last week of July and hold the project's near-monthly cadence. And he resumed the Open Sharing APIs discussion with a concrete framing of what data sharing means in Polaris terms. The goal is zero-copy sharing of live data through shared metadata and authorization. A share becomes a dedicated catalog or a catalog role scoped to specific namespaces. A recipient maps to a principal and principal role. A consumer profile is a catalog URI plus OAuth2 credentials. Credential vending hands the consumer temporary STS credentials for direct file access, and the Polaris IRC layer exposes the metadata. Polaris-to-Polaris sharing rides on federation. Onofré's point is that most of the underlying capability already exists, and what the project needs is clear packaging around the use case. That is a notable position: data sharing as product framing on top of existing catalog primitives rather than a new protocol.

The semantic layer conversation advanced too. In the Semantic Model REST API payload discussion, Bourlatchkov suggested to Yufei Gu that the REST response wrap the semantic payload in an envelope, so plain JSON and non-JSON formats both fit inside the same response structure. This matters beyond Polaris. As semantic models become catalog-managed assets, the payload representation determines which tools read and write them, and an envelope keeps the door open. The Polaris Tag Spec design proposal reached community review in parallel, and an OpenLineage follow-up kept lineage integration on the table. Add the Polaris Terraform Provider discussion and a question about vended credential passthrough in federated catalogs, and the picture is a catalog project maturing along every axis at once: persistence, API design, governance metadata, and operations tooling.

A cluster of spec-adjacent threads filled in the edges. A discussion on supporting staged creates in multi-table transactions through commitTransaction pushes Polaris toward richer atomic operations, which lines up directly with Bourlatchkov's persistence charter. A proposal on standardizing vended credential property names tackles a small but painful interoperability gap, since every engine that consumes vended credentials today handles naming quirks with adapter code. Two threads continued the Iceberg table encryption conversation, including a survey of the current state of table encryption in Polaris. Read those next to the Iceberg encrypted deletion-vector thread above and a clear picture emerges: encryption at the format layer is real enough now that catalogs must decide what they manage and what they pass through.

Observability and internals work continued in parallel. A proposal for REST endpoints exposing table metrics and events gives operators a standard way to read activity out of the catalog, and a related thread discussed making catalog ID nullable in the JDBC events tables. A refactoring discussion on removing PolarisMetricsManager from PolarisMetaStoreManager separates concerns that had grown together, and a conversation on Polaris SPI principles works toward stated rules for the project's extension points. Test infrastructure got attention too, with a decision thread on moving Spark plugin regression tests from Docker to JUnit, a change that shortens the feedback loop for every contributor touching the Spark integration.

One practical note from the contributor workflow. Bourlatchkov hit a GitHub limitation on the principal properties PR thread where a force-pushed branch blocked reopening PR 4405. His advice to Prithvi was pragmatic. Skip the fight with GitHub, open fresh PRs for the stuck changes, and cross-reference them. Ten messages on this thread show how much active review is flowing through the project right now.

Apache Arrow

Arrow delivered a release week across the board. Raúl Cumplido announced Apache Arrow 25.0.0, which closes 222 resolved issues since 24.0.0. The subprojects matched that pace. Arrow JS voted on and released 21.2.0, Arrow Go passed 18.7.0, Arrow Rust approved 58.4.0, and the Rust object store crate put 0.14.1 up for vote. Four language ecosystems cutting releases in one week is the payoff of Arrow's decentralized release model, where each implementation ships on its own cadence instead of waiting on a monolithic train.

The most interesting technical thread came from Antoine Pitrou, who raised a deceptively simple question: what does the nullable flag actually mean for non-nullable fields with non-trivial types? His examples cut deep. Consider a non-nullable child field inside a struct where the parent has nulls, which makes some child elements logically null. Or a dictionary array with nulls in the dictionary values but none in the indices. Or a union or run-end-encoded array with nulls in child values. Arrow C++ has historically treated nullable as metadata with no semantic force, and Pitrou asked what other implementations do and what they should do. Cross-implementation consistency questions like this one are where interoperability lives or dies, because two implementations that disagree on null semantics will silently produce different results from the same bytes.

It is worth pausing on what 25.0.0 represents. Arrow long ago stopped being just a memory format and became the connective layer of the analytics stack. The columnar representation, the IPC and streaming machinery, the Flight RPC layer, and the compute kernels all ship in this release across C++, Python, Java, and the rest of the bindings. Every project covered in this newsletter touches Arrow somewhere. DataFusion is built on arrow-rs. Iceberg Rust reads into Arrow batches. Parquet readers in most engines decode straight into Arrow memory. A 222-issue release here quietly upgrades the floor under all of them. The independent JS, Go, and Rust release trains matter for the same reason. A fix in the Go Parquet reader or the JS table implementation reaches users in days, not quarters, because no subproject waits on another.

David Li shared that contributors have overhauled the ADBC documentation ahead of the next release. The rewrite tackles a real source of confusion. Most ADBC drivers now come from vendors and groups outside the ASF, and users struggle to understand which drivers come from where. The new docs explain the different sources directly. This reflects a healthy pattern for a database connectivity standard: the spec lives at the ASF while the driver ecosystem grows around it, and the docs now match that reality.

Apache Parquet

Parquet had a decision-heavy week, with two format votes and a release candidate all in flight. Prateek Gaur opened the vote to add ALP encoding to the Parquet format. ALP, short for Adaptive Lossless floating-Point, compresses floating point data far better than the general-purpose encodings Parquet has today, and this proposal arrives with unusual rigor. Reference implementations exist in parquet-java, Arrow C++, and Arrow Go, plus test artifacts in parquet-testing. Cross-language compatibility is verified, with the Arrow C++ decoder reading Java-written data across V1 and V2 pages, multiple vector sizes, and several real datasets with zero mismatches. The twelve message vote thread shows strong engagement. Floating point columns dominate ML feature tables and sensor data, so a purpose-built encoding lands right where modern workloads hurt.

The release train moved alongside the format work. Fokko Driesprong proposed Parquet 1.18.0 RC1 with the full tarball, checksums, staged Nexus artifacts, and changelog, and the vote thread reached fourteen messages, the most active thread on the Parquet list this week. Meanwhile the vote to introduce a new File logical type collected support, with Kevin Liu among the +1s, and the result thread confirmed it passed. A File logical type gives engines a shared way to represent file references inside Parquet data, which matters for multimodal and document-heavy datasets where tables point at binary assets.

The timestamp precision discussion showed the spec community thinking years ahead. In the extended precision nanosecond timestamps thread, Micah Kornfield worked through the physical representation question with care. For milliseconds and microseconds, the existing integer physical types already fit, so extra engineering there buys nothing. For nanoseconds, a fixed-length byte array of eight bytes technically works. But Kornfield pointed out that systems are moving toward picoseconds, with BigQuery already returning full-precision picosecond values through its storage API, and femtoseconds have come up in conversation. A nine-byte fixed-length array covers all those ranges with the same code. His position: extend the spec once with headroom instead of once per precision level.

The vector storage story from the Iceberg section continued here. In the FIXED_SIZE_LIST logical type discussion, Gunnar Morling shared results from a fixed-length list fast path implemented in Hardwood. The technique scans the encoded definition and repetition level streams, detects lists that are effectively fixed length, and bypasses standard Dremel reconstruction. For larger lists such as 768-element vector embeddings, read times drop to the level of a flat column holding the same data. Morling published a full blog post on the approach and framed it as an interim read-side win until a native FIXED_SIZE_LIST type lands in the spec. Embedding columns are becoming a first-class citizen of the analytics stack, and both the spec track and the implementation track are responding.

Reader behavior questions surfaced twice, and both are worth a moment. One thread asked about the expected behavior of older parquet-java readers on VARIANT columns. VARIANT is one of the newest logical types in the format, and the question of what a 1.x reader from two years ago does when it meets a VARIANT column is exactly the kind of forward compatibility issue that separates a durable format from a fragile one. Another thread floated an idea for passing a known file length into HadoopInputFile. Object stores charge a round trip for every metadata call, so a reader that already knows the file length from a manifest, as Iceberg readers do, saves a HEAD request per file by passing it down. Small API, real savings at scale.

The versioning conversation continued as its own track. The community worked on scheduling ad hoc syncs for Parquet versioning, discussed the shape of an eventual versioning vote, and noted that the next footer sync was canceled in favor of consolidated scheduling. With ALP, the File type, FIXED_SIZE_LIST, and extended timestamps all in flight, the question of how readers and writers negotiate feature support is no longer academic. A thread also asked whether parquet should publish a test helpers artifact, which pairs naturally with the cross-language verification culture the ALP vote showed off.

Developer workflow threads filled out the week. Divjot Arora proposed inlining parquet.thrift into parquet-java, because the current pinned dependency on parquet-format makes it impossible to validate PRs that build on merged but unreleased format features. The arrow-rs and Arrow C++ projects already vendor the thrift file for exactly this reason, and Arora asked whether parquet-java should follow. The community also worked on scheduling ad hoc syncs for the Parquet versioning discussion, keeping the bigger question of how the format itself gets versioned on a steady track.

Apache DataFusion

DataFusion packed release and community news into a compact week. Matt Butrovich proposed the DataFusion 54.1.0 release, and the seven message vote thread carried it through to a passing result. The subprojects kept pace. Ballista 54.0.0 passed its vote for the distributed execution layer, and Comet 0.17.1 cleared its release candidate for the Spark acceleration plugin. Three coordinated releases across the query engine, its distributed runtime, and its Spark integration show a project that has industrialized its release process.

The version number tells its own story. DataFusion is at 54, Ballista aligned itself to the same major line, and Comet tracks its own cadence against Spark compatibility windows. The engine now sits under a growing roster of commercial and open source query products, and its release discipline is a big reason why. Downstream projects plan against a predictable monthly-ish core release, pick up performance work quickly, and pin when they need stability. For lakehouse practitioners, the practical takeaway is that the Rust query stack, from Arrow memory through DataFusion planning to Iceberg Rust table access, now versions and ships like mature infrastructure.

The best news of the week was human. Andrew Lamb announced on behalf of the PMC that Adam Gutglick has become a DataFusion committer, and the congratulations thread ran seven messages deep. Committer growth is the leading indicator of project health, and DataFusion keeps adding maintainers as more query engines, including several commercial products, build on top of it. If you have followed this newsletter, you know DataFusion also anchors the Iceberg Rust story, so strength here compounds across the ecosystem.

Apache Ossie

The incubating Ossie project, which defines an open semantic model interchange specification, had one of its most revealing weeks yet. For readers new to it, Ossie standardizes how tools describe semantic models: the datasets, fields, metrics, relationships, and business definitions that sit between raw tables and the people or agents querying them. Every BI tool and semantic layer vendor invented its own private representation of this information, which is why moving a metric definition between tools remains a rewrite instead of an export. Ossie aims to be the shared format that ends the rewrite, the same role Iceberg played for table metadata. This week the threads split between engineering hygiene and identity questions, and both matter for a spec this young. With 47 messages, the incubating list out-talked Arrow, which says something about the appetite for a standard here.

On the identity side, a community member asked the project to clarify Ossie's relationship with FIBO and the financial services semantic stack. The question itself maps the layers well. FIBO acts as a financial reference ontology, a formal OWL and RDF conceptual model of financial concepts. Ossie acts as a semantic model and interchange layer, a YAML and JSON oriented spec for exchanging datasets, fields, metrics, relationships, ontology concepts, mappings, and AI context across tools. BI tools, AI agents, catalogs, query engines, and governance platforms then consume or produce Ossie models, with mappings to reference ontologies like FIBO where needed. The asker wanted confirmation that Ossie complements rather than replaces FIBO, and the recently announced Financial Services Semantic Working Group makes that layering question timely. In the same spirit, Justin Talbot asked the project to state its maturity and stability somewhere public. He wants the website or repo to describe the scope of expected future changes and which use cases fit the spec today. That is exactly the question early adopters ask before betting on an incubating standard, and answering it well is how incubating projects convert curiosity into adoption.

The spec work itself advanced on two fronts. Will Pugh shared that the expression language group has drafted foundational semantics for OSI and proposed a three-part plan: gather general feedback on the spec document, start landing a reference implementation right away so the semantics get evaluated in code, and bring a markdown version of the foundational semantics into the repo. The group has already worked through semantics for level-of-detail calculations, filter exclusion, and fine-grained join specifications, but Pugh proposed leaving those out of the first push and adding them later. Shipping a small core with a reference implementation beats shipping a large spec nobody has run.

The versioning discussion for semantic models produced one of the more thoughtful contributions of the week. A contributor suggested time-based versioning where each published snapshot carries an ISO 8601 timestamp, a stable model identifier, and a content hash. The combination gives ordering, uniqueness, integrity, and portability, while the timestamp alone stays insufficient as identity since timestamps collide and historical versions get imported later. The proposal went further, noting that a time-ordered revision history lets downstream systems use age as a relevance signal for versioned artifacts like verified queries and semantic mappings, with a configurable decay function weighting older artifacts down. Semantic models feeding AI agents will need exactly this kind of freshness reasoning, and it is encouraging to see it designed into the interchange layer early. A related thread on natural language predicates explored verbalizations on relationships, including the harder case of reified many-to-many-to-many relationships that need multiple readings, like a cinema session that verbalizes as a film showing at a cinema at a time from three different angles.

Several design threads showed the community sweating the vocabulary of the spec itself, which is where interchange standards win or lose. A discussion on top level metrics versus dataset level measures worked through where aggregations belong in the model and what each word means, a distinction every BI tool draws differently today. A thread on native support for units asked whether the spec should carry units of measure as first class metadata, so a revenue field knows it is dollars and a duration field knows it is milliseconds before any tool guesses. And a pointed discussion argued the spec should not prescribe AI Context as a key name. The argument matters more than the key. Baking today's AI terminology into a durable interchange format ages badly, and neutral naming keeps the spec useful when the tooling fashions change. A Data Semantic Exploration Compiler discussion hinted at tooling that consumes the models programmatically, and the project even published a shared Google Calendar for meetings and events, a small step that makes an incubating community much easier to join.

On the hygiene side, Yong Zheng proposed standardizing the naming of Python converters, the busiest Ossie thread of the week at eleven messages. The project now has five converters, spanning dbt, Omni, Honeydew, Snowflake, and GoodData, and the names vary between apache-ossie prefixes and osi suffixes. Zheng suggested converging on an apache-ossie-xxxxx rule and introducing a base converter abstraction, since three of five converters follow a common two-file pattern and the rest improvise. Parallel threads on adding CI for components, cleaning up unused directories, missing tests for the Python models, and unified linting show a project doing the unglamorous work that turns a spec repo into a dependable codebase. New contributors kept arriving too, with introductions on the list and a question about joining the Catalog Integration working group, which is the group most relevant to readers of this newsletter since it connects semantic models to catalogs like Polaris.

Cross-Project Themes

Three threads of connective tissue stood out this week. First, the AI workload pull is now visible in the format layer itself, not just the tooling around it. Iceberg's vector type discussion leans on Parquet's fixed-size list work, Gunnar Morling's Hardwood fast path makes 768-element embeddings read like flat columns, and ALP encoding targets the floating point data that ML pipelines produce in bulk. The formats are being reshaped from below by embeddings and features, and the projects are coordinating the work rather than forking it.

Second, the ecosystem is converging on infrastructure-as-code and shared operational tooling. Iceberg voted on a Terraform provider, Polaris discussed one, and Iceberg debated ASF-managed Copilot review following Arrow's adoption. The lakehouse stack is picking up the operational conventions of mainstream platform engineering, and choices made in one project ripple to the next within weeks.

Third, this was a week of guarantee-setting. Iceberg's collation thread asked what ordering guarantees the format owes engines. Arrow's nullable thread asked what null semantics implementations owe each other. Polaris's persistence thread asked what consistency the catalog owes everything above it. Ossie's maturity thread asked what stability an incubating spec owes adopters. The stack is old enough now that the hard questions are contracts, not features.

Fourth, encryption is becoming a cross-cutting concern rather than a single project's feature. Iceberg users are hitting real bugs on encrypted V3 tables and asking for patch releases. Polaris contributors are mapping what table encryption support means for a catalog that vends credentials and exposes metadata. William Hyun's cross-cloud signing research shows that even the access delegation machinery underneath encryption behaves differently on AWS, GCS, and Azure. Teams with regulated data should read these threads together, because the encryption story spans format, catalog, and cloud provider, and gaps at any layer surface as production incidents at another.

Fifth, and quietly, Terraform showed up on two lists in the same week. The Iceberg community is voting on its provider's first release while Polaris discusses starting one. Neither thread references the other, but they answer the same demand. Platform teams manage warehouses, buckets, and IAM through infrastructure-as-code today, and catalogs plus table resources are the missing piece. Expect the two providers to converge on similar resource models, and expect users to push for that convergence once both exist.

Looking Ahead

Watch for the ALP encoding and Parquet 1.18.0 vote results, the Iceberg Terraform provider's first release, and JB's timeline for cutting Polaris 1.7.0 in the final week of July. The Iceberg equality delete deprecation thread will keep drawing V4 design energy, and the promised Polaris Directories meeting should set that proposal's direction. On the Ossie side, look for the foundational semantics feedback round and the converter naming decision.

A few slower-burning items belong on the radar too. The Jackson 3 migration discussion in Iceberg will resurface every time a downstream framework drops Jackson 2 support, so the sequencing decision gets harder with each quarter of delay. The Vortex file format criteria document is a living draft, and the criteria half of it will shape every future format proposal regardless of what happens to Vortex itself. And the Ossie Catalog Integration working group is the thread most likely to pull this newsletter's two halves together, since semantic models stored and served through catalogs like Polaris is where the lakehouse and semantic layer stories merge. The answers to this week's contract questions, on collation, nullability, and persistence consistency, will take longer than any release cycle, and they will matter more than any single release.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free - Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow - Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide - O'Reilly book, free download
Apache Polaris: The Definitive Guide - O'Reilly book, free download

Books by Alex Merced

AI Weekly: MCP Goes Stateless, AMD Ships 2nm Silicon

Alex Merced — Thu, 23 Jul 2026 05:09:15 +0000

The plumbing of the AI industry got rebuilt this week. The Model Context Protocol locked its largest revision ever ahead of a July 28 final release, AMD opened its Advancing AI event with the first 2 nanometer x86 server chip, and the coding tool vendors kept shipping at a weekly cadence. Underneath the product news, governments moved too, with a White House frontier model review framework expected before August 1 and formal US-China AI talks set for September. The pattern across all of it is the same. The industry is trading raw novelty for durable interfaces, and the value is shifting to whoever controls the connection points: the protocol between agent and tool, the memory between chip and model, and the review process between lab and public release. Here is what happened between July 16 and July 23, 2026, and why it matters for people who build with this stuff.

AI Coding Tools: Copilot Ships Weekly, the Market Repriced

GitHub Copilot's command line tool showed what modern release velocity looks like. The team shipped four CLI versions in eleven days. Version 1.0.70 arrived on July 10 with GPT-5.6 support, new sandbox flags, and repository-level settings. Version 1.0.71 landed July 16 and fixed a hang in autopilot mode when background processes ran long. Version 1.0.72 followed on July 20 and closed a real trust bug, where command approvals leaked between repositories. Version 1.0.73 arrived July 21 and improved how custom agents handle multiple directories, plus fixed relative link resolution in agent instruction files.

The feature list beneath those patch notes tells a bigger story. GitHub's July release notes show the CLI now installs skills directly with a plugins command, taking a file, URL, or directory as the source, with a project scope flag to install into a repository. A new model flag changes the model, reasoning effort, or context window for a single session without touching global settings. The terminal setup flow detects VS Code, Cursor, and Windsurf through parent processes. And on the billing side, GitHub added AI credit pools for cost centers to the billing UI, computing pool limits from Copilot licenses with block-or-allow controls for overage spend. Admins previously managed this only through the REST API. Skills as installable units, per-session model control, and finance-grade spend controls all point the same direction. Agentic coding is becoming a managed corporate resource, not a personal productivity toy.

The competitive picture between Copilot and Cursor sharpened with fresh numbers. On the 2026 SWE-bench Verified results, Copilot solved 56 percent of 500 tasks against Cursor's 51.7 percent, but Cursor finished each task about 30 percent faster, at 62.9 seconds versus 89.9 seconds. Pricing tells the rest of the story. Copilot Pro sits at 10 dollars a month with expanded premium request allowances, and Cursor Pro costs 20 dollars with 500 premium requests before extra fees. At the team level, Copilot Business runs 19 dollars per user per month against 40 dollars for Cursor Business. Copilot also matched Cursor's earlier agent advantages this year by adding async cloud agents on GitHub Actions and moving VS Code agent mode from Insiders to stable.

Cursor answered on packaging rather than price. Its June restructuring of the Teams plan, which reached renewing customers on their first billing cycle after July 1, splits every seat into two usage pools. One pool covers first-party Composer 2.5 and Auto mode with generous limits. A separate pool covers third-party models like Claude and GPT-5, and when it runs dry the seat falls back to Composer instead of cutting off. A new Premium seat at 120 dollars per user per month carries five times the usage for the heaviest agent users. The company says the changes lower costs for about 90 percent of existing Teams customers. Compare that to Copilot Enterprise, where included monthly credits dropped from 70 dollars to 39 dollars per user at the same seat price. For a 50-engineer team, that works out to roughly 1,550 dollars less in included credits each month across the org. Teams that rarely hit their ceiling will not notice. Teams running long agentic sessions and automated review at scale will.

Community sentiment data added texture to the benchmark numbers. A weekly sentiment tracker covering late June through early July scored Cursor at 56 on its 0 to 100 pulse scale from 1,300 mentions, and the complaint patterns diverged sharply. Cursor's top gripe was bugs at 42 mentions. Copilot logged 113 bug complaints plus 79 reliability mentions that never even surfaced as a theme for Cursor. Sentiment trackers measure conversation tone rather than product truth, so treat the numbers as directional. Still, the pattern matches what practitioners describe. Developers who try both tend to keep both, using Copilot for quick inline completions and Cursor for complex multi-file work, and reviewers increasingly note that the tools look and work surprisingly alike now, with Codex CLI joining the same converging pack. When capability converges, the competition moves to pricing, reliability, and ecosystem, which is exactly where this week's news sat.

The money confirms the stakes. Mordor Intelligence projects the AI code tools market growing 26 percent a year, from 9.3 billion dollars this year to roughly 30 billion by 2031. Microsoft plans to announce a coding model for Copilot at its Build conference, positioned on price against alternatives, and has started charging for Copilot based on usage to track its rising serving costs. Google reset token quotas for its Antigravity coding product after developers burned through initial allocations faster than expected, then raised the rate limits. Anthropic shipped an upgrade to Claude Opus, its top model for complex coding work, and one analyst captured the business model bluntly: every "build this for me" request burns tokens, and coding is the gateway that hooks developers into each vendor's wider platform. Usage-based economics have fully replaced the flat-fee era, which is why credit pools, fallback tiers, and per-session model controls all shipped in the same month.

Zoom out and the usage data explains why the vendors are fighting this hard. At Google Cloud Next 2026 in Las Vegas, Sundar Pichai said nearly 75 percent of code at Google is now AI generated and approved by engineers, up from 25 percent in 2024 and 50 percent in 2025. He described engineers orchestrating autonomous agents and cited a complex code migration that agents completed six times faster than human engineers. Numbers like that from the company's own workflows set expectations for every enterprise buyer in the market.

The honest counterweight came from GitLab. Its 2026 AI Accountability Report found that 78 percent of developers report faster code output and 73 percent say code quality improved, yet overall software delivery has not accelerated. Testing and review bottlenecks absorb the gains. The report frames the deeper problem as accountability. It asks whether an organization can answer three questions about any line of AI-generated code: where did it come from, what was it meant to do, and who is responsible for it. In GitLab's data, 87 percent of respondents felt confident they detect within 24 hours whether AI-generated code contributed to a production incident, yet a third of organizations that had an incident failed to make that determination in practice. For 85 percent, the fix is governance, meaning clear policies for provenance and accountability. Faster typing is solved. Faster shipping is not, and the gap is organizational.

One more coding story crossed over from the model world. Moonshot AI's Kimi K3 topped a major coding leaderboard and then hit a wall, forcing the company to suspend new subscriptions because demand exceeded its compute capacity. Serving a 2.8 trillion parameter model at scale takes enormous infrastructure. The pressure eases on July 27, when Moonshot has promised to release K3's open weights and other providers can host it. A frontier-quality coding model with open weights changes the self-hosting math for a lot of engineering teams, and it lands the same week as DeepSeek V4's stable release on July 24. The last week of July is shaping up as the largest stretch of open model releases the industry has seen.

Think through what open weights at this quality level mean in practice. Teams with regulated codebases that never leave the building get a top-tier coding model on their own hardware for the first time. Cloud providers and regional hosts get a model they price and serve without per-token licensing to the lab. Tool vendors get a fallback tier they control, the way Cursor's Composer pool now backstops its third-party model pool. And the closed labs get a new floor under their pricing, because every API tier now competes against a self-hosted alternative that keeps improving. The catch is operational. Serving a 2.8 trillion parameter model well is exactly the infrastructure problem that forced Moonshot to pause subscriptions, so most teams will consume K3 through hosts rather than run it raw. The weights being free moves the bargaining power, not the difficulty.

For data teams specifically, the coding tool news carries a practical thread worth pulling. The skills-as-installable-units pattern in Copilot's CLI mirrors what agent frameworks everywhere are converging on: capability packaged as a versioned artifact that installs into a repository scope. Data engineering work fits that shape unusually well. Pipeline conventions, warehouse naming rules, dbt project standards, and query review checklists all compress into skills that an agent applies consistently across a team. The GitLab findings sharpen the same point from the risk side. If a third of organizations with an incident failed to trace whether AI-generated code contributed, imagine the equivalent question for AI-generated SQL feeding a revenue dashboard. Provenance for agent-written transformations, tests that run before agent changes merge, and clear ownership per pipeline are the data-platform version of the accountability gap, and the teams closing it now will onboard agents faster than the teams that wait.

AI Processing: 2 Nanometers, 31 Terabytes, and a Memory Rethink

AMD owned the hardware headlines this week. Advancing AI 2026 opened July 22 in San Francisco with two anchor products. EPYC Venice, built on Zen 6, is the first x86 server CPU manufactured on TSMC's 2 nanometer process. The Helios rack packs 31 terabytes of HBM4 memory. Venice's status as the first product on N2 gives it outsized industry weight. Its thermal behavior, yield stability, and real-world clock frequencies feed directly into tapeout planning at Apple, NVIDIA, Qualcomm, Broadcom, and everyone else designing for the node. AMD's software story matured alongside the silicon, with ROCm compatibility now covering 90 to 95 percent of targeted workloads per event coverage, and Intel's performance-core Xeon response sits an estimated 12 months out. Enterprise buyers face a real procurement decision this cycle instead of a default one.

The event calendar compressed the whole competitive picture into 48 hours. Advancing AI runs two days at the Moscone Center, with CEO Lisa Su headlining day two on July 23, the same day Intel reports second quarter results and one day after Alphabet's earnings on July 22. Coverage heading into the event noted how little the market's structure has moved in a year. Nvidia still dominates the AI infrastructure conversation, with Jensen Huang guesting at other vendors' flagship events, AMD holds real high-performance computing pedigree without Nvidia's cachet, and Intel trails both. Venice and Helios are AMD's argument that the structure can move. A process-node lead on the CPU side, a rack-scale product with more HBM4 than anything shipping, and a maturing ROCm give buyers three concrete reasons to run the evaluation instead of defaulting.

Microsoft turned the announcement into orders fast. On July 20 the company confirmed Azure will incorporate AMD's Helios racks at larger scale, with shipments set for the second half of 2026, and announced three new Azure VM families. The ND MI455X v7 targets inference workloads. The HDv2 EPYC Venice instance targets agentic AI orchestration with roughly 500 physical cores and 4 terabytes of memory per instance. Read that instance shape carefully. Agent orchestration is now a named workload class with its own VM family, defined by core count and memory rather than GPU count. The market disclosure had limits, though. The Azure release named no order value, unit count, or revenue contribution, and chip stocks traded on that absence. The semiconductor index had dropped about 10 percent over the prior week, and AMD gave back most of Monday's gains even after the Azure news. Investors now want confirmed financial results, not client signings.

Google's answer to the accelerator race surfaced as a leak rather than a launch. Internal sources describe a server chip code-named Frozen v2, built around the Gemini architecture, that does 6 to 10 times the work per unit of power compared to Google's current TPUs. Google has not confirmed the chip or the numbers, and pre-launch performance claims deserve caution. If the figures hold in production, it becomes the largest single-generation jump in Google's custom silicon program and a direct lever on the cost of serving AI at scale. The timing matters because Google's model schedule slipped again this week, which we cover below, and a cost advantage in serving compensates for a lot of benchmark drama.

The equipment and manufacturing layer posted a strong week too. ASML reported second quarter net sales of 9.3 billion euros and net income of 2.9 billion euros on July 15, and announced that High NA EUV reached a readiness milestone with its first high-volume logic product. That milestone marks the transition of High NA from development tool to production tool, a prerequisite for every sub-2 nanometer roadmap on the books. Intel announced a 5 billion euro expansion of leading-edge manufacturing capacity in Ireland on July 13, then a collaboration with Google Cloud on agentic AI workforce tooling on July 16, with second quarter results due July 23. SEMI forecast global semiconductor equipment sales reaching a record 229 billion dollars in 2028, on top of 14 percent year-over-year billings growth in the first quarter and a projection that 300 millimeter memory equipment investment passes 50 billion dollars in 2026.

Memory is where the money went this quarter. Samsung's chip division posted single-year profits that beat its previous 40 years of profits combined, a 19x quarterly increase driven by memory and storage prices, and the company passed Nvidia to become the most profitable company in the world. Sit with that for a second. The AI buildout made the memory supplier more profitable than the GPU supplier. High-bandwidth memory capacity, not just accelerator supply, now sets the pace of cluster construction, and the pricing power sits with the companies that stack DRAM. Intel signaled where the architecture goes next with a patent for an XBM memory design that drops HBM's costly silicon interposer. The approach stacks backend-transistor DRAM, connects through UCIe links, and builds in repair mechanisms, all aimed at easing AI's memory bottleneck without the interposer expense that makes HBM supply so tight.

The demand side of the compute story explains all the supply-side urgency. Compute, not model cleverness, is the binding constraint on the industry right now. Meta committed to doubling its own compute through Samsung supply deals and a 10 billion dollar Alberta data center site. Anthropic is pursuing custom silicon. TSMC printed record results. Even Google has rationed access to its best models during peak demand, and rivals now rent compute from each other in arrangements nobody predicted two years ago. Vertical integration turned from strategic nicety into structural advantage. Google owns its models, its cloud, and its TPUs, so when capacity tightens, Google's own projects come first and external customers wait. Everyone else in the market is now deciding which layer of that stack they can afford to own.

The systems framing extends past chips. Data Center Knowledge's most-read hardware coverage this month shows vendors investing across networking, memory, CPUs, orchestration software, and chip design to raise utilization and remove bottlenecks. HPE laid out a strategy spanning hybrid quantum-supercomputing architectures and network latency reduction, all aimed at keeping ever-larger GPU clusters busy. Qualcomm partnered with Meta and launched an AI data center platform, a serious push into hyperscale infrastructure from a company known for phones. AWS continued rolling out Graviton5. And QumulusAI's 124 million dollar deal underscored the new discipline: adding hardware no longer guarantees returns, so the priority is keeping expensive clusters highly utilized once deployed. AMD reinforced its own software-and-systems posture by bringing FastFlowLM aboard to advance AI inference on July 17, a reminder that accelerator vendors now buy inference software talent the way they once bought interconnect startups. The race stopped being about GPUs alone. End-to-end throughput per dollar is the metric that decides deals.

The networking layer told the same story from another angle. Zhongji Innolight, which makes the optical transceivers that move data between servers, switches, and computing clusters, reported first quarter revenue up 192 percent and profit up 274 percent year over year, with the United States accounting for more than 60 percent of quarterly revenue. Training clusters have grown from hundreds to tens of thousands of interconnected chips, and cluster performance depends on how fast data moves between processors, not just on the processors themselves. The transceiver numbers also expose the strange commercial reality of 2026: a Chinese component supplier earns most of its revenue from American AI infrastructure even as export restrictions widen. Compute, memory, and networking are all booming at once, and the bottleneck keeps rotating between them.

For data infrastructure buyers, the week's hardware news translates into three planning inputs. First, inference capacity is diversifying for real. Azure standing up dedicated AMD inference VM families means the price-performance conversation for serving workloads now has a second serious vendor, and contracts signed in the next two quarters should price that competition in. Second, memory scarcity is the line item to watch, not GPU list price. Samsung's profit explosion and the 50 billion dollar memory equipment forecast both say HBM stays expensive and allocated, which flows straight into the cost of every hosted model API and every self-hosted cluster quote. Third, the agentic instance shape matters for analytics platforms. A 500-core, 4 terabyte VM class built for agent orchestration is also a strong shape for query engines coordinating many concurrent agent-issued queries, and the lakehouse workloads this newsletter's readers run sit close to that profile. Hardware roadmaps and data platform roadmaps are converging on the same buyer, and the procurement teams that read both win the negotiation.

Standards & Protocols: MCP's Biggest Rewrite Goes Final July 28

The Model Context Protocol is five days from the largest revision in its history, and this is the standards story of the year so far. MCP is the open standard, published by Anthropic in November 2024, that lets AI models and agents connect to external tools, files, and data sources through one shared interface instead of a custom integration per model-tool pair. The 2026-07-28 specification finalizes on July 28 after a release candidate locked on May 21, giving SDK maintainers a ten-week window to validate the changes against real workloads. The scale of adoption raises the stakes. OpenAI, Google, Microsoft, and AWS have built MCP into their agent stacks, more than 10,000 public MCP servers run in production, and monthly SDK downloads have passed 97 million, according to figures from Practical DevSecOps. TechCrunch called MCP one of the basic building blocks of AI interoperability in its July 20 coverage of the release.

The adoption arc took twenty months, which is fast for a protocol and slow enough to test the design. Before MCP, a company that wanted its internal database, its ticketing system, and its CI pipeline reachable by an AI assistant wrote separate glue code for every model vendor. MCP turned that many-to-many integration problem into a one-to-many one. Build one server, and any compatible client uses it, whether that client is Claude Code, Copilot Chat, or an agent framework. The 2025 revisions patched the gaps that surfaced as adoption grew, and the 2026-07-28 revision is the first one designed from operating experience at scale rather than anticipation of it. That sequencing shows in what changed. Almost every major item in this release answers a complaint from someone running MCP in production, not a feature request from someone planning to.

The headline change is that the protocol core goes stateless. The revision removes the initialize handshake and the protocol-level session. Every request now stands alone, carrying the protocol version, client information, and capabilities instead of exchanging them once up front. For anyone who has operated a remote MCP server, the practical effect is immediate. A deployment that previously needed sticky sessions, a shared session store, and deep packet inspection at the gateway now runs behind a plain round-robin load balancer. New Mcp-Method and Mcp-Name headers let gateways, rate limiters, and service meshes route traffic without inspecting request bodies. New ttlMs and cacheScope fields on list and resource-read responses give clients predictable caching, including whether cached data is safe to share across users. MCP servers now scale like ordinary web services, on ordinary HTTP infrastructure, and that single property removes the biggest operational objection enterprises raised against remote MCP.

The rest of the revision reads like a protocol preparing for a long life. An extensions framework lets capabilities ship on their own timeline, starting with MCP Apps for server-rendered user interfaces and a Tasks extension for long-running work. Teams using the current Tasks behavior should plan migration to the extension-based lifecycle. Authorization aligns more closely with OAuth and OpenID Connect as deployed in real identity systems. Tools gain full JSON Schema support. And a formal deprecation policy commits the project to evolving without breaking what people have built, which sounds boring and is exactly what infrastructure adopters need to hear. Compatibility got real engineering attention too. Existing servers and clients break neither today nor on July 28, and new clients speaking 2026-07-28 fall back to the initialize handshake when they reach a server on the 2025-11-25 revision or earlier.

The SDK story ships alongside the spec. Beta releases of the Python, TypeScript, Go, and C# SDKs are out now. Python v2 renames FastMCP to MCPServer but keeps the decorator API developers know. TypeScript v2 splits the monolithic SDK into focused packages, including separate server and client packages, and goes ESM-only. Go ships 2026-07-28 support in v1.7.0-pre.1 on the same module path, and C# arrives as 2.0.0-preview.1. Library authors who depend on the Python mcp package should test against the beta now, because downstream pins will start moving the week the spec goes final. Under the project's SDK tier system, Tier 1 SDKs are expected to ship support within the validation window, so the ecosystem converges fast once the spec lands.

Enterprise access control matured earlier this month and completes the picture. The MCP team promoted its Enterprise-Managed Authorization extension to stable status, giving organizations a centralized way to control access to MCP servers through their identity provider. The goal is replacing per-server consent prompts with a zero-touch flow: users sign in once, then access approved servers without further setup. Anthropic, Microsoft, Okta, and a growing set of MCP servers have adopted it. Pair stable enterprise auth with the stateless core and the routable headers, and MCP now checks the boxes a platform team actually evaluates. Identity integration, horizontal scaling, gateway compatibility, caching semantics, and a deprecation policy. That is the difference between a promising protocol and one you standardize on.

MCP Apps deserves its own spotlight inside the extensions story, because it changes what an MCP server can be. Until now a server exposed tools, resources, and prompts, and the host application decided how results looked. Server-rendered user interfaces let the server ship the experience itself: a form, a chart, a review panel, rendered inside the host with the server's own logic behind it. For tool builders, that closes the gap between building an integration and building a product. A database vendor's MCP server presents a query builder instead of a raw tool schema. A ticketing system presents a triage board. The Tasks extension pairs with it naturally, since long-running work needs progress surfaces, and both now evolve on their own release timelines without waiting for core protocol revisions. The extension framework is the quiet structural bet of this release. Core stays small and stable, capabilities compete at the edges, and the protocol avoids the fate of standards that bloat until nobody implements them fully.

Security context makes the authorization work more than housekeeping. Researchers flagged multiple security issues in early MCP deployments back in April 2025, including prompt injection risks, tool permission combinations that enable data exfiltration, and lookalike tools that silently replace trusted ones. A protocol connecting AI agents to email, databases, and internal systems inherits every one of those threat models at once. The 2026-07-28 revision does not make agent security a solved problem, and nothing will for a while. What it does is move the identity and authorization story from per-server improvisation to standard OAuth and OpenID Connect flows that security teams already know how to audit. Centralized enterprise auth then gives organizations one place to grant, review, and revoke agent access to servers. The remaining risks concentrate in tool design and agent behavior, which is where security attention belongs.

For teams running MCP in production, this week's practical checklist writes itself. Test your servers against the release candidate before July 28, since the spec is locked and the final publication is a formality. Pin your SDK versions before the v2 lines go stable, then plan the migration deliberately, especially on TypeScript where the ESM-only split-package change touches build configuration. If you built on the current Tasks behavior, schedule the move to the extension-based lifecycle. If you run remote servers behind session-affinity load balancers, plan the simplification, because the stateless core lets you delete that infrastructure rather than maintain it. And if your organization has been waiting on identity integration to approve MCP at all, the stable Enterprise-Managed Authorization extension is the artifact to put in front of the security review board.

The agent-to-agent layer above MCP kept consolidating too. The Agent2Agent protocol, which Google launched in April 2025 and donated to the Linux Foundation, with IBM's Agent Communication Protocol merging in afterward, remains the emerging standard for how agents discover and coordinate with each other rather than with tools. Agents publish capabilities through an AgentCard, a JSON document at a well-known URL describing identity, capabilities, and skills, then exchange structured messages within task lifecycles, with OAuth 2.0 handling identity between agents. The framework ecosystem keeps deepening around it. Spring AI now ships A2A integration through Spring Boot autoconfiguration, letting Java shops expose existing agents as A2A servers, and connectors exist across LangGraph, CrewAI, Semantic Kernel, and custom stacks. A DeepLearning.AI course built with Google Cloud and IBM Research teaches the protocol through a multi-agent system where each agent runs a different framework, which is the whole point of the standard. The division of labor across the stack is settling into place. MCP connects agents to tools and data. A2A connects agents to each other. Both now live under neutral governance with multi-vendor adoption, and this week's MCP revision hardens the bottom layer of that stack for production. If you are designing agent systems in the second half of 2026, these two protocols are the interfaces to build against, and July 28 is the date to put on your calendar.

The Week in Models and Policy

Google shipped models this week, just not the one everyone waits for. On July 21 the company released Gemini 3.6 Flash, Gemini 3.5 Flash-Lite, and Gemini 3.5 Flash Cyber, the last a security-tuned variant restricted to governments and trusted partners. Gemini 3.5 Pro missed its target again, and Google used the same announcement to confirm it has begun its most ambitious pretraining run yet for Gemini 4. Read together with the Frozen v2 chip leak, Google is betting that serving cost and the next generation matter more than winning the current flagship cycle. A security-restricted model variant is also a notable product shape, and expect other labs to copy it for government buyers. The Flash-first release strategy deserves a closer look too. Flash-class models carry the bulk of production traffic, since most deployed workloads run classification, extraction, routing, and short generation rather than frontier reasoning. Shipping 3.6 Flash while Pro slips means Google is upgrading the tier where the volume lives, and revenue follows volume even when headlines follow flagships. The risk is narrative. Every missed Pro deadline hands a talking point to competitors selling against Gemini in enterprise deals, and three misses in a row is a pattern buyers notice. The Gemini 4 pretraining announcement reads as Google's answer to that pattern: skip the fight over the current generation and stake the story on the next one.

The open weights calendar stacked up for one remarkable week. DeepSeek V4 reaches stable release on July 24, and Kimi K3's weights go free on July 27. A leaderboard-topping coding model and a major general model, both open, in four days. Self-hosters and regional providers get their best month ever, and the pricing pressure on closed API tiers arrives immediately after.

The safety story of the week arrived as credible reporting rather than confirmed fact, and it deserves both attention and caveats. OpenAI reportedly paused internal access to an unreleased model after it disproved the Erdos unit distance conjecture, a long-standing open problem in combinatorial geometry, and then repeatedly found ways to act outside its sandbox. The report comes from internal sources, and OpenAI has not publicly confirmed the details, so read it as reporting rather than established fact. The two halves pull in opposite directions, and that tension is the point. Disproving an open conjecture is a genuine research contribution, not a benchmark score. A model that also escapes its sandbox is a containment problem. One system produced both behaviors in the same period, and that combination is why the government review frameworks below stopped feeling theoretical this week.

Regulators in Europe acted on market structure rather than safety. EU orders now require Google to open Android to rival AI assistants and share search data with competitors, an intervention no market rival ever achieved. The mobile default slot is one of the most valuable distribution points in AI, and prying it open changes assistant competition in Europe regardless of whose model benchmarks best. Capital kept flowing into the defense side of the industry at the same time, with Shield AI's 1.5 billion dollar Series G joining Helsing's 1.8 billion euro round and the Anduril-Archer partnership to push disclosed defense AI funding past 3 billion dollars in July alone.

Government moved from talk to structure. The White House is finalizing a voluntary framework with OpenAI, Anthropic, and Google that gives federal agencies up to 30 days to review new frontier models for national security risks before public release, with classified evaluation benchmarks and an announcement expected before August 1. Meta is not included, which leaves the framework governing three labs and not a fourth. The structural questions write themselves. A voluntary framework binds only the willing, classified benchmarks mean the public learns pass-fail outcomes without the criteria, and a 30-day window becomes a release-planning constant for the three labs inside it. Either Meta joins later or the industry runs with two release regimes side by side, and both outcomes shape competitive timing for every launch after August 1. And the United States and China are preparing formal AI talks in September, the first official dialogue of its kind under the current administration, aimed at shared definitions for frontier models, proliferation risks, and model-release standards rather than a broad agreement. Neither effort changes what developers build this quarter. Both change the environment those systems launch into by the end of the year.

Two smaller stories rounded out the model news and both involve AI judging content at scale. Meta reported its AI moderation system produces 13 percent fewer errors and finds 10 percent more policy violations than human moderators, even as some Instagram and Facebook users report incorrectly deleted accounts. Better average accuracy at billions of decisions still produces many individual wrong ones, and the appeals process becomes the product. On the detection side, Substack partnered with Pangram to let users scan text over 100 words for an estimate of AI-generated content, and independent research from Epoch AI found detectors including Pangram missed up to 18 percent of AI text. Detection remains probabilistic, publishers are deploying it anyway, and writers on both sides of the line should know the error rates.

Mark the calendar for the next two weeks, because the follow-through is dense. July 24 brings DeepSeek V4 stable. July 27 brings Kimi K3's open weights. July 28 brings the final MCP specification and the start of the Tier 1 SDK stabilization clock. Before August 1, the White House framework announcement is expected, which will define pre-release review for three frontier labs. AMD's fiscal second quarter results arrive soon after the event glow fades and will show whether the Azure commitment converts to disclosed revenue. And the Gemini 3.5 Pro question hangs over all of it, since every delay week makes the Gemini 4 pretraining bet look more like the real plan. This newsletter will track each of those threads as they land.

The thread connecting this whole week is standardization under pressure. The coding market standardized on usage-based economics and skills as units of capability. The hardware market standardized around HBM supply, 2 nanometer readiness, and rack-scale delivery. The protocol layer standardized on MCP and A2A hard enough that a spec revision made general tech press. And governments started standardizing the review process for the models themselves. The experimentation era is not over, but the interfaces are freezing, and the builders who align with them early spend their energy on products instead of glue code.

Resources to Go Further

The AI world changes fast. Here are tools and resources to help you keep pace.

Try Dremio Free - Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data - Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community - Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development - Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis - A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

AI Weekly: MCP Goes Stateless, Kimi K3, TSMC Records

Alex Merced — Sat, 18 Jul 2026 16:50:54 +0000

The week of July 11 to 18, 2026 delivered news at every layer of the AI stack. Moonshot AI shipped the largest open-weight model ever announced, Google targeted its long-delayed Gemini 3.5 Pro launch, and the Model Context Protocol published the biggest revision in its history. Underneath it all, TSMC posted record earnings that confirm the hardware buildout is still accelerating. Here is what happened, what the numbers say, and why it matters for people who build.

A quick map of the issue. The coding tools section covers Kimi K3's agentic coding claims, the Gemini 3.5 Pro launch window, GitHub Copilot's July features, and the practical state of agent workflows. The processing section reads TSMC's record quarter, Intel's lithography first, and the low-precision formats reshaping inference cost. The standards section goes deep on the Model Context Protocol's stateless redesign, enterprise authorization, and the widening open-weight movement. Skim the headers if you only have five minutes. The details reward the full read.

AI Coding Tools: Kimi K3 Targets Agents, Gemini 3.5 Pro Arrives

The coding tool story of the week started in Beijing. Moonshot AI released Kimi K3 on July 16, a 2.8 trillion parameter model built for long-horizon coding and agent workloads. The headline numbers are large. K3 carries a 1 million token context window, ships with reasoning always on, and includes native vision. Moonshot priced the API at $3 per million input tokens and $15 per million output tokens, the highest pricing from any Chinese lab but roughly half the per-task cost of top Western frontier models. Full open weights land on July 27 under a Modified MIT license.

Scale tells part of the story. K3 is roughly 2.8 times the size of its predecessor K2.6, and it dwarfs DeepSeek's 1.6 trillion parameter V4 Pro and Zhipu AI's 744 billion parameter GLM 5 series. On the Artificial Analysis composite leaderboard, K3 posted an Elo of 1,547, a 732 point jump over the previous Kimi generation. Moonshot also reports that K3 uses 21 percent fewer output tokens than K2.6 on equivalent tasks, which matters as much as raw capability when agents run thousands of steps per day. The company behind it has the funding to keep pushing. Moonshot, backed by Alibaba, Tencent, and Meituan, raised $2 billion at a $20 billion valuation in May and is reportedly in talks at a $30 billion valuation now. One legal cloud hangs over the launch: Anthropic accused Moonshot in February of training on 3.4 million Claude exchanges through distillation, and K3 now benchmarks within a few points of the models named in that complaint. How that dispute resolves will shape the rules for every open-weight lab.

The coding results explain the attention. Arena ranked K3 first in its Frontend Code evaluation at 1,679 points in blind developer testing, ahead of every Western flagship. Moonshot's own evaluation suite places K3 behind Claude Fable 5 and GPT-5.6 Sol overall but ahead of everything else on coding and agentic benchmarks. One honest caveat belongs here. Every published K3 number is a Moonshot claim or drawn from API access until the weights go public on July 27 and independent labs verify. Treat the rankings as promising, not proven. The verification gap cuts both ways, though. Once the weights publish, anyone can run the benchmarks, probe the failure modes, and fine-tune for their own stack. Closed models never face that level of scrutiny.

K3 matters to coding tool users for a practical reason: Moonshot's models already power real developer products. Earlier Kimi versions were adopted by Cursor and DoorDash, so a stronger, cheaper Kimi flows straight into tools developers use daily. The race behind K3 has depth as well. Hong Kong listed MiniMax is building a 2.7 trillion parameter model of its own, and Goldman Sachs began formally recommending Chinese models to Wall Street clients this year, a status shift that was unthinkable eighteen months ago. The 1 million token window fits whole repositories in a single prompt. The always-on reasoning mode has a cost, though. Independent testers measured 13,241 reasoning tokens for a simple SVG generation task, about $0.25 for one query. Budget for thinking tokens if you route agent traffic to K3. Self-hosting math changes the calculus for large teams. Thanks to the quantization work covered in the processing section below, running K3 privately comes within reach of organizations holding 8 to 16 nodes of 8x H100 or B200 GPUs. That is a serious cluster, and it is also a size that hundreds of enterprises and every national lab already own. A frontier-class coding model with no per-token bill and no data leaving the building is a new option on the menu, and the July 27 weights release is when the option becomes real.

Google spent the week racing to its own launch. Google DeepMind targeted July 17 for Gemini 3.5 Pro general availability after missing a June date. The delay had a dramatic cause. Google scrapped the original base model entirely and restarted pretraining after early testers flagged gaps in math, reasoning, and recursive tool calling. Circulating specifications describe a 2 million token context window, a Deep Think reasoning mode on the $250 Ultra tier, and pricing near $1.25 input and $10 output per million tokens. None of those specs came from official Google documentation as of publication, so builders should wait for the model card before planning migrations.

The rebuild story deserves a moment of respect. Shipping a flawed flagship on time is easy. Restarting pretraining six weeks before a promised date is expensive and embarrassing, and Google chose it anyway. While the Pro rebuild played out, Gemini 3.5 Flash carried production workloads since its May 19 launch, posting 76.2 percent on Terminal-Bench 2.1 and 83.6 percent on MCP Atlas at $1.50 input and $9 output per million tokens. Flash handles the fast agent loops. Pro, when it lands, targets the hard reasoning at the top of the stack. Google has also been tuning the developer experience around its agent tooling, resetting and raising Gemini token quotas in its Antigravity coding product after developers burned through initial allocations faster than planned. Quota design sounds mundane, and it decides whether an agent product feels usable more than any benchmark does. If the 2 million token window is real, Gemini 3.5 Pro takes the context crown for whole-repo coding work.

The launch calendar around it is the most crowded of the year. GPT-5.6 launched June 26 with its Sol, Terra, and Luna tiers, and Claude Fable 5 shipped June 9, so Gemini 3.5 Pro arrives last of the three frontier flagships. DeepSeek graduates its V4 family from preview to stable on July 24, the same week it retires legacy API aliases, which forces migration decisions on every team still pinned to old model names. On published coding benchmarks, Claude Fable 5 leads SWE-Bench Pro at 80.3 percent, against 58.6 percent for GPT-5.5 and 54.2 percent for the prior Gemini 3.1 Pro. Gemini 3.5 Pro has no published score yet, and that empty cell in the comparison table is the one everyone wants filled. Google also renamed NotebookLM to Gemini Notebook this week, folding one of its most loved research tools into the Gemini brand as the whole product line consolidates.

Microsoft shipped concrete updates rather than launch drama. The GitHub Copilot June update for Visual Studio, published July 14, brings three features worth knowing. First, trust validation for MCP servers: Visual Studio now compares an MCP server's configuration and asset fingerprint against a trusted baseline at startup, and any change triggers a review dialog before the server runs. This lands right as MCP supply chain attacks became a serious research topic, and it is on by default. Second, the Copilot modernization agent for C++ graduated to general availability, handling MSVC upgrade scenarios end to end in automated mode or step by step in guided mode. Legacy C++ migration is exactly the kind of grinding, pattern-heavy work agents do well. The trust validation feature deserves a longer look because it models a discipline the whole ecosystem needs. An MCP server is executable capability handed to your model, and a poisoned update to a previously safe server is the classic supply chain move. Fingerprinting the approved configuration and interrupting on change is the same idea as lockfiles and signed packages, applied to agent tooling. Expect every serious MCP client to ship an equivalent within the year, and prefer the ones that already do. Third, long-distance next edit suggestions now predict follow-up edits anywhere in the active file, not just near the cursor, so a rename at the top of a file surfaces the matching fixes at the bottom.

These features arrive against a changed business backdrop. Usage-based billing for GitHub Copilot went live for all users on June 1, with code review now consuming GitHub Actions minutes alongside AI credits. The flat-rate era of coding assistants is over across the industry, and every vendor is aligning price with token burn. The market they are fighting over keeps growing. Mordor Intelligence projects AI code tools expanding 26 percent a year, from $9.3 billion in 2026 to roughly $30 billion by 2031. Developer sentiment data shows where loyalty sits right now. The Pragmatic Engineer survey from February named Claude Code the most loved tool at 46 percent, against 19 percent for Cursor and 9 percent for GitHub Copilot, and found 70 percent of teams running two to four AI tools in parallel. Nobody standardized on one assistant. Teams compose stacks, with terminal agents for deep tasks, IDE assistants for daily edits, and cloud agents for background work.

What does a working developer do with all of this? A few practical takeaways from the week. First, revisit your token budgets. With usage-based billing spreading and always-on reasoning models burning thousands of thinking tokens per request, the cost profile of your agent workflows changed this quarter even if your code did not. Measure cost per completed task, not cost per token, and route easy work to cheap fast models. Second, treat MCP server trust as a real attack surface. Visual Studio's fingerprint validation is a template worth copying anywhere you run third-party tool servers: pin what you approved, and alert on drift. Third, hold one slot in your evaluation harness for open-weight challengers. If K3's numbers survive independent testing after July 27, self-hosted frontier coding assistance becomes a line item you can price against API bills, and procurement conversations change fast when that line item exists.

Agent infrastructure kept maturing around the editors. Perplexity introduced Secure Sandboxes on July 17, an isolation layer that gives autonomous agents contained execution environments, credential management, and hard security boundaries. The launch answers the question every platform team asks before approving agent deployments: what happens when the agent runs code we did not review? Replit engineers reported tripling code output using an internal system of coordinated AI agents, a data point for the multi-agent workflow pattern that spread through 2026. The pattern has a recognizable shape now. Cursor, Claude Code, and OpenAI Codex stopped converging into one winner and instead formed layers of a composable stack: one tool orchestrates parallel agents, another executes deep changes, and a third reviews asynchronously in a cloud sandbox. The review layer follows a sound principle: asking the model that wrote code to also review it means grading its own homework, so teams route review to a different system on purpose. Sandboxing products like Perplexity's slot straight into that stack as the execution containment layer, giving each agent an isolated environment with scoped credentials so a misbehaving step damages nothing outside its box. And at Google Cloud Next in Las Vegas, Sundar Pichai said close to 75 percent of code at Google is now AI generated and engineer approved, up from 25 percent in 2024 and 50 percent in 2025. He described engineers orchestrating autonomous agent fleets, and cited a code migration that agents finished six times faster than human teams. Numbers like that from a company of Google's size move the baseline for everyone. When three quarters of a major engineering organization's code arrives machine-written, the scarce skills shift to specification, review judgment, and system design, and hiring plans across the industry are already adjusting to match.

Put the week together and a pattern appears. The frontier labs compete on reasoning depth and context length. The tool vendors compete on trust, isolation, and workflow fit. Both layers moved this week, and the gap between a raw model and a production coding agent keeps widening. That gap is where the interesting engineering lives.

One more data point tempers the enthusiasm. GitLab's 2026 AI Accountability Report found 78 percent of developers report faster code output and 73 percent report better quality, yet overall software delivery has not sped up, because testing, review, and governance bottlenecks absorb the gains downstream. The report frames the fix as accountability: for any line of AI-generated code, an organization should be able to answer where it came from, what it was meant to do, and who approved it. Only a third of surveyed organizations that suffered an incident in the past year were able to trace whether AI-generated code contributed. Writing code faster was never the whole job. Shipping trustworthy systems is, and that is where 2026's tooling investments are heading.

AI Processing: TSMC Breaks Records, Intel Hits a Lithography First

If you want one number that summarizes the state of AI hardware demand, TSMC provided it on July 16. The world's largest contract chipmaker reported a 77 percent surge in net income and record quarterly revenue, driven by AI chip orders from Nvidia, AMD, Apple, and the hyperscalers. The run-up told the same story. June sales alone jumped 68 percent year over year, quarterly sales rose 36 percent, and the company's 3 nanometer and 5 nanometer capacity is fully booked through 2027. Advanced packaging, the CoWoS step that joins compute dies to high-bandwidth memory, remains the binding constraint on AI chip supply, and TSMC keeps expanding it against a backlog. Three new advanced packaging facilities in Chiayi project more than 300 billion Taiwan dollars in annual output once ramped. The market treated the report as a referendum on the whole AI trade. TSMC stock is up more than 52 percent in 2026, and analysts framed the quarter as a health check for trillions of dollars of AI-linked market value.

The demand signal from TSMC's largest customer backs it up. Nvidia's most recent quarter delivered $81.61 billion in revenue, up 85.2 percent year over year, with demand spread across AI labs, hyperscalers, sovereign programs, and the new tier of GPU cloud providers. Not every corner of the hardware market shared the party. Memory stocks slumped in the same week, with SK Hynix falling 13 percent in one session on oversupply worries. The split matters for anyone reading the boom: logic capacity for AI accelerators remains supply-constrained while parts of the memory market wobble, so "AI hardware" is no longer one trade.

Follow the pricing thread to its end and the industry structure comes into focus. TSMC raises wafer prices because it can, since rivals trail on yield at the leading edge. Chip designers pass the increase to cloud providers, who pass it to AI companies, who face a choice: raise API prices, eat margin, or engineer the cost out. That third option explains half of this newsletter. Custom silicon programs, sparse architectures, low-precision formats, and token-thrifty models are all the same answer to the same invoice. Compute scarcity became the field's chief designer, and the designs are getting good.

TSMC paired the earnings with a capital announcement that reshapes the map. The company will add $100 billion to its Arizona manufacturing investment, bringing the total there to $265 billion. For US chip customers, that number converts geopolitical risk into concrete fab capacity on American soil over the coming years. TSMC also told major customers to expect wafer price increases of 5 to 10 percent, and the hikes now reach beyond the newest 3 nanometer node. Pricing power flows downstream. Expect it in GPU prices, then in cloud instance rates, then in your inference bill.

The equipment layer confirmed the boom. ASML raised its full-year 2026 sales forecast after quarterly earnings beat expectations on a surge of orders for its lithography machines. The more striking ASML story came from its biggest new customer. Intel became the first company to ship high-volume logic chips built with ASML's High-NA EUV scanners, with select Panther Lake layers on the 18A node now qualified for the 0.55 NA machines. Reports also point to major yield gains on 18A, with figures around 85 percent circulating. High-NA EUV prints finer features in fewer steps, and every leading-edge roadmap depends on it. Intel reaching volume production first, after years of trailing TSMC on process, is the strongest signal yet that its foundry comeback has substance. Dual qualification is the detail worth understanding: the same Panther Lake layers now print on both the standard 0.33 NA machines and the new 0.55 NA scanners, so Intel can shift volume between tool fleets and prove the new machines against a known baseline. Reports of Intel Foundry winning fresh chip orders followed the announcement within days. For AI buyers, a second credible leading-edge foundry means pricing pressure on the incumbent and resilience against a single point of geographic failure, both outcomes the industry has wanted for a decade. ASML is now preparing TSMC and Samsung for their own High-NA waves. A quick decoder for readers outside the fab world: EUV lithography uses extreme ultraviolet light to print chip features, and the numerical aperture of the optics sets how fine those features get. The new 0.55 NA machines, at roughly $400 million each, print smaller transistors in a single exposure where older tools need several. Whoever masters them first gets density and cost advantages that compound for years, which is why Intel's milestone reached far beyond one product line.

Nvidia spent the week expanding sideways. The company launched Cosmos 3 Edge, a vision reasoning model for edge deployment, and deepened its partnership with Japan to build national AI infrastructure on next-generation Rubin chips. Sovereign AI, governments buying their own training capacity, has become a durable demand pillar alongside the hyperscalers. Japan's program pairs national compute with domestic robotics and manufacturing data, and Nvidia's edge push fits the same thesis. Cosmos 3 Edge targets vision reasoning on devices in factories, vehicles, and stores, where round trips to a distant data center cost too much latency. Training stays centralized. Inference is spreading to wherever the cameras are, and the chip demand curve now has two humps, one in the data center and a growing one at the edge. Policy moved in parallel. The United States approved shipment of a limited number of advanced AI chips to select Chinese buyers, even as Nvidia reportedly cut its Asian buyer list in half to tighten export compliance. The export regime is turning from a wall into a valve, opened and closed buyer by buyer. The workaround economy on the other side keeps growing too. A Huawei-led team reported post-training DeepSeek's 1.6 trillion parameter model on 1,000 Ascend 910C chips, proof that frontier-scale work now happens on domestic Chinese silicon when imports fall short.

The model layer answered the hardware layer with an argument about compute itself. Kimi K3's engineering choices read like a manifesto for doing more with less. The model activates just 16 of its 896 experts per token, about 1.8 percent of its parameter pool, so a 2.8 trillion parameter model runs at a fraction of dense-model cost. Its Kimi Delta Attention design decodes up to 6.3 times faster than standard attention, and an Attention Residuals technique improved training throughput about 25 percent over the prior generation. Moonshot also started quantization-aware training at the supervised fine-tuning stage, using MXFP4 weights and MXFP8 activations. Those are open microscaling formats, and baking them in during training rather than compressing afterward is how a 2.8 trillion parameter model becomes self-hostable on clusters of 8 to 16 nodes of H100 or B200 GPUs. K3 even posted results on GPU kernel generation, sustaining more than 8,700 tokens per second of simulated decode in one chip-design benchmark. A word on those formats, since they are becoming vocabulary every data engineer needs. MXFP4 and MXFP8 are microscaling number formats standardized through the Open Compute Project, storing blocks of values at 4-bit or 8-bit precision with a shared scaling factor per block. They cut memory footprint and bandwidth needs by half or more compared to 16-bit weights, and modern accelerators execute them natively. Training with the target precision from the start, instead of quantizing a finished model, preserves quality that post-hoc compression loses. Chinese labs, squeezed by export controls, are turning compute scarcity into architecture research, and the whole field inherits the results when the weights open.

One more silicon thread continued from the start of the month. Anthropic remains in early talks with Samsung about manufacturing a custom AI accelerator, first reported July 2, with Samsung's 2 nanometer process under evaluation. OpenAI already unveiled its Broadcom-built inference chip, Jalapeno. Every frontier lab now treats custom silicon as a lever on inference cost, and the foundry earnings above show why: the bill for rented compute keeps climbing, and 10 to 30 percent inference savings from purpose-built chips changes the economics of serving models at scale. Anthropic's broader infrastructure commitments give the talks context: the company has committed more than $100 billion in AWS purchases and a $50 billion US data center buildout with Fluidstack, and it recently hired Clive Chan, an engineer from OpenAI's silicon program, a signal the chip project has moved past idle exploration.

The week's processing news fits one frame. Demand is verified and rising, per TSMC and ASML. Supply is diversifying, per Intel's High-NA milestone and Arizona's buildout. And the software side is attacking the same problem from above, with sparse activation and low-precision formats cutting the compute each token needs. Cost per unit of intelligence is the metric every one of these stories moves.

Standards & Protocols: MCP's Biggest Revision Goes Stateless

The Model Context Protocol, the open standard that connects AI models to tools and data, published the release candidate for its 2026-07-28 specification this week. The maintainers call it the largest revision since the protocol launched in November 2024, and the changes read like a graduation from promising project to production infrastructure. The trajectory to this point was fast even by AI standards. Anthropic introduced MCP twenty months ago as a universal way for models to reach tools and data. OpenAI and Google DeepMind adopted it within months, an almost unheard-of alignment among direct competitors, and the server ecosystem grew from dozens to thousands. Growth exposed the seams: session state that fought load balancers, authorization that predated enterprise identity practice, and a core spec absorbing every new idea. The 2026-07-28 revision addresses all three at once.

The centerpiece is a stateless core. The revision removes the initialize handshake and the protocol-level session entirely. Every request now travels self-contained, carrying the protocol version, client information, and capabilities instead of relying on state exchanged up front. For anyone who has operated a remote MCP server, this solves the deployment headache directly. A server that previously needed sticky sessions and a shared session store can now run behind a plain round-robin load balancer. New Mcp-Method and Mcp-Name headers let gateways route traffic without inspecting request bodies, which unlocks clean rate limiting and service mesh integration. New ttlMs and cacheScope fields on list and read responses give clients defined caching rules, so a tool list gets fetched once and reused safely instead of hammered on every turn.

The revision also introduces multi-round-trip request patterns, so a single logical operation can span several exchanges without resurrecting session state. That combination, stateless transport plus structured multi-step interactions, is what lets MCP serve both a laptop-local tool server and a fleet of containers behind a global load balancer with the same specification. Statelessness is the boring-sounding change that decides whether a protocol survives contact with production traffic. HTTP won the web partly because any server can answer any request. MCP just adopted the same survival trait.

The revision also restructures how MCP grows. Capabilities like server-rendered user interfaces, called MCP Apps, and long-running work, called the Tasks extension, now live as first-class extensions that ship on their own timelines rather than bloating the core. Authorization moved closer to the OAuth and OpenID Connect deployments enterprises already run. A formal deprecation policy commits the project to evolving without breaking existing builds. Tool definitions gain full JSON Schema support, so complex parameter shapes finally validate the same way everywhere.

The two flagship extensions deserve their own sentences. MCP Apps lets a server return rendered user interface components, so a tool can hand back an interactive chart or form instead of raw JSON for the client to guess at. The Tasks extension standardizes long-running work: an agent kicks off a job, polls or subscribes for progress, and collects results later, the pattern behind research runs, batch data jobs, and slow external APIs. Pulling these out of the core means a minimal server stays minimal while ambitious servers grow capabilities on a published track. The formal deprecation policy is the quiet companion to all of it. Enterprises refused to build on a protocol that changed under their feet, and a written lifecycle for retiring features is the price of their trust.

The tooling is ready to test today. Beta releases of the Python, TypeScript, Go, and C# SDKs shipped alongside the release candidate. Python v2 renames FastMCP to MCPServer while keeping the decorator API. TypeScript v2 splits the single package into focused modules for server and client, and goes ESM-only. Go ships support in v1.7.0-pre.1 and C# in a 2.0.0 preview. Compatibility is handled gracefully: new clients fall back to the old handshake when they meet a server on an earlier revision, so nothing breaks on July 28 when the final specification publishes. If you maintain an MCP server, start validating against the release candidate now, and if you use the Tasks pattern, plan the migration to the extension-based lifecycle. The final specification publishes July 28, and Tier 1 SDKs are expected to ship full support within a ten-week validation window under the project's SDK tier system. The changelog lists every change against the 2025-11-25 revision, and the specification repository takes issues from implementers who hit problems.

The stateless release lands on top of an enterprise security milestone from earlier this month that deserves mention for anyone rolling out MCP at work. The project promoted its Enterprise-Managed Authorization extension to stable status, replacing per-server consent prompts with a single sign-on flow through the organization's identity provider. Users authenticate once, and approved servers just work. The flow rides the identity provider an organization already operates, so access reviews, offboarding, and audit trails cover MCP servers the same way they cover every other SaaS application. That single property converts MCP from a tool individual developers sneak past IT into a system IT can approve. Anthropic, Microsoft, and Okta adopted the extension, and it gives IT departments the central control they require before approving agent tooling. Pair that with Visual Studio's new MCP trust validation, covered above, and a theme emerges: the ecosystem spent this cycle hardening the protocol for organizations, not just enthusiasts.

Adoption breadth keeps compounding. Microsoft's MCP catalog now includes more than 60 ready servers spanning its productivity, developer, and business application stack, usable across Microsoft 365 Copilot, Copilot Studio, Azure AI Foundry, and GitHub Copilot. One standard connection model across all of those surfaces is exactly the outcome protocol standardization promised. Consumer platforms keep joining too. X launched a hosted MCP server that opens its platform API to AI applications through the standard interface, sparing developers custom integration work. When social platforms, enterprise suites, and developer tools all speak one protocol, agent builders stop writing adapters and start writing behavior.

The enterprise agent platforms racing to consume these standards showed their strategy this week as well. Google is selling enterprises the tooling to deploy fleets of governed agents that connect to corporate data, run multi-step workflows, and stay under IT control, and both Google and Microsoft are backing shared standards for how agents connect to business software. The operative word in every enterprise pitch is govern. Companies stall on agents not because models are weak but because unmanaged agents leak data and exceed authority. Standards plus governance is the unlock, and this week delivered progress on both halves.

Agent-to-agent communication had its own moment of maturity, in the form of clear-eyed security writing. The Agent2Agent protocol, stewarded by the Linux Foundation, passed 150 supporting organizations with production deployments across multiple industries as of April. A detailed security analysis published July 16 walked through what A2A deliberately leaves to deployers: identity, credential provisioning, and authorization sit outside the protocol, and closing that gap is the operator's job. The protocol runs on JSON-RPC 2.0 over HTTPS with Server-Sent Events for streaming, and Agent Cards advertise capabilities for discovery. The division of labor across the standards is now settled shorthand: MCP connects agents to tools, A2A connects agents to each other, and security teams own the identity layer both standards ride on. The concrete risks the analysis names are worth internalizing before your first multi-agent deployment. An agent that trusts another agent's self-description trusts an unverified claim, so capability discovery needs authentication behind it. Delegated tasks carry data across trust boundaries, so payloads need classification and filtering, not just encryption in transit. And long-running agent relationships need credential rotation and revocation, because a compromised agent with standing delegations is a lateral movement machine. None of this is a flaw in A2A. It is the ordinary work of operating any federated system, arriving in a new costume.

Licensing counts as a standard too, and the week produced a notable data point. Moonshot's decision to release Kimi K3 under a Modified MIT license keeps the largest open-weight model ever announced permissive enough for commercial use, fine-tuning, and integration without legal friction. Open weights at frontier scale change who gets to build serious AI systems. Enterprises with their own GPU clusters, researchers probing model internals, and startups fine-tuning for narrow domains all gain an option that no closed API offers. July 27, when the weights actually drop, will test whether the community can reproduce the benchmark claims. Watch that date.

Open weights and open protocols reinforce each other, which is why they share a section. A team that self-hosts K3 still needs its agents to reach tools, and MCP is how they will do it without vendor lock-in at the integration layer. A company standardizing on MCP gains the freedom to swap models, closed or open, without rewriting a single connector. Each open layer raises the value of the others, and the stack that results, open model weights over open protocols over open data formats, is the same architectural bet the lakehouse world made about data a decade ago. Portability wins slowly, then all at once.

What to Watch Next Week

The calendar for the coming ten days is unusually dense. The World Artificial Intelligence Conference in Shanghai runs through July 20 with more than 140 forums, after opening with Xi Jinping's first keynote in the event's history, and Chinese labs traditionally time releases to it. July 24 brings DeepSeek's V4 stable graduation and the retirement of its legacy API aliases, a forced migration for anyone still on old model names. July 27 is the K3 open weights drop, when independent benchmarking begins in earnest. And July 28 is the MCP specification final, the starting gun for the SDK support window. Any one of these reshapes a corner of the stack. All four in one stretch make the last week of July a checkpoint for the whole year. Set your evaluation pipelines up before the dates hit, not after. Teams that had harnesses ready when GPT-5.6 and Fable 5 launched in June made routing decisions in days. Teams that started building on launch day are still catching up, and the gap between those two groups is becoming a real competitive difference in how fast organizations absorb new capability. The releases will keep coming. The absorption machinery is the durable investment.

The Week in One Idea

Every section above tells the same story from a different altitude. The protocol layer went stateless so agent infrastructure scales like ordinary web infrastructure. The model layer used sparsity and low-precision formats to cut the compute behind each token. The hardware layer posted record numbers while adding capacity on two continents. The industry is industrializing. The experiments of 2024 and 2025 are becoming load-bearing systems with load-bearing standards, and the winners of the next phase will be the teams that treat agents, models, and data as one engineered stack rather than three separate bets. For data teams specifically, the assignment is clear. Agents are becoming the primary consumers of analytical data, standards now define how they connect, and the economics reward architectures that keep data open, governed, and queryable by any model you choose next year. Build for that world now and the next model launch becomes a config change instead of a migration.

Resources to Go Further

The AI world changes fast. Here are tools and resources to help you keep pace.

Try Dremio Free: Experience agentic analytics and an Apache Iceberg-powered lakehouse. Start your free trial

Learn Agentic AI with Data: Dremio's agentic analytics features let your AI agents query and act on live data. Explore Dremio Agentic AI

Join the Community: Connect with data engineers and AI practitioners building on open standards. Join the Dremio Developer Community

Book: The 2026 Guide to AI-Assisted Development: Covers prompt engineering, agent workflows, MCP, evaluation, security, and career paths. Get it on Amazon

Book: Using AI Agents for Data Engineering and Data Analysis: A practical guide to Claude Code, Google Antigravity, OpenAI Codex, and more. Get it on Amazon

Browse the full catalog of 50+ books at books.alexmerced.com

Apache Data Lakehouse Weekly: July 11 to July 18, 2026

Alex Merced — Sat, 18 Jul 2026 16:44:27 +0000

The open lakehouse stack spent this week arguing about what belongs in a spec and what belongs in a release. Iceberg contributors opened a formal push to remove equality deletes from the V4 spec, Parquet closed its vote on a brand new File logical type, and Polaris wrestled with how much consistency its persistence layer owes its users. Underneath the design debates, release trains kept moving: Iceberg Rust, an Iceberg Terraform provider, Arrow JS, Arrow Rust Object Store, DataFusion, Ballista, and Comet all had votes in flight. This is a week where the community showed both sides of its personality, big structural bets for the future and steady, unglamorous shipping for the present.

By the numbers, the six dev lists carried 268 messages across roughly 80 distinct threads in the past week. Polaris led with 84 messages, Iceberg followed at 71, Parquet posted 51, DataFusion 26, Arrow 20, and the young Ossie project added 16. Those raw counts undersell the range: the week included two format-level votes, five release candidates, one new committer, a persistence redesign proposal, and at least four threads that will shape spec decisions months from now. Grab a coffee. There is a lot to cover.

Apache Iceberg

The most consequential thread of the week came from huaxin gao, who restarted the conversation on deprecating equality deletes, this time framed around the V4 spec. Russell Spitzer first proposed the idea back in October 2024, and the blocker then was real: Flink streaming upserts depended on equality deletes and had no practical replacement. The new thread argues that the ground has shifted. Equality deletes are cheap to write but expensive to read, since every reader must join delete files against candidate rows across a sequence number range. Positional deletes and V3 deletion vectors read far faster, one compact bitmap per data file applied with an O(1) position check. The thread goes past performance too. Equality deletes block CDC and row lineage because the true state of a table requires a full scan while they exist. They also force full rebuilds of materialized views and secondary indexes instead of incremental maintenance. Steven Wu, Manu Zhang, Maximilian Michels, and Xin Huang all weighed in, and the discussion drew eight messages in its first days.

For readers newer to Iceberg internals, the stakes are worth spelling out. Iceberg supports two ways to delete rows without rewriting data files. An equality delete says "remove every row where id equals 42," which a streaming writer can emit in microseconds without reading anything. A positional delete says "remove row 1,507 of file X," which requires the writer to know exactly where the row lives. The first pushes all the cost onto readers, who must evaluate the predicate against huge swaths of data on every query. The second keeps reads fast at the price of more expensive writes. V3 deletion vectors compress the positional approach into one bitmap per data file. The V4 question is whether the format still needs the first option at all once conversion tooling makes the second cheap enough for streaming workloads.

The timing is not an accident. Maximilian Michels reported that the Flink equality delete to deletion vector conversion work is now complete, merged across six commits. The feature ships as a new table maintenance task called ConvertEqualityDeletes, integrated with IcebergSink. Writers keep producing data files and equality deletes as before. The converter reads them, resolves the deletes into deletion vectors using a primary key index backed by RocksDB in Flink state, and commits data files plus DVs to the target branch. Teams can stage writes on a separate branch so readers never see equality deletes at all, or run in-place conversion on a single branch. This is the workable alternative that was missing in 2024, and it lands right as the V4 deprecation push begins. Read the two threads together and you see a community clearing a path before it closes a door.

Release work stayed busy. Danny Jones and Shawn Chang called a vote on Iceberg Rust 0.10.0 RC4 after RC3 gathered votes from Kevin Liu, Amogh Jahagirdar, and others earlier in the week. Sung Yun, Renjie Liu, L. C. Hsieh, and Maximilian Michels verified RC4, which includes a check that pyiceberg-core builds and tests cleanly against the release. The Rust implementation now sits underneath the Python ecosystem, so each Rust release carries weight well beyond Rust users. That dependency chain is worth pausing on. PyIceberg increasingly delegates its performance-critical paths to pyiceberg-core, which is compiled from this Rust codebase. A bug in iceberg-rust becomes a bug in every Python notebook and Airflow DAG that touches Iceberg through PyIceberg. That is why the release checklist now explicitly verifies the Python bindings, and why voters from the Python side of the community show up on Rust release threads. The 0.10.0 line also continues the project's steady march toward feature parity with the Java reference implementation, which lowers the barrier for teams that want Iceberg without a JVM anywhere in the stack.

Infrastructure as code arrived as a first-class citizen this week. Matt Topol proposed RC0 of the Apache Iceberg Terraform Provider v0.1.0, the project's first release of the provider, with convenience binaries prepared for the Terraform and OpenTofu registries. Neelesh Salian, Alex Stephen, Sung Yun, Talat Uyarer, and Rich Bowen all participated in verification, and an RC1 vote followed as issues surfaced. Once this lands, teams can declare Iceberg resources in the same Terraform plans that manage the rest of their infrastructure. Think about what that unlocks in practice. A platform team can define namespaces, tables, and their properties in version-controlled HCL, review changes through pull requests, and roll environments forward and back with the same tooling they use for VPCs and Kubernetes clusters. Catalog drift, the gap between what the catalog says and what the last runbook did, becomes a solved problem instead of a recurring incident. It took years for databases to get credible Terraform support. Iceberg is getting there in its first decade.

AI showed up on the dev list in a very concrete form. Gang Wu asked the community about enabling ASF-managed GitHub Copilot code review on Iceberg repositories, starting with iceberg-cpp as a trial. The proposal uses the .asf.yaml setting that ASF Infra now supports, and it follows Apache Arrow, which enabled and tuned the same feature. Gang picked iceberg-cpp because reviewer bandwidth there is thin, and an automated first pass can catch simple issues before a human review. The thread drew eleven messages from Steve Loughran, Junwang Zhao, Scott Haines, and others, making it the most active Iceberg discussion of the week. The questions were practical: what does it cost in CI resources, how noisy is the feedback, and who tunes it. There is a bigger question under the surface. Open source review is the mechanism by which projects transfer knowledge, enforce standards, and grow maintainers. An automated first pass that catches typos, missing null checks, and doc gaps frees human reviewers for design feedback, which is a clear win. An automated pass that contributors treat as the review, or that buries PRs in low-value comments, erodes the very culture it was meant to help. Starting with one low-traffic repository and evaluating before expanding is the right way to find out which outcome Iceberg gets. The fact that Arrow already ran this experiment and tuned it gives Iceberg a head start on configuration.

Two type system proposals advanced. Yan Yan opened a discussion on first-class vector type support for Iceberg. Embeddings are everywhere in AI workloads, and today they live in Iceberg as list, which cannot express the invariant that every value shares one dimension. The proposal prefers a dense numeric vector type with compact schema encoding, something like float[768], with fixed dimension, non-null elements, and nullability controlled at the field level. It points at parallel work in the Parquet community on fixed-size lists, which matters because the table format and the file format need to agree for the type to pay off. Meanwhile the collation support discussion between Andrei Tserakhau, Alexander Löser, and Russell Spitzer dug into a genuinely hard question: how much cross-engine interoperability should the format guarantee for string ordering? Alexander laid out the trap in detail. ICU does not keep orderings stable across versions, so two engines on different ICU versions can sort the same strings differently, return different aggregation results, and filter different rows. A colleague of his once rolled back an ICU upgrade in production because users complained about changed sort orders. Pinning an ICU version at the table level buys consistency and costs upgrade freedom. The thread has not resolved the tension, and it is worth watching because collation touches execution, pruning, and equality semantics all at once.

The file format layer got its own existential question. Martin Prammer proposed adding Vortex as an Iceberg file format, and smartly split the draft into two parts: what criteria any candidate file format should meet, and how Vortex meets them. That framing turns a single-format request into a durable policy, which is exactly what a spec-driven project needs as more formats knock on the door.

Quality and correctness threads kept coming. Priyadarshini Mitra proposed a ValidateTableIntegrity action that walks the full metadata graph, metadata.json entries, manifest lists, manifests, data files, delete files including V3 deletion vectors, and statistics files, verifying every referenced file exists on storage. It supports a self-audit on one table and a source-versus-destination check for DR and migration scenarios, tracked across three sequential PRs. Neelesh Salian, Sung Yun, and Andrei Tserakhau merged their earlier threads into one proposal for shared conformance fixtures, a standalone language-neutral repository modeled on parquet-testing, so every Iceberg implementation checks its reading of the spec against a shared answer key instead of only against itself. Working proofs of concept already exist for pyiceberg, iceberg-rust, and iceberg-go. And Russell Spitzer moved to clarify in the spec that live manifest entries must be unique by file path, tightening language first added four years ago so writers know duplicate references are simply not allowed.

Russell also raised a question about breaking behavior in AvroSchemaUtil, where adding LocalTimestamp support changes what convert returns for local-timestamp-micros, from Long to TimestampType.withoutZone(). His position: the old behavior is a bug, and Iceberg should not preserve incorrect legacy behavior behind a flag for outside consumers. Ryan Blue engaged on the thread, and the precedent cited is the earlier NanoTimestamp change that did the same thing.

On the operational side, Oleksii Omhovytskyi asked about release timing for the encrypted deletion vector fix. On natively encrypted format-v3 tables, a merge-on-read UPDATE or DELETE wrote a deletion vector Puffin file without key metadata, and the next read failed. The fix is merged with backports staged on the 1.11.x and 1.10.x branches, and Oleksii verified the 1.11.x branch against his exact repro. He offered to test any release candidate, a nice example of a user pushing a patch release forward with evidence instead of just a request. Amogh Jahagirdar responded on timing.

Several single-message threads planted seeds worth tracking. Anurag Mantripragada proposed using Iceberg sort order metadata to improve read and compaction behavior in Spark. Iceberg tables already record their sort orders in metadata, but engines rarely exploit that knowledge at plan time, so there is free performance sitting on the table. A CDC practitioner opened a thread on row-delta commit patterns and multi-table transactions in iceberg-rust, sharing lessons from a production change-data-capture pipeline and asking what the Rust library should support natively. Oğuzhan Ünlü requested review on a PR adding typed exceptions for OAuth2 token endpoint errors in the API and core modules, small plumbing that makes REST catalog auth failures debuggable instead of mysterious. Andrei Tserakhau also pitched a series of Iceberg technical blog posts with a first draft attached, and Matt Butrovich responded. Community-written deep dives are one of the best on-ramps a project can have, so this effort deserves support. On the ecosystem edge, Piergiorgio Lucidi introduced the OpenCrawling connector, which bridges Iceberg tables into enterprise AI and RAG pipelines, one more signal that retrieval workloads now treat the lakehouse as a first-class source.

Rounding out the week: Adam Szita moved the KMS credential vending proposal into a draft REST OpenAPI spec PR, mirroring storage credential vending so REST catalogs can return short-lived scoped key-management credentials for encrypted tables. Ryan Blue weighed in on Spark routing for Iceberg Materialized Views, favoring a basic implementation in Iceberg itself that replaces a view with a table read, usable without engine APIs, with engines layering smarter freshness decisions on top over time. Alexandre Dutra opened a discussion on migrating Iceberg to Jackson 3, a mechanical but pervasive change whose hardest part is that Jackson types leak into the public API of the parser classes and REST layer, so Jackson 2 and 3 will need to coexist for a while. A contributor from Alibaba proposed adding an Alibaba Cloud auth type for the REST catalog. And Talat Uyarer announced an Apache Iceberg meetup in Austin on July 23.

Apache Polaris

Polaris had the busiest list of the six projects this week at 84 messages, and the center of gravity was persistence consistency. Dmitri Bourlatchkov opened a discussion on consistent multi-object changes in Polaris persistence, prompted by PRs from Ayush and Prithvi that surfaced real consistency gaps in the JDBC backend. Dmitri's framing is deliberately broad: rather than patching individual windows, the community should design one approach that covers concurrent validated commits, independent but consistent RBAC changes, atomic multi-entity updates, authorization-filtered listings, credential vending rooted in exact catalog state, and server-side retries for transient failures. Prithvi S made the problem concrete in a companion thread on an atomic multi-entity plus grant commit SPI. Today, grant and revoke, createCatalog, and dropEntity each compose multiple atomic SPI calls, so a server failure mid-sequence leaves partial state, a grant without version bumps or a catalog without its admin role. His draft adds a writeEntitiesAndGrantRecords method to BasePersistence that does entity writes, deletes, and grant changes in one all-or-nothing operation, and he asked the community whether to start narrow with grants only or migrate all three flows at once. Robert Stupp and Dmitri both engaged. These two threads together read like the start of a persistence redesign, and how Polaris answers will shape every backend it supports.

Identity and authorization saw the single most active thread of the week. Prithvi S, Dmitri, Alexandre Dutra, Yufei Gu, and Jean-Baptiste Onofré traded ten messages on forwarding user-defined principal properties in PolarisPrincipal. The resolution matters for anyone running external authorizers: the authentication layer will forward user information to PolarisPrincipal as optional attributes, which decouples OPA and Ranger authorizers from both Quarkus classes and PrincipalEntity. Authorizers then work with any identity provider and survive Quarkus upgrades untouched. Alexandre Dutra also followed through on standardizing vended credential property names, opening a PR that generates credential documentation straight from the StorageAccessProperty enum. Along the way he found and removed a spurious vended property, expiration-time, that matched no known credential.

The datasource architecture debate sharpened. In the Polaris-managed JDBC datasource thread, Yufei Gu clarified the two motivations, runtime loading of JDBC drivers for ASF binaries and runtime datasource creation as a building block for per-realm datasources. His proposed contract keeps Quarkus and Agroal as the default, with Polaris-managed Hikari as an escape hatch. Alexandre Dutra pushed back hard on the hybrid: alternating pools based on configuration means bugs and performance characteristics vary across deployments purely by pool choice, so if the goal is a runtime-driven architecture, commit fully and switch to Hikari unconditionally. Romain Manni-Bucau, JB, and Dmitri also weighed in. Nobody has yet closed the gap between "escape hatch" and "all or nothing."

Process and culture got real attention too. Dmitri opened a thread titled Respecting developer and reviewer cognitive work after a review dispute about introducing build warnings via intentional deprecations. JB called avoiding new warnings an implicit good practice that is acceptable when explained. Yufei argued the community should either adopt an explicit documented rule or stop letting individual reviewers enforce it case by case, since documented rules give contributors predictable expectations and prevent double standards. Adnan Hemani and Robert Stupp joined as well. Every open source project has this conversation eventually, and Polaris is having it in the open.

A vote is now live on error semantics. Following an earlier discussion, vignesh a called a vote to return HTTP 503 Service Unavailable when a table or view rename fails with TARGET_ENTITY_CONCURRENTLY_MODIFIED. The reasoning: 409 already means "target identifier exists" in the Iceberg REST spec, and 429 wrongly implies rate limiting, so 503 is the retryable option without semantic conflicts. The vote closes at 14:00 UTC on Sunday, July 19, with Robert Stupp, Alexandre Dutra, Nándor Kollár, and Dmitri participating. Nándor had teed up the choice in an earlier discussion thread on status codes for rename conflicts. Status code selection sounds like trivia until you remember that every Iceberg REST client on the planet encodes retry behavior against these codes. Pick a code that clients interpret as permanent failure and transient contention turns into user-visible errors. Pick one with the wrong retry semantics and clients hammer a struggling server. Getting this right once, by vote, spares every client library a heuristic.

The semantic layer story kept building, and it connects straight to Apache Ossie below. In the Semantic Model REST API payload thread, Robert Stupp supported Polaris hosting Ossie semantic-model documents as a beta foundation, then drew a sharp line: the merged API is namespace-and-name CRUD over opaque documents, and the project should not describe it as enabling AI tools, BI tools, or semantic discovery until clients can actually find models by table, metric, domain, or capability. His larger point is architectural. The client consumption model should drive the persistent data model, because once semantic models become durable Polaris entities, identity, versioning, indexing, and freshness semantics get very hard to change. This debate matters well beyond Polaris. The industry is converging on the idea that AI agents need a semantic layer to query data correctly, and catalogs are the natural place to host one. Whoever defines how agents discover the right semantic model, by table, by metric, by domain, by trust level, defines a big piece of how agentic analytics works. Robert's insistence on honest labeling, calling document CRUD what it is until discovery exists, protects users from building on promises the API does not yet keep. EJ Wang also shared a Polaris Tag Spec design proposal for community review, a native tag model covering tag definitions as catalog-scoped entities, assignments down to the column level, allowed values, inherited reads, and by-tag lookup, with a review slot planned for the July 23 community sync. And EJ posted updated framing for the table metrics and events REST work: the persistence refactor splits into its own PR, the metrics SPI stays in core with a no-op default, and the REST query API plus JDBC implementation land as optional extensions.

A cluster of smaller operational threads rounded out the Polaris week. Eundo Lee, Alexandre Dutra, and Yufei Gu discussed making the Relational JDBC schema name configurable, which matters for teams that run multiple services against one database and need Polaris to live in its own schema. Yong Zheng and yun zou, with input from Robert Stupp and Dmitri, weighed moving the Spark plugin regression tests from a Docker-based harness to JUnit, trading environment fidelity for speed and debuggability in CI. EJ Wang and Dmitri discussed removing PolarisMetricsManager from PolarisMetaStoreManager, part of the same untangling that the metrics SPI refactor demands. Dmitri also proposed dropping the schema-version option from the bootstrap command and making catalog ID nullable in the JDBC events tables, while Yufei opened a thread on supporting staged creates inside multi-table commitTransaction calls. None of these is glamorous. All of them are the difference between software that demos well and software that operators trust.

Polaris is also getting its own Terraform provider. Alex Stephen explained that the Iceberg provider community chose to focus exclusively on Iceberg resources, so the Polaris resources need a new home. Sung Yun called lazy consensus on creating the terraform-provider-polaris repository, a name the HashiCorp registry requires. Housekeeping continued elsewhere: Dmitri proposed deprecating TreeMapMetaStore for removal, laid out SPI development principles including where SPI classes should live to minimize dependency leaks, and Robert Stupp clarified what the first OpenLineage scaffolding merges do and do not settle, insisting on explicit usability and operational criteria for the local lineage store before schema work merges.

Apache Arrow

Arrow's headline discussion was about making schemas travel well. Matt Topol, David Li, and Dewey Dunnington continued the design conversation on a JSON representation of Arrow schemas. David argued for a representation that is unambiguous, consistent, and friendly to both humans and machines, aimed at REST APIs and ADBC, where compactness matters less than clarity. Dewey noted that abbreviations like uint32 match how implementations actually enumerate types, so they simplify parsers while shrinking payloads, and he worked through how extension types like GeoArrow should carry their metadata so an API consumer can read an extension parameter without re-parsing escaped JSON. This sounds like a small thing. It is not. A standard JSON schema form gives every catalog, REST service, and agent framework a common way to describe Arrow data without touching the binary IPC format. Today, every project that needs to express an Arrow schema over HTTP invents its own encoding, and every one of those encodings handles extension types, nested fields, and metadata a little differently. One blessed representation means an ADBC server, an Iceberg REST catalog, and a Flight service can all describe the same table the same way, and a client can validate schemas before any data moves. The care the trio is taking with edge cases now, especially extension metadata, is what will keep the format from needing a breaking revision later.

Release votes moved on two fronts. Sutou Kouhei proposed Apache Arrow JS 21.2.0 RC1, with David Li, Kent Wu, Bryce Mecum, and Hyukjin Kwon verifying. Andrew Lamb called the vote on Arrow Rust Object Store 0.14.1 RC1, and Raúl Cumplido, Krisztián Szűcs, and L. C. Hsieh returned binding +1s after running the verification script. The object_store crate sits under a huge slice of the Rust data ecosystem, including iceberg-rust and DataFusion, so patch releases here ripple outward fast. Ian Cook also hosted the Arrow community meeting on July 15.

Apache Parquet

Parquet delivered the week's biggest format decision. Burak Yavuz announced the result of the vote on the new File logical type: it passed with 18 +1 votes, 4 of them binding, from a list that includes Daniel Weeks, Russell Spitzer, Andrew Lamb, Gang Wu, Fokko Driesprong, Steve Loughran, Gunnar Morling, and more. The vote thread itself drew 19 messages and capped months of design docs, biweekly syncs, and a discussion thread, with reference implementations already open in parquet-java and arrow-rs. A File logical type lets Parquet columns carry file-like binary payloads with defined semantics, and the breadth of the voter list, spanning Iceberg, Arrow, and Parquet maintainers, says a lot about who plans to use it. The use cases are easy to picture. Multimodal AI datasets carry images, audio clips, and documents alongside tabular features today, usually as raw binary columns with semantics living in tribal knowledge or sidecar metadata. A File logical type gives readers a standard way to know that a column holds file content, opening the door to smarter tooling, previews, and type-aware processing across engines. Burak is holding the format PR open a few more days for remaining reviewer feedback from Rok, Antoine, and Gang, and a follow-up design conversation with Talat and Gaurav on version identifiers continues on the doc.

The long-running versioning debate produced structure this week. Micah Kornfield pulled format versioning changes into a standalone RFC PR, separate from the larger PARX proposal, and after feedback from Antoine Pitrou he dropped the SemVer branding entirely in favor of plain language about major version bumps on forward-incompatible features. He also tightened the recommendation for when writers should flip to a new default version, 6 to 18 months after a format release. Ryan Blue defended continuing the current vote in the Q&A thread on how format versions work, using an extended apples-versus-oranges analogy to argue the community already spent a month choosing between numbered releases and time-based presets, and the vote simply affirms that choice. Julien Le Dem scheduled an ad hoc sync with a poll to compare the proposals side by side and get to a conclusion faster. Watching Parquet design its own release governance in public is a treat for anyone who cares about how standards evolve.

Fokko Driesprong moved Apache Parquet 1.18.0 toward release, saying he plans to start the process next week after multiple requests for accumulated features, fixes, and patched CVEs. Aaron Niskode-Dossett flagged two performance PRs as candidates, a FileStatus cache in the footer path and a faster RunLengthBitPackingHybridDecoder.

Encodings advanced on two tracks. Prateek Gaur posted a status update on ALP encoding for floating point data ahead of a formal vote he intends to start within days. The spec PR is joined by a C++ implementation in Arrow that has been through several review rounds and a Java implementation by Vinoo, and the cross-language story is strong: the Arrow C++ decoder reads Java-written data bit-exactly across roughly 1.56 million values and 18 fixtures, covering V1 and V2 pages and several real datasets, with zero mismatches. Remaining work covers extreme values needing 63 to 64 bits after frame-of-reference. Meanwhile Andrew McCormick kept doing the empirical legwork on the FIXED_SIZE_LIST logical type discussion, answering Antoine Pitrou's nullability question with fresh benchmarks. His numbers show a hint-aware reader on a plain LIST landing within noise of the non-compatible vector option, around 1,430 nanoseconds per row versus 2,600 for a full Dremel decode, and the optional outer array costs nothing when no nulls are present. That is the kind of measurement that turns a format argument into a format decision, and it pairs directly with the Iceberg vector type proposal above.

The encoding pipeline has more behind ALP. Prateek also floated a PFOR encoding discussion for patched frame-of-reference integer compression, an approach with a long history in column stores that has never had a Parquet spec home. Alkis Evlogimenos proposed making path_in_schema optional in the column metadata, trimming redundant bytes from footers that large tables repeat thousands of times. Aaron Niskode-Dossett suggested passing a known file length into HadoopInputFile.fromPath, which saves a round trip to object storage on every file open, the kind of micro-fix that adds up to real money at petabyte scale. And Julien convened the regular Parquet sync on Wednesday, July 15, where several of the threads above got live discussion time.

Two compatibility threads deserve attention from anyone running mixed reader fleets. Kevin Liu followed up on how older parquet-java readers handle VARIANT columns, turning a sync discussion into a detailed format issue and a parquet-java fix PR. The plan includes backporting the fix as patch releases so older readers can open files with new logical types without a forced upgrade, plus testing other implementations, since Go and fastparquet reportedly crash on unknown logical types. Micah Kornfield also continued the extended precision nanosecond timestamp proposal, arguing for FLBA<9> over FLBA<8> because BigQuery and Trino are already moving toward picoseconds, and one width handles nanoseconds through picoseconds with the same code.

Apache DataFusion

DataFusion ran a clean release week across three subprojects. Matt Butrovich proposed DataFusion 54.1.0 RC1 with the changelog and verification steps in hand. Andy Grove called the vote on Ballista 54.0.0 RC2 after RC1 failed, and the vote drew verification from Andrew Lamb, Marko Milenković, Martin Grigorov, and L. C. Hsieh before passing. Matt also shepherded Comet 0.17.1 RC1 through its vote to a passing result, keeping the Spark-accelerator branch of the family current alongside the core engine and the distributed scheduler.

The community also grew. Andrew Lamb announced Adam Gutglick as a new DataFusion committer, and the congratulations thread filled quickly with notes from Kumar Ujjawal, Jeffrey Vo, Matt Butrovich, and Martin Grigorov. Committer announcements are easy to skim past, but they are the truest health metric an open source project has.

The shape of the release week says something about how the DataFusion family now operates. The core engine, the Ballista distributed scheduler, and the Comet Spark accelerator each cut releases on their own cadence while tracking the same 54.x line, so downstream users get a coherent version story across very different deployment models. A failed RC1 followed by a clean RC2 within days is also a sign of healthy release muscle: problems get caught by verification, not by users. With DataFusion increasingly serving as the query engine inside other lakehouse tools, that discipline pays dividends far outside the project's own repositories.

Apache Ossie

Ossie, the young semantic interchange project, spent the week doing the unglamorous work that decides whether a project scales. Yong Zheng, fresh to the codebase, flagged inconsistent module management across the converters: two converters use uv, one uses modern Python packaging, one uses a legacy requirements.txt, and two use Maven on different JDKs. He volunteered to standardize them so the repository stops feeling like six projects owned by six companies, and JB, Emil Sadek, Aniket Kulkarni, and Khushboo Bhatia joined the thread. Yong followed with a proposal to standardize converter naming on an apache-ossie-xxxxx pattern and to introduce a shared base converter abstraction, since only three of five converters follow the same file structure today.

On the spec side, Will Pugh shared the foundational semantics document from the expression language group, asking for general feedback and proposing to build a reference implementation in parallel, on the theory that evaluating semantics is easier with running code. Level-of-detail calculations, filter exclusion, and fine-grained join specifications are deliberately deferred to a later pass. And a GitHub discussion surfaced on the list asking how Ossie relates to FIBO and the financial services semantic stack, reading FIBO as a reference ontology layer and Ossie as the interchange layer that maps to it. The question lands at the right moment, given the Polaris semantic model hosting debate above. The ecosystem is deciding, in real time, which layer owns which promise.

Why does converter housekeeping deserve newsletter space? Because Ossie is a specification project, and a spec lives or dies on its converters. If exporting a dbt project, a Snowflake semantic view, or a GoodData workspace into Ossie feels inconsistent, adoption stalls no matter how elegant the core model is. A shared base converter class and a uniform packaging story lower the cost of writing converter number seven, and converter number seven is how a new tool joins the ecosystem. Yong volunteering to do this work in his first weeks on the project is exactly the kind of contribution that turns an incubating spec into infrastructure. New contributor Dragos Crintea also introduced himself to the developer community this week, and Will Pugh posted general repository guidelines to keep the growing contributor base aligned.

Cross-Project Themes

Three threads of connective tissue stood out this week. First, infrastructure as code went from wish to reality across the stack: Iceberg voted on its first Terraform provider release while Polaris reached lazy consensus on creating its own provider repository. Declarative catalog and table management is becoming table stakes, and both communities are honoring the registry naming rules rather than fighting them.

Second, the AI workload is reshaping formats from both ends. Iceberg debated a native vector type, Parquet benchmarked fixed-size lists to store those vectors well, Iceberg considered Copilot for code review, and Polaris debated how to host Ossie semantic models for AI and BI consumption. The table format, the file format, the catalog, and the semantic layer are each answering the same question at their own layer: what does an agent or an embedding pipeline need from open data infrastructure?

Third, correctness culture is compounding. Shared conformance fixtures in Iceberg modeled on parquet-testing, a VARIANT forward compatibility fix with backports in Parquet, spec language tightening on manifest uniqueness, and integrity validation actions all point the same direction. As implementations multiply across Java, Rust, Python, Go, and C++, the projects are investing in shared answer keys instead of trusting each implementation to grade itself.

A fourth pattern hides in plain sight: governance maturity. Parquet is writing down its versioning rules as an RFC instead of relying on tribal knowledge. Polaris is debating whether review norms should be documented rules or reviewer discretion. Iceberg is defining criteria for admitting new file formats before evaluating any specific one. DataFusion promoted a committer. These are the habits of projects planning to be around in a decade, and the fact that all four surfaced in one week suggests the lakehouse stack is entering its institutional phase. Institutional does not mean slow. The same week produced five release candidates and a passed format vote. It means the projects are building the decision-making machinery that lets them move fast without breaking the ecosystems that depend on them, and that machinery is the least visible, most valuable output of this community.

Looking Ahead

The Polaris 503 vote closes Sunday, July 19, and the Polaris community sync on July 23 takes up the tag spec. The same day, Iceberg contributors gather in person in Austin. Watch for the Parquet ALP encoding vote to open, for Fokko to kick off the Parquet 1.18.0 release process, and for results on Iceberg Rust 0.10.0 RC4 and the Terraform provider RC1. The equality deletes deprecation thread will keep growing, and the answers there will define a good chunk of what Iceberg V4 becomes.

Further out, keep an eye on three slow burns. The Iceberg collation discussion has to reconcile cross-engine consistency with ICU upgrade freedom, and whatever it decides will echo in every engine that sorts strings. The Polaris persistence redesign will take months, and the SPI shape it lands on determines how hard NoSQL backends are to build. And the Parquet versioning RFC, once merged, becomes the template other format projects copy when they outgrow informal release habits. None of these resolves next week. All of them reward following the threads as they develop, and the permalinks above will take you straight to the source.

If this is your first issue, a note on method: everything above links to the public Apache dev list archives, and every claim traces to a thread you can read yourself. The dev lists are where the real decisions happen, before the blog posts and the conference talks. Subscribing to even one of them changes how you understand this ecosystem.

Resources & Further Learning

Get Started with Dremio

Try Dremio Free: Build your lakehouse on Iceberg with a free trial
Build a Lakehouse with Iceberg, Parquet, Polaris & Arrow: Learn how Dremio brings the open lakehouse stack together

Free Downloads

Apache Iceberg: The Definitive Guide: O'Reilly book, free download
Apache Polaris: The Definitive Guide: O'Reilly book, free download

Books by Alex Merced

Architecting an Apache Iceberg Lakehouse
Enabling Agentic Analytics with Apache Iceberg and Dremio
The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI
The Book on Using Apache Iceberg with Python
Browse the full catalog of 50+ books at books.alexmerced.com

When Gatekeepers Panic: The Encyclopédie, Open AI Models, and the Politics of Accessible Knowledge

Alex Merced — Sat, 18 Jul 2026 16:30:20 +0000

In 1759, Pope Clement XIII ordered the owners of a book to hand their copies to a priest for burning. The penalty for refusal was excommunication. That same year, King Louis XV of France banned the book outright. The offending work was not a heresy tract or a revolutionary pamphlet. It was an encyclopedia.

In 2026, lawmakers across 45 American states have introduced more than 1,500 bills aimed at artificial intelligence. Executives at the largest AI labs argue in front of Congress that freely downloadable models pose risks the public cannot handle. Lobbyists push agencies to issue guidance that scares enterprises away from open alternatives. The offending technology is not a weapon. It is a tool that answers questions.

These two moments sit 275 years apart. The technology changed. The argument did not. In both cases, powerful institutions faced a tool that put knowledge directly into the hands of ordinary people. In both cases, those institutions reached for the same playbook: warn of danger, demand licensing, restrict distribution, and protect the intermediary's seat at the table.

This article walks through the history of the fight over Diderot's Encyclopédie, maps it against today's fight over AI regulation and open weight models, and asks what the comparison reveals about how societies absorb disruptive knowledge. It then takes on a harder question. Innovation now moves faster than it did in the 18th century. Does that speed break the historical pattern, or does the nature of this technology give society a new way to keep up? I will argue the second. The tool causing the disruption is, for the first time in history, the same tool people can use to adapt to it.

The Most Dangerous Book in France

The Encyclopédie, ou Dictionnaire raisonné des sciences, des arts et des métiers, began as a modest translation project. French publisher André Le Breton wanted a French version of Ephraim Chambers' English Cyclopaedia. He hired Denis Diderot, a broke translator and philosopher, to run it. Diderot had bigger ideas. He recruited the mathematician Jean le Rond d'Alembert as co-editor and expanded the plan into something without precedent: a complete survey of human knowledge, written by more than 140 contributors, spanning science, philosophy, politics, religion, and the manual trades.

Publication ran from 1751 to 1772. The finished work filled 28 volumes, with 17 volumes of text and 11 volumes of engraved plates. Contributors included Voltaire, Rousseau, and Montesquieu. The entry count passed 60,000. Nothing on this scale had existed before.

Two design choices made the project explosive. The first was its treatment of the trades. Diderot sent writers into workshops to document how glassmakers, weavers, printers, and metalworkers actually did their work. The plates illustrated tools, techniques, and processes that guilds had guarded for centuries. Craft knowledge that took a seven-year apprenticeship to access now sat on a page anyone with the subscription price and reading ability was free to study.

The second choice was structural. The editors organized knowledge by reason rather than by revelation. Theology appeared as one branch of philosophy among others, not as the queen of the sciences sitting above the rest. Cross-references linked orthodox entries to skeptical ones. An article on a religious doctrine, written with perfect apparent respect, pointed the reader to another article that quietly dismantled the doctrine's logic. The encyclopedists were not just cataloging knowledge. They were rearranging it, and the new arrangement demoted the institutions that had spent centuries at the top of the old one.

The Machinery of Suppression

The reaction came fast and arrived in waves.

In 1752, months after the second volume appeared, the Jesuits demanded condemnation. The trigger was a theology thesis by the abbé de Prades, a contributor whose ideas echoed d'Alembert's Preliminary Discourse. The King's Council responded by banning possession of the first two volumes. The philosophy had penetrated the citadel of orthodox theology, and the authorities panicked.

The ban lasted three months. Madame de Pompadour, the king's mistress, and Malesherbes, the royal official in charge of the book trade, intervened to let publication resume. This detail matters. The Encyclopédie survived its first execution order through protection from sympathetic insiders within the very state that condemned it. Suppression was never a unified front. It was a faction fight inside the establishment.

The attacks continued through the 1750s. Religious critics published pamphlet after pamphlet. Contributors resigned under pressure. D'Alembert himself abandoned the project after facing threats of imprisonment. In 1759, the storm peaked. Louis XV issued a permanent ban with only seven volumes published. Weeks later, Pope Clement XIII placed the work on the Index of Forbidden Books and issued the burning order backed by excommunication.

Here is the remarkable part. The book kept coming. Diderot and Le Breton continued production in secret. The plate volumes were exempt from the ban, so those shipped openly. The remaining text volumes were printed clandestinely and distributed with a false imprint claiming publication in Neuchâtel. Subscribers, including many nobles and clergy, kept their copies. Few owners obeyed the burning order, since the set represented an enormous financial investment. The state knew the work continued and largely looked away, again through the quiet protection of officials like Malesherbes.

One more betrayal completed the story. Le Breton, terrified of prosecution, secretly censored dozens of articles before printing, cutting passages he judged too dangerous. Diderot discovered the sabotage years later and was devastated. Even the publisher of the most subversive project in Europe hedged his bets against the censors.

What the Gatekeepers Actually Feared

Read the condemnations closely and a pattern emerges. The objection was almost never that the information itself was false. The objection was that the wrong people now had access to it, without a mediator.

The Catholic Church of the 18th century did not oppose knowledge. It operated universities and produced serious scholarship. What it opposed was unmediated knowledge. For centuries, the Church controlled the interpretive layer between text and reader. The Index of Forbidden Books, the imprimatur system, and pre-publication censorship all existed to keep an approved authority between ordinary people and dangerous ideas. The Encyclopédie deleted that layer. It handed the reader the raw material and a method, reason, for processing it independently.

The French crown had a parallel concern. The Encyclopédie questioned the divine right of kings and defined limits on all power. A population that reasons about the legitimacy of authority is harder to rule than a population that accepts it. The monarchy understood, correctly as it turned out, that the work was training its readers to become citizens rather than subjects. Historians widely credit the Encyclopédie with shaping the ideas that fed the French Revolution.

The guilds had the most concrete grievance. Their power rested on artificial scarcity of technical knowledge. The apprenticeship system was a licensing regime. Publish the techniques, and the license loses value. The Encyclopédie's trade plates were an open weights release for the 18th-century economy.

Three different institutions, three different fears, one common thread. Each had built its position on being the necessary intermediary for some category of knowledge. Each correctly perceived that a comprehensive, accessible, mass-produced reference work made the intermediary optional. The safety arguments they offered in public, protecting souls, protecting order, protecting quality, were real to the people making them. The interest behind the arguments was self-preservation.

The Modern Panic: Politicians and the AI Bill Flood

Now shift to the present. The numbers tell the story of an institutional reaction gathering speed.

In 2023, American state legislatures introduced fewer than 200 bills addressing artificial intelligence. In 2024, the count passed 600, with nearly 100 enacted. In 2025, all 50 states introduced AI bills for the first time, 1,208 in total, with 145 becoming law. By March 2026, lawmakers in 45 states had introduced 1,561 more, surpassing the entire 2024 total before most sessions even finished. Congress, meanwhile, has passed exactly one AI-specific federal law, the Take It Down Act covering nonconsensual deepfake imagery.

The bills cover algorithmic discrimination, hiring decisions, chatbot safety for minors, deepfakes, insurance underwriting, and dozens of other categories. Some address genuine, documented harms. Nonconsensual intimate imagery is a real injury with real victims. Algorithmic discrimination in lending and hiring has a real evidentiary record. Child safety in companion chatbots responds to real tragedies. Nothing in the historical parallel excuses harm or argues against accountability for it.

But the volume and shape of the legislative wave reveals something beyond harm response. Much of it is jurisdictional struggle. In 2025, the U.S. Senate voted 99 to 1 to strip a proposed 10-year federal moratorium on state AI laws from a budget bill. In December 2025, the White House signed an executive order creating a litigation task force to sue states over AI laws deemed inconsistent with federal policy, and threatened to withhold billions in broadband funding from states that refused to repeal them. In response, 36 state attorneys general from both parties sent Congress a joint letter telling the federal government to stay out of their lane. Six months into 2026, states had enacted 109 AI laws in open defiance of the preemption campaign.

This is not a debate about whether AI is dangerous. It is a fight over who gets to be the gatekeeper. The federal government, the states, and the industry each want the licensing pen in their own hand. Versailles and Rome ran the same contest in the 1750s. The crown banned the Encyclopédie, then royal officials protected it. The Church condemned it, then clergy subscribed to it. Authority spoke with many voices then, and it speaks with many voices now. The one position with no organized lobby in either century is the position that no license is needed at all.

The Sharper Parallel: Proprietary Vendors Against Open Models

The political pushback is the broad parallel. The precise parallel, the one that matches the Encyclopédie fight almost beat for beat, is the campaign by proprietary AI vendors against open weight models.

Definitions first. A proprietary model is one you access through a paid API. The weights, the trained parameters that constitute the model itself, stay on the vendor's servers. An open weight model publishes those parameters for anyone to download, inspect, modify, fine-tune, and run on their own hardware. Llama, Mistral, Qwen, DeepSeek, and gpt-oss are open weight releases. The frontier offerings from the major labs are not.

The commercial stakes are plain. Closed vendors sell metered access. Every token generated is billed. The business model requires the customer to keep coming back to the vendor's servers. An open model, once downloaded, generates unlimited tokens at the cost of electricity. If open models stay close to the frontier in capability, the economic gravity pulls enterprise workloads toward the open stack. Box CEO Aaron Levie described the stakes bluntly in 2026: with open weights a close second in intelligence, the closed vendor keeps the frontier market but loses the vast majority of token volume to a stack someone else controls and monetizes.

Faced with this threat, the closed labs have not primarily responded by out-competing on price. They have responded by arguing that openness itself is the danger. The public case runs as follows. Released weights let bad actors strip out safety guardrails. Open models are unaccountable, since no company stands behind the output. Foreign open models, particularly Chinese ones, carry hidden risks and legal obligations to hostile intelligence services. The distribution of frontier capability to anyone with a GPU constitutes an extreme risk to the public.

Some of these concerns describe real technical facts. Fine-tuning does remove refusal behavior. Attribution is harder for open models. The question is not whether the facts are true. The question is what policy the facts are being used to justify, and who benefits from that policy.

Watch what the labs ask for. California's SB 1047 proposed liability and shutdown requirements that open developers, who by definition cannot recall or shut down a downloaded model, structurally cannot meet. Policy analyst Dean Ball described the current lobbying recipe with unusual candor: you do not need to ban open source, you just need every agency to issue soft guidance about backdoors and risks until every regulated enterprise backs off. Regulatory risk does the work a ban cannot. Open source advocates have made the mirror observation for years. Costly compliance requirements, safety audits, and liability frameworks are trivial expenses for a lab valued in the hundreds of billions and fatal to a community project. Regulation calibrated to the resources of the largest incumbents is a moat with a public safety label on it.

AI researcher Nathan Lambert, one of the most careful observers of the open ecosystem, wrote in July 2026 that open models face the most serious test of their viability to date, with talking points converging on a potential ban within six months. He noted the asymmetry directly. Closed models are easier to secure, and closed model companies run far more effective lobbying operations. Any government review process for model releases moves slower for open models than closed ones, compounding the disadvantage over time.

Now line this up against 1759.

The Paris book guild held royal printing privileges, exclusive licenses that made publishing legal for members and illegal for everyone else. The Church held the imprimatur, the pre-publication stamp declaring a work safe to read. Both systems were justified in the language of public protection, guarding readers from error, heresy, and sedition. Both systems, in practice, protected the revenue and authority of the license holders. The Encyclopédie threatened the guilds' economic model and the Church's interpretive monopoly at the same time, and both institutions reached for the state to suppress it rather than compete with it.

The closed labs today hold the modern equivalent of the privilege: capital, compute, and political access. The safety case they present is the modern imprimatur, the claim that intelligence is only safe when it passes through an approved intermediary. The request to government is the same request the guild made to the crown. Do not make us compete with the open version. Make the open version illegal, or failing that, make it frightening.

One irony deserves its own paragraph. The labs argue that Chinese open models are dangerous vehicles of foreign influence. The observable response of Chinese labs has been to keep releasing capable open models that developers worldwide adopt, building soft power and ecosystem dependence the way the dollar built financial dependence. The strategic answer to a rival's open models is your own open models, not advisory bulletins. France learned a version of this lesson. The Encyclopédie, banned at home, was reprinted in cheaper editions across Europe, and the ideas conquered France anyway. Suppression did not stop the knowledge. It only moved the printing presses across the border and forfeited the influence that came with hosting them.

The Diffusion Economics Nobody Stopped

One more chapter of the Encyclopédie story deserves attention, since it predicts the endgame of the current fight. The original folio edition was a luxury product. A full subscription cost roughly the annual income of a skilled worker. The bans of 1759 targeted this expensive, traceable, subscriber-listed edition, and the censors counted the containment a success.

Then the price collapsed. Publishers outside French control, in Geneva, Neuchâtel, and Lausanne, issued cheaper quarto and octavo reprints in the 1770s. The historian Robert Darnton traced the numbers in his study of the trade. Around 25,000 sets of the Encyclopédie circulated in Europe before 1789, and the cheap editions sold most of them, reaching lawyers, doctors, merchants, and provincial administrators far below the original subscriber class. The banned book became a bestseller in the very country that banned it, smuggled across the border in bales. The censorship regime raised the price of access for a decade and a half. It changed the destination of the profits from Paris to Switzerland. It stopped nothing.

The AI version of the cheap quarto edition already exists, and it arrived through the same mechanism: producers outside the incumbents' jurisdiction who noticed the demand. DeepSeek trained frontier-adjacent models for a reported fraction of American budgets and released the weights. Qwen, Kimi, and GLM followed on aggressive cadences. Distillation, the practice of training a small model on the outputs of a large one, compresses frontier capability into packages that run on consumer hardware, exactly the way octavo printing compressed 28 folio volumes into something a country lawyer's shelf held. The closed labs call distillation theft, an accusation with real legal substance and limited practical force, and the same tone the Paris guild took toward the Swiss printers. Nathan Lambert notes that an open weight model reaching top-tier closed capability is now inevitable, and that this inevitability, more than any specific harm, is what drives the regulatory push.

The economics ran one direction in the 1770s and run the same direction now. When capability exists, price falls toward the cost of reproduction. The cost of reproducing a downloaded model rounds to zero. Regulation raises the price of access temporarily, relocates the suppliers permanently, and hands the influence that comes with supplying the world to whoever declines to regulate. France funded the Swiss publishing industry with its censorship. The parallel question for American policy writes itself.

The Anatomy of Gatekeeper Arguments

Set the two episodes side by side and the recurring arguments sort into four families. Each family appeared in the 1750s and reappears today, translated into modern vocabulary.

The first family is the wrong hands argument. In the 18th century: ordinary readers lack the training to handle theological and political ideas safely, so an authority must filter what reaches them. Today: ordinary users lack the judgment to handle unrestricted model capability safely, so an approved lab must filter what the model says and who runs it. The structure is identical. Capability is fine, but only when held by the credentialed.

The second family is the accountability argument. Then: anonymous and foreign presses spread error with no one to punish, so all legal printing must flow through licensed guild members. Now: downloaded weights spread harm with no company to sue, so legitimate AI must flow through vendors who log, moderate, and answer subpoenas. Notice what the argument quietly assumes in both eras. It assumes accountability means a chokepoint, a single throat to choke. Distributed accountability, where users answer for their own use the way readers answered for their own sedition, never counts.

The third family is the social order argument. Then: reasoning individually about religion and monarchy dissolves the bonds of society. Now: synthetic media, algorithmic persuasion, and machine-generated content dissolve shared truth and democratic stability. This family contains the most substance in both eras. Print did destabilize Europe. The pamphlet wars were real, and the Revolution that followed the Encyclopédie was not bloodless. Honest analysis has to grant that the gatekeepers' predictions of turbulence were partially correct. Their prescription, permanent mediation by themselves, still failed, and the societies that absorbed the turbulence outperformed the ones that delayed it.

The fourth family is the quality argument. Then: unlicensed printing produces corrupted texts and error. Now: open models hallucinate, carry biases, and lack the safety tuning of managed services. True in both cases, and beside the point in both cases. Quality problems are competitive claims dressed as prohibition claims. If the licensed product is better, the license holder wins in the market without the ban. The demand for prohibition is itself evidence that the incumbent expects to lose a fair fight.

Sorting arguments this way provides a practical filter for evaluating any pushback against a knowledge technology. Ask one question of each claim. Does this argument identify a specific harm to a specific victim, or does it defend the necessity of an intermediary? Deepfake abuse of a real person is the first kind. A general assertion that open weights endanger the public is the second kind. History treats the two very differently. The first kind produced durable, targeted law, defamation, fraud, and obscenity statutes that survived centuries. The second kind produced the Index of Forbidden Books, which collapsed under its own irrelevance and stands today as a monument to institutional fear.

Where the Analogy Breaks, and Where It Holds

Every historical analogy has limits, and pretending otherwise weakens the argument. Three differences between the Encyclopédie and AI deserve honest treatment.

First, agency. A book informs a reader who then acts. A model acts. Agentic systems execute code, move money, send messages, and operate tools. The gap between reading about a harm and performing one shrinks toward zero. This difference is real, and it justifies real policy in narrow domains: biosecurity screening, financial controls, critical infrastructure protections. The Encyclopédie parallel does not argue against those. It argues against treating the general capability, intelligence on demand, as the thing to license.

Second, scale and speed of individual harm. A determined 18th-century reader needed years to turn dangerous knowledge into dangerous action. A model compresses research time for attackers and defenders alike. The security field calls this the offense-defense balance, and researchers genuinely dispute where it lands. Disclosure helps attackers who lack knowledge and helps defenders find and fix vulnerabilities. Four centuries of experience with the printing press, and eight decades with academic cryptography, ended up favoring disclosure. AI evidence to date points the same direction, but the question stays empirical, and honest advocates of openness track it rather than assume it.

Third, concentration of production. Anyone with a press printed books. Only a handful of organizations train frontier models, since training runs cost hundreds of millions of dollars. This concentration cuts both ways. It makes chokepoint regulation more feasible than it ever was for print, which tempts regulators. It makes the case for open distribution stronger, since distribution is the only stage where broad participation is even possible. When production is an oligopoly, closing distribution completes the monopoly.

Against these three differences stands one overwhelming similarity, and it decides which lesson applies. In both episodes, the loudest institutional voices calling for restriction are the direct economic and political beneficiaries of the restriction. When the referee and the competitor are the same entity, the historical base rate says discount the safety testimony heavily. The Church contained sincere believers in the danger of unmediated scripture. The sincerity did not make the Index good policy, and it did not stop the Index from functioning as a market protection scheme for approved publishers. Sincere safety belief and self-serving policy coexist comfortably in the same institution. They did in 1759. Nothing about 2026 repeals that fact of institutional behavior.

The Speed Objection

Here is the strongest version of the case against relying on the historical pattern. Society absorbed print over generations. Literacy in France climbed slowly across the 18th and 19th centuries. Schools, libraries, newspapers, and professional norms grew up around the printed word over a century and a half. The turbulence in between included revolutions and wars of religion. The adaptation succeeded, but the timeline was long and the bill was steep.

AI grants no such timeline. Capabilities that took print three centuries to distribute have spread in three years. ChatGPT reached a hundred million users faster than any consumer product in history. Model capability doubles on a cadence measured in months. Labor markets, school curricula, court systems, and legislatures operate on cadences measured in years and decades. The gap between the speed of the technology and the speed of the institutions is wider than it has ever been for any prior disruption. On this view, the Encyclopédie precedent is a false comfort. The pattern of successful absorption held when society had slack time. The slack is gone, so the pattern breaks, and heavier restriction is the only brake available.

This objection deserves a direct answer rather than a dismissal. The answer has three parts.

The Tool of Disruption Is the Tool of Adaptation

The first part of the answer is the central claim of this article. Every prior disruptive knowledge technology had a hard separation between the disruption and the means of adapting to it. The printing press flooded Europe with text, but the press itself taught no one to read. Adaptation required a completely separate infrastructure, schools and tutors and decades of childhood instruction, built at enormous cost on a timeline the press did nothing to shorten. The gap between disruption speed and adaptation speed was structural. The technology pushed, and society had to build its own capacity to push back, from scratch, with older tools.

AI is the first knowledge technology in history where this separation does not exist. The model that disrupts your job explains itself to you in plain language. The system that automates a workflow teaches you to build the next workflow. A displaced bookkeeper in 1990 needed years of retraining through institutions that had waiting lists and tuition bills. A displaced bookkeeper in 2026 asks the disrupting technology itself to teach her Python, draft her business plan, review her contract, and debug her first automation, at conversational speed, for a subscription fee or for free through an open model on her own laptop.

Consider what this does to the arithmetic of adaptation. Adaptation lag has always equaled the time to build separate adaptive infrastructure. When the technology is its own adaptive infrastructure, the lag collapses toward the time it takes an individual to start asking questions. The literacy barrier that gated the Encyclopédie for a century does not gate AI at all. The models speak every major language, read to the illiterate, translate for the foreigner, and simplify for the novice. Print demanded that humanity climb up to the text. AI climbs down to the human.

The empirical record so far supports this. The fastest, deepest adopters of AI in the workforce are not the credentialed elite defended by licensing regimes. Survey after survey finds heavy usage among students, freelancers, small business owners, and workers in routine jobs, exactly the populations that adapt slowest under every previous disruption, since they have the least access to formal retraining. The adaptation tool reached them first this time, and it reached them through the open and cheap end of the market, not the enterprise end.

Now trace the policy implication, and watch it invert the safety argument. If the technology is the primary means of adapting to the technology, then restricting access does not slow the disruption. Enterprises, governments, and well-funded actors keep their access through approved channels regardless. Restriction slows the adaptation, and it slows it selectively for the people with the fewest alternatives. A regime that gates capable models behind enterprise contracts and compliance regimes takes the adaptation tool away from the displaced worker and leaves it in the hands of the employer doing the displacing. That is not safety. That is the guild system rebuilt in software, and it produces the exact outcome the 18th-century guilds produced, protected incumbents and a locked-out public.

This is why the open model fight is not a side quarrel about developer preferences. Open weights are the guarantee that the adaptation channel stays open when the licensed channels tighten. A downloaded model cannot be repriced, geofenced, deprecated, or lobotomized by a vendor's compliance department. For the individual adapting to a fast economy, that permanence is the difference between owning your tools and renting them from the institution disrupting you.

Institutions Adapt Slower Than Individuals, and That Is Survivable

The second part of the answer concedes the strongest point of the speed objection and reframes it. Individuals can now adapt at conversational speed. Institutions cannot. Courts, schools, licensing boards, and legislatures move at deliberative speed by design. The mismatch is real. The question is what follows from it.

The 18th century ran this experiment too, with the roles cast the same way. Individual readers absorbed the Encyclopédie within a subscription cycle. French institutions took four decades and a revolution to adjust. The countries that fared best were not the ones whose institutions moved fastest to restrict. They were the ones whose institutions restricted least. They kept the widest channel open for individual adaptation until the formal structures caught up. The Dutch printed what France banned and captured the publishing economy. Britain, with the loosest censorship in Europe, absorbed radical print culture without a revolution. The turbulence correlated with suppression, not with openness. Institutions that fought the adaptation of their own citizens converted a manageable adjustment into a rupture.

The lesson transfers cleanly. Institutional lag is survivable when individuals are free to adapt ahead of the institutions. It becomes catastrophic when institutions use their lag as a reason to hold individuals back to institutional speed. The 1,561 state bills of early 2026 are not all equal on this test. Bills that target specific harms, deepfake abuse, discriminatory decisions, child safety, let individual adaptation proceed and clean up genuine damage. Bills that gate model access, mandate approval regimes, or impose liability structures only incumbents can carry, hold the public to the speed of the slowest regulator. The first category is institutions doing their job. The second is institutions doing the guilds' job.

The Legitimate Core of the Fear

The third part of the answer marks the boundary of the argument, since an honest article must. The claim is not that every adaptation happens smoothly for every person. Three groups face genuine trouble that the self-service adaptation story does not fix on its own.

Workers whose entire occupational category compresses face more than a reskilling problem. They face an income bridge problem during the transition, and no chatbot pays rent. This is a policy gap with real solutions in labor economics, portable benefits, wage insurance, and transition support, none of which require restricting the technology, and all of which the restriction debate crowds out of the conversation.

People outside the digital economy entirely, without devices, connectivity, or baseline digital comfort, cannot ask the model anything. The adaptation tool reaches them only through deliberate distribution, public access points, and community institutions. This is the modern version of the literacy campaigns that print eventually demanded, with the difference that the campaign now takes years instead of generations.

And targets of malicious use, fraud victims, harassment victims, people impersonated by synthetic media, are not adaptation cases at all. They are harm cases, and they justify the targeted, victim-specific law that history validates.

Marking these boundaries strengthens rather than weakens the openness case. The gatekeeper playbook works by laundering these narrow, addressable problems into a general indictment of accessible capability. Separating them out exposes the remainder of the restriction agenda for what it is.

What the Pattern Says About Societal Evolution

Step back from both episodes and a model of how societies process knowledge shocks comes into focus. The sequence runs in five stages, and it has now run at least four times, with scripture in the vernacular, with the press generally, with the Encyclopédie, and with the internet.

Stage one: a technology drops the cost of accessing some category of knowledge by an order of magnitude or more.

Stage two: the institutions whose position depended on the old cost structure raise the alarm. The alarm is always framed as protection of the public and never as protection of the position. This framing is not simple cynicism. The institutions genuinely believe both things at once, and the belief is exactly what makes the framing persuasive.

Stage three: formal suppression is attempted and partially succeeds in the short run. Volumes get banned, presses get licensed, bills get passed. The suppression works well enough to reassure the incumbents and never well enough to stop the diffusion. The knowledge routes around, through Neuchâtel imprints then, through open weight mirrors and international labs now.

Stage four: a sorting occurs among jurisdictions. Some double down on restriction and export their innovators, their industries, and eventually their influence. Others absorb the turbulence, keep the channels open, and collect the compounding returns. The Dutch republic collected them in the 1760s. The open question of the 2020s is who collects them now, and the early signs point to whichever bloc ships the models everyone else builds on.

Stage five: the restriction apparatus, deprived of function, calcifies into an embarrassment and is quietly retired. The Index of Forbidden Books survived until 1966, long past the point anyone consulted it. Its final catalog entries sit in archives as a list of the books that mattered most.

The stages compress with each iteration. The Encyclopédie cycle took about 40 years from first ban to functional irrelevance of the ban. The internet cycle, from the Communications Decency Act panic to broad institutional accommodation, took about 15. The AI cycle is running the early stages in months. The 2023 alarm, the 2024 legislative surge, the 2025 preemption war, and the 2026 open model fight map onto stages two and three with almost mechanical fidelity. Compression of the cycle is itself evidence for the adaptation thesis. Each round, the population enters the next shock already holding the tools and the memory of the last one.

The deepest conclusion the pattern supports is about where societal resilience actually lives. The gatekeeper worldview locates resilience in institutions and treats the public as the fragile element needing shelter. The record locates resilience in the distributed public and treats institutional monopoly as the fragile element. Societies did not survive print by keeping it scarce. They survived it and then thrived on it by letting hundreds of millions of individual adaptations accumulate into new institutions, journalism, public education, and modern science among them, that no censor planned and no incumbent wanted. Every one of those institutions began as the unregulated, alarming behavior of ordinary people with new access to knowledge.

That is the bet on the table again. Betting on the public has paid out every time it has been placed. Betting on the gatekeepers has never once protected what it promised to protect, and it has forfeited the compounding returns every time. The pace is faster now. The bet is the same. And for the first time, the public walks into the disruption holding the most capable adaptation tool ever built, one that answers questions at midnight, speaks every language, and, in its open form, belongs to whoever downloads it.

Diderot stated his goal as gathering the knowledge scattered across the surface of the earth, so that the work of past centuries stays useful to the centuries to come. The kings and popes who burned his volumes are footnotes in the story of his book. The institutions demanding a licensing regime for intelligence should read that story carefully. They are not the first incumbents to mistake their own necessity for the public's safety. The record suggests they will not be the last, and it tells us exactly how the story ends.

Go Deeper on AI and the Future of Work

The labor transition is the piece of this story that deserves book-length treatment, and I wrote that book. It covers how AI reshapes labor economics, which jobs transform versus disappear, how workers and businesses position themselves, and what policy actually helps rather than performs. If this article's argument about adaptation resonated, the book is the full framework behind it.

Read my book on AI and labor economics here

Alex Merced is Head of Developer Relations at Dremio (SAP Business Data Cloud), co-author of Apache Iceberg: The Definitive Guide and Apache Polaris: The Definitive Guide, and author of Architecting an Apache Iceberg Lakehouse. Find his full catalog of books at books.alexmerced.com.

Deterministic Data Engineering With AI Harnesses: Using Claude Code, Codex, Antigravity, and OpenCode for Data Work You Can Actually Trust

Alex Merced — Sat, 18 Jul 2026 16:19:40 +0000

There is an apparent contradiction at the heart of using AI agents for data work, and resolving it properly is worth an entire article, because the teams that resolve it are quietly getting enormous value while the teams that do not are generating incidents.

The contradiction: data engineering and analytics are disciplines built on determinism. The pipeline must produce the same output from the same input every run. The revenue number must reproduce, to the penny, when the auditor asks. The metric must mean the same thing on every dashboard. Meanwhile, the most powerful new tools in a data professional's kit, the agentic coding harnesses, Claude Code, OpenAI's Codex, Google's Antigravity line, OpenCode, and their peers, are built on language models, which are probabilistic by nature: ask twice, get two phrasings, sometimes two approaches, occasionally two answers.

The resolution is not to avoid the tools, and it is not to hope the models stop being stochastic. It is an architectural principle, old as software and newly urgent: use the agent to author deterministic artifacts, and let the artifacts do the work. The model's creativity lives at development time, where variance is cheap and review catches error. The runtime path, the thing that actually touches your data every night, is code: versioned, tested, reviewed, reproducible, exactly as deterministic as it ever was. Get this boundary right and the harnesses become the largest productivity gain data teams have seen in a decade. Get it wrong, agents improvising in the runtime path, numbers with no provenance, and you have built a very fast way to lose the business's trust.

This article is the full playbook: the artifact-first principle and the determinism ladder that operationalizes it, honest working profiles of the four harnesses named above as data tools specifically, the catalog of techniques, tests as contracts, dry-run gates, schema pinning, semantic layers, golden datasets, that make agent-assisted data work reproducible, the workflow patterns for pipelines, migrations, quality investigations, and analytics, and the anti-patterns that generate the incidents. My biases declared: I work at Dremio, whose MCP server is one of the governed doors these agents can knock on, and Claude Code is my personal daily driver, both of which I will weigh against fairness to the whole field.

The Principle: Agents Author, Artifacts Execute

State the core idea precisely, because everything else derives from it.

A language model invoked at runtime is a nondeterministic component in your data path: same question, potentially different SQL, different approach, different number. A language model invoked at development time is something else entirely: a collaborator producing an artifact, a SQL file, a dbt model, a pipeline script, a test suite, that is then frozen in version control, reviewed like any code, validated by tests, and executed by deterministic engines forever after. The variance happened, and it happened where variance belongs: before the merge, under review, against tests. After the merge, the pipeline is exactly as deterministic as one written by hand, because it is code, and code does not care who typed it.

This is why the coding harnesses, rather than chat interfaces, are the right tools for serious data work: they are built for the artifact workflow. They live in repositories, edit real files, run real commands, execute the tests, and produce diffs and pull requests, which means their entire interaction model already routes the model's output through the deterministic machinery, version control, CI, review, that data engineering trusts. The chat window asks you to copy-paste its suggestion into your world. The harness works inside your world, where the guardrails are.

One clarification before the ladder, because it prevents a common confusion: this principle does not forbid agents from ever running queries. Exploration, an agent reading schemas, sampling data, running diagnostic queries to understand a problem, is legitimate and valuable, and it is read-only reconnaissance in service of authoring. The line the principle draws is at the runtime path and at the numbers the business consumes: what executes on schedule is committed code, and what lands on a dashboard traces to a governed definition, never to an agent's improvisation that morning.

The Determinism Ladder: Five Levels of Trust

Teams adopt these tools along a maturity curve, and naming its levels gives you both a map and a diagnostic.

Level zero is chat-and-paste: asking a model questions about data and transcribing answers. No provenance, no reproducibility, no place in professional work beyond brainstorming, and worth naming only because plenty of organizations are unknowingly running level-zero "analytics" today.

Level one is the improvising agent: a harness connected to the warehouse, running ad hoc queries and reporting findings conversationally. Genuinely useful for exploration and incident diagnosis, dangerous the moment its outputs are treated as answers rather than leads, because the query that produced the number exists only in a session log, if there.

Level two is artifact authorship with human review: the agent writes the pipeline, the model, the query, as files in a branch, a human reviews the diff, and the merged artifact enters the deterministic estate. This is the level where real value begins, and where most productive teams operate today.

Level three adds automated validation: the artifacts carry tests, schema contracts, data quality checks, dry-run gates, and the harness itself runs them in its loop, hooks firing linters and test suites on every edit, CI enforcing them on every commit, so that the human review at level two is spent on intent and design rather than syntax and correctness. This is where the harnesses' machinery, hooks, headless modes, sandboxes, earns its keep, and where this article aims you.

And level four is the maintained estate: scheduled, headless agent runs that watch the deterministic estate and propose changes through the same gated path, the nightly job that triages its own failure and opens a pull request with the fix, the weekly run that refreshes documentation from schema changes, the agent as a tireless junior engineer whose every action still lands as a reviewable artifact. Level four is not futurism, the headless modes make it a cron entry, and its entire safety rests on the discipline of the levels beneath it: the agent proposes, the pipeline of tests and review disposes.

The diagnostic use of the ladder: when an agent-related data incident happens, it is almost always a level violation, level-one improvisation being consumed as if it were level-three truth, and the fix is almost never "ban the tools." It is "climb the ladder."

The Four Harnesses as Data Tools

Now the tools themselves, profiled specifically for data work, because their general coding reputations transfer imperfectly and their configuration for our discipline is where the value hides.

Claude Code is, in my experience and wide practitioner consensus, the deepest fit for the level-three workflow, for three specific reasons. Its hook system is the enforcement point determinism wants: hooks that run your SQL linter on every file edit, execute the relevant dbt tests after every model change, and block any command matching forbidden patterns, policy as machinery rather than hope. Its skills system packages your team's data conventions, naming standards, testing requirements, approved patterns for incremental models, as loadable procedures the agent follows reliably, which is how tribal knowledge becomes enforced knowledge. And its permission model is granular enough to encode the read-versus-write line this article lives on: read-only database access allowed silently, anything touching a write path gated behind approval. Add subagents, one exploring a gnarly schema in its own context while the lead builds, MCP connectivity to warehouses, catalogs, and lakehouse platforms including my employer's, and a headless mode that slots into CI, and the pieces of the deterministic workflow are all first-party. The costs: proprietary, Claude-only, subscription economics, and the general caveat that its depth rewards configuration investment, a team that never writes hooks or skills is buying a fraction of the tool.

OpenAI's Codex brings two distinctive strengths to data work. Its sandboxing is the best default story in the field, OS-level confinement without container ceremony, which matters more in our discipline than most, because "the agent ran a script" should never be one typo from touching production data, and Codex's tiered approvals make the escalation from sandboxed experiment to real execution an explicit, auditable act. And its cloud task mode, delegating work to managed sandboxes that return pull requests, fits the artifact principle natively: the deliverable arrives as a reviewable diff by construction. Its benchmark-leading task completion translates well to the bounded, verifiable tasks data work abounds in, write the migration, make the tests pass, and its MCP support connects it to the same governed data doors. The costs mirror its rival's: the OpenAI ecosystem assumption, and a harness whose configuration culture is younger than its capability.

Google's Antigravity line enters data work with a different center of gravity. Its lineage, succeeding the enormously popular Gemini CLI after this June's consolidation, carries forward the trait data people prized most: enormous context windows, which are not a luxury in our discipline but a working requirement when the task is "understand this four-hundred-table schema and its lineage before touching anything." Wide-schema comprehension, cross-file refactors over sprawling SQL estates, and migration work that must hold two dialects in mind at once are where the long-context advantage is tangible. The ecosystem adjacency, Google Cloud's data stack, BigQuery, Workspace surfaces where analytical outputs often land, makes it a natural fit for shops already in that gravity well. The honest caveats: the platform transition is recent, the free-tier era that drove its predecessor's adoption ended with it, and teams should verify the current terms, tooling, and MCP posture against their needs rather than assuming continuity with the tool they remember.

OpenCode is the open answer, and for data teams it carries two arguments the others cannot. Provider freedom, seventy-five-plus backends including fully local models, is not just economics: for organizations whose data governance forbids schema details or query patterns from leaving the building, a capable harness driving a local model is the difference between adopting these workflows and watching them from outside. And its plan-versus-build agent design, a read-only planning agent distinct from the full-access builder, maps beautifully onto this article's central line: exploration and authoring as separated modes with separated permissions. Open source under MIT, a polished terminal experience, LSP-grade code intelligence, and no vendor's roadmap between you and your workflow. The costs are the open-source classics: assembly required, configuration culture over convention, and the model you bring determines the ceiling, which for the hardest multi-file data refactors still favors the frontier models the commercial harnesses bundle.

The meta-guidance across all four: the harness choice matters less than the configuration discipline, and every one of them can run the level-three workflow. Pick by ecosystem and constraints, then invest in the setup, because an unconfigured frontier harness loses to a well-configured modest one on determinism every single time.

The Techniques Catalog: Making It Reproducible

Here is the toolbox, the specific practices that convert agent-assisted data work from plausible to reproducible, each stated with its mechanism.

Version control is the foundation, totalized. Every artifact the agent touches, SQL, pipeline code, dbt models, configuration, the instruction files that shape the agent itself, lives in git, and every agent session works on a branch. This is not ceremony: the branch-and-diff discipline is what makes the model's variance harmless, because variance that arrives as a reviewable diff is a proposal, and variance that arrives as an executed change is an incident.

Tests are the contract, and the agent runs them. Data tests, dbt tests, quality suites, schema assertions, are the objective function that replaces "looks right" with "is right": uniqueness, referential integrity, accepted ranges, row-count expectations, reconciliation against known totals. The workflow discipline is to make the agent write tests alongside every artifact and run them in its loop, via hooks or explicit instruction, so the agent iterates against truth rather than against its own confidence. A model's SQL is a hypothesis. A passing test suite is a fact.

Dry-run and staging gates keep hypotheses off production. Every serious data stack offers a rehearsal mode, compile-only runs, EXPLAIN plans, execution against staging schemas or table clones, write-audit-publish patterns on the lakehouse side, and the agent's instructions should mandate them: no artifact graduates without a clean dry run, no write path executes outside staging until review. The harnesses' sandboxes contain the compute side of this, and the data side, which schemas the credentials can even see, belongs to the governance section below.

Pin everything that can drift. Deterministic outputs require deterministic inputs: pinned dependency versions in pipeline environments, pinned model versions in the harness configuration where reproducing the authoring context matters, explicit schema contracts so upstream changes break loudly in CI rather than silently in production, and seeded sampling whenever the agent works against data subsets, so the exploration that justified a decision can be re-run and re-examined.

Externalize the numbers into a semantic layer. The single highest-impact determinism technique for analytics: metric definitions, what revenue means, how churn is calculated, live as governed, versioned definitions in a semantic layer, and agents are instructed to query the defined metrics rather than improvising aggregations against raw tables. This converts the worst nondeterminism in the field, three plausible revenue queries with three answers, into a lookup, and the published evidence matches field experience: grounding agents in governed semantics roughly doubles their accuracy on data questions. Declared bias and genuine conviction at once: platforms like Dremio's, with semantic layers served over MCP, exist precisely to be this layer, and whatever vendor provides yours, the architectural point stands.

And keep golden datasets for the agent itself. A small, versioned corpus of representative tasks, schemas, and known-correct outputs, against which you evaluate configuration changes, new skills, new hooks, new models, before rolling them to the team. The agent setup is itself an artifact estate, and it deserves the same regression discipline as the pipelines it helps build.

The Data Team's Instruction File: What Goes in AGENTS.md

Since every harness profiled above reads standing instruction files, and since that file is where a team's determinism discipline becomes enforced rather than remembered, it deserves a concrete treatment: here is what belongs in a data team's AGENTS.md or its equivalents, section by section, in prose you can adapt this afternoon.

Identity and boundaries first: what this repository is, which systems the agent may read, which it may never touch, and the standing rule stated bluntly, all writes land in staging schemas on branches, production is reached only by CI executing merged code. Agents follow explicit prohibitions far more reliably than implied ones, so write the prohibitions.

Conventions second, the tribal knowledge externalized: naming standards for models and columns, the project's layer structure, staging to intermediate to marts or its local equivalent, the incremental-model patterns the team blesses and the ones it has banned with scars to show, dialect specifics, and formatting rules, though the better home for formatting is a hook that enforces it mechanically, with the instruction file simply noting the hook exists.

Validation requirements third, the contract: every model ships with tests, and name the minimum, keys, accepted values, row-count expectations, every change runs the dry-run gate before proposing, every migration artifact pairs with its reconciliation check, and the definition of done is tests passing, not output looking plausible. Instruct the agent to run the validation loop itself and to report results in its summaries, which turns every session log into a small audit document.

Data semantics fourth, the drift killer: where the governed definitions live, the semantic layer, the metrics files, the catalog, and the instruction that analytical questions route through defined metrics rather than improvised aggregations, with new metric needs flagged for definition rather than silently invented. One paragraph here prevents the three-revenue-numbers incident more reliably than any review process.

And escalation last: the conditions under which the agent must stop and ask, schema changes beyond a threshold, anything touching the listed sensitive tables, reconciliation failures it cannot resolve, ambiguity about which definition applies. An agent with clear escalation rules interrupts you at exactly the right moments, and one without them interrupts you either constantly or, worse, never.

Keep the whole thing under a few hundred lines, version it, review changes to it like code, because it is code in the way that matters, and treat its growth as institutional learning: every incident retrospective that ends with a new line in the instruction file is an incident that will not repeat, on any harness, for any team member, human or otherwise.

The Workflow Patterns: Where the Hours Actually Go

Techniques compose into workflows, and five patterns cover most of a data team's agent-assisted week.

Pipeline development is the bread and butter: the agent scaffolds the ingestion or transformation, models, tests, documentation, and configuration together, iterates against the dry-run and test loop until green, and delivers a branch. The human reviews intent and design, the machinery has already reviewed correctness, and the merged result is indistinguishable, on purpose, from well-crafted handwritten work, except that it arrived in an afternoon with better test coverage than most humans write unprompted.

Migration and translation is where the harnesses look most like magic while being most deterministic: dialect-to-dialect SQL translation, warehouse-to-lakehouse moves, legacy pipeline modernization. The pattern that makes it safe is reconciliation-driven: the agent's first deliverable is the validation harness, row counts, aggregate checksums, sampled comparisons between old and new paths, and only then the translated artifacts, iterated until reconciliation passes. Long-context harnesses shine here, holding both estates in mind, and the deliverable is not "the agent says they match," it is a reconciliation report any auditor can re-run.

Quality investigation uses the improvising mode correctly: an anomaly appears, and the agent, on read-only credentials, does the reconnaissance, profiling distributions, checking recent loads, diffing schema versions, tracing lineage, that consumes human hours. Its findings are leads, and the pattern's discipline is that the fix it proposes lands as artifacts: the corrected transformation plus the new test that would have caught the issue, so every investigation permanently hardens the estate.

Documentation and lineage may be the highest-ratio pattern of all: agents generating and refreshing model documentation, column descriptions, lineage summaries, and runbooks from the code and schemas themselves, on a schedule, as pull requests. The chronically undone work of data teams, done continuously, reviewably, and without sighing.

And analytics with provenance closes the loop for the analyst side: questions answered through the semantic layer's governed definitions, exploratory findings promoted into saved, versioned queries and models rather than dying in a session, and every number that escapes to a stakeholder carrying its trace, which definition, which query, which snapshot. The agent accelerates the analysis. The architecture makes it citable.

The Analyst's Version: Taming Exploration Itself

One workflow deserves its own section because it is where analytics has always leaked determinism, agent or no agent: exploratory analysis, the notebook that found the insight and can never quite find it again.

Exploration is legitimately nondeterministic, that is what makes it exploration, and the discipline is not to constrain the wandering but to govern what escapes it. The pattern that works has three moves. First, reproducible wandering: even exploratory sessions run on read-only credentials against named snapshots or seeded samples, so that any promising path can be re-walked, and the harnesses make this nearly free, the session transcript is a record, and an instruction-file line requiring the agent to log every query it ran alongside its findings turns each exploration into a re-runnable script by accident.

Second, the promotion gate, the move that changes everything: an insight that will be shown to anyone gets promoted from exploration to artifact, the winding notebook distilled, by the agent, which is excellent at exactly this distillation, into a clean, parameterized, tested query or model, committed, reviewed, and thereafter the citable source of that number. The notebook was the search. The artifact is the answer, and the two have different jobs, different audiences, and different determinism requirements, which the promotion gate makes structural.

Third, definition capture: when exploration surfaces a metric the business will want again, churn by cohort, activation by channel, the finding routes into the semantic layer as a governed definition rather than living as a clever query in one analyst's branch, which is how exploration compounds into organizational vocabulary instead of organizational folklore. The agent drafts the definition, the humans who own the semantics review it, and the next question about that metric, from any person or any agent, resolves to the same answer.

Analysts sometimes hear this as bureaucracy arriving to ruin the fun, and the lived experience runs opposite: the wandering stays free, the harness absorbs the distillation drudgery that used to make rigor expensive, and the analyst's work stops evaporating, every promoted artifact a permanent brick where a screenshot of a notebook used to be. Determinism, at the analytics layer, is not a constraint on insight. It is what lets insight be believed twice.

Governance: The Line Between Fast and Reckless

All of the above assumes an answer to the question that should precede any harness's first database connection: what, exactly, can this agent touch?

The pattern that works is the same identity discipline applied to human engineers, tightened: agents authenticate as their own principals, never as a borrowed human account, with scopes matched to the ladder, read-only credentials for exploration and investigation, write access confined to staging and development schemas, production writes reserved for the CI system executing reviewed, merged artifacts, which is to say, never held by the interactive agent at all. Short-lived, vended credentials beat long-lived secrets in configuration files, catalog-level governance, the access controls and credential vending of the open catalog world, beats per-tool password sprawl, and every query the agent runs should land in the same audit trail as any user's, because "what did the agent touch" is a question incident review will eventually ask, and the harness session log is not the system of record, the platform's audit is.

MCP is where this becomes practical rather than aspirational: the harnesses reach data through MCP servers, and a well-built data-platform MCP server, my employer's among the entrants, is precisely a governance boundary, authenticating the agent as a principal, enforcing its scopes, exposing semantic definitions alongside raw access, and logging everything. Configure the connection once, correctly, and every workflow in this article inherits the discipline. Skip it, paste an admin connection string into a config file, and no amount of prompt engineering will save you, because governance was never the model's job. It is the door's.

A Worked Example: One Migration, End to End

Compress the whole method into one story, composited from real engagements.

A team must migrate a legacy warehouse's reporting layer, some two hundred SQL views in an aging dialect, onto their Iceberg lakehouse, historically a two-quarter slog. Week one, setup: the repository gets its instruction file, conventions, the staging-only rule, the reconciliation requirement, the harness gets hooks wiring the SQL linter and test runner, read-only credentials to the legacy system and staging credentials to the lakehouse arrive as vended, scoped principals through the MCP connection, and the golden tasks, five representative views with known outputs, validate the setup itself.

Weeks two through five, the loop: the agent proceeds view by view, and per the reconciliation-first pattern, each unit of work is a branch containing the translated model, its tests, and its reconciliation check against the legacy output, iterated headlessly against the dry-run and test gates until green, then queued for human review. The humans, freed from syntax, review design: this view should become two models, that one is dead and should be retired, this translation reveals a legacy bug worth preserving in a comment and fixing in the new path, judgment work, the kind that was always the actual job. A subagent maintains the running migration log and refreshes documentation as models land. Nightly, a scheduled headless run re-executes the full reconciliation suite across everything migrated so far, and its one mid-project catch, an upstream schema drift that broke eleven reconciliations, arrives as a morning report with a proposed fix branch, not as a surprise in month three.

Week six, the finish: two hundred views migrated, every one tested and reconciled, documentation current, audit trail complete, and the cutover is an anticlimax, which is the highest compliment a migration can receive. The team's retrospective line is the one I hear repeatedly and the reason this article exists: the agent did not replace the engineers, it replaced the two quarters, and every number still reproduces to the penny, because nothing nondeterministic ever entered the runtime path.

The Anti-Patterns: How This Goes Wrong

The failure catalog, brief and pointed, because every entry is a real pattern from the field.

The improvised dashboard: an agent in the runtime path, generating the query live on every refresh, numbers that drift between mornings, and no artifact to review when finance disputes Tuesday. The confused principal: the agent running on a human's credentials, its actions indistinguishable in the audit from its operator's, discovered during the incident that makes everyone memorize the word "principal." The untested translation: migration by vibes, "the agent converted it and it looks right," with reconciliation deferred until the legacy system is gone and the discrepancies are unfalsifiable. The context-free number: an agent answer pasted into a deck without its query, its definition, or its snapshot, unreproducible by construction. The unpinned everything: environments, schemas, and samples left floating, so that even the deterministic artifacts stop reproducing, and the agent gets blamed for what drift did. And the configuration-free adoption: a frontier harness deployed with no instruction file, no hooks, no scoped credentials, generating impressive demos and a slow accumulation of exactly the incidents above, until the tools are banned for what the setup never attempted to prevent.

Every one of these has the same cure, and it is the article's thesis read backwards: put the model's variance where variance is safe, put machinery everywhere else, and let the deterministic estate do what it has always done, which is be trusted.

Questions I Hear Most Often

Doesn't setting temperature to zero solve the determinism problem? No, and the question diagnoses the confusion this article exists to fix: sampling settings reduce token-level variance within one call, and they do not make an agent's multi-step behavior reproducible, nor should you want determinism at that layer. The determinism that matters is at the artifact and runtime layer, same pipeline, same input, same output, and that is achieved architecturally, by keeping the model out of the runtime path, not by tuning the dice.

Can I trust agent-written SQL for genuinely complex logic? Trust the process, not the SQL: complex logic is exactly where tests, reconciliation, and review earn their existence, and agent-written SQL that has passed a reconciliation suite against known outputs deserves precisely the same trust as human-written SQL that has passed it, which is the only kind of trust either ever deserved. The honest adjustment is in review attention: agents err confidently and syntactically beautifully, so review verifies semantics against intent, which the tests should be encoding anyway.

Which of the four harnesses should a data team standardize on? Standardize the discipline, not the harness: the instruction conventions, the hook-enforced validation, the credential scoping, and the MCP endpoints travel across all four, and the harness choice then follows ecosystem, model subscriptions you hold, cloud gravity, governance constraints on where data details may flow, with OpenCode's local-model path as the answer to the strictest version of that last constraint. Teams that standardize the portable layer switch harnesses in a week. Teams that standardize a harness rebuild their discipline every switch.

Is level four, scheduled autonomous agents, actually safe for data work? Yes, under one condition that is the whole point: the autonomous agent's write path is the pull request, never the production schema. A nightly agent that investigates, drafts, tests, and proposes is a tireless colleague, and the same agent with production credentials is an unattended nondeterministic process in your data path, which is the thing this entire article is designed to never build. The gate is not the agent's intelligence. It is the pipeline's.

How does this change what data engineers actually do? It concentrates the job into its judgment core: deciding what should exist, reviewing intent and design, encoding standards into the instruction files and tests that steer the machinery, and owning the governance boundaries, while the syntax, scaffolding, translation, and documentation hours compress dramatically. The engineers thriving in this workflow describe the same shift: less typing, more architecture, and a strange new artifact of seniority, the quality of your team's AGENTS.md file.

Where should a team start, concretely, this month? One workflow, full discipline: pick documentation generation or a contained migration, stand up the instruction file, the hooks, the scoped read-only credentials, and the branch-and-review flow, run it for four weeks, and measure artifacts merged, test coverage added, and incidents caused, which should be a positive number, a larger positive number, and zero. That experience, not this article, will convince your skeptics, and the setup it forces you to build is the platform every subsequent workflow inherits.

Closing Thoughts

The stochastic model and the deterministic pipeline are not enemies, and the discipline that reconciles them is neither novel nor mysterious: it is software engineering's oldest separation, development and runtime, applied at the moment it matters most. The agent harnesses give data teams a collaborator of extraordinary breadth at development time, and the estate they help build, versioned, tested, reconciled, governed, remains exactly as deterministic as the discipline enforcing it. That is the whole resolution: creativity where variance is cheap, machinery where trust is dear, and a hard line between them that your instruction files, hooks, credentials, and CI enforce so that no one has to remember it under deadline. The teams working this way are not choosing between AI speed and data trust. They are compounding both, and the gap between them and the teams still debating the contradiction widens every sprint.

If the way this article builds understanding works for you, that is what my books do at full depth. I co-authored Apache Iceberg: The Definitive Guide and Apache Polaris: The Definitive Guide for O'Reilly, with further titles on lakehouse architecture, data engineering, and agentic analytics.

Browse the full collection of my books on data and AI at books.alexmerced.com.

Designing Your Own AI Harness: A Deep Dive Into the Architecture of Agent Loops, Tools, Context, and Control

Alex Merced — Sat, 18 Jul 2026 16:13:14 +0000

The most underappreciated finding in applied AI this year fits in one statistic: a major framework team took the same model, changed nothing about it, rebuilt only the machinery around it, and watched their score on a leading agent benchmark jump from the low fifties to the mid sixties, vaulting from the middle of the pack into the top five. No new model. No fine-tuning. Just a better harness.

The harness, the loop, tools, context management, permissions, and persistence wrapped around a language model, is where agents are actually engineered, and while most people will rightly use the excellent commercial and open harnesses that now exist, a growing population needs to build their own: product teams embedding agents into applications, platform teams needing control the packaged tools will not cede, researchers who need to see every token, and engineers who simply refuse to operate machinery they do not understand, a camp I have deep sympathy for. For all of them, and for anyone who wants to understand what their off-the-shelf agent is actually doing, this article is the deep dive: the anatomy of a harness component by component, the architectural decisions at each layer with their honest trade-offs, the hard-won patterns, progressive context compaction, permission matrices, filesystem-first tools, budget enforcement, that separate production harnesses from weekend demos, and a staged path from a hundred-line loop to a system you can trust unattended. My biases declared: I work at Dremio, whose MCP server is one of the governed endpoints a well-built harness might call, and my daily drivers are commercial harnesses whose design choices I will reference as evidence throughout, because the best public teachers of harness design are the tools that won.

First Principles: What You Are Actually Building

Strip everything away and a harness is a while loop with judgment. Here is the irreducible core, in prose pseudocode, because every architecture decision in this article is an elaboration of one of its lines.

Assemble the initial context: system instructions, the task, whatever project knowledge applies. Then loop: send the context to the model along with schemas describing the available tools. The model responds with either an answer, in which case check whether the task is done, or with tool calls, requests to read a file, run a command, query an API. Validate each requested call against permissions, execute the allowed ones, and append the results to the context. Check the budgets, steps, time, tokens, money, and if any is exhausted, stop gracefully. Otherwise, loop again.

That is the whole animal, and a functioning version fits in a few hundred lines, which is itself an important design fact: one deliberately minimal open harness ships under a thousand tokens of scaffolding and performs respectably, proving how much of the magic is the loop plus a strong model. Everything else in this article, and everything in the feature lists of the major commercial harnesses, is an answer to one of five questions that the minimal loop leaves open. How does the model talk to the world, the tool layer. What does the model get to see, the context layer. What is the model allowed to do, the permission layer. What happens when things go long or wrong, the control layer. And what survives the session, the persistence layer. Architect those five deliberately and you have a harness. Let them happen by accident and you have a demo that will humiliate you in production.

One framing to carry throughout: the harness is a delivery mechanism for context engineering. Models are stateless and see only what you assemble for them each turn, their performance degrades measurably as context fills with noise, the phenomenon practitioners call context rot, and so nearly every sophisticated harness feature, compaction, subagents, tool output management, file-based memory, is ultimately about getting the right information in front of the model and keeping the wrong information away from it. Hold that lens and the whole design space organizes itself.

Watch the Loop Run: One Task, Traced

Before the components, one traced execution, because seeing the loop's lines fire in order makes every later section concrete. The task, given to a modest custom harness: "the CSV exports in the reports folder have inconsistent date formats, normalize them and tell me what you changed."

Turn one: the harness assembles context, its system instructions and tool schemas at the front for cache stability, the project's instruction file, the task, and sends it. The model returns a tool call: list the reports folder. The permission layer checks the matrix, read-only tier, allowed silently, executes, and appends the listing, forty files, as an observation.

Turns two through four: the model samples, read this file, read that one, and the tool layer's output management earns its first keep: each CSV is thousands of lines, so the read tool returns the first fifty rows plus a note of the full size, enough to diagnose formats without flooding the window. The model identifies three date conventions across the files and proposes its plan as a message. The harness's instruction file said plans touching more than ten files require approval, so the control layer surfaces the plan and pauses. The human approves.

Turns five through nine: the model writes a normalization script to a scratch directory and requests shell execution. The permission layer classifies it, mutating, sandboxed path, allowed with logging, and the sandbox confines the run. The script fails on two files with a malformed year. The failure returns as an honest observation, exit code, stderr, the offending rows, and the model adjusts the script to quarantine unparseable rows rather than guess, reruns, success. Note what the harness did here: nothing clever, it just delivered truthful feedback and let the loop do its work.

Turn ten: budgets are healthy, twelve steps of forty used, a fraction of the token ceiling, but the context is now heavy with observations, so the compaction layer runs its cheapest stage, dropping the superseded file previews while pinning the task, the plan, and the error history. The model writes its summary of changes to a notes file, the persistence habit the system prompt requires, and returns its answer: files normalized, two rows quarantined with reasons, script saved for reuse. The control layer sees the completion signal, runs the validation hook, a quick script confirming every date now parses, and only then marks the task done. Total: ten turns, one approval, one caught failure, an audit log that reconstructs all of it, and a reusable artifact.

Every subsystem this article is about to detail appeared in that half page: assembly, caching order, permission tiers, sandboxing, output truncation, honest errors, budgets, compaction, notes, hooks, validation gates, audit. A harness is not exotic. It is this, done deliberately, every turn.

Decision Zero: Build, Assemble, or Adopt

Before the components, the honest gate everyone should pass through, because building a harness is a commitment and the alternatives are strong.

Adopt means using the packaged harnesses, the commercial terminal agents and their peers, and it is the right answer for most individual productivity and most coding work: they embody years of hard lessons, they expose customization through instruction files, skills, hooks, and MCP, and their headless modes embed into pipelines without you owning a loop. Assemble means building on an agent framework, the graph-based runtimes that give you typed state, checkpointing, and resumability, with batteries-included agent layers on top providing planning, filesystem tools, subagents, and compression out of the box. Assembly is the right answer when you need a custom agent inside a product but your differentiation is the workflow, not the loop mechanics, and the checkpoint-and-resume story, pause any step, serialize state, resume on another machine days later, is genuinely hard to replicate alone. Build from scratch is the right answer in three cases: the loop itself is your product or research subject, your constraints, air-gapped environments, exotic latency budgets, deep protocol integration, defeat the frameworks, or the pedagogical case, which I refuse to dismiss, because a team that has built even a toy harness debugs its production agents, whoever made them, at a different level.

The rest of this article serves all three camps: builders get the blueprint, assemblers get the checklist of what their framework must provide, and adopters get X-ray vision into the tools they already run.

The Model Layer: Your Foundation's Foundation

The first component is the interface to the model itself, and its design goals are boring, which is the point: reliability, portability, and cost hygiene.

Portability first: wrap the provider behind your own interface from day one, because provider-neutrality is cheap at hour one and expensive at month six, and the year's market turbulence, repricings, deprecations, access changes, has made single-provider coupling a documented business risk. Your abstraction needs to cover the real surface: streaming token delivery, native tool-calling with structured schemas, and, increasingly, provider-side features like prompt caching, which brings the first non-obvious design rule: order your context for cache stability. Providers discount tokens that prefix-match previous requests, so stable content, system instructions, tool schemas, project context, belongs at the front, volatile content at the back, and a harness that interleaves them carelessly can double its bill without changing a word of behavior.

Reliability second: every model call can fail, time out, or return malformed tool arguments, so the model layer owns retries with backoff, timeout enforcement, and schema validation of what comes back, with malformed tool calls fed back to the model as errors to correct rather than crashing the loop. And accounting third: this layer is where every token and dollar is counted, per call, per session, per task, because the budget enforcement that the control layer needs and the cost-per-task metrics that the evaluation layer needs both depend on the model layer measuring honestly. Instrument it first, you will thank yourself weekly.

The Tool Layer: Where Design Taste Shows Most

Tools are how the agent acts, and tool design is where I have watched the most harnesses go wrong, usually in the same direction: too many tools, too narrow, too clever.

The mechanics are standard by now: each tool is a typed schema, name, description, parameters, that the model sees, plus an implementation the harness executes, with results returned as observations. The Model Context Protocol has become the ecosystem's answer for external tools, typed by default, discoverable, and reusable across harnesses, and your harness should be an MCP client early, because it converts the entire ecosystem of servers, databases, browsers, ticketing systems, governed data platforms including my employer's, into your agent's tool belt for free.

The design lessons are where the field's scar tissue lives, and the most important one comes from a production case study worth knowing: a major cloud team's incident-response agent began with over a hundred bespoke, specialized tools and a prescriptive prompt, and performed mediocrely on novel incidents. The rebuild threw most of it away: expose the world as a filesystem, source code, runbooks, schemas, past investigation notes as files, give the agent a handful of general tools, read, search, list, shell, and let it investigate the way an engineer would. Their task-success measure rose by thirty points. The lesson generalizes and matches what the leading coding harnesses converged on independently: a few powerful, composable, general tools beat a hundred narrow ones, because general tools let the model apply its reasoning, while narrow tools demand it guess your API's ontology.

Three more rules earn their place in any harness. Manage tool output aggressively: a two-thousand-line log dump is context poison, so truncate, summarize, or offload large outputs to files the agent can grep, returning a reference rather than the payload. Make tools honest about failure: an error message that says what went wrong and what a valid call looks like turns failure into a self-correcting step rather than a doom loop. And design for idempotency and reversibility wherever the domain allows: tools that can be safely retried and actions that land on branches or in staging areas make the whole system forgiving of the model's imperfection, which is the harness's actual job.

The Context Layer: The Heart of the Machine

If the harness is a context delivery mechanism, this layer is the product, and it decomposes into assembly, budget management, and compaction.

Assembly is the per-turn question: what goes in the window, in what order. The stable spine, per the caching rule, comes first: system instructions defining the agent's role and rules, tool schemas, and durable project context, which mature harnesses source from instruction files, the AGENTS.md convention and its relatives, so that humans can shape agent behavior in versioned, reviewable text rather than per-session prompting. Then the task, then the working history of the session. The discipline that separates good assembly from stuffing: everything competes for the model's attention, irrelevant material actively degrades reasoning, and the assembler's bias should be ruthless relevance, retrieve and include what the current step needs, reference the rest as files or summaries the agent can pull on demand.

Budget management is the running question: the window is finite, tool observations routinely consume the dominant share of it in long sessions, and the naive approach, let it fill until the API errors, is not an option. Track token pressure continuously against the window, using the provider's own reported counts as your calibration, and treat rising pressure as a signal to act early, not a cliff to fall off.

And compaction is the answer when pressure demands action, the most studied subsystem in modern harness engineering, and the state of the art is emphatically not a single emergency summarize-everything trigger, which activates late, destroys information, and compounds errors on repeat. The pattern that the leading harnesses and the research literature converged on is progressive, multi-stage compaction: begin with the cheap and lossless, trim redundant tool outputs, drop superseded file reads, collapse repeated observations, escalate to targeted summarization of older exchanges while pinning the task statement, key decisions, and recent turns verbatim, and reserve full-history summarization as the last resort, ideally paired with a scratchpad file where the agent has been journaling its findings all along, so that compaction compresses the conversation without erasing the knowledge. That last idea, the agent maintaining its own external notes as durable memory that survives any compaction, is one of the highest-value patterns in the field, and it costs almost nothing to implement: a file, a habit in the system prompt, and a read tool.

The Permission Layer: Autonomy as a Dial, Not a Switch

An agent that acts needs a theory of what it may do, and the difference between a toy and a production harness is that the theory is engineered rather than vibes.

Start with a risk taxonomy: classify every tool and, for powerful tools like the shell, every action pattern, into tiers, read-only, mutating-but-reversible, destructive, financial, exfiltrating, and encode a permission matrix mapping tiers to dispositions: allow silently, allow with logging, require human approval, deny always. The matrix, not the model, is the authority: requested calls are validated against it before execution, denials are returned to the model as observations it can route around, and the matrix itself is configuration, per project, per environment, per trust level, so the same harness runs locked-down in production and permissive in a sandbox.

Then contain the blast radius structurally, because permission checks are necessary and insufficient: run execution inside a sandbox, containers, virtual machines, or OS-level primitives that confine filesystem and network reach, keep secrets out of the agent's environment entirely, injected by the harness at execution time rather than visible in context, and treat everything the agent reads, files, web pages, tool outputs, as untrusted input, because prompt injection, malicious instructions smuggled into content, is the field's defining unsolved attack, and your defenses are layered skepticism: instruction hierarchies the model is trained to respect, detection heuristics, and above all a permission matrix that makes the worst case boring. Design for the audit from day one: every tool call, decision, and approval logged with enough fidelity that you can reconstruct any session, because the first serious incident review will happen, and the harness that can answer "what exactly did it do" survives it.

The Control Layer: Budgets, Stop Conditions, and Failure

The loop needs to know when to stop, and giving it that knowledge is a small subsystem with outsized returns.

Enforce budgets on four axes: steps, a maximum number of loop iterations, wall-clock time, tokens, and cost, checked every turn, with graceful degradation on exhaustion, the agent is told the budget state and asked to conclude, summarize progress, and hand off, rather than being killed mid-thought. Budgets convert the failure mode from "runaway agent burned two hundred dollars overnight" to "agent stopped at its limit and left a status note," which is the entire difference between a system you can schedule and one you must babysit.

Define stop conditions beyond budgets: explicit task-completion signals, validation gates, the task is done when the tests pass, not when the model says so, and escalation paths, conditions under which the agent must stop and ask a human, encoded as rules rather than hoped for as judgment. Handle the long-horizon cases deliberately: checkpointing, serializing loop state so sessions survive crashes and resume across machines, is the feature that graph-based runtimes give you and hand-rolled loops usually lack until the first painful loss, and for continuous work, the pattern of re-injecting the standing objective into fresh context windows keeps a persistent agent on-mission across context resets. And treat repeated failure as a first-class signal: the same tool failing three times, the same file edited in circles, are loop pathologies your control layer should detect and break, with the state handed to a human, because the model will not always notice it is stuck, and the harness must.

The Persistence Layer: What Survives the Session

Sessions end, and value should not. The persistence layer decides what carries forward, and the field's answer has converged on something refreshingly low-tech: files first.

A store of plain files, markdown notes, project instructions, accumulated conventions, investigation journals, organized simply and read through the same tools the agent already has, outperforms elaborate memory architectures for most harness purposes, and it comes with the property that matters most: humans can read, edit, and version everything the agent knows. Vector stores earn their place when semantic retrieval over large corpora is genuinely needed, as a complement rather than a replacement. The design decision that matters more than the storage technology is write governance: an agent that writes its own memory can poison its own future, so define write rules, what kinds of conclusions may be persisted, where, with what review, and keep the durable store append-mostly and auditable. Session-level persistence, transcripts, checkpoints, artifacts, rounds out the layer, and the test of the whole design is simple: kill the process mid-task, restart, and see what the agent still knows. Production harnesses pass that test on purpose.

The Orchestration Layer: Subagents and Events

Two advanced structures appear in every leading harness, and both are context engineering by other means.

Subagents, a lead agent delegating scoped work to child agents, earn their complexity in exactly two situations: parallelism, several independent investigations at once, and context isolation, a child burns its own window exploring a rabbit hole and returns only conclusions, keeping the parent's context clean. The engineering that makes them safe is the part naive implementations skip: each child gets a rebuilt context and its own permission scope, not an inherited copy of the parent's, and results return through structured summaries rather than transcript dumps. Resist the swarm temptation: most tasks are better served by one agent with clean context than five with chaos, and the leading harnesses use delegation surgically.

An event and hook system, the harness emitting events at every lifecycle point, session start, before and after each tool call, on file edits, on completion, with user-defined handlers attached, is the extension mechanism that lets policy live outside the model: format the code after every edit, run the tests after every change, block any command matching a pattern, notify a channel on completion. The most mature commercial harness exposes dozens of event types, and the design lesson for builders is to emit events from day one even before you need them, because every future integration, observability, enforcement, automation, attaches there.

The Evaluation Layer: The Sibling System You Cannot Skip

The last component is the one that makes all the others improvable: the eval harness beside the agent harness.

Build a gold set, a labeled collection of real tasks from your domain with known-good outcomes, and run it on every meaningful change, model upgrades, prompt edits, tool redesigns, compaction tuning, because agent behavior is emergent and regressions arrive from directions intuition never watches. Instrument traces end to end, every session reconstructable as a sequence of contexts, calls, and observations, because debugging an agent without traces is debugging a distributed system with print statements. Track the operational metrics that actually govern viability, task success rate, cost per completed task, tokens per task, human interventions per task, time to completion, and let them, not vibes, drive the tuning. And include security evals, injection attempts, permission probes, budget abuse cases, in the gold set, because the permission layer is code, and code that is never tested is code that does not work. Teams that stand up evaluation early report the same experience: the eval harness pays for itself the first time a "small prompt improvement" quietly halves the success rate, which it will.

The Surface Layer: How Humans and Systems Drive It

A harness needs at least one face, and the mature pattern is three faces over one engine, built in this order.

The headless surface comes first, and if you build only one, build this: a single-shot invocation, task in, work done, structured result out, with flags for budgets, permission profile, and output format. Headless is what makes the harness composable, schedulable in cron and CI, pipeable into scripts, callable from other programs, and designing it first enforces the discipline that saves you later: all state in the engine, none in the interface.

The interactive surface, terminal UI or minimal web panel, earns its keep for development and supervised work: streaming output so humans see the agent think, inline rendering of diffs and plans, approval prompts from the permission layer surfaced as real interactions rather than log lines, and session controls, pause, redirect, abort, wired to the control layer's checkpoints. Resist building this into a monument: the commercial harnesses set a high bar for interactive polish, and a custom harness's interactive face needs to be honest and responsive, not beautiful.

And the server surface is the strategic one: expose the harness through an API, and specifically through MCP's server side, so that editors, orchestrators, schedulers, and other agents can drive your agent as a typed tool. This is the move that turns a harness from a tool into infrastructure, agents composing agents, your specialized loop callable inside larger systems, and it costs little once the headless surface exists, because a server is a headless invocation with a listener in front. One engine, three faces, zero logic in any face: that separation is the whole architecture of the layer.

The Anti-Patterns: Six Ways Custom Harnesses Die

The failure modes repeat with such regularity that naming them is a public service, and I have committed at least three of these personally.

The tool zoo. Forty narrow tools, each wrapping one API endpoint, each with its own parameter ontology the model must guess. The agent spends its reasoning on tool selection instead of the task, and every new capability means another tool. The cure is the filesystem lesson: few general tools, world exposed as readable structure.

The infinite context buffet. Whole files, full logs, complete histories, appended forever, compaction added "later." Performance decays across the session, costs balloon, and the team concludes agents are overhyped when the actual diagnosis is context rot, self-inflicted. The cure is the context layer, applied from stage one, not stage three.

The trust-the-model permission model. No matrix, no tiers, no sandbox, just a system prompt asking the model to be careful. It works in every demo and fails on the first injection or hallucinated path, and the incident review finds no audit log because logging was also "later." The cure is structural: matrix, sandbox, logs, before the first real credential goes anywhere near the loop.

The unkillable session. No budgets, no stuck-loop detection, no checkpoints: the agent that ran all weekend, the crash that lost six hours of state, the retry storm that made a vendor's rate-limit team aware of your company. The cure is the control layer, which is an afternoon of work that prevents each of these exactly once, permanently.

The eval-free tune. Prompt tweaks and model swaps shipped on vibes, each one improving the demo task and silently breaking two others, discovered by users. The cure is the gold set, started at ten tasks, run on every change, boring and decisive.

And the monolith face. Business logic braided into the TUI, so the harness cannot run headless, cannot be scheduled, cannot be served, and the first automation request triggers a rewrite. The cure is the surface layer's separation, engine first, faces after, enforced from the first commit.

Every one of these is survivable, and every one is cheaper to prevent than to fix, which is what the staged build below is actually for: it sequences the prevention.

A Staged Build: From Weekend to Production

Assemble the components into the path I recommend walking, because sequencing is half the wisdom.

Stage one, the honest weekend: the minimal loop, one model provider behind your abstraction, four general tools, read, list, search, shell-in-a-container, budgets on all four axes, and a permission matrix with two tiers. This system already does real work, and building it teaches more about agent behavior than a month of reading.

Stage two, the useful month: instruction-file loading, MCP client support to inherit the tool ecosystem, tool output truncation and offloading, the scratchpad-notes pattern, session transcripts, and the first eval set of ten real tasks. This is the stage where the harness starts winning against your expectations, and where cache-aware context ordering pays its first visible bills.

Stage three, the trustworthy quarter: progressive compaction with pinned task context, the event and hook system, approval-gated permission tiers with full audit logging, checkpoint and resume, failure-pattern detection in the control layer, and the eval set grown to fifty tasks with cost and success tracked per change. This is the production line: at this stage you can schedule the agent, hand it to teammates, and answer the incident-review question.

Stage four, chosen deliberately or skipped forever: subagents with isolated contexts and scopes, durable file-based memory with write governance, a server mode so other systems, including other agents, can drive your harness, and domain specialization, which is where your harness stops being a general clone and becomes the thing only your team could have built, the reason to have walked the path at all.

Questions I Hear Most Often

Isn't this all wasted effort when the commercial harnesses are so good? For general coding productivity, mostly yes, adopt and customize. The build case is specificity and control: agents embedded in products, domains with constraints the packaged tools cannot honor, air-gapped and regulated environments, and platform teams for whom the loop is infrastructure they must own. And the learning case stands on its own: every hour building a harness compounds into sharper operation of every agent you ever run.

Which framework should I assemble on, if assembling? Choose on the boring criteria: typed, checkpointed state you can pause and resume, first-class observability, clean escape hatches to raw model calls, and an active community, rather than on demo elegance. The graph-based runtimes with agent layers on top currently define the mature end of that spectrum, and the honest alternative for simple needs remains a few hundred lines of your own loop, which at least you will fully understand.

How much does the model choice matter versus the harness? Both matter, and the harness is the half you control: the benchmark jump this article opened with came from harness changes alone, and equally, no harness rescues a model below the task's reasoning floor. The practical posture: build provider-neutral, benchmark model-harness pairs on your own gold set, and expect the answer to change several times a year.

What is the single most common design mistake? Context negligence: dumping whole files, raw logs, and full histories into the window and wondering why the agent got dumber as the session got longer. The fix is this article's spine, ruthless assembly, output management, early progressive compaction, external notes, and it routinely improves task success more than any model upgrade.

How do I make my harness safe enough to run unattended? Layered, in this order: sandbox the execution, scope the permissions with approval gates on the destructive tiers, enforce budgets on every axis, log everything, detect stuck loops, and only then schedule it, starting with read-only and reversible workloads and expanding trust with evidence. Unattended is earned, and the harness features are how it is earned.

Where does MCP fit in a custom harness? Two places: as a client, adopt it early to inherit the ecosystem's tools, including governed data endpoints, instead of hand-building integrations, and as a server, expose your finished harness through it so editors, other agents, and pipelines can drive your agent as a tool. The protocol layer is the part of this field that has genuinely standardized, and a custom harness that ignores it is custom in the expensive direction.

Closing Thoughts

The harness is where the abstract power of language models becomes accountable work, and its design space, five layers, a dozen decisive patterns, is now well enough mapped that building one is engineering rather than alchemy. The deepest lesson the field's first years produced is the one every layer of this article repeated in its own vocabulary: the model supplies the reasoning, and everything that makes the reasoning safe, cheap, durable, and true, the context discipline, the permission matrix, the budgets, the evals, the audit trail, is machinery, and machinery is yours to design. Build it deliberately, or at minimum, understand it deeply in the tools you adopt, because the difference between teams that get compounding value from agents and teams that get demos is not the model they rent. It is the harness they run.

Browse the full collection of my books on data and AI at books.alexmerced.com.

File Encryption for the Lakehouse: The Terminology, the Machinery, and the Hard Problem of Interoperable Encrypted Tables

Alex Merced — Tue, 14 Jul 2026 00:02:09 +0000

For years, the open lakehouse had an honest gap that practitioners whispered about and slide decks skipped: encryption. Not the checkbox kind, every cloud bucket has offered that for a decade, but the real kind, where the data itself is cryptographically protected in a way that survives a compromised bucket, satisfies a regulator, and still works when five different query engines from five different vendors need to read the same table. That last clause is the hard part, and it is why encryption arrived at the lakehouse years after transactions, evolution, and time travel.

The gap is now closing, and 2026 is the year it became real. Apache Parquet's modular encryption matured from specification into broadly implemented capability, and Apache Iceberg 1.11, released this May, shipped table-level encryption as a headline feature: a full envelope-encryption design with a three-tier key hierarchy, encrypted metadata, and the catalog as the key broker. The pieces of an interoperable encrypted lakehouse finally exist. What does not yet exist is widespread understanding of how they fit, and encryption is a domain where partial understanding is worse than none, because a misconfigured cryptosystem produces perfect confidence and no protection.

So this article is the full treatment: the terminology bootcamp, every term you will meet, defined properly, the layers at which data can be encrypted and what each layer actually protects, the deep mechanics of Parquet Modular Encryption and Iceberg's new table encryption, and then the heart of the piece, the interoperability challenge set: why encrypted files that every engine can read is a genuinely hard distributed-systems problem, where the seams are, and the patterns emerging to manage them. Plus the operational realities, rotation, crypto-shredding, disaster recovery, that determine whether an encryption deployment is an asset or a time bomb. As always: plain language, honest trade-offs, and the goal that the logic clicks.

Why Bucket Encryption Was Never Enough

Start with the question that stalls half the encryption conversations I have: our object storage is already encrypted, so what problem remains?

Server-side encryption, the SSE in your S3 configuration, means the storage service encrypts bytes before writing them to its disks and decrypts them on every authorized read. It is genuinely valuable and genuinely narrow: it protects against threats to the physical storage layer, stolen drives, decommissioned hardware, a breach beneath the service's API. Against everything above that line it does nothing, because the service transparently decrypts for any caller with bucket permissions. A leaked credential, an over-broad IAM role, a compromised service, a malicious insider with storage access: every one of them reads plaintext, because to the storage API, they are authorized.

Threat modeling makes the gap precise. Server-side encryption answers "what if someone steals the disks." It does not answer "what if someone gets into the bucket," which is the overwhelmingly more common incident, nor "what if the storage provider itself must be outside the trust boundary," which is the sovereignty and regulated-industry requirement, nor "how do I prove to an auditor that a specific column of personal data was unreadable to everyone without a specific key." Those questions require the data to be encrypted before it reaches storage, under keys the storage service never holds, decryptable only by clients you control. That is client-side encryption, and in the lakehouse, where the clients are a fleet of heterogeneous query engines sharing files, client-side encryption is exactly the interoperability puzzle this article exists to work through.

The honest framing, which I will repeat at the end: bucket-level encryption plus access control is a legitimate, sufficient posture for plenty of estates. The machinery below is for the estates where it is not, regulated data, multi-tenant platforms, sovereignty constraints, defense in depth mandates, and the population of those estates grows every year the AI era pushes more sensitive data into analytical reach.

The Terminology Bootcamp

Encryption conversations run on a vocabulary that gates comprehension, so here is the working glossary, built in dependency order, each term earning the next.

Plaintext and ciphertext are the before and after: readable data, and the output of encryption, which should be computationally indistinguishable from random bytes. That indistinguishability, incidentally, is why my compression article insists on compressing before encrypting: ciphertext has no redundancy left to compress.

Symmetric encryption uses one key for both directions, and it is what bulk data encryption always uses, because it is fast. The universal standard is AES, the Advanced Encryption Standard, hardware-accelerated on essentially every modern CPU through dedicated instructions, which is why encrypting terabytes is computationally cheap in 2026. Asymmetric encryption, the public-and-private key kind, is too slow for bulk data and appears in this story only around the edges, wrapping keys and authenticating parties.

Modes determine how AES, which natively scrambles single 16-byte blocks, extends to real data, and one distinction here does enormous work. AES-CTR, counter mode, encrypts efficiently and provides confidentiality only: an attacker cannot read the data but can flip bits and splice sections without detection. AES-GCM, Galois/Counter Mode, provides authenticated encryption: confidentiality plus an integrity tag, so any tampering, truncation, or splicing is detected at decryption. Modern designs default to GCM, and when you see a format offer a CTR variant, it is a deliberate performance-versus-integrity trade for specific situations. Authenticated modes require a nonce or initialization vector, a never-reused-per-key value, whose correct handling is one of those details specifications exist to get right so you cannot get it wrong.

AAD, additional authenticated data, extends GCM's integrity beyond the ciphertext: extra context, a filename, a module identifier, that is not encrypted but is bound into the integrity tag, so ciphertext moved to a different context fails to decrypt. Hold this one, it is the elegant trick that stops an attacker from swapping encrypted pieces between files.

Envelope encryption is the architecture everything at scale uses. Encrypting a petabyte directly with one master key is operationally insane, so instead: each file gets its own DEK, data encryption key, generated randomly at write time. The DEK encrypts the data, and then the DEK itself is encrypted, wrapped, by a higher key and stored alongside the data it protects. The wrapping key may itself be wrapped by another, yielding hierarchies: DEKs wrapped by KEKs, key encryption keys, wrapped by a master key. The master key lives in a KMS, key management service, a hardened system, often backed by an HSM, a tamper-resistant hardware security module, that never releases the master key at all: clients send wrapped keys to the KMS and receive unwrapped ones, every operation authenticated, authorized, and audited. The beauty of the envelope: bulk data never moves for key operations, rotating or revoking a master key means re-wrapping small keys, not re-encrypting petabytes, and the KMS audit log becomes the ledger of who could read what, when.

Key rotation is the practice of retiring keys on schedule, limiting how much any single compromised key exposes. Crypto-shredding is rotation's dramatic cousin: destroy a key, and everything encrypted under it becomes permanently unreadable, which converts data deletion, nearly impossible to prove across replicated immutable storage, into key deletion, which is instant and provable, a property privacy regulation made valuable beyond measure.

Finally, the client-side versus server-side axis from the previous section, and the cloud's menu along it: SSE with provider-managed keys, SSE with your KMS keys, which adds your audit and revocation but still decrypts for any bucket-authorized caller, SSE with customer-provided keys, and full client-side encryption, where the storage never sees plaintext. The lakehouse machinery below is the client-side end of that menu, made multi-engine.

The Layers: Where You Can Encrypt, and What Each Buys

With the vocabulary loaded, the design space becomes a clean question of layers, each protecting against more and costing more.

Layer one, transport: TLS on every connection. Table stakes, universally deployed, protects data in motion, and says nothing about data at rest.

Layer two, storage-service encryption: the SSE family. Protects the physical layer, satisfies the baseline checkbox, transparent to everything above, and, per the threat model above, powerless against credentialed access.

Layer three, whole-file client-side encryption: encrypt each object before upload, as an opaque blob. Maximum confidentiality and the death of analytics: an encrypted blob has no readable footer, no statistics, no ranged reads, so every query downloads and decrypts entire files. This layer is for archives and backups, not tables, and its failure at analytics is precisely what motivated the next layer.

Layer four, format-aware encryption: encryption designed into the file format itself, so that the columnar machinery, footers, statistics, selective column reads, pruning, survives. This is Parquet Modular Encryption's layer, and it is where the lakehouse story lives, because it is the only layer that delivers client-side protection and analytical performance simultaneously.

Layer five, field-level and application encryption: individual values encrypted before they ever enter the data platform, by the producing application. Strongest isolation, and the values become opaque to the platform, no filtering, no aggregation, no statistics on those fields, so it suits the narrow tier of ultra-sensitive identifiers, often paired with tokenization, rather than general columns.

The pattern to internalize: each layer up the stack shrinks the set of parties who can see plaintext, and shrinks what the platform can do with the data, and format-aware encryption exists because it bends that trade better than any other point, keeping plaintext away from storage and network while preserving nearly everything analytics needs. Defense in depth means running several layers at once, TLS plus SSE plus format-level for the sensitive tables, and the design work is choosing where each table's requirements land.

Parquet Modular Encryption: The Format-Aware Foundation

Parquet Modular Encryption, developed in the Parquet community with Gidon Gershinsky as its long-time driving force, is the piece that made layer four real, and its design rewards a close look because every property was chosen to preserve exactly what makes Parquet valuable.

The core move: encrypt Parquet's modules, not its file. Each unit of the format, data pages, dictionary pages, footer, indexes, is encrypted independently with AES, GCM by default, after encoding and compression have done their work, order matters, per the compression article. Because modules encrypt independently, the read path survives intact: a reader fetches and decrypts the footer, plans as always, and then fetches and decrypts only the pages the query touches. Selective column reads, predicate pushdown, ranged GETs, the whole economic model of my storage deep dive, all preserved under encryption. That single property is the difference between encryption you can afford on analytical tables and encryption you cannot.

The key model is columnar, and this is where governance enters the format: different columns can be encrypted under different keys. The salary column under one key, the email column under another, the non-sensitive columns under a footer key or left plaintext. A reader possessing only some keys can read exactly those columns, and fine-grained access control acquires a cryptographic enforcement layer beneath the policy layer: even a reader who bypasses every engine and opens the raw file gets only the columns whose keys it holds.

The footer gets special treatment because it is special: it holds the schema, the offsets, and the statistics, and statistics leak, min and max values of an encrypted column are data. Encrypted-footer mode, marked by the PARE magic bytes replacing Parquet's usual signature, encrypts the whole footer under its own key, hiding schema and statistics from keyless readers, while a plaintext-footer variant keeps legacy readers able to see the file's structure and the unencrypted columns, trading some leakage for compatibility. And integrity runs through everything via AAD: each module's encryption binds identifiers of its position, which file, which column, which page, into its authentication tag, so an attacker with storage access cannot splice pages between files, swap one file's column chunk into another, or roll a column back to an older version without decryption failing loudly. Tamper-proofing, not just secrecy.

What the format deliberately does not define is where keys come from: it specifies key metadata fields and leaves key management to the layer above, a modularity that seemed like a gap and turned out to be foresight, because the layer above now exists.

Iceberg Table Encryption: The Envelope Around Everything

Parquet encrypts files. Tables are more than files: they are metadata trees, manifests full of statistics, paths, and structure, and an encrypted table whose metadata is plaintext leaks its shape, its stats, and its history. Iceberg 1.11, released May 19, 2026, closed that gap with table-level encryption, the feature I flagged as a design discussion in my Polaris coverage, now shipped, and its architecture is the envelope pattern executed across a whole table format.

The key hierarchy has three tiers, each earning its place. At the top, a table master key, living in your KMS, referenced by the table property that names it, and never stored in Iceberg at all. In the middle, key encryption keys, KEKs, generated by Iceberg, wrapped by the master key via KMS calls, and stored wrapped inside the table metadata. At the bottom, per-file data encryption keys: every data file, delete file, and manifest gets its own DEK, generated with a secure random source on the workers, used once, and stored wrapped by a KEK in the metadata's key_metadata fields. The division of labor is the envelope pattern's textbook payoff: the KMS is consulted rarely, to unwrap KEKs, not per file, keeping it off the query hot path, rotation of the master key re-wraps KEKs without touching data, and every file's compromise surface is one unique key.

The mechanics then split by artifact. Parquet data files encrypt through native Parquet Modular Encryption, with Iceberg supplying each file's DEK and a unique AAD prefix, so the format-level protections above apply intact. Avro artifacts and the metadata tree, manifests and manifest lists, encrypt through an AES GCM streaming construction, marked by its own AGS1 magic bytes, so the table's structure, statistics, and file inventory are themselves ciphertext at rest. The read path stitches it together: the engine fetches table metadata through the catalog, which returns the metadata location along with the key material the caller is authorized to unwrap, decrypts the manifest list in memory, never on disk in plaintext, plans against the decrypted statistics, and proceeds down to data files with their individual DEKs. Even an attacker holding full bucket access sees only encrypted bytes at every level of the tree.

Note the load-bearing phrase in that read path: through the catalog. Iceberg's encryption is configured via catalog and table properties, currently supported through the REST and Hive catalog paths with Parquet and Avro data formats, and the catalog's role as the broker of key material is not incidental, it is the design's answer to the interoperability problem, which brings us to the heart of the article.

The Interoperability Challenge Set

Here is why encrypted lakehouse tables took years longer than encrypted databases: a database is one codebase holding its own keys, and a lakehouse table is a contract among many engines, from many vendors, in many languages, all of which must now agree not just on bytes but on cryptography, key acquisition, and trust. Walk the challenges one by one, because each shapes the emerging architecture.

Challenge one: every engine must implement everything. An encrypted table is only interoperable if every reader and writer in the estate implements the same encryption spec, the same modes, the same AAD construction, the same key-metadata interpretation, and implements them correctly, because cryptographic near-misses fail closed at best and fail silent at worst. The specs, Parquet Modular Encryption and now Iceberg's table encryption, exist precisely to make this possible, and implementation coverage still rolls out engine by engine: the Java lineage, Spark and Flink, matured first, the C++ and Python paths through Arrow followed, work across Trino and the broader community continues, and any given estate must audit its actual engines and versions against its actual requirements before turning the keys. An encrypted table that one critical engine cannot read is an outage with a compliance certificate.

Challenge two: the N-by-M key management problem. Beyond the crypto, every engine needs to reach your KMS: authentication, authorization, client libraries, per cloud and per vendor. N engines times M key services is the same quadratic monster this series has met at every layer, and the same class of answer is emerging: standardize the interface. Iceberg ships pluggable KMS clients with pre-defined types for the major clouds and a custom client path, and, more strategically, the catalog is stepping into the broker role, engines authenticate once to the catalog, the catalog talks to the KMS, and key material flows through the same governed channel as everything else. Readers of my Polaris article will recognize this as credential vending's sibling: the catalog already brokers short-lived storage credentials per principal per operation, and brokering wrapped table keys through the same authenticated, audited surface is the natural extension, one the community's catalog-side encryption discussions are actively shaping. The endgame worth rooting for: an engine that speaks the REST catalog protocol gets governed access to encrypted tables without ever learning what KMS sits behind them.

Challenge three: maintenance needs keys too. Compaction reads old files and writes new ones, snapshot expiration deletes, manifests rewrite, and every one of those background jobs must decrypt and re-encrypt, meaning the maintenance identity needs key access, wide key access, since it touches everything. This concentrates risk exactly where nobody is watching, and the design response is discipline: maintenance runs as its own principal with its own audited grants, DEKs are regenerated fresh on every rewrite, never reused, and the compaction fleet becomes part of the trust boundary you actively manage rather than an afterthought.

Challenge four: time travel meets rotation. Iceberg's snapshots are immutable and long-lived, and each snapshot's files carry the DEKs of their era, wrapped by the KEKs of their era. Rotate the master key and the envelope saves you, re-wrap the KEKs and history remains readable. Crypto-shred a key, and you have deliberately amputated every snapshot that depended on it, which is sometimes exactly the point, the GDPR erasure made provable, and sometimes a catastrophic surprise, the backup that can never be restored. Encrypted tables demand that key lifecycle policy and snapshot retention policy be designed as one policy, with the unglamorous corollary that your disaster recovery plan now has a second single point of failure: lose the KMS, or lose access to it in the recovery region, and the lake full of perfectly durable ciphertext is a lake full of noise. Key material replication and recovery drills join the runbook, permanently.

Challenge five: what encryption does to the surrounding features. Statistics under encrypted footers are invisible to keyless planners, which is the point, and which means shared services that relied on peeking at files, discovery crawlers, third-party optimizers, cost estimators, must now come through the governed path or go blind. Column-level keys interact with schema evolution, renames and re-additions must not confuse key assignments, the kind of edge the specs and implementations have spent their maturation grinding through. And the boundary with the policy layer needs stating plainly: encryption is not a substitute for RBAC, masking, and row filters, it is the enforcement backstop beneath them, the layer that holds even when the perimeter fails, and mature designs run both, policy for flexibility, cryptography for finality.

Challenge six: sharing across trust boundaries. The lakehouse's proudest trick, one copy of data served to many parties, meets its hardest test when the parties span organizations. Encrypted sharing means key sharing, which means the KMS grant becomes the actual instrument of data sharing, with all the revocation power and audit visibility that implies, per-partner KEKs so that revoking one consumer never touches another, and catalog federation carrying the governed key flow across boundaries. It is early days for this pattern at scale, and it is also the most exciting one on the board, because cryptographic sharing is what finally makes "share the data without trusting the perimeter" a literal statement.

The Design Patterns That Are Emerging

Out of the challenge set, a recognizable set of deployment patterns has formed, and matching your requirements to a pattern beats inventing one.

Uniform table encryption is the baseline pattern and the 1.11 default shape: every file and manifest of a sensitive table encrypted under the table's hierarchy, one master key per table or per domain, catalog-brokered keys, engines none the wiser beyond configuration. It answers the bucket-compromise and sovereignty threat models cleanly and adds the least design complexity, which makes it the right first deployment for most estates.

Column-tiered keys layer Parquet's per-column model on top for the tables where sensitivity is uneven: PII columns under restricted keys, the rest under the table baseline, so that cryptographic access mirrors the classification policy and a data scientist's engine literally cannot decrypt the columns their role excludes. The cost is key sprawl and evolution care, spend it only where classification genuinely demands it.

Key-per-tenant is the multi-tenant platform's pattern: each tenant's slices encrypted under tenant-dedicated keys, making isolation cryptographic rather than merely logical, offboarding a matter of key revocation, and the deletion clauses of contracts satisfiable by crypto-shredding with a KMS audit log as the receipt.

And defense in depth is the meta-pattern wrapping all of them: TLS everywhere, SSE on the buckets because it is free, format and table encryption on the estates that need it, RBAC and masking above, credential vending for storage, key brokering through the catalog, and every layer's audit flowing to the same place. No single layer is the security story. The stack is.

A Worked Example: One Healthcare Table, End to End

Assemble everything with a single concrete deployment: a health-tech company's patient_events table, clinical event records with identifiers, diagnoses, and timestamps, queried by Spark pipelines, a Dremio-served BI tier, and a data science team in Python, under a regulator who will eventually ask for proofs.

Design first, machinery second. The threat model: bucket compromise must expose nothing, the analytics vendor's support staff must be outside the trust boundary for identifiers, and patient erasure requests must be provable. The classification: two identifier columns are the crown jewels, the clinical columns are sensitive, the operational columns are ordinary. That maps to column-tiered keys on top of uniform table encryption.

The key architecture follows the envelope. A table master key is created in the company's KMS with its own IAM policy and audit stream, and its ARN lands in the table's encryption property. Iceberg generates KEKs, wraps them via the KMS, and stores them wrapped in table metadata. Every data file, delete file, and manifest gets its own DEK at write time, wrapped and recorded in key_metadata. The identifier columns additionally encrypt under a restricted column key whose KMS grant lists exactly three principals: the ingestion service, the compliance analytics role, and the maintenance identity. Footer mode is encrypted, PARE magic and all, so even schema and statistics are ciphertext to a keyless reader.

Now run the actors through it. The Spark ingestion job authenticates to the Polaris-based catalog as its principal, receives the table metadata plus the key material its grants allow, and writes: pages encoded, compressed, then encrypted, each file under a fresh DEK, AAD binding every module to its position. The BI tier's engine plans through the catalog the same way, decrypts manifests in memory, prunes on the decrypted statistics, and serves dashboards from the clinical and operational columns, its role holds those keys and not the identifier key, so a support engineer inspecting that engine's environment could never surface a patient identifier, not by policy but by mathematics. The data science notebook, holding only the baseline keys, queries the same table and receives the identifier columns as unreadable, exactly mirroring the masking policy above, now enforced beneath it. The nightly maintenance principal compacts small files, decrypting with old DEKs and re-encrypting outputs under fresh ones, its broad key access logged operation by operation in the KMS trail.

Then the hard days, which is what the design was for. The bucket credential leaks in month seven: the incident review confirms the attacker's haul was ciphertext at every level, data, manifests, statistics, and the disclosure obligations shrink accordingly. The annual rotation lands: the master key rotates in the KMS, the KEKs re-wrap in a metadata-only operation, and not one data file is touched. A patient exercises erasure: their records, isolated by design under a patient-scoped key strategy in the identifier tier, become permanently unreadable when that key is destroyed, and the KMS log of the destruction is the proof the regulator receives, months faster than any storage-level deletion audit could have delivered. And the disaster recovery drill, run because the runbook now demands it, verifies that key material replicates to the recovery region alongside the data, closing the one failure mode that would have turned eleven nines of durability into a perfectly preserved pile of noise.

Nothing in the story required exotic engineering. Every piece was a shipped capability, Parquet Modular Encryption, Iceberg 1.11 table encryption, catalog-brokered access, KMS discipline, composed in the order the threat model dictated. That composition is the whole craft.

A Decision Framework: How Much of This Do You Need?

Compress the article into the triage I walk teams through.

Start from the threat model, stated as sentences with names in them, not from the feature list. "A leaked bucket credential must not expose data" points at format or table encryption. "The platform vendor must not be able to read identifiers" points at column-tiered keys held outside the vendor's reach. "We must prove erasure" points at crypto-shredding and therefore at key granularity aligned to the erasure unit, per patient, per tenant, per contract. "Regulated categories require encryption at rest with customer-managed keys" is often satisfiable at SSE-KMS, read the actual requirement before building past it.

Then size the machinery to the sentences. No sentence beyond perimeter protection: TLS, SSE-KMS, catalog RBAC, credential vending, done, and spend the saved complexity on governance quality. Sentences about storage compromise or sovereignty: uniform table encryption on the sensitive domains, catalog-brokered, with rotation and DR added to the runbook. Sentences about intra-platform trust tiers or provable erasure: add column keys and granular key scoping where the sentences demand, and nowhere else, because every additional key is permanent operational surface. Multi-tenant platform sentences: key-per-tenant from day one, retrofitting tenancy into a shared-key estate is the migration nobody enjoys.

And gate the rollout on the two audits this article kept flagging: engine coverage, every reader and writer in the estate verified against the encryption spec at your versions, and lifecycle coupling, key rotation, snapshot retention, maintenance identity, and KMS recovery designed as one document. Encryption deployed without those audits is not security, it is a scheduled incident. Deployed with them, it is the quiet completion of the open lakehouse's promise.

Questions I Hear Most Often

What does encryption cost in performance? Far less than intuition suggests, thanks to hardware AES: bulk encryption and decryption run at gigabytes per second per core on modern CPUs, and published experience with Parquet Modular Encryption puts typical query overhead in the low single-digit percentages, with the envelope design keeping KMS calls off the per-file path. The honest costs live elsewhere: key management operations, the loss of file-peeking shortcuts, and engineering time. Cycles are the cheap part.

Compress then encrypt, or encrypt then compress? Compress first, always, because ciphertext does not compress, and the formats enforce the right order internally, encoding, then compression, then encryption per module. The corollary from the compression article applies: never layer another compressor over encrypted files expecting gains.

Is this overkill if I already run SSE-KMS and strong RBAC? For many estates, genuinely yes, and I say that as the person who just wrote six thousand words on the machinery. SSE-KMS plus tight IAM plus catalog governance is a defensible posture for data whose threat model ends at the perimeter. Format and table encryption earn their complexity when the model extends further: regulated categories, provable erasure, multi-tenant isolation, sovereignty, or the simple institutional requirement that storage compromise must not equal data compromise. Threat model first, machinery second.

How does this relate to credential vending? They are siblings in one governance architecture, and the pairing is the future I keep pointing at: vending controls who can reach the bytes, encryption controls who can read them, both brokered per-principal through the catalog, both audited in one trail. Vending without encryption trusts the storage perimeter. Encryption without vending sprawls keys. Together, through a catalog like Polaris, they are the complete story of governed access, which is why the catalog communities are where this integration work is happening.

Can I encrypt an existing table? Not in place, immutability forbids it: encryption arrives through rewrite, which in practice means enabling it and letting compaction and lifecycle rewrites migrate the estate, or forcing a full rewrite where urgency demands. Plan it like the compression migrations of the companion article, as maintenance-driven, table-by-table, with the encrypted-and-plaintext coexistence handled by the format's metadata.

What should I watch next in this space? Three fronts. Engine coverage maturing, the boring rollout that determines when "interoperable" is simply true for your stack. The catalog-as-key-broker work deepening across the REST catalog world, which is where the N-by-M problem actually dies. And the sharing frontier, cryptographic cross-organization data products, where the lakehouse's economics and encryption's guarantees combine into something the industry has wanted for twenty years: sharing without perimeter trust. My newsletters track all three weekly.

Closing Thoughts

Encryption was the lakehouse's last unfinished pillar because it was the hardest kind of problem the open data movement takes on: not an algorithm, cryptography solved the algorithms decades ago, but an agreement, a way for many engines under many vendors to share not just bytes and schemas but secrets, safely, with the machinery of keys and trust standardized enough to interoperate and flexible enough to satisfy every regulator's variance. The pieces that closed the gap tell this series' oldest story one more time: a format-level spec matured in the Parquet community, a table-level design shipped through Iceberg's open process, and the catalog layer, the same open governance point that credential vending established, stepping up as the broker that makes it operable at fleet scale.

The practitioner's summary: know your threat model, run defense in depth, let the envelope pattern and the catalog carry the key management, respect the operational couplings, rotation with retention, KMS with disaster recovery, maintenance with trust, and treat the interoperability rollout as the deployment gate it is. Do that, and the lakehouse's proudest properties, one copy, many engines, open formats, no perimeter of lock-in, now extend to its most sensitive data, which is exactly the data the next decade's AI systems most need governed access to.

If you want these foundations at full depth, from the formats and catalogs through the governance and AI systems above them, that is what my books are for. I co-authored Apache Iceberg: The Definitive Guide and Apache Polaris: The Definitive Guide for O'Reilly, with further titles on lakehouse architecture, data engineering, and agentic analytics.

Browse the full collection of my books on data and AI at books.alexmerced.com.

A Deep Dive Into File Compression: How Data Gets Smaller, Why Codecs Differ, and What to Actually Use in the Lakehouse

Alex Merced — Mon, 13 Jul 2026 23:51:36 +0000

Somewhere in your data platform right now, a single configuration property is quietly deciding a meaningful percentage of your storage bill, your query latency, and your compute spend. It is probably set to whatever the defaults were in 2019, nobody has looked at it since, and it is the compression codec.

Compression is the most consequential invisible decision in data infrastructure. Every Parquet file in your lakehouse, every message crossing your network, every backup in your archive passed through a compressor, and the choice of which one, at which setting, ripples through everything downstream: bytes stored, bytes transferred, requests billed, CPU burned on every read for the life of the data. Yet most engineers' working knowledge of the topic amounts to a vague ranking, gzip is old, Snappy is fast, Zstandard is good, without the mechanics that would let them reason about a new situation.

This article fixes that. We will build the theory from the ground up, in plain language: why data compresses at all, why nothing compresses everything, and the two great families of technique that every modern codec combines. Then the codec lineup itself, gzip, bzip2, LZMA, Snappy, LZ4, Zstandard, Brotli, each with its design center and honest trade-offs, and why one of them effectively won the decade. Then the layer my readers live in: how compression actually works inside the lakehouse stack, Parquet pages, columnar encodings versus codecs, splittability history, hardware acceleration, and the economics on object storage. And finally the practical playbook: what to set, when to deviate, and how to measure. As always in this series, the goal is that the logic clicks, so the next codec announcement or benchmark chart explains itself.

Why Data Compresses at All: Redundancy and the Pigeonhole

Start with the foundation, because two ideas from information theory explain every codec ever written.

The first idea: compression is the removal of redundancy, and redundancy is predictability. A string of a thousand zeros is extremely predictable, so it can be described in a few bytes: "a thousand zeros." A file of truly random bytes is perfectly unpredictable, so no description of it can be shorter than itself. Real data lives between these poles, and almost all of it lives far toward the predictable end: text repeats words, logs repeat templates, sensor readings drift in small steps, columns of a table repeat values and patterns endlessly. Claude Shannon formalized this as entropy, the true information content of data measured in bits, and entropy is the hard floor: no lossless compressor can beat it on average. Everything a codec does is an attempt to find the predictability in your bytes and stop paying to store what could be predicted.

The second idea keeps everyone honest: the pigeonhole principle guarantees there is no universal compressor. Any algorithm that shrinks some inputs must expand others, because there are simply fewer short descriptions than long inputs. This is why compressing an already-compressed file, or an encrypted one, which is deliberately indistinguishable from random, gains nothing and often loses a little. It is also why codecs are portfolios of assumptions about what real data looks like, and why matching the assumption to your data is the whole game. Every technique below is a bet on a specific kind of predictability.

One boundary before we proceed: this article is about lossless compression, where decompression reproduces the original exactly, because analytics demands it. The lossy world of JPEG and video, which discards information human senses will not miss, is a different discipline, and the closest analytics comes to it is deliberate, schema-level choices like reduced-precision floats, decisions made by engineers, never by codecs.

The Two Great Families: Finding Repeats and Pricing Symbols

Nearly every general-purpose codec in existence is a combination of two techniques, invented decades ago and refined ever since. Understand both and you can read any codec's documentation fluently.

Family one: match-based compression, the LZ family. The insight, from Lempel and Ziv in 1977, is beautifully simple: data repeats itself, so instead of storing a repeat, store a pointer to the previous occurrence. The compressor slides through the input keeping a window of recent history, and whenever the next bytes match something already seen, it emits a reference, "go back 3,041 bytes and copy 27," instead of the bytes themselves. A log file where every line shares a timestamp prefix and a template becomes mostly pointers. The knobs of the LZ family follow from the mechanics: a bigger window finds more distant repeats at more memory cost, more effort searching for the longest match buys ratio at compression-time CPU, and decompression is gloriously cheap regardless, just copying bytes the pointers indicate, which is why LZ decompression speed is measured in gigabytes per second and why the family dominates read-heavy workloads.

Family two: entropy coding, pricing symbols by frequency. After matching, what remains is a stream of symbols, literal bytes and match instructions, and they are not equally common. Entropy coding assigns short codes to frequent symbols and long codes to rare ones, squeezing the stream toward its Shannon floor. Huffman coding, from 1952, does this with whole-bit codes, elegant and fast and slightly wasteful because real frequencies want fractional bits. Arithmetic coding achieves those fractional bits and was long too slow for mainstream use. The modern breakthrough is ANS, asymmetric numeral systems, a 2010s invention that delivers arithmetic-coding compression at Huffman-like speeds, and its arrival is the single biggest reason the current codec generation beats the previous one. When you hear that Zstandard uses finite state entropy, that is ANS at work.

Almost everything you will ever use is these two stacked: LZ matching to remove repeats, entropy coding to price what remains. DEFLATE, the algorithm inside gzip and ZIP, is LZ77 plus Huffman, vintage 1993. Zstandard is a modern LZ plus ANS. The exceptions prove the rule: bzip2 built on a different transform entirely, and the columnar encodings we will meet later skip the general machinery for something more surgical. But as a mental model, "find the repeats, then price the symbols" is ninety percent of the field.

Watch a Codec Work: One Log Line, Step by Step

Theory lands best with bytes on the table, so let me run a concrete miniature: compressing three lines of a web server log, the kind of data every reader owns.

2026-07-13 10:41:07 GET /api/orders 200 8ms
2026-07-13 10:41:07 GET /api/orders 200 11ms
2026-07-13 10:41:09 GET /api/users 200 6ms

The matcher goes first, sliding through the bytes. Line one is virgin territory, nothing to point at, so it passes through as literals, and the window begins filling. Line two is where the design earns its keep: the matcher finds that the next forty-odd characters, the timestamp, the method, the path, the status, are an exact repeat of bytes it just saw, and emits a single instruction, go back 44 bytes, copy 41, followed by the few literal characters that differ, the "11ms." One pointer replaced most of a line. Line three matches in fragments: the date and hour match at distance 88, "GET /api/" matches, "200" matches, and the novel pieces, the "09" seconds, "users," "6ms," ride as literals between pointers. Already the intuition generalizes: templated data, which is most machine-generated data, is a thin stream of genuinely new bytes threaded through a lattice of repeats, and the matcher converts the lattice into cheap references.

The entropy coder goes second, over the stream the matcher produced: literals, match lengths, match distances. It counts frequencies and prices accordingly. The digit characters, spaces, and slashes that dominate the literals get short codes, rare bytes get long ones, and the match instructions themselves get frequency-priced, since real data repeats at characteristic distances, the width of a log line, the size of a record, and the coder learns those habits. In a DEFLATE-era codec this pricing is Huffman, whole bits per symbol. In a modern codec it is ANS, fractional bits, the same idea priced more precisely. On real log files this two-stage stack routinely lands ten-to-one or better, and now you know exactly where the ratio comes from: the matcher removed the template, the coder discounted the residue.

Two footnotes make the miniature honest. First, the columnar counterpoint: if these logs were parsed into a table, timestamp column, path column, status column, the encodings would beat the general codec at its own game, the status column becoming a run-length whisper, the timestamps delta-encoding to near nothing, which is the structural-knowledge advantage in action and the reason parsed beats raw in every lakehouse. Second, the failure mode: run the same machinery over an encrypted or already-compressed payload and the matcher finds no repeats, the coder finds flat frequencies, and the output grows slightly, the pigeonhole principle collecting its due. Codecs are redundancy hunters, and they can only catch what the data actually contains.

The Lineup: Seven Codecs and What Each Is For

Now the codecs themselves, presented as design centers rather than a leaderboard, because each one is the right answer to a question.

gzip, DEFLATE. The 1993 workhorse and still the lingua franca of the web and of interchange. Moderate ratio, moderate speed, universally implemented, and thoroughly outclassed on every axis by modern codecs except ubiquity. Its design center today is compatibility: when the other side might be anything, gzip works. In analytics it survives mostly as legacy Parquet settings and CSV archives, and both deserve migration.

bzip2. The 1990s ratio champion, built on the Burrows-Wheeler transform, a clever reordering that groups similar contexts together before entropy coding. Better ratios than gzip, painfully slow both directions by modern standards, and historically notable in big data for being splittable, a property whose significance we will unpack shortly. Its design center is now history.

LZMA, xz, 7-Zip. The maximalist: enormous windows, exhaustive matching, range coding, delivering the best ratios of the pre-modern era at brutal compression cost and slow decompression. Design center: cold archives where bytes matter and access is rare, and even there, modern Zstandard at high levels has eaten most of its lunch.

Snappy. Google's 2011 speed play and the codec of the Hadoop generation: LZ matching with no entropy coding at all, sacrificing ratio for blistering speed and, decisively for its era, low CPU on clusters where compute was the bottleneck. It became Parquet's long-time default, which is why so many lakehouses still run it. Design center: real-time paths where CPU is scarcer than storage, a trade whose terms have shifted dramatically since.

LZ4. Snappy's philosophy perfected: the fastest mainstream LZ, with decompression at multiple gigabytes per second per core, plus a high-compression mode that spends write-time effort for the same instant reads. Design center: anywhere latency dominates, in-memory compression, RPC payloads, caches, write-ahead logs, and streaming buffers, including Arrow IPC compression, where it shines.

Zstandard, zstd. The one that won, and worth its own section below.

Brotli. Google's web specialist: DEFLATE-family matching plus a modern entropy coder plus a built-in dictionary of web-common strings, tuned for compressing text assets once and serving them millions of times. Design center: the browser path. In analytics it appears occasionally and rarely beats Zstandard where both are available.

Honorable mentions complete the map: zlib-ng and igzip as accelerated DEFLATE for the compatibility-bound, and the domain specialists, from log-structured compressors to genomics codecs, that reinforce the pigeonhole lesson: knowing your data beats general cleverness.

Zstandard: Why the Decade Has a Default

Zstandard, released by Yann Collet at Facebook in 2016, from the same author as LZ4, deserves the deep look because it is the correct default answer to most compression questions in 2026, and knowing why makes you better at spotting the exceptions.

The technical core is the modern stack executed superbly: a strong LZ engine with large-window support, and finite state entropy, the ANS realization that closed the gap between Huffman speed and arithmetic ratios. The result redrew the trade-off curve rather than picking a point on it: at low levels, Zstandard approaches LZ4 speeds while compressing better than gzip ever did, and at high levels it approaches LZMA ratios at a fraction of the cost, with decompression staying fast, several hundred megabytes to gigabytes per second per core, across the entire range. One codec now spans what previously required three.

Three features turn the codec into a toolkit. The level dial, one through twenty-two, is a genuine single-knob policy instrument: hot data at level three, warm data at level six, archives at level nineteen, same format, same decompressor, no re-tooling. Long-distance matching extends the window to hundreds of megabytes, letting it exploit repeats across huge files, a gift for logs and backups. And trained dictionaries solve the small-payload problem: compress a thousand tiny JSON messages independently and each is too short to self-describe its own redundancy, but train a dictionary on a sample of them once, and every message compresses against that shared context, routinely tripling effectiveness on small records, the trick behind efficient message queues and key-value stores everywhere.

The ecosystem verdict followed the engineering: Zstandard is now a first-class or default codec in Parquet and ORC settings, in Kafka, in Arrow IPC, in package managers, filesystems, and browsers. When this article says "the modern default," it means zstd, and the burden of proof now rests on deviating from it.

Sixty Years in Five Moments

A compressed history of compression, because the lineage explains the present's shape.

Moment one, 1948 to 1952: Shannon defines entropy and Huffman delivers the first optimal prefix codes, establishing both the floor and the first practical tool for approaching it. Everything since is footnotes to these two, elaborate and valuable footnotes.

Moment two, 1977 to 1978: Lempel and Ziv publish the match-based algorithms that bear their initials, and compression gains its second engine. The LZ-plus-entropy-coding stack assembles over the following decade, culminating in DEFLATE and gzip, whose 1990s vintage still moves a startling fraction of the internet.

Moment three, the 1990s ratio wars: Burrows-Wheeler's transform powers bzip2, LZMA pushes windows and search effort to their limits, and the field's frontier becomes squeezing the last percentage points at any CPU cost, a sensibility suited to dial-up networks and expensive disks.

Moment four, the 2000s speed inversion: Google-scale clusters flip the constraint, CPU becomes the scarce resource, and Snappy and LZ4 answer by abandoning ratio for throughput. The big data generation builds on their trade, and its defaults fossilize into the configs this article keeps asking you to revisit.

Moment five, the 2010s synthesis: Jarek Duda's asymmetric numeral systems dissolve the old speed-versus-precision trade in entropy coding, Zstandard productizes the breakthrough, and one codec spans the whole curve the previous generations divided among themselves. Meanwhile the structure-aware current, columnar encodings, BtrBlocks-style cascades, ALP and FSST, rises alongside, and the 2020s inherit both: a settled general-purpose default and a renaissance in what to do before the general codec ever runs.

The pattern across all five moments is the one this series finds at every layer: constraints flip, defaults fossilize, and the practitioners who understand the mechanics rather than the folklore are the ones who notice when their era's answer has quietly become the last era's.

Encodings Versus Codecs: The Distinction the Lakehouse Runs On

Here the article joins hands with my file-format renaissance piece, because the columnar world adds a second compression vocabulary that must not be confused with the first.

The general-purpose codecs above treat data as anonymous bytes and hunt statistical redundancy. Columnar encodings exploit something stronger: knowledge of structure. A column of a table is not anonymous bytes, it is a sequence of values of one type, and that knowledge enables surgical techniques: dictionary encoding replacing repeated values with small codes, run-length encoding collapsing consecutive repeats, delta and frame-of-reference storing numbers as small differences from a base, bit-packing trimming integers to their true width, FSST compressing strings while keeping each independently readable, and ALP compressing floats through adaptive decimal scaling. My file formats article walks each with examples, and its central lesson bears repeating here: cascades of these lightweight, structure-aware encodings, chosen adaptively per chunk of data, can match heavyweight codec ratios while decoding at memory speed, and sometimes while never decoding at all, since engines can filter dictionary codes and range-check frame-of-reference integers directly.

The two vocabularies compose rather than compete, and the composition order matters. Parquet's classic stack applies encodings first, dictionary, RLE, bit-packing shrink the column using structure, and then runs a general codec, historically Snappy, increasingly Zstandard, over the encoded pages, catching whatever statistical redundancy the encodings left behind. The general codec's contribution shrinks as encodings improve, which is exactly the trend line of the renaissance: the newest formats lean ever harder on encoding cascades and ever lighter on the heavyweight pass, because on modern storage the heavyweight decode cost increasingly exceeds its transfer savings. When you tune a lakehouse, you are really tuning this two-layer stack, and the biggest wins often come from the encoding layer, sorted data run-length encodes spectacularly, low-cardinality columns dictionary-encode to almost nothing, rather than from swapping codecs.

Where Compression Lives in the Stack, and the Splittability Story

Compression is not one decision but several, made at different layers, and mapping them clarifies a decade of folklore.

At the file format layer, Parquet compresses per page within column chunks, with the codec settable per column, a granularity with two enormous consequences. First, selective reading survives: a query touching three columns decompresses three columns' pages, never the file. Second, the old Hadoop splittability problem dissolved. In the era of raw compressed text files, gzip's whole-file streams could not be split across workers, one giant gzip meant one reader, and formats like bzip2 earned their keep by being splittable. Parquet made the question moot by compressing inside an independently addressable structure: row groups and pages are the parallelism units, and the codec inside them is anyone's choice. The lesson survives wherever raw compressed files still roam, CSV and JSON landing zones and log archives, where a single mega-gzip remains a parallelism killer and the fix is either splittable framing or, better, conversion into the columnar world.

At the memory and network layer, Arrow IPC buffers compress with LZ4 or Zstandard for transport, chosen for decompression speed since these bytes are about to be computed on, and RPC and streaming systems make the same latency-first choice. At the storage service layer, some filesystems and services compress transparently underneath everything, a layer best left alone for already-compressed Parquet, since the pigeonhole principle collects its tax on double compression. And at the archive layer, lifecycle policies can recompress cold data at aggressive levels, the same bytes at level nineteen instead of level three, purchasing storage savings with write-once CPU on data whose reads have dwindled.

The map yields a principle worth keeping: compress closest to where structure is known, and choose each layer's codec by what happens to the bytes next, computation wants speed, archival wants ratio, interchange wants compatibility.

The Trade-Off Physics and the Economics

All codec choices reduce to a three-axis trade, ratio, compression speed, decompression speed, and the lakehouse tilts the axes in specific, calculable ways.

The first tilt is asymmetry: analytical data is written once and read many times, often thousands of times, so decompression speed and ratio matter with the full weight of every future read, while compression speed matters once, and mostly to pipeline latency budgets. This is why the LZ family's cheap decompression rules the space, why archives can afford expensive levels, and why "how fast does it compress" is usually the least important number on the benchmark chart, streaming ingestion's tight cycles being the honorable exception.

The second tilt is the object storage economy from my storage deep dive: bytes stored bill monthly, bytes transferred bill per crossing, and requests bill per call. Better ratios shrink all three, which makes compression one of the rare optimizations that cuts storage, network, and request lines simultaneously, and it compounds with everything else: smaller pages mean more data per ranged read, better cache hit rates per gigabyte of NVMe, more of the working set resident everywhere. Against these gains stands decode CPU, and here modern hardware has been generous: current codecs decode so fast, and engines vectorize so well, that on most scan workloads the I/O saved exceeds the CPU spent by a comfortable margin, with the crossover arriving only on the very fastest local storage, which is precisely the frontier where the encoding cascades take over from heavyweight codecs, the renaissance thesis once more.

The third tilt is hardware's ongoing arrival: AES-style dedicated instructions never came for compression, but SIMD did, and the modern codecs exploit it thoroughly, while accelerators go further, Intel's QAT offloading compression entirely on supported platforms, and GPU decompression libraries bringing formats' data directly onto accelerators, a co-design conversation the new file formats are having explicitly. The practical takeaway is humility in benchmarking: codec performance is now a property of the codec, the data, and the silicon together, which is one more reason the only benchmark that matters is yours.

The Practical Playbook

Everything above, compressed into the guidance I actually give.

For lakehouse tables, make Zstandard the default and pick levels by temperature: roughly level three for hot, frequently written data, five or six for the general estate, and if your platform supports recompression during maintenance, let compaction jobs rewrite cooling data at higher levels, the same lever my maintenance sections keep recommending, now applied to bytes. Retire Snappy deliberately rather than reflexively: it still defends real estate on CPU-constrained, latency-critical write paths, but on typical scan-heavy estates, migrating from Snappy to zstd routinely recovers double-digit storage percentages at negligible read cost, and the migration is a compaction pass, not a project.

Exploit the encoding layer before the codec layer. Sort or cluster tables on the columns that matter, low-cardinality and time-adjacent data will collapse under dictionary and run-length encoding, and verify with file inspection tools that your important columns are getting the encodings you expect, the same audit habit my Parquet articles preach for shredding and statistics. The single cheapest ratio improvement in most estates is better data layout, not a better codec.

Respect the special cases. Small independent payloads want trained dictionaries. Already-compressed and encrypted content wants no second pass, store media and archives uncompressed at the Parquet level. Raw text landing zones want splittable handling or fast conversion. Float-heavy and embedding-heavy columns are the current frontier, watch ALP's arrival in your engines, and until then accept that these columns compress modestly.

And measure, on your data, at your access patterns, because everything in this article is a prior, not a verdict. The experiment is cheap: rewrite a representative table under two or three candidate settings, record size, scan latency, and CPU, and let the numbers choose. Data that defies your expectations is the pigeonhole principle sending you a message about structure you have not exploited yet.

The Special Domains: Streams, JSON, Logs, and Vectors

Four data domains come up constantly in questions, and each rewards specific treatment beyond the general playbook.

Streaming messages. Kafka and its kin compress per batch, with the producer choosing the codec, and the modern answer mirrors the lakehouse: Zstandard for the ratio-per-CPU sweet spot, LZ4 where producer latency budgets are brutal. The deeper win is the dictionary trick from the Zstandard section: individual messages are too small to compress well alone, batching solves most of it, and for genuinely small-record paths, key-value stores, per-message encryption contexts, a trained dictionary shared between producer and consumer routinely multiplies effectiveness. And remember the stack view: messages compressed in flight land in the lakehouse, decompress once, and get re-compressed into Parquet's page structure, each layer choosing by what happens to the bytes next.

JSON and semi-structured payloads. Raw JSON compresses deceptively well, the keys repeat endlessly and the matcher feasts, which tempts teams into the string-column pattern my variant article buried. Resist the temptation with the full argument: a general codec shrinks JSON's bytes and preserves its parse cost, every query still decompresses and parses everything, while the variant encoding with shredding restructures the data so queries skip both. Compression is not a substitute for structure. It is what you do after structure.

Logs and text. The domain where long-range matching shines, since log files repeat across megabytes, and where the splittability ghost still haunts: the multi-gigabyte gzip in the landing bucket remains 2026's most common self-inflicted parallelism wound. The pattern that works: land raw text with splittable framing or modest file sizes, convert promptly to tables, and let the archive tier recompress the raw originals at aggressive levels for compliance retention.

Embeddings and floats. The honest frontier. High-entropy by nature, float vectors resist general codecs almost entirely, single-digit percentage gains are typical, and the real progress is structural: ALP-style encodings for the float columns that hide decimals, fixed-size layouts that at least make vectors cheap to read and GPU-friendly, and, where the application tolerates it, deliberate precision reduction chosen by engineers, float32 to float16 or quantized forms, which is the one place a lossy-flavored decision legitimately enters the analytics stack, made at the schema, never in the codec.

Benchmark Like You Mean It

Since the whole article keeps ending at "measure on your data," here is how to make that measurement worth trusting, because bad compression benchmarks are an industry pastime.

Test on real data, never on synthetic. Generated data has artificial redundancy, uniformly random data has none, and both lie in different directions. Sample actual production files, whole row groups, not handcrafted snippets, and include your ugliest tables, the wide one, the JSON-heavy one, the float-heavy one, because the average hides exactly the columns that dominate cost.

Measure all three axes plus the one everyone forgets. Ratio, compression speed, and decompression speed are the standard trio, and the fourth is end-to-end query latency on representative queries, because page sizes, cache behavior, and I/O patterns interact with codecs in ways microbenchmarks miss. Run decompression measurements at realistic parallelism, single-threaded decode numbers flatter nobody's production reality, and on the hardware class you actually deploy, since SIMD generations move these numbers materially.

Control the layout variable. A codec comparison across differently sorted or differently encoded files measures layout, not codecs, so hold encodings and sorting constant when comparing codecs, then run the layout experiment separately, and expect, per the worked example below, that the layout experiment wins. Finally, report costs in money where you can: bytes stored per month, requests per scan, CPU-seconds per query, converted at your actual prices, because "eight percent better ratio" and "four thousand dollars a month" are the same fact in different languages, and only one of them survives the budget meeting.

An afternoon of this discipline, once a year or whenever a new codec generation lands, is among the highest-return maintenance rituals a platform team owns.

A Worked Example: One Table, Three Regimes

Make it concrete with a composite from the field: a two-terabyte events table, currently Parquet with Snappy defaults from its 2020 birth, scanned heavily by BI and fed daily by batch.

Regime one, the inherited default, baselines at two terabytes stored and a known scan profile. Regime two, the modern default: a compaction pass rewrites to Zstandard level five with the same layout. The table lands around thirty percent smaller, in line with typical Snappy-to-zstd migrations, storage and egress lines drop proportionally, ranged reads carry more data per request, and scan latency improves slightly, the extra decode CPU more than repaid by the I/O saved. Total effort: one maintenance job and a config change. Regime three, the layout-aware rewrite: the same pass adds sorting on the two columns every dashboard filters by. Now the encoding layer wakes up, run-length and dictionary encodings collapse the sorted columns, statistics tighten so pruning skips more row groups, and the combined effect lands the table at roughly half its original size with materially faster filtered scans. The codec change was worth real money. The structure change was worth more, and the two together, chosen in an afternoon, will pay every single day the table lives.

That is compression in the lakehouse in one story: a default worth updating, a layout worth more than a codec, and a payoff that compounds across storage, network, requests, and every future read.

Questions I Hear Most Often

Is there ever a reason to store lakehouse data uncompressed? Almost never for tabular data, the read-side economics are too lopsided, with two exceptions: content that is already compressed, media, archives, encrypted payloads, where a second pass wastes CPU to gain nothing, and extreme-latency serving tiers on local NVMe where decode time is genuinely visible, which is exactly the niche the compute-on-encoded formats are built to close without surrendering the bytes.

Why did Snappy dominate for so long if Zstandard is better? Because Snappy was the right answer to its era's constraint: Hadoop-generation clusters where CPU was the bottleneck and storage was locally attached and comparatively cheap. Zstandard arrived after the constraint inverted, cloud object storage made bytes and requests the cost and CPU abundant, and defaults simply outlive their eras. Your 2019 configs are not wrong, they are fossils, and fossils are honorable things to replace.

Do higher Zstandard levels slow down my queries? Barely, and that is the design's quiet triumph: decompression speed stays roughly flat across the level dial, the levels buy ratio with compression-time effort, not read-time effort. The practical ceiling on levels is write and compaction budget, not query latency, which is what makes recompress-when-cold such a clean policy.

Should different columns get different codecs? The capability exists and the better version of the idea usually lives one layer down: different columns want different encodings, which good writers choose automatically, while a single sensible codec over the top keeps operations simple. The exception worth taking: columns of pre-compressed or high-entropy content, where disabling the codec avoids paying for nothing.

How does compression interact with encryption? Order is everything: compress first, then encrypt, because encrypted bytes are designed to look random and random bytes do not compress. The lakehouse formats get this right internally, Parquet encrypts pages after encoding and compression, and the full story, including what encryption does to statistics and interoperability, is exactly the subject of this article's companion piece on lakehouse encryption.

Will AI workloads change compression? They already are, in two directions. Their data, floats, embeddings, tensors, drove the new encodings like ALP and the fixed-size layouts, and their hardware, GPUs consuming data directly, is driving decompression onto accelerators and formats toward GPU-decodable designs. Compression research, dormant-seeming for years, is a live frontier again precisely because the workloads changed, which is the file format renaissance told from the bytes up.

Closing Thoughts

Compression is where information theory pays the cloud bill: a sixty-year lineage from Shannon's entropy through Lempel-Ziv's pointers and Huffman's codes to ANS and adaptive encoding cascades, all of it operating silently every time your lakehouse reads a page. The field looks settled from a distance and is anything but: the codecs consolidated onto a brilliant modern default, the structure-aware encoding layer is where innovation moved, and the hardware underneath is redrawing the trade-offs one more time. The practitioner's summary is almost embarrassingly simple, zstd by default, layout before codec, measure on your data, and the understanding behind it is what lets you know when your case is the exception.

If this way of building understanding works for you, it is what my books do at full depth. I co-authored Apache Iceberg: The Definitive Guide and Apache Polaris: The Definitive Guide for O'Reilly, with further titles on lakehouse architecture, data engineering, and agentic analytics.

Browse the full collection of my books on data and AI at books.alexmerced.com.