A bit of a streamlined edition, this month. Lots of interesting links still, but less commentary. You can put that down to me prevaricating on getting my previous blog about Materialized Tables in Apache Flink finished, and leaving myself little time to work on this one :) Not including the detailed narration actually knocks a bunch of time off the preparation---I'd be interested in your feedback as to how much the absence of narration impacts (if at all) your enjoyment of reading it. Let me know in the comments below!
Something that I'm slowly changing is how I categorise links to do with AI. A few months back anything \"AI\" got its own section. It wasn't much more than a novelty really; certainly not something worth distracting the regular link sections with. But now AI is just part-and-parcel of many people's workflows, a regular component in their toolbox. So where an article is about credibly using AI as part of an existing topic (such as data engineering), I'll file it in that section. (And if this news makes you cross because you abhor anything AI, well, I've got news for you).
<!--more-→
Current London 2026 - wanna free ticket? 🎟️
If you're in the UK and interested in Kafka, Flink, Iceberg, etc etc (which, since you're reading this blog post, I assume you at least have a passing interest in) then you might be interested in Current London in May---and I have a free ticket code for you to use! Register with code L-CMP-LDNKafka and it's all yours :)
Unresolved directive in <stdin> - include::../../asciidoc-includes/il-header.adoc[]
Analytics
Ben Sykes - Interval-Aware Caching for Druid at Netflix Scale
Dorothée Clerc - How BlaBlaCar PMs use AI to self-serve data
DuckDB 1.5.2 has been released, with support for DuckLake 1.0, even better Iceberg support, and fixes as a result of initial Jepsen testing.
Randy Au - Dashboard rot as org attention grave markers
Ahmed Youssef - Nobody Is Making Decisions With Your Dashboards
🔥 Torsten Grust has published a course about the Design and Implementation of DuckDB Internals
Hamel Husain - The Revenge of the Data Scientist
Data Platforms, Architectures, and Modelling
Antonia Badarau and team at Monzo - A "meshy" approach to Data: Enabling 100+ teams to build Data Models
Justina Bartulevičienė & Benediktas Kazanavičius (Vinted) - Serving Personalised Search Autocomplete
Rishabh Kumar (Airbnb) - Building a fault-tolerant metrics storage system at Airbnb
Matt Lawhon and team at Pinterest - Scaling Recommendation Systems with Request-Level Deduplication
Facundo Agriel (Dropbox) - Improving storage efficiency in Magic Pocket, our immutable blob store
🔥 A couple of interesting posts from the teams at Notion: Enabling Multi-Region Data Systems, and Two years of vector search: 10x scale, 1/10th cost
Nikola Ilic - Data Modeling for Analytics Engineers: The Complete Primer
Chris Gambill - The Medallion Masterclass: Why Knowing the Colors Isn't Enough
Joe Reis - Why Time Matters in Data Modeling
Data Engineering, Pipelines, and CDC
Alexander Goida - Three Kafka S3 Sink Settings for Easier File Processing
Chris Gambill - AI Agents are Failing Your Data Engineers
Sugat Mahanti (Zapier) - Lessons from using the outbox pattern at scale
Couple of good posts from Chris Hillman - Your Data Platform Costs More Than It Should, and Why Your Pipeline Finishes Later Every Month
Jin-won Park (Karrot) - In the AI era where everyone handles data, how has the data team changed over the past year? (original Korean)
Tristan Handy (dbt Labs) - Five things I believe about the future of analytics
Igor Shurmin (Riskified Tech) - Data Exploration for Software Engineers: Evaluating and Integrating External Datasets
Aleksandr Klein (Just Eat Takeaway) - Daedalus and the Data Labyrinth
🔥 An excellent deep-dive from George Zefkilis, looking at PostgreSQL WAL Internals in the context of building a CDC pipeline.
Debezium 3.6.0.Alpha1 and Debezium 3.5.0.Final have been released.
Yaroslav Tkachenko analysed the performance of different technologies for getting data from Postgres into Iceberg.
Leonard Xu looks at good practices when building Large-Scale Lake Ingestion with Flink CDC and Paimon
Real-world details from Nathan Smit of how they've been using Debezium with Oracle for four years, and how they addressed issues with Oracle CDC Replication Lag.
🔥 Yanquan Lv published the announcement of the release of Apache Flink CDC 3.6.0 as well as an excellent Deep Dive into Apache Flink CDC 3.6.0
Jason Ganz & Benoit Perigaud (dbt Labs) - Semantic Layer vs. Text-to-SQL: 2026 Benchmark Update
Kafka and Event Streaming
Zapier - Reducing Kafka connections by 10x with a sidecar pattern
Yunhong Zheng - How Apache Fluss Achieves True Pruning in Streaming Storage
Bibek Maharjan - AI-Driven Autonomous Optimization of Apache Kafka on AWS MSK for High-Volume Financial Systems
Piotr Minkowski - Deep Dive into Kafka Offset Commit with Spring Boot
StrimziCon 2026 is on 3rd June, and the schedule has been published.
Flink
🔥 Robin Moffatt (that's me!) - Materialized Tables in Apache Flink
Yaroslav Tkachenko - Apache Flink: Reading and Modifying Kafka Consumer Offsets Using the State Processor API
Lee Seung-min / Choi Won-yong - Extending Real-time Ad Frequency Capping Aggregation to One Week with Apache Flink + RocksDB Tuning (original)
Katya Gorshkova - Hands-On with Flink --- Part 5: Managing State (previously: 1, 2, 3, 4)
Viktor Gamov digs out the open source toolbox to use Kafka, Flink, Iceberg, Superset and more to build Building a Streaming Lakehouse.
Open Table Formats (OTF), Catalogs, Lakehouses etc.
Gunnar Morling's Hardwood project has had its second beta release, which includes a very cool TUI for working with Parquet files.
Laurent Saint-Félix has written aq - \"query and transform Parquet, Arrow IPC, CSV, and NDJSON files using jq-style expressions.\"
Yusuf Gözübüyük (TOM Tech) - The Performance Improvement Journey in Apache Iceberg Tables
Ved Prakash - Deep Dive into Apache Iceberg Architecture
🔥 CMU-DB tech talk - Kurt Westerfeld & Mark Cusack - Floe: A SQL Compute Service for the Data Lakehouse
Apache Iceberg has moved to \"adopt\" on the latest Technology Radar from Thoughtworks
Qiegang Long - Preliminary Notes on Open-Source Variant Performance
Steve Loughran - Benchmarking Parquet Variants through Iceberg
Anahita Singla (Picnic) - Leveraging contextual data in real-time analytics with Apache Iceberg
DuckLake version 1.0 has been released, and thus is now deemed production-ready. AFAIK it's only got real support within DuckDB, but do let me know if you see it supported elsewhere. Thoughtworks have marked it as \"assess\" on their Tech Radar.
A nice hands-on guide for setting up a local playground with Iceberg using Minio and Gravitino
Pedro Holanda describes how DuckLake deals with the small-files problem (often encountered when one starts streaming data to these types of table format). Using Data Inlining in DuckLake, they saw vast performance improvements over the same kind of processing done with Iceberg.
RDBMS
🔥 Ohad Ravid - The Best (Query) Plans of Mice and Men
Radim Marek - PostgreSQL MVCC, Byte by Byte
Simeon Griggs - Keeping a Postgres queue healthy
Thomas Kejser - Joins are NOT Expensive!
Mike Freedman - Introducing TigerFS - a filesystem backed by PostgreSQL, and a filesystem interface to PostgreSQL (Renato Losio wrote an InfoQ article about it)
Nikita Volkov - My 14-Year Journey Away from ORMs
General Data Stuff
Almog Gavra - The Broken Economics of Databases
Kirill Bobrov - The Power of Data Sketches: A Comprehensive Guide
🔥 Gergely Orosz (a.k.a. The Pragmatic Engineer) interviews Martin Kleppmann about the second edition of Designing Data-intensive Applications.
Animesh Kumar - AI-Ready Data vs. Analytics-Ready Data
I'm slightly fascinated by the idea of ggsql, which brings SQL to the world of ggplot2 and the Grammar of Graphics.
Akshat Vig & Andrew Davidson (MongoDB) - Open Source, Community, and Consequence: The Story of MongoDB (InfoQ London 2026)
AI
I warned you previously...this AI stuff is here to stay, and it'd be short-sighted to think otherwise. As I read and learn more about it, I'm going to share interesting links (the clue is in the blog post title) that I find---whilst trying to avoid the breathless hype and slop.
🔥 Joe Reis - Why Electricity (Not Dot-Com) Is the Right AI Analogy. I like this idea from Joe. It also makes me think of the lift-and-shift that folk did with on-premises workloads to VMs in the Cloud, instead of re-architecting properly.
Jason Ganz - A Dispatch from the Jagged Frontier of Analytics Engineering (referencing Ethan Mollick's jagged frontier article from 2023).
-
Industry legends Mark Russinovich and Scott Hanselman wrote this opinion piece for ACM: Redefining the Software Engineering Profession for AI.
Without the hiring of early-in-career developers, the profession's talent pipeline will collapse, and organizations will face a future without the next generation of experienced engineers.
🔥 Elena Verna - Confessions of a Millennial in Tech
🎥 Vik Gamov - If Memento was about AI Agents. I watched Memento in preparation for this...I still have no idea what was going on in either 😆
Addy Osmani - Agent Harness Engineering
🔥 Hamel Husain - LLM Evals: Everything You Need to Know
Robin Moffatt - Kicking the Tyres on Harbor for Agent Evals
🔥 Bryan Cantrill - The peril of laziness lost
Adam Jacob - Laziness, Impatience, and Hubris
Alex Woods - Don't Let AI Write For You. (Reminder: I disclose my use of AI, and it's NEVER for writing!)
And finally...
Nothing to do with data, but stuff that I've found interesting or has made me smile.
- Mitchell Hashimoto - Ghostty Is Leaving GitHub
Tool
I love the agility with which one can collaboratively work in GDocs, but I also prefer working with plain text and Markdown (or even better, Asciidoc). mist brings the concept of GDocs collaboration to Markdown files. It's pretty neat, and it's now open source.
A useful reminder from Christian Hofstede-Kuhn of Shell Tricks That Actually Make Life Easier (And Save Your Sanity)
Watch/Listen
🔥 A very cool example from the demo-scene: Razor1911
The Internet Archive isn't just about finding webpages that have gone offline---it also hosts tons of media, like this recording of Nirvana Live at Dreamerz 1989-07-08
I love this idea: TrainJazz: Every train, a note.
Nerd
😸 Not all specification drafts published are serious. Meow.
The ways in which one can play Doom continue to increase, with DOOM, played over cURL, and Can it Resolve DOOM? Game Engine in 2,000 DNS Records
-
HackerNews members share their memories of What was it like in the era of BBS before the internet?
My own memories are around Acorn-based BBSes. My favourite was Arcade BBS. Ah, memories. Fidonet, filebases...good times :)
What's more important than the code that you\'re writing Claude's writing for you? Getting it in the right font of course! Shave many a yak and waste plenty of time at Codingfont picking just the right font...
Unresolved directive in <stdin> - include::../../asciidoc-includes/il-footer.adoc[]
Top comments (0)