So. Many. Interesting. Links. Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Data Engineering
🔥 A good article from Andrew Jones on the concept of "shift left"
Useful writeup from Anders Swanson on [Iceberg, the Iceberg REST Catalog Specification, and more
Kafka
🔥 Taking out the Trash: Garbage Collection of Object Storage at Massive Scale
KIP-1150: Diskless Topics - Apache Kafka - Apache Software Foundation
ktea- a Kafka TUI clientBehind Sending Millions of Messages Per Second: A Look Under the Hood of Kafka Producer
Benchmarking Kafka: Distributed Workers and Workload topology in OpenMessaging Benchmark
Queues for Kafka, my opinion (see also: no true scotsman)
CDC
🔥 A Deep Dive Into Ingesting Debezium Events From Kafka With Flink SQL
A really good illustration of how CDC can enable low-latency use of data from transactional systems without impacting the OLTP workloads
Best Practices for Flink CDC YAML in Realtime Compute for Apache Flink
Using Debezium and Kafka Connect with Iceberg part I & part II
How Kleinanzeigen used Debezium and Apache Kafka for data migration
Stream Processing
🔥 A good article on using Flink SQL’s
MATCH_RECOGNIZEfor Real Time Fraud DetectionA new paper discussing Snowflake Dynamic Tables
A Flink CDC Pipeline connector for Apache Iceberg has been added into the project ahead of Flink CDC release 3.4.0
A good writeup of performance optimisations made in Zomato’s Flink data streaming pipeline
Build Kafka Streams Apps Faster with kstreamplify and Spring Boot
A proposal (SPIP) to add a Declarative Pipeline Framework to Apache Spark
Pedro Mazala writes about The case for a Custom Window in Flink
Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink
A talk from Liang Wu about LinkedIn’s internal Darwin tool for running Flink SQL in a notebook-like interface
AI
🔥 Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
Meal planning with AI (not just AI, but Event-Driven, Multi-Agent AI Architectures 😁)
Hands-on MCP Server Deep Dive: Connecting Flink SQL Gateway to the LLM Ecosystem
General Data Stuff
Some interesting articles from LanceDB, including where they see The Future of Open Source Table Formats: Apache Iceberg and Lance, why LanceDB is a suitable table format for ML Workloads, and details of Lance File 2.1: Smaller and Simpler
Slides from a seminar given by Will Deakin using some excellent dataviz to tell us about the UK rail network and its usage
CloudFlare have been busy, acquiring stream processing startup Arroyo, launching managed Apache Iceberg tables, and optimising their tool for migrating data from other providers' object stores into their own
I recently discovered okbob/pspg which is a very nice pager for working with database CLIs such as psql
Details of v3 of LinkedIn’s Nuage tool, which they describe as a control plane for data systems
TigerBeetle recently published a technical overview of the internals of TigerBeetle
Data in Action
A couple of interesting blogs from Salesforce, covering handling a lot of search queries with sub-second latency and their use of Trino for ETL at Petabyte-Scale
Some interesting blogs from Discord (both recently, and in the past), covering across various facets of their infrastructure storage, indexing, processing, and their their use of dbt
I really enjoyed this article about how Zillow use knowledge graphs to help people find a house to buy
One of the departments within Amazon built a data lake platform called Nexus around Spark and Hudi (recording)
Klaviyo wrote about the evolution of their event analytics platform to include Clickhouse, having originally built it on Cassandra before adding Kafka and Flink and (optimising it further)
An account from Lyka of their migration from a data warehouse on BigQuery to a lakehouse using Iceberg on S3 with Athena, and data warehouse on Snowflake
Details of how Adevinta moved from a Medallion-based lakehouse architecture to one built around data contracts and data mesh.
And finally…
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Top comments (0)