Alex Merced

Posted on Mar 6

My Non-Fiction Library: Books on Data Lakehouses, Apache Iceberg, AI, and Beyond

#ai #architecture #dataengineering #opensource

I have spent the last several years writing, teaching, and speaking about data architecture, open table formats, and AI-assisted development. Along the way, I published nine non-fiction books that cover the full spectrum from deep technical references to broad intellectual histories. This post walks through each one, explains what it covers, who it is for, and where it fits in the larger story of modern data engineering.

The common thread across the technical books is open standards. Apache Iceberg, Apache Arrow, Apache Polaris, and the Model Context Protocol (MCP) all appear repeatedly. Each book approaches these technologies from a different angle: some go deep into architecture, some focus on hands-on code, and one connects it all to autonomous AI agents. If you work in data, at least one of these books addresses a problem you are dealing with right now.

The Technical Core: Data Lakehouse and Apache Iceberg

Apache Iceberg: The Definitive Guide (O'Reilly, 2024)

Full title: Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance, and Scalability on the Data Lake

This is the O'Reilly reference on Apache Iceberg. At 341 pages, it covers the full Iceberg specification: snapshot isolation, partition evolution, schema evolution, time travel, hidden partitioning, and the mechanics of how metadata files, manifest lists, and data files work together under the hood.

Traditional data architecture patterns force you to ETL data into each tool separately. That is expensive and fragile. Iceberg provides the capabilities, performance, and scalability that fulfill the promise of an open data lakehouse. This book shows you how.

Who it is for: Data engineers and architects who want a thorough reference on the Iceberg table format. If you need to understand not just how to use Iceberg but why it works the way it does, this is the book.

What you will learn:

How Iceberg's metadata tree works and why it enables atomic operations at scale
Snapshot isolation and how concurrent readers and writers avoid conflicts
Partition evolution and why you never have to rewrite data when your partitioning strategy changes
Hidden partitioning and how Iceberg eliminates the partition column problem that Hive tables have
Time travel queries and how to roll back tables to previous states
Performance tuning with file compaction, sort ordering, and metadata pruning

Get it on Amazon

Free Digital Copy of Apache Iceberg: The Definitive Guide

Architecting an Apache Iceberg Lakehouse (Manning, 2026)

Full title: Architecting an Apache Iceberg Lakehouse: A scalable, open-source data platform

This is the Manning book on Iceberg lakehouse architecture, coming in at 358 pages. Where the O'Reilly book is a format reference, this one is an architecture guide. It walks through how to design a modular, scalable lakehouse using Iceberg as the foundation and then shows where Apache Spark, Apache Flink, Dremio, and Apache Polaris fit into that design.

The key question this book answers: you know Iceberg is the table format, but how do you actually architect a production system around it? Which engine handles which workload? How do you set up your catalog? What does the Medallion Architecture look like when you implement it with real tools?

Who it is for: Architects and senior engineers making technology choices for their data platform. This is the book you read before committing to a lakehouse stack.

What you will learn:

How to evaluate and select the right combination of engines, catalogs, and storage for your workload
Medallion Architecture implementation patterns with Iceberg
Where Spark, Flink, Dremio, and Trino each fit and where they overlap
How Apache Polaris works as an Iceberg REST catalog and why it matters for interoperability
Production operational patterns for managing Iceberg tables at scale

Get it on Amazon

Apache Polaris: The Definitive Guide (2025)

Full title: Apache Polaris: The Definitive Guide: Enriching Apache Iceberg Data Lakehouses with an Open Source Catalog

Apache Polaris is the open-source catalog that implements the Iceberg REST catalog specification. It is incubating at the Apache Software Foundation and has become a critical infrastructure component for multi-engine lakehouse deployments. This 258-page guide covers Polaris's architecture and features in detail.

The catalog layer is the piece most teams underestimate when building a lakehouse. Without a proper catalog, your Iceberg tables become isolated islands that each engine discovers differently. Polaris solves this by providing a single REST API that Spark, Flink, Dremio, Trino, and any Iceberg-compatible engine can use to discover and manage the same tables.

Who it is for: Data engineers and architects who need to understand how Iceberg catalogs work and how to integrate Polaris with their existing tools.

What you will learn:

Polaris architecture: how it stores metadata, manages namespaces, and handles credential vending
Integration with Spark, Snowflake, Dremio, and other engines
Security and access control in a multi-engine catalog environment
How Polaris compares to other catalog options like AWS Glue and Hive Metastore
Deployment and operational patterns for production use

Get it on Amazon

Free Digital Copy of Apache Polaris: The Definitive Guide

The Book on Using Apache Iceberg with Python (2026)

Full title: The Book on Using Apache Iceberg with Python: PyIceberg, PySpark, Datafusion and more

This is a hands-on, code-first guide focused entirely on the Python ecosystem for Apache Iceberg. At 119 pages, it is intentionally compact. Every chapter is built around working code examples and production-ready patterns.

Python has become the dominant language for data engineering, but most Iceberg documentation focuses on Spark and Java. This book fills that gap. You start with the fundamentals of Iceberg's metadata architecture and then work through each major Python tool: PyIceberg for direct table manipulation, PySpark for large-scale processing, PyFlink for streaming, and DataFusion for high-performance analytical queries.

Who it is for: Python developers and data engineers who want practical, code-driven guidance on working with Iceberg tables using Python-native tools.

What you will learn:

PyIceberg: creating tables, reading/writing data, managing snapshots, and running maintenance tasks
PySpark with Iceberg: reading and writing Iceberg tables from PySpark, handling schema evolution
PyFlink with Iceberg: streaming data into Iceberg tables from Flink
DataFusion: running analytical queries against Iceberg tables with Apache Arrow-native performance
Production patterns for ingestion, compaction, and snapshot management

Get it on Amazon

The Big Picture: Lakehouse + AI

The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI (2026)

Full title: The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI: A Hands-On Practitioner's Guide to Modern Data Architecture, Open Table Formats, and Intelligent Analytics

At 530 pages, this is the most comprehensive book in the collection. It is an end-to-end reference that connects three major trends: the data lakehouse, Apache Iceberg, and agentic AI. If you want one book that covers the full modern data stack from storage formats to autonomous AI agents, this is it.

What makes it different from the other Iceberg books: it covers the complete Iceberg spec progression from V1 through V4, including the new Variant, Geometry, and Geography types. It also dedicates significant chapters to agentic AI: foundation models, tool use, function calling, the Model Context Protocol (MCP), agent-to-agent protocols (A2A), and building agents that query governed lakehouse data in real time.

The hands-on chapters cover Spark, Flink, Dremio, Trino, and the full Python ecosystem (Polars, DuckDB, PyIceberg, LangChain, DataFusion). There is also practical guidance on streaming and ingestion with Confluent/Kafka, Redpanda, RisingWave, Apache Pulsar, Fivetran, and Airbyte.

Who it is for: Data engineers designing pipelines, architects evaluating lakehouse platforms, and AI engineers building agents that connect to structured data.

What you will learn:

Data lakehouse architecture from first principles
The full Apache Iceberg spec (V1 through V4)
Agentic AI fundamentals: foundation models, MCP, A2A protocols
Working code examples for every major engine and Python tool
Streaming and batch ingestion patterns at scale
Building agentic analytics pipelines with semantic layers and governed SQL access

Get it on Amazon

The Book on Agentic Analytics (2026)

Full title: The Book on Agentic Analytics: Building the Data Architecture for Autonomous AI

This is the conceptual companion to the hands-on guides. At 87 pages, it is a focused manifesto that argues the AI models are not the bottleneck. The data architecture is.

The core thesis: AI agents failed not because the models were bad, but because they were fed "dark data," unstructured, ungoverned, and inaccessible information scattered across dozens of systems. The book presents a blueprint for the Agentic Lakehouse: a data stack that is Open (Apache Iceberg), Unified (Federation), Accelerated (Apache Arrow), and Intelligent (Semantic Layer).

Who it is for: Technical leaders, data architects, and anyone building the case for why data infrastructure needs to change before AI can deliver on its promise.

What you will learn:

Why the AI model is not the problem and the data foundation is
The four pillars of an Agentic Lakehouse architecture
How federation, semantic layers, and open table formats create the conditions for reliable AI analytics
Practical frameworks for evaluating your organization's readiness for agentic analytics

Get it on Amazon

AI-Assisted Development

The 2026 Guide to AI-Assisted Development (2026)

Full title: The 2026 Guide to AI-Assisted Development: Prompt Engineering, Agent Workflows, MCP, Evaluation, Security, and Career Paths for Current and Aspiring Developers

At 676 pages, this is the longest book in the collection. It covers the full stack of AI-assisted development: writing high-signal prompts, building MCP servers, deploying production-hardened AI features with cost controls and security guardrails, and navigating career paths in an AI-augmented engineering landscape.

This book exists because AI-assisted development is moving fast and most existing resources cover a thin slice of it. Prompt engineering guides tell you how to talk to models. Engineering books tell you how to build software. This one covers the overlap: how to be a developer who works effectively with AI agents as co-pilots, from the first prompt to production deployment.

Who it is for: Software developers, data engineers, and engineering managers who want a practical handbook on working alongside AI agents.

What you will learn:

Prompt engineering patterns that produce reliable, production-quality code
Building MCP servers that connect AI agents to your internal tools and data sources
Agent workflows: when to let the agent drive, when to intervene, and how to evaluate output quality
Security guardrails for AI-generated code in production environments
Cost management strategies for teams using AI coding tools at scale
Career path guidance for developers navigating the shift to AI-augmented engineering

Get it on Amazon

Bonus: Beyond Technology

The technical books represent my professional focus, but I have also published two non-fiction books outside the data and AI space. These reflect personal intellectual interests that I have been thinking about for a long time.

Economic Ideas: From Beginning to Early 2026 (2026)

Full title: Economic Ideas: From Beginning to Early 2026: A Complete History of Economic Thought from Ancient Civilizations to the AI Economy

This 428-page book traces the entire arc of economic thought from the earliest recorded ideas about trade and money through the classical economists, the twentieth-century revolutions in Keynesian and monetarist thinking, and the cutting-edge debates of the 2020s. It covers 53 chapters of economic history, from the ethics of exchange in ancient civilizations to cryptocurrency and the AI economy.

I wrote this because I have always been fascinated by how economic ideas evolve over time and how today's debates map to arguments that people have been having for centuries. The book connects the dots between ancient trade practices, mercantilism, classical liberalism, Marxism, Austrian economics, behavioral economics, and the emerging questions about how AI and automation reshape labor markets and wealth distribution.

Get it on Amazon

The Field Guide to Libertarianism (2026)

Full title: The Field Guide to Libertarianism: Understanding the Tribes, the Debates, and the Case for a Kinder Freedom

This 306-page book maps the landscape of libertarian thought: Objectivists, anarcho-capitalists, minarchists, classical liberals, and the many tribes in between. It examines the major policy debates within libertarianism and concludes with what I call "Lovatarianism," where I argue that the future of freedom depends not just on policy positions but on cultural values like kindness, pluralism, and forgiveness.

This one is personal. I have spent years in these intellectual communities, and I wanted to write a guide that explains the internal diversity of libertarian thought honestly, including the parts that frustrate me, while making the case for a version of freedom grounded in empathy rather than just non-aggression.

Get it on Amazon

The Reading Path

If you are wondering where to start, it depends on what you need.

If you are new to data lakehouses and Iceberg: Start with The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI. It covers everything in one place.

If you already know Iceberg and need an architecture reference: Go to Architecting an Apache Iceberg Lakehouse (Manning) or Apache Iceberg: The Definitive Guide (O'Reilly).

If you work in Python and want code examples: The Book on Using Apache Iceberg with Python is the shortest path.

If you are building AI agents that need data access: Read The Book on Agentic Analytics for the strategic framework, then The 2026 Guide to AI-Assisted Development for implementation patterns.

If you want to set up a catalog for multi-engine access: Apache Polaris: The Definitive Guide covers that specific problem.

If you are interested in ideas beyond technology: The economics and libertarianism books are independent reads that do not require any technical background.

Every book is available on Amazon in paperback and Kindle formats. I write because teaching forces clarity, and clarity is what the data industry needs more of right now. Pick one, dig in, and let me know what you think.

DEV Community

My Non-Fiction Library: Books on Data Lakehouses, Apache Iceberg, AI, and Beyond

The Technical Core: Data Lakehouse and Apache Iceberg

Apache Iceberg: The Definitive Guide (O'Reilly, 2024)

Architecting an Apache Iceberg Lakehouse (Manning, 2026)

Apache Polaris: The Definitive Guide (2025)

The Book on Using Apache Iceberg with Python (2026)

The Big Picture: Lakehouse + AI

The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI (2026)

The Book on Agentic Analytics (2026)

AI-Assisted Development

The 2026 Guide to AI-Assisted Development (2026)

Bonus: Beyond Technology

Economic Ideas: From Beginning to Early 2026 (2026)

The Field Guide to Libertarianism (2026)

The Reading Path

Top comments (0)