DEV Community: Danica Fine

The Champion: Showing Up for the Ecosystem

Danica Fine — Tue, 26 May 2026 16:21:28 +0000

The Champion is a company that has chosen to champion an open source project and are investing significant resources into supporting it—maybe they have maintainers or committers on engineer staff or they have folks working directly with the community—but their core business doesn’t necessarily depend on this project for revenue.

That might sound strange. Why would a company pour engineering effort into something it doesn’t monetize?

The answer, almost always, is customers.

The factor profile

Where does the Champion sit on the four diagnostic factors?

Commercial dependency: Low to mid

The company has a solid business independent of this project—it’s not a direct revenue driver. That said, commercial dependency is typically a notch higher than the Adopter’s. The project may be strategically important to the company’s product or customer relationships, even if it’s not commercially critical in itself.

Project maturity: More established

Champions engage with mature projects that already have a community, governance, and opinions about how things should work. The project likely has an existing, healthy ecosystem—one that the Champion may already be participating in, or is deliberately choosing to invest in more deeply.

Ownership: Contributor or maintainer level

Champions see themselves as stewards. On the technical side, they’re not just users filing the occasional issue—they have people reviewing PRs, participating in governance, shaping project direction. They are investing in engineering resources to be involved in this project. But open source projects aren’t made from just technical work; a significant part of the work Champions do can come from supporting and growing the community.

Strategic intent: Customer confidence and/or ecosystem commitment

In many cases, the Champion—or at least their users—may already be part of this ecosystem. The intent isn’t to enter a new community cold, but to deepen existing ties. Their customers already rely on the project and want to see their vendor committed to it: integrating, contributing, and demonstrating that the engagement is genuine and long-term.

Why companies end up here

The Champion model is almost always driven by external influences. Your customers may already use and trust this open source project. Your users might want to know your product integrates seamlessly with it. They want to see you at the table—contributing, shaping direction, proving you’re serious about open standards.

It’s not altruism. It’s responding to what your market is telling you.

When your customers say “we want to use this project with your platform” and you show up as a genuine champion, that builds confidence that they can use the technologies they want to use and that you’ll support them. It tells them you’re invested in the ecosystem they already trust—not just your own walled garden.

Personal Aside The Champion model is the one I’m currently living at Snowflake with Apache Iceberg™. Our customers wanted open table formats. They wanted to know we were serious about the project’s direction and that their data wouldn’t be locked in. That customer demand is what drove our engagement in the beginning, and I’m proud to say that we’ve continued to deepen our relationships with the project and ecosystem our build our credibility in the space.

There’s also a quieter, more strategic motivation that Champions don’t always say out loud: growing contributor diversity in a project dilutes any single competitor’s dominance. A healthy, multi-company contributor base means no one organization controls the project’s roadmap. That’s good for the ecosystem—and it’s good for you.

But it’s worth saying clearly: the origin being customer-driven doesn’t have to make the engagement less genuine. Many Champions start from that external pressure and develop real investment in the project’s health over time. The key is that showing up with resources and participating in the community creates actual value, regardless of what motivated the decision to engage in the first place. Intent matters, but impact matters more.

Tactics: what the work actually looks like

The Champion’s playbook goes deeper than the Adopter’s. You’re not just sharing your experience—you’re actively shaping and growing the project.

Community leadership and growth. To show their commitment, the Champion has to go beyond just attending events. They need to show technical involvement, present their work in the project at community meetups or conferences, and show up in governance discussions and working groups. And it’s possible that they’re supporting the community in creating these spaces and making these events happen. But all of their work in and around the community needs to be built on a solid, technical foundation.

Project contributions and collaboration. This is where the Champion differs most from the Adopter. Their overall goal is to increase technical contributions to the project—writing code, conducting reviews, drafting documentation—and working with the other maintainers to further project as a whole. They are mentoring new contributors, mostly internally at first, to grow your company’s direct investment in the project. If they are not engaging directly through technical contributions, the Champion is generally working to grow the committer base of the project.

Financial support. Many Champions sponsor the project’s foundation or directly fund parts of the community. The same caveat from the Adopter post applies here, as well: money creates internal pressure to show ROI. Be aware of that dynamic. But for Champions, the financial support is often more natural because the strategic value is easier to articulate internally (”our customers demand this integration”) and they have already shown financial commitment by having engineers on payroll to contribute.

The key difference from the Adopter: Champions are building relationships; they are not just increasing their visibility. They’re usually in the codebase, in the governance meetings, in the PR review queue. They’re not observers telling stories about their usage—they’re participants shaping the project’s future.

The customer confidence flywheel

There’s a positive feedback loop here that makes the Champion model self-reinforcing:

Customers see your commitment to the project
They trust that your product integrates well and won’t lock them in
They adopt (or stay on) your platform
That adoption justifies continued investment in the project
Which deepens your commitment—and the cycle continues

This isn’t a “build it and they will come” situation. The investment pays for itself through customer confidence and retention. It’s one of the clearest lines you can draw between open source engagement and business value—even though the project itself isn’t generating revenue directly.

Measuring success

Champions, overall, want to see the project be healthy. But unlike the Adopter (who monitors health as a dependency risk), the Champion is actively responsible for making the project healthy.

CHAOSS metrics for the Champion

Project Engagement. This model measures the sustainability of a project through technical contributions and related activity:
- Organizational Diversity. Are contributions coming from many companies, not just yours? This is the key metric for the Champion. A healthy, diverse contributor base means the project isn’t dependent on any single organization—including you. Growing that diversity is good for the ecosystem and good for your strategic position.
- D0, D1, and D2 contributor counts. D0 are people watching—they’ve starred or forked the project. D1 are people filing issues and commenting. D2 are people actually committing code. The Champion wants all three growing. If D0 is high but D1 and D2 are flat, the project has awareness but not engagement, and that’s a health signal.
Community Service and Support. This model measures the quality of service the project provides to developers:
- Issue Response Time and Change Request Duration. Together, these metrics are a good proxy for knowing if the community is being responsive to contributors. And that’s partly your responsibility as a company who may be growing their contributions. If response times are slow, the Champion should be asking what they can do to help—not just observing the problem.

The internal metric

There’s also a metric Champions track that doesn’t appear in any CHAOSS model: customer sentiment around open source commitment. Are your customers mentioning your open source involvement positively? Are sales conversations smoother because prospects trust your ecosystem engagement? That’s harder to measure, but it’s often the metric that justifies the investment internally.

Anti-patterns to watch for

Becoming a gatekeeper instead of a steward. If your company has a lot of committers and maintainers, it’s easy to inadvertently become the bottleneck. You’re supposed to be championing the project—not controlling it. If external contributions are languishing because your team hasn’t reviewed them, you’ve crossed from champion to gatekeeper.

Over-contributing to the point where others feel crowded out. Similar to the above, if your company dominates the commit history, the mailing list, and the governance meetings, other potential contributors may feel like there’s no room for them. This also goes for nontechnical contributions. Champions need to actively make space—not just fill every gap themselves.

Treating it as a pure brand exercise without genuine technical investment. Some companies want the perception of being a Champion (the logos, the sponsorship mentions, the conference keynotes) without the actual engineering commitment. The community sees through this quickly. Although nontechnical support of a project can get a company far, Champions earn their standing through contribution, not through marketing spend.

Letting financial support substitute for participation. Writing checks to a foundation is valuable—but it’s not the same as having folks on payroll in the trenches. Money without people doesn’t earn you community standing.

Examples in the wild

Snowflake + Apache Iceberg: Customer-driven engagement with an open table format that Snowflake’s users were already demanding. The investment demonstrates commitment to open standards and interoperability.
Google + Kubernetes: Created it, donated it to CNCF, and now champions it as part of a broader cloud-native ecosystem strategy. GKE exists, but Kubernetes itself runs everywhere.
Microsoft + the Linux kernel: Became one of the top contributors not because they sell Linux, but because Azure runs on it and their customers demand excellent Linux support.

The Champion in context

The Champion model sits in an interesting space. You’re investing more heavily than the Adopter—there are real, paid resources being committed—but you’re not necessarily monetizing the project directly like The Business does. That means the justification for the investment has to come from somewhere else: customer confidence, ecosystem positioning, or strategic diversification of a project’s contributor base.

When it works, it’s one of the most sustainable engagement models. The customer demand creates a natural, ongoing justification for the investment. The community benefits from having a well-resourced contributor who isn’t trying to capture the project for commercial purposes. And the company earns trust that translates directly into customer retention.

When it doesn’t work, the engagement erodes quickly. This happens when the Champion becomes a gatekeeper, or treats it as brand-only, or loses the internal justification. And trust, once lost in a community, is very hard to rebuild.

Next up: The Business—what happens when your revenue depends on the project’s success, and the community knows it. Subscribe so you don’t miss it.

The Adopter: Advocating for OSS You Use (But Don't Own)

Danica Fine — Fri, 22 May 2026 17:56:52 +0000

The Adopter is a company that uses an open source technology—it may very likely be a core part of their internal infrastructure—and they engage externally with that project's community. They're not trying to run the project. They're not selling something based on it. They're users who have chosen to be vocal.

If you've ever been to a conference and seen a talk titled something like "How ${BIG_COMPANY} Uses ${OSS_PROJECT} at Scale," you've seen the Adopter model in action. Most consumer companies that show up at open source conferences fall into this category.

The factor profile

Where does the Adopter sit on the four diagnostic factors?

Commercial dependency: Low.

For the most part, the Adopter's revenue doesn't come from this project. In many cases, if they so choose, they could likely replace this component of their infrastructure without any of their customers noticing.

Project maturity: Established.

Adopters typically engage with more established, known projects rather than emerging ones. If they’re trusting a project enough to be embedded in their infrastructure, they want to know it’s stable.

Ownership: User level.

The Adopter is an end-user. Occasionally, they will engage with the project directly; for example, they might file issues, submit the occasional PR, or participate in community channels. But they're not maintainers. They're not steering the ship. And they don't intend to.

Strategic intent: Reputation and/or recruiting.

They want the community to know who they are, what they're building, and, frankly, they want talented engineers who are familiar with the open source project to want to come join their engineering teams.

Why companies end up here

The Adopter model often starts as a grassroots effort. It's not a top-down strategic initiative from the C-suite. It's a team of engineers who bet on a technology, built something meaningful with it, and realized their team could benefit from being visible in that project's community. And it grows from there.

The motivations are usually some combination of:

Recruiting Engineering talent is competitive. When your team gives a talk at a project's conference about the hard problems they solved at scale, that's a magnet for engineers who want to work on similar challenges. It's more authentic than a careers page and more targeted than a job posting.
Reputation Being known in a community gives your company credibility. Other companies want to hear what you're building and how you're solving problems. That recognition compounds over time and may naturally lead to invitations to speak, collaborate, and participate.
Protecting a dependency This one's less glamorous, but it's still very real. If your infrastructure depends on a project, you have a vested interest in that project staying healthy. Engaging with the community—even just by providing production feedback and use cases—helps ensure the project evolves in ways that serve your needs.

This is an archetype I lived during my time as a software engineer at Bloomberg. After extensive research, my team had made the decision to use Kafka Streams for processing market data, and so began our foray into the Apache Kafka® community. Our manager encouraged us to give talks—for visibility in the community, for recruiting, and to build Bloomberg’s reputation as a company doing cutting-edge work with streaming technologies.

That grassroots, team-driven origin is typical of the Adopter model.

Tactics: what the work actually looks like

The Adopter's playbook is the most straightforward of the four archetypes:

Attend and present at community events. Show up to the conferences, meetups, and community days for the project you use. Don't just attend—submit talks. Share what you're building. The bar isn't "we did something nobody else has done." The bar is "here's what we learned using this thing in production."

Write about your experiences. Blog posts, case studies, architecture breakdowns. Technical content that shares the patterns, challenges, and decisions you've made with the technology. This content is genuinely valuable to the community—it helps other users, validates the project's approach, and sometimes surfaces edge cases the maintainers hadn't considered.

Engage in community channels. Answer questions in Slack, comment on GitHub issues with your real-world experience, provide feedback on proposals. You don't need to contribute code to be a valuable community member. Showing up with production context is a contribution in itself.

[Optional] Provide financial support. Some Adopters sponsor the project’s foundation or events. This is legitimate and often appreciated—but a note of caution: once money enters the picture, internal stakeholders may start asking about ROI in ways that push the engagement from genuine to performative. Be aware of that pressure.

The selfishness is the feature

I'll be honest—there's a degree of selfishness in the Adopter model. You're doing this for recruiting. For reputation. For protecting a dependency you rely on.

And that's okay.

Because the community benefits too. Real-world production use cases are gold for an open source project. When you stand up at a conference and say "we run this at scale and here's what we learned," that's valuable to everyone. It validates the project. It helps other users avoid pitfalls. It gives maintainers signal about how the software is actually used.

The selfishness and the community benefit aren't in tension. They're aligned. That's what makes this model work. You get what you need (visibility, hires, dependency health) and the community gets what it needs (production feedback, real use cases, engaged participants).

Measuring success

How do you know your Adopter engagement is working? There are two layers here.

Homegrown metrics

Since the Adopter's goals are often self-interested, many companies in this position track their own internal signals:

Content reach. Are people engaging with your talks and blog posts? Are your conference submissions getting accepted?
Recruiting pipeline influence. Can you attribute inbound engineering interest to your community presence? Are candidates mentioning your talks or posts in interviews?
Community recognition. Are you being invited to participate? Do people in the community know your company's name and associate it with expertise?

CHAOSS metrics for the Adopter

If you’re not familiar, the CHAOSS project is an effort to better understand open source projects and communities through rigorous, standardized metrics models for measuring community health. For the Adopter, two models are particularly relevant:

Business Readiness. This model helps you assess whether the project you depend on is healthy enough to keep relying on:
- Defect Resolution Duration. Are bugs getting fixed in a reasonable timeframe? If defects linger, that's a risk signal for your internal use.
- Contributor Absence Factor. What's the bus factor? Could this project survive if key maintainers left? If the answer is no, you might want to increase your engagement (or have a contingency plan).
Project Awareness. This model tracks how well-known and actively engaged-with a project is:
- Organizational Diversity. Is the project healthy and supported by multiple organizations? Or is it dangerously dependent on one company? (For the Adopter, this is a health check on your dependency, not something you're trying to influence directly.)

Key Insight For the Adopter, these metrics are about monitoring risk and validating your choice. You're asking "Is this project healthy enough to rely on?" not "Are we growing the community?" That distinction matters because the same metric can mean something very different in other archetypes.

Anti-patterns to watch for

Even in the most straightforward archetype, things can go wrong:

Talking without contributing. If your company only takes the stage but never participates, e.g. files issues, provides feedback, or engages in community channels, you'll eventually be seen as a tourist. You don't need to commit code, but you do need to participate beyond the spotlight moments.

Treating it purely as marketing. The moment your “community engagement” starts feeling like a brand campaign—messaging that’s too polished, product placement, no genuine technical depth—the community sees through it as inauthentic. Adopter advocacy works because it’s engineers sharing engineering work. The second it becomes marketing-driven, it loses credibility.

Ignoring project health. If you benefit from the project but never check on its sustainability—never look at whether maintainers are burning out, whether the bus factor is dangerously low, whether the project is losing contributors—you're freeloading. That dependency risk will catch up with you.

Expecting influence without earning it. Being a high-profile user doesn’t automatically give you a seat at the governance table. Some Adopters get frustrated that their “big company voice” doesn’t translate into immediate bug fixes or project direction. That’s not how this works. Influence is earned through contribution, not brand weight.

Examples in the wild

Most consumer tech companies at open source conferences — the Ubers, the Netflixes, the Walmarts giving talks about their streaming pipelines, their infrastructure at scale, their internal tooling built on open source.
Any company whose engineering blog regularly features open source projects they use internally—this is Adopter advocacy in written form.

The Adopter in context

The Adopter model is often undervalued because it looks simple: show up, talk about your stuff, go home. But done well, it serves a critical function in the open source ecosystem by providing social proof, real-world validation, and production signal that projects need to grow.

And for the company doing it? It’s one of the highest-ROI forms of developer engagement you can do. The investment is relatively low (engineers sharing work they already did), the alignment between self-interest and community benefit is natural, and the risks are minimal compared to the other archetypes.

Just remember: the playbook is non-transferrable. What works here—showing up as a user, sharing experience, recruiting through visibility—will not work if your company’s situation changes. If you start contributing heavily enough to become a steward, or if your commercial relationship to the project deepens, you’ve drifted into a different archetype. And you’ll need a different approach to continue showing up credibly.

More on that in the final post of this series.

Next up: The Champion—what happens when your company invests heavily in a project that isn’t core to its main business. Follow along so you don’t miss it.

Why Your OSS Advocacy Strategy Probably Doesn't Fit

Danica Fine — Thu, 21 May 2026 16:48:12 +0000

Open source is built on community.

That’s not a feel-good platitude. It’s baked into the structure of open source as a whole. The code lives in public. The roadmap is shaped by contributors. The project’s health depends on people outside your company showing up, engaging, and investing their time. Which means that, if your company is participating in open source in any capacity, community engagement isn’t just some nice-to-have. It’s a requirement for doing open source at all.

Every company that does open source needs some form of developer advocacy—whether they have a formal DevRel team or not. The question isn’t ’Should we engage with this community?’ It’s ’How should we engage, given our specific situation?’

And if you’re anything like most companies I’ve seen and worked with, you’ve probably looked at what other companies do in open source and tried to replicate their approach. I’m here to tell you that’s probably not working. And I think I know why.

The standard playbook doesn't apply here

There’s a well-established developer relations playbook that works great for products: drive awareness, drive adoption, measure signups, measure conversion, grow usage. Profit. Rinse and repeat.

But open source isn’t a product. Open source is a living, breathing thing based on a community. And that community isn’t necessarily your funnel.

So if the standard product developer relations playbook doesn’t apply in this case, what does?

Looking back, I’ve actually spent quite a bit of my career wrestling with this question—at Bloomberg as a software engineer where I found myself advocating for Apache Kafka®, at Confluent as a developer advocate for both Kafka and Apache Flink®, and now at Snowflake where I lead Open Source Developer Relations, working across and setting strategies for more open source projects than I ever dreamed possible.

And honestly, the answer isn’t one playbook.

Instead, what you do is defined by your specific circumstances. Your company’s current situation and how they view a project dictates everything about how you should show up. And what works in one configuration often fails in another.

A diagnostic tool: four factors

Before I introduce the engagement models themselves (those come in the following posts of this series), I want to give you a diagnostic tool. In my experience, there are four main factors that determine where your company sits in its relationship to any given open source project.

As you read through these, think about your own organization and the open source projects you interact with.

1. Commercial dependency

How dependent is your business on the project?

Is this open source project something your company’s revenue depends on? Are you building a product around it? Or could your business exist perfectly well without it?

It’s a popular model to build a company around an open source technology, offering a managed version of it—the Confluents or the ClickHouses of the world. That’s one end of the spectrum. But for so many other companies, their goal isn’t to monetize the open source technology directly. Instead, they’re building on top of it and using it somewhere internally. As such, they have a vested interest in that project continuing to exist and evolving in a healthy way. That’s the other end of the spectrum.

Where you sit here fundamentally changes what you need from the community and what the community expects from you.

2. Project and community maturity

How mature is the open source project and its community?

Are you engaging with a project that has thousands of contributors, years of history, established governance, and strong Opinions™ about how things should work? Or are you starting something new where none of that exists yet?

This changes how you’ll engage with the technology and even how you’ll talk about it. With a mature project, your job is to figure out how this community operates and find your place in it—in the least disruptive way possible. With a new project, there’s nothing to discover. Instead, you have to design it and usher it along.

3. Level of ownership

What is your level of ownership over the project?

Are you an end-user who happens to talk about the project externally? A regular contributor? A maintainer or PMC member shaping the project’s direction? Or did your company literally create this thing and donate it?

Each level comes with different credibility, different expectations from the community, and different responsibilities.

4. Strategic intent

What is your intent in engaging with this project?

Why is your company investing in engaging with this community? Is it to recruit engineers? To grow a product? To build reputation? To demonstrate commitment to a project because your customers are demanding it? ... or something else entirely?

There’s no wrong answer here. But it’s critical that you answer this question honestly and that you know exactly why your company is choosing to get involved. Keep in mind that the community will figure it out whether you tell them or not.

Four factors, four archetypes

Most of these factors—commercial dependency, project maturity, ownership level, strategic intent—are more of a spectrum, meaning there are infinite combinations. But, in my experience, there are four common engagement archetypes that companies tend to align with. Each has distinct tactics, metrics, and pitfalls.

The Adopter A company that uses an open source project and advocates for it externally. They’re not trying to run the project. They’re not selling something based on it. They’re users who have chosen to be vocal.

The Champion A company that serves as a major contributor to a project, and whose core business doesn’t necessarily depend on it. They’re investing heavily because their customers or their ecosystem strategy demands it.

The Business A company that has built a commercial offering around an open source project. Their revenue is directly tied to the project’s success, and every move they make is scrutinized.

The Founder A company that has open-sourced a new project and is building its community from zero. No existing users. No existing contributors. No established norms. Just code and a vision.

These aren’t rigid categories, but they’re common patterns. Some companies shift between them over time (more on that in the final post in this series). Some companies can be in multiple archetypes simultaneously for different projects—and that’s key.

The core thesis

What works in one archetype will not transfer cleanly to another. An Adopter measuring themselves like a Business will waste resources. A Founder engaging like a Champion will wonder why nobody shows up.

The playbook is non-transferrable. But the framework for diagnosing where you are and building your own playbook—that’s what this series is about.

What's coming next

In the rest of this series, I'll deep-dive into each archetype. You'll learn the tactics that work, the metrics that matter (grounded in the CHAOSS project's community health metrics), and the pitfalls to avoid:

Part 2: The Adopter Advocating for OSS you use but don't own
Part 3: The Champion Investing in OSS your business doesn't depend on
Part 4: The Business Building a company around open source
Part 5: The Founder Building an OSS community from zero
Part 6: Model Drift Why it all breaks when you switch contexts

Before you read on, think about one open source project your company engages with. Where does it sit on each of the four factors? Hold onto those as we go deeper into each model in the posts ahead.

A Dive into Apache Iceberg™'s Metadata

Danica Fine — Wed, 08 Oct 2025 07:01:22 +0000

The promise of the data lakehouse is simple: combine the scalability of data lakes with the reliability of data warehouses. Apache Iceberg™ has emerged as the de facto table format for delivering that promise. But why?

The answer lies in Iceberg’s robust metadata layer. It’s a structured, versioned system that enables features like time travel, schema evolution, and efficient query planning. This post explores how Iceberg’s metadata architecture works, and why it’s the foundation of reliable, high-performance data operations in the modern lakehouse.

The Challenge: Finding Data Reliably in a Massive Data Lake

A data lake may contain billions of files, constantly being updated, merged, or deleted. To query it reliably, you need to be able to answer questions like:

Which files belong to a specific table?
What was the table's state at a particular point in time?
How has the schema changed?
Which files are relevant to a query without scanning everything?

Traditional approaches often relied on directory listings, which are slow, inconsistent, and prone to errors. Iceberg solves this with a structured, versioned metadata system and a catalog.

Let's dive in.

Iceberg's Metadata Layer: A Hierarchical View

Catalog

The catalog maps table identifiers to their current metadata file, acting as the “address book” for tables. It's the entry point for query engines and compute engines that what to interact with your Iceberg tables.

Examples include REST catalogs like Apache Polaris (incubating), Hive Metastore, or custom implementations.

Metadata Files

Each commit (transaction) to an Iceberg table generates a new Metadata File. This file contains useful (non-data) information on the table, such as:

Schemas: The columns in the table, including a full history of the schema as it’s evolved over time. Field IDs are also stored here to ensure that changes are handled correctly across versions.
Partition specs: How the data is physically partitioned, including a full history of the partition specs as they’ve evolved over time.
Snapshot IDs: A unique ID for each version of the table. Many snapshots can be stored, and you can configure the table to expire snapshots after a certain time has passed or number of snapshots have been accumulated.
Pointers to Manifest Lists: Specifically, each snapshot ID maps to a Manifest List file. This is important, because the Metadata File doesn’t list all data files, but rather pointers to Manifest Lists.

Manifest Lists

Each snapshot links to a Manifest List, which points to one or more Manifest Files. Summary statistics are aggregated in the manifest list and include row counts, partitions, and min/max values across all files in the snapshot.

Manifest Files

Each Manifest File tracks a subset of individual Data Files (e.g., Parquet, ORC, Avro) within a single Iceberg table. For each Data File, it stores:

File Path: The exact location in your object storage (S3, ADLS, GCS).
File Format: Parquet, ORC, etc.
Partition Data: Which partition this file belongs to.
Column-level Statistics: Min/max values, null counts, value counts for each column within that specific Data File. This is incredibly useful for pruning.
Content (ADD, DELETE, EXISTING): Whether the file was added, deleted, or existed in this snapshot.

Note: As the table evolves, a Manifest File can be referenced by multiple Manifest Lists.

Why Metadata Matters

By building out the metadata layer in this way, Iceberg is able to give users a ton of useful features that unlock more flexibility of how they can use and evolve their data—efficiently.

Time Travel

Every commit creates a new metadata file, resulting in a version history of your entire table. Time travel is invaluable for auditability of your data and to ensure that you always know what the result of a query was at any given point in time.

Want to see the data from a specific snapshot? Or data from last Tuesday? It's easy!

Query the system metadata table to see the snapshots and the time at which they became current.
Point to an older snapshot ID in the metadata with VERSION AS OF <SNAPSHOT_ID> or a specific timestamp by adding TIMESTAMP AS OF ‘2025-09-30 17:53:01.284’.

Schema Evolution without Rewrite

Because Iceberg tracks columns by unique ID (not position), adding, dropping, renaming, or reordering columns is a metadata-only operation. Old schemas are maintained so that you can continue to query your data without issues.

Efficient Query Planning

Query engines don't need to list directories in Iceberg. Instead, they read the current metadata file (O(1) operation), then efficiently scan manifest lists and manifest files to put together an efficient query plan. The column-level statistics within manifest files allow for aggressive file and column pruning. If your query only needs customer_id and order_total and filters by region='US', Iceberg knows exactly which data files and which columns within those files to read, skipping the rest of a potentially wide table.

ACID Transactions

Updates, deletes, and merges are managed by creating new snapshots in the metadata, atomically swapping out old data files for new ones. Readers always see a consistent snapshot, preventing dirty reads—crucial for dependable analytics.

Open Format, Decoupled Engines

The metadata files themselves are open (JSON, Avro) and self-describing. This allows any Iceberg-compatible compute engine (Apache Spark™, Apache Flink®, Trino, Presto, Snowflake, Dremio, etc.) to read the same table reliably, fostering true vendor-neutrality and interoperability.

The Metadata Difference

Iceberg's popularity and power as a table format isn't just in its features, but in the metadata system that makes those features possible. By managing tables through structured, versioned metadata, Iceberg transforms the raw sprawl of the data lake into a reliable, high-performance lakehouse that data engineers and data scientists alike can trust.

So if you haven't done so yet, check it out and get started!

The Apache Iceberg™ Small File Problem

Danica Fine — Wed, 11 Dec 2024 18:13:08 +0000

If you've been following Apache Iceberg™ at all, you've no doubt heard whispers about "the small file problem". So what is it? And why does it matter when building the data lakehouse of your dreams?

You've come to the right place! Let's dive in!

Small Files, Big Problem

To start, the small file problem is exactly what it sounds like at the surface. We have some dataset. In the case of Iceberg, our dataset is a bunch of data files bound together through metadata as a single Iceberg table. The issue is that, over time, as we add data to this dataset, we find that the dataset is made up of many, smaller files rather than fewer, bigger ones.

Having more small files might not sound like a big deal, but it actually has quite a few implications for Iceberg and can ultimately negatively impact performance, scalability, and efficiency in a number of ways, for a number of reasons:

🗄️ High Metadata Overhead: As we know already, an Iceberg table IS its metadata. So in Iceberg, we're constantly tracking every file in metadata for each table version. More small files increases the size of metadata files and, in turn, the cost of maintaining table snapshots.
🐢 Inefficient Query Planning and Execution: When it comes time to interact with our data, query engines like Apache Spark, Trino, or Snowflake need to read those many small files, which results in higher I/O overhead, slower data scanning, and reduced parallelism.
💰 Costs of Object Storage Operations: We've all experienced the frustration of unexpected cloud bills! In cloud object stores like S3 or GCS, frequent API calls for listing or retrieving many small files incur significant latency and cost.
🔊 Write Amplification: If you're unfamiliar, write amplification just means that more data is written, touched, or modified than originally intended. So, for Iceberg, many small writes will eventually generate unnecessary work for compaction and cleanup processes down the line.

Now that you know a bit more about it, you can see how the small file problem is actually a problem. But what can we do about it? 🤷‍♀️

Taking Action

The good news is that the broader Iceberg community isn't just sitting on this issue. You just have to know what's out there and how to take advantage of it!

🤖 The biggest fix is to eliminate existing small files through compaction and expiring snapshots. Iceberg already has compaction built-in to Spark through the rewriteDataFiles Action. The v2 Apache Flink Sink that was released as part of Apache Iceberg 1.7 includes support for small-file compaction, as well!
⚙️ Check your configs! You can set the target file size during writes in Iceberg with the configuration parameter, write.target-file-size-bytes.
🔀 Leveraging the Copy-on-Write (CoW) mode is helpful to get around some of the headaches of small files. Keep in mind that CoW means that any changes to existing files (think row updates or removals), will result on the entire file being copied over to a new file. This sounds horribly inefficient, but it's only a little inefficient because CoW will try to consolidate data files when it can so that the rewrite doesn't just affect one data file. This means more work up front on the writer, but fewer small files and less work for readers later on.
😵 Speaking of updates and removals, if you're using Merge-on-Read (MoR) mode, then you'll be introducing delete files into the mix to track which rows need to be removed when you query against your data files. For sparse deletes against your dataset, this can result in a number of small delete files floating around. And that's no good! Thankfully, with the upcoming Iceberg v3 table spec, we'll be introducing deletion vectors, which means that multiple files' worth of deletes can be stored in a single file.
📋 Query engines are also stepping up with smarter query planning that still allows for the existence of small files but optimizes how data is accessed.

Conclusion

That was kind of a big post on a what's otherwise a... small problem 😂 . But now you have a better idea of what the small file problem is, why it's important for folks building out a data lakehouse with Apache Iceberg, and what your options are for tackling it.

If you're interested in more Apache Iceberg content, like, follow, and find me across social media.

What’s in a name? Making Sense of Apache Kafka’s auto.offset.reset

Danica Fine — Wed, 22 May 2024 15:17:16 +0000

Users of Apache Kafka® have varying opinions on so many things—it’s what makes the community diverse and interesting. But regardless of what your policy on schemas is (they’re obviously good) or whether you believe that adding partitions to a Kafka Topic is okay (it’s definitely not), I’ll bet you can still find some common ground with your fellow users.

Case in point: pretty much everyone that works with Kafka can agree that auto.offset.reset is quite possibly the worst named Kafka configuration out there.

Driven by my frustration as a user and my own curiosity as a developer, I decided to dive into this configuration and try to understand it and its awful name, once and for all.

But first…

Some background on `auto.offset.reset`

I strongly believe that this particular origin story will be helpful for anyone, regardless of whether they’ve been using Kafka for years or days. So let’s start by getting everyone on the same page.

Kafka Consumers. You know them, right? They’re pretty useful in helping you to consume messages from Kafka Topics. As you read messages from a Topic, every so often, the Consumer will record the last message that it saw and processed using the offset of that message. This is really helpful so that, if the Consumer goes down for any reason, you can rest assured that, when it comes back online, the Consumer can use that stored offset to pick back up from its last committed position.

Once our Consumers are up and running, offsets and Consumers go hand-in-hand. But what about before the Consumer has started running? What about brand new Consumers that have never seen a single message from the Kafka Topic? That’s where auto.offset.reset comes in.

Consumers that don’t have an offset available to use still need to know where to start reading from in the Topic. To put it simply, auto.offset.reset is a Kafka Consumer configuration, which can be set to either earliest or latest (or None), that defines where a Consumer should begin reading from in the Kafka Topic when it doesn’t have any other valid offsets to start from.

That last bit is really important, and it’s what trips folks up when they first try to use and understand this configuration parameter.

So, what is in a name?

The biggest issue that folks appear to have with auto.offset.reset is its name, which is a little disappointing because one would think that a three-word configuration would have the potential to be short, sweet, and to-the-point. Alas, sasl.oauthbearer.jwks.endpoint.retry.backoff.ms appears to both be more descriptive and far less problematic than auto.offset.reset.

So what does auto.offset.reset imply? What does someone using this configuration for the first time assume, erroneously, that it does? I’ve heard a number of variations, but generally folks assume that it automatically resets the stored Consumer offset to earliest or latest.

That… would make sense, right? Except that we all know the frustration, the pain, the anguish, when we set auto.offset.reset=earliest and realize that our Kafka Consumer is not, in fact, automatically reading from the beginning of the Topic, nor did it reset our offsets. The reality is that, if there's a valid offset, auto.offset.reset doesn't do a single thing.

So what gives? Where is this disconnect with auto and reset coming from?

Auto vs. Manual

Let’s start with auto. Take a look at the rest of the Kafka Consumer configurations that have auto in them; they all, truly, have to do with handling something automatically for the Consumer.

What does auto.offset.reset do for us automatically? It certainly doesn't appear to be to overriding Consumer offsets automatically. But to really make sense of what auto.offset.reset does for us behind the scenes, we have to think about the manual version of this offset-setting process.

A Consumer without an offset

Consider what would happen if you didn’t want to set auto.offset.reset, but you had a brand new Consumer (or a Consumer whose offsets you've deleted).

If you read the auto.offset.reset documentation closely, you’ll find that you don’t actually have to set it. But, when auto.offset.reset=None and you have no stored offsets, when your Kafka Consumer tries to poll the Topic, it will throw an exception. This implies that, oops, our Kafka Consumers need initial offsets to actually read from the Topic.

Setting offsets manually

A brand new Consumer or a Consumer without offsets doesn’t have to rely on auto.offset.reset, though. You have the option to manually set your offsets using Consumer.seek()—as well as the related methods Consumer.seekToBeginning() and Consumer.seekToEnd().

seek() is an in-memory operation that takes in a TopicPartition and an offset and manually overrides the offset that the Consumer has for that partition—even if it currently has no offset for that partition.

Never heard of Consumer.seek()? That’s probably because it’s mostly used in manual one-off scripts—at least, that’s been my experience with it.

Now that you know about the existence of a way to manually set Consumer offsets, doesn’t the auto in auto.offset.reset make just a little bit more sense? That is, until we think a bit more about the fact that it doesn’t reset anything…

Reset, but only just sometimes

Okay, bear with me, here.

Let’s go back to the official Apache Kafka auto.offset.reset documentation and focus on a bit that I’ll bet you missed the first time around:

What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted)

Bingo.

Unless you have infinite retention set up, data doesn’t live in a Kafka Topic forever. Eventually, it’ll be deleted or compacted. If, for some reason, our Consumer is inactive and the message with the last offset that that Consumer has seen is deleted, then suddenly we have an invalid offset for that Topic—which, when you think about it, is just as bad as having no offset at all because having no offset is just a special case of an invalid offset...

Gee, it sure would be nice if our Consumer knew where to start from when it comes back online. If we have auto.offset.reset set up, automatically the Consumer would reset its invalid offset to one of earliest or latest, meaning that we don’t have to worry about the Consumer throwing an exception.

Automatic and resetting

So where are we with auto.offset.reset now?

In the event that there aren’t any valid offsets available for a Kafka Consumer—whether that means no offsets at all or a stale offset from a deleted message—auto.offset.reset defines how to automatically reset the invalid offsets and allow the Consumer to continue operating.

This probably isn’t what you wanted to hear, but, after all of that, the name auto.offset.reset does a pretty decent job describing exactly what it does. Kafka users everywhere just needed a little bit more context to understand why this frustrating configuration is called what it’s called.

So what do you think? Is this explanation enough to make you just a little bit less frustrated at auto.offset.reset? If not, what would you call it?

Quick Tips for a Winning Abstract

Danica Fine — Thu, 18 Jan 2024 18:07:40 +0000

If you’ve kept up with the rest of this blog series, you saw the components that make up a solid conference abstract along with some examples to inspire you. By now, you should have a pretty good idea of how you can get started writing great conference submissions.

But all of my advice is meant to be a foundation—a scaffolding for you to run with and build something that’s truly your own. You saw some of that in the examples in the previous post. So how do you get to that point? Once you’ve internalized the key information that you need to include in your abstract, it’s time to let your creative juices flow! To make it your own! ... just make sure your abstract is still achieving what you set out to do.

To help you stay on track and rein you in as you artfully craft your abstract, here are a few dos and don’ts that I’d recommend you keep in mind.

A few tips

Note: Some of these tips are going to boil down to personal preference, but I (and other members of my Program Committee) feel that they’re generally best practice:

❌ DON’T waste precious abstract space describing who you are or what your company is unless it’s absolutely necessary. Let the technical details speak for themselves. Instead, use the speaker biography section to talk about yourself and what you do.
👍 DO cut straight to the point. Be concise and clear in your problem statement and goals.

👍 DO be yourself and write as yourself. If this means being casual, then be casual. Your ideas are going to come across more clearly and your abstract will read as genuine.
❌ DON’T feel the need to use an artificially academic or potentially alienating tone (unless an academic tone is required by the conference).

❌ DON’T include links or references to other talks in your abstract.
👍 DO feel free to reference big picture ideas or an author, e.g. Data Mesh and Zhamak Dehghani, but remember to express your own ideas.

👍 DO break your abstract up into sections and separate paragraphs for readability.
❌ DON’T write a novel. Generally speaking, the abstract is what is printed on the conference agenda. It should be concise and able to be read within a minute or two.

❌ DON’T overcommit. It’s easy to go overboard and promise the world in your abstract. Keep the length of the conference session in mind, and only include material that you think you can reasonably cover.
👍 DO feel free to add more content later on! Not every detail needs to be included in your abstract in order for you to bring it up during your actual session. While I advise not trimming content, I find it perfectly acceptable to add additional, relevant information.

❌ DON’T rely exclusively on bullet points to outline your talk in your abstract.
👍 DO try to tell a story. It’s okay to use bullet points where appropriate, but, for the most part, it’s meant to be prose. You’re explaining to an attendee what they can expect and the journey that they’ll go through in attending your talk.

👍 DO add personal touches... and maybe a pun or two!
❌ DON’T let your flair get in the way of the content.

❌ DON’T make [too many] assumptions; a lot of this boils down to knowing your audience. If you're unsure, it's always safe to start from 0 and build from there!
👍 DO have peers and friends from a variety of backgrounds review your abstract before submitting. Bonus points if they have absolutely nothing to do with tech and still understand your abstract!

A final request

The above list is, by no means, exhaustive. In fact, it took me multiple conference seasons to assemble what's here, and I still feel there's so much more I could share!

So, with that in mind, now I'm curious to hear from you. What have you found that works (and doesn't work) when writing abstracts and submitting to a conference? 🤔

Learning by Example

Danica Fine — Mon, 10 Jul 2023 14:24:15 +0000

In my last post on responding to Calls for Papers, I outlined a few things to consider and some questions to ask yourself to get started with writing an abstract. While it’s easy for me to say “Consider X…” or “Ask yourself Y…”, I realize that it might still feel a bit fuzzy. To counter that, I want to take that hand-wavy, intangible process and bring it back down to earth with some concrete examples.

Abstracts that do it well

Let’s make it real by taking a look at a few abstracts that I and other Kafka Summit and Current Program Committee Members thought were particularly good… And more importantly, why they were good.

The basics

To give you a good starting point, here’s a solid, no-frills abstract on the subject of event-driven architectures.

4 Patterns to Jumpstart your Event-Driven Architecture Journey

The shift from monolithic applications to microservices is anything but easy. Since services usually don't operate in isolation, it's vital to implement proper communication models among them. A crucial aspect in this regard is to avoid tight coupling and numerous point-to-point connections between any two services. One effective approach is to build upon messaging infrastructure as a decoupling element and employ an event-driven application architecture.

During this session, we explore selected event-driven architecture patterns commonly found in the field: the claim-check pattern, the content enricher pattern, the message translator pattern, and the outbox pattern. For each of the four patterns, we look into a live demo scenario based on Apache Kafka and discuss some variations and trade-offs regarding the chosen implementation.

You will walk away with a solid understanding of how the discussed event-driven architecture patterns help you with building robust and decoupled service-to-service communication and how to apply them in your next Apache Kafka-based project.

What it does well

The author uses the opening sentence as a way to connect with the audience over something that most folks can agree on—transitioning between monoliths and microservices is difficult. What’s even better here is that connecting with the people who have gone through this process doesn’t necessarily alienate people who haven’t undergone this process just yet. In my opinion, it’s a solid way to open the abstract.
After introducing the subject of the session, we see more detail on the event-driven patterns that the talk will cover. The author then goes into more specifics about the technical demo that will be a part of the session and the goals of that demo.
Finally, he summarizes the takeaways that the audience should expect—a must IMO.

Thanks to Hans-Peter Grahsl, then Developer Advocate at Red Hat, for agreeing to include his abstract in this blog. For more, watch this talk from Current 2023.

Non-standard subject-matter... or how to make the serious content fun.

Next, we’ll mix things up a bit with an abstract covering a less-standard technical subject. Even so, there are some similarities between this and the previous example, and you might start to see a pattern emerge.

🎶🎵Bo-stream-ian Rhapsody: A Musical Demo of Kafka Connect and Kafka Streams 🎵🎶

You’ve heard of Apache Kafka. You know that real-time event streaming can be a powerful tool to power your project, product, or even company. But beyond storing and relaying messages, what can Kafka do?

In this talk, get an overview of two key components of the Kafka ecosystem beyond just brokers and clients: Kafka Connect, a distributed ingest/export framework, and Kafka Streams, a distributed stream processing library. Learn about the APIs available for developing and deploying a custom source and sink connector, and for bringing up a Streams application to manipulate the data in between them. Through a musical demonstration involving Kafka Connect and Kafka Streams, audio will be recorded, distorted, analyzed, and played back–live and in real time.

Audience members should expect to come away with a good understanding of how to develop Kafka Connect connectors and Kafka Streams applications, as well as some basics of digital signal processing.

What it does well

Before we even get into the abstract, we’re hit with a fun, attention-grabbing title. Who wouldn’t want to attend a talk with a live demo… and a live musical performance? Obviously this has more to do with the subject at hand, but the author could have just as easily used a more technical title without the fun. Making it fun was a stylistic choice. (You should always feel free to add your own stylistic flair to your abstracts; read on for more on this.)
The abstract is then supported with technical details: the talk will go beyond Kafka Brokers and Kafka Clients and instead will focus on the APIs for Kafka Connect and Kafka Streams. He follows this with an outline of the stages of the demo. Cool! Now we know what technical content to expect from the talk!
Then he brings it home with the final paragraph. Even though this talk is a fun repose from the usual, heavy content of a conference, it still has relevant and interesting technical takeaways. Just because you’re presenting something fun doesn’t mean that the audience won’t get something out of attending the session.

This abstract was written by Chris Egerton, then Staff Apache Kafka Open Source Developer at Aiven, and is being showcased here with his permission. (If this talk sounds super cool—and it should—check out the recording from Kafka Summit London 2024 to see an awesome live performance!)

Adding personality

For our final example, let's consider... what happens when you let your personality shine through in an abstract? Let’s take a look at a session introducing very useful, deeply technical content in an honest, accessible way.

Pragmatic Patterns (and Pitfalls) for Event Streaming in Brownfield Environments

Unlike Greenfield development where everything is new, shiny and smells like cheese sticks most developers must integrate with existing systems, aptly named, brownfield development. Event streaming can be complex enough when starting from scratch; throw in a mainframe, existing synchronous workflows, legacy databases and an assortment of EBCDIC and it can seem impossible. Using Kafka Streams as our library of choice we will discuss patterns that enable successful integration with the solid systems from our storied past, including:

Translating synchronous workflows to async using completion criteria and the routing slip pattern

Getting Ghosted: Detecting missing responses from external systems with the Absence of an Event and Thou Shall Not Pass patterns

Change Data Capture, the Outbox Pattern, Derivative Events and transaction pitfalls

Armed with these pragmatic best practices, you will be able to successfully bring eventing into your stack and avoid turning your brownfield…into a minefield.

What it does well

The opening sentence presents a heavy topic in a comical way, and you see a bit of the author’s personality shine through. What does the smell of cheese sticks have to do with tech? Almost nothing. But it’s fun, it’s attention-grabbing, it sets the tone for the audience.
It gets to the point. After introducing the legacy systems that form the basis of the talk (and connecting with audience members using these systems), the author dives into the technology that they’ll be using and includes examples of the patterns that will be covered.
And again, as expected, she summarizes the talk with a set of takeaways… and ends it with a pun. I love that this abstract finds balance between the technical details that form the foundation of the abstract and having a bit of fun.

This abstract was written by the incredibly talented Anna McDonald, Principal Customer Success Technical Architect at Confluent, and is included here with her permission. (If you want to see the whole talk, she delivered this session at Kafka Summit London 2023.)

Try it yourself

The three examples shown here are different from one another; they each cover a different subject at a different level, and they’re meant to appeal to different audiences. Even so, you may have noticed that they all more or less follow the abstract formula that I introduced in the last blog post on responding to a CfP. You don’t (and shouldn’t!) have to reinvent the wheel each time you want to write a new abstract.

So make it easy for yourself. Take inspiration from great abstracts that you see out in the wild and follow these steps:

Determine your topic
Answer the questions, being sure to write for your future audience
Follow the abstract template
(Optional) Add a bit of personal flair

If you liked this post and thought these examples were helpful, be sure to check out the rest of the blogs in this series! And, on that note, be sure to keep an eye out for more posts on this topic in the future... 👀

Responding to a CfP

Danica Fine — Tue, 27 Jun 2023 16:22:37 +0000

The technical conference season never stops. Hundreds of technical conferences are always occurring around the globe, giving engineers and developers the chance to share their knowledge and experiences with members of the community. But even as these events are occurring, the Calls for Papers (CfPs) for next season are already opening.

If you’ve ever felt compelled to deliver a talk at a conference, there’s no time like the present to submit to a CfP.

And I’d like to help you do just that by beginning with the most important part—writing an abstract.

Why listen to me?

I’ve had the privilege of serving on a technical conference Program Committee for a couple of years now. During which time, I’ve reviewed hundreds of technical abstracts and offered personalized feedback on a large percentage of those. On top of that, I’ve successfully seen 5 globally-recognized events through, start-to-finish, as Program Committee Chair.

After all of that, I have a few thoughts on what we looked for in abstracts on the Program Committee, how you might extrapolate that for any technical conference, and what you can do to prepare to submit for your next conference CfP.

The Secret: Write a Good Abstract

It shouldn’t come as a surprise to anyone that the secret to getting into a technical conference (or really any conference, for that matter) is by writing a good, relevant abstract. After all, the conference organizers and the Program Committee only have your abstract to base their decisions on. You may as well make it easy for them.

Conference abstracts can be beautiful pieces of prose that introduce your topic, summarize your session, and provide some proof that you are articulate enough to present on the subject at hand. That's a heavy ask for a short write up... If you don’t necessarily write on a regular basis, that might sound daunting. But it doesn’t need to be.

Let’s break it down.

💭 Decide on a Relevant Topic

I shouldn’t have to say this, but I’ve seen enough irrelevant CfP submissions that I feel the need to bring it up. The biggest favor you can do for yourself is understanding the conference that you're submitting to and the audience that will be attending. This is going to involve a little bit of research on your part. To guide you in that, consider asking yourself these questions:

Is the topic I want to speak on relevant to the conference to which I’m submitting?

(And vice versa for those of you who would like to speak at a specific conference.) If it is your dream to speak at Blah Blah Generative AI Conf, maybe don’t try submitting a talk on your experience working with OCaml. Unless you have a good angle. Which leads to the next question…

Does it make sense for an attendee of Conference XYZ to come to your talk?

Someone at a .NET conference might not necessarily find Apache Kafka useful to them and their everyday work. But that doesn’t mean that there couldn’t be an Apache Kafka talk at a .NET conference, and that’s where having the right angle comes in. If you are targeting a specific conference with a specific (perhaps non-obvious) topic, I would challenge you to start by convincing yourself that a nontrivial number of attendees from Conference XYZ would be interested in your talk before you spend time crafting an abstract and submitting.

Are there any conference themes or tracks to be aware of?

Some conferences are general and bring together a variety of subject matter, speakers, and presentation types. In which case, the world is your oyster—submit away!

But keep in mind that even some general developer conferences use themes and tracks to ensure that their audience can still get enough of a certain type of content. For example, a conference might have an AI/ML theme for that year, or perhaps they’re committing to a handful of specific content tracks like Big Data, Streaming Data, and MLOps. It’s worthwhile for everyone that you check into those themes and tracks ahead of time; tailoring your content and abstract will do wonders for increasing your chances of acceptance.

And on the other hand, not abiding by the tracks can really make it difficult for the committee to evaluate and accept your session.
// Detect dark theme var iframe = document.getElementById('tweet-1671226112779616257-103'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1671226112779616257&theme=dark" }

🛠️ Build a Foundation

Alright, so you’ve done the research and have a topic and conference you plan to submit to. The next step is to begin to formalize exactly what you will include in your abstract. I usually start doing this by writing a rough outline of key details.

As you do this, keep one thing in mind: every talk and session is there to benefit the people in the audience, and each abstract should reflect that. The conference is for the attendees.

Here are some questions to get you started:

1. Why should someone care about your talk?

Put yourself in the shoes of an attendee at that conference. They’re looking through the conference agenda and trying to figure out which relevant sessions they should attend. Make it easy for them! An attendee should know from the very first sentence of your abstract (or better yet, the title) what they can expect to get out of your session. Oftentimes, attendees are choosing sessions in the 5 minutes prior to the talk, so you need to be quick to capture their attention. Don’t waste your precious first sentence explaining who you are or which company you work for—focus.

2. What technical details are you covering?

After the attendee has been reeled in by a good title or opening sentence, now’s your chance to really prove to them that they’re going to get something good out of this session. Show them what they should expect to be covered in your conference talk. Explain exactly what tech you’ll be presenting on or using in your demo. Tell them how you accomplished what you did. Convince them (and the Program Committee) that you know your stuff.

3. What takeaways can an attendee expect?

Finally, make it super explicit what an attendee should expect to get out of the talk. Will they know the ins and outs of a specific API? Will you be giving them all the tools they need to build a certain app? Tell them! It may not make for beautiful prose, but know that there’s absolutely no reason why you can’t use the final (short) paragraph of your abstract to explicitly state your takeaways.

📝 Put Pen to Paper

Now for the easy part—write the abstract. 🙃

Don’t let this intimidate you. You don’t need to be a great writer to craft an incredible abstract. Really. You start with your foundation—conveniently, these are just the facts and key details from your outline—and build from there.

What does this look like? Here’s a template:

Abstract Template

[Descriptive Title]

[Opening sentence with a good hook based on your answer to question #1 above.] [Optional sentence to segue into the details of the talk.]

[The meat of your abstract. 2-4 sentences on the technical details of the talk based on question #2 above.]

[Summary and takeaways based on question #3.]

Bam, look at that! If you answered the questions from the previous section, you basically have an abstract right there! But remember, this is meant to be a starting point. This first draft should not be your final draft.

Ready, Set, Iterate

Now we refine. And I’m going to purposefully do a little hand-waving here. How you refine and how much you refine will be up to you and your writing style. I mean that—use this time to find your style and what works for you.

Generally speaking though, with your draft in hand, do roughly the following until it’s done (by some definition of ‘done’):

Read the draft out loud to yourself. Does it flow? Does it sound good? Before you completely discount this part and skip to the next one, stop. Reading out loud is important; written text can look fine, but saying it out loud will help to point out parts that don’t feel natural, put it more comfortably in your own words, and also identify any pesky grammatical and spelling errors.
Remove unnecessary information. It’s common for CfPs to request that your abstract be relatively short. To maximize how many you can submit to, it's beneficial to be short.
Confirm that your draft contains all of the information from the questions you asked yourself in the previous section. If not, add it back in! At a bare minimum, the audience should know what your session is about, what technical details it will cover, and what they will get out of it.

A final note… and a dose of reality

I strongly believe that technical conferences are meant to be by the community for the community. Anyone who wants to present should have the tools and resources at their disposal to make that possible. If you’ve been looking to submit to speak at a technical conference but haven’t known where to start, I hope that this post serves as the nudge that encourages you to finally do so.

As you start on your journey of writing abstracts and submitting to CfPs, I have one final thing for you to keep in mind: even if you write an objectively incredible abstract, there’s no guarantee that you’ll be accepted to every conference you submit to. You shouldn’t let that discourage you. Speaking from experience, sometimes conference organizers have to reject even some great sessions due to time and space constraints.

To stay on track:

Start small—apply to local or regional technical conferences that may have a smaller pool of applicants.
Take advantage of your rejections and iterate. If you’ve been rejected, reach out to the conference organizer and see if they have any advice or feedback as to why your session was rejected.
Know that it’s a numbers game. Even great speakers that I know of still receive rejections, so they increase their chances of acceptance by submitting to many conferences.
And watch this space for followup posts in this series to make this process feel a little more tangible. 👀

With that, get on out there and start applying to CfPs; with any luck, I’ll see you on stage at a conference someday soon!

DEV Community: Danica Fine

The Champion: Showing Up for the Ecosystem

The factor profile

Commercial dependency: Low to mid

Project maturity: More established

Ownership: Contributor or maintainer level

Strategic intent: Customer confidence and/or ecosystem commitment

Why companies end up here

Tactics: what the work actually looks like

The customer confidence flywheel

Measuring success

CHAOSS metrics for the Champion

The internal metric

Anti-patterns to watch for

Examples in the wild

The Champion in context

The Adopter: Advocating for OSS You Use (But Don't Own)

The factor profile

Commercial dependency: Low.

Project maturity: Established.

Ownership: User level.

Strategic intent: Reputation and/or recruiting.

Why companies end up here

Tactics: what the work actually looks like

The selfishness is the feature

Measuring success

Homegrown metrics

CHAOSS metrics for the Adopter

Anti-patterns to watch for

Examples in the wild

The Adopter in context

Why Your OSS Advocacy Strategy Probably Doesn't Fit

The standard playbook doesn't apply here

A diagnostic tool: four factors

1. Commercial dependency

2. Project and community maturity

3. Level of ownership

4. Strategic intent

Four factors, four archetypes

The core thesis

What's coming next

A Dive into Apache Iceberg™'s Metadata

The Challenge: Finding Data Reliably in a Massive Data Lake

Iceberg's Metadata Layer: A Hierarchical View

Catalog

Metadata Files

Manifest Lists

Manifest Files

Why Metadata Matters

Time Travel

Schema Evolution without Rewrite

Efficient Query Planning

ACID Transactions

Open Format, Decoupled Engines

The Metadata Difference

The Apache Iceberg™ Small File Problem

Small Files, Big Problem

Taking Action

Conclusion

What’s in a name? Making Sense of Apache Kafka’s auto.offset.reset

Some background on auto.offset.reset

So, what is in a name?

Auto vs. Manual

A Consumer without an offset

Setting offsets manually

Reset, but only just sometimes

Automatic and resetting

Quick Tips for a Winning Abstract

A few tips

A final request

Learning by Example

Abstracts that do it well

The basics

4 Patterns to Jumpstart your Event-Driven Architecture Journey

What it does well

Non-standard subject-matter... or how to make the serious content fun.

🎶🎵Bo-stream-ian Rhapsody: A Musical Demo of Kafka Connect and Kafka Streams 🎵🎶

What it does well

Adding personality

Pragmatic Patterns (and Pitfalls) for Event Streaming in Brownfield Environments

Some background on `auto.offset.reset`