Dror Atariah

Posted on Sep 10

Notes from PyData Berlin 2025

#python #data

After a long, and probably too long, time I finally attended PyData Berlin (2025) again. And it was great! In this post, I will list the talks I attended and share some highlights. Later, I will probably write a little more about more specific things. So, let's start.

Beyond Linear Funnels: Visualizing Conditional User Journeys with Python

By Yaseen Esmaeelpour talk's page

Yaseen presented a nice project he has been working on: funnelius designed to help with visualizing non-linear, conditional user funnels. It uses pandas, Graphviz, and Streamlit and the results are useful! Getting started seems to be very simple and requires a simple data set:

user_id	action	action_start	answer
1	1st question	2025-04-10 12:04:15.00	Yes
1	2nd question	2025-04-10 12:05:17.00	No
2	2nd question	2025-04-10 12:05:17.00	Yes
2	3rd question	2025-04-10 12:08:27.00	Yes

The tool will construct the journeys based on the user_id and their timestamps.

Accessible Data Visualizations

By Maris Nieuwenhuis talk's page

With the upcoming European Accessibility Act and the related EN 301 549 standard, the issue around accessibility is going to take some attention. In this talk Maris presented a design system aimed at accessibility. I found the use of patterns alongside the colors very appealing.

During the talk a link to a color palette finder was shared.

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB (Workshop)

By Mehdi OUAZZA Workshop's repo

This was a very interesting workshop, that would require a lot of "post-processing" on my part.

💡 DuckDB was mentioned a lot during the conference and I have been playing with it a little. However, I still don't understand how it can be used in many practical cases as described. At the end of the day the data should be loaded in the machine where duckdb is running. How does it work? The WASM support was raised --- but for that to work the whole data has to be first fetched by the client? Am I missing something? Leave some comments below!

AI-Ready Data in Action: Powering Smarter Agents (Workshop)

By Violetta Mishechkina and Chang She

This workshop covered a lot of material and two tools that I never used before: dlthub (and the git repo here) and lancedb. I am surely going to look deeper into these tools (probably starting with dlthub). The workshop followed this notebook.

Narwhals: enabling universal dataframe support (Keynote)

By Marco Gorelli

In this talk Marco:

Presented the Narwhals package, and
Shared insights about managing an open source project. The following is taken from the slides:
- Non-negotiable:
- Clear contributing guide
- Code of conduct
- Nice to have
- Low-pressure communication (like Discord)
- Open community calls
- Recognise contributors by elevating their permissions (but be careful!)
- Have a clear vision, don’t democratise decision-making too early
- Share a roadmap and priorities
- Roles:
- Merge commits: be liberal. As long as people get the spirit of the project, it’s all reversible
- Releases: be picky. Only let people who you know in person, or who's identity you’re sure of, do this

As stated in the talk, narwhals is "for builders", therefore, at the moment, it is not of high interest for me, but it is surely worth knowing about! I will come to it later. Two topics came up during the call and I will come back to them for sure:

https://github.com/ibis-project/ibis and
https://pola.rs/ - Polars was mentioned almost as often as duckdb. I have to admit that I didn't have the chance to use it so far.

The Importance and Elegance of Polars Expressions

By Jeroen Janssens - Talk's page

First, a word of disclaimer --- I have no experience with polars. You should surely watch this talk (it should become available on YouTube soon). Maybe I will discuss it more in another post, but I would mention briefly the interesting discussion I had with Jeroen. I believe we ended up agreeing that it could be a good idea to wrap expressions in functions where, for example, column names are parameterized. Doing that, makes the expressions more testable. As mentioned, more on that in the future.

Building Reactive Data Apps with Shinylive and WebAssembly

By Christoph Scheuch - Talk's page

An interesting solution to deliver data to consumers directly in the browser. It is an interesting approach that can bring both usability and maintainability to the data delivery processes. However, I suspect, as was also discussed in the talk, that the main challenge is the last mile: authentication. Specifically, securely using credentials needed to access upstream data sources is a serious challenge since the whole app is exposed in the browser. So, probably without some backend it is not possible. Here's a link to an example.

Deep Dive into the Synthetic Data SDK (Workshop)

By Tobias Hann - Talk's page

In this workshop Tobias presented the Synthetic Data SDK which helps generating synthetic data based on available data sets. The following Colab notebook was shared and it covers:

SDK Core Capabilities
Differential Privacy
Conditional Generation
Multi-Table Synthesis
Fair Synthetic Data

I believe this is a tool that can be extremely helpful for data scientists and researchers who need to generate synthetic data quickly and efficiently. The SDK provides a wide range of capabilities and even some guarantees!

Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB

By Orell Garten - Checkout talk's material

I really liked Orell's question at the beginning of the talk: "WHO PROCESSES LESS THEN 100 GB OF DATA PER DAY?". In too many cases the cloud platforms used like Databricks are an insane overkill. The lean approach presented by Orell is definitely worthy of a deeper look. Maybe also connecting it to dlthub? What do you think?

Bye-Bye Query Spaghetti: Write Queries You'll Actually Understand Using Pipelined SQL Syntax

By Tobias Lampert - Talk's page

I think this was the most practical talk I attended in this conference. In case you didn't know it, just like I didn't, there's a new dialect in town: SQL Pipeline Syntax! In contrast to traditional SQL, it guarantees that the first n rows of SQL statements will always be valid! No more commenting out sections, moving things around, etc. while developing/debugging SQL code. Here's a little example taken from Databricks. The following two statements are identical:

SELECT
  c_count,
  COUNT(*) AS custdist
FROM
  (
    SELECT
      c_custkey,
      COUNT(o_orderkey) c_count
    FROM
      customer
      LEFT OUTER JOIN orders ON c_custkey = o_custkey
      AND o_comment NOT LIKE '%unusual%packages%'
    GROUP BY
      c_custkey
  ) AS c_orders
GROUP BY
  c_count
ORDER BY
  custdist DESC,
  c_count DESC;

and

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
                           AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
   GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
   GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

So far I haven't tried it (yet), but I am definitely planning on doing so. I find it an important development of SQL. What do you think? Did you use it?

Docling: Get your documents ready for gen AI

By Christoph Auer - Talk's page

Yet another great talk that brought to my attention docling which is a lean and useful tool that extracts AI ready files (e.g., Markdown) from various document types like PDF, PPT, etc. etc. The tool is very easy to use and I will come back to it in the future and learn about its capabilities further.

Scraping urban mobility: analysis of Berlin carsharing

By Florian König - Talk's page

In this talk Florian presented his attempt at optimizing (dis-)positioning of vehicles in a carsharing fleet. I enjoyed a lot hearing about his approach. Unfortunately, he was not able to crack the secret, but I am sure that with more reliable data he could help a lot in this domain! Check it out on your own: here is a link to the slides.

DEV Community