After a long, and probably too long, time I finally attended PyData Berlin (2025) again. And it was great! In this post, I will list the talks I attended and share some highlights. Later, I will probably write a little more about more specific things. So, let's start.
Beyond Linear Funnels: Visualizing Conditional User Journeys with Python
Yaseen presented a nice project he has been working on: funnelius
designed to help with visualizing non-linear, conditional user funnels. It uses pandas
, Graphviz
, and Streamlit
and the results are useful! Getting started seems to be very simple and requires a simple data set:
user_id | action | action_start | answer |
---|---|---|---|
1 | 1st question | 2025-04-10 12:04:15.00 | Yes |
1 | 2nd question | 2025-04-10 12:05:17.00 | No |
2 | 2nd question | 2025-04-10 12:05:17.00 | Yes |
2 | 3rd question | 2025-04-10 12:08:27.00 | Yes |
The tool will construct the journeys based on the user_id
and their timestamps.
Accessible Data Visualizations
With the upcoming European Accessibility Act and the related EN 301 549 standard, the issue around accessibility is going to take some attention. In this talk Maris presented a design system aimed at accessibility. I found the use of patterns alongside the colors very appealing.
During the talk a link to a color palette finder was shared.
More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB (Workshop)
This was a very interesting workshop, that would require a lot of "post-processing" on my part.
💡 DuckDB
was mentioned a lot during the conference and I have been playing with it a little. However, I still don't understand how it can be used in many practical cases as described. At the end of the day the data should be loaded in the machine where duckdb
is running. How does it work? The WASM support was raised --- but for that to work the whole data has to be first fetched by the client? Am I missing something? Leave some comments below!
AI-Ready Data in Action: Powering Smarter Agents (Workshop)
By Violetta Mishechkina and Chang She
This workshop covered a lot of material and two tools that I never used before: dlthub
(and the git repo here) and lancedb
. I am surely going to look deeper into these tools (probably starting with dlthub
). The workshop followed this notebook.
Narwhals: enabling universal dataframe support (Keynote)
In this talk Marco:
- Presented the
Narwhals
package, and - Shared insights about managing an open source project. The following is taken from the slides:
- Non-negotiable:
- Clear contributing guide
- Code of conduct
- Nice to have
- Low-pressure communication (like Discord)
- Open community calls
- Recognise contributors by elevating their permissions (but be careful!)
- Have a clear vision, don’t democratise decision-making too early
- Share a roadmap and priorities
- Roles:
- Merge commits: be liberal. As long as people get the spirit of the project, it’s all reversible
- Releases: be picky. Only let people who you know in person, or who's identity you’re sure of, do this
As stated in the talk, narwhals
is "for builders", therefore, at the moment, it is not of high interest for me, but it is surely worth knowing about! I will come to it later. Two topics came up during the call and I will come back to them for sure:
- https://github.com/ibis-project/ibis and
-
https://pola.rs/ -
Polars
was mentioned almost as often asduckdb
. I have to admit that I didn't have the chance to use it so far.
The Importance and Elegance of Polars Expressions
First, a word of disclaimer --- I have no experience with polars
. You should surely watch this talk (it should become available on YouTube soon). Maybe I will discuss it more in another post, but I would mention briefly the interesting discussion I had with Jeroen. I believe we ended up agreeing that it could be a good idea to wrap expressions in functions where, for example, column names are parameterized. Doing that, makes the expressions more testable. As mentioned, more on that in the future.
Building Reactive Data Apps with Shinylive and WebAssembly
An interesting solution to deliver data to consumers directly in the browser. It is an interesting approach that can bring both usability and maintainability to the data delivery processes. However, I suspect, as was also discussed in the talk, that the main challenge is the last mile: authentication. Specifically, securely using credentials needed to access upstream data sources is a serious challenge since the whole app is exposed in the browser. So, probably without some backend it is not possible. Here's a link to an example.
Deep Dive into the Synthetic Data SDK (Workshop)
By Tobias Hann - Talk's page
In this workshop Tobias presented the Synthetic Data SDK which helps generating synthetic data based on available data sets. The following Colab notebook was shared and it covers:
- SDK Core Capabilities
- Differential Privacy
- Conditional Generation
- Multi-Table Synthesis
- Fair Synthetic Data
I believe this is a tool that can be extremely helpful for data scientists and researchers who need to generate synthetic data quickly and efficiently. The SDK provides a wide range of capabilities and even some guarantees!
Forget the Cloud: Building Lean Batch Pipelines from TCP Streams with Python and DuckDB
By Orell Garten - Checkout talk's material
I really liked Orell's question at the beginning of the talk: "WHO PROCESSES LESS THEN 100 GB OF DATA PER DAY?". In too many cases the cloud platforms used like Databricks are an insane overkill. The lean approach presented by Orell is definitely worthy of a deeper look. Maybe also connecting it to dlthub
? What do you think?
Bye-Bye Query Spaghetti: Write Queries You'll Actually Understand Using Pipelined SQL Syntax
I think this was the most practical talk I attended in this conference. In case you didn't know it, just like I didn't, there's a new dialect in town: SQL Pipeline Syntax
! In contrast to traditional SQL, it guarantees that the first n
rows of SQL statements will always be valid! No more commenting out sections, moving things around, etc. while developing/debugging SQL code. Here's a little example taken from Databricks. The following two statements are identical:
SELECT
c_count,
COUNT(*) AS custdist
FROM
(
SELECT
c_custkey,
COUNT(o_orderkey) c_count
FROM
customer
LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
GROUP BY
c_custkey
) AS c_orders
GROUP BY
c_count
ORDER BY
custdist DESC,
c_count DESC;
and
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;
So far I haven't tried it (yet), but I am definitely planning on doing so. I find it an important development of SQL. What do you think? Did you use it?
Docling: Get your documents ready for gen AI
Yet another great talk that brought to my attention docling
which is a lean and useful tool that extracts AI ready files (e.g., Markdown) from various document types like PDF, PPT, etc. etc. The tool is very easy to use and I will come back to it in the future and learn about its capabilities further.
Scraping urban mobility: analysis of Berlin carsharing
In this talk Florian presented his attempt at optimizing (dis-)positioning of vehicles in a carsharing fleet. I enjoyed a lot hearing about his approach. Unfortunately, he was not able to crack the secret, but I am sure that with more reliable data he could help a lot in this domain! Check it out on your own: here is a link to the slides.
Top comments (0)