Floor Drees

Posted on Dec 16, 2022

Big Data Europe 2022 - my summary

#datascience #database #analytics #ai

November 23-24, Big Data Conference Europe took place, all virtual, 2000+ attendees, 40+ talks, and 4 tracks. I got to host (part of) the Machine Learning track on day 1, and the Cloud & Streaming track on day 2. Below is a summary of what I learned - not a complete summary (by far).

Steve Upton, Lead QA Consultant at Thoughtworks, straight away told us we’re probably overthinking data quality. I thought it was wildly interesting how he explored that, in order for us to make sound decisions, we don’t need data to be 100% accurate all the time, nor necessarily live. As a person with diabetes, he says, he doesn’t even need to know his exact blood sugar levels, as long as he knows the range and can adjust accordingly.

Our strive for perfection might just drive us down a rabbit hole, where we lose sight of what (metrics) matters.

Bias and dirty data

Shalvi Mahajan, AI & Data Scientist at SAP SE, and Johnathan Azaria, Data Scientist Tech Lead at Imperva, talked about “Gender Bias in Artificial Intelligence”, and “Learning from Poisoned Data”, respectively.

There are countless instances documented where machine learning algorithms have been biased, exclusionary, and even dangerous. One book recommendation I actioned on was Shalvi’s raving review of Data Feminism, by Catherine D'Ignazio and Lauren F. Klein. I’m curious to find out “how challenges to the male/female binary can help challenge other hierarchical (and empirically wrong) classification systems”.

Johnathan talked about the “Security lifecycle of new technologies”, that takes us from the "Peak of inflated expectations", "Trough of disillusionment", to the "Slope of enlightenment", and finally the "Plateau of productivity". On the way to the peak is when there's zero security and the first vulns appear, after the peak we see the FUD (Fear Uncertainty and Doubt) we know from many people’s first reactions to open source software use, and during the slope is when security solutions are introduced.

When it comes to securing web applications and APIs, Jonathan explored 2 possible models. The Negative security model that is rule-based / signature-based security. “All's good except for what we know is bad” leans on blocklists heavily. The Positive security model, the more traditional approach, learns a baseline profile for Web/API traffic and blocks/alerts on deviation (anomaly detection). The “All's bad except for what we know is good” stance delivers a lot of false positives, but protects against zero days better.

Graph databases

Christopher Woodward is a Developer Relations Engineer at ArangoDB. In his session Machine Learning + Graph Databases for Better Recommendations, showed how pairing machine learning with graph databases can improve the quality of recommendations.

Chris says that the term “relational database'' is a bit of a mis-nomer. Graph databases make relationships a first class citizen. Rather than focus on individual rows or products, Graph databases capture dependencies and relationships between those products. You'd use SQL for product listing, Graph for co-purchase patterns.

Data relationships are the foundation of AI/ML models. Graph database use cases include fraud detection, supply chain management, recommendations, customer 360, network management, risk management, …

Chris shared a notebook with the implementation of a content-based recommendation engine. The model uses a python TFIDF - Term Frequency: How often the word shows up in a document, and Inverse Document Frequency: How often the word shows up across all documents - library to compute similarity between movies using the title, tagline, and overview: https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/arangoflix/similarMovie_TFIDF_ML_Inference.ipynb

Fun exercise - I still think it’s magic that Netflix is very capable of keeping my and my son’s viewing patterns separated while we’re on the same account.

Watermarks

Christian Tzolov is a Tech Lead and Staff Software Engineer at VMware, Spring Team Member, and Apache Software Foundation Committer / PMC Member.

He claims over a decade of experience in building distributed, data-intensive systems, and defines data-intensive systems as “the ultimate data system that would answer any question based on all information acquired from the past up to the present”.

Streaming as a data processing pattern is a continuous sequence of events, processed incrementally as they arrive. Not (global) external store, but internal state can be used to enable queries over time windows. Example event: credit card authorization attempt, where fraud is identified when multiple authorization attempts are performed by the same user in a short interval.

Hardly ever will you see perfect correspondence between Event Time (when events occur), and Processing Time (when events are processed), there's always a time skew. Continuous events from multiple sources at varying rates, flowing along distinct paths, encountering varying delays, arriving at different consumers at different times, present a challenge.

Watermarks are heuristic metadata, computed from the input events timestamps that informs a processing node that most likely all events with lower timestamps have already been received.

Christian mentioned some interesting publications for those wanting to learn more about watermarks:

Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Costs in Massive-Scale, Unbounded, Out-of-Order Data Processing
Streaming Systems

(Data) awareness

In a panel discussion with Javier Ideami, CEO of Ideami and Cofounder of The Geniverse, Kuba Misiorny, CTO at Untrite, and Radovan Bacovic, Senior Data Engineer GitLab, we talked about generative art, and system 1 and 2 in AI (and AGI - Artificial General Intelligence). The good news: the panelists don’t think we’ll lose our jobs to robots any time soon.

In Increasing Data Awareness with Self-Serve Analytics Shadi Rostami, Senior VP of Engineering at Amplitude, taught us some best practices on how we can influence data culture and enable our stakeholders.

Few tools truly enable teams to create/consume/use customer data. Data teams are the facilitators for non-technical teams to start analyzing customer data, self-service. According to Shadi, the sweet spot is between the inflexible, auto-track, precomputed, low setup tools for non-technical users, and the flexible, technical and slow tools like SQL and Data Warehouse: “Democratization is a balancing act between chaos and control”. And we’ll want to be change agents as data literacy is a key skill.

What’s next?

10/10 would attend Big Data Europe again next year! Or host, if they’ll have me! All talk recordings for Big Data Europe are now available for your on-demand viewing pleasure.

DEV Community

Big Data Europe 2022 - my summary

Bias and dirty data

Graph databases

Watermarks

(Data) awareness

What’s next?

Top comments (0)

Read next

Adding new columns - lowCalAlt_update5

SQL 101 | Chapter 2: Setting Up Your Database Environment

Unidirectional associations for one-to-many

How to Set Up CopilotKit in Your React App: A Step-by-Step Guide