DEV Community: Monica Miller

Monica vs the Machine - Women's 2023 World Cup Analysis

Monica Miller — Fri, 21 Jul 2023 13:32:49 +0000

Once upon a time

Once upon a time there was a girl who played soccer. She played so much soccer that after years and years of playing, she now has a couple of D3 records to her name. While I am now a fully retired, washed up fan who strictly spectates - I still love the game, and I still love watching anytime that I can. It doesn't hurt that I live in the United States, where we have arguably the best women's soccer program in the world, so of course it's fun to watch - especially international competitions. Because of my sports background, I also have my own opinions about men's and women's equality in sports, but I think one thing we can all agree upon is raising awareness for the incredible women's athletes who are out there playing today. As a data girl living in a data world, I thought what better way to do that than to pit myself against a machine learning algorithm to see who wins. I'm not going to lie, I am nervous for the outcome. I made my predictions like how I make all my decisions, on blind gut instinct, so I am definitely feeling vulnerable about sharing my decision making with the entire world. But, I am also excited to see what the outcome is. I have predicted the entire bracket, so I will be checking back in often to view my results. If both brackets are too terrible by the time we get to the Round of 16, I may redo the predictions, so stay tuned for that.

A quick note as you continue on - all the sports takes are my own, and do not reflect the opinions of my family, friends, dog, or employer.

The project premise and scope

My initial inspiration was born out of me trying to find new data for one of my example projects with Starburst. I started researching, and I found a men's world cup analysis performed by Gustavo Santos. Gustavo created a tremendous project that I highly recommend checking out. I was able to essentially streamline my own analysis from his prior work, so I want to say a major thank you to him. I also want to give credit where credit is due - I really learned so much from his analysis, and could not have replicated it on the women's side without his prior work. That was my entire goal of this project - to push myself out of my comfort zone and give the women's side some love by completing a simplified analysis from what Gustavo provided to share with the community. My husband always reminds me of the importance of reproducibility within our field, so I hope that this project will be a good example of that concept as well.

I plan to write a four part article series around this - The Introduction, The Data Wrangling, The Analysis, and The Result. This is the first part of that series, and I will link to the other parts as I publish them.

The Data Wrangling

Every project starts with the data wrangling. I found this Kaggle dataset, which was a great starting point. It records the outcome history of women's international football matches since before the historic 1991 World Cup, which to most is when women's soccer actually began. I did some initial data exploration, and then did some data cleaning to only take into account data from the last 20 years. I picked 20 years exactly because starting in 2003 there have been 6 women's world cups, and if Kristine Lilly can play on the national team for 23 years, I think this choice was a good balance between recency and history. I also filtered the data to only consider countries that were in the world cup. While it would be interesting to do a version two that takes into account all of the countries, the reality is that most of the best countries in the world are in the tournament. I thought that leaving the other countries in would skew the data too much, and I wanted to keep the analysis as clean as possible. I also needed to update the dataset, since it did not reflect all the results of international matches within the last year. You can't be in data without having to do some sort of manual reconciliation at some point. I feel like many of us in this field have the Marie Kondo attitude and kinda secretly love the mess that is data. Nevertheless, I persisted, and I was able to get the data into a format that I could then utilize for my analysis. I have included the wrangled data only in my project, but I am happy to share the entire datasets and process with anyone who is interested.

The Analysis

I will dive into an analysis of the data in my second post, but for now I wanted to share the basic premise. I used the CatBoost algorithm to create my predictive analysis. The train model notebook is where I ran the analysis to create my model. Much of this was me simplifying Gustavo's previous work and tailoring it to my needs, but I did end up making some significant logic changes, like weighting every match equally. Instead of the home team being weighted differently than the away team, those outcomes were statistically equally probable. My run predictions notebook is where I then used the trained model to produce probabilities for all the results. If you are looking to recreate the analysis with the same model, all you need to do is run the second notebook. Based on the group play predictions, I then ran through all the playoff games and built out the full predictive bracket.

The Predicted Bracket

The Result

I am extremely interested to see these results because the algorithm and I both predicted very similar results in the group stage. Our analysis starts to differ as we move into the round of 16, and that's because I'm a human who made all my decisions after the application of my own biases. I mean a third world cup in a row win, in this field of competition? I'd be ecstatic as a USWNT fan, but I'm not sure the odds are ever in our favor. This is where I hope the algorithm is right, and I am wrong. I would love to see the three-peat happen.

I wish I would have had more time to research the other teams for my own predictions, but I didn't, I only know what I know. I am sensing that Australia and Spain are both hot, so hoping those two countries have a successful result as well.

Monica's bracket

I will check back throughout the tournament to see who's winning - Me or the Algorithm. I will also be posting my results on LinkedIn so feel free to follow along there as well.

View the github repo

You can find all the information in this github repository. If you like what you see, throw it a star!

Datanova 2023

Monica Miller — Mon, 30 Jan 2023 00:24:01 +0000

Hello friends. I'm here to tell you about a conference that has consumed my life for the past three months. On February 8th and 9th, Starburst will be hosting our FREE and VIRTUAL event. I'm usually the last one to self promote, but I'm extremely proud of what we were able to build. I think this is amazing data content with awesome data leaders. I've listed out my favorite sessions for you.

Fun with founders: Expectation versus reality with Matt, Drew, Max and Michel - listen for a candid conversation with the founders of dbt, Preset, Airbyte, and Starburst
Joe Reis and Andy Mott will debate if data mesh the end of data engineering
Benn Stancil will talk about how data engineering fails
Chad Sanderson will give an introduction into data contracts

And yes, that is indeed me on the sign up website making a fool of myself.

ETL vs Interactive Queries: The Case for Both

Monica Miller — Mon, 13 Jun 2022 19:52:00 +0000

TL/DR – Data engineers are wizards and witches. Creating ETL pipelines is a long and sometimes unnecessary process when answering business questions that can be solved in other ways. Interactive (or ad-hoc) queries are awesome, especially for specific self-service insights. Instead of fighting to pick between the two, as the American funk band, War, points out: “Why can’t we be friends?”

Professor McGonagall, teacher of Transfiguration at Hogwarts Witchcraft and Wizardry, educates her students on the branch of magic that alters the form or appearance of an object. While I never got my official Hogwarts Letter, I sincerely reject the notion that any practice of data wrangling does not qualify as pure magic, nay – Transfiguration, in its own right. The ability to alter the form or appearance of data into the answer of meaningful business revelations is a skill that perhaps separates the data engineering muggles apart from the witches and wizards.

Traditionally, ETL (extract, transform, load) is accepted as the blueprint for implementing any sort of data transformation logic within a typical data warehousing architecture, but it is becoming an over-utilized strategy as the modern data stack evolves and new technology is developed. Interactive or ad-hoc query engines have recently proven worthy of contributing quick insight that ETL cannot provide. However, to completely overhaul one’s data technology stack in replacement of the other does not necessarily make sense as ETL can provide value that interactive queries can’t, and vice versa. In a world that is so categorical, why not embrace these differences by adopting both on the same platform? Why can’t [ETL and Interactive Query Engines] be friends?

What is ETL?
ETL is a three-phase data integration process to extract data from multiple sources, transform the data into consistent and meaningful information, and then load the information into the intended target. Due to an established ecosystem of tools and data engineering practices, ETL pipelines are frequently relied upon to provide an automatic and reliable way to move data from disparate locations into the theoretical single source of truth, most often a data warehouse. With the emergence of cloud-based data environments, the similar practice of ELT (extract, load, transform) has also gained popularity to first move raw data into an object store, such as a data lake, and then enable further transformations downstream to eventually produce the desired output.

ETL Pipeline Creation Lifecycle
The hero origin story—or villain origin story depending on the particular request—of an ETL pipeline commonly begins with an inquiry from a data scientist, data analyst, or any other data consumer. The consumer submits a request to the data engineering team, which goes to the backlog, where its fate is left up to chance on whether it gets worked next week, next month, or never (death by backlog). Let’s speculate that we have an underworked data engineering team (ha!) and a fabulous product owner running the backlog so the work gets started within 2-3 days. Still assuming our data warehousing architecture is in play, a data engineer then must haggle a DBA to get the new target table created in both a development and production environment.

For those playing along at home, that’s two requests and two backlogs with two SLAs thus far. Sometimes, the data engineer can start development work as the table is being created; but, in many scenarios, this strategy creates unnecessary rework and should be avoided. Eventually, an ETL pipeline is built to move the data from the source to the target, automation testing and data validation is completed, and the original business ask is fulfilled weeks, if not months, later. The entire process potentially starts over after a week or two has passed, and the data consumer realizes that the business insight delivered actually wasn’t the question they initially wanted answered, even though they swore it was in multiple meetings.

Interactive Query Engine Intervention
As an alternative to the ETL method, data consumers can self-service some of the answers to these business questions through an interactive query engine like Starburst Galaxy, which is a cloud-native service built on Trino. Instead of waiting for the data to land in the target fully transformed, a data consumer can write a SQL query that pulls in data directly from multiple sources at once. The best feature of an interactive query engine is exactly what you think it is, the interactivity capability. Let’s say a query is created, results are discovered, and it turns out the consumer wants to add more information to make the business insight more descriptive. This simply means an edit in the interactive query will have the newly updated result in minutes, which is a notable difference from the weeks it would take if the reliance for insight was strictly coming from ETL pipelines. Trino is built for speed and has optimized performance for even copious amounts of data. Being based on ANSI SQL, Trino can be quickly adopted to utilize the robust ecosystem of connectors.

When to Use ETL vs an Interactive Query Engine
There is no silver bullet to define clear guidelines of when each data manipulation technique is the right choice. The best suggestion is to learn the tradeoffs, understand the data, and know when to hold ’em or when to fold ’em.

Speed:
Time to insight, which defines the time taken until actionable insight is achieved from the source data, can differ greatly depending on the data manipulation method of choice. I’ve already lamented about my personal and painful qualms with the extended weeks or months required for the turnaround on a data pipeline. This is vastly different from the faster time to insight range of minutes to hours that an interactive query engine can start providing value. If an ETL pipeline needs altering, this process still requires a substantial amount of resourcing since testing will be required for each integrated application in the pipeline. A SQL query change for an interactive query can be implemented with much less time and effort required.

With the nature of interactive query engines pulling results from multiple different sources, query response time is also a considering factor. Potentially, the resulting completion time of a query run for interactive analytics may be slightly longer than a data warehouse that is built for aggregation if that data warehouse is optimized correctly. However, that assumption is heavily reliant on the if statement as data warehouses that are not optimized efficiently can create their own host of issues with query response time and may actually end up slower in comparison since Trino has implemented many cost-based optimizations, specifically to utilize the pushdown model and offload processing effort to the storage system.

Automation vs Exploration:
ETL relies on running jobs in sequential order, and there are many batch orchestration tools (such as Airflow and Dagster to name a few) that provide workflow integration to create a completely automated process. In addition to the automatic nature, these jobs can also be scheduled through the workflow manager and are seen as a dependable option to move data within the defined batch window. Interactive queries are by nature interactive and are invaluable for exploratory analytics on unfamiliar data sets or problems, which means they require hands on keyboard action. However, Trino and Starburst do allow users to easily schedule these queries using the tools mentioned above, which allows users to transition between interactive and batch with minimal effort.

Cost:
The infrastructure required to support either an on-premise or a cloud-based data warehouse requires a pretty penny of payment. As companies fill up their data warehouse with tables for one-time usage, the data usually gets forgotten but the accompanying used storage does not. After a couple of years of this habit, more storage is required and more cost is accrued than is necessary.

Meanwhile, Trino is based on the concept of separation of storage and compute, engages in autoscale clustering, and incorporates many more cost-based optimizations to reduce expenses. The utilization of an interactive query engine for specific self-service tasks can prevent data warehouses from the unnecessary storage of unused data.

Another big cost contributing factor is developer efficiency and agility. Development of an ETL pipeline requires the integration of multiple technologies to create one workflow, whereas interactive query engines utilize ANSI SQL, which many data consumers are already familiar with. It is much easier to focus on making your join statement more efficient by adding a WITH clause as opposed to debugging corrupted pipeline jobs that arise from infrastructure problems.

Trustworthiness:
Historically, the biggest trade-off between batch processing and interactive query analytics has been the battle of speed against query failure recovery. Trino was developed to be fast, so the original decision was intentionally made to exclude fault tolerance in order to lower the potential latency, creating an all-or-nothing architecture. However, the failure of any Trino subtask would result in overall query failure, which made long-running queries tricky to properly manage at scale without the proper knowledge. While ETL pipelines also face their own set of failures, the stakes are usually low as this data manipulation method is constructed to split the load between three different stages and jobs usually finish upon rerun. ETL pipelines also typically build in data quality checks, whereas an interactive query engine does not have that same ability. Despite these tradeoffs, Trino-based interactive queries are resilient and boast a surprisingly low overall failure rate assuming the clusters are configured properly.

On the other hand, some siloed data organizations build ETL pipelines on top of ETL pipelines as an intended shortcut, contributing to slower SLA’s and awful debugging challenges (ETLception). If an extract job pulls zero records, that is usually only found after the pipeline fails on the final data quality steps. However, the team may or may not have access to the source’s source, and troubleshooting quickly becomes a nightmare. With interactive query engine integration, each team can pull the data they need from the source and easily perform data manipulation to use the data for their own needs.

Introducing Query Failure Recovery
In a world so accustomed to the trade-off dilemma of choosing this or that, so much so that a TikTok trend took off exploiting that very concept, every decision seems consequentially irreversible. But just like the creators of Reese’s Peanut Butter Cups (probably) thought, why can’t we combine the best of both worlds?

Project Tardigrade is a dedicated project within Trino which seeks to increase optionality and stop the pigeonholing of multiple use cases requiring entirely different systems. While each scenario’s need will vary, the addition of query failure recovery at both the task and the query level opens up more possibilities for integrating Trino with ETL. Ideally, the implementation of query failure recovery will increase ETL pipeline predictability, reduce cost, and help add some guardrails to avoid historically traditional pitfalls that were previously associated with Trino and ETL collaboration. Now, a data analyst can run exploratory analytical queries to identify new meaningful insights using the interactive query engine, and the steps to make this process repeatable and reliable for daily dashboarding have significantly decreased.

If you are interested in seeing for yourself and have a few minutes to spare, I invite you to try Starburst Galaxy to play around with both interactive and batch clusters. Check out my video which demonstrates how to navigate between your clusters interchangeably. Do some querying, transfigure some data, and pat yourself on the back since you are one step closer to the magical world than the rest of the muggles.