DEV Community: Kyle Eaton

How DAGs grow: Deeper, wider, and thicker

Kyle Eaton — Tue, 17 Nov 2020 19:45:59 +0000

After working in data science and engineering for many years, we’ve observed a common pattern. The effort level required to maintain data systems typically looks like this:

There are three phases. First, when you first start building a data system, there’s an up-front cost to figure out how it all should work, fit it together and build it.

Second, for a while it just works. We call this phase the “miracle of software.”

Third, after a while, you start noticing that you're spending a lot of time on maintenance: bug fixes and root cause analysis. Trying to figure out why things are breaking in your data pipelines. Making additions and adjustments as other systems appear or change.

Over time, this turns into a steady, compounding creep in the maintenance time that you're putting into the project. If that gets too far out of control, the work just stops being fun. Stuff breaks unpredictably, timelines become highly variable, and people burn out.

We wanted to articulate why this happens, to understand how to prevent it. In the end, we arrived at this mental model of the core dynamics.

Why compound growth?

To understand why the cost of maintenance compounds over time, we need to break down how data scientists, analysts, and engineers really spend their time. At a high level, most data teams spend their time asking questions and building infrastructure to make that analysis repeatable. Since we’re interested in the cost of maintaining data systems, we’re going to focus on the repeatable aspects.

When data analysts, scientists, or engineers build repeatable pipelines, they build DAGs.

Here’s a stylized picture of a DAG. For the sake of this example, we’ll call it an ETL/ELT project. But it could just as easily be a machine learning pipeline, or set of BI aggregations, or an operational dashboard. In data work, everything is a DAG.

Returning to our ELT example: say we ingest three tables from an upstream data source. We then need to munge and clean the data. This likely includes things like small joins and filtering.

Let’s say that the end result of that pipeline is two cleaned tables, derived from the three tables of messy, external data that we started with.

Boom. Functioning data pipeline.

Deeper, wider, and thicker

What happens next? Answer: the DAG grows.

When people trust a source of information, they tend to ask more questions of it. When the same questions have repeated value, their supporting DAGs tend to grow---in three specific ways.

First, they grow deeper. As soon as we've got that nice clean, normalized data, someone is going to ask to see it. In the simplest case, this could be a one-off query. If it’s made repeatable, those queries probably become dashboards or reports.

For the sake of this example, let’s say you end up building four dashboards on top of the initial cleaned table.

Second, data systems grow by getting wider--by adding additional data sources. Of course, each of these will need their own ELT as well. And then new data products will be built on top of those nodes in the DAG.

For our example, let’s say that this leads to 2 new ingested sources, and 2 normalized tables. On top of that, we build an alerting system with three types of alerts at the bottom of the DAG.

The third way that data systems grow is by getting thicker. In our example, it won’t be long before users start asking to add alerts that use some of the same data as the dashboards, and add dashboards that report on the alerting system.

Saying that DAGs “grow thicker,” is the same as saying that they become more interconnected. When you map this out visually for actual DAGs, they usually turn into messy-looking hairballs, fast.

The thing is, that messy-looking interconnectedness is often the whole point of the data system. Breaking down information silos, sharing data, creating more contextually aware decision support systems---this is the work that data teams are paid to do.

Downstream consequences

We’ve now sketched out most of the main causes for compounding maintenance burden in data pipelines. There’s one last piece: downstream consequences.

Because data flows through the DAG, changes in the upstream DAG can affect the behavior of the downstream DAG. Even seemingly small changes can have large, unexpected consequences.

For example, changing an upstream table to allow null values could change the denominator in important downstream calculations. Perversely, adding null values might not have any impact on the numerator---which means that many reports based on exactly the same tables and columns might not be affected.

Another example: suppose an upstream logging team changes the enumerated values assigned to certain types of logs. If those values were being used to trigger alerts and the alerting system isn’t updated at the same time, you might suddenly find yourself responding to lots of false alarms.

More subtly, if the values were being used as inputs in a machine learning model, the model might silently start generating bad predictions. Depending on the type of model and how much weight was assigned to the affected variables, the impact on predictive accuracy could be large or small, widespread or concentrated in a few cases.

All of these types of problems have a few things in common:

Upstream changes can have unexpected downstream consequences.
Unintended consequences don’t necessarily show up in nodes immediately adjacent to where the change was made. They can skip levels and show up much deeper in the DAG
Unintended consequences can cascade to cause additional havoc further downstream in your DAG.

A quasi-proof for compounding maintenance cost

Putting all of this together, we can start to understand why the cost of maintenance compounds.

The cost of maintenance is directly tied to the probability of downstream consequences of upstream changes in the DAG.
The probability of unintended consequences is a joint function of the number of nodes in the DAG, the density of edges in the DAG, and the frequency of changes in the upstream DAG.

Both of these factors tend to increase as DAGs grow. Which is why the probability of downstream consequences and the cost of maintenance increase as a super-linear function of the size of the DAG.

Recap

Okay, this is a good breaking point. In a followup article, I’ll get into more details about how DAG maintenance plays out in practice, especially when data flows cross team boundaries and downstream consequences of changing DAGs surface at unexpected moments.

This blog was originally written by Abe Gong here: https://greatexpectations.io/blog/deeper-wider-thicker/

Your data tests failed! Now what?

Kyle Eaton — Mon, 09 Nov 2020 19:41:37 +0000

Congratulations, you’ve successfully implemented data testing in your pipeline! Whether that’s using an off-the-shelf tool or home-cooked validation code, you know that securing your data through data testing is absolutely crucial to ensuring high-quality reliable data insights, and you’ve taken the necessary steps to get there. All your data problems are now solved and you can sleep soundly knowing that your data pipelines will be delivering beautiful, high-quality data to your stakeholders! But wait… not so fast. There’s just one detail you may have missed: What happens when your tests actually fail? Do you know how you’re going to be alerted? Is anyone monitoring the alerts? Who is in charge of responding to them? How would you be able to tell what went wrong? And… how do you fix any data issues that arise?

As excited as data teams might be about implementing data validation in their pipelines - the real challenge (and art!) of data testing is not only how you detect data problems, but also how you respond to them. In this article, we’ll talk through some of the key stages of responding to data tests, and outline some of the important things to consider when developing a data quality strategy for your team. The diagram below shows the steps we will cover:

System response to failure
Logging and alerting
Alert response
Root cause identification
Issue resolution
Stakeholder communication (across several stages)

System response

The first line of response to a failed data test, before any humans are notified, are automated responses of the system to the test failure that decide whether and how to continue any pipeline runs. This could take one of the following forms:

Do nothing. Continue to run the pipeline and simply log the failure or alert the team (more on that below).
Isolate the “bad” data, e.g. move the rows that fail the tests to a separate table or file, but continue to run the pipeline for the remainder.
Stop the pipeline.

The system response can also vary depending on the level of severity of the detected issue and the downstream use case: Maybe it’s okay to keep running the pipeline and only notify stakeholders for certain “warning” level problems, but it should absolutely not proceed for other, “critical”, errors.

Logging and alerting

While it is absolutely possible for data validation results to be simply written to some form of log, we assume that at least some of your tests will be critical enough to require alerting. Some things to consider here are:

Which errors need alerting, and which ones can be simply logged as a warning? Make sure to choose the correct level of severity for your alerts and only notify stakeholders when it’s absolutely necessary in order to avoid alert fatigue.
Which medium do you choose for the alerts? Are you sending messages to a busy Slack channel or someone’s email inbox where they might go unnoticed? Do critical alerts get mixed in with daily status reports that might be less relevant to look at? Using a tool such as PagerDuty allows you to fine-tune your alerts to match the level of severity and responsiveness required.
What is the timeliness of alerts? Do alerts get sent out at a certain time or do they just show up at some point during the day? This is an important factor to consider when your alerting mechanism fails - would anyone notice?

Alert response

Now that your alerting is nicely set up, you’re onto the next hurdle: Who will actually see and respond to those notifications? Some factors to take into account are:

Who gets notified and when? Upstream data producers, downstream data consumers, the team that owns the data pipelines, anyone else? Make sure you have a clear map of who touches your data and who needs to know if there are any issues.
Who is actually responsible for acknowledging and investigating the alert? This is probably one of the most crucial factors to consider when setting up data testing: Someone actually needs to own the response. This might not always be the same person or team for all types of tests, but you better have a clear plan in order to avoid issues going unnoticed or ignored, which in turn can cause frustration with stakeholders. I’m not saying you need an on-call rotation, but maybe… maybe, you need an on-call rotation. Having said that, please see the previous paragraph on fine-tuning the severity of your alerts: On-call does not necessarily mean getting a Pagerduty call in the middle of the night. It just means that someone knows they’re responsible for those alerts, and their team and stakeholders know who is responsible.
Are your notifications clear enough for your stakeholders to know what they imply? In particular, do your data consumers know how to interpret an alert and know what steps to take to get more information about the problem or a potential resolution? (Hint: Having a clear point of contact, such as an on-call engineer, often helps with this, too!)

Stakeholder communication

While it’s easy to jump right into responding to a test failure and figure out what’s going on, you should probably stop for a moment to think about who else needs to know. Most importantly, in most cases you’ll want to let your data consumers know that “something is up with the data” before they notice. Of course, this is not specific to data pipelines, but it’s often harder for downstream data consumers to see that data is “off” compared to, say, a web app being down or buggy. Stakeholders could either already be notified through automated alerting, or through a playbook that includes notifying the right people or teams depending on the level of severity of your alerts. You’ll also want to keep an open line of communication with your stakeholders to give them updates on the issue resolution process and be available to answer any questions, or if (and only if) absolutely necessary, make some quick fixes in case there are some urgent data needs.

Root cause identification

At a high level, we think of root causes for data test failures as belonging to one of the following categories:

The data is actually correct, but our tests need to be adjusted. This can happen, for example, when there are unusual, but correct, outliers.
The data is indeed “broken”, but it can be fixed. A straightforward example for this is incorrect formatting of dates or phone numbers.
The data is indeed corrupted, and it can’t be fixed, for example, when it is missing values.

One very common source of data issues that arise at the data loading or ingestion stage are changes that are mostly out of the control of the data team. In my time working with third party healthcare data, I’ve seen a variety of data problems that arose seemingly out of nowhere. Some common examples include data not being up-to-date due to delayed data deliveries, table properties such as column names and types changing unexpectedly, or values and ranges digressing from what’s expected due to changes in how the data is generated.
Another major cause of data ingestion issues are problems with the actual ingestion runs or orchestration, which often manifest themselves as “stale data”. This can happen when processes hang, crash, or get backed up due to long runtimes.

Now, how do you approach identifying the root cause of data ingestion issues? The key here is to be methodical about

Identifying the exact issue that’s actually happening and
Identifying what causes the issue.

Regarding the former, my recommendation is to not take problems and test failures at face value. For example, a test for NULL values in a column could fail because some rows have actual NULL values - or because that column no longer exists. Make sure you look at all failures and identify what exactly the problem is. Once the problem is clear, it’s time to put on your detective hat and start investigating what could have caused it. Of course we can’t list all potential causes here, but some common ones you might want to check include:

Recent changes to ingestion code (ask your team mates or go through your version control log)
Crashed processes or interrupted connections (log files are usually helpful)
Delays in data delivery (check if all your source data made it to where it’s ingested from in time)
Upstream data changes (check in the source data and confirm with the data producers whether this was intentional or not) And finally, while data ingestion failures are often outside of our control, test failures on the transformed data are usually caused by changes to the transformation code. One way to counteract these kinds of unexpected side effects is to enable data pipeline testing as part of your development process and CI/CD processes. Enabling engineers and data scientists to automatically test their code, e.g. against a golden data set, will make it less likely for unwanted side effects to actually go into production.

Issue resolution

Now... how do I fix this? Of course, there is no single approach to fixing data issues, as the fix heavily depends on the actual cause of it - duh. Going back to our framework of the three types of root causes for test failures we laid out in the previous paragraph, we can consider the following three categories of “fixes” to make your tests go green again:

If you determine that the data is indeed correct but your tests failed, you need to adjust your tests in order to take into account this new knowledge.
If the data is fixable, some of the potential resolutions include re-running your pipelines, potentially with increased robustness towards disruptions such as connection timeouts or resource constraints, or fixing your pipeline code and ideally adding some mechanism to allow engineers to test their code to prevent the same issue from happening again.
If the data is broken beyond your control, you might have to connect with the data producers to re-issue the data, if that’s at all possible. However, there may also be situations in which you need to isolate the “broken” records, data sets, or partitions, until the issue is resolved, or perhaps for good. Especially when you’re dealing with third party data, it sometimes happens that data is deleted, modified, or no longer updated, to the point where it’s simply no longer suitable for your use case.

My data tests pass, I’m good!

Ha! You wish! Not to ruin your day here, but you might also want to consider that your data tests pass because you’re simply not testing for the right thing. And trust me, given that it’s almost impossible to write data tests for every single possible data problem before you encounter it the first time, you’ll likely be missing some cases, whether that’s small and very rare edge cases, or something glaringly obvious. I am happy to admit that I once managed a daily data ingestion pipeline that would alert if record counts dropped significantly from one day to the next, since that was usually our biggest concern. Little did I know that a bug in our pipeline would accidentally double the record counts in size, which besides some “hmm, those pipelines are running very slow today” comments aroused shockingly little suspicion - until a human actually looked at the resulting dashboards and noticed that our user count had skyrocketed that day.

So what do you do to make your tests more robusts against these “unknown unknowns”? Well, to be honest, this is a yet-to-be-solved problem for us, too, but here are some ideas:

Use an automated profiler to generate data tests in order to increase test coverage in areas that might not be totally obvious to you. For example, you might not even consider testing for the mean of a numeric column, but an automatically generated test could make your data more robust against unexpected shifts that are not caught by simply asserting the min and max of that column. One option to consider is putting these “secondary” tests into a separate test suite and reducing the alerting level, so you only get notified about actual meaningful changes.
Make sure to socialize your data tests within the team and do code reviews of the tests whenever they are added or modified, just like you would with the actual pipeline code. This will make it easier to surface all the assumptions the team working on the pipeline has about the data and highlight any shortcomings in the tests.
Do manual spot checks on your data, possibly also with the help of a profiler. Automated tests are great, but I would claim that familiarity with data is always an important factor in how quickly a team can spot when something is “off”, even when there is no test in place. One last step of your data quality strategy could be to implement a periodical “audit” of your data assets to ensure things still look the way they should and that tests are complete and accurate (and actually run).

Summary

We really hope this post has given you a good idea of the different steps to consider when you’re implementing data validation for your pipelines. Keep in mind that developing and running tests in production is only one aspect of a data quality strategy. You’ll also need to factor in things like alerting, ownership of response, communication with stakeholders, root cause analysis, and issue resolution, which can take a considerable amount of time and effort if you want to do it well.

If you want some more concrete examples, check out our case studies on how some users of Great Expectations, such as Komodo Health, Calm, and Avanade integrate Great Expectations into their data workflows.

Continuous Integration for your data with GitHub Actions and Great Expectations

Kyle Eaton — Thu, 01 Oct 2020 19:49:49 +0000

If you are reading this before Oct 8th, you can join our Community Show and Tell, where we will demo this outstanding integration for the first time. Sign up here

You might have noticed that we’ve been busy in the past few weeks working on some really amazing collaborations with fellow tech and data folks in the Great Expectations community (like the Dagster integrations and our Komodo Health and Calm case studies). This project has been brewing for a while, and we’re absolutely over the moon (yes!) to announce that we’ve just published a GitHub Action for Great Expectations aka “CI/CD for data”, live in GitHub. This means that you can now have data validation as part of your continuous integration (CI) workflows to secure your data pipelines and prevent data pipeline bugs from getting into production. Read this post to learn more about what we worked on and how you can make use of the integration, or just go straight to the repo and check out all the info in the README!

What are GitHub Actions?

GitHub Actions are a feature in GitHub that helps you automate your software development workflows in the same place you store code and collaborate on pull requests and issues. You can write individual tasks, called actions, and combine them to create a custom workflow. Workflows are custom automated processes that you can set up in your repository to build, test, package, release, or deploy any code project on GitHub. With GitHub Actions you can build end-to-end continuous integration (CI) and continuous deployment (CD) capabilities directly in your repository.

How do GitHub Actions integrate with Great Expectations?

Over the past couple of months, our team (in particular GE engineer Taylor Miller) has been working closely with Hamel Husain from the GitHub team to create an action that allows you to run data validation with Great Expectations from your GitHub repository when you create or update a PR (or based on other GitHub events). You can find detailed step-by-step instructions in the documentation for this action, but here’s a quick peek at what your workflow will look like:

Make sure your data pipelines or model code is in a GitHub repo.
Set up a deployment of Great Expectations, connect to your data (files, SQLAlchemy sources, Spark dataframes…), and create Expectations to assert what you expect your data to look like. The data could either be real data in a dev/testing environment, or static data fixtures. Configure your GitHub repository to use the GE action, and connect it to your datasource by adding credentials to GitHub Secrets, if needed.
Modify your data pipelines, re-run them in a dev or test environment.
Push the modified code and create a PR.
This will then trigger the GitHub action to run data validation with Great Expectations on the dev/test data environment and publish the validation result to your PR as a comment. You can also configure Data Docs to be served on a platform such as Netlify.

We can think of several different applications for this action. For example, in an ETL pipeline, this could be as simple as making sure that changes to the pipeline don’t introduce any data quality issues into the downstream data. In order to isolate issues caused by pipeline changes vs those caused by data changes, we recommend running these tests on static test data. In an ML context, you can also test that the output of your model meets certain expectations after making modifications to the model.

Why are we so excited about this collaboration?

To the best of our knowledge, this is one of the first integrations of data testing and documentation in a CI/CD workflow that’s supported by a platform as big as GitHub. We all know we should test our data pipelines, but it’s often done either manually by the data engineer during the development process, or depends on a home-grown data validation system. Neither of these solutions is particularly reliable, scalable, or sustainable in the long term. Just as you would run integration tests on a PR for code, the GE GitHub action runs data tests on your updated data and catches any potential issues in the code changes before they get into production. Any engineer or data scientist making changes to the pipeline can run the regular GE tests locally, but the CI tests will provide an additional safety net, plus you could even be running more extensive tests on remote infrastructure.

You’ll find detailed information and instructions about the action in the Great Expectations Action repo, hop over to check it out and get started. And as always, feel free to join the GE Slack channel if you have any questions or want to contribute to the open source project!

And finally, once more a big thanks to Hamel from the GitHub team for this amazing collaboration, it’s been an absolute pleasure working with you!

Why data quality is key to successful ML Ops

Kyle Eaton — Mon, 28 Sep 2020 15:39:12 +0000

In this first post in our 2-part ML Ops series, we are going to look at ML Ops and highlight how and why data quality is key to ML Ops workflows.

Machine learning has been, and will continue to be, one of the biggest topics in data for the foreseeable future. And while we in the data community are all still riding the high of discovering and tuning predictive algorithms that can tell us whether a picture shows a dog or a blueberry muffin, we’re also beginning to realize that ML isn’t just a magic wand you can wave at a pile of data to quickly get insightful, reliable results.

Instead, we are starting to treat ML like other software engineering disciplines that require processes and tooling to ensure seamless workflows and reliable outputs. Data quality, in particular, has been a consistent focus, as it often leads to issues that can go unnoticed for a long time, bring entire pipelines to a halt, and erode the trust of stakeholders in the reliability of their analytical insights:

”Poor data quality is Enemy #1 to the widespread, profitable use of machine learning, and for this reason, the growth of machine learning increases the importance of data cleansing and preparation. The quality demands of machine learning are steep, and bad data can backfire twice -- first when training predictive models and second in the new data used by that model to inform future decisions.” (tdwi blog)

In this post, we are going to look at ML Ops, a recent development in ML that bridges the gap between ML and traditional software engineering, and highlight how data quality is key to ML Ops workflows in order to accelerate data teams and maintain trust in your data.

What is ML Ops?

Let’s take a step back and first look at what we actually mean by “ML Ops”. The term ML Ops evolved from the better-known concept of “DevOps”, which generally refers to the set of tools and practices that combines software development and IT operations. The goal of DevOps is to accelerate software development and deployment throughout the entire development lifecycle while ensuring the quality of software by streamlining and automating a lot of the steps required. Some examples of DevOps most of us are familiar with are version control of code using tools such as git, code reviews, continuous integration (CI), i.e. the process of frequently merging code into a shared mainline, automated testing, and continuous deployment (CD), i.e. frequent automated merges of code into production.

When applied to a machine learning context, the goals of ML Ops are very similar: to accelerate the development and production deployment of machine learning models while ensuring the quality of model outputs. However, unlike with software development, ML deals with both code and data:

Machine learning starts with data that’s being ingested from various sources, cleaned, transformed, and stored using code.
That data is then made available to data scientists who write code to engineer features, develop, train and test machine learning models, which, in turn, are eventually deployed to a production environment.
In production, ML models exist as code that takes input data which, again, may be ingested from various sources, and create output data that’s used to feed into products and business processes.

And while our description of this process is obviously simplified, it’s clear to see that code and data are tightly coupled in a machine learning environment, and ML Ops need to take care of both.

Concretely, this means that ML Ops incorporates tasks such as:

Version control of any code used for data transformations and model definitions
Automated testing of the ingested data and model code before going into production
Deployment of the model in production in a stable and scalable environment
Monitoring of the model performance and output

How does data testing and documentation fit into ML Ops?

Let’s go back to the original goal of ML Ops: to accelerate the development and production deployment of machine learning models while ensuring the quality of model outputs. Of course, as data quality folks, we at Great Expectations believe that data testing and documentation are absolutely essential to accomplishing those key goals of acceleration and quality at various stages in the ML workflow:

On the stakeholder side, poor data quality affects the trust stakeholders have in a system, which negatively impacts the ability to make decisions based on it. Or even worse, data quality issues that go unnoticed might lead to incorrect conclusions and wasted time rectifying those problems.
On the engineering side, scrambling to fix data quality problems that were noticed by downstream consumers is one of the number one issues that cost teams time and slowly erodes team productivity and morale.
Moreover, data documentation is essential for all stakeholders to communicate about the data and establish data contracts: “Here is what we know to be true about the data, and we want to ensure that continues to be the case.”

In the following paragraphs, we’ll look at the individual stages in an ML pipeline at a very abstract level, and discuss how data testing and documentation fits into each stage.

At the data ingestion stage

Even at the earliest stages of working with a data set, establishing quality checks around your data and documenting those can immensely speed up operations in the long run. Solid data testing gives engineers confidence that they can safely make changes to ingestion pipelines without causing unwanted problems. At the same time, when ingesting data from internal and external upstream sources, data validation at the ingestion stage is absolutely critical to ensure that there are no unexpected changes to the data that go unnoticed.

Twitter thread by Pete Skomoroch and Vincent D. Warmerdam

We’ve been trying really hard to avoid this cliché in this post, but here we go: Garbage in, garbage out. Thoroughly testing your input data is absolutely fundamental to ensuring your model output isn’t completely useless.

When developing a model

For the purpose of this article, we’ll consider feature engineering, model training, and model testing to all be part of the core model development process. During this often-iterative process, guardrails around the data transformation code and model output support data scientists so they can make changes in one place without potentially breaking things in others.

In classic DevOps tradition, continuous testing via CI/CD workflows quickly elicits any issues introduced by modifications to code. And to go even further, most software engineering teams require developers to not just test their code using existing tests, but also add new tests when creating new features. In the same way, we believe that running tests as well as writing new tests should be part of the ML model development process.

When running a model in production

As with all things ML Ops, a model running in production depends on both the code and the data it is fed in order to produce reliable results. Similar to the data ingestion stage, we need to secure the data input in order to avoid any unwanted issues stemming from either code changes or changes in the actual data. At the same time, we should also have some testing around the model output to ensure that it continues to meet our expectations. We occasionally hear from data teams that a faulty value in their model output had gone undetected for several weeks before anyone noticed (and in the worst case, they were alerted by their stakeholders before they detected the issue themselves).

Especially in an environment with black box ML models, establishing and maintaining standards for quality is crucial in order to trust the model output. In the same way, documenting the expected output of a model in a shared place can help data teams and stakeholders define and communicate “data contracts” in order to increase transparency and trust in ML pipelines.

What’s next?

By this point, it’s probably clear how data validation and documentation fit into ML Ops: namely by allowing you to implement tests against both your data and your code, at any stage in the ML Ops pipeline that we listed out above.

We believe that data testing and documentation are going to become one of the key focus areas of ML Ops in the near future, with teams moving away from “homegrown” data testing solutions to off-the-shelf packages and platforms that provide sufficient expressivity and connectivity to meet their specific needs and environments. Great Expectations is one such data validation and documentation framework that lets users specify what they expect from their data in simple, declarative statements. In the second blog post in this two-part series, we will go into more detail on how Great Expectations fits into ML Ops.

Watch Great Expectations 101: Getting Started Webinar

Kyle Eaton — Wed, 12 Aug 2020 17:34:52 +0000

This is the first completed webinar of our "Great Expectations 101" series. The goal of this webinar is to show you what it takes to deploy and run Great Expectations successfully.

By the end of the video you’ll be able to:
Create and edit Expectation Suites
Configure new Datasources
Understand what Great Expectations does under the hood
Validate your data with Great Expectations
Navigate validation output in Data Docs

We will continue to host more sessions on how to implement and utilize Great Expectations. To be informed of future events like this please sign up to be informed here: Sign up form.

We also announce events in our Slack channel.

Webinar - Data Validation Tool Great Expectations 101: Getting Started

Kyle Eaton — Tue, 04 Aug 2020 14:35:48 +0000

🧠 Great Expectations 101 Webinar: Getting Starting with Q&A

Happy to announce our new webinar series: Great Expectations 101!

For those just learning about Great Expectations check out our GitHub: https://github.com/great-expectations/great_expectations

We are going to kick off our new webinar series by hosting a “Getting Started” session, which will be focused on getting you up and running with Great Expectations. Sam Bail and the core Great Expectations engineering team will be guiding you through what it takes to deploy and run Great Expectations successfully.

By the end of the session you’ll be able to:

Create and edit Expectation Suites
Configure new Datasources
Understand what Great Expectations does under the hood
Validate your data with Great Expectations
Navigate validation output in Data Docs

After the demo there will be plenty of time for a Q&A with Sam and the core Great Expectations engineering team.

The session is aimed at new users who would like some help getting started with Great Expectations, or existing users who would like a refresher of the base concepts - or just want to chat with the core Great Expectations engineering team!

We are going to schedule this event twice to try and cover as many time zones as possible! Here’s the schedule:

Thursday July 30th @ 9am US Eastern (event has passed video will be published)
Thursday August 6th @ 4pm US Pacific sign up here

Alternatively, you can find the Zoom link for the July 30th @ 9am in our Slack channel in #announcements or sign up with the links above. The sessions will also be recorded, watch out for an update if you can't make any of the dates.

Looking forward to seeing you all there!