DEV Community

Cover image for Build On Data & Analytics - Show notes
Darko Mesaroš ⛅️ for AWS

Posted on

Build On Data & Analytics - Show notes

Welcome to our first Build On Live Event which focuses on Data and Analytics. AWS experts and friends cover topics ranging from data engineering and large-scale data processing to open source data analytics and machine learning for optimizing data insights.

This event was hosted by Dani Traphagen and Darko Mesaros (me) and we had a blast welcoming all the amazing speakers and interacting with you, our audience. 🥳

These are the show notes from that event, which was live streamed on the 7th of July 2022. The recording of this event is available on our YouTube channel, but in this article I will be linking each segment notes with its video, so make sure to hit that subscribe button! 👏


Individual session notes:

Intro and Welcome

Guest: Damon Cortesi, Developer Advocate at AWS

Key Points:

  • Why should people care about data analytics? The answer: Insight.
  • There’s all this data all around us—sales data, marketing data, application data, etc.—but we need to make sense of it. How can we better do that?
  • Data is valuable. We need to ask questions about the data in order to make informed decisions.
  • Understand the difference between “junk data” vs “valuable data.”
  • We have a problem at data you need to log and collect.
  • How should people know what data to collect? It boils down to knowing what questions you have to ask from the data, and working backwards from there.
  • The biggest paradigm shift in the last 5-10 years is that data is now stored in the cloud, not on computers.
  • Everything’s moving up the stack. We’re making it easier for people to get the job done, without having to go as deep. You can now get started just by using SQL.
  • Three key steps to dealing with data: Collect data. Understand data. Keep your data safe.

Joke Break: Did you know that under half of all data science jokes are below average?

What's Happening in the World of Data and Analytics?

Guests: Lena Hall, Head of Developer Relations North America, and Matt Wood, VP Business Analytics at AWS

Key Points:

  • Matt Wood talks about how he got into working with AWS: He was working on the sequencing of a human genome in Cambridge. In the earlier days, they used an iPod for data coordination, and physically shipped it to different sites.
  • Back then, you could do individual genomes in about a week. The industry was dealing in gigabytes of data. But as things evolved, they had to think in terabytes instead—hundreds of terabytes per week.
  • They couldn’t get enough cost-effective power in the data center to plug in more storage. They had to think about where they would store all this data. They figured if they’re having this problem in the genome field, others may have figured it out in other fields.
  • This data problem got Matt interested in cloud computing and led him to AWS. And here he is today!
  • So, what exactly is data analytics? For starters, data analytics is really messy. Because of the way the data is collected, some of it is organized functionally, while sometimes it’s stored logically.
  • What does data governance entail? You want to have control, but usually you can only control about 20% of data. However, a whopping 80% of data tends to be available. As such, it gives builders creative freedom to take that data and build applications around it. It becomes a “highly polished gem” that has real value.
  • The paradigm shift to data mesh: It’s early days for this, but it solves some of the messiness while providing teams with agility and speed. The idea is that you have a series of consumers and producers. It gives you a lot of insight into the data, such as who owns it. It allows you to take data and add value to it.
  • There are emerging best practices that will give organizations a tremendous accelerating effect—but don’t jump on a new paradigm just because everyone else is doing it.
  • Questions to ask your team when thinking about data: What is our level of data-readiness? Where is the bright, highly-polished jewel in our crown that we can do something with? Where are we less ready? What data do we wish we had started on sooner?
  • You must really understand where the business priorities are. What is a real needle-mover for the organization?
  • Be candid about your team’s level of expertise, and then build on that over time. Learning is a life-long pursuit. Be mindful that you’ll have to keep learning to remain relevant with the latest literature.
  • In short, get a good sense of data readiness, business priorities, and where you have to expand.

NoSQL Databases

Guest: Alex DeBrie, AWS Data Hero

Key Points:

  • Core differences between SQL and NoSQL: It’s a matter of scaling. NoSQL allows horizontal scalability, putting different parts of your data on different machines. As you scale up, you simply get more machines instead of bigger machines.
  • Comparing different databases: Dynamo vs Cassandra, and Dynamo vs Mongo.
  • Dynamo is more authoritarian, Mongo is more libertarian. If you aren’t careful, you can lose some horizontal scaling. Do you want more flexibility but less consistent performance, or the other way around? Think about your query model up front, knowing what you want to leverage as you progress with your database.
  • There are two places where Dynamo excels: high-scale applications, like Amazon, and in the serverless world.
  • NoSQL: You’re replicating the same piece of data across multiple machines.
  • What is “single table”? Basically, you can’t join tables in NoSQL databases. What you can do is pre-join data. In your single table, you don’t have an “orders table” and a “customer table.” You have an application table. There are no joins, but they still model it like a relational database.
  • What is the performance impact of single table design? It helps you avoid multiple requests in Dynamo.
  • On your main table, you can get strong consistency if you have conditions.
  • Alex’s DynamoDB wish list: Adding a more flexible query model to Dynamo. Indexing some of your data, allowing for more flexible querying. And finally, billing. Currently there’s a provision model and a pay-per-request model. Alex would love to see a combination of it, where you can set a provision capacity and get billed for anything above that.

Resources:

Joke Break: Two DBAs walked into a NoSQL bar, but they left when no one would let them join their tables.

Data Ingestion Change Data Capture

Guests: Abhishek Gupta, Principal Developer Advocate & Steffen Hausmann, Specialist Solutions Architect at AWS

Key Points:

  • The problem with real-time data is that traditional batch processing may not suffice.
  • The value of insights can diminish over time. The quicker you can respond, the more valuable the insights will be. For example: How quickly can a store react to weather conditions so that enough umbrellas are in stock?
  • How much more complicated and expensive does it get when you go real-time? Well, it can be more challenging than batch-based systems. Things get more complex from an operations perspective. As such, don’t make the switch to real-time just because it’s “nice and interesting.” You need a valid reason to invest in a real-time data stream.
  • Change data capture: Almost all databases have this notion of logs. It’s used to capture changes to data and tables. It’s a useful way of detecting changes in a database.
  • Watch Abhishek Gupta demo Change Data Capture, capturing changes to a database in real-time.
  • Watch Steffen Hausmann walk through Zeppelin, using Change Data Capture to build something useful with real-time analytics, using revenue as an example.

Resources:

Data Quality, Data Observability, Data Reliability

Guests: Barr Moses, CEO, Co-Founder of Monte Carlo

Key Points:

  • Data is huge, and it’s accelerating even more. Data is still front and center. It’s at the forefront of strategy.
  • Companies are using data to make really important decisions. For example: Financial loans and lending; Fintech is driven by data more than ever; Companies like Jet Blue are running on data. It’s powering all products, and these products are relying on data to be accurate.
  • Industries from media to eCommerce and FinTech are all using data. Even traditionally non-data centric companies are using it. As such, data has to be of high quality. It has to be accurate and trustworthy.
  • How do you measure high-quality data? Historically, it was centered on accurate data that’s clean upon ingestion. But today, data has to be accurate throughout the stack.
  • What is data observability? The organization’s ability to fully understand and trust the health of the data in their system, and eliminate data downtime (e.g., When Netflix was down for 45 minutes in 2016 due to duplicate data). In other words, wrong data can cause application downtime.
  • What is the difference between Data Governance vs Data Observability? Instead, it’s more important to ask: What problem are you trying to solve?
  • Data Observability is solving situations where people are using data, but the data is wrong. For example, when the price of an item on a website is inaccurate.
  • Data Governance is the method in which companies try to manage their data in better ways.
  • How data breaks: It’s usually because of miscommunication between teams.
  • Data detection problems: The data team should be the first to know when data breaks, but they’re often the last.
    • Resolution: Reduce the time it takes to resolve a problem from weeks and even months to just hours.
    • Prevention: Can we maintain the velocity of how we build, while also maintaining the trust in the data we have?
  • Having an end-to-end approach and a focus on automation and machine learning helps monitor data accuracy.

Resources:

The Serverless Future of Cloud Data Analytics

Guest: Wojcieh Gawronski, Sr. Developer Advocate, AWS

Key Points:

  • Serverless data analytics requires a lot of operational expertise and knowledge.
  • Why should we even think about serverless? It goes beyond efficiency and speed. We can better leverage scalability and tackle operational issues.
  • Serverless data analytics services help to ease out the learning curve.
  • By leveraging the provider’s offering, you can focus on development activities instead of panically building operational experience when scaling up to the demand.
  • As the services are based and fully compatible with open source software, you are able to safely choose between a fully managed service and your own operations, allowing you to easily change the direction.
  • According to estimates, by 2030, we will have 572 ZB of data (1 zettabyte is a trillion bytes— yes, that’s one and 21 zeroes). For years, we have constantly been generating vast amounts of data, and the question arises: How do we prepare in such an environment for the challenges related to the continually growing demand for infrastructure, while not multiplying the costs of its maintenance? Serverless data analytics will answer those questions.
  • It’s easier to get started, scale, and operate when you go serverless—especially if you lack operational experience.
  • Watch Wojtek walk through a three-step demo: Acquire the data, transform the data for loading into the data warehouse to gain insights, and leverage the serverless approach.
  • Learn more about Redshift serverless with a walkthrough from Wojtek.

Joke Break: What do you call a group of data engineers? A cluster.

Resources:

Large-Scale Data Processing

Guests: Gwen Shapira, Chief Product Officer, Stealth Startup

Key Points:

  • “Data product” is one of those terms that everyone throws around. It first appeared in 2012.
  • We have all this data—how can we use it to drive a good user experience?
  • Control Plane as a Service: First ask if you need this kind of architecture. A lot of people are thinking about adapting this early on as an architectural concept, which is a big shift from a few years back.
  • Control Plane as a Service gives you an opinionated way to use specific services and databases (e.g., data planes). For example, with DynamoDB you can do almost anything. However, if you want to use it for a specific use case, a lot of effort needs to be put in how to transfer a given problem into DynamoDB.
  • User experience is the most important thing in analytics today. This is a shift from before, when the end user considerations came last.
  • Gwen’s recommendations for up and coming data builders: Start with asking who is going to be using the product, and build from there. Good user experience is key, and you want your infrastructure to support this.
  • Make your product as smooth and frictionless as possible. Start with the customer and work backwards, always.

Resources:

Not Your Dad’s ETL: Accelerating ETL Modernization and Migration

Guest: Navnit Shukla, Sr. Solutions Architect, AWS

Key Points:

  • Navnit helps customers extract value from their data at AWS.
  • ETL, which stands for “extract, transform, and load,” is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.
  • ETL is one of the most important parts of data. One of the biggest problems with traditional ETL is getting lots and lots of data from different sources. Traditional ETL tools weren’t built to scale, or to handle a variety of data.
  • Watch Navnit walk through AWS Glue, a powerful ETL tool. It does batching, scaling, and securing for you—all the important tools you need. It’s like having someone else do the laundry for you!
  • Navnit’s advice for up and coming data scientists and engineers: Stop going to ETL. Just go to ELT: extract, load, and transform. Build that ELT process instead. Bring all the data as it is, and then do your transformation on top of it.

Resources:

Stream-Processing, Large-Scale Data Analytics

Guest: Tim Berglund, Vice President of Developer Relations, StarTree

Key Points:

  • We’re having an architectural paradigm shift. The last major software architecture paradigm was in the late 80s: It was called “client server.” Then the web happened, but it didn’t depart from client server in a meaningful way.
  • Today, event-driven architecture is commonplace. This is demanding a new way of building systems. The whole stack is different now.
  • Kafka is where the events-based systems are living.
  • Event data is very valuable right now. The value of that event declines quickly. You want to respond right away. On the other hand, accumulating context gets you more and more value over time.
  • There’s a fear regarding old tools that no longer work, while not having clarity over what the new tools are just yet.
  • People are building things further down the stack, because the stack hasn’t come up to meet them yet.
  • OLTP databases record transactions in a system on an ongoing basis. In OLAP databases, you dump a lot of data in, and then ask questions about it.
  • Data is losing its value over time. For example, you might ask how long it takes to get a local Smashburger delivered? It becomes less relevant with each second.
  • Tim’s advice for Cassandra or GitHub users: Stuff is changing faster than you know. See the upcoming waves faster than you did before.

Resources:

Create, Train, and Deploy Machine Learning Models in Amazon Redshift Using SQL with Amazon Redshift ML

Guest: Rohit Bansal, Analytics Specialist Solutions Architect, AWS

Key Points:

  • Rohit has over two decades of experience in data analytics.
  • What’s Redshift? Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It’s giving machine learning capabilities to data analysts.
  • Anyone who knows SQL can use Amazon Redshift to create, train, and deploy ML models using Redshift ML using SQL.
  • Why do we need Redshift ML?
    • Data analysts/SQL users want to use ML with simple SQL commands without learning external tools.
    • Data scientists/ML experts want to simplify their pipeline and eliminate the need to export data from Amazon Redshift.
    • BI professionals want to leverage ML from their SQL queries used in the Dashboard.
  • When you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation. After the model is trained, Amazon Redshift ML makes it available as a SQL function in your Amazon Redshift data warehouse.
  • Types of algorithms supported by Redshift ML:
    • Supervised-Model type XGBoost, Linear Learner, Multi-Layer-Percetron (MLP)
    • Problem type: Binary classification, Multi-class classification, etc.
    • Non-supervised (Clustering).
  • Redshift helps operationalize insights. It’s solving for:
    • Data Analysts/SQL users want to use ML with simple SQL commands without learning external tools.
    • Data scientists/ML experts want to simplify their pipeline and eliminate the need to export data from Amazon Redshift.
    • BI professionals want to leverage ML from their SQL queries used in the Dashboard
  • How much does Redshift cost? Amazon Redshift ML leverages your existing cluster resources for prediction, so you can avoid additional Amazon Redshift charges.
  • Redshift changes Rohit would like to see: Look to include data drift in the future.
  • Watch Rohit do a walkthrough of Redshift. It will showcase how easy it is to create ML models within Redshift using Redshift ML Sales data.

Resources:

Using Amazon Managed Workflows for Apache Airflow (MWAA) to Schedule Spark Jobs on Amazon EMR Serverless

Guest: Damon Cortesi, Principal Developer Advocate, AWS

Key Points:

  • The current trends in data and analytics: “Managed” data lakes with modern storage layers like Hudi, Iceberg, and Delta. Flurry of activity mid-stack (data catalogs, streaming services) and a desire to move up the stack and abstract the hard stuff.
  • Apache Airflow makes your job easy.
  • Why is MWAA better? Why should developers and builders use it? Airflow can be complex to set up and run. But in order to run a production environment, you need different components (scheduler, web server, task runner). You can run it yourself, but if you want the environment to run for you, then MWAA will do that. You can worry about more important things.
  • Is there a cost to MWAA? Yes. It's a minimum of ~$70 a month.
  • If you have a small Airflow environment, it might be worth running it on your own.
  • EMR Serverless: It’s easy and fast to run Spark and Hive jobs on AWS.
  • How is EMR Serverless different from other EMR offerings (such as EC2 / EKS)? You can think of it as a broad spectrum.
  • EMR on EC2 gives you the most flexibility and control of configuration and underlying instances. If your job has very specific characteristics and you need to optimize the underlying instances (CPU vs Memory vs GPU, for example) or you need a specific set of resources, EMR on EC2 is a good option.
  • EMR on EC2 also has the broadest set of frameworks supported. Today, EMR on EKS supports Spark and EMR Serverless supports Spark and Hive.
  • Watch Damon do an extensive walkthrough running EMR Serverless Spark jobs with Amazon Managed Workflows for Apache Airflow.
  • The team talks about the differences between using EMR, Glue, Athena, etc, to collect and manage data.

Resources:

Unify Data Silos by Connecting and Sharing Varied Data Sources: A Financial Service Use Case

Guests: Jessica Ho, Sr. Partner Solutions Architect and Saman Irfan, Specialist SA at AWS

Key Points:

  • Let’s talk about data sharing. What kind of data sharing are we talking about exactly? And why are we doing it?
  • One of the best ways we use data sharing in our day-to-day lives is for making informed decisions. For example, when shopping: We can see reviews, which is basically other people sharing data. Then we analyze it. Based on this, we decide whether we should buy the product or not.
  • When talking about organizations, they can make informed business decisions through data sharing across different business units.
  • There’s an incurring desire to drive insights from the massive data we have. But how can we access it in a secure and efficient manner? We need to bring all the data together to a single source, and then share it out to different business units based on their needs—all without compromising the data’s integrity.
  • For example, Stripe produces software and API that allows businesses to process payments, conduct ecommerce, etc. It’s important for their customers to have access to this data so that they can understand how their business operates.
  • What problems did Stripe solve with data sharing? They launched a product called Data Pipeline, and it’s built around the Redshift sharing ability. Their customers can acquire specific data that’s unique to that customer. Stripe is always looking for ways for customers to extract relevant data from them.
  • How do we share data through Redshift? Storage is shared with different compute clusters.
  • Watch Saman demo live data sharing on Amazon Redshift. It’s very simple and secure with Redshift.
  • Hear more about Amazon Redshift Spectrum too.
  • What is a feature Saman would remove from Redshift data sharing? Currently, the producer cluster owns the data. Instead, it would be good to have no owners of the data. A one-click painless migration would be great too.

Resources:

Transactional Data Lakes Using Apache Hudi

Guest: Shana Schipers, Specialist SA, Analytics, AWS

Key Points:

  • Shana focuses on anything Big Data at AWS.
  • Defining Apache Hudi: No one term can cover it all. It’s more than a tabling format—it’s a transaction data lake framework. However, it adds features on top of your data lake, such as index, transactions, record-level updates, currency control, etc.
  • Why does this matter? We’re using data lakes more and more often, and collecting more data than ever before. Data warehouses are getting expensive to scale and are limiting to data.
  • If we want to run machine learning and aggregate data, we need to move toward a data lake.
  • At scale, companies want to know that their data is good. We need to ensure data consistency. That’s where Hudi comes in.
  • If you dump stuff into a data lake, you want to be able to delete data too. You need things like Hudi to make this possible. Typically, objects in data lakes are immutable. This can be very time-consuming. Hudi manages this for you.
  • Hudi also does file size management.
  • Apache Hudi is an open-source tool that allows you to use transactional elements on top of the data lake.
  • Hudi has two table types. One of these is amazing at streaming data ingestion. There are lots of options for streaming in data, including Glue, EMR, Athena, Presto, Treno, Spark, etc.
  • Watch Shana do an extensive demo on Apache Hudi. It’s hard to explain, but easy to learn!
  • What feature would Shana change in Apache Hudi? To be able to ingest into multiple tables using Spark streahttps://hudi.apache.org/ming, and that EMR would autoconfigure all its sizing and memory for Spark if you’re using Hudi.

Resources:

Close the Multi-Cloud Gap with Open Source Solutions

Guests: Lotfi Mouhib, Sr. Territory Solutions Architect and Alex Tarasov, Senior Solutions Architect at AWS

Key Points:

  • Lotfi and Alex talk about Apache Nifi and Apache Hop.
  • There are a lot of different types of customers, some more heavily into data than others. Managing or building a framework yourself can be a very complicated task. Open source tools can help speed up your development process and make it more endurable as well.
  • Apache NiFi allows you to move data from Source A to Source B, without needing extensive data transformation. You can deploy without heavy coding behind the scenes.
  • Apache Hop allows small bits of data to hop between different parts of the pipeline.
  • Both NiFi and Hop are similar in the way that the data is moving from one place to another. It’s data flow. They can accelerate migration projects, as there’s less code to write! They also help close the gap when there are no required connectors available in managed services.
  • Both Hop and NiFi allow you to extend the functionality by adding existing java libraries and using them with your data pipeline.
  • The Apache Hop community has done great work by decoupling the actual data pipeline from the execution engine, so it can run on different engines. That gives the flexibility to use managed services like EMR to run your pipelines with less management overhead.
  • Apache NiFi is a data movement tool—it’s not a strong processing engine. It helps you move data from one source to a sink, while applying light transformation on it.
  • Why use Apache NiFi and not other tools? Apache NiFi has been in the market for more than 15 years, has reached maturity, and has a strong supporting community.
  • Apache NiFi also focuses solely on data movement, and allows you to apply the transformation engine of your choice (such as Apache Spark or any other ETL engine).
  • Apache NiFi allows you to follow IAM best practices.
  • Apache NiFi’s core value is in data movement, so it’s not suitable for complex joins like a relational database or complex event processing for streams like Apache Flink.
  • Watch Alex demo Apache Hop.
  • Watch Lotfi demo Apache NiFi.

Resources:

The History and Future of Data Analytics in the Cloud

Guests: Imtiaz Sayed, Tech Leader, Analytics and TJ Green, Principal Engineer, at AWS

Key Points:

  • How is data analytics evolving? What does the future of analytics look like? It’s too early to talk about its history, as it isn’t actually that old. But the evolution has been fascinating.
  • The biggest changes have happened in the last 25 years. The rate of evolution is huge.
  • Everything we do today leaves a digital footprint. There’s a need to mine these data points and generate insights, to make improvements.
  • We’ve evolved from data houses to data lakes to data meshes. And to think that data used to live on something as simple as a cassette!
  • Much has changed: Kafka used to be a message bus, but now these systems are being used differently. Kafka is now being placed as a buffer. Data links were supposed to be mutable, but now you can modify, delete, and insert data. We have Redshift and Aurora performing independent memory and scale storage. Redshift is now doing machine learning and real-time streaming.
  • The big change that happened 10 years ago was the beginning of the move to the cloud. Before, we were living in an on-premise warehousing world. It was very expensive, a large investment that companies had to deal with.
  • Redshift was one of the pioneers of using cloud in 2013. You could just sign up and not have to buy or administer anything yourself. The price was also very compelling.
  • Competition has always been fierce in this space since then, moving innovation along. The cloud also allows us to scale and be more flexible.
  • Use cases for data analytics today have traversed every industry, such as retail, healthcare, and oil and gas. It enables both large enterprises and startups.
  • What is moving the needle for customers today? Price performance and ease of use. A lot of these factors are built into AWS products.
  • For example: Glue has features that provide an easy UI/UX experience to work with your data. You don’t have to be in the weeds with ETL.
  • Most customers don’t want to see how the sausage is made! They want the thing to perform well and be cheap, but they don’t want to have to turn a bunch of knobs to make it happen. We’re taking on the pain to make it easier for the customer.
  • Pain points of moving to the cloud: It’s quite complex. One of the major pain points is working with bringing all that data together.
  • Defining data mesh: Having the ability to share your data securely with multiple producers, who can share the data securely with multiple consumers. Data mesh is not easy to implement, and it’s not for everyone! Ultimately, you’ll want to start small and scale fast.

Joke Break: What’s the difference between God and DBA?
God does not think he’s a DBA.

Joke Break: Two managers are talking to each other.
One asks: “How many data engineers work on your team?”
The other replies: “Half of them.”

Resources:

How does the AWS Prototyping team build!

Guests: Sebastien Stormacq, Principal Developer Advocate & Ahmmad Youssef, Team Lead Big Data Prototyping at AWS

In this segment, check out how our AWS Prototyping team builds together with our customers. And see an amazing demo of, get this, moving data between a Relational Database and an Object store.

Ahmmad will show us how to migrate data from a Microsoft SQL Server, running on an EC2 instance, to an Amazon S3 bucket! All that using Amazon Data Migration Service (DMS)! Magic 🪄


Thank you all for being part of this wonderful event. Please stay tuned for more events like this. And if you are looking for a weekly dose of Build On, join us every Thursday at 9AM Pacific live on Twitch. 😎

📅 Keep up to date with us, by answering a simple form here!

Top comments (0)