DEV Community: Metaplane

How to manage tags for objects in Snowflake

Metaplane — Fri, 08 Mar 2024 01:53:27 +0000

Your data is the foundation for all the insights and strategies that drive your business forward. But the more data you collect, the higher the chances you encounter problems managing it.

Without a solid data management strategy, you're essentially crossing your fingers in hopes that nothing goes wrong. At best, this approach leads to inefficiencies and increased costs as you scramble to patch up emerging problems. At worst, it triggers a detrimental cycle that erodes trust in the data and in the capabilities of the teams managing it.

Thankfully, we have decades of industry best practices when implementing processes and tools to establish data governance practices that ensure trust in data. Effective data governance includes a wide range of principles and practices that may overlap with availability, usability, integrity, security, and compliance. And a big part of a proper data governance strategy in Snowflake involves using tags.

Why use tags in Snowflake

Think of your data as a city. For fun, let’s call it NYC. Here, your data points are the people of NYC, data tables are skyscrapers, and your data warehouse is the entire metropolitan area. Without any form of organization, finding what you need when you need it would be like trying to locate a single person amongst the 8.5 million who live there (all without knowing anything about where they live or work).

Data classification, in this analogy, is NYC’s zoning laws and addressing system. Residential, commercial, and industrial zones allow people to understand the general purpose and location of different areas. Tagging adds detailed signs and labels to each building, street, and neighborhood which provides additional, specific information beyond the basic structure provided by zoning and addresses.

TLDR: Data classification groups data into categories (e.g., sensitive information, financial records, or customer data). Tagging is a data classification technique that further refines these categories. It enables data stewards to monitor sensitive data for compliance, discovery, protection, and resource usage use cases through either a centralized or decentralized data governance management approach.

In Snowflake, a tag is a schema-level object that is assigned to another object as an arbitrary string value. This object-string match is then stored as a key-value pair that is unique to your schema.

There are multiple benefits to using these in Snowflake:

Ease of use: You can define a tag once and apply it to as many different objects as desirable.
Tag lineage: Since tags are inherited, applying the tag to objects higher in the securable objects hierarchy results in the tag being applied to all child objects. For example, if a tag is set on a table, the tag will be inherited by all columns in that table. This makes it easier to track object relations in future audits.
Sensitive data tracking: Tags simplify identifying sensitive data (e.g. PII, Secrets). Then you add tags to tables, views, and columns, you can easily find many parts of the database that hold sensitive information just by searching for these tags. Once found, data stewards can figure out the best way to share it safely. They might limit who can see certain rows or decide if the data should be tokenized, fully masked, partially masked, or unmasked.
Resource usage tracking: Tags bring visibility to Snowflake resource usage. With data and metadata in the same system, analysts can quickly determine which resources consume the most Snowflake credits based on the tag definition (e.g. cost_center, department).

To support different management approaches and fulfill regulatory requirements, we recommend starting your tagging journey with a proper strategy.

Tagging Strategy

Tags are extremely versatile; they can be linked to various types of objects, such as warehouses and tables, simultaneously. That’s why you need a tagging strategy before you begin object tagging in your account. This high-level plan should include:

A list of objects or datasets that need tags
Tag naming convention
Use cases for assigning specific tags

Because tags are schema-level objects in Snowflake, they can be created in a single database/schema and assigned to other Snowflake objects across the account. You can choose between two approaches here: centralized or decentralized.

Centralized vs. Decentralized Approach

In a centralized approach to tagging, a single team or department within the organization is responsible for defining, implementing, and managing the tagging strategy. This team sets the standards for how data is categorized, ensures consistency in tagging across all data assets, and monitors compliance with these standards.

While it has plenty of benefits (consistency, control, and efficiency), a centralized approach can also lead to bottlenecks, as the centralized team may become a single point of failure or delay in the tagging process.

Conversely, a decentralized approach allows individual departments or teams within an organization to define and manage their own tags according to their specific needs and use cases. This approach inherently has more flexibility, speed, and customization, but without overarching governance, it often leads to inconsistencies in tagging across the organization, which can complicate efforts to manage data at a company-wide level.

Both approaches have their pros and cons, so the “right approach” really comes down to what’s most important to your organization. Either way, you’ll want to audit the tags periodically and make changes to the tagging plan according to the business context.

How to create and manage tags in Snowflake

When assigning tags, identifiers can either be consistent across multiple objects (e.g., all tagged objects labeled as sales) or vary to reflect different categories like engineering, marketing, or finance. To manage tags for objects in Snowflake using SQL commands, you generally create a tag, assign it to an object, and query that object by its tags. Here’s what each of those steps looks like:

1. Create tags

Creating a tag is a simple query with two parts :

Use the CREATE TAG statement
Specify the tag string value when assigning the tag to an object.

Here’s what that looks like:

CREATE TAG tag_name;

2. Assign a Tag to an Object

After creating a tag, you can assign it to an object such as a table, view, or column. You can do this either when you're creating the object or by altering an existing object.

When creating a new object, you can assign a tag directly in the CREATE statement by using the TAG clause as follows:

CREATE TABLE table_name (
 column_name datatype
) TAG tag_name = 'tag_value';

Replace table_name, column_name, datatype, tag_name, and tag_value with your specific details. Note that the tag_value is the value you want to assign to this tag for the object.

To assign a tag to an existing object, use the ALTER statement:

ALTER TABLE table_name SET TAG tag_name = 'tag_value';

This example assigns a tag to a table, but you can similarly alter other types of objects.

3. Query Objects by Tags

To find objects that have been assigned a specific tag or to see the tags associated with a particular object, you can use the SHOW TAGS statement or query the INFORMATION_SCHEMA.TAG_REFERENCES view.

To find objects with a specific tag, use the following command:

SHOW TAGS LIKE 'tag_name';

This will list objects that have been tagged with tag_name.

To query specific tag assignments:

SELECT * FROM INFORMATION_SCHEMA.TAG_REFERENCES
WHERE TAG_NAME = 'tag_name';

This query returns detailed information about objects tagged with tag_name, including the object type and tag value.

How to manage tag quotas

When specifying tags, here are some “quotas” to keep in mind for each object tag:

The string value for each tag can be up to 256 characters, with the option to specify allowed values for a tag.
Snowflake allows a maximum number of 50 unique tags that can be set on a single object.
In a CREATE or ALTER statement 100 is the maximum number of tags that can be specified in a single statement.
For a table or view and its columns, the maximum number of unique tags that can be specified in a single CREATE or ALTER atement is 100.

Luckily, you can manage the tag quotas for an object easily:

Query the TAG_REFERENCES view to determine the tag assignments.
Unset the tag from the object or column. For objects, you can use the corresponding ALTER... UNSET TAG command. For a table or view column, use the corresponding ALTER { TABLE | VIEW } ... { ALTER | MODIFY } COLUMN ... UNSET TAG command.
Drop the tag using a DROP TAG statement.

Your organizational structure in Snowflake (i.e. tagging strategy) can also be applied in Metaplane. So if you want to apply a tag to a collection of tables identified by absolute paths like {database}.{schema}.{table}, Metaplane can help.

In Metaplane, you can bulk-apply tags from a custom dashboard. After navigating to the dashboard using the tag you'd like to bulk apply to objects or monitors, you'll be able to search and add additional objects. From there, you can set up alerting rules based on your tags (e.g. you can direct alerts for all data with a particular tag to a channel, email, or other alert destination). And when an incident occurs in Metaplane, you can view the tags that are affected to help your team prioritize incidents that affect critical data over ones that don't.

Want to see how Metaplane tags can improve your overall governance strategy? Talk to us or start a free trial today.

Announcing Metaplane’s $13.8M Series A

Metaplane — Tue, 05 Mar 2024 16:43:14 +0000

Other links:

PR Newswire: Metaplane Announces $13.8M Series A led by Felicis on Heels of Rapid Growth
VentureBeat Exclusive: Metaplane nets $13M to detect data anomalies with AI
Metaplane Blog: Announcing Metaplane's $13.8M Series A
CEO & Co-Founder, Kevin Hu's, LinkedIn Announcement

Today I’m happy to announce our Series A led by Felicis Ventures with participation from existing investors Khosla Ventures, Y Combinator, Flybridge Capital Partners, Stage 2 Capital, along with new investors B37 Ventures. We’re also welcoming Javier Soltero, the SVP & GM of Canva Enterprise, previously the VP & GM of Google Workspace and a two-time founder, to the board.

Following 6x growth in the past year, this Series A investment brings our total amount raised to $22.2M. Most importantly, it’s consistent with our company principle of raising the right amount of money at the right time at the right valuation. This way, we make sure the success of our company is aligned with our customers’ success.

Since our previous fundraise last year, over 100 companies like Ramp, Bose, Anduril, and Ro have trusted Metaplane to ensure trust in their data. We are the highest rated Data Observability product on G2, with 4.8/5.0 stars across 90+ reviews. And today, I’m happy to share our plans to help 10,000 companies ensure trust in the data that powers their business. But first, we should start from the beginning: why does trust in data matter at all?

Why This Matters: Untrusted Data is Worse than Useless

In 2024, the most common method for detecting data issues is still CDT, or Customer-Driven Testing. Customers who are looking at the data notice that something looks wrong, then fire off a Slack message to the data team like:

“Is this dashboard broken?”
“Why does this number seem off?”
“Why aren’t these accounts up-to-date?”

Time and trust – the two things that are easy to lose and hard to regain – are lost. The investments of data teams start to get undermined. Companies slowly lose confidence in data, and as a result, lose their ability to build competitive advantage from one of their truly unique and defensible assets. Trusted data is the foundation on which companies progress from business intelligence to automation to generative AI.

The downward spiral happens for two reasons. The first reason is asymmetry. While data teams are responsible for hundreds or thousands of data assets across fragmented systems, the consumers of data care most about the number that they’re currently looking at. The second reason is entropy. The surface area of data maintained by the data team tends to grow, leading to more opportunities for breakage. These two forces trigger a vicious cycle in which maintaining trust in data feels like a Sisyphean task of pushing a boulder up a hill.

Metaplane was founded to provide leverage in this fight against asymmetry and entropy. Using automations on top of metadata, our hope is that data work in 2034 will feel less like “working in the dark” and more like software engineering work. Issues still happen, but engineering teams have tools like Datadog and Splunk to help prevent, anticipate, and resolve them. Vicious cycles give way to virtuous cycles of increasing trust and reliability.

We’re already seeing the influence of engineering best practices in the work of moving, transforming, and using data. Now is the time to see maturation in the higher level governance problem.

What We’re Building: The Next Tier of the Tech Tree

Metaplane is already at the forefront of what is possible in data observability. Last year we said we would extend data quality from detection to prevention, expand integrations, and expand data observability from monitoring quality to usage and spend. We did all of that and more over the course of 50+ product launches in the past year, which were made possible by working closely with our customers while laying robust foundations.

The end result is that data observability became possible. Now, we want data observability to feel powerful, but in the way that water is powerful. Data observability should mold itself around the needs of your organization. It should be soft when we need light requirements, and heavy when we have critical workloads. And when it works, it’s like it’s hardly even there.

We think this next era of data observability as powerful will rest on three key pillars:

Observe Everything

Especially when it comes to data, anything that can go wrong, will go wrong. Worse: data is so interconnected that there are cascading effects, such that the root cause of an issue in one corner could be something completely unexpected in another corner.

That’s why every piece of metadata counts (especially if your product is named after the “metadata plane”). Metaplane was already the first data observability tool to launch integrations with transactional databases, ETL tools like Fivetran, and reverse ETL tools like Hightouch and Census.

But what about upstream issues like unsanitized inputs in a CRM or a faulty migration in your application DB? Or dashboards that suddenly go unused, or spiking query times? Every single piece of telemetry emitted by your data systems should be observed, stored, centralized, and monitored with the proper architecture.

Automated Monitoring Architecture

We define a monitoring architecture as the design decisions that determine what you monitor, alongside how and when you monitor it. Just like data architectures, monitoring architectures should reflect and anticipate the needs and evolution of the business.

Specifically, this “inverted pyramid” architecture is frequently the optimal trade-off between signal and noise, coverage and cost, and centralization and context. Ideally, this architecture is expressed on the semantic level through constraints like “give me all tables two layers upstream of a Tableau worksheet used by our VP of Marketing” or “show me all Fivetran landing zones that have been delayed in the past month.”

By combining metadata from your data stack alongside these semantic filters, Metaplane should help you implement and maintain this architecture automatically by default.

Trust Everywhere

Achieving 100% trustworthy data is a Sisyphean task, like pushing a boulder up a hill. But the goal isn’t to have perfect data. The goal is to get as much value out of data as possible, which means knowing how trustworthy it is given a business goal.

Customer-Driven Testing is not a tenable solution. Neither is continually refreshing an open tab. Given that reality, the trustworthiness of data should be like a label on data where and when it is used. If that data is consumed in a dashboard, there should be a red/yellow/green status check. If it’s consumed in a web application, there should be a Chrome extension. If it’s consumed in an Airflow job, there should be an API to retrieve the status of underlying tables.

By observing everything and maintaining an optimal alerting architecture on top, the last mile is to make sure that data is trusted at the time of consumption.

Building on Solid Foundations

If you already use Metaplane, you’ve probably used features that are part of those big pillars. Progress is well underway. But rest assured, all progress we make comes hand-in-hand with investing in what got us here:

Metadata extraction from the tools you use most, including column-level lineage parsing from query history and deep integrations with BI tools.
Domain-specific machine learning that is tailored to the unique patterns of your business, to ensure every alert is helpful and to avoid dreaded alert fatigue.
Integrated experiences that are easy to get started with (you can connect to Metaplane in less than 30 minutes without talking to a salesperson), easy to get value from, and deeply blend together metadata so that you get what you want, when you want it.

How We Get There: Iron sharpens iron

Our Series A comes with expectations for growth. And of course we’re investing in go-to-market to educate the market and provide great human experiences for our customers. But the primary way we invest in growth is by continuing to invest in building the best product.

We build the best product by being the best partners to our customers. The future of data observability is yet to be built, but the problems that it solves exist today.

Learning from legendary partnerships like that between Uber and Twilio, our approach is to partner closely with the companies that live in the future. These forward-thinking companies bring in Metaplane to solve a problem. But because every company is unique, they stretch Metaplane to solve their problem. We look at the ways in which the product is stretched, combine it with feedback, then solidify it into a real product.

Almost every feature we have has been co-developed with a customer:

After ClickUp asked for lineage down to Reverse ETL tools, Todd and Jen built our Hightouch and Census integrations within two days.
Colby worked with CarGurus to add visualizations of circular references to our lineage map.
Our public-facing API was developed together with Klaviyo, who wanted to version control their monitors using Terraform.
We re-architected high cardinality GROUP BY monitors to meet Bluecore’s scale of 1000s of distinct groups.

Customers often view Metaplane, both the product and people, as extensions of their teams. And that’s exactly how we want it to feel.

As the number of customers depending on Metaplane grows, our commitment is to get even closer to our customer’s needs, building the best possible product for them as quickly as possible. This is the crucible in which the best product is built.

Making the People Who Believed in Us Look Brilliant

During our fundraise announcement last year, I wrote about a quote from HubSpot co-founder Dharmesh Shah: “Success is making those who believed in you look brilliant.” This year, that is more true than ever. While we’re welcoming new investors, new teammates, new partners, and new customers to the table, we’re devoted to strengthening the relationships that got us here.

So, thank you to our customers, old and new. Helping you is the reason we exist as a company, and we measure our success by how happy you are. Please keep the feedback coming in our shared Slack channels :).

Thank you to our new investors Felicis and B37 for trusting us with your time and resources, along with our existing investors at Y Combinator, Flybridge, and SNR for doubling down.

Thank you to our partners at Snowflake, dbt, Sigma, Brooklyn Data Co, and many other organizations who keep pushing the state-of-the-art in our industry forward.

Thank you to the team for continually raising the bar, and most of all for building what we’d buy ourselves. With you all, there’s no choice but to enjoy the flight :).

And to those of you who want to help your company look brilliant by ensuring trust in data, we’d love to chat. Cheers to a bright future where data is the lifeblood of companies without the teams behind that data getting paged at 3am!

4 numeric distribution metrics to track in Snowflake (and how to track them)

Metaplane — Sat, 02 Mar 2024 23:17:57 +0000

Everything in your business runs smoothly—until it doesn’t. Out of nowhere, sales dip, website traffic plummets, or customer complaints shoot up. And scrambling to figure out what went wrong can feel like a game of business whack-a-mole.

Ultimately, when you react to problems as they pop up, you spend time and resources that would be better allocated elsewhere. If only you knew what issues were headed your way before they actually hit…

This is where tracking numeric data comes into play. By monitoring 4 key metrics in your Snowflake data warehouse, you can uncover hidden patterns, predict upcoming trends, and make decisions based on solid data, not just gut feelings.

In this guide, we’ll walk you through which numeric data metrics you need to be tracking and how to track them in Snowflake.

What is Numeric Data?

Numeric data is simply measurable information. It’s a fundamental component of statistics, often shown by its distribution (i.e. the shape or pattern of the data) in the form of a histogram.

While visual methods are great for human eyes to see and understand, sometimes we need to describe data patterns in terms of numbers. For instance, in the context of data observability, when we use machine learning to alert on anomalies within our numeric data.

For that, we rely on 4 standard metrics to summarize these distributions: minimum, maximum, mean, and standard deviation.

Minimum (min): This is the smallest number in your data set.
Maximum (max): This is the largest number in your data set.
Mean: Often referred to as the average, this is calculated by adding all the numeric values together and dividing by the number of values. It gives a central value for the data.‍
Standard deviation: This is the amount of variation or dispersion in a set of values. A low standard deviation means that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.

Now, we’ll take a deeper look at each metric, and using sample SQL queries, we’ll show you how to track them in Snowflake. For context, the sample queries below follow an e-commerce site. They assume that you have a 'sales' table that records your online store's daily revenue, as shown below.

This ‘sales’ table includes two columns: 'amount', which shows the daily earnings, and 'date', which indicates the day of these earnings.

Minimum

The minimum is the smallest value within a data set. So, if you have a data set representing daily sales, the minimum would be the lowest sales amount recorded on any given day.

While it’s often easy to overlook low sales figures when they’re overshadowed by high-performing days, the minimum can be a major indicator of your business's health during its quietest periods. Understanding the context of the lowest sales you have on any given day can reveal the underlying stability and consistency of your business.

Take this e-commerce example, for instance:

As the owner of an online store, you maintain a Snowflake dashboard with a graph showing the store's daily sales over the past six months. But over the past few months, you’ve noticed a concerning trend in this graph. While your peak sales days, typically around new product launches or holiday seasons, are doing well, there's a noticeable decline in the minimum daily sales.

This trend of decreasing minimum daily sales is a red flag. It could mean several things:

There's an issue with the website's user experience on off-peak days. This could be due to various factors such as poor website navigation, technical glitches, or less compelling content during these periods.

A new competitor has entered the market. A drop in the lowest sales figures could be a sign that customers are choosing to spend their money elsewhere. Tracking the minimum sales can be an early indicator of losing market share to competitors.
Customer preferences are changing. It might mean that your product range is no longer aligning with current market trends or customer interests. This decline could prompt a review of your product offerings and marketing strategies to ensure they align with evolving customer preferences.

There are issues with inventory. Customers unable to find what they need might turn to other sources, leading to reduced sales. You might need to reassess your inventory management to guarantee that popular products are always in stock.

Tracking this minimum daily sales figure gives you an early warning, so you can dig deeper into the root cause of the issue. Here’s how this command looks in Snowflake, assuming you have a sales table similar to the one above:

SELECT MIN(amount) AS MinDailySales
FROM sales;

Maximum

The maximum is the largest value within a data set. Juxtaposing the minimum definition above, if you have a data set representing daily sales, the maximum would be the highest sales amount recorded on any given day.

These high points can often be linked to specific actions you took, like a great marketing campaign or a popular product launch. So, tracking them means you can figure out what's working really well (and try to do it again). It also helps you with planning: making sure you have enough products in stock, enough people working, and that your website can handle a lot of visitors, all without spending too much when it's not necessary.

Let’s revisit our e-commerce example:

On the Snowflake dashboard that you've been using to track your daily sales over the past six months, there's also a graph showing the days when sales peaked. Seeing an upward trend in these peak sales days is usually a great sign. More often than not, it suggests that your recent marketing strategies are hitting the mark and there's a high demand for what you're selling.

But before you celebrate, verify that these numbers are accurate and not skewed by data recording errors, like incorrect transaction logging. You want to be certain that this upward trend is genuine.

Once you've verified the accuracy of your data and you're confident that these peak sales figures are, in fact, real, this success leads to a set of new questions and considerations. For instance, can your website handle the increased traffic on these busy days? You don't want the site crashing or slowing down just when customer interest is at its peak. Also, think about your inventory. Can you keep up with this higher demand without ending up with too much stock afterward?

At the end of the day, the benefit of tracking the maximum metric is twofold—you validate the effectiveness of your marketing and product strategies and ensure the robustness/reliability of your operations. Here’s how this command looks in Snowflake, assuming you have a sales table similar to the one above:

SELECT MAX(amount) AS MaxDailySales
FROM sales;

Mean

The mean is the average of all your data. While you might have days with exceptional sales due to a product launch and days with very low sales, the mean helps you understand your typical performance. It smooths out the extreme highs and lows, giving you a balanced view of your overall performance and business health.

This metric is especially important in planning and setting realistic expectations. If you're only focusing on the high points, you might overestimate your performance. And if you're only looking at the lows, you might underestimate it.

Now, back to our e-commerce example:

Right next to your min and max graphs in your Snowflake dashboard, you also have a graph of average daily sales. Like tracking your maximum, you want to see an upward trend. In this case, though, we’re looking for a steady rise.

A steady sales growth rate is a good indicator that overall business is growing, reflecting not just occasional spikes from big events but consistent growth across regular days as well. It suggests that your strategies—be it marketing, customer engagement, or inventory management—are resonating with your customers.

On the flip side, seeing a steady decline in sales would indicate that your strategies are not resonating with your customers. In this case, you’ll need to revisit all your strategies to find what’s not resonating, and pivot accordingly.

Either way, tracking your mean distribution over time helps identify gradual changes in business performance, ensuring that true growth (or decline) isn’t masked by high-variation days. Here’s how this command looks in Snowflake, assuming you have a sales table similar to the one above:

SELECT AVG(amount) AS AvgDailySales
FROM sales;

Standard deviation

The standard deviation is the amount of variation or dispersion in a set of values. It’s a measure of how consistent or inconsistent your numbers are. A small standard deviation means your sales numbers are pretty close to each other most of the time—which is good because it tends to make your business more predictable.

A large standard deviation, on the other hand, means your data is highly inconsistent, and it’s usually a sign that there are factors affecting your business that you need to understand.

Let’s go back to our e-commerce example one last time:

Next to your min, max, and mean graphs in your Snowflake dashboard is also a graph of standard deviation. If you start noticing fluctuations in your sales patterns. To understand this better, you can focus on the standard deviation of daily sales.

For instance, for a big product launch, you’ll probably notice a huge spike in sales, resulting in a high standard deviation for that period. This might indicate that your marketing campaign for this launch was effective. But if that spike is followed by several days of significantly lower sales, you’ll see this pattern of high sales followed by low sales days will cause your standard deviation to increase.

If your goal is to smooth out these fluctuations, you might need to rethink your marketing approach to not just create a buzz around new launches but also to maintain steady sales afterward.

And by tracking the standard deviation in your Snowflake sales data, you can gauge how well your revised marketing approaches are working towards achieving more consistent sales, and adjust your strategies accordingly for better business stability and predictability. Here’s how this command looks in Snowflake, assuming you have a sales table similar to the one above:

SELECT STDDEV(amount) AS StdDevDailySales
FROM sales;

Useful conditional statements for numeric data functions in Snowflake

Now that we’ve covered the basics, you can also refine your functions using conditional statements. For instance, if you want to focus on a specific time frame, such as a particular month, you can use a WHERE clause. Here’s what that looks like in Snowflake.

SELECT AVG(amount) AS AvgDailySales
FROM sales
WHERE date >= '2024-01-01'
AND date < '2024-02-01';

Snowflake also supports window functions, which can be useful if you want to calculate these metrics over different partitions of your data (i.e. per month or per product category). Here's an example of how you could calculate the average daily sales per month:


SELECT 
  DATE_TRUNC('MONTH', date) AS Month, 
  AVG(amount) AS AvgMonthlySales
FROM sales
  GROUP BY DATE_TRUNC('MONTH', date);

Putting it all together

For better SQL form, you can condense all these commands into a single SQL query in Snowflake. Here’s how this looks all together:

WITH MonthlySalesStats AS (
  SELECT
  DATE_TRUNC('MONTH', date) AS Month,
  MIN(amount) OVER(PARTITION BY DATE_TRUNC('MONTH', date)) AS MinMonthlySales,
  MAX(amount) OVER(PARTITION BY DATE_TRUNC('MONTH', date)) AS MaxMonthlySales,
  AVG(amount) OVER(PARTITION BY DATE_TRUNC('MONTH', date)) AS AvgMonthlySales,
  STDDEV(amount) OVER(PARTITION BY DATE_TRUNC('MONTH', date)) AS StdDevMonthlySales
FROM sales
WHERE date >= '2024-01-01' AND date < '2024-07-01' -- Specifying the time frame for the first half of 2024
)
SELECT DISTINCT Month,
  MinMonthlySales,
  MaxMonthlySales,
  AvgMonthlySales,
  StdDevMonthlySales
FROM MonthlySalesStats
ORDER BY Month;

In this query:

The DATE_TRUNC('MONTH', date) function is used to group sales by month.
The OVER(PARTITION BY DATE_TRUNC('MONTH', date)) clause with each aggregate function calculates the min, max, average, and standard deviation for each month within the specified date range.
The WHERE clause limits the data to the first half of 2024 (from January 1, 2024, to June 30, 2024).
The SELECT DISTINCT statement ensures that each month is listed only once along with its corresponding sales statistics.

This gives you a complete monthly breakdown of the minimum, maximum, average, and standard deviation of sales for each month in the specified time frame, so you can track sales performance trends over time.

Wrapping Up

In a perfect world, you could sense when there’s a hint of change in the air—especially when those sudden shifts could balloon into larger business problems. But while that sixth sense might not exist, tracking your numeric distribution metrics gives you a proactive leg up.

When you track the min, max, mean, and standard deviation in Snowflake, you can pick up on business-changing patterns and trends. That way, you can act fast, fix small issues before they become big headaches, and even capitalize on opportunities as they arise—all because you have the full picture of your data.

Want to track your numeric data in Snowflake tables and views within minutes, then be alerted on anomalies with machine learning that accounts for trends and seasonality? Get started with Metaplane for free or book a demo to learn more.

Detecting table insert misses with adaptive flatline alerts

Metaplane — Sat, 02 Mar 2024 23:12:21 +0000

Have you tried building tests or ML models to detect anomalies in your data warehouse? Turns out it’s ridiculously hard!

For example, say you have a table storing ingested 3rd-party data, which should keep getting new rows every day. You want to know if it stops growing, but how long should you wait for new data to be loaded before you get an alert? It’s a delicate balance. You definitely want to know as soon as possible that data aren’t loading, but you also don’t want to cause panic every time there’s a small delay.

But maybe you have to try this—despite the headache—because an API change caused your data to stop loading, and nobody knew for a week. Or a transformation job failed to run on schedule. Or maybe an application broke, nights were spent backfilling data, and you swore to never let it happen again.

Whatever the reason, this is how you can set up adaptive flatline alerts for proactive monitoring in your data systems.

Step function modeling

Let’s say you’ve set up a test to check the count of rows in an important table with ingested data. Your test probably has a few characteristics:

You run it on a regular interval (maybe hourly)
You usually get the same value over and over when the table isn’t changing, but then it jumps to a new plateau after an insert (maybe daily)
There is some variation in the size and timing of the table changes

Because of these characteristics, your row counts follow an irregular step function, which means your model needs to account for both the increases and the flatlines. So, when there’s a change your model checks that it’s not too big, and when there’s a plateau your model checks that it’s not extending too long.

At this point, your model might look like the graph below. The blue lines are your row count values sampled at various points throughout the day, and the orange bounds are the acceptable ranges for the increases.

Notice how on November 27th, on the right side of the graph, the row count values flatline? This is where the logic needs to kick in to say “Hey, this table usually gets a daily insert but nothing’s happening—help!”

Flatline buffer

But maybe you don’t want to get an alert immediately after the 24th hour with no increase. Maybe the insert and the test run at the same time, and the test just happens to run before the insert is complete. Or maybe there’s some variance in the timing of the insert. Either way, you probably want the model to wait at least one hour to confirm that there’s really a problem before an alert yanks you out of your flow.

So, you establish some buffer time, either manually or with a model that learns the right amount of buffer to set. Because each table behaves differently and has different levels of importance for your data pipelines, you’ll want different buffers per table. Below is an example of a trend-based buffer, which uses the average historical trend (orange line) to determine how soon to alert:

See how the row count value (blue) has stayed flat but the model (orange line) is projecting that the row count should have increased after 24 hours? This is where you can look to determine how much buffer you want. As long as the value stays within the prediction bounds (orange area), no alert will trigger.

At Metaplane, we have two options for you to tune this buffer:

Sensitivity: Increasing sensitivity would raise the lower bound so the alert would sound sooner.
Mark as normal: If the model alerts too soon you can mark the flatline as normal with a click, and the model will learn to wait longer to alert during flatlines.

Navigating the fine line between staying informed and being overwhelmed by alerts is an art, especially with high-stakes data integrity issues. So, we invite you to compare your alerting results to our ML-based anomaly detection!

Metaplane’s flatline detection algorithm learns about your data over time, tailoring itself to be more meaningful and actionable for your context. Go on, put it to the test! Create an account and get set up within 30 minutes today.

Three ways to track schema drift in Snowflake

Metaplane — Sat, 02 Mar 2024 23:07:32 +0000

Changes in database schema over time—whether that’s additions, deletions, or modifications of columns, tables, or data types—lead to schema drift. These changes can be planned or unplanned, gradual or unexpected.

These are some of the common causes of schema drift:

Data gets corrupted during migration
Data warehouse updates such as:
Adding features or fixing issues
Establishing new relationships between tables
Removing existing relationships between tables when they become unnecessary or irrelevant
Your organization switches to a different data warehouse
Your organization’s business requirements change and you need to:
Add new fields for new types of data to be collected and stored
Remove fields if certain types of data are no longer needed
Modify the data type of a field to reflect the nature of the data being stored correctly
You introduce new data sources
Technology standards change and/or new regulations are introduced

With all those causes, it’s no wonder that schema drift leads to so many data pipeline outages. You've got columns being added, data types changing on the fly, and when they go unnoticed, they swiftly erode data quality.

This results in missing or inconsistent data that not only compromises the integrity and reliability of your queries and reports but also diminishes the overall trust in data across the organization. To mitigate this, you have to track any and all possible schema changes. Here are three ways to track schema drift in Snowflake.

Option 1: Generate and compare schema snapshots

Snowflake and other cloud warehouses’ support for native Schema Change Tracking is still in its infancy, but that doesn’t preclude users from creating their own history of changes.

One way that you can do this is through periodic, recurring snapshots of your schemas. Here’s a simple sample query that you could run to snapshot this for table(s) within a given database:

Use DATABASE 
With snapshot_t0 as
(
Select
  table_schema,
  Table_name,
  Column_name,
  Data_type
From information_schema.columns
Order by 1,2,3,4
)

After you create that snapshot, you’ll want to compare for deltas. Imagining we’re staying with just SQL, you’ll then write a few queries to check for common schema changes. Below are sample queries with snapshot_t0 and snapshot_t1 being placeholder names for your snapshot tables.

 --- Make sure to either name your snapshot tables with different column names or specify them here. 
With schema_comparison as
(
Select *
From snapshot_t0
Join snapshot_t1 on snapshot_t0.table_schema = snapshot_t1.table_schema
)

--- Finding new, dropped, or renamed columns
Select
 Column_names_0,
 Column_names_1
From schema_comparison
Where column_names_0 != column_names_1

--- Finding new, dropped, or renamed tables
Select
 table_names_0,
 table_names_1
From schema_comparison
Where table_names_0 != table_names_1

--- Finding new, dropped, or renamed data types
Select
 table_names_0,
 table_names_1
From schema_comparison
Where 
 column_names_0 = 
 & column_names_1 = 
 & data_type_0 != data_type_1

There are a few gaps in this approach, with significant outliers being:

The need to specify which database(s) and schema(s) you’d like to run this for
Orchestration to schedule both snapshots and deltas
An understanding of whether a schema change was significant

Option 2: Snowflake Community Python script

Inspired by the Flyway database migration tool, the Snowflake Community python script (schemachange) is a simple Python-based tool to manage all of your Snowflake objects. You can read all about the open-source script here, but here’s an overview:

You or someone on your team should have the ability to run Python. Note that you’ll need the ‎Snowflake Python driver installed wherever you’ll be running this script. You’ll also want to familiarize yourselves with Jinja templating if you want to simplify inserting variables as you find yourself with new tables.
You’ll need to create a table in Snowflake to write changes to, with the default location being: METADATA.SCHEMACHANGE.CHANGE_HISTORY
You’ll need to specify your Snowflake connection parameters in schemachange-config.yml You’ll need to write queries that output your desired schemas to compare, following their naming conventions, structured in this way.

This script helps you manually track all of your schema changes. But you’ll also need to explicitly define what tables, schemas, and databases you're tracking—and be able to run Python and be familiar with the CLI to do so.

Option 3: Leverage Snowflake’s information_schema

Similar to Option 2, you can leverage Snowflake’s information_schema to get a full view of all schema changes. This solution can be helpful for ad-hoc checks when triaging a data quality incident. But keep in mind that, by default, Snowflake only retains this information for 7 days.

An example query for the full list of schema changes would look like this:

SELECT
 database_name, 
 schema_name,
 query_text
FROM table(information_schema.query_history())
WHERE
---specify schema change query(s) here
 query_text LIKE ‘ALTER TABLE%’
AND
---optional specification for database_name, schema_name, user or compute warehouse here
---alternatively, you can query the information_schema.query_history_by* tables
(
 Database_name = ‘’
 schema_name = ‘’
 role_name = ‘‘
 user_name = ‘’
 warehouse_name = ‘’
)

This is a great option for triaging incidents that occurred within the past week. You can optionally save the results for a full history of schema changes to reference, but it can quickly become compute-heavy for Snowflake instances with a high volume of queries.

So, if none of these options quite fit the bill and you’re looking for an automated way to track your schema changes, we have another option.

Metaplane automatically monitors your data warehouse for schema changes (e.g. column additions, data type changes, table drops, etc). There’s no need to define what tables, schemas, and databases you're tracking.

The best part? With Metaplane, you can filter out which schema changes you want to monitor and receive notifications about. That way, you can only receive the alerts that are critical to your data's integrity and operational continuity—not just a barrage of noise.

Whether you opt for a manual or an automated approach, the bottom line is this: start tracking your schema changes if you aren’t already. So, choose a method and stick with it! That’s the only way to prevent any more schema drift-related pain, increase the reliability of your data systems, and boost trust across the organization in the process.

Want to get automatically alerted on schema changes? Talk to us or start a free trial today.

Comparing Snowflake Dynamic Tables with dbt

Metaplane — Fri, 01 Mar 2024 20:15:59 +0000

Snowflake is one of the most popular data lakehouses today - and for good reason; they not only made it extremely easy to manage the infrastructure traditionally associated with data warehouses, such as scaling storage and compute, but continue to push the envelope, with new features such as Dynamic Tables.

What are Snowflake Dynamic Tables?

Dynamic tables are a new type of table in Snowflake, created from other existing objects upstream, that update (i.e. re-run its query) when its parent table(s)’ update. This is useful for any sort of modeling where you intend on reusing the results, and need the data to be current. The image below, taken from Snowflake’s documentation, is a simplified overview of how Dynamic Tables are created.

The “automated refresh process” natively refreshes on fixed intervals (i.e. x number of minutes, y number of hours, etc) rather than specific times of the day. A workaround with tasks can be used if you wish to use a cron schedule instead.

What is dbt?

dbt is one of the most popular frameworks today for transformations (i.e. modeling), in part, due to its ability to increase accessibility and collaboration in modeling through features such as SQL-based modeling and a centralized code repository with a history of documentation. Two core parts of dbt include:

A profile configuration file (in YAML) - here is where you’ll specify how and which data warehouse you’ll be connecting to
Model logic (i.e. sql files) - a collection of reusable queries (i.e. models) that you intend on reusing.
Note that the use of dbt is not constrained to Snowflake. A full list of supported integrations can be found here.

Snowflake dynamic tables vs dbt

After listening to the Snowflake presentation - my mind immediately leapt to: does this replace dbt? (which is where this article came from). The short answer is “no”. Here’s why:

Similarities

Starting with similarities - it’s easy to be confused on the role of Snowflake’s Dynamic Tables in a world where many organizations have already / are looking into implementing dbt for transformations.

Both Dynamic Tables and dbt will update an “object” in the warehouse based on the results of your query. An exception here is that dbt can create views and (normal) tables in addition to Dynamic Tables.

The primary modeling language for both are SQL.
Both have some ability to self-reference. dbt models can reference others models, and Dynamic Tables can be created and updated based on other Dynamic Tables.

Differences

Once we get past the upfront benefit of automating updates to the data fed through models, there are quite a few differences between Snowflake Dynamic Tables and dbt, with a non-exhaustive list of:

Frequency of updates: dbt runs are batched (or micro-batched), which means that models are updated at fixed times throughout the day, whereas Dynamic Tables will update once “source” data is detected to be updated (with a user-configured time lag in place). -** Modularity**: dbt model code can be easily reused in other models. Although you can reference Dynamic Tables in your Dynamic Table creation, you’ll create dependencies with the biggest issue likely being time lag.
Version Control: In part because dbt code is stored in git repositories, users will get the benefit of version control - that is, the ability to enforce code structure validation tests prior to merging the code into production, where the git repository also gives you the added benefit of storing a history of all changes.
Validation Rules (Tests): dbt also has a native feature that can be built into model runs, “dbt tests”. These are data validation rules that users specify and can also reuse, much like models, which allows consistency in model outputs.
Lineage + Documentation: dbt generates a diagram that shows which objects are referenced in your model, and allows users to define variables, such as the Owner of a model. Both of these features allow for clearer context into what a model is used for(and how to use it.

How to use Snowflake Dynamic Tables with dbt

In case it’s not clear, if you’re already using dbt - there’s no need to move everything over to Dynamic Tables. If you’re keen on migrating something to take advantage of this new native Snowflake feature, a good candidate would be existing models being run in your Snowflake instance with a combination of Streams + Tasks.

Shortly after Dynamic Tables entered Preview, dbt released support for Dynamic Table creation (Note: you must be viewing docs for dbt v1.6 to see the section. Scroll to the top of the docs to set your version number).

This is amazing because it effectively gives you the best of both worlds all of the benefits of a git environment for your modeling code (e.g. version control), built-in lineage and documentation, AND updates to models automatically triggered by refreshed data (independent of the schedule in your profile.yml).

You likely won’t want to migrate any of the dbt models that you’re currently triggering at a daily or 12 hour cadence, but any dbt runs (and downstream data products) that require near-real-time data would be good candidates for testing Dynamic Table usage.

Upcoming releases for Snowflake’s Dynamic Tables

Dynamic Tables are still fairly new, so we should reasonably expect a few things to change:

Expanded dbt support for Dynamic Tables: You can track the scope of the changes here, with the first step (dynamic table as a materalization option) already having been implemented within ~2 months. -** Expanded function support for Dynamic Tables**: One gap right now is the lack of support for a few non-deterministic functions (e.g. CURRENT_DATE()), and the inability to trigger Tasks or Stored Procedures. We’ll likely see support for this over time.
Ecosystem integration improvements: One notable callout has been an issue with PowerBI where the list of queryable objects in Snowflake don’t always show all of the Dynamic Tables. Other integrations that don’t currently support Dynamic Tables will likely build in support shortly. -** Git support**: Take this with a grain of salt, as it seems that the timeline of this release varies depending on who you talk to and when, and there’s no official released scope, but it appears that Snowflake is building support for hosting code (e.g. Snowsight worksheets) in a git repository.

Dynamic Tables Data Quality

Both Snowflake Dynamic Tables and dbt improve automation for your modeling, but have limited support for ensuring data quality when it comes to your data. While dbt tests are a great starting point for validation rules, they fall short when it comes to asynchronous deployment at scale across all of your tables, with one tangible point being a lack of automation for acceptance test thresholds.

This is where a tool like Metaplane would shine, ensuring data quality for Dynamic Tables (created through dbt or otherwise) and other objects in your Snowflake instance along with alerting on anomalous dbt job durations.

Data teams at high-growth companies (like Drift, Vendr, and SpotOn) use the Metaplane data observability platform to save engineering time and increase trust in data by understanding when things break, what went wrong, and how to fix it — before an executive messages them about a broken dashboard.

Three Ways to Retrieve Row Counts in Redshift Tables and Views

Metaplane — Fri, 01 Mar 2024 20:10:39 +0000

As your data grows in your Amazon Redshift cluster, it’s important to have an accurate count of the number of rows in your tables or views. You might need this information for capacity planning, performance tuning, or simply to satisfy your curiosity. Fortunately, Redshift provides several methods for retrieving this information.

Method 1: Using the COUNT Function

To count the number of rows in a table or view in Redshift, you can use the built-in COUNT function. Here’s an example SQL snippet that you can use:

SELECT COUNT(*) FROM table_name;

Replace “table_name” with the name of the table or view that you want to count. This query will return a single row containing the total number of rows in the table or view.

If you want to track the number of rows over time, you can run this query periodically and store the results in a separate table. Here’s an example SQL snippet that creates a table to store the row counts for a table called “orders”:

CREATE TABLE row_counts (
    timestamp TIMESTAMP,
    row_count BIGINT
);

INSERT INTO row_counts (timestamp, row_count)
SELECT
    SYSDATE,
    COUNT(*)
FROM
    orders;

This code creates a table called row_counts with two columns: “timestamp” and “row_count”. The “timestamp” column stores the current date and time, while the “row_count” column stores the current row count for the “orders” table. The INSERT INTO statement runs the COUNT query and inserts the result into the “row_counts” table.

You can then run this code periodically (e.g., daily, hourly) to track changes in the row count over time. Here’s an example SQL snippet that you can use:

INSERT INTO row_counts (timestamp, row_count)
SELECT
    SYSDATE,
    COUNT(*)
FROM
    orders;

This query inserts a new row into the “row_counts” table with the current date and time and the current row count for the “orders” table.

Method 2: Using System Statistics

Redshift automatically collects statistics on your tables and views, including row counts, and makes them available in the STL_QUERY and SVV_TABLE_INFO system tables. Here’s an example SQL snippet that you can use to retrieve the row count for a table or view from the SVV_TABLE_INFO table:

SELECT "rows" FROM SVV_TABLE_INFO WHERE "table"='table_name';
Replace “table_name” with the name of the table or view that you want to count. This query will return a single row containing the total number of rows in the table or view.

One advantage of using system statistics is that they are updated automatically and don’t require you to run any additional queries or scripts to track the row count. However, keep in mind that system statistics may not always be up-to-date or accurate, especially if you have recently loaded data or made other changes to your table.

Method 3: Using Multiple Methods
To ensure the accuracy of your row counts, it’s a good practice to use multiple methods to track the number of rows in your tables or views. For example, you might use the COUNT function to get an exact count of the rows in your table, and also use system statistics to get a more approximate count.

Here’s an example SQL snippet that combines the two methods:

SELECT
    COUNT(*) AS count_exact,
    "rows" AS count_estimate
FROM
    table_name
    JOIN SVV_TABLE_INFO ON "table_name" = "table"
WHERE
    "table" = 'table_name';

Final thoughts

Always take into account factors such as table size, frequency of updates, desired accuracy, and associated costs when choosing the method that best fits your specific needs. Each method has its trade-offs, and the best approach depends on your particular situation and needs.

In the end, understanding and navigating these options will empower you to optimize your Redshift performance and maximize your data analysis capabilities.

Just remember: extracting row counts is an integral part of managing and understanding your data within Redshift. By choosing the right method tailored to your needs, you can gain insights more quickly, optimize performance, and control costs. With the knowledge and techniques outlined in this article, you are now equipped to navigate the landscape of Redshift row count retrieval effectively and efficiently. Happy querying!

Stay Fresh: Four Ways to Track Update Times for BigQuery Tables and Views

Metaplane — Fri, 01 Mar 2024 20:05:35 +0000

Ever experienced a delayed dashboard? Been frustrated by late data for that critical report? That's the sting of stale data. As a data or analytics engineer, you know how crucial it is to have timely, up-to-date data at your fingertips.

In this post, we'll explore several ways to determine the "freshness" of your tables and views in Google BigQuery. We'll dive into both relevant SQL queries and metadata via Information Schema to give you multiple tools to keep your data transformations running smoothly.

Determining Last Update Time Using the MAX Function

The most straightforward approach to determine the last update time in BigQuery leverages the MAX() function on a timestamp column within your table. This method can be especially useful when your table rows include a timestamp column that gets updated whenever a new record is inserted or an existing one is modified.

Here's an example of how you can use the MAX() function:

SELECT 
 MAX(timestamp_column) AS last_modified
FROM 
  project_id.dataset.table

In this SQL command, replace project_id, dataset, and table with your respective Google Cloud Project ID, BigQuery dataset name, and table name. Also, replace timestamp_column with the name of the timestamp column in your table that records when each row was last updated.

This command returns the most recent timestamp in the timestamp_column column, which corresponds to the last time any row in the table was updated. This approach gives a precise picture of data freshness at the row level, which can be more informative than just the last time the table schema was updated.

However, for this method to work, your tables need to have a timestamp column that gets updated with each data modification. If such a column doesn't exist, you might want to consider adding one to your data ingestion pipelines or ETL processes to track row-level updates better.

Note that this method works on both tables and views, provided the underlying data of the views have a timestamp column that tracks updates.

Last Modified Time via Metadata

One straightforward approach to find out when a table was last updated in BigQuery is by checking the last_modified_time from the table's metadata.

You can run the following command:

SELECT 
  table_id, 
  TIMESTAMP_MILLIS(last_modified_time) AS last_modified
FROM 
  project_id.dataset.__TABLES__

In the above SQL command, replace project_id with your Google Cloud Project ID and dataset with your BigQuery dataset name. This script returns a list of tables in the specified dataset and their corresponding last modification timestamps.

_Note that this method only works for tables and not for views, as views in BigQuery do not have a last_modified_time property. _

Tracking Updates via INFORMATION_SCHEMA

Google BigQuery also provides an Information Schema, a series of system-generated views that provide metadata about your datasets, tables, and views.

To retrieve the last update timestamp for both tables and views, you can use the last_change_time column from the INFORMATION_SCHEMA.TABLES view. Here's an example:

SELECT 
  table_name, 
  TIMESTAMP(last_change_time) AS last_changed
FROM 
  project_id.dataset.INFORMATION_SCHEMA.TABLES

Like before, replace project_id and dataset with your respective project and dataset names.

However, there's an important caveat to note here. The last_change_time column represents the last time the table schema was updated, not necessarily the data. So, if you only added or removed rows but didn't modify the schema, last_change_time wouldn't reflect those changes.

Employing Partitioning and Clustering

For a more granular understanding of data freshness, BigQuery's native partitioning and clustering features can be utilized. If your tables are partitioned, you can identify the most recent partition, which often corresponds to the latest data.

SELECT 
  MAX(_PARTITIONTIME) AS last_modified
FROM 
  project_id.dataset.table

Remember to replace project_id, dataset, and table with your respective details.

This method is applicable only for partitioned tables, and it won't work for views or non-partitioned tables.

Final thoughts

Google BigQuery provides multiple methods to track the freshness of your data, each with its specific use cases and limitations. It's essential to understand these nuances and select the most appropriate method based on your needs.

In data-intensive environments where timeliness is of the essence, having these tools at your disposal ensures you can maintain the integrity and reliability of your data.

Want to track the freshness of BigQuery tables and views within minutes, then be alerted on anomalies with machine learning that accounts for trends and seasonalities? Get started with Metaplane for free or book a demo to learn more.

Stay Fresh: Two Ways to Track Update Times for Snowflake Tables and Views

Metaplane — Fri, 01 Mar 2024 20:02:13 +0000

Ever experienced a delayed dashboard? Been frustrated by late data for that critical report? That's the sting of stale data, or rather, data that isn’t fresh.

The freshness of a table or view is how frequently it is updated relative to requirements. If a table is expected to be fresh to the hour, but hasn’t been updated in a day, then it is stale. Why is data freshness important? Because inaccurate or outdated information can lead to misguided decisions, muddled forecasting, and even regulatory non-compliance.

Understanding when your Snowflake table or view was last updated isn't just a nice-to-know—it's a need-to-know. It's about ensuring your data is as fresh as your morning coffee, ready to power your day's insights and actions.

In this guide, we'll be equipping you with two vital tools in your data freshness arsenal: the MAX function and the LAST_UPDATED column. These are your hammer and screwdriver for ensuring your data is up-to-date and accurate, ready to power the decisions that matter.

Determining Last Update Time Using the MAX Function

If you have a timestamp column in your Snowflake table, one of the simplest ways to find the most recent update is to use the MAX function. The MAX function returns the maximum value of the specified column. For example, if you have a column named timestamp_column that is updated whenever a row is modified, you can use the following SQL to get the latest update time:

SELECT MAX(timestamp_column) AS last_update_time
FROM your_table_name;

Tip: Replace your_table_name with the name of your table, and you'll get the latest timestamp from the timestamp_column column.

This approach works equally well for both tables and views. However, remember that views in Snowflake are essentially saved queries. They don't store data themselves but reflect the data in the underlying tables. Therefore, the freshness of the data in a view is dependent on the freshness of the data in the underlying tables.

Two quick notes:

Also note that the precision of the timestamp column used with the MAX function can impact the accuracy of the last update time. If the precision is set to seconds, for example, multiple updates within the same second may not be accurately represented.
Using the MAX function on a large table can be resource-intensive and may impact performance. Consider using partitioning, clustering or materialized views to optimize this operation.

Leveraging the LAST_ALTERED Column

What if your table doesn't have a timestamp column? Don't worry, Snowflake has you covered. You can retrieve the last update time from the LAST_ALTERED column in the information_schema.tables or information_schema.views system view.

Here's an example:

SELECT table_name, last_altered
FROM information_schema.tables
WHERE table_schema = 'your_schema' AND table_name = 'your_table_name';

In this query, your_schema and your_table_name should be replaced with your schema and table name, respectively.

This approach provides system-level information, which can be especially useful if your table or view doesn't have a timestamp column. However, there are a couple of things to keep in mind:

The last_altered column reflects the last time the table structure (like adding a new column) was altered, not the last time the data within the table was updated.
This approach works well for tables, but for views, the last_altered timestamp may not reflect the latest data update time, as it only tracks changes to the view's structure or definition.
The freshness of a view is contingent upon the underlying tables' data freshness. A view does not hold any data, but instead, it represents the data residing in the base tables.

Wrapping Up

Understanding the freshness of your data in Snowflake is crucial for accurate and timely data analysis. With the MAX function and the LAST_ALTERED column, you can keep track of the last update time in your tables and views. Just remember that while these methods are robust, they have their nuances.

Make sure to consider whether you're dealing with a table or a view, and whether you're interested in changes to the data or changes to the structure of the database object. Happy data tracking!

Want to track the freshness of Snowflake tables and views within minutes, then be alerted on anomalies with machine learning that accounts for trends and seasonalities? Get started with Metaplane for free or book a demo to learn more.

Four Efficient Techniques to Retrieve Row Counts in BigQuery Tables and Views

Metaplane — Fri, 01 Mar 2024 19:57:30 +0000

When interacting with Google's BigQuery, it's often vital to ascertain the number of rows within a table or view. This information serves various purposes such as optimizing performance, analyzing data, and monitoring data flow.

In this article, we'll explore three dynamic methods to extract row counts in BigQuery, ranging from simple SQL queries using COUNT(*) to harnessing table statistics. We'll also delve into key considerations and potential hurdles for each approach.

Method 1: Utilizing COUNT(*)

The COUNT(*) function is the most elementary method to fetch the row count. Although it's a straightforward approach, it can be resource-heavy for larger tables, potentially causing extended execution times and increased costs. Keep in mind that if the table is actively being written to or modified, the row count could change during the query execution, leading to possibly inconsistent results.

SELECT COUNT(*) AS row_count
FROM project.dataset.table

This method works for views as well as tables because it simply executes the underlying SQL query of the view and counts the results. However, keep in mind that this can be resource-intensive and slow for complex views or those built on top of large tables.

Method 2: Leveraging INFORMATION_SCHEMA

The INFORMATION_SCHEMA provides invaluable metadata about datasets, tables, and views in BigQuery. Querying the TABLES view from this schema is efficient, but it's crucial to remember that the row count fetched from this method is an approximation and may not always be current. BigQuery intermittently updates the row_count field in the INFORMATION_SCHEMA.TABLES, which might not mirror the latest count if the table has been recently modified.

SELECT 
  table_name,
  row_count
FROM 
  project.dataset.INFORMATION_SCHEMA.TABLES
WHERE 
  table_name = 'table'

_Note that this method will not work for views.
_

Method 3: Deploying BigQuery API

By utilizing the BigQuery API, you can programmatically and efficiently retrieve the row count. This method yields accurate results and is immune to potential inconsistencies due to concurrent writes or updates. It's important, however, to ensure you have the necessary access and authentication credentials set up to make API requests. Also, consider that frequent API requests to obtain row counts may introduce additional network overhead and consequentially, increased costs.

Here's an example using Python and the BigQuery Python client library:

from google.cloud import bigquery
‍
client = bigquery.Client()
table_ref = client.dataset('dataset').table('table')
table = client.get_table(table_ref)
row_count = table.num_rows

Similar to INFORMATION_SCHEMA, the BigQuery API does not support getting row counts for views directly. The num_rows attribute is only available for tables, not views.

Method 4: Tapping into BigQuery Table Statistics

BigQuery keeps track of table statistics, including an approximate row count. Using table statistics offers a fast and cost-effective means to estimate the row count without executing a full table scan. Nonetheless, remember that these statistics might not always be up-to-date, particularly if the table has undergone recent modifications. Thus, the row count obtained from table statistics should be viewed as an estimate rather than an exact value.

SELECT
  table_id,
  row_count
FROM
  project.dataset.__TABLES__
WHERE
  table_id = 'table'

As with the previous two methods, table statistics are not available for views, so this method will not work either.

Wrapping Up

Determining the row count in BigQuery tables or views is crucial for numerous scenarios. While the COUNT(*) method offers an accurate count, it might be resource-intensive for large tables. On the other hand, methods like INFORMATION_SCHEMA, BigQuery API, or table statistics provide efficient ways to retrieve row counts.

However, it's essential to acknowledge the potential challenges and limitations inherent in each method. Always take into account factors such as table size, frequency of updates, desired accuracy, and associated costs when choosing the method that best fits your specific needs.

In the end, understanding and navigating these options will empower you to optimize your BigQuery performance and maximize your data analysis capabilities.

For instance, for large and infrequently updated datasets, using the INFORMATION_SCHEMA or table statistics could provide a quick and cost-effective solution. Conversely, for smaller tables or datasets that change frequently, the COUNT(*) function might prove to be the most accurate, despite its potential resource intensity. Lastly, if you require programmatic access, the BigQuery API is your go-to solution.

Remember, each method has its trade-offs, and the best approach depends on your particular situation and needs. For instance, COUNT(*) provides accuracy but may consume more resources, while the BigQuery API ensures consistency but may incur network overhead. INFORMATION_SCHEMA and Table Statistics offer efficiency but may not reflect the most recent changes.

In conclusion, extracting row counts is an integral part of managing and understanding your data within Google's BigQuery. By choosing the right method tailored to your needs, you can gain insights more quickly, optimize performance, and control costs. With the knowledge and techniques outlined in this article, you are now equipped to navigate the landscape of BigQuery row count retrieval effectively and efficiently. Happy querying!

Three Ways to Retrieve Row Counts for Snowflake Tables and Views

Metaplane — Fri, 01 Mar 2024 19:53:09 +0000

Determining the number of rows in a table or view is often essential when working with Snowflake. This information can prove valuable for various purposes, such as performance optimization, data analysis, and monitoring.

In this article, we will explore different approaches to obtain row counts in Snowflake, ranging from simple SQL queries using COUNT(*) to leveraging table statistics. We will also highlight essential considerations and provide SQL snippets to ensure correct execution.

Method 1: COUNT(*)

The most straightforward way to retrieve row counts for both tables and views in Snowflake is by using the COUNT(*) function in SQL. This method provides an accurate count but can be resource-intensive for larger tables and views.

SELECT COUNT(*) AS row_count
FROM database.schema.table;

Important note: Replace database, schema, and table with the appropriate identifiers for your Snowflake environment.

Method 2: Snowflake Metadata Queries

Snowflake metadata views contain information about databases, schemas, tables, and views, but the row counts for views are not stored in these metadata. By querying the appropriate metadata view and filtering on the desired table, we can efficiently obtain row counts of tables, but this method is not applicable to views.

SELECT 
  TABLE_CATALOG AS database,
  TABLE_SCHEMA AS schema,
  TABLE_NAME AS table,
  ROW_COUNT AS row_count
FROM 
  SNOWFLAKE.ACCOUNT_USAGE.TABLES
WHERE 
  TABLE_NAME = 'table'
  AND TABLE_SCHEMA = 'schema'
  AND TABLE_CATALOG = 'database';

Important note: Replace database, schema, and table with the correct identifiers for your Snowflake environment. Additionally, ensure your user has the necessary privileges to access the metadata views.

Method 3: Using Snowflake Information Schema

Similar to metadata views, Snowflake provides an INFORMATION_SCHEMA containing metadata about databases, schemas, tables, and views. However, like the Snowflake metadata views, the row counts for views are not stored here, so this method is also not applicable to views. By querying the appropriate INFORMATION_SCHEMA view and filtering on the desired table, we can effectively obtain row counts of tables.

SELECT 
  TABLE_CATALOG AS database,
  TABLE_SCHEMA AS schema,
  TABLE_NAME AS table,
  ROW_COUNT AS row_count
FROM 
  INFORMATION_SCHEMA.TABLES
WHERE 
  TABLE_NAME = 'table'
  AND TABLE_SCHEMA = 'schema'
  AND TABLE_CATALOG = 'database';

Important note: Replace database, schema, and table with the correct identifiers for your Snowflake environment. Ensure your user has the necessary privileges to access the INFORMATION_SCHEMA.

Final thoughts

Obtaining row counts in Snowflake tables or views is crucial for various use cases. While the COUNT(*) method provides an accurate count, it can be resource-intensive for large tables. Alternatively, leveraging Snowflake metadata queries or INFORMATION_SCHEMA enables efficient row count retrieval.

However, it's essential to note the necessary privileges required to access metadata views or INFORMATION_SCHEMA. Additionally, table statistics provide approximate row counts and may not always reflect the latest count. Choose the method that best suits your requirements based on table size, desired accuracy, and associated costs in your Snowflake environment.

3 ways to improve data sampling efficiency in Snowflake

Metaplane — Fri, 01 Mar 2024 19:48:45 +0000

The longer a query takes to execute, the more expensive it becomes. Not just in terms of compute resources, but also our most precious resource—time.

While it’s not much of a problem when your tables are small, as your tables grow in size, the cost, execution, and iteration time of downstream tasks follow suit. That’s why writing efficient queries is just as important as writing queries that work.

That's where conditional statements come into play. In Snowflake in particular, conditional statements can drastically reduce the resources and time spent on data queries.

How do conditional statements in Snowflake make sampling queries more efficient?

Think of conditional statements as setting up smart filters for your data collection. These “filters” sort through all the data and pick out just the bits that are actually useful and relevant to the problem you’re trying to solve. Like if you're fishing and you only want to catch salmon, you'd ideally use a net that only lets salmon through and keeps all the other fish out (if such a thing existed).

Using conditional statements before you start sampling makes sure you’re only working with the data that matters, so you’re not wasting time and resources on data you don't need. And when you do take a representative sample, it's more likely to give you the information you need without hitting random data roadblocks (or worse, sampling bias/sampling error).

With that said, we’re starting with a fairly simple sampling technique. It’s not a conditional statement per se, but it’s a great way to wrangle your larger datasets more efficiently.

Option 1: Using Partitioned Tables with Sampling

Though more of a technique than a conditional statement, partitioning and sampling data in Snowflake is a great, easy way to enhance your query performance—especially if you’re using large datasets. Partitioning organizes the data based on certain keys or criteria, facilitating quicker access to relevant data segments and reducing the scope of data scans during queries. Essentially, it speeds up query execution by focusing only on pertinent data partitions.

Sampling after partitioning allows you to work with a smaller sample size that represents the larger whole. By analyzing a sample, you can infer patterns, trends, and insights without the overhead of processing the entire dataset, saving you on data storage down the road.

To combine these steps into a single SQL query in Snowflake, you’d typically make sure your table is organized into partitions based on a key that is relevant to your query patterns. But since Snowflake automatically manages micro-partitions and doesn’t allow manual partitioning like traditional databases, we'll focus on using cluster sampling for organizing and then sampling data.

Here’s what that looks like in practice:

Let's say we have a sales data table (sales_data), and we're interested in analyzing sales performance by region. We assume that the table is clustered by region_id to optimize performance for queries filtered by region. Now, we want to sample a subset of this data for a quick analysis using the SQL query below:

SELECT 
  * 
FROM 
  sales_data TABLESAMPLE BERNOULLI (10);
-- Sample approximately 10% of the rows
WHERE 
  region_id = 'NorthAmerica' -- Assuming you're interested in North American sales data
  AND DATE(sale_date) BETWEEN '2023-01-01' 
  AND '2023-12-31'

‍
In this case:

WHERE region_id = 'NorthAmerica' focuses the query on the North American sales data, using the table's clustering on region_id to improve performance.
AND DATE(sale_date) BETWEEN '2023-01-01' AND '2023-12-31' further filters the data to only include sales from the year 2023.
TABLESAMPLE BERNOULLI (10) applies a sampling method to retrieve approximately 10% of the rows from the filtered result set. The BERNOULLI sampling method provides a random sample of the data.

This query is designed to efficiently filter and sample the data based on the table's organization (clustering by region_id)—it aligns the data with query patterns, and then samples the targeted subset of data.

While partitioning is great for speeding up searches targeting specific regions, if your data isn't neatly organized around the criteria you're using to partition it, or if your searches don't align well with how the data is split up, partitioning won't help much. For instance, if you need information that's spread across multiple partitions, or if your search conditions change a lot and don't match the partitioning scheme, you might not see the performance boost you were hoping for. If that’s the case, you might want to investigate conditional statements.

Option 2: Using CASE statements

Using CASE statements in your data sampling queries in Snowflake adds a layer of conditional logic to the sampling process, which is particularly useful when you want to apply different sampling rates or methods based on specific criteria within your data.

For instance, you might want to sample more heavily in areas where your data is denser or more variable, and less so in more uniform areas. The CASE statement allows you to dynamically adjust the sampling rate or method based on the characteristics of the data (e.g. region, time period, or any other relevant dimension).

To use CASE statements to analyze sales performance by region (like in the above sales_data table example), you can design a query that selects a sample of sales data based on certain conditions related to regions. Since Snowflake's SQL does not support using TABLESAMPLE directly within a CASE statement, you’ll have to use a workaround that involves filtering data in subqueries or using conditional logic to assign sample rates and then applying these rates in a subsequent operation.

Here's what this looks like in practice:

WITH region_sample AS (
  SELECT 
    sale_id, 
    region, 
    sale_amount, 
    CASE 
      WHEN sale_amount < 1000 THEN 1 
      WHEN sale_amount BETWEEN 1000 AND 10000 THEN 2 
      WHEN sale_amount > 10000 THEN 3 
    END AS sale_group 
  FROM 
    sales_data
), 
sampled_data AS (
  SELECT 
    * 
  FROM 
    region_sample 
  WHERE 
    (
      sale_group = 1 
      AND RANDOM() < 0.05
    ) -- For sales under $1000, sample ~5%
    OR (
      sale_group = 2 
      AND RANDOM() < 0.1
    ) -- For sales between $1000 and $10000, sample ~10%
    OR (
      sale_group = 3 
      AND RANDOM() < 0.15
    ) -- For sales over $10000, sample ~15%
) 
SELECT 
  * 
FROM 
  sampled_data;
‍

In this case:

The function above for ‘region_sample’ assigns a sample_group value to each row based on the region and sale_amount. Each region (and condition within the region) is associated with a different group.
sampled_data then filters the region_sample data by applying a random sampling condition to each sample_group. The RANDOM() function generates a random value between 0 and 1, and rows are selected based on whether this random value falls below the specified threshold (e.g., 0.05 for a 5% sample rate).

Rather than partitioning, this approach allows for nuanced sampling based on region and sales amount. As a result, you get a much more targeted data analysis of sales performance.

But you also have to deal with increased complexity and reduced readability of your SQL queries. As you add more conditions and logic with CASE statements, the queries become harder to understand and maintain (which is especially true for teams where multiple analysts work on the same codebase). If this doesn’t work for your scenario, try using a JOIN.

Option 3: Using JOINS

Using JOIN statements with conditional logic allows you to sample data based on relationships between tables or within subsets of a single table. You can create a derived table or a Common Table Expression (CTE) that contains the specific conditions or subsets you care about, then join this derived table or CTE with the original (or another related) table and apply the sampling on this joined result set.

This method is particularly useful when the sampling criteria involve complex conditions or multiple tables.

Now, back to the sales_data table example from above. Let's assume we have a related table (e.g.regions), that contains detailed information about different sales regions. And suppose we want to sample sales data from the NorthAmerica region more efficiently by joining the sales_data table with the regions table, which contains detailed region information.

This is what that SQL query looks like:

WITH RegionSales AS (
  SELECT 
    sd.* 
  FROM 
    sales_data sd 
    JOIN regions r ON sd.region_id = r.region_id 
  WHERE 
    r.region_name = 'NorthAmerica' -- Condition to filter sales data by region
    AND DATE(sd.sale_date) BETWEEN '2023-01-01' 
    AND '2023-12-31'
) 
SELECT 
  * 
FROM 
  RegionSales TABLESAMPLE BERNOULLI (10);

In this case:

RegionSales creates a temporary result set that joins the sales_data table with the regions table. It filters the sales data to include only those records from the NorthAmerica region and within the specified date range ('2023-01-01' to '2023-12-31').
The TABLESAMPLE BERNOULLI (10) clause is applied to this filtered and joined dataset, sampling approximately 10% of the rows.

JOINS are particularly advantageous when the sampling criteria involves complex conditions or multiple tables that are interconnected. Imagine trying to get a snapshot of data that's spread across different tables, each with its own set of rules or relationships. JOINS bring all that related information together first, so you can then apply your sampling logic to one, combined dataset. This is super helpful when your analysis depends on understanding how different pieces of data relate to each other, like how customer profiles link to their purchase histories.

But keep in mind: while JOINS are powerful for relating datasets, they’re not always the best choice if simplicity and performance are priorities. When you join tables, especially multiple or large ones, you increase the amount of data being processed before sampling can even occur which requires more compute resources upfront and slows down query execution time. Doing the JOIN after sampling will improve this efficiency slightly, but it won’t fix the problem entirely.

A better way

Metaplane makes it easy for users to configure how they want their data to be sampled in future Snowflake queries. This includes options for users such as Time Windows and WHERE clauses. With Time Windows and “Include Data Since” options, users can configure their lookback periods to only include their most recent data. In WHERE clauses, users can further restrict the amount of data within a table being queried by any dimension of their table.

As a bonus, consider using timestamp functions supported by your warehouse such as CURRENT_DATE() to scan data created or updated from today onwards, if your goal is to ensure that new data is accurate.

Want to see how Metaplane’s configurable monitors can make your Snowflake data sampling more efficient? Talk to us or start a free trial today.