DEV Community: Henning

Think bigger about data quality

Henning — Fri, 03 Mar 2023 22:38:14 +0000

There are a lot of thinking and writing about data quality nowadays, but a lot of the thinking considers data quality as something akin to a KPI.

In some respects, we have to look at it that way. Data contains multitudes, but we have very little capacity to understand let alone communicate all the nuances.

But quality is not a percentage. For all practical purposes, quality must be seen in through the lens of what you use the data for.

Fortunately, there is a trove of prior work on this, hidden in plain sight: National statistics, survey data, the Total Survey Error framework, and recent (read: last 10-15 years) work to adapt thinking on survey errors to new register data.

Total Survey Error

Surveys have been around for a long time, and there is a lot of academic work around how to measure errors. I'm sure someone would be able to find a semantic distinction between quality and errors, but for our purposes, we can think of this as a model for evaluating data quality in a survey.

I know, we don't do surveys (usually), and if you do there is a good chance you outsource the job to someone who knows what they are doing.

But surveys are, conceptually, as good a place as any to start thinking about data quality - becayse they force you to answer two questions: What are you asking, and who are you asking.

The Total Survey Error framework helps you to think critically about what errors can be introduced as you go from formulating a question to surveying people to processing the results. Many of us are used to thinking about sampling errors and they are beautiful because we can do math and come up with a probability range and add error bars and look really smart. But the other errors don't usually come with error bars.

Adjacent to the sampling error is coverage error, basically if you ask the kind of people you want an answer from, or someone completely different.

Similarly, there is nonresponse error - basically, you need to correct for any pattern in who doesn't answer.

Then, there is the question itself. Is it understood correctly? Does the person know the answer? Misremember? Lie? And lastly, after you have gathered the responses, are they processed correctly? Does the OCR tend to register 1s as Is?

But what does any of this have to do with you?

From surveys to... that other thing

Most of the data we use today aren't surveys. Which means we probably don't have to deal with sampling errors. But most of the other errors have parallels in the business world.

There are still processing errors. The risk of processing errors may even be way higher in business, because survey data tends to have a straight-forward structure while business data can be organized in very complex data models optimized for something completely other than analysis.

Instead of validity, the measures in the business data might be different from what you want to answer. Maybe you sell furniture and want to know the size of people's house, but the gross square footage you have includes all areas covered by a roof - including garages, sheds etc. You will overestimate the potential sales for a number of customers with big garages and sheds, but it's still valuable information.

Measurement errors still exists, someone could have jotted down the wrong number or there could have been an error when the old physical records were digitized, or maybe a current owner tries to evade taxes by reporting a much lower square footage.

Similar for representation, the group of people you want to study might not be the group of people you have data on. If you want to know what proportion of a country's population has higher education, having graduation data from universities might be a really good start. But some people got their education abroad. And some people might have moved abroad after graduating. So you have not just a subset, not just a superset, but a largely overlapping set. For your purpose, this is a quality issue. For someone else, it might be perfect.

Time is also a potential problem. Data only goes back so far, or maybe there are unfixable breaks in the data rendering older data useless. This isn't coverage per se, and it isn't nonresponse, but it is a problem. Try coming up with a name for it.

There are a lot more complex attempts at adapting the TSE framework to administrative data, see Zhang 2012 (paywalled) or a brief overview in this slide deck.

An illustration of Total Survey Error adapted for administrative data, from Zhang:

Note that we have gone through all of this without calculating a single percentage, KPI or trying to quantify anything. This is all just conceptual.

Turning the table

But as a data producer, what do you do? Do you just throw your hands up when someone asks you about the data quality, because you don't know what they need the data for?

There are of course some things you can do. Measurement errors and quirks in the data collection can be described. Make sure to include special values and weird modes, like transactions of $0 or a negative house value. A negative house value can be a glaring data quality issue if it isn't explained, or a valuable feature of the data if you know what it means.

But... what about reality?

We like to say that high-quality data represents reality truthfully, but reality is often a red herring. Not always, of course, something like a person's age is fairly simple. If the data says someone is 58 trillion years old or if their zip code contains emojis you can assume there is a data quality issue. So yes, there are easy things that you can document and call data quality and be happy.

But data doesn't necessarily have poor quality just because it is more complex than someone thinks, or is intended for a different purpose.

One month a year, my salary is negative. I make a negative amount of money. Of course, I don't really. But it looks like that on my paycheck. I get paid for one month, but get deducted for 5 weeks of vacation - which is more than one month, and so I am paid negative money. Of course, this is in a way an artifact. Do I actually get a bill from work that month? No, of course not. Technically, I don't have paid vacations, but instead of of normal pay I get a vacation allowance - which is usually higher than my normal pay. But it isn't technically salary. The total amount of money paid to me that month is higher than normal, but the salary part of my paycheck is negative.

Is my salary that month a data quality issue? Or is the negative amount a truthful representation of reality?

Prefect in Azure Container Instances

Henning — Fri, 13 Jan 2023 18:39:24 +0000

With Prefect 2, prefect agents can run in Azure Container Instances. This has a number of benefits:

Less infrastructure to manage. No VMs, no AKS clusters, just your flow. Running.
Request the resources you need for each job. Up to 6 Cores, 56 Gb RAM and 4 GPUs if you need. 1 CPU and 1 Gb RAM is fine too.
Better security - because everything other than the container is managed by Azure, there is less for you to keep safe.
In many cases cheaper than the alternatives.

There are a few drawbacks though, that you need to keep in mind:

Running a flow requires you to create a new Azure Container group, which takes longer than starting a job either on a VM or on AKS.
If you are used to AKS, you might have many systems running and communicating seamlessly using k8s services. ACI has no such feature. Every instance gets an IP address, but no service name that can be used.

Because ACIs are ephemeral, you might need to think about authentication differently. If you are used to VMs in Azure, and you use RBAC, you might be familiar with how the VM gets assigned a managed identity which makes it easy for it to authenticate with other resources. But first, it needs to be given that access, which isn't that big of a problem.

But with an ACI that lives just a few seconds, you need the container to have its access rights right from the start. This is where user-managed identities help. They are explicitly designed identities, that are given the permissions it needs, and can be attached to VMs, containers and more.

All of this can be set up with ARM template, plus a few scripts to create the necessary Prefect blocks. A full example deployment (both in the Azure ARM sense and the Prefect sense) is located at https://github.com/radbrt/prefect_aci.

The best SQL function you never heard of

Henning — Sat, 07 Jan 2023 23:34:37 +0000

There are a two things in IT that made me feel like my brain turned inside out when learning them. One of them is the SQL concept of row pattern matching.

Think of row pattern matching as doing regex between rows. Instead of finding single rows where some column has some value (or, more often, some combination of values like WHERE col_a>1 AND col_b='puzzle'), we can find sets of rows that together constitute some interesting pattern.

A basic example

The canonical example is analyzing stock prices: Can we find stock that have seen a 5 or more day rally? That have rebounded for at least 5 days after declining for at least 5 days? The longest stretch of continuous price increase? None of these questions can be answered by a WHERE clause alone, and even though the lag() function might get you far, it gets messy quickly even if it can give you an answer.

Row pattern matching introduces some new concepts though, that we need to go through carefully in order to grasp them.

Instead of stock prices we will use currency rates, which have all the same properties. A basic row pattern matching query can look something like this:

SELECT currency, tstamp, value FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    ALL ROWS PER MATCH
    PATTERN ( up+ )
    DEFINE up AS LAG(value)<value
);

We recognize the first line, but beyond that there are many new keywords.

All the action happens within the MATCH_RECOGNIZE clause, which is the row pattern matching function. The first thing we do is to partition the data by the stock symbol. Partitioning here works similar to partitioning elsewhere, like in an OVER() clause. We want to split the data by stock, we aren't interested in comparing different stocks to each other. Each stock should be seen as a separate time series. ORDER BY is also quite simple, we specify how we should order the data. Here, we order the data by date (ascending, earliest first).

We will come back to the ALL ROWS PER MATCH clause, but for now, I'll note that row pattern matching by default aggregates the data and only shows one line per match. Right now, we want to see all the rows.

The PATTERN clause is the main regex-like clause. We specify we want to select one or more (the + sign is basically regex) of something we call up, which is defined on the next line.

The last line, DEFINE, is where we define up by a set of criteria. Any row that matches these criteria is labeled up, and will match in the pattern we define. This definition uses yet another new keyword, but one we can probably guess PREV. We specify UP to be any row where the price is higher than the previous row.

The output might look something like this:

CURRENCY	TSTAMP	VALUE
LKR	1994-01-04	68.0266
LKR	1994-01-05	68.1848
LKR	1994-01-07	68.171
LKR	1994-01-10	68.2884
LKR	1994-01-12	68.1946
LKR	1994-01-17	68.211
LKR	1994-01-18	68.284
LKR	1994-01-19	68.3353
LKR	1994-01-21	68.5181

This might not be what you had expected though. The price is jumping all over the place. But it is exactly what we asked for. The currency LKR had several runs of increasing value, the longest stretch seems to be from the 17th to the 19th of April.

But what about the 12th? That's just one row!?! Well, trust me that the value on the 11th was lower than on the 12th, but we didn't actually ask to get that first row returned. We referenced it in the DEFINE clause, but it isn't in the PATTERN clause and so it isn't returned.

Expanding the example

In order to make the output a little more informative, let's expand the pattern we are looking for and introduce a specialized type of calculated field called MEASURES.

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum
    ALL ROWS PER MATCH
    PATTERN (strt up+)
    DEFINE up AS LAG(value)<value
)

We have introduced a new item, strt, in the PATTERN clause, but it isn't defined in the DEFINE clause. This might be unintuitive, but any clause not defined will reluctantly match all rows. So now, we return the "first" row of the pattern returned, and we can see the increase from start to finish. In other words, the 11th should be included in the result.

The MEASURES field defines two extra columns, using special match recognize related functions. CLASSIFIER() actually returns which part of the pattern the row matched - in our case either strt or up. MATCH_NUMBER() enumerates the matches, so that we can easily see which rows relate to each other.

The resulting table is as follows:

CURRENCY	TSTAMP	VALUE	CLF	MNUM
LKR	1994-01-03	67.9518	STRT	1
LKR	1994-01-04	68.0266	UP	1
LKR	1994-01-05	68.1848	UP	1
LKR	1994-01-06	68.1103	STRT	2
LKR	1994-01-07	68.171	UP	2
LKR	1994-01-10	68.2884	UP	2
LKR	1994-01-11	68.1027	STRT	3
LKR	1994-01-12	68.1946	UP	3
LKR	1994-01-13	68.1825	STRT	4
LKR	1994-01-17	68.211	UP	4
LKR	1994-01-18	68.284	UP	4
LKR	1994-01-19	68.3353	UP	4
LKR	1994-01-20	68.1867	STRT	5
LKR	1994-01-21	68.5181	UP	5

Going further

Patterns can be a lot more advanced, we could for instance find W-patterns by defining down, and searching for PATTERN ( STRT (DOWN UP){2,}).

SELECT * FROM sdr_rates
MATCH_RECOGNIZE(
    PARTITION BY currency
    ORDER BY tstamp
    MEASURES CLASSIFIER() AS clf,
    MATCH_NUMBER() AS mnum,
    MAX(up.value) AS max_up,
    FIRST(tstamp) AS start_at,
    FINAL LAST(tstamp) AS end_at,
    FINAL COUNT(1) AS n_days
    ALL ROWS PER MATCH
    PATTERN ( strt ( down up ){2,} )
    DEFINE up AS LAG(value)<value,
        down AS LAG(value)>value
);

In addition to the expanded pattern, we now define some more measures: The max value of rows classified as UP, the first timestamp of each match, the (final) last timestamp of the match (measures by default do not look ahead, the FINAL keyword finds the actual last value. Similarly, we take the final row count of each match.

CURRENCY	TSTAMP	VALUE	CLF	MNUM	max_up	start_at	end_at	n_days
SGD	2010-02-03	2.19037	STRT	243		2010-02-03	2010-02-09	5
SGD	2010-02-04	2.18179	DOWN	243		2010-02-03	2010-02-09	5
SGD	2010-02-05	2.18932	UP	243	2.18932	2010-02-03	2010-02-09	5
SGD	2010-02-08	2.1875	DOWN	243	2.18932	2010-02-03	2010-02-09	5
SGD	2010-02-09	2.18961	UP	243	2.18961	2010-02-03	2010-02-09	5
AUD	2011-08-15	1.53733	STRT	224		2011-08-15	2011-08-19	5
AUD	2011-08-16	1.53521	DOWN	224		2011-08-15	2011-08-19	5
AUD	2011-08-17	1.53681	UP	224	1.53681	2011-08-15	2011-08-19	5
AUD	2011-08-18	1.53167	DOWN	224	1.53681	2011-08-15	2011-08-19	5
AUD	2011-08-19	1.55462	UP	224	1.55462	2011-08-15	2011-08-19	5

References

Row pattern matching was originally introduced by Oracle in 12c. It became part of ANSI SQL 2016, but it is not widely implemented. Snowflake is one of only a handful of other databases that have this feature.

Snowflake reference: https://docs.snowflake.com/en/sql-reference/constructs/match_recognize.html

Oracle reference: https://docs.oracle.com/database/121/DWHSG/pattern.htm#DWHSG8956

The joys of loading CSV files

Henning — Sat, 07 Jan 2023 13:46:45 +0000

There is no shortage of CSV files. Reading, parsing and writing these files to databases can probably be a full time job in some cases, and there are a number of both open-source and SaaS solutions to load CSV files to a database. But not all solutions are the same, and the devil is often in the details. Let's take a look at some of the common and less common aspects you need to handle.

File encoding
File- and folder patterns
Separator
Headers
Extra headers
Column type inference
Blank lines
Uniqueness
File name reuse
quoting
extra columns
line breaks

Some of these are obvious to many, others are subtle. Let's go through them one by one.

File encoding

Most of us have argued with encodings, but we are also generally quite lucky nowadays to use UTF-8 most of the time. While some tools are able to guess the encoding of files, the encoding of a file is not authoritatively declared anywhere. Many Extract/Load tools either assumes that the encoding is utf-8, or requires the encoding to be specified when setting up the job.

And of course, there are many interesting encodings. While UTF+8 and miscellaneous Latin1 encodings are common in Europe, and I won't even guess about encodings common in Asia, there are also things like windows-1252, and utf-8-sig, which apparently is utf-8 with a byte-order-mark.

File- and folder patterns

One folder on one SFTP server (or wherever you may read from) might contain a bunch of different files that should be loaded into different tables, typically based on the file prefix. Being able to use regex to select only certain files in the folder might be vital. Especially if not all files in there are CSV. Imagine trying to parse some random PDF that made its way in there.

Separator

Common field separators are comma (,), semicolon (;), tab (\t) and pipe (|). I haven't seen all that many uncommon field separators, but colon (:) is one. Specifying the correct separator probably isn't difficult, and some tools are even able to guess. But sometimes, there is a very limited list to choose from. And you might be out of luck.

Headers

Some files have a header row, others don't. And some files have many, as we'll get in to. Reading CSV files into databases, the header usually becomes the column names. Actually, most tools require a header row and has no ability to just name columns from c1 to c100.

Another issue with headers is special characters. Special characters often aren't valid column names, so they need to be cleaned up. Even worse, if you are truly unlucky, the cleaned-up name crashes with another column name. Say, for instance, there are two columns that differ only in their special characters. Replacing special characters with underscodes (_) which is common, would lead to duplicate column names.

Related to both headers and separators, there might also be a padding at the start and end of each field. This is fortunately rare, but can make column names weird if the padding (usually space) isn't trimmed away. Dealing with extra spaces in the data itself is also annoying, and may cause a lot of fields to be cast as string instead of numbers or other data types.

Extra headers

Some files have extra lines with random data before the actual CSV file begins. This might be metadata in the form of extra column references, or perhaps a manifest specifying where and how the file was produced. In any case, before you can parse the file as a CSV, you have to skip a number of lines. Hopefully, you know how many lines beforehand, and the tool you use lets you specify a number of lines to skip.

Null values

Most sane CSV files simply don't write anything for null values - they are marked only by two separator signs in a row. But sometimes, nulls gets written out as null or NULL or N/A. It might not be the end of the world that these gets parsed as strings, but it creates a lot of cleanup.

Column type inference

Most CSV readers try to infer the data type by sampling. Sometimes though, the sampling doesn't catch the full variation of values, and it ends up declaring numbers where it should have created strings or ints where it should have created numbers. This is practically bound to fail downstream.

Most systems do type inference by default, but some have the option to either turn it off, specify the schema manually, or review and correct the result of the auto-inference.

In extreme cases, automatic column type inference can cause data loss. The string 123456789.0123456789 would be truncated if cast as a floating point number, and likely to turn up as roughly 123456789.0123456 in the target database due to a combination of truncation and floating point rounding. That's three characters out the window, never to be seen or heard from again.

Blank lines

There are two types of blank lines: blank lines that originated in the source system, and maintains the field separator - so the entire line is just a bunch of commas or similar. This might be a problem because the key column is null, which is hardly compatible with uniqueness (if that is configured).

The other type of blank line are entirely blank lines - these might come from processes that appends lines to existing files, with additional blank lines at the end. CSV parsers might have trouble with these files, and so your loading process might fail.

Uniqueness

You may or may not care about uniqueness, but in either case you need to be aware of it. Some systems explicitly require some definition of uniqueness, others use file name + line number as uniqueness, and others again allow you to forego uniqueness altogether.

If your files are written once and never updated thereafter, uniqueness might not matter much. You read any files that has been added, and append it to your target database pretty much no matter how the key is specified (as long as you make sure you don't specify a key that isn't actually unique).

Often, the unique key is only unique to that particular file. In order to get global uniqueness, you need to specify the file name to be part of the key. The file name is often included as a meta-column, so being able to refer to this as part of the unique key is probably important.

Using file name + line number as uniqueness might seem scary, but it is actually a very common approach and there is only one edge-case where it might fail: If the file is updated very random, practically rewritten and rows are removed, you may end up with duplicate values on business-keys.

File name reuse

Sometimes, the file to load doesn't change name, only content. Think of something like a file named data.csv, which gets replaced one a day. In that case, you probably want to append the target table. This works nicely if you are able specify that there is no unique key, or specify explicitly that your table should only be appended to.

quoting

Text fields should be quoted in order to make sure commas, semicolons or other signs are not interpreted as field delimiters. But when fields gets quoted, any quote signs inside the text should be escaped so that they aren't interpreted as the end of the text field. Whatever system produces the CSV file, should be able to quote and escape the file appropriately. If not, it's an error waiting to happen.

extra columns

The number of columns in the header and the number of columns in the rows should match, but there are no guarantees. What should happen if there are extra columns in some rows? Situations like this might happen if a text field isn't quoted appropriately, and a comma in a free-text field gets interpreted as a field separator. Or, in Europe, if commas are used both as field separator and a decimal mark on floating point numbers.

This is a bad situation no matter what, but you should be aware of it and have thoughts about what you would want to happen. Maybe you actually prefer that the pipeline fails in those cases. Or maybe the priority is to capture all the data and try to figure out the mess later.

line breaks

Free-text fields especially might contain line breaks, which is probably OK if the field is quoted - but check to make sure. But if it isn't, these line breaks are interpreted as new rows. This will also mess with the column count. No matter how the loader handles line breaks in unquoted fields, they will be a problem.

As long as the fields are quoted and the loader can handle it, the only thing you need to keep in mind is that this can mess with row counts. The number of rows in the file will be higher than the number of records in the resulting table, and that's how it should be.

Conclusion

Hopefully you won't stumble into all of these issues all at once, but when evaluating a robust extract/load system these are some issues you should be aware of and test for.

And I promise this isn't an exhaustive list.

A short intro to Azure Storage account access

Henning — Fri, 06 Jan 2023 22:36:38 +0000

Managing access in Azure can be very confusing - there are many options, partially overlapping, and often poorly documented. This isn't the place to explain access in Azure in general, rather I want to point out some minutiae around access credentials to blob storage in particular.

I want to begin by saying that if you can, you should use Managed Identity to access Azure resources. It is (for the most part) way better than using any kind of access credential both in terms of convenience and security.

But if you are sharing data with someone outside of your azure account, access credentials can be great.

There are three types of access keys I want to cover:

Access keys
Shared Access Signature (SAS keys) on the storage account level
Shared Access Signatures (SAS keys) on the container level (also called "Shared Access Token) in the portal menu

Fortunately, SAS keys on the storage account level and on the container level have more similarities than differences. Access keys however, are something else and serve a different purpose (although they too can seem very similar).

Access keys

All storage accounts come with two sets of access keys by default, and they can be rotated independently. This makes it possible to rotate credentials seamlessly. If your application is integrating with the storage account and you want to rotate credentials regularly, you can first update the application to use key number 2, rotate key 1, and update the application again to use the new key 1. And then rotate key 2 for good measure.

These keys are permanent in the sense that you will always have two valid access keys to your storage account, and in the sense that they never expire. Rotating a key is the only way to invalidate it.

The key itself is relatively short, and looks something like this: 4QVUVfk5GFJDUqpsWrVRS70c92rJuGBcRe13p137gAIfA2v+v/CfTH5ngL4k7D+YCy9aHBWUi+6k+AStqEXrMQ==.

There is also a connection string which contains the key and additional information required to connect to the storage account:

DefaultEndpointsProtocol=https;AccountName=radbrtstorage4180;AccountKey=4QVUVfk5GFJDUqpsWrVRS70c92rJuGBcRe13p137gAIfA2v+v/CfTH5ngL4k7D+YCy9aHBWUi+6k+AStqEXrMQ==;EndpointSuffix=core.windows.net

As you can see, this is a weird collection of keywords; the account key, the account name, the endpoint suffix and the protocol. All in all, they can be assembled to the URL of the account, plus the key to access it.

We will return to a few examples where we use it.

SAS keys on the storage account level

There are two important differences between account keys and SAS keys:

SAS keys expire at some (configurable) point in the future, making them ideal for granting time-limited access or forcing key rotation.
SAS keys can grant granular access to resources, like read-only (or, interestingly, write-only)

The SAS key is presented (in the portal at least) in three different formats:

The plain key
a connection string which bears some resemblance to the connection string for Account Keys
a Blob Service SAS URL, which looks like a regular URL with the SAS key tacked on to it.

The connection string is very long and involved, explicitly listing the URLs for all the components of the storage account: Blob storage, Queue, File storage and Table storage. But it is still, at it's core, just some simple endpoints and a token in a key-value format.

A lot of services/apps that integrate with Azure Storage will accept either the full connection string, or the storage account name plus the token. The Azure Python SDK on the other hand, is partial to the full connection string.

SAS Keys for containers

Generating a Shared Access Token for a given container renders a token and a blob SAS URL, but no connection string.

Since the connection string is what we want to use for connecting with Python, it might seem we're out of luck. But it is possible to construct our own connection string. The only endpoint our token will support is the blob endpoint, and we can assemble our own blob endpoint url by taking the domain from the URL, https://radbrtstorage4180.blob.core.windows.net, the token from the container SAS (something like sp=r&st=2023-01-06T18:43:46Z&se=2023-01-07T02:43:46Z&spr=https&sv=2021-06-08&sr=c&sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D) and assemble it into:

BlobEndpoint=https://radbrtstorage4180.blob.core.windows.net/;SharedAccessSignature=sp=r&st=2023-01-06T18:43:46Z&se=2023-01-07T02:43:46Z&spr=https&sv=2021-06-08&sr=c&sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D

So SAS keys are pretty much the same either they are generated as Storage Account Tokens for a specific container or for the storage account as a whole. Even though the container-specific SAS keys don't come with a connection string, we are able to assemble quite easily. And even though the storage account SAS lists a lot of endpoints, it is OK to strip away the ones you won't use. So in effect, the connection string above will hold no matter how the SAS key was generated.

As a curiosity, even though we saw the connection string for Storage Account Access Keys had a very different format, it turns out you can pass the storage account Access key as a SAS token following the structure above. Even though it isn't a SAS token, Azure will accept it. Let's hope that is a feature not a bug.

Finally, a demo of listing objects in a blob in Python. If you haven't already, start with pip install azure-storage-blob.

from azure.storage.blob import BlobServiceClient

def count_objects_in_container(sas_key, container_name):
    blob_service_client = BlobServiceClient.from_connection_string(sas_key)
    container_client = blob_service_client.get_container_client(container_name)
    blob_names = [blob.name for blob in container_client.list_blobs()]
    return len(blob_names)

container_name = "<my-container-name>"
storage_account_name = "<my-storage-account-name>"
token = "<my-container-sas-token, storage-account-sas-token or account key>"

connection_string =f"BlobEndpoint=https://{storage_account_name}.blob.core.windows.net/;SharedAccessSignature={token}"


n_objects = count_objects_in_container(connection_string, container_name)

print(f"We counted {n_objects} objects in the container")

Postscript 1:

There is one kind of SAS key I haven't covered: Single-object SAS keys. In a container, you can click on any object and generate an access token, looking exactly like the container-level SAS key, but the URL is for direct file download. It's a neat feature, but not one I'm using.

Postscript 2:

SAS tokens are sometimes used with a leading question-mark (?), for instance when creating an external stage in Snowflake:

create stage DWH.RAW.STAGE
  url='azure://radbrtstorage4180.blob.core.windows.net/mycontainer/files/'
  credentials=(azure_sas_token='?sp=r&st=2023-01-06T18:43:46Z&se=2023-01-07T02:43:46Z&spr=https&sv=2021-06-08&sr=c&sig=FjFD2uwAvH22Dy2ugLtz6Lri2PoSz%2FMtgwcx8dr3jhE%3D')
  file_format = DWH.RAW.LOAD_CSV;