DEV Community: Paul Brabban

How to get pwned with --extra-index-url

Paul Brabban — Sat, 06 Dec 2025 00:00:00 +0000

Python's built-in pip package manager is unsafe when used with the --extra-index-url flag (there are other dangerous variants too). An attacker can publish a malicious package with the same name and a higher version to PyPI, and their package will be installed.

This post confirms that the vulnerability (CVE-2018-20225) is still a problem today. Despite the CVSS 7.8 (High) CVSS score, the maintainers have refused to change the behaviour.

I also introduce a test suite and publicly-available test packages that you can use to more easily confirm the safety - or not - of your own setup.

Two variants of package `example-package-cve-2018-20225`

I've written two variants of a new package that I'll use to demonstrate the problem. The package is essentially a single __init__.py file that prints a message to show which package has been installed when it's imported, along with minimal metadata required to publish the package to a registry.

The "safe" variant

The "safe" variant of the package is at version 0.0.1. It prints this is the safe, private package when imported.

This package stands in for your intended, usually private, package. I've published it to GitLab and made the registry public for the convenience of testing.

The "malicious" variant

The "malicious" variant of the package is at version 1.0.0. There's nothing special about 1.0.0, it's just "higher" than 0.0.1.

This package prints oops, this is the malicious package when imported. It's published to PyPI.

Testing approach

I've created a GitHub actions workflow to test a variety of install and update scenarios. There are far too many potential tools and combinations to test them all, which is why I've made these packages available publicly. You can use them to test whatever specific scenario you want.

!!! warning
The usual disclaimers apply. My intentions are good, but that could change or I could be compromised in the future. Take whatever precautions you can to establish trustworthiness - I've kept the packages simple to aid manual audit.

All the tests are run against the latest versions (at time of writing) of the package management software. The tests report failure if the malicious package is installed. You can see the current latest test run in the repo's GitHub actions tab. You can also see the packages and how I published them to PyPI and GitLab in the repo.

Test scenarios

I'm trying out a few scenarios I'm interested in. What happens when you specify various combinations of flags (including forgetting the flags) with pip?

pip with and without flags

pip install ${PACKAGE}: 🚨 Malicious (Default behaviour if the flags are forgotten)
pip install ${PACKAGE} --index-url ${GITLAB_INDEX_URL}: ✅ Safe (Replaces PyPI with GitLab as the only source)
pip install ${PACKAGE} --extra-index-url ${GITLAB_INDEX_URL}: 🚨 Malicious (Searches both PyPI and GitLab, installs highest version)
pip install ${PACKAGE} --index-url ${PYPI_INDEX_URL} --extra-index-url ${GITLAB_INDEX_URL}: 🚨 Malicious (Sets PyPI as primary, GitLab as extra, same behaviour when flag order reversed)
export PIP_EXTRA_INDEX_URL=${GITLAB_INDEX_URL}; pip install ...: 🚨 Malicious (Uses environment variable instead of CLI flag)
pip install -r requirements.txt (File contains ${PACKAGE}): 🚨 Malicious (Installs from PyPI)
pip install ${PACKAGE} --index-url ...; pip install -U ${PACKAGE} --extra-index-url: 🚨 Malicious (Installs "safe", then runs update with both indexes to get "malicious")

GitLab's PyPI pass-through behaviour

A GitLab registry will pass through requests for packages that it doesn't hold to PyPI. This is flagged as a security risk. If you're exposed to this vulnerability, it seems like a solid step forward to me. It resolves the dependency confusion problem, works with different package managers and is easy for users.

pip install ${PACKAGE} requests --index-url ${GITLAB_INDEX_URL}: ✅ Safe (Installs target package + public lib (requests) from GitLab index only, succeeds and installs the right package)

Is `uv` vulnerable?

There are many Python package managers. uv is the current darling of the community and has different behaviour when this flag is used in a pip-like manner. I've added a couple of tests to confirm the behaviour.

uv pip install ${PACKAGE} --extra-index-url ${GITLAB_INDEX_URL}: ✅ Safe (Uses uv with the risky flag)
uv pip install ${PACKAGE} --index-strategy unsafe-best-match: 🚨 Malicious (Uses uv but forces legacy pip behaviour)

What about lockfiles?

Assuming you didn't already lock the malicious package, lockfiles only offer temporary protection. When you update, if you update unsafely, you get the malicious package.

Summary

There are many ways to put yourself at risk of CVE-2018-20225; if you get it wrong, an attacker has a trivially easy route onto your computer or infrastructure.
Being confident that what you're doing is safe isn't trivial; I've provided source code, a suite of scenario results and a test harness to help you.

Avoid

Using --extra-index-url with pip 🚨
Using --index-strategy with uv 🚨

Consider

Using a private registry with a PyPI pass-through as the only index, which I demonstrated with GitLab.

I doubt I would have known about this problem had it not been for a vulnerability scanner alerting me to it last year. If you're currently exposed to this problem, you're certainly not alone. When I asked ChatGPT how to safely use a private package registry, the response it generated (based, of course on the content it's been trained on) included using --extra-index-url with no mention of this risk.

ChatGPT recommending the vulnerable --extra-index-url approach

BigQuery, safer by default from September 2025

Paul Brabban — Thu, 17 Jul 2025 00:00:00 +0000

On September 1st 2025, Google will make BigQuery a lot safer by default, changing the default quotas for projects under the default on-demand pricing model. Instead of unlimited financial damage, the default for new projects will be around $1000 per day. Existing projects will be updated to a custom limit based on prior 30-day usage. No changes to existing limits.

I received an email entitled "[Action Advised] Review and set appropriate daily usage limit for BigQuery projects before Sep 1, 2025." yesterday. The content is a big step forward for making BigQuery safer by default. Unfortunately, Google's copyrighted the content so I can't share it in full here, and the quota documentation doesn't cover the full content, but I can summarize the key points.

New projects

As of September 1st, 2025, new projects using the default on-demand BigQuery pricing model will have a daily usage limit of 200TiB. The current model has no default limit, potentially leading to unlimited financial exposure - as I've written about in $1,370 Gone in Sixty Seconds, explaining how a single query could run up a bill over $1,300 in less than a minute, and in The BigQuery Safety Net, explaining how the quota system can mitigate these risks.

200TiB is still a lot, and at the $5/TiB current pricing, it's $1000 per day. Most SMEs won't be affected, but an unexpected spike could still be painful. I see no indication that any smart adjustments will be made automatically based on usage patterns, so I'll still be setting my own values for these quotas anyway.

Existing projects

If a project already has a quota value other than unlimited set for QueryUsagePerDay or QueryUsagePerUserPerDay, then no changes will be made. Otherwise, Google will set a quota based on usage in the last 30 days.

Auditing

The changes will be visible in audit logs when they occur.

Summary

There are a whole bunch of blog posts out there that are going to need updating!

Here are the updated BigQuery quota-related docs. BigQuery's defaults are about to get a lot safer, but could still hurt to the tune of $1000/day. I'll still be setting my own much lower values when I set up projects that use BigQuery.

GitHub Codespaces, one year later

Paul Brabban — Sat, 07 Jun 2025 00:00:00 +0000

Back in early 2024 I tried GitHub Codespaces, and quickly ditched my local dev setup entirely. This post shares my experience going cloud-native for development: benefits for onboarding, agility, and security, alongside the real-world snags like network dependency and unexpected billing quirks.

A year ago, I wrote about why I was giving GitHub Codespaces a serious try - mainly dealing with my security concerns. Fast forward to today, and that "try" has become my exclusive development environment. I have not touched my local VSCode setup for personal or professional work since then. From open-source contributions and evaluating new GitHub repos to building a product for a startup client who happened to be using GitHub, Codespaces has been my daily driver.

I found a few stumbling blocks here and there, but I see those as minor inconveniences. In my opinion, the benefits far outweigh the problems. Let's start with the rainbows before moving onto the rainclouds.

Works on my machine, and yours

I joined a team and found another new team member who had been struggling to get their local development environment working correctly for a while. It was consuming a lot of their time, destroying their productivity, and the rest of the small team were unable to help.

Codespaces uses the devcontainer spec. I had already created a devcontainer.json for the repo we were working on to get myself up and running. When I paired with my team-mate and saw the kinds of difficulties they were having, I suggested trying a Codespace. A couple of clicks plus a minute or so later, they were up and running in their web browser, able to start orienting themselves with the work in progress and very quickly started making significant contributions. It is not just the IDE - the OS version, supporting software like gcloud or awscli tooling to authenticate, IDE plugins - all of this can be quite easily managed as part of the repo.

There are different ways of running Codespaces, including plugging my local VSCode into the Codespace running in the cloud. (I think some other IDEs like IntelliJ are supported by Codespaces, but I have not tried them myself.) I used this option temporarily to continue working when an obscure privacy setting I had enabled broke Codespaces and a number of other websites via a Chrome browser update back in October. The devcontainer configuration can also be used with a local container runtime, although I was unable to get this working owing to my use of rootless Docker (and Podman) being unsupported at the time.

I've been using the browser-based VSCode option, and it places no more demand on my physical computer than Google Sheets. I can get away with a relatively cheap, light machine, and I have been stripping away software that I used to install and no longer need. The fact that I can source-control OS, software installation and configuration declaratively in the devcontainer setup and then anyone with a functional computer that can run a web browser can become productive in minutes feels like a huge step forward. I've certainly spent plenty of time over the years dealing with local environment problems on my own and other people's computers!

Context-switching made easy

Switching between repos and branches was always a minor pain, occasionally a significant one. Do I have the repo cloned already? Is it up to date? Do I need to switch branches? Install dependencies? Urgh I have to stash everything before I can switch, and so on.

With Codespaces, I can stop what I'm doing and spin up a new instance on any branch at any point. I know I'm getting a consistent, isolated computing environment that's up to date. I can also keep a Codespace around for a while if I want, but I tend to spin one up for each new unit of work for team-based work. There's even a nice button to delete the associated Codespace when a PR is merged.

I automate dependency updates as part of my devcontainer initialization. I update OS packages when the Codespace starts, and I have scripts in the repos that automatically run to update project-specific packages every time I open a terminal. This ensures I and anyone else I'm working with are always using the latest versions, including security updates. I've talked about why in how I do Python data supply chain security. A 30-second pip update and security scan is usually not noticeable running in the terminal window as I orient myself to the code and start thinking about how to proceed.

Minimising blast radius

Before Codespaces, I felt that my local environment had a nasty concentrating effect on supply chain risk - all supply chains, for everything I'd worked on, brought together in once place.

Local development environments concentrate supply chain risk

Now, with Codespaces, I feel that my local environment is well-isolated from each Codespace I use, exposed only to the user interface and interactions. Each Codespace is exposed to the minimal risk necessary for that specific piece of software.

Codespaces isolates supply chain risk to individual environments

If I'm experimenting with a new library or evaluating an open-source repo that I don't trust, I can spin it up in a Codespace with a click. It has no access to anything other than what's in that empty Codespace. It's not the first time I've seen recommendations for software only to spin it up in a blank Codespace and find it's running three-year old dependencies that are riddled with critical vulnerabilities. Erm - I'll pass thanks.

Infostealing malware is all the rage for the bad actors these days. A recent report says that infostealers stole 2.1 billion credentials in 2024. I'm scared of these things, I want to minimise the risk of being exposed to one, and the impact if it should happen.

If a package, IDE extension or some other software contains credential-stealing malware, it can only steal secrets that that specific Codespace has access to. There's still risk there as my Codespaces often contain powerful credentials like cloud authentication tokens, but I keep secrets Codespace-specific and I minimise the number of secrets and how long they are valid for. As far as I know, my local machine and web browser, logged into countless personal and professional services, is completely out of reach of malware that might slip into a Codespace.

The devcontainer setup for this website's repo is an example. devcontainer.json points to a Dockerfile, setups up githooks and grants read access to a private fork of my mkdocs theme. Things are locked down by default and I need to explicity add permissions.

{
  "build": { "dockerfile": "Dockerfile" },
  "postCreateCommand": "git config --local core.hooksPath .githooks/",
  "customizations": {
    "codespaces": {
      "repositories": {
        "brabster/mkdocs-material-insiders": {
          "permissions": {
            "contents": "read"
          }
        }
      }
    }
  }
}

The associated Dockerfile just updates the OS and installs a few packages related to image optimisation.

FROM mcr.microsoft.com/devcontainers/python:3

# image processing dependencies for optimize plugin
RUN apt-get update && \
    apt-get install -y webp imagemagick bash-completion

I did not want to install those packages on my local machine, as I felt I couldn't justify the risk just to optimise my blog images, but in a Codespace, I don't see much risk. If one of those packages is compromised the bad actors might get credentials that let them read my private fork of my theme. Much less scary then the same compromise on my local machine.

Onto the rainclouds.

The network tether

Codespaces, being cloud-based, demands consistent network connectivity. For me in the UK, on good home broadband or a stable phone hotspot, the latency is rarely noticeable. It feels just like working locally. There's no lag at the terminal, navigating directory structures, opening and scrolling through files is smooth. I've put together a short video to illustrate what the experience of opening and starting work in a Codespace is like on a good network connection.

When connectivity is slow or spotty, like when I'm on the train over to Manchester or up to Leeds, the experience deteriorates quickly. It's a bit of a "cliff edge": the terminal becomes laggy, file operations introduce noticeable spinners and delays, and eventually, Codespaces will pop up a "reconnecting" modal, halting all work. Working on trains over a patchy mobile network, for example, is usually frustrating. To be fair, I had already found working on the move through spotty network areas was frustrating. I normally need documentation or access to cloud-based services at some point - and that point will always be immediately after we head into a tunnel.

I've accepted the limitation and use that time for catching up on the reading I lament never having time to do, or working on Google Docs things in offline mode instead.

The cost

GitHub offers a monthly free tier for personal accounts (see GitHub Codespaces pricing for the latest details). Currently, this is 120 core hours and 15GB-month storage for Free plans, 180 core hours and 20GB for Pro plans. The smallest instance size you can get is two-core, so that's 60 hours by the clock on the wall, or about 3 hours a day for a 20-day working month (4.5 for the pro plan).

I did exceed these limits sometimes. You can get instances with more CPU cores and memory, and they are billed at higher rates, but I never needed to use one - the basic 2-core instance worked perfectly with the repos I needed. Codespaces automatically suspend after a configurable timeout (I think the default is 30 minutes, I reduced mine), but spending a lot of hours with a Codespace will eventually exceed the free tier. Running more than one Codespace at a time also runs up a bill faster, so I adapted to avoid doing that.

The 2-core instance bills at $0.18/hr (so the per-core price is currently $0.07/hr). Maybe a dollar a day for fairly heavy use, which aligns with the billing information I have available. GitHub changed their billing system in April 2025, and I've been unable to get any info from before that. What I do have is the bill for the startup I worked for, which came in at around $18 (orgs do not get the free tiers) for the heaviest month.

GitHub codespaces billing

So far, I've never run up enough usage or storage to get billed outside the free tier.

Sharp edges in billing

I hit a particularly frustrating "sharp edge" with GitHub's billing system. Twice, despite being well within the free Codespaces tier, I was denied access to Codespaces because I had an insufficient balance to cover other minor GitHub charges (like sponsorships). Resolving this involved hours of contacting support to expedite the process. I now ensure my linked payment account always has enough to cover GitHub charges, and I've raised the issue with GitHub support. I'm sure they do not want to disincentivise people paying for the pro tier or sponsorships!

Final thoughts and advice

If you're considering GitHub Codespaces as your primary development environment, here's my advice:

Try it out on personal projects. Take advantage of the free tier. I write these blog posts in Codespaces!
Codespaces (and devcontainer.json in general) provide an environment that is far more similar to CI/CD systems like GitHub Actions than a traditional local setup. I'm experimenting with ways to reuse more between Codespaces and GitHub Actions.
It's a far better experience than full virtual desktops. In my experience, full virtual desktop environments have always been sluggish and awkward to work with. Codespaces slots into my day-to-day working without getting in the way.
Consider local devcontainers as a stepping stone. Using devcontainers for local development could bring some of the consistency and security benefits without, or before, fully shifting to the cloud. I tried local devcontainers first but ran into issues with my rootless container runtimes.
Explore alternatives if GitHub isn't your ecosystem. While my experience has been tied to GitHub, similar concepts exist. For example, Gitpod is a cloud development environment that I've seen integrated with GitLab.

My year with GitHub Codespaces has been overwhelmingly positive. The day-to-day experience has been consistently good, sufficiently similar to local working that I've not been irritated by it. The challenges have not been a big deal and it's definitely my preferred option today given the choice!

GROUP BY ALL solves a really annoying SQL problem

Paul Brabban — Thu, 29 May 2025 00:00:00 +0000

Does your SQL still copy most of your columns from SELECT after GROUP BY?

Behold: GROUP BY ALL.

The problem

A simple example of the problem looks like this. I have a table of page views, one row per view. I want to know how many downloads I had each day, so I write some SQL like this:

SELECT
    view_date,
    COUNT(1) num_views
FROM the_raw_views_table
GROUP BY
    view_date

See how I have to repeat view_date in the GROUP BY clause? It's required, and it's pretty much the only appropriate simple value. I must add any columns that I'm not aggregating (I used the aggregate function COUNT() here) to the GROUP BY clause for the query to be valid.

Grouping is something we do all the time. It's a minor irritation when there are only a couple of columns to add, but I've seen queries where there are tens, maybe even hundreds of columns that have to be carefully kept synchronised.

A chunkier example from GitHub:

GROUP BY the_date, countryname, twitter_trend, google_trend, latcent, longcent

The poor solution

I've seen a bad solution around, where you don't need to actually name the columns but can instead use the column's ordinal number. The previous query would look like:

GROUP BY
    1

This is a bad idea for readability, and now you have a list of sequential numbers to keep in sync instead. Here's an example I found on GitHub for how silly it can get:

GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, i.indnatts

The good solution

On 23 May 2024, Google made GROUP BY ALL generally available, and I totally missed it. Now, I can just say what I mean 🎉.

SELECT
    view_date,
    COUNT(1) num_views
FROM the_raw_views_table
GROUP BY ALL

It doesn't matter if you have one plain select column or 100. GROUP BY ALL infers the list. The full documentation explains the specifics and how the inference works.

Supporting platforms

I actually found GROUP BY ALL first on the Databricks platform.

I don't think Trino (and by extension AWS Athena) have GROUP BY ALL.

My path to consultancy

Paul Brabban — Fri, 11 Apr 2025 00:00:00 +0000

I had big doubts about becoming a consultant or contractor. Could I do it? Would I find work? Could I run my own business? Would I need to change who I am, wear a suit, or buy a briefcase? Seven years have flown by since I took the plunge, so I'm going to share my story in case it's helpful for you!

TL;DR

Becoming a contractor and consultant back in 2018 was one of the best moves I've made.

I didn't need to wear a suit, or buy a briefcase.
I had no problems finding work - contractors are just humans too, and my experience was up to the task.
The income was a big step up at a time when I needed it.
Setting up and running a business was easy enough, and felt pretty good.
I wasted time on a custom website before going with a static site generator and trustworthy theme.
Accounting and tax were more complicated than I realised, causing some hassle and unexpected, painful tax bills. If I were doing it again, I'd look for a well-recommended human accountant to show me the ropes and help me through the first year or two.
I've been able to pursue opportunities with teams and organisations that probably wouldn't have been open to me otherwise.
The work has been challenging in good ways, and I feel my professional development has been accelerated.
My fears about becoming mercenary and distant from the people I worked with were unfounded.
I've gained access to a great network of people who challenge and push me, and I've been able to stay in touch with the teams I've worked with over the years and see how my decisions played out.

I took a permanent job with Equal Experts when I was offered it at a time of great personal and professional uncertainty in 2020 - new baby, COVID, and IR35. I've never been fired or let go from a role throughout my career - so far - and I don't feel like I'm at significantly lower risk of being unemployed as an employee than I was as an independent associate.

I feel like I've got the best of both worlds now, getting to be more consultant than contractor whilst being part of a great organisation. Tempered Works Ltd. is still alive but dormant in case I want to go back to being my own boss one day.

I recognise that I don't belong to an underrepresented group in the tech industry and I can't speak to the challenges others may face. I hope sharing my experiences can still provide some value or insight for those considering a similar path. I'm also happy to answer any questions you might have that I can help with, so please feel free to give me a shout on LinkedIn. I will not share anything from any conversation without permission.

Prologue

They waited until the end of the first year at uni to tell us that there weren't many jobs in physics, so that was my dream of writing maths on whiteboards for a living up the spout. My mum had gone to university after the divorce, getting a degree in computing and working as a contractor before getting a job in IT with a bank. I was too young to understand much about it at the time.

A few years later, she told twenty-one-year-old me to apply for the bank's graduate scheme. I got in, and that set up my next thirteen years with HSBC, starting out in IT Security (first job: stuffing Lotus Notes passwords into envelopes) and ending up programming in Global Banking and Markets. I was aware of some contractors here and there, but I don't recall ever meeting one, let alone talking to them about contracting or anything much else.

The other contact with contractors or consultants I can remember from these early days was my avid reading of Martin Fowler, Sam Newman, and the other luminaries at Thoughtworks.

First contact - Thoughtworks

I went from a huge multinational to a little company in Manchester that was trying to grow back in 2014. We were having trouble trying to sort out an unwieldy, monolithic codebase running on bare metal servers that resisted any extension and struggled under any sort of load. Our efforts to get the team aligned and flow moving just weren't working. What was I doing wrong?

I suggested we engage Thoughtworks and a couple of their people came to talk to us about our needs. I needed someone experienced to talk to, someone who could help me figure out what I was doing wrong and how to put it right. I asked for advice and guidance, but it turned out that wasn't an option. We'd need to pay for a team of at least two full-time contractors as a minimum engagement. I can't remember what the $$$ number was, but it was comically out of any budget I could justify, so that didn't go anywhere. Disappointing, but my first real contact with the contracting world.

I worked with some great people there. It was a real struggle, but I learned a lot, and we did manage to rewrite the beast and move everything to AWS. I've really enjoyed seeing them succeed since I moved on. One of the other backend devs contacted me recently to let me know that the thing we hastily built together had been the foundations for seven years of growth without major problems.

First contact - Equal Experts

Next stop - Sky Betting and Gaming, building a new service for the Italian market from Sheffield's Electric Works. This was the first time I really worked with contractors. No suits, no briefcases. A friendly bunch, always happy to share their experience, knowledge, and opinions. I went along to my first Equal Experts "Expert Talks" event there.

I think this is where I first really considered that I could "do" contracting. I was still a bit hung up on whether it would be right for me though. Running my own company seemed like an interesting but daunting prospect, and I recall dismissing the idea in a conversation because I wanted to be invested in what I was doing. If I'm honest, I also didn't feel like I was good enough.

I felt that contracting might be a bit mercenary for me, and I wanted to be invested in what the company I worked for was doing. Then again, I wasn't staying with companies for very long before moving on. I'd consciously started thinking in terms of "going in, making a difference, making myself redundant, and moving on" to broaden my experience more quickly.

Need some money, quick

My next role was with a six-person startup. I really like the broad scope, urgency of value and need for adaptation that comes with startup roles! Alongside rapidly building the system we needed, I was getting involved with product development and marketing. This was my first contact with things like AB testing, and I had to deal with GDPR that landed towards the end of my tenure there.

Then, as far as I know, the startup ran short of money. The other people there were young compared to me, and on much lower pay. I agreed to go without salary for a month to keep the company going. Then a second month. We'd been having conversations about how the company probably needed more product development guidance than I could provide, so when I didn't get paid for a third month, I resigned - I honestly don't think there was anything more I could do. That's startups for you.

Bills to pay.

Time to embrace my inner mercenary.

Tempered Works Ltd.

LinkedIn says I started Tempered Works Ltd. in May 2018. I knew a couple of contractor people (as well as my mum, although she'd long since gone permanent!) well enough to buy them a coffee and pick their brains about what to do next and got advice on things like how to find work, how to set up my company, and how the banking and accounting side of things works. It's a bit of a blur now, but here's what I remember.

I remember the coffee shop I was sitting in before an interview with Infinity Works where I was deciding on a company name. I was going to use the domain I got for my old blog, crossedstreams.com, but Urban Dictionary warned me about other meanings related to "crossing the streams" that have nothing to do with the Ghostbusters movies. Scratch that idea then. Sheffield has loads of places today built in old industrial buildings called things like "Cutlery Works" and "Electric Works". We've got a proud history of steelmaking, and tempering is a process to make steel tougher and less likely to break under stress. I liked that idea as an analogy for what I like to think I do, so Tempered Works Ltd. was born.

I'll be honest. Day-rate money was frankly ridiculous compared to the salaries I'd been working for before. I'd been worried about not making enough pension contributions in the latter part of my career to date, and I was able to pay the bills and start building a better retirement buffer very quickly. On the advice of one of the contractors I spoke to, I went with crunch.co.uk for accounting. I didn't have the confidence to do "my books" on my own, but I didn't want to pay for an actual human accountant. Despite my best efforts, I did get some things wrong here and there and ended up with occasional hassle and painful, unexpected tax bills.

I remember heading to the library to take photocopies of documents (maybe I even had to fax them like it was the late 1900s) to Metro Bank to get my bank account set up. Recording expenses and filing paperwork was a worry, but that part turned out to be pretty easy. It all takes a bit of time, but it's not a big deal.

If I were doing it again, I'd pay for a human accountant for the first year or two until I'd got a better handle on how accounting and tax works.

My first gig

It didn't take long to get my first engagement. I can't remember how I advertised my services, perhaps it was just on LinkedIn. I had a fair bit of AWS experience, I was a certified solution architect, but the gig needed Google Cloud. I remember standing outside a bus stop in the sunshine with the recruiter telling me not to worry about it, my skills would transfer and I'd be fine. I took the gig. There was a tech test and interview to get through, but they were no different to what I'd expect for a permanent role.

He was right, and I had a really enjoyable first engagement with Dunnhumby, a super-cool data science company. I'm normally not sure if I can directly link myself to a client but in this case, I'd had writing published on their public blog, so the link is already out there.

I decided I wanted to diversify my experience, so I politely refused an extension to go and do something else. I recall the person paying for me saying that I wasn't like other contractors he'd worked with, more like part of the team. It was great to get that feedback, reinforcing that I could succeed as a contractor without being aloof and distant!

Evolving this site

I wasted a lot of time building my own custom React-based site in the early days. I wanted to have my own look and feel and thought that exercising my React skills would be helpful in finding work. After a while, I realised that I was spending too much time on the site and not enough on why the site was there - to support my business and an accessible medium for me to share my experience and thoughts.

I eventually found a neat, blog-capable static site generator in mkdocs-material that provided me the kind of publishing workflow, flexibility and clean look and feel I wanted. It's also got a lot of sponsorship and funding and a maintainance team that seems responsible and responsive, so I think it's a trustworthy option.

Equal Experts - Associate

I'd met lots of inspiring Equal Experts people at events like that Expert Talks I mentioned earlier. Everyone seemed to speak highly of EE, and I applied. If I recall correctly, there was a fairly easy take-home test and then an interview, with the first half a pairing exercise and the second a whiteboard session.

I did not enjoy the pairing exercise much. I really don't work well in that time-pressure situation, and it's not the kind of pressure you actually get in real life. I like to let a problem settle in my mind a little before I start racing to write code and tests. Feedback afterwards was that I didn't do very well in that part. Meh 🤷 fortunately for me, my performance in the whiteboard session impressed more. I'd enthusiastically explained how we'd built out capabilities at the last startup and taken iterative, fail-fast approaches to get value out fast.

!!! note

EE's changed its recruitment process a great deal since then. The pairing-under-pressure is a thing of the past, phew.

My first engagement with EE wasn't a great fit, being quite slow-moving (I'm not known for my patience), but I just flagged that it wasn't really working for me and got lots of support finding and moving onto something that suited me better. My new engagement was fantastic, working in a diverse team with a large UK retailer on search optimisation. I was directly involved with delivering and measuring significant revenue uplifts and learned a huge amount from that engagement. It was the first time I'd seen all the things I'd read about as practices and philosophies come together to make my team and the teams around us really fly!

Equal Experts - Employee

In 2020 I was offered a permanent position as a consultant with EE and I took it, despite the drop in income. Why?

I had a great experience working with Equal Experts.
My son had just been born.
COVID had just hit.
IR35 legislation was about to be implemented in the private sector.

Whilst I had done nothing to avoid tax, I remember feeling unsure that I knew enough to be on the right side of the IR35 legislation. I felt that the legalese and opinions surrounding it were confusing and that if I got something wrong for any period of time, the financial consequences could be severe. Taking the job felt like a safe option, and I hoped to learn more about the inner workings of a successful consultancy.

Tempered Works Ltd. still exists but is dormant at the moment. I'm able to keep the company alive for less than a couple of hundred GBP per year. I was also able to get a tweak to my contract so I could wrap up a light engagement with The Developer Academy, providing some course materials and teaching for their new data science bootcamp. I write on the EE blog, but I'm still able to write independently here, and I just credit Equal Experts where my writing crosses over significantly or involves some time in working hours, like this:

As an associate and an employee, I've had access to lots of other people and their experiences through our Slack. That's been really useful in collecting up other people's ideas and putting my own out there to be challenged.

I've also had the huge benefit of staying in communication with the teams I've worked with over the years, and in particular the people who saw how my decisions played out after I moved on. By and large, the things I leave behind for the next person seem to have worked out positively in my absence! I'm very grateful for those long-term perspectives where I've been able to get them.

Next steps

If you've got a few years of experience and you're considering contracting or consulting, I'm happy to answer any questions you might have that I can help with, so please feel free to give me a shout on LinkedIn. I will not share anything from any conversation without permission.

I'd also really recommend taking a look at joining the Equal Experts network, for all the reasons I talked about earlier. Up-to-date info about joining is available on the site. I'll signpost our values, which are really important to all of us and haven't really changed since I first read them back in the day.

Generating portable and user-friendly identifiers

Paul Brabban — Sat, 08 Mar 2025 00:00:00 +0000

I'll share how I generate unique identifiers from data in 2025, avoiding the pitfalls I've seen along the way. TL;DR: I'm using MD5 to produce a digest from a string or bytes value, then I'm using plain old hexadecimal encoding of that digest, specifying upper or lowercase for the alpha characters. This solution best meets the needs I describe next.

The problem I'm solving

I have some chunk of data, often a string of text, but it can be more complex like an array or a JSON dictionary. I want to produce a value that I can use to refer to this data in user-friendly ways.

User-facing needs

It produces clean-looking URLs and copy-pastes easily. The value might end up in front of users in a URL or an API endpoint. That means it might end up being copy-pasted into messaging systems, tickets, spreadsheets and the like, so I want it to be easy and convenient to copy-paste the full value correctly. Characters that require URL-encoding are bad. Characters that require click/tap-and-drag to select across are also bad - double-click or tap-and-hold should select the whole value.
It doesn't carry any inappropriate meaning. I don't want any confusion about the value being an identifier. It should not be likely that the value could be confused with, for example, a natural language word.
It isn't possible to reverse the process to reveal the original value. I don't need cryptographic protection from this kind of reverse engineering, but I would prefer to hide how the value is computed in case there are details in the original value that should not be available to inspect. Being able to compute and compare the identifier, given a candidate value, isn't a problem.

Implementation needs

In addition to these more user-facing needs, I have a couple of constraints to ensure the solution doesn't create problems down the road.

It must be deterministic and stable over time. These values might end up being stored directly in other systems that interface with the system I'm working on, or internally between different parts of the system. An example would be a feature where a user can "favourite" something this value refers to. We don't want those values to change!
It can generate identifiers for other data structures than strings. Sometimes I need to make an ID for an array of values, a JSON dictionary, an image, and so on.
It doesn't create lock-in. I would like to avoid locking into whatever ecosystem I'm using today so that ID generation does not become a factor in future decisions. I should be able to generate the same value for the same input over as broad a range of technology as possible.
It avoids depending on new software supply chains. Software supply chains bring multiple types of risk, which I talk about a little in an earlier post touching on supply chain security. I'd rather not trust any source of software that I can avoid.
Collisions are very unlikely. It needs to be unlikely that two different pieces of data produce the same identifier. I can design the system to make this scenario an irritation rather than a catastrophe, but the less likely it is, the better.
It's efficient. I'd like an efficient solution that meets the other constraints. I don't want significant latency, nor do I want to waste compute power without good reason.

Quite the list for such an apparently simple problem!

Solution

The general solution I use is to first produce a hash from the value, then encode the hash in a way that meets my usability needs. I'll quickly summarise the steps.

Hashing the value

The general solution I use first produces a hash from the value. Hashing is a bit of computing magic that takes a chunk of data of any size and produces a fixed-size output. Information is lost in the process, so you can't "undo" a hash to retrieve the original value. The idea has been around forever, and I've yet to come across a computing platform that doesn't support the concept. Which hash function to choose? There are many hash functions, and most would break my portability and supply chain needs.

I've fallen into this trap before. I needed to generate hashes in AWS Athena (based on the Trino open-source engine), so I followed the advice in the Trino hashing functions documentation:

For general purpose hashing, use xxhash64(), as it is much faster and produces a better quality hash.

Unfortunately, when we migrated to BigQuery, xxhash64 was not natively available! I could have used a Javascript UDF, but that would have introduced complexity and new supply chains, so... no. I switched to a hashing algorithm that BigQuery did support, then generated a mapping table for existing xxhash64-based IDs in Athena until all the original values had been found and replaced.

Learning from that experience I use MD5 hashing now. MD5 has been around since the 1990s and is supported everywhere I've looked. MD5 might be a poor choice for use in crypto, but I see no reason why it's not suitable for the purpose of producing a fixed-length identifier. MD5: The broken algorithm suggests that the chance of an accidental collision between two values is on the order of 10^-29.

But, as you can imagine, the probability of collision of hashes even for MD5 is terribly low. That probability is lower than the number of water drops contained in all the oceans of the earth together.

That's small enough that I'm not going to worry about it. I've certainly never seen an accidental hash collision in real life in the last 25 years of my career. The same article also says that MD5 is the faster algorithm compared to SHA1 and SHA256, although not by enough that I'd consider it a factor. The last benefit of MD5 over those other algorithms for my purposes is that it produces a shorter value - 16 bytes compared to 20 or 32. The shorter the value is, the shorter the identifier will be.

That brings us neatly to part two - encoding.

Encoding the hashed value

Hash functions produce a "digest" - a little blob, with a specific length, of binary data. For example running: SELECT MD5('tempered.works') on BigQuery produces (in binary): 00101100110101000000000110111101110001010101101010011011100111110111011100110111111100100001110010000111010111010010111000100110

Not a great identifier, as it is so long! The second part of the process is to encode this value in a larger alphabet than ones and zeroes to get something a bit easier to work with. This is where I hit the next two facepalm moments. Firstly, not all encodings meet my usability needs. The obvious choice (and indeed the default representation if you run that query in BigQuery) is Base64 encoding, which produces LNQBvcVam593N/Ich10uJg==.

That value is not user-friendly - try selecting it! The / character breaks the selection in the middle, as does the == padding at the end. It also causes usability issues if those values end up as keys in S3-like storage systems - yes, I hit that one too. On top of that it needs encoding again to be used as a URL:

SELECT bqutil.fn_eu.cw_url_encode(TO_BASE64(MD5('tempered.works')))

LNQBvcVam593N%2FIch10uJg%3D%3D

So Base64 is a bad choice, despite being commonly available in databases and programming languages. Base32 works better, being an alphabet without those funny symbols. It still has the problem of those equals symbols padding the end, but there's a worse problem - it's not very portable. Again, Base32 isn't available on BigQuery, for example.

So the encoding I've chosen is boring old hexadecimal. It's available everywhere, only includes the characters 0-9 and a-f, and produces values of a manageable size. One last gotcha - hex conversion functions may return upper or lowercase a-f, so be aware of the need to pick one and upper/lower accordingly. The non-SQL examples in the following table require imports, but the modules are part of the runtime, not third-party dependencies.

Ecosystem	Expression	Output
BigQuery	`SELECT UPPER(TO_HEX(MD5('tempered.works')))`	`2CD401BDC55A9B9F7737F21C875D2E26`
Athena/Trino	`SELECT TO_HEX(MD5(TO_UTF8('tempered.works')))`	`2CD401BDC55A9B9F7737F21C875D2E26`
Snowflake	`SELECT UPPER(MD5('tempered.works'))`	`2CD401BDC55A9B9F7737F21C875D2E26`
Python	`hashlib.md5(b'tempered.works').hexdigest().upper()`	`2CD401BDC55A9B9F7737F21C875D2E26`
NodeJS	`crypto.createHash('md5').update('tempered.works').digest('hex').toUpperCase()`	`2CD401BDC55A9B9F7737F21C875D2E26`

Pretty portable!

Summary

So how does this solution compare to my list of needs?

It produces clean-looking URLs and copy-pastes easily ✓
It doesn't carry any inappropriate meaning ✓
It isn't possible to reverse the process to reveal the original value ✓
It must be deterministic and stable over time ✓
It can generate identifiers for other data structures than strings ✓
It doesn't create lock-in ✓
It avoids depending on new software supply chains ✓
Collisions are very unlikely ✓
It's efficient ✓

Looking good. I'll update this post if I come across any drawbacks in the future.

Using AWS billing to track down lost resources

Paul Brabban — Sun, 09 Feb 2025 00:00:00 +0000

My AWS bill was higher than I expected, and it wasn't immediately clear what was driving the cost. Here's how I tracked down the culprits.

The mystery bill

I'd been running an RDS + DMS CloudFormation stack for a few months as part of my CDC writing. Expect a post at some point going over the obvious and more subtle costs involved. When I tore the stack down in January 2025, I was still getting a bill of a couple of dollars a day and I wasn't sure why.

Cost and usage report showing current month costs at $14.45 and forecasted end costs at $64.68. Includes bar chart of various service costs across five months.

When I tore the stack down, one resource was left behind - the AWS bucket that DMS wrote the change data capture stream into. It contains very little data though, so it shouldn't be running up a noticeable bill. I could do without a $64-per-month bill for nothing. How can I find these resources?

Billing - cost explorer

My next step is to click the offending dollar value on the homepage, which takes me to the cost explorer. Here, I see a fairly useless chart of the current month's costs, with a very useful list of costs by service.

I've learned a lot over the years by being curious about costs - and I've used that knowledge to save clients money. If you can see these actual/predicted cost numbers when you head to the homepage, I'd recommend clicking in and taking a look around. It'll help you understand what's expensive and what's not, and there's a chance you'll spot something consuming money for nothing - a prime target for quick and easy cost-efficiency improvements. If you can't see these numbers and would like to, try clicking the AWS console home link (Top left in the console) or asking your administrator.

Cost breakdown table showing current month expenses for various services, including VPC, tax, Secrets Manager, S3, and Relational Database Service.

This view is a useful first step. The costs are in the VPC service, ruling out the S3 bucket as the cause. Unfortunately, VPC covers a lot of different kinds of resources - how do I track down the ones that are costing me money?

Tag editor

AWS is a big, complex beast and I've had problems sometimes tracking down deployed resources that cost money - particularly tricky when the resources are deployed in a region you're not expecting! The Tag Editor allows the enumeration of resources across regions, but as far as I know, it's not straightforward to use to track down these resources. I have very little deployed into this account, but when I search across all regions for all resources, I see the following:

The top six resource search results are listed in Tag Editor, showing items with identifiers, types, and regions. Types include Stack, Trail, and DHCPOptions. Regions include eu-west-1, 2 and 3. There are no tags assigned to the resources.

The vast majority of the resources I see are, I believe, network-related resources that AWS deploys by default into all regions for my account. Perhaps I should go through and delete all that stuff, assuming it won't cause any problems in my account. There are no tags against any resources other than the S3 bucket I already know about and an account I custom-tagged with a client identifier. None of it helps me understand what's costing me this money.

I have had the thought that maybe I should be tagging stuff to help in this kind of situation. Whilst it might help, I don't think it would guarantee anything - I suspect there would still be resources that get created automatically as part of my efforts that don't get tagged. Whilst custom tagging might be useful, I still need a way of finding stuff that falls outside or pre-dates my tagging strategy anyway.

Back to cost explorer

I've found the most effective tool to track down these resources is back in the billing console.

Cost by usage type

I switch the breakdown to usage type in the report parameters section, like this:

Interface showing report parameters: Date range set to the last week; monthly granularity; grouped by usage type.

Again, the chart isn't very useful, but the breakdown panel underneath is.

Table showing cost and usage breakdown with totals for various AWS usage types. The usage types reveal specific information about exactly what resources in which regions are driving the cost.

Aha. This table is packed with useful information.

USE1-VpcEndpoint-Hours is responsible for almost all the cost - VPC endpoints in us-east-1. That makes sense - I had to have private endpoints to get from the RDS database locked away in a private VPC to the S3 and Secrets Manager services. I expected them to get cleaned up as part of my CloudFormation destroy operation, but it seems they did not.

I'm also paying for USE1-AWSSecretsManager-Secrets - makes sense, I had a secret in there, the RDS database password, which hasn't been cleaned up in the destroy. Aurora:BackupUsage also makes sense - I have some backups of the RDS database. Aside from being in eu-west-2, the culprit behind EUW2-Requests-Tier1 isn't clear. I can switch the breakdown from "Usage type" to "API operation" to get another useful perspective.

Cost by API operation

Table displaying API cost breakdown, with total costs and individual operation expenses listed.

PutObject and CreateDBInstance:0021 look like possible culprits. There's no region information available here, so I find API operation and usage type work well together to shed actionable insights on costs.

API operation and usage type together

Where I'm confused about what a billing item is, I can combine usage type and API operation, with one as dimension and the other as a filter, to clear up the confusion. To understand CreateDBInstance:0021, I set usage type as the dimension and API operation as a filter in the report parameters. There's exactly one item in the output table now.

API operation	Value
Aurora:BackupUsage	$0.03

That clears that one up. How about EUW2-Requests-Tier1? Resetting the filters to usage type EUW2-Requests-Tier1 and setting API operation as the dimension, I'm told:

API operation	Value
PutObject costs	$0.03
PutObject usage	5,340.00 Requests

Not sure what's causing those PutObject requests in this account, but at least I know what I'm looking for now!

Cost by resource

There is another option in the "Dimension" drop-down - "Resource". When I click this option, I get access denied

Resource dimension error message.

I need to turn on resource-level billing information to use this feature. It may contain even more explicit information, but I've found I get can get enough out of the on-by-default views I've already covered to solve my problems. I've turned it on anyway but it takes a while to become effective. I'll follow up if the information or cost associated with this feature is worth shouting about.

Summary

AWS Billing's cost explorer reporting is packed with actionable information to track down costs, including service, region and API operation usage. Using these different perspectives together can help quickly find the resources responsible, regardless of which region or service they're hiding in.

Google Chrome Oct 15 update broke GitHub Codespaces

Paul Brabban — Sun, 27 Oct 2024 00:00:00 +0000

The Oct 15, 2024 update of Google Chrome stable (130.0.6723.58) suddenly broke some sites, such as GitHub Codespaces and 1Password, due to a JavaScript-related setting.

Discovery

Connection error: unauthorised client refused

This was the error message I was greeted with on Friday 18th Oct when I tried to get back to work in my GitHub Codespace after a four-day break. Everything was just fine when I pushed my last PR from the ferry in Europoort Rotterdam on Monday morning! Could it be something to do with the six-monthly rebuild I'd just done? A problem with GitHub? Was there a Chrome update between Monday morning when I shut down and Friday morning when I ran my usual start of day sudo update?

The short version: there's a Chrome setting DefaultJavaScriptJitSetting that disables the JS JIT compiler when set to value 2. DefaultJavaScriptJitSetting is documented in the Chrome enterprise policies.

You can check the effective settings on your browser by navigating to chrome://policy.

Chrome policies showing DefaultJavaScriptJitSetting

Setting this value to 1 allows JIT on all sites, or you can add exceptions on a site-by-site basis. I've updated my installation to allow JIT on the specific sites I need for now, and here's the relevant snippet from my Chrome policies file.

{
    "DefaultJavaScriptJitSetting": 2,
    "JavaScriptJitAllowedForSites": [
        "[*.]github.dev",
        "[*.]1password.eu",
        "[*.]1password.com"
 ]
}

I've raised an issue against Chromium which has been reproduced successfully. Hopefully, the problem will be resolved soon and the exceptions will no longer be needed.

Remote codespace in local VSCode

I didn't make much progress against it on Friday other than discovering that I could plug my local VSCode installation into the remote Codespace successfully. Microsoft provide how-to documentation and you need the GitHub Codespaces extension installed locally to do that - but it's provided by Microsoft, who also provide the VSCode application and GitHub Codespaces service, so there's no additional supply chain exposure there. It's nice to know this option is available and works well, and it pointed very clearly at a problem with Google Chrome (my browser of choice, again for reasons of supply chain trust).

Investigation

I eventually narrowed the issue down to a privacy-related Chrome setting I use.

I needed to sort the problem out - there's a ton of client work to do and it's going to hurt to have my productivity impacted. I'm also worried about any implications for the choices I've made to use this browser and services like Codespaces, so I dug into the problem on Friday night.

I spun up my Xubuntu install USB stick. When you boot into it you have a working, ephemeral Xubuntu installation with root privileges, unlike my locked-down persistent installation. Between this and other testing, I established that:

The issue was reproducible on multiple machines.
The issue only manifested when I applied my policy settings. The vanilla browser was fine, which would explain why others weren't seeing the problem. Just me with my rather risk-averse settings.
A process of elimination narrowed the problem down to a specific setting.
Firefox manifested the same issue on Codespaces. Turned out to be a different privacy-related setting, EnableTrackingProtection. That was pretty confusing.
The issue is present in beta and dev channels for Chrome, not just current stable.

Other affected sites

I also found issues with regex101.com, 1password and even Google Sheets.

"Unfortunately it seems your browser does not meet the criteria to properly render and utilize this website." regex101.com not working on the latest Chrome

"Update your browser to keep using 1password" 1password not working on latest Chrome

"Troubleshoot this issue by clearing application resources" Google Sheets not working on latest Chrome

WebAssembly seems to be mentioned in error messages more often than coincidence would suggest, and it might make sense for an issue related to compilation and WebAssembly to manifest like this. We'll see, but at least there's a workaround, even if it does involve "dropping the shields" on some sites.

I already have to place significant trust in GitHub Codespaces and 1password by the nature of the service they provide. Sorry, but I can live without regex101.com - I'm using regexr.com instead for now.

If you want to get in touch with me about the content in this post, you can find me on LinkedIn or raise an issue/start a discussion in the GitHub repo. I'll be happy to credit you for any corrections or additions!

If you liked this, you can find content from other great consultants on the Equal Experts network blogs page 🎉

Map over an array in BigQuery

Paul Brabban — Wed, 31 Jul 2024 00:00:00 +0000

This walkthrough shows how I can use the functional programming techniques map and filter that I already know and love in SQL engines like BigQuery. These techniques give me a lot of processing power whilst keeping my SQL simple and relatively easy to understand. Unlike custom code, I can use the same SQL and infrastructure I'd use to process ten rows to process ten billion rows in seconds.

BigQuery supports array-typed columns but doesn't provide an obvious map or a filter function. I missed these functions until I realised that you can use these functional programming concepts, it's not just not obvious how.

Why `map` over an array in BigQuery?

An associate recently mentioned that they'd found a problem with a SQL pipeline I'd written some time ago. I'd used some fiddly open source JavaScript code to implement a Porter stemmer as a JavaScript UDF. The text I was stemming was often multiple words, and I'd assumed the stemmer would stem all the words.

Nope. It turns out that the stemmer code I'd used would stem whatever text was given as one word. For example, if the text to stem was "connecting connected connections", then it got stemmed to "connecting connected connect" instead of
"connect connect connect".

Oops. Sorry, person who had to clean up my mess. The solution? Break the text up on whitespace, then map the stemming function over the words.

It doesn't matter what the function we're going to map over the array actually does, so rather than complicate things with a custom stemmer, I'll use a built-in function for this example. I'll use the LEFT function to simulate a stemmer.

How to `map` in BigQuery

First - a couple of example strings

WITH examples AS (
    -- stemmed version is 'connect connect connect'
    SELECT 'connecting connected connections' AS the_text
    -- stemmed version is 'eardrum eardrum'
    UNION ALL SELECT 'eardrummed eardrums'
)

SELECT
    the_text
FROM examples

the_text
connecting connected connections
eardrummed eardrums

Split the text to get an array

SELECT
    SPLIT(the_text, ' ') words
FROM examples

words
[connecting,connected,connections]
[eardrummed,eardrums]

UNNEST and SELECT

I think about the next step as making a little table out of the array.

SELECT word FROM UNNEST(['connecting', 'connected', 'connections']) AS word

UNNEST turns the array into a table, and I have aliased the resulting rows as word. Then, I can select word, which maps the identity function over the array. If I put that back into my query...

SELECT
    (SELECT word FROM UNNEST(SPLIT(the_text, ' ')) AS word) AS words
FROM examples

I get an error Scalar subquery produced more than one element. That's because I haven't collected the little table back up into an array yet. That's easy to do:

SELECT
    ARRAY(SELECT word FROM UNNEST(SPLIT(the_text, ' ')) AS word) AS words
FROM examples

Now we have the table we started with back:

words
[connecting,connected,connections]
[eardrummed,eardrums]

Mapping the dummy stemmer

The next step is straightforward, using LEFT(text, 7) as a dummy stemmer:

SELECT
    ARRAY(SELECT LEFT(word, 7) FROM UNNEST(SPLIT(the_text, ' ')) AS word) AS words
FROM examples

Which produces the output we want:

words
[connect,connect,connect]
[eardrum,eardrum]

Applying `filter`

You've already figured this out, but once I've unnested my array into a table, I can use WHERE to filter.

SELECT
    ARRAY(SELECT word FROM UNNEST(SPLIT(the_text, ' ')) AS word WHERE word != 'connected') AS words
FROM examples

Which filters out the word 'connected' from my arrays.

words	stemmed
[connecting,connections]
[eardrummed,eardrums]

Naturally, I can combine map and filter operations in the same statement.

Multiple return values

"But wait" I hear you say, "it looks like I can select multiple columns in my map!". You're right, let's try:

SELECT
    ARRAY(SELECT word, LEFT(word, 7) FROM UNNEST(SPLIT(the_text, ' ')) AS word) AS words
FROM examples

We get an error: ARRAY subquery cannot have more than one column unless using SELECT AS STRUCT to build STRUCT values at [9:5]. At least this error gives a clue how to proceed. Just like a regular map operation, you can't return multiple arrays, you have to pack the values into a single structure.

SELECT
    ARRAY(SELECT STRUCT(word, LEFT(word, 7) AS stemmed) FROM UNNEST(SPLIT(the_text, ' ')) AS word) AS words
  FROM examples

The resulting structure isn't easily represented in a table, so I'll format the structs as JSON:

jsonified
[{"word":"connecting","stemmed":"connect"},{"word":"connected","stemmed":"connect"},{"word":"connections","stemmed":"connect"}]
[{"word":"eardrummed","stemmed":"eardrum"},{"word":"eardrums","stemmed":"eardrum"}]

Summary

I showed how to "map" the functional programming map and filter concepts to BigQuery by thinking in terms of little embedded SQL statements over unnested arrays as tables. I've not yet had any need to use reduce, the other function that springs to mind, so I'll update this if I ever need it.

This approach seems to work generally across SQL engines with minimal dialect variation. Some data warehouses provide functions that implement these capabilities more directly and naturally. For example, the Trino (was Presto) engine under AWS Athena provides transform (for map), filter and reduce functions over arrays.

Handling CVE-2018-20225

Paul Brabban — Sat, 18 May 2024 00:00:00 +0000

CVE-2018-20225 in all versions of pip tripped my vulnerability alerting this morning. If you're scanning for vulnerabilities using Safety, you've probably seen the same alarm. This post captures my reasoning and decision-making process to understand the risk and impact of this vulnerability and then deal with it.

Discovery

When I launched my pypi-vulnerabilities codespace this morning, VSCode ran my init_and_update script automatically. This script installs my dependencies and then checks them for known vulnerabilities. The first indication that something was wrong was a red task in the terminal sidebar.

init_and_update task red with an error indicator in a VSCode sidebar

Clicking into that terminal, I see:

Scan was completed. 1 vulnerability was reported.

Scrolling up, I get this detail:

-> Vulnerability found in pip version 24.0
   Vulnerability ID: 67599
   Affected spec: >=0
   ADVISORY: ** DISPUTED ** An issue was discovered in pip (all versions) because it installs the version with the highest version number, even if the user had
   intended to obtain a private package from a private index. This only affects use of the --extra-index-url option, and exploitation requires that the package does not...
   CVE-2018-20225
   For more information about this vulnerability, visit https://data.safetycli.com/v/67599/97c
   To ignore this vulnerability, use PyUp vulnerability id 67599 in safety’s ignore command-line argument or add the ignore to your safety policy file.

The Facts

Or at least, my understanding of the facts. What do I think I know at this point?

I've installed and updated my dependencies to the latest versions, including pip.
I've run safety check and it found a known vulnerability, CVE-2018-20225, is present in the updated dependencies.
There's no known fixed release available, as I would already have it thanks to open upper version bounds and automatic update before checking for vulnerabilities.
I'm running in a codespace and the only credentials this vulnerability can expose are:
- write access to the public-read pypi-vulnerabilities repo.
- BigQuery data editor access to the public PyPI data that I've been working on.
The same vulnerability will be present in GitHub Actions runs (executed daily and on PRs), which have the same risk exposure as Codespaces.
As it's a pip vulnerability, any other Python repo I have will currently have the same vulnerability (like this one for my website!)

I'm already ahead of where I used to be before I embraced automation and auto-updating. My automation made sure I was up to date and that I ran a vulnerability scan while I was still daydreaming about fixing the failed GitHub Actions run I got an email about yesterday. I know it's not noise that I can fix by simply updating and I have to take a look and decide what to do next. Running in a codespace lets me relax a little, as it dramatically reduces the scope of things that the vulnerability could exploit or steal.

Taking Action

Two actions spring to mind. First up, look at the vulnerability information and decide what I should do. Let's check out https://data.safetycli.com/v/67599/97c/.

** DISPUTED ** An issue was discovered in pip (all versions) because it installs the version with the highest version number, even if the user had intended to obtain a private package from a private index. This only affects use of the --extra-index-url option, and exploitation requires that the package does not already exist in the public index (and thus the attacker can put the package there with an arbitrary version number). NOTE: it has been reported that this is intended functionality and the user is responsible for using --extra-index-url securely.

I'm not using private indexes, and I don't use the --extra-index-url option. Looks like something I can safely ignore, but my paranoia tells me that I shouldn't do so forever. It's possible, even if I believe it unlikely, that more powerful ways of exploiting this CVE might be found in future. Ideally, I want to ignore it for a while and be prompted again in future if the CVE does not get withdrawn and a fixed release does not become available.

Safety Policy

Safety has come a long way since I first used it. I remember when it only supported passing "ignore" flags to ignore specific vulnerabilities, which was problematic to work with and automate around. Now I see that Safety supports a "policy file", so let's generate one and see what we can do with it.

(venv) @brabster ➜ /workspaces/pypi_vulnerabilities (main) $ safety generate policy_file
A default Safety policy file has been generated! Review the file contents in the path . in the file: .safety-policy.yml

Checking out that new file, I see lots of configuration being set. That seems like a pretty bad idea to me. I just want to ignore this one vulnerability for a while, I don't want to hardcode a bunch of configuration options. In doing so, I effectively take responsibility for what their effects are going forward. Nah. Delete delete delete. Instead, I'll use the ignore example given in the documentation to just configure what I want to happen for this specific vulnerability. The other Safety defaults are fine going forward, thanks.

Here's what I end up with in my .safety_policy.yml...

security:
  ignore-vulnerabilities:
    '67599':
      reason: do not use a private index
      expires: 2024-07-02

Running safety check now...

-> Vulnerability found in pip version 24.0
   Vulnerability ID: 67599
   This vulnerability is being ignored until 2024-07-02 00:00:00 UTC. See your configurations.
   Reason: do not use a private index
   For more information about this vulnerability, visit https://data.safetycli.com/v/67599/97c
### snip ###
 Scan was completed using a safety policy file. 0 vulnerabilities were reported. 1 vulnerability from 1 package was ignored.

Seems to be working fine, ignoring the expected vulnerability until the correct date. Nice confirmation in the output too ✅. I'm expecting it to return successfully instead of an error status so that my automation is all good. Does it?

(venv) @brabster ➜ /workspaces/pypi_vulnerabilities (main) $ echo $?
0

Yes ('0' means OK, it would be pretty dumb if it returned an error!). Let's just check that it returns to being an error when the date is in the past. Updating the policy file to set the date to yesterday (2024-05-17) and running safety check again... yes, no ignore information is printed to the output and the status code is '64', so an error 🙌.

Confirming Measures Are Effective

I'll set the date back to 2024-07-02. Why? The free safety database is updated monthly on the 1st, and I'm pretty relaxed about the risk for this one. I'll give it the rest of May and June before checking back in. Does my init_and_update task work cleanly now?

init_and_update task with no colour and checkmark indicating success

Yes. Obviously, it was going to work, but I guess I've worked with software for long enough that I have trust issues. I don't believe it 'til I see it 🤷. I'll commit that policy file back with the commit message "Ignore low-risk pip vuln https://data.safetycli.com/v/67599/97c" and check that the resulting Actions workflow is also clean.

GitHub Actions update and safety run logs ignore and succeeds

The Actions run still fails though - oh yes, that's why I was launching the workspace in the first place! Before I get back to that though, I said two actions didn't I?

Ignoring At Scale

The other action relates to the other repositories I have that are going to be affected by the same vulnerability. Every repo that checks for vulnerabilities from this point until this vulnerability is withdrawn or fixed will raise the alarm. The safety policy file is local to pypi-vulnerabilities, so that's no help. How to manage this sort of vulnerability scanning and alerting at any sort of scale?

To be honest, that's a problem I've had in every role I've taken on in the past decade, employee and consultant alike. I've not had a good solution to date and I end up just working through the repos applying whatever ignore solution I had at the time. Given that I have stuff I'm maintaining in the public domain now, and a reason to blog about it, I'm going to give it a little thought this time and see if I can come up with something better.

What do you do to solve this problem?

dbt 1.8 breaks on update

Paul Brabban — Sun, 12 May 2024 00:00:00 +0000

On updating dbt-bigquery to latest 1.8.0: No module named 'dbt.adapters.factory'

Issue - dbt 1.8.0 won't run

I spun up a Codespace on pypi_vulnerabilities to resolve a failing build.
My auto-update ran as usual, updating all my Python dependencies to latest stable release and I got dbt-core 1.8.0, now out of beta as of 2024-05-09.

The first thing my auto-update does that depends on dbt - install/update dbt packages - failed.

Successfully installed daff-1.3.46 dbt-adapters-1.1.1 dbt-bigquery-1.8.0 dbt-common-1.0.4 dbt-core-1.8.0 dbt-semantic-interfaces-0.5.1
install or upgrade dbt dependencies
Traceback (most recent call last):
  File "/workspaces/pypi_vulnerabilities/venv/bin/dbt", line 5, in <module>
    from dbt.cli.main import cli
  File "/workspaces/pypi_vulnerabilities/venv/lib/python3.11/site-packages/dbt/cli/__init__.py", line 1, in <module>
    from .main import cli as dbt_cli  # noqa
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/pypi_vulnerabilities/venv/lib/python3.11/site-packages/dbt/cli/main.py", line 14, in <module>
    from dbt.cli import requires, params as p
  File "/workspaces/pypi_vulnerabilities/venv/lib/python3.11/site-packages/dbt/cli/requires.py", line 10, in <module>
    from dbt.adapters.factory import adapter_management, register_adapter, get_adapter
ModuleNotFoundError: No module named 'dbt.adapters.factory'

Clean Install to Fix

Clearing my venv and reinstalling from scratch gets things back up and running. Here's the specific procedure I used:

$ deactivate # deactivate the venv, if active
$ rm -r venv # recursively delete the venv folder
$ python -m venv venv
$ . venv/bin/activate
$ pip install -U -r requirements.txt
$ dbt deps

I actually ran my init_and_update.sh script rather than those highlighted lines, but those are the relevant things it does.

This also explains why my build wasn't breaking (well, not for this reason anyway) - it always starts from scratch.

Changes in Adapter Dependencies

That's how you can fix the No module named 'dbt.adapters.factory' error if it comes up.
As I read the v1.8 upgrade docs I also learned that dbt Labs intend to change the way Python adapter dependencies work.

Beginning in v1.8, dbt-core and adapters are decoupled. Going forward, your installations should explicitly include both dbt-core and the desired adapter.

Scanning through the related technical docs, I think the auto-update approach I use will still work as intended without explicitly specifying the core dependency.

Up to now adapters have been required release a new minor version to declare compatibility with dbt-core’s minor version. Post dbt-core version 1.8, adapters will not need to do this. Instead maintainers will need to declare their compatibility with dbt-adapters’ versions.

So, if you're setting your adapter dependency with an open upper bound (eg. dbt-bigquery>=1.8.0 as mine will be shortly) then you should also get the latest compatible dbt-core version each time you run pip install -U -r requirements.txt on your development environment or your build system. I'll post how that shakes out over time.

How I Do Python Data Supply Chain Security

Paul Brabban — Wed, 01 May 2024 00:00:00 +0000

We data practitioners - data scientists, data engineers, analytics engineers, et al. - have a hard time when it comes to security. We're exposed to tools that demand we write code and deal with the messy world of programming languages and packages. We often have little choice but to drag insights out of real and sensitive data, exposing us to risks other developers can avoid, because insights don't hide in test data. Training, career paths and dev-experience efforts typically overlook data folks, depriving them of knowledge about the risks they're exposed to and how to mitigate them. Read on and I'll share what I do (and why) to protect myself, Equal Experts and my clients from the security risks lurking behind every piece of software.

When I speak to fellow data practitioners I find there's a common concern about the security risks they're exposed to, but a lack of clear, pragmatic guidance on how to mitigate them. A lot of the guidance out there, like Snyk's Python security best practices, focuses on traditional software engineering. "Data" work can be a little different.

This story covers the things I've learned to do in my day-to-day over the past decade or more, including:

Partitioning My Work
Using pip
Assessing Dependency Risk
Keeping Dependencies Up To Date
Scanning for Vulnerabilities
Using Least-Privilege Credentials
Cloud Controls
Templating

Partitioning My Work

I have at least three different, isolated working environments that I use locally. One laptop is for my personal stuff. At any given time it may have access to credentials for my AWS accounts, my GCP account, my password manager and GMail. If someone bad got into the first two they could run me up a painful bill, despite having billing alarms set up (I talk about that in $1,370 Gone in Sixty Seconds). If they got into either of the last two they can steal my identity and do untold harm to me and my family. I really need to minimise the risk on that computer!

Then I have a laptop that Equal Experts gave me. This will have credentials for EE things that I have a responsibility to protect. EE-related work on here.

Finally, I have a laptop that I do client work on. Same deal, with even greater responsibility to protect. This laptop is used for nothing but client work.

I have a separate long, strong password for all three (I can reliably hold about four such passwords in my brain at once, if I'm using them regularly and don't have to change them all at once!), and their disks are all encrypted. That's the basic layer of protection - if any one were ever compromised, it's bad enough. But it's better than all three being compromised. There's more about how I manage multiple laptops and additional security measures I use in my posts on automating my laptop build and living with an automated laptop build.

!!! warning
Like many things I'll talk about in this post, password managers are awesome but ship with unsafe defaults - convenience over security 🤷. I make an point of setting the delay before my password manager clears the clipboard to 30 seconds (default is never!) and setting the timeout before locking again to five minutes. Reduces the time window things have to steal passwords without significantly impacting my day to day, Don't forget the password manager app on your phone...

I've recently been having a great experience with even more protective partitioning - a dedicated, isolated, customisable development environment per repository with GitHub Codespaces. I think it's the future. If coding in the cloud is an option for you, I'd recommend giving it a try with an open mind. I've even got a Codespaces walkthrough and video to help!

Using Virtualenvs

It's really easy to install into your system Python by mistake - you forget to activate the venv, or you think it's active when it's not. Ideally, your system permissions are set up so that you can't write to your system Python installation, but I find that's quite rare. It's certainly not the case on the current Ubuntu OS on the laptop I'm sitting in front of right now.

You can change the default behaviour of pip on your computer so that it won't install in the system Python.
I set an environment variable PIP_REQUIRE_VIRTUALENV to true in scripts where I interact with pip. For example, the init_and_update.sh script in this repository sets PIP_REQUIRE_VIRTUALENV. I also set it one-off as a system-wide environment variable.

$ export PIP_REQUIRE_VIRTUALENV=true
$ pip install safety
ERROR: Could not find an activated virtualenv (required).

There are several other ways to tell pip not to install in the system Python if environment variables don't work for you.

Using pip

For the past 18 months, I've used plain old pip, the Python Packaging Authority's recommended tooling. I'm very happy with it and would recommend it to other Python data practitioners. My workflow with pip and venv is:

$ python -m venv venv # create a venv for the current project if needed
$ source venv/bin/activate # activate the venv (slightly different command on Windows)
$ which python # check that the venv is active; IDE usually indicates clearly when venv is active
$ pip install -U -r requirements.txt # install and update packages in this venv based on requirements.txt file

Pipenv, Poetry et al.

I've used pipenv. I've used Poetry. There are several more. In my experience, they don't deliver significant benefits over pip and are more trouble than they're worth. I have a bunch of stories about this that I'll share another day.

!!! note
pip got a proper dependency resolver in 20.3 released in 2020. I can't recall using pip before that but I imagine it wasn't the most reliable. I haven't had any problems updating on a daily basis since sometime in 2022. Kudos to fellow consultant João Neves for the feedback 🙇.

Conda

I haven't used conda. One of the reasons for that is the complexity around commercial use in the terms of service to warrant a clarifying blog post. I think the advice here applies equally to conda users but I can't speak from my own experience.

Assessing Dependency Risk

Any software you bring onto your computer has the potential to hurt you. In the case of Python, just installing a package can let bad actors loose on your computer. My working assumption is that any software running on the computer I'm using can do anything I can do, including accessing any passwords, access tokens, and session tokens I have. Maybe even my password manager if it's unlocked.

I don't trust what I see on the internet. Anyone can make up an identity, publish anything they want and say anything they want. Identities can and have been stolen and used to inject malware into previously safe software. Maintainers can be bought out or get burnt out. I don't think there's anything I can do here that completely mitigates the risks, but I can reduce it. I'll take this opportunity to plug a responsible, in-depth treatment of package handling over at python.land.

Is the package popular?

A heavily-used package is a juicy target for bad actors. On the other hand... heavily-used packages have more eyeballs on them. They're more likely to be looked at by security researchers, and if there is a problem I'm in a larger crowd of potential victims, so a lower chance that I will be exploited before I have a chance to respond or my credentials and so on expire or are changed. On balance, I prefer packages with evidence of large user communities.

search for articles and blogs mentioning it
check out GitHub stars, forks

Special mention to libraries.io, providing a search interface with metrics about how a package is used by others - Sort: Dependents.

!!! warning
Copy-paste the package name into requirements.txt from somewhere you trust! The bad actors love "typosquatting":publishing malware-laden packages with similar names to popular legitimate packages. These packages are downloaded thousands of times before they are identified and removed!

Minimising Dependency Use

There's so much software out there, a solution for every problem or suboptimal thing you could imagine. I save myself the time of more in-depth thinking by just trying to be honest with myself about whether I really need something or not.

Python is famous for its "batteries included" philosophy, so I find it's worth checking whether you can use something built in for no additional risk rather than a package that does expose you to new risks.

Do I really need to use Poetry? No, I can use boring old pip.
Do I really need to use murmurhash? No, I can use Python's boring old built-in hash.
Do I really need to use colorama? No, I can live with boring old monochrome terminal text.

Just because I'm making boring choices in the name of safety does not mean the packages I depend on are making similar choices. Every time I avoid a dependency, I'm cutting out that dependency, and its dependencies, and their dependencies and so on! That whole subtree of dependencies, and the choices their maintainers make, and their vulnerabilities? Not my problem.

Using Common Cross-Project Dependencies

I also try to use the same dependencies everywhere, instead of allowing variation without good reason. That helps me really get to know those dependencies and their maintainers whilst reducing the exposure I have to different supply chains generally. Want more on this topic? ZDD (Zero Dependency Development) is a well-argued and more detailed case for minimising and eliminating dependencies.

Well-Maintained Dependencies

I'm suspicious of packages that:

have only one maintainer - bus factor, high risk of burnout, where are checks, measures and accountability?
have more than five maintainers - (who are all these people? How do they decide what to approve or not?)
don't have a history of being updated regularly
don't have any obvious source of funding and don't ask for any
aren't backed by an organisation I trust
don't have a security policy (eg. security tab in GitHub)

I have more Opinions on this one, but they go broader than the scope of this post. I'll pop that on my backlog for a future post.

Updating Dependencies Automatically

I think it's fair to say that keeping your dependencies up to date is not an industry standard practice[^1]. Tools like Pipenv, Poetry and the like default you to locking the exact version of every dependency, and their dependencies, and so on. They instruct you to commit these lockfiles to source control without mentioning the drawbacks. Unless you go and run special commands to update them and then commit those changes, your app will be frozen in time, accumulating vulnerabilities that you won't even know about unless you're scanning them for vulnerabilities.

Another 👍 for pip which does not lock by default. If you look at any of my more recent Python repositories, you'll find minimum-bound version constraints, along with builds and IDE support for automatically updating versions.

If you have to use a tool that creates lockfiles for now, you can git rm them, then add them to your .gitgnore file to have Git ignore them going forward. That cuts out the need to commit updates back and so simplifies updating.

Example: dbt_bigquery_template

I have exactly one dependency in dbt_bigquery_template, which is currently this:

dbt-bigquery>=1.7.0

This translates as "get me the latest release of dbt-bigquery that's no older than 1.7.0". The low bound ensures that I know things will blow up if somehow I get an older dependency than the last one I had reason to look at (1.7.0 in this case). If there's a problem with the latest version, and I can't fix it right now, I can still pin to the previous working version temporarily - but I've found this is a rare exception rather than a fatiguing everyday occurrence.

Unlike some other ecosystems I've had the misfortune of needing to work with, pip has a wonderful feature in requiring an explicit flag --pre to include pre-release versions, so you won't get the technically-latest-but-unstable 1.8.0b2 beta. Yay! The latest release is currently dbt-bigquery 1.7.7.

In the IDE

When I open dbt_bigquery_template in VSCode, the init-and-update task automatically kicks off and runs a series of commands updating different kinds of dependencies including pip install -U -r ${PROJECT_DIR}/requirements.txt. -U means --upgrade and updates any dependencies you've already got to their latest versions if newer versions are available.

In the Build

The build for dbt_bigquery_template does the same thing. This line in the workflow is basically the same as the line in the VSCode task script. Assuming you close and re-open VSCode at least once each day, that means your development environment and your build or workflow management system are both within a few hours of latest and one another at any given point in time. You don't need to freeze the world forever to avoid "works-on-my-machine" problems.

There's a lot more to say about automatically updating dependencies, but I'll keep things minimal for this post.

The Least-Bad Solution

First - in my opinion, this is the least bad approach, not a perfect solution. The big risk - if something you're depending on becomes bad, you'll pick it up automatically. I think vulnerabilities that we do and do not know about in old software are a much bigger risk. Updating automatically you get all the security fixes straight away, at no time cost to your team, before the manual updaters have had a chance to realise there's a problem, prioritise the work to update, decide whether it needs fixing, and get around to dealing with it.

I take it as validating that regulators are increasingly calling for timely and automatic updates everywhere from IoT devices to phones, servers and so on - so staying up to date seems to be generally accepted as the better position to be in.

Works In Practice

I've used this approach for several months or so with multiple teams collaborating over multiple repositories and I can't recall any significant problems. The main inconvenience that occurs is those rare occasions when a dependency lets a breaking change through, which you find out about the next day.

I see this as a feature, not a bug. These kinds of breaks aren't subtle. Any sort of automated build process or your orchestration tooling is going to notice when version conflicts can't be resolved, an API change prevents your tests from running or the maintainers change something that your permissions don't let you do. I maybe wouldn't pick the most mission-critical thing I could find to try auto-updating for the first time!

Oh - and yes, setting the constraint to allow semantic-versioning-major "breaking changes" through is by design, not accidental. I'd rather find out about a breaking change that actually affects me when it happens, not months later with a critical vulnerability to fix and no update path except through the breaking version. In my experience the ideals of semver 2.0.0 and reality don't really line up all that well - yet another post for another day.

What about renovate and dependabot?

When I talk with someone about automatically updating, automated PR-raisers often come up. I haven't used either tool myself and I avoid them. If I'm keeping up to date and not committing every update back to source control, I don't need a tool raising PRs to help me manage the relentless torrent of vulnerability notices because I don't have that problem. Plus, they're not immune from expoitation themselves. Another supply chain bites the dust 👋

Increased Awareness

It's very useful to know when a breaking change just landed on you. You can deal with it while it's fresh before the work has had any chance to pile up. You get to understand how reliable your dependencies really are - perhaps a dependency that keeps breaking you isn't so trustworthy after all? Most importantly - you don't find out you've got multiple breaking changes in the way when your scans alert you to a critical must-fix vulnerability in the two-year-old version of that dependency you haven't updated.

Speaking of which...

Scanning for Vulnerabilities

Tools like safetycli and Snyk will scan your installed dependencies and tell you whether there are any known vulnerabilities in there. You'll see my use of the safetycli Python package to scan dependencies as part of my init-and-update process both locally and in build and workflow infrastructure.

They both offer free single-developer plans at the time of writing for your personal and side-projects and paid plans for teams and enterprises. There are other options too - GitLab, for example, provides some vulnerability scanning tooling, so check what your organisation might already be using for a potentially easy option.

I wrote a little about how and when to check your dependencies back in Checking your Depenedencies.

Safety Usage Example

$ safety check
...snip...
  Using open-source vulnerability database
  Found and scanned 54 packages
  Timestamp 2024-05-02 21:13:23
  0 vulnerabilities reported
  0 vulnerabilities ignored
+============+

 No known security vulnerabilities reported. 

+============+

  Safety is using PyUp's free open-source vulnerability database. This data is 30 days old and limited. 
  For real-time enhanced vulnerability data, fix recommendations, severity reporting, cybersecurity support, team and project policy management and more sign up at
https://pyup.io or email sales@pyup.io

Scanning vs. Automatically Updating?

I scan my dependencies after installing the latest versions. As I mentioned before, auto-updating is not a perfect solution, and I could still have a vulnerability even after updating - for example, a known issue with no fix available. We seem to be getting pretty good at responsible disclosure of late. It seems much more rare that I get a vulnerability notice that's not already fixed - auto-updating means I'll already have the fix installed by the time the alert would have landed in my inbox.

If I do end up in that situation, my options are limited. Try to fix it myself? Not very safe nor likely to be feasible. Add an ignore? I haven't had to do this for a while but safety, for example, looks to have better support now for expiring ignores than last time I had to do it. Lastly - shut the thing down until a fix is available. It's worth considering in a pinch, particularly if the software or pipeline isn't all that time- or mission-critical.

Using Least-Privilege Credentials

It's been a long read to here, so I'll briefly mention a couple more points and wrap up.

Try to avoid having powerful credentials lying around, or using more powerful credentials than are needed for a job. Partitioning your development environments helps, by letting you reduce the variety of credentials you have in the same place. My experience with Codespaces shines here as I can restrict a repository to exactly the permissions it needs with nothing else lying around.

Cloud Controls

There are powerful controls in Cloud infrastructures to limit cost exposure and where data can be copied. I've been looking around at Google Cloud and quotas are cost controls far more proactive and effective than billing alarms. VPC Service Perimeters can prevent data from being transferred outside your organisation. Both are effectively disabled by default and not straightforward to use, but I'm building my understanding and will share some pragmatic advice when I have some.

Templating

A lot is going on here, and you'll hit issues as you work with more repositories. I start small and build up. dbt_bigquery_template is one way I'm speeding up and improving consistency in my dbt projects (and there's a short walkthrough video that touches on some of the content here) I've also had some interesting success using git submodules to reuse scripts and so on across multiple repositories and teams in a fairly effortless manner, which I'll write about another day.

Wrap

I'll wrap up by saying that whilst this post focuses on Python packages, the risks and ideas apply to most if not all of the other supply chains I'm exposed to. Off the top of my head - Operating system. Firmare. Application software. IDE plugins. Browser plugins. dbt packages. The list goes on and on. Any of these things might be able to get up to the kind of mischief we've been talking about, so stay vigilant and think before installing. I hope the content here helps with the thinking part!

If you got this far - wow, well done, and best of luck. I'd love any feedback - including anything I got wrong or didn't make sense, or I should have covered! There's some information about how to get in touch here.

DEV Community: Paul Brabban

How to get pwned with --extra-index-url

Two variants of package example-package-cve-2018-20225

The "safe" variant

The "malicious" variant

Testing approach

Test scenarios

pip with and without flags

GitLab's PyPI pass-through behaviour

Is uv vulnerable?

What about lockfiles?

Summary

Avoid

Consider

BigQuery, safer by default from September 2025

New projects

Existing projects

Auditing

Summary

GitHub Codespaces, one year later

Works on my machine, and yours

Context-switching made easy

Minimising blast radius

The network tether

The cost

Sharp edges in billing

Final thoughts and advice

GROUP BY ALL solves a really annoying SQL problem

The problem

The poor solution

The good solution

Supporting platforms

My path to consultancy

TL;DR

Prologue

First contact - Thoughtworks

First contact - Equal Experts

Need some money, quick

Tempered Works Ltd.

My first gig

Evolving this site

Equal Experts - Associate

Equal Experts - Employee

Next steps

Generating portable and user-friendly identifiers

The problem I'm solving

User-facing needs

Implementation needs

Solution

Hashing the value

Encoding the hashed value

Summary

Using AWS billing to track down lost resources

The mystery bill

Billing - cost explorer

Tag editor

Back to cost explorer

Cost by usage type

Cost by API operation

API operation and usage type together

Cost by resource

Summary

Google Chrome Oct 15 update broke GitHub Codespaces

Discovery

Remote codespace in local VSCode

Investigation

Other affected sites

Map over an array in BigQuery

Why map over an array in BigQuery?

How to map in BigQuery

Split the text to get an array

UNNEST and SELECT

Mapping the dummy stemmer

Applying filter

Multiple return values

Summary

Handling CVE-2018-20225

Discovery

The Facts

Taking Action

Two variants of package `example-package-cve-2018-20225`

Is `uv` vulnerable?

Why `map` over an array in BigQuery?

How to `map` in BigQuery

Applying `filter`