DEV Community: catatsuy

Developers Are Now the Attack Surface

catatsuy — Sun, 07 Jun 2026 06:52:58 +0000

Recent supply chain attacks make me feel that the phase has clearly changed.

This article is not a complete solution. It is a summary of what I have been thinking about development environments and CI/CD after seeing recent supply chain attacks.

In the past, many attacks targeted people with low computer literacy. For example, attackers asked users to open suspicious emails, run suspicious attachments, or enter credentials into fake websites.

Of course, these attacks are still serious. But in many cases, there was at least some room for individual caution.

Recent attacks are different.

They clearly target developers.

Developers often have stronger privileges than normal business users. They may have GitHub tokens, npm tokens, SSH keys, cloud credentials, and access to repositories that can affect CI/CD or production systems.

From an attacker's point of view, compromising a developer is very attractive. If they can compromise a developer, they may be able to affect the software supply chain itself.

AI coding also changes the situation. More people can now write code with AI. This is a good thing. But it also means that more people may have strong privileges without enough knowledge about development environments, dependencies, CI/CD, and secret management.

So protecting developers is becoming more important than ever.

This does not mean that EDR, DNS filtering, device management, browser protection, authentication systems, and secret management on devices are useless. They are still necessary.

But they are not enough.

We need to protect not only developer devices, but also the act of development itself.

How do we handle dependencies? What is running in CI/CD? Where are secrets stored? How do we review AI-generated code? How do we handle code from outside?

The development process itself must become safer.

CI/CD also needs protection

CI/CD is as important as developer devices.

CI/CD systems such as GitHub Actions build, test, deploy, and operate infrastructure. This means they may touch very powerful secrets, such as cloud credentials, signing keys, and registry tokens.

If CI/CD is compromised, the problem is not only source code leakage. Released artifacts may be modified. Cloud credentials may be sent outside. Infrastructure may be directly operated by attackers.

The GITHUB_TOKEN in GitHub Actions also has permissions to call the GitHub API from workflows. These permissions can be controlled, but if workflows or events are configured incorrectly, attackers may be able to use unexpected permissions.

https://docs.github.com/actions/reference/authentication-in-a-workflow

https://docs.github.com/en/actions/concepts/security/github_token

GitHub also provides documentation about secure use of GitHub Actions.

https://docs.github.com/en/actions/reference/security/secure-use

For CI/CD security, one important point is observability.

We need to know what actually happened in CI/CD.

Recently, I tried cicd-sensor and found it interesting.

https://github.com/cicd-sensor/cicd-sensor

It is like EDR for CI/CD. It uses eBPF to observe what happens during CI/CD jobs. It can record processes, file access, network access, and suspicious behavior.

Just being able to see which process ran, which files were touched, and where the job communicated is valuable.

CI/CD usually only shows what is written in logs. But that is no longer enough.

This is not as simple as saying "this IP address is suspicious."

For example, access to 169.254.169.254 may be access to the AWS metadata service. It may be legitimate. So we need to know the calling process.

Did a process started by npm install read credentials?
Did it access many different credentials?
Did it make network connections that do not usually happen in normal builds?

Without this context, detection is not practical.

Of course, observability tools do not solve everything. Rules need tuning. We also need to decide what to do after detection.

But at least we can move away from a state where we do not know what happened in CI/CD.

I think a realistic first step is detection and evidence collection. If we try to block everything from the beginning, we may break the developer experience.

First, make things visible. Then block only truly dangerous behavior.

Protecting dependency entry points and CI/CD execution

In this context, tools such as Takumi Guard and Takumi Runner are interesting.

Takumi Guard is a registry proxy that improves npm supply chain security.

https://flatt.tech/takumi/features/guard

https://shisho.dev/docs/ja/r/202603-takumi-guard/

By setting the registry URL in .npmrc, it can block malicious packages and track package installation.

This direction is important.

It is not realistic for humans to read all dependency code. It may be possible to ask AI to review dependencies, but continuously reviewing all dependencies deeply is expensive.

So one realistic approach is to stop dangerous packages at install time.

Another important point is traceability. If a package is later found to be malicious, we need to know where it may have been installed or used.

Takumi Runner is also interesting in the same context.

https://flatt.tech/takumi/features/runner

https://shisho.dev/docs/ja/t/runner/

It works as a self-hosted runner for GitHub Actions. It records processes, network access, and file operations during workflow execution by using eBPF. It also visualizes traces and helps triage suspicious behavior.

The design is practical because users can start using it by changing runs-on in existing workflows.

If improving CI/CD security requires large workflow changes, adoption becomes hard. Moving this responsibility to the runner side is a good approach.

Protecting the dependency entry point and observing CI/CD execution are both ways to make development itself safer.

AI changes vulnerability management

AI is improving the ability to find vulnerabilities.

Advanced AI agents that find vulnerabilities get a lot of attention. But even current AI review tools can already find some issues.

First, we should continuously fix issues that current AI tools can find.

I do not know whether the current increase in AI-found vulnerabilities is temporary or will continue for a long time. But at least for now, we should assume this situation will continue.

AI review is becoming better. It can help reduce new vulnerabilities, find missing authorization checks, and detect dangerous implementations earlier.

So I think AI review should become part of the development process.

But AI review is not perfect.

In particular, it is not realistic to deeply review all dependency code every time. We need to separate what AI review can protect and what should be protected by dependency management or runtime monitoring.

OSS and the GitHub ecosystem are changing

The OSS and GitHub ecosystem is also changing quickly.

Accepting pull requests from outside has been very important for OSS. People send code, receive reviews, communicate with maintainers, and build trust through that process.

In traditional OSS, sending code was also a way to build trust.

But AI has made it much cheaper to create pull requests.

If many AI-generated pull requests are sent to projects, pull requests may stop being a signal of trust. They may become a source of maintainer fatigue.

This does not mean every OSS project should close pull requests. Many projects depend on external contributions, and that will continue to be important.

But for some projects, keeping "anyone can send a pull request" may become too expensive.

AI-generated PRs, malicious PRs, and low-quality issues can all increase maintainer burden. I think OSS operation models may change.

GitHub has also added repository settings for configuring pull request access.

https://github.blog/changelog/2026-02-13-new-repository-settings-for-configuring-pull-request-access/

If a project does not accept pull requests, the reason to use GitHub may become weaker. Some projects may move to GitLab, Forgejo, or other platforms.

https://about.gitlab.com/

https://forgejo.org/

At the same time, many OSS projects depend on GitHub Actions compute resources. That free infrastructure supports many OSS projects.

So GitHub Actions and other CI/CD platforms are convenient, but they are also becoming important attack targets.

We need to evaluate vulnerabilities calmly

Vulnerability reports are also becoming difficult to evaluate.

Finding vulnerabilities is important. Security researchers and security companies do necessary work.

But the severity of a vulnerability is not decided only by the type of bug.

We need to look at attack conditions.

Can it be attacked in a realistic environment?
Is it exploitable with the default settings?
Does it require another vulnerability?
Does it require a special environment?
Is it privilege escalation, information leakage, or remote code execution?

Sensational titles get attention. But if vulnerabilities with limited attack conditions are treated too loudly, users may not know what to prioritize.

As a result, truly important issues may be hidden.

This is also a burden for OSS maintainers.

We should not underestimate vulnerabilities. But we need to look at attack conditions calmly and prioritize correctly.

How much should we depend on OSS libraries?

Can we develop web applications without OSS libraries?

In modern frontend development, avoiding React or similar major libraries is usually not realistic. We need standard libraries and widely used libraries.

But we should become more careful about small convenience libraries.

AI usually does not review the inside of dependencies. Technically it may be possible, but it is not easy in terms of cost and responsibility.

This means that when we use a library, the inside of that library often becomes a blind spot for AI review.

Using a library itself now has some risk.

Of course, I am not saying we should stop using all libraries.

For areas such as authentication, cryptography, sanitization, OAuth, and OIDC, we should not easily write our own implementations. These areas require expertise, and bugs can be critical. We should use trusted implementations.

But for small libraries that only make things a little more convenient, it may sometimes be safer to let AI generate the code, review it with AI, and then review it by humans.

The important thing is to choose dependencies carefully.

We should reduce unnecessary convenience dependencies.

npm is convenient, but difficult for security

npm is convenient, but it has difficult security aspects.

In package.json, we can specify version ranges for dependencies. This means newer versions may be installed unintentionally. Lockfiles fix the actual installed versions, but if operation is wrong, unexpected updates can still happen.

https://docs.npmjs.com/cli/configuring-npm/package-json

npm also has lifecycle scripts. Scripts can run during install and other phases. This has legitimate uses, but it is also attractive for attackers.

https://docs.npmjs.com/cli/using-npm/scripts/

npx is also convenient. But it makes it easy to run commands from external packages. From a security point of view, we should be careful with it.

https://docs.npmjs.com/cli/commands/npx/

Ruby and PHP also have dependency problems. But today they are often used inside Docker.

npm is different in many cases. npm often runs directly on developer machines.

npm touches developer machines directly. Secrets may exist there. That is why attackers target it.

Go takes a different approach. Module versions are written explicitly in go.mod, and module graph resolution has its own design.

https://go.dev/ref/mod

At first, this may feel inconvenient. But now I feel it was a safer design.

Package managers should move not only toward convenience, but also toward safety.

Moving development environments to the cloud or sandbox

Another possible direction is to avoid building development environments directly on local machines.

Instead, development can happen in cloud VMs or containers.

If all network communication can be monitored, it becomes easier to detect dangerous package installs and suspicious communication. It is also useful because secrets do not need to be placed on personal developer devices.

If the environment is compromised, it can be discarded and recreated.

Cloud development environments such as GitHub Codespaces are one example of this direction.

https://github.com/features/codespaces

But we do not need to move everything to the cloud from the beginning.

A more realistic approach is to run only dangerous operations in a sandbox.

For example:

npm install
npx
building code from outside
running AI-generated code

These operations should not always run directly on developer machines. It may be safer to run them in isolated environments.

In this sandbox, it is not enough to simply block network access.

We should record process execution, file access, and network communication by using technologies such as eBPF. When necessary, the sandbox should also block dangerous behavior.

For example:

A package install tries to read an SSH key.
A process tries to read cloud credentials.
A command makes unusual external network connections.
A process accesses the metadata service.

We need to record not only the destination IP address, but also which command caused it, which dependency process caused it, which files were read, and where it communicated.

Without this level of visibility, investigation and blocking are difficult.

So making development environments safer is not only about moving them to the cloud.

We need to isolate dangerous operations, observe what happens inside them, and block dangerous behavior when needed.

However, introducing this to old projects all at once is difficult.

Migration cost is high. Developer productivity may fall at first. IDE and debugging experiences may also become worse.

So gradual adoption is more realistic.

Start with new projects.
Start with high-risk repositories.
Run only npm install and npx in a sandbox.
Move only secret-handling work to cloud environments.

That kind of step-by-step approach seems practical.

Always keep systems rebuildable

We need to continuously update not only applications, but also infrastructure.

Linux kernels, container runtimes, base images, and language runtimes continuously get vulnerabilities.

We should not panic every time a vulnerability is announced. Instead, we should build systems that can be updated regularly.

In practice, moving toward containers is important.

Deploy at least once a week.
Update base images and runtimes regularly.
Do not keep using old container images.
Replace Kubernetes nodes and VMs when needed.

The important thing is not to work hard only after a vulnerability appears.

The important thing is to always keep systems rebuildable.

This applies to development environments, CI/CD, and production environments.

We should not assume that systems will never be compromised. We should make systems disposable, rebuildable, and observable.

Conclusion

Developers are no longer outside the attack surface.

For attackers, developers are now high-value targets.

AI coding also increases the number of people who can participate in development. This is a good thing. But it may also increase the number of people with strong privileges before knowledge about development environments, dependencies, CI/CD, and secret management is shared enough.

Developer devices, CI/CD, package managers, OSS libraries, GitHub Actions, and cloud credentials are all connected.

If one part is compromised, the whole service may be affected.

Basic endpoint protection and authentication systems are necessary. But they are not enough to make development safe.

What matters now is making development itself safer.

We should choose dependencies carefully.
We should stop dangerous packages at the entry point.
We should add AI review to the development process.
We should observe what happens in CI/CD.
We should not place secrets carelessly on developer devices or CI/CD.
We should run dangerous operations such as npm install, npx, and external code execution in sandboxes that can record and block behavior.
We should keep infrastructure and development environments rebuildable.

Tools such as cicd-sensor, Takumi Guard, and Takumi Runner are interesting examples of this direction.

cicd-sensor:

https://github.com/cicd-sensor/cicd-sensor

Takumi Guard:

https://flatt.tech/takumi/features/guard

https://shisho.dev/docs/r/202603-takumi-guard/

Takumi Runner:

https://flatt.tech/takumi/features/runner

https://shisho.dev/docs/r/202603-takumi-guard/

Now that developers are attack targets, development environments and CI/CD must be protected like production systems.

This is not only about stronger internal IT management.

It is about changing the way we develop software toward a safer model.

Let's Encrypt short-lived certificates are quite strict, so you should use an ARI-capable client

catatsuy — Sun, 19 Apr 2026 07:24:36 +0000

Let's Encrypt short-lived certificates are much harder than they look if you think of them as just a shorter version of 90-day certificates.

If you issue and renew multiple certificates for multiple subdomains in a short interval, you can hit certificate issuance rate limits more easily. Short-lived certificates increase the number of renewals, so these limits become much more visible.

https://letsencrypt.org/docs/rate-limits/

You should test in the staging environment first

For development and testing, you should use the staging environment instead of production. It has the same kind of behavior, but the limits are much looser. That makes it safer when you are still deciding how to split certificates and how to renew them.

Environment	ACME directory URL
Staging	`https://acme-staging-v02.api.letsencrypt.org/directory`
Production	`https://acme-v02.api.letsencrypt.org/directory`

https://letsencrypt.org/docs/staging-environment/

Short-lived certificates hit rate limits more easily

Short-lived certificates are valid for only 160 hours. Let's Encrypt recommends renewing them every 3 days. That means many more renewals than 90-day certificates, so rate limits become much easier to hit.

https://letsencrypt.org/2026/01/15/6day-and-ip-general-availability

https://letsencrypt.org/docs/faq/

The strictest limit here is the one for the same exact set of domain names: 5 certificates per 7 days. If you keep issuing certificates for the same set of names, you get close to that limit quickly.

Also, this limit does not fully reset all at once after 7 days. Let's Encrypt says the ability to request new certificates for the same exact set of identifiers refills at a rate of 1 certificate every 34 hours. With short-lived certificates, the renewal interval is short, so this refill speed matters too.

https://letsencrypt.org/docs/rate-limits/

If you split certificates by subdomain, the number of certificates grows. Also, subdomains under the same registered domain share the same bucket, so there is less room than it first seems.

It gets worse if you issue both RSA and ECDSA certificates

If you use both RSA and ECDSA, you need two certificates for the same domain names.

That means the number of certificates doubles immediately.

With short-lived certificates, the renewal interval is already short, so a setup that splits certificates by subdomain and also keeps both RSA and ECDSA certificates can hit rate limits quite easily.

That is why you should use an ARI-capable client

This is where ARI becomes important. ARI means ACME Renewal Information. It is a mechanism that lets the CA tell the ACME client when it should renew a certificate.

With Let's Encrypt, renewals that use ARI are exempt from all rate limits. Since short-lived certificates assume renewal every 3 days, this difference is large. It matters even more if you want both RSA and ECDSA certificates.

If you want to use short-lived certificates, you should use an ARI-capable client.

ARI is not inside the certificate

ARI is not a certificate extension. Even if you inspect a certificate with openssl x509 -text, you cannot tell whether it was renewed with ARI.

ARI works through ACME renewalInfo, so it is part of the ACME protocol, not part of the certificate itself. To know whether ARI is being used, you need to look at the client and the CA interaction, not only at the certificate.

lego is very useful for this

If you want to use short-lived certificates, you need both shortlived profile support and ARI support. lego supports both, so it is very useful for this use case.

https://github.com/go-acme/lego

Its usage is also simple. Use run for the first issuance, then use renew for later renewals. For short-lived certificates, --profile shortlived is the important option, and renew --dynamic is a simple way to run renewals. If you change --server to the staging URL, you can test the same setup in staging.

For example, the first issuance looks like this:

lego \
  --accept-tos \
  --email you@example.com \
  --server https://acme-v02.api.letsencrypt.org/directory \
  --dns route53 \
  --domains example.com \
  --domains www.example.com \
  run \
  --profile shortlived

Renewal looks like this:

lego \
  --accept-tos \
  --email you@example.com \
  --server https://acme-v02.api.letsencrypt.org/directory \
  --dns route53 \
  --domains example.com \
  --domains www.example.com \
  renew \
  --dynamic \
  --profile shortlived

If you want to test in staging, change --server to this:

--server https://acme-staging-v02.api.letsencrypt.org/directory

I still do not know whether it is ready for real production use

I still do not know how practical short-lived certificates are for real production services.

At least for now, I am using them on personal domains, and that already showed several problems.

First, short-lived certificates increase the number of renewals, so CA logs also increase. There is simply too much to monitor.

Also, the maximum validity is only 160 hours, so common certificate monitoring services tend to stay in a critical state all the time.

And as described above, these certificates are also more likely to hit rate limits.

Because of these characteristics, my own site started breaking more often after I switched to short-lived certificates. The feature is interesting, but I think stable operations are still hard if you want to use it widely in real services.

Conclusion

With Let's Encrypt short-lived certificates, it is easier to hit certificate issuance limits.

It gets even harder if you want both RSA and ECDSA certificates, because the number of certificates simply doubles. Since short-lived certificates also renew more often, a setup with multiple subdomains can get close to rate limits much faster than expected.

To make this easier to operate, it is very important to test in staging first and to use an ARI-capable client.

Designing and Optimizing Image Delivery with a CDN

catatsuy — Sat, 11 Apr 2026 06:21:53 +0000

In modern web services, it is no longer enough to handle images the way we did in the past.

A long time ago, it was common to generate a few fixed thumbnail sizes when a user uploaded an image, then serve those files as they were. That approach was often good enough. But today, high-density smartphone displays are common, and the same image may need different sizes and formats depending on the client. On top of that, CDNs and image optimization services can now generate image variants dynamically and cache them efficiently.

Because of this, image delivery is no longer just a frontend concern. It is part of system design.

What matters most to me is this:

image size matters more than quality
image URL design should be controlled by the backend, not the frontend
the backend should return multiple candidate URLs for the same image, and the browser should choose one
modern formats and image optimization services are useful, but they need to be used with a good understanding of their behavior

In this article, I will explain how I think image delivery should be designed for modern web services.

Image size matters more than quality

For JPEG-like images, there are two major parameters: image size and quality. But if your goal is visual quality, image size is more important than quality.

When an image is too small, people notice it immediately. A difference in quality is often more subtle. Because of that, the first step should not be fine-tuning quality. The first step should be making sure the image is large enough.

Today, devices like iPhones with Retina displays are normal. On smartphones, an image often needs to be 2x larger than its display size, and sometimes 3x larger, to look good. The visual quality of thumbnail images directly affects user experience. In some services, it can even affect click-through rate or sales.

Of course, larger images increase transfer size. That affects CDN cost, response time, and mobile data usage. But even with that trade-off, the right order is still important: first make sure the image is large enough, then tune quality and format.

For photo-like images, this is especially important. If you want the same visual result, using a larger image with slightly lower quality can sometimes be more efficient than trying to compensate for an image that is too small by increasing quality. If the image is too small, raising quality does not solve the real problem.

The backend should control image URL design

A CDN or image optimization service can generate different sizes and formats from the original image. That is very useful. But it does not mean the frontend should build image URLs freely.

If the frontend builds URLs on its own, long-term operation becomes harder.

For example, later you may want to:

change parameter names
change how quality is handled
introduce signed URLs
move to another image optimization service

If the client owns the URL format, backward compatibility becomes painful. This is especially true for native apps, because old versions remain in use for a long time.

This also matters for CDN cache efficiency. Image transformation works best when the same request conditions lead to the same URL. If the frontend calculates width and quality freely, many slightly different URLs will appear, and the cache will fragment. Even changing query parameter order can create a different URL. That lowers cache hit rate, increases transformation work, and adds more origin load and cost.

Because of that, image URL design should be controlled by the backend. The backend should define the allowed sizes, quality policy, and format policy, then return valid candidate URLs. The frontend should use those candidates.

Return multiple candidate URLs for the same image, then let the browser choose

Even if the backend controls image URLs, that does not mean the application has to decide the exact image for every device by itself.

With picture and srcset, browsers can choose images based on display width and pixel density. Those are standard browser features and they are already good enough for this job.

https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/Elements/picture

https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Responsive_images

The important design is this:

for the same image, the backend returns multiple candidate URLs
the browser chooses the best one

For example, for one product image, the backend can prepare several URLs for different sizes. Then picture or srcset can let the browser choose the right one for the current layout and display density.

This keeps URL design under backend control while still using the browser’s built-in image selection features.

A low-effort starting point: use the `Accept` header

If you want to introduce modern image formats, you do not need to start with full picture and srcset support everywhere.

A lower-effort starting point is to use the Accept header.

The Accept header tells the server what content types the client can receive. If your CDN or image optimization service supports it, the image delivery side can look at that header and return AVIF or WebP for supported clients, and JPEG or PNG for others, while keeping the same image URL.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Accept

This is practical because it does not require large template changes.

But this method has an important limitation: it mainly solves format negotiation, not size selection.

It also depends on where the logic runs. This approach works when the CDN or image transformation service that actually serves the image supports Accept-based negotiation. It is difficult to complete this logic only in the API that returns image URLs, because that API does not directly handle the final Accept header of the browser’s image request.

Many image optimization services already support this kind of format switching, so it is often a very practical first step.

A stronger approach: use `picture` and `srcset`

If you want to optimize not only format but also size, picture and srcset are much more powerful.

Here is a simple example:

<picture>
  <source
    type="image/avif"
    srcset="
      /images/example-640.avif 640w,
      /images/example-1280.avif 1280w
    "
    sizes="(max-width: 640px) 100vw, 640px"
  >
  <img
    src="/images/example-1280.jpg"
    alt="example"
  >
</picture>

This lets the browser choose both by format and by size.

If you only want a low-effort improvement, Accept-based format switching is a good start. If you want a more complete solution that also handles responsive size selection, picture and srcset are the right tools.

They do require template changes, so the cost is higher. But for thumbnails, product images, and other places where image quality and transfer size matter a lot, the benefit is worth that cost.

WebP is very useful, but it is not magic

WebP is still a very useful format. For photo-like images, it can often provide smaller files than JPEG while keeping good visual quality.

A key advantage of WebP is that it tends to avoid some of the visible artifacts that appear when JPEG quality is pushed too low. Because of that, WebP can often use lower quality values than JPEG.

But that does not mean you should convert everything to WebP without thinking.

A common mistake is to focus too much on quality and forget size. Even with WebP, the better approach is still to secure enough image size first, then lower quality as needed.

Some images need extra care with WebP

WebP is not equally good for every kind of image.

Text images and pixel-art-like images can be tricky. The default settings of cwebp are tuned for photo compression, and that can produce a slightly blurred look. For photos, this is often acceptable. For text and pixel art, it can look like visible degradation.

In some cases, cwebp presets such as text or icon can improve the result. But in a real service, it is not always easy to choose different conversion parameters for every image type.

Because of that, it is often reasonable not to force WebP conversion for images that are already very small, such as text-heavy images or simple pixel art. WebP should be used where it has clear value.

Also, WebP does not behave exactly like JPEG progressive rendering or PNG/GIF interlace behavior. If your service depends on those details, you should verify the difference before rollout.

AVIF is a strong choice for new support

AVIF is another important modern image format. Today, browser support for AVIF is already broad, and the support gap between AVIF and WebP is much smaller than it used to be.

https://caniuse.com/avif

In practice, there are image types where WebP needs extra care, but AVIF tends to show fewer obvious issues. AVIF also works naturally with Accept-based content negotiation.

AVIF is often said to have heavier encoding cost, but when you use a CDN or image optimization service, service users usually do not need to care about that cost directly. What matters more is whether the format works naturally in real browser environments and whether it behaves well for your images.

Because of that, for a new implementation, it is often realistic to keep JPEG or another traditional format as the fallback and support only AVIF as the modern format.

In other words:

return AVIF for clients that support it
return JPEG or another fallback for the rest

Today, in many cases, it is no longer necessary to support WebP as well just because it is a modern image format.

That said, quality numbers are not directly comparable between JPEG, WebP, and AVIF. You still need to tune them by looking at real output.

The main way to detect WebP support is the `Accept` header

If you want to return WebP only to supported clients, the main method is to look at the Accept header.

Browsers include supported content types in that header. If image/webp is present, the image delivery side can treat the client as WebP-capable.

The important point is that this logic should not be forced into the API that returns image URLs. API requests often use Accept values for JSON and do not directly reflect the final image request. That is why this method works naturally at the image-serving CDN or transformation layer.

JavaScript detection is not the main path, but it is useful for stricter checks

In some cases, header-based detection is not enough.

JavaScript can detect support more strictly by loading a small test image. This is useful if you want to check not only general WebP support but also features like lossless, alpha, or animation.

Google’s WebP FAQ shows code like this:

https://developers.google.com/speed/webp/faq

function check_webp_feature(feature, callback) {
    var kTestImages = {
        lossy: "UklGRiIAAABXRUJQVlA4IBYAAAAwAQCdASoBAAEADsD+JaQAA3AAAAAA",
        lossless: "UklGRhoAAABXRUJQVlA4TA0AAAAvAAAAEAcQERGIiP4HAA==",
        alpha: "UklGRkoAAABXRUJQVlA4WAoAAAAQAAAAAAAAAAAAQUxQSAwAAAARBxAR/Q9ERP8DAABWUDggGAAAABQBAJ0BKgEAAQAAAP4AAA3AAP7mtQAAAA==",
        animation: "UklGRlIAAABXRUJQVlA4WAoAAAASAAAAAAAAAAAAQU5JTQYAAAD/////AABBTk1GJgAAAAAAAAAAAAAAAAAAAGQAAABWUDhMDQAAAC8AAAAQBxAREYiI/gcA"
    };
    var img = new Image();
    img.onload = function () {
        var result = (img.width > 0) && (img.height > 0);
        callback(feature, result);
    };
    img.onerror = function () {
        callback(feature, false);
    };
    img.src = "data:image/webp;base64," + kTestImages[feature];
}

This is helpful when you need very strict compatibility checks. But it is not the main path for normal WebP rollout. In most cases, Accept-based handling plus picture when needed is enough.

There used to be compatibility issues that affected delivery strategy

In the past, image delivery strategy was affected by compatibility issues such as Safari not supporting WebP and Android 4.3 and below being difficult to support reliably. On Android 4.3 and below, even if surface-level detection suggested WebP support, real rendering could still break.

That situation has changed a lot. Safari 14 added WebP support, and the impact of very old Android environments has become much smaller as more services dropped support for TLS versions below 1.2.

Today, this is more of a historical compatibility concern than a core design issue.

You should monitor whether large images are being served by mistake

When people talk about image optimization, they often focus on formats and transformation logic. But in real operation, it is also important to monitor whether oversized images are being served by mistake.

Even with a good design, mistakes happen. Configuration issues or unexpected input images can result in large objects being delivered. That directly affects bandwidth, response time, and CDN cost.

Because of that, you should use metrics from your CDN or monitoring system to observe object size ranges and transfer volume continuously.

In practice, bugs often appear in edge cases. For example, a very tall or very wide image with an unusual aspect ratio can trigger an unexpected resize result and cause a much larger image to be served than intended. You will not catch that by thinking only about format choice. You need to observe the actual object sizes being delivered.

Converting images is easy, but operating an image transformation service is not

If you only want to convert JPEG to WebP on your own machine, that is easy. If you use Homebrew on macOS, you can install the tools like this:

brew install webp

cwebp converts images to WebP, and dwebp converts WebP images to other formats.

But that does not mean building and operating your own image transformation service is easy.

In real services, you will receive images that technically contain problems but are still uploaded by users all the time. For example, JPEG files with broken ICC profiles still appear in real systems. Even then, you need to transform them while keeping color changes as small as possible.

That is why image optimization services and image CDNs are valuable. The hard part is not running a conversion command. The hard part is operating the full system safely with broken files, strange inputs, and edge cases.

Conclusion

If I were designing image delivery for a modern web service, I would think about it like this.

First, image size matters more than quality. You should deliver images that are large enough for the real display environment, especially now that high-density displays are normal.

Second, image URL design should be controlled by the backend. The backend should return grouped candidate URLs for the same image, and the browser should choose one.

Third, if you want a low-effort starting point, use Accept-based format negotiation. If you want a more complete solution, use picture and srcset so that size selection is also handled properly.

Fourth, modern formats are useful, but they are not interchangeable in every case. WebP is still valuable, but some image types need extra care. AVIF is now broadly supported and can be a very practical choice when paired with a traditional fallback format.

Finally, do not stop at conversion logic. You also need monitoring, and you need to think about real operational bugs such as oversized outputs caused by unusual aspect ratios.

Understanding these points and using a CDN or image optimization service well is one of the most practical ways to improve image delivery and user experience today.

Safely Updating Existing Files in Go

catatsuy — Sun, 05 Apr 2026 07:30:24 +0000

If you want to safely update an existing file in Go, the basic rule is simple: do not write to the original file directly.
Instead, write the new content to a temporary file first, and replace the original file with rename only after the write is fully complete.

In this article, I focus on Linux and explain three points:

why direct overwrite is dangerous
why using os.CreateTemp("", ...) can cause problems
what to watch out for when systemd PrivateTmp is enabled

Why you should not overwrite a file directly

When you build a CLI tool that updates an existing file, such as a formatter like gofmt -w or a tool that regenerates a cache file, it is better not to open the original file and overwrite it in place.

There are three reasons.

1. If the process crashes halfway, the original file may be left incomplete

If an error happens during writing, or the process crashes before it finishes, the original file may be left in a partially written state.

That means even a small update can destroy the whole file.

So the final replacement must happen only after all data has already been written successfully.

2. Other processes may read the file at the same time

Another process may already be reading that file.

If you overwrite the file directly, that process may observe a partially written file.

A safer pattern is to write the full content to a separate file first, and replace the destination only after the new file is complete.

On Linux, rename is atomic when the source and destination are on the same filesystem.
Because of that, readers opening the destination path will typically see either the old file or the new file, but not a half-written replacement.

3. Concurrent writes become easier to reason about

If two or more processes try to overwrite the same file at the same time, the result may become unclear or corrupted.

If each process writes to its own temporary file and only calls rename after completion, the behavior becomes easier to reason about.

In that case, whichever process renames its file last wins.

So the basic pattern is:

write everything to another file
rename it to the final path

A common trap with `os.CreateTemp`

In Go, a common way to create a temporary file is this:

f, err := os.CreateTemp("", "tmp-*")

If the first argument is an empty string, Go creates the file under os.TempDir().
On Linux, that is usually /tmp.

The problem is that /tmp may be on a different filesystem from the directory where you want to place the final file.

For example:

/tmp may be a tmpfs
your application data may be under /var/lib/...
or your target file may be under /home/...

In that case, rename from /tmp to the destination file fails with this error:

invalid cross-device link

On Linux, rename works only within the same filesystem.

So if you already know the final destination, you should create the temporary file in the destination directory from the beginning.

Create the temporary file in the destination directory

Instead of creating a temporary file under /tmp, do this:

dst := "/var/lib/myapp/cache.json"

dir := filepath.Dir(dst)
tmp, err := os.CreateTemp(dir, ".tmp-*")
if err != nil {
    return err
}
defer os.Remove(tmp.Name()) // cleanup if something fails

This way, tmp and dst are in the same directory, so they are also on the same filesystem.

Then you can replace the destination with os.Rename:

if err := tmp.Close(); err != nil {
    return err
}

if err := os.Rename(tmp.Name(), dst); err != nil {
    return err
}

The important order is:

write all data to the temporary file
close it
rename it

Also note that the default permission of the temporary file is 0600, so if you need different permissions, you should change them explicitly.

A pitfall with systemd `PrivateTmp`

This is not specific to Go. It is a Linux and systemd topic.

This matters because some programs assume that a file created under /tmp is visible to other processes.

If PrivateTmp is enabled for a service, the process still sees a directory called /tmp, but that directory is isolated from /tmp used by other processes.

That can cause problems like this:

Process A creates a temporary file in /tmp
Process B tries to read the same path under /tmp
but Process B cannot see it

So if your design assumes that a temporary file in /tmp is shared with other processes, that assumption breaks when PrivateTmp is enabled.

In that case, you need one of these approaches:

disable PrivateTmp
- this is useful if you want to keep using the normal /tmp cleanup behavior
create and use a shared directory that all related processes can access
- but make sure the directory exists, or file creation will fail

If you use the pattern from this article — creating the temporary file in the final destination directory — you usually avoid this problem from the start.

Think about concurrent execution

You should also think about what happens when multiple processes update the same file at the same time.

If your requirement is simply that the last completed update wins, the approach above is usually enough.

os.CreateTemp generates unique file names, so multiple processes can create temporary files in the same directory without colliding on file names.

But if you need stricter control, such as:

do not allow overwriting with stale data
only write if the file generation is still based on the latest state

then you need additional coordination such as a lock file or flock.

Summary

Do not overwrite the original file directly
Always write the full content to another file first, then replace it with rename
If you want the final replacement to be atomic, create the temporary file in the same directory as the destination file
If the temporary file is on a different filesystem, rename fails with invalid cross-device link
If systemd PrivateTmp is enabled, /tmp is not shared in the way you may expect, so either disable it or use a shared directory

I Built lls, a Go CLI to List 33.12 Million Files

catatsuy — Sun, 29 Mar 2026 06:56:46 +0000

Sometimes a problem looks simple at first.

In my case, I needed a complete file list from a huge directory on storage mounted over NFS from an application server. At first, this sounded like something existing tools should be able to handle. But once the number of files became extremely large, that assumption stopped being true.

I eventually built a Go CLI called lls to solve this problem.

This was not a toy project. I built lls to solve a real production problem, and in the end it was able to list 33.12 million files from a single directory on NFS.

Repository:

https://github.com/catatsuy/lls

In this article, I will explain what failed, why I decided to use the Linux getdents64 system call directly, how the implementation works, and how lls finally solved the problem.

The problem

The directory I had to deal with was on storage mounted over NFS, and it contained an extremely large number of files.

If a directory is small, ls and find are usually enough. But once the number of files becomes too large, even getting a complete file list becomes difficult. And when the directory is on NFS instead of local storage, the situation can become even worse.

What I needed was simple in theory: get the full list of files and finish successfully. In practice, that turned out to be the hard part.

`ls -U1` and `find` could not finish

The first thing I tried was ls -U1.

I disabled sorting because sorting is one of the well-known reasons ls becomes painful on huge directories. But even with ls -U1, it still could not finish. The number varied from run to run, but at best it stopped after outputting about 6 million files.

I did not fully investigate why it stopped, but I suspected the storage server might have stopped responding.

Next, I tried find.

I thought find might handle more entries than ls, but it also failed. The result also varied, but at best it output about 12 million lines before it stopped responding.

At that point, I was almost ready to give up. I started thinking I might have to output part of the file list, delete files in multiple rounds, and somehow work around the problem manually.

But I wanted a real solution.

Why I built `lls`

Around that time, I found an article describing how someone listed a directory containing 8 million files by calling getdents directly with a large buffer. That was the key idea I needed. The article showed the C approach, but not a ready-to-use implementation, so I decided to build my own tool in Go.

http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

In Go, I could call Linux-specific system calls through the syscall package. That meant I could stay in Go, avoid cgo, and still work directly with the kernel interface I needed.

That was how lls started.

The point of lls was not to replace ls in general. It was a narrow tool for one difficult job: keep reading directory entries from a huge directory until the end.

System calls and `getdents64`

On Linux, userland programs ask the kernel to do work through system calls. Directory reading is no exception.

For this problem, the important system call was getdents64. The older getdents exists, but getdents64 was added because the original interface did not handle large filesystems and large file offsets well. In Go, the function exposed as syscall.Getdents uses getdents64, which was exactly what I needed here.

The returned data is not a high-level file list. It is raw directory-entry data packed into a byte buffer.

Conceptually, the data corresponds to a structure with fields such as inode number, offset, record length, type, and a null-terminated file name.

That detail matters, because if you use getdents64 directly, you have to parse the buffer yourself.

My first idea: one large buffer

My first idea was simple:

allocate a buffer based on the directory size
call getdents64
print each file name to standard output

The directory size here was the same value you can see with ls -dl. The idea was that this value should be large enough to hold the full result. If the buffer was smaller than what was really needed, the output would be incomplete.

lls also had a -buf-size option so I could adjust the size manually, and a -debug option to show how much of the buffer was actually used.

However, on the real directory, this did not work as expected.

The directory size reported by ls -dl was over 2 GB, and running lls with that default buffer size produced EINVAL. After trying different values, I found that 2147483647 worked but 2147483648 did not. Later, I concluded that this was because the size had to fit in an int, which also explains why the call failed beyond that point.

Even after increasing the buffer size as much as possible, that approach still was not the real solution. The important point was not “make one call with a bigger buffer.” The real solution was to change the design.

The real fix: call `getdents64` repeatedly

The real fix was to stop thinking in terms of one huge call.

getdents64 can be called repeatedly. If you keep calling it until it returns 0, you can continue reading the remaining directory entries.

This became the key change in lls.

Instead of relying on a single enormous buffer, lls now uses a reasonable buffer and keeps calling syscall.Getdents until the directory is fully consumed. That change made it possible to list all 33.12 million files.

That was the point where lls became a practical tool for extremely large directories rather than a one-shot experiment.

The core implementation

The implementation in lls is built around syscall.Dirent, which Go defines on Linux with fields like inode number, offset, record length, type, and a fixed-size name field.

The core loop is straightforward:

allocate a buffer
call syscall.Getdents(int(f.Fd()), buf)
if the return value is 0, stop
otherwise parse the returned bytes entry by entry

The most important part is how the parsing works.

The returned buffer contains multiple directory entries. Each entry has a variable size, so the code cannot move by a fixed structure size. Instead, it casts the current position in the buffer to *syscall.Dirent, reads Reclen, and moves forward by that many bytes.

That is how it walks through the buffer correctly.

The code also checks Ino. If the inode number is 0, the entry is skipped, because that means the file no longer exists.

For file names, the implementation uses the Name field from syscall.Dirent. In Go this is [256]int8, so the code first treats it as bytes and then converts the bytes before the terminating null into a string.

In other words, the implementation stays intentionally close to the kernel interface:

call getdents64
interpret the returned bytes as directory entries
move forward using Reclen
extract the file name
repeat until getdents64 returns 0

Why this worked

One especially useful detail is that libc's readdir implementation often uses a fixed internal buffer. In one of the articles I read, the example used a 2048-byte buffer internally. If your directory is huge, that means a large number of system calls just to read through it.

You cannot easily change that buffer size from outside, which is why directly calling getdents64 yourself can make sense in an extreme case like this.

That does not mean low-level code is always better. It only means this particular problem was narrow enough, and extreme enough, that the lower-level interface matched the problem better than a general-purpose tool.

The result

In the end, the comparison looked like this:

ls -U1: about 6 million files
find: about 12 million lines
lls: 33.12 million files listed successfully

That was the result that mattered.

This was not just an experiment in system programming. It solved a real production problem on NFS. I needed the full file list, and lls made that possible.

Conclusion

I built lls because standard tools could not finish the job on a huge NFS-mounted directory.

The important ideas were:

use the Linux getdents64 interface through Go's syscall.Getdents
parse the returned directory entries directly
advance through the buffer using Reclen
keep calling the system call until it returns 0

That approach finally made it possible to list 33.12 million files.

If you want to look at the code, here is the repository again:

https://github.com/catatsuy/lls

Useful references:

https://man7.org/linux/man-pages/man2/getdents.2.html

https://pkg.go.dev/syscall

Designing a File Tampering Detection Tool for a Legacy PHP Application

catatsuy — Fri, 20 Mar 2026 05:06:17 +0000

I work on a legacy PHP application that runs on AWS EC2. The application is deployed from a deploy server with rsync. In this environment, I needed a practical way to detect file tampering on application servers.

Existing tools did not fit this deployment model well, so I built a small Go tool called kekkai and open-sourced it. In this post, I want to explain not only the design choices, but also the implementation and operational details that mattered in practice.

https://github.com/catatsuy/kekkai

The environment

This application has these characteristics:

it runs on AWS EC2
it is a legacy PHP application
dependencies are installed on a deploy server
the application is deployed with rsync

This is a common setup for older PHP applications. I wanted a solution that fits this environment instead of assuming container images or immutable deployments.

The basic model

The model is simple.

First, the deploy server calculates hashes for files and creates a manifest. The manifest can be stored either in S3 or in a local file. Then the application server verifies local files against that manifest.

The tool has two main commands:

generate: create a manifest from the current files
verify: compare current files with the manifest

I wanted the data flow to stay easy to understand. The deploy server creates the trusted data, and the application server only reads it and verifies local files.

Manifest structure

The manifest contains these values for each file:

path
SHA-256 hash
file size

It also contains the exclude rules used at generation time.

I wanted the manifest itself to describe what should be checked. I did not want verification behavior to depend on extra local configuration on the application server.

This is also why verify does not accept additional exclude rules. If the application server is compromised, I do not want it to be able to silently skip more files.

Why I only hash file contents

I only check file contents. I do not check timestamps or other metadata.

The reason is simple: metadata changes too easily. Normal operational work can change timestamps even when the file contents are still the same. If a tool alerts on that, it creates noisy alerts, and eventually people stop trusting the alerts.

This is also why I did not want an approach that archives the whole source tree into a tar file and hashes that tar file. A tar file can change for reasons that do not mean the application code was tampered with. I wanted the tool to fail only when the content of an actual file changed.

How `generate` works

The generate command walks the target directory and creates manifest entries one by one.

For regular files, it reads the file, calculates a SHA-256 hash, and stores the path, hash, and file size in the manifest.

Exclude rules are applied at this stage. I made this choice on purpose. The deploy server is the side that creates trusted data, so exclude handling should be fixed there.

After all entries are collected, the manifest is written either to a local file or to S3.

I also made generate flexible enough to work even when some excluded directories do not exist on the deploy server. That helps in real deployment environments where some paths only exist on application servers.

How `verify` works

The verify command loads the manifest first. Then it walks the target directory and compares each current file with the manifest entry.

It checks:

whether the path exists in the manifest
whether the file type matches
whether the file size matches
whether the calculated hash matches

It also detects files that exist in the manifest but are missing on disk.

When verification fails, the command exits with a non-zero status. It also writes error details to standard error, including the path that failed. This makes the tool easy to integrate with monitoring systems.

How symlinks are handled in Go

Symlinks needed special handling.

kekkai does not follow symlinks. Instead, it verifies the symlink itself.

The implementation is roughly like this:

use os.Lstat to check whether the entry is a symlink
use os.Readlink to read the target path string
add a symlink: prefix to that string
calculate the SHA-256 hash of that prefixed string
during verification, check both the file type and the stored hash

This lets the tool detect:

a changed symlink target path
a type change between a regular file and a symlink
added or removed symlinks

This design is intentional. If the symlink target path stays the same but the target file contents change, that is outside the scope of this check. I accepted that trade-off because I wanted predictable behavior and simple logic.

I also do not cache symlink verification results. The hashed input is only a short string, so the cost is small.

Why I support both S3 and local files

The manifest itself must be protected.

If an attacker can modify both the application files and the manifest, verification becomes meaningless. That is why the main production model stores the manifest in S3 instead of next to the application files.

At the same time, I also wanted local file output. Without that, even simple tests would require AWS credentials. So the tool supports both:

S3 for production
local files for testing and development

I also recommend being careful with local manifest output. If you deploy that manifest into the same target directory, verify can fail because the manifest itself appears as an unexpected file.

Protecting the manifest with S3 and IAM

Using S3 also makes it easier to separate permissions.

The application side only needs GetObject.
The deploy side only needs PutObject.

That separation is useful because the deploy server and the application servers have different roles. If needed, S3 features such as versioning can also help protect the manifest further.

I also recommend keeping base-path fixed in production and managing it explicitly. Since base-path and app-name become part of the S3 path, this helps avoid accidentally overwriting production data.

Why I chose SHA-256

For this kind of verification, I needed a hash function with the right security properties. I did not want to use a weak fast hash that would make it easier to replace a file with another input that matches the stored hash.

In security terms, the important property here is second-preimage resistance.

I considered SHA-256 and SHA-512. I chose SHA-256 because it is standard, well known, and easy to justify. I also did not see a meaningful advantage from SHA-512 for source-code-sized files in this use case.

How I reduced production load

Performance was the hardest practical problem.

Hashing a large codebase uses CPU, memory, and I/O. If the verification tool itself harms production stability, that defeats the purpose. Because of that, I added several controls.

`GOMAXPROCS` and workers

First, I rely on normal Go controls such as GOMAXPROCS.

kekkai also has a --workers option to control how many files are hashed in parallel. By default, it uses the same value as GOMAXPROCS.

This helps, but it is not enough. Even with one worker, the process can still keep one CPU core busy when many files are processed.

I/O rate limiting with `golang.org/x/time/rate`

To make the tool safer in production, I added I/O rate limiting with golang.org/x/time/rate.

Instead of only limiting concurrency, I also limit how fast the tool reads file data. This makes it possible to slow verification down on purpose and reduce the production impact.

The core idea is simple:

create a limiter
read file data in chunks
wait on the limiter before each chunk
write the chunk into the hasher

This approach gave me the most flexible control. In practice, this mattered more than worker limits alone.

kekkai exposes this through the --rate-limit option. Of course, if the value is too small, verification will become very slow, so this needs to be tuned carefully.

Cache

I also added a local cache to make repeated verification faster.

The cache stores file metadata and can skip hash calculation when mtime, ctime, and file size have not changed. Here, ctime means file change time, not creation time.

I know that metadata-based skipping is not a perfect security check by itself. That is why the cache is only an optimization layer.

There is also some risk that the cache file itself could be tampered with. Because of that, the default behavior is to recalculate hashes with a 10% probability even when the cache says the file is unchanged. This probability can be changed with --verify-probability. If it is set to 0, hash recalculation is skipped as long as the cache metadata still matches.

The cache also includes the hash of the cache file itself. If tampering is detected, the cache is disabled. Also, files under /tmp may eventually be deleted, so the cache can be rebuilt naturally over time.

Go implementation notes

I also made a few implementation choices to reduce overhead in Go itself.

When hashing many files, I do not want to allocate a new hasher for every file if I can avoid it. So I reuse hash.Hash with Reset() instead of calling sha256.New() every time.

The same idea applies to buffers. I reuse the io.CopyBuffer buffer for each worker, instead of allocating a new buffer per file.

This matters because sha256.New() is not free, and repeated allocations across many files and workers increase GC cost and cache misses.

One important detail is that hash.Hash is not goroutine-safe. So if hashing is done in parallel, each worker needs its own hasher and buffer.

How I run it on the deploy server

In production, the deploy process is implemented with shell scripts. The deploy server installs dependencies first and then runs rsync.

Because of that, I run kekkai generate at the end of the deploy script.

A typical command looks like this:

kekkai generate --target /var/www/app \
  --s3-bucket 'kekkai-test' \
  --base-path production \
  --app-name kekkai \
  --exclude ".git/**"

This stores the manifest as production/kekkai/manifest.json in the specified S3 bucket.

At this stage, it is important to list every directory that must be ignored, such as log directories or NFS mount points. Since exclude rules are stored in the manifest, mistakes here will affect later verification.

How I run it on application servers

The minimum command on an application server is simple:

kekkai verify --target /var/www/app \
  --s3-bucket 'kekkai-test' \
  --base-path production \
  --app-name kekkai

In real production, I also care about these points:

the application server must not be able to write to S3
I want alerts on failure
I want to limit load on EC2
I want to use the cache to reduce execution time

So the application side only gets s3:GetObject.

Monitoring and alerts

I run verification as a periodic check from our monitoring system.

If I alert on a single failure, I may get alerts during deployment. That would create false positives and reduce trust in alerts. So I only alert after repeated failures.

Timeout is also important. Full verification can take several minutes, so the monitoring side needs a longer timeout than a normal health check.

Why I use `systemd-run`

I also use systemd-run when running kekkai verify.

The reason is simple: I do not want this check to run with strong privileges or compete too aggressively with the main application.

A real example looks like this:

systemd-run --quiet --wait --pipe --collect \
  -p Type=oneshot \
  -p CPUQuota=25% -p CPUWeight=50 \
  -p PrivateTmp=no -p User=nobody \
  /bin/bash -lc \
  'nice -n 10 ionice -c2 -n7 /usr/local/bin/kekkai verify --s3-bucket kekkai-test --app-name app --base-path production --target /var/www/app --use-cache --rate-limit 10485760 2>&1'

There are several reasons for this setup.

User=nobody makes the command run as a low-privilege user
nice and ionice reduce CPU and I/O priority
CPUQuota and CPUWeight reduce CPU usage further through cgroup control
PrivateTmp=no is necessary if I want to use the cache in /tmp

That last point is easy to miss. If PrivateTmp=no is not set, the process gets a different private /tmp, and the cache file cannot be reused.

I also mention Go 1.25 or later in this context. Before Go 1.25, even if cgroup limits were applied, GOMAXPROCS could still reflect the parent machine's CPU count. Since Go 1.25 became cgroup-aware by default, I target Go 1.25 or later in kekkai.

Alert contents

When verification fails, kekkai writes the error to standard error, including the affected path.

Some monitoring systems include standard output in notifications, so I redirect standard error to standard output when needed. That way, a notification to Slack or another channel can include the actual file path that failed verification.

This makes investigation much faster.

Production results

In production, the application has about 17,000 files including dependencies.

manifest generation takes a few seconds
verification takes about 4 to 5 minutes with --rate-limit 10485760 (10 MB/s)
with --use-cache, a cache hit can reduce that to about 25 seconds
verification runs once per hour on application servers

This difference is intentional. I want generate to finish quickly as part of deployment, but I want verify to run slowly and safely on production servers. Even if verification takes about five minutes, running it once per hour is enough for this use case.

Final thoughts

I did not want to build a large security platform. I wanted a small tool that fits a specific real-world environment: a legacy PHP application on EC2, deployed with rsync, with a deploy server and application servers playing different roles.

That focus shaped both the design and the implementation: content-only hashing, strict exclude rules, explicit symlink handling, S3 and IAM for manifest protection, local cache with probabilistic re-verification, rate limiting, and safe execution with systemd-run.

If you work with a similar deployment model, this approach may be useful for you too. I have open-sourced the tool on GitHub as catatsuy/kekkai.

catatsuy / kekkai

A lightweight Go tool for detecting file tampering by comparing content-based hashes stored securely in S3.

Kekkai

A simple and fast Go tool for file integrity monitoring. Detects unauthorized file modifications caused by OS command injection and other attacks by recording file hashes during deployment and verifying them periodically.

The name "Kekkai" comes from the Japanese word 結界 (kekkai), meaning "barrier" - a protective boundary that keeps unwanted things out, perfectly representing this tool's purpose of protecting your files from tampering.

Takumi, the AI offensive security engineer

Design Philosophy

Kekkai was designed to solve specific challenges in production server environments:

Why Kekkai?

Traditional tools like tar or file sync utilities (e.g., rsync) include metadata like timestamps in their comparisons, causing false positives when only timestamps change. In environments with heavy NFS usage or dynamic log directories, existing tools become difficult to configure and maintain.

Core Principles

Content-Only Hashing
- Hashes only file contents, ignoring timestamps and metadata
- Detects actual content changes, not superficial modifications
Immutable Exclude…

View on GitHub

Why I, as Someone Who Likes MySQL, Now Want to Recommend PostgreSQL

catatsuy — Sun, 15 Mar 2026 05:48:21 +0000

I like MySQL. I have used it for a long time, and I have also operated it in on-premises environments.

However, since joining my current company, I have had more opportunities to use PostgreSQL. At first, I honestly felt a lot of resistance to it. I had used MySQL for so long, so part of it was just habit, and I think I was also more wary of PostgreSQL than I needed to be.

But as I actually used it, I gradually started to see what was good about PostgreSQL. These days, if someone asks me which one I would choose for a new project, I have come to feel that I would want to choose PostgreSQL.

Because I have used MySQL for a long time, I also know the rough edges that older MySQL had. At the same time, I think it is inaccurate to talk about MySQL today based only on old impressions. If you configure sql_mode properly, you can avoid many dangerous behaviors, and MySQL 8 added a large number of features.

Also, this time I want to compare current MySQL and PostgreSQL on the assumption that they will run in the cloud, rather than based on impressions from the on-premises era. Some of the things that used to be described as disadvantages of PostgreSQL are no longer very important issues now.

This is not a story about “MySQL is bad.” It is also not a story like “the philosophy of PostgreSQL is beautiful.”

If I write only the conclusion, it is these two points:

Things that used to be considered disadvantages of PostgreSQL have become much less important. The feature gap has narrowed a lot, and under the assumption of managed services, there are more things you do not need to worry about.
On the other hand, from the perspective of application implementation, there are still points where PostgreSQL is clearly better.

In this article, I will organize the discussion from that perspective.

What used to be considered disadvantages of PostgreSQL has become much less significant

In older comparisons, PostgreSQL’s weaknesses were often said to be heavier operations and awkwardness around DDL.

But I think bringing those points up as-is today is a bit outdated.

MySQL has become very strong in online DDL, but at least for everyday tasks like adding columns, I do not think there is still a clear difference between MySQL and PostgreSQL. Partitioning is also no longer as big an issue as it once felt.

Also, operational topics specific to PostgreSQL, such as VACUUM, come up much less often when you assume managed services, because users have far fewer situations where they need to handle them directly. I do not think it is very fair to bring comparisons from the old on-premises era, where you had to manage everything yourself, directly into the current cloud era.

The differences around replication have also become less visible recently, because managed services have become the mainstream, and there are more parts that users do not directly touch. I feel there are fewer situations than before where I strongly notice an advantage on the MySQL side.

In other words, some of the things that used to be valid reasons not to recommend PostgreSQL have now become much weaker.

Even so, PostgreSQL is stronger for application implementation

This is the main point.

MySQL 8 closed a lot of the gap. Even so, when I look at things from the standpoint of someone actually writing applications, there are still reasons why PostgreSQL is easier to recommend.

First, the things MySQL 8 added and narrowed the gap on

I want to make this clear first. The following are things that used to be described as PostgreSQL advantages, but are no longer decisive because MySQL 8 added them:

CHECK constraints
Window functions
SKIP LOCKED

At this point, it is not fair to talk about these as strengths that only PostgreSQL has.

However, as I will explain later, “window functions themselves were added in MySQL 8” and “being able to naturally bring window functions into update processing” are different things. I still think PostgreSQL is much easier to work with for the latter.

`ON CONFLICT DO NOTHING` is not a replacement for `INSERT IGNORE`

This is a feature MySQL did not originally have, and it is one of the fairly big reasons why I recommend PostgreSQL.

MySQL has INSERT IGNORE.

However, this is hard to treat as a replacement for ON CONFLICT DO NOTHING.

PostgreSQL’s ON CONFLICT DO NOTHING is, basically, a feature that explicitly says: “Do not insert only when a unique constraint conflict occurs.” What you want to do becomes SQL exactly as it is.

By contrast, MySQL’s INSERT IGNORE is not a dedicated feature only for ignoring duplicates. It is a feature that turns errors into warnings and continues processing, so it is too broad for the use case of “I only want to ignore duplicates.”

This difference may look small, but in practice it is quite large.

It makes the behavior easier to read during review, and it makes it less likely that unintended invalid input will be silently accepted.

`RETURNING` is very powerful

This is also a feature MySQL did not originally have, and it is another fairly big reason why I recommend PostgreSQL.

In PostgreSQL, you can use INSERT/UPDATE/DELETE ... RETURNING. Because you can return the changed result right there, you can naturally complete “get the result of the change” in a single statement.

For example, you can do this:

INSERT INTO users(name, email)
VALUES('catatsuy', 'catatsuy@example.com')
RETURNING id, name, email, created_at;

When you have this, the following become very natural:

receive the inserted ID directly
receive default values or stored values directly
return the updated row as-is and use it as the API response
receive the result of an upsert directly

To be honest, the LAST_INSERT_ID()-based style that is common in MySQL is quite limiting.

The information you can get back is narrow, and basically centered on getting a numeric AUTO_INCREMENT ID.

You cannot naturally receive arbitrary columns from the inserted result, and you also cannot return the completed row as-is including default values and generated columns.

For example, what you may want is something like this:

you use UUIDs as primary keys, so you want to return them directly
you want to return the completed row including generated columns and default values
you want to pass the entire inserted row directly to the next step

You cannot do this with LAST_INSERT_ID().

In addition, even when multiple rows are inserted in a single statement, LAST_INSERT_ID() does not return the inserted result as-is. What you can get is only the first AUTO_INCREMENT value.

So if you want to handle the result of a multi-row INSERT directly in the application, it is inconvenient. With PostgreSQL’s RETURNING, you can return the inserted rows directly, and this difference is very large.

“Being able to return the changed result directly” is not just a convenience feature. It affects the very way you structure application implementation.

`VALUES` helps in real implementation

This is not about a feature that MySQL entirely lacks. Rather, I think PostgreSQL lets you use it much more naturally.

It is easy to create a small constant table on the spot, join with it, and connect it directly to update processing. When you have this, you do not need to push half-baked temporary-table-like processing out to the application side.

For example, when you want to join a small number of values received from the application and update based on them, in PostgreSQL you can write:

UPDATE users u
SET plan = v.plan
FROM (
  VALUES
    (1, 'pro'),
    (2, 'free'),
    (3, 'team')
) AS v(id, plan)
WHERE u.id = v.id;

This kind of processing comes up very often in real implementation.

For example, you may want to pass a small set of master-like values on the spot and update with them, or send a group of values received from an API directly into SQL.

As another example, it is also natural to treat a group of values received from the application as a join target:

SELECT u.*
FROM users u
JOIN (
  VALUES
    (1),
    (2),
    (5)
) AS v(id)
ON u.id = v.id;

It is not that MySQL cannot do similar things at all. Since MySQL 8.0.19, it has had the VALUES statement, and it can be treated as a table value constructor.

However, in MySQL you need to write it with ROW(...), and if you leave column names as they are, they become things like column_0 and column_1. It feels a bit different from PostgreSQL, where you create a small constant table on the spot, give it natural column names, and flow it directly into a JOIN or UPDATE.

For example, in MySQL the same idea would look like this:

SELECT u.*
FROM users u
JOIN (
  VALUES ROW(1), ROW(2), ROW(5)
) AS v
ON u.id = v.column_0;

It is not a flashy feature, but this kind of thing affects the ease of everyday implementation.

Being able to bring window functions into update processing is powerful

This part is important.

Window functions themselves were added in MySQL 8.

So it is wrong to talk about window functions themselves as a PostgreSQL-only strength.

However, in PostgreSQL, by combining them with WITH and UPDATE ... FROM, it is easy to bring the result of window functions naturally into update processing. I think there is still a difference here.

For example, if you want to set a flag only on the latest row for each user, you can write:

WITH ranked AS (
  SELECT
    id,
    ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM sessions
)
UPDATE sessions s
SET is_latest = (r.rn = 1)
FROM ranked r
WHERE s.id = r.id;

Window functions themselves do exist in MySQL 8.

However, PostgreSQL is much more natural when it comes to connecting them to this kind of update logic.

This is not just a convenience feature for analytics. It works as a weapon for application implementation.

Partial indexes are a clear feature difference

This is something I can clearly describe as a feature difference.

MySQL did not originally have it, and it is still missing now.

PostgreSQL has partial indexes, and you can create an index only on some rows, such as with WHERE deleted_at IS NULL. This fits very well with soft delete patterns, and it is also useful for managing records by status.

CREATE INDEX idx_users_active_email
ON users(email)
WHERE deleted_at IS NULL;

For example, if you have a table using soft deletes and you only want to speed up “search by email address among active users,” you can write that directly.

In MySQL, you can do something similar using generated columns or functional indexes. However, that is not a replacement for partial indexes.

PostgreSQL partial indexes put only the rows that satisfy a condition like WHERE deleted_at IS NULL into the index. In other words, unnecessary rows are excluded from the index from the beginning.

By contrast, MySQL generated columns and functional indexes basically evaluate an expression for all rows and then index that result. If you design the expression well, you can use them for similar purposes, but they do not directly express “an index that physically stays small by containing only some rows.”

So in terms of size, update cost, and clarity of intent, PostgreSQL’s partial indexes are more straightforward. The MySQL side can be used as a workaround, but it is hard to say it has the same feature.

This is not just a difference in how the SQL feels to write. It is a real feature difference, and a reason to recommend PostgreSQL.

Foreign keys are much better in PostgreSQL

This is my personal impression, but I feel there are many people in the MySQL world who think foreign keys are unnecessary, while in the PostgreSQL world there are many people who think foreign keys are necessary.

I think this comes not so much from a difference in philosophy, but from differences in how easy they are to test with, how easy they are to operate, and how hard they make it for bugs to enter.

PostgreSQL supports deferred constraints

In PostgreSQL, foreign keys can be made DEFERRABLE.

This is extremely important.

CREATE TABLE authors (
  id bigint PRIMARY KEY
);

CREATE TABLE books (
  id bigint PRIMARY KEY,
  author_id bigint NOT NULL,
  CONSTRAINT books_author_fk
    FOREIGN KEY(author_id)
    REFERENCES authors(id)
    DEFERRABLE INITIALLY DEFERRED
);

Because of this, you can delay constraint checks until the end of the transaction.

BEGIN;
INSERT INTO books(id, author_id) VALUES(1, 100);
INSERT INTO authors(id) VALUES(100);

COMMIT;

This SQL works in PostgreSQL.

The parent does not exist in the middle, but it is fine as long as consistency is satisfied at commit time.

What is this useful for?

loading data that includes circular references
creating complex test data
migration processes where the order is temporarily reversed
bulk inserts and replacement operations

These kinds of processes come up normally in real work.

And whether or not you can write them naturally is a very big deal.

MySQL forces strict ordering

MySQL does not have this mechanism.

NO ACTION is effectively RESTRICT.

In other words, you cannot structure processing in the form of “it is okay as long as it is consistent in the end.” You are always constrained by ordering rules where the parent must come first and the child must come after.

This may look like a small issue, but it makes loading test data and writing migration processes much harder.

For example, in test code, if you want to roughly load fixtures spanning multiple tables, in PostgreSQL you can write it so that consistency is satisfied by the end of the transaction. In MySQL you cannot do that, so you always need to manage fixture insertion order strictly.

If using foreign keys makes testing more troublesome, I think it is natural that a culture emerges where people say, “Then let’s stop using foreign keys.”

MySQL makes it easy to escape by disabling foreign keys

In MySQL, you can disable constraints with foreign_key_checks=0.

This looks convenient, but it is quite dangerous.

SET foreign_key_checks = 0;

INSERT INTO books(id, author_id) VALUES(1, 999);

SET foreign_key_checks = 1;

With this, inconsistent data inserted while constraints are disabled can remain.

Even if you enable them again, MySQL does not go back and verify all inconsistencies that were inserted during that time.

With this behavior, it becomes easy for accidents to happen where constraints are turned off for testing or migration convenience, and inconsistent data is brought in as-is.

PostgreSQL has a tool that says, “Keep the constraints, but delay the check timing.”

MySQL tends to go in the direction of “turn off the constraints themselves.”

This difference is quite large.

You can see the difference even from a foreign key example alone

For example, consider an ordinary pair of tables with a parent-child relationship.

CREATE TABLE users (
  id bigint PRIMARY KEY
);

CREATE TABLE orders (
  id bigint PRIMARY KEY,
  user_id bigint NOT NULL,
  CONSTRAINT orders_user_fk
    FOREIGN KEY(user_id)
    REFERENCES users(id)
    DEFERRABLE INITIALLY DEFERRED
);

In PostgreSQL, even during tests, you can write something like inserting into orders first and then inserting into users later, as long as it is within a transaction.

BEGIN;

INSERT INTO orders(id, user_id) VALUES(10, 1);
INSERT INTO users(id) VALUES(1);
COMMIT;

This flexibility helps a lot with fixture creation, data migration, and simplifying test code.

MySQL does not have this.

So in MySQL it is easier for people to drift toward thinking, “Foreign keys are in the way, so turn them off,” or “Guarantee it in the application,” while in PostgreSQL it is easier to drift toward thinking, “Let’s use foreign keys properly.”

The reason PostgreSQL foreign keys are better is not simply that they have more features.

A big part of it is that they make it easier to write tests, migrations, and data loading while keeping the constraints intact, and as a result, it becomes easier to actually use foreign keys in production.

In MySQL, you cannot do vector operations

Recently, I think this is probably the reason most often mentioned for adopting PostgreSQL.

PostgreSQL has pgvector, which not only allows you to store vectors, but also lets you use distance operations and similarity search directly from applications. It also has indexes for nearest-neighbor search, so it is easy to use directly in implementation.

By contrast, assuming the OSS edition, MySQL added a Vector type in MySQL 9.0, which has already been released as an Innovation Release, while the LTS version has not yet been released. However, distance functions are provided only in MySQL HeatWave on OCI and MySQL AI, and are not included in MySQL Commercial or Community. In other words, in the OSS edition you cannot do vector operations, so it is not really usable for this. This is a clear difference from PostgreSQL + pgvector.

Character sets and collations are still more complicated in MySQL

This is also very important.

I still think character sets and collations are more likely to cause trouble in MySQL than in PostgreSQL.

However, this is not only a problem with MySQL itself. It includes frameworks, connectors, and default settings as well.

There are well-known examples in Japan such as the so-called “Haha-Papa problem” and “Sushi-Beer problem.”

Both of these are not so much character encoding problems as collation problems.

The Haha-Papa problem is when strings that look different are treated as the same because of collation rules.
The Sushi-Beer problem is when comparisons involving emoji and similar characters do not behave the way you intuitively expect.

What makes these problems troublesome is that “we changed it to utf8mb4, so we are done” is not enough. In reality, you need to understand both the character set and the collation.

And in MySQL 8, rather than making this simpler, it actually gave us even more things to think about. New collations were added, and they coexist with older systems, so “the old style,” “the post-MySQL-8 style,” and “framework defaults” do not always line up.

In other words, MySQL 8 certainly improved some things, but because old and new styles now coexist as a result of those improvements, the overall situation has in some ways become even more chaotic.

I think this is not so much because MySQL itself is bad, but because it has a long history and has evolved while carrying compatibility with it.

Still, from the point of view of an application developer, that complexity directly becomes an entry point for accidents.

Summary

In the past, PostgreSQL had some clear weaknesses too.

But now, many of them have become much less significant. The feature gap has narrowed, and under the assumption of managed services, there are more things users do not need to think about directly.

On the other hand, from the point of view of application implementation, there are still reasons why PostgreSQL is easier to recommend.

The especially big ones are these.

Things MySQL 8 added and narrowed the gap on

CHECK constraints
Window functions
SKIP LOCKED

Clear reasons to recommend PostgreSQL

ON CONFLICT DO NOTHING
RETURNING
VALUES
being able to bring window functions into update processing
partial indexes
the maturity of foreign keys
vector operations through pgvector
being less likely to cause trouble around character sets and collations

I like MySQL.

I have used it for a long time, and I still think it is a good database that is easy to get good performance from.

Even so, if the question is which one I would adopt for a new project today, the one I would recommend is PostgreSQL.

That is not because MySQL is bad.

It is because even now, after many of its old weaknesses have been filled in, I still think PostgreSQL has an advantage when it comes to ease of application implementation.

DEV Community: catatsuy

Developers Are Now the Attack Surface

CI/CD also needs protection

Protecting dependency entry points and CI/CD execution

AI changes vulnerability management

OSS and the GitHub ecosystem are changing

We need to evaluate vulnerabilities calmly

How much should we depend on OSS libraries?

npm is convenient, but difficult for security

Moving development environments to the cloud or sandbox

Always keep systems rebuildable

Conclusion

Let's Encrypt short-lived certificates are quite strict, so you should use an ARI-capable client

You should test in the staging environment first

Short-lived certificates hit rate limits more easily

It gets worse if you issue both RSA and ECDSA certificates

That is why you should use an ARI-capable client

ARI is not inside the certificate

lego is very useful for this

I still do not know whether it is ready for real production use

Conclusion

Designing and Optimizing Image Delivery with a CDN

Image size matters more than quality

The backend should control image URL design

Return multiple candidate URLs for the same image, then let the browser choose

A low-effort starting point: use the Accept header

A stronger approach: use picture and srcset

WebP is very useful, but it is not magic

Some images need extra care with WebP

AVIF is a strong choice for new support

The main way to detect WebP support is the Accept header

JavaScript detection is not the main path, but it is useful for stricter checks

There used to be compatibility issues that affected delivery strategy

You should monitor whether large images are being served by mistake

Converting images is easy, but operating an image transformation service is not

Conclusion

Safely Updating Existing Files in Go

Why you should not overwrite a file directly

1. If the process crashes halfway, the original file may be left incomplete

2. Other processes may read the file at the same time

3. Concurrent writes become easier to reason about

A common trap with os.CreateTemp

Create the temporary file in the destination directory

A pitfall with systemd PrivateTmp

Think about concurrent execution

Summary

I Built lls, a Go CLI to List 33.12 Million Files

The problem

ls -U1 and find could not finish

Why I built lls

System calls and getdents64

My first idea: one large buffer

The real fix: call getdents64 repeatedly

The core implementation

Why this worked

The result

Conclusion

Designing a File Tampering Detection Tool for a Legacy PHP Application

The environment

The basic model

Manifest structure

Why I only hash file contents

How generate works

How verify works

How symlinks are handled in Go

Why I support both S3 and local files

Protecting the manifest with S3 and IAM

Why I chose SHA-256

How I reduced production load

GOMAXPROCS and workers

I/O rate limiting with golang.org/x/time/rate

Cache

Go implementation notes

How I run it on the deploy server

How I run it on application servers

Monitoring and alerts

Why I use systemd-run

Alert contents

Production results

Final thoughts

A low-effort starting point: use the `Accept` header

A stronger approach: use `picture` and `srcset`

The main way to detect WebP support is the `Accept` header

A common trap with `os.CreateTemp`

A pitfall with systemd `PrivateTmp`

`ls -U1` and `find` could not finish

Why I built `lls`

System calls and `getdents64`

The real fix: call `getdents64` repeatedly

How `generate` works

How `verify` works

`GOMAXPROCS` and workers

I/O rate limiting with `golang.org/x/time/rate`

Why I use `systemd-run`

`ON CONFLICT DO NOTHING` is not a replacement for `INSERT IGNORE`

`RETURNING` is very powerful

`VALUES` helps in real implementation