DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Secret Rotation: Practical Ways to Enhance Security

I've seen countless times how much risk static secrets (API keys, database passwords, certificates) pose in my systems. A few years ago, on a client project, we experienced a serious security vulnerability due to an old service account's API key that was forgotten in the production environment. We had only disabled the secret instead of rotating it, and later realized the old key was still active in another system. This incident clearly showed me that secret rotation is not just a "best practice," but also a fundamental security requirement.

In this post, I'll explain why secret rotation is so important, different rotation strategies, and the practical methods I've implemented in my own systems. My goal is to share the challenges I've faced and the solutions I've found to make my secret management processes more robust.

Why Is Secret Rotation a Critical Security Step?

The longer any static secret remains unchanged in a system, the greater the risk of it being compromised or misused. In the event of a breach, an attacker's first target is usually these types of static credentials. If these secrets are not regularly renewed, a once-compromised key can remain valid indefinitely, creating a persistent backdoor.

In my experience, especially in legacy systems or projects with rapid development, I've seen how easily secrets can be overlooked. In a production ERP, there was a database user defined for an old integration that hadn't changed in six years. This user had broad privileges, and this situation was flagged as a major risk during a cybersecurity audit. This example alone demonstrates how vital regular rotation is.

⚠️ The Danger of Long-Lived Secrets

Long-lived secrets can provide attackers with persistent access in the event of a breach. This makes detection difficult and increases the extent of the damage. The longer a secret's lifespan, the higher the probability of that secret being obtained and used by malicious actors.

Furthermore, human error is a significant factor. A developer might accidentally commit a secret to a code repository or leave it exposed in a log file. If this secret is subject to a rotation policy, even such an error will be rendered ineffective after a certain period. I remember accidentally writing an S3 bucket key to test logs while developing the backend for one of my side products. Fortunately, this key had a 30-day rotation period and was automatically renewed a few days after the incident. This limited the impact of a potential vulnerability.

Secret Rotation Strategies and Approaches

There are several different ways to implement secret rotation, and each has its own advantages and disadvantages. Generally, they fall on a spectrum from manual to fully automated.

1. Manual Rotation

This is the simplest method. At regular intervals (e.g., once a month), an administrator or developer manually changes the secret and updates it in all relevant systems. This approach might be feasible for small systems with few secrets. However, it's prone to human error, time-consuming, and tends to be inconsistent.

I tried this method initially for one of my small side products. Every month, I'd put a note on my calendar: "Change DB password and API keys." But I remember skipping a month or two during a busy period and then getting frustrated with myself. As the scale grows or the number of secrets increases, this method becomes unsustainable. Especially changing a secret used by more than 10 applications on a single database server could turn into almost half a day's work.

2. Semi-Automated Rotation

In this strategy, the creation or modification of the secret is automated, but its distribution or the updating of applications might still require manual intervention. For example, a script might generate a new secret, but the system administrator copies this secret to the relevant configuration files and restarts the services.

On a client project, I saw that the security team automatically generated certain certificates and placed them in a repository, but the distribution to the Nginx servers using these certificates and the restart of the Nginx service were the responsibility of the operations team. While better than manual, this still carried coordination and human factor risks. I experienced a similar situation with the deploy-hook feature of the certbot tool I used to renew Let's Encrypt certificates on my own server. certbot would renew the certificate, but if I forgot to restart Nginx, the old certificate would remain active. That's why I added the ExecReload command to my systemd unit to automate this process.

3. Fully Automated Rotation

This is the ideal approach. The creation, distribution, updating of relevant applications or services, and even the cleanup of old secrets are completely automated. This is typically achieved with a Secret Management Tool (SMT) or custom automation scripts and CI/CD processes.

In a production ERP I used, database passwords, API keys, and service tokens were managed with an SMT like HashiCorp Vault. Applications would fetch updated secrets from this Vault upon startup or at regular intervals. This way, when we rotated a secret, all dependent systems could automatically receive the new secret. This significantly reduced operational overhead while strengthening the security posture. I delved into more details on [relevant: security integration in CI/CD processes].

Database Credentials and Rotation

Databases typically house the most sensitive secrets of systems. Therefore, regularly rotating database credentials is one of the highest priorities. I have experience in this area, especially in projects where I worked with PostgreSQL.

Changing a user's password in PostgreSQL is quite simple with the ALTER USER command:

ALTER USER myapp_user WITH PASSWORD 'new_strong_password';
Enter fullscreen mode Exit fullscreen mode

However, the real challenge is implementing this change in a live system without causing downtime. My strategy in a production ERP was as follows:

  1. Creating a New User (Optional but Secure): If needed, creating a new user with the same privileges provides a safety net for rollback scenarios.
  2. Transition at the Application Layer: How applications manage database connection pools is critical. Many modern connection pools (e.g., HikariCP in Java, custom pools written with asyncpg in Python) can dynamically detect password changes or load new credentials with a reload command. If this feature isn't available, applications might need to be restarted sequentially.
  3. Two-Phase Rotation: In some cases, I implemented a transition strategy that allowed both the old and new passwords to be valid simultaneously for a period. For example, a new password is defined first, then applications are switched to the new password. Once all applications complete the transition, the old password is disabled. This is particularly useful for minimizing downtime in large and complex deployments.
  4. pg_hba.conf Management: Authentication methods are defined in the pg_hba.conf file. If IP-based restrictions or different authentication mechanisms are used here, these changes must also be included in the rotation plan.

Once, while rotating the PostgreSQL password for the backend of a task management application I developed, I realized that the connection pool wasn't automatically picking up the new password. Everything worked fine after I restarted the application, but even this brief outage made me more cautious. This situation highlights the importance of understanding how each component reacts to secret rotation. I specifically automated reloading secrets using commands like ExecReload or ExecStartPost for systemd services. I also touched upon the intricacies of database management in my post on [relevant: PostgreSQL performance tuning and WAL bloat issues].

API Keys and Service Tokens

API keys and service tokens used in inter-application communication are also important categories of secrets that require regular rotation. Especially keys used for publicly exposed APIs or integrations with third-party services should be rotated more frequently as they expand the attack surface.

JWT and OAuth2 Tokens

Rotation strategies for JWT (JSON Web Tokens) and OAuth2 tokens, commonly used in modern applications, are slightly different. JWTs typically have a short lifespan (minutes or hours). The crucial part is the regular rotation of the keys used to sign these tokens (HMAC secret or RSA private key).

In a production ERP I used, I rotated the signing keys for JWTs used for user sessions every 30 days. This meant that even if a key was compromised, it would expire within a maximum of one month. I set up this process to happen automatically in my key management system. When a new key was generated, application services dynamically loaded it. This ensured a seamless transition by allowing the ExecReload command in systemd units to load the new key without sending a SIGTERM signal.

Third-Party API Keys

Many applications use APIs from third-party services like Stripe, Twilio, or similar. The rotation of these API keys depends on the capabilities offered by the service provider. Typically, a new key is generated from the service provider's management panel, and the old key is deactivated.

In the backend of my Android spam blocker app, I was integrated with an SMS gateway service. I needed to rotate this service's API key every 90 days. I managed this process with an automation script:

  1. Generate a new key via the service provider's API.
  2. Check the validity of the old key.
  3. Add the new key to the application's configuration file.
  4. Restart application services or reload the configuration.
  5. Deactivate the old key.

Automating this process was critical because it was very prone to being forgotten when done manually. Once, I forgot to set up this automation, and the key expired, causing SMS deliveries to stop for 6 hours. This situation showed how important automation is not just for convenience, but also for reliability.

Automation Tools and Processes

Automation is indispensable for successful secret rotation. Manual operations carry both the risk of error and don't scale. Here are some automation approaches I've used in my own systems and client projects:

Secret Management Tools (SMT)

SMTs like HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault offer ideal solutions for centrally managing secrets, dynamically generating them, and automating their rotation. These tools also simplify auditing and logging access to secrets by applications.

On a client project, we were managing secrets for over 3000 services via HashiCorp Vault. Vault could automatically generate and rotate database credentials and API keys with specific TTL (Time-To-Live) periods. Applications would retrieve these secrets using Vault client libraries or tools like envconsul. This way, when we rotated a secret, Vault automatically generated a new one, and applications would fetch this secret within minutes, ensuring a seamless transition. This type of configuration significantly reduces operational overhead, especially in microservice architectures with a large number of services.

CI/CD Integration

CI/CD pipelines offer a powerful platform for automating secret rotation processes. Steps for creating a new secret, updating configuration files, and restarting services can be integrated into the CI/CD workflow.

In the deployment process for one of my side products, I use GitLab CI. Here, there's a step that ensures a newly generated API key is automatically added to the env file deployed to the production environment.

# A snippet from .gitlab-ci.yml
deploy_production:
  stage: deploy
  script:
    - export NEW_API_KEY=$(generate_new_api_key_script) # Generate new key with custom script
    - sed -i "s/^API_KEY=.*$/API_KEY=${NEW_API_KEY}/g" .env.production
    - ssh user@prod-server "sudo systemctl reload myapp-backend.service"
  only:
    - master
Enter fullscreen mode Exit fullscreen mode

In this example, generate_new_api_key_script represents a custom, external script that generates a new key from a key management system or directly from the service's API. This approach guarantees that the most up-to-date secrets are used at the time of deployment. I can elaborate on this topic in my post on [relevant: building reliable CI/CD pipelines].

Custom Scripts and systemd Timers

For smaller-scale systems or specific needs, I use custom shell scripts or Python scripts with systemd timers for automation. For example, I use a systemd timer to renew TLS certificates used for an Nginx reverse proxy and reload Nginx.

# /etc/systemd/system/nginx-cert-rotate.service
[Unit]
Description=Nginx Certificate Rotation Script

[Service]
Type=oneshot
ExecStart=/usr/local/bin/rotate_nginx_certs.sh
User=root

# /etc/systemd/system/nginx-cert-rotate.timer
[Unit]
Description=Run Nginx Certificate Rotation Daily

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode

The /usr/local/bin/rotate_nginx_certs.sh script renews the certificates and then reloads Nginx with sudo systemctl reload nginx to activate the new certificates. This is very useful, especially on bare-metal servers or when I'm not using container orchestration.

Challenges and Solutions of Secret Rotation

While secret rotation offers significant security benefits, it also brings some operational challenges. Knowing these challenges beforehand and developing solution strategies is critical for a seamless transition.

1. Risk of Interruption and Downtime

An incorrectly performed rotation can lead to applications being unable to access secrets, thus causing downtime. Especially in large systems, updating all components simultaneously is a challenging task.

Solution:

  • Phased Rollouts (Blue/Green or Canary Deployments): Deploying new service instances configured with new secrets and gradually shifting traffic to them.
  • Two-Phase Secret Policy: Ensuring that both the old and new secrets are valid for a certain period. This allows applications to gradually transition to the new secret. For example, defining two different passwords for a database user or implementing a "continue to accept the old one" policy for an API key.
  • Connection Pool Reload: Ensuring that application connection pools can dynamically reload secrets. If this isn't possible, ensure the application can pick up new secrets with a graceful restart.

Once, when rotating the database password for a service of my side product, I forgot to add the new password to the deployment pipeline. The service started but couldn't connect to the database, and I experienced a 15-minute outage. This showed how important it is to meticulously test every step of the automation.

2. Dependency Management

When a secret is used by multiple applications or services, identifying and updating all dependent systems can be challenging. An old, forgotten service or cron job can cause problems after rotation.

Solution:

  • Centralized Secret Management (SMT): Managing all secrets in one place makes it easier to track dependencies.
  • Secret Mapping: Documenting which secret is used by which application or service and regularly reviewing it.
  • Access Control and Auditing: SMTs typically log secret accesses. By analyzing these logs, we can see which services accessed which secrets and when.

In a production project, we discovered the existence of a reporting script from 2018, running in a test environment but connecting to the production database, during rotation. This script had stopped its reporting function because it didn't pick up the new password. Such "ghost" dependencies can only be identified through regular audits and inventorying.

3. Debugging and Observability

To quickly identify and resolve issues that arise during or after rotation, it's necessary to have adequate logging and monitoring mechanisms.

Solution:

  • Detailed Logging: Log secret rotation operations and related service secret access errors in detail. Error messages should be clear and understandable.
  • Metrics and Alerts: Collect proactive metrics and set up alerts for secret access errors, connection errors, or service outages.
  • Audit Logs: Maintain audit logs showing who rotated or accessed which secret and when.

In my system, when I rotated the Redis password, I saw that some services were trying to connect to Redis with the old password and getting an ERR invalid password error. By examining the journald logs, I quickly identified this error and restarted the relevant service. In such situations, seeing how quickly logs and metrics respond significantly shortens troubleshooting time.

ℹ️ How Often Should Rotation Occur?

The rotation period depends on the secret's sensitivity, your risk tolerance, and operational complexity. Generally, for sensitive secrets (database passwords, root API keys), 30-90 days is ideal. This period can be extended for less sensitive or short-lived tokens. However, the better the automation, the more frequently rotation can be performed.

Conclusion

Secret rotation is one of the cornerstones of modern system security. Transitioning from manual approaches to full automation not only increases operational efficiency but also significantly strengthens the security posture. In my 20 years of field experience, I've seen numerous projects and systems pay the price for underestimating this issue.

Remember, a secret compromise may be inevitable, but by shortening the secret's lifespan, we can minimize potential damage. Automation, detailed monitoring, and well-defined processes are the keys to turning secret rotation from a dreaded task into a routine security practice. My preference is to aim for full automation wherever possible and remove the human factor from the process as much as I can. Always being prepared for things to go wrong, rather than just saying "it happens," means far fewer headaches in the long run.

Top comments (0)