DEV Community

Cover image for Solved: Managing short-lived tokens on VMs — a small open-source config-driven solution
Darian Vance
Darian Vance

Posted on • Originally published at wp.me

Solved: Managing short-lived tokens on VMs — a small open-source config-driven solution

🚀 Executive Summary

TL;DR: Managing short-lived access tokens on VMs for external resource access poses significant operational challenges, leading to application outages and security risks. The article presents solutions ranging from cloud-native IAM roles (AWS Instance Profiles, Azure Managed Identities, GCP Service Accounts) for automated credential refresh to a custom, config-driven open-source agent called TokenRelay for hybrid and on-premise environments.

🎯 Key Takeaways

  • Short-lived tokens, while a security best practice, introduce operational complexities on VMs, including the need for automated refresh, robust error handling, and preventing manual toil.
  • Cloud provider IAM roles (e.g., AWS Instance Profiles, Azure Managed Identities, GCP Service Accounts) offer the most secure and scalable solution for cloud-native workloads by providing temporary credentials via metadata services without storing long-lived keys.
  • For hybrid, on-premise, or multi-cloud environments requiring access to diverse external services, a dedicated, config-driven agent like “TokenRelay” can bridge the gap by securely fetching, refreshing, and distributing tokens from various sources.
  • Manual scripting for token renewal, while simple, carries significant security risks (hardcoded secrets), high maintenance burden, and poor scalability for large VM fleets.
  • Secure implementation practices include running agents with low privileges, restricting file permissions for stored tokens, and avoiding hardcoding sensitive credentials directly in scripts or configurations.

Struggling with short-lived token management on your VMs? This post dissects common challenges and offers three robust solutions, from cloud-native IAM roles to a custom, config-driven open-source agent, ensuring secure and automated credential refresh.

The Perpetual Headache: Managing Short-Lived Tokens on VMs

In modern cloud and hybrid environments, applications and services running on Virtual Machines (VMs) frequently require access to external resources like databases, object storage, APIs, or secret management systems. Best security practices dictate the use of short-lived access tokens or credentials, rather than long-lived static keys. While highly secure, this introduces a significant operational challenge: how do you reliably and automatically refresh these tokens before they expire, without introducing manual toil or security gaps?

Symptoms of Token Management Woes

Failure to properly manage short-lived tokens can lead to a cascade of issues for IT professionals:

  • Application Outages: An application suddenly stops functioning because its access token to a critical backend service expired, leading to runtime errors and downtime.
  • Manual Overheads: DevOps teams are forced to manually renew tokens, update configuration files, and restart services, diverting valuable time from more strategic initiatives.
  • Security Risks: To avoid frequent outages, teams might be tempted to increase token validity periods, store tokens insecurely on the VM, or embed them directly into configuration files, increasing the blast radius in case of a breach.
  • Complex Error Handling: Applications need robust error handling for expired tokens, adding complexity to development and debugging.
  • Lack of Scalability: Manual processes don’t scale with a growing fleet of VMs, making consistent token management across hundreds or thousands of instances nearly impossible.

Solution 1: The Manual Chore & Simple Scripting Approach

This is often the default, especially in smaller environments or when migrating legacy applications. It involves scripting the token renewal process and scheduling it.

How it Works

A shell script is written to perform the following steps:

  • Authenticate with an identity provider (e.g., OAuth server, Vault, cloud STS) to obtain a new short-lived token.
  • Store the new token in a designated secure location (e.g., a file with restricted permissions, an environment variable).
  • Notify or restart the application that uses the token, if necessary, for it to pick up the new credential.

This script is then scheduled to run periodically using a tool like cron or a systemd timer, typically before the token’s expiration.

Example: Basic Shell Script with Cron

Let’s assume an application needs an OAuth token to access an external API, and we’re renewing it from an OAuth provider using curl and jq.

#!/bin/bash

# Configuration
CLIENT_ID="your_client_id"
CLIENT_SECRET="your_client_secret" # In a real scenario, fetch this from a secure store
TOKEN_ENDPOINT="https://oauth.example.com/token"
TOKEN_FILE="/opt/app/current_token.txt"
LOG_FILE="/var/log/token_renew.log"

# Function to log messages
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}

log_message "Starting token renewal process..."

# Request a new token
RESPONSE=$(curl -s -X POST "$TOKEN_ENDPOINT" \
    -H "Content-Type: application/x-www-form-urlencoded" \
    -d "grant_type=client_credentials&client_id=$CLIENT_ID&client_secret=$CLIENT_SECRET")

ACCESS_TOKEN=$(echo "$RESPONSE" | jq -r '.access_token')

if [[ -z "$ACCESS_TOKEN" || "$ACCESS_TOKEN" == "null" ]]; then
    log_message "ERROR: Failed to retrieve access token. Response: $RESPONSE"
    exit 1
fi

# Store the new token
echo "$ACCESS_TOKEN" > "$TOKEN_FILE"
chmod 600 "$TOKEN_FILE" # Restrict permissions

log_message "Successfully renewed token. Stored in $TOKEN_FILE."

# Optional: Signal or restart the application
# For example, if your application picks up environment variables on restart:
# systemctl restart my-application.service
# Or, if your application has a reload endpoint:
# curl -X POST http://localhost:8080/reload-token

log_message "Token renewal process completed."
Enter fullscreen mode Exit fullscreen mode

To schedule this script, add an entry to the VM’s cron table (crontab -e), for example, to run every 30 minutes if the token expires hourly:

*/30 * * * * /usr/local/bin/renew_token.sh >/dev/null 2>&1
Enter fullscreen mode Exit fullscreen mode

Pros and Cons

  • Pros:
    • Simple to understand and implement for basic cases.
    • Highly flexible for custom token sources.
    • No external dependencies beyond standard Linux utilities.
  • Cons:
    • Security Risk: Client secrets or sensitive credentials might be hardcoded or stored insecurely on the VM to facilitate renewal.
    • Maintenance Burden: Requires careful management of scripts across multiple VMs. Debugging failures can be challenging.
    • Error Handling: Robust error handling (retries, alerts) needs to be manually built into each script.
    • Application Impact: May require application restarts or custom reload mechanisms, leading to brief service interruptions.
    • Scalability: Does not scale well; managing scripts on a large fleet of VMs becomes a nightmare.

Solution 2: Leveraging Cloud Provider IAM Roles & Managed Identities

For workloads running natively within a public cloud (AWS, Azure, GCP), the most secure and scalable approach is to use the cloud provider’s native Identity and Access Management (IAM) capabilities for VM instances. These mechanisms provide a secure way for VMs to obtain temporary, short-lived credentials without ever storing long-lived keys on the instance itself.

Cloud-Native Security at its Best

The core principle is that the VM itself is assigned an identity. Applications running on the VM can then query a local metadata service to obtain temporary credentials associated with that identity. The cloud provider handles the entire credential rotation and distribution process transparently.

AWS EC2 Instance Profiles

In AWS, you attach an IAM Role to an EC2 instance via an Instance Profile. This role defines the permissions the instance’s applications will have.

  • How it Works: Applications on the EC2 instance make calls to the instance metadata service (typically http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-name>) to retrieve temporary security credentials (access key ID, secret access key, and session token). The AWS SDKs and CLIs automatically handle this process.
  • Example:
  1. Create an IAM Role (e.g., MyEC2AppRole) with policies granting necessary permissions (e.g., S3 read-only).
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:GetObject",
                  "s3:ListBucket"
              ],
              "Resource": [
                  "arn:aws:s3:::my-secure-bucket",
                  "arn:aws:s3:::my-secure-bucket/*"
              ]
          }
      ]
  }
Enter fullscreen mode Exit fullscreen mode
  1. Attach this role to your EC2 instance when launching it, or modify an existing instance.

  2. From within the EC2 instance, use the AWS CLI or an SDK. No credentials need to be configured on the instance.

  # Using AWS CLI on the EC2 instance
  aws s3 ls s3://my-secure-bucket/

  # Or, directly query the metadata service (for understanding, not typical application use)
  TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` && \
  curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/MyEC2AppRole
Enter fullscreen mode Exit fullscreen mode

Azure Managed Identities

Azure’s Managed Identities provide an identity for your Azure services in Azure Active Directory (Azure AD). You can assign them to Azure VMs.

  • How it Works: Azure AD handles the credential lifecycle. Applications running on the VM can obtain an Azure AD access token for accessing Azure resources (like Key Vault, Storage accounts) by making requests to a local, non-routable endpoint.
  • Example:
  1. Enable a System-assigned Managed Identity for your Azure VM (via Azure Portal, CLI, or ARM templates).
  az vm identity assign --name MyVm --resource-group MyResourceGroup
Enter fullscreen mode Exit fullscreen mode
  1. Grant this Managed Identity permissions to the desired Azure resource (e.g., read secrets from an Azure Key Vault).
  # Get the principal ID of the VM's managed identity
  VM_PRINCIPAL_ID=$(az vm identity show -g MyResourceGroup -n MyVm --query principalId -o tsv)

  # Grant 'Get' permission on a Key Vault secret to the VM's managed identity
  az keyvault set-policy --name MyKeyVault \
      --object-id $VM_PRINCIPAL_ID \
      --secret-permissions get
Enter fullscreen mode Exit fullscreen mode
  1. From within the VM, retrieve an access token for Azure Key Vault and use it.
  # Using curl to get an access token
  TOKEN_RESPONSE=$(curl 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fvault.azure.net' -H Metadata:true -s)
  ACCESS_TOKEN=$(echo "$TOKEN_RESPONSE" | jq -r .access_token)

  # Use the token to access a secret in Key Vault
  curl "https://mykeyvault.vault.azure.net/secrets/mysecret?api-version=2016-10-01" \
      -H "Authorization: Bearer $ACCESS_TOKEN"
Enter fullscreen mode Exit fullscreen mode

Google Cloud Platform Service Accounts

GCP uses Service Accounts attached to Compute Engine VMs to provide granular permissions.

  • How it Works: A service account is assigned to a VM instance. Applications use the metadata server (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token) to obtain short-lived OAuth 2.0 access tokens. GCP SDKs and tools handle this automatically.
  • Example:
  1. Create a Service Account (e.g., my-gce-app@<project-id>.iam.gserviceaccount.com) with necessary permissions (e.g., Cloud Storage Object Viewer).
  gcloud iam service-accounts create my-gce-app --display-name "My GCE App Service Account"
  gcloud projects add-iam-policy-binding <PROJECT_ID> \
      --member "serviceAccount:my-gce-app@<PROJECT_ID>.iam.gserviceaccount.com" \
      --role "roles/storage.objectViewer"
Enter fullscreen mode Exit fullscreen mode
  1. Launch your Compute Engine VM instance, specifying this service account.
  gcloud compute instances create my-gce-vm \
      --service-account my-gce-app@<PROJECT_ID>.iam.gserviceaccount.com \
      --scopes "https://www.googleapis.com/auth/cloud-platform" \
      --image-family debian-11 --image-project debian-cloud --zone us-central1-a
Enter fullscreen mode Exit fullscreen mode
  1. Applications on the VM use GCP SDKs, which automatically fetch tokens. For example, using Python to access Cloud Storage:
  from google.cloud import storage

  client = storage.Client()
  bucket = client.get_bucket("my-gcp-bucket")
  blobs = bucket.list_blobs()
  for blob in blobs:
      print(blob.name)
Enter fullscreen mode Exit fullscreen mode

Pros and Cons

  • Pros:
    • Highly Secure: No long-lived credentials stored on the VM. Tokens are ephemeral and rotated automatically.
    • Zero Maintenance: Cloud provider manages the entire lifecycle of credentials.
    • Granular Permissions: Policies can be applied with fine-grained control to specific VMs.
    • Scalability: Easily applicable across thousands of VMs with consistent policies.
    • Integrated: Works seamlessly with cloud SDKs and tools.
  • Cons:
    • Cloud-Specific: Only applicable to VMs running in their respective public cloud environments.
    • Limited to Cloud Resources: Primarily designed for accessing *other cloud services* within the same provider, not external, third-party APIs or on-premise systems without additional federation.
    • Learning Curve: Requires familiarity with the specific cloud provider’s IAM model.

Solution 3: Introducing TokenRelay – A Config-Driven Open-Source Agent

What if your environment is hybrid, on-premise, or requires access to a mix of cloud and external services that don’t directly integrate with cloud provider IAM roles? This is where a dedicated, lightweight, config-driven agent comes into play. Let’s call our hypothetical solution “TokenRelay.”

Bridging the Gap: The Need for an Agent

TokenRelay is designed to run as a daemon on each VM. Its purpose is to securely fetch, refresh, and distribute short-lived tokens from various sources to applications running on the same VM, all managed via a simple, declarative configuration.

How TokenRelay Works

  • Configuration-Driven: TokenRelay reads a YAML or JSON configuration file that defines which tokens to fetch, from where, their refresh intervals, and where to store them.
  • Modular Fetchers: It uses pluggable “fetchers” (e.g., Vault fetcher, OAuth client credentials fetcher, custom API fetcher) to interact with different identity and secret providers.
  • Secure Storage: Once fetched, tokens are stored securely in memory or in temporary, permission-restricted files.
  • Application Integration:
    • File-based: Writes the token to a specific file path, which applications can monitor or reload.
    • Environment Variables: Exposes tokens as environment variables for child processes (e.g., by running application as a child of TokenRelay or injecting into a shell script wrapper).
    • Local HTTP Endpoint: (Advanced) Provides a local HTTP endpoint for applications to query for their latest token, similar to cloud metadata services.
  • Health Checks & Monitoring: Integrates with system monitoring for status and alerts on renewal failures.

Example: TokenRelay Configuration

A typical /etc/tokenrelay/config.yaml might look like this:

# /etc/tokenrelay/config.yaml
tokens:
  - name: "my_app_oauth_token"
    type: "oauth2_client_credentials"
    source:
      endpoint: "https://auth.example.com/oauth/token"
      client_id: "tokenrelay_vm_client"
      client_secret_path: "/etc/tokenrelay/secrets/oauth_client_secret" # Path to client secret
      scope: "api.read api.write"
    refresh_interval_minutes: 45
    output:
      file_path: "/var/run/tokenrelay/my_app_token.json"
      format: "json" # token and expiration_timestamp
      permissions: "0640" # rw-r-----

  - name: "vault_db_creds"
    type: "hashicorp_vault_kv"
    source:
      vault_addr: "https://vault.example.com:8200"
      vault_role: "my-vm-app-role" # Uses Vault's AppRole or similar auth method
      vault_path: "secret/data/my-app/database"
    refresh_interval_minutes: 30
    output:
      file_path: "/var/run/tokenrelay/db_creds.env"
      format: "env" # DB_USER=..., DB_PASS=...
      permissions: "0600"

# Global settings (optional)
log_level: "info"
Enter fullscreen mode Exit fullscreen mode

The client_secret_path would point to a file managed by your secrets management solution (e.g., Ansible Vault, manual deployment, or even another token from Vault).

Example: Deploying and Managing TokenRelay

Assuming TokenRelay is packaged as a systemd service:

# Install TokenRelay (e.g., via package manager or direct binary download)
# For example:
# curl -L https://github.com/tokenrelay/tokenrelay/releases/download/v1.0.0/tokenrelay_linux_amd64 -o /usr/local/bin/tokenrelay
# chmod +x /usr/local/bin/tokenrelay

# Create necessary directories and set permissions
mkdir -p /etc/tokenrelay/secrets
mkdir -p /var/run/tokenrelay
chown -R tokenrelay:tokenrelay /etc/tokenrelay /var/run/tokenrelay # Assuming a 'tokenrelay' user/group

# Place the config.yaml and any required secret files
# Example: Client secret for OAuth (ensure this file is highly restricted)
echo "my_super_secret_client_password" > /etc/tokenrelay/secrets/oauth_client_secret
chmod 400 /etc/tokenrelay/secrets/oauth_client_secret
chown tokenrelay:tokenrelay /etc/tokenrelay/secrets/oauth_client_secret

# Example systemd service unit file for TokenRelay:
# /etc/systemd/system/tokenrelay.service
# --------------------------------------------------------------------------------------------------
[Unit]
Description=TokenRelay Agent for Short-Lived Token Management
After=network.target

[Service]
ExecStart=/usr/local/bin/tokenrelay --config /etc/tokenrelay/config.yaml
Restart=always
RestartSec=5
User=tokenrelay # Run as a dedicated, low-privilege user
Group=tokenrelay
StandardOutput=journal
StandardError=journal
PrivateTmp=true
ProtectSystem=full
NoNewPrivileges=true
ReadOnlyPaths=/
ReadWritePaths=/var/run/tokenrelay /var/log/tokenrelay # Only allow writing to token output and logs
CapabilityBoundingSet=CAP_NET_BIND_SERVICE # If it needs to bind to a port (e.g., local HTTP endpoint)

[Install]
WantedBy=multi-user.target
# --------------------------------------------------------------------------------------------------

# Reload systemd, enable, and start the service
systemctl daemon-reload
systemctl enable tokenrelay.service
systemctl start tokenrelay.service

# Check status
systemctl status tokenrelay.service
journalctl -u tokenrelay.service -f
Enter fullscreen mode Exit fullscreen mode

Pros and Cons

  • Pros:
    • Vendor-Agnostic: Works across hybrid, multi-cloud, and on-premise environments.
    • Config-Driven: Declarative configuration makes management and auditing easier.
    • Extensible: New token sources (e.g., specific IDPs, custom APIs) can be added via modular fetchers.
    • Decoupling: Separates token management logic from application code.
    • Enhanced Security: Keeps long-lived credentials out of application code and away from the file system where possible, and rotates short-lived ones automatically.
  • Cons:
    • Additional Component: Requires deploying and managing another agent on each VM.
    • Bootstrap Credentials: Initial credentials (e.g., client secrets for OAuth, Vault AppRole ID/Secret) still need to be securely provided to TokenRelay, potentially via an instance identity or secret injection at provisioning time.
    • Integration: Applications need to be designed to read tokens from files or environment variables, or query a local endpoint.
    • Maturity: As a custom or open-source solution, its feature set and community support may vary compared to commercial alternatives or cloud-native options.

Solution Comparison: Choosing Your Path

Here’s a comparison of the three approaches to help you decide which is best suited for your specific use case:

Feature 1. Manual/Scripted Renewal 2. Cloud Provider IAM Roles 3. TokenRelay Agent
Deployment Environment Any (Cloud, On-prem, Hybrid) Cloud-specific (AWS, Azure, GCP) Any (Cloud, On-prem, Hybrid)
Security of Credentials Low (secrets potentially on disk) High (no long-lived keys on instance) Medium-High (initial bootstrap required, then ephemeral)
Management Overhead High (script development, deployment, monitoring) Very Low (cloud provider managed) Medium (agent deployment, config management)
Scalability Poor Excellent Good (via infrastructure-as-code)
Flexibility (Token Sources) High (can script anything) Limited (primarily cloud resources) High (pluggable fetchers)
Application Integration Custom script interaction/restart Native SDKs, metadata service File system, env vars, local endpoint
Complexity Low (individual script) to High (fleet-wide) Low (once understood) Medium (agent, config, bootstrap)
Ideal Use Case Small, isolated deployments; niche token sources Cloud-native applications accessing other cloud services Hybrid environments, multi-cloud, diverse token sources, custom integrations

Conclusion

Managing short-lived tokens on VMs is a critical aspect of modern secure operations. While simple scripting can provide a quick fix, it quickly becomes unsustainable and insecure at scale. Cloud-native IAM roles offer the most seamless and secure solution for applications strictly within their respective public clouds. However, for hybrid environments, multi-cloud deployments, or access to external, non-cloud-native services, a dedicated, config-driven agent like our conceptual TokenRelay provides a powerful and flexible solution.

By carefully evaluating your environment, security requirements, and the types of resources your VMs need to access, you can choose the approach that best balances security, operational efficiency, and scalability, moving beyond the perpetual headache of manual token management.


Darian Vance

👉 Read the original article on TechResolve.blog

Top comments (0)