DEV Community: Jintao Zhang

From Deprecated npm Classic Tokens to OIDC Trusted Publishing: A CI/CD Troubleshooting Journey

Jintao Zhang — Sun, 04 Jan 2026 02:03:03 +0000

In January 2026, I encountered a series of cryptic authentication errors while publishing an npm package. This post documents the complete journey from problem discovery to final resolution—hopefully saving others from the same headaches.

Background

I maintain an npm package called amp-acp, an adapter that bridges Amp Code to the Agent Client Protocol (ACP). The project uses GitHub Actions for automated releases: pushing a v* tag triggers automatic publishing to npm and creates a GitHub Release.

This workflow had been running smoothly—until late December 2025...

The Problem

Starting with v0.3.1, every publish attempt failed. The GitHub Actions logs showed:

npm error code ENEEDAUTH
npm error need auth This command requires you to be logged in to https://registry.npmjs.org/
npm error need auth You need to authorize this machine using `npm adduser`

Even more confusing was this warning:

npm notice Security Notice: Classic tokens have been revoked. 
Granular tokens are now limited to 90 days and require 2FA by default. 
Update your CI/CD workflows to avoid disruption. 
Learn more https://gh.io/all-npm-classic-tokens-revoked

Root Cause Analysis

The End of npm Classic Tokens

After investigation, I discovered that npm permanently deprecated all Classic Tokens on December 9, 2025. According to the GitHub official announcement:

All existing npm classic tokens have been permanently revoked
Classic tokens can no longer be created or restored
New Granular tokens have a maximum validity of 90 days and require 2FA by default

This means the traditional approach of storing NPM_TOKEN in GitHub Secrets is no longer viable (at least not as convenient as before).

The New Authentication Method: OIDC Trusted Publishing

npm's recommended solution is OIDC Trusted Publishing. This OpenID Connect-based authentication mechanism offers several advantages:

No token management – No need to create, store, or rotate tokens
Enhanced security – Uses short-lived, cryptographically signed, workflow-specific credentials
Automatic provenance – Automatically generates provenance statements, providing build-origin transparency
Industry standard – Aligns with PyPI, RubyGems, crates.io, and other major package registries

Troubleshooting Log

Attempt 1: Upgrading npm Version

Initially, I assumed the issue was an outdated npm version, so I added this to the workflow:

- name: Update npm to latest
  run: npm install -g npm@latest

Result: Failed ❌

Attempt 2: Removing registry-url

Someone suggested removing the registry-url parameter from actions/setup-node:

- uses: actions/setup-node@v4
  with:
    node-version: '22'
    # Removed registry-url

Result: Failed ❌

Attempt 3: Setting NODE_AUTH_TOKEN to Empty String

Based on some outdated resources, I tried setting NODE_AUTH_TOKEN to an empty string:

- name: Publish to npm
  run: npm publish --access public
  env:
    NODE_AUTH_TOKEN: ''

Result: Failed ❌

Here's the critical misconception: setting an empty NODE_AUTH_TOKEN actually prevents OIDC from working, because npm attempts to use the empty token instead of OIDC.

Attempt 4: Completely Removing NODE_AUTH_TOKEN

I finally realized that for OIDC Trusted Publishing, NODE_AUTH_TOKEN should not be set at all:

- name: Publish to npm
  run: npm publish --access public
  # Note: no env section

Result: Partial success ⚠️

This time OIDC authentication started working (logs showed Signed provenance statement), but a new error appeared:

npm error 422 Unprocessable Entity - PUT https://registry.npmjs.org/amp-acp - 
Error verifying sigstore provenance bundle: Failed to validate repository information: 
package.json: "repository.url" is "", expected to match 
"https://github.com/tao12345666333/amp-acp" from provenance

Attempt 5 (Final Success): Adding the repository Field

It turns out npm's Provenance validation requires package.json to include a repository field matching the GitHub repository:

{
  "name": "amp-acp",
  "version": "0.3.7",
  "repository": {
    "type": "git",
    "url": "https://github.com/tao12345666333/amp-acp"
  }
}

Result: Success! ✅

The Correct Configuration

1. Configure Trusted Publisher on npmjs.com

First, configure Trusted Publisher on the npm website:

Navigate to https://www.npmjs.com/package/YOUR_PACKAGE/settings
Find the "Trusted Publisher" section
Select "GitHub Actions"
Fill in the following:
- Organization/User: Your GitHub username or organization name
- Repository: Your repository name
- Workflow filename: The workflow file name (e.g., release.yml)
- Environment: (Optional) If using GitHub Environments

2. GitHub Actions Workflow Configuration

name: Release

on:
  push:
    tags:
      - 'v*'

permissions:
  id-token: write   # Required for OIDC authentication
  contents: write   # Required for creating GitHub Release

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '22'
          registry-url: 'https://registry.npmjs.org'

      - name: Update npm to latest
        run: npm install -g npm@latest

      - name: Install dependencies
        run: npm ci

      - name: Publish to npm
        run: npm publish --access public
        # Note: Do NOT set NODE_AUTH_TOKEN!

      - name: Create GitHub Release
        uses: softprops/action-gh-release@v2
        with:
          generate_release_notes: true

3. Required package.json Fields

{
  "name": "your-package-name",
  "version": "x.y.z",
  "repository": {
    "type": "git",
    "url": "https://github.com/YOUR_USERNAME/YOUR_REPO"
  }
}

Key Takeaways

npm Classic Tokens are dead – As of December 9, 2025, all classic tokens are permanently invalidated
OIDC Trusted Publishing is the new standard – No token management, enhanced security, built-in provenance
Do not set NODE_AUTH_TOKEN – For OIDC, this environment variable should not be set at all
Configure Trusted Publisher on npmjs.com – This step is often overlooked
package.json must include the repository field – Required for provenance validation
Ensure id-token: write permission – Otherwise, OIDC token generation will fail
npm CLI version requirement – Requires npm 11.5.1 or later

FAQ

Q: Can I use OIDC to publish the first version of a new package?

A: No. The first version must be published manually or using a traditional token. Trusted Publisher can only be configured afterward.

Q: Can I use OIDC with self-hosted runners?

A: Currently, only GitHub/GitLab-hosted runners are supported. Self-hosted runners are not yet supported.

Q: Why doesn't setting NODE_AUTH_TOKEN to an empty string work?

A: An empty string is still a value—npm will attempt to use it rather than falling back to OIDC. Only when this variable is completely unset will npm automatically use OIDC.

Q: What should I do if provenance validation fails?

A: Verify that repository.url in package.json exactly matches the GitHub repository URL (including case sensitivity).

References

Written on January 4, 2026, based on the publishing experience of amp-acp project from v0.3.1 to v0.3.7.

Mastering Kubernetes Services and Ingress: A Comprehensive Guide to Deploying Applications with Ease

Jintao Zhang — Sun, 12 Feb 2023 00:24:03 +0000

I. Introduction

Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications. It has become the industry-standard for container orchestration and is widely used in production environments.

In a Kubernetes cluster, there are many components that work together to make the deployment of applications seamless. Services and Ingress are two of these components that are essential for making applications accessible from the outside world. In this article, we'll explore the details of Kubernetes Services and Ingress, and demonstrate how to use them effectively.

II. Kubernetes Services

A. Overview of Kubernetes Services

In Kubernetes, a Service is a resource that represents a set of pods running the same application. It provides stable network endpoints for pods, making it easier to route network traffic to them. The Service abstraction helps to hide the complexities of network topology and ensures that network traffic reaches the right pods, even if they move around the cluster.

B. Types of Services in Kubernetes

Kubernetes offers several types of Services, including ClusterIP, NodePort, LoadBalancer, and ExternalName. The most commonly used Service type is ClusterIP, which provides a cluster-internal IP address that routes traffic to pods. The NodePort type exposes the Service on each node's IP address, while LoadBalancer creates a cloud-provider load balancer to route traffic to pods. The ExternalName type maps a Service to a DNS name, making it accessible from outside the cluster.

C. Service Discovery in Kubernetes

Service discovery is the process of finding the network endpoint of a Service. In Kubernetes, Services can be discovered using their DNS name or through environment variables set by the kube-dns service. The DNS name of a Service is in the format of <service-name>.<namespace>.svc.cluster.local.

D. Creating and Managing Services in Kubernetes

To create a Service in Kubernetes, you can use a YAML file that defines the Service resource, or use the kubectl command-line tool. Once a Service is created, it can be managed with kubectl, including modifying its properties or scaling it up or down.

III. Kubernetes Ingress

A. Explanation of Ingress in Kubernetes

Ingress is a Kubernetes resource that allows inbound network traffic to reach Services in the cluster. It provides a single entry point for external traffic, which can then be redirected to the appropriate Service based on the URL path or hostname. This enables you to expose multiple Services on a single IP address, making it easier to manage and secure external access to your applications.

B. Ingress Controllers in Kubernetes

An Ingress Controller is a component in the cluster that is responsible for implementing the rules defined in Ingress resources. There are several popular Ingress Controllers available, including Nginx, Traefik, and Istio. The choice of Ingress Controller depends on the needs of your deployment, such as performance, security, and extensibility.

C. Ingress Rules and Path Routing

Ingress resources define rules that determine how incoming traffic is redirected to Services. These rules can include the URL path, hostname, and port, and they can also include additional settings such as SSL/TLS encryption and authentication. Ingress rules are defined in the YAML file for the Ingress resource, and they can be updated at any time.

D. Setting up SSL/TLS Encryption with Ingress

Enabling SSL/TLS encryption for your applications is a best practice for security and privacy. With Ingress, you can easily set up SSL/TLS encryption by configuring the Ingress Controller to terminate SSL/TLS connections and then forwarding the traffic to the appropriate Service. This can be done by adding annotations to the Ingress resource or by configuring the Ingress Controller itself.

E. Creating and Managing Ingress Resources in Kubernetes

Just like Services, Ingress resources can be created and managed using YAML files or the kubectl command-line tool. Once an Ingress resource is created, the Ingress Controller in the cluster will automatically implement the rules defined in the resource. You can also update or delete the Ingress resource at any time to change the behavior of the Ingress Controller.

IV. Best Practices for Kubernetes Services and Ingress

A. Designing Scalable and Efficient Services and Ingress

When designing your Services and Ingress, it's important to consider scalability and efficiency. This includes things like selecting the appropriate Service type, optimizing network traffic routing, and choosing the right Ingress Controller. By following best practices, you can ensure that your applications remain performant as they grow in size and complexity.

B. Securing Services and Ingress with Network Policies

Securing your Services and Ingress is critical for protecting your applications and data. Kubernetes provides Network Policies, which allow you to control network access to Services and Pods. By using Network Policies, you can restrict incoming and outgoing network traffic and ensure that only trusted sources can access your applications.

C. Monitoring and Logging Services and Ingress

Monitoring and logging are essential for understanding the behavior of your applications and troubleshooting issues. Kubernetes provides several tools for monitoring and logging, including the Kubernetes Dashboard, Prometheus, and ELK Stack. By setting up monitoring and logging, you can quickly detect and resolve problems with your Services and Ingress.

V. Conclusion

In this article, we've covered the basics of Kubernetes Services and Ingress and shown you how to use them to deploy and manage your applications. Services provide stable network endpoints for pods, while Ingress provides a single entry point for external traffic. By using these resources together, you can build scalable, efficient, and secure applications in Kubernetes.

VI. References

Kubernetes official documentation: https://kubernetes.io/docs/
Kubernetes Services: https://kubernetes.io/docs/concepts/services-networking/service/
Kubernetes Ingress: https://kubernetes.io/docs/concepts/services-networking/ingress/
Kubernetes Network Policies: https://kubernetes.io/docs/concepts/services-networking/network-policies/
Kubernetes Monitoring and Logging: https://kubernetes.io/docs/tasks/debug-application-cluster/logging-monitoring/

20 tips for Prometheus Monitoring

Jintao Zhang — Wed, 01 Feb 2023 10:37:10 +0000

Prometheus is an open-source monitoring system and time series database that is widely used for monitoring and alerting. It's a powerful tool that can provide deep insights into the performance of your infrastructure and applications.

In this article, I'll provide you with 20 tips for mastering Prometheus.

- Choose the Right Data Sources

One of the first things to consider when setting up Prometheus is to ensure that you are collecting the right metrics from the right sources. The metrics you collect should be relevant to your use case and provide meaningful insights into the performance of your systems.

- Use Labels Effectively

Labels are a powerful tool for organizing and grouping metrics in Prometheus. It's important to use them wisely to ensure that your metrics are easily searchable and queryable. Labels allow you to segment your metrics based on attributes such as environment, application, and host.

- Utilize Built-In Functionalities

Prometheus provides a range of built-in functionalities that you can use to perform complex queries and visualizations. PromQL is a powerful query language that allows you to search and aggregate metrics, while Promdash is a web-based dashboard that can be used to visualize your metrics.

- Service Discovery

Prometheus provides service discovery, which allows you to automatically discover and scrape metrics from targets. This feature can save you time and effort by eliminating the need to manually configure targets.

- Use Alerts Wisely

Alerts are an important feature of Prometheus, but it's important to be selective when setting them up. Ensure that your alerts are actionable and meaningful, and that they are not generating too many false positive or false negative alerts.

- Keep Your Data Fresh

Prometheus relies on up-to-date metrics to provide meaningful insights into the performance of your systems. Ensure that your metrics are being updated frequently and that they are not stale.

- Store Your Data Effectively

Prometheus stores your metrics in a time series database, and it's important to store your data in a highly available and scalable backend. Options include local disk storage, remote write to a third-party database, and cloud-based storage solutions.

- Optimize Your Queries

PromQL is a powerful query language, but it's important to optimize your queries to improve performance and reduce load on your servers. Ensure that your queries are efficient and well-optimized.

- Monitor Your Monitoring

It's important to keep an eye on your Prometheus instances and their performance to ensure they are running smoothly. Regularly monitor your Prometheus instances to ensure they are functioning as expected.

- Use Pushgateway for Short-Lived Jobs

Pushgateway is a component of Prometheus that is designed to handle metrics from short-lived jobs, such as batch jobs and cron jobs. If you have short-lived jobs in your environment, consider using Pushgateway to collect and store their metrics.

- Use Grafana for Visualization

Grafana is a popular open-source dashboard solution that works well with Prometheus. It provides a range of visualization options and is easy to use. If you need to visualize your metrics, consider using Grafana.

- Use Remote Write and Remote Read

Remote Write and Remote Read are features in Prometheus that allow you to replicate data between Prometheus instances for high availability. If you need to ensure high availability for your metrics, consider using these features.

- Use Recording Rules

Recording rules allow you to pre-aggregate and reduce the amount of data stored in your backend. They can help to improve performance and reduce the load on your servers.

- Monitor Your Application and Infrastructure

Prometheus is designed to monitor both your applications and infrastructure, so ensure that you are monitoring both to gain a complete picture of your systems. This can include metrics such as resource usage, network traffic, and application-specific metrics.

- Use Exporters for Non-Prometheus Systems

Prometheus works best with systems that have a Prometheus exporter, which can export metrics from non-Prometheus systems into Prometheus. Consider using exporters to integrate your existing systems into your Prometheus monitoring solution.

- Manage Data Retention

Prometheus provides options for managing data retention, such as setting the retention period and compaction rate. Ensure that you have appropriate settings in place to manage your data retention, as retaining too much data can consume disk space and negatively impact performance.

- Use Alertmanager for Alerting

Alertmanager is a component of Prometheus that provides advanced alerting functionality, such as routing, silencing, and aggregation. Consider using Alertmanager to manage your alerts, as it provides a more flexible and scalable solution compared to Prometheus' built-in alerting.

- Monitor Your Exporters

If you are using exporters to integrate non-Prometheus systems into Prometheus, it's important to monitor the health of your exporters. Ensure that your exporters are running smoothly and that they are providing up-to-date metrics.

- Consider a Scalable Monitoring Solution

Prometheus is designed to be scalable, but it can become challenging to manage as your environment grows. Consider using a scalable monitoring solution, such as Thanos or Cortex, to provide a more scalable and flexible solution for your monitoring needs.

- Regularly Review Your Metrics

Regularly review your metrics to ensure that they are providing meaningful insights into the performance of your systems. Ensure that your metrics are relevant, up-to-date, and well-organized, and that they are providing the information you need to make informed decisions.

In conclusion, Prometheus is a powerful and flexible monitoring solution that provides deep insights into the performance of your systems.

By following these tips, you can master Prometheus and make the most of its capabilities.

If you are interested in my articles, please subscribe to my Newsletter!

Newsletter | CloudCraftAI with Jintao

Subscribe to CloudCraftAI with Jintao's newsletter.

blog.moelove.info

How to reduce the cost of GitHub Actions

Jintao Zhang — Fri, 27 Jan 2023 02:08:55 +0000

I'll cover how to reduce the code of GitHub Actions, and give some advice.

According to G2's statistical report, GitHub Actions is the easiest-to-use CI/CD tool, and more and more people like it.

Since GitHub Actions is GitHub's native CI/CD tool, tens of thousands of Actions can be used directly in the marketplace, and it is free for public repositories. More and more projects are switching their CI tools to GitHub Actions.

I also really like GitHub Actions and use it for almost all my GitHub-hosted repositories.

But recently I was working on a project that hit the GitHub Actions quota limit. It took me some time to focus on its cost.

// Detect dark theme var iframe = document.getElementById('tweet-1616077513125691399-636'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1616077513125691399&theme=dark" }

Why is the quota exhausted?

Recently I found an interesting project: upptime/upptime: ⬆️ Free uptime monitor and status page powered by GitHub

I want to try to use it to monitor some of the services I have developed and make a status page, this will involve some API configurations, and I don't want to make it public, so I forked the project into a private repository. After a simple configuration, it works fine.

Since I wanted more data, I tweaked the CI scheduler configuration. Make these tasks run more frequently.

workflowSchedule:
  graphs: "0 * * * *"
  responseTime: "0 * * * *"
  staticSite: "0 * * * *"
  summary: "0 * * * *"
  updateTemplate: "0 * * * *"
  updates: "0 * * * *"
  uptime: "*/5 * * * *"

According to the billing documentation for GitHub Actions, GitHub Actions for public repositories is Free, but there is a quota limit for private repositories.

GitHub Actions usage is free for standard GitHub-hosted runners in public repositories, and for Self-hosted runners. For private repositories, each GitHub account receives a certain amount of free minutes and storage for use with GitHub-hosted runners, depending on the product used with the account. Any usage beyond the included amounts is controlled by spending limits.

Soon I received a quota reminder email from GitHub, reminding me that the quota was about to be used up.

This got me thinking about how to solve it.

Cost of using GitHub Actions

Making the repository public is the most straightforward way, but I explained above why it cannot be made public. I can only find other solutions.

Paying for GitHub Actions is also a very straightforward solution.

Before deciding to pay for it, I want to estimate the cost. GitHub provides a Pricing Calculator, which can easily estimate costs.

Since I modified the CI's scheduling configuration, the most frequently run tasks will run every 5 minutes.

I used Meercode to collect the running data of GitHub Actions in this repository. It provides some dashboards by default:

It also allows users to customize it themselves. I created my dashboard. If you are interested in Meercode, please let me know in the comments.

As can be seen from the figure above, each task takes no more than 0.5 minutes, and there are no more than 12 tasks per hour. Using the price calculator, the approximate cost is $35 per month.

Ways to save costs

Since my repository is mainly run uptime CI, it consumes few resources but has frequent tasks, so I wonder if I can save costs if I use a self-hosted runner.

I compared the prices of 3 lower-priced cloud service providers:

Among them, both Civo and Vultr provide 1C1G instances at $5/month, and DigitalOcean instances with the same specifications are priced at $6/month.

I finally chose Civo, which is a cloud-native service provider, and there is an introduction on its homepage:

Transparent pricing from just $5 a month

Civo provides a variety of services, such as Kubernetes (based on k3s), or compute instances.

Among them, the instance specification of the Extra Small type is 1C1G, and it has 1TB traffic, and if you choose the Kubernetes service, you do not need to pay for the control plane(same as Azure AKS). Even the larger specs look cheap.

I have tried using its Kubernetes service, and compute instance respectively, and they both work fine.

Using compute instances

Deploying the GitHub Actions runner in a Linux compute instance is simple, just add it to the project https://github.com/<Your name>/<Project name>/settings/actions/runners/new.

There are complete deployment steps on this page, just follow the steps.

My installation process is as follows:

civo@polished-bush-99d8-1926a1:~$ mkdir actions-runner && cd actions-runner
civo@polished-bush-99d8-1926a1:~/actions-runner$ curl -o actions-runner-linux-x64-2.301.1.tar.gz -L https://github.com/actions/runner/releases/download/v2.301.1/actions-runner-linux-x64-2.301.1.tar.gz
civo@polished-bush-99d8-1926a1:~/actions-runner$ echo "3ee9c3b83de642f919912e0594ee2601835518827da785d034c1163f8efdf907  actions-runner-linux-x64-2.301.1.tar.gz" | shasum -a 256 -c
actions-runner-linux-x64-2.301.1.tar.gz: OK                                                                     
civo@polished-bush-99d8-1926a1:~/actions-runner$ tar xzf ./actions-runner-linux-x64-2.301.1.tar.gz              
civo@polished-bush-99d8-1926a1:~/actions-runner$ ./config.sh --url https://github.com/MoeLove/monitoring --token $TOKEN

After the execution is complete, some files will be added to the current directory. Execute ./env.sh to start the GitHub Actions runner.

civo@polished-bush-99d8-1926a1:~/actions-runner$ ls
_diag  _work  actions-runner-linux-x64-2.301.1.tar.gz  bin  config.sh  env.sh  externals  run-helper.cmd.template  run-helper.sh  run-helper.sh.template  run.sh  safe_sleep.sh  svc.sh

If you want to run stably in the background, you can execute ./svc.sh install to install the runner as a systemd service and manage its life cycle through systemd.

Using Kubernetes

Civo does not charge for the Kubernetes control plane, but only for Worker Nodes. The advantage of using Kubernetes is that I can automatically scale up and down in the cluster, and I can easily run and create multiple runners for different projects.

Since GitHub official has not provided to deploy a Self-hosted runner on Kubernetes, I used the Actions Runner Controller (ARC) project, This project allows rapid deployment of Self-hosted runners through Runner custom resources.

The deployment process is clearly described in the documentation. The following is my deployment process.

# deploy cert-manager
(MoeLove) ➜ kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml

# deploy ARC
(MoeLove) ➜ helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
(MoeLove) ➜ helm upgrade --install --namespace actions-runner-system --create-namespace\
  --set=authSecret.create=true\
  --set=authSecret.github_token="REPLACE_YOUR_TOKEN_HERE"\
  --wait actions-runner-controller actions-runner-controller/actions-runner-controller

# create runner
(MoeLove) ➜ cat <<EOF | kubectl apply -f -
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: moelove-runner
spec:
  replicas: 1
  template:
    spec:
      repository: MoeLove/monitoring
EOF

After installation, the following results are achieved:

// Detect dark theme var iframe = document.getElementById('tweet-1616251840429002755-27'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1616251840429002755&theme=dark" }

Self-hosted vs GitHub-managed

In the content above, I introduced how I used Meercode to measure the key indicators of CI metrics and estimate the cost of GitHub Actions. According to my actual low resource consumption and high time-consuming scenario, I chose the Self-hosted runner.

So when is it more appropriate to choose a GitHub-managed runner? What are the benefits of GitHub-managed?

The GitHub-managed runner has the following advantages:

Support for multiple operating systems: In addition to providing Linux systems, GitHub-managed runner also supports macOS and Windows, but most cloud providers do not provide macOS environments. (I used to put some Mac minis as servers in the data center for specific scenarios)
VM-level isolation: According to the GitHub Actions documentation, when the GitHub Actions runner runs a job, it creates a VM to run all tasks, which brings certain security and isolation guarantees. If it is a Self-hosted runner when running through the binary, the task will share the host environment, and if it is running through ARC, it will bring isolation through the Pod. This will cause certain security issues.
Low Maintenance Costs: In fact in any large system, maintenance costs are very expensive. If it is only for personal use, or only a few projects use the Self-hosted runner, the maintenance cost is relatively controllable. Once it gets big, it introduces a lot of complexity. The GitHub-managed runner is maintained by GitHub.

There are also two products that offer self-hosted runner services:

They reduce the cost of runner maintenance and management and provide more secure isolation and support for Arm-based environments. cirun also provides GPU runner support.

If you have the above requirements, you may also wish to consider these services.

Summarize

In general, the following steps are required to reduce the cost of GitHub actions.

Visualization/Observability: Estimate costs using actual data.
Compare multiple vendors/solutions: Different vendors offer different pricing for different scenarios or products, and you can choose according to your actual situation.
Security and maintenance costs also need to be considered.

If you are interested in my articles, please subscribe to my Newsletter!

Newsletter | CloudCraftAI with Jintao

Subscribe to CloudCraftAI with Jintao's newsletter.

blog.moelove.info

My Rust journey and how to learn Rust

Jintao Zhang — Tue, 17 Jan 2023 13:08:16 +0000

I'll share my Rust journey, how I learned Rust and some free Rust learning resources.

Rust has become more and more popular. Through the StackOverflow 2022 Developer Survey, we can see that many people are interested in Rust.

Rust is on its seventh year as the most loved language with 87% of developers saying they want to continue using it.

Rust also ties with Python as the most wanted technology with TypeScript running a close second

Most Wanted
Most Loved vs. Dreaded

But Rust has a particular learning curve.

This made me want to share my Rust journey, why I chose Rust, and how to learn Rust.

Getting connected with Rust

I had heard about Rust when it was first released, and my impression was that it was a system programming language that could replace C/C++ and was safe enough. But I didn't learn and use it. (I've only used it to write Hello World!)

Back in time to 5 years ago, I was leading the transformation of the company's infrastructure into a cloud-native stack.

I need to construct a monitoring stack based entirely on Prometheus to replace a set of monitoring software in the company with more than 10 years of history. And some other monitoring software, such as Nagios, Zabbix, and Graphite.

Yes, you read that right, we are using a lot of surveillance software. There are a few reasons for this:

A single software cannot meet all needs
The team is scattered, and most of the time, new software is introduced just to meet specific needs, rather than to solve the problem

Anyway, here are some historical reasons.

And, from what I mentioned above, we have a set of self-developed monitoring software with a history of more than 10 years, as you can see, our infrastructure is slow to iterate.

And because we have our physical data center, this also leads to many old machines in our servers that have not been updated. (This is one of the reasons why I used Rust later)

I first replaced the monitoring stack in a newly launched small data center, with about 400 machines, and the effect was good. Using Prometheus to complete the monitoring of all the servers in this small data center and the various services running on them. There are also Dashboards created for them in Grafana, and alarm notifications created through Alertmanager.

Later, I promoted these transformations in two data centers, and overall it was relatively smooth, including the monitoring of Kubernetes was also completed during this process.

But when it was implemented in the last data center, I faced the biggest challenge.

node_exporter failed to start on some machines, and some machines crashed automatically after running for some time.

I started to investigate this issue. For the automatic crash issue, I temporarily fixed it by adding a restart script.

I'm mainly concerned with why node_exporter won't start. I found that the operating system of this part of the machine is CentOS 5, and the kernel is 2.6.18.

I found that there are already similar issues in the community: https://github.com/prometheus/node_exporter/issues/691

At the same time, I also noticed that the Go documentation clearly stated that CentOS 5 is not supported, and a kernel of at least version 2.6.32 or above is required.

(I forgot the minimum dependencies when I checked, but through the web archive, I see that the minimum kernel version required in 2017 is 2.6.23)

After some searching, I also saw something like How to install Go 1.1 on CentOS 5.9, but at the same time, some known issues are mentioned in the article.

So I'm not going to keep fighting it.

I want to re-implement one by myself, which can also solve the above automatic crash problem.

In the end, I used Rust to implement a tool similar to node_exporter and completed the upgrade and transformation of the monitoring system.

This is where my journey started with Rust in production.

Next, let me introduce why I chose Rust.

Why choose Rust

I have introduced some background above. At that time, the easiest choice should be Python, which is simple enough and rich in ecology. At the same time, I also have many years of experience in Python development, I can quickly build the tools I need.

The reasons for not choosing Python are:

Not all of these machines have a Python environment, and the versions of Python are also different. I was asked not to modify the environment on these machines as much as possible;
Since I may make some modifications later, I think the subsequent distribution may not be convenient;

Then I rethought my goal:

Can be compiled into binary executable files for easy distribution and deployment. I used Ansible for unified deployment.

So a more suitable option is C/C++/Rust.

I have more experience in C development and a little experience in C++. For my first requirement, the above three languages can be easily met.

When most people compare Rust and C/C++, they are comparing their performance and safety.

And in my use case at the time, I don't think the results in the other two languages would be worse than in Rust, although these are also considerations. And since I was just starting to learn Rust at the time, it might be worse than my C implementation.

But I want more challenges, try something new, and in terms of Prometheus monitoring, the C/C++-related ecology is not very active. Another point I think Rust will have great development in the future.

So in the end I chose Rust.

How I learned Rust

Rust is not simple, and it's not quite the same as other languages, so some practices that work in other languages may not work in Rust.

Since I have a specific problem that needs to be solved, I need to implement a node_exporter to complete the transformation of the monitoring stack. So I learned Rust through the learning-by-doing mode.

I first took a quick look at the following:

The Rust Programming Language: This book is very complete, I didn't read it completely at first. Instead, use it to understand the main concepts and some usages in Rust.
Rust By Example: There are many examples here, and you can also increase your familiarity with Rust by practicing these examples;
Rust std lib docs: Documentation of the standard library, a quick overview, understanding some keywords, modules, etc. But it is not necessary to read it in its entirety initially.

This way I quickly implemented a basic node_exporter version. Then continue to iterate and apply it to the production environment, and completed the construction of the Prometheus monitoring stack.

Later, I continued to implement some small tools in Rust, learned its best practices, and learned some open-source projects implemented in Rust to increase my Rust experience.

Recommend some Rust learning resources

There are many learning resources for Rust now. In addition to the ones I listed above, I recommend the following free content:

videos:

Summarize

This is how my Rust journey started, and it continues.

Although I focus on Cloud Native and Kubernetes-related technologies, and now I write more Go language, I also still write some tools in Rust and use Rust in WebAssembly.

In the future, I will also share relevant content. If you are interested in my articles, welcome to subscribe to my Newsletter!

Newsletter | CloudCraftAI with Jintao

Subscribe to CloudCraftAI with Jintao's newsletter.

blog.moelove.info

Opportunities and Challenges of Technological Evolution in Cloud Native

Jintao Zhang — Thu, 15 Dec 2022 17:06:05 +0000

Nowadays, Cloud Native is becoming increasingly popular, and the CNCF defines Cloud Native as:

Based on a modern and dynamic environment, aka cloud environment.
With containerization as the fundamental technology, including Service Mesh, immutable infrastructure, declarative API, etc.
Key features include autoscaling, manageability, observability, automation, frequent change, etc.

According to the CNCF 2021 survey, there are a very significant number (over 62,000) of contributors in the Kubernetes community. With the current trend of technology, more and more companies are investing more cost into Cloud Native and joining the track early for active cloud deployment. Why are companies embracing Cloud Native while developing, and what does Cloud Native mean for them?

Technical Advantages of Cloud Native

The popularity of Cloud Native comes from its advantages at the technical level. There are two main aspects of Cloud Native technology, including containerization led by Docker, and container orchestration led by Kubernetes.

Docker introduced container images to the technology world, making container images a standardized delivery unit. In fact, before Docker, containerization technology already existed. Let's talk about a more recent technology, LXC (Linux Containers) in 2008. Compared to Docker, LXC is less popular since Docker provides container images, which can be more standardized and more convenient to migrate. Also, Docker created the DockerHub public service, which has become the world's largest container image repository. In addition, containerization technology can also achieve a certain degree of resource isolation, including not only CPU, memory, and other resources isolation, but also network stack isolation, which makes it easier to deploy multiple copies of applications on the same machine.

Kubernetes became popular due to the booming of Docker. The container orchestration technology, led by Kubernetes, provides several important capabilities, such as fault self-healing, resource scheduling, and service orchestration. Kubernetes has a built-in DNS-based service discovery mechanism, and thanks to its scheduling architecture, it can be scaled very quickly to achieve service orchestration.

Now more and more companies are actively embracing Kubernetes and transforming their applications to embark on Kubernetes deployment. And Cloud Native we are talking about is actually based on the premise of Kubernetes, the cornerstone of Cloud Native technology.

Containerization Advantages

Standardized Delivery

Container images have now become a standardized delivery unit. By containerization technology, users can directly complete the delivery through a container image instead of binary or source code. Relying on the packaging mechanism of the container image, you can use the same image to start a service and produce the same behavior in any container runtime.

Portable and Light-weight, Cost-saving

Containerization technology achieves certain isolation by Linux kernel's capabilities, which in turn makes it easier to migrate. Moreover, containerization technology can directly run applications, which is lighter in technical implementation compared to virtualization technology, without the need for OS in the virtual machine.
All applications can share the kernel, which saves cost. And the larger the application, the greater the cost savings.

Convenience of resource management

When starting a container, you can set the CPU, memory, or disk IO properties that can be used for the container service, which allows for better planning and deployment of resources when starting application instances through containers.

Container Orchestration Advantages

Simplify the Workflow

In Kubernetes, application deployment is easier to manage than in Docker, since Kubernetes uses declarative configuration. For example, a user can simply declare in a configuration file what container image the application will use and what service ports are exposed without the need for additional management. The operations corresponding to the declarative configuration greatly simplify the workflow.

Improve Efficiency and Save Costs

Another advantageous feature of Kubernetes is failover. When a node in Kubernetes crashes, Kubernetes automatically schedules the applications on it to other normal nodes and gets them up and running. The entire recovery process does not require human intervention and operation, so it not only improves operation and maintenance efficiency at the operational level but also saves time and cost.

With the rise of Docker and Kubernetes, you will see that their emergence has brought great innovation and opportunity to application delivery. Container images, as standard delivery units, shorten the delivery process and make it easier to integrate with CI/CD systems.

Considering that application delivery is becoming faster, how is that application architecture following the Cloud Native trend?

Application Architecture Evolution: from Monoliths, Microservice to Service Mesh

The starting point of application architecture evolution is still from monolithic architecture. As the size and requirements of applications increased, the monolithic architecture no longer met the needs of collaborative team development, thus distributed architectures were gradually introduced.

Among the distributed architectures, the most popular one is the microservice architecture. Microservice architecture can split services into multiple modules, which communicate with each other, complete service registration and discovery, and achieve common capabilities such as flow limitation and circuit breaking.

In addition, there are various patterns included in a microservice architecture. For example, the per-service database pattern, which represents each microservice with an individual database, is a pattern that avoids database-level impact on the application but may introduce more database instances.

Another one is the API Gateway pattern, which receives the entrance traffic of the cluster or the whole microservice architecture through a gateway and completes the traffic distribution through APIs. This is one of the most used patterns, and gateway products like Spring Cloud Gateway or Apache APISIX can be applied.

The popular architectures are gradually extending to Cloud Native architectures. Can a microservice architecture under Cloud Native simply build the original microservice as a container image and migrate it directly to Kubernetes?

In theory, it seems possible, but in practice there are some challenges. In a Cloud Native microservice architecture, these components need to run not just in containers, but also include other aspects such as service registration, discovery, and configuration.

The migration process also involves business-level transformation and adaptation, requiring the migration of common logic such as authentication, authorization, and observability-related capabilities (logging, monitoring, etc.) to K8s. Therefore, the migration from the original physical machine deployment to the K8s platform is much more complex than it is.

In this case, we can use the Sidecar model to abstract and simplify the above scenario.

Typically, the Sidecar model comes in the form of a Sidecar Proxy, which evolves from the left side of the diagram below to the right side by sinking some generic capabilities (such as authentication, authorization, security, etc.) into Sidecar. As you can see from the diagram, this model has been adapted from requiring multiple components to be maintained to requiring only two things (application + Sidecar) to be maintained. At the same time, the Sidecar model itself contains some common components, so it does not need to be maintained by the business side itself, thus easily solving the problem of microservice communication.

To avoid the complex scenes of separate configuration and repeated wheel building when introducing a Sidecar for each microservice, the process can be implemented by introducing a control plane or by control plane injection, which gradually forms current Service Mesh.

Service Mesh usually requires two components, i.e., control plane + data plane. The control plane completes the distribution of configuration and the execution of the related logic, such as Istio, which is currently the most popular. On the data plane, you can choose an API gateway like Apache APISIX for traffic forwarding and service communication. Thanks to the high performance and scalability of APISIX, it is also possible to perform some customization requirements and custom logic. The following shows the architecture of the Service Mesh solution with Istio+APISIX.

The advantage of this solution is that when you want to migrate from the previous microservice architecture to a Cloud Native architecture, you can avoid massive changes on the business side by using a Service Mesh solution directly.

Technical Challenges of Cloud Native

The previous article mentioned some of the advantages of the current Cloud Native trend in terms of technical aspects. However, every coin has two sides. Although some fresh elements and opportunities can be brought, challenges will emerge due to the participation of certain technologies.

Problems Caused by Containerization and K8s

In the beginning part of the article, we mentioned that containerization technology uses a shared kernel, and the shared kernel brings lightness but creates a lack of isolation. If container escape occurs, the corresponding host may be attacked. Therefore, to meet these security challenges, technologies such as secure containers have been introduced.

In addition, although container images provide a standardized delivery method, they are prone to be attacked, such as supply chain attacks.

Similarly, the introduction of K8s has also brought about challenges in component security. The increase in components has led to a rise in the attack surface, as well as additional vulnerabilities related to the underlying components and dependency levels. At the infrastructure level, migrating from traditional physical or virtual machines to K8s involves infrastructure transformation costs and more labor costs to perform cluster data backups, periodic upgrades, and certificate renewals.

Also, in the Kubernetes architecture, the apiserver is the core component of the cluster and needs to handle all the inside and outside traffic. Therefore, in order to avoid border security issues, how to protect the apiserver also becomes a key question. For example, we can use Apache APISIX to protect it.

Security

The use of new technologies requires additional attention at the security level:

At the network security level, fine-grained control of traffic can be implemented by Network Policy, or other connection encryption methods like mTLS to form a zero-trust network.
At the data security level, K8s provides the secret resource for handling confidential data, but actually, it is not secure. The contents of the secret resource are encoded in Base64, which means you can access the contents through Base64 decoding, especially if they are placed in etcd, which can be read directly if you have access to etcd.
At the level of permission security, there is also a situation where RBAC settings are not reasonable, which leads to an attacker using the relevant Token to communicate with the apiserver to achieve the purpose of the attack. This kind of permission setting is mostly seen in the controller and operator scenarios.

Observability

Most of the Cloud Native scenarios involve some observability-related operations such as logging, monitoring, etc.

In K8s, if you want to collect logs in a variety of ways, you need to collect them directly on each K8s node through aggregation. If logs were collected in this way, the application would need to be exported to standard output or standard errors.

However, if the business does not make relevant changes and still chooses to write all the application logs to a file in the container, it means that a Sidecar is needed for log collection in each instance, which makes the deployment architecture extremely complex.

Back to the architecture governance level, the selection of monitoring solutions in the Cloud Native environment also poses some challenges. Once the solution selection is wrong, the subsequent cost of use is very high, and the loss can be huge if the direction is wrong.

Also, there are capacity issues involved at the monitoring level. While deploying an application in K8s, you can simply configure its rate limiting to limit the resource details the application can use. However, in a K8s environment, it is still rather easy to over-sell resources, over-utilize resources, and overflow memory due to these conditions.

In addition, another situation in a K8s cluster where the entire cluster or node runs out of resources will lead to resource eviction, which means resources already running on a node are evicted to other nodes. If a cluster's resources are tight, a node storm can easily cause the entire cluster to crash.

Application Evolution and Multi-cluster Pattern

At the application architecture evolution level, the core issue is service discovery.

K8s provides a DNS-based service discovery mechanism by default, but if the business includes the coexistence of cloud business and stock business, it will be more complicated to use a DNS service discovery mechanism to deal with the situation.

Meanwhile, if enterprises choose Cloud Native technology, with the expansion of business scale, they will gradually go to consider the direction of multi-node processing, which will then involve multi-cluster issues.

For example, we want to provide customers with a higher availability model through multiple clusters, and this time it will involve the orchestration of services between multiple clusters, multi-cluster load distribution and synchronization configuration, and how to handle and deploy strategies for clusters in multi-cloud and hybrid cloud scenarios. These are some of the challenges that will be faced.

How APISIX Enables Digital Transformation

Apache APISIX is a Cloud Native API gateway under the Apache Software Foundation, which is dynamic, real-time, and high-performance, providing rich traffic management features such as load balancing, dynamic upstream, canary release, circuit breaking, authentication, observability, etc. You can use Apache APISIX to handle traditional north-south traffic, as well as east-west traffic between services.

Currently, based on the architectural evolution and application changes described above, APISIX-based Ingress controller and Service Mesh solutions have also been derived in Apache APISIX to help enterprises to better carry out digital transformation.

APISIX Ingress Solution

Apache APISIX Ingress Controller is a Kubernetes Ingress Controller implementation that serves primarily as a traffic gateway for handling north-south Kubernetes traffic.

The APISIX Ingress Controller architecture is similar to APISIX in that it is a separate architecture for the control plane and the data plane. In this case, APISIX is used as the data plane for the actual traffic processing.

Currently, APISIX Ingress Controller supports the following three configuration methods and is compatible with all APISIX plugins out of the box:

Support for Ingress resources native to K8s. This approach allows APISIX Ingress Controller to have a higher level of adaptability. By far, APISIX Ingress Controller is the most supported version of any open-source and influential Ingress controller product.
Support for using custom resources. The current custom resources of APISIX Ingress Controller are a set of CRD specifications designed according to APISIX semantics. Using custom resources makes it easy to integrate with APISIX and is more native.
Support for Gateway API. As the next generation of the Ingress standard, APISIX Ingress Controller has started to support Gateway API (Beta stage). As the Gateway API evolves, it is likely to become a built-in resource for K8s directly.

APISIX Ingress Controller has the following advantages over Ingress NGINX:

Architectural separation. In APISIX Ingress, the architecture of the data plane and control plane are separated. When the traffic processing pressure is high and you want to expand the capacity, you can simply do the expansion of the data plane, which allows more data planes to be served externally without the need to make any adjustments to the control plane.
High scalability and support for custom plugins.
As the choice of data plane, with high performance and fully dynamic features. Thanks to the fully dynamic feature of APISIX, it is possible to protect business traffic as much as possible with the use of APISIX Ingress.

Currently, APISIX Ingress Controller is used by many companies worldwide, such as China Mobile Cloud Open Platform (an open API and cloud IDE product), Upyun, and Copernicus (part of Europe's Eyes on Earth).

APISIX Ingress Controller is still in continuous iteration, and we plan to improve more functions in the following ways:

Complete support for the Gateway API to enable more scenario configurations.
Support external service proxy.
Native support for multiple registries to make APISIX Ingress Controller more versatile.
Architectural updates to create a new architectural model;
Integrate with Argo CD/Flux and other GitOps tools to create a rich ecosystem.

If you are interested in the APISIX Ingress solution, please feel free to follow the community updates for product iterations and community trends.

APISIX Service Mesh Solution

Currently, in addition to the API gateway and Ingress solution, the APISIX-based Service Mesh solution is also in active iteration.

The APISIX-based Service Mesh solution consists of two main components, namely the control plane and the data plane. Istio was chosen for the control plane since it is an industry leader with an active community and is supported by multiple vendors. APISIX was chosen to replace Envoy on the data side, allowing APISIX's high performance and scalability to come into play.

APISIX's Service Mesh is still being actively pursued, with subsequent iterations planned in the following directions:

Performing eBPF acceleration to improve overall effectiveness.
Performing plugin capability integration to allow better use of APISIX Ingress capabilities within the Service Mesh architecture.
Creating a seamless migration tool to provide easier tools and simplify the process for users.

In general, the evolution of architecture and technology in the Cloud Native era brings us both opportunities and challenges. Apache APISIX as a Cloud Native gateway has been committed to more technical adaptations and integrations for the Cloud Native trend. Various solutions based on APISIX have also started to help enterprise users to carry out digital transformation and help enterprises to transition to the Cloud Native track more smoothly.

Thoroughly understand Events in Kubernetes

Jintao Zhang — Tue, 12 Apr 2022 15:42:35 +0000

Hi everyone, this is Jintao Zhang.

Before I wrote an article "A More Elegant Kubernetes Cluster Event Measurement Scheme" , using Jaeger to use tracing to collect events in the Kubernetes cluster and display it. The final effect is as follows:

When I wrote that article, I set up a flag to introduce the principles in detail. I have been pigeoning for a long time. Now it's the end of the year and it's time to send it out.

Eents overview

Let's first make a simple example to see what events in a Kubernetes cluster are.

Create a new namespace called moelove , and then create a deployment called redis in it. Next, look at all events in this namespace.

(MoeLove) ➜ kubectl create ns moelove
namespace/moelove created
(MoeLove) ➜ kubectl -n moelove create deployment redis --image=ghcr.io/moelove/redis:alpine 
deployment.apps/redis created
(MoeLove) ➜ kubectl -n moelove get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
redis   1/1     1            1           11s
(MoeLove) ➜ kubectl -n moelove get events
LAST SEEN   TYPE     REASON              OBJECT                        MESSAGE
21s         Normal   Scheduled           pod/redis-687967dbc5-27vmr    Successfully assigned moelove/redis-687967dbc5-27vmr to kind-worker3
21s         Normal   Pulling             pod/redis-687967dbc5-27vmr    Pulling image "ghcr.io/moelove/redis:alpine"
15s         Normal   Pulled              pod/redis-687967dbc5-27vmr    Successfully pulled image "ghcr.io/moelove/redis:alpine" in 6.814310968s
14s         Normal   Created             pod/redis-687967dbc5-27vmr    Created container redis
14s         Normal   Started             pod/redis-687967dbc5-27vmr    Started container redis
22s         Normal   SuccessfulCreate    replicaset/redis-687967dbc5   Created pod: redis-687967dbc5-27vmr
22s         Normal   ScalingReplicaSet   deployment/redis              Scaled up replica set redis-687967dbc5 to 1

But we will find that by default kubectl get events is not arranged in the order in which the events occur, so we often need to add the --sort-by='{.metadata.creationTimestamp}' parameter to it so that its output can be arranged in time.

This is why Kubernetes adds kubectl alpha events command in v1.23 version. I have made a detailed introduction in the previous article, so I won't expand it here.

After sorting by time, you can see the following results:

(MoeLove) ➜ kubectl -n moelove get events --sort-by='{.metadata.creationTimestamp}'
LAST SEEN   TYPE     REASON              OBJECT                        MESSAGE
2m12s       Normal   Scheduled           pod/redis-687967dbc5-27vmr    Successfully assigned moelove/redis-687967dbc5-27vmr to kind-worker3
2m13s       Normal   SuccessfulCreate    replicaset/redis-687967dbc5   Created pod: redis-687967dbc5-27vmr
2m13s       Normal   ScalingReplicaSet   deployment/redis              Scaled up replica set redis-687967dbc5 to 1
2m12s       Normal   Pulling             pod/redis-687967dbc5-27vmr    Pulling image "ghcr.io/moelove/redis:alpine"
2m6s        Normal   Pulled              pod/redis-687967dbc5-27vmr    Successfully pulled image "ghcr.io/moelove/redis:alpine" in 6.814310968s
2m5s        Normal   Created             pod/redis-687967dbc5-27vmr    Created container redis
2m5s        Normal   Started             pod/redis-687967dbc5-27vmr    Started container redis

Through the above operations, we can find that events is actually a resource in the Kubernetes cluster. When the resource status in the Kubernetes cluster changes, new events can be generated.

In-depth Events

Single Event object

Since events is a resource in a Kubernetes cluster, its metadata.name should contain its name under normal circumstances for individual operations. So we can use the following command to output its name:

(MoeLove) ➜ kubectl -n moelove get events --sort-by='{.metadata.creationTimestamp}' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}'
redis-687967dbc5-27vmr.16c4fb7bde8c69d2
redis-687967dbc5.16c4fb7bde6b54c4
redis.16c4fb7bde1bf769
redis-687967dbc5-27vmr.16c4fb7bf8a0ab35
redis-687967dbc5-27vmr.16c4fb7d8ecaeff8
redis-687967dbc5-27vmr.16c4fb7d99709da9
redis-687967dbc5-27vmr.16c4fb7d9be30c06

Select any one of the event records and output it in YAML format for viewing:

(MoeLove) ➜ kubectl -n moelove get events redis-687967dbc5-27vmr.16c4fb7bde8c69d2 -o yaml
action: Binding
apiVersion: v1
eventTime: "2021-12-28T19:31:13.702987Z"
firstTimestamp: null
involvedObject:
  apiVersion: v1
  kind: Pod
  name: redis-687967dbc5-27vmr
  namespace: moelove
  resourceVersion: "330230"
  uid: 71b97182-5593-47b2-88cc-b3f59618c7aa
kind: Event
lastTimestamp: null
message: Successfully assigned moelove/redis-687967dbc5-27vmr to kind-worker3
metadata:
  creationTimestamp: "2021-12-28T19:31:13Z"
  name: redis-687967dbc5-27vmr.16c4fb7bde8c69d2
  namespace: moelove
  resourceVersion: "330235"
  uid: e5c03126-33b9-4559-9585-5e82adcd96b0
reason: Scheduled
reportingComponent: default-scheduler
reportingInstance: default-scheduler-kind-control-plane
source: {}
type: Normal

You can see that it contains a lot of information, we will not expand it here. Let's look at another example.

Events in `kubectl describe`

describe on the Deployment object and the Pod object respectively, and the following results can be obtained (the intermediate output is omitted):

Operations on Deployment

(MoeLove) ➜ kubectl -n moelove describe deploy/redis                
Name:                   redis
Namespace:              moelove
...
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  15m   deployment-controller  Scaled up replica set redis-687967dbc5 to 1

Operate on Pod

(MoeLove) ➜ kubectl -n moelove describe pods redis-687967dbc5-27vmr
Name:         redis-687967dbc5-27vmr                                                                 
Namespace:    moelove
Priority:     0
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  18m   default-scheduler  Successfully assigned moelove/redis-687967dbc5-27vmr to kind-worker3
  Normal  Pulling    18m   kubelet            Pulling image "ghcr.io/moelove/redis:alpine"
  Normal  Pulled     17m   kubelet            Successfully pulled image "ghcr.io/moelove/redis:alpine" in 6.814310968s
  Normal  Created    17m   kubelet            Created container redis
  Normal  Started    17m   kubelet            Started container redis

We can find that when describes different resource objects, the contents of events that can be seen are directly related to itself. When you describe Deployment, you cannot see Pod-related Events.

This shows that, Event object that contains information about the resource objects it describes , they are directly linked.

Combining the single Event object we saw earlier, we found involvedObject of the resource object associated with the Event.

Learn more about Events

Let's take a look at the following example, creating a Deployment, but using a non-existing image:

(MoeLove) ➜ kubectl -n moelove create deployment non-exist --image=ghcr.io/moelove/non-exist
deployment.apps/non-exist created
(MoeLove) ➜ kubectl -n moelove get pods
NAME                        READY   STATUS         RESTARTS   AGE
non-exist-d9ddbdd84-tnrhd   0/1     ErrImagePull   0          11s
redis-687967dbc5-27vmr      1/1     Running        0          26m

We can see that the current Pod is in a state of ErrImagePull View the events in the current namespace (I omitted the record of deploy/redis before)

(MoeLove) ➜ kubectl -n moelove get events --sort-by='{.metadata.creationTimestamp}'                                                           
LAST SEEN   TYPE      REASON              OBJECT                           MESSAGE
35s         Normal    SuccessfulCreate    replicaset/non-exist-d9ddbdd84   Created pod: non-exist-d9ddbdd84-tnrhd
35s         Normal    ScalingReplicaSet   deployment/non-exist             Scaled up replica set non-exist-d9ddbdd84 to 1
35s         Normal    Scheduled           pod/non-exist-d9ddbdd84-tnrhd    Successfully assigned moelove/non-exist-d9ddbdd84-tnrhd to kind-worker3
17s         Warning   Failed              pod/non-exist-d9ddbdd84-tnrhd    Error: ErrImagePull
17s         Warning   Failed              pod/non-exist-d9ddbdd84-tnrhd    Failed to pull image "ghcr.io/moelove/non-exist": rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/moelove/non-exist:latest": failed to resolve reference "ghcr.io/moelove/non-exist:latest": failed to authorize: failed to fetch anonymous token: unexpected status: 403 Forbidden
18s         Normal    Pulling             pod/non-exist-d9ddbdd84-tnrhd    Pulling image "ghcr.io/moelove/non-exist"
4s          Warning   Failed              pod/non-exist-d9ddbdd84-tnrhd    Error: ImagePullBackOff
4s          Normal    BackOff             pod/non-exist-d9ddbdd84-tnrhd    Back-off pulling image "ghcr.io/moelove/non-exist"

describe operation on this Pod:

(MoeLove) ➜ kubectl -n moelove describe pods non-exist-d9ddbdd84-tnrhd
...
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  4m                     default-scheduler  Successfully assigned moelove/non-exist-d9ddbdd84-tnrhd to kind-worker3
  Normal   Pulling    2m22s (x4 over 3m59s)  kubelet            Pulling image "ghcr.io/moelove/non-exist"
  Warning  Failed     2m21s (x4 over 3m59s)  kubelet            Failed to pull image "ghcr.io/moelove/non-exist": rpc error: code = Unknown desc = failed to pull and unpack image "ghcr.io/moelove/non-exist:latest": failed to resolve reference "ghcr.io/moelove/non-exist:latest": failed to authorize: failed to fetch anonymous token: unexpected status: 403 Forbidden
  Warning  Failed     2m21s (x4 over 3m59s)  kubelet            Error: ErrImagePull
  Warning  Failed     2m9s (x6 over 3m58s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    115s (x7 over 3m58s)   kubelet            Back-off pulling image "ghcr.io/moelove/non-exist"

We can find that the output here is different from the previous Pod running correctly. The main difference is in the column Age Here we see output 115s (x7 over 3m58s)

Its meaning means: This type of event has occurred 7 times in 3m58s, and the most recent one occurred before

But when we went to kubectl get events directly, we did not see 7 repeated events. This shows that Kubernetes will automatically merge duplicate events into .

Select the last Event (the method has been described in the previous content) and output its content in YAML format:

(MoeLove) ➜ kubectl -n moelove get events non-exist-d9ddbdd84-tnrhd.16c4fce570cfba46 -o yaml
apiVersion: v1
count: 43
eventTime: null
firstTimestamp: "2021-12-28T19:57:06Z"
involvedObject:
  apiVersion: v1
  fieldPath: spec.containers{non-exist}
  kind: Pod
  name: non-exist-d9ddbdd84-tnrhd
  namespace: moelove
  resourceVersion: "333366"
  uid: 33045163-146e-4282-b559-fec19a189a10
kind: Event
lastTimestamp: "2021-12-28T18:07:14Z"
message: Back-off pulling image "ghcr.io/moelove/non-exist"
metadata:
  creationTimestamp: "2021-12-28T19:57:06Z"
  name: non-exist-d9ddbdd84-tnrhd.16c4fce570cfba46
  namespace: moelove
  resourceVersion: "334638"
  uid: 60708be0-23b9-481b-a290-dd208fed6d47
reason: BackOff
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: kind-worker3
type: Normal

Here we can see that the field includes a count field, which indicates how many times the event of the same type has occurred. And firstTimestamp and lastTimestamp respectively represent the time of the last occurrence of this event for the first time. This also explains the duration of the events in the previous output.

Understand Events thoroughly

The following content is a random selection from Events, we can see some of the field information it contains:

apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2021-12-28T19:31:13Z"
involvedObject:
  apiVersion: apps/v1
  kind: ReplicaSet
  name: redis-687967dbc5
  namespace: moelove
  resourceVersion: "330227"
  uid: 11e98a9d-9062-4ccb-92cb-f51cc74d4c1d
kind: Event
lastTimestamp: "2021-12-28T19:31:13Z"
message: 'Created pod: redis-687967dbc5-27vmr'
metadata:
  creationTimestamp: "2021-12-28T19:31:13Z"
  name: redis-687967dbc5.16c4fb7bde6b54c4
  namespace: moelove
  resourceVersion: "330231"
  uid: 8e37ec1e-b3a1-420c-96d4-3b3b2995c300
reason: SuccessfulCreate
reportingComponent: ""
reportingInstance: ""
source:
  component: replicaset-controller
type: Normal

The meanings of the main fields are as follows:

count: Indicates how many times the current similar event has occurred (described earlier)
involvedObject: The resource object directly related to this event (introduced above), the structure is as follows:

type ObjectReference struct {
    Kind string
    Namespace string
    Name string
    UID types.UID
    APIVersion string
    ResourceVersion string
    FieldPath string
}

source: directly related components, the structure is as follows:

type EventSource struct {
    Component string
    Host string
}

Reason: A simple summary (or a fixed code), which is more suitable for filtering conditions, mainly for machine readable. There are currently more than 50 such codes;
message: give a detailed description that is easier for people to understand
type: Currently there are only Normal and Warning , and their meanings are also written in the source code:

// staging/src/k8s.io/api/core/v1/types.go
const (
    // Information only and will not cause any problems
    EventTypeNormal string = "Normal"
    // These events are to warn that something might go wrong
    EventTypeWarning string = "Warning"
)

Therefore, when we collect these Events as tracing source , we can classify them involvedObject , and sort them by time.

Summarize

n this article, I mainly use two examples, a properly deployed Deploy, and a Deploy that uses a non-existent image deployment, to introduce the actual function of the Events object and the meaning of each field in depth.

For Kubernetes, Events contain a lot of useful information, but this information does not have any impact on Kubernetes, and they are not actual Kubernetes logs. By default, the logs in Kubernetes will be cleaned up after 1 hour in order to release the resource occupation of etcd.

So in order to better let the cluster administrator know what happened, in the production environment, we usually collect the events of the Kubernetes cluster. The tool I personally recommend is: https://github.com/opsgenie/kubernetes-event-exporter

Of course, you can also follow my previous article "A More Elegant Kubernetes Cluster Event Measurement Scheme" , using Jaeger to use tracing to collect events in the Kubernetes cluster and display them.

Welcome to subscribe to my article 【MoeLove】

DEV Community: Jintao Zhang

From Deprecated npm Classic Tokens to OIDC Trusted Publishing: A CI/CD Troubleshooting Journey

Background

The Problem

Root Cause Analysis

The End of npm Classic Tokens

The New Authentication Method: OIDC Trusted Publishing

Troubleshooting Log

Attempt 1: Upgrading npm Version

Attempt 2: Removing registry-url

Attempt 3: Setting NODE_AUTH_TOKEN to Empty String

Attempt 4: Completely Removing NODE_AUTH_TOKEN

Attempt 5 (Final Success): Adding the repository Field

The Correct Configuration

1. Configure Trusted Publisher on npmjs.com

2. GitHub Actions Workflow Configuration

3. Required package.json Fields

Key Takeaways

FAQ

Q: Can I use OIDC to publish the first version of a new package?

Q: Can I use OIDC with self-hosted runners?

Q: Why doesn't setting NODE_AUTH_TOKEN to an empty string work?

Q: What should I do if provenance validation fails?

References

Mastering Kubernetes Services and Ingress: A Comprehensive Guide to Deploying Applications with Ease

I. Introduction

II. Kubernetes Services

A. Overview of Kubernetes Services

B. Types of Services in Kubernetes

C. Service Discovery in Kubernetes

D. Creating and Managing Services in Kubernetes

III. Kubernetes Ingress

A. Explanation of Ingress in Kubernetes

B. Ingress Controllers in Kubernetes

C. Ingress Rules and Path Routing

D. Setting up SSL/TLS Encryption with Ingress

E. Creating and Managing Ingress Resources in Kubernetes

IV. Best Practices for Kubernetes Services and Ingress

A. Designing Scalable and Efficient Services and Ingress

B. Securing Services and Ingress with Network Policies

C. Monitoring and Logging Services and Ingress

V. Conclusion

VI. References

20 tips for Prometheus Monitoring

Newsletter | CloudCraftAI with Jintao

How to reduce the cost of GitHub Actions

Why is the quota exhausted?

Cost of using GitHub Actions

Ways to save costs

Using compute instances

Using Kubernetes

Self-hosted vs GitHub-managed

Summarize

Newsletter | CloudCraftAI with Jintao

My Rust journey and how to learn Rust

Getting connected with Rust

Why choose Rust

How I learned Rust

Recommend some Rust learning resources

Summarize

Newsletter | CloudCraftAI with Jintao

Opportunities and Challenges of Technological Evolution in Cloud Native

Technical Advantages of Cloud Native

Containerization Advantages

Container Orchestration Advantages

Application Architecture Evolution: from Monoliths, Microservice to Service Mesh

Technical Challenges of Cloud Native

Problems Caused by Containerization and K8s

Security

Observability

Application Evolution and Multi-cluster Pattern

How APISIX Enables Digital Transformation

APISIX Ingress Solution

APISIX Service Mesh Solution

Thoroughly understand Events in Kubernetes

Eents overview

In-depth Events

Single Event object

Events in kubectl describe

Learn more about Events

Events in `kubectl describe`