Never Miss a Cloud Run Error: A Guide to Comprehensive GCP Alerting

#googlecloud #devops #monitoring #webdev

Deploying an application to Google Cloud Run is an exhilarating experience. The serverless magic, the automatic scaling, the elegant containerization—it all works seamlessly. But what happens when things go wrong? When your NestJS API throws an unhandled exception, or your Next.js app starts returning 5xx errors?

Without a proper alerting system, you're flying blind. You'll most likely not be aware of the problem until a user reports it. In this post, We’ll walk through the definitive guide to setting up a comprehensive, "catch-all" alerting system for all your Cloud Run applications using Google's built-in tools: Cloud Logging and Cloud Monitoring.

We'll cover the right way to query logs, how to create a single, unified metric for all errors, and how to configure an email alert that notifies you the second a problem occurs.

Step 1: The Core Principle: From Logs to Alerts

The foundation of our strategy is that both of your applications (NestJS and Next.js) automatically write all their standard output and errors to Cloud Logging. Anything written to stderr is automatically given a severity: ERROR label. We'll leverage this to build our alerting system.

Our goal is to create a single Logs-based Metric that acts as a central counter for every error from every Cloud Run service in your project. Once we have that metric, we can set an alert policy that triggers on the very first error.

Step 2: Creating a Unified Logs-Based Metric

Instead of creating separate alerts for each service, we'll create one metric to rule them all.

Navigate to Logs-based Metrics: In the Google Cloud Console, find and navigate to Cloud Monitoring and select Logs-based Metrics from the left-hand menu.
Create a New Metric: Click the "Create Metric" button at the top of the page.
Configure the Metric:
- Metric Type: Select Counter. This will count the number of matching log entries.
- Name: Give it a clear name like cloud-run-errors.
- Description: Add a helpful description: "A counter for all log entries with a severity of ERROR or higher from all Cloud Run services."
- Log Filter: This is the most crucial part. Paste this query into the "Build filter" box. This query finds all logs with an ERROR severity from any Cloud Run revision in your project.
```
resource.type="cloud_run_revision" AND severity>=ERROR
```
Finalize: Click "Create Metric". The metric is now created but won't be visible in the alerting policy selector until it receives its first data point.

Step 3: Setting Up a "Zero-Tolerance" Alerting Policy

Now we'll create the policy that sends you a notification as soon as a new error is logged.

Create an Alerting Policy: Go to Cloud Monitoring > Alerting > Create Policy.
Select Your Metric: In the "Select a metric" dialog, search for the name of the metric you just created: cloud-run-all-errors.
Note: If it doesn't appear, you need to wait for a new error to
be logged in your service. The metric won't be searchable until
it has received its first data point.
Configure the Trigger:
- Condition: Any time series violates a value threshold
- Threshold: Is above
- Value: 0
- For: 1 minute (This ensures you get an alert as soon as the first error occurs, and the metric count goes above zero).
Add Your Notification Channel: If you haven't already, you need to create a notification channel (e.g., your email address) in Alerting > Edit notification channels.
Select your email channel to receive notifications.
Finalize the Alert:
- Give the policy a descriptive name, like "Cloud Run: First Error Detected."
- In the Documentation section, add a helpful message. This is essential, as it will be included in the email. It should tell you where to go to find the error. A perfect addition is a link to the Logs Explorer with a pre-filled query:

Troubleshoot this issue by checking the logs:

https://console.cloud.google.com/logs/query;query=resource.type%3D"cloud_run_revision"%20AND%20severity%3E%3DERROR;timeRange=1h

Finally, click "Create Policy" and you're all set!

By following these steps, you've established a robust, centralized alerting system for your Cloud Run environment. You've moved beyond reactive troubleshooting and can now proactively respond to issues the moment they happen. This "set it and forget it" approach not only gives you peace of mind but is a critical step in building reliable and scalable cloud-native applications.