Sodiq Jimoh

Posted on Mar 30 • Edited on Apr 3

9 Failures That Hit Me Building a Backstage Golden Path for KServe — Every Error, Every Fix

#backstage #kubernetes #kserve #gitops

Edit (Apr 2026): Updated title and added framing context based on community feedback.

Series context: This is Part 3 of building a production-hardened AI inference platform.

Part 1: Why Your KServe InferenceService Won't Become Ready

Part 2: 5 GitOps Failure Modes That Break KServe Deployments

Project repo: github.com/sodiq-code/neuroscale-platform

If you have ever deployed Backstage and stared at a blank /create page wondering what went wrong, this article is for you.

Most Backstage tutorials end at "the portal is running." This one starts there.

One important framing note before we begin: this article documents the path starting from the official Backstage Helm chart, not from backstage new app. If you're building a real Backstage application from source, some of these failures won't apply. But if you're doing what a lot of platform engineers do, reaching for helm install first, then every single one of these will.

This is a complete production failure log from implementing a Backstage Golden Path that deploys KServe model inference endpoints on Kubernetes. Nine distinct failures. Every one with exact error output, root cause, and the fix that worked.

The goal: a developer fills a Backstage form, a GitHub PR opens, the PR merges, ArgoCD deploys a KServe InferenceService, and the endpoint responds to predictions.
Getting there took nine failures across three days.

What I was trying to build

The Golden Path demo contract:

Backstage form → PR opened → merge → ArgoCD sync → InferenceService Ready=True → curl returns {"predictions":[1,1]}

Stack:

Backstage (Helm chart, self-hosted on k3d)
ArgoCD (GitOps reconciliation)
KServe (model inference endpoints)
GitHub (scaffolder target)
Kyverno (admission policies)

Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error

Time lost: 30 minutes

Symptom

After adding the template file and registering it in infrastructure/backstage/values.yaml, the template did not appear in Backstage's /create page. No error was visible in the UI. The page simply showed an empty catalog.

Digging In

$ kubectl -n backstage logs deploy/neuroscale-backstage --tail=50
...
[backstage] warn  Failed to process location
  {"location":{"type":"url","target":"https://github.com/sodiq-code/
  neuroscale-platform/blob/main/backstage/templates/model-endpoint/
  template.yaml"},
  "error":"NotAllowedError: Forbidden: entity of kind Template
  is not allowed from that location"}

The error only appears in server logs. The UI shows nothing.

Root Cause

Backstage's catalog configuration allows only specific entity kinds from each registered location. The default allow list for repository-based locations does not include Template. Without an explicit allow: [Template] rule, entities of kind Template are silently rejected. This is security-by-default behavior — but the complete silence in the UI makes it look like a misconfiguration rather than a permission issue.

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      catalog:
        locations:
          - type: url
            target: https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml
            rules:
              - allow: [Template]

After rolling out the updated Backstage deployment:

$ kubectl -n backstage rollout restart deploy/neuroscale-backstage
$ kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s
deployment "neuroscale-backstage" successfully rolled out

The template appeared in /create within 60 seconds.

Lesson

For a platform team deploying Backstage for internal users, this silent failure means developers see an empty template catalog and assume the platform is broken — not that a config rule is missing. Always check server logs, not just the UI, when Backstage catalog ingestion seems to fail.

Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API

Time lost: 45 minutes

Symptom

After the template was visible, clicking into it showed a blank form. The browser developer console revealed:

GET /api/scaffolder/v2/actions HTTP/1.1 401 Unauthorized
{"error":{"name":"AuthenticationError","message":"Missing credentials"}}

The page route returned HTTP 200 — the React app loaded — but the actions API returned 401, so the form had no data to render.

Root Cause

Backstage's new backend architecture (introduced in 1.x) adds an internal authentication policy requiring all service-to-service calls to include a valid Backstage token. The scaffolder frontend makes an internal API call to list available actions. Because no auth provider was configured for local development, this internal call was rejected. This is a breaking change from older Backstage versions where the actions endpoint was unauthenticated.

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      backend:
        auth:
          dangerouslyDisableDefaultAuthPolicy: true

Production note: dangerouslyDisableDefaultAuthPolicy: true is acceptable for local development only. For production, configure GitHub OAuth via values-prod.yaml with a proper sign-in policy. The production profile uses auth.providers.guest.dangerouslyAllowOutsideDevelopment: true instead — which keeps the auth subsystem active and provides a real user:default/guest identity, rather than disabling auth entirely.

Lesson

An empty scaffolder form is indistinguishable from a misconfigured form to an end user. The 401 error is only visible in browser developer tools. This is the second failure in this series that generated zero visible error in the UI.

Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key

Time lost: 20 minutes

Symptom

After the auth policy fix, reloading Backstage showed a blank white screen. The browser console:

Uncaught Error: Missing required config value at 'app.title' in 'app'
    at validateConfigSchema (config.esm.js:234)
    at BackstageApp.render (app.esm.js:891)

Root Cause

The Backstage frontend requires app.title to be present in the runtime configuration. This key was absent from the appConfig section of values.yaml. The React application crashed on initialization before any content could render. This is a required configuration key not documented prominently as "required on first boot."

Fix

# infrastructure/backstage/values.yaml
backstage:
  backstage:
    appConfig:
      app:
        title: NeuroScale Platform
        baseUrl: http://localhost:7010
      backend:
        baseUrl: http://localhost:7010
        cors:
          origin: http://localhost:7010

Note: app.baseUrl and backend.baseUrl also needed to match the port used for port-forwarding (7010), not the default 7007.

Lesson

A blank white screen with no network errors means the JavaScript runtime crashed before rendering. Always check the browser console — not just network requests — for Backstage frontend failures.

Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting

Time lost: 2 hours | Impact: Developer portal completely unavailable

Symptom

$ kubectl get pods -n backstage -w
NAME                                    READY   STATUS             RESTARTS
neuroscale-backstage-7d9f5b8c4-xqr2m   0/1     CrashLoopBackOff   8          12m

$ kubectl describe pod neuroscale-backstage-7d9f5b8c4-xqr2m -n backstage
...
Events:
  Warning  Unhealthy  30s  kubelet
    Startup probe failed: connect: connection refused

Root Cause

The Backstage Helm chart is a wrapper chart with backstage as a dependency. Configuration for the Backstage container itself must be nested under backstage.backstage.*, not backstage.*. The misconfiguration meant probe settings and resource requests were silently ignored, so Kubernetes used default probe timings — a 2-second initial delay — that were far too aggressive for Backstage's ~90-second startup time.

The pod was killed before it could become healthy, triggering CrashLoopBackOff.

Backstage requires:

startupProbe:
  initialDelaySeconds: 120
  failureThreshold: 30

The default gives it 2 seconds.

Fix

Correct the values hierarchy and harden probe timings:

# infrastructure/backstage/values.yaml
backstage:
  backstage:           # <-- must be nested here, not at backstage.*
    appConfig:
      ...
    startupProbe:
      initialDelaySeconds: 120
      failureThreshold: 30
    readinessProbe:
      initialDelaySeconds: 120
    livenessProbe:
      initialDelaySeconds: 300
    resources:
      requests:
        cpu: 100m
        memory: 512Mi

Lesson

If a Helm chart is a wrapper with a dependency, configuration for the dependency must be nested under the dependency's alias key. Values placed at the wrong hierarchy level are silently ignored — Kubernetes uses chart defaults, not your overrides. This incident directly motivated adding CI validation for rendered Helm manifests: if the final Deployment spec had been checked in CI, the wrong probe values would have been caught before deployment. Full RCA: infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md

Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value

Time lost: 30 minutes

Symptom

After the portal was stable, the scaffolder's "Open pull request" step spun for 30 seconds then failed:

Error: Request failed with status 401: Bad credentials

No PR was created in GitHub.

Root Cause

The Kubernetes Secret neuroscale-backstage-secrets contained a placeholder GITHUB_TOKEN value — literally <YOUR_TOKEN_HERE>. The environment variable was present, satisfying kubectl describe secret output, but the value was not a valid token.

A secondary issue: after updating the secret with the correct token, the running pod did not pick up the change. Environment variables from Secrets are injected at pod start time, not dynamically. The pod needed a restart.

Fix

# Update the secret with a valid token
read -s GITHUB_TOKEN
kubectl -n backstage create secret generic neuroscale-backstage-secrets \
  --from-literal=GITHUB_TOKEN="$GITHUB_TOKEN" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart to reload env vars
kubectl -n backstage rollout restart deploy/neuroscale-backstage
kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s

# Verify token is present — check length, never the value
kubectl -n backstage exec deploy/neuroscale-backstage -- \
  sh -c 'echo ${#GITHUB_TOKEN} chars'

Lesson

kubectl describe secret shows the key exists and has bytes. It does not show whether the value is a valid token or a placeholder string. Always verify token presence by checking character length in the running container, never by reading the secret value directly.

Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git

Time lost: 1 hour of confusion

Symptom

The Backstage scaffolder created the PR correctly. CI passed. The PR was merged. ArgoCD detected the new application. But the child app immediately showed OutOfSync/Degraded:

$ kubectl -n argocd get application demo-iris-2
NAME          SYNC STATUS   HEALTH STATUS
demo-iris-2   OutOfSync      Degraded

$ kubectl -n argocd describe application demo-iris-2
...
Message: Internal error occurred: failed calling webhook
  "inferenceservice.kserve-webhook-server.validator.webhook":
  no endpoints available for service "kserve-webhook-server-service"

Root Cause

This was the kube-rbac-proxy ImagePullBackOff failure from earlier — reappearing after a cluster restart. The fix had been applied with kubectl patch directly, not committed to Git. ArgoCD's selfHeal: true reverted it on the next sync cycle. The cluster restart exposed that the fix was never persisted.

Fix

# Verify the patch is in kustomization.yaml
cat infrastructure/serving-stack/kustomization.yaml | grep -A2 patches

# Commit and push
git add infrastructure/serving-stack/
git commit -m "serving-stack: persist kube-rbac-proxy removal patch"
git push origin main

ArgoCD picked up the change within 3 minutes.

Lesson

Any fix applied with kubectl directly in a GitOps-managed cluster is temporary. The next sync cycle will revert it. Every fix must be committed to Git to survive. The PR-merged-but-nothing-deployed experience is the worst possible failure for a Golden Path demo — the developer did everything correctly and the platform failed silently.

Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier

Time lost: 45 minutes

Symptom

After demo-iris-2 became Ready=True, the inference test returned an unexpected redirect:

$ curl -v \
  -H 'Content-Type: application/json' \
  -d '{"instances":[[6.8,2.8,4.8,1.4]]}' \
  http://172.20.0.3/v1/models/demo-iris-2:predict

< HTTP/1.1 307 Temporary Redirect
< Location: https://172.20.0.3/v1/models/demo-iris-2:predict

Root Cause

k3d's built-in Traefik ingress was intercepting the request and applying an HTTP-to-HTTPS redirect before it reached Kourier. The request never reached the Knative routing layer at all.

Fix

Use direct pod port-forward for canonical local verification, bypassing Traefik and Kourier entirely:

# Find predictor pod
kubectl -n default get pods \
  -l serving.knative.dev/revision=demo-iris-2-predictor-00001 \
  -o jsonpath='{.items[0].metadata.name}'

# Port-forward directly to the pod
kubectl -n default port-forward \
  pod/demo-iris-2-predictor-00001-deployment-<hash> 18080:8080

# Predict
curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

{"predictions":[1,1]}

Lesson

A healthy inference endpoint can look completely broken if your test path hits an unexpected intermediary. For local k3d clusters, disable Traefik at cluster creation:

k3d cluster create neuroscale \
  --k3s-arg "--disable=traefik@server:0"

Failure 8: Catalog Ingestion Silently Rejects Template After Values Update

Time lost: 20 minutes

Symptom

After updating values.yaml and rolling out a new Backstage deployment, the template disappeared from /create again — the same symptom as Failure 1, but after it had been working.

Root Cause

The rolling update caused a brief period where the new pod was starting and the old pod was terminating. During this window, the catalog re-ingested all locations. The updated values.yaml had a YAML indentation error in the catalog.locations block, which caused the allow rule for Template to be silently dropped during parsing.

Fix

# Check catalog ingestion in the new pod logs immediately after rollout
kubectl -n backstage logs deploy/neuroscale-backstage --tail=100 | \
  grep -i "warn\|error\|fail\|forbidden"

Fixed the YAML indentation:

# Correct indentation
catalog:
  locations:
    - type: url
      target: https://github.com/...
      rules:
        - allow: [Template]   # must be under rules:, not misaligned

Lesson

YAML indentation errors in Backstage config values are never surfaced as errors — the field is simply ignored. After every Backstage rollout that touches appConfig, immediately verify catalog ingestion by checking server logs and confirming the template appears in /create.

Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task

Time lost: 15 minutes

Symptom

The scaffolder task started successfully, progress spinner ran for 60 seconds, then failed with a network error. The Backstage UI showed the task as failed with no specific error message. A second attempt worked immediately.

Root Cause

The kubectl port-forward session for Backstage had silently died between opening the browser and submitting the scaffolder form. The React app was loaded from cache — so the page appeared fully functional — but all API calls were failing because the backend was unreachable. The scaffolder task started, sent the first API call, and failed on the network layer.

Fix

# Before running any Backstage scaffolder task, verify the port-forward is alive
curl -s http://localhost:7010/api/catalog/entities?limit=1 | head -c 100

# If it returns nothing or errors, restart the port-forward
kubectl -n backstage port-forward svc/neuroscale-backstage 7010:7007

Use scripts/port-forward-all.sh from the repository which starts all required port-forwards as background processes with clean shutdown handling.

Lesson

A React app loaded from browser cache looks fully functional even when the backend is unreachable. Always verify the backend API is responding before running a scaffolder task, not just that the UI loaded.

What the Golden Path Actually Proves After Nine Failures

Final working state:

$ kubectl -n default get inferenceservice demo-iris-2
NAME          URL                                       READY   AGE
demo-iris-2   http://demo-iris-2.default.example.com   True    25m

$ curl -sS \
  -H "Content-Type: application/json" \
  -d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
  http://127.0.0.1:18080/v1/models/demo-iris-2:predict

{"predictions":[1,1]}

The Golden Path demo is a chain of seven moving parts: Backstage config, GitHub auth, ArgoCD app-of-apps, KServe controller, Knative routing, Kourier gateway, and the predictor pod. In production, any link in that chain can fail independently.

The debugging process for these nine failures is a direct map to what a platform SRE does on an on-call shift.

Debugging Commands Reference

# Backstage catalog ingestion errors
kubectl -n backstage logs deploy/neuroscale-backstage | \
  grep -i "warn\|error\|fail\|forbidden"

# Backstage runtime config
kubectl -n backstage describe configmap neuroscale-backstage-app-config

# Verify GitHub token is present (check length only)
kubectl -n backstage exec deploy/neuroscale-backstage -- \
  sh -c 'echo ${#GITHUB_TOKEN} chars'

# ArgoCD child app sync status
kubectl -n argocd get applications
kubectl -n argocd describe application demo-iris-2

# InferenceService conditions
kubectl -n default describe inferenceservice demo-iris-2

# Admission webhook endpoints
kubectl -n kserve get endpoints kserve-webhook-server-service

The Pattern Across All Nine Failures

Looking back at the nine failures, they fall into three categories:

Silent failures (no UI error, log only):
Failures 1, 2, 8 — catalog ingestion rejections and auth failures that show nothing in the UI. Rule: always check server logs, not just the browser.

Configuration hierarchy failures:
Failures 3, 4 — missing required keys and wrong Helm nesting. Rule: validate rendered manifests in CI before applying them.

State and dependency failures:
Failures 5, 6, 7, 9 — stale secrets, unreversioned fixes, intercepting proxies, dead sessions. Rule: verify the complete dependency chain before debugging the thing that appears broken.

What I was trying to build

Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error

Symptom

Digging In

Root Cause

Fix

Lesson

Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API

Symptom

Root Cause

Fix

Lesson

Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key

Symptom

Root Cause

Fix

Lesson

Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting

Symptom

Root Cause

Fix

Lesson

Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value

Symptom

Root Cause

Fix

Lesson

Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git

Symptom

Root Cause

Fix

Lesson

Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier

Symptom

Root Cause

Fix

Lesson

Failure 8: Catalog Ingestion Silently Rejects Template After Values Update

Symptom

Root Cause

Fix

Lesson

Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task

Symptom

Root Cause

Fix

Lesson

What the Golden Path Actually Proves After Nine Failures

Debugging Commands Reference

The Pattern Across All Nine Failures

See Also