A production failure log from implementing a Backstage Golden Path for KServe model deployments — nine distinct failures with exact error output, root causes, and fixes.
If you have ever deployed Backstage and stared at a blank /create page wondering what went wrong, this article is for you.
Most Backstage tutorials end at "the portal is running." This one starts there.
This is a complete production failure log from implementing a Backstage Golden Path that deploys KServe model inference endpoints on Kubernetes. Nine distinct failures. Every one with exact error output, root cause, and the fix that worked.
The goal was simple: a developer fills a Backstage form, a GitHub PR opens, the PR merges, ArgoCD deploys a KServe InferenceService, and the endpoint responds to predictions.
Getting there took nine failures across three days.
This is part three of a series on building a production-hardened AI inference platform:
- Part 1: Why Your KServe InferenceService Won't Become Ready
- Part 2: 5 GitOps Failure Modes That Break KServe Deployments
Project repo: github.com/sodiq-code/neuroscale-platform
What I was trying to build
The Golden Path demo contract:
Backstage form → PR opened → merge → ArgoCD sync → InferenceService Ready=True → curl returns {"predictions":[1,1]}
Stack:
- Backstage (Helm chart, self-hosted on k3d)
- ArgoCD (GitOps reconciliation)
- KServe (model inference endpoints)
- GitHub (scaffolder target)
- Kyverno (admission policies)
Failure 1: Template Not Visible in Catalog — Silent Rejection With No UI Error
Time lost: 30 minutes
Symptom
After adding the template file and registering it in infrastructure/backstage/values.yaml, the template did not appear in Backstage's /create page. No error was visible in the UI. The page simply showed an empty catalog.
Digging In
$ kubectl -n backstage logs deploy/neuroscale-backstage --tail=50
...
[backstage] warn Failed to process location
{"location":{"type":"url","target":"https://github.com/sodiq-code/
neuroscale-platform/blob/main/backstage/templates/model-endpoint/
template.yaml"},
"error":"NotAllowedError: Forbidden: entity of kind Template
is not allowed from that location"}
The error only appears in server logs. The UI shows nothing.
Root Cause
Backstage's catalog configuration allows only specific entity kinds from each registered location. The default allow list for repository-based locations does not include Template. Without an explicit allow: [Template] rule, entities of kind Template are silently rejected. This is security-by-default behavior — but the complete silence in the UI makes it look like a misconfiguration rather than a permission issue.
Fix
# infrastructure/backstage/values.yaml
backstage:
backstage:
appConfig:
catalog:
locations:
- type: url
target: https://github.com/sodiq-code/neuroscale-platform/blob/main/backstage/templates/model-endpoint/template.yaml
rules:
- allow: [Template]
After rolling out the updated Backstage deployment:
$ kubectl -n backstage rollout restart deploy/neuroscale-backstage
$ kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s
deployment "neuroscale-backstage" successfully rolled out
The template appeared in /create within 60 seconds.
Lesson
For a platform team deploying Backstage for internal users, this silent failure means developers see an empty template catalog and assume the platform is broken — not that a config rule is missing. Always check server logs, not just the UI, when Backstage catalog ingestion seems to fail.
Failure 2: Scaffolder /create Page Loads Blank — 401 on Actions API
Time lost: 45 minutes
Symptom
After the template was visible, clicking into it showed a blank form. The browser developer console revealed:
GET /api/scaffolder/v2/actions HTTP/1.1 401 Unauthorized
{"error":{"name":"AuthenticationError","message":"Missing credentials"}}
The page route returned HTTP 200 — the React app loaded — but the actions API returned 401, so the form had no data to render.
Root Cause
Backstage's new backend architecture (introduced in 1.x) adds an internal authentication policy requiring all service-to-service calls to include a valid Backstage token. The scaffolder frontend makes an internal API call to list available actions. Because no auth provider was configured for local development, this internal call was rejected. This is a breaking change from older Backstage versions where the actions endpoint was unauthenticated.
Fix
# infrastructure/backstage/values.yaml
backstage:
backstage:
appConfig:
backend:
auth:
dangerouslyDisableDefaultAuthPolicy: true
Production note:
dangerouslyDisableDefaultAuthPolicy: trueis acceptable for local development only. For production, configure GitHub OAuth viavalues-prod.yamlwith a proper sign-in policy. The production profile usesauth.providers.guest.dangerouslyAllowOutsideDevelopment: trueinstead — which keeps the auth subsystem active and provides a realuser:default/guestidentity, rather than disabling auth entirely.
Lesson
An empty scaffolder form is indistinguishable from a misconfigured form to an end user. The 401 error is only visible in browser developer tools. This is the second failure in this series that generated zero visible error in the UI.
Failure 3: Frontend Crashes With Blank White Screen — Missing Required Config Key
Time lost: 20 minutes
Symptom
After the auth policy fix, reloading Backstage showed a blank white screen. The browser console:
Uncaught Error: Missing required config value at 'app.title' in 'app'
at validateConfigSchema (config.esm.js:234)
at BackstageApp.render (app.esm.js:891)
Root Cause
The Backstage frontend requires app.title to be present in the runtime configuration. This key was absent from the appConfig section of values.yaml. The React application crashed on initialization before any content could render. This is a required configuration key not documented prominently as "required on first boot."
Fix
# infrastructure/backstage/values.yaml
backstage:
backstage:
appConfig:
app:
title: NeuroScale Platform
baseUrl: http://localhost:7010
backend:
baseUrl: http://localhost:7010
cors:
origin: http://localhost:7010
Note: app.baseUrl and backend.baseUrl also needed to match the port used for port-forwarding (7010), not the default 7007.
Lesson
A blank white screen with no network errors means the JavaScript runtime crashed before rendering. Always check the browser console — not just network requests — for Backstage frontend failures.
Failure 4: Backstage CrashLoopBackOff — Helm Dependency Values Mis-Nesting
Time lost: 2 hours | Impact: Developer portal completely unavailable
Symptom
$ kubectl get pods -n backstage -w
NAME READY STATUS RESTARTS
neuroscale-backstage-7d9f5b8c4-xqr2m 0/1 CrashLoopBackOff 8 12m
$ kubectl describe pod neuroscale-backstage-7d9f5b8c4-xqr2m -n backstage
...
Events:
Warning Unhealthy 30s kubelet
Startup probe failed: connect: connection refused
Root Cause
The Backstage Helm chart is a wrapper chart with backstage as a dependency. Configuration for the Backstage container itself must be nested under backstage.backstage.*, not backstage.*. The misconfiguration meant probe settings and resource requests were silently ignored, so Kubernetes used default probe timings — a 2-second initial delay — that were far too aggressive for Backstage's ~90-second startup time.
The pod was killed before it could become healthy, triggering CrashLoopBackOff.
Backstage requires:
startupProbe:
initialDelaySeconds: 120
failureThreshold: 30
The default gives it 2 seconds.
Fix
Correct the values hierarchy and harden probe timings:
# infrastructure/backstage/values.yaml
backstage:
backstage: # <-- must be nested here, not at backstage.*
appConfig:
...
startupProbe:
initialDelaySeconds: 120
failureThreshold: 30
readinessProbe:
initialDelaySeconds: 120
livenessProbe:
initialDelaySeconds: 300
resources:
requests:
cpu: 100m
memory: 512Mi
Lesson
If a Helm chart is a wrapper with a dependency, configuration for the dependency must be nested under the dependency's alias key. Values placed at the wrong hierarchy level are silently ignored — Kubernetes uses chart defaults, not your overrides. This incident directly motivated adding CI validation for rendered Helm manifests: if the final Deployment spec had been checked in CI, the wrong probe values would have been caught before deployment. Full RCA: infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md
Failure 5: PR Creation Fails — GitHub Token Secret Contains Placeholder Value
Time lost: 30 minutes
Symptom
After the portal was stable, the scaffolder's "Open pull request" step spun for 30 seconds then failed:
Error: Request failed with status 401: Bad credentials
No PR was created in GitHub.
Root Cause
The Kubernetes Secret neuroscale-backstage-secrets contained a placeholder GITHUB_TOKEN value — literally <YOUR_TOKEN_HERE>. The environment variable was present, satisfying kubectl describe secret output, but the value was not a valid token.
A secondary issue: after updating the secret with the correct token, the running pod did not pick up the change. Environment variables from Secrets are injected at pod start time, not dynamically. The pod needed a restart.
Fix
# Update the secret with a valid token
read -s GITHUB_TOKEN
kubectl -n backstage create secret generic neuroscale-backstage-secrets \
--from-literal=GITHUB_TOKEN="$GITHUB_TOKEN" \
--dry-run=client -o yaml | kubectl apply -f -
# Restart to reload env vars
kubectl -n backstage rollout restart deploy/neuroscale-backstage
kubectl -n backstage rollout status deploy/neuroscale-backstage --timeout=300s
# Verify token is present — check length, never the value
kubectl -n backstage exec deploy/neuroscale-backstage -- \
sh -c 'echo ${#GITHUB_TOKEN} chars'
Lesson
kubectl describe secret shows the key exists and has bytes. It does not show whether the value is a valid token or a placeholder string. Always verify token presence by checking character length in the running container, never by reading the secret value directly.
Failure 6: PR Merged But ArgoCD Stays OutOfSync — Fix Not Committed to Git
Time lost: 1 hour of confusion
Symptom
The Backstage scaffolder created the PR correctly. CI passed. The PR was merged. ArgoCD detected the new application. But the child app immediately showed OutOfSync/Degraded:
$ kubectl -n argocd get application demo-iris-2
NAME SYNC STATUS HEALTH STATUS
demo-iris-2 OutOfSync Degraded
$ kubectl -n argocd describe application demo-iris-2
...
Message: Internal error occurred: failed calling webhook
"inferenceservice.kserve-webhook-server.validator.webhook":
no endpoints available for service "kserve-webhook-server-service"
Root Cause
This was the kube-rbac-proxy ImagePullBackOff failure from earlier — reappearing after a cluster restart. The fix had been applied with kubectl patch directly, not committed to Git. ArgoCD's selfHeal: true reverted it on the next sync cycle. The cluster restart exposed that the fix was never persisted.
Fix
# Verify the patch is in kustomization.yaml
cat infrastructure/serving-stack/kustomization.yaml | grep -A2 patches
# Commit and push
git add infrastructure/serving-stack/
git commit -m "serving-stack: persist kube-rbac-proxy removal patch"
git push origin main
ArgoCD picked up the change within 3 minutes.
Lesson
Any fix applied with kubectl directly in a GitOps-managed cluster is temporary. The next sync cycle will revert it. Every fix must be committed to Git to survive. The PR-merged-but-nothing-deployed experience is the worst possible failure for a Golden Path demo — the developer did everything correctly and the platform failed silently.
Failure 7: Inference Endpoint Returns HTTP 307 Redirect — Traefik Intercepts Before Kourier
Time lost: 45 minutes
Symptom
After demo-iris-2 became Ready=True, the inference test returned an unexpected redirect:
$ curl -v \
-H 'Content-Type: application/json' \
-d '{"instances":[[6.8,2.8,4.8,1.4]]}' \
http://172.20.0.3/v1/models/demo-iris-2:predict
< HTTP/1.1 307 Temporary Redirect
< Location: https://172.20.0.3/v1/models/demo-iris-2:predict
Root Cause
k3d's built-in Traefik ingress was intercepting the request and applying an HTTP-to-HTTPS redirect before it reached Kourier. The request never reached the Knative routing layer at all.
Fix
Use direct pod port-forward for canonical local verification, bypassing Traefik and Kourier entirely:
# Find predictor pod
kubectl -n default get pods \
-l serving.knative.dev/revision=demo-iris-2-predictor-00001 \
-o jsonpath='{.items[0].metadata.name}'
# Port-forward directly to the pod
kubectl -n default port-forward \
pod/demo-iris-2-predictor-00001-deployment-<hash> 18080:8080
# Predict
curl -sS \
-H "Content-Type: application/json" \
-d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
http://127.0.0.1:18080/v1/models/demo-iris-2:predict
{"predictions":[1,1]}
Lesson
A healthy inference endpoint can look completely broken if your test path hits an unexpected intermediary. For local k3d clusters, disable Traefik at cluster creation:
k3d cluster create neuroscale \
--k3s-arg "--disable=traefik@server:0"
Failure 8: Catalog Ingestion Silently Rejects Template After Values Update
Time lost: 20 minutes
Symptom
After updating values.yaml and rolling out a new Backstage deployment, the template disappeared from /create again — the same symptom as Failure 1, but after it had been working.
Root Cause
The rolling update caused a brief period where the new pod was starting and the old pod was terminating. During this window, the catalog re-ingested all locations. The updated values.yaml had a YAML indentation error in the catalog.locations block, which caused the allow rule for Template to be silently dropped during parsing.
Fix
# Check catalog ingestion in the new pod logs immediately after rollout
kubectl -n backstage logs deploy/neuroscale-backstage --tail=100 | \
grep -i "warn\|error\|fail\|forbidden"
Fixed the YAML indentation:
# Correct indentation
catalog:
locations:
- type: url
target: https://github.com/...
rules:
- allow: [Template] # must be under rules:, not misaligned
Lesson
YAML indentation errors in Backstage config values are never surfaced as errors — the field is simply ignored. After every Backstage rollout that touches appConfig, immediately verify catalog ingestion by checking server logs and confirming the template appears in /create.
Failure 9: Scaffolder Task Hangs Then Fails — Port-Forward Session Died Mid-Task
Time lost: 15 minutes
Symptom
The scaffolder task started successfully, progress spinner ran for 60 seconds, then failed with a network error. The Backstage UI showed the task as failed with no specific error message. A second attempt worked immediately.
Root Cause
The kubectl port-forward session for Backstage had silently died between opening the browser and submitting the scaffolder form. The React app was loaded from cache — so the page appeared fully functional — but all API calls were failing because the backend was unreachable. The scaffolder task started, sent the first API call, and failed on the network layer.
Fix
# Before running any Backstage scaffolder task, verify the port-forward is alive
curl -s http://localhost:7010/api/catalog/entities?limit=1 | head -c 100
# If it returns nothing or errors, restart the port-forward
kubectl -n backstage port-forward svc/neuroscale-backstage 7010:7007
Use scripts/port-forward-all.sh from the repository which starts all required port-forwards as background processes with clean shutdown handling.
Lesson
A React app loaded from browser cache looks fully functional even when the backend is unreachable. Always verify the backend API is responding before running a scaffolder task, not just that the UI loaded.
What the Golden Path Actually Proves After Nine Failures
Final working state:
$ kubectl -n default get inferenceservice demo-iris-2
NAME URL READY AGE
demo-iris-2 http://demo-iris-2.default.example.com True 25m
$ curl -sS \
-H "Content-Type: application/json" \
-d '{"instances":[[6.8,2.8,4.8,1.4],[6.0,3.4,4.5,1.6]]}' \
http://127.0.0.1:18080/v1/models/demo-iris-2:predict
{"predictions":[1,1]}
The Golden Path demo is a chain of seven moving parts: Backstage config, GitHub auth, ArgoCD app-of-apps, KServe controller, Knative routing, Kourier gateway, and the predictor pod. In production, any link in that chain can fail independently.
The debugging process for these nine failures is a direct map to what a platform SRE does on an on-call shift.
Debugging Commands Reference
# Backstage catalog ingestion errors
kubectl -n backstage logs deploy/neuroscale-backstage | \
grep -i "warn\|error\|fail\|forbidden"
# Backstage runtime config
kubectl -n backstage describe configmap neuroscale-backstage-app-config
# Verify GitHub token is present (check length only)
kubectl -n backstage exec deploy/neuroscale-backstage -- \
sh -c 'echo ${#GITHUB_TOKEN} chars'
# ArgoCD child app sync status
kubectl -n argocd get applications
kubectl -n argocd describe application demo-iris-2
# InferenceService conditions
kubectl -n default describe inferenceservice demo-iris-2
# Admission webhook endpoints
kubectl -n kserve get endpoints kserve-webhook-server-service
The Pattern Across All Nine Failures
Looking back at the nine failures, they fall into three categories:
Silent failures (no UI error, log only):
Failures 1, 2, 8 — catalog ingestion rejections and auth failures that show nothing in the UI. Rule: always check server logs, not just the browser.
Configuration hierarchy failures:
Failures 3, 4 — missing required keys and wrong Helm nesting. Rule: validate rendered manifests in CI before applying them.
State and dependency failures:
Failures 5, 6, 7, 9 — stale secrets, unreversioned fixes, intercepting proxies, dead sessions. Rule: verify the complete dependency chain before debugging the thing that appears broken.
See Also
-
infrastructure/INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md— full 12-section RCA for Failure 4 -
docs/REALITY_CHECK_MILESTONE_3_GOLDEN_PATH.md— complete implementation record -
infrastructure/backstage/values.yaml— working dev Backstage configuration -
infrastructure/backstage/values-prod.yaml— production profile with GitHub OAuth -
scripts/smoke-test.sh— automated end-to-end verification
Jimoh Sodiq Bolaji | Platform Engineer | Technical Content Engineer | Abuja, Nigeria
NeuroScale Platform · Dev.to
Top comments (0)