We audited a 29-service Kotlin/Spring + Go + React platform. Found 14 issues - 8 already in our release branch on the way to QA. Sharing all 8 with file paths, code snippets, and exact fixes. The pattern at the end: dev environment masks, healthcheck lies, silent fallback to defaults, config drift between two places.
If you're going to grep your codebase after reading this, the four greps to start with are at the bottom.
1. ML weights missing from the Docker image
Symptom: prediction endpoint returns 500 in k8s. Locally fine. CI passes.
Root cause chain:
ml-api-service/.gitignore:58 → models/ # excluded
ml-api-service/.dockerignore → missing
ml-api-service/Dockerfile:35 → COPY models/ ./models/
docker-compose.yml:976 → ./ml-api-service/models:/app/models (bind mount)
When .dockerignore is absent, Docker BuildKit falls back to .gitignore. models/ is excluded from the build context. COPY models/ copies an empty directory. The bind mount in docker-compose hides this locally.
The lying healthcheck:
@app.get("/ready")
async def ready():
return {"status": "ok"} # TODO: real check
Fix: explicit .dockerignore, track weights in git, real readiness probe:
@app.get("/ready")
async def ready():
model_dir = Path("/app/models")
if not any(model_dir.glob("*.joblib")) and not any(model_dir.glob("*.onnx")):
raise HTTPException(status_code=503, detail="No models loaded")
return {"models_loaded": len(list(model_dir.iterdir()))}
Switch HEALTHCHECK in Dockerfile to /ready.
2. MapStruct + Kotlin is-prefix → silent false
Symptom: five mappers return false for every isActive, isDemoData, isIgMember, isSubscriptionActive, regardless of DB state.
Root cause: Kotlin compiles isActive into a getter named isActive() (no get prefix). MapStruct introspects the getter and reads the property as active. The DTO constructor expects isActive. Names don't match. MapStruct falls back to default. Build green, no warning.
Fix per field:
@Mapper(componentModel = "spring")
abstract class UserMapper {
@Mapping(
target = "isActive",
expression = "java(entity.isActive())"
)
@Mapping(
target = "isDemoData",
expression = "java(entity.isDemoData())"
)
abstract fun toDto(entity: User): UserDto
}
Fix permanent: in build.gradle.kts,
kapt {
arguments {
arg("mapstruct.unmappedTargetPolicy", "ERROR")
}
}
Compiler now fails on the next mismatch instead of swallowing.
3. Edge-agent default config: vessel-001 and insecure_skip_verify: true
Symptom: none yet - caught before first prod deploy.
The bad default:
# edge-agent/config.yaml
vessel_id: "vessel-001"
mqtt_broker: "tcp://mosquitto:1883"
cloud_url: "http://gateway:8080"
mongodb_uri: "mongodb://mongodb:27017"
tls:
insecure_skip_verify: true
Fix: split into two files. config.yaml (production template, empty placeholders, TLS strict) and config.demo.yaml (dev values). Refuse to start on bad placeholders:
func validateConfig(cfg *Config) error {
if cfg.VesselID == "" {
return errors.New("vessel_id is required")
}
if matched, _ := regexp.MatchString(`^vessel-\d+$`, cfg.VesselID); matched {
return errors.New("vessel_id matches placeholder pattern, refusing to start")
}
if cfg.TLS.InsecureSkipVerify && !cfg.AllowInsecureFromCLI {
return errors.New("insecure_skip_verify requires --allow-insecure-tls")
}
return nil
}
4. Three different env-variable prefixes
Symptom: env-based configuration silently ignored. Found when somebody overrode MQTT_BROKER and the agent ignored it.
The drift:
// edge-agent/internal/config/config.go
viper.AutomaticEnv()
viper.SetEnvPrefix("AGENT") // → AGENT_VESSEL_ID
// edge-agent/cmd/edge-agent/cli/root.go
viper.BindEnv("vessel-id", "EDGE_VESSEL_ID") // → EDGE_VESSEL_ID
# docker-compose.yml
edge-agent:
environment:
EDGE_AGENT_VESSEL_ID: "v-042" # → EDGE_AGENT_VESSEL_ID
Fix: canonical prefix EDGE_AGENT_*, explicit binds for backward compat, integration test:
viper.BindEnv("vessel_id",
"EDGE_AGENT_VESSEL_ID", // canonical
"AGENT_VESSEL_ID", // legacy
"EDGE_VESSEL_ID", // legacy
)
func TestEnvOverridesConfig(t *testing.T) {
t.Setenv("EDGE_AGENT_VESSEL_ID", "test-vessel-99")
cfg := loadConfig()
if cfg.VesselID != "test-vessel-99" {
t.Fatalf("env override ignored: got %q", cfg.VesselID)
}
}
5. One hypertable in two Postgres schemas
Symptom: dashboards show stale data. Logs clean. Continuous aggregates maintaining empty views.
The double-create:
-- docker/init-db/01_create_databases.sql
CREATE TABLE public.telemetry_values (...);
SELECT create_hypertable('public.telemetry_values', 'time_measured');
-- ingest-service/db/changelog/001-init-schema.sql
CREATE TABLE ingest.telemetry_values (...);
SELECT create_hypertable('ingest.telemetry_values', 'time_measured');
ingest writes to ingest.telemetry_values. The view unified_vessel_positions (created by init script) reads public.telemetry_values.
Fix: delete the public-schema table from the init script. Single source of truth = the service's own migrations.
6. WebClient calls with no timeouts, no circuit breakers
Symptom: when one service crash-loops, every caller's reactor worker pool parks on calls that won't return. Cascade.
The default:
// passage-service-client.kt
@Bean
fun passageWebClient() = WebClient.create("http://passage-service:8080")
// no responseTimeout, no connectTimeout
Fix:
@Bean
fun passageWebClient(): WebClient {
val httpClient = HttpClient.create()
.responseTimeout(Duration.ofSeconds(5))
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 2000)
return WebClient.builder()
.baseUrl("http://passage-service:8080")
.clientConnector(ReactorClientHttpConnector(httpClient))
.build()
}
Plus Resilience4j circuit breaker on critical paths:
@CircuitBreaker(name = "passage-service", fallbackMethod = "fallbackVoyagePlan")
suspend fun getVoyagePlan(vesselId: UUID): VoyagePlan {
return passageWebClient.get()
.uri("/voyage-plans/{vesselId}", vesselId)
.retrieve()
.awaitBody()
}
private fun fallbackVoyagePlan(vesselId: UUID, ex: Throwable): VoyagePlan =
VoyagePlan.empty(vesselId)
7. Eurostat GPKG can't open inside Spring Boot fat JAR
Symptom: passage-service crash-loops on @PostConstruct. Every downstream that calls it (see #6) cascades.
Root cause: searoute-core ships marnet.gpkg (SQLite) inside its JAR. SQLite needs a filesystem path. Spring Boot fat JAR is zip-inside-zip. SQLite can't open it.
Fix chain.
Gradle task to extract at build:
// passage-service/build.gradle.kts
val extractMarnetGpkg by tasks.registering(Copy::class) {
val searouteJar = configurations.runtimeClasspath.get()
.files
.first { it.name.startsWith("searoute-core") }
from(zipTree(searouteJar)) {
include("marnet.gpkg")
}
into("$projectDir/src/main/resources/marnet/")
}
tasks.named("processResources") {
dependsOn(extractMarnetGpkg)
}
Runtime extraction in @PostConstruct:
@PostConstruct
fun init() {
val target = Paths.get(workingDir, "marnet.gpkg")
if (!Files.exists(target)) {
javaClass.getResourceAsStream("/marnet/marnet.gpkg").use { input ->
Files.copy(input!!, target)
}
}
try {
spatialIndex = loadFromGpkg(target.toString())
eurostatAvailable = true
} catch (e: Exception) {
log.warn("Eurostat unavailable, falling back to bisection routing", e)
eurostatAvailable = false
}
}
healthcheck start_period: 120s in docker-compose to give time for extraction.
8. Keycloak realm without fixed user IDs
Symptom: every fresh Keycloak deploy generates new user UUIDs. JWT.sub diverges from users.id in DB. Data isolation silently breaks.
The bad realm import:
{
"users": [
{
"username": "admin",
"email": "admin@example.com",
"credentials": [{"type": "password", "value": "admin123"}]
}
]
}
Fix: pin id to match DB:
{
"users": [
{
"id": "d0000000-0000-0000-0000-000000000003",
"username": "admin",
"email": "admin@example.com",
"credentials": [{"type": "password", "value": "admin123"}]
}
]
}
DB migration must match:
INSERT INTO users (id, email)
VALUES ('d0000000-0000-0000-0000-000000000003', 'admin@example.com');
Integration test that catches this:
@Test
fun `JWT sub matches pinned admin UUID`() {
val token = loginAs("admin", "admin123")
val sub = parseJwt(token).getClaim("sub").asString()
assertEquals("d0000000-0000-0000-0000-000000000003", sub)
}
The four greps
If you're auditing your own system after reading this, here's where to start.
# 1. What's in your image that isn't in build context?
diff <(cat .gitignore) <(cat .dockerignore)
find . -name Dockerfile -exec grep -H "COPY" {} \;
# 2. What does your /health actually verify?
grep -rn "fun health\|/health" --include="*.kt" --include="*.py" --include="*.go"
# Look for stubs returning {"status":"ok"} unconditionally.
# 3. Inter-service calls without timeouts
grep -rn "WebClient.create\|FeignClient\|RestTemplate(" --include="*.kt"
# Cross-check against responseTimeout / circuit breaker config.
# 4. Configuration drift
grep -rn "ENV\|env\|prefix" --include="*.go" --include="*.yaml"
# Look for diverging prefixes/keys across config init, CLI, compose, and helm.
What ties these together
Operationally, all eight are the same class of bug.
Some piece of the dev environment was filling in for what production wouldn't (bind mount, internal docker-compose DNS, exploded Gradle classpath, persistent dev DB). The healthcheck or build signal was reporting fine when the service wasn't. A library somewhere was defaulting silently - MapStruct to false, WebClient to an unbounded wait, Keycloak to a random UUID. Configuration was duplicated in two places and had drifted.
The four greps above are pointed at exactly those four. Run them.
Top comments (0)