DEV Community

Cover image for 8 architectural bugs we found auditing 29 microservices (with code)
Tiana
Tiana

Posted on

8 architectural bugs we found auditing 29 microservices (with code)

We audited a 29-service Kotlin/Spring + Go + React platform. Found 14 issues - 8 already in our release branch on the way to QA. Sharing all 8 with file paths, code snippets, and exact fixes. The pattern at the end: dev environment masks, healthcheck lies, silent fallback to defaults, config drift between two places.

If you're going to grep your codebase after reading this, the four greps to start with are at the bottom.


1. ML weights missing from the Docker image

Symptom: prediction endpoint returns 500 in k8s. Locally fine. CI passes.

Root cause chain:

ml-api-service/.gitignore:58   →  models/         # excluded
ml-api-service/.dockerignore   →  missing
ml-api-service/Dockerfile:35   →  COPY models/ ./models/
docker-compose.yml:976         →  ./ml-api-service/models:/app/models  (bind mount)
Enter fullscreen mode Exit fullscreen mode

When .dockerignore is absent, Docker BuildKit falls back to .gitignore. models/ is excluded from the build context. COPY models/ copies an empty directory. The bind mount in docker-compose hides this locally.

The lying healthcheck:

@app.get("/ready")
async def ready():
    return {"status": "ok"}  # TODO: real check
Enter fullscreen mode Exit fullscreen mode

Fix: explicit .dockerignore, track weights in git, real readiness probe:

@app.get("/ready")
async def ready():
    model_dir = Path("/app/models")
    if not any(model_dir.glob("*.joblib")) and not any(model_dir.glob("*.onnx")):
        raise HTTPException(status_code=503, detail="No models loaded")
    return {"models_loaded": len(list(model_dir.iterdir()))}
Enter fullscreen mode Exit fullscreen mode

Switch HEALTHCHECK in Dockerfile to /ready.


2. MapStruct + Kotlin is-prefix → silent false

Symptom: five mappers return false for every isActive, isDemoData, isIgMember, isSubscriptionActive, regardless of DB state.

Root cause: Kotlin compiles isActive into a getter named isActive() (no get prefix). MapStruct introspects the getter and reads the property as active. The DTO constructor expects isActive. Names don't match. MapStruct falls back to default. Build green, no warning.

Fix per field:

@Mapper(componentModel = "spring")
abstract class UserMapper {
    @Mapping(
        target = "isActive",
        expression = "java(entity.isActive())"
    )
    @Mapping(
        target = "isDemoData",
        expression = "java(entity.isDemoData())"
    )
    abstract fun toDto(entity: User): UserDto
}
Enter fullscreen mode Exit fullscreen mode

Fix permanent: in build.gradle.kts,

kapt {
    arguments {
        arg("mapstruct.unmappedTargetPolicy", "ERROR")
    }
}
Enter fullscreen mode Exit fullscreen mode

Compiler now fails on the next mismatch instead of swallowing.


3. Edge-agent default config: vessel-001 and insecure_skip_verify: true

Symptom: none yet - caught before first prod deploy.

The bad default:

# edge-agent/config.yaml
vessel_id: "vessel-001"
mqtt_broker: "tcp://mosquitto:1883"
cloud_url: "http://gateway:8080"
mongodb_uri: "mongodb://mongodb:27017"
tls:
  insecure_skip_verify: true
Enter fullscreen mode Exit fullscreen mode

Fix: split into two files. config.yaml (production template, empty placeholders, TLS strict) and config.demo.yaml (dev values). Refuse to start on bad placeholders:

func validateConfig(cfg *Config) error {
    if cfg.VesselID == "" {
        return errors.New("vessel_id is required")
    }
    if matched, _ := regexp.MatchString(`^vessel-\d+$`, cfg.VesselID); matched {
        return errors.New("vessel_id matches placeholder pattern, refusing to start")
    }
    if cfg.TLS.InsecureSkipVerify && !cfg.AllowInsecureFromCLI {
        return errors.New("insecure_skip_verify requires --allow-insecure-tls")
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

4. Three different env-variable prefixes

Symptom: env-based configuration silently ignored. Found when somebody overrode MQTT_BROKER and the agent ignored it.

The drift:

// edge-agent/internal/config/config.go
viper.AutomaticEnv()
viper.SetEnvPrefix("AGENT")     // → AGENT_VESSEL_ID
Enter fullscreen mode Exit fullscreen mode
// edge-agent/cmd/edge-agent/cli/root.go
viper.BindEnv("vessel-id", "EDGE_VESSEL_ID")  // → EDGE_VESSEL_ID
Enter fullscreen mode Exit fullscreen mode
# docker-compose.yml
edge-agent:
  environment:
    EDGE_AGENT_VESSEL_ID: "v-042"   # → EDGE_AGENT_VESSEL_ID
Enter fullscreen mode Exit fullscreen mode

Fix: canonical prefix EDGE_AGENT_*, explicit binds for backward compat, integration test:

viper.BindEnv("vessel_id",
    "EDGE_AGENT_VESSEL_ID",  // canonical
    "AGENT_VESSEL_ID",       // legacy
    "EDGE_VESSEL_ID",        // legacy
)
Enter fullscreen mode Exit fullscreen mode
func TestEnvOverridesConfig(t *testing.T) {
    t.Setenv("EDGE_AGENT_VESSEL_ID", "test-vessel-99")
    cfg := loadConfig()
    if cfg.VesselID != "test-vessel-99" {
        t.Fatalf("env override ignored: got %q", cfg.VesselID)
    }
}
Enter fullscreen mode Exit fullscreen mode

5. One hypertable in two Postgres schemas

Symptom: dashboards show stale data. Logs clean. Continuous aggregates maintaining empty views.

The double-create:

-- docker/init-db/01_create_databases.sql
CREATE TABLE public.telemetry_values (...);
SELECT create_hypertable('public.telemetry_values', 'time_measured');
Enter fullscreen mode Exit fullscreen mode
-- ingest-service/db/changelog/001-init-schema.sql
CREATE TABLE ingest.telemetry_values (...);
SELECT create_hypertable('ingest.telemetry_values', 'time_measured');
Enter fullscreen mode Exit fullscreen mode

ingest writes to ingest.telemetry_values. The view unified_vessel_positions (created by init script) reads public.telemetry_values.

Fix: delete the public-schema table from the init script. Single source of truth = the service's own migrations.


6. WebClient calls with no timeouts, no circuit breakers

Symptom: when one service crash-loops, every caller's reactor worker pool parks on calls that won't return. Cascade.

The default:

// passage-service-client.kt
@Bean
fun passageWebClient() = WebClient.create("http://passage-service:8080")
// no responseTimeout, no connectTimeout
Enter fullscreen mode Exit fullscreen mode

Fix:

@Bean
fun passageWebClient(): WebClient {
    val httpClient = HttpClient.create()
        .responseTimeout(Duration.ofSeconds(5))
        .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 2000)
    return WebClient.builder()
        .baseUrl("http://passage-service:8080")
        .clientConnector(ReactorClientHttpConnector(httpClient))
        .build()
}
Enter fullscreen mode Exit fullscreen mode

Plus Resilience4j circuit breaker on critical paths:

@CircuitBreaker(name = "passage-service", fallbackMethod = "fallbackVoyagePlan")
suspend fun getVoyagePlan(vesselId: UUID): VoyagePlan {
    return passageWebClient.get()
        .uri("/voyage-plans/{vesselId}", vesselId)
        .retrieve()
        .awaitBody()
}

private fun fallbackVoyagePlan(vesselId: UUID, ex: Throwable): VoyagePlan =
    VoyagePlan.empty(vesselId)
Enter fullscreen mode Exit fullscreen mode

7. Eurostat GPKG can't open inside Spring Boot fat JAR

Symptom: passage-service crash-loops on @PostConstruct. Every downstream that calls it (see #6) cascades.

Root cause: searoute-core ships marnet.gpkg (SQLite) inside its JAR. SQLite needs a filesystem path. Spring Boot fat JAR is zip-inside-zip. SQLite can't open it.

Fix chain.

Gradle task to extract at build:

// passage-service/build.gradle.kts
val extractMarnetGpkg by tasks.registering(Copy::class) {
    val searouteJar = configurations.runtimeClasspath.get()
        .files
        .first { it.name.startsWith("searoute-core") }
    from(zipTree(searouteJar)) {
        include("marnet.gpkg")
    }
    into("$projectDir/src/main/resources/marnet/")
}

tasks.named("processResources") {
    dependsOn(extractMarnetGpkg)
}
Enter fullscreen mode Exit fullscreen mode

Runtime extraction in @PostConstruct:

@PostConstruct
fun init() {
    val target = Paths.get(workingDir, "marnet.gpkg")
    if (!Files.exists(target)) {
        javaClass.getResourceAsStream("/marnet/marnet.gpkg").use { input ->
            Files.copy(input!!, target)
        }
    }
    try {
        spatialIndex = loadFromGpkg(target.toString())
        eurostatAvailable = true
    } catch (e: Exception) {
        log.warn("Eurostat unavailable, falling back to bisection routing", e)
        eurostatAvailable = false
    }
}
Enter fullscreen mode Exit fullscreen mode

healthcheck start_period: 120s in docker-compose to give time for extraction.


8. Keycloak realm without fixed user IDs

Symptom: every fresh Keycloak deploy generates new user UUIDs. JWT.sub diverges from users.id in DB. Data isolation silently breaks.

The bad realm import:

{
  "users": [
    {
      "username": "admin",
      "email": "admin@example.com",
      "credentials": [{"type": "password", "value": "admin123"}]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Fix: pin id to match DB:

{
  "users": [
    {
      "id": "d0000000-0000-0000-0000-000000000003",
      "username": "admin",
      "email": "admin@example.com",
      "credentials": [{"type": "password", "value": "admin123"}]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

DB migration must match:

INSERT INTO users (id, email)
VALUES ('d0000000-0000-0000-0000-000000000003', 'admin@example.com');
Enter fullscreen mode Exit fullscreen mode

Integration test that catches this:

@Test
fun `JWT sub matches pinned admin UUID`() {
    val token = loginAs("admin", "admin123")
    val sub = parseJwt(token).getClaim("sub").asString()
    assertEquals("d0000000-0000-0000-0000-000000000003", sub)
}
Enter fullscreen mode Exit fullscreen mode

The four greps

If you're auditing your own system after reading this, here's where to start.

# 1. What's in your image that isn't in build context?
diff <(cat .gitignore) <(cat .dockerignore)
find . -name Dockerfile -exec grep -H "COPY" {} \;
Enter fullscreen mode Exit fullscreen mode
# 2. What does your /health actually verify?
grep -rn "fun health\|/health" --include="*.kt" --include="*.py" --include="*.go"
# Look for stubs returning {"status":"ok"} unconditionally.
Enter fullscreen mode Exit fullscreen mode
# 3. Inter-service calls without timeouts
grep -rn "WebClient.create\|FeignClient\|RestTemplate(" --include="*.kt"
# Cross-check against responseTimeout / circuit breaker config.
Enter fullscreen mode Exit fullscreen mode
# 4. Configuration drift
grep -rn "ENV\|env\|prefix" --include="*.go" --include="*.yaml"
# Look for diverging prefixes/keys across config init, CLI, compose, and helm.
Enter fullscreen mode Exit fullscreen mode

What ties these together

Operationally, all eight are the same class of bug.

Some piece of the dev environment was filling in for what production wouldn't (bind mount, internal docker-compose DNS, exploded Gradle classpath, persistent dev DB). The healthcheck or build signal was reporting fine when the service wasn't. A library somewhere was defaulting silently - MapStruct to false, WebClient to an unbounded wait, Keycloak to a random UUID. Configuration was duplicated in two places and had drifted.

The four greps above are pointed at exactly those four. Run them.

Top comments (0)