Upgrading Google ADK to 2.0 on a Cloud SQL Postgres Backend: The Three Things That Bit Us

#python #gcp #postgres #ai

We run an agent built on Google's Agent Development Kit (ADK), deployed on Cloud Run with a Cloud SQL (PostgreSQL) session store via ADK's DatabaseSessionService. Bumping google-adk from 1.x to >=2.0.0 looked like a one-line dependency change. It wasn't.

Three things bit us, in increasing order of subtlety:

ADK 2.0 talks to Postgres through asyncpg, which forces a connection-URL change — and that URL is shared with sync code.
The events table needs two new columns that ADK 2.0 reads unconditionally. Deploy without them and chat silently 500s.
The legacy v0 (Pickle) schema still works, but throws a deprecation warning. Migrating to v1 (JSON) is optional and cannot be done in place.

Here's the field report.

1. The async driver switch — and the URL you now share with sync code

ADK 2.0's session service is async and expects an async Postgres driver. In practice that means your DATABASE_URL changes scheme:

postgresql://appuser:...@host/db          # 1.x
postgresql+asyncpg://appuser:...@host/db   # 2.0

Easy enough — update the secret, redeploy. The catch is that the same URL is read by code that is not async. We have custom storage (token storage, pending-state storage) built on plain synchronous SQLAlchemy, and create_engine() does not understand +asyncpg. Feed it the 2.0 URL and it tries to import an async driver into a sync engine and falls over.

The fix is a tiny normalization layer: store the async URL (because ADK is the primary consumer), and strip the driver suffix at the point where sync engines are created.

from sqlalchemy import create_engine
from sqlalchemy.engine import Engine


def _sync_db_url(db_url: str) -> str:
    """Normalize an async-driver URL for use with a sync SQLAlchemy engine."""
    return db_url.replace("postgresql+asyncpg://", "postgresql://", 1)


def create_db_engine(db_url: str) -> Engine:
    return create_engine(
        _sync_db_url(db_url),
        pool_size=2,
        max_overflow=1,
        pool_pre_ping=True,
        pool_recycle=300,
    )

The design decision worth calling out: one URL, normalized at the edge rather than two secrets. ADK gets the +asyncpg form it wants; every sync consumer goes through create_db_engine() and gets the driver suffix stripped. The replace(..., 1) only touches the scheme, so passwords containing the literal substring are safe. If you have any synchronous DB access alongside ADK 2.0, you need a shim like this — otherwise the async URL leaks into create_engine() and you get an import error at startup that looks unrelated to the upgrade.

2. The missing event columns — a silent 500 in production

This is the one that actually took the service down in our dev environment before we caught it.

ADK 2.0 added two columns to the events table:

input_transcription  jsonb
output_transcription  jsonb

ADK 2.0 reads these columns unconditionally on session GET and on the /run_sse streaming endpoint. If your database was created under 1.x, the columns don't exist, and Postgres raises UndefinedColumnError. The symptom is not a clear startup crash — the container boots fine, /health returns 200 — but every chat turn 500s and session reads fail. We reproduced it in dev as exactly that: healthy container, dead chat.

The fix is a forward-compatible ALTER TABLE that you must run before deploying the 2.0 image:

ALTER TABLE events ADD COLUMN IF NOT EXISTS input_transcription jsonb;
ALTER TABLE events ADD COLUMN IF NOT EXISTS output_transcription jsonb;

IF NOT EXISTS makes it idempotent, and adding nullable columns is non-blocking on Postgres — no table rewrite, safe on a live DB. The ordering matters: patch the DB first, then deploy. Do it the other way and you have a window where the new image is live against the old schema and chat is down.

Connecting through the Cloud SQL Auth Proxy, the whole patch is:

cloud_sql_proxy -instances=PROJECT:asia-northeast1:INSTANCE=tcp:127.0.0.1:15433 &

PGPASSWORD="$DB_PASSWORD" psql -h 127.0.0.1 -p 15433 -U appuser -d appdb <<'SQL'
ALTER TABLE events ADD COLUMN IF NOT EXISTS input_transcription jsonb;
ALTER TABLE events ADD COLUMN IF NOT EXISTS output_transcription jsonb;
SELECT column_name FROM information_schema.columns
WHERE table_name = 'events'
  AND column_name IN ('input_transcription', 'output_transcription');
-- expect 2 rows
SQL

Good news for rollback: these columns are ignored by ADK 1.x, so adding them doesn't break the old version. You can patch ahead of time without committing to the upgrade.

3. The v0 → v1 schema migration is optional (and you probably want to defer it)

On startup, ADK 2.0 logs this if your DB was created under 1.x:

The database is using the legacy v0 schema, which uses Pickle to serialize
event actions. The v0 schema will not be supported going forward and will be
deprecated in a few rollouts. Please migrate to the v1 schema which uses JSON
serialization for event data.

The key realization: ADK 2.0 reads and writes v0 fine. This is a deprecation warning, not a hard requirement. We chose to run 2.0 on the v0 schema and defer the migration — the upgrade and the migration are independent decisions, and decoupling them shrinks the risky deploy.

When you do migrate, the important constraint is that it cannot be done in place. The schemas are structurally different:

`events` column	v0	v1
`actions`	`bytea` (Pickle)	—
`event_data`	—	`jsonb` (all event data)
metadata table	none	`adk_internal_metadata`

v0 stores event actions as individual columns plus a pickled blob; v1 collapses everything into one event_data JSONB column. Because the column set changes, ADK ships a migration command that reads from one DB and writes to a freshly created one:

# CREATE DATABASE can't run inside a transaction — separate statement
psql ... -d postgres -c "CREATE DATABASE appdb_v1;"

SOURCE_URL="postgresql://appuser:${PW}@127.0.0.1:15433/appdb"
DEST_URL="postgresql://appuser:${PW}@127.0.0.1:15433/appdb_v1"

uv run adk migrate session \
  --source_db_url="${SOURCE_URL}" \
  --dest_db_url="${DEST_URL}"

adk migrate session covers ADK's own four tables: app_states, user_states, sessions, events. Anything you added yourself (OAuth tokens, app-specific state) is not touched and has to be copied separately — but that's outside ADK's scope and outside this post.

Verify the destination after migrating:

# 1 means v1
psql ... -d appdb_v1 -c \
  "SELECT value FROM adk_internal_metadata WHERE key='schema_version';"

# event_data present, actions gone
psql ... -d appdb_v1 -c \
  "SELECT column_name FROM information_schema.columns WHERE table_name='events';"

Cut over by repointing the connection secret at the new DB and redeploying. Because you migrated into a new database, the original is untouched — rollback is just repointing the secret back. No data loss, no destructive step until you're confident.

The deploy order that actually works

Pulling it together, the sequence is:

Patch the DB (ALTER TABLE events ...) — before anything else, to prevent the 500 window.
Switch the URL to postgresql+asyncpg:// (and make sure sync consumers normalize it back).
Deploy the 2.0 image.
Smoke test: /health → 200, an existing session GET → not 500, a new /run_sse chat → streams a response.
(Optional, later) migrate v0 → v1 into a new DB and cut over.

Gotchas worth pinning

pg_dump version skew. Don't reach for pg_dump to copy data if your local client is older than the Cloud SQL server (e.g. client 16 vs server 17) — it just refuses. Either match versions or copy via a script.
CREATE DATABASE outside a transaction. It can't run inside one, so it has to be its own statement — not bundled into a BEGIN ... COMMIT block with the grants.
Session compatibility across versions. Sessions written by 2.0 may not be readable by 1.x (especially older 1.x). Treat the version downgrade as lossy for any session created after cutover, and keep the old image only as a short-term escape hatch.
/health lies. A 200 from your health check says nothing about whether the schema matches. Smoke-test an actual session read and a real chat turn.

Summary

The google-adk 2.0 bump is small on paper and sharp in practice. The async driver switch ripples into any sync DB code sharing the URL; the new events columns turn a healthy-looking container into a chat outage if you deploy before patching; and the v0 deprecation warning is loud but not load-bearing — you can stay on v0 and migrate on your own schedule into a fresh DB. Patch first, normalize the URL at the edge, smoke-test the real path, and treat the schema migration as a separate project.