Peace Thabiwa

Posted on Oct 26

GCP Track — Ava on Vertex AI

#webdev #programming #ai #productivity

1) GCP Track — Ava on Vertex AI

⚙️ Architecture (GCP)

Dev -> GitHub Actions -> Artifact Registry -> Vertex AI (Train/Batch/Endpoints)
                         |                  -> Cloud Run (feature API / workers)
Data -> GCS (raw/feat) -> BigQuery (analytics)
Meta -> Firestore (pattern_ledger) + Cloud Logging + Cloud Monitoring

🗂 Repo layout (GCP)

ava-vertex-ml/
├─ infra/
│  ├─ terraform/
│  │  ├─ main.tf            # project, iam, artifact registry, gcs, bq
│  │  ├─ vertex.tf          # endpoints, models, service accounts
│  │  └─ outputs.tf
├─ services/
│  ├─ trainer/
│  │  ├─ Dockerfile
│  │  ├─ train.py
│  │  └─ requirements.txt
│  ├─ batch_infer/
│  │  ├─ Dockerfile
│  │  └─ batch.py
│  └─ feature_api/
│     ├─ Dockerfile
│     └─ app.py             # FastAPI on Cloud Run
├─ pipelines/
│  ├─ vertex_pipeline.py    # Vertex AI Pipeline (KFP v2)
│  └─ components/...
├─ ops/
│  ├─ gh-actions/
│  │  ├─ build_push_gcp.yml
│  │  └─ deploy_vertex.yml
│  └─ makefile
├─ binflow/
│  ├─ ledger.py             # minimal pattern ledger (FireStore)
│  └─ phases.py             # Focus, Loop, Transition, Pause, Emergence
└─ README.md

🧱 Terraform (GCP) — minimal core

# infra/terraform/main.tf
provider "google" {
  project = var.project_id
  region  = var.region
}

resource "google_artifact_registry_repository" "repo" {
  location      = var.region
  repository_id = "ml-images"
  format        = "DOCKER"
}

resource "google_storage_bucket" "data" {
  name     = "${var.project_id}-ml-data"
  location = var.region
}

resource "google_bigquery_dataset" "ds" {
  dataset_id                 = "ml_analytics"
  location                   = var.region
  delete_contents_on_destroy = true
}

# Service Account for Vertex
resource "google_service_account" "vertex_sa" {
  account_id   = "vertex-exec"
  display_name = "Vertex Execution SA"
}

🐳 Trainer (GCP)

# services/trainer/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY train.py .
CMD ["python", "train.py"]

# services/trainer/requirements.txt
google-cloud-storage
google-cloud-aiplatform
pandas
scikit-learn

# services/trainer/train.py
import os, json, time
from google.cloud import storage, aiplatform
from binflow.ledger import log_event

PROJECT = os.getenv("PROJECT")
BUCKET  = os.getenv("BUCKET")
REGION  = os.getenv("REGION","us-central1")

def main():
    log_event(phase="Focus", action="trainer:start", payload={"region": REGION})
    # mock train
    time.sleep(2)
    model_uri = f"gs://{BUCKET}/models/model.pkl"
    # save artifact…
    log_event(phase="Emergence", action="trainer:complete", payload={"model_uri": model_uri})
    print(json.dumps({"model_uri": model_uri}))

if __name__ == "__main__":
    main()

🔁 Vertex pipeline (KFP v2) — skeleton

# pipelines/vertex_pipeline.py
from kfp import dsl
from google_cloud_pipeline_components.v1.custom_job import CustomTrainingJobOp

@dsl.pipeline(name="ava-vertex-pipeline")
def pipeline():
    train = CustomTrainingJobOp(
        display_name="trainer",
        worker_pool_specs=[{
          "machine_spec": {"machine_type": "n1-standard-4"},
          "replica_count": "1",
          "container_spec": {
            "image_uri": "REGION-docker.pkg.dev/PROJECT/ml-images/trainer:latest",
            "args": []
          }
        }]
    )

🔧 GitHub Actions (GCP)

# ops/gh-actions/build_push_gcp.yml
name: Build & Push (GCP)
on: [push]
jobs:
  build-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}
      - uses: google-github-actions/setup-gcloud@v2
      - run: gcloud auth configure-docker REGION-docker.pkg.dev --quiet
      - run: |
          IMAGE=REGION-docker.pkg.dev/${{ secrets.GCP_PROJECT }}/ml-images/trainer:$(git rev-parse --short HEAD)
          docker build -t $IMAGE services/trainer
          docker push $IMAGE

2) AWS Track — Noah on SageMaker

⚙️ Architecture (AWS)

Dev -> GitHub Actions -> ECR -> SageMaker (Train/Batch/Realtime Endpoints)
                         |    -> ECS Fargate (feature API / workers)
Data -> S3 (raw/feat) -> Athena/Glue (analytics)
Meta -> DynamoDB (pattern_ledger) + CloudWatch + X-Ray

🗂 Repo layout (AWS)

noah-sagemaker-ml/
├─ infra/
│  ├─ terraform/
│  │  ├─ main.tf            # s3, ecr, iam roles, dynamodb
│  │  └─ sagemaker.tf       # endpoints, exec roles
├─ services/
│  ├─ trainer/
│  │  ├─ Dockerfile
│  │  ├─ train.py
│  │  └─ requirements.txt
│  ├─ batch_infer/
│  └─ feature_api/          # FastAPI -> ECS Fargate
├─ pipelines/
│  └─ sagemaker_pipeline.py # SM Pipeline (optional)
├─ ops/
│  ├─ gh-actions/
│  │  ├─ build_push_aws.yml
│  │  └─ deploy_sagemaker.yml
│  └─ makefile
├─ binflow/
│  ├─ ledger_dynamo.py      # minimal pattern ledger on DynamoDB
│  └─ phases.py
└─ README.md

🧱 Terraform (AWS) — minimal core

# infra/terraform/main.tf
provider "aws" { region = var.region }

resource "aws_s3_bucket" "data" {
  bucket = "${var.project}-ml-data"
}

resource "aws_ecr_repository" "repo" {
  name = "ml-images"
  image_scanning_configuration { scan_on_push = true }
}

resource "aws_dynamodb_table" "ledger" {
  name         = "${var.project}-pattern-ledger"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "pattern_id"
  attribute { name = "pattern_id" type = "S" }
}

🐳 Trainer (AWS)

# services/trainer/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY train.py .
ENV AWS_DEFAULT_REGION=us-east-1
CMD ["python", "train.py"]

# services/trainer/requirements.txt
boto3
pandas
scikit-learn

# services/trainer/train.py
import os, json, time, boto3
from binflow.ledger_dynamo import log_event

S3_BUCKET = os.getenv("S3_BUCKET")

def main():
    log_event(phase="Focus", action="trainer:start", payload={"bucket": S3_BUCKET})
    time.sleep(2)
    model_uri = f"s3://{S3_BUCKET}/models/model.pkl"
    # save artifact…
    log_event(phase="Emergence", action="trainer:complete", payload={"model_uri": model_uri})
    print(json.dumps({"model_uri": model_uri}))

if __name__ == "__main__":
    main()

🔧 GitHub Actions (AWS)

# ops/gh-actions/build_push_aws.yml
name: Build & Push (AWS)
on: [push]
jobs:
  build-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}
      - run: |
          REPO="${{ secrets.AWS_ACCOUNT_ID }}.dkr.ecr.${{ secrets.AWS_REGION }}.amazonaws.com/ml-images"
          aws ecr get-login-password --region ${{ secrets.AWS_REGION }} | docker login --username AWS --password-stdin $REPO
          TAG=$(git rev-parse --short HEAD)
          docker build -t $REPO:$TAG services/trainer
          docker push $REPO:$TAG

3) The Shared “Reps Framework” (both clouds)

🧩 What “reps” means here

Every experiment is a pattern (code + config + data snapshot).
Each run logs BINFLOW phase events:
- Focus (setup), Loop (iterations), Transition (deploy), Pause (idle), Emergence (new artifact).
Proof-of-Leverage (PoL) aggregates usage across time (how often a pattern is reused, promoted, or composes with others).

🧾 Minimal pattern ledger (GCP Firestore or AWS DynamoDB)

GCP (Firestore):

# binflow/ledger.py
import os, time, uuid
from google.cloud import firestore

PROJECT = os.getenv("PROJECT")
_db = firestore.Client(project=PROJECT)
COLL = "pattern_ledger"

def log_event(phase, action, payload=None, pattern_id=None, actor_id="system"):
    doc = {
      "pattern_id": pattern_id or str(uuid.uuid4()),
      "phase": phase,
      "action": action,
      "actor_id": actor_id,
      "t_external": firestore.SERVER_TIMESTAMP,
      "t_internal_ms": int(time.time()*1000),
      "payload": payload or {}
    }
    _db.collection(COLL).add(doc)
    return doc["pattern_id"]

AWS (DynamoDB):

# binflow/ledger_dynamo.py
import os, time, uuid, boto3
ddb = boto3.resource("dynamodb").Table(os.getenv("LEDGER_TABLE","project-pattern-ledger"))

def log_event(phase, action, payload=None, pattern_id=None, actor_id="system"):
    pid = pattern_id or str(uuid.uuid4())
    item = {
      "pattern_id": pid,
      "event_id": str(uuid.uuid4()),
      "phase": phase,
      "action": action,
      "actor_id": actor_id,
      "t_internal_ms": int(time.time()*1000),
      "t_external": int(time.time()),
      "payload": payload or {}
    }
    ddb.put_item(Item=item)
    return pid

🧮 Quick PoL scoring (both)

Weight phases (e.g., Emergence 1.8, Loop 1.4…)
PoL = Σ phase_weight × log(1 + tokens_out/size) × time_decay

(You can compute PoL in BigQuery/Athena on the ledger table, or inside your trainer/CI to display it on dashboards.)

4) Fast “Hello World” flows

Ava @ GCP — train & deploy

terraform apply
Build + push trainer image via Actions
Trigger Vertex CustomJob from GH Action or local:

gcloud ai custom-jobs create \
  --region=$REGION \
  --display-name="trainer" \
  --config=trainer_job.yaml

Register model → create endpoint
Cloud Run feature_api for online features

Noah @ AWS — train & deploy

terraform apply
Build + push trainer image to ECR
Create SageMaker Training Job (console or boto3)
Register model + EndpointConfig + Endpoint
ECS Fargate feature_api for online features

5) Why this slaps (for collabs)

Symmetry: same mental model on GCP/AWS—easy cross-cloud hiring.
Reps-first: experiments are first-class citizens; every run is reusable.
BINFLOW phases: adds time-aware semantics to your logs without changing core ML code.
Day-1 deployable: both stacks boot with Terraform and minimal Docker.

Top comments (1)

Peace Thabiwa • Oct 26

LET ME KNOW WHAT U THINK