DEV Community: David Montoya

Building a schema-aware RAG agent with DuckDB and LangChain Go

David Montoya — Mon, 26 Jan 2026 16:08:20 +0000

This guide translates abstract concepts of agentic RAG into a concrete, end to end implementation.

The problem

I recently encountered the challenge of having to interpret a user provided string with a "change request" to match it against one or more API fields that must be modified to fulfill the request. This is a classic example of "semantic classification" in AI engineering.

The most obvious approach is a simple one-shot inference call: ask the LLM to select the correct field(s) from a provided list. This is easy to implement as you simply enumerate the fields in the prompt with enough detail for the LLM to make a decision. However, this approach does not scale as the list of fields can grow to hundreds or even thousands of entries. This floods the context window, increases latency (processing more tokens takes longer), and ultimately sacrifices accuracy due to the "lost in the middle" phenomenon.

Asking the LLM to filter through large blocks of input text is known as the "finding a needle in a haystack" challenge. The main problem that arises when searching for a piece of information buried in a massive context window is that models tend to forget details located in the middle, while remembering details at the beginning and end best. This is known as the "lost in the middle" phenomenon. One approach to mitigate this is to use a search engine to do the filtering first (finding the potential needles), allowing the LLM to work directly with the subset of most relevant data. This is what is known as the RAG solution.

RAG is not always the best solution. It depends on the nature of the data: when processing large paragraphs of text, RAG can be limited in its ability to reason over distant parts of the data because chunks are retrieved in isolation rather than as a coherent whole. For my use case, however, API fields are discrete pieces of information that are already clumped by domain (API schema and endpoint), making RAG a strong fit.

RAG alone, however, is often not enough. Retrieval typically returns entries that score highly by similarity, which can include results that look relevant but don't best match the user's intent. Reranking is a crucial technique to re-evaluate retrieved candidates in light of the user's request and select the most contextually appropriate entries.

Anatomy of a RAG

Let's go through the steps required to build and consume a RAG system to find the field(s) that match a user's change request.

Prepare the schemas: Collect the schemas and fields that are supported. The field details are important: field name, path, description, type, additional context and the source schema.
Generate embeddings for each field: Iterate over all fields and generate a vector embedding (numerical representation) for each field's string representation (a concatenation of name, description, context, etc).
Store the fields in a vector DB: Insert each field's vector embedding in a database that supports vector similarity search, like DuckDB 😎.
Wire up your app to access the DB: Setup your app so that it can connect to the vector DB instance.
Generate an embedding for the user request: Using the same "embedder" from step 2, generate a vector representation of the user query. This vector is used to query the DB.
Setup a RAG retriever: The retriever is a function that executes the "similarity search" against the DB relying on components from step #4 and #5. It returns a list of candidate "documents" (fields) that are mathematically similar to the query. LangChain supports this 😎.
Setup the RAG chain: The chain puts together the retriever from step #6 along with the instructions for the LLM on how to choose the appropriate fields. This is where the final "reasoning" happens. LangChain is great at this 😎.

Steps 1 through 3 are "build" concerns (often called the Ingestion Pipeline). Steps 4 through 7 are "runtime" concerns performed by the app to process live user requests.

The implementation

Let's put it all together with DuckDB and Langchain Go. For the API fields I'm going to use the Github API. For the LLM and embedder I'm going to use GCP's Vertex models.

1. Prepare the schemas:

Collect each API field into a list of SchemaField instances that we can iterate over later on to generate the embeddings.

The SchemaField type:

// SchemaField represents the data we want to embed
type SchemaField struct {
    // The path to the field in dot notation
    Path string // E.g.: pull_request.required_approving_review_count
    // The "humanized" context
    Context string // E.g.: required pull request reviews
    // The field name
    FieldName string // E.g.: required approving review count
    // The intent of the field
    Description string // E.g.: Number of reviewers required to approve pull requests (0-6, where 0 means not required)
    // The type to disambiguate from fields with similar names
    Type string // E.g.: integer
    // The schema source of the field
    Schema string // E.g.: "classic_branch_protection" or "ruleset_branch_protection"
    // The aliases of the field for cases where the field is referred to by a different name
    Aliases []string // E.g.: ["approvals", "approving reviews", "review approvals", "review approvals count"]
}

A list of "repo" fields:

I'm only providing a few fields to keep it readable. In the same way there's a method getRepoFields() to capture all the "repo" fields, there would be other methods to collect "branch protection" fields, "rulesets" fields, etc.

func getRepoFields() []SchemaField {
  return []SchemaField{
    {
      Path:        "allow_auto_merge",
      Context:     "repository",
      FieldName:   "allow auto merge",
      Description: "Either true to allow auto-merge on pull requests, or false to disallow auto-merge. Default: false",
      Type:        "boolean",
      Schema:      "repo_fields",
    },
    {
      Path:        "allow_forking",
      Context:     "repository",
      FieldName:   "allow forking",
      Description: "Either true to allow private forks, or false to prevent private forks. Default: false",
      Type:        "boolean",
      Schema:      "repo_fields",
    },
    {
      Path:        "allow_merge_commit",
      Context:     "repository",
      FieldName:   "allow merge commit",
      Description: "Either true to allow merging pull requests with a merge commit, or false to prevent merging pull requests with merge commits. Default: true",
      Type:        "boolean",
      Schema:      "repo_fields",
    },
    {
      Path:        "allow_rebase_merge",
      Context:     "repository",
      FieldName:   "allow rebase merge",
      Description: "Either true to allow rebase-merging pull requests, or false to prevent rebase-merging. Default: true",
      Type:        "boolean",
      Schema:      "repo_fields",
    },
    {
      Path:        "allow_squash_merge",
      Context:     "repository",
      FieldName:   "allow squash merge",
      Description: "Either true to allow squash-merging pull requests, or false to prevent squash-merging. Default: true",
      Type:        "boolean",
      Schema:      "repo_fields",
    },
    // More fields go here
    // ...
  }
}

2. Generate embeddings for each field:

To generate the embedding we need a string representation of each field. We can do so by adding a toEmbed method to SchemaField type like this:

func (s SchemaField) toEmbed() string {
    aliases := ""
    if len(s.Aliases) > 0 {
        aliases = fmt.Sprintf(" (also known as: %s)", strings.Join(s.Aliases, ", "))
    }

    // Natural language format:
    // "Field <Name> <Aliases> is a <Type> used to <Description>."
    return fmt.Sprintf(
        "Field '%s'%s is a %s used to %s [Context: %s]",
        s.FieldName, aliases, humanizeType(s.Type), s.Description, s.Context,
    )
}

func humanizeType(t string) string {
    switch t {
    case "boolean":
        return "Toggle switch (True/False)"
    case "integer":
        return "Integer count (e.g. 1, 2, 5)"
    }
    return t
}

A few best-practice considerations for the embed string:

Frontload high signal information: Place the most semantically rich fields at the beginning of the string as they'll match closerly the terminology used by the user.
Use natural language serialization: Many embedding models are optimized for natural language sentences. Framing the data as a coherent statement can yield better results than a robotic list of key/value pairs.
Encode aliases: Users may refer to some fields with alternative names. Explicitly encode those to increase the chances of matching.
Iterate!: I found variations of the format to yield subtle differences in the similarity search. You should rely on integration tests to control variance and account for the subtleties of your data as you try out new formats.

3. Store the fields in a vector DB:

In order to perform similarity search against the fields, let's create a DuckDB instance and table with vector embedding support:

import (
    "database/sql"
    "fmt"

    // Import the DuckDB driver
    _ "github.com/duckdb/duckdb-go/v2"
)

const (
    FieldEmbeddingsTableDDL = `
    -- Enable the VSS extension for vector similarity search
    INSTALL vss;
    LOAD vss;

    -- Set the experimental flag to allow persisting the instance to disk.
    SET hnsw_enable_experimental_persistence = true;

    CREATE TABLE schema_fields (
        path                            TEXT,               -- E.g.: "required_pull_request_reviews.count"
        context                     TEXT,               -- E.g.: "required pull request reviews"
        field_name              TEXT,               -- E.g.: "count"
        description             TEXT,               -- E.g.: "Number of reviewers..."
        field_type              TEXT,               -- E.g.: "integer" (renamed from 'type' to avoid keyword conflict)
        schema                      TEXT,               -- E.g.: "ruleset_branch_protection"

        -- The Vector Column
        -- Size 768 is specific to Google's "text-embedding-004" model
        embedding   FLOAT[768]
    );

    -- Since we rely on matching meaning (which is mathematically "direction" in vector space),
    -- we must use Cosine Similarity.
    CREATE INDEX idx_schema_fields_embedding
    ON schema_fields
    USING HNSW (embedding)
    WITH (metric = 'cosine');
`

    FieldEmbeddingDML = `
    INSERT INTO schema_fields (path, context, field_name, description, field_type, schema, embedding)
    VALUES (?,?,?,?,?,?,?);
    `
)

And now the actual logic to write the DB instance to a file and insert the embeddings:

type FieldEmbeddingsDB struct {
    db *sql.DB
}

func NewFieldEmbeddingsDB() (*FieldEmbeddingsDB, error) {
    // create in-file DuckDB instance
    db, err := sql.Open("duckdb", "dist/embeddings/schema-fields.duckdb")
    if err != nil {
        return nil, fmt.Errorf("failed to open in-file duckdb: %w", err)
    }

    // create embeddings table schema
    if _, err := db.Exec(FieldEmbeddingsTableDDL); err != nil {
        db.Close()
        return nil, fmt.Errorf("failed to create field embeddings table schema: %w", err)
    }

    return &FieldEmbeddingsDB{db: db}, nil
}

func (f *FieldEmbeddingsDB) StoreFieldEmbeddings(embeddings []float32, field SchemaField) error {
    _, err := f.db.Exec(FieldEmbeddingDML,
        field.Path,
        field.Context,
        field.FieldName,
        field.Description,
        field.Type,
        field.Schema,
        embeddings,
    )
    if err != nil {
        return fmt.Errorf("failed to insert field embedding in DuckDB: %w", err)
    }
    return nil
}

func (f *FieldEmbeddingsDB) Close() error {
    return f.db.Close()
}

Lastly, iterate over the fields, generate the embedding and use the FieldEmbeddingsDB client to insert them:

    // 1. Initialize the Embedder
    embedder, err := vertex.NewGoogleEmbedder(ctx)
    if err != nil {
        log.Fatalf("Failed to create embedder: %v", err)
    }

    // 2. Define the schema fields to embed
    schemaFields := getRepoFields()

  // 3. Prepare the text to be embedded
    var textsToEmbed []string
    for _, field := range schemaFields {
        textsToEmbed = append(textsToEmbed, field.toEmbed())
    }

  // 4. Generate Embeddings (Batch Call)
    vectors, err := embedder.EmbedDocuments(ctx, textsToEmbed)
    if err != nil {
        log.Fatalf("Failed to embed fields: %v", err)
    }

  // 5. Generate the DuckDB embeddings instance
    db, err := NewFieldEmbeddingsDB()
    if err != nil {
        log.Fatalf("Failed to create DuckDB embeddings instance: %v", err)
    }
    defer db.Close()

    // 6. Store the vectors in a DuckDB instance
    for i, field := range schemaFields {
        fieldVector := vectors[i]

        err := db.StoreFieldEmbeddings(fieldVector, field)
        if err != nil {
            log.Fatalf("Failed to store field embeddings: %v", err)
        }
    }

This concludes the "ingestion phase". Let's now move on to the runtime processing phase. Note that I didn't include function NewGoogleEmbedder. That will be included on the next steps as we'll also need it during runtime query processing.

4. Wire up your app to access the DB:

Let's now setup the app with a client to access and query the "fields" DuckDB instance written by step #3.

import (
    "context"
    "database/sql"
    "fmt"

    // Import the DuckDB driver
    _ "github.com/duckdb/duckdb-go/v2"
)

// FieldsDB is a readonly DuckDB instance to find fields that match a vector embedding
type FieldsDB struct {
    db *sql.DB
}

func NewFieldsDB(fieldsDBDirectoryPath string) (*FieldsDB, error) {
    // ?access_mode=READ_ONLY is crucial for concurrency safety
    dsn := fmt.Sprintf("%s/schema-fields.duckdb?access_mode=READ_ONLY", fieldsDBDirectoryPath)

    db, err := sql.Open("duckdb", dsn)
    if err != nil {
        return nil, fmt.Errorf("failed to open in-file duckdb with fields embeddings: %w", err)
    }

    return &FieldsDB{db: db}, nil
}

func (f *FieldsDB) Close() error {
    return f.db.Close()
}

type MatchedField struct {
    FieldName   string  `json:"fieldName"`
    Context     string  `json:"context"`
    Path        string  `json:"path"`
    Description string  `json:"description"`
    FieldType   string  `json:"fieldType"`
    Schema      string  `json:"schema"`
    Score       float64 `json:"score"`
}

// FindFields finds the fields most similar to the given embedding
func (f *FieldsDB) FindFields(
    ctx context.Context,
    embedding []float32,
) ([]MatchedField, error) {
    query := `
    LOAD vss;

    SELECT field_name,
        context,
        path,
        description,
        field_type,
        schema,
        array_cosine_similarity(embedding, ?::FLOAT[768]) AS similarity_score
    FROM schema_fields
    ORDER BY similarity_score DESC
    LIMIT 10
    `

    rows, err := f.db.QueryContext(
        ctx,
        query,
        embedding,
    )
    if err != nil {
        return nil, fmt.Errorf("failed to query fields: %w", err)
    }
    defer rows.Close()

    fields := []MatchedField{}
    for rows.Next() {
        var fieldName string
        var context string
        var path string
        var description string
        var fieldType string
        var schema string
        var similarityScore float64

        err := rows.Scan(
            &fieldName,
            &context,
            &path,
            &description,
            &fieldType,
            &schema,
            &similarityScore,
        )
        if err != nil {
            return nil, fmt.Errorf("failed to scan field: %w", err)
        }

        fields = append(fields, MatchedField{
            FieldName:   fieldName,
            Context:     context,
            Path:        path,
            Description: description,
            FieldType:   fieldType,
            Schema:      schema,
            Score:       similarityScore,
        })
    }

    return fields, nil
}

5. Generate an embedding for the user request & 6. Setup a RAG retriever

I'm bundling steps 5 and 6 since the RAG retriever in my app takes care of generating the embedding for the user request.

⚠️ Note ⚠️: You should not pass the raw user provided string directly to the embedder. You should normalize it before generating the embedding by removing action verbs, stop words and any irrelevant characters to ensure high quality matching.

Let's now take a look at the embedder that must be used during both the ingestion and runtime processing phases:

import (
    "context"
    "fmt"

    "github.com/tmc/langchaingo/embeddings"
    "github.com/tmc/langchaingo/llms/googleai"
    "github.com/tmc/langchaingo/llms/googleai/vertex"
)

// GetGoogleEmbedder initializes the Google AI client specifically for embeddings
func NewGoogleEmbedder(ctx context.Context) (embeddings.Embedder, error) {
    llm, err := vertex.New(ctx,
        googleai.WithDefaultEmbeddingModel("text-embedding-004"),
        googleai.WithCloudProject("your-gcp-project"),
        googleai.WithCloudLocation("us-central1"),
    )
    if err != nil {
        return nil, fmt.Errorf("failed to create googleai client: %w", err)
    }

    // Create the Embedder wrapper
    // This interface unifies calls across OpenAI, Google, Ollama, etc.
    embedder, err := embeddings.NewEmbedder(llm)
    if err != nil {
        return nil, fmt.Errorf("failed to create embedder wrapper: %w", err)
    }

    return embedder, nil
}

And the Langchain retriever:

import (
    "context"
    "fmt"
    "log/slog"

    "github.com/tmc/langchaingo/embeddings"
    "github.com/tmc/langchaingo/schema"
)

type FieldRetriever struct {
    fieldsDB *database.FieldsDB
    embedder embeddings.Embedder
}

func NewFieldRetriever(
    fieldsDB *database.FieldsDB,
    embedder embeddings.Embedder,
) *FieldRetriever {
    return &FieldRetriever{
        fieldsDB: fieldsDB,
        embedder: embedder,
    }
}

// MatchFields matches a change description to one or more fields using
// similarity search against the schema embeddings database.
// The change description should be a normalized change description.
func (t *FieldRetriever) matchFields(ctx context.Context, changeDescription string) ([]database.MatchedField, error) {
    embedding, err := t.embedder.EmbedQuery(ctx, changeDescription)
    if err != nil {
        return nil, fmt.Errorf("failed to embed change description: %w", err)
    }

    fields, err := t.fieldsDB.FindFields(ctx, embedding)
    if err != nil {
        return nil, fmt.Errorf("failed to match fields in schema embeddings database: %w", err)
    }

    return fields, nil
}

// GetRelevantDocuments implements schema.Retriever.
func (r *FieldRetriever) GetRelevantDocuments(
    ctx context.Context,
    query string,
) ([]schema.Document, error) {
    matchedFields, err := r.matchFields(ctx, query)
    if err != nil {
        slog.Error("failed to match fields via the RAG retriever", "error", err)
        return nil, fmt.Errorf("failed to match fields via the RAG retriever: %w", err)
    }

    docs := make([]schema.Document, len(matchedFields))
    for i, field := range matchedFields {
        content := fmt.Sprintf(
            "Field: %s\nContext: %s\nPath: %s\nDescription: %s\nFieldType: %s\nSchema: %s\nScore: %f",
            field.FieldName,
            field.Context,
            field.Path,
            field.Description,
            field.FieldType,
            field.Schema,
            field.Score,
        )
        docs[i] = schema.Document{
            PageContent: content,
            Metadata: map[string]any{
                "field_name": field.FieldName,
                "path":       field.Path,
                "type":       field.FieldType,
                "score":      field.Score, // similarity score from the vector DB
            },
        }
    }
    slog.Debug("matched fields via the RAG retriever", "count", len(docs))

    return docs, nil
}

7. Setup the RAG chain

Lastly, setup the RAG chain with the retriever and instructions to select the fields:

⚠️ Note ⚠️: Notice how the agent method MatchFields accepts both a changeDescription with the raw user query and a normalizedChangeDescription with the normalized string. changeDescription provides the full context to perform the final reasoning.

type FieldMatcherAgent struct {
    llm            llms.Model
    fieldRetriever *FieldRetriever
}

// MatchFields matches a change description to the fields that must be modified to fulfill the change..
func (a *FieldMatcherAgent) MatchFields(
    ctx context.Context,
    changeDescription string,
    normalizedChangeDescription string,
) (MatchedFieldsResponse, error) {

    // Prepare the instructions for the RAG chain
    outputParser, err := outputparser.NewDefined(MatchedFieldsResponse{})
    if err != nil {
        slog.Error("failed to create output parser for the field matcher RAG chain", "error", err)
        return MatchedFieldsResponse{}, fmt.Errorf("failed to create output parser for the field matcher RAG chain: %w", err)
    }

    systemPrompt := prompts.NewSystemMessagePromptTemplate(`You are an expert in Github API configuration.
You will receive a "Change Description" and a list of "Candidate Fields".

Your task:
1. Analyze the Change Description.
2. Review the Candidate Fields provided in the Context.
3. Select ALL fields that must be modified to fulfill the change.
4. Copy the exact metadata (FieldName, Path, Score, etc.) from the candidate text into the response.

If no fields are a good match for the change description, return an empty list.`, []string{})

    // Interpolate the change description directly as it can't be passed via the RAG chain
    userPromptTemplate := fmt.Sprintf(`Match the change description to the fields that must be modified to fulfill the change.

Change Description: %s

Candidate Fields (Context):
{{.context}}

Format instructions:
`+outputParser.GetFormatInstructions(), changeDescription)

    userPrompt := prompts.NewHumanMessagePromptTemplate(userPromptTemplate, []string{"question", "context"})

    agentPrompt := prompts.NewChatPromptTemplate([]prompts.MessageFormatter{
        systemPrompt,
        userPrompt,
    })

    // Build a map rerank documents chain for question answering (finding the best field match)
    llmChain := chains.NewLLMChain(a.llm, agentPrompt)

    // Build StuffDocuments Chain
    // This chain concatenates all retrieved docs into the "{{.context}}" variable
    stuffChain := chains.NewStuffDocuments(llmChain)

    // Build the RAG chain
    rag := chains.NewRetrievalQA(
        stuffChain,
        a.fieldRetriever,
    )
    rag.InputKey = "query" // Matches {{.query}} in prompt

    ragResult, err := chains.Call(ctx, rag, map[string]any{
        // Will be passed to the retriever (GetRelevantDocuments) as the "question" input
        "query": normalizedChangeDescription,
    })
    if err != nil {
        slog.Error("failed to run RAG chain to find the matching fields", "error", err, "change_description", changeDescription)
        return MatchedFieldsResponse{}, fmt.Errorf("failed to run RAG chain: %w", err)
    }

    // The result is now a single JSON string containing all matches
    textResult, ok := ragResult["text"].(string)
    if !ok {
        return MatchedFieldsResponse{}, fmt.Errorf("unexpected output type from RAG chain")
    }

    matches, err := outputParser.Parse(textResult)
    if err != nil {
        slog.Error("failed to parse matched fields", "error", err)
        return MatchedFieldsResponse{}, fmt.Errorf("failed to parse matched fields: %w", err)
    }

    return matches, nil
}

That's all!

10 milestones for the Internal Developer Platform. A roadmap for teams getting started

David Montoya — Wed, 04 Oct 2023 22:22:25 +0000

Internal Developer Platform (IDP) teams lay the rails for domain teams to ship apps and features (ultimately, to production) with low friction. Team Topologies (Skelton, Pais, 2019) defines the purpose of a platform team as "to enable stream-aligned teams to deliver work with substantial autonomy". The ability for domain teams to deliver that work depends partly (there are other organizational factors, of course) on the capabilities and developer experience provided by the platform. How can then a platform team set out to build those capabilities and lay the foundation to provide that developer experience? How can they take an incremental approach to building it so that domain teams can benefit from platform features early on? The milestones listed below trace a route for platform teams to lay that foundation and to provide essential features for their platform, like tenancy controls and secret management.

Not all milestones are achieved sequentially. A team or multiple teams may be at some point working simultaneously toward several milestones. Each milestone is a journey on which platform teams progress from early to advanced practitioners, as they incrementally add features and create abstractions, and gain momentum by building on top of stabilized tooling.

Before you go build an IDP

As Kris Nova and Justin Garrison mention in Cloud Native Infrastructure (Garrison, Nova, 2018), “When you’re building a platform to run applications, it’s important to know what you are getting into. Initial development is only a small fraction of what it takes to build and maintain a platform”. Maintaining and evolving the platform along with the organization will consume most of the team(s) efforts. Before you set out to build a developer platform, define the guiding principles and the architectural -ilities that should influence your design and implementation. Keep a watchful eye for unnecessary complexity and redundant infrastructure, as unchecked growth will increase cognitive overload and get in the way of delivering new features for devs. Having "self-serviceability" influence the design (E.g.: of pipelines, APIs or Kubernetes CRDs), can help unlock said teams' autonomy. Modularity and evolvability are other -ilities to consider to have the platform fare the flux of change in organizations and avoid going into costly migrations every few years.

1. Bootstrapping a developer platform

On day 0, there is only a root service account for some cloud provider. There are no "Admin" or "Team" personas, and no definition of "Environments". There are no pipelines ("Provisioner" persona) to manage cloud infrastructure nor clusters for workloads to run. Using the root service account, the platform team seeds the roles and IAM policies required by the "Admin" and "Provisioner" personas to continue scaffolding the platform. If using Terraform, the state file should be clear of secrets and versioned along with the bootstrapping code. A bucket or other storage backend is required to store the state for subsequent infrastructure. Resources managed by this component may require more permissions than the given to admins; creating a service account with elevated permissions and allowing admins to cut short-lived keys, serves as an escape hatch when additional permissions are required. At the end of day 0, members of the platform team should be able to use their “admin” credentials to access the cloud.

Code produced in this milestone should have a low frequency of change as time passes. A high frequency of change indicates that a given configuration should be promoted to a tenancy control in a separate pipeline with less permissions.

2. Implement access controls for teams

This is day 1. There is one or a couple teams already waiting for access to begin deploying their apps. Team boundaries must be delineated to prevent teams from stepping on each other, to enforce resource quotas, and to grant least privilege access. A pipeline can be created to bind IAM roles and policies to each domain team. If using Terraform, optimize for configurability by having an internal module represent a single instance of a team and invoke it for each onboarded team with different inputs. Optimize for self-serviceability by surfacing all customizable settings to a format that teams can edit easily like a YAML file, and by requiring approvals from platform and security engineers. Customizable IAM roles can help enforce the principle of least privilege. Permissions granted via this pipeline may be long-lived but that's only until the team unlocks tenancy 2.0 and allows for break-glass workflows to get just-in-time elevated access to environments and cloud resources.

Code produced in this milestone evolves as the organization grows in teams and workloads. Separating language specific configuration from tenancy metadata allows evolving this component into an API or a Kubernetes CRD.

3. Provide isolated environments for applications

Developers require environments to deploy and iterate on their apps as they progress towards production. Defining environment boundaries at the network level ensures isolation and stability. A blueprint for early platform teams to quickly get started is to create a Kubernetes cluster per environment, each with its dedicated VPC network. As the organization grows, so does the footprint of the apps and APIs running on each cluster, which requires the platform team to evolve the cluster topology to allow for more specialized definitions of environments or advanced cluster management strategies like blue/green deployment. Once an environment is operational, enhance the pipelines and components produced on milestone #2 to allow managing teams' access to namespaces with Kubernetes RBAC.

Code produced in this milestone has a consistent frequency of change as new "cluster components" will be added to support various platform features like sidecar injectors, secret and certificate managers, config reloaders, telemetry collectors, API gateways, etc. Separating language specific configuration from cluster metadata allows evolving this component into a pipeline to manage the lifecycle of multiple clusters.

4. Unlock configuration management

Both admins and users require a development workflow to configure Kubernetes and Cloud APIs. Terraform coupled with Atlantis or Argo Workflows can allow managing cloud resources a la GitOps. Kubernetes apps are configured with YAML documents, which can be templated with tools like Helm or Jsonnet or CUE to reduce repetitiveness and perform validations, and to modify applications in bulk. For mid-level to advanced teams, choosing a configuration language like Jsonnet, allows for creating expressive abstractions with functions or object-orientation; those can then be published as libraries (or "libsonnets") for other apps to import. Both templated and raw YAML documents can then be applied to their destination clusters with tools like Flux or ArgoCD.

As the platform team grows, they may favor the Kubernetes-style configuration and unify their cloud configuration workflows with tools like GoogleCloudPlatform/k8s-config-connector or developing custom CRDs with Kubebuilder or Crossplane, which I talk about on milestone #9.

Libraries, pipelines and documentation created in this milestone begin as focused on the needs of the platform team, and then, once tested, extended to domain teams.

5. Allow reusing build jobs and steps

Build pipelines are required to build and test code, and to publish packaged applications like Docker images. Platform teams can create reusable pipeline steps with tools like Github Actions or CircleCI Orbs, which can then be combined to implement CI (Continuous Integration) and CD (Continuous Delivery) workflows. You will need an artifact registry and a set of base images that developers can extend (FROM). Put them on a nightly build so you can keep them patched. Encourage teams to practice semantic versioning to tag artifacts produced and track them across environments. Platform teams with an advanced Kubernetes practice or particular build requirements, may choose to run their own CI/CD system with Tekton or Argo Workflows.

Code produced includes artifact registries, reusable actions or jobs, IAM service accounts, example applications and documentation. Build pipelines may require over-privileged service accounts. If using Hashicorp Vault, use a scheduled task to rotate the API key at least once daily.

6. Implement secret management

Secrets are tricky to deal with. Devs can't store them with code, or at least not unencrypted; they vary across environments (development vs production), some are even required by apps to run successfully, and rotating them in the event of a security incident (or just good security hygiene), should not require application down time nor action from more than one team. Cloud native secret managers like AWS's parameter store and GCP's Secret Manager are handy for the platform team to bootstrap infrastructure and share secrets with admins. For Kubernetes platforms, there are tools like ExternalSecrets and the Kubernetes Container Storage Interface, CSI, which map secrets from various sources to Kubernetes secrets and volumes, which are then mounted by Deployments or Statefulsets. Hashicorp Vault is also a great companion to any Kubernetes app. When injected as a sidecar, it allows the app to auto-reload on secret changes or to issue short-lived credentials to access databases and cloud APIs, using Vault's Secret Backends and custom plugins. For mid-level to advanced platform teams, Hashicorp Vault helps prevent secret sprawl, unlocks workflows for devs to access cloud resources securely, and because of the Vault agent caching capabilities, it mitigates the thundering herd problem on the Vault server and the Kubernetes API server (used for authentication).

Code produced includes secret stores and RBAC controls for workloads and users to access them, cluster operators and sidecar injectors, and secret self-service documentation.

7. Ensure application monitoring and observability

Teams can't just launch critical services into production with no visibility. Part of making a Kubernetes cluster operational for apps and humans, is to deploy log collectors and metrics scrapers as part of the lifecycle of every cluster. Allow domain teams to access those logs and metrics so they can create visualization dashboards and monitoring alerts, which they'll need to operate their service in production. If your org is not already using a managed monitoring stack, Prometheus + Alert Manager + Grafana is an industry chaos-proofed OSS stack for Kubernetes. At the very least your Kubernetes monitoring stack should report on app’s resource utilization and allow visualizing patterns over time. As cluster operators, platform teams implement safety checks and quotas to enforce tenancy and ensure stability for other apps; As reliability consultants, they encourage best practice configurations for resource allocation and logging practices, and advice on performance optimizations and alert tuning. If using Grafana, manage your dashboards with code using grafana-operator.

Deliverables produced in this milestone includes cluster features, best-practice libraries and snippets, dashboards, self-service guides and operator manuals.

8. Templatize best practices and delineate golden paths

As domain teams iterate on their apps and improve on their software craft, they'll arrive at practices and conventions that improve the ergonomics (and economics) of building and running their software. As a platform engineer, spot the patterns across teams and codebases, and seek consensus on redundant libraries or divergent themes. Take the practices and patterns most voted/favored by teams and use them to publish template apps that devs can easily clone and rename to quickly get started with a new app. Keep each template fully functional and deploy them to every environment, production included; this way they serve both as live documentation, and as the canary in the coal mine to test new platform features and cluster upgrades. The build and deployment workflows showcased by a template are the golden paths that users follow to take their app to production. This is a key place to showcase the best practices like semantic versioning. To create advanced templating workflows such as allowing devs to choose over different storage backends, or choosing whether their new app is a UI or a gRPC API, create a wizard-like experience with Backstage's Software Templates feature.

In this milestone there's an intentional focus on the developer experience hence surveying developers, and reading their code are key to producing templates that devs will want to use. Deliverables include template apps running live, templating workflows, deployment pipelines and release workflows, how-to wikis, demo videos, developer surveys.

9. Enter domain abstractions

This is a milestone for teams and organizations gaining momentum, where roadmaps are signaling new features and product development. Take configuration abstractions that have been in use by domain teams and that have stabilized (see milestone #4), and consolidate them behind a common interface. Compose them together into new abstractions and give them names that communicate to users what they do. E.g. an Application API Key, a Highly Available Application, a Tenant Namespace. Hide away the boilerplate bits, cement the good practices already internalized by the platform team (like resource allocation best practices), and focus on surfacing the config settings most relevant to devs. In Kubernetes platforms, Custom Resource Definitions (CRD) are the building blocks to creating domain abstractions. They allow configuring apps and infrastructure with YAML code, which can be easily templated, and fits well with GitOps and pull request workflows. In addition to a Kubernetes-style declarative configuration, CRDs come with “controllers” that ensure resources are reconciled to a healthy state, allowing for orchestration of resources based on their health and lifecycle stage using "operators". For teams getting started, Kubebuilder is a must try to understand the lifecycle of CRDs and how the Reconciler Pattern works in action. For teams looking to reconcile resources outside of Kubernetes, Crossplane provides a way to represent third-party resources and manage their lifecycle with pluggable providers.

This milestone increases the configuration surface for platform teams and simplifies it for domain teams. It produces Kubernetes controllers and operators, new configuration templates, declarative config examples, wikis.

10. Address concerns at the edge with an API Gateway

This is a milestone most pressing for engineering teams that support an internet-facing product and for teams looking to benefit from the endpoint tenancy model unlocked by the Kubernetes Gateway API. This is also a milestone for mid-level to advanced teams, which makes it a subject for another post!

How to approach a codebase a la DevOps

David Montoya — Fri, 28 Jan 2022 03:29:13 +0000

We devs often have to jump from repo to repo as we work through implementing a new feature or making a change to an app or API (hello SREs). Approaching a complex codebase that we haven't touched before (or recently) can be a daunting task. Having a systematic approach for getting acquainted with a codebase before rushing to introduce change, will give you a more encompassing view of the code, help you put the required change in context, and save you from shaving the wrong yak.

Whether solo or pair programming, small or large codebase, open source or proprietary code, follow these steps before you start hacking away.

1. Start from the README

A README is the de-facto index page of a program or codebase for users and future maintainers. Good READMEs welcome developers to self-service code changes in open organizations. Codebase owners should ensure the most important things a maintainer needs to know about the app are documented here, along with a quick try-it-yourself guide and one-liners to build, test or setup the app. At a minimum, the README should serve as an index that points to more detailed documents and diagrams.

Questions to ponder: What does this codebase do? Does it have tests? Can I install it? Does it have diagrams?

2. Poke at the CI pipeline

Looking at a codebase from the perspective of the CI pipeline gives you insights into the change frequency, stability and overall health of the codebase. Confirming the codebase is in a healthy state and "ready for change" before making a code change can save you from going down rabbit-holes, troubleshooting errors unrelated to your change.

When exploring a CI pipeline, look for common failures and signs of flaky tests so you know what to expect when running the tests locally; Browsing through recent build (and commits) can reveal patterns about a common type of change, the average size of change, or major refactorings or features that have just been introduced; From the list of releases, you can tell the "release cadence" and when to expect your change to make it to production. Lastly, rerun the most recent job to confirm the pipeline is idempotent and build artifacts outputted are consistent on every run.

If the pipeline is red, adding new revisions would only increase noise and make it harder for others to troubleshoot. Hold off pushing your changes until the codebase is back to continuous integration mode.

Questions to ponder: Is the pipeline green? When was the last time it ran? Does it fail often? Does it perform linting? Does it have flaky tests? Does it have e2e tests? Who was a recent contributor I could reach out to for help?

3. Run the tests from your local

Running the test suite from your machine gives you a baseline for when you start hacking away on code and iterating on the new test case. Running the tests can yield some insights on the level of test coverage, testing patterns used by maintainers, potential external dependencies, and the overall maintainability of the codebase. Codebases with consistent test patterns and sensible test coverage make it safe and efficient to introduce change.

Questions to ponder: Are the tests passing? Does it even have tests? Can the tests run on my machine? Do I have the required dependencies? Does it have external dependencies? Can the tests run with my Wi-Fi off? Can I add a new test?

4. Identify the entry-point

The entry-point in a software program determines how it is initiated and executed. Knowing where the entry-point is, gives you an idea of how to consume and test the code you're about to change. Most apps perform some form of configuration task upon start. Any configuration required to run the app it's likely being read and validated near the entry-point. When introducing new configuration options to an app like a new environment variable, the entry-point is a good place to start.

What the entry-point looks like depends on what the program does and how it's consumed. In HTTP based programs like Web apps or JSON APIs, the entry-point is an "http server" that opens a port and accepts TCP connections. You'd then need a client to consume it. Search the codebase for the occurrence of that port or references to HTTP resource paths. In the case of libraries, the entry-point would be a set of public interfaces or methods that expose a certain functionality. Tests are a good place to start when digging into library code. For command line apps (or CLIs) the entry-point would be a "command" function that's meant to be invoked from a terminal once installed.

In Dockerized apps, start by looking at the Dockerfile or docker-compose.yaml files. If an entry-point is not explicitly configured, one will be required when launching the container on the target platform. If running in Kubernetes, the entry-point would then be found in the command field of the Pod specification.

Questions to ponder: How does it run? How is it initialized? Does it need configuration? How is it executed in production?

5. Read up. Spot the patterns.

At this point, we should have a better view of the state and form of the codebase. Next step is to look at its structure and the actual business code. Every codebase has patterns set forward by the early or main maintainers. Depending on the size and type of change, you may need to emulate or adapt those patterns. Spotting the patterns, practices and overall structure of the code, puts the required code change in context and keeps you focused on introducing only the necessary code modifications. For consistency's sake, practices and conventions already established across the codebase should prevail over individual preferences.

If the domain model is not too anemic, scan for types (or classes) and their publicly scoped methods.

Questions to ponder: Where does my code change fit in all this? Can I change the code with the least impact to existing features? Can the codebase support the required change or will it require a refactoring?

At this point I'd strongly encourage you to practice TDD and write a unit test first before actually jumping into changing the code, but that's subject for another post :)

And you? How do you approach a codebase?

5 practices to take Hashicorp Vault in Kubernetes to production readiness

David Montoya — Wed, 10 Mar 2021 19:22:33 +0000

Are you setting out to deploy Hashicorp Vault in Kubernetes? There are a variety recommended practices documented out there, however, many of them won't give you an overarching picture of what it takes to do so securely and reliably in production.

Follow these practices to guide you on the path from prototype to production-readiness:

https://expel.io/blog/production-readiness-hashicorp-vault-kubernetes