DEV Community: Shipyard DevRel

Claude Code tokens: what they are and how they're counted

Shipyard DevRel — Mon, 30 Mar 2026 21:54:37 +0000

You might have noticed that Claude Code tracks usage via "tokens" consumed, which you've probably seen from the /cost or /context commands, your Anthropic dash, or just in Claude convos.

Tokens are the constant, foundational unit that Claude uses to read and generate text. They're not exclusive to Claude, as Claude tokens are the same as NLP tokens, which are what LLMs use to process input and generate.

Tokenization in NLP

LLMs break text content down into tokens, which are chunks of text that might be somewhere between a character and a full word long. Depending on your tokenizer, the word "programmatically" could be broken down into tokens as "program" + "matically", or as a whole word (alternatively, character tokenization would break it down by letter, to catch unique spellings and typos). You can see how OpenAI's models tokenize text with this webapp.

Each token has a next-most-likely token, based on how that token has been used in the training data. LLMs take input text, and output a series of tokens that are statistically likely to follow the tokenized input.

Punctuation, whitespace, and code syntax are included in tokens, and their token length is shorter than that of written copy. Every LLM has a set context window, which is the memory limit of how many tokens it can hold at a time (most Claude models hold ~200k). When you exceed the context window, earlier tokens will get dropped.

So when you're using Claude Code, keep in mind that code is more token-dense than natural language. And languages like Python are more token-efficient than Rust or C++.

How Claude tokenizes text

Claude uses a tokenizer trained specifically on its training data, so strings will be tokenized differently from other models like GPT-5. Claude's token length averages out to about 4 letters per token, and 1.5 tokens per word. Token length is shorter for code snippets, e.g. () or { will each equal one token.

When you prompt CC with an input, it doesn't just tokenize your message alone: it'll process that along with the conversation history, loaded files, system prompts, command outputs, etc. As Claude processes more context, it consumes more tokens since this "input" compounds. Longer sessions will start to drain usage faster as they go on because of this.

Input vs. output tokens (and cache tokens)

Claude distinguishes between input tokens (e.g. what you submit to the model, as well as history + context) and output tokens (the text that it generates). These are tracked separately and have different pricing rates.

Claude prices output tokens higher than input tokens. With Claude Sonnet, for example, output tokens cost ~5x more per million than input tokens. So when Claude writes 300 lines of code, it'll be much more pricey vs. when it reads 300 lines and generates a short answer. Or when CC reads in your entire codebase, it won't cost as much as one-shot generating a codebase of a comparable size.

There's a third type of token class: cache tokens. CC supports prompt caching, which lets repeated context (a long system prompt, a codebase snapshot, etc.) get stored then reused at a much lower token price rate. So adding project context to your system prompt and CLAUDE.md will help keep usage lower over time (vs. repeating the same instructions often in a prompt).

How to track your token usage

Here's a detailed guide on tracking Claude Code tokens.

If you're using Claude via the API, you can view your token consumption, costs by model, and usage over time in the Claude Console.

Running the /cost command during a Claude Code session will show token usage for your current session, with respect to cost. If you're on a Pro or Max subscription and not the API, /cost reflects what you would have paid via API. You can also run /context to see how many tokens are currently filling your context window.

Third-party tools can give you good metrics on your usage. ccusage is a lightweight CLI tool that reads and analyzes Claude's local JSONL files for token metrics. You can run and install it with npx ccusage@latest. If you want something more real-time, you can run Claude-Code-Usage-Monitor to track live token consumption and burn rate. It'll also predict when you're likely to hit your limits.

Environments for Claude Code

When you're developing with Claude Code, you'll want to make sure that every new feature is tested extensively in a secure, isolated environment. Ephemeral environments pair beautifully with CC: you can spin up an environment automatically based on a branch/PR, run tests, do QA, push patches, and then merge once you've determined it's ready.

Shipyard is a plug-and-play ephemeral environment solution for devs using Claude Code. Claude can interact with the environments on its own via MCP/CLI (pull logs, get each live URL, visit the environments with Playwright MCP, etc). Try it free for 30 days and see how much faster your dev/test loop gets.

Devoting December to developer enrichment

Shipyard DevRel — Mon, 22 Dec 2025 22:05:35 +0000

Engineering teams spend so much time during the year talking about developer productivity. December is a difficult time to keep those metrics up anyway, so why not use the holiday stretch to catch up on your skills and learn new things?

December for engineers

December is perhaps the most inconsistent month of the year, work-wise. Some employees are taking PTO to travel and visit family during various times of the month, and the last week (or two) might be company-wide break.

This makes it nearly impossible to operate to the same goals expected during any other month. Plus, there’s a solid amount of review/administrative work that needs to happen (particularly retrospectives and Q1 planning).

Under these conditions, December isn’t a fulfilling work month for engineers. Yes, the workload is lighter, but the actual culprit is the quality of tasks.

Why set aside time to upskill?

Many people have a never-ending list of new technologies that they want to learn, but rarely are able to get around to. The software space moves so fast, and there’s a constant flow of new tools (especially those that you aren’t using and/or won’t use on the job). Setting aside time to explore these is very much worth it:

Learning new things is good for your brain, and can improve your problem-solving skills and make you happier
Doing something different during your 9-5 is a good remedy for the EoY burnout
You’re going to be on your computer anyway, and it’s likely your work responsibilities have wound down. This is a great opportunity to repurpose that time
You might pick up a new tool/technology that positively impacts your work

Most of us don’t have time to upskill off the clock, and that can be a real disadvantage in the tech field. There’s a lot you can accomplish during the most prime working hours of your day, and when it’s work that you find engaging/challenging, it can help you re-invigorate before your holiday break. Otherwise, if you’re not engaged during the last week or two, you might dread going back in January even more.

Keeping your mind sharp

December leaves too many devs bored and unfulfilled. Most orgs have a code freeze, so there are fewer (impactful) things to work on. Support and basic maintenance tasks get dull after awhile, which makes it all too easy to mentally check out.

Getting your manager’s encouragement to use this time to learn can help you genuinely enjoy the last couple weeks of work, rather than counting down the hours until break.

If there isn’t anything immediately top-of-mind, you could try one of these topics:

Contribute to an open source project
Learn a new devtool
Take an online course
Try your hand at Advent of Code
Create a hobby project with a new stack
Try a totally different function (UX, platform, design, etc)

This could also be a good opportunity to document your learnings with blog posts. Doing so can help you remember what you learned, and help jog your memory of any challenges you encountered.

How can you implement this as a manager?

As a manager, you obviously want your team to thrive. During busy years, there might be fewer opportunities for devs to learn and grow outside their assignments. So if you’re lucky enough to see tasks wind down in December, you can repurpose this time.

The way you go about this really comes down to each dev’s learning style. Some people thrive with structure, while others prefer to self-manage. You probably are familiar with how each team member excels, so you can make it as rigid or as flex as they need. But as far as rigidity goes, you’ll want to leave this at least somewhat open-ended, so it feels less like work, and more like enrichment. You can step in for accountability, but you shouldn’t have to track their “progress”. It won’t quite work if your devs aren’t excited about what they’re learning, which will also do nothing to improve burnout.

You can work with your team and listen as they set their own goals. Listen to what technologies they want to learn, and what they want to accomplish with them. If you know of any resources, share them. And if a particular dev is overwhelmed with the sheer amount of options they have, suggest something more guided, like an online course or Advent of Code.

Happy holidays to you and your team

So much is in flux during December, especially during the second half. Engineering managers don’t usually expect much in the way of productivity, so why not use this time for upskilling/education/things that get left on the back burner otherwise?

Doing so can be a great refresher and an answer to end-of-year burnout. Help your teammates remember why they were passionate about software in the first place.

How to track Claude Code usage + analytics

Shipyard DevRel — Mon, 17 Nov 2025 20:26:15 +0000

Now that you’ve gotten into the habit of working with Claude Code, you’re probably curious to see how efficiently you’re using it. For example, do you know how many tokens it’s taking you to complete a ticket on average? Since Claude tokens can run out and will reset after 5 hours, you might want to see how much you’re pushing it to the limit.

We’ve tried three different ways to track our Claude Code usage. Here’s what we found.

Option 1 (the classic way): Check the Anthropic Console

If you’re using the Anthropic API for Claude Code access (instead of through a Pro or Max plan), you can track your usage through the Anthropic Console. This is useful for devs and orgs who are doing the “pay as you go” approach, because your number of tokens consumed is directly correlated to how much you’re spending.

If you’re a manager, you can see usage trends for your whole team, and get the data you need to find how Claude Code is impacting developer productivity. You’ll be able to see how often you/others accept Claude’s suggested changes. Anthropic suggests using this as a way to measure your org’s Claude Code adoption, as well.

Option 2 (view within your session): /context slash command

You can also view your current usage directly from a Claude Code session by running the /context slash command.

This will show you how many tokens you've consumed, how many you have available, and break down token usage by category. It'll also show you a detailed view of how individual actions (including MCP tool uses) have consumed tokens.

Option 3 (view historical usage): ccusage

ccusage is a CLI-based tool that lets you view usage by date, session, or project. We really liked using this tool for seeing how our usage was divided across repos (e.g. which ones did we rely on Claude more for?).

It works by reading + analyzing Claude’s local JSONL files. This is super helpful if you have a Pro or Max plan, since you’re paying a flat rate per month and can’t track this via the Anthropic Console.

One cool thing about ccusage is that it’s so small you don’t need to fully install it, you can just run the npx command with whichever args you want to filter by:

npx ccusage@latest report daily

You can also install it:

npm install -g ccusage
ccusage report daily --since 20250801

Since it’s a CLI tool, you can ask Claude Code to run ccusage and retrieve info for specific search params:

Hey Claude, use ccusage to find my analytics for the RecipeBook project between July 3rd and July 7th.

The only limitation is that ccusage can’t show your current, real-time Claude Code analytics. If this is something you want, check out the next tool in our list.

Option 4 (view real-time usage): Claude-Code-Usage-Monitor

Claude-Code-Usage-Monitor is another CLI-based tool for tracking how you use Claude Code. Unlike ccusage, this one gives you a real-time chart of your token consumption, cost estimate, and predictions about how long it’ll take for you to hit your limits.

You can install it with pip or uv and run it with the cmonitor command:

pip install claude-monitor
cmonitor

Run it while you’re using Claude in a different window, and see it update live as Claude generates. Like ccusage, this pulls from Claude’s local data.

You can see your historical usage in table view by including a timeframe:

cmonitor --view monthly

Understanding Claude's session model

If you want to best plan out your Claude sessions, you’ll want to understand your constraints. Claude’s tokens are granted by plan based on overall server load, so on busier days you’ll get fewer.

If you’re not using the API for pay-as-you-go Claude access, you’ll want to choose the Claude tier that works best for you.

Pro: for a medium-high coding workload. Expect to use continuously for smaller code changes, and as a supplement to your own coding. $20/month
Max5: for an intense coding workload. 5x the token allowance of Pro. Opus access. $100/month
Max20: for near-autonomous, nonstop, heavy development workloads with multiple sessions/agents. Significantly larger context window. 20x the token allowance of Pro. Opus access. $200/month

Sessions kick off as soon as you send your first message, and last five hours.

It’s most efficient (and guarantees better outputs) if you start different sessions for different tasks.

The Bottom Line

We found the /context slash command, ccusage, and Claude-Code-Usage-Monitor were easy to use, and helped us make sense of our Claude metrics. This was helpful for tracking our own developer productivity, as well as seeing how Claude impacted our work styles.

If you’re using the Pro or Max tiers, you won’t be able to see this info in the Anthropic Console. We suggest using tools to supplement each other: /context to get a good summary by category, ccusage to view your usage trends, and Claude-Code-Usage-Monitor to see your live token consumption and predictions.

Using Claude Code? Give your agents ephemeral environments. They can deploy the code they write, pull logs, find + fix bugs, all with little-to-no human intervention. Try it free for 30 days.

The Shipyard MCP Server: Manage environments with AI agents

Shipyard DevRel — Fri, 14 Nov 2025 22:21:57 +0000

We’re excited to announce the beta release of our MCP server. Now you can loop your agents into environment management, and give them access/context to run tests, tail environment logs, start/stop environments, and debug remote issues.

The MCP server is an extension of the Shipyard CLI, so if you're already set up auth, you won't need to add your token again.

Installing the Shipyard CLI/MCP

(Skip this step if you already have the Shipyard CLI installed)

brew tap shipyard/tap
brew install shipyard

Claude Code

With API token and org name:

claude mcp add shipyard --env SHIPYARD_API_TOKEN=your-token-here --env SHIPYARD_ORG=your-org-name -- shipyard mcp serve

If already configured with CLI:

claude mcp add shipyard -- shipyard mcp serve

Codex CLI

Edit ~/.codex/config.toml and add:

[mcp_servers.shipyard]
command = "shipyard"
args = ["mcp", "serve"]
env = { "SHIPYARD_API_TOKEN" = "your-token-here", "SHIPYARD_ORG" = "your-org-name" }

(Contact us at support@shipyard.build if you need API access).

Our users have been using the MCP server to add more capabilities to their agent dev loops:

Writing and improving E2E tests: your agent will use your codebase, the Shipyard MCP, and the Playwright MCP to study your app and improve existing tests
Sandboxes for rapid prototyping: generate code with your agent and instantly see that in a live env, then pull logs to iterate/improve features
Live code reviews: use the Shipyard MCP to get an env link, then use Playwright MCP to gather info about the feature, cross reference with the codebase

Start building!

If you're a dev building with agents, we highly encourage you to try it out today. It's 100% open source, so feel free to fork it, open a PR, or use it to help you build your own MCP server.

(Sign up for a 30-day free Shipyard trial if you haven’t yet).

We're curious to hear how you’re using it to build and test faster. Write us at hello@shipyard.build to share your thoughts. Any and all feedback is super valuable for us during the beta.

Pair programming with Claude Code: using output styles

Shipyard DevRel — Thu, 09 Oct 2025 17:43:52 +0000

As a developer, you've gone into software likely because you enjoy writing code. This isn't a chore for you, and you get some degree of fulfillment from it. If you're reading this, you're probably using agentic devtools to improve your productivity, and take away some of the toil that comes with dev work.

Why should agents get to do all the fun parts of software dev, like writing code? Claude Code (the world's favorite AI coding agent) has a feature that allows you to make agentic dev more active.

Here's how you can write code alongside Claude Code, instead of through pure natural language prompts.

Why pair program with AI agents?

Agents can generate code incredibly fast, and at volumes that are tough for humans to truly digest. Many devs end up turning off their brains at a certain point, because understanding the agent's code is much slower than prompting fast and getting a working result.

This leads to devs accepting agentic code suggestions blindly. It's a slippery slope for sure: if 90% of an LLM's suggestions are good, why should you have to sift through all of them with a fine-toothed comb, especially if you hardly find anything off? Then next thing you know, your agent accidentally corrupts your database or changes your file permissions.

Pair programming with an agent keeps you, as the developer, engaged and in the driver's seat. You're actively thinking through every choice, being intentional. It's more akin to the mindset you get into when you're 100% writing code.

Plus, AI agents are good for many things beyond picking up coding tickets. Thanks to their LLMs' wide knowledge bases, they can explain things pretty well, and pick up on niche software patterns. They can step into that "pair" role and guide you. This brings them closer to human-human pair programming

Claude Code output styles

In August 2025, Anthropic introduced a new feature for changing how Claude Code interacts with a user. Output styles can help extend CC beyond its popular use cases, and Anthropic suggests this can make CC act as any kind of agent (with the same core read/write/execute/list capabilities).

An output style for Claude Code is functionally the same as a system prompt. Claude's default output style is concise, and targeted mainly at solving software engineering tasks.

CC includes two other built-in output styles, which are (in our opinion) more conducive to pair programming.

Explanatory output style

To set: /output-style explanatory

Claude Code's explanatory mode is similar to the default system prompt, but offers educational insights along the way. This is a good mode for newer devs who don't want CC to do all the thinking for them, and are hoping to stay engaged during agentic dev tasks.

CC will add little "insight" blocks within its responses. These are kind of like pro tips, or just extra background info. Since default CC is concise, this is helpful for getting more complete, detailed answers.

Sample prompt:

look at the CLI docs, and tell me what can be improved in the scope of DevEx

CC explanatory mode response (excerpt):

...

Strategic Long-term Enhancements

7. **Progressive disclosure with expandable sections**

8. **Integration examples with popular IDEs and tools**

9. **Video walkthroughs for complex workflows**

10. **Interactive CLI tutorial or guided setup**

★ Insight ─────────────────────────────────────

The current documentation is comprehensive but lacks the user-centric

design principles that make modern developer tools delightful to use. The

biggest opportunities are in restructuring content around workflows and

reducing friction in the getting-started experience.

──────────────────────────────────────────

Learning output style

To set: /output-style learning

Learning mode is more interactive than default or explanatory mode. It gives the user small programming tasks to complete during development, so they can contribute code. It's really beneficial for junior engineers looking for a pair programming setup, as it will break down tasks, assign them, and then give feedback on the completed snippet.

This was our favorite CC output style, since it let us pick up fun coding tasks and walked us through the why/how of every action.

Sample prompt:

look at the CLI docs, and tell me what can be improved in the scope of DevEx

CC learning mode response (excerpt):

Your Task: In docs/CLI.mdx, create a new "Quick Start" section`

that should appear right after line 13 (after the opening description).

Look for TODO(human). This section should provide developers with the

fastest possible path from installation to their first successful CLI

interaction.

Guidance: Consider what a developer most wants to accomplish when they`

first install the CLI - probably seeing their environments and confirming

everything works. Think about the psychological impact of early success

vs. getting stuck on setup details. The section should be concise (3-5

commands max) and include a verification step that shows success.

Creating your own custom output style

Claude Code is super customizable, and most of this can be done via natural language in Markdown files. You can create your own output style (which is a system prompt for CC) by running:

/output-style:new [name] [verbose description]

Remember, everything that CC knows about software engineering lives in the default output style. You'll need to re-enlighten it here if you want it to be a strong software engineering pair.

Once you provide a description of what output style you'd like, Claude will generate a markdown file, inferring what other characteristics it should have. You can find it in your ~/.claude/output-styles/ directory. You'll want to tweak this to fit what you had in mind.

Changing Claude Code's system prompt

There are a few ways to change/augment CC's system prompt.

Option 1: with output styles (recommended)

Claude's output styles are system prompts. If you want to completely override Claude's default personality, skillset, and behavior, this is how you can do it.

This means you'll need a super high-quality, detail-oriented prompt. Claude Code can help you write one, but don't expect it to be as optimized as CC's default prompt. This is still a great exercise for prompt engineers, and also valuable for those who want to use a terminal-based agent for miscellaneous non-dev tasks.

To recap: you'll do this by launching a Claude Code session and running:

/output-style:new [name] [verbose description]

Option 2: with append-system-prompt

If you want to keep the general behavior of Claude Code's system prompt, but add a few more high-priority instructions, you can use the --append-system-prompt flag. As part of the system prompt, CC will prioritize this over any other config settings (e.g. CLAUDE.md) or prompts.

You can append your own instructions by running:

claude --append-system-prompt "Never commit to main. Alert me when a new feature does not have unit tests written for it"

Option 3: with CLAUDE.md

Your CLAUDE.md file is not a system prompt. Rather, it's a prioritized instruction that Claude Code will refer to one tier below the instructions in the system prompt.

Usually, A CLAUDE.md file is stored at the project level, and is automatically pulled into context when you launch a session. It'll typically store instructions and context relevant to a project, which will save you tokens and guarantee better outputs.

You can run the /init slash command to create it, or add one manually to your project's root directory.

Keep your dev skills sharp

A hybrid of agentic and human dev can be a really refreshing balance for many programmers. You can keep learning good patterns, stay engaged during a dev session, and use your brain in a different way (vs. pure agentic dev). Thankfully, you've got plenty of ways to customize this to your liking with Claude Code's output styles. You can get a more fulfilling dev experience just by toggling system prompts.

Happy coding!

Claude Code Subagents Quickstart: what they are + how to use them

Shipyard DevRel — Thu, 25 Sep 2025 22:09:50 +0000

Subagents are a popular capability of Claude Code, which allow you to get specialized aid and advice during your dev sessions. Here's what they are, how to define them, and a few starter subagent personas.

What are subagents?

Claude Code's subagents are simply Claude instances with different areas of expertise. Instead of asking one generalist agent to do everything (e.g. classic Claude Code), you create specialists.

Within a Claude Code session, you can invoke these subagents for tasks that might require more specialization, and thanks to their system prompts, you'll see better results. Subagents also benefit from having their own context windows, so they only ingest info relevant to their tasks. You'll save tokens, and see less of the "output degradation" that happens when the context window fills up during a long Claude session.

Subagents vs. multi-agent

Subagents and multi-agent setups both involve multiple Claude instances. The biggest distinction here is the scope of the tasks that each type handles. A subagent gets spun up within a single Claude Code session to handle a specific task. It does its work in an isolated context window and returns a result to the parent session. A multi-agent setup (sometimes called an agent team) involves multiple CC sessions running in parallel or in sequence, often managed by an orchestrator.

How to add subagents to Claude Code

Claude Code makes it simple to write/customize/generate your own subagents.

First, make sure you're on the latest version:

claude --version

You can upgrade by running:

claude update

Start a CC session and run the /agents slash command. From here, you'll be able to define a subagent at the project or user level. (We recommend starting out with project-level subagents, so you can customize them to use project-specific tools and domain knowledge).

Claude will prompt you whether you want to create the subagent with Claude's help. This is a good move, since CC will use the context from your project to customize an agent to your needs. You'll write a summary of its responsibilities and Claude will fill out the rest. Agents are defined in natural language, but you'll want to use good prompt engineering practices to get the best results.

Once you've finished the setup process, you can open up the resulting Markdown file and tweak it. For example, this is a snippet of the markdown config that Claude generated when asked to create a technical docs proofreader:

---
name: proofreader
description: "Use this agent when you need to proofread and correct grammar in markdown files, especially technical documentation or blog posts. Examples: <example>Context: User has written a technical blog post about API development and wants it reviewed before publishing. user: 'I just finished writing a blog post about REST APIs. Can you review it for grammar and formatting?' assistant: 'I'll use the proofreader agent to review your blog post for grammar, formatting, and technical accuracy.' <commentary>Since the user wants proofreading of technical content, use the proofreader agent to handle grammar correction and markdown formatting review.</commentary></example> <example>Context: User has updated documentation with technical content that needs review. user: 'I updated the README with new installation instructions. Could you check it over?' assistant: 'Let me use the proofreader agent to review your README for clarity, grammar, and proper markdown formatting.' <commentary>The user wants documentation reviewed, so use the proofreader agent to ensure proper grammar and markdown syntax.</commentary></example>"
model: sonnet
color: orange
---

You are an expert technical proofreader and grammar specialist with deep knowledge of software development terminology, markdown syntax, and technical writing conventions. You excel at maintaining clarity while preserving the author's intended casual tone and technical accuracy.

In the Markdown frontmatter, you can customize the name, description, choose your preferred Claude model, and set the agent's UI color. Optionally, you can grant it access to only select tools (read, write, grep, etc) instead of all by default.

In the content section, you'll be able to go into more detail on what the agent should know, and how it should respond to certain scenarios. Here you may want to list out workflows and define rules.

What subagents do I need?

Think about what tasks you use Claude Code for. They probably can be boiled down into a few roles. Which agents you'll want really comes down to personal preference, but here are a few that we've found useful.

Tip: You can paste these descriptions into CC during agent creation.

The System Architect

This is a big-picture agent that advises you and nitpicks your design patterns. Use it to help you think through large features. It chimes in when you drift towards an anti-pattern.

It's well-versed in the frameworks your app uses, and has plenty of expertise in systems at scale.

This agent will be helpful when you ask it about optimizing logic, database choices, and why your monolith might need to become microservices.

The Code Reviewer

This is a very thorough agent that nitpicks every PR to find security holes, performance issues, unresolved comments, etc. This agent should be well-versed on optimal algorithms (e.g. critiquing your nested loops), but also understand that good code is elegant, not complicated.

It has a sophisticated understanding of code syntax and style. This agent can act as one (of your many) checks and balances against LLM hallucinations.

The Debugger

This debugging agent is methodical and patient. It traces your internal docs to run your code properly, and then ingests/analyzes logs to find the point of failure. It understands your services enough to recognize where a bug is coming from, and might recognize some common failure patterns.

This agent works best alongside a dev, and asks questions to help it solve the problem (e.g. "when did this last work?", "what changed recently?").

The DevOps Engineer

This agent understands your deployment process + pipeline. It knows Docker, Kubernetes, and your CI/CD framework. It can help proofread your config files, and cross-reference with logs when you're not getting the expected behavior. You should be able to ask it how to optimize/implement best practices, and it should confidently answer from its glossary of DevOps patterns.

The Playwright test agents

Playwright maintains three custom subagents: the planner, the generator, and the healer. They can help with every stage of test writing, or even work together to create a strong automated test suite for your app.

Creating smart personas

What you need in a subagent will become more clear to you as you use Claude Code. Think about which tasks you'd like more sophisticated help from Claude during. Think about which parts matter less to you.

Even though subagents are specialized and generally perform better for domain-specific tasks than stock Claude, they'll still have their blind spots and weaknesses. Engineers have reported strong results when including these weaknesses in their subagent system prompts. Also important to only throw each subagent the tasks that are in its wheelhouse.

In your system prompt, it helps to instruct your agent to "be honest" or "be critical" or "be realistic". Many LLM system prompts default to an agreeable demeanor, so you'll want to be sure yours overrides this.

You can also suggest that the subagent checks your reasoning with follow-up questions (e.g. "why do you want to make this change?", "how do you know this is the root of the problem?"). Making subagents a little argumentative and opinionated will protect you from bad design decisions.

Your subagent team

In the beginning, you'll want to define and deploy your subagents one at a time. Get used to how they work, and how you can use them to get better results. You'll probably max out at about 3 or 4 subagents total; after that your own productivity may drop.

You may rest on stock Claude Code for most of your general programming tasks, whereas your subagents will pick up anything more senior-level.

Using Claude Code? Give your agents ephemeral environments. They can deploy the code they write, pull logs, find + fix bugs, all with little-to-no human intervention. Try it free for 30 days.

How to prompt Gemini CLI to improve your Dockerfile

Shipyard DevRel — Wed, 24 Sep 2025 20:54:43 +0000

When you're generating a Dockerfile for your app, you might get something that works but isn't optimized. This means it could be bloated with unnecessary packages (leading to potential vulnerabilities), not designed with dev/prod parity in mind, or not structured for faster builds.

We found the optimal way to improve Dockerfiles with Gemini CLI. Gemini CLI is great for this task, since it has such a large context window that it can take in a large chunk (if not all) of your codebase, and use that to help it write a working and optimized Dockerfile.

Why optimize your Dockerfile?

It's quite simple to write an MVP, working Dockerfile for one of your services. Writing it the fast way will leave a few considerations off the table. More specifically, speed, size, and security.

Since Dockerfiles are used for cloud-based dev and deployments, you'll want to keep them as small as you can. If you use a larger base image and add in a lot of extra tools, you'll hit your container registry and CI build minute limits much faster, and without any benefits at all. Keeping your images lean also means that there's a smaller attack surface (new vulnerabilities are found every single day, and fewer packages overall == less chance of a vulnerability).

One of the best benefits of Docker is that it helps you make your software multi-environment friendly, so you can use the same (or similar) config from local dev to production. Having a Dockerfile for every environment kind of defeats the purpose. Optimizing it means using env vars and keeping the overall architecture more abstract.

Docker's multi-stage build pattern allows images to build faster by caching the topmost layers. For example, if you have lines 1-4 to manage imports and installs, and you swap around some instructions on lines 5 and 6, your Dockerfile will rebuild only from layer 5 down. You have to be strategic with where you perform certain instructions in the Dockerfile. Unoptimized Dockerfiles often don't take this into consideration, so your builds will be slower and thus more expensive.

As with any software you're writing with a coding agent, you'll want to do several passes over your Dockerfile to iteratively improve it.

The hidden costs of "lazy" Dockerfiles

Unoptimized Dockerfiles don't really have a huge impact locally, since you aren't paying for CPU, storage, or RAM expended. However, one of the major benefits of Dockerfiles is their versatility from local machine to cloud. And of course, when there's cloud, there are cloud costs.

Cloud is typically billed usage-based. Slow-building + bloated Dockerfiles rack up your CI/CD usage faster, and consume your container registry bandwidth and storage. Inefficient resource usage also adds up in pre-production and production environments.

A couple hundred extra MB might not make a huge difference outright, but when you think about how many times you're building/rebuilding/uploading per week, you're multiplying that spend quite a bit.

Aside from literal "costs", slower, unoptimized Dockerfiles cost you when it comes to developer experience. Wait times add up, and rebuilding with updates will take longer than it needs to. Plus, they're harder/more tedious to debug.

Refining a Dockerfile through multiple prompts

You'll get the best results if you optimize your Dockerfile in several different stages. This way, Gemini can be more thorough in every step, and you can review its suggestions in smaller, more palatable batches.

Step 1: Analyzing the Dockerfile

Before having it make any changes, you can kick off Gemini CLI's task with some context on what you want to accomplish.

Here, you're basically priming it on how to understand your codebase. With its massive context window, it should have plenty of room to do so.

During this stage, you should also be able to verify whether Gemini is on track. If it isn't, you can course-correct it and give it additional (paraphrased) context, or tell it to study certain files.

"Please analyze my codebase and current Dockerfile and Docker Compose. What type of application is this, what are the main dependencies, and what optimization opportunities do you see? Here are my key files: [files]"

Review Gemini's output to make sure it understands both the task and info on the codebase it'll need to work with.

Step 2: The strategy

Now that Gemini has some background in your config + codebase, you can start using it to map out a strategy. For example, when you say "optimize", what do you mean? Speed, security, image size, all of the above? Do you want to use a minimal image and strip down to only necessary dependencies?

You'll also want to go into specs of your remote environment(s) (esp. since these might not be defined in your codebase). Are you using EC2, EKS, GKE? Kubernetes? Do you want to use the same Dockerfile across environments?

If you have any more nonnegotiables about image specs, you can list them here. This may include compliance and dev workflow needs.

"Based on your analysis, I want to prioritize [your priorities e.g. speed/size/security]. Follow the patterns [good practice] and [other good practice]. My deployment environment is [context]. I will need it to be compatible with [devtool] and have [xyz compliance]. Can you recommend the best optimization strategy and explain the trade-offs of different approaches?"

By asking Gemini for alternatives, you're helping it reason through picking a solution, and force it to think about not just the most common solution. Otherwise, it won't evaluate pros and cons of each. It may even select a different optimal solution via this approach; this helps it arrive at a more fleshed-out conclusion. (Plus you can and should have a say in which approach it takes).

Step 3: Implementation

At this stage, you can ask Gemini CLI to revise your Dockerfile so that it is production-ready. You can help it stick to stricter logic by asking it to explain every decision it makes.

You can also request that it generates a Markdown guide with start commands, test commands. Optionally, you might also want to wrap it in a Makefile.

"Please revise and optimize this Dockerfile and Docker Compose following the strategy we discussed. Include detailed comments explaining each optimization, a .dockerignore file, and a markdown guide with specs + the commands I should use to build and test it."

Step 4: Refining the Dockerfile

Now that you have an "improved" Dockerfile and Docker Compose, you'll want to make sure they actually work. Ask Gemini CLI to run the Dockerfile and analyze the logs. You should also take a look at the performance yourself, to double-check what Gemini says.

Let it know your impressions, and whether you want it to improve any aspects from here.

You may tweak your Docker config via env vars: ask Gemini what their values should be and test from there. Once its working locally, repeat in your PR, test, and staging environments.

What to avoid

As you've probably seen with LLMs, they can very much be "garbage in, garbage out" systems. You don't want to take shortcuts, and generic input will lead to generic output (not good for something as unique as your codebase). Here are a few particularly egregious things to avoid:

Don't skip the planning/discovery phase, this will help Gemini get the context it needs to give you customized results.
Don't be vague. Saying "make it faster" won't get you solid results, because it casts too wide a net for an LLM to narrow down to the right answer.
Don't skip testing, and don't trust an LLM to be correct when it attests that everything is working.
Remember that the optimal solution for one codebase won't necessarily apply to yours. Take time to understand what you're working with. Above all, you should be the expert, not the LLM.

Better Dockerfiles for better deployments

Using Gemini CLI's huge context window, you can give it enough context to optimize your Dockerfile to the best of its ability. We saw the best results with an iterative, multi-prompt approach, since that helped Gemini break each task into concrete steps, and us to steer it into the right direction along the way.

Overall, better Dockerfiles lead to improvements all throughout the SDLC. You'll have better performance when running them locally, and cost savings in the cloud. They'll build faster and take advantage of Docker's layers/caching. You can apply these practices faster with Gemini's help.

Claude Code CLI Cheatsheet: config, commands, prompts, + best practices

Shipyard DevRel — Fri, 22 Aug 2025 17:53:17 +0000

Last updated: 2025-08-18

Claude Code is Anthropic's agentic coding tool that lives in your terminal and for now is SOTA for coding. This cheatsheet should give you everything you need to install, config, and use Claude Code for now...

Getting started with Claude code

Once you have a Claude Pro or Max subscription (or are paying for API access), you can start using Claude Code from your terminal.

(Our advice: opt for the subscription if you're using it consistently and at a reasonable rate. It's worth getting API tokens if you don't want to deal with token refresh windows).

Installation

Install globally:

npm install -g @anthropic-ai/claude-code

Prereqs: Node.js 18 or newer

Auth

Set up your Anthropic API key before launching CC.

Get your key: Get an API key from the Anthropic Console.

Set your key: Set the ANTHROPIC_API_KEY env var:

export ANTHROPIC_API_KEY="YOUR_ANTHROPIC_API_KEY"

Alternatively, if you have a Pro or Max plan, you'll have the option to auth via your browser.

Add this to your shell profile (e.g., ~/.bashrc, ~/.zshrc) to persist across sessions.

Basic usage

Interactive mode (REPL): Start a conversational coding session.

claude

REPL with initial prompt: Start with a specific question.

claude "explain this project"

Print mode: Query once and exit (great for scripting).

claude -p "explain this function"

Piping content: Process piped input.

cat logs.txt | claude -p "explain these errors"

Continue recent conversation:

claude -c

Resume specific session:

claude -r "session-id" "continue working on this feature"

Config

You're able to write (or generate) files to configure CC's basic behaviors.

Settings files

Claude Code uses hierarchical settings stored in JSON files:

User settings: ~/.claude/settings.json (applies to all projects)
Project settings: .claude/settings.json (shared with team, checked into git)
Local project settings: .claude/settings.local.json (personal, ignored by git)

Example settings.json:

{
  "model": "claude-sonnet-4-20250514",
  "maxTokens": 4096,
  "permissions": {
    "allowedTools": ["Read", "Write", "Bash(git *)"],
    "deny": [
      "Read(./.env)",
      "Read(./.env.*)",
      "Write(./production.config.*)"
    ]
  },
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write(*.py)",
        "hooks": [
          {
            "type": "command",
            "command": "python -m black $file"
          }
        ]
      }
    ]
  }
}

Memory files (CLAUDE.md)

Use CLAUDE.md files to give context and instructions to Claude. They save time + tokens, and are super helpful for info you'd otherwise include in your prompts. These are loaded hierarchically:

Global: ~/.claude/CLAUDE.md (applies to all projects)
Project root: ./CLAUDE.md (project-wide context)
Subdirectories: Component-specific instructions

Example CLAUDE.md:

# Project context

## Coding standards
- Use TypeScript for all new code
- Follow existing ESLint configuration
- Write tests for all new functions using Jest
- Use functional components with hooks in React

## Architecture
- Frontend: Next.js with TypeScript
- Backend: Node.js with Express
- Database: PostgreSQL with Prisma
- State: Zustand for client state

## File organization
- Components in `src/components/`
- Utilities in `src/utils/`
- Tests alongside source files with `.test.ts` extension

CLI commands + flags

You can use the following shell commands outside a Claude session.

Core commands

Command	Description	Example
`claude`	Start interactive REPL	`claude`
`claude "query"`	Start REPL with initial prompt	`claude "explain this project"`
`claude -p "query"`	Query via print mode, then exit	`claude -p "review this code"`
`claude -c`	Continue most recent conversation	`claude -c`
`claude -c -p "query"`	Continue in print mode	`claude -c -p "run the tests"`
`claude -r "id" "query"`	Resume session by ID	`claude -r "abc123" "finish the PR"`
`claude update`	Update to latest version	`claude update`
`claude mcp`	Configure MCP servers	`claude mcp add server-name`

CLI flags

Flag	Description	Example
`--add-dir`	Add additional working directories	`claude --add-dir ../apps ../lib`
`--allowedTools`	Allow specific tools without prompting	`claude --allowedTools "Write" "Bash(git *)"`
`--disallowedTools`	Block specific tools	`claude --disallowedTools "Bash(rm *)"`
`--model`	Use specific Claude model	`claude --model claude-opus-4`
`--max-turns`	Limit conversation turns	`claude -p --max-turns 3 "query"`
`--output-format`	Set output format (text/json/stream-json)	`claude -p --output-format json "query"`
`--input-format`	Set input format	`claude -p --input-format stream-json`
`--verbose`	Enable verbose logging	`claude --verbose`
`--continue`	Continue most recent conversation	`claude --continue`
`--resume`	Resume specific session	`claude --resume abc123`
`--dangerously-skip-permissions`	Skip all permission prompts (proceed with caution)	`claude --dangerously-skip-permissions`

Interactive session commands

You can use these slash commands during a Claude Code session.

Built-in slash commands

Command	Description
`/help`	Show all commands + custom slash commands
`/config`	Configure Claude Code settings interactively
`/allowed-tools`	Configure tool permissions interactively
`/hooks`	Configure hooks
`/mcp`	Manage MCP servers
`/agents`	Manage subagents (create, edit, list)
`/vim`	Enable vim-style editing mode
`/terminal-setup`	Install terminal shortcuts (Shift+Enter for iTerm2/VS Code)
`/install-github-app`	Set up GitHub Actions integration

Note: The /help command shows all available slash commands, including your custom commands from .claude/commands/ and ~/.claude/commands/ directories, as well as any commands you have from connected MCP servers.

File and directory references (@)

You can reference files or directories in your prompts. (If you don't have an exact filename/location, CC can grep for it).

Single file:

> Review this component for accessibility issues. @./src/components/Button.tsx

Directory (recursive):

> Add comprehensive error handling to all API routes. @./src/api/

Multiple files:

> Compare these two implementations. @./src/old.js @./src/new.js

Glob patterns:

> Review all test files for completeness. @./src/**/*.test.ts

Shell commands (!)

You can run shell commands directly in a Claude session. Use the ! to bypass Claude's conversational mode, which will use more tokens to get the same result:

Single command:

> !npm test

Shell mode toggle:

> !
# Now in shell mode. type ! again to exit

Advanced features

We appreciate how customizable CC is, and it's quite easy to extend it with a few features like custom commands, hooks, MCP, and stored prompts.

Custom slash commands

You can create your own CC slash commands. This is a good "shortcut" for pulling up a common prompt. Again, the more context the better (but also keep these abstract so they can be widely applied). Define them in Markdown files:

Project commands (.claude/commands/):

# Create a project-specific command
mkdir -p .claude/commands
echo "Analyze this code for performance issues and suggest optimizations:" > .claude/commands/optimize.md

Personal commands (~/.claude/commands/):

# Create a personal command for all projects
mkdir -p ~/.claude/commands
echo "Review this code for security vulnerabilities:" > ~/.claude/commands/security.md

Commands with arguments:

# Create parameterized command
echo 'Fix issue #$ARGUMENTS following our coding standards' > .claude/commands/fix-issue.md

# Use that command in a Claude session
> /fix-issue 123

Advanced command with context:

---
allowed-tools: Bash(git add:*), Bash(git status:*), Bash(git commit:*)
description: Create a git commit with context
---
## Context
- Current status: !`git status`
- Current diff: !`git diff HEAD`
- Current branch: !`git branch --show-current`

Create a meaningful commit message based on the changes above.

Hooks for automation

Hooks run shell commands automatically after specific prompts/events:

Example: Auto-format Python files

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write(*.py)",
        "hooks": [
          {
            "type": "command", 
            "command": "python -m black \"$file\""
          }
        ]
      }
    ]
  }
}

Hook events:

PreToolUse: Before tool execution (can block)
PostToolUse: After tool execution
UserPromptSubmit: Before processing user input
SessionStart: At session startup

Model Context Protocol (MCP)

You can extend what Claude Code can do by adding MCP servers:

Add MCP server:

claude mcp add my-server -e API_KEY=123 -- /path/to/server arg1 arg2

(Check your MCP tool's docs to get the right syntax here.)

Common MCP use cases:

Connect to Google Drive for design docs
Integrate with Jira for ticket management
Add custom dev tooling
Access external databases

Conversation management

Continue recent work:

claude --continue
claude --continue --print "show me our progress"

Resume specific session:

claude --resume  # Shows picker
claude --resume session-id

Save and restore context:
All CC conversations are auto-saved with full message history and tool state.

Common workflows

Here are a few different tasks that CC can help with. Remember, the more context, the better, so if you can provide specifics around your ask, Claude will give better results (and you'll have fewer things to correct).

Code analysis

> Analyze this codebase structure and suggest improvements. @./src/

Feature development

> Implement a user auth system with JWT tokens and password hashing

Bug fixing

> Debug this error: "TypeError: Cannot read property 'id' of undefined" @./src/user-service.js

Code review

> Review this pull request for potential issues, performance problems, and adherence to our coding standards. @./src/

Testing

> Generate comprehensive unit tests for this utility module. @./src/utils/validation.js

Refactoring

> Refactor this class to use dependency injection and make it more testable. @./src/services/EmailService.js

Docs

> Generate API docs for all endpoints in this directory. @./src/routes/

CI/CD integration

# In GitHub Actions or other CI
claude -p "If there are any linting errors, fix them and suggest a commit message"

Security + permissions

Claude Code defaults to asking permission for every single action it takes. If you trust it for a certain type of action (e.g. fetching links, reading files), you can grant it wider permissions. Most devs approve actions individually.

Permission system

Claude Code lets you grant permissions as you see fit:

Tool permissions:

Read: File reading operations
Write: File writing/modification
Bash: Shell command execution
MCP tools: External integrations

Configuration examples:

{
  "permissions": {
    "allowedTools": [
      "Read",
      "Write(src/**)",
      "Bash(git *)",
      "Bash(npm *)"
    ],
    "deny": [
      "Read(.env*)",
      "Write(production.config.*)",
      "Bash(rm *)",
      "Bash(sudo *)"
    ]
  }
}

Best practices

As with anything in dev, keep an eye on what permissions you're granting, and watch which shell commands are being run. Also:

ALWAYS review changes before accepting
Use .claude/settings.local.json for personal/sensitive settings
Configure tool permissions for your env; verify everything (don't use YOLO mode unless you've put the proper safeguards in place)
Use hooks for auto code formatting/validation
Keep sensitive data in .env files; deny CC permission to these

Understanding Claude's session model

If you want to best plan out your Claude sessions, you'll want to understand your constraints. Claude's tokens are granted by plan based on overall server load, so on busier days you'll get fewer.

If you're not using the API for pay-as-you-go Claude access, you'll want to choose the Claude tier that works best for you.

Pro: for a medium-high coding workload. Expect to use continuously for smaller code changes, and as a supplement to your own coding. $20/month
Max5: for an intense coding workload. 5x the token allowance of Pro. Opus access. $100/month
Max20: for near-autonomous, nonstop, heavy development workloads with multiple sessions/agents. Significantly larger context window. 20x the token allowance of Pro. Opus access. $200/month

Sessions kick off as soon as you send your first message, and last five hours. If you're using Opus, you'll burn through tokens much faster.

It's most token-efficient (and guarantees better outputs) if you start different sessions for different tasks.

Extra resources

Claude has built-in access to its own docs: you can ask questions about features directly
Check out the official docs
Use /help in your Claude session

Need to take CC to the next level? Give your AI agents ephemeral environments where they can deploy code, pull logs, and fix bugs autonomously. Try Shipyard free for 30 days.

The Docker Compose Test Environment Guide

Shipyard DevRel — Wed, 04 Jun 2025 15:54:50 +0000

Docker Compose is an invaluable development tool for microservices apps. If you’re working with containers, you’ll likely use it in some capacity. Whether you’re using it for orchestration or app definition (or both), Compose abstracts away the complexity and guesswork involved in container management.

What is a Docker Compose environment?

Docker Compose has built-in container orchestration for single and multi-service environments. It’s the obvious choice for setting up a local dev environment for containerized workloads: you can use it to run your Dockerized app with limited config or overhead. For example, you might use Compose to spin up the containers, see how they interact with each other, and do some simple testing.

Docker Compose for the perfect local environment

A Docker Compose environment might be more limited and less complex from what you’d use when you have cloud resources available. However, sometimes Compose’s orchestration is all you need. Compose is especially convenient during the inner dev loop, since you can spin up environments within minutes, and run them without overwhelming your laptop.

For these reasons, teams find Compose sits at a good balance between simplicity and capability. They can use it to provide build instructions/config for in-progress services, and pull prebuilt images from a container registry.

Docker Compose also takes care of basic networking between containers, and allows users to configure external networks. With Compose, you can also create custom links (aliases) between containers.

Compose excels because it has top-notch DevEx. By virtue of being so straightforward, you can spend more time developing your app instead of tweaking orchestration config. The Docker CLI has thorough logging, so fixing/improving your Compose definition isn’t a huge task.

Should I use Docker Compose for cloud environments?

When you have cloud compute at your disposal, you may want to take advantage of that. Compose is exceptional because it is so lightweight, but with the tradeoff of limited scalability. You can get more efficient (and better optimized) orchestration with a lightweight Kubernetes distro instead, which will also be closer in parity to your production environment.

At the end of the day, Compose is the clear winner for local dev, but its benefits are lost by the time you get to remote environments. Instead, you might use your Compose file as an application definition (instead of a Kubernetes manifest) for your ephemeral, staging, and production environments. We’ll get into that shortly.

K3s: lightweight, yet capable container orchestration

Rancher’s k3s distribution (named after the abbreviation of “Kubernetes” at half-length) is a lightweight alternative. Unlike stock k8s, k3s is much easier to configure and tweak. While it has fewer features, it still is pretty powerful and can be a good approximation of production orchestration, especially for intermediate-stage environments.

Sometimes, it makes sense to use k3s at the local level, but its value really shines in cloud dev environments, staging, and ephemeral environments. With a much smaller memory footprint, it’s a stellar alternative for most use cases that don’t require massive scaling, complex networks, or multi-tenant architecture.

Docker Compose to Kubernetes for preview environments

If you want to forego Kubernetes manifests altogether, you’ll save yourself the hassle of defining, tweaking, and optimizing orchestrations. Instead, you might have a compose.dev.yml and a compose.staging.yml and/or compose.prod.yml. These will be more or less similar, with varying services (perhaps debug containers instead of final-built images). This way, when something works in one environment but not another, you can narrow down the issue to the places where the config differs (or conclude that its infrastructure-related).

With a platform like Shipyard, you can get all the perks and convenience of using Docker Compose for local dev. Once you’re done iterating, Shipyard transpiles your Compose file to Kubernetes manifests, so you get production-grade orchestration for every branch-based ephemeral environment. After you merge your code changes, you can take that same (or similar) Compose file and use it for a service like AWS Elastic Beanstalk for your production deployment.

Elastic Beanstalk for Compose in production

One major benefit of Compose is that if you set up your environment workflow right, you can use Compose as your application definition throughout different environments. This prevents configuration drift and helps keep all your service definitions constant. After all, it’s much simpler to handle a couple of Compose files instead of writing custom Kubernetes manifests for every type of environment in your SDLC.

AWS Elastic Beanstalk ingests your app and Compose file, and uses that to orchestrate your application for staging or production deployments. It’s typically lauded for its ease-of-use, especially since it can harness EC2 with relatively little setup.

If you wanted a purely-Compose environment workflow, you could use Docker Compose’s native orchestration locally, Shipyard’s ephemeral environments for test/QA and/or staging, and Elastic Beanstalk for staging and/or prod.

Tying it all together

Docker Compose plays an important role in environment workflows from development to deployment. Using Compose can remove some of the lift with regards to config and orchestration, and abstract away the trickier parts of multi-container management.

Many teams who use Shipyard love Compose for its flexibility. They use it for local development, and then pass their Compose file to Shipyard to gain production-level orchestration for every PR/MR. Once they’re done iterating and testing, some use Compose as the application definition for a service like Elastic Beanstalk. If you’re looking for dev/prod parity without the hassle, sign up for a 30-day free trial.

Local-first CI/CD with Makefiles

Shipyard DevRel — Thu, 01 May 2025 15:42:37 +0000

Continuous integration often only happens after a commit or a PR, in a remote environment. However, there’s a ton of benefit from designing these pipelines at the local level, and running them early on. This way, you’re using the same steps throughout the SDLC, but running them across different environments, which helps show incompatibilities between your application and any new code changes. Here’s why you should keep your pipeline logic in Makefiles and use them from development to deployment.

Why local-first makes sense for CI/CD

Continuous integration isn’t something that needs to be restricted to remote pipelines. You can get a lot more mileage out of your CI/CD pipelines if you begin integration locally. If you’re able to resolve bugs uncovered by your pipeline right then and there on your machine, you’ll save a lot of time (and CI runner costs).

The core philosophy of continuous integration is keeping code changes small and modular. That way, you can test each new one individually and thoroughly, and stay pretty aware how it affects your codebase. CI also emphasizes moving fast to avoid merge conflicts or a stale trunk.

The best way to do this is to find out how your code changes fare early on: at the local level. That way, you’re keeping your commit history cleaner, and only pushing changes that pass local integration steps. This way, you can stay in the inner loop, take more risks, and trust your code changes more, instead of feeling uncertain until they hit a remote branch and trigger your cloud-hosted CI/CD.

Keeping your logic outside of the pipelines

You don’t want to spend time building out your CI/CD logic within the pipeline itself. Someday, you might change CI/CD platform providers, and it can take months to transpose one YAML spec to a different YAML spec.

The best practice for writing pipelines is treating them as simply a wrapper for your build/test/deploy steps. This way, you can have a single source of truth for both your local and remote workflows, instead of having fully separate pipelines for every stage. This makes maintaining CI/CD much more straightforward, and helps guide you towards good pipeline design principles.

Why Makefiles?

The Makefile spec has been around since 1976, and is still considered one of the best methods to define build workflows.

A Makefile serves as a collection of scripts linked to make aliases. For example, if you want to start your app’s PostgreSQL database, you’d run make postgres.start, and in your Makefile, you’ve defined that alias like this:

postgres.start:
    docker compose up -d postgres
    docker compose exec postgres \
      sh -c 'while ! nc -z postgres 5432; do sleep 0.1; done'

Makefiles make it easy to ensure your build steps stay consistent. Instead of memorizing (or copy-pasting) several steps, you can invoke the same steps in the same order, every time.

And these steps will likely not change too much throughout your SDLC. The main difference will be the infrastructure they’re run against (e.g. a small local dev environment vs. a staging environment that approximates all production services).

If you want to learn more about the beauty of Makefiles, check out our post "Makefiles for Modern Development”.

Using your remote pipeline as a wrapper for your Makefile

A Makefile brings all your tooling together into a centralized spec. You’re invoking different commands and using different tools, so it really is a wrapper in itself. This means that you’ll want your Makefile to be polished and up-to-date, since it should act as your single-source-of-truth for dev/test workflows.

Once you have that ready to go, your CI/CD YAML should be really straightforward to write. Essentially, it’ll serve as a wrapper for your Makefile. In this example, we’re using make commands in a GitHub Actions workflow, where each step invokes a command from the Makefile. This keeps it consistent: you know exactly what is happening in each step, because this is a similar workflow to the one you’ve used during prior development checkpoints (e.g. locally). You can solve for dev/prod parity issues by seeing where the pipeline fails (you already know these exact steps have succeeded on your machine, so it’ll be easier to narrow down the point of failure).

on:
  push:
    branches: [dev]

jobs:
  ci:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Install dependencies
        run: make deps

      - name: Lint code
        run: make lint

      - name: Run tests
        run: make test

      - name: Build app
        run: make build

      - name: Build Docker image
        run: make docker.build

      - name: Push Docker image
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        run: make docker.push

Using a Makefile as your local CI/CD pipeline

Once your CI/CD logic is already living in your Makefile, using anything but a Makefile is overkill for most local CI/CD workflows. Why add yet another wrapper? Your Makefile can invoke and stack make targets, so you can bundle a few together to run your CI/CD steps in order. You can also define a few different variants (e.g. defining one workflow that seeds the database and runs tests, and another that just lints and builds).

In your Makefile, you can group your existing targets together:

all: deps lint test build

And then you simply run a single make command to execute the entire local CI/CD process:

make all

Better pipelines == better production

When you’ve finetuned your CI/CD pipeline, you’ll start getting more value from it by using it at every major SDLC stage. As your code changes go through different “gates” and environments, your CI/CD can stress test and ensure that they’re production-ready.

And if you want to run CI/CD against full-stack, production-like environments on every code change, Shipyard has you covered. It manages the lifecycle of ephemeral environments through GitOps, so they spin up when you open a PR, update when you make a commit, and spin down automatically on a merge or timeout. It’s that easy. This way, you can run your full E2E test suite whenever you need to, and do CI/CD the right way. Try it free for 30 days, or jump on a call and we’ll help you get set up.

Test Your Vibe Coding with Ephemeral Environments

Shipyard DevRel — Tue, 22 Apr 2025 19:44:39 +0000

Vibe coding is taking the tech world by storm. Why? It’s a lightning-fast way to prototype the app of your dreams. Just come up with an idea, tell your favorite LLM, and keep iterating/tweaking it until it does everything you want it to. Sounds too good to be true, right?

Well, yes and no. Without proper testing, your vibe-coded proof-of-concept might be a little brittle or prone to bugs. However, it’s very possible to use good engineering practices to ensure your app runs smoothly and as expected. Here’s how you can stay ahead of some common LLM missteps.

Where vibe coding falls short

When you ask ChatGPT or Claude a question, how often is it actually correct? Not quite often enough for you to take its output as 100% factual. LLMs have a major weak spot: accuracy. An LLM will answer your prompt by responding with the most frequent associations/patterns it sees in its training data.

That’s where vibe coding becomes a liability. There’s a lot of poorly-written, vulnerable code in the world, and therefore in LLM training data. The output of your vibe coding copilot could be derived from a StackOverflow comment with two upvotes, or it could be an actual best-practice solution.

And this only gets worse the more niche your vibe coding project gets…

LLMs don’t always understand the context of your application, and will thus not “understand” the intended behavior. Being as specific as possible will only help you (e.g. “build me a personal finance app” vs. “build me a React, Django, Postgres app where I can track my saving and spending, and enter purchases into a form with three fields”).

In short, vibe coding is a great way to get rapid-fast prototypes, at the expense of cohesive, secure software design. This means you’ll want to test it accordingly.

Don’t “vibe test” your vibe coding

Rule #1 when it comes to vibe coding: write your own tests. (Or at least pair program them with your coding copilot).

Yes, this will probably take a lot longer than building your app itself. But if you’re working on an app you care about, or something business-critical, you’ll want to minimize any risks already inherent to vibe coding.

You, as a person, understand in your mind exactly what you want your app to do (down to subtle little things that you can’t quite articulate to your LLM buddy).

Take some time to think about what functionalities need extra verification, and brainstorm any edge and corner cases. Writing end-to-end tests that simulate real-world workflows can be especially helpful, since a vibe coded application might have some components that don’t mesh super well, especially if they were tacked on later.

Ephemeral environments for vibe coding

When you’re deploying your vibe-coded app to production, you’ll want to see how it runs on real infrastructure. One of the most frustrating issues with vibe coding is configuration. Since you’re working with LLMs, many of the packages that get pulled into your app aren’t versioned correctly, configured the right way, or are very out-of-date (especially dependent on that LLM’s training data).

Instead of taking a “push and pray” approach to production, you can individually test features, PRs, branches, and even main before you deploy. After creating and testing a few vibe-coded apps, we found out there was a bit of a disconnect between the vibe coding environment (which these apps are optimized to run on) and any other environment (staging, preview, or production). With ephemeral environments, you can sort out these issues and approximate how your app will run in production.

Step 1: Vibe coding

This is the fun part. Vibe coding needs no intro, just go to your online LLM-enabled IDE, type in your prompt, and watch it piece together a full-stack app step-by-step. If you want to make this app extra portable, and make it easier to build/deploy/run anywhere, make sure to ask the LLM to include Docker Compose orchestration (this works best when you ask in your initial prompt).

We evaluated a few different online vibe coding IDEs, and were impressed by all, although Lovable.dev was the clear standout. We were able to get a few solid apps built on the free tier, and we appreciated that credits refresh on a daily basis. Also, Lovable was helpful for making corrections and debugging, without breaking the entire app.

Step 2: Exporting your app to GitHub

Once you’re feeling solid about your app, export it to a GitHub repository. Many vibe coding apps will allow you to sync code changes to the repository, and the agent can sometimes even commit to the repo when you prompt it to. Good source control can help you revert code changes that break things, which isn’t unusual when an LLM enters the equation.

Step 3: Updating your config

In this example, we’ll be using Shipyard as our ephemeral environment management platform. If you’re new to Shipyard, you can sign in with GitHub or GitLab to kick off a 30-day free trial.

Shipyard takes a Docker Compose application definition, and uses that to generate Kubernetes manifests for your app’s orchestration. In your app’s Compose file, find the frontend-facing service. Add a label to set this as the primary route:

services:
  frontend:
    build:
      context: .
      dockerfile: Dockerfile
    labels:
      shipyard.route: '/'
    ports:
      - "8080:8080"

Config will vary between apps, and any app generated from an LLM is always a wildcard in terms of stack, tools, and setup. Check out the Shipyard Docker Compose docs if your app requires additional labels.

Step 4: Running the app in an ephemeral environment

Once you’ve configured your Docker Compose file, you’re all set to get it running in an ephemeral environment. From your Shipyard dashboard, you can create a new application and select your repo.

From here, your app will build. You can check out the Build, Run, and Deploy logs to see if you need to adjust anything config-wise. To make changes, go back to your vibe coding IDE and either manually edit the code, or ask the LLM to solve for the issues the logs are showing. As soon as you (or the agent) commit that, the Shipyard environment will rebuild to reflect those code changes.

Step 5: Testing, UATing, and previewing your app

Now, you can visit your app and interact with it on near-production infrastructure. Test all possible workflows manually, and/or add your teammates to help with review.

You can even vibe code (or manually write) a CI/CD pipeline to run your automated tests against this app. Doing this on every new code change will make it easier to keep iterating until your app is up to spec.

And most importantly, you can now rest easy knowing that your app doesn’t come with any surprises!

Shipyard + vibe coding = trusty releases

We get it, you love the convenience of vibe coding, but you don’t love its unpredictability. That’s where a platform like Shipyard can help. Funnel your vibe-coded app through a few gates of ephemeral environments, get several sets of eyes on it, run tests against it, and then release it with confidence.

Try it free for 30 days, or book a call to learn how it all works.

How to Measure DORA Change Fail Rate

Shipyard DevRel — Thu, 10 Apr 2025 14:43:51 +0000

As with all four key DORA metrics, measurement is the first step to improvement. Change fail rate is critical to track because it demonstrates your team’s success at delivering reliable, stable code. Having strong scores when it comes to deployment frequency and lead time for changes is only impressive if you have a comparatively low change failure rate. Here’s how you can assess how stable your application is.

What is change fail rate (CFR)?

Change fail rate (CFR) is one of the four DORA key metrics. It represents how often a change (or deployment) to production doesn’t perform as expected, or introduces a bug/regression.

DORA measures software throughput and stability, which are arguably the most critical dimensions when it comes to evaluating your team’s performance. Change fail rate (as of 2024) is considered the primary stability metric, among the four keys. This means that CFR is equally important to track as the velocity metrics, and will demonstrate if your fast deployment rates are at the expense of often-faulty code.

Any time you push a patch or hotfix, the code change you’re resolving is counted as a “failed deployment”. Sometimes these don’t surface until much later, when you find a bug that had been laying low for months or years.This is all important to keep track of, because it reveals faulty testing or QA practices, which may be either process or pipeline-based.

How to measure change fail rate

Change fail rate is represented as a percentage: it is the percentage of faulty deployments from your number of total deployments.

Change fail rate = ((number of failed deployments) / (number of total deployments)) * 100

It’s a simple formula, but the challenging part comes from collecting enough deployment data to measure trends. If you haven’t already, it’s helpful to find a way to automate this data collection, preferably into a spreadsheet. From there, you can track change fail rate over the time intervals of your choosing, and compare those rates to your other DORA metrics.

Measuring CFR through story status

One straightforward way to get change fail rate is by crunching story data. Keep count of every ticket that results in a deployment, as well as every ticket that goes from Completed back to In progress. With most ticket systems, this can be a good way to approximate how often your completed tickets need a rework. However, keep in mind that it’s only as accurate as your ticket tracking — if you aren’t documenting thoroughly, you won’t get a reliable estimate on your change fail rate.

Measuring CFR through GitHub

If you’re using GitHub’s Deployment feature for production releases, you can use GitHub’s REST API to collect deployment events. At minimum, you’ll want to record timestamps for when each deployment was created and/or modified, and grab deployment statuses (success, failure, error).

GET /repos/my-username/my-project/deployments

Measuring CFR through GitLab

GitLab’s designated Deployment API can give you the data you need to calculate change fail rate. A simple call returns all deployments for a given project, and you can also grab the timestamp and its status. From there, you can compare the number of deployments with failed status to the overall number.

GET /projects/:id/deployments

For Ultimate subscribers, GitLab offers a DORA metrics API, where you can directly request DORA change fail rate.

GET /projects/:id/dora/metrics

What is a good change fail rate?

As of the 2024 DORA report, the highest-performing teams (those who have attained the “elite” performance standard) have a change fail rate of 5%. About one-fifth of teams surveyed fall within this range.

Interestingly, teams within DORA’s “high” performance standard tend to have a worse change fail rate than those within the “medium” standard. Teams with a lead time of under a week tend to see a change fail rate of 20%, while those with a lead time between one week and one month see a change fail rate of 10%. Often, teams struggle to maintain a low change fail rate when they’re pushing code fast (can be from lack of QA automation, code review, test coverage, etc.).

There doesn’t have to be a tradeoff between throughput and stability. With thorough pipelines and processes in place, your team should be able to deliver features fast, and trust the features they’re delivering.

TLDR: change fail rate should be taken with the context of throughput. If you have a high throughput and low change fail rate, you’re doing well. Remember, DORA metrics are best used to measure your improvement. As long as your change fail rate is going down (and not at the expense of your deployment velocity or lead time), this is a good indicator of performance. You don’t need to aim for “elite” numbers, as your team’s ideal change fail rate might look different from that of another org.

Trying to reduce change fail rate?

The number one enemy of a good change fail rate is poor testing practices. When teams are trying to deploy fast, it can be tricky to ensure each feature gets thorough testing and review. But how do you remedy that, while still keeping lead times short?

Ephemeral environments are an effective way to make sure your features get enough eyes (and automation) prior to staging. Since infrastructure wait times become nonexistent, you can test your features as soon as you’re done building them, and share them with all stakeholders to get sufficient feedback. Instead of going back to the drawing board, you can make all changes on the spot, and then run tests until your pipeline is green. Having multiple gates before production (and even staging) can inspire trust in every deployment, resulting in a low change fail rate.

And Shipyard gives you the tools and automation you need to do exactly that! Kick off a 30 day free trial and see for yourself. Your DORA metrics will thank you.