I've been building Shipwright, a PM toolkit that runs inside Claude Code. It's not a chatbot. Not a "give me a PRD" prompt. It's a structured system: 44 atomic skills, 7 specialist agents, 16 multi-step workflows, binary pass/fail quality gates, and deterministic recovery playbooks.
I want to show what it actually produces, because the artifacts are the proof.
Project 1: LatAm Credential Verification
I was researching a potential product in the cross-border credential verification space. I ran Shipwright's discovery and research workflows against this problem. Here's what came out:
A full TAM/SAM/SOM analysis with three-source triangulation, regulatory forcing functions (EU eIDAS 2.0 mandating digital wallet infrastructure by December 2026), and a credible SAM estimate of $150–300M. Separate country-level briefs for Colombia, Mexico, and Venezuela; including the Spain homologation backlog signal (60,000 applications/year, 84% from LatAm, 3-7 year wait times) and remote tech hiring trends (50% YoY growth in LatAm-to-US placements through EOR platforms that don't systematically verify credentials).
Then came an Opportunity-Solution Tree: five ranked opportunities, twelve testable assumptions, twelve experiments scoped with timelines ($0–$1,500 cost, 3-4 weeks each), and explicit decision gates that define what constitutes a pass before any solution work begins.
Finally, a technical feasibility audit of seven credential registries — ToS reviewed, API availability tested, and commercial resale terms documented. Verdict: FAIL. Only 0–1 of seven have viable programmatic access at under $2/query with commercial-use rights. The audit invalidated a core assumption before I built anything and identified a pivot to a concierge model with different unit economics.
Realistic PM time for this research stack: 4–8 days.
Project 2: Pre-Sales Discovery Research
This was pre-sales work for a healthcare client — competitive analysis, company profiling, and a discovery tool for the first meeting.
Shipwright produced: a company profile with confidence-tagged unknowns explicitly listed for the first discovery call; a competitive analysis covering four major competitors; not just positioning claims, but a sourced capability gap matrix across nine automation dimensions, with revenue impact quantified from published industry benchmarks.
It provided a 29-question discovery questionnaire (behavioral framing, not hypothetical questions) with a scoring rubric (Friction Severity × Lens Relevance = Opportunity Score), industry benchmarks that auto-flag underperformance, and a decision tree that r findings to the next appropriate skill.
Realistic PM time: 4–6 days.
What makes the artifacts strong
Every output includes a Decision Frame: a recommendation, the trade-off of acting now versus waiting, an explicit confidence level, an owner, a decision date, and a revisit trigger. Evidence is tagged with confidence levels. Assumptions are separated from findings. Unknowns are listed explicitly rather than papered over.
The system enforces this through binary pass/fail gates before any output is used. An artifact either satisfies the structural and evidence requirements, or it goes back through a deterministic recovery playbook.
The T3 API audit failing is the system working — a testable assumption got tested, came back false, and the strategy updated accordingly.
What it doesn't replace
The discovery conversation. Assumption validation experiments. Real customers. The artifacts create the foundation and the questions; they don't answer them.
Shipwright is open source. It runs on Claude Code. However, the skills are plain markdown, so they also work in Cursor, Codex, and Gemini CLI.
I would really appreciate feedback on how to improve it, and I am happy to answer any questions.
Find it on GitHub: https://github.com/EdgeCaser/shipwright
Top comments (0)