Hi! I’m part of the team building AIPOCH, an open-source library of 450+ executable Agent Skills designed specifically for medical research workflows.
AIPOCH GitHub Repository
see our Website here
Why we built AIPOCH?
Most medical research AI tools today are essentially a bundle of prompt engineering + fixed toolchains + a UI. They handle "published knowledge" well (like summarizing a paper), but they fall apart the moment you say: "Now validate this hypothesis using my own cohort data." Existing tools often lack a persistent research context. There is no version-controlled hypothesis tracking, no seamless link between literature evidence and actual data execution. We wanted to move beyond point-solutions to a modular, extensible protocol.
What is AIPOCH?
AIPOCH is a curated library of 450+ Medical Research Agent Skills, built to work with OpenClaw and other AI agent platforms, including OpenCode and Claude. To achieve this, we have encoded specialized medical research logic directly into our Skills.
- Scientific Integrity Constraints
- Study type identification
- Medically Specialized Prompt Logic
A Skill is a structured capability package consisting of:
- skill.md: A "contract" containing YAML metadata (trigger logic) and specific operational steps.
- Python Scripts: Executable engines called directly via bash under the guidance of the skill.md.
In the context of AIPOCH, we define our developed skills as structured capability packages designed for professional medical research tasks, utilizing skill.md as the trigger contract and Python scripts as the execution engine. We have embedded medical research constraints directly into our skill.md, references, and Python scripts.
AIPOCH Medical Skill Auditor (in development)
What is Medical Skill Auditor?
Skill Auditor is AIPOCH’s evaluation framework under active development for scoring Medical Research Agent Skills with rigorous, multi‑dimensional quality metrics. It’s intended to go beyond static descriptions by measuring both core capability and real execution performance—giving users and developers a clearer, data‑driven understanding of skill quality.
How does it work?
🧰 Core Capability
Evaluates a skill’s design and contract against key dimensions such as Functional Suitability, reliability, performance & context, Agent Usability, human usability, Security, Agent-Specific and maintainability.
📊 Medical Task
Assesses actual outputs of a skill with layered criteria, weighting general competence and category‑specific behaviors to reflect real‑world execution quality.
🚫Veto Gates
To enforce strict quality control, Skill Auditor is designed with two layers of veto mechanisms. Any failure in these checks may lead to immediate rejection of a skill.
Skill Veto
Operational Stability
Structural Consistency
Result Determinism
System SecurityResearch Veto
Scientific Integrity
Practice Boundaries
Methodological Ground
Code Usability
The Most Frustrating Moment
One of our biggest early mistakes was using a cheaper LLM to "vibe coding" the initial batch of scripts.
On the surface, it worked. The scripts ran, and the logic seemed okay. The nightmare only surfaced during our audit: we realized the executing agent was silently correcting the script's logic on the fly. Because the agent read the intent in skill.md, it would "patch" the sloppy edge cases and vague error branches in the Python code during execution.
The result? We were burning massive amounts of extra tokens just to fix errors that shouldn't have existed. It didn't throw an error; it just showed up on the API bill.
We eventually scrapped the lot. We learned the hard way: Quantity isn't a moat; high-quality scripts are.
All questions/feedback welcome!😎😎😎


Top comments (0)