AIPOCH

Posted on Apr 9

AIPOCH Medical Skill Auditor: How We Evaluates Agent Skills?

#agents #ai #codequality #testing

You can explore a growing collection of Medical Research Agent Skills on the AIPOCH Github.

If you find it useful, consider giving it a ⭐ to support the project!

What is Medical Skill Auditor?

AIPOCH Medical Skill Evaluator is a framework for assessing the quality of AIPOCH's Agent Skills. Its core function is to perform a comprehensive quality check on a Skill before it is released to users.

How does Medical Skill Auditor Work?

Veto Gates

To enforce strict quality control, Skill Auditor is designed with two layers of veto mechanisms. Any failure in these checks may lead to immediate rejection of a skill.

Skill Veto

Take the agent skill “medical-research-literature-reader-pro” as an example：

Operational Stability
Structural Consistency
Result Determinism
System Security

Research Veto

Take the agent skill “medical-research-literature-reader-pro” as an example：

Scientific Integrity
Practice Boundaries
Methodological Ground
Code Usability

Core Capability

Take the agent skill “medical-research-literature-reader-pro” as an example：

Evaluates a skill’s design and contract against key dimensions such as Functional Suitability, Reliability, Performance & Context, Agent Usability, Human Usability, Security, Agent-Specific and Maintainability.

Medical Task
Take the agent skill “medical-research-literature-reader-pro” as an example：

Assesses actual outputs of a skill with layered criteria.

For skill testing, the AI automatically generates inputs. The number of inputs in specific categories will increase or decrease depending on the complexity of the skill. The following 7 inputs represent the most comprehensive version.

Canonical
Variant A
Edge
Variant B
Stress
Scope Boundary
Adversarial

Skill Complexity Classification

Label	Code/Rank	Definition
Simple	S	Narrow task scope
Moderate	M	Moderate branching or multiple task types
Complex	C	Broad or multi-step specialized skill

Simple (S): 3 inputs

Moderate (M): 5 inputs

Complex (C): 7 inputs

Final Score

Take the agent skill “medical-research-literature-reader-pro” as an example：

The Skill Evaluator uses a two-stage scoring system: static evaluation (design quality, accounting for 40%) and dynamic evaluation (runtime performance, accounting for 60%). The final overall score is derived by combining both.

Static (40%)
Dynamic (60%)

Final Score = Static Score × 40% + Dynamic Score × 60%

You can view evaluation results for selected AIPOCH skills here.

Feedback and possible future directions

This framework is still under active development.Right now it is only applied to a subset of AIPOCH’s skills, but we’re considering expanding it more broadly.

Top comments (1)

AIPOCH • Apr 9

Are there evaluation frameworks you’re already using? Looking for your Comments!