makotunes

Posted on Jul 21

Best Practices for Engineer Evaluation Systems in the Age of AI (Overview)

Introduction

I run a development company with around 40 people. In recent years, AI-driven development has become the norm, and the acceleration brought by the era of vibe coding—together with the growing difficulty of fairly evaluating engineers—motivated me to develop an engineer evaluation platform: CodeRanker.

Source Code

⸻

Background

External Factor: AI

With the rise of so-called “vibe coding” (real-time code generation and thread-based workflows powered by AI), there’s been a clear shift from traditional team-based division of labor to a relative emphasis on individual ability. It’s not that engineering jobs are disappearing—rather, it’s akin to the emergence of a new high-level language and a fundamental change in the cognitive demands required of engineers.

For clarity, the “AI” here refers to vibe-coding-style AI support tools like Cursor. However, I see this trend as a transitional phase. In the near future, development will converge further toward declarative, structured, spec-driven paradigms (for example, Kiro), where prompts themselves become declarative code. Even as technology advances, the fundamental trend toward the abstraction of thought—where development is completed by specifying logic and requirements—will remain unchanged.

The reason I believe we will shift from vibe coding to declarative, spec-driven development is the expected increase in LLM model context window sizes. If cheap, widely available models can handle context windows of several megabytes—the size of an average codebase—even prompt-level adjustments may no longer be necessary.

As a result, future engineers will need to possess abstract thinking skills: the ability to see the entire system, maintain coherence, and define requirements across modules, rather than simply breaking work into modules and dividing tasks.

This means that, in the development environment of tomorrow, having one person manage the whole—leveraging AI—will naturally result in higher quality output, rather than splitting up work among many people.

Division of Labor Will Be Disrupted in the AI Era

Therefore, I no longer see value in teams composed solely of simple coders. Ideally, projects should be handled by a small group—or even a single person—who can see the whole picture. Just as society shifted from agriculture to hunting, traditional SIer-style, large-scale waterfall development will be eliminated. I believe that engineers will be regarded more as talents than as organizational “employees.” In this world, the difference in productivity by individual ability can be tenfold or a hundredfold. Engineering and other intellectual work should be the domain of a select few, without the unnecessary overhead of communication costs.

If a product is built by one person, its value is equivalent to the value of that developer. Thus, I consider this field to be more like the service or sales professions—where results-based incentives and a supporting evaluation system become necessary.

How Should Humans and AI Coexist?

There were also internal factors in my organization: with a lack of senior engineers, we hit the limitations of conventional management, training, and evaluation systems. The challenge became: how do we boost the performance of a team made up mostly of mid-level and junior engineers? The answer, I believe, is to use AI for correction and upskilling, but also to help engineers recognize the areas where AI falls short and develop human abilities that surpass AI—ultimately ensuring client satisfaction.

AI as the “Tool,” Humans as the “Judges”

AI can help with quality assurance and coverage measurement, but if AI were perfect, there would be no need for humans. In reality, AI alone cannot yet determine the quality or requirement-fit of complex systems. The meaning of requirements, technical depth, and the thoroughness of testing still demand the insight of senior engineers. AI should play a supportive role. To compensate for the thin senior layer, it’s important to combine automated measurement by AI with human judgment, and to continue on-site learning and review.

Treating Engineers as “Talents,” Not Just Organizational Members

Traditionally, organizations have often treated engineers as just parts of the company, making promotions and compensation decisions based on a manager’s evaluation. In my organization, we value engineers as talents based on three axes: quality of output, volume of output, and client satisfaction. This ensures fair evaluation according to each engineer’s projects and skill set, and makes their roles and rewards objective.

360-Degree Evaluation and a Democratic Process

To guarantee transparency and fairness, we use 360-degree evaluation by multiple reviewers. Deliverable evaluation, commit evaluation, and manager evaluation are all made public. Evaluation standards and calculation methods should be clearly documented and disclosed. Transparent and democratic evaluation is the basis for a flat, decentralized organization—not a top-down one.

Transparency and Decentralization

Another important factor for organizational growth is transparency. We focus on publishing rankings, instant reflection of evaluation results, and sharing of criteria, so all employees act with the same information. Flat evaluations—where hierarchy is minimized—reduce central authority and foster self-management and autonomy.

Proactive Competition

Fair competition improves productivity across the organization. We make monthly rankings public and link them directly to rewards and project assignments. This encourages ambition and covers for a thin senior layer by motivating everyone to improve. When connecting rankings and rewards, it’s important to clarify evaluation criteria and enforce penalties for misconduct, maintaining a healthy competitive culture.

Toward a Self-Regenerating Organization

The ideal is an organization that can grow and renew itself with minimal management cost. By combining quantitative AI-based evaluation with qualitative human judgment, and designing a transparent, decentralized evaluation process, each engineer understands their own strengths and challenges and takes initiative for improvement. This balance of competition and co-creation is the future of talent management and evaluation for the AI era.

This article outlines the concepts and practical framework of the CodeRanker engineer evaluation system as an organizational theory.

What is the 3-Axis Evaluation System?

The HR evaluation system proposed by CodeRanker measures an engineer’s performance on three axes: Quality, Quantity, and Client Evaluation.

Axis	Content	Purpose
Quality Evaluation	Senior engineers (with AI support) evaluate deliverables: code quality, design, test coverage, technical depth.	Guaranteeing and standardizing technical quality
Quantity Evaluation	Analyzes Git commit history. AI evaluates the quantity and quality of output by parsing commit messages and file changes.	Objective visualization of output volume
Client Evaluation	Project managers or clients assess requirement fit and business value.	Reflection of business value and customer satisfaction

By combining these, a balanced evaluation is possible, integrating technical, quantitative, and customer perspectives.

The Structure and Theory Behind the 3-Axis Evaluation System

Modern software organizations need mechanisms to correctly and fairly assess engineering ability. CodeRanker was designed to meet this need, with the 3-axis evaluation system as its core. This section explains the framework and underlying theory behind the system.

Why a New Evaluation System is Needed

Advances in AI have transformed engineering. With automated code generation and sophisticated CI/CD, individual productivity has increased dramatically. As a result, it’s become difficult to judge “who creates real value” using only traditional, subjective, or qualitative methods. Seniority- and impression-based evaluation makes it hard to recognize truly outstanding talent and risks harming organizational competitiveness.

This is why a data-driven, highly transparent evaluation system is needed. By visualizing performance with clear, quantitative data, you can fairly assess each engineer’s effort and results, and boost motivation. CodeRanker’s 3-axis system is designed to objectively measure “true value” in the AI era.

Three-Dimensional Evaluation

The 3-axis system assesses engineers from the following perspectives, with each axis addressing a different aspect to achieve an unbiased, comprehensive evaluation:

$screencapture-localhost-5173-system-evaluations-dd9t67lb7hjhvfrpu8dv4zme-2025-07-21-21\_57\_31.png$

Deliverable Quality (Output-Focused): Evaluates the quality of code and products produced by engineers. Assesses requirements coverage, test completeness, design quality, security, and performance. Evaluation is based on automated test results and code reviews by senior engineers using specialized tools. This axis focuses on the quality of results—emphasizing technical depth and code completeness.

$screencapture-localhost-5173-system-evaluations-hzsjj9u9sdpjdnz8gwsbab2q-2025-07-21-21\_57\_45.png$

Quantity of Output (Process-Focused): Measures the amount and quality of output from Git commit history. Quantifies weekly features, added/modified lines, commit frequency, and other contributions. AI analyzes commit content to score how well requirements were implemented and how efficient the workflow was. Because this is collected and analyzed automatically in CI, nearly real-time feedback is possible.

$screencapture-localhost-5173-system-evaluations-rrwaz8kne25v43lzuge1l279-2025-07-21-21\_57\_19.png$

Manager/Client Evaluation (Human Evaluation): The manager or client provides an overall evaluation. Each week (in under 25 minutes), they assess whether delivered features meet requirements, deliver business value, and whether there were process issues. They check things like “requirement alignment,” “user usability,” “priority of business-critical features,” and “communication quality.” Subjective satisfaction and field perspective—things AI cannot judge—are included here.

Theory and Aims Behind the 3-Axis Approach

The three-axis approach stems from the belief that a one-dimensional evaluation can’t capture an engineer’s true skill. If you measure only code quantity, you miss out on quality; if you look only at elegance, you lose track of productivity. Some value (such as user experience or teamwork) can’t be captured by numbers alone. By combining quality, quantity, and human evaluation, you get a balanced view.

The aims and theoretical foundations of this method are:

Eliminating Bias & Ensuring Fairness: Multiple axes prevent unbalanced evaluations. Because people have strengths and weaknesses, a combination of quality, quantity, and manager perspective ensures fairness. Combining automated (AI) and human review balances objectivity and human insight.
Short-Term Results and Long-Term Growth: By evaluating both deliverable quality (short-term) and quantity/process (long-term), the system encourages sustainable growth, not just one-off wins. This also encourages a culture of ongoing improvement.
AI-Era Appropriateness: In an age where AI writes code and generates tests, human value lies in “how well you use AI to create high-quality results efficiently.” CodeRanker quantifies what can be automated and leaves creativity and higher-order decisions to humans, reflecting this division in its design.
Transparency & Buy-In: Clear criteria and calculation methods for each axis are available to all. Everyone can see how their score is calculated and understand why they received a given evaluation, boosting trust and satisfaction.

Evaluation Flow and Score Integration

$screencapture-coderanker-cloud-system-ranking-dzezzsqjedg86ga8xqkw3rl0-2025-07-21-22\_25\_04.png$

In CodeRanker, the results from the three axes are combined into a single integrated score for each engineer. The general flow is as follows:

Data Collection: Automatically collects project specs, test code, Git history, and CI test results.
Automated Evaluation (Deliverable Quality): Automatically checks requirements coverage, test results, and code security. Senior engineers also regularly review code with specialized tools (semi-automated, human-in-the-loop).
Automated Evaluation (Output Quantity): CI scripts analyze recent commits: number of features, lines changed, commit frequency/granularity, etc. AI also classifies commits (feature, bugfix, etc.) to score both quantity and content.
Manager/Client Evaluation (Human): The project manager reviews the auto-evaluation results and the actual product weekly, checking for requirement completeness, non-functional requirements, communication quality, and more.
Score Aggregation: Each axis's score is combined with predefined weights (default: Deliverable Quality 40%, Output Quantity 35%, Manager Evaluation 25%), producing an overall ranking.
Feedback: The final result, with explanations for each score and improvement recommendations, is provided to the engineer. They can use this feedback for personal growth and goal setting.

This flow repeats weekly or monthly, so the evaluation and feedback cycle stays close to real-time.

Benefits of the 3-Axis Evaluation System

$screencapture-coderanker-cloud-system-ranking-dzezzsqjedg86ga8xqkw3rl0-2025-07-21-22\_25\_04.png$

Implementing the 3-axis system yields benefits not possible with traditional approaches:

Fair, Objective Evaluation: Using multiple axes prevents unfair, one-dimensional judgments. Combining data-driven automation and human review delivers fair, bias-resistant results.
Transparency: Scoring standards and calculation methods are shared, and everyone knows what is being measured. This increases buy-in and eliminates mistrust of “black box” HR decisions.
Real-Time Feedback: Output scoring happens in near real time, so engineers get instant feedback, can adjust quickly, and don’t have to wait for an annual review to improve.
Continuous Growth: By evaluating both output and process, engineers are encouraged to improve quality and efficiency over time, fostering a culture of self-improvement.
Human–AI Collaboration: The system automates everything possible, reducing burden on reviewers, while keeping human-only evaluations for context and nuance, building a truly human-centric system for the AI era.
Meritocracy: Scores and rankings make clear who delivers what, creating healthy tension and competition, rewarding results, and nurturing an achievement-oriented culture.

⸻

Summary: CodeRanker’s 3-axis system is a new framework that combines quality, quantity, and human review. It brings fairness and transparency to engineering evaluation for the AI era, supporting both individual and organizational growth. For CTOs and HR managers looking to modernize their evaluation systems, the 3-axis approach is a powerful solution.

Solving Management Issues with Payroll and Monthly Statistics Dashboards

$screencapture-localhost-5173-system-users-2025-07-21-22\_01\_31.png$

CodeRanker’s system aggregates weekly evaluation points each month and publishes rankings and statistics for all members. Each engineer’s points are calculated based on the quality of deliverables, quantity of output, and manager (or client) satisfaction. Scoring follows a “Quality + Quantity + Satisfaction – Penalty + Bonus” formula, with documented standards and calculation methods. Weekly points are totaled monthly, and rankings are made by department and company-wide. This enables visualization of each member’s trends, team averages, and the distribution of high and low performers—helpful for managing and improving organizational performance.

Monthly rankings aren’t just for show—they directly impact pay. CodeRanker links evaluation results and compensation, so higher ranks get bonuses or raises, and lower ranks receive reductions or improvement programs. Each tier (T0–T7) has a base salary table, and monthly scores automatically determine adjustment. For example, exceeding a certain threshold earns a pay raise; falling below triggers a penalty. Calculation methods and rates are public, so everyone understands how pay is determined.

Key points for monthly stats and pay linkage:

Automatic Aggregation and Publication: Weekly points are summed each month and shared company-wide. This keeps performance transparent and visible.
Immediate Reward Reflection: Monthly rankings affect next month’s salary. Top performers get bonuses/raises, while underperformers get reductions or improvement plans—effort directly impacts pay.
Automated Calculation: Scoring is standardized, so rankings and salary adjustments are automatic, reducing admin burden.
Fairness and Buy-In: Criteria and methods are public, so everyone can check monthly stats and pay changes, ensuring trust and transparency.

Currently, fully automated CI/CD-driven monthly stats and payroll features are not yet live, but the architecture for weekly-to-monthly aggregation and payroll adjustment is in place. Full automation will bring even more efficiency in future updates.

Product

With the rapid advancement of AI, engineer evaluation systems also need to evolve. CodeRanker’s 3-axis system—grounded in fairness and transparency—offers valuable insight for many organizations.

For details, see the CodeRanker official website.

Source Code