DEV Community

Manikanta Suru
Manikanta Suru

Posted on

I Added Claude to Our MR Pipelines. It Now Reviews Every Code Change Before Humans Do.

Three incidents. Three things that passed human code review and shouldn't have.

A developer pushed an API key into a test repository. It sat committed in the git history for days before anyone noticed.

A Terraform merge request got approved with a security group ingress rule open to 0.0.0.0/0. It went to pre production.

A GitLab API token ended up hardcoded in a README.md file. The README. The file everyone reads. Merged without a flag.

I'm the sole DevOps engineer at a pre-seed energy startup. I support a 10+ person engineering team across backend, frontend, and Terraform infrastructure. We process 100+ merge requests every month. I can't be in every MR. And even when senior engineers reviewed, things slipped through — especially at the end of the week, especially under deadline pressure, especially on Terraform where most developers aren't fluent.

So I added a Claude AI review stage to our Jenkins MR validation pipeline. Now every merge request gets an automated AI review before any human sees it — posting comments directly on the changed code in GitLab, flagging security issues, checking standards, catching the things tired humans miss.

Here's exactly how it works.


The Architecture: Jenkins + GitLab + Claude

Our setup uses GitLab for source control and Jenkins for CI/CD. When a developer opens or updates a merge request in GitLab, a webhook fires and triggers our Jenkins pipeline. Inside that pipeline lives a file called Jenkinsfile-mr-validation — our dedicated MR validation pipeline. One of its stages runs the Claude AI review.

The flow:

Developer opens/updates MR in GitLab
          │
          ▼
GitLab webhook fires → Jenkins triggered
          │
          ▼
Jenkinsfile-mr-validation runs
          │
    ┌─────┴──────────────────────┐
    │  Stage 1: Lint & Validate  │
    │  Stage 2: Unit Tests       │
    │  Stage 3: Claude AI Review │ ← this is what we're building
    │  Stage 4: Security Scan    │
    └────────────────────────────┘
          │
          ▼
Claude fetches MR diff from GitLab API
          │
          ▼
Claude reviews changed files
          │
          ▼
Comments posted back to GitLab MR inline
          │
          ▼
Developer sees AI feedback before human reviewer arrives
Enter fullscreen mode Exit fullscreen mode

The AI review stage never blocks a merge. It's informational — a first-pass that runs automatically so human reviewers can focus on what actually needs human judgment.


The Jenkinsfile Stage

Inside Jenkinsfile-mr-validation, the Claude review stage looks like this:

pipeline {
    agent any

    environment {
        ANTHROPIC_API_KEY = credentials('anthropic-api-key')
        GITLAB_TOKEN = credentials('gitlab-api-token')
    }

    stages {

        stage('Lint & Validate') {
            // ... existing stages
        }

        stage('Claude AI Review') {
            when {
                expression { env.gitlabActionType == 'MERGE' || 
                             env.gitlabActionType == 'UPDATE' }
            }
            steps {
                script {
                    sh '''
                        pip install anthropic python-gitlab --quiet
                        python3 scripts/claude_mr_review.py \
                            --project-id ${gitlabMergeRequestTargetProjectId} \
                            --mr-iid ${gitlabMergeRequestIid}
                    '''
                }
            }
            post {
                failure {
                    echo 'Claude review failed — continuing pipeline'
                }
            }
        }

        stage('Security Scan') {
            // ... existing stages
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Two important details:

The when condition ensures this stage only runs on actual MR events — not on branch pushes or scheduled builds. No point reviewing code that isn't being merged.

The post { failure } block means the pipeline continues even if the AI review script crashes or the Anthropic API times out. A failed AI review should never block a deployment.

Credentials are stored in Jenkins credential store — anthropic-api-key and gitlab-api-token — never hardcoded. Ironically, hardcoded credentials were one of the problems this system was built to catch.


The Python Review Script

import anthropic
import gitlab
import argparse
import os
import sys

def get_mr_changes(gl_client, project_id, mr_iid):
    """Fetch changed files from GitLab MR."""
    project = gl_client.projects.get(project_id)
    mr = project.mergerequests.get(mr_iid)
    changes = mr.changes()
    return changes['changes'], mr

def detect_file_type(filename):
    """Route to correct review prompt based on file type."""
    if filename.endswith(('.tf', '.tfvars')):
        return 'terraform'
    elif filename.endswith(('.js', '.ts', '.jsx', '.tsx', '.vue')):
        return 'frontend'
    elif filename.endswith(('.py', '.java', '.go', '.rb')):
        return 'backend'
    return 'general'

def build_system_prompt(file_type):
    """Build file-type-aware review prompt."""

    base = """You are a senior software engineer doing a code review.
Analyze the diff and identify real issues only.

Format every finding as:
SEVERITY: [CRITICAL/HIGH/MEDIUM/LOW]
FILE: [filename]
ISSUE: [specific description]
SUGGESTION: [concrete fix]

Rules:
- Maximum 10 findings per review
- Only report genuine issues — no hypotheticals
- CRITICAL = security risk or data loss potential
- HIGH = bug or significant quality issue
- MEDIUM = code quality concern
- LOW = style or minor improvement
- If nothing significant found, say so clearly"""

    terraform_checks = """

Terraform-specific checks:
- Security groups with ingress open to 0.0.0.0/0 on any port
- Unencrypted RDS instances, EBS volumes, or S3 buckets
- IAM policies with wildcard actions or resources
- Resources missing required tags (Name, Environment, Owner)
- Sensitive outputs without sensitive=true
- Hardcoded credentials or tokens in any value"""

    frontend_checks = """

Frontend-specific checks:
- User input rendered without sanitization (XSS risk)
- Sensitive data written to localStorage or console
- Hardcoded API endpoints, keys, or tokens
- Missing input validation on form fields"""

    backend_checks = """

Backend-specific checks:
- Hardcoded credentials, API keys, or tokens
- Missing authentication or authorization checks
- SQL queries built with string concatenation
- Sensitive data written to logs
- Missing error handling in async operations"""

    if file_type == 'terraform':
        return base + terraform_checks
    elif file_type == 'frontend':
        return base + frontend_checks
    elif file_type == 'backend':
        return base + backend_checks
    return base

def review_file_with_claude(filename, diff, file_type):
    """Send file diff to Claude for review."""

    # Using Haiku — faster and cheaper than Sonnet or Opus
    # At 100+ MRs/month, cost compounds quickly
    # Haiku handles code review quality well for this use case
    client = anthropic.Anthropic(
        api_key=os.environ['ANTHROPIC_API_KEY']
    )

    # Truncate large diffs — stay within token limits
    if len(diff) > 8000:
        diff = diff[:8000] + "\n\n[diff truncated — file too large]"

    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1500,
        system=build_system_prompt(file_type),
        messages=[{
            "role": "user",
            "content": f"Review this diff for {filename}:\n\n{diff}"
        }]
    )
    return response.content[0].text

def post_review_comment(mr, review_text, filename):
    """Post review as MR comment in GitLab."""
    comment = f"### 🤖 AI Review — `{filename}`\n\n{review_text}"
    comment += "\n\n*Automated review. Human approval still required.*"
    mr.notes.create({'body': comment})

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--project-id', required=True)
    parser.add_argument('--mr-iid', required=True)
    args = parser.parse_args()

    gl = gitlab.Gitlab(
        url=os.environ.get('GITLAB_URL', 'https://gitlab.com'),
        private_token=os.environ['GITLAB_TOKEN']
    )

    try:
        changes, mr = get_mr_changes(
            gl, args.project_id, args.mr_iid
        )
    except Exception as e:
        print(f"Failed to fetch MR: {e}")
        sys.exit(0)  # Exit 0 — don't fail the pipeline

    reviewed = 0

    for change in changes:
        filename = change.get('new_path', '')
        diff = change.get('diff', '')

        # Skip deleted files and trivial changes
        if change.get('deleted_file') or len(diff) < 100:
            continue

        file_type = detect_file_type(filename)

        # Skip binary and unknown file types
        if file_type == 'general':
            continue

        review = review_file_with_claude(filename, diff, file_type)

        if review and 'no significant' not in review.lower():
            post_review_comment(mr, review, filename)
            reviewed += 1

    if reviewed == 0:
        mr.notes.create({
            'body': "## 🤖 AI Code Review\n\n"
                    "✅ No significant issues found in changed files.\n\n"
                    "*Human review still required before merging.*"
        })

    print(f"AI review complete. Reviewed {reviewed} files.")

if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

What Claude Actually Catches

After running this across 100+ MRs per month, the pattern is clear.

The three incident types that originally motivated this have not recurred since deployment. Hardcoded credentials get flagged at CRITICAL severity before any human reviewer sees the MR. Terraform security groups with permissive ingress rules get caught in the diff before they reach production. Tokens in documentation files get flagged immediately.

Beyond those:

  • Missing error handling in async backend functions
  • Unvalidated user input in frontend code
  • Terraform resources missing required tags
  • Debug logging that prints sensitive data
  • IAM policies broader than necessary

What it misses:

Claude sees the diff — not the system. It doesn't know why this code was written this way, what the business logic requires, what the rest of the codebase looks like, or whether this change conflicts with something in a file that wasn't modified. Business logic bugs, architectural issues, and context-dependent decisions still need human eyes.


The Prompt Engineering That Actually Mattered

The first version was nearly useless.

Claude posted 30+ comments on every MR. A mix of CRITICAL security findings and extremely minor style suggestions — all formatted identically. Developers started dismissing everything. Which is worse than having no AI review at all.

Three changes fixed it:

Maximum 10 findings. Hard limit in the system prompt. Forces Claude to prioritize. A review with 5 real issues beats one with 25 where 20 are noise. After this change, developers started reading the reviews instead of closing them.

Severity classification. CRITICAL/HIGH/MEDIUM/LOW on every finding. Developers can triage. CRITICAL gets fixed before merge. LOW is optional. This one change turned AI review from background noise into a signal worth acting on.

File-type-specific prompts. Terraform security checks are different from Python code quality checks. A generic prompt applied to both produces generic results for both. Separate prompts per file type — with specific checks relevant to that language and context — made findings significantly more relevant.

The system prompt is where 80% of the value lives. Spend time there.


What Still Requires a Human

Claude reviews the diff. It doesn't understand why the code was written this way. It doesn't know what the product decision was. It can't see the technical debt in the rest of the codebase that this change interacts with. It can't tell whether this feature does what the ticket actually asked for.

Human code review is still required before every merge. Always. Without exception.

What changed is what human review focuses on. Before this system, reviewers spent attention on mechanical checks — did someone remember error handling, are there obvious security gaps, is this formatted correctly. Now Claude handles that pass. Human reviewers focus on business logic, architecture, and the decisions that actually require knowing the system.

That's the right division of labor.


Results After Running This in Production

Security incidents from MR review dropped to zero. The specific things that used to slip through — committed credentials, open security groups, tokens in documentation — haven't made it through since this system went live.

Junior developers are progressing faster. Consistent, detailed, patient feedback on every MR they open. No waiting for a senior engineer's attention. No inconsistency based on who reviews on what day. The AI explains the reasoning behind every suggestion — something a tired senior engineer at the end of a long review queue often skips.

Human reviews are more focused. When the mechanical checks are handled automatically, the conversation in human review shifts to what actually needs human judgment. Less back-and-forth on obvious issues. More time on architecture and business logic.


Setting This Up in Your Environment

What you need:

  • GitLab with MR webhooks configured to trigger Jenkins
  • Jenkins with the GitLab plugin installed
  • Anthropic API key stored as Jenkins credential: anthropic-api-key
  • GitLab personal access token (api scope) stored as: gitlab-api-token
  • Python 3.9+ available on your Jenkins agent

Start with post { failure } in your Jenkinsfile stage and sys.exit(0) in your Python script — ensure the AI review never blocks your pipeline when it fails. Build the safety net first. Tune the prompts second.

The system prompt is your highest-leverage investment. Write specific checks for your codebase, your language stack, and your team's standards. A generic prompt gives you generic results. The specificity is what makes it useful.


Credentials committed to repositories, security groups left wide open, tokens hardcoded in documentation — these are mechanical failures. They don't require judgment to catch. They require consistent attention that human reviewers, under pressure, sometimes don't have.

That's what this system does. The judgment calls still belong to the humans.


devops #jenkins #gitlab #ai #claude #codereview #cicd #python #anthropic

Top comments (0)