Paulo Almeida

Posted on May 24 • Edited on Jun 5

I asked Gemma 4 31B to audit SAP code offline—and it argued back about risk calibration

#gemmachallenge #gemma #sap #devchallenge

Gemma 4 Challenge: Build With Gemma 4 Submission

I was watching a debug log scroll by at 2 AM when Gemma 4 31B Dense did something I wasn't expecting.

It was classifying a custom ABAP function called Z_CUSTOM_BUSINESS_RULE from a synthetic legacy codebase. The RAG retriever had pulled three documents from the local corpus — including a Migration Priority Matrix that assigns MEDIUM severity to undocumented Z* functions by default. So I expected MEDIUM.

The model returned HIGH, with this justification (in the original Portuguese, copied verbatim from the run log):

"O objeto é um módulo de função no namespace customizado (Z*), o qual, conforme o Documento 1, deve ser sinalizado para CUSTOM_REVIEW para verificar se ele encapsula APIs internas ou acessa tabelas padrão. A severidade é definida como HIGH devido à ausência de documentação fornecida sobre a lógica interna do objeto."

Translated: the model flagged the function for human review (per the policy document that recommends review for undocumented Z* code), and elevated severity above the Matrix's default — citing the absence of documentation as the reason. The audit cannot confirm whether the function wraps non-published APIs, so the risk is treated as higher than a routine review.

I ran the exact same input on Gemma 4 E4B. Its justification, also from the log:

"O módulo Z* não é automaticamente proibido, mas deve ser sinalizado para revisão (CUSTOM_REVIEW). [...] Como não há informações sobre as dependências ou a lógica de negócio, a revisão é necessária."

Same classification (CUSTOM_REVIEW). Same reasoning about the policy. But E4B stayed at MEDIUM — the Matrix default. It saw the same gap in documentation and treated it as routine.

That difference — same input, same retrieved context, same correct classification, different calibration of how much risk the gap carries — is the most interesting thing I learned from this project, and the reason I'm submitting it to the Gemma 4 Challenge.

The project: SAPMigrate

SAPMigrate is a local-first ABAP audit assistant for SAP API governance and Clean Core migration — built around the SAP API Policy v4/2026 that SAP published in April 2026. It's a working prototype, not a finished product, but the pipeline runs end-to-end and the classifications hold up across reruns.

The full code is open: github.com/PauloAAlmeida/sapmigrate.

What makes the use case interesting isn't the SAP angle (although that's specific). It's that this is one of the few cases where local-first AI isn't a preference — it's an operational requirement.

Why this problem can't go to the cloud

Most "local AI" demos I see in this challenge are about preference: someone prefers offline because of latency, or cost, or privacy as a principle. Those are real, but soft.

ABAP code from real SAP customers is different:

It's intellectual property. Often under NDA.
It encodes pricing logic, customer master data, fiscal rules, integration secrets.
It's frequently subject to GDPR, LGPD, SOX, or internal security policies that explicitly prohibit external AI inference.

Sending it to OpenAI, Anthropic, or Google Cloud isn't a trade-off — it's a contract violation in many enterprise contexts. So if you want AI to help audit this kind of code, the AI has to come to the code. Not the other way around.

That's the design constraint that drove every decision in this project.

💡 A quick note on hardware: I ran the 31B model on a workstation with an RTX 5090 (32GB VRAM) to get the maximum reasoning capability for this audit. But if you're looking at "31B" and worrying about your laptop's VRAM, don't worry. You can clone the repo today and run it on modest hardware using the edge model. Just pass the environment variable: GEMMA_MODEL=gemma4:e4b python app.py. (I actually compare the reasoning between both models below!)

What the v4/2026 policy means for this prototype

In April 2026, SAP published API Policy v.4.2026a — commonly referred to in the community as v4/2026. I am not treating the policy as a legal rule engine. SAPMigrate is an engineering audit assistant, so its job is to flag code paths that deserve architectural review.

For this prototype, the relevant audit signals are:

Published APIs — APIs listed in SAP Business Accelerator Hub or otherwise documented for a product-specific use case — are treated as the safest integration surface.
Non-published or internal SAP interfaces — such as private RFCs, undocumented classes, or interfaces not identified as released — are treated as API-governance risks.
Deprecated APIs with documented replacements become migration candidates.
Direct standard-table access (MARA, BSEG, VBAK, KNA1) is treated as a Clean Core risk because it creates direct dependencies on internal data structures.

Here's the nuance I got wrong twice before getting it right: customer Z* and Y* function modules are not automatically prohibited. They are customer-developed code. But they still deserve review when they wrap non-published SAP APIs, perform direct standard-table access, or bypass documented controls.

This nuance is why my classifier has five output classes, not four:

PUBLISHED — released APIs from the Accelerator Hub
INTERNAL — non-published SAP objects (direct table access, undocumented SAP RFCs)
DEPRECATED — published but obsolete with a clear replacement
CUSTOM_REVIEW — customer Z*/Y* code that needs a human auditor's judgment
UNKNOWN — insufficient context to classify

The first version of the system flagged all Z* code as INTERNAL. That was wrong, and a senior SAP auditor would have caught it immediately. Reading the policy language carefully — especially the distinction between published SAP APIs, non-published SAP interfaces, and customer-developed ABAP — forced me to add CUSTOM_REVIEW as a class.

Why Gemma 4 31B Dense, specifically

I tested the model choice. Specifically, I ran the full demo pipeline with both Gemma 4 31B Dense and Gemma 4 E4B, same RAG index, same prompts, temperature 0.1 in both cases. Raw logs are committed in the repo as demo_runs_v4.log (31B Dense) and demo_runs_v4_e4b.log (E4B).

Here's what the comparison shows.

Classification: 8/8 agreement. Both models correctly assigned PUBLISHED, INTERNAL, DEPRECATED, or CUSTOM_REVIEW for every demo finding. E4B is not pattern-matching on this dataset — it produces structured reasoning over the retrieved policy text and reaches the right category every time.

Severity calibration: 6/8 agreement. The two models diverge on two findings, and the divergence is the more interesting result.

SAP object	31B Dense	E4B
`BAPI_PO_CREATE1`	PUBLISHED, LOW	PUBLISHED, LOW
`MARA`	INTERNAL, CRITICAL	INTERNAL, CRITICAL
`MARC`	INTERNAL, CRITICAL	INTERNAL, CRITICAL
`BAPI_CUSTOMER_CREATEFROMDATA1`	DEPRECATED, MEDIUM	DEPRECATED, HIGH
`VBAK`	INTERNAL, CRITICAL	INTERNAL, CRITICAL
`BAPI_SALESORDER_CREATEFROMDAT2`	PUBLISHED, LOW	PUBLISHED, LOW
`Z_INTERNAL_PRICE_CALC`	CUSTOM_REVIEW, HIGH	CUSTOM_REVIEW, HIGH
(OData wrapper)	no candidates	no candidates
`Z_CUSTOM_BUSINESS_RULE`	CUSTOM_REVIEW, HIGH	CUSTOM_REVIEW, MEDIUM

Look at the two divergent rows:

BAPI_CUSTOMER_CREATEFROMDATA1: 31B Dense → MEDIUM, E4B → HIGH. E4B inflates severity for a deprecated BAPI that has a clean replacement and a multi-year deprecation runway. 31B Dense reads it as "plan migration, not emergency."
Z_CUSTOM_BUSINESS_RULE: 31B Dense → HIGH, E4B → MEDIUM. E4B falls back to the matrix default for undocumented Z*. 31B Dense elevates severity, citing absence of documentation as the reason in the justification.

This is the audit value of the larger model. A senior SAP auditor's job is not to classify "what kind of API is this" — junior auditors can do that. The senior's job is to calibrate how much risk an undocumented or deprecated call carries in context. On this demo set, 31B Dense's severity assignments track that calibration more consistently: conservative on deprecated-with-clear-replacement, aggressive on undocumented-customer-code-with-unknown-dependencies.

E4B is more uniform — closer to the default Migration Priority Matrix value in both directions. For low-stakes scanning or initial triage, E4B is viable and runs comfortably on a laptop GPU. For audit output that an SAP architect will sign off on, 31B Dense is the model I'd ship.

A note on the model's Portuguese

There's a second reason 31B Dense was the choice: native multilingual fluency at the technical depth required.

The UI is in English (for international auditors), but the model's justifications are in Brazilian Portuguese. This is a deliberate showcase of Gemma 4 31B's multilingual capability — the justification field reads like a senior SAP consultant's report in PT-BR.

Here's an actual sample the model produced for a direct SELECT on the MARA table:

"O acesso direto à tabela MARA via SELECT cria uma dependência direta de estrutura interna não publicada e deve ser revisado no contexto de Clean Core, APIs publicadas e documentação específica do produto."

Auditor-grade Portuguese. Not translated, not awkward — generated natively by the local model.

How the system works

Before diving into the steps, it's worth noting how lean the architecture is. The entire app.py is extremely minimal, relying on a lightweight stack: Gradio (UI), ChromaDB (local vector storage), Ollama (model inference), and Pydantic (schema validation). This simplicity is intentional—it proves the LLM is doing the heavy lifting of reasoning, rather than relying on complex, hardcoded Python logic.

Five stages, each one boring on its own:

ABAP parser — regex-based extraction of BAPIs, SELECTs on standard tables, and Z*/Y* function calls. Not an AST parser (that's roadmap). Misses dynamic calls and complex macros, but covers the common static patterns in the demo dataset.
RAG retriever — ChromaDB with intfloat/multilingual-e5-base embeddings. The corpus is a curated synthesis of SAP public documentation (API Policy, Business Accelerator Hub reference, Clean Core principles, deprecated APIs map). For production use, this should be replaced with a snapshot of real SAP Help Portal pages.
Gemma 4 31B Dense via Ollama — classifies each candidate with system prompt + code excerpt + top-3 retrieved chunks. Output format is forced to JSON. Temperature 0.1 for determinism.
Pydantic validation — structured Finding object with classification, severity, recommended alternatives, justification, source location. 8/8 outputs validated without retries in the demo set.
Gradio UI — three tabs: findings table, detailed view per finding, executive report grouped by remediation priority. The justification field is prominently labeled "Justification (PT-BR)" — making the multilingual reasoning visible.

The 31B Dense and 26B MoE variants ship with a 256K context window; the edge models E2B and E4B use 128K. Either is comfortable for typical audit prompts.

What it looks like

Findings table

Finding detail with PT-BR justification

Executive report

Reproducibility, not luck

The demo set is intentionally small (6 ABAP files, 8 audit candidates). It's a smoke test, not a benchmark. But I re-ran the full pipeline from a fresh venv. The classifications matched what's in the README on both models — the meaningful reproducibility test is in the committed logs (demo_runs_v4.log and demo_runs_v4_e4b.log), which any reader can regenerate with python rag/ingest.py && python -m classifier.gemma_classifier demo/*.abap.

Some numbers from the reference run on the 31B Dense:

Metric	Value
JSON output validity (Pydantic)	8/8
Determinism across reruns	Same class/severity; justification wording varies slightly
Average latency per finding	~10–15 s (RTX 5090, Q4_K_M)
Peak VRAM	~22 GB

What I learned about Gemma 4 31B Dense

Five things stood out, in order of how surprised I was:

JSON output is rock-solid. I expected at least some retries. There were none. format="json" with temperature 0.1 produced valid Pydantic outputs every time.
Brazilian Portuguese technical fluency is real. The justifications read like a senior consultant's report. I would not have predicted this from running the model on conversational prompts.
The 31B vs E4B gap is about severity calibration, not classification. Both models classify correctly on the demo set. The difference shows up when assigning how much risk a finding carries: 31B Dense is more conservative on deprecated APIs with clear replacements, and more aggressive on undocumented Z* code with unknown dependencies. E4B is more uniform — closer to the default Migration Priority Matrix value in both directions.
256K context is comfortable. No need for chunk-juggling when combining system prompt + code + 3 RAG snippets + few-shots.
Determinism is sufficient for product use. Re-running the same input twice gave the same classification and severity. The justification text varies slightly, but the decision doesn't.

For chat interfaces, E4B is probably fine and faster. For structured technical audit with retrieved evidence and audit-grade severity calibration, 31B Dense — the "boring" dense model nobody hypes — is the one I'd pick again.

Honest scope

This is a proof of concept. The honest limitations:

The parser is regex-based. It misses dynamic calls (CALL FUNCTION lv_name) and complex macros.
The demo set is 6 small files. Real ABAP codebases have 10× the noise.
The RAG corpus is author-curated synthesis, not redistributed SAP documentation. For real audits, you'd replace it with actual SAP Help Portal snapshots.
It doesn't handle SAP CPI flows or JCo Java integrations yet.
This tool is an engineering audit assistant, not a legal compliance determination tool. Findings should be reviewed by qualified SAP architects, security teams, and legal/compliance stakeholders before being treated as policy violations.

Try it yourself

The full repo is at github.com/PauloAAlmeida/sapmigrate. Setup is straightforward if you have Ollama installed:

git clone https://github.com/PauloAAlmeida/sapmigrate.git
cd sapmigrate
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
ollama pull gemma4:31b
python rag/ingest.py
python app.py

If your hardware is more modest (this is tested on RTX 5090 with 32 GB VRAM), you can swap to a smaller variant via environment variable:

GEMMA_MODEL=gemma4:e4b python app.py

You'll get correct classifications but different severity calibration on edge cases. The pipeline still runs.

What I'd build next

If this went beyond a proof of concept:

Full ABAP AST parser (ANTLR-based) instead of regex
Live SAP Notes / Accelerator Hub integration via authenticated APIs
Patch generation for common migration patterns (BAPI swap, OData wrapper)
CI/CD gate mode: block merges on CRITICAL findings
Multilingual justification toggle (English / German / Portuguese)

But the core thesis I wanted to demonstrate is already there: for sensitive enterprise code that can't go to the cloud, Gemma 4 31B Dense — running locally, on the auditor's machine — produces structured, multilingual, evidence-grounded reasoning that holds up to senior-auditor scrutiny on the cases I've tested.

The 31B Dense's edge isn't that it classifies what E4B can't. It's that it calibrates severity the way a senior auditor would — citing the absence of documentation as the reason to elevate risk on a Z* function whose contract is unknown — and explains its reasoning in fluent technical Portuguese while doing it. That's something I genuinely didn't think a 31B-parameter local model would do six months ago.

Submission for the Gemma 4 Challenge. Built in Rio de Janeiro on an RTX 5090. No proprietary ABAP code was harmed in the making of this prototype.

Edit: re-hosted three screenshots that had failed to load from GitHub raw.

Paulo Almeida — Senior Data Scientist & AI Engineer, github.com/PauloAAlmeida.

DEV Community