DEV Community: Mark Thorn

SLMs vs. LLMs: When Smaller Wins

Mark Thorn — Wed, 13 May 2026 11:52:31 +0000

There is a reflex in AI engineering right now: when in doubt, reach for the biggest model you can afford. GPT-4o for the customer support bot. Claude Opus for the internal search tool. A frontier-class model for the document classifier that runs ten thousand times a day.

That reflex is expensive. And in a growing number of production scenarios, it is also wrong.

Small language models are no longer a compromise you accept when you cannot afford the real thing. They are a deliberate architectural choice that, in the right context, beats larger models on latency, cost, privacy, and even accuracy. This post gives you the framework to know when that context applies to your project.

What Makes a Model "Small"?

The working definition across the industry is any language model under ten billion parameters. In practice, most SLMs deployed in production today sit between one and seven billion parameters. Common examples include Microsoft's Phi-4 family, Google's Gemma 3, Meta's Llama 3.2 1B and 3B, Mistral AI's Ministral 3B, and Alibaba's Qwen3 family.

For context: GPT-4 is estimated at over one trillion parameters. DeepSeek R1 runs at 671 billion. The gap in raw scale is enormous. The gap in practical performance on many real tasks is surprisingly narrow, and in some cases it has flipped.

The Case That Changed the Conversation

The most cited evidence for SLMs in 2025 came from Microsoft's Phi-4 line. Phi-4-reasoning-plus, a 14-billion-parameter model, outperformed DeepSeek-R1-Distill-70B (a model five times its size) on multiple demanding benchmarks, and approached the performance of the full DeepSeek R1 at 671 billion parameters on the AIME 2025 math exam.

Phi-4-mini-reasoning, with only 3.8 billion parameters, showed comparable results to OpenAI o1-mini on math benchmarks and surpassed it on Math-500 and GPQA Diamond evaluations.

The mechanism behind this is important. Microsoft did not just shrink a large model. They used curated synthetic training data, careful filtering of high-quality organic data, and reinforcement learning to instill strong reasoning without needing massive parameter counts. The insight: better data beats more parameters, at least up to a point.

This is not a one-off result. In healthcare, the domain-specific Diabetica-7B model achieved 87.2% accuracy on diabetes-related queries, surpassing both GPT-4 and Claude 3.5 on that specific task. Mistral 7B has been shown to outperform Meta's LLaMA 2 13B across various benchmarks. The pattern is clear: a well-trained small model that knows your domain deeply will beat a general giant that knows everything shallowly.

The Four Dimensions That Matter in Production

The benchmark headline is useful. The production reality is more nuanced. Here are the four dimensions that actually drive the SLM vs. LLM decision.

1. Cost

This is where SLMs make their most compelling case. Studies report up to 11x cost savings on inference when switching from frontier models to optimized small models. Flagship LLMs charge $2-15 per million tokens depending on input vs. output. Smaller models on the same infrastructure can drop that to fractions of a cent.

The math scales fast. A customer support pipeline handling one million conversations a month at 700 tokens per conversation is a very different bill at GPT-4o pricing versus a self-hosted 7B model. Training frontier LLMs costs over $100 million, and inference pricing grows steeply at volume. SLMs reduce cost per million queries by over 100x at scale.

Quantization sharpens this further. 4-bit quantization via GPTQ achieves near-full accuracy while cutting operational costs 60-70%.

2. Latency

Cloud-hosted LLMs introduce round-trip latency in the hundreds of milliseconds. That is acceptable for many applications. It is not acceptable for real-time agents, interactive code completion, industrial robotics requiring 10ms response windows, or any user-facing feature where perceived speed is part of the product.

SLMs serve tokens in tens of milliseconds compared to hundreds for cloud-hosted LLMs. On-device deployment eliminates the round-trip entirely. Speculative decoding, a technique that uses a tiny model to draft tokens which a larger model then verifies, can deliver 2-3x speed improvements in inference pipelines and pairs particularly well with small models.

3. Privacy and Data Sovereignty

This is the dimension that closes deals in regulated industries.

Healthcare, finance, and legal sectors face regulations that demand data sovereignty. When you send a query to a cloud LLM API, that data leaves your infrastructure. With a locally deployed SLM, it never does. The privacy guarantee is architectural, not contractual.

Gartner predicts that by 2026, over 55% of deep learning inference will occur at the edge, up from under 10% a few years ago. The driver is not just performance. It is the enterprise demand for "your data never leaves your device" as a hard guarantee rather than a service-level promise.

Research from SandLogic Technologies on their Shakti SLM family demonstrates that compact models, when carefully engineered and fine-tuned, meet and often exceed expectations in healthcare, finance, and legal edge-AI scenarios, domains where sending data to external APIs is frequently impractical or prohibited.

4. Domain Accuracy After Fine-Tuning

This is the most underappreciated advantage. A general LLM is optimized to be decent at everything. A fine-tuned SLM is optimized to be excellent at your thing.

For domain-specific tasks, a well fine-tuned SLM can outperform a much larger general-purpose LLM. Fine-tuning a 7B model requires far less compute than fine-tuning a 70B model, is cheaper, faster to iterate, and produces a model that deeply internalizes your output formats, terminology, and reasoning patterns. The tradeoff is that it generalizes less well outside that domain, which is usually exactly what you want in production.

Research comparing SLMs and LLMs across NLP, reasoning, and programming tasks found that in four out of six selected tasks, fine-tuned SLMs maintained comparable performance to LLMs for a significant reduction in carbon emissions during inference. The environmental argument is real but secondary. The economic one is primary.

Where LLMs Still Win

Honesty requires naming the cases where SLMs fall short.

Open-ended reasoning and novel problem-solving. When the task is genuinely unpredictable, requires synthesizing information across disparate domains, or demands the kind of long-horizon reasoning that frontier models have been trained to handle, scale still matters. A 7B model will not replace Claude Opus or GPT-4o for complex multi-step agent tasks with ambiguous requirements.

Long context and memory. Frontier reasoning and long conversations still favor the cloud. Mobile NPUs are powerful, but decode-time inference is memory-bandwidth bound. Generating each token requires streaming full model weights. On-device SLMs are excellent for formatting, light Q&A, and summarization. They are not yet the right tool for tasks requiring a 1M-token context window.

Generalization across unfamiliar domains. If your product serves wildly varied queries across different domains and you cannot predict what users will ask, an LLM's broad pretraining gives it resilience that a narrow SLM cannot match without a very expensive fine-tuning pipeline.

Cold start. If you are still validating whether your product is worth building, start with an LLM API. Iteration speed matters more than cost efficiency at the hypothesis stage.

The Architecture Most Teams Are Actually Shipping

The binary choice between SLM and LLM is increasingly a false one. Many teams in 2026 are landing on a hybrid approach: use an LLM for complex, unpredictable queries and route straightforward, high-volume tasks to a specialized SLM.

This is called model routing, and it has become a serious engineering discipline. Model routing can reduce LLM token costs by 20-60% while maintaining output quality. The pattern looks like this:

A lightweight router (itself often a small classifier or a fast SLM) examines each incoming query, estimates its complexity, and sends it to the right model tier. Simple extractive tasks, formatting jobs, classification, and high-confidence template responses go to the SLM. Queries that require nuanced judgment, creative synthesis, or complex reasoning escalate to the LLM.

Research on hybrid inference architectures takes this further, evaluating routing at the token level rather than the query level. The SLM generates tokens, and each token is scored against the LLM's probability distribution. Tokens scoring above a threshold are accepted; those below prompt the LLM to take over. This ensures cloud resources are only used when genuinely necessary.

As of 2026, most production AI teams route across at least four model providers. Routing is no longer an optimization. It is the default architecture.

A Practical Decision Framework

Use this to make the call on your next project.

Reach for an SLM when:

Your task is well-defined and your training data is clean. A classification pipeline, an extraction task, a structured generation job with a fixed output schema. The narrower the task, the stronger the SLM argument.
Latency below 100ms is a requirement. Real-time agents, edge devices, interactive UI.
Data cannot leave your infrastructure. Healthcare records, legal documents, financial data in regulated environments.
You are operating at scale and inference cost is material. If you are running millions of queries a month, a 10x cost reduction is a meaningful engineering goal.
You have a stable domain and are willing to invest in fine-tuning. The investment pays back faster than most teams expect.
Stay with an LLM when:
You are still in validation mode and need fast iteration. LLM APIs give you a working prototype in hours.
Your queries are diverse, unpredictable, or genuinely require broad general knowledge.
The task demands complex, multi-step reasoning without a well-defined answer format.
Long context is a core requirement (above 32K tokens reliably).
Build a hybrid when:
You have a mix of query types at scale. Route by complexity.
You need both the speed of a local model and the intelligence of a frontier model. Serve simple queries on-device, escalate to the cloud selectively.

- Cost and quality are both non-negotiable. The hybrid pattern is the main way teams serve both without compromise.

The Bigger Shift

The industry narrative is moving from "which model is best?" to deliberate model selection by task. Capgemini and Wavestone's 2026 tech trend reports both flag the shift from one LLM for everything toward intentional model tier selection as mainstream engineering practice.

This is a maturity milestone. When teams were first deploying LLMs, using the biggest model available felt safe. Now the discipline has caught up. We know enough about failure modes, cost curves, and domain performance to make principled choices rather than defaulting to scale.

The SLM vs. LLM question is really a resource allocation question. Every query you send to a frontier model that a fine-tuned 3B model would answer just as well is money you did not invest in the parts of your product that actually need it.

Most production AI is not doing the thing that requires a trillion parameters. Figure out what your product actually needs, and size the model accordingly.

What is your current stack? Are you routing between model tiers, or still on a single model for everything? Drop it in the comments.

The Fintech Startups That Actually Take Compliance Seriously

Mark Thorn — Thu, 07 May 2026 09:27:48 +0000

Compliance in fintech has a reputation problem. For most of the last decade, the word meant a checklist that founders grudgingly worked through before launch, a legal cost center, and something you dealt with after you had product-market fit. The pattern played out the same way repeatedly: build fast, grow fast, get regulated, scramble.

That approach is running out of road. In January 2025, state regulators fined Block $80 million for insufficient money laundering controls. Starling Bank paid £28.96 million to the UK's FCA in 2024 for financial crime failings. According to KPMG's Pulse of Fintech H1 2025, global fines for non-compliance in the first half of 2025 totalled $1.23 billion, a 417% increase on the same period a year earlier. Even large, well-funded companies are not immune.

A different generation of fintech startups has drawn a different conclusion from this environment. Instead of treating compliance as a problem to solve later, they have built it into the architecture from the start. The audit trail is not an add-on. The traceability is not a feature. The risk controls are not a wrapper around the product. They are the product.

These are five of those startups, each solving a distinct layer of the compliance problem.

1. Neno — Compliance as a Design Principle

Website: neno.co

Most fintech back offices are a mess of disconnected tools, manual reconciliation, and accountability gaps. Neno was built by people who had lived through that mess at some of the most regulated companies in European fintech and decided to start over.

The team behind Neno includes veterans from Adyen, Plaid, Mollie, Deloitte, BDO, and EY. They are backed by Motive Partners and Firstminute Capital, alongside angels from PayPal, Deel, Coinbase, and Miro. That background shapes the product's operating philosophy in a way that is immediately visible in how Neno approaches the basics.

The core principle is stated plainly in their manifesto: every action and transaction must be logged, traceable, and explainable. Not as a compliance workaround. As the baseline expectation for any system that handles real money.

Neno builds the complete back office for entrepreneurs, covering incorporation, business accounts, invoicing, bookkeeping, payroll, and tax in one connected system. The reason this matters for compliance is fragmentation. When financial data lives across five different tools, lineage breaks, reconciliation becomes manual, and the audit trail becomes reconstruction work rather than a live record. Neno eliminates that fragmentation at the source.

Key features:

B.V. incorporation with compliance documentation handled from day one
Business accounts and cards through Swan, an EU-regulated Electronic Money Institution operating under French ACPR license and registered with De Nederlandsche Bank
Invoicing connected directly to bookkeeping with no manual reconciliation step
Automated bookkeeping, payroll, and tax with human oversight preserved throughout
Every transaction logged, timestamped, and traceable by design
Enterprise-grade security controls for all automated operations
AI-assisted workflows where humans remain in control of consequential decisions

The compliance architecture here is not just about satisfying a regulator. It is about what happens when your accountant asks a question, when an investor requests a financial report, or when you need to understand why a number changed. The answer is already in the system, traceable back to the original transaction.

Built for: Entrepreneurs and small businesses in the EU who want a back office that runs compliantly without requiring a compliance team to operate it.

2. Salv — Collaborative AML Intelligence

Website: salv.com

The standard model for AML compliance has a fundamental structural flaw. Financial institutions work alone. Each one monitors its own customers in isolation, files suspicious activity reports to regulators, and has no way to know whether the person they just flagged is already being investigated by three other banks.

Salv was founded in 2018 by Taavi Tamkivi, who built the AML, fraud, and KYC teams at Wise and Skype, alongside Jeff McClelland and Sergei Rumjantsev. The founding insight was simple: criminals work in networks. The institutions trying to stop them do not. That asymmetry is exploited continuously.

Salv's answer is a collaborative crime-fighting platform built around two products. The first is an AML platform covering transaction monitoring, customer risk assessment, and screening. The second is Salv Bridge, an encrypted network that allows financial institutions to securely exchange intelligence on bad actors across legal and jurisdictional boundaries, within the bounds of GDPR and EU data protection law.

The Bridge concept was piloted in Estonia with full support from the country's Financial Supervision and Resolution Authority, Data Protection Inspectorate, and Financial Intelligence Unit. All of the largest banks in Estonia participated. The results were concrete: in the early network alone, institutions were preventing financial crime worth €50,000 to €100,000 per week, and the pilot prevented up to €3 million from reaching criminal-controlled accounts.

Key features:

Real-time transaction monitoring with automated alert triage
Customer risk assessment and ongoing monitoring
Sanctions screening across major global watchlists
Salv Bridge: encrypted inter-institutional intelligence sharing network
Privacy Enhancing Technology (PET) enabling secure data sharing without exposing raw customer data
ISO/IEC 27001:2022 certified, SOC 2 Type 2 audited
Modular SaaS pricing, deployable in one to three weeks
Compatible with core banking providers including Mambu, Thought Machine, and Temenos

Salv is currently active across ten European countries and expanding into Germany, Czech Republic, and Spain. The model matters because it shifts AML from a defensive compliance exercise into an active, networked crime-fighting effort, which is closer to how financial crime actually operates.

Built for: Banks, fintechs, electronic money institutions, and crypto companies operating under EU regulatory frameworks that need collaborative AML intelligence alongside their core monitoring tools.

3. Hummingbird — AML Operations for the Modern Compliance Team

Website: hummingbird.co

AML compliance generates a staggering volume of operational work. Every flagged transaction needs to be investigated. Every investigation needs to be documented. Every Suspicious Activity Report needs to be filed, reviewed, and tracked. For most compliance teams, the tooling to do this work is scattered, outdated, or built for a different era of financial crime.

Hummingbird was founded in 2016 by Joe Robinson and Jesse Reiss, with team backgrounds from Square, Stripe, the US Treasury, and the Office of the Comptroller of the Currency. The thesis from the beginning was that the tools compliance professionals use daily are far behind the tools available to the fraudsters and money launderers they are trying to catch.

The platform covers the full lifecycle of AML compliance work: customer due diligence, transaction and risk monitoring, case management, suspicious activity reporting, and regulatory filing. In September 2025, Hummingbird launched a unified risk and compliance platform bringing all of these capabilities together alongside new customer screening tools covering sanctions, PEP checks, and adverse media monitoring throughout the customer lifecycle.

In 2024, Hummingbird acquired LogicLoop to expand its no-code automation capabilities, allowing compliance teams to build and modify detection rules and workflows without requiring engineering support. The platform has since launched AI Agents and an AI Assistant designed to automate routine casework while keeping investigators focused on decisions that require judgment.

Key features:

Customer due diligence with support for onboarding approvals, periodic monitoring, and enhanced due diligence
Transaction and risk monitoring with customizable detection rules
Case management with collaborative investigation workflows
Automated SAR, STR, and CTR preparation with one-click e-filing
Customer screening for sanctions, PEP exposure, and adverse media
No-code automation builder for compliance workflows
AI Agents for alert handling, case preparation, and activity monitoring
Reported 70 to 90% reduction in time-per-case for customers using automated workflows
Recognized in Forrester's Financial Crime Management Solutions Landscape Q1 2026

Hummingbird has raised $41.2 million in total funding. Its customers include Stripe, Etsy, DraftKings, and FirstBank Puerto Rico, spanning payments platforms, marketplaces, sports betting operators, and traditional banks.

Built for: Banks, fintechs, gaming operators, and crypto companies that need a unified, AI-augmented platform for managing AML investigations and regulatory reporting at scale.

4. Sardine — Fraud and Compliance Unified

Website: sardine.ai

The conventional approach to fraud prevention and AML compliance treats them as separate problems. Separate teams, separate tools, separate data sets. Sardine was built on the observation that this separation is itself a vulnerability.

Founded in 2020 by Soups Ranjan, who previously led data science and risk at Coinbase and headed crypto at Revolut, Sardine combines fraud detection, AML compliance, and identity verification in a single platform. The insight driving the architecture is that 90% of fraud detected on Sardine's customer platforms comes from individuals who have already passed the standard KYC process. Compliance checks at onboarding are not equivalent to fraud prevention. The ongoing behavioral signal matters as much as the initial verification.

The platform uses device intelligence, behavioral biometrics, and machine learning to evaluate risk continuously, not just at onboarding. By February 2025, Sardine had profiled more than 2.2 billion devices and served over 300 enterprise customers including FIS, Deel, GoDaddy, and X, with 130% year-over-year ARR growth in 2024. The company raised a $70 million Series C in February 2025, bringing total funding to $145 million, led by Activant Capital with participation from Andreessen Horowitz, Google Ventures, Moody's Analytics, and Experian Ventures.

Key features:

Device intelligence and behavioral biometrics for real-time fraud detection
KYC and KYB automation with coverage across 150+ countries
Transaction monitoring for money laundering detection and money mule activity
Sanctions screening, PEP monitoring, and adverse media checks
Customer risk rating for ongoing CDD
Agentic AML operations: automated alert review, investigation support, and audit-ready outputs
Sponsor banking controls for embedded finance programs
SardineX: an industry consortium for real-time fraud data sharing across payment rails
Founding members of SardineX include Visa, Chesapeake Bank, Airbase, and Blockchain.com

The FRAML convergence — combining fraud and AML into one workflow — is increasingly where the industry is heading. Sardine has been building toward it since founding.

Built for: Banks, fintechs, payment processors, crypto platforms, and enterprises that need fraud prevention and AML compliance to work from the same data, in real time, rather than in separate silos.

5. Chainalysis — Compliance for the Blockchain Layer

Website: chainalysis.com

Cryptocurrency introduces a compliance problem that traditional financial tools are not designed to solve. Every transaction is public and permanent. The challenge is not access to data. It is making sense of it at scale across hundreds of blockchains, millions of wallets, and transaction volumes that dwarf traditional payment systems.

Chainalysis was founded in 2014 by Michael Gronager, Jan Møller, and Jonathan Levin, and was the first company dedicated specifically to Bitcoin tracing. The core insight was that blockchain is not anonymous. It is pseudonymous. The public ledger contains a permanent record of every transaction. With the right analysis, those records reveal patterns, connections, and ultimately identities.

The platform today covers cryptocurrency compliance and investigation for over 1,500 global institutions including the FBI, DEA, IRS, and international law enforcement counterparts, alongside exchanges like Coinbase and Binance, banks integrating crypto services, and crypto-native businesses. Chainalysis data has been ruled admissible in court and has been used in some of the most significant financial crime cases involving digital assets, including the takedown of the Silk Road dark web marketplace in 2020 and attribution of seven 2021 cryptocurrency thefts to North Korea's Lazarus Group.

The platform's valuation reached over $8 billion in 2025. In 2026, the company launched blockchain intelligence agents, putting the full depth of its data and investigation capabilities into the hands of compliance analysts without requiring specialist blockchain expertise.

Key features:

Know Your Transaction (KYT): real-time screening of crypto transactions against high-risk addresses and known illicit activity
Reactor: transaction visualization and fund tracing across multiple blockchains and bridges
Kryptos: risk profiling for crypto exchanges and counterparty due diligence
Hexagate: real-time hack prevention, which helped protect over $50 billion in funds
Alterya: fraud prevention processing over $23 billion in monthly transactions
Automatic token support covering 260,000+ XRPL tokens and all major token standards
Blockchain intelligence agents for automated investigation and compliance workflows
Court-admissible data with chain-of-custody standards built into the platform
2026 Crypto Crime Report: illicit addresses received at least $154 billion in 2025, a 162% year-over-year increase

The FATF Travel Rule, which requires virtual asset service providers to share originator and beneficiary information on transfers above a threshold, has made Chainalysis's compliance infrastructure increasingly central to any crypto business operating in regulated jurisdictions. MiCA, the EU's crypto regulation framework entering full effect in 2026, adds further obligations that Chainalysis is positioned to support.

Built for: Cryptocurrency exchanges, custodians, banks entering digital assets, DeFi protocols, and government agencies that need court-grade blockchain intelligence for compliance monitoring and financial crime investigation.

The Pattern Across All Five

These startups operate at different layers of the compliance stack. Neno works at the back office and data integrity layer. Salv addresses collaborative AML intelligence between institutions. Hummingbird handles AML investigation operations and reporting. Sardine unifies fraud and compliance into a single real-time signal. Chainalysis brings compliance infrastructure to the blockchain layer.

What they share is an architectural decision made early: compliance is not a layer added on top of a product. It is a property of the data model, the transaction record, and the decision workflow from the first line of code.

The BCBS 239 principles, the Basel Committee's framework for risk data aggregation, define a standard that most large financial institutions have struggled to meet for over a decade. These startups are, in different ways, building toward what BCBS 239 describes as the goal: data that is accurate, complete, timely, and traceable by design rather than by effort.

That shift does not make compliance cheap or easy. But it changes the cost structure substantially. Compliance work that requires manual reconstruction is expensive, error-prone, and difficult to scale. Compliance that is built into the data architecture runs continuously, costs less per transaction at scale, and produces output that regulators can actually use.

The startups that figure this out early have a structural advantage that compounds over time. The ones that do not are building toward a very expensive reckoning.

Migrating Financial Data to the Cloud Without Losing Lineage or Regulators' Trust

Mark Thorn — Tue, 05 May 2026 13:31:18 +0000

When a financial services team decides to move data to the cloud, the conversation usually starts with infrastructure. Which cloud provider. What the cost model looks like. Whether to go lift-and-shift or re-architect from the ground up.

Those are real decisions. But they are not the hard part.

The hard part is walking into a room with your compliance team six months into the migration and being able to answer two questions: Where did this data come from? And how do we prove it?

If you cannot answer both of those confidently, your migration is not done. It might not even be safe.

This post is about what it actually takes to migrate financial data to the cloud while keeping data lineage intact and regulators on your side. Not the theory. The decisions.

Why Financial Data Migration Is a Different Problem

Most cloud migration guides treat data as a technical artifact. Move it, validate it, retire the source. Done. Financial data does not work that way.

In a regulated environment, data carries obligation. Transaction records, loan histories, risk model inputs, audit logs — every one of these has a chain of custody that regulators expect you to maintain and explain. GDPR, SOX, BCBS 239, PCI-DSS: the specific framework depends on your institution, but the underlying requirement is consistent. You must be able to demonstrate that your data is accurate, complete, and traceable from origin to output.

That requirement does not pause while you migrate.

This is the core challenge. A standard migration moves data from point A to point B. A compliant financial data migration moves data from point A to point B while maintaining a documented, auditable record of exactly how it was transformed along the way.

The two things are not the same, and the gap between them is where most migrations get into trouble.

The Lineage Problem Nobody Talks About Before They Start

Data lineage in a modern financial institution is rarely clean. Over decades of mergers, platform changes, and regulatory responses, data flows get layered on top of each other. A customer record might move through a mainframe core system, a middleware ETL job, a risk calculation engine, and a reporting database before it ever surfaces in a dashboard.

Each one of those transitions is a potential lineage gap.

When you migrate to the cloud, you are not just moving data. You are also moving or replacing the pipelines, jobs, and processes that shape that data. If you do not map those dependencies before you start, you will make changes that seem reasonable in isolation but break the lineage chain in ways that are invisible until an auditor asks a question you cannot answer.

This is one of the most underestimated risks in financial cloud migration. It is not a data quality problem. It is an architecture visibility problem.

Before a single record moves, you need to know how data flows through your existing systems at the execution level, not just the schema level. That means understanding which batch jobs transform which fields, which downstream systems consume which outputs, and where business logic is embedded in places that your architecture diagrams do not show.

IN-COM's breakdown of top data modernization tools and strategies makes a useful distinction here: understanding data dependencies and execution paths is a separate capability from data migration itself, and skipping it is one of the main reasons modernization programs introduce inconsistencies they cannot trace later.

Step One: Map Before You Move

The instinct on most migration projects is to start moving things. There is pressure to show progress, hit milestones, and demonstrate value. Moving data feels like progress.

Mapping your data flows first is the opposite of that impulse. It feels slow. It produces documentation rather than deployments. But it is the step that determines whether your migration survives contact with a regulatory examination.

What does a proper pre-migration data map look like in a financial context?

It needs to capture not just where data lives, but how it moves. Which systems write to which databases. What transformations happen at each step. Where derived fields are calculated and from what source values. Which data elements are used as inputs to risk models or regulatory reports.

It also needs to capture the timing and sequencing of data flows. Batch windows, dependency chains, the order in which jobs run and what happens when one fails. This matters because cloud environments often change the execution model, and if your lineage documentation assumes a specific processing order, you need to know before you redesign the pipeline.

This work is not glamorous. But institutions that skip it discover the gap when a regulator requests a data lineage report and the answer involves significant manual reconstruction.

Designing for Lineage From Day One

Once you have mapped your existing data flows, you have a choice about how to carry lineage forward into your cloud architecture.

The wrong approach is to treat lineage as something you will retrofit. Cloud-native data platforms make this tempting because they handle a lot of the infrastructure complexity automatically. It is easy to build pipelines that work without thinking explicitly about how you will explain what happened to each record.

The right approach is to treat lineage as a first-class requirement in your cloud data architecture, with the same priority as performance and availability.

In practice this means a few specific things.

Capture metadata at every transformation step. Every time data moves or changes in your pipeline, the system should record what happened, when, and from what source. This is not the same as logging. It is structured provenance data that describes the lineage of each record.

Use immutable audit tables. Financial data should be appended to, not overwritten. When a value changes, the new value is written alongside the old one with a timestamp and a source. This gives you a complete history of how data has evolved over time, which is exactly what a regulator wants to see.

Separate raw from processed data. In a cloud environment, this typically means maintaining a raw landing zone where data arrives in its original form before any transformation, with a clear boundary between that layer and the processed layers downstream. The raw zone is your ground truth. It is what you point to when someone questions whether a transformation was applied correctly.

Choose tools that expose lineage natively. Many modern cloud data platforms support lineage tracking as a built-in feature. Apache Atlas, for example, integrates with the Hadoop ecosystem to track data lineage across pipelines. AWS Glue Data Catalog captures schema and transformation history. When evaluating platforms for a financial migration, lineage support should be on the evaluation criteria list, not an afterthought.

What Regulators Actually Look For

Regulatory expectations around cloud data migration vary by jurisdiction and framework, but there are consistent themes worth understanding before you design your architecture.

Regulators want to know that you understand where your data is. This sounds obvious, but in complex cloud environments with multiple regions, replication policies, and third-party services, data residency becomes a real governance challenge. Financial institutions operating under GDPR face explicit requirements about where customer data is stored and processed. You need to be able to answer those questions at the field level, not just at the system level.

They want to know that access is controlled and audited. Cloud environments introduce new identity and access management complexity. Every service account, API key, and IAM role that can touch sensitive financial data is a potential audit finding if it is not properly scoped and logged. Your cloud migration should include an access control model that is at least as strict as what you had on-premises, and probably stricter.

They want to know that your data is accurate and consistent. This is where lineage connects directly to compliance. If an examiner asks how a specific value in a regulatory report was derived, the answer should trace cleanly back through your pipeline to a source record. If it does not, or if it requires manual explanation to reconstruct, that is a finding.

They want to know what your controls are. Migrating to the cloud does not remove the obligation to maintain robust data governance controls. In some cases it adds new ones. Your migration plan should include an explicit mapping of existing controls to their cloud equivalents, with gaps identified and addressed before go-live.

One of the most common compliance failures in cloud migrations is not a technical failure. It is a documentation failure. The systems work correctly, but the organization cannot demonstrate it. Build the documentation into the migration process, not as a post-project cleanup task.

The Cutover Problem

Even with excellent lineage design and regulatory preparation, the cutover moment carries specific risk in financial data migrations.

The period during which data exists in both the legacy and cloud systems simultaneously is when lineage is most fragile. Transactions may be processed in one environment while reference data is still being synchronized from another. Reports may draw from both systems without making that dependency explicit. The source of truth is ambiguous.

This is not a hypothetical. It is one of the most common sources of audit findings in financial cloud migrations, and it is a problem that needs to be solved architecturally before cutover happens, not after.

A few patterns that reduce cutover risk:

Run parallel environments with explicit reconciliation. During the transition period, run your cloud systems in parallel with your legacy systems and implement automated reconciliation that compares outputs at the record level. Any discrepancy should halt the migration, not be flagged for later review.

Define a clear point of record. Before cutover, document explicitly which system is authoritative for which data at each point in time. This documentation becomes part of your audit trail.

Migrate by domain, not by system. Rather than trying to cut over entire systems at once, migrate by data domain, bringing each domain fully into the cloud with complete lineage before moving to the next. This reduces the complexity of the transition period and makes reconciliation tractable.

Treat the cutover log as a compliance artifact. Every decision made during cutover, including any data corrections or exceptions, should be logged with timestamps, rationale, and the identity of who made the decision. This log is not internal project management documentation. It is part of the regulatory record of the migration.

What a Compliant Cloud Migration Actually Looks Like

Pulling this together, a financial data cloud migration that preserves lineage and satisfies regulators looks like this:

It starts with a complete inventory of existing data flows, including the execution-level dependencies that do not appear in standard architecture documentation. This work typically takes longer than anyone expects and surfaces problems that were invisible in the original scoping.

It moves into cloud architecture design with lineage as a first-class requirement. The design specifies how provenance data will be captured at every transformation step, how raw data will be preserved, and how the access control model maps to regulatory requirements.

It includes a regulatory review at the design stage, before any data moves. Engaging your compliance team as a design partner rather than a gatekeeper at the end of the project is one of the highest-leverage changes a migration team can make.

It runs parallel environments during transition with automated reconciliation and a documented point of record for every data domain.
And it produces, as a deliverable of the migration itself, a lineage architecture document that regulators can examine. Not a summary. A complete, auditable description of how data flows from source to output in the cloud environment.

The Uncomfortable Truth About Timelines

Cloud migrations in financial services take longer than the initial estimates almost every time. The lineage and compliance requirements are usually the reason.

This is not a failure of planning. It is a reflection of the genuine complexity of the problem. Financial data has been accumulating for decades in systems that were never designed with cloud migration in mind. Mapping those flows accurately and designing an architecture that preserves their regulatory integrity is hard work.

The teams that handle this best are the ones that acknowledge this complexity early and build it into their planning rather than treating it as a risk to be managed later. A realistic timeline for a compliant financial data migration includes the pre-migration mapping phase, the compliance review cycle, the parallel run period, and a reconciliation buffer before cutover.

Moving fast is not the goal. Moving without breaking lineage or compliance is.

Wrapping Up

Financial data cloud migration is not primarily a technical problem. The tools exist. Cloud platforms are mature. The patterns for building scalable, reliable data pipelines in the cloud are well understood.
The problem is the regulatory obligation that financial data carries, and the requirement to prove that you have honored that obligation through every step of the migration.

That requires lineage design before you write a single pipeline, compliance engagement before you move a single record, and documentation that treats the audit trail as a deliverable rather than an afterthought.
Get that right, and the technical migration becomes straightforward. Get it wrong, and you will be reconstructing data provenance manually for an examiner who is not interested in your technical architecture.
Start with the map. Build lineage from day one. And do not cut over until reconciliation is clean.

If you have been through a financial data migration and hit the lineage wall, I'd like to hear what you ran into. Drop it in the comments.

RAG vs Fine-Tuning: Which One Should You Actually Use?

Mark Thorn — Wed, 29 Apr 2026 08:56:52 +0000

When you start building something real with LLMs, it takes about five minutes before someone asks the question. Do we RAG this, or do we fine-tune? I have been in that room. And I have watched teams burn weeks choosing the wrong answer, not because they were careless, but because most articles explain what each approach is without telling you when to reach for which one.

This post skips the textbook definitions and goes straight to the decision. By the end, you will have a clear mental model, a practical framework, and enough context to make the call confidently on your next project.

What Is RAG, Really?

RAG, which stands for Retrieval-Augmented Generation, is an architecture that connects a language model to an external knowledge source at query time. Instead of relying on what the model memorized during training, the system retrieves relevant documents from a database, injects them into the prompt as context, and then lets the model generate its answer from that richer input.

Think of it like giving an open-book exam. The model's base intelligence stays the same, but it now has access to the right reference material when it needs it.

A typical RAG pipeline works like this:

Your documents get chunked, embedded into vectors, and stored in a vector database (Pinecone, Weaviate, Chroma, or FAISS are common choices)
A user sends a query
The query is embedded and used to retrieve the most relevant document chunks via semantic search
Those chunks are injected into the prompt as context
The LLM generates a response grounded in that retrieved content

What RAG is good at:

Answering questions from frequently updated documents
Citing sources, because you know exactly which chunks informed the response
Keeping sensitive data out of model weights and in a controlled external store
Getting to production fast, often in days or weeks

What RAG struggles with:

Latency, because every query adds retrieval steps
Cost at high query volume, since you are passing hundreds of extra tokens with every request
Tasks that require the model to deeply internalize a specific format, tone, or structured behavior

What Is Fine-Tuning, Really?

Fine-tuning means taking a pretrained model and continuing to train it on your own dataset. The model's weights actually change. You are not just giving it information at query time. You are permanently teaching it something new.

If RAG is an open-book exam, fine-tuning is a specialized education. After training, the model does not need to look anything up. The knowledge, behavior, or style is baked in.

Fine-tuning a model requires:

A labeled training dataset, typically hundreds to thousands of high-quality examples in a structured format (commonly JSONL prompt-completion pairs)
A training run on GPU hardware, which can range from hours to days depending on model size
Evaluation to confirm the fine-tuned model actually performs better on your task
Deployment and ongoing maintenance when your data changes

What fine-tuning is good at:

Teaching the model a specific output format it must follow reliably (like structured JSON, clinical notes, or legal citation styles)
Embedding domain terminology so the model interprets prompts accurately
Reducing inference latency at very high query volumes, since a smaller fine-tuned model can outperform a larger general one
Tasks where the training data is stable and unlikely to change frequently

What fine-tuning struggles with:

Knowledge that changes. Your fine-tuned model is frozen at training time. A software release from last week, a new policy, last month's pricing — none of that is in there unless you retrain.
Auditability. A fine-tuned model cannot tell you where its knowledge came from.
Speed and cost to iterate. A RAG update is as simple as adding a document. A fine-tuning update requires a new training run.

The Core Difference in One Sentence

RAG changes what information the model sees. Fine-tuning changes what the model knows how to do.

That single distinction drives almost every decision in the framework below.

A Practical Decision Framework

This is the part most guides skip. Here are the questions you actually need to answer before picking an approach.

Question 1: How often does your knowledge change?

If your information changes weekly or monthly, like product documentation, support tickets, policies, or pricing, RAG wins almost automatically. Updating a vector database is operationally trivial compared to running a new training pipeline.

If your domain knowledge is stable for months at a time, fine-tuning becomes worth evaluating.

Question 2: Do you need to cite sources?

RAG has a natural audit trail. You know exactly which documents were retrieved. For regulated industries, legal tools, healthcare apps, or anything where users need to trust and verify answers, that traceability matters enormously. Fine-tuning offers no equivalent.

Question 3: What does your output need to look like?

If you need the model to always produce a very specific output format, a consistent brand voice, structured data extraction, or domain-specific reasoning that prompt engineering alone cannot reliably produce, fine-tuning is the right tool. It internalizes behavior at the weight level in a way RAG simply cannot.

Question 4: What is your query volume?

RAG adds tokens to every prompt. At low-to-medium volume, this cost is manageable. At very high volume, those extra tokens get expensive fast. A fine-tuned smaller model handling millions of queries per day can become significantly cheaper over time, once the upfront training cost is amortized.

Question 5: How fast do you need to ship?

RAG can be production-ready in days. Fine-tuning adds dataset curation, training compute, evaluation, and iteration cycles. If you need to move fast or you are still validating whether the product is worth building, RAG lets you start delivering value immediately.

Side-by-Side Comparison

Criteria	RAG	Fine-Tuning
Knowledge freshness	Always current	Frozen at training time
Setup time	Days to weeks	Weeks to months
Upfront cost	Low	Medium to high
Inference cost	Higher per query	Lower per query at scale
Source attribution	Built-in	Not available
Output format control	Limited	Strong
Data privacy	Data stays external	Data baked into weights
Maintenance	Update the docs	Retrain the model
Best for	Dynamic knowledge, fast shipping	Stable tasks, consistent behavior, high volume

The Case for Combining Both

Here is something most comparison posts underplay: the most effective production systems often use both.

A common real-world pattern is to fine-tune a domain-specific model to deeply understand your industry's terminology and reasoning style, then layer RAG on top of it to provide current, specific, and updateable information at query time.

Legal AI tools are a good example. A model fine-tuned on statutory reasoning and citation style is then connected to a RAG system that retrieves the most recent case law. The fine-tuning handles the how of responding; RAG handles the what.

In practice, the decision is less often "RAG or fine-tuning" and more often "which of these do I need first, and do I need the other one later?"

My Default Recommendation

If you are starting a new project and you are not sure which to pick, start with RAG.

Here is why. RAG gets you to a working system faster. You will learn what your users actually need from the product. That feedback will tell you whether fine-tuning is worth the investment, and if so, which specific behaviors to train for.

Fine-tuning is a refinement, not a starting point. The teams that jump to fine-tuning first often discover they spent weeks training for the wrong thing.

The practical hierarchy for most projects looks like this:

Prompt engineering first. Can you get good results with a well-crafted system prompt? This costs nothing and takes hours.
RAG next. Ground the model in your actual data. This works for the vast majority of knowledge-intensive applications.
Fine-tuning selectively. Identify high-volume, stable, format-critical workflows where RAG's limitations genuinely hurt you. Fine-tune for those specific cases.

Wrapping Up

RAG and fine-tuning are not competitors. They solve different problems, and knowing which problem you actually have is the only decision that matters.

Use RAG when your knowledge changes, you need attribution, or you need to move fast. Use fine-tuning when the behavior needs to be deeply consistent, your data is stable, and you have the infrastructure to support a training pipeline. Use both when your product demands it.

What approach have you used in production? Curious whether others have hit the same wall I did when building that first RAG pipeline. Drop it in the comments.