DEV Community: Graham Morley

How AI Memory Actually Works: Context Windows and RAG

Graham Morley — Mon, 20 Apr 2026 19:13:30 +0000

This article was original posted at Morley Media

Every day, millions of people open ChatGPT or Claude and pick up where they left off. They say things like "remember that project we were working on?" or "based on what you know about me, what do you think?" It feels like a conversation with someone who knows you, but it isn't.

Large language models have no memory. None. Every single time you send a message, the model receives a fresh prompt, processes it, and returns a response. It has never seen you before and it will not remember you after. The continuity you experience is an illusion, and understanding how that illusion works will make you a dramatically better user of these tools.

TL;DR - The people who get the most from AI tools are the ones who stop treating them like a colleague with a memory and start treating them like a brilliant stateless function. Talk to them conversationally, but know that when something matters, you restate it rather than assuming it carried over.

The cold start problem

Here is what actually happens when you send a message in a chat interface like Claude or ChatGPT.

The application takes your new message, prepends the entire conversation history above it, and sends the whole bundle to the model as a single request. The model reads everything, top to bottom, as if it were encountering the entire conversation for the first time. It generates a response and then the application appends that response to the history. When you send your next message, the whole process repeats with a slightly longer bundle.

This is not a technical detail, it is the entire architecture. There is no hidden state being maintained between calls. There is no persistent thread running inside the model that tracks who you are or what you said ten minutes ago. The model is a function: text goes in, text comes out, and between calls it retains nothing.

The reason this matters is that it changes how you should think about conversations with AI. A long chat session is not a deepening relationship with an increasingly informed assistant, it is a growing document that gets re-read from scratch every time you hit send.

The context window: your conversation has a size limit

That growing document has a hard ceiling. Every model has a context window, measured in tokens (roughly three-quarters of a word per token), and once the conversation exceeds it, something has to give. You should also be aware that the longer a conversation gets, the more tokens you consume each time you hit send.

Current context windows range from 128K to over a million tokens depending on the model and provider. If you have ever worked through a complex coding session, a long research thread, or an iterative document revision, you may have noticed the model starting to "forget" things you discussed earlier. It didn't forget, per se, but rather the application quietly dropped or compressed older messages to make room for newer ones.

This is called compaction, and in most cases it happens silently (in Claude Code it does tell you it's happening). Usually the platform summarizes your earlier messages into a condensed version, but less sophisticated LLM platforms can simply drop/remove the oldest messages. Either way, you lose fidelity. The model is no longer reading your actual words from earlier in the conversation, but reading a compressed approximation of them, or not seeing them at all.

This is why long conversations tend to drift. The model is literally working with an incomplete version of the conversation you think you're having.

How LLM platforms simulate memory

Platform-level memory features (Claude's memory system, ChatGPT's memory) create the strongest version of the illusion of memory. You tell the model your name, your job, your preferences, and the next day it greets you with all of that context. This isn't memory the way we think of it. When you anthropomorphise the LLM, you have to be careful not to misunderstand the tool you're working with.

What is actually happening: the platform extracted facts from your previous conversations, stored them in a database, and injected them into the system prompt at the start of your new session. The model is reading its own notes. It is doing the equivalent of a doctor glancing at your chart before walking into the exam room. The knowledge is real, but it is external to the model, retrieved and inserted by the application layer.

This is an important distinction because it defines the limits of what these memory systems can do. They store discrete facts, not conversational nuance. They do not know the arc of reasoning you walked through last Tuesday to arrive at a specific decision. The texture is gone. Only the labels survive.

The four flavors of fake memory

Every "memory" system in the current AI landscape is a variation on the same pattern: store information outside the model, retrieve it at call time, and inject it into the prompt. The differences come down to retrieval strategy and transparency.

RAG with vector databases is the most scalable approach. Prior conversations or documents are chunked into pieces, converted into numerical representations (embeddings), and stored in a vector database. When you send a new message, the system searches that database for semantically relevant chunks and injects them into the prompt alongside your message. This is powerful, but it is also lossy. You are at the mercy of a similarity search deciding what is "relevant" to your current query. Context that a human would consider obviously important might not score highly enough to make the cut.

Structured database retrieval is more deterministic. Instead of relying on semantic similarity, you define explicit schemas for what gets stored (user preferences, project metadata, conversation summaries) and retrieve it based on rules or direct lookups. This gives you more control, but it also means you are hand-engineering what the system remembers. In practice, it works reasonably well for structured facts and less well for open-ended conversational context.

Platform memory systems are what Claude and ChatGPT offer to end users. These are essentially a managed version of the approaches above, abstracted behind a clean interface. The platform extracts facts, stores them, and handles retrieval automatically. You do not see the mechanism, which is both the appeal and the risk. You have limited visibility into what was stored, what was missed, and what might be wrong.

Markdown files are the most transparent option. Tools like Claude Code use a CLAUDE.md file that the model reads at the start of every session. There is no magic here. It is a text file. You can open it, read it, edit it, and know exactly what the model will "know" about your project. This lack of sophistication is actually a feature, because you can see and control the seams.

All four approaches converge on the same architectural truth: something outside the model is doing the remembering, and the model is reading it cold every time.

What about models that learn from you?

Cursor, the AI coding editor built by Anysphere, has introduced something genuinely novel with what they call "real-time reinforcement learning." They collect behavioral signals from their user base (whether edits were accepted, whether users sent frustrated follow-ups, whether tool calls broke) and use those signals to retrain their model on a roughly five-hour cycle. The model you use in the afternoon is a different checkpoint than the one you used in the morning.

This is interesting, and it is worth understanding precisely because of what it is not. This is not per-user, per-session learning. The model is not adapting to you during your conversation, it is adapting to the statistical aggregate of all Cursor users and deploying updated weights. Your individual session is still completely stateless, but you are benefiting from a faster version of the traditional train-and-deploy cycle.

Neither Anthropic nor OpenAI has publicly described anything this tight for their general-purpose models. Part of the reason is scale: retraining a frontier model multiple times per day on production traffic is far more tractable when the model is smaller and domain-specific. But it points toward a future where the line between "the model was trained" and "the model is training" gets blurry, even if the individual inference call remains stateless.

What this means for how you use AI

Understanding this architecture changes your behavior in concrete ways.

Start new conversations more often. A fresh conversation is not a loss, it is a clean context window with no compaction artifacts, no summarized-away nuance, and no accumulated confusion. If your thread has gone past 30 or 40 exchanges, you are almost certainly working with a degraded version of your own conversation.

Front-load context. Because every message is a cold start with the conversation history prepended, the information at the top of your chat matters disproportionately. If you are starting a complex task, write a clear, detailed opening message. Do not drip-feed context over a dozen turns and assume the model is building a mental model the way a human collaborator would. It is rereading everything from scratch each time, and if the early messages get compacted, your carefully layered context evaporates.

Do not trust long-session continuity for critical work. If you are making important decisions (architecture choices, legal analysis, financial modeling), do not rely on the model's ability to hold the full context of a session that has gone on for hours. Re-state your constraints. Re-paste your requirements. Redundancy is not waste here, it is insurance against silent context loss.

Treat memory features as a convenience, not a guarantee. Platform memory is useful for casual personalization, but it is not a reliable system of record. If something matters, do not assume the platform stored it correctly, write it down yourself.

Use explicit context files when possible. If your tool supports something like CLAUDE.md or project-level instructions, use them. They are the most reliable and transparent form of "memory" available, precisely because they are not memory at all. They are documentation, and documentation is something engineers already know how to manage.

The honest version

There is nothing wrong with the illusion of memory; it makes these tools more pleasant and more useful. Still, illusions become dangerous when you mistake them for reality and make decisions based on assumptions about what the model knows, what it is tracking, and what it will retain.

The model is not your colleague. It is not building an understanding of your project over time. It is a stateless function that reads a document and generates a continuation. Everything that makes it feel like more than that is happening in the application layer, outside the model, in systems that are useful but imperfect.

Once you understand that, you stop being frustrated when the model "forgets" something. You stop having long, winding conversations when a concise, well-structured prompt would serve you better. You start treating context as a finite resource and managing it deliberately.

The best AI users are not the ones who have the longest conversations. They are the ones who understand the architecture well enough to work with it instead of against it.

One final note

If you aren't familiar with metaprompts, they may help you use AI better. A metaprompt is a structured prompt generated by one AI context window for use in another. For example, you might describe a project to Claude Chat, which has some memory context from your previous conversations, and ask it to generate a detailed prompt that you then pass into a tool like Claude Code, which knows nothing about you or your project beyond what that prompt contains. You review and edit the metaprompt before sending it, because you are the bridge between two stateless systems that have no awareness of each other. It takes a couple of minutes and it eliminates an entire category of "why doesn't the AI understand what I'm working on" frustration. We have a full guide on writing effective metaprompts, and if you've read this far, it's the natural next step.

AI Data Residency: When Cloud APIs Don't Meet Your Compliance Requirements

Graham Morley — Thu, 16 Apr 2026 06:10:00 +0000

This guide covers the distinction between data residency and data sovereignty, the three real infrastructure options for AI compliance, and the operational reality of running self-hosted inference. It is written for engineering leaders and compliance teams evaluating whether cloud AI APIs meet their regulatory requirements, and what the alternatives look like if they do not.

For regulatory detail specific to your jurisdiction, see our regional guides:

Data Residency vs. Data Sovereignty

These terms get used interchangeably. The distinction matters for compliance. The next examples are Canada specific, but the article is for everyone, and not specific to one region.

Data residency means data is stored in a specific geographic location. Your cloud provider has a Canadian region, your database is in ca-central-1, your data physically sits on a server in Montreal.

Data sovereignty means data is subject to the laws of the country where it is stored, and only those laws. This is the harder requirement. A US-headquartered cloud provider operating a Canadian datacenter satisfies data residency. It does not necessarily satisfy data sovereignty, because the provider's parent company may be subject to foreign legal process that can compel disclosure regardless of where the data is physically stored.

Every major regulatory framework that touches AI data handling, including HIPAA, GDPR, GLBA, PIPEDA, and the EU AI Act, imposes requirements that depend on understanding this distinction. The specific requirements vary by jurisdiction (covered in our regional guides linked above), but the structural problem is the same everywhere: storing data in a local datacenter operated by a foreign-headquartered company does not insulate it from that company's home jurisdiction.

The CLOUD Act Problem

The US CLOUD Act (Clarifying Lawful Overseas Use of Data Act) is the specific legal mechanism that makes "data residency" insufficient for many compliance requirements outside the United States.

The CLOUD Act permits US authorities to compel production of data within the "possession, custody or control" of a covered entity, regardless of where that data is physically stored. A US-headquartered company operating a datacenter in Frankfurt, Toronto, or Sydney is still subject to CLOUD Act demands on that data.

This is not theoretical. On June 10, 2025, Microsoft France's Director of Public and Legal Affairs, Anton Carniaux, testified under oath before a French Senate inquiry commission investigating digital sovereignty in public procurement. When asked whether he could guarantee that data belonging to French citizens, hosted under government procurement agreements, would not be transmitted to US authorities without French authorization, his response was "No, I cannot guarantee it." This was a senior legal official under oath in a formal parliamentary proceeding.

Canada's Treasury Board has stated the same conclusion: "as long as a CSP that operates in Canada is subject to the laws of a foreign country, Canada will not have full sovereignty over its data." The Balsillie Papers research published in March 2026 went further, noting that Canadian government data can be compelled by US authorities without Canadian judicial review or governmental notification.

Every major cloud AI service (Azure OpenAI, Amazon Bedrock, Google Vertex AI) and every major AI API provider (OpenAI, Anthropic, Google) is operated by a US-headquartered parent company subject to CLOUD Act jurisdiction. Regional deployments from these providers satisfy data residency. None of them resolve the CLOUD Act jurisdiction question for organizations outside the US.

For US-based organizations, the CLOUD Act is not a problem in the same way: it is US law applied through US legal process to US companies. It becomes relevant when you serve international customers whose regulators care about foreign government access to their data. Our US regulatory guide covers this angle.

Customer-managed encryption keys (CMEK) are often positioned as a mitigation. In theory, if you hold the encryption keys, the provider cannot decrypt your data even under compulsion. In practice, CMEK does not fully protect against CLOUD Act orders in most implementations. The provider still has access to metadata, account information, file names, sharing structures, and activity logs. A CLOUD Act order can compel production of all of this. CMEK is a meaningful layer of defence, but it is not a complete solution to the jurisdictional exposure.

Three Infrastructure Options

Option A: Cloud Provider Residency with Contractual Controls

Select a regional deployment from a hyperscaler, execute the appropriate Data Processing Agreement and Business Associate Agreement (where applicable), implement CMEK where available, and document your risk acceptance.

When this is sufficient: Your regulator accepts contractual controls and documented risk assessments. Your threat model does not include foreign government compulsion as a primary concern. Your data classification does not require the strictest sovereignty controls. You need to move fast and your compliance team has signed off on the residual risk.

What it does not solve: The jurisdictional exposure described above. For many commercial workloads, this risk is accepted with appropriate documentation. For government data, healthcare data in certain jurisdictions, or financial data subject to stricter regulatory interpretation, it may not be.

Cost: Lowest. Standard cloud compute and API pricing with no capital expenditure on hardware.

Option B: Dedicated Single-Tenant Cloud Deployments

Azure OpenAI offers provisioned throughput deployments. AWS Bedrock offers dedicated throughput. These run on reserved capacity, not shared multi-tenant inference endpoints. Depending on configuration, data does not leave the specified region.

When this makes sense: You need guaranteed throughput and latency SLAs. Your compliance team is comfortable with the cloud provider's jurisdiction but wants network isolation from other tenants. You want to use frontier proprietary models that are not available as downloadable weights.

What it does not solve: The underlying jurisdiction question is the same as Option A. You get network isolation and dedicated compute, not jurisdictional independence.

Cost: Significantly higher than on-demand API pricing. Provisioned and dedicated throughput require committed spend, typically thousands to tens of thousands per month depending on model and throughput requirements.

Option C: Self-Hosted Infrastructure

You run the hardware. You run the models. You control the network. The inference endpoint is on infrastructure that is not subject to foreign jurisdiction because the entity that owns and operates it is domestic.

When this is the right answer: Your regulator requires that no foreign-jurisdiction entity can be compelled to disclose your data. Your threat model explicitly includes foreign government compulsion. You are processing data classified at a level that precludes third-party cloud processing. You need to run custom or fine-tuned models. Your inference volume is high enough that the capital expenditure breaks even against API costs within a reasonable timeframe.

What it costs you: Capital expenditure on GPU hardware, colocation or facility costs, power and cooling, a team capable of operating bare-metal infrastructure, and the ongoing operational burden of keeping it running.

The Self-Hosted Decision Matrix

Before committing to self-hosted infrastructure, work through these questions.

1. What is your regulator actually requiring?

There is a meaningful difference between "data must be stored in our country" (residency), "data must not be accessible to foreign governments" (sovereignty), and "data must be processed on infrastructure with no foreign corporate parent" (full-stack sovereignty). Read your specific regulatory guidance. Many organizations over-index on sovereignty requirements that their regulator has not actually imposed, and under-index on requirements like audit logging and access controls that the regulator cares about deeply.

2. Who is the adversary in your threat model?

If you are protecting against commercial data breaches and unauthorized access, cloud providers with SOC 2 Type II and ISO 27001 certifications are likely more secure than anything you will operate yourself. If you are protecting against foreign government compulsion via legal process, cloud provider certifications are irrelevant because the compulsion is lawful within the provider's jurisdiction. The threat model determines the infrastructure choice.

3. What is your inference volume?

Self-hosted GPU infrastructure has high fixed costs and low marginal costs. Cloud API pricing has low fixed costs and high marginal costs. There is a crossover point. For light, intermittent usage, cloud APIs win on total cost. For sustained high-volume inference, self-hosted hardware pays for itself. The break-even depends on your specific model size, throughput requirements, and GPU utilization rate. As a rough framework: if you are spending more than $15,000 to $20,000 per month on cloud AI API costs with consistent utilization, a capital expenditure analysis on self-hosted hardware is worth running.

4. Do you have the team to operate it?

Self-hosted GPU infrastructure is not "set up a server and walk away." It requires ongoing hardware monitoring, firmware and driver updates, model serving software maintenance, network security operations, and capacity planning. If your organization does not have infrastructure operations experience, you either need to hire it, contract it, or accept that you are taking on operational risk.

Need help evaluating your AI data residency requirements? We build and
operate compliant AI infrastructure for regulated industries. We have
multi-year production experience running physical server infrastructure across
Canadian and European datacenters, and we built an AI compliance platform that
achieved SOC 2 Type 1 and ISO 27001 certifications. From initial compliance
assessment through hardware planning, deployment, and ongoing operations, we
handle the full stack. Talk to our team.

Hardware Selection Criteria

The hardware landscape changes faster than blog posts age. We cover selection criteria that remain stable regardless of which specific generation is current, rather than recommending specific models or listing prices that will be outdated in months. We select the best fit for each engagement based on the client's budget, model requirements, deployment location, and throughput needs.

Memory capacity is the primary constraint for inference. The model's parameters must fit in GPU memory (VRAM). A model that does not fit on a single GPU requires tensor parallelism across multiple GPUs, which introduces inter-GPU communication overhead and operational complexity. Quantization (running the model at reduced numerical precision) shrinks memory requirements significantly, but it affects output quality to varying degrees depending on the model and method. The tradeoff between memory capacity, quantization level, and output quality is the first decision point.

Memory bandwidth determines inference throughput. Once the model fits in memory, the speed at which the GPU can read model weights during each forward pass determines tokens-per-second. For autoregressive language models, inference is memory-bandwidth-bound, not compute-bound. A GPU with more memory bandwidth will often outperform a GPU with more raw compute at the same price point for inference workloads specifically.

Power and cooling are non-negotiable constraints. Current-generation datacenter GPUs draw 600W to 1000W+ per unit. An 8-GPU server can draw 10kW or more. This requires appropriate power delivery (typically 208V or 240V three-phase), cooling infrastructure (liquid cooling is increasingly mandatory for the latest generation, not optional), and a facility that can support the power density. Standard office buildings and most commodity colocation cannot accommodate this without modifications.

Colocation vs. owned facility. Unless your organization already operates datacenter space, colocation is the practical choice. You ship your hardware to a facility that provides power, cooling, network connectivity, and physical security. You retain ownership and control of the servers. The colocation provider does not have logical access to your systems. Evaluate colocation providers on: power density per rack (you need more than standard 5-10kW racks), network connectivity (redundant uplinks, low-latency peering), physical security and compliance certifications, and whether they can support liquid cooling if your hardware requires it.

Redundancy planning. GPU hardware fails. Power supplies, fans, memory modules, and the GPUs themselves all have failure rates. Plan for N+1 redundancy at minimum for production workloads: enough spare capacity that losing a single GPU server does not take your inference endpoint offline.

Practical advice: Before purchasing hardware, run your target models on
rented cloud GPU instances to establish baseline performance requirements.
Measure tokens-per-second, latency percentiles, and memory utilization under
realistic load. Use those numbers to spec your purchase, rather than sizing
from spec sheets alone.

Model Selection for On-Premise Inference

The open-weight model ecosystem is mature enough that several model families are viable for production on-premise deployment. We do not recommend specific models here because the landscape shifts every few months, but the selection criteria are stable.

License terms. Some open-weight models are released under permissive licenses (Apache 2.0, MIT) that allow unrestricted commercial use. Others have custom licenses with usage restrictions, such as monthly active user caps or prohibitions on specific use cases. Read the license before committing infrastructure to a model. If you are building a product on top of it, license terms affect your business, not just your engineering.

Model size vs. hardware fit. Match the model's parameter count and precision requirements to your available GPU memory. A model that requires tensor parallelism across four GPUs to serve a single request is operationally more complex and more expensive per token than a model that fits on one or two. Mixture-of-experts (MoE) architectures are relevant here: a model with a large total parameter count but a small number of active parameters per token needs the memory of a large model but the compute of a smaller one.

Quality gap from frontier proprietary models. There is still a gap between the best open-weight models and the best proprietary models on certain tasks, particularly complex multi-step reasoning and nuanced instruction following. For many production use cases (classification, extraction, summarization, structured data generation, customer-facing chat with bounded scope), the quality gap is small enough to be irrelevant. For tasks at the frontier of model capability, it may matter. Evaluate on your actual use case, not on benchmark leaderboards.

Inference serving software. The model needs to be served through an inference engine that handles batching, quantization, and the HTTP/gRPC API layer. As of early 2026, vLLM is the production default for most deployments, using PagedAttention for efficient GPU memory management. SGLang is a strong alternative that outperforms vLLM by roughly 29% on throughput for workloads with shared context (chatbots, RAG, agents) through its RadixAttention caching. Hugging Face's Text Generation Inference (TGI) entered maintenance mode in December 2025, with Hugging Face explicitly recommending vLLM or SGLang for new deployments. llama.cpp remains the standard for running models on consumer hardware or CPU-based inference. The choice of serving software affects performance as much as the choice of GPU.

The honest assessment: If your use case requires the absolute best available model quality and you are not constrained by sovereignty requirements, proprietary cloud APIs will outperform what you can self-host. If your use case requires sovereignty, or if your inference volume makes self-hosting economically attractive, the open-weight ecosystem is good enough for most production workloads and improving rapidly.

Security Architecture for Self-Hosted AI

Running your own inference infrastructure means you own the entire security surface. This section covers the architecture decisions specific to self-hosted AI. For general server hardening, see our security checklist.

Network isolation. The inference cluster should be on an isolated network segment with no direct internet access. Client applications reach the inference API through a reverse proxy or API gateway in a DMZ. The inference servers themselves should not be able to initiate outbound connections. This limits the blast radius of a compromise and prevents data exfiltration through the inference layer.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client App    │────>│  API Gateway /  │────>│   Inference     │
│                 │     │  Reverse Proxy  │     │   Cluster       │
│                 │<────│  (DMZ)          │<────│  (Isolated Net) │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                              │                        │
                              ▼                        ▼
                        ┌───────────┐           ┌─────────────┐
                        │  Audit    │           │ No outbound │
                        │  Logging  │           │ access      │
                        └───────────┘           └─────────────┘

Key management. Encryption keys for data at rest (model weights if encrypted, input/output logs, cached embeddings) should be managed through a hardware security module (HSM) or a dedicated key management service. For the highest security requirements, an HSM provides tamper-evident, FIPS 140-2/3 validated key storage. For most production deployments, a software-based KMS (HashiCorp Vault or equivalent) is sufficient if configured with appropriate access controls and audit logging. The critical requirement: key material should not be stored on the inference servers themselves.

Audit logging. Every inference request should be logged with a timestamp, the requesting user or service identity, and input/output token counts. Whether to log input and output content depends on your retention policy and regulatory requirements. If you do log content, encrypt those logs at rest with keys managed separately from the inference infrastructure. Ship logs to a centralized logging system that is not on the same network segment as the inference cluster.

Air-gapped deployment patterns. For the most sensitive workloads, the inference cluster has no network connectivity to the internet at all. Model updates, software patches, and configuration changes are delivered via physical media or a one-way data diode. This is operationally expensive and only justified for classified workloads or environments with the strictest regulatory requirements.

Input/output filtering. Self-hosted models do not come with the same content filtering and safety layers that cloud APIs provide by default. If your use case involves end-user-facing interactions, you need to implement your own input validation, output filtering, and guardrails. This is additional development work that is easy to underestimate.

Operational Reality

This is the section that differentiates "we have read about self-hosted AI" from "we have operated physical infrastructure."

We have run physical servers across Canadian and European datacenters for multiple years, maintaining blockchain node infrastructure that processed real financial transactions with real money at stake. That infrastructure included ErgoPad (a token launchpad that reached over $20M in total value locked), Paideia (a DAO governance platform operating across multiple blockchain networks), and several other production systems. Across all of these deployments, over multiple years of continuous operation, we had zero security exploits. We also built and shipped Crystal aOS, an AI legal compliance platform that achieved SOC 2 Type 1 and ISO/IEC 27001:2022 certifications, with document ingestion pipelines, RAG, and data residency controls built in from the start.

That track record required operational discipline that is directly transferable to AI inference infrastructure, because the failure modes are the same: hardware fails, software needs patching, networks go down, and the systems need to keep running regardless.

Hardware monitoring is continuous, not periodic. GPU temperatures, memory utilization, fan speeds, power draw, and error rates need real-time monitoring. GPU memory errors (ECC corrections and uncorrectable errors) are early indicators of hardware failure. A GPU accumulating ECC errors will eventually fail. You want to replace it during a planned maintenance window, not during a production outage. IPMI/BMC access for out-of-band management is essential: you need to power cycle a server, access its console, and check hardware health without relying on the operating system being functional.

Driver and firmware updates are not optional. GPU drivers, firmware, BIOS updates, and inference serving software all receive regular updates that affect performance, stability, and security. These updates need to be tested in a staging environment before rolling to production, and they occasionally require reboots that take servers offline. Plan for regular maintenance windows.

Capacity planning requires forecasting. GPU procurement lead times can be weeks to months depending on availability. If your inference load is growing, you need to be ordering hardware before you need it, not when you run out of capacity.

Incident response for hardware is different from software. When a cloud VM has a problem, you click a button and get a new one. When a physical GPU server has a problem, someone needs to physically access the machine. If it is in a colocation facility, that means either driving to the datacenter or filing a remote-hands ticket and waiting. Factor this into your SLA calculations. If your colocation is in a different city, remote-hands response time and capability become critical vendor selection criteria.

Backups and disaster recovery are your problem. Model weights can be re-downloaded (assuming you have not fine-tuned them). Your fine-tuned models, your RAG indexes, your configuration, and your audit logs cannot be re-downloaded. Back them up. Test restoring from backups regularly. Have a documented procedure for rebuilding the inference stack from scratch on replacement hardware.

The operational cost that gets underestimated: It is not the hardware
purchase, it is the ongoing human cost of keeping the infrastructure running.
A production inference cluster requires monitoring, maintenance, security
patching, capacity planning, and incident response. If you are budgeting for
self-hosted AI, budget for the people, not just the servers.

When NOT to Self-Host

Self-hosting is the right answer for a specific set of requirements. It is the wrong answer more often than it is the right one.

Your compliance team has signed off on cloud provider residency with contractual controls. If your DPA, BAA, and risk assessment satisfy your regulator, the operational overhead of self-hosting is not justified. Most organizations fall into this category.

Your inference volume is low or intermittent. Self-hosted GPU hardware sits idle when you are not running inference. Cloud API pricing is per-token with no idle cost. If your usage is bursty or low-volume, you will pay more in depreciation and power for idle hardware than you would pay in API fees.

You do not have the team to operate it. Self-hosted infrastructure without operational expertise is a liability, not an asset. A misconfigured, unpatched, unmonitored inference cluster is worse for your security posture than a well-managed cloud deployment.

You need frontier model quality and it materially affects your product. The best proprietary models are available only through cloud APIs. If your use case requires the absolute best available model performance and the quality gap matters for your specific task, cloud APIs are the answer. Evaluate on your actual workload, not on published benchmarks.

You are prototyping or validating a use case. Use cloud APIs to validate that the AI feature works and that users want it. Migrate to self-hosted infrastructure after you have proven the use case and have the volume to justify the capital expenditure. Optimizing infrastructure before validating demand is a common and expensive mistake.

Summary

The decision framework:

Determine what your regulator actually requires. Read the specific guidance for your jurisdiction and sector (US, Canada, UK/EU).
Identify the adversary in your threat model. Commercial breach vs. foreign government compulsion are different problems with different solutions.
If cloud APIs satisfy your compliance requirements, use them. The operational simplicity is worth it.
If you need single-tenant isolation but can accept cloud provider jurisdiction, evaluate dedicated throughput offerings.
If you need full sovereignty, self-hosted on domestically owned and operated infrastructure is the path. Budget for the hardware, the facility, the team, and the ongoing operational cost.
Whichever path you choose, get the security fundamentals right: encryption at rest and in transit, audit logging, access controls, and incident response planning.

A 15-Point Security Checklist That Startups Often Ignore

Graham Morley — Fri, 27 Feb 2026 17:38:14 +0000

Security is the thing every startup founder knows matters but nobody wants to spend time on. You're racing to ship features, close customers, and stay alive. Security feels like friction. So you postpone it, telling yourself you'll "add security later."

The problem is that there is no "adding security later." Security isn't a feature you bolt on. It's a foundation you build on. Every month you wait makes it exponentially more expensive and complex to retrofit.

The numbers back this up. According to IBM's 2024 Cost of a Data Breach Report, the global average cost of a data breach reached $4.88 million, up 10% from the prior year. For companies with fewer than 500 employees, the average is around $2.98 million. That's enough to kill most startups outright. And attackers aren't just going after big targets. 43% of all cyberattacks in 2023 targeted small businesses, according to Verizon's Data Breach Investigations Report. Automated tools don't care how small you are.

Warning: There is no "adding security later". Security isn't a feature you bolt on. It's a foundation you build on. Every day you wait makes it exponentially more expensive and complex to implement properly.

The 15-Point Security Checklist

Here's what you need to implement before you have any real users on your platform.

Authentication & Access Control

1. Authentication: Know Your Options and Their Tradeoffs

Authentication is the most important security decision you'll make early on, and it's worth understanding the landscape before you commit. There are three broad approaches, each with real tradeoffs.

Option A: Full-Service Auth Platforms

Providers like Supabase Auth, Clerk, or Auth0 handle everything: login, MFA, session management, user dashboards, and role management. You write very little auth code. This is the fastest path to a working login system, and it's fine for prototypes and early MVPs.

The catch is that you're deeply coupled to their platform. Your session model is theirs. Your user model is theirs. If you later need to run a separate API service, support a mobile app with different session requirements, or do anything the provider didn't anticipate, you'll hit walls. Migrating off a full-service auth provider mid-growth is painful and expensive. You're also paying per-user at scale.

Option B: OAuth 2.0 via NextAuth (Auth.js)

NextAuth wraps OAuth providers (Google, GitHub, Microsoft) and manages sessions within Next.js. It's more flexible than a full-service platform, gives you direct control over your user records, and has no per-user cost.

But it's tightly coupled to Next.js. If your backend grows beyond a single Next.js app, if you add a NestJS API, a mobile app, or any service that needs to verify auth independently, NextAuth's session model doesn't follow you. You'll end up bolting on your own JWT layer anyway. This makes it a solid choice for simpler applications, but be prepared to replace it as you scale.

Option C: Roll Your Own Sessions, Delegate Login to OAuth 2.0 (Recommended)

This is the approach we recommend for any startup that plans to grow. Let Google, Microsoft, or GitHub handle the login flow and MFA. They spend billions securing credential storage and authentication. You don't need to compete with that.

What you do handle in-house is everything after login: JWT issuance and validation, session management (short-lived access tokens, longer-lived refresh tokens), CSRF token rotation, XSS prevention via HTTP-only cookies, and authorization logic.

This gives you full control over your session architecture from day one. When you add a mobile app, a separate API, or microservices, your auth layer already supports it. You're not locked into any vendor's session model, and the most dangerous part of auth (storing and verifying credentials) is handled by companies that do it better than you ever will.

// Example: After OAuth 2.0 callback, issue your own tokens
import { sign, verify } from "jsonwebtoken";

// Short-lived access token (15 min)
const accessToken = sign(
  { userId: user.id, role: user.role },
  process.env.JWT_SECRET ?? "",
  { expiresIn: "15m", issuer: "your-app", audience: "your-api" },
);

// Longer-lived refresh token (7 days), stored in DB for revocation
const refreshToken = sign(
  { userId: user.id, tokenVersion: user.tokenVersion },
  process.env.REFRESH_SECRET ?? "",
  { expiresIn: "7d" },
);

// Set as HTTP-only cookies (not accessible via JavaScript = XSS resistant)
res.cookie("access_token", accessToken, {
  httpOnly: true,
  secure: true,
  sameSite: "strict",
  maxAge: 15 * 60 * 1000,
});

Tip: Authentication is a deep topic with a lot of nuance. We're working on a dedicated guide covering OAuth 2.0 flows, session strategies, and how to choose the right approach for your stack. Stay tuned.

2. Strong Password Policies (If You Must Self-Host Auth)

If you have a specific reason to manage credentials yourself, such as compliance requirements, offline access, or cost at scale, then at minimum:

Enforce 12+ character passwords
Use a strength estimator like zxcvbn instead of simple regex rules. Users will satisfy [A-Z][a-z][0-9][!@#$] with Password1! every time
Hash with bcrypt or argon2, never SHA-256 or MD5
Implement account lockout after repeated failed attempts
Rotate service account credentials on a schedule

But seriously, just use OAuth 2.0 for login and skip this entire category of problems.

3. JWT Token Security: Bearer Tokens vs. HTTP-Only Cookies

Most tutorials show JWTs stored in localStorage and sent as Authorization: Bearer <token> headers. This works, but it's vulnerable to XSS attacks. Any malicious script running on your page can read localStorage and exfiltrate tokens.

The more secure approach for web applications is HTTP-only cookies. The browser sends them automatically, and JavaScript cannot access them, which eliminates the most common token theft vector. Combined with SameSite: strict and Secure flags, this is significantly harder to exploit.

Note: If your API lives on a different subdomain from your frontend, such as api.yoursite.com, SameSite: strict will block cookies on cross-origin requests. You'll need to use SameSite: lax and set the cookie domain to .yoursite.com so it's shared across subdomains. This is a common stumbling point when separating your API from your frontend.

// WEB: HTTP-only cookie (preferred for browser clients)
res.cookie("access_token", token, {
  httpOnly: true, // JavaScript can't read it
  secure: true, // HTTPS only
  sameSite: "strict", // No cross-site requests
  maxAge: 15 * 60 * 1000,
});

// MOBILE: Bearer token (necessary for native apps)
// Mobile apps don't have cookie jars in the same way,
// so you'll need a /token endpoint that returns JWTs directly.
// Store them in the platform's secure storage:
// iOS: Keychain
// Android: EncryptedSharedPreferences

The reality is that when you ship a mobile app, you'll need to support bearer tokens anyway. Native apps don't share browser cookie jars. The practical approach is to support both: HTTP-only cookies for your web client, and a token endpoint for mobile clients that returns JWTs directly. Your API validates both, checking cookies first, then falling back to the Authorization header.

This is one reason Option C from the auth section matters. If you've built your own session layer, adding a second token delivery mechanism is straightforward. If you're locked into NextAuth's cookie-based sessions, you're in for a rewrite.

Data Protection

4. Encryption: What SSL Handles and What It Doesn't

"End-to-end encryption" gets thrown around a lot, but for most SaaS applications, TLS/SSL already handles encryption in transit. If your app is served over HTTPS (and it should be, always), data between the user's browser and your server is encrypted. You don't need to add a separate encryption layer on top of that for transit.

Where you do need to think about encryption is data at rest, meaning what's stored in your database. Most cloud database providers (RDS, PlanetScale, Supabase) encrypt at rest by default using AES-256. If you're self-hosting, make sure disk encryption is enabled (LUKS on Linux, or your hosting provider's equivalent).

The next level is field-level encryption for particularly sensitive data: API keys, payment tokens, SSNs, or anything subject to specific compliance requirements. Encrypt these at the application layer before they hit the database, so even a database breach doesn't expose plaintext values.

import { createCipheriv, createDecipheriv, randomBytes } from "crypto";

const ALGORITHM = "aes-256-gcm";
const KEY = Buffer.from(process.env.ENCRYPTION_KEY ?? "", "hex"); // 32 bytes

export function encrypt(plaintext: string): string {
  const iv = randomBytes(16);
  const cipher = createCipheriv(ALGORITHM, KEY, iv);
  const encrypted = Buffer.concat([
    cipher.update(plaintext, "utf8"),
    cipher.final(),
  ]);
  const tag = cipher.getAuthTag();
  // Store IV + auth tag + ciphertext together
  return `${iv.toString("hex")}:${tag.toString("hex")}:${encrypted.toString("hex")}`;
}

export function decrypt(payload: string): string {
  const [ivHex, tagHex, encryptedHex] = payload.split(":");
  const decipher = createDecipheriv(
    ALGORITHM,
    KEY,
    Buffer.from(ivHex ?? "", "hex"),
  );
  decipher.setAuthTag(Buffer.from(tagHex ?? "", "hex"));
  return (
    decipher.update(Buffer.from(encryptedHex ?? "", "hex")) +
    decipher.final("utf8")
  );
}

To summarize: TLS handles transit. Your database provider likely handles at-rest encryption for the full disk. You handle field-level encryption for sensitive fields in your application code. Don't over-engineer this, but don't ignore it either.

5. Database Security: Use an ORM

The single most common vulnerability in web applications is SQL injection. An ORM like Prisma, Drizzle, or SQLAlchemy eliminates entire categories of injection attacks because queries are parameterized by default. You never concatenate user input into a raw SQL string.

// BAD: SQL injection waiting to happen
const user = await db.query(`SELECT * FROM users WHERE email = '${email}'`);

// GOOD: Prisma parameterizes automatically
const user = await prisma.user.findUnique({ where: { email } });

Beyond injection prevention, ORMs give you a schema layer that prevents accidental data leaks. Without one, it's easy to write a query that returns every column in a table, including fields you never intended to expose to the client. With Prisma's select or Drizzle's column picking, you explicitly declare what comes back.

ORMs also make database migrations predictable and version-controlled. Instead of running ad-hoc ALTER TABLE statements in production, you have a migration history that can be reviewed, rolled back, and tested in CI.

One more thing: never use root database credentials in your application. Create a limited-privilege user that can only do what your app actually needs.

CREATE USER 'app_user'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, INSERT, UPDATE ON app_db.* TO 'app_user'@'%';
-- No DROP, DELETE, or ALTER privileges

If your app doesn't need to delete rows, don't grant DELETE. If it doesn't need to modify schema, don't grant ALTER. The principle of least privilege applies to database access just as much as it does to user permissions.

6. API Security: Rate Limiting, Validation, and Defense in Depth

API security has multiple layers, and they serve different purposes.

Cloudflare-Level Rate Limiting (Edge/Network Layer)

If you're using Cloudflare, AWS WAF, or a similar edge provider, you can set broad rate limits that apply before requests even reach your server. This is your first line of defense against DDoS attacks, brute-force login attempts, and automated scraping. Think of this as protecting your infrastructure from being overwhelmed.

Typical edge rules might be: 1000 requests per minute per IP globally, with tighter limits on specific paths like /api/auth/login.

Application-Level Rate Limiting

Edge rate limiting doesn't know your business logic. Your application needs its own rate limits, and different endpoints need different thresholds.

A search endpoint might allow 30 requests per minute. A login endpoint should allow maybe 5 attempts per 15 minutes per account. A password reset endpoint might be even stricter. An admin API for bulk operations might have a generous limit but require elevated permissions.

// Example using a simple in-memory rate limiter (use Redis in production)
import rateLimit from "express-rate-limit";

// General API: 100 requests per 15 minutes
const generalLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 100,
  standardHeaders: true,
});

// Auth endpoints: 5 attempts per 15 minutes
const authLimiter = rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 5,
  message: "Too many login attempts, please try again later",
});

app.use("/api/", generalLimiter);
app.use("/api/auth/", authLimiter);

Input Validation and Sanitization

Validate every input at the API boundary. Use a schema validation library like Zod (TypeScript), Joi, or class-validator (NestJS). Reject malformed requests before they reach your business logic.

import { z } from "zod";

const CreateUserSchema = z.object({
  email: z.string().email().max(255),
  name: z.string().min(1).max(100),
  role: z.enum(["user", "admin"]).default("user"),
});

// In your route handler
const result = CreateUserSchema.safeParse(req.body);
if (!result.success) {
  return res.status(400).json({ errors: result.error.flatten() });
}

API Versioning

Version your API from day one (/api/v1/). When you need to ship security patches that change response shapes or require new fields, you can do so in a new version without breaking existing clients.

Infrastructure Security

7. Server and Database Placement

There are two schools of thought here, and the right answer depends on your stage and budget.

Same server as your app. Your database sits on the same Linux box as your application. The connection is localhost, which means no network exposure, no TLS overhead for database connections, and no extra hosting costs. The downside is that you can't scale your app and database independently. When you need a second app server, you'll need to migrate the database.

Managed cloud database (RDS, PlanetScale, Supabase). Your database runs on a separate managed service with automatic backups, replication, and independent scaling. The downside is cost (managed Postgres starts around $15-25/month and climbs fast) and a network connection you need to secure.

If you go the cloud route, force SSL on database connections:

# Force SSL in your connection string
postgresql://user:pass@db-host:5432/mydb?sslmode=require

Cloud providers solve the network exposure problem with VPCs (Virtual Private Clouds on AWS), VNets (Azure), or their GCP equivalent. Your app and database communicate over an internal network that's never exposed to the public internet. This is the enterprise-grade answer, but the costs add up: NAT gateways, data transfer fees, and load balancers can quietly run $50-100/month before you've served a single user.

Our take: If you're a startup on a budget, keep everything on a well-configured Linux box (Hetzner, OVH, or a decent VPS) until your revenue justifies the cloud bill. A $20/month dedicated server with proper firewall rules, fail2ban, and automated backups provides more practical security than most startups have on a $500/month AWS setup they don't fully understand. Scale when you need to, not when a cloud sales rep tells you to.

If you self-host, harden the box. This is non-negotiable:

Disable password-based SSH. Use key-based authentication only. This single change eliminates the most common attack vector against Linux servers.

# /etc/ssh/sshd_config
PasswordAuthentication no
PermitRootLogin no
PubkeyAuthentication yes
AllowUsers deploy            # Only allow specific users
Port 2222                    # Change default SSH port (reduces noise, not a security measure on its own)

Set up a firewall. Use ufw or iptables to allow only the ports you need. For a typical web server, that's 80 (HTTP), 443 (HTTPS), and your SSH port.

ufw default deny incoming
ufw default allow outgoing
ufw allow 443/tcp
ufw allow 80/tcp
ufw allow 2222/tcp   # Your SSH port
ufw enable

Install fail2ban. It monitors log files and bans IPs that show repeated failed login attempts. The default configuration handles SSH brute force, and you can add jails for your application's auth endpoints.

apt install fail2ban
systemctl enable fail2ban

Keep the system updated. Enable unattended security updates. This is one of the highest-value, lowest-effort security measures available.

apt install unattended-upgrades
dpkg-reconfigure -plow unattended-upgrades

Admin access for teams. If multiple people need server access, give each person their own user account with their own SSH key. Never share a single root or deploy key. Use sudo for privilege escalation, and log who does what. When someone leaves the team, disable their account.

8. Container Security

If you're deploying with Docker, the defaults are insecure. Containers run as root by default, images often include far more than they need, and it's easy to ship known vulnerabilities without realizing it.

Run as a non-root user. This limits the damage if a container is compromised.

FROM node:20-alpine

# Create a non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S appuser -u 1001 -G nodejs

# Set working directory and copy files
WORKDIR /app
COPY --chown=appuser:nodejs . .

# Install dependencies and build
RUN npm ci --only=production

# Switch to non-root user before running
USER appuser

EXPOSE 3000
CMD ["node", "dist/main.js"]

Use multi-stage builds. Your final image should contain only what's needed to run the app, not your build tools, dev dependencies, or source code.

# Build stage
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage
FROM node:20-alpine
RUN addgroup -g 1001 -S nodejs && adduser -S appuser -u 1001 -G nodejs
WORKDIR /app
COPY --from=builder --chown=appuser:nodejs /app/dist ./dist
COPY --from=builder --chown=appuser:nodejs /app/node_modules ./node_modules
USER appuser
CMD ["node", "dist/main.js"]

Use minimal base images. node:20-alpine is significantly smaller than node:20 and has a much smaller attack surface. Even better, use distroless images if your stack supports them.

Scan images for vulnerabilities. Tools like docker scout, Snyk, or Trivy can scan your images as part of CI and flag known CVEs in your dependencies or base image.

Don't store secrets in images. Never COPY .env into a Docker image. Use environment variables injected at runtime, or mount secrets from your orchestrator.

9. Environment Variable Security

Never commit secrets to your repository. This sounds obvious, but it's one of the most common security failures in practice. A single .env file pushed to a public repo can expose database credentials, API keys, and signing secrets.

.gitignore is your first line of defense. Make sure .env, .env.local, .env.production, and any other secret files are in your .gitignore before your first commit. Retroactively removing a committed secret doesn't help. It's still in your git history.

# .gitignore
.env
.env.*
!.env.example   # Keep a template with placeholder values

For CI/CD pipelines (GitHub Actions, GitLab CI): Use the platform's built-in secrets management. In GitHub Actions, secrets are stored encrypted and injected as environment variables at runtime. They're masked in logs automatically.

# .github/workflows/deploy.yml
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run deploy
        env:
          DATABASE_URL: ${{ secrets.DATABASE_URL }}
          JWT_SECRET: ${{ secrets.JWT_SECRET }}

For Vercel, Netlify, or similar platforms: Use their environment variable settings in the dashboard. Set different values for development, preview, and production. Never put production secrets in a .env file that gets committed.

Team workflows. When a new developer joins, they need access to development secrets. Don't email them or drop them in Slack. Use a password manager with shared vaults (1Password, Bitwarden) or a dedicated secrets manager (Doppler, HashiCorp Vault). For smaller teams, a shared 1Password vault for development environment variables works well.

Rotate secrets on a schedule and immediately when someone with access leaves the team. This includes database passwords, API keys, JWT signing secrets, and any third-party service credentials.

Monitoring & Response

10. Security Logging

Logging isn't just for debugging. Your security logs are how you detect breaches, investigate incidents, and prove compliance.

At minimum, log every authentication event (successful and failed logins, token refreshes, password resets), every permission change (role assignments, access grants), every data access pattern that touches sensitive records, and every admin action.

Structure your logs as JSON so they're searchable. Include timestamps, user IDs, IP addresses, and the action taken. Don't log sensitive data like passwords, tokens, or full credit card numbers.

// Structured security log
logger.info({
  event: "auth.login.success",
  userId: user.id,
  ip: req.ip,
  userAgent: req.headers["user-agent"] ?? "unknown",
  timestamp: new Date().toISOString(),
});

logger.warn({
  event: "auth.login.failed",
  email: req.body.email, // Log the attempted email, not the password
  ip: req.ip,
  reason: "invalid_credentials",
  attemptCount: failedAttempts,
});

Ship your logs to a centralized system. Self-hosted options like Grafana + Loki work well on a budget. Managed services like Datadog or New Relic are more convenient but cost more. The important thing is that your logs survive if the server is compromised, meaning they need to be stored somewhere the attacker can't easily delete them.

Set retention policies based on your compliance requirements. SOC 2 typically requires 1 year. GDPR has its own retention rules. At minimum, keep security logs for 90 days.

11. Intrusion Detection and Alerting

Logs are useless if nobody reads them. Set up automated alerts for patterns that indicate an attack or breach in progress.

Start with the high-signal alerts: a spike in failed login attempts from a single IP or against a single account (brute force), login from an unusual location or device for a given user, privilege escalation (a regular user suddenly accessing admin endpoints), unusual data export volumes (a user downloading 10x their normal amount), and repeated 403/401 responses (someone probing for access).

You don't need an expensive SIEM to start. A simple approach is to have your application emit structured log events for security-relevant actions, aggregate them with a tool like Grafana + Loki or even a simple database table, and write alerting rules that notify your team via Slack, PagerDuty, or email when thresholds are crossed.

As you grow, tools like CrowdStrike, Wazuh (open source), or cloud-native options like AWS GuardDuty provide more sophisticated detection. But the most important thing at the start is that someone gets notified when something unusual happens.

12. Automated Security Scanning in CI/CD

Security scanning should run on every pull request, not as a quarterly afterthought. Integrate these checks into your CI/CD pipeline so vulnerabilities are caught before they reach production.

Dependency scanning. Tools like Snyk, npm audit, or GitHub's Dependabot check your dependencies against known vulnerability databases. Run these on every PR and block merges on critical/high severity findings.

# GitHub Actions example
- name: Security audit
  run: npm audit --audit-level=high

- name: Snyk test
  uses: snyk/actions/node@master
  env:
    SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}

Static analysis. Tools like SonarQube or Semgrep scan your source code for common security anti-patterns: hardcoded secrets, SQL injection risks, insecure crypto usage, and similar issues.

Container scanning. If you're deploying Docker images, scan them for OS-level vulnerabilities with Trivy, Docker Scout, or Snyk Container.

DAST (Dynamic Application Security Testing). Tools like OWASP ZAP can run against a staging environment to test for runtime vulnerabilities like XSS, CSRF, and injection flaws. These are slower and typically run on merges to main rather than on every PR.

The goal is to shift security left, catching issues during development rather than in production.

Compliance & Legal

13. Data Retention Policies

You can't protect data you don't need. Every piece of data you store is a liability, and the simplest way to reduce your attack surface is to stop hoarding data you're not using.

Define retention periods for each category of data you store. User account data might be kept for the life of the account plus a grace period. Payment transaction records might need to be kept for 7 years for tax compliance. Session logs might only need 90 days. Analytics events might be useful for a year.

Implement automatic purging. Don't rely on someone remembering to clean up old data manually. Write scheduled jobs that delete or anonymize data past its retention window.

// Example: Purge expired sessions and old audit logs
async function purgeExpiredData() {
  // Delete sessions older than 30 days
  await prisma.session.deleteMany({
    where: { expiresAt: { lt: new Date() } },
  });

  // Anonymize audit logs older than 1 year
  await prisma.auditLog.updateMany({
    where: { createdAt: { lt: subYears(new Date(), 1) } },
    data: { userId: null, ipAddress: null },
  });
}

GDPR gives users the right to request deletion of their data. Even if you're not legally required to comply with GDPR, building this capability early is much easier than retrofitting it when a large customer or regulation demands it.

14. Incident Response Plan

You need a documented incident response plan before you have an incident. In the middle of a breach is not the time to figure out who does what.

Your plan doesn't need to be a 50-page document. For an early-stage startup, a clear one-pager covering these essentials is enough:

Detection. How do you know a breach has occurred? (Your monitoring and alerting from points 10-11.)

Triage. Who gets notified first? What's the severity classification? A leaked API key is different from a full database dump.

Containment. What are the immediate steps? Revoke compromised credentials, rotate secrets, isolate affected systems, disable compromised accounts.

Communication. Who tells customers? When? Most jurisdictions have mandatory breach notification timelines (72 hours under GDPR, varies by US state). Know yours before you need them.

Recovery. How do you restore service? Where are your backups? Have you tested restoring from them?

Post-mortem. After every incident, document what happened, why, and what you're changing to prevent it from happening again. Blameless post-mortems build a culture where people report issues early instead of hiding them.

Keep the plan somewhere accessible (not only on the server that might be compromised). Review it quarterly. Run a tabletop exercise at least once a year where you walk through a hypothetical scenario.

15. Regular Security Audits

Internal code reviews catch some issues, but you need external eyes on your security posture at least annually. Third-party penetration testing by a qualified security firm will find vulnerabilities your team has blind spots for.

For early-stage startups, a focused pentest on your authentication system and API endpoints is the highest-value engagement. A full infrastructure audit can come later as you grow.

If you're pursuing SOC 2 compliance (and you will if you sell to enterprise customers), start your audit preparation early. SOC 2 requires documented policies, access controls, logging, and incident response, essentially everything in this checklist. Companies that build these practices from the start can achieve SOC 2 readiness in weeks. Companies that retrofit them spend months.

Bug bounty programs are another option once you have the maturity to triage reports. Platforms like HackerOne or Bugcrowd give you access to a large community of security researchers. Start with a private program (invite-only) before opening it up publicly.

The Implementation Reality Check

Looking at this list, you're probably thinking: "This will take months to implement properly." You're right. And that's exactly why most startups skip it.

Here's how it typically plays out:

Week 1: You implement basic login/logout. It works, so you move on to features.

Month 3: A customer asks about MFA. Implementing it properly requires database migrations, UI changes, and user communication. It takes two weeks.

Month 6: A potential enterprise customer asks about SOC 2 compliance. You realize you need logging, access controls, and documentation. It takes two months and delays your product roadmap.

Month 12: A security researcher finds a vulnerability. You scramble to patch it, then realize you need to audit your entire codebase for similar issues.

The Smart Approach: Implement security foundations from day one, or hire a team that already has this expertise. The cost of doing it right initially is a fraction of retrofitting security later.

A Progressive Approach

Instead of trying to do everything at once, use this phased approach:

Phase 1: Foundation (Week 1-2). Secure authentication with OAuth 2.0, HTTP-only cookies, basic input validation with Zod or equivalent, HTTPS everywhere, environment variable hygiene, and SSH hardening if self-hosting.

Phase 2: Monitoring (Week 3-4). Structured security logging, rate limiting at both edge and application level, basic alerting for failed auth attempts and unusual patterns.

Phase 3: Advanced (Month 2-3). Automated security scanning in CI/CD, dependency auditing, incident response plan, data retention policies, and preparation for compliance frameworks.

The Tools We Recommend

Don't reinvent the wheel. Use established tools:

Authentication: Use OAuth 2.0 providers (Google, Microsoft, GitHub) directly. Build your own session layer on top. Use Clerk or Auth0 only if you need to ship in an afternoon and accept the tradeoffs.
ORM: Prisma, Drizzle, or SQLAlchemy. Parameterized queries by default, no SQL injection.
Validation: Zod (TypeScript), Joi, or class-validator (NestJS).
Monitoring: Datadog, New Relic, or self-hosted Grafana + Loki.
Scanning: Snyk, SonarQube, Trivy, OWASP ZAP.
Secrets: Doppler, HashiCorp Vault, or 1Password for team secret sharing.
Edge Security: Cloudflare (free tier is excellent for basic protection and rate limiting).

Start Today

Pick one item from this checklist and implement it this week. Then add one more next week. Small, consistent progress beats trying to do everything at once and getting overwhelmed.

Your future self (and your customers) will thank you.

Need help implementing these security measures? Our team specializes in building secure, compliant applications for startups and enterprises. We've helped companies pass SOC 2 audits and prevent security incidents. Get in touch with Morley Media Group