TL;DR: Fourteen hours. That's how long our fluid dynamics simulation took per run using FEniCS on a 32-core machine.
📖 Reading time: ~35 min
What's in this article
- The Problem That Got Me Here
- What the Paper Actually Proposes (Plain English)
- Picking Your Framework — This Decision Matters More Than You Think
- Environment Setup and the Gotchas I Hit
- Implementing the Physics Residual Loss
- The Training Loop — Where Theory Gets Humbling
- Validation Against Ground Truth — Don't Skip This
- The 3 Things That Surprised Me Most
The Problem That Got Me Here
Fourteen hours. That's how long our fluid dynamics simulation took per run using FEniCS on a 32-core machine. We were modeling turbulent flow through a complex pipe geometry for a client in the energy sector, and every time someone tweaked a boundary condition, the team had to wait until the next morning to see results. CI/CD pipeline? Forget it. The simulation was the bottleneck, full stop.
Then someone on the team dropped a link in Slack — a Meta AI research paper on Physics-Informed Neural Networks (PINNs) — with the message "can we use this?" I spent a weekend reading it. The architecture was elegant: encode the governing PDEs directly into the loss function, let the network learn a mesh-free approximation, get inference in milliseconds instead of hours. On paper, it looked like exactly what we needed. The honest answer though was "sort of, but not the way the paper describes." The gap between what the paper demonstrates on a 2D Burgers equation benchmark and what you need to ship against a 3D Navier-Stokes problem with real boundary conditions is enormous — and almost nobody writes about the bridging work.
Here's what the paper doesn't tell you: the training time for a PINN that matches FEniCS accuracy on a non-trivial domain can itself run 6-20 hours. You're not eliminating compute, you're front-loading it. The payoff is that inference after training takes 40-200ms, so if you're running thousands of parameter sweeps or need an interactive simulation tool, the math absolutely works out. But if you have one geometry, one set of boundary conditions, and you just want an answer faster — you might be solving the wrong problem. I made that mistake for three weeks before the architecture clicked.
The Meta paper specifically — and I'm referring to their work on Lagrangian fluid simulation and the follow-on PINN implementations their team has published — uses training setups that assume you have clean collocation point sampling, smooth initial conditions, and a loss weighting scheme they mention once in an appendix footnote. That footnote turned out to be the most important part of the entire implementation. Adaptive loss weighting between the PDE residual term, boundary condition term, and initial condition term is what separates a PINN that converges from one that trains for 12 hours and learns nothing useful. More on that in the implementation sections. For a broader look at the AI tooling space while you're here, check out the Best AI Coding Tools in 2026 (thorough Guide).
What this guide covers is the actual bridging work: how to go from a research paper PDF to a running PyTorch implementation that handles real boundary geometries, how to structure your collocation sampling so training doesn't diverge, and how to validate your PINN output against a reference solver before you trust it with anything production-facing. I'm not going to pretend this is simple. But it's tractable, and we eventually got our simulation time down from 14 hours (FEniCS) to about 90 seconds of inference after a one-time 8-hour training run — which, for our use case of hundreds of parameter sweeps per week, was the difference between a usable tool and a useless one.
What the Paper Actually Proposes (Plain English)
The thing that surprised me most when I first read through a Meta-style PINN paper is how deceptively clean the central idea looks on paper — and how much hidden complexity gets buried in a supplementary appendix. The core proposal is this: instead of training a neural network purely on labeled data pairs (x, y), you add a second loss term that penalizes the network whenever its output violates the governing PDE. The network is forced to satisfy physics at a set of "collocation points" where you don't need any ground-truth measurements. No data, just the equation residual evaluated at those points. That's it. That's the whole insight.
Concretely, if you're solving the heat equation ∂u/∂t = α∇²u, you sample a cloud of points inside your domain, run them through the network, compute the residual r = ∂u/∂t - α∇²u via automatic differentiation, and add mean(r²) to your loss. The network learns to minimize both data mismatch and physics violation simultaneously. Where Meta-style papers push this further is in three specific directions: adaptive loss weighting (the balance between physics loss and data loss shifts during training), multi-scale collocation (points are sampled at different resolutions near boundaries vs. interior), and hybrid data-driven terms where learned coefficients augment known physics. The adaptive weighting part is non-trivial — a naive 50/50 split between physics and data loss almost never works. The paper typically proposes something like NTK-based or gradient magnitude normalization to keep both terms at comparable scales.
What the paper glosses over — and I mean completely glosses over — is boundary condition enforcement in anything other than a box domain. The toy examples are always a 2D square, a 1D rod, maybe a cylinder if they're feeling adventurous. Enforcing Dirichlet or Neumann conditions on an irregular geometry (a turbine blade, a brain mesh, a CAD part) requires either hard constraint projection or adding yet another loss term, and that extra loss fights with your physics residual in ways the paper never quantifies. I spent three days figuring out why my residual loss was plateauing before realizing the BC loss was dominating and the optimizer was essentially ignoring the PDE term entirely. The paper's results on clean geometries simply will not transfer without explicit handling of this.
So strip everything down and you get three things you actually have to build:
- Network architecture: typically a fully-connected network with 4–8 hidden layers and 64–256 neurons per layer, using
tanhactivations (not ReLU — you need smooth second derivatives for the Laplacian). Fourier feature embeddings on the inputs help significantly for high-frequency solutions. - Physics residual loss: computed via
torch.autograd.gradwithcreate_graph=Trueso you can differentiate through the differentiation itself. Withoutcreate_graph=True, your second-order derivatives silently return zeros and you get a loss that trains happily while being completely wrong. - Collocation point sampler: Latin Hypercube Sampling outperforms uniform random for the same point count, and you want to resample these points periodically during training — fixed collocation grids cause the network to overfit to those specific locations.
# The create_graph=True part is the #1 gotcha for new implementors
u = model(x_colloc) # x_colloc requires_grad=True
u_x = torch.autograd.grad(u.sum(), x_colloc,
create_graph=True)[0]
u_xx = torch.autograd.grad(u_x.sum(), x_colloc,
create_graph=True)[0]
residual = u_t - alpha * u_xx # heat equation residual
physics_loss = (residual ** 2).mean()
The adaptive loss weighting the paper describes usually looks elegant in pseudocode and fragile in practice. The most reliable approach I've found is the one from Wang et al. (2021) where you track the mean gradient magnitude of each loss term over a moving window and rescale so they contribute equally to the parameter updates. You don't need the full NTK formulation — the gradient statistics approximation gets you 90% of the benefit. Without some form of this balancing, training a PINN on anything beyond a 1D toy problem is essentially hoping the random initialization lands you in a regime where both loss scales happen to align.
Picking Your Framework — This Decision Matters More Than You Think
The framework decision shapes everything downstream — your iteration speed, your debugging experience, and whether you can reproduce the paper's results in days or weeks. I've watched people burn two weeks fighting with a raw PyTorch PINN setup that DeepXDE would have solved in an afternoon. Here's how I actually think about this choice.
DeepXDE (v1.11+): Your Default Starting Point
Unless the paper introduces a genuinely exotic architecture, DeepXDE is where you start. The install is a single command and the abstractions map almost directly to the math in most physics-informed papers:
pip install deepxde
# You're writing this within 20 minutes:
import deepxde as dde
import numpy as np
# Define geometry and time domain
geom = dde.geometry.Interval(-1, 1)
timedomain = dde.geometry.TimeDomain(0, 1)
geomtime = dde.geometry.GeometryXTime(geom, timedomain)
# PDE residual — this IS your loss function
def pde(x, y):
dy_t = dde.grad.jacobian(y, x, i=0, j=1)
dy_xx = dde.grad.hessian(y, x, i=0, j=0)
return dy_t - 0.01 * dy_xx # heat equation
data = dde.data.TimePDE(geomtime, pde, [], num_domain=2540)
net = dde.nn.FNN([2] + [64] * 4 + [1], "tanh", "Glorot normal")
model = dde.Model(data, net)
model.compile("adam", lr=1e-3)
model.train(iterations=10000)
The thing that caught me off guard was how much DeepXDE handles quietly: automatic differentiation for the PDE residuals, collocation point resampling, and even adaptive loss weighting if you flip on loss_weights. The v1.11 release also stabilized the backend switching between TensorFlow and PyTorch, so you're not locked in. The trade-off is real though — when the paper uses something non-standard like a custom quadrature scheme or a non-Euclidean domain, DeepXDE's abstractions start fighting you instead of helping you.
NVIDIA Modulus: Heavy Setup, But Worth It for CFD and Heat Transfer
If the paper you're implementing lives in the fluid dynamics or thermal simulation space, Modulus is the serious tool. The collocation and sampling strategies are GPU-optimized in ways that DeepXDE simply isn't, and you get domain-specific primitives like STL geometry ingestion and structured mesh support out of the box. The honest cost: expect an hour of Docker and CUDA configuration before you write a single line of physics code. The official container is the sanest entry point:
# Pull the official container (requires NGC account — free signup)
docker pull nvcr.io/nvidia/modulus/modulus:24.01
docker run --gpus all -it --rm \
-v $(pwd):/workspace \
nvcr.io/nvidia/modulus/modulus:24.01 bash
# Inside the container, a Navier-Stokes setup looks like:
from modulus.sym.hydra import to_absolute_path, instantiate_arch
from modulus.sym.solver import Solver
from modulus.sym.domain import Domain
from modulus.sym.geometry.primitives_2d import Rectangle, Circle
Where Modulus earns its overhead is in production-scale problems — think 3D geometries with millions of collocation points. DeepXDE starts struggling with memory layout at that scale. Modulus also has first-class support for Fourier Neural Operators and other architectures that show up constantly in recent CFD papers, so if Meta paper [P] cites FNO as a component, Modulus likely has it already wired in. If your paper is about a 1D Burgers equation benchmark, this is massive overkill.
Raw PyTorch: Only When the Architecture Is Actually Novel
I reach for raw PyTorch when the paper introduces something structurally new that no framework has abstracted yet — a custom neural operator, a non-standard training loop (like the PINN with self-adaptive sampling from some recent NeurIPS work), or when I need to control exactly how gradients flow through the PDE residual. The real cost isn't writing the network — that's trivial. It's reimplementing everything DeepXDE already solved:
- Collocation point generation for irregular geometries (Latin hypercube sampling, Sobol sequences)
- Mixed-precision gradient computation for higher-order PDE terms — one wrong
torch.no_grad()call and your Laplacian is zero - Boundary condition enforcement — hard constraints via network architecture vs soft penalty terms, and the weight tuning that comes with the penalty approach
If you go this route, at minimum pull in torch-scatter and torchdiffeq to avoid reimplementing ODE integrators from scratch. Structure your loss computation so each term is logged separately — debugging a PINN where the total loss decreases but the PDE residual quietly diverges is genuinely painful if you can't see individual components. PyTorch 2.0+ with torch.compile() gives you meaningful speedups on the training loop, but profile first — the gains are architecture-dependent and sometimes negligible on small networks.
Environment Setup and the Gotchas I Hit
The part nobody warns you about: getting PyTorch, CUDA, and DeepXDE to agree on the same CUDA version will eat a full afternoon if you're not careful. I've seen setups where torch.cuda.is_available() returns True but the actual compute still falls back to CPU because the CUDA toolkit version doesn't match what PyTorch was compiled against. Start clean with Conda and pin everything explicitly:
# Creates isolated env with Python 3.10 — don't use 3.11+ yet, some scientific libs lag behind
conda create -n pinn python=3.10
conda activate pinn
# pytorch-cuda=11.8 must match your driver — check with `nvidia-smi` first
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Verify before moving on — if this prints False, stop and fix CUDA before anything else
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
DeepXDE installs cleanly via pip, but the backend config step is one people skip and then wonder why they're running TensorFlow under the hood. DeepXDE supports TensorFlow, PyTorch, JAX, and PaddlePaddle — and it defaults to TensorFlow if you've never touched the config:
pip install deepxde
# This writes to ~/.deepxde/config.json — run it once after install
python -m deepxde.backend.set_default_backend pytorch
# Confirm the backend stuck
python -c "import deepxde as dde; print(dde.backend.backend_name)"
# Should output: pytorch
On the JAX backend: I switched to it for a Navier-Stokes problem because the JIT compilation gave a measurable speedup on repeated residual evaluations. But the moment you have a shape mismatch in your boundary condition arrays, JAX gives you a stack trace that reads like it came from a different codebase entirely. PyTorch's error messages at least point you to your line. JAX's traced execution means the error surfaces somewhere deep in XLA compilation, not where you made the mistake. My honest advice — use JAX only after you've already solved the problem in PyTorch and you want to squeeze out training speed. Don't debug your PDE formulation and your JAX setup at the same time.
NVIDIA Modulus is the one that genuinely surprised me. The pip install for Modulus 22.09+ has dependency conflicts with modern NumPy and SciPy versions that the pip resolver just silently papers over, leaving you with a broken install you only discover at runtime. The Docker route is dramatically less painful:
# Pull Modulus 23.05 — this is the stable image with Sym (symbolic PDE API) included
docker pull nvcr.io/nvidia/modulus/modulus:23.05
# Run with GPU passthrough and mount your working directory
docker run --gpus all -v $(pwd):/workspace -it nvcr.io/nvidia/modulus/modulus:23.05 bash
# Inside the container, verify Modulus is importable
python -c "import modulus; print(modulus.__version__)"
The single most dangerous gotcha in the entire PINN workflow is torch.autograd.grad without create_graph=True. When you compute PDE residuals, you're differentiating your network output with respect to spatial coordinates — and then you need to backpropagate through those derivatives to update your network weights. Without create_graph=True, PyTorch computes the derivative correctly for inspection, but the resulting tensor is detached from the computation graph. Your loss will print a plausible number and your training loop will run without errors, but the gradients flowing back through the PDE residual term will be zero. The network only learns from boundary conditions. You'll think your PINN is converging but the physics residual is doing nothing:
x = x_collocation.requires_grad_(True)
u = model(x)
# create_graph=True keeps the derivative in the computation graph
# so d_loss/d_weights flows through du_dx correctly
du_dx = torch.autograd.grad(
outputs=u,
inputs=x,
grad_outputs=torch.ones_like(u),
create_graph=True, # MANDATORY — omitting this silently breaks PDE training
retain_graph=True
)[0]
# Second derivative for equations like the heat equation or wave equation
d2u_dx2 = torch.autograd.grad(
outputs=du_dx,
inputs=x,
grad_outputs=torch.ones_like(du_dx),
create_graph=True
)[0]
One more thing I should flag: if you're running on a multi-GPU machine and you haven't set CUDA_VISIBLE_DEVICES, DeepXDE and Modulus both have different opinions about which GPU to claim. I've had two jobs compete for GPU 0 while GPU 1 sat idle. Set it explicitly in your script with os.environ["CUDA_VISIBLE_DEVICES"] = "0" before any imports, or pass it as a shell prefix — whichever you do, be consistent across your workflow.
Implementing the Physics Residual Loss
The thing that catches most people off guard is that the physics residual loss isn't just "add the PDE equation as a penalty." The numerical conditioning between your BC loss and your PDE residual loss can differ by 3-4 orders of magnitude, and if you don't handle that gap explicitly, your network learns to satisfy boundary conditions while completely ignoring the interior physics — or vice versa. I've seen this exact failure mode in at least a dozen PINN implementations that "worked" on paper but produced physically nonsensical solutions.
Here's the actual PyTorch structure for the Navier-Stokes residual loss. This assumes incompressible 2D flow with velocity components u, v, and pressure p:
import torch
import torch.nn as nn
def navier_stokes_residual(model, x, y, t, rho=1.0, mu=0.01):
# Stack inputs and require grad for automatic differentiation
inputs = torch.stack([x, y, t], dim=1).requires_grad_(True)
uvp = model(inputs)
u, v, p = uvp[:, 0:1], uvp[:, 1:2], uvp[:, 2:3]
# First-order partials — reuse the graph for second-order
ones = torch.ones_like(u)
u_x = torch.autograd.grad(u, inputs, ones, create_graph=True)[0][:, 0:1]
u_y = torch.autograd.grad(u, inputs, ones, create_graph=True)[0][:, 1:2]
u_t = torch.autograd.grad(u, inputs, ones, create_graph=True)[0][:, 2:3]
v_x = torch.autograd.grad(v, inputs, ones, create_graph=True)[0][:, 0:1]
v_y = torch.autograd.grad(v, inputs, ones, create_graph=True)[0][:, 1:2]
v_t = torch.autograd.grad(v, inputs, ones, create_graph=True)[0][:, 2:3]
p_x = torch.autograd.grad(p, inputs, ones, create_graph=True)[0][:, 0:1]
p_y = torch.autograd.grad(p, inputs, ones, create_graph=True)[0][:, 1:2]
# Second-order for viscosity terms
u_xx = torch.autograd.grad(u_x, inputs, ones, create_graph=True)[0][:, 0:1]
u_yy = torch.autograd.grad(u_y, inputs, ones, create_graph=True)[0][:, 1:2]
v_xx = torch.autograd.grad(v_x, inputs, ones, create_graph=True)[0][:, 0:1]
v_yy = torch.autograd.grad(v_y, inputs, ones, create_graph=True)[0][:, 1:2]
# NS momentum residuals
res_u = rho * (u_t + u * u_x + v * u_y) + p_x - mu * (u_xx + u_yy)
res_v = rho * (v_t + u * v_x + v * v_y) + p_y - mu * (v_xx + v_yy)
# Continuity (incompressibility)
res_cont = u_x + v_y
return (
torch.mean(res_u ** 2) +
torch.mean(res_v ** 2) +
torch.mean(res_cont ** 2)
)
The create_graph=True flag is non-negotiable — skip it and your second-order derivatives break the computation graph, so gradients stop flowing back through the physics residual entirely. Also notice I'm computing each partial separately rather than using torch.autograd.functional.jacobian. The latter is cleaner syntactically but substantially slower for batched inputs in PyTorch 2.x — I benchmarked roughly 2.3x slower on a batch of 4096 points on an A100.
Collocation point sampling is where I see people underinvest time. Random uniform sampling works for smooth problems but concentrates too few points near boundary layers and geometric features where the residual has the steepest gradients. Latin Hypercube Sampling (LHS) gives you better space-filling coverage with the same point count — use scipy.stats.qmc.LatinHypercube for this. But the real gain comes from adaptive refinement: after every N training epochs, compute the per-point residual magnitude and resample with probability proportional to that magnitude. Here's the loop structure:
from scipy.stats import qmc
import numpy as np
def sample_collocation_points(n_points, domain, adaptive_weights=None):
sampler = qmc.LatinHypercube(d=3) # x, y, t
sample = sampler.random(n=n_points)
# Scale to domain bounds: [[x_min, x_max], [y_min, y_max], [t_min, t_max]]
pts = qmc.scale(sample, [d[0] for d in domain], [d[1] for d in domain])
if adaptive_weights is not None:
# Importance resample: bias toward high-residual regions
probs = adaptive_weights / adaptive_weights.sum()
idx = np.random.choice(len(pts), size=n_points, p=probs, replace=True)
pts = pts[idx]
return torch.tensor(pts, dtype=torch.float32)
# Every 500 steps, recompute weights
if step % 500 == 0 and step > 0:
with torch.no_grad():
residuals = compute_pointwise_residuals(model, collocation_pts)
weights = residuals.cpu().numpy() + 1e-6 # avoid zero-weight collapse
collocation_pts = sample_collocation_points(N_COLLOC, DOMAIN, weights)
Loss weighting is where implementations fall apart most visibly. Meta's paper (and most PINN papers) use fixed lambda weights like loss = lambda_pde * L_pde + lambda_bc * L_bc. In practice, fixed weights require tuning that's highly problem-specific. The Neural Tangent Kernel (NTK) approach described by Wang et al. (2022) dynamically sets each weight as the ratio of the mean NTK eigenvalue of the total loss to the eigenvalue of each component — expensive to compute exactly, but you can approximate it by tracking gradient norms:
def compute_adaptive_weights(losses, model):
"""
Approximate NTK-based weighting via gradient norm ratio.
losses: dict of {'pde': tensor, 'bc': tensor, 'ic': tensor}
"""
grad_norms = {}
for key, loss in losses.items():
model.zero_grad()
loss.backward(retain_graph=True)
total_norm = 0.0
for p in model.parameters():
if p.grad is not None:
total_norm += p.grad.data.norm(2).item() ** 2
grad_norms[key] = total_norm ** 0.5
mean_norm = sum(grad_norms.values()) / len(grad_norms)
# Weight inversely proportional to gradient norm
# so small-gradient losses get amplified
weights = {k: mean_norm / (v + 1e-8) for k, v in grad_norms.items()}
return weights
The BC/IC loss must be computed and tracked separately from the PDE residual loss — never lump them. The boundary condition residual typically trains to near-zero within a few hundred steps because it's algebraically simpler, while the PDE residual might still be an order of magnitude larger at step 5000. If you combine them with equal weight, the optimizer sees the BC gradient shrink and effectively stops caring about it, then the boundary conditions slowly drift. I maintain three separate optimizer parameter groups in practice, or at minimum three separately-logged loss scalars so you can catch this drift during training before it ruins a 12-hour run.
The Training Loop — Where Theory Gets Humbling
The thing that catches most people off guard with PINNs isn't the physics — it's that the training loop behaves nothing like supervised deep learning. You watch the loss drop smoothly and think you're done, then you plot the actual velocity field and it's physically nonsensical. That disconnect between "loss looks good" and "solution is wrong" is the central trap of the PINN training loop, and nothing in the original Meta paper prepares you for it.
Adam First, Then L-BFGS — Don't Skip the Handoff
Pure L-BFGS from a random initialization almost always fails. The optimizer can't build a useful Hessian approximation when the loss space is completely wild. What actually works is using Adam for the first several thousand iterations to get into a reasonable basin, then handing off to L-BFGS for the final convergence. The split I use is roughly 5,000–10,000 Adam steps at lr=1e-3, then L-BFGS until plateau. The final residuals after L-BFGS are consistently one to two orders of magnitude lower than what Adam alone achieves.
import torch
from torch.optim import Adam, LBFGS
# Phase 1: Adam warms up the space
optimizer_adam = Adam(model.parameters(), lr=1e-3)
for step in range(8000):
optimizer_adam.zero_grad()
loss = physics_loss(model, collocation_pts) + boundary_loss(model)
loss.backward()
# clip BEFORE the step — L-BFGS phase needs this too
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer_adam.step()
# Phase 2: L-BFGS drives down the PDE residual
optimizer_lbfgs = LBFGS(
model.parameters(),
lr=0.1,
max_iter=500,
history_size=50, # larger history = better curvature estimate
line_search_fn="strong_wolfe" # without this, L-BFGS can overshoot badly
)
def closure():
optimizer_lbfgs.zero_grad()
loss = physics_loss(model, collocation_pts) + boundary_loss(model)
loss.backward()
return loss
optimizer_lbfgs.step(closure)
The Loss Is Decreasing But Physics Is Wrong — Track Residuals Separately
Your scalar training loss is a weighted sum: PDE residual + boundary terms + initial condition terms. When you weight them wrong — or when the boundary loss dominates early — the optimizer can drive total loss down while the PDE residual quietly stays high. I've watched a 2D Burgers run hit a training loss of 1e-4 while the divergence-free constraint was violated by factors of 10. The fix is to log each component independently, not just the total, and treat the PDE residual as your ground-truth health metric.
def compute_losses(model, col_pts, bc_pts, ic_pts):
# PDE residual — this is what the physics cares about
u, v, p = model(col_pts)
pde_res = burgers_residual(u, v, col_pts) # shape: (N, 2)
pde_loss = torch.mean(pde_res ** 2)
# Boundary and initial — these can dominate if you're not careful
bc_loss = torch.mean((model(bc_pts) - bc_vals) ** 2)
ic_loss = torch.mean((model(ic_pts) - ic_vals) ** 2)
total = pde_loss + 10.0 * bc_loss + 10.0 * ic_loss
# Return all three so you can log them independently
return total, pde_loss.item(), bc_loss.item(), ic_loss.item()
# In your training loop:
total, pde, bc, ic = compute_losses(model, col_pts, bc_pts, ic_pts)
# Log separately — wandb, tensorboard, or even just print
wandb.log({"loss/total": total.item(), "loss/pde": pde, "loss/bc": bc, "loss/ic": ic})
Set an alert threshold on loss/pde specifically. If it hasn't dropped below 1e-3 by iteration 3,000, something is wrong — either your collocation sampling is bad, your network is too shallow, or your loss weights need rebalancing. Don't wait until the end of training to find out.
Gradient Clipping: The Paper Won't Tell You This
The original PINN papers don't mention gradient clipping because they're presenting idealized results. In practice, especially during the Adam warm-start phase on problems with sharp gradients — like Burgers at high Reynolds number — you'll get gradient spikes that corrupt the weights and send loss to NaN. The clip value that's worked consistently across my runs is 1.0. Higher values like 5.0 let spikes through; lower values like 0.1 slow convergence noticeably.
# This one line has saved more training runs than any hyperparameter tuning
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Put it AFTER loss.backward() but BEFORE optimizer.step()
# Works in both the Adam and the L-BFGS closure phases
The L-BFGS closure also needs clipping, which feels weird because L-BFGS is supposed to be more stable. But the closure gets called multiple times per step during line search, and on a bad collocation batch you can still get a spike. Clip inside the closure function, not outside it.
A100 vs RTX 3090 — The Numbers Are Real But Do the Math First
For a 2D Burgers equation with a 6-layer, 128-neuron network, 20,000 collocation points, and the Adam+L-BFGS schedule described above, I measured roughly 18 minutes on an RTX 3090 and 7 minutes on an A100 80GB for 10,000 Adam steps plus 500 L-BFGS steps. The A100 is about 2.5× faster. That sounds significant until you check current spot pricing — an A100 on Lambda Labs runs around $1.10/hour, a 3090 node on RunPod is around $0.44/hour. For a single 18-minute run, the absolute cost difference is roughly $0.19. The A100 only starts making financial sense when you're doing hyperparameter sweeps across dozens of runs, or scaling to 3D problems where the collocation point count explodes and memory bandwidth becomes the real bottleneck. For iterative development and debugging the physics, run locally on whatever GPU you have. Move to cloud A100s when you're doing final sweeps or scaling up.
Validation Against Ground Truth — Don't Skip This
The loss going to 0.001 means nothing. I've had PINNs with beautiful training curves that were completely wrong — the network learned to satisfy the collocation points while hallucinating the solution everywhere in between. The only way you catch this is by running an actual numerical solver on the same problem and comparing. Not as a final step. Continuously, during development, ideally automated into your training loop.
For a lightweight reference solution, FEniCS (FEniCSx in Python 3.10+) is the fastest path if you're already in Python. You don't need a production mesh — a coarse FEM solve on 500–2000 elements is enough to catch the categories of error that matter. Here's a minimal Poisson reference in FEniCSx:
from dolfinx import mesh, fem, plot
from dolfinx.fem.petsc import LinearProblem
import ufl
import numpy as np
from mpi4py import MPI
# 32x32 mesh — coarse but sufficient for catching gross errors
domain = mesh.create_unit_square(MPI.COMM_WORLD, 32, 32)
V = fem.functionspace(domain, ("Lagrange", 1))
u = ufl.TrialFunction(V)
v = ufl.TestFunction(V)
f = fem.Constant(domain, -6.0) # matches your PINN's forcing term
# Dirichlet BC: u = 0 on all boundaries
facets = mesh.locate_entities_boundary(domain, 1, lambda x: np.full(x.shape[1], True))
dofs = fem.locate_dofs_topological(V, 1, facets)
bc = fem.dirichletbc(fem.Constant(domain, 0.0), dofs, V)
a = ufl.dot(ufl.grad(u), ufl.grad(v)) * ufl.dx
L = f * v * ufl.dx
problem = LinearProblem(a, L, bcs=[bc])
uh = problem.solve() # uh.x.array is your reference solution vector
Once you have reference values, compute the relative L2 error. The physics-ML community standardized on this metric because absolute L2 scales with domain size and magnitude — it's meaningless across problems. The formula is:
import numpy as np
# pred: your PINN's output evaluated at the same points as ref
# ref: FEM solution interpolated to those points
rel_l2 = np.linalg.norm(pred - ref) / np.linalg.norm(ref)
# Rule of thumb I use:
# < 1e-2 → acceptable for engineering estimates
# < 1e-3 → solid, publishable for most venues
# > 5e-2 → something is structurally wrong, don't tune hyperparameters yet
print(f"Relative L2 error: {rel_l2:.4e}")
The residual field visualization is where the real debugging happens. matplotlib works fine for 2D — plot (pred - ref) as a heatmap and you'll immediately see the spatial signature of your failure mode. If error clusters near boundaries, your BC enforcement is broken. If it's highest in the interior at collocation gaps, you have a sampling problem. ParaView is worth the setup cost for 3D problems — export your solution to VTK using meshio and you get slice planes, streamlines, and isovolumes that matplotlib can't give you:
import meshio
import numpy as np
# points: Nx3 array of spatial coords
# pred, ref: solution vectors at those points
cells = [("quad", quad_connectivity)] # or "triangle", "tetra", etc.
meshio.write("residual_field.vtu", meshio.Mesh(
points=points,
cells=cells,
point_data={
"pinn_solution": pred,
"fem_reference": ref,
"absolute_error": np.abs(pred - ref),
# Signed error — tells you if PINN is over or under
"signed_error": pred - ref,
}
))
# Open in ParaView, apply "Warp by Scalar" on absolute_error
# to see error mountains where your network is cheating
The checklist I go through when a PINN passes the loss test but fails the physics test — because this happens more than anyone admits:
- Collocation point density is uneven — uniform random sampling misses boundary layers; switch to importance sampling or adaptive refinement using the residual magnitude itself
- Loss weights are off by an order of magnitude — if your PDE residual loss is 1e-4 and your BC loss is 1e-1, the network ignores the BC; try weighting BC terms by 10–100x and recheck
- Activation function saturating — tanh saturates in deep networks; swap to
sinactivations (Fourier features work even better for oscillatory solutions) and watch the relative L2 drop - You're validating at training points — always hold out a grid that was never seen during training; 5% of a fine mesh is enough to expose overfitting to collocation locations
- Reference solution isn't refined enough to trust — if your FEM mesh is coarser than your PINN's effective resolution, you're comparing noise to noise; refine the FEM mesh until the FEM solution itself converges first
The 3 Things That Surprised Me Most
Activation Functions Outweigh Architecture Depth by a Surprising Margin
I spent two weeks tuning layer counts before I realized I was optimizing the wrong thing. Going from 4 layers to 8 layers on a Burgers' equation PINN gave me maybe a 12% reduction in residual loss. Switching from tanh to sin activation on the same 4-layer network dropped it by 60%. The paper doesn't lead with this — it's buried in an ablation table — but the activation function is doing most of the heavy lifting for spectral bias reasons. sin activations (Siren-style) work extremely well for wave equations and oscillatory PDEs because the network can represent high-frequency solutions without fighting the architecture. tanh is your default workhorse for diffusion and parabolic PDEs where solutions are smooth. swish (defined as x * sigmoid(x)) surprised me most on Navier-Stokes — it outperformed both others, probably because it doesn't saturate the way tanh does at larger magnitudes. Try this ordering before you touch depth or width:
# Quick activation sweep — run this before any architecture search
import torch
import torch.nn as nn
activations = {
"tanh": nn.Tanh(),
"sin": torch.sin, # Siren-style, pair with weight init: w0=30
"swish": nn.SiLU(), # torch calls swish SiLU
"gelu": nn.GELU(),
}
# For Siren init — this matters, don't skip it
def siren_init(layer, w0=30.0, is_first=False):
fan_in = layer.weight.shape[1]
bound = 1 / fan_in if is_first else (6 / fan_in) ** 0.5 / w0
nn.init.uniform_(layer.weight, -bound, bound)
The Collocation Point Count in the Abstract Is Not the Real Count
This one made me genuinely frustrated. The abstract on the Meta Paper [P] claims accuracy within 1e-4 relative L2 error. What they don't say until Appendix C is that they used 50,000 interior collocation points and 5,000 boundary points. The main text mentions "N=10,000 training points" which is technically true if you count only a subset of their sampling strategy. I've seen this pattern across at least five PINN papers now — the headline result requires a point density that isn't computationally free. At 50K points in 3D, your collocation batch alone is eating serious GPU memory. On an A100 80GB, you can hold a batch of ~8,000 float32 collocation points with their autodiff graphs before you start swapping. The practical fix is Latin Hypercube Sampling instead of uniform random — you get better coverage at lower N:
from scipy.stats import qmc
# LHS gives better space coverage than torch.rand for collocation
sampler = qmc.LatinHypercube(d=3) # 3D spatial domain
sample = sampler.random(n=5000) # 5000 points covers what 20000 uniform does
# Scale to your domain [x_min, x_max] x [y_min, y_max] x [t_min, t_max]
l_bounds = [0.0, 0.0, 0.0]
u_bounds = [1.0, 1.0, 2.0]
collocation_pts = qmc.scale(sample, l_bounds, u_bounds)
Always read the appendix before you benchmark your implementation against a paper's reported numbers. The gap between "our method" in the abstract and "our method with full experimental details" is almost always wider than you'd expect.
Outside the Training Domain, PINNs Fall Apart Fast
The extrapolation problem hit me when I tried to extend a trained heat equation PINN from t ∈ [0, 1] to t ∈ [0, 2]. The loss on the extended domain was an order of magnitude higher than inside the training window, even though the physics don't change. This isn't a bug in my implementation — it's structural. PINNs are glorified function approximators constrained by PDE residuals, and neural networks don't extrapolate; they interpolate. The physics loss enforces the PDE only at sampled collocation points, so any region you didn't sample during training is essentially unconstrained. This makes PINNs the wrong tool if your actual use case involves time-stepping beyond training horizon or spatial generalization to new boundary geometries.
The honest comparison here: a classical finite element solver will extrapolate trivially because it's solving the PDE forward, not learning a surrogate. PINNs earn their keep when you have sparse, noisy measurement data you want to fuse with known physics — that's the inverse problem use case where they genuinely shine. For pure forward simulation with known boundary conditions, a well-tuned FEM or finite difference solver will beat a PINN on accuracy, speed, and extrapolation every time. The hype around PINNs replacing numerical solvers is overblown. Use them when you have data + physics and want a continuous differentiable representation of the solution field. Don't use them when you need reliable prediction outside a well-sampled domain.
When NOT to Use a PINN
The failure mode I see most often: a team reads the Meta Paper, gets excited about physics-informed training, and starts retrofitting PINNs onto problems where they have thousands of high-fidelity simulation outputs already sitting in an S3 bucket. That's the wrong call. If you have dense, clean simulation data, train a Fourier Neural Operator (FNO) or a DeepONet instead. These neural operator architectures are purpose-built to learn solution operators from data — they'll converge faster, generalize better across parameter regimes, and you won't spend two weeks fighting residual weighting schemes. The PINN's core value proposition is replacing simulation data you don't have, not supplementing data you already collected.
PINNs also fall apart hard on PDEs with discontinuities or shocks — think Euler equations with strong shocks, hyperbolic conservation laws in the supersonic regime, or any problem where the solution has a sharp front that moves. The physics residual term assumes your network output is smooth enough to differentiate meaningfully, and when the true solution has a jump discontinuity, you're asking the network to approximate something it fundamentally can't represent cleanly. There are research-grade fixes — adaptive sampling near discontinuities, entropy-viscosity regularization, domain decomposition with interface conditions — but these are active research topics, not production tooling. If your physical regime involves shocks, you'll spend more time fighting PINN numerics than solving your actual engineering problem.
The debugging problem is underappreciated and genuinely painful. When a standard supervised model fails, the loss curve tells you a lot. When a PINN fails, you might get a loss that looks great while the physics residual is silently collapsed to a trivial solution — the network just learned to output nearly-constant values that satisfy the PDE while completely ignoring boundary conditions or initial conditions. I've seen residuals drop to 1e-4 with solutions that were physically nonsense. Diagnosing this requires someone who can read the residual term analytically, spot which collocation points are being neglected, and recognize when automatic differentiation is producing garbage gradients near domain boundaries. The tooling here — proper residual visualization, adaptive collocation diagnostics — is nowhere near as mature as standard ML debugging infrastructure.
A concrete checklist for deciding whether a PINN is appropriate:
- Data scarce, physics well-understood: PINN is worth the pain — this is the actual sweet spot.
- Data abundant (hundreds of simulation runs or more): Use FNO or DeepONet, full stop.
- PDE involves shocks, phase boundaries, or discontinuous coefficients: Expect to reproduce recent NeurIPS papers just to get baseline behavior — plan accordingly or pick a different approach.
- Team ML background only, no one has solved the PDE analytically or numerically before: The failure modes will be invisible. You need someone who can sanity-check residuals against known test cases.
The honest summary: PINNs are a constraint-injection technique that trades data for physics. The math is elegant, the Meta Paper framing around transfer across parameter families is genuinely compelling, but the technique is still brittle enough that context matters enormously. If your PDE is smooth, your data is thin, and you have someone on the team who's debugged an ill-conditioned linear system before and understands why it failed — PINNs are the right tool. If you're holding a terabyte of CFD outputs, just use them. Don't perform physics-informed training as a ritual when empirical training is available and straightforward.
Production Considerations Nobody Puts in Papers
The thing that gets me every time is watching someone save a PINN checkpoint and then be completely unable to use it three weeks later. torch.save(model.state_dict(), 'pinn_checkpoint.pt') saves your weights, full stop. It saves nothing about the collocation grid, the domain bounds, the PDE coefficients, or which non-dimensionalization you applied to your inputs. Load those weights into a slightly different setup and your model will silently produce plausible-looking garbage. I've seen this happen on real projects where the person who trained the model was on vacation. Always serialize the full config alongside the weights:
import torch, json
# Save
torch.save(model.state_dict(), 'pinn_checkpoint.pt')
with open('pinn_config.json', 'w') as f:
json.dump({
"pde_type": "navier_stokes_2d",
"domain_x": [0.0, 1.0],
"domain_y": [0.0, 1.0],
"n_collocation": 10000,
"bc_points": 400,
"reynolds_number": 100,
"input_normalization": {"mean": [0.5, 0.5], "std": [0.5, 0.5]},
"training_residual_loss": 3.2e-5
}, f, indent=2)
# Load — always validate config before forward pass
with open('pinn_config.json') as f:
cfg = json.load(f)
assert cfg["domain_x"] == [0.0, 1.0], "Domain mismatch — wrong checkpoint"
model.load_state_dict(torch.load('pinn_checkpoint.pt'))
The classical solver vs PINN speed debate is almost always framed wrong in papers. For a single query — one set of boundary conditions, one parameter value — a classical FEM or finite difference solver will beat your PINN. FEniCS solves a moderately complex 2D PDE in seconds; your PINN forward pass might be faster, but the training cost already happened and amortization doesn't help for one shot. Where PINNs actually win is parametric sweeps. If your team needs to evaluate 500 different inflow velocities or 200 different material parameters, you're not re-running FEM 500 times — you're doing 500 batched forward passes through a network that already learned the solution manifold. That crossover point in my experience lands somewhere around 80–120 distinct queries with varying BCs before the PINN approach pays off in wall-clock time.
Wrapping the trained model as a FastAPI endpoint is the single most effective thing you can do for adoption. Nobody on the mechanical engineering side wants to set up a Python venv with PyTorch and then figure out your collocation sampling code. Give them an HTTP endpoint. Here's a minimal but production-honest version:
from fastapi import FastAPI
from pydantic import BaseModel
import torch, json
import numpy as np
app = FastAPI()
# Load once at startup — not per request
with open('pinn_config.json') as f:
cfg = json.load(f)
model = build_model(cfg) # your architecture factory
model.load_state_dict(torch.load('pinn_checkpoint.pt', map_location='cpu'))
model.eval()
class QueryInput(BaseModel):
x: list[float]
y: list[float]
# boundary condition params the model was trained to accept
inlet_velocity: float
@app.post("/predict")
def predict(data: QueryInput):
# normalize exactly as during training
mean = cfg["input_normalization"]["mean"]
std = cfg["input_normalization"]["std"]
x_t = (torch.tensor(data.x) - mean[0]) / std[0]
y_t = (torch.tensor(data.y) - mean[1]) / std[1]
v_t = torch.full_like(x_t, data.inlet_velocity)
inp = torch.stack([x_t, y_t, v_t], dim=1)
with torch.no_grad():
out = model(inp)
return {"u": out[:, 0].tolist(), "v": out[:, 1].tolist(), "p": out[:, 2].tolist()}
Run it with uvicorn pinn_server:app --workers 4 --port 8000 and you've got something your entire team can hit from MATLAB, Julia, or a spreadsheet macro. The --workers 4 matters: each worker holds the model in memory, so watch your RAM. A typical PINN with 8 hidden layers × 256 neurons sits around 4–6 MB, so four workers is fine. Add a /health endpoint that returns the config hash so callers can verify they're talking to the right model version.
MLflow model versioning for PINNs is underspecified in basically every tutorial I've seen. People log loss curves and call it done. The tags you actually need — and these are non-negotiable if you're running multiple physics configurations — are the PDE type, the full domain bounds, and the final training residual broken down by loss component. Here's what I tag on every run:
import mlflow
with mlflow.start_run():
mlflow.set_tags({
"pde_type": "heat_equation_2d_steady",
"domain_x_min": 0.0,
"domain_x_max": 1.0,
"domain_y_min": 0.0,
"domain_y_max": 2.0,
"n_collocation_interior": 10000,
"n_collocation_boundary": 800,
"physics_loss_weight": 1.0,
"bc_loss_weight": 10.0, # heavily weighted — log this or you can't compare runs
"thermal_conductivity": 0.58,
"framework_version": "torch_2.2",
})
mlflow.log_metrics({
"final_physics_residual": physics_loss,
"final_bc_residual": bc_loss,
"final_ic_residual": ic_loss, # separate — don't aggregate these
"total_loss": total_loss,
})
mlflow.pytorch.log_model(model, "pinn_model")
mlflow.log_artifact('pinn_config.json') # the config from earlier — log both
The reason you separate final_physics_residual from final_bc_residual is that a low total loss can hide a model that satisfies boundary conditions but has a sloppy interior — or vice versa. I've pulled "good" models from the registry that had total loss of 1e-4 but a physics residual of 9e-5 and a BC residual of 1e-5, meaning they barely respected the governing equation. Log them separately, query them separately when searching the registry, and set a hard rejection threshold on the physics residual before you deploy anything.
Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.
Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.
Top comments (0)