DEV Community

James Huschle
James Huschle

Posted on

How I deployed a cross-encoder model at Lambda latency without PyTorch

Most quiz systems check whether you typed the right string. I wanted to build one that evaluates whether you understood the concept.

That's a harder problem. "A contract between services" and "an agreed-upon interface" are the same answer to anyone who understands APIs. To a string matcher, they're completely different. Closing that gap is what led me down the path of deploying a custom ML model on AWS Lambda — and eventually to cutting PyTorch entirely from the container.

Here's how it went.

The problem with string matching
Know-It-All Tutor is a serverless learning system where users define knowledge domains, add terms and definitions, and get quizzed on them. The core evaluation problem: when a user submits an answer, how do you decide if they got it right?

The naive approach — exact string match or even fuzzy string match — fails immediately. Learners paraphrase. They use synonyms. They express the same concept in ten different ways. A system that only accepts the exact definition you typed isn't evaluating understanding, it's evaluating memory.

What I needed was semantic similarity — a model that understands meaning, not strings.

Cross-encoder vs. bi-encoder
There are two main architectures for semantic similarity tasks:

Bi-encoders encode each sentence independently into a vector, then compare vectors with cosine similarity. Fast at inference time because you can pre-compute embeddings. Great for retrieval tasks where you're comparing one query against thousands of candidates.

Cross-encoders take both sentences together as a single input and produce a single similarity score. Slower — you can't pre-compute anything — but significantly more accurate because the model can attend to both sentences simultaneously.

For answer evaluation, I'm comparing one user answer against one reference definition. I don't need retrieval speed. I need accuracy. Cross-encoder was the right call.

Fine-tuning the model
I started with a cross-encoder pre-trained on Natural Language Inference (NLI) — a task where models learn to classify sentence pairs as entailment, contradiction, or neutral. NLI pre-training gives the model a strong foundation for understanding semantic relationships.

From there I fine-tuned on the Semantic Textual Similarity Benchmark (STSB), which provides sentence pairs with human-annotated similarity scores from 0 to 5. This shifts the model from classification to regression — outputting a continuous similarity score rather than a category label.

The result: a model that understands that "rapid" and "fast" are similar, that "a contract between services" and "an agreed-upon interface" are the same idea, and that "blue" and "democracy" have nothing to do with each other.

The Lambda constraint problem
Here's where it gets interesting.

A standard PyTorch + Transformers deployment would produce a Lambda container in the range of 1.5-2GB. Lambda supports containers up to 10GB, so it fits — but the cold start time scales with container size, and for an interactive quiz application, cold starts are user-visible latency.

Oh, and there is one other thing. I was running this on a personal AWS account with a goal of keeping it as close to free-tier as possible. What can I say? I'm cheap. The obvious Lambda container story is the 10GB image size limit. That wasn't my problem. My problem was storing a container in ECR as close to free as possible.

Lambda container images must come from a private ECR repository — ECR Public Gallery and the 50 GB of storage that comes with it isn't supported. And private ECR's free tier is 500MB for 12 months. After that, you pay per GB-month for storage. That's a real ongoing charge for a personal project that gets intermittent traffic.

I wanted this system to run permanently at near zero infrastructure cost. That meant the container had to fit inside 500MB — ideally with room to spare.

I needed to get the container small.

The solution: ONNX export + int8 quantization
Export the model to ONNX and drop PyTorch entirely at inference time:

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model = ORTModelForSequenceClassification.from_pretrained(
model_path,
export=True
)
Then quantize to int8:

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained(model_path)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir=quantized_path, quantization_config=qconfig)
The quantized ONNX model is ~79MB. At inference time, the only runtime dependencies are onnxruntime and tokenizers — no PyTorch, no Transformers. The final Docker Lambda image comes in at ~112MB, well under Lambda's limit and meaningfully faster to cold-start.

And on top of that, my projected end of month AWS bill is $0.48. Stupid incidental costs.

The inference handler
The Lambda handler is straightforward:

import onnxruntime as ort
from tokenizers import Tokenizer
import numpy as np

_session = None
_tokenizer = None

def _load():
global _session, _tokenizer
if _session is None:
opts = ort.SessionOptions()
opts.intra_op_num_threads = 1
_session = ort.InferenceSession("model/model.onnx", sess_options=opts)
tok = Tokenizer.from_file("model/tokenizer.json")
tok.enable_padding()
tok.enable_truncation(max_length=512)
_tokenizer = tok

def score_pair(answer: str, reference: str) -> float:
_load()
encoded = _tokenizer.encode(answer, reference)
inputs = {
"input_ids": np.array([encoded.ids], dtype=np.int64),
"attention_mask": np.array([encoded.attention_mask], dtype=np.int64),
}
logits = _session.run(None, inputs)[0][0]
sig = float(1.0 / (1.0 + np.exp(-logits[0])))
return float(np.clip((sig - 0.0347) / (0.1689 - 0.0347), 0.0, 1.0))
Results
The calibration range [0.0347, 0.1689] was derived from a limited sample — with more real student answers accumulated over time, I'd re-derive it and bring the threshold back to 0.50. For now, 0.45 reflects where the model actually performs rather than where I'd like it to perform.

The model handles paraphrases, synonyms, and conceptual equivalence naturally. It doesn't require exact wording. It doesn't require the learner to have memorized the definition verbatim.

More importantly: it runs on CPU at Lambda latency. No GPU required. No SageMaker endpoint. No inference server to manage. Just a Docker Lambda with a model baked in, invoked on demand, paying only for what it uses.

The bigger picture
Know-It-All Tutor actually uses two models. The cross-encoder handles answer evaluation — a precision task that needs a small, fast, accurate model. A separate self-hosted open-source LLM handles curriculum generation — decomposing a new knowledge domain into coherent subdomains, searching authoritative sources, and generating definitions precise enough to be evaluated against.

Two very different jobs. Two very different tools. Knowing which problem needs a large model and which one doesn't is how you keep inference costs under control.

The full system is a six-stack CDK deployment on AWS. The architecture and the rest of the project are at syntacticallysugary.dev — a portfolio I'm adding to weekly.

I'm building projects at the intersection of AI, cloud architecture, and the humans who have to live with the systems we build. If that sounds like your world, I'd like to connect.

Top comments (0)