Sequential execution was perfect, same expected answer 100% of the time. Looked like a reliable system. Then we ran the same test with five parallel processes, and the model started disagreeing with itself. It returned the expected answer only 87% of the time.
What's actually happening: when requests arrive simultaneously, the same prompt computed in a batch of one versus a batch of five produces different floating point results - different enough that decoding picks a different token. Temperature 0.0 doesn't save, it just makes the drift reproducible.
If your application runs agents in parallel (most companies do), the guarantee breaks at the GPU scheduling layer you normally can't control. The way you catch it is by running a consistency probe like this before you delegate any real actions to the model.
Top comments (0)