Last week, a conversation started in UCP Discussion #328 between contributors from Google, a global travel IT leader, Zolofy, and our team. The topic: UCP can't handle perishable inventory. Flights, hotel rooms, event tickets — any product where the price and availability expire on a timer. The protocol has no standard way for merchants to say “this offer is only valid for 15 minutes.”
So we built the primitive and tested it. Here's what happened.
The problem
UCP works well for retail. A pair of shoes doesn't expire while you're deciding. But travel inventory does. An airline seat at $389 right now might be $419 in ten minutes, or gone entirely. Hotel rooms get released. Event tickets get reassigned.
Haunic, who works at a global leader in IT solutions for the travel industry, put it plainly in the discussion: UCP has no hold/release mechanism, no way for merchants to signal when inventory expires. Revanth from Zolofy raised the related problem of ephemeral SKUs — products that don't exist until the merchant resolves them.
Both problems reduce to one missing primitive: temporal validity. An offer needs a TTL.
What we built
We added a travel demo merchant to UCP Playground with 9 MCP tools: search_flights, search_hotels, get_offer_details, hold_offer, create_booking, get_booking, add_ancillary, complete_booking, and cancel_booking.
The key addition: every search result includes a validity_window object.
{
"offer_id": "offer_xgusncCfor2G",
"flight_number": "DL 1053",
"price": { "total": 38900, "currency": "USD" },
"validity_window": {
"valid_until": "2026-04-06T12:15:00Z",
"ttl_seconds": 30,
"notice": "This fare expires in 30 seconds. After expiry, re-search for current pricing."
}
}
If an agent tries to book an expired offer, it gets an OFFER_EXPIRED error with instructions to re-search. Prices shift on every re-query (configurable volatility, default ±15%) to simulate real market conditions. TTLs are configurable via environment variables — production is set to 30 seconds for flights and 10 seconds for hotels, making expiry observable in real time.
We also included hold_offer — an explicit hold mechanism that extends an offer's validity by 5 minutes. This is the digital equivalent of an airline GDS ticket time limit: the inventory is reserved but not yet purchased.
Available inventory
The demo server includes 6 flights (SFO→JFK, LAX→BOS, ORD→LHR, ORD→FRA across United, Delta, American, JetBlue, and Lufthansa), 5 hotels (New York, Boston, London, Frankfurt), and 4 ancillary services (travel insurance, priority boarding, extra legroom, airport lounge). All prices are in cents with configurable volatility.
Production configuration
The live Playground server runs with these environment variables, and anyone can reproduce these tests against the same configuration:
TRAVEL_FLIGHT_TTL=30 # Flight offers valid for 30 seconds
TRAVEL_HOTEL_TTL=10 # Hotel offers valid for 10 seconds
TRAVEL_PRICE_VOLATILITY=15 # ±15% price shift on re-search
This creates natural test conditions: hotel offers frequently expire during model thinking time (10s < typical model round-trip), flight offers survive for fast models but expire between turns, and price changes are noticeable on every re-search.
Test methodology
We ran 5 frontier models through 6 test scenarios, all reproduced on the production Playground Headless API. Every session is linkable and replayable. Exact prompts and API calls below.
Models tested: Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.2, Gemini 3.1 Pro, and Grok 4 (all via OpenRouter).
Test 1: Happy path
Single-shot: can the model search and complete a booking before the 30-second flight TTL expires?
curl -X POST https://ucpplayground.com/api/v1/chat \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "claude-sonnet-4-5",
"domain": "demo-travel.ucpplayground.com",
"message": "Search flights from SFO to JFK and book the cheapest nonstop for me immediately. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. Do not ask for confirmation, just search and book."
}'
Swap model for each provider. The prompt is intentionally directive — we want to test the tool-calling loop, not the model's conversational hesitation.
Test 2: Competing TTLs (flight=30s, hotel=10s)
Same single-shot approach, but asking for both flight and hotel. The hotel's 10-second TTL often expires during the model's thinking time, while the flight's 30-second TTL survives. Tests whether agents can handle partial expiry.
curl -X POST https://ucpplayground.com/api/v1/chat \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "claude-sonnet-4-5",
"domain": "demo-travel.ucpplayground.com",
"message": "Search flights from SFO to JFK and hotels near JFK for 3 nights. Then book the cheapest nonstop flight and most affordable hotel together. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. Do not ask for confirmation."
}'
Test 3: Stale offer recovery (real expiry across turns)
Two-step test with a wait between turns. Step 1 — search:
curl -X POST https://ucpplayground.com/api/v1/chat \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "claude-sonnet-4-5",
"domain": "demo-travel.ucpplayground.com",
"message": "Search flights from SFO to JFK and hotels near JFK for 3 nights. Show me the options with their offer IDs and validity windows."
}'
Save the session_id from the response. Wait 35 seconds for the 30-second flight TTL to expire. Then Step 2 — book with stale offer IDs:
curl -X POST https://ucpplayground.com/api/v1/chat \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "claude-sonnet-4-5",
"domain": "demo-travel.ucpplayground.com",
"session_id": "SESSION_ID_FROM_STEP_1",
"message": "Go ahead and book the cheapest nonstop flight and the most affordable hotel from what you just found. My name is Alex Rivera, alex@example.com. Payment token: tok_demo."
}'
The model's conversation history contains the original search results with offer IDs that no longer exist. It will attempt to book, receive OFFER_EXPIRED, and must decide how to recover.
Test 4: TTL awareness
Does the model read and reason about the validity_window field?
curl -X POST https://ucpplayground.com/api/v1/chat \\
-H "Authorization: Bearer YOUR_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"model": "claude-sonnet-4-5",
"domain": "demo-travel.ucpplayground.com",
"message": "Search flights from SFO to JFK. Before you try to book anything, examine the validity_window field in the results. Tell me: how long are these offers valid? What happens if they expire? Should you hold them first?"
}'
Test 5: Price change awareness (explicit ask)
Uses the Test 3 flow (stale offer recovery) but with an explicit instruction to flag price changes:
"message": "Book the Delta DL 1053 nonstop. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. IMPORTANT: If the price has changed from what you showed me earlier, tell me the old and new price before proceeding."
Response fields to check
Each response includes structured data for analysis:
-
outcome—"checkout_reached","search_only", or"failed" -
tool_calls[]— each tool called, withname,arguments,result,duration_ms, anderror(if any) -
turn_count— how many model round-trips were needed -
duration_ms— total wall-clock time -
tokens.total— total tokens consumed -
steps_completed[]— which funnel steps the agent reached -
session_id— for continuing the conversation in a follow-up call
To check if a model re-searched after expiry, count occurrences of search_flights in tool_calls. To check if it flagged price changes, read the final assistant message in messages[].
Results
Test 1: Happy path
All 5 models with 30-second flight TTL, 10-second hotel TTL.
| Model | Outcome | Turns | Duration | Tokens | Session |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 | Booked | 4 | 15.6s | 17,726 | K09HVV |
| GPT-5.2 | Booked | 4 | 12.0s | 13,057 | EBHN6 |
| Gemini 3.1 Pro | Booked | 4 | 29.2s | 12,801 | 4CN4G |
| Grok 4 | Booked | 4 | 38.5s | 20,268 | BR5QX |
| Claude Opus 4.6 | Booked | 5 | 20.9s | 23,071 | GDRD |
All 5 models completed booking. Flight-only searches succeed within the 30-second window, though Grok 4 cut it close at 38.5 seconds (it likely booked just before or after expiry and the re-search was transparent).
Test 2: Competing TTLs (flight=30s, hotel=10s)
Flight + hotel combined booking. Hotel offers (10s TTL) frequently expire during model thinking.
| Model | Outcome | Strategy | Hotel Re-searches | Session |
|---|---|---|---|---|
| Claude Sonnet 4.5 | Booked | Hotel re-searched once, booked both | 1 | 04QMG |
| GPT-5.2 | search_only | Stuck trying hold_offer on expired offers, ran out of turns | 1 | Z0J1E |
| Gemini 3.1 Pro | Booked | Booking failed (hotel expired), re-searched hotels, booked both | 2 | MBZ9Z |
| Claude Opus 4.6 | Booked | Hotel re-searched twice, booked both | 2 | 49E97 |
3/4 models completed successfully. GPT-5.2 got stuck — it kept trying hold_offer on expired offers instead of re-searching, consuming all its turns. Gemini handled it cleanly: detected the hotel-specific failure and re-searched only hotels. Claude Opus recovered with hotel re-searches.
Test 3: Stale offer recovery (35-second wait)
Searched for flights and hotels, waited 35 seconds for the 30-second flight TTL to expire, then asked each model to book from the earlier results.
| Model | Outcome | Recovery Strategy | Re-searches | Price Flagged? | Session |
|---|---|---|---|---|---|
| Claude Sonnet 4.5 | Booked | Tried stale offer, got OFFER_EXPIRED, re-searched both, booked fresh | 2 | No | W5AE |
| GPT-5.2 | Booked | Tried hold_offer (expired x2), re-searched both, booked fresh | 2 | No | D89KV |
| Gemini 3.1 Pro | Booked | Tried stale offer, got OFFER_EXPIRED, re-searched both, booked fresh | 2 | No | Y7A8WV |
| Claude Opus 4.6 | Booked | Re-searched both proactively, booked fresh | 2 | No | NT7GY |
All 4 models recovered and completed booking. No model hallucinated a workaround or fabricated new offer IDs. Zero models flagged the price change that occurred due to the 15% volatility — every model silently booked at the new price.
Test 4: TTL awareness
All 4 models correctly identified when explicitly asked:
- The 30-second flight TTL and 10-second hotel TTL
- Expiry consequences (“fare no longer guaranteed,” “must re-search”)
-
hold_offeras a mitigation strategy
But no model proactively checked TTL before attempting to book in any of the other 5 tests. This is the key disconnect: models can reason about temporal validity, they just don't do it unless prompted.
Session replays: Claude Sonnet 4.5 · GPT-5.2 · Gemini 3.1 Pro · Claude Opus 4.6
Test 5: Price change awareness (explicit ask)
Same stale-offer flow as Test 3, but with an explicit instruction to flag price changes. Searched, waited 35 seconds, then asked to book with the price comparison prompt.
| Model | Flagged Change? | Detail | Session |
|---|---|---|---|
| Claude Sonnet 4.5 | Yes | Old $424.55 → new $392.42, “You save $32.13!” | YGJSG |
| GPT-5.2 | Yes | Old $437.47 → new $378.19, asked for confirmation | 0X7AC |
| Gemini 3.1 Pro | No | Re-searched but just showed new results without comparing | 0VMDH |
| Claude Opus 4.6 | Yes | Old $380.36 → new $396.08, “+$15.72”, offered alternatives | B21V |
3 out of 4 models correctly flagged price changes when instructed. Compare this to Test 3 where the same scenario without the instruction produced zero price change notifications. The capability exists — the default behavior doesn't use it.
Bonus: 3-second TTL stress test (local)
We also ran a local stress test with TRAVEL_FLIGHT_TTL=3 and TRAVEL_HOTEL_TTL=3 — shorter than any model's response time. This was run as an additional local stress test to reveal how models behave when offers expire during their thinking time.
| Model | Outcome | Searches | Strategy |
|---|---|---|---|
| GPT-5.2 | Booked | 2 | Re-searched, booked immediately in same tool call batch |
| Claude Sonnet 4.5 | Failed | 3 | Re-searched 3x, tried hold_offer, never fast enough |
| Gemini 3.1 Pro | Failed | 4 | Re-searched 4x, tried hold + direct book, never fast enough |
| Claude Opus 4.6 | Failed | 4 | Re-searched 4x, tried hold_offer twice, never fast enough |
| Grok 4 | Failed | 1 | Searched once, never attempted booking |
GPT-5.2 was the only model to complete a booking at 3-second TTL — it was fast enough to search and book in the same turn before the window closed. All other models correctly understood the error and re-searched, but their thinking time exceeded the TTL on every cycle. Grok 4 showed the worst behavior: it searched once and stopped, never even attempting to book.
To reproduce this stress test on your own deployment, set TRAVEL_FLIGHT_TTL=3 and TRAVEL_HOTEL_TTL=3 in your own deployment.
Protocol implication: The spec should recommend minimum TTLs. A validity window shorter than typical agent round-trip time (~5–10 seconds) is effectively a denial of service. We'd suggest 60 seconds as a floor, with 15–30 minutes as the recommended range for travel.
Protocol implications
-
Agents treat TTL as error-handling, not planning. No model reads
validity_window.valid_untiland proactively acts on it. They all try, fail, recover. This means the server MUST reject expired offers rather than relying on agent self-policing. -
The
OFFER_EXPIREDerror message is sufficient for recovery. Every model understood “search again” and did so. The current error format works — no new error primitives needed. -
Price change blindness is a real problem. Agents silently book at different prices unless explicitly told to compare. The protocol should either include
previous_pricein re-search results, or require aprice_change_acknowledgmentfield increate_booking. -
hold_offeris used by smart agents. GPT-5.2 proactively usedhold_offerin multiple tests. This validates the hold/release mechanism haunic proposed — though it can also backfire (GPT-5.2 got stuck holding expired offers in Test 2 instead of re-searching). - Asymmetric TTLs break some agents. When flight and hotel TTLs differ significantly, some models get stuck in retry loops instead of adopting a different strategy. The protocol should recommend minimum TTLs (60s floor) or allow agents to request extended holds.
Models understand temporal validity when asked about it. They don't act on it unprompted. The server must enforce expiry, not rely on agent self-policing. And price changes must be surfaced at the protocol level — agents won't compare prices on their own.
Reproduce it yourself
UCP Playground is an open testing environment for AI agent commerce — it runs shopping sessions against real and demo merchants across frontier models, recording every tool call, error, and recovery. Everything described here is live and testable on the production API. The sessions linked above are real — open any replay to see the full tool call sequence.
1. Get an API token
Sign in to UCP Playground, go to Settings → API Tokens, and create a token. See the Headless API docs for details.
2. List available models
curl https://ucpplayground.com/api/v1/models \\
-H "Authorization: Bearer YOUR_TOKEN"
3. Run any test
Use the curl commands from the Test Methodology section above, replacing YOUR_TOKEN with your API token and model with any model ID from the list.
4. Compare across models
Run the same prompt with different model values: claude-opus-4-6, claude-sonnet-4-5, gpt-5-2, gemini-3-1-pro, grok-4, gemini-2-5-flash, etc. The response format is identical across models, making side-by-side comparison straightforward.
Available routes
Flights: SFO↔JFK, LAX↔BOS, ORD↔LHR, ORD↔FRA. Hotels near: JFK (New York ×2), BOS (Boston), LHR (London), FRA (Frankfurt). Try multi-leg trips: “I need a flight from Chicago to London and a hotel in London for 5 nights.”
If you're working on UCP protocol extensions or building agent infrastructure for travel, join the discussion.
Top comments (0)