Benji Fisher

Posted on Apr 7 • Originally published at ucpchecker.com

We Tested 5 AI Models on Expiring Travel Inventory — Here's How They Failed

#ecommerce #webdev #ai #ucp

Last week, a conversation started in UCP Discussion #328 between contributors from Google, a global travel IT leader, Zolofy, and our team. The topic: UCP can't handle perishable inventory. Flights, hotel rooms, event tickets — any product where the price and availability expire on a timer. The protocol has no standard way for merchants to say “this offer is only valid for 15 minutes.”

So we built the primitive and tested it. Here's what happened.

The problem

UCP works well for retail. A pair of shoes doesn't expire while you're deciding. But travel inventory does. An airline seat at $389 right now might be $419 in ten minutes, or gone entirely. Hotel rooms get released. Event tickets get reassigned.

Haunic, who works at a global leader in IT solutions for the travel industry, put it plainly in the discussion: UCP has no hold/release mechanism, no way for merchants to signal when inventory expires. Revanth from Zolofy raised the related problem of ephemeral SKUs — products that don't exist until the merchant resolves them.

Both problems reduce to one missing primitive: temporal validity. An offer needs a TTL.

What we built

We added a travel demo merchant to UCP Playground with 9 MCP tools: search_flights, search_hotels, get_offer_details, hold_offer, create_booking, get_booking, add_ancillary, complete_booking, and cancel_booking.

The key addition: every search result includes a validity_window object.

{
  "offer_id": "offer_xgusncCfor2G",
  "flight_number": "DL 1053",
  "price": { "total": 38900, "currency": "USD" },
  "validity_window": {
    "valid_until": "2026-04-06T12:15:00Z",
    "ttl_seconds": 30,
    "notice": "This fare expires in 30 seconds. After expiry, re-search for current pricing."
  }
}

If an agent tries to book an expired offer, it gets an OFFER_EXPIRED error with instructions to re-search. Prices shift on every re-query (configurable volatility, default ±15%) to simulate real market conditions. TTLs are configurable via environment variables — production is set to 30 seconds for flights and 10 seconds for hotels, making expiry observable in real time.

We also included hold_offer — an explicit hold mechanism that extends an offer's validity by 5 minutes. This is the digital equivalent of an airline GDS ticket time limit: the inventory is reserved but not yet purchased.

Available inventory

The demo server includes 6 flights (SFO→JFK, LAX→BOS, ORD→LHR, ORD→FRA across United, Delta, American, JetBlue, and Lufthansa), 5 hotels (New York, Boston, London, Frankfurt), and 4 ancillary services (travel insurance, priority boarding, extra legroom, airport lounge). All prices are in cents with configurable volatility.

Production configuration

The live Playground server runs with these environment variables, and anyone can reproduce these tests against the same configuration:

TRAVEL_FLIGHT_TTL=30      # Flight offers valid for 30 seconds
TRAVEL_HOTEL_TTL=10       # Hotel offers valid for 10 seconds
TRAVEL_PRICE_VOLATILITY=15  # ±15% price shift on re-search

This creates natural test conditions: hotel offers frequently expire during model thinking time (10s < typical model round-trip), flight offers survive for fast models but expire between turns, and price changes are noticeable on every re-search.

Test methodology

We ran 5 frontier models through 6 test scenarios, all reproduced on the production Playground Headless API. Every session is linkable and replayable. Exact prompts and API calls below.

Models tested: Claude Opus 4.6, Claude Sonnet 4.5, GPT-5.2, Gemini 3.1 Pro, and Grok 4 (all via OpenRouter).

Test 1: Happy path

Single-shot: can the model search and complete a booking before the 30-second flight TTL expires?

curl -X POST https://ucpplayground.com/api/v1/chat \\
  -H "Authorization: Bearer YOUR_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "claude-sonnet-4-5",
    "domain": "demo-travel.ucpplayground.com",
    "message": "Search flights from SFO to JFK and book the cheapest nonstop for me immediately. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. Do not ask for confirmation, just search and book."
  }'

Swap model for each provider. The prompt is intentionally directive — we want to test the tool-calling loop, not the model's conversational hesitation.

Test 2: Competing TTLs (flight=30s, hotel=10s)

Same single-shot approach, but asking for both flight and hotel. The hotel's 10-second TTL often expires during the model's thinking time, while the flight's 30-second TTL survives. Tests whether agents can handle partial expiry.

curl -X POST https://ucpplayground.com/api/v1/chat \\
  -H "Authorization: Bearer YOUR_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "claude-sonnet-4-5",
    "domain": "demo-travel.ucpplayground.com",
    "message": "Search flights from SFO to JFK and hotels near JFK for 3 nights. Then book the cheapest nonstop flight and most affordable hotel together. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. Do not ask for confirmation."
  }'

Test 3: Stale offer recovery (real expiry across turns)

Two-step test with a wait between turns. Step 1 — search:

curl -X POST https://ucpplayground.com/api/v1/chat \\
  -H "Authorization: Bearer YOUR_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "claude-sonnet-4-5",
    "domain": "demo-travel.ucpplayground.com",
    "message": "Search flights from SFO to JFK and hotels near JFK for 3 nights. Show me the options with their offer IDs and validity windows."
  }'

Save the session_id from the response. Wait 35 seconds for the 30-second flight TTL to expire. Then Step 2 — book with stale offer IDs:

curl -X POST https://ucpplayground.com/api/v1/chat \\
  -H "Authorization: Bearer YOUR_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "claude-sonnet-4-5",
    "domain": "demo-travel.ucpplayground.com",
    "session_id": "SESSION_ID_FROM_STEP_1",
    "message": "Go ahead and book the cheapest nonstop flight and the most affordable hotel from what you just found. My name is Alex Rivera, alex@example.com. Payment token: tok_demo."
  }'

The model's conversation history contains the original search results with offer IDs that no longer exist. It will attempt to book, receive OFFER_EXPIRED, and must decide how to recover.

Test 4: TTL awareness

Does the model read and reason about the validity_window field?

curl -X POST https://ucpplayground.com/api/v1/chat \\
  -H "Authorization: Bearer YOUR_TOKEN" \\
  -H "Content-Type: application/json" \\
  -d '{
    "model": "claude-sonnet-4-5",
    "domain": "demo-travel.ucpplayground.com",
    "message": "Search flights from SFO to JFK. Before you try to book anything, examine the validity_window field in the results. Tell me: how long are these offers valid? What happens if they expire? Should you hold them first?"
  }'

Test 5: Price change awareness (explicit ask)

Uses the Test 3 flow (stale offer recovery) but with an explicit instruction to flag price changes:

"message": "Book the Delta DL 1053 nonstop. My name is Alex Rivera, alex@example.com. Payment token: tok_demo. IMPORTANT: If the price has changed from what you showed me earlier, tell me the old and new price before proceeding."

Response fields to check

Each response includes structured data for analysis:

outcome — "checkout_reached", "search_only", or "failed"
tool_calls[] — each tool called, with name, arguments, result, duration_ms, and error (if any)
turn_count — how many model round-trips were needed
duration_ms — total wall-clock time
tokens.total — total tokens consumed
steps_completed[] — which funnel steps the agent reached
session_id — for continuing the conversation in a follow-up call

To check if a model re-searched after expiry, count occurrences of search_flights in tool_calls. To check if it flagged price changes, read the final assistant message in messages[].

Results

Test 1: Happy path

All 5 models with 30-second flight TTL, 10-second hotel TTL.

Model	Outcome	Turns	Duration	Tokens	Session
Claude Sonnet 4.5	Booked	4	15.6s	17,726	K09HVV
GPT-5.2	Booked	4	12.0s	13,057	EBHN6
Gemini 3.1 Pro	Booked	4	29.2s	12,801	4CN4G
Grok 4	Booked	4	38.5s	20,268	BR5QX
Claude Opus 4.6	Booked	5	20.9s	23,071	GDRD

Test 1: Five frontier models booking flights with 30-second offer TTL

All 5 models completed booking. Flight-only searches succeed within the 30-second window, though Grok 4 cut it close at 38.5 seconds (it likely booked just before or after expiry and the re-search was transparent).

Test 2: Competing TTLs (flight=30s, hotel=10s)

Flight + hotel combined booking. Hotel offers (10s TTL) frequently expire during model thinking.

Model	Outcome	Strategy	Hotel Re-searches	Session
Claude Sonnet 4.5	Booked	Hotel re-searched once, booked both	1	04QMG
GPT-5.2	search_only	Stuck trying hold_offer on expired offers, ran out of turns	1	Z0J1E
Gemini 3.1 Pro	Booked	Booking failed (hotel expired), re-searched hotels, booked both	2	MBZ9Z
Claude Opus 4.6	Booked	Hotel re-searched twice, booked both	2	49E97

Test 2: Competing TTLs — hotel offers expire at 10s while flight offers survive at 30s

3/4 models completed successfully. GPT-5.2 got stuck — it kept trying hold_offer on expired offers instead of re-searching, consuming all its turns. Gemini handled it cleanly: detected the hotel-specific failure and re-searched only hotels. Claude Opus recovered with hotel re-searches.

Test 3: Stale offer recovery (35-second wait)

Searched for flights and hotels, waited 35 seconds for the 30-second flight TTL to expire, then asked each model to book from the earlier results.

Model	Outcome	Recovery Strategy	Re-searches	Price Flagged?	Session
Claude Sonnet 4.5	Booked	Tried stale offer, got OFFER_EXPIRED, re-searched both, booked fresh	2	No	W5AE
GPT-5.2	Booked	Tried hold_offer (expired x2), re-searched both, booked fresh	2	No	D89KV
Gemini 3.1 Pro	Booked	Tried stale offer, got OFFER_EXPIRED, re-searched both, booked fresh	2	No	Y7A8WV
Claude Opus 4.6	Booked	Re-searched both proactively, booked fresh	2	No	NT7GY

Test 3: Stale offer recovery — models attempt booking with expired offer IDs after 35-second wait

All 4 models recovered and completed booking. No model hallucinated a workaround or fabricated new offer IDs. Zero models flagged the price change that occurred due to the 15% volatility — every model silently booked at the new price.

Test 4: TTL awareness

All 4 models correctly identified when explicitly asked:

The 30-second flight TTL and 10-second hotel TTL
Expiry consequences (“fare no longer guaranteed,” “must re-search”)
hold_offer as a mitigation strategy

But no model proactively checked TTL before attempting to book in any of the other 5 tests. This is the key disconnect: models can reason about temporal validity, they just don't do it unless prompted.

Session replays: Claude Sonnet 4.5 · GPT-5.2 · Gemini 3.1 Pro · Claude Opus 4.6

Test 5: Price change awareness (explicit ask)

Same stale-offer flow as Test 3, but with an explicit instruction to flag price changes. Searched, waited 35 seconds, then asked to book with the price comparison prompt.

Model	Flagged Change?	Detail	Session
Claude Sonnet 4.5	Yes	Old $424.55 → new $392.42, “You save $32.13!”	YGJSG
GPT-5.2	Yes	Old $437.47 → new $378.19, asked for confirmation	0X7AC
Gemini 3.1 Pro	No	Re-searched but just showed new results without comparing	0VMDH
Claude Opus 4.6	Yes	Old $380.36 → new $396.08, “+$15.72”, offered alternatives	B21V

Test 5: Price change awareness — models asked to flag price differences after offer expiry

3 out of 4 models correctly flagged price changes when instructed. Compare this to Test 3 where the same scenario without the instruction produced zero price change notifications. The capability exists — the default behavior doesn't use it.

Bonus: 3-second TTL stress test (local)

We also ran a local stress test with TRAVEL_FLIGHT_TTL=3 and TRAVEL_HOTEL_TTL=3 — shorter than any model's response time. This was run as an additional local stress test to reveal how models behave when offers expire during their thinking time.

Model	Outcome	Searches	Strategy
GPT-5.2	Booked	2	Re-searched, booked immediately in same tool call batch
Claude Sonnet 4.5	Failed	3	Re-searched 3x, tried hold_offer, never fast enough
Gemini 3.1 Pro	Failed	4	Re-searched 4x, tried hold + direct book, never fast enough
Claude Opus 4.6	Failed	4	Re-searched 4x, tried hold_offer twice, never fast enough
Grok 4	Failed	1	Searched once, never attempted booking

GPT-5.2 was the only model to complete a booking at 3-second TTL — it was fast enough to search and book in the same turn before the window closed. All other models correctly understood the error and re-searched, but their thinking time exceeded the TTL on every cycle. Grok 4 showed the worst behavior: it searched once and stopped, never even attempting to book.

To reproduce this stress test on your own deployment, set TRAVEL_FLIGHT_TTL=3 and TRAVEL_HOTEL_TTL=3 in your own deployment.

Protocol implication: The spec should recommend minimum TTLs. A validity window shorter than typical agent round-trip time (~5–10 seconds) is effectively a denial of service. We'd suggest 60 seconds as a floor, with 15–30 minutes as the recommended range for travel.

Protocol implications

Agents treat TTL as error-handling, not planning. No model reads validity_window.valid_until and proactively acts on it. They all try, fail, recover. This means the server MUST reject expired offers rather than relying on agent self-policing.
The OFFER_EXPIRED error message is sufficient for recovery. Every model understood “search again” and did so. The current error format works — no new error primitives needed.
Price change blindness is a real problem. Agents silently book at different prices unless explicitly told to compare. The protocol should either include previous_price in re-search results, or require a price_change_acknowledgment field in create_booking.
hold_offer is used by smart agents. GPT-5.2 proactively used hold_offer in multiple tests. This validates the hold/release mechanism haunic proposed — though it can also backfire (GPT-5.2 got stuck holding expired offers in Test 2 instead of re-searching).
Asymmetric TTLs break some agents. When flight and hotel TTLs differ significantly, some models get stuck in retry loops instead of adopting a different strategy. The protocol should recommend minimum TTLs (60s floor) or allow agents to request extended holds.

Models understand temporal validity when asked about it. They don't act on it unprompted. The server must enforce expiry, not rely on agent self-policing. And price changes must be surfaced at the protocol level — agents won't compare prices on their own.

Reproduce it yourself

UCP Playground is an open testing environment for AI agent commerce — it runs shopping sessions against real and demo merchants across frontier models, recording every tool call, error, and recovery. Everything described here is live and testable on the production API. The sessions linked above are real — open any replay to see the full tool call sequence.

1. Get an API token

2. List available models

curl https://ucpplayground.com/api/v1/models \\
  -H "Authorization: Bearer YOUR_TOKEN"

3. Run any test

Use the curl commands from the Test Methodology section above, replacing YOUR_TOKEN with your API token and model with any model ID from the list.

4. Compare across models

Run the same prompt with different model values: claude-opus-4-6, claude-sonnet-4-5, gpt-5-2, gemini-3-1-pro, grok-4, gemini-2-5-flash, etc. The response format is identical across models, making side-by-side comparison straightforward.

Available routes

Flights: SFO↔JFK, LAX↔BOS, ORD↔LHR, ORD↔FRA. Hotels near: JFK (New York ×2), BOS (Boston), LHR (London), FRA (Frankfurt). Try multi-leg trips: “I need a flight from Chicago to London and a hotel in London for 5 nights.”

If you're working on UCP protocol extensions or building agent infrastructure for travel, join the discussion.

DEV Community

We Tested 5 AI Models on Expiring Travel Inventory — Here's How They Failed

The problem

What we built

Available inventory

Production configuration

Test methodology

Test 1: Happy path

Test 2: Competing TTLs (flight=30s, hotel=10s)

Test 3: Stale offer recovery (real expiry across turns)

Test 4: TTL awareness

Test 5: Price change awareness (explicit ask)

Response fields to check

Results

Test 1: Happy path

Test 2: Competing TTLs (flight=30s, hotel=10s)

Test 3: Stale offer recovery (35-second wait)

Test 4: TTL awareness

Test 5: Price change awareness (explicit ask)

Bonus: 3-second TTL stress test (local)

Protocol implications

Reproduce it yourself

1. Get an API token

2. List available models

3. Run any test

4. Compare across models

Available routes

Top comments (0)