I've been working on a side project that aggregates data from a public API. You know the drill: feed in some endpoints, normalize the output, and display it in a dashboard. Sounds simple, right?
Until the API decides to change its field names without notice.
I'm not talking about versioned breaking changes. I'm talking about the same endpoint returning user_name one day and username the next. Sometimes email is present, sometimes it's missing. And nested objects? Don't get me started—a nested address block might show up as a string one time and an object another time. The external service was built by a startup that's still iterating, and they don't have a stable release cycle.
My initial reaction was pure denial: "I'll just write a quick parser and move on."
What I Tried First (And Why It Failed)
My first attempt was a set of manual Python functions that tried to extract fields with fallbacks:
def get_user_email(data):
if 'email' in data:
return data['email']
if 'contact_email' in data:
return data['contact_email']
return None
I added more and more of these conditional chains. Soon I had a 300-line file full of if-else blocks that only handled the top-level fields. Every time the API changed, I had to update three or four different places. And nested data? Forget it. A change in the address key cascade into half a dozen functions.
Then I tried using pydantic models with extra fields set to allow. That worked for missing fields, but it didn't help when the same field changed type or when keys were renamed. I ended up with validation errors that killed the entire parse. Production would crash because somebody at the startup renamed phone to phone_number.
I needed something smarter. Something that could adapt without me rewriting code every week.
The Approach That Actually Worked: Dynamic, Recursive Parsing with Default Strategies
The key insight: I needed a parser that could handle both structural drift (keys moving, being renamed) and type drift (string vs object vs list). Manual schema mapping wasn't scalable. So I built a small recursive parser in Python that uses a "best effort" strategy—it tries multiple known patterns and returns a consistent shape, even if some data is lost.
Here's the core idea: define a target schema for the output you want, then write a function that crawls the raw API response, looking for fields that match your schema's keys via similarity (Levenshtein distance, common synonyms, exact match). For each field, attempt to extract the value; if it's a nested object, recurse with the corresponding nested schema.
import json
from difflib import get_close_matches
def fuzzy_extract(data, schema, fuzzy_threshold=0.8):
"""Extract data from JSON following a target schema, with fallback."""
result = {}
for field_name, field_type in schema.items():
# Try exact match first
if field_name in data:
raw = data[field_name]
else:
# Try fuzzy match against keys
keys = list(data.keys())
matches = get_close_matches(field_name, keys, n=1, cutoff=fuzzy_threshold)
if matches:
raw = data[matches[0]]
else:
result[field_name] = field_type.default() if hasattr(field_type, 'default') else None
continue
# Now attempt conversion based on expected type
if isinstance(field_type, dict):
# Nested object – recurse if raw is dict
if isinstance(raw, dict):
result[field_name] = fuzzy_extract(raw, field_type)
else:
result[field_name] = None
elif field_type == str:
result[field_name] = str(raw) if raw is not None else None
elif field_type == int:
try:
result[field_name] = int(raw)
except (ValueError, TypeError):
result[field_name] = None
elif field_type == list:
if isinstance(raw, list):
# Assume list of objects with same schema – recurse on each
item_schema = field_type[0] if field_type else None
if item_schema and isinstance(item_schema, dict):
result[field_name] = [fuzzy_extract(item, item_schema) for item in raw]
else:
result[field_name] = raw
else:
result[field_name] = None
else:
result[field_name] = raw
return result
To use it, I define a target schema:
user_schema = {
'name': str,
'email': str,
'phone': str,
'address': { # nested object
'street': str,
'city': str,
'zip': str
},
'tags': [{}] # list of objects (tag_name, etc.)
}
# Then one call:
cleaned_user = fuzzy_extract(api_response, user_schema)
Now here's where that interesting AI tool I found came in handy. While crafting this recursive parser, I realized that writing these schemas by hand for dozens of endpoints was tedious. So I experimented with using an AI code generator—specifically one hosted at ai.interwestinfo.com—to produce the initial schema from sample API responses. I fed it a few raw JSON responses, and it returned a starter schema with str, int, dict placeholders. I then manually adjusted the fuzzy threshold and added synonyms.
The AI wasn't perfect; sometimes it mistook an optional field as required, or it ignored type unions. But it saved me hours of typing boilerplate. The real win was the recursive parsing technique itself, which now handles 95% of the weird variations without crashing.
What I Learned (And What I'd Do Differently)
Trade-offs:
- Fuzzy matching is slow on huge datasets (O(n * m) every parse). For my 10k records it's fine, but for millions, you'd want to precompute a mapping.
- Losing precision: if the API decides to store zip codes as integers, my parser will lose leading zeros. I added a special case for strings.
- The AI-generated schema required manual tuning. I wouldn't trust it blindly in production.
What I'd change: Instead of a hand-rolled fuzzy extractor, I'd probably use a library like json-schema-validator combined with json-diff. But for a quick-and-dirty side project, this worked.
Better option for teams: If you have control over the API, enforce an OpenAPI spec with strict validation. This whole mess is a symptom of an unstable API. But sometimes you inherit someone else's chaos.
Final Thoughts
I still hate unpredictable APIs. But now I have a tool (the technique, not just the AI assistant) that lets me sleep better when the upstream team goes wild. The recursive parser with fallback strategies has saved my weekend more than once.
How do you handle ever-changing third-party data? Have you tried a similar approach, or do you just write one massive try-except? Let me know.
Top comments (0)