OpenAI lets you export your entire ChatGPT conversation history as a zip of JSON shards. Most people download it, look at the zip, and forget it exists. I decided to actually parse mine.
The export is straightforward: conversations-000.json through conversations-NNN.json, each containing an array of conversation objects. Each conversation has a title, create_time, update_time, and a mapping field — a tree of message nodes linked by parent/children.
1,058 conversations spanning September 2023 → April 2026. Here's the full pipeline and what it revealed.
The parser
import json
from pathlib import Path
from datetime import datetime, timezone
from collections import Counter
RAW = Path('ChatGPT-Export/raw')
shards = sorted(RAW.glob('conversations-*.json'))
def walk_messages(mapping):
msgs = []
def visit(nid):
node = mapping.get(nid, {})
msg = node.get('message')
if msg:
role = msg.get('author', {}).get('role')
content = msg.get('content', {})
if content.get('content_type') == 'text':
parts = content.get('parts', [])
text = '\n'.join(p for p in parts if isinstance(p, str))
if text.strip() and role in ('user', 'assistant'):
msgs.append({'role': role, 'text': text})
for child in node.get('children', []):
visit(child)
for nid, node in mapping.items():
if not node.get('parent'):
visit(nid)
break
return msgs
all_convos = []
for shard in shards:
with open(shard) as f:
data = json.load(f)
convos = data if isinstance(data, list) else data.get('conversations', [])
all_convos.extend(convos)
That's it for extraction. mapping is a tree because ChatGPT supports message branching (regenerated responses create sibling children under the same parent). The walk takes the left-most path, which matches what you'd see in the UI.
Topic classification
The interesting part is classification. I built a simple keyword bucket matcher:
TOPIC_KEYWORDS = {
'self-reflection': ['reflect', 'journaling', 'feeling', 'wondering if'],
'coding': ['function', 'bug', 'error', 'API', 'debug'],
'learning': ['explain', 'what is', 'how does', 'tutorial'],
'writing': ['draft', 'edit', 'rewrite', 'improve'],
'business-concept': ['startup', 'business idea', 'monetize', 'SaaS'],
# ... more buckets
}
def classify(conv):
text = (conv['title'] or '') + ' ' + extract_first_user_message(conv)
text_lower = text.lower()
return [t for t, kws in TOPIC_KEYWORDS.items() if any(k in text_lower for k in kws)]
Naive keyword matching is limited — my first pass dumped 651 conversations into the catch-all other bucket because my keywords were too narrow. I expanded the keyword lists and added better patterns, and the other bucket dropped to 171. Good enough for a first pass without needing LLM classification.
The distribution
Here's what the 1,058 conversations looked like by topic (top 15):
self-reflection: 443
coding: 403
learning: 357
content-creation: 350
fitness: 333
writing: 287
finance: 274
grad-school: 188
other: 171
robotics: 156
business-concept: 148
hardware-projects: 138
career: 109
travel: 93
research-strategy: 73
The surprise
Self-reflection is the #1 topic at 443 conversations. I expected coding to dominate — it's usually framed as "the AI developer tool" — and it's a strong #2 at 403. But the honest usage pattern is that ChatGPT has been a thinking partner at least as much as a coding assistant.
If you're building AI products that compete on "we're the best coding copilot" or "the smartest search," you might be missing the actual primary use case: people use these tools to process their own thoughts. Not ask questions. Not get answers. Just think out loud with a responsive interlocutor.
That reframes what "AI app" means. An app that's good at helping you think is a different product from one that's good at giving you correct answers. The first requires patience, memory, and nuance. The second requires accuracy and speed. Most AI tools optimize for the second because it's easier to benchmark — but the first is where durable engagement lives.
The second surprise
By year:
2023: 169 (partial — Sept onwards)
2024: 485 (peak)
2025: 336
2026: 68 (partial — first ~100 days)
2024 was my heaviest usage year by a wide margin. I'd predicted 2025 would be heavier because I had more going on, but the data disagrees. A plausible read: in 2024, I had more unstructured exploration time — figuring out what to work on, processing a life transition. In 2025, I had my direction locked in and used the tool more surgically.
The export reveals your life texture in a way that's genuinely weird to see charted.
The business-concept bucket
One specific finding I didn't expect: the business-concept topic has 148 conversations across the 2.5 years. My original narrow-keyword pass only caught 2 — I thought I'd had a handful of "what should I build" brainstorms. Expanding the keywords revealed 148 actual entrepreneurial ideation conversations spanning 2.5 years.
That means I've been thinking about building something ~60 times a year, ~once a week, for 2.5 years. None of them shipped until very recently. The gap between ideation and execution is the real lesson — your ChatGPT export is a very blunt accountability tool.
Full indexing script
The complete script I wrote does more: frontmattered markdown output per conversation, date-sorted indexes, per-topic files, a safe_for_sharing flag that blocks personal content from being pushed to any wider-access store. It's ~400 lines but the core is the parse loop above.
Run it on your own export:
# Download export from ChatGPT Settings → Data Controls → Export
# Extract zip, point script at the conversations-*.json shards
python index_chatgpt_export.py
You'll get a markdown file per conversation (grep-able, editable), topic counts, and a date range. Takes about 10 seconds on 1,000 conversations — the JSON parsing is the bottleneck, not the classification.
The takeaway
The export is free data about your own thinking patterns. It's richer than any journaling app because it's already the actual record. The only thing stopping most people from learning from it is the 30 minutes of Python it takes to turn the JSON into something readable.
If you use ChatGPT more than occasionally, this is worth doing. The distribution you find will probably surprise you the way mine did.
Top comments (0)