Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts.
Extract Plain Text from Medium Posts for RAG and Search Indexes
HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.
Tool outcome:
ingest-medium-article.ts→ chunked documents in your vector DB.
Pipeline
- Discover ids via user feed or search.
-
GET /article/{id}/content→ plain text. - Optional:
GET /article/{id}for title, tags, author metadata. - Chunk → embed → upsert vector store.
- Query in your chat UI or internal search.
Ingest script
const API = 'https://api.zenndra.com';
const headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };
export async function fetchArticleText(articleId) {
const [contentRes, metaRes] = await Promise.all([
fetch(`${API}/article/${articleId}/content`, { headers }),
fetch(`${API}/article/${articleId}`, { headers }),
]);
const { content } = await contentRes.json();
const meta = await metaRes.json();
return {
id: articleId,
title: meta.title,
tags: meta.tags,
text: content,
};
}
export function chunkText(text, { size = 800, overlap = 100 } = {}) {
const words = text.split(/\s+/);
const chunks = [];
for (let i = 0; i < words.length; i += size - overlap) {
chunks.push(words.slice(i, i + size).join(' '));
}
return chunks.filter(Boolean);
}
Wire chunkText to OpenAI embeddings, Ollama, or your host’s model—swap the vector client, keep the ingest shape.
Chunking tips
- Include title + tags in the embedding preamble for better retrieval.
- Store
article_idandchunk_indexin metadata for citations. - Deduplicate re-ingest with content hash if posts are edited rarely.
Compliance (non-optional)
- Respect Medium’s Terms of Service and author rights.
- Many teams only index their own posts or licensed partners.
- Do not expose paywalled or member-only content through public bots without permission.
For human-readable syndication, see embed articles—different threat model than LLM training.
Keywords
medium plain text api, medium rag pipeline, medium embeddings, medium article content extraction, llm medium.
Top comments (0)