DEV Community

Cover image for Extract Plain Text from Medium Posts for RAG and Search Indexes
Sebastian Casvean
Sebastian Casvean

Posted on

Extract Plain Text from Medium Posts for RAG and Search Indexes

Chunk clean article content for embeddings, summarization, and full-text search—skip nav, clap bars, and scripts.

Extract Plain Text from Medium Posts for RAG and Search Indexes

HTML embeds are for humans; plain text is for chunking, embeddings, and summarization. One call should return body text without nav, clap bars, or script tags.

Tool outcome: ingest-medium-article.ts → chunked documents in your vector DB.


Pipeline

  1. Discover ids via user feed or search.
  2. GET /article/{id}/content → plain text.
  3. Optional: GET /article/{id} for title, tags, author metadata.
  4. Chunk → embed → upsert vector store.
  5. Query in your chat UI or internal search.

Ingest script

const API = 'https://api.zenndra.com';
const headers = { Authorization: `Bearer ${process.env.ZENNDRA_API_KEY}` };

export async function fetchArticleText(articleId) {
  const [contentRes, metaRes] = await Promise.all([
    fetch(`${API}/article/${articleId}/content`, { headers }),
    fetch(`${API}/article/${articleId}`, { headers }),
  ]);

  const { content } = await contentRes.json();
  const meta = await metaRes.json();

  return {
    id: articleId,
    title: meta.title,
    tags: meta.tags,
    text: content,
  };
}

export function chunkText(text, { size = 800, overlap = 100 } = {}) {
  const words = text.split(/\s+/);
  const chunks = [];
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '));
  }
  return chunks.filter(Boolean);
}
Enter fullscreen mode Exit fullscreen mode

Wire chunkText to OpenAI embeddings, Ollama, or your host’s model—swap the vector client, keep the ingest shape.


Chunking tips

  • Include title + tags in the embedding preamble for better retrieval.
  • Store article_id and chunk_index in metadata for citations.
  • Deduplicate re-ingest with content hash if posts are edited rarely.

Compliance (non-optional)

  • Respect Medium’s Terms of Service and author rights.
  • Many teams only index their own posts or licensed partners.
  • Do not expose paywalled or member-only content through public bots without permission.

For human-readable syndication, see embed articles—different threat model than LLM training.


Keywords

medium plain text api, medium rag pipeline, medium embeddings, medium article content extraction, llm medium.


Further reading

Top comments (0)