In looking for a way to automatically generate descriptions for pages I stumbled into a markdown ast in python. It allows me to go over the markdown page and get only paragraph text. This will ignore headings, blockquotes, and code fences.
import commonmark parser = commonmark.Parser() ast = parser.parse(p.content) paragraphs = '' for node in ast.walker(): if node.t == "paragraph": paragraphs += " " paragraphs += node.first_child.literal
It's also super fast, previously I was rendering to html and using beautifulsoup to get only the paragraphs. Using the commonmark ast was about 5x faster on my site.
Top comments (0)