DEV Community

Prahlad Yeri
Prahlad Yeri

Posted on

Do you use BeautifulSoup or LXML to parse your HTML markup in Python?

BeautifulSoup has been my go to library for html parsing since many years, its useful for DOM parsing in the python world (just as jquery is in JavaScript world) and it supports multiple html parsers such as lxml and html5lib.

But I came across this interesting StackOverflow answer today which hints that BeautifulSoup may not be the best for performance, and that lxml's own module called soupparser is much faster.

What do you use for html parsing, have you ever come across any performance bottleneck with BeautifulSoup? Me, I haven't.

Top comments (9)

Collapse
 
moopet profile image
Ben Sinclair

I've used BeautifulSoup in the past but after a while I realised I didn't have a use case for HTML parsing. Either I was using a scraping library or something like mechanize to hack things about or it was me generating the HTML, in which case I shouldn't need to re-parse it anyway.

Collapse
 
guy profile image
Guy

I've only used BeautifulSoup, and have found it to be fast enough for the simple scripts I need to write. Its documentation is excellent. I'd suggest starting out using it and if performance does become a hinderance, look elsewhere at that point.

Collapse
 
sm0ke profile image
Sm0ke

Hello,
I'm using BS to parse HTML themes, usually with 4/5 pages.
All related operations (page load, update properties on nodes, extract components) are executed in a few seconds, I never felt that perf is an issue with this small amount of input. Using a large amount of input data, maybe lxml can perform better.

On top of that, BS support lxml as plugin parser (along with html.parser ).

Collapse
 
jmcp profile image
James McPherson

I'm biased in favour of BeautifulSoup, because the majority of the html and xml I've come across is not well-formed. In my experience BeautifulSoup is much more forgiving.

I've made use of it in

github.com/jmcp/grabbag/blob/maste...
and
github.com/jmcp/grabbag/blob/maste...

and in another minor project I'm working over at the moment where I'm taking KML and ogr2ogr-converted Mapinfo shape files which will find its way to the grabbag in due course.

Collapse
 
jmcp profile image
James McPherson

I'm not particularly stressed about the performance of the solution, btw, so long as I have written efficient code I'm not worried about the library I'm using.

Collapse
 
rhymes profile image
rhymes

I'm a fan of lxml but I haven't done any HTML parsing in a while. lxml is written in C and BeautifulSoup in Python IIRC, which tends to be slower than C.

I think your best bet is to write a pet project, feed the same HTML to both, measure performance but also see if they behave the same way. Different parsers sometimes have different behaviors in corner cases or malformed input.

Collapse
 
itachiuchiha profile image
Itachi Uchiha

Yes, I do.

I use lxml to parsing HTML. :)

Collapse
 
nux99 profile image
Nuga
Collapse
 
iceorfiresite profile image
Ice or Fire

BeautifulSoup