Do you use BeautifulSoup or LXML to parse your HTML markup in Python?

twitter logo github logo ・1 min read

BeautifulSoup has been my go to library for html parsing since many years, its useful for DOM parsing in the python world (just as jquery is in JavaScript world) and it supports multiple html parsers such as lxml and html5lib.

But I came across this interesting StackOverflow answer today which hints that BeautifulSoup may not be the best for performance, and that lxml's own module called soupparser is much faster.

What do you use for html parsing, have you ever come across any performance bottleneck with BeautifulSoup? Me, I haven't.

twitter logo DISCUSS (9)
markdown guide
 

I've only used BeautifulSoup, and have found it to be fast enough for the simple scripts I need to write. Its documentation is excellent. I'd suggest starting out using it and if performance does become a hinderance, look elsewhere at that point.

 

I've used BeautifulSoup in the past but after a while I realised I didn't have a use case for HTML parsing. Either I was using a scraping library or something like mechanize to hack things about or it was me generating the HTML, in which case I shouldn't need to re-parse it anyway.

 

I'm biased in favour of BeautifulSoup, because the majority of the html and xml I've come across is not well-formed. In my experience BeautifulSoup is much more forgiving.

I've made use of it in

github.com/jmcp/grabbag/blob/maste...
and
github.com/jmcp/grabbag/blob/maste...

and in another minor project I'm working over at the moment where I'm taking KML and ogr2ogr-converted Mapinfo shape files which will find its way to the grabbag in due course.

 

I'm not particularly stressed about the performance of the solution, btw, so long as I have written efficient code I'm not worried about the library I'm using.

 

Hello,
I'm using BS to parse HTML themes, usually with 4/5 pages.
All related operations (page load, update properties on nodes, extract components) are executed in a few seconds, I never felt that perf is an issue with this small amount of input. Using a large amount of input data, maybe lxml can perform better.

On top of that, BS support lxml as plugin parser (along with html.parser ).

 

I'm a fan of lxml but I haven't done any HTML parsing in a while. lxml is written in C and BeautifulSoup in Python IIRC, which tends to be slower than C.

I think your best bet is to write a pet project, feed the same HTML to both, measure performance but also see if they behave the same way. Different parsers sometimes have different behaviors in corner cases or malformed input.

 
 
 
Classic DEV Post from Feb 5

If/else or just if?

It's a simple question that cuts deep.

Prahlad Yeri profile image
Most programmers like coffee but I'm fond of tea.