DEV Community

Lars Quentin
Lars Quentin

Posted on

1

Use XQuery for HTML (Web Crawling)

This is a very very niche post. If you don't know why this is a pain point, don't waste your time reading this.

XQuery is a great language for high level XML processing, providing a fully turing complete declarative language leveraging XPath. Unfortunately, it is not used often.

My personal take is that this is the case because most HTML out there is not XML compliant, mostly because of tags that are never closed (such as <link ...> instead of <link .../>). Thus your Saxon/BaseX parser will fail.

The solution is TagSoup, which provides a SAX-compliant HTML parser, pretending that it just parses XML.

With that, you can now do actual web crawling! The rest is just plain XQuery.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (0)

Heroku

This site is powered by Heroku

Heroku was created by developers, for developers. Get started today and find out why Heroku has been the platform of choice for brands like DEV for over a decade.

Sign Up

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay