DEV Community

drake
drake

Posted on • Edited on

非标准HTML无法被xpath解析的问题解决

当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如:

错误示例

  • 1、此种方式将得到None
from lxml import etree
tree = etree.HTML(res.text)
Enter fullscreen mode Exit fullscreen mode
  • 2、此种方式也是得到None
from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

Enter fullscreen mode Exit fullscreen mode
  • 3、此种方式也是得到None
from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

正确示例

  • 1、解决方案

原理是修复缺损的HTML,以及修复非标准的HTML

from lxml.html import soupparser
tree = soupparser.fromstring(res.text)

Enter fullscreen mode Exit fullscreen mode
  • 2、解决方案

原理是HTML文本中存在非标准ASCCI码导致解析异常,将其统一转换成ASCCI码

from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))

Enter fullscreen mode Exit fullscreen mode

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay