DEV Community

drake
drake

Posted on • Edited on

非标准HTML无法被xpath解析的问题解决

当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如:

错误示例

  • 1、此种方式将得到None
from lxml import etree
tree = etree.HTML(res.text)
Enter fullscreen mode Exit fullscreen mode
  • 2、此种方式也是得到None
from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

Enter fullscreen mode Exit fullscreen mode
  • 3、此种方式也是得到None
from bs4 import BeautifulSoup
tree = BeautifulSoup(html, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

正确示例

  • 1、解决方案

原理是修复缺损的HTML,以及修复非标准的HTML

from lxml.html import soupparser
tree = soupparser.fromstring(res.text)

Enter fullscreen mode Exit fullscreen mode
  • 2、解决方案

原理是HTML文本中存在非标准ASCCI码导致解析异常,将其统一转换成ASCCI码

from lxml import etree
tree = etree.HTML(res.text.encode("ascii", "xmlcharrefreplace").decode("ascii"))

Enter fullscreen mode Exit fullscreen mode

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up