DEV Community

drake
drake

Posted on • Edited on

Python爬虫如何爬wss数据

目标站点

https://dexscreener.com

目标数据

列表页数据

问题

  • 看到网上有不少推荐使用该库的aiowebsocket,测试下来该库无法实现,也是耽误了我一天的时间,各种调试,一开始只是以为是我没有过风控的问题,其实并不是,问题在库上;该项目已经4年没有更新了,废掉了,不建议使用
# from aiowebsocket.converses import AioWebSocket
#https://github.com/asyncins/aiowebsocket
Enter fullscreen mode Exit fullscreen mode
  • Sec-WebSocket-Key 加密参数;该参数是根据一定规则随机生成的,只要按照规则生成即可,不按规则就无法连接成功; 代码示例
function generateWebSocketKey() {
  // 生成一个16字节的随机值
  const buffer = new Uint8Array(16);
  window.crypto.getRandomValues(buffer);

  // 将随机值进行Base64编码
  const key = btoa(String.fromCharCode.apply(null, buffer));

  return key;
}
Enter fullscreen mode Exit fullscreen mode
import os
import base64

def generate_websocket_key():
    # 生成一个16字节的随机值
    buffer = os.urandom(16)

    # 将随机值进行Base64编码
    key = base64.b64encode(buffer).decode('utf-8')

    return key
print(generate_websocket_key())
Enter fullscreen mode Exit fullscreen mode
  • headers 是需要对UA进行校验的

解决代码示例

class ListPage:
    def __init__(self):
        self.base_url = 'wss://io.dexscreener.com/dex/screener/pairs/h24/{}?rankBy[key]=txns&rankBy[order]=desc&filters[liquidity][min]=100000&filters[marketCap][min]=200000&filters[txns][h24][min]=100'

    def generate_websocket_key(self):
        # 生成一个16字节的随机值
        buffer = os.urandom(16)
        # 将随机值进行Base64编码
        key = base64.b64encode(buffer).decode('utf-8')
        return key
    def open_connection(self,num):
        """
        建立连接
        """
        header = {
                # 用户唯一性校验+风控识别校验,可以理解为mac地址的作用,该值有一定的规范,是随机生成的,但是得符合一定的标准,在标准之外的随机数是不被认可的,无法建立连接
                "Sec-WebSocket-Key": self.generate_websocket_key(),
                # 和http请求的用途一样,用于校验客户端角色;主要风控
                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
            }

        ws = websocket.WebSocket(sslopt={"cert_reqs": ssl.CERT_NONE})
        # remote = 'wss://io.dexscreener.com/dex/screener/pairs/h24/4??rankBy[key]=txns&rankBy[order]=desc&filters[liquidity][min]=100000&filters[marketCap][min]=200000&filters[txns][h24][min]=100'
        url = self.base_url.format(num)
        logger.info(f'request {num}: {url}')
        # 必须得携带头部信息,Python代码实现的wss通信自动生成的Sec-WebSocket-Key 很多服务器会不支持
        ws.connect(url, header=header)
        return ws

    def get_data(self, num):
        """
        发起请求获取数据
        """
        retry_num = 0
        while True:
            retry_num += 1
            if retry_num > 3:
                logger.error('重试超过次数!')
                return
            try:
                ws = self.open_connection(num)
                # message = input("Enter message: ")
                # 测了下好像不会过期(约20个小时)   该行代码有可能报错
                ws.send('pong')
                break
            except:
                logger.warning('wss通信失败,重试...')
                time.sleep(1)
        response = ws.recv()
        logger.info(f'response: {num}')
        data = json.loads(response)

Enter fullscreen mode Exit fullscreen mode

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay