Daily Log - 23/08/2024

Overall

I was developing my AI application.

Problem facing

I found that Jinja is conflict with jsonify(), result in error of upload file and chat with GPT. Mentor helped me to solve it.

Learn

Since I want to extract all company ID in https://www.ctgoodjobs.hk/ . I tried to web crawling. Seccess.

import requests
from bs4 import BeautifulSoup
import json
import parsel  # 第三方的模块

def main():

    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
    }
    # Fetch the job listings from the API
    url = "https://www.ctgoodjobs.hk/top-companies"
    html_data = requests.get(url=url, headers=headers).text
    selector = parsel.Selector(html_data)

    # .get(): return string; no: return lsit
    # .get: return 1st; no: all
    lists = selector.css('.sub-sec li')
    extra_data = selector.css('div.sub-sec ul.extra::text').get() # no text: string with <element> tag 

    company_ids = []

    # tokenization
    company_ids = extra_data.strip().split(',')

    for list in lists:
        company_id = list.css('a::attr(data-company-id)').get()
        company_ids.append(company_id)

    print(company_ids)
    print(len(company_ids))


if __name__ == "__main__":
    main()

DEV Community

Daily Log - 23/08/2024

Overall

Problem facing

Learn

Top comments (0)