Search startup jobs with Python and LLMs

Misha Zanka — Mon, 27 Jan 2025 10:42:00 +0000

Company websites contain a lot of job listings that aren't always available on popular job boards.
For example, finding remote startup jobs could be challenging, as these companies may not even be listed on the job boards.
To find these jobs you need to:

Find promising companies
Search for their career pages
Analyze available job listings
Manually record job details

It requires a lot of time, but we are going to automate it.

Preparation

We'll use the Parsera library to automate job scraping. Parsera provides two usage options:

Local: Pages are processed on your machine using an LLM of your choice;
API: All processing occurs on Parsera's servers.

In this example we'll go with the Local option, since this is a one-time, small-scale extraction.

To get started, install the required packages:

pip install parsera
playwright install

Since we're running the local setup, an LLM connection is required.
We'll use OpenAI's gpt-4o-mini, for simplicity, which only requires setting an environment variable:

import os
from parsera import Parsera

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY_HERE>"

scraper = Parsera(model=llm)

With everything set up, we're ready to start scraping.

Step 1: Getting a list of the fresh Series A startups

First, we need to find a list of companies of interest and their websites.
I've found a list of 100 Series A startups that closed their rounds last month.
Growing companies with fresh rounds seems like a good place to look.

Let's grab the country and website of these companies:

url = "https://growthlist.co/series-a-startups/"
elements = {
    "Website": "Website of the company",
    "Country": "Country of the company",
}
all_startups = await scraper.arun(url=url, elements=elements)

Having the country, we can filter the country of our interest.
Let's narrow down our search to the United States:

us_websites = [
    item["Website"] for item in all_startups if item["Country"] == "United States"
]

Step 2: Finding Careers pages

Now, we have a list of the websites of new Series A startups from the US.
The next step is to find their careers page. We'll do it straightforwardly by extracting careers pages from their main pages:

from urllib.parse import urljoin

# Define our target
careers_target = {"url": "Careers page url"}

careers_pages = []
for website in us_websites:
    website = "https://" + website
    result = await scraper.arun(url=website, elements=careers_target)
    if len(result) > 0:
        url = result[0]["url"]
        if url.startswith("/") or url.startswith("./"):
            url = urljoin(website, url)
        careers_pages.append(url)

Note, that there is an option to replace this step with calling Search API, replacing LLM calls with search calls.

Step 3: Scraping open jobs

The last step is to load all open jobs from the careers pages of the websites.
Let's say we are looking for a software engineering job, then we'll look for the job title, location, link, and whether it's related to software engineering:

jobs_target = {
    "Title": "Title of the job",
    "Location": "Location of the job",
    "Link": "Link to the job post",
    "SE": "True if this is a software engineering job, otherwise False",
}

jobs = []
for page in careers_pages:
    result = await scraper.arun(url=page, elements=jobs_target)
    if len(result) > 0:
        for row in result:
            row["url"] = page
            row["Link"] = urljoin(row["url"], row["Link"])
    jobs.extend(result)

All jobs are extracted and we can filter out all that is not software engineering and save them to a .csv file:

import csv

engineering_jobs = [job for job in jobs if job["SE"] == "True"]

with open("jobs.csv", "w") as f:
    write = csv.writer(f)
    write.writerow(engineering_jobs[0].keys())
    for job in engineering_jobs:
        write.writerow(job.values())

At the end, we have a table with a list of jobs that looks like this:

Title	Location	Link	SE	url
AI Tech Lead Manager	Bengaluru	https://job-boards.greenhouse.io/enterpret/jobs/6286095003	True	https://boards.greenhouse.io/enterpret/
Backend Developer	Tel Aviv	https://www.upwind.io/careers/co/tel-aviv/BA.04A/backend-developer/all#jobs	True	https://www.upwind.io/careers
...	...	...	...	...

Conclusion

As a next step, we could repeat the same process to extract more info from the full job listing.
Like getting the tech stack or filtering for a remote startup job. This will save time on manually reviewing all pages.
You can try it yourself by iterating over Link fields and extracting elements of your interest.

Hope you found this article helpful and if you have any questions let me know.

Lightweight python library for scraping with LLMs

Misha Zanka — Tue, 13 Aug 2024 12:17:02 +0000

Hi Everyone,

I want to share my Python library for lazy scrapping :)

I’ve been leveraging LLMs to quickly extract structured data from websites without dealing with DOM structure and writing web scrapers. After a few months of experiments, I am sharing my code as an open-source Python library.

Compared to similar open-sourced libraries, the key benefit is simplicity and focus on minimal token use, which leads to lower costs and faster processing.

Check out the library GitHub: https://github.com/raznem/parsera

Happy to hear your feedback!