Alex

Posted on May 31 • Edited on Jun 13

Building a smarter Web scraper: Vector embeddings for intelligent content retrieval

#python #postgressql #vectordatabase

Yet Another Scraper? This One's Actually Worth Your Time!

Hey there, fellow code wranglers! 👋 Alex here, and today I'm going to introduce you to my latest python creation - (and yes, I will do a Rust lang follow-up if this gets enough attention).

Foreword

Whether you're just starting your coding journey or you've got a few years under your belt as a developer, this project is designed to be both educational and immediately useful. Entry-level devs will find the codebase approachable with clear patterns to learn from, while mid-level engineers can appreciate the architecture choices and extend the functionality for their specific needs - it is meant to be a starting point for you to build upon.

What Makes This One Special?

This isn't just any scraper. It's a full-stack solution that combines FastAPI, PostgreSQL with pgvector, and Playwright to create a powerful content scraping and similarity search system.

Think of it as your personal web librarian that not only collects content but also helps you find related information using the magic of vector embeddings.

The stack

FastAPI: Because life's too short for slow APIs
PostgreSQL + pgvector: Vector similarity search that doesn't make your database cry
Playwright: Headless browsing that actually works with modern websites
Sentence Transformers: Turning words into math (vectors) so computers can understand content similarity

Show Me The Code Already!

Here's how easy it is to scrape URLs:

curl -X POST http://localhost:8000/v1/scrape/ \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://dev.to/alexandrughinea", "https://dev.to/topics/python"]}'

And searching for similar content is just as simple:

curl -X GET "http://localhost:8000/v1/search?text=fastapi%20vector%20search&limit=5" \
  -H "X-API-Key: your_api_key"

For larger batches, the async batch endpoint provides better performance:

# The batch endpoint processes URLs asynchronously for better performance
curl -X POST http://localhost:8000/v1/scrape/batch/ \
  -H "X-API-Key: your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com", "https://example.org", "https://example.net"]}'

And once you've scraped some content, you can query it directly:

# Get all scraped data (paginated)
curl -X GET "http://localhost:8000/v1/data/?limit=10" \
  -H "X-API-Key: your_api_key"

# Get a specific item by ID
curl -X GET "http://localhost:8000/v1/data/42" \
  -H "X-API-Key: your_api_key"

The Secret Sauce: Vector Similarity

The real magic happens in the search functionality. When you scrape content, it gets converted into vector embeddings (fancy math arrays) that represent the semantic meaning of the text. Then when you search, your query text gets converted to the same format, and we find the closest matches using cosine similarity.

Why I Built This

I got tired of scraping websites and then having to build separate systems for search and content analysis. This project combines everything into one cohesive system that:

Scrapes content efficiently using headless browsers
Stores content and vector embeddings in PostgreSQL
Provides a clean API for searching and retrieving similar content
Handles batch operations for processing multiple URLs (sync and async versions)
Avoids duplicate content through similarity detection
Respects robots.txt rules (configurable via environment variables)

Getting Started in 30 Seconds

# Clone the repo
git clone https://github.com/alexandrughinea/python-fastapi-postgres-vector-scraper

# Start with Docker
cd python-fastapi-postgres-vector-scraper
docker-compose up -d

# Your API is now running at http://localhost:8000

Final Thoughts

Is this yet another scraper? Technically, yes. But it's a scraper with superpowers. It doesn't just collect data; it understands it, organizes it, and helps you find connections between different pieces of content in your own database.

The next time someone says "we need to scrape some websites," you can smugly pull out this repo instead of cobbling together a BeautifulSoup script that will break the moment the website changes its CSS classes.

Getting Started

To make your life even easier, I've included Bruno API documentation in the /docs folder. Bruno is a great open-source API client that lets you test all the endpoints without writing a single line of code. Just install Bruno, open the collection, and start experimenting!

The complete project is available on GitHub at github.com/alexandrughinea/python-fastapi-postgres-scraper. Star it if you find it useful, and contributions are always welcome!

Give it a try and let me know what you think! And remember, with great scraping power comes great responsibility, don't hammer websites with requests, be respectful of robots.txt, and maybe consider asking for permission first - because being a good web citizen matters. You can toggle this behavior with the RESPECT_ROBOTS_TXT environment variable if you really need to.

Happy scraping! 🕸️🚀

DEV Community