Proxies Alone Aren’t (Always) the Answer

#python #programming #datascience #webscraping

I always got the same answer when asking about scaling up web scraping, or how to avoid being blocked, and that was “use good proxies”. This isn’t necessarily a bad answer, it's just not a complete one. So let's look into why proxies aren’t a one-stop solution, but why you still need them, and how to use them properly.

Why Proxies Don't Work Like They Used To

So why don't proxies work like they used to? The main factor here is the huge advancements in the technology used to detect and ban requests made to a website based on the owner's criteria. Rotating IPs used to be a solid option but you can’t rely on this solely anymore as there are multiple different ways other than the IP itself to detect and block.

The systems and tech available now to anyone who runs a website is incredible, and the costs for these services just keep coming down, while their abilities go up.

The most sophisticated systems utilise a whole host of techniques, utilizing things like Javascript tests that you need to pass to gain access, to advanced fingerprinting techniques pulling data from your browser, and monitoring mouse and keyboard movements and actions. And given the rise in AI, quickly and easily profiling and checking data is more accessible than ever.

I mean it’s insane how much information can be obtained from this alone – one of the traps that beginners fall into a lot is not matching their IP location timezone to their browser timezone. A simple but easy thing to miss that can result in a ban, and a headache for you.

And if you’ve ever wondered why your code that uses Python requests gets blocked a lot, it’s because it has a unique TLS fingerprint that most don’t know about. This makes it easy to block, regardless of how good your proxy IPs are.

But You Do NEED Proxies, Right? Yes.

As I stated at the top of the video, having a set of good quality IPs is now a base requirement, rather than a useful perk. Think of it as a foundation to build your house on.

To have the most success, it's important to match the IP to the overall session, meaning all of the cookies and headers from the original successful request are passed along with all subsequent requests. What’s key here is getting the initial OK 200 response. A lot of the common WAFs use a cookie that's generated as a “pass”, meaning it’s cleared that specific session to carry on making requests, and this is what you need to utilize effectively and efficiently. The problem is that this sounds simple, but in reality it isn’t, and it takes time and effort to set up and manage.

In fact, this is what we use at Zyte to manage bans and also reduce costs. If you’ve ever scraped a lot of data with full browsers you’ll know they are resource and data heavy.

Challenging the Narrative: From Proxies to a Web Scraping API

We want to challenge the narrative. We believe in a shift from just proxies to a web scraping API that can benefit anyone who wants to access web data quickly and easily, while still retaining control over their own developer environment.

The Zyte API is exactly that, a simple but extremely powerful solution that will work for small projects and new developers, right the way up to the pros scraping millions of pages of data. And we’re so confident, it’s what we use here for our data business.

Understanding the Total Cost of Ownership

But why is this better than using proxies and doing it all yourself? I want to address this, but first we need to understand the total cost of ownership over a web scraping project. Web scraping is unlike any other software solution. At one end it comes with a lot of unknowns, like requiring site-specific solutions and cutting-edge tooling, highly technical infrastructure, and constant upkeep and maintenance.

Yet on the other hand, in some cases you can scrape data in 3 lines of code. This massive gap and learning curve means that a lot of technical teams can scrape one site successfully, yet then fall into the trap of thinking it will be easy to manage and maintain multiple more scrapers with minimal effort, when it simply isn’t.

This is the benefit of moving away from a solely proxy-based solution to the Zyte API: we take away all the pain points of managing and scaling a web scraping project, meaning less work and developer time for you, in turn saving costs.

Just like building a successful app, your end users and customers don’t care what language or framework you use, as long as the end result facilitates their needs.

If a site changes, and the datacenter proxies you were using now get blocked, you need to spend time to either find more that do work, or change up to residential proxies with an increase in cost.

With us it's a flat cost per request based on the site, not a cost per GB like proxies are. Know how many pages you are scraping? Easy to work out and scale the pricing.

In fact, go to zyte.com now and ask the bot what it might cost you to scrape 1 million pages of your target site and see for yourself.