DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Cover image for Faster Than Requests with MultiThread Web Scraper
Juan Carlos
Juan Carlos

Posted on

Faster Than Requests with MultiThread Web Scraper

  • Alternative HTTP Client, new version 0.9, API for Humans.
  • Added Multi-Thread web scraper Built-in one-liner.
  • Added Multi-Thread file downloader Built-in one-liner.
  • 1 file, 0 dependency, ~100 Lines of code, 2.7 to 3.8, Alpine & ARM.
  • GitHub Actions CI building from scratch.
  • GitHub Actions CI running Unittests from scratch.
  • Examples for web scraper and file downloader.
  • Extras for Data Science, Web Scrapping, HTTP REST JSON APIs.
  • Examples, Dockerfile, tests, FAQ, CoC, Debug helpers, JSON helpers.
  • Docs has all functions with detailed arguments and returns with types.
Library Speed Files LOC Dependency Devs Scraper
PyWGET 152.39 1 338 Wget >17 No
Requests 15.58 >20 2558 >=7 >527 No
Urllib 4.00 ??? 1200 0(std lib) ??? No
Urllib3 3.55 >40 5242 >5(SSL) >188 No
PyCurl 0.75 >15 5932 Curl,LibCurl >50 No
FTR 0.45 1 99 0 1 Yes, 2

Hello World

requests.get("http://httpbin.org/get")
Enter fullscreen mode Exit fullscreen mode
  • GET, POST, PATCH, PUT, DELETE and more.

Multi-Thread Web Scraper Built-in

requests.scrapper(["http://example.org", "http://example.io"], threads=True)
Enter fullscreen mode Exit fullscreen mode
  • Theres 2 ready-made Web Scrapers built-in, easy to use one-liner.

Multi-Thread File Downloader Built-in

requests.download2([("http://example.org/foo.jpg", "output.jpg"), ], threads=True)
Enter fullscreen mode Exit fullscreen mode
  • delay=1000 for 1 Second delay sleep between downloads.

Multi-Thread Bulk GET

requests.get2str2(["http://example.org", "http://example.io"], threads=True)
Enter fullscreen mode Exit fullscreen mode
  • threads=False for No Multi-Thread.

GitHub

πŸπŸ˜ΌπŸ‘

Top comments (5)

Collapse
 
thepeoplesbourgeois profile image
Josh • Edited on

I keep seeing the double-p in scrapper and think, "Like, a scrappy fighter?" Is it possible you intended to call the method scraper?

Collapse
 
kcespedes profile image
kcespedes

I'm passing headers to faster_than_requests but it gives 400 Bad request.

this is my code sample:

import faster_than_requests as requests

headers = [("Host", "api.sample.com"), ("Connection", "Keep-Alive"), ("Accept-Encoding", "gzip")]

response = requests.post(url=self.offer_url, body=self.offer_data, http_headers=headers, proxy_url=self.proxyURL)

print(response["status"])

any help would be greatly appreciated. I currently use the requests library but is too slow.

Collapse
 
titanhero profile image
Lex

Cool, I like the color scheme of your text editor...animus, I gonna put attention to th is librarie...animus

Collapse
 
rohansawant profile image
Rohan Sawant

Great Post Juan!

If you think this is interesting, checkout Async for HTTP requests, I bet it will blow your mind! πŸ¦„

Collapse
 
rhymes profile image
rhymes

Nice, I see you implemented it in Nim eheh!

Have you tried httpx's async support as well?

An Animated Guide to Node.js Event Lop

Node.js doesn’t stop from running other operations because of Libuv, a C++ library responsible for the event loop and asynchronously handling tasks such as network requests, DNS resolution, file system operations, data encryption, etc.

What happens under the hood when Node.js works on tasks such as database queries? We will explore it by following this piece of code step by step.