I found this guide a while later after I have found the solution.
TL;DR
- Use website's public API, if exists.
- Read
https://website.com/robots.txt
- Rate limiting. Do not try to DoS (Denial of Service) the website.
- Use a fake User Agent by setting request header, or use a headless browser, like Puppeteer, or Selenium.
await Promise.all(vocabs.map(async (v, i) => {
await new Promise(resolve => setTimeout(resolve, i * 1000)) // Sleep for i seconds.
await axios.get('https://website.com/search.php', {
params: {
q: v
},
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
}
}).then(r => {
const $ = cheerio.load(r.data)
// ...
})
}))
Interestingly, CORS cannot prevent website scraping from other places than <script>
tags. Why do we have CORS by default, again?
Top comments (1)
I'm not a web dev. Might be wrong. I believe CORS is intentionally used to tell browsers that the resource is shareable across various domains.