Alright let's see a simple way of doing data web scrapping using the console of the browser, here we use chrome but any will do since we are not using anything specific.
This article is the follow up to this video, consider checking it out as we go a little more in depth in some parts.
Check out the video for this post:
If you like follow for more and consider subscribing to the YT channel ramgendeploy π
I think this is a great video for people starting with javascript to learn more about array manipulation and also data extraction.
Great! so we are going to use the browser inspector to extract data and put it into useful formats like JSON or CSV files.
Content:
- Document Element Selection
- Data Processing with Javascript, array methods
- map
- reducer
- filter
- Javascript optional chaining example
Nice let's go over some snippets:
First if you are using chrome, when you select an element, you can then reference that element in the console tab with $0 this is useful to see the childrens and extract a "route" to the data you want.
There is a handful of methods to select elements, here we use the more general one, that is the querySelectorAll. We define it into the selEl variable so it's more convenient.
let selEl = document.querySelectorAll('selector')
Selector can be:
- Element name
- class
- id
- css syntax like: .container > .btn
I don't know if there is more but those are the most useful :D
Using our selEl function we can give a Selector, a class for example and it will give us all the elements that have that class.
Then after you select all the elements that you need, you going to have a nodelist, so to use array methods on it you need to convert it to an array.
How we do this? There is a bunch of ways to convert novelists to an array, but here we are going to use some the spread operator to create a new array from our nodelist.
let selEleArray = [...selEle]
With that now we can use the array methods and process our data:
let parsedData = selEleArray.map(
(item)=>[item.children[0].innerText,item.src, item.innerHTML]
)
Here for example we map the data into a new array, with the innerText, the source attribute and the innerHTML of the element, here is the part that we actually construct the data we need.
So it's up to you, for example if we are scraping images the src might be of interest.
In the video we go more in depth on this part π
Now Having this object with an array of arrays, is not enough to do console.log(parsedDate) to be able to copy the data and have it elsewhere, sometimes the browser says nope I won't display 1500 lines.
To solve this we are going to call our friend JSON, and using stringify we convert the object into a string to then display it into the console
JSON.stringify(parsedData)
You don't need to actually log here, the inspector does it implicitly.
Now with our object as a JSON string, we can grab this and use it elsewhere that supports JSON.
But what if you want a CSV file, well .reduce to the rescue.
We are going to grab that array and reduce it to a single string with a csv format.
let data_cvs = parsedData.reduce(
(accumulator,current)=>{
return accumulator+`\n${item[0]},${item[1]},${item[2]}`
},
'header_1,header_2,header_3')
To explain this a little bit more, reduce needs two parameters, a reduce function that will run with each item in the array and a starter value, in this case our started value is the headers of the csv file.
You can also use a for loop but I think using reduce is more neat π
Then in each iteration, we add to the string a return escape and our comma separated values, notice that we use the `` quotes to have variable interpretation inside the string.
Like, Follow and stuff π
And consider subscribing to the YT channel ramgendeploy
Top comments (0)