DEV Community

K
K

Posted on

Crawling Websites in React-Native

Coming from years of web developing React-Native feels like a fresh start to me. You get better access to native functionality AND you have fewer rules imposed to your app. For example, you can use fetch() toy get any website you want. What this enables is client site web crawling.

Why

Maybe you need data from a service, but they don't expose an API or the API doesn't give you all the data you need or the API is simply bad. Normally you would have to set up a server that crawls the target website and turns it into an API that you can use, but when you can access all data from all websites inside your client, you can save time.

Lets take the Amazon website for example. You want to show all products of a page and a way to load the next, but you want it in our own data structure, so you can build your own UI around it.

How

  1. Get the HTML from the server
  2. Extract the needed data from the HTML
  3. Reshape the data for our use

1 Get the HTML from the Server

That's the easy part.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);   // fetch page

  const htmlString = await response.text();  // get response text
  ...
}
Enter fullscreen mode Exit fullscreen mode

Fetching a URL with a search pattern returns a HTML page with some items.

2 Extract the Needed Data from the HTML

This is a bit trickier. The data is inside the HTML, but it's a string.

The naive approach would be to use a regular expression to parse the string and get the data, but HTML doesn't have a regular grammar so that wouldn't work.

The better way is to use a HTML parser and CSS selectors.

Cheerio is this solution. It comes with a HTML parser and a re-implementation of jQuerys core functionality, so you can use it on Node.js.

Problem is, React-Native is missing most Node.js packages so it doesn't work.

I searched quite some time to finde a re-implementation of Cheerio that works on React-Native the naming of the package was a bit strange, haha.

But with this, the extraction of the data is now childs play too.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);      // fetch page 

  const htmlString = await response.text();     // get response text
  const $ = cheerio.load(htmlString);           // parse HTML string

  const liList = $("#s-results-list-atf > li"); // select result <li>s
  ...
}
Enter fullscreen mode Exit fullscreen mode

3 Reshape the Data for further Use

After the data has been extracted from the HTML, we can start to reshape it for our use-cases. Extraction and reshaping are a bit blurry here, the <li>s we selected are full of markup and getting the right data out of them is extraction too, but often these two steps go hand-in-hand.

async function loadGraphicCards(page = 1) {
  const searchUrl = `https://www.amazon.de/s/?page=${page}&keywords=graphic+card`;
  const response = await fetch(searchUrl);  // fetch page 

  const htmlString = await response.text(); // get response text
  const $ = cheerio.load(htmlString);       // parse HTML string

  return $("#s-results-list-atf > li")             // select result <li>s
    .map((_, li) => ({                      // map to an list of objects
      asin: $(li).data("asin"),                   
      title: $("h2", li).text(),                
      price: $("span.a-color-price", li).text(),
      rating: $("span.a-icon-alt", li).text(),
      imageUrl: $("img.s-access-image").attr("src")
    }));
}
Enter fullscreen mode Exit fullscreen mode

This is not a robust example, but I think you get the idea. We can now use the new list of objects in our app to make our own UI for the Amazon results.


class App extends ReactComponent {
  state = {
    page: 0,
    items: [],
  };

  componentDidMount = () => this.loadNextPage();

  loadNextPage = () =>
    this.setState(async state => {
      const page = state.page + 1;
      const items = await loadGraphicCards(page);
      return {items, page};
    });

  render = () => (
    <ScrollView>
      {this.state.items.map(item => <Item {...item} key={item.asin}/>)}
    </ScrollView>
  );
}

const Item = props => (
  <TouchableOpacity onPress={() => alert("ASIN:" + props.asin)}>
    <Text>{props.title}</Text>
    <Image source={{uri: props.imageUrl}}/>
    <Text>{props.price}</Text>
    <Text>{props.rating}</Text>
  </TouchableOpacity>
);
Enter fullscreen mode Exit fullscreen mode

Conclusion

As with most problems, if you have the right tools solutions can become simple. Often the problem is more about finding these tools :D

This client side crawling approach can be used to build quick prototypes without the need of an API. Amazon is so nice to deliver okay-ish static HTML, so it works rather well on their sites.

Top comments (17)

Collapse
 
kayis profile image
K

Glad this article is still helpful after all that time :D

Collapse
 
acaraccioli profile image
Acaraccioli

Hello K, great post learned a lot I didnt know this was possible using fetch. Just one quick question. How would you manage if you wanted to fetch some quick data in front end (react) but had to enter information in an input tag and maybe even click a button? I hope you can help me out a bit

Collapse
 
kayis profile image
K

Glad you liked it.

I'd use React hooks.

function MyComponent(props) {
  const [info, setInfo] = React.useState("");
  const [remoteData, setRemoteData] = React.useState(
    "No data fetched yet!"
  );

  async function load() {
    const response = await fetch(
      "http://example.com?info=" + info
    );
    const text = await response.text();
    setRemoteData(text);
  }

  return (
    <div>
      <input
        value={info}
        onChange={(e) => setInfo(e.target.value)}
      />
      <button onClick={load}>Fetch</button>
      <textarea>{remoteData}</textarea>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode
Collapse
 
acaraccioli profile image
Acaraccioli

That is great I think that might works thanks a lot! I'm just trying to figure out this error:
Access to fetch at 'MyUrl' from origin 'localhost:8100' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource. If an opaque response serves your needs, set the request's. I've found to fix it adding {mode:"no-cors"} in the fetch call but the object returns null. Do you know anything about this kind of error?

Collapse
 
hardi_dev profile image
Hardiansa

Hi K, this awesome post.
So already try and it's working fine,

But I have some problem,
I try scrape from streaming service website (anime).

But there are no video tag inside the website.

So I try to inspect element, then I saw that website need to click "play" button then I got ifarame embedded html document with video tag.

So how can i do click on cheerio then get embeded video?

Thanks

Collapse
 
kayis profile image
K

Sorry, I don't know if Cheerio works with JavaScript sites.

One way to solve this would be to check if you could calculate the video URL from the data that is already in the HTML.

Otherwise I don't know.

Collapse
 
bagustyo profile image
Bagustyo

why Async ? what if just fetch ?

Collapse
 
kayis profile image
K

You can use fetch without async/await. React-Native supports async/await, that's why I used it, but it isn't needed, you can use promises directly :)

Collapse
 
binaryforgeltd profile image
Bart Karalus

Nice one! I had no clue there was a jquery-like tool for RN. Very useful.

Collapse
 
crawlbase profile image
Crawlbase

Thankyou! Impressive breakdown of web crawling in React-Native! Your methodical approach and clear explanations make it accessible for anyone diving into this field. Don't forget to streamline your efforts with Crawlbase for enhanced efficiency.

Collapse
 
kwangmart profile image
Martin Sone

Hi K, could the above crawling applies to reactjs or only to react native?

Collapse
 
kayis profile image
K

Only React-Native, because you can't access other websites from within a browsers, just sites from the same domain or such that are CORS enabled.

Collapse
 
yunusist profile image
ynstl

No, it's not working. ERROR.
ESLint Parsing error: Unexpected token

Collapse
 
prakort profile image
Prakort Lean

Please go in depth, i couldn't get cheerio-without-node-native to work

Collapse
 
sixman9 profile image
Richard Joseph

There's also react-native-cheerio, I've not yet used it myself but, obviously, I'm doing the research, also.

Collapse
 
lilrajax profile image
lilraja-x

I'm new to this react native thing... I've tried your exact method as of now yet nothing is displayed on my react native mobile app.
I'm new to this so your help will matter alot.

Collapse
 
lilrajax profile image
lilraja-x

There's no error but also nothing's displayed.