DEV Community

Thinh Ong
Thinh Ong

Posted on

Academic portfolio: scrape publications from your Google Scholar profile with React

"Publish or perish", publication is super important in research. If you have a personal website, it would be a pain to manually update your publications, so why not scraping all publications from Google Scholar instead? Then you only need to maintain your Google Scholar profile and whenever there is a new published article, it will be automatically update on your personal website. Here I use React and decorate it with Chakra UI.

1. Set up a cors-anywhere server

Google Scholar use CORS mechanism to secure data transfer, so you'll come across a CORS error like this when you try to fetch data from them.
CORS error
To overcome this, we need to set up a proxy server. You can create a heroku account for free and deploy a cors-anywhere server (also free) with these simple commands:

git clone https://github.com/Rob--W/cors-anywhere.git
cd cors-anywhere/
npm install
heroku create
git push heroku master
Enter fullscreen mode Exit fullscreen mode

Now you have your own cors-anywhere server with an url like this https://safe-mountain-7777.herokuapp.com/.

2. Create react app and install dependencies

This will take some time so please bear with it, in terminal:

create-react-app scholarscraper
Enter fullscreen mode Exit fullscreen mode

Personally I use Chakra UI to style my website. We'll use axios to scrape the html and cheerio to extract the html data, so now let's install them:

cd scholarscraper
npm i @chakra-ui/react
npm i axios
npm i cheerio
Enter fullscreen mode Exit fullscreen mode

3. Edit the App.js file

I'll explain these step by step, at the end of this I also put a full version of the App.js file.

First, we import all libraries

import axios from 'axios';
import {Text, Link, ChakraProvider, Container} from "@chakra-ui/react";
import {useEffect, useState} from "react";
const cheerio = require('cheerio')
Enter fullscreen mode Exit fullscreen mode

In the function App() {}, basically:

  • We set the PROXY_URL which is the cors-anywhere server we deployed previously, then the URL to Google scholar
  • Our articles will be stored in variable articles, this is an array defined by useState([])
  • Make a get request to scholar with the proxy, this is super simple with PROXY_URL + URL, we also paste the params with your user id. This is the id in your scholar profile url User id
  • Extract the elements with cheerio, here I extract title, authors, journal, number of citation and some links, if you want to extract more data, you can inspect the scholar website to get their classes and use my syntax
    const PROXY_URL = 'https://safe-mountain-7777.herokuapp.com/';
    const URL = 'https://scholar.google.com/citations';
    const [articles, setArticles] = useState([]);

    useEffect(() => {
        axios.get(PROXY_URL + URL, {
            params: {
                'user': 'PkfvVs0AAAAJ',
                'hl': 'en'
            }
        })
        .then(res => {
            let $ = cheerio.load(res.data);
            let arrayArticles = [];
            $('#gsc_a_b .gsc_a_t').each((index, element) => {
                const title = $(element).find('.gsc_a_at').text();
                const link = $(element).find('.gsc_a_at').attr('href');
                const author = $(element).find('.gsc_a_at + .gs_gray').text();
                const journal = $(element).find('.gs_gray + .gs_gray').text();
                arrayArticles.push({'title': title, 'link': link, 'author': author, 'journal': journal});
            })
            $('#gsc_a_b .gsc_a_c').each((index, element) => {
                const cited = $(element).find('.gs_ibl').text();
                const citedLink = $(element).find('.gs_ibl').attr('href');
                arrayArticles[index]['cited'] = cited;
                arrayArticles[index]['citedLink'] = citedLink;
            })
            setArticles(arrayArticles);
        })
        .catch(err => console.error())
    }, [])
Enter fullscreen mode Exit fullscreen mode

Finally, render the UI:

   return (
        <ChakraProvider>
            <Container maxW={'container.md'}>
                {articles.map(article => {
                    return (
                        <>
                            <Link href={`https://scholar.google.com${article.link}`} isExternal>
                                <Text fontWeight={600} color={'teal.800'}>{article.title}</Text>
                            </Link>
                            <Text color={'gray.600'}>{article.author}</Text>
                            <Text color={'gray.600'}>{article.journal}</Text>
                            <Link href={article.citedLink} isExternal>
                                <Text color={'gray.600'}>Cited by {article.cited}</Text>
                            </Link>
                        </>
                    )
                })}
            </Container>
        </ChakraProvider>
    )
Enter fullscreen mode Exit fullscreen mode

The full App.js file is here:

import axios from 'axios';
import {Text, Link, ChakraProvider, Container} from "@chakra-ui/react";
import {useEffect, useState} from "react";
const cheerio = require('cheerio')

function App() {
    const PROXY_URL = 'https://safe-mountain-19493.herokuapp.com/';
    const URL = 'https://scholar.google.com/citations';
    const [articles, setArticles] = useState([]);

    useEffect(() => {
        axios.get(PROXY_URL + URL, {
            params: {
                'user': 'PkfvVs0AAAAJ',
                'hl': 'en'
            }
        })
        .then(res => {
            let $ = cheerio.load(res.data);
            let arrayArticles = [];
            $('#gsc_a_b .gsc_a_t').each((index, element) => {
                const title = $(element).find('.gsc_a_at').text();
                const link = $(element).find('.gsc_a_at').attr('href');
                const author = $(element).find('.gsc_a_at + .gs_gray').text();
                const journal = $(element).find('.gs_gray + .gs_gray').text();
                arrayArticles.push({'title': title, 'link': link, 'author': author, 'journal': journal});
            })
            $('#gsc_a_b .gsc_a_c').each((index, element) => {
                const cited = $(element).find('.gs_ibl').text();
                const citedLink = $(element).find('.gs_ibl').attr('href');
                arrayArticles[index]['cited'] = cited;
                arrayArticles[index]['citedLink'] = citedLink;
            })
            setArticles(arrayArticles);
        })
        .catch(err => console.error())
    }, [])

    return (
        <ChakraProvider>
            <Container maxW={'container.md'}>
                {articles.map(article => {
                    return (
                        <>
                            <Link href={`https://scholar.google.com${article.link}`} isExternal>
                                <Text fontWeight={600} color={'teal.800'}>{article.title}</Text>
                            </Link>
                            <Text color={'gray.600'}>{article.author}</Text>
                            <Text color={'gray.600'}>{article.journal}</Text>
                            <Link href={article.citedLink} isExternal>
                                <Text color={'gray.600'}>Cited by {article.cited}</Text>
                            </Link>
                        </>
                    )
                })}
            </Container>
        </ChakraProvider>
    )
}

export default App;
Enter fullscreen mode Exit fullscreen mode

Now start the app and enjoy your work

npm start
Enter fullscreen mode Exit fullscreen mode

The app will look like this:
Demo

Good luck!

Oldest comments (0)