"Publish or perish", publication is super important in research. If you have a personal website, it would be a pain to manually update your publications, so why not scraping all publications from Google Scholar instead? Then you only need to maintain your Google Scholar profile and whenever there is a new published article, it will be automatically update on your personal website. Here I use React and decorate it with Chakra UI.
1. Set up a cors-anywhere server
Google Scholar use CORS mechanism to secure data transfer, so you'll come across a CORS error like this when you try to fetch data from them.
To overcome this, we need to set up a proxy server. You can create a heroku account for free and deploy a cors-anywhere server (also free) with these simple commands:
git clone https://github.com/Rob--W/cors-anywhere.git
cd cors-anywhere/
npm install
heroku create
git push heroku master
Now you have your own cors-anywhere server with an url like this https://safe-mountain-7777.herokuapp.com/
.
2. Create react app and install dependencies
This will take some time so please bear with it, in terminal:
create-react-app scholarscraper
Personally I use Chakra UI to style my website. We'll use axios to scrape the html and cheerio to extract the html data, so now let's install them:
cd scholarscraper
npm i @chakra-ui/react
npm i axios
npm i cheerio
3. Edit the App.js file
I'll explain these step by step, at the end of this I also put a full version of the App.js file.
First, we import all libraries
import axios from 'axios';
import {Text, Link, ChakraProvider, Container} from "@chakra-ui/react";
import {useEffect, useState} from "react";
const cheerio = require('cheerio')
In the function App() {}
, basically:
- We set the PROXY_URL which is the cors-anywhere server we deployed previously, then the URL to Google scholar
- Our articles will be stored in variable
articles
, this is an array defined byuseState([])
- Make a get request to scholar with the proxy, this is super simple with
PROXY_URL + URL
, we also paste the params with your user id. This is the id in your scholar profile url - Extract the elements with cheerio, here I extract title, authors, journal, number of citation and some links, if you want to extract more data, you can inspect the scholar website to get their classes and use my syntax
const PROXY_URL = 'https://safe-mountain-7777.herokuapp.com/';
const URL = 'https://scholar.google.com/citations';
const [articles, setArticles] = useState([]);
useEffect(() => {
axios.get(PROXY_URL + URL, {
params: {
'user': 'PkfvVs0AAAAJ',
'hl': 'en'
}
})
.then(res => {
let $ = cheerio.load(res.data);
let arrayArticles = [];
$('#gsc_a_b .gsc_a_t').each((index, element) => {
const title = $(element).find('.gsc_a_at').text();
const link = $(element).find('.gsc_a_at').attr('href');
const author = $(element).find('.gsc_a_at + .gs_gray').text();
const journal = $(element).find('.gs_gray + .gs_gray').text();
arrayArticles.push({'title': title, 'link': link, 'author': author, 'journal': journal});
})
$('#gsc_a_b .gsc_a_c').each((index, element) => {
const cited = $(element).find('.gs_ibl').text();
const citedLink = $(element).find('.gs_ibl').attr('href');
arrayArticles[index]['cited'] = cited;
arrayArticles[index]['citedLink'] = citedLink;
})
setArticles(arrayArticles);
})
.catch(err => console.error())
}, [])
Finally, render the UI:
return (
<ChakraProvider>
<Container maxW={'container.md'}>
{articles.map(article => {
return (
<>
<Link href={`https://scholar.google.com${article.link}`} isExternal>
<Text fontWeight={600} color={'teal.800'}>{article.title}</Text>
</Link>
<Text color={'gray.600'}>{article.author}</Text>
<Text color={'gray.600'}>{article.journal}</Text>
<Link href={article.citedLink} isExternal>
<Text color={'gray.600'}>Cited by {article.cited}</Text>
</Link>
</>
)
})}
</Container>
</ChakraProvider>
)
The full App.js file is here:
import axios from 'axios';
import {Text, Link, ChakraProvider, Container} from "@chakra-ui/react";
import {useEffect, useState} from "react";
const cheerio = require('cheerio')
function App() {
const PROXY_URL = 'https://safe-mountain-19493.herokuapp.com/';
const URL = 'https://scholar.google.com/citations';
const [articles, setArticles] = useState([]);
useEffect(() => {
axios.get(PROXY_URL + URL, {
params: {
'user': 'PkfvVs0AAAAJ',
'hl': 'en'
}
})
.then(res => {
let $ = cheerio.load(res.data);
let arrayArticles = [];
$('#gsc_a_b .gsc_a_t').each((index, element) => {
const title = $(element).find('.gsc_a_at').text();
const link = $(element).find('.gsc_a_at').attr('href');
const author = $(element).find('.gsc_a_at + .gs_gray').text();
const journal = $(element).find('.gs_gray + .gs_gray').text();
arrayArticles.push({'title': title, 'link': link, 'author': author, 'journal': journal});
})
$('#gsc_a_b .gsc_a_c').each((index, element) => {
const cited = $(element).find('.gs_ibl').text();
const citedLink = $(element).find('.gs_ibl').attr('href');
arrayArticles[index]['cited'] = cited;
arrayArticles[index]['citedLink'] = citedLink;
})
setArticles(arrayArticles);
})
.catch(err => console.error())
}, [])
return (
<ChakraProvider>
<Container maxW={'container.md'}>
{articles.map(article => {
return (
<>
<Link href={`https://scholar.google.com${article.link}`} isExternal>
<Text fontWeight={600} color={'teal.800'}>{article.title}</Text>
</Link>
<Text color={'gray.600'}>{article.author}</Text>
<Text color={'gray.600'}>{article.journal}</Text>
<Link href={article.citedLink} isExternal>
<Text color={'gray.600'}>Cited by {article.cited}</Text>
</Link>
</>
)
})}
</Container>
</ChakraProvider>
)
}
export default App;
Now start the app and enjoy your work
npm start
Good luck!
Top comments (1)
I am getting error