Rodion Gorkovenko

Posted on Feb 15, 2020

Detecting DEV peak hours via API (bash study)

#tutorial #webscraping #webdev #bash

It may be interesting, when there are more people reading DEV - for example, perfect timing is important for those who is obsessed with idea to get more likes and comments, right? :)

example of histogram produced as result of experiments described below

Let's do a simple exercise of collecting stats - for example from DEV, though this could be used for most similar sites and forums. I will use bash for amusement, and because it is useful to know more about it. Though you can easily do all steps in other language (Python or PHP will suit best in my opinion). You may briefly refer to my previous post about coding projecteuler problem in bash if you are not well acquainted with this popular linux command line processor.

Plan is like this:

few words about DEV API at all
getting max id (or total amount) of articles
getting about 100 random articles before this id
collecting their publication hours
and making small chart of it

Few words about DEV API

This wonderful site has REST API - simply urls which return JSON with data about articles, users, comments etc. A bit of documentation could be found here but it is quite incomplete. Luckily, we can also refer in the source code, particularly here.

Root for API is at dev.to/api. For example, let's look at endpoint listing articles...

Getting total number of articles

Open the link https://dev.to/api/articles in your browser. You'll see JSON, obviously - array containing articles data.

[
    {
        "type_of":"article",
        "id":261930,
        "title":"An Open..."
        //...
    },
    { /*... another article data*/ },
    //...
]

Note, we have several latest articles returned. And each has numeric id, so seemingly now DEV hosts over 200k articles.

Now I want to make request from command line, which will return me just this number. I do requests with curl tool. Try running following commands, one by one:

curl https://dev.to/api/articles/

curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+'

curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+' | head -n 1

What's happening? The first line is very simple - curl fetches json data from API endpoint - and dumps them to console. Second line redirects this output to grep tool, which applies regular expression to find all "id"=1234567 fragments and prints only numeric ids, one per line (a bit about it further). The third line applies yet another command to the produced list of ids - head just takes several top lines (one in this case).

What this regular expression means? Look at the end, we search for sequence \d+ which means "digit, repeated 1 or more times". Before this fragment should be something enclosed in (?<=...) pattern, called "look-behind". I.e. we shall find only those digits, which are preceded by \"id\":, though this prefix is not included into "matching" part.

So at last let's put this single top id number into variable:

total=$(curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+' | head -n 1)

Commands inside $(...) are executed and their output is stored into variable total.

Getting single article by ID

Now we are going to fetch about 100 random articles. For this we are going to take random number with $RANDOM function, subtract it from max id and fetch specific article by this id.

To get article by id we just add this id to the link used above, e.g.
https://dev.to/api/articles/123456 - note here we get not array, but single object describing an article:

{
    "type_of":"article",
    "id":123456,
    //...
    "created_at":"2019-06-13T14:31:50Z"
}

The object includes creation timestamp, we'll use it bit later. For now let's complete the code which fetches 100 articles at random:

i=0
while [[ $i -lt 100 ]] ; do
    random_id=$(($total - $RANDOM))
    atext=$(curl -sf https://dev.to/api/articles/$random_id)
    if [[ $? -eq 0 ]] ; then
        i=$(($i + 1))
        echo "$random_id - ok"
    else
        echo "$random_id - FAIL"
    fi
done

We don't use received text yet - just print out whether article could be loaded or not. Some fails with 404 (which could be checked manually) - seemingly they were saved to draft and never published or deleted afterwards.

If you want run this snippet right now, I recommend changing 100 to 10 for test purposes - otherwise it may take significant time.

Collecting timestamp data (hour)

So we have json for every article in atext variable. Let's extract created_at or more precisely the hour part of this field only. Instead of grep let's try sed - another cool default command, just for practice. It works like this, let's redirect curl output to it:

curl https://dev.to/api/articles/246755 | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/'

Here sed just does substitution by regexp. Regexp is made so that it captures whole document (thanks to .* at start and end) with created_at\"\:\".{11}(..) fragment inside. In this fragment we skip quotes, semicolon, then 11 symbols (like 2020-01-02T) and capture two symbols with parentheses. We use the value of the first (and only) captured group with \1 reference to replace whole string. So output is like 20 - i.e. hour part.

We add such line in our script (under if) in the form:

hour=$(echo "$atext" | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/')

A kind of histogram

Now we want to collect results into array. Let's initiate array with 24 zeroes - and we'll increment the element corresponding to given hour on every article found:

result=($(for i in {0..23}; do echo 0; done))   # creates 24 zeroes
# ... and below in the loop
result[$hour]=$(($result[$hour] + 1))

I will leave to yourself to figure out how to pretty-print histogram, like one shown at the beginning of the article. Simplified code, as a shell file could look like one below. It simply prints counters for every hour:

#!/usr/bin/env bash

arts=`curl https://dev.to/api/articles`
total=`echo "$arts" | grep -oP '(?<=\"id\"\:)\d+' | head -n 1`

result=($(for i in {0..23}; do echo 0; done))
i=0
while [[ $i -lt 100 ]] ; do
    random_id=$(($total - $RANDOM))
    atext=$(curl -sf https://dev.to/api/articles/$random_id)
    if [[ $? -eq 0 ]] ; then
        i=$(($i + 1))
        hour=$(echo "$atext" | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/')
        hour=$(echo $hour | sed 's/^0//')
        (( result[hour]++ ))
        echo "$random_id - ok, $hour ($i)"  # remove leading zero
    else
        echo "$random_id - FAIL"
    fi
done

for i in {0..23} ; do
    echo "$i: ${result[i]}"
done

Conclusion

This is rather introductory material. You may at once see several things to improve:

we should check that randomly chosen pages have no repeating ids
requests are taking few seconds and often fail, so it would be better to learn running them in parallel
most active hours probably are not determined by just the hours when articles were created - probably it is better to regard only those articles, which have enough comments and likes
we can also use authorization for API (don't know yet if this affects speed).

Thanks for reading so far, and excuse me for bash - probably further experiments with API I'll publish using PHP/Python!

Oldest comments (6)

Zen • Feb 15 '20

Wow. I motivated to build app that get random posts from Dev.

I see in docs.dev.to/api/#operation/getArti..., JSON only get 30 posts. Can I reach more than 30?

Rodion Gorkovenko • Feb 16 '20 • Edited

Hi, I think yes, though it is not documented. Source code contains per_page parameter which governs amount of articles in the response.

For example:

dev.to/api/articles/?per_page=3

However, probably it is not good idea to load very large responses (don't know if there is limit). Rather use page parameter to send several requests to different pages. E.g.

dev.to/api/articles/?page=333

Zen • Feb 16 '20 • Edited

Then, how to get 10 posts from Dev randomly? 😂

Oh. I got idea:

Get number of total posts (eg: 1000)
Get 10 number randomly from 0 to 1000

Rodion Gorkovenko • Feb 16 '20

Yep. Good idea! Just exactly what is described in this article above :)

Zen • Feb 16 '20

😂 thanks

rhymes • Feb 23 '20

As I couldn't use:

curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+' | head -n 1

as -P doesn't work in macOS I replaced it with the utility I use to pretty print JSON, jq:

> curl https://dev.to/api/articles | jq '.[0].id'
266498

or even better, as we don't need the entire page:

> total=$(curl "https://dev.to/api/articles?per_page=1" | jq '.[0].id')
> echo $total
266498

this doesn't work in macOS either:

curl https://dev.to/api/articles/246755 | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/'

I replaced it with:

> curl https://dev.to/api/articles/246755 | jq '.created_at' | sed 's/^.*T\(.*\)\:.*\:.*/\1/'
20

or even better:

> curl https://dev.to/api/articles/246755 | jq '.created_at[11:13]' | sed 's/"//g'
20

jq is truly awesome :D stedolan.github.io/jq/manual/