loading...

Scraping DEV to measure your post quality

yujiri8 profile image Ryan Westlund ・3 min read

I thought it would be interesting to determine the average ratio of likes to views I get. But of course I wasn't going to manually tally them, I'm a coder don't you know, I'm going to write a scraper. In the words of Douglas Adams, "I am rarely happier than when spending an entire day programming my computer to perform automatically a task that would otherwise take me a good ten seconds to do by hand."

Wait, what did you say? The DEV dashboard already shows my total reactions and total views? Fuck! I didn't notice that until I was done writing both the code and the post. Oh well! No way is that gonna stop me from posting this. It's not all wasted effort, since mine will also count total comments and give all three ratios between it, likes, and views.

The first thing I did was open devtools' network tab, filter for XHR only, and reload my dashboard. I was checking if the posts were fetched from an API endpoint, which would've made this easy since I wouldn't have to traverse HTML. Alas, it looks like they're server-side rendered.

I did notice something odd while examining one of the requests:

network tab

The duplicate headers are curious. But that doesn't concern us now.

So the next thing is to inspect the HTML to figure out how to fish the information out of it. I used Firefox's element selector to determine that a selector I can use to match all posts is div.dashboard-story. Since I also want to exclude drafts, I add :not(.story-unpublished). Finally, I'll need to keep fetching the next page of my dashboard until I'm at the beginning, so I note that a selector for the "Next" button is a[rel=next].

I chose Crystal as my language, because I want to learn it more and because I know it has a better standard library than Rust (which would've been my choice otherwise). Crystal's stdlib includes a decent HTTP client, but I did need to find a good HTML parsing library that supports CSS selectors. I found myhtml.

Writing the code immediately reminded me that I needed to copy the relevant cookies to make DEV know it was me. So I inspected my browser cookies. There were 7, so I took a guess at which one I needed and I was right: _Dev_Community_Session.

After getting frustrated with scant documentation, I finished the code:

require "http/client"
require "myhtml"

# Have to do this because Crystal doesn't support global mutable variables,
# but mutable constants (ugh) are a workaround.
class Totals
  property likes : Int32 = 0, comments : Int32 = 0, views : Float64 = 0
end
TOTALS = Totals.new

def scan_page(doc)
  doc.css("div.dashboard-story:not(.story-unpublished)").each do |post|
      # The skipping is because there are empty text nodes between stuff.
      # I used trial and error to find how many I needed to skip.
      # (Myhtml doesn't seem to support the :last-child selector)
      TOTALS.likes += post.css("[title=Reactions]").first.children.skip(2).first.tag_text.to_i
      TOTALS.comments += post.css("[title=Comments]").first.children.skip(3).first.inner_text.to_i
      # DEV doesn't show the actual number if it's less than 25, so assume it's 12.5.
      TOTALS.views += post.css("[title=Views]").first.children.skip(2).first.tag_text.to_f? || 12.5
  end
end

def crawl(url)
  req = HTTP::Request.new("GET", url)
  req.cookies << HTTP::Cookie.new("_Dev_Community_Session", "8a5f1d079594fcb5239c87b78ae7b1d1")
  response = (HTTP::Client.new (URI.parse "https://dev.to")).exec req
  doc = Myhtml::Parser.new response.body
  scan_page doc
  nexturl = doc.css("a[rel=next]").first?.try &.attributes["href"]
  crawl(nexturl) if nexturl
end

crawl "dashboard"

puts "views per like: #{TOTALS.views / TOTALS.likes}"
puts "views per comment: #{TOTALS.views / TOTALS.comments}"
puts "likes per comment: #{TOTALS.likes / TOTALS.comments}"

I would've obtained each post's ratios and averaged those, but then I would have to decide what to do about 0 and Infinity, so I just went with a simple total.

And my results:

views per like: 20.27
views per comment: 56.46
likes per comment: 2.79

What are yours?

Discussion

pic
Editor guide
Collapse
maan2003 profile image
Maan2003

why are you using global mutable state? the scan_page could just return the likes comments and views

Collapse
yujiri8 profile image
Ryan Westlund Author

Good catch :)

Collapse
michaelphipps profile image
Phippsy

Now do it using the API! ;)

DEV API (Beta)