How I built a web-scraping twitter bot

lucysuddenly profile image Lucy Suddenly Updated on ・3 min read

twitter logo inside a gear

Okay, full disclosure: I thought that building a twitter bot was going to be a LOT of work. Turns out it's really easy! First I'll break down how to use the Chatterbot Ruby gem, then move on to how I used Nokogiri.

Step 1:Install and Configure Chatterbot

Here is the chatterbot guide webpage

Firstly, create a new Twitter account! Easy enough.

Secondly, using that twitter account apply for a developer account here.

While you're waiting for approval (should take an hour or two), install the chatterbot ruby gem in your project folder by running gem install chatterbot in your terminal. Then maybe start on your bot logic (more on this later).

To get the setup script to run, simply make a new Ruby file whose title is the username of the twitter account you'll be using for your bot. Fill it with the following content to start:

require 'rubygems'
require 'chatterbot/dsl'

tweet "Hello World!"

Run your new Ruby file from the terminal and the setup script will run -- it will prompt you for the api keys and secrets that your successful twitter dev account application will provide you. Be sure to copy your confirmation pin from your approval and paste it into the proper step of the setup script.

That's it! Replace "Hello World!" with whatever logic you like for how you want your bot to tweet! There are other options for searching, retweeting, and replying -- check the chatterbot guide for a detailed rundown of your options.

For my bot, I wanted to scrape a set of webpages for quotes: here's how I did it.

Step 2: Install Nokogiri

Run gem install nokogiri to get Nokogiri installed in your project folder.

Include the following in the header of your Ruby file:

require 'Nokogiri'
require 'open-uri'

I set an array of URL strings equal to the URLS variable -- here's how I used Nokogiri to scrape a random selection from the URL array:

doc = Nokogiri::HTML(open(URLS[rand(0..URLS.length - 1)]))

Nokogiri::HTML accesses Nokogiri HTML parsing methods, open uses open-url to open the webpage, URLS is my url array, [rand(0..URLS.length - 1)] is a random index number somewhere in the URLs array.

For my purposes, I wanted all of the p tags on the page that weren't of certain natures -- so I told Nokogiri I wanted them with the .css command and added in my very lengthy CSS statement:

content = []

doc.css('p:not(.title):not(.toc):not(.index):not(.indentb):not(.quoteb):not(.information):not(.fst):not(.footer):not(.pagenote):not(.quote)').each do |node|
    content << node.text

I shoveled in the text of each p node that wasn't a title, index, quote etc. into the content array.

Next I use regular expressions to split each paragraph into sentences while preserving punctuation by wrapping the regexp in parentheses.

new_content = []

  content.each do |c|
    d = c.split(/([\?\!\;\.])/)
    d.each do |e|
      new_content << e

Next I deal with setting a random index for this giant array to randomly choose a quote.

  #set random index
  random = rand(0..new_content.length - 6)

  #ensure first element chosen is not punctuation
  while new_content[random].length < 4
      random = rand(0..new_content.length - 6)

On to instantiating the tweet content string and doing some formatting:

  #concatenate elements after random chosen index
  tweet_content = new_content[random] + new_content[random + 1] + new_content[random + 2] + new_content[random + 3] + new_content[random + 4] + new_content[random + 5]

  tweet_content = tweet_content.gsub("\r\n", " ")
  tweet_content = tweet_content.gsub("\n", " ")
  tweet_content = tweet_content.gsub("   ", " ")
  tweet_content = tweet_content.gsub("i.e.", "that is")

Next we want to check to make sure the sentences chosen are under the tweet character limits:

  #split again to begin checking length of sentences
  tweet_content = tweet_content.split(/([\?\!\;\.])/)

  #decide on complete sentences that are under the character limit
  if tweet_content[3]
    if (tweet_content[0].length + tweet_content[1].length + tweet_content[2].length + tweet_content[3].length) < 240
      tweet_content = tweet_content[0..3].join
      tweet_content = tweet_content[0..1].join
    tweet_content = tweet_content[0..1].join

Step 3: Celebrate!

tweet_content is now ready to be returned to our tweet function! Instead of tweet "Hello World!" we can write tweet tweet_content

To automate our bot, we can wrap the contents of our script in a do loop and use the sleep method inside the loop to set the interval of our tweets.

And that's it! A quick setup of a bot and whatever logic you choose for the bot to tweet! If you decide to make a bot and have some interesting logic to share, put it in the comments!

Posted on by:

lucysuddenly profile

Lucy Suddenly


Site Reliability Engineer trans, she/her


markdown guide