Refactoring Ruby with Regex

#ruby #regex #codenewbie

The project

I am currently working on an exciting project with a friend of mine. This application uses a bot that interacts with different social media sites and determines if a post has relevant information about a specific stock. If the post does have relevant information the bot persists the post to an API where we have different access points depending on what the user wants to see on our front-end (most mentioned stocks in the last 24 hours, week, month, year). One of the most challenging and enjoyable parts so far for me has been working on the algorithm on the bot to determine if a post mentions a specific stock.

Tickers

Before we get into how to find relevant data, we need to talk about how people normally mention stocks on the internet. Stocks have long names and they also have tickers or symbols that allow people to quickly reference a specific stock.

Company Name	Ticker
Game Stop Corp.	GME
AMC Entertainment Holdings Inc	AMC
Rocket Companies Inc	RKT

While some people may say Game Stop or Nokia they will rarely if ever say a stock's complete name. Tickers are generally preferred thanks to their brevity and specificity. Another important thing to know is that tickers are often placed after the dollar sign like this: $NIO. This isn't always the case, but it happens often enough to warrant knowing.

check_if_relevant

Without going into the nitty-gritty details, I will tell you that my friend captured all the relevant stock data (names, tickers, etc.) into a database using .CSV (comma separated value) files. Now that we have our stocks and their tickers we can set up our check_if_relevant method. We will loop through each stock, grab the stock's ticker, and send it (along with the post) to another function called contains_ticker?.

def check_if_relevant(post)
  @stocks.each do |s|
    if contains_ticker?(post, s["symbol"])
      # Save post here
    end
  end
end

contains_ticker?

This is where things get interesting. I know I have sang Ruby's praises before, but working with strings is extremely intuitive and fun in this gem of a language (thanks Matz!). Checking if a sentence contains a word is super simple!

"Ruby is fun!!!".include? "Ruby"
=> true

See how easy and painless that was? One line, no loops, it just works. This was my initial solution, but I soon realized the bot was saving a lot of false positives. Some of the tickers were common words or only one character long. I needed to add logic. I would revise the ticker by adding spaces and if it was too short or a common word I would add a $.

def get_revised_symbol(symbol)
  if symbol.length <= 2 || COMMON_WORDS.include?(symbol)
    '$' + symbol + ' '
  else
    ' ' + symbol + ' '
  end
end

def contains_symbol?(post, symbol)
  post.include? symbol
end

This was my initial approach, but I realized I would need something I could be more specific with (sometimes people would add punctuation to the end of the ticker and not a space). Enter regex.

Regular Expressions

Without going too much into what regex is, I will quote MDN

Regular expressions are patterns used to match character combinations in strings.

The perfect, elegant solution to my problem! Now I just needed to look up all the cryptic symbols and what each did (I feel like learning a new programming language is 75% this). I had two awesome resources to help accomplish this and I recommend you check them out. Read about using regex in Ruby at rubyguides.com and actually play around with it at Rubular.

Onward

I wanted to basically allow any punctuation before or after the ticker but still add the $ under the same conditions. I jumped through several iterations while reading about regex including this beast:

   def contains_symbol?(post, symbol)
     if symbol.length <= 2 || COMMON_WORDS.include?(symbol)
       return post.title.match(/[!,.?:;"' #%^&()$]+$#{symbol}+[!,.?:;"' #%^&()]/) || post.selftext.match(/[$!,.?:;"' #%^&()]+$#{symbol}+[!,.?:;"' #%^&()]/)
     else
       return post.title.match(/[!,.?:;"' #%^&()$]+#{symbol}+[!,.?:;"' #%^&()]/) || post.selftext.match(/[$!,.?:;"' #%^&()]+#{symbol}+[!,.?:;"' #%^&()]/)
     end
   end

Ugly right? Turns out I needed to practice some patience and read the entire article and experiment before implementing a solution. Regex exists because it has some very useful built-in ranges. I found \w covered all numbers and letters, so I would use that but add ^ to negate it. Then I read on to discover that \W is equivalent to ^\w. So here is my final solution:

def contains_symbol?(post, symbol)
  if symbol.length <= 2 || COMMON_WORDS.include?(symbol)
    return post.title.match(/[\W]$#{symbol}[\W]/) || post.selftext.match(/[\W]$#{symbol}[\W]/)
  else
    return post.title.match(/[\W]#{symbol}[\W]/) || post.selftext.match(/[\W]#{symbol}[\W]/)
  end
end

Conclusion

In closing, using cool new features in a language can make code cleaner, coding more enjoyable, and also a little painful if you don't fully understand what you're doing! Read the docs!

Top comments (2)

Thorsten Hirsch • Mar 15 '21

If performance doesn't matter - alright, sometimes regular expressions are a nice solution, although it's hard to believe that they contribute to cleaner code. Just be aware that they have a big negative impact on performance.

Samuel Grasse-Haroldsen • Mar 16 '21

That is good to know! I had no idea how much of a negative impact regex had on performance. Here I was thinking it was the an optimal solution! Thanks for the heads up!