Ryan Palo

Posted on Oct 17, 2017 • Edited on Jun 8, 2019 • Originally published at assertnotmagic.com

Secret Message

#ruby #puzzle #interview #performance

Today I was reading through an article on Dev.to by Ben Greenberg about an interview coding challenge, and I got hooked and had to try it for myself. I highly recommend you read his original post before you read this so you have some background. That being said, for the lazy ones, let me give you...

Some Background

The challenge is this (copied from the original post):

Sort the characters in the following string:
abcdefghijklmnopqrstuvwxyz_

by the number of times the character appears in the following text (descending):

… [String omitted for brevity]

Now take the sorted string, and drop all the characters after (and including) the _. The remaining word is the answer.

You can find the super long string in the comments section of his post (where I asked for it ðŸ˜) if you want to try it yourself. You should! And share your solution! Anyways, on with the story.

My Solution (Part 1)

If it's not abundantly clear, I'm about to share my solution. If you want to work it out for yourself without spoilers, go ahead and stop reading until you're done. I won't write anymore until you're finished. ðŸ˜„

Finished? Great! ONWARD.

I decided to use Ruby because Ruby is fun and great. Initially, I just wanted to knock out the first solution I could think of -- regardless of how inefficient or slow it was -- and come back to clean it up after.

# I stored the huge string in a separate file called
# `blob.txt` so as to not clutter my main script

blob = File.read('blob.txt')
letters = 'abcdefghijklmnopqrstuvwxyz_'.chars

def naive_message(letters, blob)
  letters
    .sort_by { |letter| blob.count(letter) }
    .reverse  # to get desc. order
    .join
    .split('_')
    .first
end

puts naive_message(letters, blob)

I won't spoil the answer, but let's just say that I "pray, plead, beseech, urge" you to try it on your own. Possibly in the past tense. I think the markdown format of the initial text blob may have skewed the number of "_" characters, causing my answer to have slightly more characters on the end than should actually be there. UPDATE: That's totally what happened. Accurate text blob here.

My Solution (Part 2)

Anyways, I looked back at my solution and thought to myself, "Ryan, that doesn't look very performant. I have to imagine that running blob.count(letter) for each letter is the worst case performance for this scenario (27 'letters' * n chars in the blob, looping through the whole blob for each letter). It seems like it should be more efficient to do it the way Ben did it, which is by looping through the blob once and counting each letter along the way. So I tried that.

def efficient_message(letters, blob)
  # counter = {a: 0, b: 0, c: 0...}
  counter = letters.product([0]).to_h

  # run through blob once only
  blob.each_char { |c| counter[c] += 1 }

  # sort and trim off everything after _
  counter
    .sort_by(&:last)    # sort by the count
    .map(&:first)       # grab just the letter key into an array
    .reverse
    .join
    .split('_')
    .first
end

Not as pretty, in my opinion, but hopefully faster. Ruby, being interpreted, is slower than most compiled langauges, so this should help. (So I thought...)

Comparing Performance

Was this optimization worth it? I needed to find out. Luckily Ruby comes with an awesome Benchmarking library built-in. (Oh Ruby, what is there that you can't do?)

require 'benchmark'

# ... My code above

Benchmark.bmbm do |bm|
  bm.report("Naive: ") { 1000.times { naive_message(letters, contents) } }
  bm.report("Efficient: ") { 1000.times { efficient_message(letters, contents) } }
end

Benchmark has a method called bmbm that runs one trial run and then a second real run. This helps shake out any overhead performance drains from the garbage collector. And to my horror:

~\desktop> ruby .\secret_word.rb
Rehearsal -----------------------------------------------
Naive:        0.047000   0.000000   0.047000 (  0.039974)
Efficient:    0.484000   0.000000   0.484000 (  0.481631)
-------------------------------------- total: 0.531000sec

                  user     system      total        real
Naive:        0.031000   0.000000   0.031000 (  0.038011)
Efficient:    0.483000   0.000000   0.483000 (  0.478715)

The "Efficient" version is ~10x slower than the "Naive" version! Noooo! "But, why?" you ask. "How can this be?" I had the same questions.

Algorithmic Profiling

Ruby has a built-in profiler, but a short Google search told me that there was a better option: ruby-prof. After a quick gem install ruby-prof, I was back at it again with the white vans. (Check out the Ruby-Prof Documentation to learn more).

require 'ruby-prof'

# ... The previous code

RubyProf.start
1000.times { naive_message(letters, contents) }
result = RubyProf.stop

RubyProf::FlatPrinter.new(result).print(STDOUT)

RubyProf.start
1000.times { efficient_message(letters, contents) }
result = RubyProf.stop

RubyProf::FlatPrinter.new(result).print(STDOUT)

I added titles below for clarity.

Naive:
===================
Measure Mode: wall_time
Thread ID: 3259000
Fiber ID: 20752200
Total: 0.066000
Sort by: self_time

 %self      total      self      wait     child     calls  name
 50.00      0.033     0.033     0.000     0.000    27000   String#count
 18.18      0.059     0.012     0.000     0.047     1000   Enumerable#sort_by
 10.61      0.007     0.007     0.000     0.000   136000   Integer#<=>
 10.61      0.040     0.007     0.000     0.033     1000   Array#each
  3.03      0.002     0.002     0.000     0.000     1000   Array#reverse
  3.03      0.066     0.002     0.000     0.064     1000   Object#naive_message
  1.52      0.001     0.001     0.000     0.000     1000   Array#first
  1.52      0.001     0.001     0.000     0.000     1000   Array#join
  1.52      0.001     0.001     0.000     0.000     1000   String#split
  0.00      0.066     0.000     0.000     0.066        1   Global#[No method]
  0.00      0.066     0.000     0.000     0.066        1   Integer#times

* indicates recursively called methods

Efficient:
==============
Measure Mode: wall_time
Thread ID: 3259000
Fiber ID: 20752200
Total: 0.688000
Sort by: self_time

 %self      total      self      wait     child     calls  name
 93.60      0.644     0.644     0.000     0.000     1000   String#each_char
  2.04      0.025     0.014     0.000     0.011     1000   Enumerable#sort_by
  1.16      0.008     0.008     0.000     0.000   136000   Integer#<=>
  0.58      0.005     0.004     0.000     0.001     1000   Array#map
  0.44      0.688     0.003     0.000     0.685     1000   Object#efficient_message
  0.44      0.003     0.003     0.000     0.000     1000   Array#reverse
  0.44      0.003     0.003     0.000     0.000     1000   Array#product
  0.44      0.003     0.003     0.000     0.000     1000   Array#to_h
  0.29      0.003     0.002     0.000     0.001     1000   Hash#each
  0.29      0.002     0.002     0.000     0.000     1000   String#split
  0.15      0.001     0.001     0.000     0.000    27000   Array#last
  0.15      0.001     0.001     0.000     0.000    28000   Array#first
  0.00      0.688     0.000     0.000     0.688        1   Global#[No method]
  0.00      0.000     0.000     0.000     0.000     1000   Array#join
  0.00      0.688     0.000     0.000     0.688        1   Integer#times

* indicates recursively called methods

From what I can tell, the String#count method is super-optimized, and String#each_char is a relatively expensive operation (it has to create an array the length of the blob!). So, in the long run, looping through the blob string a bunch of times using the faster String#count ends up being more performant. So much for going through the trouble to generate an efficient solution.

Wrap Up

Anyways, I hope you're able to get your heart rate back to normal after such a roller-coaster ride. Be sure to share your solution on Ben's post. Also, since this was originally a kind of interview code puzzle, if you're someone that has interviewed people, I would love any feedback on my solution or the surrounding presentation! Is it similar to what you would expect from an applicant, or am I missing something important?

Thanks for reading!

Originally posted on assert_not magic?

Top comments (22)

Talha • Oct 17 '17 • Edited

    def give_me_word(text):
        ans=dict()
        the_text=""
        for i in text:
            if i in ans:
                ans[i]+=1
            else:
                ans[i]=1

        comparison=ans['_']
        sorted_list=sorted(ans.values(),reverse=True)
        rev_dict = dict((v,k) for k,v in ans.iteritems())

        for i in sorted_list:
            if(i>=comparison and rev_dict[i]!='_'):
                the_text=the_text+rev_dict[i]
            elif rev_dict[i]=='_':
                break

        return the_text

    text="the_text"

    print give_me_word(text)

Not fully optimized script to do the same in python

Ryan Palo • Oct 17 '17 • Edited

Nice! Thanks for sharing. :) Have you seen the built-in Counter class? This seems like a perfect use-case for it.

from collections import Counter

blob = "..."
letters = "abcdefghijklmnopqrstuvwxyz_"

def message(letters, blob):
    hist = Counter(blob)    # Counter takes any iterable and makes
                            # a histogram of its items!

    sorted_letters = ''.join(x[0] for x in hist.most_common())
    return sorted_letters.split('_')[0]

print(message(letters, blob))

Ryan Palo • Oct 17 '17

Bonus! On my machine, this version is even 5 times faster than the best Ruby version!

cat test.py | python -m timeit
100000000 loops, best of 3: 0.00844 usec per loop

Talha • Oct 17 '17

This is great. Will use counter effectively. Thanks for showing me counter(). :-)

edh_developer • Oct 18 '17

I wondered if I could do better performance-wise, based on knowledge of the problem that the more general built-in classes couldn't take advantage of. Ended up with this:

import string

longString="..."

def decode(msg):
    # Testing suggests this is a little faster than using defaultdict
    charCountHash = {}.fromkeys(string.ascii_lowercase,0)
    charCountHash['_'] = 0

    for ch in msg:
        charCountHash[ch] += 1

    c = charCountHash['_']
    resultHash = {charCountHash[k]: k for k in charCountHash.keys() \
        if charCountHash[k] > c }

    return ''.join( [ resultHash[index] for index in \
        sorted(resultHash.keys(), reverse=True) ] )

print(decode(longString))

Testing both implementations a few times suggests this way is faster, but only 1% or so.

Might try this in go, just for the practice...

Ryan Palo • Oct 18 '17

Nice! I haven't used fromkeys before. That's a nice alternative to defaultdict (and you don't have to import anything!) if you already know the keys you need.

Ben Greenberg • Oct 17 '17

Python is on my list of things to learn soon and can't help but remark how much it "looks" like Ruby without knowing much about it yet.

Talha • Oct 17 '17

"...without knowing much about it yet." This is great about python. Nice challenge this one. :-)

Andrew Buntine • Oct 17 '17

Nice post. It's an interesting look at how algorithmic complexity can be trumped by real-world details.

But I think you'll find that the performance is not exactly in the each_char (which is probably lazily evalutated) but is caused by:

The setup/teardown (stack frames, etc) for the block that is passed to each_char
The repeated hash lookups (although they are supposed to be O(1))

Ryan Palo • Oct 17 '17

Thanks, that’s super interesting! Good to consider for future block usage. I can’t think of a way to speed up the block setup and tear down, but I wonder if I’d pick any performance up by switching the hash keys to symbols. I’ll have to do some ‘sperimenting 👍🏻

Ben Greenberg • Oct 17 '17

This is great! I can't even begin to describe my love for Ruby. It is such a clean language.

Tobias Salzmann • Oct 19 '17 • Edited

Obvious (!!!) question to ask: What if the text does not fit into memory?

Distributed version in Scala with Spark:

def message(
  distributedBlob: RDD[Char], 
  chars: Set[Char] =  ('a' to 'z').toSet + '_'
)(implicit sc: SparkContext): String = {
  val frequencies = distributedBlob
    .filter(chars.contains)
    .countByValue()
  frequencies
    .toSeq
    .sortBy(- _._2)
    .map(_._1)
    .mkString
    .takeWhile('_' !=)
}

Alex Pliutau • Oct 18 '17

Interesting challenge. I want to see how it can be gracefully implemented in Go, even it will be more code than Ruby, but it can be elegant. github.com/plutov/practice-go/tree...