DEV Community

Discussion on: Daily Coding Problem #1

 
danstur profile image
danstur

Question: What happens if the hash of the values being inserted happens to be the same for all values? (Hint: Remember that you have to first check if the value already exists in the set before you can add it).

Is amortized cost for n inserts still O(n) in that scenario?

Or an even easier "proof": If you're doing open hashing as most implementations do, a hashset that contains values with the same hash degrades to a simple linked list. What is the amount of time to insert n values into a linked list of you have to check for duplicates first?

The amortized performance you learned for hashsets in school only works under the assumption of a reasonably well distributed hash function - in the worst case it degrades.

Thread Thread
 
martinhaeusler profile image
Martin Häusler • Edited

If all values deliver the same hash code then I'd argue that you have a pretty useless hash function... as I said, all of the argumentation I provided is based on the assumption of a proper hash function.

No, you don't have to check first if something is in a hash set before you insert it. Simply look up the bucket and the position where the element would be, then put it there if absent, or don't if it's not there. If your buckets are of size 0, 1 or 2 (as they should be) then the algorithm is O(1).

Bad hash codes will result in bad hash performance. No rocket science. If you take an object from a standard library, be it a number, a string, you name it - they are bound to have proper hash functions with good distributions. If a real-world hash set degenerates into a list (which is possible), I would question the programmer - not the theory.

Thread Thread
 
danstur profile image
danstur • Edited

The problem is that even the best hash function will always have pathological cases. In this case with integers the problem isn't the hash function of the integer, but the necessary transformation to fit into the set, which always requires some kind of modulo computation.

If hash mod bucket_size is always the same number you have the problem. If you know the used algorithm to convert the hash to the bucket number, the starting size and the growth algorithm of the set you can always pick your numbers in such a way that you'll get collisions. This has nothing to do with whether the hash algorithm of the data type is bad or good (integers have a pretty perfect naive hash algorithm after all).

"No, you don't have to check first if something is in a hash set before you insert it. Simply look up the bucket and the position where the element would be, then put it there if absent, or don't if it's not there."
Good explanation of how you check to see if the element exists - that's exactly how you'd do it. And since we've already established that for certain inputs all values will be in the same bucket, this degenerates to O(N) in the worst case.

Consequently the worst case performance here is O(n2) and not O(n). That's simply how hashsets work. Is that important? Depends on your use case. If you aren't worrying about an attacker, or the input is outside the control of any attacker you're good. If not, you've just opened yourself up to a classical DDOS attack by not paying attention to those * next to the complexity numbers.

And before you disregard this as purely theory: This issue has plagued web frameworks and their handling of POST form data amongst many other examples I can think of. You can read bugs.python.org/issue13703 for one classic example.

Thread Thread
 
martinhaeusler profile image
Martin Häusler

Hold your horses. We went from a discussion with somebody who clearly was unfamiliar with hashing techniques to malicious attack vectors as an argument to prove a generally accepted theory wrong? Good grief, that's a stretch. Sorry, this is getting ridiculous - I just wanted to clarify a misunderstanding, that's all. If we all explained things this way, we would have an army of students running about, firmly believing that the common access complexity for a hash set is O(n). I highly doubt that this is useful to anybody.

Thread Thread
 
danstur profile image
danstur • Edited

To each their own. I find "Hashsets have O(1) amortized lookup and insert performance assuming a relatively uniform hash distribution, otherwise O(n)" a perfectly reasonable explanation even if it is marginally more complex than "Hashsets always have O(1) amortized lookup and insert performance". I doubt that most students would fail to understand the first explanation.

It's also a valuable lesson to students to teach them that you have to specify whether you mean best case, average or worst case performance when analysing algorithms.