A guide to shrinking an elephant hashed and encoded in PHP.
I was recently working on a new bundle for Symfony, and while reviewing a block of code I had written earlier I asked my self, why do the big guys (Symfony and Laravel) encode the binary output of hash_hmac()
using base64_encode()
?
After some digging around, I discovered the answer. But first lets take a look at how hash_hmac()
works. More specifically, what hash_hmac()
returns...
TL;DR
When overhead is a concern, using base64_encode(hash_hmac(..., true))
gives you a hash with a string length of 44 characters vs the 64 characters that hash_hmac(..., false)
provides.
What is this hash hmac and why do I care about it?
From the manual - "hash_hmac β Generate a keyed hash value using the HMAC method"
Translation, this method allows us to take a string of data, and potentially create a cryptographically secure hash that can be used to verify the data integrity. One thing to remember about hash_hmac
, it's selfish. hash_hmac
takes the data and gives you a string (hash), but it will never give you the data used to create that hash.
This is sometimes referred to as a one-way hash. Meaning the only way we can verify the integrity and authenticity of the data, is to create a new hash and then compare the original hashed value against our newly created hash. To do this we use the hash_equals()
method.
To put it another way, lets play the game telephone. You remember that as a kid right? A large group gets in a circle and then the first kid says to his buddy on the left, "elephants are gray". Then his buddy does the same to the kid to his left, and so on and so on. The point of that game is to see if the phrase is still the same by the time it circles back around to the first kid.
Only in our version of the game, I'm going to start the game and use hash_hmac
to prove to everyone at the end, I'm not fibbing...
$elephantHash = hash_hmac('sha256', 'elephants are gray', 'secret_key');
var_dump($elephantHash);
// string(64) d53c1002ddf2159b926ae4eaf775fb0242282e826070bb79a899bbf1af45642f
So the phrase is "elephants are gray", go tell a friend what the phrase is and have them repeat it to me. Hurry, I'm waiting...
While I'm waiting to hear back from your friend, I'll show you how I'm going to verify the original phrase. First, there are 3 required arguments hash_hmac()
needs. The first is the algorithm, algo
, that will be used to create the hash. sha256
is a good place to start, but be sure to follow best practices when picking which algorithm best fits your use case.
The second argument is the data
string. This string is what will actually be hashed. In our case, we are hashing 'elephants are gray'. Lastly, we have to provide the third, key
, argument that is used to generate the hash. The key, similar to a salt, must be kept secret. If you're using a framework like Symfony, APP_SECRET
could be used as your key. So long as it's never exposed.
There is a 4th optional argument to hash_hmac()
, raw_output
, but I'll get to that later. Meanwhile, let's see what happens when we create a second hash using 'elephants are gray' as the data string again and compare it to the first.
// verifyElephantColor()
$secondGrayElephant = hash_hmac('sha256', 'elephants are gray', 'secret_key');
function verifyElephantColor($knownElephant, $suspiciousElephant): string
{
if (hash_equals($knownElephant, $suspiciousElephant) {
return 'We have 2 gray elephants';
}
return 'Ooops, one of the elephants changed color';
}
The verifyElephantColor()
method above would return We have 2 gray elephants
. This is because our, $secondGrayElephant
, is using the same algorithm, data, and key as $elephantHash
.
That was quick! Your friend just said to me 'elephants are pink'. Hmm, something seems fishy about this. I don't remember telling you the PHP elephant is pink... Let's find out for sure by calling the verifyElephantColor()
method again.
$pinkElephant = hash_hmac('sha256', 'elephants are pink', 'secret_key');
echo verifyElephantColor($elephantHash, $pinkElephant);
// Ooops, one of the elephants changed color
AH HA! See, I never said elephants are pink! Since the data
supplied to hash_hmac()
differs from the original data used to create $elephantHash
, we are given a different hashed string. The same holds true if a different algo
or key
were used.
How can we leverage PHP to shrink an elephant?
Now that you know how to create a hash with hash_hmac
and verify the hash with another, using hash_equals()
, let's see if we can get that elephant to shed a pound or two. Remember that fourth and optional argument, raw_output
, I mentioned earlier? The documentation states that when raw_output
is set to true
, it will return raw binary data. Let's have a look at that...
Inner workings of hash_hmac:
Diving into the PHP source code, specifically php-src/ext/hash/hash.c
, we eventually find the hash_hmac
declaration and see that it calls the internal php_hash_do_hash_hmac()
function.
PHP_FUNCTION(hash_hmac)
{
php_hash_do_hash_hmac(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0, 0);
}
Without getting into the nitty gritty of exactly how the php_hash_do_hash_hmac
function generates a hash, we can skip down to the return logic:
//php_hash_do_hash_hmac()
if (raw_output) {
ZSTR_VAL(digest)[ops->digest_size] = 0;
RETURN_NEW_STR(digest);
} else {
zend_string *hex_digest = zend_string_safe_alloc(ops->digest_size, 2, 0, 0);
php_hash_bin2hex(ZSTR_VAL(hex_digest), (unsigned char *) ZSTR_VAL(digest), ops->digest_size);
ZSTR_VAL(hex_digest)[2 * ops->digest_size] = 0;
zend_string_release_ex(digest, 0);
RETURN_NEW_STR(hex_digest);
}
Then strip away some of that C code that is irrelevant in the context of this explanation:
if (raw_output) {
RETURN_NEW_STR(digest);
} else {
php_hash_bin2hex(ZSTR_VAL(hex_digest), (unsigned char *) ZSTR_VAL(digest), ops->digest_size);
RETURN_NEW_STR(hex_digest);
}
When we call hash_hmac()
in PHP, the 4th raw_output
argument is used in this if/else block. So if raw_output
= true
, php_hash_do_hash_hmac()
returns us a nice chunk of binary code. If your business logic consumes binary, proceed no further. If binary isn't in your taste buds, use the default value, false
, for raw_output
.
When raw_output
= false
, we can see that the digest
(binary representation) is passed to php_hash_bin2hex()
who's return value is ultimately returned by php_hash_do_hash_hmac()
. Sound familiar? php_hash_bin2hex()
is actually the internal function called by bin2hex()
that we use in userland code.
In short, hash_hmac()
returns either a binary or hexit representation of the data
passed to the function depending on the value of raw_output
.
Ok, I'm not seeing the point yet...
Don't worry, I'm getting to it now. Using the example from above -
$elephantHash = hash_hmac('sha256', 'elephants are gray', 'secret_key');
var_dump($elephantHash);
// string(64) d53c1002ddf2159b926ae4eaf775fb0242282e826070bb79a899bbf1af45642f
We now know that our $elephantHash
value is actually a 64 character length string that was ultimately returned from bin2hex()
. I'm a huge fan of using the least amount of data possible, especially when I have to store that data in a data store such as MySQL.
Let's try this instead -
$elephantHash = hash_hmac('sha256', 'elephants are gray', 'secret_key', true);
// $elephantHash <- binary representation of the hashed 'elephants are gray` data string because we set true as the raw_data argument.
$skinnyElephant = base64_encode($elephantHash);
var_dump($skinnyElephant);
// string(44) "1TwQAt3yFZuSauTq93X7AkIoLoJgcLt5qJm78a9FZC8="
$skinnyElephant
has a string length of 44 characters while $elephantHash
is 64 characters.
Holy Elephant, you've lost 20 characters! How can this be?!
In a brief discussion with Sara Golemon, one of PHP's core developers, she provided 2 key takeaways in respect to encoding the binary result of hash_hmac()
using base64_encode()
vs hash_hmac(..., false)
;
Neither version is more or less secure than the other. Each has effectively* 256 bits of entropy.
bin2hex() increases the size of the string its given by 2:1, base64_encode() increases the size of the input by about 4:3**.
* It's actually slightly lower than 256bit for math reasons, but the actual entropy of both is precisely equal. They're just different representations of the same data.
** Depends on input block alignment which is based on 3 octets. sha256 isn't an exact multiple of blocks, so the expansion is actually 11:8, which is still well below 2:1.
Why does any of this matter?
As we can see above, using base64_encode(hash_hmac(..., true))
results in, approximately, a 31% smaller "footprint" as compared to using hash_hmac(..., false)
. If you're developing an application that doesn't care about how much storage space, or I/O is being utilized in a data store, then either of the 2 method's will suite your needs perfectly fine.
However, if scalability, storage space, network performance, etc... are of any concern to you, use base64_encode(hash_hmac(..., true))
. This will ultimately reduce overhead and utilize less space in persistence while, theoretically speeding up network calls and app performance. The smaller the data, the faster the data can be consumed.
Footnote
I'm not a security expert... I'm a developer just like you. Use the information I have conveyed in this article with a bit of common sense. I.e. don't go using hash_emac('sha256'...)
to secure the admin panel of your retirement fund... You get the picture.
Lastly I'd like to thank Sara Golemon for taking a few minutes to verify and answer a few thoughts/questions I had that led up to this article. Keep up the great work you do for the PHP community!
As always, any comments, questions, and/or constructive criticism are always welcome.
- Jesse Rushlow
jr (at) rushlow (dot) dev
@JesseRushlow
Top comments (0)