In this post I will show one of the methods I usually choose to generate unique random code.
First, the function that will return the string code.
function generateCode(int $size = 10){
$chars = '0123456789ABCDEFGHIJKLMNOPQRSTUVZYabcdefghijklmnopqrstuvwxyz';
$code = substr(str_shuffle($chars), 0, $size);
return $code;
}
EXPLAN: A very simple function that accept an integer as attribute which will be the length of our code, in this case if the legth is not given, a 10 is passed by default.
Then we set a string of characters we want to be used to build our code.
Then the code itself is built by shuffling the string and get a substring from the first character as long as indicated by the int we pass in the function.
For this to be unique, we can search the table we want to interact with
// Generate + update code
do {
// Generate code
$code = generateCode();
// Loop to search for an identical code, if found generate another one
$stmt = $conn->prepare("SELECT count(id) from tablename where code = ?");
$stmt->bind_param("s", $code);
$stmt->execute();
$result = $stmt->get_result();
$row_code = $result->fetch_row();
} while ($row_code[0]!=0);
// Update record
$stmt = $conn->prepare("UPDATE tablename SET code = ? WHERE id = ?");
$stmt->bind_param("si", $code, $id);
$stmt->execute();
EXPLAN: with a do-while we can call the function and get a first-try code. Then we search for it in the table with a select. In the while we check if the result is different from zero (the code already exists). If it exists, the loop will keep running and create codes until a really brand new one is created. Then the code will exit the loop and we can do something with this unique code, maybe update an existing record.
Of course, the shorter are the codes and the fewer are the characters used to build them, the higher will be the risk to have a very long and resources sucking code.
If you have any suggestion or correction about this, I'll be glad to hear about it.
Top comments (13)
I would HIGHLY recommend against using str_shuffle for this purpose. It is not a cryptographically random function. Values over time will not be as unique as you may wish them to be. You'll have collisions in real-world scenarios as things scale up.
Something like random_bytes is more suited to this.
Thanks for this comment. As I pointed out in the post this piece of code is not that efficient in large scale application and, you are right, also not for cases where you need a "secure" code, so I agree with you. This is suitable for small application and in the case you need/want to choose which characters your code will be made of.
Anyway, having said that, the uniqueness is granted by the second part of code I posted, so I don't think this can be an issue in any way, no matter how big the things are scaled up. Or am I missing something?
There are no locks on the database, so it has a potential race condition there. Additionally, if this were a multi-primary database (like Galera), those locks cannot be guaranteed anyways. Its usually better to just let existing dedicated unique ID systems to their things, like auto-increment primary keys. They have a lot of consideration in their implementation to ensure uniqueness, down to how they handle thread locking and mutexes.
I don't think it's a matter of locks, nor "unique ID systems" here.
Checking the DB for duplicates can be a a way to ensure uniqueness basing on type of DB and type of application.
But I appreciate your point.
Thanks.
The locks have to deal with the race condition you've introduced in the example code. What happens when there are two active connections to the database at the same time? For example, two separate web requests coming in at the same time. If they both generate the same "unique" ID at the same time, they'll both check to see if the ID exist, which it doesn't yet for either. Then the first goes to insert, and then the second goes to insert. The database lock would be a single mutex to prevent this very condition from happening. Also, standard unit testing usually doesn't test for these types of scenarios either, since they're generally single-instanced, single-threaded. Its just something I want readers to be aware of when building systems like this, and potential issues that'll arise over time.
Your point sure have to be clear. I agree.
You gave me something to think about...
Let me ask you this: being in a MySql DB and setting the field unique, instead of looking for the code in the table, what about using a try-catch inserting the code and, if the error code refers to duplicate entry, do that again?
Something like this:
So a few things here to consider.
Firstly, if you're dealing with a single database instance, this will work. If you're dealing with a Galera instance, this should work. If you're dealing primary-replica style replication, this will also work. However, if you're doing primary-primary replication with writes to multiple nodes, this could still potentially lead to a race condition, so it is best to know what your replication topology is.
Secondly, using random data for keys works in small to medium scale (hundreds, thousands, maybe even a few million rows), but once you get beyond that, range locking within the database engine for the key starts to become a performance concern.
It is also important to take into consideration the data size. Only 10 characters will eventually fill up. And the closer to full you get, the longer it is going to take to find a unique value. Something to keep in mind as services scale upwards in ways not thought about at creation time. For instance, YouTube published articles about increasing their view counter on their videos from a 32-bit integer to a 64-bit integer, because Gangnam Style broke the 2 billion views mark.
str_shuffle is also listed as not cryptographically secure. These types of pseudo random generators are usually designed to be fast, and as such, may not even be able to produce every single possible value in a given set of constraints, so you may be hitting those collisions even sooner.
Interesting points here, thanks!
That's nice, but maybe it would be making sense to add a functionality to "auto-choose" the length of the code. At first, you can start with "0", then "1" ... "a0" ... "ax0" and so on. So, the code will start faster using it and with the time it's going to be slower. When the length of the code is round 10 signs, there was have already billions of unique codes generated.
No a bad idea at all. Could be useful when you don't have to stick to a fixed lentgh for other reasons.
Can't think about downsides about your suggestion offhand.
Maybe it will work faster than with fixed lenght at the beginning, with few records, but slower at certain point?
I don't know, which variant is faster overall. Depending on the possibilities and the technical infrastructure, it makes sense to store the used keys in a mongodb, which would be faster in storing key-value-pairs. Or elasticsearch or any other NoSQL-Databse.
Great! Thank you, very interesting! :)