Our QA team are absolute monsters, they're monsters I tell you!
No matter how much I try to bribe them with drinks at our next company gathering to stop finding bugs in my code, they keep coming up with unreasonable requests and "expected behaviours"...ewwwww, gross!
Don't believe me?
Recently they asked for a list to be ordered alphanumerically.
Can you believe that? Monsters I tell you!
I mean, it isn't like an end user would expect a list to be sorted:
test 1
test 2
test 12
...why can't it just be like MySQL thinks it should be sorted:
test 1
test 12
test 2
Makes perfect sense right? Right?
Anyway lets not get too deep into who is right and who is wrong here (they're right FYI), I thought I better try and fix it.
I mean...how hard can it be? (foreshadowing)
The short answer
I know you might be busy, so the short answer is sort it in JS / PHP / whatever. Don't try and do it with MySQL.
Like seriously, just return your data set and then use Intl.Collator
if you are using Node, or natsort()
in PHP.
And you know what, that would have been fine. It would have solved the problem and I could have moved on with my day. QA would have been happy, list would be sorted, all good!
However, if you are like me then when your CTO reviews your PR and says:
I take that to heart!
I mean, he is so right!
Why should we have to sort things in code, MySQL should be able to do that for us!
What ensued was over 4 hours of fiddling and swearing to try and get MySQL to bend to my will.
Did I manage it? No...but kinda.
What I ended up with works well enough for user input in our specific scenario, and is much better than other solutions I found, but has it's limitations.
But we are getting ahead of ourselves, I want to show you what is it like trying to solve this in MySQL!
The Data for testing
First thing is first, let's start with the test data I used, just so you can see if I missed anything:
The Test Data
Test Data
INSERT INTO test_data (data_value) VALUES
('2 test'),
('12 test'),
('1 test'),
('20 test'),
('10 test'),
('1test'),
('2test'),
('test 1'),
('my data 2020-01-01'),
('my data 2020-02-01'),
('01-02-24 data'),
('12 test'),
('4a test'),
('my 2020-01-01 data 2020-01-01'),
('my 2020-02-01 data 2020-01-01'),
('my 2020-02-01 data 202-01-01'),
('my 2020-02-01 data 20-01-01'),
('my 2020-02-01 data 1-01-01'),
('my 2020-02-01 data 2-01-01'),
('my 2020-02-01 data 12-01-01'),
('my 2020-02-01 data 01-01-01'),
('my 2020-01-01 data 2020-02-01'),
('my 2020-01-01 data 2021-01-01'),
('my 2020-01-01 data 2120-01-01'),
('my 2120-01-01 data 2020-01-01'),
('4b test'),
('my test'),
('my 12 magic test'),
('my magic 12 test'),
('cheese and test 12'),
('42-a-1'),
('40-a-1'),
('40a'),
('FoClSy4727'),
('Pthw068bf'),
('6bfS'),
('HOFAp_Yx7920'),
('25hWTX'),
('dnjLlW1'),
('RHrIt72402eaLr'),
('cIhb42WFNQ'),
('9244uVCpGa'),
('yDKrkCp7960'),
('GeGIrPM-H86'),
('wrOae537LGCT'),
('WffSPaBA318'),
('kQ33596c'),
('3uEKHmHePf'),
('796h-eYWy'),
('833HufIZAS'),
('utjtV03Xns'),
('dlCSh87811'),
('13IUkOxEVl'),
('VHCok55901XYVk'),
('2RnSVwq'),
('AwtwQdn09'),
('gvSV6z'),
('uxWLO039hb'),
('vTg946');
Some nonsense with numbers in various places throughout the strings, seemed reasonable enough to me.
It is probably not a perfect data set, but good enough for what I was after!
So once I had some data I did what every developer does, asked the Google God for an answer:
Googling for natural sort in MySQL
Seems like the logical first step right?
MySQL has been around for 30 years, and alphanumeric sorting is something you would think people need to do often, so it must have some answers out there already right?
Short answer...it was exhausting. Every single article, Stack Overflow post etc. spitting out the same nonsense that doesn't work or is hyper specific to a particular data format.
I wanted something generic, so I could just sort alphanumerically everywhere!
So let me walk you through the pain I endured so you don't have to, and so you can decide if my answer is any better...or if I have just added to the pain!
Let's try some of the generic sort methods people suggested:
Test 1: the "+0" method
SELECT
*
FROM
test_data
ORDER BY
`data_value`+0 ASC
The idea being that it tries to force MySQL to do numeric ordering over lexicographical ordering.
This would work fine if we had numbers or words in each row (the numbers would indeed sort correctly and then the words).
However because we have mixed data, it just spat out a load of nonsense:
Test 1 results
id | data_value |
---|---|
44 | GeGIrPM-H86 |
25 | my 2120-01-01 data 2020-01-01 |
27 | my test |
28 | my 12 magic test |
29 | my magic 12 test |
34 | FoClSy4727 |
Full Results for test 1
id | data_value |
---|---|
44 | GeGIrPM-H86 |
25 | my 2120-01-01 data 2020-01-01 |
27 | my test |
28 | my 12 magic test |
29 | my magic 12 test |
34 | FoClSy4727 |
35 | Pthw068bf |
37 | HOFAp_Yx7920 |
39 | dnjLlW1 |
40 | RHrIt72402eaLr |
41 | cIhb42WFNQ |
43 | yDKrkCp7960 |
30 | cheese and test 12 |
45 | wrOae537LGCT |
46 | WffSPaBA318 |
47 | kQ33596c |
51 | utjtV03Xns |
52 | dlCSh87811 |
54 | VHCok55901XYVk |
56 | AwtwQdn09 |
57 | gvSV6z |
58 | uxWLO039hb |
59 | vTg946 |
17 | my 2020-02-01 data 20-01-01 |
8 | test 1 |
9 | my data 2020-01-01 |
10 | my data 2020-02-01 |
14 | my 2020-01-01 data 2020-01-01 |
15 | my 2020-02-01 data 2020-01-01 |
16 | my 2020-02-01 data 202-01-01 |
18 | my 2020-02-01 data 1-01-01 |
19 | my 2020-02-01 data 2-01-01 |
21 | my 2020-02-01 data 01-01-01 |
23 | my 2020-01-01 data 2021-01-01 |
22 | my 2020-01-01 data 2020-02-01 |
20 | my 2020-02-01 data 12-01-01 |
24 | my 2020-01-01 data 2120-01-01 |
6 | 1test |
11 | 01-02-24 data |
3 | 1 test |
55 | 2RnSVwq |
1 | 2 test |
7 | 2test |
48 | 3uEKHmHePf |
13 | 4a test |
26 | 4b test |
36 | 6bfS |
5 | 10 test |
2 | 12 test |
12 | 12 test |
53 | 13IUkOxEVl |
4 | 20 test |
38 | 25hWTX |
33 | 40a |
32 | 40-a-1 |
31 | 42-a-1 |
49 | 796h-eYWy |
50 | 833HufIZAS |
42 | 9244uVCpGa |
Ok so that didn't work (and I didn't really expect it to).
Let's try something else:
Test 2: The "length" hack
SELECT
alphanumeric, integer
FROM
sorting_test
ORDER BY
LENGTH(alphanumeric), alphanumeric
This is another interesting approach.
obviously "12" is longer than "1" so it would work there.
It also works if you have a standard format for data, like the "test 1, test 12, test 2" I mentioned at the beginning.
However, for our data it returns more nonsense:
Test 2 results
id | data_value |
---|---|
33 | 40a |
36 | 6bfS |
6 | 1test |
7 | 2test |
3 | 1 test |
1 | 2 test |
38 | 25hWTX |
32 | 40-a-1 |
31 | 42-a-1 |
57 | gvSV6z |
8 | test 1 |
59 | vTg946 |
5 | 10 test |
Full Results for test 2
id | data_value |
---|---|
33 | 40a |
36 | 6bfS |
6 | 1test |
7 | 2test |
3 | 1 test |
1 | 2 test |
38 | 25hWTX |
32 | 40-a-1 |
31 | 42-a-1 |
57 | gvSV6z |
8 | test 1 |
59 | vTg946 |
5 | 10 test |
2 | 12 test |
12 | 12 test |
4 | 20 test |
55 | 2RnSVwq |
13 | 4a test |
26 | 4b test |
39 | dnjLlW1 |
27 | my test |
47 | kQ33596c |
49 | 796h-eYWy |
56 | AwtwQdn09 |
35 | Pthw068bf |
53 | 13IUkOxEVl |
48 | 3uEKHmHePf |
50 | 833HufIZAS |
42 | 9244uVCpGa |
41 | cIhb42WFNQ |
52 | dlCSh87811 |
34 | FoClSy4727 |
51 | utjtV03Xns |
58 | uxWLO039hb |
44 | GeGIrPM-H86 |
46 | WffSPaBA318 |
43 | yDKrkCp7960 |
37 | HOFAp_Yx7920 |
45 | wrOae537LGCT |
11 | 01-02-24 data |
40 | RHrIt72402eaLr |
54 | VHCok55901XYVk |
28 | my 12 magic test |
29 | my magic 12 test |
30 | cheese and test 12 |
9 | my data 2020-01-01 |
10 | my data 2020-02-01 |
18 | my 2020-02-01 data 1-01-01 |
19 | my 2020-02-01 data 2-01-01 |
21 | my 2020-02-01 data 01-01-01 |
20 | my 2020-02-01 data 12-01-01 |
17 | my 2020-02-01 data 20-01-01 |
16 | my 2020-02-01 data 202-01-01 |
14 | my 2020-01-01 data 2020-01-01 |
22 | my 2020-01-01 data 2020-02-01 |
23 | my 2020-01-01 data 2021-01-01 |
24 | my 2020-01-01 data 2120-01-01 |
15 | my 2020-02-01 data 2020-01-01 |
25 | my 2120-01-01 data 2020-01-01 |
And that also makes sense, all it is doing is ordering on length first so:
"a1" would come before "1potato" by length.
Test 3: The Hardouken!
Just take a look at this beauty:
SELECT
*
FROM
test_data
ORDER BY
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
data_value,
'0', 'Ā'),
'1', 'ā'),
'2', 'Ă'),
'3', 'ă'),
'4', 'Ą'),
'5', 'ą'),
'6', 'Ć'),
'7', 'ć'),
'8', 'Ĉ'),
'9', 'ĉ')
COLLATE utf8_bin;
The meme is real!
But despite that, this one pretty much works (albeit slightly differently to how I would have done it).
This one is smart in that it replaces each number with a character that will be sorted based on their binary representation in Unicode.
Test 3 results
id | data_value |
---|---|
56 | AwtwQdn09 |
34 | FoClSy4727 |
44 | GeGIrPM-H86 |
37 | HOFAp_Yx7920 |
35 | Pthw068bf |
40 | RHrIt72402eaLr |
54 | VHCok55901XYVk |
46 | WffSPaBA318 |
41 | cIhb42WFNQ |
30 | cheese and test 12 |
Full Results for test 3
id | data_value |
---|---|
56 | AwtwQdn09 |
34 | FoClSy4727 |
44 | GeGIrPM-H86 |
37 | HOFAp_Yx7920 |
35 | Pthw068bf |
40 | RHrIt72402eaLr |
54 | VHCok55901XYVk |
46 | WffSPaBA318 |
41 | cIhb42WFNQ |
30 | cheese and test 12 |
52 | dlCSh87811 |
39 | dnjLlW1 |
57 | gvSV6z |
47 | kQ33596c |
9 | my data 2020-01-01 |
10 | my data 2020-02-01 |
29 | my magic 12 test |
27 | my test |
28 | my 12 magic test |
14 | my 2020-01-01 data 2020-01-01 |
22 | my 2020-01-01 data 2020-02-01 |
23 | my 2020-01-01 data 2021-01-01 |
24 | my 2020-01-01 data 2120-01-01 |
21 | my 2020-02-01 data 01-01-01 |
18 | my 2020-02-01 data 1-01-01 |
20 | my 2020-02-01 data 12-01-01 |
19 | my 2020-02-01 data 2-01-01 |
17 | my 2020-02-01 data 20-01-01 |
16 | my 2020-02-01 data 202-01-01 |
15 | my 2020-02-01 data 2020-01-01 |
25 | my 2120-01-01 data 2020-01-01 |
8 | test 1 |
51 | utjtV03Xns |
58 | uxWLO039hb |
59 | vTg946 |
45 | wrOae537LGCT |
43 | yDKrkCp7960 |
11 | 01-02-24 data |
3 | 1 test |
6 | 1test |
5 | 10 test |
2 | 12 test |
12 | 12 test |
53 | 13IUkOxEVl |
1 | 2 test |
55 | 2RnSVwq |
7 | 2test |
4 | 20 test |
38 | 25hWTX |
48 | 3uEKHmHePf |
13 | 4a test |
26 | 4b test |
32 | 40-a-1 |
33 | 40a |
31 | 42-a-1 |
36 | 6bfS |
49 | 796h-eYWy |
50 | 833HufIZAS |
42 | 9244uVCpGa |
You know what, I would use this if it wasn't for one thing. It would mean that we could no longer accept 'Ā' etc as input.
Unfortunately our software will be used internationally, so we can't make that call.
Additionally if you do decide to use this, it is worth noting your database column must use COLLATE 'utf8_bin'
on it as well or you may get errors.
Chat GPT Time
When googling fails, chat GPT can always help right?
I am not going to show all the results, but let's just say the following are just a few of the "did not work" ones it produced!
Test 4: GPT fever dream
SELECT
data_value,
-- Replace numbers with 'a' and letters with 'b' dynamically
(
SELECT GROUP_CONCAT(
CASE
WHEN c REGEXP '[0-9]' THEN 'a' -- Replace digits with 'a'
WHEN c REGEXP '[a-zA-Z]' THEN 'b' -- Replace letters with 'b'
ELSE c -- Leave other characters as-is
END
ORDER BY seq
) SEPARATOR ''
FROM (
SELECT SUBSTRING(data_value, seq, 1) AS c, seq
FROM (
SELECT data_value, seq
FROM test_data
JOIN (
SELECT n AS seq
FROM (SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10) numbers
WHERE n <= 255 -- Handle up to 255 characters in a string
) seq_table
ON seq <= CHAR_LENGTH(data_value)
) char_table
) char_map
) AS sort_key
FROM test_data
ORDER BY sort_key;
It didn't run, and I couldn't quite work out why, but isn't it impressive looking!
What it seemed to be trying to do was replace all numbers with "a" and all letters with "b" to form strings like "aabaaa, aabba, abbaa" for comparison.
I would have probably spent more time trying to fix this if it wasn't flawed.
"11ab1" would produce "aabba"
"111ab1" would produce "aaaba".
So with the way MySQL sorts, 111ab1 would have been above 11ab1 anyway as "aaa" comes before "aab".
Nice idea, but sadly not quite it.
Test 5: GPT nearly nailed it!
This one surprised me.
I had seen it elsewhere already (as we know GPT is great at just spitting things out that it has already seen and isn't actually creative), but the explanation it gave made more sense!
SELECT
*
FROM
test_data
ORDER BY
CAST(data_value AS UNSIGNED), data_value ASC;
Ahhhh nearly. I really thought this was the answer.
But sadly it fails on "Test 1, Test 12, Test 2". It is good for sorting if the number comes first though!
My Turn
I still couldn't get what I wanted after several attempts. But I was now much closer to an answer after seeing a few techniques.
After a bit of noodle scratching and crying I thought I had solved it!
It works...kinda!
Here is what I came up with:
CREATE TABLE test_data (
id INT AUTO_INCREMENT PRIMARY KEY,
data_value VARCHAR(255) NOT NULL,
transformed_column VARCHAR(255) AS (
CASE
WHEN REGEXP_SUBSTR(data_value, '[0-9]+') IS NOT NULL THEN
REGEXP_REPLACE(
data_value,
'[0-9]+',
LPAD(REGEXP_SUBSTR(data_value, '[0-9]+'), 8, '0')
)
ELSE data_value -- Or another default value, e.g., 'No Numbers'
END
) STORED
);
Then we can query it like so:
SELECT
*
FROM
test_data
ORDER BY
transformed_column, data_value;
And this gives us nearly the results we would expect.
Test 6 results
data_value | transformed_column |
---|---|
1 test | 00000001 test |
01-02-24 data | 00000001-00000001-00000001 data |
1test | 00000001test |
2 test | 00000002 test |
2RnSVwq | 00000002RnSVwq |
2test | 00000002test |
3uEKHmHePf | 00000003uEKHmHePf |
4a test | 00000004a test |
Full Results for test 6
data_value | transformed_column |
---|---|
1 test | 00000001 test |
01-02-24 data | 00000001-00000001-00000001 data |
1test | 00000001test |
2 test | 00000002 test |
2RnSVwq | 00000002RnSVwq |
2test | 00000002test |
3uEKHmHePf | 00000003uEKHmHePf |
4a test | 00000004a test |
4b test | 00000004b test |
6bfS | 00000006bfS |
10 test | 00000010 test |
12 test | 00000012 test |
12 test | 00000012 test |
13IUkOxEVl | 00000013IUkOxEVl |
20 test | 00000020 test |
25hWTX | 00000025hWTX |
40-a-1 | 00000040-a-00000040 |
40a | 00000040a |
42-a-1 | 00000042-a-00000042 |
796h-eYWy | 00000796h-eYWy |
833HufIZAS | 00000833HufIZAS |
9244uVCpGa | 00009244uVCpGa |
AB-34Y67846 | AB-00000034Y00000034 |
AwtwQdn09 | AwtwQdn00000009 |
cheese and test 12 | cheese and test 00000012 |
cIhb42WFNQ | cIhb00000042WFNQ |
dlCSh87811 | dlCSh00087811 |
dnjLlW1 | dnjLlW00000001 |
E5-RMT893Y9 | E00000005-RMT00000005Y00000005 |
EV-489RY3DA | EV-00000489RY00000489DA |
FoClSy4727 | FoClSy00004727 |
GeGIrPM-H86 | GeGIrPM-H00000086 |
gvSV6z | gvSV00000006z |
HOFAp_Yx7920 | HOFAp_Yx00007920 |
kQ33596c | kQ00033596c |
my 12 magic test | my 00000012 magic test |
my 2020-01-01 data 2020-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-01-01 data 2020-02-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-01-01 data 2021-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-01-01 data 2120-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 01-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 1-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 12-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 2-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 20-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 202-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 2020-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2120-01-01 data 2020-01-01 | my 00002120-00002120-00002120 data 00002120-00002120-00002120 |
my data 2020-01-01 | my data 00002020-00002020-00002020 |
my data 2020-02-01 | my data 00002020-00002020-00002020 |
my magic 12 test | my magic 00000012 test |
my test | my test |
Pthw068bf | Pthw00000068bf |
RHrIt72402eaLr | RHrIt00072402eaLr |
test 1 | test 00000001 |
utjtV03Xns | utjtV00000003Xns |
uxWLO039hb | uxWLO00000039hb |
VHCok55901XYVk | VHCok00055901XYVk |
vTg946 | vTg00000946 |
WffSPaBA318 | WffSPaBA00000318 |
wrOae537LGCT | wrOae00000537LGCT |
yDKrkCp7960 | yDKrkCp00007960 |
It breaks on the rows:
data_value | transformed_column |
---|---|
my 2020-02-01 data 1-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 12-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 2-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
As you can see, but overall I think it will suit our needs for now!
This was based on a LPAD
example I saw. But it had the same flaw that it only worked on Strings with numbers at the beginning.
So I took that principle and applied it to all number series.
Also this now uses a generated column so that it is fast on reads (although we could soon make it part of the query if it didn't hurt read performance too much or we were more bothered about write performance).
The big issue is the fact that this works through "brute force" by working around the fact that MySQL sees "11" as smaller than "2" due to checking one character at a time (the dumbed down version of what happens) by making all numbers 8 digits long.
It also replaces all numbers in the transformed_column
with the first number it finds and as far as I can tell, other than nesting about 20 statements and using coalesce
or something, there is no way to avoid this behaviour I could find.
Conclusion
I am not really sure? MySQL should join the 21st century and provide an alphanumeric sort option like other DBs?
While we wish for things to happen I think the short answer in the scope of this article is that I still haven't solved this.
But with that being said, I feel like test 6 is much closer to the answer than anything I found out in the wild and may be useful to someone someday?
Obviously if you are expecting numbers longer than 8 digits, you may want to adjust the LPAD
value as that is one limitation.
Also it isn't perfect so you can't rely on it for scenarios other than sorting user input so they can find things easily.
I think at this point my best bet is to throw it out into the world and see if some smarter people than me can solve this thing properly! (I am probably needing some much deeper knowledge of sorting in MySQL to really solve it!)
If someone can solve the problem I have where I can't change every number to 8 digits long, then this would work even better! (so "my 2020-02-01 data 2-01-01" would become "my 00002020-00000002-00000001 data 00000002-00000001-00000001").
Or maybe there is a literal one liner that I missed in all my research that solves this? (I will only cry a little bit if there is, I promise!).
Anyway, that is the best I have, but would love to see something better!
Finally here is a DB Fiddle to play with
Here is a DB fiddle of the generated column version to start from if you think you can spot a quick win / improvement:
https://www.db-fiddle.com/f/o2ohcGVAgHZQg4teg1s9jW/1034
Thanks for reading, catch you all soon!
Top comments (9)
Now, that is a dedication to duty. Personally I'd have dropped a support call into Oracle and told the QA team that you are waiting on a reply, pretty sure you could string that along for a fair few months... ;)
Haha, might just do that! 🤣💗
Guarantee someone shows me how to do this in one line and I end up deleting the post or putting "here is the answer, ignore the rest of the article" at the top! haha
You said one line but hadn't specify how long can it be 🌝
Jokes aside it's not that "natural" as you might expect. I mean sire 10 is "greater" than "2" but it's the same order we use when ordering strings. E.g.
Imagine we accept as a society that when you combine
h
withi
it suddenly becomes "greater" in some dimension thanhk
hence you expectweird, isn't it?
I understand your struggle nevertheless and I can think of two ways of solving it; create a view pre-calculating the order (do you can consume it from different places) or just let it be and, as you said, sort it this way in JS/PHP or whatever 😅 I'd probably go for the latter.
Haha exactly, looks like that is the way for now, but I will have a see if I can hack together some impractical monstrosity that does it purely in MySQL and works perfectly...even if it is 200 times lower than JS / PHP / whatever! haha.
Now that you made me think about it... What about PL SQL ordering the substring of numbers dynamically with regex?
I've no computer RN to test
… and for readers who just want to do it in JS:
Exactly, way easier than the way I cam up with to actually do this in MySQL. 💗
Nice work as always Graham!!! Way to follow through while running into some elegant queries to keep in your toolbox!