DEV Community

🚀 Human-Regex: Write Readable Regular Expressions Like English

Ridwan Ajibola on February 03, 2025

Created by Ridwan Ajibola Sick of trying to understand those confusing regex patterns? Let's change that. // Before: Cryptic regex f...

Read full post

Peter Vivo • Feb 4 '25

Good work, but I missing the captialLetter() function because now in this password example can be passed without using capitalLetter.

If you try to make a harder regexp example, sure to found a few more missing function.

Check this code: github.com/Pengeszikra/flogon-gala...
it is a markdown parser part of this game: dev.to/pengeszikra/javascript-grea... , try to recreate those regexp with your module and you will be found a missings.

Ridwan Ajibola • Feb 4 '25

Good! I will make sure to include the missing methods in the new release.

Thanks

Jonas Scholz • Feb 4 '25

I never really understood whats so hard about Regex? I think if you just learn the grammar once you dont really forget it anymore and its perfectly understandable. Nice idea anyway:)

Ridwan Ajibola • Feb 4 '25

Thanks Jonas

Manuchehr • Feb 4 '25

Skill issues to be honest

Ridwan Ajibola • Feb 4 '25

Thanks

Ridwan Ajibola • Feb 4 '25

I’d appreciate it if you could star the repo.

Mohammed Samgan Khan • Feb 5 '25

this is cool man, like really cool...

Ridwan Ajibola • Feb 6 '25

Thanks a lot @msamgan! Your contributions are really appreciated. If you find this project useful, it would mean a lot if you could star the repo, it helps others discover it too!

Ridwan Ajibola • Feb 6 '25

Thanks a lot @msamgan, I’d appreciate it if you could star the repo.

BestCodes • Feb 15 '25

I starred the repo!
Very cool idea. How does it do as far as performance compared to a plain Regex?

Ridwan Ajibola • Feb 17 '25 • Edited

Thanks a lot @best_codes

Ridwan Ajibola • Feb 17 '25 • Edited

Benchmark results seem inconsistent, so I'll use performance.now() for more accurate testing.

Suite1: Human Regex x 0.15 ops/sec ±0.97% (3874 runs sampled)
Suite1: Native RegExp x 0.07 ops/sec ±0.47% (3596 runs sampled)
Suite1 Fastest is Human Regex
Suite2: Native RegExp x 0.05 ops/sec ±0.31% (3346 runs sampled)
Suite2: Human Regex x 0.04 ops/sec ±0.24% (3187 runs sampled)
Suite2 Fastest is Native RegExp

BestCodes • Feb 17 '25

In my tests, human regex is several times slower, but the performance is negligible except in large-scale or when parsing regexes quite frequently.

Samuel Munoz • Feb 26 '25

Ported this to C# (It's .NET 8+ exclusive for now). Check it out:

Repo: github.com/SamuelMunoz/human-regex...
Nuget Package: nuget.org/packages/HumanRegexBuilder/

Ridwan Ajibola • Mar 10 '25 • Edited

Good job! I’ve started the repo

Paweł bbkr Pabian • Feb 4 '25 • Edited

I see a bug / inconsistency:

If .digit() is matching \d then it converts to Decimal_Number Unicode property. For example (I'm not familiar with JS so I'll use Raku):

$ raku -e 'say "1๖" ~~ /\d+/;'   # DIGIT ONE and THAI DIGIT SIX codepoints
｢1๖｣

$ raku -e 'say "1๖" ~~ /<:Decimal_Number>+/;' # same character class
｢1๖｣

$ raku -e 'say "1๖" ~~ /<:digit>+/;' # also the same
｢1๖｣

Then .letter() should convert consequently to Letter Unicode property, not a-z. For example:

$ raku -e 'say "aت" ~~ /<:Letter>+/';    #  LATIN SMALL LETTER A and ARABIC LETTER TEH
｢aت｣

You should not make implicit ASCII / non-ASCII assumptions, where one method works differently than the other sibling.

Another bug you have is anchoring:

$ perl -E 'say "match" if "a\n" =~ /a$/' # oops!
match

Token $ means end of logical string. What you are probably looking for is \z:

$ perl -E 'say "match" if "a" =~ /a\z/'
match

$ perl -E 'say "match" if "a\n" =~ /a\z/'
$  # no match, most likely expected result

I don't want to discourage you, but I really dislike those "regex to human" modules. They make code crazy error-prone, because - as I just shown - you don't see explicitly what you are matching. Things get worse when you are working on multi language stack and you want to exchange your PCRE regexps with someone using other language. Basically all "Why This Matters" points are just the opposite - new developers will not understand regexes more, there will be more archeology because you will need to decipher additional layer of abstraction, and collaboration will be more difficult.

My advice would be to stick directly (or at least closely) to Unicode properties. Drop ambiguous method letter() and add Uppercase_Letter() mapping directly to Lu property. And build modifiers on top of that like Uppercase_Letter('ascii')orUppercase_Letter('script'=>'Latin')`. Otherwise this will be false friend - module that is supposed to make your life easier but it introduces weird errors and security risks because it hides too much assumptions under the hood.

Ridwan Ajibola • Feb 4 '25

This is explicitly for JavaScript and not for any languages

Paweł bbkr Pabian • Feb 4 '25 • Edited

Sure. I pointed out universal issues. Imagine developer joining some project that uses this module. If he/she already has regular expression experience this interface will be confusing, because your assumption of what "letter" or "endAnchor" are is completly different than what those things mean in terms of Unicode properties and PCRE standard.

Same goes for "tld". Your module does not match TLDs. It only matches what you consider to be TLD. Exactly 3 items out of 1589 currently known TLDs, so right out of the box it has 99.81% failure rate.

I'm not trying to be mean, I'm just saying that pseudo-standards or partially implemented specs are universally bad and sooner or later backfire in every project.