DEV Community

Cover image for Simple Regex in C
Aniket
Aniket

Posted on

Simple Regex in C

Why? Why not to learn something we implement in software but didn't knew how it works under the hood.


Yesterday, while I was searching for ideas, I came across an article on the internet about a Regex matcher in the C language. Why did I decide to implement a Regex in C? Because I was curious about how regular expressions work under the hood in any language. And what better language to implement it in than C?

C is faster than most other languages and more understandable than higher-level languages. It offers the best of both worlds: performance and maintainability through readability.


Credits:

I would like to share the credits with : Rob Pike of Princeton University

The original code for this was written by Rob Pike whose article I did came across.


If you are new to Regex just visit this Github Repo.


To make a Regex Matcher we need to understand what we really need to implement.

  c    matches any literal character c
  .    matches any single character
  ^    matches the beginning of the input string
  $    matches the end of the input string
  *    matches zero or more occurrences of the previous character
  \0   terminating condition
Enter fullscreen mode Exit fullscreen mode

Let's Dive into code:

Let's say we have two strings one is the Regex expression and other is the text we need to match:

We can declare them by:
char *regexp = "[your regex here]";
char *text = "[text here]";

We will pass these values through the main function and result will be obtained in result variable which can be either 1 or 0 representing true or false

int result = match(regexp, text);
if (result)
   printf("Pattern found in the text.\n");
else
   printf("Pattern not found in the text.\n");

Enter fullscreen mode Exit fullscreen mode

Match Function:

int match(char *regexp, char *text) {
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '\0');
    return 0;
}
int matchhere(char *regexp, char *text) {
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

What all this code does :

^ : When the ^ symbol is used at the beginning of a regular expression, it signifies that the pattern must match at the beginning of the text. If the first character of the regular expression does not match the first character of the text, the function returns 0, indicating that there is no match. In other words, the ^ character ensures that the pattern starts matching from the very beginning of the text. If it doesn't, the function returns false (0).

c: The c character in a regular expression represents any literal character 'c'. It matches exactly with the character 'c' in the text. If the character 'c' in the regular expression doesn't match the corresponding character in the text, the function returns 0, indicating that there is no match.

.: The . character in a regular expression is a wildcard that matches any single character in the text. It is a placeholder for any character. For example, if the regular expression contains a.b, it would match any character between 'a' and 'b' in the text. If there is no matching character in the text, the function returns 0.

* : The * character in a regular expression indicates zero or more occurrences of the previous character or expression. For instance, ab* would match 'a' followed by zero or more 'b' characters. The function checks for zero or more occurrences of the previous character in the regular expression and returns 1 if a match is found; otherwise, it returns 0.

$: The $ character in a regular expression represents the end of the input string (text). It ensures that the pattern must match at the very end of the text. If the regular expression consists of only $ and the text reaches its end as well, the function returns 1, indicating a successful match at the end of the text. If the text still has characters remaining, the function returns 0, indicating no match at the end of the text.


You can use above Regex Matcher here :

Code Available here : Github


You can get the whole Article

Follow and like for more such blogs.
Connect me on LinkedIn

Thanks for reading until the end.


Top comments (0)