jbxamora

Posted on Feb 24, 2023

URL REGEX

#regex #beginners #javascript #webdev

This regular expression is designed to match a wide range of URLs, including those that use different protocols (such as https, http, and ftp) and those that include an IP address or domain name.

/^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/;

Break Down -Anchors -Quantifiers -Grouping Constructs -Bracket Expressions -Character Classes -The OR Operator -Flags -Character Escapes -Regex Components
Language
Use Case
-Preformance
Variations
-Modifying Regex
Testing
-Common Mistakes
-Comparing Regex
Author

Break Down

Anchors

The regex starts with ^ which is an anchor that asserts the position at the start of the string.

Quantifiers

The ? is a quantifier that makes the preceding element in the regex optional. This is used to match the protocols http://, https://, and ftp://.

The * is a quantifier that matches the preceding element zero or more times. This is used to match the domain name or IP address.

The +is a quantifier that matches the preceding element one or more times. This is used to match the domain name or IP address

The {1,5} is a quantifier that matches the preceding element between 1 and 5 times. This is used to match the port number in the URL

Grouping Constructs

The () is used as grouping construct to group together different parts of the regex

Bracket Expressions

The [] is used to match a set of characters. This is used to match the domain name or IP address

Character Classes

The a-z and 0-9 are character classes that match any character within the specified range.

The OR Operator

The | is the OR operator that is used to match one of the multiple options. This is used to match the different protocols in the regex

Flags

The regex does not use any flags

Character Escapes

The \/ is used as a character escape to match the forward slash in the URL.

Language

Here are some examples of how the regular expression for matching URLs can be used in different programming languages:

JavaScript:

In this example, the regular expression is being defined as a string, and then being passed as an argument to the match method of the url string. The match method returns an array containing the entire match and any capturing groups, or null if no match is found. In this example, the match is the entire URL, and the capturing groups are the protocol (if present), the subdomain, the domain, the top-level domain, the port (if present), and the path (if present).

let regex =
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/;
let url = "https://www.example.com";
let match = url.match(regex);
console.log(match); // Output: ["https://www.example.com", "https://", undefined, "example", ".com", undefined, undefined]

Python:

In this example, the regular expression is being defined as a raw string, using the r prefix to avoid having to escape the backslashes. It is then being passed as an argument to the match function of the re module. The match function returns a match object if there is a match anywhere in the string, or None if no match is found. The match object has various methods and attributes that can be used to extract information about the match, such as the group() method, which returns the entire match, and the groups() method, which returns a tuple containing all the capturing groups.

import re

regex = r"^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$"
url = "https://www.example.com"
match = re.match(regex, url)
print(match.group()) # Output: "https://www.example.com"

Java:

In this example, we are using the compile method of the Pattern class to create a pattern object from the regular expression, and then using the matcher method to create a matcher object from the URL. The find() method is used to find the next subsequence of the input sequence that matches the pattern. If the find() method returns true, it means the URL is valid.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String regex = "^(http://www\\.|https://www\\.|http://|https://)?[a-z0-9]+([\\-.]{1}[a-z0-9]+)*\\.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?$";
        String url = "https://www.example.com";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(url);
        if(matcher.find()) {
            System.out.println("The URL is valid");
        } else {
            System.out.println("The URL is invalid");
        }
    }
}

Please note that these are just examples, and the actual use of this regular expression may vary depending on the specific use case.

Variations

"^(http://|https://)?[a-z0-9]+([\-.]{1}[a-z0-9]+)\.[a-z]{2,5}(:[0-9]{1,5})?(/.)?$";

This variation matches URLs that begin with http:// or https://.

"^(www\.)?[a-z0-9]+([\-.]{1}[a-z0-9]+)\.[a-z]{2,5}(:[0-9]{1,5})?(/.)?$";

This variation matches URLs that begin with www..

"^(http://www\\.|https://www\\.|http://|https://)?[a-z0-9]+([\\-.]{1}[a-z0-9]+)*\\.[a-z]{2,5}(/.*)?$";

This variation does not match URLs that include a port number after the domain name.

"^(http://|https://)?[a-z0-9]+([\-.]{1}[a-z0-9]+)\.[a-z]{2,5}(:[0-9]{1,5})?(/.)?$";

This variation matches URLs that include a port number after the domain name.

"^(http://|https://)?[a-z0-9]+([\-.]{1}[a-z0-9]+)\.[a-z]{2,5}(:[0-9]{1,5})?(/.)?$";

This variation matches URLs that include a query string or parameters after the domain name.

"^(http://|https://)?[a-z0-9]+([\-.]{1}[a-z0-9]+)\.[a-z]{2,5}(:[0-9]{1,5})?(/.)?$";

This variation matches URLs that include a fragment or anchor after the domain name.

Modifying Regex

To match URLs that include a query string, you can add a regular expression to match the query string to the end of the current regular expression. For example, (/.)?(?[^\s])?

To match URLs that include a username and password, you can add a regular expression to match the username and password before the domain name. For example, (http://|https://)?[a-z0-9]+:[a-z0-9]+@[a-z0-9]+([-.]{1}[a-z0-9]+).[a-z]{2,5}(:[0-9]{1,5})?(/.)?$

Keep in mind that adding more complex matching rules will also increase the complexity of the regular expression and can negatively impact performance.

Use Case

Here are some examples of use cases where the regular expression for matching URLs can be useful:
NOTE JAVASCRIPT

Validation of user input

When a user is asked to enter a URL in a form, you can use this regular expression to validate that the input is in the correct format before submitting the form.

let input = "https://www.example.com";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
if (regex.test(input)) {
  console.log("valid url");
} else {
  console.log("invalid url");
}

Parsing and scraping web pages

When you want to extract URLs from a web page, you can use this regular expression to identify and extract URLs from the page's HTML code.

let html = "<a href='https://www.example.com'>Example</a>";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let match = html.match(regex);
console.log(match);

Checking links in a website

When you want to check that all the links in a website are working, you can use this regular expression to identify and extract URLs from the page's HTML code and then check if these URLs are accessible.

let html = "<a href='https://www.example.com'>Example</a>";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let match = html.match(regex);
for (let i = 0; i < match.length; i++) {
  fetch(match[i]).then((response) => {
    if (response.ok) {
      console.log(match[i] + " is working.");
    } else {
      console.log(match[i] + " is not working.");
    }
  });
}

Log analysis

When you want to analyze logs from a web server, you can use this regular expression to extract URLs from the logs and analyze the data for trends or patterns.

let logs =
  "https://www.example.com accessed by user 123 at 2021-01-01T00:00:00.000Z";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let match = logs.match(regex);
console.log(match);

Searching for specific URLs

When you want to search for specific URLs in a large dataset, you can use this regular expression to identify and extract URLs that match a specific pattern.

let data =
  "https://www.example.com https://www.example2.com https://www.example3.com";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let searchTerm = "example2";
let matches = data.match(new RegExp(searchTerm, "gi"));
console.log(matches);

URL Shortening service

When you want to create a URL shortening service, you can use this regular expression to validate the original URL before shortening it.

et url = "https://www.example.com";
let regex = new RegExp(/^(http://www.|https://www.|http://|https://)?[a-z0-9]+([-.]{1}[a-z0-9]+).[a-z]{2,5}(:[0-9]{1,5})?(/.)?$/);
if(regex.test(url)) {
// Valid URL
// Shorten URL and store it in the database
let shortenedURL = generateShortenedURL(url);
storeURL(url, shortenedURL);
console.log("Shortened URL: " + shortenedURL);
} else {
console.log("Invalid URL");
}

function generateShortenedURL(url) {
// Create unique identifier for the URL
let uniqueID = createUniqueID();
// Return the shortened URL
return "https://short.com/" + uniqueID;
}

function storeURL(url, shortenedURL) {
// Store the original URL and its shortened counterpart in the database
let data = {
originalURL: url,
shortenedURL: shortenedURL
};
database.store(data);
}

Checking for broken links

When you want to check for broken links in a website, you can use this regular expression to extract URLs from the page's HTML code and check if these URLs are accessible.

let html = "<a href='https://www.example.com'>Example</a>";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let match = html.match(regex);
for (let i = 0; i < match.length; i++) {
  fetch(match[i]).then((response) => {
    if (response.ok) {
      console.log(match[i] + " is working.");
    } else {
      console.log(match[i] + " is not working.");
    }
  });
}

Identifying spammy links

When you want to identify and remove spammy links from a website, you can use this regular expression to extract URLs from the page's HTML code and check if these URLs match patterns associated with spam.

let html = "<a href='https://www.spam.com'>Spam</a>";
let regex = new RegExp(
  /^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
);
let match = html.match(regex);
let spamDomains = ["spam.com"];
for (let i = 0; i < match.length; i++) {
  if (spamDomains.includes(match[i].split(".")[1])) {
    console.log(match[i] + " is a spammy link and should be removed.");
  } else {
    console.log(match[i] + " is not a spammy link.");
  }
}

Preformance

The time complexity of the regular expression is O(n) where n is the length of the input string. This is because the regular expression needs to check each character of the input string once.

The memory complexity of the regular expression is O(1) because it only needs to store a small amount of data at any given time.

The performance of the regular expression should be sufficient for most use cases. However, if you need to match a very large number of URLs, you may want to consider using a more efficient algorithm or a specialized library.

Testing

To test the regular expression, you can use a tool such as Regex101 (https://regex101.com/) or Rubular (https://rubular.com/). Simply copy and paste the regular expression into the tool and test it against a variety of URLs.
URLs that the regular expression should match:

Test Cases that should match:

http://www.example.com
https://www.example.com
http://example.com
https://example.com
example.com
example.com/path
example.com:8080/path
URLs that the regular expression should not match:
example
example.c
example.com/path#fragment

Test Cases that should not match:

example
example.c
example.com/path#fragment
http://
https://
www.example.com
http://sub.domain.
http://sub.domain
http://sub.
http://sub.#
http://sub.#fragment
http://sub.domain#fragment

It's important to note that the regular expression is not perfect and may not match all valid URLs or may match some invalid URLs. It's always a good idea to test the regular expression against a variety of test cases to ensure it behaves as expected.

Common Mistakes

One common mistake is not accounting for all possible variations of a URL. For example, this regular expression does not match URLs with a fragment (i.e., #fragment at the end of the URL).
Another common mistake is assuming that all URLs have a www. subdomain. This regular expression does account for URLs with or without the www. subdomain.
To avoid these mistakes, it is important to clearly define the requirements for the URLs that need to be matched and to test the regular expression against a variety of test cases.

Comparing Regex

This regular expression is a relatively simple and widely-used regular expression for matching URLs.

Example: https://www.example.com/ or https://subdomain.example.com/path/to/page.html

It has the advantage of being easy to understand and modify.

Example: https?://[a-z0-9\-\.]+(\.[a-z]{2,3}){1,2}(/.*)?
can be modified to match URLs that include specific paths or query parameters, such as https://www.example.com/path/to/page.html?query=parameter.

Other regular expressions for matching URLs can be more complex and harder to understand, but may provide more accurate or comprehensive matching.

Example: https?://([a-z0-9\-\.]+\.)*[a-z0-9\-]+\.[a-z]{2,3}(:[0-9]+)?(/.*)?
is a more complex regular expression that can match URLs with multiple subdomains, port numbers, and various top-level domains.

It's also important to note that this regular expression does not cover all possible variations of URLs and match all of them, for example it does not match URLs with a fragment # in them.

Example: https://www.example.com/path/to/page.html#fragment
is a URL that includes a fragment, and would not be matched by the provided regular expression.

Summary

This .md provides an overview of regular expressions and their usage in matching URLs. It covers different components of regular expressions such as anchors, quantifiers, grouping constructs, bracket expressions, character classes, the OR operator, flags, and character escapes. It also provides examples of how the regular expression for matching URLs can be used in different programming languages such as JavaScript, Python, and Java. Finally, it lists some use cases where the regular expression for matching URLs can be used such as validating user input of URLs, parsing URLs from text, and more.

Refs

Author

A student at eDx FULL-STACK-FLEX program through UC BERKELEY.
Follow my progress on GITHUB https://github.com/jbxamora
Visit my portfolio! https://jbxamora.github.io/get2knowjorge/

I would appreciate any feedback constructive or not. Thank you for reading.

DEV Community