DEV Community

Michael D. Stemle, Jr.
Michael D. Stemle, Jr.

Posted on

String Manipulation of URLs is an Anti-Pattern.

Quick note before we get started: this piece is Node-centric in its examples, but this anti-pattern is polyglottal. As with most anti-patterns, this isn't about the syntax, it is about the approach.

What's a URL, really?

A URL is a useful thing. It tells both humans and users where to find resources on the internet. There's a lot of information packed into a URL, from protocol designations to document anchors, and when we treat it like a string we're steering into danger.

A URL is a packed value. It contains an awful lot of data:

  • Protocol scheme
  • Host name
  • Port number
  • Path
  • File name
  • Search parameters (a.k.a. query string parameters)
  • Anchor (which can also be used for parameters)

The Problem with String Manipulation

Based on your specific needs, a URL may contain several reserved characters. Some of these characters include ?, #, =, &, %, :, , and /. This is not an exhaustive list. Having these characters in the wrong place within your URL can cause misunderstanding.

A good implementation should be flexible enough to deal with any reasonable inputs, and capable of failing predictably when inputs are not reasonable. Packed values, like a URL, need to be treated like packed values, and not handled using string manipulation.

You can see here how the q is seen as part of the URL, but restrict_sr is interpreted as another URL parameter parallel to url. While it may be tempting to simply use a function to URL-encode this, I would like to encourage you to reconsider. These URL encoding methods aren't great for all of the possible characters that you'd want to put in there, and they're likely to make a bunch of assumptions that aren't going to be true.

A Better Approach

Here you can see that encoding the URL didn't solve the problem. Let's try a different approach: let's use the URL API.

By using the URL API here, you can see that the URL which is being used as a parameter is safely tucked away, and you don't have to worry about it being confused.

Why does this matter?

The two primary problems caused by the anti-pattern of string manipulation of URLs are those of bugs, and of URL injection vulnerabilities.

Poorly-encoded URLs make it difficult for web servers and applications to understand the parameters coming to them. If they cannot reliably understand their inputs, there may be unexpected or unwanted behavior.

URLs which are constructed using predictable string manipulation also pose a very real risk of URL injection. URL injection can lead to SQL injection, NoSQL injection, cross-site scripting (XSS), and a whole host of other security holes.

Conclusion

A URL isn't a string. Much like the packed bit fields of yore, it is a packed value. Don't treat it like a string, treat it like a first-class object or structure. And never write your own URL parsers, every language has a good URL library that you can use.

Discussion (7)

Collapse
sblundy profile image
Steve Blundy

This a very common problem. It’s good of you to call it out here. String URL manipulation is a subset of a broader anti-pattern called “string obsession” or “primitive obsession”. File system paths is another thing you do not want to be manipulating as a string. Any kind of structured or encoded data really

Collapse
aminmansuri profile image
hidden_dude

I'd generalize this and say that whenever you have a structured string like:

  • urls
  • emails
  • phone numbers
  • SQL queries
  • Mailing Addresses
  • ???

That you need to extract or inject info into (not just store or copy), then you need some sort of Builder pattern or Editor pattern to do so. You can use a third party library or build your own if it doesn't exist.

But the advantage of treating it as a library is that it can evolve over time and your code isn't riddled with N half baked implementations.

About 16 years ago I was presented with a situation in which we had a major application that was building SQL statements on the fly, and creating my own SQLBuilder really was able to make to code far more maintainable since so many parts of the code where editing SQL in different ways.

SQL editing has since fallen out of favor, and ORMs basically provide that function now. But the principle remains.

Collapse
manchicken profile image
Michael D. Stemle, Jr. Author

That's a good point!

Collapse
nombrekeff profile image
Keff

I really liked this, I have done this many times. But I've recently started using the good approach, not because I knew it was a bad practice or antipattern, but because it makes my life a lot easier and the code cleaner IMO. The hard work is done for you!

Collapse
manchicken profile image
Michael D. Stemle, Jr. Author • Edited

Yeah, there are a lot of areas of programming where the problems have already been solved for us, we need only take advantage of those already-existent solutions.

Other problem domains which fit into this category (not an exhaustive list), IMO, are time zones and localization. I can't tell you how many times I've seen people try to roll their own solutions in those areas when there have been excellent (and portable) solutions in place for decades.

Collapse
coderdenver9 profile image
coderdenver9 • Edited

Instead of using a third party library, you could've used encodeURIComponent(bookmark_url), which behaves correctly.

Collapse
manchicken profile image
Michael D. Stemle, Jr. Author • Edited

There are two reasons I used the third-party library:

  1. Most folks that I have encountered do use a library.
  2. Even if the encoding function does work in that use case, the anti-pattern remains.

I was trying to keep this article short, but I could have also gone in to how so many times folks will try to do stuff like:

some_url = `${first_url}${(source_url.indexOf('?') > -1) ? '&' : '?'}${param_list.join('&')}`
Enter fullscreen mode Exit fullscreen mode

Those are all really bad practices. The only real solution is to use URL libraries which treat a URL as the packed value that it is.