loading...
Cover image for Be Careful with String’s Substring Method in Java

Be Careful with String’s Substring Method in Java

renegadecoder94 profile image Jeremy Grifski Originally published at therenegadecoder.com on ・5 min read

Every once in awhile, I’ll come across a well-established library in a programming language that has its quirks. As an instructor, I have to make sure I’m aware of these quirks when I’m teaching. For instance, last time I talked a bit about the various Scanner input methods and how they don’t all behave the same way. Well today, I want to talk about the substring method from Java’s String library.

Documentation

When using a library for the first time, I find it useful to check out the documentation. But with a library so established, it sometimes feels silly to dig into the documentation. After all, a lot of languages support strings. Personally, all I need to know is the name of the command before I can figure out the rest.

However, every once in awhile, I’ll come across a function that is less intuitive than I thought. In this case, I’m talking about Java’s substring method. As you can probably imagine, it grabs a substring from a string and returns it. So, what’s the catch?

Well for starters, the substring method is actually an overloaded method. As a result, there are two different forms of the same method in the documentation. Take a look:

public String substring(int beginIndex)

Returns a new string that is a substring of this string. The substring begins with the character at the specified index and extends to the end of this string.

Java API, 2019

public String substring(int beginIndex, int endIndex)

Returns a new string that is a substring of this string. The substring begins at the specified beginIndex and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.

Java API, 2019

At this point, don’t fixate too much on their descriptions as we’ll get to those. Just be aware that there are two different versions of the same method.

Usage

At this point, I’d like to take a moment to show how to use the substring method. If this is your first time poking around the Java API, this would be a good time to follow along.

First, notice that the method header does not contain the static keyword. In other words, subtring is an instance method which makes sense. We need an instance of a string in order to get a substring:

String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);

In this example, we’ve created two new substrings: one from position 7 to the end and the other from position 0 to position 5. Without looking at the documentation, can you figure out what the resulting strings will be?

Interval Notation

Before I give away the answer, I think it’s important to discuss some terminology from mathematics. In particular, I’d like to talk a bit about interval notation.

In interval notation, the goal is to explicitly state the range of some subset. For instance, we may be interested in all integers greater than 0. In interval notation, that would look something like:

(0, +∞)

In this example, we’ve chosen to exclude the value of 0 from the range using parentheses. We could have just as easily defined the interval starting with 1—pay attention to the brackets:

[1, +∞)

In either case, we’re describing the same set: all integers greater than 0.

So, how does this tie into the substring method? As it turns out, a substring is a subset of a string, so we can use interval notation to define our substring. Why don’t we try a couple examples? Given “Hello, World!”, determine the substring using the following intervals:

  • [0, 2]
  • (0, 5]
  • (1, 3)
  • (-1, 7]

Once you’re done, check out the answers below:

  • “Hel”
  • “ello,”
  • “l”
  • “Hello, W”

We’ll need to keep this idea in the back of our mind moving forward.

The Truth

The truth of the matter is the substring method is a bit weird. On one hand, we can use a single index to specify the starting point of our new substring. On the other hand, we can use two indices to grab an arbitrary subset of a string.

However, in practice, I find that the second option gives a lot of students trouble, and I don’t blame them. After all, the bounds are deceptive. For example, let’s revisit some code from above:

String str = "Hello, World!";
String subOne = str.substring(7);
String subTwo = str.substring(0, 5);

Here, we can confidently predict that subOne has a value of “World!”, and we’d be right. After all, index 7 is ‘W’, the method automatically grabs the rest of the string.

As for subTwo, we’d probably guess “Hello,”, and we’d be incorrect. It’s actually “Hello” because the end index is exclusive (i.e. [0, 5) ). In the next section, we’ll take a look at why that is and how I feel about it.

My Take

From what I understand, the inclusive/exclusive model is the standard for ranges in the Java API. That said, I do occasionally question the design choice.

On one hand, there’s the advantage of being able to use the length of the string as the end point of the substring:

String jokerQuote = "Madness, as you know, is like gravity, all it takes is a little push.";
String newtonTheory = jokerQuote.substring(30, jokerQuote.length());

But, is this really necessary? Java already provides an overload to the substring method which captures exactly this behavior.

That said, there is a nice mathematical explanation for this notation, and part of it has to do with the difference between the starting and ending points. In particular, we get the length of the new substring:

int length = endIndex - startIndex;

In addition, this particular notation allows adjacent substrings to share a midpoint:

String s = "Luck is great, but most of life is hard work.";
String whole = s.substring(0, s.length()/2) + s.substring(s.length()/2, s.length());

Both of these properties are nice, but I think they're likely a byproduct of indexing by zero (perpetuated by Dijkstra) which isn't all that intuitive either. And for those of you who are going to take exception to that comment, be aware that I'm all for indexing by zero and and this inclusive/exclusive subset convention.

All I'm trying to say is that I've seen my own students get tripped up over both conventions, so I feel for them in a way. That's why I went through such lengths to write this article in the first place.

Let me know if you feel the same or if I’m totally off base. Otherwise, thanks for taking some time to read my work. I hope you enjoyed it!

Posted on by:

renegadecoder94 profile

Jeremy Grifski

@renegadecoder94

Engineering Education PhD student interested in challenging cultural issues in the tech community.

Discussion

markdown guide
 

I find it completely normal. Imagine the following: you want to get 3 characters counting from index 5. That means you want from 5 to 5+3=8.
Also, in many others languages you either specify the length of the substring of follow the rule explained. Other than that, you usually do for loops as follow
for (int i=0; i<3; i++), and you already know that i will never be 3.

 

I totally agree! Both of your examples make perfect sense for people who have coded for a bit. After all, we've all agreed that indices start from 0 (perpetuated by Dijkstra), but that's not intuitive for new folks either.

EDIT: I should clarify that we don't all agree on indexing from 0, but I'd argue that all of the most currently dominant languages index from 0.

 

We have not agreed to that at all. As in Dijkstra note you linked, ALGOL and Pascal indices start at 1. This is also the case with XSLT/XPath/XQuery.

I don't understand why you are trying to blame index-0 on Dijkstra. His note was written in 1982, years after languages like C were defined. Dijkstra did not set the rule on where to start. Just because he voiced his reasoned opinion on a subject does not give him the blame for what others did.

Besides, the natural world also starts at 0. When you are born you are in your first year, which is from 0 to 1. This confusing problem is everywhere, not just in programming languages.

“Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration.” - Stan Kelly-Bootle

Again, I agree all the way that this problem is confusing! The entire point of this article is that certain conventions are not always intuitive. That doesn't mean they're bad. It just means there should be a good reason for them.

Also, I'm not saying that Dijkstra is the reason for indexing from 0, but he's clearly made the strongest case for it. There's been less time between the first programming language and what Dijkstra said (24 years) than what he said and today (37 years). He's had an incredible influence on the field in the last 40 years.

 

Starting with index 1 is meaningful only in cases you're not doing any computation on the index value itself... but then you could start from 14 too without much problems either.

If the actual numeric value of the index is used for computations then you will find that in most cases the correct index for the first element is 0 (e.g. polynomial terms where the index is also used in exponent).

So when doesn't matter you can use 1 (or 2 or 42), but when it does you really need to use 0. Then why not using 0 always?

The number 0 was a great discovery, if you're teaching please don't cripple your students brain making them thinking in roman numerals, you're not helping them.

 

Hey Andrea! I appreciate the comment, but I don't find it overly constructive. Is it necessary to put my teaching abilities into question?

I made no mention of a preference for indexing by 1 anywhere in this article or the comments below. All I said was that my students struggle with the substring method in Java, and I tried to put myself in their shoes.

 

I'm not questioning your teaching abilities, but that apparently (from tone) you dislike the idea of indexing from 0.

This is unfortunately not uncommon (see the Lua language, for example) but the fact remains that indexing from 1 is a bad idea and EWD was right on this.

Sometimes it helps thinking to indexes as being "between" elements...

content     H   e   l   l   o   ,       W   o   r   l   d   .
          |___|___|___|___|___|___|___|___|___|___|___|___|___|
index     0   1   2   3   4   5   6   7   8   9  10  11  12  13

so the interval 3-5 is clearly "lo" and the number of elements included in the interval a, b is b - a.

This way of thinking simplifies a lot reasoning for example when implementing binary search or raster graphic algorithms.
This mental model is equivalent to [a, b[, but (maybe) easier to understand and remember.

x[i] is just the element between i and i+1.

Some API solve the problem of substring-like interfaces by relying on start/size instead of start/end (this is what Qt does in many cases, for example). Unfortunately the same Qt framework made the wrong choice of using "boundary-inclusive" intervals when implementing right() and bottom() method for integer rectangles, a poor choice that makes hard to write pixel-perfect code and force a lot of +1 and -1 in code: floating point rectangles are ok, and the mistake on the integer case is acknowledged in the documentation (but unfortunately cannot be removed because of backward compatibility reasons).

I thought you were asking for comments, so I commented.

I think the view that substring in Java has a "quirk" because uses the [a, b[ convention for intervals is questionable.

That "semi-open interval" is in my opinion the correct approach (may be on par or second only to a start/size approach). A "boundary included" [a, b] would instead be worse for many reasons.

Java has no "quirk" here: it's the correct thing to do (and please note that I'm surely NOT a Java fan, at all).