DEV Community

Cover image for [Rust Guide] 8.4. String Type Pt.2 - Bytes, Scalar Values, Grapheme Clusters, and String Operations
SomeB1oody
SomeB1oody

Posted on

[Rust Guide] 8.4. String Type Pt.2 - Bytes, Scalar Values, Grapheme Clusters, and String Operations

8.4.0. Chapter Overview

Chapter 8 is mainly about common collections in Rust. Rust provides many collection-like data structures, and these collections can hold many values. However, the collections covered in Chapter 8 are different from arrays and tuples.

The collections in Chapter 8 are stored on the heap rather than on the stack. That also means their size does not need to be known at compile time; at runtime, they can grow or shrink dynamically.

This chapter focuses on three collections: Vector, String (this article), and HashMap.

If you find this helpful, please like, bookmark, and follow. To keep learning along, follow this series.

8.4.1. You Cannot Use Indexing to Access String

String in Rust is different from that in other languages: you cannot access it by indexing. Example:

fn main() {  
    let s = String::from("6657 up up");  
    let a = s[0];  
}
Enter fullscreen mode Exit fullscreen mode

Output:

error[E0277]: the type `str` cannot be indexed by `{integer}`
 --> src/main.rs:3:15
  |
3 |     let a = s[0];
  |               ^ string indices are ranges of `usize`
  |
  = help: the trait `SliceIndex<str>` is not implemented for `{integer}`, which is required by `String: Index<_>`
  = note: you can use `.chars().nth()` or `.bytes().nth()`
          for more information, see chapter 8 in The Book: <https://doc.rust-lang.org/book/ch08-02-strings.html#indexing-into-strings>
  = help: the trait `SliceIndex<[_]>` is implemented for `usize`
  = help: for that trait implementation, expected `[_]`, found `str`
  = note: required for `String` to implement `Index<{integer}>`
Enter fullscreen mode Exit fullscreen mode

The error says that the String type cannot be indexed with an integer. Looking further down at the = help line, we can see that this type does not implement the Index<{integer}> trait.

8.4.2. Internal Representation of String

String is a wrapper around Vec<u8>, where u8 means a byte. We can use the len() method on String to return the string length. Example:

fn main() {  
    let len = String::from("Niko").len();  
    println!("{}", len);  
}
Enter fullscreen mode Exit fullscreen mode

Output:

4
Enter fullscreen mode Exit fullscreen mode

This string uses UTF-8 encoding, and len is 4, which means the string occupies 4 bytes. So in this example, each letter takes up one byte.

But that is not always the case. For example, if we change the string to another language (here, Russian written in Cyrillic):

fn main() {  
    let hello = String::from("Здравствуйте");  
    println!("{}", hello.len());  
}
Enter fullscreen mode Exit fullscreen mode

If you count the letters in this string, there are 12, but the output is:

24
Enter fullscreen mode Exit fullscreen mode

That means each letter in this language takes up two bytes (Chinese characters take three bytes each). The term used to refer to a “letter” here is a Unicode scalar value, and each Cyrillic letter here corresponds to two bytes.

From this example, you can see that numeric indexing into String does not always correspond to a complete Unicode scalar value, because some scalar values occupy more than one byte, while numeric indexing can only read one byte at a time.

Another example: the Cyrillic letter З corresponds to two bytes, whose values are 208 and 151. If numeric indexing were allowed, then taking index 0 of Здравствуйте would give you 208, which by itself is meaningless because it is missing the second byte needed to form a Unicode scalar value. So to avoid this kind of bug that would be hard to notice immediately, Rust bans numeric indexing on String, preventing misunderstandings early in development.

8.4.3. Bytes, Scalar Values, and Grapheme Clusters

There are three ways to view strings in Rust: bytes, scalar values, and grapheme clusters. Among them, grapheme clusters are the closest to what we usually call “letters.”

1. Bytes

Example:

fn main() {  
    let s = String::from("नमस्ते");  // Hindi written in Devanagari script
    for b in s.bytes() {  
        print!("{} ", b);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

This Devanagari string may look like it contains four letters. We use the .bytes() method to get the bytes it corresponds to. The output is:

224 164 168 224 164 174 224 164 184 224 165 141 224 164 164 224 165 135
Enter fullscreen mode Exit fullscreen mode

These 18 bytes show how the computer stores the string.

2. Scalar Values

Now let’s view it as Unicode scalar values:

fn main() {  
    let s = String::from("नमस्ते");  
    for b in s.chars() {  
        print!("{} ", b);  
    }  
}
Enter fullscreen mode Exit fullscreen mode

Using the .chars() method gives the scalar values corresponding to this string. The output is:

न म स ् त े 
Enter fullscreen mode Exit fullscreen mode

It has 6 scalar values, and some of them are combining marks rather than standalone letters. They only make sense when combined with the preceding characters.

This also explains why this Devanagari string takes 18 bytes: each of the 6 scalar values takes 3 bytes, and 6 × 3 gives 18 bytes.

3. Grapheme Clusters

Because obtaining grapheme clusters from a String is complicated, the Rust standard library does not provide this functionality. We will not demonstrate it here, but you can use a third-party crate from crates.io to implement it.

In short, if this string were printed as grapheme clusters, it would look like this:

8.4.4. Why String Cannot Be Indexed

  • Numeric indexing may return an incomplete value that cannot form a full Unicode scalar value, leading to bugs that are not immediately visible.
  • Indexing is supposed to take constant time, or O(1), but String cannot guarantee that, because it must traverse the entire contents from beginning to end to determine how many valid characters it contains.

8.4.5. Slicing String

You can use [] with a range inside it to create a string slice. For detailed coverage of string slices, see Chapter 4.5, Slices. Example:

fn main() {  
    let hello = String::from("Здравствуйте");  
    let s = &hello[0..4];  
    println!("{}", s);  
}
Enter fullscreen mode Exit fullscreen mode

As mentioned earlier, one Cyrillic letter takes two bytes. This string slice takes the first 4 bytes of the string, which means the first two letters. The output is:

Зд
Enter fullscreen mode Exit fullscreen mode

What if the string slice takes the first three bytes instead? That would mean the slice contains the first letter plus half of the second letter. What happens in that case? Look at the following example:

fn main() {  
    let hello = String::from("Здравствуйте");  
    let s = &hello[0..3];  
    println!("{}", s);  
}
Enter fullscreen mode Exit fullscreen mode

Output:

byte index 3 is not a char boundary; it is inside 'д' (bytes 2..4) of `Здравствуйте`
Enter fullscreen mode Exit fullscreen mode

The program triggers panic!, and the error message says that index 3 is not a char boundary. In other words, slicing must follow char boundaries. For Cyrillic, that means slicing in units of two bytes.

8.4.6. Iterating Over String

  • For scalar values, use the .chars() method. Example:
fn main() {  
    let s = String::from("नमस्ते");  
    for b in s.chars() {  
        print!("{} ", b);  
    }  
}
Enter fullscreen mode Exit fullscreen mode
  • For bytes, use the .bytes() method. Example:
fn main() {  
    let s = String::from("नमस्ते");
    for b in s.bytes() {  
        print!("{} ", b);  
    }  
}
Enter fullscreen mode Exit fullscreen mode
  • For grapheme clusters, the standard library does not provide a method, but you can use a third-party crate.

Top comments (0)