Matt Davies

Posted on Aug 10, 2021

Rust #8: Strings

#rust

Strings, at first glance in Rust, may seem very complicated. First of all, there's the existence of two string types: String and str. Then, there's the fact you can't easily index them. On top of that, as you dig deeper in the language, there are more string types such as CString, OsString, PathBuf, CStr, OsStr and Path. And then, if that wasn't enough, there are different ways to construct them and to convert between them because different APIs use different string types.

No wonder people complain about Rust's learning curve, which I will personally argue is not as steep as C++.

But, the truth is, that when you understand a few concepts like Unicode, UTF-8, ownership and slices, it is not that complicated. In fact, necessary, and can also increase performance.

In this blog article, I will talk about the whys of strings and how to convert between the various types. I will not talk about some of the operations you can do on strings as that is a huge topic. I do recommend you look at the Rust Standard API documentation for more information.

ASCII

In the far annals of time, when computers only talked in English, there needed to be a method to map byte values in computer memory to characters that included digits, letters (upper and lower case), and punctuation. Computers only see bytes; they do not see digits, letters and punctuation. This meant that if you wanted to show text on printed paper or a screen, there needed to be a standard way of saying "whenever you see this byte value, print the letter A".

You could, for example, say that a byte of value 0 means A, a value 1 means B and so on. You could invent anything you wanted as long as your software on the computer and the hardware that printed the letter agreed. They had to agree, because if they did not, sending HELLO could come out as JONNY or $£&&!.

Two major standards appeared in the 1960s: IBM's EBCDIC and ASCII's committee's ASCII code. EBCDIC was the abbreviation for Extended Binary Coded Decimal Interchange Code. ASCII stood for American Standard Code for Information Interchange. But ASCII won over, probably because it was easier to remember its name!

ASCII uses 7 of the 8 bits in a byte to encode a character of code. Values from 0-31 and 127 were character codes and used to give meta-data to printers and terminals: for example, a new line or backspace. Values 32-126 represented all the printable characters in the English language as well as many punctuation characters. It had some nice properties such as to convert from upper case to lower case you only needed to add 32 or set bit 5.

The unused 8th bit was used for error detection when sending those bytes over networks. But later when computers started appearing in countries other than English speaking languages, more characters needed to be added and that 8th bit was given up for 128 more characters. Then new competing standards turned up that matched ASCII for the first 128 values, but differed in the second 128 values. You had the concept of code pages that defined how byte values 128-255 were mapped.

But eventually even 256 separate values were not enough to contain all the characters that were printable in the world. Things got complicated very fast. So enter the Unicode Consortium.

Unicode

It turns out there are many, MANY different characters in the world. To support that many characters a new standard had to be created and this was run by the Unicode Consortium. They manage the Unicode standard that, according to Wikipedia, "defines 143,859 characters covering 154 modern and historic scripts, as well as symbols, emojis and non-visual control and formatting codes.". But the consortium doesn't just maintain those characters and codes, but how to display them, normalise, decompose, collate and render them. It is a very, very, and dare I add one more 'very', complicated system.

For example, é, the accented letter 'e' that can appear in French and many other languages, can be represented by a single value, or by two values: the value of the letter e and the value meaning put an acute accent on the previous character. Normalisation is the process of making sure all the characters follow one representation or the other. Composition is the process of reducing it to a single value, decomposition is the process of expanding it to the longer representation.

Each value, or code point, is represented as a 32-bit value. The first 128 values match ASCII exactly, which means converting ASCII to Unicode trivial, and converting Unicode to ASCII trivial as well as long as all code points lie between the values of 0 and 127. The various values are divided into planes that represent a range of 65,536 values:

Plane	Value range in hex	Name
0	0000-FFFF	Basic Multilingual Plane
1	10000-1FFFF	Supplementary Multilingual Plane
2	20000-2FFFF	Supplementary Ideographic Plane
3	30000-3FFFF	Tertiary Ideographic Plane
4-13	40000-DFFFF	Unused
14	E0000-EFFFF	Supplementary Special-purpose Plane
15-16	F0000-10FFFF	Supplementary Private Use Area Planes

As you can see from the table above, not all value ranges are used within the 32-bit space. Only 21 bits are used. I guess we need more bits when we discover aliens and learn their languages.

But the downside of Unicode representation is that now all text takes 4 times as much space, irregardless or language and how common each symbol is. Thanks to David Prosser and Ken Thompson (yes, that Ken Thompson, inventor and co-inventor of Unix, the B programming language, the grep utility and the modern programming language Go), we have a better representation.

UTF-8

In 1992, Dave Prosser, working for the Unix System Laboratories, submitted a proposal for a new representation of Unicode that mapped 31 bits to 1-5 bytes in such a way that lower Unicode values could be represented with fewer bytes. ASCII itself could be represented as normal single byte ASCII. It worked by using bit 7 to indicate that a multibyte sequence was being used. In the first byte of the sequence, the number of 1 bits following bit 7 represented how many bytes would follow. The encoding worked like this:

Number of bytes	Value range	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5
1	0000000-0000007F	0xxxxxxx
2	0000080-0000207F	10xxxxxx	1xxxxxxx
3	0002080-0008207F	110xxxxx	1xxxxxxx	1xxxxxxx
4	0082080-0208207F	1110xxxx	1xxxxxxx	1xxxxxxx	1xxxxxxx
5	2082080-7FFFFFFF	11110xxx	1xxxxxxx	1xxxxxxx	1xxxxxxx	1xxxxxxx

Plane 0 of Unicode, which contained the most common characters around the world, could be encoded 1-3 bytes. All the current Unicode characters could be encoded, at most, within 4 bytes.

Ken Thompson improved it at the cost of adding a 6th byte, but allowed a program to tell the difference between an initial byte that started a sequence and an intermediate byte. He did this by reserving 10 for only the following bytes:

Number of bytes	Value range	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
1	0000000-0000007F	0xxxxxxx
2	0000080-000007FF	110xxxxx	10xxxxxx
3	0000800-0000FFFF	1110xxxx	10xxxxxx	10xxxxxx
4	0010000-001FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
5	0200000-03FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
6	4000000-7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

Plane 0 can still be encoded in 1-3 bytes, ASCII is still represented and all Unicode characters can still be encoded in up to 4 bytes. UTF-8 is such an amazingly clever standard. Well done David and Ken!

Rust uses the UTF-8 standard to encode its strings in the String struct. So any character in a string will be encoded using between 1 and 4 bytes.

Ownership and slices

To recap, Rust strictly defines the concept of ownership. This is where a variable, parameter etc owns the memory it requires to store its value, and it is the ONLY variable to own that memory. Rust's compiler guarantees that there is only a single owner of a piece of memory at any single time. This also means that when the owner goes out of scope, the compiler can safely destroy the memory and give it back to the operating system.

Secondly, the owner can dish out references to its memory and lend them to other parts of the program. It can do so either mutably or immutably. Rust enforces that you can only give out a single mutable reference (i.e. the reference holder can mutate the memory the owner owns) at any time. The owner itself cannot even mutate itself while there's a mutable reference. But, if there are no mutable references, the owner can give out any number of immutable references, which also means that the owner itself cannot mutate itself while they still exist.

It is also important to note that if there are any references live when the owner goes out of scope, that is a compiler error. Rust will just not allow you to do that. To enforce that, there is a concept of lifetimes, but I do not need to go into that for the topic of strings.

So to summarise, you can only have these 3 states in Rust:

Only an owner. If the variable is marked mut, it can mutate itself.
Only an owner and any number of immutable references. Everything is read-only. The owner cannot mutate itself and neither can the references mutate the owned memory as they are immutable.
Only an owner and a single mutable reference. Only the mutable reference can mutate the owned memory. The owner can only read itself.

One type of reference, mutable or immutable, is called a slice. It is a special reference to an array of memory or a contiguous sequence of values. The array that contains elements of type T is written as [T] and the slice, because it's just a reference to an array, is written as either &[T] or &mut [T]. Simply put, a slice can be thought of as a view into some contiguous memory owned by something else. A slice, due to being a reference, does not own any memory and so therefore requires something else to own that memory to have a view on to it.

Now a slice doesn't have to view all of the owned memory. It can view a subset of it. And Rust guarantees both at compile-time and run-time that a slice cannot view outside the bounds of the owned memory. This characteristic of slices is crucial to having performant operations on strings as you will see.

A slice has two pieces of information. Firstly, a pointer to the start of the memory it views, and secondly, the size of the memory it can view.

And now we can talk about what the string types above are all about. Some are owners, types that own the memory that contain the string characters, and some are array types that with the reference operator & implement the slice.

Owner type	Slice type
String	&str
OsString	&OsStr
PathBuf	&Path

`String`

If we look at the definition of std::string::String we see this:

struct String {
    vec: Vec<u8>
}

This means a String, internally, is an array of bytes. However, there is one important aspect to those array of bytes. Rust guarantees through its standard APIs that those array of bytes can only contain a valid sequence of UTF-8 characters. Likewise, &str guarantees that it is a view on the whole or part of a String that only contains UTF-8 characters. The APIs guarantee that the start pointer of a &str and its implied end pointer (the start pointer plus its length) fall exactly on UTF-8 character boundaries.

The fact that the data is UTF-8 means that when you ask for a character from the string, you can get back 1, 2, 3 or even 4 bytes of data due to the encoded nature of UTF-8. This is why when you extract a character from a string you get back a char type, which is 4 bytes in size. Rust's string APIs will convert the UTF-8 sequence of bytes to a single Unicode code point or code points.

It is this very complexity of how characters or codes in UTF-8 can be multiple bytes (or not) that makes it very cumbersome to use when processing these strings. It is important to Rust that no matter what you do the UTF-8 is valid, otherwise it will panic.

One more aspect to talk about is string literals. For example:

let greeting = "Hello";

All literal strings are embedded in the executable. And because that memory is unmoving and frozen in time, you can only grab an immutable slice to that string. Therefore, all literal strings are of types &str. They can never be mutated as the memory containing all the literals in an executable are always read-only on all operating systems.

Technically speaking, the type is really &'static str because the literal is available for the entire lifetime of the program. If you don't understand lifetimes at time of reading this article then don't worry about this detail right now.

`OsString`

Different operating systems use different techniques for encoding their strings. This is important to know when interacting with the operating system directly because UTF-8 encodings may not cut it. Therefore, Rust introduces std::ffi::OsString to abstract away the differences between operating systems. It is defined in the FFI (Foreign Function Interface) module because it is often required when talking to C code and other languages.

On Unix systems, strings are sequences of non-zero bytes, often in UTF-8 encoding. On Windows, strings can be encoded in either 8-bit ASCII or a 16-bit representation called UTF-16 (another Unicode encoding system using 1 or 2 16-bit values). Rust strings are always UTF-8 that may contain zeroes. So Rust provides a new type with fallible conversion functions.

Like String, OsString has a companion slice type called std::ffi::OsStr.

`PathBuf`

std::path::PathBuf is a OsString with a particular purpose. It is used to represent filenames. And because filenames are used by the operating system, they are required to be OS-specifically encoded. The definition of PathBuf is:

struct PathBuf {
    inner: OsString,
}

Similar to the other string representations, there is a companion slice type called std::fs::Path. Along with this type are many useful methods for manipulating paths, like producing slices on the extension part, or the filename part.

`CString`

Finally, out of the string types that I wish to talk about in this article, there is the std::ffi::CString type. This is used to communicate with C functions because C stores strings very differently to Rust. For starters, there is no length parameter as C uses a terminating zero byte, called the null-terminator, to mark the end of the string. This means that C strings can never contain zero bytes and finding the length requires stepping through all the characters. Also, C strings have no implied encoding. Due to this, CString is usually constructed from a byte array or vector.

Constructing a CString will make sure that there are no zero bytes in the middle of an array and will return an error if it finds them.

CStrings are a vector of the type c_char which is an unsigned byte to match the char type of the C language. The companion type of CString is CStr which contains an array of c_chars.

When working with C code, you usually find yourself converting a String to a CString, which can then be passed to the function that takes a const char* using CStr's as_ptr() method.

Conversions

Another source of confusion when starting with Rust is how do you convert from one type to another. I have often found myself rushing to Stack Overflow to find code that shows me how. First, I will cover conversion between owner types and their slice types. Then, I will visit each string type and talk about how to convert from any other different string type.

From owned types to slices

Converting from an owned type to a slice type is as simple as putting a reference operator in front:

let s = String::from("Hello");
let os_s = OsString::from("Foo");
let path_s = PathBuf::from("/home/matt/.bashrc");
let c_s = CString::from("C sucks!");

let ref_s: &str = &s;
let ref_os_s: &OsStr = &os_s;
let ref_path_s: &Path = &path_s;
let ref_c_s: &CStr = &c_s;

This magic works because of the Deref trait. For example, String implements Deref<Target=str> and can inherit all of str's methods. This also means that you can pass a String to a function that expects a &str or &mut str is the owner is mutable. This is why many functions that accept strings use &str as the parameter type. But when returning a string, functions often return an owned type so its lifetime isn't linked to the input parameters. So, for example, you can often see a function like:

fn concat(s1: &str, s2: &str) -> String {
    let s = String::with_capacity(s1.len() + s2.len());
    s.push(s1);
    s.push(s2);
    s
}

...
let greet = String::from("Hello ");
let place = String::from("World");

let my_greeting = concat(&greet, &place);

Because converting an owned type to a slice is just constructing a pointer and a length, there is no allocation. Also, by accepting a &str, the function doesn't care whether you pass a &String or a string literal.

From slices to owned types

This is a little more complicated but not much. To construct an owned type from a slice, you have to allocate memory and copy the characters from the memory the slice is referencing. This is because someone else owns that memory and you can't have two owners. There are multiple ways of doing it.

One way is via the From trait, which declares the from method. All the owned string types described here implement From for their companion slice type. You've already seen its use in the examples above.

Another way is using a method on the slice type. Unfortunately, the different string types use different methods names:

Owned type	Method to convert from slice type
`String`	`&str.to_owned()`
`String`	`&str.to_string()`
`OsString`	`&OsStr.to_os_string()`
`PathBuf`	`&Path.to_path_buf()`
`CString`	no method exists

You will notice that str has two methods: to_owned() and to_string(). to_string() actually comes from the trait ToString. The default version of this trait using Rust's formatting functions and ends up being slower than the to_owned() method. So I highly recommend you use to_owned() or the From trait methods.

Unfortunately, none of the other string types implement the ToString type as the conversion can fail. to_string() does not return a Result.

Conversions between string types

There are many different functions with different naming conventions to convert between string types, sometimes the owned type, sometimes the slice type. I thought about how I would represent this information. I considered tables for each string type I was converting to. But sometimes there isn't a single function you can call. In the end I decided on graph representations.

The first graph shows conversions between different types using their methods. Where a ? appears after the method name, it means it is fallible and either returns an Option or Result. You will need to handle that. Red nodes are the owned types.

The second graph shows conversion using the From trait. If an arrow travels from type T to type U, it means that U can be created using the From<T> trait. Again red nodes are owned types.

I hope these graphs help you convert from any string type to another.

Holy Cow!

There is one more string type in Rust I want to discuss and that is the std::borrow::Cow<T> type where T is a slice type that implements the ToOwned trait. All the string slice types mentioned in this article implement the ToOwned trait.

COW is an acronym for Copy On Write. It can either hold an immutable slice or an owned type. It is commonly used for parameters and return types and is constructed from the multiple From traits it implements.

If you construct it from a slice (such as a string literal or &str), it will hold an immutable reference to that slice. If you construct it from an owned type, it will take ownership:

let s = String::from("Hello");
let c1 = Cow::from(s); // c1 = Owned("Hello")
let c2 = Cow::from("World"); // c2 = Borrowed("Hello")

This makes Cow a useful return type for a function that can return a static string or a generated string:

fn main() {
    let g1 = greet(Some("Matt"));   // g1 = Owned("Hello Matt")
    let g2 = greet(None);           // g2 = Borrowed("Hello human!")
}

fn greet(name: Option<&str>) -> Cow<'static, str> {
    match name {
        Some(name) => Cow::from(format!("Hello {}", name)),
        None => Cow::from("Hello human!"),
    }
}

You can notice that the generic parameters passed to Cow are a lifetime parameter and the slice type. The slice type is required if Cow is ever borrowed. The owned type is derived from the slice type's ToOwned trait. The lifetime parameter is required for in the case of borrowing a reference. If the Cow owns data, a lifetime is no required.

In the example above, g1 owns the data and g2 doesn't because it references a static string literal.

Cow types can be easily cloned too. If it owns a type, it returns a Cow that borrows. You can also obtain a mutable reference to the contents by calling to_mut and that will clone if necessary.

Summary

That concludes my article for this week. I hope you find strings easier to understand now and realise that the reasons they are complicated are justified. We talked about owned types and their associated slice types. We talked about why there are different types due to differences in UTF-8 and operating system representations. We also talked about how you can convert between them using a combination of their methods and From traits. Finally, we discussed the Cow type that can represent both an owned type or a referenced type.

Until next time!

Top comments (2)

DiM • Dec 7 '22

Amazing rust series. I like that extra value (like explaining why UTF-8 was created etc.). Really thanks a lot.

PS: I believe theres a typo in your example. Very last word shall be "World" (you have "Hello").

let s = String::from("Hello");
let c1 = Cow::from(s); // c1 = Owned("Hello")
let c2 = Cow::from("World"); // c2 = Borrowed("World") <-- here

Luciano Mammino • Feb 23 '22 • Edited

This meme seemed appropriate!

PS: Awesome article, thanks a lot :)