Unions: delving into unsafe Rust

#rust #programming #unsafe

I came across union while looking through the code of the client's library for the new trading platform WATS by the Warsaw Stock Exchange (GPW), which currently is (still) under development.

When I skimmed through the code, I noticed a lot of unsafe functions, that were the artifacts of using unions instead of structures or enums.

I never used or even encountered the unions before, so I wanted to learn about them. It proved to be an interesting journey, where I also revisited other Rust topics (like memory safety and ownership).

According to official documentation, union uses the same syntax as structure.

If I would describe it unofficially: it's like a child between struct and enum - it looks like structure, but behaves more like enum - but is neither of them.

struct FooStruct {
   x: i64,
   y: f64
}

union FooUnion {  // <---- union instead of struct
   x: i64,
   y: f64
}

In the code snippet above, we declare FooStruct, which is a structure with 2 fields: x for the integer value, and y for the float.
The FooUnion has identical fields, from reading the code, but under the hood, union can store only 1 value! Which value? Compiler does not know, this is why reading the value should be enclosed into unsafe block or function.
While Rust is generally very descriptive about value's mutability, here, actually, the writing to the field is safe. This is because we have only one place to write to, so as long as the variable has a mutable ownership of the value, it can be overwritten.

let mut my_union = FooUnion { y: 23.5 };
my_union.y = 25.7;

With union, what we're saying to the compiler is: FooUnion will store either integer or float in the single memory allocation, look away when we access it, as WE know how to handle it.
Thus, to declare it, we'll use:

let my_struct = FooStruct { x: 42, y: 23.5 };

// Union declares only one of its fields
let my_union = FooUnion { x: 42 };

This instantly reminds me of the enum - (remove the unsafe part) - there is only one value currently stored.

The natural train of my thoughts was: what about pattern matching? The Rust documentation conveniently provides the match statement:

let u = FooUnion { x: 10 };

unsafe {
    match u {
        FooUnion { x: 10 } => println!("Found exactly ten!"),
        FooUnion { y } => println!("Found y = {y} !"),
    }
}

However, if we rewrite the match into a statement used in an enum, we'll see a warning that the second arm is unreachable.

unsafe {
    match u {
        FooUnion { x } => println!("Found x = {x} !"),
        FooUnion { y } => println!("Found y = {y} !"), , // <-|
 //                           warning: unreachable pattern  --|
    }
}
// Alternatively, we can use std::mem::transmute to transform value
// unsafe{
//     let val = std::mem::transmute::<FooUnion, i32>(my_union);
//     println!("Found value = {val} !");
// }

(｢•-•)｢ ʷʱʸ? Because there is nothing to compare, we don't have a few variants to match against: we have 1 union with 1 field, whatever we'll write in the initialisation call.

let u = FooUnion { x: 10 };

// OR
let u = FooUnion { y: 23.5 };

The result from the match statement will be Found x = 10 ! if the value was "under" x, or we'll have Found x = 1102839808 ! (which is a 23.5 representation of the float value in Big Endian type) - which probably means nothing during run time.

Let's add the field that stores values on the Heap. The most obvious one is a String. If we include it into the fields of the union, it requires us to add std::mem::ManuallyDrop wrapper - which is a neat reminder that we need to drop the variable ourselves.

Accessing value (≖_≖ )

How do you know what is a stored value? I found myself accustomed to writing if statements that guarantee that the type is known, like if let Variant::A = var { ... }, however, I could not understand how to do it with union.
Apparently, you cannot: this is a real definition of unsafety.
This code would produce different outputs: first one is the length of the string, another one - is an actual string. In this case, compiler cannot help us to write the matching arms, it expects from us to KNOW the type that is stored under the variable.

pub union FooUnionHeap {
    x: i32,
    y: f32,
    z: std::mem::ManuallyDrop<String>,
}

let s = "HelloWorld".to_string();
let my_union_h = FooUnionHeap { z: ManuallyDrop::new(s) };


unsafe {
    match my_union_h {
        FooUnionHeap { x } => { println!("Found x = {x} !"); },

    };
}

// Output: `Found x = 10 !`

unsafe {
    match my_union_h {
        FooUnionHeap { z } => { 
            println!("Found z = {} !", z.to_string()) 
        },

    };
}

// Output: `Found z = HelloWorld !`

The same goes for just borrowing different fields:

let my_union = FooUnion { y: 25.7 };
unsafe {
    let a = &my_union.x;
    let b = &my_union.y;
    println!("a: {a}, b: {b}");
}
// Output: `a: 1103993242, b: 25.7`.

Size of the type (>.<)人(⸝⸝⸝>﹏<⸝⸝⸝)

The size of structure and union is different, because structure's size is roughly the sum of its fields (+ alignment), and for union - size of its largest field. However, the union has the same size in bytes as enum. Even after comparing both against the union with C representation (which can change the padding between fields), the size is the same.

pub struct FooStruct {
    x: i32,
    z: String,
}

pub enum FooEnum {
    X(i32),
    Z(String)
}

#[repr(C, packed)]
pub union FooUnionHeapC {
    pub x: i32,
    pub z: ManuallyDrop<String>,
}

pub union FooUnionHeap {
    x: i32,
    z: ManuallyDrop<String>,
}
println!("Size of heap allocated union: {}",
    std::mem::size_of::<FooUnionHeap>());
// Output: Size of heap allocated union: 24
println!("Size of heap allocated union with C representation: {}",
    std::mem::size_of::<FooUnionHeap>());
// Output: Size of heap allocated union with C representation: 24
println!("Size of enum: {}", std::mem::size_of::<FooEnum>());
// Output: Size of enum: 24
println!("Size of structure: {}", std::mem::size_of::<FooStruct>());
// Output: Size of structure: 32

Speed 三三ᕕ( ᐛ )ᕗ

Here, we compare only the enum and union using criterion tool.

Both creation and access of the union is faster - probably because both creation and access to the enum requires also to check the discriminant, which holds information about used variant.

The DANGER ( ꩜ ᯅ ꩜;)⁭

With great power comes great responsibility _{--- Amazing Fantasy, and more specifically}
It is the programmer’s responsibility to make sure that the data is valid at the field’s type _{--- The Rust Reference}

To see how everything compiles, but panics during runtime, we must manually drop the value first, and then access it to read.

// Drops the `z` field from the union
fn consume(u: &mut FooUnionHeap) {
    unsafe{ ManuallyDrop::drop(&mut u.z); }
}


fn outer() {
    let s = "HelloWorld".to_string();
    // owner of the union
    let mut my_union_h = FooUnionHeap { z: ManuallyDrop::new(s) };

    // union's `z` field is dropped here
    consume(&mut my_union_h);

    // reading the `z` field
    unsafe {
        match &my_union_h {
            FooUnionHeap { z } => { 
                println!("Found z = {} !", z.to_string()) 
            },
        }
    };
}

// The ERROR: unsafe precondition(s) violated:
// ptr::copy_nonoverlapping requires that both pointer arguments are
// aligned and non-null and the specified memory 
// ranges do not overlap

// This indicates a bug in the program. 
// This Undefined Behavior check is optional, and 
// cannot be relied on for safety.

The compiler looked away while we crashed due to the dangling pointers.

Why and when to use them? ᕙ(⇀‸↼‶)ᕗ

When unions were first introduced in Rust in 2015, the goal was to provide the native support for C-compatible unions (RFC).

For me, if not in FFI, they are the thieves of joy developing in Rust - you must know what to use when, and be sure it exists - manageable in small code bases, nightmare - in large.

Why it was chosen in WATS - I think speed or/and compatibility with some C API. After all, it is a trading platform.

The code is available on Github.

What do you think? (☞ ͡° ͜ʖ ͡°)☞