DEV Community

nabbisen
nabbisen

Posted on • Originally published at scqr.net

matten: Heterogeneous data with `--features dynamic`

This is the third of four short posts about matten. The previous post covered the numeric core. This one covers a separate feature: ingesting messy, real-world data.


The problem

The numeric Tensor in the previous post is clean by construction: every cell is an f64. That is fine when your data is already clean. It is less fine when it arrives from a JSON API or a CSV file that has missing cells, integer values alongside floats, or the occasional boolean flag.

The dynamic feature adds an ingestion-and-cleanup layer for that case. It is not a second compute engine — you cannot do arithmetic on a dynamic tensor directly. The idea is simpler: ingest heterogeneous data, inspect and clean it, then convert explicitly to a numeric tensor when you are confident the data is ready.

Enable it:

matten = { version = "0.28", features = ["dynamic"] }
Enter fullscreen mode Exit fullscreen mode

Ingesting mixed data

from_json_dynamic and from_csv_dynamic accept data with mixed types. Each cell lands in an Element variant: Float, Int, Bool, Text, or None (for JSON null or an empty CSV field).

use matten::{NumericPolicy, Tensor};

// A JSON table with mixed numeric kinds and a missing cell
let json = "[[1, 2.5, null], [4.0, 5, 6]]";
let t = Tensor::from_json_dynamic(json)?;

assert!(t.is_dynamic());
assert_eq!(t.shape(), &[2, 3]);
assert_eq!(t.count_none(), 1);
Enter fullscreen mode Exit fullscreen mode

The same on-ramp works for CSV:

use matten::Tensor;

let csv = "10.0,20.0,30.0\n40.0,,60.0\n70.0,80.0,\n"; // two empty cells
let t = Tensor::from_csv_dynamic(csv)?;

assert_eq!(t.count_none(), 2);
Enter fullscreen mode Exit fullscreen mode

The format differs; the workflow does not.

Inspecting missing values

Before cleaning, you can see where the gaps are:

// none_mask: a numeric tensor of 0.0 / 1.0, one per cell
let mask = t.none_mask();
assert_eq!(mask.get(&[0, 2]), Some(1.0)); // null at [0,2]
assert_eq!(mask.get(&[0, 0]), Some(0.0)); // present

// schema_summary gives a readable type breakdown
println!("{}", t.schema_summary());
// e.g. "Float: 4, Int: 1, None: 1"
Enter fullscreen mode Exit fullscreen mode

Converting to a numeric tensor

The conversion step is explicit by design. try_numeric() is strict and refuses if any None, Bool, or Text values are present:

// This fails — there is a null in the data
assert!(t.try_numeric().is_err());
Enter fullscreen mode Exit fullscreen mode

try_numeric_with(policy) lets you state exactly what to do with each variant:

use matten::NumericPolicy;

// Treat None as 0.0; Int and Float both become f64
let clean = t.try_numeric_with(NumericPolicy::default().none_as(0.0))?;

assert!(!clean.is_dynamic());
assert_eq!(clean.as_slice(), &[1.0, 2.5, 0.0, 4.0, 5.0, 6.0]);
Enter fullscreen mode Exit fullscreen mode

Other policy options:

// none_as_nan: missing → f64::NAN instead of a chosen sentinel
let p = NumericPolicy::default().none_as_nan();

// allow_bool: true → 1.0, false → 0.0
let p = NumericPolicy::default().allow_bool();

// allow_text_parse: try to parse text cells as f64
let p = NumericPolicy::default().allow_text_parse();

// Chain options together
let p = NumericPolicy::default().none_as(0.0).allow_bool();

// Accept all variants permissively
let p = NumericPolicy::permissive();
Enter fullscreen mode Exit fullscreen mode

Cleaning before converting

You can also fill missing values before the conversion step:

// Replace every None with 0.0 in place
let filled = t.fill_none(0.0);
assert!(filled.is_numeric_convertible());

// Then convert strictly
let numeric = filled.try_numeric()?;
Enter fullscreen mode Exit fullscreen mode

forward_fill_none is also available for time-series-style forward propagation.

What dynamic is not

A few things that are intentionally absent from the dynamic feature:

  • No arithmetic on dynamic tensors. Call try_numeric() first.
  • No dynamic reshape, slice, or reduction.
  • No serde for dynamic tensors.

The point is a clean handoff: messy input → inspect and clean → numeric Tensor → ordinary numeric work. That boundary is deliberate.

Links: crates.io · docs.rs · mdBook · repository

Top comments (0)