DEV Community: doma.dev

Type system innovation propagation

doma.dev — Fri, 03 Sep 2021 16:30:00 +0000

TL;DR

Incorporation of established programming language theory approaches is desired by mainstream language designers.
- The fashion in which parametric polymorphism has enabled generics in Java and Go demonstrates this.
- Go with generics has the potential to solve the expression problem.
- C++ has got it right straight away and work has been done to improve parametric polymorphism to allow for ergonomic higher kinded types (generic types that themselves accept type variables).
Further work is required to further improve expressiveness and ergonomics of languages with type systems.
- Most of languages with type systems lack scalable ways to deal with heterogenous data.
- Structure-aware features and row polymorphism asks for a wider adoption than just in PureScript.
- Lack of efficient structure-aware features algorithms holds back the adoption greatly.

Why not settle for naive or simple type systems?

Most language designers agree that type systems should have first-class treatment in programming languages. Almost all the programming languages saw their type systems evolve to incorporate new features. In this post we'll study some of such cases and motivate the need for furthering type system R&D beyond what we have now at our disposal.

To do that, we shall look at the history of two mainstream programming languages (Java and Go) through the lens of generic computing in said languages. In this post, when we talk about generic computing, we mean "ways to program in a type-agnostic way" or "writing a program that doesn't just work on one concrete type, but works on some class of types".

Thus, generic computing is instrumental even to the most basic programming. Data structures (trees, arrays, ...) are foundational to the discipline and intrinsically generic. The challenge then, is to encode them in a type-safe way. A motivational example would be Java's "Hashtable", as seen in version 1.0, dated 7th of January, 1998.

Razor-sharp generic computing

Consider its get function:

public synchronized Object get(Object key) {
    HashtableEntry tab[] = table;
    int hash = key.hashCode();
    int index = (hash & 0x7FFFFFFF) % tab.length;
    for (HashtableEntry e = tab[index] ; e != null ; e = e.next) {
        if ((e.hash == hash) && e.key.equals(key)) {
        return e.value;
        }
    }
    return null;
}

Considerations for the billion dollar mistake aside, when we talk about type safety of this snippet, we see that, on line three of it, we call method hashCode() of an instance of class Object. This approach to "generics" asks engineers to have a single point in the closed type hierarchy, which mandates all the necessary methods for the generic applications. This approach is a source of headache for library implementers. Even if we negotiate that using Interfaces is good enough for implementing generic programs (think, get would accept IHashable instead of Object), the problems still exist.

Upcasting (also known as generalisation, treatment of a subtype as a supertype) to an interface or an Object would result in the return value of a wider-than-needed type, which would require for downcasting (also known as specialisation, treatment of a supertype as a subtype) later on, throwing away type guarantees and creating a space for errors.
Less significantly, overlapping abstract method names in interfaces without resolving facilities make generic programming via upcasting less scalable.

The pioneering language in the modern type systems engineering, which gave raise to Haskell and Ocaml is called "ML". ML, in mid-seventies, has introduced something called "parametric polymorphism", the idea of which is to let programmers have variables for types themselves in a similar way that programmers have variables for values. Modern Java's Hashtable uses parametric polymorphism and is said to be "polymorphic in key and value types":

public class Hashtable<K,V>
extends Dictionary<K,V>
implements Map<K,V>, Cloneable, Serializable

Case study: type variables for better polymorphism

Generic Java

As we discussed, initial approach to generic programming in Java was to use Object, the common super-class for any Java class. Pizza language, made by Odersky (eventually, the creator of Scala) and Wadler (co-designer of Haskell), released one year after Java, was a superset of Java that was a bit more principled and allowed for type variables that would then be "erased" and translated into Object class, automating upcasting and downcasting, thus retaining type safety. It also allows to remove the problem with exponential blow-up of compiled artefacts like the one seen in C++ due to conditional code generation. More on that later.

Type erasure is greatly misunderstood and some shortcomings of Java type system is misattributed to it, but it's not without its drawbacks. Most notably, one cannot use type variables in Java in to cast values to that type. I.e. (T)x is not a valid expression if T is type variable. The other drawback of type erasure is that even if a generic data structure or method is parametrised with a primitive type, the overhead of boxing it (turning it into a Java class) will be carried via erasure. Note that none of the drawbacks of type erasure limit type safety, only expressiveness and performance.

Wadler et al., after Pizza was released, made a minimum viable formalisation of Java, which was instrumental for eventual inclusion of generics in Java in version 1.5, in 2004.

Generic Go

Go is infamous for the longest time between the release of an industrial language and getting generics. Importantly, it gave room for what I call void * polymorphism. In Go circa 2021, it's interface{} polymorphism and, without going into much details about why it works, we'll present you with real code that makes use of it:

func ToBoolE(i interface{}) (bool, error) {
    i = indirect(i)

    switch b := i.(type) {
    case bool:
        return b, nil
    case nil:
        return false, nil
    case int:
        if i.(int) != 0 {
            return true, nil
        }
        return false, nil
    case string:
        return strconv.ParseBool(i.(string))
    default:
        return false, fmt.Errorf("unable to cast %#v of type %T to bool", i, i)
    }
}

This is clearly problematic, because usage of interface{} type in programs poisons them with runtime switching over type information, unlifting the failure detection from the realm of static analysis to the realm of dynamic monitoring. Furthermore, a slight change in the acceptable types shall cause a refactoring hell! There would be no way to know, when you extend domain of your interface{} function, which other functions need to have their domain also extended.

Similarly to introducing generics to Java, introducing generics to Go included two stages: formalisation and implementation proposal. Given the experience of the team who is behind generics in Go experience in the matter (a lot of it is thanks to having Wadler on board), in case of Go, proper formalisation came first, it was implemented later.

Another reason for starting with formalisation first in case of Go, perhaps, is rooted in the fact that adding parametric polymorphism to Go is harder than doing so in Java. Indeed, one of the great features of Go language is that its struct-interface supertyping is open.

package s

type Nil struct{}

func (n *Nil)Show() string {
        return "{}"
}

A structure with a function in a package defined independently can indeed happen to implement an interface defined in another package:

package main

import (
        "fmt"
        . "doma.dev/s"
)

type Shower interface {
        Show() string
}

func f(a Shower) string {
        return a.Show()
}

func main() {
        var x = Nil{}
        fmt.Println(f(&x))
}

Further complication which warranted careful planning for this feature was that the goal was to use code generation (fancy word for which is "monomoprhisation" because poly-morphic things spawn a bunch of mono-morphic things), instead of type erasure, to achieve more versatile generics at the expense of binary size.

Finally, a proposal that adds generics with constraints (which programmers can create and use in their code) was implemented.

Go and expression problem test

Besides, Generic Go, as currently implemented almost passes the expression problem test.

The expression problem, essentially, states that without changing the existing source code in modules (except for the integration module) and while preserving type safety, codebase is extendable with:

a new type, implementing all existing functions;
a new function over all existing types.

The expression problem test is then formulated as follows:

Work with expressions for a calculator DSL that builds up arithmetic expressions and then evaluates them (hence the name of "expression problem").
Start with an expression type case "constant" which holds a value of some primitive numeric type.
Implement a function "evaluate" that takes an expression and returns the corresponding value of the primitive numeric type.
Implement "evaluate" for "constant".
Encode expression "plus" that denotes adding up two expressions.
Extend "evaluate" to work on it without changing other modules.
Implement "to string" function for both expressions ("plus" and "constant") without changing other modules.
In the integration module, demonstrate that any function is callable over any defined type case.
Erase all code for "plus" and "to string".
Reimplement "to string" first.
Reimplement "plus" second, then extending "evaluate" and "to string".

If generic constraint narrowing would be possible in Generic Go as implemented (it was planned to be possible in the original research), we would have been able to write the following code to solve the expression problem in Go:

// package A at time 0
type ExprConst[T any] struct {
    UnConst T
}

// Currently impossible because receiver arguments have to have exactly the
// same type signature, including specificity of the type parameters, as their
// struct declarations.
func (e ExprConst[int]) Eval() int {
    return e.UnConst
}
// end of package A at time 0

// package E at time 0
type Evaler interface {
    Eval() int
}
// end of package E at time 0

// package P at time 1
type ExprPlus[L, R any] struct {
    Left L
    Right R
}

// Currently impossible
func (e ExprPlus[Evaler, Evaler]) Eval() int {
    return e.Left.Eval() + e.Right.Eval()
}
// end of package P at time 1

// package E at time 2
type Evaler ...

type Shower interface {
    Show() string
}
// end of package E at time 2

// package A at time 2
type ExprConst...

func ...Eval() int...

func (e ExprConst[int]) Show() string {
    return strconv.Itoa(e.Const)
}
// end of package A at time 2

// package P at time 2
type ExprPlus...

func ...Eval() int...

func (e ExprPlus[Shower, Shower]) Show() string {
    return fmt.Sprintf("( %s + %s )", e.Left.Show(), e.Right.Show())
}
// end of package P

// package main at time 2
type Expr interface {
    Evaler
    Shower
}
func main() {
    var e Expr = ExprPlus[Expr]{
        ExprPlus[Expr]{
            ExprConst[Expr]{ 30 },
            ExprConst[Expr]{ 11 },
        },
        ExprConst[Expr]{ 1 }
    }
    fmt.Printf("%d = %s", e.Eval(), e.Show())
}
// end of package main

Then, when one would run this, the output would be 42 = ( ( 30 + 11 ) + 1 ).

Quoting Robert Griesemer, one of the contributors to the FG paper and one of the main implementers of Generic Go

Even though we can type-check that, we don't know to implement it efficiently in the presence of interfaces (which would also have methods with corresponding type parameters).

Maybe some day...

More evidence of usefulness of R&D in type systems

There are many other examples that demonstrate adoption of programming language theory results in mainstream languages. To name a few:

Rediscovery of higher kinded types in C++ (something very little type systems allow for natively), and a long process of evolution to make them ergonomic.
Design and inclusion of higher kinded types into Scala by Martin Odersky.
Allowing for ergonomic higher order functions in C++ and Java
Function type treatment in mainstream languages, from Golang to Rust.

There is also an innovation that is on the verge of breaking through into mainstream languages.

Structure-aware type systems and row polymorphism

As we discussed, type systems, by definition, limit the expressiveness of languages. And yet, they are well worth it as far as budgets are concerned. Let's start this post with exploring a classical expressiveness shortcoming of languages with type systems: the problem of operating on heterogenous data.

Imagine we need to store a hierarchy of countries and cities in the same tree. An untyped approach would be simple: make distinct objects for countries, cities, neighbourhoods and then add children field to each, putting necessary objects on lower levels of the hierarchy:

let city1 = {"name": "Riga", "longestStreet": "Brivibas"};
let city2 = {"name": "Zagreb", "longestStreet": "Ilica"};
let country1 = {"name": "Latvia", "ownName": "Latvija", "capital": city1};
let country2 = {"name": "Croatia", "ownName": "Hrvatska", "capital": city2};
let city11 = {"name": "Zilupe", "longestStreet": "Brivibas"};
let city22 = {"name": "Split", "longestStreet": "Domovinskog Rata"};
let world =
  {"name": "Earth",
   "children":
     [{...country1, "children": [city1, city11]},
      {...country2, "children": [city2, city22]}]
  };

Naively, the same can be achieved by having a tree type, parametrised with a union type that encodes either a City or a Country.

data World = World { name :: Text }
data Country = Country { name :: Text, capital :: City }
data City = City { name :: Text, longestStreet :: Text }
data Value = W (World, [Country]) | C (Country, [City]) | T City

However, quite some problems arise when we want to extend encoding to also capture streets, for instance. Our union type shall change along with type definition for City. This topic is far from being trivial to solve in a polymorphic fashion in typed languages. There is modern research that shows that it's doable by introducing "pattern structures" into structure-aware type systems.

Relevant to the issue of heterogenity, solving problems such as capability tracking and diverse effect systems, is row polymorphism. It's another structure-aware approach to polymorphism, which is said to work on types with rows (records), and allows to define functions that are polymorphic in something except for some rows. In our example, a row-polymorphic function over our structure, could perhaps ask for any type for which name :: Text is defined, along with, perhaps, non-zero other rows. It would then accept anything in our heterogenous structure, since everything is named. If it feels to you like this walks like duck typing and quacks like duck typing then yes, you are right. It is exactly a way to formalise duck typing and introduce it into the type systems. It is a common theme, however, that for PLT to be adopted in the industry, systems need to be engineered that implement the theory. But when you introduce one feature to a system, you trade off ease of introduction of other features (this is why we don't have and we will never have a universal language that is good at everything). In case of row polymorphism, the challenge is an efficient representation of records. Gladly, default implementation of PureScript piggy-backs node.js efficiency. We expect row polymorphism to make its way into functional programming languages from already existing implementations in PureScript and an industrial laboratory language Ermine and eventually be adopted in mainstream languages.

Notable ommissions

It is hard to provide full survey of polymorphism and tangent topics in one little blog post. This is why we had to pick our battles. We have considered, but decided to ommit or mention just briefly, the following subjects (with links to introductory posts about them):

Parting words

In most mainstream languages, existing facilities to boost expressiveness of type system is sufficient in majority of cases without sacrificing guarantees. If you find yourself needing more, sometimes introducing refactoring loops into your feature implementation process can be wise. In well-typed systems, refactoring is cheap and introducing such loops be detrimental to time to market compared to using untyped approaches. That said, for the sake of accepting many potential architectures that would be possible if type systems were more rich, we need to press on as a community and create compilers that take novel research ideas or ideas from other languages in a continuous struggle to unify those into ergonomc systems. Furthermore, along with regaining expressiveness, this work often is capable to tighten the compile-time guarantees. More about it in the upcoming blog post.

All in all, we think that exploration of repeated success of adoption of parametric polymorphism by mainstream languages does good enough job to motivate businesses to look at the proceedings in the field!

Why type systems matter

doma.dev — Thu, 26 Aug 2021 15:00:00 +0000

TL;DR

Computer programmers expect their language environments to reject bad programs. It's largely done with lightweight formal methods.
Runtime monitoring (the thing that tells you that undefined is not a funciton) is an example of such a formal method.
A type system is also an example of such a formal method.
Type systems target a range of properties deemed as "bad", for which it is guaranteed that programs having those are rejected.
This may and does result in "good" programs having "bad" properties being rejected. We say that it means that type systems limit expresiveness of a language.
"If it compiles, it works" is a dangerous misconception, but there is a synergy between correctness and passing a type check.
Alternative to using type systems, however, is less feasible for adequately-budgeted developments because it moves error detection to later stages of system's lifecycle, resulting in significantly larger costs.
Type systems are also great for rapid prototyping, technical validation and deriving specifications from requirements.
Type systems allow you to test for only truly run-time faults, which are more often than not related to side-effects.
If your organisation can cope with the possible friction that type systems bring, not using type systems is irresponsible.

A fistful of formal methods

We want our programming language environments at large to be able to tell well-behaved programs from those that behave poorly. There are several ways to achieve this.

Runtime monitoring, which considers things like operations on incompatible objects (a la Python and JavaScript) and underappreciated contract programming based on preconditions and postconditions, as well as invariant checking (a la Eiffel and DLang).
Some will remember model-driven engineering with UML modelling (my Vim highlighted "UML" as a non-existent word! It brings me joy). Automatically deriving constraints from such models and rejecting models that are self-contradicting or breaking some constraints. (a la EMFtoCSP).
Both digital and analog circuits can be accepted or rejected based on automatically derived finite state machine models and checking for desired properties.
Type systems for rejecting classes of poorly-behaved programs statically, during compilation (a la Java, Haskell).

A rather interesting observation is that these discriminators should be reproducible, which calls for underlying formalisms. Furthermore, it's preferred that domain experts (JavaScript programmers, UML architects, embedded systems engineers) can reap benefits from those. That property is called "lightweight" in culture. When we put these considerations together, we see that all of these things, including JavaScript's runtime monitoring, which many people may deem as basic, are lightweight formal methods! Not scary at all.

Not all formal methods, however, are made for the same reason and not everything achievable with one can be achieved with another. To illustrate that, consider the following use-case: we build up an array of validation functions and then, at the validation site we call them one by one.

01 |    let module1 = {
02 |      defaultValidators: [
03 |        (x) => 2 == x.split(' ').length,
04 |      ],
05 |      validate: (input) => (fs) =>
06 |        fs.reduce((acc, f) => acc && f(input), true),
07 |    };
08 |
09 |    let module2 = {
10 |      alsoCapitalised: [
11 |        (x) =>
12 |           x.split(' ').reduce(
13 |             (acc, x) => acc && (/[A-Z]/.test(x[0])), true
14 |           )
15 |      ] + module1.defaultValidators,
16 |    }
17 |
18 |    let main = {
19 |      main: (input) => {
20 |        let validators = module2.alsoCapitalised;
21 |        if (module1.validate(input)(validators)) {
22 |          console.log("It's time to open the door");
23 |        }
24 |      }
25 |    }
26 |
27 |    main.main("Viktor Tsoi");

When we run this code, the following error will be reported:

Uncaught TypeError: fs.reduce is not a function
    validate debugger eval code:6
    main debugger eval code:21
    <anonymous> debugger eval code:27

The true place where the error happens is line 15. Getting there from lines 6 and 21 would probably require a little bit of debugging, especially in a real project. Indeed, the error happens due to nonsense operation + over two arrays. Let's fix it:

15 | ].concat(module1.defaultValidators),

When we run the code again, we get the expected message in the log:

It's time to open the door

Let's compare runtime monitoring with a type system.

Here's the code with the same error in Haskell.

01 | {-# LANGUAGE OverloadedStrings #-}
02 | import qualified Data.Text as T
03 | import Data.Text( Text )
04 | import Data.Char( isUpper )
05 |
06 | defaultValidators :: [Text -> Bool]
07 | defaultValidators = [\x -> 2 == (length $ T.splitOn " " x)]
08 |
09 | validate :: Text -> [Text -> Bool] -> Bool
10 | validate input fs = foldl (\acc f -> acc && f input) True fs
11 |
12 | alsoCapitalised :: [Text -> Bool]
13 | alsoCapitalised = [\x -> foldl (\acc w -> acc && (isUpper $ T.head w))
14 |                                True
15 |                                (T.splitOn " " x)] + defaultValidators
16 |
17 | main :: IO ()
18 | main =
19 |   case validate input defaultValidators of
20 |     True -> putStrLn "It's time to open the door"
21 |     _    -> putStrLn "Close the door behind me"
22 |   where
23 |     input = "Viktor Tsoi"

When we try to compile it, we're going to get an error that says exactly what is wrong. To be able to apply +, operands had to be classified as numbers via typeclass Num. This typeclass doesn't include lists of validator functions. Note that if we would want to define addition on such values, we would be able to, by providing an appropriate instance of Num. But it's a rather horrible idea, so we'll fix the bug instead.

/tmp/hi.hs:13:19: error:
    No instance for (Num [Text -> Bool]) arising from a use of '+'

With this error, we quickly can replace line 15 with one using ++, the list concatenation operator:

15 | (T.splitOn " " x)] ++ defaultValidators

When we run this one, we get the correct result!

Static checking allows us to catch many classes of errors like this one early, which saves a lot of money, since—as systems engineering teaches us—the cost of fixing a fault in a system grows exponentially as a function of how far it is from the requirement gathering stage in the lifecycle of a system.

Benefits of using type systems

Type systems come in different shapes and sizes: different type systems may be geared to eliminate different classes of incorrect programs. However, most type systems—collaterally—eliminate programmers' errors such as incomplete case analyses, mismatched units, et cetera. For example, if we get rid of line 21 in the Haskell listing entirely, in a well-configured GHC (which treats warnings as errors and warns about everything "suspicious), we'll get the following error, indicating incomplete case analysis:

/tmp/hi.hs:19:3: error: [-Wincomplete-patterns, -Werror=incomplete-patterns]
    Pattern match(es) are non-exhaustive
    In a case alternative: Patterns not matched: False
   |
19 | case validate input defaultValidators of
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^...

Of course, it's impossible to reap this benefit without a certain wit and discipline. I like to call that discipline "tight typing", many Haskellers call it "having as concrete data structures as possible" (it's an elaboration of the mantra "abstract functions, concrete data"). For instance, the DummyTag type you have seen earlier is loose, because it is used in the places of the model accepting types with incompatible terms. Quoting my colleague:

Nothing prevents you to subtract height from pressure if both of those are encoded as Float.

Type checkers are powerful refactoring tools. Anecdotally, at work, once we had to restructure approximately fifty modules while separating part of those into a library and fusing arguments into structures in another part of those. The whole refactoring was done and released by one person in one working day. It would have been impossible without a type checker to ensure the completeness of said refactoring. Another anecdote comes from my colleague:

When I worked for AlphaSheets.com (a startup "acquihired" by Google), I refactored the whole codebase, threading through an App monad instead of IO. I did this for a week in part due to painful and regular rebasing onto the main branch. But when the thing was compiled, the only tests that failed were the ones covering a place that was stubbed with undefined.

Notorious for a steep learning curve and perceived feature development slow-down, languages with type systems—granted a certain engineering savviness—can serve as amazing facilitators of rapid prototyping. The goal of prototyping is often to figure out the best specification for the requirements at hand. Expressive type systems often allow encoding relationships between domain entities without writing complex business logic. If one views types as claims that something is possible and values of said types as proofs that it indeed is possible, this approach makes a lot of sense. Of course, mileage can vary and it works better the fewer side-effects there are in a system, but remember that you can simulate side effects via type encodings too! For instance, instead of figuring out a proper way to do cryptography while prototyping, we can encode an interface of a crypto-system and populate it with dummy functions, complying with it:

data PKC pass sk pk slip sig sigmsg enc plain cipher = PKC
  { -- | Initial key derivation function
    kdf :: pass -> (sk, pk, slip),
    -- | Rederive with kdf
    rekdf :: slip -> pass -> (sk, pk),
    -- | Sign data with key @sk@ and produce a detached @signature@
    -- possibly containing @pk@ for verification.
    sign :: sigmsg -> (sk, pk) -> signature pk,
    -- | Verifies @signature@ container's validity.
    verify :: sigmsg -> signature pk -> Bool,
    -- | Encrypt data of type @plain@ to the key @pk@ and produce @encrypted@
    -- containing @cipher@.
    encrypt :: plain -> pk -> encrypted pk cipher,
    -- | Decrypts @cipher@, contained in an @encrypted@ container into @plain@.
    decrypt :: encrypted pk cipher -> sk -> Maybe plain
  }

This is a model of public key cryptography. Populating this model with functions would, together with tests that verify correct behaviours and error handling of a cryptosystem, serve as proof of the possibility of a correct implementation of public key cryptography in Haskell. Perhaps more importantly, it would also provide an interactive specification for doing so, perhaps even in another language! Let's give an example of some functions slotting into that model.

data DummyTag = DummySK | DummyPK | DummySlip
  deriving (Show, Eq)

-- | Note that we abuse the fact that in dummy implementation
-- secret key and public key are both represented with a
-- DummyTagged type alias to embed secret key together with
-- the message.
newtype DummySigned msg key = DummySigned {sig :: (key, (key, msg))}
  deriving (Show, Eq)

-- ...

type DummyTagged = (DummyTag, ByteString)

-- ...

dummyKdf :: ByteString -> IO (Maybe (DummyTagged, DummyTagged, DummyTagged))
dummyKdf pass = pure $ Just ((DummyPK, pass), (DummySK, pass), (DummySlip, pass))

dummyRekdf :: DummyTagged -> ByteString -> Maybe (DummyTagged, DummyTagged)
dummyRekdf (DummySlip, x) pass =
  go (x == pass)
  where
    go True = Just ((DummyPK, pass), (DummySK, pass))
    go False = Nothing
dummyRekdf _ _ = error "DummySlip expected in the 1st argument"

-- ...

dummySign :: (DummyTagged, DummyTagged) -> msg -> DummySigned msg DummyTagged
dummySign (verificationKey@(DummyPK, _), signingKey@(DummySK, _)) blob =
  DummySigned (verificationKey, (signingKey, blob))
dummySign _ _ = error "The first argument has to be a tuple of DummySK and DummyPK"

-- | Note that we're not comparing public key with secret key
-- but rather compare embedded bytestrings which match if the keys
-- were derived from the same password by kdf
dummyVerify :: Eq a => DummySigned a DummyTagged -> a -> Bool
dummyVerify (DummySigned ((DummyPK, signedAs), ((DummySK, signedWith), signedWhat))) candidate =
  (signedAs == signedWith) && (candidate == signedWhat)

Now to keep prototyping, we just need to instantiate PKC data type with this collection of functions. Later on, when we switch to a real cryptographic system, we will use it to make another value of type PKC. If all the properties of the initial prototype were preserved, it shall serve as proof to the claim that "production public key cryptographic systems exist as modelled". We can and should verify that early, but not too early to slow the prototyping down. Similarly, HTTP client-server interactions can be modelled. After the prototype is completed, one will end up with a runnable, compilable specification for the software they're about to write. What's fairly amazing is that if the company doesn't want to switch from Haskell (or whichever strongly-typed language they were using to model) to another language for the actual product, they can repurpose this prototype for an incremental rewrite into an MVP!

Now to the last, but not the least important benefit of type systems! These days, usage of higher-order and anonymous functions is prominent even in more conservative ecosystems like Java's. Type systems greatly assist in reasoning about the code's overall behaviour. In general, type systems are one of many tools for writing self-documenting code. An illustration for those, who (like me) keep forgetting the order of the accumulator and an the iterated value in reducers:

Prelude> :t foldl
foldl :: Foldable t => (b -> a -> b) -> b -> t a -> b

Yeah, if the language's ecosystem is tightly integrated with the underlying type system, amazing things are possible, from simple type signature lookups at your fingertips to full-blown type signature search engines like Hoogle and Serokell's hackage-search. Conversely, if the language ecosystem evolved independently from type system, such as TypeScript and Dialyzer extending their respective languages with success typings, likely, self-documenting benefits and improved discoverability will be way less pronounced, if at all.

Type systems also enable and encourage writing composable code, meaning that the programmers are nudged towards writing well-structured and modular codebases, no matter what is the unit of modularity: a Java class or a Haskell module. Of course, there are ways to write spaghetti code in any language, but it's harder when the whole ecosystem imposes structure.

Drawbacks of type systems

Type systems aren't entirely free. Both the users (programmers) and the computer itself have to do extra work to write a program that would be accepted by a language with a type system. Sometimes, that work takes non-trivial amounts of computational time, but in practice, it's seldom more computationally intense than, say, code generation via templates.

Also, not quite a drawback, but rather something many people don't understand about type systems. There's an "if it compiles, it works" meme, but it's extremely misleading. Type systems, by their static nature, can't prove the presence of features of a program, only absence. But this yields a couple of actual drawbacks:

With type systems, you almost always pick your fights. Languages geared towards eliminating one kind of bad program behaviours won't eliminate another, perhaps, side-stepping it altogether by deferring it to the language runtime.
Scrutiny, thus power, of type systems is always in tension with expressiveness, which is understood as the measure of the programs that are well-behaved at runtime, which are rejected by the type checker.
Sometimes complex type checking algorithms rely on heuristics, which limit expressiveness in ways, unexpected by the user.

Bottom line

We'll end this article with a nice heuristic for architects to think about using type systems. Perform the following thought experiment:

Approximate how much would it cost to write, deploy and run a 100%-coverage random testing system, searching for runtime errors (in our JS example, it would be a test making sure that there is no validator crashed by a string and there is no call in the program that crashes it).
Approximate how much would it cost to train developers to use a "tight" type system, which would give the same sort of guarantees "for free" and in static fashion (many different kinds of tests are still needed in this approach, but the tests we've described in p.1 are given to us for free by the type checker).

If cost 2 is even comparable to, let alone lower than, cost 1, going for using type systems to refuse incorrect programs is warranted. Quoting "The Toyota Way":

Focusing on quality actually reduced cost more than focusing only on cost.

Everything you need to know to write safe bash scripts

doma.dev — Fri, 09 Apr 2021 13:32:00 +0000

TL;DR

When to resort to shell scripts
- Portability is important
- Problem at hand is compact
- File system interaction
- Command-line program automation
When to search for an alternative
- Extensibility is required
- Coding mission-critical stuff
Shell scripting is dangerous, use shellcheck and limit yourself in idioms used

Computer consoles are the second most important UX improvement in computing, surpassed only by window managers. Consoles went a long way from allowing the computer user to enter programs to be executed by a single-process operating system to converting them into toolboxes. Gladly, most of it was happening in Bell Labs under the supervision of the unstoppable innovator Douglas McIlroy.

The Multics "shell", just like inputs on other terminal-enabled computers of that era, was an instrument to accept a program and execute it on the Multics OS, the predecessor of UNIX. The earliest versions of the Multics shell already had input/output redirection. It wasn't until Douglas McIlroy discovered (not invented) command pipelines, it became a golden standard across all the operating systems.

Pipelining and UNIX philosophy is allowing for writing small problem-specific programs that can be later composed. But the biggest reason for the usage of UNIX shell scripts in 2021 is portability. Indeed, every modern system has a UNIX shell readily available. Furthermore, often making a reasonably well-written shell script is "good enough for the job". But how does one do that? Let's explore the answer to this question, assuming that the reader already knows the very basics of shell scripting.

Minimal shell scripts in bash

Let's get the most important consideration out of the way. Writing secure and reliable shell scripts is almost impossible. The least one can do is to use shellcheck, which has integration with VSCode. Furthermore, be very disciplined with user inputs and do your best to quote every argument that has a variable in it.

Alright, with that out of the way, let's talk about step-by-step items one has to do to make a reasonable shell script.

Setting up your shell environment

Before starting programming shells, we need to first determine which shell scripting language will we use. There are three schools of thought about this:

Use bash by default. Bash is a reasonable middle ground between having a lot of features and being portable since it is shipped with each of the popular OS.
Use sh by default and bash when advanced features are needed. This is a purist approach. It offers the most portability but requires distinguishing between basic features and bash-exclusive features.
Use zsh, fish or some other "hipster" shell for everything. I'm mentioning this school of thought for completeness. Since it breaks portability, people who pick this option may just as well code in Python.

As one can guess, we suggest simply using bash for everything. Of course, Apple seems to be deprecating bash as the default shell, but it's not going anywhere from mac os systems. Conversely, Windows 10 has support for reasonable bash integration with its WSL programme. It requires some setup, but these days WSL2 seems to become the default for Windows development.

Besides, bash scripts have fine-grained built-in support for lowering the impact of the inevitable bugs. While setting up your shell, we suggest you use the following options:


#!/usr/bin/env bash
set -euo pipefail

Optionally add -x for easier debugging.

Command line argument processing

If you, for some reason, want to use shell to write something huge like a full-blown issue tracker with git backend, you'll need to use getopts or make your own despatch system. In this post, however, we shall consider simple argument processing. We heavily advocate for short and concise scripts that do one thing and one thing only, after all.

First things first, let's see how to print help:


if [["$1" == "--help" || "$1" == "-h"]]; then
  cat <<EOH
frgtmv: for each file read from STDIN, forget its filename entirely or amend part of it.
...
''frgtmv'' will then ''mv'' each of these files to ''\$(date +'%Y%m%d%H%M%S%N')'', preserving the file extension.
...
EOH
  exit
fi

Key points:

We're using <<EOH / EOH "heredoc" syntax and make sure that we don't indent the lines under it.
We're escaping special characters like $ with a backslash. If we wouldn't, bash would evaluate internals in a subshell.
Don't forget to exit after printing the help!

Now let's use -n, a predicate checking if a variable is set, to prepare the variables needed. Often we want to set up defaults at the beginning of the file:


_mode="forget"
_pattern_from=""
_replace_with=""

if [-n "$1"]; then
  _mode="amend"
  _pattern_from="$1"
fi

if [-n "$2"]; then
  _replace_with="$2"
fi

Sometimes you would need to exit if an argument is not supplied. We use -z that checks if a string is not empty for this:


# Exit if file or directory is not submitted or not a valid file or directory
if [-z "$1"]; then
  echo "We really need the first argument"
  exit 228
fi

STDIN, pipes and GNU parallel

Sometimes you need to receive input from STDIN through a pipe or user input. It's done using read -r:


while read -r _x; do
  mv -v "$_x" "$(date +'%Y%m%d%H%M%S%N').${_x#*.}"
done

If you care about performance more than about portability, use cat - to pass your STDIN to GNU parallel, following this pattern:


function forget() {
  mv -v "$2" "$(date +'%Y%m%d%H%M%S%N').$1.${2#*.}"
}
export -f forget # (A)

if [[$_mode == "forget"]]; then
  cat - | parallel forget {%} {} # (B)
fi

In (A), the parallel payload is implemented as a bash function. Notice the export statement! We suggest writing payloads that receive two variables: the parallel job ID ({%}) and the currently read out item from the STDIN stream ({}). The payload is called from (B).

Here are some interesting parallel techniques:

--keep-order to guarantee that the order of the input will be kept. Requires several file handles per input, which may turn out to be a bottle-neck.
find . -print0 | parallel -0 f {} to work in null-terminated mode.
parallel 'echo "{%}:{1}:{2}";' ::: 1 2 ::: a b c will prarallelise a Cartesian product of input sets {1,2} × {a,b,c}.

Bash parameter expansion

Confusing many, if not all new shell users, "parameter expansion" has a veil of mystery around it. In our humble opinion, there are several causes for this effect:

Parameter expansion is a blanket name that unites accessing values and substituting values.
Within these use-cases, there is a myriad of conditional behaviours and they are decided based on the kind of parameter.
There isn't much expansion going on. Word "expanded" is simply an arcane way to say "reduced to a value".

Let's start from the beginning. A parameter in bash is either a variable (like $HOME), a positional "argument" parameter (like $1), or a special parameter (like $@).

"Expansion" is the process of reduction of parameters to values. Variables expand in the way one would expect from variables.Special parameters, however, can have context-dependent expansions. Expansions have special syntaxes to tack on additional computations like string substitution, length calculation, etc.

Let's use the following variable as an example: x="a.b.c". Here is a list of the most often used parameter expansions, according to us:

Stripping. Use-case: get a file extension or remove a file extension.
- ${x%.*} ≡ a.b
- ${x%%.*} ≡ a
- ${x#*.} ≡ b.c
- ${x##*.} ≡ c
String replacement.
- ${x/./\!} ≡ a!b.c
- ${x//./\!} ≡ a!b!c
Array enumeration with IFS and [@].
- IFS="."; for v in ${x[@]}; do echo -n "($v)"; done ≡ (a)(b)(c)

You can also construct arrays with a "compound assignment":


x=(a b c)
for v in ${x[@]}; do 
    echo -n "($v)"
done

IFS manipulation is not normally needed. If you're changing the splitting context, you should see if there is another way. You might be falling victim to the XY problem.

This tutorial should provide good-enough techniques to quickly and effectively implement shell scripts that do what you want them to. Quote your variables, fail fast, make backups to recover from destructive changes, don't use too many "advanced features" since they are error-prone, and good luck!

We leave you with some shell scripts we wrote that push the boundaries of what shell should be used to:

If there will be community interest, we will take some time to cover extensible shell scripting in Haskell with Turtle. As usual, reach out to us on Twitter or in the comments on Dev.to or Medium mirrors.

Parser combinators in Rust

doma.dev — Tue, 30 Mar 2021 20:57:59 +0000

TL;DR

Don't use regular expressions for parsing
Parser combinators are a way to construct composable computations with higher-order functions
- Examples:
- many1(digit1)
- alt((tag("hello"), tag("sveiki")))
- pair(description, preceded(space0, tags))
Parser combinators are easy to use to get results quickly
They are sufficient for 99% of pragmatic uses, falling short only if your library's sole purpose is parsing

Role of parsing in computing

Data processing is a pillar of computing. To run an algorithm, one must first build up some data structures in memory. The way to populate data structures is to get some raw data and load it into memory. Data scientists work with raw data, clean it and create well-formatted data sets. Programming language designers tokenise source code files and then parse those into abstract syntax trees. Web-scraper author navigates scraped HTML and extracts values of interest.

Informally, each of these steps can be called "parsing". This post talks about how to do complete, composable and correct parsing in anger. What do we mean by this?

Parsing in anger considers the problem of data transformation pragmatically. A theoretically optimal solution is not required. Instead, the goal is to write a correct parser as quickly as possible.
Composable parsing means that the resulting parser may consist of "smaller" components. It can itself be later on used as a component in "bigger" parsers.
Complete parsing means that the input shall be consumed entirely. If the input can have any deviations or errors, its author shall encode them in the resulting parser.

So how do we achieve it? Let's first talk about how to not do it.

Forget about regular expressions

Thanks to the popularity of now perished Perl programming language, a whole generation of computer programmers was making futile attempts to parse non-regular languages with regular expressions. Regular expressions are no more than encodings of finite-state automata.

Items over arrows are characters of {0, 1} alphabet. Circles are states, q1 is "accepting state". Arrows denote state transitions.

Non-deterministic finite-state automata can rather elegantly accept many non-trivial languages. Classical example is that no regular expression exists that accepts strings of form "ab", "aabb", "aaabbb", ... Equivalently, one can't solve the matching parentheses problem with a regular expression. The simplest stack machine is needed for that.

Stack automaton can be in several states at once. A state with no transitions "fizzles" on any input. (@\* matches character '(' with any stack state. ε@ε matches instantaneously as the automaton gets to state p, but only if the stack is empty. The best introductory book for those interested in formal languages.

Thus, regular expressions are nowhere close to providing enough facilities to work with context-free grammars. But they may be sufficiently powerful to clean data or extract some values, so why are we saying you shouldn't use them ever? Practicality reasons!

Let's take an example from some Regex Cookbook post (medium-paywalled link). This way we know it's an actual approach used in the industry. Here is one of the regular expressions author offers:


^(((h..ps?|f.p):\/\/)?(?:([\w\-\.])+(\[?\.\]?)([\w]){2,4}|(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\[?\.\]?){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)))*([\w\/+=%&_\.~?\-]*)$

Many can superficially understand what is going on here. This regex seemingly has something to do with links, but even when we resort to automated explanation, things don't get much clearer. Well, according to the author, this regex is supposed to detect "defanged" URLs. Now let's see all ways in which it and any other sufficiently large regular expression fail.

It is wrong: it doesn't match https://ctflearn.com/ (notice zero-width spaces).
It requires external tokenising, so no plug-and-play: it doesn't match ␣https://ctflearn.com/ (notice leading space).
External tokenisation is specific to this expression: it doesn't match https://ctflearn.com, (notice trailing comma).
It's impossible to fix it: matching optional characters around each printable character would turn it from a large and poorly readable piece of code into a huge and completely unreadable one. Your brain wouldn't even be able to guess h..ps and f.p bits.
It can't be used to extract values. Regexps don't "parse data into data structures". Rather they accept or decline strings. Thus, additional post-processing is required to make use of their output.

Regular expressions have intrinsic problems. To us, it means that only short expressions should be used. The author uses them exclusively with grep, find, and vim.

These days, gladly, a better parsing methodology becomes mainstream with working libraries in all the popular languages. As you can guess from the title, it's called "parser combinators".

Step-by-step guide to composable parsing

In the spirit of our previous blogs, let's solve some practical task. Consider you have to write an interactive TODO application, the pinnacle of practicality. It specifies the following commands:

add ${some word}* ${some #hashtag}* (appends item ID)
done ${some item ID} (marks entry at item ID as resolved)
search ${some word or some #hashtag}+ (searches across entries, returns a list of matching item IDs)

Let's first define how will we represent parsed data, omitting the boring bits:


pub enum Entry {
    Done (Index),
    Add (Description, Vec<Tag>),
    Search (SearchParams),
}

Now let's use the nom library to enjoy expressive and declarative parsing. It has or used to have macro API and function API. Since in v5 of the library macro API was very glitchy, we shall use function API, which we have tested with v6.

We will be parsing the commands line by line. Begin with declaring the top-level parse for a line and meet your first parser combinator: alt.


pub fn command(input: &str) 
-> IResult<&str, Entry> { /* A */
    alt((done, add, search))(input) /* B */
}

In (A) is declared that our function command is a parser.IResult captures parsed type (in our case, str&) and output data structure (in our case, Entry).

In (B) we combine three parsers: add, done, and search with nom::branch::alt combinator. It attempts to apply each of these parsers starting from the leftmost until one succeeds.

Now, let's have a look at the simplest parser out of the three:


fn done(input: &str) -> IResult<&str, Entry> {
    let (rest, value) = preceded( /* A */
        pair(tag("done"), ws), /* B */
        many1(digit1) /* C */
    )(input)?; 
    Ok((
      rest,
      Entry::Done( /* D */
        Index::new( vec_to_u64(value) )
      ) 
    ))
}

The first combinator we see straight away is preceded. It forgets parse (B) and keeps only the output of (C).(B) still will consume input, however! Generally speaking, it combines two computations into a composition that runs both of them, returning what the second one returns. It is not the same as just running them in a sequence because here we build up a computation, but we will run it later on!

Interestingly, if we were writing Haskell we wouldn't find "preceded" combinator in our parser library. The reason is that what we described in the previous paragraph is called "right applicative arrow", or, as was coined during Ben Clifford's wonderful talk "right sparrow":


λ> :t (*>)
(*>) :: Applicative f => f a -> f b -> f b

The other two combinators are pretty self-explanatory.pair combines parsers into a sequence, with the ws parser being a parser that consumes single whitespace. Here is a naive definition of ws: one_of(" \t").many1 repeats a digit1 parse at least one time to succeed. digit1 is implemented in nom itself.

Now let's solidify the understanding of how to make sure that our parsers can be used by others.

We have already discussed that to achieve that, we need to return IResult. Now it's time to remember that it's still a "Result" type, so its constructors are still Err and Ok:

Err variant of Result is constructed via the ? modifier, that passes any potential error arising in parse (A) through.
Ok variant of Result is constructed in (D) by transforming many1 output (which is a vector of digits) into an unsigned 64-bit integer. It's done with vec_to_u64 function, which is omitted for brevity.

The shape of Ok value for IResult<in, out> is Ok((rest: in, value: out)). Here rest is the remaining input to be parsed, and value is the output result of the parser. You can see that preceded parse in (A) followed the very same pattern.

Here are more advanced parsers, that should solidify your intuition about how to use parser combinators in anger:


fn add(input: &str) -> IResult<&str, Entry> {    
  let (rest, (d, ts)) = preceded( /* B */
    pair(tag("add"), ws),                     
    pair(description, preceded(space0, tags)) /* A */
  )(input)?;
  Ok( (
    rest,
    Entry::Add( Description::new(&d), ts )
  ) )
}

fn search(input: &str) -> IResult<&str, Entry> {
  let (rest, mash) = preceded(
    pair(tag("search"), ws),
    separated_list(
      tag(" "),
      alt((tag_contents, search_word)) /* C */
    )
  )(input)?;
  Ok((rest, mash_to_entry(mash)))
}

fn mash_to_entry(mash: Vec<SearchWordOrTag>) -> Entry /* D */
{ /* ... */ }

Parsing with combinators is so self-descriptive, it's hard to find things that need to be clarified, but here are a couple of highlights:

Repeat preceded to focus on the data you need to parse out, see (A) and binding in (B).
Sometimes, you have to parse heterogeneous lists of things. The best way to do that in our experience is to create a separate data type to enclose this heterogeneity (SearchWordOrTag, in our case) and then use separated_list parser over alt of options, like in (C). Finally, when you have a vector of matches, you can fold it into a neater data structure as needed by using a conversion function (see (D)).

This should be enough guidance for you to start getting comfortable with this amazing combinator-based parsing methodology. Here are some parting thoughts:

Pay close attention to whitespaces, which can be a little tricky, especially since we're not aware of the automatic tokenisation option in nom.
Take a look at choosing a combinator documentation page for the version of nom you are using (NB! entries in this table are pointing to macro versions of combinators rather than function versions).
If you so choose, you can check out code truly written in anger, which inspired the snippets in this blog post. The code is authored by Chris Höppner and Jonn Mostovoy.

If parsing is not your product or the main purpose of your library, odds are, parser combinators shall be sufficiently expressive and sufficiently performant for your tasks. We hope you liked this post and happy parsing!

If you have any questions, you can reach out to Jonn and Pola directly. Start the conversation in the comments of the mirrors of this article on Dev.to and Medium.

Pattern matching in Rust and other imperative languages

doma.dev — Thu, 18 Mar 2021 07:28:17 +0000

TL;DR

Rust is an imperative language that has the most pattern-related language facilities
- Has both shallow destructuring and deep destructuring
- if let matching form can be utilised to alleviate the lack only multiple-head functions
JavaScript has a lot of pattern-related language features
- Position-based destructuring for arrays and key-based for objects
- Rest parameters, supporting destructuring
- Shallow-copy spread operator
- With support from Microsoft, Facebook and NPM, proper pattern-matching in JS is inevitable
Python has the weakest support for pattern-related facilities
- Language support for pattern-matching is included in alpha (edit thanks to reddit)
- Packing/unpacking
C++ has powerful libraries for pattern matching. Language support is likely in C++23

All the time, ideas and approaches sift into the world of conventional programming languages world from the programming language theory research and functional programming world. Even Excel has lambdas now!

In this post, we shall cover pattern matching in various imperative programming languages. We shall help you adopt pattern matching techniques to boost the expressiveness and conciseness of your code.

An example from a C++ evolution proposal.

Pattern matching in Rust

Rust has the most advanced and well-designed pattern system among all imperative languages. Part of it, of course, can be attributed to the fact that developers of Rust had the luxury of building a language from the ground up. But most significantly, it stems from the rigour and culture of design and development.

Pattern matching facilities in Rust language are almost as rich as in its older functional brother Haskell. To learn about them along with us, first, consider the following task (inspired by a real-life use-case):

Explore a non-strictly-structured JSON object where keys are species and values are sets of animals of these species.

If an animal's coat is fur or feathers, it's Cute, otherwise it's Weird. If a species is "aye-aye", it's Endangered. There may be new criteria discovered later on that change categorisation of a particular animal or species.

Categorise animals with distinct names found in the given data set!

So let's start with encoding the categories:


#[derive(Hash, Debug, PartialEq, Eq, PartialOrd, Ord)] /* A */
pub enum Category {
  Cute,
  Weird,
  Endangered,
}

(A) makes sure that Rust will order values from top to bottom, so that Cute < Weird < Endangered. This ordering will be important later on.

Now to encode the rules from the task. Since our JSON is unstructured, we can't rely on any property existing, so we can't safely unwrap or reliably coerce JSON to some data Rust data structure:


fn cat_species(v: &str) -> Category {
  match v {
    "aye-aye" => Category::Endangered, /* A */
    _ => Category::Cute, /* B */
  }
}

Our first match! How exciting! This match is equivalent to switching over contents of variable v, of course. However, it offers more flexibility later on. With the power of destructuring, we can match complex structures, not just single variables.

(A) shows how to match a literal value, (B) shows the "catch-all" clause. This pattern match reads species named "aye-aye" is endangered, other species are cute.

Now let's have a look at how to write something more interesting:


fn cat_animal_first_attempt(v: &Value) -> Category {
  match v["coat"].as_str() {
    Some("fur") | Some("feathers") => Category::Cute,
    _ => Category::Weird,
  }
}

The rule of cuteness is satisfied, no unwrapping used. There are also no explicit checks if the value has Some contents or it has None! This listing confidently states: animals with a fur coat or with a feather coat are cute, others are weird.

But is this implementation good enough? One can check by considering a rule getting added, just as requirements warned us:

Animals that have the albino mutation are Endangered. Otherwise, previous rules apply.


fn cat_animal_first_attempt_1(v: &Value) -> Category {
  let cat = match v["coat"].as_str() { /* A */
    Some("fur") | Some("feathers") => Category::Cute, /* B */
    _ => Category::Weird,
  }
  match v["mutation"].as_str() {
    Some("albino") => Category::Endangered,
    _ => cat
  }
}

The snippet became bulky and boilerplate-y... We now have to thread some variable like in (A). We have to remember not to short-circuit computation in (B) by adding a return by accident. In case an additional rule pops up, we will need to decide between mutable cat or versioned.

So is this it? Pattern matching collapses the moment we need to capture some heterogeneous set of matches? Not quite. Let us introduce if let statement, made just for this sort of challenge:


fn cat_animal(v: &Value) -> Category {
  if let Some("albino") = v["mutation"].as_str() {
    Category::Endangered
  } else if let Some("fur")
              | Some("feathers")
              = v["coat"].as_str() {
    Category::Cute
  } else {
    Category::Weird
  }
}

Now that's more like it. But wait, what does it mean? As with other pattern matches, left hand side is a pattern (for instance, Some("albino")) and right hand side is value (for instance, v["mutation"].as_str()). A branch under if will get executed when and only when the LHS pattern shall match the RHS value.

Pattern matching with if let syntax makes us start with the most specific clause and fall through to less specific clauses in an unambiguous order, taking away excessive liberty and thus making the code less error-prone.

Putting it all together


pub fn categorise(
  data: HashMap<String, Vec<Value>>,
) -> HashMap<Category, Vec<String>> {
  let mut retval = HashMap::new();
  for (species, animals) in data {
    for animal in animals {

      if let Some(name) = (animal["name"].as_str()) { /* A */
        retval
          .entry(max(cat_species(species.as_str()),
                     cat_animal(&animal))) /* B */
          .or_insert(Vec::new()) /* C */
          .push(name.to_string())
      }

    }
  }
  retval
}

Now that we have categorisation functions, we can proceed to categorise our data set. If (A) if let match fails (current animal has no name supplied), we'll move to the next iteration. Not all the patterns have to have the catch-all arm.

Otherwise, the variable name will store the current animal's name and we will chain some functions from a handy HashMap API. In (B) we use the Ord instance of Category enum to determine the highest priority category between species-based categorisation and per-animal categorisation with std::cmp::max function.

Then HashMap's entry returns the reference to the value under the category. If there is None, or_insert in (C) inserts an empty vector and returns a reference to it. Finally, we can push the name of the current animal to this vector, and it will appear in our mapping!

We hope that this guide provides a reasonable introduction to pattern matching in Rust. See the full code of the example module on sourcehut.

Let's finish the post with some information about pattern-related features of other popular imperative languages.

Patterns in modern JavaScript


const foldAndDump = (path, xs, ...cutoffs) => {
  // snip
  for (c of cutoffs) {
    //snap
  }
}

An old feature of ECMAScript, the JS standard called "rest parameters" ...cutoffs will match arguments of a function beyond the second into an array called cutoffs.


var rs = [];
for (let [printing, info] of
     Object.entries(allPrintingsJson['data']))
{
    rs.push({ ...info, "_pv_set": printing });
}

When the ellipsis isn't in the argument list, it means that we're dealing with a newer feature called "spread syntax". ...info means "include info object as is". Analogously, spread syntax can spread an enumerable object across arguments of a function call:


const xs = [1,2,3];
console.log(sum(...xs));

Finally, there is unpacking, which is a pretty standard feature by now:


> [a,b] = [1,2]
[1, 2]
> {x,y} = {y: a, x: b}
{ y: 1, x: 2 }
> {k,l} = {y: a, x: b}
{ y: 1, x: 2 }
> [a,b,x,y,k,l]
[1, 2, 2, 1, undefined, undefined]

Packing and unpacking in Python

In modern Python, any iterable is unpackable:


>>> a, *b, c = {'hello': 'world', 4: 2, 'rest': True, False: False}
>>> a, b, c
('hello', [4, 'rest'], False)

* is analogous to JS's ellipsis (...) operator. It can collect some "the rest of the values", but it can also work as a spread for iterables:


>>> print(*[1, 2, 3])
1 2 3

Conversely, in spirit of Python, there's a special case operator called "dictionary unpacking operator". It works very similar to spread operator:


>>> print({'x': True, **{'y': False},** {'x': False, 'z': True}})
{'x': False, 'y': False, 'z': True}

Rightmost spread precedes.

Pack your bags: we're going pattern matching

Every single language that is in active development is looking to adopt more and more features from functional languages, and pattern matching is no difference.

We'll conclude this post with a list of languages that will adopt proper pattern matching, ranked by degree of certainty in adoption.

Pattern matching in C++

Pattern matching as seen in this evolution document is likely to land in C++23
While you wait, there's always a library or two that does a reasonable job mimicking the new standard

Pattern matching in JavaScript

Tied for the first place in "the most likely to adopt proper pattern matching", JavaScript's standard called "ECMAScript", has this proposal backed by Microsoft, Facebook and NPM.
The proposal is thoroughly reviewed and was moved to "stage 1", which puts the theoretical release of this feature in the 2023-2025 range.
You can check our maths by inspecting git logs in completed proposals repository.

Pattern matching in Python

There were different proposals throughout the history of Python, but PEP 634 got implemented
Alpha version of Python with "structural pattern matching" is available since March 1st (thanks to reddit for pointing our attention to it)

The idea of pattern matching is to have a code execution branch based on patterns, instead of conditions. Instead of trying to encode properties of values necessary for a code branch to get executed, programmers who use pattern-matching encode how should values look like for it to happen. Thus, in imperative languages, pattern matching promises more expressive and declarative code compared to predicate statements such as if and case, bar some corner cases.

It might be a subtle difference, but once you get it, you add a very powerful way of expression to your arsenal.

We find that understanding these concepts is akin to the understanding of declarative vs imperative programming paradigms. To those interested in the philosophy of the matter, we suggest finding a cosy evening to curl up with a cup of steaming drink and watch Kevlin Henney's "declarative thinking, declarative practice" talk:

https://www.youtube-nocookie.com/embed/nrVIlhtoE3Y

Kevlin Henney: Declarative Thinking, Declarative Practice. ACCU 2016. Non-tracking YouTube embed.

The best hex editor that you have never heard of

doma.dev — Wed, 10 Mar 2021 12:57:59 +0000

From CSV to XML to JSON, humans sure love their structured data. Computers like it too. If you think about it, X86 assembly is not much more than a structured data format. So is true for ELF, dwarf, protobuf…

PNGs, JPEGs and even MySQL database files are all structured binary formats. They can get corrupted, store hidden data or you might need to simply patch something inside without pulling heavy tools to work with a particular file format.

To explore a binary file at a glance, we suggest using

hexdump -C to get classical side by side view of address space on the left, hex representation of bytes in the middle and best-effort to print binary as ASCII on the right
xxd -b in case you prefer to look at binary representation instead of hexadecimal
od -S4 or strings -n4 to automatically search for 4-byte long ASCII strings or longer

But what about editing?!

Hex editors are less-than-optimal tools

There is no clear “winner” in terms of hex editors. The options are so abundant that most popular hex editors tend to be the worst in terms of stability and being able to handle large files. Besides, hex editors tend to be extremely byte-oriented, so for proper bitwise processing, you have to look elsewhere.

Our choice is wxHexEditor. Here’s an example of an attempt to unscramble some 7-bit ASCII with it:

As you can see it’s easy to lose track of where we were, especially while trying to edit binary quickly with a hex editor.

That’s why in this post we are providing a step-by-step guide on how to use GNU poke to do binary transformations!

GNU poke, step by step

Step 1: WSL2-compatible Ubuntu LTS 20.04 Installation

Sadly, right now it’s impossible to easily install main branch of poke without nix. Gladly, poke authors have recently released v1:

sudo apt install tcl-dev libgc-dev \
 libjson-c-dev libreadline-dev # (1)
wget https://ftp.gnu.org/gnu/poke/poke-1.0.tar.gz
tar xzvf ./poke-1.0.tar.gz
mkdir poke-1.0/build && cd poke-1.0/build
../configure — prefix=”$(pwd)” # (2)
make
make install

(1): These are the libraries that were required for no-GUI installation on a reasonably fresh Ubuntu installation.

(2): It’s very important to always override default prefix with project directory to not mess up your system. We symlink binaries we want to use to ~/.local/bin after installing them in the project directory.

Step 2: Prepare output file

Let’s say you need to work with a file called file.in. If you change something while working in poke, the changes will be written to file.in right away, so you should prepare a sufficiently large file.out:

cat file.in > file.out; cat file.in >> file.out

Step 3: Describe input and output

First check if your file format is already described in standard library, also

known as “pickles”:

$ ls -1 pickles/ | grep pk
argp.pk
bmp.pk
bpf.pk
btf-dump.pk
btf.pk
color.pk
ctf.pk
dwarf-common.pk
dwarf-frame.pk
dwarf-pubnames.pk
dwarf-types.pk
dwarf.pk
elf.pk
id3v1.pk
leb128.pk
mbr.pk
pktest.pk
rgb24.pk
time.pk
ustar.pk

Now, describe the structure of your input and your output (we don’t have a relevant pickle, so we don’t load anything):

type InAtom = struct {
  uint<7> host;
  bit guest;
};
type Input = InAtom[];

type Output = struct {
  byte[] hosts;
  bit[] guests;
};

Step 4: Transform input to output

It is possible to map transformations of data to files straight away, but it’s also possible to do more conventional iterative conversions between Input and Output:

fun solve = (Input xs) Output: {
  var resultHosts = byte[]();
  var resultGuests = bit[]();
  var bitsWrote = 0;
  var bytesWrote = 0;
  for (i in xs) {
    if (bitsWrote % 8 == 0) {
      resultGuests += [(0 as bit), i.guest];
      bitsWrote += 2;
    } else {
      resultGuests += [i.guest];
      bitsWrote += 1;
    }
    resultHosts += [(0 as bit):::i.host ];
    bytesWrote += 1;
  }
  return Output {
    hosts = resultHosts,
    guests = resultGuests
  };
};

Step 5: Write output to a file

Now let’s write something like “main” function, that will have the duty of reading file.in, processing it and producing file.out:

fun writeSolution = (string basename) bit:
{
  var fin = open(basename + ".in", IOS_F_READ | IOS_F_WRITE);
  var fout = open(basename + ".out", IOS_F_READ | IOS_F_WRITE);
  var input = Input @ fin : 0#B;
  var output = solve(input);
  printf("INPUT: %v\nOUTPUT: %v\n", input, output);

  / ***This doesn't work in 1.0 release for some reason:*** /
  /* Output @ fout : 0#B = output; */
  / ********************************************************* /

 **byte[] @ fout : 0#B = output.hosts;  
 bit[] @ fout : (output.hosts'size) = output.guests;**  
  close(fin);
  close(fout);
  return 0;
};

The interesting bit here is mapping of variables onto IO space “fout”. When the mapping instruction is on the left hand side and a value is on the right hand side, it means that poke shall unwrap the contents of the value onto the IO space. In case of files it means that it shall write the contents of the value immediately.

Step 6: Run it!

$ cat file.in > file.out && cat file.in >> file.out && ./poke/poke
     _____
 ---' __\_______
            ______ ) GNU poke 1.0
            __)
           __)
 ---. _______ )

...

For help, type ".help".
Type ".exit" to leave the program.
(poke) .load solve.poke
(poke) writeSolution ("file");
INPUT: [
  InAtom {host=100U,guest=1U},
  InAtom {host=111U,guest=1U},
  InAtom {host=109U,guest=0U},...,
  InAtom {host=95U,guest=0U},
  InAtom {host=95U,guest=1U}]
OUTPUT: Output {
  hosts=[100UB,111UB,...,95UB,95UB],
  guests=[0U,1U,1U,...,1U,0U,1U]}
(uint<1>) 0
(poke) .file file.out
(poke) dump
76543210 0011 2233 4455 6677 8899 aabb ccdd eeff 0123456789ABCDEF
00000000: 646f 6d61 7b77 3472 6d33 3537 5f77 336c doma{w4rm357_w3l
00000010: 636f 6d33 5f5f 5f74 305f 5f5f 7468 3135 com3___t0___th15
00000020: 5f5f 5f62 6c30 677d ef68 e5db 666b 6fbe ___bl0g }.h..fko.
00000030: ee66 d9c7 deda 66be bfbf e860 bfbf bfe9 .f....f....`....
00000040: d163 6bbf bebf .ck...

And you got the secret message!

Now go poke something!

This blog provides sufficient techniques for you to start editing binary data without worrying about your hex editor crashing.

If you want to take a look at a functional programming approach, you’re welcome to read our blog over at doma.dev website, that covers using Erlang as a binary editor to make the same binary transformation.