DEV Community: Andrés Alejos

Livebook: Elixir's Swiss Army Knife

Andrés Alejos — Sun, 24 Mar 2024 21:54:25 +0000

The Problem of Adoption Friction

Recently I gave a talk about my first year programming with Elixir and my experience contributing to Elixir's open-source ecosystem. In it, I made some points about how a big part of encouraging language adoption is to reduce friction.

One common point of friction that many languages suffer from is fractured tooling ecosystems. These are language ecosystems where the language is unopinionated about things such as its formatting, linting, testing, build system, etc. These can be good for advanced users who have the benefit of experience when deciding which tools to choose, giving them more options depending on their needs. This can, however, just as easily leave beginners wary to even start. Many people credit Go with popularizing the practice of providing developers with standardized libraries and tooling for all parts of the development lifecycle. Providing code formatters, package managers, package repositories, LSPs, unit testing libraries, etc. straight from the language's core team can drastically reduce adoption friction by preventing the phenomenon of analysis paralysis, where people can get too overwhelmed by the bevy of tooling options that they never get around to actually writing code.

Elixir excels in this respect, providing tools such as ExDoc (for documentation), ExUnit (for testing), Mix (build tool), Hex (package manager), Mix format (for formatting), and more. When writing code, the only decisions left to make (for the most part) are design and implementation details for your code, rather than needing to concern yourself with these details that can (and have been) standardized.

Understanding Your Audience

In that same talk, I also spoke about the idea of "entry-points" to the language and how targeting them can help reduce adoption friction, especially in a world where most people are first introduced to other programming paradigms such as object-oriented as opposed to functional programming like Elixir. Here are some of the most common points of entry I came up with for Elixir.

Elixir Points of Entry

Understanding the perspective of your potential adopters is crucial in focusing where developer efforts should go, and in general a good first impression can do a lot to convince people to stay.

First impressions can be made in all sorts of ways, from good landing pages, to detailed documentation, to convincing testimonials and showcases, but one common way is the place where many people first interact with the language: its REPL (or interactive session). In Elixir, this is IEx.

Many people prefer, above reading documentation or reading source code, getting their hands dirty by immediately interacting with the language. They prefer the hands-on approach to learning the language, which is a large reason why more and more languages are moving to code playgrounds as their getting started guide. They want to get immediate feedback and iterate on quick ideas to gauge whether the language is worth investing more time into. There is only so much you can do from a REPL, so when they've reached the limits of exploration through that means, they might start working on a small script of sorts, but depending on which entry-point they came from, the next natural step might be quite different.

Wouldn't it be great if there was a place that served as a nexus, where someone coming from the embedded world could be introduced to the language in much the same way as someone coming from the Machine Learning World, while still leaning into those very domains that brought them there?

Enter Livebook, the One-Stop-Shop

Livebook is an interactive Notebook-style Elixir application that lets you write reproducible, shareable, deployable, extensible, interactive, and integrated Elixir workflows (and even has support for Erlang!). With these characteristics in mind, Livebook is fast becoming the premier gateway application to introduce newcomers to all that the language has to offer. Dockyard has even gone as far as writing all of the curriculum for Dockyard Academy as Elixir Livebooks.

For the last couple of months, I've been spending much of my time in Livebook, and have found it to really enhance my workflow when working on my libraries. I prepared much of my ElixirConfUS 2023 talk as a Livebook (also published as a blog post here), as well as wrote the documentation for the EXGBoost Plotting Module as a Livebook, in addition to working on several interactive libraries for Livebook (which I will touch on later).

It's very quick to get started with Livebook as it has standalone Mac and Windows installers that bundle in Erlang/OTP and Elixir, so you can get started without any prerequisites. You also have the options to use Livebook Teams to gain collaboration features or run instances of Livebook in the cloud, so you don't have to worry about managing installs!

Livebook is available in three different runtime options: standalone if you want to use Livebook as its own instance of Elixir, attached if you want Livebook to connect to an existing instance of Elixir, or embedded if you want to install Livebook on an embedded device (see the Nerves Project).

Livebook can be viewed as a next-generation, "supercharged IEx," and is still getting better. Let's go into some more detail regarding some of the earlier points I mentioned to show the power and versatility of Livebook.

If you want to see some examples of Livebook in action, you can check out some apps I have deployed to HuggingFace Spaces here.

Reproducible

Livebooks fundamentally consists of a series of code cells (or other cells such as markdown) which are executed in the order they appear (top-down). Working within the immutable nature of Elixir, each cell receives the output state of the previous cell as its input state, and passes its output state to the next cell, meaning that whenever a cell is run, all previous cells will run first since it depends on their collective state. This ensures reproducibility of workflows. Livebook also has the concept of "Sections" and "Forks", where you can Fork from a section which means that the initial state of the current section relies on the final state of the forked section. This means that Livebook cell execution is represented as a tree, which enables very flexible workflows.

This is fundamentally different from other notebook execution environments such as Jupyter Notebooks, which does not enforce that cells must be executed in order and allows mutable changes to cells such that executing the same cell multiple times can have compounding effects on the state of the notebook.

// Detect dark theme var iframe = document.getElementById('tweet-1759281738662944954-231'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1759281738662944954&theme=dark" }

Shareable

One of my favorite decisions that the Livebook team was making a Livebook's .livemd filetype perfectly valid Markdown, such that you can share Livebooks anywhere that Markdown can be shared. I've written several blog posts entirely within Livebook then simply copied the .livemd Markdown into a blog post as Markdown (see here for an example).

This is in stark contrast to the format used by Jupyter Notebooks, which uses a custom JSON schema to represent its notebooks, so if you want to share the notebook you either have to export to another format (such as HTML, or PDF) or rely on the hosting service to provide rendering (e.g. GitHub supports rendering .ipynb files).

Meta information is stored as Markdown comments, so the Livebook application itself can render additional details. When exporting a Livebook to Markdown, you also have the option of whether to include the output of the cells, which again, is also just valid Markdown!

Lastly, Livebook offers a "Run in Livebook" badge generator that you can place in the exported Markdown, and the badge will allow a one-click import of the Livebook. You can import Livebooks this way, or many other ways such as from local or cloud storage, from a URL, or from source.

Extensible

Out of the box, Livebook comes outfitted with several great integrations such as Smart Cells for Slack, Databases (PostgreSQL / MySQL Microsoft SQL Server, and more), Plotting via VegaLite, and more. You can check out the full list of built-in integrations here.

What's great about these integrations is that most of them are built on top of the Kino library, and are open-source! Anyone has the ability to contribute their own extensions using Kino and its various abstractions, such as Kino.JS, Kino.JS.Live, and Kino.SmartCell. These abstractions are built on top of well-established Elixir concepts such as GenServers, meaning that you will either already be familiar with the abstraction, or by learning the abstraction you will gain useful experience for use in other Elixir projects. Writing custom Kinos is a great way to dip your toes into writing Client/Server behaviours in Elixir!

I won't go into too much detail about the behaviours defined in Kino in this article, but if you're interested in my preliminary thoughts you can refer to this post where I spoke about it a bit:

// Detect dark theme var iframe = document.getElementById('tweet-1759334543268430179-126'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1759334543268430179&theme=dark" }

One example of this extensibility comes from Thomas Millar, who talks about how he wanted the ability to have Livebook hot-reloading where as soon as he changed a file that the Livebook depends on it re-executes the cell, so he went ahead and built it and shared it with the community!

// Detect dark theme var iframe = document.getElementById('tweet-1762210503810507140-889'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1762210503810507140&theme=dark" }

I've also built out a few Kinos for different use cases that I came across and figured I would share.

Here's a simple Kino I built to embed YouTube videos using the YouTube iframe API:

acalejos / kino_youtube

A simple Kino that wraps the YouTube Embedded iFrame API to render a YouTube player in a Livebook.

KinoYoutube

This Kino consists of only one function, KinoYouTube.new/2.

Refer to the YouTube documentation for a list of accepted parameters

Installation

The package can be installed by adding kino_youtube to your list of dependencies in mix.exs:

def deps do
  [
    {:kino_youtube, "~> 0.1"}
  ]
end

Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/kino_youtube.

View on GitHub

Kino has a set of built-in components such as Kino.Control.Button, Kino.Control.Form, and even Kino.Input.Audio. I was wanting to use the audio input to emit event live during a recording for a project I was working on, but found that the Kino.Input.Audio component only allowed listening to the upload event for the audio recording as a whole audio file, so you couldn't receive the event with the audio data until the audio finished recording. I decided to write my own Kino for live audio, and published it along with an app showcasing it.

acalejos / kino_live_audio

A Kino designed to record a raw audio stream (no client-side encoding) and emit events.

KinoLiveAudio

Installation

The package can be installed by adding kino_live_audio to your list of dependencies in mix.exs:

def deps do
  [
    {:kino_live_audio, "~> 0.1.0"}
  ]
end

View on GitHub

I'm working on a 3-part series on custom Kinos in the near future. Subscribe to not miss out!

Interactive

As I showed above, you can add interactive experiences to Livebook using the Kino library. These interactions can be powered by the three major behaviours provided by the Kino library: Kino.JS, Kino.JS.Live, and Kino.SmartCell.

Here's the example I mentioned showcasing the kino_live_audio library I wrote and how to use it to make a live Voice Activity Detection app in Livebook!

// Detect dark theme var iframe = document.getElementById('tweet-1761520038165520617-10'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1761520038165520617&theme=dark" }

There are built-in interactive Kinos such as the kino_bumblebee library, which provides a GUI to the Elixir Bumblebee library.

💡

Want to know more about Bumblebee and other libraries mentioned throughout this article? Then you should read my Elixir ML Explained article.

There's also support for the VegaLite JavaScript library (via Elixir bindings) to build interactive plots and graphics. You can read more about the Vega-Lite capabilities within Livebook here and you can see some plotting with Vega in action in my post about Plotting in XGBoost, which itself is just a Livebook with Vega plots of decision trees from XGBoost. You can see the .livemd file here.

Integrated

As I mentioned before, Livebook has three runtime options, including an option to attach an instance of Livebook to a running Elixir node. This options provides a ton of flexibility and opportunity to not only use Livebook as the first step in prototyping, but to use it seamlessly during development and even production.

For example, you could use Livebook to create dashboard of live-running production systems or otherwise introspect into the state of the system.

Here's an example of Thomas Millar using Livebook to integrate with the Wallaby library to ease the process of creating automated browser test code.

// Detect dark theme var iframe = document.getElementById('tweet-1757128491034624249-343'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1757128491034624249&theme=dark" }

Here's an example of using my Merquery SmartCell to automatically create Postman-like cells to test a Phoenix application's routes with the ability to pre-populate parameter information as well.

// Detect dark theme var iframe = document.getElementById('tweet-1756551191079592155-535'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1756551191079592155&theme=dark" }

Since Livebook is either attached to a running Elixir node, or is a standalone node itself, you can interact with it like you would with any other node on an cluster in Elixir!

Deployable

Livebook isn't even necessarily just for prototyping either! Livebook apps can be deployed either locally or into the cloud (first-class support for HuggingFace Spaces). There are many options when deploying Livebook apps, including support for multi-session apps (apps where each new connection spawns a new instance of the app) and password-protected apps. You can choose whether or not to expose the underlying source code to the users, meaning that you can hide proprietary implementation details if you want, or use let others learn from your implementation.

You also have the option to only include rich output (e.g. the interactive cells I spoke about earlier) in the deployed app, meaning you can create very rich and complex apps and control what gets shown in the deployed app.

Here are some examples of deployed Livebook applications, and you can see examples of all I just discussed.

https://hugobarauna-livebook.hf.space/apps

https://acalejos-livebook-apps.hf.space

Bringing It All Together

Let's look back at the common Elixir entrypoints I mentioned at the top of this article:

I just showed how Livebook has a ton to offer regardless of which of these points of entry you're coming from. You can get exposed to backend development since you can write any Elixir code. Kinos are generally implemented as GenServers, which is a core pattern to become familiar with when writing backend Elixir code. You can also use Livebook as one node in a cluster and practice distributed Elixir.

Livebook itself is a Pheonix LiveView app, so if nothing else, you can refer to the source code for good Phoenix code practices or write Kinos for practice with frontend (albeit, not necessarily with Phoenix).

Livebook is great for machine learning due to the ability to perform exploratory data analysis with libraries such as Explorer / Kino Explorer, create new workflows like I showed before with Serving Spam Detection with XGBoost and Elixir, work with SmartCells (like kino_bumblebee) to learn and generate code snippets, and much more!

Lastly, Livebook even works great for folks working in embedded projects since it supports an embedded runtime. In fact, the Nerves Project supports a firmware image that comes pre-loaded with Livebook.

Conclusion

I hope I convinced you to not only give Livebook a try, but that it should have a permanent spot in your developer toolkit. Whenever I suggest how to get started with Elixir at this point I always point to Livebook. It has so much to offer and can seamlessly grow with you as a developer. When you're just learning the language it can simply serve as a learning tool, but it has the ability to scale with you and eventually you can use it for production use-cases.

Hacking Phoenix LiveUpload

Andrés Alejos — Mon, 15 Jan 2024 23:44:20 +0000

The Phoenix Framework is a simply spectacular Server-Side-Rendered (SSR) web-development framework written in Elixir. Phoenix supports reactive functionality through its web-socket based library called LiveView, which makes writing dynamic SSR web-apps a breeze. One of the built-in features of LiveView is its support of Uploads out of the box. Some of the features, as described in the documentation, are:

Accept specification - Define accepted file types, max number of entries, max file size, etc. When the client selects file(s), the file metadata is automatically validated against the specification. See Phoenix.LiveView.allow_upload/3.
Reactive entries - Uploads are populated in an @uploads assign in the socket. Entries automatically respond to progress, errors, cancellation, etc.
Drag and drop - Use the phx-drop-target attribute to enable. See Phoenix.Component.live_file_input/1.

LiveView even includes an out-of-the-box solution for rendering previews of your uploads using the live_img_preview/1 component, but notably, these previews only work for image uploads by default since it simply adds the file data as a Blob to an img tag.

In this article, I will walk you through how the Phoenix Live Image Preview works, and show you how to customize it so that you can add a custom rendering solution to render previews for other types of files – in particular I will be showing how to render a PDF preview using Mozilla's pdf.js library.

The code for this project can be found here at my GitHub. Feel free to give me a follow on there to see what else I'm up to.

How Do Phoenix Uploads Work?

The Phoenix documentation is pretty thorough on describing how uploads are handled, but I will give a brief synopsis here for convenience.

All uploads held in a session are stored in a reserved assigns variable @uploads . If you try to name or update the uploads assigns manually you will notice that you get an error since Phoenix reserves that particular key in the assigns. You enable uploads for a particular page using Phoenix.LiveView.allow_upload/3, which accepts a specification that tells it information such as the allowed file type, max file entries, max file size, and more. The first parameter to allow_upload is a key that will specify a specific upload within the uploads assign, meaning you can have multiple upload "sessions" at once.

One of the keys you may optionally specify is the :writer key, which points to a module implementing the UploadWriter behaviour which specifies what to do with the file chunks from the uploaded files before consuming the file uploads. If you do not specify this option, then the default UploadTmpFileWriter will be used.

You then use the Phoenix.Component.live_file_input/1 component to specify the file input. As files are uploaded using the live file input component, data about the

The writer is the mediator between when the file is uploaded into the live_file_input and when it is subsequently consumed using Phoenix.LiveView.consume_uploaded_entries/3. The implementation of the writer dictates how you consume the uploads according to the meta/1 implementation in the writer. For example, here is an example of how the default writer works along with the default example of how files are consumed:

defmodule Phoenix.LiveView.UploadTmpFileWriter do
  @moduledoc false

  @behaviour Phoenix.LiveView.UploadWriter

  # Other impls ommitted for brevity

  @impl true
  def meta(state) do
    %{path: state.path}
  end
end

defmodule Test.DemoLive do
  def handle_event("save", _params, socket) do
    uploaded_files =
      consume_uploaded_entries(socket, :avatar, fn %{path: path}, _entry ->
        dest = Path.join("priv/static/uploads", Path.basename(path))
        File.cp!(path, dest)
        {:ok, Routes.static_path(socket, "/uploads/#{Path.basename(dest)}")}
      end)
    {:noreply, update(socket, :uploaded_files, &(&1 ++ uploaded_files))}
  end
end

Notice that the last parameter of consume_uploaded_entry specifies a function whose inputs are the outputs of the writer's meta implementation, so if you find yourself implementing your own writer, ensure that you keep this in mind. In our case, we are not going to change the writer since the modifications we are making mostly reside in the JavaScript side.

Adding Custom Preview Rendering

Now I will discuss the necessary steps to add custom preview rendering to Phoenix Live Uploads. This will assume you have set up your Phoenix project with the appropriate routing and database connection, so we will not be discussing those portions. Refer to the getting started guide if needed.

The way-too-simplified explanation for what I'm going to show is how to synchronize the PDF rendering and subsequent conversion to a Blob which then is stored as an HTML attribute in its corresponding element. A unique entry reference ID is used to identify the correct element and Blob between the server and client, but in order to generate the ID and not disrupt other upload types we must use the existing LiveUploader instance that LiveView uses by using the UploadEntry class.

To adapt this method to other file extensions, just change your rendering method and how you match on file extension.

The files in the repository that you should be concerned with are:

You will have to make 1 modification to a file that is not included in the GitHub repository. After you run mix deps.get, go to the file located at deps/phoenix_live_view/priv/static/phoenix_live_view.esm.js, and at the bottom you will see where Phoenix Live View declares its JavaScript exports. By default it only exports the LiveSocket class, but you will need to add UploadEntry as an export. This will be used within app.js.

//deps/phoenix_live_view/priv/static/phoenix_live_view.esm.js
// Bottom of the file

// Before modifications
export { 
  LiveSocket
};

// After modifications
export { 
  LiveSocket, 
  UploadEntry 
};

Modifying `index.ex`

This is where all of the server-side code for the uploads reside. Let's start by enabling uploads. We need to give a key that we will use to store the uploads we want to associate with this input batch. We will use the key :demo here:

defmodule UploadPdfPreviewWeb.DemoLive.Index do
  use UploadPdfPreviewWeb, :live_view

  @impl true
  def mount(_session, _params, socket) do
    socket =
      socket
      |> assign(:uploaded_files, [])
      |> allow_upload(
        :demo,
        accept: ~w(.pdf .jpg .jpeg .png .tif .tiff),
        max_entries: 5
      )

    {:ok, socket}
  end
  ...
end

So in this example we are accepting files with extensions .pdf, .jpg, .jpeg, .png, .tif, and .tiff. We are limiting an upload batch to at most five entries. We also add another assign separate from the @uploads assign where we will store uploaded files once they are consumed.

Most of the modification we will make are in the client-side hooks that are invoked from the live_file_input and live_img_preview components. Since the hooks are hard-coded in the components, I just looked at what the components are doing under the hood and copied the default with the only changes being the hooks. For the live_file_input, we now add a custom hook called LiveFileUpload, and for live_img_preview we conditionally include a hook LivePdfPreview only for PDF files, and otherwise we attach the default hook.

So now the custom live_file_upload looks like:

input
  id={@uploads.demo.ref}
  type="file"
  name={@uploads.demo.name}
  accept={@uploads.demo.accept}
  data-phx-hook="LiveFileUpload"
  data-phx-update="ignore"
  data-phx-upload-ref={@uploads.demo.ref}
  data-phx-active-refs={join_refs(for(entry <- @uploads.demo.entries, do: entry.ref))}
  data-phx-done-refs={
    join_refs(for(entry <- @uploads.demo.entries, entry.done?, do: entry.ref))
  }
  data-phx-preflighted-refs={
    join_refs(for(entry <- @uploads.demo.entries, entry.preflighted?, do: entry.ref))
  }
  data-phx-auto-upload={@uploads.demo.auto_upload?}
  multiple={@uploads.demo.max_entries > 1}
  class="sr-only"
/>

And the custom live_img_preview looks like:

<img
    id={"phx-preview-#{entry.ref}"}
    data-phx-upload-ref={entry.upload_ref}
    data-phx-entry-ref={entry.ref}
    data-phx-hook={
      if entry.client_type == "application/pdf",
        do: "LivePdfPreview",
        else: "Phoenix.LiveImgPreview"
    }
    data-phx-update="ignore"
/>

We can optionally choose to validate the entries beyond the provided validation:

  def handle_event("validate_upload", _params, socket) do
    num_remaining_uploads =
      length(socket.assigns.uploaded_files) - socket.assigns.uploads.demo.max_entries

    valid =
      Enum.uniq_by(socket.assigns.uploads.demo.entries, & &1.client_name)
      |> Enum.take(num_remaining_uploads)

    socket =
      Enum.reduce(socket.assigns.uploads.demo.entries, socket, fn entry, socket ->
        if entry in valid do
          socket
        else
          socket
          |> cancel_upload(:demo, entry.ref)
          |> put_flash(
            :error,
            "Uploaded files should be unique and cannot exceed #{socket.assigns.uploads.demo.max_entries} total files."
          )
        end
      end)

    {:noreply, socket}
  end

Here, we filter out any duplicate files and we check to see how many uploaded files have been done in previous batches (which is stored in the uploaded_files assigns) and take as many files as we have remaining. Then we cancel any of the files which didn't pass our validation.

Lastly, we add a handler for the file submissions, potentially customizing the behavior depending on file type:

 def handle_event("submit_upload", _params, socket) do
  uploaded_files =
    consume_uploaded_entries(socket, :demo, fn %{path: _path}, entry ->
      case entry.client_type do
        "application/pdf" ->
          # Handle PDFs
          IO.puts("PDF")

        _ ->
          # Handle images
          IO.puts("Image")
      end
    end)

  socket =
    socket
    |> update(:uploaded_files, &(&1 ++ uploaded_files))

  {:noreply, socket}
end

Modifying `app.js`

Next we can modify the client-side hooks, which consists of implementing the two hooks we defined in our server code. The LiveFileUpload code essentially commandeers the stock implementation of the hook from Phoenix Live View (which you can view here). It relies on the UploadEntry class we exported earlier, which communicates with the LiveUploader singleton class that stores data regarding the current uploads. The custom implementation essentially reimplements the stock version, adding in a check and new behavior for PDF file types. Let's walk through this code:

`LiveFileUpload` Hook

Hooks.LiveFileUpload = {
  activeRefs() {
    return this.el.getAttribute("data-phx-active-refs");
  },

  preflightedRefs() {
    return this.el.getAttribute("data-phx-preflighted-refs");
  },

  mounted() {
    this.preflightedWas = this.preflightedRefs();
    let pdfjsLib = window["pdfjs-dist/build/pdf"];
    // Ensure pdfjsLib is available globally
    if (typeof pdfjsLib === "undefined") {
      console.error("pdf.js is not loaded");
      return;
    }
    // Use the global `pdfjsLib` to access PDFJS functionalities
    pdfjsLib.GlobalWorkerOptions.workerSrc =
      "https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js";
    this.el.addEventListener("input", (event) => {
      const files = event.target.files;
      for (const file of files) {
        if (file.type === "application/pdf") {
          const fileReader = new FileReader();
          fileReader.onload = (e) => {
            const typedarray = new Uint8Array(e.target.result);
            // Load the PDF file
            pdfjsLib.getDocument(typedarray).promise.then((pdf) => {
              // Assuming you want to preview the first page of each PDF
              pdf.getPage(1).then((page) => {
                const scale = 1.5;
                const viewport = page.getViewport({ scale: scale });
                const canvas = document.createElement("canvas");
                const context = canvas.getContext("2d");
                canvas.height = viewport.height;
                canvas.width = viewport.width;

                // Render PDF page into canvas context
                const renderContext = {
                  canvasContext: context,
                  viewport: viewport,
                };
                page.render(renderContext).promise.then(() => {
                  // Convert canvas to image and set as source for the element
                  const imgSrc = canvas.toDataURL("image/png");
                  let upload_entry = new UploadEntry(
                    this.el,
                    file,
                    this.__view
                  );
                  if (
                    (imgEl = document.getElementById(
                      `phx-preview-${upload_entry.ref}`
                    ))
                  ) {
                    imgEl.setAttribute("src", imgSrc);
                  } else {
                    this.el.setAttribute(
                      `pdf-preview-${upload_entry.ref}`,
                      imgSrc
                    );
                  }
                });
              });
            });
          };
          fileReader.readAsArrayBuffer(file);
        }
      }
    });
  },
  updated() {
    let newPreflights = this.preflightedRefs();
    if (this.preflightedWas !== newPreflights) {
      this.preflightedWas = newPreflights;
      if (newPreflights === "") {
        this.__view.cancelSubmit(this.el.form);
      }
    }

    if (this.activeRefs() === "") {
      this.el.value = null;
    }
    this.el.dispatchEvent(new CustomEvent("phx:live-file:updated"));
  },
};

On mount, we load PDF.js from the CloudFlare CDN (you should specify the most updated version) and we add a listener that listens on the input event for the input tag. This gets triggered whenever files are uploaded using the input, and stores all files associated with the upload batch (rather than one at a time).

So we iterate through the target (uploaded) files and we only care about PDF files, all other files will just continue. For PDF files, we use PDF.js to get the first page (you can extend this to all if you would like) for the sake of the preview. We render the PDF into a canvas which we can then convert to a PNG, encoded as a Blob object through the canvas.toDataURL function. We then create a new UploadEntry using this Blob, which registers the data to the UploadWriter, doing various things such as assigning a unique ID (ref) to the entry. Since the upload now has a unique ID, we can attach the unique ID to the DOM element which can be used later on during the LivePdfPreview hook.

`LivePdfPreview` Hook

We now define the LivePdfPreview hook which will only be used to preview PDF files. The code here is derivative of the default Phoenix.LiveImgPreview handler, and only works given the previous hook we discussed.

Here, we check to see if the PDF preview has already been rendered. If not, we put a placeholder image as the source, knowing that the first hook will replace the placeholder when it's done rendering. Otherwise we set the element's src to be the Blob from the previous hook.

Hooks.LivePdfPreview = {
  mounted() {
    this.ref = this.el.getAttribute("data-phx-entry-ref");
    this.inputEl = document.getElementById(
      this.el.getAttribute("data-phx-upload-ref")
    );
    let src = this.inputEl.getAttribute(`pdf-preview-${this.ref}`);
    if (!src) {
      src = "https://poainc.org/wp-content/uploads/2018/06/pdf-placeholder.png";
    } else {
      this.inputEl.removeAttribute(`pdf-preview-${this.ref}`);
    }
    this.el.src = src;
    this.url = src;
  },
  destroyed() {
    if (this.url) {
      URL.revokeObjectURL(this.url);
    }
  },
};

We have to add this synchronization since there is no guarantee that the PDF is rendered by the time the preview appears in the DOM. The preview is conditionally rendered in the LiveView when the entries are populated into the assigns.

<li :for={entry <- @uploads.demo.entries} class="relative">
  <div class="group aspect-h-7 aspect-w-10 block w-full overflow-hidden rounded-lg bg-gray-100 focus-within:ring-2 focus-within:ring-indigo-500 focus-within:ring-offset-2 focus-within:ring-offset-gray-100">
    <img
      id={"phx-preview-#{entry.ref}"}
      data-phx-upload-ref={entry.upload_ref}
      data-phx-entry-ref={entry.ref}
      data-phx-hook={
        if entry.client_type == "application/pdf",
          do: "LivePdfPreview",
          else: "Phoenix.LiveImgPreview"
      }
      data-phx-update="ignore"
    />
...
</li>

Final Thoughts

It is worth noting that in the provided GitHub repo, there is some additional functionality to store the previews as a base64-encoded blob to be used later on. I didn't go over it specifically during this article since it is not core to the task at hand, but I needed it for my use case so I went ahead and included it in the code. The relevant code is the GatherPreviews hook in app.js and the handle_event("update_preview_srcs") handler in index.ex.

I want to emphasize that this very well might not be the best way to achieve this goal, but after reading the docs and source code of Phoenix Live View for quite some time it didn't seem that there was a clean API or support for customizing this behavior. I'd love to hear about any other methods in the comments below!

Elevate Your Elixir With Sigils

Andrés Alejos — Sat, 06 Jan 2024 21:25:52 +0000

Motivation

In a previous article of mine, I wrote about using NimbleOptions to add extremely powerful option handling to your Elixir applications. One of the custom validations I was using was a function in_range that would check if an option fell within a real-valued interval. This differs from Elixir's built-in Range in that it needed to be real-valued (rather than discrete integer steps). Additionally, mostly due to aesthetic and personal opinion, I wanted to be able to express the intervals using mathematical notation such as (0,1] to mean "allow any value greater than 0 and less than or equal to 1". I find Elixir to be such a beautiful language with a unique capacity for extensions that it felt wrong to use a function such as in_range or in_interval. Additionally, some implementations I've come across have somewhat unintuitive APIs, such as the following spec:

@spec in_range(float, float, float, bool, bool) :: bool
@doc """
  * `:value` - Value to test for inclusion
  * `:min` - Minimum value in range
  * `:max` - Maximum value in range
  * `:left` - Whether the left boundary is inclusive (true) or exclusive (false)
  * `:right` - Whether the right boundary is inclusive (true) or exclusive (false)
"""
def in_range(value, min \\ 0, max \\ 1, left \\ true, right \\ true)

There's nothing expressly wrong with this implementation, but with my use cases and as Elixir is being used more in the domain of Machine Learning which deals with these intervals quite often, I wanted a solution that felt a bit more integrated.

Solution

This led me to create a small 1-file library called Exterval which is available on Hex and can be installed with:

def deps do
[
  {:exterval, "~> 0.1.0"}
]
end

To make the interval feel more native to Elixir, I implemented it as a sigil that implements theEnumerable Protocol, which gives you several nice benefits:

Takes advantage of the member?/2 function which means we can use the in keyword to check for membership
Allows for checking of sub-interval membership (with some caveats)
Implements an optional step parameter that allows you to iterate/reduce over the interval
Implements a size function (remember, size refers to the ability to count the number of members without reducing over the whole structure, whereas lengths implies a need to reduce).
Allows for :infinity and :neg_infinity to be specified in the interval

This lets us write more succinct checks like:

iex> import Exterval
iex> ~i<[1, 10)//2>
[1, 10)//2
iex> ~i<[1, 10)//2> |> Enum.to_list()
[1.0, 3.0, 5.0, 7.0, 9.0]
iex> ~i<[1, 10)//2> |> Enum.sum()
25.0
iex> ~i<[-1, 3)//-0.5> |> Enum.to_list()
[2.5, 2.0, 1.5, 1.0, 0.5, 0.0, -0.5, -1.0]
iex> ~i<[1, 10]> |> Enum.count()
:infinity
iex> ~i<[1, 10)//2> |> Enum.count()
4
iex> ~i<[-2,-2]//1.0> |> Enum.count()
1
iex> ~i<[1,2]//0.5> |> Enum.count()
3
iex> ~i<[-2,-1]//0.75> |> Enum.count()
2
iex> 1 in ~i<[1, 10]>
true
iex> 1 in ~i<[1, 10)//2>
true
iex> 3 in ~i<(1, 10)//2>
true
# You can even do variable substitution using string interpolation syntax, since the sigil parameter is just a string
iex> min = 2
iex> 3 in ~i<(#{min + 1}, 10)//2>
false

Design Details

The decision to implement the interval as a sigil was not as straightforward as it might seem. As I mentioned before, Elixir is an extremely extensible language with superior support for meta-programming, so implementing this as a macro was my first instinct. I considered commandeering the opening brackets ( and [ to trigger the macro, or something similar with the comma , , but fortunately, I hit a brick wall with that effort. I say fortunately not only because it would have been a bad idea from a design perspective, but it certainly would have been a messier implementation and would have overly complicated it in addition to actually making the code less clear. I appreciate the usage of the sigil ~I because it makes it clear that the range that follows is not to be confused with the built-in Range.

💡

You can read more about Elixir sigils here and see their syntax reference here. Of note, you can use any of the allowed delimiter pairs that it lists to capture your sigil. I chose [and] so as to not conflict with the brackets used in the interval. You could also use something like ~i|[0,1)| if you prefer.

Once I decided on the usage of the Enumerable protocol, I knew I wanted to allow some way for an optional step size to be specified so that reduce could be used on the structure. Elixir sigils allow for parameters to be passed after the closing sigil, so initially, I considered passing in the step size as a parameter since zero or more ASCII letters and digits can be given as a modifier to the sigil, but this would prohibit having floats as step sizes. Another constraint to consider when using sigils is that string interpolation is only allowed within the sigil when using a lowercase sigil. Sigils start with ~ and are followed by one lowercase letter or by one or more uppercase letters, immediately followed by one of the allowed delimiter pairs. Within the context of our use case, we happen to be able to get by without having string interpolation since we use it within pre-defined parameters that are hardcoded, but the library becomes much more useful if we can have dynamically defined intervals, so this limits how the sigil is named.

Another major design decision was how to actually parse the sigil. I ultimately landed on the straightforward answer of just using a regex, but I had a decent back-and-forth with my friend Paulo from the Elixir-Nx core team regarding other options. He provided some nice proofs of concept using binary pattern matching as well as NimbleParsec, but I decided on a regex due to my familiarity, its ability to reduce the amount of code, and because I was not too concerned with performance concerns with what will typically be short patterns.

One of the last design details finalized was how to treat the step size and its effect on item membership. Paulo and I discussed whether it should support ranges where the min and max values did not necessarily have to be in the correct order (e.g. ~i<1,-1//0.5>) which would essentially imply that any iteration would start at 1 in this instance and would work towards -1 in steps of 0.5. This was discussed since it can be seen in some other implementations throughout other ecosystems. We decided that the most clear solution, as well as the solution that fit best within the spirit of the library, was to enforce that the first value specified be less than or equal to the second value, and any desire to iterate starting with the max value could be specified using a negative step size.

Implementation Details

Creation

An interval is stored as a struct with the following fields:

left - the left bracket, either [ or (.
right - the right bracket, either ] or ).
min - the lower bound of the interval. Can be :neg_infinity or any number.
max - the upper bound of the interval. Can be :infinity or any number.
step - the step size of the interval. If nil, the interval is continuous.

To define a sigil, you create a function with the name of the sigil prefixed by sigil_, so since I wish to use this sigil using ~i I define it as

def sigil_i(pattern, []) do
end

The second parameter are the options to the sigil I mentioned earlier. For now these are unused.

I parse the input to the sigil using the following regex:

  ^(?P<left>\[|\()\s*(?P<min>[-+]?(?:\d+|\d+\.\d+)(?:[eE][-+]?\d+)?|:neg_infinity)\s*,\s*(?P<max>[-+]?(?:\d+|\d+\.\d+)(?:[eE][-+]?\d+)?|:infinity)\s*(?P<right>]|\))(?:\/\/(?P<step>[-+]?(?:[1-9]+|\d+\.\d+)(?:[eE][-+]?\d+)?))?$

Using the named capture groups I perform some additional validation such as ensuring that the interval goes from the minimum value to the maximum.

Enumerable – Size / Count

The first function I need to implement for the protocol is the Enumerable.count/1 function. Logically, there are three conditions to account for. First are the instances where the size is either zero or infinity. Since Enumerable.count/1 must return a number on success, I choose to return {:error, Infinity} from Enumerable.count/1 when I wish to return :infinity. This would normally be used to return a module which can perform a reduction to compute the count, but if we just make a simple helper module

defmodule Infinity do
  @moduledoc false
  def reduce(%Exterval{}, _, _), do: {:halt, :infinity}
end

Now I can get my desired behavior. I implement these cases with the following:

def size(interval)
def size(% __MODULE__ {step: nil}), do: {:error, Infinity}
def size(% __MODULE__ {max: :neg_infinity}), do: 0
def size(% __MODULE__ {min: :infinity}), do: 0

def size(% __MODULE__ {min: min, max: max})
    when min in [:infinity, :neg_infinity] or max in [:infinity, :neg_infinity],
    do: {:error, Infinity}

Lastly I separate cases where the step size is negative and where its positive since the logic is different.

def size(% __MODULE__ {left: left, right: right, min: min, max: max, step: step}) when step < 0 do
  case {left, right} do
    {"[", "]"} ->
      abs(trunc((max - min) / step)) + 1

    {"(", "]"} ->
      abs(trunc((max - (min - step)) / step)) + 1

    {"[", ")"} ->
      abs(trunc((max + step - min) / step)) + 1

    {"(", ")"} ->
      abs(trunc((max + step - (min - step)) / step)) + 1
  end
end

def size(% __MODULE__ {left: left, right: right, min: min, max: max, step: step}) when step > 0 do
  case {left, right} do
    {"[", "]"} ->
      abs(trunc((max - min) / step)) + 1

    {"(", "]"} ->
      abs(trunc((max - (min + step)) / step)) + 1

    {"[", ")"} ->
      abs(trunc((max - step - min) / step)) + 1

    {"(", ")"} ->
      abs(trunc((max - step - (min + step)) / step)) + 1
  end
end

Enumerable – Reduce

The implementation for reduce is a great example of how Elixir's pattern matching in function headers can reduce visual complexity and even the implementation itself. First, we return :infinity if step is nil.

def reduce(%Exterval{step: nil}, acc, _fun) do
  {:done, acc}
end

Next, we again have different clauses depending on if the step is positive or negative, since that dictates which direction with respect to the interval the reduction occur.

def reduce(%Exterval{left: left, right: right, min: min, max: max, step: step}, acc, fun)
    when step > 0 do
  case left do
    "[" ->
      reduce(min, max, right, acc, fun, step)

    "(" ->
      reduce(min + step, max, right, acc, fun, step)
  end
end

def reduce(%Exterval{left: left, right: right, min: min, max: max, step: step}, acc, fun)
    when step < 0 do
  case right do
    "]" ->
      reduce(min, max, left, acc, fun, step)

    ")" ->
      reduce(min, max + step, left, acc, fun, step)
  end
end

Notice that these clauses to the reduce/3 implementation return a different reduce/6 function which is specific to our module.

Next we handle conditions where the reduction is halted or suspended:

efp reduce(_min, _max, _closing, {:halt, acc}, _fun, _step) do
  {:halted, acc}
end

defp reduce(min, max, closing, {:suspend, acc}, fun, step) do
  {:suspended, acc, &reduce(min, max, closing, &1, fun, step)}
end

Next we handle edge cases involving :infinity and :neg_infinity where we have no way to begin the reduction since we cannot move step increments away from either of these when they are our starting point:

defp reduce(:neg_infinity, _max, _closing, {:cont, acc}, _fun, step) when step > 0 do
  {:done, acc}
end

defp reduce(_min, :infinity, _closing, {:cont, acc}, _fun, step) when step < 0 do
  {:done, acc}
end

Interestingly, these are cases where the size of the intervals would be :infinity but we cannot reduce over them at all, as opposed to other infinitely sized intervals where we can begin iteration which will never end, such as ~i<[0,:infinity]//1> which would effectively be an infinite stream starting at 0 and incrementing by 1.

Next we add all of the main logic for the "typical" cases:

defp reduce(min, max, "]" = closing, {:cont, acc}, fun, step)
     when min <= max do
  reduce(min + step, max, closing, fun.(min, acc), fun, step)
end

defp reduce(min, max, ")" = closing, {:cont, acc}, fun, step)
     when min < max do
  reduce(min + step, max, closing, fun.(min, acc), fun, step)
end

defp reduce(min, max, "[" = closing, {:cont, acc}, fun, step)
     when min <= max do
  reduce(min, max + step, closing, fun.(max, acc), fun, step)
end

defp reduce(min, max, "(" = closing, {:cont, acc}, fun, step)
     when min < max do
  reduce(min, max + step, closing, fun.(max, acc), fun, step)
end

And lastly we add the final case where the condition that min < max (or min <= max depending on the brackets) is no longer met, which means the reduction is complete:

defp reduce(_, _, _, {:cont, acc}, _fun, _up) do
  {:done, acc}
end

Just like that the reduce/3 implementation is complete! As I mentioned before and notes in more detail, there are some opinions inherit to this implementation having to do with :infinity and :neg_infinity bounds as well as empty intervals, but I tried to keep the behavior consistent throughout.

Enumerable – Membership

Now on to the part that I was most interested in, which is interval membership. First, let's add support for checking membership between two intervals, which is essentially a check for one interval being a sub-interval of another.

Sub-interval must satisfy the following to be a subset:

The minimum value of the subset must belong to the superset.
The maximum value of the subset must belong to the superset.
The step size of the subset must be a multiple of the step size of the superset.

If the superset has no step size, then only the first two conditions must be satisfied.

if the superset has a step size, and the subset doesn't then membership is false.

def member?(%Exterval{step: nil} = outer, %Exterval{} = inner) do
  res = inner.max in outer && inner.min in outer
  {:ok, res}
end

def member?(%Exterval{}, %Exterval{step: nil}) do
  {:ok, false}
end

def member?(%Exterval{} = outer, %Exterval{} = inner) do
  res = inner.max in outer && inner.min in outer && :math.fmod(inner.step, outer.step) == 0
  {:ok, res}
end

Then that just leaves the main implementation for membership checks, which is basically just a case statement which changes the output depending on the brackets supplied. Additionally, if the interval contains a step then the value being checked must be a multiple of the step.

def member?(%Exterval{} = rang, value) when is_number(value) do
  res =
    if Exterval.size(rang) == 0 do
      {:ok, false}
    else
      case {rang.left, rang.min, rang.max, rang.right} do
        {_, :neg_infinity, :infinity, _} ->
          true

        {_, :neg_inf, max_val, "]"} ->
          value <= max_val

        {_, :neg_infinity, max_val, ")"} ->
          value < max_val

        {"[", min_val, :infinity, _} ->
          value >= min_val

        {"(", min_val, :infinity, _} ->
          value > min_val

        {"[", min_val, max_val, "]"} ->
          value >= min_val and value <= max_val

        {"(", min_val, max_val, "]"} ->
          value > min_val and value <= max_val

        {"[", min_val, max_val, ")"} ->
          value >= min_val and value < max_val

        {"(", min_val, max_val, ")"} ->
          value > min_val and value < max_val

        _ ->
          raise ArgumentError, "Invalid range specification"
      end
    end

  res =
    unless is_nil(rang.step) || rang.min == :neg_infinity || rang.max == :infinity do
      res && :math.fmod(value - rang.min, rang.step) == 0
    else
      res
    end

  {:ok, res}
end

Inspect

Lastly, to make the user experience a bit better, it's not too difficult to implement the Inpect Protocol to provide a cleaner output:

defimpl Inspect do
  import Inspect.Algebra
  import Kernel, except: [inspect: 2]

  def inspect(%Exterval{left: left, right: right, min: min, max: max, step: nil}, opts) do
    concat([string(left), to_doc(min, opts), ",", to_doc(max, opts), string(right)])
  end

  def inspect(%Exterval{left: left, right: right, min: min, max: max, step: step}, opts) do
    concat([
      string(left),
      to_doc(min, opts),
      ",",
      to_doc(max, opts),
      string(right),
      "//",
      to_doc(step, opts)
    ])
  end
end

Future Plans

Currently I am weighing the options between adding more functionality to the library or keeping it as thin as it currently is. The main additions could be more robust set operations on the intervals, but I currently do not have a need for it so it will probably not make it into the library in the near future.

For now, I hope this provided a detailed look at the process of identifying a problem, and subsequently designing and implementing the solution. I found this to be an elegant solution to the problem, but as I mentioned it was not a straight-line path. I would be interested to hear about any other solutions people have seen!

Python NumPy to Elixir-Nx

Andrés Alejos — Tue, 05 Dec 2023 01:13:01 +0000

For my ElixirConfUS talk, I wanted to demonstrate training a spam detection model with in Elixir. A pre-processing step I needed to perform was TF-IDF vectorization, but there were no TF-IDF libraries already written which were built for Elixir-Nx. Seeing as TF-IDF is an extremely common pre-processing step with Decision Trees, since they ingest tabular data, I decided to go ahead and write a full-fledged implementation rather than just writing a minimal implementation that I needed for my example.

I decided to model my implementation after the TF-IDF Vectorizerimplementation in the Python scikit-learn, and in writing it I learned many lessons about translating Python NumPy code to Elixir-Nx. Since Nx is to Elixir as NumPy is to Python, I thought others might find it useful to see how we can leverage existing code from the Python ecosystem to bring it Elixir.

The sklearn source code can be found here, while the Elixir source code can be found here.

API Overview

A primary goal I had while writing my Elixir library was to make the API as similar to the Python API as possible so that it would be easy for people coming from Python, since sklearn is the Machine Learning library that most people likely have experience with. The Elixir version supports most of the same arguments as the sklearn version. Here are some examples of basic usage of the API

# Elixir
TfidfVectorizer.new(
    ngram_range: {1, 3},
    sublinear_tf: true,
    stop_words: english_stop_words,
    max_features: 5000
  )
  |> TfidfVectorizer.fit_transform(X_train)

# Python
TfidfVectorizer(
  sublinear_tf=True, 
  ngram_range=(1, 3), 
  max_features=5000
  ).fit_transform(X_train)

Design Overview

We will start by looking at the Sklearn CountVectorizer class since TfidfFVectorizer is actually implemented as a subclass of CountVectorizer and then using the fit method of the TfidfTransformer class to transform the output of the CountVectorizer to its TF-IDF representation. As a result, the bulk of the code is implemented in the CountVectorizer. In Elixir, I accomplished this design by having a CountVectorizer module and a TfidfVectorizer module which has a CountVectorizer as a struct member.

The vectorizer works by building a vocabulary from the given corpus (or using a vocabulary you pass it), counting the number of times each word in the vocabulary is taken, and filtering according to

The vectorizers work roughly according to these steps:

Either builds a vocabulary from the given corpus or uses a vocabulary supplied to it. a. Performs preprocessing according to a given function b. Performs tokenization according to a tokenization function c. Generates requested ngrams d. Filters stop words
Iterates through each document in the corpus, counting each term in the vocabulary
Limit output features according to parameters a. max_features - Only consider the top max_features ordered by term frequency across the corpus. b. min_df - Ignore terms that have a document frequency strictly lower than the given threshold. c. Ignore terms that have a document frequency strictly higher than the given threshold.
Output CountVectorizer
TfidfVectorizer uses the CountVectorizer to transform the term-frequency matrix into it TFIDF Matrix.

As you can see, the bulk of the work is done in the CountVectorizer module, so that is where we will spend the most of our time going forward. Now that you have a general understanding of how the vectorizers work, we will look at a brief survey of different functions to compare Python and Elixir implementations.

Implementation Details

Here is how the vectorizer is initialized in Python. It uses keyword args with default parameters and initializes its class attributes accordingly.

class CountVectorizer(_VectorizerMixin, BaseEstimator):
  def __init__ (
        self,
        *,
        input="content",
        encoding="utf-8",
        decode_error="strict",
        strip_accents=None,
        lowercase=True,
        preprocessor=None,
        tokenizer=None,
        stop_words=None,
        token_pattern=r"(?u)\b\w\w+\b",
        ngram_range=(1, 1),
        analyzer="word",
        max_df=1.0,
        min_df=1,
        max_features=None,
        vocabulary=None,
        binary=False,
        dtype=np.int64,
    )
    self.input = input
        self.encoding = encoding
        self.decode_error = decode_error
        self.strip_accents = strip_accents
        self.preprocessor = preprocessor
        self.tokenizer = tokenizer
        self.analyzer = analyzer
        self.lowercase = lowercase
        self.token_pattern = token_pattern
        self.stop_words = stop_words
        self.max_df = max_df
        self.min_df = min_df
        self.max_features = max_features
        self.ngram_range = ngram_range
        self.vocabulary = vocabulary
        self.binary = binary
        self.dtype = dtype

And here is a snippet of the CountVectorizer struct in Elixir as well as the new/1 function that is used to initialize the vectorizer. new/1 gives a similar behavior to Python's class init method. I used NimbleOptions for parameter validation (it's a great library and you can read more about it here), and you can refer to the parameters source here. validate_shared! validates the parameters and assigns default value when none were provided.

defmodule Mighty.Preprocessing.CountVectorizer do
  defstruct vocabulary: nil,
            ngram_range: {1, 1},
            max_features: nil,
            min_df: 1,
            max_df: 1.0,
            stop_words: [],
            binary: false,
            preprocessor: nil,
            tokenizer: nil,
            pruned: nil

  def new(opts \\ []) do
    opts = Mighty.Preprocessing.Shared.validate_shared!(opts)
    struct( __MODULE__ , opts)
  end
end

All of the operations in our CountVectorizer module will take a CountVectorizer as its first argument, which allows us to pipe our operations nicely. The Python implementation separates the TfidfTransformer from the TfidfVectorizer, where the TFIDFVectorizer inherits from the CountVectorizer. To achieve similar behavior, our TfidfVectorizer is its own struct that contains a CountVectorizer as one of its member. Creating a new TfidfVectorizer starts with creating a new CountVectorizer:

defmodule Mighty.Preprocessing.TfidfVectorizer do
  alias Mighty.Preprocessing.CountVectorizer
  alias Mighty.Preprocessing.Shared

  defstruct [
    :count_vectorizer,
    :norm,
    :idf,
    use_idf: true,
    smooth_idf: true,
    sublinear_tf: false
  ]

  @doc """
  Creates a new `TfidfVectorizer` struct with the given options.

  Returns the new vectorizer.
  """
  def new(opts \\ []) do
    {general_opts, tfidf_opts} =
      Keyword.split(opts, Shared.get_vectorizer_schema() |> Keyword.keys())

    count_vectorizer = CountVectorizer.new(general_opts)
    tfidf_opts = Shared.validate_tfidf!(tfidf_opts)

    % __MODULE__ {count_vectorizer: count_vectorizer}
    |> struct(tfidf_opts)
  end
end

Now, let's compare three of the main pieces of functionality between the two implementations: building the term-frequency matrix (count matrix), limiting / pruning features from the resulting matrix, and performing the TFIDF transformation on that resulting matrix.

Building Term-Frequency Matrix

defp _transform(vectorizer = % __MODULE__ {}, corpus, n_doc) do
  if is_nil(vectorizer.vocabulary) do
    raise "CountVectorizer must be fit to a corpus before transforming the corpus. Use CountVectorizer.fit/2 or CountVectorizer.fit_transform/2 to fit the CountVectorizer to a corpus."
  end

  tf = Nx.broadcast(0, {n_doc, Enum.count(vectorizer.vocabulary)})

  corpus
  |> Enum.with_index()
  |> Enum.chunk_every(2000)
  |> Enum.reduce(tf, fn chunk, acc ->
    Task.async_stream(
      chunk,
      fn {doc, doc_idx} ->
        doc
        |> then(&do_process(vectorizer, &1))
        |> Enum.reduce(
          Map.new(vectorizer.vocabulary, fn {k, _} -> {k, 0} end),
          fn token, acc ->
            Map.update(acc, token, 1, &(&1 + 1))
          end
        )
        |> Enum.map(fn {k, v} ->
          case Map.get(vectorizer.vocabulary, k) do
            nil -> nil
            _ when v == 0 -> nil
            idx -> [doc_idx, idx, v]
          end
        end)
      end,
      timeout: :infinity
    )
    |> Enum.reduce({[], []}, fn
      {:ok, iter_result}, acc ->
        Enum.reduce(iter_result, acc, fn
          nil, acc -> acc
          [x, y, z], {idx, upd} -> {[[x, y] | idx], [z | upd]}
        end)
    end)
    |> then(fn {idx, upd} ->
      Nx.indexed_put(acc, Nx.tensor(idx), Nx.tensor(upd))
    end)
  end)
end


def _count_vocab(self, raw_documents, fixed_vocab):
  """Create sparse feature matrix, and vocabulary where fixed_vocab=False"""
  if fixed_vocab:
      vocabulary = self.vocabulary_
  else:
      # Add a new value when a new vocabulary item is seen
      vocabulary = defaultdict()
      vocabulary.default_factory = vocabulary. __len__

  analyze = self.build_analyzer()
  j_indices = []
  indptr = []

  values = _make_int_array()
  indptr.append(0)
  for doc in raw_documents:
      feature_counter = {}
      for feature in analyze(doc):
          try:
              feature_idx = vocabulary[feature]
              if feature_idx not in feature_counter:
                  feature_counter[feature_idx] = 1
              else:
                  feature_counter[feature_idx] += 1
          except KeyError:
              # Ignore out-of-vocabulary items for fixed_vocab=True
              continue

      j_indices.extend(feature_counter.keys())
      values.extend(feature_counter.values())
      indptr.append(len(j_indices))

  if not fixed_vocab:
      # disable defaultdict behaviour
      vocabulary = dict(vocabulary)
      if not vocabulary:
          raise ValueError(
              "empty vocabulary; perhaps the documents only contain stop words"
          )

  if indptr[-1] > np.iinfo(np.int32).max: # = 2**31 - 1
      if _IS_32BIT:
          raise ValueError(
              (
                  "sparse CSR array has {} non-zero "
                  "elements and requires 64 bit indexing, "
                  "which is unsupported with 32 bit Python."
              ).format(indptr[-1])
          )
      indices_dtype = np.int64

  else:
      indices_dtype = np.int32
  j_indices = np.asarray(j_indices, dtype=indices_dtype)
  indptr = np.asarray(indptr, dtype=indices_dtype)
  values = np.frombuffer(values, dtype=np.intc)

  X = sp.csr_matrix(
      (values, j_indices, indptr),
      shape=(len(indptr) - 1, len(vocabulary)),
      dtype=self.dtype,
  )
  X.sort_indices()
  return vocabulary, X

The most evident differences here are that in the Elixir code we are building a dense tensor, while the Python code is building a sparse tensor. A sparse tensor certainly makes much more sense in the context of these vectorizers, but as of now Nx currently does not support sparse tensors. This is also why we are using Task.async_stream along with Enum.chunk_every, as to reduce the memory consumption since it is dense.

The way we are constructing the tensor, however, is almost identical. We start by creating a zero-tensor the size of the final tensor. Then we are creating mappings of indices and their updates during our iteration within the reduce. After we collect these updates, we update the initial zero-tensor using Nx.indexed_put, which requires a list of indices you are updating along with the new values you are putting into those indices.

Feature Pruning

 defp where_columns(condition = %Nx.Tensor{shape: {_cond_len}}) do
    count = Nx.sum(condition) |> Nx.to_number()
    Nx.argsort(condition, direction: :desc) |> Nx.slice_along_axis(0, count, axis: 0)
  end

  defp limit_features(
         vectorizer = % __MODULE__ {},
         tf = %Nx.Tensor{},
         df = %Nx.Tensor{shape: {df_len}},
         high,
         low,
         limit
       ) do
    mask = Nx.broadcast(1, {df_len})
    mask = if high, do: Nx.logical_and(mask, Nx.less_equal(df, high)), else: mask
    mask = if low, do: Nx.logical_and(mask, Nx.greater_equal(df, low)), else: mask

    limit =
      case limit do
        0 ->
          limit

        nil ->
          limit

        _ ->
          limit - 1
      end

    mask =
      if limit && Nx.greater(Nx.sum(mask), limit) do
        tfs = Nx.sum(tf, axes: [0]) |> Nx.flatten()
        orig_mask_inds = where_columns(mask)
        mask_inds = Nx.argsort(Nx.take(tfs, orig_mask_inds) |> Nx.multiply(-1))[0..limit]
        new_mask = Nx.broadcast(0, {df_len})
        new_indices = Nx.take(orig_mask_inds, mask_inds) |> Nx.new_axis(1)
        new_updates = Nx.broadcast(1, {Nx.flat_size(new_indices)})
        new_mask = Nx.indexed_put(new_mask, new_indices, new_updates)

        new_mask
      else
        mask
      end

    new_indices = mask |> Nx.flatten() |> Nx.cumulative_sum() |> Nx.subtract(1)

    {new_vocab, removed_terms} =
      Enum.reduce(vectorizer.vocabulary, {%{}, MapSet.new([])}, fn {term, old_index},
                                                                   {vocab_acc, removed_acc} ->
        case Nx.to_number(mask[old_index]) do
          1 ->
            {Map.put(vocab_acc, term, Nx.to_number(new_indices[old_index])), removed_acc}

          _ ->
            {vocab_acc, MapSet.put(removed_acc, term)}
        end
      end)

    kept_indices = where_columns(mask)

    if Nx.flat_size(kept_indices) == 0 do
      raise "After pruning, no terms remain. Try a lower min_df or a higher max_df."
    end

    tf = Nx.take(tf, kept_indices, axis: 1)
    {tf, new_vocab, removed_terms}
  end


def _limit_features(self, X, vocabulary, high=None, low=None, limit=None):
        if high is None and low is None and limit is None:
            return X, set()

        # Calculate a mask based on document frequencies
        dfs = _document_frequency(X)
        mask = np.ones(len(dfs), dtype=bool)
        if high is not None:
            mask &= dfs <= high if low is not none: mask &="dfs">= low
        if limit is not None and mask.sum() > limit:
            tfs = np.asarray(X.sum(axis=0)).ravel()
            mask_inds = (-tfs[mask]).argsort()[:limit]
            new_mask = np.zeros(len(dfs), dtype=bool)
            new_mask[np.where(mask)[0][mask_inds]] = True
            mask = new_mask

        new_indices = np.cumsum(mask) - 1 # maps old indices to new
        removed_terms = set()
        for term, old_index in list(vocabulary.items()):
            if mask[old_index]:
                vocabulary[term] = new_indices[old_index]
            else:
                del vocabulary[term]
                removed_terms.add(term)
        kept_indices = np.where(mask)[0]
        if len(kept_indices) == 0:
            raise ValueError(
                "After pruning, no terms remain. Try a lower min_df or a higher max_df."
            )
        return X[:, kept_indices], removed_terms
    =>

These functions show the most stark differences between the capabilities of NumPy versus those of Nx, as well as just the syntactic differences. The syntax of NumPy is much less verbose that that of Nx (especially considering that we are operating outside of a defn here which will inject its own implementation of the Kernel module to add custom operators), and there are just some capabilities in NumPy that are currently not possible in Nx. For example, in Nx you cannot do dynamic shape modifications like are done in the Python code new_mask[np.where(mask)[0][mask_inds]] = True, so I had to come up with other solutions that could achieve the same thing. Looking at the Python version, you realize that np.where(mask)[0] is only concerned with the resulting columns. This makes sense since each column represents a term in the vocabulary and each row represents a document in the corpus, so each item in a column represents the count of that term in that documents. So we are concerned with whole columns because term-frequency is calculated for each term, which again is represented by the whole column. So we can now use a combination of our function where_columns and Nx functions such as Nx.argsort, Nx.take, and Nx.multiply to rearrange the matrix such that items are sorted by our filter conditions, and then we can just take the number of items we want according to the supplied :limit.

It would take entirely too long for me to go over every difference between these two functions, but I implore you to look closely at these two implementations to gain a better understanding of how to convert NumPy code to Elixir Nx.

TFIDF Transformation

def fit(% __MODULE__ {count_vectorizer: count_vectorizer} = vectorizer, corpus) do
    {cv, tf} = CountVectorizer.fit_transform(count_vectorizer, corpus)
    df = Scholar.Preprocessing.binarize(tf) |> Nx.sum(axes: [0])

    idf =
      if vectorizer.use_idf do
        {n_samples, _n_features} = Nx.shape(tf)
        df = Nx.add(df, if(vectorizer.smooth_idf, do: 1, else: 0))
        n_samples = if vectorizer.smooth_idf, do: n_samples + 1, else: n_samples
        Nx.divide(n_samples, df) |> Nx.log() |> Nx.add(1)
      end

    struct(vectorizer, count_vectorizer: cv, idf: idf)
  end

  def transform(% __MODULE__ {count_vectorizer: count_vectorizer} = vectorizer, corpus) do
    tf = CountVectorizer.transform(count_vectorizer, corpus)

    tf =
      if vectorizer.sublinear_tf do
        Nx.select(Nx.equal(tf, 0), 0, Nx.log(tf) |> Nx.add(1))
      else
        tf
      end

    tf =
      if vectorizer.use_idf do
        unless vectorizer.idf do
          raise "Vectorizer has not been fitted yet. Please call `fit_transform` or `fit` first."
        end

        Nx.multiply(tf, vectorizer.idf)
      else
        tf
      end

    tf =
      case vectorizer.norm do
        nil -> tf
        norm -> Scholar.Preprocessing.normalize(tf, norm: norm)
      end

    tf
  end

  def fit_transform(% __MODULE__ {} = vectorizer, corpus) do
    vectorizer = fit(vectorizer, corpus)
    {vectorizer, transform(vectorizer, corpus)}
  end


class TfidfTransformer(
    OneToOneFeatureMixin, TransformerMixin, BaseEstimator, auto_wrap_output_keys=None
):
    def __init__ (self, *, norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=False):
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf

    def fit(self, X, y=None):
        X = self._validate_data(
            X, accept_sparse=("csr", "csc"), accept_large_sparse=not _IS_32BIT
        )
        if not sp.issparse(X):
            X = sp.csr_matrix(X)
        dtype = X.dtype if X.dtype in FLOAT_DTYPES else np.float64

        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)
            df = df.astype(dtype, copy=False)

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            idf = np.log(n_samples / df) + 1
            self._idf_diag = sp.diags(
                idf,
                offsets=0,
                shape=(n_features, n_features),
                format="csr",
                dtype=dtype,
            )

        return self

    def transform(self, X, copy=True):
        X = self._validate_data(
            X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy=copy, reset=False
        )
        if not sp.issparse(X):
            X = sp.csr_matrix(X, dtype=np.float64)

        if self.sublinear_tf:
            np.log(X.data, X.data)
            X.data += 1

        if self.use_idf:
            # idf_ being a property, the automatic attributes detection
            # does not work as usual and we need to specify the attribute
            # name:
            check_is_fitted(self, attributes=["idf_"], msg="idf vector is not fitted")

            # *= doesn't work
            X = X * self._idf_diag

        if self.norm is not None:
            X = normalize(X, norm=self.norm, copy=False)

        return X

    @property
    def idf_(self):
        # if _idf_diag is not set, this will raise an attribute error,
        # which means hasattr(self, "idf_") is False
        return np.ravel(self._idf_diag.sum(axis=0))

    @idf_.setter
    def idf_(self, value):
        value = np.asarray(value, dtype=np.float64)
        n_features = value.shape[0]
        self._idf_diag = sp.spdiags(
            value, diags=0, m=n_features, n=n_features, format="csr"
        )

  class TfidfVectorizer(CountVectorizer):
      def fit(self, raw_documents, y=None):
        self._check_params()
        self._warn_for_unused_params()
        self._tfidf = TfidfTransformer(
            norm=self.norm,
            use_idf=self.use_idf,
            smooth_idf=self.smooth_idf,
            sublinear_tf=self.sublinear_tf,
        )
        X = super().fit_transform(raw_documents)
        self._tfidf.fit(X)
        return self

    def fit_transform(self, raw_documents, y=None):
        self._check_params()
        self._tfidf = TfidfTransformer(
            norm=self.norm,
            use_idf=self.use_idf,
            smooth_idf=self.smooth_idf,
            sublinear_tf=self.sublinear_tf,
        )
        X = super().fit_transform(raw_documents)
        self._tfidf.fit(X)
        # X is already a transformed view of raw_documents so
        # we set copy to False
        return self._tfidf.transform(X, copy=False)

    def transform(self, raw_documents):
        check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")

        X = super().transform(raw_documents)
        return self._tfidf.transform(X, copy=False)

I would consider this the most straight-forward translation example between the two implementations. One noticeable difference is that Nx.log doesn't handle zero values the same way NumPy does. While NumPy essentially ignores zeroes, Nx.log will throw a divide by zero error, so I use Nx.select to selectively ignore zero values, and only apply Nx.log to non-zero values. Additionally, I use Nx.multiply(tf, vectorizer.idf) to achieve the same thing as X = X * self._idf_diag as there is no need to construct a diagonal matrix since Nx.multiply broadcasts.

Conclusion

I would have liked to go into more detail for each example, but I think the code does a good job by itself showing the differences and steps required to translate a NumPy implementation to Nx. I think these examples illustrate how NumPy can obscure what operations are happening in an effort to make a more concise syntax, whereas some might consider Nx overly verbose in comparison. The more you familiarize yourself with both APIs, the better you will be able to identify places where you can do direct translation and places where you might have to be more creative.

Comment below with your own examples or if you have any other Python snippets you want to see converted to Elixir Nx!

Leveling Up Your Elixir Option Handling

Andrés Alejos — Sun, 17 Sep 2023 21:55:08 +0000

When writing libraries in any programming language, it is helpful to do as much validation of function arguments as possible up front to avoid spending time doing work that will eventually fail. This is a significant advantage to having type systems since many of the argument validations you might want to do have to do with ensuring that the correct types are passed to your function. With Elixir being dynamically typed (for now), there are other language idioms used to achieve argument validation. For example, you might use the combination of multiclause functions, pattern matching, and function guards to achieve a similar outcome to static type checks.

def negate(int) when is_integer(int) do
  -num
end

def negate(bool) when is_boolean(bool) do
  not bool
end

def negate(_) do
  raise "Type must be "int" or "bool"
end

This is the example that José Valim uses in his recent ElixirConfUS talk entitled The Foundations of the Elixir Type System. As you can see, you can use multiclause functions to separate concerns for each type that the function accepts, and you can include the last clause that matches every other type and raise an error with the appropriate message. This is a perfectly fine solution for rudimentary argument validation, but once you start needing to validate on properties not captured in guards or validating keyword arguments, it can get a bit messier.

You can always elect to do manual keyword validation using Keyword.validate, but if you have multiple functions that share similar validation then you might find it repetitive, in which case you might extract those validations into a function. Then you might start realizing that you want more powerful validations, and soon enough, you decide to extract that logic into its own module. Well, as it turns out, the folks over at Dashbit have already done that with the Nimble Options package!

Intro to NimbleOptions

NimbleOptions is described as "a tiny library for validating and documenting high-level options." It allows you to define schemas in which to validate your keyword options and raises the appropriate error when a validation errors. It includes many built-in types to define schemas, or you can provide custom definitions. It also allows you to pair the documentation about an option with the option itself, and then can conveniently generate your documentation for all of your options. You can reuse schema definitions and compose them as you would any keyword list (since schemas are defined as Keyword Lists), and even compile your schemas ahead of time (assuming you have no runtime-only terms in your definitions).

The API is very easy to learn, and the library is small enough that you can feel good incorporating it into your code base. One other benefit of the library is that it allows you to perform transformations on the parameters while also validating, which was a real benefit to me when writing binding to an external API. If I didn't like a particular API parameter, or if some did not make sense in the context of Elixir, then I could change the Elixir-facing API and use NimbleOptions to transform the parameter to the required field to pass to the external API. Let's dive deeper into that real-world use case.

Leveling Up Your Validations

As I was writing EXGBoost, I found that one of the pain points was finding a good way to do the bevy of parameter validation needed for the library. XGBoost itself has many different parameters that it may accept, and the way in which some parameters act can be dependent on other parameters.

There were several unique considerations I had when writing my parameter validations. I will explain the problem and show how NimbleOptions helped me solve it.

💡

The code that I will be referencing is available in its totality here, and you can find the documentation that was generated (including that which was generated with NimbleOptions) here.

Custom Validations

As with many machine learning models, there are parameters that must be validated within a real-number range (not an Elixir Range), so I knew those would also need repeated custom validation. Think parameters such as regularization terms (alpha,beta), learning rates (eta), etc. One interesting case for XGBoost is the colsample_by* family of parameters. The XGBoost C API treats each one as a separate parameter, but each shares the same validation. Also, these parameters work cumulatively since they control the tree sampling according to different characteristics, so a valid option could be {'colsample_bytree':0.5, 'colsample_bylevel':0.5, 'colsample_bynode':0.5}, which would reduce 64 features to 8 features at each split. I wanted to simplify this API a bit to

colsample_by: [tree: 0.8, node: 0.8, level: 0.8]

We can do this by taking advantage of a custom type in the definition schema. First, let's write the definition for this parameter:

colsample_by: [
    type: {:custom, EXGBoost.Parameters, :validate_colsample, []},
    doc: """
    This is a family of parameters for subsampling of columns.
    All `colsample_by` parameters have a range of `(0, 1]`, the default value of `1`, and specify the fraction of columns to be subsampled.
    `colsample_by` parameters work cumulatively. For instance, the combination
    `col_sampleby: [tree: 0.5, level: 0.5, node: 0.5]` with `64` features will leave `8`.
      * `:tree` - The subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. Valid range is (0, 1]. The default value is `1`.
      * `:level` - The subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree. Valid range is (0, 1]. The default value is `1`.
      * `:node` - The subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. Valid range is (0, 1]. The default value is `1`.
    """
  ]

Next, we write the validator. With custom type in NimbleOptions, your validator must return {:error, reason} or {:ok, options}, where options are the outputted validated options.

def validate_colsample(x) do
  unless is_list(x) do
    {:error, "Parameter `colsample` must be a list, got #{inspect(x)}"}
  else
    Enum.reduce_while(x, {:ok, []}, fn x, {_status, acc} ->
      case x do
        {key, value} when key in [:tree, :level, :node] and is_number(value) ->
          if in_range(value, "(0,1]") do
            {:cont, {:ok, [{String.to_atom("colsample_by#{key}"), value} | acc]}}
          else
            {:halt,
             {:error, "Parameter `colsample: #{key}` must be in (0,1], got #{inspect(value)}"}}
          end

        {key, _value} ->
          {:halt,
           {:error,
            "Parameter `colsample` must be in [:tree, :level, :node], got #{inspect(key)}"}}

        _ ->
          {:halt, {:error, "Parameter `colsample` must be a keyword list, got #{inspect(x)}"}}
      end
    end)
  end
end

And just like that, we can now have all three possible colsample_by* options succinctly under one key, while still adhering to the XGBoost C API. We will touch more on other transformations later.

Overridable Configuration Defaults

There are certain parameters that must be passed to the XGBoost API, but we don't necessarily want the user to have to define them on each API call. So instead we can define a default value that the user can either globally override or override on each call. One case of this is with the parameter nthread for EXGBoost.train and EXGBoost.predict. It would be tedious for the user to have to pass the nthread option for each invocation to these high-level APIs, so instead we use Application.compile_env/3 function to set a default in our NimbleOptions schema definitions.

 nthread: [
      type: :non_neg_integer,
      default: Application.compile_env(:exgboost, :nthread, 0),
      doc: """
      Number of threads to use for training and prediction. If `0`, then the
      number of threads is set to the number of cores. This can be set globally
      using the `:exgboost` application environment variable `:nthread`
      or on a per booster basis. If set globally, the value will be used for
      all boosters unless overridden by a specific booster.
      To set the number of threads globally, add the following to your `config.exs`:
      `config :exgboost, nthread: n`.
      """
    ]

With this definition, the highest precedence is when a user passes the nthread option to the API call. If they do not provide that option, then it falls back to the value the user set under the same key in their config.exs file. If the user did not set the key, then the default of 0 is used, which in this case refers to using all available cores. Additionally, since we used Application.compile_env/3, then this schema still has no runtime-only terms and can thus be compiled (which means that any runtime changes to the environment will not be reflected).

Parameter Transformation

The XGBoost C API requires that all parameters be JSON string encoded, which means that many parameter names they use are not valid atoms. For example, all of the Learning Task Parameters objectives use colons (:) to separate parameters used for different types of models ("reg:squarederror","binary:logistic", "multi:softmax"). We could just use those directly, but that does not feel very "Elixir" to me. So instead I opted to use atoms and just replace the colons with an underscore (:reg_squarederror,":binary_logistic", :multi_softmax), which just looks much cleaner within a keyword list. Atoms convey to the user that there is a limited enumeration of valid options, whereas a string conveys a user-defined option.

So to achieve this we can define the definition with a custom validation:

objective: [
      type: {:custom, EXGBoost.Parameters, :validate_objective, []},
      default: :reg_squarederror,
      doc: ~S"""
      Specify the learning task and the corresponding learning objective. The objective options are:
        * `:reg_squarederror` - regression with squared loss.
        * `:reg_squaredlogerror` - regression with squared log loss $\frac{1}{2}[\log (pred + 1) - \log (label + 1)]^2$. All input labels are required to be greater than `-1`. Also, see metric rmsle for possible issue with this objective.
        * `:reg_logistic` - logistic regression.
        * `:reg_pseudohubererror` - regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
        * `:reg_absoluteerror` - Regression with `L1` error. When tree model is used, leaf value is refreshed after tree construction. If used in distributed training, the leaf value is calculated as the mean value from all workers, which is not guaranteed to be optimal.
        * `:reg_quantileerror` - Quantile loss, also known as pinball loss. See later sections for its parameter and Quantile Regression for a worked example.
        * `:binary_logistic` - logistic regression for binary classification, output probability
        * `:binary_logitraw` - logistic regression for binary classification, output score before logistic transformation
        * `:binary_hinge` - hinge loss for binary classification. This makes predictions of `0` or `1`, rather than producing probabilities.
        * `:count_poisson` - Poisson regression for count data, output mean of Poisson distribution.
            * `max_delta_step` is set to `0.7` by default in Poisson regression (used to safeguard optimization)
        * `:survival_cox` - Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as `HR = exp(marginal_prediction)` in the proportional hazard function `h(t) = h0(t) * HR`).
        * `:survival_aft` - Accelerated failure time model for censored survival time data. See [Survival Analysis with Accelerated Failure Time](https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html) for details.
        * `:multi_softmax` - set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
        * `:multi_softprob` - same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
        * `:rank_ndcg` - Use LambdaMART to perform pair-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized. This objective supports position debiasing for click data.
        * `:rank_map` - Use LambdaMART to perform pair-wise ranking where Mean Average Precision (MAP) is maximized
        * `:rank_pairwise` - Use LambdaRank to perform pair-wise ranking using the ranknet objective.
        * `:reg_gamma` - gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
        * `:reg_tweedie` - Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
      """
    ]

Then define the validation function itself:

def validate_objective(x) do
  if(
    x in [
      :reg_squarederror,
      :reg_squaredlogerror,
      :reg_logistic,
      :reg_pseudohubererror,
      :reg_absoluteerror,
      :reg_quantileerror,
      :binary_logistic,
      :binary_logitraw,
      :binary_hinge,
      :count_poisson,
      :survival_cox,
      :survival_aft,
      :multi_softmax,
      :multi_softprob,
      :rank_ndcg,
      :rank_map,
      :rank_pairwise,
      :reg_gamma,
      :reg_tweedie
    ],
    do: {:ok, Atom.to_string(x) |> String.replace("_", ":")},
    else:
      {:error,
       "Parameter `objective` must be in [:reg_squarederror, :reg_squaredlogerror, :reg_logistic, :reg_pseudohubererror, :reg_absoluteerror, :reg_quantileerror, :binary_logistic, :binary_logitraw, :binary_hinge, :count_poisson, :survival_cox, :survival_aft, :multi_softmax, :multi_softprob, :rank_ndcg, :rank_map, :rank_pairwise, :reg_gamma, :reg_tweedie], got #{inspect(x)}"}
  )
end

Another situation where I found the use of these transformations to lead to a much cleaner, "Elixir" styled API is with certain evaluation metrics. Again, these were defined in the XGBoost API as strings and certain options could be appended with a - to indicate an inversion of sorts (ndcg vs ndcg-), while other options could be parameterized to indicate a cut-off value (ndcg vs ndcg@n where n is an integer), and you could even combine these two modifiers (such as ndcg@n-). While it would be perfectly serviceable to just use strings in these situations, again it felt like it would stand out as inelegant compared to the usual elegance of the language. So, as a way of keeping the options as atoms while allowing these modifications, I changed the API to prepend inv in the atom names for those which would otherwise append - in their string, and use a 2-Tuple for those options which would be parametrized, as opposed to the @ used in their string counterparts. This would lead to an API such as :ndcg,:inv_ndcg, {:inv_ndcg, n}

def validate_eval_metric(x) do
    x = if is_list(x), do: x, else: [x]

    metrics =
      Enum.map(x, fn y ->
        case y do
          {task, n} when task in [:error, :ndcg, :map, :tweedie_nloglik] and is_number(n) ->
            task = Atom.to_string(task) |> String.replace("_", "-")
            "#{task}@#{n}"

          {task, n} when task in [:inv_ndcg, :inv_map] and is_number(n) ->
            [task | _tail] = task |> Atom.to_string() |> String.split("_") |> Enum.reverse()
            "#{task}@#{n}-"

          task when task in [:inv_ndcg, :inv_map] ->
            [task | _tail] = task |> Atom.to_string() |> String.split("_") |> Enum.reverse()
            "#{task}-"

          task
          when task in [
                 :rmse,
                 :rmsle,
                 :mae,
                 :mape,
                 :mphe,
                 :logloss,
                 :error,
                 :merror,
                 :mlogloss,
                 :auc,
                 :aucpr,
                 :ndcg,
                 :map,
                 :tweedie_nloglik,
                 :poisson_nloglik,
                 :gamma_nloglik,
                 :cox_nloglik,
                 :gamma_deviance,
                 :aft_nloglik,
                 :interval_regression_accuracy
               ] ->
            Atom.to_string(task) |> String.replace("_", "-")

          _ ->
            raise ArgumentError,
                  "Parameter `eval_metric` must be in [:rmse, :mae, :logloss, :error, :error, :merror, :mlogloss, :auc, :aucpr, :ndcg, :map, :ndcg, :map, :ndcg, :map, :poisson_nloglik, :gamma_nloglik, :gamma_deviance, :tweedie_nloglik, :tweedie_deviance], got #{inspect(y)}"
        end
      end)

    {:ok, metrics}
  end

Composability

Since definition schemas are just normal Keyword Lists, we can compose definitions like you would any other Keyword List. For example, with XGBoost, there is a class of booster called a Dart Booster, which is just Tree Booster with dropout. So assuming we have our tree booster schema defined as @tree_booster_params, we can define our Dart Booster as:

@dart_booster_params @tree_booster_params ++
   [
     sample_type: [
       type: {:in, [:uniform, :weighted]},
       default: :uniform,
       doc: """
       Type of sampling algorithm.
          * `:uniform` - Dropped trees are selected uniformly.
          * `:weighted` - Dropped trees are selected in proportion to weight.
       """
     ],
     normalize_type: [
       type: {:in, [:tree, :forest]},
       default: :tree,
       doc: """
       Type of normalization algorithm.
          * `:tree` - New trees have the same weight of each of dropped trees.
              * Weight of new trees are `1 / (k + learning_rate)`.
              * Dropped trees are scaled by a factor of `k / (k + learning_rate)`.
          * `:forest` - New trees have the same weight of sum of dropped trees (forest).
              * Weight of new trees are 1 / (1 + learning_rate).
              * Dropped trees are scaled by a factor of 1 / (1 + learning_rate).
       """
     ],
     rate_drop: [
       type: {:custom, EXGBoost.Parameters, :in_range, ["[0,1]"]},
       default: 0.0,
       doc: """
       Dropout rate (a fraction of previous trees to drop during the dropout). Valid range is [0, 1].
       """
     ],
     one_drop: [
       type: {:in, [0, 1]},
       default: 0,
       doc: """
       When this flag is enabled, at least one tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).
       """
     ],
     skip_drop: [
       type: {:custom, EXGBoost.Parameters, :in_range, ["[0,1]"]},
       default: 0.0,
       doc: """
       Probability of skipping the dropout procedure during a boosting iteration. Valid range is [0, 1].
          * If a dropout is skipped, new trees are added in the same manner as gbtree.
          * **Note** that non-zero skip_drop has higher priority than rate_drop or one_drop.
       """
     ]
   ]

Putting It All Together

XGBoost separates its training and prediction options into four classes of options:

General parameters relate to which booster we are using to do boosting, commonly tree or linear model

Booster parameters depend on which booster you have chosen

Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

Command line parameters relate to behavior of CLI version of XGBoost.

A dditionally, there are global parameters, so we will treat them as Application level parameters. We can ignore the last class since they don't apply to our use case. Since certain Learning Task Parameters are only valid for certain objectives, we must separate those concerns similarly to how Booster Parameters depend on options from General Parameters.

Our final validation flow will look like this:

We define a separate validate!/1 function which we will invoke to validate this flow when our API is used. So let's first stub out our function:

def validate!(params) when is_list(params) do
end

Now we get and validate only the general parameters from all of the options passed:

def validate!(params) when is_list(params) do
  general_params =
        Keyword.take(params, Keyword.keys(@general_params))
        |> NimbleOptions.validate!(@general_schema)
end

Now, as a way of giving the user an escape hatch in case they want to directly pass through to the XGBoost C API, forgoing the niceties we're providing, we also add a :validate_parameters boolean option, which defaults to true, but the user can set to false:

  if general_params[:validate_parameters] do
    ...
  else
    params
  end

Now, within the true branch, we gather and validate the booster params:

booster_params =
  case general_params[:booster] do
    :gbtree ->
      Keyword.take(params, Keyword.keys(@tree_booster_params))
      |> NimbleOptions.validate!(@tree_booster_schema)

    :gblinear ->
      Keyword.take(params, Keyword.keys(@linear_booster_params))
      |> NimbleOptions.validate!(@linear_booster_schema)

    :dart ->
      Keyword.take(params, Keyword.keys(@dart_booster_params))
      |> NimbleOptions.validate!(@dart_booster_schema)
  end

Next we gather the Learning Task params:

learning_task_params =
    Keyword.take(params, Keyword.keys(@learning_task_params))
    |> NimbleOptions.validate!(@learning_task_schema)

Now we use the selected :objective option from the learning_task_params to validate the parameters which are objective-dependent:

extra_params =
    case learning_task_params[:objective] do
      "reg:tweedie" ->
        Keyword.take(params, Keyword.keys(@tweedie_params))
        |> NimbleOptions.validate!(@tweedie_schema)

      "reg:pseudohubererror" ->
        Keyword.take(params, Keyword.keys(@pseudohubererror_params))
        |> NimbleOptions.validate!(@pseudohubererror_schema)

      "reg:quantileerror" ->
        Keyword.take(params, Keyword.keys(@quantileerror_params))
        |> NimbleOptions.validate!(@quantileerror_schema)

      "survival:aft" ->
        Keyword.take(params, Keyword.keys(@survival_params))
        |> NimbleOptions.validate!(@survival_schema)

      "rank:ndcg" ->
        Keyword.take(params, Keyword.keys(@ranking_params))
        |> NimbleOptions.validate!(@ranking_schema)

      "rank:map" ->
        Keyword.take(params, Keyword.keys(@ranking_params))
        |> NimbleOptions.validate!(@ranking_schema)

      "rank:pairwise" ->
        Keyword.take(params, Keyword.keys(@ranking_params))
        |> NimbleOptions.validate!(@ranking_schema)

      "multi:softmax" ->
        Keyword.take(params, Keyword.keys(@multi_soft_params))
        |> NimbleOptions.validate!(@multi_soft_schema)

      "multi:softprob" ->
        Keyword.take(params, Keyword.keys(@multi_soft_params))
        |> NimbleOptions.validate!(@multi_soft_schema)

      _ ->
        []
    end

Finally, we return all of the params we gathered:

general_params ++ booster_params ++ learning_task_params ++ extra_params

The final validation function looks like this:

@doc """
  Validates the EXGBoost parameters and returns a keyword list of the validated parameters.
  """
@spec validate!(keyword()) :: keyword()
def validate!(params) when is_list(params) do
  # Get some of the params that other params depend on
  general_params =
    Keyword.take(params, Keyword.keys(@general_params))
    |> NimbleOptions.validate!(@general_schema)

  if general_params[:validate_parameters] do
    booster_params =
      case general_params[:booster] do
        :gbtree ->
          Keyword.take(params, Keyword.keys(@tree_booster_params))
          |> NimbleOptions.validate!(@tree_booster_schema)

        :gblinear ->
          Keyword.take(params, Keyword.keys(@linear_booster_params))
          |> NimbleOptions.validate!(@linear_booster_schema)

        :dart ->
          Keyword.take(params, Keyword.keys(@dart_booster_params))
          |> NimbleOptions.validate!(@dart_booster_schema)
      end

    learning_task_params =
      Keyword.take(params, Keyword.keys(@learning_task_params))
      |> NimbleOptions.validate!(@learning_task_schema)

    extra_params =
      case learning_task_params[:objective] do
        "reg:tweedie" ->
          Keyword.take(params, Keyword.keys(@tweedie_params))
          |> NimbleOptions.validate!(@tweedie_schema)

        "reg:pseudohubererror" ->
          Keyword.take(params, Keyword.keys(@pseudohubererror_params))
          |> NimbleOptions.validate!(@pseudohubererror_schema)

        "reg:quantileerror" ->
          Keyword.take(params, Keyword.keys(@quantileerror_params))
          |> NimbleOptions.validate!(@quantileerror_schema)

        "survival:aft" ->
          Keyword.take(params, Keyword.keys(@survival_params))
          |> NimbleOptions.validate!(@survival_schema)

        "rank:ndcg" ->
          Keyword.take(params, Keyword.keys(@ranking_params))
          |> NimbleOptions.validate!(@ranking_schema)

        "rank:map" ->
          Keyword.take(params, Keyword.keys(@ranking_params))
          |> NimbleOptions.validate!(@ranking_schema)

        "rank:pairwise" ->
          Keyword.take(params, Keyword.keys(@ranking_params))
          |> NimbleOptions.validate!(@ranking_schema)

        "multi:softmax" ->
          Keyword.take(params, Keyword.keys(@multi_soft_params))
          |> NimbleOptions.validate!(@multi_soft_schema)

        "multi:softprob" ->
          Keyword.take(params, Keyword.keys(@multi_soft_params))
          |> NimbleOptions.validate!(@multi_soft_schema)

        _ ->
          []
      end

    general_params ++ booster_params ++ learning_task_params ++ extra_params
  else
    params
  end
end

Conclusion

I hope I demonstrated how NimbleOptions can make some fairly complex validation logic into very manageable and intuitive code. I wanted to share my thoughts on the library since it really worked wonders for my particular use case, which I believe showed a wide range of validation techniques. With this validation flow, I can have complex sets of parameters such as the following:

params = [
      num_boost_rounds: num_boost_round,
      tree_method: :hist,
      obj: :multi_softprob,
      num_class: num_class,
      eval_metric: [
        :rmse,
        :rmsle,
        :mae,
        :mape,
        :logloss,
        :error,
        :auc,
        :merror,
        :mlogloss,
        :gamma_nloglik,
        :inv_map,
        {:tweedie_nloglik, 1.5},
        {:error, 0.2},
        {:ndcg, 3},
        {:map, 2},
        {:inv_ndcg, 3}
      ],
      max_depth: 3,
      eta: 0.3,
      gamma: 0.1,
      min_child_weight: 1,
      subsample: 0.8,
      colsample_by: [tree: 0.8, node: 0.8, level: 0.8],
      lambda: 1,
      alpha: 0,
      grow_policy: :lossguide,
      max_leaves: 0,
      max_bin: 128,
      predictor: :cpu_predictor,
      num_parallel_tree: 1,
      monotone_constraints: [],
      interaction_constraints: []
    ]

and completely validate it within Elixir (although the XGBoost C API performs its own validation as well).

Finally, when you need to produce documentation for your modules, NimbleOptions makes it incredibly easy. Since all of my documentation was paired with the parameters themselves, my final module documentation just looks like this:

@moduledoc """
  Parameters are used to configure the training process and the booster.

  ## Global Parameters

  You can set the following params either using a global application config (preferred)
  or using the `EXGBoost.set_config/1` function. The global config is set using the `:exgboost` key.
  Note that using the `EXGBoost.set_config/1` function will override the global config for the
  current instance of the application.

elixir
config :exgboost,
verbosity: :info,
use_rmm: true,

  #{NimbleOptions.docs(@global_schema)}

  ## General Parameters
  #{NimbleOptions.docs(@general_schema)}

  ## Tree Booster Parameters
  #{NimbleOptions.docs(@tree_booster_schema)}

  ## Linear Booster Parameters
  #{NimbleOptions.docs(@linear_booster_schema)}

  ## Dart Booster Parameters
  #{NimbleOptions.docs(@dart_booster_schema)}

  ## Learning Task Parameters
  #{NimbleOptions.docs(@learning_task_schema)}

  ## Objective-Specific Parameters

  ### Tweedie Regression Parameters
  #{NimbleOptions.docs(@tweedie_schema)}

  ### Pseudo-Huber Error Parameters
  #{NimbleOptions.docs(@pseudohubererror_schema)}

  ### Quantile Error Parameters
  #{NimbleOptions.docs(@quantileerror_schema)}

  ### Survival Analysis Parameters
  #{NimbleOptions.docs(@survival_schema)}

  ### Ranking Parameters
  #{NimbleOptions.docs(@ranking_schema)}

  ### Multi-Class Classification Parameters
  #{NimbleOptions.docs(@multi_soft_schema)}
  """

Not only does it produce great documentation for users of the library, but it is nw logically organized and would be easy to navigate for any future developers or contributors. The use of NimbleOptions makes my code much more self-documenting, which is a great quality to strive for in a code base. I hope I convinced you to try the library out for yourself. Let me know how it goes!

Enjoy this article? Sign-up to be notified!

Serving Spam Detection With XGBoost and Elixir

Andrés Alejos — Sat, 09 Sep 2023 21:36:59 +0000

Nx-Powered Decision Trees

Mix.install(
  [
    {:exgboost, "~> 0.3.1", override: true},
    {:nx, "~> 0.6"},
    {:exla, "~> 0.5"},
    {:kino, "~> 0.10.0"},
    {:kino_explorer, "~> 0.1.4"},
    {:scidata, "~> 0.1"},
    {:scholar, "~> 0.1"},
    {:tokenizers, "~> 0.3.0"},
    {:explorer, "~> 0.7.0"},
    {:mighty, git: "https://github.com/acalejos/mighty.git"},
    {:mockingjay,
     git: "https://github.com/acalejos/mockingjay.git", branch: "make_tree_travs_jit_compilable"}
  ],
  config: [nx: [default_defn_options: [compiler: EXLA], default_backend: {EXLA.Backend, []}]]
)

alias Mighty.Preprocessing.TfidfVectorizer
data_path = "/{path_to_your_dataset}/Phishing_Email.csv"

Intro

This notebook was made to accompany my ElixirConfUS 2023 talk entitled Nx-Powered Decision Trees. For the best experience, you should launch this in Livebook by clicking the button above.

Additionally, the TF-IDF library used was made for this talk, but I decided to release it as I plan to continue working on an NLTK-like library for Elixir. Consider it a work in progess still.

You can find all of the libraries that I wrote that are used in this notebook at my GitHub at https://github.com/acalejos. If you want to follow my projects you can find me at https://twitter.com/ac_alejos.

Problem Statement

In this notebook we will be using the Phishing Email Dataset to create a Decision Tree Classifier to determine if an email is fake / a phishing attempt or legitimate.

This is a binary classification task, meaning that there are only 2 possible outputs from the model: legitimate email or fake email. The dataset we are using includes pairs of email text to the classification label, so we will have to perform preprocessing on the text to generate features conducive to Decision Tree Learning.

Once we are satisfied with our trained model, we will try it out against some examples from the test set and some user-generated examples.

This notebook is based on the work done at https://www.kaggle.com/code/vutronghoa/phishing-email-classification. This was not meant to show the best fine-tuning practices for XGBoost, but rather to introduce EXGBoost + Mockingjay and how they can be used with Nx.Serving to serve a decision tree model in Elixir.

By the end, you will have processed a text dataset using TF-IDF, trained an EXGBoost decision tree model, compiled the model into an Nx function, and serve the model using Nx.Serving.

Explore the Dataset

alias Explorer.DataFrame, as: DF
require Explorer.DataFrame

Explorer.DataFrame


df = Explorer.DataFrame.from_csv!(data_path, columns: ["Email Text", "Email Type"])

Let's start by seeing how many nil values there are in this dataset.

DF.nil_count(df)

Only 16 nil values out of 18650 samples is not bad. We will now go ahead and drop any row that contains a nil value. If these were numerical features or a substantial portion of the dataset were nil there might be ways that we could fill in for the nil values, but we will just drop in this instance.

df = Explorer.DataFrame.drop_nil(df)
nil

Now we need to transform the labels from their current text representation to a binary representation. We will map "Safe Email" to 0 and Phishing Email to 1, and any other values we will map to 2 and filter later if needed. We will also add a column to represent te text length of each row.

text_length = Explorer.Series.transform(df["Email Text"], &String.length/1)

text_label =
  Explorer.Series.transform(df["Email Type"], fn
    "Safe Email" ->
      0

    "Phishing Email" ->
      1

    _ ->
      2
  end)

df = Explorer.DataFrame.put(df, "Text Length", text_length)
df = Explorer.DataFrame.put(df, "Email Type", text_label)
nil

Now that we have some numerical columns we can use Explorer.DataFrame.describe to get some initial metrics such as mean, count, max, min, and std. For the sake of demonstration, here we will use a Kino Explorer Smart Data Transformation cell to showcase some of its features but do note that you could get a similar output using

DF.describe(df) |> DF.discard("Email Text")

df
|> DF.to_lazy()
|> DF.summarise(
  "Text Length_min": min(col("Text Length")),
  "Text Length_max": max(col("Text Length")),
  "Text Length_mean": mean(col("Text Length")),
  "Text Length_variance": variance(col("Text Length")),
  "Text Length_standard_deviation": standard_deviation(col("Text Length"))
)
|> DF.collect()

The max Email Type value is 1, meaning that we don't have to filter out any that were assigned 2 in the previous transform. The max Text Length value seems like an extreme outlier compared to the other percentiles available. Let's take a look to see how much of the overall corpus the max value makes up.

Explorer.Series.max(df["Text Length"]) / Explorer.Series.sum(df["Text Length"])

0.3317832761107029

As you can see, the text row with the max length has a length that is ~33% the length of the entire 18,000 count corpus, so we are going to remove it. In fact, for the sake of speed and memory efficiency during TFIDF vectorization, let's just remove any entry whose length is in the top 5% of the corpus.

df =
  Explorer.DataFrame.filter_with(
    df,
    &Explorer.Series.less(&1["Text Length"], Explorer.Series.quantile(&1["Text Length"], 0.95))
  )

nil

Now we have a bit of a trimmed down dataset as well as encoded labels, so we can now convert this DataFrame to tensors to use in the TFIDF Vectorization step.

x = Explorer.Series.to_list(df["Email Text"])
y = Explorer.Series.to_tensor(df["Email Type"])
nil

Perform TF-IDF Vectorization

With Natural Language Processing (NLP) tasks such as this, the overall text dataset is usually referred to as a corpus, where each entry in the dataset is referred to as a document. So in this case, the overall dataset of emails is the corpus, and an individual email is a document. Since Decision Trees work on numerical tabular data, we must convert the corpus of emails into a numerical format.

Count Vectorization refers to counting the number of times each token occurs in each document. The vectorization encodes each row as a length(vocabulary) tensor where each entry corresponds to the count of that token in the given document.

For example, given the following corpus:

corpus = [
   "This is the first document",
   "This document is the second document",
   "And this is the third one",
   "Is this the first document"
 ]

The Count vectorization would look like (assume downcasing and whitespace splitting):

this	is	the	first	document	second	and	third	one
1	1	1	1	1	0	0	0	0
1	1	1	0	2	1	0	0	0
1	1	1	0	0	0	1	1	1
1	1	1	1	1	0	0	0	0

Term Frequency - Inverse Document Frequency (TF-IDF) is a vectorization technique that encodes the importance of tokens with respect to their documents and the overall corpus, acocunting for words that might occur more often but have less impact to the meaning of the document (e.g. articles in the English language).

Term Frequency refers to the count of each token with respect to each document, which can be represented using the aforementioned CountVectorizer.

Document Frequency refers to how many documents in the corpus each token occurs in. Given the example from above, the Document Frequency matrix would look like:

this	is	the	first	document	second	and	third	one
1.0	1.0	1.0	0.5	0.75	0.25	0.25	0.25	0.25

So to get a TFIDF reprsentation we can get a perform a count vectorization and then multiply by the inverse document frequency.

The TFIDF Vectorizer we will be using allow you to pass a list of stop words which are words that you want to be filtered out before they get encoded. Here we will use a list from SKLearn. It is also worth noting that you can also determine what words should be filtered by setting the :min_df and :max_df options in the vectorizer to clamp the output to only using words whose document frequency is within the specified range.

# From https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d6e0a3e8ddf92a7e5561245224dab102/sklearn/feature_extraction/_stop_words.py
english_stop_words =
  ~w(a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fifty fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put rather re same see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under until up upon us very via was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves)

nil

We can pass a custom Tokenizer the the TFIDFVectorizer. The tokenizer must be passed in Module-Function-Arity (MFA) format, so we will make out own module to wrap the wonderful Tokenizers library, which itself is a wrapper around the HuggingFace Tokenizers library. We will be using the bert-base-uncased tokenizer since we will normalize the corpus by downcases beforehand. We will also pass in the bert vocabulary to the TfidfVectorizer so we don't have to build it ourselves.

defmodule MyEncoder do
  alias Tokenizers.Tokenizer

  def encode!(text, tokenizer) do
    {:ok, encoding} = Tokenizers.Tokenizer.encode(tokenizer, text)
    Tokenizers.Encoding.get_tokens(encoding)
  end

  def vocab(tokenizer) do
    Tokenizer.get_vocab(tokenizer)
  end
end

{:module, MyEncoder, <<70, 79, 82, 49, 0, 0, 7, ...>>, {:vocab, 1}}

Now we are creating out vectorizer, passing in the above tokenizer and vocab, and stop words. We also specify max_feature: 5000 to limit the vocabulary to only the top 5000 tokens according to the total count. We're using the default ngram_range to specify we only want unigrams, meaning the context window is only a single token. If we wanted unigrams and bigrams we could specify {1,2} for the range and it would also include each combination of 2 consecutive words as a separate token.

{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("bert-base-uncased")

{tfidf, tfidf_matrix} =
  TfidfVectorizer.new(
    tokenizer: {MyEncoder, :encode!, [tokenizer]},
    vocabulary: MyEncoder.vocab(tokenizer),
    ngram_range: {1, 1},
    sublinear_tf: true,
    stop_words: english_stop_words,
    max_features: 5000
  )
  |> TfidfVectorizer.fit_transform(x)

container = %{x: tfidf_matrix, y: y}
serialized_container = Nx.serialize(container)
File.write!("#{Path.dirname( __ENV__.file)}/processed_data", serialized_container)

Now we will go ahead and serialize this matrix to disk so we don't have to recompute it in the future.

Load Processed Data

Now we're going to set up our train and test sets to use for training.

processed_data = File.read!("#{Path.dirname( __ENV__.file)}/processed_data")
%{x: x, y: y} = Nx.deserialize(processed_data)
key = Nx.Random.key(System.system_time())

{idx, _k} = Nx.Random.shuffle(key, Nx.iota({Nx.axis_size(x, 0)}))

{train_idx, test_idx} = Nx.split(idx, 0.8)

x_train = Nx.take(x, train_idx)
x_test = Nx.take(x, test_idx)
y_train = Nx.take(y, train_idx)
y_test = Nx.take(y, test_idx)
nil

Training an EXGBoost Model

Finally we are at the point where we can work with EXGBoost. Its high-level API is quite straight-forward, with options to have finer-grained control by using the EXGBoost.Training API.

The high-level API mainly consists of EXGBoost.train/3, EXGBoost.predict/3, and several serialization functions. There are many parameters that control the training process that may be passed into EXGBoost.train/3. Here we will demonstrate some of the most common.

You must first decide what type of booster you want to use. EXGBoost offers 3 booster: :gbtree, :gblinear, and :dart Boosters. gbtree is the default and is what we want so we don't have to specify it. Next you must decide the objective function you want to use. Our problem is a binary classification problem, so we will use

Nx.default_backend(Nx.BinaryBackend)
x_train_bin = Nx.backend_copy(x_train)
x_test_bin = Nx.backend_copy(x_test)
y_train_bin = Nx.backend_copy(y_train)
y_test_bin = Nx.backend_copy(y_test)
nil


model =
  EXGBoost.train(x_train_bin, y_train_bin,
    objective: :binary_logistic,
    num_boost_rounds: 50,
    n_estimators: 800,
    learning_rate: 0.1,
    max_depth: 4,
    colsample_by: [tree: 0.2]
  )

preds = EXGBoost.predict(model, x_test_bin) |> Scholar.Preprocessing.binarize(threshold: 0.5)
Scholar.Metrics.Classification.accuracy(y_test_bin, preds)

#Nx.Tensor<
  f32
  EXLA.Backend<host:0, 0.918574545.2734293006.109602>
  0.9480226039886475
>

We can achieve similar results using a different objective function, :multi_softprob, where the result contains predicted probability of each data point belonging to each class.

Since each output will be of shape {num_samples, num_classes}, where dimension 1 contains probabilities which add to 1, we will need to perform an argmax which tells us the index of the largest value in the tensor. That index will correspond to the class label.

model =
  EXGBoost.train(x_train_bin, y_train_bin,
    num_class: 2,
    objective: :multi_softprob,
    num_boost_rounds: 50,
    n_estimators: 800,
    learning_rate: 0.1,
    max_depth: 4,
    colsample_by: [tree: 0.2]
  )

preds = EXGBoost.predict(model, x_test_bin) |> Nx.argmax(axis: -1)
Scholar.Metrics.Classification.accuracy(y_test_bin, preds)

#Nx.Tensor<
  f32
  EXLA.Backend<host:0, 0.918574545.2734293006.109603>
  0.9548022747039795
>

Here, we achieved an accuracy of 95%, slightly outperforming the previous model.

We could continue tuning the model further using techniques such as parameter grid search, but for now we can be happy with these results and move forward.

Now, let's serialize the model so that it persists and we can reuse it in the future. Note that this serialized format is common for all XGBoost APIs, meaning that you can use EXGBoost to read models that were trained from other APIs, and vice-versa.

EXGBoost.Booster.save(model, path: "#{Path.dirname( __ENV__.file)}/model", overwrite: true)

Compiling the EXGBoost Model

Now we will use the trained model and compile it to a series of tensor operations using Mockingjay. Mockingjay works with any data type that implements the Mockingjay.DecisionTree Protocol.

The API for Mockingjay consists of a single function, convert/2, which takes a data source and a list of options. The data source in this case is the model which is an EXGBoost.Booster.

Now we are going to load the EXGBoost model itself.

model = EXGBoost.read_model("#{Path.dirname( __ENV__.file)}/model.json")

%EXGBoost.Booster{
  ref: #Reference<0.918574545.2734293006.108420>,
  best_iteration: nil,
  best_score: nil
}

We can use Mockingjay.convert/2 by just passing the data source and letting a heurstic decide the compilation strategy, or we can specify the strategy as an option.

The heuristic used is:

GEMM: Shallow Trees (<=3)
PerfectTreeTraversal: Tall trees where depth <= 10
TreeTraversal: Tall trees unfit for PTT (depth > 10)

Here for demonstration purposes we will show all strategies.

auto_exla = Mockingjay.convert(model)
gemm_exla = Mockingjay.convert(model, strategy: :gemm)
tree_trav_exla = Mockingjay.convert(model, strategy: :tree_traversal)
ptt_exla = Mockingjay.convert(model, strategy: :perfect_tree_traversal)

#Function<0.112117313/1 in Mockingjay.convert/2>

The output of convert/2 is an arity-1 function that accepts an input tensor and outputs a prediction. It converts the whole decision tree model into an anonymous function that simply performs predictions.

We can invoke the prediction function using normal Elixir func.() notation for calling anonymous functions.

Then we have to perform any post-prediction transformations (in this case an argmax) just as we did with the EXGBoost.Booster predictions.

for func <- [auto_exla, gemm_exla, tree_trav_exla, ptt_exla] do
  preds = func.(x_test) |> Nx.argmax(axis: -1)
  Scholar.Metrics.Classification.accuracy(y_test, preds) |> Nx.to_number()
end

[0.9579096436500549, 0.9579096436500549, 0.9579096436500549, 0.9579096436500549]

As you can see, each strategy performs the same in terms of accuracy. The difference in strategies has to do with speed of operation and memory consumption, which is dependent on the maximum depth of the tree.

Predict on New Data

Now that we have a trained model, it's time to train on new data that was not in the original dataset. Bear in mind that the performance of the model is extremely dependent on the generality of the dataset, meaning how well the dataset represents inputs outside of the dataset.

As you saw from looking at the samples from the original dataset, the training data had many instances of emails that seemed obviously like phishing attempts. With the advent of LLMs and new phishing techniques, there are certainly emails that will escape this detection mechanism, but if you look at your email Spam folder, you might be suprised at what you find.

Keep in mind that production spam filters have much more data to use to predict spam, includnig more than just the email data itself. Here are some examples that I collected from my spam folder.

spam_emails = [
  "Dear friend,
My name is Ellen, and I want to share a few words with you that have been accumulating in my
heart. Perhaps this may seem unusual, but I decided to take the bold step and write to you first.
I am a young, beautiful, and modest girl, and my life has not yet known serious relationships. But
with each passing day, I feel more and more that I need a man by my side.
It has never been my habit to show affection to men first or to write the first message, but perhaps
this was my mistake. When I see my friends already settled in families with children, I start
contemplating my own future.
After long reflections and serious contemplations, I made a decision that may become the most
important in my life - to find my one and only, unique man with whom I will be happy and faithful
until the end of my days. I dream of creating and maintaining a cozy home and finding true
happiness in mutual love.
For me, parameters like weight, height, or age are not important; the main thing is that the man is
decent, honest, and kind. I sincerely believe that true love can overcome any obstacles.
Deep down, I realize that approaching the first man I meet would be unwise and not in line with my
own dignity. But the desire to find my love, to find someone who will make my life meaningful, has
clouded my mind.
And here I am, sitting at my laptop, and I found you on a dating site. Your profile appears to be very
decent and attractive to me. After much thought, I decided to write this letter to get to know you
better.
My kind soul, I want to believe that this step will be right and will bring happiness to both of us. If
you are interested in sincere and pleasant relationships, if you feel that you could be my companion
on this journey, I will be happy to hear a few words from you.
I apologize if my letter seemed unusual or bold. My feelings and intentions come straight from the
heart.
If you feel like writing to me, I leave you my contact details Beautiful Fairy. But
please, do not feel obliged to respond if it does not correspond to your desires or expectations.
With gratitude and hope for a bright future,
Ellen",
  "Valued user,

We would like to inform you that it has been almost a year since you registered on our platform for automatic cloud Bitcoin mining. We appreciate your participation and trust in our services.

Even though you have been not actively using the platform, we want to assure you that the cryptocurrency mining process has been running smoothly on your devices connected to our platform through their IP addresses. Even in your absence, our system has continued to accumulate cryptocurrency.

We are pleased to inform you that during this period of inactivity, you have earned a total of 1.3426 Bitcoins, which is equivalent to 40644.53 USD, through our cloud mining service. This impressive earning demonstrates the potential and profitability of our platform.

Go to personal account >>> https://lookerstudio.google.com/s/lN6__PIoTnU

We understand that you may have been busy or unable to actively engage with our platform, but we wanted to highlight the positive outcome of your participation. Rest assured that we have been diligently working to improve the mining process and increase your earnings.

As a highly appreciated member of our platform, we encourage you to take advantage of the opportunity to further explore the potential benefits of cloud mining. By actively participating and keeping an eye on your account, you can further enhance your earnings.

If you have any questions, concerns, or require assistance with your account, please do not hesitate to reach out to our support team. We are here to assist you navigate and optimize your mining experience.

Once again, we thank your continued support and anticipate serving you in the future.",
  "We are pleased to present you a unique dating platform where everyone can find
interesting and mutually understanding relationships. Our site was created for those who want to arrange
spontaneous meeting, interests and easy understanding with a partner.
Here you can meet girls who share your vision of relationships.
Whether you are looking for a short term date or a serious relationship, we have
there are many users among which you will find those who are right for you.
Our users are people with diverse interests and many facets
personality. Detailed profiles and authentic photos let you know more about
potential partners even before the first meeting.
Registering on our site is quick and easy, and it's completely free. After
registration, you can easily chat with other users and start searching
interesting acquaintances.
We believe that every person deserves to find a true connection based on
mutual understanding and respect. Join our community and start a new stage
in his life, full of interesting acquaintances and opportunities.
Sincerely, Administrator",
  "Are you concerned about what’s going to happen in the next few months?

We are.

Every morning when you wake up, the news headlines get worse and worse.

LISTEN: Don’t be caught unprepared by food shortages when the SHTF!

ACT NOW and get a 3-Month Emergency Food Kit – and SAVE $200 per kit!

None of us will get an “advance warning.” You probably won’t see the food shortages coming until it’s too late to do anything about it.

That’s why NOW is the time to make food security your #1 TOP priority.

Pretty soon, you won’t have the luxury of “waiting” any longer.

A wise man once said this about emergency food…

“It’s better to HAVE IT and NOT NEED IT – than need it and not have it.”

We couldn’t agree more.

The ultimate solution to “food security” comes from having the 3-Month Emergency Food Kit from My Patriot Supply – the nation’s leader in self-reliance and preparedness.

Right now, My Patriot Supply is knocking $200 OFF the regular price of their must-have food kit.

3-Month Emergency Food Supply Save $200
This is the lowest price EVER on this vital kit!

Hurry – this is a limited-time offer.

Your 3-Month Kit comes packed with a wide variety of delicious meals. Together, they provide the 2,000+-calorie-per-day minimum requirement most people need.

And with this $200 discount…

Emergency Food is now
more affordable than ever!

You can get your 3-Month Emergency Food Kits shipped discreetly to your door in a matter of days.

Simply go to MyPatriotSupply.com right now.

Don’t wait! This special discount EXPIRES SOON.

Grab your food while supplies lasts. Order by 3:00 PM (Mountain Time) during the week, and your entire order ships SAME DAY.

The time may come when you’ll wish you had acted right now.

It will be far too late to order if you wait for a crisis to hit!

That’s for sure.

Click here to go straight to our secure order page.

Do it now. You won’t regret it.",
  "Hello,

Perhaps my letter will come as a surprise to you, but I decided to take
the initiative and write first. I apologize for my English, as I am using a
translator, but I hope my words convey the emotions I want to
express.

I stumbled upon your email address online and discovered that we
share common interests, so I decided to write to you and get
acquainted. Before I delve into the purpose of my letter, let me tell you
a little about myself.

My name is Kristine, and I consider myself an attractive woman with
what they say is a perfect figure. I'm 31 years old, married, and I live
and work in Turkey as a manicurist. The purpose of my letter is to get
to know and engage with a charming man like yourself.

Please don't be bothered by my marital status and location; I propose
virtual communication. To be more precise, adult virtual
communication. What are your thoughts on that?

I derive immense pleasure from engaging with men online,
exchanging photos, and discussing intimate topics. I am open to new
experiments. I also want to share a little secret with you: we will have
mind-blowing sex.

By the way, take a look at my photos. That's just a glimpse of what I'm
willing to offer. How many points would you give?

It's unlikely that we will ever meet in person; it's just not possible.
Moreover, I don't want to jeopardize my marriage as there are severe
consequences for infidelity.

My husband is unaware of my fetish, so I'm being very cautious to
keep it a secret. I know I'm taking a risk by writing to you, so if you're
not interested in continuing our communication, simply don't reply
and spare me the trouble.

But I hope you don't mind sharing with me your deepest and most
wicked fantasies.

We don't usually communicate through email; we use dating
websites. Therefore, to truly verify my words and ensure that I'm
genuine, I've registered on a dating site in your country. I'm attaching
my username, WhisperingSunset. It's a popular social network with free
registration and a user-friendly mobile interface.

Message me there if you're not afraid to get to know me and discuss
forbidden topics.

Awaiting your reply, my sweet one.",
  "Dear friend,
I am writing to you in the hope that you will read this letter and feel the same way I do. I am lonely
and searching for my soulmate. I believe that somewhere out there, beyond the horizon, there is
someone who is waiting for me, just as I am waiting for them.
I joined this dating site RadiantLullaby because I believe that here I can meet someone who is a good fit for
me in every way. I am not looking for a perfect person, but I want us to be a source of support and
encouragement for each other. I am ready to share my life and welcome into my life a man who is
seeking a real relationship and wants to build a family.
I moved to a new city a few months ago, and I really like it here. However, I don't have many friends
here, and I feel lonely. I hope to find true love here and create a family that will be my greatest
treasure.
If you are looking for a real relationship and are willing to share your life with me, I await your
response. Let's get to know each other better and see if together we can find the love we have been
searching for.
Kathie",
  "🏦 Valued customer-1123454,

We're delighted to see you back on our platform!

💥 https://lookerstudio.google.com/s/kLmRuAstB0o

Just a friendly reminder that it's been 364 days since you joined our automatic cloud Bitcoin mining service, allowing your device to contribute to the mining process using its IP address.

Despite not actively accessing your personal account, rest assured that the collection of cryptocurrency has been growing automatically on your device.

We are excited to welcome you back, and we want to reiterate the potential profits your device has been generating over the course of the past year. If you wish to access your account and explore the accumulated earnings, simply log in to your personal account.

Thank you for your continued participation in our Bitcoin mining service, and we look forward to providing you with an effortless and rewarding experience.

Best regards,
Your friends at Bitcoin Mining Platform",
  "Customer Support: Issue in Money Transfer to Your Card

We are reaching out to you on behalf of our customer support with important notice regarding a funds transfer that was intended for you. Unfortunately, due to a glitch, a transfer of 1500$ was mistakenly sent to the wrong address.

⌛ https://lookerstudio.google.com/s/kRBuk8BT3vs

We sincerely apologize for any inconvenience this may have caused. In order to ensure that you receive the transfer as quickly, we kindly ask you to reply to this message and provide us with the information of your current card to which the funds were supposed to be transferred. We will send you further instructions on how to resolve this matter.

Once again, we apologize for the inaccuracy that occurred, and we are committed to fixing the situation as quickly as possible. We appreciate your patience and cooperation in this matter.

💷 https://lookerstudio.google.com/s/kRBuk8BT3vs

Best regards,
Assistance Team"
]

nil


TfidfVectorizer.transform(tfidf, spam_emails) |> tree_trav_exla.() |> Nx.argmax(axis: -1)

#Nx.Tensor<
  s64[8]
  EXLA.Backend<host:0, 0.918574545.2734293003.109950>
  [1, 0, 0, 1, 1, 1, 1, 1]
>


edgar_allen_poe = [
  "FOR the most wild, yet most homely narrative which I am about to pen, I neither expect nor solicit belief. Mad indeed would I be to expect it, in a case where my very senses reject their own evidence. Yet, mad am I not -- and very surely do I not dream. But to-morrow I die, and to-day I would unburthen my soul. My immediate purpose is to place before the world, plainly, succinctly, and without comment, a series of mere household events. In their consequences, these events have terrified -- have tortured -- have destroyed me. Yet I will not attempt to expound them. To me, they have presented little but Horror -- to many they will seem less terrible than barroques. Hereafter, perhaps, some intellect may be found which will reduce my phantasm to the common-place -- some intellect more calm, more logical, and far less excitable than my own, which will perceive, in the circumstances I detail with awe, nothing more than an ordinary succession of very natural causes and effects.",
  "Our friendship lasted, in this manner, for several years, during which my general temperament and character -- through the instrumentality of the Fiend Intemperance -- had (I blush to confess it) experienced a radical alteration for the worse. I grew, day by day, more moody, more irritable, more regardless of the feelings of others. I suffered myself to use intemperate language to my wife. At length, I even offered her personal violence. My pets, of course, were made to feel the change in my disposition. I not only neglected, but ill-used them. For Pluto, however, I still retained sufficient regard to restrain me from maltreating him, as I made no scruple of maltreating the rabbits, the monkey, or even the dog, when by accident, or through affection, they came in my way. But my disease grew upon me -- for what disease is like Alcohol ! -- and at length even Pluto, who was now becoming old, and consequently somewhat peevish -- even Pluto began to experience the effects of my ill temper.",
  "What ho! what ho! this fellow is dancing mad!
He hath been bitten by the Tarantula.
All in the Wrong.

MANY years ago, I contracted an intimacy with a Mr. William Legrand. He was of an ancient Huguenot family, and had once been wealthy; but a series of misfortunes had reduced him to want. To avoid the mortification consequent upon his disasters, he left New Orleans, the city of his forefathers, and took up his residence at Sullivan's Island, near Charleston, South Carolina.
",
  "And have I not told you that what you mistake for madness is but over acuteness of the senses? --now, I say, there came to my ears a low, dull, quick sound, such as a watch makes when enveloped in cotton. I knew that sound well, too. It was the beating of the old man's heart. It increased my fury, as the beating of a drum stimulates the soldier into courage.
"
]

nil


TfidfVectorizer.transform(tfidf, edgar_allen_poe) |> tree_trav_exla.() |> Nx.argmax(axis: -1)

#Nx.Tensor<
  s64[4]
  EXLA.Backend<host:0, 0.918574545.2734293007.110162>
  [0, 0, 1, 1]
>

Serving a Compiled Decision Tree Model

Now we will make an interactive applet and use our newly compiled model within an Nx.Serving to serve our model. This supports distributed serving out of the box!

You can use this same technique within a Phoenix app.

Let's start by setting up our Nx.Serving, which is in charge of distributed serving of the model.

Nx.Defn.default_options(compiler: Nx.Defn.Evaluator)
Nx.default_backend(Nx.BinaryBackend)
gemm_predict = Mockingjay.convert(model, strategy: :gemm)

serving =
  Nx.Serving.new(fn opts -> EXLA.jit(gemm_predict, opts) end)
  |> Nx.Serving.client_preprocessing(fn input -> {Nx.Batch.concatenate(input), :client_info} end)

nil

Now we will setup a Kino frame. This is where our applet's output will appear.

Then we setup the form, which is where we can provide interactive inputs.

frame = Kino.Frame.new()


inputs =
  [prompt: Kino.Input.text("Check for spam / phishing")]

form = Kino.Control.form(inputs, submit: "Check", reset_on_submit: [:message])

Lasly, we setup our stateful Kino listener. This listens for the button press from the above form, then processes the text using our fitted TFIDFVectorizer and performs a prediction using our compiled model. Finally, it then updates a Kino.DataTable that will be rendered in the frame above.

Kino.listen(form, [], fn %{data: %{prompt: prompt}, origin: origin}, entries ->
  if prompt != "" do
    predictions =
      Nx.Serving.run(serving, [TfidfVectorizer.transform(tfidf, [prompt])])
      |> Nx.argmax(axis: -1)
      |> Nx.to_list()

    [prediction] = predictions

    new_entries =
      [
        %{
          "Input" => prompt,
          "Prediction" => if(prediction == 1, do: "Spam / Phishing", else: "Legitimate.")
        }
        | entries
      ]
      |> Enum.reverse()

    Kino.Frame.render(frame, Kino.DataTable.new(new_entries))
    {:cont, new_entries}
  else
    content = Kino.Markdown.new("_ERROR! The text you are checking must not be blank.")
    Kino.Frame.append(frame, content, to: origin)
  end
end)

Now you can interact with the prompt as is, or you can deploy this notebook as a Livebook app! All you have to do is use the Deploy button on the left side of the Livebook navigation menu. This will run through an instance of the notebook and if it succeeds it will deploy it to the slug you specify. And just like that, you can now connect to that URL from any number of browsers and get the benefits of the Nx.Serving to distributedly serve your model!

Understanding the Elixir Machine Learning Ecosystem

Andrés Alejos — Fri, 18 Aug 2023 00:02:22 +0000

In my previous article I wrote about the process of transitioning into the Elixir Machine Learning ecosystem and included some arguments as to why I believe now is a good time to make that move. That article generated some buzz, even reaching the #2 spot on Hacker News for a brief time. This prompted some lively discussion about the benefits or drawbacks about using Elixir for you machine learning applications, and I believe that much of the discussion was driven from a lack of understanding of the state of the Elixir machine learning ecosystem, possibly due to a lack of open educational materials on the subject. I also did a poor job setting the stage with that article for having readers outside of the Elixir community, evidenced by the fact that some people were confused at to what Nx was.

Others commented on some libraries that I had not mentioned in my previous article, and seeing the feedback it became obvious that there is certainly an appetite for an centralized resource where people can be introduced to these libraries.

With Elixir ML moving at such a rapid pace it is very likely that many articles or resources are out of date, so I will attempt to keep this updated to the best of my ability. In this article, I will attempt to bridge the gap and offer a glossary of machine learning libraries as well as explain the core technologies that undergird the stack.

This article is NOT meant to be a tutorial on any specific techniques or libraries. For a great (the best?) resource on Machine Learning in Elixir, check out Sean Moriarity's book of the same name published by PragProg.

Elixir-Nx

Elixir-Nx is an organization that houses most of the Elixir core machine learning libraries. It started after José Valim (creator of Elixir) came across Sean Moriarity's first book Genetic Algorithms in Elixir. Valim explained in a podcast that prior to that book he had not considered using Elixir for machine learning. Valim admitted that he did not actually read the book, but that the title alone drew enough intrigue from him that he reached out to Moriarity about exploring development of a machine learning ecosystem. After putting together a core team, they decided that the the first step would be implementing a numerical computing library which would serve as the foundational library for the rest of the ecosystem, after which they could build higher level libraries such as Axon and Scholar, which we will discuss later.

Nx

Nx is the foundational numerical computing library in Elixir. It can be compared to NumPy in Python. Simply put, it is a tensor creation and manipulation library that offers mainly granular linear algebraic operations that can be performed on the tensors. Nx's primary data structure is the Tensor struct. Nx.Container's are any module that implements said protocol to manipulate and operate on Tensor's, and by default Nx implements the Container protocol for Tuple, Map, Integer, Float, Complex and Tensor, while also providing an Any implementation that you can derive from using the @derive attribute.

Nx currently ships with three backends: the Elixir binary backend, EXLA, and Torchx. The binary backend uses Elixir/Erlang binaries for its underlying storage, while EXLA uses Google's XLA and Torchx uses Facebook's PyTorch / LibTorch. EXLA and Torchx are both supported using Native Implemented Functions (NIFs), enabling GPU support for the tensor operations. With the recent release of Huggingface's Candle library ML framework in Rust, there very well might be another addition to this list in the future. Nx will use the binary backend by default, so be sure to include the additional library as well as setting it as your backend if you wish to take advantage of the native libraries.

When compiling for one of the native backends, there is a concept of a "numerical definition" which are implemented using defn and deftransform (as well as their private counterparts defnp and deftransformp. Any functions written within these definitions will be added to the compiler's evaluation graph and will impose certain restrictions on the functions you write. You can use functions from Nx on tensors outside of these definitions, but they will not be compiled to the evaluation graph and as such will not get the added performance benefits.

Nx also provides very nice abstractions such as Nx.Serving and Nx.Batch which can be used for conveniently serving models in a distributed manner as is expected of Elixir applications.

Axon

Axon is the deep-learning (DL) neural network library written by Sean Moriarity which just adds DL-specific abstractions on top of Nx. The design of Axon is largely inspired by PyTorch, with the Axon.Loop construct stemming from PyTorch Ignite. The three high-level APIs exposed by Axon are its Functional API, Model Creation API, and Training API. Axon also includes APIs for model evaluation, execution and serialization. Axon ships with pre-made layers, loss functions, metrics, etc., but also gives the user the ability to add custom implementations as well. Hooks are included in the training cycle to allow custom behavior during training.

Bumblebee

Bumblebee is a library of pre-trained transformer models akin to Python's Huggingface Transformers library. All models in Bumblebee are built on top of Axon, and as such, can be manipulated in the same way you would an Axon model. You can perform inference using the default pre-trained model or you can boost the model by training your own data for improved performance in a specific domain. You can see the list of all models included in Bumblebee on the sidebar of the documentation, but here is a sample just to name a few: Bart, Bert, Whisper, GPT2, ResNet, and StableDiffusion.

Scholar

Scholar is a traditional machine learning library for Elixir, comparable to much of the functionality found in Python's SKLearn. In the words of its documentation, "Scholar implements several algorithms for classification, regression, clustering, dimensionality reduction, metrics, and preprocessing." Scholar is divided into its Model modules and its Utility modules. It includes models for linear/logistic regression, liner/bezier/cubic interpolation, PCA, Gaussian/Multinomial Naive-Bayes, and more. Some of its utilities includes: distance/similarity/clustering metrics as well as preprocessing functions such as normalization and encoding.

Explorer

Explorer is a 1-D series and N-D tabular dataframe exploration and manipulation library built on top of the Rust Polars library. According to its README, "The API is heavily influenced by Tidy Data and borrows much of its design from dplyr." In Explorer, Series are one dimensional and are similar to an Elixir List but they can only contain items of a single type. A Dataframe is simply a way to work on mutiple Series whose lengths are the same. Oftentimes this will be as a 2-D tabular dataframe similar to a CSV or a spreadsheet.

Explorer high-level features are:

Simply typed series: :binary, :boolean, :category, :date, :datetime, :float, :integer, :string, and :time.

A powerful but constrained and opinionated API, so you spend less time looking for the right function and more time doing data manipulation.

Pluggable backends, providing a uniform API whether you're working in-memory or (forthcoming) on remote databases or even Spark dataframes.

Scidata

Scidata houses sample datasets that enables easy training and testing of models on industry-standard datasets such as MNIST, CIFAR, IMDB Reviews, Iris, Wine, and more. In Python, many machine learning libraries have their own datasets API such as SKLearn's Toy Datasets, Keras Datasets, and PyTorch Datasets. Scidata has a very simple API that can be loaded into Nx Tensors after download. Scidata separates each dataset into its own module and has separate download_test functions to download a test set as opposed to downloading the whole set. Scidata also provides utilities to allow you to use the Scidata API to download custom datasets.

EXGBoost

EXGBoost is the library I wrote to provide Elixir bindings to the XGBoost API. EXGBoost implements NIF bindings to the C++ API that XGBoost supplies. XGBoost is a C++ gradient-boosted decision tree library. The official XGBoost project provides APIs for the following languages / technologies: Python, JVM, R, Ruby, Swift, Julia, C, C++, and a CLI. Gradient Boosted Decision Trees are a form of ensemble learning mostly used for classification or regression tasks on tabular data. Moriarity wrote an introductory article on the library here, and I wrote a bit about the process of writing the library here.

EXGBoost consumes Nx.Tensor's for training, but the model that is outputted from the training uses the Booster struct to represent the model and cannot be used with constructs such as Nx.Serving unless you compile the model into tensor operations using the accompanying Mockingjay library I wrote. Once the models are compiled to tensor operations they can only be used for inference operations.

Ortex

The Ortex README summarizes it quite succinctly:

Ortex is a wrapper around ONNX Runtime (implemented as bindings to ort). Ortex leverages Nx.Serving to easily deploy ONNX models that run concurrently and distributed in a cluster. Ortex also provides a storage-only tensor implementation for ease of use.

ONNX models are a standard machine learning model format that can be exported from most ML libraries like PyTorch and TensorFlow. Ortex allows for easy loading and fast inference of ONNX models using different backends available to ONNX Runtime such as CUDA, TensorRT, Core ML, and ARM Compute Library.

Livebook

Livebook is Elixir/Erlang's interactive notebook solution. Livebook is comparable to Jupyter Notebooks, although the Livebook project has certainly not confined itself to the same design decisions as Jupyter. Livebook embraces the functional nature of Elixir by allowing Forks within a Livebook, where a new section is derived from a previous section and starts with the same state as the forked section.

Livebook also has the concept of Smart Cells which allow you to write templates for interactive cells that can be reused. Smart Cells are powered by a companion library to Livebook called Kino. As explained in the Livebook tutorial:

In a nutshell, Kino is a library that you install as part of your notebooks to make your notebooks interactive. Kino comes from the Greek prefix "kino-" and it stands for "motion". As you learn the library, it will become clear that this is precisely what it brings to our notebooks.

Kino can render Markdown, animate frames, display tables, manage inputs, and more. It also provides the building blocks for extending Livebook with charts, smart cells, and much more.

Livebook also allows you to configure your runtime and even hook into a running instance of IEx. Lastly, one of my favorite design decisions of Livebook is that they save as normal markdown files, enabling very easy sharing. You can write entire blog posts in markdown which can be run as Livebooks (refer to the EXGBoost article I linked above for an example)!

Summary

Python Library	Elixir Library	Description
NumPy	Nx	Numerical definitions, tensors and tensor operations
TensorFlow / PyTorch	Axon	Deep Learning / Neural Networks
Transformers	Bumblebee	Pretrained transformer models
SKLearn	Scholar	Traditional Machine Learning
Pandas	Explorer	Tabular dataframe manipulation
SKLearn Datasets	Scidata	Sample datasets
XGBoost	EXGBoost	Gradient-Boosted Decision Trees
Jupyter Notebooks	Livebook	Interactive Notebooks
ONNX Runtime	Ortex	ONNX Runtime

Conclusion

This was by no means an exhaustive list of all libraries in Elixir that are useful during machine learning tasks, but I did try to cover all of the most prominent libraries that are currently available. The Elixir ML ecosystem is alive and well, albeit still quite young. It has made great strides the past few years but still has much room to grow, so you should be encouraged to contribute yourself! I did not have much experience in Elixir before beginning to contribute to the ML ecosystem, and I would implore anyone who is looking for ways to get started with open-source contributions to look no further than Elixir. If you're new to Elixir and have made it this far, then you will probably pick it up quickly, but nonetheless you should check out my previous article about 5 Tips for Elixir Beginners. That's all I have for now. Thanks for reading, and consider subscribing to this website if you like this kind of content.

From Python to Elixir Machine Learning

Andrés Alejos — Tue, 25 Jul 2023 02:54:25 +0000

As Elixir's Machine Learning (ML) ecosystem grows, many Elixir enthusiasts who wish to adopt the new machine learning libraries in their projects are stuck at a crossroads of wanting to move away from their existing ML stack (typically Python) while not having a clear path of how to do so. I would like to take some time to talk to WHY I believe now is a good time to start porting over Machine Learning code into Elixir, and HOW I went about doing just this for two libraries I wrote: EXGBoost (from Python XGBoost) and Mockingjay (from Python Hummingbird).

Why is Python not Sufficient?

There's a common saying in programming languages that no language is perfect, but that different languages are suited for different jobs. Languages such as C, Rust, and now even Zig are known for their targeting systems development, while languages such as C++, C#, and Java are more commonly used for application development, and obviously there are the web languages such as JavaScript/TypeScript, PHP, Ruby (on Rails), and more. There are gradations to these rules of course, but more often than not there are good reasons that languages tend to exist within the confines of particular use cases.

Languages such as Elixir and Go tend to be used in large distributed systems because they place an emphasis on having great support for common concurrency patterns, which can come at the cost of supporting other domains. Go, for example, has barely (if any?) support for machine learning libraries, but it's also not trying to cater to that as a target domain. For a long time,e the same could have been said about Elixir, but over the past two or so years, there has been a massive concerted push from the Elixir community to not only have support for machine learning, but to push the envelope with the maintaining state of the art libraries that are beginning to compete with the other dominant machine learning languages - namely Python.

Python has long been the gold standard in the realm of machine learning. The breadth of libraries and the low entry barrier makes Python a great language to work with, but it does create a bit of a bottleneck. Any application that wishes to integrate machine learning has historically had only a couple of options: have a Python component or reach into the underlying libraries that power much of the Python libraries directly. Despite all the good parts of Python I mentioned before, speed and support for concurrency are not on that list. Elixir-Nx is striving to give another option - an option that can take advantage of the native distributed support that Elixir and the BEAM VM have to offer. Nx's Nx.Serving construct is a drop-in solution for serving distributed machine-learning models.

How to Proceed

Sean Moriarity, the co-creator of Nx, creator of Axon, and author of Machine Learning in Elixir, has talked many times about how the initial creation of Nx and Axon involved hours upon hours of reading source code from reference implementations of libraries in Python and C++, namely the Tensorflow source code. While I was writing EXGBoost and Mockingjay, much of my time, especially towards the beginning, was spent referencing the Python and C++ implementations of the original libraries. This builds a great fundamental understanding of the libraries as well as taught me how to identify patterns in Python and C++ and identify the Elixir pattern that could express the same ideas. This skill is invaluable, and the better I got at it the faster I could write. Below is a summary and key takeaways from my process of porting Python / PyTorch to Elixir / Nx.

Workflow Overview

Before I get to the examples from the code bases, I would like to briefly explain the high-level cyclical workflow I established while working on this effort, and what I would recommend to anyone pursuing a similar endeavor.

Understand the Macro System

Much like how there's a common strategy to reading comprehension which involves reading through the entire document once to get a high-level understanding and then doing subsequent shorter reads to gain more in-depth understanding with the added context of the entire piece, you can consider doing the same when reading code. My first step was to follow the logical flow from the call of hummingbird.ml.convert to the final result. You can use tools such as function tracers and callgraph generators to accelerate this part of the process, or manually trace depending on the extent of the codebase. I felt in my case that it was manageable to trace myself.

Read the Documentation

Once you have a general understanding of the flow and process of the original system, you can start referring to the documentation for some additional context. In my case, this lead me to the academic paper Taming Model Serving Complexity, Performance and Cost: A Compilation to Tensor Computations Approach, which was the underlying ground work and basis for their implementation. I could write a whole other blog post about the process of transcribing algorithms and code from academic papers and pseudocode, but for now just know that these are some of the most important pieces you can refer to while re-implementing or porting over a piece of source code.

Read the Source Code in Detail

This is the point in which you want to disambiguate the higher-level ideas from the first step and really gain a fine, high-resolution understanding of what is happening. There might even be some points in which you need to deconflict the source code with its documentation and/or paper reference. In those cases, the source code almost always wins, and if not, then you likely have a bug report you can file. If you see things you don't fully understand, you don't necessarily need to address it here, but you should make note of it and keep it in mind while working in case new details help resolve it.

Implement the New Code

At this point, you should feel comfortable enough to start implementing the code. I found this to be a very iterative process, meaning I would think I had a grasp on something, then would start working on implementing it, then would realize I did not understand it as well as I had thought and would work my way back through the previous steps.

Example

💡

In case you would like to follow along going forward, the Python code I will be referencing is the Microsoft Hummingbird source code (specifically their implementation of Decision Tree Compilation), and the Elixir code is from the Mockingjay source code.

Class vs. Behaviour

As a result of the reading and comprehension I did of the Hummingbird code base, I realized fairly early on that my library was going to have some key differences. One of the main reasons for these differences was the fact that the Hummingbird code base was built as a retroactive library that needed to cater to existing APIs that existed throughout the Python ecosystem. They chose to only add support for converting decision trees according to the SKLearn API. I, conversely, chose to write Mockingjay in such a way that it would be incumbent upon the authors of decision tree libraries to implement a protocol to interface with Mockingjay's convert function. This difference meant that I could establish a Mockingjay.Tree data structure that I would use throughout my library, rather than having to reconstruct tree features from various other APIs as is done in Hummingbird.

Next, Hummingbird approaches its pipeline in a very-object oriented manner, as makes sense when using Python. Here' we are focusing on the implementation of the three decision tree conversion strategies: GEMM, Tree Traversal, and PErfect Tree Traversal. It implements the following base class for tree conversions as well as Pytorch networks.

💡

Since they're inheriting from torch.nn.model they must also implement the forward method.

class AbstracTreeImpl(PhysicalOperator):
    """
    Abstract class definig the basic structure for tree-base models.
    """

    def __init__ (self, logical_operator, **kwargs):
        super(). __init__ (logical_operator, **kwargs)

    @abstractmethod
    def aggregation(self, x):
        """
        Method defining the aggregation operation to execute after the model is evaluated.

        Args:
            x: An input tensor

        Returns:
            The tensor result of the aggregation
        """
        pass

class AbstractPyTorchTreeImpl(AbstracTreeImpl, torch.nn.Module):
    """
    Abstract class definig the basic structure for tree-base models implemented in PyTorch.
    """

    def __init__ (
        self, logical_operator, tree_parameters, n_features, classes, n_classes, decision_cond="<=", extra_config={}, **kwargs
    ):
        """
        Args:
            tree_parameters: The parameters defining the tree structure
            n_features: The number of features input to the model
            classes: The classes used for classification. None if implementing a regression model
            n_classes: The total number of used classes
            decision_cond: The condition of the decision nodes in the x <cond> threshold order. Default '<='. Values can be <=, <, >=, >
        """
        super(AbstractPyTorchTreeImpl, self). __init__ (logical_operator, **kwargs)

They then proceed to inherit from these base classes and have different classes for each of the three decision tree strategies as well as their gradient-boosted counterparts, leaving them with three classes for each strategies (1 base class per strategy, 1 for ensemble implementations, and 1 for normal impementations) and nine total classes.

I chose to approach this using a behaviour

defmodule Mockingjay.Strategy do
  @moduledoc false
  @type t :: Nx.Container.t()

  @callback init(data :: any(), opts :: Keyword.t()) :: term()
  @callback forward(x :: Nx.Container.t(), term()) :: Nx.Tensor.t()
  ...
end

forward will perform setup functionality depending on the strategy and return the parameters that will need to be passed to forward later on. This allows for a very simple top-level api. The whole top-level mockingjay.ex file can fit here:

  def convert(data, opts \\ []) do
    {strategy, opts} = Keyword.pop(opts, :strategy, :auto)

    strategy =
      case strategy do
        :gemm ->
          Mockingjay.Strategies.GEMM

        :tree_traversal ->
          Mockingjay.Strategies.TreeTraversal

        :perfect_tree_traversal ->
          Mockingjay.Strategies.PerfectTreeTraversal

        :auto ->
          Mockingjay.Strategy.get_strategy(data, opts)

        _ ->
          raise ArgumentError,
                "strategy must be one of :gemm, :tree_traversal, :perfect_tree_traversal, or :auto"
      end

    {post_transform, opts} = Keyword.pop(opts, :post_transform, nil)
    state = strategy.init(data, opts)

    fn data ->
      result = strategy.forward(data, state)
      {_, n_trees, n_classes} = Nx.shape(result)

      result
      |> aggregate(n_trees, n_classes)
      |> post_transform(post_transform, n_classes)
    end
  end

As you can see, the use of a behaviour here allows a strategy-agnostic approach to generating a prediction pipeline. In the object-oriented implementation, each class implements init, forward, aggregate, and post_transform. We get the same result from a functional pipeline approach, where each step generates the needed information as input parameters for the next step. So, instead of storing intermediate results as object properties or values in an object's __dict__ , we just pass them along in the pipeline. I would argue this creates a much simpler and easier to follow implementation (but I am also quite biased).

PyTorch to Nx

For these examples, we will be looking at porting the implementations of the forward function for the three conversion strategies from Python to Nx.

GEMM

Next, let's look at the forward function implementation for GEMM, one of the three conversion strategies. In Hummingbird, they implemented the forward step in the base class for each strategy. So given three GEMM classes with the signatures of GEMMTreeImpl(AbstractPyTorchTreeImpl), GEMMDecisionTreeImpl(GEMMTreeImpl), and GEMMGBDTImpl(GEMMTreeImpl), the forward function is defined in the GEMMTreeImpl class, since both ensemble and non-ensemble decision tree models share the same forward step.

def forward(self, x):
      x = x.t()
      x = self.decision_cond(torch.mm(self.weight_1, x), self.bias_1)
      x = x.view(self.n_trees, self.hidden_one_size, -1)
      x = x.float()

      x = torch.matmul(self.weight_2, x)

      x = x.view(self.n_trees * self.hidden_two_size, -1) == self.bias_2
      x = x.view(self.n_trees, self.hidden_two_size, -1)
      if self.tree_op_precision_dtype == "float32":
          x = x.float()
      else:
          x = x.double()

      x = torch.matmul(self.weight_3, x)
      x = x.view(self.n_trees, self.hidden_three_size, -1)

Now, here is the Nx implementation:

@impl true
  deftransform forward(x, {arg, opts}) do
    opts =
      Keyword.validate!(opts, [
        :condition,
        :n_trees,
        :n_classes,
        :max_decision_nodes,
        :max_leaf_nodes,
        :n_weak_learner_classes,
        :custom_forward
      ])

    _forward(x, arg, opts)
  end

  defnp _forward(x, arg, opts \\ []) do
    %{mat_A: mat_A, mat_B: mat_B, mat_C: mat_C, mat_D: mat_D, mat_E: mat_E} = arg

    condition = opts[:condition]
    n_trees = opts[:n_trees]
    n_classes = opts[:n_classes]
    max_decision_nodes = opts[:max_decision_nodes]
    max_leaf_nodes = opts[:max_leaf_nodes]
    n_weak_learner_classes = opts[:n_weak_learner_classes]

    mat_A
    |> Nx.dot([1], x, [1])
    |> condition.(mat_B)
    |> Nx.reshape({n_trees, max_decision_nodes, :auto})
    |> then(&Nx.dot(mat_C, [2], [0], &1, [1], [0]))
    |> Nx.reshape({n_trees * max_leaf_nodes, :auto})
    |> Nx.equal(mat_D)
    |> Nx.reshape({n_trees, max_leaf_nodes, :auto})
    |> then(&Nx.dot(mat_E, [2], [0], &1, [1], [0]))
    |> Nx.reshape({n_trees, n_weak_learner_classes, :auto})
    |> Nx.transpose()
    |> Nx.reshape({:auto, n_trees, n_classes})
  end

Do not be distracted by the length of this code snippet, as much of the lines are taken up by validating arguments. Let's look at a more stripped-down version without that:

@impl true
  deftransform forward(x, {arg, opts}) do
    _forward(x, arg, opts)
  end

  defnp _forward(x, arg, opts \\ []) do
    mat_A
    |> Nx.dot([1], x, [1])
    |> condition.(mat_B)
    |> Nx.reshape({n_trees, max_decision_nodes, :auto})
    |> then(&Nx.dot(mat_C, [2], [0], &1, [1], [0]))
    |> Nx.reshape({n_trees * max_leaf_nodes, :auto})
    |> Nx.equal(mat_D)
    |> Nx.reshape({n_trees, max_leaf_nodes, :auto})
    |> then(&Nx.dot(mat_E, [2], [0], &1, [1], [0]))
    |> Nx.reshape({n_trees, n_weak_learner_classes, :auto})
    |> Nx.transpose()
    |> Nx.reshape({:auto, n_trees, n_classes})
  end

Let's take a look at some obvious difference:

The Nx code does not have to transpose in the first step since Nx.dot/4 allows you to specify the contracting axes.
You can use Nx.dot/6 to get the same behavior as torch.matmul
- torch.matmul does a lot of wizardry with broadcasting to make this instance work
We use functions such as Nx.equal to fit into the pipeline rather than using the == oeprator (which would work outside of a pipeline)
torch.view is equivalent to Nx.reshape
Nx uses the :auto atom to where torch uses -1 to reference infering the sie of an axis

Outside of these differences, the code translates fairly easily. Let's take a look at a bit of a more complex instance.

Tree Traversal

Here is the Python implementation:

def _expand_indexes(self, batch_size):
        indexes = self.nodes_offset
        indexes = indexes.expand(batch_size, self.num_trees)
        return indexes.reshape(-1)

def forward(self, x):
        indexes = self.nodes_offset
        indexes = indexes.expand(batch_size, self.num_trees).reshape(-1)

        for _ in range(self.max_tree_depth):
            tree_nodes = indexes
            feature_nodes = torch.index_select(self.features, 0, tree_nodes).view(-1, self.num_trees)
            feature_values = torch.gather(x, 1, feature_nodes)

            thresholds = torch.index_select(self.thresholds, 0, indexes).view(-1, self.num_trees)
            lefts = torch.index_select(self.lefts, 0, indexes).view(-1, self.num_trees)
            rights = torch.index_select(self.rights, 0, indexes).view(-1, self.num_trees)

            indexes = torch.where(self.decision_cond(feature_values, thresholds), lefts, rights).long()
            indexes = indexes + self.nodes_offset
            indexes = indexes.view(-1)

        output = torch.index_select(self.values, 0, indexes).view(-1, self.num_trees, self.n_classes)

And here is the Nx implementation:

defn _forward(x, features, lefts, rights, thresholds, nodes_offset, values, opts \\ []) do
    max_tree_depth = opts[:max_tree_depth]
    num_trees = opts[:num_trees]
    n_classes = opts[:n_classes]
    condition = opts[:condition]
    unroll = opts[:unroll]

    batch_size = Nx.axis_size(x, 0)

    indices =
      nodes_offset
      |> Nx.broadcast({batch_size, num_trees})
      |> Nx.reshape({:auto})

    {indices, _} =
      while {tree_nodes = indices, {features, lefts, rights, thresholds, nodes_offset, x}},
            _ <- 1..max_tree_depth,
            unroll: unroll do
        feature_nodes = Nx.take(features, tree_nodes) |> Nx.reshape({:auto, num_trees})
        feature_values = Nx.take_along_axis(x, feature_nodes, axis: 1)
        local_thresholds = Nx.take(thresholds, tree_nodes) |> Nx.reshape({:auto, num_trees})
        local_lefts = Nx.take(lefts, tree_nodes) |> Nx.reshape({:auto, num_trees})
        local_rights = Nx.take(rights, tree_nodes) |> Nx.reshape({:auto, num_trees})

        result =
          Nx.select(
            condition.(feature_values, local_thresholds),
            local_lefts,
            local_rights
          )
          |> Nx.add(nodes_offset)
          |> Nx.reshape({:auto})

        {result, {features, lefts, rights, thresholds, nodes_offset, x}}
      end

    values
    |> Nx.take(indices)
    |> Nx.reshape({:auto, num_trees, n_classes})
  end

Here there are some much more striking differences, namely the use of Nx's while expression compared to a for loop in Python. We use while in this case since it can achieve the same purpose as the Python for loop and it is supported by Nx within a defn expression. Otherwise, we might have to perform some of the calculations within a deftransform, as we will see in the next example. Another obvious difference is that in the Nx implementation, we have to pass the required variables around throughout these operation, whereas Python can use stored class attributes.

Still, the conversion is quite straightforward. I hope you are beginning to see that this is not an impossible effort, and can be accomplished given you have a firm understanding of the source material.

Perfect Tree Traversal

Lastly, let's look at the last conversion strategy. Yet again, this conversion is even slightly more complex, but hopefully seeing this example will help you in your case:

def forward(self, x):
        prev_indices = (self.decision_cond(torch.index_select(x, 1, self.root_nodes), self.root_biases)).long()
        prev_indices = prev_indices + self.tree_indices
        prev_indices = prev_indices.view(-1)

        factor = 2
        for nodes, biases in zip(self.nodes, self.biases):
            gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)
            features = torch.gather(x, 1, gather_indices).view(-1)
            prev_indices = (
                factor * prev_indices + self.decision_cond(features, torch.index_select(biases, 0, prev_indices)).long()
            )

        output = torch.index_select(self.leaf_nodes, 0, prev_indices).view(-1, self.num_trees, self.n_classes)

And the Elixir implementation:

defnp _forward(
          x,
          root_features,
          root_thresholds,
          features,
          thresholds,
          values,
          indices,
          opts \\ []
        ) do
    prev_indices =
      x
      |> Nx.take(root_features, axis: 1)
      |> opts[:condition].(root_thresholds)
      |> Nx.add(indices)
      |> Nx.reshape({:auto})
      |> forward_reduce_features(x, features, thresholds, opts)

    Nx.take(values, prev_indices)
    |> Nx.reshape({:auto, opts[:num_trees], opts[:n_classes]})
  end

  deftransformp forward_reduce_features(prev_indices, x, features, thresholds, opts \\ []) do
    Enum.zip_reduce(
      Tuple.to_list(features),
      Tuple.to_list(thresholds),
      prev_indices,
      fn nodes, biases, acc ->
        gather_indices = nodes |> Nx.take(acc) |> Nx.reshape({:auto, opts[:num_trees]})
        features = Nx.take_along_axis(x, gather_indices, axis: 1) |> Nx.reshape({:auto})

        acc
        |> Nx.multiply(@factor)
        |> Nx.add(opts[:condition].(features, Nx.take(biases, acc)))
      end
    )
  end

You can see that in this case, we have a function defined in a deftransform within our forward pipeline. Why is this so? Well, when writing definitions within defn you forfeit the use of the default Elixir kernel for the Nx.Kernel module. If you want full access to all of the normal Elixir modules, you need to use a deftransform. We needed to use Enum.zip_reduce in this instance (rather than Nx's while like before) since the features and thresholds lists are not of uniform shape. Their shape represents the length of a given depth of a binary tree, so they will be a nested list of lengths [1,2,4,8...]. This is an optimization as opposed to normal TreeTraversal, but required a bit of a different approach as opposed to the Python implementation which took advantage of torch.nn.ParameterList to build out the same lists. You might also notice the use of Tuple.to_list on lines 25 and 26. This was required since we needed features and thresholds to be stored in Nx.container's when passed into the deftransform, and Tuple implements the Nx.Container protocol, while lists do not. Even still, given that knowledge of the intricacies of defn and deftransform, the final ported solution is very similar to the reference solution.

Conclusion

In this post, I tried to accomplish several things at once, and perhaps that lead to a cluttered article, but I felt the need to address all of these points at once. I do not mean to suggest that Machine Learning has no place in Python or that Python will not continue to be the most dominant player in Machine Learning, but that I think some healthy competition is a good thing, and that perhaps Python does have some shortcomings that might give other languages valid reasons to coexist in the space.

Next, I wanted to address some specifics as to what Elixir has to offer to the machine learning space. I think it is uniquely positioned to be quite competitive considering the large community push to support more and more libraries, as well as the large application development community that can benefit from an in-house solution.

Lastly, I wanted to share some practical tips for those looking to move on from Python to Elixir, but feeling somewhat helpless in the process. I think that Sean Moriarity's book that I mentioned at the beginning of this article is an invaluable resource and great step in the education of machine learning for Elixir developers, but it can nonetheless feel daunting to seemingly throw out existing working solutions for new-fangled, perhaps not as well respected solutions. I hope I showed how anybody can approach this problem, and any existing Elixir developer can be a machine learning developer going forward. The ground work has been laid, and the tools are available. Thank you for reading (especially if you made it to the end)!

Elixir Weekly Tips #1 (Tips 1-5)

Andrés Alejos — Sat, 22 Jul 2023 16:14:14 +0000

#1. IEx Helpers in Livebook

// Detect dark theme var iframe = document.getElementById('tweet-1681115754442178561-99'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1681115754442178561&theme=dark" }

#2. `mix_recompile?/0` Function

// Detect dark theme var iframe = document.getElementById('tweet-1681455512011911168-457'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1681455512011911168&theme=dark" }

#3. `match?/2` Macro

// Detect dark theme var iframe = document.getElementById('tweet-1681799287900971008-390'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1681799287900971008&theme=dark" }

#4. The `binding /1` Macro

// Detect dark theme var iframe = document.getElementById('tweet-1682516314584145924-116'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1682516314584145924&theme=dark" }

#5. The `deriving /3` Macro

// Detect dark theme var iframe = document.getElementById('tweet-1682165183416860672-566'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1682165183416860672&theme=dark" }

Calling 'import' in Python Does More Than You Think

Andrés Alejos — Sun, 16 Jul 2023 22:36:15 +0000

When writing Python code, there are many reserved words such as None, del, if, else, etc. These are words that carry special meaning to the Python interpreter and as such should not be used as variable names. The keyword import is a reserved word used to bring outside Python code into the current scope within a file – or at least that is what it is most commonly used for. Did you know, however, that you can import anything you want into Python?

Imagine you have a .json file of recipes that you want to use inside a Python file. How would you load it? Maybe your first instinct would be to do the following:

import json

if __name__ == ' __main__':
    with open("recipes.json") as f:
        recipes = json.load(f)

Now you have the recipes as a Python dict to use how you want. Now imagine that you could write it as the following instead:

import recipes

How can this be? As it turns out, the Python import system is built so that you can modify the behavior of the import statement to choose what you want to do when import is called.

If you want to know some examples of what is possible using the methods described in this post you can check out a project a wrote called Toadstool that implements many of these techniques. There I show how you can load GraphQL queries, JSON files, config files, CSV files, and more directly from an import statement.

Now, let's take a deeper dive into the Python import system.

The Python Import System

The full detail of the import system is described here, but for now I will give a high-level overview. As the docs state:

The import statement combines two operations; it searches for the named module, then it binds the results of that search to a name in the local scope

So, you have two different points in which you can modify the behavior of the import statement.

You can can modify its search function, or how it maps the value after the import statement to an action, and you can modify how it maps the.
You can modify the binding function, or what action to take once the "module" is found .  ## Finders and Loaders

The first step is know as Searching while the second step is know as Loading. Without modifying the import system, the default finders (modules that perform the searching operation) and loaders are used. As with all things in Python, the searcher and loaders are just objects. You can even inspect them for yourself.

❯ python3
>>> import sys
>>> sys.meta_path
[<class '_frozen_importlib.BuiltinImporter'>, <class '_frozen_importlib.FrozenImporter'>, <class '_frozen_importlib_external.PathFinder'>]

All loaded modules will be cached in a dict called sys.modules which maps module names to all of the variables exported from the module. When importing a module with import X, Python will check if X is already in sys.modules and only resort to using its finders and loaders if X is not already present.

Here is a brief flowchart of what I described above.

Customizing the Import System

💡

There are two import hooks that can be used to extend the import system behavior: meta hooks and import path hooks. We will only be focusing on meta path hooks in this post but you can read more about import hooks here.

The Meta Path

As you saw from the snippet above, finders and loaders are stored in sys.meta_path. When import X is called, if X is not found in sys.modules, then Python will iterate over all of the objects stored in sys.meta_path and check if they implement the find_spec method which can be used to find X. If the finder returns a spec, then Python will invoke the exec_module function to invoke the loading step. Python’s default sys.meta_path has three meta path finders, one that knows how to import built-in modules, one that knows how to import frozen modules, and one that knows how to import modules from an import path (i.e. the path based finder).

So to modify how import works you can simply modify the contents of sys.meta_path to include finders and loaders that work the way you want. That means that you must create an object that implements the find_spec and exec_module methods and add it to sys.meta_path. Please note that if you wish to supersede default module loading behavior (typically modules that end in .py), then you will either have to delete the default modules from sys.meta_path or prepend your custom modules such that they will be invoked before the default finders.

💡

Within the sys.meta_path you can implement a single object to perform both searching and loading, but it is important to understand that the operations are separate and constitute two different steps of the import process.

Custom Searching

To implement custom search behavior, we must have a class that implements the find_spec method. The full signature of the method is find_spec(fullname, path, target=None). Here, path refers to the path that of the parent module, so if path is None then it is a top-level import, otherwise path will contain the name of the parent module. fullname is the name passed to the import statement, so for import A.B.C, fullname would be "A.B.C". target is a module object that the finder may use to make a more educated guess about what spec to return. find_spec should either return None if the finder could not find the module, or a ModuleSpec if it is found. What it means to "find the module" is up to your implementation.

So let's say we want to implement a finder for .json files. It might look something like

class JsonLoader():
    """
    Used to import Json files into a Python dict
    """
    @classmethod
    def find_spec(cls, name, path, target=None):
        """Look for Json file"""
        package, _, module_name = name.rpartition(".")
        filename = f"{module_name}.json"
        directories = sys.path if path is None else path
        for directory in directories:
            path = pathlib.Path(directory) / filename
            if path.exists():
                return ModuleSpec(name, cls(path))

⚠️

Note that we use the @classmethod property here. This is necessary so that Python does not have to instatiate a new JSONLoader object to invoke find_spec since the class JsonLoader itself is what is stored in sys.meta_path.

Line 8 simply extracts the part of the fullname that we care about (everything after the last . in the name). Line 9 is where we specify that we're only looking for files of the given name that end in .json. Line 10 is where we specify which directories to search over. This is either the sys.path is no path is passed to find_spec or it's the path. Again, this is just the behavior we are choosing to implement here – you can use these parameters that are passed to find_spec however you see fit. Lastly, lines 11-14 are checking for the existance of the .json file, and if found, returning a new ModuleSpec with the name and the path and a new instance of the JsonLoader class initialized with the path (we will use this with the nex step). This does no action regarding even opening the file, let alone binding new variables. That is what will be done by the loader. Because of how we are constructing this class, JsonLoader will also serve as the loader, which is why we passed cls(path to ModuleSpec on line 14, but you could also pass any other class that implements the exec_module method.

Custom Loading

Now that we have identified the .json file we have to determine what to do with it. For that, we will use the same class as before and just implement a new method: exec_module since our finder deferred to JsonLoader again with its call to ModuleSpec.

class JsonLoader():
    """
    Used to import Json files into a Python dict
    """
    def __init__ (self, path):
        """Store path to Json file"""
        self.path = path

    @classmethod
    def find_spec(cls, name, path, target=None):
        """Look for Json file"""
        package, _, module_name = name.rpartition(".")
        filename = f"{module_name}.json"
        directories = sys.path if path is None else path
        for directory in directories:
            path = pathlib.Path(directory) / filename
            if path.exists():
                return ModuleSpec(name, cls(path))

    def create_module(self, spec):
        """Returning None uses the standard machinery for creating modules"""
        return None

    def exec_module(self, module):
        """Executing the module means reading the JSON file"""
        with self.path.open() as f:
            data = json.load(f)
        fieldnames = tuple(_identifier(key) for key in data.keys())
        fields = dict(zip(fieldnames, [to_namespace(value) for value in data.values()]))
        module. __dict__.update(fields)
        module. __dict__ ["json"] = data
        module. __file__ = str(self.path)

On line 5 we implement the class' __init__ method to save the path that was passed to the find_spec method. The path is not passed to exec_module by default, so this is a way of maintaining that information. On line 24 we define the exec_module method which is where we perform any bindings. Here we read the json file, identify all keys from the JSON file, convert those names to valid Python variable names if needed, and update the calling module's __dict__ (which stores all variables in scope) to now contain the values from the JSON file.

Customized Result

Now append JsonLoader to sys.meta_path and assume you have the following employees.json file:

{
    "employee": {
        "name": "sonoo",
        "salary": 56000,
        "married": true
    },
    "menu": {
        "id": "file",
        "value": "File",
        "popup": {
          "menuitem": [
            {"value": "New", "onclick": "CreateDoc()"},
            {"value": "Open", "onclick": "OpenDoc()"},
            {"value": "Save", "onclick": "SaveDoc()"}
          ]
        }
      }
}

Now you can load employees.json and use it as follows:

import employees

>>> employees.
employees.employee employees.json employee.menu
print(employee.menu)
> {'id': 'file', 'value': 'File', 'popup': {'menuitem': [{'value': 'New', 'onclick': 'CreateDoc()'}, {'value': 'Open', 'onclick': 'OpenDoc()'}, {'value': 'Save', 'onclick': 'SaveDoc()'}]}}

You can even specify only importing certain keys from the file.

from employees import employee

This will only load employee from employees.json and NOT load the menu key.

Conclusion

This was just a sample of what can be done by using Python's import hooks (and remember, it only covered one of the two available hooks). You can customize this behavior to your heart's desire. If you want to see a more generalized approach you can check out my project I mentioned at the top at https://github.com/acalejos/toadstool/. The package is able to be installed via pip install toadstool if you want to try out some of the loaders.

If you need more dynamic import behavior within you code, you can also look to importlib, which is described as having three main purposes

Provide the implementation of the import statement (and thus, by extension, the import () function) in Python source code.
Expose the components to implement import and thus giving users the ability to create their own importers
Contains modules exposing additional functionality for managing aspects of Python packages

As you can see, the topic of the Python import system goes very deep, so I would encourage you to explore it further and gain a better understanding of it.

5 Tips for Elixir Beginners

Andrés Alejos — Sun, 09 Jul 2023 18:33:00 +0000

I first started looking into Elixir upon the recommendation of a friend of mine about the wonders of the language. My first foray into the language was over a year ago as I dipped my toe into the Phoenix Framework to see how it compared to something like ReactJS, the web development framework I was familiar with. At the time, I was using it just to learn the language but was not truly dedicated to any particular project, so I soon fizzled out of learning it.

In December 2022 I decided that I wanted to start contributing more to the open-source community, so once again I reached out to my friend to see what open-source Elixir efforts were ongoing that I could turn my attention to. He recommended that I look into Scholar since I have a background in Machine Learning and it was a library that needed some attention. I contributed a very minor feature to the library. Despite how small the feature was, it exposed me to enough of the language to draw me into the point that I knew I wanted to find a way to contribute meaningfully. Since then, I have published two libraries in ML space (EXGBoost & Mockingjay) and am as excited as ever to see what the future holds for the language, and wanted to take some time to reflect on some of the most valuable lessons I have learned about Elixir.

1. Take Time to Understand the Basics

Although I like jumping head-first into new technologies, it is not always beneficial to do so. I have previously worked with languages that include functional features(mostly Scala), but never a truly functional language. Much of my initial confusion could have been avoided if I had taken more time to understand some of the foundational constructs of Elixir – namely the match operator (=) (and pattern matching more generally), the capture operator (&), the pin operator (^), behaviours, and protocols. Not only will having a better understanding of these help you write better and more idiomatic Elixir, but it will also allow you to more easily read Elixir code, which is invaluable as you learn and explore the Elixir ecosystem.

I still remember my first "a-ha" moment with pattern matching, when I needed to take the first item out of a list of nested dictionaries. Instead of many calls or to a dictionary access followed by checks for a valid key, you can capture those three or four calls in a single pattern match. Similarly, pattern matches in function heads allow you to decompose complex control flow and make it much more readable. Rather than having a single function head for a function where the body does different things depending on the input, it makes more sense to separate those different behaviors into different function heads using pattern matching and guards. These features allow the code itself to be much more expressive. As the saying goes, it is better to have expressive codes than to have to add comments or documentation to explain what should be plainly obvious from the code alone.

Elixir provides comprehensive documentation and getting-started guides that will gently introduce you to the language, but also serve as great references as you get more advanced and need to refresh on the concepts. I recommend you read over the getting started guide once completely before doing much else in the language, and as you begin to write more Elixir and need to use more and more features of the language, then go back and reference the particular sections. For example, reading through the section on sigils will help when you are reading Elixir code and encounter the funky ~r that you have never seen in any other language, and then when you find yourself wanting to implement your own sigil, you will remember that there is a nice section about them in the getting started guide.

2. Embrace the Amazing Community

If you still find yourself struggling to understand some of the aforementioned concepts, fear not! The Elixir community is very active across many different platforms such as Twitter, Slack (e.g. the EEF Slack), and the Elixir Forum. If you find issues on a particular repo, its contributors and maintainers are usually quick to respond and welcome Pull Requests from people outside of the project.

Many conferences such as ElixirConf and Code Beam make their talks available online for free, so even if you are unable to attend the events you can take full advantage of all of the educational material they provide. Many of the speakers at these conferences are authors of Elixir libraries or books that contribute valuable educational material for the language. Some of the most common books to refer to a new Elixir Developer are Elixir in Action by Saša Jurić and Programming Elixir 1.6 by Dave Thomas. An upcoming book by Stephen Bussey entitled From Ruby to Elixir is catered towards people coming from an object-oriented background to Elixir. Some notable Elixir podcasts are Thinking Elixir, Elixir Wizards, and Beam Radio.

In addition to consuming material from the Elixir community, it is important to also give back to the community, which is one of the best ways to learn. The more you engage with others the more you get to cross-pollinate ideas. When you put code out into the world you get a chance to have someone review your code. Each time someone takes time out of their day to review your code you will feel the satisfaction and vindication that will make you want to continue.

3. Treat Elixir as a Functional Language

Some of the features of Elixir can be likened to parts of object-oriented languages, but it is best to NOT try to shoehorn object-oriented design patterns into Elixir. I found myself doing this quite frequently as I was porting the Python binding of XGBoost into Elixir because I would read the Python code and not take time to step back to see what it was trying to accomplish and instead tried to find line-by-line analogs to Elixir. For example, in Python when you see an accumulator initialized before a loop, then perform an operation on it within the loop, and return the accumulator, thus mutating the accumulator each loop, you should be able to tell that you can just use Enum.reduce to achieve the same outcome (in a matter of fact, the sooner you can start seeing most looping operations as forms of Enum.reduce or Enum.reduce_while the better).

Although behaviours and protocols exist in the language and are great tools to achieve similar characteristics that object orientation attempts to achieve, you should not always reach for those tools for the same tasks that you might reach for a new class in an object-oriented language. You should always attempt to solve your problem using functions, and only then reach for the other features of the language when functions alone cannot solve your problem (or using only functions greatly increases the complexity of your solution).

4. Learn Idiomatic Elixir

If you've ever written Python for an extended period of time or had to work with others in a Python project, it is likely that you've heard the term "writing Pythonic code." This refers to the fact that there can be various equally valid ways to write certain code, but one convention or standard is generally accepted by the community as being more correct, oftentimes due to taking advantage of more features specific to the language. Well Elixir is not immune to this same effect, with certain conventions being universally adopted by the Elixir community. I would argue that Elixir is even more prone to this over other languages due to the reliance on established Erlang packages and libraries (namely OTP) that predate Elixir itself.

The easiest way to learn idiomatic Elixir is to read implementation of standard libraries. If you're struggling to know the most idiomatic way to implement protocols in your application, perhaps look to the Enumerable protocol from the standard library. If you're familiar with kwargs from Python and wondering how you can implement similar behavior in Elixir, you can look at Elixir libraries and notice the use of keyword lists with a default empty list in the function signature. Although it is always best to understand why a convention is the way it is, sometimes it is best when starting to adopt the convention first and learn the why later, since it is likely that there are very valid reasons for it to be accepted that you might just not be aware of yet. Once you become more advanced at the language then you can more safely navigate when to deviate from the convention.

Certain practices are more concrete than other. For example, it is convention that any function named is_* returns a boolean and can be used as a guard, whereas function names ending in ? also return a boolean but cannot be used in guards. This is not enforced by the language itself but once you learn that rule it makes it easier to both read Elixir as well as write Elixir where others can share similar vocabularies.

Another pattern that you will see are function pairs where one return a 2-Tuple as a return type where it takes the form of {:ok | :error, value::string.t()}, and the other returns the raw value. For example this convention is used throughout the Jason module such as with decode and decode!. The former allows the user to match on the first element of the tuple to handle success or error how they want, while the latter will "unwrap" the raw value and either succeed or throw its own error. The more Elixir you read and write, the more of these conventions you will begin to notice and adopt for yourself.

5. Get Familiar With Elixir's Rich Tooling Ecosystem

One of the best things about Elixir is that you also get all of the rich tools built for Erlang which long preexisted Elixir itself. During every step of development, there is likely a tool that can improve your efficiency and quality of life. Start by learning the build system Mix to create and compile your application, linting and static analysis can be done with tools such as Credo and Dialyzer, testing can be done via ExUnit, documentation is covered by ExDoc, and finally, you can publish your projects to Hex. Learning the technologies that make up the Elixir development stack is a great way to become more familiar with the language overall and each one is demonstrative of why developers find Elixir such a joy to work with.

The seamless integration between Hex and its documentation domain HexDocs along with ExDoc makes is intoxicatingly easy to publish packages in Elixir. Mix Tasks give way for developers to write (and distribute) their own modules for the Mix build system, which means that for nonstandard libraries there are often Mix tasks available that assist in your deployment, or you can even write your own to share with anybody else who might find it useful.

For example, Erlang has the NIF (Native Implemented Functions) library that allows code that is run on top of the BEAM (the Erlang/Elixir Virtual Machine) to call native code. When writing Elixir NIFs, you will need to have an accompanying shared library written in native code (such as a .dll, .so, or .dylib) to link your NIF against. When publishing to Hex, it would be convenient for the end users of the library to not have to necessarily recompile those libraries every time they pull from Hex, so you might wish to have precompiled distributions of the libraries available for the most common architectures. Well, I had this exact case when writing EXGBoost, and I took advantage of the ElixirMake and CCPrecompiler libraries to precompile my NIFs and include them in my Hex distribution (the guide I followed to do this is here for those who are curious).

Conclusion

As with all things, our mileage may vary when beginning your Elixir journey, but I have found the whole language, from its developer ecosystem to the thriving community, to continue to draw me in and make me excited to be a part of it. Even more exciting is that the language itself and its surrounding communities are still quite young and technologically diverse! I have planted myself firmly in the Machine Learning camp of Elixir, but web development with the Phoenix Framework is one of the most beloved frameworks there is, and there is even open development on an embedded Elixir system from the Nerves project. There is vast opportunity across all of these domains for meaningful contributions even from beginner Elixir developers, so I would encourage everyone to find their niche and put code out in the world.

Supercharge Your Elixir with Native Implemented Functions (NIFs)

Andrés Alejos — Fri, 02 Jun 2023 03:18:04 +0000

I have recently discovered the joy of writing Elixir code. For those who don't know, Elixir is a functional programming language built to run on the same virtual machine that powers Erlang, known as the BEAM. Elixir is lauded for its productivity and development in the domains of web development, embedded software, machine learning, data pipelines, and multimedia processing, just to name a few.

As I was looking for ways to contribute to the open-source community, a friend of mine suggested that I look into writing NIFs for Elixir to leverage existing libraries written in other languages, such as C, C++, Rust, or Zig. As I was looking for resources about writing NIFs, I found that much of the material only gave a very shallow understanding of what NIFs could do and did not go into depth regarding how to actually use them to make anything interesting in Elixir. My goal with this article is to not only thoroughly explain what NIFs are, but give some tips for working with NIFs as well as show some real-world examples of using NIFs in a library I decided to write which brings the power of the XGBoost library into Elixir

Introduction to NIFs

Native Implemented Functions (NIFs) are functions implemented in code that compiles natively to the target machine and are executed outside of the confines of the BEAM virtual machine. The :erlang.load_nif/2 functions allow the use of these functions from Elixir (or Erlang for that matter). Some possible use cases of NIFs are to speed up extremely time-sensitive operations, have better control of hardware, or interact with existing APIs written in other languages as we will explore more in-depth later on.

The basic library that you will interface with when writing NIFs (in C/C++ at least – more on this later) is the erl_nif.h C library. The library has fairly comprehensive documentation which you will find yourself referencing quite frequently when writing NIFs, but I would like to go over some of the essential components that you will be interacting with the most.

In Erlang/Elixir, the simplest form of expression is a term, that is an integer, float, atom, string, list, map, or tuple. The return value is the term itself. In erl_nif.h, many of the API calls take the form of enif_get_* or enif_make_*, which denotes that the function is either ingesting a term from the Elixir side and converting it to the respective type in C, or taking a C variable of said type and converting it to an Elixir term of the appropriate type. When converting to an Elixir term, the term and any allocated memory are also registered to the BEAM garbage handler, and at that point is no longer the responsibility of the native code with regards to memory management.

Additionally, erl_nif.h also interacts directly with Elixir binaries, which are represented as an opaque struct where the only two fields you interact with are data, which points to the contiguous block of data where the binary resides, and size, which stores the lengths of the binary. Binaries are a great way to pass data between the BEAM and the NIF without altering the data.

The last key idea that I will discuss here is the concept of a resource object. As the erl_nif.h documentation states, "The use of resource objects is a safe way to return pointers to native data structures from a NIF." The main usage of a resource object is when you have a data structure in the native code that cannot be converted to an Elixir term, then you can still pass a reference to that structure to Elixir. That resource will then only be used to pass between different Elixir functions which are mapped to other NIFs. So, from the Elixir side, the resource is represented as a reference and is opaque, and must be passed back to another NIF that acts upon it to do anything useful with it.

NIFs in Practice

Now, let's look at some practical examples of what can be done with NIFs. For these examples, I will be using code from EXGBoost, which is an Elixir library that I wrote to leverage the use of the XGBoost C API in Elixir. The first thing to know is that to use a NIF, your native code implementations must be compiled into a shared library for the appropriate target architecture ( .so for Linux, .dll for Windows, .dylib for MacOS). For EXGBoost, that is named called libexgboost. Then, in the Elixir module where you want your NIFs to be called from, you will have the following:

defmodule EXGBoost.NIF do
  @on_load :on_load
  def on_load do
    path = :filename.join([:code.priv_dir(:exgboost), "libexgboost"])
    :erlang.load_nif(path, 0)
  end
end

exgboost/lib/exgboost/nif.ex

This is the minimum needed to load a NIF into Elixir, assuming that a shared library named libexgboost exists in the modules priv directory.

Basic Example

Now loading a NIF library is not very useful if there are no functions to run, so let's add some basic functions, starting with simply getting the build information of XGBoost that our NIFs are linked against. First, we must ensure that our NIF library (libexgboost) is linked against the XGBoost shared library ( libxgboost ). You can refer to the Makefile to see how that is done – I won't go into details about it here. Let's start by making a utility file that will import relevant libraries into our project, and then making a config.{h,c} to hold the function we are making EXGBoostVersion :

#ifndef EXGBOOST_UTILS_H
#define EXGBOOST_UTILS_H

#include <erl_nif.h>
#include <stdio.h>
#include <string.h>
#include <xgboost/c_api.h>

// Status helpers

ERL_NIF_TERM exg_error(ErlNifEnv *env, const char *msg);

ERL_NIF_TERM ok_atom(ErlNifEnv *env);

ERL_NIF_TERM exg_ok(ErlNifEnv *env, ERL_NIF_TERM term);

#endif

Here, we are importing erl_nif.h which we will need to use the Erlang NIF API functions, and xgboost/c_api.h, which will be used to interact with libxgboost. The interaction between these two libraries is the crux of our library.

#include "utils.h"

// Atoms
ERL_NIF_TERM exg_error(ErlNifEnv *env, const char *msg) {
  ERL_NIF_TERM atom = enif_make_atom(env, "error");
  ERL_NIF_TERM msg_term = enif_make_string(env, msg, ERL_NIF_LATIN1);
  return enif_make_tuple2(env, atom, msg_term);
}

ERL_NIF_TERM ok_atom(ErlNifEnv *env) { return enif_make_atom(env, "ok"); }

ERL_NIF_TERM exg_ok(ErlNifEnv *env, ERL_NIF_TERM term) {
  return enif_make_tuple2(env, ok_atom(env), term);
}

Here, we are setting up some helper functions. Every NIF will return in the form of {:ok, term} | {:error, String.t()} | :ok, which means that one of these three functions will be returned for every NIF.

#ifndef EXGBOOST_CONFIG_H
#define EXGBOOST_CONFIG_H

#include "utils.h"

ERL_NIF_TERM EXGBuildInfo(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]);

#endif

#include "config.h"

ERL_NIF_TERM EXGBuildInfo(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[]) {
  char const *out = NULL;
  int result = -1;
  ERL_NIF_TERM ret = 0;
  if (argc != 0) {
    ret = exg_error(env, "Wrong number of arguments");
    goto END;
  }
  result = XGBuildInfo(&out);
  if (result == 0) {
    ret = exg_ok(env, enif_make_string(env, out, ERL_NIF_LATIN1));
  } else {
    ret = exg_error(env, XGBGetLastError());
  }
END:
  return ret;
}

Here is the implementation of the actual NIF. Each NIF will have the function signature of ERL_NIF_TERM (*fptr)(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[]), where env represents an environment that can host Erlang terms – all terms in an environment are valid as long as the environment is valid, argc is the number of arguments passed to the NIF, and argv is an array of terms passed to the NIF. So, all this function is doing is checking to make sure the correct number of arguments were passed (in this case zero), then invoking the XGBoost API to get the build info (which is returned from the XGBoost API as a JSON-encoded string), and finally either returning it as an Elixir string on success or returning an error on failure.

Now that the NIF implementation is done, it is time to initialize it. We do this by using the ERL_NIF_INIT function and passing it an array of ErlNifFunc. Each ErlNifFunc defines what the corresponding Elixir function is called, the arity of the function, the native function that is bound to the Elixir function, and a flag variable that can be used to change how the scheduler handles the NIF. In addition the functions array, you also pass the name of the Elixir module that will house your NIFs, as well as three optional functions load, upgrade, and unload that we will talk about later.

#ifndef EXGBOOST_H
#define EXGBOOST_H

#include "config.h"

#endif

#include "exgboost.h"

static ErlNifFunc nif_funcs[] = {
    {"xgboost_build_info", 0, EXGBuildInfo}
};
ERL_NIF_INIT(Elixir.EXGBoost.NIF, nif_funcs, load, NULL, upgrade, NULL)

In this example, this will create a function in the Elixir NIF module called xgboost_build_info/0 which, when called, will pass all arguments (in this case there are none) to the native function, run the native function, and return from the native function back to Elixir.

defmodule EXGBoost.NIF do
  @on_load :on_load
  def on_load do
    path = :filename.join([:code.priv_dir(:exgboost), "libexgboost"])
    :erlang.load_nif(path, 0)
  end

  def xgboost_build_info, do: :erlang.nif_error(:not_implemented)
end

Now, we can run the NIF by doing EXGBoost.NIF.xgboost_build_info(). Congratulations! You have now used Elixir NIFs to run an external API. Next, let's look at a more advanced example.

Using Resource Objects

For the advanced example, let's fast forward to performing a prediction using XGBoost. The two main data structures used in XGBoost are DMatrix which represents the data matrix holding the input data to the model, and Booster which represents the model itself. These two structures will be initialized and created from the XGBoost API, but we need to allow the Elixir NIF module to interact with them in some way. This is where resource objects come in. We use resources as a handle to the DMatrix and Booster structs that we can pass back and forth between NIFs. There are a few things that must be done for each resource type. We must declare each unique resource type, we must register with ERL_NIF_INIT what resource types we will be using, and since the XGBoost API documentation tells us to use the custom XGDMatrixFree and XGBoosterFree functions to free their respective structs, we must also tell the BEAM garbage handler how to properly dispose of the resources.

#ifndef EXGBOOST_UTILS_H
#define EXGBOOST_UTILS_H

#include <erl_nif.h>
#include <xgboost/c_api.h>

ErlNifResourceType *DMatrix_RESOURCE_TYPE;
ErlNifResourceType *Booster_RESOURCE_TYPE;

void DMatrix_RESOURCE_TYPE_cleanup(ErlNifEnv *env, void *arg);
void Booster_RESOURCE_TYPE_cleanup(ErlNifEnv *env, void *arg);

#endif

Here we declare both resource types as well as their cleanup functions.

#include "utils.h"

// Resource type helpers
void DMatrix_RESOURCE_TYPE_cleanup(ErlNifEnv *env, void *arg) {
  DMatrixHandle handle = *((DMatrixHandle *)arg);
  XGDMatrixFree(handle);
}

void Booster_RESOURCE_TYPE_cleanup(ErlNifEnv *env, void *arg) {
  BoosterHandle handle = *((BoosterHandle *)arg);
  XGBoosterFree(handle);
}

Here we define the cleanup functions to use the custom freeing functions that the XGBoost documentation tells us to use.

#include "exgboost.h"

static int load(ErlNifEnv *env, void **priv_data, ERL_NIF_TERM load_info) {
  DMatrix_RESOURCE_TYPE = enif_open_resource_type(
      env, NULL, "DMatrix_RESOURCE_TYPE", DMatrix_RESOURCE_TYPE_cleanup,
      (ErlNifResourceFlags)(ERL_NIF_RT_CREATE | ERL_NIF_RT_TAKEOVER), NULL);
  Booster_RESOURCE_TYPE = enif_open_resource_type(
      env, NULL, "Booster_RESOURCE_TYPE", Booster_RESOURCE_TYPE_cleanup,
      (ErlNifResourceFlags)(ERL_NIF_RT_CREATE | ERL_NIF_RT_TAKEOVER), NULL);
  if (DMatrix_RESOURCE_TYPE == NULL || Booster_RESOURCE_TYPE == NULL) {
    return 1;
  }
  return 0;
}

static int upgrade(ErlNifEnv *env, void **priv_data, void** old_priv_data,
                   ERL_NIF_TERM load_info) {
  DMatrix_RESOURCE_TYPE = enif_open_resource_type(
      env, NULL, "DMatrix_RESOURCE_TYPE", DMatrix_RESOURCE_TYPE_cleanup,
      ERL_NIF_RT_TAKEOVER, NULL);
  Booster_RESOURCE_TYPE = enif_open_resource_type(
      env, NULL, "Booster_RESOURCE_TYPE", Booster_RESOURCE_TYPE_cleanup,
      ERL_NIF_RT_TAKEOVER, NULL);
  if (DMatrix_RESOURCE_TYPE == NULL || Booster_RESOURCE_TYPE == NULL) {
    return 1;
  }
  return 0;
}
static ErlNifFunc nif_funcs[] = {
    {"xgboost_build_info", 0, EXGBuildInfo}
};
ERL_NIF_INIT(Elixir.EXGBoost.NIF, nif_funcs, load, NULL, upgrade, NULL)

Here is where we register to ERL_NIF_INIT that we are using these two resource types, and which cleanup function to use for each. One of load or upgrade is called to initialize the library. unload (which is null in this case) is called to release the library. By passing the appropriate cleanup function to enif_open_resource, we are registering which freeing function to use when the appropriate resource is discarded by the garbage collector.

Now we are ready to use resource types to make resources from the DMatrix and Booster structs. I like to make helper functions that take in the BoosterHandle and DMatrixHandle types and returns the resource or error, so let's go ahead and make those functions:

#include "booster.h"

static ERL_NIF_TERM make_Booster_resource(ErlNifEnv *env,
                                          BoosterHandle handle) {
  ERL_NIF_TERM ret = -1;
  BoosterHandle **resource =
      enif_alloc_resource(Booster_RESOURCE_TYPE, sizeof(BoosterHandle *));
  if (resource != NULL) {
    *resource = handle;
    ret = exg_ok(env, enif_make_resource(env, resource));
    enif_release_resource(resource);
  } else {
    ret = exg_error(env, "Failed to allocate memory for XGBoost DMatrix");
  }
  return ret;
}

#include "dmatrix.h"

static ERL_NIF_TERM make_DMatrix_resource(ErlNifEnv *env,
                                          DMatrixHandle handle) {
  ERL_NIF_TERM ret = -1;
  DMatrixHandle **resource =
      enif_alloc_resource(DMatrix_RESOURCE_TYPE, sizeof(DMatrixHandle *));
  if (resource != NULL) {
    *resource = handle;
    ret = exg_ok(env, enif_make_resource(env, resource));
    enif_release_resource(resource);
  } else {
    ret = exg_error(env, "Failed to allocate memory for XGBoost DMatrix");
  }
  return ret;
}

Great! Now we can just use either of these functions when we need to return the appropriate resource to the Elixir NIF module. So, let's make the NIF that will allow us to create a new Booster from Elixir. The NIF will accept an Elixir list of DMatrix resources to initialize the Booster from, and if the list is empty it creates a Booster that's not associated with any DMatrix. First, here is a helper function that takes in a list of DMatrix resource objects that are passed from Elixir and populates the array dmats with all of the corresponding DMatrixHandle structs:

int exg_get_dmatrix_list(ErlNifEnv *env, ERL_NIF_TERM term,
                         DMatrixHandle **dmats, unsigned *len) {
  ERL_NIF_TERM head, tail;
  int i = 0;
  if (!enif_get_list_length(env, term, len)) {
    return 0;
  }
  *dmats = (DMatrixHandle *)enif_alloc(*len * sizeof(DMatrixHandle));
  if (NULL == dmats) {
    return 0;
  }
  while (enif_get_list_cell(env, term, &head, &tail)) {
    DMatrixHandle **resource = NULL;
    if (!enif_get_resource(env, head, DMatrix_RESOURCE_TYPE,
                           (void *)&(resource))) {
      return 0;
    }
    memcpy(&((*dmats)[i]), resource, sizeof(DMatrixHandle));
    term = tail;
    i++;
  }
  return 1;
}

Next, we add the NIF implementation:

#include "booster.h"

ERL_NIF_TERM EXGBoosterCreate(ErlNifEnv *env, int argc,
                              const ERL_NIF_TERM argv[]) {
  DMatrixHandle *dmats = NULL;
  ERL_NIF_TERM ret = -1;
  int result = -1;
  unsigned dmats_len = 0;
  BoosterHandle booster = NULL;
  if (1 != argc) {
    ret = exg_error(env, "Wrong number of arguments");
    goto END;
  }
  if (!exg_get_dmatrix_list(env, argv[0], &dmats, &dmats_len)) {
    ret = exg_error(env, "Invalid list of DMatrix");
    goto END;
  }
  if (0 == dmats_len) {
    result = XGBoosterCreate(NULL, 0, &booster);
    if (result == 0) {
      ret = make_Booster_resource(env, booster);
      goto END;
    } else {
      ret = exg_error(env, "Error making booster");
      goto END;
    }
  }

  result = XGBoosterCreate(dmats, dmats_len, &booster);
  if (result == 0) {
    ret = make_Booster_resource(env, booster);
    goto END;
  } else {
    ret = exg_error(env, XGBGetLastError());
  }
END:
  return ret;
}

This implementation accepts a list of DMatrix resources, uses them to create a new BoosterHandle struct, then creates a resource for the Booster that is then passed back to the calling Elixir module. Now, we register the new NIF to its corresponding Elixir function:

static ErlNifFunc nif_funcs[] = {
    {"xgboost_build_info", 0, EXGBuildInfo},
    {"booster_create", 1, EXGBoosterCreate},
};

Finally, we add the NIF to our Elixir module. Now, we can create a new Booster from Elixir using EXGBoost.NIF.booster_create/1.

defmodule EXGBoost.NIF do
  @on_load :on_load
  def on_load do
    path = :filename.join([:code.priv_dir(:exgboost), "libexgboost"])
    :erlang.load_nif(path, 0)
  end

  def xgboost_build_info, do: :erlang.nif_error(:not_implemented)
  def booster_create(_handles), do: :erlang.nif_error(:not_implemented)
end

Advanced Example

Now that we can pass the DMatrix and Booster structs between NIFs, let's pretend we have a trained Booster and want to make predictions using it. So let's go ahead and implement the EXGBoosterPredictFromDMatrix NIF. This will accept a Booster resource, a DMatrix resource, and a JSON-encoded config string and return a 2-tuple of the shape of the prediction results and the flat array of the prediction results.

ERL_NIF_TERM EXGBoosterPredictFromDMatrix(ErlNifEnv *env, int argc,
                                          const ERL_NIF_TERM argv[]) {
  BoosterHandle booster;
  BoosterHandle **booster_resource = NULL;
  DMatrixHandle dmatrix;
  DMatrixHandle **dmatrix_resource = NULL;
  char *config = NULL;
  bst_ulong *out_shape = NULL;
  bst_ulong out_dim = 0;
  float *out_result = NULL;

  ERL_NIF_TERM ret = -1;
  int result = -1;
  if (3 != argc) {
    ret = exg_error(env, "Wrong number of arguments");
    goto END;
  }
  if (!enif_get_resource(env, argv[0], Booster_RESOURCE_TYPE,
                         (void *)&(booster_resource))) {
    ret = exg_error(env, "Invalid Booster");
    goto END;
  }
  if (!enif_get_resource(env, argv[1], DMatrix_RESOURCE_TYPE,
                         (void *)&(dmatrix_resource))) {
    ret = exg_error(env, "Invalid DMatrix");
    goto END;
  }
  if (!exg_get_string(env, argv[2], &config)) {
    ret = exg_error(env, "Config must be a JSON-encoded string");
    goto END;
  }
  booster = *booster_resource;
  dmatrix = *dmatrix_resource;
  result = XGBoosterPredictFromDMatrix(booster, dmatrix, config, &out_shape,
                                       &out_dim, &out_result);
  if (result == 0) {
    ret = collect_prediction_results(env, out_shape, out_dim, out_result);
  } else {
    ret = exg_error(env, XGBGetLastError());
  }
END:
  if (config != NULL) {
    enif_free(config);
  }
  return ret;
}

Here, you can see that we take in the two resources, pass their underlying structs to the XGBoost XGBoosterPredictFromDMatrix API call, then use collect_prediction_results to return the desired output, so let's take a look at collect_prediction_results:

static ERL_NIF_TERM collect_prediction_results(ErlNifEnv *env,
                                               bst_ulong *out_shape,
                                               bst_ulong out_dim,
                                               float *out_result) {
  bst_ulong out_len = 1;
  ERL_NIF_TERM shape_arr[out_dim];
  for (bst_ulong j = 0; j < out_dim; ++j) {
    shape_arr[j] = enif_make_int(env, out_shape[j]);
    out_len *= out_shape[j];
  }
  ERL_NIF_TERM shape = enif_make_tuple_from_array(env, shape_arr, out_dim);
  ERL_NIF_TERM result_arr[out_len];
  for (bst_ulong i = 0; i < out_len; ++i) {
    result_arr[i] = enif_make_double(env, out_result[i]);
  }
  return exg_ok(env, enif_make_tuple2(
                         env, shape,
                         enif_make_list_from_array(env, result_arr, out_len)));
}

This function will convert the shape to a tuple, convert the output predictions to a list, and return a 2-tuple containing both.

After registering this function to both the C and Elixir sides like we did before, we can use the function. First, we can use Elixir structs to wrap the resource neatly so that instead of interacting with a reference() type (which is the Elixir type of the resource), we can use the Booster and DMatrix structs:

defmodule EXGBoost.Booster do
    @enforce_keys [:ref]
    defstruct [:ref, :best_iteration, :best_score]
end

defmodule EXGBoost.DMatrix do
    @enforce_keys [
    :ref,
    :format
  ]
  defstruct [
    :ref,
    :format,
    :label,
    :weight,
    :base_margin,
    :group,
    :label_upper_bound,
    :label_lower_bound,
    :feature_weights,
    :missing,
    :nthread,
    :feature_names,
    :feature_types
  ]
end

We can use @enforce_keys to require that the resource be passed to the struct. Now, we can use the .ref key in each struct to pass their corresponding resources:

def predict(%Booster{} = booster, %DMatrix{} = data, opts \\ []) do
    ...
    # Refer to source for full example
    # https://github.com/acalejos/exgboost/blob/main/lib/exgboost/booster.ex
    {shape, preds} =
      EXGBoost.NIF.booster_predict_from_dmatrix(booster.ref, data.ref, Jason.encode!(config))
      |> Internal.unwrap!()

    Nx.tensor(preds) |> Nx.reshape(shape)
  end

Just like that, you can make predictions on a trained Booster using XGBoost from Elixir!

NIFs in the Wild

As I alluded to before, there are other languages that you can implement Elixir NIFs in. One of the downsides of writing NIFs (which you will become familiar with if you write them, and is heavily caveated on the erl_nif.h documentation) is that you open your program up to very unsafe code, where the BEAM cannot protect you. This means gasp SEGFAULTS can happen in your program. Or more insidious even -- memory leaks. It is very important to write your native code diligently and to be cognizant of the risks that NIFs incur, so as to only use them when and where appropriate.

Fortunately, the open-source community is to the rescue (as it often is), with projects such as Rustler and Zigler aiming to bring the power of NIFs to Elixir using much safer languages (Rust and Zig respectively). Although I have not used either of these projects myself, I would encourage you to try to use these when possible rather than using the erl_nif.h C API directly.

If you're looking for some projects in the wild that use NIFs, I think you would be surprised to see how many well-known projects use them under the hood, but here are just a few examples:

Nx - Uses NIFs to implement its backends (both EXLA and torchx)
Explorer - Uses Polars DataFrame library  # Conclusion

NIFs are a great way to implement fast native code in Elixir, and open the world of external APIs to Elixir developers, but as with writing anything in C – with great power comes great responsibility. Some open-source alternatives have sprouted up to help address these problems, and as they become more mature, there might be increasingly fewer reasons to use C NIFs, although I doubt they will ever go extinct. Comment below with any other interesting projects that use NIFs!

DEV Community: Andrés Alejos

Livebook: Elixir's Swiss Army Knife

The Problem of Adoption Friction

Understanding Your Audience

Enter Livebook, the One-Stop-Shop

Reproducible

Shareable

Extensible

acalejos / kino_youtube

A simple Kino that wraps the YouTube Embedded iFrame API to render a YouTube player in a Livebook.

KinoYoutube

Installation

acalejos / kino_live_audio

A Kino designed to record a raw audio stream (no client-side encoding) and emit events.

KinoLiveAudio

Installation

Interactive

Integrated

Deployable

Bringing It All Together

Conclusion

Hacking Phoenix LiveUpload

How Do Phoenix Uploads Work?

Adding Custom Preview Rendering

Modifying index.ex

Modifying app.js

LiveFileUpload Hook

LivePdfPreview Hook

Final Thoughts

Elevate Your Elixir With Sigils

Motivation

Solution

Design Details

Implementation Details

Creation

Enumerable – Size / Count

Enumerable – Reduce

Enumerable – Membership

Inspect

Future Plans

Python NumPy to Elixir-Nx

API Overview

Design Overview

Implementation Details

Building Term-Frequency Matrix

Feature Pruning

TFIDF Transformation

Conclusion

Leveling Up Your Elixir Option Handling

Intro to NimbleOptions

Leveling Up Your Validations

Custom Validations

Overridable Configuration Defaults

Parameter Transformation

Composability

Putting It All Together

Conclusion

Serving Spam Detection With XGBoost and Elixir

Nx-Powered Decision Trees

Intro

Problem Statement

Explore the Dataset

Perform TF-IDF Vectorization

Load Processed Data

Training an EXGBoost Model

Compiling the EXGBoost Model

Predict on New Data

Serving a Compiled Decision Tree Model

Understanding the Elixir Machine Learning Ecosystem

Elixir-Nx

Nx

Axon

Bumblebee

Scholar

Explorer

Scidata

EXGBoost

Ortex

Livebook

Summary

Modifying `index.ex`

Modifying `app.js`

`LiveFileUpload` Hook

`LivePdfPreview` Hook

#2. `mix_recompile?/0` Function

#3. `match?/2` Macro

#4. The `binding /1` Macro

#5. The `deriving /3` Macro