DEV Community: Aurelio Buarque

Optimizing Struct Layout and Padding in Practice

Aurelio Buarque — Tue, 27 May 2025 01:52:07 +0000

What is Struct Layout?

When working with Go, understanding how structs are laid out in memory is crucial for writing efficient code. Struct layout refers to how fields within a struct are arranged in memory, including any padding that the compiler adds to ensure proper alignment.

In Go, the compiler follows specific rules for struct layout to ensure efficient memory access and proper alignment with the underlying hardware architecture. This process is automatic, but understanding it can help you write more memory-efficient code.

Why is Struct Layout Important?

The way structs are laid out in memory can have significant implications for your application's performance and memory usage. Here's why it matters:

1. Memory Efficiency: Poor struct layout can lead to wasted memory due to padding. The compiler adds padding bytes to ensure that fields are properly aligned, which can sometimes result in significant memory overhead.

2. Cache Utilization: Modern CPUs use cache lines (typically 64 bytes) to fetch data from memory. If your struct fields are scattered due to padding, you might need more cache lines to access the same amount of data.

3. Performance Impact: Accessing memory that isn't properly aligned can lead to performance penalties, as the CPU might need to perform multiple memory accesses to read a single value.

Understanding Padding

Take a look at this one:

type Example struct {
    a bool    // 1 byte
    b int64   // 8 bytes
    c bool    // 1 byte
}

You might expect this struct to take 10 bytes (1 + 8 + 1), but due to alignment requirements, it actually takes 24 bytes! Here's why:

The int64 field needs to be aligned on an 8-byte boundary;
The compiler adds 7 bytes of padding after the first bool;
It also adds 7 bytes of padding after the second bool to maintain alignment for potential subsequent fields;

We can optimize this by reordering the fields:

type OptimizedExample struct {
    b int64   // 8 bytes
    a bool    // 1 byte
    c bool    // 1 byte
}

This optimized version only takes 16 bytes because the bool fields can share the same padding space. The rule of thumb is: place larger fields first.

Practical Benefits

1. Reduced Memory Usage: In systems where memory is constrained or when dealing with large numbers of structs, proper layout can significantly reduce memory consumption.

2. Better Cache Performance: When structs are properly laid out, more data can fit into a single cache line, reducing cache misses and improving performance.

3. Improved Serialization: When sending structs over the network or storing them on disk, proper layout can reduce the amount of data that needs to be transferred or stored.

Best Practices

Here are some best practices for struct layout in Go:

1. Order fields by size: Place larger fields first, followed by smaller ones.

2. Group related fields: Keep related fields together to improve cache locality.

3. Consider alignment requirements: Be aware of the alignment requirements of different types (e.g., int64 needs 8-byte alignment).

4. Use tools: Leverage tools like viztruct to analyze and optimize your struct layouts.

Practical Benchmark

I've created a simple benchmark to demonstrate the impact of struct layout on performance and memory usage. It compares the two following structs: one with poor layout and one with optimized layout:

// Poor layout - fields ordered by size
type PoorLayout struct {
    a bool    // 1 byte
    b int64   // 8 bytes
    c bool    // 1 byte
    d int32   // 4 bytes
    e bool    // 1 byte
}

// Optimized layout - larger fields first
type OptimizedLayout struct {
    b int64   // 8 bytes
    d int32   // 4 bytes
    a bool    // 1 byte
    c bool    // 1 byte
    e bool    // 1 byte
}

The outputs I got from this benchmarks are:

$ go test -bench=. -benchmem
goos: darwin
goarch: amd64
pkg: github.com/buarki/chvv
cpu: VirtualApple @ 2.50GHz
BenchmarkMemoryAllocation/PoorLayout-12                 1207        843062 ns/op    32006154 B/op          1 allocs/op
BenchmarkMemoryAllocation/OptimizedLayout-12            2422        447033 ns/op    16007174 B/op          1 allocs/op
BenchmarkFieldAccess/PoorLayout-12                      2146        567032 ns/op           0 B/op          0 allocs/op
BenchmarkFieldAccess/OptimizedLayout-12                 4645        257307 ns/op           0 B/op          0 allocs/op
PASS
ok      github.com/buarki/chvv  5.214s

Note: These benchmarks were run on a VirtualApple CPU (2.50GHz) using Go 1.21 on macOS. The numbers might vary slightly on different architectures, but the relative improvements should be similar.

The results show some interesting insights:

1. Memory Usage: The PoorLayout struct uses 32 bytes per instance, while the OptimizedLayout uses only 16 bytes. This means we save 16 bytes per struct instance, which is a 50% reduction in memory usage. For 1 million instances, this translates to 32MB vs 16MB - a massive difference that directly affects your application's memory footprint and garbage collection overhead.

2. Allocation Performance: The optimized version is 47% faster in allocation (447,033 ns/op vs 843,062 ns/op). This improvement comes from reduced memory pressure and better cache utilization during allocation. The CPU can process more optimized structs in the same time frame because they fit better in the cache hierarchy.

3. Field Access Performance: The optimized version shows a 55% improvement in field access speed (257,307 ns/op vs 567,032 ns/op). This significant performance gain comes from better cache locality and reduced cache line misses. Modern CPUs have a hierarchy of caches (L1, L2, L3) with different sizes and speeds:

• L1 Cache: The fastest but smallest (typically 32-64KB per core)

• L2 Cache: Medium speed and size (typically 256KB-1MB per core)

• L3 Cache: The largest but slowest (typically 2-32MB shared)

The optimized layout allows more structs to fit in these caches, reducing the need to fetch data from main memory. When accessing fields in the poor layout, the CPU might need to load multiple cache lines due to padding, while the optimized layout can often fit multiple structs in a single cache line.

The improvements are particularly significant in scenarios where you're dealing with large numbers of structs or performance-critical code paths. The combination of reduced memory usage and improved cache utilization can lead to substantial performance gains in real applications.

Ok, but where does this make any difference? Some practical scenarios where these optimizations matter:

1. Web Servers: A typical web server might handle thousands of concurrent requests, each potentially creating multiple structs for request processing, authentication, and response formatting. For example, if your server processes 10,000 requests per second, each creating 100 structs, that's 1 million structs per second - and the memory savings add up quickly.

2. Data Processing: When processing large datasets, you might load thousands of records into memory. For instance, a CSV file with 100,000 rows, each represented by a struct, would save 1.6MB of memory with optimized layout. This becomes even more significant when dealing with multiple concurrent data processing tasks.

3. Game Development: In game engines, you might have thousands of entities (players, NPCs, items) each represented by structs. The memory savings and performance improvements can be crucial for maintaining smooth gameplay, especially on resource-constrained devices.

4. IoT Devices: On resource-constrained devices, every byte counts. Optimizing struct layouts can help reduce memory usage and improve battery life by reducing the number of memory operations.

In these scenarios, the cumulative effect of struct layout optimization can lead to:

• Reduced memory pressure and fewer garbage collection cycles;

• Better cache utilization and faster processing;

• Lower resource requirements and better scalability;

• Improved battery life on mobile and IoT devices;

You can run the benchmark on your machine to see the actual numbers. The results might vary depending on your CPU architecture and Go version, but the relative differences should be similar.

Using Viztruct to Optimize Struct Layout

While you can manually optimize struct layouts, tools like viztruct can help visualize and optimize your struct layouts automatically. These tools can:

Show the actual memory layout of your structs
Suggest optimal field ordering
Calculate padding and alignment
Help identify potential memory waste

You can use viztruct through its web app or install it locally:

go install github.com/buarki/viztruct/cmd/viztruct@latest

Conclusion

Understanding and optimizing struct layout is an important aspect of writing efficient Go code. While the compiler handles most alignment automatically, being aware of these concepts can help you write more memory-efficient and performant code, especially in resource-constrained environments or when dealing with large numbers of structs.

Remember that optimization should be guided by profiling and actual performance requirements. Not every struct needs to be perfectly optimized, and there are scenarios where struct padding should be "disabled", like in C networking code where you need precise control over memory layout for protocol headers. However, understanding these concepts will help you make informed decisions when performance matters.

Searching Castles Using Go, MongoDB, Github Actions And Web Scraping

Aurelio Buarque — Sun, 09 Jun 2024 18:42:28 +0000

It’s Been A While!

It’s been a while since last one! TL;DR: work, life, study… you know :)

A Project Of Open Data Using Go

In March 2024, I created a small project to experiment with using Server-Sent Events (SSE) in a Go web server to continuously send data to a frontend client. It wasn’t anything particularly fancy, but it was still pretty cool :)

The project involved a small server written in Go that served a minimalist frontend client created with raw HTML, vanilla JavaScript, and Tailwind CSS. Additionally, it provided an endpoint where the client could open an SSE connection. The basic goal was for the frontend to have a button that, once pressed, would trigger a server-side search to collect data about castles. As the castles were found, they would be sent from the server to the frontend in real-time. I focused on castles from the United Kingdom and Portugal, and the project worked nicely as you can see below:

The code of such minimalist project can be found here and you can follow the README instructions to run it on you local machine.

A few days ago, I revisited this project and decided to expand it to include more countries. However, after several hours of searching, I couldn’t find an official consolidated dataset of castles in Europe. I did find a few datasets focused on specific countries, but none that were comprehensive. Therefore, for the sake of having fun with Go and because I have a passion for history, I started the project Find Castles. The goal of this project is to create a comprehensive dataset of castles by collecting data from available sources, cleaning it, preparing it, and making it available via an API.

Why Go Really Shines For This Project?

Goroutines and channels! The biggest part of the code of this project will be navigating through websites, collecting and processing data to, in the end, save it on database. By using Go we leverage the ease that the language offers us to implement these complex operations keeping the maximum possible amount of hair :)

How It Works So Far?

So far I implemented data collectors for 3 countries only: Ireland, Portugal and United kingdom, the reason was that the effort for finding a good reference for these countries was not so hard.

The current implementation basically has two main stages: the website inspection for the links containing castle data and the data extraction per se. This process is the same for all countries and due to that an interface was introduced to establish an stable API for current and future enrichers:

type Enricher interface {
  CollectCastlesToEnrich(ctx context.Context) ([]castle.Model, error)

  EnrichCastle(ctx context.Context, c castle.Model) (castle.Model, error)
}

If you want to see the implementation of at least one, here you can find the enricher for Ireland.

Once we have enrichers able to scrap and extract data from proper sources we can actually collect data using the executor package. This package manages the execution of enrichers by leveraging goroutines and channels distributing the work load among the available CPUs.

The executor current definition and function constructor can be see bellow:

type EnchimentExecutor struct {
    enrichers map[castle.Country]enricher.Enricher
    cpus      int
}

func New(
    cpusToUse int,
    httpClient *http.Client,
    enrichers map[castle.Country]enricher.Enricher) *EnchimentExecutor {
    cpus := cpusToUse
    availableCPUs := runtime.NumCPU()
    if cpusToUse > availableCPUs {
        cpus = availableCPUs
    }
    return &EnchimentExecutor{
        cpus:      cpus,
        enrichers: enrichers,
    }
}

The execution process is basically a data pipeline in which the first stage looks for castles to be enriched, next stage extract data from the given sources and last one persists it on DB.

The first stage goes by spawning goroutines to find the castles and as those castles are found they are pushed into a channel. We then merge those channels into a single one to be consumed by the next stage:

func (ex *EnchimentExecutor) collectCastles(ctx context.Context) (<-chan castle.Model, <-chan error) {
    var collectingChan []<-chan castle.Model
    var errChan []<-chan error
    for _, enricher := range ex.enrichers {
        castlesChan, castlesErrChan := ex.toChanel(ctx, enricher)
        collectingChan = append(collectingChan, castlesChan)
        errChan = append(errChan, castlesErrChan)
    }
    return fanin.Merge(ctx, collectingChan...), fanin.Merge(ctx, errChan...)
}

func (ex *EnchimentExecutor) toChanel(ctx context.Context, e enricher.Enricher) (<-chan castle.Model, <-chan error) {
    castlesToEnrich := make(chan castle.Model)
    errChan := make(chan error)
    go func() {
        defer close(castlesToEnrich)
        defer close(errChan)

        englandCastles, err := e.CollectCastlesToEnrich(ctx)
        if err != nil {
            errChan <- err
        }
        for _, c := range englandCastles {
            castlesToEnrich <- c
        }
    }()
    return castlesToEnrich, errChan
}

The second stage spawn a group of goroutines to be listening to the output channel of previous stage, and as it receives castles it extracts data by scraping the HTML page. As the data extraction finishes, the enriched castles are pushed into another channel containing the enriched castles.

func (ex *EnchimentExecutor) extractData(ctx context.Context, castlesToEnrich <-chan castle.Model) (chan castle.Model, chan error) {
    enrichedCastles := make(chan castle.Model)
    errChan := make(chan error)

    go func() {
        defer close(enrichedCastles)
        defer close(errChan)

        for {
            select {
            case <-ctx.Done():
                return
            case castleToEnrich, ok := <-castlesToEnrich:
                if ok {
                    enricher := ex.enrichers[castleToEnrich.Country]
                    enrichedCastle, err := enricher.EnrichCastle(ctx, castleToEnrich)
                    if err != nil {
                        errChan <- err
                    } else {
                        enrichedCastles <- enrichedCastle
                    }
                } else {
                    return
                }
            }
        }
    }()

    return enrichedCastles, errChan
}

And the main executor’s function that does it all is bellow one:

func (ex *EnchimentExecutor) Enrich(ctx context.Context) (<-chan castle.Model, <-chan error) {
    castlesToEnrich, errChan := ex.collectCastles(ctx)
    enrichedCastlesBuf := []<-chan castle.Model{}
    castlesEnrichmentErr := []<-chan error{errChan}
    for i := 0; i < ex.cpus; i++ {
        receivedEnrichedCastlesChan, enrichErrs := ex.extractData(ctx, castlesToEnrich)
        enrichedCastlesBuf = append(enrichedCastlesBuf, receivedEnrichedCastlesChan)
        castlesEnrichmentErr = append(castlesEnrichmentErr, enrichErrs)
    }

    enrichedCastles := fanin.Merge(ctx, enrichedCastlesBuf...)
    enrichmentErrs := fanin.Merge(ctx, castlesEnrichmentErr...)

    return enrichedCastles, enrichmentErrs
}

The full current implementation of the executor can be found here.

The last just consumes the channel with enriched castles and save them in bulk into MongoDB:

castlesChan, errChan := castlesEnricher.Enrich(ctx)

var buffer []castle.Model

for {
  select {
  case castle, ok := <-castlesChan:
    if !ok {
      if len(buffer) > 0 {
        if err := db.SaveCastles(ctx, collection, buffer); err != nil {
          log.Fatal(err)
        }
      }
      return
    }
    buffer = append(buffer, castle)
    if len(buffer) >= bufferSize {
      if err := db.SaveCastles(ctx, collection, buffer); err != nil {
        log.Fatal(err)
      }
      buffer = buffer[:0]
    }
  case err := <-errChan:
    if err != nil {
      log.Printf("error enriching castles: %v", err)
    }
  }
}

You can find the current version of the main.go here. This process runs periodically using a scheduled job created using Github Actions.

Next Steps

This project has a considerable roadmap ahead, bellow you can find listed the next steps.

1. Implement recursive crawling: in order to add more enrichers is making it possible to do recursive crawling of a website, because some of them has a huge list of castles in such a way that the listing is done through pagination.

2. Support for multiple enrichment website sources for the same country: It must also support multiple enrichment website sources of the same country because this is something possible as I could see.

3. Develop an official website: In the meantime, an official website for this project must be done to make the collected data available and for sure to show the progress. Such site is in progress and you can already visit it here. Due to my lack of design skills the site is ugly as hell, but stay tuned and we’ll get over it :)

4. Integrate machine learning for filling data gaps: And for sure, something that will help a lot, specially in complementing data hard to be found via the regular enrichers, will be machine learning, because by prompting these models with requests for hard-to-find data, we can efficiently fill in data gaps and enrich the dataset.

Contributions Are Welcome!
This project is open source and all collaborations are more than welcome! Whether you’re interested in backend development, frontend design, or any other aspect of the project, your input is valuable.

If you find anything you want to contribute — specially with frontend :) — just open an issue on the repository and ask for code review.

This article was originally posted on my personal site: https://www.buarki.com/blog/find-castles

Hexagonal Architecture/Ports And Adapters: Clarifying Key Concepts Using Go

Aurelio Buarque — Thu, 21 Mar 2024 14:18:45 +0000

Disclaimer

Before saying any words on such topic I must highlight some points: (1) The intent of such writing is providing a concrete, easy to understand and practical example of Hexagonal Architecture, due to that, the example is really simple to avoid a big cognitive load of peripheral topics. It’s really straight to the topic; (2) For sure there are several ways to approach the example shown bellow but this is just one simple and minimalist example with the intention of only pass the idea, you are more than welcome to share your thoughts as well :)

Motivation To Write this

A few weeks ago, while chatting with a college newcomer in software development, I noticed she was struggling with some of the same difficulties I encountered when trying to understand Hexagonal Architecture. Latter on, looking for some discussions on reddit it seems this is something everyone faces, like this one:

Due to that I decided to write down my 2 cents in this topic sharing what I know, what I’ve used and my experience in projects I’ve been involved in.

What Is Hexagonal Architecture? Where It Comes From? And Where Does the Hexagon Fit In?

Being straightforward, Hexagonal Architecture has nothing to do with hexagons. A better name for it is the one that the author, Alistair Cockburn, gave to it in his blog post: Ports and Adapters. The name “Hexagonal Architecture” stuck probably due to the visual representation of the system’s structure that is usually used. So, as it has really nothing to do with hexagons, from now on let’s refer to the topic of this article by its proper name: Ports and Adapters :)

Historically, Ports and Adapters was born in the context where Dependency Inversion Principle (DIP) was getting hot, back in the beginning of the 2000’s. DIP was getting more present on development day to day, and an example of a framework that was a pioneer in such topic is Google Guice.

We can say that one of the first attempts to define a standard for software organization was N-Layered architecture. The basic idea is: group things related together. In practical terms it usually ends up with three layers: user interface, business logic and data access. The great leap N-Layared Architecture gave us was the separation from UI and business logic.

It’s also important pointing that in 2003, two years before Alistair publishes his article, Eric Evans published his famous book Domain-Driven Design: Tackling Complexity in the Heart of Software introducing DDD to the world. A core concept of DDD, which can be explored here in another article, is focusing on the business, and we can refer to it using the concept of domain. Again, DDD is a huge topic and I won’t be a fool to give the details of it only in one paragraph, but the outcomes it brings in terms of layers is the domain being the heart of system and a few layers besides it: the presentation layer, in charge of interacting with the client of the system (a person, another system etc); the application layer, which coordinates what needs to be done; and the infrastructure layer, which is in charge of DBs, notifications etc. Bellow image might help to illustrate the idea.

A practical description of the image above is as follows: the presentation uses the application, the application uses the domain, and the domain uses the infrastructure. In the past (around the early 2000s), the DIP wasn’t as obvious as it is today (2024), as result, the layers used to reference the next one downstream. This means that, for instance, the domain layer could have references to the database details (I’m not pointing it here as something good or bad).

The idea of Ports and Adapters was introduced to address this because it enforces DIP to isolate the domain from details not directly involved in the business, like the application and infrastructure layers. This is done by reversing the dependency relation of the three layers shown bellow where the domain defines a contract of how it needs things to work and the peripheral layers will take care of following this contract. This contract is named Port, and an implementation of a Port is named Adapter. And other important name to mention here is that the “business part” is named Core.

That way, we kind of get rid of application and infrastructure layers per se and now we are relying on contracts defined by the core, which could be an interface in the world of Object Oriented Programming (OOO), and adapters of the surrounding layers implementing such contracts, like classes in OOO. As you can imagine, there will be adapters that will trigger actions or use the core, and these ones are called primary adapters. And the adapters called/triggered by the core are called secondary adapters.

To help you visualize it, rather than just show an image, image an application that has a web API and an AMQP event listener that can dispatch business processes, and based on some requirements it must persist something on DB and send email to user. On such hypothetical app, we could say that the core of the application would need to provide four ports: one for the web API, one for the AMQP listener, one to define how to send emails and one to define how to interact with DB. Assuming that the web API will use REST, the AMQP listener will be based on RabbitMQ, the emails will be sent using SendGrid and the DB will be Mongo, we could have the following configuration:

As you can imagine, primary adapters usually take the form of REST controllers, event listeners (such as for Kafka or RabbitMQ), or Command Line Interfaces (CLIs). These are the components that trigger actions or use the core functionalities of the application.

On the other hand, secondary adapters typically include well-known components like repositories, SMTP clients, and services implementing storage management, such as an AWS S3 client. These adapters are the ones called or triggered by the code, serving as bridges between the application’s core and external systems or data storage.

A Short But Effective Example

The following example will show a concrete example using Go.

Consider an hypothetical and isolated application feature which is: the user must opt-in for receiving news, maybe from a news website, and once it opts in it must receive an email confirming it. Easy and minimalist as is. Consider that the "service" that executes it, not compliant with Ports and Adapters yet, is this one and also that this is triggered from a POST request (let's abstract this part or the example will be too big):

package core

import (
    "errors"
    "fmt"
)

type NewsSubscriber struct {
    emailSender sendgrid.Mailer
    newsDB      db.Connection
}

func New() *NewsSubscriber {
    return &NewsSubscriber{
        emailSender: sendgrid.New(),
        newsDB:      postgres.NewConnection(),
    }
}

func (ns *NewsSubscriber) Subscribe(u User) error {
    sqlQuery := fmt.Sprintf(`
      UPDATE
        users 
      SET
        receive_news = 1
      WHERE
        id = %s`, u.ID)
    if err := ns.newsDB.Execute(sqlQuery); err != nil {
        return errors.New("failed to subscribe user to updates")
    }
    email := sendgrid.Email{
        Subject: "Subscription successfully done!",
        Body:    "Now you are subscribed to receive updates",
        To:      u.email,
    }
    if err := ns.emailSender.Send(email); err != nil {
        return errors.New("failed to send email to user")
    }
    return nil
}

Above code is kind of a crime, but take a deep breath and will get over it :)

In order to make it compliant with Ports and Adapter we can start pointing what is not compliant. If we take a close look at the New() function we see it is creating the instances of the email service and postgres connection client inline. So the first thing we can highlight is that we have a strong coupling between this service and the provider to send emails and the postgres database.

Other point is that the method Subscribe is directly creating the SQL statement to mark the user's row as "I want to receive news", again a feature of system's core is knowing too much about an implementation detail.

With above points we can see how hard it is to unit test this feature is. We also see that if, for instance, the company needs to replace Sendgrid with Mailgun, the feature must be directly modified to achieve it. Now imagine that the database also needs to be changed for MongoDB, another gigantic refactoring would be needed. Again, this is just an hypothetical and drastic scenario to illustrate the idea, for sure such examples don't happen quite often in real life, keep calm :)

In order to make it compliant with Ports and Adapters we could introduce two Ports: EmailSender and NewsSubscriptionRegister. Both them are interfaces describing what our service NewsSubscriber needs, and the "how" it is done does not matter. They are:

type EmailParams struct {
    Subject string
    Body    string
    To      string
}

// Port defining WHAT the email sender should do
// and not HOW.
type EmailSender interface {
    Send(e EmailParams) error
}

And:

// Port defining WHAT the process of saving on
// DB that user wants to receive news should do
// and not HOW
type NewsSubscriptionRegister interface {
    Register(u User) error
}

We can now adjust the NewsSubscriber to depends of such Ports:

type NewsSubscriber struct {
    emailSender               EmailSender
    newsSubscriptionRegister  NewsSubscriptionRegister
}

By doing so we applied the Dependency Inversion Principle (DIP) and now the service is depending on a contract that fits its needs, without any worries about how it works. Not directly related to it, but we could go even further enhancing flexibility and modularity by applying the Inversion Of Control (IOC) principle in such scenarion by using dependency injection in the New constructor:

func New(
  emailSender EmailSender,
  newsSubscriptionRegister NewsSubscriptionRegister,
) *NewsSubscriber {
    return &NewsSubscriber{
        emailSender:               emailSender,
        newsSubscriptionRegister:  newsSubscriptionRegister,
    }
}

With above modification we removed the coupling that a Core feature had with the details of sending emails and database persistence by introducing contracts (Ports) that meets our needs and provides a clear separation of concerns. Additionally, we also are no longer in charge of the instantiation of those contracts implementations, the Adapters. Instead, we are delegating it the the client of the service NewsSubscriber. And the Register implementation might look like this:

func (ns *NewsSubscriber) Subscribe(u User) error {
    if err := ns.newsSubscriptionRegister.Register(u); err != nil {
        return err
    }
    email := EmailParams{
        Subject: "Subscription successfully done!",
        Body:    "Now you are subscribed to receive updates",
        To:      u.email,
    }
    if err := ns.emailSender.Send(email); err != nil {
        return err
    }
    return nil
}

In this implementation, the NewsSubscriber service is no longer directly responsible for the database registration. Instead, it delegates this responsibility to the newsSubscriptionRegister instance, providing a cleaner and more modular design. The Register method, representing the database registration action, encapsulates the specific logic related to news subscription registration.

About the adapters implementation we could have:

type SendGridEmailSender struct {
    client sendgrid.Mailer
}

func (s *SendGridEmailSender) Send(e core.EmailParams) error {
    if err := s.client.Send(e); err != nil {
      return fmt.Errorf("failed to send email to user, got %v", err)
    }
    return nil
}

And also:

type PostgresNewsSubscriptionRegister interface {
    connection postgres.Connection
}

func (p *PostgresNewsSubscriptionRegister) Register(u core.User) error {
    sqlQuery := fmt.Sprintf(`
      UPDATE
        users 
      SET
        receive_news = 1
      WHERE
        id = %s`, u.ID)
    if err := p.connection.Execute(sqlQuery); err != nil {
      return fmt.Errorf("failed to subscribe user to updates, got %v", err)
    }
    return nil
}

Outcomes Of Using Ports And Adapters

Note: I didn't use benefits or drawbacks as title and I'll elaborate the reason soon.

Adopting Ports and Adapters brings a considerable flexibility to replace components and providers. As an example, if the service you use to send emails gets expensive and the CTO of your company made a deal with a cheaper provider, the replacement should not be so hard as the main work would basically be, at least in theory, just implementing a new component respecting the Port contract.

Another trait of a code using Ports and Adapters is the high level of testability, because as the Core only deals with contracts, creating unit tests is a rather easy tasks using software doubles, like mocks.

Probably the most widely mentioned trait is completely isolation of frameworks and libraries. This can be achieved by integrating the framework or lib through adapters, ensuring that the Core remains completely agnostic to such points.

By using Ports and Adapters you also get reduced risk of vendor lock-in, as the first example given above has shown: if a provider is no longer suitable the Core of system will not be highly coupled to it, so a replacement should not be so complex.

Something really worth pointing out is the potential misuse and overhead of layers. The adoption of Ports and Adapters could potentially introduce the risk of developers getting excited on creating not necessary layers using as argument some of the above points. Personally, this is something I could see in the majority of systems I worked on using it, and a clear result of such layers misuse was the Pull Request size for a simple modification, like adding a new method to a Port, spanning across multiple files and directories.

And last, but not least, two things that must be mentioned are the learning curve and the initial development type. Trust me, do not neglect these topics, especially if the team that will be working with it is not familiar. It's important to give the team time to get used to it so they can gain traction.

I named above points as outcomes rather than "benefits and drawbacks" because they can be labeled as benefits or drawbacks only with context. Is having the core of the business 100% decoupled from the framework a must for your business? If so, then go for it. The point is: don't adopt Ports and Adapters blindly just because other teams or companies are using it. Look for your needs, be straight to them and check if the intrinsic cost of effort of Ports and Adapters worth once compared to the outcomes of it to the project. I have worked in projects where Port and Adapters was a nice fit (and AFAIK the system is still in use after years), but I also have faced situations where a simple transaction script could do the job, but instead, it was done using a pile of layers in which the effective code, if placed in a single file, would not have more than +180 lines :)

Ports and Adapters is a valuable technique, not a strict doctrine. It is up to us to apply our skills, experience, and wisdom to make informed decisions on whether to embrace this architectural pattern based on our project's specific needs.

If you like this topic, I have also an article comparing Hexagonal Architecture with other famous software design buzz words, like Clean Architecture, you can check it here.

Building A File Compressor Using C And Wasm

Aurelio Buarque — Wed, 21 Feb 2024 01:45:16 +0000

Checking Wasm Tools For C

In a previous article I did a small experiment with Go and Wasm to check the state of tools available for it in 2024. During such endeavour I find a tool called Emscripten that drawn my attention and made my mind to do one experiment, now using C.

Emscripten is an Open Source compiler toolchain to WebAssembly. As its docs says, practically any portable C or C++ codebase can be compiled into Wasm using it.

With the tool to use defined next step was defining a problem to approach.

Problem To Solve: Compress Files Using Huffman Algorithm

I must be honest, C is boo of mine ❤! Due to that, I really wanted to build something real that could make me do some bit fiddling, and I decided to implement a very interesting compression algorithm called Huffman Coding.

Huffman Coding was created by David A. Huffman back in 1952. It is a lossless compression method, which means that all bits present on original file will be carried to the decompressed one. The basic idea of the algorithm is to collect all the symbols present on a file besides its frequency to create a binary tree in which all symbols will be leaves and the path from the tree root to the leaf should be used to represent the symbol. I know that at first glance it looks like rocket science, but trust me, it is far away from that and in this one we will see it in baby steps :)

This project was something really nice doing and I hope you enjoy following up.

Non-Goals Of This Project And Its Limitations

This project was not intended to be a production ready file compressor like WinRAR. It also was not intended to support files bigger than 150MB to avoid memory issues on user browser. For sure it can be adjusted, but I guess this size is fine for a hobby project :)

How Can Use The Program?

To use the program, zipper as I named it, just access it, and in case you want to check the details, the full code it is available on my Github.

Explaining Huffman Coding In Baby Steps

There’s no better way to understand something rather than a practical example, so let’s apply the algorithm with the following text: coding is fun and fun is coding.

First step: collect all the symbols besides its frequencies. By doing so we get:

Second step: sort collected symbols by its frequencies. The output in ascending order from top to the bottom is:

Third step: create a binary tree from this sorted list. Do so by removing the two least frequenct symbols and creating a node tree with them, in which the less frequent symbol goes to the left and the most frequent to the right. And also, the “frequency” of this new node is the sum of the subtree frequencies. Put this new created node into that list and repeat this process until the list have only one item. Keep calm! It seems more complex than the portuguese grammar but trust me, it is not. Let’s just execute this algorithm and you’ll agree with me.

It says, “remove the two least frequent symbols”. The two least frequent symbols are a and c, so let’s remove them.

Then it says “create a tree node where the least frequent symbol goes to the left and most frequent to the right. And the frequency of this new node is the sum of the subtrees”. In order to have a way to distinguish the regular file symbols to this “joining” symbol let’s adopt the @ as the joining symbol. The result of such process is this one:

Once the new node is built we must put it back to the list. As the list is sorted we should keep it sorted while adding it. The list items, arranged horizontally to help you visualize it, looks like this now:

Now we just repeat the process, until only one item remains on the list. The next two items to process are f and g. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are o and o. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are u and the joining symbol @:3. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are d and the first joining symbol @:4. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are @:4 and the joining symbol i:4. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are @:5 and n:5. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are space:6 and @:7. The new tree node created from them is this one:

Adding it to the list we get:

The next two items to process are @:8 and @:10. The new tree node created from them is this one:

Adding it to the list we get:

The last two items to process are @:10 and @:18. The new tree node created is the complete Huffman tree:

Fourth step: find the corresponding symbol code by walking the tree from the root to the leaf and use 0 to mark a jump to the left and 1 to the right. Adding the 0s and 1s we get this:

Thus, the corresponding codes are:

It worth noticing that that the higher the frequency of a symbol, the smaller the corresponding code length is. Just compare the length of space and a.

Fifth step: Replace the symbol for its corresponding code. If we rewrite the present words using found codes we get:

Thus, the full text coding is fun and fun is coding is: 11011 0110 010 101 111 1001 00 101 0111 00 1000 1100 111 00 11010 111 010 00 1000 1100 111 00 101 0111 00 1011 0110 010 101 111 1001.

As ASCII characters uses 1 byte, the original text coding is fun and fun is coding requires 31 (length of text) x 1 byte = 31 bytes to be stored. On the other hand, the compressed version requires only 12 bytes and 6 bits of one extra byte, so in total 13 bytes. To visualize it, just group the 0s and 1s in chunks of 8 (because 1 byte is a group of 8 bits): 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001.

With simple math we can see that the compressed content is almost 58% smaller than the original one! This simple example shows how powerful Huffman Coding is.

Once you became a Huffman Coding master we can proceed to next step :)

The (Funny) C Part Of Compression

In order to elaborate the details of the compression part we must define the Compression API written in C that will be used on the JavaScript layer through Wasm. The compression header file can be seen here and as we can see it expects the full file content to be passed alongside with its size in bytes. Such API imposes some limitations, like the size of the file we can handle, because it would not be feasible loading a 2GB file into an unsigned char buffer, but that’s totally ok for this project as it was not intending to be a production ready compressor like WinRAR ;)

That said, we can go on with compression details. As shown above, we start by collecting the symbols with its frequencies, and the implementation of it can be found here. The idea is creating an array with 256 buckets, each one representing the frequency of the corresponding ASCII symbol in the file. Thus, the frequency of symbol ‘a’ would be stored at index 97. It worth mentioning it uses calloc to create such array, to ensure that all buckets have zero values once created, otherwise we could have some “trash” on such buckets, which could lead to bugs during this stage.

With the frequencies of symbols collected, we can now sort them. To do so, I decided to use a Min Heap data structure as it is easy to find the lower value present inside of it. As we are always getting the two lowest values, using such data structure seems a good design decision. I implemented the Min Heap used on this project, and you can find the Abstract Data Type (ADT) of it here.

Creating the Min Heap from the array with symbols frequency is a straightforward process and the implementation can be seen here.

Once we have the Min Heap ready, we can now build the Huffman tree. You can find the implementation of such process here and also the Huffman Tree ADT.

From the Huffman Coding execution we did above, the last step missing is building the Huffman table with the codes. You can find the implementation of it here. The table itself is a 256x256 2D array of type unsigned char, where the table[i] represents the code of the a symbol present in the file. For instance, if the symbol ‘a’ is present and its code turns out to be 1101 it means that the table[97] will carry the string ‘1’, ‘1’, ‘0’, ‘1’, ‘\0’.

All above points is all we need to implement Huffman Coding, but that’s not all yet. Because in order to decompress the compressed file, the sequence of 0s and 1s, we need the tree. Thus, to give the compressed file to someone expecting to decompress it, the tree must be sent together, and it introduces a nice problem to solve.

The Architecture Of The Compressed File

Let’s assume that with the tree and the codes table on hand we get the 0s and 1s of the compressed file. Now we want to give the compressed content to a friend besides the tree to it knows how to decompress the file. How should we pass the tree? With The symbols traversed in some order? maybe postorder? And should we pass the tree content before or after the 1s and 0s?

There’s no right or wrong answer for above questions, but the one picked must be very well implemented. For this project, I decided to build the “file that we share” with the following structure: the contents of the tree traversed in preorder and right after it the 0s and 1s.

Traversing a tree in preorder means, in simple terms, that you’ll collect the symbol of a node, look for the left subtree and then to the right subtree. A concrete example of it, using the tree we created for the text coding is fun and fun is coding is: @@_space_@d@os@@@fgi@@u@acn.

That said, the “file we share” with the compressed file “would be”: @@_space_@d@os@@@fgi@@u@acn 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001. Ok, now we defined the basic layout of the file that is sent.

But how the one that has received the compressed file would know where the content of tree ends? Different files would have different trees and consequently different series of 0s and 1s, so we need a strategy to inform to the receiver where the tree content and the 0s and 1s starts and ends.

In order to the receiver knows how to properly find the tree and the compressed content we need to add a header to the sent file.You can think about the header as some special bytes at the beginning of the file that will provide some information about it. For this project, it must inform how to properly find the content of the tree and the compressed files separately. A header must be well planned because it inevitably increases the final size of the file shared. Let’s see how we built the header for this project by addressing the tree content size first.

Using above example of file that could be shared, the one attempting to decompress it must be able to find properly @@_space_@d@os@@@fgi@@u@acn as the content of the tree. In such example we see that the tree content uses 27 bytes. It’s important pointing that by tree size I mean the amount of nodes on the tree, not the height.

Counting the number of nodes on the tree is a rather easy task, and the implementation can be seen here. As we know that the Huffman tree is a binary tree, if a tree completely uses all 256 ASCII symbols, we can say it would have (2 * 256) — 1 = 511 nodes. But for this project this equation requires one small adjustment. If you remember the execution we did above, you’ll notice that for this project we used the symbol @ as the “joining symbol”. The problem is that a file can also have this symbol in its content, due to that we need a way to distinguish when we are dealing the the @ of the joining symbol and the @ of the file per se.

The strategy adopted here was using the symbol \ as “escaping”. But the same problem goes for it, so we need to use \ not only for the @ but also for the \. The implication of it is that, if @ or \ are present we would require 2 additional bytes, or “fake nodes”, to comport it. Thus, for the worst case scenario, the maximum number of nodes on the tree would be ((2 * leaves) — 1) + 2, thus, 513 nodes. Such number in binary is 1000000001, which means it requires 1 byte and 2 additional bits to be stored.

So far, we know that we need two additional bytes on the “compressed file” to represent the size of tree content (two because 10 bits is bigger than 8 bits, one byte, so we need one extra full byte to comport it). But, even if we put the information of tree size in the header there is still one more problem to be addressed: how to know how many bits the compressed part exactly uses?

If we look again to the content of compressed file, 11011011 00101011 11100100 10101110 01000110 01110011 01011101 00010001 10011100 10101110 01011011 00101011 111001, we can see it uses 12 complete bytes with one extra byte that uses only 6 bits. Thus, even if we know where the content of the compressed file starts, the number of bytes required by the tree + 1, we must know exactly how many bits do we need to look for, especially on the last byte to skip the padding bits.

Padding bits are nothing but bits used to “complete” the byte. It happens because a byte is a sequence of 8 bits, even if we use less. Check bellow image to visualize it:

The good part is that calculating how many padding bits we would have is easy. To do so, we need to first compute the “length” of the compressed content, in other other words, the length of 0s and 1s. This is done by summing all the products (symbol frequency * respective code length) and you can see the implementation here. With the length of compressed content on hand, we just need to check the remainder of it with 8 bits to know how many bits are being used, and then, we check how many bits are missing to complete 8 bits, in other others: 8 - (compressed content length % 8). You can see the the implementation here.The text coding is fun and fun is coding has 102 bits length, so 8 — (102 % 8) = 2 padding bits :)

Unless that the last byte is fully used, the max number of padding bits will be 7, and it can be represented using 3 bits: 111.

Thus, we got a puzzle to solve: we have two information to store in the header, the tree content size, which requires at most 10 bits, and the number of padding bits, that requires at most 3 bits, resulting in 13 bits for the header, and consequently, 2 bytes. Due to the importance of the header in this project I decided to define an ADT for it to highlight its API. The solution adopted for the header puzzle was: the 3 first bits of the header first byte will store the padding bits, and the last 2 bits of first byte besides the entire second byte will store the tree content size. For the text coding is fun and fun is coding the padding is 2, thus 10 in binary, and the tree size is 27, or 11011. Bellow image illustrates how the header bits are used:

Building the header like above is pure bit fiddling, and is done like this: we create a byte where the 3 bits of padding are placed at the most significant bits by doing a left shift of 5 bits. If the padding for the text example is 2 (2 in 8 bits is 00000010), 2 << 5 will gives us 01000000. Then, we collect the most significant bytes of tree content size. In our example, the size is 27, or 0000000000011011 in binary using 2 bytes. To collect the most significant bytes we do a right shift of 8 bits like 0000000000011011 >> 8 and we get 00000000. We also need to collect the least significant bits of tree size, we do it executing a bitwise AND with the tree size and a full set byte: 0000000000011011 & 11111111 and we get 00011011. The last step is creating a 2 bytes sized buffer to carry the header, where the first byte will have the result of the byte with the 3 bits of padding at the most significant bits bitwise OR the most significant bits of tree size, in our case 01000000 | 00000000 = 01000000. And the second byte will just receive the least significant bits of the tree size, which is 00011011. The final result is 0100000000011011. You can check the implementation of this step here.

And we finally have the three parts of the file:

The full compression implementation can be seen here.

Decompressing And Reconstructing The Original File

I must say that the decompression is way easier than compression, and it also requires an API, similar to the compression. The process kicks off identifying the metada we added during compression (padding bits and tree size) to know where to look for the content of tree and compressed file. You can check the implementation here.

Once we know how many bytes are used to store the tree content we can properly find the beginning of the tree and compressed file. If the bytes of the “shared file” are in the buffer compressedContent, then the tree content is placed right after the header, thus it can be found at &compressedContent[HEADER_REQUIRED_BYTES], while the compressed content is right after the tree, thus &compressedContent[HEADER_REQUIRED_BYTES + treeContentSize]. It worth pointing that by doing so we are not using extra memory spaces, just pointers pointing to the proper place.

By knowing where to look for things, we start rebuilding the tree from its content that was attached to the file in preorder. The implementation can be seen here and it worth noticing that it handles the special symbols we have in this project, the “@” and the .

With the tree rebuilt, we must calculate how many bytes we need for the decompressed file (the original one). This is needed to do because I decided to not include the original file size on the header, because it’d increase the total “shareable” file size. The implementation was a little bit tricky as you can see but it seems to work :)

Last, but certainly not least, we effectively fill the buffer of the “decompressed” file, in other words, we rebuild the original file that was compressed. You can check the implementation here. And we are almost done with the C part.

Building The Wasm Binary

In order to create a binary able to run the compression and decompression Emscripten was used. It was done by providing a main.c file exposing the compression and decompression API. And to build it we can use the tool emcc as implemented in the project Makefile.

Something very important that must be mentioned is that Wasm only understand numbers and pointers, due to that I broke the compression and decompression in the main.c file in two steps: one to execute the operation and return the number of bytes required by the created compressed or decompressed file, and other one to just collect the compressed or decompressed file content.

To understand it, let’s check the compression flow from the JavaScript to C. The compress() function (in the app.js file) is placing the content of the uploaded file into a Uint8Array with length equals to the number of bytes the uploaded file has. It then passes such array alongside its size to the c_compress function via the built in ccall function and getting as output the number of bytes that the compressed file (more precisely, the shareable file, with header, tree, 0s and 1s) requires. Then, it uses this number of bytes to allocate a buffer with this size in the heap. As final step, the created buffer besides the number of bytes required for the compressed size are passed to function receiveCompressedContent, also via ccall, where each byte of the compressed file is copied to it. Once the very content gets copied, we do some pointer math to extract it on the JavaScript layer.

As mentioned, the content of compressed file is copied to a buffer allocated in the Heap. Emscripten allows us to access the heap as a typed array by providing some objects, as we are dealing with chunks of 8 bits here, we should use HEAPU8 (unsigned 8 bits). Such object provides access to the heap through the attribute buffer. Thus, if we have access to the heap, we know where the sequence of bytes of the compressed file is (the address of created buffer) and we know the size of this sequence (how many bytes the compressed file requires) we can extract such bytes creating a “slice” of the Heap from “the buffer address” until the “compressed size bytes”. If you didn’t get it, keep calm and check bellow example.

Consider the array unsigned char buffer[] = {‘a’, ‘b’, ‘c’}. The content of this array is placed at 3 sequential bytes. If you know C a little bit you know that buffer is just a pointer to some address. Assume that for instance such address is 99. Then, we can say that buffer points to 99. As it is a contiguous block of memory, we can access the neighboring byte by doing buffer+ 1, and we could get (in this example, for sure) 100. And buffer + 2 would gives us 101. Thus, if we collect the content stored at the address 99, 100 and 101 we would properly collect ‘a’, ‘b’ and ‘c’. Actually, in C notation, *buffer, or *(buffer + 0), is equal to ‘a’, *(buffer + 1) is equal to ‘b’, and *(buffer + 2) is equal to ‘c’. And in case you didn’t know, yes, buffer[2] is just a syntactic sugar for *(buffer + 2) :)

Once we collected all the bytes of the compressed file to a JavaScript variable/object, we must deallocate the pointers used to interact with the C binary (the one to store the number of bytes needed and the buffer to store the file bytes). This is done by simply applying them to the free() function we exported from C during the compilation.

With the pointers deallocated, the JavaScript promise is resolved returning an object carrying the size of original file, the size of compressed file and the bytes of the compressed file that will be downloaded to the user’s device. And the same process goes for the decompression.

Takeaways And My Conclusion

First of all, I just loved it all!

I found the Emscripten tool + documentation nice, actually, I found it more practical to “check and use” than the resources I’ve found for Go.

Something worth pointing is that sharing content from the C layer to the JavaScript using the Heap might be a bottleneck depending on the size of the object you are passing. In this project, for instance, to handle such “processing times” I added a spinner during compression and decompression. It doesn’t seems to be a dealbreaker though, because depending on the problem we can use different approaches, like processing the hard tasks using service workers.

You are more than welcome to try it out, check code and suggest some improvements. Hope you enjoyed it :)

Originally published at https://www.buarki.com/blog/wasm-huffman

WebAssembly in 2024: Finding Prime Numbers With Go And Wasm

Aurelio Buarque — Fri, 16 Feb 2024 06:58:09 +0000

What Is WebAssembly?

It’s been a while since last time I did something with WebAssembly, somewhere back then in 2022. If this word is new to you don’t worry. Putting in simple terms, it is a binary instruction format designed as a portable compilation target for high-level programming languages. It allows code written in languages like C, C++, Rust and Go to be executed in web browsers at near-native speed. WebAssembly is designed to be a low-level, efficient, and secure virtual machine that can be embedded in web pages, providing a platform-independent execution environment for web applications.

Why Play With It Now?

From time to time I see people claiming that WebAssembly is close to become a normal tool in our day to day work. With insomnia as my unlikely companion and nothing better to do I decided to check the state of tooling specifically for golang and I intend to show my findings in this one :)

A Goal For this Endeavour

The goal of this project was building a simple app using Go and Wasm to implement the Sieve of Eratosthenes Algorithm. This algorithm is very useful for finding prime numbers up to a given limit. For instance, if you ask for all the prime numbers up to 12 you’ll have 2, 3, 5, 7 and 11.

The algorithm’s idea is quite simple:

we create a list of numbers with the size of the given limit + 1 (just to be able to represent the numbers using indexes);
we assume that all numbers in such a list are truly prime numbers;
as we know that 0 and 1 are not prime numbers we mark them as false;
then, we iterate from 2 up to the square root of the given limit checking if the current number can divide the limit, if so we mark all multiples of the current number as false;
and then we repeat this process.

I chose this objective because it involves not only transferring data from the “JavaScript part” to the “Go part” but also the reverse process.

The basic flow is: the user provides a number representing the end of the interval to search for prime numbers. This number is then passed as a parameter to a Go function responsible for identifying prime numbers within the specified range. Subsequently, the discovered prime numbers are returned to the UI layer and displayed as a list.

Binding Go and JavaScript

As mentioned earlier, this project has two main parts, let’s call them the Go part and JavaScript part. The Go part is a simple program that makes a function implementing the Sieve of Eratosthenes available to be called by JavaScript when running in a WebAssembly environment. The very app entry point is bellow one:

The line js.Global().Set("findPrimes", findPrimes()) serves as the bridge between Go and JavaScript, allowing the function findPrimes be callable from JavaScript. Additionally, the line <-make(chan bool) just defines a simple channel to prevent the program from exiting immediately. This is typical in WebAssembly programs where the Go runtime does not automatically wait for asynchronous tasks.

To build the go part we can run:

GOARCH=wasm GOOS=js go build -o public/main.wasm cmd/wasm/*.go

This is basically setting the target architecture for the Go compiler to WebAssembly by using GOARCH=wasm and GOOS=js just targets operating system to JavaScript. The created file main.wasm will be placed at the public directory.

And another important file is the wasm_exec.js, already provided by Go, thus we just need to copy it. It can be done by running:

cp "$(go env GOROOT)/misc/wasm/wasm_exec.js" public/wasm_exec.js

With files wasm_exec.js and main.wasm available we can load them on the JavaScript client to call the function findPrimes. And that’s all for the Go part :)

You can see the full JavaScript part on the project repository, but the important chunk is bellow one:

Besides the input validation, it fetches the WebAssembly binary (main.wasm), instantiates it, and passes the import object from the go instance, this is done by WebAssembly.instantiateStreaming(fetch(wasmFile), go.importObject).then((result) => {});. Besides of it, the chunk go.run(result.instance); executes the Go program inside the WebAssembly instance, making the Go function findPrimes callable.

Deploying To Vercel

Vercel now supports Go functions, and due to the easy integration with Github projects I decided to deploy it there and you can see the project running live here. In case you want to check the code and run it locally you can find the project on my Github

My Conclusion So Far

As I could see, indeed there are good learning resources out there, like this one, but they require you dig deep into it in order to mine useful resources for what you need. For sure this is true for any dev tool, but at least, for instance, one can easily find a plenty of tutorials on “how to build a project XYZ using NestJS, Spring Boot, Echo, etc.”, the same is not true for WebAssembly yet.

But I really enjoyed it, and I won’t stop with this one. I’ll probably experiment it using C due to the tool I found during this project called Emscripten.

This article was originally posted on my personal site: https://www.buarki.com/blog/webassembly

MLOps in practice: building and deploying a machine learning app

Aurelio Buarque — Thu, 11 Jan 2024 23:16:49 +0000

Foreword

This one aims to provide a friendly introduction to Machine Learning Ops (MLOps) in practice by describing how a simple App able to perform simple mathematical operations using images of single digit numbers was made. It has no intentions at all to be a replacement of any deeper study of the topic, but just a simple hands-on example.

Context

Talking to some colleagues and friends lately gathering some ideas of a nice Machine Learning project to build, I’ve seen that there’s a gap of knowledge in terms of how do one exactly uses a Machine Learning model trained? Just imagine yourself building a model to solve some problem, you are probably using Jupyter Notebook to perform some data clean up, perform some normalization and further tests. Then you finally achieve an acceptable accuracy and decides that the model is ready. How will that model end up being used by some API or worker to perform some inference that will be used elsewhere in the company you work or by any system?

Such question is addressed by the ones involved with Machine Learning Ops (MLOps), which is a series of practices and policies that takes care of making trained model available to be used somehow, pretty similar to how we export a software lib, a docker image tag etc.

To provide a concrete, simple, and effective example, this article will go through the very process of planning one app that requires Machine Learning. We’ll pass through the planning and building the needed model, saving the model, checking how it can be exported and how it can be used on an application software.

All this process was applied on an open project called snapmath, and the very source code is available on my Github and you can also use it as it is deployed here. And in case you would like to know the whole design and planning process check the project design document.

The Problem to be solved

We want to create a simple calculator of single-digit numbers. Such app will receive the two numbers to be used as images besides the operation to be performed: +, -, / and *. An overview of how the app was planned to look like in the beginning can be seen on the images below (and please don’t envy my prototyping skills):

And once user selects the operation to run and input the two images:

How was the problem solved?

With problem stated the main questions that may arise are: how can one properly build the model? Is there a fancy tool available to do so or it should be done on an ad hoc way? Once the model is trained, how is it exported? as a JSON?

The tool used to build the model per se was TensorFlow, a very powerful and end-to-end open source platform for machine learning with a rich ecosystem of tools. And in order to to create the needed script using TensorFlow Jupyter Notebook was used, which is a web-based interactive computing platform.

In machine learning projects, the most important part is data. To build a model to solve such a problem we need a dataset that provides some images of digits with the corresponding digit. A nice and available one is MNIST. This dataset provides the images with shape of 28 pixels width x 28 pixels height in grayscale besides its corresponding digit. Such data is provided in form of a CSV file where each line has 729 columns, 728 to represent the pixels (28 x 28) and 1 to represent the expected digit.

With model built and trained, the next step was loading it into one React app to run the inference of a given image. To load the model we needed to “translate” the trained TensorFlow model on Jupyter Notebook to a form that the TensorFlow.js is able to understand. To do so we can use tfjs-converter, which is is an open source library to load a pre-trained TensorFlow SavedModel, Frozen Model or Session Bundle into the browser and run inference through TensorFlow.js.

It sounds like a lot of things to do, but calm down, let’s highlight the goals, non-goals and limitations before proceeding :)

Goals of such project

The goals of this project were straightforward: (1) understand the process of building a model from scratch using TensorFlow;(2) get acquainted with Jupyter notebook as it is a widely used tool in by the industry;(3) get acquainted with strategies to deploy Machine Learning models once trained, such as TensorFlow Serving.

Non-goals of such project

With this project we are not intended to create production-ready model with high accuracy. The main focus here is having a model with a reasonable accuracy and understand the process from creation to deploy of a Machine Learning model.

The limitations of this project

As MNIST dataset was used to train our model, the images to be inserted into the model must be in the shape of 28 pixels width by 28 pixels height and also be in grayscale. It’s reasonable thinking that the majority of people won’t have an image with such traits, thus we’ll need to preprocess it to then input into the model. Due that preprocessing, the image quality might be degraded, leading to bigger possibilities of errors. Thus, it’s recommended to use small images only.

Building the model with Convolutional Neural Networks

Due to the fact that numbers can be drawn in different ways, one relevant aspect to plan the Neural Network is the translation invariance, because the app may be fed with a plenty of different forms of number 1s, thus, different forms of invariances, like size, perspective and lighting. Below images might give an example of this.

To effectively extract features that enable the identification of a “1” regardless of these variations, we employ convolutional neural networks (CNNs). The convolutional layers in CNNs are adept at scanning local regions of the input images, enabling the network to learn hierarchical features. This is particularly valuable for capturing essential patterns despite variations in appearance.

And to further make this feature extraction consistent, we can combine the convolutions with pooling as it helps in creating a more abstract representation of the input, making the model less sensitive to the exact spatial location of features. This means that even if the position, size, or lighting conditions of a drawn “1” differ, the CNN’s feature extraction mechanism remains consistent, facilitating accurate identification.

Thus, the first part of the Neural Network will be in charge of detecting and collecting the features from the images and placing them on a vector. Such vector then will be fed into a regular fully connected Neural Network to perform the learning.

Once the features that makes a “1” be a “1” are found we can pass it into a fully connected Neural Network and perform the training adjusting the weights until it starts giving accurate predictions of which number the image is corresponding.

The full process of building such model in baby steps can be seen on this Jupyter Notebook file available on my Github.

Exporting the model

Once the model is properly trained, we can export in variety of formats. So far, I’ve only used the format tf_saved_model, and it seems to be the recommended way for now. Once we export it, TensorFlow creates a directory similar to below one:

It’s also worth pointing that the model should be named using a timestamp for better version control. For above case the name is 1703825980.

Translating the model to TensorFlow.js

As said before, in order to use the built model using TensorFlow.js we should first “translate” it to a compatible format. To do so we used tfjs-converter. By converting the model in format tf_saved_model we get a directory with following structure:

Loading the model

TensorFlow.js has a method loadGraphModel which accepts an URL that is serving a model to loaded. As this app is built on top of NextJS, the “translated” model was placed under public directory, and due to that the model importing is done like below image:

That way, once the app is rendered on browser the model is loaded and get available to perform an inference on a given image.

Preprocessing a given image before running inference

The model was be trained using a 28x28 grayscale image. The majority of images that will be inserted into the model will probably be different than that, for instance, it could be a colorful image, it could have shape 173x100 etc.

Moreover, the MNIST dataset has a peculiar way to represent images: the “effective image area” representing the number per se has values in the range of 0–255, while the “empty areas” are considered 0. One concrete example of number 3 can be seen below:

Real world images are represented using a different approach in which the background is typically lighter than the foreground. Hence, we need to invert the pixel values.

Due to that, we need to preprocess the given images before running the model inference. This process must involve: (1) converting image in grayscale (to ensure that the input image has a single channel);(2) inverting the pixels (to ensure that foreground is lighter than background like the images we will use to train the model);(3) normalizing the image (apply element wise the division by 255);(4) resizing the image to be 28x28 (the shape that the model will use);

Such process performed against an image (X, Y, W), where X is the width, Y the height and W the channels, should return an image with shape (28, 28, 1).

And one last step before feeding the image into the model is creating an input tensor having shape (1, 28, 28, 1), which means a tensor with one element (the image) with 28 pixels of width, 28 pixels of height and 1 channel.

Using the app

If you want to just use the app you can visit it snapmath. In case you want to run the app locally just clone the project and follow the instructions to run. The repo also provides a version using Python and Flask and instructions are also available.

The current version of snapmath looks like this:

This article was originally posted on my personal site: https://www.buarki.com/blog/mlops-experience

Takeaways from 2023, expectations for 2024: linear regression is the new CRUD portfolio project

Aurelio Buarque — Sat, 23 Dec 2023 20:45:02 +0000

Foreword

This is a non-technical and non-scientific article. It's simply a mind dump of some key takeaways from my experiences in 2023 and my humble vision for 2024 (something I do every year). Please do NOT make crucial decisions in your life based solely on this :)

2023: Highs and Lows

What a year, my friends! The year 2023 will undoubtedly go down in history as a significant milestone, especially with the strides made in the field of Machine Learning that reached the general non-tech public, with GPT-4 taking center stage. Additionally, we cannot overlook the unfortunate wave of massive layoffs that affected many colleagues, some of them close to us. However, my intention here is not to compile software engineering-related news from 2023 but to spotlight key observations I made throughout the year.

The Power of Fundamentals: Beyond AI Tools in Software Engineering

I believe it is reasonable to say that many developers are well-equipped with various AI tools, some of which can be integrated into the code editor, enabling AI to effectively engage in pair programming with a developer. But, even having such powerful tools, one thing I could see throughout this year is that such arsenal doesn't mean much if fundamental design principles are not something well understood. I must even say that in the age of advanced automation and intelligent code suggestions, a robust foundation in design principles is the bedrock of effective software engineering.

This year, I had the chance to work in two important teams at job, and it was interesting to perform code reviews for colleagues using such tools and realizing that while these assistants are powerful aids, they are not a silver bullet for ensuring code quality. In other words, those AI assistants are there to give you whatever you ask, and if you don't have the knowledge or experience to evaluate the given code, then, bad code will be produced the same way, but now for a machine :) How can one evaluate the code given by some AI assistant if it doesn't know why and how to check the cohesion of the code? If it doesn't know why and how to check coupling?

Over my 10 years in this ever-evolving landscape of software development, I could swap between a series of programming languages, such as C, Java, Go and Javascript with their famous frameworks and libs like Spring Boot, Express, NestJS and so on. I lived enough time to see monoliths be treated as something good, then as the worst thing ever, and in the past few years to get back to be seen as something valuable again. In this ever-shifting terrain, where technologies rise and fall like passing trends, one constant has anchored my proficiency and adaptability: a commitment to timeless design principles. Some principles like SOLID, DRY, YAGNI, "Tell, don't ask" have served as my North Star, providing a reliable compass that transcends the specifics of coding languages.

As the industry embraces the era of artificial intelligence, I've found good design principles to be more critical than ever, especially in the context of code generated by AI. While AI tools can swiftly produce code snippets and even entire functions, their effectiveness is greatly enhanced when paired with a solid foundation in design principles.

Two things we know for sure about life: the death and the presence of overengineering

Another point that never gets out of fashion is overengineering things. From my observations, the main cause of it is the desire to apply into a project a set of solutions that big companies had success using... but the project itself is orders of magnitude smaller. An interesting aspect here is that depending on how emotionally attached the solution's owner is, you might be labeled as someone who doesn't follow "good practices". After all, how could someone argue that Domain-Driven Design or Clean Architecture are not suitable for all projects? :)

2024: linear regression is the new CRUD portfolio project

Depending on how long you are working in the software industry, you'll probably agree with me that around 10 years ago, a typical portfolio app used to be a CRUD (Create, Read, Update and Delete) using some programming language and some web framework, or it was something like "building a microservice for something". Once I look to the current scenario we are, in which AI and ML are draining attention, and compare with those passed 10 years, I see that linear regression is the new CRUD. I think that because linear regression is a basic step that the majority of developers are able to do, regardless of their preferred stack and it is an effective entry point into the big world of Artificial Intelligence/Machine Learning. Furthermore, I won't be surprise if during next year the amount of Github repos with AI/ML topics increase and if related topics become more frequently featured on CVs.

Building upon that, something that I also expect for incoming year is Python getting more adoption, not necessarily to be used for web development, or even further as a Javascript replacement, but due to the high demand for AI and the rich ecosystem of libraries available to be used with Python.

And adding my 2 cents on this: for the ones really interested in stepping into the world of AI/ML, my humble suggestion is to ace the basics of statistics and linear algebra that it requires. Rather than focusing too much on tools, such as Tensorflow, the crucial aspect of working with Machine Learning lies in acquiring the right data, cleaning and normalizing it, and then structuring the model to test hypotheses, check accuracy, and, if satisfactory, export the 'model' to be loaded into some API for execution. Without a solid understanding of the mathematics behind of it, one won't be able to assess the effectiveness of their model and won't be able to innovate in the field.

Hoping for the best in 2024

I know that recent and rapid advancements and disruptions bring us some uncertainties about the future, which can potentially make us afraid. But, it's precisely these situations that force us to be more resilient and move forward.

Let's embrace the future with open minds, let's keep our curiosity fueled, and let's remember that amid all the tech jargon and buzzwords, what truly matters are the fundamentals. Whether you're shifting from one programming language to another or from one AI assistant tool to the next, carry with you the timeless principles that lay the foundation for good software.

As we step into 2024, let's not just hope for a hot tech industry, but actively contribute to turning up the heat. Let's bring the passion, the creativity, and the collaborative spirit that make our industry not just survive but thrive.

This article was originally posted on my personal site: https://www.buarki.com/blog/2023-2024

Supervised Machine Learning: how to build your own Neural Network from scratch

Aurelio Buarque — Sun, 03 Dec 2023 21:15:14 +0000

Foreword

This article aims to provide a simple, minimalist and baby-steps introduction to how Supervised Machine Learning works. It will explain the underlying idea, the mechanics, and the reasons behind its functionality. Additionally, a concrete and minimalist example with code will be available at the end as a reference. Last, but not least, it does not mean to be a replacement of academic articles on this topic, but an introductory material for the ones who love math and coding.

What is Machine Learning?

In brief, as you can see in this nice article from MIT, “Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems”. We can divide this AI field in three categories: Supervised Learning (and this article will address it), Unsupervised Learning and Reinforcement Learning.

Supervised Learning: an overview

Supervised learning is a branch of machine learning where computers are trained by being presented with examples of desired behavior. It mimics the process of a teacher guiding a student in learning a subject, with continuous checks on the student’s progress.

For such type of Machine Learning, the algorithm is trained on a dataset that includes both input data and corresponding correct output or “answers.”

A funny example if this is the App “Hot dog and Not hot dog” from the Silicon Valley TV Serie in which the character Jian Yang creates an app able to tell whether or not a picture is from a hot dog or not. For sure that’s not the best use case for supervised machine learning, in our day to day life we have much more useful uses like spam filters, product recommendations, fraud detection and so on.

Some methods used in supervised learning include neural networks, naïve bayes, linear regression, logistic regression, random forest, and support vector machine (SVM).

As you can see, even being a subfield of Artificial Intelligence, supervised learning also has its own subtopics. In next section we’ll address how supervised learning works using neural networks.

The fundamental pieces of Neural Networks: Neurons and the Universal Approximation Theorem

Dear reader, I must warn you that math is a passion of mine, and I’m thrilled as I write this section. So, proceed at your own risk :)

As previously discussed, supervised learning works by inputting some data into an algorithm and checking if the outcome matches the expected result. When implementing supervised learning using Neural Networks the output computation will be performed as the input passes through the “neurons” of the net. A neuron in this case is essentially a function that receives one or more inputs, such as a vector containing values, does some calculation and provides a result. Each input of such neuron will have a weight attached to it to simulate the synapse process. After summing all (inputs * weights) plus a bias value the neuron applies the result into a activation function which is in charge of introducing non-linearity into the network, enabling it to learn and approximate complex relationships in the data. To help you visualize it check bellow image:

In above image each “x” represents an input value fed into the neuron and each “w” is a weight applied to an input. The neuron sums all (x*w) with the bias ( theta letter) and the output of it is applied into the f(x). Mathematicaly we can describe it as the following:

Where yk is the output of the kith neuron and f is an activation function, for instance it could be a sigmoid function. Above equation is enough to explain how a neuron works, but it is doesn’t explain why it works.

To grasp why it operates, let’s revisit the fundamental concept of supervised learning: we possess a labeled dataset that serves as the training data for a Neural Network, allowing it to comprehend the underlying patterns. Consider the following example: we aim to create a function, denoted as f(x), which, based on the hours of study by a computer science student, predicts their linear algebra test score. We assume that the student has already taken three tests, and we have their corresponding scores:

Generalizing the idea we get:

Thus, our job here is to find a function f(x) so that: f(xi) is very close to yi, where i = 0,1,2…n. To do so we define a family of functions F that could be a good fit for this problem, like linear regression:

If, and only if, there really exists a function that could represent the relation between xi and yi we can take this as an approximation problem: how can we find an element from F as close as possible from our desired function?

We don’t all have corresponding yi for all xi, only a few samples where for sure f(xi) = yi. Thus, the best we can do is find a function that properly fits the relation between xi and yi, generalize unknown examples and is computationally possible to find. In other words, we want an element from F that minimizes the supremum norm:

The best approximation.

The supremum norm.

As we want the best fit between input data and output as possible we need to find the minimal || f — u||. The supremum norm is a mathematical tool that helps us measure the maximum vertical distance between the true function and our approximation across all possible x values. By minimizing this supremum norm, we seek a function within F that provides the closest possible approximation to our desired function.

A proof that the mathematical Neuron definition I’ve shown above is capable of finding the minimal || f — u|| is given by the Universal Approximation Theory (UAT), and this quick video shows why Neural Networks are able to learn pretty much any function due to the UAT.

The overall process is defining a small ε value that will control how closely the elements of F will fit our model. The smaller ε is more accurate our Neural Network will be. Once ε is defined we start searching a Neural Network in which the outputs are close to our desired f, bellow image (tries) to show it:

Neural Network trying to fit the model.

Although the UAT proves that such function u(x) (an element of F) exists it doesn’t tell us how to find it and the required amount of neurons needed to fit f(x) might be really big, due to that finding an optimal Neural Netwok in practice is a challenge. Besides of it there are other points to take care while chasing for a Neural Network, such as avoiding overfitting and underfitting, the quantity and quality of data required, computational resources available and so on.

Such challenges will be addressed in next section, by assessing a simple problem as example and showing one possible way to find an optimal approximation (a Neural Network that fits the behavior of an f).

Defining a concrete problem as example to solve and planning data architecture

We want to create a model that based on the amount of hours of study and the amount of hours of meditation of a computer science student we find its linear algebra test score. We can also assume that the student has already attempted to perform three tests and we have their respective score:

Table showing a simple dataset of the problem we are trying to solve in this article. First two columns are, respectively, the hours of sleep and hours of meditation. The third one represents the test score.

If we assume that the model we are trying to create is a function, then the hours of sleep and hours of meditation are inputs of such function, while score is the output. Let’s define the vector X as the input of our model and Y as the output of it. Just to clarify it, if f(x) represents our Neural Net and one possible input is X=[6,5] (where 6 is the hours sleept and 5 hours of medidation) then f(X)=[8] (where 8 is the test score). The same way would be if X=[7,4] and f(X)=[7] and also if X=[7,5] and f(X) = [8]. A clever way to approach how we group inputs and outputs would be using matrices of them, then we would have the input matrix:

Matrix with the Neural Network inputs.

Each line on above matrix X is a valid input where the left column represents the hours of sleep and the right column the hours of meditation. And then f(X) will be:

" width="403" height="346">

Matrix with the Neural Network outputs.

Each line on above matrix Y is a valid output of the Neural Network.

One thing that must be pointed is that both matrices data must be scaled to ensure that the Neural Network handles standardized units. One way to achieve it is using Min-max normalization.

With the problem stated and input features known we can plan how the Neural Network structure will be. As our input data has two dimensions we need an input layer with two inputs, and as our result is a single number our network has one output only. Thus, the basic structure required for our Neural Network is the following:

The basic structure of our Neural Network.

To complete the structure we need to plan the hidden layers of it. The amount of inner layers may vary depending on the problem the Neural Network needs to solve. In fact, real world examples could have millions of hidden layers. For the sake of simplicity we’ll stick if one hidden layers with 3 neurons only, and bellow image image shows it:

Some important things missing on image above are the neuron weights. We will assume that the weights of first layer are always 1 and ignore them, so applying them on the second and third layer we get:

The completed structure of our Neural Network with Neuron weights.

In case you didn’t get the weight notation take a look at bellow image:

In next section we will properly address how the Neural Network model is built based on this proposed structure.

The Neural Network model in action: forward process

The process of acctually using the Neural Network is called forward process. Here the Neural Network receive the inputs, pass all of them through the weights, apply the sum into the activation function and proceed to next layer.

Let’s start checking how the process works on the second layer. Each neuron inside of second layer will receive the input from all neurons from first layer and apply a weight on it. Once all inputs are weighted we will get the following:

" width="720" height="535">

Neural Network with all second layer’s input weighted.

Considering that we placed all input values into the matrix X and it is a 3x2 matrix, if we place all weights also on a matrix 2x3 we will end up having a matrix containing the results of all (inputs * weights). Let the weight matrix of second layer be:

Then, the product:

Will be:

We’ll call the above product as:

Even before you ask: we’ll address the bias in next sections.

Once we calculated the weighted sum for each neuron we can now apply the activation function element wise on above matrix to calculate the output of second layer. For this problem, the activation function used will be the sigmoid:

Sigmoid activation function.

Thus, the second layer output can be defined as:

" width="269" height="106">

Matrix containing second layer output.

Once second layer is finished we proceed to the third layer with exact same process and with the second layer output being the third layer input. Let’s start by seeing the third layer weights:

Matrix carrying the weights of third layer.

Multiplying the second layer output with third layer weights will gives us:

And to get the output of third layer we apply the sigmoid on this product:

Above matrix has shape 3x1 and each line represents the predicted test score.

With the equations discussed in this section we are able to predict the test score based on the amount of study and medidation hours. But such predictions will probably be absolutely wrong :). This is due to the fact that the weights from our layers once we create the Neural Network are defined randomly. Thus, the Neural Network won’t be useful on helping us predicting the test score. To adjust the Neural Network we need a way to find proper values for our weights while we measure the accuracy of outputs. This will be addressed in next section.

Assessing the Neural Network error: error cost

In order to improve the Neural Network accuracy we need to know how wrong the outputs are. To do so we’ll use a cost function (or loss function). The cost function will measure the performance of our machine learning model quantifying the error between the prediction and expected value. A good way to address the cost function is using the Mean Squared Error (MSE) (soon it will be explained why such approach is effective and smart). In mathematical terms we can define MSE by the following:

The error of our Neural Net

On above equation, E is the global error of our Neural Network,y' is the expected output for one input and y is the received output for one input. In the case you are wondering why we placed the term 1/2: it was done for convenience during derivatives operations as it simplifies the computation of gradients and it doesn’t alter the fundamental meaning of Mean Squared Error (MSE), which remains a measure of the average squared difference between predicted and actual values.

Being brief, training a Neural Network means minimize its cost function. Our cost function is a function of the Neural Network input and its weights, and in case this is not clear we can expand E:

We just don’t have control over the X param, only the “client” of the Neural Network does. Thus, if there is something we can change in order to adjust the cost function is the Neural Network weights. But how can we adjust the weights? Brute force?

If we consider for a while a hypothetical Neural Network containing one weight only in which the “proper” value for it can be found on a set of 10k numbers, a brute force approach would take in the worst case 10k “checks” to find the proper values. If we also assume that each “check” takes 2 milliseconds (ms), then, the overall process would take 20 seconds. Anoying but still feasible.

It would takes us 10k times executing the forward process, calculating the error cost to, in the end, find the minimal error value. If we expand it to two neurons and we assume that this new neuron also has the proper value on a search space of 10k numbers, in such scenario, our search would require 100000000 iterations (10k * 10k), and if we also keep the check time of 2ms it would take 200,000 seconds or approximately 56 hours :)

" width="720" height="663">

The conclusion here is obvious: the more neurons the Neural Network has, more expensive its training will be. This phenomenon is called Curse of dimensionality, which is a serie of challenges and issues that arise when dealing with data in high-dimensional spaces.

As we could see, bruce force solutions are not feasible here. We need a way to find a shortcut between all the available numbers in the search space to the ones that gives us the minumum of error cost function. We could achieve it by kwnowing the direction that E decreases from one inspected point. Let’s consider the scenarion with one single neuron again to visualize it. Consider bellow image where point P(wi, E(wi)) is a known point in the availble search space and we know that moving to the right decreases E while moving to the left increases E. Then, as we want to minimize E we can move to the right.

We can extend this ideia by asking: how E behaves when the weights changes? And fortunely there is an awesome tool to helps us finding it out called calculus, specifically the the derivatives. As mentioned above, E is a function of both Neural Network inputs and the weights, but we control only the weights. Thus, to know how E behaves by modifying weights we need to calculate the derivative of E with respect of the weights. If derivative of E with respect to the weights is positive we can say that E is increasing, otherwise it is decreasing. For the case of our Neural Network, which has more than one layer, we need to perform a parcial derivative of E with respect to each layer weight.

By knowing the direction that E decreases we can iteratively look for lower values of E, ignoring points where it increases optmizing the search time. Check bellow image to visualize it:

The method described above is called Gradient Descent.), which is a way to minimize the cost function by adjusting the model params in such a way that the cost decreases faster. It works by calculating the gradient (slope) of the cost function with respect to the parameters and updating them in the opposite direction of the gradient, moving towards the minimum of the cost function. We basically repeat this process until the cost function reaches the minimal value.

An important thing to point is that we can use Gradient Descent for our Neural Network because our error cost function has a convex shape, which ensures that E always moves on “the same direction”. The convex shape is a consequence of using the MSE, which is a quadratic function :)

If our cost function didn’t have such convex shape it could mess our Gradient Descent execution, and our search could end up being stuck on a local minimum rather than the global minimum. Check bellow image:

It also worth poiting that for some scenarios the shape of cost function doesn’t matter, for instance in the case of using Stochastic Gradient Descent, where you use one input example at time. In our case we’ll be using Batch Gradient Descent, where rather than one training example at time we’ll use a set of it.

Backpropagation: adjusting weights using Gradient Descent

To perform the Batch Gradient Descent process to update the Neural Network weights we need to calculate the partial derivative of E with respect to our weights from second and third layer. Let’s start with third layer.

As the derivative of a sum is the sum of derivatives we can move the inner sum to the outside:

For simplicity, let’s remove the outter sum and focus on the derivative part of an isolated case. That said, if we apply the chain rule we would get:

As the term y’ is the expected output then it is a constant. Resolving the equation we get:

If we remember that y (the output of Neural Network) is the result of the activation function applied on v(3) and apply the chain rule again we get:

The derivative of second term is the derivative of our activation function S, which is a sigmoid. The derivative of S is:

Applying it we get:

This part is a little bit tricky due to matrix dimensions, but don’t give up :)

A few iterations ago we have removed the Σ from our calculations to look to the problem isolatedly and now we can go back to it. Both matrices y’ and y have same dimentionality and the subctration between them can occur without any problem. The interesting part is what to do with S’, as it has the same dimension of the matrix subtraction. If we look closely on how this product would happen if we were applying the Σ summation we could see that this multiplication would be done element-wise. Thus, to represent it in the form of matrices mutiplication we can use the Hadamard Product. Adding the proper notation to it we get:

The last derivative to solve is the third term, which result is the Y(2). As in this step we are propagating the error backwards to the weight that generated it we need to mutiply the elements of Y(2) to its proper back propagating error. We can do so by multiplying the transpose of Y(2) by the first two terms:

And we are done with third layer! Proceeding to the second layer, the overall process is the same, the difference is that now we’ll perform the derivative of E with respect to W(2). We can start with:

Solving it we’ll have:

Applying the chain rule we get:

The first two terms are already known:

And if we keep applying the chain rule:

If we reorganize some terms on above equation and also on the one for the third layer we’ll see they have a product in common:

This product is the back propagating error of third layer. We can add the following notation to it:

We can rewrite the equation for second layer to be the following:

If we extend the idea, we can also define the back propagating error to the second layer and we’ll have the following:

Once we finished the equations for both second and third layer we are able to adjust the Neural Network weights by subtracting them with each respective gradient component. We need to subtract because the gradient points in the direction of the steepest increase of the error, to reduce the error we need to move in the opposite direction of the gradient. Thus, by subtracting we are adjusting the weights in the direction that decreases the error.

We can also mutiply our gradient component for a small scalar to control the size of the steps taken during each iteration of the optimization process. This scalar is useful because it controls the size of the weight adjustments, ensuring that the optimization process is stable and converges to a meaningful solution. This scalar is often called learning rate. Finding a proper learning rate is important because a too-large learning rate can cause the optimization process to oscillate or diverge, potentially missing the minimum. On the other hand, a too-small learning rate can make the convergence very slow, requiring many iterations to reach the minimum. By doing some researches we can find some suggestions on how to configure out the learning rate, one common suggestion is something between 0.01 and 1.0.

Thus, to update weights matrices we do the following:

Where α is the learning rate. One thing worth point specifically during the implementation of such step is to test the implementation to ensure it was well done. Any implementation glitch on this step would lead to wrong Neural Network results and problems very hard to debug :)

Training the Neural Network

Once we have designed the Neural Network structure, planned its forward process and the backpropagation process we are able to effectively train it. The idea is pretty simple: we define a param to controll how many “iterations” the training will have, usually this param is called epoch, and also the already mentioned learning rate. The learning rate, the amount of epochs and the number of layers are examples of what we call hyperparameters. Once we define such numbers the process goes like: for each epoch, execute the forward process, compute the error between the expected values and received values, calculate the gradient descent and update the weights. A very high level algorithm would be:

With a reasonable amount of data with proper quality, time and computer power, above process will do the job, but there are some pitfalls during such process, specially for Neural Networks with a much bigger number of layers. For instance, finding a proper learning rate is a non trivial task and a bad weights initialization can result slow convergence or even convergence to suboptimal solutions. There are alternatives that can be explored such as BFGS technique, as it can perform a faster convergence and does not requires a manual definition of the learning rate.

Now we are missing to address two points to complete our basic knowledge about Neural Networks: bias and regularization. We’ll discuss them in next sections.

Bias: enhancing flexibility

One component that so far we skipped is the bias on Neural Networks. An important job it does is giving to the Neural Network the hability to learn different constants for each neuron, which is crucial for fitting diverse patterns in the data. It also helps Neural Networks to model nonlinear relationships. Moreover, it allows neurons to be “activated” even when the weighted sum of inputs is below zero, and this is important for modeling situations where certain features may not contribute much to the output individually but still play a role when combined with other features.

In order to introduce the use of bias in our model we might need to adjust the found equations a little bit. The first adjustment is on how we calculate the weighted input, now we need to add a matrix B containing the biases:

And we also need to update B(2) and B(3) during the backpropagation. The ideia is the same applied to weights, so if we start with B(3):

And to update B(3) we do:

For B(2) we also do the same process, and due to that we can ommit some parts:

And to update it:

Now we can finally move to last step :)

Overfitting: we can’t ignore the noise!

Let’s assume that in a few days the same student will have another linear algebra test, and it is hopping to pass it now :). So the student uses our Neural Network to find a proper combination of hours of study and meditation to succeed on the test. It concludes that a score 7 is enough and testing based on the vector X=[7,4], which also is record from its previous test, it decides to study for 7 hours and medidate for 4. Once the test score is informed it sees it scored 5 :(

There can be several reasons for the model predict such wrong score, like the student missing to review a topic in which the test focused more, the student having headache during the test, or even back luck.

The important point here is the data we use to design our model is just a perception of a fact, but between the fact per se and our observation there is noise. In other words, there are other factors that influence the test score, and some of them we might not even be aware that they exist. If a model doesn’t take it into account the Neural Network can endup having overfitting issues, which is when the Neural Network performs good with training data but badly with unseen data.

Overfitting verification is a must have during the Neural Network training and one interesting way to do so is split the training data in two parts: a bigger one to train the Neural Network while iterating through some epochs, and a smaller one to evaluate if the training caused an overfitting.

To avoid having such problems we can introduce a form of regularization that will discourage overly complex models with excessively large parameter values. Common regularization techniques are Ridge and Lasso regularization, in our approach we’ll use the Ridge method. To do so we need to adjust how we calculate the error cost E and the gradient components. For the error cost we can do:

On above equation, the right added term can be interpreted as a penalty for complex models. The β is a regularization hyperparameter that helps us to control the relative cost. Higher β implies in bigger penalties for higher model complexity, resulting in less overfitting. We also are multiplying the left term by 1/n where n is the amout of input samples, thus the number of rows of X. This term is important as it ensures that the regularization penalty is not overly influenced by the size of the dataset. The regularization term involves summing up the squared weights across all parameters. Without normalization, this sum would grow with the number of examples in your dataset. By dividing by the number of examples we ensure that the regularization term is on a consistent scale, regardless of the size of your dataset. If we don’t normalize it, the regularization term might dominate the loss term in the objective function, especially for large datasets. This could lead to the optimization process being overly sensitive to the regularization term, potentially overshadowing the impact of the actual loss term.

About the gradient components we can do:

And that’s it :)

Summary

We went through a lot of points in deep details, so let’s wrap it up like a check list of “what to take care while creating a neural network to solve a supervised machine learning problem”.

1. Understand the problem

What is the problem we are trying to solve?
What are the inputs of the model we need to create?
What are the outputs of the model we need to create?

2. Neural Network topology

How many dimensions our input layer requires?
How many dimensions our output layer requires?
Which hyperparameters needs initialization?
How do we initialize weights? Random values?
How do we initialize biases? Random values? ### 3. Training data
Which amount of data do we need to train the model?
Does the training dataset requires cleaning?
How much of the dataset will be used for training and how much to check the training?
How can we check if we have overfitting problems?

Reference Code

I put a minimalist implementation of a Neural Network to solve the problem of the student trying to pass the linear algebra test, you can see the code on my Github repository.

Key References

Approximation by Superpositions of a Sigmoidal Function; G. Cybenkot.
The Universal Approximation Theorem; Alexander.
Machine learning, explained; Sara Brown.

This article was originally posted on my personal site: https://www.buarki.com/blog/ml-supervised-learning

Snap out of layers: Vertical Slices for the win!

Aurelio Buarque — Wed, 08 Nov 2023 00:22:47 +0000

In the following sections of this article, we will explore alternative design approaches that address challenges introduced by using “Layered Architecture”. These solutions aim to make business intentions more explicit, reduce tight coupling, and improve the overall maintainability of your codebase. By the end, you will have a clearer understanding of how to structure your backend applications for greater efficiency and ease of development.

A case of study: how code is structured most of the time on backend applications?

I must say that it is very common to open a brand new project and see the following structure:

Above image represents a typical application using Layered Architecture. This mode of organizing code focuses on organizing code by technical concerns, such as the controllers, services, repositories and so on.

I have worked on a plenty of projects using that exact same structure and they worked fine I must say. But pretty much all the time I got a new one like this I asked myself: which features is this project delivering? how it is supporting the business needs?

Organizing code by technical concerns doesn’t make it explicit which business problems the code is trying to address. To figure it out you must dig into it probably starting on the controller, checking which services it calls, which repositories are used by services etc.

Implications of such design

Figuring out what such systems do usually will take some time as said before, and it’ll probably require a few code walkthroughs with a coworker that has more knowledge about it. As consequence it has a direct impact in the time to deliver new features or fixies on code.

Such design also pushes for tight coupling between the system components because soon a cross-entity feature will pop up, such as “listing all tasks of an user”, and the application will end up having the situation where “user-service” is calling the “task-repository”. The intention behind it is fare and good, is called Don’t Repeat Youself (DRY). DRY principle encourages reusing code to reduce duplication. But applying it in certain scenarios, specially in cases of over-reuse, makes things so coupled that is very hard to modify a piece of code not having unintended side effects somewhere else, because as code is being shared and used to implement different features any modification done will not be isolated.

Furthermore, the tight coupling between components in this traditional layered architecture can also become a significant pain point when multiple developers are collaborating on the same codebase. When different team members are working on different parts of the system that share tightly coupled dependencies, the likelihood of encountering merge conflicts significantly increases. As a consequence, developers find themselves in a race to resolve these conflicts. This race involves inspecting and manually merging conflicting code, which can be a time-consuming and error-prone process. Not only does it slow down the development process, but it also introduces the risk of introducing unintended bugs or breaking other parts of the system during conflict resolution. In essence, the tightly coupled nature of the traditional architecture exacerbates the problem of merge conflicts, making it an issue that can no longer be ignored when multiple team members are actively working on the same codebase. This issue can result in a domino effect, affecting the development timeline and potentially compromising code stability.

In the end, the main consequence of such design is that business intentions gets obfuscated and the comprehension of the project as a whole will rely on the deep analysis of code.

As a concrete example we can assume that the features that the service from print must cover are: (1) the user needs to sign up; (2) the user needs to login; (3) user can create a task, like on Jira; (4) user can edit the task description; (5) user can delete a task; (6) and the user can list its own created tasks. If we breakdown the layered application it might look like bellow image:

The feature “list all user’s tasks” implemented by the “user-service” requires a query implemented by the “task-repository” which creates a coupling between them. The reason for that is: both “user-service” and “task-service” needs to do that listing, so to avoid duplicating code we make both use the same implementation (DRY principle). Now suppose that the usage of such task listing on the “task context” changes and now it should be able to list tasks of users active, inactive or both. In order to keep the “user-service” using “task-repository” this modification will also be propagated to it, and it’s easy to see where it leads us: a truly spaghetti code :)

A suggestion to improve code organization: Vertical Slice Architecture (VSA)

If you consider the three layers presented as part of a big cake we could also start looking at our features as slices of it. And that way, all technical concerns needed for each feature would be grouped together ensuring minimal side effects in case of modifications. Take a look at bellow pictures:

The first notable difference we must point is now the code structure is more close to the feature requirements, actually we have a 1:1 match of how users in fact use the system and probably how the Product Owner of the team lists the available features :)

By following this approach we enforce the system features to be treated as independent components that can be created and evolve independently. We also push for a lower coupling between the system components making the slices more cohesive. The time to grasp what the system does also decreases as the top level navigation is more related to the feature requirements per se, and one doesn’t need to understand the system as a whole, as the overall feature’s code now have clear boundaries. And for sure, adding new features becomes a more straightforward exercise with a much lower risk of unintended side effects.

Even before you ask: yes, some things for sure will be shared between slices

While reading the general idea of VSA one may think it implies the project to have nothing shared at all between slices, but that is not the case. In case you find yourself seeing duplicated code between two slices there’s no problem in extracting it into a shared package, usually called kernel or shared. But you’ll also realize that this won’t be so frequent, because for such scenario happen the piece of code to be shared must be doing pretty much the exact business related action (something expected to not happen so frequently while organizing code by features).

Summarize

There’s no silver bullet while building software. From the experiences I had scaling system using vertical slice architecture I must say it gave to me and to the teams I was working on flexibility to ship new features fast and it made it easy to apply fixies and refactorings isolated with minimal side effects and specially during the maintenance phase.

For sure for newcomers of this design approach it will have some initial friction, but in a few iterations the general idea will stuck, and in case you need some mentoring on that you can reach me out ;)

This article was originally posted on my personal site: https://www.buarki.com/blog/clear-code-with-vertical-slices

Hexagonal Architecture x Onion Architecture x Clean Architecture: their differences

Aurelio Buarque — Thu, 02 Nov 2023 07:19:05 +0000

Hexagonal Architecture

I bet you have heard about this one, for sure. If not just google something like how to build an API with hexagonal architecture and you'll see a plenty of tutorials, examples and Github repositories showing some examples. And I also bet that the majority of them will not explain why it has such name.

Hexagonal archictecture has nothing to do with hexagons. A better name for it would be the one that its author gave it: Ports and Adapters. This pattern was proposed by Alistair Cockburn and the original publication is avaiable on his blog.

The main ideia of it is to protect the core product source code from changes on external dependencies, such as database (for sure we can discuss how frequently we change database during projects, but on other article latter on). To do so, the basic idea is: every time we detect that something is not part of the system per se we add an abstraction for it (a port), in OOP languages it might be an interface, and then we create a concrete implementation of such abstraction, and we name this implementation adapter, which could be a class that implements that interface and is injected into the code.

Just to give you a simple and minimalist example to think about, consider a system that needs to perform queries against a database, read and write data on a cache, and fetch some data from a CMS. For each one of these needs, you could create interfaces describing how they should behave. That way, you could forget about database-specific configs, for example, and focus on coding what you need. Suppose you decide to use Postgres for production and SQLite for local development. In that case, you could create one class for each of these databases, implementing the contract established by the interface. This approach allows your application to be more flexible and maintainable in the long run.

Onion Architecture

Onion Architecture was proposed by Jeffrey Palermo and you can check the original article on his blog. It extends the ideia of protecting the system's core from the Hexagonal Architecture by introducing layers surrounding it. This core is named domain model and it must define business rules and invariants.

On top of the core we create the domain services layer which adds objects that supports enterprise services and add business rules that don't fit well into the domain model, like validations that requires some database interaction.

Wrapping the domain services layer we have the application services layer, which coordenates execution flows like a request hitting a controller what needs to call a domain-service to perform some action.

Clean Architecture

Clean Architecture was proposed by our dear Uncle Bob and you can check the original article on his blog. Summarizing, it extends the Onion Archiecture and the key difference is that domain model in this contexts is called entity and application service is called use case, which is good because it gives more visibility about what the application in fact does.

Summarizing

All three ideas above proposes the same thing: decouple the business rules from the framework and infrastructure-specific code, arguing that it’ll make the application flexible and maintainable in the long run.