DEV Community: Ahmad Ganjtabesh

The Primary Responsibility of a Software Engineer: Delaying Decisions

Ahmad Ganjtabesh — Thu, 10 Oct 2024 09:19:10 +0000

At the heart of software engineering, our most visible task is building the features that businesses rely on. We work closely with stakeholders to develop the functionality that drives the business forward, whether it’s handling payments, managing customer data, or delivering personalized experiences to users. Delivering these features is critical, but it’s not the most important part of the job.

I believe that the primary responsibility of a software engineer is to design systems that give the team flexibility to delay certain key decisions until later, when there's more information available. Delaying decisions may sound counterintuitive, but it's a strategy that can significantly reduce risk, increase adaptability, and ultimately lead to more robust systems.

Software development is full of uncertainties: business needs evolve, user behavior changes, data volume grows, and new technologies emerge. If we make key decisions too early - such as selecting a database, cloud provider, or framework - we may lock ourselves into choices that limit flexibility in the future. What might work well for a small project or MVP (Minimum Viable Product) could break down as the system scales or as business requirements change.

The Importance of Delaying Decisions

Making decisions with incomplete information can lead to technical debt. When we choose technology or architecture early on, it's often based on assumptions about how the system will behave in the future. But reality rarely matches those assumptions. By postponing decisions, we reduce the risk of locking ourselves into choices that don't hold up as the project evolves.

Take, for example, the decision of choosing a database. At the start of a project, you might not have a clear picture of the types of queries you'll need, the volume of data you'll store, or the complexity of relationships in the data. Committing to a specific database early could cause issues if the project later requires more complex querying, higher scalability, or better indexing. By delaying this decision - perhaps starting with an abstracted data access layer - you give yourself the flexibility to switch to a different database later without reworking large parts of the codebase.

Delaying decisions isn't about avoiding responsibility or kicking the can down the road. It's about waiting for the right time to make decisions - when you have more clarity on the system's real needs. This leads to better, more informed choices that are sustainable in the long term.

How to Delay Decisions: Abstraction and Isolation

The key to effectively delaying decisions is through abstraction and isolation. By isolating different layers of the system and abstracting them from one another, you can minimize the impact of changes to one part of the system when it's time to make a decision.

For instance, in the case of choosing a cloud provider, many companies start by tightly integrating their systems with specific services offered by platforms like AWS, Google Cloud, or Azure. But what happens if the business later wants to migrate to a different provider for cost reasons or regulatory compliance? If the system is tightly coupled to a single provider's APIs, switching becomes a painful and costly process. However, by abstracting cloud-specific logic behind interfaces or adopting cloud-agnostic solutions like Kubernetes, you delay the decision of committing fully to one provider. This gives you the flexibility to migrate or expand to multiple providers when the need arises, without significant rework.

Abstraction can also apply to logging, monitoring, or even authentication. Instead of choosing a logging framework early on and embedding it deep in your business logic, you can abstract logging behind a common interface. This allows you to start with a basic logging mechanism during development and later switch to a more advanced solution (like ELK stack or Datadog) as the project scales - without modifying the core of your application.

Balancing with the Secondary Responsibility: Delivering Features

While the primary responsibility is about creating flexibility and future-proofing systems, the secondary responsibility of a software engineer is just as important: delivering features.

The business requires functionality to meet immediate goals - whether it's releasing a new product feature, enabling a payment system, or building an internal tool for operations. These features directly contribute to the value of the business, and it's our job to implement them efficiently and with quality. After all, the system won't be useful if it doesn't deliver the functionality that users and stakeholders need.

However, delivering features can sometimes be at odds with the goal of delaying decisions. There is often pressure to deliver quickly, and it can be tempting to hardcode specific choices or skip the abstraction process to speed things up. For example, an engineer might choose a specific third-party authentication provider and tightly integrate it into the code, believing that this will get the feature done faster. And in the short term, it might. But if the business later decides to switch providers, or add support for multiple authentication methods, the tightly-coupled implementation will lead to costly refactoring.

This is where balancing both responsibilities becomes critical. While it's necessary to deliver features to meet business needs, engineers must ensure that those features are delivered in a way that keeps the system flexible. Abstracting key parts of the system and keeping components loosely coupled allows features to be built without locking the system into decisions that may no longer make sense in the future.

For example, while delivering a feature that involves user data storage, we might opt to use an abstracted data layer that allows us to switch databases later, rather than committing to one specific database right away. This way, we deliver the feature while still keeping the system adaptable for future changes.

Why Delaying Decisions and Delivering Features Go Hand-in-Hand

Ultimately, the two responsibilities - delaying decisions and delivering features - are not in conflict; they complement each other. By building systems that allow decisions to be postponed, engineers reduce the risk of technical debt and future-proof the system. And by delivering features thoughtfully, with abstraction and flexibility in mind, we ensure that business needs are met without compromising long-term adaptability.

When both responsibilities are managed effectively, we create systems that deliver immediate business value through features that work well today. At the same time, these systems remain flexible and adaptable to future changes, allowing key decisions to be made when the right time comes.

A strong software engineer understands that delivering features isn't just about solving today's problems - it's about ensuring that the system can grow and evolve smoothly in the future. This mindset leads to more maintainable, scalable, and resilient systems.

Conclusion: A Balance for Sustainable Software Development

In summary, while delivering features is critical and serves the immediate needs of the business, the primary responsibility of a software engineer is to design systems that give the team the flexibility to make better decisions in the future. By abstracting key components, decoupling systems, and postponing decisions like choosing a database or cloud provider, we can create adaptable systems that evolve smoothly as requirements change.

Balancing both responsibilities ensures that we're not only meeting the business's current needs but also protecting the long-term health and scalability of the system. In doing so, we set our teams up for success, now and in the future.

Thank you for reading! Feel free to leave your comments below, and let’s stay connected on LinkedIn or follow me on Twitter or Github for more insights and updates.

Speeding Up Go Concurrency: Unlocking the Power of Array Segmentation

Ahmad Ganjtabesh — Mon, 01 Jul 2024 09:36:45 +0000

Hey there, fellow Gophers! 🌟

If you’ve been playing around with Go, you know that goroutines are the bread and butter of concurrency. But here’s the thing: when multiple goroutines try to access the same array, things can get messy real quick. You end up having to use mutex locks to avoid race conditions, and that can slow things down big time. But don’t worry, I’ve got a cool trick up my sleeve – segmenting the array to let goroutines work independently without stepping on each other’s toes. Let’s dive in!

Use Case: Crunching Big Data

Imagine you’re building a web scraper that collects tons of data from different sources and dumps it all into a big array. You’ve got multiple goroutines fetching and processing data, and you want them to work concurrently to speed things up. But how do you keep them from getting in each other’s way?

The Mutex Approach

First up, let’s see how we usually do it with mutex locks.

package main

import (
    "fmt"
    "sync"
)

func main() {
    const size = 1000000
    data := make([]int, size)
    var mu sync.Mutex
    var wg sync.WaitGroup

    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(start int) {
            defer wg.Done()
            for j := start; j < start+size/10; j++ {
                mu.Lock()
                data[j] = j * 2 // Example operation
                mu.Unlock()
            }
        }(i * (size / 10))
    }

    wg.Wait()
    fmt.Println("Processing complete")
}

In this setup, every goroutine locks the entire array before it writes to it. It’s safe, but man, is it slow!

The Segmentation Magic

Now, let’s try something cooler. We’ll segment the array so each goroutine works on a different part without any locks.

package main

import (
    "fmt"
    "sync"
)

func main() {
    const size = 1000000
    data := make([]int, size)
    var wg sync.WaitGroup

    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(start int) {
            defer wg.Done()
            for j := start; j < start+size/10; j++ {
                data[j] = j * 2 // Example operation
            }
        }(i * (size / 10))
    }

    wg.Wait()
    fmt.Println("Processing complete")
}

Here, each goroutine handles a specific segment of the array, and guess what – no locks needed! 🚀

Showdown: Benchmarking Both Approaches

Alright, time to pit these two against each other and see which one’s faster. Here’s the benchmark code:

package main

import (
    "sync"
    "testing"
)

const size = 1000000

func ProcessWithMutex(data []int, wg *sync.WaitGroup, size int) {
    var mu sync.Mutex
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(start int) {
            defer wg.Done()
            for j := start; j < start+size/10; j++ {
                mu.Lock()
                data[j] = j * 2
                mu.Unlock()
            }
        }(i * (size / 10))
    }
}

func ProcessWithSegmentation(data []int, wg *sync.WaitGroup, size int) {
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(start int) {
            defer wg.Done()
            for j := start; j < start+size/10; j++ {
                data[j] = j * 2
            }
        }(i * (size / 10))
    }
}

func BenchmarkWithMutex(b *testing.B) {
    data := make([]int, size)
    var wg sync.WaitGroup

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ProcessWithMutex(data, &wg, size)
        wg.Wait()
    }
}

func BenchmarkWithSegmentation(b *testing.B) {
    data := make([]int, size)
    var wg sync.WaitGroup

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        ProcessWithSegmentation(data, &wg, size)
        wg.Wait()
    }
}

Run these benchmarks using go test -bench . and let’s see the results!

Benchmark Results:

With Mutex:
- Operations: 14
- Average Time Per Operation: ~82.48 ms (82,482,426 ns)
With Segmentation:
- Operations: 4948
- Average Time Per Operation: ~0.23 ms (231,511 ns)

Analysis:

Performance with Mutex:
- Only 14 operations in the testing period.
- Each operation took around 82.48 milliseconds.
- Mutexes are safe but introduce a lot of overhead and slow things down.
Performance with Segmentation:
- A whopping 4948 operations in the same period.
- Each operation took just 0.23 milliseconds.
- By letting goroutines work on separate parts of the array, we cut out the need for locks and speed things up massively.

Conclusion:
Segmentation blows mutexes out of the water! It’s about 357 times faster, making it a fantastic way to boost performance when dealing with large datasets and high concurrency.

So next time you’re juggling goroutines, think about segmenting your data. It’s a game-changer!

Happy coding, and feel free to share your thoughts and results in the comments below. Let’s make Go concurrency even more awesome! 🚀✨

P.S: Thank you for reading! Feel free to leave your comments below, and let's stay connected on LinkedIn or follow me on Twitter or Github for more insights and updates in data mining and analytics.

Utilizing Locality-Sensitive Hashing (LSH) for Market Basket Analysis

Ahmad Ganjtabesh — Thu, 07 Mar 2024 12:51:15 +0000

Market Basket Analysis is a powerful technique used in retail and e-commerce to uncover associations between products frequently purchased together. By analyzing transactional data, businesses can identify patterns and make informed decisions to improve sales strategies, product placements, and customer experiences.

Introduction to Market Basket Analysis

Market Basket Analysis operates on the principle of examining transactional data to uncover associations between products. The fundamental metric used in this analysis is "support," which measures the frequency of occurrence of item combinations in transactions. Common metrics derived from support include "confidence" and "lift," which provide insights into the strength and significance of associations between products.

The Apriori algorithm is a fundamental technique in data mining for discovering frequent itemsets and deriving association rules from transactional data. It works by iteratively generating candidate itemsets, counting their support, and identifying frequent itemsets above a specified threshold.
Despite its usefulness, the Apriori algorithm has limitations:

Computational Complexity: It can be computationally expensive, especially for large datasets, due to multiple passes over the data and generation of numerous candidate itemsets.
Memory Usage: The algorithm requires significant memory to store candidate itemsets and support counts, which can be challenging for systems with limited resources.
Inefficient for Sparse Data: In datasets with high sparsity, the algorithm may produce many low-support itemsets, leading to inefficient pruning and reduced effectiveness.
Apriori Property Limitation: Premature pruning based on the Apriori property may miss potentially interesting association rules if infrequent itemsets are pruned too aggressively.

For scenarios where the Apriori algorithm's limitations are prohibitive, consider exploring alternative approaches such as Locality-Sensitive Hashing (LSH) for scalable and efficient market basket analysis.

Introduction to Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing (LSH) is a technique used in data mining and similarity search to efficiently approximate similarity between data points in high-dimensional spaces. It is particularly useful for applications involving large datasets where traditional similarity search methods, such as exhaustive pairwise comparisons, become computationally expensive.

The core idea behind LSH is to hash data points into buckets in such a way that similar data points are more likely to be hashed into the same bucket, while dissimilar points are likely to be hashed into different buckets. By organizing data into these buckets, LSH enables approximate nearest neighbor search, where similar data points can be efficiently retrieved by querying the corresponding buckets.

LSH achieves this goal by employing hash functions that satisfy the locality-sensitive property, which means that the probability of collision (i.e., two data points being hashed into the same bucket) decreases with their similarity. LSH techniques are designed to balance the trade-off between precision (retrieving similar data points) and recall (retrieving all similar data points) based on application requirements.

One of the key advantages of LSH is its ability to scale to large datasets with high-dimensional data spaces, such as text documents, images, and genetic sequences. By partitioning the data space into hash buckets, LSH reduces the search space for similarity queries, leading to significant improvements in computational efficiency.

For this demonstration, we'll be utilizing the Groceries dataset, a commonly used benchmark dataset in Market Basket Analysis. This dataset comprises transactional records from a grocery store, with each transaction representing a customer's basket containing an assortment of items. We employ a GoLang library, github/agtabesh/lsh, for implementing Locality-Sensitive Hashing (LSH) in our analysis.

Let's go through the Go code step by step (you can find the source code here):

import (
    "context"
    "encoding/csv"
    "fmt"
    "os"
    "sort"

    "github.com/agtabesh/lsh"
    "github.com/agtabesh/lsh/types"
)

The code starts by importing necessary packages. context, csv, os, and fmt are standard Go packages. sort is used for sorting slices. github/agtabesh/lsh contains the Locality-Sensitive Hashing (LSH) implementation, and github.com/agtabesh/lsh/types contains custom types used in the LSH implementation.

datasetMap, err := readVectorsFromFile("Groceries_dataset.csv")
if err != nil {
    return err
}

The program starts by reading vectors (representing transactional data) from a CSV file named "Groceries_dataset.csv" using the readVectorsFromFile function. It handles any errors that occur during file reading and returns an error if encountered.

config := lsh.LSHConfig{
    SignatureSize: 128,
}

Next, the code defines the configuration for the Locality-Sensitive Hashing (LSH) algorithm. The SignatureSize parameter determine the size of the hash signature.

hashFamily := lsh.NewXXHASH64HashFamily(config.SignatureSize)
similarityMeasure := lsh.NewHammingSimilarity()
store := lsh.NewInMemoryStore()

The code initializes components required for LSH, including the hash family (hashFamily), similarity measure (similarityMeasure), and storage mechanism (store). In this case, XXHASH64HashFamily is used for hashing, HammingSimilarity for measuring similarity, and an in-memory store.

instance, err := lsh.NewLSH(config, hashFamily, similarityMeasure, store)
if err != nil {
    return err
}

An LSH instance is created using the previously defined configuration, hash family, similarity measure, and store.

ctx := context.Background()
for i, vector := range datasetMap {
    vectorID := types.VectorID(i)
    err := instance.Add(ctx, vectorID, vector)
    if err != nil {
        return err
    }
}

The code iterates over the dataset map obtained from the CSV file, and adds them to the LSH instance.

vector := types.Vector{"white bread": 1}
count := 100
similarVectorsID, err := instance.QueryByVector(ctx, vector, count)
if err != nil {
    return err
}

A sample vector is defined, representing a product ("white bread"), and the QueryByVector method is called to find similar vectors in the LSH instance. The number of similar vectors to retrieve is specified by the count variable.

items := make(Items)
for _, vectorID := range similarVectorsID {
    for item := range datasetMap[vectorID.String()] {
        items[item.String()]++
    }
}
n := 10
result := items.Top(n)
fmt.Println("Result:", result)

The code iterates over the IDs of similar vectors retrieved and aggregates the associated items from the dataset map, counting their occurrences. The top 10 associated items are extracted from the aggregated data, and the result is printed to the console.

You can find the source code here

Conclusion

In conclusion, Market Basket Analysis combined with Locality-Sensitive Hashing provides a scalable and efficient solution for uncovering associations between products in large transactional datasets. By leveraging this approach, businesses can gain valuable insights to optimize their sales strategies and enhance customer experiences.

P.S: Thank you for reading! Feel free to leave your comments below, and let's stay connected on LinkedIn or follow me on Twitter or Github for more insights and updates in data mining and analytics.