DEV Community: Pedro H Goncalves

Diagnosing and fixing critical PostgreSQL performance issues: A deep dive

Pedro H Goncalves — Mon, 16 Jun 2025 20:17:22 +0000

I recently worked on optimizing a PostgreSQL database that was facing serious performance issues. Some of the main complaints were:

Queries were slow to return results, even when using indexed columns.
Inserts, deletes, and updates were painfully slow.
Maintenance tasks like reindexing, vacuuming, analyzing, and the like were nearly impossible to run.

In this article, I’m going to break down what was behind the poor performance. We’ll cover things like over-indexing, bloated tables, fragmented indexes, basic maintenance tasks, some modeling tips, the roadmap I used to diagnose the problem (which you can adapt to your own case), the solution I came up with, and what I’d do differently to prevent this kind of situation from happening again.

From time to time, you’ll see text formatted like this:

Example of text.

These are just technical side notes for certain technologies or features I mention—feel free to skip them.

Text formatted like this:
Example of text.
Means it's a side comment.

Or like this:
— "Example of text."
Are meant to simulate questions the reader might be thinking.

**Note 1: While I do explain many of the terms, causes, and consequences of most of the issues here, it’s best if you have at least a bit of background in databases beyond basic SELECTs.
**Note 2: This article isn’t meant to showcase clean or optimized code examples. The SQL and Scala snippets could definitely be improved. Think of it as a mental exercise—how would you improve their readability and performance?

Diagnosing the Problem
- Possible causes
- Hardware resources and parameter tweaks
- Computational resources
- Locks
- Table locks
- Database and Tables
- Large volumes of data
- Bloated Tables & Fragmented Indexes
Fixing the problem
- Rebuilding the table
- Reloading the data
How can this be avoided?
Conclusion

Diagnosing the problem.

To kick things off, it’s important to mention that we didn’t have query metadata or anything similar—pg_stat_statements wasn’t enabled (nor any analytics service like RDS’), and we had very little visibility into query history to identify areas for improvement.

Possible causes.

To make our “investigation” a bit easier, I laid out a few areas to check. We’ll break each one down below in this order, but feel free to jump ahead—reading them in order isn’t necessary:

Lack of computational resources
Table locks
Too much data
Bloated tables
Fragmented tables

Hardware resources and parameter tweaks.

I started with a general check of the system. Others had already said the server wasn’t maxed out in any way, but I like to confirm things for myself. At the time, the server was showing:

~25% average CPU usage
~65% average RAM usage
~4.3k ops/s (NVMe SSD)
Disk ~90% full

At this point, you might be thinking:

— “Only 10% free disk? That’s risky—ideally it should be at least 20%.”

And you’d be right. I agree with that take. But in this case, that alone didn’t explain the massive performance drop. With all the other metrics well below danger zones, we ruled out resource bottlenecks.

Computational resources.

Even though resource limitations weren’t the issue, I still suggested tweaking some PostgreSQL parameters. The config had never been touched, and many default settings are made for local setups where resources are shared—unlike our case, where the server was dedicated to the database.

Some of the parameters we updated were:

shared_buffers
work_mem
effective_cache_size
effective_io_concurrency
maintenance_work_mem

These changes made better use of the available hardware and did improve things like ORDER BY queries and some maintenance tasks (on smaller tables), but this clearly wasn’t the main issue.

Locks.

Certain types of long-lasting or widespread locks can definitely wreak havoc on read/write performance and even block maintenance tasks. As I said before, we didn’t have historic query or lock data, but we could monitor the locks currently active.

During peak hours, I ran this query:

SELECT 
    t.schemaname,
    t.relname,
    l.locktype,
    l.mode,
    l.granted,
    a.usename,
    a.query,
    a.query_start,
    age(now(), a.query_start) AS query_duration
FROM pg_locks l
JOIN pg_stat_all_tables t ON l.relation = t.relid
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE t.relname IS NOT NULL
ORDER BY t.relname, l.mode;

And here’s what we got:

pid	table	locktype	mode	granted	usename	query	query_start	duration
22956	table_d	relation	AccessShareLock	TRUE	postgres	-	2025-06-16 13:00:31.543569+00	00:00.3
24810	table_e	relation	AccessShareLock	TRUE	postgres	-	2025-06-16 11:39:29.805778+00	21:02.0
24985	table_e	relation	ShareUpdateExclusiveLock	TRUE	NULL	autovacuum: VACUUM ANALYZE public.table_e (to prevent wraparound)	2025-06-16 11:39:32.468211+00	20:59.3
25102	table_f	relation	AccessShareLock	TRUE	postgres	-	2025-06-16 11:39:29.805778+00	21:02.0

There were dozens more rows like these showing AccessShareLocks on other tables.

Now maybe you’re thinking:

— “Aha! It’s the locks! That’s what’s killing performance!”

Sorry to disappoint—AccessShareLocks are super permissive. They mostly coexist with everything except AccessExclusiveLock (which is used by DROP TABLE, CLUSTER, REINDEX, VACUUM FULL, table-altering commands, etc). So they’re not the problem.

But then you ask:

— “What about that ShareUpdateExclusiveLock?”

Good catch. It’s a bit more restrictive, blocking maintenance tasks and table structure changes. So maybe this is the bad guy preventing maintenance from running?

Not really. That lock was taken by an autovacuum trying to run VACUUM ANALYZE. In reality, this process probably never finishes and ends up just hanging there. Our move here was to disable autovacuum temporarily and kill the zombie process.

You can cancel it like this (replace 24985 with your PID):

SELECT pg_cancel_backend(24985);

If that doesn’t work:

SELECT pg_terminate_backend(24985);

Then, we disable autovacuum for that table:

ALTER TABLE table_e SET (autovacuum_enabled = false);

table_e wasn’t super critical—only a few queries hit it—so this helped a bit but wasn’t a game changer.

Lock Tables.

Database and tables.

At this point, the best move was to stop looking at things too broadly and zoom in on something more specific—maybe a slow query or a concrete complaint like:

— “The database is really slow.”

That kind of vague complaint doesn’t help much. Ideally, you want to narrow things down until you find the actual pain point. Is it a specific query? Then start with the EXPLAIN plan. That’s SQL tuning 101. But as I mentioned earlier, we didn’t have access to historical queries, and there wasn’t enough logging to get insights. That made the challenge more… let’s say, “fun.”

We knew all operations—reads, writes, maintenance—were way too slow. So the problem wasn’t a lack of indexes (too many indexes usually slow down writes, but reads get faster), and we couldn’t blame bad query design either—we simply didn’t have the history or even access to the codebase using the database.

So next step: maybe we’re dealing with too much data?

Checking the data volume.

Let’s start with the obvious: how many rows do these tables actually have?

SELECT COUNT(id) FROM my_table;

Yeah… that didn’t go well. It ran for 10 minutes before I killed it. Not helpful.

So, plan B: table statistics.

SELECT 
    schemaname, 
    relname AS tablename, 
    n_live_tup,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;

This gives an estimate of live rows based on the last time PostgreSQL collected stats, along with timestamps for the last ANALYZE.

Results (abridged):

schema	table	live_tuples	last_analyze	last_autoanalyze
public	table_a	1.3 billion	null	2025-03-20
public	table_b	500 million	null	2025-01-03
public	table_c	200 million	2025-03-15	2025-03-16

No surprise here—the stats were outdated by months. These numbers were basically useless. But we already had a clue: missing ANALYZE runs.

ANALYZE updates internal stats used by the query planner to figure out the best way to run a query. It’s critical maintenance.

If these stats aren’t being updated—and it’s not because nobody tried—it’s probably a symptom of a deeper issue.

Estimating row count without stats.

Still, I needed a ballpark number. Since every table had a sequential id, I ran:

SELECT MAX(id) FROM table_a;

Getting the MAX of a sequential, indexed column like id is usually very fast thanks to how b-tree indexes work—the DB can just peek at the last page.

Result:

max(id)
9,846,091,813

So yeah… nearly 10 billion inserts at some point. That doesn’t mean the table has 10 billion rows now—just that it’s had a lot of activity.

To estimate how many rows were still there, I used external tooling. Since the DB couldn't handle a simple count or run ANALYZE, I turned to Spark.

val query = "(SELECT * FROM table_a WHERE id < 1000000000) AS subquery"

val df = spark.read
  .jdbc(jdbcUrl, query, connectionProps)

println(df.count())

The idea: If IDs under 1 billion return ~1 billion rows, we might really have close to 10 billion rows. If not, we extrapolate based on what we get.

— “But why use Spark to count instead of waiting it out in the DB?”

Good question. When I tried counting in Postgres, it took over 40 minutes and didn’t finish. Selecting a sample was way faster.

After 30 minutes of Spark chugging away, I got the answer: 14 million rows for IDs under 1 billion. So I estimated roughly 125 million total rows (not linear, but close enough).

Too much data.

Next question: what if we had massive delete or update operations in the past? That would create tons of dead tuples—especially if VACUUM or VACUUM FULL hadn’t been running.

Even though we know PostgreSQL’s stats were way off, I still checked them just to confirm our suspicions.

SELECT 
    n_tup_ins AS inserts,
    n_tup_upd AS updates,
    n_tup_del AS deletes,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'table_a';

Result:

inserts	updates	deletes	last_vacuum	last_autovacuum
117 million	659	3.25 billion	null	2025-01-30

Three. Billion. Deletes. That’s… a lot.

Now let’s see how bloated the table is:

SELECT 
    s.schemaname, 
    s.relname AS tablename, 
    s.n_live_tup AS live_tuples,
    s.n_dead_tup AS dead_tuples,
    CASE 
        WHEN s.n_live_tup = 0 THEN 0 
        ELSE (s.n_dead_tup::numeric / NULLIF(s.n_live_tup,0)::numeric) * 100 
    END AS dead_tuples_pct,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_stat_user_tables s
JOIN pg_class c ON s.relid = c.oid
ORDER BY s.n_live_tup DESC;

Result (simplified):

table	live_tuples	dead_tuples	dead_tuples_pct
table_a	1.38B	612M	44.2%

Similar stats for other large tables too.

This confirmed two things:

Bloated tables and probably Fragmented tables.

Even though PostgreSQL’s stats were outdated and some operations were failing, we could already see the pattern. Tons of deletes happened, and no follow-up VACUUM to clean things up. The tables were bloated, and reads/writes were getting slower by the day.

PostgreSQL doesn’t immediately remove deleted rows. Because of MVCC (Multi-Version Concurrency Control), it needs to keep old versions around for open transactions.

So you may be wondering:

— “Okay, but what makes a table fragmented?”

Or:

— “Aren’t bloated and fragmented tables the same thing?”

Not quite. Bloated tables usually are fragmented, but not always the other way around. REINDEX can help fix both—but only when it can actually run. In our case, it was completely unusable.

Fragmented tables have rows scattered across disk in a messy way. That kills sequential scans.
Bloated tables are ones where dead tuples and overhead take up a huge chunk of storage—sometimes 30–40% or more.

Fixing the problem.

Alright, so we’ve (probably) identified the culprits: the most heavily used tables are bloated and fragmented beyond hope. We also know that basic maintenance operations like VACUUM, REINDEX, or even ANALYZE don’t complete anymore. So, what options are left?

The most practical and efficient solution? Rebuild the table outside the database.

The idea is:

Extract the entire table’s data.
Apply any necessary cleanup or transformation.
Import it back into the database in a fresh, clean table.
Swap the old and new tables.

You might be wondering:

— “But if the table has around 140 million rows, won’t the extract/load process eat up a ton of resources?”

Not if you use Spark. Of course, tools like DuckDB or Polars might be even faster, but I used Spark because I already had the environment and some code ready to go.

To make life easier, I split the export into chunks (batches). This way I could:

Get feedback during the process
Resume from checkpoints in case anything failed

Here’s the Scala code I used to extract in 1-billion-row batches based on the primary key (a serial ID):

val jdbcUrl = "jdbc:postgresql://hostname:5432/database_name"
val jdbcUser = "admin"
val jdbcPassword = "admin"
val tableName = "table_a"

val connectionProperties = new java.util.Properties()
connectionProperties.setProperty("user", jdbcUser)
connectionProperties.setProperty("password", jdbcPassword)
connectionProperties.setProperty("fetchsize", "1000000")

val batchSize = 1000000000L // 1 billion
val totalRecords = 10000000000L // 10 billion
val numBatches = (totalRecords / batchSize).toInt

val outputBasePath = "data/path"

for (batchIndex <- 0 until numBatches) {
  val startId = batchIndex * batchSize
  val endId = (batchIndex + 1) * batchSize - 1

  val batchDF = spark.read
    .jdbc(
      jdbcUrl,
      s"(SELECT * FROM $tableName WHERE id BETWEEN $startId AND $endId) AS tmp",
      connectionProperties
    )

  val batchOutputPath = s"${outputBasePath}batch_${startId}_to_${endId}.parquet"
  println("Working in export...")

  batchDF.write
    .mode(SaveMode.Overwrite)
    .csv(batchOutputPath)
}

Note: This snippet isn’t complete—it’s missing Spark setup and imports, but you get the idea.

Export time averaged around 1 hour and 20 minutes per batch, plus 40 minutes to write each file as CSV. Interestingly, export times increased with each batch, likely due to DB caching. Restarting the DB between exports might’ve helped.

Oh—and fun fact: we ended up with **250 million rows, which was 110 million more than estimated.

Rebuilding the table.

Once export was done, we needed to recreate the table structure with all its indexes and constraints. Fortunately, PostgreSQL makes that easy:

CREATE TABLE new_table_a (LIKE table_a INCLUDING ALL);

This clones the entire table structure—columns, types, indexes, constraints, everything (except the data). Now we were ready to reload the data.

Reloading the data.

In our case, no data transformation was needed, so we went straight to reimporting.

I benchmarked two methods:

1. Batch inserts (Spark)

for (batchIndex <- 0 until numBatches) {
  val startId = batchIndex * batchSize
  val endId = (batchIndex + 1) * batchSize - 1
  val csvPath = s"${inputBasePath}batch_${startId}_to_${endId}.csv"

  val df = spark.read.csv(csvPath)
  val dfToWrite = df.repartition(100)

  dfToWrite.write
    .mode(SaveMode.Append)
    .option("isolationLevel", "NONE")
    .jdbc(jdbcUrl, jdbcTable, connectionProperties)
}

Using this method, we managed to import 250 million rows in about 40 minutes.

2. COPY command

val numExecutors = 3

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val baseDir = new Path(inputBasePath)
val csvFiles = fs.listStatus(baseDir)
  .filter(_.getPath.getName.endsWith(".csv"))
  .map(_.getPath.toString)
  .sorted

val executor = Executors.newFixedThreadPool(numExecutors)

csvFiles.foreach { csvPath =>
  executor.submit(new Runnable {
    override def run(): Unit = {
      var connection: Connection = null
      var reader: BufferedReader = null

      try {
        connection = DriverManager.getConnection(jdbcUrl, jdbcUser, jdbcPassword)
        connection.setAutoCommit(false)

        val hadoopPath = new Path(csvPath)

        val localPath = if (csvPath.startsWith("file:")) {
          csvPath.substring(5)
        } else {
          val tmpDir = System.getProperty("java.io.tmpdir")
          val localFile = new File(tmpDir, s"import_${System.currentTimeMillis()}_${hadoopPath.getName}")
          fs.copyToLocalFile(hadoopPath, new Path(localFile.getAbsolutePath))
          localFile.getAbsolutePath
        }

        val copyManager = new CopyManager(connection.asInstanceOf[BaseConnection])
        val copySQL = s"""COPY $jdbcTable FROM STDIN WITH (FORMAT CSV, DELIMITER '$delimiter', HEADER)"""

        reader = new BufferedReader(new FileReader(localPath))
        val rowsCopied = copyManager.copyIn(copySQL, reader)

        connection.commit()
        println(s"[${Thread.currentThread().getName}] Imported: $rowsCopied rows.")

      } catch {
        case e: Exception =>
          if (connection != null) Try(connection.rollback())
          println(s"[${Thread.currentThread().getName}] ERROR: ${e.getMessage}")
          e.printStackTrace()
      } finally {
        if (reader != null) Try(reader.close())
        if (connection != null) Try(connection.close())
      }
    }
  })
}

Yes, I know both methods should be parallelized for a fairer comparison—but even running single-threaded, COPY was much faster.

If you’re on AWS RDS, you can still use COPY, but the files go to an S3 bucket instead of the DB server. Check the AWS docs for more.

With COPY, we loaded 250 million rows in just over 15 minutes.

The COPY command is inherently faster because it loads raw data into memory and defers constraint checks and trigger execution until after the load. One downside: it’s all-or-nothing—if something breaks, you can’t resume halfway. If you want a deeper dive into COPY, check out this great article by a friend of mine: Speed up your PostgreSQL bulk inserts with COPY

Finally, to complete the switch:

BEGIN;
DROP TABLE table_a;
ALTER TABLE new_table_a RENAME TO table_a;
COMMIT;

Note: If any foreign keys referenced the original table, you’ll need to drop and recreate them manually. PostgreSQL doesn’t have a built-in way to do this automatically.

The rebuild process was done during a service freeze—no new data was being written or queried during the whole extraction/import. Some smaller tables were fixed with just VACUUM FULL and REINDEX.

The result?

Disk usage dropped by over 60%
More than 1.5 TB of space freed.
Massive performance gains across the board.

How can this be avoided?

After solving the issue, I went digging to figure out what might’ve caused those massive delete operations that ultimately blocked PostgreSQL from running VACUUM.

I talked to a few devs who had direct access to the DB, and they told me a service was being called dozens (maybe hundreds) of times a day. And every time it ran, it inserted the incoming data into the database.

Here’s the catch: those inserts were often unnecessary. Most of the time, the service was receiving duplicated data. But instead of checking for duplicates, the application just blindly inserted everything. There were no proper constraints on the table either. So... boom: we ended up with a mountain of redundant records.

Eventually, someone realized the performance was tanking and decided to clean up the duplicated data. That’s where our villain enters the story: the way those deletes were done.

Here’s the actual code that was used:

DO $$ DECLARE
  total_count INTEGER := 0;
BEGIN
  LOOP
    CREATE TEMP TABLE table_a_duplicates AS 
      SELECT a.id 
      FROM table_a a 
      WHERE (a.name, a.age) IN (
        SELECT name, age FROM table_a 
        GROUP BY name, age HAVING COUNT(*) > 1
      ) 
      LIMIT 100000;

    SELECT count(1) INTO total_count FROM table_a_duplicates;

    DELETE FROM table_a 
    WHERE id IN (SELECT id FROM table_a_duplicates);

    DROP TABLE table_a_duplicates;

    COMMIT;

    EXIT WHEN total_count = 0;

    PERFORM pg_sleep(1);
  END LOOP;
END $$;

It’s a pretty straightforward loop:

Find duplicate rows based on name and age
Store their ids in a temp table (in batches of 100k)
Delete them from the main table
Repeat until there are no more duplicates

Performance-wise, this approach isn’t too bad. The real problem? No VACUUM was run during or after the process. This table had billions of rows. The loop ran for days, and eventually the table became so bloated that VACUUM couldn’t finish anymore. That’s what really tanked the DB.

Now, what if they had just added these three lines?

PERFORM pg_catalog.pg_advisory_lock(12345);
EXECUTE 'VACUUM (ANALYZE) table_a';
PERFORM pg_catalog.pg_advisory_unlock(12345);

With those lines, the cleanup block would’ve looked like this:

DO $$ DECLARE
  total_count INTEGER := 0;
BEGIN
  LOOP
    CREATE TEMP TABLE table_a_duplicates AS 
      SELECT a.id 
      FROM table_a a 
      WHERE (a.name, a.age) IN (
        SELECT name, age FROM table_a 
        GROUP BY name, age HAVING COUNT(*) > 1
      ) 
      LIMIT 100000;

    SELECT count(1) INTO total_count FROM table_a_duplicates;

    DELETE FROM table_a 
    WHERE id IN (SELECT id FROM table_a_duplicates);

    DROP TABLE table_a_duplicates;

    COMMIT;

    EXIT WHEN total_count = 0;

    PERFORM pg_catalog.pg_advisory_lock(12345);
    EXECUTE 'VACUUM (ANALYZE) table_a';
    PERFORM pg_catalog.pg_advisory_unlock(12345);

    PERFORM pg_sleep(1);
  END LOOP;
END $$;

This entire problem could’ve been avoided by having the right constraints in place to begin with. The issue here was the lack of a unique constraint to prevent duplicate rows. The app should’ve either rejected duplicates or used UPSERT.

This is why it's crucial to test against real-world scenarios and take time to understand your schema. Well-designed constraints can save your DB from future disaster.

For more on designing scalable, robust PostgreSQL databases, I wrote a follow-up piece: Designing robust and scalable relational databases: A series of best practices

After the rebuild, performance improved massively—insert, update, delete, read, maintenance, everything. I didn’t have any performance-monitoring tools enabled at the time (which is on me), so I couldn’t get hard numbers, but based on sample queries, we saw an average 250% improvement in response time. Oh, and we also freed up more than 1.5 TB of disk space.

Conclusion

PostgreSQL has its quirks—but with solid modeling, those quirks rarely become a problem. Of course, surprises and edge cases are unavoidable, but the goal is to reduce how often they happen.

Sometimes, a single operation executed without proper thought—especially in large-scale environments—can lead to catastrophic outcomes, potentially costing thousands of dollars in downtime, troubleshooting, and emergency fixes.

Before you run anything that touches a large amount of data, take a moment to ask yourself:

“How will the database handle this volume of information?”

And study the cost of each operation—some are cheap, others will absolutely wreck your performance if used carelessly.

Also, don’t forget: optimization should always align with how your services interact with the database. That’s where real gains happen. Having monitoring in place—like AWS Performance Insights or enabling pg_stat_statements—gives you visibility into where the real bottlenecks are, so you’re not left guessing.

(I'm planning to write a guide on using pg_stat_statements soon, by the way.)

I hope this article helped clarify the root of the problem and gave some insight into how we approached the fix.

Thanks a lot for reading 🙌

Designing robust and scalable relational databases: A series of best practices.

Pedro H Goncalves — Tue, 19 Nov 2024 18:10:58 +0000

Throughout my experience as a data engineer and software developer, I've had the pleasure (and displeasure) of creating and maintaining many databases. In this article, I'll list some best practices that I consider essential for relational databases to become truly scalable.

Some practices, such as adding indexes, table partitioning, and studying normalization and denormalization applications, will require a slightly deeper level of knowledge to be used correctly and not generate future or immediate problems. I'll describe some of these points, but it's highly recommended that you study them more in-depth. Additionally, some of the practices I'll list are not possible to implement in existing, consolidated systems that primarily have many interface points (services that consume from them); changes in these cases can mean a lot of headaches. In any case, it's highly recommended that you use these practices in your new projects, even if you're the only developer. Your future self will thank your past self for implementing them.

Although I'll list some practices, concepts, and tools, I don't intend to delve deeply into them. My aim is to provide a basic explanation of each topic so that you can analyze and see if it makes sense to apply them to your projects. Therefore, for some of the topics, I'll leave a link to an article/post/answer that I believe has good quality regarding that theme and goes into more depth.

While this article covers several crucial aspects of database management and design, some advanced topics deserve a more in-depth exploration. Subjects like MPP, sharding, distributed processing and storage are complex enough to warrant their own dedicated article.

I plan to address these topics in a future piece where I can give them the careful consideration and detailed explanation they merit.

What we will see:

Monitoring your database.
Low-level configuration files and variables.
Naming Convention Standardization.
Correct Use of Data Types.
Normalizing and Denormalizing: How Far to Go.
Choosing a Tool for Versioning and Testing.
Documenting Your Entities and Fields.
Applying Indexes Correctly.
Modern Backup and Restoration Pipelines.

Monitoring your database

Monitoring your database is one of the tasks that should start in the early stages of your application. It's completely understandable that part of the product/service development team's focus is on making things work, but monitoring your database and its entities can deliver important insights about your application and infrastructure. Good database monitoring answers questions such as:

Average query response time.
Throughput (number of transactions per second).
CPU and memory utilization.
Disk I/O (reads/writes per second).
Uptime (available time within a range).
Error rates.
Deadlocks.
Locks.
Failed login attempts.
Backup metrics.

Some of these questions are not applied at the table level, but the more you can drill down, the better it will be for future investigations. By drilling down, I mean how far you can go, such as the average execution time of queries, down to the table level and having the average time of the table, down to the query level and understanding which types of queries take longer to execute on that table.

As a recommendation, I suggest a very famous monitoring stack, which is Prometheus, a monitoring system/time-series database, with Grafana, a tool for analyzing metrics through graphs. If for some reason you can't or don't want to use these tools, I'm sorry to say that answering all these questions with just one tool (especially OSS) will be a difficult task. So, the ideal approach is to study a series of tools and design some implementations, clearly dividing the functions so that they don't overlap and end up generating ambiguities, which could also possibly generate different data, leading to a big headache.

Low-level configuration files and variables.

I believe that, like me, many people are curious about what configuration files like postgresql.conf or my.cnf can change within the database, whether in performance, authorization, transactions, or anything that presents a visible improvement. To address part of the topic and answer many people's questions: yes, configuration files can and do dictate much of the DBMS behavior and how it manages transactions and manipulates data. Database systems like PostgreSQL and MySQL have default values for basically all manipulable values, and they are adjusted to fit local environments (most of these DBMSs are used in test environments and for small to medium-sized projects). This means that in environments with large volumes of data that have a dedicated server with high processing capacity, you can (and should) certainly make adjustments to these configurations so that they are better utilized.

Using some PostgreSQL configurations as a basis, I'll mention some important variables that can be altered. Ideally, you should read the documentation about these variables, study your needs, and test the changes to understand how they affect performance and whether they create problems elsewhere.

shared_buffers: This is the shared memory for data caching. Increasing the memory that the DBMS can consume significantly increases the query speed of frequently accessed data.

Values:
- OLTP environments: 25 - 40% RAM
- OLAP environments: 50 - 70% RAM
effective_cache_size: This is how much disk caching is available from the operating system and PostgreSQL. It's not effectively used but helps the query planner perform some operations to understand, for example, if an index fits entirely in the cache.

Values:
- OLTP environments: 50 - 70% RAM
- OLAP environments: 80 - 90% RAM
work_mem: This is the amount of memory PostgreSQL allocates for sorting operations (order by) and hash operations such as aggregations and joins.

Values:
- OLTP environments: 4MB - 64MB
- OLAP environments: 64MB - 1GB
max_connections: This is the maximum number of simultaneous connections. It's a great starting point for controlling concurrency; too many connections can overload the server.

Values:
- OLTP environments: 200 - 500
- OLAP environments: 20 - 60
max_wal_size: The Write-Ahead-Log (WAL) is the mechanism that ensures data integrity by recording changes before applying them to the database. A larger value decreases the frequency of checkpoints, which increases write performance but also increases recovery times after failure.

Values:
- OLTP environments: 1GB - 4GB
- OLAP environments: 4GB - 16GB

There are numerous other configurations that will make a difference across all aspects of your database, including resource consumption, authentication, encryption, connections, and other different areas. I strongly recommend reading the documentation for your DBMS and adapting to your needs, always making a prior backup to be able to return in case of unexpected behaviors and analyzing your database performance through monitoring.

Naming Convention Standardization

Naming convention standardization is one of the most powerful practices, but it's probably the most difficult to implement. It's easy to implement during the initial database modeling, but it's very easy to "lose" throughout development, and in many cases, it's impossible to implement in existing systems.

With naming convention standardization, we can list a few points:

Reduction of redundant names, such as the field client_name in the client table where it could simply be name.
Standardization of abbreviations. In one table, you might have the field sale_value, and in another, you have vl_service, where the first could be vl_sale.
Singular names for tables and fields. Instead of naming a table clients, you would name it client, and the same rule applies to fields, with the exception of fields that truly represent more than one value. An example could be a tags field of type ARRAY of STRINGS.
Schema naming. Not every database has the possibility of dividing entities by schemas, but the most famous one currently (PostgreSQL) does. One of the recommendations is to work with short names, such as the schema that organizes finance-related tables being abbreviated to fin, hr for human resources, or mkt for marketing. You might wonder why short names for schemas and not for tables and fields? Schemas are the most comprehensive hierarchical class; in most cases, the tables already explain what that schema means, so there's no real need for a long description.

Data Dictionary in DBMS - javatpoint

Correct Use of Data Types

The correct use of data types seems quite obvious to mention, and you've surely read/heard about it in more than one course or article. The important point I'll list here is, if you have a prior opportunity to study the data, study and define well how it should be stored. The correct data types, besides generating more performance in inserting new values or reading, ensure security and implement a layer of data safety.

It's also important to understand the difference between data types that seem ambiguous like varchar and char, tinyint and bigint, varchar and text, text and blob, numeric & decimal and float. The choices of these types may not appear to and really don't make a difference in small amounts of data, but when it comes to billions of rows with numerous fields and indexes, the correct choice of data type can bring a valuable performance difference to the application.

I'll mention some examples of use cases of data types that seemed correct but actually generated future problems.

The use of FLOAT to store transaction values. But well, why would this be a problem? Float (and decimal too) uses floating point, which is a storage and representation technique that, roughly speaking, allows for rounding in exchange for performance in read and write operations, unlike types like DECIMAL that use the fixed point technique. Fixed point stores the value exactly, without rounding. In financial transactions, rounding cannot occur because at some point the value of these roundings would be significant, and we already know the end of that.
The use of CHAR(355) to store product descriptions. The char type has a fixed size, which means that if the product description doesn't have 355 characters in length, the database will fill the missing characters with blank characters. The number of stored products scaled to various levels, and the difference in storage use from CHAR to VARCHAR was considered substantial, in addition to negatively affecting the performance of indexes such as FULLTEXT (we'll talk about indexes later).
The use of VARCHAR to store negotiation situations. The field would have (or at least should have) only 3 negotiation situations, approved, pending, denied. In this case, the correct type to use would be ENUM, which would limit the entry of other values, such as incorrect spellings or unmapped situations. You might argue that only your application interfaces with the database in manipulation operations and this type of verification is in code, but it's important to understand that data constraints are never too much, and you don't know if other applications that are beyond your control may arise in the future and end up interfacing with the database.

I strongly recommend that you study the differences between data types and how the database handles them in insertions, selections, updates, and deletions. You will find pertinent information in the documentation of the RDBMS you are using.

Normalizing and Denormalizing: How Far to Go

Briefly contextualizing, normalization is a set of rules to be followed in entity modeling. These rules aim to reduce redundancies and spaces for inconsistencies.

Getting ahead of much of the topic and being quite generalist, denormalization doesn't make sense in environments that aren't analytical, such as data warehouses or OLAP. You might even ask, "but joins are very costly, wouldn't it be better to reduce the number of joins in the most common queries?" Actually, no. SQL is famous precisely because joins are quite performant; there are some best practices to make joins execute in the best possible way, and I'll list them later.

It's important to remember that with normalization/denormalization, you're not just gaining/losing performance, but also giving up integrity and consistency. With normalization, you ensure referential integrity, and this prevents some inconsistent data insertions where to compensate, you end up opting to write triggers or the like, which exponentially increases the complexity of your entities.

Does this mean that analytical environments have always been wrong in terms of denormalization? No, analytical environments are designed already knowing the initial demands of what information will be needed, what is and isn't useful data. In these cases, many relationship keys don't make sense and/or little information from another entity is used, so denormalization occurs. Another point is that in the case of analytical environments, they are a set of entities designed for reading; the removal of join operations really makes a difference at large scales but is not a determining factor compared to other common practices in analytical environments.

Points to optimize your join operations (or lack thereof):

Primary and foreign keys should be of the same type and in the "same manner". Doing a type cast or an expression like lower() to then be able to do the join. The database needs to do a full scan, apply the expression/cast, and then make the comparisons.
Very small entity (a few dozen rows). In these cases, scanning the table will probably be more performant than doing a join.
Primary and/or foreign key are not indexed, which means the database needs to fully scan the table, which obviously reduces performance (SIGNIFICANTLY).

Choosing a Tool for Versioning and Testing

Versioning and testing in the creation and modification of entities is not very common but makes a HUGE difference in the maintenance and understanding of a system as a whole. It's rare to find codebases where you can see the entire history of the database with field alterations, insertion of new ones, and removal of old ones, such as the evolution of an E-Commerce or ERP system. The easiest way to version your database is with migration scripts. There are some tools like Flyway for JVM that does this with .sql scripts like V0__create_customer_table.sql, or YoYo Migrations for Python where you can even generate migrations with Python code or .sql scripts.

Don't forget to create rollback scripts!

Migration management tools enable you to write migration tests that are extremely important for several reasons:

No data is lost.
No inconsistencies are created between data.
There are no unnecessary duplications.
Prevention of failures due to logical or syntax errors.
Allows performance evaluation.

Before choosing the migration tool you'll use, it's important to know some aspects of it. The most important aspect I usually consider is whether my database will have more than one interface that will manipulate entities and make migrations. Why consider this? Tools like Alembic, also from Python, do a very simple metadata management that's perfect for small and unique projects. That is, you won't have problems working with just one project performing migrations. This scenario changes when you have other applications also making migrations because it's very easy to lose the order and continuity of migrations. Tools like Flyway and YoYo do more complex but also more complete metadata management and usually serve better in these cases.

It's important to create rollback scripts so that if any of the migrations are performed successfully but haven't generated the expected result (bug), it can easily be undone to return to the previous state of the database. It's understandable that not all migrations have a rollback script, especially those that affect large quantities of unmapped tuples. In these cases, it's extremely important to have a development database that copies the production database in smaller proportions.

Additionally, migration tools are easily integrated into CI/CD environments, which facilitates bringing entity changes to production environments, unifying and exposing the results.

Database Schema Migration: Understand, Optimize, Automate

Documenting Your Entities and Fields

This topic draws a bit from what the first topic (Naming Convention Standardization) introduced, which is a data dictionary defining meanings for fields, relationships, functions, giving meaning to elements that might seem ambiguous.

The creation of metadata (data about data) is extremely important, especially for consumption by other teams. It's a practice that comes from data engineering teams but is also valuable for development teams. First, it's important to briefly describe the cardinality and relationship of the tables. There are numerous tools that allow this type of work, but the most common ones are Lucidchart and the new drawdb. It's a simple task that is usually done even before the materialization of the tables; you can feel free to use any tool you want. There are some tools that generate documentation in HTML/PDF format like Dataedo which I also recommend. The next tool I'll list is a bit more comprehensive and solves the problem more satisfactorily.

Data engineering teams have been frequently using Open Metadata, an OSS tool for creating metadata. Software engineering professionals might think it could be a bit much for documentation, but tools like this are perfect for all types of companies and should be included in the culture as early as possible. Briefly explaining how Open Metadata works: When plugged into the database, the tool "scans" it for all entities, functions, triggers, and so on, and brings this information into a very user-friendly Web UI. With this "recognition" step done, you can put metadata on the fields to say how they are calculated, in which system they originate, and in which services they are used. This, combined with a very useful text search, gives you a complete tool that describes your database from end to end and provides crucial information about the entities and their respective fields.

In general, there are N ways to document your database entities. The point is how you will make it available to stakeholders, what work you will have to do to document/update, and what granularity of information your documentation will have.

Database Documentation - Lands of Trolls: Why and How? - Simple Talk Metadata: Definition, Examples, Benefits & Use Cases

Applying Indexes Correctly

In your queries, regardless of the data they retrieve or how you execute them, a universal best practice is to always think about reducing the number of I/Os the database needs to perform. The most common way to do this is with indexes. Indexes, in a quick, succinct, and rough explanation, are metadata about the data stored in the database that point to where certain tuples with certain values are located. A quick contextualization about indexes with a practical example: You make the following query select * from orders where order_dt = '2024-11-05', the order_dt column has a b+ tree index. Instead of traversing all the data pages, the query planner goes directly to the pages that have the value 2024-11-05 because it knows this thanks to the index metadata. In general, you'll prefer Index Scan queries because the query planner uses the index to reach the desired values, and a Table Scan query scans the entire table looking for these values. Indexes have a mathematics called selectivity, which basically tells the ability of the index to filter a large portion of the data from a table. The calculation is:

1 Selectivity = (Number of distinct values) / (Total number of records)

The closer to 1, the more selective the index is, and the more selective it is, the smaller the percentage of records returned in a query. It's counterintuitive, but they are usually more effective because the database can scan these indexes more quickly.

After multiple insertions, removals, and updates of the data, the indexes can and probably will become fragmented. This happens because the physical order (where they are stored) does not correspond to the logical order, and this will result in loss of performance in searches. Moreover, unfortunately, there's no magic, and indexes come with a cost that is paid mainly in storage (you have metadata and this needs to be stored somewhere) and also in the performance of data manipulation operations (INSERT, UPDATE, DELETE) (the database needs to rearrange the indexes in these cases). This already gives a clear warning that you can't simply go filling your database with indexes. Their implementation should be studied so that it's not more harmful than beneficial to the database. This also applies to analytical environments where many complex queries are executed and the database will naturally do a table scan, which makes the design of these indexes even more crucial.

It's important to note that there are different types of indexes such as hash, which is excellent for equality searches but not so good for range searches, or fulltext, which is optimized for large volumes of text and supports keyword and phrase searches. If you want to optimize queries on your database entities, I strongly recommend that you deepen your studies on indexes.

SQL Indexes - The Definitive Guide - Database Star
An in-depth look at Database Indexing – eLitmus Blog

Modern Backup and Restoration Pipelines

Those who think backups are only for restoring data in case of a natural catastrophe or some invasion and data hijacking are mistaken. Of course, these points are important and should be considered in planning your backups, but there are other reasons which you can use as motivators to improve your processes to make them more efficient on all fronts and better documented. We'll include in our list of "Why have modern backup pipelines" the ease of testing new features and integrations that make massive use of the database (high number of insertions or data manipulation and new data types). Having a backup of the database state in a previous version of the application can save you (and probably will) from numerous scenarios where things didn't go as planned. Another clear use is for compliance needs; in cases of cybercrime or legal disputes, it's extremely important to have the entire history of data movement for use in investigations.

It's extremely common to see database backups being done with a plain-text SQL dump and also with binary formats. SQL Dump is only an option because many DBMSs do it automatically where you configure few things and they are generated automatically, but they are far from ideal in scenarios where you have a medium/large amount of data. The format is quite ineffective both in storage (takes up a lot of space), restoration (medium/large volumes will take good hours if not days to be restored), take a significant amount of time to be written, and do not have integrability with other tools for reading (analysis/query). For these reasons, we disqualify SQL dump as an option for backups in robust databases. Binary files solve part of these problems such as storage and writing time, which are good reasons to be used despite the integrability being worse than SQL dumps since binaries are usually specific to a particular DBMS. Binaries are more difficult to manage because they usually require a bit more technical knowledge, but the gains are noticeable and in most cases worth it, but now, we'll talk about another type of architecture.

With the modernization of areas such as data engineering, machine learning engineering, data science and related fields, new file formats and new tools for manipulating these files have emerged. Although they are directed to analytical environments and not exactly for backup purposes, some of these tools can be applied to this context to extract great value. Something that analytical environments and backups have in common is the low, or no, amount of data change operations, only large volumes of insertion. With this, we can use columnar format files that have advanced compression techniques that significantly reduce the size and time used to read and reinsert back into a transactional environment.
Read my article Different file formats, a benchmark doing basic operations.

The parquet format is more optimized in terms of storage mainly for transactional environment data that has a large amount of duplicate values (parquet uses a compression technique that reduces duplication, decreasing size). Writing parquet files is also not time-consuming; the time is usually quite similar to binary files despite requiring a bit more computational resource. Parquet files can be queried without necessarily a restoration, which facilitates in some scenarios where it would be necessary to restore a complete version to query some specific information in the case of binary. Although it stores the schema in its metadata, this metadata is not compatible with most DBMSs, and this is a problem if you don't have your backups synchronized with migrations. This style of architecture called "modern data stack" in the data engineering area requires more specific technical knowledge in some tools and concepts, but it will probably generate great savings in the resources your backups consume and in the quality of processes. In all cases, changes of this type that affect such complex and delicate things as compliance should be widely discussed and studied before being adopted, no matter how much the numbers point to great improvements.

Conclusion

Many of these topics could easily become an article of this size on their own or even larger, but the idea is not to explain in depth, understand the anatomy, or explain in micro details how they work, but rather to give a reliable and robust introduction. The necessary deepening for future implementations or changes can be achieved by reading the documentation of the DBMSs, the tools you will use for assistance, in books about software architecture and application design, or in more specific articles.

If you want to know more about databases, I recommend reading my other article about the ACID principle. Transactions and the ACID principle, going a little deeper.

Thank you for your attention.

Improving your Python code, an initial series of best practices.

Pedro H Goncalves — Thu, 21 Mar 2024 01:38:20 +0000

This article gathers some of the practices I would like to see in the codebases I maintain and that I believe are not difficult to adopt. Obviously, it is impossible to demand that developers, or you demand of yourself, all good code practices. You can start with this initial set and gradually improve.

If you are looking for ways to optimize your code in terms of asymptotic complexity or anything related, this article will be of little use to you. However, if you want to improve the readability and organization of your Python projects, I strongly recommend reading it. You can adopt these practices in any type of project, not just in a specific niche like web development (which has some other good practices), data science, or any other area. It is important to consider that these practices should not override the code standards that your company/project already has, whether it's naming conventions, indentation, comments, or anything else. The overall goal is to provide you with ideas on how to keep your code readable and organized; you can modify the practices to your liking. Ultimately, what matters are the standards you adopt.

Exception Handling

This best practice might sound like a rule imposition, but it's really important that you take it into consideration.

Why did I include exception handling as a topic to be addressed in good Python coding practices when there are many others? Well, exception handling is particularly dangerous in many applications, and if not done correctly, you will likely infer unwanted behavior in your application. In short, it's much easier to mess up with exception handling than with other practices.

It's quite common to find the following code snippet in various Python projects:

try:
    something
except Exception as e:
    log.register('critical', e)

I was quite generous with the part of log.register; you're more likely to find a print(e).

This is extremely problematic for two reasons:

Firstly, it catches all exceptions that may occur in that block of code, and you probably don't want that because debugging errors will become infinitely more difficult. Moreover, it enables your code to do whatever it wants and still continue to execute. It also doesn't tell me anything about what might potentially happen. If you put this in your application, you did it thinking about a potential error, such as a UniqueViolation from psycopg2 or a TypeError, and therefore you should focus only on the errors you want to handle.

Secondly, it does nothing after catching the exception. In this case, it still logs an error (which is far from ideal), but in many applications with this type of snippet, it serves only to "not break" the code. However, it's always good for the rest of your application to know the result of that operation. Propagating the exception to higher layers or changing the behavior after a failure is a good practice.

To refactor this exception, we could do something like this:

try:
    something
except psycopg2.errors.UniqueViolation as duplicate_error:
    log.register('critical', duplicate_error)
    raise ModifiedExceptionDatabase('Value exists in the database.')
except TypeError as type_error:
    log.register('critical', type_error)
    # do something
    raise ModifiedExceptionType('Type mismatch.')
finally:
    connection.close()

It's also important to remember that it's not good to have large code blocks or a very extensive flow inside try-except blocks. The larger the block, the more errors it catches, and the less visibility you have of your application.

In summary, you will want to avoid catching generic Exception and instead handle only specific exceptions. After handling them, share the result of the operation with the rest of your application so that something can be done in higher layers.

To learn more about error handling: PEP 463 – Exception-catching expressions | peps.python.org

Code Styling and Standards

For code standards, there are several points you can address, such as naming conventions, abbreviations, common methods, among others. Speaking a bit about naming styles, Python's recommended styles are widely disseminated; it's almost common knowledge that snake case convention is used for function, method, and variable names, and Pascal case for classes. Although they are widely adopted by the community, they are far from irreplaceable. Now, if your team is accustomed to or already writes differently, it's completely acceptable, and this shouldn't be a problem for your codebase. However, if your team doesn't follow any standardization, whether in styling, naming, or tool usage, it's important that you adopt one. I strongly recommend that you write a document that is easily accessible to your team, containing the code standards you choose to adopt. To assist you, I'll list some points you can mention and give examples:

Default Python version: 3.10
Abbreviations. Ex: dt = date, nb = number
Functions: Snake Case
Methods: Snake Case
Variables: Snake Case
Classes: Pascal Case
Modules: Kebab Case
Spacing and indentation: Refer to PEP 8 (or any other)
Database connection: ...
Database operations: ...
Library for migrations: ...
Credential management: ...

There are many other points you can address with a code standard to facilitate maintenance, reduce errors, and increase the testability of your code. Obviously, it's impossible to standardize all your code, and I dare say that would be harmful. However, mapping the most repeated points and addressing them in a standardized way is a good software development practice.

Type Hints

Python is a language with dynamic typing, which means that data types are inferred at runtime. That is, you don't assign data types to each variable, class parameter, method, or function returns. Fortunately, it has type hints (available from Python 3.5) which are literally what the name suggests, hints about types. Type hints do not interfere with code execution but aid in readability and maintenance. Some tools like PyCharm and VSCode extensions provide highlights when you infer a type that does not match the type hint. To infer a type int in your code, after the variable or parameter name, you put a colon : followed by its type as int. In function returns, you should use a dash followed by a right arrow -> and the return type as str. Let's illustrate with code:

persons: dict[str, int] = {"joe": 78} 
age: int = persons.get("joe")

def is_legal_age(age: int) -> bool:
    return age >= 18

In our code, we have a dictionary with a string key and an integer value, and the type hint dict[str, int] exactly conveys this. You might think that in this case, a type hint could be more of a hindrance than a help since the variable's value is explicit in the code, and you are correct. But in cases where we can't see what value the variable assumes after code execution, type hints are extremely important to give context to your application.

In ideal applications, all variables would have type hints regardless of how the value is assigned to them. However, it's understandable that updating them whenever there's a change in the code or importing and inferring their types could be complicated. Therefore, I usually put type hints only on class constructors' parameters, functions, methods, and their returns.

It's important to remember that there are some tools like MyPy that serve as a "compiler"; it detects type errors using type hints and prevents problem execution until the error is fixed. It's not widely used, but if you encounter many type problems in your codebase, you may consider it as an option.

To learn more about type hints: PEP 484 – Type Hints | peps.python.org

Docstrings

Not only in Python but also in other languages, it's very common to have documentation for functions, classes, and methods in the code itself, as if they were comments. In Python, we call this documentation "docstrings," which are present in all built-in methods (functions integrated into the compiler) to community-made frameworks. It consists of adding a string below the initialization of your function, class, or method that succinctly explains some fundamental points such as what that function does, its parameters and their types, what that snippet returns, which exceptions it raises, and how it handles them, among other points. Let's visualize with a code example:

def sum(one: int, two: int) -> int:
    """A function that sums two values

        Parameters:
            one: First value to sum
            two: Second value to sum

        Raises:
            TypeError: if one of the two values is not an integer

        Returns:
            The sum of the two values
    """
    try:
        return one + two
    except TypeError:
        raise TypeError("Both values must be integers")

The function in the example sums two integers. The docstring explains what the function does, the parameters it receives and what they do, explains which exception is handled, in which scenario, and summarizes what it returns. Of course, you can write the docstring to your liking; some people might disagree with how this one is written, but in the end, it fulfills its purpose of explaining the function's operation and also meets standardization criteria.

If you want to follow this pattern, it's important that you write your docstring within 6 double quotes, 3 at the beginning, a line break, and 3 at the end. Explain briefly what the function does in the first line/sentence, list the parameters separated by line breaks, include considerations you consider important such as exception handling, and finally, mention the return value, of course, separating all topics by blank lines. If the function doesn't have parameters, return values, or exceptions, it's okay to omit this information. Another consideration to make is that it's crucial to keep it updated; if you make a modification and don't update the docstring, it's preferable to delete the docstring than to leave misleading information.

Docstrings written in classes generate the __help__ method where, if you type print(Class.__help__), the docstring will be printed in the terminal. For functions and methods, you enable the built-in help, where if you type print(help(sum)), it will also print the docstring. Additionally, it optimizes automatic documentation such as that of the Sphinx tool, which relies on docstrings for documentation generation.

Although it's a common consensus in the community that docstrings should be written for all classes, methods, and functions, it's understandable that this may not be possible due to a lack of practice leading to difficulty in creating readable documentation or due to lack of time to create and modify it. Therefore, something I have adopted is to write docstrings for functions that I consider too complex and would take too long to understand with just the code reading. But of course, this is something you should think about and consider before imposing it on your projects or team.

In the example of our code, which has an extremely simple function, in a real project, it would be ideal to assign a single-line docstring because it's a very obvious case. I exemplified with a more complex docstring because I believe it covers most cases of real projects. Anyway, docstring is a quite complex subject; it could easily fill an entire article on its own.

To learn more about docstrings: PEP 257 – Docstring Conventions | peps.python.org

Conclusion

Perhaps, like me, a good portion of the scripts you write are not influenced by others, and you may judge it unimportant to adopt these practices. However, in the future, when you write code for others, they will likely be very grateful to see these practices materialized in your code. Furthermore, the you of the future will thank the you of the past for writing readable code and adopting code standards. It's important to exercise these skills to become almost like muscle memory and be able to develop readable code more quickly. If you can apply this, you'll become a much better developer.

Note: I didn't mention code linters because I consider it a step further. They are great and almost mandatory in serious Python projects, but they require external configuration and this can be a problem for some people.

Different file formats, a benchmark doing basic operations

Pedro H Goncalves — Sun, 10 Mar 2024 20:35:14 +0000

Recently, I've been designing a data lake to store different types of data from various sources, catering to diverse demands across different areas and levels. To determine the best file type for storing this data, I compiled points of interest, considering the needs and demands of different areas. These points include:

Tool Compatibility

Tool compatibility refers to which tools can write and read a specific file type. No/low code tools are crucial, especially when tools like Excel/LibreOffice play a significant role in operational layers where collaborators may have less technical knowledge to use other tools.

Storage

How much extra or less space will a particular file type cost in the data lake? While non-volatile memory is relatively cheap nowadays, both on-premise and in the cloud, with a large volume of data, any savings and storage optimization can make a difference in the final balance.

Reading

How long do the tools that will consume the data take to open and read the file? In applications where reading seconds matter, sacrificing compatibility and storage for gains in processing time becomes crucial in the data pipeline architecture planning.

Writing

How long will the tools used by our data team take to generate the file in the data lake? If immediate file availability is a priority, this is an attribute we would like to minimize as much as possible.

Query

Some services will directly consume data from the file and perform grouping and filtering functions. Therefore, it's essential to consider how much time these operations will take to make the correct choice in our data solution.

Benchmark

Files

1 - IBM Transactions for Anti Money Laundering (AML)

Rows: 31 million
Columns: 11
Types: Timestamp, String, Integer, Digital, Boolean

2 - Malware Detection in Network Traffic Data

Rows: 6 million
Columns: 23
Types: String, Integer

Number of Tests

15 tests were conducted for each operation on each file, and the results in the graphs represent the average of each test iteration's results. The only variable unaffected by the number of tests is the file size, which remains the same regardless of how many times it is written.

Why 2 datasets?

I chose two completely different datasets. The first is significantly larger than the second and contains many null values represented by "-" and many columns with duplicate values where the distinction is low. In contrast, the first dataset has few columns with little data variability and contains more complex types such as timestamps. These characteristics highlight the distinctions, strengths, and weaknesses of each format.

Script

The script used for benchmarking is open on GitHub for anyone who wants to check or conduct their benchmarks with their files, which I strongly recommend.

file-format-benchmark: benchmark script of key operations between different file formats

Tools

I will use Python with Spark for the benchmark. Spark allows native queries on different file types, unlike Pandas, which requires an extra library to achieve this. Additionally, Spark is more performant in handling larger datasets, and the datasets used in this benchmark are relatively large, where Pandas struggled.

Env:
Python version: 3.11.7
Spark version: 3.5.0
Hadoop version: 3.4.1

Benchmark Results

Tool Compatibility

Although I wanted to measure tool compatibility, I couldn't find a way to do it, so I'll share my opinion. For pipelines with downstream stakeholders who have more technical knowledge (data scientists, machine learning engineers, etc.), the file format matters little. With a library/framework in any programming language, you can manipulate information from a file in any format. However, for non-technical stakeholders like business analysts, C-level executives, or other collaborators who work directly with product/service production, the scenario changes. These individuals often use tools like Excel, LibreOffice, PowerBI, or Tableau (which, despite having more native readers, do not support Avro or ORC). In cases where files are consumed "manually" by people, you will almost always opt for CSV or JSON. These formats, being plain-text, can be opened, read, and understood in any text editor. Additionally, all kinds of tools can read structured data in files in these formats. Parquet still has some compatibility, being the column storage type with the most support and attention from the community. On the other hand, ORC and Avro have very little support and can be challenging to find parsers and serializers in non-Apache tools.

In summary, CSV and JSON have a significant advantage over the others, and you will likely choose them when your stakeholders are directly handling the files and lack technical knowledge.

Storage

Dataset 1:

Dataset 2:

To calculate storage, we loaded the dataset in CSV format, rewrote it in all formats (including CSV itself), and listed the amount of space they occupy.

The graphs show a significant advantage for JSON, which was three times larger than the second-largest file (CSV) in both cases. The difference is so pronounced due to the way JSON is written, following a schema of a list of objects with key-value pairs, where the key is the column, and the value is the column's value in that tuple. This results in unnecessary schema redundancy, always inferring the column name in all values. In addition to being plain-text without any compression, similar to CSV, JSON shows the two worst performances in terms of storage. Parquet, ORC, and Avro had very similar results, highlighting their efficiency in storage compared to more common types. The key reasons for this advantage are that Parquet and Avro are binary files, offering a storage advantage. Furthermore, Parquet and ORC are columnar format files that significantly reduce data redundancy, avoiding waste and optimizing space. All three formats have highly efficient compression methods.

In summary, CSV and JSON are by no means the best for storage optimization, especially in cases like storing logs or data with no immediate importance but cannot be discarded.

Reading

Dataset 1:

Dataset 2:

In the reading operation, we timed the dataset loading and printed the first 5 rows.

In reading, there is a peculiar case: despite increasing the file size difference several times (3x), the only format with a visible and relevant difference was JSON. This occurs solely due to the way JSON is written, making it costly for the Spark parser to work with that amount of metadata (redundant schema). The growth in reading time is exponential with the file size. As for why CSV performed as well as ORC and Parquet, it is because CSV is extremely simple, lacking metadata like a schema with types or field names. It is quick for the Spark parser to read, separate, and assess the column types of CSV files, unlike ORC and, especially, Parquet, which have a large amount of metadata useful in cases of files with more fields, complex types, and a larger amount of data. The difference between Avro, Parquet, and ORC is minimal and varies depending on the state of the cluster/machine, simultaneous tasks, and the data file layout. In the case of these datasets in the reading operation, it is challenging to evaluate the difference; it becomes more evident when scaling these files to several times larger than the datasets we are working with.

In summary, CSV, Parquet, ORC, and Avro had almost no difference in reading performance, while JSON cannot be considered as an option in cases where fast data reading is required. Few cases prioritize reading alone; it is generally evaluated along with another task like a query. If you are looking for the most performant file type for this operation, you should consider conducting your own tests.

Writing

Dataset 1:

Dataset 2:

In the writing operation, we read a .csv file and rewrote it in the respective format, only counting the writing time.

In writing, there was a surprise: JSON was not the slowest to be written in the first dataset; it was actually ORC. However, in the second dataset, JSON took the longest. This discrepancy is due to the second dataset having more columns, meaning more metadata to be written. While ORC is a binary file with static typing of data, similar to Parquet, the difference is that ORC applies "better" optimization and compression techniques, requiring more processing power and time. This justifies the query time (which we will see next) and the generated file size, which is smaller in almost all cases than Parquet files. CSV had good performance because it is a very simple format, lacking additional metadata such as schema and types or redundant metadata like JSON. On a larger scale, more complex files would have better performance than CSV. Avro also has its benefits and had a very positive result in dataset 1, outperforming Parquet and ORC with a significant advantage. This probably happened due to the data layout favoring Avro's optimizations, which differ from Parquet and ORC.

In summary, Avro, despite not being a format with much fame or community support, is a good choice in situations where you want the quick availability of your files for stakeholders to consume. It starts making a difference when scaling to several GBs of data, where the difference becomes 20-30 minutes instead of 30-40 seconds.

Query

Dataset 1:

Dataset 2:

In the query operation, the dataset was loaded, and a query with only one WHERE clause filtering a unique value was performed, followed by printing the respective tuple.

In the first file, all formats had good performance, and the graph scales give the impression that Parquet had poor performance. However, the differences are minimal. Since dataset 2 is much smaller, we believe the query results are very susceptible to external factors. Therefore, we will focus on explaining the results in the first dataset. As mentioned earlier, ORC performs well compared even to Avro, which had excellent performance in other operations. Still, Parquet leads this ranking as the fastest query result. Why? Parquet, being the default format for Spark, indicates much about how the framework works with this format. It incorporates various query optimization techniques, many consolidated in DBMSs. One of the most famous is predicate pushdown, which essentially leaves the WHERE clauses at the end of the execution plan to reduce the amount of data read and examined on the disk. This is an optimization not present in ORC. Why do CSV and JSON lag so far behind? In this case, CSV and JSON are not the problem; the truth is that Parquet and ORC are very well optimized. All the benefits mentioned earlier, such as schema metadata, binary files, and columnar formats, give them a significant advantage. And where does Avro fit into this since it has many of these mentioned benefits? In terms of query optimization, Avro lags far behind ORC and Parquet. One of the points we can mention is column projection, which essentially computes only the specific columns used in the query rather than the entire dataset. This is present in ORC and Parquet but not in Avro. Logically, this is not the only thing that makes ORC and Parquet differ so much from Avro in terms of query optimization, but overall, Avro falls far behind in query optimization.

In summary, when working with large files, with both simple and complex queries, you will want to work with Parquet or ORC. Both have many query optimizations that will deliver results much faster compared to other formats. This difference is already evident in files slightly smaller than dataset 1 and becomes even more apparent in larger files.

Conclusion

In a data engineering environment where you need to serve various stakeholders, consume from various sources, and store data in different storage systems, operations such as reading, writing, or queries are widely affected by the file format. Here, we see what issues certain formats may have, evaluating the main points raised in data environment constructions.

Even though Parquet is the "darling" of the community, we were able to highlight some of its strengths, such as query performance, but also show that there are better options for certain scenarios, such as ORC in storage optimization.

The performances of these operations for each format also depend heavily on the tool you are using and how you are using it (environment and available resources). The results from Spark probably will not differ much from other more robust frameworks like Duckdb or Flink, but we recommend that you conduct your tests before making any decisions that will have a significant impact on other areas of the business.

Transactions and the ACID principle, going a little deeper.

Pedro H Goncalves — Mon, 04 Sep 2023 21:59:46 +0000

Recently, I began to study how databases work under the hood, and this is the result of a small part of those studies. Today, I will share what I have learned about transactions and how Database Management Systems (DBMS) minimize errors, both in terms of data inconsistency and general errors that can affect the availability and performance of the database. I emphasize to everyone to read the documentation of the DBMS you are using to make better use of the tools it provides and to improve the performance of your applications. This article is a generalization of how DBMS implement these properties, and only by reading the documentation will you have concrete answers.

Transactions

We can say that a transaction is a unit of work. Sometimes, it is difficult to perform all the data manipulation you want to do with just one query; it may even be impossible. That's where a transaction comes in – to execute a sequence of queries.

Transactions are generally used for data manipulation (which does not prevent or prohibit read-only queries). For this purpose, they have a set of properties that make data manipulation and querying safer in terms of data quality. This set of properties is formed by the concepts of Atomicity, Consistency, Isolation, and Durability, which form the acronym ACID. Later, we will delve deeper into how they work and what they represent.

Returning to transactions, there are various DBMSs that address the topic in their own way, with some prioritizing performance in certain areas and others prioritizing performance in other operations. But in general, the way they work is quite similar. We will start with some commands that initiate transactions, terminate them, and what they do exactly.

Transaction Handling

BEGIN

When we give the command BEGIN to the database, we are asking it to start a new "branch." In this context, a "branch" is something like a checkpoint in a game. We start our branch here, can perform our SQL manipulation or queries, and have some "privileges," such as the ability to roll back without the database applying the changes we instructed within the branch.

COMMIT

As I mentioned earlier, starting a transaction or, as we call it, a "branch" gives us some privileges, such as the ability to revert to the database's state at the point where we began that transaction. But how do we confirm these changes (understand, with changes, I refer to DML operations executed within the same branch)? With the COMMIT command, we ask the database to save the changes we made. In other words, all DMLs we performed in the transaction will be permanently applied to the database and cannot be reverted to the previous state without another DML command.

ROLLBACK

The ROLLBACK command tells the database that we do not want any of the queries executed "within" that transaction to be applied. In other words, everything executed in the transaction will be undone. Some databases allow the use of the ROLLBACK command on DDL operations (e.g., create table) and the like, while others do not. This depends on the DBMS you are using. I recommend reading the documentation of the DBMS you will use in your project.

ERRORS

As I mentioned earlier, DBMSs implement transactions in their own way, and error behavior can vary among them. But it is almost a standard that the database performs a ROLLBACK to preserve data reliability whenever an error occurs. This clashes with the concept of atomicity, which we will see next.

ACID

Atomicity

Atomicity is one of the four concepts that dictate the behavior of a transaction. It mainly deals with cases of failure. The initial idea of atomicity is that a transaction is a unit of work where all queries must be executed successfully, or none of them is executed. This applies both when the transaction is "open," and you are performing queries or manipulations, and if there is a query error, a constraint violation, a database crash for any reason, or even external forces like a power outage, the database will perform a ROLLBACK. Even in more extraordinary cases, such as a crash while the database is performing a COMMIT (writing changes to disk), it will undo all the queries that were successful and return to the state before the transaction started.

Example:

Let's perform a bank transfer between João and Maria with two queries: one to withdraw money from João's account and another to deposit it into Maria's account.

Withdrawing from João's account:

UPDATE BANK_ACCOUNTS SET BALANCE - 100 WHERE NAME = 'JOAO'

Depositing into Maria's account:

UPDATE BANK_ACCOUNTS SET BALANCE + 100 WHERE NAME = 'MARIA'

Committing changes:

COMMIT

[x] João's Query: ERROR IN THE DATABASE
[ ] Maria's Query

In our example, the database encountered some memory error and couldn't write both changes to disk. So, instead of deducting the amount from João's account and not crediting it to Maria's account (which would leave a hole of 100 dolars in the system, thus serving inconsistent data), the database didn't write any of the changes and reverted to the state before the transaction started.

Isolation

As most database management systems (DBMS) allow multiple connections from different sources, each executing various queries, there is a need for transactions executed by these different connections to be isolated, meaning they do not share their operations and changes until they are committed. The principle of isolation ensures that transactions within a database are executed separately, even when occurring concurrently.

Within isolation, there is something we call read phenomena, which demonstrates errors that can occur when operations are executed in different transactions that are happening simultaneously. There are four phenomena in total:

Dirty read

A dirty read occurs when a query is executed within an isolation system that shares transaction states. When this happens, there is a possibility that the change will not be committed, and the query will return inconsistent data. To illustrate this, let's use the following table:

sku	quant	price
1	5	20
2	10	15

We initiate the following query in Transaction 1:

select sku, quant*price from orders

The result will be as follows:

sku	quantXprice
1	100
2	150

Next, Transaction 2 begins in another database connection and performs the following manipulation:

update orders set quant+5 where sku = 1

The orders table will now look like this:

sku	quant	price
1	10	20
2	10	15

Transaction 1 then performs another query:

select sum(quant*price) from orders

The query result will be:

sum(quantXprice)
350

Then Transaction 2 executes the rollback command, undoing all the changes it made, and the orders table returns to its initial state. Consequently, the result of the sum(quant X price) operation is not 350 but rather 250.

Your example illustrates the concept of "dirty read" very effectively and how it can lead to inconsistent data in a database system when transactions are not isolated. Dirty read occurs when one transaction can see data that has been modified by another transaction, but that other transaction has not yet committed its changes.

Non-repeatable Read

A non-repeatable read is quite similar to a dirty read, with the difference being that Transaction 2 commits its change before Transaction 1 performs the second query. This leads to data inconsistency. You might think, "But if the change has been committed, shouldn't my report have the most up-to-date number?" However, this contradicts the principle of isolation, which aims to isolate the states and changes of each transaction.

In a large system where the database experiences a high volume of changes, this can be problematic. Since transactions can observe changes made by other transactions, the results can be inconsistent or unexpected.

To illustrate, let's use the same table from the dirty read example:

sku	quant	price
1	5	20
2	10	15

Transaction 1 executes a query:

select sku, quant*price from orders

The result will be as follows:

sku	quantXprice
1	100
2	150

Then Transaction 2 begins in another database connection and performs the following manipulation, followed by a commit:

update orders set quant+5 where sku = 1  commit

The orders table will now permanently look like this:

sku	quant	price
1	10	20
2	10	15

Transaction 1 then performs another query:

select sum(quant*price) from orders

The query result will be:

sum(quantXprice)
350

This example is quite similar because both cases share similarities. The key difference is that a dirty read doesn't require the change to be committed for other transactions to observe the result, whereas in a non-repeatable read, a change must be committed for transactions to observe it.

Phantom Read

A phantom read presents the same issue as the other two phenomena, which is data inconsistency. The difference is that a phantom read involves rows that were not initially read. Instead of transactions making changes to a set or a single row, a phantom read adds rows that were not read in the initial read but affect the second result. It's called a phantom read because there are rows that alter the result but were not read because the change occurred after the transaction's initial query.

Let's illustrate with the same table example:

sku	quant	price
1	5	20
2	10	15

Transaction 1 performs the following query:

select sku, quant*price from orders

The result will be as follows:

sku	quantXprice
1	100
2	150

Then Transaction 2 begins in another database connection and performs the following manipulation, followed by a commit:

insert into orders values(3, 15, 2)  commit

The table will now look like this:

sku	quant	price
1	5	20
2	10	15
3	15	2

Now, Transaction 1 performs another query:

select sum(quant*price) from orders`

The query result will be:

sum(quantXprice)
280

In this case, the row with SKU 3 was added in a separate transaction but affected the consistency of Transaction 1's queries, adding a "phantom row."

Lost Updates

Lost updates aren't exactly a read phenomenon, but we'll include them in this topic. Lost updates occur when two transactions attempt to make changes to the same row, and one of them ends up overwriting the changes made by the other. In this case, the changes are not shared between transactions, and since transactions don't share status (what they read/changed, when they finished, etc.), a write-write conflict occurs where one transaction "discards" its own change after another transaction commits.

Let's illustrate this with the same table used in the previous examples:

sku	quant	price
1	5	20
2	10	15

Transaction 1 executes the following query:

update orders set quant+5 where sku = 1

The table will look like this (important to note that no commit has occurred):

sku	quant	price
1	10	20
2	10	15

Afterward, Transaction 2 begins, where it also executes a modification query and commits:

update orders set quant+10 where sku = 1  commit

The table is now permanently updated as follows:

sku	quant	price
1	15	20
2	10	15

Right after that, Transaction 1 performs the query:

select sku, quant*price from orders

The expected result was:

sku	quantXprice
1	200
2	150

However, due to a lost update where Transaction 1 relied on the "committed table" (where Transaction 2 had just made changes), the result will be:

sku	quantXprice
1	300
2	150

This means the uncommitted change made by Transaction 1 was discarded without an explicit command, resulting in an unexpected outcome and data inconsistency.

These are the phenomena that the concept of database isolation aims to eliminate. So, what are isolation levels exactly?

Isolation Levels

Isolation levels are divided into different levels because each level of isolation aims to address specific issues (the read phenomena). Each DBMS implements its isolation model based on the problems and demands it seeks to address (in some DBMSs, you can choose the isolation level for your transaction). The isolation levels include:

Read Uncommitted: No isolation; any change can be observed from outside the transaction, whether committed or not.
Read Committed: Minimal isolation; any change made and committed within a transaction can be observed by another transaction.
Repeatable Read: Medium isolation; the transaction ensures that when a query reads a row, that row remains unchanged throughout the transaction's execution.
Snapshot: True isolation. Each query in the transaction is executed based on the tuples that were committed in the database before starting the transaction. It's like taking a "snapshot" of the current state of the database and executing queries based on it, completely isolating that transaction from any other changes another transaction may make.
Serializable: At the serializable isolation level, the DBMS defines the degree to which transactions are separated from each other, ensuring data consistency during the concurrent execution of multiple transactions. In other words, serializable isolation ensures that even if multiple transactions are happening simultaneously, the final result is the same as if they were executed sequentially, one after the other.

The table below illustrates which problems each isolation level addresses.

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Lost Updates
Read Uncommitted	Possible	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible	Possible
Snapshot	Prevented	Prevented	Prevented	Prevented
Serializable	Prevented	Prevented	Prevented	Prevented

These isolation levels offer a range of trade-offs between data consistency and performance, allowing you to choose the level that best suits your application's requirements.

Consistency

Throughout the text, I mentioned the term "data inconsistency," which is basically when data does not assume the value it should after executing manipulation commands, and it ends up not matching real-world events. In this case, we will address data inconsistencies within the database.

To illustrate this, let's use two example tables:

Table "orders":

sku	quant	price
1	2	3
2	5	2

Table "relat_orders":

sku	amount
1	5
2	10

In this example, we have a relationship inconsistency. The "relat_orders" table has the "amount" column, which is the result of quant x price from the "orders" table. SKU 1 should have an "amount" of 6, but it actually has 5 because the tables do not have the proper relationship. To facilitate table modeling within the database, DBMSs have constraints, which are rules applied to columns to standardize behavior or relationships between them. Examples of constraints include unique, not null, and foreign key constraints.

The concept of consistency in ACID refers to a database that maintains the integrity of all constraints. The term "integrity" refers to the accuracy or correctness of the data's state within a database, where the database cannot transition from a consistent state to an inconsistent state. Let's work with some scripts to better understand this concept.

Table creation script (PostgreSQL):

CREATE TABLE orders (id serial,
sku int,
status varchar(10) NOT NULL
);

Tuple insertion script (PostgreSQL):

INSERT INTO orders (sku) VALUES (1);

The database cannot accept the second script because the "status" column must have a value in all tuples that are inserted (due to the "not null" constraint). If the insertion occurs, the database will transition from a consistent state to an inconsistent one because it violated a constraint, consequently violating the principle of consistency.

Durability

This is the last property in this set of properties that makes the database so secure throughout the process of creation, management, and data storage. Durability ensures that all changes made within a transaction are written to non-volatile memory after a commit and cannot be altered in any other way except through a subsequent transaction. In other words, durability ensures that data is stored and can be queried and manipulated even in the event of a power outage, errors, or any type of database failure, causing it to be inactive for a certain period.

The trade-off of durability is quite clear. To establish the durability property, data must be written to non-volatile memory as mentioned earlier (usually a hard drive), and non-volatile memories are naturally slow (I will write about how databases store data on disk and why they are slow in certain operations in the future).

In-memory databases like Redis, which use RAM (volatile memory), trade off the benefits of ACID properties in favor of performance for both write and read operations. In terms of durability, the trade-off is also quite clear. Continuing with Redis, the low-latency access to data comes with the risk of an unpredictable major force (power outage, errors, etc.) shutting down the service, resulting in data loss since it is not saved in non-volatile memory. Redis also implements some things quite similar to atomicity, consistency, isolation, and durability in its own way. It caters to very specific demands, and surprisingly, these demands are very different from what databases that use volatile memory aim to solve.

Conclusion

With this, we have been able to understand how a database with ACID properties minimizes errors for both read and write operations. It's important to read about how the DBMS you are using in your project implements these "concepts." What I presented here is a generalization of how databases approach transaction management and data storage.

Finally, I recommend the "Fundamentals of Database Engineering" course available on Udemy (I'm not someone who usually pays for courses, but this one is really worth it). I'll leave the link below.

Course: Fundamentals of Database Engineering

Thank you very much to everyone who read this!

What is data engineering and a B.I architecture

Pedro H Goncalves — Fri, 30 Jun 2023 20:28:02 +0000

In this article, we will talk about data engineers, what they manage, what they develop, and other activities they perform. We will also explore what a BI architecture is and what role data engineers play in this vast world of data (which is still relatively new to many people).

We will also discuss the skills typically possessed by data engineers. To avoid confusion regarding the positions we'll be referring to or if you're unsure about how a BI team is usually structured, I recommend reading my other article that explains the differences in skills and activities performed in various positions. You can find the link below.

Data roles in data teams and your skill set. Using math.

Data Engineer

A data engineer is responsible for designing, building, and managing data infrastructures that handle the processing, transformation, and storage of large volumes of data from various sources. There are different "types" of data engineers, such as those working with streaming data, messaging systems, big data engineers dealing with distributed systems, and many others. In this article, I will focus on those who work in BI teams.

A data engineer (in BI teams) is the one who equips BI analysts/data analysts, data scientists, ML models, data products, managers, and the entire company with reliable data. They achieve this by using tools for large-scale data processing, creating, managing, and monitoring routines. They develop tooling such as APIs and applications that abstract user activities, making it easier for all departments to leverage the data and ensuring transparent and auditable processes. They employ techniques like data modeling, following all normalization rules, and utilize current tools to develop scalable, performant data warehouses, data lakes, and data lakehouses that use minimal storage and processing resources.

Dated Processes

In the vast majority of companies, data analysis is carried out using Excel and Google Sheets. Typically, it's a repetitive task that consumes one's time, which could be spent on other tasks. Moreover, it has various weaknesses, such as the lack of visualization with charts, making it difficult for individuals to grasp the magnitude of data and make more informed decisions quickly. Given that Excel is prone to human errors, using it as the primary method for data analysis is a significant disadvantage.

I'm not against companies using Excel for their analysis; I'm against companies that have valuable data, which could be used as a growth pillar, but still treat it as a mere consequence of events.

When do I know I need a data engineer?

Speaking from my experience, it became evident that we needed one or more data engineers when the BI team's constructions (metrics tracking routines) started competing with production in terms of processing power, leading to exponential resource consumption. The routines we created started affecting overall sales processing, negatively impacting the user experience in our e-commerce platform. Additionally, we were faced with the daunting task of managing the costs of machines, processing, and storage, which were incurred without proper planning.

To address these issues, we decided to separate the production environments from the analytics team almost entirely. We improved storage by adopting datalakehouse principles and compressing files, which significantly reduced our space requirements. By using incremental data updates instead of full processing of all data, we eliminated processing bottlenecks and improved delivery speed for analytics and data science teams. With many abstractions in place, the processes became transparent, and most team members understood how KPIs were calculated. This transparency encouraged the company to become even more data-driven.

In general, you will want a data engineer when your operation starts to grow and the data-driven culture is well-established, or when you want to turn your data into a product, commercialize it, create data products from it, or simply need better performance and cost-effectiveness. A data engineer can address many of these challenges and provide guidance on optimizing the utilization of your resources for data processing, handling, storage, and management. They can help you make the most out of the data your company possesses.

In an ideal world, the data engineer is the pioneer of the entire data movement, but in reality, this role is relatively new in the market and might not always be the case.

Architecture Planning

As companies, teams, and operations grow, it's natural for these outdated processes to fall behind, and a BI team starts to be structured (or at least it should). At this stage, we enter the realm of data engineering, starting with studying the type of architecture to be used. Will we implement a data warehouse, or perhaps a data lakehouse? Will we use a cloud or an on-premise solution? What does our budget allow us to develop? Which tools will we use for daily KPI monitoring, PowerBI, Tableau, or another option? All of these questions, among others, are answered in collaboration with other departments, taking into account the company's current state, historical and cultural context, and the skills possessed by the people directly involved. These are some of the "obvious" variables that must be considered in the planning process of a data engineering center of excellence.

Development and Tooling

Once these pertinent questions are answered, the subsequent stages should be solid, metric-driven, well-documented, and well-architected to ensure reliable ETL/ELT processes. The development of pipelines mainly involves moving information from one system to another. You can perform an ETL (extract-transform-load) directly into a data warehouse on your RDS, or an ELT (extract-load-transform) process on your data lake. You may be aggregating data from your production database or consuming an FTP from a partner company to enrich your database. Developing APIs for other services to consume your transformed data or for data scientists to access it is quite common and not exclusive to back-end developers. The choice of tools can be determined during the development stage, but it generally aligns with common practices. For pipeline orchestration, a strong candidate is Airflow, a Python framework for managing routines. For distributed processing, you have PySpark and Spark at your disposal. For an on-premise data lake, you can use MinIO. For your data warehouse, PostgreSQL with a star schema modeling is a common choice, but if you scale up with many fact tables and numerous dimensions, making star schema impractical, you can opt for snowflake modeling. If you're performing data scraping and want to enrich your data, you can use low-code software like IBM RPA, or if you prefer to continue with Python, you can use scrapy, an excellent framework for web crawling.

I will write more articles in the future about MinIO and data lakes, Airflow and task orchestration, and distributed processing with Spark.

Metadata

When developing data systems, it is essential to have documentation, not just for the code but also for metadata, which is data about data. A pipeline that consumes information from a daily API performs various transformations before storing it alongside data from other systems. But what transformations does it perform and why? In BI teams, datasets are often prepared daily with numerous transformations, aggregations, and abstractions. How are these aggregations done? How is the gross_revenue column calculated? Why do many columns from the production table not appear in this dataset? These are common questions that analysts and data scientists will ask, highlighting the need for a robust knowledge base with this metadata.

Data management

Data management is one of the tasks that data engineers handle, and it shares similarities with the activities performed by DBAs. Applying privacy guidelines to your data, providing appropriate access to the right users, and managing it continuously is a labor-intensive task, despite many DBMSs and S3 storage services having integrated permission controls. Additionally, it is necessary to develop robust logging and metrics systems to monitor the daily health of the data and pipelines, providing reports on routines that ran with errors, ran incompletely, or encountered any other type of inconsistency. The reliability of the data and the margin of error need to be measured and relentlessly communicated. The reliability of the data is often a subject of discussion and is usually linked to external data sources of the company, while the margin of error is due to rounding and updates that may be performed in the production environment, directly affecting OLAP systems.

Conclusion

In conclusion, we have explored what a data engineer does, how they can begin their journey, and when their efforts are needed. It is important to note that these insights are based on my experiences working across the three main data fronts. If you have any remaining questions about the positions within a data team or if you would like to learn more about the skill set required for a data engineer, I encourage you to read my article on the composition of a BI team and the skills typically sought after in this role.

Thank you very much for reading

Data roles in data teams and your skill set. Using math.

Unsupervised Clustering with K-Means

Pedro H Goncalves — Mon, 26 Jun 2023 01:30:43 +0000

In the past few weeks, I have been studying about clustering and some of its models to apply in a project at the company I work for. When you study clustering, you quickly come across the centroid model, which leads you to K-Means, the most famous method for working with this type of clustering. We will use K-Means for our activity.

Speaking a bit about how we will perform our activity, we will use a dataset containing sales information from an unspecified company, and we will cluster its customers taking into account some of their behaviors in the store. For this purpose, we will also utilize the RFM concept (recency, frequency, monetary), which is widely used in marketing teams.

It's important to remember that there are various other types of clustering and centroid-based clustering algorithms. This article specifically focuses on K-Means and a practical application of its algorithm.

In this article, we will not discuss data transformation or data visualization. If you would like to provide feedback on my code, you can visit the repository where the code used in this article is located, as there are some visualization and transformation aspects not shown here.

K-Means

K-Means is an unsupervised algorithm, which means it does not require "labels" on the events, unlike supervised algorithms that need labels for training. Unsupervised algorithms are designed to learn from the data itself by autonomously identifying patterns (often not visible to the naked eye).

The goal of the algorithm is to generate K clusters (where K is defined by the scientist), reducing the variance between clusters and increasing the similarity among points assigned to the same cluster.

How it works

The algorithm randomly assigns K numbers in the feature space.
The distance is calculated by iterating over each point and each centroid, and the point is assigned to the centroid with the shortest distance (the calculation of distance uses the Euclidean distance formula).

Recalculate the position of the clusters based on the mean coordinates of each point assigned to the same cluster.
Repeat steps 2 and 3 until the position of the clusters no longer undergoes significant changes or until a certain number of iterations is reached.

Determining the ideal number of K

To determine the number of K (clusters), we will use the elbow method, which is the most commonly used method for this task. We will also use the distance point-line calculation to further refine and better define our number of clusters.

The elbow method calculates the sum of squared distances between the points within a cluster. Its goal is to minimize the total inertia (variability of the points) of each cluster. The formula for this calculation is as follows:

Where K is the number of clusters, x is the point within each cluster, and μ is the mean distance between each point.

The distance point-line calculation is the perpendicular distance of each point along a line defined by two points. It is used to discover the greatest homogeneity within a cluster and the greatest difference between clusters.

P0 and P1 are our starting point (P0) and our last point (P1). y1 represents the y-value (in a Cartesian plane) at our last point (P1), and the same applies to y0 for the first point. The same logic applies to x0 and x1. In the equation, as we usually iterate for each number of clusters, x and y represent the x and y values of the cluster being calculated.

We will start by defining two functions. calculateWcss iterates from 1 to 9 clusters (we don't want to have too many customer clusters in our dataset, and this is generally determined and tested with the data and business teams). It calculates the total inertia for each number of clusters and returns a list with the cluster number and its inertia.

def calculateWcss(data) -> list:  
    wcss = []  
    for k in range(1, 10):  
        kmeans = KMeans(n_clusters=k)  
        kmeans.fit(X=data)  
        data["clusters"] = kmeans.labels_  
        wcss.append(kmeans.inertia_)  
    return wcss  

def plotFigure(quadraticSum:list,figsize:tuple[int]):  
    plt.figure(figsize=figsize)  
    plt.plot(quadraticSum)  
    plt.xlabel("Clusters")  
    plt.show()  

dfRecencyModel = dfOrderCostumer[['recency']]  
quadraticSum = calculateWcss(dfRecencyModel)  
plotFigure(quadraticSum,(13,6))

Calling the calculateWcss function using the column in the dataset that represents the number of days since the last purchase and plotting it in the plotFigure function, we get the following result:

Interpreting this graph, we might think, "Well, the number 8 of clusters is the best because it has the lowest inertia." It's not entirely incorrect, but not entirely correct either. As mentioned earlier, we don't want too many clusters, so we're looking for the point where the inertia doesn't decrease drastically, always aiming to have the fewest clusters.

Upon reevaluation, we could say that 2 and 3 are strong candidates. However, we will use the distance point-line calculation to ensure the number of clusters we will apply.

Let's define the distancePointLine function in code. It calculates the distance of the number of clusters to the points P0 and P1, which are 1 and 9 (our number of clusters defined in calculateWcss). It returns the ideal number of clusters where we have the greatest perpendicular distance between the starting point and the ending point.

def distancePointLine(wcss:list) -> int:  
    import math  
    x1, y1 = 2,wcss[0]  
    x2,y2 = 20,wcss[len(wcss)-1]  

    distance = []  
    for i in range(len(wcss)):  
        x0 = i+2  
        y0 = wcss[i]  
        numerator = abs((y2 - y1) * x0 - (x2 - x1) * y0 + x2 * y1 - y2 * x1)  
        denominator = math.sqrt((y2 - y1) ** 2 + (x2 - x1) ** 2)  
        distance.append(numerator/denominator)  
    return distance.index(max(distance)) + 2

Calling the function with the return of calculateWcss we have the value 4 as the ideal number of clusters and we will use it in the rest of the tasks.

Clustering our dataset

In our dataset, we have information such as recency (which we used to determine the ideal number of clusters and represents the number of days since the last purchase), frequency (which represents the number of times a particular customer has made purchases in our store), and monetary value (representing the amount the customer spent in our store). Typically, people would cluster using all the features (columns) together. However, we will perform separate clustering for each feature, specifically four clusters for each feature.

Let's start by defining a function that takes parameters such as a new column name for the cluster, the name of the feature to be used as the basis for clustering, the multidimensional array of the separated feature from the DataFrame, the DataFrame itself to add the clustering, and whether the rating (cluster it belongs to) should be in ascending or descending order.

We will use the cluster as the rating. As the cluster starts from 0 and goes up to 3, the cluster with a rating of 0 will represent customers who have spent the least money or have been inactive for the longest time on the platform.)

def orderCluster(clusterName:str,target_name:str,featureColumn:DataFrame,dfAppend,ascending:bool) -> DataFrame:  
    kmeans = KMeans(n_clusters=nmrCluster)  

    dfUse = dfAppend  
    dfUse[clusterName] = kmeans.fit_predict(featureColumn)  

    groupbyCluster = dfUse.groupby(clusterName)[target_name].mean().reset_index()  
    groupbyCluster = groupbyCluster.sort_values(by=target_name,ascending=ascending).reset_index(drop=True)  
    groupbyCluster['index'] = groupbyCluster.index  
    groupbyCluster.drop(columns=[target_name],inplace=True)  
    dfUsageMerged = pd.merge(dfUse,groupbyCluster, on=clusterName)  
    dfUsageMerged.drop(columns=[clusterName],inplace=True)  
    dfUsageMerged.rename(columns={"index":clusterName},inplace=True)  
    return dfUsageMerged

Now we will call the orderCluster function for each feature and increment by in the dfMain (DataFrame that we performed some transformations after reading the .csv file)

finalDataframe = dfMain[['id_unique_costumer','recency','recency_cluster','order_approved','frequency_cluster','agg_value','revenue_cluster']]  
finalDataframe['pontuation'] = finalDataframe['recency_cluster'] + finalDataframe['frequency_cluster'] + finalDataframe['revenue_cluster']

finalDataframe['segmentation'] = 'Inactive'

finalDataframe.loc[finalDataframe['pontuation']>1,'segmentation'] = 'Business'  
finalDataframe.loc[finalDataframe['pontuation']>3,'segmentation'] = 'Master'  
finalDataframe.loc[finalDataframe['pontuation']>5,'segmentation'] = 'Premium'

And then, we can plot a graph to visualize the distribution of each segmentation, using the features of agg_value (amount of money spent) and recency (number of days since the last purchase) as well.

Here's the function to plot the graph:

def plot_segment(x,y,data):  
    sns.set(palette='muted',color_codes=True,style='whitegrid')    
    sns.scatterplot(x=x,y=y,hue='segmentation',data=data,sizes=(50,150),size_order=['Premium','Master','Business','Inativo'])  
    plt.show()

plot_segment('recency','agg_value',finalDataframe)

The plotted graph:

With this graph, it becomes clear that our customers classified as Premium (obviously few) have spent higher amounts than the average and made more recent purchases, while the inactive ones have not spent much and haven't made purchases for some time. Based on this, our company can have more targeted communication by offering customized services to Premium customers and providing some type of discount coupon to encourage inactive customers to return and spend in our store.

Digging deeper into RFM.

Let's further analyze our recency cluster with the following code.

finalDataframe.groupby('recency_cluster')

['recency'].describe().reset_index()

Now we know that customers belonging to cluster 0 have an average of 490 days since their last purchase, making them the cluster with the lowest recency.

The RFM concept creates more customer groups based on the clusters they belong to, taking into account the attributes of recency, frequency, and monetary value. For example, a customer who belongs to cluster 0 in recency, cluster 1 in frequency, and cluster 3 in monetary value means that they haven't made a recent purchase, have made a reasonable number of purchases in our store, and have spent a high amount of money. The RFM analysis would allocate this customer to a "dormant" or "hibernating" customer segment. We can implement this classification in our algorithm, but I present it here as a challenge. I recommend reading more about RFM and how to implement it in your business alongside unsupervised clustering.

Conclusion.

In this article, we have learned how to determine the ideal number of clusters and how to cluster our customers based on the widely used RFM concept. If you would like to explore more about other models, data visualization, data transformation, I suggest checking out my GitHub repository, where I frequently work on data engineering projects and related topics.

Thank you very much for reading.

Repository: https://github.com/pedrohgoncalvess/k-means-clustering

Data roles in data teams and your skill set. Using math

Pedro H Goncalves — Mon, 12 Jun 2023 22:08:08 +0000

Data roles and skill set

With GPT, some data positions have become quite famous, especially data scientists, but there are others that deserve the same attention. In this article I will show what they are, what are the main activities they develop and the set of skills they exercise. It is open for discussion, I'll try to keep it updated as I change my opinions (which I certainly will).

About the ""method"" I chose.

I listed the most used abilities in the data area and compressed them until they were in common use between at least 2 positions. It is important to point out that I did not focus only on hard skills and that soft skills are just as important. I gave grades from 1 to 10 (based on my experience) according to the use and the necessary level of knowledge for that skill.

Skill explanations

Applied math.

Applied mathematics is the area of mathematics that looks for ways to solve problems in companies especially in areas of finance, expansion, logistics and marketing, it is widely used (mandatory skill for people who deal with data) in data-oriented companies. The grades increase as the position needs to be closer to business-related indicators such as revenue, operating costs, CHURN, retention, etc.

Advanced math.

Firstly, with advanced mathematics focused on data I mean the use of some areas such as calculus, probability, advanced statistics, linear regression among many other areas, it is distinguished from applied mathematics because, unlike it, advanced mathematics does not pass by """lots of changes""" in functions due to business rules, as opposed to applied math which is basically modeled by business rules.

SQL

Although it is quite obvious that functions around data must have knowledge in SQL, this knowledge does not have the same levels in different positions. SQL subsets are responsible for this great differentiation of levels since many of the positions do not necessarily need to know DCL, DDL, DML commands. The levels serve precisely to separate the subsets, following this line of reasoning I created a "straight line of wisdom".

DQL > DTL > DQL ADVANCED > DML > DDL > DCL

Programming

Programming is very present in the data areas, whether in languages such as python, scala, pl/sql for creating routines and pipelines or M, DAX for manipulating data in specific software. The increase in grades in the various areas has to do with how much code the job requires you to write thoughtfully along with the languages and the framework that you will use in your day to day.

Business

With business I took the liberty to add some soft skills like communication and general market knowledge in the (imaginary) equation. Knowledge about the business in which the company you provide your services is basic for any position, be it development, commercial, marketing, etc., but how much business do you need to understand to be able to act minimally in the positions? I didn't use any mathematical formula or quantitative driver to define this, just my experience and knowledge in BI functions.

Infra

The infra skill refers to knowledge of deployment, monitoring pipe lines, identifying bottlenecks and needs, it is basically focused on software and frameworks such as docker, terraform, cloud architecture, metadata, among others, it is very similar to the set of devops practices. Particularly it is a skill 8 or 80.

No/low code softwares

Some positions require you to use some no/low code software for developing dashboards, pipeline, documentation, deployment, etc. The different levels are related to how much knowledge you should have about these tools and how much you will use them in your day-to-day life.

Ponderation

The weight of the skill is greater in relation to the difficulty in learning its basics. Weighting is an attempt to say how complex a position is in relation to the activities it performs and which are listed in the method. the account is ((Applied math * 1) + (Advanced math * 4) + (SQL * 2) + (Programming * 3) + (Business * 1) + (Infra * 3) + (No/low code * 1) / 15) The weights of the weights were based on experience learning certain things related to skills and on informal surveys of some developers and data professionals. The higher the grade, the greater the difficulty. DON'T take it as true.

Positions

Data analytics.

The data analyst works in the opposite direction of the data scientist, focusing on talking about how and why things happened rather than making hypotheses about what might happen.

This position usually works by participating more actively in meetings and creating dashboards so that the company's activity points can consume or that it can help in decision making, either making the analyzes more accurate or delivering the data as chewed up as possible using a lot of visualization like tools. It gives good points to no/low code software because knowledge of tools like power bi, tableau, excel, pentaho, etc. they are in high demand for vacancies and are allocated good working hours. Applied mathematics and business follow the same path, generating insights with data and extracting information that generates value for the business requires knowledge about them.

For SQL and advanced math you don't need in-depth knowledge, you often won't need to elaborate complex queries or do a linear regression calculation.

Data analytics skills set points.

Data engineer.

Quoting the good article by Maxime Beauchemin, former data engineer at Facebook now Meta, data engineers are much closer to their older brother Software Engineer than to their younger brother Data Scientist that's why instead of creating machine learning models, deep learning, data visualization or anything related, first of all a data engineer he creates the necessary tooling to abstract as much as possible from technical activities related to modeling, mining and manipulation of data, creating pipelines, data lakes and data warehouses.

Data engineers are responsible for database modeling, mining and data manipulation, this explains why I gave SQL a 10 in the skill along with 10 in programming like routines in Python with Airflow or creating pipelines in Scala due to the large amount of data. In the day to day of a data engineer at IDEAL, reports or visualizations are not created for use on commercial fronts, nor machine learning or deep learning models, which is why the low grade in applied and advanced mathematics. The high score on infra is due to the fact that data engineers are very responsible for deployment environments, therefore knowledge in docker, terraform, cloud, etc. they are in great demand in vacancies and occupy good hours of work.

Data engineers skills set points.

Data ops.

Yes, I know that data ops is not a person occupying a vacancy, but a culture equal to devops, but there was a materialization of the culture in which a person was leading the fronts for data ops, as happened in devops, in fact, other positions were born derivatives of dataops like ml ops which is basically data ops focused only on machine learning, so I think it's fair to materialize the culture in this text.

A dataops mainly takes care of monitoring pipelines, studying business needs, new possibilities and improvements. He ensures data security, reliability and quality, but he also participates in budget meetings and architecture definitions, abstracting that responsibility from data engineers, so obviously he gives full marks in infrastructure and also in business. The reason for the high score in programming is due to the configuration of environments and knowledge in the most diverse software aimed at deployment and the software life cycle. Even though it seems counterintuitive to ensure data quality and I didn't give high marks in SQL, I don't think advanced knowledge of SQL is needed to do this. The grade in Advanced Mathematics is explanatory and the reasonable grade in Applied Mathematics is due to the participation in financial issues that are related to the operational costs of keeping the applications available and with a satisfactory level of use.

Data ops skills set points.

Data scientist.

Data science is probably the most famous position on data in recent times. Her reputation is equally equivalent to the difficulty of becoming one, because the activities that a data scientist performs are extremely complex, even with the tools abstracting most of the things that are more complex, such as complex calculations, collecting and cleaning data.

The high marks in both math skills is self explanatory, you need to know both very well to develop artificial intelligence models that make sense for your business, this goes directly with business knowledge which is also needed but not at a very high level. In the ideal world, the data scientist is abstracted from functions that involve infra and even the collection and cleaning of data, a minimum knowledge is enough. With the popularization of AI, several no/low code software are emerging that abstract most of the complex tasks in the day-to-day of a data scientist, but knowledge in programming is still necessary, mainly in languages such as R and Python and their frameworks which are references in the area and therefore the medium score in programming.

Data scientist skills set points.

Some considerations

Perhaps you have already heard of some other functions such as ML Engineer, BI Specialist or others, these functions exist but they are consistent with the business model in which it is mentioned, they are usually ramifications of the 4 mentioned above

Repository with csv and code for images: https://github.com/pedrohgoncalvess/dataroles-skill-set

Thank you very much for reading

A little about the Scala Language

Pedro H Goncalves — Sat, 10 Jun 2023 18:03:20 +0000

What is Scala?

Scala is a compiled multiparadigm language that runs on the JVM. It was inspired by Java, Erlang, and Haskell. It has a static type system with first-class type inference, making it one of the "modern" languages with a sophisticated type system. By saying that Scala is multiparadigm, we emphasize that the language embraces different programming styles. Unlike other languages that tend to favor a specific paradigm, Scala gives developers the freedom to choose the programming style that best suits the problem at hand. Additionally, the language is committed to code conciseness and expressiveness. Surprisingly, Scala manages to deliver all these features. In this article, I will explain how.

Functional Programming in Scala
Scala has many functional programming features, such as first-class functions. First-class functions allow functions to be treated as values, assignable to variables, returned by other functions, and passed as parameters. This provides a high level of abstraction and modularity, especially for data manipulation. Functional programming in a static type system can be challenging, but Scala offers features that facilitate this task. Additionally, Scala supports lambda expressions, allowing the creation of anonymous functions in a concise and expressive way. Immutability is encouraged through the val and var keywords for declaring immutable and mutable variables, respectively. Immutability is useful for asynchronous and concurrent programming, topics that we will address later.

Object-Oriented Programming and JVM Interoperability
As mentioned earlier, Scala runs on the JVM and has seamless integration with Java as one of its main philosophies. It is possible to use Java libraries in Scala and vice versa. Interoperability between the two languages is simpler when it comes to basic data structures but can become more complex when dealing with more advanced data structures due to subtle differences in Java and Scala syntax. Scala incorporates important concepts of object-oriented programming, such as class and object creation, encapsulation, and useful abstractions like abstract, case, and final class, which simplify the declaration of operations. Inheritance and polymorphism are also supported by the language, allowing for more flexible manipulation of types according to the application context.

All these data manipulation and new type creation features allow developers, even with a static type system, to have the freedom to define the desired behavior for their solution, avoiding type errors and facilitating the creation of comprehensive tests. Additionally, first-class type inference reduces the need for explicit type declarations, giving the language an appearance of dynamic typing while maintaining the safety of the static type system.

Scalability
Building upon what was mentioned earlier, the concept of immutability present in the language enables concurrent programming using multiple threads, resulting in better performance and the construction of more robust and efficient solutions. Alongside immutability and concurrent programming, Scala offers support for asynchronous tasks that execute work differently but leverage many of the benefits of concurrent programming. Furthermore, Scala is known as the foundation of two famous frameworks: Spark and Akka.

Spark
Spark is widely used for distributed processing and data analysis. It has APIs in multiple languages such as Python, R, and Java, and is widely adopted in both academic and commercial environments. The Spark MLlib library is dedicated to machine learning and data visualization.

Akka
Akka is a library used for creating distributed and concurrent systems. It implements the actor model, an abstraction for concurrent programming based on message passing. The Akka ecosystem includes Akka HTTP for web application development, Akka Streams for stream processing, and Akka Cluster for handling actor clustering in distributed environments.

Web Development
For web development in Scala, there are two popular options. The first one is the Play Framework, which follows the MVC (Model-View-Controller) pattern and is similar to other popular frameworks such as Rails, Django, and Laravel. The Play Framework provides development abstractions that make web application creation quite enjoyable. The second option is Akka HTTP, which offers greater scalability but requires knowledge of actors and asynchronous message passing. Although it may require a bit more experience, Akka HTTP is powerful and flexible. Additionally, there is Slick, which is a commonly used ORM/FRM in Scala web development. Slick provides efficient abstractions over JDBC, including support for result streaming. While Slick doesn't have built-in object migrations, it can be combined with tools like Flyway, which integrates well with Slick and the Play Framework.

Weaknesses
Initially, Scala was developed with an academic focus, and this emphasis can be observed in some parts of the documentation, where complex concepts are presented without proper prior explanation. An example of this is the creation of the Cats library, which provides abstractions of the functional paradigm to fill some gaps in the official documentation. This approach ended up alienating some people who preferred to start their studies in another language. Additionally, the Scala community has faced some issues that negatively affected the perception of the language. Controversies involving creators and maintainers of famous libraries are an example of this. We can also attribute some blame to the creation of Scala 3, which became a "distraction" since Scala 2 still receives constant updates and many people are not focused or at least interested in making a (often laborious) migration from Scala 2 to Scala 3. The already "small" community around the language, which has experienced reductions, has become "elitist," where only big tech companies (LinkedIn, Netflix, Twitter, Airbnb, Spotify) use its technologies since skilled professionals and experienced architects of distributed systems (its main application) have become increasingly expensive. As a consequence of the "elitization" of the language, job opportunities have been consistently scarce, which doesn't usually attract new learners. All these events have led to the loss of popularity of the language, but it will undoubtedly continue to be widely used in commercial settings.

Conclusion
Despite these weaknesses, Scala remains a powerful and widely used language in many environments. Its combination of object-oriented and functional programming, interoperability with Java, and scalability features make it a solid choice for a variety of applications.

After discussing Scala and its (dis)success to some extent, I encourage everyone to write a few lines in the language. Despite criticism of the documentation, there is a significant amount of introductory content available on the internet, and I recommend the Rock The JVM channel for those interested in the language.

Thank you very much for reading.

DEV Community: Pedro H Goncalves

Diagnosing and fixing critical PostgreSQL performance issues: A deep dive

Table of Contents

Diagnosing the problem.

Possible causes.

Hardware resources and parameter tweaks.

Computational resources.

Locks.

Lock Tables.

Database and tables.

Checking the data volume.

Estimating row count without stats.

Too much data.

Bloated tables and probably Fragmented tables.

Fixing the problem.

Rebuilding the table.

Reloading the data.

1. Batch inserts (Spark)

2. COPY command

How can this be avoided?

Conclusion

Designing robust and scalable relational databases: A series of best practices.

What we will see:

Monitoring your database

Low-level configuration files and variables.

Naming Convention Standardization

Correct Use of Data Types

Normalizing and Denormalizing: How Far to Go

Choosing a Tool for Versioning and Testing

Documenting Your Entities and Fields

Applying Indexes Correctly

Modern Backup and Restoration Pipelines

Conclusion

Improving your Python code, an initial series of best practices.

Exception Handling

Code Styling and Standards

Type Hints

Docstrings

Conclusion

Different file formats, a benchmark doing basic operations

Tool Compatibility

Storage

Reading

Writing

Query

Benchmark

Files

1 - IBM Transactions for Anti Money Laundering (AML)

2 - Malware Detection in Network Traffic Data

Number of Tests

Why 2 datasets?

Script

file-format-benchmark: benchmark script of key operations between different file formats

Tools

Benchmark Results

Tool Compatibility

Storage

Reading

Writing

Query

Conclusion

Transactions and the ACID principle, going a little deeper.

Transactions

Transaction Handling

ACID

Atomicity

Isolation

Dirty read

Non-repeatable Read

Phantom Read

Lost Updates

Isolation Levels

Consistency

Durability

Conclusion

What is data engineering and a B.I architecture

Data Engineer

Dated Processes

When do I know I need a data engineer?

Architecture Planning