<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jakub Stanisławczyk</title>
    <description>The latest articles on DEV Community by Jakub Stanisławczyk (@jakubstanislawczyk).</description>
    <link>https://dev.to/jakubstanislawczyk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3192967%2F6a46d8d8-9130-477f-a2ee-14a0eec93a95.jpg</url>
      <title>DEV Community: Jakub Stanisławczyk</title>
      <link>https://dev.to/jakubstanislawczyk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jakubstanislawczyk"/>
    <language>en</language>
    <item>
      <title>DynamoDB Professional - part 2 - sparse index</title>
      <dc:creator>Jakub Stanisławczyk</dc:creator>
      <pubDate>Tue, 26 Aug 2025 09:25:29 +0000</pubDate>
      <link>https://dev.to/jakubstanislawczyk/dynamodb-professional-part-2-sparse-index-536i</link>
      <guid>https://dev.to/jakubstanislawczyk/dynamodb-professional-part-2-sparse-index-536i</guid>
      <description>&lt;p&gt;In the previous part, I talked about indexing mechanisms in DynamoDB. I mentioned that there are &lt;strong&gt;two types&lt;/strong&gt; of indexes: &lt;strong&gt;Local Secondary Index&lt;/strong&gt; (LSI) and &lt;strong&gt;Global Secondary Index&lt;/strong&gt; (GSI). I lied to you a little, because there are actually &lt;strong&gt;three types&lt;/strong&gt;. But wait, how is that possible? Is the AWS documentation wrong?&lt;/p&gt;

&lt;h2&gt;
  
  
  How does a GSI work?
&lt;/h2&gt;

&lt;p&gt;Before I answer that question, let's first go back to the basics. Let's recall how a GSI works. I covered its mechanism in the first part of this series. In short, this index operates under the hood like a separate table. Data is copied into it only when needed. And it is precisely this mechanism that forms the foundation of what can be called a &lt;strong&gt;Sparse Index&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sparse Index
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;sparse index&lt;/strong&gt; is a type of database index where we don't create an entry for every single record in the table - only for the ones we care about. In DynamoDB, this isn't something that's directly supported. But, using the mechanisms we've already talked about, we can simulate it pretty easily.&lt;/p&gt;

&lt;h3&gt;
  
  
  Standard GSI Index example
&lt;/h3&gt;

&lt;p&gt;Let's take an example. Imagine we're building an app that processes events from temperature sensors. Say each sensor emits an event every second. Each event has an &lt;code&gt;ID&lt;/code&gt;, the &lt;code&gt;sensor ID&lt;/code&gt;, a &lt;code&gt;createdAt&lt;/code&gt;, a &lt;code&gt;value&lt;/code&gt;, and &lt;code&gt;status&lt;/code&gt;. The status is just a simple enum: &lt;code&gt;OK&lt;/code&gt; if the value is within the expected range, or &lt;code&gt;ALARM&lt;/code&gt; if it's outside.&lt;/p&gt;

&lt;p&gt;Now, suppose we want to add a section to our dashboard that shows only the alarms for a given sensor. It's easy, we create a new GSI called &lt;code&gt;SENSOR_STATUS_GSI&lt;/code&gt;, where the partition key is &lt;code&gt;{sensorId}#{status}&lt;/code&gt; and the sort key is &lt;code&gt;{createdAt}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1fgjqqjguud8a0gjxzb.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1fgjqqjguud8a0gjxzb.PNG" alt="Sensor Status GSI console" width="759" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is how it looks in the code. We create separate &lt;code&gt;SENSOR_STATUS_GSI_PK&lt;/code&gt; property.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib2jadzmh3i1h6dmkndo.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fib2jadzmh3i1h6dmkndo.PNG" alt="Sensor event create" width="486" height="208"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this setup, querying events by status becomes super straightforward. When we scan index we can see all the events. In this example I created 10 events for 2 sensors (5 events per sensor)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf6c01t9gx7hqvj80xv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkf6c01t9gx7hqvj80xv7.png" alt="Sensor status GSI scan" width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we want to fetch all the alarms for the sensor, we only need to provide expected GSI PK&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gsopho35rfi5pb19mf7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2gsopho35rfi5pb19mf7.png" alt="Query index" width="800" height="295"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But this isn't really the best solution. After all, what we actually need is to query only the events that are alarms. Now we're storing everything. In a system that might generate millions of events, this would quickly increase the cost of our DynamoDB table.&lt;/p&gt;

&lt;p&gt;Instead, we can tweak our GSI a little bit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sparse Index example
&lt;/h3&gt;

&lt;p&gt;Let's create new &lt;code&gt;ALARM_GSI&lt;/code&gt; index. This one uses &lt;code&gt;{sensorId}&lt;/code&gt; as PK and &lt;code&gt;{createdAt}&lt;/code&gt; as SK.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9gnug8j7utqo70645pp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh9gnug8j7utqo70645pp.png" alt="Console GSI alarm" width="767" height="244"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also updated the code with the new GSI PK.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yeerndf4uttj698z17j.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9yeerndf4uttj698z17j.PNG" alt="Alarm save" width="482" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we query our index by sensor ID, we'll only get corresponding alarms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tr8nzqzlgdkgada9exl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8tr8nzqzlgdkgada9exl.png" alt="Alarm GSI alarms" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The magic really starts to happen when we scan our index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7hc19zd7in8x00gkrtl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa7hc19zd7in8x00gkrtl.png" alt="Alarm GSI scan" width="800" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The difference is immediate. It turns out the index now stores only the alarms, no events with an &lt;code&gt;OK&lt;/code&gt; status in sight. The index already contains only the data you want.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;A Sparse Index in DynamoDB is basically a clever hack. Even though DynamoDB doesn't give us native support for this feature, with a bit of creativity we can simulate it using GSIs.&lt;/p&gt;

&lt;p&gt;This solution is very useful for large data sets that need to be searched binarily. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A job queue where we need to search for jobs with a status of &lt;code&gt;PENDING&lt;/code&gt; to process&lt;/li&gt;
&lt;li&gt;Searching only for available products&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The downside of this solution is that future changes must be anticipated. If the business also wants to display recent events with a status of &lt;code&gt;OK&lt;/code&gt;, our sparse index won't be able to handle it. This will require either adding another index or modifying the current index and migrating the data.&lt;/p&gt;

&lt;p&gt;Sparse indexes are a neat DynamoDB trick. They're not always the right fit, but in scenarios where you only care about a subset of your data, they can save you money and speed up your queries.&lt;/p&gt;




&lt;p&gt;Links&lt;/p&gt;

&lt;p&gt;👋 &lt;a href="//www.linkedin.com/in/jakub-stanis%C5%82awczyk-33128b142"&gt;LinkedIn&lt;/a&gt;&lt;br&gt;
💻 &lt;a href="https://github.com/jstanislawczyk/dynamodb-professional/tree/master/sparse-index" rel="noopener noreferrer"&gt;Github&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>node</category>
      <category>programming</category>
      <category>database</category>
    </item>
    <item>
      <title>DynamoDB Professional - part 1 - indexing</title>
      <dc:creator>Jakub Stanisławczyk</dc:creator>
      <pubDate>Mon, 07 Jul 2025 13:24:43 +0000</pubDate>
      <link>https://dev.to/jakubstanislawczyk/dynamodb-professional-part-1-indexing-11gb</link>
      <guid>https://dev.to/jakubstanislawczyk/dynamodb-professional-part-1-indexing-11gb</guid>
      <description>&lt;p&gt;Amazon DynamoDB is a fully managed, NoSQL database designed for high performance and scalability. With the ability to handle massive numbers of requests per second and automatic scaling, DynamoDB is widely used in systems that require reliability and low latency - from mobile applications to IoT systems and microservices.&lt;/p&gt;

&lt;p&gt;However, despite its many advantages, DynamoDB has one key limitation - it doesn't forgive mistakes made during the design of your solutions. Unlike relational databases or document databases like MongoDB, here the structure of data and queries must be carefully planned. DynamoDB requires not only knowledge of its indexing mechanisms, but more importantly, a deliberate approach to data modeling and an understanding of the application query patterns. This is crucial, as some decisions cannot be easily reversed without creating a new table and migrating data.&lt;/p&gt;

&lt;p&gt;In this article, I will explain you how the DynamoDB indexing mechanism works and show you how to plan your data structure to fully harness the potential of this powerful yet demanding technology.&lt;/p&gt;

&lt;h2&gt;
  
  
  A bit of theory
&lt;/h2&gt;

&lt;p&gt;Before we dive into the internal mechanisms of DynamoDB, we need to start with the basics. It is the understanding of general database techniques that will help us achieve maximum performance. One of the fundamental topics in any distributed database is the data consistency. In the case of DynamoDB, there are two available read modes: &lt;strong&gt;Eventually Consistent Read&lt;/strong&gt; and &lt;strong&gt;Strongly Consistent Read&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What's the difference between them? With Strong Consistency, you are always guaranteed to receive the most up-to-date data. In the other hand, if you choose Eventual Consistency, there's a chance that the returned data may be outdated. Its value will eventually be updated to the expected state, but this will happen with some delay.&lt;/p&gt;

&lt;p&gt;The read mode is selected at the query level. Let's see how this works with an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export const getUserById = async (id, consistentRead = true) =&amp;gt; {
    const ddbClient = initDDBClient();
    const command = new GetCommand({
        TableName: usersTable,
        Key: {
            id,
        },
        ConsistentRead: consistentRead, // Set true/false
    });

    try {
        const response = await ddbClient.send(command);
        return response.Item;
    } catch (error) {
        console.error("Error getting item:", error);
        throw error;
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can use this function to fetch our data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const user = {
    id: crypto.randomUUID(),
    name: "John Doe",
    email: "jdoe@examplemail.com",
}

await saveUser(user);

const [eventualUser, strongUser] = await Promise.all([
    // Read with eventual consistency.
    getUserById(user.id, false),
    // Read with strong consistency
    getUserById(user.id, true)
]);

console.log("Read with eventual consistency:", eventualUser);
console.log("Read with strong consistency:", strongUser);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point &lt;code&gt;Read with strong consistency&lt;/code&gt; will always return correct data. &lt;code&gt;Read with eventual consistency&lt;/code&gt; may show empty result.&lt;/p&gt;

&lt;p&gt;Now you might be thinking, &lt;em&gt;'So the solution to all problems is to always choose Strong Consistency by default?'&lt;/em&gt;. Not quite. Unfortunately, such reads consume &lt;strong&gt;twice as many Read Capacity Units&lt;/strong&gt; (RCUs). What does that mean? If you're using the On-Demand mode, it means you'll pay twice as much for your queries. If you're using the Provisioned mode, you'll reach the RCU limit of your table much more easily.&lt;/p&gt;

&lt;p&gt;So, if your application can tolerate slightly outdated data, it's worth considering &lt;strong&gt;Eventual Consistency to optimize costs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Another important point is that &lt;strong&gt;not all queries can support Strong Consistency&lt;/strong&gt;. But more on that in a moment :)&lt;/p&gt;

&lt;h2&gt;
  
  
  What can a record consist of?
&lt;/h2&gt;

&lt;p&gt;First and foremost: the &lt;strong&gt;primary key&lt;/strong&gt;. This is what uniquely identifies each entity in our table. It can consist of two components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partition Key&lt;/strong&gt; (always required) - Its value determines which partition a given record will be placed into. This works through a hashing function that distributes entities across physical nodes. As a result, the load is evenly balanced and the data access is faster. Each partition, except of the size (&lt;strong&gt;10 GB&lt;/strong&gt;), also has a maximum of &lt;strong&gt;3000 Read Capacity Units&lt;/strong&gt; and &lt;strong&gt;1000 Write Capacity Units&lt;/strong&gt;. If your table grows beyond that, DynamoDB automatically adds more partitions. However, if too much data is stored under a single partition key, &lt;strong&gt;it cannot be split&lt;/strong&gt;. That's why it's crucial for the partition key (PK) to be as granular as possible. Otherwise, we might end up with all records landing in a single partition - a situation known as a &lt;strong&gt;hot partition&lt;/strong&gt;, which significantly slows down queries. Let's visualize it. We have a users identified by a given ID. ID is hashed using an internal algorithm, which determines which partition the item will be stored in.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr17nqz4b6x9kkjb0pos8.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr17nqz4b6x9kkjb0pos8.PNG" alt="Partitions" width="497" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sort Key&lt;/strong&gt; (optional) - Adds powerful capabilities to our table:

&lt;ul&gt;
&lt;li&gt;Enables ordering of data within a single partition, so we can retrieve records in a specific sequence - either from the beginning or the end of the table.&lt;/li&gt;
&lt;li&gt;Allows us to associate multiple records with a single partition key. A good example would be documents with multiple versions.
&lt;strong&gt;Caution⚠️&lt;/strong&gt;: If we associate too many items with a single partition key, it can still result in a hot partition.&lt;/li&gt;
&lt;li&gt;It also enables more advanced querying capabilities. With only a partition key, we can retrieve data by exact match. But once we introduce a sort key, we can perform range queries, search by prefix, and many more.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5iac14cldrh3mzx4p39m.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5iac14cldrh3mzx4p39m.PNG" alt="Sort key query" width="193" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's see how it looks in practice. A common use case for sort keys is storing versioned data. For example, imagine a system that manages files. Each file has a unique identifier which we use as the partition key. Files can have multiple versions, so we use the version number as the sort key.&lt;/p&gt;

&lt;p&gt;This way, all versions of a given file are stored under the same partition, and the sort key allows us to query them efficiently - whether we want to fetch all versions or only specific one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldxymd7vsdq0bw68dg2l.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldxymd7vsdq0bw68dg2l.PNG" alt="Sort Key" width="599" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you use only the partition key in your query, DynamoDB will return all items that share that partition key (e.g. all versions of a specific file).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query document by PK = "abc123"
[
    { id: "abc123", version: 1, ...},
    { id: "abc123", version: 2, ...},
    { id: "abc123", version: 3, ...},
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you provide both the partition key and the sort key (PK + SK), DynamoDB will return exactly one item - the one that matches both values. This is ideal when you want to fetch a specific version of a file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query document by PK = "abc123" and SK = "1"
[
    { id: "abc123", version: 1, ...},
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Types of Operations
&lt;/h2&gt;

&lt;p&gt;Now that we understand what our data might look like, let’s take a look at the different ways we can retrieve it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetItem&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BatchGetItem&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scan&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ParallelScan&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TransactGetItem&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whoa - that's quite a few options just for reading data! So, which one is the best?&lt;br&gt;
Well, it depends on what you're trying to retrieve and how your table is structured. Let's briefly go over each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GetItem&lt;/strong&gt; - The most basic read operation. It retrieves a single item based on the partition key. &lt;strong&gt;Note&lt;/strong&gt;: If your table has a sort key defined, you must provide it as well.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BatchGetItem&lt;/strong&gt; - Retrieves multiple items in a single batch (up to 100 items or 16MB per request). This is much more efficient than fetching them one at a time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt; - Offers more advanced search capabilities (e.g. sorting, filtering, range queries). It can also be used to query secondary indexes, which I'll cover shortly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scan&lt;/strong&gt; - Reads every item in the table. Because of its brute-force nature, this operation is slow and expensive. You can apply filters to reduce the number of returned results - but keep in mind that RCU (Read Capacity Units) usage remains the same. It just reduces the size of the received object.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ParallelScan&lt;/strong&gt; - Works like Scan, but splits the work across multiple parallel workers. It's faster, but can quickly consume your table's available RCU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TransactGetItem&lt;/strong&gt; - Reads up to 25 items across one or more tables, with all-or-nothing guarantees. If one item can't be retrieved, none will be returned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you can see, the most efficient and preferred options for retrieving data are &lt;strong&gt;GetItem&lt;/strong&gt;, &lt;strong&gt;BatchGetItem&lt;/strong&gt;, and &lt;strong&gt;Query&lt;/strong&gt;.&lt;br&gt;
Scan operations should be your last resort - only use them when none of the other methods fit your use case.&lt;/p&gt;
&lt;h2&gt;
  
  
  Indexes
&lt;/h2&gt;

&lt;p&gt;A primary key alone often isn't enough - we frequently need more flexibility when querying our data than just using an ID.&lt;br&gt;
That's where &lt;strong&gt;indexes&lt;/strong&gt; come in. Just like in other database engines, indexes in DynamoDB are used to &lt;strong&gt;optimize and speed up queries&lt;/strong&gt; on your tables.&lt;br&gt;
It supports two types of indexes. We can visualize them as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggfdtqzb7zk08nqj8vus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggfdtqzb7zk08nqj8vus.png" alt="Indexes" width="337" height="239"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Local Secondary Index (LSI)
&lt;/h3&gt;

&lt;p&gt;An LSI provides an &lt;strong&gt;alternative sort key&lt;/strong&gt; while still using the same partition key. Since it's stored within the same partition as the base table, it supports both eventual and strongly consistent reads. However, LSIs come with several limitations that are important to consider during the design phase:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can define up to &lt;strong&gt;5 LSIs&lt;/strong&gt; per table, which can limit your query flexibility.&lt;/li&gt;
&lt;li&gt;LSIs share the table's provisioned throughput and storage, so heavy usage of the index may impact performance of other operations.&lt;/li&gt;
&lt;li&gt;They must be created at the time the table is created - &lt;strong&gt;you can't add LSIs to an existing table&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You'll incur additional costs for storing data in the index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These constraints mean that you need a clear understanding of your query patterns before designing your schema.&lt;/p&gt;

&lt;p&gt;Earlier, we discussed a document metadata table that stores versioned file metadata, where the partition key is the file ID and the sort key is the version number. While this model is effective for accessing different versions of a file, you might encounter situations where you want to query versions of a file by their name - for example, to find all versions with a specific name pattern or to retrieve a renamed file history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is where a Local Secondary Index (LSI) becomes useful&lt;/strong&gt;. By defining an LSI on the &lt;code&gt;name&lt;/code&gt; attribute, you can query the same set of items (same partition key) but sort and filter them based on their name instead of version number.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyos4ml5v7ecvf11ov7i.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiyos4ml5v7ecvf11ov7i.PNG" alt="LSI" width="617" height="396"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query LSI "name" index by PK = "abc123" sorted in reverse order
[
    { id: "abc123", version: 2, name: "myfile.txt", ... },
    { id: "abc123", version: 3, name: "final.txt", ... },
    { id: "abc123", version: 1, name: "draft.txt", ... },
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Global Secondary Index (GSI)
&lt;/h3&gt;

&lt;p&gt;A GSI allows you to define a completely different partition key and sort key, giving you far greater flexibility in querying your data. DynamoDB supports up to &lt;strong&gt;20 GSIs&lt;/strong&gt; per table, making them a powerful tool for advanced query requirements.&lt;/p&gt;

&lt;p&gt;Since GSIs are stored in a separate partition space, they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have independent read/write capacity limits&lt;/li&gt;
&lt;li&gt;Can be added after the table is created&lt;/li&gt;
&lt;li&gt;Support &lt;strong&gt;only eventual consistency&lt;/strong&gt; for reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As with LSIs, storing data in a GSI incurs additional storage and throughput costs. To control these costs, you can configure attribute projections, which determine what data is copied to the index:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KEYS_ONLY&lt;/strong&gt; - Only the primary key and index keys are projected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;INCLUDE&lt;/strong&gt; - Keys and selected non-key attributes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ALL&lt;/strong&gt; - All attributes from the base table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's say we have a table that stores user data. Each user has a unique id, which we use as the primary partition key. However, we also want to be able to query all users who belong to a specific team - something that isn't possible with the base table's key structure alone.&lt;/p&gt;

&lt;p&gt;To support this access pattern, we can define a GSI with &lt;code&gt;teamId&lt;/code&gt; as the partition key. What's important to understand is that a GSI behaves like a separate table under the hood. It has its own partition key and (optionally) sort key, independent from the base table. In the case of our user table, this means that &lt;code&gt;teamId&lt;/code&gt; becomes the primary key of the index, allowing us to efficiently query all users assigned to a specific team.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdd3g975e9ciotmz2d2i.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdd3g975e9ciotmz2d2i.PNG" alt="GSI" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Testing Eventually Consistent Reads
&lt;/h3&gt;

&lt;p&gt;A common challenge is properly testing code that uses Eventually Consistent Reads. What does a typical integration test verifying database reads look like?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Insert test data&lt;/li&gt;
&lt;li&gt;Read the data&lt;/li&gt;
&lt;li&gt;Validate its correctness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures each test is independent and can be run in isolation. Unfortunately, as mentioned earlier, Eventually Consistent data can have a delay, which may result in inconsistency during read operations. This, in turn, can lead to failed assertions.&lt;/p&gt;

&lt;p&gt;Let's imagine our &lt;code&gt;TEAM_ID&lt;/code&gt; GSI in our &lt;code&gt;users&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1oo2jhaftmunwqabydz.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr1oo2jhaftmunwqabydz.PNG" alt="Test GSI" width="581" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first idea might be to wait and poll periodically until the data is updated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...Other tests

it('should read users by team', async () =&amp;gt; {
    // Arrange
    await saveUser(user);

    // Act
    let attempts = 15;
    let usersByTeam = [];

    for (let i = 0; i &amp;lt; attempts; i++) {
        usersByTeam = await getUsersByTeam(teamId);

        if (usersByTeam.length &amp;gt; 0) {
            break;
        }

        await setTimeout(300); // Wait for eventual consistency
    }

    // Assert
    expect(usersByTeam.length).toBe(1);
});

...Other tests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, this results in significantly &lt;strong&gt;longer test execution times&lt;/strong&gt;.&lt;br&gt;
A better solution might be to insert the data before all tests and run the eventual consistency tests at the very end. Of course, we still need to wait to be 100% sure that the data has eventually become consistent. However, part of that waiting time will be offset by other tests executing in the meantime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;beforeAll(async () =&amp;gt; {
    // Init GSI data before all tests
    await saveUser(user);
})

... Other tests

// GSI tests below
it('should read users by team', async () =&amp;gt; {
    // Act
    let attempts = 15;
    let usersByTeam = [];

    for (let i = 0; i &amp;lt; attempts; i++) {
        usersByTeam = await getUsersByTeam(teamId);

        if (usersByTeam.length &amp;gt; 0) {
            break;
        }

        await setTimeout(300); // Wait for eventual consistency
    }

    // Assert
    expect(usersByTeam.length).toBe(1);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prefixing
&lt;/h3&gt;

&lt;p&gt;Prefixing is a common technique used in DynamoDB to structure sort keys in a way that enables more flexible and efficient query patterns.&lt;/p&gt;

&lt;p&gt;Instead of using a raw value as a sort key (e.g. just a timestamp), you prefix the sort key with a constant label or category to create a composite value. This allows you to distinguish between different types of data stored under the same partition key and query them accordingly.&lt;/p&gt;

&lt;p&gt;Let's consider the following structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A company has multiple teams&lt;/li&gt;
&lt;li&gt;Each team has multiple users&lt;/li&gt;
&lt;li&gt;Each user can upload multiple documents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can store all of this in a single DynamoDB table using the following pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PK&lt;/strong&gt; - company ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SK&lt;/strong&gt; - &lt;code&gt;TEAM#{teamId}#USER#{userId}#DOCUMENT#{documentId}&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why this Works?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can query all teams of a company: SK &lt;code&gt;begins_with(SK, 'TEAM#')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You can query all users in a team: SK &lt;code&gt;begins_with(SK, 'TEAM#teamA#USER#')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You can query all documents for a user: SK &lt;code&gt;begins_with(SK, 'TEAM#teamA#USER#user1#DOCUMENT#')&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, it's important to note that parsing the composite sort key is our responsibility. Since DynamoDB stores the sort key as a simple string, it's up to the application logic to split and interpret its components (e.g. extracting &lt;code&gt;teamId&lt;/code&gt;, &lt;code&gt;userId&lt;/code&gt;, or &lt;code&gt;documentId&lt;/code&gt; from a key like &lt;code&gt;TEAM#teamA#USER#user1#DOCUMENT#doc1&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;This design also introduces a potential risk of hot partitions. Since all nested data (teams, users, documents) lives under a single partition key (company ID). Let's see this in the next example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hot Partition
&lt;/h3&gt;

&lt;p&gt;We're building software for factories that collects sensor events. Since the system is multi-tenant, we decided to model multi-tenancy directly at the DynamoDB level. Specifically, the partition key is the company ID, and the sort key is a composite value in the format &lt;code&gt;{sensorId}#{eventId}&lt;/code&gt;. This design allows us to efficiently query for all events from a specific sensor or retrieve a specific event by ID.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsvue4npm4sags279l6n.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsvue4npm4sags279l6n.PNG" alt="Hot partition" width="453" height="248"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, we've encountered a scaling issue: one of our clients has thousands of active sensors, each sending event once per second. Since all this data shares the same partition key (their company ID), DynamoDB directs all writes to a single partition, which quickly becomes a hot partition. This causes problems such as throttled writes due to exceeding partition throughput limits and poor horizontal scalability.&lt;/p&gt;

&lt;p&gt;This example highlights how seemingly clean and logical data models can run into physical limitations in high-throughput environments, especially when many high-frequency entities (like IoT sensors) share the same partition key. &lt;/p&gt;

&lt;p&gt;There are several techniques to mitigate this problem but those are more advanced strategies that deserve a dedicated article :)&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;As we've seen through the examples, DynamoDB offers immense power, but only when used with a clear understanding of how it works under the hood.&lt;/p&gt;

&lt;p&gt;One of the most critical aspects of working with DynamoDB is data modeling, and at the heart of that lies indexing. Choosing the right partition key, designing efficient sort keys and leveraging Global or Local Secondary Indexes is the key to the efficient database.&lt;/p&gt;

&lt;p&gt;DynamoDB can be incredibly fast, scalable, and reliable - but only if you know exactly what you're doing. That's why investing time upfront in understanding your access patterns and indexing strategy is not just helpful - it's mandatory.&lt;/p&gt;

&lt;p&gt;Getting indexing right is a foundational step - but truly mastering DynamoDB requires diving much deeper. We'll explore those advanced techniques in the next parts of this guide. Stay tuned 🎉.&lt;/p&gt;

&lt;p&gt;Links&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;👋 &lt;a href="https://www.linkedin.com/in/jakub-stanis%C5%82awczyk-33128b142/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;a href="https://github.com/jstanislawczyk/dynamodb-professional" rel="noopener noreferrer"&gt;Github&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>node</category>
      <category>programming</category>
      <category>database</category>
    </item>
    <item>
      <title>Become the Serverless DJ. How to process audio using AWS?</title>
      <dc:creator>Jakub Stanisławczyk</dc:creator>
      <pubDate>Mon, 02 Jun 2025 16:55:56 +0000</pubDate>
      <link>https://dev.to/jakubstanislawczyk/become-the-serverless-dj-how-to-process-audio-using-aws-11ng</link>
      <guid>https://dev.to/jakubstanislawczyk/become-the-serverless-dj-how-to-process-audio-using-aws-11ng</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;As software developers, we work with different types of files. These are often formats like JSON, XML, or CSV. Data engineers, on the other hand, use more specialized tools such as Parquet. Beyond text files, we also process images - resizing them, adjusting colors, or altering shapes.&lt;/p&gt;

&lt;p&gt;However, there is one type of medium that seems to be somewhat overlooked, despite surrounding us everywhere. After all, who doesn't enjoy listening to music to relax or using it as background sound while working?&lt;/p&gt;

&lt;p&gt;But how do we work with audio files? How can we process them in the AWS cloud? What aspects should we consider to ensure our architecture is both scalable and cost-effective? I'll answer these and many other questions in this article.&lt;/p&gt;

&lt;p&gt;Full code can be found on my &lt;a href="https://github.com/jstanislawczyk/aws-audio-transformer" rel="noopener noreferrer"&gt;Github&lt;/a&gt;. I used &lt;strong&gt;Terraform&lt;/strong&gt; to describe all AWS resources. This will help you set up this project without having to configure it manually. I decided not to use any architectures or design patterns because I wanted to keep things as simple as possible. Feel free to extend it using layers, interfaces, hexagonal architecture etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technologies used
&lt;/h2&gt;

&lt;p&gt;In the world of modern cloud solutions, we are constantly looking for ways to increase efficiency, reduce costs and eliminate unnecessary infrastructure. That is why choosing serverless technology seemed like a natural step. Thanks to the model in which I do not have to worry about managing servers, I can focus on what is most important - processing audio in a fast, scalable and cost-optimized way. The following services will help me with this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon S3&lt;/strong&gt; - The storage, where we will save our files. It scales very well and has high availability (99.999999999%). Mechanisms such as &lt;strong&gt;Presigned URL&lt;/strong&gt; or &lt;strong&gt;S3 Events&lt;/strong&gt; will be an important part of our architecture.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsryu93rr7e9jhhk9rii.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqsryu93rr7e9jhhk9rii.PNG" alt="S3 icon" width="144" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS Lambda&lt;/strong&gt; - This will be our main working tool that allows us to run our Node.js code. The main advantage is that we only pay for the time it takes to run, making it ideal for reacting to events. Cons? Maximum 15 minutes of runtime and 10GB of RAM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafof6xd8xn0pp9hdg94m.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafof6xd8xn0pp9hdg94m.PNG" alt="Lambda icon" width="143" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS API Gateway&lt;/strong&gt; - This service allow us to expose our Lambdas as REST API endpoints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hxrzraupgjlndy8q4gf.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5hxrzraupgjlndy8q4gf.PNG" alt="API GW icon" width="143" height="142"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB&lt;/strong&gt; - A NoSQL database that provides very good performance and scalability. To be able to talk about its capabilities would require a separate (and not so short) article. TLDR: DDB is for you if you need an efficient and scalable database, and at the same time you know exactly what the query patterns will be.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fje8zvxrocyvvncrip5y9.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fje8zvxrocyvvncrip5y9.PNG" alt="DynamoDB icon" width="142" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon SQS&lt;/strong&gt; - A simple AWS queue that allows us to send events in two modes:

&lt;ul&gt;
&lt;li&gt;Normal - where we have virtually unlimited scaling but duplicate messages are possible.&lt;/li&gt;
&lt;li&gt;FIFO - where duplicates are automatically removed but with a limit of 300 operations per second.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rja2nqlrgqpcs4s0rab.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2rja2nqlrgqpcs4s0rab.PNG" alt="SQS icon" width="143" height="143"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;FFmpeg&lt;/strong&gt; - This tool allows us to customize and modify the audio file to suit our needs. It is a CLI tool designed for multimedia processing and consists of two subtools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FFmpeg – enables conversion to different formats, trimming and merging files, changing the sampling rate… and that’s just the tip of the iceberg.&lt;/li&gt;
&lt;li&gt;FFprobe – allows us to analyze the file, including checking size, format, and other attributes..&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Well, a CLI tool. Won't that be a problem in the case of Lambda? After all, it is a serverless solution, which is very high-level. Fortunately, there is a way to solve this problem, but more on that later.&lt;/p&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;When developing our solution, we need to cover three fundamental aspects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Upload&lt;/strong&gt; – How can we efficiently deliver new files?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio Processing&lt;/strong&gt; – Similar to text files, audio files come in a wide range of formats. Additionally, each file can have different sampling rates and channel configurations. Standardizing them will simplify further processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; – It’s essential to ensure that files can be easily searched and sorted later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's how the final process looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  Upload
&lt;/h3&gt;

&lt;p&gt;Uploading a file seems like the least of your problems. After all, it's just sending a file to our backend and throwing it into an S3 bucket? Right?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax015p0uvwexvou9isop.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fax015p0uvwexvou9isop.PNG" alt="Uploading file" width="598" height="147"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Well, not really. Of course, it can be done this way, but it will be inefficient. After all, in such an architecture, our Lambda will work a bit like a shovel that has to move a large amount of data and that is its only task. What if our client could upload a file directly to the Bucket?&lt;/p&gt;

&lt;p&gt;Fortunately, AWS provides us with a &lt;strong&gt;Presigned URL&lt;/strong&gt; mechanism that allows for direct upload to the S3 Bucket. It's very simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In the first step, we ask the bucket to generate an URL that allows us to upload file directly.&lt;/li&gt;
&lt;li&gt;After receiving the response from the URL, we do a redirect to it. We place our file in the body of the new request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkvoyqw2r2eqosmmneat.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdkvoyqw2r2eqosmmneat.PNG" alt="Uploading file" width="592" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another factor to consider is the file size - larger files may take longer to upload and are more susceptible to network errors. The solution is to use the &lt;strong&gt;Multipart Upload&lt;/strong&gt;. It allows parallelization of requests and increases resilience, for example by enabling re-sending. For what files should it be used?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt;100MB&lt;/strong&gt; - you should consider using this mechanism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt;5GB&lt;/strong&gt; - AWS requires the use of Multipart Upload for files larger than 5GB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this example I will stick to the &lt;code&gt;PutObject&lt;/code&gt; for simplicity purposes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audio Processing
&lt;/h3&gt;

&lt;p&gt;With the raw audio file already saved, the next step is to pass it to Lambda to initiate processing. The easiest way is to use the &lt;strong&gt;S3 Events&lt;/strong&gt; mechanism. It allows us to listen for the changes in directories and files. We specify the event type, prefix, suffix, and the destination service for the notification. From now on, each time a file is added that fits the rules given above, an event is triggered that will start our Lambda. Of course, it doesn't contain the file itself - only the metadata needed to download it from our S3 bucket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszjbeneusb1d4p8ganui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fszjbeneusb1d4p8ganui.png" alt="S3 events" width="591" height="201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alternatively, you can use EventBridge, which is a more general solution that supports a much wider range of services and events.&lt;/p&gt;

&lt;p&gt;As you can see, this is a very simple architecture and unfortunately not fully functional. There is one detail that may be problematic and it is &lt;strong&gt;"At least one delivery"&lt;/strong&gt; of S3 events. This means that duplicates may occur, which will cause us to process the same file twice. The simplest way is to set up an additional SQS FIFO queue, which will automatically reject duplicates and save us some computing resources. Note that this will only work if we fit into the SQS FIFO limit (300 messages per second). To achieve higher throughput, we can use a solution like DynamoDB to track whether a given event has already been processed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjca3l1q4pu0vwz4sto0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjca3l1q4pu0vwz4sto0l.png" alt="Deduplicating" width="800" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Okay, but how should our actual audio processing look like? How to use the FFmpeg in Node.js? There is two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can directly call CLI commands using Node &lt;code&gt;child_process&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { exec } from 'child_process';
import { promisify } from 'util';

...

const execAsync = promisify(exec);

try {
  const { stderr } = await execAsync(`ffmpeg -i ${audioFilePath} -b:a ${bitrate} ${transformedAudioPath}.${format}`);

  if (stderr) {
    console.warn('stderr:', stderr);
  }
} catch (err) {
  console.error('Error:', err.message);

  if (err.stderr) {
    console.error('stderr:', err.stderr);
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;You can also use &lt;a href="https://www.npmjs.com/package/fluent-ffmpeg" rel="noopener noreferrer"&gt;fluent-fmmpeg&lt;/a&gt;. It's the NPM package that wraps ugly CLI commands in a beautiful chain of functions. I know it's deprecated but it can still be useful for most operations. Here's how we use it:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;await new Promise((resolve, reject) =&amp;gt; {
  ffmpeg(audioFilePath)
    .toFormat(format) // Change format
    .audioBitrate(bitrate) // Change bitrate
    .save(transformedAudioPath)
    .on('end', () =&amp;gt; {
      console.log('File has been transformed successfully');
      return resolve(transformedAudioPath);
    })
    .on('error', (error: Error) =&amp;gt; {
      console.log('Failed to transform audio file: ', error.message);
      return reject(error);
    });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple and easy to use. No matter what are the input parameters, we still get the unified and predictable output. This helps us in further processing. E.g. we no need to worry if given format is supported by the browser or not. We can also degrade the audio quality to save some space on S3.&lt;/p&gt;

&lt;p&gt;"But wait. You mentioned that FFmpeg is a CLI tool. Can we just install the library and expect it to work?" Well, unfortunately, it is not that easy. We still need to have FFmpeg installed on our system. But how can we do this? Do we need to put it in a ZIP with the Lambda code? This is where &lt;strong&gt;Lambda Layers&lt;/strong&gt; comes in handy. This mechanism allows us to pack our dependencies into archives, which can then be used in our functions. We can include predefined dependencies, as well as include external tools. In our case, FFmpeg and FFprobe will be packaged in this way. We only need to zip the binaries, create the new layers and attach them to our Lambda function. We also need to remember to set the appropriate &lt;code&gt;FFMPEG_PATH&lt;/code&gt; and &lt;code&gt;FFPROBE_PATH&lt;/code&gt; values using Lambda environment variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjebdwiqae8vvvoxch62r.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjebdwiqae8vvvoxch62r.PNG" alt="FFmpeg Paths" width="254" height="43"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwvfyfjd4nqsidb59y5l.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhwvfyfjd4nqsidb59y5l.PNG" alt="Lambda layers" width="800" height="177"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From now on we can use FFmpeg CLI commands.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata
&lt;/h3&gt;

&lt;p&gt;The last element is the metadata needed for later filtering and searching of audio files, e.g. for frontend purposes. Here, I will use &lt;strong&gt;DynamoDB&lt;/strong&gt; as a database, which provides very good scalability and on demand payment (only for used resources). During the entire flow, we will update the current state of file processing, which looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famyqo0wtqx0rv2pumf3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Famyqo0wtqx0rv2pumf3j.png" alt="Metadata" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is how it looks like in the code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create new metadata record
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export const createAudioMetadataRecord = async (audioMetadata: AudioMetadata): Promise&amp;lt;void&amp;gt; =&amp;gt; {
  const documentClient = initDocumentClient();
  const putCommand = new PutCommand({
    TableName: process.env.AUDIO_TABLE_NAME,
    Item: audioMetadata,
  });

  await documentClient.send(putCommand);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Update record with new status
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  const updateCommand = new UpdateCommand({
    TableName: process.env.AUDIO_TABLE_NAME,
    Key: {
      id: audioId,
    },
    UpdateExpression: 'SET #status = :status',
    ExpressionAttributeNames: {
      '#status': 'status',
    },
    ExpressionAttributeValues: {
      ':status': 'UPLOADED' satisfies FileStatus,
    },
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The final architecture looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fket78kj7mglrf6g9o921.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fket78kj7mglrf6g9o921.png" alt="Final architecture" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's test it!&lt;br&gt;
First, we need to generate new Presigned URL. We get it in the response body of our &lt;code&gt;POST /api/files&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2r04nbnpj2bw5wnbeox.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv2r04nbnpj2bw5wnbeox.PNG" alt="Generate presigned" width="621" height="219"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then we use it in the PUT method. We can also leverage HTTP 301 code to automatically redirect after successful response. This starts the whole processing flow. We can get updated metadata with &lt;code&gt;GET /api/files&lt;/code&gt; endpoint that lists all the uploaded files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsjd3b01p0lhirwampsz.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsjd3b01p0lhirwampsz.PNG" alt="List audio files" width="509" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, even such a simple task should be well planned with attention to details such as scaling, duplicates or low operating cost of our solution. Of course, such architecture is only a base for more complex business cases, so I encourage you to experiment.&lt;/p&gt;




&lt;h3&gt;
  
  
  Links
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;👋 &lt;a href="https://www.linkedin.com/in/jakub-stanis%C5%82awczyk-33128b142/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;a href="https://github.com/jstanislawczyk/aws-audio-transformer" rel="noopener noreferrer"&gt;Github&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>node</category>
      <category>programming</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
