DEV Community: Bartosz Górski

Checking object existence in large AWS S3 buckets using Python and PySpark

Bartosz Górski — Fri, 12 Sep 2025 06:59:48 +0000

Introduction

In my recent project, I encountered a need to check if data from 3rd party database corresponds with the documents in a S3 bucket. While this might seem like a straightforward task, the approach, the dataset was massive - up to 10 million objects in a single bucket. Traditional iteration over objects list or requesting head for every searched file will take forever. I took some interesting steps, using Python and PySpark to search through potentially large datasets efficiently.

Here's a detailed breakdown of my process.

Listing S3 Bucket Contents and Saving Directory Names

The first step was to list the contents of the S3 bucket and save the names of the subdirectiories to a text file. For this, I utilized the Boto3 library in Python, which is a powerful interface to interact with Amazon Web Services (AWS).

Here's a snippet of the code used to accomplish this task:

import os
import sys
import boto3

__doc__ = """
Usage: python get_objects.py <bucket_name> <output_file> [prefix]
Example: python get_objects.py my_bucket objects.txt prefix
"""

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print(__doc__)
        sys.exit(1)

    bucket_name = sys.argv[1]
    output_file = sys.argv[2]
    prefix = sys.argv[3] if len(sys.argv) > 3 else ""
    s3 = boto3.client("s3")
    try:
        os.remove(output_file)
    except OSError:
        pass
    continuation_token = None
    while True:
        if continuation_token:
            response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix, ContinuationToken=continuation_token, Delimiter='/')
        else:
            response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix, Delimiter='/')
        objects = response.get("CommonPrefixes", [])
        continuation_token = response.get("NextContinuationToken")
        with open(output_file, "a") as f:
            for obj in objects:
                f.write(obj["Prefix"] + "\n")
        print(f"{len(objects)} Objects in {bucket_name} are listed in {output_file}")
        if not continuation_token:
            break

Let's break it down.

os.remove(output_file) Every run should start with the empty file, so we are removing the output file if exists.
s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix, ContinuationToken=continuation_token, Delimiter='/') gets up too 1000 objects from the bucket bucket_name from "sub-directory" prefix using delimiter / to retrieve only directories. As we are not interested in checking the number of files in directories, we don't need to check non-directory entries. ContinuationToken is used for retrieving all the elements in the loop, as the maximal number of retrieved objects in the single call is 1000
As the listed "directory" objects have no content, we need to save the value of the Prefix property of that object.

Using PySpark to Search for the Selected Directory

With the directory names saved in a text file, the next step was to leverage PySpark for efficient searching. PySpark's DataFrame API provides a powerful way to handle large datasets.

Here's an example of how I used PySpark to search for a selected directory:

import os
import sys
import time
from pyspark.sql import SparkSession


def load_input_data(spark):
    if os.path.exists("out.parquet"):
        df = spark.read.load("out.parquet")
        print(f"Loaded {df.count()} records from file out.parquet")
    else:
        df = spark.read.text("out.txt")
        print(f"Loaded {df.count()} records from file out.txt")
        df.write.save("out.parquet", format="parquet")
        print(f"Saved dataframe to out.parquet")
    return df


def find_entry(id, df):
    return df.filter(df.value == id).count() > 0


def find_in_s3(id_to_find):
    spark = SparkSession.builder.appName("S3Find").getOrCreate()
    df = load_input_data(spark)
    print("Loading input data finished after %s seconds ---" % (time.time() - start_time))
    found = find_entry(id_to_find, df)
    print(f"Found entry {id_to_find}" if found else f"Entry {id_to_find} not found")
    spark.stop()


if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit(1)
    id_to_find = sys.argv[1]
    start_time = time.time()
    find_in_s3(id_to_find)
    print("--- %s seconds ---" % (time.time() - start_time))

load_input_data seeks for the saved parquet data, or, if not available loads the selected txt file, stores it in a data frame, and saves parquet data. find_entry uses a filter on a data frame to check if selected element exists find_in_s3 creates a new spark session, loads data frame and performs finding At main method I added simple execution time counting.

Benchmarks

Using a simple test generator I created 3 collections (100k, 1M and 10M elements) of random uuids and test cases with 10, 100 and 1000 items with some randomly added suffixes (for non-existence search). The results shocked me a bit. It was blazing fast (time in seconds):

As we can see, search time was nearly exactly correlated with the number of searched elements, not collection size.

Grep

Internet said

Grep is always faster for searching in files.

Ok, let's try, here is a test code that check computing time for the grep and spark on the same datasets and test uuids:

import os
import subprocess
import sys
import time
from pyspark.sql import SparkSession


def find_entry(id, df):
    return df.filter(df.value == id).count() > 0


def spark_find(test_file):
    start_time = time.time()
    spark = SparkSession.builder.appName("S3Find").getOrCreate()
    df = spark.read.text("test_output")
    with open(test_file, "r") as f:
        for line in f:
            id_to_find = line.rstrip()
            find_entry(id_to_find, df)
    spark.stop()
    print("SPARK: --- %s seconds ---" % (time.time() - start_time))

def grep_find(test_file):
    start_time = time.time()
    with open(test_file, "r") as f:
        for line in f:
            id_to_find = line.rstrip()
            subprocess.call(['/usr/bin/grep', '-q', id_to_find, 'test_output'])
    print("GREP: --- %s seconds ---" % (time.time() - start_time))

if __name__ == "__main__":

    numOfRecords = str(sys.argv[1])
    spark_find("test_outputtest"+numOfRecords)
    grep_find("test_outputtest"+numOfRecords)

And the results for the grep:

For relatively small collections grep was faster, but when it comes to millions spark solution overcomes good 'ol grep.

A direct comparison of Spark vs Grep:

Summary

To understand why PySpark outperformed grep, let's delve into the differences:

Grep:

Traditional command-line utility for searching plain-text data.
Efficient for small to medium-sized text files.
Performance drops significantly with larger datasets due to linear search.

PySpark:

Distributed computing framework, ideal for large-scale data processing.
It uses in-memory computations, which speed up the search process.
Capable of handling much larger datasets efficiently.

In summary, while grep is a fantastic tool for quick searches on smaller datasets, PySpark shines when dealing with larger datasets, offering significant performance improvements due to its distributed nature.

By leveraging Python and PySpark, I was able to efficiently determine the existence of directories in an S3 bucket, saving time and computational resources. This method showcases the power of modern data processing tools and their application in real-world scenarios.

All the mentioned code can be found in my Repo

Feel free to share your thoughts or ask questions in the comments below!

Note: The performance results may vary based on the specific configuration and resources of your environment.

Key Takeaways from AWS Community Day Baltic 2025: Attendee Insights

Bartosz Górski — Fri, 12 Sep 2025 06:52:33 +0000

Attending AWS Community Day Baltic 2025 on September 10th was a delightful experience. This community‑driven, full‑day event featured practical sessions, real‑world use cases, and numerous hallway conversations, allowing builders from across the region to share their experiences. As a community conference, AWS Community Day exists to amplify practitioner knowledge, foster peer learning, and turn ideas into production‑grade decisions - prioritizing substance over show.

Model Context Protocol (MCP): From Zero to Hero

The first session I attended was “Model Context Protocol (MCP): From Zero to Hero,” which covered the essentials of building and running agents with MCP. MCP is an open standard that defines how AI systems connect to tools, data, and environments, simplifying integration and reducing glue code. The speaker, Viktor Vedmich, Senior Developer Advocate at AWS, delivered real‑life examples and quick demos that fit naturally into daily workflows. A memorable framing was: MCP is to LLM what REST is to the Web. The session also highlighted AWS’s role in MCP tool chaining, emphasizing Strands Agents and Bedrock AgentCore as emerging building blocks for agentic systems at scale. For deeper dives, see the links in the original draft.

Key takeaways

MCP standardizes tool access for agents, enabling safer, composable integrations.
Strands Agents and Bedrock AgentCore accelerate agent orchestration in AWS environments.

https://aws.amazon.com/bedrock/agentcore/

https://strandsagents.com/latest/

DynamoDB Demystified – Core Features & the Single‑Table Design

To pivot from AI orchestration to data fundamentals, I next attended “DynamoDB Demystified – Core Features & the Single‑Table Design,” led by Asia Chojnacka, Senior Software Engineer & Cloud Community Leader at Capgemini (AWS Community Builder), and Szymon Sołtys, Senior DevOps Engineer at Transition Technologies (AWS Community Builder). Although they noted it was their first AWS event talk, the delivery was confident and well‑structured. The session was split into DynamoDB essentials and an approachable walkthrough of Single‑Table Design (a modeling strategy that consolidates multiple entities into one table using partition/sort keys and access patterns). With prior DynamoDB experience, the second part was especially valuable: practical partition/sort key strategies, a disciplined query‑over‑scan mindset, and techniques for managing schema evolution without sacrificing read efficiency. The most important takeaway: if Single‑Table Design is not well understood, resist adopting it prematurely despite its potential benefits.

Key takeaways

Model around access patterns first; design keys to answer the most frequent queries.
Single‑Table Design pays off only when the team is proficient with its trade‑offs.

Build → Ship → Scale: From Your First AI Agent to Multi‑Agent Systems on AWS

Building on the agentic foundation from the morning, I attended “Build → Ship → Scale,” presented by Viktor Vedmich, a Senior Developer Advocate. The session focused on moving from a first working agent to orchestrated multi-agent systems with guardrails and observability on AWS. It naturally continued themes from the MCP session, showing how Strands Agents and Bedrock AgentCore support real production workflows. Multi‑agent infrastructure is rapidly becoming a standard approach for enterprise LLM usage.

Self‑Healing Data Platforms – 3A’s: Automation, AI, and Alignment

After a lunch break, I attended “Self‑Healing Data Platforms – 3A’s: Automation, AI, and Alignment,” presented by Przemysław Mikulski, the Co-Founder and Cloud & Data Architect at Kodlot ApS. The session emphasized using automation to reduce MTTR, AI-assisted anomaly detection, and cross-team alignment. Real-life examples showing a significant reduction in incident volume provided impressive evidence of AI's importance in Data Platform monitoring. The session was highly expert-level, so domain experts could truly benefit from the insights shared.

There Are No Magic Tricks in Cloud Cost Optimization

Continuing into operations, I attended “There Are No Magic Tricks in Cloud Cost Optimization,” led by Martin Hauskrecht, Head of Engineering at Labyrinth Labs. The session emphasized building lasting FinOps habits rather than relying on quick fixes. It highlighted design-for-cost, unit economics, strict tagging, and continuous measurement as the foundation for sustainable savings. Unfortunately, the session was quite short and focused only on Kubernetes and database cost optimization. I was hoping for more insights into serverless. The key lesson learned: To reduce costs, you first need accurate information obtained through observability.

Workshop – Agentic AI with MCP and Strands SDK

To translate concepts into practice, I participated in the workshop “Agentic AI with MCP and Strands SDK,” led by Szymon Kochański, a Senior Solutions Architect at AWS. The session covered everything from MCP basics to building Strands-based agents using tools, memory, and simple orchestration patterns. The hands-on format and step-by-step labs made it easy to continue the work after the event. It was my favorite session at the conference. I was impressed by how easy it was to use Strand Agents and their powerful usage patterns. The most impressive example? A multi-agent app with multiple tools that makes HTTP requests, saves data to OpenSearch, commits changes in the source code, and even creates new tools from just a user prompt! Want to try it? Here’s the link:
https://catalog.workshops.aws/strands/en-US

The engineering mindset in the age of AI: Beyond vibe coding

The last session I attended was “The Engineering Mindset in the Age of AI: Beyond Vibe Coding.” This talk contrasted prompt-first “vibe coding” with an engineering mindset focused on invariants, guarantees, and clear system design around probabilistic components. The session advocated for clear interfaces, considering failure modes, and measurable outcomes—treating AI assistance as a component to be constrained and evaluated, not a replacement for engineering rigor. It also outlined practical principles for senior developers to balance speed with safety, moving from exploratory generation to intentionally designed, testable systems. I was personally amazed by the talk. Gunnar Grosch—Principal Developer Advocate at AWS—perfectly explained all the benefits and challenges of current AI-powered development work. I now believe that AI is like the steam engine in the late 19th century. It won't reduce developers' work; it will help us in our daily tasks with a brand-new toolset. Have you heard of Jevons Paradox?
https://simple.wikipedia.org/wiki/Jevons_paradox

Conclusion

AWS Community Day Baltic 2025 delivered a balanced mix of foundational concepts and advanced practices across agentic AI, data platforms, and cloud operations. The through‑line across sessions was disciplined engineering: adopt open standards for agents, model data around access patterns, automate detection and recovery, and embed cost visibility from day one. The community format—talks, workshops, and hallway exchanges—made the insights immediately actionable and reinforced that continuous learning is essential in today’s rapidly evolving landscape. Most of all, the event demonstrated that combining principled engineering with modern AI tools yields systems that are not only more capable but also more reliable and economically sound.

Glossary

MCP: an open protocol that standardizes how AI agents access tools, data, and environments.

Single‑Table Design: modeling multiple entities in one DynamoDB table using partition/sort keys and access‑pattern‑driven schema.

FinOps: a cross‑functional practice aligning engineering and finance to measure, manage, and optimize cloud spend.