DEV Community: Alessandra Bilardi

When boto3 doesn't have it (yet), you write it: a realtime speech-to-speech story in Python

Alessandra Bilardi — Wed, 20 May 2026 22:00:07 +0000

At a meetup's networking session, someone dropped: "the new speech-to-speech feature in Teams is really cool". Microsoft Teams added the interpreter agent with realtime AI-powered speech-to-speech translation during calls. So the natural question: how complicated is building one with AWS ? And what performance does it deliver ?

Meanwhile, for PyCon IT 2026, with an inclusivity goal, the plan was already to use bilardi/realtime-transcription with a monitor in the room showing the talk transcript. But wouldn't it be handier if each attendee had the translated transcript directly on their own mobile, and maybe the audio in their own language too, naturally without installing anything ?

And so bilardi/realtime-speech-to-speech was born, ready to use, for any conference or meetup. Under the hood there are three AWS services chained together: Transcribe Streaming for Automatic Speech Recognition (ASR) from audio to text, Translate for the translation, Polly bidirectional streaming for Text-to-Speech (TTS) from text to audio. Architecture, costs and usage live in the repo: here, instead, I tell the choices and what went sideways along the way.

A stage PoC for multilingual meetups

The initial alternatives were three, from the simplest to the most complex.

Option	When it makes sense	Effort
One-way PoC, 1 speaker language → 1 listener language	Minimal validation of the AWS pipeline	Headphones to keep the mic from recapturing the TTS
Bidirectional 1:1 conversation	International meeting between two people	Two symmetric pipelines + a second device to test
1-to-many conference (fan-out), multilingual	Talks and meetups with international audience	Browser audio playback + N parallel pipelines under contention

I started from the 1:1 one-way PoC to validate the AWS pipeline and the new piece (Polly bidirectional streaming plus the browser audio playback), and from there moved on to 1-to-many, which is the real scenario for a conference or a meetup. Direction and language pair stay as two environment variables: changing scenario becomes editing two lines in .env, no refactor.

Listener client: the browser. Mobile has it without installing anything, and opening a URL is the simplest UX for the "PC speaks, mobile listens" test. A native app isn't worth it even in production for this use case, let alone for a PoC: targets to maintain, stores to publish to, zero advantages over a page opened from a QR code.

Why not Nova 2 Sonic ?

AWS recently announced Amazon Nova 2 Sonic: an end-to-end speech-to-speech model, ASR plus LLM plus TTS in a single bidirectional connection. Obligatory question: why not Nova Sonic, then ?

Nova Sonic is designed to respond to an audio: conversational assistant, human-AI dialogue, turn-taking, managed interruptions. The use case here is the opposite: a transmission to multiple listeners, a different language for each (multilingual broadcast), with faithful translation. For example, Italian audio as input, the same sentence as audio in N different languages as output, across N parallel channels. They are two different products: the fact that both go by "speech-to-speech" is a marketing collision.

Mapping the current three stages against Nova Sonic:

Current stage	Function	Nova 2 Sonic covers ?	Same guarantee ?
Transcribe Streaming	ASR audio to text	Yes, integrated	Plausible, but I haven't tested
Translate	Deterministic Neural Machine Translation (NMT)	Yes, via prompting	No, not deterministic
Polly Generative	TTS reading quality	Yes, conversational voices	No, dialogue intonation

The three critical points, from most to least blocking:

Translate: an NMT trained for faithful, deterministic translation. Nova Sonic would do translation via LLM prompting: more fluent but not deterministic, may paraphrase or add conversational fillers. Unacceptable for a broadcast where the audience expects exactly what the speaker says
Polly Generative: voices optimized for reading a given text. Nova Sonic has voices optimized for dialogue, intonation that adapts to the user's voice input. For reading a translation it's the wrong voice
Transcribe: replaceable in principle, but Nova Sonic doesn't expose ASR as a standalone service billed separately

Operational constraints independent of quality: 8-minute connection limit against Transcribe Streaming's 4 hours, and Nova requires a separate session per target language (the current pipeline calls Transcribe once for N languages).

Decision: pipeline with three specialized services. Nova 2 Sonic stays the natural candidate for a different scenario, where the listener asks the AI a question and the AI answers, not for a meetup with a human speaker and a passive audience.

Here is the stack

As a lazy developer, the first thing I looked for is reusable pieces. realtime-transcription already has the audio_client/ module to capture Pulse-Code Modulation (PCM) audio from a device and the FastAPI WebSocket scaffold: cherry-pick roughly 140 lines and you're off. The browser display, instead, is from scratch, because audio playback is a different beast from text display.

The server-side pipeline is simple and linear: Transcribe streaming → Translate one-shot → Polly bidirectional. Transcribe can deliver a partial text (is_partial=True) on faster timing, but it might be wrong and so cancelled and rewritten: the goal is validating the chain end to end, not shaving milliseconds of latency. Everything therefore starts from Transcribe once it has recognized a complete sentence (is_partial=False): at that point Translate fires with a single call per sentence, and the translated text goes to Polly bidirectional, which begins returning audio while it's still generating the rest.

For the audio format the options were compressed MP3 and raw PCM. MP3 uses ~4 times less bandwidth, but the browser has to decode it asynchronously for each chunk (decodeAudioData), breaking the continuity of the playback queue. PCM (16-bit signed LE, 16 kHz mono) weighs more on bandwidth but the browser writes it straight into a Web Audio API AudioBuffer: no intermediate decoding, linear queue. On LAN or local WiFi bandwidth isn't the constraint, latency is: I picked PCM. On top of that, 16 kHz mono matches the sample rate of the microphone and of Transcribe: no format conversion in the middle of the pipeline. In the cloud, where the audio going out from the server to each listener is data transfer out (AWS egress, billed), PCM might blow past the 100GB / month free tier, which is ~35h with 25 listeners.

To pick a Polly voice in the target language, there were two paths. A hardcoded (language) → (voice id): simple but it breaks every time AWS publishes new voices. The other calls DescribeVoices at server boot and discovers dynamically what's available, with an in-memory cache. I picked the second: one API call at startup, zero maintenance when AWS adds voices. To stay compatible with bidirectional streaming I filtered by LanguageCode (the target language) and by voices that support it: the feature is recent (2026) and not every language covers it, so without the filter synthesis would fail at the first start_speech_synthesis_stream.

The truly new piece is precisely StartSpeechSynthesisStream, the Amazon Polly bidirectional API. Announced in March 2026, exposed in the Java SDK, and missing in boto3. The feature shows up in the Java SDK because its code generator reads service-2.json and supports the HTTP/2 bidirectional event-stream protocol. Under boto3 there's botocore, and even botocore doesn't have that infrastructure: the operation stays declared in the service model but the Python client doesn't expose it. Same scenario for aioboto3, the asynchronous version of boto3, which reuses the same service models. Verified on boto3 1.43.9.

So, what paths are available ?

Path	Pros	Cons
`synthesize_speech` sync	Already in the SDK, 5 lines	No fast first-byte: waits until Polly has generated all the audio before returning any byte
HTTP/2 raw + SigV4 + event-stream parser	Real bidirectional, first audio chunk arriving while Polly is still generating	Not in Python: needs to be written from scratch

Decision: the sync one first to validate the pipeline, then the bidirectional one.

And here begins the piece that became a package of its own: amazon-polly-streaming. A PR to boto3 would have been the first reflex, but boto3 doesn't have the HTTP/2 bidirectional event-stream infrastructure. For Transcribe streaming AWS kept it out of boto3 in a separate package under awslabs: first in amazon-transcribe-streaming-sdk (deprecated today) that delegates the HTTP/2 transport to awscrt, then in aws-sdk-transcribe-streaming (the successor) that delegates the event-stream too to smithy_aws_core. For Polly bidirectional an official equivalent doesn't exist yet (verified in May 2026, neither on awslabs nor on PyPI), so amazon-polly-streaming is the first public Python implementation of the feature.

The public API is PollyStreamingClient.start_speech_synthesis_stream(), a mirror of TranscribeStreamingClient.start_stream_transcription() from aws-sdk-transcribe-streaming. Same pattern as the official AWS package for Transcribe: a convention that lets future adoption by awslabs happen without redesigning the API. Same for exceptions: a separate module that mirrors the types Polly exposes in StartSpeechSynthesisStream.

And why not delegate the HTTP/2 bidirectional event-stream to smithy_aws_core[eventstream], the way aws-sdk-transcribe-streaming does ? The bulk of the package would stay uncovered: AWS hasn't published a smithy client for Polly bidirectional. Since that client doesn't exist, it's simpler to keep the protocol in-house too: one fewer dependency, and no need to sync amazon-polly-streaming's cycles with those of an external lib under active development.

The stories the README doesn't tell

That `ServiceFailureException` that says nothing

I started from the AWS documentation for StartSpeechSynthesisStream: it lists the parameters (Engine, LanguageCode, VoiceId, OutputFormat, ..) and the event types (TextEvent, CloseStreamEvent, AudioEvent), but doesn't explain how to package the bidirectional event-stream body. The first attempt was therefore naive: I built a single event-stream body with TextEvent followed by CloseStreamEvent, signed it with SigV4 in its standard form (HTTP_REQUEST_HEADERS headers and EMPTY_SHA256 payload), and sent it in one shot. AWS Polly's response:

ServiceFailureException: Service is unavailable

That's it. No "this header is missing", no "the body isn't the type I expect", nothing that lets you figure out what's wrong. Always the same response across every combination I tried. Pushing harder on the Polly endpoint by tweaking parameters was therefore pointless: the contract had to be found elsewhere.

I checked botocore's service-2.json file (the same file is in the Java SDK, but only the latter implements it in a client): it's the canonical declaration of the AWS contract, committed to the repos as input for the code generator. For Polly it declares protocol: "rest-json" with protocolSettings: { h2: "eventstream" } and an ActionStream payload of type eventstream. It's the same protocol Transcribe Streaming uses for start-stream-transcription, and for Transcribe a public Python implementation already exists: amazon-transcribe-streaming-sdk (Apache 2.0, awslabs). I read the transcribe-sdk and ported its signing logic to amazon-polly-streaming, adapting it to Polly.

What I learned (the hard way):

AWS errors like ServiceFailureException don't say what went wrong: a design choice. For AWS services not yet in boto3, you have to go straight to the service-2.json file (in botocore or in the Java SDK, they are identical): faster than debugging parameter by parameter
smithy_aws_core[eventstream] is today the most complete Python reference for the generic part of the AWS HTTP/2 bidirectional event-stream; the event types (for Polly: TextEvent, CloseStreamEvent, AudioEvent) aren't there, whoever builds the client writes them (in this case the Polly client)
the Java SDK v2 client code is generated automatically at build time from service-2.json, it isn't committed in the repo: searching the method name (e.g. startSpeechSynthesisStream) in the source returns only changelogs and the service model, not the real signatures. For the protocol contract, service-2.json stays the canonical source (both in the Java SDK and in botocore)

That pool that worked solo

An HTTPS call to AWS has a cost: before exchanging the first byte of data, there's the TLS handshake and the HTTP/2 setup. A connection pool removes that cost for every call after the first: open once, reuse N times. On a pipeline that calls Polly bidirectional once per finalized sentence it's an immediate win: ~50 ms less median per call, from the second call onwards.

I added the HTTP/2 pool in amazon-polly-streaming v0.2.0 with use_pool=True as the default, and on a single listener it worked fine ..

Then I implemented the multilingual broadcast fan-out: 1 speaker to N listeners, each with its own target language. The test with 2 listeners (en-US and de-DE), 5 sentences per 2 target languages: I expected 10 calls to Polly. Instead half of the calls emitted no audio. Alternating pattern: in the same execution one language always "won" and the other always "lost", but across different executions the role flipped. So it wasn't language-specific, it was specific to the second parallel task of the fan-out iteration (a for target in targets: over an unordered set).

Diagnosis: pool v0.2.0 kept one and only HttpClientConnection per (host, port) pair. Under fan-out, two near-simultaneous calls asked the pool for a connection to Polly: the first opened one from scratch, the second received the same connection already open. Both opened a new HTTP/2 stream on the same connection. But Polly bidirectional enforces "1 stream = 1 sentence" and the Polly endpoint accepts only one active bidirectional stream at a time: what I observed was that awscrt queued the second stream until the first one closed. Under fan-out the queue never drained: before the first one finished, the next sentence arrived. From here two moves: one immediate and one structural.

As a lazy developer, the workaround first: POLLY_USE_POOL=false so every call opened a fresh connection and every call produced audio. Cost: the ~50 ms gained earlier from the pool were lost on every call. The refactor of _ConnectionPool with lease semantics was needed: amazon-polly-streaming v0.3.0 creates a list of connections per (host, port) instead of a single one, so every fan-out task leases a distinct connection (opened cold the first time, reused after that).

Improvement table across iterations. The polly_first_byte_ms metric measures the time between when Translate returns the translated text and the first audio byte arriving from Polly: TLS plus HTTP/2 setup plus Polly's start-up latency. It's not the end-to-end latency perceived by the listener (which also includes server-to-browser forwarding).

Scenario	Median `polly_first_byte_ms` warm
Single listener, no pool	~370 ms
Single listener, with pool (v0.2.0)	~331 ms
Fan-out 1-to-2, workaround without pool (v0.2.0)	~373 ms
Fan-out 1-to-2, with fixed pool (v0.3.0)	~306 ms

The fixed pool in v0.3.0 beats every previous measurement: ~25 ms less median compared to the single-listener pool in v0.2.0. This extra delta comes from pipeline optimizations accumulated across iterations, orthogonal to the pool but that show up in the final result.

That WAF that, thankfully, isn't needed

At the first deploy on EC2 via aws-docker-host, with a public ALB at https://sts.workshop.pandle.net, uvicorn's logs filled up within minutes of the apply with:

POST /hello.world?%ADd+allow_url_include%3d1+%ADd+auto_prepend_file%3dphp://input   404
GET  /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php                            404
GET  /vendor/phpunit/Util/PHP/eval-stdin.php                                        404
GET  /phpunit/phpunit/src/Util/PHP/eval-stdin.php                                   404

Tens of requests per minute: no problem, FastAPI answers 404 to all of them. The real risk is different: a targeted bot connects to /ws/speak or /ws/listen and fires Transcribe plus Translate plus Polly at the expense of the AWS account owner. The figure is low per single call but scales linearly with the number of malicious connections.

So, how do you defend yourself ?

Option	Pros	Cons
IP allowlist on the ALB security group	Granular	The audience IPs at the talk are not known in advance
AWS WAF with rules on scanner patterns	Blocks the known noise (scanner UA, PHP paths)	Doesn't block "competent" abuse (bot with browser UA, correct path), and costs 5-10 € / month
Single shared token	Simple to implement	The QR code reaches tens of people, to be treated as a secret
Double token per role	Exposure asymmetry	15 extra lines of code

Decision: double token. SPEAKER_TOKEN protects /ws/speak (the cost driver: Transcribe plus Translate plus Polly for N languages), LISTENER_TOKEN protects /ws/listen (the distribution path via QR code). Independent: the listener token doesn't work for the speaker, and vice versa. If the QR code leaks (photos on social, screenshots, shares), the damage is limited to "anyone can listen", not "anyone can spend the AWS owner's money". The SPEAKER_TOKEN stays in the shell history and in the .env of the deploy.

The design stays minimal at every level. Locally, with no tokens set, authentication is off and nothing changes. The architecture adds no complications: no cookies, no login form, no OAuth, just a string comparison at each connection. And the code fits in a few lines on the server, a flag on the audio client, a URL parameter for the browsers, a few sample environment variables. Good enough for a PoC with frequent token rotation between events.

What else could be added ?

Signed JWTs in place of static tokens: for prolonged use (always-on service, multiple events) JWTs with TTL per role. If the internet exposure becomes continuous, manually rotating the two static tokens gets tiring.

Subtitles sync: the translated text arrives at the browser as a JSON message before the audio, so it's already on screen when the audio starts. A precise text-to-audio sync (word-by-word highlight) is the next step for accessibility. Polly exposes SpeechMark exactly for this in the sync synthesize; for the bidirectional one they need to be checked in service-2.json.

Pause-based hybrid transcription pipeline: to cut the perceived latency between "I'm done speaking" and "first audio byte", the pipeline needs to fire even when a Transcribe partial has been still for N milliseconds, not only when is_partial=False arrives. Worth it only if you really want to optimize timing to the millisecond: the current sentence-bounded handling is enough, and implementing it requires a cancellation logic that's anything but trivial, because when Transcribe corrects a partial, the pipeline may already have fired translation and synthesis: you have to decide whether to let them finish, cancel them, or replace them.

Adoption of amazon-polly-streaming by awslabs: today it's the first public Python implementation of Polly bidirectional. The concrete path is a PR to aws-sdk-python to publish aws-sdk-polly-streaming (sibling of aws-sdk-transcribe-streaming), built on top of the generic primitives of smithy_aws_core[eventstream]. When that client exists, amazon-polly-streaming can be considered deprecated.

When does Iceberg beat Parquet+projection on AWS Glue, and when doesn't ?

Alessandra Bilardi — Sun, 10 May 2026 20:39:08 +0000

Why this project

I built this repo because I didn't have one of this kind yet and, having worked on data ingestion with Glue for a while, I wanted to gather in one place three things: how to structure code so it stays testable, which Firehose and Glue features to use and on what criteria, and a few Docker and Terraform gems I'd always promised myself to slot in somewhere.

Plus, I had never set up Glue streaming from scratch, and for a personal project I needed a test bed to compare Iceberg and Parquet + partition projection on the same data flow and under the same Athena queries, to figure out when one solution wins over the other and why.

This project mixes a lot of the experience I've gathered over the years with a couple of curiosities I hadn't had a chance to test. So there are no real challenges here: I already took those hits long ago. What I'm sharing is deliberate choices, driven by knowing these services inside out.

The architecture in the image describes exactly this project: a Python producer simulating stock tickers, a Kinesis Data Stream as the single entry point, two Firehose streams persisting the same flow in two different formats (Iceberg and Parquet), two Glue jobs that write to both formats (one batch for OHLC computation on 1m and 5m, one streaming for anomaly detection via z-score on a sliding window), and Athena querying both databases.

The choices and why

The goal was to compare Glue batch and Athena on top of an Iceberg-based database and a Parquet + partition projection one.

Choice	Why (less effort)	Discarded alternative (more effort)
Python producer with `boto3.put_records`	Original code, controllable scenarios (`stable`, `trend`, `spike`, `mixed`), pytest tests	Kinesis Data Generator: webapp with Cognito, poorly maintained
Parquet	Partitioned with projection ready to use	The alternative forces you to run a Crawler or schedule MSCK REPAIR TABLE
`--LOAD_DATA_MODE` (`parquet`, `spark`, `iceberg`)	One parameter exposes three read strategies you can compare in the same deploy	Three separate Glue jobs
Wheel + `--additional-python-modules`	Explicit `pip install` at worker boot, `pip install -e .` locally: same import semantics	`--extra-py-files` with zip or wheel: less deterministic across Glue versions
3-line wrapper in `src/glue_jobs/`	3 lines that call `run()` from the wheel: all logic testable in pytest	All code in `script_location`: no pytest on the main scripts

The record schema the producer writes (ticker_symbol, sector, price, change, event_timestamp) isn't something I made up: it's the one from the official AWS Firehose demo. That demo configures a single Firehose; this PoC configures two in parallel, one for Iceberg and one for Parquet+projection, to compare both storages on top of the same source. The Kinesis Data Generator is the tool the demo uses to produce the dataset, but rewriting it as a Python producer with boto3 gave me control over the scenarios (stable, trend, spike, mixed) and made them testable in pytest. The scenarios feed Glue streaming, which handles anomaly detection: spike injects controlled price spikes to validate z-score detection on anomalies, stable and trend act as baseline to avoid false positives.

As a lazy developer, the criterion is always the same: less effort, in terms of time, code or cost. Two rows of the table deserve a deeper look: --LOAD_DATA_MODE raises the question of read modes, the 3-line wrapper carries the code organization that makes TDD possible. I'll cover them one at a time, starting with reading.

Performance and read modes

To understand why the three LOAD_DATA_MODE exist, you have to start from the choice of partition projection as the partitioning strategy. The alternative would have been registering the partitions in Glue Catalog via Crawler or MSCK REPAIR TABLE, letting you read them from Glue with from_catalog and leverage the push-down predicate, up to 5x faster than post-read filtering. GetPartitions can hit API rate limits, S3 LIST instead scales because it's paginated. Projection skips the registration (the table above reminds you why: less effort), but comes with a constraint:

Partition projection is usable only when the table is queried through Athena. If the same table is read through another service such as Amazon Redshift Spectrum, Athena for Spark, or Amazon EMR, the standard partition metadata is used.

So a Glue job reading the Parquet+projection database via from_catalog would fall back to standard partition metadata, which for a projection table aren't registered in the Catalog: no partition info available on the Glue side, full scan that goes nowhere, dead end. You have to go straight to S3 with spark.read.parquet, leaving Spark to handle partition discovery via LIST of the prefixes. Projection only matters when you query the same table from Athena, where it does its job: no GetPartitions calls to the Catalog, partitions computed in memory from the template.

From here, the three modes of LOAD_DATA_MODE exposed by the Glue batch job:

Mode	What it returns	Extra cost vs `spark`	When it makes sense
`parquet`	Glue DynamicFrame (`from_options(connection_type="s3", format="parquet")`)	Schema discovery on-the-fly + ResolveChoice (explicit encoding of columns with inconsistent types as "choice"); wrapper memory overhead	Raw "messy" data or unstable schema, where the DynamicFrame's flexibility helps
`spark`	Plain DataFrame (`spark.read.parquet(path)`)	No extra overhead: schema is what it is	Parquet data with stable schema, like Firehose-generated. The most direct path
`iceberg`	DynamicFrame from `from_catalog`, but the read goes through Iceberg metadata (manifest list, column statistics)	Reading the manifest list (small fixed cost); in exchange you get file skipping on non-partition filters	Data managed as Iceberg tables with MERGE/UPSERT, and when typical filters are on columns with useful statistics (timestamp, ticker, etc.)

The DynamicFrame's traits are described in the Glue documentation:

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

The access pattern shifts the balance between spark/parquet and iceberg as volume grows:

Access pattern	Small volumes (~1 GB)	Large volumes (50-100 GB, many files)
Full read, no filter	`iceberg` slightly penalized by the fixed cost of the manifest read	`iceberg` comparable: the manifest cost dilutes against total I/O
Filter on partition column	comparable: both do basic pruning	`iceberg` wins: the manifest list is O(1) over partition count, S3 list grows with O(n)
Filter on non-partition column	`iceberg` wins via column statistics in the manifests: skips entire files without opening them	`iceberg` wins clearly: `parquet`/`spark` have to read and filter at runtime

In practice, on large volumes Iceberg wins because it keeps, for each Parquet file, the min and max value of every column. When a query filters (say ticker_symbol = 'AMZN'), the query engine looks at those min/max and immediately knows which files might hold the data and which can't; the discarded files don't even get opened.

As a lazy developer I preferred reading the documentation rather than running a generic benchmark, because the access pattern is already clear. Then, case by case, the choice depends on the kind of data access required.

Three-layer TDD on Glue jobs

Glue jobs are notoriously hard to test: you need GlueContext, you need a real Iceberg MERGE INTO, you need Spark configured the way it runs on the worker. I don't give up TDD here either: I split the code into three layers with clear boundaries.

Pure Python logic (argument parsing, naming derivation, producer scenarios): direct pytest, zero AWS or Spark dependencies
Spark core transformations (the OhlcAggregator, ZScoreDetector classes): SparkSession.builder.master("local[1]") as fixture, DataFrames built from literals. The classes are DataFrame-in / DataFrame-out, fully isolated
Orchestrator run(): takes args, spark, glue_context, read_*_fn, write_fn as parameters. Tests pass a mocked GlueContext and test source/sink functions. The principle is "the job builds, the classes consume": all Glue knowledge lives in _cli_entrypoint, which instantiates source and sink before calling run()

What stays out of pytest is just the real integration (Glue Data Catalog, Iceberg MERGE INTO, Kinesis Stream): covered by the JSON files in tests/integration/, which run both locally via docker compose and on AWS via aws glue start-job-run. The same file drives both: no duplication between AWS config and local test scripts.

Alongside, docker-compose.yaml exposes two profiles pointing to the official AWS images, glue4 (Spark 3.3, Python 3.10) and glue5 (Spark 3.5, Python 3.11, Iceberg built-in): make test-integration-local PROFILE=glue5 (default) or PROFILE=glue4. The mount paths differ between the two images (/home/glue_user/ vs /home/hadoop/), but local_test.sh uses relative paths so the same JSON works on both. It's the shortcut to validate the same script on two Glue versions before bumping glue_version.

The Python developer in me is now very satisfied.

What I learned (the hard way)

Firehose with format conversion: 64 MB minimum and cached schemas

Firehose accumulates records in a buffer before writing them to S3, and flushes in two cases: when the buffer reaches a certain size (buffering_size, in MB) or when a certain time passes (buffering_interval, in seconds).

For a while now, the minimum values for these buffers have been lowered: buffering_size starts at 1 MB and buffering_interval at 0 seconds.

For a PoC with small volumes I wanted a quick flush: I set buffering_size = 1 MB and buffering_interval = 60s, counting on the flush to fire on time before size.

On the Iceberg Firehose it went smoothly. On the Parquet+projection Firehose, no:

Error: InvalidArgumentException: BufferingHints.SizeInMBs must be at least 64

When a Firehose has format conversion enabled (data_format_conversion_configuration, which converts the incoming JSON to Parquet before writing it to S3), AWS imposes buffering_size >= 64 MB. On the Iceberg Firehose there's no conversion (Iceberg leans on its own native format), so 1 MB is accepted. On Parquet+projection I bumped the value to 64 MB and that was that: the flush stays governed by buffering_interval = 60s, and at PoC volumes the 64 MB never get saturated. Perceived latency unchanged.

Same Parquet+projection Firehose, second round: after apply, records were ending up in s3://bucket/parquet_projection/_firehose_errors/format-conversion-failed/ instead of raw/. Cause: the producer writes event_timestamp as ISO 8601 with T and timezone ("2026-04-23T20:48:32+00:00"), but the OpenXJsonSerDe used by Firehose accepts as Hive timestamp only yyyy-MM-dd HH:mm:ss[.fff]. The Iceberg Firehose accepts ISO 8601 natively, the Parquet+projection one doesn't. Three options:

change the producer to write epoch millis: that was the cleanest, but assuming you can't touch the producer, where would it make sense to handle the conversion downstream ?
add a Lambda processor in Firehose to reformat the timestamp: such a simple operation, repeated on every record, was it really worth bringing in a Lambda ?
type event_timestamp as string in the Glue raw tables, and cast it in Spark via F.to_timestamp("event_timestamp") when needed: when Spark has all the data in hand, it can handle the typing with O(n) complexity but parallelized

Picked the third. The "natural" type lives in the layer where the data is born (raw populated by Firehose, string for portability), the timestamp type appears in aggregated_* and anomalies where DataFrames are already in Spark's hands.

After applying the fix, I updated the Glue raw table schema, changing the type of event_timestamp from timestamp to string. terraform apply went through fine, but for the next ~5 minutes the records kept landing in _firehose_errors/. Cause: Firehose caches the schema_configuration of the Glue table to avoid querying the Catalog on every record. AWS documents "up to 15 minutes" of cache; in tests 5 were enough before seeing records arrive cleanly in raw/. To skip the wait, terraform apply -replace="aws_kinesis_firehose_delivery_stream.parquet_projection[0]" recreates the delivery stream and clears the cache. For a PoC the wait is fine; in a real case the replace (or aws firehose update-destination directly) is the faster path.

The wheel filename: a story unto itself

In the distant past, before I had local test management, I had the bad idea of providing the Glue job with the wheel renamed to dist/glue_common.whl, so I wouldn't have to touch any configuration on each new upload to S3.

But Glue throws a fit:

LAUNCH ERROR | Installation of Additional Python Modules failed:
ERROR: glue_common.whl is not a valid wheel filename

pip install requires the PEP 427 form: {name}-{version}-{python}-{abi}-{platform}.whl. The unversioned alias doesn't pass validation outside the PyPI context.

So as a lazy developer, what's the best way to do everything automatically without forgetting to upload the new wheel ?

Terraform reads the version dynamically from src/glue_common/__init__.py via regex(), builds the PEP 427 name and uses it as S3 key and source path
on make patch the filename changes, Terraform sees the new file and re-uploads it to S3 by itself

Another satisfying win.

Iceberg on Glue 5.0: two ways to register the catalog

After the wheel fix, the batch job stopped on:

AnalysisException: [TABLE_OR_VIEW_NOT_FOUND]
The table or view 'etl_prototype_demo_iceberg.aggregated_1m' cannot be found

The tables were in the Glue Data Catalog (Terraform had created them, I could see them via aws glue get-tables). What was missing was the bridge between Spark and the Catalog: the keys spark.sql.extensions, spark.sql.catalog.glue_catalog.* and spark.sql.defaultCatalog that tell Spark "for the glue_catalog catalog, use the Iceberg implementation that leans on the Glue Data Catalog".

It's a technical constraint: these keys must be applied before the SparkSession is initialized. Once GlueContext(sc) has created the SparkSession, a runtime spark.conf.set("spark.sql.catalog.glue_catalog", "...") is accepted syntactically, but has no effect: the catalog doesn't get registered and the job answers "Catalog 'glue_catalog' plugin class not found". That was exactly my first attempt long ago, before I diligently read the documentation ..

The Glue documentation for Iceberg lists two equivalent ways to apply the conf in the right place:

Create a key named --conf for your AWS Glue job, and set it to the following value. Alternatively, you can set the following configuration using SparkConf in your script.

Under the hood, the two configurations achieve the same result:

SparkConf in Python code:

  sc = SparkContext()
  conf = sc.getConf()
  conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
  # ... other conf ...
  sc.stop()
  sc = SparkContext.getOrCreate(conf=conf)
  glueContext = GlueContext(sc)  # the SparkSession is born here with the right conf

The configuration lives in the code. The sc.stop() + recreation of the SparkContext is when the configuration gets "injected" before SparkSession init.

--conf in Terraform's default_arguments:

  locals {
    iceberg_spark_conf = join(" --conf ", [
      "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
      "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog",
      "spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",
      "spark.sql.catalog.glue_catalog.warehouse=s3://${data.aws_s3_bucket.main.id}/iceberg/",
      "spark.sql.defaultCatalog=glue_catalog",
    ])
  }

Glue parses the concatenated string, applies the configurations at SparkSession boot, and then hands control to the Python script.

I chose to configure the PoC via Terraform: why ? Three reasons:

a single source of truth: the iceberg_spark_conf local is defined once in Terraform and reused by both the Glue batch and the streaming via --conf = local.iceberg_spark_conf in their respective default_arguments. No per-job duplication, and if I add a third Glue job tomorrow I reuse the same local with a single line
separation of configuration and code: the catalog setup lives in Terraform alongside --datalake-formats=iceberg; the Python code of the jobs doesn't know an Iceberg catalog exists, it imports glue_common, takes spark and glue_context as parameters and runs
low-cost configuration changes: a different warehouse, catalog implementation or IO is touched only in Terraform, with no need to rebuild and re-upload the wheel

The configuration in code, on the other hand, stays handier when the catalog config depends on arguments the job receives at runtime (for instance a warehouse derived from the input bucket name passed as --ARG): in that case the conf is built naturally in the code, since you already have the resolved arguments there. In this PoC the warehouse is fixed per environment, so the configuration in Terraform wins on simplicity.

What else is there to add ?

Once the PoC has been signed off, you start to get serious: there's what was simulated to integrate, and other services and approaches to evaluate:

Real APIs: replace the simulated scenario with a real ingestion. It changes the producer's nature, not the architecture
Apache Flink as an alternative to Glue streaming: it makes sense when you need stricter guarantees on how many times an event is processed (Flink natively supports exactly-once, i.e. each event processed exactly once; Glue streaming is at-least-once and duplicates are handled at the application layer), or when the required latency is sub-second (Glue streaming, working in micro-batches, typically lands in the 5-10 second range; Flink drops to hundreds of milliseconds)
Multi-environment deploy: in a PoC, a single environment is enough. In production you need to separate so you can test feature rollouts without touching live data. So you introduce Terraform Workspaces or per-env modules, with all the implications for account management
CI/CD: in a PoC, manual make test and terraform apply are enough. Working in a team or on mission-critical pipelines you need automation (lint, test, build wheel, terraform plan automatic on every PR) to catch regressions before merge
Cross-account Data Catalog sharing: Lake Formation + RAM + KMS + assume_role. When the data lake aggregates flows from branches, departments, partners, the centralized schema changes everything
Data Management: the evolution of centralized Data Catalog sharing is DataZone or SageMaker Unified Studio, with lineage, asset-level permissions and per-asset documentation
Extra time frames in the batch as roll-up from 5m (1h, 1d), not from raw: each level computes on top of the previous level's output, hence on less data. It's a classic approach (cascade ETL) and works when the higher-level aggregate can be recomputed from the lower level (the high of one hour is the max of the highs of the 5 minutes). It doesn't work if the calculation needs to go back to the original values, like medians or exact distinct counts

The lazy developer's code quality

Alessandra Bilardi — Thu, 30 Apr 2026 09:25:08 +0000

A repo to refresh, several rabbit holes to dive into

A while ago, at PyCon IT, I attended a talk that opened my eyes on pytest:

simpler test management, especially for mocks
parametrizable fixtures instead of the setUp / tearDown ritual
bare assert instead of a thousand self.assertEqual

I'd like my repo python-prototype, born for educational purposes, to also be a bit of a template I can pull off the shelf for the next projects.

So, with the excuse of refreshing the testing system with pytest and the packaging with pyproject, I started thinking about adding more.

I had been using black and pylint for a long time, so my first thought was: ok, let's bring in formatting and linting too. But I asked myself: isn't there something better that maintains style (PEP 8), docstrings (PEP 257) and type hints (PEP 484) automatically ?

And the environment, can it be modernized too ? With what ? Well, just like there are two schools, emacs and vi, there are also two schools, poetry and uv .. without even mentioning all the others.

What I needed was something to cover code quality, formatting, packaging and beyond: fewer tasks left to memory or to reading the holy README, more chances they actually get done.

Since there's no "all-inclusive package", the plan was to test what was maintained and maintainable, and find the one most suited to my needs.

Today's chosen stack

Four tools, not ten:

uv: the env manager. One Rust binary in place of pip, venv, pyenv and pipx. With poetry, the last two aren't covered and need to be installed separately: fewer satellite tools around.
ruff: formatting and linting. Replaces black, isort, flake8 and most of pylint. Another Rust binary.
pyright: the type checker. Skipping mypy, pyrefly and ty. For now.
pre-commit: a git-hook that runs ruff and pytest automatically before every commit. Just .. remember to set it up at the start of the project !

The single criterion that drove all these choices is least total effort. Fewer tools = less config = less maintenance. The lazy developer wants the toolchain to break before the commit, in case some step gets forgotten. But without overdoing it: just enough to produce quality code.

Stories from the field

Pylint and the 4.35/10 grade

The first run of pylint on simple-sample stings: 4.35/10. A high school grade, not a teaching repo's. I sit down to fix my JavaScript hangover: myClass becomes my_class (PEP 8 naming), foo and bar and foobar become get_param_processing, get_boolean, get_reverse_protected_param (names that say what they do). Up to 9.41/10.

But before claiming victory, three warnings need a decision:

W0223: abstract method not implemented in a subclass. Pylint flags it as a bug to fix. In my case it MUST fail: it's part of the educational example. I keep it.
C0301: line too long. I look: it's an HTTP link in a docstring, can't be broken. I ignore it.
C0104: names like "foo" and "bar" are disallowed. I could disable the rule globally, but here I prefer having spent the hour of restructuring: variables and methods should be expressive.

Each of these decisions is a "the tool is right about the code but not about the context". And here is where pylint's limit shows up: it tells you what it found, not whether it really needs fixing. The case-by-case judgement stays with you: it doesn't change anything by itself.

Pylint doesn't understand pytest

I go looking for trouble, and run pylint on the test suite: a new warning shows up, W0621 redefining-outer-name, on the fixtures:

@pytest.fixture
def mci():
    return MyClassInterface()

def test_mci_creation(mci):
    assert isinstance(mci, MyClassInterface)

Pylint says "you're redefining mci from the outer scope". But this pattern is the way fixtures work: it's not redefinition, it's parameter injection. Pylint reads the code as if it were running it, but it doesn't know how pytest runs it.

False positive. The workaround exists:

@pytest.fixture(name="mci")
def mci_fixture():
    return MyClassInterface()

def test_mci_creation(mci):
    assert isinstance(mci, MyClassInterface)

But it's there to silence pylint, not to improve the code. I don't add it. And here I start thinking that pylint is old for pytest, and it's time to switch tool.

Ruff arrives and takes black's place

I try ruff check and ruff format. It covers practically everything black did for formatting, and a good chunk of what pylint did for linting. One binary. Config in pyproject.toml: a single section instead of two. Execution time: milliseconds.

Ruff openly states the trade-off: it's AST-based and works on a single file at a time, it doesn't "read" the class hierarchy across files. So the abstract method not overridden, which I do need to see, doesn't get flagged. Ruff is a fast surface linter, not a deep analyst.

Ok. Ruff takes black's place and covers most of pylint. For what's missing (abstract method, type consistency across files) I need another tool: a type checker.

The type checker tour

Pylint flagged both typing and scoping errors (W0621 is a style check, not a type one). Choosing a type checker, I focus on the typing front: the scoping front stays out of this tour.

I add type hints everywhere, otherwise the type checkers would throw a sea of red (with nothing to check): the signature def get_param_processing(self, param): becomes def get_param_processing(self, param: bool) -> bool:.

Then I run mypy, pyrefly, ty, pyright on the same code to see who flags what.

Tool	Abstract method not implemented	Return None where type hint says bool	Other
mypy	yes	yes	historical, slow
pyrefly	in a different form	yes	lightning fast, young
ty	yes (interface only)	yes	lightning fast, young
pyright	yes	yes	also flags a third error: the method is used in MyClass

Pyright finds more and has a mature ecosystem: Microsoft maintains it actively, and Pylance (the Python extension for VS Code) is built on top of pyright. Pyright wins. Pyrefly and ty are under active development: I'll come back to them later.

The workflow breaking at the first `make patch`

Setup done. Ruff passes clean. Pyright passes clean. Pre-commit stops me if I forget something. I run make patch for the first "real" release .. and:

make[1]: bump-my-version: No such file or directory

The Makefile was calling bump-my-version directly, and the project's dev-deps were in tests/requirements-test.txt, not in pyproject.toml. So whoever cloned the repo had to know to do a pip install -r tests/requirements-test.txt on top of uv sync, and the release workflow assumed the venv was activated. Too much implicit knowledge, too much hassle.

I'm so used to using uv run that I don't run source .venv/bin/activate anymore, so I tripped over something that "the old-fashioned way" would never have happened.

What did it take to truly hand the environment over to uv ? Well, all I needed was to add every dependency in pyproject.toml with:

uv add --dev -r tests/requirements-test.txt

A single command. uv reads the requirements file, writes everything in [dependency-groups].dev of pyproject.toml (the standard introduced by PEP 735 for dev-deps), updates uv.lock, and installs. The tests/requirements-test.txt file becomes redundant: one less file to handle.

And then in the Makefile I added uv run in front of every Python command:

release:
    uv run bump-my-version bump $(PART)
    $(MAKE) changelog
    git tag -f v$$(uv run python -c "from simple_sample import __version__; print(__version__)")
    git push && git push --tags --force

Now make patch works even from a fresh shell, no activation needed. The venv is no longer tribal knowledge, it's implicit in every command.

Seven sections in `pyproject.toml`, one per tool

pyproject.toml was born for packaging, and from there it picked up the config sections of the project's tools: seven in total.

ruff starts from select = ["ALL"]: I enable every available rule and use ignore for the ones I find too much. Philosophy "everything by default, exclude by name": as ruff adds new rules, I get them automatically. And the "ALL" bundle isn't just style + lint: it includes naming (PEP 8), docstring (PEP 257), type annotations (PEP 484, with flake8-annotations), cyclomatic complexity (mccabe), basic security (bandit-base), import order (isort). Ruff isn't "just" a formatter + linter, it's the umbrella under which black + isort + flake8 + parts of pylint, pydocstyle and bandit live.

pyright in typeCheckingMode = "strict": the default basic lets a lot slide, strict requires complete type hints and explicit returns. It's the mode that surfaces those errors the type checker tour had revealed (and that mypy / pyrefly / ty in default config would have missed).

pytest: minimal config, asyncio_mode = "auto" and testpaths = ["tests"]. The rest lives in the tests themselves.

[dependency-groups].dev: the list of dev-deps with version constraints (PEP 735). uv reads this section for uv sync --group dev.

packaging ([build-system], [project], [tool.setuptools]), bumpversion, git-cliff: handle the release pipeline (metadata + runtime dependencies + wheel and sdist build + versioning + CHANGELOG from conventional commits). A different topic from code quality, but necessary for the modernization and automation goal.

pre-commit lives in .pre-commit-config.yaml (outside pyproject.toml): it points to the official astral-sh/ruff-pre-commit repo for the two ruff hooks (check + format) and keeps a local hook running uv run pytest for the tests. So pre-commit also leans on uv to access the project's venv, just like the Makefile targets.

Plus

The lazy developer adds tools when they're really needed, when it's time to handle some other aspect automatically.

Still on the code quality front, what could be added and when ?

vulture and radon: project-level dead code and complexity reports. When a map of the codebase is needed, for instance before a major refactor: ruff sees the single file, vulture and radon see the whole.
bandit (SAST), pip-audit (SCA) and detect-secrets: if the package becomes an API or handles sensitive data, but here a whole new world opens up ..
mypy in strict mode: a second pass on top of pyright. Today I don't have an example that would push me to add it, pyright strict covers well.
pyrefly and ty: worth re-evaluating especially for projects with many files. They're fast but young.
pre-commit.ci: a hook that runs in CI on every PR too. For a personal one-maintainer project it's overhead, for a shared repo it would make sense.

Realtime transcription: choices and stories for PyCon IT

Alessandra Bilardi — Mon, 20 Apr 2026 21:23:48 +0000

Why all this interest in realtime transcription

It all started with the collaboration with PyCon IT. At PyCon IT 2025 they set up live transcription with local Whisper on a Graphics Processing Unit (GPU), based on the repo realtime-transcription-fastrtc. With the YouTube videos used as tests, all good. With the real audio of a conference room, Whisper started hallucinating: a generative model, if you give it a signal it doesn't recognize, doesn't leave a blank, it writes something anyway.

For PyCon IT 2026 a different path was needed, on a non-negotiable anchor: no hallucinations. If the model doesn't hear, ok, skip a word. If it hears badly, ok, transcribe badly. But it must not write sentences I didn't say.

Fixing Whisper's hallucinations directly (Voice Activity Detection, tuning decoding parameters, logprob filters, fine-tuning, ..) would have been a separate effort: I didn't have the time, with everything else to build. A bigger Whisper I haven't tested. Other paid generative Speech To Text (STT) services either: they stay in the same category of a model that produces text token after token, so the structural risk of invention stays. To get out of the category, a managed service based on acoustic decoding was needed. And since it's PyCon, let's also grab the bonus of decoupling the pieces and writing it in a testable way.

A model that gets it wrong but doesn't make it up

Let's start with the engine. Then with what's around it.

STT: who gets it wrong, who makes it up

I didn't run empirical benchmarks on the three. The choice played out on two axes: model structure (generative or not) and delivery (self-hosted or managed). The properties in the table come from product documentation and from direct observation of Whisper at PyCon IT 2025, not from A/B tests.

Criterion	Whisper local	Amazon Transcribe Streaming	Paid generative STT
Architecture	generative (autoregressive)	non-generative (acoustic decoding)	generative
Hallucinations structurally possible	yes	no	yes
Delivery	self-hosted	managed	managed
Setup	GPU + model	AWS credentials	credentials
Network dependency	no	yes	yes
Cost	on-site hardware	$0.024/min	variable
Declared latency	1-15s end of segment	~300ms partial	depends

The most important criterion is architecture. A non-generative model cannot, by construction, add words it didn't hear: at worst it skips or gets it wrong. A generative model can. The other criteria (network, cost, latency) are secondary trade-offs, all acceptable for a conference context: there's internet, a 30-minute talk costs ~$0.72, partial results arrive in ~300ms.

Choice: Amazon Transcribe Streaming. Not because it's "the best" in absolute terms, but because it sits in the category that rules out at the root the problem we're here for. The repo video-to-text I wrote on purpose to test Transcribe as an alternative to Whisper.

New repo or fork of the old one ?

The other big choice: fork of realtime-transcription-fastrtc (the one already used at PyCon IT 2025), or a new repo that takes only the good pieces from the two predecessors (realtime-transcription-fastrtc and video-to-text) ?

Criterion	Fork	New repo
Initial effort	low	medium
Fragile dependencies inherited	FastRTC v0.0.26	none
Architecture	monolithic to dismantle	designed for the use case
Testability	inherits the existing scope	every component in isolation

Choice: new repo. As a lazy developer one would be tempted to fork, but when a dependency is fragile (FastRTC v0.0.26 isn't a stable standard), a fork could cost more than a targeted rewrite.

From realtime-transcription-fastrtc I keep the screen layout (black background, large text) and the auto-scroll logic of the frontend. From video-to-text I take the transcribe_service.py module and the async pattern with asyncio.Queue + asyncio.gather(). The rest gets dropped.

Architecture: monolithic or decoupled ?

As a lazy developer, I don't want to redo everything moving from Proof of Concept (PoC) to Minimum Viable Product (MVP). The two predecessors already have pieces that work (the screen layout of realtime-transcription-fastrtc, the transcribe_service of video-to-text), but they're pieces from different repos, made for different purposes. To recycle them, the modules need clear boundaries.

A decoupled architecture here means having three components as three separate processes that talk to each other over the network:

the audio client, which captures audio from the system device and sends it to the server
the server, which receives audio, manages the stream toward Amazon Transcribe, and publishes the text
the display client, which receives the text from the server and shows it on the dedicated monitor

The alternative architecture is a single process (a single running program) that captures, transcribes, displays.

Criterion	Monolithic	Decoupled
Deploy	a single binary	three components
Distribution across multiple computers	no	yes (native)
Testability	internal dependencies	each component in isolation
Communication overhead	none	network calls

Choice: decoupled. It works both in development with everything on one computer (localhost), and at the conference with three separate computers: audio client in the control room near the mixer, server on any computer connected to the network, and display client on the computer that drives the monitor. The monolithic instead locks everything onto a single computer, and the code couples the components: tests and replacements require more work. With more rooms the bill gets worse: you'd need a full copy of the system per room (audio, server, display for each), whereas the decoupled shares a single server across all rooms, and each room only adds an audio-and-display client on the same computer, or, to avoid running a long cable across the room, a second display client near the monitor.

Audio client: browser or standalone ?

The audio to transcribe has different sources depending on the context: laptop microphone in local tests, Universal Serial Bus (USB) or analog mixer in the room, browser loopback for live apps like StreamYard. Who picks up this flow and sends it to the server ?

Two candidates: the browser app with getUserMedia (realtime-transcription-fastrtc's path), or a standalone Python script launched from the audio computer.

Criterion	In the browser	Standalone Python script
System devices (mixer)	limited	full access
Browser dependency	yes	no
Testability	medium	high

Choice: standalone Python with sounddevice. At a conference, audio doesn't come from the speaker's laptop microphone, but from a room mixer or a dedicated microphone connected via USB. The browser's Web Audio APIs don't expose virtual sinks and USB mixers as separate devices. Instead, a Python script with sounddevice sees all the devices the operating system exposes, loopback and mixer included.

Protocol between audio client and server

realtime-transcription-fastrtc used Web Real-Time Communication (WebRTC); video-to-text instead WebSocket (WS). Which makes sense here ?

Criterion	WebRTC	WS
Bidirectionality	required	not needed
Network setup	Network Address Translation (NAT), Traversal Using Relays around NAT (TURN), Interactive Connectivity Establishment (ICE)	none
Reliability	path-dependent	persistent connection
Complexity	high	low

Choice: WS. The audio client sends, the server receives. Bidirectionality isn't needed, so WebRTC is overkill. Persistence, on the other hand, is: a talk lasts tens of minutes, audio goes in chunks every 100ms, and on the server the same pipe keeps the Amazon Transcribe stream open for the whole session. WS covers both without the WebRTC layers.

Transcript channel between server and display

realtime-transcription-fastrtc used Server-Sent Events (SSE); video-to-text WS. Which here ?

Criterion	SSE	WS
Fits the case	yes	yes
Tech already in use	no	yes (for audio)
Duplicate code	a second handler	same stack

Choice: WS. SSE would technically be enough (unidirectional server -> client, fine for the transcript). But WS is already in the house for the audio channel: keeping a single technology means a single stack of handlers server-side and a single client-side library, instead of two.

Partial results vs final

Amazon Transcribe sends both partials (text that changes until the segment is stable) and finals (stable). To compare the two delivery modes in the field, the display supports both via the ?partial=true|false flag: picked at runtime, not at build.

Criterion	Partial on by default	Partial off by default
Readability on the monitor	low (changing text)	high
Perceived latency	good	medium

Choice: off by default. A dedicated monitor with text that writes, erases and rewrites is unpleasant to look at. Partials can be turned on via ?partial=true on the display if in a specific room the delay of finals ends up bothering.

Language: zero restart between talks

Amazon Transcribe wants the language when opening the stream (language_code="it-IT" or "en-US"). At PyCon, rooms have consecutive talks in different languages: Italian, English. Two paths: language as a global server configuration, or as a parameter per connection of the audio client.

Criterion	Global in the server	Per-room parameter
Language change between talks	server restart	zero restart
Scalability to multiple rooms in parallel	all same language	each room its own

Choice: per-room parameter. With the global version, a restart would be needed at every language change (or a proxy that discriminates per path, complicating things). With the per-room parameter, the server stays up for the whole day, and the audio client reopens at the next talk with the right language (?lang=it-IT or ?lang=en-US). And it also works with multiple rooms in parallel: each room has its own language, independent of the others.

Concretely: every WS connection is an independent handler on FastAPI, and each opens its own Amazon Transcribe stream with its own language. There's no shared state between different streams, so the language of one room cannot affect another.

Display: dynamic app or static HTML ?

In this case, the display is what the audience looks at: a dedicated monitor with text scrolling as it arrives. It must update in real time receiving messages from the server, but does nothing else: no forms, no interaction.

Two paths: a dynamic app (React, Vue or similar, with build and state management), or a static HTML page with a bit of JS that opens a WS and appends text.

Criterion	Dynamic app	Static HTML + JS
Client-side state	possible	only via WS
Deploy	requires build	file served by the server
Reuse from `realtime-transcription-fastrtc`	no	yes (CSS + JS)

Choice: static HTML. No client-side state needed: the browser opens the page, receives text via WS, shows it. No build. And the CSS of realtime-transcription-fastrtc's screen mode gets reused as is.

Choices at a glance

The realtime-transcription choices don't come out of nowhere: some are new decisions for the live use case, others are pieces lifted from the two predecessors. Here they are in a row, with the source of inspiration. For the sequence diagram with WS endpoints and message flow, see the README of the repo.

Choice	Winning option	Criterion	Source
STT	Amazon Transcribe Streaming	no hallucinations	`video-to-text` (transcribe_service)
Repo	new	less tech debt	new
Architecture	decoupled (3 components)	reuse from predecessors, deploy flexibility	new
Audio client	standalone Python	full access to system devices	new
Audio protocol	WS	persistent connection, minimal network setup	new
Transcript channel	WS	single stack server + client	`video-to-text`
Partial vs final	flag `?partial=true\	false`	readability on the monitor
Language	per room	zero restart between talks, scales to more rooms	new
Display	static HTML	no build, reuse of existing work	`realtime-transcription-fastrtc` (CSS + JS `screen` mode)

The stories you only find when you plug things in

The real fun starts when you stop drawing and turn on the machines.

The device number on Fedora

The first time I ran uv run python -m audio_client --list-devices I found myself facing a long list with the same hardware (my headphones in the docking station jack) showing up multiple times, with similar names and different IDs. On Linux several audio layers coexist (ALSA at the kernel, JACK for pro audio, PipeWire as a modern sound server) and sounddevice lists them all: each exposes the same device, each is a candidate on paper.

Backend	Device ID	Outcome
ALSA	1	doesn't work as one might expect
JACK	25	doesn't work as one might expect
PipeWire (system default)	20	works (it's the active routing of the system)

There's no logic that helps you pick a priori: it depends on what the system uses as default routing. On Fedora 41 it's PipeWire, so the "right" ID was 20. I tried all three before figuring out the logic.

Rule of thumb: if the audio doesn't get where it should, try all the candidates before touching the code.

The browser loopback

One of the audio sources to transcribe is StreamYard, which is a browser app: the speaker's audio goes out of the browser to the system's default sink. audio_client with sounddevice can capture from system devices (microphone, USB mixer), but can't read directly from an app's output. A bridge is needed: a virtual sink the browser writes to, and whose monitor audio_client reads from.

On Linux with PipeWire (or PulseAudio) the bridge is module-null-sink. You load a sink called loopback, you move the browser's stream onto it, you point audio_client at the null-sink's monitor. It works on the first try, but there's a side effect: while the browser's stream is on the null-sink, I can't hear it on my headphones anymore. In the room it's not a problem (audio comes from the physical mixer, not from the laptop browser). In development, yes: I can't verify what I'm transcribing.

I tried three paths: two deaf, one hearing clearly.

Approach	audio_client hears	Headphones hear	Notes
`module-null-sink` + move browser	yes	no	base setup, muted on the laptop
`module-combine-sink` with slaves	no	yes	failed
`module-null-sink` + `module-loopback` as a parallel branch	yes	yes (+~50ms)	adopted solution

The path that works is module-loopback as a parallel branch. The null-sink loopback stays source for audio_client; on top you load a module-loopback that reads from the null-sink's monitor and writes to the default sink. Two independent consumers on the same monitor, neither blocks the other.

The ~50ms is module-loopback's buffer. For the transcription nothing changes: the audio_client branch stays instant. The 50ms is only what I hear in headphones compared to what leaves the browser.

Everything is wrapped in two make commands: make loopback_redirect APP=firefox (which also accepts MONITOR=1 for the listening branch to headphones) and make loopback_clean that cleans up.

Practical choice: default MONITOR=0. At the conference audio comes from the mixer, not the laptop, so hearing it locally isn't needed. MONITOR=1 is a development luxury.

How much hardware do you need ?

I haven't benchmarked the system on specific hardware yet, so I'm basing this on typical sizes of similar Python applications. Better to oversize than to pick the bare minimum: on a real deploy you want margin, not to crash on the first spike.

Component	RAM/CPU	Recommended example	Notes
Audio client	~50-100MB	Pi 4 2GB with USB mic	Pi 3 technically enough but tight
Server	~100-200MB base + ~30-50MB per room	EC2 t4g.small (2GB, ARM) or Pi 4 4-8GB	Pi 4 handles 1-2 rooms; EC2 for more
Display client	~200-300MB for Chromium	Pi 4 4GB	Pi 4 2GB technically enough but tight

Three deploy scenarios:

Scenario	Recommended device	When and why
All separate	Pi 4 2GB (audio) + EC2 t4g.small (server) + Pi 4 4GB (display)	Multi-room conference; server in cloud for sharing
All together	A laptop with 8GB, or a Pi 4 8GB with USB mic	Development, local demo
Audio + server together, display separate	Pi 4 8GB (audio+server) + Pi 4 4GB (display)	A single room, zero cloud; the audio Pi also hosts the server

For one room, two Pis are enough. With a Pi 5 (server) you can push to 2-3 rooms; beyond that, EC2 is the way. EC2 or a more powerful laptop are natural upgrades anywhere, if you want more margin.

Anything else to add ?

What's there today is good enough for one room, with any computer connected to the network. But the design holds beyond, when it's worth it.

More rooms, same setup

If many rooms in parallel are needed, the infrastructure can be handled with aws-docker-host, which spins up an Elastic Compute Cloud (EC2) instance with Docker ready to use. The realtime-transcription server already ships with docker compose, and the opening image describes exactly this scenario.

When one EC2 isn't enough: ECS Fargate

If there are many rooms and the load varies, a single static EC2 becomes tight. Fargate (part of Elastic Container Service, ECS) spins up tasks on-demand and shuts them down when needed. But live transcription lives on long-lived WS, and from the AWS documentation there are some points to configure with care (I haven't tested them on the project):

Sticky sessions: a one-hour WS connection must stay on the same Fargate task. The Application Load Balancer (ALB) supports WS, but the session must be routed with affinity. No per-packet round-robin.
Idle timeout: the ALB target group default is 60 seconds of inactivity. A 20-second pause between sentences isn't inactivity (the client sends silence every 100ms), but it's worth raising the timeout to a few minutes for safety.
Graceful shutdown: during a deploy or a scale-in, the task that's closing must let open Transcribe streams finish, not cut off mid-talk. The container must handle SIGTERM and close the WSs gracefully, giving the client time to reconnect to a different task.

Authentication on the WebSockets

Today the WSs are open: anyone who knows /ws/audio/{sala} can inject audio, anyone who knows /ws/transcript/{sala} can listen. For a deploy in a Local Area Network (LAN) or a private cloud on a Virtual Private Network (VPN) it's perfectly fine. On the public internet you need at least:

a token in the path or query (e.g. ?token=...), validated at connect
rate limit per Internet Protocol (IP) on the audio channel
permission separation: whoever can write on room X may not necessarily be allowed to read it

These are the minimum requirements to expose the endpoints on the public internet.

Docker on EC2 with Terraform

Alessandra Bilardi — Fri, 10 Apr 2026 22:25:12 +0000

Why this project

I was preparing a workshop and needed to expose a url with a specific interface, sparing participants from installing docker or anything else on their machines.

I built the workshop locally with docker compose, which is one of the ways to develop and test locally: it works, it's fast, it's reproducible. And then?

Then you need to move everything to the cloud. And as a lazy developer, why not use that same docker compose?

The point isn't running Docker in the cloud - it's everything around it: HTTPS, custom domain, machine access, data backups, and the ability to rebuild or tear it all down with one command.

With IaC you can manage HTTPS, custom domain, backups, access and cleanup smoothly: everything in one place, versioned, reproducible. Without IaC, you start from scratch every time.

The usual options:

Manual EC2 setup: SSH in, install Docker, configure nginx, certbot, and pray. Slow, fragile, and hard to reproduce.
ECS/Fargate: task definition, service discovery, cluster .. for what ? Using Fargate for a single container is like hiring a moving truck to carry your groceries home.
Docker on EC2 with Terraform: one terraform apply to spin up, one bash scripts/destroy.sh to tear down. Backups included.

The third option is what I chose because it has the simplest architecture .. and the most complex part depends on your user data !

The architecture in the image above is generated directly from the Terraform code (spoiler) in the repo, where you can find the README.md and all the details to use it.

But let's take it step by step. The third option can be implemented in 1024 different ways: which IaC tool ? How do you handle HTTPS ? How do you access the machine ? Where do you store backups ? How do you manage DNS ? Which AMI ? It depends. The point is asking the right questions.

As a lazy developer, every choice follows one criterion: less effort, in terms of time, cost, or both. And when less effort isn't enough to decide, the cleanest path is a minimal system: you know what's there, you know what's missing, no surprises.

Why Terraform and not CDK

	Terraform	CDK
Language	HCL: declarative, simple	TypeScript/Python: powerful but verbose for simple infra
State	Local file, zero dependencies	Requires CloudFormation stack, S3 bucket for assets
Bootstrap	`terraform init`	`cdk bootstrap` already creates resources in your AWS account
Learning curve	Low for simple infra	Need to know both CDK and CloudFormation .. and their quirks
Destruction	`terraform destroy`: clean, predictable	`cdk destroy`, which sometimes leaves orphaned resources

For an ephemeral workshop run by one person, Terraform with local state is the minimum effort. CDK makes sense when the infra grows, you need complex logic, or there's a team involved.

The choices and why

Choice	Why (less effort)	The discarded alternative (more effort)
ALB + ACM	Free HTTPS certificate, auto-renewal, no certbot/nginx	Let's Encrypt on EC2: port 80 open, cron for renewal, more moving parts
SSM instead of SSH	No keys, no port 22, audit trail on CloudTrail	SSH key pair, SG rules, bastion if private subnet
S3 for backups	Costs nothing, survives the EC2, simple CLI	EBS snapshot: tied to instance lifecycle, harder to restore
Route 53 hosted zone	DNS validation for ACM, alias record for ALB, all managed by Terraform	External DNS only: manual certificate validation or HTTP challenge
Amazon Linux 2023 minimal	Clean AMI, you install only what you need	AL2023 standard: doesn't have Docker anyway, but has hundreds of extra packages you don't need
`docker compose up --build`	Works with both `build` and `image`	Separate logic for build vs pull: pointless complexity
Local state	The workshop is ephemeral, one operator, no team	Remote state (S3 + DynamoDB): cost and setup for zero benefit
Conditional VPC	Three modes: use an existing VPC, find the default, or create a new one	Always new VPC: waste for a workshop running in the default VPC
Conditional S3 bucket	Pass one and it uses it. Don't, and it creates one named after the domain	Always new bucket: waste for someone running many workshops and just managing backups

What I learned (the hard way)

The right AMI and how much disk

As a lazy developer, instead of reading the documentation, one command to see what's out there:

aws ec2 describe-images \
  --filters "Name=name,Values=al2023-ami-*-x86_64" \
  --owners amazon \
  --query 'reverse(sort_by(Images, &CreationDate))[:10].[Name, BlockDeviceMappings[0].Ebs.VolumeSize]' \
  --output table

Three variants: minimal (2 GB), standard (8 GB), ECS-optimized (30 GB). The ECS one comes with Docker but is meant to run in an ECS cluster, not on a standalone EC2. Standard and minimal don't have Docker: you need to install it either way.

At that point, what does the standard have that minimal doesn't ? SSM agent and a few hundred packages you don't need. The package comparison page confirms it: no Docker, no buildx, nothing that changes the picture.

Minimal is the cleanest choice: install Docker, SSM agent and buildx in the user data, and you know exactly what's on the machine. One thing to watch: the 2 GB disk isn't enough, set volume_size = 20 and move on.

ssm-user is not root

When you connect with aws ssm start-session, you're ssm-user. You don't have access to the Docker socket. Everything needs sudo. Commands sent with aws ssm send-command run as root though, so sudo is built in.

buildx: no buildx, no build

From Docker Compose v2.17+ the --build flag requires buildx >= 0.17.0. The minimal AMI doesn't have it. Without buildx, docker compose up --build fails even if no service uses build: install it in the user data and forget about it.

That damn cache

After a destroy + redeploy, the new Route 53 hosted zone gets different nameservers. You update the NS records on the DNS provider, everything looks fine. But the browser says no.

dig @8.8.8.8 tells you it's all good. But your local resolver disagrees.

What happens: your ISP's resolver has the old SERVFAIL cached, and until it expires, that domain doesn't exist as far as it's concerned.

The fix: temporarily switch your local DNS to Google (8.8.8.8) and wait for your provider's cache to expire: they say 5-10 minutes, but sometimes (way) longer.

Anything else to add ?

When it's not a workshop of a few hours but something that lasts weeks or months, it's worth investing extra effort to make the system hold up over time. But remember, it's always a temporary solution !

More subdomains: more applications on the same ALB, with routing rules, separate target groups, and potentially more containers on the same EC2 or, if needed, dedicated EC2s per service
Tactical scheduling: start/stop the EC2 to save money off-hours, periodic backups with EventBridge + SSM, not just at destroy
CloudWatch alarms: basic monitoring (CPU, disk, health check) with SNS notifications
Auto-recovery: ASG with min=max=1 to replace dying instances (user data restores everything from S3)
Spot instances: for workshops that tolerate interruptions, ~70% cost reduction

DEV Community: Alessandra Bilardi

When boto3 doesn't have it (yet), you write it: a realtime speech-to-speech story in Python

A stage PoC for multilingual meetups

Why not Nova 2 Sonic ?

Here is the stack

The stories the README doesn't tell

That ServiceFailureException that says nothing

That pool that worked solo

That WAF that, thankfully, isn't needed

What else could be added ?

When does Iceberg beat Parquet+projection on AWS Glue, and when doesn't ?

Why this project

The choices and why

Performance and read modes

Three-layer TDD on Glue jobs

What I learned (the hard way)

Firehose with format conversion: 64 MB minimum and cached schemas

The wheel filename: a story unto itself

Iceberg on Glue 5.0: two ways to register the catalog

What else is there to add ?

The lazy developer's code quality

A repo to refresh, several rabbit holes to dive into

Today's chosen stack

Stories from the field

Pylint and the 4.35/10 grade

Pylint doesn't understand pytest

Ruff arrives and takes black's place

The type checker tour

The workflow breaking at the first make patch

Seven sections in pyproject.toml, one per tool

Plus

Realtime transcription: choices and stories for PyCon IT

Why all this interest in realtime transcription

A model that gets it wrong but doesn't make it up

STT: who gets it wrong, who makes it up

New repo or fork of the old one ?

Architecture: monolithic or decoupled ?

Audio client: browser or standalone ?

Protocol between audio client and server

Transcript channel between server and display

Partial results vs final

Language: zero restart between talks

Display: dynamic app or static HTML ?

Choices at a glance

The stories you only find when you plug things in

The device number on Fedora

The browser loopback

How much hardware do you need ?

Anything else to add ?

More rooms, same setup

When one EC2 isn't enough: ECS Fargate

Authentication on the WebSockets

Docker on EC2 with Terraform

Why this project

Why Terraform and not CDK

The choices and why

What I learned (the hard way)

The right AMI and how much disk

ssm-user is not root

buildx: no buildx, no build

That damn cache

Anything else to add ?

That `ServiceFailureException` that says nothing

The workflow breaking at the first `make patch`

Seven sections in `pyproject.toml`, one per tool