Integrated Biological Data Collection Platform: An Architecture for Automated Curation of Public Repositories

#architecture #automation #dataengineering #science

Introduction

In contemporary research, the volume of biological data deposited in public repositories is growing exponentially. The Gene Expression Omnibus (GEO), NCBI Gene, PubMed, and UniProt accumulate thousands of new records daily, including sequences, expression profiles, scientific articles, and functional annotations. On the one hand, this scenario represents a unique opportunity for biomedical research. On the other hand, the diversity of data formats, access protocols, and metadata models creates a significant barrier: each source requires a specific collector, distinct rate-limiting strategies, and its own validation logic. Above all, the lack of standardization in data storage compromises the reproducibility of scientific studies. The need for integrated tools capable of unifying data extraction, curation, and persistence has been widely discussed. In practice, ad hoc solutions such as isolated scripts for individual repositories generate redundant work and make maintenance difficult. First and foremost, it is necessary to establish an architecture that treats data collection as a service rather than a collection of scattered artifacts.

This work presents Project 1 of the Integrated Bioinformatics Platform: a containerized Biomedical Data Collector coupled with a Data Lake. Its objective is to provide a REST API capable of triggering asynchronous data collections from the four aforementioned sources, storing immutable raw data in MinIO, and persisting metadata in PostgreSQL, all while ensuring traceability and resilience.

Development

The system architecture is divided into three main layers. The first is the API and orchestration layer, implemented using FastAPI. Its five endpoints — POST /collections, GET /collections, GET /collections/{id}, GET /collections/{id}/download/{dataset_id}, and GET /health — expose a clean interface for initiating and monitoring collection processes. The second layer is the collector engine, composed of abstract classes and concrete implementations for GEO, NCBI Gene, PubMed, and UniProt. Each collector follows the same lifecycle: fetch, validate, upload, and metadata generation. The third layer is the Data Lake, built on MinIO for object storage and PostgreSQL for relational metadata management. Docker Compose orchestrates the three primary services — API, PostgreSQL, and MinIO — as well as a one-shot job responsible for creating the raw-data bucket.

Furthermore, the collection workflow was designed with resilience in mind. When a user submits a POST request containing a source and an external_id, the system creates a Collection(status=pending) record in PostgreSQL and launches an asynchronous background task. This task updates the status to running, downloads data from the external source using configurable exponential retry and backoff mechanisms, validates both format and content, uploads the data to MinIO under the path raw/{source}/{external_id}/, and generates a metadata.json file alongside the collected data. Finally, it inserts the corresponding Dataset records and marks the collection as completed. In case of failure, the status is changed to failed, and the error message is preserved for later inspection.

The technology stack was selected carefully. The uv dependency manager was chosen due to its speed and reproducible lockfile mechanism. SQLAlchemy 2.0 in asynchronous mode, combined with Alembic migrations, provides a robust persistence layer. Pydantic v2 integrates seamlessly with FastAPI, allowing the same schemas to be reused for validation and automatic documentation. Testing frameworks such as pytest, httpx, pytest-asyncio, and respx enable HTTP request mocking without depending on external repositories. Meanwhile, ruff unifies linting and formatting, and mypy ensures type safety throughout the codebase.

As a result, the project directory structure reflects a clear separation of responsibilities. The app/ directory contains the submodules models/, schemas/, api/, collectors/, storage/, and utils/. Each collector resides in its own package — geo/, ncbi_gene/, pubmed/, and uniprot/ — facilitating the addition of new data sources in the future. The test suite mirrors the same organization, with dedicated modules for collectors, API functionality, and storage components.

Several design decisions are particularly important for scientific reproducibility. First, raw data immutability is enforced: once stored in MinIO, files cannot be modified. Every collection operation generates new files, even when the same external_id has already been collected. Likewise, the accompanying metadata.json file contains fields such as source, date, version, and parameters, ensuring complete data provenance. Additionally, fixing the random seed with random.seed(42) guarantees reproducibility for any sampling procedures that may occur. Consequently, researchers can trust that the same parameters will always produce identical datasets over time.

The data model itself is concise and functional. The Collection table stores a UUID identifier, source (geo, ncbi_gene, pubmed, or uniprot), external ID, status (pending, running, completed, or failed), MinIO storage path, JSONB metadata, error messages, and creation/update timestamps. The Dataset table, linked through a foreign key to Collection, records each individual file, including its name, format, size, SHA-256 checksum, and full MinIO path.

As a consequence, researchers can track the progress of any collection using simple HTTP requests. A Bash script utilizing curl and jq can initiate a collection, poll for completion, and list generated datasets in just a few lines of code. Both synchronous and asynchronous Python clients are also supported, as demonstrated in the usage documentation.

Conclusion

Project 1 establishes the foundation upon which the remaining fifteen projects of the platform will be built. In summary, it addresses the challenge of fragmented biological data sources by providing a unified API, immutable storage, and standardized metadata management. In this context, future projects focused on genomics, transcriptomics, proteomics, biomarker discovery, biological networks, molecular docking, and related domains will be able to consume curated and traceable datasets directly from the Data Lake. As a result, operational bottlenecks are eliminated, and the integrity, documentation, and reproducibility of raw data—the cornerstone of any scientific analysis—are ensured. Ultimately, Project 1 is far more than a data collector; it serves as the backbone of the entire Integrated Bioinformatics Platform.

Top comments (1)

Jeferson F Silva • Jun 1

GitHub Repository Here!