Wilians Conde

Posted on Apr 16

Processing High Frequency Solar Data Without HPC: Real Constraints and Design Decisions in MackSun

#dataengineering #mongodb #systemdesign #bigdata

Solar activity directly impacts Earth, from GPS accuracy to power systems.

MackSun was designed to process billions of high frequency solar data points under strict hardware constraints, without relying on HPC infrastructure.

The platform is available at:
https://www.macksun.org

The problem

Instruments such as POEMAS (https://www.macksun.org/pages/wiki/arquivos-telescopios.html) operate with acquisition intervals around 10 milliseconds. This enables detailed analysis of solar activity, but also produces a continuous stream of data.

This creates a set of concrete challenges:

continuous ingestion under load
long term storage of billions of records
memory and IO limitations
processing under constant pressure

In most scenarios, this would require distributed systems or HPC clusters. Here, the system had to work without that.

Data origin

The data used in MackSun is not synthetic.

It comes from real solar observation instruments located in South America, operated at the CASLEO observatory in Argentina.

These instruments are managed by CRAAM, part of Mackenzie Presbiterian University in Brazil.

This matters because:

data is generated under real observational conditions
acquisition is continuous and subject to physical constraints
system behavior is influenced by real hardware

This is not a controlled environment. It is a live acquisition scenario.

Infrastructure limits

The system runs under a constrained but well defined setup:

single Linux server
16 vCPU
32 GB of RAM in total
4 GB reserved for the operating system
16 GB allocated to MongoDB running in sharded mode
12 GB allocated to the ingestion pipeline container

The MongoDB allocation is not arbitrary. It was defined based on limits observed during experimental validation.

Even on a single machine, MongoDB showed better performance in sharded mode. This is not assumed. It was experimentally validated and later published in Astronomy and Computing:

https://www.sciencedirect.com/science/article/pii/S221313372500126X

These limits are enforced. The system is designed to operate within them.

Data scale

The current volume is around:

3 billion data points
continuous ingestion from solar instruments
original data at high frequency

At this scale, uncontrolled growth leads to instability.

The system must control:

memory usage
write patterns
data organization
query behavior

Partitioning strategy

The system enforces a strict limit:

about 150 million data points per collection

Beyond this:

performance degrades
queries slow down
memory pressure increases

Data is therefore split across multiple collections.

This is required for stability.

Ingestion model

The ingestion process is not real time.

It runs as a sequential pipeline with five stages, executed once per day.

This approach:

avoids continuous load pressure
keeps resource usage predictable
simplifies failure handling

We chose batch processing over real time ingestion. This reduces latency flexibility, but guarantees stability.

Precomputed datasets

On demand processing is not viable under these constraints.

One day of observation generates around:

5 million data points

Processing this during a request would:

increase latency
consume too much memory
destabilize the system

The system generates daily datasets in advance.

Each dataset is:

processed
consolidated
stored in a ready to serve format

Datasets are available at:
https://www.macksun.org

Structure and format are documented here:
https://www.macksun.org/pages/wiki/arquivos-telescopios.html

We chose precomputed datasets instead of on demand processing. This reduces flexibility, but ensures consistent performance.

Trade offs

This architecture makes explicit decisions.

Real time vs stability
No real time processing
Predictable execution

Flexibility vs predictability
No arbitrary queries over raw data
Structured access through prepared datasets

Infrastructure vs engineering
No hardware scaling
More control over data and processing

We chose sharding on a single server. This is not the typical approach, but it was experimentally validated.

We chose precomputation instead of real time processing. This reduces flexibility, but guarantees stability.

Why this works

The system works because it enforces limits.

collections are bounded
memory usage is controlled
ingestion and access are separated
heavy processing is done in advance

Instead of relying on infrastructure scaling, the system relies on controlled behavior.

Final thoughts

MackSun shows that it is possible to process billions of records without HPC, but only if constraints are treated as part of the design.

This requires:

strict partitioning
controlled ingestion
precomputed outputs
disciplined resource usage

Explore the datasets and see how MackSun handles billions of records under constrained hardware:

https://www.macksun.org

DEV Community

Processing High Frequency Solar Data Without HPC: Real Constraints and Design Decisions in MackSun

Top comments (0)