DEV Community

Wilians Conde
Wilians Conde

Posted on

Processing High Frequency Solar Data Without HPC: Real Constraints and Design Decisions in MackSun

Solar activity directly impacts Earth, from GPS accuracy to power systems.

MackSun was designed to process billions of high frequency solar data points under strict hardware constraints, without relying on HPC infrastructure.

The platform is available at:
https://www.macksun.org

The problem

Instruments such as POEMAS (https://www.macksun.org/pages/wiki/arquivos-telescopios.html) operate with acquisition intervals around 10 milliseconds. This enables detailed analysis of solar activity, but also produces a continuous stream of data.

This creates a set of concrete challenges:

  • continuous ingestion under load
  • long term storage of billions of records
  • memory and IO limitations
  • processing under constant pressure

In most scenarios, this would require distributed systems or HPC clusters. Here, the system had to work without that.

Data origin

The data used in MackSun is not synthetic.

It comes from real solar observation instruments located in South America, operated at the CASLEO observatory in Argentina.

These instruments are managed by CRAAM, part of Mackenzie Presbiterian University in Brazil.

This matters because:

  • data is generated under real observational conditions
  • acquisition is continuous and subject to physical constraints
  • system behavior is influenced by real hardware

This is not a controlled environment. It is a live acquisition scenario.

Infrastructure limits

The system runs under a constrained but well defined setup:

  • single Linux server
  • 16 vCPU
  • 32 GB of RAM in total
  • 4 GB reserved for the operating system
  • 16 GB allocated to MongoDB running in sharded mode
  • 12 GB allocated to the ingestion pipeline container

The MongoDB allocation is not arbitrary. It was defined based on limits observed during experimental validation.

Even on a single machine, MongoDB showed better performance in sharded mode. This is not assumed. It was experimentally validated and later published in Astronomy and Computing:

https://www.sciencedirect.com/science/article/pii/S221313372500126X

These limits are enforced. The system is designed to operate within them.

Data scale

The current volume is around:

  • 3 billion data points
  • continuous ingestion from solar instruments
  • original data at high frequency

At this scale, uncontrolled growth leads to instability.

The system must control:

  • memory usage
  • write patterns
  • data organization
  • query behavior

Partitioning strategy

The system enforces a strict limit:

about 150 million data points per collection

Beyond this:

  • performance degrades
  • queries slow down
  • memory pressure increases

Data is therefore split across multiple collections.

This is required for stability.

Ingestion model

The ingestion process is not real time.

It runs as a sequential pipeline with five stages, executed once per day.

This approach:

  • avoids continuous load pressure
  • keeps resource usage predictable
  • simplifies failure handling

We chose batch processing over real time ingestion. This reduces latency flexibility, but guarantees stability.

Precomputed datasets

On demand processing is not viable under these constraints.

One day of observation generates around:

  • 5 million data points

Processing this during a request would:

  • increase latency
  • consume too much memory
  • destabilize the system

The system generates daily datasets in advance.

Each dataset is:

  • processed
  • consolidated
  • stored in a ready to serve format

Datasets are available at:
https://www.macksun.org

Structure and format are documented here:
https://www.macksun.org/pages/wiki/arquivos-telescopios.html

We chose precomputed datasets instead of on demand processing. This reduces flexibility, but ensures consistent performance.

Trade offs

This architecture makes explicit decisions.

Real time vs stability
No real time processing
Predictable execution

Flexibility vs predictability
No arbitrary queries over raw data
Structured access through prepared datasets

Infrastructure vs engineering
No hardware scaling
More control over data and processing

We chose sharding on a single server. This is not the typical approach, but it was experimentally validated.

We chose precomputation instead of real time processing. This reduces flexibility, but guarantees stability.

Why this works

The system works because it enforces limits.

  • collections are bounded
  • memory usage is controlled
  • ingestion and access are separated
  • heavy processing is done in advance

Instead of relying on infrastructure scaling, the system relies on controlled behavior.

Final thoughts

MackSun shows that it is possible to process billions of records without HPC, but only if constraints are treated as part of the design.

This requires:

  • strict partitioning
  • controlled ingestion
  • precomputed outputs
  • disciplined resource usage

Explore the datasets and see how MackSun handles billions of records under constrained hardware:

https://www.macksun.org

Top comments (0)