DEV Community: TANIYAMA Ryoji

The CloudAIoT 3-Layer Reference Architecture

TANIYAMA Ryoji — Sat, 29 Nov 2025 05:38:07 +0000

A Separation-of-Concerns Approach for Real-Time IoT and Robotics Systems

Ryoji Taniyama

CEO & Founder, Takumi Labs Inc.

38 years in network engineering | RISS Certified

Introduction

When a single device is responsible for sensing, control, networking, and cloud synchronization, failure is not a possibility—it is inevitable. The failure mode is not one problem, but the impossibility of troubleshooting which one.

After deploying IoT nodes to enterprise customers for years, I reached an uncomfortable conclusion: the conventional approach to IoT architecture is fundamentally flawed.

The industry has embraced single-board computers like Raspberry Pi as the default solution for edge computing. Tutorials, courses, and countless GitHub repositories reinforce this pattern. Yet in production environments—where systems must run 24/7 for years without intervention—this approach consistently fails.

This article presents the CloudAIoT 3-Layer Reference Architecture, born from real-world failures and iterative refinement. It is not theoretical. Every principle described here emerged from deploying, failing, replacing, and ultimately succeeding in customer environments.

The core insight is simple: real-time control, edge processing, and cloud connectivity must be strictly separated. When they are mixed on a single device, stability becomes impossible to guarantee.

Why We Abandoned Raspberry Pi

We initially deployed Raspberry Pi units as edge nodes for enterprise customers. The appeal was obvious: full Linux environment, GPIO access, strong community support, abundant documentation.

Within months, the problems began.

Thermal Instability

Raspberry Pi units in enclosed spaces—server rooms, factory floors, retail back offices—experienced thermal throttling under sustained load. CPU temperatures regularly exceeded safe thresholds. Passive cooling was insufficient; active cooling introduced noise, moving parts, and additional failure points.

We tried heat sinks. We tried cases with ventilation. We tried duty cycle management. None provided the reliability our customers required.

Non-Deterministic Behavior

Worse than thermal issues was unpredictability. A Pi running motor control, sensor polling, network communication, and data logging simultaneously exhibited random latency spikes. OS updates could stall real-time loops. Wi-Fi reconnection attempts blocked critical code paths.

When a failure occurred, diagnosis was difficult. Was it thermal? Network? SD card corruption? Kernel scheduling? The monolithic design made root cause analysis nearly impossible.

The Decision

We replaced every Raspberry Pi node deployed to customers. This was expensive and embarrassing, but necessary. The architecture itself was the problem, not the implementation details.

Why GL-INET Was Also Retired

After abandoning Raspberry Pi, we evaluated GL-INET routers as edge nodes. Running OpenWrt, they offered network-centric functionality with lower power consumption and better thermal characteristics.

For a time, this worked. But the solution was still unsatisfying.

Limitations of OpenWrt

GL-INET devices run a constrained Linux environment. Package availability is limited. Development and debugging are cumbersome. Integration with modern tooling requires workarounds.

More fundamentally, GL-INET occupies an awkward middle ground: more capable than a microcontroller, less capable than a full Linux system.

The Economic Shift

The decisive factor was market evolution. By 2024, MiniPCs with full x86 Linux environments became available at price points comparable to GL-INET devices. Quad-core processors, 8GB RAM, NVMe storage, multiple USB ports, Gigabit Ethernet—all in a compact, passively cooled form factor.

The question became: why accept OpenWrt limitations when a full Ubuntu or Debian system costs the same?

We retained GL-INET devices only for specialized use cases—specifically, wireless network probing where their radio characteristics are advantageous. For general edge computing, MiniPC is now the standard.

The 3-Layer Architecture

The CloudAIoT 3-Layer Reference Architecture is built on a strict separation of concerns across three layers:

┌─────────────────────────────────────────────────────────────┐
│                    CLOUD / UPSTREAM LAYER                    │
│         Aggregation, Analytics, Dashboards, AI, Storage      │
│                   (Completely non-real-time)                 │
└─────────────────────────────────────────────────────────────┘
                              ▲
                              │ MQTT / HTTP / WebSocket
                              │
┌─────────────────────────────────────────────────────────────┐
│                 EDGE / NEAR-REAL-TIME LAYER                  │
│      Data preprocessing, Buffering, Network communication    │
│                    (Tolerates mild latency)                  │
│                                                              │
│                  [ Linux MiniPC / Uno Q CPU ]                │
└─────────────────────────────────────────────────────────────┘
                              ▲
                              │ USB (Primary) / Serial
                              │
┌─────────────────────────────────────────────────────────────┐
│                     REAL-TIME LAYER                          │
│            Motor/Servo/PWM control, Safety loops             │
│               (Must NEVER depend on networking)              │
│                                                              │
│              [ Arduino MCU: Uno Q, XIAO, etc. ]              │
└─────────────────────────────────────────────────────────────┘

Layer 1: Real-Time (MCU)

The bottom layer handles safety-critical, timing-sensitive operations: motor control, servo positioning, PWM generation, current sensing, emergency stops.

This layer runs on bare-metal microcontrollers—Arduino Uno Q, Seeed XIAO, or similar. There is no operating system to interrupt execution. There is no network stack to block on. There is no filesystem to corrupt.

Critical principle: This layer must operate correctly even if all network connectivity is lost. A robot arm must not swing wildly because Wi-Fi dropped. A motor must not overheat because the cloud is unreachable.

Layer 2: Edge / Near-Real-Time (Linux)

The middle layer handles data preprocessing, buffering, protocol translation, local decision-making, and upstream communication.

This layer runs on a MiniPC with full Linux (Ubuntu/Debian). It communicates with MCU nodes via USB. It communicates upstream via MQTT, HTTP, or WebSocket.

If network connectivity is lost, this layer continues operating. It buffers data locally. It maintains MCU communication. When connectivity returns, it synchronizes with upstream.

Layer 3: Cloud / Upstream

The top layer handles aggregation across multiple edge nodes, long-term storage, analytics, dashboards, AI inference, and alerting.

This layer has no real-time requirements. Latency of seconds or even minutes is acceptable. It can run on VPS, LXC containers, or public cloud services.

Implementation Mapping

The 3-layer architecture maps cleanly to specific hardware:

Layer	Hardware	Role
Real-Time	Arduino Uno Q (MCU side), XIAO	Motor control, PWM, safety loops
Edge	MiniPC, Uno Q (CPU side)	Data collection, preprocessing, network
Upstream	VPS, LXC, Cloud	Storage, analytics, dashboards

Why Arduino Uno Q

The Arduino Uno Q is particularly well-suited to this architecture. It combines an MCU (for real-time control) and a Linux-capable CPU (for edge processing) in a single board. The two processors communicate via internal serial, but can operate independently.

This means a single Uno Q can serve as both real-time and edge layers for simple deployments. For complex systems, dedicated MCU nodes connect to a central MiniPC.

USB as Primary Interconnect

We use USB rather than Wi-Fi for MCU-to-Edge communication. This provides:

Reliability: No wireless interference, no reconnection delays
Power: MCU nodes can be bus-powered
Simplicity: Standard serial-over-USB, no network configuration
Scalability: USB hubs allow many nodes per edge device

        Edge MiniPC
            │
    ┌───────┼───────┬───────┬───────┬───────┐
    │       │       │       │       │       │
   USB     USB     USB     USB     USB     USB
    │       │       │       │       │       │
┌───┴───┐┌──┴──┐┌───┴───┐┌──┴──┐┌───┴───┐┌──┴──┐
│ Motor ││Sensor││Current││Relay││  PWM  ││ ... │
│Control││ Node ││Monitor││Node ││ Node  ││     │
└───────┘└─────┘└───────┘└─────┘└───────┘└─────┘

Each node has a single responsibility. Adding capacity means adding nodes. The architecture scales horizontally without redesign.

Wi-Fi is reserved for upstream communication only, where its unreliability is tolerable.

Design Principles

Principle 1: Network Independence for Safety

Real-time control must never depend on network availability. If the network fails, safety-critical operations continue unchanged. This is not a nice-to-have; it is a fundamental requirement.

Principle 2: Hot-Swappable Nodes

Any MCU node should be replaceable within 4 minutes without configuration. This requires:

Standardized hardware (Grove connectors, common pinouts)
Configuration stored at edge layer, not on MCU
Automatic node discovery on USB connection

Field technicians should not need programming skills to replace a failed node.

Principle 3: Single Responsibility per Node

Each MCU node handles one clear function: PWM control for motors, current sensing, environmental monitoring, actuator control. Combining multiple responsibilities on one node reintroduces the complexity we are trying to eliminate.

Principle 4: Horizontal Scalability

Adding capacity means adding nodes, not replacing hardware. A system with 4 motors and 8 sensors uses 12 MCU nodes. A system with 40 motors and 80 sensors uses 120 nodes connected to multiple edge devices. The architecture is the same; only quantity changes.

When Do You Need This Architecture?

If your system meets any of these criteria, a single-board computer approach will fail:

[ ] Must survive network outages without disruption
[ ] Must run 24/7 unattended for months or years
[ ] Requires hot-swappable nodes without reconfiguration
[ ] Controls motors, relays, servos, or actuators
[ ] Is deployed at remote customer sites
[ ] Must be maintainable by non-programmers

If you checked even one box, Raspberry Pi is not suitable. The 3-layer architecture is required.

Deployment Patterns

Small Systems

For simple deployments, a single Arduino Uno Q is sufficient:

MCU side: Real-time control
CPU side: Edge processing and upstream communication

No additional hardware required.

Medium Systems

For moderate complexity:

Multiple Arduino/XIAO MCU nodes
One MiniPC as edge aggregator
Upstream to VPS or cloud

Large Systems

For enterprise scale:

Many MCU nodes organized by function
Multiple MiniPCs as regional aggregators
Hierarchical upstream to cloud infrastructure

The architecture scales without fundamental changes.

Conclusion

The CloudAIoT 3-Layer Reference Architecture is not a product. It is a set of principles for designing IoT and robotics systems that are safe, scalable, and maintainable.

The core insight is separation of concerns:

Real-time control belongs on dedicated MCUs with no network dependency
Edge processing belongs on Linux systems that can tolerate network interruption
Cloud integration belongs upstream where latency is irrelevant

We arrived at this architecture through failure. Raspberry Pi taught us that mixing concerns creates instability. GL-INET taught us that half-measures satisfy no one. Production deployments taught us that elegant theory means nothing if systems fail at customer sites.

The market has finally provided hardware that makes this architecture economically viable. MiniPCs are cheap. Arduino-compatible MCUs are ubiquitous. USB hubs cost almost nothing.

There is no longer any excuse for deploying IoT systems that fail when the network hiccups, overheat in summer, or require expert intervention to replace a sensor node.

CloudAIoT Re-Defined is our contribution to the industry: a proven architecture, freely shared, for building systems that actually work.

About the Author

Ryoji Taniyama is the CEO of Takumi Labs Inc., with 38 years of network engineering experience spanning AI research, router development, and large-scale retail deployments.

Contact: cloudaiot.tech

📌 What 24 Hours of RTT Monitoring Reveals: Comparing 6 Public DNS Providers Using Multi-Target Correlation (2025-10-20)

TANIYAMA Ryoji — Sat, 01 Nov 2025 19:21:23 +0000

Ideal for SREs, network engineers, and anyone tuning DNS for production workloads.

🚀 TL;DR

Most DNS benchmark articles run a one-time lookup test — which often misleads.
This study performs continuous RTT monitoring over 24 hours across six DNS services simultaneously, enabling multi-target correlation to separate DNS performance from ISP or routing effects.

Key takeaway: Google (8.8.8.8 / 8.8.4.4) and Cloudflare (1.0.0.1) offer the flattest day-long stability. Cloudflare 1.1.1.1 shows occasional Anycast routing-driven latency spikes.

Unlike typical one-time DNS speed comparisons, this analysis uses 24-hour monitoring across 6 targets simultaneously to distinguish network issues from DNS provider performance.
The first comprehensive guide to multi-target DNS stability monitoring

I'll rewrite this as an English technical article for dev.to, maintaining the analytical depth and technical accuracy.

The most stable baseline:

8.8.8.8 (consistently 6.0–6.3 ms, minimal jitter).
1.0.0.1 / 8.8.4.4 / 9.9.9.9 cluster around 6.3–6.8 ms with flat trends.
149.112.112.112 (Quad9) consistently runs +0.8–1.2 ms higher — a clear "step up."
1.1.1.1 alone showed isolated 9–11 ms spikes several times. Minimal correlation with other targets suggests Anycast/routing-side transient events.

Evening through night:
+0.2–0.3 ms baseline lift across all targets = typical traffic load impact.
Conclusion:
Stable operation throughout the day with negligible business impact. Multi-target correlation successfully isolated "ISP vs. destination" factors.

Purpose of This Analysis

The chart above (2025-10-20 JST) tracks ICMP probes sent at regular intervals from a single probe to six destinations, recording average RTT over time:

1.0.0.1 (Cloudflare)
1.1.1.1 (Cloudflare)
149.112.112.112 (Quad9)
8.8.4.4 (Google)
8.8.8.8 (Google)
9.9.9.9 (Quad9)

Goal: Avoid false conclusions from single-target anomalies by reading network state through multi-target correlation. Anycast DNS services are particularly sensitive to time-of-day and routing convergence, making multi-target observation a diagnostic fundamental.

How to Read the Chart (Quick Guide)

Average RTT: Round-trip delay average. Line "height" indicates the baseline floor.
Jitter (oscillation): Vertical amplitude. Lower = more stable user experience.
Simultaneous spikes: Likely local/ISP congestion.
Target-specific spikes: Possible Anycast node shift or route-specific transient.

Observations

1) Stability and Baselines

8.8.8.8: Lowest latency with minimal jitter. Maintains 6.0–6.3 ms.
1.0.0.1 / 8.8.4.4 / 9.9.9.9: Semi-stable cluster at 6.3–6.8 ms.
149.112.112.112: Consistently higher at ~7.2–7.6 ms, suggesting longer path length or different Anycast placement.

2) Spikes (Transient Outliers)

1.1.1.1 showed 9–11 ms spikes during late night and around 23:00 JST.
- No concurrent spikes on other targets → Strong indication of routing/Anycast-side factors, not ISP-wide issues.
A small peak around 09:00 JST appeared across multiple targets → Short-lived congestion on near-side network.

3) Diurnal Variation

Evening through night: +0.2–0.3 ms baseline lift across all targets.
This falls within typical traffic increase range — no operational concern.

Interpretation & Implications

Multi-Target Observation Prevents Misdiagnosis

Monitoring a single public DNS alone risks false conclusions like "high latency = ISP degradation."

In this case, 1.1.1.1's isolated spikes were distinguished from ISP issues because other targets remained stable.

→ Operational best practice: Use correlation across all targets as primary indicator.

Selecting Benchmark "Rulers"

8.8.8.8 is ideal for health benchmarking due to its low, stable baseline.
Quad9 (149.112.112.112) consistently runs higher — useful for observing regional/path differences through baseline comparison.

Alert Design Considerations

Threshold: Set per-target at "baseline mean + 3σ" (respecting each target's natural baseline).
All targets exceed threshold simultaneously: ISP or near-side network event.
Only specific target exceeds: Route reconvergence, Anycast shift, or AS-level congestion.
Pro tip: Run concurrent traceroute snapshots during events for easier post-mortem analysis.

Summary (Today's Findings)

Overall stability maintained with slight evening baseline lift. 1.1.1.1 exhibited brief spikes but no sustained ISP-side degradation detected. Multi-target correlation enabled rapid fault isolation.

Appendix: Quick Terminology

RTT (Round-Trip Time): Time for packet to travel round-trip.
Jitter: Variance in RTT. Directly impacts VoIP and real-time communication quality.
Anycast DNS: Same IP advertised from multiple locations. Actual destination varies by proximity, routing, and load.
3σ (three sigma): Outlier detection threshold assuming normal distribution.

About the Monitoring Setup

This analysis was conducted using a custom-built monitoring script running on Linux. The setup continuously pings multiple public DNS endpoints and logs RTT data for time-series analysis.

Tech Stack:

Linux-based probe
ICMP ping utilities
Custom scripts for data collection and logging
Graph generation: GeneralLLM/ChatGPT (visualizing raw data into time-series charts)
Time-series data processing and analysis

The multi-target approach allows for rapid fault isolation by correlating latency patterns across different Anycast networks simultaneously.

About the Author

Network engineer and observability enthusiast based in Kawasaki, Japan. I focus on practical network monitoring, latency analysis, and building custom diagnostic tools to understand real-world internet infrastructure behavior.

I believe in multi-dimensional observability — never trust a single data point when you can correlate across multiple sources.

Interested in network monitoring and performance analysis?
Follow me for more insights on DNS infrastructure, latency optimization, and home-grown monitoring solutions.

If you also run DNS monitoring, which metrics do you care about most — latency, packet loss,
regional POP consistency, or DoH/DoT performance?
I’d love to compare approaches in the comments.