While contributing to SugarLabs, I encountered a runtime reliability issue related to D-Bus service recovery.
In long-running desktop sessions, background services can crash or restart. Applications interacting with these services must handle such events gracefully. In this case, Sugar failed to recover when the sugar-datastore process was restarted.
This post explains the issue, root cause, and the recovery mechanism implemented.
The Problem
During an active Sugar session, if the sugar-datastore process crashed or was manually terminated:
Sugar continued using a cached D-Bus proxy
Subsequent datastore calls failed with:
org.freedesktop.DBus.Error.ServiceUnknown
The session did not recover automatically
Users had to restart the entire environment
This was a reliability gap in crash recovery.
Root Cause
Sugar cached the D-Bus proxy to the datastore interface.
When the service disappeared and was later reactivated via D-Bus activation:
The cached proxy remained stale
No revalidation occurred
All further calls failed
The system assumed the service lifecycle was static.
It wasn’t.
The Fix
The solution was intentionally minimal and scoped.
Datastore D-Bus calls were wrapped to catch:
ServiceUnknown
NoReply
Disconnected
On detecting one of these failures:
Clear the cached datastore proxy
Recreate a fresh D-Bus interface
Retry the operation once
If retry failed:
Raise original exception
No infinite retries
No UI side effects
The recovery is silent and bounded.
Manual Testing
To validate the fix:
Started Sugar normally
Verified sugar-datastore running
Killed the process manually
Triggered a Journal operation
Observed initial failure
Verified proxy reset
Confirmed successful retry
The session recovered without restart.
Maintainer feedback:
“Tested, works as expected.”
The PR was merged into master.
Why This Matters
This contribution improves:
Fault tolerance
Runtime resilience
Crash recovery behavior
Stability in long-running sessions
Instead of requiring full session restart, the system now self-heals.
This is not a UI change or feature addition.
It is a robustness fix in core service interaction logic.
What I Would Improve
If revisiting this today, I would:
Add automated integration tests simulating D-Bus restart
Add lightweight logging hooks for recovery events
Document the recovery contract in developer docs
Conclusion
Service lifecycle management is often overlooked in desktop applications.
Caching service proxies without revalidation creates hidden failure modes.
This fix reinforced the importance of defensive recovery logic in distributed local systems like D-Bus-based architectures.
This contribution was merged into the Sugar repository:
PR
It resolves issue #870:
Issue link
Top comments (0)