TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

#architecture #distributedsystems #devops #webdev

2 years development time — from first internal prototype to full customer rollout
Workspace-centric → org-wide — every DM, notification, search result, and unread count
Thousands of APIs refactored to support org-level data access
Rollout Sep 2023 → Mar 2024 — six-month controlled rollout after two years of development
3 features unlocked that were architecturally impossible before: unified DMs, org-wide Activity, Save it for Later
Built within the existing Rails monolith — no microservices extraction required

Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.

The Story

All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions — which usually entails a lot of work — or trying to support new behavior atop the existing architecture.

— Slack Engineering, via 'Unified Grid: How We Re-Architected Slack for Our Largest Customers'

Slack launched in 2013 with a beautifully simple data model: users belong to workspaces, workspaces contain channels, channels contain messages. For small teams using a single workspace, this model was perfect. For large enterprises that had grown to 50, 100, or 200 workspaces across departments, geographies, and business units, it was a prison. Every DM, every notification, every unread count, every search result was siloed by workspace. A VP with access to 80 workspaces had to remember which workspace a conversation was in, click to it, check notifications, return, and repeat — dozens of times per day.

Slack's team had been papering over the workspace-centric limitation for years with increasingly complex workarounds. The Connect (Slack's feature allowing users in different Slack organisations to message each other across workspace boundaries — built as an overlay on the workspace model) feature, multi-workspace management tools, org-wide settings — all workarounds that added complexity without fixing the fundamental architecture. The CTO and engineering leadership faced a classic build-it-now-or-keep-patching decision. They chose to build. The project was called Unified Grid, and it would require rebuilding the core data model, refactoring thousands of APIs, and redesigning both the backend and every client application — simultaneously.

The Workspace-Centric Assumption That Had to Die

In Slack's original architecture, almost all data was particular to a single workspace: messages, channels, DMs, notification preferences, user profiles, unread counts. This assumption was baked into thousands of database queries, API responses, and client rendering paths. To build Unified Grid, every piece of data that needed to be visible across workspaces had to be lifted out of the workspace silo — a change that touched nearly every system in the stack.

Problem

Enterprise Users Drowning in Workspace Context-Switches

Slack's workspace-centric model forced enterprise users to manually navigate between dozens of workspaces to find conversations and check notifications. Key features like a unified DM inbox, an org-wide activity feed, and cross-workspace search were impossible within the existing architecture — not missing features, architecturally blocked features.

Cause

The Foundation Assumption Was Wrong for Enterprise

Slack's data model had been built on the assumption that almost all user data is particular to a single workspace. Ten years of feature development had embedded this assumption deep into database schema, API contracts, and client rendering logic. Supporting org-wide views required either a rewrite or an ever-growing layer of workarounds.

Solution

Prototype the Path: Build Incrementally, Prove Out

Rather than committing immediately to a full rewrite, Slack's team built a proof of concept using Unified Grid within internal tooling — Slack's own employees using it daily. Only after the POC validated the architecture and revealed what work was required did the team commit to a full rollout.

Result

Shipped After 2 Years: Rollout Sep 2023 → Mar 2024

Unified Grid rolled out to customers starting Fall 2023 and completed in March 2024. Features like the unified DMs tab, org-wide Activity tab, and Save it for Later became possible on a foundation that had been impossible on the workspace-centric model.

The Fix

The Technical Work: Thousands of APIs, One New Foundation

Slack's codebase contained thousands of API endpoints, database queries, and client rendering paths that assumed workspace-scoped data. Each had to be evaluated: does it need to be org-aware? If so, what's the migration path? In many cases, a query that fetched a user's DMs from a single workspace had to be replaced with a query that could aggregate DMs from all of the user's workspaces efficiently.

2 years — development duration from first prototype to full customer rollout
1000s — APIs, database queries, and permission checks updated to support org-wide data access
Mar 2024 — full rollout completion date
3 features — unified DMs tab, org-wide Activity tab, Save it for Later — all architecturally impossible on the old model

# Simplified conceptual example of workspace-centric vs org-wide data access
# Real Slack uses Hack/PHP and complex distributed data systems

# OLD: Workspace-centric DM fetch
# User must specify which workspace — data is completely siloed
def get_dms_old(user_id: str, workspace_id: str) -> list:
    return db.query(
        "SELECT * FROM direct_messages "
        "WHERE workspace_id = ? AND user_id = ?",
        workspace_id, user_id  # workspace_id required — siloed
    )

# NEW: Org-aware DM fetch (Unified Grid)
# Returns DMs across all workspaces the user belongs to
def get_dms_unified(user_id: str, org_id: str) -> list:
    # Query all workspaces the user belongs to in this org
    workspaces = org_membership_service.get_workspaces(user_id, org_id)

    # Aggregate DMs across all workspaces — unified inbox
    # Sorted by recency, not by workspace — the user experience change
    return dm_service.get_org_wide(
        user_id=user_id,
        workspace_ids=[ws.id for ws in workspaces],
        sort_by='recency'
    )

# Permission checks also needed org-level understanding:
# Old: can_access(user, workspace, resource)
# New: can_access(user, org, workspace, resource) — layered org context
# Every permission check in the codebase required evaluation and update

Prototyping the Path: How Slack De-Risked the Rewrite

Slack's most important process decision for Unified Grid was building a working prototype used internally before committing to full scope. This is "prototyping the path" — not a throwaway prototype, but a real functioning implementation used by real users on real data. The feedback from internal use surfaced problems that would have been catastrophic if discovered post-rollout. It also gave leadership evidence rather than speculation to justify two years of engineering investment.

The executive concern: is it worth the cost?

The Unified Grid blog post is unusually candid about the organisational challenge: execs and engineering leadership were genuinely concerned about the cost. Was rebuilding the core architecture worth potentially thousands of engineer-weeks of effort? The team's answer was to build the proof of concept first, use internal data to demonstrate the benefits, and then make the case for full investment — rather than asking for two years of resources upfront on a bet.

The Rails monolith as change vehicle

Despite Slack's architectural evolution, the backend rewrite was implemented within the existing Rails monolith rather than as a separate service. This made incremental deployment easier — changes could be gated behind feature flags, rolled back quickly, and deployed through the existing CI/CD pipeline. The Unified Grid project is evidence that a monolith can accommodate fundamental architectural evolution without requiring a microservices extraction.

The migration cost that can't be avoided

Unified Grid required updating existing customers' Slack configurations, data migrations for org-level constructs, and client-side state invalidation when users upgraded. Some features required users to re-learn workflows they had developed over years with the old model. There is no such thing as a transparent foundational architecture change at production scale — some user-visible change is inevitable, and Slack had to manage customer communication throughout the rollout.

Architecture

Unified Grid's architecture changes span three layers of Slack's stack. The backend required new data models for org-level concepts, updated APIs with org-level context, and new query patterns that aggregate across workspaces. The desktop and mobile clients required redesigned rendering architectures that could display org-wide views alongside workspace-specific ones. The permission system required new layering to support org-level access controls on top of existing workspace-level controls. All three layers had to change simultaneously and stay in sync during the two-year rollout.

Before Unified Grid: Workspace-Centric Data Silos

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After Unified Grid: Org-Wide Views Across Workspaces

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Permission System: The Hardest Architectural Change

Old permissions: can this user access this resource in this workspace? New permissions: can this user access this resource in this workspace within this org? Org-level admin controls needed to cascade down to workspace-level controls, override in some cases, and defer in others. Building a correct, auditable, performant permission system that understood both levels required careful design — org-level permission bugs in a product used by enterprises have serious security implications.

Lessons

The "avoid rewrites" truism is a default, not a law. When your architecture's foundational assumptions have drifted far enough from actual usage that every new feature requires a workaround, the accumulated technical debt of workarounds may exceed the cost of rebuilding the foundation. Evaluate honestly. Don't use "rewrites are bad" as a reason to avoid a decision that actually needs to be made.
Prototyping the path (building a working but incomplete implementation of a major change, using it internally to validate the direction before committing to full scope) is the engineering equivalent of a staged rollout for architectural decisions. You don't commit the full budget until you have production evidence that the direction is right. Slack's internal dogfooding gave leadership evidence rather than speculation.
Permission systems need to evolve in lockstep with data models. Org-level access controls cannot be bolted onto workspace-level permission systems. When your user model gains a new organisational layer, your permission model must gain it too. This work is unglamorous, invisible to users, and absolutely required for enterprise security.
Client and backend architecture must change together. You cannot ship an org-wide backend while keeping workspace-centric clients. The full change is end-to-end: data model, API contracts, permission systems, desktop client, mobile client, web client. Planning the delivery sequence for a change this wide is as important as designing the architecture itself.
When the architecture prevents the product from serving its largest customers, the rewrite decision has already been made by the market. Unified DMs, org-wide Activity, cross-workspace search — these were features enterprise contracts were being written around. The business case for the rewrite was not abstract technical cleanliness; it was that the features could not exist without it.

Engineering Glossary

Org-wide — a data scope in Slack's Unified Grid architecture where data is visible across all workspaces within an organisation, rather than being scoped to a single workspace. Enabled by lifting data out of the workspace silo and adding an org-level layer to the data model, permission system, and client rendering paths.

Prototyping the path — Slack's term for building a working but incomplete implementation of a major architectural change, deploying it to internal users for daily use, and letting real usage surface gaps before committing to full scope. Contrasted with designing the complete architecture first and then building it.

Unified Grid — Slack's internal name for the two-year project that rebuilt the core data model, APIs, permission system, and client architecture to support org-wide data access across all of an enterprise customer's workspaces simultaneously.

Workspace-centric model — Slack's original data model where almost all user data — messages, channels, DMs, notification preferences, user profiles, unread counts — was scoped to a single workspace. Baked into thousands of database queries and API responses. The foundational assumption that Unified Grid replaced.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community