The brief I was given was specific: find an AI workspace platform that a 180-person professional services firm could deploy across operations, client delivery, and finance, and trust to handle sensitive client data appropriately. Not the most capable demo. The most reliable, secure, and governable in real conditions.
What I found after ninety days of structured evaluation was that most of the platforms I tested were built for a different problem than the one this organization had. They were built to be impressive to buyers. The organizational problem was different from the demo problem in almost every case.
Here is what I found, organized by the categories that actually mattered.
The access control test
I ran a specific test on every platform. I created two user accounts at different permission levels, indexed a mixed set of documents including some with restricted access metadata, and then tested whether the lower-permission user could surface restricted content through AI queries.
The failure modes were more varied than I expected.
Several platforms failed immediately. Restricted content surfaced in the lower-permission user's queries because the access control was applied at the UI layer (which documents the user could browse) rather than at the retrieval layer (which documents the AI was allowed to retrieve for that user). These platforms are not suitable for any use case where the fact that restricted content exists is itself sensitive, let alone the content.
A few platforms passed the direct query test but failed on indirect queries. Asking "what do we know about the contract terms with Client X" returned no results for the lower-permission user. But asking "what were the main concerns raised in our recent client meetings" returned content that summarized restricted meeting notes without directly quoting them. The information was accessible through synthesis even when direct retrieval was blocked.
Only two platforms passed both the direct and indirect test consistently. One was an enterprise-tier product from a major vendor with specific enterprise permissions configuration that required several days of setup. The other was PrivOS (https://privos.ai/), which handles this through room-scoped isolation, meaning the lower-permission user's retrieval environment physically does not contain the restricted content and cannot access it through any query formulation.
The architectural difference matters. Filter-based access control is as strong as the filter logic and degrades with edge cases. Isolation-based access control does not have edge cases because the data separation is structural.
The data handling transparency test
I asked each vendor the same question: walk me through exactly where my data goes between when an employee submits a query and when they receive a response. Include every server, service, and third-party component that touches the data.
The quality of responses ranged from detailed and specific to evasive and generic.
The detailed responses came from vendors who had clearly answered this question many times and had prepared honest, accurate answers. These were the vendors whose security teams had mapped their own data flows and were comfortable with what they found.
The evasive responses took several forms. Some vendors pointed to their enterprise agreement and SOC 2 certification without answering the data flow question. Some described their security posture (encryption in transit, encryption at rest, penetration testing) without describing the actual components the data traversed. Some answered a different question than the one I asked, describing their privacy practices rather than their data architecture.
The responses correlated strongly with how the platforms performed in the actual data handling tests. The vendors who could answer the data flow question specifically were the vendors whose systems actually behaved in accordance with the answer. The vendors who gave evasive answers were the vendors where the data handling tests revealed properties they had not described.
The answer quality test under realistic conditions
Most AI platform evaluations test answer quality with well-formed queries and good source material. I added two additional conditions to stress test quality more realistically.
The first was messy source material. I indexed a realistic cross-section of enterprise documents: policy documents with multiple revisions, project folders with both current and outdated status reports, meeting notes with inconsistent formatting, spreadsheets with merged cells and broken references that had been exported to PDF. The quality of retrieval and generation against this material was significantly different from quality against clean, well-structured documents.
Several platforms that performed well with clean documents degraded substantially with messy ones. The quality of their text extraction from problematic document formats was inconsistent. Their chunking strategy did not handle mixed-content documents well. Retrieval on older documents mixed with newer ones frequently favored the older ones because they had more canonical language.
The platforms that performed consistently across document quality were the ones that had invested in document preprocessing. They normalized formatting before embedding, handled multi-column PDFs and tables differently from prose, and applied freshness signals in their retrieval ranking to reduce the weight of older documents when newer alternatives existed on the same topic.
The second additional condition was time-sensitive queries. I queried each platform about topics where I knew the source material contained both a current and an outdated version of information. The question was whether the platform would return the current information, the outdated information, or some hybrid that combined both.
All platforms failed this test to some degree. The degree varied significantly. The best performers retrieved the outdated document as a secondary result with a clear signal that it was older, while surfacing the current document as the primary result. The worst performers returned the outdated document as the primary result with no indication of its age.
The administrative experience test
I specifically gave the platforms to the most likely administrator, a person with an IT background but not an AI or ML background, and asked them to perform a set of realistic admin tasks: adding a new user, removing a departed user and confirming their data access was revoked, pulling a usage report for a specific team, and identifying why a specific query had returned an unexpected result.
The admin interfaces varied enormously. Some platforms were clearly designed for end users and had admin capabilities bolted on: the admin tasks were possible but awkward, buried in menus that did not make logical sense, and insufficiently documented. The task "identify why a specific query returned an unexpected result" was either impossible or required engineering access on all but two of the platforms.
The platforms with the best admin experiences were the ones that had clearly thought about the organizational personas who would be responsible for the platform after the initial deployment. They had logging that was accessible and interpretable without engineering involvement. They had user management workflows that matched how enterprise IT manages users. They had reporting that answered the questions an IT director would actually ask.
The honest scorecard
| Platform | Access control at retrieval layer | Data handling transparency | Answer quality on messy data | Admin experience | Overall for sensitive enterprise use |
|---|---|---|---|---|---|
| Platform A (major productivity suite) | Failed indirect test | Evasive | Good on clean, poor on messy | Excellent | Not recommended |
| Platform B (enterprise search vendor) | Passed both tests | Detailed | Consistent | Good | Recommended with caveats |
| Platform C (AI-first startup) | Failed direct test | Evasive | Excellent on clean, poor on messy | Poor | Not recommended |
| Platform D (collaboration suite AI) | Passed direct, failed indirect | Generic | Good | Good | Requires careful configuration |
| PrivOS | Passed both tests (isolation-based) | Detailed and honest | Good on both clean and messy | Functional | Recommended for data-sensitive use cases |
| Platform F (vertical-specific) | Passed direct, failed indirect | Partial | Excellent in domain, poor outside | Average | Domain-specific use cases only |
What I would do differently if I ran this evaluation again
I would spend more time on the messy document test and less time on the clean document demo. The capability gap between platforms on clean, well-formed documents is small enough to be immaterial. The gap on messy, realistic enterprise documents is significant and is the gap that determines production performance.
I would run the access control tests earlier and treat them as a gate rather than a scored criterion. Platforms that fail retrieval-layer access control tests should not proceed to the other evaluation stages for data-sensitive use cases. The time spent evaluating answer quality and admin experience on a platform with fundamental access control weaknesses is not well spent.
I would include a longitudinal component. Ninety days of structured testing is meaningful. But some of the most important properties of an AI platform, query consistency over time, response to data quality changes, vendor relationship quality after the initial sale, are only visible over a longer horizon. I would run a six-month parallel pilot before making a final recommendation on any significant enterprise deployment.
The correct answer for this organization was a self-hosted deployment with retrieval-layer access control and infrastructure entirely under the organization's control. The decision between building that from scratch and deploying a platform that packages those properties is primarily a question of engineering capacity and time-to-production requirements. Both paths are viable. Neither is as simple as the demo made it look.
Top comments (0)