DEV Community

Cover image for  NEMO: A New Take On Data Discovery
N8sGit
N8sGit

Posted on

NEMO: A New Take On Data Discovery

Recently Facebook Engineering posted on their blog about NEMO, an in-house data discovery engine that combines some compelling techniques and ideas. While the post is unfortunately sparse on technical details and it doesn’t look like they are intent on open-sourcing the software, it does hint at some best in class practices on data management and uses key technologies: graphs, machine learning, and search, to create a data workflow that scales to billions of users and thousands of employees. While the original post is a bit of a tease, it’s still worth asking: what can we learn from it?

Before reviewing the contents of the blog post, first let’s review the history of search and appreciate its fundamental role in modern computing. The undisputed emperor of search is of course Google, whose ranking algorithm knitted the whole internet together and made a vast array of resources discoverable to the average web surfer. The beauty of search is that it finds the happy middle ground between human and computer: people are naturally question askers, and so typing in a query into a simple input field that machines can parse and respond to requires the minimum technical investment from the user but leaves open a maximum of technical leverage for the technology. There are no clever settings or special codes that have to be written, you just type in what you’re looking for in your natural language of choice and get what you want.

A discerning techie may have noticed that over the years search functionality has spread to all kinds of software domains. Search is now a standard feature for operating systems, allowing a user to quickly search their entire file system , phone settings, etc, rather than rummaging through clunky GUIs full of nested folders. In the realm of databases, the ability to retrieve data is traditionally restricted to query languages such as SQL that require precise syntax and semantics to retrieve the desired results. Searchable databases are now more of a thing. As data sets grow to humanly unimaginable sizes, search stands out as the uncontested go-to solution for finding the needle in the haystack.

The implementation of search of course has to accommodate the specifics of the data set it processes and the indexable fields that data surfaces. The algorithm must work around the constraints the data imposes. Web search for example, indexes web pages by crawling through hyperlinks, but it is clear that this method cannot be applied on a data format that lacks hyperlinks. The problem is compounded when you must search over varied, inconsistent data types. Web search is conveniently supplied uniform data: the internet all runs on the same core standards, so the need to normalize data for format variance is limited. Facebook, however, mentions how they have over a dozen different data types in their internal databases. Any complex infrastructure is bound to store data in formats that have different read methods, meaning that the first improvement which NEMO achieves is to flatten the data so that it is uniformly readable, searchable, and indexible.

A key to data management is centralization. Data is at its most useful when it is pooled, organized, and interpreted in the light of other data, and accessible through a single set of standardized procedures. If you clumsily silo and compartmentalize data in different places, then finding what you need is harder. Furthermore, the discovery burden is placed on people rather than infrastructure. You need to ask a person, usually a more senior team member, where that data is located. This is less than ideal for obvious reasons.

NEMO is built on top of the graph search indexing system Unicorn. As the name implies graph search fuses the data structure of a directed graph with search algorithms. Graphs are a particularly flexible, robust data structure for representing entities (nodes) and relationships (edges). Graphs are a powerful data structure because it allows for entities to be modeled using free-form, open ended association rules. This is achieved by constructing adjacency lists. In the most basic terms an adjacency list is a data pair consisting of an id and hits. An adjacency list, in other words, gives you all the nodes on the graph connected by a specific edge-type relation to a specific node’s index. As FB puts it in their white paper, “We can model other real-world structures as edges in Unicorn’s graph. For example, we can imagine that there is a node representing the concept female, and we can connect all female users to this node via a gender edge. Assuming the identifier for this female node is 1, then we can find all friends of Jon Jones who are female by intersecting the sets of result-ids returned by friend:5 and gender:1.” Unicorn supports chaining these sorts of relational queries using standard logical operators such as AND/OR in addition to more sophisticated DIFFERENCE operators.

At a high level, the ingeniousness of representing data like this is apparent in how open-ended it is. The logical connections between data are already represented in the graph’s structure. No matter the variance in data format, nearly all data exists to represent real-world relationships between entities, and so nearly all data is conformable to a graph. NEMO goes one step further by layering the latest and greatest NLP functionality over Unicorn’s graph technology. Such a graph representation allows for rich, multi-dimensional and overlapping data queries that treat all types of data the same. Meanwhile the NLP layer provides an intuitive and natural way to access the data.

By the way Unicorn is also built on an in-memory processing architecture, meaning that there are no slow read/writes to the disk. Rather, the operations happen in servers’ RAM in real time. This is something you would want when working with highly volatile, fast-changing data sets.

Sophisticated modern search engines do more to close the gap between the searcher and the result. They use for instance personal data about the user’s search history to suggest hits or predict results. NEMO incorporates similar personalized usage information so that the data one tends to use can be anticipated and put in front. As the blog post puts it , “Nemo signals vary widely, from simple textual ones (degree of overlap between artifact name and query text) to content-aware ones (how many widgets appear in this dashboard) to highly personalized ones (how many people with your role have accessed this table recently). Nemo also computes a trust score for artifacts, indicating how likely they are to be a reliable source of data. This score is independent of the specific query and focuses on usage and freshness signals, using manual heuristics. When evaluating result quality for training, Nemo counts not just clicks but also other actions taken by the user. For instance, if an artifact was shown to the user and then they accessed it later that day, that is generally a good indication that they found it useful.” This smart ranking system turns the data graph into a responsive, dynamic system rather than a static table that has to be worked around when it doesn’t comply with user needs.

Crucially—and to my knowlege this is fairly unique among database systems-NEMO supports natural language queries. There is no need for a ponderous SQL-like query language to specify the request. You can just type a sentence and the NLP engine will interpret it. This tendency to “naturalize” search queries removes some of the technical cost in executing queries and opens up the system not only to engineers but organization-wide.

The use of post-ranking machine learning relevancy signals means that massive data sets can respond to how much they are used and how often they are needed. NEMO is of great interest for how it merges several great ideas to make data discovery more scalable, intuitive, and painless. If I had to call it, I'd say this approach might spell the future for how big data is coped with further down the line.

Top comments (0)