DEV Community: N8sGit

NEMO: A New Take On Data Discovery

N8sGit — Fri, 16 Oct 2020 21:06:55 +0000

Recently Facebook Engineering posted on their blog about NEMO, an in-house data discovery engine that combines some compelling techniques and ideas. While the post is unfortunately sparse on technical details and it doesn’t look like they are intent on open-sourcing the software, it does hint at some best in class practices on data management and uses key technologies: graphs, machine learning, and search, to create a data workflow that scales to billions of users and thousands of employees. While the original post is a bit of a tease, it’s still worth asking: what can we learn from it?

Before reviewing the contents of the blog post, first let’s review the history of search and appreciate its fundamental role in modern computing. The undisputed emperor of search is of course Google, whose ranking algorithm knitted the whole internet together and made a vast array of resources discoverable to the average web surfer. The beauty of search is that it finds the happy middle ground between human and computer: people are naturally question askers, and so typing in a query into a simple input field that machines can parse and respond to requires the minimum technical investment from the user but leaves open a maximum of technical leverage for the technology. There are no clever settings or special codes that have to be written, you just type in what you’re looking for in your natural language of choice and get what you want.

A discerning techie may have noticed that over the years search functionality has spread to all kinds of software domains. Search is now a standard feature for operating systems, allowing a user to quickly search their entire file system , phone settings, etc, rather than rummaging through clunky GUIs full of nested folders. In the realm of databases, the ability to retrieve data is traditionally restricted to query languages such as SQL that require precise syntax and semantics to retrieve the desired results. Searchable databases are now more of a thing. As data sets grow to humanly unimaginable sizes, search stands out as the uncontested go-to solution for finding the needle in the haystack.

The implementation of search of course has to accommodate the specifics of the data set it processes and the indexable fields that data surfaces. The algorithm must work around the constraints the data imposes. Web search for example, indexes web pages by crawling through hyperlinks, but it is clear that this method cannot be applied on a data format that lacks hyperlinks. The problem is compounded when you must search over varied, inconsistent data types. Web search is conveniently supplied uniform data: the internet all runs on the same core standards, so the need to normalize data for format variance is limited. Facebook, however, mentions how they have over a dozen different data types in their internal databases. Any complex infrastructure is bound to store data in formats that have different read methods, meaning that the first improvement which NEMO achieves is to flatten the data so that it is uniformly readable, searchable, and indexible.

A key to data management is centralization. Data is at its most useful when it is pooled, organized, and interpreted in the light of other data, and accessible through a single set of standardized procedures. If you clumsily silo and compartmentalize data in different places, then finding what you need is harder. Furthermore, the discovery burden is placed on people rather than infrastructure. You need to ask a person, usually a more senior team member, where that data is located. This is less than ideal for obvious reasons.

NEMO is built on top of the graph search indexing system Unicorn. As the name implies graph search fuses the data structure of a directed graph with search algorithms. Graphs are a particularly flexible, robust data structure for representing entities (nodes) and relationships (edges). Graphs are a powerful data structure because it allows for entities to be modeled using free-form, open ended association rules. This is achieved by constructing adjacency lists. In the most basic terms an adjacency list is a data pair consisting of an id and hits. An adjacency list, in other words, gives you all the nodes on the graph connected by a specific edge-type relation to a specific node’s index. As FB puts it in their white paper, “We can model other real-world structures as edges in Unicorn’s graph. For example, we can imagine that there is a node representing the concept female, and we can connect all female users to this node via a gender edge. Assuming the identifier for this female node is 1, then we can find all friends of Jon Jones who are female by intersecting the sets of result-ids returned by friend:5 and gender:1.” Unicorn supports chaining these sorts of relational queries using standard logical operators such as AND/OR in addition to more sophisticated DIFFERENCE operators.

At a high level, the ingeniousness of representing data like this is apparent in how open-ended it is. The logical connections between data are already represented in the graph’s structure. No matter the variance in data format, nearly all data exists to represent real-world relationships between entities, and so nearly all data is conformable to a graph. NEMO goes one step further by layering the latest and greatest NLP functionality over Unicorn’s graph technology. Such a graph representation allows for rich, multi-dimensional and overlapping data queries that treat all types of data the same. Meanwhile the NLP layer provides an intuitive and natural way to access the data.

By the way Unicorn is also built on an in-memory processing architecture, meaning that there are no slow read/writes to the disk. Rather, the operations happen in servers’ RAM in real time. This is something you would want when working with highly volatile, fast-changing data sets.

Sophisticated modern search engines do more to close the gap between the searcher and the result. They use for instance personal data about the user’s search history to suggest hits or predict results. NEMO incorporates similar personalized usage information so that the data one tends to use can be anticipated and put in front. As the blog post puts it , “Nemo signals vary widely, from simple textual ones (degree of overlap between artifact name and query text) to content-aware ones (how many widgets appear in this dashboard) to highly personalized ones (how many people with your role have accessed this table recently). Nemo also computes a trust score for artifacts, indicating how likely they are to be a reliable source of data. This score is independent of the specific query and focuses on usage and freshness signals, using manual heuristics. When evaluating result quality for training, Nemo counts not just clicks but also other actions taken by the user. For instance, if an artifact was shown to the user and then they accessed it later that day, that is generally a good indication that they found it useful.” This smart ranking system turns the data graph into a responsive, dynamic system rather than a static table that has to be worked around when it doesn’t comply with user needs.

Crucially—and to my knowlege this is fairly unique among database systems-NEMO supports natural language queries. There is no need for a ponderous SQL-like query language to specify the request. You can just type a sentence and the NLP engine will interpret it. This tendency to “naturalize” search queries removes some of the technical cost in executing queries and opens up the system not only to engineers but organization-wide.

The use of post-ranking machine learning relevancy signals means that massive data sets can respond to how much they are used and how often they are needed. NEMO is of great interest for how it merges several great ideas to make data discovery more scalable, intuitive, and painless. If I had to call it, I'd say this approach might spell the future for how big data is coped with further down the line.

Recoil.js: Reactified State Management

N8sGit — Tue, 15 Sep 2020 15:11:08 +0000

As browser-based clients have grown in complexity in recent years they have become a far cry from the simple static HTML skeletons of old. To accomodate the increasing data demands, sophisticated operations, and interactivity of modern UIs, many crafty frontend libraries have emerged in the past decade. Among the most popular of these is react.js. As the complexity of UIs has grown, efficient application state management to deal with all the data changes has become a crucial feature of scalable frontend infrastructures. Several popular state management libraries have come to the forefront such as redux and mobx. While these libraries have various advantages and drawbacks, they are marred by a certain lack of parallelism with the UI libraries they interface with. As anyone who has worked with redux can attest, as useful as it is it sticks out like a sore thumb in comparison to the rest of the tooling and involves the use of much tedious configuration and boilerplate in order to extend it even marginally. Perhaps what we need is manageable state management.

Happily, Facebook Engineering has recently released recoil.js, a state management module that leverages react-like concepts that mesh with the overall design philosophy of react. This strikes me as a very cool idea. So let’s learn something about it!

First and foremost, what problems does recoil address to justify yet another state management tool? In short it provides a clean and intuitive interface for shared state between components, derived data and queries, and observation. We’ll address these in turn and then take a dive into the main concepts and syntax of recoil.

Any state management library obviously wants to solve the problem of sharing state application-wide. The cool thing about recoil is that it allows for components to tap into a store without much boilerplate or without imposing unreact-like concepts onto your components.

Derived data and queries are of great use when you want components to tap into certain regular computations or API requests. Clearly if many components are going to be doing the same thing, it makes sense to host this functionality outside the component and to provide a subscribable set of functions called selectors to handle this need.

Observation is a useful feature to have when dealing with application state. In essence, observation allows a component to watch everything that’s happening in the app. This is useful for debugging, logging, persistence and keeping components’ state synchronized.

One of the attractive aspects of recoil is its comparative simplicity. There are really only two main concepts to it, atoms and selectors. Let’s go over the basics.

Atoms are the changeable pieces of application state that various components throughout the app can subscribe to. They account for the “single source of truth” principle of state management. When an atom updates, every component subscribed to it re-renders and syncs with the current state of the atom. Creating an atom is easy:

import { atom } from 'recoil';
const counterState = atom({
key: ‘counterState’,
default: 0
});

That’s really all there is to it. You define a variable using the atom() function, and pass it an object with a key and a default state. Then it’s just a matter of subscribing the desired component to that atom, which can be achieved with precisely zero configuration. Using hooks, doing so looks like this:

const App = () => {
const [count, setCount] = useRecoilState(counterState);
const loading = useRecoilValue(counterState);
...
}

Do the same for each component you wish to connect to the same piece of state and they will each consistently sync up with it and reflect its updates accordingly. useRecoilState specifies a setter function, so that the state can be updated within the component when called. useRecoilValue is a getter function that grabs the current state of the atom for display or general use within the component.

Aside from a few minor details that’s essentially all there is to atoms. The naming is apt; atoms are meant to be the most elementary pieces of state with little baggage besides the minimum definitional properties needed to specify it.

Next comes selectors. Selectors are a bit more complicated. Basically, they handle derived state in recoil. They accept either atoms or other selectors as input. You define a selector in a similar way as an atom:

import { selector } from 'recoil';
const checkCounterState = selector({
key: ‘counterState’,
get: ({ get } ) => {
const count = get(counterState)
function isPrime(num) {
  for(var i = 2; i < num; i++)
    if(num % i === 0) return false;
  return num > 1;
}
return isPrime(count);
})

This selector tells you if the current state of the counter is a prime number. You can subscribe to the selector within any component and run this computation wherever needed. Selectors provide a consistent app-wide API for calculating derived state. Selectors can also be writable, meaning you can update the state using them. It also comes with async support without the need of any external libraries, so selectors can return promises and be used for server queries.

While there is more depth to recoil, what you see here is the core of the library. Recoil is in its infancy, and is even considered merely “experimental” by its developers. Few however can deny the appeal of its clean and simple interface. Recoil is certainly a piece of state you will want to subscribe to as it matures!

Dunno about Deno? A Primer on the New JS Runtime from the Creator of Node

N8sGit — Wed, 03 Jun 2020 19:49:43 +0000

In 2009 Node debuted. As runtime environment that supported server-side Javascript, it was a bit of an odd specimen but it quickly generated buzz and widespread adoption. Node took JS out of the browser and used it to power a runtime process. There are several advantages to this approach, specifically from the view of web development. One of the clearest benefits is to give web apps uniformity across the implementation. Having the same language run on both the browser and server eliminates assumptions and improves module cohesion. A programming language may or may not work well with another but it always works well with itself. It also makes sense to model web server on single-threaded event-driven concept that runs on browser engines. Node uses the same V8 engine that runs in Chrome. Using the same language on the frontend and backend also clips the learning cost for developing full-stack web apps making Node a good choice for someone who wants to get set up and going fast.

There are however some shortcomings to Node. Javascript was not intended to be a server-side language and had to be taken out of its natural habitat in the browser and modified to fit that role. As a dynamically typed language with in-built garbage-collection and memory management routines, JS forces certain rules on the server that might better be controlled. Specifically, as a dynamically typed language JS introduces some noise into server design. If a number unexpectedly gets cast into a string somewhere in a complex backend process that is almost sure to break something at some point. Generally you want to explicitly declare variable types and control memory allocation on the backend, features that JS is highly opinionated about or automates away.

Another issue with Node is that JS is a rapidly evolving language and was a different animal over a decade ago. In particular, latency issues involving the EventEmitter API made JS unsuited to processing asynchronous I/O operations. Node quickly inherited technical debt and had to be jerry-rigged to accommodate implementational advances in the language. JS has no built-in method for dealing with asynchronous I/O, without which you effectively can’t do what servers are supposed to do. So Node, which is written mainly in C, had to accommodate for that.

Deno, spearheaded by the creator of Node, Ryan Dahl, is a response to these problems. It isn’t a fork of the Node source code but an entirely new project that attempts to reimplement some of the needs addressed by Node while casting it in a new and improved mould. Here we’ll go into some detail on what Deno is about and how it could be a fresh and invigorating take on server-side JS.

One big difference with Deno is first-class typescript support. For the uninitiated typescript is an extension of JS that allows for optional strict typing for values. The result is a more predictable, tightly controlled context. Adding the typing facilities of TS allows you to start with quick hacky implementations and then scale up to more rigorously foolproof code without having to fundamentally alter the code structure.

Node was developed before the ES6 introduced the now indispensable Promise object. Deno is designed with promises in mind, streamlining callback handling. Deno is built around the ES modules rather than CommonJS specification. It also supports the handy async/await syntax which has made life much easier for developers using JS. In general, Deno is designed to be more consilient with the browser and web APIs. For example Javascript’s inbuilt fetch API, which is used to handle HTTP resource transactions, is part of Deno’s repetoire.

Unlike Node which allows for open access by default, Deno has a secure permissions policy. Any access to the OS layer, file system, or environment must be enabled. Your linter should not have access to your whole computer unless you want it to for some reason. Deno is sandboxed by default.

Deno works out of the box as a single executable. Deno also comes with built in code formatter, unit testing, and CLI tools. Deno does not use NPM to install dependencies. Instead it is built on the ubiquitous URL protocol and PATH technologies to reference modules. The result is a leaner, more compact runtime!

Reliance on URLS for module imports has the advantage of perfect specificity. A URL by definition is a unique reference to a resource location. In contrast, linking to a package associated to a filename could run afoul of namespace resolution algorithms. node_modules, the massive root file for Node’s resources, contains a reference to a module and so does package.json. This is needlessly confusing. By the way, package.json is going the way of the Deno and will not be used with it. The concept that modules should be managed as a directory of files is not native to the web, where URLs prevail. package.json demands that module versions be tracked in a dependencies: list. If libraries were instead linked by URLs the path to the url defines the version. It’s a simplified approach to dependency linking. Deno also caches the dependency the first time it builds, so you will not have to worry about a url instabilities or a url pointing to an outdated resource unless you want to update it with the —-reload command. Because of this you can also run the app offline.

In Node, node_modules is installed locally in every project greatly increasing the size. The inefficient module resolution algorithm that traverses the node_modules file tree can be dispensed with if a more direct url pathing method is used.

There's more to Deno, but this gives you an overview of some of its motivating ideas and how it differs from its predecessor. So will Deno replace Node? Time will tell. The first production-ready version of Deno was released in early May 2020 and at this writing it very much in its infancy. Node has robust support and widespread representation in countless production builds of established companies. Deno may indeed spell a categorical improvement over Node, but that does not mean it will become industry standard quite yet. Nor does it certainly mean that Node is on the way out. That said, Deno embodies years of thinking about how to improve Node using concepts from a much evolved Javascript language and carefully considered design decisions. It will be interesting to see what becomes of this new technology in the near future.