P2P For Web Devs, Part 1: Networking

#p2p #node #distributedsystems #javascript

Previous article in this series: P2P For Web Devs, Prologue: Why should I care?

Hyperswarm

When you start with Pear runtime, one of the first things you encounter is Hyperswarm. It describes itself as:

A high-level API for finding and connecting to peers who are interested in a "topic."

Let's dissect this seemingly simple description.

Peers are processes that communicate through a network, with a key distinction being that they do both - make requests and receive requests. You will often see a term "node" too, these are used almost interchangeably in P2P context, as all peers are nodes.

Swarm in Hyperswarm is a group of peers that are all connected to each other. E.g. a group call with 4 people in it, is a swarm of 4 peers.

Finding means we are solving the equivalent of DNS lookup, we need to get IP addresses of the peers to connect to.

Connecting is the step after "finding". When you get an IP address of a server, usually you connect with HTTP request on a predefined port number. Here it is more complex, as who we are connecting to is not conveniently listening on port 443.

Topic is a way for peers to find each other. E.g. in a phone call of 4 people, a topic would be a call id. Someone else could join this call without knowing the other 4 people, they just need the call id.

A key point to understand is that peers are not equal to an IP address. Your phone is a peer. But a phone is a mobile device that is constantly changing IPs as you move. So a topic would match you to a peer, but translating peer to an IP address is a separate issue.

Now that's a lot of information packed into that short sentence describing Hyperswarm.

Let's reduce scope, and go deeper.

Hyperswarm is built on top of Hyperdht.

HyperDHT

The Hyperswarm DHT uses a series of holepunching techniques to make sure connectivity works on most networks, and is mainly used to facilitate finding and connecting to peers using end to end encrypted Noise streams.

We got a layer deeper, we got new terms, and the description is twice as long now.

DHT - Distributed Hash Table. That's a hash table, that instead of living in memory of a single machine, is actually across many different machines. The different machines can hold a range of data from the table, and all of them together form the full table. E.g. Cassandra DB is an implementation of DHT.
In our context, DHT is what we can use to implement the "finding" part, that's where we can have information about existing peers, their IP addresses, topics. It's the metadata layer that is accessed by the entire network, without a central server.

holepunching - that's the connection part between peers. It solves the previously mentioned issue of "not conveniently listening on port 443".

end to end encrypted Noise streams - that's the equivalent of SSL on the web. The noise protocol is a lot more lightweight compared to SSL and does not require a certificate authority. Which makes it the ideal choice for embedded systems and P2P systems that do not want to depend on some central certification issuer.

So in this module we got some terms not only about what is being done, but also on how it is being done.

Another line in the description of hyperdht is:

Built on top of dht-rpc.

dht-rpc

Make RPC calls over a Kademlia based DHT.

Remote IP / firewall detection

Easily add any command to your DHT

Streaming queries and updates

I think we are deep enough! This is a core module that focuses on the DHT itself.

Kademlia based DHT, DHT's can be implemented in different ways. E.g. the more controlled and trusted nodes will usually implement DHT with a hash ring. In the case of P2P network where peers are connecting and disconnecting all the time, are untrusted and the scale can easily reach millions of peers, Kademlia is the standard. These different approaches define how the node with the data you need is actually found.

I mentioned that we use DHT here for the metadata, for resolving the peer to IP maps. But that's just the use case in the hyperswarm module. This dht-rpc module is just a generic implementation, giving you the ability to command and query your DHT for whatever custom data you need.

Now that we have the vocabulary and understand the problem scope, we can start looking at how these problems are actually solved.

In the next articles I will go through implementation details of each of those modules, building up from dht-rpc to Hyperswarm.