DEV Community: Volodymyr Pavlyshyn

Pre and Post Filtering in Vector Search with Metadata and RAG Pipelines

Volodymyr Pavlyshyn — Mon, 30 Sep 2024 14:50:05 +0000

In the modern world of AI, managing vast amounts of data while keeping it relevant and accessible is a significant challenge, mainly when dealing with large language models (LLMs) and vector databases. One approach that has gained prominence in recent years is integrating vector search with metadata, especially in retrieval-augmented generation (RAG) pipelines. Vector search and metadata enable faster and more accurate data retrieval. However, the process of pre- and post-search filtering results plays a crucial role in ensuring data relevance.

The Vector Search and Metadata Challenge

In a typical vector search, you create embeddings from chunks of text, such as a PDF document. These embeddings allow the system to search for similar items and retrieve them based on relevance. The challenge, however, arises when you need to combine vector search results with structured metadata. For example, you may have timestamped text-based content and want to retrieve the most relevant content within a specific date range. This is where metadata becomes critical in refining search results.

Unfortunately, most vector databases treat metadata as a secondary feature, isolating it from the primary vector search process. As a result, handling queries that combine vectors and metadata can become a challenge, particularly when the search needs to account for a dynamic range of filters, such as dates or other structured data.

LibSQL and vector search metadata

LibSQL is a more general-purpose SQLite-based database that adds vector capabilities to regular data. Vectors are presented as blob columns of regular tables. It makes vector embeddings and metadata a first-class citizen that naturally builds deep integration of these data points.

  create table if not exists conversation (
    id varchar(36) primary key not null,
    startDate real,
    endDate real,
    summary text,
    vectorSummary F32_BLOB(512)
   );

It solves the challenge of metadata and vector search and eliminates impedance between vector data and regular structured data points in the same storage.

As you can see, you can access vector-like data and start date in the same query.

select c.id ,c.startDate, c.endDate, c.summary, vector_distance_cos(c.vectorSummary, vector(${vector})) distance
      from conversation
      where
      ${startDate ? `and c.startDate >= ${startDate.getTime()}` : ''}
      ${endDate ? `and c.endDate <= ${endDate.getTime()}` : ''}
      ${distance ? `and distance <= ${distance}` : ''}
      order by distance
      limit ${top};

vector_distance_cos calculated as distance allows us to make a primitive vector search that does a full scan and calculates distances on rows. We could optimize it with CTE and limit search and distance calculations to a much smaller subset of data.

This approach could be calculation intensive and fail on large amounts of data.

Libsql offers a way more effective vector search based on FlashDiskANN vector indexed.

vector_top_k('idx_conversation_vectorSummary', ${vector} , ${top}) i

vector_top_k is a table function that searches for the top of the newly created vector search index. As you can see, we could use only vector as a function parameter, and other columns could be used outside of the table function. So, to use a vector index together with different columns, we need to apply some strategies.

Now we get a classical problem of integration vector search results with metadata queries.

Post-Filtering: A Common Approach

The most widely adopted method in these pipelines is post-filtering. In this approach, the system first retrieves data based on vector similarities and then applies metadata filters. For example, imagine you’re conducting a vector search to retrieve conversations relevant to a specific question. Still, you also want to ensure these conversations occurred in the past week.

Post-filtering allows the system to retrieve the most relevant vector-based results and subsequently filter out any that don’t meet the metadata criteria, such as date range. This method is efficient when vector similarity is the primary factor driving the search, and metadata is only applied as a secondary filter.

    const sqlQuery = `
      select c.id ,c.startDate, c.endDate, c.summary, vector_distance_cos(c.vectorSummary, vector(${vector})) distance
      from  vector_top_k('idx_conversation_vectorSummary', ${vector} , ${top}) i
      inner join conversation c on i.id = c.rowid
      where 
      ${startDate ? `and c.startDate >= ${startDate.getTime()}` : ''}
      ${endDate ? `and c.endDate <= ${endDate.getTime()}` : ''}
      ${distance ? `and distance <= ${distance}` : ''}
      order by distance
      limit ${top};

However, there are some limitations. For example, the initial vector search may yield fewer results or omit some relevant data before applying the metadata filter. If the search window is narrow enough, this can lead to complete results.

One working strategy is to make the top value in vector_top_K much bigger. Be careful, though, as the function's default max number of results is around 200 rows.

Pre-Filtering: A More Complex Approach

Pre-filtering is a more intricate approach but can be more effective in some instances. In pre-filtering, metadata is used as the primary filter before vector search takes place. This means that only data that meets the metadata criteria is passed into the vector search process, limiting the scope of the search right from the beginning.

While this approach can significantly reduce the amount of irrelevant data in the final results, it comes with its own challenges. For example, pre-filtering requires a deeper understanding of the data structure and may necessitate denormalizing the data or creating separate pre-filtered tables. This can be resource-intensive and, in some cases, impractical for dynamic metadata like date ranges.

In certain use cases, pre-filtering might outperform post-filtering. For instance, when the metadata (e.g., specific date ranges) is the most important filter, pre-filtering ensures the search is conducted only on the most relevant data.

Pre-filtering with distance-based filtering

So, we are getting back to an old concept. We do prefiltering instead of using a vector index.

WITH FilteredDates AS (
    SELECT 
        c.id, 
        c.startDate, 
        c.endDate, 
        c.summary, 
        c.vectorSummary
    FROM 
        YourTable c
    WHERE 
        ${startDate ? `AND c.startDate >= ${startDate.getTime()}` : ''}
        ${endDate ? `AND c.endDate <= ${endDate.getTime()}` : ''}
),
DistanceCalculation AS (
    SELECT 
        fd.id, 
        fd.startDate, 
        fd.endDate, 
        fd.summary, 
        fd.vectorSummary,
        vector_distance_cos(fd.vectorSummary, vector(${vector})) AS distance
    FROM 
        FilteredDates fd
)
SELECT 
    dc.id, 
    dc.startDate, 
    dc.endDate, 
    dc.summary, 
    dc.distance
FROM 
    DistanceCalculation dc
WHERE 
    1=1
    ${distance ? `AND dc.distance <= ${distance}` : ''}
ORDER BY 
    dc.distance
LIMIT ${top};

It makes sense if the filter produces small data and distance calculation happens on the smaller data set.

As a pro of this approach, you have full control over the data and get all results without omitting some typical values for extensive index searches.

Choosing Between Pre and Post-Filtering

Both pre-filtering and post-filtering have their advantages and disadvantages. Post-filtering is more accessible to implement, especially when vector similarity is the primary search factor, but it can lead to incomplete results. Pre-filtering, on the other hand, can yield more accurate results but requires more complex data handling and optimization.

In practice, many systems combine both strategies, depending on the query. For example, they might start with a broad pre-filtering based on metadata (like date ranges) and then apply a more targeted vector search with post-filtering to refine the results further.

Conclusion

Vector search with metadata filtering offers a powerful approach for handling large-scale data retrieval in LLMs and RAG pipelines. Whether you choose pre-filtering or post-filtering—or a combination of both—depends on your application's specific requirements. As vector databases continue to evolve, future innovations that combine these two approaches more seamlessly will help improve data relevance and retrieval efficiency further.

Personal Knowledge Graphs in AI RAG on user phone

Volodymyr Pavlyshyn — Mon, 30 Sep 2024 14:46:58 +0000

Graphs and vector search are potent tandems for AI-powered applications, which are booming nowadays. Personal knowledge graphs are the core of semantic memory for many agentic AI applications.

At Mykin, we craft AI agentic architecture with a complex memory model directly on the user's device.

Mykin is a privacy-focused AI agent on top of sovereign data owned by users.

Kin. A personal AI for your work life
Get inspired, talk things through, navigate situations or get personalized guidance with Kin. Built for privacy…
mykin.ai

Our Nord Stars

The technical Nord stars of mykin are :

Privacy by design — guide on how to keep architecture secure and private
ssi principles — focus on user data ownership and sovereignty
local first architecture — give the instrument to the user to own

Data Ownership

All these Nord stars have one aspect in common—data ownership. The user has full control and ownership of the data. This means that we shift from a classical all-in-cloud centralized model to a local-first architecture where data is stored and processed, a mesh of user devices, and potentially some cloud services or capabilities are involved.

So we need to run complex RAG and vector search and vector graph clustering primarily on user device.

Expectations for database capabilities

general queries on structured data (regular application data) like messages, conversations, settings, etc
vector search and similarity search capabilities to RAG pipelines and different LLM and ML-powered flows
Graph and graph search capabilities (ML and semantic memory )
As far as we work on mobile, we have a few tech capabilities, too

embeddable with good support for mobile bindings
single file database that simplifies a backup
portable
battery friendly
fast and nonblocking io as much as possible
wide community support
reliability ## LibSQL If you follow my articles, you already know the answer — LibSQL

I described the full journey of vector search and graphs on top of relational models in my articles

Personal Knowledge Graphs in AI RAG-powered Applications with libSQL
I spend a long time working on privacy first personal ai assistant
ai.plainenglish.io

We have 1 question — how to run LibSQL on a user device?

We are using React Native, so the library should have react native bindings

LibSQL on React native
It is plenty of libraries for React native that run Sqlite but not LibSQL

react-native-sqlite-storage

Widely used with support for transactions and raw SQL queries.
Supports both Android and iOS.
Provides a promise-based API.
react-native-sqlite-2

A lightweight alternative.
Based on a WebSQL API.
Works well for simple databases but has limited features compared to react-native-sqlite-storage.
react-native-sqlite

Similar to react-native-sqlite-storage, but with more minimalistic features.
Might require manual linking.
watermelondb

Built on top of SQLite but offers a more modern approach.
Designed for highly scalable databases in React Native.
Provides an ORM-like interface and works with large datasets efficiently.
expo-sqlite (if using Expo)

Built-in SQLite support for Expo apps.
It is lightweight and easy to use but has fewer advanced features than other libraries.
Expo-sqlite is now a de facto library for SQLite in the Expo ecosystem, and my first idea was to convince the community to add libsql as an engine or fork it and use it for our internal needs.

It was much more challenging than I expected. Sometimes, a big open-source project is a closed door for new ideas and improvements. So it is a door that is hard to nook.

OP-SQLite
OP SQLite Documentation | Notion
Built with Notion, the all-in-one connected workspace with publishing capabilities.
ospfranco.notion.site

The fastest SQLite library for react-native by Ospfranco is what I read the first time when I found OP-SQL in Git Hub. And it is

It has few interesting features for react native app

Async operations
The default query runs synchronously on the JS thread. There are async versions for some of the operations. This will offload the SQLite processing to a different thread and prevent UI blocking. It is also real multi-concurrency, so it won’t bog down the event loop.

Raw execution
If you don’t care about the keys you can use a simplified execution that will return an array of results.

Hooks
You can subscribe to changes in your database by using an update hook that give a full row :

// Bear in mind: rowId is not your table primary key but the internal rowId sqlite uses
// to keep track of the table rows
db.updateHook(({ rowId, table, operation, row = {} }) => {
console.warn(Hook has been called, rowId: ${rowId}, ${table}, ${operation});
// Will contain the entire row that changed
// only on UPDATE and INSERT operations
console.warn(JSON.stringify(row, null, 2));
});

db.execute('INSERT INTO "User" (id, name, age, networth) VALUES(?, ?, ?, ?)', [
id,
name,
age,
networth,
]);
Extension Load
It was the first library that allowed me to load an extension by myself and even more. Oskar adds CR-SQL extension as an option to a library to make it work out of the box !!!

Open to Cooperation
One of LibSQL's mottos is to be open to contributions. Oskar was more open to contributions, saw the amazing benefits of libsql, and added it as an option to op-sql.

Little How-To
So, how do you build a vector search-aware personal knowledge graph on a user device ?

I expect that you will have a React native or expo project. You need to add op-sql

yarn add @op-engineering/op-sqlite
You need version 7.3.0+

Now let's configure Libsql. You need to add this section to your package.json

"op-sqlite": {
"libsql": true
}
As far as we do a polymorphic library that runs not only on the device but on nodejs also. I made an abstraction that allows me to swap libsql implementations.

// @ts-nocheck
import {
open as openLibsql,
OPSQLiteConnection,
QueryResult,
Transaction,
} from '@op-engineering/op-sqlite';
import {
BatchQueryOptions,
DataQuery,
DataQueryResult,
IDataStore,
UpdateCallbackParams,
StoreOptions,
} from '@mykin-ai/kin-core';
import { documentDirectory } from 'expo-file-system';

export class DataStoreService implements IDataStore {
private _db: OPSQLiteConnection | undefined;

private _isOpen = false;

public _name: string;
private _location: string;
public useCrSql = true;
private _options: StoreOptions;

constructor(
name = ':memory:',
location = documentDirectory,
options: StoreOptions = {
vectorDimension: 512,
vectorType: 'F32',
vectorNeighborsCompression: 'float8',
vectorMaxNeighbors: 20,
dataAutoSync: false,
failOnErrors: false,
reportErrors: true,
},
) {
this._name = name;
this._options = options;
if (location?.startsWith('file://')) {
this._location = location.split('file://')[1];
} else {
this._location = location;
}
if (this._location.endsWith('/')) {
this._location = this._location.slice(0, -1);
}
}

getVectorOption() {
return {
dimension: this._options.vectorDimension,
type: this._options.vectorType,
compression: this._options.vectorNeighborsCompression,
maxNeighbors: this._options.vectorMaxNeighbors,
};
}

async query(query: string, params?: any[] | undefined): Promise {
try {
await this.open(this._name);
const paramsWithCorrectTypes = params?.map((param) => {
if (param === undefined || param === null) {
return null;
}
if (param === true) {
return 1;
}
if (param === false) {
return 0;
}
return param;
});
const data = await this._db.executeRawAsync(query, paramsWithCorrectTypes);
return {
isOk: true,
data,
};
} catch (e) {
console.error(e.code, e.message);
return {
isOk: false,
data: [],
errorCode: e.code || 'N/A',
error: e.message,
};
}
}

async execute(query: string, params?: any[] | undefined): Promise {
try {
await this.open(this._name);
const paramsWithCorrectTypes = params?.map((param) => {
if (param === undefined || param === null) {
return null;
}
if (param === true) {
return 1;
}
if (param === false) {
return 0;
}
return param;
});
const data = await this._db.executeAsync(query, paramsWithCorrectTypes);
return {
isOk: true,
data: data.rows?._array ?? [],
};
} catch (e) {
console.error(e);
return {
isOk: false,
data: [],
errorCode: e.code || 'N/A',
error: e.message,
};
}
}
async open(name: string): Promise {
try {
if (this._isOpen && name === this._name) {
return true;
}
if (this._isOpen && name !== this._name) {
await this.close();
this._isOpen = false;
}
this._name = name;
this._db = openLibsql({
name: this._name,
location: this._location,
});
console.log('Opened db');
this._isOpen = true;
return true;
} catch (e) {
// eslint-disable-next-line no-console
console.error("couldn't open db", e);
return false;
}
}

async isOpen(): Promise {
return Promise.resolve(this._isOpen);
}

async close(): Promise {
if (this.useCrSql) {
this._db.execute(select crsql_finalize(););
}
this._db.close();
this._isOpen = false;
return Promise.resolve(true);
}
}
Now we are ready to make graph tables and indexes. I'll skip the entire class as far it is too long and give only essential parts

const vectorOptions = this._store.getVectorOption()
Give us vector configurations, such as the type of vector value and the dimension of embeddings, as the same as vector index params.

const createR = await this._store.execute(create table if not exists edge ( id varchar(36) primary key not null, fromId varchar(36) not null default '', toId varchar(36) not null default '', label varchar not null default '', displayLabel varchar not null default '', vectorTriple ${vectorOptions.type}_BLOB(${vectorOptions.dimension}), createdAt real, updatedAt real, source varchar(36) default 'N/A', type varchar default 'edge', meta text default '{}' );)
Now we have a triple store that has references to nodes

const createR = await this._store.execute(create table if not exists node ( id varchar(36) primary key not null, label varchar not null default '', vectorLabel ${vectorOptions.type}_BLOB(${vectorOptions.dimension}), displayLabel varchar not null default '', createdAt real, updatedAt real, source varchar(36) default 'N/A', type varchar default 'node', entity text default '{}', meta text default '{}' );)
If you want to know how to model graphs in relational db read my article

Personal Knowledge Graphs. Semantic Entity Persistence in Relational Model
In my last two articles, we modeled different kinds of graphs in a portable relational model.
blog.stackademic.com

Time to create an index

const createIndex = await this._store.execute(CREATE INDEX IF NOT EXISTS idx_edge_vectorTriple ON edge (libsql_vector_idx(vectorTriple${vectorOptions.compression !== 'none' ?, 'compress_neighbors=${vectorOptions.compression}': ''} ${vectorOptions.maxNeighbors ?, 'max_neighbors=${vectorOptions.maxNeighbors}': ''}));)
We configure compress_neighbors and max_neighbors to get the best storage space footprint. if you want to learn more about space complexity, read this article

The space complexity of vector search indexes in LibSQL
Hey, so I continue my adventure in vector search and Graph clustering at
ai.plainenglish.io

Now, we could create a triple

const createOp = await this._store.execute(
  `
  insert into edge (id, fromId, toId , label, vectorTriple, displayLabel, createdAt, updatedAt)
    values (?, ? , ? , ? , vector(${this._store.toVector(
      await this.embeddingsService.embedDocument(`${fromNode.label} ${normalizedLabel} ${toNode.label}`)
    )}) , ? , ?, ?);
`,
  [
    this._getUuid(),
    fromNode.id,
    toNode.id,
    normalizedLabel,
    label,
    Date.now(),
    Date.now(),
  ]
)

Unfortunately, op-sql does not support float32array as a parameter as libsql does. To make a workaround, we need to use a bit of dynamic SQL and create a serialized vector as part of queries. My toVector method does a stringify of float32array and cares about quotes. Please note that we pass a serialized array to a vector function in SQL. I hope that the next version of op-SQL will support float32arrays

Time to query !!

const _top = top ?? 10
const vector = this._store.toVector(await this.embeddingsService.embedQuery(query))

const querySql = `
select  e.id, e.label, e.displayLabel, e.createdAt, e.updatedAt, e.source, e.type , e.meta , fn.label, fn.displayLabel, tn.label, tn.displayLabel, vector_distance_cos(e.vectorTriple , ${vector}) distance
from vector_top_k('idx_edge_vectorTriple', ${vector} , ${_top}) as i
inner join edge as e on i.id = e.rowid
inner join node as fn on e.fromId = fn.id
inner join node as tn on e.toId = tn.id
where 1=1 ${maxDistance ? `and  distance <= ${maxDistance}` : ''}
order by distance
limit ${_top};

`
const edgeData = await this._store.query(querySql)
Few notes

by default, the vector index works and returns rowid so be careful that the joins
index does not return distance. Still, you could calculate it if you needed
vector_top_k expect top parameter and will return top N items. If you have complex filtering or external top limitations, remember to set a much bigger top N to make the search possible. In our case it is not an issue.
Issues and challenges
I faced a few challenges in React Native, mainly for iOS. They are related to how native modules are compiled and linked in iOS.

One quite unpleasant issue — if you have another library that use another version of Sqlite — it could unpredictably override linking and broke libsql completely.

Compilation Clashes
If you have other packages that are dependent on sqlite (specially if they compile it from source) you will have issues.

Some of the known offenders are:

expo-updates
expo-sqlite
cozodb
You will face duplicated symbols and/or header definitions since each of the packages will try to compile SQLite from sources. Even if they manage to compile, they might compile sqlite with different compilation flags and you might face threading errors.

Unfortunately, there is no easy solution. It would be best if you got rid of the double compilation by hand, either by patching the compilation of each package so that it still builds or removing the dependency on the package.

On Android you might be able to get away by just using a pickFirst strategy (here is an article on how to do that). On iOS depending on the build system you might be able to patch it via a post-build hook, something like:

pre_install do |installer|
installer.pod_targets.each do |pod|
if pod.name.eql?('expo-updates')
# Modify the configuration of the pod so it doesn't depend on the sqlite pod
end
end
end
Follow op-sql docs to get an updated list of libs

Gotchas | Notion
Built with Notion, the all-in-one connected workspace with publishing capabilities.
ospfranco.notion.site

RNRestart crash
One more ios issue

import RNRestart from 'react-native-restart';
if you for some reasons need to restart app and use react-native-restart you need to make that you close all connections

import { closeAllConnections } from '@storage/data-store-factory';
import RNRestart from 'react-native-restart';
export const restartApplication = async (): Promise => {
await closeAllConnections();
RNRestart.restart();
};
Now you could also do a personal knowledge graph with vector search on a user device!!!

I want to say thanks to Oskar and Turso team for their amazing work

Fastest way to count in sql

Volodymyr Pavlyshyn — Wed, 11 Sep 2024 15:47:46 +0000

We all know that stars in a select statement are a terrible idea

select * from message ;
It could give unpredictable results over time with schema evolution and give unoptimized queries, so good practice is to select what you need !!

Good star in SQL
Well, only some starts are good. One particular star is a good one!

Count (*): Tell your database to count rows of tables as fast as possible! It is a bit counterintuitive, but let's examine it further.


sql
select count(id) from message ;
┌───────────┐
│ count(id) │
├───────────┤
│ 1091      │
└───────────┘
Run Time: real 0.001 user 0.000170 
As you see on timing, it is fast, but we have a quicker result possible with

libsql> select count(*) from message ;
┌──────────┐
│ count(*) │
├──────────┤
│ 1091     │
└──────────┘
Run Time: real 0.000 user 0.000093 
How is it possible?

Let's ask explain

libsql> explain query plan select count(*) from message ;
QUERY PLAN
`--SCAN message USING COVERING INDEX idx_message_conversation
As we can see, it uses a secondary index much smaller than a clustering index that keeps a row of data. So, if you have any secondary indexes, the majority of query planers will use it for a fast count.

So even if it is counter-intuitive not all stars are bed in SQL

Personal Knowledge Graphs in AI RAG-powered Applications with libSQL

Volodymyr Pavlyshyn — Sun, 18 Aug 2024 09:02:32 +0000

I spend a long time working on privacy first personal ai assistant

Our application is localfirst and focused on sovereign data ownership. So, one of the challenges was finding the proper storage of data.

Device friendly

I partially describe several options for embeddable and device-friendly databases in my previous article

AI-powered apps, especially the semantic memory part, set few expectations for database capabilities

general queries on structured data (regular application data) like messages, conversations, settings, etc
vector search and similarity search capabilities to RAG pipelines and different LLM and ML-powered flows
Graph and graph search capabilities (ML and semantic memory )

As far as we work on mobile, we have a few tech capabilities, too

embeddable with good support for mobile bindings
single file database that simplifies a backup
portable
battery friendly
fast and nonblocking io as much as possible
wide community support
reliability

Vector and Graph capabilities for embeddable databases were relatively new challenges for modern databases. The close competitor was Postgress, with extensions that add

and

So, we need a similar embeddable setup.

Graphs

Currently, there are practically no graph-oriented databases that have portable and embeddable capabilities with a mobile or small device friend setup.

I made a few articles to model and show how to use relational databases for graph and hypergraph capabilities. It is a wide topic that has a lot of exciting research topics. You can find more about it in my articles.

Vectors

It is a wide variety of vector databases and libraries on the market. Some of the libraries like faiss .

It is top-performing and even has some capabilities to persist sectors to a file. So, it was a good start, but...

It is a big challenge to keep heterogeneous data storages in sync, and even the process of sync will consume time, battery, and CPU resources of an application and eat the main thread of the app that will, making it less and less user-friendly.

For me, it was clear — we need something integrated into a database.

After weeks of research and a prototype, we stopped at the SQLite ecosystem. SQLite has been the most popular and reliable database for mobile devices for decades.

But what about vectors?

SQLite has an extendable architecture that allows native modules to extend a database capability. I found this project that brings faiss to a SQLite.

Unfortunately, it was not reliable and had a few major bugs and issues, so I almost gave up.

We were lucky to find a better answer to our question

Libsql is open to contribution. It is an open-source replacement for SQLite that brings many features and performance optimization to the table. Some of these features deserve a separate article,

like :

alter table extension that makes a migration easy
webassembly defined functions !
Virtual WAL interface

and much more

but the most critical one is that it extends SQLite with vector search capabilities.

It is built smartly with minimal database changes, so it is easy to migrate and still compatible with SQLite.

It is no vector type, or let's say it is aliased on top of BLOB

CREATE TABLE node (
          id varchar(36) primary key not null,
          label varchar not null default '',
          vectorLabel F32_BLOB(512) ,
          displayLabel varchar not null default '',
          createdAt real,
          updatedAt real
         );
CREATE TABLE edge (
          id varchar(36) primary key not null,
          fromId varchar(36) not null default '',
          toId varchar(36) not null default '',
          label varchar not null default '',
          displayLabel varchar not null default '',
          vectorTriple F32_BLOB(512) ,
          createdAt real,
          updatedAt real,
         );

So F32_BLOB(512) specifies a meta information about a vector. It is a value type float 32-bit and a dimension of the array.

This type is more an alliance on BLOB but it gives a posibility to a database to validate a vector shape and it is data types

Now we have the ability to use a vector search for clustering a Graph and use it in a LLM powered pipelines

On edge, we store few meta information about triple

display label — describe edge label un normalized one
label — normalized label
vectorTripple — is the most interesting part. we normalize a node labels and edge label and concat them together. it allows us to make embedding out of it and make edges searchable by vector search

Our vectorTripple column adds a vector search capability to our personal knowledge graph .

To insert data we could use a vector function that gets embedding as :

float32 array
a blob of serialized float32 array
string representation like ‘[0.5432635, 0.3333 ….]’

insert into edge (rowid, id, fromId, toId , label, vectorTriple, displayLabel, createdAt, updatedAt)
  values (? , ?, ? , ? , ? , vector(${this._store.toVector(
          await this.embeddingsService.embedDocument(`${fromNode.label} ${normalizedLabel} ${toNode.label}`)
        )}) , ? , ?, ?);

With vector_distance_cos function we could already do a distance calculation and queries

select  e.id, e.label, vector_distance_cos(e.vectorTriple , ${vector}) distance 
from egde e 
where distance < 0.15

It is fantastic but very slow and ineffective and CPU intensive as far as you need to scan the full table and calculate the distances. So, we need a vector index.

We are lucky we can create an index for embedding columns !!!

CREATE INDEX idx_edge_vectorTriple ON edge (libsql_vector_idx(vectorTriple));

Now, instead of a full scan, we could search in the index

select  e.id, e.label, vector_distance_cos(e.vectorTriple , vector('[0.32323, 0.525, ....]')) distance
    from vector_top_k('idx_edge_vectorTriple', vector('[0.32323, 0.525, ....]'), ${_top}) as i
    inner join edge as e on i.id = e.rowid
    inner join node as fn on e.fromId = fn.id
    inner join node as tn on e.toId = tn.id
    where distance <= 0.15`
    order by distance
    limit 20;

So let's step by

vector_top_k('idx_edge_vectorTriple', '[0.32323, 0.525, ....]', 20) as i

This will give us back rowIDs of similarity and vector search from index idx_edge_vectorTriple, based on the vectorTriple column.

You should be careful. By default, it uses rowid. So, to combine the result of the vector search with any data and tables in your DB, you need to use a simple join statement. All is an integral part of the base and queriable

inner join edge as e on i.id = e.rowid

The index does not return distances, so you still need to calculate them yourself self, but now it happens on the way smaller dataset

select  e.id, e.label, vector_distance_cos(e.vectorTriple , vector('[0.32323, 0.525, ....]')) distance

vector function is a smart move that converts string like ‘[1,32,2, …. ]’ to a blob type that stored in a database

Now you could do an extra filter on distances if you need to find close related triples

where distance <= 0.15

Now we have a vector search on top of personal Knowledge

Conclusion

Personal Knowledge Graphs can be modeled in a relation model and are usually relatively middle-sized, so they never lead to performance issues with SQL queries. With Libsql, we have native and low-level support of vectors. It allows us to build a clustering of the graph and RAG pipelines on the user devices if needed. This feature is still under active development and may require some time, but from my experience, it is stable now.

Multimillion Common Programming Language Mistakes and Better Approaches

Volodymyr Pavlyshyn — Sun, 18 Aug 2024 09:00:45 +0000

Modern programming languages have evolved considerably, yet certain design flaws and pitfalls remain pervasive, causing significant challenges and costs in software development. This article explores some of these common mistakes and presents alternative approaches to mitigate their impact, as discussed in a recent insightful video.

The Null Problem

One of the most notorious issues in programming is the concept of null references. Tony Hoare, the inventor of the null reference, even referred to it as his “billion-dollar mistake” due to the numerous errors and system crashes it has caused. To avoid null-related issues, several strategies can be employed:

Option and Maybe Types: Languages like Java and C# offer options or monads that encapsulate values that may be null, providing a safer way to handle optional values.
Dependent Types: These types allow more logic to be moved to the type system, ensuring null-free code.
Libraries and Optional Chains: JavaScript has adopted optional chaining, which, while not perfect, significantly reduces null-related errors.

Handling Errors

Exception handling in programming languages has often been criticized for its complexity and propensity to introduce bugs. Traditional try-catch blocks can lead to cumbersome and error-prone code. Alternative approaches include:

Monadic Interfaces: Using monads to handle errors as values can improve code readability and safety. Languages like Haskell and Scala employ this method effectively.
Algebraic Effects: This advanced technique separates error handling from the main code logic, allowing for more flexible and maintainable error management.

Coloring Functions

The concept of “coloring” functions refers to the practice of marking functions as asynchronous (async). This can lead to a cascade effect where many functions must be marked async, complicating codebases. Algebraic effects offer a solution:

Algebraic Effects and Handlers:By using algebraic effects, asynchronous operations and error handling can be integrated more seamlessly, reducing the need for widespread async function declarations.

Concurrent Programming and Locks

Concurrency introduces significant complexity, often leading to difficult-to-diagnose bugs. Traditional methods like locks and threads can be error-prone. Better alternatives include:

Actor Model: This model, used in languages like Erlang and frameworks like Akka, encapsulates state and behavior within actors, making concurrency more manageable.
Go-like Channels: Go uses channels to facilitate communication between goroutines, simplifying concurrent programming by abstracting the complexities of thread management.

Fragile Classes and Overuse of Inheritance

Object-oriented programming (OOP) and inheritance can lead to fragile base class problems and excessive complexity. Alternatives to traditional class-based inheritance include:

Prototypal Inheritance: JavaScript utilizes prototypes, where objects inherit directly from other objects. This approach can be more flexible and less error-prone than classical inheritance.
Data-Oriented Programming: Focusing on data and its transformations rather than the objects can lead to more maintainable and understandable code.

Conclusion

By acknowledging and addressing these common pitfalls in programming languages, developers can write more robust, maintainable, and error-free code. Embracing advanced techniques such as monads, algebraic effects, actor models, and prototypal inheritance can significantly enhance the quality and reliability of software systems. As the field of software development continues to evolve, it’s crucial to stay informed about these approaches and incorporate them into your programming practices.

HyperGraphs In Relation Model

Volodymyr Pavlyshyn — Mon, 08 Apr 2024 12:37:07 +0000

In my last article, we model different kinds of graphs in abstract relational databases.

We talk about hypergraphs and even model Undirected hypergraphs. Let's recap our undirected models.

Hypergraph

Undirected Hypergraph

A hypergraph is a mathematical generalization of graphs where a hyperedge could connect multiple or no nodes. So, you have a set of nodes instead of a pair of nodes. Hypergraphs are an emerging domain for modeling complex and dynamic systems and are widely used for temporal and event-dependent graphs. We will model undirected hypergraphs.

Usually, a hypergraph is drawn assets that overlap or as Vin diagrams.

As far as edges now have many-to-many relations with nodes, we just need a joint table.

Nodes

Edges

Edge to nodes

The table could increase if you have a significant edge with a comprehensive set of nodes.

Directed HyperGraph

In directed HyperGraph, we divide nodes into a subset of in-nodes and out-nodes. So, we have a directed edge from one subset to another subset.

We have multiple options for modeling this structure.

As directed, edge-node relations

The most straightforward way is to add a direction attribute to edge-node pairs.

Now Nodes and edges stay the same as in preve example

But relations now has a directions

Pros of this model

simplicity
ability to create mixed graphs — directed and undirected

Cons of this model

it could be hard to build queries
The model could encode invalid states. Nobody prevents us from having the same node on the same edge in two directions.
for a strictly directed graph we need extra application-level constraints

As directed Graph of nodesets

We could be more explicit and extend a model from a directed graph to operate with node sets

Now, the source and object of the edge point to a node-set relation. For simplicity, we expect that in the case of individual nodes, we create separate node-set for these nodes

Pros of this model

more explicit

Cons of this model

Require to create a node set for single nodes
more complex

Personal Knowledge Graphs in Relational Model

Volodymyr Pavlyshyn — Mon, 08 Apr 2024 12:35:21 +0000

Various graph databases offer functionalities with a wide range of graph-oriented query languages, from Cypher to graphQL and custom ones. Graph databases could be optimized for storing and processing big graphs but require time to master and learn, and sometimes, it has quite a step in the learning curve. Personal Knowledge Graphs usually have a much smaller scale and are part of user applications or personal knowledge systems.

In a classical application, only a part of the data has a graph nature, and we could have a mixed setup of regular relational data and graphs.

In AI-powered applications, we have a mixed case of

- graph data

- vectors and vector indexes

- regular documents

It is hard to find a database that satisfies all these conditions.

I have been happy with CozoDB for a long time. You could combine PGvector and Apache AGE for Postgres and, together with Postgress's document-like features, build a lot. Sometimes, we need embeddable databases, and we hear the leader is SQLite. We are still waiting for a PGlite implementation that brings Postgres on edge.

We will avoid discussing the scalability of relational databases and load it as a topic for a separate article. Still, relational structures are widespread and well-known and offer many tools. They have good Developer Experience.

Graphs are not relational structures, but we could try to adopt relations to achieve a good representation and performance.

If you have small and fixed graphs, you could represent them as an Adjacent matrix, but this model needs to be more scalable. Any model should be optimized for your queries and needs; current models are subjective.

Directed Graph

A simple and common type of graph is a simple-directed graph where edges connect to nodes.

It is easy to model as a relational structure.

So sample data

Nodes

Edges

RDF Like Graphs

In the Resource definition framework, nodes and edges are not directly differentiated, and you could use edge and node in a different context if needed. All data is stored as triples of resources. Sometimes, modeling graphs closer to RDF is helpful, but reasoning and building queries in this model are hard. I prefer separate relations for nodes and edges.

ClassicaL RDF does not have a label and models it as a tripel relation, but as far as labels are concerned, I add it as a column.

Resource

Triple

Named Graphs and Graph of Graphs

The concept of the named graph came from the RDF community, which needed to group some sets of triples. In this way, you form subgraphs inside an existing graph. You could refer to the subgraph as a regular node. This setup simplifies complex graphs, introduces hierarchies, and even adds features and properties of hypergraphs while keeping a directed nature.

It looks complex, but it is easy to model it with slightly modifying a directed graph.

So, the node could host graphs inside. Let's reflect this fact with a location for a node. If a node belongs to a main graph, we could set the location to null or introduce a main node . it is up to you

Nodes could have edges to nodes in different subgraphs. This structure allows any nesting graphs. Edges stay location-free

Hypergraph

As far as edges now has many-to-many relations with nodes we just need a joint table

Nodes

Edges

Edge to nodes

The table could grow quickly if you have a big edge with a wide set of nodes.

HyperGraph with Edges as Nodes

As you noticed, we could point the Me node to KGraph because Kgraph is now an edge. So, in a Hypergraph, edges are a set of nodes. If we want to have a graph of graph-like setup, we need the ability to use edges as nodes, the same as the RDF framework does with a resource.

We could simplify a lot of relations and create more complex structures.

To achieve this, we could combine an RDF-like schema with a hypergraph schema. The model would still remain relatively simple.

we couldn’t reuse edges like we did in RDF because they contain different resources.

We could deduce pure nodes and edges from relations. Unfortunately, as long as you allow empty edges, there is no way to differentiate empty edges from single nodes. Hypergraphs could model named graphs and graphs of graphs, but from my experience, named graphs are more convenient for using nodes with a location.

Conclusion

Relational and embeddable databases could be a good choice for small-scale and personal Knowledge graphs. I have had a lot of positive experience with graph structure on a relations model with a datalog. Also, any datalog database with persistence could give you good results. Most static fact-based semantics graphs work well with a simple directed graph. I am increasingly working with AI applications in which complex ideas, conversations, or events could contain subgraphs and multiple entities. In this case, a directed graph or simple triple is not enough. As for me, graphs of graphs and named graphs give good results for this task and still stay close to what SPARQL 1.1 and Turtle could model.

<span id="3b27" data-selectable-paragraph=""># N-Graphs<br>&lt;http://example.org/alice/foaf.rdf&gt; {<br>  &lt;http://example.org/alice/foaf.rdf#me&gt; &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://xmlns.com/foaf/0.1/Person&gt; .<br>  &lt;http://example.org/alice/foaf.rdf#me&gt; &lt;http://xmlns.com/foaf/0.1/name&gt;                  "Alice"  .<br>}<br>&lt;http://example.org/bob/foaf.rdf&gt; {<br>  &lt;http://example.org/bob/foaf.rdf#me&gt;   &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://xmlns.com/foaf/0.1/Person&gt; .<br>  &lt;http://example.org/bob/foaf.rdf#me&gt;   &lt;http://xmlns.com/foaf/0.1/name&gt;                  "Bob" .<br>}</span>

Hypergraphs are a more robust and new tool well suited for complex dynamic and temporal aware systems. It is not so much standard tools that work with hypergraphs as industry standards. Hypergraphs where edges could be used as nodes deviate from a classical mathematical model but give the most flexible platform for modeling. Sometimes, this model could simplify the amount of edges. More general models is simple to store but more complex to reason and query so you need to find a balance yourself.

Don't Sell SSI … build on top of it. Outlook of 2023 in SSI journey

Volodymyr Pavlyshyn — Sun, 31 Dec 2023 10:40:53 +0000

don't sell SSI build on top
we need agents, not wallets
we don't need VC for everything
DWNs as ssi on steroids
AI needs SSI for the future

As we enter a new year, it’s a time for reflection and forward-thinking, especially in technology and data privacy. My journey with Self-Sovereign Identity (SSI) and Sovereign data has been a significant part of my professional life, and I’d like to share some insights and realizations from this journey.

The Shift in Focus: From Selling SSI to Building Value-Driven Products

Over the past year, there has been a significant shift in my approach to SSI. Like many in the field, I was initially captivated by the potential of SSI technology. We believed it could transform society and various projects. However, the reality proved to be different. Simply selling or packaging the technology as a shiny new solution isn’t enough. This approach led to a lack of adoption and understanding, as people found it complex and challenging.

The Realization: Technology Must Solve Real-Life Problems

The key realization was that technology, including SSI, must be more than an innovative concept. It must address real-life problems and offer tangible value to users. This year, I’ve aligned with a team that shares my values and focuses on creating products that enhance privacy and security. Our goal is to build tools that respect user data privacy and offer practical solutions to everyday challenges.

The Product: A Privacy-First Personal Assistant

One of our exciting projects is a privacy-first personal assistant. This tool is designed to keep user data secure and private, operating with a local-first and, eventually, offline-first approach. This presents unique challenges, as it involves running applications directly on devices, a path less traveled in the tech world.

Beyond Technology: Creating a Network of Networks

Looking ahead, the potential of SSI extends beyond individual applications. We envision a future where personal knowledge graphs and semantic memories, already available in some of our tools, can connect to broader social structures. This could lead to the development of artificial identity agents that understand and interact with our complex social networks, reflecting our multiple identities and the groups we associate with.

The Challenge with Wallets and the Need for Agents

Another area of exploration is the concept of digital wallets. While they are often touted as revolutionary, I see them as a barrier to adoption due to their complexity and maintenance requirements. The future might lie in developing agents that can act on behalf of users, balancing security, sovereignty, and usability.

Trust is not verifiable, and data is not only VCs

We are moving away from the concept of Verifiable data for everything. Current versions of verifiable data simply have too many security and trust challenges.

DWN, Web5, and SSI on steroids

DWN took a unique path in decentralized storage, offering a relay-based topology with permissions, encryption, and synchronization capabilities. This approach particularly benefits DApps and AI agents requiring extensive data storage. DWNs balance local and networked data storage, providing privacy, efficiency, and scalability. DWNs are gived up on global consensus and state.

DWN offers a permission layer out of the box. The user has control and complete control of access rights.
DWN is network agnostic, so you don’t need to trust any network or organization that stays behind a network or protocol
DWN is transport agnostic and could be used in heterogeneous setup
DWN has a self-hosting and embeddable setup, so your decentralized app could start with zero-cost
DWN still has an incentives challenge, and it is one of the topics that should be solved by the community and TBD folks. I believe that it will be a network of hosted DWNs in the future
Most importantly, DWN as a protocol-based solution open to extension and, on top of itself, allows to build of flexible and secured data protocols that unlock user data and made them interoperable

Building for the Future

Moving forward, we should focus on creating products that genuinely enhance people’s lives while respecting their privacy and data sovereignty. It’s not just about selling a technology; it’s about integrating it meaningfully into the fabric of daily life. This approach will drive actual adoption and appreciation of SSI and related technologies.

Join us on this journey, try our products, and be part of a community that values privacy, security, and practical innovation. Here’s to a year of meaningful technological advancements and a future where technology serves humanity profoundly and respectfully.

DWNs as a data game changer for sovereign data

Volodymyr Pavlyshyn — Sun, 05 Nov 2023 19:46:51 +0000

Earlier days of SSI. Is everything a VC?

It was a boom of self-sovereign identity projects around 2018-2020. All of them had one common bias. Everything was viewed as a Verifiable Credential. So, every single user data point should be represented as a W3C verifiable credential or similar technology. If you have a hammer, everything is a nail that didn't work well.
The majority of user data is not vcfiable . VCfication makes sense for data points with an attestation component that needs to be verifiable by a third party and creates a need for classical trust triangle interaction.
So, we must admit that sovereign data is much broader than a VC. We still need to share user data concerning SSI principles of user-centricity and consent.
Now we need a persistent and application layer that enable this polyglot data setup

Missed persistence layer

Self-sovereign Identity has the bold goal of creating a missed identity layer for the Internet. It is more than one problem of the web. Internet was designed on top of stateless protocols, and all these protocols miss decentralized persistence. So even for Verifiable Credentials and Identity data like DID documents, you need persistence, and to satisfy the SSI idea, you probably need it in a decentralized manner.

IPFS is not a rescue

IPFS was a pioneer of decentralized data storage outside of blockchain. IPFS focuses on self-addressable data, shapes a decentralized application, and enables a lot of Dapps. IPFS struggle on a few areas:

privacy : all data is open and accessible for everyone, like most blockchains. So to be private only one way is a full encryption better if you use a post-quantum driven encryption as far as IPFS has a promise of immutable data storage where data stay forever
Incentives - the way much bigger problem is how to convince node maintainers to store your data and replicate your data ? If you have no nodes - you data is gone. Blockchain have implicit economic model
heavy - it is almost impossible to run ipfs on local device mode. > For a general IPFS system, not accounting for specific workload requirements or use cases, high core count processors and a minimum of 32GB of memory is recommended. A tiered storage system using NVMe, SSD, and HDD devices is ideal for data storage.
global is designed as a decentralized global storage
peer-to-peer nature forces users to install and use a client for data access, limiting adoption and setting a relatively high technical barrier.

Ceramic like services

Ceramic take IPFS to the next level and for sure fix privacy concern with lit protocol and solve Incentive problem in some way. However, the network is private, permission-based, and controlled by a private company. So we still not there

What kind of storage do we need for sovereign data?

So, what does an excellent persistent layer for sovereign data look like?

Content polyglot layer: To empower and enable wallets, it should be capable of storing VCs and other structured data like files, blobs, and other media.
local first and offline first - The user should be able to self-host and own data in a perfect setup system that should be capable of running on user hardware
device and mobile friendly - continuation of previous requirement. more and more users are mobile-first, or even more often, we have users that use mainly mobile
extendable in a decentralized manner It is a question of time when the capacity of a mobile device will not be enough to handle all the data that the user wants to take with him. We wish to extend our local first model with decentralized and online capabilities
Secure - we want edge encryption and optionally synchronize data. The user should decide what data to share and replicate with the decentralized node and other network users. Better to have client-side end-to-end encryption
Interoperable and open storage should work over open standard data protocol and transparent and user-centric protocol-based data exchange
reactive and proactive on a decentralized and data-intensive exchange, it is essential to have a more reactive and streaming approach to get data changes and keep devices in sync
findable According to FAIR principles, data should not only be accessible and reusable but also friendly to discovery and recall
queryable it should be a way to implement intelligent queries across structured data

How does DWN fit into the game?

Content polyglot layer DWN suitable to store structures and binary data and potential files. The system expects self explainable data as far as you provide a data schema together with a data, and schema is one of the keys
local first and offline first DWN reference implementation built on top of LevelDB and SQL-based setup is possible. User could use DWN server or use an app that has embedded DWN built with DWN SDK
extendable in a decentralized manner DWN is built as a Relay based decentralized solution that is the sweet spot between a centralized server and peer-to-peer setup. On top of Relay, you can construct flexible topologies from self-hosted solo servers and in-memory or even browser-based agents to a full-scale sync data mesh system
Secure DWN offers DID Auth layer and full-scale permission and data access management out of the box. End to end client side edge encryption is one of the out-of-the-box
Interoperable and open DWN is an open protocol under the DIF foundation umbrella. DWN is fully open source and driven by a proactive and open community. DWN encourages a protocol and data protocol-driven development
reactive and proactive DWN has a sync mechanism as the propagation mechanism currently on active development. DWN server offered socketcket interface. A pretty prominent feature is Hooks, which offer transport-agnostic subscriptions to data changes and event
findable Currently Discovery of DWNs heavily depends on DID document and quite similar top a DIDComm v2 mechanism's
- queryable DWN is quite limited of queries and currently require external indexing and query solutions

What is WEB5 about, and why does it matter in the post-AI and post-blockchain world?

Volodymyr Pavlyshyn — Sat, 09 Sep 2023 15:14:03 +0000

Ultimate challenges of modern WEB

internet was born without identity layer. So you dont know with whom you talk
internet was born stateless. so it is allow to scale massively but how user could keep his data ?
How to prove a data ownership and authentisity of data ?

Why it is matter now for AI world?

AI future needs a data-driven pseudo-anonymous Identity.. AI will reshape a landscape completelly. We will need a way to identify original content form a generated one. Data is a main fuel of AI models. Research show that evem smaller models perform better on quality data. it is opens topic of data economy and data ownership. Even more now we extending owr self with a agent - now we need to manage agent identities and payments for interconected agents network. All this require new ways of managing identity and data ownership

WEB3 give you Assets to own but ...

We all know about web3 and blockchain together with a programmable economy and tokenisation of assets. Web3 was invented as a concept by Dr.Gavin Wood from Ethereum blockchain to sell a idea of smart contracts and etherium network. The biggest value of this movement was a ownership. First time we got a mechanism to give a user digitalised assets that user could own together with it we faced with a problem of identity in a decentralized world.
To get deeper historical and technological read my article Pre Web , Web1, Web2 , web3 , web5 , web7 and all hundreds of future web X explained in 12 Toots

Challanges of WEB3

Every body has his own view on web3 future and his own set of challenges
Privacy is broken and now we have a tornado and all Layer 2 / layer 3 patches that broke a ladger concept
User is locked in a particular blockchain and practically has no way of interacting ouside of network
Networks of Networks - only one god know how many of blockchains and crypto currency procects we have nowadays
data persistence is limited and expencive

Web 3 and second system syndrome

We have so many requirements and features for the next web version that we failed to deliver it and got lost.

Internet for machines and semantic web
internet of connected data
internet of IoT devices and things
Internet of identity and ownership and most recent challenge Internet of AI-powered agent Satohi paper and Vitalic Buterin idea of intelligent contracts bring a new view of ownership and economy but ignore heavenly privacy , ownership, and data.

Web3 locks a user in a new glass cage

Blockchain is a promise of WEB3 and a new era of internet lock users in even more restricted and isolated networks that force to be self-contained and accumulate mainly public data inside a network with highly hight cost storage and require the cost of interaction. Another challenge is a speed of transation - blockchains are slow. So blockchain is cool for assets and a new economy but fail to create a identity and new social space for extended human and machines.

WEB5 build on top of Self Sovreign Identity

For SSI read my article Self Sovereign Identity in 7 Toots . Idea is simple - we build a missed identity layer that treat humans as something more that a private key and give a tools to create a data point about ourselves and others and freally exchange this data via protocols. We unlock a user for a glass cage of blockchain and from a fragmented nightmare of web2 platforms and apps where you slice yourself to hundreds of you. Now holistic you open to the world

Building Blocks of WEB5

Identity layer on top of SSI tools - DIDs
missed decentralized persistence layer with a permissions and synchronisation on top of DWN
Verifiable and Owned data with a Verifiable Credentials
DApps on top of DWNs ### Old WEB3 & SSI Tools #### Keys Asymmetric Keys + Signatures — backbone of decentralized identity in #web3 and #web5 that give #algorithmic #cryptographical basics of identity. But while we not locked to a network that create addressable space we have a new challenge - How to distribute and manage Public Keys so we have a need of DPKI - decentralized Public Key Infrastructure

DIDs

DID give Decentralized Public Key Infrastructure that distributes your Public Keys and service endpoints to a broader audience.

DID is a core of SSI and create cryptographic verifieble, decentralised, resolvable identifier. Decentralized Identifier (DID) is a unique, persistent, and cryptographically verifiable identifier that allows individuals, organizations, or devices to establish and manage their digital identities independently. DIDs are used in decentralized identity systems, enabling users to have control over their data and interact securely without relying on a centralized authority.

DID Identifier

Representation of did itself. is a part of DID URI.

did:key identifier

DID Document

A DID (Decentralized Identifier) document is a structured, machine-readable JSON or JSON-LD document that contains essential information associated with a specific DID. It serves as a "public profile" for a decentralized identity, providing the necessary details for verifying signatures, encrypting/decrypting messages, and interacting with the identity's associated services.

The DID document typically includes:

The DID itself: A unique identifier that represents the decentralized identity.
Public keys: One or more public keys associated with the DID, used for cryptographic operations such as verifying signatures and encrypting messages.
Authentication methods: Mechanisms to prove control of the DID, which typically involve the use of public keys.
Service endpoints: URLs or other identifiers of services related to the DID, such as profile information, communication channels, or data repositories.
Other metadata: Additional information related to the DID, such as timestamps, controller information, or specific DID method details.

The DID document’s information allows other parties to trust and interact securely with the associated DID without relying on a centralized authority. DID documents are created, updated, and deactivated according to the rules and processes defined by the DID method associated with the DID. They are usually stored on distributed ledgers, blockchains, or other decentralized networks, making them globally resolvable and cryptographically verifiable.

{
  "@context": "https://w3id.org/security/v2",
  "publicKey": [
    {
      "id": "did:elem:EiBa0KyUWgvMdkt_ywullSPac2kyOkRP5JRtHSeICQ1t6Q#primary",
      "usage": "signing",
      "type": "Secp256k1VerificationKey2018",
      "publicKeyHex": "022ca63fffbd8b6dd7e54fa88b76d5245700ac81657fd59a03b73e4325ba1e19ba"
    },
    {
      "id": "did:elem:EiBa0KyUWgvMdkt_ywullSPac2kyOkRP5JRtHSeICQ1t6Q#recovery",
      "usage": "recovery",
      "type": "Secp256k1VerificationKey2018",
      "publicKeyHex": "0390d67bfbfc80d00edc7080a4c91f1c844208fabd03e158a5910f5d1601e69eb5"
    }
  ],
  "authentication": [
    "did:elem:EiBa0KyUWgvMdkt_ywullSPac2kyOkRP5JRtHSeICQ1t6Q#primary"
  ],
  "assertionMethod": [
    "did:elem:EiBa0KyUWgvMdkt_ywullSPac2kyOkRP5JRtHSeICQ1t6Q#primary"
  ],
  "id": "did:elem:EiBa0KyUWgvMdkt_ywullSPac2kyOkRP5JRtHSeICQ1t6Q"
}

DID Actions

It is four posible DID Actions.

Creating a DID: Generating a new identifier and associating it with a DID document, which contains public key material and service endpoints for the identity. For DIDs on blockchaine we could see term Anhoring. DID anchoring refers to the process of registering a Decentralized Identifier (DID) and its associated DID document on a distributed ledger or blockchain. Anchoring provides a secure, tamper-proof, and verifiable record of the DID’s existence and its associated information, making it an essential component of decentralized identity systems.
Resolving a DID: Retrieving the DID document associated with a specific DID, which is essential for verifying signatures, encrypting/decrypting messages, and interacting with the identity’s associated services.
Updating a DID: Modifying the DID document, such as adding or updating public keys, service endpoints, or other metadata. This action typically requires authorization from the DID controller.
Deactivating a DID: Marking a DID as inactive, rendering it unusable for future interactions. This action also usually requires authorization from the DID controller. DID resolution is a main and mandatory operations. Every did method has create procedure.

DID method

Method is concrete implementation that defines the rules and processes for did actions on a particular distributed ledger, blockchain, or other decentralized network. DID methods provide a standardized way to manage DIDs and their associated DID documents, enabling interoperability between different decentralized identity systems. Each DID method is identified by a unique method name, which appears in the DID itself. For example, a DID with the method name “example” would look like “did:example:123456789abcdefghi”.

ould change an owner and be sold or reasigned

DID Relations demistified

Relations between all parts of DID identifier can be illustrated in the following diagram. DID method dictate how DID identifier gets created, updated, deactivated, and resolved.

DWN - game changer in a data

DWN = Secured storage + message relay.
Yep it is some how similar to noster relays but focused not a social media but data and data exchanges. It is more complex because it should be more generic and allow you to describe complex data and data interactions. Thats why DWN as a protocol allows you to describe your own protocols around a data

Protocol-based on messages, but it is not about messages at all. Messages only transfer data about Records, Permissions, Hooks, and Protocols.

DWN is a stack of Access and data Protocols.

Key Concepts and Interfaces

Schema — the core of interoperability defines a data context and meaning.
Records - the interface of Decentralized Web Nodes provides a mechanism to store data relative to shared schemas.
Permissions — provides a mechanism for external entities to request access to various data and functionality. It employs a capabilities-based architecture that allows for DID-based authorization and delegation of authorized capabilities to others if permitted by the owner of a Decentralized Web Node.
Protocols — introduces a mechanism for declaratively encoding an app or service’s underlying protocol rules, including segmentation of records, relationships between records, data-level requirements, and constraints on how participants interact with a protocol. With the DWeb Node Protocols mechanism, one can model the underpinning protocols for a vast array of use cases in a way that enables interop-by-default between app implementations that ride on top of them.
Hooks — aim to not only allow permissioned subscribers to be notified of new data but also optionally respond to the entity’s request that triggers their invocation. This allows a subscribed entity to process the data and react to the entity waiting on results.

Read in more datail

Verifiable Data

Verifiable credentials allow you to share information about yourself and others in a temper-proof and end-verifiable way. VC contains the issuer’s signature, a person, and an organization that creates a data statement. critical part it is based on open standards

Anatomy Of VC

machine readable data context for semantic web and AI
data
optional schema
optional revocation data
optional expiration data
signature

VC = data + metadata + signature

Example of Revocable VC

{
  '@context': [
    'https://www.w3.org/2018/credentials/v1',
    'https://schema.affinidi.com/ContentLikeV1-0.jsonld',
    'https://w3id.org/vc-revocation-list-2020/v1'
  ],
  id: 'claimId:i2wgld5x7b',
  type: [ 'VerifiableCredential', 'ContentLike' ],
  holder: {
    id: 'did:elem:EiAs9VqvNcEMkm9OfMdseWR0jMIltWHuUd5tCK_f17M6jA;elem:initial-state=eyJwcm90ZWN0ZWQiOiJleUp2Y0dWeVlYUnBiMjRpT2lKamNtVmhkR1VpTENKcmFXUWlPaUlqY0hKcGJXRnllU0lzSW1Gc1p5STZJa1ZUTWpVMlN5SjkiLCJwYXlsb2FkIjoiZXlKQVkyOXVkR1Y0ZENJNkltaDBkSEJ6T2k4dmR6TnBaQzV2Y21jdmMyVmpkWEpwZEhrdmRqSWlMQ0p3ZFdKc2FXTkxaWGtpT2x0N0ltbGtJam9pSTNCeWFXMWhjbmtpTENKMWMyRm5aU0k2SW5OcFoyNXBibWNpTENKMGVYQmxJam9pVTJWamNESTFObXN4Vm1WeWFXWnBZMkYwYVc5dVMyVjVNakF4T0NJc0luQjFZbXhwWTB0bGVVaGxlQ0k2SWpBek5UUXhZMk01T1RabU56VmxaR1U1WkRnd00yVXlOVE5oTm1FNU5UWXdOekF5TWprMk1EaGhNemM0WVRWbE56RmlaV1V4WldGaE1EQXpObU0zTkdJME1DSjlMSHNpYVdRaU9pSWpjbVZqYjNabGNua2lMQ0oxYzJGblpTSTZJbkpsWTI5MlpYSjVJaXdpZEhsd1pTSTZJbE5sWTNBeU5UWnJNVlpsY21sbWFXTmhkR2x2Ymt0bGVUSXdNVGdpTENKd2RXSnNhV05MWlhsSVpYZ2lPaUl3TTJOaU1qZzFPVGRrWkRjM016bG1OREl3WTJaaVpEUXdOekZtTUdNNU5Ua3dPRFZtWVRBNVlqSXlOR1l4Tm1ZeE1UbGlOelV6WVRZeVpXVTJaalJqT1RRaWZWMHNJbUYxZEdobGJuUnBZMkYwYVc5dUlqcGJJaU53Y21sdFlYSjVJbDBzSW1GemMyVnlkR2x2YmsxbGRHaHZaQ0k2V3lJamNISnBiV0Z5ZVNKZGZRIiwic2lnbmF0dXJlIjoiOXg1UVpYS0h4OEFCSmd2cmhqVFhhR2NGUC1TSVdoYVJCeW1Vbm9vOGk2dGdMaDhWSnlWWGxnbS0xaTZqSXROTW1NZXEwX2t1SUZRZnVNelVNdVNMbXcifQ'
  },
  credentialSubject: {
    data: {
      '@type': [Array],
      url: 'https://www.youtube.com/watch?v=owbkzvLhblk',
      date: '2022-09-09T13:22:20.668Z',
      like: true,
      score: 10
    }
  },
  credentialSchema: {
    id: 'https://schema.affinidi.com/ContentLikeV1-0.json',
    type: 'JsonSchemaValidator2018'
  },
  issuanceDate: '2022-09-09T13:22:20.668Z',
  expirationDate: '2065-09-10T00:00:00.000Z',
  credentialStatus: {
    id: 'https://revocation-api.prod.affinity-project.org/api/v1/revocation/revocation-list-2020-credentials/did:elem:EiBIkVawTQOfOCYp2xSITNKKePuELFTj3oc1ITnxk2uehw/20551#1',
    type: 'RevocationList2020Status',
    revocationListIndex: '1',
    revocationListCredential: 'https://revocation-api.prod.affinity-project.org/api/v1/revocation/revocation-list-2020-credentials/did:elem:EiBIkVawTQOfOCYp2xSITNKKePuELFTj3oc1ITnxk2uehw/20551'
  }
}

Spec
NOSTR events is verifiable data too as far as it is signed

DApps - How it is work all together

It is simirar to a WEB3 Dapp but istead of blockchain it is connected to a several DWNs that manage and store a data of application or particular user represented by DIDs. Web5 Dapps designed more for human interaction gateways that give you a UI and etc. For a agent to agent or wallet to wallet interactions DWNs and message interfaces is used together with application level protocols.

To summarize

Web5 give identity and persistent layers missed by web in a network agnostic maner.
It could work on top of regular network protocols or colocal

DID - give you a destribution of Public identity that are network independent
DWNs - give a persistent and intyeraction layer
DWNs protocols - allow to define Dapps data exchange logic
VCs allow to add ownnership and authentisity of data and build data and agent economy on top of it
Dapps - use all this tools to interact with a human in a loop. Serve a UI and interact with a DWNs and agents

Targeted Personal Knowledge Graphs in Professional Networks. From Hunters to DAO and beyond

Volodymyr Pavlyshyn — Fri, 01 Sep 2023 11:48:31 +0000

Nostr is about Social Networks. What came first to your mind when we mention social Network? A lot of folks compare NOSTR-based apps with Twitter, but Twitter is a far away from a social movement.
The biggest social network of mine is not Twitter or Mastodon. nop even not facebook. It is LinkedIN

It is not about Twitter

Twitter was launched around 2006, but before it was.
SixDegrees.com (1997): As mentioned, it was the first to allow users to create profiles and friend lists.

LiveJournal (1999): Allowed users to keep a blog, journal, or diary and also make connections.

Friendster (2002): One of the first to use the term "social network," it allowed users to connect with friends, post pictures, and share content.

Hi5 (2003): Popular in Latin America and Southeast Asia, it allowed users to create profiles, connect with friends, and share photos and videos.

LinkedIn (2003): Focused on professional networking, it allowed users to connect with colleagues and other professionals.

MySpace (2003): Allowed users to create profiles, have a list of friends, and share music and other media. It was particularly popular among musicians and artists.

Orkut (2004): Developed by Google, it was popular in Brazil and India.

Facebook (2004): Initially limited to Harvard students, it quickly expanded and became the dominant social network.

I was earlier adopter of a Live Journal and always was about a long-form content.

Future of targeted social networks

On targeted, i mean social network of people with a common interest and goals with some passion and topics that unite them or maybe it is a place where you could find folks that were extreamly hard to find in a facebook like posts with food , cats, and boobs.
First sucessful and long lived targeted social network for me wa LinkedIn

From Hunters , Middle Ages, Guilds to LinkedIn and DAO
People get together around profession and mastery from earlier beginning. I think hunters was the first tribal guild )). It is not only about a job search it is also about mastery of skills and proud and status and showing who you are.
I know a lot of folks have linkedIn but root of problems there is not a topic of social media it is more around corporate culture that heavily shaped social media. It should be different.

Linkedin got Evil?!

It is a classical story. The project was started with good intentions and goals, but it is somebody else business that is eager to make money.
In general, it is a 2 simple strategies

lock users data in a platform
Limit access to this data and make money on top of it. Now millions of profiles are locked in prison with limits to connect and search. Recruiters struggle the most. if linkedin kill you profile it is practically the end of your entire recruiter career.

I am not looking for a job

Nowadays, we have industries and professions that depend completely on platforms like Linked In because we haven't invented a Google for people or let's say google for professionals and the best that we have is LinkedIn, Xing, Monster, Stepstone focus on the broken idea of a job search. Your job is a subset of your activities, skills and interests. I am a YouTuber and nostr enthusiast that keen on decentralized tech and open technology and open source. I am a tea lover and keen on cognitive science, psychology and philosophy and all these stills and interests contribute to my professional profile and form my bubbles of people.
I am looking for

people
interesting challenges
Confirmation of my skill
confirmation of my knowledge and achievements from a network
verifiable skills I do not want to go to yet another interview in my life. Big dream - i give you my profile based on a verified data point, and we could chat about our common values. I tried to prove that I knew JS or design patterns over and over and over again. It is not about content, it is about relation !! We don't see a forest among the trees...

Targeted Social Graph

We over-focused on content, but the real asset of social media that is often misunderstood by people is a social graph. Your connections matter a lot. Your connections and audience could say more about you than your content. We simply do not have tools today that allow to express our connections in a meaningful way. On Linked In, you have the option to connect or follow, but human relations is way more complex even in professional areas. it is all about portable social Graphics and portable audiences. Work relations or skill relations is more about what you do for others.
Interoperability is a cornerstone of any protocol so we need to find a way to describe our relations in interoperablele way . Semantic web made a few attempts to do

"Friend of Friends" (FoF)

First try to make a portable social graph was made by semantic web folks. In a semantic web we have a concept of anthology.
An ontology is essentially a formal specification of a conceptualization. In simpler terms, it's a way to define the types, properties, and interrelationships of the entities that exist for a particular domain. In the context of the semantic web, ontologies are often expressed in languages like RDF (Resource Description Framework), OWL (Web Ontology Language), or Turtle.
The concept of a "Friend of Friends" (FoF) ontology is often used in the context of semantic web technologies, social networking, and data modeling. The idea is to create a structured representation of social networks, where relationships like "friendship" can be formally defined and queried. This is particularly useful for applications that require a deep understanding of social connections, such as recommendation systems, targeted advertising, or social analytics.
Problem hear - we have limited relation descriptions so you simply mimic i know this person link

"Description of a Career" (DOAC) and Resume RDF

"Description of a Career" (DOAC) ontology, the focus would be on capturing various aspects of an individual's professional life. This could include:

Basic Information: Name, contact details, and other personal identifiers.
Skills: A list of skills the individual possesses, possibly categorized by domain (e.g., programming, management).
Qualifications: Academic background, certifications, and other formal qualifications.
Work Experience: Past roles, responsibilities, and achievements.
Projects: Specific projects worked on, along with the role played and technologies used.
Endorsements: Recommendations or endorsements from colleagues, supervisors, or other professional contacts.
Goals: Career objectives and future plans.
We are still not there
Resume RDF do a similar job but still does pure job in a space of relations. Some ontologies define your work relations in terms of organization structure. We need relations in the context of skills and what person do. Like we made a web3 app together. We run a cool event etc. Something that gives understanding of what you do for other people with other people

NOSTR Context

Nostr has one big supper power - All events are signed. It means it is an end verifiable, immutable, and temper proof. We almost there we already have a backbone of social features that could be enriched by more powerful relations

NOSTR missed social graph features

Now you see a problem. A list of followers is not enough. We need to find together a better way to describe an interaction experience with other people in the context of skills, knowledge, interests, and what they do for the community and us.
Don't be a stranger! You could tell much more about your relation to a person.

NOSTR missed events

As far as NOSTR protocol go with a idea of kinds of events it is turn to a hard path of extending a protocol with particular human activities.
One area is a people's lifetime events

job or project
achievement
Completion of task for DAO
skill and skill recommendation

One more potential feature - maybe I want to follow selective events from a person only related to some types of activities

The DAO friendly

DAO will change the way we work in the near future. For some of my friends, it is already a reality. It is hard to build a reputation in a highly anonymous environment. I see a big undiscovered potential of Target Social Graphs powered by NOSTR for a talent and partners search. But it will be a topic for next article

Holistic Identity, Digital Twins, and Autonomous Agents: A New Era of Self-Sovereign Identity

Volodymyr Pavlyshyn — Sun, 30 Jul 2023 09:53:42 +0000

In the digital age, our identities are fragmented across various platforms, each holding a piece of our data. This fragmentation poses a significant challenge, as it prevents us from having a complete, unified view of our own digital identities. However, the concept of a holistic identity, digital twins, and autonomous agents can offer a solution to this problem, providing a more comprehensive and self-sovereign approach to digital identity.

Holistic Identity: A Unified View of Self

Holistic identity is not just a technical term; it's a philosophical concept that aims to solve the problem of data fragmentation. Unlike traditional identity systems that are authoritative and siloed, a holistic identity provides a unified view of an individual's data across various platforms.

In essence, a holistic identity is a snapshot of all data points about you, including your behavior, activities, and posts. It's not just about an identifier or login password; it's about aggregating all the data points that identify you, providing a more comprehensive view of your digital self.

Digital Twins: Your Digital Copy

A digital twin is a digital copy of you and all your data that you control. It's a continuation of your holistic identity, aggregating all the data points about you and all the data produced by you.

The concept of a digital twin goes beyond just storing information; it's about getting benefits out of it. With a digital twin, you can interact with your data, gain insights, and even sell your data. It opens up a world of possibilities, from personalization to automation.

Autonomous Agents: Your Digital Assistants

Autonomous agents are the next step from digital twins. They are essentially digital assistants that can perform tasks on your behalf. These agents can have access to a portion of your data and can perform various operations, from booking tables and buying tickets to trading operations and data trading on data exchanges.

Autonomous agents can analyze data from your digital twin and perform actions based on it. They can optimize routine tasks, cooperate with each other, create trust networks, and even make micro-payments.

The Future of Self-Sovereign Identity

The concepts of holistic identity, digital twins, and autonomous agents are interlinked and form the cornerstone of a self-sovereign identity. They provide a way to have a sovereign persona and proof that the data belongs to you.

These concepts are not just theoretical; they have practical applications that can revolutionize various domains, from healthcare and finance to personalization and automation. They represent the future of digital identity, a future where we have more control over our data and where our digital identities are unified, comprehensive, and self-sovereign.

In conclusion, the era of self-sovereign identity is upon us. It's an era where we can control our data, gain insights from it, and use it to our advantage. It's an era where our digital identities are not fragmented across various platforms but are unified and comprehensive. It's an era where we can have digital twins and autonomous agents that can perform tasks on our behalf. It's an era of holistic identity, digital twins, and autonomous agents.