Introduction
Reading the Bible, everyone struggles to remember relationships between persons and where to find references to them in the Bible.
That keeps us from being immersed in the story as it's common to find oneself guessing who is who to whom.
Recently, we've discovered this gem of a dataset and wanted to explore it.
Memgraph turned out to be a great fit for analysing the data.
In the first part of this article, we'll answer some commonly asked questions using a graph database.
In the second part, you can find the step by step process of how to explore your own datasets; from idea to results.
Graph Data model
The data collected by Brady Stephenson contains information about persons in the Bible and the relationships between them.
Warning: the dataset is incomplete!
At the time of writing, the data is collected up to the second book of Chronicles, verse 20.
If you want to know more about the format of input data and how we imported it into Memgraph, hop over to the second part of this article.
Person:
- id ["Adam_1", "Jotham_2", "Jotham_3", ...]
- name ["Adam", "Eve", "Jacob", ...]
- notes
- sex ["male", "female"]
- tribe ["Judah", "Levi", "Manasseh", "Ammonite", ...]
RELATIONSHIP:
- type ["father", "half-brother", "killer", "killed by", ...]
- category ["logical", "explicit", "implicit"]
- reference ["GEN 4:19", "2KI 15:13", ...]
Frequently asked questions
Every question in this section will be followed by:
- a short description
- an image from MemgraphLab (Memgraph's default data browser)
- a Cypher query (SQL-like declarative query language for graph databases)
What is the longest patrilineal descent from Adam?
In the Bible, family and tribal membership appears to be transmitted through the father, which is why this seems like a sane question to ask.
The image shows the longest agnatic kinship path from Adam (created by G-d) all the way to Jotham, the son of Azariah.
Here the incompleteness of our dataset becomes obvious since the family tree of Uzziah, son of Amaziah, should be longer than of his brother Azariah (all the way to Yeshua).
The Amaziah -> Uzziah relationship is first noted in the second book of Chronicles 26:1, which is why it is not included.
This is one of the charms of dealing with live data.
MATCH path = ({name: "Adam"})
-[* {type: "father"} ]->
()
RETURN path
ORDER BY size(path) DESC
LIMIT 1
What is the family tree of Adam excluding Israel (Jacob)?
Drawing the whole family tree from the Bible would be a hefty job.
Since only the tribe of Israel is covered so far, we can make a cut-off there and see what the family tree looked like in the book of Genesis.
The image shows Adam's family tree up to Jacob, the son of Isaac, grandson of Abram.
Do note that apart from the families of Jacob; Moabites, Ammonites (sons of Lot), and other tribes were also present in later parts of the old testament.
Female nodes in the graph are colored orange, and male nodes are colored red.
MATCH path = ({id: "Adam_1"})-[* bfs (
e, v | v.id != "Jacob_1" AND e.type = "father"
)]->()
RETURN path
Which tribe of Israel had the largest family?
Jacob had twelve sons, all of whom became the heads of their own family groups, later known as the Twelve Tribes of Israel.
The tribe of Joseph split into two half-tribes: the tribe of Manasseh and the tribe of Ephraim.
In the following table, we can see the size of each tribe up to the second book of Chronicles.
size_by_blood
is calculated by counting all the nodes we can reach from the sons of Jacob using only the father relationships.
size_by_marriage
is calculated by counting all the instances in the Bible where a person is explicitly noted to be a part of a certain tribe.
tribe | size_by_blood | size_by_marriage |
---|---|---|
Judah | 163 | 294 |
Levi | 110 | 323 |
Benjamin | 101 | 189 |
Joseph | 49 | 95 |
Asher | 41 | 47 |
Issachar | 17 | 26 |
Simeon | 15 | 38 |
Reuben | 9 | 34 |
Gad | 8 | 44 |
Naphtali | 5 | 13 |
Zebulun | 4 | 13 |
Dan | 2 | 14 |
MATCH ({id: "Jacob_1"})<-[{type: "son"}]-(son)
MATCH path = (son)-[* bfs (e, v | e.type = "father")]->(a)
WITH son.name as tribe, count(a) + 1 AS size_by_blood
OPTIONAL MATCH (person)
WHERE person.tribe IN
CASE tribe
WHEN "Joseph" THEN ["Manasseh", "Ephraim"]
ELSE [tribe]
END
RETURN tribe, size_by_blood, count(person) AS size_by_marriage
ORDER BY size_by_blood DESC, size_by_marriage DESC
The image shows the spear side of the Twelve Tribes of Israel up to 2CH 20:1.
Can you guess which family tree belongs to whom?
Help yourself with the table provided above!
MATCH ({id: "Jacob_1"})<-[{type: "son"}]-(son)
MATCH path = (son)-[* bfs (e, v | e.type = "father")]->(a)
RETURN path
From Idea to Results
Inserting data into a graph database
Each line in the raw dataset by Brady Stephenson contains information about the relationship between two persons and the reference to the Bible verse. Here we show a cleaned dataset excerpt:
person_id_1,relationship,person_id_2,reference
G-d_1,Creator,Adam_1,GEN 2:7
Adam_1,husband,Eve_1,GEN 3:6
Eve_1,wife,Adam_1,GEN 2:25
Adam_1,father,Cain_1,GEN 4:1
Cain_1,son,Adam_1,GEN 4:1
We want to turn that data into a graph, where nodes of the graph represent persons from the Bible and edges represent relationships between them. In the image below, we can see the graph of the excerpt given above:
To do that, we need to translate each line in the file into a Cypher query. In pseudo-code, that would be:
create PERSON "G-d_1" if it doesn't exist
create PERSON "Adam_1" if it doesn't exist
create RELATIONSHIP "Creator" between them
In Cypher that looks like this:
MERGE (a: Person {id: "G-d_1"})
MERGE (b: Person {id: "Adam_1"})
CREATE (a)-[r: RELATIONSHIP {type: "Creator"}]->(b)
Real data
In reality, the data is a little more complex. Which is a good thing, since we can gather more information out of it. Here is an excerpt of the original data:
person_relationship_id,person_relationship_sequence,person_id_1,relationship_type,person_id_2,relationship_category,reference_id,relationship_notes
G-d_1:Adam_1:1,1,G-d_1,Creator,Adam_1,explicit,GEN 2:7,
Adam_1:Eve_1:2,2,Adam_1,husband,Eve_1,explicit,GEN 3:6,
Eve_1:Adam_1:3,3,Eve_1,wife,Adam_1,explicit,GEN 2:25,
Adam_1:Cain_1:4,4,Adam_1,father,Cain_1,explicit,GEN 4:1,
Cain_1:Adam_1:5,5,Cain_1,son,Adam_1,explicit,GEN 4:1,
The preferred way of loading such data into Memgraph is using a LOAD CSV
expression.
LOAD CSV FROM '/relationship.csv' WITH HEADER AS row
MERGE (a: Person {id: row.person_id_1})
MERGE (b: Person {id: row.person_id_2})
CREATE (a)-[:RELATIONSHIP {
type: row.relationship_type,
category: row.relationship_category,
notes: row.relationship_notes,
reference: row.reference_id
}]->(b)
Additionally, we can add name, sex, tribe, and other metadata to each person using the person's dataset.
LOAD CSV FROM "/person.csv" WITH HEADER AS row
MERGE (node: Person {id: row.person_id})
SET node += {
name: row.person_name,
sex: row.sex,
tribe: row.tribe,
notes: row.person_notes
}
This brings us to the data model described at the beginning of the article.
A step beyond would be modeling the data into nodes of type Person, Reference, and Relationship with appropriate edges between them.
That way we could add cross-reference relationships and possibly have a better chance of extracting implicit relationships programmatically.
Testing data integrity
Without domain-specific knowledge, we can notice that every outgoing edge has to be paired with a matching incoming edge.
We can check if every father and mother relationship has a son or daughter matching relationship.
We can do the same for wives and husbands, brothers and sisters, etc.
The following query returns all Persons that have an unequal amount of incoming versus outgoing edges:
MATCH (person)
OPTIONAL MATCH ()-[incoming]->(person)
WITH person, count(incoming) as ins
OPTIONAL MATCH (person)-[outgoing]->()
WITH person, ins, count(outgoing) as outs
WHERE outs != ins
RETURN person.name
Result:
G-d_1, Naamah_16, Naamah_1, Eve_3, Eve_1, Adam_1
Upon closer inspection, we can see that Adam_1 is missing a created by edge towards G-d_1, and that Naamah_16 and Naamah_1 are the same person (same goes for Eve's).
We've reported the errors to the author, but in the meantime, we can make these changes ourselves.
We can do that with Cypher or change the data directly in the CSV file.
MATCH (g {id: "G-d_1"})-[relationship]->(adam {id: "Adam_1"})
CREATE (adam)-[new_relationship: RELATIONSHIP]->(g)
SET new_relationship=relationship
SET new_relationship.type="created by";
On the other side, we can analyze the data using domain-specific knowledge.
If you have read the Bible some things might come out to you as odd straight away.
For example, when we were executing the queries in the first part of the article, we noticed that Benjamin was the son of Joseph and Rachel, but that can't be right since Joseph was the brother of Benjamin.
Also, we have noticed that Nahshon_1 and Nahshon_2 must be the same person as there was no clear bloodline from Adam to David!
In general, for these kinds of observations, you will usually need someone who is an expert in the field.
These methods won't work well if the number of entries is too big.
In those cases, you'll have to accept small discrepancies between the data and the truth.
Conclusion
This has been a concise how-to guide on exploring real-world data with all the charms that go with it.
We hope you enjoyed reading it as much as we enjoyed writing it.
The dataset will continue to be expanded.
Download the latest version, and you might get better results.
Head over to the download page to start exploring your own data or play with a plethora of datasets we provide.
Good luck and stay sharp!
Top comments (0)