DEV Community

loading...

Filling out profiles is important, but unsexy

ivy_green profile image Ivy Green ・9 min read

Our algo-engineer Sven has written an article about Neo4j as an alternative to a recommendation engine for entwickler.de. We are so excited about it, that we had to translate it to English to share it with everyone. Enjoy!

Modern IT recruiting is all about speed. Smooth operation and fast feedback are expected, yet finding the best match between job-seeking developers and interesting dev jobs is like searching for needles in a haystack. Especially when developers want to enter new technologies or industries, meaning the really interesting matches, it is essential to have a well-filled out profile at hand. One that also includes skills that have only been learned on the side.

In the registration process, the technical skills must be queried in as much detail as possible since everything from everyday tools to niche technology is relevant to the matcher. But who wants to scroll through a list of all known (at least known to matched.io) tech skills to find the few skills that didn't come to mind immediately?
Live search with autocomplete already provides some relief here, but something smarter was needed for a really good user experience.
Alt Text
If we could make good predictions that identify which skills out of the heap of technical skills the developer might also know and then offer those as relevant suggestions, that would be an improvement on both ends.
Alt Text
But how do you continuously produce relevant personalized skill guesses?

In this article we describe an approach that uses already filled skill sets, a graph theory model and Neo4j to generate such suggestions as accurately as possible but also as quickly as necessary.

Devs who know these skills also know…

Amazon's "Customers who bought this item also bought…" feature provides a good template here. We already have many completed profiles where we can calculate common skill frequencies and then derive suggestions from them.

However, a static list of "popular" skills does not have a personal reference and would deliver only the most generic suggestions. Ideally, it should be possible to recognize the type of developer from the skills already entered and then, when seeing a skill set of ['CSS', 'HTML'] for example, would suggest JavaScript rather than Kafka.

To build a heuristic for "Devs who know these skills also know…", it makes sense to take the frequency of skill combinations as the basis for suggestions.

Modeling with Graph Sets

So we have a big dataset of filled developer profiles with a set of mastered skills per profile. Using this data and given a new set of partially filled skills we want to make a reasonable suggestion as to which skill is often mastered in combination with these skills. One way to model this question is a graph approach. You build a graph from your dataset in which all developers and all skills are represented as nodes. Then you connect each dev-node via relationships to exactly those skill-nodes that are in the skillset of the developer.
Alt Text
For individual skills, an intuitively sensible suggestion can already be read here. For example, a developer who knows HTML would first be suggested Angular, since 2 profiles contain both Angular and HTML, and Java and PHP would be considered as further suggestions. In essence we calculate the number of paths [Skill1]-[Dev]-[Skill2] that occur in the dataset and store this value as the frequency of a skill pair, to use it for individual skill suggestions.
Alt Text
If you receive a skillset with several skills for which suggestions are to be created, you must ask yourself how important it is to you that every skill of the skillset appears in a developer profile before you derive suggestions from it. For example, for a developer who has already specified C++ and Java, you could suggest C before Angular, reasoning that the profile of developer 1 really contains both skills C++ and Java.

Instead, we decided to look only at skill pairs to be able to present meaningful suggestions even to fullstack developers and other profiles that don't easily fit into prescribed boxes. So we will calculate the value of a suggested skill by summing the frequency of all skill pairs (suggested skill, skillset skill) for all skills in the skill set.
Alt Text
With this model in mind, we now need a good tool to implement our graph and quickly calculate suggestions.

Create graph in Neo4j

The screenshots and title buzzword drop might have already given away that we will build our graph in Neo4j. Not only is this clearer during development, since Neo4j as a graph database provides visual representation tools to help us understand our data and find errors, but we will also see that the already introduced frequency calculations can be greatly simplified in Neo4j.

First, though, let's create our profile data in Neo4j as a graph. To do this, we create the developer nodes and skill nodes via cypher calls in Neo4j:

//create developer nodes
CREATE (d1:Dev{name: 'dev1'})
CREATE (d2:Dev{name: 'dev2'})
CREATE (d3:Dev{name: 'dev3'})
CREATE (d4:Dev{name: 'dev4'})
//create skill nodes
CREATE (s1:Skill{name:'c'})
CREATE (s2:Skill{name:'c++'})
CREATE (s3:Skill{name:'java'})
CREATE (s4:Skill{name:'angular'})
CREATE (s5:Skill{name:'html'})
CREATE (s6:Skill{name:'php'})
Enter fullscreen mode Exit fullscreen mode

After that, we add the relationships that represent which developer has which skills:

//create knowledge relationships
CREATE (d1)-[:KNOWS]->(s1)
CREATE (d1)-[:KNOWS]->(s2)
CREATE (d1)-[:KNOWS]->(s3)
CREATE (d2)-[:KNOWS]->(s3)
CREATE (d2)-[:KNOWS]->(s4)
CREATE (d3)-[:KNOWS]->(s3)
CREATE (d3)-[:KNOWS]->(s4)
CREATE (d3)-[:KNOWS]->(s5)
CREATE (d4)-[:KNOWS]->(s4)
CREATE (d4)-[:KNOWS]->(s5)
CREATE (d4)-[:KNOWS]->(s6)
Enter fullscreen mode Exit fullscreen mode

With a simple query we already reached the example image we have seen before, only this time not arranged so nicely (the author might have cheated here and the arranged the earlier images by hand…):

//Show graph
MATCH (n) return n;
Enter fullscreen mode Exit fullscreen mode

Alt Text

Preprocess for the overview

Now we are actually ready to formulate a cypher query that outputs the suggested skills for a skillset sorted by the frequency of the skill pairs. However, we would always have to go through the dev nodes and count the number of paths from skill 1 via a dev to skill 2 in the query. It is more convenient to calculate the frequency of a skill pair in the whole dataset in a preprocessing step and then work with these calculated frequencies in the following queries. Preprocessing is worthwhile, because it makes the following queries faster, especially if you want to calculate several suggestions from the same dataset.

So our preprocess step needs to calculate for each pair of skills in the graph how many dev-nodes are connected to both skill-nodes, i.e. the number of paths from skill 1 via a dev to skill 2. In Neo4j syntax you write such a path as (s1:Skill)-[:KNOWS]-(:Dev)-[:KNOWS]-(s2:Skill) and a cypher query that counts these paths and stores the result in a relationship between Skill 1 and Skill 2 as weight looks like this:

//create skill-pair edges from dev paths
MATCH path=((s1:Skill)-[:KNOWS]-(:Dev)-[:KNOWS]-(s2:Skill))
WITH s1, s2, Count(path) as amount
MERGE (s1)-[p:SKILL_PAIR]-(s2)
SET p.weight = amount
Enter fullscreen mode Exit fullscreen mode

From now on, we don't need to pay attention to the Dev-nodes and KNOWS-relationships anymore and can calculate suggestions with Skill-nodes and SKILL_PAIR-relationships only. Our graph has become clearer, faster and more intuitive:

//Show skillgraph
MATCH (n:Skill) return n;
Enter fullscreen mode Exit fullscreen mode

Alt Text
Note that the directions of the SKILL_PAIR- relationships have no meaning. With the KNOWS-relationships the arrows still made sense (even if we didn't need the direction), but here with SKILL_PAIR the direction is arbitrarily set by the order of the skills in our query.
Neo4j unfortunately doesn't allow true undirected relationships, so as a developer you have to have in mind which directions you are interested in and which have no value. We will ignore all relationship directions in the match queries in this article, using (:Dev)-[:KNOWS]-(s2:Skill) as above instead of (:Dev)-[:KNOWS]->(s2:Skill), without an arrow indicating the direction, Neo4j will then search for an arbitrary directed relationship.

Closest Skill Query

Before we do suggestions for skill sets in the skillgraph, let's look at a simpler query that returns a suggestion for a single skill. For this we use MATCH to search all SKILL_PAIR-relationships, which connect our start skill s1 with a suggestion skill s2 and then ORDER the suggested skills according to the weight of the relationship:

//search closest single
MATCH (s1:Skill)-[r:SKILL_PAIR]-(s2:Skill)
WHERE s1.name='html'
RETURN s2.name, r.weight
ORDER BY r.weight DESC
Enter fullscreen mode Exit fullscreen mode

Alt Text
This looks promising and very clear so far.
For the next step we need the concept of a skillset. In this example we will input a list of skill-names and then search the corresponding skill-nodes to use in later MATCH calls. To collect the skill-nodes in a usable skillset we use MATCH again and save the result for further use in the query via "as skill_nodes":

WITH ['c', 'java'] as skills
MATCH (st:Skill) WHERE st.name IN skills
WITH collect(st) as skill_nodes
Enter fullscreen mode Exit fullscreen mode

We will use a lot of "as xyz" to hand down values to the next query or just give them more readable names. With the relevant skill_nodes in hand we can now continue the query and use MATCH to find exactly the SKILL_PAIR-relationships that connect a skill in our skillset with a skill from outside it (we don't want to suggest skills that the developer has already entered):

MATCH (s1:Skill)-[r:SKILL_PAIR]-(s2:Skill)
WHERE s1 IN skill_nodes AND NOT s2 IN skill_nodes
Enter fullscreen mode Exit fullscreen mode

Overall, the cypher query providing our suggestions looks like this:

//search closest
WITH ['c', 'java'] as skills
MATCH (st:Skill) WHERE st.name IN skills
WITH collect(st) as skill_nodes
MATCH (s1:Skill)-[r:SKILL_PAIR]-(s2:Skill)
WHERE s1 IN skill_nodes AND NOT s2 IN skill_nodes
RETURN s2.name as suggested_skill, SUM(r.weight) as fit
ORDER BY fit DESC
Enter fullscreen mode Exit fullscreen mode

Alt Text

Request results from a microservice

Obviously a few nice queries and a manually requested solution does not a production release make. Both the creation of the frequency graph during preprocessing and the suggestion query need to be done frequently, automatically and, in the case of the suggestion query, even quasi-live. Fortunately, Neo4j provides a cypher API, which is accessible from quite a few languages via provided packages. For our problem we expect many tiny queries (after each selected skill for the skillset a new suggestion set should be computed) and therefore decided to use a Go microservice.

Since we expect a lot of very time-critical small requests for suggestions, we separated the computation of the frequency graph as preprocessing from the suggestion query. This frequency graph forms a best estimate of how often skill pairs are known together based on our profile data. It is recalculated regularly to capture changes and new profiles, but since skill pair frequencies change rather slowly and take almost an hour to calculate given all the collected data, we only update once a week.

We might want to keep in mind that we have a bias in our suggestions due to the restriction to already known profiles and therefore only suggest what we already know as a combination. If everyone only used the skill suggestions in the profile creation and thus forgot their rare skill combinations, a too successful suggestion feature could negatively influence the profiles. Ideas for solutions are touched upon in the conclusion.

But for now, we have a stored frequency graph that contains our best estimate of reality, this resides in the Neo4j server in the cloud and is ready to be used for queries. In Go you address Neo4j via the neo4j-go-driver and work quite directly with cypher queries as formatted strings. If we want to call our suggestion query with a skillset as argument, it looks something like this:

(these code examples are just snippets to show the workflow).

// define the query with variable skillset
query := `WITH $array as skills
MATCH (st:Skill) WHERE st.id IN skills
WITH collect(st) as skill_nodes
MATCH (s1:Skill)-[r:SKILL_PAIR]-(s2:Skill)
WHERE s1 IN skill_nodes AND not s2 in skill_nodes
RETURN s2.name as name, s2.id as id, SUM(r.weight) as fit
ORDER by fit DESC
LIMIT 50`
// map skillset to name in query
mappings := map[string]interface{}{
"array": request.Skills,}
// run query against open neo4j session
skills, err = neo4j.Collect(session.ReadTransaction(func(tx neo4j.Transaction) (interface{}, error) {
return tx.Run(query, mappings)}
Enter fullscreen mode Exit fullscreen mode

A relatively direct implementation of the API interface, but just right for our purposes, since we had already worked out the Cypher queries. This approach built into a Go microservice allows us to process proposal requests quickly, sufficiently up-to-date and, above all, in parallel threads.

Conclusion

Graph visualizations are a useful tool to make problems visually more understandable and with the help of Neo4j queries you can quickly get results without having to implement graph algorithms. This allows you to pragmatically access information from your own data without having to roll out black box technologies such as Deep Learning, which is often overkill for a simple suggestion engine use case, or to adapt your requirements to the structure of existing recommendation systems as known from the eCommerce sector.

For further gimmicks like the automatic detection of the "types" of developers mentioned in the beginning, Neo4j offers implementations of community detection algorithms like the Louvain method, which could be used to identify the closely related communities in our skill graph and then for example offer rare skill combinations from different communities. This would be a strategy to break out of the bias of our own data, but to steal an infuriating quote from many a textbook: "This is left as an exercise for the reader".

Discussion (0)

pic
Editor guide