Index my bio

#softwareengineering #elasticsearch #python

I want to make life of job-changers and human-researchers easier. I want to make easily searchable projects view. I want to have it on Elasticsearch, cause I want to proof, you can easily do something small with it, so it’s not only for very BIG data projects.

Following piece of code available is on Github.

Run on your Linux machine

It’s easy as piece of cake. You only need docker and it’s younger brother docker-compose. Then just this little script allows me to run whole the machine.

version: "2"
services:
  kibana:
    image: kibana
    ports:
    - 5601:5601
  elasticsearch:
    image: elasticsearch

Write in YAML, index to ES

Thinking of domain right now, what do I need when thinking of my Bio? I will simplify it to project object. This could be exemplary bio of someone..

Projects:
  Commercial:
    Truck browser:
      started at: 2013-01
      finished at: 2013-06
      tasks:
      - writing code
      - testing my code
      learned:
      - JDK8
      - Junit Test version 5 is cool
      challenges:
      - dealing with frontside, CSS :(
      - hard to deploy on prod server
      technologies:
      - Spring 3.2
      - JDK8
      - handmade JS and CSS
      measure of success:
      - 90% of test coverage
      - 2 hundred of happy users in Polish workshops
  Private:
    Mini REST service for my CD Collection:
      started at: 2013-02
# still working on it
      tasks:
      - care about whole app
      learned:
      - NodeJS + Express = fast web or Rest app written in JS!
      challenges:
      - which library on npm to choose?!
      technologies:
      - NodeJS
      - Express
      - javascript
      measure of success:
      - REST app in 6 days

How to index it? First, make a JSON out of it.

Python in data wranigling and ES I/O

Data wrangiling or conversion is so easy. Let’s make JSON out of Yaml notation here. With yaml library it’s easy.

import yaml
fname = "projects.yml"
with open(fname) as f:
    doc = yaml.load(f)

Then let’s proof we can connect to ElasticSearch. After importing official library, we can see this line

luk@luk-UX410UAK:~/prj/searchmybio$ python indexmyprojects.py
--Listing commercial projects--
Truck browser
--Listing side-projects--
Mini REST service for my CD Collection
{'name': 'HmNOtOU', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'tRdhrV3gR0OnUfZMkPrpqQ', 'version': {'number': '5.5.0', 'build_hash': '260387d', 'build_date': '2017-06-30T23:16:05.735Z', 'build_snapshot': False, 'lucene_version': '6.6.0'}, 'tagline': 'You Know, for Search'}

Good, connection is done!

Index time!

Before we’ll start to index docs, we have to arrange place for them. Any schema? No.. Elasticsearch can do it for us. We just have to create index, easily with some py script. After that..

res = es.indices.create("searchmybio_luke")

..we can query ES in Kibana tools. So let’s now fill it! After few refactors we are ready to insert or PUT docs to their place. Indexing from within a script is done by this one-liner: All docs are there. Any proof? Query it:

GET searchmybio_luke/project/_search
{
  "query": {
    "match": {
      "technologies": "nodejs"  
    }
  }
}

Even you can write test and run it against this index.

query_find_nodejs = {
    "query": {
        "match": {
            "technologies": "nodejs"
        }
    }
}
res = self.es.search(index='searchmybio_luke', doc_type='project', body=query_find_nodejs)
hits = res['hits']['hits']
self.assertEquals(len(hits), 1)

That’s all. In less then one hour we were able to run Elasticsearch on our machine and index docs from yaml directly to index.

Here is screenshot from my Kibana