DEV Community: Kimaru Thagana

Kaplan Meier Estimator in Python

Kimaru Thagana — Wed, 12 Oct 2022 18:47:38 +0000

In the fast paced world of business, decision makers are usually interested in interpreting customer behavior in order to understand them better. With a better understanding, the business can then deploy marketing or operational actions that either prevent the customer from leaving (churn) or keeping the customer happy such that they are more likely to stay(retention)

In either case, the business needs to find the optimal point at which the customer is most likely to leave and then, make their move. If they make their move too early, they waste resources and if they are too late, they lose the customer. Is there a scientific/repeatable way to know this “golden point” in time? The time it takes for the event to happen with a certain probability? Apparently, researchers in the field of medicine have had a go at this and came up with an estimator. This value is generally used to interpret survival of a subject from a certain “event”.

Generally, this estimator falls into the broad category of survival curves(due to the general shape of the result on a line graph) ; the method is referred to as the Kaplan Meier estimator. Read more about it here

In this tutorial, we will consider a scenario of an IoT company that sells a remote sensing device.The sample data can be found in this repository along with the code used to generate the sample data. The device usually sends regular updates on a daily basis and the update is one of two types. ROUTINE CHECK and SYSTEM ERROR.

As a business, we will be interested in the error message as our event and the time between the date of purchase and the date of occurrence of the event as the time to event in days. The graph will be interpreted as “what is the probability of survival by the population(user devices) from the event of a system error at day X”

In performing survival analysis, the first concept that needs to be made available is the time to event. The time in whatever frame(days, weeks, minutes, years) from beginning of observation to when the event occurs. In our case, the beginning of the observation is the date of sale of the device. The end is when the first error event is sent per user.

In the csv linked here, this computation has already been done and the time_to_event value has been computed in the days_to_event column. With most datasets, you will be able to have a duration_to_event column or derive it easily. However if the use case is similar to ours where we have two different datasets, consider this step in SQL. Can also be easily translated to Python. Follow this link for the SQL version.

UserId,RegistrationDate,EventType,EventDate,days_to_event
85,2021-03-19,SYSTEM ERROR,2021-08-29,163
25,2021-05-10,SYSTEM ERROR,2021-05-23,13
15,2021-06-12,SYSTEM ERROR,2021-07-31,49

After you have your time to event data, you now get to apply the Kaplan meier estimator to the data. This will generate a different dataset that is commonly known as the survival table.

The survival table is what is plotted and gives the survival curves that many know of. To apply the estimator, one needs a bit of an understanding of the math behind it.

Image credits

T is the total observation time. i is a point in time between 0 and the maximum t. For example, if you are looking at events over 1 year, your t value would be 365. At time ti, di is the number of events that have happened as at time ti. In our case, this will be the number of users whose devices have reported their first error message. Ni is the number of subjects that have either survived the event or are still at risk of experiencing the event as at at time ti. In our scenario, this is the number of users whose devices have survived failure or are still at risk of failing as at time ti.

We then subtract this ratio from 1 to get the ratio of survivors because the end goal is to determine who survived at a specific point in time.

The PI symbol is a cumulative product symbol. This gives the cumulative survival probability of all the user devices at time t.

All the above will be quite tricky to compute by hand. It would require deep knowledge of statistics which may not be the case to many. This is why there is the library lifelines which makes all this easier. All the above is reduced to 3-7 lines of code.

To use the library, only two items are required. The time_to_event column and the target/event column. Since the target column varies from one use case to the other, a simple transformation step is required. Transform the column values to boolean where the target event is TRUE(1)and the rest are FALSE(0)

import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter 

df = pd.read_csv('data/duration.csv')
df['EventType'] = [1 if x == 'SYSTEM ERROR' else 0 for x in df['EventType']]
time_to_event = df['days_to_event']
event = df['EventType']

kmf = KaplanMeierFitter()
kmf.fit(time_to_event, event_observed=event)
kmf.plot(at_risk_counts=True)
plt.title('Kaplan-Meier Curve')
plt.show()

In the code above, we are defining a new column Target and assigning to it the result of the transformation. Once done, all that is required is to fit and plot the Kaplan Meier estimate.

In the example above, we can interpret the following

At day 50, there is a 0.99 probability of devices surviving the SYSTEM ERROR EVENT
The least probability of survival being 0.88 means the business can promise the devices will survive the SYSTEM ERROR event at day 350(roughly 1 year) with 88% confidence.
At day 50, there is a 0.01 probability that devices do not survive the event and hence, report an error.

In conclusion, we have learnt about the Kaplan Meier estimator and how to employ it to answer vital business questions around survival of events. With such knowledge, so much more is possible. Some other applications of this technique include determining:

How many days until a customer cancels our subscription
How many hours until our IoT device fails
How many orders do customers make until they cancel their subscription
How many years until our installed water pump fails

Knowing about all the above with a good degree of accuracy allows a business to not only operate efficiently but to also stay ahead of competition by handling their customers or assets better. All using the power of data.

The Data Trinity

Kimaru Thagana — Wed, 04 Aug 2021 11:47:05 +0000

A trinity is a reference of three(3). Three items working in perfect harmony/orchestration, seemingly as a single unit, to achieve a defined goal. In the modern world of data engineering, this is a holy grail that many are in search of.
In this case, we are considering the scope from source to business insight using data. The three most important components. These are:

Extraction and loading tools
Transform and storage tools
Visualization and Intelligence tools

We will be going through and discussing the options in each of the trinity components. What to consider and what to weigh.

Extraction and Loading Tools

This component of the trinity is mainly involved in taking data from source(ingestion) and "transporting" it to destination. A cool feature with most tools is scheduling ingestion. You can choose to ingest your data from source to destination daily, after x minutes or y hours. This allows you to focus on other more important tasks.

Within the data engineering context, Extraction refers to the process of obtaining raw data at source. The source can be any software artifact that holds or produces data. These include APIs, databases,software systems such as CRMs, transactional systems et al.

Loading - Dumping data into a destination artifact. The data can be raw data, if one is using the ELT approach or processed data if one is using the traditional ETL approach. The most common destination artifact is a data warehouse.

Considerations

When dealing with this component, the main considerations include data privacy, engineering resource management, costs and robustness of the tool used. Let us consider a practical example with real life tools.

If your company is price sensitive, you might opt for a free open source tool such as Airbyte but absorb the engineering costs of setup and maintenance. If your company wishes to focus their engineering resources on other tasks other than extraction and loading, they can use a managed service like Fivetran where they pay for the service.

The trade offs are purely situation dependent. Maybe you want an open source free tool, maybe you want a fully managed service, maybe you are bootstrapping and do not have funds, maybe you are a big company and do not mind paying.

Airbyte and Fivetran are some of the most common tools in this space within data engineering.

Transform and Storage Tools

This component of the trinity is mainly involved in processing your data according to your business needs and also storing the data.
The transform component is where business logic is domiciled. This is where you perform data transformations to generate business value from data. They could be as simple as filters and as complex as joins, rolling column computation, pivoting, etc. Most transformations will be done in SQL. This is because of its ability to perform data processing and also compatibility with the storage systems. Most of the storage systems will be designed to be SQL compatible and hence, SQL as the transformation language is the most common. A common tool in this space is Data Build Tool DBT A powerful tool that supercharges SQL by introducing other programming paradigms such as jinja templating, source definitions, hooks, variables and sanity checks/tests among others. The alternative would be stored procedures and SQL scripts but you would be losing out on all the great features of DBT.
I have had a go at it in my personal github which you can check over here. DBT Bon Voyage
Fivetran also offers an SQL based transformation interface where one can run their SQL scripts.

The storage component is simply where the data lives after extraction and loading. The most common artifact is a data warehouse or data lake depending on conceptual design.
The most common services in this category are AWS Redshift, Google cloud Big query and Snowflake.

Visualization and Intelligence Tools

This component of the trinity is mainly involved in synthesis and displaying of business value. This is where the business intelligence analyst, data analyst, data scientist and business executives operate. They use the extracted loaded and transformed data to answer business questions and produce actionable insights. The software artifacts in this level often have visualization, query and reporting tools.
Common tools in this level are business intelligence software such as Looker, PowerBi and Tableau among others.
The best and most robust features in any of these tools are on a paid tier and hence, costs are definitely a consideration. Whether you would want it on your private server or on a public cloud that the vendor can setup for you, is also a choice depending on your data policy.

Wrapping Up

By now you are aware of the data trinity in terms of tools. A well oiled machine that is commonly referred to as the modern data stack.

Airbyte/Fivetran --> Snowflake/Redshift/BigQuery [With DBT or custom SQL scripts running on top] --> Looker/PowerBI/Tableau

A great company that offers data consulting services through out the whole stack is SILVER CREEK INSIGHTS They have direct partnerships with Looker and Fivetran and hence, your entire data stack needs can and will be solved under the Silvercreek umbrella.
If you would like to interact further, you can find my personal site or LinkedIn

Soft Skills in a Hard Skills Industry

Kimaru Thagana — Thu, 04 Mar 2021 06:25:59 +0000

Introduction

For most people, a large part of their life is spent in the workplace. Your work environment has a direct effect on the quality of life you lead. And since most jobs will not be done in solitude, but will be carried out in teams, your success rests on working with other people to achieve your common goals as much as it does on the individual job you do.
Should you lack the skills to navigate the work environment and the people you work with, your day to day becomes daunting, which leads to job dissatisfaction and possibly quitting outright. These soft skills can be as important as your technical skills.

Contrary to popular belief, soft skills can be learned and taught. Some may dismiss them as personality traits, skills you either are born with or not. But these soft skills can be taught through structured and unstructured mentorship.
A structured approach involves planning and predefined goals. This is ideal where people endeavour to achieve results within a fixed time frame and are available for the planned activities. Figuratively speaking, unstructured mentopships involve taking a friend's or colleague's hand and walking the journey together in learning a new skill. Mentoring occurs for both people and occurs without a fixed timeframe. It could ideally go on forever.
In this article, we will explore the soft skills that most affect tech and software engineering careers. We’ll discuss how they impact our day to day lives, how to gain them, and lastly, how to practice and teach them to peers.

What are soft skills?

They can be best described as social, people and communication skills that are heavily intertwined with an individual's attitude, persona attributes and mindset.

These skills directly influence, whether positively or negatively, an individual's level of

emotional and social intelligence.

Soft skills are therefore paramount for any member living in a society for they will need to interact on an interpersonal level.

A superb grasp of soft skills means that you are emotionally and socially intelligent. In this regard, you are easy to work with and hence, desirable as a partner in a social setting or as an employee in a work setting. Depending on what you wish to go for.

In the workplace, especially in reference to Information Technology(IT), there are generally seven(7) soft skills that have been singled out to be the most important. With these,

you are generally a highly desirable candidate for collaboration or work on any project.

These are:

Leadership skills - Qualities that make you capable of handling a leadership role and responsibly managing work under no supervision. This shows a sense of autonomy and trustworthiness. This skill is ideal in gaining promotions to higher ranks at work.
Communication skills - Verbally and in writing. A vital cornerstone of interpersonal relations. Your teammates or managers need to understand what you are saying in a clear and timely manner. Communication skills encompasses facets such as letter and email writing, public speaking and one on one conversations among others. This skill is vital in improving your desirability at work since when mastered, people will gravitate towards you. You will be easy to collaborate with.
Teamwork and collaboration - The spirit of teamwork and collaboration is paramount in any society or group. This skill encompasses items such as resource sharing, selflessness, empathy and tolerance. One should be able to consider things from their teammates perspective and get where they are coming from. Great collaboration skills allow you to be a critical and indispensable member of any team. You are easy to work with, you engage with other team members actively and therefore increase team morale and productivity.
Conflict resolution - This involves being a fair arbiter. In the event that a conflict or misunderstanding arises, using this skill, you will be able to listen to both sides, de escalate the situation and propose a pragmatic solution that is just and leaves all the conflicting parties satisfied. This ties into leadership and with this skill, you can easily rise into a leadership role.
Reliability and dependability - This calls upon your character. How true you are to your word. It is very critical to be reliable since it is impractical for someone to give you work and to keep following up if it is done. Reliability signals autonomy which puts you on track towards a more senior role since your managers believe you can deliver on a larger scale.
Flexibility and adaptability - With the ever changing world of tech, the most adaptable becomes the most valuable. Being very stiff and resisting change weighs you down and does the same to your work mates. They may find it very hard to work with you. Flexibility involves being able to adapt to new situations and make the best out of them.
Critical Thinking - A vital skill in a tech related or engineering field. This is more of an umbrella skill that encompasses observation, analysis, inference and problem solving. With this skill, you simply get the job done. Mastering this skill makes you very marketable since you can deliver products and services that people are willing to pay money for. You are able to identify pain points and address them with a practical and suitable solution hence solving the problem.

Importance of Soft Skills

The main importance is smooth interpersonal relations and peace of mind. From the above enumeration, it can be observed that all of them gravitate

towards making your life working with other people easier.

Soft skills also enrich you and make you more valuable as either a team member or an employee. The more valuable you are, the higher the price and the more favourable the terms you attract. In the tech setting and work place, the main importance of these skills is to keep you in the job or position that your hard skills propelled you to attain or propel you to a higher position. The skills can therefore not be ignored if one wishes to progress in their career.

As a personal testament, I was once a junior developer in an agricultural tech startup. I knew very little and was quite intimidated.

To offset this imbalance, I volunteered for every task, communicated any progress or blockers encountered, took correction and constructive criticism positively and ensured I was as reliable as can be. I was able to save my job due to these qualities that the management saw and decided to give me a chance to learn and catch up.

Gaining Soft Skills

As we have established what soft skills are and their importance, the next logical step is to learn how to attain them and improve on them as much as possible.
Since these are people skills, they are best taught by other people, who have these skills, through mentorship. There are generally two forms of mentorship.

Structured Mentorship and Fellowships

A formalized approach that involves structures, timetables, deadlines and expected outcomes within a certain time frame. This mode of mentorship is ideal for campus students. Individuals who are in any institution of learning and are relatively young in their career or have not started at all. The structured form capitalizes on the human resource capacity of seasoned professionals to teach, mentor and guide the relatively novice individuals. Some arrangements involve paid mentorships but most serve as structured mentorships for free for the mentees.

To sustain the program, partners and sponsors are involved. Some partners offer grants, some offer resources such as venues and some offer human capital in the form of mentors.

There is also a self sustaining model where the mentees of the first cohort can come back and

offer their time as program assistants or mentors in the next cohort. This is based on the assumption that after graduating as a mentee, you are equipped and fit enough to impart knowledge and wisdom to anyone in the subsequent cohort.

The pros include:

Structured approach- This creates order and a course to follow. With such parameters in place, It is easy to achieve goals within the intended time frame.
Easier to follow up- With clear goals to be achieved, there is easier monitoring and evaluation of progress.
Can be easily replicated- Structures allow documentation of procedures. This makes it easy to replicate the success of a structured program in another setting. The main con is that it is demanding and hence, may be taxing on individuals who may have prior commitments. There is a lack of flexibility. The time bound requirements attached to each goal may sometimes create unforseen pressure on participants and hence, not giving their best.

Unstructured Mentorships

This form of mentorship, also referred to as the buddy system, is ideal on a long term basis between people who are in close proximity in terms of career or socially. This could be your friend or colleague. With unstructured mentorship, there is no predefined course or curriculum to follow. This then means that the participants get to decide the rules of engagement, what the end goal is and how it can be measured. Though deceivingly free and open, this requires a lot of discipline

as an individual and as a unit. You are your own boss and own critique.

It is best reserved for individuals with some greater level of experience in life or career since they have built their discipline over time and can manage the mentorship activities themselves.

The pros include freedom and flexibility to choose your own structure and what works best for you, no time bounds or restrictions hence can be done at a desired pace among others.

The main cons are tied to the fact that there is total freedom. With great power(freedom) comes great responsibility. With no order or timetable to follow, a stellar level of discipline and personal responsibility is demanded of the participants for the exercise to bear fruits.

Everyday Practice

To be good at any skill, you have to be willing to put in the hours. Gaining soft skills is no exception.

It requires habitual changes that eventually lead to sharpening of soft skills.

A very strong requirement in this is the discipline to put in the work everyday.

Tips to sharpen your soft skills include

Proactively work in team settings
Seek public speaking opportunities in your settings whether work or home. This builds your confidence in speaking to crowds.
Volunteer for leadership roles in your circles. eg team lead, master of ceremony etc
Offer help with applications and writing such as job applications and scholarship applications.

Conclusion

To wrap it all up, it is evident that this is a vital, if not critical component in the professional life of a practitioner in the field of Computer Science or any engineering field in general. It introduces a much needed balance to a field commonly mistaken to be all about hard skills such as building and development.

This balance later enriches team interactions, mainly via collaboration and communication, which in the long run boosts productivity and output.

Soft skills will sometimes save you from a potential job loss, de-escalate conflicts, open doors in the workplace and improve your communication and presentation skills overall.

The onus is therefore upon us, members of the tech community to share the importance of these skills and how they can compliment the commonly valued hard skills. One of the most effective ways of sharing and giving back would be mentorship and accountability groups geared towards development of soft skills. With all hands on deck, we can work towards changing the notion that soft skills are not that important in an engineering or tech job.

The fact is, the soft skills, sometimes if not most, will be the more important skill one would rather possess.
The hard skills may get you the job, the soft skills will keep you there.

Python for FaunaDB

Kimaru Thagana — Mon, 08 Feb 2021 19:40:14 +0000

Introduction

Cloud native data APIs, serverless services and cloud native services in general are becoming a go to for companies who no longer wish to manage their own data and infrastructure layer. This use case is getting more common as startups shift their focus more towards their core business and customer satisfaction.

Cloud native services ride on the already successful and proven delivery model of the cloud compute services. They are reliable, scalable and readily available. Using these,startups quickly bypass infrastructural setups such as networks, server and operating system provisions. The saved time and resources is then channelled towards their core business, thus staying competitive.

In this guide, you will explore and learn about such a cloud native offering. A data API leveraging the cloud compute and serverless architecture to offer reliability, flexibility and cost savings. The API also boasts its ability to handle different data models, from graph to document based data.

FaunaDB

Fauna ,a cloud native data API, marketed for developers and designed for serverless applications. Some of the outstanding features include a native graphql layer, support for different data models(relational, document or graph) and ACID transactions which ensure consistency.

It follows a freemium approach where one can subscribe to the free plan and later move up the tiers if need be. As a true developer centric offering, the service accommodates some popular programming languages by means of drivers.

There are several language drivers available such as Javascript, C#, Go and Python amongst others.

To get started, register and create a new database where you will receive a database key.

With the key, it is then possible to develop your app in your preferred language and use the available drivers to connect to your newly created cloud native database.

The interface, upon checking the box, can load some sample data to allow you to navigate the user interface and interact with CRUD components.

Scenario

To better understand Fauna and its use, we will create a sample app that serves the purpose from a sample scenario.

Consider the situation where you are an independent software and systems consultant advising a hospital. The client (the hospital) wishes to move its operations to the cloud. Among the requirements, the client wishes to have a data layer service that is flexible enough to accommodate different data schemas. Mainly, document, graph and relational due to their data science and mining operations that they wish to perform in future.

The client also wishes that the solution is Python friendly as this is the main language they intend to use.

Taking all these requirements into consideration, over and above their desire for a reliable and cost effective solution, you choose to suggest FaunaDB.

To demonstrate its use, you attach a sample application, written with Python drivers, to your proposal for the client's developer team to gain some insights.

Sample App

To demonstrate a hospital's operations, consider the schema below

Doctor  
  last_name
  first_name 
  license_number 
  specialization 
  staffID  

Diagnosis  
  doctor 
  notes 
  patient  

Patient  
 last_name 
 first_name 
 patientID 
 insurance_policy

Python Driver

To install the python driver, run the command pip install faunadb

The source code for the driver can be found here

The codeblock below shows a starter script to get you going.

import os 
from faunadb 
import query as q
from faunadb.client import FaunaClient 

 db_client = FaunaClient(secret=os.environ.get("secret"))

It is always best practice to keep keys as environment variables. In this case, the secret is the key to access your cloud based FaunaDB instance that you created on your dashboard.

With the above code, you are connected and can commence performing database operations such as CRUD.

Create Collection

doctor_collection = db_client.query(q.create_collection({"name":"doctors"}))

Create the remaining collections, Diagnosis and Patient using the above syntax

Create Document

doctor_1 = db_client.query(    
 q.create(doctor_collection["ref"],   
  {"data":{"last_name": "Fauc",    
 "first_name":"John",   
 "license_number": "AGY5578199O",  
"specialization": "cardiologist",  
"staffID":"AW110"  
}}))

Create document with a foreign key reference

diagnosis_1 = db_client.query(    
 q.create(diagnosis_collection["ref"],   
  {"data":{"doctor": q.ref(q.collection("doctors"), "181019942046968320"),    
 "patient":q.ref(q.collection("patients"), "181019942046968320"),   
 "notes": "The patient seems to exhibit symptoms of the common flu....",  
}}))

Use the above syntax to create several more records in the available collections.

Retrieve Records

Demonstrate a retrieve operation by getting a single doctor's document entered into the database by the create function.

single_doctor = db_client.query(  
 q.get(q.ref(q.collection("doctors"), "181019942046968320")))  
print(single_doctor)

Note that the second parameter after the collection is the reference number for a particular document

Update Record

To demonstrate an update, assume that there is a slight error in the previously created doctor record.
Below is the update syntax

updated_doc = db_client.query(  
 q.update( q.ref(q.collection("doctors"), "181388642312002080"), { "data": { "specialiation": "Cardiology", "license_number": "AGY5578199O-001" } } ))  
print(updated_doc)

Delete Records

record = db_client.query(  
 q.delete(q.ref(q.collection("doctors"), "182028742581742080")))

With the above code snippets, you have demonstrated how to perform basic CRUD operations using the Python driver for Fauna.

Learn more about the various functions in the Fauna Query Language (FQL) here.

Conclusion

Content on cloud native transactional databases or data APIs in general is best consumed by backend engineers and devops engineers.

It is important for developers to be well acquainted with cloud native tools such as FaunaDB to not only allow them to build and design better applications, but to also help them stay relevant in the fast paced ever changing world of tech.

It is best advised to not stop at this guide. Build your knowledge base and further pursue concepts learnt in this guide. Majorly cloud native data APIs and FaunaDB.
This then implies that further reading is required to increase your understanding and grasp of the subject matter.

Below are some related topics to pursue:

Develop a Flask or Django based app using FaunaDB as your data layer. You can further develop the sample app.
Fauna Query Language- Native to FaunaDB
Cloud native architectures and design
FaunaDB GraphQL layer for your apps and its pros and cons.
Trade off between NoSQL and SQL data structures for your data stores