DEV Community: RGEv1L

Credit Card Purchasing Analysis [ML]

RGEv1L — Wed, 30 Mar 2022 12:23:32 +0000

Introduction:

We are going to analyze the credit data and for what purpose people use it for, using German credit card data, published by UCL. The data set contains the following fields:

Age (numeric)
Sex (text: male, female)
Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 – skilled, 3 - highly skilled)
Housing (text: own, rent, or free)
Saving accounts (text - little, moderate, quite rich, rich)
Checking account (numeric, in DM - Deutsch Mark)
Credit amount (numeric, in DM)
Duration (numeric, in month)
Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, 10. education, business, vacation/others)

This problem could be subjected to multi-class analysis, but we are going to try it as a clustering problem. Why as a clustering problem you say? Because, lets say we have to normalize the data in-term of what would be the better approach to divide underlying customers in groups. Marketing segmentation relies heavily on segmenting their customers to get a better understanding of a group. As in this database, there are many columns that describes a behavior of a user in-terms of what is his gender, how much credits a person has used, how much money he has in his savings account etc. We are more interested in how to draw a boundary to grip the problem and deduce results based on groups. Why groups? So, that we can understand better in terms of what is the general trend of buying in these groups. Lets, say you introduce a new user to the record, with a help of a little bit of history with the system, you would be able to pitch a better loan scheme to a customer in a group.

The Data set looks like this upon show() command:

Lets check the minimum, maximum, average “Age” of the user/s in the data set.

Lets check the minimum, maximum, average Credit Amount Spent” of the user/s in the data set.

Lets start the most trivial categorization used in market segmentation, i-e segmentation done using gender, to see if that works.

To start with, lets check the average amount of credits spent by gender:

We can see a little bit of difference in average credit spent. Lets check the trends in buying and total credit spent on each category, for both genders:

We got two type of differentiation. First, one is the total credits spent on each category and the second one is the top 2nd and 3rd category is different for both. First offset is maybe due to difference in average credit spent, as scene above, but the second differentiation is based on what is the buying trend using credit amount. For Females the 2nd most favorite is “furniture/equipment” category and for males it is “radio/TV”.

We got a differentiation, but its not much to base all of the recommendations for loans etc.

Lets start Clustering:

We used Apache Spark as the framework and Kmeans for clustering the data. Since, we have so many categorical variables, like Housing/ Saving accounts etc, we had to first index these string columns to numeric classes using Spark’s “StringIndexer” and then transform into feature vectors. These feature vectors were then fed to the Kmeans algorithm. The Kmeans algorithm was trained for k = 2 up-to k = 6. The error scores for each iteration are given below

We selected k = 3 as it drops the error rate significantly and next iterations doesn’t really provides much reduction in the error. The results for k = 3 are:

Lets visualize and find trends in Group 0,1 and 2.

We got three advantages by introducing clustering. The groups have successfully eliminated the average credit amount spend differentiation. Now, each group has different representation of trends in credit amount spend. But, here’s the catch. I looked up the total users spending on a category in each group, what I found out was there was a group, far significant in number but dominating all three categories in terms of credit spent on “cars”. So, lets explore it by average credits spent on each category for each group.

Voila, not only each group spent differently on each category on average, we can distinguish by this demarcation, about whats the most aggressively bought item for each category………….

GRAPH vs SQL vs NoSQL Part 2: Benchmarks

RGEv1L — Fri, 21 Jan 2022 21:19:01 +0000

Let’s move things out of theories and designs and check what actually these constructs can perform. So. I chose one database from each category to test out some standard and complex queries. Test databases are as follow:

Neo4j -> Graph
PostgreSQL -> SQL
Apache Cassandra - > NoSQL

Licensing:

All of them are flagship products of their own categories. PostgreSQL and Cassandra are fully open-source. Neo4j is open-core and floats a free community edition.

Schema:

Neo4j is schema-less but it is better to define labels to group nodes for faster speed.
Cassandra is NoSQL but requires a schema up-front.
PostgreSQL also requires a schema.

Data/Key Partitioning:

Neo4j doesn't support data partitioning, but labels provide nodes grouping for faster operations. Neo4j only supports Vertical Scaling with Read-Replica support.
PostgreSQL doesn’t have any data partitioning mechanism, so vertical scaling with read-replicas support only.
Cassandra splits data based on primary or composite key across clusters. So, both horizontal and vertical scaling is available.

Benchmark:

Test Setup: T3.Small, 2 vCPUs, 2 GB RAM, 10 GB EBS General Purpose SSD
Test Instances: One instance for each DB Engine
Test Dataset: 65.2 MB (Personality 2018)

Results:

This result includes combined time for import of both files in the respective databases. Cassandra took a lot of time. Good amount of searching concluded that there is another open-source tool available for bulk-loading CSVs in Cassandra. I used built-in mechanisms for benchmarking purposes only.

This result contains total time to it took for all of three to count all of the available ratings. Neo4j did an exceptional job snice the traversal was only within “RATINGS” label. Postgres did great also since the traversal was within “Ratings” table only. Cassandra had to open-up each range-key for the traversal that slowed it down a bit.

This result explores all of the ratings greater than “4”. Postgres and Neo4j yielded similar results based on contained traversal. Cassandra cached up quite well because of distributed aggregated functions processing.

This traversal includes one table access with a foreign key match at one instance only for a user in Postgres “ratings” table. For Cassandra, it was just a composite key match. Neo4j didn’t perform well, due to the traversal of nodes and relationships for the first time. The same subsequent request yielded scores even lower than PostgreSQL. Neo4j is highly memory-centric and utilizes it well.

This involves access pattern just like above example, but with a more aggressive stats function. Return times are quite similar based on prediction.

This is a complex query with complex access patterns. This is two table accesssses for PostgreSQL. It has to collect the number of users with ratings greater than 3 in the “Ratings” table and agreeableness > 7 in the “Users” table. It has to perform a join operation, hence taking more time for a complex query. For Cassandra, it was two sort keys lookup and then collect users with rating > 3 and agreeableness > 7 within the same table. Hence, relatively faster results than any SQL database. Neo4j performed marvelously due to its pre-embedding of nodes and relationships. Although, it involves scanning two set of labels, “PERSON” and “RATINGS”, the traversal or jumps between these two nodes is pre-embedded by a relationship between then. This relationship embedding not only allows faster complex queries but offers much more complex operations that no other DB could even perform.

Other Capabilities:

Cassandra is a columnar database that distributes its keys among shards to achieve horizontal scaling. So, up till a point, where your table has millions of rows, your SQL solutions can work, but after that you would see degraded query performance. Cassandra is for Big Data which distributes data horizontally and also supports a huge amount of column numbers.
Neo4j is built with a graph visualization browser tool that helps in the deep analysis of your loaded data. Apart from that, Neo4j also contains rich library for AI-based graph analysis that further helps in digging deeper into the data for analysis purposes.Neo4j also keeps a subset of warm data in memory for faster access.

Graph vs SQL vs NoSQL Part 1: Theory

RGEv1L — Fri, 21 Jan 2022 21:13:54 +0000

SQL:

The age-old standing solution to persist data in a query-able manner has stood the test of time. It started with Web 1.0 and is still serving in Web 2.0. SQL databases store data where rows placement takes precedent over the columns, i-e singular rows are expected to be of more importance to ensure foreign key constraints across tables. This ensures the strong consistency of keys within the table and across the tables as well. Moreover, SQL provides a lot of precise filters to operate on attributes within the table and complex joins to go beyond a table. It also provides enriched features to what goes within your table, I-e you can define constraints other than your keys.

NoSQL:

NoSQL originated with the birth of Web 2.0. Web 2.0 became quickly abundant with data at rest and in transit. SQL faced dire consequences of the bulk of load moving in and out. Those were the few challenges that were faced by SQL databases. NoSQL tackled the problem by being able to shard horizontally. But, in order to be able to shard horizontally, shards used the concept of hash maps to distribute table keys. Columns took exclusive precedence instead of rows for the access pattern. This had a reason, in order to bulk load values, the row-by-row access pattern had to be removed. It solved two problems, the first one being bulk loading data by fetching the entire columns and the other one being column values share more resemblance than row values, so, it yelled better compression ratios. But data embedding and establishing relations across tables became a problem since there is no foreign key to extend the relations. This was partially solved by the massive reduction in storage prices. NoSQL SQL proffered storing data with relations within the same table even if it meant replication of some attributes to favor speed. NoSQL doesn’t provide constraints on attributes. So, a faster speed with a less grip on the data than SQL.

Graph:

Graph databases contain a mix of both two. Based on graph theory, it uses a linked-list data structure that captures the qualities of SQL and NoSQL databases. Graph databases provide precedence over the row operations but distribute the entities by hashed node ids. It provides NoSQL-like columnar distribution but has relations to serve as foreign keys to establish relationships with different entities without creating replicated data. Complex join operations requiring foreign keys are replaced with an already established path between nodes, called relationships. Labels are just a columnar grouping of nodes with a logical category that speeds up columnar queries. It provides constrains on attributes as well and is still schema-less like NoSQL but provides the grip of data like SQL.

Virtual WorkSpaces for Remote Teams: Match your team’s needs without buying hardware

RGEv1L — Fri, 21 Jan 2022 21:03:59 +0000

Intro:

Building and managing remote teams can be painful if you are not fully equipped with the right tools for it. Matching the computer and storage needs for your team members has always been challenging, and one of the worst parts is that it is constantly changing, and the other being providing IT support for the remote laptops that have gone bad. The big-fat laptops that you just bought for the worst-case scenario aren’t going to keep up with the resource demand for long, and your IT administrator won’t be happy traveling to remote resource's homes to fix their laptops. So, how can you avoid buying expensive hardware and still equip your team with the ability to handle the changing resources demands and have IT provide support? Moving to Virtual Environments is your best option.

Options:

All the premier cloud-services providers provide virtual workspaces in order to help remote teams in provisioning scalable resources on-demand and monthly/yearly reserved. You can spin up a remote Windows or Linux environment in minutes and quickly scale it up and down based on what your team requires. AWS provides Linux powered Workspaces, starting from 21$/month, and Windows Powered Workspaces, starting from 25$/month. The cheapest, workable, browsing-capable laptops that won’t lift the top of your head while working starts from 250$. You get a cloud-environment that is manageable by your IT guy and can scale it up and down based on your team’s demand. However, there is more to that.

Benefits:

Secure authorization
Managed accounts
Scale CPU/MEM/STORAGE on demand
Automated backups
Return Workspaces if not needed

Limitations:

If your team is low in count per region, example if you have 3 people joining from Middle-East, 2 people joining from South-America, 5 people from North-America and 3 people joining from India, Spinning up domain controllers in each region will sky-rocket the price crazy. To avoid it:

Go with AWS WorkDocs for directory service wherever it is available.
If WorkDocs is not available, spin Active directory service if the number of users is greater than 5 for that region.
If both of the above conditions are not met, your are far better with spinning up an EC2.

Deep Dive:

Managed Workspaces, provided by cloud services providers, are a great choice if you don’t want to manage Virtual Workspaces. Also, you can expect everything from increasing CPUs, memory, and storage to automated-backups, secure-logins, etc. should be a click away. But if you really want to double-down on cost and performance, you can move to provision your own Virtual Machine by setting up remote procedures, setup users and their resource quotas, and you are good to go.

Stacked Workspaces vs Single user Workspace:

Setting up a single workspace for each team member is certainly more easily manageable and less of an administrative hurdle. However, it tends to waste resources as well. Each single workspace you provision is going to have its own boiler-plate resources that are inevitable. If you provision 50 managed workspaces, that would be 50 cores and ~100 GB of RAM wasted to run the individual operating system on each workspace. On the other hand, the lesser-known fact is that not all the users are going to use 100% of the provisioned CPU and Memory all the time. Most of the time, the resources will sit idle. It is a common practice to provision a 16 core VM and assign 20 users to it. This works fine because 1 core is spared for OS, and 15 cores is still an overkill for 20 users working on excel or browsing etc. So, in order to stack a team of 50, a 48 core VM is more than enough to host 50 users if building for worst-case scenario.

Let's discuss requirements:

AWS Workspace for Linux with 2 cores / 4 GB / 90 GB is about 29$ per user. Keep in mind that only 2.5 – 3 GB of RAM and 1 core is fully available to the end user only. For 50 users, that would be 1450$/month.
AWS EC2 with 32 cores / 128 GB RAM / 5 TB ~ 1200$/month.
Most budget-friendly laptop with $250 price tag for 50 users is 12500$

Additional Options:

What to do with VM after working hours? With your own managed VM, working hours are usually 8-10 hours out of 24 hours. So, still you are going to waste your Managed / Un-managed workspace for 2/3 of the day that you already paid for. With a single instance with a huge core count, you can put it to good use. You can use the same VM for rendering, conversion, compression, ML, DL, and many more jobs that require huge resources stacked in a single VM. You can’t do that with your managed workspaces. So, here is the additional, not-so-talked advantage of having your own instance at your disposal.

Conclusion:

At the end, it depends on what your exact requirement is, and how much responsibility you want to take while managing resources for your remote teams. Assigning laptops is still fine if you don’t care about budget, and your team is good enough to take care of their devices. Managed workspaces are for ease of use with flexibility and no up-front. Managing your own VM is when you want to utilize your money properly and extracting almost everything you are paying for.

Additional Life-Saver Notes:
-Prepare a base-line for Windows-Server and Linux with GUI (XFCE4) with common packages like firefox, 7zip, vscode etc and dont add any users. Create an AMI and you can take it to whatever region for re-deployment. Install additional packages later as per users requirements.

Super-Grouping Users with Big Five Personality Clustering

RGEv1L — Tue, 18 Jan 2022 16:51:10 +0000

Introduction

Human behavior is of tremendous importance of tracking behavior from daily life behavior to tracking possibility of diseases. It is currently a hot topic in research domains to enhance user experience based on their behavior. World wide web needs some sort of markers to suggest content based on user personality type. The Big Five personality model has been the most successful empirical model to analyze personality of an individual. We can segregate users based on their personality type by asking questions that are curated based on years of tested research in the domain of psychology. There are five baseline traits that the model identifies and attribute some characteristics associated with these traits. Traits are:
1. Contentiousness
2. Openness
3. Extraversion
4. Agreeableness
5. Neuroticism

Why Super-Grouping?

Tracking human behavior from patterns found in questionnaires is a tedious job. But, years of research in psychology have made it possible to extract patterns. Grouping these outcomes is another job. From common sense, we know that we are all distinct in our behavior, but we share a lot of common behavioral patterns and yet we only focus on tiny differences. We can be still grouped together on the basis of similar patterns. Big 5 just provide us with an empirical scale to map patterns of behavior.

Methodology

It is a problem of clustering. We can group user base based on their scores in each trait. The scale we are going to use is “International Personality Item Pool – FFM (IPIP-FFM)-50” which consists of 50 questions which add value to these traits. The end result gives us scores in these traits. The dataset we are going to use is IPIP-FFM-data-8Nov2018.

The Machine learning algorithm in use is going to be unsupervised clustering which will segregate user-base based on their Euclidian distance from their trait scores. The problem at hand is whether we can segregate user-base using predetermined KNN algorithm where K has to be identified. But, the research doesn’t point to a fixed value of K. There could be another approach of X-Means algorithm where cluster itself defines the number of clusters based on appropriate differentiation. The scale used to score items is from the official ipip.ori.org website https://ipip.ori.org/newBigFive5broadKey.htm.

Jupyter Notebook and Python

Jupyter makes it easy to break the code into chunks of functional code with meaningful incremental data output. Python libraries uses are
• scikit-learn from Google
• numpy for linear algebra
• matplotlib for plotting 2D graphs
• seaborn for high level visualization
• os for directory read/write ops
• yellowbricks for distortion score visualization

Rest of the python code from Jupyter notebook is attached with the project files.

Demonstration

First, we had to open the csv file and cleanse the data with partial entries, so that we could perform operations on complete dataset only. The number of participants after this operation were 1013481.

Then we plotted users from their respective countries on the graph. Users on y-axis and countries on x-axis:

Using sklearn to form clusters and yellowbricks to visualize the results of cluster distortion for different values of K (from 2 to 15). As the graph below shows the best value is 5 for K

Using K=5 for personality clusters, we that started to train our model by inserting the dataset to K-Means clustering algorithm. At the end we get users from different ranges into their cluster group.

Number of unique users in each cluster.

Average psychometric scores for each cluster

Visual representation of range of average scores for 5 psychometric 5 traits for each cluster.

Dispersion of Traits scores for users in each cluster.

Conclusion

These graphs with dispersion and average trait scores tell us that we can group users based on similar behavioral patterns for each scenario from daily life. Apart from the research in personality literature that sees each psychometric as an individual maker to group users, but converting it into a clustering problem tells us that range of 5 trait scores is better suited to group user base. It draws better distinction with full mapping of user behavior rather than mapping individual trait. From that we can segregate user base for purposes like recommendation systems etc. Each cluster reflects a particular behavior-system regards to shopping trends, movie liking etc. which is quite popular now a days as a research problem.

Serverless vs Servers

RGEv1L — Tue, 18 Jan 2022 16:13:08 +0000

The traditional way of doing computing was, and still is, is by “spinning up a server -> loading the code -> executing the code” still works quite well. But what took the developers and system administrators by surprise was that you can go through these steps quite quickly based on demand.

Let me explain this idea.
For server-based deployment, you have to run a VM, configure a VM, load the application on the VM, and scale it up, down or horizontally based on usage. But, with serverless, you just hand your code, specify Memory configuration and you are good to go. Each invocation of the application code will be scaled and managed by the service provider. So, what you need to focus on based on both technologies is

For server-based approach:

Size of VM/VMs [CPU, RAM, NETWORK, STORAGE]
OS
OS configurations
Application package installation
Application code

For serverless approach:

Size of Function [Memory, Timeout]
Application Code

You can see the difference in design choice. As the saying goes “With great power comes great responsibility”. What is implicit in the statement is that you have chosen to pick up ‘great responsibility’. My idea would be to at least check, based on some metrics, whether you want the ‘great power’ or not. Else, it would ‘great waste’ of time and resources.

Let's write some metrics:
Do you expect your task to be completed under a timeout? [ AWS Lambda: 15 mins, GCP Cloud Functions: 9 mins, Azure Functions: 10 mins]

If Yes: 1 point
If No: Can you decouple your application to a queue-based event driven system?
If Yes: 0.25 point
No: 0 point

Does your application memory requirement is under the service provider’s max limit? [ AWS Lambda:15 GB, AZURE Functions: 1.5 GB, GCP Cloud Functions: 8 GB]

If Yes: 1 point
If No: Can you split your job in a way you can run another function and it will pick up the job where the previous function left it?
If Yes: 0.25 point
If No: 0 point

Is your application stateless?

If Yes: 1 point
If No: Can you rewrite session-based part to a session-less JWT solution?
If Yes: 0.25 point
If No: Can you opt-in to store session data on a network file system or on a database?
If Yes: 0.25 point
If No: 0 point

Does your application require another software appliance to operate (DB, SPARK, Compute Nodes etc.) ?

If No: 1 point
If Yes: Can you opt-in for a cloud-based solution that provides the similar capability (DynamoDB, EMR, AWS BATCH)?
If Yes: 0.25 point
If No: 0 point

Results:
If you manage to get all three points. You can go and architect a serverless solution and start coding. No need to manage fleets of servers. But, if your architecture requirements have blocked you from reaching a perfect 4, you can still opt-in for a mix solution. The decimals at the points to the fact that you have to put extra effort to bind a solution that can still leverage a serverless solution with all its benefits. But how hard it is to architect completely serverless solution is based on how far are you from 4. If you fall in the range between 3+ to 4, you might need some extra effort to architect and develop your solution. You might need to bind some other cloud provided services to your application code, design queues workflow that can push and pull jobs etc. There is a little bit of learning curve to learn about binding the relevant cloud service that is equivalent or better than your earlier defined software appliance. If you fall in the range of 3 or below, you can definitely opt in for mix strategy.

In the end, manage what can be decoupled on the serverless technology and provision the rest on servers. That’s is very handy and cost saving. Tasks with unpredictable workload and tasks with predictable workload can be divided quite easily and you save a lot in terms of cost for compute resources provisioned as the serverless fleet will only cost as long as you have customers using it. The rest of the provisioned infrastructure can be scaled based on how many serverless functions are in execution or how many messages are in a queue to be processed.

Virtual Workspaces for Remote Teams: Match your team’s needs without buying hardware

RGEv1L — Tue, 18 Jan 2022 15:41:17 +0000

Intro:

Options:

Benefits:

Secure authorization
Managed accounts
Scale CPU/MEM/STORAGE on demand
Automated backups
Return Workspaces if not needed

Limitations:

Spin Active directory service if the number of users is greater than 5 for that region.
If not , your are far better with spinning up EC2.

Deep Dive:

Stacked Workspaces vs Single user Workspace:

Let's discuss requirements:

AWS Workspace for Linux with 2 cores / 4 GB / 90 GB is about 29$ per user. Keep in mind that only 2.5 – 3 GB of RAM and 1 core is fully available to the end user only. For 50 users, that would be 1450$/month.
AWS EC2 with 32 cores / 128 GB RAM / 5 TB ~ 1200$/month.
Most budget-friendly laptop with $250 price tag for 50 users is 12500$

Additional Options:

Conclusion:

Graph Network Analysis

RGEv1L — Tue, 11 Jan 2022 14:07:14 +0000

Introduction:

Market analysis has traditionally been the bedrock for companies for various purposes. If a company wants to;

Launch a new product
Check performance of an existing product
Find new markets
Evaluate existing markets

Market analysis is sort of an abstract layer from where you can extract trends in a demographic for certain index (topic) and co-relate with your own use-case. These co-relations can be 1:1, if your product/company seems to have competitors doing the similar thing in a market, or the company already have stakes in the market. If it is not the case, you have to extract a trend, compare it with your use-case in-terms of how much the abstracted-out trend can help you predict your own performance in that demographic/market.

Dataset: Nashville Meetup Network
meetup.com is a website for people organizing and attending regular or semi-regular events ("meet-ups"). The relationships amongst users—who goes to what meetups—are a social network, ideal for graph-based analysis.

Problem Confinement:
Examples involving a lot of factors that are super-connected must be done in graph. The rational for this is that you would find It computationally limited to run that many join operations for finding a connection and it also logically doesn’t make any sense to preserve what can easily be preserved and requires a lot of call to fetch it, with-in a data-structure.

Loading Data in Neo4j:
Loading Cities and States and appending relationship between them

Loading Members and adding relation to cities they live in

Loading Groups

Loading Categories of Groups

Adding Group relation with each category

Loading Organizing Members and attaching their “Organized” relationship with “Group”.

Loading Events and adding group as ”Conducted” which conducted that event.

Loading Members that participated in Events in groups

Loading Total Number of Members for each Group.

Visualizing Graph Schema:

“””
I have added members relationship to Group I both directions because that will make DB calls much more optimized when going left or right from members
“””

Querying:

**
Let's check which group category has the most members

Finding trends for cities (‘Franklin’)

Let’s find the groups with name ‘Cars & Motorcycles’ across all cities.

Finding number of events per city:

Finding ‘Tech’ events hosted by all of the cities:

Visual Inspection:

**
Let explore Categories of Groups; Games, Dancing and Education and see how many groups do they have

Let’s follow a trail for a particular group like ‘GEEK by AKEIN Engineering’ in Education.

We can see the ‘Members’ associated with group, Events done for this group and the organizer of this group as well.

Let’s pick a random person and see in which ‘City’ he lives.

So, this person lives in Nashville and share this location with 3 other group members as well.

Ending note:

**
With Graph technology, we can find the trends visually as well as programmatically. The bottom line is that preserving the connections within the structure is much more important. We can do trailing queries and segment our problem in areas we need and connect these links when needed for a particular use-case.