WhatsApp chats can be a goldmine for NLP projects especially if you can get the data cleaned and into a format that's easy to mold.
I wanted to get a few stats from my group and personal chats so I could get 'user profile' for each of my contact and learn a bit more about them.
There are 2 parts to this challenge:
Getting the Data
Applying NLP techniques
Getting the Data
The easiest way to get data for whatsapp is through their export chat feature. It outputs the chat into a log while we have the options to retain the media or not.
Analysis for media would encompass documents/videos/images etc so I chose to leave it out for now.
After getting the logs there are a few key things we need to extract :
- Username - So we can generate user profiles
- Date of message - So we can see when is the user most active
- Time of message - So we can see when is the user most active
- Deleted messages - How secretive our user is?
- Emojis used - People tend to have emojis they are highly biased towards
- Words used and their frequency - So we can generate word clouds.
- Total user messages - Does she actually like you? or were you pushing the conversations?
The code for this is live at @whatsapp-analyze.herokuapp.com/data
and can be used by sending a POST request with a txt file in body with 'input' as the key.
The code can be locally hosted too Github
Applying NLP techniques
NLP techniques that can be applied are
- Sentiment Analysis - This paired with the date and time data can have some interesting outputs.
- POS Tagging - To extract most commonly discussed Topics.
Current Limitations
- Multi Lingual Chats make it difficult for NLP applications.
- Text language- u cnt undrstd wht d usr wntd 2 say.
- Naming Convention used by WhatsApp varies a lot and is dependent on the name you have assigned to the chat. An example of this is while you have a number(assume):12345 saved as 'Mike' , when mentioned in chat by others it will show up as '12345' . Getting this relation is impossible unless someone inputs it manually which can be tiring for group chat.
Conclusion
Even with the limitations for NLP applications there can be other statistics that can be extracted and will be added to the github code later on like:
- Who replies fastest to a user in group chat
- Who receives the most replies
- Who mentions whom the most?
Top comments (0)