|Credit: ZeeMELT and Kyoorius|
TV Serials and family dramas have a special place in every Indian’s heart. Nothing can ever replace the iconic “Dhum Ta Terenana” score that amplifies the tension in the air or the “Saas Bahu” dramatic tropes introduced into the Indian Entertainment Industry by these TV Serials.
From classics like “Saas Bhi Kabhi Bahu Thi” and “Sasural Simar Ka” to modern entries like “Shark Tank”, this industry and this culture is ever-evolving and uniquely creative.
Its only fitting then, that when I found a dataset about Hindi TV Serials, I immediately decided to do this analysis and draw some interesting insights from it.
Let us start with looking at the dataset I am going to be using for this analysis project. This dataset titled “Hindi TV Serials” contains almost 800 unique values with the name of the serial, its cast, its IMDB rating and an overview.
It contains all the TV Serials aired on the following channels from 1988 to the present day (May 2022):
- Sab TV
- Sony TV
- Colors TV
- Zee TV
Technically the dataset is distributed as a CSV file (181.76kB) and has 736 unique values spread of the following columns:
Example Values from the Dataset
|Kyunki Saas Bhi Kabhi Bahu Thi||1.6||"Comedy, Drama, Family"||A mother-in-law's struggle to put up with her three bahu's. The three bahu's have grown up sons. The bahu's sons start to get involved with having girlfriends and the bahu's try and break their relationships up.||2000–2008||"Smriti Malhotra-Irani ,Ronit Roy ,Amar Upadhyay ,Sudha Shivpuri"|
|Kahaani Ghar Ghar Kii||2.1||Drama||"The show explored the worlds of its protagonists Parvati Aggarwal and Om Aggarwal, who live in a joint family where by Parvati is an ideal daughter-in-law of Aggarwal family and Om the ideal son."||2000–2008||"Sakshi Tanwar ,Kiran Karmarkar ,Mita Vashisht ,Ali Asgar"|
I will be analyzing the relationships and the insights that each of the column provides when properly cleaned and arranged.
Setting up the Environment
I start with importing the necessary modules for this project:
Then the dataset is imported into the environment through the
import pandas as pd import numpy as np import matplotlib.pyplot as plt dfmain = pd.read_csv("Hindi TV Serials.csv")
The IMDB ratings
The IMDB ratings are going to be very important throughout this analysis as a way to judge the quality and popularity of a TV Show whenever applicable.
But before we dive-in into how other parameters relate and affect the IMDB rating of a show, let us independently look at these ratings.
Top 5 shows by IMDB ratings
We use the
sort_values() function to get an output of the top shows according their IMDB ratings.
Name Ratings ... Year Cast 407 Mitegi Laxman Rekha 9.7 ... 2018 Aayesha Vindhara ,Ankita Goraya ,Rajeev Saxena... 242 Shobha Somnath Ki 9.4 ... 2011–2012 Ashnoor Kaur ,Tarun Khanna ,Joy Rattan Singh M... 79 Love U Zindagi 9.4 ... 2011 NaN 586 Wagle Ki Duniya 9.2 ... 2021– Sumeet Raghavan ,Pariva Pranati ,Sheehan Kapah... 742 Jagannath Aur Purvi Ki Dosti Anokhi 9.2 ... 2022– Rajendra Gupta ,Sushmita Mukherjee ,Ismeet Koh... .. ... ... ... ... ... (remaining output omitted due to irrelevancy)
As is clearly discernible, the top 5 shows according to their ratings are:
- Mitegi Laxman Rekha (9.7)
- Shobha Somnath Ki (9.4)
- Love U Zindagi (9.4)
- Wagle Ki Duniya (9.2)
- Jagannath Aur Purvi ki Dosti Anokhi (9.2)
Well I am not sure I agree with these results but well if you say so IMDB, if you say so...
The Cast and The Artists
Analyzing the cast column can provide some interesting statistics to look at, but there is a serious problem that limits us from using it to any useful extent.
The problem is the format in which these values are stored in the dataset.
For example take the value for the "Cast" column in the row for
Shobha Somnath Ki:
|Ashnoor Kaur ,Tarun Khanna ,Joy Rattan Singh Mathur ,Sandeep Arora|
This value is troublesome as it is stored as a single <str> type object and thus it is not possible to calculate or discern any data for individual cast members.
Cleaning Data: Solving the Cast Problem
Thankfully, as elaborated by Max Hilsdorf in his Medium blog, the string object present in the cell can be converted into a list object, and subsequently into a one dimensional data type that can allow functions like
groupby() to function.
But his solution does not apply to our problem without extensive modifications as the values we wish to convert to a list do not have any pre-existent list based syntax. Therefore we need to convert each cell in the Cast Column into a value based on list syntax i.e.
We can implement this by writing a function the takes input in the format that we have and then adding the square brackets and the quotation marks and returning it in the format that we need. This is my implementation of such a function:
def clean_artist_list(list_): if type(list_) is str: list_ = "[" + list_ + "]" list_ = list_.replace(',', '","') list_ = list_.replace('[', '["') list_ = list_.replace(']', '"]') list_ = list_.replace(' "', '"') return list_ else: return ""
This function also takes care to properly handle and replace any disruptive data. I mainly encountered some FLOAT datatypes which threw errors as they could not be treated like strings.
After applying this function and the python
eval() function, we have the required list datatypes.
dfmain["Cast"] = dfmain["Cast"].apply(clean_artist_list) dfmain["Cast"] = dfmain["Cast"].apply(eval)
Before proceeding we also need to create the function needed to convert these 2D lists to 1D. For that we will use:
def to_1D(series): return pd.Series([x for _list in series for x in _list])
Top Rated Artist
Now that we can use the Cast data properly, lets find out which artist has the best average IMDB ratings for the shows they worked in.
df_cast_imdb = dfmain.groupby(to_1D(dfmain["Cast"])).mean() print(df_cast_imdb.sort_values(["Ratings"],ascending=False))
Ratings Tusharr Khanna 9.2 Sahil Mehta 9.2 Vrajesh Hirjee 9.2 Gautami Kapoor 9.1 Vaidehi Amrute 9.1 ... ... (remaining output omitted due to irrelevancy)
The artists with the best mean IMDB rating for his shows is Tushar Khanna. He has worked in "Pyaar Tune Kia Kya", "Piyaa Albela" and "Bekaboo".
This however does not necessarily reflect any superiority in acting or talent, but it may show (at least to people who believe in it) some signs of luck an artist brings to a set.
Most Experienced Artist
Now moving to a more concrete relation. We will be finding out which actor has worked in the most TV shows.
It should be noted that the values of this dataset only list the leading cast members in the cast section and thus artist with minor roles are not properly recognized in this analysis.
Ronit Roy 9 Jennifer Winget 8 Seema Kapoor 7 Sangeeta Ghosh 7 Shahab Khan 7 .. (remaining output omitted due to irrelevancy)
Ronit Roy having worked in 9 shows, comes out to be the most experienced artist in this dataset. No wonder I see him in every other serious father type role.
Its either comedy (the family kind) or drama (also the family kind) with Indian TV Serials. But don't take my word for it, let us see for ourselves the genre dynamics of Indian TV.
Cleaning Data: Genre
Genres also face the same problem as we faced above with artists. There is a small edit made to handle redundancies due to whitespace characters.
def clean_genre_list(list_): if type(list_) is str: list_ = "[" + list_ + "]" list_ = list_.replace(',', '","') list_ = list_.replace('[', '["') list_ = list_.replace(']', '"]') list_ = list_.replace(' "', '"') list_ = list_.replace(" ","") return list_ else: return ""
It is then used similarly as the Cast solution.
dfmain["genres"] = dfmain["genres"].apply(clean_genre_list) dfmain["genres"] = dfmain["genres"].apply(eval)
Most Acclaimed Genre
First lets look at which genre claims the best mean IMDB ratings and garners the best critic response.
df_genre_imdb = dfmain.groupby(to_1D(dfmain["genres"])).mean() print(df_genre_imdb.sort_values(["Ratings"],ascending=False))
Ratings War 6.900000 Horror 6.684211 Adventure 6.680000 Biography 6.650000 Sport 6.500000 Family 6.443478 Crime 6.271429 History 6.162500 Action 5.966667 Comedy 5.961644 (remaining output omitted due to irrelevancy)
Humans do love war, huh.
Next lets look at which genre the creators love the most and thus create the most shows based around.
df_genre_count = to_1D(dfmain["genres"]).value_counts() print(df_genre_count) df_genre_count.plot(kind = 'bar') plt.show()
Instead of the text output, a visual representation of the output would be more suitable here, thus we generate a bar graph using the
So THAT is why Indian households end up being so dramatic...
Shows like "Sarabhai vs Sarabhai" were definitely much ahead of their time. But lets look at how time affected the rest of the Indian TV.
Cleaning Data: Years
To make use of the data in the Years column, we need to convert it into forms that are not haphazard and unusable like it originally is.
I created two new columns based on the Years column:
- First Year: This column tracks the year in which the show started airing.
- Years Run: This column tracks how long a show ran.
These columns were created with the following code:
def findstart(list_): if type(list_) is str: list_ = list_[:4] return list_ else: return "" def duration(list_): if type(list_) is str: if len(list_) == 9 and list_!="I": l1 = int(list_[:4]) l2 = int(list_[5:]) return l2-l1 else: return 0 else: return 0 dfmain["First Year"] = dfmain["Year"].apply(findstart) dfmain["Years Run"] = dfmain["Year"].apply(duration)
The code was made to handle edge cases like wrong datatype and the weird "I XX" values in the Year column.
Which year was the busiest for the creators? We can use the following code to visualize the frequency of productions across years.
df_year_count = dfmain["First Year"].value_counts().sort_index() df_year_count = df_year_count.iloc[:-4] #removing the weird I values df_year_count.plot(kind = 'bar') plt.show()
2017 brought us shows like "Naagin 2", "Yeh Rishta Kya Kehlata Hai" and "Yeh Hein Mohabbatein". In total it records the production of 59 shows compared to the runner up 2018 with 46 shows.
Longest Running Show
Indian shows like "Sasural Simar Ka" and "Kyunki Saas Bhi Kabhi Bahu Thi" are infamous for running long enough to be part of a late teenager's life since birth. So its obvious to find out which show actually has the longest runtime.
print(dfmain.sort_values(["Years Run"], ascending=False))
Name Ratings ... First Year Years Run 720 C.I.D. 6.8 ... 1998 20 255 Hum Paanch 8.2 ... 1995 11 536 Yes Boss 8.4 ... 1999 10 0 Kyunki Saas Bhi Kabhi Bahu Thi 1.6 ... 2000 8 1 Kahaani Ghar Ghar Kii 2.1 ... 2000 8 .. ... ... ... ... ... (remaining output omitted due to irrelevancy)
"C.I.D." is no-doubt part of every Indian's life. With iconic characters like ACP Pradyuman, Abhijit, and Daya, and a premise revolving around crime in India, its not a surprise that it had a runtime of 20 years.
Analyzing the Overviews
Here comes the part I was most excited for. The written descriptions and overviews of these shows could surely provide me some very interesting insights that could have been the highlights of this project.
Unfortunately after cleaning the data and writing the code to analyze it, it was shocking to see how useless the ordeal was. The data did was not sufficient and quality enough to let me draw any real conclusions from it.
But I will still show the method I used to clean and try analyzing the data.
Cleaning Data: Description
Similar to the approach I took for the problems with other columns, I decided to convert the string based values to a list with every word being an element of the list. Also additionally the words were all turned to lowercase and any special characters were removed so as to make sure that redundancy was minimized.
def clean_ovw_list(list_): if type(list_) is str: list_ = "[" + list_ + "]" #removing all the special characters list_ = list_.replace(',', '') list_ = list_.replace('.', '') list_ = list_.replace('"', '') list_ = list_.replace('(', '') list_ = list_.replace(')', '') list_ = list_.replace('-', '') list_ = list_.replace('»', '') list_ = list_.replace(' ', '","') list_ = list_.replace('[', '["') list_ = list_.replace(']', '"]') list_ = list_.replace(' "', '"') #converting to lower case list_ = list_.lower() return list_ else: return ""
The function was applied:
dfmain["overview"] = dfmain["overview"].apply(clean_ovw_list) dfmain["overview"] = dfmain["overview"].apply(eval)
Now we have data that we can supposedly work on.
Usage of words over time
I planned to analyze multiple words like "love", "hate", "mother", "mother-in-law", "brother", etc. and their usage over time in the descriptions of TV Serials and even plot graphs showing interesting relations between the trends of different words.
This code gives the count of the words used grouped by years:
df_ovwcount = dfmain.groupby(['First Year',to_1D(dfmain["overview"])]).count().reset_index()
The following code could be used to plot the variance of occurance of words overtime, and also to show contrast in different words.
#Selecting and plotting the first word df_selectedword = df_ovwcount[df_ovwcount["level_1"].isin(["First Word"])] plt.plot(df_selectedword["First Year"],df_selectedword["overview"]) #Selecting and plotting the second word df_selectedword = df_ovwcount[df_ovwcount["level_1"].isin(["Second Word"])] plt.plot(df_selectedword["First Year"],df_selectedword["overview"]) plt.xticks(rotation=90) plt.show()
A visualization generated through this code (provided better data) could have looked like this:
This data could have led to a lot of other interesting analysis too, but unfortunately it was not possible.
Most Used Word
We can still draw some simple insights from this data. Let us find out the 50 most used words in the descriptions for Indian TV Serials.
df_ovw_count_simple = to_1D(dfmain["overview"]).value_counts() print(df_ovw_count_simple.head(50))
1843 a 856 the 848 and 647 of 588 to 394 is 338 her 314 in 302 who 201 with 191 story 185 their 158 his 140 on 129 family 128 love 125 an 125 plot 119 add 118 see 117 full 117 summary 114 for 113 from 111 life 107 she 105 by 103 girl 84 as 79 that 79 two 76 are 73 show 72 they 71 but 71 when 66 young 57 about 57 around 56 this 53 lives 52 it 51 has 49 he 49 married 47 series 47 one 44 other 42 revolves 41
Some significant meaningful words come out to be "family", "love" and "life"... That is some Fast & Furious philosophy it seems.
Indian TV is definitely an interesting place to observe and analyze. This project aimed at looking at some of the angles of the vast possibilities that are present with proper datasets.
But the tip of the iceberg that we touched also gave us some interesting results:
- Top 5 Indian TV Shows by IMDB Rating.
- Artists with the best mean IMDB Rating.
- Artists with the most experience.
- Genre with the best mean IMDB Rating.
- Genre with the most available content.
- The release frequency of shows over the years.
- The longest running shows.
- Usage of certain words in the overviews of TV shows over time.
- Most used words in TV Show descriptions.
This project also helped me cement my skills in data analysis, especially learning how to analyze a varied dataset in multi-faceted fashion.
I also gained experience cleaning data and how to treat list like values in cells and treat elements individually.
Thankyou to everyone who actually stuck with reading till here, it was very fun for me to work on this project.
Top comments (8)
Seems like a fun project! Good read as well. I love when people take their skills and apply it to an India-specific context. Curious to see what you pick for the next analysis.
Thankyou so much man. I am also excited about starting with a new project after learning some more new stuff.
It was a fun to read project...good job..
Thankyou so much.
😂couldn't have said it any better, loved the analysis
Thanks a lot
Wow this is really interesting!