<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Magali</title>
    <description>The latest articles on DEV Community by Magali (@magali).</description>
    <link>https://dev.to/magali</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F641792%2F5ef95f21-be31-4ea9-a2fd-6a6d77ea1385.png</url>
      <title>DEV Community: Magali</title>
      <link>https://dev.to/magali</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/magali"/>
    <language>en</language>
    <item>
      <title>Evaluating Time Series Models</title>
      <dc:creator>Magali</dc:creator>
      <pubDate>Wed, 02 Nov 2022 03:00:41 +0000</pubDate>
      <link>https://dev.to/magali/evaluating-time-series-models-3o99</link>
      <guid>https://dev.to/magali/evaluating-time-series-models-3o99</guid>
      <description>&lt;p&gt;Model evaluation is a critical, yet often under looked, step in data science. Much time is dedicated to obtaining, cleaning, and analyzing large amounts of structured and unstructured data, in order to then build models that may generate new insights. In the rush to extract value from data and to put such insights to work in a fast-paced world, the process of evaluating the performance of the model can be dismissed altogether. Recent examples showcase the need for model evaluation. Most notably, Zillow shut down its home buying business after it paid too much for houses by relying on an algorithm that could not accurately forecast prices during a period of rapid price changes. The algorithm did not perform well with new, changing data. &lt;/p&gt;

&lt;p&gt;This blog post will highlight how to incorporate evaluation into modeling time series--such as real estate prices, FX rates, stock prices, retail sales, temperature readings, and numbers of subscribers--in Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Train-test splits
&lt;/h2&gt;

&lt;p&gt;The first step is to split the data set into a training subset (to train the model) and a testing subset (to test the trained model). This is referred to as a 'train-test split', where the training data typically accounts for 80% of the sample and the test data accounts for 20% of the sample. In non-time series machine learning, the train-test split function partitions the data randomly. &lt;/p&gt;

&lt;p&gt;With time series, however, the data is not split into random subsets. Rather, the testing subset is comprised of the observations towards the end of the dataset. For example, in a dataset with 60 records (let's say these are months), the training set would be comprised of the first 48 months and the test set would be comprised of the last 12 months. The test set is sometimes described as the 'hold-out' set since this data is withheld from the dataset that is used for fitting the model. Importantly, a model should not be trained on test data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---X_8-z6c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2xzoa22m0hktpkxeebk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---X_8-z6c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2xzoa22m0hktpkxeebk0.png" alt="Image description" width="507" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A function can be defined in Python to perform such a split on multiple time series, and the slicing notation can be used to slice the dataset accordingly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Define function to apply train_test_split to all selected dfs
def tt_split(df_list, names):
    tts_list = []

    for i, df in enumerate(df_list):
        train_data = df[:-12] #all data except last 12 months
        test_data = df[-12:] #last 12 months of data
        tts_list.extend([train_data, test_data])
        print(names[i],':', df.shape, ' // Train: ', train_data.shape,
             'Test: ', test_data.shape)
    return tts_list    #return list of train, test dfs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the data has been split into training and testing subsets, the developed model will be fit to the training set. Evaluation metrics relevant to the use case can then be assessed. Residuals are calculated on the training set, and forecasts are calculated on the test set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow visualization
&lt;/h2&gt;

&lt;p&gt;A simplified workflow to develop a time series model could look like this, according to &lt;a href="https://developers.google.com/machine-learning/crash-course/validation/another-partition"&gt;Google's Machine Learning&lt;/a&gt; guide:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Zr2-9yKZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rh82pdylt9uexz4c1y7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Zr2-9yKZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rh82pdylt9uexz4c1y7b.png" alt="Image description" width="685" height="341"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark models
&lt;/h2&gt;

&lt;p&gt;Another key step in the model validation process is to build a baseline (ie, benchmark) model followed by iterative modeling with different assumptions, inputs, hyper-parameters and even with varying models to compare results. At this point, models can be compared on relevant evaluation metrics (such as RMSE and AIC) and how well the model generalizes to new data, which is known as over- and under-fitting. This matters because models are developed on sample data, which is usually not fully representative of the new data that the model will encounter in the real world. The model should ultimately learn well enough that it can be applied to examples it did not see during the training phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Incorporating model evaluation practices--through train-test splits, building benchmark models, and iterating on them--is a good practice that should be part of every time series analysis, especially when such models inform decision-making.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>timeseries</category>
    </item>
    <item>
      <title>How Bias Sneaks into Machine Learning</title>
      <dc:creator>Magali</dc:creator>
      <pubDate>Tue, 20 Sep 2022 22:01:10 +0000</pubDate>
      <link>https://dev.to/magali/how-bias-sneaks-into-machine-learning-5g5e</link>
      <guid>https://dev.to/magali/how-bias-sneaks-into-machine-learning-5g5e</guid>
      <description>&lt;p&gt;Machine learning offers a unique opportunity to transform how credit is analyzed for risk and is allocated by lenders. It can learn patterns in data to efficiently and accurately assess the credit risk of borrowers without reliance on traditional credit reporting systems. This is transformative because it means that machine learning can be applied in a range of environments--from banked to unbanked, underserved segments--to develop new ways to quickly determine borrowers' creditworthiness. &lt;/p&gt;

&lt;p&gt;Despite the promising advantages, the application of machine learning in financial services can also lead to bad outcomes, such as introducing biases and reinforcing discrimination in lending decisions. Bias can sneak into machine learning at various stages of model development, and there are no built-in checks to detect it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;What does this look like in practice? Let's walk through a simplified example. In this example, a machine learning model was trained on the &lt;a href="https://archive.ics.uci.edu/ml/datasets/South+German+Credit"&gt;German credit dataset&lt;/a&gt; available on UCI's Machine Learning Repository to classify creditworthy (1) and not-creditworthy (0) borrowers. The dataset has 1000 records, with 21 variables comprised of numerical and categorical data. The model determines that the most important features that determine if a borrower is classified as creditworthy are related to a borrower's checking account, terms of credit (amount and duration), age, and credit history, among other attributes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fdXOK50T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1k48cdc8mwzsb8q5y6gs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fdXOK50T--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1k48cdc8mwzsb8q5y6gs.png" alt="Image description" width="626" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;According to the machine learning model, creditworthy borrowers are older (36.2 years old) than non-creditworthy borrowers (33.9 years old) on average. In particular, the average age of creditworthy males is higher than non-creditworthy males. The average age of creditworthy and non-creditworthy females is about the same. And, in general, older males are more likely to be classified as creditworthy and receive higher credit amounts than younger females.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tdAgMBYo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r9tivwew04030xchznsq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tdAgMBYo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r9tivwew04030xchznsq.png" alt="Image description" width="880" height="665"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Types of bias
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Sample bias&lt;/em&gt; , also known as &lt;em&gt;selection bias&lt;/em&gt;, was unknowingly introduced into the model because the age data provided as an input to train the model is not likely be representative of the data the model will encounter going forward when it is used for the business application. Indeed, with a mean of 35.5 years and positive skew, the age data the model was trained on is not normally distributed. Data encountered in the real world is likely to include more younger and older borrowers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oa5UhBZr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l8ypxnuib6vu4gv4h64m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oa5UhBZr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l8ypxnuib6vu4gv4h64m.png" alt="Image description" width="385" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Moreover, women are notably underrepresented in the dataset.  In other words, the model is trained on a lot more male data and, thus, has better learned male borrowers' credit attributes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sVO1i1p6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/emgwlefi93ngzwv90936.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sVO1i1p6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/emgwlefi93ngzwv90936.png" alt="Image description" width="361" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are also several other types of biases that can make their way into machine learning: &lt;em&gt;exclusion bias&lt;/em&gt;, &lt;em&gt;measurement bias&lt;/em&gt;, &lt;em&gt;recall bias&lt;/em&gt;, &lt;em&gt;observer bias&lt;/em&gt;, &lt;em&gt;association bias&lt;/em&gt;, and &lt;em&gt;racial bias&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reducing Bias
&lt;/h2&gt;

&lt;p&gt;Under-represented data can lead to a machine learning model wrongly inferring attributes about the unrepresented segments, thus leading to unintended consequences such as exacerbating existing biases. Seeing how prevalent biases can be in machine learning, it is crucial for data scientists and analysts to incorporate bias testing to test assumptions, collect and prepare data fairly, and examine modeling decisions throughout the model development cycle.  &lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>bias</category>
    </item>
    <item>
      <title>Making Use of Zip Codes</title>
      <dc:creator>Magali</dc:creator>
      <pubDate>Thu, 28 Jul 2022 20:48:00 +0000</pubDate>
      <link>https://dev.to/magali/making-use-of-zip-codes-1nep</link>
      <guid>https://dev.to/magali/making-use-of-zip-codes-1nep</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;I am continuing to learn Python coding and data science as a student in Flatiron School’s Online Data Science Bootcamp. My motivation is to explore how these skills can be applied to make better, more informed, and timely operating and policy decisions in finance and economics. I am already finding value in the application of these skills to real-world questions and challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Business Problem
&lt;/h2&gt;

&lt;p&gt;My second project involved working with King County, Washington’s house price dataset. This is a large dataset, with over 21,000 entries. Among these entries are house prices and location-related indicators such as latitude, longitude, and over 70 zip codes in King County. Since location can affect home prices, I was interested in exploring how house prices vary across the county. The challenge is that the dataset does not include addresses nor city names, which makes the zip code data difficult to interpret if one does not know what area the zip code represents. Moreover, because zip codes were created to assist postal workers deliver mail, they do not necessarily cleanly delineate an actual area that one would be able to reference on a map. Indeed, many zip codes have odd boundaries. Could the zip code data be made useful and interpretable in Python for this analysis?&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Exploration
&lt;/h2&gt;

&lt;p&gt;The first step is to confirm that price indeed varies by location. A simple scatterplot of longitude and latitude, with markers weighted by price, indicates that it does. &lt;/p&gt;

&lt;p&gt;Next, I explored the zip code data, which does not have a normal distribution. I then plotted the average price by zip code, which further supports the view that location matters. Homes in some zip codes have higher mean prices than homes in other zip codes. There is no order in the relationship between zip codes and home prices. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--r7SHrr_X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kf4xo9kf69ufyd04xcl9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--r7SHrr_X--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kf4xo9kf69ufyd04xcl9.png" alt="Image description" width="880" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bringing in City Data
&lt;/h2&gt;

&lt;p&gt;Since I—and most people—do not know what areas each of these 70+ zip codes represent, I brought in city data obtained from King County GIS Open Data to match with zip codes. I have done similar matching of data previously, but never with Python so this was a test of my new skills. I approached the challenge by creating a dictionary with zip codes as keys and cities as values; calling on the dictionary when applying the pd.replace function to the dataset; and, finally, creating a visualization of home prices by city. Check it out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Load primary dataframe that underlies the King County home price analysis.
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'data/kingcounty.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Load second dataframe with zipcode and city data, obtained from King County GIS Open Data.
&lt;/span&gt;&lt;span class="n"&gt;df_zips&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'data/zipcodes.csv'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Using this second dataframe, make a dictionary with key, values represented by zipcode, city.
&lt;/span&gt;&lt;span class="n"&gt;zips_dictionary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_zips&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zipcode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_zips&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;zips_dictionary&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;{98001: 'Auburn',&lt;br&gt;
 98002: 'Auburn',&lt;br&gt;
 98003: 'Federal_Way',&lt;br&gt;
 98004: 'Bellevue',&lt;br&gt;
 98005: 'Bellevue',&lt;br&gt;
 98006: 'Bellevue',&lt;br&gt;
 98007: 'Bellevue',&lt;br&gt;
 98008: 'Bellevue',...}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#In primary dataframe named data, create new column named “City” that contains zip codes as placeholder values.
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'city'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zipcode&lt;/span&gt;

&lt;span class="c1"&gt;#Call on dictionary to replace zip code values in this new column with city names.
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'city'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;zips_dictionary&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Examine results
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seattle          8777&lt;br&gt;
Renton           1584&lt;br&gt;
Bellevue         1263&lt;br&gt;
Kent             1197&lt;br&gt;
Redmond           960&lt;br&gt;
Kirkland          941&lt;br&gt;
Auburn            907&lt;br&gt;
Sammamish         778&lt;br&gt;
Federal_Way       777&lt;br&gt;
Issaquah          717&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating Visualization of Mean Price by City
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Plot mean price by city
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Plot mean price by city
&lt;/span&gt;&lt;span class="n"&gt;mean_price_by_city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'city'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s"&gt;'price'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;mean_price_by_city&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'bar'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'powderblue'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Mean Price by City'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Plot mean price for King County
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'price_mean'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'price_mean'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'line'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'mediumblue'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'Mean Price of King County: $504,333'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Format y-axis
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Mean Price ($)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;current_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gca&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;get_yticks&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gca&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;set_yticklabels&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'{:,.0f}'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current_values&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylim&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1500000&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;#Format x-xis
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'City'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;xticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Add legend, title
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'upper left'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;borderaxespad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edgecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'white'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suptitle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Home Prices: County and City'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots_adjust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Remove chart borders
&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'top'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;set_visible&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spines&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'right'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;set_visible&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KVfHsELS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/duhxtuvlajqkzkdru6kz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KVfHsELS--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/duhxtuvlajqkzkdru6kz.png" alt="Image description" width="535" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By bringing in city data to match with each zip codes, the variation in prices across the county can be more easily interpreted. The number of bins were reduced from over seventy to twenty or so, which reduces granularity but results in a grouping that is well known and understandable by the public. This is useful because subsequent analysis—for example, on price per square footage and structural home features—can be analyzed for differences within the county, through the lens of cities. &lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
    </item>
    <item>
      <title>FS Project 1: Charting with Python</title>
      <dc:creator>Magali</dc:creator>
      <pubDate>Tue, 01 Jun 2021 20:17:02 +0000</pubDate>
      <link>https://dev.to/magali/charting-with-code-250j</link>
      <guid>https://dev.to/magali/charting-with-code-250j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I am learning Python coding and data science because I am interested in applying these skills to work with bigger data sets and to analyze such data to make better, more informed, and timely operating and policy decisions in finance and economics.&lt;/p&gt;

&lt;p&gt;As a student enrolled in Flatiron School's Data Science Bootcamp, I am enjoying learning these new skills and am already finding value in their application to solve real-world questions and challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Business problem
&lt;/h2&gt;

&lt;p&gt;My first project examines the movie industry for a client that is interested in entering the business. Descriptive data analysis shows that the movie industry is profitable but there is significant variation in performance across films and production studios. The client can use this analysis to understand the key trends in the movie industry, identify its main competitors, and determine the types of films they will be creating. This analysis also serves as a baseline for deeper dives on the movie industry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data and methodology
&lt;/h2&gt;

&lt;p&gt;The project explored more than ten zipped, movie-related data sets from four sources. Data files provide a wide range of information over the last 20 years about individual movies' box office revenues, budgets, genres, as well as about production studios and associated cast and crew members.&lt;/p&gt;

&lt;p&gt;The project applies exploratory data analysis and examines trends of key metrics over time. This provides an insightful overview of the evolution of the performance of the movie industry. Several data files contain useful information that complements the other files, while some data was duplicated. This required extensive cleaning and joining data sets. I have experience preparing and analyzing data sets, mostly in Excel; but I have never worked with so many large files simultaneously nor done this with Python code. It was fun to tackle this in a new way, and I particularly enjoyed analyzing the data and generating chart output that presents the results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Charting with Python
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How to make chart with double y-axes
&lt;/h3&gt;

&lt;p&gt;One of the first findings is that there are large outliers in the data and there is notable variation across movies for most indicators. Given these characteristics, I wanted to explore the gross revenue and ROI of the top grossing movies (i.e., those in the 99th percentile). A reasonable hypothesis is that the movies with the highest gross revenue would be among the ones with the highest ROI. This is not the case. To show this cleanly, I plotted the top movies' gross revenue and ROI on two different axes. &lt;/p&gt;

&lt;p&gt;The chart shows that top grossing movies are profitable, but these movies do not necessarily have the highest ROI. Blockbuster office performance does not translate to higher ROI, implying that cost control is an important determinant of the bottom line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Create figure
&lt;/span&gt;&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subplots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;#Assign chart variables
&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'title'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ww_gross&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'worldwide_gross_m'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;ROIp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'ROIpct'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;#Identify two y-axes using the same x-axis (i.e., the second (left) y-axis will use the same x-axis 
&lt;/span&gt;&lt;span class="n"&gt;ax2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;twinx&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;#Create standard bar chart of gross revenue on the left y-axis
&lt;/span&gt;&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ww_gross&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'lightsteelblue'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Add line plot of ROI to same chart on the right y-axis. Set markers to '.' and remove line.
&lt;/span&gt;&lt;span class="n"&gt;ax2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROIp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;marker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;markersize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'navy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linestyle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'None'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#X-axis label formatting: rotate and center
&lt;/span&gt;&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xticklabels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rotation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'center'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Y-axis label formatting: Set labels and change colors of labels to match chart content
&lt;/span&gt;&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Gross Revenue (Millions $)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'gray'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'ROI (Percent)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'navy'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Y-axis tick marks: Set min, max, intervals
&lt;/span&gt;&lt;span class="n"&gt;ax1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_yticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ax2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_yticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MiG-_kUj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zf0uixqv9qpyk7ypoidx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MiG-_kUj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/zf0uixqv9qpyk7ypoidx.png" alt="image"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How to make stacked percentage bar chart
&lt;/h3&gt;

&lt;p&gt;Another valuable takeaway from the data is the share of movies that are profitable versus unprofitable. I was curious to see what percentage of movies fall within given profitability ranges—for this, I set out to make a stacked percentage chart. This analysis shows that the movie industry is a profitable, but challenging, business. Forty percent of movies generate healthy return on investment exceeding 100%, while 25% generate positive but lower returns below 100%. Notably, 35% of movies lose money. &lt;/p&gt;

&lt;p&gt;After some initial troubleshooting on my code, I approached this as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Prepare data for stacked 100% bar chart. Create df grouped by year and count of ROI buckets. Convert count of ROI buckets to percent of total count.
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;df_roi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'year'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'ROI_buckets'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="s"&gt;'ROI_buckets'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                       &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;df_roi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;'year'&lt;/span&gt;&lt;span class="p"&gt;])[&lt;/span&gt;&lt;span class="s"&gt;'ROI_buckets'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="c1"&gt;#Set color map and select number of colors from color map
&lt;/span&gt;&lt;span class="n"&gt;viridis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_cmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'viridis'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Create stacked bar chart
&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unstack&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stacked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;viridis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Set title, x-label, y-label
&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'ROI - Movies in 90th percentile'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Year'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Percent of movies (%)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fontsize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;#Set y-axis ticks: min, max, interval
&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;yaxis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_ticks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;110&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Set tick marks on right side.
&lt;/span&gt;&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tick_params&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labeltop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labelright&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;#Reverse legend order and set legend location
&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_legend_handles_labels&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;legend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;handles&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nb"&gt;reversed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'center left'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bbox_to_anchor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--d0xHAWi8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/al843tcnm7mo991g3rxr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--d0xHAWi8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/al843tcnm7mo991g3rxr.png" alt="image"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Final insights
&lt;/h2&gt;

&lt;p&gt;Looking forward to continuing to learn and share data science insights--more to come!&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
