<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ernestine-m</title>
    <description>The latest articles on DEV Community by ernestine-m (@ernestinem).</description>
    <link>https://dev.to/ernestinem</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F421695%2F54e2c1d4-3093-4efb-84ab-868ec2a2f504.jpeg</url>
      <title>DEV Community: ernestine-m</title>
      <link>https://dev.to/ernestinem</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ernestinem"/>
    <language>en</language>
    <item>
      <title>Normalize nested JSON objects with pandas</title>
      <dc:creator>ernestine-m</dc:creator>
      <pubDate>Mon, 03 Aug 2020 13:14:00 +0000</pubDate>
      <link>https://dev.to/ernestinem/normalize-nested-json-objects-with-pandas-1g7m</link>
      <guid>https://dev.to/ernestinem/normalize-nested-json-objects-with-pandas-1g7m</guid>
      <description>&lt;p&gt;Ever since I started my job as a data analyst, I have heard many times from many different people that the most time-consuming task in data science is cleaning the data. And after a little more than a month in this new job, I can totally concur. However, python pandas library is making it smoother than I thought. &lt;/p&gt;

&lt;h4&gt; A little about pandas &lt;/h4&gt; 

&lt;p&gt;Pandas is a an open source data analysis library that allows for intuitive data manipulation. It's based on two primary data structures: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.index.html" rel="noopener noreferrer"&gt;The series&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a one-dimensional array capable of holding any type of data or python objects. I like to think of it as a column in Excel. &lt;br&gt;
Series are by default indexed with integers (0 to n) but we can also define our own index.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html" rel="noopener noreferrer"&gt;The dataframe&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's a 2-dimensional labeled data structure with columns of potentially different types. I like to think of it as different series put together (or as a spreadsheet in excel). Dataframes are the most commonly used data types in pandas. &lt;/p&gt;

&lt;p&gt;This &lt;a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html" rel="noopener noreferrer"&gt;10 minutes to pandas&lt;/a&gt; article in the documentation explains everything you need to know to start with pandas!  &lt;/p&gt;

&lt;h4&gt; Surprise! It's JSON nested objects... &lt;/h4&gt; 

&lt;p&gt;It was &lt;strong&gt;&lt;em&gt;not&lt;/em&gt;&lt;/strong&gt; a good surprise. I had retrieved 178 pages of data from an API (I talk about this &lt;a href="https://dev.to/ernestinem/what-s-an-api-and-how-to-access-it-using-python-2158"&gt;here&lt;/a&gt;) and I thought I had to write some code for each nested field I was interested in. &lt;br&gt;
Indeed, my data looked like a shelf of russian dolls, some of them containing smaller dolls, and some of them not. &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F2%2F25%2FRussian_Dolls_%25284891096981%2529.jpg%2F1200px-Russian_Dolls_%25284891096981%2529.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F2%2F25%2FRussian_Dolls_%25284891096981%2529.jpg%2F1200px-Russian_Dolls_%25284891096981%2529.jpg" alt="Russian dolls"&gt;&lt;/a&gt; The data &lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1050%2F1%2AtTUgbl2uahxCa7441032MQ.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1050%2F1%2AtTUgbl2uahxCa7441032MQ.png" alt="Nested JSON object structure"&gt;&lt;/a&gt; Nested JSON object structure &lt;br&gt;
I was only interested in keys that were at different levels in the JSON. This seemed like a long and tenuous work. &lt;/p&gt;

&lt;h4&gt; The solution : pandas.json_normalize &lt;/h4&gt;

&lt;p&gt;Pandas offers a function to easily flatten nested JSON objects and select the keys we care about in 3 simple steps: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make a python list of the keys we care about. We can accesss nested objects with the dot notation&lt;/li&gt;
&lt;li&gt;Put the unserialized JSON Object to our function json_normalize&lt;/li&gt;
&lt;li&gt;Filter the dataframe we obtain with the list of keys &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And voilà! &lt;/p&gt;

&lt;p&gt;Since I had multiple files to clean that way, I wrote a function to automate the process throughout my code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FIELDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;list of keys I care about&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json_normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function allowed me to clean the data I had retrieved and prepare clear dataframes for analysis in just a couple lines of code! 🙌&lt;/p&gt;

</description>
      <category>codenewbie</category>
      <category>datascience</category>
      <category>pandas</category>
      <category>json</category>
    </item>
    <item>
      <title>What's an API and how to access one using Python?</title>
      <dc:creator>ernestine-m</dc:creator>
      <pubDate>Tue, 21 Jul 2020 14:36:48 +0000</pubDate>
      <link>https://dev.to/ernestinem/what-s-an-api-and-how-to-access-it-using-python-2158</link>
      <guid>https://dev.to/ernestinem/what-s-an-api-and-how-to-access-it-using-python-2158</guid>
      <description>&lt;p&gt;Last month, I was given my very first task at work as a beginner in data science : retrieve data from an API that uses the Oauth2 authorization protocol. With hindsight, that seems like a very basic task, but I had trouble finding a how-to online that is beginner-friendly. This article is a little breakdown of the steps needed to communicate with an API using python 3. &lt;/p&gt;

&lt;h3&gt; What is an API ? &lt;/h3&gt;

&lt;p&gt;The textbook definition of an &lt;strong&gt;API&lt;/strong&gt; (or &lt;strong&gt;A&lt;/strong&gt;pplication &lt;strong&gt;P&lt;/strong&gt;rogramming &lt;strong&gt;I&lt;/strong&gt;nterface) is "a set of functions and procedures allowing the creation of applications that access the features or data of an operating system, application, or other service." &lt;/p&gt;

&lt;p&gt;To put it simply, an API is &lt;strong&gt;the messenger&lt;/strong&gt; between a client and a server and allows us to retrieve data. It can be compared to a waiter in a restaurant who takes our order, transmits it to cooks in the kitchen, then delivers our food back to us.  &lt;/p&gt;

&lt;p&gt;&lt;em&gt;A very helpful 3 minute explanation&lt;/em&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/s7wmiS2mSXY"&gt;
&lt;/iframe&gt;
 &lt;/p&gt;

&lt;p&gt;We could use different architectural styles to code an API but the standard one is based on the &lt;strong&gt;representational state transfer (REST)&lt;/strong&gt;, which allows for interoperability between computer systems on the internet. Indeed, A RESTful API, or REST API, uses existing HTTP methodologies to communicate: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GET&lt;/strong&gt; to retrieve a resource/data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PUT&lt;/strong&gt; to change the state of a resource or update it &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POST&lt;/strong&gt; to create a resource&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DELETE&lt;/strong&gt; to remove a resource&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt; What is OAuth2 ? &lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m0-ilnb2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Oauth_logo.svg/270px-Oauth_logo.svg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m0-ilnb2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Oauth_logo.svg/270px-Oauth_logo.svg.png" alt="OAuth logo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to access an API, you need an authorization. The &lt;strong&gt;most common standard&lt;/strong&gt; is called &lt;strong&gt;OAuth&lt;/strong&gt; and is used by most big tech companies. OAuth allows access tokens to be issued to third-party clients by an authorization server, with the approval of the resource owner. The third party then uses the access token to access the protected resources hosted by the resource server.&lt;/p&gt;

&lt;h3&gt; So, how does it work ? &lt;/h3&gt;

&lt;p&gt;The workflow I had to use for this task was client_credentials, which consists of 2 steps: &lt;/p&gt;

&lt;h4&gt; Step 1: Request an access token with the information given by the resource owner &lt;/h4&gt;

&lt;p&gt;In order to communicate with APIs, python has a very useful HTTP library called &lt;strong&gt;requests&lt;/strong&gt; that allows us to retrieve data in a very simple way. There’s no need to manually add query strings to URLs, or to form-encode POST data.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The code I wrote&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"grant_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s"&gt;"client_credentials"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s"&gt;"client_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;' given by the resource owner'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s"&gt;'client_secret'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'given by the resource owner'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="s"&gt;'scope'&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'specified in the API documentation'&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s"&gt;'Authorization'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;'given by the resource owner'&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'host/oauth2/token'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This code gives us back an access token that allows us to move to step 2. &lt;/p&gt;

&lt;h4&gt; Step 2 : Retrieve the data by using the access token that's been issued &lt;/h4&gt; 

&lt;p&gt;For this step, I used &lt;a href="https://www.postman.com/"&gt;Postman&lt;/a&gt;, a collaborative platform for API developments that also allows us to send requests. This tool is useful for beginners as it auto-generates headers. The only one I had to add was a range header, because the API results were paginated. &lt;/p&gt;

&lt;h5&gt; paginated ? &lt;/h5&gt;

&lt;p&gt;Yes, just like books, APIs can be paginated. Since databases can contain millions or billions of data, requesting all of it at once could cause the server to crash. Pagination was invented in order to prevent such an issue to occur by limiting the number of pages of data you get at each request. There are 3 main types of pagination : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Offset-based pagination&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keyset pagination&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seek pagination&lt;/strong&gt; &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://nordicapis.com/everything-you-need-to-know-about-api-pagination/"&gt;This article&lt;/a&gt; goes into greater details about each one of these methods!&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>codenewbie</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
