<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hakim</title>
    <description>The latest articles on DEV Community by Hakim (@h4c5).</description>
    <link>https://dev.to/h4c5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1045061%2F6e223518-8168-4856-9ffd-6506c92b8628.png</url>
      <title>DEV Community: Hakim</title>
      <link>https://dev.to/h4c5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/h4c5"/>
    <language>en</language>
    <item>
      <title>Scraping de legifrance ? Considérez plutôt l'API</title>
      <dc:creator>Hakim</dc:creator>
      <pubDate>Sun, 17 Mar 2024 23:16:32 +0000</pubDate>
      <link>https://dev.to/h4c5/scraping-de-legifrance-considerez-plutot-lapi-5afi</link>
      <guid>https://dev.to/h4c5/scraping-de-legifrance-considerez-plutot-lapi-5afi</guid>
      <description>&lt;p&gt;Si vous envisagez le scraping pour récupérer des articles sur le site &lt;a href="https://www.legifrance.gouv.fr/" rel="noopener noreferrer"&gt;legifrance&lt;/a&gt; vous devriez considérer l'API publique disponible dans &lt;a href="https://developer.aife.economie.gouv.fr/index.php" rel="noopener noreferrer"&gt;le catalogue d'API Piste&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Obtention des identifiants OAuth
&lt;/h2&gt;

&lt;p&gt;Commencez par créer un compte sur Piste.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjc4n3s157kmgikeotw1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzjc4n3s157kmgikeotw1.png" alt="Capture d'écran du site de Piste, page de création de compte"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ensuite, rendez-vous sous "API &amp;gt; Consentement CGU API" pour accépter les CGU de Légifrance&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6k3by890zlj8rolu01h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi6k3by890zlj8rolu01h.png" alt="Capture d'écran du site de Piste, page de consentement CGU"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Créez une application dans l'onglet application.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh77m4dwk3rgfvyqm0fkr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh77m4dwk3rgfvyqm0fkr.png" alt="Capture d'écran du site de Piste, page de création d'une application"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Une fois créée, modifiez votre application pour y ajouter l'API Légifrance&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fseuizirmvwu02tdsotjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fseuizirmvwu02tdsotjk.png" alt="Capture d'écran du site de Piste, page de mofification des applications"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9snrtnyirizw6v6kq4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv9snrtnyirizw6v6kq4o.png" alt="Capture d'écran du site de Piste, page de de mofification des applications, ajout d'une API"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;De retour dans l'onglet applications, cliquez sur votre application puis récupérer votre client id et votre client secret.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0n1j4wocjc6s2izdmqzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0n1j4wocjc6s2izdmqzo.png" alt="Capture d'écran du site de Piste, page de consultation des applications, récupération des identifiants"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Obtention d'un token
&lt;/h2&gt;

&lt;p&gt;L'obtention d'un token se fait via une requête POST sur l'url : &lt;code&gt;https://oauth.piste.gouv.fr/api/oauth/token&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Exemple de requête avec cURL : &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

curl  &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'https://oauth.piste.gouv.fr/api/oauth/token'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Accept-Encoding: gzip,deflate'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/x-www-form-urlencoded'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Length: 140'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'grant_type=client_credentials'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_id=xxx'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'client_secret=yyy'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-urlencode&lt;/span&gt; &lt;span class="s1"&gt;'scope=openid'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Voici le même exemple sous python : &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;requests_oauthlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OAuth2Session&lt;/span&gt;

&lt;span class="n"&gt;client_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xxx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;client_secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://oauth.piste.gouv.fr/api/oauth/token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grant_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_secret&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;client_secret&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# {'access_token': 'WxkeizeaK7vqh1A3swGD14RASpM4tTf2PXzC32SV1mzBOebmx1asYA',
#  'token_type': 'Bearer',
#  'expires_in': 3600,
#  'scope': 'openid resource.READ'}
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Récupération d'un article à l'aide du token
&lt;/h2&gt;

&lt;p&gt;Le jeton obtenu ci-dessus permet de requête l'API Piste Légifrance.&lt;/p&gt;

&lt;p&gt;Voici un exemple avec cURL : &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;

curl  &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s1"&gt;'https://api.piste.gouv.fr/dila/legifrance/lf-engine-app/consult/getArticle'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Authorization: Bearer k0ESUrVPNCywGOm4Ef9Qehgxj8Ktr4gMhtPNc3DRilz5v11y3JKEDP'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-raw&lt;/span&gt; &lt;span class="s1"&gt;'{
  "id": "LEGIARTI000006307920"
}'&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Le même exemple en python : &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.piste.gouv.fr/dila/legifrance/lf-engine-app/consult/getArticle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LEGIARTI000006307920&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer k0ESUrVPNCywGOP4Ef9Qehgxj8Ktr4gMhtPNc3DRilz5v11y3JKEDP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# {
#   "executionTime": 0,
#   "dereferenced": false,
#   "article": {
#     "id": "LEGIARTI000006307920",
#     "idTexte": null,
#     "type": "AUTONOME",
#     "texte": "L'impôt sur le revenu est établi d'après le montant total du revenu net annuel dont dispose chaque foyer fiscal. ..."
#   }
# }
&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;La documentation de l'API est disponible dans le rubrique "API &amp;gt; Mes API" puis en cliquant sur Légifrance.&lt;/p&gt;

&lt;h2&gt;
  
  
  pylegifrance
&lt;/h2&gt;

&lt;p&gt;Je suis tombé par hasard sur la librairie &lt;a href="https://github.com/rdassignies/pylegifrance" rel="noopener noreferrer"&gt;pylegifrance&lt;/a&gt; dont l'objectif est de faciliter la récupération des articles legifrance via l'API de Piste. &lt;/p&gt;

&lt;p&gt;Il s'agit plus d'un protoype de librairie puisqu'elle n'est pas encore publiée sur pypi mais a l'air prometteuse donc à surveiller.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>french</category>
      <category>legifrance</category>
      <category>python</category>
    </item>
    <item>
      <title>Train a Sentence-CamemBERT</title>
      <dc:creator>Hakim</dc:creator>
      <pubDate>Sat, 09 Mar 2024 11:59:30 +0000</pubDate>
      <link>https://dev.to/h4c5/finetuner-a-camembert-model-for-sentence-embedding-j0d</link>
      <guid>https://dev.to/h4c5/finetuner-a-camembert-model-for-sentence-embedding-j0d</guid>
      <description>&lt;p&gt;The &lt;a href="https://camembert-model.fr"&gt;CamemBERT&lt;/a&gt; model is a state-of-the-art language model for modeling the&lt;br&gt;
French language.&lt;/p&gt;

&lt;p&gt;It is a &lt;a href="https://arxiv.org/abs/1907.11692"&gt;RoBERTa&lt;/a&gt; model that has been trained on a large number of French texts and can be easily adapted to a large number of tasks thanks to finetuning.&lt;/p&gt;

&lt;p&gt;Here we're going to to finetune the model for sentence embedding.&lt;/p&gt;
&lt;h2&gt;
  
  
  Sentence-BERT
&lt;/h2&gt;

&lt;p&gt;The output of a BERT model is an embedding vector for each token. To obtain an embedding of the text as a whole, we need to define a transformation strategy to go from individual token embeddings to an embedding vector for the sentence as a whole.&lt;/p&gt;

&lt;p&gt;The simplest and most effective strategy is simply to take the average of the token embeddings. &lt;br&gt;
This strategy is known as &lt;em&gt;mean pooling&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc8roxd60aa2iormvea6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc8roxd60aa2iormvea6.png" alt="Sentence-BERT diagram : token embeddings are average in a mean pooling layer to get the sentence embedding" width="800" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you'd like to find out more about the strategies that have been considered, take a look at this paper: &lt;a href="https://arxiv.org/pdf/1908.10084.pdf"&gt;Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Finetuning a BERT model into a Sentence-BERT model
&lt;/h2&gt;

&lt;p&gt;The authors of the paper mentioned above have built a Python library called &lt;a href="https://www.sbert.net/"&gt;sentence-transformers&lt;/a&gt;, for manipulating Sentence-BERT models.&lt;/p&gt;

&lt;p&gt;We'll use it to obtain a Sentence-CamemBERT model from a CamemBERT model available on &lt;a href="https://huggingface.co/almanach/camembert-base"&gt;huggingface&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;We will be using the following packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;datasets
sentence-transformers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Training data
&lt;/h3&gt;

&lt;p&gt;First of all, we're going to retrieve the training data.&lt;br&gt;
We will use the French part of the dataset &lt;a href="https://huggingface.co/datasets/stsb_multi_mt"&gt;STSb Multi MT&lt;/a&gt;.&lt;br&gt;
This is a dataset containing pairs of sentences and a score between 0 and 5 representing the similarity between the two sentences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dataset&lt;/span&gt;

&lt;span class="n"&gt;sts_train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stsb_multi_mt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sts_dev_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stsb_multi_mt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sts_test_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stsb_multi_mt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We'll then convert the retrieved data into InputExample objects that can be used for training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InputExample&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dataset_to_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;InputExample&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;InputExample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sentence2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
        &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;example&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;example&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;sts_train_examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dataset_to_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sts_train_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sts_dev_examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dataset_to_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sts_dev_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sts_test_examples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dataset_to_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sts_test_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will use the CamemBERT model named &lt;a href="https://huggingface.co/almanach/camembert-base"&gt;&lt;code&gt;almanach/camembert-base&lt;/code&gt;&lt;/a&gt; for finetuning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evaluation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;losses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.utils.data&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataLoader&lt;/span&gt;

&lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;almanach/camembert-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_dataloader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sts_train_examples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;losses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CosineSimilarityLoss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use cosine-similarity loss objective to train the model.&lt;/p&gt;

&lt;p&gt;Finally, an evaluator is built to monitor the model's performance on the dev dataset during training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sts_dev_evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmbeddingSimilarityEvaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sts_dev_examples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sts-dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now start training the model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;train_objectives&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;train_dataloader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_loss&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
    &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sts_dev_evaluator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;warmup_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;save_best_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Model evaluation
&lt;/h3&gt;

&lt;p&gt;Once training is complete, you can measure the model's performance on the test data set you've kept&lt;br&gt;
away:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sts_test_evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EmbeddingSimilarityEvaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_input_examples&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sts_test_examples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sts-test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;sts_test_evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I get a Pearson correlation of &lt;code&gt;0.837&lt;/code&gt; which is at the same level as the Sentence-CamemBERT I found on huggingface :&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Pearson Correlation&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/h4c5/sts-camembert-base"&gt;&lt;code&gt;h4c5/sts-camembert-base&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.837&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;110M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/Lajavaness/sentence-camembert-base"&gt;&lt;code&gt;Lajavaness/sentence-camembert-base&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.835&lt;/td&gt;
&lt;td&gt;110M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts"&gt;&lt;code&gt;inokufu/flaubert-base-uncased-xnli-sts&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.828&lt;/td&gt;
&lt;td&gt;137M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/h4c5/sts-distilcamembert-base"&gt;&lt;code&gt;h4c5/sts-distilcamembert-base&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.817&lt;/td&gt;
&lt;td&gt;68M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased"&gt;&lt;code&gt;sentence-transformers/distiluse-base-multilingual-cased-v2&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;0.786&lt;/td&gt;
&lt;td&gt;135M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Sentence-BERT model distilled
&lt;/h2&gt;

&lt;p&gt;As you may have noticed in the table above, I've also trained a Sentence-CamemBERT model that's about half the size (68M parameters vs. 110M) and yet performs very well: &lt;a href="https://huggingface.co/h4c5/sts-distilcamembert-base"&gt;&lt;code&gt;h4c5/sts-distilcamembert-base&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is in fact a model obtained by following the above procedure but starting from the distilled CamemBERT model: &lt;a href="https://huggingface.co/cmarkea/distilcamembert-base"&gt;cmarkea/distilcamembert-base&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This so-called "distilled" model was obtained by removing half of the layer of the CamemBERT base model and training it to maintain its performance.&lt;/p&gt;

&lt;p&gt;To find out more about the distillation process, please consult the following papers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/1910.01108"&gt;DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://hal.archives-ouvertes.fr/hal-03674695"&gt;DistilCamemBERT: a distilled version of the French CamemBERT&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Et voilà. You can find my two Sentence-CamemBERT models on huggingface :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/h4c5/sts-camembert-base"&gt;h4c5/sts-camembert-base&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/h4c5/sts-distilcamembert-base"&gt;h4c5/sts-distilcamembert-base&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>embedding</category>
      <category>nlp</category>
    </item>
    <item>
      <title>An introduction to packaging in Python</title>
      <dc:creator>Hakim</dc:creator>
      <pubDate>Fri, 08 Mar 2024 21:54:43 +0000</pubDate>
      <link>https://dev.to/h4c5/an-introduction-to-packaging-in-python-d5a</link>
      <guid>https://dev.to/h4c5/an-introduction-to-packaging-in-python-d5a</guid>
      <description>&lt;p&gt;Learning how to package your code is very useful for any python developer. It gives you a better understanding of how Python works and, above all, enables you to share your code with others or simply deploy it in a runtime environment.&lt;/p&gt;

&lt;p&gt;So how does it work? Why not just share your code via git? Actually, it's more complex than it sounds. There's even a whole working group, the &lt;a href="https://github.com/pypa"&gt;Python Packaging Authority&lt;/a&gt; (aka pypa), which has been working on this subject since &lt;a href="https://www.pypa.io/en/latest/history/#before-2013"&gt;since 2011&lt;/a&gt; and it's a constantly evolving field.&lt;/p&gt;

&lt;p&gt;In this article, we'll take a quick look at packaging in Python and present a simple method for packaging your code in 2023.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use the &lt;a href="https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html"&gt;pyproject.toml&lt;/a&gt; file following the &lt;a href="https://setuptools.pypa.io/en/latest/userguide/quickstart.html"&gt;setuptools guide&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You can simply use &lt;code&gt;pip install build &amp;amp;&amp;amp; python -m build&lt;/code&gt; to build your package.&lt;/li&gt;
&lt;li&gt;Adopt src-layout.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Packages in Python
&lt;/h2&gt;

&lt;p&gt;First, a few definitions:&lt;/p&gt;

&lt;h3&gt;
  
  
  Python module
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://docs.python.org/fr/3/tutorial/modules.html"&gt;&lt;em&gt;module&lt;/em&gt;&lt;/a&gt; python un is a text file with the extension &lt;code&gt;.py&lt;/code&gt; containing python code.&lt;/p&gt;

&lt;p&gt;It's called a &lt;em&gt;module&lt;/em&gt; because it "modularizes" the python source code into several &lt;code&gt;.py&lt;/code&gt; files. We then use python's import functionality to import elements from another module.&lt;/p&gt;

&lt;p&gt;For example, if you create a &lt;code&gt;count_lines.py&lt;/code&gt; file containing a &lt;code&gt;count_lines_file&lt;/code&gt; function, you can then import this function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# count_lines.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_lines_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Count the number of lines in a file&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;count_lines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_lines_file&lt;/span&gt;
&lt;span class="nf"&gt;count_lines_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count_lines.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you execute the instruction &lt;code&gt;from count_lines import count_lines_file&lt;/code&gt;, python looks for the module in the current directory (where python was launched); then in the python installation directory &lt;code&gt;/usr/lib/python3.11&lt;/code&gt;; then in the directory where python installs default packages: &lt;code&gt;/usr/lib/python3.11/site-packages&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;As soon as the module is found, it is executed in the python environment and the elements defined in it become available.&lt;/p&gt;

&lt;p&gt;The list of directories in which python searches for modules is provided by the &lt;code&gt;sys&lt;/code&gt; package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/usr/lib/python3.11/python311.zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/usr/lib/python3.11/python311&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/usr/lib/python3.11/site-packages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Package python
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://docs.python.org/fr/3/tutorial/modules.html#packages"&gt;&lt;em&gt;package&lt;/em&gt;&lt;/a&gt; is a folder that groups together a set of python modules and facilitates access to them by creating a namespace: &lt;code&gt;from numpy.linalg import norm&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To create a package, simply create a &lt;code&gt;__init__.py&lt;/code&gt; file (which can be empty) in a folder.&lt;br&gt;
The folder is then considered by python as a package.&lt;/p&gt;

&lt;p&gt;Let's create a package for the &lt;code&gt;count_lines.py&lt;/code&gt; module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count_package
├── __init__.py
└── count_lines.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now use "pointed" notation to import the &lt;code&gt;count_lines&lt;/code&gt; module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;count_package.count_lines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_lines_file&lt;/span&gt;
&lt;span class="nf"&gt;count_lines_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count_lines.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When executing the instruction &lt;code&gt;from count_package.count_lines import count_lines_file&lt;/code&gt;, python looks for a &lt;code&gt;count_package&lt;/code&gt; folder containing a &lt;code&gt;__init__.py&lt;/code&gt; file in the &lt;code&gt;sys.path&lt;/code&gt; directories (the current directory and the default directories seen above).&lt;/p&gt;

&lt;p&gt;If the package is found, the modules present in it can be accessed using the "pointed" notation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distribute your package
&lt;/h2&gt;

&lt;p&gt;To distribute your package, create a &lt;em&gt;distribution&lt;/em&gt;. This is an archive containing the package to be distributed, which can then be installed using the &lt;em&gt;pip&lt;/em&gt; package manager.&lt;/p&gt;

&lt;p&gt;There are two main distribution formats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;a href="https://packaging.python.org/en/latest/glossary/#term-Source-Distribution-or-sdist"&gt;&lt;em&gt;Source Distribution&lt;/em&gt;&lt;/a&gt; (sdist) format: this is an archive containing all source code and metadata.&lt;/li&gt;
&lt;li&gt;The &lt;a href="https://packaging.python.org/en/latest/glossary/#term-Built-Distribution"&gt;&lt;em&gt;Built Distribution&lt;/em&gt;&lt;/a&gt; format: this is a distribution format in which a number of things have been pre-compiled to facilitate installation on other environments. This is particularly useful for modules written in C / C++.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;wheel&lt;/code&gt; format is the reference &lt;em&gt;Built Distribution&lt;/em&gt; format. It is the format developed by the Python Packaging Authority and is widely used to distribute packages.&lt;/p&gt;

&lt;p&gt;There are many tools available for creating distributions, but here we'll focus mainly on the tools created by the Python Packaging Authority, which have become indispensable: &lt;code&gt;setuptools&lt;/code&gt;, &lt;code&gt;build&lt;/code&gt; and &lt;code&gt;twine&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  setuptools
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://setuptools.pypa.io/en/latest/index.html"&gt;&lt;code&gt;setuptools&lt;/code&gt;&lt;/a&gt; is the tool used by the vast majority of projects to build their distributions.&lt;/p&gt;

&lt;p&gt;Let's take our &lt;code&gt;count_package&lt;/code&gt; example from earlier and see how to create a distribution with &lt;code&gt;setuptools&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Our project tree might look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;projet_genial
├── count_package
│   ├── __init__.py
│   └── count_lines.py
├── tests
│   └── test_count_lines.py
├── .gitignore
├── LICENSE.md
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;setuptools&lt;/code&gt; needs a configuration file to know what to include in the distribution and all the project metadata.&lt;/p&gt;

&lt;p&gt;Unfortunatly there is three configuration files standards currently used (and you can use them all at the same time...) :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;setup.py&lt;/code&gt; file, the oldest and most popular
(&lt;a href="https://github.com/search?q=filename%3Asetup.py"&gt;3.7 million results on github&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;setup.cfg&lt;/code&gt; file, the second most popular file (&lt;a href="https://github.com/search?q=filename%3Asetup.py"&gt;0.4 million hits on github&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;pyproject.toml&lt;/code&gt; file, the latest arrival and now the official standard for all python python packaging tools. (&lt;a href="https://github.com/search?q=filename%3Asetup.py"&gt;0.2 million hits on github&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;pyproject.toml&lt;/code&gt; is the &lt;a href="https://peps.python.org/pep-0621/"&gt;official standard&lt;/a&gt;, but is still in the minority compared to &lt;code&gt;setup.py&lt;/code&gt;, so we'll be looking at all three file formats.&lt;/p&gt;

&lt;h3&gt;
  
  
  setup.py
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;setup.py&lt;/code&gt; file, as its extension suggests, is a python file. It has the following form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# setup.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;setuptools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;setup&lt;/span&gt;

&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count_package&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;author&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;me&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Package for counting the number of lines in files.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0.0.1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;python_requires&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=3.7, &amp;lt;4&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;install_requires&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pandas&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;importlib-metadata; python_version &amp;gt;= &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.8&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The very fact that it's a python file is both its strength and its weakness: it's possible to build the configuration dynamically in the code, but this makes it difficult to parse and interface with other external tools.&lt;/p&gt;

&lt;p&gt;In addition, since this format is specific to &lt;code&gt;setuptools&lt;/code&gt;, distributions of the &lt;em&gt;sdist&lt;/em&gt; type can only be installed if setuptools has been installed on the target environment and in a compatible version.&lt;/p&gt;

&lt;p&gt;Sadly, since the vast majority of projects started using setuptools and &lt;code&gt;setup.py&lt;/code&gt;, it became difficult to propose alternatives, so projects like &lt;a href="https://pypi.org/project/flit/"&gt;flit&lt;/a&gt; had to be built "on top" of &lt;code&gt;setuptools&lt;/code&gt;, so the &lt;code&gt;setup.py&lt;/code&gt; doesn't encourage innovation.&lt;/p&gt;

&lt;p&gt;Moreover, its use is often problematic.&lt;br&gt;
To take the example from &lt;a href="https://bernat.tech/posts/pep-517-518/#packaging-tool-diversity"&gt;this article&lt;/a&gt;, you might be tempted to introduce a &lt;em&gt;if/else&lt;/em&gt; condition in your &lt;code&gt;setup.py&lt;/code&gt; to manage a dependency needed in python 2.7 based on &lt;code&gt;sys.version&lt;/code&gt;, but in doing so you would be introducing a vicious bug: the dependency will be included or not depending on the environment that compiles the distribution and not depending on the environment that is installing it.&lt;/p&gt;

&lt;p&gt;It's also tempting &lt;a href="https://bernat.tech/posts/growing-pain/#importing-the-built-package-from-within-setuppy"&gt;to import your own package from &lt;code&gt;setup.py&lt;/code&gt;&lt;/a&gt; to manage the version. But by doing so, sdist distributions will crash at installation because the package you are trying to import is not yet present in the python environnment.&lt;/p&gt;

&lt;p&gt;In short : &lt;strong&gt;please do not use &lt;code&gt;setup.py&lt;/code&gt; anymore&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And if you do, use it in a declarative way.&lt;/p&gt;

&lt;p&gt;Using the file as a script: &lt;code&gt;python setup.py&lt;/code&gt; is &lt;strong&gt;depreciated&lt;/strong&gt;, as the documentation clearly explained in the setuptools documentation :&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It is important to remember, however, that running this file as a script (e.g. python setup.py sdist) is strongly&lt;br&gt;
discouraged, and that the majority of the command line interfaces are (or will be) deprecated (e.g. python setup.py&lt;br&gt;
install, python setup.py bdist_wininst, ...).&lt;/p&gt;

&lt;p&gt;We also recommend users to expose as much as possible configuration in a more declarative way via the pyproject.toml&lt;br&gt;
or setup.cfg, and keep the setup.py minimal with only the dynamic parts (or even omit it completely if applicable).&lt;/p&gt;

&lt;p&gt;See &lt;a href="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html"&gt;Why you shouldn't invoke setup.py directly&lt;/a&gt;&lt;br&gt;
for more background.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  setup.cfg
&lt;/h3&gt;

&lt;p&gt;To address the issues mentioned above and make configuration more declarative, in 2016 pypa created the &lt;code&gt;setup.cfg&lt;/code&gt; file format.&lt;/p&gt;

&lt;p&gt;The example &lt;code&gt;setup.py&lt;/code&gt; file above is equivalent to the following &lt;code&gt;setup.cfg&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# setup.cfg
&lt;/span&gt;&lt;span class="nn"&gt;[metadata]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;count_package&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.1&lt;/span&gt;
&lt;span class="py"&gt;author&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;me&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;Package for counting the number of lines in files.&lt;/span&gt;

&lt;span class="nn"&gt;[options]&lt;/span&gt;
&lt;span class="py"&gt;python_requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;&amp;gt;=3.7,&amp;lt;4&lt;/span&gt;
&lt;span class="py"&gt;install_requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="err"&gt;pandas&lt;/span&gt;
    &lt;span class="err"&gt;importlib-metadata&lt;/span&gt;&lt;span class="c"&gt;; python_version &amp;gt;= "3.8"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This format has had many fans but has recently been superseded by the &lt;code&gt;pyproject.toml&lt;/code&gt; format, which is now the official way of declaring python package configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  pyproject.toml
&lt;/h3&gt;

&lt;p&gt;In addition to adopting the declarative approach of &lt;code&gt;setup.cfg&lt;/code&gt;, the &lt;code&gt;pyproject.toml&lt;/code&gt; format introduces a number of new features.&lt;/p&gt;

&lt;p&gt;It is now possible (and even mandatory) to specify the package builder.&lt;br&gt;
It is also a means of centralizing the configuration of numerous development tools in an agnostic way, rather than multiplying configuration files such as &lt;code&gt;tox.ini&lt;/code&gt;, &lt;code&gt;.coveragerc&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pyproject.toml&lt;/code&gt; format includes a mandatory section to define the builder to be used to build the package :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;
&lt;span class="nn"&gt;[build-system]&lt;/span&gt;
&lt;span class="py"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="py"&gt;"setuptools&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;  &lt;span class="py"&gt;"wheel&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;  &lt;span class="py"&gt;"cython&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.29&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;build-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"setuptools.build_meta"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;pyproject.toml&lt;/code&gt; it is now possible to declare to &lt;code&gt;pip&lt;/code&gt; the dependencies needed for the build!&lt;/p&gt;

&lt;p&gt;It is then perfectly possible to specify the use of a builder other than setuptools, such as flit :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;
&lt;span class="nn"&gt;[build-system]&lt;/span&gt;
&lt;span class="py"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"flit"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;build-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"flit.api:main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This format is gradually becoming the preferred way of centralizing package configuration.&lt;br&gt;
It is &lt;a href="https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html"&gt;the format preferred by setuptools&lt;/a&gt; and a number of third-party tools use it to store their configuration: &lt;a href="https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html#what-on-earth-is-a-pyproject-toml-file"&gt;black&lt;/a&gt;, &lt;a href="https://docs.pytest.org/en/7.2.x/reference/customize.html"&gt;pytest&lt;/a&gt;, &lt;a href="https://pycqa.github.io/isort/docs/configuration/config_files.html#pyprojecttoml-preferred-format"&gt;isort&lt;/a&gt;, etc.&lt;/p&gt;

&lt;p&gt;Here's a sample &lt;code&gt;pyproject.toml&lt;/code&gt; file for our &lt;code&gt;count_package&lt;/code&gt; example package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;
&lt;span class="nn"&gt;[build-system]&lt;/span&gt;
&lt;span class="py"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"setuptools"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;build-backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"setuptools.build_meta"&lt;/span&gt;

&lt;span class="nn"&gt;[project]&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"count_package"&lt;/span&gt;
&lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.0.1"&lt;/span&gt;
&lt;span class="py"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Package for counting the number of lines in files."&lt;/span&gt;
&lt;span class="py"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my_package"&lt;/span&gt;
&lt;span class="py"&gt;authors&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="py"&gt;{name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"me"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;email&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"email@me.fr"&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;requires-python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="py"&gt;"&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;3.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="py"&gt;dependencies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"pandas"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;'importlib-metadata; python_version &amp;gt;= "3.8"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;dynamic&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html"&gt;setuptools documentation&lt;/a&gt; gives more details on configuration in &lt;code&gt;pyproject.toml&lt;/code&gt; format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build a distribution of your package
&lt;/h3&gt;

&lt;p&gt;Once you've written your configuration file (preferably &lt;code&gt;pyproject.toml&lt;/code&gt;), all that's left to do is build your package.&lt;/p&gt;

&lt;p&gt;The modern way to do this is to use the &lt;a href="https://pypi.org/project/build"&gt;&lt;code&gt;build&lt;/code&gt;&lt;/a&gt; package developed by pypa:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; build
python &lt;span class="nt"&gt;-m&lt;/span&gt; build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;build&lt;/code&gt; will first install the builder specified in your &lt;code&gt;pyproject.toml&lt;/code&gt; file, then use it to build an sdist distribution and a wheel.&lt;/p&gt;

&lt;p&gt;You can then use the &lt;a href="https://twine.readthedocs.io/en/stable/"&gt;&lt;code&gt;twine&lt;/code&gt;&lt;/a&gt; package to publish it on the official &lt;a href="https://pypi.org/"&gt;pypi&lt;/a&gt; repository.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For backward compatibility with older versions of packaging libraries, &lt;a href="https://setuptools.pypa.io/en/latest/userguide/pyproject_config.html"&gt;you can create a minimal &lt;code&gt;setup.py&lt;/code&gt; file&lt;/a&gt; in addition to the &lt;code&gt;pyproject.toml&lt;/code&gt; file:&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# setup.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;setuptools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;setup&lt;/span&gt;
&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Version management
&lt;/h3&gt;

&lt;p&gt;In the &lt;code&gt;pyproject.toml&lt;/code&gt; example seen above, the package version is set manually. You therefore need to change it each time you want to publish a new version of your package.&lt;/p&gt;

&lt;p&gt;I find the tool &lt;a href="https://github.com/pypa/setuptools_scm/"&gt;&lt;code&gt;setuptools-scm&lt;/code&gt;&lt;/a&gt; very useful for managing package versions using git or mercurial.&lt;/p&gt;

&lt;p&gt;This is done very simply by adding the &lt;code&gt;pyproject.toml&lt;/code&gt; dependency and specifying that the version is dynamic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;
&lt;span class="nn"&gt;[build-system]&lt;/span&gt;
&lt;span class="py"&gt;requires&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="py"&gt;["setuptools&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="s"&gt;", "&lt;/span&gt;&lt;span class="py"&gt;setuptools_scm[toml]&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;6.2&lt;/span&gt;&lt;span class="s"&gt;"]&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;
&lt;span class="nn"&gt;[project]&lt;/span&gt;
&lt;span class="c"&gt;# version = "0.0.1"  # Remove any existing version parameter.&lt;/span&gt;
&lt;span class="py"&gt;dynamic&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nn"&gt;[tool.setuptools_scm]&lt;/span&gt;
&lt;span class="py"&gt;write_to&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"src/pkg/_version.py"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When building a package, &lt;code&gt;setuptools-scm&lt;/code&gt; will search for the last tag with a valid version number and then deduce the package version number. By default, the version is built from three elements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The last tag with a valid version number (example: &lt;code&gt;v1.2.3&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The distance to this tag (number of revisions since this tag)&lt;/li&gt;
&lt;li&gt;Working directory status (if there are any uncommitted changes)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the version number has been deduced, a &lt;code&gt;_version.py&lt;/code&gt; file will be created inside the distribution at the specified location (e.g. &lt;code&gt;src/pkg/_version.py&lt;/code&gt;), allowing the package version to be known from the distribution without the git history being present on the target environment.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you are in the habit of entering the version number of your packages yourself, please note that valid version formats are governed by &lt;a href="https://peps.python.org/pep-0440/#version-scheme"&gt;PEP 440&lt;/a&gt;.&lt;br&gt;
If you don't comply with these specifications, you're likely to run into problems when publishing or installing your packages.&lt;br&gt;
In particular, versions &lt;code&gt;v1.2.3-local&lt;/code&gt; or &lt;code&gt;v1.2.3-dev&lt;/code&gt; are invalid.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The layout
&lt;/h2&gt;

&lt;p&gt;When configuring your package, regardless of the method used (&lt;code&gt;setup.py&lt;/code&gt;, &lt;code&gt;setup.cfg&lt;/code&gt; or &lt;code&gt;pyproject.toml&lt;/code&gt;), you must specify the packages &lt;strong&gt;and subpackages&lt;/strong&gt; you wish to include in your distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;
&lt;span class="nn"&gt;[tool.setuptools]&lt;/span&gt;
&lt;span class="py"&gt;packages&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"mypkg"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"mypkg.subpkg1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"mypkg.subpkg2"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fortunately, &lt;code&gt;setuptools&lt;/code&gt; has an automatic &lt;a href="https://setuptools.pypa.io/en/latest/userguide/package_discovery.html"&gt;discovery feature&lt;/a&gt; for your packages and subpackages. This is compatible with two classic project layouts:&lt;/p&gt;

&lt;p&gt;flat-layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;count_package
├── count_package
│   ├── __init__.py
│   └── count_lines.py
├── tests
│   └── test_count_lines.py
├── .gitignore
├── LICENSE.md
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and layout with a src-layout folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;projet_genial
├── src
|   ├── count_package
│   ├── __init__.py
│   └── count_lines.py
├── tests
│   └── test_count_lines.py
├── .gitignore
├── LICENSE.md
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference may seem minimal, but personally I have a strong preference for src-layout because it prevents bad habits and forces you to understand how the package import and installation system&lt;br&gt;
system works in Python.&lt;/p&gt;

&lt;p&gt;In fact, when you develop your package, you test the functionalities you add to it as you go along.&lt;/p&gt;

&lt;p&gt;To do this, we are very tempted to simply import our package from our :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_file.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;count_package&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_lines&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will work if you use a flat-layout and run your test module from the directory containing the &lt;code&gt;count_package&lt;/code&gt; folder, because as we saw above python includes the current directory in the list of directories where it searches for modules.&lt;/p&gt;

&lt;p&gt;However, this is a bad habit for two reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firstly, if you use setup.py, which is located at the root of your project, it is able to import the &lt;code&gt;count_package&lt;/code&gt; package it is supposed to install on a client in sdist mode, which can cause bugs if you're not careful.&lt;/li&gt;
&lt;li&gt;Secondly, &lt;strong&gt;you're not really testing the package as it will be installed on others&lt;/strong&gt;! For example, you may not have thought to include data files in your configuration file, and your tests should crash as a result. But since these files are present in your working directory, you won't notice a thing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these reasons, I think it's best to opt for a src-layout and use an &lt;a href="https://setuptools.pypa.io/en/latest/userguide/development_mode.html"&gt;editable installation&lt;/a&gt; with the command: &lt;code&gt;pip install -e .&lt;/code&gt; for local development. This allows you to install the package by making a symbolic link with your code, so that any changes you make are immediately reflected in the installed package.&lt;/p&gt;

&lt;p&gt;For an in-depth analysis of the benefits of src-layout, I refer you to &lt;a href="https://blog.ionelmc.ro/2014/05/25/python-packaging/#the-structure"&gt;this article&lt;/a&gt; (which dates back to 2014).&lt;/p&gt;

&lt;h2&gt;
  
  
  Going further
&lt;/h2&gt;

&lt;p&gt;If you want to learn more about the packaging eco-system I found the resources listed in this article very usefull : &lt;a href="https://dev.to/astrojuanlu/best-resources-on-python-packaging-fea"&gt;🐍 Best resources on Python packaging 📖&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the meantime here is a non-exhaustive list of alternatives to setuptools you should consider :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://pipenv.pypa.io/en/latest/"&gt;Pipenv&lt;/a&gt;: allows you to jointly manage your project's virtual environment and its dependencies.
dependencies. Adds a valuable feature: generation of &lt;code&gt;Pipfile.lock&lt;/code&gt; files, which reference
reference exact versions of dependencies to enable identical reproduction of the development environment.
development environment.
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mrp67LI0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://gist.githubusercontent.com/jlusk/855d611bbcfa2b159839db73d07f6ce9/raw/7f5743401809f7e630ee8ff458faa980e19924a0/pipenv.gif" alt="Gif montrant les fonctionnalités de pipenv" width="654" height="341"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://python-poetry.org/"&gt;Poetry&lt;/a&gt;: a powerful tool for managing virtual environments
environments, dependencies (and dependencies on dependencies), generate a &lt;code&gt;poetry.lock&lt;/code&gt; file similar to
&lt;code&gt;Pipfile.lock&lt;/code&gt;, publish your package, etc. However, it does not comply with certain PEP standards.
&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3Ljd3plv--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://raw.githubusercontent.com/python-poetry/poetry/master/assets/install.gif" alt="Gif showing poetry features" width="688" height="370"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://pdm-project.org/latest/"&gt;PDM&lt;/a&gt;: next-generation package manager for python. Unlike PerformanceEntry, it respects PEP standards.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hatch.pypa.io/latest/"&gt;Hatch&lt;/a&gt;: Pypa's new tool for managing python projects. It has many interesting features.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/astral-sh/uv"&gt;uv&lt;/a&gt;: pip written in rust to make it 10 to 100 times faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bonus
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Try a next-gen manager &lt;a href="https://hatch.pypa.io/latest/"&gt;Hatch&lt;/a&gt;, &lt;a href="https://pdm-project.org/latest/"&gt;PDM&lt;/a&gt; or &lt;a href="https://python-poetry.org/"&gt;Poetry&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;I'm sharing my simple &lt;a href="https://github.com/h4c5/cookie-python-minimal"&gt;template &lt;code&gt;cookie-cutter&lt;/code&gt;&lt;/a&gt; for your python projects.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Let's be happy
&lt;/h3&gt;

&lt;p&gt;Everything I've told you here may seem like a lot, and yet I've only skimmed the surface. In any case, I think we can count ourselves lucky when we see what the Sam &amp;amp; Max site wrote in 2018:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;First we have &lt;code&gt;distutils&lt;/code&gt;, &lt;code&gt;setuptools&lt;/code&gt;, &lt;code&gt;distribute&lt;/code&gt;, and &lt;code&gt;distribute2&lt;/code&gt; which were all at one time the "standards recommended for packaging a lib.&lt;br&gt;
Then came the days of &lt;code&gt;eggs&lt;/code&gt;, &lt;code&gt;exe&lt;/code&gt;, and other stuff that &lt;code&gt;easy_install&lt;/code&gt; would go and find anywhere in the wild, blindly following links on PyPi. &lt;br&gt;
Not to mention the stuff that had to be compile at every turn.&lt;br&gt;
Besides, nothing was encrypted when downloaded, and &lt;code&gt;pip&lt;/code&gt; wasn't packaged with Python.&lt;br&gt;
Python, it was dying on stupid errors like badly managed encoding...&lt;br&gt;
On top of that, &lt;code&gt;virtualenv&lt;/code&gt; was a separate thing, with lots of competitors, and linked system packages by default. &lt;br&gt;
Not to mention that we didn't have &lt;code&gt;python -m&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In short, Python packaging was a real mess. Not to mention a shitty documentation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's a lot simpler these days.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;pypa's guide to python packaging:
&lt;a href="https://packaging.python.org/en/latest/overview/"&gt;An Overview of Packaging for Python&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;An article on wheels:
&lt;a href="https://realpython.com/python-wheels/#python-packaging-made-better-an-intro-to-python-wheels"&gt;What Are Python Wheels and Why Should You Care?&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;A series of three very enlightening articles on how Python packaging works, written by Bernát Gábor in
2019 :

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://bernat.tech/posts/pep-517-and-python-packaging/"&gt;The state of Python Packaging&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Python Packaging - Past, Present, Future](&lt;a href="https://bernat.tech/posts/pep-517-518/"&gt;https://bernat.tech/posts/pep-517-518/&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bernat.tech/posts/growing-pain/"&gt;Python packaging - Growing Pains&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;An article in praise of &lt;code&gt;setup.cfg&lt;/code&gt; on the late Sam &amp;amp; Max site:
&lt;a href="https://sametmax2.com/vive-setup-cfg-et-mort-a-pyproject-toml/index.html"&gt;about setup.cfg&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Stackoverflow question:
&lt;a href="https://stackoverflow.com/questions/62983756/what-is-pyproject-toml-file-for"&gt;What is pyproject.toml file for&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>python</category>
      <category>packaging</category>
      <category>learning</category>
      <category>pip</category>
    </item>
    <item>
      <title>JSON in data science projects: tips &amp; tricks</title>
      <dc:creator>Hakim</dc:creator>
      <pubDate>Thu, 07 Mar 2024 00:20:00 +0000</pubDate>
      <link>https://dev.to/h4c5/json-in-data-science-projects-tips-tricks-2n1p</link>
      <guid>https://dev.to/h4c5/json-in-data-science-projects-tips-tricks-2n1p</guid>
      <description>&lt;p&gt;Some useful tips and libraries for manipulating json in your data science projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  The standard &lt;code&gt;json&lt;/code&gt; module
&lt;/h2&gt;

&lt;p&gt;Python has a standard module called &lt;code&gt;json&lt;/code&gt; that lets you quickly manipulate JSON files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loading
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="c1"&gt;# [{'id': 0, 'content': [0.0, 0.0, 1.0]}, {'id': 1, 'content': [0.0, 1.0, 0.0]}]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Backup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;First tip&lt;/strong&gt;: when working with textual data, the &lt;code&gt;ensure_ascii=False&lt;/code&gt; option is very useful&lt;br&gt;
to preserve, among other things, accents when saving&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Second tip&lt;/strong&gt;: the &lt;code&gt;indent&lt;/code&gt; option in the &lt;code&gt;dump&lt;/code&gt; method indents the data in the backup file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Issues related to numpy
&lt;/h2&gt;

&lt;p&gt;Data science projects often use &lt;a href="https://numpy.org/"&gt;&lt;code&gt;numpy&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
However, &lt;code&gt;numpy&lt;/code&gt; objects are not JSON-serializable and therefore require conversion to standard python objects in order to be saved:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;---------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="nb"&gt;TypeError&lt;/span&gt; &lt;span class="nc"&gt;Traceback &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Cell&lt;/span&gt; &lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
      &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
      &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;----&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Object&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ndarray&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt; &lt;span class="n"&gt;serializable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By converting the array into a list, the &lt;code&gt;data&lt;/code&gt; object can be saved:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that's not very practical...&lt;/p&gt;

&lt;p&gt;One solution is to create a custom &lt;a href="https://docs.python.org/fr/3/library/json.html#json.JSONEncoder"&gt;&lt;code&gt;JSONEncoder&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
which converts the &lt;code&gt;numpy.ndarray&lt;/code&gt; using its &lt;code&gt;tolist&lt;/code&gt; method at save time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;JSONEncoder to store python dict or list containing numpy arrays&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transform numpy arrays into JSON serializable object such as list
        see : https://docs.python.org/3/library/json.html#json.JSONEncoder.default
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The &lt;code&gt;orjson&lt;/code&gt; library
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ijl/orjson/tree/master"&gt;&lt;code&gt;orjson&lt;/code&gt;&lt;/a&gt; is the fastest JSON library available for python. It natively manages &lt;a href="https://docs.python.org/fr/3/library/dataclasses.html"&gt;dataclass&lt;/a&gt; objects,&lt;br&gt;
&lt;a href="https://docs.python.org/fr/3/library/datetime.html"&gt;datetime&lt;/a&gt;, &lt;a href="https://numpy.org/"&gt;numpy&lt;/a&gt;&lt;br&gt;
and &lt;a href="https://docs.python.org/3/library/uuid.html"&gt;UUID&lt;/a&gt; objects.&lt;/p&gt;

&lt;p&gt;A few things to remember when working with &lt;code&gt;orjson&lt;/code&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There is no &lt;code&gt;load&lt;/code&gt; or &lt;code&gt;dump&lt;/code&gt; method, you have to use &lt;code&gt;loads&lt;/code&gt; and &lt;code&gt;dumps&lt;/code&gt; instead.&lt;/li&gt;
&lt;li&gt;You must use flags to use certain functionalities, such as &lt;code&gt;orjson.OPT_SERIALIZE_NUMPY&lt;/code&gt; to serialize
serialize &lt;code&gt;numpy&lt;/code&gt; objects
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_SERIALIZE_NUMPY&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Note that the json file is written in binary (hence &lt;code&gt;rb&lt;/code&gt; and &lt;code&gt;wb&lt;/code&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;p&gt;orjson claims to serialize numpy.ndarray 4 to 12 times faster than the standard library. This can be&lt;br&gt;
by comparing the two methods described above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_json&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/fast.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_orjson&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/orfast.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_SERIALIZE_NUMPY&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_NON_STR_KEYS&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="nf"&gt;save_json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="nf"&gt;save_orjson&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# 15.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this example, orjson is more than 10 times faster. Note that the &lt;code&gt;OPT_NON_STR_KEYS&lt;/code&gt; option is used to enable&lt;br&gt;
allow orjson to save non-string keys.&lt;/p&gt;

&lt;p&gt;(Tests run with python 3.11)&lt;/p&gt;
&lt;h3&gt;
  
  
  FastAPI and orjson
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;FastAPI&lt;/code&gt; documentation contains a guide to using &lt;code&gt;orjson&lt;/code&gt;](&lt;a href="https://fastapi.tiangolo.com/advanced/custom-response/?h=orjson#use-orjsonresponse"&gt;https://fastapi.tiangolo.com/advanced/custom-response/?h=orjson#use-orjsonresponse&lt;/a&gt;) to serialize JSON responses.&lt;/p&gt;

&lt;p&gt;This is particularly useful for APIs that expose machine learning models whose outputs are often numpy.ndarray&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ORJSONResponse&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/random-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ORJSONResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_random_vector&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ORJSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  JSON Lines
&lt;/h2&gt;

&lt;p&gt;When working with the JSON format, it's not uncommon to manipulate collections of objects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"toto"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"titi"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To be valid, objects must be contained in a JSON list, hence the square brackets around the objects in the collection. However, this is not at all practical for reading large volumes of data, as you have to parse the entire file the entire file and load everything into memory.&lt;/p&gt;

&lt;p&gt;This can be remedied by using the [JSON Lines] format (&lt;a href="https://jsonlines.org/"&gt;https://jsonlines.org/&lt;/a&gt;). This involves nothing more and nothing less than  placing one JSON object per line, so that you can browse the objects without having to parse the entire&lt;br&gt;
collection all at once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"id": 0, "name": "toto"}
{"id": 1, "name": "titi"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The [&lt;code&gt;jsonlines&lt;/code&gt;] library (&lt;a href="https://jsonlines.readthedocs.io/en/latest/index.html"&gt;https://jsonlines.readthedocs.io/en/latest/index.html&lt;/a&gt;) is very useful for manipulating&lt;br&gt;
such files. It can also be combined with &lt;code&gt;orjson&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;jsonlines&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;jsonlines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/many_examples.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# {'id': 0, 'name': 'toto'}
# {'id': 1, 'name': 'titi'}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;A brief summary of the tips seen here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;ensure_ascii=False&lt;/code&gt; when working with the standard &lt;code&gt;json&lt;/code&gt; library.&lt;/li&gt;
&lt;li&gt;Consider the &lt;code&gt;orjson&lt;/code&gt; library for the performance and functionality it offers.&lt;/li&gt;
&lt;li&gt;Consider the &lt;code&gt;JSON Lines&lt;/code&gt; format for collections of JSON objects.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>tutorial</category>
      <category>json</category>
    </item>
    <item>
      <title>JSON dans les projets data science : Trucs &amp; Astuces</title>
      <dc:creator>Hakim</dc:creator>
      <pubDate>Wed, 06 Mar 2024 23:51:39 +0000</pubDate>
      <link>https://dev.to/h4c5/json-dans-les-projets-data-science-trucs-astuces-4mpf</link>
      <guid>https://dev.to/h4c5/json-dans-les-projets-data-science-trucs-astuces-4mpf</guid>
      <description>&lt;p&gt;Quelques astuces et librairies utiles pour manipuler du json dans vos projets de data science.&lt;/p&gt;

&lt;h2&gt;
  
  
  Le module standard &lt;code&gt;json&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Python possède un module standard nommé &lt;code&gt;json&lt;/code&gt; qui permet de manipuler rapidement des fichiers JSON&lt;/p&gt;

&lt;h3&gt;
  
  
  Chargement
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt;
&lt;span class="c1"&gt;# [{'id': 0, 'content': [0.0, 0.0, 1.0]}, {'id': 1, 'content': [0.0, 1.0, 0.0]}]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sauvegarde
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Première astuce&lt;/strong&gt; : lorsque l'on travaille avec des données textuelles l'option &lt;code&gt;ensure_ascii=False&lt;/code&gt; est très utile &lt;br&gt;
pour préserver, entre autres, les accents lors de la sauvegarde&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deuxième astuce&lt;/strong&gt; : l'option &lt;code&gt;indent&lt;/code&gt; de la méthode &lt;code&gt;dump&lt;/code&gt; permet d'indenter les données dans le fichier de sauvegarde&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Les problématiques liées à numpy
&lt;/h2&gt;

&lt;p&gt;On utilise très souvent &lt;a href="https://numpy.org/"&gt;&lt;code&gt;numpy&lt;/code&gt;&lt;/a&gt; dans les projets de data science.&lt;br&gt;
Or les objets &lt;code&gt;numpy&lt;/code&gt; ne sont pas JSON-sérialisable et nécessite donc une conversion en objets standards python pour &lt;br&gt;
être sauvegardés :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;---------------------------------------------------------------------------&lt;/span&gt;
&lt;span class="nb"&gt;TypeError&lt;/span&gt;                                 &lt;span class="nc"&gt;Traceback &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Cell&lt;/span&gt; &lt;span class="n"&gt;In&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
      &lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;([[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
      &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;----&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;     &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Object&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ndarray&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt; &lt;span class="n"&gt;serializable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;En convertissant le tableau en liste, on parvient à sauvegarder l'objet &lt;code&gt;data&lt;/code&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mais ce n'est pas très pratique...&lt;/p&gt;

&lt;p&gt;Une solution est de créer un &lt;a href="https://docs.python.org/fr/3/library/json.html#json.JSONEncoder"&gt;&lt;code&gt;JSONEncoder&lt;/code&gt;&lt;/a&gt; personnalisé qui convertit les &lt;code&gt;numpy.ndarray&lt;/code&gt; grâce à leur méthode &lt;code&gt;tolist&lt;/code&gt; au moment de la sauvegarde :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;JSONEncoder to store python dict or list containing numpy arrays&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transform numpy arrays into JSON serializable object such as list
        see : https://docs.python.org/3/library/json.html#json.JSONEncoder.default
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONEncoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/numpy.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  La librairie &lt;code&gt;orjson&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ijl/orjson/tree/master"&gt;&lt;code&gt;orjson&lt;/code&gt;&lt;/a&gt; est la librairie JSON la plus rapide disponible pour python. De&lt;br&gt;
plus elle gère nativement les objets &lt;a href="https://docs.python.org/fr/3/library/dataclasses.html"&gt;dataclass&lt;/a&gt;, &lt;br&gt;
&lt;a href="https://docs.python.org/fr/3/library/datetime.html"&gt;datetime&lt;/a&gt;, &lt;a href="https://numpy.org/"&gt;numpy&lt;/a&gt; &lt;br&gt;
et &lt;a href="https://docs.python.org/3/library/uuid.html"&gt;UUID&lt;/a&gt; ce qui est très pratique.&lt;/p&gt;

&lt;p&gt;Quelques éléments à retenir lorsque l'on travaille avec &lt;code&gt;orjson&lt;/code&gt; : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Il n'y a pas de méthode &lt;code&gt;load&lt;/code&gt; ni &lt;code&gt;dump&lt;/code&gt;, il faut utiliser les méthodes &lt;code&gt;loads&lt;/code&gt; et &lt;code&gt;dumps&lt;/code&gt; à la place.&lt;/li&gt;
&lt;li&gt;Il faut utiliser des flags pour utiliser certaines fonctionnalités tels que &lt;code&gt;orjson.OPT_SERIALIZE_NUMPY&lt;/code&gt; pour 
sérialiser les objets &lt;code&gt;numpy&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/example.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_SERIALIZE_NUMPY&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A noter que l'on écrit dans le fichier json en binaire (d'où &lt;code&gt;rb&lt;/code&gt; et &lt;code&gt;wb&lt;/code&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  Performances
&lt;/h3&gt;

&lt;p&gt;orjson annonce une sérialisation des &lt;code&gt;numpy.ndarray&lt;/code&gt; 4 à 12 fois plus rapide qu'avec la librairie standard. On peut le&lt;br&gt;
vérifier sur un exemple en comparant les deux méthodes présentées plus haut :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_json&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/fast.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NumpyJSONEncoder&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_orjson&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/orfast.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_SERIALIZE_NUMPY&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OPT_NON_STR_KEYS&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="nf"&gt;save_json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="nf"&gt;save_orjson&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# 15.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 1.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dans cet exemple orjson est plus de 10 fois plus rapide. A noter que l'option &lt;code&gt;OPT_NON_STR_KEYS&lt;/code&gt; est utilisé pour &lt;br&gt;
permettre à orjson de sauvegarder des clés qui ne sont pas des chaines de caractères. &lt;/p&gt;

&lt;p&gt;(Tests exécutés avec python 3.11)&lt;/p&gt;
&lt;h3&gt;
  
  
  FastAPI et orjson
&lt;/h3&gt;

&lt;p&gt;La &lt;a href="https://fastapi.tiangolo.com/advanced/custom-response/?h=orjson#use-orjsonresponse"&gt;documentation de &lt;code&gt;FastAPI&lt;/code&gt; contient un guide pour utiliser &lt;code&gt;orjson&lt;/code&gt;&lt;/a&gt; &lt;br&gt;
pour sérialiser les réponses JSON.&lt;/p&gt;

&lt;p&gt;C'est particulièrement pratique pour les API qui exposent des modèles de machine learning dont les sorties sont souvent&lt;br&gt;
des &lt;code&gt;numpy.ndarray&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ORJSONResponse&lt;/span&gt;


&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/random-vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ORJSONResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_random_vector&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ORJSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  JSON Lines
&lt;/h2&gt;

&lt;p&gt;Il arrive très souvent que l'on manipule des collections d'objets lorsque l'on travaille avec le format JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"toto"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"titi"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pour être valide, le objets doivent être contenus dans une liste JSON, d'où les crochets encadrant les objets de la&lt;br&gt;
collection. Cependant cela n'est pas du tout pratique pour lire de gros volume de données puisqu'il faut parser &lt;br&gt;
l'intégraliter du fichier et tout charger en mémoire.&lt;/p&gt;

&lt;p&gt;Pour remédier à cela, on peut utilise le format &lt;a href="https://jsonlines.org/"&gt;JSON Lines&lt;/a&gt;. Il ne s'agit ni plus ni moins que &lt;br&gt;
de placer un objet JSON par ligne de façon à pouvoir parcourir les objets sans devoir parser l'intégralité de la &lt;br&gt;
collection d'un seul coup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"id": 0, "name": "toto"}
{"id": 1, "name": "titi"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;La librairie &lt;a href="https://jsonlines.readthedocs.io/en/latest/index.html"&gt;&lt;code&gt;jsonlines&lt;/code&gt;&lt;/a&gt; est très pratique pour manipuler&lt;br&gt;
de tels fichiers. De plus on peut la combiner avec &lt;code&gt;orjson&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;jsonlines&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;orjson&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;jsonlines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/many_examples.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orjson&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# {'id': 0, 'name': 'toto'}
# {'id': 1, 'name': 'titi'}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Petit résumé des astuces vues ici : &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Utilier &lt;code&gt;ensure_ascii=False&lt;/code&gt; lorsqu'on travaille avec la librairie standard &lt;code&gt;json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Considérer la librairie &lt;code&gt;orjson&lt;/code&gt; pour les performances et les fonctionnalités qu'elle apporte&lt;/li&gt;
&lt;li&gt;Penser au format &lt;code&gt;JSON Lines&lt;/code&gt; pour les collections d'objets JSON&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>tutorial</category>
      <category>french</category>
      <category>json</category>
    </item>
  </channel>
</rss>
