<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ana Carolina Branco Neumann</title>
    <description>The latest articles on DEV Community by Ana Carolina Branco Neumann (@anacbneumann).</description>
    <link>https://dev.to/anacbneumann</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F833905%2F14552d52-7084-41eb-a2f3-11cc44f2eb72.jpeg</url>
      <title>DEV Community: Ana Carolina Branco Neumann</title>
      <link>https://dev.to/anacbneumann</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anacbneumann"/>
    <language>en</language>
    <item>
      <title>Churn Prediction - Telco Company</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Tue, 28 Jan 2025 01:31:05 +0000</pubDate>
      <link>https://dev.to/anacbneumann/churn-prediction-telco-company-2i6b</link>
      <guid>https://dev.to/anacbneumann/churn-prediction-telco-company-2i6b</guid>
      <description>&lt;h2&gt;
  
  
  &lt;em&gt;Churn Prediction - Telco Company&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📌 Dataset Source:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download" rel="noopener noreferrer"&gt;Telco Customer Churn - Kaggle&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📂 Github Repository:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://github.com/anacbneumann/churn_prediction_telco" rel="noopener noreferrer"&gt;Telco Customer Churn - Github&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;About the Project&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;This project leverages &lt;em&gt;Machine Learning&lt;/em&gt; to predict customer churn for a telecommunications company. The main goal is to identify patterns indicating the likelihood of cancellation, enabling the company to implement proactive retention strategies before customers discontinue their services.&lt;/p&gt;

&lt;p&gt;The primary focus is on the &lt;em&gt;Recall&lt;/em&gt; metric, which is crucial for capturing most churners, even at the cost of some false positives, as preventive retention actions are more advantageous for the business.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Exploratory Data Analysis&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;During the EDA, patterns in the dataset were explored to understand the factors associated with churn. Key findings include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monthly vs. Long-Term Contracts:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Customers with monthly contracts showed a higher likelihood of churn, suggesting that long-term contracts may encourage loyalty.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Additional Services:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Customers subscribing to additional services, such as online security or technical support, tended to churn less.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tenure and Monthly Charges:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customers with longer tenure (contract length) exhibited greater loyalty.
&lt;/li&gt;
&lt;li&gt;Higher &lt;code&gt;MonthlyCharges&lt;/code&gt; correlated positively with churn.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Removal of TotalCharges:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;
The &lt;code&gt;TotalCharges&lt;/code&gt; column was removed due to high collinearity with &lt;code&gt;tenure&lt;/code&gt;, which could affect the model's stability.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Technical Choices&lt;/em&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why SVM? Algorithm&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;Support Vector Machine (SVM)&lt;/em&gt; was chosen for several reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficiency with Smaller Datasets:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
With approximately 7,000 rows, SVM effectively captures complex patterns without overfitting.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible Kernel Options:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By combining &lt;code&gt;linear&lt;/code&gt; and &lt;code&gt;rbf&lt;/code&gt; kernels, SVM identifies both linear and non-linear relationships through GridSearchCV.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Binary Classification:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
SVM is well-suited for binary problems like this one, where the goal is to predict churn (&lt;em&gt;Yes&lt;/em&gt; or &lt;em&gt;No&lt;/em&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Preprocessing Steps:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scaling (MinMaxScaler):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Models like SVM are sensitive to differences in scale. Scaling was applied to normalize numeric variables between 0 and 1.  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Encoding (OneHotEncoder):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Categorical variables were transformed into dummy variables. This ensures that categories are properly represented in a format the model can understand.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Splitting and Validation:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The dataset was split into 70% training and 30% testing sets.
&lt;/li&gt;
&lt;li&gt;Validation was conducted using &lt;em&gt;5-fold cross-validation&lt;/em&gt; to ensure robust results.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Machine Learning Pipeline&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;The implementation followed these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dataset Splitting:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The dependent variable (&lt;code&gt;Churn&lt;/code&gt;) and independent variables were separated, ensuring proper data splitting for training and testing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hyperparameter Tuning for SVM:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Optimization was performed using GridSearchCV, adjusting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;C:&lt;/strong&gt; Regularization parameter, controlling the trade-off between margin and error.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kernel:&lt;/strong&gt; Evaluation of &lt;code&gt;linear&lt;/code&gt; and &lt;code&gt;rbf&lt;/code&gt; kernels.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Model Evaluation Metrics:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The model was assessed using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy:&lt;/strong&gt; Percentage of correct predictions.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall:&lt;/strong&gt; Identification of churners (true positives).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision:&lt;/strong&gt; Percentage of correctly identified churners.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;F1 Score:&lt;/strong&gt; Harmonic mean of precision and recall.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROC AUC:&lt;/strong&gt; The model's ability to distinguish between classes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Results&lt;/em&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;80.81%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Recall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;56.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;74.35%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;F1 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;63.95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ROC AUC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.42%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Analysis of Results:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
While the accuracy is high, the primary focus was on &lt;em&gt;Recall&lt;/em&gt;, achieving 56%. This means that the majority of customers likely to churn were identified, enabling proactive interventions.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Future Improvements&lt;/em&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrating External Data:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enrich the dataset with &lt;em&gt;customer satisfaction feedback&lt;/em&gt;, such as NPS or survey responses.
&lt;/li&gt;
&lt;li&gt;Include economic or regional indicators to identify specific patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Experimenting with Models:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test models like &lt;em&gt;XGBoost&lt;/em&gt; or &lt;em&gt;LightGBM&lt;/em&gt;, which handle complex interactions well.
&lt;/li&gt;
&lt;li&gt;Perform feature importance analysis to optimize variable selection.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automation:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Develop a real-time pipeline to update the model with periodically refreshed data.
&lt;/li&gt;
&lt;li&gt;Integrate the model into the CRM system for automated retention actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Customer Segmentation:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focus retention efforts on high-value or high-risk customer segments.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;em&gt;Project Files&lt;/em&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;EDA.ipynb&lt;/code&gt;:&lt;/strong&gt; Exploratory Data Analysis and main insights.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pre_processing.py&lt;/code&gt;:&lt;/strong&gt; Data preprocessing and transformation script.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ML_application.py&lt;/code&gt;:&lt;/strong&gt; Machine learning training, validation, and result exportation.
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Contact Information:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For further inquiries or collaboration opportunities, feel free to reach out via &lt;a href="https://www.linkedin.com/in/anacbneumann/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>churnprediction</category>
      <category>python</category>
    </item>
    <item>
      <title>#02Python - Outliers and their types</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Thu, 31 Aug 2023 00:10:54 +0000</pubDate>
      <link>https://dev.to/anacbneumann/outliers-and-their-types-59dh</link>
      <guid>https://dev.to/anacbneumann/outliers-and-their-types-59dh</guid>
      <description>&lt;p&gt;Outliers are unusual values that stand out from the rest of the data in a set, often because they are at extreme values. They can result from measurement errors, incorrect inputs, rare events, or even represent information about the observed phenomenon.&lt;/p&gt;

&lt;p&gt;Outliers can be categorized into different types based on their nature. Here are some types of outliers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Univariate Outliers (single variable):&lt;/strong&gt; Refers to a value that stands far apart from the others in a single variable.&lt;br&gt;
An illustrative example occurs in a pizza-eating contest where all competitors eat 3 to 5 slices, but one person devours 20 slices and still asks for more! This person is the univariate outlier of the group.&lt;br&gt;
It can distort measures (mean, median, etc.) and graphs, affecting statistics related to the variable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multivariate Outliers (across multiple variables):&lt;/strong&gt; Are values identified when we consider multiple entities at the same time. Detecting multivariate outliers is more complex as it requires considering interactions between variables.&lt;br&gt;
An example happens at a costume party where most people chose common costumes, but someone shows up dressed as a dragon, floating with a jetpack, and carrying a giant violin. This "Space Dragon Musician" is a multivariate outlier.&lt;br&gt;
It can influence analyses involving interactions between variables, such as heatmaps and correlations, leading to wrong conclusions if not properly addressed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Global Outliers:&lt;/strong&gt; Are values significantly distant from all the other data points in the entire dataset.&lt;br&gt;
For instance, in a physical education class, where the teacher asked everyone to write their heights on a sheet. While most were around 1.50m to 1.70m, the "Super Basketball Player" wrote 2.20m on the paper, making him the global outlier.&lt;br&gt;
This specific type clearly distorts analyses like the mean, making it less representative of the data, and can distort the overall view of the data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Contextual Outliers:&lt;/strong&gt; Are observed based on the specific context of the problem.&lt;br&gt;
For example, in a salary study within a company, a value 10 times above the average is unusual and noteworthy. However, upon closer examination, it might correspond to a high-ranking position in the organization. Although much higher than the others, its presence is not an error.&lt;br&gt;
The impact here might be smaller, as its justification is tied to circumstances. It usually doesn't drastically distort aggregate statistics if treated as a special case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Replicated Data Outliers:&lt;/strong&gt; Are outliers found when variant data is collected at different times or locations.&lt;br&gt;
They can arise due to temporal or spatial variations, or changes in measurement methodology. They can provide insights into changes in the phenomenon over time or space.&lt;br&gt;
Imagine measuring your mug's height using a ruler on your desk every day. On the first day, you record 12 cm, the next day 13 cm, and the third day 11 cm. This doesn't mean the mug is growing or the ruler is changing, but your measurement is varying (12, 13, 11 centimeters).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Influential Outliers:&lt;/strong&gt; Are values that significantly impact statistical analyses, such as regression model fitting. This type of outlier can affect the slope and fit of the regression line, as well as statistical models, potentially resulting in incorrect conclusions if not properly addressed.&lt;br&gt;
An example is seen in a car dealership that mostly sells popular cars between $30,000 and $50,000. A luxury car was sold for $150,000. This distinct sale had a major impact on sales metrics and the average price of cars sold.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Random Outliers:&lt;/strong&gt; Are caused by measurement errors or natural variations in the data. They occur randomly and don't signify significant patterns.&lt;br&gt;
For instance, in an industry, during a temperature measurement experiment, sensors typically read between 22°C and 25°C. However, in one reading, the sensor indicated 500°C.&lt;br&gt;
This might have been a measurement error and doesn't represent the actual temperature.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's important to note that not all outliers are errors or process failures. Some provide information about the studied phenomenon or indicate special circumstances. When dealing with outliers, it's essential to understand the context and decide whether they should be treated, transformed, or retained.&lt;/p&gt;

</description>
      <category>python</category>
      <category>outliers</category>
      <category>statistics</category>
      <category>programming</category>
    </item>
    <item>
      <title>#02Python - Outliers e seus tipos</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Thu, 31 Aug 2023 00:07:28 +0000</pubDate>
      <link>https://dev.to/anacbneumann/02python-outliers-e-seus-tipos-5dn8</link>
      <guid>https://dev.to/anacbneumann/02python-outliers-e-seus-tipos-5dn8</guid>
      <description>&lt;p&gt;Outliers são valores atípicos que se diferenciam do restante dos dados em um conjunto, normalmente por estarem em valores extremos. Eles podem resultar de erros de medição, entradas incorretas, eventos raros ou, até mesmo, representar informações sobre o fato observado. &lt;/p&gt;

&lt;p&gt;Os outliers podem ser divididos em diferentes tipos com base em sua natureza. Seguem alguns tipos de outliers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers univariados (apenas 1 variável):&lt;/strong&gt; se refere a um valor distoante em relação a uma única variável.&lt;br&gt;
Um exemplo ilusório se dá num concurso de comer pizza, onde todos os competidores comem de 3 a 5 fatias, mas uma pessoa devora 20 fatias e ainda pede mais! Esse é o outlier univariado do grupo.&lt;br&gt;&lt;br&gt;
Pode causar distorções em medidas (média, mediana, etc.) e gráficos, afetando principalmente as estatísticas relacionadas a variável.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers multivariados (entre várias variáveis):&lt;/strong&gt; são valores identificados quando observamos várias entidades ao mesmo tempo. A detecção de outliers multivariados é mais complexa, devido a requerer considerações de interações entre variáveis.&lt;br&gt;
Um exemplo se dá numa festa à fantasia onde a maioria das pessoas escolheu trajes comuns, mas tem alguém que aparece trajado de dragão, flutuando com um jetpack e carregando um violino gigante. Parece que o "Dragão Espacial Músico" é um outlier multivariado.&lt;br&gt;
Pode influenciar análises que envolvem interações entre variáveis, como gráficos de mapas de calor e correlações. Levando a conclusões erradas se não forem tratados adequadamente.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers globais:&lt;/strong&gt; são aqueles que estão distantes significativamente de todo o resto dos dados em todo o conjunto de dados. &lt;br&gt;
Por exemplo, na aula de educação física, onde a professora pediu que todos escrevessem suas alturas em uma folha. Enquanto todos estavam em torno de 1,50m a 1,70m, o "Super Jogador de Basquete" escreveu 2,20m no papel, ele é o outlier global.&lt;br&gt;
Esse tipo específico, distorce evidentemente as análises, como a média, tornando menos representativa em relação aos dados, e pode distorcer a visão global dos dados.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers contextuais:&lt;/strong&gt; são observados baseado no contexto específico do problema. &lt;br&gt;
Por exemplo, em um estudo de salários em uma empresa, um valor 10x acima da média é incomum, e algo a ser estudado. Mas quando observado pode se referir a um alto cargo na organização. Embora seja muito maior que os demais, sua presença não é um erro.&lt;br&gt;
O impacto, nesse caso, pode ser menor, pois sua justificativa está ligada a circunstâncias. Normalmente, não distorce evidentemente as estatísticas agregadas se for tratado como um caso especial.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers de dados replicados:&lt;/strong&gt; são outliers encontrados para dados variantes coletados em vários momentos ou locais. &lt;br&gt;
Eles podem surgir devido a variações temporais ou espaciais, ou devido a mudanças na forma de medição. E podem fornecer informações sobre as mudanças no fenômeno ao longo do tempo ou do espaço.&lt;br&gt;
Imagine que você, todo dia, tem uma reunião morosa no trabalho, e com uma régua da sua escrivaninha, mede sua caneca todos os dias. No primeiro dia, você registra 12 cm, no segundo, 13 cm, e no terceiro, 11 cm. Não significa que a caneca está crescendo, ou que a régua está mudando, mas sua medição está variando (12, 13, 11 centímetros).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers influentes:&lt;/strong&gt; é um valor que impacta significativamente as análises estatísticas, como ajuste de modelos de regressão. Esse tipo de outlier pode afetar a inclinação e o ajuste da linha de regressão, e também, modelos estatísticos, podendo resultar em conclusões errôneas se não forem tratados adequadamente.&lt;br&gt;
Um exemplo se dá em uma concessionária que vende a maioria dos carros populares entre R$ 30.000 e R$ 50.000. Um carro de luxo foi vendido por R$ 150.000. Essa venda destoante teve um grande impacto nas métricas de vendas e na média de preços de carros vendidos.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outliers aleatórios:&lt;/strong&gt; são ocasionados por erros de medição ou variações naturais nos dados. Ocorrem de maneira aleatória e não representam padrões significativos.&lt;br&gt;
Um exemplo se dá em uma indústria, durante um experimento de medição de temperatura, os sensores geralmente registram entre 22°C e 25°C. Mas, em uma leitura, o sensor indicou 500°C. &lt;br&gt;
Isso pode ter sido um erro de medição e não indica a temperatura real.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;É importante notar que nem todos os outliers são erros ou falhas no processo. Alguns trazem informações sobre o fenômeno estudado ou indicam situações especiais. Ao lidar com outliers, é essencial entender o contexto e decidir se eles devem ser tratados, transformados ou mantidos.&lt;/p&gt;

</description>
      <category>python</category>
      <category>outlier</category>
      <category>programming</category>
      <category>estatistica</category>
    </item>
    <item>
      <title>#01Python - Missing Values (NaN | Null)</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Wed, 16 Aug 2023 16:00:58 +0000</pubDate>
      <link>https://dev.to/anacbneumann/01python-missing-values-nan-null-53pp</link>
      <guid>https://dev.to/anacbneumann/01python-missing-values-nan-null-53pp</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Types of Data Absence&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There are different patterns of data absence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MCAR (Missing Completely At Random)&lt;/strong&gt;: MCAR stands for "Missing Completely At Random." In this case, the data's absence is entirely random and not related to any other variable in the dataset. This means that the probability of a value being missing is the same for all observations, and it doesn't depend on unobserved values in other variables. In other words, missing data doesn't introduce any systematic bias into the analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAR (Missing At Random)&lt;/strong&gt;: MAR stands for "Missing At Random." The absence of data might be related to other observed variables but isn't related to the missing value itself. In other words, the probability of a value being missing may depend on the information available in other variables, but it doesn't depend on the actual value that's missing. Even if there's a relationship between missing data and other variables, as long as these variables are present in the dataset, no systematic bias is introduced into the analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MNAR (Missing Not At Random)&lt;/strong&gt;: MNAR means "Missing Not At Random." It indicates that the absence of data is related to the missing value itself, and this relationship can't be explained by other variables in the dataset. In other words, the probability of a value being missing depends on the actual value that's missing, regardless of other observed variables. Missing data introduces a systematic bias into the analysis due to the relationship between the missing data and the desired outcome.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It's important to understand the type of data absence when working with a dataset, as each type of absence requires different treatment strategies or data imputation methods to deal with missing values. Knowing the types of data absence can also help in correctly interpreting results and avoiding false or biased conclusions.&lt;/p&gt;

&lt;p&gt;Note: "No systematic bias introduced" means that the absence of data doesn't affect the analysis in a biased or systematic manner. In other words, the missing data doesn't consistently influence the results.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Identifying Null Values&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To identify missing values, we use the &lt;strong&gt;&lt;code&gt;isnull()&lt;/code&gt;&lt;/strong&gt; method of pandas to check which values are null in the dataframe. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command returns a dataframe with the same format as the original, but with boolean values indicating whether each element is null or not. The &lt;strong&gt;&lt;code&gt;.isna()&lt;/code&gt;&lt;/strong&gt; method can also be used, as described below.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Identifying NaN Values&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Nan → Not a Number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; is a pandas method that returns a boolean matrix indicating which elements are missing values (NaN) or null. It has the same functionality as the &lt;strong&gt;&lt;code&gt;isnull()&lt;/code&gt;&lt;/strong&gt; method.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; function can be applied to an entire dataframe or to a specific series within the dataframe. When used on the entire dataframe, it returns a dataframe with the same shape as the input, where each element is replaced by &lt;strong&gt;&lt;code&gt;True&lt;/code&gt;&lt;/strong&gt; if it's a missing value or &lt;strong&gt;&lt;code&gt;False&lt;/code&gt;&lt;/strong&gt; otherwise.&lt;/p&gt;

&lt;p&gt;Here's an example of using the &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'A'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s"&gt;'B'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Counting Missing Values&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To get an overview of missing values in each column, you can use the &lt;strong&gt;&lt;code&gt;sum()&lt;/code&gt;&lt;/strong&gt; method in combination with the &lt;strong&gt;&lt;code&gt;isnull()&lt;/code&gt;&lt;/strong&gt; method. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command returns the total number of missing values in each column. The same can be applied to the &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; command. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Visualization of Missing Data [Missingno]&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt; library is a useful tool for visualizing missing data patterns in a dataframe. It generates charts that help identify patterns of missing values and understand the distribution of these values in a dataframe. Some of its key charts are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Matrix Plot&lt;/strong&gt;: Shows the presence or absence of values in each cell of the dataframe. Each row represents a sample or record, and each column represents a variable, with empty/blank cells indicating missing values. This makes patterns and correlations between missing values visible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bar Chart of Missing Values&lt;/strong&gt;: Displays the count of missing values for each variable. It shows the proportion of missing values relative to the total number of available observations for each variable. It helps identify variables with a significant number of missing values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heatmap&lt;/strong&gt;: Uses colors to represent the presence or absence of values in a dataframe. It's useful when working with large datasets, allowing the visualization of missing value distribution across multiple variables.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt; library also provides other visualizations, such as correlation dendrograms and line plots to track missing values over time.&lt;/p&gt;

&lt;p&gt;Here's an example of using the &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt; library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;missingno&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;msno&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Matrix plot
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Bar chart of missing values
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Heatmap
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Setting a Threshold for Missing Values per Column&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There's no standard or rule to determine the threshold of missing values per column, as it can vary depending on the problem, data nature, and analysis requirements.&lt;/p&gt;

&lt;p&gt;However, there are some approaches to defining this threshold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Percentage Threshold&lt;/strong&gt;: You can set a percentage threshold, for example, allowing a column to have up to 5% (or any other value) of missing values. If the proportion of missing values in a column exceeds this threshold, actions like value imputation or column deletion can be taken.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Analysis&lt;/strong&gt;: Certain columns might be more critical and require fewer missing values, while others may have more leeway for data absence. For instance, in a medical data dataframe, variables like age or gender might be considered essential, whereas other columns with more specific information might tolerate more missing values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impact on Results&lt;/strong&gt;: Consider the impact of missing values on final results. If a column contains critical information for the problem at hand or is necessary for constructing the result, it's advisable to have a lower threshold for missing values.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Documenting the missing data treatment process, especially the decision about the missing values threshold, is important.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dealing with Missing Values&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There are several strategies to handle missing values. Here are some common options:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Removing Rows or Columns&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If the missing values are in a small number of rows or columns, you can choose to remove them. Use the &lt;strong&gt;&lt;code&gt;dropna()&lt;/code&gt;&lt;/strong&gt; method to do this. For example, to remove all rows containing at least one null value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dropping rows with missing values:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To remove columns with at least one null value, specify the &lt;strong&gt;&lt;code&gt;axis=1&lt;/code&gt;&lt;/strong&gt; parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dropping columns with missing values:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Below is an example of dropping rows with missing values below a 5% threshold in a dataframe and analyzing if there are columns with missing values above the defined threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Setting threshold:
&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;

&lt;span class="c1"&gt;# Columns to drop rows of missing values (threshold &amp;lt; 5%):
&lt;/span&gt;&lt;span class="n"&gt;cols_to_drop_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Replacing Missing Values with Statistical Values&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you prefer to keep all rows and columns, you can fill the missing values with specific values. Use the &lt;strong&gt;&lt;code&gt;fillna()&lt;/code&gt;&lt;/strong&gt; method to fill null values. In the example below, all null values are filled with '0':&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also fill with other values, such as column statistics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fill missing values with column mean
&lt;/span&gt;&lt;span class="n"&gt;df_filled_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Fill missing values with column median
&lt;/span&gt;&lt;span class="n"&gt;df_filled_median&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Fill missing values with column standard deviation
&lt;/span&gt;&lt;span class="n"&gt;df_filled_std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Fill missing values with column mode (most frequent value)
&lt;/span&gt;&lt;span class="n"&gt;df_filled_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or advanced statistics like weighted median:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define weights for weighted median
&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Function to calculate weighted median
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;weighted_median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sorted_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sorted_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;sorted_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;sorted_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;sorted_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cumsum_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cumsum_weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;median_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cumsum_weights&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;total_weight&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sorted_values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;median_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Fill missing values with column weighted median
&lt;/span&gt;&lt;span class="n"&gt;df_filled_weighted_median&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weighted_median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But be aware that filling missing values with other values can lead to misleading analysis insights, depending on the amount of filled values.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Statistical Measures by Subgroups&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Filling missing values with statistical measures segmented by subgroups within the dataframe is a useful technique when you want to impute missing values based on specific characteristics of data subsets. This allows considering data heterogeneity and avoiding distortions when filling missing values with general statistical measures.&lt;/p&gt;

&lt;p&gt;Here's an example of how to fill missing values with the mean of a group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Calculate group mean
&lt;/span&gt;&lt;span class="n"&gt;group_means&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Group'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s"&gt;'Value'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Fill missing values with group mean
&lt;/span&gt;&lt;span class="n"&gt;df_filled_segmented_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Group'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s"&gt;'Value'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DataFrame filled with segmented mean by group:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_filled_segmented_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code can be adapted for other statistical measures like median, standard deviation, mode, among others. Simply replace the appropriate statistical function within the &lt;strong&gt;&lt;code&gt;apply()&lt;/code&gt;&lt;/strong&gt; function.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Interpolating Missing Values&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Another option is to use interpolation to fill missing values based on existing values in the columns. The &lt;strong&gt;&lt;code&gt;interpolate()&lt;/code&gt;&lt;/strong&gt; method does this automatically. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interpolate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Check Missing Values Again After Transformations&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After performing missing value treatment steps, check again if there are any remaining null values in the dataframe to ensure that missing values have been properly handled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# For null values:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Or:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
    <item>
      <title>#01Python - Valores ausentes (Nan | Null)</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Wed, 16 Aug 2023 15:59:48 +0000</pubDate>
      <link>https://dev.to/anacbneumann/valores-ausentes-nan-null-2e81</link>
      <guid>https://dev.to/anacbneumann/valores-ausentes-nan-null-2e81</guid>
      <description>&lt;h2&gt;
  
  
  Tipos de Ausência de Dados
&lt;/h2&gt;

&lt;p&gt;Existem tipos de padrões de ausências de dados:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;MCAR (Missing Completely At Random)&lt;/strong&gt;: MCAR significa "ausência de dados completamente aleatória". Nesse caso, a ausência dos dados é totalmente aleatória e não está relacionada a nenhuma outra variável do conjunto de dados. Isso significa que a probabilidade de um valor estar ausente é a mesma para todas as observações, não depende de valores não observados em outras variáveis. Em outras palavras, dados ausentes não introduzem nenhum viés sistemático na análise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MAR (Missing At Random)&lt;/strong&gt;: MAR significa "ausência de dados aleatória". A ausência dos dados pode estar relacionada a outras variáveis observadas, mas não está relacionada ao próprio valor ausente. Em outras palavras, a probabilidade de um valor estar ausente pode depender das informações disponíveis em outras variáveis, mas não depende do valor real que está faltando. Mesmo existindo uma relação entre a ausência dos dados e outras variáveis, desde que essas variáveis estejam presentes no conjunto de dados, não há viés sistemático introduzido na análise.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MNAR (Missing Not At Random)&lt;/strong&gt;: MNAR quer dizer "ausência de dados não aleatória". Significa que, a ausência dos dados está relacionada ao próprio valor ausente, e essa relação não pode ser explicada por outras variáveis no conjunto de dados. Ou seja, a probabilidade de um valor estar ausente depende do valor real que está faltando, independentemente das outras variáveis observadas. A ausência dos dados introduz um viés sistemático na análise, devido a existir uma relação entre os dados ausentes e o resultado que se deseja observar.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;É importante entender o tipo de ausência de dados ao trabalhar com o dataset, pois cada tipo de ausência requer diferentes estratégias de tratamento ou métodos de imputação de dados para lidar com os valores ausentes. Conhecer os tipos de ausência de dados também pode ajudar a interpretar corretamente os resultados e evitar conclusões falsas ou enviesadas.&lt;/p&gt;

&lt;p&gt;Obs.: "Não há viés sistemático introduzido" significa que a ausência de dados não afeta a análise de maneira tendenciosa ou sistemática. Ou seja, a ausência dos dados não influencia os resultados de forma consistente.&lt;/p&gt;




&lt;h2&gt;
  
  
  Identificando valores ausentes Null
&lt;/h2&gt;

&lt;p&gt;Para identificar valores ausentes, utilizamos o método &lt;strong&gt;`isnull()&lt;/strong&gt;` do pandas para verificar quais valores são nulos no dataframe. Por exemplo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esse comando retorna um dataframe com o mesmo formato do original, mas com valores booleanos indicando se cada elemento é nulo ou não. Pode ser usado, também, o método &lt;code&gt;**.isna()**&lt;/code&gt;, descrito abaixo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Identificando valores ausentes NaN
&lt;/h2&gt;

&lt;p&gt;Nan → Not a Number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; é um método do pandas que retorna uma matriz booleana indicando quais elementos são valores ausentes (NaN) ou nulos. Possui a mesma funcionalidade que o método &lt;strong&gt;&lt;code&gt;isnull()&lt;/code&gt;&lt;/strong&gt; .&lt;/p&gt;

&lt;p&gt;A função &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; pode ser aplicada a um dataframe inteiro ou a uma série específica dentro do dataframe. Quando usada em todo o dataframe, ela retorna um dataframe com a mesma forma do aplicado, onde cada elemento é substituído por &lt;strong&gt;&lt;code&gt;True&lt;/code&gt;&lt;/strong&gt; se for um valor nulo ou &lt;strong&gt;&lt;code&gt;False&lt;/code&gt;,&lt;/strong&gt; caso contrário.&lt;/p&gt;

&lt;p&gt;Segue um exemplo de uso do método &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s"&gt;'A'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s"&gt;'B'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Contagem de valores ausentes
&lt;/h2&gt;

&lt;p&gt;Para ter uma visão geral dos valores ausentes em cada coluna, você pode usar o método &lt;code&gt;sum()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;em combinação com o método &lt;code&gt;isnull()&lt;/code&gt;. Por exemplo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esse comando retorna o número total de valores ausentes em cada coluna. O mesmo pode ser aplicado ao comando &lt;strong&gt;&lt;code&gt;isna()&lt;/code&gt;&lt;/strong&gt; . Por exemplo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Visualização de dados Ausentes [Missigno]
&lt;/h2&gt;

&lt;p&gt;A biblioteca &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt; é uma ferramenta útil para visualizar padrões de dados ausentes em um dataframe. Ela gera gráficos que ajudam a identificar os padrões de valores ausentes e a entender a distribuição desses valores em um dataframe. Alguns dos seus principais gráficos são:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Matriz de ausência (Matrix Plot)&lt;/strong&gt;: Mostra a presença ou ausência de valores em cada célula do dataframe. Cada linha representa uma amostra ou registro, e cada coluna representa uma variável, as células vazias/em branco indicam valores ausentes. Tornando visível padrões e correlações entre os valores ausentes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gráfico de barras de valores ausentes (Bar Chart)&lt;/strong&gt;: Exibe a contagem de valores ausentes para cada variável. Ele mostra a proporção de valores ausentes em relação ao número total de observações disponíveis para cada variável. É possível identificar quais variáveis têm um número significativo de valores ausentes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gráfico de calor (Heatmap)&lt;/strong&gt;: Utiliza cores para representar a presença ou ausência de valores em um dataframe. É útil quando se trabalha com grandes conjuntos de dados, pois permite visualizar a distribuição de valores ausentes em várias variáveis.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A biblioteca &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt; também fornece outras visualizações, como dendrogramas de correlação e gráficos de linha para rastrear valores ausentes ao longo do tempo.&lt;/p&gt;

&lt;p&gt;Segue um exemplo de utilização da biblioteca &lt;strong&gt;&lt;code&gt;missingno&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;missingno&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;msno&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Matriz de ausência
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Gráfico de barras de valores ausentes
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Gráfico de calor
&lt;/span&gt;&lt;span class="n"&gt;msno&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;heatmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Definindo limite para valores ausentes por coluna
&lt;/h2&gt;

&lt;p&gt;Não existe um padrão ou regra para determinar o limite de valores ausentes por coluna, pois isso pode variar dependendo do problema, da natureza dos dados e dos requisitos da análise. &lt;/p&gt;

&lt;p&gt;No entanto, existem algumas abordagens para definir esse limite:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Limite percentual&lt;/strong&gt;: Pode-se estabelecer um limite percentual, como por exemplo, permitir uma coluna ter até 5% (ou qualquer outro valor) de valores ausentes. 
Caso a proporção de valores ausentes em uma coluna exceder esse limite, podem ser tomadas medidas como imputação de valores ou exclusão da coluna.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Análise do domínio&lt;/strong&gt;: Certas colunas podem ser mais críticas e requerer menos valores ausentes, enquanto outras, podem ter uma liberdade maior para a ausência de dados. 
Por exemplo, em um dataframe de dados médicos, variáveis como idade ou sexo podem ser consideradas essenciais, já outras colunas, com informações mais específicas, podem possuir mais valores ausentes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impacto nos resultados&lt;/strong&gt;: Deve-se considerar o impacto dos valores ausentes nos resultados finais. Se uma coluna contiver informações críticas para o problema em questão ou for necessária para construção do resultado, é recomendável ter um limite mais baixo de valores ausentes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;É importante documentar o processo de tratamento de dados ausentes, principalmente a decisão sobre o limite de valores ausentes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lidando com valores ausentes
&lt;/h2&gt;

&lt;p&gt;Existem várias estratégias para lidar com valores ausentes. Aqui estão algumas opções comuns:&lt;/p&gt;

&lt;h3&gt;
  
  
  Removendo linhas ou colunas
&lt;/h3&gt;

&lt;p&gt;Se os valores ausentes estiverem em um número pequeno de linhas ou colunas, você pode optar por removê-las. Use o método &lt;strong&gt;&lt;code&gt;dropna()&lt;/code&gt;&lt;/strong&gt; para fazer isso. Por exemplo, para remover todas as linhas que contenham pelo menos um valor nulo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dropando linhas com valores ausentes:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Para remover colunas com pelo menos um valor nulo, especifique o parâmetro &lt;strong&gt;&lt;code&gt;axis=1&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Dropando colunas com valores ausentes:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Segue abaixo, um exemplo de exclusão de linhas de valores ausentes abaixo de um limite de 5% em um dataframe e, análise se há colunas com valores ausentes acima do limite definido:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Definindo threshold (limite):
&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;

&lt;span class="c1"&gt;# Colunas para dropar linhas de valores ausentes (limite &amp;lt; 5%):
&lt;/span&gt;&lt;span class="n"&gt;cols_to_drop_na&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Dropando valores ausentes de colunas abaixo do limite aceitável:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cols_to_drop_na&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Verificando se ainda há colunas com valores ausentes:
&lt;/span&gt;&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Se há, definimos essas colunas para futura tratativa:
&lt;/span&gt;&lt;span class="n"&gt;cols_with_missing_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cols_with_missing_values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Substituindo valores ausentes por valores estatísticos
&lt;/h3&gt;

&lt;p&gt;Se você preferir manter todas as linhas e colunas, pode preencher os valores ausentes com um valor específico. Use o método &lt;strong&gt;&lt;code&gt;fillna()&lt;/code&gt;&lt;/strong&gt; para preencher os valores nulos. &lt;br&gt;
No exemplo abaixo, se preenche todos os valores nulos com ‘0’:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Você também pode preencher com outros valores, como estatísticas da coluna:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Preencher valores ausentes com a média da coluna
&lt;/span&gt;&lt;span class="n"&gt;df_filled_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Preencher valores ausentes com a mediana da coluna
&lt;/span&gt;&lt;span class="n"&gt;df_filled_median&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Preencher valores ausentes com o desvio padrão da coluna
&lt;/span&gt;&lt;span class="n"&gt;df_filled_std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Preencher valores ausentes com a moda da coluna (valor mais frequente)
&lt;/span&gt;&lt;span class="n"&gt;df_filled_mode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ou, estatísticas avançadas como a mediana ponderada:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Definir pesos para a mediana ponderada
&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Função para calcular a mediana ponderada
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;weighted_median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sorted_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sorted_values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;sorted_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;sorted_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;sorted_indices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cumsum_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sorted_weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cumsum_weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;median_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cumsum_weights&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;total_weight&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sorted_values&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;median_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Preencher valores ausentes com a mediana ponderada da coluna
&lt;/span&gt;&lt;span class="n"&gt;df_filled_weighted_median&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;weighted_median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;any&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Mas esteja ciente que, preencher valores ausentes com outros valores, pode trazer gráficos de análise com perspectivas falsas, dependendo da quantidade de valores preenchidos.&lt;/p&gt;

&lt;h3&gt;
  
  
  Medidas estatísticas por subgrupos
&lt;/h3&gt;

&lt;p&gt;Preencher valores ausentes com medidas estatísticas segmentadas por subgrupos dentro do dataframe é uma técnica útil quando você deseja imputar valores ausentes com base em características específicas de subconjuntos dos seus dados. &lt;br&gt;
Isso permite levar em consideração a heterogeneidade dos dados e evitar distorções ao preencher os valores ausentes com medidas estatísticas gerais.&lt;/p&gt;

&lt;p&gt;Segue um exemplo de como preencher valores ausentes com média de um grupo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Calcular a média por grupo
&lt;/span&gt;&lt;span class="n"&gt;group_means&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Grupo'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s"&gt;'Valor'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Preencher valores ausentes com a média por grupo
&lt;/span&gt;&lt;span class="n"&gt;df_filled_segmented_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'Grupo'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="s"&gt;'Valor'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"DataFrame preenchido com a média segmentada por grupo:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df_filled_segmented_mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esse código pode ser adaptado para outras medidas estatísticas, como a mediana, desvio padrão, moda, entre outras. Basta substituir a função estatística apropriada dentro da função &lt;strong&gt;&lt;code&gt;apply()&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Interpolar valores ausentes
&lt;/h3&gt;

&lt;p&gt;Outra opção é usar a interpolação para preencher os valores ausentes com base nos valores existentes nas colunas. O método &lt;strong&gt;&lt;code&gt;interpolate()&lt;/code&gt;&lt;/strong&gt; faz isso automaticamente. Por exemplo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interpolate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Verificar novamente valores ausentes após transformações
&lt;/h2&gt;

&lt;p&gt;Depois de realizar as etapas de tratamento de valores ausentes, verifique novamente se restam valores nulos no dataframe, para garantir que os valores ausentes tenham sido tratados corretamente.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Para valores nulos:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Ou:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
    </item>
    <item>
      <title>#01QuickTips: Python</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Tue, 15 Aug 2023 14:06:51 +0000</pubDate>
      <link>https://dev.to/anacbneumann/01quicktips-python-25nm</link>
      <guid>https://dev.to/anacbneumann/01quicktips-python-25nm</guid>
      <description>&lt;p&gt;In this first post, I'll provide some tips for beginners in the Python world, covering the main useful commands that are simple to use in exploratory analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Comments:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Use the "#" character to start a comment on a line. Comments are useful for adding explanatory notes to your code and are not executed by the Python interpreter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;This&lt;/span&gt; &lt;span class="nx"&gt;is&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;comment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Line Breaks:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To split code across multiple lines, you can use the backslash "" at the end of each line or place it between parentheses, brackets, or braces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Using&lt;/span&gt; &lt;span class="nx"&gt;backslash&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;\&lt;/span&gt;
    &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="o"&gt;\&lt;/span&gt;
    &lt;span class="mi"&gt;30&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Using&lt;/span&gt; &lt;span class="nx"&gt;parentheses&lt;/span&gt;
&lt;span class="nx"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
     &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
     &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Using&lt;/span&gt; &lt;span class="nx"&gt;brackets&lt;/span&gt;
&lt;span class="nx"&gt;lista&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Using&lt;/span&gt; &lt;span class="nx"&gt;braces&lt;/span&gt;
&lt;span class="nx"&gt;dicionario&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;b&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Indentation:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python uses indentation to delimit code blocks, instead of using curly braces or special keywords. Make sure to maintain the same indentation within a block to avoid syntax errors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Example&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;indentation&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;x is positive&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Still inside the block&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Outside the block&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Printing to the Screen:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Use the &lt;strong&gt;&lt;code&gt;print()&lt;/code&gt;&lt;/strong&gt; function to display messages or values on the standard output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Ana&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Hey there,&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;The secret value of x is:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;User Input:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Use the &lt;strong&gt;&lt;code&gt;input()&lt;/code&gt;&lt;/strong&gt; function to receive user input. Remember that the result of &lt;strong&gt;&lt;code&gt;input()&lt;/code&gt;&lt;/strong&gt; is always a string, so you might need to convert it to other types if necessary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;What's your name? &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Welcome,&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Assignment Operators:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Python provides several useful assignment operators to perform common operations in a single line.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;Simple&lt;/span&gt; &lt;span class="nx"&gt;assignment&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Importing Libraries:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Use the &lt;strong&gt;&lt;code&gt;import&lt;/code&gt;&lt;/strong&gt; keyword to import libraries and modules into your code. This allows you to access additional resources and functions provided by these libraries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;math&lt;/span&gt;

&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="nx"&gt;datetime&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;datetime&lt;/span&gt;

&lt;span class="nx"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Shape&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Returns a tuple representing the number of rows followed by the number of columns in a dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;DataFrame Columns&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Returns all columns, separated by commas, that compose a dataset. Very useful to recall which columns make up the dataset after transformations or even during exploratory analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;columns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>tip</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>#01DicasRápidas: Python</title>
      <dc:creator>Ana Carolina Branco Neumann</dc:creator>
      <pubDate>Tue, 15 Aug 2023 13:36:52 +0000</pubDate>
      <link>https://dev.to/anacbneumann/01dicasrapidas-python-3f50</link>
      <guid>https://dev.to/anacbneumann/01dicasrapidas-python-3f50</guid>
      <description>&lt;p&gt;Nesse primeiro post, vou dar algumas dicas para iniciantes no mundo do Python, dos principais comandos úteis, e bem simples de serem utilizados, em uma análise exploratória.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Comentários&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;Use o caractere "#" para iniciar um comentário em uma linha. Os comentários são úteis para adicionar notas explicativas no seu código e não são executados pelo interpretador do Python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Isso é um comentário
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Quebra de linha&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;Para dividir uma código em várias linhas, você pode usar a barra invertida "\" no final de cada linha ou colocá-la entre parênteses, colchetes ou chaves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Usando barra invertida
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; \
    &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; \
    &lt;span class="mi"&gt;30&lt;/span&gt;

&lt;span class="c1"&gt;# Usando parênteses
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
     &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
     &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Usando colchetes
&lt;/span&gt;&lt;span class="n"&gt;lista&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Usando chaves
&lt;/span&gt;&lt;span class="n"&gt;dicionario&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
              &lt;span class="s"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Indentação&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;O Python usa a indentação para delimitar blocos de código. Ao invés de utilizar chaves ou palavras-chave especiais. Certifique-se de manter a mesma indentação dentro de um bloco para evitar erros de sintaxe.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Exemplo de indentação
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"x é positivo"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Ainda dentro do bloco"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fora do bloco"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Imprimir na tela&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;Use a função &lt;strong&gt;&lt;code&gt;print()&lt;/code&gt;&lt;/strong&gt; para exibir mensagens ou valores na saída padrão.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"Ana"&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fala,"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"O valor secreto de x é:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Input do usuário&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;Use a função &lt;strong&gt;&lt;code&gt;input()&lt;/code&gt;&lt;/strong&gt; para receber uma entrada do usuário. Lembre-se de que o resultado do &lt;strong&gt;&lt;code&gt;input()&lt;/code&gt;&lt;/strong&gt; é sempre uma string, portanto, você pode precisar converter para outros tipos, se necessário.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nome&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Qual seu nome? "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Bem-vindo,"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nome&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Operadores de atribuição&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;O Python oferece vários operadores de atribuição úteis para realizar operações comuns em uma única linha.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# Atribuição simples
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# x = x + 5
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="c1"&gt;# x = x - 3
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="c1"&gt;# x = x * 2
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="c1"&gt;# x = x / 4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Importação de bibliotecas&lt;/strong&gt;:
&lt;/h2&gt;

&lt;p&gt;Use a palavra-chave &lt;strong&gt;&lt;code&gt;import&lt;/code&gt;&lt;/strong&gt; para importar bibliotecas e módulos no seu código. Isso permite que você acesse recursos adicionais e funções fornecidas por essas bibliotecas.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;math&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="n"&gt;agora&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agora&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Shape
&lt;/h2&gt;

&lt;p&gt;Traz uma tupla correspondendo o número de linhas seguido do número de colunas, de um dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shape&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Colunas de um dataframe
&lt;/h2&gt;

&lt;p&gt;Traz todas as colunas, separadas por vírgula, que compõe um dataset. Muito útil para relembrar quais colunas compõe o dataset após transformações, ou mesmo, no meio da análise exploratória:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="nx"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;columns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>python</category>
      <category>tip</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
