DEV Community: Ana Carolina Branco Neumann

Churn Prediction - Telco Company

Ana Carolina Branco Neumann — Tue, 28 Jan 2025 01:31:05 +0000

Churn Prediction - Telco Company

📌 Dataset Source:

Telco Customer Churn - Kaggle

📂 Github Repository:

Telco Customer Churn - Github

About the Project

This project leverages Machine Learning to predict customer churn for a telecommunications company. The main goal is to identify patterns indicating the likelihood of cancellation, enabling the company to implement proactive retention strategies before customers discontinue their services.

The primary focus is on the Recall metric, which is crucial for capturing most churners, even at the cost of some false positives, as preventive retention actions are more advantageous for the business.

Exploratory Data Analysis

During the EDA, patterns in the dataset were explored to understand the factors associated with churn. Key findings include:

Monthly vs. Long-Term Contracts:

Customers with monthly contracts showed a higher likelihood of churn, suggesting that long-term contracts may encourage loyalty.
Additional Services:

Customers subscribing to additional services, such as online security or technical support, tended to churn less.
Tenure and Monthly Charges:
- Customers with longer tenure (contract length) exhibited greater loyalty.
- Higher MonthlyCharges correlated positively with churn.
Removal of TotalCharges:

The TotalCharges column was removed due to high collinearity with tenure, which could affect the model's stability.

Technical Choices

Why SVM? Algorithm

The Support Vector Machine (SVM) was chosen for several reasons:

Efficiency with Smaller Datasets:

With approximately 7,000 rows, SVM effectively captures complex patterns without overfitting.
Flexible Kernel Options:

By combining linear and rbf kernels, SVM identifies both linear and non-linear relationships through GridSearchCV.
Binary Classification:

SVM is well-suited for binary problems like this one, where the goal is to predict churn (Yes or No).

Preprocessing Steps:

Scaling (MinMaxScaler):

Models like SVM are sensitive to differences in scale. Scaling was applied to normalize numeric variables between 0 and 1.
Encoding (OneHotEncoder):

Categorical variables were transformed into dummy variables. This ensures that categories are properly represented in a format the model can understand.

Data Splitting and Validation:

The dataset was split into 70% training and 30% testing sets.
Validation was conducted using 5-fold cross-validation to ensure robust results.

Machine Learning Pipeline

The implementation followed these steps:

Dataset Splitting:

The dependent variable (Churn) and independent variables were separated, ensuring proper data splitting for training and testing.
Hyperparameter Tuning for SVM:

Optimization was performed using GridSearchCV, adjusting:
- C: Regularization parameter, controlling the trade-off between margin and error.
- Kernel: Evaluation of linear and rbf kernels.
Model Evaluation Metrics:

The model was assessed using:
- Accuracy: Percentage of correct predictions.
- Recall: Identification of churners (true positives).
- Precision: Percentage of correctly identified churners.
- F1 Score: Harmonic mean of precision and recall.
- ROC AUC: The model's ability to distinguish between classes.

Results

Metric	Value
Accuracy	80.81%
Recall	56.09%
Precision	74.35%
F1 Score	63.95%
ROC AUC	85.42%

Analysis of Results:

While the accuracy is high, the primary focus was on Recall, achieving 56%. This means that the majority of customers likely to churn were identified, enabling proactive interventions.

Future Improvements

Integrating External Data:
- Enrich the dataset with customer satisfaction feedback, such as NPS or survey responses.
- Include economic or regional indicators to identify specific patterns.
Experimenting with Models:
- Test models like XGBoost or LightGBM, which handle complex interactions well.
- Perform feature importance analysis to optimize variable selection.
Automation:
- Develop a real-time pipeline to update the model with periodically refreshed data.
- Integrate the model into the CRM system for automated retention actions.
Customer Segmentation:
- Focus retention efforts on high-value or high-risk customer segments.

Project Files

EDA.ipynb: Exploratory Data Analysis and main insights.
pre_processing.py: Data preprocessing and transformation script.
ML_application.py: Machine learning training, validation, and result exportation.

Contact Information:

For further inquiries or collaboration opportunities, feel free to reach out via LinkedIn.

#02Python - Outliers and their types

Ana Carolina Branco Neumann — Thu, 31 Aug 2023 00:10:54 +0000

Outliers are unusual values that stand out from the rest of the data in a set, often because they are at extreme values. They can result from measurement errors, incorrect inputs, rare events, or even represent information about the observed phenomenon.

Outliers can be categorized into different types based on their nature. Here are some types of outliers:

Univariate Outliers (single variable): Refers to a value that stands far apart from the others in a single variable.
An illustrative example occurs in a pizza-eating contest where all competitors eat 3 to 5 slices, but one person devours 20 slices and still asks for more! This person is the univariate outlier of the group.
It can distort measures (mean, median, etc.) and graphs, affecting statistics related to the variable.
Multivariate Outliers (across multiple variables): Are values identified when we consider multiple entities at the same time. Detecting multivariate outliers is more complex as it requires considering interactions between variables.
An example happens at a costume party where most people chose common costumes, but someone shows up dressed as a dragon, floating with a jetpack, and carrying a giant violin. This "Space Dragon Musician" is a multivariate outlier.
It can influence analyses involving interactions between variables, such as heatmaps and correlations, leading to wrong conclusions if not properly addressed.
Global Outliers: Are values significantly distant from all the other data points in the entire dataset.
For instance, in a physical education class, where the teacher asked everyone to write their heights on a sheet. While most were around 1.50m to 1.70m, the "Super Basketball Player" wrote 2.20m on the paper, making him the global outlier.
This specific type clearly distorts analyses like the mean, making it less representative of the data, and can distort the overall view of the data.
Contextual Outliers: Are observed based on the specific context of the problem.
For example, in a salary study within a company, a value 10 times above the average is unusual and noteworthy. However, upon closer examination, it might correspond to a high-ranking position in the organization. Although much higher than the others, its presence is not an error.
The impact here might be smaller, as its justification is tied to circumstances. It usually doesn't drastically distort aggregate statistics if treated as a special case.
Replicated Data Outliers: Are outliers found when variant data is collected at different times or locations.
They can arise due to temporal or spatial variations, or changes in measurement methodology. They can provide insights into changes in the phenomenon over time or space.
Imagine measuring your mug's height using a ruler on your desk every day. On the first day, you record 12 cm, the next day 13 cm, and the third day 11 cm. This doesn't mean the mug is growing or the ruler is changing, but your measurement is varying (12, 13, 11 centimeters).
Influential Outliers: Are values that significantly impact statistical analyses, such as regression model fitting. This type of outlier can affect the slope and fit of the regression line, as well as statistical models, potentially resulting in incorrect conclusions if not properly addressed.
An example is seen in a car dealership that mostly sells popular cars between $30,000 and $50,000. A luxury car was sold for $150,000. This distinct sale had a major impact on sales metrics and the average price of cars sold.
Random Outliers: Are caused by measurement errors or natural variations in the data. They occur randomly and don't signify significant patterns.
For instance, in an industry, during a temperature measurement experiment, sensors typically read between 22°C and 25°C. However, in one reading, the sensor indicated 500°C.
This might have been a measurement error and doesn't represent the actual temperature.

It's important to note that not all outliers are errors or process failures. Some provide information about the studied phenomenon or indicate special circumstances. When dealing with outliers, it's essential to understand the context and decide whether they should be treated, transformed, or retained.

#02Python - Outliers e seus tipos

Ana Carolina Branco Neumann — Thu, 31 Aug 2023 00:07:28 +0000

Outliers são valores atípicos que se diferenciam do restante dos dados em um conjunto, normalmente por estarem em valores extremos. Eles podem resultar de erros de medição, entradas incorretas, eventos raros ou, até mesmo, representar informações sobre o fato observado.

Os outliers podem ser divididos em diferentes tipos com base em sua natureza. Seguem alguns tipos de outliers:

Outliers univariados (apenas 1 variável): se refere a um valor distoante em relação a uma única variável.
Um exemplo ilusório se dá num concurso de comer pizza, onde todos os competidores comem de 3 a 5 fatias, mas uma pessoa devora 20 fatias e ainda pede mais! Esse é o outlier univariado do grupo.

Pode causar distorções em medidas (média, mediana, etc.) e gráficos, afetando principalmente as estatísticas relacionadas a variável.
Outliers multivariados (entre várias variáveis): são valores identificados quando observamos várias entidades ao mesmo tempo. A detecção de outliers multivariados é mais complexa, devido a requerer considerações de interações entre variáveis.
Um exemplo se dá numa festa à fantasia onde a maioria das pessoas escolheu trajes comuns, mas tem alguém que aparece trajado de dragão, flutuando com um jetpack e carregando um violino gigante. Parece que o "Dragão Espacial Músico" é um outlier multivariado.
Pode influenciar análises que envolvem interações entre variáveis, como gráficos de mapas de calor e correlações. Levando a conclusões erradas se não forem tratados adequadamente.
Outliers globais: são aqueles que estão distantes significativamente de todo o resto dos dados em todo o conjunto de dados.
Por exemplo, na aula de educação física, onde a professora pediu que todos escrevessem suas alturas em uma folha. Enquanto todos estavam em torno de 1,50m a 1,70m, o "Super Jogador de Basquete" escreveu 2,20m no papel, ele é o outlier global.
Esse tipo específico, distorce evidentemente as análises, como a média, tornando menos representativa em relação aos dados, e pode distorcer a visão global dos dados.
Outliers contextuais: são observados baseado no contexto específico do problema.
Por exemplo, em um estudo de salários em uma empresa, um valor 10x acima da média é incomum, e algo a ser estudado. Mas quando observado pode se referir a um alto cargo na organização. Embora seja muito maior que os demais, sua presença não é um erro.
O impacto, nesse caso, pode ser menor, pois sua justificativa está ligada a circunstâncias. Normalmente, não distorce evidentemente as estatísticas agregadas se for tratado como um caso especial.
Outliers de dados replicados: são outliers encontrados para dados variantes coletados em vários momentos ou locais.
Eles podem surgir devido a variações temporais ou espaciais, ou devido a mudanças na forma de medição. E podem fornecer informações sobre as mudanças no fenômeno ao longo do tempo ou do espaço.
Imagine que você, todo dia, tem uma reunião morosa no trabalho, e com uma régua da sua escrivaninha, mede sua caneca todos os dias. No primeiro dia, você registra 12 cm, no segundo, 13 cm, e no terceiro, 11 cm. Não significa que a caneca está crescendo, ou que a régua está mudando, mas sua medição está variando (12, 13, 11 centímetros).
Outliers influentes: é um valor que impacta significativamente as análises estatísticas, como ajuste de modelos de regressão. Esse tipo de outlier pode afetar a inclinação e o ajuste da linha de regressão, e também, modelos estatísticos, podendo resultar em conclusões errôneas se não forem tratados adequadamente.
Um exemplo se dá em uma concessionária que vende a maioria dos carros populares entre R$ 30.000 e R$ 50.000. Um carro de luxo foi vendido por R$ 150.000. Essa venda destoante teve um grande impacto nas métricas de vendas e na média de preços de carros vendidos.
Outliers aleatórios: são ocasionados por erros de medição ou variações naturais nos dados. Ocorrem de maneira aleatória e não representam padrões significativos.
Um exemplo se dá em uma indústria, durante um experimento de medição de temperatura, os sensores geralmente registram entre 22°C e 25°C. Mas, em uma leitura, o sensor indicou 500°C.
Isso pode ter sido um erro de medição e não indica a temperatura real.

É importante notar que nem todos os outliers são erros ou falhas no processo. Alguns trazem informações sobre o fenômeno estudado ou indicam situações especiais. Ao lidar com outliers, é essencial entender o contexto e decidir se eles devem ser tratados, transformados ou mantidos.

#01Python - Missing Values (NaN | Null)

Ana Carolina Branco Neumann — Wed, 16 Aug 2023 16:00:58 +0000

Types of Data Absence

There are different patterns of data absence:

MCAR (Missing Completely At Random): MCAR stands for "Missing Completely At Random." In this case, the data's absence is entirely random and not related to any other variable in the dataset. This means that the probability of a value being missing is the same for all observations, and it doesn't depend on unobserved values in other variables. In other words, missing data doesn't introduce any systematic bias into the analysis.
MAR (Missing At Random): MAR stands for "Missing At Random." The absence of data might be related to other observed variables but isn't related to the missing value itself. In other words, the probability of a value being missing may depend on the information available in other variables, but it doesn't depend on the actual value that's missing. Even if there's a relationship between missing data and other variables, as long as these variables are present in the dataset, no systematic bias is introduced into the analysis.
MNAR (Missing Not At Random): MNAR means "Missing Not At Random." It indicates that the absence of data is related to the missing value itself, and this relationship can't be explained by other variables in the dataset. In other words, the probability of a value being missing depends on the actual value that's missing, regardless of other observed variables. Missing data introduces a systematic bias into the analysis due to the relationship between the missing data and the desired outcome.

It's important to understand the type of data absence when working with a dataset, as each type of absence requires different treatment strategies or data imputation methods to deal with missing values. Knowing the types of data absence can also help in correctly interpreting results and avoiding false or biased conclusions.

Note: "No systematic bias introduced" means that the absence of data doesn't affect the analysis in a biased or systematic manner. In other words, the missing data doesn't consistently influence the results.

Identifying Null Values

To identify missing values, we use the isnull() method of pandas to check which values are null in the dataframe. For example:

df.isnull()

This command returns a dataframe with the same format as the original, but with boolean values indicating whether each element is null or not. The .isna() method can also be used, as described below.

Identifying NaN Values

Nan → Not a Number.

isna() is a pandas method that returns a boolean matrix indicating which elements are missing values (NaN) or null. It has the same functionality as the isnull() method.

The isna() function can be applied to an entire dataframe or to a specific series within the dataframe. When used on the entire dataframe, it returns a dataframe with the same shape as the input, where each element is replaced by True if it's a missing value or False otherwise.

Here's an example of using the isna() method:

df = pd.DataFrame({'A': [1, 2, None], 'B': [3, None, 5]})
print(df.isna())

Counting Missing Values

To get an overview of missing values in each column, you can use the sum() method in combination with the isnull() method. For example:

df.isnull().sum()

This command returns the total number of missing values in each column. The same can be applied to the isna() command. For example:

df.isna().sum()

Visualization of Missing Data [Missingno]

The missingno library is a useful tool for visualizing missing data patterns in a dataframe. It generates charts that help identify patterns of missing values and understand the distribution of these values in a dataframe. Some of its key charts are:

Matrix Plot: Shows the presence or absence of values in each cell of the dataframe. Each row represents a sample or record, and each column represents a variable, with empty/blank cells indicating missing values. This makes patterns and correlations between missing values visible.
Bar Chart of Missing Values: Displays the count of missing values for each variable. It shows the proportion of missing values relative to the total number of available observations for each variable. It helps identify variables with a significant number of missing values.
Heatmap: Uses colors to represent the presence or absence of values in a dataframe. It's useful when working with large datasets, allowing the visualization of missing value distribution across multiple variables.

The missingno library also provides other visualizations, such as correlation dendrograms and line plots to track missing values over time.

Here's an example of using the missingno library:

import missingno as msno
import matplotlib.pyplot as plt

# Matrix plot
msno.matrix(df)
plt.show()

# Bar chart of missing values
msno.bar(df)
plt.show()

# Heatmap
msno.heatmap(df)
plt.show()

Setting a Threshold for Missing Values per Column

There's no standard or rule to determine the threshold of missing values per column, as it can vary depending on the problem, data nature, and analysis requirements.

However, there are some approaches to defining this threshold:

Percentage Threshold: You can set a percentage threshold, for example, allowing a column to have up to 5% (or any other value) of missing values. If the proportion of missing values in a column exceeds this threshold, actions like value imputation or column deletion can be taken.
Domain Analysis: Certain columns might be more critical and require fewer missing values, while others may have more leeway for data absence. For instance, in a medical data dataframe, variables like age or gender might be considered essential, whereas other columns with more specific information might tolerate more missing values.
Impact on Results: Consider the impact of missing values on final results. If a column contains critical information for the problem at hand or is necessary for constructing the result, it's advisable to have a lower threshold for missing values.

Documenting the missing data treatment process, especially the decision about the missing values threshold, is important.

Dealing with Missing Values

There are several strategies to handle missing values. Here are some common options:

Removing Rows or Columns

If the missing values are in a small number of rows or columns, you can choose to remove them. Use the dropna() method to do this. For example, to remove all rows containing at least one null value:

# Dropping rows with missing values:
df = df.dropna()

To remove columns with at least one null value, specify the axis=1 parameter:

# Dropping columns with missing values:
df = df.dropna(axis=1)

Below is an example of dropping rows with missing values below a 5% threshold in a dataframe and analyzing if there are columns with missing values above the defined threshold:

# Setting threshold:
threshold = len(df)*0.05

# Columns to drop rows of missing values (threshold < 5%):
cols_to_drop_na = df.columns[df.isna().sum() <= threshold

Replacing Missing Values with Statistical Values

If you prefer to keep all rows and columns, you can fill the missing values with specific values. Use the fillna() method to fill null values. In the example below, all null values are filled with '0':

df = df.fillna(0)

You can also fill with other values, such as column statistics:

# Fill missing values with column mean
df_filled_mean = df.fillna(df.mean())

# Fill missing values with column median
df_filled_median = df.fillna(df.median())

# Fill missing values with column standard deviation
df_filled_std = df.fillna(df.std())

# Fill missing values with column mode (most frequent value)
df_filled_mode = df.apply(lambda col: col.fillna(col.mode()[0]) if col.isna().any() else col)

Or advanced statistics like weighted median:

# Define weights for weighted median
weights = [1, 2, 1, 3, 1]

# Function to calculate weighted median
def weighted_median(values, weights):
    sorted_indices = np.argsort(values)
    sorted_values = np.array(values)[sorted_indices]
    sorted_weights = np.array(weights)[sorted_indices]
    cumsum_weights = np.cumsum(sorted_weights)
    total_weight = cumsum_weights[-1]
    median_idx = np.argmax(cumsum_weights >= total_weight / 2)
    return sorted_values[median_idx]

# Fill missing values with column weighted median
df_filled_weighted_median = df.apply(lambda col: col.fillna(weighted_median(col.dropna(), weights)) if col.isna().any() else col)

But be aware that filling missing values with other values can lead to misleading analysis insights, depending on the amount of filled values.

Statistical Measures by Subgroups

Filling missing values with statistical measures segmented by subgroups within the dataframe is a useful technique when you want to impute missing values based on specific characteristics of data subsets. This allows considering data heterogeneity and avoiding distortions when filling missing values with general statistical measures.

Here's an example of how to fill missing values with the mean of a group:

# Calculate group mean
group_means = df.groupby('Group')['Value'].mean()

# Fill missing values with group mean
df_filled_segmented_mean = df.groupby('Group')['Value'].apply(lambda x: x.fillna(x.mean()))

print("DataFrame filled with segmented mean by group:\n", df_filled_segmented_mean)

This code can be adapted for other statistical measures like median, standard deviation, mode, among others. Simply replace the appropriate statistical function within the apply() function.

Interpolating Missing Values

Another option is to use interpolation to fill missing values based on existing values in the columns. The interpolate() method does this automatically. For example:

df = df.interpolate()

Check Missing Values Again After Transformations

After performing missing value treatment steps, check again if there are any remaining null values in the dataframe to ensure that missing values have been properly handled.

# For null values:
df.isnull().sum()

# Or:
df.isna().sum()

#01Python - Valores ausentes (Nan | Null)

Ana Carolina Branco Neumann — Wed, 16 Aug 2023 15:59:48 +0000

Tipos de Ausência de Dados

Existem tipos de padrões de ausências de dados:

MCAR (Missing Completely At Random): MCAR significa "ausência de dados completamente aleatória". Nesse caso, a ausência dos dados é totalmente aleatória e não está relacionada a nenhuma outra variável do conjunto de dados. Isso significa que a probabilidade de um valor estar ausente é a mesma para todas as observações, não depende de valores não observados em outras variáveis. Em outras palavras, dados ausentes não introduzem nenhum viés sistemático na análise.
MAR (Missing At Random): MAR significa "ausência de dados aleatória". A ausência dos dados pode estar relacionada a outras variáveis observadas, mas não está relacionada ao próprio valor ausente. Em outras palavras, a probabilidade de um valor estar ausente pode depender das informações disponíveis em outras variáveis, mas não depende do valor real que está faltando. Mesmo existindo uma relação entre a ausência dos dados e outras variáveis, desde que essas variáveis estejam presentes no conjunto de dados, não há viés sistemático introduzido na análise.
MNAR (Missing Not At Random): MNAR quer dizer "ausência de dados não aleatória". Significa que, a ausência dos dados está relacionada ao próprio valor ausente, e essa relação não pode ser explicada por outras variáveis no conjunto de dados. Ou seja, a probabilidade de um valor estar ausente depende do valor real que está faltando, independentemente das outras variáveis observadas. A ausência dos dados introduz um viés sistemático na análise, devido a existir uma relação entre os dados ausentes e o resultado que se deseja observar.

É importante entender o tipo de ausência de dados ao trabalhar com o dataset, pois cada tipo de ausência requer diferentes estratégias de tratamento ou métodos de imputação de dados para lidar com os valores ausentes. Conhecer os tipos de ausência de dados também pode ajudar a interpretar corretamente os resultados e evitar conclusões falsas ou enviesadas.

Obs.: "Não há viés sistemático introduzido" significa que a ausência de dados não afeta a análise de maneira tendenciosa ou sistemática. Ou seja, a ausência dos dados não influencia os resultados de forma consistente.

Identificando valores ausentes Null

Para identificar valores ausentes, utilizamos o método `isnull()` do pandas para verificar quais valores são nulos no dataframe. Por exemplo:

df.isnull()

Esse comando retorna um dataframe com o mesmo formato do original, mas com valores booleanos indicando se cada elemento é nulo ou não. Pode ser usado, também, o método **.isna()**, descrito abaixo.

Identificando valores ausentes NaN

Nan → Not a Number.

isna() é um método do pandas que retorna uma matriz booleana indicando quais elementos são valores ausentes (NaN) ou nulos. Possui a mesma funcionalidade que o método isnull() .

A função isna() pode ser aplicada a um dataframe inteiro ou a uma série específica dentro do dataframe. Quando usada em todo o dataframe, ela retorna um dataframe com a mesma forma do aplicado, onde cada elemento é substituído por True se for um valor nulo ou False, caso contrário.

Segue um exemplo de uso do método isna():

import pandas as pd

df = pd.DataFrame({'A': [1, 2, None], 'B': [3, None, 5]})
print(df.isna())

Contagem de valores ausentes

Para ter uma visão geral dos valores ausentes em cada coluna, você pode usar o método sum()

em combinação com o método isnull(). Por exemplo:

df.isnull().sum()

Esse comando retorna o número total de valores ausentes em cada coluna. O mesmo pode ser aplicado ao comando isna() . Por exemplo:

df.isna().sum()

Visualização de dados Ausentes [Missigno]

A biblioteca missingno é uma ferramenta útil para visualizar padrões de dados ausentes em um dataframe. Ela gera gráficos que ajudam a identificar os padrões de valores ausentes e a entender a distribuição desses valores em um dataframe. Alguns dos seus principais gráficos são:

Matriz de ausência (Matrix Plot): Mostra a presença ou ausência de valores em cada célula do dataframe. Cada linha representa uma amostra ou registro, e cada coluna representa uma variável, as células vazias/em branco indicam valores ausentes. Tornando visível padrões e correlações entre os valores ausentes.
Gráfico de barras de valores ausentes (Bar Chart): Exibe a contagem de valores ausentes para cada variável. Ele mostra a proporção de valores ausentes em relação ao número total de observações disponíveis para cada variável. É possível identificar quais variáveis têm um número significativo de valores ausentes.
Gráfico de calor (Heatmap): Utiliza cores para representar a presença ou ausência de valores em um dataframe. É útil quando se trabalha com grandes conjuntos de dados, pois permite visualizar a distribuição de valores ausentes em várias variáveis.

A biblioteca missingno também fornece outras visualizações, como dendrogramas de correlação e gráficos de linha para rastrear valores ausentes ao longo do tempo.

Segue um exemplo de utilização da biblioteca missingno:

import missingno as msno
import matplotlib.pyplot as plt

# Matriz de ausência
msno.matrix(df)
plt.show()

# Gráfico de barras de valores ausentes
msno.bar(df)
plt.show()

# Gráfico de calor
msno.heatmap(df)
plt.show()

Definindo limite para valores ausentes por coluna

Não existe um padrão ou regra para determinar o limite de valores ausentes por coluna, pois isso pode variar dependendo do problema, da natureza dos dados e dos requisitos da análise.

No entanto, existem algumas abordagens para definir esse limite:

Limite percentual: Pode-se estabelecer um limite percentual, como por exemplo, permitir uma coluna ter até 5% (ou qualquer outro valor) de valores ausentes. Caso a proporção de valores ausentes em uma coluna exceder esse limite, podem ser tomadas medidas como imputação de valores ou exclusão da coluna.
Análise do domínio: Certas colunas podem ser mais críticas e requerer menos valores ausentes, enquanto outras, podem ter uma liberdade maior para a ausência de dados. Por exemplo, em um dataframe de dados médicos, variáveis como idade ou sexo podem ser consideradas essenciais, já outras colunas, com informações mais específicas, podem possuir mais valores ausentes.
Impacto nos resultados: Deve-se considerar o impacto dos valores ausentes nos resultados finais. Se uma coluna contiver informações críticas para o problema em questão ou for necessária para construção do resultado, é recomendável ter um limite mais baixo de valores ausentes.

É importante documentar o processo de tratamento de dados ausentes, principalmente a decisão sobre o limite de valores ausentes.

Lidando com valores ausentes

Existem várias estratégias para lidar com valores ausentes. Aqui estão algumas opções comuns:

Removendo linhas ou colunas

Se os valores ausentes estiverem em um número pequeno de linhas ou colunas, você pode optar por removê-las. Use o método dropna() para fazer isso. Por exemplo, para remover todas as linhas que contenham pelo menos um valor nulo:

# Dropando linhas com valores ausentes:
df = df.dropna()

Para remover colunas com pelo menos um valor nulo, especifique o parâmetro axis=1:

# Dropando colunas com valores ausentes:
df = df.dropna(axis=1)

Segue abaixo, um exemplo de exclusão de linhas de valores ausentes abaixo de um limite de 5% em um dataframe e, análise se há colunas com valores ausentes acima do limite definido:

# Definindo threshold (limite):
threshold = len(df)*0.05

# Colunas para dropar linhas de valores ausentes (limite < 5%):
cols_to_drop_na = df.columns[df.isna().sum() <= threshold]

# Dropando valores ausentes de colunas abaixo do limite aceitável:
df.dropna(subset=cols_to_drop_na, inplace=True)

# Verificando se ainda há colunas com valores ausentes:
print(df.isna().sum)

# Se há, definimos essas colunas para futura tratativa:
cols_with_missing_values = df.columns[df.isna().sum() > 0]
print(cols_with_missing_values)

Substituindo valores ausentes por valores estatísticos

Se você preferir manter todas as linhas e colunas, pode preencher os valores ausentes com um valor específico. Use o método fillna() para preencher os valores nulos.
No exemplo abaixo, se preenche todos os valores nulos com ‘0’:

df = df.fillna(0)

Você também pode preencher com outros valores, como estatísticas da coluna:

# Preencher valores ausentes com a média da coluna
df_filled_mean = df.fillna(df.mean())

# Preencher valores ausentes com a mediana da coluna
df_filled_median = df.fillna(df.median())

# Preencher valores ausentes com o desvio padrão da coluna
df_filled_std = df.fillna(df.std())

# Preencher valores ausentes com a moda da coluna (valor mais frequente)
df_filled_mode = df.apply(lambda col: col.fillna(col.mode()[0]) if col.isna().any() else col)

Ou, estatísticas avançadas como a mediana ponderada:

# Definir pesos para a mediana ponderada
weights = [1, 2, 1, 3, 1]

# Função para calcular a mediana ponderada
def weighted_median(values, weights):
    sorted_indices = np.argsort(values)
    sorted_values = np.array(values)[sorted_indices]
    sorted_weights = np.array(weights)[sorted_indices]
    cumsum_weights = np.cumsum(sorted_weights)
    total_weight = cumsum_weights[-1]
    median_idx = np.argmax(cumsum_weights >= total_weight / 2)
    return sorted_values[median_idx]

# Preencher valores ausentes com a mediana ponderada da coluna
df_filled_weighted_median = df.apply(lambda col: col.fillna(weighted_median(col.dropna(), weights)) if col.isna().any() else col)

Mas esteja ciente que, preencher valores ausentes com outros valores, pode trazer gráficos de análise com perspectivas falsas, dependendo da quantidade de valores preenchidos.

Medidas estatísticas por subgrupos

Preencher valores ausentes com medidas estatísticas segmentadas por subgrupos dentro do dataframe é uma técnica útil quando você deseja imputar valores ausentes com base em características específicas de subconjuntos dos seus dados.
Isso permite levar em consideração a heterogeneidade dos dados e evitar distorções ao preencher os valores ausentes com medidas estatísticas gerais.

Segue um exemplo de como preencher valores ausentes com média de um grupo:

# Calcular a média por grupo
group_means = df.groupby('Grupo')['Valor'].mean()

# Preencher valores ausentes com a média por grupo
df_filled_segmented_mean = df.groupby('Grupo')['Valor'].apply(lambda x: x.fillna(x.mean()))

print("DataFrame preenchido com a média segmentada por grupo:\n", df_filled_segmented_mean)

Esse código pode ser adaptado para outras medidas estatísticas, como a mediana, desvio padrão, moda, entre outras. Basta substituir a função estatística apropriada dentro da função apply().

Interpolar valores ausentes

Outra opção é usar a interpolação para preencher os valores ausentes com base nos valores existentes nas colunas. O método interpolate() faz isso automaticamente. Por exemplo:

df = df.interpolate()

Verificar novamente valores ausentes após transformações

Depois de realizar as etapas de tratamento de valores ausentes, verifique novamente se restam valores nulos no dataframe, para garantir que os valores ausentes tenham sido tratados corretamente.

# Para valores nulos:
df.isnull().sum()

# Ou:
df.isna().sum()

#01QuickTips: Python

Ana Carolina Branco Neumann — Tue, 15 Aug 2023 14:06:51 +0000

In this first post, I'll provide some tips for beginners in the Python world, covering the main useful commands that are simple to use in exploratory analysis.

Comments:

Use the "#" character to start a comment on a line. Comments are useful for adding explanatory notes to your code and are not executed by the Python interpreter.

# This is a comment

Line Breaks:

To split code across multiple lines, you can use the backslash "" at the end of each line or place it between parentheses, brackets, or braces.

# Using backslash
x = 10 + \
    20 + \
    30

# Using parentheses
y = (10 +
     20 +
     30)

# Using brackets
lista = [1, 2,
         3, 4]

# Using braces
dicionario = {'a': 1,
              'b': 2}

Indentation:

Python uses indentation to delimit code blocks, instead of using curly braces or special keywords. Make sure to maintain the same indentation within a block to avoid syntax errors.

# Example of indentation
if x > 0:
    print("x is positive")
    print("Still inside the block")
print("Outside the block")

Printing to the Screen:

Use the print() function to display messages or values on the standard output.

name = "Ana"
print("Hey there,", name)

x = 42
print("The secret value of x is:", x)

User Input:

Use the input() function to receive user input. Remember that the result of input() is always a string, so you might need to convert it to other types if necessary.

name = input("What's your name? ")
print("Welcome,", name)

Assignment Operators:

Python provides several useful assignment operators to perform common operations in a single line.

x = 10  # Simple assignment
x += 5  # x = x + 5
x -= 3  # x = x - 3
x *= 2  # x = x * 2
x /= 4  # x = x / 4

Importing Libraries:

Use the import keyword to import libraries and modules into your code. This allows you to access additional resources and functions provided by these libraries.

import math

x = math.sqrt(25)
print(x)

from datetime import datetime

now = datetime.now()
print(now)

Shape

Returns a tuple representing the number of rows followed by the number of columns in a dataset:

df.shape

DataFrame Columns

Returns all columns, separated by commas, that compose a dataset. Very useful to recall which columns make up the dataset after transformations or even during exploratory analysis:

df.columns

#01DicasRápidas: Python

Ana Carolina Branco Neumann — Tue, 15 Aug 2023 13:36:52 +0000

Nesse primeiro post, vou dar algumas dicas para iniciantes no mundo do Python, dos principais comandos úteis, e bem simples de serem utilizados, em uma análise exploratória.

Comentários:

Use o caractere "#" para iniciar um comentário em uma linha. Os comentários são úteis para adicionar notas explicativas no seu código e não são executados pelo interpretador do Python.

# Isso é um comentário

Quebra de linha:

Para dividir uma código em várias linhas, você pode usar a barra invertida "\" no final de cada linha ou colocá-la entre parênteses, colchetes ou chaves.

# Usando barra invertida
x = 10 + \
    20 + \
    30

# Usando parênteses
y = (10 +
     20 +
     30)

# Usando colchetes
lista = [1, 2,
         3, 4]

# Usando chaves
dicionario = {'a': 1,
              'b': 2}

Indentação:

O Python usa a indentação para delimitar blocos de código. Ao invés de utilizar chaves ou palavras-chave especiais. Certifique-se de manter a mesma indentação dentro de um bloco para evitar erros de sintaxe.

# Exemplo de indentação
if x > 0:
    print("x é positivo")
    print("Ainda dentro do bloco")
print("Fora do bloco")

Imprimir na tela:

Use a função print() para exibir mensagens ou valores na saída padrão.

nome = "Ana"
print("Fala,", nome)

x = 42
print("O valor secreto de x é:", x)

Input do usuário:

Use a função input() para receber uma entrada do usuário. Lembre-se de que o resultado do input() é sempre uma string, portanto, você pode precisar converter para outros tipos, se necessário.

nome = input("Qual seu nome? ")
print("Bem-vindo,", nome)

Operadores de atribuição:

O Python oferece vários operadores de atribuição úteis para realizar operações comuns em uma única linha.

x = 10  # Atribuição simples
x += 5  # x = x + 5
x -= 3  # x = x - 3
x *= 2  # x = x * 2
x /= 4  # x = x / 4

Importação de bibliotecas:

Use a palavra-chave import para importar bibliotecas e módulos no seu código. Isso permite que você acesse recursos adicionais e funções fornecidas por essas bibliotecas.

import math

x = math.sqrt(25)
print(x)

from datetime import datetime

agora = datetime.now()
print(agora)

Shape

Traz uma tupla correspondendo o número de linhas seguido do número de colunas, de um dataset:

df.shape

Colunas de um dataframe

Traz todas as colunas, separadas por vírgula, que compõe um dataset. Muito útil para relembrar quais colunas compõe o dataset após transformações, ou mesmo, no meio da análise exploratória:

df.columns