DEV Community: Kushal

Prompt Engineering - Part 1

Kushal — Mon, 05 Jun 2023 10:23:09 +0000

In this article, I will provide a comprehensive tutorial on prompt engineering, highlighting how to achieve the best and optimal results from Large Language Models (LLMs) such as OpenAI's ChatGPT.

Prompt engineering has gained significant popularity and widespread usage since the advent of LLMs, leading to a revolution in the field of Natural Language Processing (NLP). The beauty of prompt engineering lies in its versatility, allowing professionals from diverse backgrounds to effectively utilize it and maximize the potential of LLMs.

Basic working of ChatGPT

ChatGPT works with the concept of "assistant" and "user" roles to facilitate interactive conversations. The model operates in a back-and-forth manner, where the user provides input or messages, and the assistant responds accordingly.

The user role represents the individual engaging in the conversation. As a user, you can provide instructions, queries, or any text-based input to the model - which forms the prompt to the model.

The assistant role refers to the AI language model itself, which is designed to generate responses based on the user's input.
The model processes the conversation history, including both user and assistant messages, to generate a relevant and coherent response. It takes into account the context and information provided in the conversation history to generate more accurate and appropriate replies.

The conversation typically starts with a system message that sets the behavior of the assistant, followed by alternating user and assistant messages. By maintaining a conversational context, the model can generate more consistent and context-aware responses.

To maintain the context, it is important to include the relevant conversation history when interacting with the model. This ensures that the model understands the ongoing conversation and can provide appropriate responses based on the given context.

Template for Prompt Usage

In this section, we will be writing a boiler-plate code that will form the basis for all our tasks.
To begin with, we need to generate a secret key from our OpenAI account.

!pip install openai

import openai 

# Generating Secret Key from your OpenAI Account
openai.api_key  = ('<OPEN-AI-SECRET-KEY>')


# Template function 
def get_completion(prompt, model="gpt-3.5-turbo", temperature = 0):
    messages = [{"role": "system", "content": "You are the assistant."},
                {"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature, # this is the degree of randomness of the model's output
    )
    reply =  response.choices[0].message["content"]
    return reply

The function has three inputs:

Prompt
Model (ChatGPT)
Temperature

We have covered the model and the role of the "user." Now, let's move on to the next two inputs: prompt and temperature.

Prompt refers to the text input provided to the model, which serves as a guiding instruction or query for generating a response.

Temperature, on the other hand, is a hyperparameter that plays a crucial role in determining the behavior of the model's output. Not to be confused with real world connotation, but this metric controls the level of randomness in the generated responses. By adjusting the temperature value, we can influence the model's output.

When the temperature is set to a higher value, the model produces more diverse and creative responses. Conversely, lower temperature values make the model more focused and deterministic, often resulting in more precise but potentially less varied outputs.

Choosing an appropriate temperature depends on the specific task and desired output.

Basics of Prompting

ChatGPT can perform a plethora of tasks namely Text Summarisation, Information Extraction, Question Answering, Text Classification, Sentiment Analysis, Code Generation to name a few.
Prompts can be designed to undertake single or multiple tasks depending on the use-case.

In this below example, we will be showcasing a basic text summarisation task peformed by the GPT.

prod_desc = """
3 MODES & ROTATABLE NOZZLE DESIGN- This portable oral irrigator comes with Normal, Soft and Pulse modes which are best for professional use. The 360° rotatable jet tip design allows easy cleaning helping prevent tooth decay, dental plaque, dental calculus, gingival bleeding and dental hypersensitivity.
DUAL WATERPROOF DESIGN- The IPX7 waterproof design is adopted both internally and externally to provide dual protection. The intelligent ANTI-LEAK design prevents leakage and allows the dental flosser to be used safely under the running water.
UPGRADED 300 ML LARGE CAPACITY WATER TANK- The new water tank is the largest capacity tank available and provides continuous flossing for an entire session. The removable full-opening design allows thorough cleaning thus preventing formation of bacteria and limescale deposits.
CORDLESS & QUALITY ASSURANCE- Cordless and lightweight power irrigator comes with a powerful battery that lasts upto 14 days on a single charge
RECHARGEABLE & QUALITY ASSURANCE- Cordless and lightweight power irrigator comes with a powerful battery that lasts upto 14 days on a single charge
"""


prompt = f"""
Your task is to generate a short summary of a product \
description from an ecommerce site. 

Summarize the description below, delimited by tags, in at most 50 words. 

Review: <tag>{prod_desc}</tag>

Output should be in JSON format with "summary" as key.
"""

response = get_completion(prompt)
print(response)

Output :

{
"summary": "This portable oral irrigator has 3 modes and a rotatable nozzle design for easy cleaning. It has a dual waterproof design and a large 300ml water tank. It is cordless, rechargeable, and comes with a powerful battery that lasts up to 14 days on a single charge."
}

Some of the key notes to infer from the above code

When constructing a prompt, it's important to provide clear and specific instructions to guide the model's behavior. This helps ensure that the generated output aligns with your desired outcome.
Delimit input data: To differentiate the prompt's instruction from the actual input data, it's advisable to use delimiters. Delimiters can take various forms, such as quotation marks (" "), angle brackets (< >), HTML tags ( ), colons (:), or backticks ( ```). By using delimiters, you create a visual distinction that aids in parsing the prompt.
Request structured output: If your task requires a specific format or structure for the model's response, make sure to explicitly mention it in the prompt. By doing so we can easily utilize the output.

Here is a more detailed breakdown of the above prompt.

Elements of Prompts	Breakdown
Instruction	To generate a short summary of a product description.
Task	Summarize the description
Task Constraints	At most 50 words.
Input Data Delimiter	`<tag> </tag>`
Output Format	JSON

The key-elements of a prompt are : Instruction, Tasks, Constraints, Output Indicator, Input Data.

Multi-tasking Prompts

Consider a scenario where you are presented with a text and need to perform sentiment analysis, summarize the content, and extract topics from it.
In the pre-LLM era, accomplishing these tasks would typically involve training separate models for each task or relying on pre-trained models. However, with the advent of LLMs like ChatGPT, all of these tasks can now be efficiently executed using a single prompt.This eliminates the need for multiple specialized models and streamlines the workflow.



review = f""" Writing this review after using it for a couple of months now. It can take some time to get used to since the water jet is quite powerful. It might take you a couple of tries to get comfortable with some modes. Start with the teeth, get comfortable and then move on to the gums. Some folks may experience sensitivity. I experienced it for a day or so and then went away.
It effectively knocks off debris from between the teeth especially the hard to get like the fibrous ones. I haven't seen much difference in the tartar though. Hopefully, with time, it gets rid of it too.
There are 3 modes of usage: normal, soft and pulse. I started with soft then graduated to pulse and now use normal mode. For the ones who are not sure, soft mode is safe as it doesn't hit hard. Once you get used to the technique of holding and using the product, you could start experimenting with the other modes and choose the one that best suits you.
One time usage of the water full of tank should usually be sufficient if your teeth are relatively clean. If, however, you have hard to reach spaces with buildup etc. it might require a refill for a usage.
If you don't refill at all, one time full recharge of the battery in normal mode will last you 4 days with maximum strength of the water jet. If you refill it once, it'll last you 2 days after which the strength of the water jet reduces.
As for folks who are worried about the charging point getting wet, I accidentally used it once without the plug for the charging point and yet it worked fine and had no issues. Ideally keep the charging point covered with the plug provided with the product.
It has 2 jet heads (pink and blue) and hence the product can be used by 2 people as long as it's used hygienically. For charging, it comes with a USB cable without the adapter which shouldn't be an issue as your phone adapter should do the job.
I typically wash the product after every usage as the used water tends to run on the product during usage.
One issue I see is that the clasp for the water tank could break accidentally if not handled properly which will render the tank useless. So ensure to not keep it open unless you are filling the tank.
"""


prompt = f"""
Your task is to provide insights for the product review \
on a e-commerce website, which is delimited by \
triple colons.

Perform following tasks:
1. Identify the product.
2. Summarize the product review, in upto 50 words.
3. Analyze the sentiment of review - positive/negative/neutral
4. Extract topics that user didnt like about the product.
5. Identify the name of the company, if not then "not mentioned"

Use the following format:
1. Product - <product>
2. Summary - <summary>
3. Sentiment - <user_sentiment>
4. Topics - <negative_topics>
5. Company - <company>
Use JSON format for the output.

Product review: :::{review}:::
"""


response = get_completion(prompt)
print(response)

Output:

{
"Product": "Water Flosser",
"Summary": "The water flosser is effective in removing debris from between teeth, but may take some time to get used to. It has 3 modes of usage and a full tank can last for one usage. The charging point should be covered with the provided plug. The clasp for the water tank could break if not handled properly.",
"Sentiment": "Neutral",
"Topics": "Difficulty in getting used to the product, sensitivity, no significant difference in tartar removal, clasp for water tank could break",
"Company": "not mentioned"
}

From the aforementioned example, we can infer that by explicitly listing the tasks and providing a structured format, we enable ChatGPT to understand and address each task individually.

Furthermore, we can enhance the prompt by including specific conditions or instructions for each task. This allows for a more tailored and accurate response from ChatGPT, as it can take into account the unique requirements and constraints of each task.

Iterative Prompt Development

As we reach the final section of the article, it's crucial to acknowledge that the process of designing and crafting prompt is similar to optimizing/selecting ML models. It is an iterative process, although typically simpler and less complex.

Creating effective prompts requires experimentation, observation, and continuous refinement. It's important to iterate and fine-tune the prompts based on the desired output required by the use-cases.

End-notes

In Part 1 of this series, we provided a brief introduction to the foundations of Prompt Engineering - that can get you started on building your own applications.
As we move forward, subsequent parts will delve into various types of techniques and concepts, including LangChain and ChatBots.

TensorFlow Model Deployment using FastAPI & Docker

Kushal — Fri, 02 Apr 2021 17:10:27 +0000

TL; DR:
In this article, we are going to build a TensorFlow model (v2) and using FastAPI create REST API calls to predict from the model, and finally containerize it using Docker 😃

I want to emphasize the usage of FastAPI and how rapidly this framework is a game-changer for building easy to go and much faster API calls for a machine learning pipeline.
Traditionally, we have been using Flask Microservices for building REST API calls but the process involves a bit of nitty-gritty to understand the framework and implement it.
On the other end, I found FastAPI to be pretty user-friendly and very easy to pick up and implement type of framework.

And finally from one game-changer to another, Docker
As a data scientist: our role is vague and it keeps on changing year in year out. Some skillset gets added, some get extinct and obsolete, but Docker has made its mark as one of the very important and most sought out skills in the market. Docker gives us the ability to containerize a solution with all its binding software and requirements.

The Data

We have used a text classification problem : IMDb Dataset for the purpose of building the model.

The dataset comprises 50,000 reviews of movies and is a binary classification problem with the target variable being a sentiment: positive or negative.

Preprocessing

We use Tensorflow's TextVectorization layer which tidies things up and outputs a layer which we will be using in the process of creating a graph on a Sequential or Functional Model.

VOCAB_SIZE = 5000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens = VOCAB_SIZE, standardize = 'lower_and_strip_punctuation',
    output_mode = 'int', output_sequence_length = 200,
)

encoder.adapt(train_dataset.map(lambda text, label: text))

We can go for custom standardization by curating a function for our own use case but there are some bugs in tf:2.4.1 which create trouble whilst creating REST API call for the model.

Model

As we can see below, we are using the encoder layer on the top of Embedding that outputs us with a 256 dimension vector.
The rest of the graph is self-explanatory, although we are giving a probabilistic output instead of a 2-class softmax layer: the closer the probability to 1 meaning a positive sentiment for the review and vice-versa.


# Creating the model
model = tf.keras.Sequential([
    encoder,
    Embedding(input_dim=len(encoder.get_vocabulary()), output_dim= 256, mask_zero=True),
    GlobalAveragePooling1D(),
    Dropout(0.2),
    Dense(128, activation = 'relu'),
    Dropout(0.2),
    Dense(1, activation = 'sigmoid')])

After initialising the graph, we compile and fit the model:

# Compiling the model
model.compile(optimizer= tf.keras.optimizers.Adam(learning_rate= 0.001), 
              loss = tf.keras.losses.BinaryCrossentropy(from_logits= False), metrics = ['accuracy'])


# Training the model
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=5)

Evaluation

After model training, we evaluate the model on the test dataset and get a reasonably satisfactory test accuracy of 86.2%
(Although our major focus is the API & Docker and not extending our virtues in model building for this scenario)

# Evaluating the model on test dataset
loss, accuracy = model.evaluate(test_dataset)
print("Loss: ", loss)
print("Accuracy: ", accuracy)

# Output:
Loss:  0.3457462191581726
Accuracy:  0.8626000285148621


# Saving the model
model.save('tf_keras_imdb/')

In TensorFlow, we can save the model in two ways: 'tf' or '.h5' format. Our model cannot be saved in '.h5' format since we are using the TextVectorization layer

FastAPI
Before we start creating APIs, we need a particular directory structure that will be utilized for creating a Docker image.

tf_keras_imdb/ : SavedModel from TensorFlow
main.py : Python file for creating REST APIs using FastAPI framework

|
|--- model
|    |______ tf_keras_imdb/
|
|--- app
|    |_______ main.py
|
|--- Dockerfile

Whenever we are building an API using FastAPI, we use pydantic to set the type of input our API expects. For eg, a list, dictionary, JSON, string, integer, float.

To create an object using pydantic, we use BaseModel that defines our type of inputs.

One of the reasons why FastAPI is faster and more efficient is its usage of ASGI - Asynchronous Server Gateway Interface, instead of traditional WSGI - Web Server Gateway Interface (which is used in Flask, Django)

POST request is assigned to our prediction API, since it requires us to post the data and fetch back the results.

Uvicorn is a lightning-fast ASGI server implementation, which creates a server on our host machine and lets our API host the model on.

We can test our API on SwaggerUI:

Docker
Finally, to wrap it all up, we create a Dockerfile

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

RUN pip install tensorflow==2.4.1

COPY ./model /model/

COPY ./app /app

EXPOSE 8000

CMD ["python", "main.py"]

We have attached a docker container (tiangolo/uvicorn-gunicorn-fastapi) which is made public on docker-hub, which makes quick work of creating a docker image on our own functionalities.

To create a docker image and deploy it, we run the following commands, and voila!

docker build -t api .

docker run -d -p 8000:8000 api

Conclusion

After going through the process of working around FastAPI and Docker, I feel this skillset is a necessary repertoire in a data scientist's toolkit. The process of building around our model and deploying it has become easier and much more accessible than it was before.

Github Link: https://github.com/kushalvala/fastapi-nlp

Kushal Vala
Data Scientist

Data and Sampling Distributions- II

Kushal — Thu, 16 Jul 2020 13:32:31 +0000

At the end of Part-I, we talked about how to calculate an estimate for Standard Error of a Statistic. We will be continuing the discussion and discuss further.

The Bootstrap

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.

Conceptually, you can imagine the bootstrap as replicating the original sample thousands or millions of times so that you have a hypothetical population that embodies all the knowledge from your original sample.

In practice, it is not necessary to actually replicate the sample a huge number of times. We simply replace each observation after each draw i.e, we sample with replacement.

The algorithm for bootstrap resampling of the mean for a sample size of n is as follows:

Draw a sample value, record it, and then replace it.
Repeat n times
Record the mean of n resampled values
Repeat steps 1-3 R times
Use the R results to:
1. Calculate their standard deviation ( estimates sample mean standard error)
2. Produce Boxplot or Histogram
3. Find Confidence Interval

The number of iterations of the bootstrap: R, is set arbitrarily. The more the iterations, the more accurate is the estimate of standard error.

From the previous dataset of Red Wine Quality Estimation, we are taking Total Sulfur Dioxide as a key feature to calculate the bias and an estimate of standard error.

from sklearn.utils import resample
boot_sample = 1000
results = []
for nrepeat in range(1000):
  sample = resample(data['total sulfur dioxide'], replace = True, n_samples = boot_sample)
  results.append(sample.mean())

results = pd.Series(results)

print('Bootstrap Statistics:')
print('Original Population Size : ', data['total sulfur dioxide'].shape[0])
print('Bootstrap Sample Size : ', boot_sample)
print('Original: ', data['total sulfur dioxide'].median())
print('Bias: ', results.mean() - data['total sulfur dioxide'].mean())
print('Standard Error: ', results.std())

#Output:
Bootstrap Statistics:
Original Population Size :  1599
Bootstrap Sample Size :  1000
Original:  38.0
Bias:  -0.016345870231326387
Standard Error:  1.071951943585676

The bootstrap can be used with multivariate data, where the rows are sampled as units.
A model might then be run on the bootstrapped data, for example, to estimate the stability (variability) of model parameters, or to improve predictive power.
With CART Algorithm (Random Forest), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree.

So as we can observe that, the concept of Bootstrap has been used extensively in Machine Learning.

Confidence Intervals

The concept of Confidence Interval lies in the idea of uncertainty. Usually, there are point estimate which are estimated but presenting a range of values to counteract this tendency.

Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%.
One way to think of a 90% confidence interval is as follows: it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic.
More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed).

Bootstrap is a general tool that can be used to generate confidence intervals for most statistics, or model parameters.

The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval.
Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty)

For a data scientist, a confidence interval is a tool that can be used to get an idea of how variable a sample result might be.

# Creating a dataset from normal distribution
dataset = 0.5 + np.random.rand(1000) * 0.5
# bootstrap
scores = list()
for _ in range(100):
    # bootstrap sample
    indices = np.random.randint(0, 1000, 1000)
    sample = dataset[indices]
    # calculate and store statistic
    statistic = np.mean(sample)
    scores.append(statistic)
print('50th percentile (median) = %.3f' % np.median(scores))
# calculate 95% confidence intervals (100 - alpha)
alpha = 5.0
# calculate lower percentile (e.g. 2.5)
lower_p = alpha / 2.0
# retrieve observation at lower percentile
lower = max(0.0, np.percentile(scores, lower_p))
print('%.1fth percentile = %.3f' % (lower_p, lower))
# calculate upper percentile (e.g. 97.5)
upper_p = (100 - alpha) + (alpha / 2.0)
# retrieve observation at upper percentile
upper = min(1.0, np.percentile(scores, upper_p))
print('%.1fth percentile = %.3f' % (upper_p, upper))

In this article, we covered two major concepts: Confidence Intervals and Bootstrap, this two concepts are used majorly in field of Data Science for various applications.

Fin

Data and Sampling Distributions- I

Kushal — Mon, 13 Jul 2020 07:41:12 +0000

In the previous series, we divulged a lot into Exploratory Data Analysis and how as a Data Scientist we have a lot of tools at our disposal to analyze and synthesize our data.

There is a popular misconception during the age of Big Data, is that because of the size and nature of data, the need for sampling is redundant. But on the contrary, because of the varying quality of data: the need for sampling is still prevalent.

The lefthand side is the population which is assumed to follow an unknown distribution. The righthand side is the sample with an empirical distribution.
The process of picking up data from the lefthand side to the right-hand side is called sampling and which is the major concern in data science.

Random Sampling and Sample Bias

A sample is a subset of data from a larger data set, statisticians call this larger data set the population.

Random sampling is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw.

Sampling can be done with replacement, in which observations are put back in the population after each draw for possible future reselection. Or it can be done without replacement, in which case observations, once selected, are unavailable for future draws.

What is Sample Bias?

It occurs when a sample drawn from the population was drawn in a nonrandom manner which resulted in a different distribution as compared to its population.

Bias

Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process.
There is a large difference between Error from Bias and Error due to Random chance.

How to deal with Bias? - Random Selection

There are now a variety of methods to achieve representativeness, but at the heart of all of them lies random sampling.
Random sampling is not always easy. A proper definition of an accessible population is key.

In stratified sampling, the population is divided into stratas, and random samples are taken from each of them.

Selection Bias

Selection bias refers to the practice of selectively choosing data—consciously or unconsciously—in a way that leads to a conclusion that is misleading or ephemeral.

Selection bias occurs when you are data snooping, i.e extensive hunting for patterns inside data that suit your use-case.

Since the repeated review of large data sets is a key value proposition in data science, selection bias is something to worry about. A form of selection bias that a data scientist has to deal with is called Vast search effect.

If you repeatedly run different models and ask different questions with a large data set, you are bound to find something interesting. But is the result you found truly something interesting, or is it the chance outlier?

How to deal with this effect? The answer is by using a holdout set and sometimes more than one holdout set to validate against.

Sampling Distribution of Statistic

The term sampling distribution of a statistic refers to the distribution of some sample statistics over many samples drawn from the same population.

Much of the classical statistics is concerned with making inferences from small samples to a very large population.

Typically, a sample is drawn with the goal of measuring something (with a sample statistic) or modeling something (with a statistical or machine learning model).
Since our estimate or model is based on a sample, it might be in error, it might be different if we were to draw a different sample.
We are therefore interested in how different it might be, a key concern is sampling variability.

Note: It is important to distinguish between the distribution of the individual data points, known as the data distribution, and the distribution of a sample statistic, known as the sampling distribution.

The distribution of a sample statistic such as the mean is likely to be more regular and bell-shaped than the distribution of the data itself. The larger the sample the statistic is based on, the more this is true. Also, the larger the sample, the narrower the distribution of the sample statistic.

From an open-sourced Dataset- Wine Quality,
We are taking three samples from this data: a sample of 1,000 values, a sample of 1,000 means of 5 values, and a sample of 1,000 means of 20 values.

# Taking a Sample Data
sample_data = pd.DataFrame({
'total sulfur dioxide': data['total sulfur dioxide'].sample(1000), 'type': 'Data',
})

# Taking a mean of statistic for 5 samples

sample_mean_05 = pd.DataFrame({
    'total sulfur dioxide' : [data['total sulfur dioxide'].sample(5).mean() for _ in range(1000)],
    'type': 'Mean of 5'
})

# Taking mean of statistic for 20 samples

sample_mean_20 = pd.DataFrame({
    'total sulfur dioxide' : [data['total sulfur dioxide'].sample(20).mean() for _ in range(1000)],
    'type': 'Mean of 20'
})

results = pd.concat([sample_data, sample_mean_05, sample_mean_20])

g = sns.FacetGrid(results, col='type', col_wrap=1, height=2, aspect=2) 
g.map(plt.hist, 'total sulfur dioxide', range=[0, 100], bins=40)
g.set_axis_labels('fixed acidity', 'Count') 
g.set_titles('{col_name}')

The above code produces a FacetGrid consisting of three histograms, the first one being a data distribution, and second and third being sampling distribution.

The phenomenon we’ve just described is termed the Central limit theorem. It says that the means drawn from multiple samples will resemble the familiar bell-shaped normal curve.

The central limit theorem allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis tests.

Standard Error

The standard error is a single metric that sums up the variability in the sampling distribution for a statistic.
The standard error can be estimated using a statistic based on the standard deviation s of the sample values, and the sample size n.

As the sample size increases, the standard error decreases, corresponding to what was observed in the above figure.

The approach to measuring standard error:

Sample from an accessible population distribution
For each sample, calculate the statistic (eg mean)
Calculate the standard deviation of statistics from Step-2. Using this as an estimate for Standard Error.

In practice, this approach of collecting new samples to estimate the standard error is typically not feasible. Fortunately, it turns out that it is not necessary to draw brand new samples; instead, you can use bootstrap resamples.

In modern statistics, Bootstrap has become a standard way to estimate the standard error.

So this concludes Part-I, where I have divulged on Sample/Population Data dichotomy, Various types of bias in samples (Sample Bias), Ways to mitigate bias in our data, Central Limit Theorem and Standard Error.

Fin

Exploratory Data Analysis: Part C

Kushal — Thu, 09 Jul 2020 13:16:38 +0000

In this article, we will delve into various aspects of plotting and analyzing numerical and categorical variables in bivariate and a multivariate manner.

Correlation

Exploratory data analysis in many modeling projects (whether in data science or in research) involves examining correlation among predictors and between predictors and a target variable.

Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y.

We use Pearson's Correlation Coefficient as a de-facto method for computing correlation among numerical variables.

Following is the mathematical formula of the same:

The correlation coefficient always lies between +1 (perfect positive correlation) and –1 (perfect negative correlation); 0 indicates no correlation.

#Reading the data
data = pd.read_csv('winequality-red.csv', sep = ';')
# pandas dataframe has .corr() method to compute a correlation table
data.iloc[:, :-1].corr()

The output of the following code is a square matrix with each numerical variable's correlation computed against every other in the data.

For visualization purposes, we use seaborn's heatmap for better inferences and data storytelling.

plt.figure(figsize= (10,7))
sns.heatmap(data.iloc[:, :-1].corr(), vmin= -1, vmax= 1, cmap= sns.diverging_palette(20, 220, as_cmap=True))

Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data.

Note: There are various other correlation coefficients devised by statisticians: Spearman’s rho or Kendall’s tau.
These are correlation coefficients based on the rank of the data. Since they work with ranks rather than values, these estimates are robust to outliers and can handle certain types of nonlinearities.

Scatterplots

The standard way to visualize the relationship between two measured data variables is with a scatterplot. The x-axis represents one variable and the y-axis another, and each point on the graph is a record.

ax = data.plot.scatter(x = 'citric acid', y = 'fixed acidity', figsize = (10,6))
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
ax.grid()

The plot shows a fairly positive relation between Citric Acid and Fixed Acidity, where we can conclude safely that increase of citric acid results in a corresponding increase of acidity levels.

Exploring Two or More Variables

Familiar estimators like mean and variance look at variables one at a time (univariate analysis).
In this section, we look at additional estimates and plots, and at more than two variables (multivariate analysis).

Hexagonal Binning and Contours

Scatterplots are fine when there is a relatively small number of data values.
For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship.
Rather than plotting points, which would appear as a monolithic dark cloud, we grouped the records into hexagonal bins and plotted the hexagons with a color indicating the number of records in that bin.

In Python, hexagonal binning plots are readily available using the pandas data frame method hexbin

ax = data.plot.hexbin(x = 'citric acid', y = 'fixed acidity', gridsize= 40, figsize = (10,6))
ax.set_xlabel('Citric Acid')
ax.set_ylabel('Fixed Acidity')

Another method to analyze dense data is to plot the density contours. In Python, seaborn has the method kdeplot

plt.figure(figsize=(10,10))
sns.kdeplot(data['citric acid'], data['fixed acidity'])

Two Categorical Variables

A useful way to summarize two categorical variables is a contingency table - a table of counts by category.
Contingency tables can look only at counts, or they can also include column and total percentages.

The pivot_table method creates the pivot table in Python. The aggfunc argument allows us to get the counts.

crosstab = adult_data.pivot_table(index = 'education', columns= 'sex', aggfunc= lambda x: len(x) , margins= True)

df = crosstab.copy()

df.loc[:, 'workclass']

As we can observe that, for two categorical variables : education and sex, we have computed a contingency table.
From the table, we have used 'workclass' column for the output.

Categorical and Numerical Variables

Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable.

The pandas boxplot method takes the by an argument that splits the data set into groups and creates the individual boxplots.

ax = adult_data.boxplot(by = 'race', column = 'hours-per-week', figsize=(10,10))

From the above visualization, we can observe that, we have grouped the data by the categorical variable race and have plotted against hours-per-work

A violin plot is an enhancement to the boxplot and plots the density estimate with the density on the y-axis.
The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin.
The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot

plt.figure(figsize=(10,5))
sns.violinplot(x= adult_data['race'], y= adult_data['hours-per-week'], inner = 'quartile')

We created a violin plot with similar features as the aforementioned boxplot.

Closing Remarks

Exploratory data analysis (EDA) set a foundation for the field of data science. The key idea of EDA is that the first and most important step in any project based on data is to look at the data. By summarizing and visualizing the data, you can gain valuable intuition and understanding of the project.

Fin.

Exploratory Data Analysis: Part B

Kushal — Wed, 08 Jul 2020 15:46:30 +0000

In the Part-A, we divulged in the following concepts:

Elements of Structured Data
Estimate of Location ( Central Tendency )
Estimate of Variability

In this article, we will be moving ahead with the methodologies and techniques for Exploratory Data Analysis.

Exploring Data Distribution

Each of the estimates we have covered sums up the data in a single number to describe the location or variability of the data. It is also useful to explore how the data is distributed overall.

Percentiles and BoxPlots

In Part-A, we have seen how percentile can be used to measure the spread of the data.
Percentiles are also valuable for summarizing the entire distribution. Percentiles are especially valuable for summarizing the tails (the outer range) of the distribution.

import pandas as pd
data = pd.read_csv('winequality-red.csv')

# Calculating Percentiles - 5th, 25th, 50th, 75th, 95th 
data['total sulfur dioxide'].quantile([0.05,0.25,0.5,0.75,0.95]) 

# Output:
0.05     11.0
0.25     22.0
0.50     38.0
0.75     62.0
0.95    112.1
Name: total sulfur dioxide, dtype: float64

As we can observe from the above code, the median (50th percentile) is 38.0 i.e Total Sulfur Dioxide has a huge variance with the 5th percentile being 11.0 and 95th percentile being 112.1.

Boxplots is based on percentiles and give a quick way to visualize the distribution of data

plt.figure(figsize= (5,8))
ax = data['total sulfur dioxide'].plot.box()
ax.set_ylabel('Concentration of Sulfar Dioxide')
ax.grid()

In this code, we use * pandas* inbuilt boxplot command, but many data scientists and analysts prefer matplotlib and seaborn for their flexibility.

The top and bottom of the box are the 75th and 25th percentiles, respectively. The median is shown by the horizontal line in the box. The dashed lines referred to as whiskers, extend from the top and bottom of the box to indicate the range for the bulk of the data.

Any data outside of the whiskers are plotted as single points or circles (often considered outliers).

Frequency Tables and Histograms

A frequency table of a variable divides up the variable range into equally spaced segments and tells us how many values fall within each segment.

The function pandas.cut creates a series that maps the values into the segments.
Using the method value_counts, we get the frequency table.

# Frequency Table for Sulfur Dioxide Concentration

binnedConct = pd.cut(data['total sulfur dioxide'], 10)
binnedConct.value_counts()

#Output:
(5.717, 34.3]     730
(34.3, 62.6]      471
(62.6, 90.9]      221
(90.9, 119.2]     113
(119.2, 147.5]     52
(147.5, 175.8]     10
(260.7, 289.0]      2
(232.4, 260.7]      0
(204.1, 232.4]      0
(175.8, 204.1]      0
Name: total sulfur dioxide, dtype: int64

It is important to include the empty bins. The fact that there are no values in those bins is useful information. It can also be useful to experiment with different bin sizes.

A histogram is a way to visualize a frequency table, with bins on the x-axis and the data count on the y-axis.

pandas support histograms for data frames with the hist method. Use the keyword argument bins to define the number of bins.

# Plotting Histogram
plt.figure(figsize= (7,5))
ax = data['total sulfur dioxide'].plot.hist(bins = 10)
ax.set_xlabel('Concentration of Sulfar Dioxide')
ax.grid()

In the following histogram, we see that histogram is rightly skewed. (I will address Skewness and Kurtosis in upcoming articles)

Density Plots and Estimates

Related to the histogram is a density plot, which shows the distribution of data values as a continuous line. A density plot can be thought of as a smoothed histogram, although it is typically computed directly from the data through a kernel density estimate.

plt.figure(figsize= (10,7))
ax = data['total sulfur dioxide'].plot.hist(bins = 10, density = True)
data['total sulfur dioxide'].plot.density(ax = ax)
ax.set_xlabel('Concentration of Sulfar Dioxide')
ax.grid()

Note: A key distinction from the histogram plotted in is the scale of the y-axis: a density plot corresponds to plotting the histogram as a proportion rather than counts.

Exploring Binary and Categorical Data

Getting a summary of a binary variable or a categorical variable with a few categories is a fairly easy matter: we just figure out the proportion of 1s or the proportions of the important categories.

Bar charts, seen often in the popular press, are a common visual tool for displaying a single categorical variable.

adult_data = pd.read_csv('adult.data', header = None, skip_blank_lines= True)

# Plotting the Categorical Variable: Education
adult_data[3].value_counts().plot.bar(figsize =(7,5))
plt.grid()

Note that a bar chart resembles a histogram; in a bar chart the x-axis represents different categories of a factor variable, while in a histogram the x-axis represents values of a single variable on a numeric scale.

Some more concepts on categorical variables:

Mode:

The mode is the value—or values in case of a tie—that appears most often in the data. The mode is a simple summary statistic for categorical data, and it is generally not used for numeric data.

Expected Value:

The expected value is calculated as follows:

Multiply each outcome by its probability of occurrence.
Sum these values.

The expected value is really a form of weighted mean: it adds the ideas of future expectations and probability weights, often based on subjective judgment.

That's all Folks for Part-B of this series!
In this article, we covered various plotting paradigms used to analyze numerical and categorical variables along with its python code.

Fin

Exploratory Data Analysis: Part A

Kushal — Tue, 07 Jul 2020 13:39:40 +0000

In this article, we will be exploring fundamental ways of doing an exploratory data analysis on a dataset.
Earlier, the statistical studies were limited to inferences, but then John Tukey proposed a new scientific discipline called data analysis that included statistical inference as just one component.
With the ready availability of computing power and expressive data analysis software, exploratory data analysis has evolved well beyond its original scope.

Elements of Structured Data

Data comes from many sources: sensor measurements, events, text, images, and videos.
The Internet of Things (IoT) is spewing out streams of information. Much of this data is unstructured: Images are a collection of pixels, with each pixel containing RGB (red, green, blue) color information.
Texts are sequences of words and nonword characters, often organized by sections, subsections, and so on.

To apply statistical concepts, unstructured raw data has to be converted into structured data.

There are mainly two types of structured data:

Numeric Type
- Continuous: Data that can take on any value in an interval.
- Discrete: Data that can take on only integer values, such as counts.
Categorical Type
- Binary Data (Special Case): A special case of categorical data with just two categories of values, e.g., 0/1, true/false.
- Ordinal Data: Categorical data that has an explicit ordering. (Synonym: ordered factor).

Rectangular Data

The typical frame of reference for analysis in data science is a rectangular data object, like a spreadsheet or database table.

Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables).
The data frame is the specific format in R and Python.

Key Terms for Rectangular Data

Feature: A column within a table is commonly referred to as a feature. Alias: attribute, predictor, variable
Records: A row within a data frame. Alias: case, example, instance, observation. etc

Below is the typical data frame object read by pandas library in Python.
Dataset: Wine Quality by UCI

Non-Rectangular Data Structure

There are data structures other than the rectangular data.
Time series data records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things.
Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures.
Graph (or network) data structures are used to represent physical, social, and abstract relationships.

Estimates of Location

Variables with measured or count data (Numerical) might have thousands of distinct values.
A basic step in exploring your data is getting a “typical value” for each feature (variable): an estimate of where most of the data is located (i.e., its central tendency).

At first glance, summarizing data might seem fairly trivial: just take the mean of the data. In fact, while the mean is easy to compute and expedient to use, it may not always be the best measure for a central value.

Mean

The most basic estimate of location is the mean or average value. The mean is the sum of all values divided by the number of values.

N (or n) refers to the total number of records or observations. In statistics, it is capitalized if it is referring to a population, and lowercase if it refers to a sample from a population.

A variation of the mean is a trimmed mean, which you calculate by dropping a fixed number of sorted values at each end and then taking an average of the remaining values.

An advantage of using a trimmed mean is that it removes the influence of extreme values. It is more robust than the regular mean.

Another type of mean is a weighted mean, which you calculate by multiplying each data value by a user-specified weight and dividing their sum by the sum of the weights.

Median and Robust Measures

The median is the middle number on a sorted list of the data. If there is an even number of data values, the middle value is one that is not actually in the data set, but rather the average of the two values that divide the sorted data into upper and lower halves.

Compared to the mean, the median takes into account only the central values of the sorted data, which makes the median more robust. In many use-cases, the median is a better metric for central tendencies.

The median is referred to as a robust estimate of location since it is not influenced by outliers (extreme cases) that could skew the results.

An outlier is any value that is very distant from the other values in a data set

In fact, a trimmed mean is widely used to avoid the influence of outliers. For example, trimming the bottom and top 10% (a common choice) of the data will provide protection against outliers in all but the smallest data sets.


# Mean, Trimmed Mean and Median of the feature: fixed acidity of wine

print('Mean of Fixed Acidity of Wine:', data['fixed acidity'].mean())

# Slicing 10% of left and right most elements
print('Trimmed Mean of Fixed Acidity of Wine: ', trim_mean(data['fixed acidity'], 0.1))

print('Median of Fixed Acidity of Wine: ', data['fixed acidity'].median())

#Output
Mean of Fixed Acidity of Wine: 8.319637273295838
Trimmed Mean of Fixed Acidity of Wine:  8.152537080405933
Median of Fixed Acidity of Wine:  7.9

Estimates of Variability

Location is just one dimension in summarizing a feature.
A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.

Standard Deviation and Related Estimates

The most widely used estimates of variation are based on the differences, or deviations, between the estimate of location and the observed data.
In fact, the sum of the deviations from the mean is precisely zero. Instead, a simple approach is to take the average of the absolute values of the deviations from the mean.
This is known as the mean absolute deviation and is computed with the formula:

The best-known estimates of variability are the variance and the standard deviation, which are based on squared deviations.

The standard deviation is much easier to interpret than the variance since it is on the same scale as the original data.
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations.

A robust estimate of variability is Median Absolute Deviation

Estimates based on Percentiles

A different approach to estimating dispersion is based on looking at the spread of the sorted data. Statistics based on sorted (ranked) data are referred to as order statistics.
The most basic measure is the range, but it is sensitive to outliers and not a great measure of dispersion.

In a data set, the Pth percentile is a value such that at least P percent of the values take on this value or less, and at least (100 – P) percent of the values take on this value or more.
For example, to find the 80th percentile, sort the data. Then, starting with the smallest value, proceed 80 percent of the way to the largest value

A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR).

For very large data sets, calculating exact percentiles can be computationally very expensive since it requires sorting all the data values.


# Measures of Variability for Sulfur Dioxide

# Standard Deviation
print('Standard Deviation for Sulfur Dioxide in Wine: ', data['free sulfur dioxide'].std())

# Inter-Quartile Range
print('IQR of Sulfar Dioxide: ', data['free sulfur dioxide'].quantile(0.75) - data['free sulfur dioxide'].quantile(0.25))

# Median Absolute Deviation (a robust measure)
print('Median Absolute Deviation: ', mad(data['free sulfur dioxide']))

#Output:
Standard Deviation for Sulfur Dioxide in Wine:  10.46015696980973
IQR of Sulfur Dioxide:  14.0
Median Absolute Deviation:  10.378215529539213

So in this article, we have explored basics of EDA process, exploring central tendencies and measures of variability.
Part-B will focus on Data Distributions, Exploring Categorical Variables, and Correlations.

Fin

Natural Language Processing #1: Traditional Embeddings

Kushal — Mon, 22 Jun 2020 13:12:50 +0000

Hello there! You are about to embark on an exciting journey of Natural language processing, covering the nuances from a programmatic and mathematical standpoint.

Natural Language Processing has been at the helm for decades, it is no secret that there has been a significant effort made during the 1980-90s to build a chatbot to communicate with a human, and give out a pre-scripted response based on the question asked.
This type of system usually called Finite State Machines (FSM) or Deterministic Finite Automation (DFA)
The major drawback of such a system was the rule-based implementation and a hierarchical if-else conditional which can be complex structure to decode and update.

The field of NLP is based on the foundation to derive embeddings from text data, and in-process understanding the semantic and syntactic pattern in the data, to carry out various tasks like:

Spelling Checker
Sentence Autocomplete
Document Summarization
Question Answering
Named Entity Recognition
Machine Translation

In this article, we will look into some of the most-used Frequency Embedding Techniques used and also divulge into the pros and the cons of it.

There are two families of methodologies to derive a word embedding :
1. Frequency-based methods
2. Prediction based methods

Frequency-based Methods

In this paradigm, a sentence is often tokenized into words, and then certain techniques are used to count the weight of the corresponding word, in turn giving us a brief idea of the usage.

Following are the schemes for frequency-based methods:

Count Vector ( Bag of Words Model)
TF-IDF Vector (Term Frequency - Inverse Document Frequency)
Co-Occurrence Vector

1. Count Vectors

This method which is popularly referred to as Bag of Words Model, which is the simplest representation of text into numeric data.

The process is as follows:

Corpus of Unique Vocabulary Words is built
Each Word in Corpus is assigned a unique index
A count number (weight) is assigned to the word in a sentence.
Vector Length of the sentence is equal to the vocabulary size of the corpus. For the words which do not fall into a sentence, the weight is assigned as 0

BoW (Bag of Words) Model can be built using scikit-learn's CountVectorizer Method

This method is not recommended since it fails to learn the semantic and syntactic structure of the sentence.

Additionally, the method also results in a sparse matrix which is difficult to compute and store.

2. TF-IDF Vectors

TF-IDF (Term Frequency- Inverse Document Frequency) is a weighing scheme that incorporates two formulas.

Term-Frequency: Measure of Occurrence of the word 't' in the
document 'd'

Inverse Document Frequency: IDF is a measure of how important a term is, that is how rare or frequent the occurrence across the documents/sentences.

Below is the code, using scikit-learn's TfidfVectorizer

TF-IDF also gives larger values for less frequent words and is high when both IDF and TF values are high i.e the word is rare in all the documents combined but frequent in a single document.

3. Co-Occurence Vectors

The big idea – Similar words tend to occur together and will have a similar context.

There are mainly two concepts to understand for building a co-occurrence matrix:

Co-occurrence
Context Window

Co-occurrence – For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.

Context Window – Context window is specified by a number and the direction

It preserves the semantic relationship between words to some extent. Further down, a co-occurrence matrix can be factorized using a Truncated SVD Transformation for dense vector representations.

In conclusion, we covered three base methods for frequency-based word embeddings: BoW, tf-idf, co-occurrence matrix.