DEV Community

angu10
angu10

Posted on • Edited on

Exploring TAPAS: Analyzing Clinical Trial Data with Transformers

Introduction:

Welcome to the world of Transformers, where cutting-edge natural language processing models are revolutionizing the way I interact with data. In this series of blogs, I will embark on a journey to explore and understand the capabilities of the TAPAS (Tabular Pre-trained Language Model) model, which is designed to extract valuable insights from tabular data. To kick things off, I'll delve into the basics of TAPAS and see it in action on a real-world dataset.

Understanding TAPAS:

TAPAS is a powerful language model developed by Google that specializes in processing tabular data. Unlike traditional models, TAPAS can handle structured data seamlessly, making it a game-changer for tasks involving tables and spreadsheets. With a token size of 512k, TAPAS can process large datasets efficiently, making it a valuable tool for data analysts and scientists.

My Dataset:

For this introductory exploration, I will work with a clinical trial dataset [Clinicaltrails.gov]. To start, I load the dataset and create a data frame containing the "label" column. This column contains information about gender distribution in clinical trials. I'll be using this data to ask questions and obtain insights.

from transformers import pipeline,TapasTokenizer, TapasForQuestionAnswering
import pandas as pd
import datasets

# Load the dataset (only once)
dataset = datasets.load_dataset("Kira-Asimov/gender_clinical_trial")

# Create the clinical_trials_data DataFrame with just the "label" column (only once)
clinical_trials_data = pd.DataFrame({
    "id": dataset["train"]["id"],
    "label": dataset["train"]["label"],
})

clinical_trials_data = clinical_trials_data.head(100)


Enter fullscreen mode Exit fullscreen mode

Asking Questions with TAPAS:

The magic of TAPAS begins when I start asking questions about our data. In this example, I want to know how many records are in the dataset and how many of them are gender-specific (Male and Female). I construct queries like:

"How many records are in total?"
"How many 'Male' only gender studies are in total?"
"How many 'Female' only gender studies are in total?"

Using TAPAS to Answer Questions:

I utilize the "google/tapas-base-finetuned-wtq" model and its associated tokenizer to process our questions and tabular data. TAPAS tokenizes the data, extracts answers, and even performs aggregations when necessary.

counts = {}
answers = []

def TAPAS_model_learning(clinical_trials_data):
    model_name = "google/tapas-base-finetuned-wtq"
    model = TapasForQuestionAnswering.from_pretrained(model_name)
    tokenizer = TapasTokenizer.from_pretrained(model_name)


    queries = [
        "How many records are in total ?",
        "How many 'Male' only gender studies are in total ?",
        "How many 'Female' only gender studies are in total ?",
    ]

    for query in queries:
            model_name = "google/tapas-base-finetuned-wtq"
            model = TapasForQuestionAnswering.from_pretrained(model_name)
            tokenizer = TapasTokenizer.from_pretrained(model_name)
            # Tokenize the query and table
            inputs = tokenizer(table=clinical_trials_data, queries=query, padding="max_length", return_tensors="pt", truncation=True)

            # Get the model's output
            outputs = model(**inputs)
            predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
                inputs, outputs.logits.detach(), outputs.logits_aggregation.detach()
            )

            # Initialize variables to store answers for the current query
            current_answers = []

            # Count the number of cells in the answer coordinates
            count = 0
            for coordinates in predicted_answer_coordinates:
                count += len(coordinates)
                # Collect the cell values for the current answer
                cell_values = []
                for coordinate in coordinates:
                    cell_values.append(clinical_trials_data.iat[coordinate])

                current_answers.append(", ".join(cell_values))

            # Check if there are no matching cells for the query
            if count == 0:
                current_answers = ["No matching cells"]
            counts[query] = count
            answers.append(current_answers)
    return counts,answers
Enter fullscreen mode Exit fullscreen mode

Evaluating TAPAS Performance:

Now, let's see how well TAPAS performs in answering our questions. I have expected answers for each question variation, and I calculate the error percentage to assess the model's accuracy.

# Prepare your variations of the same question and their expected answers
question_variations = {
    "How many records are in total ?": 100,
    "How many 'Male' only gender studies are in total ?": 3,
    "How many 'Female' only gender studies are in total ?":9,
}



# Use TAPAS to predict the answer based on your tabular data and the question
predicted_count,predicted_answer = TAPAS_model_learning(clinical_trials_data)
print(predicted_count)
# Check if any predicted answer matches the expected answer
for key,value in predicted_count.items():
    error = question_variations[key] - value


    # Calculate the accuracy percentage
    error_percentage = (error / question_variations[key]) * 100

    # Print the results
    print(f"{key}: Model Value: {value}, Excepted Value: {question_variations[key]}, Error Percentage: {error_percentage :.2f}%")

Enter fullscreen mode Exit fullscreen mode

Results and Insights:

The output reveals how TAPAS handled our queries:

For the question "How many records are in total?", TAPAS predicted 69 records, with an error percentage of 31.00% compared to the expected value of 100 records.

For the question "How many 'Male' only gender studies are in total?", TAPAS correctly predicted 3 records, with a perfect match to the expected value.

For the question "How many 'Female' only gender studies are in total?", TAPAS predicted 2 records, with a significant error percentage of 77.78% compared to the expected value of 9 records.

Conclusion and Future Exploration:

In this first blog of our TAPAS exploration series, I introduced you to the model's capabilities and showcased its performance on a real dataset. I observed both accurate and less accurate predictions, highlighting the importance of understanding and fine-tuning the model for specific tasks.

In our future blogs, I will delve deeper into TAPAS, exploring its architecture, fine-tuning techniques, and strategies for improving its accuracy on tabular data. Stay tuned as I unlock the full potential of TAPAS for data analysis and insights.

Top comments (0)