0x2e73

Posted on Nov 14

🚀 Project Journey #1: Picking the Tech Stack & Diving into AI 🤖

#programming #ai #openai #python

Hey everyone! 👋

Today marks the beginning of my journey to build an AI-powered app that can analyze contracts and help users understand how risky they might be. 🌐⚖️

I’m a 4th-year apprentice developer, so my main goal with this project is to learn by doing (and hopefully not break too many things in the process 😅). So, let's jump into what I've been up to today!

🛠️ Choosing the Tech Stack

After a lot of research (and a few cups of coffee ☕), I’ve finally decided on the tech stack I’ll be using for this project:

Backend:

Python (with Flask 🚀) for the API
- Why Flask? It's lightweight, super easy to get started with, and perfect for building quick APIs to serve our AI models. Today, I learned how to set up a basic Flask API endpoint, and honestly... it was way easier than I expected!

💡 Frontend:

Next.js (React framework) for the UI 🌐
- Next.js will allow us to build a fast and SEO-friendly frontend. It's got server-side rendering, which means faster load times and better performance. 📈

AI Model:

Python (again, because who doesn’t love Python, right? 🐍)
- The AI will be using NLP (Natural Language Processing) models. Specifically, I’m looking into using BERT or GPT-like models from the HuggingFace library. These models are like super-smart language nerds that can understand and analyze human text. 🤓

🚀 Understanding Transformers and NLP Models 🤖

🔍 What Are Transformers? 🤖

Transformers are a type of deep learning model that revolutionized the field of NLP. They were introduced in a landmark paper called "Attention is All You Need" by Vaswani et al. in 2017. Unlike older models like RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory networks), transformers are highly efficient at processing long sequences of text in parallel, which makes them both faster and more accurate.

🧠 Key Concept: Attention Mechanism

The secret sauce behind transformers is the attention mechanism. Attention allows the model to focus on the most relevant parts of the input text, regardless of its position in the sequence. Think of it like reading a contract — instead of reading every single word, your brain automatically zooms in on the important parts, like "hidden fees" or "data sharing." 🕵️‍♂️

Here's how it works in simple terms:

Understanding Context: Attention helps the model understand the relationship between different words in a sentence, even if they're far apart.
- Example: In the sentence, "The cat, which was very hungry, finally ate the food," the model knows that "ate" is related to "cat" even though there are many words in between.
Calculating Attention Scores: The model assigns an attention score to each word in the sequence. Higher scores mean more relevance to the task at hand.
- If you're analyzing a contract for risky clauses, the model might give high scores to words like "penalty," "termination," or "data sharing."

🧮 A Quick Peek at the Math

Each word in the input is transformed into a vector (a list of numbers) using something called word embeddings. The model then uses three vectors for each word:

Query (Q): What word are we focusing on?
Key (K): How important is this word to other words?
Value (V): The actual meaning of the word.

These vectors are used to calculate the attention scores, which determine how much focus should be given to each word in the sentence.

The formula looks like this:

Where:

( QK^T ) is the dot product of the query and key vectors.
( d_k ) is the dimension of the key vector (used to scale the values).
Softmax is a function that converts scores into probabilities.

📚 How NLP Models Like BERT and GPT Use Transformers

Transformers are the backbone of popular NLP models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer). Here's a breakdown of each:

🔎 BERT (Bidirectional Encoder Representations from Transformers)

Purpose: BERT is great for understanding the context of words in both directions (left-to-right and right-to-left). This makes it perfect for tasks like text classification, question answering, and contract analysis.
Architecture: BERT is made up of encoders only, which means it's focused on understanding the input text deeply.
Training: It’s pre-trained on a huge amount of text data with two tasks:
1. Masked Language Modeling: Predicting missing words in a sentence.
  - Example: "The [MASK] is in the garden" → "The cat is in the garden."
2. Next Sentence Prediction: Determining if one sentence logically follows another.
  - Example: "He opened the door." → "He walked into the room." (Yes) vs. "She went to the store." (No)

✍️ GPT (Generative Pre-trained Transformer)

Purpose: GPT is designed for generating text. It's excellent for tasks like text completion, content creation, and even conversational AI.
Architecture: GPT uses decoders only, which means it's focused on generating new text based on given input.
Training: GPT is trained on a vast dataset to predict the next word in a sentence.
- Example: "Once upon a time..." → "Once upon a time, there was a brave knight."

Key Difference:

BERT is bidirectional (understands the full context).
GPT is unidirectional (predicts the next word based on past context).

🛠️ How to Build Your AI for Contract Analysis

1. Data Preparation

Collect and clean a dataset of contracts.
Label the contracts with different levels of danger (like your CSV dataset).

2. Model Selection

For contract analysis, BERT or a fine-tuned version like RoBERTa could be a good fit because it’s great at understanding context.
Use Hugging Face Transformers library to access these models.

3. Training the Model

   from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
   from datasets import load_dataset

   # Load dataset
   dataset = load_dataset('csv', data_files='contracts.csv')

   # Tokenize data
   tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
   def tokenize_function(example):
       return tokenizer(example['text'], truncation=True)
   tokenized_datasets = dataset.map(tokenize_function, batched=True)

   # Load pre-trained BERT model
   model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

   # Set up training arguments
   training_args = TrainingArguments(
       output_dir='./results',
       evaluation_strategy='epoch',
       learning_rate=2e-5,
       per_device_train_batch_size=16,
       num_train_epochs=3
   )

   # Train model
   trainer = Trainer(
       model=model,
       args=training_args,
       train_dataset=tokenized_datasets['train'],
       eval_dataset=tokenized_datasets['test']
   )
   trainer.train()

Dont worry, the real code will be better xD

Thank you for reading this article, dont hesitate to give me some advice, and to like this post if you liked it !