DEV Community

0x2e73
0x2e73

Posted on

2 2 2 2 2

Project Journey #2: πŸ› οΈ Coding, Failing, and Learning with AI Law Shield βš–οΈ

Welcome back to my AI journey, where I stumbled, learned, and maybe cried a little! πŸ˜‚

1. Diving into the Code: The Good, The Bad, and The Ugly

This time, I got my hands dirty by coding the first version of my AI model. Spoiler alert: I achieved an accuracy of just 0.18945%! 🎯 (Ouch! I guess even my toaster could do better πŸ€–πŸž).

Let's dive into the code and see what went wrong.


# Initializing BERT for sequence classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

Enter fullscreen mode Exit fullscreen mode

What’s happening here?
I'm using BERT, the superstar transformer model, to classify the danger level of legal contracts on a scale of 1 to 5. πŸ“„

def preprocess_data(dataframe, tokenizer):
    texts = dataframe['texte'].tolist()
    labels = [label - 1 for label in dataframe['niveau_de_danger'].tolist()]
    encoded_data = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    return encoded_data, labels
Enter fullscreen mode Exit fullscreen mode

Why preprocess data?
I’ve tokenized the contract texts for BERT to digest (like breaking down a complex contract into easier-to-understand clauses). 🍽️

2. Training My Model: And… It Crashed and Burned πŸ’₯


def train_model(model, train_loader, num_epochs=5):
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

    for epoch in range(num_epochs):
        model.train()
        for batch in train_loader:
            optimizer.zero_grad()
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            accuracy = (outputs.logits.argmax(dim=-1) == labels).float().mean()
            loss = outputs.loss
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')
    print(f'Final Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')

Enter fullscreen mode Exit fullscreen mode

This function trains BERT to classify contracts, but let’s just say it didn’t pass the bar exam 😬. The low accuracy told me that my model was basically guessing randomly.

3. Evaluating the Model: Reality Check πŸ§‘β€βš–οΈ


def evaluate_model(model, test_loader):
    model.eval()
    predictions = []
    true_labels = []
    with torch.no_grad():
        for batch in test_loader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]
            outputs = model(input_ids, attention_mask=attention_mask)
            _, predicted = torch.max(outputs.logits, dim=-1)
            predictions.extend(predicted.cpu().tolist())
            true_labels.extend(labels.cpu().tolist())
    return classification_report(true_labels, predictions)

Enter fullscreen mode Exit fullscreen mode

After running this, I got a brutal classification report that screamed, "You need more data, buddy!" πŸ“‰

4. The Root Cause: My Dataset Needs a Lawyer-Grade Makeover πŸ“Š

After some reflection, I realized the real issue was my dataset. It’s like trying to learn law from a pamphlet instead of an encyclopedia. πŸ“š

I need to get my hands on a large, reliable, and indexed dataset that can better train the model. If anyone knows where to find high-quality legal datasets, I’m all ears! πŸ‘‚

5. Annotating Contracts (A Work in Progress) ✍️


def annotate_contract(model, tokenizer, contract_text):
    inputs = tokenizer(contract_text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
        _, predicted = torch.max(outputs.logits, dim=-1)
    danger_level = predicted.item() + 1
    problematic_sections = analyze_problematic_sections(contract_text, danger_level)

    return {
        'danger_level': danger_level,
        'problematic_sections': problematic_sections
    }

Enter fullscreen mode Exit fullscreen mode

This function is supposed to analyze the legal contract and predict the danger level, but as you might guess, it’s not ready to replace your lawyer just yet. 🧐

Next Steps: A Better Dataset and Model Tuning πŸ“ˆ

I’m planning to go on a treasure hunt for a better dataset. Once I have more data, I’ll revisit model training, tweak hyperparameters, and hopefully get a model that can actually understand legal jargon! βš–οΈ

Until next time, may your accuracy be ever in your favor! πŸš€

0x2e73

Image of Timescale

πŸš€ pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applicationsβ€”without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (2)

Collapse
 
henery_hix_041110c11b6c6d profile image
henery hix β€’
Comment hidden by post author

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more