Jimmy Guerrero for Voxel51

Posted on Aug 9, 2024 • Originally published at voxel51.com

Voxel51 Filtered Views Newsletter - August 9, 2024

#computervision #machinelearning #ai #datascience

Author: Harpreet Sahota (Hacker in Residence at Voxel51)

Welcome to Voxel51's weekly digest of the latest trending AI, machine learning and computer vision news, events and resources! Subscribe to the email version.

📰 The Industry Pulse

🏊🏼‍♀️🏆🤖 How AI is applied at the 2024 Olympic Games in Paris

The 2024 Paris Olympics aren't just about athletic prowess but also a venue for showcasing cutting-edge technology. Artificial intelligence is diving into the Games like never before, transforming everything from athlete support to how we watch the events.

1. Athlete Support and Performance
AthleteGPT, an AI chatbot accessible through the Athlete365 mobile app, will provide 24/7 assistance to athletes, answering questions about venues, events, and other aspects of the Games.

Intel's 3D athlete tracking (3DAT) is also used to analyze athletes' biomechanics, potentially leading to improved performance, closer competition, and new records.

2. Refereeing and Real-time Data
While AI is already used in some sports for decision-making, its implementation in Olympic events varies. Sports with larger budgets and more data, like football, are more likely to incorporate AI in refereeing. However, challenges remain for sports with less funding or more complex environments, such as water polo.

3. Enhancing Viewer Experience
Undoubtedly, spectators are the biggest participants at every Olympic Games. And AI is changing how spectators engage with the Olympics. Broadcasters can now display more detailed statistics like acceleration, top speeds, and stride lengths. AI-powered personalized highlights are also being developed, allowing viewers to access specific moments or performances they're interested in.

👥 OpenAI is nothing without its people...but what is it without its leadership?

If you were following along with the Sam Altman drama on X late last year, you might have come across posts from OpenAI employees stating that OpenAI is nothing without its people.

However, three key people, including John Schulman, left OpenAI last week.

John Schulman, one of OpenAI's co-founders, has left the company after nine years to join rival AI firm Anthropic.

Schulman announced his departure on social media, stating that he decided to focus on "AI alignment" and pursue more hands-on technical work at Anthropic.

Along with Schulman's departure, OpenAI's product manager Peter Deng has also left the company, while president Greg Brockman is taking an extended leave of absence. Schulman emphasized that his decision to leave was not due to a lack of support for alignment research at OpenAI but to gain new perspectives and work alongside people deeply engaged in his areas of interest.

Schulman was vital in developing ChatGPT and leading OpenAI's reinforcement training organization. His departure leaves only three of the original 11 co-founders still at OpenAI: CEO Sam Altman, president Greg Brockman, and language head Wojciech Zaremba.

This news comes amid ongoing controversies surrounding OpenAI, including a recent lawsuit by Elon Musk against the company and Altman.

Schulman is joining Anthropic, a company founded in 2021 by former OpenAI employees and considered a fierce rival of OpenAI.

Déjà Vu or Different This Time? The AI Bubble Debate

The whispers are getting louder. Is the AI boom just another tech bubble ready to burst, leaving a trail of shattered dreams (and investments) like the dot-com crash? Kelvin Mu doesn't shy away from this question. He takes us back to the frenzy of the late 1990s when internet startups with more hype than profit dominated headlines. Remember Pets.com?

Like back then, money is pouring into AI, valuations are skyrocketing, and everyone's scrambling for a piece of the action. We see the parallels: the infrastructure giants (NVIDIA now, Cisco then), the rush of new companies, and the ever-present fear of missing out. Similarities between the cycles include:

Similar ecosystem structures (infrastructure, enablers, applications)
Occurring during equity bull markets
Significant infrastructure investments
High VC interest and valuations

But Mu doesn't stop at surface similarities. He digs deeper, uncovering crucial differences that suggest this time might differ. AI is already generating actual revenue, not just promises. The underlying technology is far more mature and capable of delivering tangible value from day one. What’s different about now is:

AI companies are generating revenue earlier and more sustainably
The current economic environment is less favourable, leading to more cautious investing
AI financing comes primarily from private markets and big tech, not public markets
AI business models are generally more sustainable with more reasonable valuations

However, echoes of the past linger. Over investment is almost inevitable with revolutionary technologies. Remember the fibre optic cables laid but unused during the dot-com boom? Mu argues that even if some AI ventures fail, the overall impact on society will be transformative. So, are we in a bubble? Mu's answer is a cautious "not so fast," based on the following:

More sustainable revenue profiles for AI companies
A more cautious investment environment
Reasonable valuations compared to the dot-com era
The AI cycle is in its early stages

💎 GitHub Gems

I hope y’all don’t mind a bit of shameless promo this week!

Last week, we announced our Data Centric Visual AI (DCVAI) Competition hosted on Hugging Face.

I created a GitHub repo to support your submission for the competition. It contains an example submission for the challenge and a template for structuring your submissions.

I’m also creating a course for the Coursera platform titled Hands-on Data Centric Visual AI with FiftyOne. During course development, I’ll flesh out my ideas “in public” in this GitHub repo. This repo will have all the coding lessons from the course, which will show you how to apply Data Centric AI techniques for cleaning and curating Visual AI datasets using FiftyOne.

Be sure to ⭐️Star both repositories and become a watcher so you don’t miss any of the content as it updates!

📙 Good Reads

LLMs' "Refusal Button" Found (and It's Hackable)

I came across some work from Neel Nanda’s ML Alignment and Theory Scholars Program, titled Refusal in LLMs is mediated by a single direction, which focuses on understanding the residual stream within transformer models.

In simple terms, the residual stream is like the main information highway in a transformer model, carrying information from one layer to the next. It represents the input data and gets refined as it moves through the model's layers.

This is where the input text gets transformed as it passes through different layers of the model, from basic token embeddings to more complex features.

In the initial layers, the residual stream mainly depends on recent tokens, while the broader context becomes more important in later layers. As the information flows through the residual stream, it becomes more context-dependent and refined. The model can access and manipulate more complex representations in the later layers.

Key Findings from the article:

Single Refusal Direction: The authors found that a single direction in the model's activation space, specifically within the residual stream, mediates refusal across diverse harmful prompts. This "refusal direction" emerges as a key feature within the evolving representation of the input and acts as a bottleneck, determining if the model should refuse a request.
Manipulating Refusal: In their studies, the authors effectively bypassed the model's safety mechanisms by directly manipulating the residual stream: -> Ablating the Refusal Direction: Preventing model components from writing to this "refusal direction" within the residual stream hindered the model's ability to refuse harmful requests. -> Injecting the Refusal Direction: Artificially adding this direction to the residual stream when processing harmless prompts caused the model to refuse them.
Weight Orthogonalization: The authors propose a method to achieve persistent ablation by modifying the weights of components that write to the residual stream. By orthogonalizing these weights with respect to the "refusal direction," the model is permanently prevented from expressing that feature. Here’s an interesting notebook that shows you exactly how to do that.
Widespread Phenomenon: This single-direction refusal mechanism was observed across various open-source LLMs, regardless of model size or family.

Implications:

Fragile Safety Mechanisms: This article highlights the fragility of current safety fine-tuning methods in open-source LLMs. Even small, targeted interventions within the residual stream can undermine these safeguards.
Jailbreaking LLMs: The ability to manipulate refusal through weight modification, effectively controlling the presence or absence of this "refusal direction" in the residual stream, offers a simple and effective method for jailbreaking open-source chat models. This bypasses the need for further fine-tuning or complex inference-time interventions.
Ethical Considerations: While not introducing entirely new risks, this research raises concerns about the accessibility of jailbreaking techniques and their potential misuse.

You can read the full paper on arXiv

🎙️ Good Listens : How To Approach Working With Datasets

This week's Good Listen features an episode of the Chaos to Clarity podcast with Jason Corso, the Chief Science Officer at Voxel51, as the guest. Corso explores key topics such as the critical importance of tail performance in AI systems, strategies for effective data management, and innovative approaches to tackling high-dimensional data challenges. Here are the key pieces of alpha I took away from the conversation:

Importance of tail performance: Jason emphasizes that real performance in AI systems comes from doing well at the tails of data distributions, not just the mean plus or minus a couple of standard deviations.
Data quality discovery: Jason shares an anecdote about a customer who, using just 1% of their data in FiftyOne, found dozens of mistakes in a dataset they had been working with for over a year.
Data quantity vs. quality trade-off: Jason invites us to run a thought experiment about whether it's better to have a smaller, high-quality dataset or a larger dataset with lesser quality, noting that this is still an open question in the field.
Best subset selection: He discusses the interesting problem of selecting the best subset of data (e.g., 10,000 samples from 1 million) for optimal model performance, balancing compute budget and data quality.
Human-in-the-loop systems: Jason suggests the value of having humans involved in inferential loops, especially when the system is less certain about its predictions.
High-dimensional data challenges: He explains the difficulties in modelling high-dimensional data distributions and mentions research on discovering low-dimensional substructures in high-dimensional spaces.
Pattern recognition in humans: Jason parallels human intuition (like in sports coaching) and the challenges of understanding complex patterns in high-dimensional data.
Mixed reality and AI in education: He expresses excitement about the potential of combining mixed reality with generative AI capabilities for educational purposes, envisioning interactive learning experiences.

👩🏽‍🔬 Interesting Research: SAM2’s Data Engine

Meta released SAM 2, and rightly so, there’s been a tonne of hype around it.

The 42-page SAM 2 paper offers a masterclass in data engineering for computer vision. Meta shares important insights into what separates state-of-the-art models from the rest. What stood out to me was the data engine they built. The team at Meta designed a system to create a large, high-quality video segmentation dataset.

It works in three main phases, each building upon the previous one:

Phase 1: The Foundation

Tool: SAM (Segment Anything Model) for images
Process:

Human annotators watch a video frame by frame.
For each frame, they use SAM to outline objects or parts of objects.
They can refine these outlines using tools like a digital brush or eraser.

Result: High-quality, precise annotations, but very time-consuming (about 38 seconds per frame).
Think of it like tracing objects in a flipbook, one page at a time.

Phase 2: Adding Memory

Tool: SAM 2 Mask (an early version of SAM 2 that only accepts mask inputs)
Process:

Annotators still start by outlining objects in the first frame using SAM.
SAM 2 Mask then tries to "follow" this object through the video.
Annotators can step in to correct mistakes, then let SAM 2 Mask continue.

Result: Much faster (about 7 seconds per frame) while maintaining good quality.
Imagine having an assistant who can continue your tracing through the flipbook, but sometimes needs guidance.

Phase 3: Full Automation

Tool: Complete SAM 2 (accepts various types of prompts like points and masks)
Process:

SAM 2 now "remembers" objects across frames.
Annotators mainly provide occasional clicks to refine predictions.
SAM 2 does most of the work in tracking objects through the video.

Result: Even faster (about 4.5 seconds per frame) and still high quality.
Like having a smart assistant who can trace through the whole flipbook, needing only occasional pointers from you.

The key takeaway? Start with quality, then scale.

Too often, teams rush to amass large datasets, only to find their models choke on edge cases. SAM 2's approach ensures a foundation of high-quality annotations before ramping up speed. It's the CV equivalent of "measure twice, cut once." But here's where it gets fascinating: the model-in-the-loop approach. They've created a virtuous improvement cycle by continuously retraining SAM 2 with newly annotated data. This is the holy grail of machine learning - a system that gets smarter as it works.

The tl;dr for the key lessons I learned:

Prioritize quality, then scale.
Create a feedback loop between your model and data collection.
Focus on challenging cases - they're where the real gains are made.
Use your model to help generate data, but always with human verification.
Think carefully about how you split your dataset.
Design for real-world use cases from the start.

Implement these in your next CV project, and you'll be streets ahead of the competition. The future of computer vision isn't model architecture - it's sophisticated, iterative data engineering. SAM 2 has shown us the way.