DEV Community: Josef Albers

Zig's Power in Action: C Integration and WASM Compilation for Terrain Generation

Josef Albers — Thu, 15 Aug 2024 15:09:36 +0000

Zig's ability to directly interface with C libraries and compile to WebAssembly (WASM) opens doors for diverse applications. This post showcases these capabilities through TerrainZigger, a 3D terrain generator project.

Key Points:

Seamless C Interoperability: Zig's @cImport allows effortless importing and utilization of C libraries, enabling developers to tap into a rich ecosystem of existing C code. TerrainZigger demonstrates this by integrating Raylib for rendering.
```
const ray = @cImport({ @cInclude("raylib.h"); });
```

Effortless WASM Compilation: Zig's build-exe toolchain facilitates seamless compilation to WASM, making Zig code accessible from JavaScript and readily embeddable in web pages. TerrainZigger exemplifies this by providing a playable demo on itch.io.

zig build-exe terrain_zigger.zig -target wasm32-freestanding -O ReleaseSmall -fno-entry --export=generate_terrain_wasm --export=get_terrain_height_wasm && python -m http.server & open http://localhost:8000/
kill $(lsof -t -i:8000)

Performance and Control: Zig's emphasis on low-level control and performance is ideal for computationally demanding tasks such as terrain generation.
```
zig build-exe walk.zig -I. -lc $(pkg-config --libs --cflags raylib) -O Debug
leaks -atExit -- ./walk
```

TerrainZigger

GitHub Repo: https://github.com/JosefAlbers/TerrainZigger
Playable Demo: https://albersj66.itch.io/terrainzigger

Conclusion

Zig's seamless interaction with C libraries and WASM compilation capability enable developers to craft high-performance applications across different platforms, including natively and within web browsers. Whether it's games, simulations, or interactive projects, Zig offers the tools to bring ideas to life.

Porting Phi-3-Vision to MLX: A Python Hobbyist's Journey

Josef Albers — Wed, 07 Aug 2024 01:38:29 +0000

Hey fellow devs! 👋

I've just published a comprehensive series on Medium detailing my journey of porting Phi-3-Vision, a powerful vision-language model, from Hugging Face to Apple's MLX framework. As a Python hobbyist, I wanted to share my experience and hopefully inspire others to dive into AI model optimization.

📚 Series Overview:

Basic Implementation: Getting Phi-3-Vision up and running in MLX.
Su-scaled Rotary Position Embeddings (SuRoPE): Implementing 128K context support.
Batching: Optimizing for multiple inputs.
Caching: Speeding up text generation.
Choice Selection: Implementing constrained output.
Constrained Decoding: Guiding the model's output structure.
LoRA Training: Fine-tuning the model efficiently.
Agent and Toolchain System: Building flexible AI workflows.

🤔 Why This Matters:

Run advanced AI models efficiently on Apple Silicon
Learn about model optimization techniques
Understand the internals of vision-language models
Explore the capabilities of MLX for AI development

🔗 Read the Full Series:

https://medium.com/@albersj66

💻 GitHub Repository:

I've open-sourced all the code and markdown files used in this series. You can find them in my GitHub repository:

https://github.com/JosefAlbers/Phi-3-Vision-MLX

Feel free to explore, experiment, and contribute!

💬 Let's Discuss:

Have you worked with MLX or other AI frameworks on Apple Silicon?
What challenges have you faced in porting or optimizing AI models?
Any specific parts of the series you'd like to dive deeper into?

I'm excited to hear your thoughts and experiences! Let's learn from each other and push the boundaries of what's possible with AI on consumer hardware.

Part 2: Implementing Su-scaled Rotary Position Embeddings (RoPE) for Phi-3-Vision

Josef Albers — Thu, 01 Aug 2024 12:49:51 +0000

Introduction

Welcome to Part 2 of our Phi-3-Vision porting series. In Part 1, we've created a basic implementation of the model in MLX. However, we also noted that it struggles with longer sequences. Today, we'll address this limitation by implementing Su-scaled Rotary Position Embeddings (RoPE), which will significantly enhance our model's ability to handle long contexts of up to 128K tokens.

The full implementation of what we'll cover today is available at https://github.com/JosefAlbers/Phi-3-Vision-MLX/tree/main/assets/tutorial_2.py

1. Understanding Rotary Position Embeddings (RoPE)

Before we delve into Su-scaled RoPE, let's first understand the basics of Rotary Position Embeddings.

RoPE is a technique that injects positional information into the model's token representations without adding extra tokens or increasing the model's parameter count. The key idea is to apply a rotation to each token's embedding based on its position in the sequence.

Frequency Calculation: For each dimension d in the embedding space, RoPE calculates a frequency:
```
inv_freq = 1 / (base ** (d / dim))
```
Position-Frequency Interaction: These frequencies are then multiplied by the token positions to create unique sinusoidal patterns for each position.
```
freqs = inv_freq @ position_ids.T
```
Rotation Application: The resulting patterns are used to rotate the token embeddings in 2D planes.

For a token at position pos, RoPE applies the following rotation:
```
x_rotated = [x * cos(pos * freq) - y * sin(pos * freq),
             y * cos(pos * freq) + x * sin(pos * freq)]
```

Now that we understand RoPE, let's explore how Su-scaled RoPE builds upon and enhances this concept.

2. Understanding Su-RoPE

Su-RoPE extends RoPE by introducing scaling factors for different sequence length ranges.

freq = 1 / (SU_FACTOR * theta ** (d / dim))

This allows the model to better generalize to sequences longer than those seen during training.

Short and Long Factors: Two sets of scaling factors are used, one for shorter sequences and one for longer sequences.
Adaptive Scaling: The choice between short and long factors is made based on the sequence length.
Scaling Factor: An additional scaling factor is applied to adjust for the extended maximum position embeddings.

3. Implementing Su-scaled RoPE

Now that we understand the theory behind Su-scaled RoPE, let's implement it in code. We'll create a SuRoPE class that encapsulates all the functionality we've discussed:

import mlx.core as mx
import mlx.nn as nn
import math

class SuRoPE:
    def __init__(self, config):
        self.dim = config.hidden_size // config.num_attention_heads
        self.original_max_position_embeddings = config.original_max_position_embeddings
        self.rope_theta = config.rope_theta
        self.scaling_factor = math.sqrt(1 + math.log(config.max_position_embeddings / config.original_max_position_embeddings) / math.log(config.original_max_position_embeddings))
        self.long_factor = config.rope_scaling["long_factor"]
        self.short_factor = config.rope_scaling["short_factor"]

    def __call__(self, q, k, position_ids=None):
        position_ids = mx.arange(q.shape[2], dtype=mx.float32)[None] if position_ids is None else position_ids
        cos, sin = self._get_cos_sin(position_ids)
        q = (q * cos) + (self._rotate_half(q) * sin)
        k = (k * cos) + (self._rotate_half(k) * sin)
        return q, k

    def _get_cos_sin(self, position_ids):
        su_factor = self.long_factor if mx.max(position_ids) > self.original_max_position_embeddings else self.short_factor
        position_ids_expanded = position_ids[:, None, :]
        inv_freq = 1.0 / (mx.array(su_factor, dtype=mx.float32) * self.rope_theta**(mx.arange(0, self.dim, 2, dtype=mx.float32) / self.dim))
        inv_freq_expanded = mx.repeat(inv_freq[None, :, None], position_ids.shape[0], axis=0)
        freqs = (inv_freq_expanded @ position_ids_expanded).transpose(0, 2, 1)
        emb = mx.concatenate([freqs, freqs], axis=-1)
        cos = mx.expand_dims(mx.cos(emb) * self.scaling_factor, axis=1)
        sin = mx.expand_dims(mx.sin(emb) * self.scaling_factor, axis=1)
        return cos, sin

    @staticmethod
    def _rotate_half(x):
        midpoint = x.shape[-1] // 2
        x1, x2 = x[..., :midpoint], x[..., midpoint:]
        return mx.concatenate([-x2, x1], axis=-1)

4. Integrating Su-scaled RoPE into Phi-3-Vision

Integrating our Su-scaled RoPE implementation into the Phi-3-Vision model is straightforward. We only need to add two lines to our Phi3Attention module:

class Phi3Attention(nn.Module):
    def __init__(self, config):
        # ...
        self.rope = SuRoPE(config)

    def __call__(self, x):
        # ...
        q, k = self.rope(q, k)
        # ...

And now our ported model can handle up to 128K tokens!

Conclusion

In this tutorial, we implemented Su-scaled Rotary Position Embeddings (RoPE), enabling our model to handle sequences up to 128K tokens.

The full implementation is available at https://github.com/JosefAlbers/Phi-3-Vision-MLX/tree/main/assets/tutorial_2.py

Next, we'll explore efficient batching techniques to further optimize our Phi-3-Vision implementation in MLX.

Part 1: Basic Implementation of Phi-3-Vision in MLX

Josef Albers — Wed, 31 Jul 2024 11:00:40 +0000

Introduction

Welcome to Part 1 of the tutorial series on porting Phi-3-Vision from PyTorch to Apple's MLX framework. Our goal is to create a minimal, functional implementation of Phi-3-Vision in MLX through:

Analyzing the original PyTorch code
Translating core components to MLX
Building a basic MLX implementation
Loading and running the ported model

By the end of this tutorial, we will have a basic working model capable of generating text, setting the stage for further optimizations in subsequent parts of the series.

The full implementation of what we'll cover today is available at https://github.com/JosefAlbers/Phi-3-Vision-MLX/tree/main/assets/tutorial_1.py

1. Finding and Understanding the Model Code

Our first task is to locate the source code for the original Phi-3-Vision implementation:

Visit the Hugging Face model hub: https://huggingface.co/models
Search for "phi-3-vision"
Click on the model to access its repository
Look for a file named modeling_phi3_v.py

Now let's examine the code:

Don't panic! We will break this down step by step:

Scroll to the bottom of the file to find the top-level model class (Phi3VForCausalLM in our case)
Look for the forward method in this class
Trace the flow of data through the model by following method calls

Through this process, we can identify five key components of the model:

Main model (Phi3VModel)
Decoder layers (Phi3DecoderLayer)
Attention mechanism (Phi3Attention)
Feed-forward network (Phi3MLP)
Image embedding (Phi3ImageEmbedding)

With these components identified, we're ready to begin the translation process to MLX.

2. MLX-Specific Considerations

A few differences between PyTorch and MLX to note before we begin the porting:

Array Creation: MLX doesn't require specifying device location.
Lazy Computation: Arrays in MLX are only materialized when eval() is called.
Model Definition: MLX uses __call__ instead of forward for the model's forward pass.

3. Understanding the Model Structure

Let's now examine each key component of Phi-3-Vision, translating them to MLX as we go:

3.1 Top-Level Model: Phi3VForCausalLM

This class serves as the main entry point of the model. It encapsulates the core Phi3VModel and adds a language modeling head.

class Phi3VForCausalLM(nn.Module):
    # ...
    def __call__(self, input_ids, pixel_values=None, image_sizes=None):
        x = self.model(input_ids, pixel_values, image_sizes)
        return self.lm_head(x)

This top-level class serves two main functions:

It encapsulates the core model (Phi3VModel), which produces contextualized representations of the input.
It applies a linear transformation (the "language model head") to these representations, converting them into logits over the entire vocabulary. These logits represent the model's predictions for the next token in the sequence.

3.2 Core Model: Phi3VModel

The Phi3VModel implements the main transformer architecture.

class Phi3VModel(nn.Module):
    # ...
    def __call__(self, input_ids, pixel_values, image_sizes):
        x = self.embed_tokens(input_ids)
        x = self.vision_embed_tokens(x, pixel_values, image_sizes)
        for l in self.layers:
            x = l(x)
        return self.norm(x)

This class processes inputs through four stages:

Text Embedding: Input tokens are converted to dense vector representations.
Vision Embedding: If present, image inputs are processed and integrated with the text embeddings.
Transformer Layers: The combined embeddings are then passed through a series of decoder layers.
Normalization: The output is normalized before being returned.

3.3 Image Embedding: Phi3ImageEmbedding

This component processes image inputs and integrates them with text embeddings.

class Phi3ImageEmbedding(nn.Module):
    # ...
    def __call__(self, txt_embeds, img_embeds, img_sizes, positions):
        # Process images with CLIP
        img_features = self.img_processor.vision_model(img_embeds)

        # Reshape and concatenate features
        global_features = self._process_global_features(img_features)
        local_features = self._process_local_features(img_features, img_sizes)

        # Apply additional projections
        x = mx.concatenate([local_features, global_features], axis=1)
        for layer in self.img_projection:
            x = layer(x)

        # Integrate with text embeddings
        txt_embeds = self._integrate_features(txt_embeds, x, positions)
        return txt_embeds

This class combines a CLIP (Contrastive Language-Image Pre-training) model with custom processing steps:

CLIP Processing: The model uses a pre-trained CLIP vision model to extract initial features from the input images.
Additional Processing: After CLIP processing, the model applies additional processing steps:
- It reshapes and concatenates the features for both global and local (sub-image) representations.
- It applies additional linear projections and non-linear activations (GELU) to further process these features.
Integration with Text Embeddings: Finally, the processed image features are integrated with the text embeddings at specific positions in the input sequence.

3.4 Decoder Layer: Phi3DecoderLayer

Each decoder layer is a fundamental building block of the transformer architecture.

class Phi3DecoderLayer(nn.Module):
    # ...
    def __call__(self, x):
        r = self.self_attn(self.input_layernorm(x))
        h = x + r
        r = self.mlp(self.post_attention_layernorm(h))
        return h + r

The decoder layer performs a series of operations to its input:

Self-Attention: This mechanism allows the model to weigh the importance of different parts of the input when processing each element, enabling it to capture long-range dependencies in the sequence.
Feedforward Neural Network (MLP): This subnet processes each position independently, introducing non-linearity and increasing the model's capacity to learn complex functions.
Residual Connections: After both the self-attention and MLP operations, the input is added to the output. This technique helps in mitigating the vanishing gradient problem and allows for easier training of deep networks.
Layer Normalization: Applied before the self-attention and MLP operations, this normalizes the inputs to each sub-layer, stabilizing the learning process and allowing for deeper networks.

The combination of these components enables each layer to refine and enrich the representations passed through the model.

3.5 Attention Mechanism: Phi3Attention

The attention mechanism allows the model to weigh the importance of different parts of the input when processing each element.

class Phi3Attention(nn.Module):
    # ...
    def __call__(self, x):
        B, L, _ = x.shape
        qkv = self.qkv_proj(x)
        q, k, v = mx.split(qkv, self.chop, axis=-1)
        q = q.reshape(B, L, self.n_heads, -1).transpose(0, 2, 1, 3)
        k = k.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
        v = v.reshape(B, L, self.n_kv_heads, -1).transpose(0, 2, 1, 3)
        mask = mx.triu(mx.full((x.shape[1], x.shape[1]), -mx.inf), k=1)
        w = (q * self.scale) @ k.transpose(0, 2, 3, 1)
        w += mask
        w = mx.softmax(w, axis=-1)
        o = w @ v
        o = o.transpose(0, 2, 1, 3).reshape(B, L, -1)
        return self.o_proj(o).astype(qkv.dtype)

Key aspects of this implementation:

Projection and Splitting: The input is first projected into query (q), key (k), and value (v) representations using a single linear projection (qkv_proj), which is then split.
Multi-head Reshaping: The q, k, and v tensors are reshaped to separate the heads and prepare for the attention computation.
Attention Mask: A causal mask is created to ensure that each position can only attend to previous positions.

Scaled Dot-Product Attention: The core attention computation is performed. Alternatively, you can use a faster, optimized version of this operation available in mlx.core.fast:

# This:
w = (q * self.scale) @ k.transpose(0, 1, 3, 2)
w += mask
w = mx.softmax(w, axis=-1)
o = w @ v

# Is equivalent to:
o = mx.fast.scaled_dot_product_attention(q,k,v,scale=self.scale,mask=mask)

Output Projection: The attention output is reshaped and projected back to the original dimensionality.

The attention mechanism supports both standard multi-head attention and grouped-query attention by allowing different numbers of heads for queries (n_heads) versus keys/values (n_kv_heads).

In the current configuration, however, these are set to the same value (32), resulting in standard multi-head attention.

3.6 MLP Layer: Phi3MLP

The MLP layer applies non-linear transformations to the attention outputs.

class Phi3MLP(nn.Module):
    # ...
    def __call__(self, x):
        x = self.gate_up_proj(x)
        gate, x = mx.split(x, 2, axis=-1)
        return self.down_proj(nn.silu(gate) * x)

This implements a gated feedforward network:

Gated Architecture:
- The input is first projected into two separate spaces: one for the 'gate' and one for the 'values'.
- This is achieved through a single linear projection followed by a split operation.
Activation Function:
- The gate portion uses the SiLU (Sigmoid Linear Unit) activation, also known as the swish function.
- SiLU is defined as f(x) = x * sigmoid(x), which has been shown to perform well in deep networks.
Gating Mechanism:
- The activated gate is element-wise multiplied with the value portion.
- This allows the network to dynamically control information flow, potentially helping with gradient flow and enabling more complex functions to be learned.
Final Projection:
- The gated output is then projected back to the model's hidden size through a final linear layer.

This design combines the benefits of gating mechanisms (often seen in LSTMs and GRUs) with the simplicity and effectiveness of feedforward networks, potentially allowing for more expressive computations within each transformer layer.

4. Loading and Using the Model

Now that we've ported our model to MLX, let's load and use it for text generation.

First, we'll download the model configuration and weights from huggingface:

model_path = snapshot_download('microsoft/Phi-3-vision-128k-instruct')

Next, we'll load the model configuration:

with open(f"{model_path}/config.json", "r") as f:
    config = json.load(f)
model_config = SimpleNamespace(**config)

Now, let's load and "sanitize" the model weights:

model_weight = [(k, v.transpose(0, 2, 3, 1) if "patch_embedding.weight" in k else v) 
                for wf in glob.glob(f"{model_path}/*.safetensors") 
                for k, v in mx.load(wf).items()]

The line v.transpose(0, 2, 3, 1) if "patch_embedding.weight" in k else v adapts the patch embedding weights to MLX's format by converting them from PyTorch's NCHW (batch, channel, height, width) to MLX's NHWC (batch, height, width, channel) format. This transposition, often called "weight sanitization", is necessary when porting the model from PyTorch to MLX.

With our configuration and weights ready, we can initialize and load our model:

model = Phi3VForCausalLM(model_config)
model.load_weights(model_weight)
mx.eval(model.parameters())
model.eval()

Now that our model is loaded, let's use it to generate some text. First, we'll load the pretrained processor:

processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)

Then, we'll process our input text and generate the first token:

inputs = processor('Hello world!', return_tensors='np')
input_ids = mx.array(inputs['input_ids'])
logits = model(input_ids)
token = mx.argmax(logits[:, -1, :], axis=-1)
list_tokens = token.tolist()

This code processes the input text "Hello world!" and generates the first token. We use the AutoProcessor to tokenize the input, then pass it through the model to get logits. The token with the highest probability is selected as the next token.

To generate more tokens, we can use a simple loop:

for i in range(5):
    input_ids = mx.concatenate([input_ids, token[None]], axis=-1)
    logits = model(input_ids)
    token = mx.argmax(logits[:, -1, :], axis=-1)
    list_tokens += token.tolist()

This loop generates five additional tokens by repeatedly feeding our model's output back as input.

print(processor.tokenizer.decode(list_tokens))
# Output: How are you doing?<|end|>

And there you have it! We've successfully ported Phi-3-Vision to MLX, loaded the model, and generated text. While this implementation is basic, it demonstrates that our port is functional and capable of generating coherent text.

5. Initial Results and Limitations

Our minimal implementation works for short sequences, but you'll notice it starts producing gibberish with longer contexts. This is because we haven't yet implemented position encoding, which we'll address in the next part with RoPE (Rotary Position Embedding).

Conclusion:

We've successfully created a barebones implementation of Phi-3-Vision in MLX. While it's not yet fully functional, it provides a solid foundation for the optimizations we'll explore in upcoming tutorials.

Next Steps:

In Part 2, we'll implement Su-scaled Rotary Position Embeddings (RoPE) to enhance our model's ability to handle long sequences.

Porting Phi-3-Vision to MLX: A Python Hobbyist's Journey into Advanced AI on Apple Silicon

Josef Albers — Wed, 31 Jul 2024 10:55:15 +0000

Introduction:

Welcome to an exciting series on optimizing cutting-edge AI models for Apple Silicon! Over the next few weeks, we'll dive deep into the process of porting Phi-3-Vision, a powerful and compact vision-language model, from Hugging Face to MLX.

This series is designed for AI enthusiasts, developers, and researchers interested in running advanced models efficiently on Mac devices. For those eager to get started, you can find the MLX ports of both Phi-3-Mini-128K and Phi-3-Vision in my GitHub repository: https://github.com/JosefAlbers/Phi-3-Vision-MLX.

Why Phi-3-Vision?

When Microsoft Research released Phi-3, I was immediately intrigued. Despite its relatively small size of 3.8 billion parameters, it was performing on par with or even surpassing models with 7 billion parameters. This efficiency was impressive and hinted at the potential for running sophisticated AI models on consumer-grade hardware.

The subsequent release of Phi-3-Vision further piqued my interest. As an owner of a Mac Studio and a Python hobbyist, I saw an exciting opportunity to bring this capable vision-language model to Apple Silicon. While llama.cpp was a popular option for running large language models on Mac, its C++ codebase was way beyond my skill level, so I looked for a more accessible option. This led me to MLX, Apple's machine learning framework that not only offered a Python-friendly environment but also promised better performance than llama.cpp on Apple Silicon.

What made this journey even more exciting was that it marked my first foray into contributing to open source projects. As I worked on porting Phi-3-Vision, I found myself making my first pull requests to repositories like "mlx-examples" and "mlx-vlm". This experience was an invaluable learning experience that helped me gain a better understanding of the MLX framework and how to apply it to real-world projects. This experience also connected me with the broader AI development community.

Useful Resources:

Before we dive into the series, I want to highlight some excellent resources that have been invaluable in my journey:

MLX Examples (https://github.com/ml-explore/mlx-examples): This official repository from the MLX team at Apple is a treasure trove of examples and tutorials that showcase the capabilities of the MLX framework. With a wide range of standalone examples, from basic MNIST to advanced language models and image generation, this repository is an excellent starting point for anyone looking to learn MLX. The quality and depth of the examples are truly impressive, and they demonstrate the team's commitment to making MLX accessible to developers of all levels.

I also want to give a special shoutout to awni, the repo owner, who was incredibly kind and patient with me when I made my first-ever pull request to this repository. Despite my lack of experience with Git and GitHub, awni guided me through the process and helped me navigate the precommit hooks and other nuances of the repository. Their patience and willingness to help a newcomer like me was truly appreciated, and I'm grateful for the opportunity to have contributed to this repository. If you're new to MLX or Git, I highly recommend checking out this repository and reaching out to awni - they're a great resource and a pleasure to work with!
MLX-VLM (https://github.com/Blaizzy/mlx-vlm): A package specifically for running Vision Language Models on Mac using MLX. This repository was particularly helpful in understanding how to handle multimodal inputs, and I found the well-organized and well-written code to be incredibly valuable in learning not only Vision Language Models (VLMs) but also the MLX framework in general. The codebase is a great example of how to structure and implement complex AI models using MLX, making it an excellent resource for anyone looking to learn from experienced developers and improve their own MLX skills. For those interested in other models, Prince Canuma has an excellent tutorial on running Google's Gemma 2 locally on Mac using MLX:
Hugging Face (https://huggingface.co/): A popular platform for natural language processing (NLP) and computer vision tasks, providing a vast range of pre-trained models, datasets, and tools. Hugging Face’s Transformers library is particularly useful for working with transformer-based models like Phi-3-Vision. Their documentation and community support are also top-notch, making it an excellent resource for anyone looking to learn more about NLP and computer vision.

These resources provide a great foundation for anyone looking to explore MLX and run advanced AI models on Apple Silicon.

What to Expect in This Series:

1. MLX vs. Hugging Face: A Code Comparison

We'll start by comparing the original Hugging Face implementation with our MLX port, highlighting key differences in syntax and how MLX leverages Apple Silicon's unified memory architecture.

2. Implementing Su-RoPE for 128K Context

We'll explore the Surrogate Rotary Position Embedding (Su-RoPE) implementation that enables Phi-3-Vision to handle impressive 128K token contexts.

3. Optimizing Text Generation in MLX: From Batching to Advanced Techniques

Learn how to implement efficient batch text generation, a crucial feature for many real-world applications. We'll also cover custom KV-Cache implementation, streaming capabilities, and other text generation optimizations.

4. LoRA Fine-tuning and Evaluation on MLX

Discover how to perform Low-Rank Adaptation (LoRA) training, enabling efficient fine-tuning of Phi-3-Vision on custom datasets. We'll also develop comprehensive evaluation techniques to ensure our LoRA-adapted model meets or exceeds the original's performance.

5. Building a Versatile AI Agent

In our finale, we'll create a multi-modal AI agent showcasing Phi-3-Vision's capabilities in handling both text and visual inputs.

Why This Series Matters:

Phi-3-Vision represents a significant advancement in compact, high-performing vision-language models. By porting it to MLX, we're making it more accessible and efficient for a wide range of applications on Apple Silicon devices. This project demonstrates the potential of running advanced AI models on consumer-grade hardware, specifically Apple Silicon Macs.

Who This Series Is For:

AI enthusiasts and hobbyists looking to dive deeper into model optimization
Researchers exploring efficient AI on consumer hardware
Mac users eager to leverage their devices for AI tasks
Anyone curious about the intersection of AI and Apple Silicon
Beginners interested in contributing to open source AI projects

Stay Tuned!

Our journey into optimizing Phi-3-Vision for MLX promises to be full of insights, challenges, and exciting breakthroughs. Whether you're a fellow hobbyist looking to run advanced AI models on your Mac or simply curious about the future of AI on consumer devices, this series has something for you.

Join me on this adventure in AI optimization, and let's unlock the full potential of Phi-3-Vision on Apple Silicon together!