Introduction
There has been a lot of buzz recently about Black Forest Labs.
AI researchers involved in the development of image generation AI such as '
Stable Diffusion ' have launched a new AI development company ' Black Forest Labs '. In addition, Black Forest Labs has also announced ' Flux ', an open source image generation AI model with a parameter size of 12 billion.
In this post, I'll walk through the process of running the image generation AI model FLUX.1 on my MacBook (M2).
All the code used here is written in Python.
About FLUX.1
For a detailed explanation, you can check out the official announcement “Announcing Black Forest Labs - Black Forest Labs.” Here’s a quick overview of the three available models:
-
FLUX.1 [pro]
- A cutting-edge image generation model available via API access.
-
FLUX.1 [dev]
- A model for non-commercial applications, available on Hugging Face. For commercial use, inquiries are required.
-
FLUX.1 [schnell]
- A fast model optimized for local development and personal use, released under the Apache 2.0 license. It’s available on Hugging Face, and the inference code is found on GitHub and Hugging Face’s Diffusers, supporting ComfyUI integration.
While I initially wanted to try FLUX.1 [pro], the API is currently invite-only for selected partners. While it’s possible to use it via platforms like Replicate or Fal.ai, I’ll be using FLUX.1 [dev], the highest-quality model that supports local generation. Towards the end of this article, I’ll also include an example using FLUX.1 [schnell].
For those interested in quickly testing image generation, you can try the following links. Note that some services require payment.
- FLUX.1 [pro]
- FLUX.1 [dev]
- FLUX.1 [schnell]
My Setup
I’m running this on a MacBook Pro (Apple M2 Pro Chip / 16GB RAM), with the operating system being macOS 14 Sonoma.
The necessary packages include diffusers, sentencepiece, t5, torch, and transformers. For diffusers, I followed the official installation guide from GitHub.
Here’s the installation command, with a specific version of torch for reasons I’ll explain later:
$ pip install sentencepiece torch==2.3.1 transformers git+https://github.com/huggingface/diffusers.git
Here are the versions running in my environment:
$ pip list | grep -e diffusers -e sentencepiece -e t5 -e torch -e transformers
diffusers 0.30.0.dev0
sentencepiece 0.2.0
t5 0.9.4
torch 2.3.1
transformers 4.43.3
Although torch 2.4.0 was the latest version as of August 5, 2024, I downgraded to 2.3.1. I initially tried 2.4.0, but the generated images were too noisy. After some research, I found that downgrading to 2.3.1 resolved the issue, though I couldn’t confirm the exact reason.
Here’s an example of a noisy image:
Obtaining Access
For this setup, I accessed the model via Hugging Face, so you’ll need a Hugging Face account.
Once you’ve created an account and logged in, go to the FLUX.1-dev page, where you’ll be asked to agree to the terms.
After reading and agreeing to the terms, click Agree and access repository.
You’ll see a confirmation message like this, indicating that access has been granted:
Creating an Access Token
To authenticate the model download, you’ll need to create an access token on Hugging Face.
Navigate to Settings, then to Access Tokens.
Click Create new Access Token.
Since you only need read permissions, select Read, give it a name, and generate the token.
To avoid hardcoding, store this token as an environment variable.
Open your configuration file (in my case, it’s zshrc):
$ open ~/.zshrc
Then record the generated access token. I used HUGGING_FACE_TOKEN as the variable name.
export HUGGING_FACE_TOKEN=your_access_token_here
Specifying PYTORCH_MPS_HIGH_WATERMARK_RATIO
Since my MacBook has only 16GB of RAM, I ran into a memory shortage when trying to execute the model, resulting in this error:
RuntimeError: MPS backend out of memory
Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable the upper limit for memory allocations (may cause system failure).
This issue stems from insufficient memory for the MPS GPU on my MacBook.
As suggested, I set the following environment variable to remove the upper limit for MPS usage:
export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0
Be cautious, as removing this limit may cause the system to crash due to memory exhaustion.
Initially, I tried using a non-zero value, but that led to the following error, so I set it to 0.0:
RuntimeError: invalid low watermark ratio 1.4
It’s also possible to clear the MPS cache with torch.mps.empty_cache(), which may help avoid memory issues.
Fixing transformer_flux.py
Even after these adjustments, I encountered another error:
scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
This error means float32 needs to be used instead of float64. You can fix it by modifying transformer_flux.py like this:
# scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
scale = torch.arange(0, dim, 2, dtype=torch.get_default_dtype(), device=pos.device) / dim
This issue is also mentioned in the GitHub issue “flux does not work on MPS devices,” and it might be addressed soon. A similar issue was raised in ComfyUI: “FLUX Issue | MPS framework doesn’t support float64.”
Code
The official sample code doesn’t work directly, so I made the following changes:
- Specifying the access token.
- Setting mps as the device to utilize the MacBook GPU.
- Removing the call to enable_model_cpu_offload.
- This function assumes CUDA, leading to an AssertionError: Torch not compiled with CUDA enabled.
Here’s the modified code, with comments noting the changes from the official sample:
import torch
from diffusers import FluxPipeline
import os # Added for environment variable access
hf_token = os.getenv("HUGGING_FACE_TOKEN") # Retrieve the Hugging Face token
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
token=hf_token # Specify the Hugging Face token
)
pipe.to(torch.device("mps")) # Specify MPS as the device
# pipe.enable_model_cpu_offload() # Removed
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")
If you’re testing, you can speed things up by reducing num_inference_steps. You can control seed generation with generator. For details on the parameters, refer to the API reference.
The model generates an image based on the prompt “A cat holding a sign that says hello world.”
The first time you run the code, the model and dependencies (over 30GB) will be downloaded, which may take some time depending on your network speed.
Results
While there are areas for improvement, the result is quite clean, and the text “Hello World” is clearly legible. The prompt is faithfully represented in the image.
I also tried a different prompt:
prompt = "anime style, Japanese moe heroine"
Here’s the result:
While I won’t go into the nuances of moe, the generated image is a high-quality, accurate representation of the prompt.
For comparison, here’s an image generated with the same prompt using Stable Image Ultra.
Comparing them, I’d say FLUX.1 is closer to the concept of a moe heroine.
Testing FLUX.1 [schnell]
I also tested FLUX.1 [schnell], the faster model that generates images in fewer steps. Using the official sample code, I generated the following image:
import torch
from diffusers import FluxPipeline
import os # Added for environment variable access
hf_token = os.getenv("HUGGING_FACE_TOKEN") # Retrieve the Hugging Face token
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell",
torch_dtype=torch.bfloat16,
token=hf_token # Specify the Hugging Face token
)
pipe.to(torch.device("mps")) # Specify MPS as the device
# pipe.enable_model_cpu_offload() # Removed
prompt = "A cat holding a sign that says hello world"
image = pipe(
prompt,
guidance_scale=0.0,
output_type="pil",
num_inference_steps=4,
max_sequence_length=256,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-schnell.png")
As with FLUX.1 [dev], the first run will involve downloading over 30GB of data, which might take a while.
Here’s the result:
I also tried using the prompt “anime style, Japanese moe heroine”:
While the quality isn’t quite as high as FLUX.1 [dev], both images faithfully reflect the prompt and are visually appealing. With fewer inference steps, FLUX.1 [schnell] generates images much faster than FLUX.1 [dev].
Conclusion
I’m eager to try the API when it becomes available.
Today we release the FLUX.1 text-to-image model suite. With their strong creative capabilities, these models serve as a powerful foundation for our upcoming suite of competitive generative text-to-video systems. Our video models will unlock precise creation and editing at high definition and unprecedented speed. We are committed to continue pioneering the future of generative media.
I’m really looking forward to seeing what they develop next in the area of video generation.
Japanese Version of the Article
Stable Diffusionのオリジナル開発陣が発表した画像生成AIモデルFLUX.1([dev]/[schnell])をMacBook(M2)で動かしてみた
Top comments (0)