DEV Community: nabata

Replacing Only the Background of an Image with AI Generation Using the Stable Diffusion Web API

nabata — Fri, 08 Nov 2024 15:09:45 +0000

Introduction

This guide demonstrates how to replace the background of an image using Python code only, without relying on image editing software like Photoshop. The goal is to keep the subject intact while swapping in an AI-generated background.

While this approach may not be revolutionary, it addresses a common need, so I hope it will be helpful for those with similar requirements.

Input and Output Images

Let’s start with the results.

The following output image was generated from the input image shown below.

Input Image

Output Image

Libraries

Install requests to handle the API calls.

$ pip install requests

I verified the version as follows:

$ pip list | grep -e requests
requests                  2.31.0

API Key

For background generation, we’ll use Stability AI’s Web API.

To access this API, you’ll need to obtain an API Key from their Developer Platform. For pricing, refer to the Pricing page.

To keep your key secure, save it as an environment variable rather than hardcoding it in your code.

In my environment, I use the zshrc settings file.

$ open ~/.zshrc

I saved the key under the name STABILITY_API_KEY.

export STABILITY_API_KEY=your_api_key_here

Code

Here, we use the Remove Background API to isolate the subject. We then pass the extracted image to the Inpaint API to create the new background.

The prompt used is "Large glass windows with a view of the metropolis behind"

import os
import requests

# File paths
input_path = './input.png'  # Original image
mask_path = './mask.png'  # Mask image (temporarily generated)
output_path = './output.png'  # Output image

# Check for API Key
api_key = os.getenv("STABILITY_API_KEY")
if api_key is None:
    raise Exception("Missing Stability API key.")

headers = {
    "Accept": "image/*",
    "Authorization": f"Bearer {api_key}"
}

# Call Remove Background API
response = requests.post(
    f"https://api.stability.ai/v2beta/stable-image/edit/remove-background",
    headers=headers,
    files={
        "image": open(input_path, "rb")
    },
    data={
        "output_format": "png"
    },
)

# Save mask image
if response.status_code == 200:
    with open(mask_path, 'wb') as file:
        file.write(response.content)
else:
    raise Exception(str(response.json()))

# Call Inpaint API
response = requests.post(
    "https://api.stability.ai/v2beta/stable-image/edit/inpaint",
    headers=headers,
    files={
        "image": open(mask_path, "rb"),
    },
    data={
        "prompt": "Large glass windows with a view of the metropolis behind",
        "output_format": "png",
        "grow_mask": 0,  # Disable blurring around the mask
    },
)

# Delete mask image
os.remove(mask_path)

# Save output image
if response.status_code == 200:
    with open(output_path, "wb") as file:
        file.write(response.content)
else:
    raise Exception(str(response.json()))

Using rembg

Another approach for background removal is to use rembg. This method requires only one API call, making it more cost-effective, though it may result in differences in extraction accuracy.

First, install rembg.

$ pip install rembg

I verified the version as follows:

$ pip list | grep -e rembg
rembg                     2.0.59

Here’s the code for this approach:

from rembg import remove
import os
import requests

# File paths
input_path = './input.png'  # Input image path
mask_path = './mask.png'  # Mask image path (temporarily generated)
output_path = './output.png'  # Output image path

# Generate mask image with background removed
with open(input_path, 'rb') as i:
    with open(mask_path, 'wb') as o:
        input_image = i.read()
        mask_image = remove(input_image)
        o.write(mask_image)

# Check for API Key
api_key = os.getenv("STABILITY_API_KEY")
if api_key is None:
    raise Exception("Missing Stability API key.")

# Call Inpaint API
response = requests.post(
    "https://api.stability.ai/v2beta/stable-image/edit/inpaint",
    headers={
        "Accept": "image/*",
        "Authorization": f"Bearer {api_key}"
    },
    files={
        "image": open(mask_path, "rb"),
    },
    data={
        "prompt": "Large glass windows with a view of the metropolis behind",
        "output_format": "png",
        "grow_mask": 0,
    },
)

# Delete mask image
os.remove(mask_path)

# Save output image
if response.status_code == 200:
    with open(output_path, "wb") as file:
        file.write(response.content)
else:
    raise Exception(str(response.json()))

Here’s the output image. In this case, the accuracy of the extraction seems satisfactory.

If you set up a local Stable Diffusion environment, you can eliminate API call costs, so feel free to explore that option if it suits your needs.

Conclusion

Being able to achieve this through code alone is highly convenient.

It’s exciting to witness the ongoing improvements in workflow efficiency.

Original Japanese Article

Stable DiffusionのWeb APIを用いて画像上の人物はそのままに背景だけをAI生成で入れ替えてみた

Comparing Prompt Accuracy Across Various Image Generation AIs (Stable Diffusion 3.5, FLUX1.1, Imagen 3, DALL·E 3, Adobe Firefly)

nabata — Sat, 02 Nov 2024 03:21:14 +0000

Introduction

Recently, Stability AI introduced Stable Diffusion 3.5.

Today we are introducing Stable Diffusion 3.5. This open release includes multiple model variants, including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo. Additionally, Stable Diffusion 3.5 Medium will be released on October 29th.

Reference: Stable Diffusion 3.5 — Stability AI

The release announcement also states, "Additionally, our analysis shows that Stable Diffusion 3.5 Large leads the market in prompt adherence and rivals much larger models in image quality" This article aims to explore that claim by comparing the accuracy of prompt adherence across various popular image generation models.

Please note that this evaluation is subjective and is intended as a reference for understanding how these models handle straightforward prompts that may not always yield ideal results.

Image Generation AIs Used in This Comparison

The models tested include:

Stable Diffusion 3.5 Large by Stability AI
FLUX1.1 [pro] by Black Forest Labs
Imagen 3 by Google
DALL·E 3 by OpenAI
Adobe Firefly by Adobe

Each model was tested once per prompt. If multiple images were generated simultaneously, I selected the "top-left" result.

Stable Diffusion 3.5 Large and FLUX1.1 [pro] images were generated through Web API, while the others were created directly in-browser. Imagen 3 was accessed through ImageFX and DALL·E 3 through ChatGPT.

For Firefly, I used Firefly Image 3 with Fast mode turned off, then upscaled the images after generation. As a result, Firefly images are 2048x2048, while all other images are 1024x1024.

The code used to generate FLUX1.1 [pro] images is adapted from the article "Using the Web API for FLUX 1.1 [pro]: The Latest Image Generation AI Model by the Original Team of Stable Diffusion" with the size updated to 1024x1024.

Below is the code used for Stable Diffusion 3.5 Large. The STABILITY_API_KEY environment variable stores the API key. For more details, see the API Reference.

import os
import requests
import time

api_host = os.getenv('API_HOST', 'https://api.stability.ai')
api_key = os.getenv("STABILITY_API_KEY")
prompt = "Describe the prompt here"

# Ensure API Key is available
if api_key is None:
    raise Exception("Missing Stability API key.")

# API call
response = requests.post(
    f"{api_host}/v2beta/stable-image/generate/sd3",
    headers={
        "Accept": "image/*",
        "Authorization": f"Bearer {api_key}"
    },
    files={"none": ''},
    data={
        "prompt": prompt,
        "output_format": "png",
        "model": "sd3.5-large"
    },
)

# Save image with timestamped filename
if response.status_code == 200:
    with open(f"./{int(time.time())}.png", "wb") as file:
        file.write(response.content)
else:
    raise Exception(str(response.json()))

Now, let’s dive into the results.

No.1 - A Single Banana

One known issue in AI image generation is The Lone Banana Problem.

The bias to two bananas in a picture is, I believe, an example of a subtle bias (OK, it’s not that subtle, but it is more subtle than many of the more concerning news-grabbing biases that we regularly read about). A naïve explanation may be that in the training dataset there have been many pictures of bananas added to Midjourney’s database that have been labelled “banana” but not labelled “two bananas”. It may also be that Midjourney has never seen an individual banana, so it doesn’t know that a single banana is possible.

Reference: The Lone Banana Problem. Or, the new programming: “speaking” AI - TL;DR - Digital Science

A similar phenomenon, known as The Strawberry Problem, has also recently become a topic of interest.

To see how each model addresses this issue, I started with the following prompt:

Prompt

There is a single banana on the table.There is a single banana hanging from the ceiling.There is a single banana placed on the chair.There is a man with a single banana on his head.There is a woman washing a single banana.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Unfortunately, none of the models performed well on this prompt.

It’s possible the prompt was too complex. My apologies.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✗	✗	✗	✗

No.2 - Retrying the Single Banana

I simplified the initial prompt and tried again to see if reducing complexity would improve results.

Prompt

A single banana placed in the center of a white background. The banana should be ripe, with a bright yellow peel and a few brown spots, indicating its ripeness. The shape of the banana should be curved in a natural way, and it should be clearly identifiable as one piece of fruit without any additional objects or bananas in the image.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

The following error prevented image generation:

I couldn't generate the requested image because it didn't align with the content policy. If you have another idea or request, feel free to share, and I'll do my best to create it!

Adobe Firefly

Comments

Stable Diffusion 3.5 Large produced some unusual results here, as with the previous attempt, highlighting potential limitations in handling simpler prompts.

Imagen 3 generated a banana that appears slightly under-ripe, and Firefly’s result has a subtle unnatural quality. However, both images reasonably reflect the prompt’s intent.

It’s unclear what aspect of the prompt conflicted with DALL·E 3’s content policy.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✓	✓	-	✓

No.3 - Space Battle

To explore each AI's handling of more fantastical themes, I introduced a space battle scenario.

Prompt

A large-scale space battle between two fleets of futuristic spaceships. Lasers and missiles are being fired, with explosions happening in the background. The scene takes place in deep space, with a distant galaxy visible in the background and some debris floating nearby.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Direct comparisons were challenging due to varying interpretations by each AI. It’s difficult to identify both lasers and missiles in every image, and some results lack a strong sense of combat.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
~	~	~	~	~

No.4 - Steampunk Invention

Next, I tried a prompt centered on the steampunk genre to see how well each AI captures this distinct aesthetic.

Steampunk is a subgenre of science fiction that incorporates retrofuturistic technology and aesthetics inspired by, but not limited to, 19th-century industrial steam-powered machinery.

Reference: Steampunk - Wikipedia

Prompt

An intricate steampunk device on a workbench, made of brass, gears, and glass tubes. The device is emitting a faint steam cloud, with tiny dials and gauges displaying various readings. Nearby, a pair of leather gloves and a set of old blueprints are scattered on the wooden table.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Most images represented the prompt well, though DALL·E 3 missed the scattered blueprints, and the gloves did not appear as a pair.

Firefly also did not include leather gloves.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✓	✓	✓	✗	✗

No.5 - Chibi-Style Character

For this test, I prompted each AI to generate a distinctive character in a chibi style.

Prompt

A chibi-style character of a smiling young girl with big eyes, short pink hair, and a school uniform. She is holding a small cat in her arms, standing on a grassy hill under a bright blue sky with fluffy clouds.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Imagen 3 did not meet the prompt specification for pink hair, and DALL·E 3 omitted the cat the girl was supposed to be holding.

In Firefly, the character was given cat ears, and both the cat and the girl’s hands are somewhat awkwardly rendered.

Stable Diffusion 3.5 Large mostly captures the prompt details, though some aspects, like the cat’s body shape, appear slightly unnatural, so I rated it △.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
~	✓	✗	✗	✗

No.6 - Colorful Coral Reef

Next, I asked the AIs to generate a serene and vibrant underwater scene.

Prompt

A colorful underwater scene featuring a coral reef filled with vibrant fish, sea turtles, and a few small sharks. Sunlight beams are penetrating through the water's surface, illuminating the sea life and creating a beautiful, serene atmosphere.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

This prompt returned a processing error, so Firefly could not generate an image.

Comments

FLUX1.1 [pro] was missing sea turtles, Imagen 3 lacked multiple turtles and sharks, and DALL·E 3 did not include any sharks.

It’s unclear what caused Firefly’s processing error.

Incidentally, I’ve noticed that Imagen 3 frequently fails to generate images, even with other prompts.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✓	✗	✗	✗	-

No.7 - Japanese Tea Ceremony

For the final test, I chose a prompt with a specific cultural theme to see how well each model captures details from a traditional Japanese tea ceremony.

Prompt

A traditional Japanese tea ceremony taking place in a tatami room. A woman in a kimono is gracefully preparing tea, while a guest kneels in front of her, observing respectfully. The room is decorated with traditional Japanese art and sliding shoji doors.

Stable Diffusion 3.5 Large

FLUX1.1 [pro]

Imagen 3

DALL·E 3

Adobe Firefly

Comments

Strict accuracy was not evaluated here, as the details of tea ceremony protocol could disqualify all the images.

Stable Diffusion 3.5 Large produced a somewhat ambiguous tea preparation scene and used an unconventional shoji door style.

DALL·E 3 displayed notably distorted tatami and other room elements.

Firefly lacked the observing guest, and its shoji doors and tatami differed from traditional interpretations.

SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
✗	✓	✓	✗	✗

In Conclusion

The following table summarizes the results.

		SD 3.5 Large	FLUX1.1 [pro]	Imagen 3	DALL·E 3	Firefly
1	Single Banana	✗	✗	✗	✗	✗
2	Single Banana Retry	✗	✓	✓	-	✓
3	Space Battle	~	~	~	~	~
4	Steampunk Invention	✓	✓	✓	✗	✗
5	Chibi Character	~	✓	✗	✗	✗
6	Coral Reef	✓	✗	✗	✗	-
7	Tea Ceremony	✗	✓	✓	✗	✗

Overall, this was a subjective review, and based on these results, it’s clear that Stable Diffusion 3.5 Large does not definitively outperform other models.

Here is a rough grouping by adherence accuracy:

Prompt Adherence Level S
- FLUX1.1 [pro], Imagen 3
Prompt Adherence Level A
- Stable Diffusion 3.5 Large
Prompt Adherence Level B
- DALL·E 3, Firefly

One takeaway is that getting an AI to generate an image that perfectly matches the intent remains a challenging task—extremely, incredibly challenging.

However, the pace of AI’s improvement is undeniably impressive.

Original Japanese Article

プロンプトがどれだけ正確に反映されるのかを様々な画像生成AIで比較してみた（Stable Diffusion 3.5、FLUX1.1、Imagen 3、DALL·E 3、Adobe Firefly）

Using Runway's "Gen-3 Alpha Turbo" API to Generate AI Videos

nabata — Sat, 26 Oct 2024 03:38:09 +0000

Introduction

Runway is a platform offering the video generation AI Gen-3-Alpha, which I previously covered in an article titled "Converting Images to Video Using the Video Generation AI "Gen-3 Alpha" : The Results Were So Natural, It Was Almost Scary."

Now, Runway has introduced a more cost-effective version called Gen-3 Alpha Turbo, which is accessible through a web API. In this post, I'll explore how to use it.

We are excited to announce the launch of our new API, providing developers with access to our Gen-3 Alpha Turbo model for integration into various applications and products. This release represents a significant step forward in making advanced AI capabilities more accessible to a broader range of developers, businesses and creatives.

Reference：Runway News | Introducing the Runway API for Gen-3 Alpha Turbo

Waitlist Registration

To access the API, you currently need to register on the waitlist.

Fill out the Google Form with your email, name, company, intended use case, and estimated number of videos you'll generate monthly.

After submitting, wait for authorization—I received access about 10 days after applying. Once approved, you'll be asked to enter your organization name when logging in with your registered email.

Purchasing Credits

Video generation costs $0.25 for a 5-second video and $0.50 for a 10-second video.

You can purchase credits on the Billing page, with a minimum purchase of $10.0 (1,000 credits). For $10, you can create either 40 videos of 5 seconds or 20 videos of 10 seconds.

For the latest pricing, check the Price page.

Creating an API Key

To create an API key, navigate to the API Keys page.

Securely copy the key above and store it in a safe place. Once you close this modal, the key will not be displayed again.

The key is only visible once, so make sure to save it before closing the modal.

Environment

I am using macOS 14 Sonoma for this setup and will use Python for the implementation.

$ python --version
Python 3.12.2

Following the quickstart guide, I installed the SDK:

$ pip install runwayml
Collecting runwayml
  Downloading runwayml-2.0.0-py3-none-any.whl (71 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.2/71.2 kB 3.5 MB/s eta 0:00:00

...

Installing collected packages: runwayml
Successfully installed runwayml-2.0.0

Next, I saved the API key as an environment variable in my zshrc file:

$ open ~/.zshrc

I set the environment variable as RUNWAYML_API_SECRET:

export RUNWAYML_API_SECRET=<Your API Key Here>

Be cautious with the variable name—incorrect input will result in this error:

The api_key client option must be set either by passing api_key to the client or by setting the RUNWAYML_API_SECRET environment variable

Original Image

The video will be based on the following image:

This image was generated using FLUX 1.1 [pro] with dimensions of 1280x768, matching the Gen-3 Alpha Turbo’s default aspect ratio of 16:9.

For more details about FLUX 1.1 [pro], check out my previous article, "Using the Web API for FLUX 1.1 [pro]: The Latest Image Generation AI Model by the Original Team of Stable Diffusion".

Ensure the image is uploaded to an accessible URL for the API call.

Video Generation

Let’s generate the video.

The code below is based on the quickstart guide, using the prompt "A Japanese woman is smiling happily."

I've added comments for extra parameters.

For more details, refer to the API Reference.

import time
from runwayml import RunwayML

client = RunwayML()

# Create a new image-to-video task using the "gen3a_turbo" model
task = client.image_to_video.create(
    model='gen3a_turbo',
    prompt_image='<Image URL Here>',
    prompt_text='A Japanese woman is smiling happily',  # Must be under 512 characters
    # seed=0,  # Default: random
    # watermark=True,  # Default: False
    # duration=5,  # Default: 10
    # ratio="9:16"  # Default: "16:9"
)
task_id = task.id

# Check for completion
time.sleep(10)  # Wait
task = client.tasks.retrieve(task_id)
while task.status not in ['SUCCEEDED', 'FAILED']:
    time.sleep(10)  # Wait
    task = client.tasks.retrieve(task_id)

print('Task complete:', task)

Once the process completes, it outputs the video URL.

Here is the result:

The video generated as expected, although the movement appears slightly uncanny.

Conclusion

The rapid development and competition in the generative AI space are remarkable.

How long will this trend continue?

Original Japanese Article

Runwayの「Gen-3 Alpha Turbo」のAPIを呼んで動画をAI生成してみた

Using the Web API for FLUX 1.1 [pro]: The Latest Image Generation AI Model by the Original Team of Stable Diffusion

nabata — Sun, 20 Oct 2024 03:19:51 +0000

Introduction

Previously, I wrote an article titled “Running the FLUX.1 Image ([dev]/[schnell]) Generation AI Model by Stable Diffusion’s Original Developers on a MacBook (M2).” It demonstrated the FLUX.1 image generation model from Black Forest Labs, founded by the creators of Stable Diffusion.

Now, two months later, FLUX 1.1 [pro] (codenamed Blueberry) has been released, along with public access to its web API, though it’s still in beta.

Today, we release FLUX1.1 [pro], our most advanced and efficient model yet, alongside the general availability of the beta BFL API. This release marks a significant step forward in our mission to empower creators, developers, and enterprises with scalable, state-of-the-art generative technology.

Reference: Announcing FLUX1.1 [pro] and the BFL API - Black Forest Labs

In this post, I will demonstrate how to use the FLUX 1.1 [pro] web API.

All code examples are written in Python.

Creating an Account and an API Key

Start by registering an account and logging in on the API page under the Register option.

Credits are priced at $0.01 each, and I received 50 credits upon registration (this may vary).

Based on the Pricing page, the model costs are as follows:

FLUX 1.1 [pro]: $0.04 per image
FLUX.1 [pro]: $0.05 per image
FLUX.1 [dev]: $0.025 per image

Once you’re logged in, generate an API key by selecting Add Key and entering a name of your choice.

Your key will appear as shown below.

Environment Setup

I'm using macOS 14 Sonoma as my operating system.

The Python version is:

$ python --version
Python 3.12.2

To run the sample code, I installed requests:

$ pip install requests

I confirmed the installed version:

$ pip list | grep -e requests 
requests           2.31.0

To avoid hardcoding, I saved the API key as an environment variable by editing the zshrc file.

$ open ~/.zshrc

I named the environment variable BFL_API_KEY:

export BFL_API_KEY=<Your API Key Here>

Example Code

Below is the sample code from the Getting Started, with some additional comments. Ideally, it should handle errors using the status, but I left it unchanged for simplicity.

import os
import requests
import time

# Request
request = requests.post(
    'https://api.bfl.ml/v1/flux-pro-1.1',
    headers={
        'accept': 'application/json',
        'x-key': os.environ.get("BFL_API_KEY"),
        'Content-Type': 'application/json',
    },
    json={
        'prompt': 'A cat on its back legs running like a human is holding a big silver fish with its arms. The cat is running away from the shop owner and has a panicked look on his face. The scene is situated in a crowded market.',
        'width': 1024,
        'height': 768,
    },
).json()
print(request)
request_id = request["id"]

# Wait for completion
while True:
    time.sleep(0.5)
    result = requests.get(
        'https://api.bfl.ml/v1/get_result',
        headers={
            'accept': 'application/json',
            'x-key': os.environ.get("BFL_API_KEY"),
        },
        params={
            'id': request_id,
        },
    ).json()
    if result["status"] == "Ready":
        print(f"Result: {result['result']['sample']}")
        break
    else:
        print(f"Status: {result['status']}")

In this example, the prompt is:

A cat on its back legs running like a human is holding a big silver fish with its arms. The cat is running away from the shop owner and has a panicked look on his face. The scene is situated in a crowded market.

The final result format looks like this. The response time was faster compared to other APIs I’ve tested.

{
    'id': 'Request ID',
    'status': 'Ready',
    'result': {
        'sample': 'URL of the generated image',
        'prompt': 'Specified prompt'
    }
}

The sample contains the URL of the generated image, which was hosted on bflapistorage.blob.core.windows.net when I tested it.

Here's the generated image:

The result closely matches the prompt, capturing the sense of urgency.

Experimenting with Alternative Prompts

I tried different prompts to generate varied images.

Japanese Moe Heroine

Prompt: "Japanese moe heroine," using anime style.

'prompt': 'anime style, Japanese moe heroine',

Sweets from Popular Japanese Anime

Prompt: "Sweets that appear in popular Japanese anime," using anime style.

'prompt': 'anime style, sweets that appear in popular Japanese anime',

Male High School Student on a School Trip

Prompt: "Male high school student on a school trip," using anime style.

'prompt': 'anime style, male high school student on a school trip',

A Princess Playing Guitar

Prompt: "A princess playing guitar," using fantasy-art style.

'prompt': 'fantasy-art style, a princess playing guitar',

A Cute Fairy on Top of a White Laptop

Prompt: "A cute fairy on top of a white laptop," using photographic style.

'prompt': 'photographic style, a cute fairy on top of a white laptop',

28-Year-Old Japanese Woman with Black Bobbed Hair

Prompt: "28-year-old Japanese pretty woman with black bobbed hair," using photographic style.

'prompt': 'photographic style, 28-year-old Japanese pretty woman with black bobbed hair',

Hong Kong Downtown in the 1980s

Prompt: "Hong Kong downtown in the 1980s," using photographic style.

'prompt': 'photographic style, Hong Kong downtown in the 1980s',

Shinjuku Kabukicho in 2020

Prompt: "Shinjuku Kabukicho in 2020," using photographic style.

'prompt': 'photographic style, Shinjuku Kabukicho in 2020',

All of the generated images were of exceptional quality.

After generating so many high-quality AI images, reality almost feels surreal.

Conclusion

Black Forest Labs continues to innovate and enhance its AI models.

I’m looking forward to the future release of video generation capabilities.

Original Japanese Article

Stable Diffusionのオリジナル開発陣による画像生成AIモデル最新版FLUX 1.1 [pro]のWeb APIを呼んでいくつかの画像を生成してみた

Using the "Dream Machine" Video Generation AI Service via Web API

nabata — Sun, 06 Oct 2024 13:26:45 +0000

Introduction

Recently, the Web API for Dream Machine was released. In this article, I’ll walk you through how to use it.

Logging In and Purchasing Credits

First, log in with your Google account.

You can purchase credits from the Billing & Credits page by selecting Add More Credits. You can specify an amount ranging from $5 to $500. For further details, check out the Dream Machine API Pricing page.

Creating an API Key

Next, create an API Key. Note that once it’s generated, you won’t be able to view it again, so make sure to store it securely.

It is your responsibility to record the key below as you will not be able to view it again.

Setting Up the Environment

I’m running macOS 14 Sonoma, and the version of Python installed on my machine is:

$ python --version
Python 3.12.2

Install the Python SDK as follows:

$ pip install lumaai

You’ll also need the requests package for making HTTP requests:

$ pip install requests

Check the installed versions:

$ pip list | grep -e lumaai -e requests 
lumaai             1.0.2
requests           2.31.0

To avoid hard-coding the API key, I stored it as an environment variable named LUMAAI_API_KEY:

export LUMAAI_API_KEY=your obtained API Key here

Generating a Video from Text

Now, let’s generate a video from text by referring to the Text to Video section of the Python SDK Guide. For more details, check the API Reference.

import os
from lumaai import LumaAI
import requests

client = LumaAI(
    auth_token=os.environ.get("LUMAAI_API_KEY")  # Get the API Key
)

generation = client.generations.create(
    prompt="A teddy bear in sunglasses playing electric guitar and dancing",
    aspect_ratio="16:9",
    loop=True  # Enables looping
)

# The state can be one of queued, dreaming, completed, or failed
while generation.state != "failed":
    generation = client.generations.get(generation.id)  # Re-fetch the status
    if generation.state == "completed":
        response = requests.get(generation.assets.video)
        with open(generation.id + ".mp4", "wb") as file:
            file.write(response.content)
        break

# You can retrieve a list of all previous requests like this:
# print(client.generations.list())

For this example, I used the prompt "A teddy bear in sunglasses playing electric guitar and dancing".

I set the aspect ratio to 16:9 and enabled loop to make the video loop smoothly. This ensures that the first and last frames match.

Here’s the generated video:

The result perfectly matches the prompt. Pretty impressive!

Generating a Video from an Image

Next, I tried generating a video from an image, based on the Image to Video section of the Python SDK Guide.

You should upload and use your own cdn image urls, currently this is the only way to pass an image

This means you’ll need to upload your image to a server that can provide a URL.

For this test, I uploaded the following image to a server:

It’s possible to specify different images for the first and last frames, but I only used an image for the first frame.

Here’s the code, with the prompt "Japanese woman smiling". I did not set the loop option this time.

import os
from lumaai import LumaAI
import requests

client = LumaAI(
    auth_token=os.environ.get("LUMAAI_API_KEY")  # Get the API Key
)

generation = client.generations.create(
    prompt="Japanese woman smiling",
    keyframes={
        "frame0": {
            "type": "image",
            "url": "Specify_image_url_here"
        }
    }
)

# The state can be one of queued, dreaming, completed, or failed
while generation.state != "failed":
    generation = client.generations.get(generation.id)  # Re-fetch the status
    if generation.state == "completed":
        response = requests.get(generation.assets.video)
        with open(generation.id + ".mp4", "wb") as file:
            file.write(response.content)
        break

Here’s the video that was generated:

Once again, the prompt was accurately reflected in the output.

It’s fascinating to see how far AI technology has come every time I test something like this.

Conclusion

Being able to generate videos purely through code is incredible.

I’m excited to experiment further and see what else is possible with this tool!

Japanese Version of the Article

動画生成AIサービス「Dream Machine」をWeb APIで呼び出してみた

Converting Images to Video Using the Video Generation AI "Gen-3 Alpha" : The Results Were So Natural, It Was Almost Scary

nabata — Sun, 29 Sep 2024 11:16:40 +0000

Introduction

In this article, I’ll explore the process of converting images into videos using Gen-3 Alpha, a well-known video generation AI similar to Dream Machine.

Runway

Runway, a platform named after its parent company, offers a variety of generative AI tools, one of which is Gen-3 Alpha.

It is also available as an iOS app, but for this article, I’ll be using the browser version.

Account Creation

You can create an account via the Sign up page.

You can register using an email address, Google, or Apple account. The Enterprise plan also supports SSO.

Pricing

Runway offers a free plan, but Gen-3 Alpha is currently only accessible with paid plans. For this test, I subscribed to the Standard plan at $15/month, which provides 625 credits per month (Gen-3 Alpha consumes 10 credits per second of video). The free plan gives access to Gen-3 Alpha Turbo, a lower-cost version.

For more details about the different plans, check out pricing. If you choose the annual payment option, there’s a 20% discount.

Video Generation 1

I selected Gen-3 Alpha and began generating a video.

Since the required size is 1280x768, I used the following image that fits those dimensions.

If your image is a different size, you can crop it directly in the browser.

I used the following prompt:

A Japanese woman is smiling happily

You can also generate videos without specifying a prompt. For more details on how to use prompts, refer to the Gen-3 Alpha Prompting Guide.

It’s possible to use different images for the first and last frames, but I used the same image for both in this case.

Here’s the generated video, which is 10 seconds by default.

The result looks quite natural. The woman in the original image is genuinely smiling. If someone showed me this video without telling me it was AI-generated, I probably wouldn’t have noticed.

For comparison, I generated a similar video using Dream Machine with the same image and prompt.

Here’s the 5-second video generated by Dream Machine.

Although there’s significant movement, there is noticeable distortion, especially around the face, creating a sense of unease. This wasn’t as evident in the videos I generated in my previous article, so I thought it was worth mentioning as a reference point.

Video Generation 2

For further experimentation, I generated another video using a completely different image.

I used the following prompt for this image:

Japanese man dancing

Here’s the generated video.

This one also turned out very well.

There’s a slight awkwardness in certain areas like the hands, but the video maintains consistency over its 10-second duration.

For comparison, I also generated a video from the same image using Dream Machine. Here’s the result:

I’m not sure if this counts as dancing, but there’s definitely movement, which is a nice touch.

Conclusion

Although my testing was limited and the prompts were simple, I noticed distinct characteristics in the videos generated by both Gen-3 Alpha and Dream Machine.

The field of video generation has made incredible advancements, and I’m excited to see where it goes next.

There have also been some interesting recent developments in the video generation space.

AI-generated videos aren't just the future: They're here, and they're scary. AI companies are rolling out tech that can produce realistic videos from simple text prompts. Adobe is just the latest, and their AI-generated videos are impressive—even if the demos are brief.

Reference: Adobe's AI Video Generator Might Be as Good as OpenAI's | Lifehacker

I’m looking forward to trying it out myself.

Japanese Version of the Article

動画生成AI「Gen-3 Alpha」のImage to Videoで画像を動画に変換してみたらやっぱり自然すぎて恐くなりもした

Running the FLUX.1 Image ([dev]/[schnell]) Generation AI Model by Stable Diffusion's Original Developers on a MacBook (M2)

nabata — Sun, 29 Sep 2024 01:23:36 +0000

Introduction

There has been a lot of buzz recently about Black Forest Labs.

AI researchers involved in the development of image generation AI such as '
Stable Diffusion ' have launched a new AI development company ' Black Forest Labs '. In addition, Black Forest Labs has also announced ' Flux ', an open source image generation AI model with a parameter size of 12 billion.

Reference：Stable Diffusion's original developers launch AI company Black Forest Labs and release their own image generation AI model, Flux - GIGAZINE

In this post, I'll walk through the process of running the image generation AI model FLUX.1 on my MacBook (M2).

All the code used here is written in Python.

About FLUX.1

For a detailed explanation, you can check out the official announcement “Announcing Black Forest Labs - Black Forest Labs.” Here’s a quick overview of the three available models:

FLUX.1 [pro]
- A cutting-edge image generation model available via API access.
FLUX.1 [dev]
- A model for non-commercial applications, available on Hugging Face. For commercial use, inquiries are required.
FLUX.1 [schnell]
- A fast model optimized for local development and personal use, released under the Apache 2.0 license. It’s available on Hugging Face, and the inference code is found on GitHub and Hugging Face’s Diffusers, supporting ComfyUI integration.

While I initially wanted to try FLUX.1 [pro], the API is currently invite-only for selected partners. While it’s possible to use it via platforms like Replicate or Fal.ai, I’ll be using FLUX.1 [dev], the highest-quality model that supports local generation. Towards the end of this article, I’ll also include an example using FLUX.1 [schnell].

For those interested in quickly testing image generation, you can try the following links. Note that some services require payment.

FLUX.1 [pro]
- Replicate
- Fal.ai
FLUX.1 [dev]
FLUX.1 [schnell]

My Setup

I’m running this on a MacBook Pro (Apple M2 Pro Chip / 16GB RAM), with the operating system being macOS 14 Sonoma.

The necessary packages include diffusers, sentencepiece, t5, torch, and transformers. For diffusers, I followed the official installation guide from GitHub.

Here’s the installation command, with a specific version of torch for reasons I’ll explain later:

$ pip install sentencepiece torch==2.3.1 transformers git+https://github.com/huggingface/diffusers.git

Here are the versions running in my environment:

$ pip list | grep -e diffusers -e sentencepiece -e t5 -e torch -e transformers
diffusers                    0.30.0.dev0
sentencepiece                0.2.0
t5                           0.9.4
torch                        2.3.1
transformers                 4.43.3

Although torch 2.4.0 was the latest version as of August 5, 2024, I downgraded to 2.3.1. I initially tried 2.4.0, but the generated images were too noisy. After some research, I found that downgrading to 2.3.1 resolved the issue, though I couldn’t confirm the exact reason.

Here’s an example of a noisy image:

Obtaining Access

For this setup, I accessed the model via Hugging Face, so you’ll need a Hugging Face account.

Once you’ve created an account and logged in, go to the FLUX.1-dev page, where you’ll be asked to agree to the terms.

After reading and agreeing to the terms, click Agree and access repository.

You’ll see a confirmation message like this, indicating that access has been granted:

Creating an Access Token

To authenticate the model download, you’ll need to create an access token on Hugging Face.

Navigate to Settings, then to Access Tokens.

Click Create new Access Token.

Since you only need read permissions, select Read, give it a name, and generate the token.

To avoid hardcoding, store this token as an environment variable.

Open your configuration file (in my case, it’s zshrc):

$ open ~/.zshrc

Then record the generated access token. I used HUGGING_FACE_TOKEN as the variable name.

export HUGGING_FACE_TOKEN=your_access_token_here

Specifying PYTORCH_MPS_HIGH_WATERMARK_RATIO

Since my MacBook has only 16GB of RAM, I ran into a memory shortage when trying to execute the model, resulting in this error:

RuntimeError: MPS backend out of memory

Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable the upper limit for memory allocations (may cause system failure).

This issue stems from insufficient memory for the MPS GPU on my MacBook.

As suggested, I set the following environment variable to remove the upper limit for MPS usage:

export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

Be cautious, as removing this limit may cause the system to crash due to memory exhaustion.

Initially, I tried using a non-zero value, but that led to the following error, so I set it to 0.0:

RuntimeError: invalid low watermark ratio 1.4

It’s also possible to clear the MPS cache with torch.mps.empty_cache(), which may help avoid memory issues.

Fixing `transformer_flux.py`

Even after these adjustments, I encountered another error:

scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim

TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.

This error means float32 needs to be used instead of float64. You can fix it by modifying transformer_flux.py like this:

# scale = torch.arange(0, dim, 2, dtype=torch.float64, device=pos.device) / dim
scale = torch.arange(0, dim, 2, dtype=torch.get_default_dtype(), device=pos.device) / dim

This issue is also mentioned in the GitHub issue “flux does not work on MPS devices,” and it might be addressed soon. A similar issue was raised in ComfyUI: “FLUX Issue | MPS framework doesn’t support float64.”

Code

The official sample code doesn’t work directly, so I made the following changes:

Specifying the access token.
Setting mps as the device to utilize the MacBook GPU.
Removing the call to enable_model_cpu_offload.
- This function assumes CUDA, leading to an AssertionError: Torch not compiled with CUDA enabled.

Here’s the modified code, with comments noting the changes from the official sample:

import torch
from diffusers import FluxPipeline
import os  # Added for environment variable access

hf_token = os.getenv("HUGGING_FACE_TOKEN")  # Retrieve the Hugging Face token
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
                                    torch_dtype=torch.bfloat16,
                                    token=hf_token  # Specify the Hugging Face token
)
pipe.to(torch.device("mps"))  # Specify MPS as the device
# pipe.enable_model_cpu_offload()  # Removed

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

If you’re testing, you can speed things up by reducing num_inference_steps. You can control seed generation with generator. For details on the parameters, refer to the API reference.

The model generates an image based on the prompt “A cat holding a sign that says hello world.”

The first time you run the code, the model and dependencies (over 30GB) will be downloaded, which may take some time depending on your network speed.

Results

While there are areas for improvement, the result is quite clean, and the text “Hello World” is clearly legible. The prompt is faithfully represented in the image.

I also tried a different prompt:

prompt = "anime style, Japanese moe heroine"

Here’s the result:

While I won’t go into the nuances of moe, the generated image is a high-quality, accurate representation of the prompt.

For comparison, here’s an image generated with the same prompt using Stable Image Ultra.

Comparing them, I’d say FLUX.1 is closer to the concept of a moe heroine.

Testing FLUX.1 [schnell]

I also tested FLUX.1 [schnell], the faster model that generates images in fewer steps. Using the official sample code, I generated the following image:

import torch
from diffusers import FluxPipeline
import os  # Added for environment variable access

hf_token = os.getenv("HUGGING_FACE_TOKEN")  # Retrieve the Hugging Face token
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-schnell",
                                    torch_dtype=torch.bfloat16,
                                    token=hf_token  # Specify the Hugging Face token
)
pipe.to(torch.device("mps"))  # Specify MPS as the device
# pipe.enable_model_cpu_offload()  # Removed

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=4,
    max_sequence_length=256,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-schnell.png")

As with FLUX.1 [dev], the first run will involve downloading over 30GB of data, which might take a while.

Here’s the result:

I also tried using the prompt “anime style, Japanese moe heroine”:

While the quality isn’t quite as high as FLUX.1 [dev], both images faithfully reflect the prompt and are visually appealing. With fewer inference steps, FLUX.1 [schnell] generates images much faster than FLUX.1 [dev].

Conclusion

I’m eager to try the API when it becomes available.

Today we release the FLUX.1 text-to-image model suite. With their strong creative capabilities, these models serve as a powerful foundation for our upcoming suite of competitive generative text-to-video systems. Our video models will unlock precise creation and editing at high definition and unprecedented speed. We are committed to continue pioneering the future of generative media.

Reference：Announcing Black Forest Labs - Black Forest Labs

I’m really looking forward to seeing what they develop next in the area of video generation.

Japanese Version of the Article

Stable Diffusionのオリジナル開発陣が発表した画像生成AIモデルFLUX.1([dev]/[schnell])をMacBook(M2)で動かしてみた

Using "Hive Moderation AI-GENERATED CONTENT DETECTION" to Identify Images from Tools like DALL·E 3, FLUX.1, and ImageFX

nabata — Sat, 28 Sep 2024 13:58:00 +0000

Introduction

In this article, we will evaluate and identify AI-generated images using Hive Moderation AI-GENERATED CONTENT DETECTION.

The AI image generation tools we’ll focus on are:

Premise

In the world of generative AI, Content Credentials are becoming increasingly prevalent.

With the improved quality of content from generative AI models, the need for transparency regarding the origin of AI-generated content has grown. All AI-generated images from Azure OpenAI Service now include Content Credentials—a tamper-evident method to disclose the origin and history of content. Content Credentials are based on an open technical specification from the Coalition for Content Provenance and Authenticity (C2PA), a Joint Development Foundation project.

Reference: Content Credentials in Azure OpenAI - Microsoft Learn

Some of the images used in this evaluation include Content Credentials, yet there were many cases where the detection results didn’t align perfectly. For instance, images generated on the web using Firefly by Adobe do include Content Credentials but were still identified as having a low probability of being AI-generated.

Because of this, we will proceed without considering Content Credentials in this article.

Additionally, while this article only includes two to three images per tool, more were tested in practice. Since the trends were consistent, we are presenting only a selection here.

DALL·E 3

We first evaluated images generated by DALL·E 3 from OpenAI, created by providing prompts through ChatGPT.

Although some of the labels were inaccurate, all images were correctly identified as AI-generated with a high probability score of over 99%. This is quite accurate, as they do look like AI-generated images.

On a related note, I’ve recently heard people say, "This image looks AI-generated." I wonder what criteria people use to make that judgment.

FLUX.1[dev]

Next, we evaluated images generated using FLUX.1[dev] by Black Forest Labs.

As expected, these were also identified as AI-generated with a high probability exceeding 99%.

ImageFX

Next, we tested images generated by Google’s ImageFX, which has gained attention for its impressive realism.

The first image scored slightly below 90%, but it was still detected as AI-generated with a high probability.

On a side note, the quality of the photo-like images, especially of people, is truly impressive and lives up to the hype. It’s becoming harder to believe that these images are AI-generated without prior knowledge.

Stable Diffusion

Next, we evaluated images generated by Stable Diffusion from Stability AI.

Both images were detected as AI-generated, as expected.

Testing Image to Image

Since almost all images so far were detected as AI-generated with over 99% accuracy, I decided to try something different.

With Image to Image generation, you can specify a base image for the AI to modify. I’ll test how well this detection tool identifies AI modifications.

Here’s a photo I took at a certain location. I’ve reduced the file size for easier viewing.

Using the same Stable Diffusion 3 Large, I applied the prompt "Night view of buildings" with Image to Image generation. The degree to which the prompt modifies the original image can be adjusted between 0.0 and 1.0, with higher values deviating more from the original image. I generated images at various stages of modification and tested them.

Original Image

Detection result: near 0.

strength 0.3 (original image 0.7)

Probability increased by only 0.2%, still near 0.

strength 0.5 (original image 0.5)

Now entering double digits.

strength 0.7 (original image 0.3)

Approaching 50%.

strength 0.9 (original image 0.1)

Surpassed 50%.

As the degree of modification increases, the probability of AI detection rises proportionally.

Firefly

Lastly, we tested Firefly from Adobe, with a slightly different approach.

Modifying the Sky

Using Photoshop, I applied AI-generated content focused on the sky in the previous night view image. Here’s the result. Only the upper half of the image is AI-generated, while the lower half remains unchanged from the original.

The detection result is as follows.

Detection result: 0.1%.

Modifying the River

Next, I left the upper half as-is and applied AI generation to the lower half of the image.

The detection result is as follows.

Detection result: similarly low probability.

Testing 100% AI-Generated Image

Finally, I tested an entirely AI-generated image from Firefly using a prompt.

The detection result is as follows.

Detection result: slightly above 1%, but still relatively low.

While Firefly is listed as a supported tool in AI-GENERATED CONTENT DETECTION, could it be that it does not yet fully account for the latest generation logic? Alternatively, could Firefly operate on fundamentally different principles?

Conclusion

While this tool provides a useful benchmark, it remains difficult to accurately detect AI-generated content in every case.

As I’ve noted previously, with the prevalence of AI in digital tools today, attempting to definitively distinguish between AI-generated and non-AI content may be a futile effort.

Japanese Version of the Article

AI生成かどうかを判定する「Hive Moderation AI-GENERATED CONTENT DETECTION」にDALL·E 3やFLUX.1、ImageFX等の生成画像を判定させてみた