Yes, it's true. I created an Image file format using solely Python. Struct is a library that comes with python.It was extensively used in this project for packing binary as well as unpacking it.
Firstly, what is an image file?
It has 2 primary components:
- Header
- Data The header holds the metadata of the image while the data has the pixel intensity values.In the context of headers, each format has a fixed layout of fields that never changes position. But these vary with formats. These "fields" are defined by a spec sheet of the file format. Here is mine:
AVJ File Format Specification
Field | Size (bytes) | Type | Description |
---|---|---|---|
Magic Number | 4 | ASCII (4s ) |
File signature, fixed to "AVJ1"
|
Version | 1 | Unsigned short (H ) |
File format version. Current = 1
|
Image Width | 4 | Unsigned int (I ) |
Image width in pixels |
Image Height | 4 | Unsigned int (I ) |
Image height in pixels |
Color Mode | 1 | Unsigned byte (B ) |
Color representation. 3 = RGB (only mode supported in v1) |
Alt Text Length | 4 | Unsigned int (I ) |
Length of alt text (UTF-8 encoded) in bytes |
Embedding 1 Length | 4 | Unsigned int (I ) |
Length of first embedding vector (bytes) |
Embedding 2 Length | 4 | Unsigned int (I ) |
Length of second embedding vector (bytes) |
Header Total | 23 | — | Fixed header size before variable sections |
Alt Text | Variable | UTF-8 string | Textual description for accessibility / metadata |
Embedding Vector 1 | Variable | Raw bytes | Embedding of alt text |
Embedding Vector 2 | Variable | Raw bytes | Embedding of image pixel data |
Image Pixel Data | Variable | Raw RGB bytes | Pixel matrix (width × height × 3 bytes) |
The file format is : '.avj'
It's an uncompressed file format encompassing vector embeddings of both the alt_text and the image data itself. Tailored for ML usecase!
In the spec sheet you need to include what type of info it is.
You can do that based on the following:
Commonly Used Struct Format Characters
Format Character | Python Type | Size (bytes) | Notes |
---|---|---|---|
B |
int | 1 | Unsigned byte (0–255), great for flags like color mode |
H |
int | 2 | Unsigned short (0–65,535), good for version numbers |
I |
int | 4 | Unsigned int (0–4 billion), perfect for dimensions or lengths |
f |
float | 4 | 32-bit float, useful for storing embedding values if needed |
d |
float | 8 | 64-bit float, more precise embeddings |
s |
bytes | count given | Fixed-length string (e.g. 4s for "AVJ%" ) |
Endianness / Byte Order Prefixes
Prefix | Meaning |
---|---|
< |
Little-endian (recommended for simplicity) |
> |
Big-endian |
! |
Network order (big-endian) |
Note: if you don't know what an endian is, just choose little endian.
Feel free to include a description for yourself. This makes things easier when coding. After you are done with that, let's get to coding:
So first, let's import the required libraries:
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import StreamingResponse, JSONResponse
import struct
from PIL import Image
import io
import numpy as np
import torch
from transformers import CLIPProcessor, CLIPModel
I am creating this project in such a way that I can interact with it as an api. That's why I'm importing FASTAPI. Pillow for image processing, numpy for array handling and some other tasks, pytorch and transformers libraries for embedding generation.Since we want to use the CLIP model for embedding generation, we import that too.
app = FastAPI(title=".avj Encoder/Decoder with Embeddings")
# ------------------- AVJ Format -------------------
'<4s H I I B H B I I'
HEADER_SIZE = struct.calcsize(HEADER_FORMAT)
Here, we define the format of the header of our image file. This is done based on the table we discussed earlier. We also need to calculate the size of the header.
In addition to this I also define the FASTAPI app.
def image_to_bytes(image_file):
img = Image.open(image_file).convert("RGB")
return img.tobytes(), img.width, img.height, img.mode
Here we are defining a function that takes an image file path as argument.It then opens that image and convertstye I age to RGB format and uses the to_bytes
method to convert the image into binary. It then returns the binaryand some image metadata.
def encode_headers_with_embeddings(raw_bytes, h, w, mode, alt_text, alt_emb, img_emb):
alt_text_encoded = alt_text.encode("utf-8")
len_alt_text_encoded = len(alt_text_encoded)
mode_encoded = mode.encode("utf-8")
len_mode_encoded = len(mode_encoded)
alt_emb_bytes = np.array(alt_emb, dtype=np.float32).tobytes()
img_emb_bytes = np.array(img_emb, dtype=np.float32).tobytes()
header = struct.pack(
HEADER_FORMAT,
b'AVJ1', # magic
1, # version
int(h),
int(w),
3, # channels RGB
len_alt_text_encoded,
len_mode_encoded,
len(alt_emb_bytes),
len(img_emb_bytes)
)
return header + alt_text_encoded + mode_encoded + alt_emb_bytes + img_emb_bytes + raw_bytes
Now this here is the encoder function that encodes the metadata and everything we need the header to have along with image binary.
The function accepts 7 arguments:
raw_bytes - image in bytes
h - image height
w - width of image
alt_text - well...alt text!
mode - image file format strong (eg. "RGBA", "RGB")
alt_emb - embeddings of alt text
img_emb - embeddings of the image itself (not including the headers)
So here we first encoder the alt-text into a standard format,ie; UTF-8. Then we convert the embeddings into bytes. We use the lack method in the struct class to "pack" all the info we want in the header. And that's the encoder!
def decode_headers_with_embeddings(encoded_bytes):
header = encoded_bytes[:HEADER_SIZE]
magic, version, height, width, channels, alt_text_len, mode_len, alt_emb_len, img_emb_len = struct.unpack(HEADER_FORMAT, header)
start = HEADER_SIZE
alt_text = encoded_bytes[start:start+alt_text_len].decode("utf-8")
start += alt_text_len
mode = encoded_bytes[start:start+mode_len].decode("utf-8")
start += mode_len
alt_emb_bytes = encoded_bytes[start:start+alt_emb_len]
alt_embedding = np.frombuffer(alt_emb_bytes, dtype=np.float32)
start += alt_emb_len
img_emb_bytes = encoded_bytes[start:start+img_emb_len]
image_embedding = np.frombuffer(img_emb_bytes, dtype=np.float32)
start += img_emb_len
image_bytes = encoded_bytes[start:]
return {
"magic": magic.decode("utf-8", errors="ignore"),
"version": version,
"height": height,
"width": width,
"channels": channels,
"alt_text": alt_text,
"mode": mode,
"alt_embedding": alt_embedding.tolist(),
"image_embedding": image_embedding.tolist(),
"image_bytes": image_bytes
}
def reconstruct_image(image_bytes, width, height, mode="RGB"):
return Image.frombytes(mode, (width, height), image_bytes)
# ------------------- CLIP Embeddings -------------------
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
This one does the opposite: it takes an .avj file and unpacks it. I extracts the header, grab the alt text, mode, and embeddings, and then return all of that in a Python dictionary.
The decoder works by first reading the fixed-size header from the file and unpacking it into fields like magic number, version, dimensions, and the lengths of the variable sections. Using those lengths, it moves through the file step by step: first extracting and decoding the alt text, then the image mode, followed by the two embeddings which are converted into NumPy arrays. Whatever remains after that is the raw pixel data. Finally, all of this information is returned neatly in a dictionary so the image and metadata can be reconstructed.
The restoration function simply takes the raw pixel bytes along with the image’s width, height, and mode, and feeds them into Pillow’s frombytes method. This rebuilds the original image exactly as it was, using the metadata we extracted from the file.
We use clip model to make embeddings too.
Then we just define the api endpoints and setup the flow. Voila! Our own image file!
Top comments (0)