hexfloor

Posted on Oct 29, 2024 • Edited on Nov 8, 2024

More storage for media : organize files using ChatGPT : part 1, data preparation

#linux #productivity #ai #chatgpt

Age of AI

Given the rise of ChatGPT and other Gen AI tools I have to reinvent my posts to make them relevant, some introduction here.
Hence since now my posts will be less technical and more functional. From my conversation with ChatGPT it (from now on I will use pronouns it) granted me the permission to use our conversation in this blog and to reference ChatGPT as it.

Introduction

Let's consider a common problem : we wish to optimize storage of media files.
Now, let's imagine we have made a dump of data with photos and videos by year in different formats and a magic tools that may convert them to the format of our choice.

However, we need to do some preparation work : group the files of each type instead of having the grouping by year. How to achieve this using ChatGPT ?

First steps

Hers is our dump of media :

Let's get some input from ChatGPT :

windows using git bash and linux commands, I wish to rename the folders in the current dir having name 'Photos from ' simply to year

for dir in Photos\ from\ *; do
    if [[ -d "$dir" ]]; then
        year="${dir##* }"  # Extract the year from the folder name
        mv "$dir" "$year"  # Rename the folder
    fi
done

Result :

Great, let's move on!

Metadata : data preparation

The very first thing you should consider is to fix the metadata, which in the narrow sense mean that each file should contain the Date taken for the image files and Media created for the video files. In the best case the image name should contain:

formatted timestamp in the form date_time, like 20240101_000000 or even longer
origin discriminator, anything that would help you to classify the file, when compressing could the the original extension
some counter with as many digits as necessary to avoid collisions during any compression job Some of the modern phones do create a speaking file id, some do not, in any case for the old files it may happen that the data is present only in the adjacent json file, the structure name of it usually corresponds to the following : <filename_with_extension>{<.suffix_of_type_metadata>}.json. Hence before any processing it is wise to go through all the images and all the videos and update the time attributes, for example with exiftool for images and ffmpeg for videos and as well it's a good idea to copy the processed file to the new directory with an extension that has as a prefix the formatted_date, and those who do not - move to another folder to fix them up manually.

Overall the idea could be the following:

list file extensions in all the folders to evaluate the necessary work

ls -R ./ | awk -F. '/\./ {print $NF}' | sort -u

prepare the next which will be to convert all the files to the lower case to simplify further operation, prior to this we should check for the case insensitive duplicates

#!/bin/bash

# Find all subdirectories and process files in each subdirectory separately
find . -type d | while read -r dir; do
    # Find files in the current directory
    find "$dir" -maxdepth 1 -type f | \
        # Remove the path, leaving only the filename
        sed 's/.*\///' | \
        # Convert filenames to lowercase for case-insensitive comparison
        tr '[:upper:]' '[:lower:]' | \
        # Sort the filenames
        sort | \
        # Find duplicates in the sorted list
        uniq -d | \
        # Print the duplicates with their directory path
        while read -r filename; do
            echo "Duplicates in '$dir':"
            find "$dir" -maxdepth 1 -type f -iname "$filename"
        done
done

convert all the files to the lower case:

#!/bin/bash

# Find all files recursively
find ./ -type f | while read -r file; do
    dir=$(dirname "$file")
    base=$(basename "$file")

    # Convert the filename to lowercase
    lower_base=$(echo "$base" | tr '[:upper:]' '[:lower:]')

    # If the filename is different (case insensitive), do the two-step rename
    if [[ "$base" != "$lower_base" ]]; then
        # Step 1: Rename to an intermediate name (tmp_<lowercase filename>)
        mv "$file" "$dir/tmp_$lower_base"

        # Step 2: Rename to the final lowercase name (removing tmp_ prefix)
        mv "$dir/tmp_$lower_base" "$dir/$lower_base"
    fi
done

At this point you are ready to start fixing the metadata

Metadata : updating images using data from jsons

At this point we are ready to start fixing the metadata, the bare minimum is to ensure that the Date taken is up to date, and it's quite often that this date is not set in the image itself but stored in the json separately, here is the script you may use to sort jpg images into two folder: one containing images with accurate metadata and another containing the images that need to be viewed further and therefore prefixed with the origin folder.
I will use the exiftool for this purpose and add a formatted date prefix for each file that has a relevant metadata stored in json.

#!/bin/bash

# Hardcoded input directory and output directories
INPUT_DIR="./input"               # Input directory set to ./input
OUTPUT_DIR_JPG="./jpg"           # Directory for modified JPGs
OUTPUT_DIR_JPG_TO_FIX="./jpg_to_fix"  # Directory for JPGs needing fixing

# Create output directories if they don't exist
mkdir -p "$OUTPUT_DIR_JPG"
mkdir -p "$OUTPUT_DIR_JPG_TO_FIX"

# Iterate through all JPG files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.jpg" | while read -r jpg_file; do
    # Get the base name of the JPG file without the extension
    base_name=$(basename "$jpg_file" .jpg)

    # Look for the corresponding JSON file that starts with base_name and ends with .json
    json_file=$(find "$(dirname "$jpg_file")" -type f -iname "${base_name}.jpg*.json" | head -n 1)

    # Check if the corresponding JSON file exists
    if [[ -f "$json_file" ]]; then
        # Extract the creation time and description from the JSON file
        creation_time=$(jq -r '.photoTakenTime.timestamp' "$json_file")
        description=$(jq -r '.description' "$json_file")

        # Format the creation time for ExifTool (assuming it's in Unix timestamp)
        if [[ "$creation_time" =~ ^-?[0-9]+$ ]]; then
            formatted_date=$(date -d @"$creation_time" +"%Y%m%d_%H%M%S" 2>/dev/null)
            if [[ $? -ne 0 ]]; then
                formatted_date=""
            fi
        else
            formatted_date=""
        fi

        # **Always update** EXIF data based on JSON content (even if they already exist in JPG)
        if [[ -n "$formatted_date" ]]; then
            exiftool -overwrite_original -DateTimeOriginal="$formatted_date" "$jpg_file" >/dev/null 2>&1
        fi

        if [[ -n "$description" ]]; then
            exiftool -overwrite_original -Description="$description" "$jpg_file" >/dev/null 2>&1
        fi

    else
        # If JSON file is not found, notify but continue to check Date taken
        echo "JSON file not found for: $jpg_file"
    fi

    # Now check if Date taken is set in the JPG file
    date_taken=$(exiftool -DateTimeOriginal -s -s -s "$jpg_file")

    # Construct the new filename based on the Date taken
    if [[ -n "$date_taken" ]]; then
        # Format the date_taken string to be filename-safe
        formatted_date=$(echo "$date_taken" | sed -e 's/://g' -e 's/ /_/g') # Remove colons and spaces
        safe_date_taken="${formatted_date:0:8}_${formatted_date:9:6}"  # Separate date and time
        new_filename="${OUTPUT_DIR_JPG}/${safe_date_taken}_$(basename "$jpg_file")" # New filename based on formatted date
        cp "$jpg_file" "$new_filename"
    else
        # Get the last directory name if Date taken is not set
        last_dir=$(basename "$(dirname "$jpg_file")")
        new_filename="${OUTPUT_DIR_JPG_TO_FIX}/${last_dir}_$(basename "$jpg_file")"
        cp "$jpg_file" "$new_filename"
    fi
done

echo "Processing complete."

Feel free to adjust the logic for different file types using ChatGPT or another Generative AI of your choice.

Metadata : updating videos using data from jsons

The same trick might be used for video files using ffmpeg, however this time it's a bit more tricky as with ffmpeg you must create a new file copy:

#!/bin/bash

# Hardcoded input directory and output directories
INPUT_DIR="./input"               # Input directory set to ./input
OUTPUT_DIR_MP4="./mp4"           # Directory for modified MP4s
OUTPUT_DIR_MP4_TO_FIX="./mp4_to_fix"  # Directory for MP4s needing fixing

# Create output directories if they don't exist
mkdir -p "$OUTPUT_DIR_MP4"
mkdir -p "$OUTPUT_DIR_MP4_TO_FIX"

# Iterate through all MP4 files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.mp4" | while read -r mp4_file; do
    # Get the base name of the MP4 file without the extension
    base_name=$(basename "$mp4_file" .mp4)

    # Look for the corresponding JSON file that starts with base_name and ends with .json
    json_file=$(find "$(dirname "$mp4_file")" -type f -iname "${base_name}.mp4*.json" | head -n 1)

    # Initialize the new filename variable
    new_filename=""

    # Check if the corresponding JSON file exists
    if [[ -f "$json_file" ]]; then
        echo "JSON file IS found for: $mp4_file"

        # Extract the creation time from the JSON file
        creation_time=$(jq -r '.photoTakenTime.timestamp' "$json_file")

        # Format the creation time for ffmpeg (assuming it's in Unix timestamp)
        if [[ "$creation_time" =~ ^-?[0-9]+$ ]]; then
            # Convert Unix timestamp to "YYYY-MM-DD HH:MM:SS" format
            formatted_date=$(date -d @"$creation_time" +"%Y-%m-%d %H:%M:%S" 2>/dev/null)
            if [[ $? -ne 0 ]]; then
                formatted_date=""
            fi
        else
            formatted_date=""
        fi

        # **Always update** MP4 metadata based on JSON content (creation_time)
        if [[ -n "$formatted_date" ]]; then
            # The filename date format: YYYY-MM-DD_HH:MM:SS
            filename_date=$(echo "$formatted_date" | tr -d ' ' | tr -d '-')  # Strip spaces and dashes
            filename_date="${filename_date//:/}"  # Replace colons with empty string for filename safety

            # Construct the new filename based on the formatted date
            new_filename="${OUTPUT_DIR_MP4}/${filename_date}_$(basename "$mp4_file")"  # New filename based on formatted date

            # Debugging: Log the new filename
            echo "New filename (with date): $new_filename"

            # Update the creation time in the MP4 metadata and copy it to the new location in one step
            ffmpeg -i "$mp4_file" -c copy -metadata creation_time="$formatted_date" "$new_filename"
        fi
    fi

    # If we did not successfully update the file, copy it to the 'mp4_to_fix' directory
    if [[ -z "$new_filename" ]]; then
        echo "JSON file not found or timestamp invalid. Moving to mp4_to_fix: $mp4_file"

        # Get the last directory name (in case the file is being moved to mp4_to_fix)
        last_dir=$(basename "$(dirname "$mp4_file")")
        new_filename="${OUTPUT_DIR_MP4_TO_FIX}/${last_dir}_$(basename "$mp4_file")"

        # Debugging: Log the action taken (moving to mp4_to_fix)
        echo "Moving to $OUTPUT_DIR_MP4_TO_FIX: $new_filename"

        # Copy the file to the mp4_to_fix directory
        cp "$mp4_file" "$new_filename"
    fi
done

echo "Processing complete."

At this point we are done with the metadata and we can go further.

ID : using proper id

It may happen that your files are coming from different sources and have completely different id's, now as your files are prefixed with a date in format 20000101_000000 you may erase the previous filename and use an identifier of your choice and an ad-hoc counter, here is an example of transformation of all id's to the format 20000101_000000_jpg_0001.jpg :

#!/bin/bash

# Set the input and output directories
input_dir="./input"
output_dir="./output"

# Make sure the output directory exists, create it if it doesn't
mkdir -p "$output_dir"

# Counter variable, starting at 1
counter=1

# Loop through sorted .jpg files, handling files with spaces correctly
find "$input_dir" -type f -name "*.jpg" | sort | while IFS= read -r file; do
    # Extract the first part of the filename (before the first three underscores)
    base_name=$(basename "$file")
    prefix=$(echo "$base_name" | cut -d'_' -f1-2)  # Extract "20000101_000000" part

    # Build the new filename with a 4-digit counter and "_jpg_" prefix
    new_filename=$(printf "%s_jpg_%04d.jpg" "$prefix" "$counter")

    # Copy the file to the output directory with the new filename
    cp "$file" "$output_dir/$new_filename"

    # Increment the counter
    ((counter++))
done

echo "Files renamed and copied to $output_dir"

Compress images

Using ImageMagick from part 2 you may convert all the images to the format of your choice:

#!/bin/bash

# Input and Output directories
input_dir="./input"
output_dir="./output"

# Ensure the output directory exists
mkdir -p "$output_dir"

# Loop through all jpg files in the input directory
for img in "$input_dir"/*.jpg; do
    # Get the image dimensions (width x height)
    dimensions=$(identify -format "%wx%h" "$img")
    width=$(echo $dimensions | cut -d'x' -f1)
    height=$(echo $dimensions | cut -d'x' -f2)

    # Check for vertical images (height >= width)
    if [ "$height" -ge "$width" ]; then
        # Vertical and height > 1280, resize to height 1280
        if [ "$height" -gt 1280 ]; then
            output_file="$output_dir/$(basename "$img" .jpg).heic"
            magick "$img" -resize x1280 -quality 80 "$output_file"
            echo "Resized and converted $img to $output_file"
        else
            # Vertical image with height <= 1280, just convert to HEIC
            output_file="$output_dir/$(basename "$img" .jpg).heic"
            magick "$img" "$output_file"
            echo "Converted $img to $output_file"
        fi
    elif [ "$width" -gt "$height" ]; then
        # Landscape and width > 1280, resize to width 1280
        if [ "$width" -gt 1280 ]; then
            output_file="$output_dir/$(basename "$img" .jpg).heic"
            magick "$img" -resize 1280x -quality 80 "$output_file"
            echo "Resized and converted $img to $output_file"
        else
            # Landscape image with width <= 1280, just convert to HEIC
            output_file="$output_dir/$(basename "$img" .jpg).heic"
            magick "$img" "$output_file"
            echo "Converted $img to $output_file"
        fi
    else
        echo "Skipping $img: does not fit any criteria"
    fi
done

Compress videos

Similar logic can be applied to the videos using ffmpeg from part 3:

#!/bin/bash

# Hardcoded input directory and output directory
INPUT_DIR="./input"               # Input directory set to ./input
OUTPUT_DIR="./output"             # Output directory for converted MP4s

# Create the output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Iterate through all MP4 files in the input directory and subdirectories
find "$INPUT_DIR" -type f -iname "*.mp4" | while read -r mp4_file; do
    # Get the base name of the MP4 file without the extension
    base_name=$(basename "$mp4_file" .mp4)

    # Get video dimensions (width and height) using ffprobe
    dimensions=$(ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=s=x:p=0 "$mp4_file")
    width=$(echo "$dimensions" | cut -d 'x' -f 1)
    height=$(echo "$dimensions" | cut -d 'x' -f 2)

    # Debugging: Log the dimensions
    echo "Dimensions of $mp4_file: Width=$width, Height=$height"

    # Initialize the new filename variable
    new_filename="${OUTPUT_DIR}/${base_name}_converted.mp4"

    # Check if rescaling is needed and apply the appropriate scale
    if [[ "$height" -ge "$width" && "$height" -gt 1280 ]]; then
        # If height >= width and height > 1280, rescale to -2:1280
        scale="-2:1280"
        echo "Rescaling $mp4_file to $scale"
    elif [[ "$width" -gt "$height" && "$width" -gt 1280 ]]; then
        # If width > height and width > 1280, rescale to 1280:-2
        scale="1280:-2"
        echo "Rescaling $mp4_file to $scale"
    else
        # No scaling needed
        scale=""
        echo "No rescaling needed for $mp4_file"
    fi

    # Run ffmpeg with or without scaling, based on the conditions
    ffmpeg -i "$mp4_file" \
           -r 30 -c:v libx265 -crf 28 -preset medium \
           -c:a aac -b:a 192k \
           -metadata creation_time="$formatted_date" \
           ${scale:+-vf "scale=$scale"} \
           "$new_filename"

done

echo "Processing complete."

Summary

Overall, if some formats have a limited support for metadata, for example png, it could be a good idea to encode it into the filename in order to make it available for an eventual conversion tool.

DEV Community