DEV Community

Cover image for Build an AI Image Captioner with React Native & Hugging Face + Unit Testing
Mohamed Elgazzar
Mohamed Elgazzar

Posted on • Edited on

Build an AI Image Captioner with React Native & Hugging Face + Unit Testing

Hey there!๐Ÿ‘‹ After a bit of a hiatus, I'm back and ready to dive into some coding fun. In this article, we'll be building something exciting โ€“ an AI Image Captioning App using React Native and the Hugging Face Inference API.

Getting Started

Without any more delays, let the coding begin! We will be using Expo to bootstrap our React Native project. Expo is a set of tools and services for building React Native applications more easily and quickly.

Before we kick things off, make sure you have Node.js installed on your machine.

To initialize a new project, run the following command:

npx create-expo-app ai-image-captioner && cd ai-image-captioner

After navigating to the project directory, install the following dependencies:

npx expo install expo-camera expo-image-picker expo-font expo-splash-screen

npm install axios

  1. axios: axios is a promise-based HTTP client for the browser and Node.js. It is commonly used for making HTTP requests and handling responses.
  2. expo-camera: expo-camera is a part of the Expo framework, providing a set of components and APIs for integrating camera functionality into React Native applications. It simplifies the process of capturing photos and videos.
  3. expo-image-picker: expo-image-picker is another Expo package that facilitates accessing the device's image and video picker. It allows users to choose images or videos from their device's gallery for use within the application.
  4. expo-font: expo-font is an Expo module that simplifies the process of loading custom fonts in React Native applications. It provides tools to easily incorporate and use custom fonts for styling text elements.
  5. expo-splash-screen: expo-splash-screen is an Expo-specific module designed to manage the splash screen (initial screen displayed while the app is loading) in React Native applications. It offers an easy way to customize and control the splash screen experience.

Hugging Face:

Hugging Face is a machine learning and data science platform and community that helps users build, deploy and train machine learning models.

Hugging Face offers a diverse array of tools, models, and datasets that empower developers and researchers in the field of machine learning. Their platform is home to an extensive library of pre-trained models, facilitating easy integration and experimentation for professionals working in artificial intelligence.

Let's head to the Hugging Face official website and create an account.

Once you have created your account, you can navigate to access tokens and generate one.

Copy it and set it aside. We will use it in a few moments.

Pre-trained models:

A pre-trained model isย a machine learning (ML) model that has been trained on a large dataset and can be fine-tuned for a specific task.

The Model Hub allows users to discover, share, and use pre-trained models for various tasks.

There are various approaches to integrate pre-trained models seamlessly. For a quick and straightforward implementation, we are going to use the Inference API. However, if your project demands a higher level of customization and control, you can use the Transformers library.

Inference API

When opting for the Inference API, you have two pathways. You can either incorporate the huggingface/inference package into your project for a streamlined experience, or take advantage of direct API access by defining the endpoint variable, such as: ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>

Our app consists of a few components, but the heart of the operation lies in App.js. Copy the code below into the file:

import React, { useState } from "react";
import axios from "axios";
import ImageForm from "./components/ImageForm";

const App = () => {
  const [caption, setCaption] = useState("");

  const handleImageUrl = async (imageUrl) => {
    try {
      let image = await (await fetch(imageUrl)).blob();
      const HUGGING_FACE_API_KEY = "HUGGING_FACE_API_KEY";
      const response = await axios.post(
        "https://api-inference.huggingface.co/models/nlpconnect/vit-gpt2-image-captioning",
        image,
        {
          headers: {
            "Content-Type": "application/json",
            Authorization: `Bearer ${HUGGING_FACE_API_KEY}`,
          },
          transformRequest: [(data) => data],
        }
      );
      setCaption(response.data[0].generated_text);
    } catch (error) {
      console.error(error);
    }
  };

  return <ImageForm onSubmit={handleImageUrl} caption={caption} />;
};

export default App;

Enter fullscreen mode Exit fullscreen mode

Now let's break down some key parts:

Here, we import React and useState from React for our component's state management. Additionally, we bring in axios for making HTTP requests and ImageForm, a component we'll create to handle image processing. Then, we declare the App component and using the useState hook, we set up a state variable caption and its corresponding setter function setCaption.

handleImageUrl, is where the real action happens. It takes an imageUrl as an argument, fetches the image, and makes a POST request to the Inference API. We used the nlpconnect/vit-gpt2-image-captioning model for image captioning. The result is stored in the caption state.

Note: transformRequest: [(data) => data] in the axios request is used to disable automatic data serialization, and send data in its raw form.

Finally, in the return statement, we render the ImageForm component, passing down the handleImageUrl function and the caption state as props.

Create a newย ImageForm.jsย file with the following content:

import React, { useState, useEffect, useCallback } from "react";
import {
  View,
  TextInput,
  TouchableOpacity,
  Image,
  Text,
  ActivityIndicator,
  StyleSheet,
  Alert,
} from "react-native";
import { Camera, CameraType } from "expo-camera";
import { useFonts } from "expo-font";
import * as ImagePicker from "expo-image-picker";
import * as SplashScreen from "expo-splash-screen";
import CameraScreen from "./CameraScreen";
import { Preview } from "../assets";

SplashScreen.preventAutoHideAsync();

const ImageForm = ({ onSubmit, caption }) => {
  const [imageUrl, setImageUrl] = useState(null);
  const [selectedImage, setSelectedImage] = useState(null);
  const [loading, setLoading] = useState(false);

  const [toggleCamera, setToggleCamera] = useState(false);
  const [capturedImage, setCapturedImage] = useState(null);
  const [permission, requestPermission] = Camera.useCameraPermissions();
  const [type, setType] = useState(CameraType.back);

  const [fontsLoaded, fontError] = useFonts({
    "BebasNeue-Regular": require("../assets/fonts/BebasNeue-Regular.ttf"),
  });

  useEffect(() => {
    // Request permission to access the photo library
    (async () => {
      const { status } =
        await ImagePicker.requestMediaLibraryPermissionsAsync();
      if (status !== "granted") {
        Alert.alert(
          "Permission denied",
          "You need to grant permission to access the photo library."
        );
      }
    })();
  }, []);

  const pickImage = async () => {
    try {
      const result = await ImagePicker.launchImageLibraryAsync({
        mediaTypes: ImagePicker.MediaTypeOptions.Images,
        allowsEditing: true,
        aspect: [4, 3],
        quality: 1,
      });
      if (!result.cancelled) {
        setSelectedImage(result.assets[0].uri);
        setImageUrl(result.assets[0].uri);
      }
    } catch (error) {
      console.error(error);
    }
  };

  const handlePress = async () => {
    setLoading(true);
    await onSubmit(imageUrl);
    setLoading(false);
  };

  const submitImage = () => {
    setToggleCamera(!toggleCamera);
  };
  const cancelCamera = () => {
    setToggleCamera(!toggleCamera);
    setImageUrl("");
    setCapturedImage(null);
    setSelectedImage(null);
  };

  const onLayoutRootView = useCallback(async () => {
    if (fontsLoaded || fontError) {
      await SplashScreen.hideAsync();
    }
  }, [fontsLoaded, fontError]);

  if (!fontsLoaded && !fontError) {
    return null;
  }

  return (
    <View style={styles.wrapper} onLayout={onLayoutRootView}>
      {toggleCamera ? (
        <CameraScreen
          cancelCamera={cancelCamera}
          setImageUrl={setImageUrl}
          submitImage={submitImage}
          capturedImage={capturedImage}
          setCapturedImage={setCapturedImage}
          requestPermission={requestPermission}
          permission={permission}
          type={type}
          setSelectedImage={setSelectedImage}
        />
      ) : (
        <View style={styles.container}>
          <View>
            <Text style={[styles.textStyle, styles.header]}>
              Image Captioner! ๐Ÿค—
            </Text>
          </View>

          <View style={styles.imagePreviewContainer}>
            <Image
              source={ selectedImage ? { uri: selectedImage } : Preview }
              style={[styles.imagePreview, !selectedImage && { width: 100 }]}
              resizeMode="contain"
            />
          </View>

          <View>
            <Text style={[styles.textStyle, { fontWeight: "bold" }]}>
              {caption !== "" && "๐Ÿช„ " + caption + " ๐Ÿช„"}
            </Text>
          </View>

          <View style={styles.inputContainer}>
            <TextInput
              style={styles.inputBox}
              autoCapitalize="none"
              placeholder="Enter Image Link"
              value={imageUrl}
              onChangeText={(url) => setImageUrl(url)}
            />
          </View>

          <View>
            <Text
              style={{
                fontSize: 20,
                fontWeight: "bold",
                textAlign: "center",
                color: "#8C94A5",
                marginVertical: 10,
              }}
            >
              OR
            </Text>
          </View>

          <View style={styles.buttonArea}>
            <TouchableOpacity
              style={[
                styles.button,
                { backgroundColor: "#0166FF" },
              ]}
              onPress={submitImage}
            >
              <Text style={styles.ButtonText}>Take Photo</Text>
            </TouchableOpacity>
          </View>

          <View style={styles.buttonArea}>
            <TouchableOpacity
              style={[styles.button, { backgroundColor: "#0166FF" }]}
              onPress={pickImage}
            >
              <Text style={styles.ButtonText}>Browse Images</Text>
            </TouchableOpacity>
          </View>

          <View style={styles.lineStyle} />

          <View style={[styles.buttonArea, styles.submitBtnArea]}>
            <TouchableOpacity
              style={[styles.button, { backgroundColor: "#212429" }]}
              onPress={handlePress}
            >
              <Text style={styles.ButtonText}>Process</Text>
            </TouchableOpacity>
          </View>

          {loading && (
            <ActivityIndicator
              size="large"
              color="#0000FF"
              style={styles.loading}
            />
          )}
        </View>
      )}
    </View>
  );
};

const styles = StyleSheet.create({
  wrapper: {
    height: "100%",
    paddingHorizontal: 30,
    paddingVertical: 80,
  },
  container: {
    height: "100%",
  },
  header: {
    fontWeight: 700,
    fontSize: 35,
    fontFamily: "BebasNeue-Regular",
    color: "#29323B",
  },
  textStyle: {
    fontSize: 14,
    marginTop: 8,
    marginBottom: 5,
    textAlign: "center",
    color: "#29323B",
  },
  inputContainer: {
    marginTop: 20,
  },
  inputBox: {
    borderColor: "#E1E4EB",
    height: 55,
    width: "100%",
    borderRadius: 10,
    borderWidth: 2,
    padding: 10,
    textAlign: "left",
  },
  button: {
    backgroundColor: "blue",
    height: 40,
    width: "100%",
    borderRadius: 5,
    padding: 10,
  },
  buttonArea: {
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
    marginBottom: 5,
  },
  submitBtnArea: {
    marginVertical: 10,
  },
  ButtonText: {
    color: "white",
    textAlign: "center",
  },
  loading: {
    marginTop: 8,
  },
  imagePreviewContainer: {
    alignItems: "center",
    marginBottom: 16,
    marginTop: 16,
    width: "100%",
    height: 200,
    borderWidth: 2,
    borderStyle: "dashed",
    borderColor: "#7BA7FF",
    borderRadius: 5,
  },
  imagePreview: {
    borderRadius: 5,
    width: "100%",
    height: "100%",
  },
  lineStyle: {
    borderWidth: 0.5,
    borderColor: "#D3D8E3",
    marginBottom: 15,
    marginTop: 10,
  },
});

export default ImageForm;


Enter fullscreen mode Exit fullscreen mode

We first import necessary modules and components for the ImageForm component.

Then, we configure SplashScreen that invokes preventAutoHideAsync to prevent the splash screen from hiding until fonts are loaded.

After that, we initialize state variables using useState for managing the component's state.

toggleCamera state manages whether the camera is active.
permission and requestPermission manage camera permissions.
type manages the camera type (front/back).
useFonts loads custom fonts, with error handling.
useEffect hook here is used to request permission to access the photo library when the component mounts.

We use the useFonts hook from Expo to import the "BebasNeue-Regular" font and load it. The onLayoutRootView function, using useCallback, hides the splash screen when fonts are loaded. If fonts are not yet loaded, it returns null to prevent rendering.

pickImage function uses ImagePicker from Expo to launch the device's image library. It configures the picker with options like allowed media types, editing capabilities, aspect ratio, and quality. If the user selects an image (!result.cancelled), it updates the state variables setSelectedImage and setImageUrl with the URI of the selected image.

handlePress function is triggered when the user presses the Process button to submit the image for processing. It sets the loading state to true to show an activity indicator. Calls the onSubmit function (provided as a prop) with the imageUrl as an argument, which triggers the image processing logic.

submitImage function toggles the camera screen.

cancelCamera function when the user cancels or exits the camera screen.

The UI is conditionally rendered based on the toggleCamera state. If toggleCamera is true, it shows the CameraScreen, passing various props for handling the camera functionality; otherwise, it displays the main ImageForm UI.

Finally, create a newย CameraScreen.jsย file and copy the following code into it:

import React from "react";
import {
  View,
  Text,
  TouchableOpacity,
  Image,
  StyleSheet,
  Button,
} from "react-native";
import { Camera } from "expo-camera";
import { Capture, Submit, Back, Reset } from "../assets";

const CameraScreen = ({
  cancelCamera,
  submitImage,
  setImageUrl,
  capturedImage,
  setCapturedImage,
  setSelectedImage,
  permission,
  type,
}) => {
  const takePicture = async () => {
    if (cameraRef) {
      const photo = await cameraRef.takePictureAsync();
      setCapturedImage(photo);
      setImageUrl(photo.uri);
      setSelectedImage(photo.uri);
    }
  };

  const resetImage = () => {
    setCapturedImage(null);
    setImageUrl(null);
    setSelectedImage(null);
  };

  if (!permission) {
    // Camera permissions are still loading
    return <View />;
  }

  if (!permission.granted) {
    // Camera permissions are not granted yet
    return (
      <View style={styles.container}>
        <Text style={{ textAlign: "center" }}>
          We need your permission to show the camera
        </Text>
        <Button onPress={requestPermission} title="grant permission" />
      </View>
    );
  }

  return (
    <View style={styles.container} testID="camera-screen">
      <Camera
        style={styles.camera}
        type={type}
        ref={(ref) => (cameraRef = ref)}
      >
        <View style={styles.buttonContainer}>
          <TouchableOpacity
            style={styles.button}
            onPress={takePicture}
            testID="capture-button"
          >
            <Image
              source={Capture}
              style={styles.imageIcon}
              resizeMode="contain"
            />
          </TouchableOpacity>
          <TouchableOpacity style={styles.button} onPress={cancelCamera}>
            <Image
              source={Back}
              style={styles.imageIcon}
              resizeMode="contain"
            />
          </TouchableOpacity>
          {capturedImage && (
            <TouchableOpacity style={styles.button} onPress={resetImage}>
              <Image
                source={Reset}
                style={styles.imageIcon}
                resizeMode="contain"
              />
            </TouchableOpacity>
          )}
          {capturedImage && (
            <TouchableOpacity style={styles.button} onPress={submitImage}>
              <Image
                source={Submit}
                style={styles.imageIcon}
                resizeMode="contain"
              />
            </TouchableOpacity>
          )}
        </View>
      </Camera>

      {capturedImage && (
        <View style={styles.imagePreviewContainer}>
          <Text style={{ textAlign: "center", marginBottom: 15 }}>
            Captured Image Preview
          </Text>
          <Image
            source={{ uri: capturedImage.uri }}
            style={styles.imagePreview}
          />
        </View>
      )}
    </View>
  );
};

const styles = StyleSheet.create({
  container: {
    flex: 1,
  },
  camera: {
    flex: 1,
  },
  buttonContainer: {
    flex: 1,
    flexDirection: "row",
    backgroundColor: "transparent",
    margin: 64,
  },
  button: {
    flex: 1,
    alignSelf: "flex-end",
    alignItems: "center",
  },
  cameraContainer: {
    flex: 1,
    flexDirection: "column",
    justifyContent: "space-between",
    margin: 20,
    width: "100%",
  },
  buttonText: {
    fontSize: 18,
    color: "white",
  },
  imagePreviewContainer: {
    flex: 1,
    justifyContent: "center",
    alignItems: "center",
  },
  imagePreview: {
    width: "80%",
    height: "80%",
    resizeMode: "contain",
  },
  imageIcon: {
    width: 30,
    height: 30,
  },
});

export default CameraScreen;

Enter fullscreen mode Exit fullscreen mode

takePicture function captures a photo using the cameraRef and updates state variables (setCapturedImage, setImageUrl, setSelectedImage).
resetImage function resets the captured image and associated state variables.

Then we check whether camera permissions are still loading or not granted. Renders a message to grant permission if needed.

Congratulations on making it this far! Now, let's perform some unit testing to ensure our code behaves as expected.

Unit Testing:

Before we dive into the testing arena, let's make sure our environment is prepared. Ensure you have the necessary packages in your toolkit:

npx expo install jest-expo jest

npm install -D @testing-library/react-native

  1. jest: jest is a JavaScript testing framework widely used for testing JavaScript code, including React and React Native applications. It provides a test runner, assertion library, and mocking capabilities.

  2. jest-expo: jest-expo is a Jest preset specifically designed for Expo projects. It configures Jest with settings optimized for Expo and React Native development, making it easier to write and run tests in Expo projects.

  3. @testing-library/react-native: @testing-library/react-native is part of the Testing Library family and provides utilities for testing React Native components. It encourages testing components in a way that simulates user interactions and ensures the application behaves as expected.

Include the Jest configuration in your package.json file like this:

...

"scripts": {
    ...
    "test": "jest"
},
"jest": {
    "preset": "jest-expo",
    "transformIgnorePatterns": [
      "node_modules/(?!((jest-)?react-native|@react-native(-community)?)|expo(nent)?|@expo(nent)?/.*|@expo-google-fonts/.*|react-navigation|@react-navigation/.*|@unimodules/.*|unimodules|sentry-expo|native-base|react-native-svg)"
    ]
  },

...
Enter fullscreen mode Exit fullscreen mode

This configuration ensures that Jest ignores certain modules during the transformation process, preventing potential issues with Expo and related dependencies.

Writing Our Script

Create aย App.test.jsย file with the following content:

import React from "react";
import axios from "axios";
import { render, waitFor, fireEvent, act } from "@testing-library/react-native";
import { Camera } from "expo-camera";
import ImageForm from "./ImageForm";
import App from "./App";

jest.mock("axios");

jest.mock("expo-image-picker", () => ({
  ...jest.requireActual("expo-image-picker"),
  requestMediaLibraryPermissionsAsync: jest.fn(),
}));

describe("App", () => {
  describe("ImageForm Component", () => {
    it("renders correctly", async () => {
      require("expo-image-picker").requestMediaLibraryPermissionsAsync.mockResolvedValue(
        {
          status: "granted",
        }
      );

      const { getByText, getByPlaceholderText } = render(<ImageForm />);

      await waitFor(() => {
        expect(getByText("Image Captioner! ๐Ÿค—")).toBeTruthy();
      });
      await waitFor(() => {
        expect(getByPlaceholderText("Enter Image Link")).toBeTruthy();
      });
    });

    it("handles submit button press correctly", async () => {
      const mockOnSubmit = jest.fn();
      const { getByPlaceholderText, getByText, getByTestId } = render(
        <ImageForm onSubmit={mockOnSubmit} />
      );

      const input = getByPlaceholderText("Enter Image Link");
      const submitButton = getByText("Process");

      fireEvent.changeText(input, "https://example.com/image.jpg");

      fireEvent.press(submitButton);

      await waitFor(() => {
        expect(mockOnSubmit).toHaveBeenCalledWith(
          "https://example.com/image.jpg"
        );
      });
    });

    it("handles image URL submission and displays the generated caption", async () => {
      const mockCaption = "Mock Caption";
      axios.post.mockResolvedValue({ data: [{ generated_text: mockCaption }] });
      const { getByPlaceholderText, getByText } = render(<App />);
      const input = getByPlaceholderText("Enter Image Link");
      await act(() => {
        fireEvent.changeText(input, "https://example.com/image.jpg");
      });
      await act(() => {
        const submitButton = getByText("Process");
        fireEvent.press(submitButton);
      });
      await waitFor(() =>
        expect(getByText(`๐Ÿช„ ${mockCaption} ๐Ÿช„`)).toBeTruthy()
      );
    });
  });

  describe("CameraScreen Component", () => {
    it("shows camera if permissions are granted", async () => {
      jest
        .spyOn(Camera, "useCameraPermissions")
        .mockReturnValue([{ granted: true }, () => Promise.resolve({})]);

      const { getByTestId, getByText } = render(<App />);
      await act(() => {
        fireEvent.press(getByText("Take Photo"));
      });

      const camera = getByTestId("camera-screen");
      expect(camera).toBeTruthy();
    });

    it("takes a picture and displays preview", async () => {
      jest
        .spyOn(Camera, "useCameraPermissions")
        .mockReturnValue([{ granted: true }, () => Promise.resolve({})]);

      jest
        .spyOn(Camera.prototype, "takePictureAsync")
        .mockImplementation(() => {
          return Promise.resolve({
            uri: "file://some-file.jpg",
          });
        });

      const { getByTestId, getByText } = render(<App />);

      fireEvent.press(getByText("Take Photo"));

      const captureButton = getByTestId("capture-button");

      fireEvent.press(captureButton);
      await waitFor(() => {
        expect(getByText("Captured Image Preview")).toBeTruthy();
      });
    });
  });
});

Enter fullscreen mode Exit fullscreen mode

Mocking Axios and Expo's image picker is also for testing purposes.

"renders correctly":
The test sets up a mock for requestMediaLibraryPermissionsAsync to simulate the permission being granted. Then, it renders the ImageForm component, and asserts that the component renders the expected text "Image Captioner! ๐Ÿค—" and includes a placeholder for entering an image link.

"handles submit button press correctly":
It creates a mock function (mockOnSubmit) to simulate the onSubmit function, and simulates changing the text in the input field to a sample image link and presses the Process button. It asserts that the onSubmit function is called with the expected image URL.

"handles image URL submission and displays the generated caption":
It mocks the Axios POST request to simulate generating a caption for an image URL. Simulates changing the text in the input field to a sample image link and pressing the "Process" button. It asserts that the generated caption is displayed in the expected format.

"shows camera if permissions are granted":
It spies on the useCameraPermissions function to simulate that camera permissions are granted. Renders the App component. Simulates pressing the Take Photo button, and finally, it asserts that the camera screen is displayed when permissions are granted.

"takes a picture and displays preview":
Again, it spies on useCameraPermissions to simulate that camera permissions are granted and takePictureAsync to mock taking a picture. Simulates pressing the Take Photobutton and then the capture button. It asserts that the captured image preview is displayed.

These were just a few testing scenarios to get you started, but the possibilities are endless. Feel free to explore and add more scenarios that cover different aspects of your application. Think about a specific scenario or functionality you want to test. It could be related to user interactions, edge cases, or error handling.

Full code is available for reference on GitHub here

Conclusion:

And there you have it โ€“ your very own Image Captioner is ready! But wait, what if we sprinkle a bit more magic? How about adding language translation feature to those captions? โ€“ there's always room for more creativity. You could explore more advanced models, tweak parameters, or even consider deploying your app to the cloud for broader accessibility.

Catch you in the next one. Take care! ๐Ÿš€โœจ

Top comments (0)