Posted on Feb 5, 2019

Introduction to Multimodal Learning Model

#tensorflow #python

Learn how multimodal works in this article by Amir Ziai who is proficient in Building machine learning platforms and applications; and Quan Hua, a computer vision and machine learning engineer at BodiData, a data platform for body measurements.

Multimodal learning is not rocket science. Let me start with an interesting example. Suppose you are attending a photo exhibition in town and there are several photographs of architecture, wildlife, and so on from all over the world. Your friend asks you to give a brief description of one of the wildlife photographs, and you don't have a clue what the animal in the picture is, so you can't even begin to describe the photo. How embarrassing! If I gave you an algorithm that could read your input image and automatically generate an image description for you, would that help? Is it even possible for a computer algorithm to tell you what is happening in a picture? The answer is yes: it is possible. We will learn how multimodal learning works in this article.

First, we will create a toy code to see how it is possible to use information from multiple sources to develop a multimodal learning model. Let's open our Python environment and create a Python file with the name multimodal_toy.py.

We will need the following:

At least two information sources
An information processing model for each source
A learning model to learn combined information

An information processing model, listed in the second point, is actually a feature extractor network. It will be responsible for extracting meaningful features from the input data. For example, if the input is an image, the feature extractor will extract the features regarding the patterns or objects in the picture.

The following diagram shows a typical architecture for multimodal information analysis:

As you can see in the preceding architecture, we use two different feature extractors as there is no correlation between the information of the two sources. In most cases, both the sources will be from different domains, namely visual and textual.

Information Sources

Now, let's try putting these blocks into our Python file. For this, we will generate a toy dataset that will have two information sources in the form of two Gaussian distributions, each with a different mean and variance. Let's start with the information generation. We will consider one distribution as the visual source and the other as the textual source. The code listing is as follows:

# Number of Sample points

n = 400

# Probability of distribution

p_c = 0.5

# Let's generate a binomial distribution for class variable
# It will divide the 50% samples in both class 1 and class 0

c = np.random.binomial(n=1, p=p_c, size=n)

# Our two information sources will be 2 bivariate Gaussian distributions
# So we need to define 2 mean values for each distribution
# Mean values for Visual dataset

mu_v_0 = 1.0
mu_v_1 = 8.0

# Mean values for the textual dataset

mu_t_0 = 13.0
mu_t_1 = 19.0

# Now we will generate the two distributions here

x_v = np.random.randn(n) + np.where(c == 0, mu_v_0, mu_v_1)
x_t = np.random.randn(n) + np.where(c == 0, mu_t_0, mu_t_1)

# Let's normalize data with the mean value

x_v = x_v - x_v.mean()
x_t = x_t - x_t.mean()

# Visualize the two classes with the combined information

plt.scatter(x_v, x_t, c=np.where(c == 0, 'blue', 'red'))
plt.xlabel('visual modality')
plt.ylabel('textual modality')
plt.show()

When we execute the preceding script, the output will look as follows:

As you can see in the preceding diagram, we have two class responses that are shown in two colors. Both classes have been created by combining two modalities: on the x-axis, you can see visual modality, while on the y-axis, textual modality is presented. When we pick any point from the preceding diagram, its class is estimated by the combined information from both modalities.

We can now draw samples from the preceding distributions. For the inference task, we will create information data from the preceding distribution. This is done as follows:

# Define the number of points in the sample set

resolution = 1000

# Create a linear sample set from the visual information distribution

vs = np.linspace(x_v.min(), x_v.max(), resolution)

# Create linear sample set from the textual information distribution

ts = np.linspace(x_t.min(), x_t.max(), resolution)

# In the following lines we will propagate these sample points to create
# a proper dataset, it will help to create a pair

(vs, ts) = np.meshgrid(vs, ts)

# Here we will flatten our arrays

vs = np.ravel(vs)
ts = np.ravel(ts)

The preceding script will give us a one-dimensional array with the information generated from two different sources (Gaussian distributions). Once we have our samples, we are ready to create feature extractors for both information sources.

Feature Extractor

We will now move on to designing the feature extractors that will be used for processing this information. We will create two neural networks for this task. One will handle the visual information, and the other will process the textual information. The following code snippet will first extract the information. We will combine the information and then pass it to the final classifier to classify the combined information according to the training data. The following code snippet will create useful variables that can be used later in your architecture:

# Let's start with creating variables
# It will store the visual information

visual = tf.placeholder(tf.float32, shape=[None])

# This will store textual information

textual = tf.placeholder(tf.float32, shape=[None])

# And the final one will be responsible for holding class variable

target = tf.placeholder(tf.int32, shape=[None])

# As we are working with a binary problem

NUM_CLASSES = 2

# We will use a fixed number of neurons for every layer

HIDDEN_LAYER_DIM = 1

As you can see, we first created variables to store data from the source. We also created a variable to store the class responses. Since this is a toy dataset, we will make our network architecture very simple by using a single neuron in each layer. In the following snippet, we will implement our feature extractor for visual information processing:

# This is our Visual feature extractor,
# It will be responsible for the extraction of useful features,
# from visual samples, we will use tanh as the activation function.

h_v = tf.layers.dense(tf.reshape(visual, [-1, 1]), HIDDEN_LAYER_DIM,
                      activation=tf.nn.tanh)

You can think of this network as an image processing network that's used for image classification or object recognition problems. To keep the network simple, we have not used any convolutional layers; we will use a dense layer of neurons with the tanh activation function instead.

Next, we will implement a network to process our textual information:

# This is our Textual feature extractor,
# It will be responsible for the extraction of useful features,
# from visual samples, we will use tanh as the activation function.

h_t = tf.layers.dense(tf.reshape(textual, [-1, 1]), HIDDEN_LAYER_DIM,
                      activation=tf.nn.tanh)

You can think of this network as a recurrent neural network or an LSTM network, which is among one of the most widely used networks for text prediction. Here, we will exact an architecture similar to the one used in the visual feature extractor.

Aggregator Network

Now, we combine the information from both networks using the following code snippet:

# Now as we have features from both the sources,
# we will fuse the information from both the sources,
# by creating a stack, this will be our aggregator network

fuse = tf.layers.dense(tf.stack([h_v, h_t], axis=1), HIDDEN_LAYER_DIM,
                       activation=tf.nn.tanh)

As you can see in the preceding code, we have created a stack of processed information. This is where the information fusion happens. This network is also similar to our previous two; the only difference is the input of the network, which is in the form of stacked information. We are now ready to process this stacked information. The following is also part of the aggregator network:

# Flatten the data here

fuse = tf.layers.flatten(fuse)

# Following layers are the part of the same aggregator network

z = tf.layers.dense(fuse, HIDDEN_LAYER_DIM, activation=tf.nn.sigmoid)

# This one is our final dens layer which used to convert network output
# for the two class

logits = tf.layers.dense(z, NUM_CLASSES)

# We want probabilities at the output; sigmoid will help us

prob = tf.nn.sigmoid(logits)

As you can see, we have just added a few more layers. Nothing fancy is happening here. We are using sigmoid probabilities for the output, while for loss calculation we are using a simple dense layer without an activation function.

We should be putting all these snippets in our multimodal_toy.py file one by one. We now need to analyze the performance of our network so that we can train weights accordingly. To do this, we will need to define the loss function. We will use sigmoid cross-entropy to calculate the loss made by our network. This can be written as follows:

# We will use Sigmoid cross-entropy as the loss function

loss = \
    tf.losses.sigmoid_cross_entropy(multi_class_labels=tf.one_hot(target,
                                    depth=2), logits=logits)

To optimize this loss, we will use stochastic gradient descent with adaptive momentum or the ADAM optimizer. This optimizer will work to reduce the loss made by the network. The code snippet is as follows:

# Here we optimize the loss

optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss)

We are now ready to train our architecture as we have everything ready. Let's quickly write a function to take care of the training of our network:

def train(train_op, loss, sess):

# TRdef train(train_op, loss,sess):
  # TRAIN_OP: optimizer
  # LOSS: calculated loss
  # SESS: Tensorflow session
  # First initialize all the variables created

    sess.run(tf.global_variables_initializer())

  # We will monitor the loss through each epoch

    losses = []

  # Let's run the optimization for 100 epochs

    for epoch in range(100):
        (_, l) = sess.run([train_op, loss], {visual: x_v, textual: x_t,
                          target: c})
        losses.append(l)

  # Here we will plot the training loss

    plt.plot(losses, label='loss')
    plt.title('loss')

We will now create a TensorFlow session and pass it, along with other parameters, to the train function. In the following lines, we will run the session and train our networks using the samples drawn from our different Gaussian distributions. We have already extracted these samples:

# Create a tensorflow session

sess = tf.Session()

# Start training of the network

train(train_op, loss, sess)

# Run the session

(zs, probs) = sess.run([z, prob], {visual: vs, textual: ts})

When we run the preceding script, our models start training for loss optimization. We are creating a list containing loss values for each epoch, which will help us monitor whether our models are being trained or not.

Let's see how our loss behaves during the training period:

On the x-axis, we plot the number of epochs, and on the y-axis, we plot the different loss values. As we have trained our network for 100 epochs, you can see that there is a very nice curve, which shows how well our models are learning. A reduction in the loss value shows a promising improvement in the model's performance.

To visualize this on our distributions, we will need to write a simple function that will create a decision boundary around our information sources.

Let's see how well our multimodal architecture is working:

def plot_evaluations(
    evaluation,
    cmap,
    title,
    labels,
    ):

  # EVALUATION: Probability op from network
  # CMAP: colormap options
  # TITLE: plot title
  # LABELS: Class labels
  # First we will plot our distributions as we have done previously

    plt.scatter((x_v - x_v.min()) * resolution / (x_v
                - x_v.min()).max(), (x_t - x_t.min()) * resolution
                / (x_t - x_t.min()).max(), c=np.where(c == 0, 'blue',
                'red'))

  # Give the titles to our plots by labeling the axes

    plt.title(title, fontsize=14)
    plt.xlabel('visual modality')
    plt.ylabel('textual modality')

  # Here we will create a color map to draw the boundaries

    plt.imshow(evaluation.reshape([resolution, resolution]),
               origin='lower', cmap=cmap, alpha=0.5)

  # Let's put a color bar to create a fancy looking plot

    cbar = plt.colorbar(ticks=[evaluation.min(), evaluation.max()])
    cbar.ax.set_yticklabels(labels)
    cbar.ax.tick_params(labelsize=13)

We are now ready to plot our decision boundaries by calling our function. This is shown in the following code:

# We will plot the probabilities

plot_evaluations(probs[:, 1], cmap='bwr', title='$C$ prediction',
                 labels=['$C=0$', '$C=1$'])

# Show the plots over here

plt.show()

The following is the result of the execution of the preceding lines that shows us the decision boundary:

This is just what we want from our network. As you can see, we have processed the information that we collected from two different sources and classified the samples in the various classes correctly. This is how multimodal learning works: we gather information and combine it to get remarkable results.

If you found this article interesting, you can explore Hands-On Artificial Intelligence with TensorFlow for useful techniques in machine learning and deep learning for building intelligent applications.

DEV Community

Introduction to Multimodal Learning Model

Information Sources

Feature Extractor

Aggregator Network

Top comments (0)

Read next

Memoization in Python; an alternative to Recursion

Tipos em Python

How can applying the SOLID principles make the code better?

From Concept to Reality: The Journey of Building a Mobile CV-Based Human Pose Detection App