DEV Community: Jon Wood

Use Bing Image Search to Get Training Image Data

Jon Wood — Wed, 08 Dec 2021 11:36:06 +0000

When going through the FastAI book, Deep Learning for Coders, I noticed that in one of the early chapters they mention using the Bing Image Search API to retrieve images for training data. While they have a nice wrapper for the API, I thought I'd dive into the API as well and use it to build my own way to download training image data.

Let's suppose we need to make an image classification model to determine what kind of car is in an image. We'd need quite a bit of different images for this model, so let's use the Bing Image Search to gather images of the Aston-Martin car so we can start getting our data.

Check out the below for a video version of this post.

Why Bing Image Search

Before going into the technical side of Bing Image Search, let's go over why use this in the first place. Why not just download the images ourselves?

Bing Images Search has a few features in it that we can utilize in our code when getting our images. Some of these features are important to take into account.

Automation

I'll be honest, I'm lazy and if I can script something to do a task for me then I'll definitely spend the time to build the script rather than do the task manually. Rather than manually finding images and downloading them, we can use the Bing Image Search API to do this for us.

Image License

We can't always just take an image from a website and use it however we want. A lot of images that are online are copyrighted and if we don't have a license to use the copyright we are actually in violation of the creator's copyright. If they find out we use their image without a license or permission then they can, more than likely, take legal action against us.

However, with Bing Image Search, we have an option to specify what license the images has that get returned to us. We can do this with the licenseType query parameter in our API call. This utilizes Creative Commons licenses. We can specify exactly what type of license our images has. We can specify that want images that are public where the copyright is fully waived, which is what we will do. There are many Creative Commons license types that the Bing Image Search supports and there's a full list here.

Image Type

There are quite a few images types that we could download from Bing Image Search. For our purposes, though, we only want photos of Aston Martin cars. Due to that, we can specify the image type in our API calls to just photo. If we don't specify this we could get back animated GIFs, clip art, or drawings of Aston Martin cars.

Safe Content

When downloading images from the internet you never really know what you're going to get. Bing Image Search can help ease that worry by specifying that you want only safe content to be returned.

Bing can do this filtering for us so we don't have to worry about it when we do our API call. This is one less thing we have to worry about and, because it's the internet, it's definitely something to worry about when download images.

Create Azure Resource

Before we can use the Bing Image Search API we need to create the resource for it. In the Azure Portal create a new resource and search for "Bing Search". Then, click on the "Bing Search v7" resource to create it.

When creating the resource give it a name, what resource group it will be under, and what subscription it will be under. For the pricing tier, it does have a free tier to allow you to give the service a try for evaluation or for a proof of concept. Once that is complete, click "Create".

When that completes deployment, we can explore a bit on the resource page. One thing to note is that there are a few things we can look at here. There's a "Try me" tab where we can try the Bing Search API and see what results we get. There is some sample code to see real quick how to use the Bing Search API. And there are a lot of tutorials that we can look at if we want to look at something more specific, such as the image or video search APIs.

Retrieve Key and Endpoint

To use the API in our code we will need the API key and the endpoint to call. There are a couple of ways we can get to it. First, on the "Overview" page of the resource there's a link that says to "click here to manage keys". Clicking that will take you to another page where you can get the API keys and the endpoint URL.

You can also click on the "Keys and Endpoint" section on the left navigation.

Now save the API key and the endpoint since we'll need those to access the API in the code.

Using the API

Now we get to the fun stuff where we can get into some code. I'll be using Python, but you're very welcome to use the language of your choice since this is a simple API call. I'm also using Azure ML since it's very easy to get a Jupyter Lab instance running plus most machine learning and data science packages already installed.

Imports

First, we need to import some modules. We have four that we will need to import.

JSON: This will be used to read in a config file for the API key and endpoint
Requests: Will be used to make the API calls. This is pre-installed in an Azure ML Jupyter instance, so you may need to run pip install requests if you are using another envrionment.
Time: Used to delay API calls so the server doesn't get hit too much by requests.
OS: Used to saved and help clean image data on the local machine.
PPrint: Used to format JSON when printing.

import json
import requests
import time
import os
import pprint

The API Call

Now, we can start building and making the API call to get the image data.

Building the Endpoint

To start building the call, we need to get the API key which is kept in a JSON file for security reasons. We'll use the open method to open the file to be able to read it and use the json module to load the JSON file. This creates a dictionary where the JSON keys are the key names of the dictionary where you can get the values.

config = json.load(open("config.json"))

api_key = config["apiKey"]

Now that we have the API key we can build up the URL to make the API call. We can use the endpoint that we got from the Azure Portal and help build up the URL.

endpoint = "https://api.bing.microsoft.com/"

With the endpoint, we have to add some to it to tell it that we want the Image Search API. To learn more about the exact endpoints we're using here, this doc has a lot of good information.

url = f"{endpoint}v7.0/images/search"

Building the Headers and Query Parameters

Some more information we need to add to our call are the headers and the query parameters. The headers is where we supply the API key and the query parameters detail what images we want to return.

Requests makes it easy to specify the headers, which is done as a dictionary. We need to supply the Ocp-Apim-Subscription-Key header for the API key.

headers = { "Ocp-Apim-Subscription-Key": api_key }

The query parameters are also done as a dictionary. We'll supply the license, image type, and safe search parameters here. Those are optional parameters, but the q
parameter is required which is what query we want to use to search for images. For our query here, we'll search for aston martin cars.

params = {
    "q": "aston martin", 
    "license": "public", 
    "imageType": "photo",
    "safeSearch": "Strict",
}

Making the API Call

With everything ready, we can now make the API call and get the results. With requests we can just call the get method. In there we pass in the URl, the headers, and the parameters. We use the raise_for_status method to throw an exception if the status code isn't successful. Then, we get the JSON of the response and store that into a variable. Finally, we use the pretty print method to print the JSON response.

response = requests.get(url, headers=headers, params=params)
response.raise_for_status()

result = response.json()

pprint.pprint(result)

And here's a snapshot of the response. There's quite a bit here but we'll break it down some later in this post.

{'_type': 'Images',
 'currentOffset': 0,
 'instrumentation': {'_type': 'ResponseInstrumentation'},
 'nextOffset': 38,
 'totalEstimatedMatches': 475,
 'value': [{'accentColor': 'C6A105',
            'contentSize': '1204783 B',
            'contentUrl': '[https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg](https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg)',
            'creativeCommons': 'PublicNoRightsReserved',
            'datePublished': '2021-02-06T20:45:00.0000000Z',
            'encodingFormat': 'jpeg',
            'height': 1530,
            'hostPageDiscoveredDate': '2021-01-12T00:00:00.0000000Z',
            'hostPageDisplayUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&amp;picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&amp;picture=aston-martin-car)',
            'hostPageFavIconUrl': '[https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&amp;pid=Api](https://www.bing.com/th?id=ODF.lPqrhQa5EO7xJHf8DMqrJw&amp;pid=Api)',
            'hostPageUrl': '[https://www.publicdomainpictures.net/view-image.php?image=376994&amp;picture=aston-martin-car](https://www.publicdomainpictures.net/view-image.php?image=376994&amp;picture=aston-martin-car)',
            'imageId': '38DBFEF37523B232A6733D7D9109A21FCAB41582',
            'imageInsightsToken': 'ccid_WTqn9r3a*cp_74D633ADFCF41C86F407DFFCF0DEC38F*mid_38DBFEF37523B232A6733D7D9109A21FCAB41582*simid_608053462467504486*thid_OIP.WTqn9r3aKv5TLZxszieEuQHaF5',
            'insightsMetadata': {'availableSizesCount': 1,
                                 'pagesIncludingCount': 1},
            'isFamilyFriendly': True,
            'name': 'Aston Martin Car Free Stock Photo - Public Domain '
                    'Pictures',
            'thumbnail': {'height': 377, 'width': 474},
            'thumbnailUrl': '[https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&amp;pid=Api](https://tse2.mm.bing.net/th?id=OIP.WTqn9r3aKv5TLZxszieEuQHaF5&amp;pid=Api)',
            'webSearchUrl': '[https://www.bing.com/images/search?view=detailv2&amp;FORM=OIIRPO&amp;q=aston+martin&amp;id=38DBFEF37523B232A6733D7D9109A21FCAB41582&amp;simid=608053462467504486](https://www.bing.com/images/search?view=detailv2&amp;FORM=OIIRPO&amp;q=aston+martin&amp;id=38DBFEF37523B232A6733D7D9109A21FCAB41582&amp;simid=608053462467504486)',
            'width': 1920}]

A few things to note from the response:

nextOffset: This will help us page items to perform multiple requests.
value.contentUrl: This is the actual URL of the image. We will use this URL to download the images.

Paging Through Results

For a single API call we may get around 30 items or so by default. How do we get more images with the API? We page through the results. And the way to do this is to use the nextOffset item in the API response. We can use this value to pass in another query parameter offset to give the next page of results.

So if I only want at most 200 images, I can use the below code to page through the API results.

new_offset = 0

while new_offset <= 200:
    print(new_offset)
    params["offset"] = new_offset

    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()

    result = response.json()

    time.sleep(1)

    new_offset = result["nextOffset"]

    for item in result["value"]:
        contentUrls.append(item["contentUrl"])

We initialize the offset to 0 so the initial call will give the first page of results. In the while loop we limit to just 200 images for the offset. Within the loop we set the offset parameter to the current offset, which will be 0 initially. Then we make the API call, we sleep or wait for one second, and we set the offset parameter to the nextOffset from the results and save the contentUrl items from the results into a list. Then, we do it again until we reach the limit of our offset.

Downloading the Images

In the previous API calls all we did was capture the contentUrl items from each of the images. In order to get the images as training data we need to download them. Before we do that, let's set up our paths to be ready for images to be downloaded to them. First we set the path and then we use the os module to check if the path exists. If it doesn't, we'll create it.

dir_path = "./aston-martin/train/"

if not os.path.exists(dir_path):
    os.makedirs(dir_path)

Generally, we could just do the below code and loop through all of the content URL items and for each one we create the path with the os.path.join method to get the correct path for the system we're on, and open the path with the open method. With that we can use requests again with the get method and pass in the URL. Then, with the open function, we can write to the path from the image contents.

for url in contentUrls:
    path = os.path.join(dir_path, url)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)

            f.write(image_data.content)
    except OSError:
        pass

However, this is a bit more complicated than we would hope it would be.

Cleaning the Image Data

If we print the image URLs for all that we get back it would look something like this:

https://www.publicdomainpictures.net/pictures/380000/velka/aston-martin-car-1609287727yik.jpg
https://images.pexels.com/photos/592253/pexels-photo-592253.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=750&amp;w=1260
https://images.pexels.com/photos/2811239/pexels-photo-2811239.jpeg?cs=srgb&amp;dl=pexels-tadas-lisauskas-2811239.jpg&amp;fm=jpg
https://get.pxhere.com/photo/car-vehicle-classic-car-sports-car-vintage-car-coupe-antique-car-land-vehicle-automotive-design-austin-healey-3000-aston-martin-db2-austin-healey-100-69398.jpg
https://get.pxhere.com/photo/car-automobile-vehicle-automotive-sports-car-supercar-luxury-expensive-coupe-v8-martin-vantage-aston-land-vehicle-automotive-design-luxury-vehicle-performance-car-aston-martin-dbs-aston-martin-db9-aston-martin-virage-aston-martin-v8-aston-martin-dbs-v12-aston-martin-vantage-aston-martin-v8-vantage-2005-aston-martin-rapide-865679.jpg
https://c.pxhere.com/photos/5d/f2/car_desert_ferrari_lamborghini-1277324.jpg!d

Do you notice anything in the URLs? While most of then end in jpeg there are a few with some extra parameters on the end. If we try to download with those URLs we won't get the image. So we need to do a little bit of data cleaning here.

Luckily, there are two patterns we can check, if there is a ? in the URL and if there is a ! in the URL. With those patterns we can update our loop to download the images to the below to get the correct URLs for all images.

for url in contentUrls:
    split = url.split("/")

    last_item = split[-1]

    second_split = last_item.split("?")

    if len(second_split) > 1:
        last_item = second_split[0]

    third_split = last_item.split("!")

    if len(third_split) > 1:
        last_item = third_split[0]

    print(last_item)
    path = os.path.join(dir_path, last_item)

    try:
        with open(path, "wb") as f:
            image_data = requests.get(url)
            #image_data.raise_for_status()

            f.write(image_data.content)
    except OSError:
        pass

With this cleaning of the URLs we can get the full images.

Conclusion

While this probably isn't as sophisticated as the wrapper that FastAI has, this should help if you need to get training images from Bing Image Search manually. You can also tweak this if needed.

Using Bing Image Search is a great way to get quality and license appropriate images for training data.

How the Machine Learning Process is Like Cooking

Jon Wood — Tue, 18 May 2021 07:24:14 +0000

When creating machine learning models it's important to follow the machine learning process in order to get the best performing model that you can into production and to keep it performing well.

But why cooking? First, I enjoy cooking. But also, it is something we all do. Now, we all don't make five course meals every day or aim to be a Michelin star chef. We do follow a process to make our food, though, even if it may be to just heat it up in the microwave.

In this post, I'll go over the machine learning process and how it relates to cooking to give a better understanding of the process and maybe even a way to help remember the steps.

For the video version, check below:

Machine Learning Process

First, let's briefly go over the machine learning process. Here's a diagram that's known as the cross-industry standard process for data mining, or simply known as CRISP-DM.

Kenneth Jensen - Own work based on:ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1)

The machine learning process is pretty straight forward when going through the diagram:

Business understanding - What exactly is the problem we are trying to solve with data
Data understanding - What exactly is in our data, such as what does each column mean and how does it relate to the business problem
Data prep - Data preprocessing and preparation. This can also include feature engineering
Modeling - Getting a model from our data
Evaluation - Evaluating the model for performance on generalized data
Deployment - Deploying the model for production use

Note that a couple of items can go back and forth. You may do multiple iterations of getting a better understanding of the business problem and the data, data prep and modeling, and even going back to the business problem when evaluating a model.

Notice that there's a circle around the whole process which means you may even have to go back to understanding the problem once a model is deployed.

There are a couple of items I would add to help improve this process, though. First, we need to think about getting our data. Also, I believe can add a separate item in the process for improving, or optimizing, our model.

Getting Data

I would actually add an item before or after defining the business problem, and that's getting data. Sometimes you may have the data already and define the business problem but you may have to get the data after defining the problem. Either way, we need good data. You may have heard an old saying in programming, "Garbage in, garbage out", and that applies to machine learning as well.

We can't have a good model unless we give it good data.

Improving the Model

Once we have an algorithm we can also spend some time to improve it even further. We can deploy some techniques that can tweak the algorithm to perform better.

Now that we understand the machine learning process a bit better, let's see how it relates to cooking.

Relating the Machine Learning Process to Cooking

At first glance, you may not see how the machine learning process relates to cooking at all. But let's go into more detail of the machine learning process and how each step relates to cooking.

Business Understanding

One of the first things to do for the machine learning process is to get a business understanding of the problem.

For cooking, we know we want to make a dish, but which one? What do we want to accomplish with our dish? Is it for breakfast, lunch, or dinner? Is it for just yourself or do we want to create something for a family of four? Or for a large gathering?

Knowing these will help us determine what we want to cook.

Getting Data

We would need a way to get our data. We may already have the data in a data warehouse or we would need to generate it.

For cooking, getting data can be related to getting your ingredients. You may already have the ingredients that you need or you would need to go to the grocery store.

Getting Good Data

Earlier I mentioned that in order to have the best performing machine learning model we need to give it good data. The same can be said for making the best tasting dish. We would need to give it the best ingredients we can find.

You wouldn't want to give your model bad data just like you wouldn't want to use spoiled or rotten ingredients when making your dish.

Data Processing

Data processing is perhaps the most important step after getting good data. Depending on how you process the data will depend on how well your model performs.

For cooking, this is equivalent to preparing your ingredients. This includes chopping any ingredients such as vegetables, but keeping a consistent size when chopping also counts. This helps the pieces cook evenly. If some pieces are smaller they can burn or if some pieces are bigger then they may not be fully cooked.

Also, just like in machine learning there are multiple ways to process your data, there are also different ways to prepare ingredients. In fact, there's a word for processing all of your ingredients before you start cooking - mise en place - which is French for "everything in it's place". This is done in cooking shows all the time where they have everything ready to start cooking.

This actually also makes sense for machine learning. We have to have all of our data processing done on the training data before we can give it to the machine learning algorithm.

Modeling

Now it's time for the actual modeling part of the process where we give our data to an algorithm.

In cooking, this is actually where we cook our dish. In fact, we can relate choosing a recipe to choosing a machine learning algorithm. The recipe will take the ingredients and turn out a dish, and the algorithm will take the data and turn out a model.

Different recipes will turn out different dishes, though. Take a salad, for instance. Depending on the recipe and the ingredients, the salad can turn out to be bright and citrusy like this kale citrus salad. Or, it can be warm and savory like this spinach salad with bacon dressing.

They're both salads, but they are turned into different kinds of salads because of different ingredients and recipes. In machine learning, you can have similar models from different data and algorithms.

What if you have the same ingredients? There are definitely different ways to make the same recipe. Hummus is traditionally made with chickpeas, tahini, garlic, and lemon like in this recipe. But there is also this hummus recipe that has the same ingredients but the recipe is just a bit different.

Optimizing the Model

Depending on the algorithm the machine learning model is using we can give it different parameters that can optimize the model for better performance. These parameter are called hyperparameters, which is used during the learning process of the model to your data.

These can be updated manually by you choosing values for the hyperparameters. This can be quite tedious and you never know what value to choose. So, instead, there are ways this can be automated by giving a range of values and running the model multiple times with different values and you can use the best performing model that is found.

How do we optimize a dish, though? Perhaps the best way to get the best taste out of your dish, other than using the best ingredients, is to season it. Specifically, seasoning with salt. In this video by Ethan Chlebowski, he suggests

…home cooks severely under salt the food they are cooking and is often why the food doesn't taste good.

He even quotes this line from the book Ruhlman's Twenty:

How to salt food is the most important skill to know in the kitchen.

I've even experienced this in my own cooking where I don't add enough salt. Once I do, the dish tastes 100 times better.

Now, adding salt to your dish is the more manual way of optimizing it with seasoning. Is there a way that this can be automated? Actually, there is! Instead of using just salt and adding other spices to it yourself you can get these seasoning blends that has all the spices in it for you!

Evaluating the Model

Evaluating the model is going to be one of the most important steps because this tells you how well your model will perform on new data, or rather, data that it hasn't seen before. During training your model have good performance, but giving it new data may reveal that it actually is performing bad.

Evaluating your cooked dish is a lot more fun, though. This is where you get to eat it! You will determine if it's a good dish by how it tastes. Is it good or bad? If you served it to others, what did they think about it?

Iterating on the Model

Iterating on the model is a part of the process that may not seem necessary, but it can be an important one. Your data may change over time which would then make your model stale. That is, it's relying on data that it used to but due to some process change or something similar it no longer does. And since the underlying data changed the model won't predict as well as it did.

Similarly, you may have more or even better data that you can use for training, so you can then retrain the model with that to make better predictions.

How can you iterate on a dish that you just prepared? First thing is if it was good or bad. If it was bad, then we can revisit the recipe and see if we did anything wrong. Did we overcook it? Did we miss an ingredient? Did we prepare an ingredient incorrectly?

If it was good, we can still iterate on it. Any ingredients you would like to add or remove to make it more to your liking?

A lot of chefs and home cooks like to take notes about recipes they've made. They write in some tricks they've learned along the way but also some different paths from the recipe that they either had to take due to a missing ingredient or preferred to take.

Conclusion

Hopefully, this helps your better understand the machine learning process through the eyes of cooking a dish. It may even help you understand the importance of each step because, in cooking, if one step is missed then you probably won't be having a good dinner tonight.

And if you're wondering where does AutoML fit into all of this, then you can think of it as the meal delivery kits like Hello Fresh or Blue Apron. They do a lot of the work for you and you just have to put it all together.

Participate in the 2020 Virtual ML.NET Hackathon

Jon Wood — Fri, 30 Oct 2020 09:46:03 +0000

If you wanted to learn machine learning then join in on the Virtual ML.NET Hackathon! Here you can create or join a project to have fun, learn ML.NET and machine learning, and help contribute to open source.

For anyone not familiar with ML.NET there will be a workshop presented on November 13th to go over the basics of ML.NET.

Schedule

The workshop will be on November 13th and will start at 10AM Pacific or 1PM Eastern.

The week to officialy start hacking on your ML.NET code is from November 13th to November 20th.

Submissions of your projects or contributions will be on November 18th. And the winners of the hackathon will be announced on the 20th.

Sign ups have already started and feel free to reference the official schedule.

How to Sign Up

To sign up for the workshop and/or the hackathon, fill out this form. The first 50 to sign up will get a free tshirt!

Creating or Joining a Project

To create a project simply create an issue in the GitHub repository. When signing up feel free to describe the project or contribution you want to submit. You have the option to specify if you want others to join your project as part of a team as well to specify if you would like a mentor to help you with your project.

If you see a project that's already listed as an issue and it specifies they would like others to join their team, simply comment on the issue indicating you would like to join.

Note that, for your project, if you are using a dataset, to make sure that it doesn't have any personal information in it.

Submissions

Final submissions are due November 18th. To submit, create a pull request in the repo in the Submissions folder. Submissions must include a README file indicating what was done for a solution to the project or what was contributed, any source code used for the project, and a 1 to 3 minute video showing or talking about your project or contribution.

Submissions do not have to be fully complete or to run to be counted.