Author: Jason Corso (Professor of Robotics and EECS at University of Michigan | Co-Founder and Chief Science Officer @ Voxel51)
Working with computer systems has been of keen interest to me for perhaps the last 40 years. Growing up with the Atari 2600, the Commodore 64K, and the 8-bit Nintendo, I spent my time gaming, learning the basics (pun intended) of programming, making sprites bounce around — you name it. I remember this one weekend, which must have been in 1985, when my family was going to visit some friends who had recently got a Nintendo. No one “on my block” (I grew up in New York City) owned a Nintendo. I was super excited. At this point, my gaming consisted of what I could do on the Atari and Commodore. I had seen commercials on the television that depicted this futuristic robot (see Figure 2 below) who would make gaming come to life, transforming it into an interactive, collaborative experience! Anyway, I cannot tell you how disappointed I was when I got to their house and found they only had the main NES, not the Robotic Operating Buddy.
These early dreams of exotic ways to work with computer systems stuck with me for decades. I even wrote a dissertation about it. Yet, I find myself puzzled by the lack of general interest in the possibility of humans and computers — namely, contemporary AI systems — cooperating to solve problems jointly. Although exciting, for example, most of the robotics papers I see promise some automation-like agent who will empty my dishwasher for me or fold my laundry for me and, most of the human-computer interaction (HCI) papers I see focus on the “user experience” more than some notion of collaboration; the notion that I may be able to collaborate with the computer system to capitalize on what each of us would bring to the table is often missing.
More generally, when it comes to the last two decades steeped in supervised machine learning, the interaction between humans and computers has been mostly rather transactional and in one direction: humans provide input when engineering an AI system that will then operate without further guidance.
For a given scenario, the engineer translates a problem definition into a protocol for the types of data and labels needed to train a model that would (hopefully) solve the problem. After this protocol is established, other humans generate these labels during a process called annotation. At this point, the labels are transferred back to the engineer, who trains the model, which will then run autonomously. Although the nuggets of this “data work” may change over time as annotation is dead or dying, the essence is the same.
Even the recent advances in LLMs do not change the picture. They only compound it further but in the opposite direction. Here, the burden is constantly on the human to diagnose whether or not the output is reasonable and accurate, or a hallucination. Sure, it’s interesting, but not collaborative.
In this blog, I thought I’d summarize a paper I wrote with my past student Stephan Lemmer, that directly approaches a key problem in human+AI collaboration. Namely, when should the AI system trust the guidance it receives from a human during the course of a collaborative problem solving effort? This paper, entitled “Ground-truth or DAER: Selective Re-query of Secondary Information,’’ was published at the IEEE International Conference on Computer Vision in 2021. The original paper is pretty technical; in this post, I target a general audience. If you have questions after reading, please don’t hesitate to ask via a comment or send a message directly to me.
Collaborative Human+AI Inference Problem Settings
The paper focuses on a situation where you have a primary input (i.e., an image or video) and a target output space (e.g., a class label, bounding box coordinates, or keypoints for pose estimation). More importantly, you also have a secondary piece of information, designed to help a machine learning method map the primary input to the target output. When we wrote the paper, we called the secondary input a seed, and hence call this class of problems “seeded inference” problems. However, perhaps now the more general term for seed is “prompt,” which has been popularized over the last year with the growth of LLMs. The secondary input can take on many forms. In a visual tracking scenario, it’s the initial bounding box of the object to track. In pose estimation, it could be the location of a keypoint on the object, such as the left knee in a human analysis setting. Figure 3 below provides visual examples.
We envision a setup where the secondary information would be provided by a human at the moment of inference. In practice for this paper, the secondary information is part of each dataset we are using, without any impact on the meaningfulness of the work — it just made it a whole lot easier to study. In this initial work in the human+AI setting, it’s assumed you know what question to ask the user; alternatives could be investigated in other work.
Given a primary input, target output space, and the secondary information from the human, the paper studies the problem of whether or not to reject the secondary information. We call the problem “seed rejection.” This is related to certain other subfields in machine learning, such as selective prediction, but is new in that it only allows the human to interact via the seed — think telling your robot where the dog food is, as opposed to feeding the dog yourself.
If we do not reject it, we can directly run inference on the primary and secondary inputs, such as tracking the object or answering the question. If we do reject it, we ask for a new seed.
Can Human Input Harm Performance?
Before getting to the formulation in the paper, let’s discuss whether or not a seed can even be bad. Although it seems obvious that it can be, let’s be thorough. For example, in a tracking scenario, if the seed — the initial bounding box on the object to be tracked — only partially covers the object and partially covers some background object, it would lead to problems because the visual pattern of the object you care about would be conflated with the background. Or, as Figure 1 at the beginning of the blog shows, in a pose estimation problem on a motorcycle, certain seeds are equivalently good (these are the seeds, or keypoints in the figure, that are in the green region). In contrast, others are bad, such as those in the red region. Notice that some bad seeds are quite close to the actual “gold standard” (best case) seed.
The situation is actually even more complicated. In some cases, the model will yield the “correct” target output independent of the seed. In other cases, the model will yield the “incorrect” target output independent of the seed. What can I say? In practice, machine learning is hard, even when humans are in the loop.
Finding the Bad Seeds
To approach this new problem of seed rejection, the paper proposes a model to determine whether or not the candidate seed will diminish output quality. If it would diminish output quality, then we ask for a new seed.
To train the model that estimates the potential harm the candidate seed would bring, we need to know the gold standard seed. The model learns to predict how much additional error the candidate seed would bring with respect to this gold standard seed during training. At test time, it does so without knowing the gold standard seed. Let’s call this type of model a “rejection model.” As we’ll discuss below, we have implemented different rejection models, for example, based on Vision Transformers or other architectures.
The rejection model has two pathways. First, it predicts whether or not the seed is correct (topmost arrow in Figure 4 below). Second, it regresses some expectation of the additional error directly. Separating the two pathways in the model simplifies the learning problem.
The two pathways also enable a straightforward training method, as depicted in Figure 4. In addition to the rejection model, we need a suitable task model for the problem that can transform the primary input and the seed into a possible target output. The two models directly enable the estimation of the correctness and the goodness of the seed (i.e., how it performs in the task model). During training, candidate seeds are suitably randomly sampled from the input space.
The paper refers to this two-pathway-based training method as Dual-loss Additional Error Regression. After training, the learned rejection model is retained along with the preexisting task model. At execution time, given a primary input and a seed, the rejection model is used to assess whether or not to reject the seed. In practice, this output is continuous on the interval between zero and one; hence, one must threshold it. Furthermore, as Figure 5 below shows, if the candidate seed is rejected, then the replacement seed is fed to the task model after the re-query happens (there are other things that could be done with a sequence or set of seeds, but that is not discussed in this paper).
How Well Does It Work?
Seed rejection is applicable in a variety of settings involving human+AI teams. In the paper, it is tested on viewpoint estimation and hierarchical scene classification.
Viewpoint estimation is the problem of estimating the relative pose of the object in an image given the image and a semantically-informative keypoint location. For example, in a vehicle viewpoint estimation problem, this would be a snapshot of the vehicle and, say, a click on the rear-left wheel. My research group originally introduced this problem in an ICCV 2017 paper; the task model is directly used from this earlier paper. This problem is useful to construct geometrically augmented datasets from YouTube videos of rare vehicular events, such as crashes or near misses.
On the viewpoint estimation task, the proposed seed rejection model is more than 5x better than a random rejection decision and 3.2x better than using the direct task model output, which is the best-in-class method for selective prediction. The metric used here (and below) is known as the area under the mean additional error curve, which basically summarizes how good the additional error estimate is as a function of how many seeds are rejected. If we accept many seeds — a situation the paper calls “high coverage” — and we maintain a low mean additional error, this is a good situation to be in. It means that our seeds are good ones and we should keep them.
Figure 6 below shows plots of the ground truth (“geodesic”) error for various keypoint locations (top row) against the predicted error from the trained rejection model (bottom row), demonstrating it is able to capture the signal quite strongly upon inspection.
The paper also tests this approach on hierarchical scene classification, which seeks to determine a fine-grained classification of a scene given an image and a coarse category. For example, if the image is of a “ballroom” or “coffee shop” then the coarse category could be “indoor” or “store,” respectively. The paper uses the well known SUN397 dataset for this scenario and a straightforward ResNet18 variant task model. As in viewpoint estimation, the performance here is strong, demonstrating a 3.8x improvement over random selection and a 2.7x improvement over softmax on the fine-grained classification model response.
Figure 7 above plots the area under the mean addition error curve for the hierarchical scene classification task. Recall from above that the lower the curve the better — the lower the curve implies the more valuable the seeds are and we can accept more of them.
Closing
Understanding how to effectively model human+AI teams will be increasingly important in the coming years, especially as the capabilities and naturalness of such teams increase. Most work I am aware of in HCI focuses on the user experience, but as we see an evolving capability for more natural interfaces, the need to study a more general collaboration between the human and machine is increasingly important. Humans can prompt the AI for information, like today with LLMs. The AI can prompt the human for clarification or help — humans are better at adaptivity in general tasks than AI systems are. Or, the human and the AI can actually collaborate to work better together, perhaps in the spirit of Bo Schembechler, it’s all about the team.
This blog reviewed a paper from my research group that defines the notion of seed rejection. It is relevant in scenarios where we have a specific task model for a certain AI capability, such as visual question answering, that is supplemented by a human input “seed” that may or may not be useful. The proposed rejection models provide a principled mechanism in deciding whether to use that seed or to ask the human a second time for a new one.
This is the earliest work I know that considers such a challenging scenario in human+AI teaming. However, it barely scratches the surface of this interesting space. My group has published follow-on work that I’ll report on in future weeks, and I hope to read relevant work from others too. If you know of related works, please mention them in the comments!
Acknowledgements
Thanks to Stephan Lemmer who played a significant role in the primary paper discussed here, and Ryan Szeto who ventured into the earlier work on keypoint-guided viewpoint estimation to begin our journey into seeded inference. And, thank you to my colleagues Harpreet Sahota, Jacob Marks, Dan Gural, and Michelle Brinich for reading early versions of this essay and providing insightful feedback.
Biography
Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.
Disclaimer
This article is provided for informational purposes only. It is not to be taken as legal or other advice in any way. The views expressed are those of the author only and not his employer or any other institution. The author does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by the content, errors, or omissions, whether such errors or omissions result from accident, negligence, or any other cause.
Copyright 2024 by Jason J. Corso. All Rights Reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at JasonCorso.
Top comments (0)