OpenAI’s Q-Star (Q*) technology could be a breakthrough in what an AI system can do and how it learns. Versions of Q* were previously implemented as two very different innovations in AI design. The first was in 1972 as part of an experimental AI question-answering system for theorem-proving. It used deduction and inference to generate answers to users’ queries. The second was DeepMind’s 2012 definition of Q* for its Deep Q Network (DQN) system. It introduced trial-and-error into AI systems as a method of learning.
If these two use cases were combined, then syntactic and semantic language information would be integrated with trial-and-error learning. This opens the door for AI systems to learn how to solve complex problems by reducing a large problem to a set of solvable subproblems. The 1973 paper that first introduced Q* discussed this approach to problem reduction in an Appendix.
Forty years after the first publication about Q-Star, DeepMind published its Q* innovation in the Deep Q Network (DQN). This breakthrough was why Google bought DeepMind in 2014 for $650M. Facebook, now Meta, also wanted the Q* technology but was outbid. Within Google, the DeepMind AI research group was positioned to compete with the existing Google Brain AI research group. At that time, the Google Brain group included Ilya Sutskever, who in 2015 cofounded OpenAI and became its lead researcher.
This created a direct business connection between (1) the DeepMind team who published the 2013 description of its Q* algorithm, and (2) Sutskever's OpenAI team and its Q* project mentioned in the leak. There is a high probability that DeepMind's 2013 paper introducing its Q* is related to the 2023 leak about OpenAI's Q-Star project.
Also, the 1972 Technical Report for the AI system that introduced Q-Star describes the system as "intended for studying deductive search methods. Although the work is oriented towards question-answering, MRPPS provides a general problem solving capability." It is speculative, but OpenAI may have combined the two Q-Star approaches with new ideas and created a question-answering system that learns by trial-and-error.
The 1972 Q* Algorithm
The first version of the Q* search algorithm was part of the Maryland Refutation Proof Procedure System (MRPPS). The 1972 Technical Report “describes the current implementation of MRPPS. It describes each of the components and how they are integrated into what has been termed the Q* algorithm.”
In contrast, when the name of OpenAI’s Q-Star project became public through an information leak, no details came with it. Consequently, I go back to basics and describe Q* from its 1972 introduction in MRPPS and a 2013 paper by DeepMind Technologies.
As a question-answering system, MRPPS used both deduction and inference in its generation of clauses as answers to queries. A reported example query was "Who is Sally's mother's mother-in-law?" The Q-Star algorithm does the search and the inference component "performs the logical manipulations necessary to deduce a clause from one or two other clauses."
The numerous quotes in this section are from the 1972 technical report about MRPPS and a 1973 paper, "The Q* Algorithm---A Search Strategy for a Deductive Question-Answering System". The authors of both documents are Minker, Fishman, and McSkimin.
The Q-Star (Q*) algorithm was developed because of limitations in the A* and ∑* search strategies. MRPPS main components are "a set of inference rules, a deduction strategy, and a base clause selection strategy. The last two together comprise the Q-Star search algorithm ... MRPPS is controlled by an executive that communicates with the user as well as with the deduction strategy routines and data based creation and maintenance routines."
The system was "designed to be a flexible as well as computationally efficient system." A user can choose from several inference systems that could be used alone or in user-defined combinations. The "paramodulation" substitution rule for equalities could also be used to infer new clauses. There were also "various heuristic measures for use during search."
"The Q* algorithm generates nodes in the search space, applying semantic and syntactic information to direct and limit the search." However, as of 1973 the implementation of "Q* does not yet include the semantic filtering of candidates, and no mechanism has been included to take semantic actions." Semantic filtering was evaluated manually, and it eliminated the generation of all but one unnecessary result.
The Appendix of the paper suggests an approach to general problem solving. It applies what had been learned to a reduction process that "is recursively applied to each generated subproblem until the original problem has been reduced to a set of primitive subproblems which are trivial to solve."
DeepMind's Q* Breakthrough
Consider the imaginary world of a simple Atari video game. The computational problem is optimizing the actions of the Agent. The entire world and the number of possible actions at each step are finite and discrete. In probabilistic terms, the Agent is the controller and the world is a controlled probabilistic (Markov) process.
In 2012, the DeepMind team built an AI system that learned to play and beat the human benchmark for several Atari video games solely by viewing videos. The researchers input raw video of playing a video game and the AI learned to recognize patterns in the game-play pixels. By using this input, the AI learned to control the actions of the Agent within the game to optimize the Agent's rewards.
DeepMind’s DQN system was "the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning." The AI learned to estimate the values in seven different Atari video games without changes to its model (architecture, learning algorithm, and hyperparameters).
The researchers identified three major challenges to overcome.
- The time delay between an Agent's actions and the reward makes it difficult to associate the reward with an action (which must be remembered), and to proportionally assign the reward across all actions.
- The features are often correlated rather than independent.
- The learning process changes the data distribution continuously during the game.
The DeepMind model was only a convolutional neural net (CNN), and it would become smarter when implemented on a more recent net with more parameters. Q* is explicitly a variant of the Q-learning algorithm described in detail in 1992 by Watkins and Dayan.
After a series of improvements that took several years, the DQN system became Agent57. This system outperformed the human-normalized median scores on all 57 Atari games in the benchmark set.
The Original Q-Learning
The original version of Q-learning is reinforcement learning by trial-and-error. It is model-free because the Agent starts with no previous knowledge and always 'lives' in the here-and-next without accumulating memories. The Agent’s estimations of the values of future choices start as random values and then are changed by experiences. In some Q-Learning models, estimates of distant-future values may be discounted more than those closer in time to the current state.
From any state (the here-and-now), the Agent changes to a new state by doing an action. The value of the action's consequence and the value of the new state are used to evaluate the goodness of the action. Learning is continued so that all actions in all states are evaluated by the discounted long-term reward (e.g., winning the game).
This primitive form of learning is the basis of much more complex learning. Watkins and Dayan reference numerous examples with more details. The key point is emphasized in the abstract of their paper: "Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discreetly.” At every decision point, the Agent knows which choice will maximize the total reward received.
From Q-Learning to Q*
DeepMind created its Q-star (Q*) algorithm as a direct improvement to Q-learning. I have not located any reference linking DeepMind’s Q* work to Minker’s Q* papers and reports in the 1970s. The DQN Q* network was "trained with a variant Q-learning algorithm, with stochastic gradient descent to update the weights." This changed the optimization function to use a single point in estimates of the actual gradient.
DeepMind’s Q* also drew a crucial attribute of its learning model from Lin’s robotics work to create agents who "are adaptive, reactive, and self-supervised.” The addition of “experience replay” allowed the model to learn game strategies from remembered actions. Samples of experienced play were stored in a finite memory and then, at a specific point in the algorithm loop, drawn randomly to apply Q-learning updates. After doing the experience replay, an epsilon-greedy policy was used to select the Agent’s action.
The importance of Lin’s work in robotics should be considered carefully when discussing Q-learning. Lin wrote: “The aim of this dissertation is to extend the state of the art of reinforcement learning and enable its applications to complex robot-learning problems. In particular, it focuses on two issues."
- How to shorten the training time when reinforcement signals are sparse and delayed
- Make training practical in environments that do not assume Markovian processes.
The goal is to shorten the time necessary for training to do realistic tasks. "We can scale up reinforcement learning in various ways, making it possible to build reinforcement learning agents that can effectively acquire complex control policies for realistic robot tasks. … In particular, this dissertation develops various techniques to overcome the slow learning problem, and presents architectures for reinforcement learning in non-Markovian environments."
Summary
I started writing this post expecting to find that Q* was an incremental improvement, such as a combination of Q-learning and the A* search function. What I found---that versions of Q* previously had been implemented in two innovative AI systems---changed my mind completely.
It has been 50 years since Q* was defined for the MRPPS question-answering system. It has been a decade since a version of Q* was defined and used in DQN’s trial-and-error reinforcement learning. Imagine a brilliant researcher who had 10 years to integrate experience-replay reinforcement learning into training large language models (LLMs) with videos. Now add the new proposal in the sixth chapter of Lin’s dissertation—how robots could abstract hierarchical levels of planning through reinforcement learning—and perhaps integrate problem-reduction ideas from the1973 Q* paper.
The result could be a Q* general problem solving AI system that would be a descendent of MRPPS and Agent57. Greatly increasing its parameters would be feeding it the computer equivalent of steroids. It could efficiently and effectively solve problems that have stumped very intelligent humans in vastly different domains. With a meta-controller as an executive function, we are moving towards an advanced AI that could learn to replicate its base system and improve with each replication.
This view of AI self-improvement in the near future could have caused a rift within OpenAI over the safety of a new Q-Star (Q*) technology. Ilya Sutskever has expressed his focus on the alignment of AI with human interests, and that approach sharply contrasts with fast commercialization of AI for profit.
Top comments (0)