DEV Community


Posted on

OpenAI Codex - The Model behind GitHub Copilot

OpenAI has released the research paper,

"Evaluating Large Language Models Trained on Code"

Which powers Codex, a highly improved Autoregressive language model than the 3rd generation Generative Pre-trained Transformer (GPT-3) language model.

Codex is much improved than GPT-3 due to its model been trained on a dataset that includes a much larger concentration of public source code from GitHub.

Codex has been fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities.

An evaluation harness for the HumanEval problem solving dataset from the research paper is also available on OpenAI Github repository.

"A distinct production version of Codex powers GitHub Copilot" says the paper

Which explains that the sucessor of Codex is used to train on code dataset from GitHub repositories for the GitHub Copilot project.


The paper claims it has solved 10,000 competitive programming problems and problems from open source projects related to Continuous Integration.

The paper concludes,

"We found that our models displayed strong performance on a dataset of human-written problems with difficulty level comparable to easy interview problems"

About data collection, the paper says,

"Our training dataset was collected in May 2020 from 54 million public software repositories hosted on GitHub, containing 179 GB of unique Python files under 1 MB. We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters. After filtering, our final dataset totaled 159 GB"

The economic impact that CODEX going to do will be observed in programming related jobs in future by improving productivity of the coder and also a change in Competitive Programming World.

Where may be the difficulty level of coding questions will be increased or may be Competitive Programming will be left alone to AI and Humans have to focus on much higher level of problem solving than that of just writing repeatable codes.

Check out the following resources,

Link to the paper :

OpenAI Github repository :

Personal Blog @

Top comments (0)