Arindam Majumder

for Studio1

Posted on May 3

Scaling Karpathy’s AutoResearch Using Nebius Token Factory

#ai #nebius #machinelearning #opensource

Introduction

We ran about 250 prompt optimization experiments overnight using a small AI agent loop. The idea was simple: let an AI system propose an experiment, run it, evaluate the result, and then try again with a better idea. Instead of manually testing prompts one by one, the system keeps improving its own attempts over multiple iterations.

This idea comes from Andrej Karpathy’s AutoResearch, where an AI agent can automate the typical machine learning research cycle. In a normal workflow, researchers adjust parameters, run experiments, observe the results, and repeat the process many times before reaching a good configuration. AutoResearch shows that this repetitive process can be handled by an intelligent agent.

In this article, we will walk through how we built a cloud-native AutoResearch loop using Nebius Token Factory for LLM inference, allowing the agent to run hundreds of experiments automatically while keeping structured records of every attempt.

Pitfalls in the Original AutoResearch Implementation

Karpathy’s AutoResearch project is a very interesting idea and a great starting point to understand how AI agents can run experiments automatically.

The repository mainly consists of simple files such as program.md (which defines the research goal), train.py (which runs the experiment), and a lightweight results log that records experiment outcomes. The agent reads the goal, modifies the experiment code, runs it, and stores the results, demonstrating how an AI system can iterate through experiments automatically.

It was built as a research prototype to demonstrate the concept, not as a full system for running large-scale experiments in real workflows.

As a result, a few limitations arise when we try to scale the idea.

Experiment tracking using TSV files: In the original implementation, experiment results are stored in a simple TSV (tab-separated values) file. This file usually contains basic fields like experiment ID, score, and parameters. While this is easy to implement, it becomes difficult to manage as the number of experiments grows.
Limited structure for experiment metadata: A flat TSV file does not provide structured storage for experiment details such as prompts, responses, timestamps, or reasoning steps.
Local model dependency: The original workflow usually depends on running models locally. This means developers often need access to local GPUs or preconfigured environments to run inference.
Difficulty scaling experiment loops: Because of the local infrastructure and simple logging system, running hundreds of experiments in a controlled way becomes harder.

What We Are Going to Build

The use case for this project is prompt optimization. In many real workflows, developers often try multiple prompts manually to get the best response from a language model. This usually involves repeated trial-and-error, where prompts are slightly modified, tested, and evaluated before arriving at a good result. When the number of experiments increases, managing and tracking these attempts becomes difficult.

In this tutorial, we will build a small system that keeps the same AutoResearch idea while addressing a few of those practical limitations.

AutoResearch-style experiment loop: We implement the same research cycle where an agent proposes an experiment, runs it, evaluates the result, stores the outcome, and then proposes the next attempt based on previous results.
Cloud-based reasoning using Nebius Token Factory: Instead of relying on local models, the agent uses Nebius Token Factory to generate experiment ideas and responses. This removes the dependency on local GPUs and makes the research loop easier to run.
Structured experiment tracking using JSON: Instead of logging experiments in a flat TSV file, each experiment is stored as a structured JSON record. This allows us to track prompts, responses, scores, and timestamps more clearly.
Prompt optimization as the experiment domain: For this tutorial, the system will try to generate the best explanation for vector databases. Each experiment proposes a different prompt and evaluates how well the response matches our criteria.
Running a large experiment loop: The system runs around 250 iterations, allowing the agent to gradually improve prompts by learning from previous experiment results.

Why Nebius Token Factory

To run the AutoResearch loop properly, the agent needs a reliable way to generate experiment ideas and responses. Instead of running models locally, we use Nebius Token Factory as the inference layer.

Managed model inference: Nebius Token Factory lets us run open models through an API. We do not need to download models or manage GPUs locally.
Access to open models: The platform provides models such as Llama, DeepSeek, and Qwen, which can be used for reasoning, prompt generation, and experiment responses.
OpenAI-compatible API: The API follows the same structure used in the OpenAI SDK. This makes integration simple in Python applications.
Agent reasoning backend: In our system, the AutoResearch agent calls Nebius Token Factory to analyze previous experiments and propose the next prompt.
Supports large experiment loops: Because inference runs in the cloud, we can run hundreds of experiment iterations without worrying about local compute limits.
No local GPU requirement: Developers can run the experiment loop directly from their machine while the model inference happens in Nebius infrastructure.

For more demos, refer to the cookbooks here.

Tutorial: Building an AutoResearch-like System with Nebius Token Factory

Step 1 — Clone the Project Repository

Clone the repository.

git clone https://github.com/studio1hq/Nebius_autoresearch.git

Install required dependencies.

pip install requests python-dotenv

Set up a working environment.
Review the Project Structure

File	Purpose	What it Does
program.md	Defines the research goal	Contains the task the agent is trying to optimize. In this project, the goal is to generate the best prompt for explaining vector databases.
agent.py	Generates the next experiment	Uses Nebius Token Factory to analyze previous experiment results and propose the next prompt to test.
experiment.py	Runs the experiment	Sends the generated prompt to Nebius Token Factory and retrieves the model’s response.
scorer.py	Evaluates experiment output	Scores the response based on simple rules such as response length and presence of relevant keywords. Returns a numeric score.
main.py	Controls the AutoResearch loop	Loads previous experiment history, runs the experiment cycle, logs results, and repeats the process for multiple iterations.
results.json	Stores experiment history	Saves structured experiment records including prompt, response, score, and timestamp. Easier to analyze compared to flat TSV logs.

Step 2 — Configure Nebius Token Factory

Create a Nebius Token Factory API key

Open the Token Factory dashboard:

Navigate to the API Keys section in the left sidebar and click Get API key.

Create a new key for your project. This key will be used by the application to authenticate API requests sent to Nebius Token Factory.

Store the API key as an environment variable

Create a .env file in the root of the project and store the key as an environment variable:

NEBIUS_API_KEY=your_api_key

Configure the Token Factory client

Nebius Token Factory provides an OpenAI-compatible API, so we can use the same client structure used for OpenAI integrations.

Initialize the client using the Nebius API base URL and the API key stored in the environment variable.

The request sends the prompt to the selected model (for example, llama-3-70b-instruct) and receives the generated response.

In this system, these API calls power two parts of the AutoResearch loop:

Agent reasoning: Where the agent proposes the next experiment prompt (agent.py)

from typing import List, Dict, Any
from client import generate_response

def propose_experiment(history: List[Dict[str, Any]], goal: str) -> str:
    """
    Proposes a new system prompt based on the research goal and experiment history.

    Args:
        history: List of previous experiment results (dicts with 'prompt' and 'score').
        goal: The research goal string.

    Returns:
        str: The proposed system prompt for the next experiment.
    """
    # Format history for the prompt context
    if not history:
        history_text = "No previous experiments."
    else:
        # Use the last 3 experiments to provide context without overflowing context window
        recent_history = history[-3:]
        history_text = "\n".join([
            f"Attempt {i+1}:\n"
            f"Prompt: {r.get('prompt', 'Unknown')}\n"
            f"Score: {r.get('score', 0)}/10\n" 
            for i, r in enumerate(recent_history)
        ])

    system_instruction = (
        "You are an expert AI researcher optimizing a system prompt to achieve a specific goal.\n"
        "Analyze the previous attempts and their scores. Identify what worked and what didn't.\n"
        "Then, generate a NEW, improved system prompt that is likely to achieve a higher score.\n"
        "Do not repeat previous prompts. Be creative and precise.\n\n"
        "Format your response exactly as follows:\n"
        "THOUGHT: <your analysis and plan>\n"
        "PROMPT: <the actual system prompt text>"
    )

    user_message = (
        f"Research Goal:\n{goal}\n\n"
        f"Experiment History:\n{history_text}\n\n"
        "Based on the above, generate the next system prompt to test.\n"
        "Ensure you include your thought process before the prompt."
    )

The agent builds a context using the goal + past experiments.
This prompt is sent to Nebius using generate_response().
The model suggests the next experiment (new prompt).
This is where the “thinking” of the agent happens.

Experiment execution: where the prompt is sent to the model, and the response is evaluated. (experiment.py)

from client import generate_response

def run_experiment(system_prompt):
    """
    Runs the experiment by using the proposed system prompt to explain vector databases.
    """
    test_prompt = f"{system_prompt}\n\nExplain vector databases. Respond in under 120 words."
    return generate_response(test_prompt)

The generated prompt is sent again to Nebius.
The model produces the actual output for evaluation.
This response is later scored and stored.
This is the execution step of the experiment.

With this configuration in place, the AutoResearch loop can use Nebius Token Factory as its inference backend.

For example, (refer to client.py)

api_key = os.environ.get("NEBIUS_API_KEY")
    if not api_key:
        raise ValueError("NEBIUS_API_KEY environment variable not set")

    # Nebius Token Factory OpenAI-compatible endpoint
    url = "https://api.tokenfactory.nebius.com/v1/chat/completions"

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        # Optional parameters for generation control
        "temperature": 0.7,
        "max_tokens": 4096
    }

You can also use an OpenAI-compatible API interface. The same client patterns used in OpenAI-based applications can be used here with minimal changes, making integration into existing workflows straightforward.

from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com",
        api_key="NEBIUS_API_KEY")

completion = client.chat.completions.create(
    model="llama-3-70b-instruct",
    messages=[{
      "role":"user",
      "content":"What is the answer to all questions?"
    }]
)
print(completion.choices[0].message.content)

Step 3 — Understanding the AutoResearch Loop

The system follows a simple research loop where the agent proposes experiments, evaluates the results, and improves the next attempt based on previous runs.

Load the research goal: The system reads the task defined in program.md. In this case, the goal is to generate the best prompt for explaining vector databases.
Load previous experiment history: The program reads results.json to understand what prompts were already tested and how they performed.
The agent proposes the next experiment: The agent uses Nebius Token Factory to analyze the previous results and suggest a new prompt to try.
Run the experiment: The new prompt is sent to the model through the Token Factory API, and the response is generated.
Score the result: The response is evaluated using the scoring function defined in scorer.py.
Store the experiment record: The prompt, response, score, and timestamp are saved in results.json.
Repeat the loop: The system continues this cycle for around 250 experiment iterations, allowing the agent to gradually improve the prompts.

Step 4 — Run the System

Command:

python main.py

Terminal output shows:

iteration number
proposed prompt
generated response
score.

Every 25 experiments, the system prints a short report in the terminal showing the number of experiments completed, the best score so far, and the prompt that produced the best result. This helps track how the agent is improving over time.

Each experiment is stored as a structured record in results.json. The record contains the prompt used, the response generated by the model, the score assigned by the evaluation function, and the timestamp of the run. This structured format makes it easier to inspect and analyze experiment history later.

Instead of starting from scratch every time, the system loads the existing results.json file when the program starts. It detects the last completed experiment and continues from the next iteration. This allows the experiment loop to resume without losing previous results.

Key Takeaways

AutoResearch shows how an AI agent can run experiments automatically by proposing ideas, testing them, and learning from the results.

In this tutorial, we extended that idea by using Nebius Token Factory for cloud-based model inference and structured experiment logging for better tracking. This makes the research loop easier to run, easier to observe, and more practical for developers experimenting with AI workflows.

If you want to try similar experimentation workflows, you can start with Nebius Token Factory to run open models through a simple API without managing GPUs. The broader Nebius Cloud platform also provides GPU infrastructure, scalable inference, and tools for building AI applications.

Explore the available services, experiment with different models, and build your own AI-driven systems using the Nebius ecosystem.

DEV Community