DEV Community

Dmitry Glhf
Dmitry Glhf

Posted on

Automating machine learning with AI agents

When solving competitions on Kaggle, you start to notice a pattern. It's easy to create a baseline: upload the data, run CatBoost or LightGBM, and get the baseline metric. It takes half an hour. But to get into the top solutions, you need to try dozens of preprocessing options, hundreds of feature combinations, and thousands of hyperparameter sets.

Existing AutoML systems don't help much. They work according to a fixed scenario: they try a predetermined set of algorithms, select the best one according to the metric, and return the result. AutoGluon trains several models and creates a multi-level ensemble, but each run starts from scratch. TPOT generates a pipeline through a genetic algorithm, but does not learn from the mistakes of previous runs.

The main problem is that these systems do not reason. They do not analyze why a particular approach worked or failed. They do not adapt to the specifics of the task. They do not accumulate experience between runs. Each new task is like the first for them.

Humans work differently. If a data scientist sees unbalanced classes, they immediately know that stratification and threshold selection are needed. If they have seen a similar task before, they apply what worked then. If the first attempt fails, they analyze why and try a different approach.

With the advent of language models, it became possible to create a system that works closer to humans. LLMs can analyze data, reason about method selection, and learn from examples. But one model is not enough. It can miss an obvious mistake or get stuck on the wrong approach. We need an architecture that allows the system to check itself and accumulate experience.

The idea with two agents

The first version was simple: one agent receives data, trains the model, and returns predictions. The problem quickly became apparent. LLM sometimes skips data checks or forgets to process gaps.

The solution came from reinforcement learning. In Actor-Critic methods, one agent acts, while the second evaluates these actions. Why not apply this approach to AutoML?

The Actor receives data and a set of tools for analyzing, processing, and training models. It explores the dataset, decides what steps are needed, and generates a solution. The Critic looks at the result from the outside and checks if everything is done correctly. If the Critic finds problems, the Actor receives feedback and tries again.

This architecture solves a key problem: one agent may make mistakes, but two agents with different roles will catch most of them.

Architecture

The diagram shows how the system works. Actor has access to five specialized MCP servers with tools for working with data and models. Critic works without tools, only analyzing Actor's reports. There is an iterative exchange between them: the decision from Actor goes to Critic for evaluation, and feedback is returned. After each iteration, the experience is stored in memory, from where it is retrieved when working with similar tasks.

Tools for the agent

LLM can reason on its own, but it needs tools to work with data. I have divided them into several categories: data preview, statistical analysis, processing, and model training.

For example, the preview tool returns structured information:

{
    "shape": (150, 5),
    "columns": ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"],
    "dtypes": {"sepal_length": "Float64", "species": "String"},
    "sample": [{"sepal_length": 5.1, "species": "setosa"}, ...]
}
Enter fullscreen mode Exit fullscreen mode

The agent sees the dimension, column types, and first rows. This is sufficient to make decisions about further steps.

An important point for data processing: the agent must apply the same transformations to train and test. If the categorical feature was encoded as {“red”: 0, “blue”: 1} in the training sample, the same encoding must be used in the test sample. To do this, mappings are saved in JSON files:

mapping_path = Path(output_dir) / f"{column}_mapping.json"
with open(mapping_path, "w") as f:
    json.dump(mapping, f)
Enter fullscreen mode Exit fullscreen mode

This is critical for categorical classification. If the target variable was encoded as numbers, the model will return numbers, but we need the original class labels.

Each training tool returns three things: the path to the model, the path to predictions, and metrics on train. Paths are generated with a timestamp and UUID, so the agent can experiment with multiple algorithms simultaneously without conflict.

Model Context Protocol

But what happens if you need to add more tools, for example, for a specialized domain? When the number of tools exceeds ten, problems arise with their support, management, and scaling.

Model Context Protocol, and in particular the FastMCP framework, solves this problem. MCP allows you to package tools into separate servers that the agent calls as needed.

I created five MCP servers: file_operations for working with files, data_preview for CSV previews, data_analysis for statistics, data_processing for transformations, and machine_learning for training models and ensemble predictions.

Evaluation of decisions

Actor generates a solution in the form of a structured report with four sections: data analysis, preprocessing, model training, and results. Critic receives this report and analyzes each section separately.

Instead of one LLM judge, I used four specialized ones. The first checks the quality of data analysis: whether the agent studied the distribution, checked for gaps, and detected anomalies. The second looks at preprocessing: whether gaps are processed correctly, categories are encoded correctly, and there is no data leakage. The third evaluates the choice of model and hyperparameters. The fourth analyzes the results and overall methodology.

judges = [
    LLMJudge(rubric="Evaluate data_analysis: Is exploration thorough?"),
    LLMJudge(rubric="Evaluate preprocessing: Are steps appropriate?"),
    LLMJudge(rubric="Evaluate model_training: Is selection justified?"),
    LLMJudge(rubric="Evaluate results: Are metrics calculated correctly?"),
]
Enter fullscreen mode Exit fullscreen mode

Each judge returns a score between 0 and 1 with justification. The average score is compared to the acceptance threshold (usually 0.75). If higher, the decision is accepted. Otherwise, Critic forms feedback from all judges' comments and passes it to Actor for the next iteration.

This approach is more stable than a single judge. A single LLM may be too strict or miss an obvious error. Four specialized judges smooth out subjectivity.

Isolated workspace

When an agent works with files, it should not be given access to the entire file system. Isolation is necessary. I created a dedicated directory for each session in ~/.scald/actor/ with three subdirectories: data for copies of the source data, output for intermediate files, and workspace for models and predictions.

The source CSV files are copied to data dir. All tools work only within these directories. The agent cannot accidentally overwrite important files or read other people's data.

After completion, all artifacts are copied to the session directory with a timestamp, and the workspace is cleared. After some time, you can open this directory and understand what exactly the agent did: which models it trained (load them from .pkl), what metrics it received, and what steps it performed.

Memory and learning

After each iteration, the system saves the experience. The Actor report and Critic evaluation are recorded in the ChromaDB vector database. When the agent receives a new task, the system searches for similar past solutions based on semantic similarity using the Jina embedding model.

self.mm.save(
    actor_solution=actor_solution,
    critic_evaluation=critic_evaluation,
    task_type=task_type,
    iteration=iteration,
)

# Search for similar
actor_memory, critic_memory = self.mm.retrieve(
    actor_report=actor_solution.report,
    task_type=task_type,
    top_k=5,
)
Enter fullscreen mode Exit fullscreen mode

The solutions found are passed to the agent as context. If the system has previously solved a similar classification task with unbalanced data, it will remember what helped then.

Interestingly, it is not only successful solutions that are useful. When Critic says, “you forgot to process the gaps,” this is valuable information for future tasks. Semantic search finds such cases as well.

Main cycle

When all components are ready, all that remains is to put them together. The cycle runs until the maximum number of iterations is reached or the Critic makes a decision. At each iteration, the Actor solves the problem taking into account feedback, the Critic evaluates the solution, the experience is stored in memory, and relevant context is extracted for the next attempt.

It is interesting to observe how the Actor learns from feedback. The first iteration is usually simple: basic preprocessing and one model. The Critic finds problems: “you did not check the class balance” or “you did not do feature engineering.” The second iteration is more accurate: the agent adds the missing steps, tries several models, and creates an ensemble.

Problems during development

Encoding the target variable

The first version failed on categorical classification. The agent encoded the target into numbers, trained the model, but forgot to decode the predictions back. The output was numbers instead of class labels.

The solution required explicit instructions in the system prompt:

If you encode target column, you MUST DECODE predictions before returning.
Use decode_categorical_label with the mapping path from encoding step.
Enter fullscreen mode Exit fullscreen mode

Unique file names

When the agent experiments with multiple models, the files overwrite each other. I tried to delegate this to the agent via prompt, but LLM does not always generate unique names. The correct solution turned out to be to do this at the tool level with timestamp and UUID.

What was the result?

To fully solve the initial tasks of deep analysis, preprocessing, and training on data, it is necessary to continue developing the system by adding specialized agents and tools. Nevertheless, the system is already working: running it on a dataset for several iterations gives predictions, and all intermediate results are saved.

I tested the system on tasks from OpenML. On the christine dataset, it showed an F1-score of 0.743, outperforming Random Forest (0.713) by 4% and surpassing AutoGluon with FLAML, which failed to cope with this task at all. On cnae-9, the result was 0.980 against 0.945 for the best competitor, FLAML, which is 3.5% better.

There were also failures. On the Australian dataset, the system showed 0.836, losing to AutoGluon (0.860) and other baseline methods. Interestingly, on blood-transfusion, the result of 0.756 was better than Random Forest (0.712) and AutoGluon (0.734), but worse than FLAML (0.767).

The cost of running varies from $0.14 to $3.43 depending on the complexity of the task and the number of iterations. The running time is unpredictable: from a minute to half an hour.

In fact, the value of the result is not that the metric is better or worse than a particular AutoML framework. The value lies in the fact that modern LLMs allow for more intelligent automation of routine tasks, while the modularity of MCP paves the way for the creation of an ecosystem of specialized agents. This solves the original “fixed scenario” problem of classic AutoML: now the system can be adapted to any task by simply connecting the necessary agents, while maintaining a single cycle of iterative improvement and experience accumulation.

The limitations are clear. The system is good for tabular data with gradient boosting algorithms. For deep learning or time series, other tools are needed. Quality depends heavily on the size of the underlying LLM.

Usage

For those who want to try it, installation is simple:

pip install scald
Enter fullscreen mode Exit fullscreen mode

You can use it either from the terminal via CLI:

scald --train data/train.csv --test data/test.csv --target price --task-type regression
Enter fullscreen mode Exit fullscreen mode

Or via Python API:

from scald import Scald

scald = Scald(max_iterations=5)
predictions = await scald.run(
    train="data/train.csv",  # .csv or dataframe
    test="data/test.csv",
    target="target_column",
    task_type="classification",
)
Enter fullscreen mode Exit fullscreen mode

To work, you need an API key from a provider compatible with OpenAI (for example, OpenRouter). You will also need a key from Jina for embeddings in the memory system (the service provides a large number of free tokens upon registration).

All code is packaged in a library and available on GitHub.

Top comments (0)