Sergey Nikolaev

Posted on Apr 8 • Edited on May 15 • Originally published at manticoresearch.com

Monitor Manticore Search in Grafana with One Command

#database #monitoring #performance #tutorial

The most annoying kind of incident is when database doesn’t go down completely - it just gets slower.

Users start noticing it right away. Complaints come in. Everything is technically still running, but clearly something is off.

And that is usually the hardest part: not noticing the problem, but figuring out what is actually happening.

When everything looks fine, but search is still slow

Let’s take a pretty normal scenario.

Search starts slowing down. It is not crashing. It is not returning obvious errors. The service is up. From the outside, nothing looks broken in a dramatic way.

But users can feel it.

So you open your monitoring:

CPU looks fine.
Average latency does not look too bad.
No obvious alerts.

At first glance, nothing really explains the slowdown.

So you keep digging...

You check the queue. Nothing jumps out immediately.
You look at worker usage. They are busy, but not in a way that tells you much on its own.
You check the logs. Still nothing obvious.

And after a while you get to that frustrating point where you realize you have already checked the usual things, and you still do not know where the problem is.

Each metric, by itself, looks more or less okay. But together, the system is clearly degrading.

So now you are no longer following a clear line of investigation. You are just checking everything you can think of and hoping the pattern shows up.

Meanwhile, time is passing.

What was actually going on

A couple of hours later, the picture finally starts to make sense.

It turns out:

the request queue has been slowly growing;
workers have been sitting near 100% utilization;
one heavy query keeps blocking execution from time to time;
p99 latency is much worse than the average suggests;
and one of the nodes restarted recently.

So the signals were there all along.

The problem was that they were scattered across different places, and it took too long to connect them into one clear story.

The solution: see the whole picture right away

Instead of spending hours piecing all of that together by hand, it is much better to have one place where the important signals are already visible.

That is why we put together a ready-to-use dashboard for Manticore Search that starts with a single Docker command. It comes with Grafana, Prometheus, a preconfigured data source, and built-in alerts.

docker run -p 3000:3000 manticoresearch/dashboard

Environment variables

The container supports two environment variables:

MANTICORE_TARGETS - comma-separated list of Manticore Search instances (default: localhost:9308)
GF_AUTH_ENABLED - set to true to enable Grafana login (by default, anonymous admin access is enabled)

Example:

docker run -p 3000:3000 \
  -e MANTICORE_TARGETS=your-host:9308 \
  manticoresearch/dashboard

If you monitor multiple nodes, pass them as a comma-separated list:

docker run -p 3000:3000 \
  -e MANTICORE_TARGETS=node1:9308,node2:9308,node3:9308 \
  manticoresearch/dashboard

If Manticore is running on a remote server

By default, the dashboard expects Manticore at localhost:9308. If your instance is running on a remote machine, the simplest option is SSH port forwarding:

ssh -L 9308:localhost:9308 user@your-server

After that, local connections to localhost:9308 will be forwarded to the remote server, so the dashboard can connect without additional changes.

A minute later, you have a usable overview of your system.

Not just a pile of graphs, but a dashboard that helps you quickly answer the questions you actually care about when something feels wrong.

You can see queue growth, worker saturation, latency, process state, and query behavior in one place, instead of bouncing between tools and trying to stitch the story together in your head.

What the dashboard shows

The value here is not that there are a lot of panels. The value is that the panels answer the right questions quickly.

The first place to look is the overall system view:

This gives you the basic picture right away:

is the service up;
has it restarted recently;
is there queue pressure;
are workers already under load.

If this row looks healthy, maybe the issue is narrow and local. If it does not, you know right away that the system is under real pressure.

Then you move to load and query behavior:

This is where you can quickly see:

whether work is starting to pile up;
whether workers are saturated;
whether latency is getting worse, especially p95 and p99;
whether one slow thread is causing a disproportionate amount of trouble.

And if you need more context, you can drill down into the rest of the dashboard:

cluster state:

tables and data:

At that point, you are no longer looking at disconnected metrics. You are looking at the system as a whole.

Why this matters

In the kind of situation that used to cost you a couple of hours just to understand, now you can usually spot the direction in a few minutes.

You can see that the queue is growing.
You can see that workers are pinned.
You can see that p99 is climbing.
You can see that one node restarted.
You can see that one query is probably doing most of the damage.

That does not mean the dashboard magically fixes the issue for you.

What it does do is remove the slowest part of the whole process: figuring out where to look.

And in practice, that is often the difference between spending two hours trying to understand the incident and spending five minutes getting to the real problem.