Discussion on: ⏳ Timing database calls to 🐘PostgreSQL or 🚀YugabyteDB (strace)

View post

Thanks for the interesting post.

Is it safe to run strace (like in this post) in Production ?

I know it is 'Do At Your Own Risk' in an Oracle production. Does it carry the 'same level' of risk (or) is it 'more safer' in Postgresql / Yugabyte context ?

Franck Pachot YugabyteDB • Jul 20 '22

That's a great question. Everything you do on production, especially on Friday, has a risk- But not fixing a performance problem is also a risk 😉
About strace:

it can slow the process but you run it for a short time, and when you have a problem anyway
the risk to crash the process is low in recent Linux kernels
run it in a prepod environment before, even if you don't reproduce he problem there, at least you see if strace hurts
if you attach to a session backend, the risk is limited to one connection. I would not attach strace PostgreSQL logger or checkpointer in production without good reason
in YugabyteDB even if you crash a server, the application is still available. If you drain tablets (by blacklisting the node) and connections from one node (that you may add for this purpose) you can still strace what happens in YSQL without impacting the others

it is 'Do At Your Own Risk' in an Oracle production. Does it carry the 'same level' of risk

I've used a lot strace on Oracle productions (but not with old linux kernels - there were some bugs). Same remark about the user process vs lgwr or dbwr. And only when I cannot reproduce on another environment and when safer troubleshooting methods are not enough.

@fritshoogland may comment on this as well

Frits Hoogland • Jul 20 '22

I agree with @franckpachot's comments.

A view from my perspective is the following:

The strace utility on linux uses the ptrace facility on linux. You can read the link if you are interested in more details. My experience too is that in general, in situations where there's no excessive usage of any kind on the system it is used, strace does not cause kernel issues.

However, the influence of ptrace on the process or thread from the perspective is significant from a performance perspective. In other words: the process or thread that is accessed via ptrace using strace in most cases has its performance been significantly reduced, because ptrace is controlling execution, and yields execution in order to serve the strace-ing process alias the controlling process data.

When the strace utility is terminated, the ptrace connection is stopped, and the process or thread that was attached to is back to normal execution.

Please mind the wording 'does not cause kernel issues'. If the way a process or thread is changed such as with ptrace, this may cause issues in for the executable that is relying on certain behaviour or timing for this process or thread, which the ptrace control might have an influence on, and thus could cause issues in user space.

This is the reason that people say 'use at your own risk in oracle production': especially if you attach to critical oracle background processes, the change in behaviour might cause issues, and especially if you attach to processes such as the log writer or the database writer for investigating issues of slow IO, this has a high potential to exaggerate the issue, whilst you might think you are just observing.

In this sense using strace with postgres or YugabyteDB is not different, in the sense that is significantly reduces processing speed of the thread or process strace attached to. The difference between using strace with YugabyteDB and Oracle and PostgreSQL is that with YugabyteDB the storage layer is distributed with redundancy, so as Franck pointed out: so if you want to be extremely safe, you can create a tablet server node for the purpose of strace-ing a process or thread only.