Preamble
A PostgreSQL query can take a very long time. Usually, it’s easy to stop (cancel) a statement like this, but sometimes things can go wrong and a query can’t be stopped. What might be the cause is covered in this article. In addition, I’ll demonstrate a tricky technique for solving the issue (not for the weak of heart!)
How to cancel a running query
There are options in the PostgreSQL protocol for stopping an ongoing statement. This is done by opening a new connection and sending a CancelRequest
message with a secret key. The server transmitted that secret key when the initial connection was established. Everyone would be able to cancel your query without that key, which would be an unacceptable security issue.
The C library libpq
provides the functions PQgetCancel()
and PQcancel()
to cancel queries, and other database APIs should have similar provisions. In an interactivepsql
session, you can simply hit Ctrl+C to send a cancel request, and GUI clients usually have a button for that purpose.
You can stop a query from another user with the database function pg cancel backend (). Another choice is pg_terminate_backend(), which takes things a step further and also ends the database session of the other person. In order to use these functions, you must be a superuser, a member of the default role pg signal backend, or have connected to the database as the same database user as the session you want to torpedo (the next section will explain why) (you are allowed to cancel your own statements).
How the server responds to a cancel request
The inter-process communication of PostgreSQL heavily relies on signals.
When the postmaster process receives a CancelRequest
, it sends the signal SIGINT
to the backend process of the corresponding database session. This is also what the function pg_cancel_backend()
does.pg_terminate_backend()
sends the signalSIGTERM
.
Now that these signals are being received, each PostgreSQL process has a signal handler that handles them. This signal handler doesn’t stop the backend process right away, but it does set global variables for it.SIGINT
will set QueryCancelPending
and SIGTERM
will set ProcDiePending
. These variables act as flags, and it is the responsibility of the backend process to react to them as soon as is convenient. This makes sure that no process is stopped at an inconvenient time, which could mess up shared memory, for example.
Calls to theCHECK_FOR_INTERRUPTS()
macro, which invokes theProcessInterrupts()
function, are sprinkled all over the PostgreSQL code in safe places. Depending on which flag was set, this function will then either throw the error that stops the current statement or end the backend process.
Reasons why canceling a query may not work
There are numerous potential reasons:
- Execution is stuck in a loop that does not contain
CHECK_FOR_INTERRUPTS()
. That would be a PostgreSQL bug, and the fix would be to add another call to the macro. - In this case, you should report the issue as a bug to the author of the third-party C function that was called in the SQL statement.
- It should be noted that the delivery of signals is delayed while a process is in kernel space, which would indicate a problem on the operating system or hardware level if execution is stuck in a system call that cannot be terminated.
Don’t use kill -9 unless you are desperate
Using “plain” kill on a PostgreSQL backend process is ok. It is the same as executing pg_terminate_backend() for that backend, because that will send the SIGTERM signal. It is tempting to use kill -9, which will send SIGKILL, if that has no effect. That signal cannot be intercepted, and the procedure ends immediately. When one of its child processes doesn’t end properly, which is a problem, the postmaster knows about it. When this happens, all other PostgreSQL processes are killed, and the database is offline for a period of time that can range from seconds to minutes.
If you use kill -9 on a backend, it might cause a short outage, but if you do it on the postmaster process itself, the results are much worse, and you should never do it. It opens the door for a new postmaster to be hired while some of the children of the old postmaster are still alive, which is likely to mess up the data on the disk. Never, ever use kill -9 to terminate the postmaster process!
Even kill -9 will occasionally fail to terminate a PostgreSQL backend. This means that the backend is stuck in an uninterruptible system call, such as doing I/O on a network-attached storage device that is no longer working. Rebooting the operating system is the only way to end the procedure if that condition continues.
A trick to cancel a stuck query without crashing the server
Sometimes, doing the following will help you avoid a crash recovery and an outage: This example shows how to use the GNU debugger on Linux. You may need to change it for other environments.
One illustration of a hanging function
We write this simple C function (source file loop.c
):
#include "postgres.h" #include "fmgr.h" #include <unistd.h> Datum loop (PG_FUNCTION_ARGS); PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1 { /* an endless loop */ while(1) sleep(2); }
Create the shared library as follows (modify the include path as necessary):
gcc -I loop.so.loop.c -fPIC -shared -I /usr/pgsql-14/include/server
and copy the file into the PostgreSQL shared library directory (which you can get with “pg_config --libdir
”).
Define and call the function
Define the function in SQL as superuser:
CREATE FUNCTION loop() RETURNS void LANGUAGE c AS 'loop';
Then, as an arbitrary user, call the function:
loop() for SELECT;
The process will stall. The query won’t stop running no matter what you do.
Send a termination signal to the backend process that is hanging after identifying it.
open a fresh database connection as the same database user. then discover the database session’s unique process ID from the background process:
SELECT pid, query FROM pg_stat_activity WHERE query LIKE '%loop%';
Once you know the process ID, send the process a SIGTERM
:
SELECT pg_terminate_backend(12345);
Of course, the process ID obtained in the previous step should be the argument. The function returns TRUE
, since the signal was sent, but the query continues to execute.
Install the GNU debugger, gdb, if necessary. Although it’s not required for the approach I’ll show you, you should have the PostgreSQL server’s debugging symbols installed in order to obtain a readable stack trace. As the PostgreSQL user, which is commonly called postgres, log into the database server computer. Call gdb as follows (with the correct process ID and path to the postgres executable):
affix to the debugger
As soon as you get the prompt “(gdb)
”, you generate a stack trace with the command “bt
”. That will look similar to the following:
#0 __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffdaf61cde0, rem=rem@entry=0x7ffdaf61cde0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:71 #1 0x00007f113d864897 in __GI___nanosleep (req=req@entry=0x7ffdaf61cde0, rem=rem@entry=0x7ffdaf61cde0) at ../sysdeps/unix/sysv/linux/nanosleep.c:25 #2 0x00007f113d8647ce in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55 #3 0x00007f113e623139 in loop () from /usr/pgsql-14/lib/loop.so #4 0x00000000006d71fb in ExecInterpExpr (state=0x13837b8, econtext=0x13834e0, isnull=<optimized out>) at executor/execExprInterp.c:1260 #5 0x000000000070e391 in ExecEvalExprSwitchContext (isNull=0x7ffdaf61ced7, econtext=0x13834e0, state=0x13837b8) at executor/../../../src/include/executor/executor.h:339 #6 ExecProject (projInfo=0x13837b0) at executor/../../../src/include/executor/executor.h:373 #7 ExecResult (pstate=<optimized out>) at executor/nodeResult.c:136 #8 0x00000000006da8b2 in ExecProcNode (node=0x13833d0) at executor/../../../src/include/executor/executor.h:257 #9 ExecutePlan (execute_once=<optimized out>, dest=0x137f4c0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x13833d0, estate=0x13831a8) at executor/execMain.c:1551 [...]
The stack trace makes it much simpler to identify the problem’s origin. Include it if you’re reporting a bug to PostgreSQL!
If you don’t want to perform the next step, you can chicken out and enter “detach
” to detach the debugger from the process and allow it to continue.
Make the hanging backend leave cleanly to cancel execution.
According to the stack trace shown above, execution is currently taking place in a custom function (in loop () from /usr/pgsql-14/lib/loop.so) rather than PostgreSQL code. This indicates that allowing the process to finish is generally safe. There is a slight possibility that PostgreSQL could be holding a spinlock or otherwise altering shared state if execution occurs inside the PostgreSQL server. A peek at the call stack will let you to evaluate that risk if you are aware of the PostgreSQL source. As long as ProcDiePending is set, you can now, if you dare, run ProcessInterrupts() to force the process to terminate:
(gdb) print ProcessInterrupts() [Inferior 1 (process 12345) exited with code 01] The program being debugged exited while in a function called from GDB. Evaluation of the expression containing the function (ProcessInterrupts) will be abandoned. (gdb) quit
Fix the function to enable execution cancellation for the user
In order to make things better, the function code needs to be changed as follows:
#include "postgres.h" #include "fmgr.h" #include "miscadmin.h" #include <unistd.h> PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(loop); Datum loop(PG_FUNCTION_ARGS) { /* an endless loop */ while(1) { CHECK_FOR_INTERRUPTS(); sleep(2); } }
The aforementioned modifications will make the function check for interrupts every two seconds so that execution can be safely aborted.
Conclusion
Sending a SIGINT signal to the backend is how queries are canceled. If neither of those methods nor issuing SIGTERM can stop a backend from stalling, you can use gdb to connect to it and execute ProcessInterrupts() directly to stop it.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Enteros: Enhancing Database Security Cost Management with RevOps and AIOps for the Insurance Sector
- 23 December 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros: Revolutionizing Database Performance with AIOps, RevOps, and DevOps for the Insurance Sector
- 20 December 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros: Transforming Database Software with Cloud FinOps for the Technology Sector
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Enterprise Performance: Enteros Database Architecture and Cloud FinOps Solutions for the Healthcare Industry
- 19 December 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…