Preamble
A simple deadlock example
Getting things ready
We will use the following example to evaluate our deadlock debugging strategies:
CREATE TABLE parent ( pid bigint PRIMARY KEY, pdata text NOT NULL ); CREATE TABLE child ( cid bigint PRIMARY KEY, pid bigint REFERENCES parent NOT NULL, cdata text NOT NULL ); INSERT INTO parent VALUES (1, 'no children yet');
The impasse
We simultaneously execute the following transactions to cause the deadlock. Each transaction adds a row to the child
table and then tries to modify the corresponding row in parent
.
-- session 1 BEGIN; INSERT INTO child VALUES (100, 1, 'new child'); -- session 2 BEGIN; INSERT INTO child VALUES (101, 1, 'another child'); SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; -- session 2 blocks -- session 1 SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; -- session 1 blocks, but after one second, it receives an error ERROR: deadlock detected DETAIL: Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718. Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674. HINT: See server log for query details. CONTEXT: while locking tuple (0,1) in relation "parent" -- session 2 is unblocked and receives a result pdata ═════════════ no children (1 row)
What a deadlock looks like in the log
A PostgreSQL log provides additional information:
ERROR: deadlock detected DETAIL: Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718. Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674. Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; HINT: See server log for query details. CONTEXT: while locking tuple (0,1) in relation "parent" STATEMENT: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
Why is debugging deadlocks difficult?
There are no pertinent details in the error message that was sent to the client. This is due to the possibility that the SQL statement from the other sessions involved in the deadlock contains information that you are not authorized to view. Although there is more useful information in the PostgreSQL log, it still isn’t enough to explain the deadlock. Because locks are always held until a transaction is completed, you would need to be aware of every statement from the transactions that were involved in order to identify the cause of the deadlock. The error message’s two statements on their own could never result in a deadlock. Therefore, a previous statement in these transactions must have already taken locks that are a contributing factor to the issue. PostgreSQL does not keep a record of these earlier statements, so we must find another method to obtain the complete picture.
Usually, the application’s developers are required to assist you. If you’re lucky, the SQL statements in the error message will likely be sufficient to identify the relevant database transactions, and it will be simple to reassemble the SQL statements that were generated by each transaction. You should be able to reassemble the events that led to the deadlock by looking at the locks that these statements took.
The cause of the deadlock in our example
Knowing the statements resulting from our transactions in the aforementioned example, we can infer the following:
- The
INSERT
statements take aFOR KEY SHARE
lock on the referenced rows inparent
. That is necessary so that no concurrent transaction can delete the row referenced by our not yet committedINSERT
. As you can see in the compatibility matrix, these locks don’t conflict with each other. - The
FOR UPDATE
row locks from theSELECT ... FOR UPDATE
statements conflict with theFOR KEY SHARE
locks from the other transaction, which leads to the deadlock.
Once we know that, the solution is simple: change “SELECT ... FOR UPDATE
” to “SELECT ... FOR NO KEY UPDATE
“, and there will be no more lock conflicts and consequently no deadlock. Note that aFOR UPDATE
row lock is only required if you want to DELETE
a row or if you intend to update a column with a unique or primary key constraint on it (a key). The lock name FOR UPDATE
is counter-intuitive, since it suggests that this is the correct lock for an UPDATE
, and it is a common trap for PostgreSQL beginners.
With our complete knowledge, we could fix the problem. But how do we get the statements from the transactions?
Techniques for debugging deadlocks
There are various methods:
Debugging deadlocks through application-side logging
It is simple to identify the transactions that caused the issue if the application logs information about each error and every database transaction that is started. However, to do that, you either need to have access to the application’s source code or the application must have adequate tracing capabilities. You are not always in that fortunate circumstance.
Annotating queries on the application side to troubleshoot deadlocks
This is another technique that requires application modification. The secret is to add something like this to the query’s annotations.
SELECT /* function "add_data", source file "app.c", line 1234 */ pdata FROM parent WHERE pid = 1 FOR UPDATE;
Any information that enables you to identify the transaction to which the statement belongs can serve as the information.
Deadlocks can be investigated by logging on the database side.
For us to be able to trace statements, we must make sure that the process ID and transaction number are logged with each statement. With CSV logging, that is always the case, but with stderr
logging, you have to change log_line_prefix
:
log_line_prefix = "%m [pid=%p]" %q[txid=%x] %u@%d '
Also, you have to log all statements:
log_statement = 'all'
For busy applications, the volume and performance impact of logging every statement may now be prohibitive. PostgreSQL v12 was introducedlog_transaction_sample_rate
, which allows you to log only a certain percentage of all transactions. However, in many cases, that won’t be good enough because you’ll typically only see the statements from one of the transactions involved.
For our example, the log file would appear as follows:
2022-06-20 17:17:50.888 CEST [pid=19718] [txid=0] laurenz@test LOG: statement: BEGIN; 2022-06-20 17:17:52.988 CEST [pid=19718] [txid=0] laurenz@test LOG: statement: INSERT INTO child VALUES (100, 1, "new child"); 2022-06-20 17:17:58.215 CEST [pid=19674] [txid=0] laurenz@test LOG: statement: BEGIN; 2022-06-20 17:18:03.626 CEST [pid=19674] [txid=0] laurenz@test LOG: statement: "INSERT INTO child VALUES (101, 1, "another child"); 2022-06-20 17:18:18.979 CEST" [pid=19674] [txid=1108] laurenz@test LOG: statement: "SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:31.550 CEST" [pid=19718] [txid=1109] laurenz@test LOG: statement: "SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:32.551 CEST" [pid=19718] [txid=1109] laurenz@test ERROR: deadlock detected 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test DETAIL: Process 19718 waits for ShareLock on transaction 1108, which is blocked by process 19674. Process 19674 waits for ShareLock on transaction 1109, which is blocked by process 19718. Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test Hint: See the server log for query details. 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test CONTEXT: While locking tuple (0,1) in relation "parent," 2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test STATEMENT: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
Remember that the transaction ID is logged as 0 until the transaction’s first data-modifying statement. PostgreSQL only assigns transaction IDs when they are actually needed, which is why this is the case.
overcoming stalemates by raisingdeadlock_timeout
The PostgreSQL deadlock detector doesn’t kick in immediately but waits until after deadlock_timeout
has passed. This is because the deadlock detector is a computationally expensive component, so we only want to activate it if there is a very likely problem. The parameter is initially set to one second. Unfortunately, one second is far too little to fully troubleshoot the issue. You can now raise the parameter if you can afford to let deadlocked sessions “hang” while you debug them. Nearly 600 days is the maximum allowed value, which is more time than you require. With proper monitoring, you could, for example, set deadlock_timeout
to 15 minutes to catch deadlocks in the act.
Once you have an active deadlock, you can find client_addr
and client_port
from pg_stat_activity
for the hanging sessions:
SELECT client_addr, client_port, query FROM pg_stat_activity WHERE state = "active" AND wait_event_type = "Lock" ORDER BY state_change: client_addr > client_port > query ═══════════════╪═════════════╪════════════════════════════════════════════════════ 192.168.2.120 │ 49578 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; 192.168.2.122 │ 47432 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE; (2 rows)
We sort by state_change
, because other sessions might later get stuck behind the sessions that originally caused the deadlock. For that reason, you want to focus on the oldest entries.
Next, log into the client machine and find out which process has a TCP connection on port 49578:
# lsof -P | grep 49578 myapp 3974 laurenz [...] TCP localhost:49578-> localhost:49578-> dbserver:5432 (ESTABLISHED)
Now you know that the client process is 3974 and is running the programmyapp
. You can now attach to the program with a debugger and see the current execution stack. As a result, you should be able to determine the SQL statements that are executed during that transaction.
Naturally, this is another technique that needs access to the application code.
Conclusion
We have seen a variety of techniques for solving deadlocks. The majority of them demand that you be in charge of the application code. Not every technique is effective in every situation; for instance, you might not have the money to log every statement or leave deadlocks hanging for a longer period of time.
About Enteros
Enteros offers a patented database performance management SaaS platform. It proactively identifies root causes of complex business-impacting database scalability and performance issues across a growing number of clouds, RDBMS, NoSQL, and machine learning database platforms.
The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.
Are you interested in writing for Enteros’ Blog? Please send us a pitch!
RELATED POSTS
Optimizing Database Performance with Enteros and AWS Resource Groups: A RevOps Approach to Streamlined Efficiency
- 13 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enhancing Healthcare Data Integrity: How Enteros, Logical Models, and Database Security Transform Healthcare Operations
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Optimizing Budgeting and Cost Allocation in the Finance Sector with Enteros: A Smarter Approach to Financial Efficiency
- 12 November 2024
- Database Performance Management
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…
Enteros and Cloud FinOps: Unleashing Big Data Potential for eCommerce Profitability
In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…