Error deadlock detected sqlstate 40p01 - Исправление ошибок и поиск оптимальных решений проблем

Содержание

Deadlock в insert запросе в postgres
Ответы (1 шт):
Как мы ловим Deadlock`и на PostgreSQL и чиним их
Предисловие
Немного о внутреннем устройстве бекенда
Часть 1: Мониторинг
Как проявляется Deadlock
Ручной захват
application_name
Думаем о серверных логах
Часть 2: Как бороться с дедлоками
Несколько практик избегания deadlock`ов
Частый случай №1: Классический дедлок
Частый случай №2: Сам себе злобный буратино (ССЗБ)
Бонус №1
Бонус №2
PostgreSQL: Understanding deadlocks
How does a deadlock happen?
How to fix and avoid deadlocks
Finally …

Deadlock в insert запросе в postgres

Получаю время от времени ошибку

Message: SQLSTATE[40P01]: Deadlock detected: 7 ERROR: deadlock detected

Запрос вида (для примера сокращу, в запросе может быть до 250 строк и много контента):

Так же есть ньюанс что подобные запросы выполняются на 40 серверах к одной базе данных

Как с таким бороться?

Ответы (1 шт):

в запросе может быть до 250 строк

Логично предположить, что конкурентно вставляете одни и те же данные в разном порядке.

Как же ON CONFLICT DO NOTHING ? А что он может?

t1: begin
t2: begin
t1: insert 1 on conflict — никто не мешает, записали
t2: insert 2 on conflict — никто не мешает, записали
t2: insert 1 on conflict — хм, это же значение пишет t1, но мы не знаем можем ли мы сделать do nothing, т.к. если t1 сделает rollback — мы сделаем неправильный выбор. Потому ждём окончание t1
t1: insert 2 on conflict — а это значение хочет t2 писать, подождём её завершения.
t1 или t2 спустя время deadlock_timeout — что-то я долго жду, проверю-ка граф ожиданий. Опа, действительно deadlock, отменяю свою транзакцию. Вторая транзакция соответственно разблокируется.

insert на много строк всё равно транзакция и потому может быть аналогично.

Отсортируйте записываемые множества по какому-либо признаку в идентичном порядке.

Источник

Как мы ловим Deadlock`и на PostgreSQL и чиним их

Предисловие

Ситуация: есть высоконагруженная мета-игра для наших танков под названием Глобальная карта. Эдакая пошаговая настолка для команд, где бои происходят в реальном танковом клиенте. В пиковые часы на карте несколько тысяч руководителей кланов производят игровые действия: атакуют друг друга, перемещают дивизии, покупают, продают, ~~грабят корованы~~. Помимо этого, существует десяток сервисов, которые также могут вносить изменения в игровую ситуацию: подкидывают деньжат, штрафуют, добавляют игроков в клан и прочее.

Всё это неизбежно приводит к дедлокам. Так вот, хочу вам поведать историю о том, как мы эти периодические проблемы держим в допустимых рамках.

Немного о внутреннем устройстве бекенда

Часть 1: Мониторинг

Как проявляется Deadlock

Когда у нас возникает Deadlock, то падает исключение следующего вида:

ERROR: deadlock detected
DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation «users»

Первое, на что следует обратить внимание, — это строчка:
HINT: See server log for query details.

Действительно, если мы посмотрим серверные логи, то увидим для этого же места следующее:
ERROR: deadlock detected

И дальше конкретику:

DETAIL: Process 18293 waits for ShareLock on transaction 639; blocked by process 18254.
Process 18254 waits for ShareLock on transaction 640; blocked by process 18293.
Process 18293: update users set balance = balance + 10 where > Process 18254: update users set balance = balance + 10 where >
HINT: See server log for query details.
CONTEXT: while updating tuple (0,9) in relation «users»

И, наконец, запрос, на котором произошла ошибка:

STATEMENT: update users set balance = balance + 10 where >

Логирование запросов при этом не обязано быть включено.

Круто. Но первая глобальная проблема для любого более-менее серьёзного проекта — то, что у вас нет доступа к серверным логам вследствие политики безопасности. Иногда вообще нет никакого доступа. А иногда можно попросить участок, но надо ждать. Иногда это 30 минут, иногда день.

А хотелось бы получать такую информацию сразу. В особенности, если у вас в проекте есть Sentry, и большинство ошибок команда разработки получает сразу.

Как-то подкрутить сервер, чтобы он такую информацию выдавал обычным клиентам — нельзя. Вследствие политики безопасности разработчиков базы. Но, если у вашего пользователя к базе доступ обычный, без всяких там ограничений на выполнения служебных функций и без Row-Level security policies, то организовать себе доступ к подобной информации всё же можно.

Ручной захват

Мы можем преобразовать наши классы так, чтобы вручную получать похожую информацию. И даже больше. Для этого после отлова исключения о дедлоке, нам необходимо:

Временно ничего не откатывать в текущем соединении с базой и вообще ничего не трогать там.
Создать ещё одно соединение и выполнить в нём простейший запрос:

Результаты положить в доступное для просмотра разработчиками хранилище. Например отправить по Sentry как ошибку.

В данном случае у нас есть высокая вероятность того, что мы увидим какой именно запрос сломал нам транзакцию, вычислив его по PID и посмотрев текущий query.

Но бывает и так, что вычислив соединение по PID и посмотрев на query вы можете увидеть совсем не тот query, который устроил нам дедлок, а какой-нибудь следующий за ним по логике. Ведь пока вы ловили исключение и открывали соединение, нужный нам запрос для отлова мог и завершиться. Всё что мы можем здесь сделать — это работать через pgBouncer или его аналоги для минимизации времени установления соединения и использовать application_name.

application_name

Даже если вы получили тот запрос, который вызвал дедлок, у вас всё равно могут возникнуть трудности с пониманием, в каком месте логики он был вызван. И вот здесь на помощь приходит поле application_name. По умолчанию оно инициализируется не сильно полезной информацией, но его можно менять. А что если писать туда то место, откуда мы начинали транзакцию?

Вуаля. Теперь можно быстро открывать файлы в нужных местах и смотреть код.

pid	application_name	state	query
1	8613	deadlock_test.py:10	idle in transaction (aborted)	UPDATE users SET balance = balance + 10 WHERE RETURNING pg_sleep(1);
2	8614	deadlock_test.py:17	active	UPDATE users SET balance = balance + 10 WHERE RETURNING pg_sleep(1);
3	8617	active	SELECT pid, application_name, state, query FROM pg_stat_activity;

Думаем о серверных логах

Вся эта магия, описанная сверху хороша, но теоретически может не сработать. Иногда вам всё же не обойтись без серверных логов, поэтому необходимо сделать ещё два шага:

Обговорить понятную процедуру получения нужных участков серверных логов в разумное время с заинтересованными сторонами.
Делать их в требуемом вами формате, изменив log_line_prefix в postgresql.conf. На машине разработчика например можно так: log_line_prefix = ‘APP:%a PID:%p TR:%x ‘ .

Часть 2: Как бороться с дедлоками

Обычно в таких разделах дают ссылки на документацию об уровнях изоляции транзакций, видах блокировок и предлагают думать и анализировать в разрезе своего приложения. Я тоже так сделаю, но позже. А сначала просто опишу как мы это делаем чаще всего, ибо так уже получилось, что дедлоки у нас очень похожи друг на друга.

Несколько практик избегания deadlock`ов

Частый случай №1: Классический дедлок

Самый наш частый случай следующий:

Происходит игровое действие, затрагивающее сразу несколько игроков.
При этом игроки нередко могут совершать несколько действий в одно и то же время. Например, у нас: у крупных кланов есть много подразделений, участвующих в боях.
При выигрыше или проигрыше начисляются очки клану. И таких начислений параллельно может быть очень много.

То есть имеем ситуацию, когда пришёл пяток результатов боёв, их обработка распараллелилась, при этом в параллельных процессах, бывает, встречаются одни и те же игроки.

Всё что тут можно сделать: или выстраивать начисления в цепочку, но это медленно, или позволять начислениям падать и пробовать начислить чуть позже.

Частый случай №2: Сам себе злобный буратино (ССЗБ)

У нас походовая игра. Раз в ход происходит пересчёт баланса игроков, учитывая большое количество совершённых ими игровых действий. На время изменения баланса мы блокировали другие изменения через SELECT… FOR UPDATE. Хотя мы блокировали не сразу всех, а чанками по 100, всё равно иногда уходили в дедлок с процессом, который начисляет бонусы за бой, который не останавливается на время расчёта хода.

Так вот, оказалось, что мы были неправы. SELECT… FOR UPDATE — слишком мощная блокировка, необходимая только если выполняются 2 условия:

Условный id текущей таблицы используется как внешний ключ в другой.
Этот же условный id может быть изменён/удалён в результате дальнейших действий.

Возьмём пример:

P2 в данной ситуации повиснет, поскольку мы даём СУБД понять, что запись с может перестать существовать. Однако в P1 мы не делаем ничего такого, только хотим защитить баланс клана от изменений. Поэтому, когда мы изменили FOR UPDATE на FOR NO KEY UPDATE, мы перестали ловить дедлоки.

Бонус №1

SELECT… FOR UPDATE в примере выше вызван явно. Но вы получите аналогичный эффект, если затронете своими изменениями уникальный ключ, на который ссылается внешний ключ из других таблиц. А любой UPDATE, который не затрагивает своими изменениями подобные ключи, вызовет блокировку аналогичную SELECT… FOR NO KEY UPDATE. Я вам рекомендую ознакомиться с этими особенностями в статье «Явные блокировки» в списке литературы ниже.

Бонус №2

Вернёмся к ещё одной любопытной детали из первоначальной ошибки:

CONTEXT: while updating tuple (0,9) in relation «users»

Что за тупл спросите вы? Это физический адрес строчки в таблице, из-за которой возник конфликт. Дело в том, что в каждой таблице есть служебные поля, которые запросом SELECT * не выбираются. Однако стоит явно указать к примеру ctid среди полей, как мы увидим этот самый тупл:

Пользы от него немного в случае дедлока, ибо разблокированный процесс скорее всего обновит конфликтную строчку, и у неё изменится этот ctid (поскольку любой UPDATE в PostgreSQL на самом деле INSERT, а старая строчка помечается как невидимая и позже будет удалена автовакуумом). Но знать стоит, вдруг когда-нибудь пригодится.

Источник

PostgreSQL: Understanding deadlocks

Many might have seen PostgreSQL issue the following error message: «ERROR: deadlock detected» . But what does it really mean? How can we prevent a deadlock and how can we reproduce the problem? Let’s dive into PostgreSQL locking and understand what deadlock and deadlock_timeout really mean.

How does a deadlock happen?

Many people approach us because they want to understand what a deadlock is and how it can happen. They also want to understand how a deadlock can be avoided and what software developers can do about it.
If you want to understand how a deadlock occurs, all you need is a table containing two lines. That’s sufficient to explain the basic principle of deadlocks.

Here is some easy-to-use sample data:

The crux is that if data is updated in a different order, transactions might have to wait for one another to be finished. It is perfectly fine if transaction 1 has to wait for transaction 2. But what happens if transaction 1 has to wait for transaction 2 and transaction 2 has to wait for transaction 1? In that case, the system has two choices:

Wait infinitely, or
Abort one transaction and commit the other transaction.

As waiting infinitely is not an option, PostgreSQL will abort one of these transactions after some time (deadlock_timeout). Here is what happens:

Transaction 1	Transaction 2	Comment
BEGIN;	BEGIN;
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+—— 1 \| 1000 (1 row)	UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+—— 2 \| 2000 (1 row)	works perfectly
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *;	has to wait until transaction 2 releases the lock on the row containing >
… waits …	UPDATE t_data SET data = data * 10 WHERE/> RETURNING *;	wants to lock the row locked by transaction id: now both are supposed to wait
… deadlock timeout …	… deadlock timeout …	PostgreSQL waits (deadlock_timeout) and triggers deadlock detection after this timeout (not immediately)
update proceeds: “UPDATE 1”	ERROR: deadlock detected	a transaction has to die
COMMIT;	the rest commits normally

The error message we will see is:

ERROR: deadlock detected
DETAIL: Process 70725 waits for ShareLock on transaction 891717; blocked by process 70713.
Process 70713 waits for ShareLock on transaction 891718; blocked by process 70725.
HINT: See server log for query details.
CONTEXT: while updating tuple (0,1) in relation «t_data»

The reason is that transactions have to wait for one another. If two transactions are in a conflict, PostgreSQL will not resolve the problem immediately, rather it will wait for deadlock_timeout and then trigger the deadlock detection algorithm to resolve the problem.

Why does PostgreSQL wait for some time before it steps in and fixes things? The reason is that deadlock detection is quite expensive, and therefore not immediately triggering it makes sense. The default value here is 1 second, which is high enough to avoid pointless deadlock detection attempts, but is still short enough to fix the problem in a useful and timely manner.

How to fix and avoid deadlocks

The most important thing to know is: There is NO MAGIC CONFIGURATION PARAMETER to fix this problem. The problem does NOT depend on configuration. It depends on the execution order of operations. In other words, you cannot magically fix it without understanding the application and its underlying operations.

The only thing that can fix the problem is to change the execution order, as shown in the next listing:

This is the data you should see after committing the transaction that did not fail before. Thus we can see what happens if two transactions execute in a different order:

Transaction 1	Transaction 2	Comment
BEGIN;
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+—— 1 \| 1000 (1 row)	BEGIN;
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *;
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+——- 2 \| 20000 (1 row)	… wait …
COMMIT;	… wait …
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+——— 1 \| 100000 (1 row)	re-read the value and use the newly committed entries
UPDATE t_data SET data = data * 10 WHERE/> RETURNING *; id \| data —-+——— 2 \| 200000 (1 row)	re-read the value and use the newly committed entries
COMMIT;

In this case, there is no deadlock. However, in a real work scenario it is hardly possible to simply swap the execution order. That’s why this is more of a theoretical solution to the problem than a practical one. However, there are no other options to fix the problem of deadlocks. In the case of deadlocks, being aware of how to prevent them is the best cure.

Finally …

Locking is really important. Deadlocks are not the only concern in this area. Performance might be equally important, therefore it makes sense to deal with performance-related locking effects as well. Stay tuned for more on this topic.

If you want to learn more about important features of PostgreSQL, you might want to check out a blog post about UPDATE which can be found here.

Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter , or follow us on Twitter , Facebook , or LinkedIn .

Hans-Jürgen Schönig

Hans-Jürgen Schönig has experience with PostgreSQL since the 90s. He is CEO and technical lead of CYBERTEC, which is one of the market leaders in this field and has served countless customers around the globe since the year 2000.

Источник

I digged this issue yesterday and found out that it is caused by deadlock that appears when we are trying to insert a block validation record.
Here are the logs form RDS database:

2020-10-29 10:12:25 UTC:172.31.27.161(53549):fuse@fusenet_explorer:[16327]:ERROR:  deadlock detected
2020-10-29 10:12:25 UTC:172.31.27.161(53549):fuse@fusenet_explorer:[16327]:DETAIL:  Process 16327 waits for ShareLock on transaction 37769062; blocked by process 16295.
	Process 16295 waits for AccessShareLock on tuple (240712,40) of relation 18711 of database 18697; blocked by process 31566.
	Process 31566 waits for ShareLock on transaction 37769078; blocked by process 16327.
	Process 16327: SELECT b0."hash" FROM "blocks" AS b0 WHERE (b0."number" = ANY($1) AND b0."consensus") ORDER BY b0."hash" FOR UPDATE
	Process 16295: INSERT INTO "block_rewards" AS b0 ("address_hash","address_type","block_hash","inserted_at","reward","updated_at") VALUES ($1,$2,$3,$4,$5,$6),($7,$8,$9,$10,$11,$12),($13,$14,$15,$16,$17,$18) ON CONFLICT ("address_hash","address_type","block_hash") DO UPDATE SET "reward" = EXCLUDED.reward WHERE (EXCLUDED.reward IS DISTINCT FROM b0."reward") RETURNING "updated_at", "inserted_at", "block_hash", "address_hash", "reward", "address_type"
	Process 31566: SELECT b0."hash" FROM "blocks" AS b0 WHERE (b0."number" = ANY($1) AND b0."consensus") ORDER BY b0."hash" FOR UPDATE
2020-10-29 10:12:25 UTC:172.31.27.161(53549):fuse@fusenet_explorer:[16327]:HINT:  See server log for query details.
2020-10-29 10:12:25 UTC:172.31.27.161(53549):fuse@fusenet_explorer:[16327]:CONTEXT:  while locking tuple (240712,41) in relation "blocks"
2020-10-29 10:12:25 UTC:172.31.27.161(53549):fuse@fusenet_explorer:[16327]:STATEMENT:  SELECT b0."hash" FROM "blocks" AS b0 WHERE (b0."number" = ANY($1) AND b0."consensus") ORDER BY b0."hash" FOR UPDATE

From I research I believe these lines of code are affected => https://github.com/poanetwork/blockscout/blob/master/apps/explorer/lib/explorer/chain/import/runner/block/rewards.ex#L46, however I’m not that good in building ECTO queries on Elixir and I don’t think I can fix it fast enough.
@vbaranov Could you look at it, please?

Источник

Many might have seen PostgreSQL issue the following error message: "ERROR: deadlock detected". But what does it really mean? How can we prevent a deadlock and how can we reproduce the problem? Let’s dive into PostgreSQL locking and understand what deadlock and deadlock_timeout really mean.

How does a deadlock happen?

Here is some easy-to-use sample data:

test=# CREATE TABLE t_data (id int, data int);
CREATE TABLE
test=# INSERT INTO t_data VALUES (1, 100), (2, 200);
INSERT 0 2
test=# TABLE t_data;
id  | data
----+------
1   | 100
2   | 200
(2 rows)

Wait infinitely, or
Abort one transaction and commit the other transaction.

As waiting infinitely is not an option, PostgreSQL will abort one of these transactions after some time (deadlock_timeout). Here is what happens:

Transaction 1	Transaction 2	Comment
`BEGIN;`	`BEGIN;`
`UPDATE t_data SET data = data * 10 WHERE id = 1 RETURNING *; id \| data ----+------ 1 \| 1000 (1 row)`	`UPDATE t_data SET data = data * 10 WHERE id = 2 RETURNING *; id \| data ----+------ 2 \| 2000 (1 row)`	works perfectly
`UPDATE t_data SET data = data * 10 WHERE id = 2 RETURNING *;`		has to wait until transaction 2 releases the lock on the row containing id = 2
… waits …	`UPDATE t_data SET data = data * 10 WHERE id = 1 RETURNING *;`	wants to lock the row locked by transaction id: now both are supposed to wait
… deadlock timeout …	… deadlock timeout …	PostgreSQL waits (deadlock_timeout) and triggers deadlock detection after this timeout (not immediately)
update proceeds: “UPDATE 1”	ERROR: deadlock detected	a transaction has to die
COMMIT;		the rest commits normally

The error message we will see is:

ERROR: deadlock detected DETAIL: Process 70725 waits for ShareLock on transaction 891717; blocked by process 70713. Process 70713 waits for ShareLock on transaction 891718; blocked by process 70725. HINT: See server log for query details. CONTEXT: while updating tuple (0,1) in relation "t_data"

How to fix and avoid deadlocks

The only thing that can fix the problem is to change the execution order, as shown in the next listing:

test=# SELECT * FROM t_data ;
id  | data
----+------
1   | 1000
2   | 2000
(2 rows)

This is the data you should see after committing the transaction that did not fail before. Thus we can see what happens if two transactions execute in a different order:

Transaction 1	Transaction 2	Comment
`BEGIN;`
`UPDATE t_data SET data = data * 10 WHERE id = 1 RETURNING *; id \| data ----+------ 1 \| 1000 (1 row)`	`BEGIN;`
	`UPDATE t_data SET data = data * 10 WHERE id = 1 RETURNING *;`
`UPDATE t_data SET data = data * 10 WHERE id = 2 RETURNING *; id \| data ----+------- 2 \| 20000 (1 row)`	… wait …
`COMMIT;`	… wait …
	`UPDATE t_data SET data = data * 10 WHERE id = 1 RETURNING *; id \| data ----+-------- 1 \| 100000 (1 row)`	re-read the value and use the newly committed entries
	`UPDATE t_data SET data = data * 10 WHERE id = 2 RETURNING *; id \| data ----+-------- 2 \| 200000 (1 row)`	re-read the value and use the newly committed entries
	`COMMIT;`

Finally …

If you want to learn more about important features of PostgreSQL, you might want to check out a blog post about UPDATE which can be found here.

Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

Источник

Here is a small demo to show the retry logic that must be implemented to handle transaction conflict or transparent failover errors from a database. The goal is to be able to do some experiments on your laptop, because there are, unfortunately, no magic solutions.

Setup a PostgreSQL database

Start a local PostgreSQL:

docker run --name pg -e POSTGRES_PASSWORD=franck -p 5432:5432 -d postgres

Enter fullscreen mode

Exit fullscreen mode

Create a simple 10 rows table:

PGUSER=postgres PGPASSWORD=franck PGHOST=localhost PGPORT=5432  psql -e <<SQL
create table franck(id int primary key, val int);
insert into franck select generate_series(1,10),0;
SQL

Enter fullscreen mode

Exit fullscreen mode

create table franck(id int primary key, val int)
CREATE TABLE

insert into franck select generate_series(1,10),0;
INSERT 0 10

Enter fullscreen mode

Exit fullscreen mode

Get YBDemo

I use a very simple program to run queries in thread, Download the YBDemo.jar which includes the PostgreSQL and YugabyteDB JDBC drivers and Hikari connection pool:

curl -Ls https://github.com/FranckPachot/ybdemo/releases/download/v0.0.1/YBDemo-0.0.1-SNAPSHOT-jar-with-dependencies.jar > YBDemo.jar
curl -Ls https://jdbc.postgresql.org/download/postgresql-42.3.2.jar > postgresql.jar

Enter fullscreen mode

Exit fullscreen mode

I setup the connection parameters in Hikary properties:

cat > hikari.properties <<'INI'
dataSourceClassName=org.postgresql.ds.PGSimpleDataSource
dataSource.url=jdbc:postgresql://localhost:5432/postgres?user=postgres&password=franck
INI

Enter fullscreen mode

Exit fullscreen mode

Here is how I run 5 threads that update all rows in my demo table:

for i in {1..5} ; do
 echo "update franck set val=val+1 returning val+1"
done | java -jar YBDemo.jar

Enter fullscreen mode

Exit fullscreen mode

This is a very simple example, update franck set val=val+1 on the 10 rows of my table, run from multiple threads, to show transaction conflicts. Of course, if you have this in your application, you have a problem. But the goal of a lab is to analyze simple things in order to get full understanding.

Read committed

The default isolation level in PostgreSQL is Read Committed. You can check quickly that it is the same with my program: java -jar YBDemo.jar <<< "select current_setting('transaction_isolation')"

So, If you have run the simple update franck set val=val+1 returning val+1 above from at least 3 threads you will see some transaction conflict errors:

 Thread-2      5 ms: 17735
 Thread-0      5 ms: 17736
 Thread-1      5 ms: 17737

update franck set val=val+1 returning val+1
Error in thread  Thread-0   1016 ms SQLSTATE(40P01) - retry 0/10
org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 180 waits for ShareLock on transaction 20923; blocked by process 182.
Process 182 waits for ShareLock on transaction 20926; blocked by process 180.
  Hint: See server log for query details.
  Where: while rechecking updated tuple (11,5) in relation "franck"
 wait in thread  Thread-0     15 ms after   0 retries

update franck set val=val+1 returning val+1
Error in thread  Thread-1   1005 ms SQLSTATE(40P01) - retry 0/10
org.postgresql.util.PSQLException: ERROR: deadlock detected
  Detail: Process 181 waits for ShareLock on transaction 20989; blocked by process 182.
Process 182 waits for ShareLock on transaction 20988; blocked by process 181.
  Hint: See server log for query details.
  Where: while rechecking updated tuple (9,21) in relation "franck"
 wait in thread  Thread-1     10 ms after   0 retries

Enter fullscreen mode

Exit fullscreen mode

I’ve run this example to make it clear that, even in Read Committed isolation level, with one simple SQL statement, transaction conflict can happen and the application must handle it with a retry logic. Why a retry logic? because you don’t want to stop your application and you don’t want to ignore the unsuccessful transaction. This is, by the way, the main goal of my YBDemo.java.

So, whatever the isolation level is, SQLSTATE 40P01 (deadlock_detected) must be retried by an highly available application.

Serializable

With Serializable isolation level, you can expect more transaction conflicts because the read/write stability is guaranteed for the whole transaction.

I’m adding this line to set the isolation level for each transaction:

cat >> hikari.properties <<'INI'
connectionInitSql= set default_transaction_isolation=serializable;
INI

Enter fullscreen mode

Exit fullscreen mode

and run the same again:

for i in {1..5} ; do
 echo "update franck set val=val+1 returning val+1"
done | java -jar YBDemo.jar

Enter fullscreen mode

Exit fullscreen mode

I can see many errors like:

 Thread-3      2 ms: 153703
 Thread-3      2 ms: 153704
 Thread-3      2 ms: 153705
 wait in thread  Thread-0   2567 ms after   8 retries
 Thread-3      2 ms: 153706

update franck set val=val+1 returning val+1
Error in thread  Thread-0      2 ms SQLSTATE(40001) - retry 9/10
org.postgresql.util.PSQLException: ERROR: could not serialize access due to concurrent update
 Thread-3      3 ms: 153707
 Thread-3      2 ms: 153708
 Thread-3      2 ms: 153709
 Thread-3      2 ms: 153710

Enter fullscreen mode

Exit fullscreen mode

This one was at the limit because my YBDemo.java fails above 10 retries. This happened quickly:

Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 0/10
 wait in thread  Thread-2     10 ms after   0 retries
Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 1/10
 wait in thread  Thread-2     29 ms after   1 retries
Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 2/10
 wait in thread  Thread-2     41 ms after   2 retries
Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 3/10
 wait in thread  Thread-2     86 ms after   3 retries
Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 4/10
 wait in thread  Thread-2    164 ms after   4 retries
Error in thread  Thread-2      1 ms SQLSTATE(40001) - retry 5/10
 wait in thread  Thread-2    328 ms after   5 retries
Error in thread  Thread-2      1 ms SQLSTATE(40001) - retry 6/10
 wait in thread  Thread-2    641 ms after   6 retries
Error in thread  Thread-2      3 ms SQLSTATE(40001) - retry 7/10
 wait in thread  Thread-2   1289 ms after   7 retries
Error in thread  Thread-2      4 ms SQLSTATE(40001) - retry 8/10
 wait in thread  Thread-2   2567 ms after   8 retries
Error in thread  Thread-2      3 ms SQLSTATE(40001) - retry 9/10
 wait in thread  Thread-2   5125 ms after   9 retries
Error in thread  Thread-2      2 ms SQLSTATE(40001) - retry 10/10

Enter fullscreen mode

Exit fullscreen mode

I use exponential backoff to retry, you can see how in the code. The last one waited 5 seconds.

YugabyteDB

With a distributed database, optimistic locking is the rule, and retry-able errors can happen for many reasons. I’ve started a cluster on my laptop with 3 tablet servers using a docker-compose generated by yb-lab

I create the demo table (the docker yb-tserver-0 is redirected on localhost port 5433):

psql -p 5433 -U yugabyte -d yugabyte <<SQL
create table franck(id int primary key, val int);
insert into franck select generate_series(1,10),0;
SQL

Enter fullscreen mode

Exit fullscreen mode

and set the connection properties for it:

cat > hikari.properties <<INI
dataSourceClassName=com.yugabyte.ysql.YBClusterAwareDataSource
dataSource.url=jdbc:yugabytedb://localhost:5433/yugabyte?user=yugabyte&load-balance=false
connectionInitSql= set default_transaction_isolation=serializable;
INI

Enter fullscreen mode

Exit fullscreen mode

I see some serialization errors, of course, but never reached the maximum number of retries:

 Thread-3     17 ms: 1474
 Thread-3     17 ms: 1475
 Thread-3     17 ms: 1476
 Thread-3     17 ms: 1477
 Thread-3     18 ms: 1478
 Thread-3     17 ms: 1479
 Thread-3     18 ms: 1480
 Thread-3     17 ms: 1481

update franck set val=val+1 returning val+1
Error in thread  Thread-0   6541 ms SQLSTATE(40001) - retry 1/10
com.yugabyte.util.PSQLException: ERROR: Operation failed. Try again.: [Operation failed. Try again. (yb/tablet/running_transaction.cc:456): Transaction aborted: 9ced2fc8-06d2-490c-b400-732e53013a56 (pgsql error 40001)]
 Thread-3     18 ms: 1482
 wait in thread  Thread-0     21 ms after   1 retries
 Thread-3     18 ms: 1483
 Thread-0     20 ms: 1484
 Thread-0     16 ms: 1485
 Thread-0     18 ms: 1486
 Thread-0     20 ms: 1487
 Thread-0     19 ms: 1488
 Thread-0     19 ms: 1489
 Thread-0     18 ms: 1490
 Thread-0     17 ms: 1491

Enter fullscreen mode

Exit fullscreen mode

After one hour running I got two threads reaching 9 retries.

So, when using Serializable isolation level, SQLSTATE 40001 (serialization_failure) must be retried and the maximum number of retried (and time to wait between retries) should be correctly tuned for your system.

High Availability

While the previous was running, I’ve stopped and started one tablet server with docker-compose restart yb-tserver-2. I had a few more waits for a few seconds (because the tablets that had their leader in the stopped node have to get a follower elected as the new leader). But still showing only SQLSTATE 40001.

However, I was connected only to yb-tserver-0 in the previous test. The load is balanced to the cluster, but the connection happens in one place. When the node I’m connected to is down, the application will continue by connecting to the other nodes (thanks to the connection pool and the cluster-aware JDBC driver). The transient errors may also be handled by the application.

 Thread-0     22 ms: 59648^M
 Thread-0     22 ms: 59649^M
 Thread-0     29 ms: 59650^M
53171 [Thread-3] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@61336f01 marked as broken because of SQLSTATE(08006), ErrorCode(0)^M
53171 [Thread-0] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@3b373f9c marked as broken because of SQLSTATE(08006), ErrorCode(0)^M
53171 [Thread-2] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@2abffd27 marked as broken because of SQLSTATE(08006), ErrorCode(0)^M
53171 [Thread-4] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@39961dd8 marked as broken because of SQLSTATE(08006), ErrorCode(0)^M
53171 [Thread-1] WARN com.zaxxer.hikari.pool.ProxyConnection - HikariPool-1 - Connection com.yugabyte.jdbc.PgConnection@683b2ab4 marked as broken because of SQLSTATE(08006), ErrorCode(0)^M
com.yugabyte.util.PSQLException: An I/O error occurred while sending to the backend.^M
  at com.yugabyte.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:382)^M
  at com.yugabyte.jdbc.PgStatement.executeInternal(PgStatement.java:490)^M
  at com.yugabyte.jdbc.PgStatement.execute(PgStatement.java:408)^M
  at com.yugabyte.jdbc.PgStatement.executeWithFlags(PgStatement.java:329)^M
  at com.yugabyte.jdbc.PgStatement.executeCachedSql(PgStatement.java:315)^M
  at com.yugabyte.jdbc.PgStatement.executeWithFlags(PgStatement.java:291)^M
  at com.yugabyte.jdbc.PgStatement.executeQuery(PgStatement.java:243)^M
  at com.zaxxer.hikari.pool.ProxyStatement.executeQuery(ProxyStatement.java:110)^M
  at com.zaxxer.hikari.pool.HikariProxyStatement.executeQuery(HikariProxyStatement.java)^M
  at YBDemo.run(YBDemo.java:38)^M

Enter fullscreen mode

Exit fullscreen mode

The SQLSTATE 08006 (connection_failure) should also be retried. There is no need to wait because, in YugabyteDB, other nodes are immediately available.

I also got some:

yb-lab-yb-demo-write-1  | Error in thread  Thread-1   1589ms SQLSTATE(XX000) - retry 0/5
yb-lab-yb-demo-write-1  | com.yugabyte.util.PSQLException: ERROR: no owned sequence found
yb-lab-yb-demo-write-1  |   Where: Catalog Version Mismatch: A DDL occurred while processing this query. Try again.

Enter fullscreen mode

Exit fullscreen mode

when testing different failover situations so I decided to retry ‘XX000’.

And the re-connection may also get a SQLTransientConnectionException which should alo be retried, without waiting because the connection pool has already a timeout setting

It is difficult to provide some generic code to handle this. If it was that easy, this would be automated within the database or the driver. And YugabyteDB has already many situation where conflicts are retried automatically. But some situation must be handled by the application:

did you do some non-transactional operations that risk to be done twice (examples: send e-mail, notifications…)
are you sure that you got the connection error before the changes were committed, or do you need to check the state before retry?

In summary

Any SQL database, because it is shared, and guarantees ACID properties to avoid data corruption, may raise transaction conflicts. Some situations can be retried automatically. Their sole consequence is a higher latency but, given the low probability of it, makes it better than pessimistic locking when you need to scale. Some other situations must be retried with care of the application logic, for which the database doesn’t know, especially if you don’t encapsulate all database actions into stored procedures.

Here is the repo where I have this demo program and docker-compose examples:

🐳 The best use of this is from the docker-compose environment set in ./docker/yb-lab

YBDemo

YBDemo is a simple Java program that creates an HikariCP connection pool from the hikari.properties file in the current directory, and takes SQL statements to execute as lines from stdin. There’s no multi-line statement: each line is a thread, executing the statement in a loop. The goal is to make it easy to run a demo with concurrent threads by providing the set of queries in a simple way, interactive or though file redirection. Only the first column of the first row is displayed, I use row_to_json() or json_agg() to format a larger result into one value. The thread stops if no row is returned, I use RETURNING to get a row from DML.

The goal is to run it on PostgreSQL compatible databases, especially distributed ones like YugabyteDB. The .jar includes the YugabyteDB…

Источник