Error tuple already updated by self - Исправление ошибок и поиск оптимальных решений проблем

Saw this when starting up multiple topic -> stream consumers:

ERROR:  tuple already updated by self

Vinay @vinaygulani1 19:00
@usmanm thanks. I am getting an error «ERROR: tuple already updated by self
LOG: worker process: [kafka consumer] stream <- topic (PID 4171) exited with exit code 1. «

This seems like a bug introduced recently with the consumer group stuff. Hadn’t seen it before.

Any news on this.
@usmanm would you like to have a quick chat on this and see if I can make the changes necessary. Please let me know.

@derekjn Any updates on this one? Seeing this pop up randomly after some time. Then it is quite hard to get rid of it. Right now, I’m calling pipeline_kafka.consume_end, then removing from pipeline_kafka.consumers and pipeline_kafka.offsets, and finally dropping stream and view. Only then is it possible to setup up everything again.

I only have one stream per topic, and one view per stream. I’m consuming 5-10 topics.

I just noticed that when I look into pipeline_kafka.offsets for the consumers that die, the stored offsets look weird. The messages in Kafka are evenly distributed across the partitions, but the offets in PipelineDB are not. Some partitions are at the latest offsets, but others are way behind …

Copy link

Contributor

Author

@derekjn Any updates on this one?

@michaelsauter I don’t believe we were ever able to repro this, and haven’t seen it since but we’re more than happy to take a look at it.

Seeing this pop up randomly after some time. Then it is quite hard to get rid of it. Right now, I’m calling pipeline_kafka.consume_end, then removing from pipeline_kafka.consumers and pipeline_kafka.offsets, and finally dropping stream and view. Only then is it possible to setup up everything again.

Ugh, sorry to hear that—that’s not good at all. From what I can tell, the bad behavior you’re seeing looks like this:

You see a tuple already updated by self error thrown by a consumer process
The processes that throw this error essentially die
The offsets are essentially in an unusable state after that

Please add anything I missed here or correct me if I am mistaken.

One question I have for you at this point is: does the tuple already updated by self appear repeatedly? Or is it just once?

When tuple already updated by self appears, then the consumer dies, and it is being started again, and tuple already updated by self appears again …. and so on in an endless loop.

The only way to stop it is what I described above — ending consumption, deleting everything and starting again from scratch.

The continuous view I’m filling is an aggregation of the individual messages. The count column increases forever if that problem happens …

UPDATE:
Regarding my last point about the ever-increasing counts, I started the consumer with format := 'json', start_offset:= -2 originally, and my assumption is that the consumer reads some messages over and over again because some offsets are not committed properly.

Copy link

Contributor

Author

@michaelsauter got it, this is helpful. We’re looking into this and will report back when we know more.

@derekjn Any news on this one?

It hasn’t occured the last week or so, but just now it happened again, right when I added a new stream/view/consumer — at that time another consumer which ran fine for a longer time failed. Then I deleted that old consumer (with stream and view), recreated it and now all are running again …

Copy link

Contributor

Author

@michaelsauter nothing just yet but that’s actually good information. We will stress test against that scenario and hopefully get to the bottom of this soon.

Copy link

Contributor

Author

@michaelsauter can you try running the latest master now? We’ve had some trouble reproducing the exact behavior you’re describing but have a couple of theories as to what might be going wrong. We just pushed a small change that addresses one of these theories.

And if it does happen again, can you share with us anything at all that you see in the logs?

@derekjn Sorry for the late reply. Thanks for looking into it!

I didn’t find time yet to update to latest master, but I hope to get to it end of next week and then I’ll report back here.

@derekjn I just encountered the issue again, then build a new image and I’m running latest master now. So far no issues.

Copy link

Contributor

Author

@derekjn I just encountered the issue again, then build a new image and I’m running latest master now. So far no issues.

@michaelsauter just to clarify, did you run into the issue again on an older build, and now you’re running the latest master?

@derekjn Sorry for the confusion. I had the issue while still on the old build. Afterwards I made a new image from latest master and am running that now — no issues so far.

Copy link

Contributor

Author

It was running fine the since then. Today I stopped all consumers, dropped all views and recreated them from the Kafka topics (we keep those messages a long time). Unfortunately now I have encountered ERROR: tuple concurrently updated. So not the same problem, and the effect is different (consumer exits, but doesn’t go crazy), but there is still some issue …

I wonder if it has to do with load? When I recreate the views, CPU is at 100%, memory is fine.

Источник

On Tue, Jun 18, 2019 at 06:12:33PM -0500, Justin Pryzby wrote:
> A customers DB crashed due to OOM. While investigating the issue in our
> report, I created MV stats, which causes this error:
>
> ts=# CREATE STATISTICS sectors_stats (dependencies) ON site_id,sect_id FROM sectors;
> CREATE STATISTICS
> ts=# ANALYZE sectors;
> ERROR: XX000: tuple already updated by self
> LOCATION: simple_heap_update, heapam.c:4613

> I’m guessing the issue is with pg_statistic_ext, which I haven’t touched.
>
> Next step seems to be to truncate pg_statistic{,ext} and re-analyze the DB.

Confirmed the issue is there.

ts=# analyze sectors;
ERROR: tuple already updated by self
ts=# begin; delete from pg_statistic_ext; analyze sectors;
BEGIN
DELETE 87
ANALYZE

On Tue, Jun 18, 2019 at 04:30:33PM -0700, Andres Freund wrote:
> Any chance to get a backtrace for the error?

Sure:

(gdb) bt
#0 errfinish (dummy=0) at elog.c:414
#1 0x000000000085e834 in elog_finish (elevel=<value optimized out>, fmt=<value optimized out>) at elog.c:1376
#2 0x00000000004b93bd in simple_heap_update (relation=0x7fee161700c8, otid=0x1fb7f44, tup=0x1fb7f40) at heapam.c:4613
#3 0x000000000051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234
#4 0x000000000071e5ca in statext_store (onerel=0x7fee16140de8, totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at extended_stats.c:344
#5 BuildRelationExtStatistics (onerel=0x7fee16140de8, totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at extended_stats.c:130
#6 0x0000000000588346 in do_analyze_rel (onerel=0x7fee16140de8, options=2, params=0x7ffe5b6bf8b0, va_cols=0x0, acquirefunc=0x492b4, relpages=36, inh=true, in_outer_xact=false, elevel=13) at analyze.c:627
#7 0x00000000005891e1 in analyze_rel (relid=<value optimized out>, relation=0x1ea22a0, options=2, params=0x7ffe5b6bf8b0, va_cols=0x0, in_outer_xact=false, bstrategy=0x1f38090) at analyze.c:317
#8 0x00000000005fb689 in vacuum (options=2, relations=0x1f381f0, params=0x7ffe5b6bf8b0, bstrategy=<value optimized out>, isTopLevel=<value optimized out>) at vacuum.c:357
#9 0x00000000005fbafe in ExecVacuum (vacstmt=<value optimized out>, isTopLevel=<value optimized out>) at vacuum.c:141
#10 0x0000000000757a30 in standard_ProcessUtility (pstmt=0x1ea2410, queryString=0x1ea18c0 «ANALYZE sectors;», context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 «»)
at utility.c:670
#11 0x00007fee163a4344 in pgss_ProcessUtility (pstmt=0x1ea2410, queryString=0x1ea18c0 «ANALYZE sectors;», context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 «»)
at pg_stat_statements.c:1005
#12 0x0000000000753779 in PortalRunUtility (portal=0x1f1a8e0, pstmt=0x1ea2410, isTopLevel=<value optimized out>, setHoldSnapshot=<value optimized out>, dest=0x1ea26d0, completionTag=<value optimized out>) at pquery.c:1178
#13 0x000000000075464d in PortalRunMulti (portal=0x1f1a8e0, isTopLevel=true, setHoldSnapshot=false, dest=0x1ea26d0, altdest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 «») at pquery.c:1331
#14 0x0000000000754de8 in PortalRun (portal=0x1f1a8e0, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1ea26d0, altdest=0x1ea26d0, completionTag=0x7ffe5b6bfdf0 «») at pquery.c:799
#15 0x0000000000751987 in exec_simple_query (query_string=0x1ea18c0 «ANALYZE sectors;») at postgres.c:1145
#16 0x0000000000752931 in PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x1edbad8 «ts», username=<value optimized out>) at postgres.c:4182
#17 0x00000000006e1ba7 in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4358
#18 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4030
#19 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1707
#20 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1380
#21 0x0000000000656210 in main (argc=3, argv=0x1e9c4d0) at main.c:228

#3 0x000000000051bdb7 in CatalogTupleUpdate (heapRel=0x7fee161700c8, otid=0x1fb7f44, tup=0x1fb7f40) at indexing.c:234
indstate = 0x1fb84a0
#4 0x000000000071e5ca in statext_store (onerel=0x7fee16140de8, totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at extended_stats.c:344
stup = 0x1fb7f40
oldtup = 0x7fee16158530
values = {0, 0, 0, 0, 0, 0, 0, 33260544}
nulls = {true, true, true, true, true, true, true, false}
replaces = {false, false, false, false, false, false, true, true}
#5 BuildRelationExtStatistics (onerel=0x7fee16140de8, totalrows=100843, numrows=100843, rows=0x1fd4028, natts=33260176, vacattrstats=0x1fb7ef0) at extended_stats.c:130
stat = <value optimized out>
stats = <value optimized out>
lc2 = <value optimized out>
ndistinct = <value optimized out>
dependencies = <value optimized out>
pg_stext = 0x7fee161700c8
lc = 0x1fb8290
stats = 0xfb6a172d
cxt = 0x1fb7de0
oldcxt = 0x1f6dd60
__func__ = «BuildRelationExtStatistics»

Ah: the table is an inheritence parent. If I uninherit its child, there’s no
error during ANALYZE. MV stats on the child are ok:

ts=# CREATE STATISTICS vzw_sectors_stats (dependencies) ON site_id,sect_id FROM vzw_sectors;
CREATE STATISTICS
ts=# ANALYZE vzw_sectors;
ANALYZE

I’m not sure what the behavior is intended to be, and probably the other parent
tables I’ve added stats are all relkind=p.

FWIW, we also have some FKs, like:

«sectors_site_id_fkey» FOREIGN KEY (site_id) REFERENCES sites(site_id)

Justin

Источник

Our database seems to be broken, normally it uses about 1-2% of cpu, but if we run some additional backend services making UPDATE and INSERT queries for 10M rows table (about 1 query per 3 second) everything is going to hell (including CPU increase from 2% to 98% usage).

We have decided to debug what’s going on, run VACUUM and ANALYZE to learn what’s wrong with db but…

production=# ANALYZE VERBOSE users_user;
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
ERROR:  tuple already updated by self

we are not able to finish ANALYZE on ANY of the tables and could not find any information about this issue. Any suggestions what can be wrong?

 PostgreSQL 9.6.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit

Additional info as requested in comments:

Maybe you have a corrupted pg_class

SELECT * FROM pg_class WHERE relname = 'users_user';

Output: https://pastebin.com/WhmkH34U

So the first thing to do would be to kick out all other sessions and
try again

There are no additional sessions, we have dumped the whole DB on the new testing server, issue still occur, there are no clients connected to this DB

I’d recommend you to start the server with the following parameters before searching for duplicated rows:

enable_indexscan = off
enable_bitmapscan = off
ignore_system_indexes = on

If your server crashed, indexes could be in a different state of table data. This happens when corruption affects transaction visibility (pg_clog), for example.
Then search for a duplicated row on pg_class or pg_statistic as mentioned in comments.

You could also try to clean up pg_statistic. First, start the server with:

allow_system_table_mods = on

And then issue a TRUNCATE TABLE and ANALYZE afterward:

--Cleaning pg_statistic
TRUNCATE TABLE pg_catalog.pg_statistic;
--Analyze entire database
ANALYZE VERBOSE;

If the problem is in pg_statistic this should be enough.

Источник

Наша база данных кажется сломанной, обычно она использует около 1-2% процессора, но если мы запустим некоторые дополнительные серверные службы, выполняющие запросы UPDATE и INSERT для таблицы 10M строк (около 1 запроса в 3 секунды), все полетит к черту (включая Увеличение загрузки ЦП с 2% до 98%).

Мы решили отладить происходящее, запустить VACUUM и ANALYZE, чтобы узнать, что не так с db, но…

production=# ANALYZE VERBOSE users_user;
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
INFO:  analyzing "public.users_user"
INFO:  "users_user": scanned 280 of 280 pages, containing 23889 live rows and 57 dead rows; 23889 rows in sample, 23889 estimated total rows
ERROR:  tuple already updated by self

Мы не можем закончить АНАЛИЗ НИ ОДНОЙ из таблиц и не можем найти никакой информации об этой проблеме. Любые предложения, что может быть не так?

 PostgreSQL 9.6.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), 64-bit

Дополнительная информация по запросу в комментариях:

Возможно, у вас поврежден pg_class

SELECT * FROM pg_class WHERE relname = 'users_user';

Вывод: https://pastebin.com/WhmkH34U

Итак, первое, что нужно сделать, это закрыть все остальные сеансы и повторить попытку.

Дополнительных сессий нет, мы сбросили всю БД на новый тестовый сервер, проблема осталась, к этой БД нет подключенных клиентов

1 ответ

Я бы порекомендовал вам запустить сервер со следующими параметрами перед поиском повторяющихся строк:

enable_indexscan = off
enable_bitmapscan = off
ignore_system_indexes = on

Если ваш сервер вышел из строя, индексы могут находиться в другом состоянии табличных данных. Это происходит, например, когда повреждение влияет на видимость транзакции (pg_clog). Затем найдите повторяющуюся строку в pg_class или pg_statistic, как указано в комментариях.

Вы также можете попробовать очистить pg_statistic. Сначала запустите сервер с помощью:

allow_system_table_mods = on

А затем выполните TRUNCATE TABLE и ANALYZE после этого:

--Cleaning pg_statistic
TRUNCATE TABLE pg_catalog.pg_statistic;
--Analyze entire database
ANALYZE VERBOSE;

Если проблема в pg_statistic, этого должно быть достаточно.

Michel Milezzi
14 Май 2019 в 16:37

Источник

Redeployment unable to startup again

Updated the resource limits for a postgresql-persistent 9.5 deployment

pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
ERROR:  tuple already updated by self

It seems the first pod did not shutdown cleanly and left the PID in /var/lib/pgsql/data/userdata/postmaster.pid volume thus preventing the container from starting up automatically without manual intervention

Perhaps an edge case as this is the first time seeing this with many other postgresql deployments

Hello, here facing the same problem: container unable to start, with same output detailed above.
[Is there / does anyone have found] any solution or workaround for this problem? I quickly tried removing /var/lib/pgsql/data/userdata/postmaster.pid file but when starting the container I’m getting the same issue.

EDIT: I double checked, and in my case the output is:

 pg_ctl: another server might be running; trying to start server anyway
 waiting for server to start....LOG:  redirecting log output to logging collector process
 HINT:  Future log output will appear in directory "pg_log".
 ... done
 server started
 => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
 ERROR:  tuple already updated by self

Thanks for the report. Interesting, @pedro-dlfa so you manually dropped the pid file, and immediately after that the container again refused to start (because of the pid file)? Smells like pg_ctl stop isn’t really doing what we expect.

Hello, I am facing the same issue as @pedro-dlfa , I tried to delete the file manually and redeploy pod but with no success. My solution is to recreate pod again. I am using version 9.5 of postgresql.

If you are affected by this, can you confirm that deployment strategy is Recreate?

Hi,
My strategy is Rolling and I was affected last week again.

Hi @martin123218

The problem with the Rolling strategy is that it tells Openshift to first create a new pod with the same data volume as the old one and only shut down the original pod when the new pod is up and running. Since there are at one time two pods accessing (and presumably writing to) the same data volume you can run into this issue.

Please use the Recreate strategy instead. There will be some downtime since the new pod is only started after the old pod gets shut down but you should not run into this issue anymore.

bkabrda

I’ve also just run into this issue. Is there a way to make this work with «Rolling» strategy to have zero downtime upgrades?

Not with this trivial layout. This problem is equivalent to non-container scenario where you do dnf update postgresql-server. You have to shut down the old server, and start a new one. I.e. you can not let two servers write into the same data directory.

Btw., PostgreSQL server has a guard against «multiple servers writing to the same data directory» situation, but unfortunately in container scenario — it has deterministic PID number (PID=1). So concurrent PostgreSQL server (in different container) checks the pid/lock file, compares the PID with it’s own PID and assumes «I’m the PID=1, so the PID file is some leftover from previous run». So it removes the PID file and continues with data directory modifications. This has a disaster potential.

Our templates only support Recreate strategy. The fact Rolling «mostly» works is matter of luck that the old server is not under heavy load.

That said, zero downtime problem needs to be solved on higher logical layer.

bkabrda

Ok, that makes sense, thanks. If I wanted to solve this on higher logical layer, how would I go about this? Do you have any good pointers?

At this point, you’d have to start thinking about pgpool or similar thing (I’d prefer to have a separate issue for such RFEs, to not go off-topic in this bug report).

This issue seems to be caused by concurrent run of multiple postgresql
containers against the same data directory (persistent VOLUME), e.g.
caused by Rolling strategy in OpenShift.

I’ve heard an idea that it could also happen if OpenShift happens to be
moving the container to idle state (because HA proxy decided so), while —
during that time — some traffic makes the container to be woken up (ie new
container is started even before it was successfully moved to idle state).
Anyone able to confirm that this could happen?

Anyways, I’d like to have opinions how to handle this situation properly;
how to protect against over-mounting the same storage — since detecting
this reasonably from within container seems to be close to hard problem.
The only way which comes to my mind is implementing a «best effort» guard
by some daemon implementing «leader election» mechanism. Any links to how
others do this?

We might delegate this to OpenShift operators, but I suspect that
templates will have to stay supported anyways — or at least that
postgresql-container should be also usable from (some)templates; and thus
the problem won’t disappear from non-operator use-cases, or plain «docker»
and «podman» use-cases.

Hi, facing the same issue, using the Recreate strategy. Deleting the postmaster.pid also did not help, as I got the same error at the next pod startup.
Any idea on how to fix or work around this?

Had this problem after there was an issue with the underlying node that caused it to terminate very ungracefully. A new pod got spun up (as it is supposed to) on a new node but the container got stuck in a crash-back loop with this exact error message. Surely there needs to be an automated way to get around this problem? Especially because only a single replica is supported, there’s not a lot of wiggle room for high-availability if the container can’t start

drobus

hhorak

We’ve also had an off-line discussion with Daniel Messer from RH who’ve hit this problem in his team as well. After changing the strategy to Recreate, it problem seems to disappear, but there was a good point to start testing the crash scenario in the CI tests (run the OpenShift template, then kill the pod or postgres deamon directly). This seems like a good addition to our test coverage.

Источник

1 ответ

Redeployment unable to startup again

Читайте также: