Failures in test_scrubber_physical_gc_ancestors #10391

jcsp · 2025-01-14T15:21:22Z

Reconcile timeout: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10382/12768462608/index.html#testresult/63dfcbb7b5ddcfa9/retries
Assertion on layers GC'd count: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10386/12769573994/index.html#testresult/f8a44bfb9862cf5c/retries

arpad-m · 2025-01-20T14:02:29Z

For the first one, apparently the storage controller believes that the shard is not ready to be promoted from secondary to primary:

2025-01-14T13:51:51.847992Z  INFO request{method=POST path=/debug/v1/reconcile_all request_id=a8d670f7-343f-4df5-9266-15294071178a}: Identified optimization for attachment: ScheduleOptimization { sequence: 1, action: MigrateAttachment(MigrateAttachment { old_attached_node_id: NodeId(1), new_attached_node_id: NodeId(2) }) } tenant_shard_id=9d7bbb52079070cee36d24b69036740f-0104
2025-01-14T13:51:51.849668Z  INFO request{method=POST path=/debug/v1/reconcile_all request_id=a8d670f7-343f-4df5-9266-15294071178a}: Skipping migration of 9d7bbb52079070cee36d24b69036740f-0104 to 2 (localhost) because secondary isn't ready: SecondaryProgress { heatmap_mtime: None, layers_downloaded: 0, layers_total: 0, bytes_downloaded: 0, bytes_total: 0 }

as the shard is then always believed to require reconciling, the result is that the storage reconcile times out.

this makes me wonder about the way the SecondaryProgress is calculated: maybe it doesn't recognize layers from a hard link after a shard split?

arpad-m · 2025-01-20T20:59:44Z

Regarding the second one, comparing the logs of a reproducing instance with one that works, apparently gc on the pageservers is keeping some layers:

2025-01-14T15:06:44.983090Z  INFO request{method=PUT path=/v1/tenant/2c39827c0cc56e310d75a0348f2bb628-0104/timeline/a82e75964ffcbfcdff620930c13a2952/do_gc request_id=fc2eae87-aec5-4cd6-8c8a-88bfd365c7fe}:immediate_gc{tenant_id=2c39827c0cc56e310d75a0348f2bb628 shard_id=0104 timeline_id=a82e75964ffcbfcdff620930c13a2952}:gc_timeline{timeline_id=a82e75964ffcbfcdff620930c13a2952 cutoff=0/15C0520}: keeping 000000067F0000000500000DF60000000001-000000067F000000050000403D0100000000__00000000015C0520 because it's newer than space_cutoff 0/15C0520
2025-01-14T15:06:44.983101Z  INFO request{method=PUT path=/v1/tenant/2c39827c0cc56e310d75a0348f2bb628-0104/timeline/a82e75964ffcbfcdff620930c13a2952/do_gc request_id=fc2eae87-aec5-4cd6-8c8a-88bfd365c7fe}:immediate_gc{tenant_id=2c39827c0cc56e310d75a0348f2bb628 shard_id=0104 timeline_id=a82e75964ffcbfcdff620930c13a2952}:gc_timeline{timeline_id=a82e75964ffcbfcdff620930c13a2952 cutoff=0/15C0520}: keeping 000000067F000000050000403D0100000000-030000000000000000000000000000000002__00000000015C0520 because it's newer than space_cutoff 0/15C0520

arpad-m · 2025-01-20T21:34:10Z

Ah so apparently the issue is not that, those "later" layers get shown as "latest" in the non-reproducing case. The more interesting part is about the other image layers, which have earlier lsns: those get shown as "latest" in the reproducing case where the assertion fails, and get deleted in the non-reproducing case.

) We currently have some flakiness in `test_scrubber_physical_gc_ancestors`, see #10391. The first flakiness kind is about the reconciler not actually becoming idle within the timeout of 30 seconds. We see continuous forward progress so this is likely not a hang. We also see this happen in parallel to a test failure, so is likely due to runners being overloaded. Therefore, we increase the timeout. The second flakiness kind is an assertion failure. This one is a little bit more tricky, but we saw in the successful run that there was some advance of the lsn between the compaction ran (which created layer files) and the gc run. Apparently gc rejects reductions to the single image layer setting if the cutoff lsn is the same as the lsn of the image layer: it will claim that that layer is newer than the space cutoff and therefore skip it, while thinking the old layer (that we want to delete) is the latest one (so it's not deleted). We address the second flakiness kind by inserting a tiny amount of WAL between the compaction and gc. This should hopefully fix things. Related issue: #10391 (not closing it with the merger of the PR as we'll need to validate that these changes had the intended effect). Thanks to Chi for going over this together with me in a call.

alexanderlaw · 2025-01-22T13:20:47Z

And another kind of the test failures:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10474/12903916907/index.html#/testresult/406cebf420f09f1c

repo/pageserver_2/pageserver.log:
2025-01-22T08:48:45.650888Z ERROR page_service_conn_main{peer_addr=127.0.0.1:40892}:process_query{tenant_id=e7520b3ff796624e4cb9b3467f97a9c1 timeline_id=99a506f6dbe4c64461b1490fdd1a41e5}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/1602CF9, request LSN 0/1602CF8, ancestor 0/0

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10474/12903916907/index.html#/testresult/19e54fa1e2aba0dd
2025-01-22T10:43:06.149320Z ERROR page_service_conn_main{peer_addr=127.0.0.1:55568}:process_query{tenant_id=6e8b4e9883827dcbf3db75116946f0f5 timeline_id=4ba3ea97649f4ae3f8bc2b50bd3caecf}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/1604FB9, request LSN 0/1604FB8, ancestor 0/0

erikgrinaker · 2025-01-22T13:38:46Z

Another one here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10402/12896916261/index.html#suites/616e84f65c91fe4bc748db7447d35268/c29c08518f82ced5/

ERROR page_service_conn_main{peer_addr=127.0.0.1:40122}:process_query{tenant_id=3af0ded1f49e5727cbf2a7d60306e573 timeline_id=30f2aa67a815353789dccc6c14b53176}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/15C73F9, request LSN 0/15C73F8, ancestor 0/0

arpad-m · 2025-01-22T16:45:26Z

Hmmm yeah it seems it fails way more often now than it used to.

"could not find data for key" is a bit scary, in theory the key should exist.

The "fix" in #10457 merged 26 hours ago and there have been 40 instances of flakiness since then, while in the 6 days before that there have only been 34 (admittedly two are workdays but still).

arpad-m · 2025-01-22T16:58:12Z

When there is a compute error, I see the following (link) in the ps logs:

2025-01-22T13:51:01.892177Z ERROR wal_connection_manager{tenant_id=afd1356626f76a8b1a98f086e8a090e4 shard_id=0104 timeline_id=85ee936bd4de54b3637a148fb1ffd4c9}: wal receiver task finished with an error: walreceiver connection handling failure

Caused by:
    0: could not ingest record at 0/15C94E8
    1: could not find data for key 020000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/15C94E9, request LSN 0/15C94E8, ancestor 0/0

looks like an endpoint doing a misdirected request (key read for shard 0 from a ps that has shard 0104).

alexanderlaw · 2025-01-22T17:38:01Z

I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09):
compute.log:

PG:2025-01-22 16:19:17.514 GMT [976688] ERROR:  [NEON_SMGR] [shard 0] could not read block 162 in rel 1663/5/16384.0 from page server at lsn 0/09BB7660
PG:2025-01-22 16:19:17.514 GMT [976688] DETAIL:  page server returned error: Read error

pageserver.log:

2025-01-22T16:18:51.405445Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}:tokio_epoll_uring_ext::thread_local_system{thread_local=12 attempt_no=0}: successfully launched system
...
2025-01-22T16:19:17.514097Z ERROR page_service_conn_main{peer_addr=[::1]:55676}:process_query{tenant_id=584a20da29656b33e10cbb8ce14da0e1 timeline_id=5d722464ed806f311b1748d2964f9a0a}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F000000050000400000000000A2 (shard ShardNumber(0)) at LSN 0/9BB7661, request LSN 0/9BB7660, ancestor 0/0
...
2025-01-22T16:19:18.602413Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}: walreceiver connection handling ended: connection closed

arpad-m · 2025-01-22T18:51:32Z

@alexanderlaw that might be a different issue, maybe it makes sense to file a different thread about it? Edit: done, see #10482.

For the issue in test_scrubber_physical_gc_ancestors, it's not really possible that the timeout increase in #10457 caused it: even if all instances where we before we hit the timeout, we now get these errors, the total number of errors wouldn't have went up: because if we are still within the timeout, we are still within the timeout.

So the cause was that added churn_rows call added by #10457. Also, in fact that's where the backtrace points to.

It also makes sense if you think about it: in the loop we do for multiple shards, namely 4:

compact that shard, creating image layers at the latest LSN
call churn_rows, advancing WAL
gc all history, including the stuff before the latest LSN

For the first shard, this will work. For the second shard however, we'll be in the situation that if we want to start an endpoint that is at anything but the latest lsn, we can't do that any more for the first shard. So probably this is why this fails.

The only thing I can't explain is why it's flaky. If my theory is accurate, it should always fail. But no clue.

In any case, I have filed #10481. Maybe this will fix things?

## Problem PR #10457 was supposed to fix the flakiness of `test_scrubber_physical_gc_ancestors`, but instead it made it even more flaky. However, the original error causes disappeared, now to be replaced by key not found errors. See this for a longer explanation: #10391 (comment) ## Solution This does one churn rows after all compactions, and before we do any timeline gc's. That way, we remain more accessible at older lsn's.

arpad-m · 2025-01-23T17:01:17Z

The error doesn't occur any more since merging the PR, I think it's resolved.

For the [None] case, it happened two times but I'd put that onto the PR using an outdated git state. In any case, even there it's way less occurences than there were yesterday. Closing the issue.

jcsp added a/test Area: related to testing c/storage/scrubber Component: s3_scrubber t/bug Issue Type: Bug labels Jan 14, 2025

jcsp assigned arpad-m Jan 14, 2025

arpad-m mentioned this issue Jan 20, 2025

Fix two flakiness sources in test_scrubber_physical_gc_ancestors #10457

Merged

This was referenced Jan 22, 2025

Only churn rows once in test_scrubber_physical_gc_ancestors #10481

Merged

test_pageserver_gc_compaction_smoke: could not find data for key #10482

Open

arpad-m closed this as completed Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in test_scrubber_physical_gc_ancestors #10391

Failures in test_scrubber_physical_gc_ancestors #10391

jcsp commented Jan 14, 2025

arpad-m commented Jan 20, 2025

arpad-m commented Jan 20, 2025

arpad-m commented Jan 20, 2025

alexanderlaw commented Jan 22, 2025

erikgrinaker commented Jan 22, 2025

arpad-m commented Jan 22, 2025 •

edited

Loading

arpad-m commented Jan 22, 2025

alexanderlaw commented Jan 22, 2025 •

edited

Loading

arpad-m commented Jan 22, 2025 •

edited

Loading

arpad-m commented Jan 23, 2025

Failures in test_scrubber_physical_gc_ancestors #10391

Failures in test_scrubber_physical_gc_ancestors #10391

Comments

jcsp commented Jan 14, 2025

arpad-m commented Jan 20, 2025

arpad-m commented Jan 20, 2025

arpad-m commented Jan 20, 2025

alexanderlaw commented Jan 22, 2025

erikgrinaker commented Jan 22, 2025

arpad-m commented Jan 22, 2025 • edited Loading

arpad-m commented Jan 22, 2025

alexanderlaw commented Jan 22, 2025 • edited Loading

arpad-m commented Jan 22, 2025 • edited Loading

arpad-m commented Jan 23, 2025

arpad-m commented Jan 22, 2025 •

edited

Loading

alexanderlaw commented Jan 22, 2025 •

edited

Loading

arpad-m commented Jan 22, 2025 •

edited

Loading