Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in test_scrubber_physical_gc_ancestors #10391

Closed
jcsp opened this issue Jan 14, 2025 · 10 comments
Closed

Failures in test_scrubber_physical_gc_ancestors #10391

jcsp opened this issue Jan 14, 2025 · 10 comments
Assignees
Labels
a/test Area: related to testing c/storage/scrubber Component: s3_scrubber t/bug Issue Type: Bug

Comments

@jcsp jcsp added a/test Area: related to testing c/storage/scrubber Component: s3_scrubber t/bug Issue Type: Bug labels Jan 14, 2025
@arpad-m
Copy link
Member

arpad-m commented Jan 20, 2025

For the first one, apparently the storage controller believes that the shard is not ready to be promoted from secondary to primary:

2025-01-14T13:51:51.847992Z  INFO request{method=POST path=/debug/v1/reconcile_all request_id=a8d670f7-343f-4df5-9266-15294071178a}: Identified optimization for attachment: ScheduleOptimization { sequence: 1, action: MigrateAttachment(MigrateAttachment { old_attached_node_id: NodeId(1), new_attached_node_id: NodeId(2) }) } tenant_shard_id=9d7bbb52079070cee36d24b69036740f-0104
2025-01-14T13:51:51.849668Z  INFO request{method=POST path=/debug/v1/reconcile_all request_id=a8d670f7-343f-4df5-9266-15294071178a}: Skipping migration of 9d7bbb52079070cee36d24b69036740f-0104 to 2 (localhost) because secondary isn't ready: SecondaryProgress { heatmap_mtime: None, layers_downloaded: 0, layers_total: 0, bytes_downloaded: 0, bytes_total: 0 }

as the shard is then always believed to require reconciling, the result is that the storage reconcile times out.

this makes me wonder about the way the SecondaryProgress is calculated: maybe it doesn't recognize layers from a hard link after a shard split?

@arpad-m
Copy link
Member

arpad-m commented Jan 20, 2025

Regarding the second one, comparing the logs of a reproducing instance with one that works, apparently gc on the pageservers is keeping some layers:

2025-01-14T15:06:44.983090Z  INFO request{method=PUT path=/v1/tenant/2c39827c0cc56e310d75a0348f2bb628-0104/timeline/a82e75964ffcbfcdff620930c13a2952/do_gc request_id=fc2eae87-aec5-4cd6-8c8a-88bfd365c7fe}:immediate_gc{tenant_id=2c39827c0cc56e310d75a0348f2bb628 shard_id=0104 timeline_id=a82e75964ffcbfcdff620930c13a2952}:gc_timeline{timeline_id=a82e75964ffcbfcdff620930c13a2952 cutoff=0/15C0520}: keeping 000000067F0000000500000DF60000000001-000000067F000000050000403D0100000000__00000000015C0520 because it's newer than space_cutoff 0/15C0520
2025-01-14T15:06:44.983101Z  INFO request{method=PUT path=/v1/tenant/2c39827c0cc56e310d75a0348f2bb628-0104/timeline/a82e75964ffcbfcdff620930c13a2952/do_gc request_id=fc2eae87-aec5-4cd6-8c8a-88bfd365c7fe}:immediate_gc{tenant_id=2c39827c0cc56e310d75a0348f2bb628 shard_id=0104 timeline_id=a82e75964ffcbfcdff620930c13a2952}:gc_timeline{timeline_id=a82e75964ffcbfcdff620930c13a2952 cutoff=0/15C0520}: keeping 000000067F000000050000403D0100000000-030000000000000000000000000000000002__00000000015C0520 because it's newer than space_cutoff 0/15C0520

@arpad-m
Copy link
Member

arpad-m commented Jan 20, 2025

Ah so apparently the issue is not that, those "later" layers get shown as "latest" in the non-reproducing case. The more interesting part is about the other image layers, which have earlier lsns: those get shown as "latest" in the reproducing case where the assertion fails, and get deleted in the non-reproducing case.

github-merge-queue bot pushed a commit that referenced this issue Jan 21, 2025
)

We currently have some flakiness in
`test_scrubber_physical_gc_ancestors`, see #10391.

The first flakiness kind is about the reconciler not actually becoming
idle within the timeout of 30 seconds. We see continuous forward
progress so this is likely not a hang. We also see this happen in
parallel to a test failure, so is likely due to runners being
overloaded. Therefore, we increase the timeout.

The second flakiness kind is an assertion failure. This one is a little
bit more tricky, but we saw in the successful run that there was some
advance of the lsn between the compaction ran (which created layer
files) and the gc run. Apparently gc rejects reductions to the single
image layer setting if the cutoff lsn is the same as the lsn of the
image layer: it will claim that that layer is newer than the space
cutoff and therefore skip it, while thinking the old layer (that we want
to delete) is the latest one (so it's not deleted).

We address the second flakiness kind by inserting a tiny amount of WAL
between the compaction and gc. This should hopefully fix things.

Related issue: #10391

(not closing it with the merger of the PR as we'll need to validate that
these changes had the intended effect).

Thanks to Chi for going over this together with me in a call.
@alexanderlaw
Copy link

And another kind of the test failures:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10474/12903916907/index.html#/testresult/406cebf420f09f1c

repo/pageserver_2/pageserver.log:
2025-01-22T08:48:45.650888Z ERROR page_service_conn_main{peer_addr=127.0.0.1:40892}:process_query{tenant_id=e7520b3ff796624e4cb9b3467f97a9c1 timeline_id=99a506f6dbe4c64461b1490fdd1a41e5}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/1602CF9, request LSN 0/1602CF8, ancestor 0/0

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10474/12903916907/index.html#/testresult/19e54fa1e2aba0dd
2025-01-22T10:43:06.149320Z ERROR page_service_conn_main{peer_addr=127.0.0.1:55568}:process_query{tenant_id=6e8b4e9883827dcbf3db75116946f0f5 timeline_id=4ba3ea97649f4ae3f8bc2b50bd3caecf}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/1604FB9, request LSN 0/1604FB8, ancestor 0/0

@erikgrinaker
Copy link
Contributor

Another one here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10402/12896916261/index.html#suites/616e84f65c91fe4bc748db7447d35268/c29c08518f82ced5/

ERROR page_service_conn_main{peer_addr=127.0.0.1:40122}:process_query{tenant_id=3af0ded1f49e5727cbf2a7d60306e573 timeline_id=30f2aa67a815353789dccc6c14b53176}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F0000000500000A650000000000 (shard ShardNumber(1)) at LSN 0/15C73F9, request LSN 0/15C73F8, ancestor 0/0

@arpad-m
Copy link
Member

arpad-m commented Jan 22, 2025

Hmmm yeah it seems it fails way more often now than it used to.

"could not find data for key" is a bit scary, in theory the key should exist.

Image

The "fix" in #10457 merged 26 hours ago and there have been 40 instances of flakiness since then, while in the 6 days before that there have only been 34 (admittedly two are workdays but still).

@arpad-m
Copy link
Member

arpad-m commented Jan 22, 2025

When there is a compute error, I see the following (link) in the ps logs:

2025-01-22T13:51:01.892177Z ERROR wal_connection_manager{tenant_id=afd1356626f76a8b1a98f086e8a090e4 shard_id=0104 timeline_id=85ee936bd4de54b3637a148fb1ffd4c9}: wal receiver task finished with an error: walreceiver connection handling failure

Caused by:
    0: could not ingest record at 0/15C94E8
    1: could not find data for key 020000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/15C94E9, request LSN 0/15C94E8, ancestor 0/0

looks like an endpoint doing a misdirected request (key read for shard 0 from a ps that has shard 0104).

@alexanderlaw
Copy link

alexanderlaw commented Jan 22, 2025

I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09):
compute.log:

PG:2025-01-22 16:19:17.514 GMT [976688] ERROR:  [NEON_SMGR] [shard 0] could not read block 162 in rel 1663/5/16384.0 from page server at lsn 0/09BB7660
PG:2025-01-22 16:19:17.514 GMT [976688] DETAIL:  page server returned error: Read error

pageserver.log:

2025-01-22T16:18:51.405445Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}:tokio_epoll_uring_ext::thread_local_system{thread_local=12 attempt_no=0}: successfully launched system
...
2025-01-22T16:19:17.514097Z ERROR page_service_conn_main{peer_addr=[::1]:55676}:process_query{tenant_id=584a20da29656b33e10cbb8ce14da0e1 timeline_id=5d722464ed806f311b1748d2964f9a0a}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F000000050000400000000000A2 (shard ShardNumber(0)) at LSN 0/9BB7661, request LSN 0/9BB7660, ancestor 0/0
...
2025-01-22T16:19:18.602413Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}: walreceiver connection handling ended: connection closed

@arpad-m
Copy link
Member

arpad-m commented Jan 22, 2025

@alexanderlaw that might be a different issue, maybe it makes sense to file a different thread about it? Edit: done, see #10482.

For the issue in test_scrubber_physical_gc_ancestors, it's not really possible that the timeout increase in #10457 caused it: even if all instances where we before we hit the timeout, we now get these errors, the total number of errors wouldn't have went up: because if we are still within the timeout, we are still within the timeout.

So the cause was that added churn_rows call added by #10457. Also, in fact that's where the backtrace points to.

It also makes sense if you think about it: in the loop we do for multiple shards, namely 4:

  • compact that shard, creating image layers at the latest LSN
  • call churn_rows, advancing WAL
  • gc all history, including the stuff before the latest LSN

For the first shard, this will work. For the second shard however, we'll be in the situation that if we want to start an endpoint that is at anything but the latest lsn, we can't do that any more for the first shard. So probably this is why this fails.

The only thing I can't explain is why it's flaky. If my theory is accurate, it should always fail. But no clue.

In any case, I have filed #10481. Maybe this will fix things?

github-merge-queue bot pushed a commit that referenced this issue Jan 22, 2025
## Problem

PR #10457 was supposed to fix the flakiness of
`test_scrubber_physical_gc_ancestors`, but instead it made it even more
flaky. However, the original error causes disappeared, now to be
replaced by key not found errors.

See this for a longer explanation:
#10391 (comment)

## Solution

This does one churn rows after all compactions, and before we do any
timeline gc's. That way, we remain more accessible at older lsn's.
@arpad-m
Copy link
Member

arpad-m commented Jan 23, 2025

Image

The error doesn't occur any more since merging the PR, I think it's resolved.

For the [None] case, it happened two times but I'd put that onto the PR using an outdated git state. In any case, even there it's way less occurences than there were yesterday. Closing the issue.

@arpad-m arpad-m closed this as completed Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/scrubber Component: s3_scrubber t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants