-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures in test_scrubber_physical_gc_ancestors #10391
Comments
For the first one, apparently the storage controller believes that the shard is not ready to be promoted from secondary to primary:
as the shard is then always believed to require reconciling, the result is that the storage reconcile times out. this makes me wonder about the way the |
Regarding the second one, comparing the logs of a reproducing instance with one that works, apparently gc on the pageservers is keeping some layers:
|
Ah so apparently the issue is not that, those "later" layers get shown as "latest" in the non-reproducing case. The more interesting part is about the other image layers, which have earlier lsns: those get shown as "latest" in the reproducing case where the assertion fails, and get deleted in the non-reproducing case. |
) We currently have some flakiness in `test_scrubber_physical_gc_ancestors`, see #10391. The first flakiness kind is about the reconciler not actually becoming idle within the timeout of 30 seconds. We see continuous forward progress so this is likely not a hang. We also see this happen in parallel to a test failure, so is likely due to runners being overloaded. Therefore, we increase the timeout. The second flakiness kind is an assertion failure. This one is a little bit more tricky, but we saw in the successful run that there was some advance of the lsn between the compaction ran (which created layer files) and the gc run. Apparently gc rejects reductions to the single image layer setting if the cutoff lsn is the same as the lsn of the image layer: it will claim that that layer is newer than the space cutoff and therefore skip it, while thinking the old layer (that we want to delete) is the latest one (so it's not deleted). We address the second flakiness kind by inserting a tiny amount of WAL between the compaction and gc. This should hopefully fix things. Related issue: #10391 (not closing it with the merger of the PR as we'll need to validate that these changes had the intended effect). Thanks to Chi for going over this together with me in a call.
And another kind of the test failures: repo/pageserver_2/pageserver.log: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10474/12903916907/index.html#/testresult/19e54fa1e2aba0dd |
Another one here: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10402/12896916261/index.html#suites/616e84f65c91fe4bc748db7447d35268/c29c08518f82ced5/
|
Hmmm yeah it seems it fails way more often now than it used to. "could not find data for key" is a bit scary, in theory the key should exist. The "fix" in #10457 merged 26 hours ago and there have been 40 instances of flakiness since then, while in the 6 days before that there have only been 34 (admittedly two are workdays but still). |
When there is a compute error, I see the following (link) in the ps logs:
looks like an endpoint doing a misdirected request (key read for shard 0 from a ps that has shard 0104). |
I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09):
pageserver.log:
|
@alexanderlaw that might be a different issue, maybe it makes sense to file a different thread about it? Edit: done, see #10482. For the issue in So the cause was that added It also makes sense if you think about it: in the loop we do for multiple shards, namely 4:
For the first shard, this will work. For the second shard however, we'll be in the situation that if we want to start an endpoint that is at anything but the latest lsn, we can't do that any more for the first shard. So probably this is why this fails. The only thing I can't explain is why it's flaky. If my theory is accurate, it should always fail. But no clue. In any case, I have filed #10481. Maybe this will fix things? |
## Problem PR #10457 was supposed to fix the flakiness of `test_scrubber_physical_gc_ancestors`, but instead it made it even more flaky. However, the original error causes disappeared, now to be replaced by key not found errors. See this for a longer explanation: #10391 (comment) ## Solution This does one churn rows after all compactions, and before we do any timeline gc's. That way, we remain more accessible at older lsn's.
The text was updated successfully, but these errors were encountered: