Skip to content

Commit

Permalink
Fix two flakiness sources in test_scrubber_physical_gc_ancestors (#10457
Browse files Browse the repository at this point in the history
)

We currently have some flakiness in
`test_scrubber_physical_gc_ancestors`, see #10391.

The first flakiness kind is about the reconciler not actually becoming
idle within the timeout of 30 seconds. We see continuous forward
progress so this is likely not a hang. We also see this happen in
parallel to a test failure, so is likely due to runners being
overloaded. Therefore, we increase the timeout.

The second flakiness kind is an assertion failure. This one is a little
bit more tricky, but we saw in the successful run that there was some
advance of the lsn between the compaction ran (which created layer
files) and the gc run. Apparently gc rejects reductions to the single
image layer setting if the cutoff lsn is the same as the lsn of the
image layer: it will claim that that layer is newer than the space
cutoff and therefore skip it, while thinking the old layer (that we want
to delete) is the latest one (so it's not deleted).

We address the second flakiness kind by inserting a tiny amount of WAL
between the compaction and gc. This should hopefully fix things.

Related issue: #10391

(not closing it with the merger of the PR as we'll need to validate that
these changes had the intended effect).

Thanks to Chi for going over this together with me in a call.
  • Loading branch information
arpad-m authored Jan 21, 2025
1 parent 624a507 commit 7e4a39e
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion test_runner/regress/test_storage_scrubber.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,9 @@ def test_scrubber_physical_gc_ancestors(neon_env_builder: NeonEnvBuilder, shard_
new_shard_count = 4
assert shard_count is None or new_shard_count > shard_count
shards = env.storage_controller.tenant_shard_split(tenant_id, shard_count=new_shard_count)
env.storage_controller.reconcile_until_idle() # Move shards to their final locations immediately
env.storage_controller.reconcile_until_idle(
timeout_secs=120
) # Move shards to their final locations immediately

# Create a timeline after split, to ensure scrubber can handle timelines that exist in child shards but not ancestors
env.storage_controller.pageserver_api().timeline_create(
Expand Down Expand Up @@ -269,6 +271,8 @@ def test_scrubber_physical_gc_ancestors(neon_env_builder: NeonEnvBuilder, shard_
ps.http_client().timeline_compact(
shard, timeline_id, force_image_layer_creation=True, wait_until_uploaded=True
)
# Add some WAL so that we don't gc at the latest remote consistent lsn
workload.churn_rows(1)
ps.http_client().timeline_gc(shard, timeline_id, 0)

# We will use a min_age_secs=1 threshold for deletion, let it pass
Expand Down

1 comment on commit 7e4a39e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7510 tests run: 7119 passed, 1 failed, 390 skipped (full report)


Failures on Postgres 16

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_layer_map[release-pg16-github-actions-selfhosted]"
Flaky tests (4)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 33.6% (8445 of 25111 functions)
  • lines: 49.2% (70802 of 143967 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
7e4a39e at 2025-01-21T18:03:22.163Z :recycle:

Please sign in to comment.