test_pageserver_gc_compaction_smoke: could not find data for key #10482

arpad-m · 2025-01-22T19:12:44Z

Originally reported by @alexanderlaw in: #10391 (comment):

I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09): compute.log:

PG:2025-01-22 16:19:17.514 GMT [976688] ERROR:  [NEON_SMGR] [shard 0] could not read block 162 in rel 1663/5/16384.0 from page server at lsn 0/09BB7660
PG:2025-01-22 16:19:17.514 GMT [976688] DETAIL:  page server returned error: Read error

pageserver.log:

2025-01-22T16:18:51.405445Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}:tokio_epoll_uring_ext::thread_local_system{thread_local=12 attempt_no=0}: successfully launched system
...
2025-01-22T16:19:17.514097Z ERROR page_service_conn_main{peer_addr=[::1]:55676}:process_query{tenant_id=584a20da29656b33e10cbb8ce14da0e1 timeline_id=5d722464ed806f311b1748d2964f9a0a}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F000000050000400000000000A2 (shard ShardNumber(0)) at LSN 0/9BB7661, request LSN 0/9BB7660, ancestor 0/0
...
2025-01-22T16:19:18.602413Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}: walreceiver connection handling ended: connection closed

The text was updated successfully, but these errors were encountered:

skyzh · 2025-01-22T20:28:31Z

The issue seems to only happen when gc-compaction decides to produce an image that covers the full range 0000..FFFF. In theory this should not happen b/c no historic layer should cover that full range.

skyzh · 2025-01-22T20:29:12Z

I suppose there are some places in the code that incorrectly assumes 000..FFF are L0 layers (even if I added the logic to allow 000...FFF image layers)

## Problem Not really a bug fix, but hopefully can reproduce #10482 more. If the layer map does not contain layers that end at exactly the end range of the compaction job, the current split algorithm will produce the last job that ends at the maximum layer key. This patch extends it all the way to the compaction job end key. For example, the user requests a compaction of 0000...FFFF. However, we only have a layer 0000..3000 in the layer map, and the split job will have a range of 0000..3000 instead of 0000..FFFF. This is not a correctness issue but it would be better to fix it so that we can get consistent job splits. ## Summary of changes Compaction job split will always cover the full specified key range. Signed-off-by: Alex Chi Z <[email protected]>

arpad-m added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 22, 2025

arpad-m assigned skyzh Jan 22, 2025

arpad-m mentioned this issue Jan 22, 2025

Failures in test_scrubber_physical_gc_ancestors #10391

Closed

skyzh mentioned this issue Jan 22, 2025

fix(pageserver): extend split job key range to the end #10484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_pageserver_gc_compaction_smoke: could not find data for key #10482

test_pageserver_gc_compaction_smoke: could not find data for key #10482

arpad-m commented Jan 22, 2025

skyzh commented Jan 22, 2025

skyzh commented Jan 22, 2025

test_pageserver_gc_compaction_smoke: could not find data for key #10482

test_pageserver_gc_compaction_smoke: could not find data for key #10482

Comments

arpad-m commented Jan 22, 2025

skyzh commented Jan 22, 2025

skyzh commented Jan 22, 2025