You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09): compute.log:
PG:2025-01-22 16:19:17.514 GMT [976688] ERROR: [NEON_SMGR] [shard 0] could not read block 162 in rel 1663/5/16384.0 from page server at lsn 0/09BB7660
PG:2025-01-22 16:19:17.514 GMT [976688] DETAIL: page server returned error: Read error
pageserver.log:
2025-01-22T16:18:51.405445Z INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}:tokio_epoll_uring_ext::thread_local_system{thread_local=12 attempt_no=0}: successfully launched system
...
2025-01-22T16:19:17.514097Z ERROR page_service_conn_main{peer_addr=[::1]:55676}:process_query{tenant_id=584a20da29656b33e10cbb8ce14da0e1 timeline_id=5d722464ed806f311b1748d2964f9a0a}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F000000050000400000000000A2 (shard ShardNumber(0)) at LSN 0/9BB7661, request LSN 0/9BB7660, ancestor 0/0
...
2025-01-22T16:19:18.602413Z INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}: walreceiver connection handling ended: connection closed
The text was updated successfully, but these errors were encountered:
The issue seems to only happen when gc-compaction decides to produce an image that covers the full range 0000..FFFF. In theory this should not happen b/c no historic layer should cover that full range.
I suppose there are some places in the code that incorrectly assumes 000..FFF are L0 layers (even if I added the logic to allow 000...FFF image layers)
## Problem
Not really a bug fix, but hopefully can reproduce
#10482 more.
If the layer map does not contain layers that end at exactly the end
range of the compaction job, the current split algorithm will produce
the last job that ends at the maximum layer key. This patch extends it
all the way to the compaction job end key.
For example, the user requests a compaction of 0000...FFFF. However, we
only have a layer 0000..3000 in the layer map, and the split job will
have a range of 0000..3000 instead of 0000..FFFF.
This is not a correctness issue but it would be better to fix it so that
we can get consistent job splits.
## Summary of changes
Compaction job split will always cover the full specified key range.
Signed-off-by: Alex Chi Z <[email protected]>
Originally reported by @alexanderlaw in: #10391 (comment):
The text was updated successfully, but these errors were encountered: