Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_pageserver_gc_compaction_smoke: could not find data for key #10482

Open
arpad-m opened this issue Jan 22, 2025 · 2 comments
Open

test_pageserver_gc_compaction_smoke: could not find data for key #10482

arpad-m opened this issue Jan 22, 2025 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@arpad-m
Copy link
Member

arpad-m commented Jan 22, 2025

Originally reported by @alexanderlaw in: #10391 (comment):

I've reproduced these scary errors with test_pageserver_gc_compaction_smoke too, by running 8 test instances in parallel, even on 4c4cb80 (dated 2024-12-09): compute.log:

PG:2025-01-22 16:19:17.514 GMT [976688] ERROR:  [NEON_SMGR] [shard 0] could not read block 162 in rel 1663/5/16384.0 from page server at lsn 0/09BB7660
PG:2025-01-22 16:19:17.514 GMT [976688] DETAIL:  page server returned error: Read error

pageserver.log:

2025-01-22T16:18:51.405445Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}:tokio_epoll_uring_ext::thread_local_system{thread_local=12 attempt_no=0}: successfully launched system
...
2025-01-22T16:19:17.514097Z ERROR page_service_conn_main{peer_addr=[::1]:55676}:process_query{tenant_id=584a20da29656b33e10cbb8ce14da0e1 timeline_id=5d722464ed806f311b1748d2964f9a0a}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: error reading relation or page version: Read error: whole vectored get request failed because one or more of the requested keys were missing: could not find data for key 000000067F000000050000400000000000A2 (shard ShardNumber(0)) at LSN 0/9BB7661, request LSN 0/9BB7660, ancestor 0/0
...
2025-01-22T16:19:18.602413Z  INFO wal_connection_manager{tenant_id=584a20da29656b33e10cbb8ce14da0e1 shard_id=0000 timeline_id=5d722464ed806f311b1748d2964f9a0a}:connection{node_id=1}: walreceiver connection handling ended: connection closed
@arpad-m arpad-m added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 22, 2025
@skyzh
Copy link
Member

skyzh commented Jan 22, 2025

The issue seems to only happen when gc-compaction decides to produce an image that covers the full range 0000..FFFF. In theory this should not happen b/c no historic layer should cover that full range.

@skyzh
Copy link
Member

skyzh commented Jan 22, 2025

I suppose there are some places in the code that incorrectly assumes 000..FFF are L0 layers (even if I added the logic to allow 000...FFF image layers)

github-merge-queue bot pushed a commit that referenced this issue Jan 23, 2025
## Problem

Not really a bug fix, but hopefully can reproduce
#10482 more.

If the layer map does not contain layers that end at exactly the end
range of the compaction job, the current split algorithm will produce
the last job that ends at the maximum layer key. This patch extends it
all the way to the compaction job end key.

For example, the user requests a compaction of 0000...FFFF. However, we
only have a layer 0000..3000 in the layer map, and the split job will
have a range of 0000..3000 instead of 0000..FFFF.

This is not a correctness issue but it would be better to fix it so that
we can get consistent job splits.

## Summary of changes

Compaction job split will always cover the full specified key range.

Signed-off-by: Alex Chi Z <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants