Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: WAL receiver did not spawn after shard migration #10351

Open
VladLazar opened this issue Jan 10, 2025 · 1 comment · May be fixed by #10452
Open

pageserver: WAL receiver did not spawn after shard migration #10351

VladLazar opened this issue Jan 10, 2025 · 1 comment · May be fixed by #10452
Assignees
Labels
c/storage Component: storage t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@VladLazar
Copy link
Contributor

Timeline

tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408

Storage controller triggered a live migration as an optimisation:

2025-01-10T11:42:08.876125Z  INFO background_reconcile:optimize_attachment{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Identified optimization: migrate attachment 9344->9345 (secondaries [NodeId(9345)])

2025-01-10T11:42:08.889162Z  INFO background_reconcile: ded6ee19a4cbad4adb8f5d05640db9d0-0408 secondary on 9345 (pageserver-30.us-east-2.aws.neon.build) is warm enough for migration: SecondaryProgress { heatmap_mtime: Some(SystemTime(SystemTime { tv_sec: 1736508815, tv_nsec: 0 })), layers_downloaded: 19, layers_total: 19, bytes_downloaded: 1625481216, bytes_total: 1625481216 }

It turned out that the new location is actually stale. That's fine, it downloads as much as it can:

2025-01-10T11:47:09.175791Z  WARN reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Timed out after 300034ms downloading layers to 9345 (pageserver-30.us-east-2.aws.neon.build).  Progress so far: 725/998 layers, 172792389632/209516150784 bytes 

Now it moves on to waiting for the LSN on the new location to catch up. No progress is made here for about 1h:

2025-01-10T11:47:14.021333Z  INFO reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: 🕑 LSN origin 30/368A9498 vs destination 30/368A6480 timeline_id=deed3258f6cd7f280909693ca593371c (stuck here until the reconciler was cancelled 1h later)

Why didn't we ingest that delta? The new location, pageserver-30, attached the shard and spawned the wal receiver for it:

2025-01-10T11:47:14.021171Z  INFO attach{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408 gen=00000005}: launching WAL receiver for timeline deed3258f6cd7f280909693ca593371c of tenant ded6ee19a4cbad4adb8f5d05640db9d0-0408

However, it never actually connected to the safekeeper. It did subscribe to the broker for updates, but it never got any:

2025-01-10T11:47:14.021516Z  INFO subscription started id=79325, key=Timeline(TenantTimelineId { tenant_id: ded6ee19a4cbad4adb8f5d05640db9d0, timeline_id: deed3258f6cd7f280909693ca593371c }), addr=10.6.30.32:53422

Broker was available throughout around that time. Metrics indicate that SK was publishing updates to the broker
and pageservers were receiving updates. There's not enough granularity in the metrics to investigate a certain timeline.

Investigation

Since the wal receiver was never spawned, the code must be stuck somewhere in connection_manager_loop_step.
To start the wal receiver, PS must know which SKs are available. This discovery happens through the broker.
There are two ways of getting info from the broker:

  1. The substitution path here. SK publishes an update rougly every 1s for all active timelines.
    Broker propagates it to the PS.
  2. The on-demand broker pull here. If there's no active connection and we have a pending
    get page for an LSN in the future (or anything else that triggers an LSN wait), then we nudge the broker to tell the PS what it knows.

Actions

  1. Improve observability. If we have been waiting for a certain amount of time and have learned nothing from the broker, log.
  2. If the subscription doesn't yield anything, nudge the broker instead of waiting indefinitely (even if we don't have a pending LSN wait).
@VladLazar VladLazar added c/storage Component: storage t/bug Issue Type: Bug labels Jan 10, 2025
@jcsp
Copy link
Collaborator

jcsp commented Jan 14, 2025

Triage notes:

  • Mitigation: We could add a migration API that kicks ingest the same way a getpaqe request would, rather than just using timeline INFO API
  • wait_lsn API might be useful for some tests too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants