pageserver: WAL receiver did not spawn after shard migration #10351

VladLazar · 2025-01-10T18:30:35Z

Timeline

tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408

Storage controller triggered a live migration as an optimisation:

2025-01-10T11:42:08.876125Z  INFO background_reconcile:optimize_attachment{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Identified optimization: migrate attachment 9344->9345 (secondaries [NodeId(9345)])

2025-01-10T11:42:08.889162Z  INFO background_reconcile: ded6ee19a4cbad4adb8f5d05640db9d0-0408 secondary on 9345 (pageserver-30.us-east-2.aws.neon.build) is warm enough for migration: SecondaryProgress { heatmap_mtime: Some(SystemTime(SystemTime { tv_sec: 1736508815, tv_nsec: 0 })), layers_downloaded: 19, layers_total: 19, bytes_downloaded: 1625481216, bytes_total: 1625481216 }

It turned out that the new location is actually stale. That's fine, it downloads as much as it can:

2025-01-10T11:47:09.175791Z  WARN reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Timed out after 300034ms downloading layers to 9345 (pageserver-30.us-east-2.aws.neon.build).  Progress so far: 725/998 layers, 172792389632/209516150784 bytes

Now it moves on to waiting for the LSN on the new location to catch up. No progress is made here for about 1h:

2025-01-10T11:47:14.021333Z  INFO reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: 🕑 LSN origin 30/368A9498 vs destination 30/368A6480 timeline_id=deed3258f6cd7f280909693ca593371c (stuck here until the reconciler was cancelled 1h later)

Why didn't we ingest that delta? The new location, pageserver-30, attached the shard and spawned the wal receiver for it:

2025-01-10T11:47:14.021171Z  INFO attach{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408 gen=00000005}: launching WAL receiver for timeline deed3258f6cd7f280909693ca593371c of tenant ded6ee19a4cbad4adb8f5d05640db9d0-0408

However, it never actually connected to the safekeeper. It did subscribe to the broker for updates, but it never got any:

2025-01-10T11:47:14.021516Z  INFO subscription started id=79325, key=Timeline(TenantTimelineId { tenant_id: ded6ee19a4cbad4adb8f5d05640db9d0, timeline_id: deed3258f6cd7f280909693ca593371c }), addr=10.6.30.32:53422

Broker was available throughout around that time. Metrics indicate that SK was publishing updates to the broker
and pageservers were receiving updates. There's not enough granularity in the metrics to investigate a certain timeline.

Investigation

Since the wal receiver was never spawned, the code must be stuck somewhere in connection_manager_loop_step.
To start the wal receiver, PS must know which SKs are available. This discovery happens through the broker.
There are two ways of getting info from the broker:

The substitution path here. SK publishes an update rougly every 1s for all active timelines.
Broker propagates it to the PS.
The on-demand broker pull here. If there's no active connection and we have a pending
get page for an LSN in the future (or anything else that triggers an LSN wait), then we nudge the broker to tell the PS what it knows.

Actions

Improve observability. If we have been waiting for a certain amount of time and have learned nothing from the broker, log.
If the subscription doesn't yield anything, nudge the broker instead of waiting indefinitely (even if we don't have a pending LSN wait).

The text was updated successfully, but these errors were encountered:

jcsp · 2025-01-14T15:09:33Z

Triage notes:

Mitigation: We could add a migration API that kicks ingest the same way a getpaqe request would, rather than just using timeline INFO API
wait_lsn API might be useful for some tests too?

VladLazar added c/storage Component: storage t/bug Issue Type: Bug labels Jan 10, 2025

jcsp assigned VladLazar Jan 14, 2025

jcsp added the triaged bugs that were already triaged label Jan 14, 2025

This was referenced Jan 20, 2025

storcon: signal LSN wait to pageserver during live migration #10452

Open

pageserver: log on potentially stuck connection manager loop #10453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: WAL receiver did not spawn after shard migration #10351

pageserver: WAL receiver did not spawn after shard migration #10351

VladLazar commented Jan 10, 2025

jcsp commented Jan 14, 2025

pageserver: WAL receiver did not spawn after shard migration #10351

pageserver: WAL receiver did not spawn after shard migration #10351

Comments

VladLazar commented Jan 10, 2025

Timeline

Investigation

Actions

jcsp commented Jan 14, 2025