You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Storage controller triggered a live migration as an optimisation:
2025-01-10T11:42:08.876125Z INFO background_reconcile:optimize_attachment{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Identified optimization: migrate attachment 9344->9345 (secondaries [NodeId(9345)])
2025-01-10T11:42:08.889162Z INFO background_reconcile: ded6ee19a4cbad4adb8f5d05640db9d0-0408 secondary on 9345 (pageserver-30.us-east-2.aws.neon.build) is warm enough for migration: SecondaryProgress { heatmap_mtime: Some(SystemTime(SystemTime { tv_sec: 1736508815, tv_nsec: 0 })), layers_downloaded: 19, layers_total: 19, bytes_downloaded: 1625481216, bytes_total: 1625481216 }
It turned out that the new location is actually stale. That's fine, it downloads as much as it can:
2025-01-10T11:47:09.175791Z WARN reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: Timed out after 300034ms downloading layers to 9345 (pageserver-30.us-east-2.aws.neon.build). Progress so far: 725/998 layers, 172792389632/209516150784 bytes
Now it moves on to waiting for the LSN on the new location to catch up. No progress is made here for about 1h:
2025-01-10T11:47:14.021333Z INFO reconciler{seq=3 tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408}: 🕑 LSN origin 30/368A9498 vs destination 30/368A6480 timeline_id=deed3258f6cd7f280909693ca593371c (stuck here until the reconciler was cancelled 1h later)
Why didn't we ingest that delta? The new location, pageserver-30, attached the shard and spawned the wal receiver for it:
2025-01-10T11:47:14.021171Z INFO attach{tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408 gen=00000005}: launching WAL receiver for timeline deed3258f6cd7f280909693ca593371c of tenant ded6ee19a4cbad4adb8f5d05640db9d0-0408
However, it never actually connected to the safekeeper. It did subscribe to the broker for updates, but it never got any:
2025-01-10T11:47:14.021516Z INFO subscription started id=79325, key=Timeline(TenantTimelineId { tenant_id: ded6ee19a4cbad4adb8f5d05640db9d0, timeline_id: deed3258f6cd7f280909693ca593371c }), addr=10.6.30.32:53422
Broker was available throughout around that time. Metrics indicate that SK was publishing updates to the broker
and pageservers were receiving updates. There's not enough granularity in the metrics to investigate a certain timeline.
Investigation
Since the wal receiver was never spawned, the code must be stuck somewhere in connection_manager_loop_step.
To start the wal receiver, PS must know which SKs are available. This discovery happens through the broker.
There are two ways of getting info from the broker:
The substitution path here. SK publishes an update rougly every 1s for all active timelines.
Broker propagates it to the PS.
The on-demand broker pull here. If there's no active connection and we have a pending
get page for an LSN in the future (or anything else that triggers an LSN wait), then we nudge the broker to tell the PS what it knows.
Actions
Improve observability. If we have been waiting for a certain amount of time and have learned nothing from the broker, log.
If the subscription doesn't yield anything, nudge the broker instead of waiting indefinitely (even if we don't have a pending LSN wait).
The text was updated successfully, but these errors were encountered:
Timeline
tenant_id=ded6ee19a4cbad4adb8f5d05640db9d0 shard_id=0408
Storage controller triggered a live migration as an optimisation:
It turned out that the new location is actually stale. That's fine, it downloads as much as it can:
Now it moves on to waiting for the LSN on the new location to catch up. No progress is made here for about 1h:
Why didn't we ingest that delta? The new location, pageserver-30, attached the shard and spawned the wal receiver for it:
However, it never actually connected to the safekeeper. It did subscribe to the broker for updates, but it never got any:
Broker was available throughout around that time. Metrics indicate that SK was publishing updates to the broker
and pageservers were receiving updates. There's not enough granularity in the metrics to investigate a certain timeline.
Investigation
Since the wal receiver was never spawned, the code must be stuck somewhere in
connection_manager_loop_step
.To start the wal receiver, PS must know which SKs are available. This discovery happens through the broker.
There are two ways of getting info from the broker:
Broker propagates it to the PS.
get page for an LSN in the future (or anything else that triggers an LSN wait), then we nudge the broker to tell the PS what it knows.
Actions
The text was updated successfully, but these errors were encountered: