Skip to content

feat(core): implement core/scheduler#466

Merged
emlautarom1 merged 58 commits into
mainfrom
emlautarom1/core-scheduler
Jun 16, 2026
Merged

feat(core): implement core/scheduler#466
emlautarom1 merged 58 commits into
mainfrom
emlautarom1/core-scheduler

Conversation

@emlautarom1

@emlautarom1 emlautarom1 commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Closes #176


Uses an Actor model to implement the core/scheduler module, solving some concurrency hacks that the original implementation had.

Metrics have been ported using the existing Vise patterns but are untested.

- Add `Scheduler` struct (WIP)
- Add `new_slot_ticker`
- Fix duplicated code
- `valcache` is required by scheduler
- Fixes cyclical dependency issue
- Use `valcache` to get the list of active validators.
- Manages internal state of the scheduler
- Stored in a `Mutex` for thread safety
- TODO's regarding concurrent access while resolving duties
- Try to reuse existing infrastructure
- Preserve only Actor API
@emlautarom1

Copy link
Copy Markdown
Collaborator Author

@claude Re-review this PR. Check for your previous review and recent comments.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude finished @emlautarom1's task in 2m 15s —— View job


Re-review: core/scheduler

Re-checked the branch against my earlier review and the discussion since. Good progress — several findings are now resolved or improved. Below is the delta, grouped by status.


✅ Addressed since last review

  • Lagging subscribers now log (scheduler.rs:132-139, 164-173) — the silent-drop regression is gone; a lag now emits tracing::error! before terminating. The remaining question (return a handle / don't terminate at all) is now a design choice, see below.
  • Reorg handler logs (handle_chain_reorg, scheduler.rs:331-343) — no longer a silent early-return; it logs "Chain reorg event handled, duties trimmed".
  • Ticker send is now cancellable (scheduler.rs:649-652) — the channel send is wrapped in select! against ct.cancelled(). (But see Update README #2 below — the sleep still isn't.)
  • Async callbackssubscribe_slot/subscribe_duty now accept Fn -> Future, matching the discussion. Tradeoff analysis below.
  • Test coverageget_duty_definition is now driven through the SchedulerHandle path (get_duty_success_then_reorg_then_get_duty_fails), and the slot-15 mid-epoch case exercises the last_in_epoch pre-resolution branch.

⚠️ Still open

1. (Major) get_duty_definition: 100 ms total timeout + network call inside the actor. Still unresolved and still the main concern.

  • SchedulerActor::get_duty_definition (scheduler.rs:352) calls self.client.fetch_slots_config().await? on every query, behind the handle's Duration::from_millis(100) (scheduler.rs:269).
  • Crucial parity detail I want to underline: in Charon the 100 * time.Millisecond is a poll interval inside for s.isResolvingEpoch(epoch) { … time.Sleep(100ms) } (scheduler.go:185-191), not a deadline — and FetchSlotsConfig is memoized in eth2wrap, so it isn't a per-call round-trip. Pluto converts a poll interval into a hard total timeout and puts a live beacon call on the hot path. Under a slow beacon node this returns TimeoutError where Charon would wait.
  • Fix together: cache slots_per_epoch on the actor (it's already fetched in new_slot_ticker) so the query is synchronous; then either drop the timeout or let the caller pass one.
  • SchedulerError::EpochNotResolved (scheduler.rs:62-66) still drops the duty field that Charon logs (scheduler.go:194-195).

Fix this →

2. new_slot_ticker inter-slot sleep still ignores cancellation (scheduler.rs:631-636). You wrapped the send but the tokio::time::sleep(wait).await before it is bare, so a shutdown mid-slot still blocks the ticker task for up to a full slot (~12 s). Wrap the sleep in the same select!:

tokio::select! {
    _ = ct.cancelled() => break,
    _ = tokio::time::sleep(wait) => {}
}

Fix this →

3. Reorg featureset gate still punted to the caller. with_chain_reorgs (scheduler.rs:181-184) keeps the // NOTE: The SSE feature check should be done by the caller comment. Charon gates on featureset.Enabled(featureset.SSEReorgDuties) inside the handler and warns when disabled (scheduler.go:142-153). pluto-featureset is already a dependency of this crate — consider matching Charon and checking it in-handler, or at minimum document the requirement loudly on the public method (a doc-comment, not an inline note).

5. handle_chain_reorg is still pub async fn on the private SchedulerActor (scheduler.rs:331). Drop pub.

6. Redundant clone in build (scheduler.rs:208) — new_slot_ticker(&client.clone(), …) still clones then borrows; new_slot_ticker takes &client. Pass &client.

7. validator_status reset is still the O(total-seen) scan (scheduler.rs:690-696) with no TODO. Non-blocking, but worth a note about the HashMap<PubKey, String> last-status optimization.

8. Nits: enum SchedulerMessage is still wedged against the impl Default block with no blank line (scheduler.rs:236-237). fast_backoff still carries the // TODO: Duplicated from crates/p2p/src/bootnode.rs — please file a follow-up to DRY it.

9. Test gap: no slow-path delay_slot_offset test — every test uses test_past_slot, so the deadline has always elapsed and the with_cancellation_token_owned race is never actually exercised with a live wait. One test with a near-future slot + cancellation would close this.


On the open design questions

  • backon dependency (varex83: "move it back to dev-dependencies"): it can't move — fast_backoff/default_backoff and the .retry(...) calls in wait_chain_start/wait_beacon_sync are production paths, not tests. Keeping it in [dependencies] is correct.
  • Callbacks vs returning the Receiver (your comment): I'd lean toward returning the broadcast::Receiver from subscribe_*. It pushes the await-each-vs-spawn-per-callback and cancellation decisions to the caller (per your own analysis), removes the lag-terminates-subscription policy from the scheduler entirely, and sidesteps the "should the actor outlive a slow callback" question. The callback form bakes one policy in for everyone; the channel form is strictly more flexible and is the idiomatic Rust shape here. If you keep callbacks, document the await-each-in-order behavior (and the lag-terminates semantics) on the public methods.
  • await-each vs spawn-per-callback: Charon runs subscribers without ordering guarantees, but I wouldn't treat "Charon does X" as decisive here — ordering-preserving back-pressure (current behavior) is the safer default for duty emission. Returning the Receiver makes this moot.

Summary

Solid iteration — the observability regressions (#3 logging, #4 logging) and the test/handle coverage are meaningfully better. #1 remains the blocker (hard 100 ms timeout + per-call beacon round-trip is a real behavioral divergence from Charon's memoized + poll-loop semantics). #2 (mid-slot sleep cancellation) and the pub/clone/blank-line cleanups are quick wins. Recommend addressing #1 and #2 before merge; the rest can be a cleanup pass.

  • Read current scheduler.rs + metrics.rs
  • Re-check each prior finding
  • Review recent changes (async callbacks, logging, tests)
  • Cross-reference Charon for deltas
  • Post updated review
    • Branch: emlautarom1/core-scheduler

Comment thread crates/core/src/scheduler.rs Outdated
Comment thread crates/core/src/types.rs Outdated
Comment thread crates/core/src/scheduler.rs Outdated
Comment thread crates/core/src/scheduler.rs Outdated
Comment thread crates/core/src/scheduler.rs Outdated
@emlautarom1 emlautarom1 requested a review from iamquang95 June 15, 2026 18:31
@emlautarom1 emlautarom1 merged commit 8b8d317 into main Jun 16, 2026
16 checks passed
@emlautarom1 emlautarom1 deleted the emlautarom1/core-scheduler branch June 16, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement core/scheduler

4 participants