Make compiler side device-type-agnostic by lockshaw · Pull Request #1651 · flexflow/flexflow-train

lockshaw · 2026-06-02T22:52:48Z

Description of changes:

Remove DeviceType from MachineSpaceCoord
Move task-spec to use device_id_t
Flesh out copy_insertion logic

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

…e_id_t, flesh out copy_insertion logic

elliottslaughter

@elliottslaughter partially reviewed 120 files and all commit messages, and made 11 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on lockshaw).

lib/realm-execution/include/realm-execution/distributed_per_device_op_state_initialization.h line 29 at r1 (raw file):

    OptimizerAttrs const &optimizer_attrs,
    Realm::Event precondition,
    DeviceType device_type);

Maybe I'm misunderstanding how this works, but isn't this included in the mapping by this point? I'd expect this to be a property of the specific node invocations inside of dg not something passed separately.

I think by the time we get into Realm code proper, we generally do not want to be passing around explicit DeviceType because we want Realm to be able to handle the case of mixed mapping (i.e., some operators on CPU and others on GPU, or some on GPU and some on FGPA/TPU/ASIC flavor of the week).

lib/realm-execution/include/realm-execution/instance_allocation.h line 31 at r1 (raw file):

        &preallocated,
    RealmContext &ctx,
    DeviceType device_type);

Same comment here.

lib/realm-execution/include/realm-execution/realm_context.h line 36 at r1 (raw file):

  ///\{
  Realm::Processor map_device_coord_to_processor(device_id_t const &) const;
  device_id_t map_processor_to_device_coord(Realm::Processor) const;

Should we be calling these map_device_id_to_processor and map_processor_to_device_id?

lib/realm-execution/include/realm-execution/realm_context.h line 114 at r1 (raw file):

  void discover_machine_topology();

public:

This is definitely unsafe. Why are we exposing these?

lib/realm-execution/src/realm-execution/distributed_per_device_op_state_initialization.cc line 26 at r1 (raw file):

    OptimizerAttrs const &optimizer_attrs,
    Realm::Event precondition,
    DeviceType device_type) {

Same comment as before.

lib/realm-execution/src/realm-execution/instance_allocation.cc line 42 at r1 (raw file):

        &preallocated,
    RealmContext &ctx,
    DeviceType device_type) {

As noted elsewhere, does not appear to even be used in this function.

lib/realm-execution/src/realm-execution/pcg_instance.cc line 120 at r1 (raw file):

  dg = perform_update_insertion(dg, optimizer_attrs);
  dg = perform_copy_insertion(dg);
  debug_print_dynamic_open_dataflow_graph_as_dot(dg);

Remove before final merge (unless this is something we can put behind a switch).

lib/realm-execution/src/realm-execution/realm_context.cc line 25 at r1 (raw file):

      allocator(get_realm_allocator(
          processor, RealmContext::get_nearest_memory(processor))) {
  if (processor != Realm::Processor::NO_PROC) {

We don't want to do this here because we construct RealmContext in every task. The original design intentionally does this dynamically so that we don't do it eagerly in every context. Unless there's some other design constraint I'm not aware of, I'd like to go back to that.

lib/realm-execution/src/realm-execution/realm_context.cc line 79 at r1 (raw file):

device_id_t RealmContext::get_current_device_idx() const {
  Realm::Processor proc = this->get_current_processor();
  return this->map_processor_to_device_coord(proc);

This is an example of a tradeoff that looks bad, but the replacement is actually worse.

The code you've removed is O(local machine size). The code you've replaced it with is O(global machine size). That means paying to scan 10,000 GPUs when you're scaling to a large HPC machine.

The comment in the old code is accurate, but it's an intentional design tradeoff to avoid building a data structure that fundamentally cannot be efficient unless we make it a singleton, which I have not been willing to do.

Since you might not be aware, a couple of design considerations:

Realm has a limit of 64 processors per address space so the current loop is effectively O(64). Realm itself would need to change to increase this limit because it influences how we bit pack IDs. Therefore, this is a fairly stable assumption that will not change any time soon (if ever).

The Realm machine query infrastructure is actually pretty efficient and employs caches to speed up any expensive queries. Therefore, I'm pretty comfortable in general writing code that does a dynamic query on the machine structure, even in relatively small tasks.

lib/task-spec/src/task-spec/dynamic_graph/dynamic_node_invocation.cc line 57 at r1 (raw file):

                });

  ASSERT(are_disjoint(input_slots, output_slots));

I guess this is fine, but is there no merge_disjoint_sets?

lib/task-spec/src/task-spec/dynamic_graph/dynamic_open_dataflow_graph.cc line 62 at r1 (raw file):

         keys(values_produced_multiple_times));

  labelled_open_kwarg_dataflow_graph_from_dynamic_open_dataflow_graph(g);

Since we're throwing away the value, maybe comment on what property we're checking here?

lockshaw requested a review from elliottslaughter June 2, 2026 22:52

lockshaw added 2 commits June 5, 2026 01:18

Minor include fixes (flexflow#1655)

768c1eb

Remove DeviceType from MachineSpaceCoord, move task-spec to use devic…

bfbb318

…e_id_t, flesh out copy_insertion logic

lockshaw force-pushed the device-type-agnostic-compiler branch from 2695113 to bfbb318 Compare June 5, 2026 10:07

elliottslaughter reviewed Jun 5, 2026

View reviewed changes

lockshaw force-pushed the master branch from 768c1eb to 613a64d Compare June 6, 2026 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make compiler side device-type-agnostic#1651

Make compiler side device-type-agnostic#1651
lockshaw wants to merge 2 commits into
flexflow:masterfrom
lockshaw:device-type-agnostic-compiler

lockshaw commented Jun 2, 2026 •

edited by wmdi

Loading

Uh oh!

elliottslaughter left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lockshaw commented Jun 2, 2026 • edited by wmdi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elliottslaughter left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lockshaw commented Jun 2, 2026 •

edited by wmdi

Loading