Make compiler side device-type-agnostic#1651
Conversation
…e_id_t, flesh out copy_insertion logic
2695113 to
bfbb318
Compare
elliottslaughter
left a comment
There was a problem hiding this comment.
@elliottslaughter partially reviewed 120 files and all commit messages, and made 11 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on lockshaw).
lib/realm-execution/include/realm-execution/distributed_per_device_op_state_initialization.h line 29 at r1 (raw file):
OptimizerAttrs const &optimizer_attrs, Realm::Event precondition, DeviceType device_type);
Maybe I'm misunderstanding how this works, but isn't this included in the mapping by this point? I'd expect this to be a property of the specific node invocations inside of dg not something passed separately.
I think by the time we get into Realm code proper, we generally do not want to be passing around explicit DeviceType because we want Realm to be able to handle the case of mixed mapping (i.e., some operators on CPU and others on GPU, or some on GPU and some on FGPA/TPU/ASIC flavor of the week).
lib/realm-execution/include/realm-execution/instance_allocation.h line 31 at r1 (raw file):
&preallocated, RealmContext &ctx, DeviceType device_type);
Same comment here.
lib/realm-execution/include/realm-execution/realm_context.h line 36 at r1 (raw file):
///\{ Realm::Processor map_device_coord_to_processor(device_id_t const &) const; device_id_t map_processor_to_device_coord(Realm::Processor) const;
Should we be calling these map_device_id_to_processor and map_processor_to_device_id?
lib/realm-execution/include/realm-execution/realm_context.h line 114 at r1 (raw file):
void discover_machine_topology(); public:
This is definitely unsafe. Why are we exposing these?
lib/realm-execution/src/realm-execution/distributed_per_device_op_state_initialization.cc line 26 at r1 (raw file):
OptimizerAttrs const &optimizer_attrs, Realm::Event precondition, DeviceType device_type) {
Same comment as before.
lib/realm-execution/src/realm-execution/instance_allocation.cc line 42 at r1 (raw file):
&preallocated, RealmContext &ctx, DeviceType device_type) {
As noted elsewhere, does not appear to even be used in this function.
lib/realm-execution/src/realm-execution/pcg_instance.cc line 120 at r1 (raw file):
dg = perform_update_insertion(dg, optimizer_attrs); dg = perform_copy_insertion(dg); debug_print_dynamic_open_dataflow_graph_as_dot(dg);
Remove before final merge (unless this is something we can put behind a switch).
lib/realm-execution/src/realm-execution/realm_context.cc line 25 at r1 (raw file):
allocator(get_realm_allocator( processor, RealmContext::get_nearest_memory(processor))) { if (processor != Realm::Processor::NO_PROC) {
We don't want to do this here because we construct RealmContext in every task. The original design intentionally does this dynamically so that we don't do it eagerly in every context. Unless there's some other design constraint I'm not aware of, I'd like to go back to that.
lib/realm-execution/src/realm-execution/realm_context.cc line 79 at r1 (raw file):
device_id_t RealmContext::get_current_device_idx() const { Realm::Processor proc = this->get_current_processor(); return this->map_processor_to_device_coord(proc);
This is an example of a tradeoff that looks bad, but the replacement is actually worse.
The code you've removed is O(local machine size). The code you've replaced it with is O(global machine size). That means paying to scan 10,000 GPUs when you're scaling to a large HPC machine.
The comment in the old code is accurate, but it's an intentional design tradeoff to avoid building a data structure that fundamentally cannot be efficient unless we make it a singleton, which I have not been willing to do.
Since you might not be aware, a couple of design considerations:
Realm has a limit of 64 processors per address space so the current loop is effectively O(64). Realm itself would need to change to increase this limit because it influences how we bit pack IDs. Therefore, this is a fairly stable assumption that will not change any time soon (if ever).
The Realm machine query infrastructure is actually pretty efficient and employs caches to speed up any expensive queries. Therefore, I'm pretty comfortable in general writing code that does a dynamic query on the machine structure, even in relatively small tasks.
lib/task-spec/src/task-spec/dynamic_graph/dynamic_node_invocation.cc line 57 at r1 (raw file):
}); ASSERT(are_disjoint(input_slots, output_slots));
I guess this is fine, but is there no merge_disjoint_sets?
lib/task-spec/src/task-spec/dynamic_graph/dynamic_open_dataflow_graph.cc line 62 at r1 (raw file):
keys(values_produced_multiple_times)); labelled_open_kwarg_dataflow_graph_from_dynamic_open_dataflow_graph(g);
Since we're throwing away the value, maybe comment on what property we're checking here?
Description of changes:
Related Issues:
Linked Issues:
Issues closed by this PR:
This change is