BTW, I think this should also work, but it contradict to the validation num_process == dp_replicate_size * dp_shard_size * tp_size * cp_size * sp_size. From my understanding, sharding the model with dp_shard_size should never be affected by the non_data_parallel_size. Thanks!
Ruiyuan Gao
flymin
AI & ML interests
None yet
Recent Activity
commented on
an
article
2 days ago
Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training
commented on
an
article
2 days ago
Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training
commented on
an
article
3 days ago
Accelerate ND-Parallel: A guide to Efficient Multi-GPU Training
Organizations
None yet