github.com-microsoft-DeepSpeed
all · 6 devs · built 2026-06-13
Repository snapshot
Monthly reports
Highlights
- Enabled `torch.func` transformations for DeepSpeed engine with *ZeRO stages 0, 1, and 2*, allowing advanced gradient computations via [4370aa50 · Sung Hyun Cho].
- Introduced `engine.coalesce_grad_reduction()` for *ZeRO 1/2/3 multi-backward patterns*, significantly improving efficiency by coalescing gradient reductions via [60b242af · Sung Hyun Cho].
- Added support for *bf16 optimizer states with CPU offload* for *ZeRO stages 1, 2, and 3*, reducing CPU host RAM usage by storing Adam moments in bf16 precision via [3c337b54 · lucaspirola].
- Integrated an *SDMA allgather backend* for *AMD MI300 GPUs* in *DeepSpeed ZeRO-3*, optimizing parameter prefetch and speeding up training by 10-11% for MI300X hardware via [66af8f03 · inkcherry].
- Implemented *automatic Sequence Parallelism (AutoSP) support for multimodal models* (ViT encoders and LLM decoders), reducing memory footprint for long sequence inputs via [4e668fce · nathon].
- Streamlined setup for *DS4Sci EvoformerAttention* by *automating CUTLASS installation path discovery*, eliminating manual configuration via [d5356e07 · Max Tretikov].
Observations
- Total output (Grow + Maintenance) increased 87% compared to the 2-month average (current: 7, average: 4), indicating a highly productive month.
- Maintenance score surged 128% compared to the 2-month average (current: 4, average: 2), reflecting a strong emphasis on improving existing systems, CI stability, and release management.
- Commit volume saw a moderate increase of 22% (36 commits this month vs 30-commit 2-month average).
- Grow and Waste scores remained stable compared to the 2-month average (current Grow: 1 vs average: 1; current Waste: 1 vs average: 1).
- A significant number of bug fixes were implemented, addressing critical issues such as a *ZeRO-3 forward crash* on modules with plain dict `_parameters` [d7a3972f · Sung Hyun Cho], a *critical file descriptor leak* in `FastFileWriter` [b01a0915 · jg-heo], and a *command injection vulnerability* in `data_analyzer.py` [8cdf8651 · OrbisAI Security].
- Multiple compatibility fixes were delivered, including enabling `vmap` for *LinearFunctionForZeroStage3* [ae576f83 · Sung Hyun Cho], fixing *DeepCompile AOT kwargs patching* for PyTorch >= v2.11 [510ebe58 · Masahiro Tanaka], and supporting *flash-attn 2.7.0* in FPDT attention [45429221 · bincheng.xiong].
- CI/CD robustness was a recurring theme, with fixes for *PR-target workflow concurrency* [b7aef4dc · Masahiro Tanaka], *AutoSP compile test sequencing* [2c8a007b · Masahiro Tanaka], and *full CI test isolation* for ZeRO chmod and NVMe quantization [4570c508 · Masahiro Tanaka].
Performance over time
ETV stacked by Growth, Maintenance and Fixes — 90-day moving average, normalized to ETV / month.
Average performance per developer
ETV per active developer per month — 30-day moving average.
Active developers over time
Unique developers committing each day — 90-day moving average.
Knowledge concentration
How dependent is this repo on a small number of contributors? Higher top-1 share = higher key-person risk.
Masahiro Tanaka owns 34.6 % of commits.
Top contributors
Most impactful commits
Top 20 by ETV in the all-time window.
- 1.5ETVSupport custom partitioning patterns for AutoTP (#7806) This PR introduces a flexible, configuration-driven API for AutoTP (Automatic Tensor Parallelism) that allows users to define custom layer partitioning patterns for training. @inkcherry @delock ## Motivation Previously, AutoTP relied on hardcoded layer detection logic that was difficult to customize for new model architectures. This PR enables: 1. **Custom models**: Users can define exact regex patterns to match their model's parameter names 2. **Fused layers**: Support for fused QKV, gate_up_proj, and other packed weight matrices with unequal sub-parameter sizes (e.g., GQA with different Q/K/V dimensions) 3. **Extensibility**: Easy to add new model presets or customize existing ones Here is an example of a config including custom partitioning patterns: ```json { "tensor_parallel": { "autotp_size": 4, "partition_config": { "use_default_specs": false, "layer_specs": [ { "patterns": [".*\\.o_proj\\.weight$", ".*\\.down_proj\\.weight$"], "partition_type": "row" }, { "patterns": [".*\\.[qkv]_proj\\.weight$"], "partition_type": "column" }, { "patterns": [".*\\.gate_up_proj\\.weight$"], "partition_type": "column", "shape": [2, -1], "partition_dim": 0 } ] } } } ``` Refer to the [document](https://github.com/tohtana/DeepSpeed/blob/tohtana/autotp_custom_patterns/docs/code-docs/source/training.rst) for more details (including preset models and how to define partitioning for fused models). We also opened a new [PR](https://github.com/deepspeedai/DeepSpeedExamples/pull/998) to show the usage. ## Simplified initialization step AutoTP previously required calling ``set_autotp_mode(training=True)`` and ``deepspeed.tp_model_init`` before ``deepspeed.initialize``. Now we can include all the necessary configurations in the DeepSpeed config. We still support the traditional initialization path for backward compatibility. When you use both (i.e. calling ``set_autotp_mode(training=True)`` and ``deepspeed.tp_model_init`` and passing the config to ``deepspeed.initialize``), we will merge the settings at initialization. When we have conflicting settings, we will error out. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 6b9cab1d · 2026-01-31
- 1.1ETVPyTorch-compatible backward API (#7665) Currently DeepSpeed's backward API has more constraints compared to PyTorch's normal backward API. Here is the usage as described in the documentation: ```python loss = model_engine(batch) model_engine.backward(loss) ``` In this example, 1. Only accepts a (scalar) loss value 1. Need to call engine's backward API In contrast, in standard PyTorch, you can do: ```python output = model(batch) output.backward(out_grad) ``` There are several use cases that rely on this flexibility. For example, combining multiple models or using loss functions defined separately from the main model. If you attempt the same pattern with a DeepSpeed engine, some preprocessing and postprocessing steps will be silently skipped, which can lead to incorrect results. The [document](https://deepspeed.readthedocs.io/en/latest/training.html#jointly-training-models-with-shared-loss) explains we can call `_backward_epilogue` manually (possibly `backward_prologue` as well). However, it's easy for users to miss these calls, and passing a non-scalar gradient is still not supported. This PR introduces the same `.backward()` behavior as PyTorch, allowing .backward() to be called directly on tensors and supporting non-scalar outputs. To implement post-backward hooks, we had to use some torch internal APIs. See [comments](https://github.com/deepspeedai/DeepSpeed/blob/73f7ff1aab9d1387eb7dd4eca7453a25024533f4/deepspeed/runtime/engine.py#L424) for more details. When the internal APIs are not available, DeepSpeed engine only accepts the traditional way `model_engine.backward(loss)`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 53e91a09 · 2025-11-19
- 1.0ETVUlysses SP for HF Integration (#7268) This is the Deepspeed counterpart of https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>Stas Bekman · 4d00b38a · 2025-05-31
- 0.8ETVCI: prefer bf16 over fp16 (#7304) these days fp16 is barely ever used, so we should be testing bf16 instead of fp16 where possible. had to fix a bunch of tests to adapt to this change. a few bugs as well on the way. --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>Stas Bekman · b4cc079e · 2025-05-28
- 0.8ETVFix DeepCompile ZeRO-1 grad target lifetime (#8036) DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into the optimizer step, causing backward gradient storage to overlap with optimizer temporaries. This PR fixes the issue by making DeepCompile ZeRO-1 reduce targets follow the normal step-local ZeRO partition gradient-buffer lifetime, instead of preserving cloned target storage from compile setup. The actual code changes are: - During compile initialization, register empty DeepCompile ZeRO-1 gradient targets, then bind them to the step-local flat ZeRO partition gradient buffer and per-parameter views when gradients are ready to synchronize. - After ZeRO-1 builds the optimizer-facing fp32 gradient partition, release the DeepCompile registry references and clear reduce bucket storage after backward synchronization. | | Step 10-30 avg sec | Peak alloc GiB | | --- | --- | ---: | | Without this PR | 0.858 | 43.594 | | Without this PR | 0.859 | 39.366 | Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1, seq=4096, GAS=1) showed only finite value losses for 1000 steps. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 11eeb7cd · 2026-06-09
- 0.7ETVFix DeepCompile+Z3 on PyTorch v2.9/2.10 (#7951) DeepCompile+Z3 didn't work with PyTorch v2.9/2.10 because: - PyTorch v2.9+ started enforcing stricter TorchDynamo parameter tensor-match guards. During DeepCompile tracing, some ZeRO-3 parameters were temporarily all-gathered, so Dynamo recorded full sizes such as 4096 - By the time guard evaluation ran, DeepSpeed had already released those params back to the normal ZeRO-3 partitioned representation, where `param.data` is `empty(0)`. That produced guard failures like `expected 4096, actual 0`. This PR resolves the issue by: - Leep full-shape dummy tensors for symbolic tracing - Override guard size/stride metadata for ZeRO-3 params to the stable released representation instead of transient gathered sizes This PR also includes fixes of these bugs: - For v2.7 and v2.8, the compiled backward graph could hoist `end_backward` ahead of the real `reduce_grad` calls. - Selective unsharding pass can overcount the persistence memory budget. Note: DeepCompile is still incompatible with v2.11. It will be addressed by another PR. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · ecb26a51 · 2026-04-11
- 0.7ETVUpdate GH org references (#6998) Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>Olatunji Ruwase · fd405169 · 2025-02-05
- 0.7ETVEnable torch.autocast with ZeRO (#6993) DeepSpeed supports mixed precision training, but the behavior is different from `torch.autocast`. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while `torch.autocast` maintains parameters in FP32 but computes only certain operators in the lower precision. This leads to differences in: - performance: `torch.autocast` needs downcast in forward/backward - memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision - accuracy: `torch.autocast` has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32. To align DeepSpeed's behavior with `torch.autocast` when necessary, this PR adds the integration with `torch.autocast` with ZeRO. Here is an examples of the configuration. ```json "torch_autocast": { "enabled": true, "dtype": "bfloat16", "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"] } ``` Each configuration works as follows: - `enabled`: Enable the integration with `torch.autocast` if this is set to `True`. You don't need to call `torch.autocast` in your code. The grad scaler is also applied in the DeepSpeed optimizer. - `dtype`: lower precision dtype passed to `torch.autocast`. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of `lower_precision_safe_modules` are also downcasted to this dtype. - `lower_precision_safe_modules`: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows `torch.autocast`'s policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: `[torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`. Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Omar Elayan <oelayan@habana.ai> Signed-off-by: Roman Fitzjalen <romaactor@gmail.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Roman Fitzjalen <romaactor@gmail.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>Masahiro Tanaka · ed5f7375 · 2025-06-19
- 0.6ETVAdd HuggingFace tp_plan support for AutoTP (#7901) ## Summary Adds automatic detection and use of HuggingFace's built-in `base_model_tp_plan` for AutoTP, addressing the HuggingFace tp_plan support item from #7861. Models that ship with a `tp_plan` (e.g. Llama, Qwen, Gemma2) now work with AutoTP out of the box — no `preset_model` or `partition_config` needed, just set `autotp_size`. ## Changes **Runtime** - `engine.py`: Added tp_plan fallback in `_apply_autotp_partitioning`. Priority order: `partition_config` > HF `tp_plan` > AutoTP heuristics. - `config.py`: Added `_get_hf_tp_plan(model)` to extract tp_plan from `model._tp_plan` or `model.config.base_model_tp_plan`. - `tp_plan_converter.py`: New file. `TPPlanConverter` converts HF tp_plan entries (`colwise`/`rowwise`) to DeepSpeed `TPLayerSpec`. Other HF partition types (`colwise_rep`, `local_colwise`, etc.) are not yet supported (documented with TODO). **Tests** (11 files, 17 CPU + 5 GPU tests) - `test_tp_plan_converter.py`: Unit tests for the converter (alternate prefixes, projection names, unsupported types, etc.) - `test_tp_plan_extraction.py`: Unit tests for `_get_hf_tp_plan` with mock models. - `test_tp_plan_e2e.py`: GPU e2e tests with ZeRO 0/1/2 (requires 2 GPUs). - `test_tp_plan_real_models.py`: GPU tests with Qwen2 and custom models (requires 2 GPUs). **Documentation** - Tutorial: New "HuggingFace tp_plan Support" section in `autotp-training.md`. - Config reference: Added tp_plan paragraph in `config-json.md`. - API docs: Added tp_plan subsection in `training.rst`. - Blog: Updated ongoing work in `blogs/huggingface-tp/README.md`. ## Limitations - Only `colwise` and `rowwise` partition types are supported. Extended types (`colwise_rep`, `local_colwise`, `local_rowwise`, `local_packed_rowwise`, `gather`, `sequence_parallel`) are deferred. --------- Signed-off-by: Guokai Ma <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>Ma, Guokai · a240c4da · 2026-03-25
- 0.5ETVImprove leaf module interface (enable via config, relax matching criteria, add document, etc.) (#7604) This PR improves the usability of the leaf module feature. Here are the changes: - Allow enabling the leaf module via both the DeepSpeed config and APIs. - Relax matching criteria to support class-based matching. - Support multiple ways of specifying the target module: class, class name (with or without package name), module name, or suffix. - Add documentation to the training guide, including config snippets and explanations of default behavior. - Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically enable the leaf module feature. (Welcoming requests to add more classes) --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>Masahiro Tanaka · 7d9a2f2b · 2025-10-03
- 0.5ETVFix gradient checkpointing with use_reentrant=True / PyTorch-style backward / ZeRO-3 (#7780) Fixes an issue where ZeRO-3 with gradient checkpointing (`use_reentrant=True`) and non-scalar backward (`tensor.backward(gradient=...)`) would fail on subsequent training iterations with `AttributeError: 'NoneType' object has no attribute 'numel'`. The root cause was stale parameters remaining in `ipg_buckets` between iterations due to the checkpoint's backward re-running forward. The fix clears these buckets in `_pre_step()` before each optimizer step. Added comprehensive tests for gradient checkpointing with ZeRO-3 covering both `use_reentrant=True` and `use_reentrant=False` modes. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>Masahiro Tanaka · 311674ff · 2026-01-17
- 0.5ETVAvoid CUDA reinit error in CI tests (#7977) Currently the CI full test shows a [CUDA reinit error](https://github.com/deepspeedai/DeepSpeed/actions/runs/24444633640/job/71417719445). This PR includes the following fixes: - Fix `compute_capability_args()` in JIT mode to read `TORCH_CUDA_ARCH_LIST` before calling `torch.cuda.get_device_capability()` and restores JIT builder state after `jit_load()`. It also adds regression tests for the explicit-arch, bad-fork, and restore paths. - Delay initialization of CUDA streams in DeepCompile After this fix, the full test [passed](https://github.com/deepspeedai/DeepSpeed/actions/runs/24508304055/job/71632434455) again. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 83d43c35 · 2026-04-17
- 0.4ETVFix bf16 gradient norm divergence with ZeRO stage 0 (#7839) Fixes: #7837 ZeRO-0 + bf16 has two bugs in `engine.py`: 1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with `cur_scale=65536` but `engine.backward()` never scales the loss, so `step()` divides gradients by 65536 2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing gradient accumulation. Fix: disable loss scaling for bf16 and remove the `zero_optimization()` gate on `zero_grad`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 1752c2ab · 2026-02-12
- 0.3ETVTraining multiple models (#7018) Support training multiple models, such as in [HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model) Here is some update on supporting multiple DS engines with single loss.backward(). The main message is that I think we can support this. First, some context. Backward pass in ZeRO is complicated because the optimizations/features require special handling of gradients, such as: 1. Gradient partitioning 2. Overlapping backward and reduction 3. Upcasting for fp32 grad accumulation So, we created engine.backward(loss) as a wrapper function to provide us fine-grained control over backward as below ```python def backward(loss): backward_prologue() # setup logic for special gradient handling loss.backward() backward_epilogue() # cleanup/teardown logic ``` As demonstrated by @muellerzr, this approach breaks down when loss originates from multiple DS engines. Our proposed solution is to use backward hooks on the module to launch backward_prologue() and backward_epilogue() . Specifically, 1. backward pre hook on engine.module to launch backward_prologue() before any module gradient is created. 2. backward post hook on engine.module to launch backward_epilogue() after all module gradients are created. We plan for this solution to preserve BC, i.e., engine.backward() will remain correct for single engine scenarios. The current status is that (1) is completed, while (2) is in progress. To unblock e2e testing for multi-engine scenarios, since there are probably other issues, we have a temporarily added engine._backward_prologue() . You can try this out via the following artifacts. 1. Simple multi-engine test code: https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py 2. DS branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>Olatunji Ruwase · b418cf6c · 2025-03-11
- 0.3ETVLow-precision master params/grads/optimizer states (#7700) DeepSpeed optimizer always creates fp32 master params/gradients/optimizer states. However, we sometimes want to keep them lower precision given [torch.autocast support](https://deepspeed.readthedocs.io/en/latest/training.html#mixed-precision-training). This PR allows lower precision master params/grads/optimizer states with bf16/fp16 enabled. DeepSpeed currently accepts `fp16_master_weights_and_gradients` option under `fp16` section (not documented) with ZeRO1/2. This PR extends this for bf16 and also ZeRO3. In `bf16` section, we can have new items `bf16_master_weights_and_grads` and `bf16_optimizer_states`. Similary to `fp16_master_weights_and_grads`, `bf16_master_weights_and_grads` keeps master parameters in bf16. `bf16_optimizer_states` keeps optimizer states also in bf16. Here is an example configuration: ```json "bf16": { "enabled": true, "bf16_master_weights_and_grads": true, "bf16_optimizer_states": true } ``` Note that `bf16_master_weights_and_grads==True` and `bf16_optimizer_states==False` is supported only with cpu offloading. Also, we don't have `fp16_optimizer_states` as it won't be practical. More details are described in [`config-json.md`](https://github.com/tohtana/DeepSpeed/blob/88e0bbdfba89c4712d815980ddb28353d6da5b2e/docs/_pages/config-json.md) Previously, `torch.autocast` support (`torch_autocast` section in config) was not compatible with `bf16` `fp16` enabled, but we now accept the combination. This PR also adds some test cases for the configurations as well as the combination with `torch.autocast`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>Masahiro Tanaka · 39a682d7 · 2025-12-04
- 0.2ETVFix DeepCompile all-gather scheduler candidate selection (#8033) This PR fixes issues with the heuristic in DeepCompile's scheduler: - Fix a candidate-selection bug in `fast_free_schedule()`: the scheduler computed the zero-`free_acc_mem` candidate subset, but then sorted the full runnable set instead of that subset. - Keep the existing local scheduling heuristic, but rank candidates with graph-local all-gather pressure metrics before release-side cost when a low-live release path is available. - Add deterministic CPU-only FX scheduler regressions for the zero-free filter, pressure ordering, fallback candidate ordering, and single-all-gather ordering. ## Rationale `fast_free_schedule()` is a local heuristic for reducing gathered-parameter live ranges. This patch keeps that model, but fixes a general selection inconsistency: when at least one runnable candidate can reach release without additional all-gathers, the scheduler should choose from that zero-`free_acc_mem` subset. The previous code used the subset only as a branch condition, then ranked all runnable candidates by `free_cost`, so it could select a candidate that still required additional all-gathers before release. After preserving the zero-`free_acc_mem` filter, the ordering uses only workload-independent graph pressure signals already available to the scheduler: scheduled all-gather count, all-gather byte pressure, release-side cost, and a stable node-name tie breaker. In the fallback path, where every candidate still requires additional all-gathers, `free_acc_mem` remains the primary selector and the scheduler preserves the previous boundary of scheduling only through `schedule_until_ag`; this avoids making a memory-budget decision without tracking already-live gathered parameters. ## Testing - `python -m pytest tests/unit/compile/test_list_schedule.py -q` - `pre-commit run --all-files` --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 819af0e5 · 2026-05-30
- 0.2ETVNormalize ZeRO-3 DeepCompile grad dtype before reduction (#8038) Some backward kernels produce gradients in their computation dtype, not necessarily in the parameter storage dtype. For example, if a backward path accumulates or promotes math in fp32, a parameter stored as bf16 can still receive an fp32 raw gradient from that backward computation. In normal PyTorch execution, that raw gradient reaches the leaf-gradient accumulation step, which stores it according to the tensor's expected grad dtype. ZeRO-3 DeepCompile intercepts the raw compiled-backward gradient before that leaf accumulation boundary. The reducer was assuming the raw gradient dtype was already the expected leaf grad dtype, so it could select an fp32 communication bucket even when the ZeRO grad partition storage was bf16. To address this, this PR changes `dc.reduce_grad`'s behavior to match PyTorch's leaf-gradient dtype contract. ZeRO-3 registration now records the expected grad dtype for each parameter, and `reduce_grad` normalizes raw compiled-backward gradients to that dtype before selecting the communication bucket. This follows the documented `grad_dtype` behavior, including preserving explicit `grad_dtype=None` opt-outs: https://docs.pytorch.org/docs/main/generated/torch.sparse.semi_structured.SparseSemiStructuredTensorCUSPARSELT.html#torch.sparse.semi_structured.SparseSemiStructuredTensorCUSPARSELT.grad_dtype Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>Masahiro Tanaka · 4f5fb834 · 2026-06-09
- 0.2ETVFix ZeRO stage 1 and add stage 2 support with DeepCompile (#7366) This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage 2 support. DeepCompile's ZeRO1 currently performs allreduce at every iteration even when it is not a gradient accumulation boundary. This significantly slows down the performance when gradient accumulation is enabled. This PR fixes this issue by performing allreduce only at the gradient accumulation boundary. As the current behavior is similar to ZeRO2, this PR also adds DeepCompile's ZeRO2 support. We can now set zero stage to 2 with DeepCompile. The loss values, performance, and memory usages were verified using this [verification tool](https://github.com/tohtana/ds_verify_loss) ([results](https://github.com/tohtana/ds_verify_loss/blob/main/results/results_20250617_035117/report.md)). --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>Masahiro Tanaka · be8124c8 · 2025-06-27
- 0.2ETVRestore real inputs for recompilation (#7356) This PR keeps some of real inputs given to the custom backend for DeepCompile. DeepCompile expects that the custom backend at TorchFX graph level is always called when recompilation happens. In some cases, however, only the Aten-level backend is called. As the Aten-level backend uses real inputs saved by TorchFX-level backend, we need to keep the real inputs for recompilation. Currently we discard the real inputs after the Aten-level backend uses it as the real inputs are often too large to keep in GPU memory. This causes an error in cases where recompilation only calls Aten-level backends because we don't have a chance to record new real inputs in TorchFX-level backend. This PR always keeps only tensor metadata and non-tensor data on CPU and materialize the tensors when needed (i.e. when recompilation happens and only Aten-level backends are called without real inputs). As we use dummy data to materialize tensors, this solution might still not work but improves the coverage. The new module `InputStorage` keeps tensor metadata and non-tensor data for this purpose and materialize tensors. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>Masahiro Tanaka · 6f1a1c04 · 2025-06-19
- 0.2ETVUpdate references to new X/Twitter handle (#7110) As a part of joining the Linux Foundation AI&Data it makes sense to rename the X/Twitter accounts associated with DeepSpeed. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>Logan Adams · c2c81993 · 2025-03-04