OmniMorph Multi-Node XPU Training Progress
Active Jobs (2026-03-28)
Job 26457072: 8-node production training (RUNNING)
- Date: 2026-03-28
- Status: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime
- Script:
bash_train_multi_nodes.sh(srun timeout reduced to 1.5h) - Config:
all_om_net, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice - Data: 14,814 diffusion + 2,583 registration (full real data)
- Steps/epoch: 41 (14814 / (2 Γ 64) Γ· DIFF_REG_BATCH_RATIO steps), ~108s/step β ~1h14m per epoch
- Changes from previous job:
- Contrastive
clip_grad_normmax_norm: 0.02 β 1e-3 (avoid contrastive dominating training) - Registration activation threshold: -0.6 β -0.7 (stricter gate)
dist.all_reduceβdist.broadcast(src=0)for NaN sync + registration gate (CCL hang workaround)- srun timeout: 7200s β 5400s (1.5h, reduces CCL hang waste from 46min to ~16min)
- Excluded pvc-s-135 (hardware failure: "No XPU devices are available")
- Contrastive
- Logs:
Logs/train_multi_26457072.out
CCL Epoch-Boundary Hang (ongoing issue)
- Pattern: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op.
- Workaround: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h.
- Root cause: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context.
- Impact: ~16min wasted per epoch (hang time before timeout), but training progresses reliably.
- TODO: Investigate reinitializing CCL mid-training (
destroy_process_group+init_process_group) to avoid the restart overhead.
Single-node CCL hang (unresolved)
- Single-node job (26446050) hung at step 1 on explicit
dist.all_reducecalls. - Changed to
dist.broadcast(src=0)but never verified (cancelled when multi-node started). - Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch).
- TODO: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport.
Checkpoint status
- Latest:
000036_all_om_net.pth(Mar 28, 2.9 GB) - Total: epochs 3-36 in
Models/all_om_net/(85+ GB) - CUDA copies: epochs 0,1,2,8,11 in
Models/all_om_net/cuda_ckpts/ - Older model: epochs 0-10 in
Models/all_recmulmodmutattnnet/(pre-om_net migration)
Loss history (real data, 64 XPU tiles)
| Epoch | Ang | Dist | Regul | Contrastive | Regist (total) | imgsim | imgmse | ddf |
|---|---|---|---|---|---|---|---|---|
| 3-7 | -0.02β-0.10 | 1.50β1.02 | β | 9.2e-4 | 0.0 | β | β | β |
| 8-11 | β | β | β | β | β | β | β | β (CUDA) |
| 31 | β | β | β | 6.9e-5 | -0.09 | -0.21 | 0.35 | 4.2e-4 |
| 35 | -0.50 | 0.85 | 1.3e-4 | 7.1e-5 | -0.10 | -0.26 | 0.32 | 1.4e-4 |
Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37).
Previous Job 25899265: Production training (v3) β COMPLETED (historical)
- Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue)
- Validated memory stability and CCL cache fix, but loss data not meaningful
Previous Strategy A β Job 25898957: CRASH-LOOPED (torch.load bug)
- Ran 1 good iteration (epoch 5 step 31βepoch 6 step 9), then crash-looped 63 times
- Root cause: saving
np.random.get_state()in checkpoint β numpy arrays βtorch.loadwithweights_only=True(PyTorch 2.6 default) rejects numpy globals - Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started.
Previous Strategy A β Job 25898957: Production training (v2, all audit fixes)
- Date: 2026-03-23
- Status: RUNNING β epoch 6 in progress
- Script:
bash_train_multi_nodes.sh - Config:
Config/config_om.yaml(om_net, img_size=128, batchsize=2, device=xpu) - Resources: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task
- Walltime: 36:00:00
- Progress (as of 2026-03-23 02:45):
- Completed epoch 5 (full, no restart needed), checkpoint
000005_all_om_net.pthsaved - Epoch 6 step 2 in progress, memory stable at ~46 GiB free
- Zero restarts triggered β leak rate ~0.07 GiB/step allows full epochs without OOM
- Completed epoch 5 (full, no restart needed), checkpoint
- Fixes applied (10 bugs found across 4 independent audits):
- (Critical) Optimizer on all DDP ranks β all ranks load checkpoint
- (Critical) XPU device RNG saved/restored β
torch.xpu.get_rng_state()in checkpoint - (Critical) RNG restored after DataLoader skip β prevents
__getitem__corruption - (Critical)
loss_nan_stepnot overwritten β guarded with else branch - (Critical) Off-by-one in step skip β
step <= initial_stepwithinitial_step > 0guard - (Low) tmp/ cleanup β DDP race + stale checkpoint fixes
- (Medium) Per-rank RNG divergence β non-rank-0 re-seeded (CPU + XPU device RNG)
- (Critical) Step 0 skipped on fresh start β
initial_step > 0guard added - (Config) Timeout β
SRUN_TIMEOUT2400 β 7200 - (Low)
total_regdivision by zero βmax(total_reg, 1)guard
- Leak rate: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv)
- Logs:
Logs/train_multi_25898957.out
Job 25899049: CANCELLED (checkpoint conflict)
- Submitted as continuation but got nodes while 25898957 was still running. Both would write to same
Models/all_om_net/. Cancelled before training started.
Previous Strategy A β Job 25898349: CANCELLED (had 6 bugs)
- Ran: 4 restart iterations, reached epoch 5 step 33
- Issues: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6)
loss_nan_stepreset after restore - Checkpoints saved:
000004_all_om_net.pth(epoch 4),000005_step0030_all_om_net.pth(epoch 5 step 30) - Memory: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step)
Strategy B β Job 25916258: Full validation (v4, CCL cache fix)
- Date: 2026-03-23
- Status: RUNNING
- Resources: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime
- Fix:
CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000(was default 1000, caused driver segfault at ~400 steps) - Goal: Survive full 2h walltime without crash β validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault)
- Logs:
Logs/stratB_25916258.out
Previous Strategy B β Job 25899266 (v3, segfault at epoch 74)
- 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB.
- Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit β driver segfault (
drm_neo.cpp:288). Fixed withCCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000. - Logs:
Logs/stratB_25899266.out
Previous Strategy B β Job 25898356 (crashed on logging bug)
- Ran 6 steps, leak ~0.375 GiB/step.
ZeroDivisionErrorontotal_reg=0(now fixed).
Previous Job 25892717 β CANCELLED (was hung)
- Status: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at
loss_contra.backward()).srunhung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle. - Last checkpoint:
Models/all_om_net/tmp/000004_step0020_all_om_net.pth - Lesson:
--kill-on-bad-exit=1is not reliable for CCL cleanup. Strategy A'stimeoutwrapper solves this.
Implementation Details
Strategy A: Proactive Restart (backup approach)
Problem: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via sbatch failed because srun hangs after CCL rank crash β the bash script never reaches the resubmit logic.
Solution: Proactive exit + restart loop within the same SLURM allocation.
Changes to OM_train_3modes.py (training behavior unchanged):
- Added
EXIT_CODE_RESTART = 42constant - Added
--max-steps-before-restart NCLI argument (default 0 = disabled) - Added
steps_since_startcounter in the training loop - After N steps: saves mid-epoch checkpoint β
dist.barrier()βdist.destroy_process_group()βsys.exit(42) SystemExit(42)is NOT caught byexcept Exceptionβ propagates cleanly
Changes to bash_train_multi_nodes.sh:
- Replaced single
sruncall with awhileloop (up to 500 iterations) - Each
srunwrapped withtimeout 2400(40 min) to catch CCL hangs - Exit code handling:
0β training complete, break42β proactive restart, 5s pause, re-launch124β timeout (CCL hang), 10s pause, re-launch- Other β crash (OOM etc.), 10s pause, re-launch from checkpoint
- Passes
--max-steps-before-restart 20to Python
Training behavior: Identical to original. Two independent audits verified that:
- Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored
- Off-by-one fixed:
step <= initial_stepcorrectly skips the checkpointed step - RNG restored AFTER DataLoader skip loop (not before) to avoid
__getitem__corruption loss_nan_step,total_reg, and all 9 epoch loss accumulators preserved across restarts- All DDP ranks load optimizer state (not just rank 0)
Strategy B: Leak Rate Diagnostic
Problem: Need to know whether existing mitigations (gradient checkpointing, UpsampleConv, empty_cache) reduce the XPU leak rate, or if the leak is fundamental to all backward ops.
Approach: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps.
Script: bash_train_stratB.sh β 1-node, dummy data, no checkpoint saves, --max-steps-before-restart 0.
Training loop structure (both strategies use the same code):
- Diffusion:
forward β backward β step(NO gradient clipping) - Contrastive:
forward β backward β clip(max_norm=0.02) β step - Registration:
forward β backward β clip(max_norm=0.1) β step
Each phase has gc.collect() + synchronize() + empty_cache() between them.
Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement)
ConvTranspose3dbackward was identified as leaking ~0.33 GiB/step per layer on XPU- Replaced with
UpsampleConv(F.interpolate + Conv) inDiffusion/networks.pyforOM_net - Result: Negligible impact β leak rate ~1.78 GiB/step before and after
- Conclusion:
ConvTranspose3dwas NOT the primary leak source; the leak is fundamental to XPU autograd
Strategy B: Per-Operation XPU Leak Analysis (job 25893155)
Ran tests/diagnose_xpu_leak_ops.py on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures device_free drift.
| Operation | Leak Rate (GiB/step) | Pattern | Verdict |
|---|---|---|---|
| ConvTranspose3d (256β3, 8Β³β128Β³) | 0.335 | Linear, persistent | LEAKS β primary source |
| Full OM_net (rec_num=2, 128Β³) | 1.15 | Linear, OOM at step 17 | LEAKS β aggregated |
| Stacked Conv3d encoder (1β256, 128Β³β8Β³) | 0.013 | One-time alloc | OK (initial alloc only) |
| F.grid_sample (128Β³) | 0.007 | One-time alloc | OK |
| Conv3d (16β32, 64Β³) | 0.004 | One-time alloc | OK |
| MultiheadAttention (512 tokens) | 0.002 | One-time alloc | OK |
| Adam optimizer only | 0.0005 | One-time alloc | OK |
| F.interpolate trilinear (32Β³β256Β³) | 0.000 | No leak | ZERO LEAK |
| UpsampleConv (256β3, 8Β³β128Β³) | Not tested (env error) | β | Expected zero (uses F.interpolate) |
Key findings:
ConvTranspose3dbackward is the dominant leaker β 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step.F.interpolatehas ZERO leak β confirming thatUpsampleConv(which uses F.interpolate + Conv) is the correct fix.- All other ops have only one-time allocations (no linear drift) β Conv3d, grid_sample, attention, Adam are all clean.
- Full OM_net leaks 1.15 GiB/step β consistent with
ConvTranspose3dΓ 5 layers plus minor contributions from other ops.
Component-level diagnostic (job 25826494):
- Forward only (no backward): ZERO leak β 62.75 GiB stable for 20 steps
- Forward + backward: 1.12 GiB/step β 62.97 β 40.53 GiB over 20 steps
- Confirms the leak is in the autograd backward pass, specifically in
ConvTranspose3dbackward kernel.
Why the leak rate dropped in recent runs (jobs 25898349/25898957): The UpsampleConv fix in OM_net replaced all 5 decoder ConvTranspose3d layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly.
Why the old runs (25892717) still leaked 1.78 GiB/step: The diagnostic diagnose_xpu_leak_ops.py test used the OLD RecMulModMutAttnNet with ConvTranspose3d. The OM_net class (used in production training) already had UpsampleConv. The earlier measurement of "negligible impact" was incorrect β the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective.
Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps)
- Job: 25899266 (Strategy B v3)
- Symptom: GPU segfault (
drm_neo.cpp:288) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free). - Root cause: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver.
- Warning before crash:
CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000 - Fix:
export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000in SLURM scripts. Also added to Strategy A'sbash_train_multi_nodes.sh. - Note: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs.
Scripts Directory
| File | Purpose |
|---|---|
bash_train_multi_nodes.sh |
Strategy A: 8-node production training with proactive restart loop |
bash_train_stratB.sh |
Strategy B: 1-node leak rate diagnostic (no restart, dummy data) |
bash_infer.sh |
Inference / augmentation SLURM job |
bash_diagnose_leak.sh |
Submits tests/diagnose_xpu_leak.py to diagnose per-component leak |
bash_diagnose_ops.sh |
Submits tests/diagnose_xpu_leak_ops.py for per-operation leak analysis |
bash_verify_fix.sh |
Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue) |
bash_compare_opt.sh |
Speed comparison: optimized vs original 3-mode training |
bash_compare_orig.sh |
Speed comparison: original 3-mode training baseline |
tests/diagnose_xpu_leak.py |
Component-level leak test (network, DeformDDPM, DDP) |
tests/diagnose_xpu_leak_ops.py |
Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv) |
tests/test_3modes_opt_equivalence.py |
Verifies optimized training matches original |
tests/test_mslncc.py |
MSLNCC loss function unit tests |
tests/compare_3modes_speed.py |
Speed benchmark for 3-mode training variants |
Issues Resolved (2026-03-22)
14. XPU Autograd Engine Memory Leak β ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED)
- Jobs: All XPU training jobs; diagnosed in 25826494
- Symptom:
device_free(viatorch.xpu.mem_get_info) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator β not tracked bymemory_allocated/memory_reserved. - Root cause: PyTorch XPU autograd engine bug. The
loss.backward()call leaks device memory on every invocation. Confirmed by isolated diagnostic (tests/diagnose_xpu_leak.py):- Test 1 (forward only, no backward): NO LEAK β device_free perfectly stable at 62.75 GiB over 20 steps
- Test 2a (forward + backward, no optimizer): LEAK β 1.0 GiB/step (62.97 β 42.98 GiB over 20 steps)
- Test 2b (forward + backward + optimizer.step): LEAK β 1.1 GiB/step (slightly worse)
- NOT caused by: CCL all-reduce (
no_sync()showed identical leak rate), DDP (leak occurs without DDP), garbage collection (gc.collect()had no effect), caching allocator (empty_cache()had no effect), deferred ops (synchronize()had no effect) - Why it works on CUDA: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime).
- Workaround applied:
- Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 β 26 GiB, buying ~26 steps before OOM
- Mid-epoch checkpoints every 10 steps to
tmp/subfolder - Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets)
- Upstream: Should be reported to intel/torch-xpu-ops with
tests/diagnose_xpu_leak.pyas minimal reproduction
13. Pre-allocation Approach β Wrong Direction
- Jobs: 25799043 (92%), 25823021 (78%)
- Finding: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% β step 3, 78% β step 10, none β step 15.
- Resolution: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap).
12. Contrastive Backward OOM β Diffusion Tensors Not Freed
- Jobs: 25823021, 25823710
- Finding:
del pre_dvf_I, dvf_I, trm_predwas placed AFTER the contrastive step. Duringloss_contra.backward(), diffusion output tensors were still alive, pushing peak above the limit. - Fix: Moved
del+empty_cache()to BEFORE the contrastive step. Also saveloss_gen_a.item()before deleting since it's needed for registration decision.
Issues Resolved (2026-03-20 to 2026-03-21)
11. DDP Collective Hangs and Registration Desync
- Jobs: 25699197, 25709947 (hung), 25704670, 25706470
- Symptoms: Job hangs after step 1 (log files stop growing); or
Expected to have finished reductionerror - Root cause 1: OOM try/except guards with
continueskip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. OOM guards are fundamentally incompatible with DDP. - Root cause 2: Registration conditional block (
loss_gen_a.item() < -0.6) differs per rank β some ranks callDeformddpm(...)for registration while others skip, causing DDP desync. - Fixes applied:
- Removed all OOM try/except guards β let OOM crash the job; rely on checkpoint auto-resume
DDP(..., find_unused_parameters=True)β handles detached recovery iterations and conditional registrationdist.all_reduce(regist_flag, op=ReduceOp.MIN)β all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree.
10. XPU OOM β Allocator 70% Memory Cap
- Jobs: 25530886 through 25780909
- Error:
UR_RESULT_ERROR_OUT_OF_RESOURCESat step 12-14 of each epoch - Root cause: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug (torch-xpu-ops#1543). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory.
- Key diagnostic (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps β no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap.
- Resolution: Gradient checkpointing reduced peak from 43 β 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed.
Issues Resolved (2026-03-19 to 2026-03-20)
1. torchrun Permission Denied
- Fix: Switched to
python -m torch.distributed.run, then later to directsrunlaunch
2. GPUS_PER_NODE Mismatch
- Fix:
--nodes=8 --ntasks-per-node=8for 64 total XPU tiles (4 cards x 2 tiles/card)
3. .to(rank) Sends to CUDA Not XPU
- Fix: Changed to
.to(f"{DEVICE_TYPE}:{rank}")
4. No DistributedSampler
- Fix: Added
DistributedSamplerfor both dataloaders +set_epoch()per epoch
5. CCL Backend Not Found
- Fix: dn-mo1 rebuilt the conda env with compatible packages
6. MPI/PMI Init Failure
- Fix: Switched from torchrun to direct
srun --ntasks-per-node=8with SLURM env var mapping
7. CCL Worker Thread Startup Failure
- Fix: Increased
--cpus-per-task=12+CCL_WORKER_AFFINITY=auto
8. gloo Backend Incompatible with XPU
- Fix: Must use
cclbackend for XPU DDP
9. Print Spam from All Ranks
- Fix: Guarded prints with
gpu_id == 0
Working Configuration Summary
# SLURM
--nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12
# Environment
I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so
I_MPI_HYDRA_BOOTSTRAP=slurm
CCL_WORKER_AFFINITY=auto
PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Launch
srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...'
# DDP
backend = "ccl"
DDP(model, device_ids=[rank], find_unused_parameters=True)
# XPU autograd leak workaround
# - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True)
# - Mid-epoch checkpoint every 10 steps to Models/.../tmp/
# - SLURM script auto-resubmits on crash (sbatch at end of script)
# - No pre-allocation (gradient checkpointing keeps peak under 70% cap)
Previous Test Jobs
| Job ID | Date | Nodes | Status | Issue |
|---|---|---|---|---|
| 25379684 | 03-18 | 16 | FAILED | torchrun permission denied |
| 25416001 | 03-19 | 8 | FAILED | ccl backend not found (SYCL mismatch) |
| 25433261 | 03-19 | 8 | FAILED | xccl not built in |
| 25518305 | 03-19 | 8 | FAILED | MPI PMI_Init failure |
| 25521705 | 03-20 | 8 | FAILED | CCL OFI transport init failure |
| 25522164 | 03-20 | 8 | FAILED | CCL worker startup failure (3 CPUs) |
| 25522734 | 03-20 | 8 | FAILED | CCL worker=0 segfault |
| 25522979 | 03-20 | 1 | FAILED | gloo + XPU incompatible |
| 25523654 | 03-20 | 1 | FAILED | gloo + XPU (no device_ids) still fails |
| 25525830 | 03-20 | 1 | SUCCESS | First working run: ccl + 12 CPUs/task |
| 25528754 | 03-20 | 8 | FAILED | Superseded by 25530886 |
| 25530886 | 03-20 | 8 | FAILED | XPU OOM at step 12/41 (original code, no workarounds) |
| 25635451 | 03-20 | 8 | FAILED | empty_cache regression, OOM step 1 |
| 25678499 | 03-21 | 8 | FAILED | OOM step 15, variable cleanup added |
| 25696461 | 03-21 | 8 | FAILED | Epoch 2 reached but forward OOM killed rank 61 |
| 25699197 | 03-21 | 8 | HUNG | OOM try/except broke DDP |
| 25704670 | 03-21 | 8 | FAILED | find_unused_parameters error at registration |
| 25706470 | 03-21 | 8 | FAILED | OOM guard broke DDP reducer state |
| 25709947 | 03-21 | 8 | HUNG | Registration conditional desync |
| 25763882 | 03-21 | 8 | FAILED | all_reduce(MIN) sync β no hang! OOM step 14. |
| 25780909 | 03-21 | 8 | FAILED | Confirmed 70% allocator cap (9.84/44.82 GiB stable) |
| 25799043 | 03-21 | 8 | FAILED | 92% pre-alloc (59 GiB) β OOM step 3, WORSE |
| 25823021 | 03-21 | 8 | FAILED | 78% pre-alloc β diffusion OK (43.3 GiB), contra OOM step 10 |
| 25823544 | 03-21 | 8 | FAILED | del tensors before contra β UnboundLocalError bug |
| 25823710 | 03-21 | 8 | FAILED | Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed |
| 25824128 | 03-21 | 8 | FAILED | No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak |
| 25825585 | 03-22 | 8 | FAILED | no_sync() β same leak rate; NOT from all-reduce |
| 25825861 | 03-22 | 8 | FAILED | gc.collect+sync+empty_cache β no effect on leak |
| 25826494 | 03-22 | 1 | DIAG | Root cause found: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug. |
| 25832610 | 03-22 | 8 | PARTIAL | Grad checkpoint works! Peak 43β22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit). |
| 25853940 | 03-22 | 8 | PARTIAL | Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20. |
| 25867855 | 03-22 | 8 | HUNG | Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit). |
| 25892717 | 03-22 | 8 | HUNGβCANCELLED | Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered. |
| 25898349 | 03-22 | 8 | CANCELLED (Strat A v1) | Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc). |
| 25898356 | 03-22 | 1 | CRASHED (Strat B v1) | Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed). |
| 25899114 | 03-23 | 1 | PENDING (Strat B v2) | Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps. |
| 25898957 | 03-23 | 8 | CRASH-LOOPED (Strat A v2) | Epoch 6 step 9 then crash-loop Γ63. torch.load rejects numpy RNG in checkpoint. |
| 25899049 | 03-23 | 8 | CANCELLED | Checkpoint conflict β ran simultaneously with 25898957. |
| 25899114 | 03-23 | 1 | CANCELLED | Strategy B β cancelled with 25898957 for code fix. |
| 25899265 | 03-23 | 8 | RUNNING (Strat A v3) VALIDATED | 100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime. |
| 25899266 | 03-23 | 1 | COMPLETED (Strat B v3) | 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit). |
| 25916258 | 03-23 | 1 | RUNNING (Strat B v4) | CCL cache fix (10000 handles). Goal: survive full 2h walltime. |