Omini3D / progress.md
maxmo2009's picture
Sync from local: code + epoch-110 checkpoint, clean README
2af0e94 verified

OmniMorph Multi-Node XPU Training Progress

Active Jobs (2026-03-28)

Job 26457072: 8-node production training (RUNNING)

  • Date: 2026-03-28
  • Status: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime
  • Script: bash_train_multi_nodes.sh (srun timeout reduced to 1.5h)
  • Config: all_om_net, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice
  • Data: 14,814 diffusion + 2,583 registration (full real data)
  • Steps/epoch: 41 (14814 / (2 Γ— 64) Γ· DIFF_REG_BATCH_RATIO steps), ~108s/step β†’ ~1h14m per epoch
  • Changes from previous job:
    • Contrastive clip_grad_norm max_norm: 0.02 β†’ 1e-3 (avoid contrastive dominating training)
    • Registration activation threshold: -0.6 β†’ -0.7 (stricter gate)
    • dist.all_reduce β†’ dist.broadcast(src=0) for NaN sync + registration gate (CCL hang workaround)
    • srun timeout: 7200s β†’ 5400s (1.5h, reduces CCL hang waste from 46min to ~16min)
    • Excluded pvc-s-135 (hardware failure: "No XPU devices are available")
  • Logs: Logs/train_multi_26457072.out

CCL Epoch-Boundary Hang (ongoing issue)

  • Pattern: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op.
  • Workaround: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h.
  • Root cause: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context.
  • Impact: ~16min wasted per epoch (hang time before timeout), but training progresses reliably.
  • TODO: Investigate reinitializing CCL mid-training (destroy_process_group + init_process_group) to avoid the restart overhead.

Single-node CCL hang (unresolved)

  • Single-node job (26446050) hung at step 1 on explicit dist.all_reduce calls.
  • Changed to dist.broadcast(src=0) but never verified (cancelled when multi-node started).
  • Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch).
  • TODO: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport.

Checkpoint status

  • Latest: 000036_all_om_net.pth (Mar 28, 2.9 GB)
  • Total: epochs 3-36 in Models/all_om_net/ (85+ GB)
  • CUDA copies: epochs 0,1,2,8,11 in Models/all_om_net/cuda_ckpts/
  • Older model: epochs 0-10 in Models/all_recmulmodmutattnnet/ (pre-om_net migration)

Loss history (real data, 64 XPU tiles)

Epoch Ang Dist Regul Contrastive Regist (total) imgsim imgmse ddf
3-7 -0.02β†’-0.10 1.50β†’1.02 β€” 9.2e-4 0.0 β€” β€” β€”
8-11 β€” β€” β€” β€” β€” β€” β€” β€” (CUDA)
31 β€” β€” β€” 6.9e-5 -0.09 -0.21 0.35 4.2e-4
35 -0.50 0.85 1.3e-4 7.1e-5 -0.10 -0.26 0.32 1.4e-4

Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37).


Previous Job 25899265: Production training (v3) β€” COMPLETED (historical)

  • Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue)
  • Validated memory stability and CCL cache fix, but loss data not meaningful

Previous Strategy A β€” Job 25898957: CRASH-LOOPED (torch.load bug)

  • Ran 1 good iteration (epoch 5 step 31β†’epoch 6 step 9), then crash-looped 63 times
  • Root cause: saving np.random.get_state() in checkpoint β†’ numpy arrays β†’ torch.load with weights_only=True (PyTorch 2.6 default) rejects numpy globals
  • Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started.

Previous Strategy A β€” Job 25898957: Production training (v2, all audit fixes)

  • Date: 2026-03-23
  • Status: RUNNING β€” epoch 6 in progress
  • Script: bash_train_multi_nodes.sh
  • Config: Config/config_om.yaml (om_net, img_size=128, batchsize=2, device=xpu)
  • Resources: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task
  • Walltime: 36:00:00
  • Progress (as of 2026-03-23 02:45):
    • Completed epoch 5 (full, no restart needed), checkpoint 000005_all_om_net.pth saved
    • Epoch 6 step 2 in progress, memory stable at ~46 GiB free
    • Zero restarts triggered β€” leak rate ~0.07 GiB/step allows full epochs without OOM
  • Fixes applied (10 bugs found across 4 independent audits):
    1. (Critical) Optimizer on all DDP ranks β€” all ranks load checkpoint
    2. (Critical) XPU device RNG saved/restored β€” torch.xpu.get_rng_state() in checkpoint
    3. (Critical) RNG restored after DataLoader skip β€” prevents __getitem__ corruption
    4. (Critical) loss_nan_step not overwritten β€” guarded with else branch
    5. (Critical) Off-by-one in step skip β€” step <= initial_step with initial_step > 0 guard
    6. (Low) tmp/ cleanup β€” DDP race + stale checkpoint fixes
    7. (Medium) Per-rank RNG divergence β€” non-rank-0 re-seeded (CPU + XPU device RNG)
    8. (Critical) Step 0 skipped on fresh start β€” initial_step > 0 guard added
    9. (Config) Timeout β€” SRUN_TIMEOUT 2400 β†’ 7200
    10. (Low) total_reg division by zero β€” max(total_reg, 1) guard
  • Leak rate: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv)
  • Logs: Logs/train_multi_25898957.out

Job 25899049: CANCELLED (checkpoint conflict)

  • Submitted as continuation but got nodes while 25898957 was still running. Both would write to same Models/all_om_net/. Cancelled before training started.

Previous Strategy A β€” Job 25898349: CANCELLED (had 6 bugs)

  • Ran: 4 restart iterations, reached epoch 5 step 33
  • Issues: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6) loss_nan_step reset after restore
  • Checkpoints saved: 000004_all_om_net.pth (epoch 4), 000005_step0030_all_om_net.pth (epoch 5 step 30)
  • Memory: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step)

Strategy B β€” Job 25916258: Full validation (v4, CCL cache fix)

  • Date: 2026-03-23
  • Status: RUNNING
  • Resources: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime
  • Fix: CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000 (was default 1000, caused driver segfault at ~400 steps)
  • Goal: Survive full 2h walltime without crash β€” validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault)
  • Logs: Logs/stratB_25916258.out

Previous Strategy B β€” Job 25899266 (v3, segfault at epoch 74)

  • 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB.
  • Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit β†’ driver segfault (drm_neo.cpp:288). Fixed with CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000.
  • Logs: Logs/stratB_25899266.out

Previous Strategy B β€” Job 25898356 (crashed on logging bug)

  • Ran 6 steps, leak ~0.375 GiB/step. ZeroDivisionError on total_reg=0 (now fixed).

Previous Job 25892717 β€” CANCELLED (was hung)

  • Status: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at loss_contra.backward()). srun hung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle.
  • Last checkpoint: Models/all_om_net/tmp/000004_step0020_all_om_net.pth
  • Lesson: --kill-on-bad-exit=1 is not reliable for CCL cleanup. Strategy A's timeout wrapper solves this.

Implementation Details

Strategy A: Proactive Restart (backup approach)

Problem: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via sbatch failed because srun hangs after CCL rank crash β€” the bash script never reaches the resubmit logic.

Solution: Proactive exit + restart loop within the same SLURM allocation.

Changes to OM_train_3modes.py (training behavior unchanged):

  • Added EXIT_CODE_RESTART = 42 constant
  • Added --max-steps-before-restart N CLI argument (default 0 = disabled)
  • Added steps_since_start counter in the training loop
  • After N steps: saves mid-epoch checkpoint β†’ dist.barrier() β†’ dist.destroy_process_group() β†’ sys.exit(42)
  • SystemExit(42) is NOT caught by except Exception β€” propagates cleanly

Changes to bash_train_multi_nodes.sh:

  • Replaced single srun call with a while loop (up to 500 iterations)
  • Each srun wrapped with timeout 2400 (40 min) to catch CCL hangs
  • Exit code handling:
    • 0 β†’ training complete, break
    • 42 β†’ proactive restart, 5s pause, re-launch
    • 124 β†’ timeout (CCL hang), 10s pause, re-launch
    • Other β†’ crash (OOM etc.), 10s pause, re-launch from checkpoint
  • Passes --max-steps-before-restart 20 to Python

Training behavior: Identical to original. Two independent audits verified that:

  • Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored
  • Off-by-one fixed: step <= initial_step correctly skips the checkpointed step
  • RNG restored AFTER DataLoader skip loop (not before) to avoid __getitem__ corruption
  • loss_nan_step, total_reg, and all 9 epoch loss accumulators preserved across restarts
  • All DDP ranks load optimizer state (not just rank 0)

Strategy B: Leak Rate Diagnostic

Problem: Need to know whether existing mitigations (gradient checkpointing, UpsampleConv, empty_cache) reduce the XPU leak rate, or if the leak is fundamental to all backward ops.

Approach: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps.

Script: bash_train_stratB.sh β€” 1-node, dummy data, no checkpoint saves, --max-steps-before-restart 0.

Training loop structure (both strategies use the same code):

  1. Diffusion: forward β†’ backward β†’ step (NO gradient clipping)
  2. Contrastive: forward β†’ backward β†’ clip(max_norm=0.02) β†’ step
  3. Registration: forward β†’ backward β†’ clip(max_norm=0.1) β†’ step

Each phase has gc.collect() + synchronize() + empty_cache() between them.

Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement)

  • ConvTranspose3d backward was identified as leaking ~0.33 GiB/step per layer on XPU
  • Replaced with UpsampleConv (F.interpolate + Conv) in Diffusion/networks.py for OM_net
  • Result: Negligible impact β€” leak rate ~1.78 GiB/step before and after
  • Conclusion: ConvTranspose3d was NOT the primary leak source; the leak is fundamental to XPU autograd

Strategy B: Per-Operation XPU Leak Analysis (job 25893155)

Ran tests/diagnose_xpu_leak_ops.py on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures device_free drift.

Operation Leak Rate (GiB/step) Pattern Verdict
ConvTranspose3d (256β†’3, 8Β³β†’128Β³) 0.335 Linear, persistent LEAKS β€” primary source
Full OM_net (rec_num=2, 128Β³) 1.15 Linear, OOM at step 17 LEAKS β€” aggregated
Stacked Conv3d encoder (1β†’256, 128Β³β†’8Β³) 0.013 One-time alloc OK (initial alloc only)
F.grid_sample (128Β³) 0.007 One-time alloc OK
Conv3d (16β†’32, 64Β³) 0.004 One-time alloc OK
MultiheadAttention (512 tokens) 0.002 One-time alloc OK
Adam optimizer only 0.0005 One-time alloc OK
F.interpolate trilinear (32Β³β†’256Β³) 0.000 No leak ZERO LEAK
UpsampleConv (256β†’3, 8Β³β†’128Β³) Not tested (env error) β€” Expected zero (uses F.interpolate)

Key findings:

  1. ConvTranspose3d backward is the dominant leaker β€” 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step.
  2. F.interpolate has ZERO leak β€” confirming that UpsampleConv (which uses F.interpolate + Conv) is the correct fix.
  3. All other ops have only one-time allocations (no linear drift) β€” Conv3d, grid_sample, attention, Adam are all clean.
  4. Full OM_net leaks 1.15 GiB/step β€” consistent with ConvTranspose3d Γ— 5 layers plus minor contributions from other ops.

Component-level diagnostic (job 25826494):

  • Forward only (no backward): ZERO leak β€” 62.75 GiB stable for 20 steps
  • Forward + backward: 1.12 GiB/step β€” 62.97 β†’ 40.53 GiB over 20 steps
  • Confirms the leak is in the autograd backward pass, specifically in ConvTranspose3d backward kernel.

Why the leak rate dropped in recent runs (jobs 25898349/25898957): The UpsampleConv fix in OM_net replaced all 5 decoder ConvTranspose3d layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly.

Why the old runs (25892717) still leaked 1.78 GiB/step: The diagnostic diagnose_xpu_leak_ops.py test used the OLD RecMulModMutAttnNet with ConvTranspose3d. The OM_net class (used in production training) already had UpsampleConv. The earlier measurement of "negligible impact" was incorrect β€” the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective.

Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps)

  • Job: 25899266 (Strategy B v3)
  • Symptom: GPU segfault (drm_neo.cpp:288) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free).
  • Root cause: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver.
  • Warning before crash: CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000
  • Fix: export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000 in SLURM scripts. Also added to Strategy A's bash_train_multi_nodes.sh.
  • Note: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs.

Scripts Directory

File Purpose
bash_train_multi_nodes.sh Strategy A: 8-node production training with proactive restart loop
bash_train_stratB.sh Strategy B: 1-node leak rate diagnostic (no restart, dummy data)
bash_infer.sh Inference / augmentation SLURM job
bash_diagnose_leak.sh Submits tests/diagnose_xpu_leak.py to diagnose per-component leak
bash_diagnose_ops.sh Submits tests/diagnose_xpu_leak_ops.py for per-operation leak analysis
bash_verify_fix.sh Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue)
bash_compare_opt.sh Speed comparison: optimized vs original 3-mode training
bash_compare_orig.sh Speed comparison: original 3-mode training baseline
tests/diagnose_xpu_leak.py Component-level leak test (network, DeformDDPM, DDP)
tests/diagnose_xpu_leak_ops.py Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv)
tests/test_3modes_opt_equivalence.py Verifies optimized training matches original
tests/test_mslncc.py MSLNCC loss function unit tests
tests/compare_3modes_speed.py Speed benchmark for 3-mode training variants

Issues Resolved (2026-03-22)

14. XPU Autograd Engine Memory Leak β€” ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED)

  • Jobs: All XPU training jobs; diagnosed in 25826494
  • Symptom: device_free (via torch.xpu.mem_get_info) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator β€” not tracked by memory_allocated / memory_reserved.
  • Root cause: PyTorch XPU autograd engine bug. The loss.backward() call leaks device memory on every invocation. Confirmed by isolated diagnostic (tests/diagnose_xpu_leak.py):
    • Test 1 (forward only, no backward): NO LEAK β€” device_free perfectly stable at 62.75 GiB over 20 steps
    • Test 2a (forward + backward, no optimizer): LEAK β€” 1.0 GiB/step (62.97 β†’ 42.98 GiB over 20 steps)
    • Test 2b (forward + backward + optimizer.step): LEAK β€” 1.1 GiB/step (slightly worse)
  • NOT caused by: CCL all-reduce (no_sync() showed identical leak rate), DDP (leak occurs without DDP), garbage collection (gc.collect() had no effect), caching allocator (empty_cache() had no effect), deferred ops (synchronize() had no effect)
  • Why it works on CUDA: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime).
  • Workaround applied:
    1. Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 β†’ 26 GiB, buying ~26 steps before OOM
    2. Mid-epoch checkpoints every 10 steps to tmp/ subfolder
    3. Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets)
  • Upstream: Should be reported to intel/torch-xpu-ops with tests/diagnose_xpu_leak.py as minimal reproduction

13. Pre-allocation Approach β€” Wrong Direction

  • Jobs: 25799043 (92%), 25823021 (78%)
  • Finding: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% β†’ step 3, 78% β†’ step 10, none β†’ step 15.
  • Resolution: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap).

12. Contrastive Backward OOM β€” Diffusion Tensors Not Freed

  • Jobs: 25823021, 25823710
  • Finding: del pre_dvf_I, dvf_I, trm_pred was placed AFTER the contrastive step. During loss_contra.backward(), diffusion output tensors were still alive, pushing peak above the limit.
  • Fix: Moved del + empty_cache() to BEFORE the contrastive step. Also save loss_gen_a.item() before deleting since it's needed for registration decision.

Issues Resolved (2026-03-20 to 2026-03-21)

11. DDP Collective Hangs and Registration Desync

  • Jobs: 25699197, 25709947 (hung), 25704670, 25706470
  • Symptoms: Job hangs after step 1 (log files stop growing); or Expected to have finished reduction error
  • Root cause 1: OOM try/except guards with continue skip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. OOM guards are fundamentally incompatible with DDP.
  • Root cause 2: Registration conditional block (loss_gen_a.item() < -0.6) differs per rank β€” some ranks call Deformddpm(...) for registration while others skip, causing DDP desync.
  • Fixes applied:
    1. Removed all OOM try/except guards β€” let OOM crash the job; rely on checkpoint auto-resume
    2. DDP(..., find_unused_parameters=True) β€” handles detached recovery iterations and conditional registration
    3. dist.all_reduce(regist_flag, op=ReduceOp.MIN) β€” all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree.

10. XPU OOM β€” Allocator 70% Memory Cap

  • Jobs: 25530886 through 25780909
  • Error: UR_RESULT_ERROR_OUT_OF_RESOURCES at step 12-14 of each epoch
  • Root cause: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug (torch-xpu-ops#1543). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory.
  • Key diagnostic (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps β€” no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap.
  • Resolution: Gradient checkpointing reduced peak from 43 β†’ 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed.

Issues Resolved (2026-03-19 to 2026-03-20)

1. torchrun Permission Denied

  • Fix: Switched to python -m torch.distributed.run, then later to direct srun launch

2. GPUS_PER_NODE Mismatch

  • Fix: --nodes=8 --ntasks-per-node=8 for 64 total XPU tiles (4 cards x 2 tiles/card)

3. .to(rank) Sends to CUDA Not XPU

  • Fix: Changed to .to(f"{DEVICE_TYPE}:{rank}")

4. No DistributedSampler

  • Fix: Added DistributedSampler for both dataloaders + set_epoch() per epoch

5. CCL Backend Not Found

  • Fix: dn-mo1 rebuilt the conda env with compatible packages

6. MPI/PMI Init Failure

  • Fix: Switched from torchrun to direct srun --ntasks-per-node=8 with SLURM env var mapping

7. CCL Worker Thread Startup Failure

  • Fix: Increased --cpus-per-task=12 + CCL_WORKER_AFFINITY=auto

8. gloo Backend Incompatible with XPU

  • Fix: Must use ccl backend for XPU DDP

9. Print Spam from All Ranks

  • Fix: Guarded prints with gpu_id == 0

Working Configuration Summary

# SLURM
--nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12

# Environment
I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so
I_MPI_HYDRA_BOOTSTRAP=slurm
CCL_WORKER_AFFINITY=auto
PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Launch
srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...'

# DDP
backend = "ccl"
DDP(model, device_ids=[rank], find_unused_parameters=True)

# XPU autograd leak workaround
# - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True)
# - Mid-epoch checkpoint every 10 steps to Models/.../tmp/
# - SLURM script auto-resubmits on crash (sbatch at end of script)
# - No pre-allocation (gradient checkpointing keeps peak under 70% cap)

Previous Test Jobs

Job ID Date Nodes Status Issue
25379684 03-18 16 FAILED torchrun permission denied
25416001 03-19 8 FAILED ccl backend not found (SYCL mismatch)
25433261 03-19 8 FAILED xccl not built in
25518305 03-19 8 FAILED MPI PMI_Init failure
25521705 03-20 8 FAILED CCL OFI transport init failure
25522164 03-20 8 FAILED CCL worker startup failure (3 CPUs)
25522734 03-20 8 FAILED CCL worker=0 segfault
25522979 03-20 1 FAILED gloo + XPU incompatible
25523654 03-20 1 FAILED gloo + XPU (no device_ids) still fails
25525830 03-20 1 SUCCESS First working run: ccl + 12 CPUs/task
25528754 03-20 8 FAILED Superseded by 25530886
25530886 03-20 8 FAILED XPU OOM at step 12/41 (original code, no workarounds)
25635451 03-20 8 FAILED empty_cache regression, OOM step 1
25678499 03-21 8 FAILED OOM step 15, variable cleanup added
25696461 03-21 8 FAILED Epoch 2 reached but forward OOM killed rank 61
25699197 03-21 8 HUNG OOM try/except broke DDP
25704670 03-21 8 FAILED find_unused_parameters error at registration
25706470 03-21 8 FAILED OOM guard broke DDP reducer state
25709947 03-21 8 HUNG Registration conditional desync
25763882 03-21 8 FAILED all_reduce(MIN) sync β€” no hang! OOM step 14.
25780909 03-21 8 FAILED Confirmed 70% allocator cap (9.84/44.82 GiB stable)
25799043 03-21 8 FAILED 92% pre-alloc (59 GiB) β€” OOM step 3, WORSE
25823021 03-21 8 FAILED 78% pre-alloc β€” diffusion OK (43.3 GiB), contra OOM step 10
25823544 03-21 8 FAILED del tensors before contra β€” UnboundLocalError bug
25823710 03-21 8 FAILED Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed
25824128 03-21 8 FAILED No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak
25825585 03-22 8 FAILED no_sync() β€” same leak rate; NOT from all-reduce
25825861 03-22 8 FAILED gc.collect+sync+empty_cache β€” no effect on leak
25826494 03-22 1 DIAG Root cause found: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug.
25832610 03-22 8 PARTIAL Grad checkpoint works! Peak 43β†’22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit).
25853940 03-22 8 PARTIAL Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20.
25867855 03-22 8 HUNG Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit).
25892717 03-22 8 HUNG→CANCELLED Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered.
25898349 03-22 8 CANCELLED (Strat A v1) Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc).
25898356 03-22 1 CRASHED (Strat B v1) Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed).
25899114 03-23 1 PENDING (Strat B v2) Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps.
25898957 03-23 8 CRASH-LOOPED (Strat A v2) Epoch 6 step 9 then crash-loop Γ—63. torch.load rejects numpy RNG in checkpoint.
25899049 03-23 8 CANCELLED Checkpoint conflict β€” ran simultaneously with 25898957.
25899114 03-23 1 CANCELLED Strategy B β€” cancelled with 25898957 for code fix.
25899265 03-23 8 RUNNING (Strat A v3) VALIDATED 100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime.
25899266 03-23 1 COMPLETED (Strat B v3) 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit).
25916258 03-23 1 RUNNING (Strat B v4) CCL cache fix (10000 handles). Goal: survive full 2h walltime.