Deepspeed zero 3을 이용한 H100x2에서의 학습시 실패

292011 · January 13, 2026, 7:27am

상황

1. 다른 H100 x 2대가 있는 서버에서 돌아가는 코드를 그대로 가져와 수행하였지만 RuntimeError: CUDA error: an illegal memory access was encountered 발생

2. Deepspeed zero-3을 통해 2개의 GPU를 활용해 LM 학습

조치사항

1. nvidia drive (570), all python libraries (torch 포함) 버전 통일
2. 환경 변수 세팅

# 1. NVLink(P2P) 비활성화 → PCIe로 통신 우회 (핵심)

export NCCL_P2P_DISABLE=1

# 2. InfiniBand 비활성화 → TCP/Ethernet 사용

export NCCL_IB_DISABLE=1

결과

조치 전과 똑같은 에러 발생

Single GPU로는 아무 조치하지 않아도 수행됨

에러메시지 중 일부

[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 1729, in partition_param
[rank1]: param.ds_tensor.copy(src_tensor)
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank1]:[E113 16:16:14.842117488 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ec0a9ab9446 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ec0a9a636e4 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ec0a9ba5a18 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ec05f5c7726 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ec05f5cc3f0 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ec05f5d3b5a in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ec05f5d561d in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7ec0a9ef05c0 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7ec0aaa94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1268c0 (0x7ec0aab268c0 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ec0a9ab9446 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7ec0a9a636e4 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7ec0a9ba5a18 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7ec05f5c7726 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7ec05f5cc3f0 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7ec05f5d3b5a in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ec05f5d561d in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0x145c0 (0x7ec0a9ef05c0 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #8: + 0x94ac3 (0x7ec0aaa94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x1268c0 (0x7ec0aab268c0 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ec0a9ab9446 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7ec05f24271b in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7ec0a9ef05c0 in /home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7ec0aaa94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x1268c0 (0x7ec0aab268c0 in /lib/x86_64-linux-gnu/libc.so.6)

W0113 16:16:16.576000 34792 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 34815 closing signal SIGTERM
E0113 16:16:19.499000 34792 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 34816) of binary: /home/user/miniconda3/envs/miqa/bin/python3.12
Traceback (most recent call last):
File “/home/user/miniconda3/envs/miqa/bin/torchrun”, line 7, in
sys.exit(main())
^^^^^^
File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/distributed/run.py”, line 919, in main
run(args)
File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/distributed/run.py”, line 910, in run
elastic_launch(
File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/distributed/launcher/api.py”, line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/miniconda3/envs/miqa/lib/python3.12/site-packages/torch/distributed/launcher/api.py”, line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: