Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions build_tools/wheel_utils/Dockerfile.aarch
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,20 @@ RUN dnf clean all
RUN dnf -y install glog.aarch64 glog-devel.aarch64
RUN dnf -y install libnccl libnccl-devel libnccl-static

ENV PATH="/usr/local/cuda/bin:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
RUN dnf -y install openmpi openmpi-devel && dnf clean all
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf
Comment on lines +39 to +43
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same as Dockerfile.x86: ldconfig should be called after appending to the /etc/ld.so.conf.d/ file so the dynamic linker cache is updated within the same Docker layer.

Suggested change
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-aarch64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-aarch64.conf && \
ldconfig


ENV PATH="/usr/local/cuda/bin:/opt/mpi/bin:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:/opt/mpi/lib:${LD_LIBRARY_PATH}"
ENV CUDA_HOME=/usr/local/cuda
ENV CUDA_ROOT=/usr/local/cuda
ENV CUDA_PATH=/usr/local/cuda
ENV CUDADIR=/usr/local/cuda
ENV MPI_HOME=/opt/mpi
ENV NVTE_RELEASE_BUILD=1

CMD ["/bin/bash", "-c", "bash /TransformerEngine/build_tools/wheel_utils/build_wheels.sh manylinux_2_28_aarch64 $BUILD_METAPACKAGE $BUILD_COMMON $BUILD_PYTORCH $BUILD_JAX $CUDA_MAJOR"]
12 changes: 10 additions & 2 deletions build_tools/wheel_utils/Dockerfile.x86
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,20 @@ RUN dnf clean all
RUN dnf -y install glog.x86_64 glog-devel.x86_64
RUN dnf -y install libnccl libnccl-devel libnccl-static

ENV PATH="/usr/local/cuda/bin:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"
RUN dnf -y install openmpi openmpi-devel && dnf clean all
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf
Comment on lines +39 to +43
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 After writing to /etc/ld.so.conf.d/openmpi-x86_64.conf, ldconfig should be called in the same RUN layer to update the dynamic linker cache. Without it, tools that depend on the ldconfig cache (rather than LD_LIBRARY_PATH) will not find the OpenMPI libraries at build time inside the container.

Suggested change
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf
RUN mkdir -p /opt/mpi && \
ln -s /usr/lib64/openmpi/bin /opt/mpi/bin && \
ln -s /usr/lib64/openmpi/lib /opt/mpi/lib && \
ln -s /usr/include/openmpi-x86_64 /opt/mpi/include && \
echo "/usr/lib64/openmpi/lib" >> /etc/ld.so.conf.d/openmpi-x86_64.conf && \
ldconfig


ENV PATH="/usr/local/cuda/bin:/opt/mpi/bin:${PATH}"
ENV LD_LIBRARY_PATH="/usr/local/cuda/lib64:/opt/mpi/lib:${LD_LIBRARY_PATH}"
ENV CUDA_HOME=/usr/local/cuda
ENV CUDA_ROOT=/usr/local/cuda
ENV CUDA_PATH=/usr/local/cuda
ENV CUDADIR=/usr/local/cuda
ENV MPI_HOME=/opt/mpi
ENV NVTE_RELEASE_BUILD=1

CMD ["/bin/bash", "-c", "bash /TransformerEngine/build_tools/wheel_utils/build_wheels.sh manylinux_2_28_x86_64 $BUILD_METAPACKAGE $BUILD_COMMON $BUILD_PYTORCH $BUILD_JAX $CUDA_MAJOR"]
29 changes: 29 additions & 0 deletions build_tools/wheel_utils/build_wheels.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,35 @@ git submodule update --init --recursive

# Install deps
/opt/python/cp310-cp310/bin/pip install cmake pybind11[global] ninja setuptools wheel
/opt/python/cp310-cp310/bin/pip install \
"nvidia-cublasmp-cu${CUDA_MAJOR}" \
"nvidia-cusolvermp-cu${CUDA_MAJOR}" \
"nvidia-nvshmem-cu${CUDA_MAJOR}"

SITE_PACKAGES=$(/opt/python/cp310-cp310/bin/python -c "import sysconfig; print(sysconfig.get_paths()['purelib'])")
export CUBLASMP_HOME="${SITE_PACKAGES}/nvidia/cublasmp/cu${CUDA_MAJOR}"
export CUSOLVERMP_HOME="${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Likely incorrect CUSOLVERMP_HOME path

The path ${SITE_PACKAGES}/nvidia/cu${CUDA_MAJOR} is missing the package-name segment. Every other NVIDIA Python package follows the layout site-packages/nvidia/<package-name>/cu<ver>/ — for example, nvidia-cublasmp-cu12 installs under nvidia/cublasmp/cu12/, so nvidia-cusolvermp-cu12 should install under nvidia/cusolvermp/cu12/. With the current path the .so symlink loop silently skips cuSolverMP's lib/ directory ([ -d "$lib_dir" ] || continue), no unversioned .so stubs are created, and the linker will not find cuSolverMP at build time even though NVTE_WITH_CUSOLVERMP=1 is exported.

export NVSHMEM_HOME="${SITE_PACKAGES}/nvidia/nvshmem"

# nvidia-cuda-python package compatibility.
for lib_dir in "${CUBLASMP_HOME}/lib" "${CUSOLVERMP_HOME}/lib" "${NVSHMEM_HOME}/lib" ; do
[ -d "$lib_dir" ] || continue
for so in "$lib_dir"/lib*.so.* ; do
[ -e "$so" ] || continue
base=$(basename "$so")
unversioned="${base%%.so.*}.so"
ln -sf "$base" "${lib_dir}/${unversioned}"
done
done

# Enable optional build features and expose the runtime libs to the linker.
export NVTE_WITH_CUBLASMP=1
export NVTE_WITH_CUSOLVERMP=1
export NVTE_ENABLE_NVSHMEM=1
export NVTE_UB_WITH_MPI=1
export MPI_HOME="${MPI_HOME:-/opt/mpi}"
export LD_LIBRARY_PATH="${NVSHMEM_HOME}/lib:${CUBLASMP_HOME}/lib:${CUSOLVERMP_HOME}/lib:${MPI_HOME}/lib:${LD_LIBRARY_PATH}"
export PATH="${MPI_HOME}/bin:${PATH}"

if $BUILD_METAPACKAGE ; then
cd /TransformerEngine
Expand Down
Loading