Skip to content

Some specific environment variables left behind when using EESSI module + CUDA #229

@ocaisa

Description

@ocaisa

When using #226 I noticed that I effectively introduced a variable that remains even when the EESSI module is unloaded:

[aocais00@login05 software-layer-scripts]$ srun -p boost_usr_prod --cpus-per-task=10 -N 1 --ntasks-per-node=1 --gres=gpu:1 -J test_eessi --account EUHPC_D30_076 -t 0:10:00 --pty /bin/bash
srun: job 41422642 queued and waiting for resources
srun: job 41422642 has been allocated resources

[aocais00@lrdn0193 software-layer-scripts]$ env | grep EESSI

[aocais00@lrdn0193 software-layer-scripts]$ ~/test/mount_cvmfs.sh
[INFO] Making CVMFS accessible via afuse.
[aocais00@lrdn0193 software-layer-scripts]$ export EESSI_MODULE_STICKY=1
[aocais00@lrdn0193 software-layer-scripts]$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
Modules purged before initialising EESSI
Module for EESSI/2023.06 loaded successfully (requires '--force' option to unload or purge)
EESSI has selected x86_64/intel/icelake as the compatible CPU target for EESSI/2023.06
EESSI has selected accel/nvidia/cc80 as the compatible accelerator target for EESSI/2023.06
(for debug information when loading the EESSI module, set the environment variable EESSI_MODULE_DEBUG_INIT)

{EESSI/2023.06} [aocais00@lrdn0193 software-layer-scripts]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
Lmod has detected the following error:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. Please update your CUDA driver libraries and then let EESSI know about the
update.
For more information on how to do this, see https://www.eessi.io/docs/site_specific_config/gpu/.

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0        /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

{EESSI/2023.06} [aocais00@lrdn0193 software-layer-scripts]$ export LMOD_PACKAGE_PATH=$PWD/test/.lmod
{EESSI/2023.06} [aocais00@lrdn0193 software-layer-scripts]$ module load OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0
Lmod Warning:
Your driver CUDA version is 12.2  but the module you want to load requires CUDA 12.4.0. You will therefore be in minor version compatibility mode as described in
https://docs.nvidia.com/deploy/cuda-compatibility/minor-version-compatibility.html .

While processing the following module(s):
    Module fullname                                   Module Filename
    ---------------                                   ---------------
    UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0        /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/UCX-CUDA/1.15.0-GCCcore-13.2.0-CUDA-12.4.0.lua
    NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0            /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/NCCL/2.20.5-GCCcore-13.2.0-CUDA-12.4.0.lua
    OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0  /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all/OSU-Micro-Benchmarks/7.5-gompi-2023b-CUDA-12.4.0.lua

{EESSI/2023.06} [aocais00@lrdn0193 software-layer-scripts]$ module --force purge
[aocais00@lrdn0193 software-layer-scripts]$ env | grep EESSI
LMOD_SYSTEM_DEFAULT_MODULES=EESSI/2023.06
EESSI_CUDA_DRIVER_VERSION=12.2
EESSI_MODULE_STICKY=1  # Explicitly set, this is ok
EESSI_MODULE_UPDATE_PS1=1  # Set by source, so also ok
__LMOD_STACK_EESSI_CUDA_DRIVER_VERSION_SUPPRESS_WARNING=false
__EESSI_VERSION_USED_FOR_INIT=2023.06
EESSI_CUDA_DRIVER_VERSION_SUPPRESS_WARNING=UCX-CUDA

I don't have a good solution right now, this is a little hard to solve as you really only want these variables once per session...so in some sense it is a feature and not a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions