Using a conda environment

Hi,

I would like to use a custom conda environment instead of LCG for reproducibility. I have installed conda in my CERNbox area and tried to “hack” this by running the following script.

unset LD_LIBRARY_PATH
unset PYTHONPATH
unset PYTHONHOME
export CONDA_INSTALL_LOCATION_EOS=/eos/user/u/username/miniconda3
source $CONDA_INSTALL_LOCATION_EOS/etc/profile.d/conda.sh
conda activate base

This script is at first accepted by SWAN but when in a notebook it fails to connect to the kernel and won’t run. Is there a “hack” to get this to work?
Are there any plans for SWAN to better incorporate conda?

Thanks!

Non-SWAN expert responding here.

The key step that you are missing is to register your ipython kernel. The Jupyter architecture means that the “notebook interface” is what gets started by SWAN, and then it subsequently starts a kernel (python, or indeed any other language) to do code execution.

I have a prototype which also installs conda (in my case at SWAN start, and into /scratch), in which I register the ipython kernel in a SWAN “Environment script”. It is feasible to replace the existing “python3” kernel, which would effectively replace the default SWAN kernel for the duration of your SWAN session. Personally though I just created a new kernel name, and then as soon as my notebook started, switch the kernel that the notebook runs with.

In terms of code, you will need to ensure you’ve installed ipython_kernel (conda installable with ipykernel), then:

# The kernels on SWAN are installed in the scratch user-site directory.
KERNEL_PREFIX=$SCRATCH_HOME/.local

python -m ipykernel install --prefix=${KERNEL_PREFIX} --name ${kernel_name}

Dear John, pelson,

Thanks for your messages. We didn’t know that it was possible to install it like that, given that the startup script has a limited time to finish (we’ve increased this time recently, though).

The only thing we can say right now is that, although not supported at the moment, we are investigating how to provide a simplified access to Conda environments. We had a Google Summer of Code student last year working on this and EOS is be about to push to production a functionality that should allow you to keep the envs stored in EOS and have an acceptable experience. But this is still experimental and we need to see if it gets to a point that we feel comfortable pushing to our users.
We’re also refreshing the way we manage Software in general (LCG, Conda, Experiments stacks, etc), and we will try to bring all of this together.

Cheers,
Diogo

Indeed the timeout was a problem. I tricked the system a little bit by backgrounding the “conda install” step, and by manually installing the kernel definition (it is a simply json file) in order to get past the 60s limit. After that, you just have to wait until the background process has finished before launching a notebook/kernel. Typically, when installing to scratch I found that even big environments were less than a few minutes (I didn’t yet try with mamba, which would possibly speed up the whole thing considerably).

Happy to share/call if you wanted to see the approach @dalvesde.

Putting things in background is one way our users have to bypass the time limit. But then there’s no way to know wether things have finished or not.
We’ll come back to you in the near future, when we start looking again into the integration of these “pieces”. But it would be already helpful if you shared your startup script with us.
Thanks

Is it necessary to always install things on start-up? For my project, we preinstall packages on EOS and just link to such an venv through the startup script.

Here is the default jupyter config for a standard SWAN setup for me:

$ jupyter --paths
config:
    /scratch/pelson/.jupyter
    /cvmfs/sft.cern.ch/lcg/releases/Python/3.7.6-b96a9/x86_64-centos7-gcc8-opt/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
...
runtime:
    /scratch/pelson/.local/share/jupyter/runtime

There is no directory which is persistent and writable, therefore there is no currently way (FWICS) to configure Jupyter to pick up anything persistently. Therefore, if you want to have a custom kernel, you have to do something at SWAN startup - either set an environment variable (e.g. JUPYTER_PATH), or to install a kernel into $SCRATCH_HOME.

The alternative approach is to put something into each of your notebooks to do the setup/config - I have an example of doing that to install a virtual environment and setup the Python path here. I personally don’t particularly like this approach, and prefer to setup the environment before any notebooks are loaded/executed since it it feels like an implementation detail of SWAN is bleeding into the data analysis record. To be completely clear though, I don’t think Jupyter has solved the problem of declaring the software environment in which to run a notebook problem yet, so its isn’t a SWAN problem per-se.