A script to add local installation path and copy notebooks

mmacieje · May 11, 2020, 8:57am

Dear SWAN Team,

I am working on a project developing notebooks for Hardware Commissioning and Operation analyses for the LHC superconducting circuits.
Hardware Commissioning is a process of incremental checks of circuit components prior to its restart at the nominal energy.
These notebooks consist of cells dedicated to analysis of particular systems. The notebooks are executed by domain experts (about a dozen). To date there are about 25 notebooks and several dozens are in development to complete the library of all circuits.

I am writing to ask you for suggestion in improving the user experience.
Currently, a user should:

Install four external packages (plotly, tqdm, influxdb, tzlocal)
Upgrade our own package lhcsmapi (subject to frequent updates to account for different scenarios)
Copy notebooks from the repository (https://gitlab.cern.ch/lhcdata/lhc-sm-hwc)

While the first point has to be done only once the latter two require asynchronous updates posing a thread of using incompatible lhcsmapi package/notebook across the team of experts (some will follow the communication on updates, some might miss that).

Given the criticality of these tests, the current workflow is unsustainable and calls for improvement.

I read your documentation [1, 2] and came across the idea of bash startup script. Ideally, the script would perform these three steps by taking the packages (points 1 and 2) from an installation at the project EOS space (all experts have read rights) and cloning the repository (or simply overwriting users project by a copy from EOS).

Would that be possible with he startup script?

Thank you in advance for your time and consideration.

Cheers, Michał.

PS. I’m aware about adding the packages to CVMFS, however the project is still in an exploratory phase. Once we reach satisfactory degree of stability, we will ask librarians to add these packages. Nonetheless, adding the packages to CVMFS solves only the half of the problem and the sync of the notebooks still has to be carried out.

[1] https://github.com/swan-cern/help/blob/master/session/select.md
[2] https://github.com/swan-cern/help/blob/master/advanced/install_packages.md

pelson · May 11, 2020, 2:37pm

Hi Michał,

Non-SWAN maintainer here. Just to say that, yes, the environment startup script can literally do anything. If it takes much longer than 60s you find that the startup of the SWAN environment is unreliable so I have in the past backgrounded tasks in the script to avoid this.

Personally however, if I had the same requirements as you described, I would enforce a single command at the start of every notebook which did the tasks that you require. This way it is very explicit, and there is no risk that the user forgets to include the environment startup script when starting up SWAN (plus you can use a standard SWAN environment without having to restart it in order to do this analysis).

I posted a few weeks back an example of doing precisely this for package installation into a directory that was not .local. The same technique applies for each of the requirements you have listed (pip install stuff, force a pip upgrade of specific thing, clone a repository). Avoiding the use of ``--user`` and ``.local`` for pip installations

Anyway, hope this helps. Again, just a reminder - I’m a fellow user, not a developer/maintainer of SWAN, so please don’t take my word that this is the recommended approach.

Cheers,

Phil

etejedor · May 11, 2020, 3:13pm

Dear Michal,

Just to add on the previous answer (thanks Phil!):

I read your documentation [1, 2] and came across the idea of bash startup script. Ideally, the script would perform these three steps by taking the packages (points 1 and 2) from an installation at the project EOS space (all experts have read rights) and cloning the repository (or simply overwriting users project by a copy from EOS).

Yes you can do that, since the environment script runs with your user, the only concern here is time: you need to make sure it does not take too long (> 30 seconds), since that could be misinterpreted as a problem by SWAN. You can easily create an environment script and check if this is a workable solution.

If the lhcsmapi package has frequent updates, it is indeed not a good candidate for the LCG releases on CVMFS, but the other four packages might be, and that would save you one step.

On the other hand, the option that Phil mentioned (having a specific cell in the notebook to take care of the 3 steps) is perfectly valid, it really depends on how much it bothers you (or your users) to have that kind of cell repeated in every notebook.

Cheers,

Enric

mmacieje · May 11, 2020, 3:51pm

Dear @pelson and @etejedor,

many thanks for your detailed replies. They go in the right direction:

No, I don’t mind having a cell creating a venv at the beginning of each notebook. Actually, I was wondering if one could create and reuse a venv for a project (Thanks Phil!). The users shouldn’t mind as well, this reduces one annoying step from them. Thanks, I’ll play with that. I’ll also contact librarians regarding the four external packages.
I think that I still need an environment script to get an update of notebooks. Otherwise, users would need to e.g., run this command from SWAN terminal before starting to work. So, in the end an operation that the users should remember about (either providing the environment script, to me a bit more natural, or executing the script to copy notebooks from the terminal).

What do you think is a good to solution to get the update of notebooks?
Do you have any examples of environment scripts to draw inspiration from?

Cheers, Michał.

pelson · May 12, 2020, 8:55am

Please feel free to make use of https://gitlab.cern.ch/pelson/swan-run-in-venv. If you need to make adjustments I’m more than happy to discuss integrating them into the repository. If you find it useful we may also wish to move the project out to a common group (perhaps the SWAN github org) so that it isn’t explicitly tied to my user. I was quite opinionated about the venv directory location, and that is certainly something I could see changing (or at least configurable) if desired.

With regards to updating the notebooks, technically you can refresh the page from a code cell, so I could actually imagine the notebook pulling the latest changes and then refreshing. Unfortunately you are going to have some problems if users have made any changes to the notebooks as you will have merge conflicts - notebooks are famous for this problem and there isn’t really a good solution (even nbdime). My best suggestion is to try to minimise the amount of work done in the notebook, and instead try to move it into the library you mentioned. Even if all the users were using the same file rather than copies, you will have a synchronisation problem if more than one user is making changes at the same time, this is a fundamental problem with notebooks which tools such as CoCalc and Google colab have sought to address (with a reasonable amount of success if you buy-in to their platform). In your situation, I think I would try to ensure that it was clear that the notebooks being distributed are read&execute only, anything they write in the notebook will be lost unless they take a copy (e.g. from the menu in the notebook interface ).

etejedor · May 12, 2020, 9:25am

Hi,
No, unfortunately we don’t have examples of environment scripts, but users usually set some environment variables in them (or call into other scripts that set them). There is nothing special about it, it is just a script we run as your user before your session starts (and we pick any environment variable you set in it).

If you use the environment script to pull the notebooks every time a user starts a session, you need to consider what will be the location of those notebooks. Is there user always going to work with a SWAN project of the same name that will contain the notebooks inside? What if they created another project with them? What if the user modified the notebooks and wanted to keep the modifications? Perhaps making the users fully aware of the update is not that bad, that would give them more control over what and when to update.

etejedor · May 13, 2020, 8:22am

Hi,

Another possibility would be to have an EOS project where you put the version of the packages you want to provide to your users, and in the environment script you append to PYTHONPATH to point to those packages. That would give you full control on the version your users use of such packages.

You could also place the notebooks in that EOS project, and copy them to the user’s CERNBox in the environment script, but again this would require some convention (i.e. to what SWAN project they are copied to).

mmacieje · May 13, 2020, 3:40pm

Hi @etejedor,

Indeed, I was hoping to profit from the tight integration of SWAN with EOS. We have an EOS project (lhcsm) and all our users have read rights to the project.
Having a virtual environment installed at the lhcsm project seems like an elegant solution. I installed packages with pip install --target=/eos/project/l/lhcsm/venv/ package_name

Exactly as @pelson pointed out, the notebooks are distributed as read&execute only. In fact, it is imperative that everyone uses the most up-to-date notebook version. For advanced users, the edited notebooks will be synced through git.

The convention is that whoever calls the environment script would have:

\\cern.ch\eos\project\l\lhcsm\venv appended to PYTHONPATH
Would export PYTHONPATH="\cern.ch\eos\project\l\lhcsm\venv" do the job?
content of \\cern.ch\eos\project\l\lhcsm\hwc\notebooks copied to hwc local SWAN project (to be deleted and copied again at every log-in)

Could you please kindly help me with these two steps?
Where could I locate the environment script (GitLab, EOS)?

Thank you in advance, Michał.

etejedor · May 13, 2020, 4:20pm

Hi Michal,

\\cern.ch\eos\project\l\lhcsm\venv appended to PYTHONPATH
Would export PYTHONPATH=“\cern.ch\eos\project\l\lhcsm\venv” do the job?

You will need to append to the PYTHONPATH, otherwise it will mess up with SWAN’s environment. So:
export PYTHONPATH=$PYTHONPATH:\cern.ch\eos\project\l\lhcsm\venv

Where could I locate the environment script (GitLab, EOS)?

The EOS project folder should be fine.

mmacieje · May 18, 2020, 3:18pm

Hi Enric,

thanks for the reply.
I created a script at cern.ch\eos\project\l\lhcsm\public\env.sh with a single line:
export PYTHONPATH=$PYTHONPATH:\cern.ch\eos\project\l\lhcsm\venv
The idea is that all users would refer to the same environment script, which is stored at eos/project/l/lhcsm/public/env.sh.
I tried different path combinations (e.g. /eos/project/l/lhcsm/public/env.sh, /cern.ch/eos/project/l/lhcsm/public/env.sh) while configuring SWAN environment, however none of them work. What would be the right path to call the script?

Thank you in advance, Michał.

etejedor · May 19, 2020, 8:46am

Hi Michal,

This path should be fine:

/eos/project/l/lhcsm/public/env.sh

I tried myself with a similar example (in /eos/project/s/swan) and it worked for me.

Just to try, can you place the following content (and only that content) in /eos/project/l/lhcsm/public/env.sh :

export MYVAR=myvar

Then you start a SWAN session specifying /eos/project/l/lhcsm/public/env.sh as environment script. If you open now a terminal, do you see MYVAR defined?

If that works, you just need to make sure you include a line in the environment script that is:

export PYTHONPATH=$PYTHONPATH:/eos/project/l/lhcsm/directory_where_python_modules_are

Note that /eos/project is mounted on the SWAN nodes, you can use that mount.

mmacieje · May 19, 2020, 1:16pm

Hi Enric,

thanks, now it does work! I guess, I had the path written wrong.
I also added the copy of notebooks.

The final solution of the script, in case anyone would have a similar question, is:

export PYTHONPATH=$PYTHONPATH:/eos/project/l/lhcsm/venv
rm -r $CERNBOX_HOME/SWAN_projects/hwc
cp -r /eos/project/l/lhcsm/hwc/lhc-sm-hwc/ $CERNBOX_HOME/SWAN_projects/hwc/

Cheers, Michał.