Is there a way to automatise the run of scripts in SWAN?

dguerrei · March 25, 2019, 2:28pm

Hi,

I would like to know it there is or it is foreseen a way to be able to periodically run scripts in SWAN without human action (i.e. log-in in SWAN, connecting to SPARK and click RUN). Just the same way we can do it in linux with cron.

Cheers
Diogo

etejedor · March 25, 2019, 3:48pm

Hi @dguerrei,

In general, I would say the purpose of SWAN is to provide an interactive service, so you actually need to be there to run stuff (i.e. via Jupyter notebooks). Moreover, if you are inactive for a certain amount of time, SWAN will clean your session.

What is exactly your use case? You want to run just plain shell scripts or you want to use Spark?

dguerrei · March 25, 2019, 4:41pm

Hello.
Thanks you for your reply.

We would like to perform analysis to production data (in our case, related to cooling and ventilation systems EN-CV). We were already able to perform some analysis via SWAN, therefore the service seems very interesting to us but it misses a scheduling a feature.

There are certainly alternative ways, but since CERN is investing in this infrastructure for data analytics it made sense that we should give it a try.

As I can understand there is no scheduling feature for the moment but would you see that as a possible feature to add in the future?

etejedor · March 25, 2019, 6:56pm

Hi @dguerrei,

We will discuss this use case in the team, we appreciate the suggestion. Our current philosophy is not to encourage executing long-running processes in SWAN sessions, scheduled or not. That is why we are integrating the service with external resources, so that people can offload their computations. Spark clusters is an example, but we are now exploring batch systems and GPUs.

On the other hand, we will organise a 1-day SWAN users workshop soon, probably next fall. The objective is precisely to gather feedback from the users about what works / what doesn’t in SWAN and how they think the service should evolve. It would be great if you or some other representative from EN-CV could join and showed what you are using SWAN for.

Cheers,
Enric

pmrowczy · March 25, 2019, 10:08pm

Hello @dguerrei. What you are looking for is not related to SWAN. SWAN is used to develop Spark Jobs in an interactive way. Once you have your job ready, you can run it in cluster mode.

if you need an assistance with moving interactive spark job to production cluster-mode job, please consult us at https://cern.service-now.com/service-portal/function.do?name=Hadoop-Components and submit a request for consulting. Please also explain your use case, and we will make sure to make it work for you.

dguerrei · March 26, 2019, 7:58am

Dear @pmrowczy and @etejedor,

Thank you for your feedback and taking note of our interest.
I’ll share this information within our group, including the workshop that is being planed (is it going to be shared via ai-hadoop-users or it-analytics-wg egroups?)

Let me clarify that for the moment the interactive mode of SWAN is sufficient as we’re still exploring use cases of interest and starting to build some algorithms for analytics of production data. Once we’ve something solid we’ll come back to you to discuss how the jobs could be scheduled.

Regards
Diogo

etejedor · March 26, 2019, 8:12am

That’s great, thank you. Yes, we will announce via these groups too in due time.

pkothuri · March 26, 2019, 9:10am

Hi @dguerrei

We did have another user (in IT) wanting this functionality, at that time he couldn’t use batch mode because then he will loose all his visualizations. There are commercial providers who offer this functionality (https://docs.databricks.com/user-guide/jobs.html) , so your request is reasonable and we would discuss internally on the development work required and decide based on the critical mass

regards,
Prasanth