Elastic applications work on the ideal assumption of infinite resources. While large Public Cloud infrastructures may be a reasonable approximation of this condition, scientific computing centres usually work in a saturated regime, since keeping a fraction of the computing cores idle to allow for headroom is simply not an option. At the INFN Torino Private Cloud, one of the applications (a WLCG Tier-2 Grid site) is much larger than all others. A continuous stream of jobs is received, so that worker-nodes never go idle. Moreover, a minimal number of pledged running jobs should be ensured each year. Therefore, this application acts like a static concrete wall, preventing smaller application to scale freely.
The second largest application (a DIRAC Tier-2 Grid site) has peaks of job requests for limited periods, while resources are idle most of the time. Finally, our Cloud hosts an elastic facility for interactive data analysis at the LHC and a number of smaller batch farms on-demand for other local scientific use cases (i.e. nuclear plant simulations, theoretical calculations, medical imaging processing).
We have investigated different solutions for an efficient usage of available resources, tailored to the needs of the above mentioned applications. We used the OpenNebula OneFlow service to implement elasticity of the two Grid sites, with elasticity policies both scheduled and based on cluster activity parameters. Virtual worker-node properties (number of cores and lifetime) have been tuned to match the statistical distribution of job durations and the balance between the amount of resources statically allocated and the amount left available for competitive seizing.
For the smaller applications we chose, instead, an ad-hoc developed tool for elastic clusters provisioning. This model relies on virtual routers and elastic IPs for network sandboxing and on a custom daemon to implement elasticity. Such self-contained elastic batch farms can easily be configured and deployed with the help of a web interface.
These strategies aim to be non-invasive with respect to our most common class of applications, i.e. Grid jobs, and to grant fast and simple allocation of resources for local tenants.
In this presentation we describe the implementation of said models in our infrastructure and the very first operational experiences.
Author Biography
I graduated in nuclear physics in 2007 at the University of Torino and got a PhD on the same subject in 2012 at the University of Heidelberg. Most of my research experience has been devoted to data analysis and detector commissioning for the LHC. Since two years I’m employed at the INFN computing centre in Torino, where I cover the position of Cloud manager and developer.