Welcome to CREAM_Guide documentation!¶
Contents:
CREAM Functional Description¶
The CREAM (Computing Resource Execution And Management) Service is a simple, lightweight service that implements all the operations at the Computing Element (CE) level; its well-defined WebService-based interface and its implementation as an extension of the Java-Axis servlet (running inside the Apache Tomcat container) provide interoperability with clients written in any programming language and running on any computer platform.
The CREAM interface is well-defined using the Web Service Description Language (WSDL); anyone can generate his/her CREAM client by simply filling in the stub code generated by WSDL parser (gSOAP for C/C++, Axis for Java, Perl module for perl).
CREAM functionality:
- Job Submission
- Possibility of direct staging of input sandbox files
- gLite JDL compliance (with CREAM-specific extensions)
- Support for batch and parallel jobs
- Manual and automatic proxy delegation
- Job Cancellation
- Job Info with configurable level of verbosity and filtering based on submission time and/or job status
- Job List
- Job Suspension and Resume
- Job output retrieval
- Job Purge for terminated jobs
- Possibility (for admin) to disable new submissions
- Self limiting CREAM behavior: CREAM is able to protect itself if the load, memory usage, etc. is too high. This happens disabling new job submissions, while the other commands are still allowed
- ARGUS or gJAF based authorization
- Possibility to forward requirements to the batch system
- Integration with APEL accounting systems
CREAM can be used
- by the Workload Management System (WMS), via the ICE (Interface to CREAM Environment) service
- by a generic client, e.g. an end-user willing to directly submit jobs to a CREAM CE. A C++ command line interface (CLI) is available
CREAM service accesses and operates local resource management systems. The current version of the application supports the following management systems:
- TORQUE
- LSF
- SLURM
- HTCondor
- Grid Engine (partially)
Authentication in CREAM is managed via the trustmanager. The Trust Manager is the component responsible for carrying out authentication operations. It is an implementation of the J2EE security specifications. Authentication is based on PKI. Each user (and Grid service) wishing to access CREAM is required to present an X.509 format certificate. These certificates are issued by trusted entities, the Certificate Authorities (CA). The role of a CA is to guarantee the identity of a user. This is achieved by issuing an electronic document (the certificate) that contains the information about the user and is digitally signed by the CA with its private key. An authentication manager, such as the Trust Manager, can verify the user identity by decrypting the hash of the certificate with the CA public key. This ensures that the certificate was issued by that specific CA. The Trust Manager can then access the user data contained in the certificate and verify the user identity.
Authorization in the CREAM CE can be implemented in two different ways (the choice is done at configuration time):
- Authorization with ARGUS
- Authorization with gJAF
Argus is a system meant to render consistent authorization decisions for distributed services (e.g. compute elements, portals). In order to achieve this consistency a number of points must be addressed. First, it must be possible to author and maintain consistent authorization policies. This is handled by the Policy Administration Point (PAP) component in the service. Second, authored policies must be evaluated in a consistent manner, a task performed by the Policy Decision Point (PDP). Finally, the data provided for evaluation against policies must be consistent (in form and definition) and this is done by the Policy Enforcement Point (PEP). Argus is also responsible to manage the Grid user - local user mapping.
gJAF (Grid Java Authorization Framework) provides a way to invoke a chain of policy engines and get a decision result about the authorization of a user. The policy engines are divided in two types, depending on their functionality. They can be plugged into the framework in order to form a chain of policy engines as selected by the administrator in order to let him set up a complete authorization system. A policy engine may be either a PIP or a PDP. PIP collect and verify assertions and capabilities associated with the user, checking her role, group and VO attributes. PDP may use the information retrieved by a PIP to decide whether the user is allowed to perform the requested action, whether further evaluation is needed, or whether the evaluation should be interrupted and the user access denied.
In CREAM CE VO based authorization is supported. In this scenario, implemented via the VOMS PDP, the administrator can specify authorization policies based on the VO the jobs’ owners belong to (or on particular VO attributes). When gJAF is used as authorization mechanism, the Grid user - local user mapping is managed via glexec, For what concerns authorization on job operations, by default each user can manage (e.g. cancel, suspend, etc.) only her own jobs. However, the CREAM administrator can define specific super-users who are empowered to manage also jobs submitted by other users.
CREAM installation¶
Requirements¶
The requirements for a basic installation of a CREAM site are:
- Scientific Linux 6, Cent OS version 7 or compatibile distributions.
- The UMD4 extension as described in the UMD guide
Installation¶
The basic installation of a CREAM site consists on the following mandatory packages:
- canl-java-tomcat
- cleanup-grid-accounts
- bdii
- dynsched-generic
- glite-ce-cream
- glite-info-provider-service
- globus-gridftp-server-progs
- globus-proxy-utils
- glue-schema
- kill-stale-ftp
- lcg-expiregridmapdir
- mariadb
- sudo
If the authorization framework used is based on glexec the following packages are mandatory:
- glexec
- lcas-lcmaps-gt4-interface
- lcas-plugins-basic
- lcas-plugins-check-executable
- lcas-plugins-voms
- lcmaps-plugins-basic
- lcmaps-plugins-verify-proxy
- lcmaps-plugins-voms
if the authorization framework used is based on Argus the following package is mandatory:
- argus-gsi-pep-callout
The following packages are optional:
- apel-parsers : if the support for APEL accounting is enabled
- glite-lb-logger : if the support for the Logging and Bookkeping is enabled
- info-dynamic-scheduler-lsf : if the batch system used is LSF
- info-dynamic-scheduler-lsf-btools : if the batch system used is LSF
- info-dynamic-scheduler-slurm : if the batch system used is SLURM
- lcg-info-dynamic-scheduler-condor : if the batch system used is HTCondor
- lcg-info-dynamic-scheduler-pbs : if the batch system used is TORQUE
The standard procedure for the installation and configuration of a CREAM site makes use of the puppet framework. The puppet module checks for the presence of the packages described above, and in case they’re not available on the system it installs them.
Example¶
This section illustrate an example of installation of a CREAM site by means of a standalone execution of puppet. In the following examples the FQDN of the CREAM CE is myhost.mydomain.
The required steps are:
Installation of the packages for puppet:
yum -y install puppet
Verification of the puppet environment:
facter | grep -E 'hostname|fqdn'
If the FQDN of the host is not correctly resolved it is necessary to specify the parameter
creamce::host
Installation of the CREAM CE module for puppet:
puppet module install infnpd-creamce
Creation of the required directories:
mkdir -p /etc/puppet/manifests /var/lib/hiera/node /etc/grid-security
Creation of the file
/etc/puppet/manifests/site.pp
with the following definitionnode 'myhost.mydomain' { require creamce }
Creation of the file
/etc/hiera.yaml
with the following definitions:--- :backends: - yaml :hierarchy: - "node/%{fqdn}" :yaml: :datadir: /var/lib/hiera
Creation of the symbolic link
ln -s /etc/hiera.yaml /etc/puppet/hiera.yaml
Creation of the CREAM CE description file
/var/lib/hiera/node/myhost.mydomain.yaml
, an example of minimal configuration is:--- creamce::mysql::root_password : mysqlp@$$w0rd creamce::creamdb::password : creamp@$$w0rd creamce::creamdb::minpriv_password : minp@$$w0rd apel::db::pass : apelp@$$w0rd creamce::batch_system : pbs creamce::use_argus : false creamce::default_pool_size : 10 gridftp::params::certificate : "/etc/grid-security/hostcert.pem" gridftp::params::key : "/etc/grid-security/hostkey.pem" gridftp::params::port : 2811 creamce::queues : long : { groups : [ dteam, dteamprod ] } short : { groups : [ dteamsgm ] } creamce::vo_table : dteam : { vo_app_dir : /afs/dteam, vo_default_se : storage.pd.infn.it, servers : [ { server : voms.hellasgrid.gr, port : 15004, dn : /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr, ca_dn : "/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2016" }, { server : voms2.hellasgrid.gr, port : 15004, dn : /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr, ca_dn : "/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2016" } ], groups : { dteam : { fqan : [ "/dteam" ], gid : 9000 }, dteamsgm : { fqan : [ "/dteam/sgm/ROLE=developer" ], gid : 9001, pub_admin : true }, dteamprod : { fqan : [ "/dteam/prod/ROLE=developer" ], gid : 9002 } }, users : { dteamusr : { first_uid : 6000, fqan : [ "/dteam" ], name_pattern : "%<prefix>s%03<index>d" }, dteamsgmusr : { first_uid : 6100, fqan : [ "/dteam/sgm/ROLE=developer", "/dteam" ], pool_size : 5, name_pattern : "%<prefix>s%02<index>d" }, dteamprodusr : { first_uid : 6200, fqan : [ "/dteam/prod/ROLE=developer", "/dteam" ], pool_size : 5, name_pattern : "%<prefix>s%02<index>d" } } } creamce::hardware_table : subcluster001 : { ce_cpu_model : XEON, ce_cpu_speed : 2500, ce_cpu_vendor : Intel, ce_cpu_version : 5.1, ce_physcpu : 2, ce_logcpu : 2, ce_minphysmem : 2048, ce_minvirtmem : 4096, ce_os_family : "linux", ce_os_name : "CentOS", ce_os_arch : "x86_64", ce_os_release : "7.0.1406", ce_outboundip : true, ce_inboundip : false, ce_runtimeenv : [ "tomcat_6_0", "mysql_5_1" ], subcluster_tmpdir : /var/tmp/subcluster001, subcluster_wntmdir : /var/glite/subcluster001, ce_benchmarks : { specfp2000 : 420, specint2000 : 380, hep-spec06 : 780 }, nodes : [ "node-01.mydomain", "node-02.mydomain", "node-03.mydomain" ] # Experimental support to GPUs accelerators : { acc_device_001 : { type : GPU, log_acc : 4, phys_acc : 2, vendor : NVidia, model : "Tesla k80", version : 4.0, clock_speed : 3000, memory : 4000 } } } creamce::software_table : tomcat_6_0 : { name : "tomcat", version : "6.0.24", license : "ASL 2.0", description : "Tomcat is the servlet container" } mysql_5_1 : { name : "mysql", version : "5.1.73", license : "GPLv2 with exceptions", description : "MySQL is a multi-user, multi-threaded SQL database server" } creamce::vo_software_dir : /afs creamce::se_table : storage.pd.infn.it : { mount_dir : "/data/mount", export_dir : "/storage/export", type : Storm, default : true } cloud.pd.infn.it : { mount_dir : "/data/mount", export_dir : "/storage/export", type : Dcache }
The permissions of the file
/var/lib/hiera/node/myhost.mydomain.yaml
must be set to600
.Deployment of the host private key in
/etc/grid-security/hostkey.pem
Deployment of the host certificate in
/etc/grid-security/hostcert.pem
Execution of puppet
puppet apply /etc/puppet/manifests/site.pp
Puppet Configuration¶
The CREAM CE site can be configured with a puppet module. The module resorts to hiera for handling configuration parameters. Parameters are marked by type; simple types are “string”, “integer” and “boolean”; complex types are “list” and “hash”.
Hiera can work with different back-ends, like YAML and JSON. For the examples of parameter definitions in the following document the YAML syntax is used. According to YAML specification, the complex types can be declared in two different modes:
with indentation:
# Example of list type blah::shared_directories : - "/mnt/afs" - "/media/cvmfs" # Example of hash type, with nested complex definitions creamce::queues : long : groups : - dteam - dteamprod short : groups : - dteamsgm
with block style:
# Example of list type blah::shared_directories : [ "/mnt/afs", "/media/cvmfs" ] # Example of hash type, with nested complex definitions creamce::queues : { long : { groups : [ dteam, dteamprod ] }, short : { groups : [ dteamsgm ] } }
Mixing the two modes is possible but may be misleading; in case it is possible to test the hiera configuration with special tools like yamllint, already present in EPEL distribution.
CREAM service parameters¶
These are the basic configuration parameters related to the CREAM service:
creamce::batch_system
(string): The installed batch system, mandatory, one of “pbs”, “slurm”, “condor”, “lsf”creamce::host
(string): The fully qualified Computing Element host name, default the host namecreamce::port
(integer): The tomcat listen port, default 8443creamce::site::name
(string): The human-readable name of the site, default the host namecreamce::site::email
(string): The main email contact for the site. The syntax is a coma separated list of email addresses, default undefinedcreamce::sandbox_path
(string): The directory where the sandbox files are staged on the CREAM CE node, default “/var/cream_sandbox”creamce::delegation::purge_rate
(integer): Specifies in minutes how often the delegation purger has to run, default 10 minutescreamce::lease::time
(integer): The maximum allowed lease time in second. If a client specifies a lease time too big, this value is used instead, default 36000 seconds.creamce::lease::rate
(integer): Specifies in minutes how often the job purger has to run, default 30 minutes
The following parameters enable and configure the internal monitoring system. Whenever a resource metric exceeds the related resource limit the CREAM CE suspends the job submission.
creamce::enable_limiter
(boolean): In order to disable the limiter, it is needed to set this parameter value to false and restart the service, default truecreamce::limit::load1
(integer): Limiter threshold for the load average (1 minute), default 40creamce::limit::load5
(integer): Limiter threshold for the load average (5 minute), default 40creamce::limit::load15
(integer): Limiter threshold for the load average (15 minute), default 20creamce::limit::memusage
(integer): Limiter threshold for the memory usage, default 95 (percentage)creamce::limit::swapusage
(integer): Limiter threshold for the swap usage, default 95 (percentage)creamce::limit::fdnum
(integer): Limiter threshold for the number of file descriptors, default 500creamce::limit::diskusage
(integer): Limiter threshold for the disk usage, default 95 (percentage)creamce::limit::ftpconn
(integer): Limiter threshold for the number of concurrent ftp connections, default 30creamce::limit::fdtomcat
(integer): Limiter threshold for the number of file descriptors, default 800creamce::limit::activejobs
(integer): Limiter threshold for the number of active jobs, default -1 (unlimited)creamce::limit::pendjobs
(integer): Limiter threshold for the number of pending jobs, default -1 (unlimited)
The following parameters configure the internal job purger mechanism, all the values are expressed in number of days
creamce::job::purge_rate
(integer): Specifies in minutes how often the job purger has to run, default 300 minutes.creamce::purge::aborted
(integer): Specifies how often the job purger deletes the aborted jobs, default 10 dayscreamce::purge::cancel
(integer): Specifies how often the job purger deletes the cancelled jobs, default 10 dayscreamce::purge::done
(integer): Specifies how often the job purger deletes the executed jobs, default 10 dayscreamce::purge::failed
(integer): Specifies how often the job purger deletes the failed jobs, default 10 dayscreamce::purge::register
(integer): Specifies how often the job purger deletes the registered jobs, default 2 days
The following parameters configure the job wrapper:
creamce::jw::proxy_retry_wait
(integer): The minimum time interval expressed in seconds, between the first attempt and the second one for retrieving the user delegation proxy, default 60creamce::jw::isb::retry_count
(integer): The maximum number of ISB file transfers that should be tried, default 2creamce::jw::isb::retry_wait
(integer): If during a input sandbox file transfer occurs a failure, the JW retries the operation after a while. The sleep time between the first attempt and the second one is the “initial wait time” (i.e. the wait time between the first attempt and the second one) expressed in seconds. In every next attempt the sleep time is doubled. Default 60 seconds.creamce::jw::osb::retry_count
(integer): The maximum number of ISB file transfers that should be tried, default 2creamce::jw::osb::retry_wait
(integer): If during a output sandbox file transfer occurs a failure, the JW retries the operation after a while. The sleep time between the first attempt and the second one is the “initial wait time” (i.e. the wait time between the first attempt and the second one) expressed in seconds. In every next attempt the sleep time is doubled. Default 300 seconds.
CREAM Database¶
The following parameters configure the CREAM back-end:
creamce::mysql::root_password
(string): root password for the database administrator, mandatorycreamce::creamdb::password
(string): The database user password for the main operator, mandatorycreamce::creamdb::minpriv_password
(string): The database user password for the monitor agent, mandatorycreamce::mysql::max_active
(integer): The maximum number of active database connections that can be allocated from this pool at the same time, or negative for no limit, default 200creamce::mysql::min_idle
(integer): The minimum number of connections that can remain idle in the pool, without extra ones being created, or zero to create none, default 30creamce::mysql::max_wait
(integer): The maximum number of milliseconds that the pool will wait for a connection to be returned before throwing an exception, or -1 to wait indefinitely, default 10000creamce::mysql::override_options
(hash): see the override option defined in https://forge.puppet.com/puppetlabs/mysql,default{'mysqld' => {'bind-address' => '0.0.0.0', 'max_connections' => "450" }}
creamce::creamdb::name
(string): The database name for the CREAM service, default “creamdb”creamce::creamdb::user
(string): The database user name with user acting as main operator, default “cream”creamce::creamdb::host
(string): The fully qualified host name for any CE databases, default the host namecreamce::creamdb::port
(integer): The mysql listen port for any CE databases, default 3306creamce::creamdb::minpriv_user
(string): The database user name with user acting as monitor agent, default “minprivuser”creamce::delegationdb::name
(string): The database name for the Delegation Service, default “delegationcreamdb”
BLAH¶
These are the basic configuration parameters for BLAH:
blah::config_file
(string): The path of the main BLAH configuration file, default “/etc/blah.config”blah::logrotate::interval
(integer): The interval in days for log rotation, default 365 daysblah::logrotate::size
(string): The size of a log file in MB, default “10M”blah::use_blparser
(boolean): If true it enables the BLParser service otherwise BUpdater/BNotifier is used, default falsecreamce::blah_timeout
(integer): Represents the maximum time interval in seconds accepted by CREAM for the execution of commands by BLAH, default 300 secondscreamce::job::prefix
(string): The prefix to be used for the BLAH job id, default “cream_”blah::shared_directories
(list): A list of of paths that are shared among batch system head and worker nodes; the empty list is the default value
The following parameters configure BLAH if BNotifier/BUpdater are enabled:
blah::bupdater::loop_interval
(integer): The interval in seconds between two BUpdater sessions, default 30 seconds.blah::bupdater::notify_port
(integer): The service port for the BNotifier, default 56554blah::bupdater::logrotate::interval
(integer): The interval in days for log rotation, default 50blah::bupdater::logrotate::size
(string): The size of a log file in MB, default “10M”
The following parameters configure BLAH if the BLParser is enabled:
blah::blp::host
(string): The host name for the primary BLParser, mandatory if BLParser is used, default undefinedblah::blp::port
(integer): The service port for the primary BLParser, default 33333blah::blp::num
(integer): The number of BLParser enabled instances, default 1blah::blp::host1
(string): The host name for the secondary BLParser, default undefinedblah::blp::port1
(integer): The service port for the secondary BLParser, default 33334creamce::listener_port
(integer): The port used by CREAM to receive notifications about job status changes sent by the BLParser/JobWrapper, default 49152creamce::blp::retry_delay
(integer): The time interval in seconds between two attempts to contact the BLAH parser, default 60 secondscreamce::blp::retry_count
(integer): Represents the number of attempts to contact the BLAH parser (if it is not reachable) before giving up. If -1 is specified, CREAM will never give up , default 100
CREAM information system¶
The following parameters configure the Resource BDII:
bdii::params::user
(string): The local user running the BDII service, default “ldap”bdii::params::group
(string): The local group running the BDII service, default “ldap”bdii::params::port
(integer): The BDII service port, default 2170creamce::use_locallogger
(boolean): True if the local logger service must be installed and configured, default is falsecreamce::info::capability
(list): The list of capability for a CREAM site; it’s a list of string, in general with format “name=value”, default empty listcreamce::vo_software_dir
(string): The base directory for installation of the software used by Virtual Organizationscreamce::workarea::shared
(boolean): True if the working area is shared across different Execution Environment instances, typically via an NFS mount; this attribute applies to single-slot jobs, default falsecreamce::workarea::guaranteed
(boolean): True if the job is guaranteed the full extent of the WorkingAreaTotal; this attribute applies to single-slot jobs, default falsecreamce::workarea::total
(integer): Total size in GB of the working area available to all single-slot jobs, default 0creamce::workarea::free
(integer): The amount of free space in GB currently available in the working area to all single-slot jobs, default 0 GBcreamce::workarea::lifetime
(integer): The minimum guaranteed lifetime in seconds of the files created by single-slot jobs in the working area, default 0 secondscreamce::workarea::mslot_total
(integer): The total size in GB of the working area available to all the multi-slot Grid jobs shared across all the Execution Environments, default 0GBcreamce::workarea::mslot_free
(integer): The amount of free space in GB currently available in the working area to all multi-slot jobs shared across all the Execution Environments, default 0 GBcreamce::workarea::mslot_lifetime
(integer): The minimum guaranteed lifetime in seconds of the files created by multi-slot jobs in the working area, default 0 seconds
Hardware table¶
The hardware table contains any information about the resources of the
site; the parameter to be used is creamce::hardware_table
. The
hardware table is a hash table with the following structure:
the key of an entry in the table is the ID assigned to the homogeneous sub-cluster of machines (see GLUE2 execution environment).
the value of an entry in the table is a hash containing the definitions for the homogeneous sub-cluster, the supported mandatory keys are:
ce_cpu_model
(string): The name of the physical CPU model, as defined by the vendor, for example “XEON”ce_cpu_speed
(integer): The nominal clock speed of the physical CPU, expressed in MHzce_cpu_vendor
(string): The name of the physical CPU vendor, for example “Intel”ce_cpu_version
(string): The specific version of the Physical CPU model as defined by the vendorce_physcpu
(integer): The number of physical CPUs (sockets) in a work node of the sub-clusterce_logcpu
(integer): The number of logical CPUs (cores) in a worker node of the sub-clusterce_minphysmem
(integer): The total amount of physical RAM in a worker node of the sub-cluster, expressed in MBce_os_family
(string): The general family of the Operating System installed in a worker node (“linux”, “macosx”, “solaris”, “windows”)ce_os_name
(string): The specific name Operating System installed in a worker node, for example “RedHat”ce_os_arch
(string): The platform type of worker node, for example “x86_64”ce_os_release
(string): The version of the Operating System installed in a worker node, as defined by the vendor, for example “7.0.1406”nodes
(list): The list of the name of the worker nodes of the sub-cluster
the supported optional keys are:
ce_minvirtmem
(integer): The total amount of virtual memory (RAM and swap space) in a worker node of the sub-cluster, expressed in MBce_outboundip
(boolean): True if a worker node has out-bound connectivity, false otherwise, default truece_inboundip
(boolean): True if a worker node has in-bound connectivity, false otherwise default falsece_runtimeenv
(list): The list of tags associated to the software packages installed in the worker node, the definitions for a tag is listed in the software table, default empty listce_benchmarks
(hash): The hash table containing the values of the standard benchmarks (“specfp2000”, “specint2000”, “hep-spec06”); each key of the table corresponds to the benchmark name, default empty hashsubcluster_tmpdir
(string): The path of a temporary directory shared across worker nodes (see GLUE 1.3)subcluster_wntmdir
(string): The path of a temporary directory local to each worker node (see GLUE 1.3)
Software table¶
The software table contains any information about the applications
installed in the worker nodes; the parameter to be used is
creamce::software_table
. The software table is a hash with the
following structure:
- the key of an entry in the table is the tag assigned to the software
installed on the machines (see GLUE2 application environment); tags
are used as a reference (
ce_runtimeenv
) in the hardware table. - the value of an entry in the table is a hash containing the
definitions for the software installed on the machines, the supported
keys are:
name
(string): The name of the software installed, for example the package name, mandatoryversion
(string): The version of the software installed, mandatorylicense
(string): The license of the software installed, default unpublisheddescription
(string): The description of the software installed, default unpublished
Queues table¶
The queue table contains definitions for local user groups; the
parameter to be declared is creamce::queues
. The queues table is a
hash with the following structure:
- the key of an entry in the table is the name of the batch system queue/partition.
- the value of an entry in the table is a hash table containing the
definitions for the related queue/partition, the supported keys for
definitions are:
groups
(list): The list of local groups which are allowed to operate the queue/partition, each group MUST BE defined in the VO table.
Storage table¶
The storage table contains any information about the set of storage
elements bound to this site. The parameter to be declared is
creamce::se_table
, the default value is an empty hash. The storage element table is a
hash with the following structure:
creamce::queues
. The queues table is a hash with the following
structure:
- the key of an entry in the table is the name of the storage element host.
- the value of an entry in the table is a hash table containing the
definitions for the related storage element, the supported keys for
definitions are:
type
(string): The name of the application which is installed in the storage element (“Storm”, “DCache”, etc.)mount_dir
(string): The local path within the Computing Service which makes it possible to access files in the associated Storage Service (this is typically an NFS mount point)export_dir
(string): The remote path in the Storage Service which is associated to the local path in the Computing Service (this is typically an NFS exported directory).default
(boolean): True if the current storage element must be considered the primary SE, default false. Just one item in the storage element table can be marked as primary.
Example¶
---
creamce::queues :
long : { groups : [ dteam, dteamprod ] }
short : { groups : [ dteamsgm ] }
creamce::hardware_table :
subcluster001 : {
ce_cpu_model : XEON,
ce_cpu_speed : 2500,
ce_cpu_vendor : Intel,
ce_cpu_version : 5.1,
ce_physcpu : 2,
ce_logcpu : 2,
ce_minphysmem : 2048,
ce_minvirtmem : 4096,
ce_os_family : "linux",
ce_os_name : "CentOS",
ce_os_arch : "x86_64",
ce_os_release : "7.0.1406",
ce_outboundip : true,
ce_inboundip : false,
ce_runtimeenv : [ "tomcat_6_0", "mysql_5_1" ],
subcluster_tmpdir : /var/tmp/subcluster001,
subcluster_wntmdir : /var/glite/subcluster001,
ce_benchmarks : { specfp2000 : 420, specint2000 : 380, hep-spec06 : 780 },
nodes : [ "node-01.mydomain", "node-02.mydomain", "node-03.mydomain" ]
# Experimental support to GPUs
accelerators : {
acc_device_001 : {
type : GPU,
log_acc : 4,
phys_acc : 2,
vendor : NVidia,
model : "Tesla k80",
version : 4.0,
clock_speed : 3000,
memory : 4000
}
}
}
creamce::software_table :
tomcat_6_0 : {
name : "tomcat",
version : "6.0.24",
license : "ASL 2.0",
description : "Tomcat is the servlet container"
}
mysql_5_1 : {
name : "mysql",
version : "5.1.73",
license : "GPLv2 with exceptions",
description : "MySQL is a multi-user, multi-threaded SQL database server"
}
creamce::vo_software_dir : /afs
creamce::se_table :
storage.pd.infn.it : { mount_dir : "/data/mount", export_dir : "/storage/export",
type : Storm, default : true }
cloud.pd.infn.it : { mount_dir : "/data/mount", export_dir : "/storage/export",
type : Dcache }
CREAM security and accounting¶
The following parameters configure the security layer and the pool account system:
creamce::host_certificate
(string): The complete path of the installed host certificate, default /etc/grid-security/hostcert.pemcreamce::host_private_key
(string): The complete path of the installed host key, default /etc/grid-security/hostkey.pemcreamce::voms_dir
(string): The location for the deployment of VO description files (LSC), default /etc/grid-security/vomsdircreamce::gridmap_dir
(string): The location for the pool account files, default /etc/grid-security/gridmapdircreamce::gridmap_file
(string): The location of the pool account description file, default /etc/grid-security/grid-mapfilecreamce::gridmap_extras
(list): The list of custom entry for the pool account description file, default empty listcreamce::gridmap_cron_sched
(string): The cron time parameters for the pool account cleaner, default “5 * * * *”creamce::groupmap_file
(string): The path of the groupmap file, default /etc/grid-security/groupmapfilecreamce::crl_update_time
(integer): The CRL refresh time in seconds, default 3600 secondscreamce::ban_list_file
(string): The path of the ban list file, if gJAF/LCMAPS is used, default /etc/lcas/ban_users.dbcreamce::ban_list
(list): The list of banned users, each item is a Distinguished Name in old openssl format. If not defined the list is not managed by puppet.creamce::use_argus
(boolean): True if Argus authorization framework must be used, false if gJAF must be used, default truecreamce::argus::service
(string): The argus PEPd service host name,mandatory
ifcreamce::use_argus
is set to truecreamce::argus::port
(integer): The Argus PEPd service port, default 8154creamce::argus::timeout
(integer): The connection timeout in seconds for the connection to the Argus PEPd server, default 30 secondscreamce::argus::resourceid
(string): The ID of the CREAM service to be registered in Argus, defaulthttps://{ce_host}:{ce_port}/cream
creamce::admin::list
(list): The list of service administators Distinguished Name, default empty listcreamce::admin::list_file
(string): The path of the file containing the service administrators list, default /etc/grid-security/admin-listcreamce::default_pool_size
(integer): The default number of users in a pool account, used ifpool_size
is not define for a VO group, default 100creamce::create_user
(boolean): False if the creation of the users of all the pool accounts is disabled, default True
VO table¶
The VO table contains any information related to pool accounts, groups
and VO data. The parameter to be declared is creamce::vo_table
, the
default value is an empty hash. The VO table is a hash, the key of an
entry in the table is the name or ID of the virtual organization, the
corresponding value is a hash table containing the definitions for the
virtual organization,the supported keys for definitions are:
servers
(list): The list of configuration details for the VOMS servers. Each item in the list is a hash; the parameter and any supported keys of a contained hash aremandatory
. The supported keys are:server
(string): The VOMS server FQDNport
(integer): The VOMS server portdn
(string): The distinguished name of the VOMS server, as declared in the VOMS service certificateca_dn
(string): The distinguished name of the issuer of the VOMS service certificate
groups
(hash): The list of local groups and associated FQANs, the parameter ismandatory
, each key of the hash is the group name, each value is a hash with the following keys:gid
(string): The unix group id,mandatory
fqan
(list): The list of VOMS Fully Qualified Attribute Names. The items in the list are used to compose the group map file. The parameter ismandatory
users
(hash): The description of pool accounts or a static users, the parameter ismandatory
, each key of the hash is the pool account prefix or the user name for a static user, each value is a hash with the following keys:name_pattern
(list): The pattern used to create the user name of the pool account, the variables used for the substitutions areprefix
, the pool account prefix, andindex
, a consecutive index described below; the expression is explained in the ruby guide, default value is%<prefix>s%03<index>d
primary_fqan
(list): The list of the primary VOMS Fully Qualified Attribute Names associated with the user of the pool account. The attributes are used to calculate the primary group of the user and to compose the grid-map file. The mapping between FQANs and groups refers to thegroups
hash for the given VO. For further details about the mapping algorithm refer to the authorization guide. The parameter ismandatory
secondary_fqan
(list): The list of the secondary VOMS Fully Qualified Attribute Names associated with the user of the pool account. The items in the list are used to calculate the secondary groups of the user. The parameter is optional.create_user
(boolean): False if the creation of the user is disabled for the current pool account, default is Truepub_admin
(boolean): True if the pool account is the defined administrator account, default false, just one administrator account is supported. The first user in the pool account, or the static user if it is the case, is selected for publishing information about VO tags.accounts
(list): The list of SLURM accounts associated with this set of users, further details can be found in the SLURM specific section
A pool account can be defined in two different ways. If the user IDs are consecutive the parameters required are:
first_uid
(integer): The initial number for the unix user id of the pool account, the other ids are obtained incrementally with step equals to 1pool_size
(integer): The number of user in the current pool account, the default value is global definition contained intocreamce::default_pool_size
. If the value for the pool size is equal to zero the current definition must be considered for a static user.
If the user IDs are not consecutive their values must be specified with the parameter:
uid_list
(list): The list of user ID; the pool account size is equal to the number of elements of the list.
In any case the user name is created using the pattern specified by the parameter
name_pattern
where the index ranges from 1 to the pool account size (included). It is possible to shift the range of the indexes using the parametercreamce::username_offset
.vo_app_dir
(_string_): The path of a shared directory available for application data for the current Virtual Organization, as describe by Info.ApplicationDir in GLUE 1.3.vo_default_se
(string): The default Storage Element associated with the current Virtual Organization. It must be one of the key of the storage element table
Example¶
---
creamce::use_argus : false
creamce::default_pool_size : 10
creamce::username_offset : 1
creamce::vo_table :
dteam : {
vo_app_dir : /afs/dteam,
vo_default_se : storage.pd.infn.it,
servers : [
{
server : voms.hellasgrid.gr,
port : 15004,
dn : /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr,
ca_dn : "/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2016"
},
{
server : voms2.hellasgrid.gr,
port : 15004,
dn : /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr,
ca_dn : "/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2016"
}
],
groups : {
dteam : { fqan : [ "/dteam" ], gid : 9000 },
dteamsgm : { fqan : [ "/dteam/sgm/ROLE=developer" ], gid : 9001 },
dteamprod : { fqan : [ "/dteam/prod/ROLE=developer" ], gid : 9002 }
},
users : {
dteamusr : { first_uid : 6000, primary_fqan : [ "/dteam" ],
name_pattern : "%<prefix>s%03<index>d" },
dteamsgmusr : { first_uid : 6100, pool_size : 5,
primary_fqan : [ "/dteam/sgm/ROLE=developer" ],
secondary_fqan : [ "/dteam" ],
name_pattern : "%<prefix>s%02<index>d",
pub_admin : true },
dteamprodusr : { primary_fqan : [ "/dteam/prod/ROLE=developer" ],
secondary_fqan : [ "/dteam" ],
name_pattern : "%<prefix>s%02<index>d",
uid_list : [ 6200, 6202, 6204, 6206, 6208 ] }
}
}
CREAM with TORQUE¶
The TORQUE cluster must be install before the deployment of CREAM,
there’s no support in the CREAM CE puppet module for the deployment of
TORQUE. Nevertheless the module may be used to configure the TORQUE
client on CREAM CE node if and only if the node is different from the
TORQUE server node. The YAML parameter which enables the TORQUE client
configuration is torque::config::client
, if it is set to false the
configuration is disabled, the default value is true. The CREAM CE
puppet module can create queues and pool accounts in TORQUE, the YAML
parameter is torque::config::pool
, if it is set to false the feature
is disabled, the default value is true.
Other configuration parameters for TORQUE are:
torque::host
(string): The TORQUE server host name, default value is the host name.torque::multiple_staging
(boolean): The BLAH parameter for multiple staging, default falsetorque::tracejob_logs
(integer): The BLAH parameter for tracejob, default 2munge::key_path
(string): The location of the munge key. If TORQUE client configuration is enabled the path is used to retrieve the manually installed key;mandatory
iftorque::config::client
is set to true.
CREAM with SLURM¶
The SLURM cluster must be install before the deployment of CREAM,
there’s no support in the CREAM CE puppet module for the deployment of
SLURM. The module provides an experimental feature for configuring SLURM
users and accounts if the
accounting subsystem is
enabled in SLURM. The YAML parameter which enables the experimental
feature is slurm::config_accounting
, the default value is false. If
it is set to true each user of the pool account is replicated in the
SLURM accounting subsystem. The site administrator can associate to the
any replicated user one or more SLURM accounts in two different ways:
- Specifying a list of accounts already created in SLURM. The list of
SLURM accounts associated to the new user is specified by the
parameter
accounts
of theusers
definition of the VO table; the parameter is mandatory in this case. - Delegating the creation of the SLURM accounts to the puppet module.
The module creates a SLURM account for each VO and a SLURM
sub-account for each group in a given VO. The parameter to be set for
enabling the automatic creation of account is
slurm::standard_accounts
, its default value is false.
CREAM with HTCondor¶
The HTCondor cluster must be install before the deployment of CREAM, there’s no support in the CREAM CE puppet module for the deployment of HTCondor.
The configuration parameters for HTCondor are:
condor::deployment_mode
(string): The queue implementation model used, the value can be “queue_to_schedd” or “queue_to_jobattribute”, the default is “queue_to_schedd”condor_queue_attr
(string): The classad attribute used to identify the queue, when the model used is “queue_to_jobattribute”,mandatory
if the deployment mode is “queue_to_jobattribute”condor_user_history
(boolean): True if condor_history should be used to get the final state info about the jobs, the default is falsecondor::config::dir
(string): The directory containing the configuration files for the HTCondor installation, the default is /etc/condor/config.dcondor::command_caching_filter
(string): The executable for caching the batch systems commands, if not specified the caching mechanism is disabled
Experimental features¶
The features described in this section are subject to frequent changes and must be considered unstable. Use them at your own risk.
GPU support configuration¶
The GPU support in the information system (BDII) can be enabled with the
configuration parameter creamce::info::glue21_draft
(boolean), the
default value is false. The GPU resources must be described in the
hardware table, inserting in the related sub-cluster hashes the
following parameter:
accelerators
(_hash_): The hash table containing the definitions for any accelerator device mounted in the sub-cluster. Each item in the table is a key-value couple. The key is the accelerator ID of the device and the value consists on a hash table with the following mandatory definitions:type
(string): The type of the device (GPU, MIC, FPGA)log_acc
(integer): The number of logical accelerator unit in the sub-clusterphys_acc
(integer): The number of physical accelerator device (cards) in the subclustervendor
(string): The vendor IDmodel
(string): The model of the deviceversion
(string): The version of the deviceclock_speed
(integer): The clock speed of the device in MHzmemory
(integer): The amount of memory in the device in MByte
CREAM User’s Guide¶
CREAM Command Line Interface Guide¶
This section briefly explains the sequence of operations to be performed by a user to submit and then manage jobs on CREAM based CEs, referring to the C++ Command Line Interface (CLI).
Before starting: get your user proxy¶
Before using any of the CREAM client commands, it is necessary to have a
valid proxy credential available on the client machine. You can create
it using the voms-proxy-init
command. If you already have a valid
proxy available on your machine just make the X509_USER_PROXY
environment variable point to it.
In order to get a proxy certificate issued by VOMS, you should have in
the directory /etc/vomses
the proper VOMS file containing a line as
follows:
"EGEE" "kuiken.nikhef.nl" "15001" "/O=dutchgrid/O=hosts/OU=nikhef.nl/CN=kuiken.nikhef.nl" "EGEE" "22"
or the corresponding line for your VO. You also need to install the VO
related .lsc
files in the /etc/grid-security/vomsdir/<VO>
directory. In a standard EMI UI installation, these settings should be
already there.
Make moreover sure you have in the directory $HOME/.globus
your
certificate/key pair, i.e. the following files:
- usercert.pem
- userkey.pem
Note that file permissions are important: the two files must have respectively 0600 and 0400 permissions.
Then you can issue the VOMS client command (you will be prompted for the pass-phrase):
$ voms-proxy-init -voms dteam
Enter GRID pass phrase:
Your identity: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto
Creating temporary proxy ................................................................................. Done
Contacting voms2.hellasgrid.gr:15004 [/C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr] "dteam" Done
Creating proxy .............................. Done
Your proxy is valid until Sat Apr 30 05:05:49 2011
$ voms-proxy-info -all
subject : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto/CN=proxy
issuer : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto
identity : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto
type : proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 11:59:55
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO dteam extension information ===
VO : dteam
subject : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto
issuer : /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr
attribute : /dteam/Role=NULL/Capability=NULL
attribute : /dteam/italy/Role=NULL/Capability=NULL
attribute : /dteam/italy/INFN-PADOVA/Role=NULL/Capability=NULL
timeleft : 11:59:55
uri : voms2.hellasgrid.gr:15004
CREAM CLI commands¶
The most relevant commands to interact with CREAM based CEs are:
glite-ce-job-submit <jdl_file_1> ... <jdl_file_N>
submits N jobs (N must be >=1) to a CREAM based CE. It requires N JDL files as input and returns N CREAM job identifiers.
glite-ce-delegate-proxy <delegation_Id>
allows the user to delegate her proxy credential to the CREAM service. This delegated credential can then be used for job submissions.
glite-ce-job-status <job_Id_1> ... <job_Id_N>
displays information (in particular the states) of N jobs (N must be >=1) previously submitted to CREAM based CEs.
glite-ce-job-list <host[:port]>
lists the identifiers of jobs submitted to a CREAM based CE by the user issuing the command.
glite-ce-job-cancel <job_Id_1> ... <job_Id_N>
cancels N jobs (N must be >=1) previously submitted to CREAM based CEs.
glite-ce-job-suspend <job_Id_1> ... <job_Id_N>
suspends the execution of N jobs (N must be >=1) previously submitted to CREAM based CEs.
glite-ce-job-resume <job_Id_1> ... <job_Id_N>
resumes the execution of N jobs (N must be >=1) which have been previously suspended.
glite-ce-job-output <job_Id_1> ... <job_Id_N>
retrieves the output sandbox files of N jobs (N must be >=1) previously submitted to CREAM based CEs.
glite-ce-job-purge <job_Id_1> ... <job_Id_N>
clears N jobs (N must be >=1) from CREAM based CEs. After this operation the purged jobs canRSQUOt be managed anymore.
glite-ce-proxy-renew <delegation_Id_1> ... <delegation_Id_N>
renews N delegations (N must be >=1), and therefore refreshes the proxy of the jobs submitted to CREAM based CEs using the considered delegations.
glite-ce-service-info <host[:port]>
returns information about the CREAM service (version, status, etc.).
glite-ce-enable-submission <host[:port]>
(re-)enables job submissions on the specified CREAM CE.
glite-ce-disable-submission <host[:port]>
disables job submissions on the specified CREAM CE.
glite-ce-allowed-submission <host[:port]>
checks if jobs submissions on the specified CREAM CE are allowed or have been disabled.
glite-ce-job-lease <lease_identifier> --endpoint <cream_endpoint> \ --leaseTime <lease_time>
create a lease identifier in the CREAM server and associate a time duration to it.
Man pages are available for all the CREAM client commands. You can also access information about the usage of each command by issuing:
$ <command> --help
Submitting jobs to CREAM based CEs¶
To submit jobs to CREAM based CEs, the command glite-ce-job-submit
must be used. The glite-ce-job-submit
command requires as input one
or more job description files; each file describes the job
characteristics and requirements through the JDL (Job Description
Language). A typical example of a JDL job description file is:
[
Type = "Job";
JobType = "Normal";
Executable = "myexe";
StdInput = "myinput.txt";
StdOutput = "message.txt";
StdError = "error.txt";
InputSandbox = {"/users/seredova/example/myinput.txt",
"/users/seredova/example/myexe"};
OutputSandbox = {"message.txt", "error.txt"};
OutputSandboxBaseDestUri = "gsiftp://se.pd.infn.it/data/seredova";
]
Such a JDL would make the myexe
executable be transferred on the
remote CREAM CE and be run taking the myinput.txt
file (also copied
from the client node) as input. The standard streams of the job are
redirected to files message.txt
and error.txt
, and when job
completes its execution they are automatically uploaded on
gsiftp://se.pd.infn.it/data/seredova
.
A detailed description of the available JDL attributes and of the rules for building correct JDL files is documented in the CREAM JDL guide.
The jobs submitted to a CREAM based CE are given the delegated credentials of the user who submitted it. These credentials can then be used when operations requiring security support has to be performed by the job.
There are two possible options to deal with proxy delegation:
- asking the automatic delegation of the credentials during the submission operation;
- explicitly delegating credentials, and then asking to rely on these previously delegated credentials on the actual submission operations.
It is highly suggested to rely on this latter mechanism, using the same delegated proxy for multiple job submissions, instead of delegating each time a proxy. Delegating a proxy, in fact, is an operation that can require a non negligible time.
The command glite-ce-delegate-proxy
is the command to be used to
explicitly delegate the user credentials to a CREAM CE.The following
shows an example of job submission, performed explicitly delegating
credentials. So first of all the credentials are delegated to a CREAM
based CE (whose endpoint is specified with the option --endpoint
(-e
):
$ glite-ce-delegate-proxy -e cream-ce-01.pd.infn.it mydelid
2006-02-26 15:03:37,286 NOTICE - Proxy with delegation id [mydelid] successfully
delegated to endpoint [https://cream-ce-01.pd.infn.it:8443//ce-cream/services/CREAMDelegation]
The identifier of the delegation is then specified with the
--delegationId
(-D
) option in the job submit operation:
$ glite-ce-job-submit -D mydelid -r cream-ce-01.pd.infn.it:8443/cream-lsf-grid02 myjob1.jdl myjob2.jdl myjob3.jdl
The option -r
(--resource
) has been used to specify the
identifier of the CREAM CE where the job has to be submitted to.
myjob1.jdl
myjob2.jdl
myjob3.jdl
are the 3 JDL files
describing the jobs to be submitted.
The command returns the CREAM job identifiers associated with these jobs
(e.g. https://cream-ce-01.pd.infn.it:8443/CREAM116j9vgnf
) which
identify them in clear and unique way all over the Grid system scope.
In addition the user can associate a lease that she/he has previously
created with the command glite-ce-job-lease
by mean of the option
--leaseId <lease_identifier>
:
$ glite-ce-job-submit -D mydelid -r cream-ce-01.pd.infn.it:8443/cream-lsf-grid02 --leaseId <my_lease_identifier>
myjob1.jdl myjob2.jdl myjob3.jdl
To create a lease in the CREAM service, with a certain duration of time (expressed in seconds), issue the command:
$ glite-ce-job-lease --endpoint cream-27.pd.infn.it --leaseTime 3600 myLID
You requested lease time [3600] for lease ID [myLID]
CREAM negotiated the lease time to [3600]
The above command has created a lease on cream-27.pd.infn.it
named
“myLID” and lasting 1 hour.
Monitoring jobs¶
Passing the CREAM job identifiers returned by the
glite-ce-job-submit
command to the glite-ce-job-status
command,
it is possible to monitor the submitted jobs. Several (static and
dynamic) information can be shown, depending on the chosen verbosity
level. The verbosity level can be 0 (less verbosity), 1 or 2 (most
verbosity). Please note that specifying 0 as verbosity level means
calling on the CREAM service a faster operation than when using 1 or 2
as verbosity level. The most relevant attribute is the job status.
The following is an example of job status operation, specifying 1 as verbosity level:
$ glite-ce-job-status -L 1 https://cream-02.pd.infn.it:8443/CREAM738582717
****** JobID=[https://cream-02.pd.infn.it:8443/CREAM738582717]
Current Status = [DONE-FAILED]
ExitCode = [N/A]
FailureReason = [lsf_reason=256; Cannot move ISB (${globus_transfer_cmd}
gsiftp://cream-02.pd.infn.it//CREAMTests/Exe1/ssh1.sh file:///home/infngrid001/home_cream_738582717/CREAM738582717/ssh1.sh):
error: globus_ftp_client: the server responded with an error 500 500-Command failed. : globus_l_gfs_file_open failed.
500-globus_xio: Unable to open file //CREAMTests/Exe1/ssh1.sh
500-globus_xio: System error in open: No such file or directory
500-globus_xio: A system call failed: No such file or directory 500 End.]
Grid JobID = [N/A]
Job status changes:
-------------------
Status = [REGISTERED] - [Tue 22 Jan 2008 15:55:08] (1201013708)
Status = [PENDING] - [Tue 22 Jan 2008 15:55:08] (1201013708)
Status = [IDLE] - [Tue 22 Jan 2008 15:55:11] (1201013711)
Status = [RUNNING] - [Tue 22 Jan 2008 15:55:18] (1201013718)
Status = [DONE-FAILED] - [Tue 22 Jan 2008 16:03:10] (1201014190)
Issued Commands:
-------------------
*** Command Name = [JOB_REGISTER]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
*** Command Name = [JOB_START]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
In this example it is interesting to note that the job failed (as
reported by the Current Status
field) for the problem reported in
the FailureReason
field: the file to be transferred was not found.
Instead of explicitly specifying the identifiers of the jobs to monitor,
the user can also ask to monitor all her jobs, in case specifying
conditions (on the submission date and/or on the job status) that must
be met. For example to monitor all jobs, whose status is DONE-OK or
DONE-FAILED, submitted to the grid005.pd.infn.it
CREAM CE between
July 23, 2005 10:00 and July 28, 2005 11:00, the following command must
be issued:
$ glite-ce-job-status --all -e grid005.pd.infn.it:8443 --from ’2005-07-23 10:00:00’ \
--to ’2005-07-28 11:00:00’ -s DONE-OK:DONE-FAILED
Retrieving output of jobs¶
User can choose to save the output sandbox (OSB) files on a remote
server, or save them in the CREAM CE node. In the latter case these
files can then be retrieved using the glite-ce-job-output
command.
For example the following command retrieves the output sandbox files of
the specified job from the relevant CREAM CE node:
$ glite-ce-job-output https://cream-38.pd.infn.it:8443/CREAM295728364
2011-01-29 10:09:50,394 INFO - For JobID [https://cream-38.pd.infn.it:8443/CREAM295728364]
output will be stored in the dir ./cream-38.pd.infn.it_8443_CREAM295728364
This command can be used also to retrieve output produced by multiple jobs, by specifying multiple job identifiers as command’s arguments
Getting job identifiers¶
If a user is interested to get the identifiers of all her jobs submitted
to a specific CREAM CE, she can use the glite-ce-job-list
command.
For example the following command returns the identifiers of all the
jobs submitted to the specified CREAM CE, owned by the user issuing the
command:
$ glite-ce-job-list grid005.pd.infn.it:8443
Cancelling jobs¶
In some cases it might be needed to cancel jobs which have been
previously submitted to CREAM based CEs. This can be achieved via the
glite-ce-job-cancel
command. E.g., the command:
$ glite-ce-job-cancel https://grid005.pd.infn.it:8443/CREAM115j5vfnf
cancels the specified job.
Suspending and resuming jobs¶
A running or idle job can be suspended (i.e. its execution will be
stopped), and be resumed (i.e. it will run again) later. This can be
achieved with the glite-ce-job-suspend
and glite-ce-job-resume
commands. The following example shows that after having issued the
glite-ce-job-suspend
command, after a while the job status becomes
HELD
.
$ glite-ce-job-suspend https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2
Are you sure you want to suspend specified job(s) [y/n]: y
$ glite-ce-job-status -L 0 https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2
****** JobID=[https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2]
Status = [HELD]
Issuing the =glite-ce-job-resume= command, the job will run/will be idle again:
$ glite-ce-job-resume https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2
Are you sure you want to resume specified job(s) [y/n]: y
$ glite-ce-job-status -L 0 https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2
****** JobID=[https://cream-ce-01.pd.infn.it:8443/CREAM11a79tnb2]
Status = [REALLY-RUNNING]
Purging jobs¶
A CREAM job can be monitored (via the =glite-ce-job-status=) even after
it has completed its execution. A job gets “lost”; (i.e. it is not
possible to monitor or manage it anymore) only when the user who
submitted it decides to explicitly clear it, or when the CREAM system
administrator decides to do this purging operation. A user can purge her
own jobs, using the glite-ce-job-purge
command. E.g., after having
issued the command:
$ glite-ce-job-purge https://cream-ce-01.pd.infn.it:8443/CREAM116jbi4o0
the specified job canRSQUOt be managed anymore (e.g. it is not possible to check its status anymore).
Renewing proxies¶
It is possible that long jobs may outlive the validity of the initial
delegated credentials; if so the job will die prematurely. To avoid this
it is possible to renew the proxy of jobs submitted to CREAM CEs with
the glite-ce-proxy-renew
command. E.g. the following command:
$ glite-ce-proxy-renew -e cream-ce-01.pd.infn.it:8443 mydelid
renews the proxy of all the jobs having mydelid
as delegation id. It
must be stressed that for jobs submitted to CREAM based CEs via the
Workload Management System (WMS), proxy renewal is automatically dealt
by the middleware.
Handling job identifiers¶
Handling the job identifiers directly quickly becomes tedious. To avoid
this, you can make the glite-ce-job-submit
and glite-ce-job-list
commands append the job Id(s) to a named file using the --output
(-o
) option. On the other side, the CREAM client commands which take
job identifier(s) as argument accept also the --input
(-i
)
option which allows the job identifier(s) to be read from a file. The
following shows an example:
$ glite-ce-job-submit -a -r cream-ce-01.pd.infn.it:8443/cream-lsf-grid02 -o idfile myjob.jdl
https://cream-ce-01.pd.infn.it:8443/CREAM116jbs5b9
The returned job id got also inserted in the specified file
(idfile
), which can be specified with the --input
(-i
)
option e.g. with the glite-ce-job-status
command:
$ glite-ce-job-status -i idfile
****** JobID=[https://cream-ce-01.pd.infn.it:8443/CREAM116jbs5b9]
Status=[REALLY-RUNNING]
Restricting job submissions¶
In order to prevent that a CREAM CE gets overloaded, the CREAM CE administrator can set a specific policy to disable new job submissions when certain conditions are met. If submissions are disabled because of that, if newer job submissions are attempted, users will get an error message such as:
$ glite-ce-job-submit -a -r cream-38.pd.infn.it:8443/cream-pbs-creamtest1 oo.jdl
MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment]
FaultCause=[Threshold for Load Average(1 min): 30 => Detected value for Load Average(1 min): 31.13]
Timestamp=[Sat 29 Jan 2011 11:55:18]
In order to avoid degrading the performance of the system, the specified
policy is not evaluated for each job submission, but instead it is
evaluated and imposed from time to time (so it might happen that for a
short time job submissions are allowed even if the specified threshold
has been reached). CREAM “super-users”; can also disable newer job
submissions via the command glite-ce-disable-submission
. Submissions
can then be re-enabled by a CREAM “super-user”; via the command
glite-ce-enable-submission
. To check if job submissions on a
specific CREAM CE are allowed, the command
glite-ce-allowed-submission
can be used.
$ glite-ce-disable-submission grid006.pd.infn.it:8443
Operation for disabling new submissions succeeded
$ glite-ce-allowed-submission grid006.pd.infn.it:8443
Job Submission to this CREAM CE is disabled
$ glite-ce-enable-submission grid006.pd.infn.it:8443
Operation for enabling new submissions succeeded
$ glite-ce-allowed-submission grid006.pd.infn.it:8443
Job Submission to this CREAM CE is enabled
It must be stressed that if job submissions to a specific CREAM CE are disabled, all other operations (job status, job cancellations, etc.) can still be performed.
Getting information about the CREAM service¶
It is possible to get information about the CREAM service (interface and
service version, status, etc) using the glite-ce-service-info
command, e.g.:
$ glite-ce-service-info cream-13.pd.infn.it:8443
Interface Version = [2.1]
Service Version = [1.12]
Description = [CREAM 2]
Started at = [Tue Nov 10 14:42:12 2009]
Submission enabled = [YES]
Status = [RUNNING]
Service Property = [SUBMISSION_THRESHOLD_MESSAGE]->
[Threshold for Load Average
(1 min): 10 => Detected value for Load Average(1 min): 0.03
Threshold for Load Average(5 min): 10 => Detected value for Load Average(5 min): 0.03
Threshold for Load Average(15 min): 10 => Detected value for Load Average(15 min): 0.00
Threshold for Memory Usage: 95 => Detected value for Memory Usage: 57.41%
Threshold for Swap Usage: 95 => Detected value for Swap Usage: 2.02%
Threshold for Free FD: 500 => Detected value for Free FD: 204500
Threshold for tomcat FD: 800 => Detected value for Tomcat FD: 107
Threshold for FTP Connection: 30 => Detected value for FTP Connection: 1
Threshold for Number of active jobs: -1 => Detected value for Number of active jobs: 0
Threshold for Number of pending commands: -1 => Detected value for Number of pending commands: 0
CREAM CLI configuration files¶
The configuration of the CREAM UI is accomplished via three possible configuration files:
- A general configuration file. This file is looked for in
/etc/glite_cream.conf
- A VO specific configuration file. This file is looked for in
/etc/<VO>/glite_cream.conf
- A user specific configuration file. This file is looked for in the
following order:
- The file specified with the
--conf
option of the considered command - The file referenced by the
$GLITE_CREAM_CLIENT_CONFIG
environment variable $HOME/.glite/<VO>/glite_cream.conf
(if the VO is defined), or$HOME/.glite/glite_cream.conf
otherwise
- The file specified with the
Each of these files is a classad containing definitions. If the same attribute is defined in more configuration file, the definition in the user specific configuration file (if any) is considered. Likewise the definitions in the VO specific configuration file have higher priority than the ones specified in the general configuration file. It must be noted that one or more (even all) of these three configuration files can be missing.
We list here the possible attributes that can be specified in the configuration files:
Name | Description | Default |
---|---|---|
CREAM_URL_PREFIX | the prefix to the <hostname>:<port> to build the CREAM service endpoint |
https:// |
CREAMDELEGATION_URL_PREFIX | the prefix to the <hostname>:<port> to build the CREAM delegation service endpoint |
https:// |
DEFAULT_CREAM_TCPPORT | the port to be appended to the hostname (if not specified by the user) to build the CREAM and CREAM delegation service endpoint | 8443 |
CREAM_URL_POSTFIX | the postfix to be appended to the <hostname>:<port> to build the CREAM service endpoint |
/ce-cream/services/CREAM2 |
CREAMDELEGATION_URL_POSTFIX | the postfix to be appended to the <hostname>:<port> to build the CREAM delegation service endpoint |
/ce-cream/services/gridsite-delegation |
JDL_DEFAULT_ATTRIBUTES | the classad that must be included by default in the user’s JDLs | empty classad |
STATUS_VERBOSITY_LEVEL | the default verbosity level to be used for the glite-ce-job-status command |
0 |
UBERFTP_CLIENT | is the pathname of the uberftp client executable |
/usr/bin/uberftp |
SUBMIT_LOG_DIR | the directory where by default the log file glite-ce-job-submit_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-submit command) is created |
/tmp/glite_cream_cli_logs |
DELEGATE_LOG_DIR | the directory where by default the log file glite-ce-delegate-proxy_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-delegate-proxy command) is created |
/tmp/glite_cream_cli_logs |
STATUS_LOG_DIR | the directory where by default the log file glite-ce-job-status_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-status command) is created |
/tmp/glite_cream_cli_logs |
LIST_LOG_DIR | the directory where by default the log file glite-ce-job-list_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-list command) is created |
/tmp/glite_cream_cli_logs |
SUSPEND_LOG_DIR | the directory where by default the log file glite-ce-job-suspend_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-suspend command) is created |
/tmp/glite_cream_cli_logs |
RESUME_LOG_DIR | the directory where by default the log file glite-ce-job-resume_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-resume command) is created |
/tmp/glite_cream_cli_logs |
CANCEL_LOG_DIR | the directory where by default the log file glite-ce-job-cancel_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-cancel command) is created |
/tmp/glite_cream_cli_logs |
JOBOUTPUT_LOG_DIR | the directory where by default the log file glite-ce-job-output_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-output command) is created |
/tmp/glite_cream_cli_logs |
PURGE_LOG_DIR | the directory where by default the log file glite-ce-job-purge_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-purge command) is created |
/tmp/glite_cream_cli_logs |
ALLOWEDSUB_LOG_DIR | the directory where by default the log file glite-ce-allowed-submission_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-allowed-submission command) is created |
/tmp/glite_cream_cli_logs |
ENABLE_LOG_DIR | the directory where by default the log file glite-ce-enable-submission_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-enable-submission command) is created |
/tmp/glite_cream_cli_logs |
DISABLE_LOG_DIR | the directory where by default the log file glite-ce-disable-submission_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-disable-submission command) is created |
/tmp/glite_cream_cli_logs |
PROXYRENEW_LOG_DIR | the directory where by default the log file glite-ce-proxy-renew_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-proxy-renew command) is created |
/tmp/glite_cream_cli_logs |
GETSERVICEINFO_LOG_DIR | the directory where by default the log file glite-ce-service-info_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-service-info command) is created |
/tmp/glite_cream_cli_logs |
LEASE_LOG_DIR | the directory where by default the log file glite-ce-job-lease_CREAM_<username>_<date>_<time>.log (created when the --debug option is used with the glite-ce-job-lease command) is created |
/tmp/glite_cream_cli_logs |
As mentioned above, if the same attribute is defined in more than a configuration file, the definition in the user specific configuration file (if any) has higher priority than the definition in the VO specific configuration file (if any) which has higher priority than the definition in the generic configuration file. If an attribute is not defined anywhere, the default value is considered.
Use specific functionality of the CREAM CE¶
Forward of requirements to the batch system¶
The CREAM CE allows to forward, via tha BLAH component, requirements to
the batch system. For this purpose the JDL CERequirements
attribute,
described in the CREAM JDL guide, can be used. For
direct submissions to the CREAM CE (e.g. jobs submitted to the CREAM CE
using the CREAM CLI glite-ce-job-submit
command) the CeRequirements
attribute is supposed to be filled by the end-user. For jobs submitted
to the CREAM CE via the WMS, the CeRequirements
attribute is instead
filled by the WMS, considering the JDL Requirements
expression and
the value of the CeForwardParameters
attribute in the WMS
configuration file.
For example, if in the user JDL there is :
Requirements= "other.GlueHostMainMemoryRAMSize > 100 && other.GlueCEImplementationName==\"CREAM\"";
and if the WMS configuration file there is:
CeForwardParameters = {"GlueHostMainMemoryVirtualSize","GlueHostMainMemoryRAMSize", "GlueCEPolicyMaxCPUTime"};
in the JDL sent by the WMS to CREAM there will be:
CeRequirements= "other.GlueHostMainMemoryRAMSize > 100";
The CERequirements
expression received by CREAM is then forwarded to
BLAH. Basically BLAH manages the CERequirements
expression setting
some environment variables, which are available and can be properly used
by the /usr/libexec/xxx_local_submit_attributes.sh
script (e.g.
/usr/libexec/pbs_local_submit_attributes.sh
for PBS/Torque,
/usr/libexec/lsf_local_submit_attributes.sh
for LSF). This script
must be properly created by the site admin.
For example, considering the following CeRequirements expression:
CeRequirements="other.GlueHostMainMemoryRAMSize > 100 && other.GlueCEStateWaitingJobs <10 && \
other.GlueCEImplementationName==\"CREAM\" && other.GlueHostProcessorClockSpeed >= 2800 && \
(Member(\"FDTD\", other.GlueHostApplicationSoftwareRuntimeEnvironment))";
the following settings will be available in
$USR_LOCATION/libexec/xxx_local_submit_attributes.sh
:
GlueHostMainMemoryRAMSize_Min='100'
GlueCEStateWaitingJobs_Max='10'
GlueCEImplementationName='CREAM'
GlueHostProcessorClockSpeed_Min='2800'
GlueHostApplicationSoftwareRuntimeEnvironment='"FDTD"'
where the value for $USR_LOCATION in a standard installation of a CREAM
CE is “/usr”. What is printed by the
/usr/libexec/xxx_local_submit_attributes.sh
script is automatically
added to the submit command file. For example if the JDL
CeRequirements
expression is:
CeRequirements = "(Member(\"FDTD\", other.GlueHostApplicationSoftwareRuntimeEnvironment))";
and the =/usr/libexec/pbs_local_submit_attributes.sh= is:
#!/bin/sh
if [ "$other.GlueHostApplicationSoftwareRuntimeEnvironment" == "FDTD" ]; then
echo "#PBS -l software=FDTD"
fi
then the PBS submit file that will be used will include:
...
...
# PBS directives:
#PBS -S /bin/bash
#PBS -o /dev/null
#PBS -e /dev/null
#PBS -l software=FDTD
....
....
where the line:
#PBS -l software=FDTD
is set via the /usr/libexec/pbs_local_submit_attributes.sh
script.
Please note that there are no differences if in CeRequirements
expresssion there is e.g.
CeRequirements = other.xyz==\"ABC\"
or:
CeRequirements = "xyz==\"ABC\"";
In both cases in /usr/libexec/xxx_local_submit_attributes.sh
the
variable xyz
will be set. As shown above, having x>a
or x>=a
doesn’t make any difference in the setting of the environment variable
x
in the /usr/libexec/xxx_local_submit_attributes.sh
script. It
will be in both cases:
x_Min='a'
Starting with BLAH v. 1.18 it is possible to forward to the batch system
also other attributes not included in the CeRequiments
JDL
attribute. This can be done adding in =/etc/blah.config= the line:
blah_pass_all_submit_attributes=yes
In this way the xxx_local_submit_attributes.sh
will see the
following environment variables set:
- gridType
- x509UserProxyFQAN
- uniquejobid
- queue
- ceid
- VirtualOrganisation
- ClientJobId
- x509UserProxySubject
It is also possible to specify that only some attributes must be
forwarded in the batch system setting in blah.config
e.g.:
blah_pass_submit_attributes[0]="x509UserProxySubject"
blah_pass_submit_attributes[1]="x509UserProxyFQAN"
CREAM job states¶
Here below is provided a brief description of the meaning of each possible state a CREAM job can enter:
REGISTERED
: the job has been registered but it has not been started yet.PENDING
the job has been started, but it has still to be submitted to the LRMS abstraction layer module (i.e. BLAH).IDLE
: the job is idle in the Local Resource Management System (LRMS).RUNNING
: the job wrapper, which “encompasses”; the user job, is running in the LRMS.REALLY-RUNNING
: the actual user job (the one specified as Executable in the job JDL) is running in the LRMS.HELD
: the job is held (suspended) in the LRMS.CANCELLED
: the job has been cancelled.DONE-OK
: the job has successfully been executed.DONE-FAILED
: the job has been executed, but some errors occurred.ABORTED
: errors occurred during the “management”; of the job, e.g. the submission to the LRMS abstraction layer software (BLAH) failed.UNKNOWN
: the job is an unknown status.
The following figure shows the possible job states transitions:

CREAM JDL Guide¶
Introduction¶
The Job Description Language (JDL) is a high-level, user-oriented language based on Condor classified advertisements (classads) for describing jobs to be submitted to the CREAM CE service. Being the JDL an extensible language the user is allowed to use whatever attribute for the description of a request without incurring in errors from the JDL parser. However, only a certain set of attributes, that we will refer as “supported attributes”; from now on, is taken into account by the CREAM CE service.
Some of the attributes in the JDL are mandatory. If the user does not specify them, CREAM cannot handle the request. For the other attributes the system may find a default value if they are necessary for processing the request.
Before starting with the detailed attribute description please note that a request description is composed by entries that are strings having the format
attribute = expression;
and are terminated by the semicolon character. The whole description has
to be included between square brackets, i.e. [ <job descr.> ]
. The
termination with the semicolon is not mandatory for the last attribute
before the closing square bracket ].
Attribute expressions can span several lines provided the semicolon is put only at the end of the whole expression. Comments must have a sharp character (#) or a double slash (//) at the beginning of each line. Comments spanning multiple lines can be specified enclosing the text between “/*” and “*/”.
Please note that since CREAM exposes a publicly available WSDL interface, no assumption is made in the document (unless explicitly specified) about the client tool used to submit the JDL description of the job.
Request and Job Types¶
Type¶
This a string representing the type of the request described by the JDL, e.g.:
Type = "Job";
For the time being the only possible value is: Job
The value for
this attribute is case insensitive. If this attribute is not specified
in the JDL description, the default value (“Job”) will be considered.
Job Attributes Description¶
This section reports the detailed description of the JDL attributes that can be specified for describing Job requests. A sub-section for each attribute is provided.
JobType¶
This a string representing the type of the job described by the JDL, e.g.:
JobType = "Normal";
At least for the time being the only possible value is: Normal
. This
attribute only makes sense when the Type attribute equals to “Job”. The
value for this attribute is case insensitive. If not specified in the
JDL, it will be set to “Normal”
Mandatory: No
Default: Normal
Executable¶
This is a string representing the executable/command name. The user can specify an executable that lies already on the remote CE and in this case the absolute path, possibly including environment variables referring to this file should be specified, e.g.:
Executable = "usr/local/java/j2sdk1.4.0_01/bin/java";
or:
Executable = "$JAVA_HOME/bin/java";
The other possibility is to provide either an executable located on a remote gridFTP server accessible by the user (HTTPS servers are also supported but this requires to have the GridSite =htcp= client command installed on the WN). In both cases the executable file will be staged from the original location to the Computing Element WN. In both cases only the file name has to be specified as executable. The URI of the executable should be then listed in the InputSandbox attribute expression to make it be transferred. E.g. respectively:
Executable = "cms_sim.exe";
InputSandbox = {"file:///home/edguser/sim/cms_sim.exe", ...};
Or:
Executable = "cms_sim.exe";
InputSandbox = {"gsiftp://neo.datamat.it:5678/tmp/cms_sim.exe", ...};
Also descriptions as follows:
Executable = "cms_sim.exe";
InputSandbox = {"/home/edguser/sim/cms_sim.exe", ... };
are accepted and interpreted as in the first case, i.e. the executable file is available on the local file system.
It is important to remark that if the job needs for the execution some command line arguments, they have to be specified through the =Arguments= attribute. This attribute is mandatory.
Mandatory: Yes
Arguments¶
This is a string containing all the job command line arguments. E.g. an executable sum that has to be started as:
$ sum N1 N2 -out result.out
is described by:
Executable = "sum";
Arguments = "N1 N2 -out result.out";
If you want to specify a quoted string inside the Arguments then you have to escape quotes with the \ character. E.g. when describing a job like:
$ grep -i "my name" *.txt
you will have to specify:
Executable = "/bin/grep";
Arguments = "-i \"my name\" *.txt";
Analogously, if the job takes as argument a string containing a special character (e.g. the job is the tail command issued on a file whose name contains the ampersand character, say file1&file2), since on the shell line you would have to write:
$ tail -f file1\&file2
in the JDL youRSQUOll have to write:
Executable = "/usr/bin/tail";
Arguments = "-f file1\\\&file2";
i.e. a \ for each special character. In general, special characters such as:
&, |, >, <
are only allowed if specified inside a quoted string or preceded by triple . The character ` cannot be specified in the JDL.
Some other guidelines for the management of special characters:
- If the arguments string contains one or more spaces, the string must be specified in single quotes
- If the arguments string contains single quotes, then the string must be specified in double quotes (which must be escaped)
- If the arguments string contains spaces and single quotes, then the string must be specified in double quotes (which must be escaped)
- the number of the backquote characters must be an even number
Some examples:
Arguments = "\"Problematic character is: '\" " ;
Arguments = "-o '/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=romanov/CN=427293/CN=Vladimir Romanovskiy' \"It's sunny \"" ;
Arguments = "-o \"'option'\"";
Mandatory: No
Default: No
StdInput¶
This is a string representing the standard input of the job. . This means that the job is run as follows:
$ executable < <standard input file>
It can be an absolute path possibly including environment variables (wild cards are instead not allowed), i.e. it is already available on the CE, e.g.
StdInput = "/var/tpm/jobInput";
or just a file name, e.g.
StdInput = "myjobInput";
and this means that file needs to be made available on the WN where the job is run. Therefore the standard input file has to be added to the InputSandbox file list so that it will be downloaded on the WN. The same rules described for the Executable attribute apply to StdInput.
Mandatory: No
Default: No
StdOutput¶
This is a string representing the file name where the standard output of the job is saved. The user can specify either a file name or a relative path (with respect to the job working directory on the WN), e.g.:
StdOutput = "myjobOutput";
StdOutput = "event1/myjobOutput";
Wild cards are not allowed. The value specified for StdError can be the same as the one for StdOutput: this means that the two standard streams of the job are saved in the same file. The user can choose to have this file staged automatically on a GridFTP server specifying a URI for that file in the OutputSandbox attribute expression. E.g.:
StdOutput = "myjobOutput";
OutputSandbox = {
"gsiftp://fox.infn.it:5678/home/gftp/myjobOutput",
...
};
indicates that myjobOutput
when the job has completed its execution
has to be transferred on gsiftp://fox.infn.it:5678
in the
/home/gftp
directory.
Mandatory: No
Default: No
StdError¶
This is a string representing the file name where the standard error of the job is saved. The user can specify either a file name or a relative path (with respect to the job working directory on the WN), e.g.:
StdError = "myjobError";
StdError = "event1/myjobError";
Wild cards are not allowed. The value specified for StdError can be the same as the one for StdOutput: this means that the two standard streams of the job are saved in the same file. The user can choose to have this file staged automatically on a GridFTP server specifying a URI for that file in the OutputSandboxDestURI attribute expression The same rules as for the StdOutput apply to StdError.
Mandatory: No
Default: No
InputSandbox¶
This is a string or a list of strings identifying the list of files available on the file system of the client (UI) machine and/or on an accessible gridFTP server (HTTPS servers are also supported but this requires to have the GridSite htcp client command installed on the WN) needed by the job for running. These files hence have to be transferred to the WN before the job is started. Wildcards and environment variables are admitted in the specification of this attribute only if the submission takes place through a client able to resolve them locally before passing the JDL to the CREAM service (e.g. this is the case for the CREAM CLI). Admitted wildcard patterns are the ones supported by the Linux glob function. One can remove the special meaning of the characters: ‘?’, ‘*’, and ‘[‘ by preceding them by a backslash.
File names can be provided as URI on a gridFTP/HTTPS server, simple file names, absolute paths and relative paths with respect to the current UI working directory. The InputSandbox file list cannot contain two or more files having the same name (even if in different paths) as when transferred in the job’s working directory on the WN they would overwrite each other. This attribute can also be used to accomplish executable and standard input staging to the CE where job execution takes place as explained above. The InputSandbox attribute meaning is strictly coupled with the value of the InputSandboxBaseURI defined in the following that specifies a common location on a gridFTP/HTTPS server where files in the InputSandbox not specified as URI are located.
Support for file transfer from gridftp servers running using user credentials instead of host credentials is also provided1. In this case the distinguish name of such user credentials must be specified in the URI using:
?DN=<distinguish name>
as shown in the example below.
Here below follows an example of InputSandbox setting:
InputSandbox = {
"/tmp/ns.log",
"mytest.exe",
"myscript.sh",
"data/event1.txt",
"gsiftp://neo.datamat.it:5678/home/fpacini/cms_sim.exe ",
"file:///tmp/myconf",
"gsiftp://lxsgaravatto.pd.infn.it:47320/etc/fstab?DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"
};
InputSandboxBaseURI = "gsiftp://matrix.datamat.it:5432/tmp";
It means that:
/tmp/ns.log
is located on the UI machine local file systemmytest.exe
,myscript.sh
anddata/event1.txt
are available ongsiftp://matrix.datamat.it:5432
in the/tmp
directory/tmp/myconf
is located on the user local file system (explicitly specified using the file:// prefix)/etc/fstab
is available ongsiftp://lxsgaravatto.pd.infn.it:47320
which is a gridftp server running using user credentials (with the specified distinguish name)
If the InputSandboxBaseURI is not specified than also mytest.exe
,
myscript.sh
and data/event1.txt
would be interpreted as located
on the user local file system.
Mandatory: No
Default: No
InputSandboxBaseURI¶
This is a string representing the URI on a gridFTP server (HTTPS servers are also supported but this requires to have the GridSite htcp client command installed on the WN) where the InputSandbox files that have been specified as simple file names and absolute/relative paths are available for being transferred on the WN before the job is started. E.g.
InputSandbox = {
...
"data/event1.txt",
...
};
InputSandboxBaseURI = "gsiftp://matrix.datamat.it:5432/tmp";
makes CREAM consider
"gsiftp://matrix.datamat.it:5432/tmp/data/event1.txt"
for the transfer on the WN.
Support for file transfer from gridftp servers running using user credentials instead of host credentials is also provided1. In this case the distinguish name of such user credentials must be specified in the URI using:
?DN=<distinguish name>
as shown in the example below. E.g.
InputSandbox = {
...
"data/event2.txt",
...
};
InputSandboxBaseURI = "gsiftp://lxsgaravatto.pd.infn.it:47320/tmp?DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto";
makes CREAM consider
"gsiftp://lxsgaravatto.pd.infn.it:47320/tmp/data/event2.txt"
for the transfer on the WN, where that gridftp server has been started using user credentials (with the specified distinguish name)
Mandatory: No
Default: No
OutputSandbox¶
This is a string or a list of strings identifying the list of files generated by the job on the WN at runtime, which the user wants to save. This attribute can be combined with the OutputSandboxDestURI or the OutputSandboxBaseDestURI to have, upon job completion, the output directly copied to specified locations running a gridFTP server (HTTPS servers are also supported but this requires to have the GridSite htcp client command installed on the WN). Wildcards are admitted in the specification of this attribute only if the OutputSandboxBaseDestURI attribute is used along with the OutputSandbox attribute. Admitted wildcard patterns are the ones supported by the Linux glob function. One can remove the special meaning of the characters: ‘? ‘, ‘*’ and ‘[‘ by preceding them by a backslash.
File names can be provided as simple file names or relative paths with respect to the current working directory on the executing WN. The OutputSandbox file list should not contain two or more files having the same name (even if in different paths).
Mandatory: No
Default: No
OutputSandboxDestURI¶
This is a string or a list of strings representing the URI(s) on a gridFTP/HTTPS server where the files listed in the OutputSandbox attribute have to be transferred at job completion. The OutputSandboxDestURI list contains for each of the files specified in the OutputSandbox list the URI (including the file name) where it has to be transferred at job completion. Support for file transfer to gridftp servers running using user credentials instead of host credentials is also provided1. In this case the distinguish name of such user credentials must be specified in the URI using:
?DN=<distinguish name>
as shown in the example below. E.g.
OutputSandbox = {
"myjobOutput",
"run1/event1",
"run2/event2",
};
OutputSandboxDestURI = {
"gsiftp://matrix.datamat.it:5432/tmp/myjobOutput ",
"gsiftp://grid003.ct.infn.it:6789/home/cms/event1",
"gsiftp://lxsgaravatto.pd.infn.it:47320/tmp/event2?DN=/C=IT/O=INFN/
OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"
};
makes CREAM transfer respectively:
myjobOutput
onmatrix.datamat.it
in the directory/tmp
event1
ongrid003.ct.infn.it
in the directory/home/cms
event2
onlxsgaravatto.pd.infn.it
(gridftp server running using user credentials, with the specified distinguish name) in the directory/tmp
Specifying the URI gsiftp://localhost
, the OutputSandboxFile is
saved in the gridftp server of the CREAM CE, as shown in the following
example:
OutputSandbox = {
"file1",
"file2",
};
OutputSandboxDestURI = {
"gsiftp://localhost/file1",
"gsiftp://grid003.ct.infn.it:6789/home/cms/file2"
};
In the above example file1
is saved on the gridftp server of the
CREAM CE, while file2
is saved on grid003.ct.infn.it
(directory
/home/cms
). The OutputSandboxDestURI list must have the same
cardinality as the OutputSandbox list, otherwise the JDL will be
considered as invalid. Note that the file name specified in the
OutputSandbox can be different from the corresponding destination file
name specified in the OutputSandboxBaseDestURI. The OutputSandboxDestURI
attribute and the OutputSandboxBaseDestURI cannot be specified together
in the same JDL. One (and only one) among the OutputSandboxDestURI and
OutputSandboxBaseDestURI attributes must be specified if the
OutputSandbox attribute has been specified
Mandatory: No
Default: No
OutputSandboxBaseDestURI¶
The OutputSandboxBaseDestURI attribute is a string representing the base URI on a gridFTP server, i.e. a directory on the server, where the files listed in the OutputSandbox attribute have to be transferred at job completion. HTTPS servers are also supported but this requires to have the GridSite htcp client command installed on the WN. All the OutputSandbox files are transferred to the location specified by the URI with the same names (only names in a flat directory) as the ones specified in the OutputSandbox. E.g.:
OutputSandbox = {
"myjobOutput",
"run1/event1",
};
OutputSandboxBaseDestURI = "gsiftp://matrix.datamat.it:5432/tmp";
makes CREAM transfer both files in the /tmp
directory of the gridFTP
server matrix.datamat.it
(note that event1
will go in /tmp
and not in /tmp/run1
).
Support for file transfer to gridftp servers running using user credentials instead of host credentials is also provided1. In this case the distinguish name of such user credentials must be specified in the URI using:
?DN=<distinguish name>
as shown in the example below. E.g.
OutputSandbox = {
"myjobOutput",
"run1/event1",
};
OutputSandboxBaseDestURI = "gsiftp://lxsgaravatto.pd.infn.it:47320/tmp?DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto";
makes CREAM transfer both files in the /tmp
directory of the gridFTP
server lxsgaravatto.pd.infn.it
(running using user credentials, with
the specified distinguish name).
Specifying the URI =gsiftp://localhost=, the OutputSandboxFile is saved in the gridftp server of the CREAM CE, as shown in the following example:
OutputSandbox = {
"file1",
"file2",
};
OutputSandboxBaseDestURI = "gsiftp://localhost";
In the above example file1
and file2
are saved on the gridftp
server of the CREAM CE. The OutputSandboxBaseDestURI attribute and the
OutputSandboxDestURI cannot be specified together in the same JDL. One
(and only one) among the OutputSandboxDestURI and
OutputSandboxBaseDestURI attributes must be specified if the
OutputSandbox attribute has been specified.
Mandatory: No
Default: No
Prologue¶
The Prologue attribute is a string representing the executable/script name of the prologue. The prologue is an executable run within the CREAM job wrapper before the user job is started. It can be used for purposes ranging from application-specific checks that the job environment has been correctly set on the WN to actions like data transfers, database updates or MPI pre script. If the prologue fails the job wrapper terminates. The rules for specification of the Prologue attributes and its relationship with the InputSandbox attribute are exactly the same already described for the Executable attribute.
Mandatory: No
Default: No
PrologueArguments¶
The PrologueArguments attribute is a string containing all the prologue executable command line arguments. All the rules reported in the description of the Arguments attribute also apply to the PrologueArguments attribute.
Mandatory: No
Default: No
Epilogue¶
The Epilogue attribute is a string representing the executable/script name of the epilogue. The epilogue is an executable/script run within the CREAM job wrapper after the user job completion. It can be used for purposes ranging from application-specific checks that the job performed correctly to actions like data transfers, database updates or MPI post script. The rules for specification of the Epilogue attributes and its relationship with the InputSandbox attribute are exactly the same already described for the Executable attribute.
Mandatory: No
Default: No
EpilogueArguments¶
The EpilogueArguments attribute is a string containing all the epilogue executable command line arguments. All the rules reported in the description of the Arguments attribute also apply to the EpilogueArguments attribute.
Mandatory: No
Default: No
Environment¶
This is a list of string representing environment settings that have to
be performed on the execution machine and are needed by the job to run
properly. The JobWrapper on the Worker Node performs these settings just
before the job is started. Each item of the list is an equality
'VAR_NAME=VAR_VALUE'
. E.g.:
Environment = {
"JOB_LOG_FILE=/tmp/myjob.log",
"ORACLE_SID=edg_rdbms_1",
"JAVABIN=/usr/local/java"
};
Mandatory: No
Default: No
PerusalFileEnable¶
The PerusalFileEnable attribute is a Boolean attribute that allows enabling the job file perusal support in CREAM. File perusal can be used when jobs are submitted to CREAM by the WMS, but it is possible to use this functionality also for jobs submitted directly to CREAM. When this attribute is set to true, i.e.
PerusalFileEnable = true;
the user can inspect, while the job is running, the files generated by the job on the WN. This is achieved by uploading on a location specified by the attribute PerusalFilesDestURI, at regular time intervals, chunks of the files (specified by the attribute PerusalListFileURI) generated by the job on the WN. The PerusalFileEnable attribute is not mandatory. If not specified in the JDL it is assumed to be set to false.
Mandatory: No
Default: False
PerusalTimeInterval¶
The PerusalTimeInterval attribute is a positive integer representing the difference in seconds between two subsequent saving (and upload on the location specified by the attribute PerusalFilesDestURI) of the job files generated by the job on the WN. Specifying e.g.
PerusalTimeInterval = 10;
makes the CREAM JobWrapper save the job files specified through the attribute PerusalListFileURI each 10 seconds and upload them on location specified by the attribute PerusalFilesDestURI, so that they can be inspected by the user.
Mandatory: No (unless PerusalFileEnable is true)
Default: No
PerusalFilesDestURI¶
The PerusalFilesDestURI attribute is a string representing the URI of the location on a gridFTP or HTTPS server (HTTPS servers are also supported but this requires to have the GridSite htcp command installed on the WN) where the chunks of files generated by the running job on the WN and specified through the attribute PerusalListFileURI have to be copied. E.g.
PerusalFilesDestURI = gsiftp://ghemon.cnaf.infn.it/home/glite/peek
Mandatory: No (unless PerusalFileEnable is true)
Default: No
PerusalListFileURI¶
The PerusalListFileURI attribute is a string representing the URI of the file on a gridFTP server, containing the list of files (one for each line), that must be saved and uploaded on the location specified by the attribute PerusalFilesDestURI at regular time intervals. E.g.
PerusalListFileURI = "gsiftp://ghemon.cnaf.infn.it/peek/files2peek";
Mandatory: No (unless PerusalFileEnable is true)
Default: No
BatchSystem¶
This is a string representing the Local Resource Management System (LRMS), that is the batch system type (e.g. LSF, PBS, etc.) of the target CREAM CE. Here below follows an example for this attribute:
BatchSystem = "pbs";
This attributes is mandatory but can be omitted if the CREAM CLI is used, since it retrieves this attribute from the CEID and automatically fills the JDL.
Mandatory: Yes
Default: No
QueueName¶
This is a string representing the queue name in the Local Resource Management System (LRMS) where the job has to be submitted on the target CREAM CE. Here below follows an example for this attribute:
QueueName = "long";
. This attributes is mandatory but can be omitted if the CREAM CLI is used, since it retrieves this attribute from the CEID and automatically fills the JDL.
Mandatory: Yes
Default: No
CPUNumber¶
The CpuNumber attribute is an integer greater than 0 specifying the number of CPUs needed. This attribute can be used in particular for MPI jobs. Please note that this attributes allows allocating the specified number of CPUs. Then it is up to the job using them to run the job (e.g. via mpistart). An example of the JDL setting is:
CpuNumber = 5;
Mandatory: No
Default: 1
SMPGranularity¶
The SMPGranularity attribute is an integer greater than 0 specifying the number of cores any host involved in the allocation has to dedicate to the considered job. This attribute canRSQUOt be specified along with the Hostnumber attribute when WholeNodes is false.
Mandatory: No
Default: No
GPUNumber¶
The GPUNumber attribute is an integer greater than 0 specifying the number of accelerator devices (CUDA) required for executing the current job.
Mandatory: No
Default: No
GPUModel¶
The GPUModel attribute is a string containing the model of the accelerator device (CUDA) as declared by the vendor.
Mandatory: No
Default: No
WholeNodes¶
The WholeNodes attribute is a boolean that indicates whether whole nodes should be used exclusively or not.
Mandatory: No
Default: False
HostNumber¶
HostNumber is an integer indicating the number of nodes the user wishes
to obtain for his job. This attribute can’t be specified along with the
SMPGranularity attribute when WholeNodes is false. Please note that
HostNumber
shouldn’t be greater than CpuNumber
Mandatory: No
Default: No
CERequirements¶
The CERequirements attribute is a Boolean ClassAd expression that uses
C-like operators. It represents job requirements on resources. Properly
configuring the BLAH software, it is possible to instruct the Local
Resource Management System to select the most appropriate Worker Node to
run the considered job. The CERequirements expression can contain
attributes that describe the CE which are prefixed with other.
. This
is an example of CERequirements expression:
CERequirements = "other.GlueCEPolicyMaxCPUTime >= 100 && other.GlueHostMainMemoryRAMSize > 2000";
The CERequirements attribute can be used when jobs are submitted
directly to a CREAM based CE, while for jobs submitted to a CREAM based
CE via the WMS, the CERequirements expression is automatically filled by
the WMS considering the user’s Requirements
expression, and the
value of the CeForwardParameters
attribute in the WMS configuration
file.
For further details refer to the User guide.
Mandatory: No
Default: No
MWVersion¶
The MWversion attribute is a string whose value is given to the
environment value EDG_MW_VERSION
, defined for the job immediately
when it arrives on the WN (i.e. it is not defined in the job wrapper).
There can be hooks on the WN which detect this (before any
/etc/profile.d
scripts are run) and set up the appropriate
environment for a job.
Mandatory: No
Default: No
OutputData¶
This attribute allows the user to ask for the automatic upload and registration to the Replica Catalog of datasets produced by the job on the WN. Through this attribute it is possible to indicate for each output file the LFN (Logical File Name) to be used for registration and the SE (Storage Element) on which the file has to be uploaded. The OutputData attribute is not mandatory.
OutputData is a list of classads where each classad contains the following three attributes:
OutputFile
StorageElement
LogicalFileName
These three attributes are only admitted if members of one of the classads composing OutputData. They cannot be specified independently in the job JDL. Here below follows an example of the !OutputData attribute:
OutputData = {
[
OutputFile = "dataset_1.out ";
LogicalFileName = "lfn:/test/result1";
],
[
OutputFile = "dataset_2.out ";
StorageElement = "se001.cnaf.infn.it";
],
]
OutputFile = "cms/dataset_3.out";
StorageElement = "se012.to.infn.it";
LogicalFileName = "lfn:/cms/outfile1";
],
[
OutputFile = "dataset_4.out ";
]
};
If the attribute OutputData is found in the JDL then the JobWrapper at
the end of the job calls the Data Management service that copies the
file from the WN onto the specified SE and registers it with the given
LFN. If the specified LFN is already in use, the DM service registers
the file with a newly generated identifier GUID (Grid Unique
Identifier). During this process the JobWrapper creates a file (named
DSUpload_<jobid>.out
) with the results of the operation. In case of
submission to CREAM through the WMS this file is put automatically in
the OutputSandbox attribute list by the UI (and can then be retrieved by
the user with the glite-wms-job-output
command). If instead the job
was submitted directly to the CE, the file is put in the OSB directory
of the CE node (and therefore can then be retrieved by the user with the
glite-ce-job-output
command) unless the OutputSandboxBaseDestURI
attribute has been used (in this latter case the specified location is
used to store the DSUpload_<jobid>.out file).
Mandatory: No
Default: No
OutputFile¶
This is a string attribute representing the name of the output file, generated by the job on the WN, which has to be utomatically uploaded and registered by the WMS. Wildcards are not admitted in the specification of this attribute. File names can be provided as simple file names, absolute paths or relative paths with respect to the current working directory.
Mandatory: Yes (only if OutputData has been specified)
Default: No
StorageElement¶
This is a string representing the URI of the Storage Element where the output file specified in the corresponding OutputFile attribute has to be uploaded by the WMS.
Mandatory: No
Default: No
LogicalFileName¶
This is a string representing the logical file name (LFN) the user wants to associate to the output file when registering it to the Replica Catalogue. The specified name has to be prefixed by “lfn:” (lowercase). If this attribute is not specified then the corresponding output file is registered with a GUID that is assigned automatically by the Data Management services.
Mandatory: No
Default: No (If not specified a GUID is assigned by DM services)
JDL Examples¶
Simple examples of JDL describing different types of jobs and requests are reported in this section.
Example 1¶
[
Type = "job";
JobType = "normal";
Executable = "/sw/command";
Arguments = "60";
StdOutput = "sim.out";
StdError = "sim.err";
OutputSandbox = { "sim.err", "sim.out" };
OutputSandboxBaseDestURI = "gsiftp://se1.pd.infn.it:5432/tmp";
InputSandbox = {
"file:///home/user/file1",
"gsiftp:///se1.pd.infn.it:1234/data/file2",
"/home/user/file3", "file4"
};
InputSandboxBaseURI = "gsiftp://se2.cern.ch:5678/tmp";
]
With this JDL a “normal” (batch) job will be submitted. Besides the
specification of the executable (already available in the file system of
the executing node, since not listed in the InputSandbox), and of the
standard output/error files, it is specified that the files file1
,
file2
, file3
, file4
will have to be staged on the executing
node:
file1
andfile3
will be copied from the client (UI) file systemfile2
will be copied from the specified GridFTP server (gsiftp:///se1.pd.infn.it:1234/data/file2
)file4
will be copied from the GridFTP server specified as InputSandboxBaseURI (gsiftp://se2.cern.ch:5678/tmp
)
It is also specified that the file sim.err
and sim.out
(specified as OutputSandbox) must be automatically uploaded into
gsiftp://se1.pd.infn.it:5432/tmp
when job completes its execution.
Example 2¶
[
Type = "job";
JobType = "normal";
Executable = "script.sh";
Arguments = "60";
StdOutput = "sim.out";
StdInput = "sim.inp";
StdError = "sim.err";
OutputSandbox = {
"sim.err",
"sim.out"
};
OutputSandboxDestURI = {
"gsiftp://matrix.datamat.it:5432/tmp/sim.err",
"gsiftp://grid003.ct.infn.it:6789/home/cms/sim.out",
};
InputSandbox = {
"file:///home/user/file1",
"gsiftp:///se1.pd.infn.it:1234/data/file2",
"/home/user/file3",
"file4",
"script.sh",
"sim.inp"
};
]
This JDL is very similar to the previous one. The only differences are the following:
- The executable and the standard input files have been included in the InputSandbox, and therefore they will be staged in the executing node
- Instead of specifying the URL to be used for all the files of the OutputSandbox, it is specified (via the attribute OutputSandboxDestURI) an URI for each file that have to be uploaded (the files listed as OutputSandbox)
- The attribute InputSandBoxBaseURI hasnRSQUOt been specified, so the
files
file4
,script.sh
andsim.inp
will be copied from the file system of the client (UI) machine.
BLAH User’s Guide¶
BLAH Introduction¶
BLAHPD is a light component accepting commands according to the BLAH (Batch Local Ascii Helper) protocol to manage jobs on different Local Resources Management Systems (LRMS). The BLAH service provides a minimal, pragmatically designed common interface for job submission and control to a set of batch systems. Interaction with BLAH is achieved via text commands whose syntax and semantics is described below. Interaction with each one of the various supported batch system is achieved via customized scripts (the existing scripts are all Bourne shell scripts) that take care of executing the following functions:
- job submit
- job hold
- job resume
- job status
- job cancel
BLAH Portability¶
The purpose of this writeup is to:
- describe the assumptions that are currently made on the minimal functionality that batch systems must provide;
- describe the details of the command line arguments that are passed to each one of the five scripts that implement the fuctionality described above for a specific batch system;
- provide guidelines for porting of the existing scripts, especially the submit script for which helper shell functions were defined.
As the universe of different batch systems that need to be supported is very small, the most efficient way to get help beyond the information collected in these notes is to interact with the BLAH developers directly via the e-mail address blah-help@mi.infn.it The string XXX, in the following text, is meant to be replaced with the name of the specific batch system being interfaced (pbs, lsf, condor, etc.).
BLAH assumes that batch systems are capable to
- Identify jobs for subsequent operations via a unique, constant identifier that is returned at the time of job submission.
- Transfer and optionally rename a configurable set of files from a given location on the submit node to the initial working directory on the selected worker node before the job execution starts.
- Transfer and optionally rename a configurable set of files from the initial working directory of job on the worker node that is selected for execution to a given location on the submit node.
- Provide status of currently running jobs via an appropriate command.
- Provide a historical track of jobs that were accepted and ran in the past via one or more log files.
BLAH can use the optional batch system functionality of holding (suspend) and resuming jobs via appropriate commands.
BLAH doesn’t require batch systems to be able to renew X509 proxies and/or transfer files to and from the worker node while the job is being executed. Proxy renewal is taken care of by a proxy renewal daemon that runs at the side of the running job and receives delegations of refreshed proxies during the lifetime of the job.
The supported batch systems¶
At the time of writing BLAH supports the following batch systems: LSF,
PBS/Torque, SGE, Condor and just recently SLURM. In order to enable BLAH
to interact with the selected batch system, it must be configured
properly by setting few parameters on its configuration file which is
located by default at /etc/blah.config
. The file contains a list of
parameters and their own default value is well defined for a standard
setup of BLAH. Their customization is even possible if needed. For
example to enable the support of SLURM is need just a specific parameter
slurm_binpath
which informs BLAH about the location where the SLURM
executables are located (i.e. scontrol and sacct).
XXX_submit.sh script¶
Submit a new job request to the batch system.
NOTE: most of the functionality for parsing and handling the submit script arguments is provided by a set of shell functions that are described further below (see SUBMIT SCRIPT HELPER FUNCTIONS). The argument description is provided for reference only. SYNOPSIS:
XXX_submit.sh -c <command> -q <queue>[-i <stdin_file>]
[-o <stdout_file>] [-e <stderr_file>] [-x <x509userproxy>]
[-v <environment> | -V <environment>] [-s <YES | no>]
[-d <YES | no>] [-w <workdir>] [-n <number_of_MPI_nodes>]
[-r <YES | no>] [-p <proxyrenew_poll_interval>]
[-l <minimum_left_lifetime_proxy>] [-j <jobid_in_caller>]
[-T <location_temp_of_dir>] [-I <list_input_of_additional_files>]
[-O <list_of_additional_files_output>]
[-R <list_of_remap_file_output_rules>] [-C <ce_file_requirements>]
[-- command_arguments]
Any argument after ‘–’ will be passed to the user job (“command”).
The command line switches have the following meaning (switches listed in alphabetical order):
[-C <ce_file_requirements>]
: When this argument is present, the local script XXX_local_submit_attributes.sh is called, and its output is pasted to the batch system submit file. The XXX_local_submit_attributes.sh script is called after sourcing the contents of the ‘CE requirements file’ first. This file is composed by the BLAH main daemon and sets shell variables that specify attribute bounds derived from the CERequirements attribute in the BLAH submit command. Example format for the ‘CE requirements file’:GlueHostMainMemoryRAMSize _Min=1000 GlueCEPolicyMaxCPUTime _Max=30
[-c command]
: (shell) command to be executed as a batch job.[-d <YES | no>]
: Debug option. If set to ‘yes’, it causes the submit file to be sent to /dev/tty, and no actual submission to the batch system.[-e <stderr_file>]
: Location of the local file that should receive the job STDERR upon job termination.[-I <list_input_of_additional_files>]
: points to an optional, temporary file containing a list of additional files (one per line) that need to be transferred from the submission node to the initial working directory on the worker node. This temp file is removed by the submit script.[-i <stdin_file>]
: Location of a local file to be transferred to the worker node and connected to the STDIN of the job.[-j <jobid_in_caller>]
: String for unique identification of the job by the calling layer. As this string is unique, it is used to name the submit script, that will be created as ‘cream_jobID’.[-l <minimum_left_lifetime_proxy>]
: Minimum remaining lifetime of the user proxy (expressed in seconds) before the proxy renewal damon will kill the user job. Defaults to 180 seconds, or 3 minutes.[-n <number_of_MPI_nodes>]
: Number of MPI nodes to be reserved for the job, in case this feature is supported by the underlying batch system.[-O <list_of_additional_files_output>]
: points to an optional, temporary file containing a list of additional files (one per line) that need to be transferred from the working directory on the worker node back to the submit node. This temp file is removed by the submit script. The -R arguments can be used to establish file name remapping.[-o <stdout file>]
: Location of the local file that should receive the job STDOUT upon job termination.[-p <proxyrenew_poll_interval>]
: time interval (expressed in seconds) between attempts of the proxy renewal daemon to check that the use process is still alive. Defaults to 60 seconds.[-q <queue_name>]
: Batch system queue to run the job in.[-R <list_of_remap_file_output_rules>]
: Points to a file containing a list of names is transferred from the worker node to the submit (local) node. The list file will be deleted by the submit script.[-r <YES | no>]
: Disables the entire BLAH proxy renewal machinery when set to ‘no’. Defaults to ‘yes’ if missing.[-T <location_temp_of_dir>]
: sets the directory location for storing temporary files.[-v <environment> | -V <environment>]
: Environment variables to be set for the user job. Two formats are possible:- semicolon-separated assignments -v “ENV1=val1;ENV2=val2;…”
- space-separated assignments -V “ENV1=val1 ENV2=val2 …”
[-x <x509userproxy>]
: Location of the initial X509 user proxy to associate with the job. This file is initially transferred with the job, then (optionally, see -r) renewed via proxy delegation.[-w <workdir>]
: Directory location pre-pended to relative file paths and used as CWD by the submit script.
RETURN VALUE: Termination code is zero on success, nonzero on error. Job identifier, with the identifier string ‘BLAHP_JOBID_PREFIX’ is returned on STDOUT on success. The job identifier must start with the batch system name followed by a slash (XXX/id).
SUBMIT SCRIPT HELPER FUNCTIONS: A set of shell functions was written to ease the parsing and handling of the submit script options. They allow to write a submit script along the following template:
#!/bin/bash
# 1. Source definition of helper functions.
. `dirname $0`/blah_common_submit_functions.sh
# 2. Parse the submit options, that are used to set
# a list of bls_opt_XYZ shell variables.
bls_parse_submit_options $@
# 3. Set up temporary files. Set a variable with the name of
# an environment variable that holds the batch system job ID
# at runtime.
bls_setup_all_files
bls_job_id_for_renewal=XXX_JOBID
# 4. Start writing the submit script to $bls_tmp_file
cat > $bls_tmp_file << end_of_preamble
#!/bin/bash
# PBS job wrapper generated by `basename $0`
# on `/bin/date`
# proxy_string = $bls_opt_proxy_string
# proxy_local_file = $bls_proxy_local_file
# Etc. Etc. Etc.
end_of_preamble
# 5. Handle script local customisations according to -C option
# as appropriate for the batch system at hand.
if [ ! -z $bls_opt_req_file ] ; then
echo \#\!/bin/sh >> ${bls_opt_req_file}-temp_req_script
cat $bls_opt_req_file >> ${bls_opt_req_file}-temp_req_script
echo "source ${GLITE_LOCATION:-/opt/glite}/bin/XXX_local_submit_attributes.sh" >> ${bls_opt_req_file}-temp_req_script
chmod +x ${bls_opt_req_file}-temp_req_script
${bls_opt_req_file}-temp_req_script >> $bls_tmp_file 2> /dev/null
rm -f ${bls_opt_req_file}-temp_req_script
rm -f $bls_opt_req_file
fi
# 6. Add specific directives to select queue ($bls_opt_queue) and
# MPI node request ($bls_opt_mpinodes)
# 7. Add directives to transfer and rename input and output files.
# These are stored as
# $bls_inputsand_local_0...$bls_inputsand_local_n-1
# $bls_inputsand_remote_0...$bls_inputsand_remote_n-1
# $bls_outputsand_local_0...$bls_outputsand_local_n-1
# $bls_outputsand_remote_0...$bls_outputsand_remote_n-1
#
# Two shell functions can help here.
# a:
# bls_fl_subst_and_accumulate inputsand "@@F_REMOTE/@@F_LOCAL" "sep"
# bls_fl_subst_and_accumulate outputsand "@@F_REMOTE/@@F_LOCAL" "sep"
# fill $bls_fl_subst_and_accumulate_result with a list of "sep"
# separated strings formatted as shown in the second argument.
# The submit node full file path is substituted to @@F_LOCAL
# and the worker node path relative to the initial working dir
# is substituted to @@F_REMOTE.
# b:
# bls_fl_subst_and_dump inputsand "@@F_LOCAL>@@F_REMOTE" $bls_tmp_file
# bls_fl_subst_and_dump outputsand "@@F_LOCAL<@@F_REMOTE" $bls_tmp_file
# append to $bls_tmp_file a line for each input and output file,
# where @@F_REMOTE and @@F_LOCAL are substituted as above.
# 8. Append job wrapper as a shell script to $bls_tmp_file
bls_add_job_wrapper
# 9. Send the submit file $bls_tmp_file to the batch system and
# try making sure it doesn't get lost.
# 10. Echo to STDOUT the unique job ID to be used by subsequent scripts
# (with BLAHP_JOBID_PREFIX) and wrap up. The job ID must be
# properly understood by subsequent commands.
echo "BLAHP_JOBID_PREFIXXXX?$jobID"
bls_wrap_up_submit
exit $retcode
XXX_hold.sh script¶
Suspend the execution of a job.
SYNOPSIS:
XXX_hold.sh <job identifier>
The job identifier must be the same string returned by the submit script (without the leading BLAHP_JOBID_PREFIX). Any leading part up to the first slash ‘/’ will be ignored by the script.
RETURN VALUE: Termination code is zero on success, nonzero on error.
XXX_resume.sh script¶
Resume the execution of a (previously suspended) job.
SYNOPSIS:
XXX_resume.sh <job identifier>
The job identifier must be the same string returned by the submit script (without the leading BLAHP_JOBID_PREFIX). Any leading part up to the first slash ‘/’ will be ignored by the script.
RETURN VALUE: Termination code is zero on success, nonzero on error.
XXX_status.sh script¶
Get current status of a job.
SYNOPSIS:
XXX_status.sh [-w] [-n] <job identifier>
The job identifier must be the same string returned by the submit script (without the leading BLAHP_JOBID_PREFIX). Any leading part up to the first slash ‘/’ will be ignored by the script.
[-n]
: Option specific to the CREAM caller. Return the port used by the BLParser (see description below) to communicate with the CREAM service[-w]
: Return the worker node where a job is running as an attribute of the output classad (WorkerNode =”host.domain”)
RETURN VALUE: The script must return a string-formatted classad (see http://www.cs.wisc.edu/condor/classad/refman/ for the complete reference on classad syntax) containing at least the following attributes:
BatchjobId = <jobId without the leading batch system name>
JobStatus = <status_code>
possible value for the status code are 1 = IDLE, 2 = RUNNING, 3 = REMOVED, 4 = COMPLETED, 5 = HELD
ExitCode = <code>
only for COMPLETED jobs
Example of status script results:
job queued
[ BatchjobId = "26526.atlfarm006.mi.infn.it"; JobStatus = 1 ]
job completed
[ ExitCode = 0; BatchjobId = "26526.atlfarm006.mi.infn.it"; JobStatus = 4 ]
XXX_cancel.sh script¶
Remove a job from the batch system queue.
SYNOPSIS:
XXX_cancel.sh <job identifier>
The job identifier must be the same string returned by the submit script (without the leading BLAHP_JOBID_PREFIX). Any leading part up to the first slash ‘/’ will be ignored by the script.
RETURN VALUE: Termination code is zero on success, nonzero on error.
BLAH forward requirements to the local batch system¶
The user can set some requirements to be forwarded to the local batch
system, by using the attribute CERequirements
, as defined above, in
the blah_job_submit
command. This can be achieved both with direct
submission to the CREAM CE and with submission to the CE via the WMS, as
explained in the following:
- direct submission to CREAM -> the attributes to be forwarded are
specified in the .jdl
CERequirements
attribute and are the ones of the GlueSchema in use - submission to a CE via WMS -> the CERequirements attribute for
blah_job_submit
is filled taking into account the value of the job JDL Requirements expression and what is specified as CeForwardParameters in the WMS conf file (workloadmanager section). Also in this case the parameters to be forwarded are chosen from the GlueSchema in use
BLAH Commands syntax and semantics¶
BLAH Commands¶
The following list of commands represents the set of commands required
for interaction with the BLAHP server, interfacing to a given Local
Resource Management system. This is based on the minimum set of commands
used in the original GAHP (v1.0.0) specification removing commands that
are specific to the operation of the GRAM protocol
(INITIALIZE_FROM_FILE, GASS_SERVER_INIT, GRAM_CALLBACK_ALLOW,
GRAM_JOB_CALLBACK_REGISTER, GRAM_PING). The JOB_SIGNAL command may
be initially left unimplemented for some of the batch systems (and in
that case will return an error E
state and will not be returned by
COMMANDS).
- BLAH_JOB_CANCEL
- BLAH_JOB_SIGNAL
- BLAH_JOB_HOLD
- BLAH_JOB_REFRESH_PROXY
- BLAH_JOB_RESUME
- BLAH_JOB_STATUS
- BLAH_JOB_STATUS_ALL
- BLAH_JOB_STATUS_SELECT
- BLAH_JOB_SUBMIT
- BLAH_SET_GLEXEC_DN
- BLAH_SET_GLEXEC_OFF
- COMMANDS
- CACHE_PROXY_FROM_FILE
- QUIT
- RESULTS
- USE_CACHED_PROXY
- UNCACHE_PROXY
- VERSION
Optionally, the following two commands may also be implemented:
- ASYNC_MODE_ON
- ASYNC_MODE_OFF
BLAHP Commands structure¶
Conventions and Terms used¶
Below are definitions for the terms used in the sections to follow:
<CRLF>
The characters carriage return and line feed (in that order), or solely the line feed character.<SP>
The space character.line
A sequence of ASCII characters ending with a<SP>
Request Line
A request for action on the part of the BLAHP server.Return Line
A line immediately returned by the BLAHP server upon receiving a Request Line.Result Line
A line sent by the BLAHP server in response to a RESULTS request, which communicates the results of a previous asynchronous command Request.S:
andR:
In the Example sections for the commands below, the prefix “S: ” is used to signify what the client sends to the BLAHP server. The prefix “R: ” is used to signify what the client receives from the BLAHP server. Note that the “S: ” or “R: ” should not actually be sent or received.
Commands structure¶
BLAHP commands consist of three parts:
- Request Line
- Return Line
- Result Line
Each of these “Lines” consists of a variable length character string ending with the character sequence <CRLF>.
A Request Line is a request from the client for action on the part of the BLAHP server. Each Request Line consists of a command code followed by argument field(s). Command codes are a string of alphabetic characters. Upper and lower case alphabetic characters are to be treated identically with respect to command codes. Thus, any of the following may represent the blah_job_submit command:
- blah_job_submit
- Blah_Job_Submit
- blAh_joB_suBMit
- BLAH_JOB_SUBMIT
In contrast, the argument fields of a Request Line are case sensitive.
The Return Line is always generated by the server as an immediate response to a Request Line. The first character of a Return Line will contain one the following characters:
- S - for Success
- F - for Failure
- E - for a syntax or parse Error
Any Request Line which contains an unrecognized or unsupported command, or a command with an insufficient number of arguments, will generate an “E” response.
The Result Line is used to support commands that would otherwise block. Any BLAHP command which may require the implementation to block on network communication require a “request id” as part of the Request Line. For such commands, the Result Line just communicates if the request has been successfully parsed and queued for service by the BLAHP server. At this point, the BLAHP server would typically dispatch a new thread to actually service the request. Once the request has completed, the dispatched thread should create a Result Line and enqueue it until the client issues a RESULT command.
Transparency¶
Arguments on a particular Line (be it Request, Return, or Result) are typically separated by a <SP>. In the event that a string argument needs to contain a <SP> within the string itself, it may be escaped by placing a backslash (“") in front of the <SP> character. Thus, the character sequence “\ ” (no quotes) must not be treated as a separator between arguments, but instead as a space character within a string argument.
Sequence of Events¶
Upon startup, the BLAHP server should output to stdout a banner string which is identical to the output from the VERSION command without the beginning “S ” sequence (see example below). Next, the BLAHP server should wait for a complete Request Line from the client (e.g. stdin). The server is to take no action until a Request Line sequence is received.
Example:
R: $GahpVersion: x.y.z Feb 31 2004 INFN\ Blahpd $
S: COMMANDS
R: S COMMANDS BLAH_JOB_CANCEL BLAH_JOB_SIGNAL BLAH_JOB_STATUS BLAH_JOB_SUBMIT COMMANDS QUIT RESULTS VERSION
S: VERSION
R: S $GahpVersion: x.y.z Feb 31 2004 INFN\ Blahpd $
(other commands)
S: QUIT
R: S
BLAH Commands syntax¶
This section contains the syntax for the Request, Return, and Result line for each of the following commands:
COMMANDS
: List all the commands from this protocol specification which are implemented by this BLAHP server.Request Line: COMMANDS <CRLF>
Return Line: S <SP> <COMMAND 1> <SP> <COMMAND 2> <SP> … <CRLF>
Result Line: None.
VERSION
: Return the version string for this BLAHP. The version string follows a specified format (see below). Ideally, the version entire version string, including the starting and ending dollar sign ($) delimiters, should be a literal string in the text of the BLAHP server executable. This way, the Unix/RCS “ident” command can produce the version string. The version returned should correspond to the version of the protocol supported.Request Line: VERSION <CRLF>
Return Line: S <SP> $GahpVesion: <SP> <major>.<minor>.<subminor> <SP> <build-month> <SP> <build-day-of-month> <SP> <build-year> <SP> <general-descrip> <SP>$ <CRLF>
- major.minor.subminor = for this version of the protocol, use version 1.0.0.
- build-month = string with the month abbreviation when this BLAHP server was built or released. Permitted values are: “Jan”, “Feb”, “Mar”, “Apr”, “May”, “Jun”, “Jul”, “Aug”, “Sep”, “Oct”, “Nov”, and “Dec”.
- build-day-of-month = day of the month when BLAHP server was built or released; an integer between 1 and 31 inclusive.
- build-year = four digit integer specifying the year in which the BLAHP server was built or released.
- general-descrip = a string identifying a particular BLAHP server implementation.
Result Line: None.
Example:
S: VERSION R: S $GahpVersion: x.y.z Feb 31 2004 INFN\ Blahpd $
QUIT
: Free any/all system resources (close all sockets, etc) and terminate as quickly as possible.Request Line: QUIT <CRLF>
Return Line: S <CRLF>
Immediately afterwards, the command pipe should be closed and the BLAHP server should terminate.
Result Line: None.
RESULTS
: Display all of the Result Lines which have been queued since the last RESULTS command was issued. Upon success, the first return line specifies the number of subsequent Result Lines which will be displayed. Then each result line appears (one per line) – each starts with the request ID which corresponds to the request ID supplied when the corresponding command was submitted. The exact format of the Result Line varies based upon which corresponding Request command was issued.IMPORTANT: Result Lines must be displayed in the exact order in which they were queued!!! In other words, the Result Lines displayed must be sorted in the order by which they were placed into the BLAHP’s result line queue, from earliest to most recent.
Request Line: RESULTS <crlf>
Return Line(s): S <SP><num-of-subsequent-result-lines> <CRLF><reqid> <SP> … <CRLF> <reqid> <SP> … <CRLF>…
reqid = integer Request ID, set to the value specified in the corresponding Request Line.
Result Line: None.
Example:
S: RESULTS R: S 1 R: 100 0
ASYNC_MODE_ON
: Enable Asynchronous notification when the BLAHP server has results pending for a client. This is most useful for clients that do not want to periodically poll the BLAHP server with a RESULTS command. When asynchronous notification mode is active, the GAHP server will print out anR
(without the quotes) on column one when the ‘RESULTS’ command would return one or more lines. TheR
is printed only once between successiveRESULTS
commands. TheR
is also guaranteed to only appear in between atomic return lines; theR
will not interrupt another command’s output.If there are already pending results when the asynchronous results available mode is activated, no indication of the presence of those results will be given. A GAHP server is permitted to only consider changes to it’s result queue for additions after the ASYNC_MODE_ON command has successfully completed. GAHP clients should issue a
RESULTS
command immediately after enabling asynchronous notification, to ensure that any results that may have been added to the queue during the processing of the ASYNC_MODE_ON command are accounted for.Request Line: ASYNC_MODE_ON <CRLF>
Return Line: S <CRLF> Immediately afterwards, the client should be prepared to handle an R <CRLF> appearing in the output of the GAHP server.
Result Line: None.
Example:
S: ASYNC_MODE_ON R: S S: BLAH_JOB_CANCEL 00001 123.bbq.mi.infn.it R: S S: BLAH_JOB_CANCEL 00002 124.bbq.mi.infn.it R: S R: R S: RESULTS R: S 2 R: 00001 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0 R: 00002 0
Note that you are NOT guaranteed that the
R
will not appear between the dispatching of a command and the return line(s) of that command; the GAHP server only guarantees that theR
will not interrupt an in-progress return. The following is also a legal example:S: ASYNC_MODE_ON R: S S: BLAH_JOB_CANCEL 00001 123.bbq.mi.infn.it R: S S: BLAH_JOB_CANCEL 00002 124.bbq.mi.infn.it R: R R: S S: RESULTS R: S 2 R: 00001 0 R: 00002 0
ASYNC_MODE_OFF
: Disable asynchronous results-available notification. In this mode, the only way to discover available results is to poll with the RESULTS command. This mode is the default. Asynchronous mode can be enabled with the ASYNC_MODE_ON command.Request Line: ASYNC_MODE_OFF <CRLF>
Return Line: S <CRLF>
Results Line: None
Example:
S: ASYNC_MODE_OFF R: S
BLAH_JOB_SUBMIT¶
Submit a job request to a specified queue (specified in the submit classad). This will cause the job to be submitted to the batch system.
Request Line: BLAH_JOB_SUBMIT <SP> <reqid> <SP> <submit classad> <CRLF>
reqid = non-zero integer Request ID
submit classad = valid submit description for the job, in string representation. See paragraph 3.0 for a description of the format. Here’s a list of supported attributes with a brief description.
“Cmd”: Full path of the executable in the local filesystem
“Args”: List of individual arguments (no ‘/bin/sh’ convention on argument separation, but separate arguments) for the executable
“In”: Full path in the local filesystem where the standard input for the executable is found
“Out”: Full path in the local filesystem where the standard output of the executable will be stored (at job completion).
“Err”: Full path in the local filesystem where the standard error of the executable will be stored (at job completion).
“X509UserProxy”: Full path wherethe proxy certificate is stored.
“Env”: Semicolon-separated list of environment variables of the form:
<parameter> = <value>
“Stagecmd”: Sets if the executable of the job must be copied on the WorkerNode: can be “TRUE” or “FALSE”.
“Queue”: Queue in the local batch system where the job must be enqueued.
“Gridtype”: String indicating the underlying local batch system (currently “pbs” and “lsf” supported).
“uniquejobid”: unique name identifier for the final job to be submitted to the local batch system.
“NodeNumber”: number of nodes to be reserved for an MPI job. It gets translated into the LRMS command qsub “-l nodes=<n>” for pbs or into “-n <n>” for lsf. For Condor this attribute is not supported, therefore it gets ignored.
“CERequirements”: string containing the requirements to be forwarded to the local batch system
“TransferInput = file1,file2,file…”: comma-delimited list of all the files to be transferred into the working directory for the job before the job is started. Only the transfer of files is available. The transfer of subdirectories is not supported.
“TransferOutput = file1,file2,file…”: an explicit list of output files to be transferred back from the temporary working directory on the execute machine to the submit machine (where BLAHPD is running). Only the standard output and standard error files, are transferred back by default if requested.
“TransferOutputRemaps = name1=newname1;name2=newname2;name=newname…”: This attribute specifies the name (and optionally the complete path) to use when downloading output files from the completed job. Note that the mappings are separated by semicolons. Normally, output files are transferred back to the initial working directory with the same name they had in the execution directory. This gives you the option to save them with a different path or name. If you specify a relative path, the final path will be relative to the job’s initial working directory (the directory specified in the ‘Iwd’ attribute, or, if this is missing, the current working directory of BLAHPD)
Return Line:
<result> <CRLF>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Lines: <reqid> <sp> <result-code> <sp> <error-string> <sp> <job_local_id> <crlf>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
- job_local_id = on success, a string representing a unique
identifier for the job. This identifier must not be bound to this
BLAHP server, but instead must be allowed to be used in subsequent
BLAHP server instantiations. For instance, the job_local_id must be
implemented in such a fashion that the following sequence of events
by the caller must be permissible:
- step a: issue a BLAH_JOB_SUBMIT command
- step b: read the job_local_id in the result line
- step c: store the job_local_id persistently
- step d: subsequently kill and restart the BLAHP server process
- step e: issue a BLAH_JOB_CANCEL command, passing it the stored job_local_id value obtained in step (b).
Example:
S: BLAH_JOB_SUBMIT 2 [\ Cmd\ =\ "/usr/bin/test.sh";\ Args\ =\ "'X=3:Y=2'";
\ Env\ =\ "VAR1=56568";\ In\ =\ "/dev/null";\ Out\ =\ "/home/StdOutput";
\ Err\ =\ "/home/error";\ x509userproxy\ =\ "/home/123.proxy";\ Stagecmd
\ =\ TRUE;\ Queue\ =\ "short";\ GridType\ =\ "pbs";\ ]'
R: S
S: RESULTS
R: 2 0 No\ error pbs/20051012/2957
BLAH_JOB_CANCEL¶
This function removes an IDLE job request, or kill all processes associated with a RUNNING job, releasing any associated resources.
Request Line: BLAH_JOB_CANCEL <sp> <reqid> <sp> <job_local_id> <CRLF>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job to be canceled.
Return Line: <result> <crlf>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <CRLF>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
Example:
S: BLAH_JOB_CANCEL 1 pbs/20051012/2957.grid001.mi.infn.it
R: S
R: R
S: RESULTS
R: S 1
R: 1 0 No\ error
BLAH_JOB_STATUS¶
Query and report the current status of a submitted job.
Request Line: BLAH_JOB_STATUS <sp> <reqid> <sp> <job_local_id> <CRLF>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job whose status is desired.
Return Line: <result> <CRLF>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <sp> <job_status> <sp> <result-classad> <CRLF>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
- job_status = if the result_code is 0 (success), then job_status is
set to an integer based upon the status of the job as follows:
- 1 IDLE (job is waiting on the batch system queue)
- 2 RUNNING (job is executing on a worker node)
- 3 REMOVED (job was successfully cancelled)
- 4 COMPLETED (job completed its execution on the batch system)
- 5 HELD (job execution is suspended; job is still in the batch system queue)
- result-classad = Aggregate information about the job status. The
classad format can vary with the local batch system. Typically, the
following attributes are defined:
- JobStatus : job status - same codes as described above
- BatchjobId : Job ID as known to the local batch system
- ExitCode : Termination code - only for finished jobs
- ExitReason : Exit reason, if available - only for finished jobs
- WorkerNode : When available, FQDN of the worker node - only for running jobs
Example:
S: BLAH_JOB_STATUS 1 pbs/20051012/2958.grid001.mi.infn.it
R: S
R: R
S: RESULTS
R: S 1
R: 1 0 No\ Error 2 [\ BatchjobId\ =\ "2958.grid001.mi.infn.it";
\ JobStatus\ =\ 2;\ WorkerNode\ =\ "\ grid001.mi.infn.it"\ ]
BLAH_JOB_STATUS_ALL¶
This command is only available if the BLAH local job registry file is configured in the BLAH config file (job_registry attribute) and supported by the active batch system. Query and report the current status of all jobs managed by the BLAH server.
Request Line: BLAH_JOB_STATUS_ALL <sp> <reqid> <CRLF>
- reqid = non-zero integer Request ID
Return Line: <result> <CRLF>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <sp> <result-classad> <CRLF>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
- result-classad = List of classads containing aggregate information
about the job status. In addition to the attributes defined for the
output of the BLAH_JOB_STATUS command, as the status info comes
from a local cache, the following attributes may also be present:
- BlahJobId: Job ID as known to the BLAH layer. Typically a decorated form of BatchJobId
- CreateTime: Seconds since the Unix epoch at the time the first job info was inserted into the cache.
- ModifiedTime: Seconds since the Unix epoch when the most recent modification of the job info occurred.
BLAH_JOB_STATUS_SELECT¶
This command is only available if the BLAH local job registry file is configured in the BLAH config file (job_registry attribute) and supported by the active batch system. Query and report the current status of all jobs managed by the BLAH server.
Request Line: BLAH_JOB_STATUS_SELECT <sp> <reqid> <sp> <selection expression> <CRLF>
- reqid = non-zero integer Request ID
- selection expression = Classad expression defining select requirements on the returned job ads (e.g.: JobStatus ==2).
Return Line: <result> <crlf>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <sp> <result-classad> <CRLF>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
- result-classad = List of classads containing aggregate information about the status of the jobs that satisfy the selection requirement. The format is the same as for the BLAH_JOB_STATUS_ALL command.
BLAH_JOB_SIGNAL¶
Send a signal (if possible) to a specified job. This has to be in the RUNNING status.
Request Line: BLAH_JOB_SIGNAL <sp> <reqid> <sp> <job_local_id> <sp> <signal> <CRLF>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job whose status is desired.
- signal = an integer with the signal to send
Return Line:<result> <CRLF>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <sp> <job_status> <CRLF>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
- job_status = if the result_code is 0 (success), then job_status is
set to an integer based upon the status of the job as follows
(compare above):
- 1.IDLE
- RUNNING
- REMOVED
- COMPLETED
- HELD
BLAH_JOB_REFRESH_PROXY¶
Renew the proxy of an already submitted job. The job has to be in IDLE, RUNNING or HELD status.
Request Line: BLAH_JOB_REFRESH_PROXY <sp> <reqid> <sp> <job_local_id> <sp> <proxy_file> <CRLF>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job whose proxy has to be renewed.
- proxy_file = path to the fresh proxy file.
Return Line: <result> <CRLF>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <crlf>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
Example:
S: BLAH_JOB_REFRESH 1 123.proxy
R: S
R: R
S: RESULTS
R: S 1
R: 1 0 No\ Error
BLAH_JOB_HOLD¶
This function always puts an IDLE job request in a HELD status. If the job is already running RUNNING it can be HELD too, depending whether the underlying batch system supports this feature.
Request Line: BLAH_JOB_HOLD <sp> <reqid> <sp> <job_local_id> <crlf>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job to be canceled.
Return Line: <result> <crlf>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <crlf>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
Example:
S: BLAH_JOB_HOLD 1 pbs/20051012/2957.grid001.mi.infn.it
R: S
R: R
S: RESULTS
R: S 1
R: 1 0 No\ error
BLAH_JOB_RESUME¶
This function puts an HELD job request in the status it was before the holding action.
Request Line: BLAH_JOB_RESUME <sp> <reqid> <sp> <job_local_id> <crlf>
- reqid = non-zero integer Request ID
- job_local_id = job_local_id (as returned from BLAH_JOB_SUBMIT) of the job to be canceled.
Return Line: <result> <crlf>
- result = the character “S” (no quotes) for successful submission of the request (meaning that the request is now pending), or an “E” for error on the parse of the request or its arguments (e.g. an unrecognized or unsupported command, or for missing or malformed arguments).
Result Line: <reqid> <sp> <result-code> <sp> <error-string> <crlf>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
- result-code = integer equal to 0 on success, or an error code
- error-string = description of error
Example:
S: BLAH_JOB_RESUME 1 pbs/20051012/2957.grid001.mi.infn.it
R: S
R: R
S: RESULTS
R: S 1
R: 1 0 No\ error
BLAH_SET_GLEXEC_DN¶
This function enables local user mapping via GLEXEC for all subsequent actions. The call is synchronous and the effect is immediate.
Request Line: BLAH_SET_GLEXEC_DN <sp> <user_proxy> <sp> <ssl_client_certificate> <sp> <not_limited> <crlf>
- user_proxy Path to a valid GSI proxy file. This will be optionally (when not_limited == 0) turned into a limited (delegated) proxy and be made available via GLEXEC to the user job. (The delegated proxy path is set in the GLEXEC_SOURCE_PROXY enviroment variable prior to the GLEXEC call).
- ssl_client_certificate Path to a valid GSI proxy file. Can be the same as user_proxy. This is the client certificate that is used for access to GLEXEC and local user mapping (it is set in the GLEXEC_CLIENT_CERT environment variable prior to the GLEXEC call).
- not_limited If this argument is set to a numeric value not equal to zero, the supplied proxy will not be limited (delegated).
Return Line: one of the following
- S <sp> <action description> <crlf> - in case of success
- F <crlf> - in case of failure
Result Line: None.
Example:
S: BLAH_SET_GLEXEC_DN /path/test.proxy /path/test.proxy 0
R: S Glexec\ mode\ on
BLAH_SET_GLEXEC_OFF¶
This command unconditionally disables user mapping via GLEXEC for all subsequent actions. The call is synchronous and the effect is immediate.
Request Line: BLAH_SET_GLEXEC_OFF <crlf>
- reqid = integer Request ID, set to the value specified in the corresponding Request Line.
Return Line: one of the following
- S <sp> <action-description> <crlf> - in case of success
- F <crlf> - in case of failure
Result Line: None.
Example:
S: BLAH_SET_GLEXEC_OFF
R: S Glexec\ mode\ off
CACHE_PROXY_FROM_FILE¶
This command is only available if enable_glexec_from_condor is set to ‘yes’ in the BLAHPD config file. Read a GSI proxy from a specified file, and store it under the specified symbolic name. Unlike other GAHPs, and because this function is only used to trigger GLEXEC access which requires access to a proxy file, BLAH does not read and cache the proxy contents but only the file name. Changes to the file will affect the use of the stored proxy. The proxy name caching is lost when the BLAHP server terminates. The intent of this command is to allow the BLAHP server to capture and use different proxy files for different jobs. Other commands (e.g. by USE_CACHED_PROXY) may later be reference the given cached proxy file by its symbolic name. USE_CACHED_PROXY has the side effect of enabling GLEXEC.
Request Line:CACHE_PROXY_FROM_FILE <sp> <id> <sp> <path-to-local-proxy-file> <crlf>
- <id> = a symbolic name which will be used to reference the given proxy.
- <path-to-local-proxy-file> = a fully-qualified pathname to a file local to the BLAHP server which contains a valid GSI proxied certificate.
Return Line: one of the following:
- S <sp> <action-description> <crlf>
- F <sp> <error-string> <crlf>Upon success, use the “S” version; if not
recognized, use the “F” version.
- error-string = brief string description of the error, appropriate for reporting to a human end-user.
Result Line:None.
Example:
S: CACHE_PROXY_FROM_FILE 1 /path/test.proxy
R: S Proxy\ cached
USE_CACHED_PROXY¶
This command is only available if enable_glexec_from_condor is set to ‘yes’ in the BLAHPD config file. Sets the proxy previously cached under the specified name, as the “active” proxy. Enable user mapping via GLEXEC for all subsequent actions. The active proxy will be used for GLEXEC user mapping, it will remain active until it is changed via the next invocation of USE_CACHED_PROXY. The proxy must have been previously cached (under the specified name) using CACHE_PROXY_FROM_FILE command. This command allows the BLAHP server to simultaneously act on behalf of two or more jobs that require
Request Line: USE_CACHED_PROXY <sp> <id> <crlf> <id> = the symbolic name of the proxy to use
Return Line: One of the following:
- S <sp> <action-description> <crlf>
- F <sp> <error-string> <crlf> Upon success, use the “S” version; if
not recognized, use the “F” version.
- error-string = brief string description of the error, appropriate for reporting to a human end-user.
Result Line: None.
Example:
S: USE_CACHED_PROXY 1
R: S Glexec\ proxy\ set\ to\ /path/test.proxy\ (dir\ IWGRP\ bit\ already\ on)
BLParser¶
To obtain efficient access and prompt update to the current status of BLAH active jobs, a ‘log parser’ daemon was developed for LSF and PBS. The submit and status scripts can make optional use of the batch log parser to increase their efficiency.
Here’s a description of the semantics of the batch log parser (BLParser).
The BLParser listens on a network socket and replies to a set of mandatory and a set of optional queries (optional queries can be used for info and debug).
Here is the list of mandatory queries and their replies:
BLAHJOB/<blahjob-id> <lrms-jobid>
<date-yyyymmdd>/<lrms-jobid> a classad like this:
[ BatchJobId=<lrms-jobid> Workernode=<workernode> JobStatus=<jobstatus> LRMSSubmissionTime=<submission time> LRMSStartRunningTime=<start time> LRMSCompletedTime=<finish time> ExitReason=<exitreason> ExitCode=<exit code> ]/pr_removal
where pr_removal is a flag that told the status script if the proxy file has to be removed (when job is killed or finished) or not.
CREAMPORT => port where cream can connect to the parser.
TEST => print Y<lrmsname> (e.g. YLSF YPBS). This query is used to know if a parser is active. This is useful when in config file is specified more than one parser to try: if the first does not reply to the TEST query the next one is tried.
These are optional queries and their replies:
- HELP => print command list
- VERSION => print version
- TOTAL => print total number of jobs cached in the parser
- LISTALL => print jobid of all jobs cached in the parser
- LISTF[/<first-n-jobid>] => print first n jobid
- LISTL[/<last-n-jobid>] => print last n jobid
The BLParser includes an option to notify to a client (typically the CREAM service) every state change in CREAM-managed jobs. CREAM sends to the parser a string like this:
STARTNOTIFY/<date-yyyymmdd>
atfer that string the parser sends back to cream every state change for cream job since the date contained in the string with lines like this:
NTFDATE/<classad>
where <classad> is a classad like the one sent in reply to <date-yyyymmdd>/<lrms-jobid> query.
After that parser will sent to cream every new state change for cream jobs. Cream can change the default prefix (cream_) used to identify tracked jobs among all other jobs with this command:
CREAMFILTER/<creamfilter>
where <creamfilter> should be a string like crXXX_ (XXX can be any character [a-z],[A-Z],[0-9]). This command has to sent before STARTNOTIFY.
There is a startup script for the BLParser that reads all the parameters from a config file. It is possible to set the debug level, the debug log file, the log file location; it is also possible to start different parsers listening on different ports with the same startup script.
New Parser Scenario¶
The old BLParser is now splitted in two different daemons:
- BUpdater (that is specific for each batch system:there are BUpdaterLSF, BUpdaterPBS and BUpdaterCondor)
- BNotifier that is the same for all batch systems.
Both daemons are now automatically launched and controlled by the blahpd daemon so there is no need for a startup script. The main differences between this approach and the old one are that the daemons run as non privileged user and that the informations about jobs are collected using the batch system native command (as bhist, bjobs, qstat…) and are saved in a db file on disk called job registry. These informations are asynchronously notified to CREAM by BNotifier daemon.
It is possible to configure some parameter in the blah.config file:
- _bupdater_path_ and _bnotifier_path_: if this is set to the bupdater (or bnotifier) complete path (e.g. /opt/glite/bin/BUpdaterLSF) blahpd takes care of starting and periodically check the status of the daemon. If it is not set the daemons should be started by hand.
- _bupdater_pidfile_ and _bnotifier_pidfile_: this are used to check the daemons status by blahpd (e.g. /var/tmp/bupdater.pid).
- _job_registry_: is the complete path to the job registry location (/var/tmp/job_registry.db).
- _bupdater_debug_level_: set the debug level (1,2,3)
- _bupdater_debug_logfile_: set the location of the debug log file
- _async_notification_host_: hostname where the notification should be sent
- _async_notification_port_: port where to send notifications
- _purge_interval_: how old (in seconds) a job registry entry has to be before it is purged.
- _finalstate_query_interval_: the query that search for finished jobs is executed every finalstate_query_interval seconds (default:30)
- _alldone_interval_: after that interval an unseen job is set as done (status = 4) and exitstatus = -1 (default:600)
- _bupdater_loop_interval_: the updater main lop is executed every bupdater_loop_interval seconds (default:5)
- _blah_graceful_kill_mappable_cmd_: Command to use when terminating processes via a user mapping tool (glexec or SUDO). Default is /bin/kill
- _blah_graceful_kill_timeout_: Total number of seconds a parent process will wait for a child to terminate after sending it the first termination (SIGTERM) signal. During this time, and until the child process terminates, the parent will send SIGTERM to che child once a second before blah_graceful_kill_timeout/2 seconds have elapsed, then it will send SIGKILL once a second. Defaults to 20 seconds.
- _blah_children_restart_interval_: Mininum time in seconds that has to elapse before blahpd tries restarting one of the children processes it controls. Defaults to 150 seconds.
- _blah_require_proxy_on_submit_: Set to boolean ‘true’ to generate an error when the X509Proxy attribute is missing from the job description in the BLAH_JOB_SUBMIT command. Set to boolean ‘false’ if job submission without a valid proxy is allowed. Defaults to ‘false’.
- _blah_enable_glexec_from_condor_: Set to boolean ‘true’ to make blahpd recognize the following three commands: CACHE_PROXY_FROM_FILE, USE_CACHED_PROXY, UNCACHE_PROXY. User mapping via GLEXEC will be automatically enabled when the USE_CACHED_PROXY command is received. This is to allow user switching via glexec based on the active proxy for a given user, as cached by the Condor gridmanager. Defaults to ‘false’.
- _blah_disable_wn_proxy_renewal_: When set to boolean ‘true’, use of the BPRclient/server tool to delegate or send renewed X509 proxies to a job executing node will be entirely disabled. Defaults to ‘false’.
- _blah_delegate_renewed_proxies_: When set to boolean ‘true’, X509 proxies sent to job worker nodes via the BPRclient/server tool will be delegated instead of being copied over the wire. Defaults to ‘false’ as many Globus tools (including gridftp) will reject access when a limited -and- delegated proxy is presented.
- _blah_graceful_kill_glexecable_cmd_format_
BLAH Accounting¶
Log records are written in a daily file specified as
<prefix>-YYYMMDD
where <prefix>
is specified in blah.config:
BLAHPD_ACCOUNTING_INFO_LOG=/var/log/accounting/blahp.log
The format of the record is:
"<field>=<value>" [<SP> "<field>=<value>" ...] <LF>
thus all values are assumed to have no double quotes inside.
The fields are:
timestamp
: record creation date (~ lrms submission time)userDN
: user’s certificate distinguished nameuserFQAN
: user’s certificate Fully Qualified Attribute Names (i.e. VO and roles) repeated, one per FQANceID
: Computing Element’s ID as appearing in the Glue schemajobID
: the Grid job ID (i.e. the one returned to the user by the CE)lrmsID
: the ID of the job in the batch systemlocalUser
: numeric uid of the local account mapped to the remote userclientId
: a job identifier (arbirtary string) provided by the CE at submission time
Accounting can be disabled in BLAH by commenting out the BLAHPD_ACCOUNTING_INFO_LOG line in blah.config.
CREAM monitoring¶
The CREAM CE site is monitored with the Nagios framework and a set of specific probes. The service can be tested with job submission through WMS or with direct submittion (i.e. using CREAM CLI). The probes developed for the CREAM service must be installed on a User Interface because they use the cream-cli commands to monitor the CREAM ce.
Requirements¶
The required packages for the CREAM CE probes are:
- python (version 2.7 or greater)
- python-ldap
- python-suds (version 0.3.5 or greater)
- openssl (version 0.9.8e-12 or greater)
- nagios-submit-conf (version 0.2 or greater)
- python-GridMon (version 1.1.10)
About the last two rpms they can be install using the EGI repository
CREAM-CE direct job submission metrics and WN probes¶
The following metrics are a restructured version of the existing ones and provide a better approach for probing a CREAM CE and its WNs:
cream_serviceInfo.py
- get CREAM CE service infocream_allowedSubmission.py
- check if the submission to the selected CREAM CE is allowedcream_jobSubmit.py
- submit a job directly to the selected CREAM CEcream_jobOutput.py
- submit a job directly to the selected CREAM CE and retrieve the output-sandboxWN-softver probe
- check middleware version on WN (via cream_jobOutput.py)WN-csh probe
- check if WN has csh (via cream_jobOutput.py)
All of them have been implemented in python and are based on the cream-clicommands. They share the same logic structure and provide useful information about their version, usage (i.e. help) including the options list and their meaning, according to the guide Probes Development. For example:
$ ./cream_serviceInfo.py
Usage: cream_serviceInfo.py [options]
cream_serviceInfo.py: error: Specify either option -u URL or option -H HOSTNAME (and -p PORT) or read the help (-h)
$ ./cream_serviceInfo.py --help
Usage: cream_serviceInfo.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname of the CREAM service.
-p PORT, --port=PORT The port of the service. [default: none]
-x PROXY, --proxy=PROXY
The proxy path
-t TIMEOUT, --timeout=TIMEOUT
Probe execution time limit. [default:
120 sec]
-v, --verbose verbose mode [default: False]
-u URL, --url=URL The status endpoint URL of the service. Example:
https://<host>[:<port>]
$ ./cream_serviceInfo.py --version
cream_serviceInfo v.1.1
The interaction with the CREAM CE requires the use of a valid VOMS proxy expressed by the X509_USER_PROXY
env variable or through the --proxy
option. All metrics check the existence of the proxy file and calculate the time left. In case of error, the related error message will be thrown:
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 --proxy /tmp/x509up_u0 --verbose
Proxy file not found or not readable
The verbose mode (--verbose
) could be enabled to each metric. It provides several details about the probe execution itself by highlighting the internal commands:
$ ./cream_serviceInfo.py --hostname prod-ce-01.pd.infn.it --port 8443 -x /tmp/x509up_u733 --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
invoking service info
executing command: /usr/bin/glite-ce-service-info prod-ce-01.pd.infn.it:8443
CREAM serviceInfo OK: Service Version = [1.16.4 - EMI version: 3.15.0-1.el6]
In case of mistakes on the selected options or on their values, the probe tries to explain what is wrong. For example the cream_serviceInfo doesn’t support the --queue
option:
$ ./cream_serviceInfo.py --hostname prod-ce-01.pd.infn.it --port 8443 --queue creamtest1 -x /tmp/x509up_u733 --verbose
Usage: cream_serviceInfo.py [options]
cream_serviceInfo.py: error: no such option: --queue
In case of the errors in interacting with the CREAM CE, useful details will be provided about the failure:
$ ./cream_allowedSubmission.py --url https://prod-ce-01.pd.infn.it:8443 -x /tmp/x509up_u733
command '/usr/bin/glite-ce-allowed-submission cream-43.pd.infn.it:8443' failed: return_code=1
details: ['2019-12-13 15:59:57,085 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection refused]\n']
Sources are available in github
cream_serviceInfo.py¶
The serviceInfo.py retrieves information about the status of the CREAM CE. The help shows how the probe must be invoked:
$ ./cream_serviceInfo.py --help
Usage: cream_serviceInfo.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname of the CREAM service.
-p PORT, --port=PORT The port of the service. [default: none]
-x PROXY, --proxy=PROXY
The proxy path
-t TIMEOUT, --timeout=TIMEOUT
Probe execution time limit. [default:
120 sec]
-v, --verbose verbose mode [default: False]
-u URL, --url=URL The status endpoint URL of the service. Example:
https://<host>[:<port>]
In order to get information about the CREAM service on the host https://prod-ce-01.pd.infn.it:8443, use the following command:
$ ./cream_serviceInfo.py --url https://prod-ce-01.pd.infn.it:8443 -x /tmp/x509up_u733
CREAM serviceInfo OK: Service Version = [1.16.4 - EMI version: 3.15.0-1.el6]
or similary:
$ ./cream_serviceInfo.py --hostname prod-ce-01.pd.infn.it --port 8443 -x /tmp/x509up_u733
CREAM serviceInfo OK: Service Version = [1.16.4 - EMI version: 3.15.0-1.el6]
cream_allowedSubmission.py¶
This is a simple metric which checks if the submission to the selected CREAM CE is allowed. Its usage is analogous to the above metric:
$ ./cream_allowedSubmission.py --help
Usage: cream_allowedSubmission.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname of the CREAM service.
-p PORT, --port=PORT The port of the service. [default: none]
-x PROXY, --proxy=PROXY
The proxy path
-t TIMEOUT, --timeout=TIMEOUT
Probe execution time limit. [default:
120 sec]
-v, --verbose verbose mode [default: False]
-u URL, --url=URL The status endpoint URL of the service. Example:
https://<host>[:<port>]
Notice: the use of the --url
option is equivalent to specify both
the options: --hostname
and --port
:
$ ./cream_allowedSubmission.py --hostname prod-ce-01.pd.infn.it --port 8443 -x /tmp/x509up_u733
CREAM allowedSubmission OK: the job submission is ENABLED
$ ./cream_allowedSubmission.py --url https://prod-ce-01.pd.infn.it:8443 -x /tmp/x509up_u733
CREAM allowedSubmission OK: the job submission is ENABLED
The verbose mode highlights the internal commands:
$ ./cream_allowedSubmission.py --url https://prod-ce-01.pd.infn.it:8443 -x /tmp/x509up_u733 --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
invoking allowedSubmission
executing command: /usr/bin/glite-ce-allowed-submission prod-ce-01.pd.infn.it:8443
CREAM allowedSubmission OK: the job submission is ENABLED
cream_jobSubmit.py¶
This metric submits a job directly to the selected CREAM CE and waits until the job termination by providing the final status. Finally the job is purged. This probe does not test the output-sandbox retrieval.
$ ./cream_jobSubmit.py --help
Usage: cream_jobSubmit.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname of the CREAM service.
-p PORT, --port=PORT The port of the service. [default: none]
-x PROXY, --proxy=PROXY
The proxy path
-t TIMEOUT, --timeout=TIMEOUT
Probe execution time limit. [default:
120 sec]
-v, --verbose verbose mode [default: False]
-u URL, --url=URL The status endpoint URL of the service. Example:
https://<host>[:<port>]/cream-<lrms>-<queue>
-l LRMS, --lrms=LRMS The LRMS name (e.g.: 'lsf', 'pbs' etc)
-q QUEUE, --queue=QUEUE
The queue name (e.g.: 'creamtest')
-j JDL, --jdl=JDL The jdl path
The --url
(-u
) directive must be used to target the probe to a specific CREAM CE identified by its identifier (i.e. CREAM CE ID). Alternatively is it possible to specify the CREAM CE identifier by using the --hostname
, --port
, --lrms
and --queue
options which are mutually exclusive with respect to the --url
option.
Consider the JDL file hostname.jdl with the following content:
$ cat ./hostname.jdl
[
Type="Job";
JobType="Normal";
Executable = "/bin/hostname";
Arguments = "-s";
StdOutput = "std.out";
StdError = "std.err";
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
If verbose mode is disabled, the output should look like this:
$ ./cream_jobSubmit.py --url https://prod-ce-01.pd.infn.it:8443/cream-lsf-grid -x /tmp/x509up_u733 --jdl ./hostname.jdl
CREAM JobSubmit OK [DONE-OK]
Notice: the use of the --url
option is equivalent to specify both the options: --hostname
, --port
--lrms
and –queue:
$ ./cream_jobSubmit.py --hostname prod-ce-01.pd.infn.it --port 8443 --lrms lsf --queue grid -x /tmp/x509up_u733 --jdl ./hostname.jdl
CREAM JobSubmit OK [DONE-OK]
If the verbose mode is enabled, the output of the above command should be like this:
$ ./cream_jobSubmit.py --hostname prod-ce-01.pd.infn.it --port 8443 --lrms lsf --queue grid -x /tmp/x509up_u733 --jdl ./hostname.jdl --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
executing command: /usr/bin/glite-ce-job-submit -d -a -r prod-ce-01.pd.infn.it:8443/cream-lsf-grid ./hostname.jdl
['2019-12-13 13:54:33,247 DEBUG - Using certificate proxy file [/tmp/x509up_u733]\n', '2019-12-13 13:54:33,279 DEBUG - VO from certificate=[enmr.eu]\n', '2019-12-13 13:54:33,279 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2019-12-13 13:54:33,279 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_zangrand_20191213-135433.log]\n', '2019-12-13 13:54:33,282 INFO - certUtil::generateUniqueID() - Generated DelegationID: [12815a52a76431b1712199d87ae5896fd6718b3a]\n', '2019-12-13 13:54:36,175 DEBUG - Registering to [https://prod-ce-01.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "grid"; Executable = "/bin/hostname"; Type = "Job"; Arguments = "-s"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[./hostname.jdl]\n', '2019-12-13 13:54:36,634 DEBUG - Will invoke JobStart for JobID [CREAM067861520]\n', 'https://prod-ce-01.pd.infn.it:8443/CREAM067861520\n']
job id: https://prod-ce-01.pd.infn.it:8443/CREAM067861520
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://prod-ce-01.pd.infn.it:8443/CREAM067861520
['\n', '****** JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM067861520]\n', '\tStatus = [DONE-OK]\n', '\tExitCode = [0]\n', '\n', '\n']
exitCode= ExitCode = [0]
job status: DONE-OK
invoking jobPurge
executing command: /usr/bin/glite-ce-job-purge --noint https://prod-ce-01.pd.infn.it:8443/CREAM067861520
CREAM JobSubmit OK [DONE-OK]
cream_jobOutput.py¶
This metric extends the cream_jobSubmit.py functionality by retrieving the job’s output-sandbox. Both the stage-in and stage-out phases are both performed automatically by the CE. In particular the stage-out needs the OutputSandboxBaseDestUri="gsiftp://localhost"
set in the JDL. Finally the job is purged.
$ ./cream_jobOutput.py --help
Usage: cream_jobOutput.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-H HOSTNAME, --hostname=HOSTNAME
The hostname of the CREAM service.
-p PORT, --port=PORT The port of the service. [default: none]
-x PROXY, --proxy=PROXY
The proxy path
-t TIMEOUT, --timeout=TIMEOUT
Probe execution time limit. [default:
120 sec]
-v, --verbose verbose mode [default: False]
-u URL, --url=URL The status endpoint URL of the service. Example:
https://<host>[:<port>]/cream-<lrms>-<queue>
-l LRMS, --lrms=LRMS The LRMS name (e.g.: 'lsf', 'pbs' etc)
-q QUEUE, --queue=QUEUE
The queue name (e.g.: 'creamtest')
-j JDL, --jdl=JDL The jdl path
-d DIR, --dir=DIR The output sandbox path
The options are the same as cream_jobSubmit.py except for --dir
. Such option allows the user to specify the path where the output-sandbox has to be stored temporarily. The default value is /var/lib/argo-monitoring/eu.egi.CREAMCE
.
Consider the JDL file hostname.jdl with the following content:
$ cat ./hostname.jdl
[
Type="Job";
JobType="Normal";
Executable = "/bin/hostname";
Arguments = "-s";
StdOutput = "std.out";
StdError = "std.err";
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
If verbose mode is disabled, the output should look like this:
$ ./cream_jobOutput.py --hostname prod-ce-01.pd.infn.it --port 8443 --lrms lsf --queue grid -x /tmp/x509up_u733 --dir /tmp --jdl ./hostname.jdl
CREAM JobOutput OK | retrieved outputSandbox: ['std.err', 'std.out']
**** std.err ****
**** std.out ****
prod-wn-038
Notice: the use of the --dir
and the output-sandbox content returned in the output message.
If the verbose mode is enabled, the output of the above command should be like this:
$ ./cream_jobOutput.py --hostname prod-ce-01.pd.infn.it --port 8443 --lrms lsf --queue grid -x /tmp/x509up_u733 --dir /tmp --jdl ./hostname.jdl --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
executing command: /usr/bin/glite-ce-job-submit -d -a -r prod-ce-01.pd.infn.it:8443/cream-lsf-grid ./hostname.jdl
['2019-12-13 14:02:55,478 DEBUG - Using certificate proxy file [/tmp/x509up_u733]\n', '2019-12-13 14:02:55,519 DEBUG - VO from certificate=[enmr.eu]\n', '2019-12-13 14:02:55,520 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2019-12-13 14:02:55,520 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_zangrand_20191213-140255.log]\n', '2019-12-13 14:02:55,523 INFO - certUtil::generateUniqueID() - Generated DelegationID: [b6b895d69f7ef0d438db82930476a2fd149d0501]\n', '2019-12-13 14:02:57,610 DEBUG - Registering to [https://prod-ce-01.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "grid"; Executable = "/bin/hostname"; Type = "Job"; Arguments = "-s"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[./hostname.jdl]\n', '2019-12-13 14:02:58,271 DEBUG - Will invoke JobStart for JobID [CREAM160637101]\n', 'https://prod-ce-01.pd.infn.it:8443/CREAM160637101\n']
job id: https://prod-ce-01.pd.infn.it:8443/CREAM160637101
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://prod-ce-01.pd.infn.it:8443/CREAM160637101
['\n', '****** JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM160637101]\n', '\tStatus = [IDLE]\n', '\n', '\n']
job status: IDLE
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://prod-ce-01.pd.infn.it:8443/CREAM160637101
['\n', '****** JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM160637101]\n', '\tStatus = [DONE-OK]\n', '\tExitCode = [0]\n', '\n', '\n']
exitCode= ExitCode = [0]
job status: DONE-OK
invoking getOutputSandbox
executing command: /usr/bin/glite-ce-job-output --noint --dir /tmp https://prod-ce-01.pd.infn.it:8443/CREAM160637101
output sandbox dir: /tmp/prod-ce-01.pd.infn.it_8443_CREAM160637101
invoking jobPurge
executing command: /usr/bin/glite-ce-job-purge --noint https://prod-ce-01.pd.infn.it:8443/CREAM160637101
CREAM JobOutput OK | retrieved outputSandbox: ['std.err', 'std.out']
**** std.err ****
**** std.out ****
prod-wn-038
WN-softver probe¶
This probe checks the middleware version on a WN managed by the CREAM-CE. It makes use of cream_jobOutput.py in the following way:
$ ./cream_jobOutput.py --url https://prod-ce-01.pd.infn.it:8443/cream-lsf-grid -x /tmp/x509up_u733 --dir /tmp -j ./WN-softver.jdl
CREAM JobOutput OK | retrieved outputSandbox: ['std.err', 'std.out']
**** std.err ****
**** std.out ****
prod-wn-014 has EMI 3.15.0-1.el6
where
$ cat WN-softver.jdl
[
Type="Job";
JobType="Normal";
Executable = "WN-softver.sh";
#Arguments = "a b c";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = {"WN-softver.sh"};
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
and WN-softver.sh is attached.
The verbose option enabled gives the following output:
$ ./cream_jobOutput.py --url https://prod-ce-01.pd.infn.it:8443/cream-lsf-grid -x /tmp/x509up_u733 --dir /tmp -j ./WN-softver.jdl --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
executing command: /usr/bin/glite-ce-job-submit -d -a -r prod-ce-01.pd.infn.it:8443/cream-lsf-grid ./WN-softver.jdl
['2019-12-13 14:06:25,768 DEBUG - Using certificate proxy file [/tmp/x509up_u733]\n', '2019-12-13 14:06:25,804 DEBUG - VO from certificate=[enmr.eu]\n', '2019-12-13 14:06:25,805 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2019-12-13 14:06:25,805 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_zangrand_20191213-140625.log]\n', '2019-12-13 14:06:25,805 DEBUG - Processing file [/users/cms/zangrand/cream-nagios-master/src/WN-softver.sh]...\n', '2019-12-13 14:06:25,805 DEBUG - Inserting mangled InputSandbox in JDL: [{"/users/cms/zangrand/cream-nagios-master/src/WN-softver.sh"}]...\n', '2019-12-13 14:06:25,806 INFO - certUtil::generateUniqueID() - Generated DelegationID: [7f0ac5ec8a7deefa01f207c0b341fce1568f5282]\n', '2019-12-13 14:06:27,612 DEBUG - Registering to [https://prod-ce-01.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "grid"; Executable = "WN-softver.sh"; Type = "Job"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; InputSandbox = { "/users/cms/zangrand/cream-nagios-master/src/WN-softver.sh" }; StdError = "std.err" ] - JDL File=[./WN-softver.jdl]\n', '2019-12-13 14:06:28,228 DEBUG - JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM608273414]\n', '2019-12-13 14:06:28,228 DEBUG - UploadURL=[gsiftp://prod-ce-01.pd.infn.it/var/cream_sandbox/enmr/CN_Marco_Verlato_verlato_infn_it_O_Istituto_Nazionale_di_Fisica_Nucleare_C_IT_DC_tcs_DC_terena_DC_org_enmr_eu_Role_NULL_Capability_NULL_enmr018/60/CREAM608273414/ISB]\n', '2019-12-13 14:06:28,230 INFO - Sending file [gsiftp://prod-ce-01.pd.infn.it/var/cream_sandbox/enmr/CN_Marco_Verlato_verlato_infn_it_O_Istituto_Nazionale_di_Fisica_Nucleare_C_IT_DC_tcs_DC_terena_DC_org_enmr_eu_Role_NULL_Capability_NULL_enmr018/60/CREAM608273414/ISB/WN-softver.sh]\n', '2019-12-13 14:06:28,482 DEBUG - Will invoke JobStart for JobID [CREAM608273414]\n', 'https://prod-ce-01.pd.infn.it:8443/CREAM608273414\n']
job id: https://prod-ce-01.pd.infn.it:8443/CREAM608273414
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://prod-ce-01.pd.infn.it:8443/CREAM608273414
['\n', '****** JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM608273414]\n', '\tStatus = [REALLY-RUNNING]\n', '\n', '\n']
job status: REALLY-RUNNING
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://prod-ce-01.pd.infn.it:8443/CREAM608273414
['\n', '****** JobID=[https://prod-ce-01.pd.infn.it:8443/CREAM608273414]\n', '\tStatus = [DONE-OK]\n', '\tExitCode = [0]\n', '\n', '\n']
exitCode= ExitCode = [0]
job status: DONE-OK
invoking getOutputSandbox
executing command: /usr/bin/glite-ce-job-output --noint --dir /tmp https://prod-ce-01.pd.infn.it:8443/CREAM608273414
output sandbox dir: /tmp/prod-ce-01.pd.infn.it_8443_CREAM608273414
invoking jobPurge
executing command: /usr/bin/glite-ce-job-purge --noint https://prod-ce-01.pd.infn.it:8443/CREAM608273414
CREAM JobOutput OK | retrieved outputSandbox: ['std.err', 'std.out']
**** std.err ****
**** std.out ****
prod-wn-014 has EMI 3.15.0-1.el6
WN-csh probe¶
This probe checks that csh is there on a WN managed by the CREAM-CE. It makes use of cream_jobOutput.py in the following way:
$ ./cream_jobOutput.py --url https://prod-ce-01.pd.infn.it:8443/cream-lsf-grid -x /tmp/x509up_u733 --dir /tmp -j ./WN-csh.jdl
CREAM JobOutput OK | retrieved outputSandbox: ['std.err', 'std.out']
**** std.err ****
**** std.out ****
prod-wn-016 has csh
where
$ cat WN-csh.jdl
[
Type="Job";
JobType="Normal";
Executable = "WN-csh.sh";
#Arguments = "a b c";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = {"WN-csh.sh"};
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
and WN-csh.sh is attached.
Deployment example¶
In a Nagios server version 3.5.0 testing instance, we deployed the files needed to execute the probes described above in the following directories:
$ ls -l /usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/
total 48
-rwxr-xr-x 1 root root 1361 Jan 30 16:58 cream_allowedSubmission.py
-rwxr-xr-x 1 root root 2972 Jan 31 12:42 cream_jobOutput.py
-rwxr-xr-x 1 root root 2972 Jan 31 12:42 cream_jobSubmit.py
-rwxr-xr-x 1 root root 1416 Jan 31 12:42 cream_serviceInfo.py
drwxr-xr-x 2 root root 4096 Jan 31 11:34 cream_cli
and
$ ls -l /etc/nagios/plugins/eu.egi.CREAMCE/
total 16
-rw-r--r-- 1 root root 213 Jan 29 14:26 hostname.jdl
-rw-r--r-- 1 root root 129 Jan 30 16:21 sleep.jdl
-rw-r--r-- 1 root root 292 Jan 31 11:34 WN-csh.jdl
-rwxr-xr-x 1 root root 603 Jan 31 11:34 WN-csh.sh
-rw-r--r-- 1 root root 300 Jan 31 11:34 WN-softver.jdl
-rwxr-xr-x 1 root root 1144 Jan 31 11:34 WN-softver.sh
and defined the new services adding in the file
/etc/nagios/objects/services.cfg
the following lines:
define service{
use local-service
host_name prod-ce-01.pd.infn.it
service_description eu.egi.CREAMCE-AllowedSubmission
check_command ncg_check_native!/usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/cream_allowedSubmission.py!60!-x /tmp/x509up_u733 -p 8443
normal_check_interval 6
retry_check_interval 3
max_check_attempts 2
obsess_over_service 0
}
define service{
use local-service
host_name prod-ce-01.pd.infn.it
service_description eu.egi.CREAMCE-ServiceInfo
check_command ncg_check_native!/usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/cream_serviceInfo.py!60!-x /tmp/x509up_u733 -p 8443
normal_check_interval 6
retry_check_interval 3
max_check_attempts 2
obsess_over_service 0
}
define service{
use local-service
host_name prod-ce-01.pd.infn.it
service_description eu.egi.CREAMCE-JobSubmit
check_command ncg_check_native!/usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/cream_jobSubmit.py!60!-x /tmp/x509up_u733 -p 8443 -l lsf
-q creamtest1 -j /etc/nagios/plugins/eu.egi.CREAMCE/hostname.jdl
normal_check_interval 6
retry_check_interval 3
max_check_attempts 2
obsess_over_service 0
}
define service{
use local-service
host_name prod-ce-01.pd.infn.it
service_description eu.egi.CREAMCE.WN-Softver
check_command ncg_check_native!/usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/cream_jobOutput.py!60!-x /tmp/x509up_u733 -p 8443 -l lsf
-q creamtest1 -j /etc/nagios/plugins/eu.egi.CREAMCE/WN-softver.jdl
normal_check_interval 6
retry_check_interval 3
max_check_attempts 2
obsess_over_service 0
}
define service{
use local-service
host_name prod-ce-01.pd.infn.it
service_description eu.egi.CREAMCE.WN-Csh
check_command ncg_check_native!/usr/libexec/argo-monitoring/probes/eu.egi.CREAMCE/cream_jobOutput.py!60!-x /tmp/x509up_u733 -p 8443 -l lsf
-q creamtest1 -j /etc/nagios/plugins/eu.egi.CREAMCE/WN-csh.jdl
normal_check_interval 6
retry_check_interval 3
max_check_attempts 2
obsess_over_service 0
}
The check_command ncg_check_native was defined in the file
/etc/nagios/objects/commands.cfg
as below:
define command{
command_name ncg_check_native
command_line $ARG1$ -H $HOSTNAME$ -t $ARG2$ $ARG3$
}
Releases¶
Release 1.16.7¶
The current release of the CREAM service is fully compatible with the latest version of Bouncycastle, CAnL and VOMS published in EPEL.
The fixed issues are:
- GGUS-Ticket-ID: #106384
- GGUS-Ticket-ID: #122974
- GGUS-Ticket-ID: #124404
- GGUS-Ticket-ID: #127020
- GGUS-Ticket-ID: #133522
- GGUS-Ticket-ID: #136074
The list of packages for the CREAM services on CentOS 7 is the following:
Package | Version | Arch |
---|---|---|
BLAH | 1.22.3-1 | x86_64 |
BLAH-debuginfo | 1.22.3-1 | x86_64 |
glite-ce-common-java | 1.16.5-2 | noarch |
glite-ce-common-java-doc | 1.16.5-2 | noarch |
glite-ce-cream | 1.16.5-5 | noarch |
glite-ce-cream-api-java | 1.16.6-3 | noarch |
glite-ce-cream-core | 1.16.5-5 | noarch |
glite-ce-cream-utils | 1.3.7-1 | x86_64 |
info-dynamic-scheduler-slurm | 1.0.5-1 | noarch |
The list of packages for the CREAM services on Scientific Linux 6 is the following:
Package | Version | Arch |
---|---|---|
canl-java-tomcat | 0.1.19-2 | noarch |
glite-ce-blahp | 1.22.3-1 | x86_64 |
glite-ce-blahp-debuginfo | 1.22.3-1 | x86_64 |
glite-ce-common-java | 1.16.5-2 | noarch |
glite-ce-common-java-doc | 1.16.5-2 | noarch |
glite-ce-cream | 1.16.5-5 | noarch |
glite-ce-cream-api-java | 1.16.6-3 | noarch |
glite-ce-cream-core | 1.16.5-5 | noarch |
glite-ce-cream-utils | 1.3.7-1 | x86_64 |
info-dynamic-scheduler-lsf | 2.3.8-1 | noarch |
info-dynamic-scheduler-lsf-btools | 2.3.8-1 | noarch |
info-dynamic-scheduler-slurm | 1.0.5-1 | noarch |
lcg-info-dynamic-scheduler-condor | 1.1-1 | noarch |
lcg-info-dynamic-scheduler-pbs | 2.4.6-1 | noarch |
The basic requirement for the installation is UMD 4.7.1. The java application requires Java SE 8 (OpenJDK 1.8.0 or later). The installation and configuration of the CREAM CE site is certified with the puppet module version 0.1.1 or greater. YAIM is no more supported on Scientific Linux 6 and its usage must be considered at “your own risk”.
The update process on Scientific Linux 6 requires the following additional steps:
- After the update it is necessary to remove the packages bouncycastle-mail and bouncycastle (version 1.46).
- Any broken link in /var/lib/tomcat6/webapps/ce-cream/WEB-INF/lib and /usr/share/tomcat6/lib must be manually deleted.
If YAIM is used for the configuration it is necessary to create a symbolic link from /usr/share/java/bcprov-1.58.jar into /usr/share/java/bcprov.jar and re-running the configurator.
Release 1.16.6¶
The changes delivered with this release are:
- Support for CentOS 7
- Support for accelerator devices (NVidia GPUs, MIC)
- Implementation of GLUE 2.1 v.0.4 (Draft)
The list of packages for the CREAM services is the following:
Package | Version | Arch |
---|---|---|
BLAH | 1.22.2-1 | x86_64 |
BLAH-debuginfo | 1.22.2-1 | x86_64 |
axis2 | 1.6.4-2.emi | noarch |
canl-java-tomcat | 0.2.1-1 | noarch |
condor-classads-blah-patch | 0.0.1-1 | x86_64 |
dynsched-generic | 2.5.5-2 | noarch |
glite-ce-common-java | 1.16.5-1 | noarch |
glite-ce-common-java-doc | 1.16.5-1 | noarch |
glite-ce-cream | 1.16.5-3 | noarch |
glite-ce-cream-api-java | 1.16.6-2 | noarch |
glite-ce-cream-core | 1.16.5-3 | noarch |
glite-ce-cream-utils | 1.3.6-2 | x86_64 |
glite-ce-wsdl | 1.15.1-1 | noarch |
glite-info-dynamic-ge | 7.2.0-29.1 | noarch |
glite-jdl-api-java | 3.3.2-3 | noarch |
glite-jdl-api-java-doc | 3.3.2-3 | noarch |
info-dynamic-scheduler-lsf | 2.3.8-3 | noarch |
info-dynamic-scheduler-lsf-btools | 2.3.8-3 | noarch |
info-dynamic-scheduler-slurm | 1.0.4-1 | noarch |
jclassads | 2.4.0-3 | noarch |
kill-stale-ftp | 1.0.1-1 | noarch |
lcg-info-dynamic-scheduler-condor | 1.1-1 | noarch |
lcg-info-dynamic-scheduler-pbs | 2.4.6-1 | noarch |
The list of packages for the CREAM clients is the following:
Package | Version | Arch |
---|---|---|
glite-ce-cream-cli | 1.15.4-1 | x86_64 |
glite-ce-cream-client-api-c | 1.15.5-1 | x86_64 |
glite-ce-cream-client-devel | 1.15.5-1 | x86_64 |
glite-jdl-api-cpp | 3.4.4-1 | x86_64 |
glite-jdl-api-cpp-devel | 3.4.4-1 | x86_64 |
glite-jdl-api-cpp-doc | 3.4.4-1 | x86_64 |
glite-wms-utils-classad | 3.4.3-1 | x86_64 |
glite-wms-utils-classad-devel | 3.4.3-1 | x86_64 |
glite-wms-utils-exception | 3.4.3-1 | x86_64 |
glite-wms-utils-exception-devel | 3.4.3-1 | x86_64 |
log4cpp | 1.0.13-1 | x86_64 |
log4cpp-devel | 1.0.13-1 | x86_64 |
The installation and configuration of the CREAM CE site is certified with the puppet module version 0.0.16 or greater.
The supported batch systems are:
Batch system | Version |
---|---|
TORQUE | 4.2.10 |
SLURM | 16.05.10 |
Htcondor | 8.6.3 |
LSF | 7.0 |
GridEngine | 6.2 |
The fixed issues are:
- GGUS-Ticket-ID: #106384
- GGUS-Ticket-ID: #122974
- GGUS-Ticket-ID: #124034
- GGUS-Ticket-ID: #124404
- GGUS-Ticket-ID: #127020
The known issues are:
- The CREAM UI requires classads libraries up to version 8.4.11, it does not work with versions 8.6.*
- GridEngine is partially supported, the infoprovider does not publish informationa about acceleratore devices.
- The puppet agent may report parsing errors.
- If HTCondor is the batch system adopted, the HTCondor services on the computing element must be restarted after the installation of the CREAM service.