Wisconsin CMS Tier-2 FAQ#

HDFS Data Storage#


How can I get a copy of the data files?#

You must first find out the path to your files in UW-HEP HDFS.#

If you are trying to copy your files to another CMS site (e.g. Fermilab), then see Can I access data files from outside the hep.wisc.edu domain?.#

To do a bulk copy of all of the files in a dataset to a local disk, you can simply use a command such as the following:#

cp -r /hdfs/store/... destination-directory

This should work on any machine with /hdfs mounted (e.g. the login machines).#


Can I access data files from outside the hep.wisc.edu domain?#

Yes. All data files are globally readable through several protocols: davs, xrootd, gsiftp.#

If you want to get a local copy of data files from UW-HEP, you just need a list of filenames. Once you have that, you may copy them.#

Example using xrootd to copy files#

Copy data located at Wisconsin using local Xrootd redirector or manager (root://cmsxrootd.hep.wisc.edu)#

 xrdcp root://cmsxrootd.hep.wisc.edu//store/... .

Copy data located at Wisconsin using global(US) Xrootd redirector or manager (root://cmsxrootd.fnal.gov)#

 xrdcp root://cmsxrootd.fnal.gov//store/... .

Example using gfal-copy#

 gfal-copy -p -r davs://cmsxrootd.hep.wisc.edu:1094/store/user/<name>/src_file dest_filename

How can I copy files into HDFS from a local disk?#

One way to copy files into HDFS is with gfal-copy. First create a grid proxy with voms-proxy-init and then use gfal-copy to copy the file. Example:#

gfal-copy -p /path/to/src_file davs://cmsxrootd.hep.wisc.edu:1094/store/user/path/to/dest_file

How can I manage my files in HDFS?#

On machines with /hdfs mounted (e.g. login machines), you should be able to manage files in HDFS using standard unix commands such as rm, mv, mkdir, rmdir.#

You should be able to cd to /hdfs/store/user and manage your files. Be very careful when doing recursive rm operations! There is no way to recover if you remove files by mistake.#

File management can also be done via gfal-* commands (details below).#

Using gfal commands (examples)#

File Copy

 gfal-copy davs://cmsxrootd.hep.wisc.edu:1094/store/user/<name>/src_file dest_filename
 gfal-copy -p -r /path/to/src_file davs://cmsxrootd.hep.wisc.edu:1094/store/user/path/to/dest_file

File list

 gfal-ls -l davs://cmsxrootd.hep.wisc.edu:1094/store/user/your_dir/your_file

File remove

 gfal-rm davs://cmsxrootd.hep.wisc.edu:1094/store/user/your_dir/your_file

Using xrdfs commands#

File management can also be done via xrdfs. This can be used as a command-line tool, like gfal-ls. It can also be used as an interactive shell.#

File list

 xrdfs cmsxrootd.hep.wisc.edu ls /store/user/example

File remove

 xrdfs cmsxrootd.hep.wisc.edu rm /store/user/example/file

Interactive shell

 xrdfs cmsxrootd.hep.wisc.edu

How can I open HDFS files directly from root?#

If you know the /store path to a file, you can have root read from the file directly by specifying root://cmsxrootd.hep.wisc.edu//store/... as the file name that you give to root.#


How can I verify that our copy of a file is valid?#

Calculate the CRC checksum of a file and compare it to that listed in DAS. For example:#

cksum /hdfs/store/data/Run2010B/Electron/AOD/Dec22ReReco_v1/0002/3861EBF4-2C0E-E011-A891-0018F3D0960E.root
2443048860 2448344328 /hdfs/store/data/Run2010B/Electron/AOD/Dec22ReReco_v1/0002/3861EBF4-2C0E-E011-A891-0018F3D0960E.root

The first number is the CRC. Compare this result to that returned by DAS when you click on “show”.#

The “farmout” Scripts#


How can I use farmoutRandomSeedJobs to submit CMSSW MC jobs to Condor?#

farmoutRandomSeedJobs jobName nEvents nEventsPerJob /path/to/CMSSW /path/to/configTemplate cmsRunArg1=value1 cmsRunArg2=value2 ...

Use the -h option to see all of the options.#

Either the configuration file or the cmsRun arguments must contain the macros $randomNumber, $nEventsPerJob, and $outputFileName. These macros are filled in for each job by the farmout script. Additional random numbers may be generated per job by using the macros $randomNumber1, $randomNumber2, and so on.#

The following example assumes a configuration .py file that uses the standard FWCore.ParameterSet.VarParsing module. In addition to the standard options, it assumes an option named randomSeed is used. Since all required macros are specified in the command-line options, no macros are required in the .py file, so it can just be a standard .py file that you could use when running cmsRun interactively.#

farmoutRandomSeedJobs \
  jobName \
  nEvents \
  nEventsPerJob \
  /path/to/CMSSW \
  /path/to/config.py \
  'outputFile=$outputFileName' \
  'maxEvents=$nEventsPerJob' \
  'randomSeed=$randomNumber'

Notice that the $ character must be protected from being interpreted specially by the shell, so in the above example, the options containing macros are surrounded by single quotes.#


How can I use farmoutAnalysisJobs to submit CMSSW analysis jobs to Condor?#

This script will run cmsRun root files in a directory or directory tree. By default, it runs on all root files in a directory in your /hdfs area, using the jobName that you specify to find the files. However, you can direct it to an alternate path and tell it to exclude root files with names matching a pattern that you specify.#

For full options to the script, use the -h option. Here is a brief synopsis:#

farmoutAnalysisJobs [options] jobName /path/to/CMSSW /path/to/configTemplate cmsRunArg1=value1 cmsRunArg2=value2 ...

Either the configuration file or the cmsRun arguments must contain the macros $outputFileName and $inputFileNames. These macros are filled in for each job by the farmout script.#

The following example assumes a configuration .py file that uses the standard FWCore.ParameterSet.VarParsing module. Since all required macros are specified in the command-line options, no macros are required in the .py file, so it can just be a standard .py file that you could use when running cmsRun interactively.#

farmoutAnalysisJobs [options] jobName /path/to/CMSSW /path/to/config.py 'outputFile=$outputFileName' 'inputFiles=$inputFileNames'

Notice that the $ character must be protected from being interpreted specially by the shell, so in the above example, the options containing macros are surrounded by single quotes.#


How can I use farmoutAnalysisJobs to merge together analysis output?#

The --merge option to farmoutAnalysisJobs may be used to merge a set of small files into one or more larger files. To use this option, invoke farmoutAnalysisJobs as though you were using the root files to be merged as input files. Example:#

farmoutAnalysisJobs \
   --merge \
   --input-files-per-job=50 \
   --input-dir=/store/user/dan/QCD-Trigger \
   example-merge-job-name \
   ~/CMSSW_4_2_2_patch1

Notice that --input-files-per-job must be specified. Choose a value that will produce reasonably sized merged output files. Aim for merge files of a few gigs. If they are larger than 10GB, your merge job may run out of space on the worker node. If there are more input files than the number you specify, then multiple merge jobs will be created, and you will end up with multiple merge files.#

By default, mergeFiles.C is used to merge the files. If you wish to instead use the root hadd utility, specify --use-hadd.#


How can I submit framework-lite or other types of jobs using farmoutAnalysisJobs?#

In addition to submitting standard CMSSW analysis jobs, farmoutAnalysisJobs can submit framework-lite or other jobs. The mechanism for submitting generic jobs (i.e. non CMSSW jobs) is to use the --fwklite option to farmoutAnalysisJobs. It’s more generic than the name implies. Any sort of script can be run through this mechanism, not just framework-lite jobs.#

When you use that option, instead of supplying a cmsRun configuration template to farmoutAnalysisJobs, you supply an executable script. Note that farmout does not replace macros in your script or make any modifications to it whatsoever. Instead, the environment variables INPUT and OUTPUT are used to pass information to the script. INPUT gives the path to a file which contains the list of input files (one per line). OUTPUT specifies the path to the output file that the job is expected to produce. This output file will be stored in HDFS, just like output files from cmsRun jobs. In addition to those two environment variables, command-line options may be passed to the script. The same macro-substitutions are done in the command-line options as for CMSSW analysis jobs.#

As a convenience, if the “script” is a .C or .cc file, it is executed like this:#

root -q -b filename.C

Otherwise, you can provide a shell script or python script or whatever is most convenient to start the job. All that matters is that it can be executed and that it exits with 0 status when all goes well. If the script relies on additional files, you can transfer those using --extra-inputs.#

##

How can I test my jobs before submitting them?#

It is important to test your jobs before submitting a large number of them to the cluster – if they fail, many hours of computation can be wasted in a short period of time. Very short and lightweight tests may be run on the login machines. Longer or more intense tests can be run interactively in the cluster using interactive jobs. This allows you to test your job in an environment identical to our production Condor cluster. To submit a UI job, use the following command:#

farmoutInteractiveJob

Condor Batch System#


How do I submit a generic job to Condor at the Wisconsin Tier-2?#

See UW-HEP Condor User Info. Also see farmout –fwklite option.#


Why are my jobs idle?#

Jobs submitted to Condor at the Wisconsin Tier-2 may run on resources distributed across the campus grid. It can take a few minutes for the Condor negotiator to come around to your newly submitted job and try finding a machine to run it on. If no machines are immediately available, the job waits in the idle state (‘I’ in the condor_q output).#

To see how many machines could possibly run your job, you can use the following command:#

condor_q -analyze <jobid>

If your job requirements do not match very many machines, you can try to analyze the requirements:#

condor_q -better-analyze <jobid>

It may happen that your urgent jobs have no problem matching the requirements of lots of machines, but they are still idle due to machines being busy with other jobs. In this case, let us know and we can see if a priority adjustment would help.#

The above condor_q commands only analyze the resources available in the CMS Tier-2. When the Tier-2 resources are all busy, your jobs may run (via Condor flocking) in another pool of resources on campus. These pools can be analyzed using commands such as the following:#

condor_q -pool cm.chtc.wisc.edu -analyze <jobid>

Why do my jobs get held?#

Jobs may enter the ‘held’ state for a number of reasons. To find out why your jobs are held, check them with condor_q:#

condor_q -format '%s\n' HoldReason <jobid>

This will print a message explaining why your job was held. One common reason looks like the following:#

The job attribute PeriodicHold expression 'ImageSize / 1024 > 4.000000 * 900 || DiskUsage / 1024 > 10.000000 * 2000' evaluated to TRUE

This message means that your job’s memory or disk usage exceeded the allowable threshold. To see which attribute may have caused the job to become held, use condor_q again:#

condor_q -format 'Memory:\t%s\n' ImageSize -format 'Disk:\t%s\n' DiskUsage <jobid>

Then, compare the resulting numbers to the thresholds in the PeriodicHold expression above. If you’re running CMSSW jobs and find that they are getting held because their ImageSize has exceeded the threshold, consider reducing the number of events per job or searching your code for possible memory leaks. If your jobs require more memory than is typically available on our cluster, you can explicitly set a higher memory requirement to limit your jobs to machines that have enough RAM (where N is the expected number of megabytes of RAM needed): requirements = (TARGET.Memory >= N) If too few machines have enough memory to run your jobs, we may be able to specially configure some of them so that your jobs can run.#


What operating system will my job run in and how can I control that?#

You can see which operating systems are installed on the cluster by using the following command:#

condor_status -af OpSysAndVer | sort | uniq -c
condor_status -pool cm.chtc.wisc.edu -af OpSysAndVer | sort | uniq -c

If your job runs on an operating system that is different from the one it was compiled for, it may experience problems such as missing libraries.#

You can control which OS your job runs on using the requirements expression in your condor submit file. Example:#

requirements = TARGET.OpSysAndVer == "CentOS7"

or#

requirements = TARGET.OpSysAndVer == "AlmaLinux9" || TARGET.OpSysAndVer == "CentOS9"

If you are using the farmout scripts, the --opsys command-line option is used.#

An alternative to restricting the OS is to run the job in a container, so that it sees the same OS environment no matter what the host computer is running. One way to achieve that is to specify a singularity image in the condor submit file. Example:#

container_image = /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel7

Another way is to invoke singularity (aka apptainer) in your job script. Example:#

singularity exec -B /cvmfs -B `pwd` --no-home /cvmfs/singularity.opensciencegrid.org/cmssw/cms:rhel7 COMMAND-TO-EXECUTE

If you are using the farmout scripts, the --use-singularity command-line option is used.#


How can I debug my running jobs?#

If you need to interactively debug a running job, you can ssh to it using the following command:#

condor_ssh_to_job jobid

A message will inform you of the PID of the job. This is likely just a shell script, not the actual cmsRun process. To find the cmsRun process, you can use the following:#

ps auwwx --forest | less

Find the PID mentioned by condor in the process tree. You should then see cmsRun as a child of that process.#

Another way to see your job’s process tree is with the following command, substituting PID with the PID specified by Condor:#

pstree -pa PID

Once you know the PID of the process you wish to debug, you can attach with a debugger or use gstack to get a stack dump. Example:#

gstack PID
gdb -p PID

What does this WARNING (File /afs/blah/blah.out is not writable by condor) mean?#

WARNING: File /afs/hep.wisc.edu/user/blah/blah.out is not writable by condor.
WARNING: File /afs/hep.wisc.edu/user/blah/blah.error is not writable by condor.

The above indicates that your job is trying to access AFS. Remove all references to AFS files and directories from your job. AFS is not accessible to Condor jobs.#


How do I restrict my job to run on only the machines with /hdfs mounted?#

If you are reading files via xrootd, your job can run anywhere and access the data. If instead you are reading files via /hdfs, be aware that not all machines at UW have /hdfs mounted.#

The requirements expression for restricting your job to run on machines that have /hdfs mounted is this:#

requirements = TARGET.HAS_CMS_HDFS

This is automatically inserted into the job requirements for farmout jobs using the --use-hdfs option.#


Should I use condor file transfer for reading from/writing to hdfs ?#

The simple answer is “NO”. We don’t recommend using condor file transfer for reading/writing to hdfs. The reason is that condor file transfer happens through the submit machine where the job was submitted. If many jobs are reading/writing to hdfs at the same time, this creates a bottleneck. Since files can be read/written to hdfs using highly scalable protocols such as xrootd, there is no need to be limited by this bottleneck. #

So the best option is to use the script that runs your job copy all the output root files into hdfs using gfal-copy. This requires that you submit your job with a voms proxy so it can authenticate to hdfs. Example of how to use gfal-copy: https://www.hep.wisc.edu/cms/comp/faq.html#how-can-i-copy-files#

General#


##

What can I do on the login.hep.wisc.edu machines?#

We provide a pool of machines for interactive use. You can connect to them directly or via a round-robin address:#

$ ssh login.hep.wisc.edu
$ ssh login01.hep.wisc.edu

On the login servers, you can test and submit Condor jobs manually or using farmout; read from the HDFS cluster mounted at /hdfs; or run short tasks that don’t require much CPU or memory.#

Each login server also has a large local disk partition mounted at /scratch. You can create a directory for yourself (/scratch/<username>) and store short-lived or unimportant data there. /scratch is unique to each login server and not backed up or replicated on other servers, so write important data to HDFS instead.#

How can I protect files stored in AFS?#

Your home directory may contain several sensitive files, including your grid certificate private key or SSH private keys. These files must be protected from unauthorized access, though you may still allow unauthenticated users to view other files in your home directory. For each of the directories that contain sensitive files, create a private subdirectory and apply restrictive AFS ACLs to it. For example, to protect a Globus grid certificate private key:#

$ cd ~/.globus
$ mkdir .private
$ chmod 700 .private
$ fs sa .private $USER rlidkwa -clear

Then, move each sensitive file into the .private directory and create a symlink back:#

$ mv userkey.pem .private
$ ln -s .private/userkey.pem userkey.pem

Files that should be protected in this manner include:#

  • ~/.globus/userkey.pem
  • ~/.ssh/id_rsa
  • ~/.ssh/id_dsa
  • ~/.ssh/identity

In addition, several directories may contain private files and should therefore be restricted in their entirety:#

  • ~/.mozilla

Contacts For Help#

Email: help@hep.wisc.edu#