condor

Sylvia Biscoveanu's first-time condor tutorial.

useful commands for the unix command line

view submitted jobs: condor_q
view my submitted jobs:
# define "qlist" in your .bash_login file
alias qlist='condor_q -global ethrane | sed s/.*Submitter.*// | sed /^$/D'
kill all condor submissions:
alias qkill='condor_rm user ethrane'
submitting a dag file:
#define this command in your .bash_login file:
alias qsub='condor_submit_dag'
# or this one to limit the number of jobs
alias presub='condor_submit_dag -maxjobs 40'
information about your submissions (e.g., to learn why a job is on hold): condor_q -analyze ID
"better" analysis of condor job: condor_q -better-analyze 5381361.0
(Check, e.g., how many machines match the conditions required by your sub file.)
running a detailed condor_q: condor_q -direct schedd -analyze your_user_name
Why is my job on hold? condor_q -hold -long your_user_name grep '^HoldReason ='
cancel a submission: condor_rm ID
condor_status: lists all the nodes available, their memory, and other useful information
condor_userprio -all: lists condor users and their priority; (lower "Effective priority" numbers are given more resources
condor_q -long ID# gives detailed information about your job including "RequestMemory" (how much memory you asked for) and "ImageSize" (how much memory you ended up using
condor_ssh_to_job can be used to run top on the node where your job is actually executing. This is useful for debugging.

sub files

example sub file
# require GPUs (4g memory recommended)
Requirements = TARGET.WantGPU =?= True
+WantGPU = True
# require 8-core CPUs (8g memory recommended)
request_cpus = 8
# require extra memory
RequestMemory = 4000 (OBSOLETE since Oct 30, 2014)
request_memory = 4000
# vetoing a bad node
Requirements = TARGET.Machine =!= "node14.mit.edu"
Requirements = TARGET.Machine =!= "node501.cluster.ldas.cit"
...or use a regex. Note, you can find the full set of machine class adds by running condor_status -long.

dag files

example dag file
perl script to make this dag file

debugging

My condor jobs are dying before I can ssh_to_job. Change your executable call to be: your_executable || sleep 10000. When the executable dies, the node will proceed to the sleep command and give you time to ssh.

general advice

  • Make sure that the parent job does not fail with an error.  If this happens the entire submission will fail.  One way to avoid this fate is to put an if-then statement in your executable source code so that it automatically exits gracefully if jobNumber==0.
  • example bash wrapper for condor submission
  • You can use top to assess your memory needs by watching "VIRT" while you run your job on the head node. VIRT is the relevant memory for the purposes of predicting what you will need for condor.
  • To make sure your condor job does not exceed memory requirements, you can use the command "ulimit -v" before running your job on the head node. This will let you know if your job exceeds memory requirements.
  • Running matlab on condor: HTCondor page

    Back to Resources