Changeset 44

Dec 5, 2011, 10:58:16 PM (9 years ago)
  • doc doc and doc!
1 edited


  • trunk/oarutils/oar-parexec

    r43 r44  
    233233=head1 NAME
    235 oar-parexec - parallel execute lot of small job
     235oar-parexec - parallel execution of many small job
    237237=head1 SYNOPSIS
    239  oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh]
     239 oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] \
     240            [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh]
    240241 oar-parexec --help
    242243=head1 DESCRIPTION
    244 C<oar-parexec> execute lot of small parallel inside a cluster.
    245 Number of parallel job at one time cannot excede core number in the node file.
     245C<oar-parexec> can execute lot of small job in parallel inside a cluster.
     246Number of parallel job at one time cannot exceed the number of core define in the node file
    246247C<oar-parexec> is easier to use inside an OAR job environment
    247 which define automatically theses strategics parameters...
     248which define automatically these strategics parameters...
     249However, it can be used outside OAR.
    249251Option C<--filecmd> is the only mandatory one.
    251253Small job will be launch in the same folder as the master job.
    252 Two environment variable are define for each small job
     254Two environment variable are defined for each small job
    253255and only in case of parallel small job (option C<--jobnp> > 1).
    256258 OAR_NP        - number of processor affected
    258 The file define by OAR_NODE_FILE is created on the node before launching
    259 the small job in /tmp and will be delete after...
     260The file define by OAR_NODE_FILE is created  in /tmp
     261on the node before launching the small job
     262and this file will be delete after job complete.
    260263C<oar-parexec> is a simple script,
    261264OAR_NODE_FILE will not be deleted in case of crash of the master job.
    279282File which log and trace running job.
    280 In case of running the same command (after crash for example),
    281 only job that ar not mark as done will be run again.
    282 Be carefful, job mark as running (start but for finish) will be run again.
     283In case of running the same master command (after crash for example),
     284only job that are not mark as done will be run again.
     285Be careful, job mark as running (start but not finish) will be run again.
    284287This option is very usefull in case of crash
    294297=item B<-n|--nodefile filenode>
    296 File name that list all the node to launch job.
     299File name that list all the node where job could be launch.
    297300By defaut, it's define automatically by OAR via
    298301environment variable C<OAR_NODE_FILE>.
    326329By default
    328         oarsh -q -T
     331 oarsh -q -T
     333Change it to C<ssh> if you are not using an OAR cluster...
    330335=item B<-h|--help>
    335340=head1 EXAMPLE
     342=head2 Simple list of sequential job
    337344Content for the job file command (option C<--filecmd>) could have:
    352359 $HOME/test/
    354 These jobs could be launch by
    356  oarsub -n test -l /core=6,walltime=00:35:00 "oar-parexec -f ./subjob.list.txt"
     361These jobs could be launch by:
     363 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt"
     365=head2 Parallel job
     367You need to put the number of core each small job need with option C<--jobnp>.
     368If your job is build on OpenMP or MPI,
     369you can use OAR_NP and OAR_NODE_FILE variables to configure them.
     370On OAR cluster, you need to use C<oarsh> or a wrapper like C<oar-envsh>
     371for connexion between node instead of C<ssh>.
     373Example with parallel small job on 2 core:
     375 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -j 2 -f ./subjob.list.txt"
     377=head2 Tracing and master crash
     379If the master node crash after hours of calculus, everything is lost ?
     380No, with option C<--logtrace>,
     381it's possible to remember older result
     382and not re-run these job the second and next time.
     384 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     386After a crash or an C<oardel> command,
     387you can then re-run the same command that will end to execute the jobs in the list
     389 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     391C<logtrace> file are just plain file.
     392We use the extension '.log' because these files are automatically
     393eliminate from our backup system!
     395=head2 Checkpointing and Idempotent
     397C<oar-parexec> is compatible with the OAR checkpointing.
     398Il you have 2000 small jobs that need 55h to be done on 6 cores,
     399you can cut this in small parts.
     401For this example, we suppose that each small job need about 10min...
     402So, we send a checkpoint 12min before the end of the process
     403to let C<oar-parexec> finish the jobs started.
     404After being checkpointed, C<oar-parexec> do not start any new small job.
     406 oarsub -t idempotent -n test -l /core=6,walltime=04:00:00 --checkpoint 720 \
     407   "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     409After 3h48min, the OAR job will begin to stop launching new small job.
     410When all running small job are finished, it's exit.
     411But as the OAR job is type C<idempotent>,
     412OAR will re-submit it as long as all small job are not executed...
     414This way, we let other users a chance to use the cluster!
     416In this last exemple, we use moldable OAR job with idempotent
     417to reserve many core for a small time or a few cores for a long time:
     419 oarsub -t idempotent -n test \
     420   -l /core=50,walltime=01:05:00 \
     421   -l /core=6,walltime=04:00:00 \
     422   --checkpoint 720 \
     423   "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
    359426=head1 SEE ALSO
    361 oar-dispatch, mpilauncher
     428oar-dispatch, mpilauncher,
     429orsh, oar-envsh, ssh
Note: See TracChangeset for help on using the changeset viewer.