Changeset 44


Ignore:
Timestamp:
Dec 5, 2011, 10:58:16 PM (9 years ago)
Author:
g7moreau
Message:
  • doc doc and doc!
File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/oarutils/oar-parexec

    r43 r44  
    233233=head1 NAME
    234234
    235 oar-parexec - parallel execute lot of small job
     235oar-parexec - parallel execution of many small job
    236236
    237237=head1 SYNOPSIS
    238238
    239  oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh]
     239 oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] \
     240            [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh]
    240241 oar-parexec --help
    241242
    242243=head1 DESCRIPTION
    243244
    244 C<oar-parexec> execute lot of small job.in parallel inside a cluster.
    245 Number of parallel job at one time cannot excede core number in the node file.
     245C<oar-parexec> can execute lot of small job in parallel inside a cluster.
     246Number of parallel job at one time cannot exceed the number of core define in the node file
    246247C<oar-parexec> is easier to use inside an OAR job environment
    247 which define automatically theses strategics parameters...
     248which define automatically these strategics parameters...
     249However, it can be used outside OAR.
    248250
    249251Option C<--filecmd> is the only mandatory one.
    250252
    251253Small job will be launch in the same folder as the master job.
    252 Two environment variable are define for each small job
     254Two environment variable are defined for each small job
    253255and only in case of parallel small job (option C<--jobnp> > 1).
    254256
     
    256258 OAR_NP        - number of processor affected
    257259
    258 The file define by OAR_NODE_FILE is created on the node before launching
    259 the small job in /tmp and will be delete after...
     260The file define by OAR_NODE_FILE is created  in /tmp
     261on the node before launching the small job
     262and this file will be delete after job complete.
    260263C<oar-parexec> is a simple script,
    261264OAR_NODE_FILE will not be deleted in case of crash of the master job.
     
    278281
    279282File which log and trace running job.
    280 In case of running the same command (after crash for example),
    281 only job that ar not mark as done will be run again.
    282 Be carefful, job mark as running (start but for finish) will be run again.
     283In case of running the same master command (after crash for example),
     284only job that are not mark as done will be run again.
     285Be careful, job mark as running (start but not finish) will be run again.
    283286
    284287This option is very usefull in case of crash
     
    294297=item B<-n|--nodefile filenode>
    295298
    296 File name that list all the node to launch job.
     299File name that list all the node where job could be launch.
    297300By defaut, it's define automatically by OAR via
    298301environment variable C<OAR_NODE_FILE>.
     
    326329By default
    327330
    328         oarsh -q -T
     331 oarsh -q -T
     332
     333Change it to C<ssh> if you are not using an OAR cluster...
    329334
    330335=item B<-h|--help>
     
    334339
    335340=head1 EXAMPLE
     341
     342=head2 Simple list of sequential job
    336343
    337344Content for the job file command (option C<--filecmd>) could have:
     
    352359 $HOME/test/subjob40.sh
    353360
    354 These jobs could be launch by
    355 
    356  oarsub -n test -l /core=6,walltime=00:35:00 "oar-parexec -f ./subjob.list.txt"
     361These jobs could be launch by:
     362
     363 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt"
     364
     365=head2 Parallel job
     366
     367You need to put the number of core each small job need with option C<--jobnp>.
     368If your job is build on OpenMP or MPI,
     369you can use OAR_NP and OAR_NODE_FILE variables to configure them.
     370On OAR cluster, you need to use C<oarsh> or a wrapper like C<oar-envsh>
     371for connexion between node instead of C<ssh>.
     372
     373Example with parallel small job on 2 core:
     374
     375 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -j 2 -f ./subjob.list.txt"
     376
     377=head2 Tracing and master crash
     378
     379If the master node crash after hours of calculus, everything is lost ?
     380No, with option C<--logtrace>,
     381it's possible to remember older result
     382and not re-run these job the second and next time.
     383
     384 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     385
     386After a crash or an C<oardel> command,
     387you can then re-run the same command that will end to execute the jobs in the list
     388
     389 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     390
     391C<logtrace> file are just plain file.
     392We use the extension '.log' because these files are automatically
     393eliminate from our backup system!
     394
     395=head2 Checkpointing and Idempotent
     396
     397C<oar-parexec> is compatible with the OAR checkpointing.
     398Il you have 2000 small jobs that need 55h to be done on 6 cores,
     399you can cut this in small parts.
     400
     401For this example, we suppose that each small job need about 10min...
     402So, we send a checkpoint 12min before the end of the process
     403to let C<oar-parexec> finish the jobs started.
     404After being checkpointed, C<oar-parexec> do not start any new small job.
     405
     406 oarsub -t idempotent -n test -l /core=6,walltime=04:00:00 --checkpoint 720 \
     407   "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
     408
     409After 3h48min, the OAR job will begin to stop launching new small job.
     410When all running small job are finished, it's exit.
     411But as the OAR job is type C<idempotent>,
     412OAR will re-submit it as long as all small job are not executed...
     413
     414This way, we let other users a chance to use the cluster!
     415
     416In this last exemple, we use moldable OAR job with idempotent
     417to reserve many core for a small time or a few cores for a long time:
     418
     419 oarsub -t idempotent -n test \
     420   -l /core=50,walltime=01:05:00 \
     421   -l /core=6,walltime=04:00:00 \
     422   --checkpoint 720 \
     423   "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log"
    357424
    358425
    359426=head1 SEE ALSO
    360427
    361 oar-dispatch, mpilauncher
     428oar-dispatch, mpilauncher,
     429orsh, oar-envsh, ssh
    362430
    363431
Note: See TracChangeset for help on using the changeset viewer.