Changeset 44 for trunk/oarutils/oar-parexec
- Timestamp:
- Dec 5, 2011, 10:58:16 PM (13 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
trunk/oarutils/oar-parexec
r43 r44 233 233 =head1 NAME 234 234 235 oar-parexec - parallel execut e lot ofsmall job235 oar-parexec - parallel execution of many small job 236 236 237 237 =head1 SYNOPSIS 238 238 239 oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh] 239 oar-parexec --filecmd filecommand [--logtrace tracefile] [--verbose] [--jobnp integer] \ 240 [--nodefile filenode] [--masterio basefileio] [--switchio] [--oarsh sssh] 240 241 oar-parexec --help 241 242 242 243 =head1 DESCRIPTION 243 244 244 C<oar-parexec> execute lot of small job.in parallel inside a cluster.245 Number of parallel job at one time cannot exce de core number in the node file.245 C<oar-parexec> can execute lot of small job in parallel inside a cluster. 246 Number of parallel job at one time cannot exceed the number of core define in the node file 246 247 C<oar-parexec> is easier to use inside an OAR job environment 247 which define automatically theses strategics parameters... 248 which define automatically these strategics parameters... 249 However, it can be used outside OAR. 248 250 249 251 Option C<--filecmd> is the only mandatory one. 250 252 251 253 Small job will be launch in the same folder as the master job. 252 Two environment variable are define for each small job254 Two environment variable are defined for each small job 253 255 and only in case of parallel small job (option C<--jobnp> > 1). 254 256 … … 256 258 OAR_NP - number of processor affected 257 259 258 The file define by OAR_NODE_FILE is created on the node before launching 259 the small job in /tmp and will be delete after... 260 The file define by OAR_NODE_FILE is created in /tmp 261 on the node before launching the small job 262 and this file will be delete after job complete. 260 263 C<oar-parexec> is a simple script, 261 264 OAR_NODE_FILE will not be deleted in case of crash of the master job. … … 278 281 279 282 File which log and trace running job. 280 In case of running the same command (after crash for example),281 only job that ar not mark as done will be run again.282 Be caref ful, job mark as running (start but forfinish) will be run again.283 In case of running the same master command (after crash for example), 284 only job that are not mark as done will be run again. 285 Be careful, job mark as running (start but not finish) will be run again. 283 286 284 287 This option is very usefull in case of crash … … 294 297 =item B<-n|--nodefile filenode> 295 298 296 File name that list all the node to launch job.299 File name that list all the node where job could be launch. 297 300 By defaut, it's define automatically by OAR via 298 301 environment variable C<OAR_NODE_FILE>. … … 326 329 By default 327 330 328 oarsh -q -T 331 oarsh -q -T 332 333 Change it to C<ssh> if you are not using an OAR cluster... 329 334 330 335 =item B<-h|--help> … … 334 339 335 340 =head1 EXAMPLE 341 342 =head2 Simple list of sequential job 336 343 337 344 Content for the job file command (option C<--filecmd>) could have: … … 352 359 $HOME/test/subjob40.sh 353 360 354 These jobs could be launch by 355 356 oarsub -n test -l /core=6,walltime=00:35:00 "oar-parexec -f ./subjob.list.txt" 361 These jobs could be launch by: 362 363 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt" 364 365 =head2 Parallel job 366 367 You need to put the number of core each small job need with option C<--jobnp>. 368 If your job is build on OpenMP or MPI, 369 you can use OAR_NP and OAR_NODE_FILE variables to configure them. 370 On OAR cluster, you need to use C<oarsh> or a wrapper like C<oar-envsh> 371 for connexion between node instead of C<ssh>. 372 373 Example with parallel small job on 2 core: 374 375 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -j 2 -f ./subjob.list.txt" 376 377 =head2 Tracing and master crash 378 379 If the master node crash after hours of calculus, everything is lost ? 380 No, with option C<--logtrace>, 381 it's possible to remember older result 382 and not re-run these job the second and next time. 383 384 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log" 385 386 After a crash or an C<oardel> command, 387 you can then re-run the same command that will end to execute the jobs in the list 388 389 oarsub -n test -l /core=6,walltime=04:00:00 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log" 390 391 C<logtrace> file are just plain file. 392 We use the extension '.log' because these files are automatically 393 eliminate from our backup system! 394 395 =head2 Checkpointing and Idempotent 396 397 C<oar-parexec> is compatible with the OAR checkpointing. 398 Il you have 2000 small jobs that need 55h to be done on 6 cores, 399 you can cut this in small parts. 400 401 For this example, we suppose that each small job need about 10min... 402 So, we send a checkpoint 12min before the end of the process 403 to let C<oar-parexec> finish the jobs started. 404 After being checkpointed, C<oar-parexec> do not start any new small job. 405 406 oarsub -t idempotent -n test -l /core=6,walltime=04:00:00 --checkpoint 720 \ 407 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log" 408 409 After 3h48min, the OAR job will begin to stop launching new small job. 410 When all running small job are finished, it's exit. 411 But as the OAR job is type C<idempotent>, 412 OAR will re-submit it as long as all small job are not executed... 413 414 This way, we let other users a chance to use the cluster! 415 416 In this last exemple, we use moldable OAR job with idempotent 417 to reserve many core for a small time or a few cores for a long time: 418 419 oarsub -t idempotent -n test \ 420 -l /core=50,walltime=01:05:00 \ 421 -l /core=6,walltime=04:00:00 \ 422 --checkpoint 720 \ 423 "oar-parexec -f ./subjob.list.txt -l ./subjob.list.log" 357 424 358 425 359 426 =head1 SEE ALSO 360 427 361 oar-dispatch, mpilauncher 428 oar-dispatch, mpilauncher, 429 orsh, oar-envsh, ssh 362 430 363 431
Note: See TracChangeset
for help on using the changeset viewer.