LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Di Ling (dling_at_[hidden])
Date: 2003-07-11 21:09:33


Hi,

I wrote a mpi program for solving linear equations and it does give the correct
result, but when I run the program repeatly, it runs slower and slower. For one
case of two processors, the time for one step varied from 219 seconds for the
first step to 554 seconds for the 10th step, but the calculations of each step are
exactly the same. Seems that the scheduling among the tasks is unbalanced.
Anybody also had this problem before? Please help me.

The output of my program is like:

SHELL% mpirun -ssi rpi sysv myapp

STEP 1:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 20.886864
        Timer 1 for gather: count = 1468, time = 173.243905
        Timer 2 for reduce: count = 6036, time = 5.449356
        Timer 3 for bcast: count = 0, time = 0.941890
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 21.310844
        Timer 1 for gather: count = 1468, time = 170.313529
        Timer 2 for reduce: count = 6036, time = 8.222676
        Timer 3 for bcast: count = 0, time = 0.943633
 Total omnodr time = 219.992384

STEP 2:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 20.848251
        Timer 1 for gather: count = 1468, time = 175.721152
        Timer 2 for reduce: count = 6036, time = 11.647156
        Timer 3 for bcast: count = 0, time = 0.516676
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 20.769629
        Timer 1 for gather: count = 1468, time = 176.512609
        Timer 2 for reduce: count = 6036, time = 11.098584
        Timer 3 for bcast: count = 0, time = 0.516803
 Total omnodr time = 227.987318

STEP 3:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 21.012736
        Timer 1 for gather: count = 1468, time = 210.400191
        Timer 2 for reduce: count = 6036, time = 17.359436
        Timer 3 for bcast: count = 0, time = 0.490925
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 20.384681
        Timer 1 for gather: count = 1468, time = 182.488757
        Timer 2 for reduce: count = 6036, time = 46.003675
        Timer 3 for bcast: count = 0, time = 0.493234
 Total omnodr time = 268.607717

STEP 4:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 22.160519
        Timer 1 for gather: count = 1468, time = 226.491551
        Timer 2 for reduce: count = 6036, time = 22.529618
        Timer 3 for bcast: count = 0, time = 0.544272
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 20.554751
        Timer 1 for gather: count = 1468, time = 190.271343
        Timer 2 for reduce: count = 6036, time = 60.288964
        Timer 3 for bcast: count = 0, time = 0.546628
 Total omnodr time = 291.045351

STEP 5:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 21.615419
        Timer 1 for gather: count = 1468, time = 232.338246
        Timer 2 for reduce: count = 6036, time = 27.635866
        Timer 3 for bcast: count = 0, time = 0.512605
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 20.785000
        Timer 1 for gather: count = 1468, time = 198.468036
        Timer 2 for reduce: count = 6036, time = 62.600546
        Timer 3 for bcast: count = 0, time = 0.515364
 Total omnodr time = 302.223854

STEP 6:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 22.103874
        Timer 1 for gather: count = 1468, time = 242.393103
        Timer 2 for reduce: count = 6036, time = 31.386167
        Timer 3 for bcast: count = 0, time = 0.576921
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 20.571535
        Timer 1 for gather: count = 1468, time = 205.591732
        Timer 2 for reduce: count = 6036, time = 71.268750
        Timer 3 for bcast: count = 0, time = 0.581129
 Total omnodr time = 317.488496

STEP 7:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 22.947837
        Timer 1 for gather: count = 1468, time = 250.703343
        Timer 2 for reduce: count = 6036, time = 39.540515
        Timer 3 for bcast: count = 0, time = 0.538599
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 21.716292
        Timer 1 for gather: count = 1468, time = 220.841175
        Timer 2 for reduce: count = 6036, time = 72.211103
        Timer 3 for bcast: count = 0, time = 0.542357
 Total omnodr time = 335.016795

STEP 8:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 27.126867
        Timer 1 for gather: count = 1468, time = 332.812285
        Timer 2 for reduce: count = 6036, time = 104.659123
        Timer 3 for bcast: count = 0, time = 0.734146
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 27.840139
        Timer 1 for gather: count = 1468, time = 314.123519
        Timer 2 for reduce: count = 6036, time = 122.871434
        Timer 3 for bcast: count = 0, time = 0.738456
 Total omnodr time = 490.470455

STEP 9:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 29.285080
        Timer 1 for gather: count = 1468, time = 331.930461
        Timer 2 for reduce: count = 6036, time = 147.903799
        Timer 3 for bcast: count = 0, time = 0.920213
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 31.194333
        Timer 1 for gather: count = 1468, time = 352.703100
        Timer 2 for reduce: count = 6036, time = 124.809475
        Timer 3 for bcast: count = 0, time = 0.925016
 Total omnodr time = 536.432494

STEP 10:
 Loading matrix from file 'matrix_l'...
 Matrix file 'matrix_l' loaded.
 Starting PCR: _g_size = 46280, _sp_size = 1094564, bandwidth = 23.650908
 Parallel mode: MPI tasks = 2
 Converged in 1468 iterations, e_residue = 4.4007069704E-08, omn count = 1467.
 Process <0> timers:
        Timer 0 for sub: count = 2937, time = 28.739998
        Timer 1 for gather: count = 1468, time = 337.927366
        Timer 2 for reduce: count = 6036, time = 160.852730
        Timer 3 for bcast: count = 0, time = 0.729920
 Process <1> timers:
        Timer 0 for sub: count = 2937, time = 29.843883
        Timer 1 for gather: count = 1468, time = 370.256357
        Timer 2 for reduce: count = 6036, time = 125.977428
        Timer 3 for bcast: count = 0, time = 0.735170
 Total omnodr time = 554.968516
 Master 0 safely stop.
 Slave 1 safely stop.