LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: Jonathan Herriott (jherriott_at_[hidden])
Date: 2005-02-14 13:17:21


It's a one node system with 2 processors (an XServe), so everything is
running on the same system.

--
Jonathan Herriott
Architecture and Performance Group
Apple Computer, Inc.
On Feb 14, 2005, at 6:24 AM, Jeff Squyres wrote:
> Just to double check -- the MPI_COMM_WORLD rank 0 is running on the 
> same node as mpirun, right?
>
> On Feb 4, 2005, at 4:25 PM, Jonathan Herriott wrote:
>
>> Well, the interesting thing is it is stopping on my first read from 
>> the standard input stream.  A file is being redirected through the 
>> standard input stream, and the file is full of lines of text, and yet 
>> it is getting stuck.
>>
>> --
>> Jonathan Herriott
>> Architecture and Performance Group
>> Apple Computer, Inc.
>>
>> On Feb 3, 2005, at 2:14 PM, Jeff Squyres wrote:
>>
>>> Sorry -- I should have been more clear.
>>>
>>> mpirun does not execute this function -- only the MPI processes 
>>> execute this function.  rpwait() is LAM's internal function for 
>>> "remote process wait" -- it's waiting for the MPI processes to 
>>> complete.
>>>
>>> Can you attach gdb to the running MPI processes and see where they 
>>> are stuck?
>>>
>>>
>>> On Feb 3, 2005, at 5:11 PM, Jonathan Herriott wrote:
>>>
>>>> When I use gdb, it seems to stop up on line 823 of mpirun.c.  The 
>>>> line reads "if (rpwait(&nodeid, &pid, &status))"
>>>>
>>>> --
>>>> Jonathan Herriott
>>>> Architecture and Performance Group
>>>> Apple Computer, Inc.
>>>>
>>>> On Feb 3, 2005, at 7:18 AM, Jeff Squyres wrote:
>>>>
>>>>> Something sounds quite wrong here -- the lam_tv_load_type_defs() 
>>>>> function is a dummy function that is essentially a no-op, and is 
>>>>> only included so that the linker pulls in relevant symbols.  
>>>>> Indeed, here's the code for that function:
>>>>>
>>>>> -----
>>>>> void *
>>>>> lam_tv_load_type_defs(void)
>>>>> {
>>>>>   static void *dummy[11];
>>>>>
>>>>>   /* Referencing the above variables needed for loading type
>>>>>      definitions in TotalView so that compiler does not optimize 
>>>>> them
>>>>>      out. */
>>>>>
>>>>>   dummy[0] = &dummy_req;
>>>>>   dummy[1] = &dummy_comm;
>>>>>   dummy[2] = &dummy_group;
>>>>>   dummy[3] = &dummy_proc;
>>>>>   dummy[4] = &dummy_gps;
>>>>>   dummy[5] = &dummy_ah_desc;
>>>>>   dummy[6] = &dummy_al_desc;
>>>>>   dummy[7] = &dummy_al_head;
>>>>>   dummy[8] = &dummy_msg;
>>>>>   dummy[9] = &dummy_cid;
>>>>>   dummy[10] = &dummy_envl;
>>>>>
>>>>>   return dummy;
>>>>> }
>>>>> -----
>>>>>
>>>>> All the "dummy" variables are instantiated earlier in the file.
>>>>>
>>>>> So if a thread is blocking in this function, there is something 
>>>>> wrong with the installation.  Can you attach a debugger to see 
>>>>> where exactly it is blocking?
>>>>>
>>>>>
>>>>> On Feb 2, 2005, at 3:42 PM, Jonathan Herriott wrote:
>>>>>
>>>>>> Well, you were right about it being a spinlock issue (95% of the 
>>>>>> profile) when running two threads.  The problem is being spent in 
>>>>>> the function lam_tv_load_type_defs.  I'll include the shark 
>>>>>> profile.  I also tried leaving the program running over night on 
>>>>>> two threads, which it should finish around 430s, but after 17 
>>>>>> hours, it was still running.  Both processors are being used, but 
>>>>>> only one thread is active and being passed between the two.  The 
>>>>>> other thread starts up and then doesn't do anything.  There was 
>>>>>> no use in trying to do it with one thread since the thread stays 
>>>>>> inactive.  On another note, which version of LAM/MPI uses the 
>>>>>> mpirun_ssh command if any does at all?
>>>>>>
>>>>>> <LAM_Thr2.mshark>
>>>>>>
>>>>>> --
>>>>>> Jonathan Herriott
>>>>>> Architecture and Performance Group
>>>>>> Apple Computer, Inc.
>>>>>> 408-974-5931_______________________________________________
>>>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>>
>>>>> -- 
>>>>> {+} Jeff Squyres
>>>>> {+} jsquyres_at_[hidden]
>>>>> {+} http://www.lam-mpi.org/
>>>>>
>>>>> _______________________________________________
>>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>>
>>>>
>>>> _______________________________________________
>>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>>
>>>
>>> -- 
>>> {+} Jeff Squyres
>>> {+} jsquyres_at_[hidden]
>>> {+} http://www.lam-mpi.org/
>>>
>>> _______________________________________________
>>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>>
>>
>> _______________________________________________
>> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>>
>
> -- 
> {+} Jeff Squyres
> {+} jsquyres_at_[hidden]
> {+} http://www.lam-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>