On Jul 6, 2005, at 3:58 PM, Ross Heikes wrote:
>
> On Jul 6, 2005, at 7:40 AM, Brian Barrett wrote:
>> On Jul 5, 2005, at 12:38 PM, Ross Heikes wrote:
>>
>>> We have a Apple Xserve with 40 nodes. We have made two internal
>>> networks on this cluster.
>>>
>>> Suppose there are two jobs on node 5 in network "A" which
>>> communicate
>>> with two jobs on say node 8 in network "B".
>>>
>>> My question is this:
>>> is there a lam mpi command that will allow each job on node 5 to
>>> communicate SIMULTANEOUSLY with corresponding job on node 8?
>>
>> I'm not sure I understand your question. Do you want the
>> processes in
>> one job to talk to the processes in the other job? Or for the
>> processes to be able to communicate within the same job, but at the
>> same time? If the first, you want to look at the MPI-2 connect/
>> accept
>> dynamic process management. If the second, everything should "just
>> work", so you shouldn't have any problems.
>>
>> Of course, if I misunderstood your question completely, please
>> include
>> some more detail.
>
> Each node has 2 NIC. The master node has 3 (one to connect to
> internet).
> The problem is that no job should wait for another job because they
> are using the same network. They can use different networks.
> For example, if node 5 has a job executing using one NIC (network)
> then -- AT SAME TIME -- the other job should communicate
> with a node 8 using a different NIC.
>
> The following transcript says that LAM _MPI does not support rpi tcp
> module for this solution.
>
> Should we consider using OPEN MPI then?
I believe that LAM should support this, with the appropriate
trickery. LAM will use the hostnames provided in the hostfile passed
to lamboot for all it's communication. So let's say each machine had
two names (machine1-1 and machine1-2, for example), one for each
NIC. If you lamboot with the -1 names, it will use those NICs for
communication. If you use the -2 names, it will use the other set of
NICs for communication.
Now, this doesn't necessarily solve your problem because you still
have to figure out what names to pass to lamboot. This isn't really
a LAM issue, as much as a resource scheduling issue. High-end
machines (like those using the Quadrics interconnect) solve this by
using schedulers that are aware of NIC resources and scheduling
appropriately. There's also the big whiteboard in the machine room
approach, but that can get ugly if you have more than a few well
behaved users.
Open MPI won't help with this. All we offer in Open MPI over what
LAM does is that in Open MPI, both processes will try to use both
NICs to move data. I suppose in some ways, that's better, but that
doesn't solve the question you asked.
Hope this helps,
Brian
--
Brian Barrett
LAM/MPI developer and all around nice guy
Have a LAM/MPI day: http://www.lam-mpi.org/
|