On Mon, 3 Jan 2005, Simon wrote:
> Hello Anthony,
>
> Monday, January 3, 2005, 2:00:25 AM, you wrote:
>
> AJC> Hello All,
>
> AJC> I have noticed that the elapsed time is not equal to the sum of the user
> AJC> and system times when using the tcp modules. For example, one might
> AJC> record;
> AJC> user time: 550s
> AJC> sys time : 50s
> AJC> real time: 750s
>
> AJC> I first suspected that the missing time was the time in which the
> AJC> communication actually happened; however, the opposite of that seems to be
> AJC> happening.
>
> AJC> In a job which transferes a total of:
> AJC> 4 gigabytes 1 gigabyte
> AJC> user 580 650
> AJC> sys 60 30
> AJC> real 760 850
> AJC> -----------------------------------
> AJC> diff 120 170
>
>
> AJC> The times are close, even though one program transfered one-fourth of the
> AJC> data of the other. There may have been a difference in the size of the
> AJC> packets, but they should have all been large.
>
>
> AJC> Does anyone know a fairly good network profiler for 2.6.x kernels to look
> AJC> into whats happening? Or does anyone know about this missing time right
> AJC> off?
>
>
>
> AJC> And what ever happened to M-VIA and VIA? They would help reduce processor
> AJC> load some. Is it just that TCP can pump stuff out at near wire-speed as
> AJC> it is, so there is no need for VIA? Has anyone ever thought of using
> AJC> IPX/SPX networks? For clusters, they might be more efficient than TCP/IP.
>
>
> AJC> ------------------------------------------------------------
> AJC> Anthony Ciani (aciani1_at_[hidden])
> AJC> Computational Condensed Matter Physics
> AJC> Department of Physics, University of Illinois, Chicago
> AJC> http://ciani.phy.uic.edu/~tony
> AJC> ------------------------------------------------------------
> AJC> _______________________________________________
> AJC> This list is archived at http://www.lam-mpi.org/MailArchives/lam/
>
> Answer to Your question about times is simlpe: the "real" time isn't
> just a sum of "user" time and "system" time. Real time is time that
> passed from when Your process started, so it is a sum of system and
> user times of *all* processes in Your system plus time for OS specific
> functions like memory paging etc. , counted since Your LAM
> process (process that calls times()) started.
Perhaps I should appologize for not making this clear (and this probably
should have gone on the developement list), but this is happening on
dedicated systems with very minimal usage outside of the desired program.
Also, this large difference between user+system and real ONLY happens with
the tcp based comms. It does not happen with shared memory or Myrinet.
Also, from what I recall, any time the kernel spends on your program is
counted as system time, including page swapping, disk I/O, and handling
network packets. In this case, the only source for the discrepancy must
be time spent waiting (an idle CPU). My original thought was that this is
time spent waiting for the transfer of the packets over the network,
except that there is much more waiting than could be accounted for by
that. Plus, the less data sent, the longer the time spent waiting!
Therefore, the question was "waiting for what?"
------------------------------------------------------------
Anthony Ciani (aciani1_at_[hidden])
Computational Condensed Matter Physics
Department of Physics, University of Illinois, Chicago
http://ciani.phy.uic.edu/~tony
------------------------------------------------------------
|