LAM/MPI logo

LAM/MPI General User's Mailing List Archives

  |   Home   |   Download   |   Documentation   |   FAQ   |   all just in this list

From: USFResearch_at_[hidden]
Date: 2003-12-17 00:59:48


This program works fine on a Sunfire 880. It is crashes *sometimes* though
on a cluster of P4s. I do not have root access (the admin installed MPI per my
request), and I'm not privy to all the details of installation. I do know
that it is running OpenMosix and Redhat 8 with kernel 2.4.20-openmosix2. Output
from lamboot says "LAM 7.0/MPI 2 C++/ROMIO - Indiana University" so I assume
that is the version being run.

Here is a backtrace from gdb:

#0 0x4011a12c in __pthread_alt_lock () from /lib/i686/libpthread.so.0
#1 0x40116d77 in pthread_mutex_lock () from /lib/i686/libpthread.so.0
#2 0x42075a7a in free () from /lib/i686/libc.so.6
#3 0x400bdde3 in operator delete(void*) (ptr=0x0) at
../../../../libstdc++-v3/libsupc++/del_op.cc:39
#4 0x400bde3f in operator delete[](void*) (ptr=0x0) at
../../../../libstdc++-v3/libsupc++/del_opv.cc:36
#5 0x08051e6c in ComputeBaselineInfo (Dataset=0x855308c) at Gain.cpp:125
#6 0x08053020 in BestC45SplitForAttribute (Dataset=0x855308c, AttNum=17,
returned_low=0x0, returned_high=0x0, Options=0x80ce868) at Gain.cpp:504
#7 0x080574bc in ObtainBestSplit (Dataset=0x855308c,
returned_low=0xbffff474, returned_high=0xbffff478, Options=0x80ce868) at Split.cpp:145
#8 0x0804ce2f in BuildTree (Dataset=0x855308c, Options=0x80ce868) at
BuildTree.cpp:131
#9 0x0804ca51 in BuildTree (Dataset=0x854c394, Options=0x80ce868) at
BuildTree.cpp:235
...

Here is the ComputeBaselineInfo function (frame 5):

float ComputeBaselineInfo(_TrainingSet* Dataset)
{
  int x;
  float BaseLineInfo;
  int* ClassCounter = new int[Dataset->NumClasses];
  for(x=0;x<Dataset->NumClasses;x++)
    ClassCounter[x]=0;
  BaseLineInfo=0;
  for(x=0;x<Dataset->NumExamples;x++)
    ClassCounter[Dataset->ClassLabel[x]]++;
  for(x=0;x<Dataset->NumClasses;x++)
    BaseLineInfo += (((float)-ClassCounter[x]/(float)Dataset->NumExamples) *
Log((float)ClassCounter[x]/(float)Dataset->NumExamples));
  delete [] ClassCounter;
  return BaseLineInfo;
}

The value of Dataset->NumClasses is 19, so it clearly allocated some memory,
used it, and now wants to get rid of it. Then bad things happen. I know that
it did not go out of bounds or anything like that. Can anyone explain why the
reason for the failure?

Thanks alot,
Robert