This program works fine on a Sunfire 880. It is crashes *sometimes* though
on a cluster of P4s. I do not have root access (the admin installed MPI per my
request), and I'm not privy to all the details of installation. I do know
that it is running OpenMosix and Redhat 8 with kernel 2.4.20-openmosix2. Output
from lamboot says "LAM 7.0/MPI 2 C++/ROMIO - Indiana University" so I assume
that is the version being run.
Here is a backtrace from gdb:
#0 0x4011a12c in __pthread_alt_lock () from /lib/i686/libpthread.so.0
#1 0x40116d77 in pthread_mutex_lock () from /lib/i686/libpthread.so.0
#2 0x42075a7a in free () from /lib/i686/libc.so.6
#3 0x400bdde3 in operator delete(void*) (ptr=0x0) at
../../../../libstdc++-v3/libsupc++/del_op.cc:39
#4 0x400bde3f in operator delete[](void*) (ptr=0x0) at
../../../../libstdc++-v3/libsupc++/del_opv.cc:36
#5 0x08051e6c in ComputeBaselineInfo (Dataset=0x855308c) at Gain.cpp:125
#6 0x08053020 in BestC45SplitForAttribute (Dataset=0x855308c, AttNum=17,
returned_low=0x0, returned_high=0x0, Options=0x80ce868) at Gain.cpp:504
#7 0x080574bc in ObtainBestSplit (Dataset=0x855308c,
returned_low=0xbffff474, returned_high=0xbffff478, Options=0x80ce868) at Split.cpp:145
#8 0x0804ce2f in BuildTree (Dataset=0x855308c, Options=0x80ce868) at
BuildTree.cpp:131
#9 0x0804ca51 in BuildTree (Dataset=0x854c394, Options=0x80ce868) at
BuildTree.cpp:235
...
Here is the ComputeBaselineInfo function (frame 5):
float ComputeBaselineInfo(_TrainingSet* Dataset)
{
int x;
float BaseLineInfo;
int* ClassCounter = new int[Dataset->NumClasses];
for(x=0;x<Dataset->NumClasses;x++)
ClassCounter[x]=0;
BaseLineInfo=0;
for(x=0;x<Dataset->NumExamples;x++)
ClassCounter[Dataset->ClassLabel[x]]++;
for(x=0;x<Dataset->NumClasses;x++)
BaseLineInfo += (((float)-ClassCounter[x]/(float)Dataset->NumExamples) *
Log((float)ClassCounter[x]/(float)Dataset->NumExamples));
delete [] ClassCounter;
return BaseLineInfo;
}
The value of Dataset->NumClasses is 19, so it clearly allocated some memory,
used it, and now wants to get rid of it. Then bad things happen. I know that
it did not go out of bounds or anything like that. Can anyone explain why the
reason for the failure?
Thanks alot,
Robert
|