hi, every developer,
the lammpi does checkpoint by using blcr, it saves the whole process image to the context file. if the process' size is very huge, the context file's size will be also huge.
it takes a long time.
according to my test, the context file's size is up to the memory size.
for example,
int arr[700][700][700];
it will be more than 1G memory, if the physical memory is less than that, the cr_restart will be failed.
the mpi is very popular in high performance computing field, many problems in HPC needs to load huge size of data array to memory, that's a disaster for checkpoint. that's the point! so at this time, the CR seems useless...
my question is that, any ideas to improve that? or any new features on this issue in next version's design?
happy May 1st!
Lenjoy