-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write error crash #73
Comments
Happened again. Will have to implement something from #74 to keep going.
|
It would be good to know the exact time to search the logs. One trick might to write these files to /var/tmp (SSD) or possibly /tmp (RAM). We could add a parameter to pkio.atomic_write that would do this operation in steps: first write to a tmp_dir, then rename to a local random name, then rename to the target. The ensures integrity while also avoiding write issues. Do you have any idea of the number of files being written a second across all nodes? |
This was the file write that causes the crash. It is there, but looks corrupted if I try to open it.
You can find it at There are a lot of files being written across all the nodes, mostly due to Genesis which insists on writing particle and field data. I tried turning it off previously but the option in the Genesis1.3v2 manual doesn't seem to do anything. elegant writes a maybe a dozen files The workers directly write the input files so they are only writing a couple of files, but they aren't (currently) robust to errors like the simulations, which are run through Executors. |
Did you try to move the genesis writes to /var/tmp or /tmp? This might make things go faster, and it might avoid some contention in NFS, which maybe will fix the issue. |
I implemented a version of libEnsemble and rsopt that allow for an Executor to run outside the normal worker directory in a directory on /tmp.
Is this due to trying to run rsmpi outside of /home ? |
I'm guessing that the directory has to be shared across all machines. See pmodels/mpich#1872 I guess genesis must be using an MPI command that requires the files to be shared. |
Closing this. An option for better handling this sort of problem should eventually be available once #74 is completed. |
Not direct an rsopt issue, but a large run went down last night due to a file write error on one of the workers killing libEnsemble.
Handing Worker errors that should not be fatal for the whole run has been a pain point for a while now. Will have to investigate options for re-starting workers or just letting them die without taking down the manager.
The text was updated successfully, but these errors were encountered: