Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write error crash #73

Closed
cchall opened this issue Sep 3, 2021 · 7 comments
Closed

Write error crash #73

cchall opened this issue Sep 3, 2021 · 7 comments

Comments

@cchall
Copy link
Member

cchall commented Sep 3, 2021

Not direct an rsopt issue, but a large run went down last night due to a file write error on one of the workers killing libEnsemble.

Handing Worker errors that should not be fatal for the whole run has been a pain point for a while now. Will have to investigate options for re-starting workers or just letting them die without taking down the manager.

[0] libensemble.manager (ERROR): ---- Received error message from worker 6 ----
[0] libensemble.manager (ERROR): Message: OSError: [Errno 5] Input/output error: './genesis.in'
[0] libensemble.manager (ERROR): Traceback (most recent call last):
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 334, in run
    response = self._handle(Work)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 294, in _handle
    calc_out, persis_info, calc_status = self._handle_calc(Work, calc_in)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 217, in _handle_calc
    out = calc(calc_in, Work['persis_info'], Work['libE_info'])
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 147, in run_sim
    return sim_f(calc_in, persis_info, sim_specs, libE_info)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/simulation.py", line 112, in __call__
    job._setup.generate_input_file(kwargs, '.')  # TODO: Worker needs to be in their own directory
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/configuration/setup.py", line 390, in generate_input_file
    model.write_input_file()
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/genesis/genesis.py", line 279, in write_input_file
    with open(self.input_file, 'w') as f:
OSError: [Errno 5] Input/output error: './genesis.in'
@cchall
Copy link
Member Author

cchall commented Sep 4, 2021

Happened again. Will have to implement something from #74 to keep going.

[0] libensemble.manager (ERROR): ---- Received error message from worker 2 ----
[0] libensemble.manager (ERROR): Message: OSError: [Errno 5] Input/output error: './run_parallel_python.py'
[0] libensemble.manager (ERROR): Traceback (most recent call last):
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 334, in run
    response = self._handle(Work)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 294, in _handle
    calc_out, persis_info, calc_status = self._handle_calc(Work, calc_in)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 217, in _handle_calc
    out = calc(calc_in, Work['persis_info'], Work['libE_info'])
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 147, in run_sim
    return sim_f(calc_in, persis_info, sim_specs, libE_info)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/simulation.py", line 112, in __call__
    job._setup.generate_input_file(kwargs, '.')  # TODO: Worker needs to be in their own directory
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/configuration/setup.py", line 250, in generate_input_file
    with open(file_path, 'w') as ff:
OSError: [Errno 5] Input/output error: './run_parallel_python.py'

@robnagler
Copy link
Member

robnagler commented Sep 4, 2021

It would be good to know the exact time to search the logs.

One trick might to write these files to /var/tmp (SSD) or possibly /tmp (RAM). We could add a parameter to pkio.atomic_write that would do this operation in steps: first write to a tmp_dir, then rename to a local random name, then rename to the target. The ensures integrity while also avoiding write issues.

Do you have any idea of the number of files being written a second across all nodes?

@cchall
Copy link
Member Author

cchall commented Sep 4, 2021

This was the file write that causes the crash. It is there, but looks corrupted if I try to open it.

run_parallel_python.py
  Size: 119887          Blocks: 240        IO Block: 1048576 regular file
Device: 2bh/43d Inode: 4723661098  Links: 1
Access: (0640/-rw-r-----)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2021-09-04 20:13:00.267054812 +0000
Modify: 2021-09-04 04:16:33.875033175 +0000
Change: 2021-09-04 04:16:33.875033175 +0000

You can find it at StaffScratch/cchall/fastfelo/ensemble_s2e_aposmm_run2/worker2/sim588/run_parallel_python.py

There are a lot of files being written across all the nodes, mostly due to Genesis which insists on writing particle and field data. I tried turning it off previously but the option in the Genesis1.3v2 manual doesn't seem to do anything.
When Genesis writes this data in the parallel version it dumps ~2000 slice files during the simulation then combines them at the end. This is being done by 20 workers, probably roughly around the same time for all of them. I'm not sure the time but these simulations aren't that long, the Genesis part is probably less than 2 minutes. These files comprise about 80 - 90% of the data written during a job.

elegant writes a maybe a dozen files

The workers directly write the input files so they are only writing a couple of files, but they aren't (currently) robust to errors like the simulations, which are run through Executors.

@robnagler
Copy link
Member

Did you try to move the genesis writes to /var/tmp or /tmp? This might make things go faster, and it might avoid some contention in NFS, which maybe will fix the issue.

@cchall
Copy link
Member Author

cchall commented Sep 8, 2021

I implemented a version of libEnsemble and rsopt that allow for an Executor to run outside the normal worker directory in a directory on /tmp.
The executor gives back an error:

[[email protected]] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
[[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[[email protected]] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Is this due to trying to run rsmpi outside of /home ?

@robnagler
Copy link
Member

I'm guessing that the directory has to be shared across all machines. See pmodels/mpich#1872

I guess genesis must be using an MPI command that requires the files to be shared.

@cchall
Copy link
Member Author

cchall commented Nov 8, 2021

Closing this. An option for better handling this sort of problem should eventually be available once #74 is completed.

@cchall cchall closed this as completed Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants