Write error crash #73

cchall · 2021-09-03T17:15:02Z

Not direct an rsopt issue, but a large run went down last night due to a file write error on one of the workers killing libEnsemble.

Handing Worker errors that should not be fatal for the whole run has been a pain point for a while now. Will have to investigate options for re-starting workers or just letting them die without taking down the manager.

[0] libensemble.manager (ERROR): ---- Received error message from worker 6 ----
[0] libensemble.manager (ERROR): Message: OSError: [Errno 5] Input/output error: './genesis.in'
[0] libensemble.manager (ERROR): Traceback (most recent call last):
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 334, in run
    response = self._handle(Work)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 294, in _handle
    calc_out, persis_info, calc_status = self._handle_calc(Work, calc_in)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 217, in _handle_calc
    out = calc(calc_in, Work['persis_info'], Work['libE_info'])
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 147, in run_sim
    return sim_f(calc_in, persis_info, sim_specs, libE_info)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/simulation.py", line 112, in __call__
    job._setup.generate_input_file(kwargs, '.')  # TODO: Worker needs to be in their own directory
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/configuration/setup.py", line 390, in generate_input_file
    model.write_input_file()
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/genesis/genesis.py", line 279, in write_input_file
    with open(self.input_file, 'w') as f:
OSError: [Errno 5] Input/output error: './genesis.in'

The text was updated successfully, but these errors were encountered:

cchall · 2021-09-04T04:29:59Z

Happened again. Will have to implement something from #74 to keep going.

[0] libensemble.manager (ERROR): ---- Received error message from worker 2 ----
[0] libensemble.manager (ERROR): Message: OSError: [Errno 5] Input/output error: './run_parallel_python.py'
[0] libensemble.manager (ERROR): Traceback (most recent call last):
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 334, in run
    response = self._handle(Work)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 294, in _handle
    calc_out, persis_info, calc_status = self._handle_calc(Work, calc_in)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 217, in _handle_calc
    out = calc(calc_in, Work['persis_info'], Work['libE_info'])
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/libensemble/worker.py", line 147, in run_sim
    return sim_f(calc_in, persis_info, sim_specs, libE_info)
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/simulation.py", line 112, in __call__
    job._setup.generate_input_file(kwargs, '.')  # TODO: Worker needs to be in their own directory
  File "/home/vagrant/.pyenv/versions/3.7.2/envs/py3/lib/python3.7/site-packages/rsopt/configuration/setup.py", line 250, in generate_input_file
    with open(file_path, 'w') as ff:
OSError: [Errno 5] Input/output error: './run_parallel_python.py'

robnagler · 2021-09-04T13:05:12Z

It would be good to know the exact time to search the logs.

One trick might to write these files to /var/tmp (SSD) or possibly /tmp (RAM). We could add a parameter to pkio.atomic_write that would do this operation in steps: first write to a tmp_dir, then rename to a local random name, then rename to the target. The ensures integrity while also avoiding write issues.

Do you have any idea of the number of files being written a second across all nodes?

cchall · 2021-09-04T20:22:47Z

This was the file write that causes the crash. It is there, but looks corrupted if I try to open it.

run_parallel_python.py
  Size: 119887          Blocks: 240        IO Block: 1048576 regular file
Device: 2bh/43d Inode: 4723661098  Links: 1
Access: (0640/-rw-r-----)  Uid: ( 1000/ vagrant)   Gid: ( 1000/ vagrant)
Access: 2021-09-04 20:13:00.267054812 +0000
Modify: 2021-09-04 04:16:33.875033175 +0000
Change: 2021-09-04 04:16:33.875033175 +0000

You can find it at StaffScratch/cchall/fastfelo/ensemble_s2e_aposmm_run2/worker2/sim588/run_parallel_python.py

There are a lot of files being written across all the nodes, mostly due to Genesis which insists on writing particle and field data. I tried turning it off previously but the option in the Genesis1.3v2 manual doesn't seem to do anything.
When Genesis writes this data in the parallel version it dumps ~2000 slice files during the simulation then combines them at the end. This is being done by 20 workers, probably roughly around the same time for all of them. I'm not sure the time but these simulations aren't that long, the Genesis part is probably less than 2 minutes. These files comprise about 80 - 90% of the data written during a job.

elegant writes a maybe a dozen files

The workers directly write the input files so they are only writing a couple of files, but they aren't (currently) robust to errors like the simulations, which are run through Executors.

robnagler · 2021-09-06T14:58:01Z

Did you try to move the genesis writes to /var/tmp or /tmp? This might make things go faster, and it might avoid some contention in NFS, which maybe will fix the issue.

cchall · 2021-09-08T02:59:44Z

I implemented a version of libEnsemble and rsopt that allow for an Executor to run outside the normal worker directory in a directory on /tmp.
The executor gives back an error:

[[email protected]] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
[[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[[email protected]] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Is this due to trying to run rsmpi outside of /home ?

robnagler · 2021-09-08T13:38:21Z

I'm guessing that the directory has to be shared across all machines. See pmodels/mpich#1872

I guess genesis must be using an MPI command that requires the files to be shared.

cchall · 2021-11-08T20:35:39Z

Closing this. An option for better handling this sort of problem should eventually be available once #74 is completed.

cchall mentioned this issue Sep 3, 2021

Handle errors during sim_f execution #74

Open

cchall closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write error crash #73

Write error crash #73

cchall commented Sep 3, 2021

cchall commented Sep 4, 2021

robnagler commented Sep 4, 2021 •

edited

Loading

cchall commented Sep 4, 2021

robnagler commented Sep 6, 2021

cchall commented Sep 8, 2021 •

edited

Loading

robnagler commented Sep 8, 2021

cchall commented Nov 8, 2021

Write error crash #73

Write error crash #73

Comments

cchall commented Sep 3, 2021

cchall commented Sep 4, 2021

robnagler commented Sep 4, 2021 • edited Loading

cchall commented Sep 4, 2021

robnagler commented Sep 6, 2021

cchall commented Sep 8, 2021 • edited Loading

robnagler commented Sep 8, 2021

cchall commented Nov 8, 2021

robnagler commented Sep 4, 2021 •

edited

Loading

cchall commented Sep 8, 2021 •

edited

Loading