Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rsmpi execution mode failure with many workers #175

Open
ncook882 opened this issue Nov 15, 2024 · 1 comment
Open

rsmpi execution mode failure with many workers #175

ncook882 opened this issue Nov 15, 2024 · 1 comment
Assignees

Comments

@ncook882
Copy link
Member

Increasing the number of workers (to a number much greater than the number of available hosts) can generate a host unavailable error:

host 3 XXXXXX is unavailable

Your cluster has been reassigned or some nodes are down.
Please contact support for help with this issue.

Reducing the number of workers makes this error significantly less likely. Here's the code used for the config file test_python.yml:

codes:
  - python:
      parameters:
        ind:
          min: 0
          max: 49
          start: 0
          samples: 50
      setup:
        input_file: test_python_script.py
        function: main
        force_executor: True
        execution_type: rsmpi
        cores: 1
options:
  run_dir: ./scan/
  software: mesh_scan
  nworkers: 50
  executor_options:
       hosts: [1,2,3,4,5] 

And here's the code used for the test_python_script.py:

#run a simple python script
import json

def main(ind):
    #create a basic dictionary
    my_ind = int(ind)
    org = {}
    org[my_ind] = ind
    
    fn = 'my_ind.json'
    
    with open(fn, 'w') as file:
        json.dump(org,file)
    
if __name__ == "__main__":
    main(ind)

I ran via rsopt sample configuration test_python.yml

@cchall
Copy link
Member

cchall commented Nov 15, 2024

This looks like it is being caused by a change to libEnsemble job submission: Libensemble/libensemble#1468

Changing the wait_on_start from bool to the 8 seconds that is used for rsmpi submission fail_time allows failed submissions to be caught as expected.

@ncook882 you can install branch rsopt/rsmpi_wait_on_start branch to resolve this issue for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants