-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UFS_WEATHER_MODEL HR.v4 cannot be run with fully packed nodes on Gaea C5 at C1152 resolution #2540
Comments
the issue can be mitigated by running with traditional threads and 64 or fewer ranks per node. |
@GeorgeVandenberghe-NOAA Have you been able to attempt ESMF-managed threading runs at full core capacity with the custom Verbosity setting as per:
This should dump a lot of memory tracing information into the ESMF PET* log files. It might give us a clue as to where/why memory pressure is growing to the point of failure. If you have PET* log files with that extra info, I would like to look at them. Thanks! |
Thanks. I will try that this afternoon.
…On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
Have you been able to attempt ESMF-managed threading runs at full core
capacity with the custom Verbosity setting as per:
# EARTH #
EARTH_component_list: MED ATM OCN ICE WAV
EARTH_attributes::
Verbosity = 32563
::
This should dump a lot of memory tracing information into the ESMF PET*
log files. It might give us a clue as to where/why memory pressure is
growing to the point of failure. If you have PET* log files with that extra
info, I would like to look at them. Thanks!
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
A run CWD is on /gpfs/f5/scratch/gwv/hr4j/da. Output is oo and error is
ee. Where are these memory statistics
written?
…On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
Have you been able to attempt ESMF-managed threading runs at full core
capacity with the custom Verbosity setting as per:
# EARTH #
EARTH_component_list: MED ATM OCN ICE WAV
EARTH_attributes::
Verbosity = 32563
::
This should dump a lot of memory tracing information into the ESMF PET*
log files. It might give us a clue as to where/why memory pressure is
growing to the point of failure. If you have PET* log files with that extra
info, I would like to look at them. Thanks!
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
What do I toggle to get those PET logs turned on?
On Thu, Dec 19, 2024 at 7:16 PM George Vandenberghe - NOAA Affiliate <
***@***.***> wrote:
… A run CWD is on /gpfs/f5/scratch/gwv/hr4j/da. Output is oo and error is
ee. Where are these memory statistics
written?
On Thu, Dec 19, 2024 at 6:19 PM Gerhard Theurich ***@***.***>
wrote:
> @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
> Have you been able to attempt ESMF-managed threading runs at full core
> capacity with the custom Verbosity setting as per:
>
> # EARTH #
> EARTH_component_list: MED ATM OCN ICE WAV
> EARTH_attributes::
> Verbosity = 32563
> ::
>
> This should dump a lot of memory tracing information into the ESMF PET*
> log files. It might give us a clue as to where/why memory pressure is
> growing to the point of failure. If you have PET* log files with that extra
> info, I would like to look at them. Thanks!
>
> —
> Reply to this email directly, view it on GitHub
> <#2540 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ANDS4FSNALOR6IPYV46LLHD2GME2JAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGQ4TKMRQGI>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
In
|
Done. A run with the PET logs is on /gpfs/f5/scratch/gwv/hr4j/da
…On Thu, Dec 19, 2024 at 7:33 PM Gerhard Theurich ***@***.***> wrote:
What do I toggle to get those PET logs turned on?
# ESMF #
logKindFlag: ESMF_LOGKIND_MULTI
globalResourceControl: true
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTSAFAPOVX6T7NJLST2GMNSFAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJVGYZDMOJZGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA I looked at the memory tracing, and it looks to me that the run dies because of memory pressure on the nodes that run the WAV component. WAV in this run is setup to execute on 998 PETs. Does the WAV configuration work on that number of PETs under traditional threading? |
Yes but it is run 64 or 32 PETS per node with traditional threading. I
suspect it will fail the same way with one thread and a packed node 128
ranks per node.
…On Thu, Dec 19, 2024 at 9:16 PM Gerhard Theurich ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
looked at the memory tracing, and it looks to me that the run dies because
of memory pressure on the nodes that run the WAV component. WAV in this run
is setup to execute on 998 PETs. Does the WAV configuration work on that
number of PETs under traditional threading?
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUIVNRBUSDKJT7RSOD2GN4WXAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGEZDMNZZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I will report the memory pressure to the WAVE people.
…On Thu, Dec 19, 2024 at 9:16 PM Gerhard Theurich ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
looked at the memory tracing, and it looks to me that the run dies because
of memory pressure on the nodes that run the WAV component. WAV in this run
is setup to execute on 998 PETs. Does the WAV configuration work on that
number of PETs under traditional threading?
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUIVNRBUSDKJT7RSOD2GN4WXAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGEZDMNZZHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
You could try running WAV with different threading levels under ESMF-managed threading. E.g.
To run 2x threaded, therefore using 64 tasks per node, or with
for 4x way threaded, using 32 tasks per node. Still using 998 cores in total for any of those cases, just changing the threading level. Would be curious to see how that changes things. |
Thanks, I'll check it out.
…On Thu, Dec 19, 2024 at 9:31 PM Gerhard Theurich ***@***.***> wrote:
You could try running WAV with different threading levels under
ESMF-managed threading. E.g.
# WAV #
WAV_model: ww3
WAV_petlist_bounds: 8296 9293
WAV_omp_num_threads: 2
WAV_attributes::
Verbosity = 0
OverwriteSlice = false
mesh_wav = mesh.uglo_m1g16.nc
user_sets_restname = false
::
To run 2x threaded, therefore using 64 cores per node, or with
WAV_omp_num_threads: 4
for 4x way threaded, using 32 cores per node. Still using 998 cores in any
of those cases, just changing the threading level. Would be curious to see
how that changes things.
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTO3GUWJWDC7CE6F5T2GN6PVAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJWGE2DCMBXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Thanks @GeorgeVandenberghe-NOAA and @theurich - I just wanted to acknowledge here that the wave people have seen this. @DeniseWorthen has also observed the wave memory issues and has done some work to address some of the issues, which can be seen in a draft PR here: NOAA-EMC/WW3#1317 |
The other choke point is inline post. I can run this with traditional
threads with 256 ranks per I/O group but it fails with ESMF managed threads
on a packed node with 256 and 512 MPI ranks per I/O group on gaea C5.
This will be a serious issue increasing resolution on WCOSS2 also and I am
puzzled where the memory for THAT is being used. 512 tasks are spread
between four nodes each with 256GB of memory!
…On Fri, Dec 20, 2024 at 8:46 AM Jessica Meixner ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> I
looked at the memory tracing, and it looks to me that the run dies because
of memory pressure on the nodes that run the WAV component. WAV in this run
is setup to execute on 998 PETs. Does the WAV configuration work on that
number of PETs under traditional threading?
Thanks @GeorgeVandenberghe-NOAA
<https://github.com/GeorgeVandenberghe-NOAA> and @theurich
<https://github.com/theurich> - I just wanted to acknowledge here that
the wave people have seen this. @DeniseWorthen
<https://github.com/DeniseWorthen> has also observed the wave memory
issues and has done some work to address some of the issues, which can be
seen in a draft PR here: NOAA-EMC/WW3#1317
<NOAA-EMC/WW3#1317>
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FR55EDIB7G6VCHSEMD2GQNSTAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA2DQNJTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
It doesn't look like the case @GeorgeVandenberghe-NOAA is pointing to (/gpfs/f5/scratch/gwv/hr4j/da) has the PIO in WW3 enabled. Is that intentional? |
I can try to get @GeorgeVandenberghe-NOAA a test case with PIO enabled by the end of the day - with @sbanihash help we almost have a PR ready for g-w to generate a new test case |
How is this enabled?
For this situation inline post failure trumps the wave memory issues anyway
and wave memory issues can be addressed with additional ESMF threads. UFS
post memory issues have persisted with two threads and 256 or 512 ranks per
I/O group, 128 ranks per node.
…On Fri, Dec 20, 2024 at 9:10 AM Denise Worthen ***@***.***> wrote:
It doesn't look like the case @GeorgeVandenberghe-NOAA
<https://github.com/GeorgeVandenberghe-NOAA> is pointing to
(/gpfs/f5/scratch/gwv/hr4j/da) has the PIO in WW3 enabled. Is that
intentional?
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FUSLVQE6RZD3U6YFSL2GQQMLAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA4DSNJZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Is there a namelist thing I can toggle?
…On Fri, Dec 20, 2024 at 9:13 AM Jessica Meixner ***@***.***> wrote:
I can try to get @GeorgeVandenberghe-NOAA
<https://github.com/GeorgeVandenberghe-NOAA> a test case with PIO enabled
by the end of the day - with @sbanihash <https://github.com/sbanihash>
help we almost have a PR ready for g-w to generate a new test case
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FRCXORKJNDEHXJAXGD2GQQW5AVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGA4TGOBWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
You can toggle inline post off in To toggle PIO for WW3 the model needs to have been compiled w/ PIO in the switch for WW3. I don't know if you case has that or not. |
@DeniseWorthen and @JessicaMeixner-NOAA It's great to know wave people are aware of the memory pressure coming from WW3, and even greater there is already a PR to address it! Do you think George should be testing here with WW3 changes from that PR? @GeorgeVandenberghe-NOAA is the next step to attempts a run on same layout (as far as tasks and threading is concerned for each component), but with inline post off, and PIO active for WW3? We would expect a successful run. After that turn inline post back on, and observe what happens? |
@theurich There are two things that have been/can be done w/rt WW3 memory pressure. The first was implementing PIO for WW3 restarts. That has been committed, it requires compiling WW3 w/ the PIO ifdef and some additional settings in ufs.configure. It may not yet be in G-W though. The second is a draft PR to eliminate duplicate fields. That has sat in draft because I ran into a test case---Hera+GNU+Release which did not reproduce baselines. All other cases did. I also ran cases on Hercules and Gaea and everything passed. Hera uses a more recent GNU version though. Since the GNU+Debug passed, my supposition is that there is an optimization which is changing answers, but I have not had time to debug. |
To be very frank it's the inline post memory pressure that concerns me more
because I can't fix this toggling ESMF thread count as I can with Wave
and because inline post memory pressure is going to scale with ATM
resolution. Inline post memory pressure also looks like a soon to be
problem for traditionally threaded runs at higher than C1152 resolution too.
…On Fri, Dec 20, 2024 at 4:15 PM Denise Worthen ***@***.***> wrote:
@theurich <https://github.com/theurich> There are two things that have
been/can be done w/rt WW3 memory pressure. The first was implementing PIO
for WW3 restarts. That has been committed, it requires compiling WW3 w/ the
PIO ifdef and some additional settings in ufs.configure. It may not yet be
in G-W though.
The second is a draft PR to eliminate duplicate fields. That has sat in
draft because I ran into a test case---Hera+GNU+Release which did not
reproduce baselines. All other cases did. I also ran cases on Hercules and
Gaea and everything passed.
Hera uses a more recent GNU version though. Since the GNU+Debug passed, my
supposition is that there is an optimization which is changing answers, but
I have not had time to debug.
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTY6MHP6XR5WVF6YUD2GQ7DVAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMYDCNJYGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
It sounds right to focus on the inline post memory pressure issue. Do you have memory logging from a run where WAV isn't running out of memory, but where inline post is causing the issue, that I can look at? Thanks. |
I'll make one.
…On Fri, Dec 20, 2024 at 4:28 PM Gerhard Theurich ***@***.***> wrote:
It sounds right to focus on the inline post memory pressure issue. Do you
have memory logging from a run where WAV isn't running out of memory, but
where inline post is causing the issue, that I can look at? Thanks.
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FV2SSEOVARATYK4NDL2GRAUNAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMZDCOBRGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I've run with inline post and with WAVE configured with two threads to
survive. It's in the same location
/gpfs/f5/scratch/gwv/hr4j/da
…On Fri, Dec 20, 2024 at 4:28 PM Gerhard Theurich ***@***.***> wrote:
It sounds right to focus on the inline post memory pressure issue. Do you
have memory logging from a run where WAV isn't running out of memory, but
where inline post is causing the issue, that I can look at? Thanks.
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FV2SSEOVARATYK4NDL2GRAUNAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXGMZDCOBRGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA It looks like the PET* log files under |
Okay. Had to step out of the office for a kid problem. Will resubmit to
make new ones
On Friday, December 20, 2024, Gerhard Theurich ***@***.***> wrote:
@GeorgeVandenberghe-NOAA It looks like the PET* log files under
/gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs.
That makes it very hard to post process and analyze. Could you post PET*
log files somewhere from just one single run that fails due to inline post?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.<
…--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
I have resubmitted a clean run with new PET files created and it is done
and they are created. Same location
On Fri, Dec 20, 2024 at 2:31 PM George Vandenberghe - NOAA Affiliate <
***@***.***> wrote:
… Okay. Had to step out of the office for a kid problem. Will resubmit to
make new ones
On Friday, December 20, 2024, Gerhard Theurich ***@***.***>
wrote:
> @GeorgeVandenberghe-NOAA It looks like the PET* log files under
/gpfs/f5/scratch/gwv/hr4j/da contains output from several different runs.
That makes it very hard to post process and analyze. Could you post PET*
log files somewhere from just one single run that fails due to inline post?
>
> —
> Reply to this email directly, view it on GitHub, or unsubscribe.
> You are receiving this because you were mentioned.<
https://ci3.googleusercontent.com/meips/ADKq_NY--ebYQtTj0Ll2Vwt_-D4LsCnGOLOj6Urkb9c2QUR9PAdTNa_x39uflVaduNqLXhtc5v8Q9AM4wknv-b64T20-KkWhOF-QZtSTH4m0QfbMqm2PUpGth7AOsXWJRHQ2mC3UoO3JXqCNnpjqdsNS25wmwmQS1LXGOll4xKPgpqXioxodn2IYtTrjWDs4fb4bPfpRKH9ycz59HVnJTc6qwAehRgIBKm19uxiaOcfrRgq0inq6JkOT7jY=s0-d-e1-ft#https://github.com/notifications/beacon/ANDS4FVG5MZA6NBGJESBWQD2GRQGHA5CNFSM6AAAAABT3QB4ASWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUYODPTI.gif>Message
ID: ***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
What happens if you increase the threading level for ATM, e.g. to 4x?
This change requires you also change
With this, the FCST component still gets the first 6912 cores (now using them with 1728 tasks 4x threaded). The two WRT comps get each 128x4 = 512 cores as before, now using those cores with 128 tasks 4x threaded. Does that reduce the memory pressure? |
In future runs, could you set |
I will try that
…On Fri, Dec 20, 2024 at 4:19 PM Gerhard Theurich ***@***.***> wrote:
What happens if you increase the threading level for ATM, e.g. to 4x?
# ATM #
ATM_model: fv3
ATM_petlist_bounds: 0 7935
ATM_omp_num_threads: 4
ATM_attributes::
Verbosity = 0
DumpFields = false
ProfileMemory = false
OverwriteSlice = true
::
This change requires you also change model_configure to keep giving the
same number of packed cores to the WRT components:
quilting: .true.
quilting_restart: .true.
write_groups: 2
write_tasks_per_group: 128
With this, the FCST component still gets the first 6912 cores (now using
them with 1728 tasks 4x threaded). The two WRT comps get each 128x4 = 512
cores as before, now using those cores with 128 tasks 4x threaded.
Does that reduce the memory pressure?
—
Reply to this email directly, view it on GitHub
<#2540 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXZB5HCEMVY7DBJLRL2GSCVRAVCNFSM6AAAAABT3QB4ASVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJXG4ZDSOJXGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Adding four ESMF threads rather than two and maintaining 256 ranks per I/O group (and post group) allows it to run through. There are two sources of memory pressure. WAVE (not closely examined but specifying two wave threads rather than one fixes that side of it), and UPP. UPP just plain runs out of memory with only 256 GB available on 128 core nodes. I put in some summary() calls to print out getrusage stats. Post memory usage per rank scales inversely with the number of ranks (good) but total is huge, roughly 10x the size of the 81GB history state it's trying to post process. To post process a C1152 forecast needs about 1.3 TB of total memory spread between the ranks and the sum of all of the ranks on one node can't exceed 250GB or so. This will have to be reexamined with even modest increases in resolution. |
When ufs-weather-model (tested is hr.v4 ) is run at C1152 resolution on Gaea C5 with ESMF managed threading, it hangs or fails when run 128 MPI ranks per node. ESMF managed threading requires 128 ranks per node for full use of the node because it disables traditional threading so we cannot run C1152 with ESMF managed threading. It is possible to get full use of the node by running with traditional threading and plural threads per task (two threads, 64 ranks per node or four threads 32 ranks per node) but other components which do not thread well then use their nodes inefficiently. It is hypothesizes the 2GB/core memory limit is insufficent to run this configuration fully packed, 128 ranks per node but then this begs the question, WHAT is using so much memory even at very high rank counts.? It has failed with 256 ranks per I/O task and two ESMF threads, and 512 ranks per I/O task and two ESMF threads.
The text was updated successfully, but these errors were encountered: