-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model crash, negative dvice #2562
Comments
Hi Ben,
Thank you for the information. Yes, this is exactly the same crash I’ve
been experiencing. For example, here is the error message I got:
* 0: PASS: fcstRUN phase 2, n_atmsteps = 17929 time is
0.223418*
* (shift_ice)shift_ice: negative dvice (shift_ice)boundary, donor cat:
3 4 (shift_ice)daice = 0.000000000000000E+000
(shift_ice)dvice = -5.710222883117139E-067 (icepack_warnings_setabort) T
:file icepack_itd.F90 :line 551 (shift_ice) shift_ice: negative
dvice (icepack_warnings_aborted) ...
(shift_ice) (icepack_warnings_aborted) ...
(linear_itd) (icepack_warnings_aborted) ... (icepack_step_therm2)*
Thanks,
Shan
…On Sun, Jan 19, 2025 at 11:01 AM benjamin-cash ***@***.***> wrote:
I am running the SFS configuration of the model, C129mx025,
global-workflow and ufs-weather-model hashes as given here
<https://docs.google.com/spreadsheets/d/1F0e1wwR04Kirddo2mMd06NUGcgtfzIfFouMsGXKHNUo/edit?usp=sharing>
.
For the most part the runs have been running stably, but I have seen a
significant number of crashes with error messages like the following:
PASS: fcstRUN phase 1, n_atmsteps = 10824 time is 2.919200
(shift_ice)shift_ice: negative dvice
(shift_ice)boundary, donor cat: 3 4
(shift_ice)daice = 0.000000000000000E+000
(shift_ice)dvice = -2.944524469334827E-065
(icepack_warnings_setabort) T :file icepack_itd.F90 :line 551
(shift_ice) shift_ice: negative dvice
(icepack_warnings_aborted) ... (shift_ice)
(icepack_warnings_aborted) ... (linear_itd)
(icepack_warnings_aborted) ... (icepack_step_therm2)
(icepack_warnings_aborted) ... (icepack_step_therm2)
@ShanSunNOAA <https://github.com/ShanSunNOAA> - is this the same issue
you have been seeing?
—
Reply to this email directly, view it on GitHub
<#2562>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALORMVSPG2CKE46EOLU2H6D2LPR7TAVCNFSM6AAAAABVO5PHWWVHI2DSMVQWIX3LMV43ASLTON2WKOZSG44TONZVGM2TSNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Shan Sun, Ph.D. (she)
S2S Branch Chief
Earth Prediction Advancement Division
NOAA Global Systems Laboratory
Boulder, Colorado
|
Maybe a first thing to check is whether CICE initial conditions for crash cases are odd from the start. As mentioned in CICE-Consortium/Icepack#333), all IC ice thicknesses (vice(n)/aice(n) should be within the category bounds of hin_max. This check can be done offline or during initialization I'm not sure of the origins of these ICs and there is permissions issue with the google sheets link, but maybe ICs or run directories are available somewhere? |
Hi @NickSzapiro-NOAA - thanks for pointing out that issue. I will definitely do that check, although on first glance there are some differences with that issue. This is not occurring rarely in my case - at last count I had 47 failures of this kind out of 231 runs. Some of the crashes are coming 2+ months into the simulation, which also seems a bit odd for an IC bug. Each ensemble member is also using the same ice initial file, and not all ensemble members are crashing. Having said all that, here is a link to one of the ice initial files on AWS that is associated with a crash: The runs are being performed on Frontera, which I could either give you access to or I could transfer a run directory offsite. @NeilBarton-NOAA - I remember you saying there was an issue with sea ice ICs in the past that you had developed a workaround for, but looking at the code it seems like that was related to the ice edge and not the thickness bounds Nick mentioned here. |
I don't see any problems with the thickness categories in that IC file. These dvice_negative aborts from ~ -10^-65 really seem so small, particularly relative to a_min/m_min/hi_min and zap_small_areas in Icepack. One test is to change that the donor is at least -puny instead (like Dave Bailey opened CICE-Consortium/Icepack#338 ) I also wonder how much residual ice (CICE-Consortium/CICE#645) is just present in these runs Before really looking into cases or modifications, maybe @DeniseWorthen and @NeilBarton-NOAA have thoughts |
Hi @NickSzapiro-NOAA - some more context for this problem is that (to my knowledge) it does not appear in the case where the atmosphere and ocean are reduced to 1 degree (C96mx100). So far we are only seeing it in the C192mx025 runs, where we are using those IC files I pointed you to directly. One thing I have not yet done is any kind of analysis of the ice in those runs, to see if there is something pathological going on. That's first up on my agenda for today. @ShanSunNOAA do you have any insights from your crashes? |
I was testing the addition of the -ftz flag (flush-to-zero), but Denise pointed out that it was already in place. Why didn’t the flag work as expected and set the e-67 value to zero? |
@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these. |
I would say that very sparse ice is more technical than physical as it may not be moving, melting, or freezing depending on some threshold dynamics and thermodynamics settings (see CICE-Consortium/CICE#645). It seems reasonable that the sparse areas are associated with these crashes but we can confirm. I'm happy to try to log more information about what's happening...I don't know if you would need to provide a run directory or if this is reproducible via global-workflow or such on RDHPCS. If you're open to experimenting, the first "quick fix" that comes to mind is editing ufs-weather-model/CICE-interface/CICE/icepack/columnphysics/icepack_itd.F90
The reasoning is that this is more consistent with what cleared in the zap_small_areas routine in cleanup_itd |
My runs are using global-workflow but on Frontera via containers, so that might not be the easiest test case to give you. @ShanSunNOAA, what system are you making your runs on? I could also globus one of my run directories to Hercules. |
Thanks @NickSzapiro-NOAA for a "quick fix"! I saved the crashed ice output under /scratch2/BMC/gsd-fv3-dev/sun/hr4_1013/COMROOT/c192mx025/gefs.20051101/00/mem000/model/ice/history_bad/. I am testing your fix on Hera right now. Will let you know how it turns out later today. Thanks! |
Thanks @ShanSunNOAA ! It takes a while for jobs to get through the queue on Frontera so I wouldn't be able to test nearly so quickly. @NickSzapiro-NOAA , if this does turn out to fix the problem, would you expect it to change answers more generally? I.e., would the successful runs need to be redone for consistency? |
I think this is the smallest change that would help and is a localized change near A bigger change would be to more actively remove very sparse ice |
@NickSzapiro-NOAA The model crashed at the same time step as earlier, with a different error message: 1241: Now it is vicen, not dvice, that has a value of -e65. Should the same treatment be applied to 'vicen'? |
Yes, sorry about that @ShanSunNOAA . Would you mind re-testing a change to all 4 conditions:
|
Thank you, @NickSzapiro-NOAA, for your prompt response! I am testing it now. |
@NickSzapiro-NOAA - assuming this does fix the problem, how quickly do you think this can get incorporated as a PR? I can't help but notice that CICE-Consortium/CICE#645 has been open since 2021 and has had no activity since May of last year. |
@ShanSunNOAA - what is the earliest crash you see? Most of my runs are crashing at around the 50-day, 7 wallclock hour mark, which makes testing awkward. |
@NickSzapiro-NOAA Good news - with your quick fix, my run successfully completed 3 months. Thank you again for your prompt help late last night - it made it possible to run it overnight, as it takes 4-5 hours to reach the crashing point. |
@ShanSunNOAA thanks for the quick reply! Do you have a fork of Icepack you can include the fix in, so we can be sure we are working off the same code? |
@benjamin-cash I don't have a fork for this. @NickSzapiro-NOAA Are you going to submit a PR? |
Good to hear. In UFS, we use an EMC fork of CICE that uses CICE-Consortium/Icepack submodule directly. Let me follow up at CICE-Consortium |
@ShanSunNOAA - could you confirm which hash of Icepack you are using in your runs? I've created a fork and a branch with the the proposed fix, but I want to be sure that I haven't gotten my submodules out of sync. |
Just a word of caution---while one previously failing case may now pass, it doesn't preclude some previously passing case will now fail. So unless you run the entire set of runs, you don't really know if this is a fix. |
@DeniseWorthen that's definitely a concern. Plus it is a change from the C96m100 baseline configuration. On the other hand it is show-stopper bug for the C192mx025 runs, so we definitely need to do something. My thinking at this point is to first rerun one or two of my successful cases and compare the outcomes. |
What would also be interesting to know is whether there is any seasonal signal in when the failing runs occur? Are they at a time of fast melt, hard freezing etc. Could you document the failed run dates and the day on which they fail? |
@DeniseWorthen I think I need to chime in. We have many crashed cases like that in C384 run, mainly in May/June. |
@DeniseWorthen @bingfu-NOAA - I have't looked exhaustively, but they seem to be in the 2-3 month range (May 01 start). And it is definitely variable. For example, C192mx025_1995050100 mem003 crashed at hour 1627, while mem010 crashed at 1960. |
So, just for completeness in helping keep track of this issue, more information here: |
@dabail10, tagging you in for awareness. |
@NeilBarton-NOAA I will have plots of the ice distribution with and without this change later today |
So what happens if you use an "unmodified" initial file? It would be good to plot vicen, aicen, and vicen / aicen from these files. I think the problem is that you have two categories where vicen / aicen, i.e. the average thickness is very close. |
@dabail10 - all of my runs are with the unmodified IC files, and those are the files I pointed @NickSzapiro-NOAA to on AWS up thread. |
I took Ben's run directory and made a few modifications: a) copied the sym-linked fix files in input.nml from the glopara directory on Hercules and updated the input.nml When launched, the job immediately seg-faulted
To test whether I had goofed up the settings, I then ran w/ ice_ic=default. The model integrated successfully 6 hours. For those w/ access to Hercules: ice_ic=default: |
The model is seg-faulting when multiplying apeffn*aicen here
The issue appears to be that aicen>puny where kmt=0 (land). This produces a NaN for the variable apeffn. @benjamin-cash Are you sure you're not using some sort of processed IC? Because I don't believe a "native" CICE IC would produce ice values on land. |
I guess I am confused. Is the initial file from exactly the same configuration of a run? If that is the case, the you wouldn't be able to do a continue / restart run? Like @DeniseWorthen just said. |
To the best of my knowledge, I am using the files from https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com without any further modifications. Never say never, but I don't see anywhere in the scripts where those files are being edited after they are downloaded. @ShanSunNOAA and @bingfu-NOAA, is this where you are getting the cice ICs from as well? |
The IC from AWS (for 2004-05-01) does not have ice over land but the restart in Ben's run directory does (for 2004-06-28-43200 ?). The CICE_OUTPUT files didn't copy over but those would be nice to see. I imagine the NaNs are from the Shan's result matches the RTs, where differences only occur for situations that currently abort |
@NickSzapiro-NOAA I don't follow. In the rundir I got from Ben, it is
In that rundir, if I use kmt (from It seems pretty clear to me that aice>0 over land in the initial condition. Secondly, we regularly test RTs in debug mode. Our RTs would fail if the NaNs were because an uninitialized variable w/ our "normal" ICs. Finally, I added this code to CICE
Grepping stdout shows that only when kmt=0 and aice>0 do you get NaN in apeffn:
|
@DeniseWorthen - if the problem was the presence of aice>0 over land, why wouldn't all members of the ensemble fail? I'm still struggling with that one. I have to wonder if somehow the IC file was overwritten during the run in that run directory, because I'm at a loss otherwise to explain how it could be different from the file on AWS. |
The problem is that you don't know what variable array in memory you're corrupting when you (I'm guessing) divide by 0. If you're seg-faulting in debug mode, that is job one to figure out. There's no point (imho) in trying to figure out the behaviour of a model which is showing NaNs in debug mode. I've also looked at a run directory on Hera that Shan provided. Her IC is clearly different. As I explained to her, 1) there are _FillValue attributes (the _FillValue is NaN). These are not written by CICE in a restart file. And secondly, there is an additional array (aicen_orig) which is also not written by CICE. Her IC looks to be "processed" somehow. However, the same "aice>0 on land" problem is still present. Your initial condition file does not contain either of those, and appears to be an "unprocessed" initial condition. However, it still has aice>0 on land. |
Hi @DeniseWorthen - Thanks for the explanation. I downloaded the IC file again from the AWS bucket and confirmed that it is identical to the file in my run directory. It sounds then like the ice IC files in HPSS are not the same as the files on AWS, which if nothing else comes out of this was important to uncover. I wonder if the HPSS files have been further processed via @NeilBarton-NOAA's https://github.com/NeilBarton-NOAA/REPLAY_ICS/blob/main/SCRIPTS/CICE_ic_edit.py code? Neil, is that correct? If someone can put one of @ShanSunNOAA 's ICs somewhere I can see it I will run Neil's code against my corresponding file and see if I get the same result. |
Oh! Something else that occurs to me - it would be very interesting to analyze the files used for the C96mx100 runs for these same dates, since apparently none of those runs crashed with this issue. |
@benjamin-cash You can "fix" your own
When I did this, and used it w/ the model compiled in debug mode, the model completed 6 hours of integration. I can now try to run out and see what happens---no guarantee this will fix it, but we at least know it's not initializing weird. |
@dabail10 @benjamin-cash The ICs on AWS were also modified by SOCA/JEDI DA. The ICs I processed (on hpss) take the AWS ICs are remove ice because of the dv issue (obviously doesn't work for all cases). The C96mx100 CICE ICs are interpolated from the mx025 ICs, which points to that the interpolation or running on a lower resolution removed this issue. @dabail10 I tried removing the ice where two categories of vicen / aicen were very close in the ICs and aice was "small", but I wasn't successful. This was about a year ago and could be examined more. @guillaumevernieres for awareness |
@DeniseWorthen - Good to know! I still would like to make the comparison to the ICs that @ShanSunNOAA used and figure out exactly how and why they are different, because this is a place where I've been worried differences might creep in. |
@NeilBarton-NOAA - it would be very interesting to check and see if those lower-resolution IC files do or do not include aicen>0 over land. |
@NickSzapiro-NOAA Your quick fix resolved all three of my cases that originally crashed due to negative dvice; all have successfully completed 3 months. Thank you again for efficiently and effectively pinpointing the issue. |
@ShanSunNOAA could you put one of your run directories on Hercules so I can compare IC files? |
@benjamin-cash Good idea, however, I am battling a globus error. Will let you know when I am successful |
@benjamin-cash Can you provide the ice history files and ice restarts (the ones in CICE_RESTART) from the run directory you placed on Hercules ? |
@DeniseWorthen - the restarts are transferring now (to the CICE_RESTART directory properly this time), as well as the history files. The history files are going in a directory labeled ice_history in that same run directory I transferred. |
@benjamin-cash I see the cice restarts, but the history files are still just sym-links to directories on Frontera (?) |
@DeniseWorthen are you looking in /work/noaa/nems/cash/ice_fail/fcst.37280/ice_history? Those should be real netcdf files. |
ok, now I see them. Thanks. |
Has anyone ever noticed that you can't ncrcat these output files because the forecast hour character string is not consistent...fXX, fXXX, fXXXX? |
Yes - definitely annoying. As an update - I was able to confirm just now that the difference between my ice ICs and @ShanSunNOAA 's is that her files were further processed with @NeilBarton-NOAA 's CICE_ic_edit.py code and mine were not. Since her files were still crashing with this error it seems like we still probably need @NickSzapiro-NOAA 's fix to get past it, especially since it does not seem to be changing the answers otherwise. |
@benjamin-cash Shan's IC also had ice on land. It will also probably seg fault if run in debug mode. My run on Hercules w/ the fixed ICs (ie, w/ the NCO processing to remove all ice on land) did stop w/ negative dvice at Jun 16. But I have the Jun 15th restart, which means we can now debug the actual issue by restarting and tracking what is happening. |
I'm just happy to have identified the source of the differences in our ICs at this point. :) Also very glad to know that you have a case that you can now interrogate in detail! |
I am running the SFS configuration of the model, C129mx025, global-workflow and ufs-weather-model hashes as given here.
For the most part the runs have been running stably, but I have seen a significant number of crashes with error messages like the following:
@ShanSunNOAA - is this the same issue you have been seeing?
The text was updated successfully, but these errors were encountered: