Model crash, negative dvice #2562

benjamin-cash · 2025-01-19T18:01:07Z

I am running the SFS configuration of the model, C129mx025, global-workflow and ufs-weather-model hashes as given here.

For the most part the runs have been running stably, but I have seen a significant number of crashes with error messages like the following:

PASS: fcstRUN phase 1, n_atmsteps =            10824 time is         2.919200

  (shift_ice)shift_ice: negative dvice
  (shift_ice)boundary, donor cat:           3           4
  (shift_ice)daice =  0.000000000000000E+000
  (shift_ice)dvice = -2.944524469334827E-065
    (icepack_warnings_setabort) T :file icepack_itd.F90 :line          551
 (shift_ice) shift_ice: negative dvice
 (icepack_warnings_aborted) ... (shift_ice)
 (icepack_warnings_aborted) ... (linear_itd)
 (icepack_warnings_aborted) ... (icepack_step_therm2)
 (icepack_warnings_aborted) ... (icepack_step_therm2)

@ShanSunNOAA - is this the same issue you have been seeing?

The text was updated successfully, but these errors were encountered:

ShanSunNOAA · 2025-01-19T22:46:11Z

Hi Ben, Thank you for the information. Yes, this is exactly the same crash I’ve been experiencing. For example, here is the error message I got: * 0: PASS: fcstRUN phase 2, n_atmsteps = 17929 time is 0.223418* * (shift_ice)shift_ice: negative dvice (shift_ice)boundary, donor cat: 3 4 (shift_ice)daice = 0.000000000000000E+000 (shift_ice)dvice = -5.710222883117139E-067 (icepack_warnings_setabort) T :file icepack_itd.F90 :line 551 (shift_ice) shift_ice: negative dvice (icepack_warnings_aborted) ... (shift_ice) (icepack_warnings_aborted) ... (linear_itd) (icepack_warnings_aborted) ... (icepack_step_therm2)* Thanks, Shan

…

On Sun, Jan 19, 2025 at 11:01 AM benjamin-cash ***@***.***> wrote: I am running the SFS configuration of the model, C129mx025, global-workflow and ufs-weather-model hashes as given here <https://docs.google.com/spreadsheets/d/1F0e1wwR04Kirddo2mMd06NUGcgtfzIfFouMsGXKHNUo/edit?usp=sharing> . For the most part the runs have been running stably, but I have seen a significant number of crashes with error messages like the following: PASS: fcstRUN phase 1, n_atmsteps = 10824 time is 2.919200 (shift_ice)shift_ice: negative dvice (shift_ice)boundary, donor cat: 3 4 (shift_ice)daice = 0.000000000000000E+000 (shift_ice)dvice = -2.944524469334827E-065 (icepack_warnings_setabort) T :file icepack_itd.F90 :line 551 (shift_ice) shift_ice: negative dvice (icepack_warnings_aborted) ... (shift_ice) (icepack_warnings_aborted) ... (linear_itd) (icepack_warnings_aborted) ... (icepack_step_therm2) (icepack_warnings_aborted) ... (icepack_step_therm2) @ShanSunNOAA <https://github.com/ShanSunNOAA> - is this the same issue you have been seeing? — Reply to this email directly, view it on GitHub <#2562>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALORMVSPG2CKE46EOLU2H6D2LPR7TAVCNFSM6AAAAABVO5PHWWVHI2DSMVQWIX3LMV43ASLTON2WKOZSG44TONZVGM2TSNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Shan Sun, Ph.D. (she) S2S Branch Chief Earth Prediction Advancement Division NOAA Global Systems Laboratory Boulder, Colorado

NickSzapiro-NOAA · 2025-01-20T08:57:28Z

Maybe a first thing to check is whether CICE initial conditions for crash cases are odd from the start. As mentioned in CICE-Consortium/Icepack#333), all IC ice thicknesses (vice(n)/aice(n) should be within the category bounds of hin_max. This check can be done offline or during initialization

I'm not sure of the origins of these ICs and there is permissions issue with the google sheets link, but maybe ICs or run directories are available somewhere?

benjamin-cash · 2025-01-20T16:00:46Z

Hi @NickSzapiro-NOAA - thanks for pointing out that issue. I will definitely do that check, although on first glance there are some differences with that issue. This is not occurring rarely in my case - at last count I had 47 failures of this kind out of 231 runs. Some of the crashes are coming 2+ months into the simulation, which also seems a bit odd for an IC bug. Each ensemble member is also using the same ice initial file, and not all ensemble members are crashing.

Having said all that, here is a link to one of the ice initial files on AWS that is associated with a crash:
https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com/2012/05/2012050106/iced.2012-05-01-10800.nc

The runs are being performed on Frontera, which I could either give you access to or I could transfer a run directory offsite.

@NeilBarton-NOAA - I remember you saying there was an issue with sea ice ICs in the past that you had developed a workaround for, but looking at the code it seems like that was related to the ice edge and not the thickness bounds Nick mentioned here.

NickSzapiro-NOAA · 2025-01-21T07:34:54Z

I don't see any problems with the thickness categories in that IC file.

These dvice_negative aborts from ~ -10^-65 really seem so small, particularly relative to a_min/m_min/hi_min and zap_small_areas in Icepack. One test is to change that the donor is at least -puny instead (like Dave Bailey opened CICE-Consortium/Icepack#338 )

I also wonder how much residual ice (CICE-Consortium/CICE#645) is just present in these runs

Before really looking into cases or modifications, maybe @DeniseWorthen and @NeilBarton-NOAA have thoughts

benjamin-cash · 2025-01-21T13:03:57Z

Hi @NickSzapiro-NOAA - some more context for this problem is that (to my knowledge) it does not appear in the case where the atmosphere and ocean are reduced to 1 degree (C96mx100). So far we are only seeing it in the C192mx025 runs, where we are using those IC files I pointed you to directly.

One thing I have not yet done is any kind of analysis of the ice in those runs, to see if there is something pathological going on. That's first up on my agenda for today. @ShanSunNOAA do you have any insights from your crashes?

ShanSunNOAA · 2025-01-21T15:22:43Z

I was testing the addition of the -ftz flag (flush-to-zero), but Denise pointed out that it was already in place. Why didn’t the flag work as expected and set the e-67 value to zero?

benjamin-cash · 2025-01-21T18:09:15Z

@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these.

NickSzapiro-NOAA · 2025-01-21T19:29:05Z

I would say that very sparse ice is more technical than physical as it may not be moving, melting, or freezing depending on some threshold dynamics and thermodynamics settings (see CICE-Consortium/CICE#645).

It seems reasonable that the sparse areas are associated with these crashes but we can confirm. I'm happy to try to log more information about what's happening...I don't know if you would need to provide a run directory or if this is reproducible via global-workflow or such on RDHPCS.

If you're open to experimenting, the first "quick fix" that comes to mind is editing ufs-weather-model/CICE-interface/CICE/icepack/columnphysics/icepack_itd.F90

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..32debc2 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0

The reasoning is that this is more consistent with what cleared in the zap_small_areas routine in cleanup_itd

benjamin-cash · 2025-01-21T19:35:59Z

My runs are using global-workflow but on Frontera via containers, so that might not be the easiest test case to give you. @ShanSunNOAA, what system are you making your runs on? I could also globus one of my run directories to Hercules.

ShanSunNOAA · 2025-01-21T19:45:17Z

Thanks @NickSzapiro-NOAA for a "quick fix"! I saved the crashed ice output under /scratch2/BMC/gsd-fv3-dev/sun/hr4_1013/COMROOT/c192mx025/gefs.20051101/00/mem000/model/ice/history_bad/.

I am testing your fix on Hera right now. Will let you know how it turns out later today.

Thanks!

benjamin-cash · 2025-01-21T19:48:38Z

Thanks @ShanSunNOAA ! It takes a while for jobs to get through the queue on Frontera so I wouldn't be able to test nearly so quickly. @NickSzapiro-NOAA , if this does turn out to fix the problem, would you expect it to change answers more generally? I.e., would the successful runs need to be redone for consistency?

NickSzapiro-NOAA · 2025-01-21T21:28:09Z

I think this is the smallest change that would help and is a localized change near -puny<{a,v}icen<-puny*{a,v}icen where puny=1.0e-11_dbl_kind...so (hand-wavy) expectation is roundoff differences to 32bit atmosphere

A bigger change would be to more actively remove very sparse ice

ShanSunNOAA · 2025-01-22T03:51:34Z

@NickSzapiro-NOAA The model crashed at the same time step as earlier, with a different error message:

1241:
1241: (shift_ice)shift_ice: dvice > vicen
1241: (shift_ice)boundary, donor cat: 3 4
1241: (shift_ice)dvice = 0.000000000000000E+000
1241: (shift_ice)vicen = -1.410289012658147E-065
1241: (icepack_warnings_setabort) T :file icepack_itd.F90 :line 594
1241: (shift_ice) shift_ice: dvice > vicen
1241: (icepack_warnings_aborted) ... (shift_ice)
1241: (icepack_warnings_aborted) ... (linear_itd)
1241: (icepack_warnings_aborted) ... (icepack_step_therm2)

Now it is vicen, not dvice, that has a value of -e65. Should the same treatment be applied to 'vicen'?
Thanks!

NickSzapiro-NOAA · 2025-01-22T04:05:51Z

Yes, sorry about that @ShanSunNOAA . Would you mind re-testing a change to all 4 conditions:

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..5d81bc3 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -480,7 +480,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (daice(n) > aicen(nd)*(c1-puny)) then
-                  if (daice(n) < aicen(nd)*(c1+puny)) then
+                  if (daice(n) < aicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else
@@ -489,7 +489,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) > vicen(nd)*(c1-puny)) then
-                  if (dvice(n) < vicen(nd)*(c1+puny)) then
+                  if (dvice(n) < vicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else

ShanSunNOAA · 2025-01-22T04:15:18Z

Thank you, @NickSzapiro-NOAA, for your prompt response! I am testing it now.

benjamin-cash · 2025-01-22T14:26:17Z

@NickSzapiro-NOAA - assuming this does fix the problem, how quickly do you think this can get incorporated as a PR? I can't help but notice that CICE-Consortium/CICE#645 has been open since 2021 and has had no activity since May of last year.

benjamin-cash · 2025-01-22T14:53:53Z

@ShanSunNOAA - what is the earliest crash you see? Most of my runs are crashing at around the 50-day, 7 wallclock hour mark, which makes testing awkward.

ShanSunNOAA · 2025-01-22T15:10:10Z

@NickSzapiro-NOAA Good news - with your quick fix, my run successfully completed 3 months. Thank you again for your prompt help late last night - it made it possible to run it overnight, as it takes 4-5 hours to reach the crashing point.
@benjamin-cash My crashed runs typically occur around days 50–60 as well. In this particular case, it originally crashed on day 54.

benjamin-cash · 2025-01-22T15:30:58Z

@ShanSunNOAA thanks for the quick reply! Do you have a fork of Icepack you can include the fix in, so we can be sure we are working off the same code?

ShanSunNOAA · 2025-01-22T15:37:50Z

@benjamin-cash I don't have a fork for this. @NickSzapiro-NOAA Are you going to submit a PR?

NickSzapiro-NOAA · 2025-01-22T15:56:24Z

Good to hear.

In UFS, we use an EMC fork of CICE that uses CICE-Consortium/Icepack submodule directly. Let me follow up at CICE-Consortium

benjamin-cash · 2025-01-22T17:18:40Z

@ShanSunNOAA - could you confirm which hash of Icepack you are using in your runs? I've created a fork and a branch with the the proposed fix, but I want to be sure that I haven't gotten my submodules out of sync.

DeniseWorthen · 2025-01-22T17:39:27Z

Just a word of caution---while one previously failing case may now pass, it doesn't preclude some previously passing case will now fail. So unless you run the entire set of runs, you don't really know if this is a fix.

benjamin-cash · 2025-01-22T17:49:50Z

@DeniseWorthen that's definitely a concern. Plus it is a change from the C96m100 baseline configuration. On the other hand it is show-stopper bug for the C192mx025 runs, so we definitely need to do something. My thinking at this point is to first rerun one or two of my successful cases and compare the outcomes.

DeniseWorthen · 2025-01-22T17:56:32Z

What would also be interesting to know is whether there is any seasonal signal in when the failing runs occur? Are they at a time of fast melt, hard freezing etc. Could you document the failed run dates and the day on which they fail?

bingfu-NOAA · 2025-01-22T17:58:47Z

@DeniseWorthen I think I need to chime in. We have many crashed cases like that in C384 run, mainly in May/June.

benjamin-cash · 2025-01-22T18:05:25Z

@DeniseWorthen @bingfu-NOAA - I have't looked exhaustively, but they seem to be in the 2-3 month range (May 01 start). And it is definitely variable. For example, C192mx025_1995050100 mem003 crashed at hour 1627, while mem010 crashed at 1960.

DeniseWorthen · 2025-01-22T18:09:08Z

So, just for completeness in helping keep track of this issue, more information here:
https://bb.cgd.ucar.edu/cesm/threads/model-abort-due-to-dvice-negative-moved-from-cice-issues.8940/#post-52031

benjamin-cash · 2025-01-22T18:35:15Z

@dabail10, tagging you in for awareness.

ShanSunNOAA · 2025-01-23T20:18:21Z

@NeilBarton-NOAA I will have plots of the ice distribution with and without this change later today

dabail10 · 2025-01-23T20:59:44Z

Does anyone have plots of the ice distribution with and without this change?

@dabail10 this issue is an extension to https://bb.cgd.ucar.edu/cesm/threads/model-abort-due-to-dvice-negative-moved-from-cice-issues.8940/

I wrote a script that removes small ice in areas of small ice values outside the ice edge (defined by 15%) in ICs. This helped with a lot of cases, but obviously not enough. We have tried to remove ice from more cells, but we were not successful in the runs. The script is at https://github.com/NeilBarton-NOAA/REPLAY_ICS/blob/main/SCRIPTS/CICE_ic_edit.py

Note, the ICs are from a DA run and the C384 and C192 runs have the same resolution and CICE ICs.

So what happens if you use an "unmodified" initial file? It would be good to plot vicen, aicen, and vicen / aicen from these files. I think the problem is that you have two categories where vicen / aicen, i.e. the average thickness is very close.

benjamin-cash · 2025-01-23T21:04:04Z

@dabail10 - all of my runs are with the unmodified IC files, and those are the files I pointed @NickSzapiro-NOAA to on AWS up thread.

DeniseWorthen · 2025-01-23T21:12:47Z

I took Ben's run directory and made a few modifications:

a) copied the sym-linked fix files in input.nml from the glopara directory on Hercules and updated the input.nml
b) adjusted the ATM restart frequency and fh_out to be 120 hours and 6 -1, respectively
c) generated a job card on Hercules using the sfs test.
d) compiled S2S in Debug mode (-DAPP=S2S -D32BIT=ON -DHYDRO=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DDEBUG=ON) since neither WAV or CHM is being used. This was the HR4 tag (fcc9f84).

When launched, the job immediately seg-faulted

709: [hercules-01-24:3102859:0:3102859] Caught signal 8 (Floating point exception: floating-point invalid operation)
630: ==== backtrace (tid:4124176) ====
630:  0 0x000000000005f14c ucs_callbackq_cleanup()  ???:0
630:  1 0x000000000005f40a ucs_callbackq_cleanup()  ???:0
630:  2 0x0000000000054d90 __GI___sigaction()  :0
630:  3 0x000000000e8ce518 ice_init_column_mp_init_shortwave_()  /work/noaa/nems/dworthen/ufs_hr4/CICE-interface/CICE/cicecore/shared/ice_init_column.F90:436

To test whether I had goofed up the settings, I then ran w/ ice_ic=default. The model integrated successfully 6 hours.

For those w/ access to Hercules:

ice_ic=default: /work2/noaa/stmp/dworthen/stmp/dworthen/negdvice/fcst.37280
ice_ic=cice_model.res.nc: /work2/noaa/stmp/dworthen/stmp/dworthen/negdvice/test.ciceic.res

DeniseWorthen · 2025-01-23T22:35:09Z

The model is seg-faulting when multiplying apeffn*aicen here

                  apeff_ai(i,j,iblk) = apeff_ai(i,j,iblk) &
                       + apeffn(i,j,n,iblk)*aicen(i,j,n,iblk)

The issue appears to be that aicen>puny where kmt=0 (land). This produces a NaN for the variable apeffn.

@benjamin-cash Are you sure you're not using some sort of processed IC? Because I don't believe a "native" CICE IC would produce ice values on land.

dabail10 · 2025-01-23T22:36:59Z

@dabail10 - all of my runs are with the unmodified IC files, and those are the files I pointed @NickSzapiro-NOAA to on AWS up thread.

I guess I am confused. Is the initial file from exactly the same configuration of a run? If that is the case, the you wouldn't be able to do a continue / restart run? Like @DeniseWorthen just said.

benjamin-cash · 2025-01-23T23:10:36Z

To the best of my knowledge, I am using the files from https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com without any further modifications. Never say never, but I don't see anywhere in the scripts where those files are being edited after they are downloaded. @ShanSunNOAA and @bingfu-NOAA, is this where you are getting the cice ICs from as well?

ShanSunNOAA · 2025-01-24T04:01:29Z

My ICs are taken directly from HPSS, where Neil placed them, without any modifications.
I am comparing ice extent and thickness between the crashed run (last output at hr1296) and the same forecast from a good run with Nick's quick fix. I found that both the ice extent and thickness are identical (based on results from ncdump) between the two runs.

NickSzapiro-NOAA · 2025-01-24T10:05:27Z

The IC from AWS (for 2004-05-01) does not have ice over land but the restart in Ben's run directory does (for 2004-06-28-43200 ?). The CICE_OUTPUT files didn't copy over but those would be nice to see. I imagine the NaNs are from the -init=snan,arrays compile option with intel debug

Shan's result matches the RTs, where differences only occur for situations that currently abort

DeniseWorthen · 2025-01-24T13:17:50Z

@NickSzapiro-NOAA I don't follow. In the rundir I got from Ben, it is

    runtype            = 'initial'
    runid               = 'unknown'
    ice_ic              = 'cice_model.res.nc'

In that rundir, if I use kmt (from kmtu_cice_NEMS_mx025.nc) and aicen(summed over ncat) from cice_model.res.nc
and plot the ice concentration where kmt=0, I get

It seems pretty clear to me that aice>0 over land in the initial condition.

Secondly, we regularly test RTs in debug mode. Our RTs would fail if the NaNs were because an uninitialized variable w/ our "normal" ICs.

Finally, I added this code to CICE

diff --git a/cicecore/shared/ice_init_column.F90 b/cicecore/shared/ice_init_column.F90
index 22cd318..cf24131 100644
--- a/cicecore/shared/ice_init_column.F90
+++ b/cicecore/shared/ice_init_column.F90
@@ -197,6 +197,8 @@ subroutine init_shortwave
       use ice_grid, only: tlat, tlon, tmask
       use ice_restart_shared, only: restart, runtype
       use ice_state, only: aicen, vicen, vsnon, trcrn
+      ! debug
+      use ice_grid, only: kmt

       integer (kind=int_kind) :: &
          i, j , k    , & ! horizontal indices
@@ -240,6 +242,9 @@ subroutine init_shortwave

       character(len=*), parameter :: subname='(init_shortwave)'

+      ! debug
+      integer :: ig, jg
+
       call icepack_query_parameters(puny_out=puny)
       call icepack_query_parameters(shortwave_out=shortwave)
       call icepack_query_parameters(dEdd_algae_out=dEdd_algae)
@@ -406,6 +411,25 @@ subroutine init_shortwave
       ! Match loop order in coupling_prep for same order of operations
       !-----------------------------------------------------------------

+         ! debug
+         this_block = get_block(blocks_ice(iblk),iblk)
+         ilo = this_block%ilo
+         ihi = this_block%ihi
+         jlo = this_block%jlo
+         jhi = this_block%jhi
+         do n = 1,ncat
+            do j = jlo, jhi
+               do i = ilo, ihi
+                  ig = this_block%i_glob(i)
+                  jg = this_block%j_glob(j)
+                  if (aicen(i,j,n,iblk) > puny) then
+                     print '(a,4i8,2g14.7)','XSW ',ig,jg,n,int(kmt(i,j,iblk),4),apeffn(i,j,n,iblk),aicen(i,j,n,iblk)
+                  end if
+               end do
+            end do
+         end do
+

Grepping stdout shows that only when kmt=0 and aice>0 do you get NaN in apeffn:

706: XSW      440     807       1       0           NaN 0.1202263
706: XSW      440     808       1       0           NaN 0.1202263
706: XSW      440     809       1       0           NaN 0.1202263
707: XSW      478     827       1       0           NaN 0.5506212
707: XSW      478     828       1       0           NaN 0.8399250

benjamin-cash · 2025-01-24T13:52:44Z

@DeniseWorthen - if the problem was the presence of aice>0 over land, why wouldn't all members of the ensemble fail? I'm still struggling with that one.

I have to wonder if somehow the IC file was overwritten during the run in that run directory, because I'm at a loss otherwise to explain how it could be different from the file on AWS.

DeniseWorthen · 2025-01-24T14:07:53Z

The problem is that you don't know what variable array in memory you're corrupting when you (I'm guessing) divide by 0. If you're seg-faulting in debug mode, that is job one to figure out. There's no point (imho) in trying to figure out the behaviour of a model which is showing NaNs in debug mode.

I've also looked at a run directory on Hera that Shan provided. Her IC is clearly different. As I explained to her, 1) there are _FillValue attributes (the _FillValue is NaN). These are not written by CICE in a restart file. And secondly, there is an additional array (aicen_orig) which is also not written by CICE. Her IC looks to be "processed" somehow. However, the same "aice>0 on land" problem is still present.

Your initial condition file does not contain either of those, and appears to be an "unprocessed" initial condition. However, it still has aice>0 on land.

benjamin-cash · 2025-01-24T14:28:27Z

Hi @DeniseWorthen - Thanks for the explanation. I downloaded the IC file again from the AWS bucket and confirmed that it is identical to the file in my run directory. It sounds then like the ice IC files in HPSS are not the same as the files on AWS, which if nothing else comes out of this was important to uncover.

I wonder if the HPSS files have been further processed via @NeilBarton-NOAA's https://github.com/NeilBarton-NOAA/REPLAY_ICS/blob/main/SCRIPTS/CICE_ic_edit.py code? Neil, is that correct? If someone can put one of @ShanSunNOAA 's ICs somewhere I can see it I will run Neil's code against my corresponding file and see if I get the same result.

benjamin-cash · 2025-01-24T14:30:02Z

Oh! Something else that occurs to me - it would be very interesting to analyze the files used for the C96mx100 runs for these same dates, since apparently none of those runs crashed with this issue.

DeniseWorthen · 2025-01-24T14:40:22Z

@benjamin-cash You can "fix" your own cice_model.res.nc with the following NCO commands (the first just appends the kmt array so you can use it to mask the other variables):

ncks -A kmtu_cice_NEMS_mx025.nc cice_model.res.nc
ncap2 -s where(kmt==0) aicen=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) vicen=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) vsnon=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) Tsfcn=0.0 cice_model.res.nc

When I did this, and used it w/ the model compiled in debug mode, the model completed 6 hours of integration. I can now try to run out and see what happens---no guarantee this will fix it, but we at least know it's not initializing weird.

NeilBarton-NOAA · 2025-01-24T14:44:26Z

@dabail10 @benjamin-cash The ICs on AWS were also modified by SOCA/JEDI DA. The ICs I processed (on hpss) take the AWS ICs are remove ice because of the dv issue (obviously doesn't work for all cases). The C96mx100 CICE ICs are interpolated from the mx025 ICs, which points to that the interpolation or running on a lower resolution removed this issue.

@dabail10 I tried removing the ice where two categories of vicen / aicen were very close in the ICs and aice was "small", but I wasn't successful. This was about a year ago and could be examined more. @guillaumevernieres for awareness

benjamin-cash · 2025-01-24T14:45:48Z

@DeniseWorthen - Good to know! I still would like to make the comparison to the ICs that @ShanSunNOAA used and figure out exactly how and why they are different, because this is a place where I've been worried differences might creep in.

benjamin-cash · 2025-01-24T14:49:43Z

@NeilBarton-NOAA - it would be very interesting to check and see if those lower-resolution IC files do or do not include aicen>0 over land.

ShanSunNOAA · 2025-01-24T15:01:37Z

@NickSzapiro-NOAA Your quick fix resolved all three of my cases that originally crashed due to negative dvice; all have successfully completed 3 months. Thank you again for efficiently and effectively pinpointing the issue.

benjamin-cash · 2025-01-24T15:12:56Z

@ShanSunNOAA could you put one of your run directories on Hercules so I can compare IC files?

ShanSunNOAA · 2025-01-24T15:45:34Z

@benjamin-cash Good idea, however, I am battling a globus error. Will let you know when I am successful

DeniseWorthen · 2025-01-24T18:21:38Z

@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these.

@benjamin-cash Can you provide the ice history files and ice restarts (the ones in CICE_RESTART) from the run directory you placed on Hercules ?

benjamin-cash · 2025-01-24T18:34:44Z

@DeniseWorthen - the restarts are transferring now (to the CICE_RESTART directory properly this time), as well as the history files. The history files are going in a directory labeled ice_history in that same run directory I transferred.

DeniseWorthen · 2025-01-24T19:28:06Z

@benjamin-cash I see the cice restarts, but the history files are still just sym-links to directories on Frontera (?)

benjamin-cash · 2025-01-24T19:33:02Z

@DeniseWorthen are you looking in /work/noaa/nems/cash/ice_fail/fcst.37280/ice_history? Those should be real netcdf files.

DeniseWorthen · 2025-01-24T19:34:11Z

ok, now I see them. Thanks.

DeniseWorthen · 2025-01-24T20:23:20Z

Has anyone ever noticed that you can't ncrcat these output files because the forecast hour character string is not consistent...fXX, fXXX, fXXXX?

benjamin-cash · 2025-01-24T20:56:07Z

Has anyone ever noticed that you can't ncrcat these output files because the forecast hour character string is not consistent...fXX, fXXX, fXXXX?

Yes - definitely annoying.

As an update - I was able to confirm just now that the difference between my ice ICs and @ShanSunNOAA 's is that her files were further processed with @NeilBarton-NOAA 's CICE_ic_edit.py code and mine were not. Since her files were still crashing with this error it seems like we still probably need @NickSzapiro-NOAA 's fix to get past it, especially since it does not seem to be changing the answers otherwise.

DeniseWorthen · 2025-01-24T21:05:26Z

I've also looked at a run directory on Hera that Shan provided. Her IC is clearly different. As I explained to her, 1) there are _FillValue attributes (the _FillValue is NaN). These are not written by CICE in a restart file. And secondly, there is an additional array (aicen_orig) which is also not written by CICE. Her IC looks to be "processed" somehow. However, the same "aice>0 on land" problem is still present.

@benjamin-cash Shan's IC also had ice on land. It will also probably seg fault if run in debug mode.

My run on Hercules w/ the fixed ICs (ie, w/ the NCO processing to remove all ice on land) did stop w/ negative dvice at Jun 16. But I have the Jun 15th restart, which means we can now debug the actual issue by restarting and tracking what is happening.

benjamin-cash · 2025-01-24T21:11:08Z

I'm just happy to have identified the source of the differences in our ICs at this point. :) Also very glad to know that you have a case that you can now interrogate in detail!

NickSzapiro-NOAA mentioned this issue Jan 22, 2025

Abort in icepack_itd.F90 due to negative dvice. CICE-Consortium/Icepack#338

Closed

Model crash, negative dvice #2562

Model crash, negative dvice #2562

Comments

benjamin-cash commented Jan 19, 2025

ShanSunNOAA commented Jan 19, 2025 via email

NickSzapiro-NOAA commented Jan 20, 2025

benjamin-cash commented Jan 20, 2025

NickSzapiro-NOAA commented Jan 21, 2025

benjamin-cash commented Jan 21, 2025

ShanSunNOAA commented Jan 21, 2025

benjamin-cash commented Jan 21, 2025

NickSzapiro-NOAA commented Jan 21, 2025

benjamin-cash commented Jan 21, 2025

ShanSunNOAA commented Jan 21, 2025

benjamin-cash commented Jan 21, 2025

NickSzapiro-NOAA commented Jan 21, 2025

ShanSunNOAA commented Jan 22, 2025 • edited Loading

NickSzapiro-NOAA commented Jan 22, 2025

ShanSunNOAA commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

ShanSunNOAA commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

ShanSunNOAA commented Jan 22, 2025

NickSzapiro-NOAA commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

DeniseWorthen commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

DeniseWorthen commented Jan 22, 2025

bingfu-NOAA commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

DeniseWorthen commented Jan 22, 2025

benjamin-cash commented Jan 22, 2025

ShanSunNOAA commented Jan 23, 2025

dabail10 commented Jan 23, 2025

benjamin-cash commented Jan 23, 2025

DeniseWorthen commented Jan 23, 2025 • edited Loading

DeniseWorthen commented Jan 23, 2025 • edited Loading

dabail10 commented Jan 23, 2025

benjamin-cash commented Jan 23, 2025

ShanSunNOAA commented Jan 24, 2025

NickSzapiro-NOAA commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025 • edited Loading

benjamin-cash commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025

NeilBarton-NOAA commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

ShanSunNOAA commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

ShanSunNOAA commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025 • edited Loading

benjamin-cash commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025

benjamin-cash commented Jan 24, 2025

DeniseWorthen commented Jan 24, 2025 • edited Loading

benjamin-cash commented Jan 24, 2025

ShanSunNOAA commented Jan 22, 2025 •

edited

Loading

DeniseWorthen commented Jan 23, 2025 •

edited

Loading

DeniseWorthen commented Jan 23, 2025 •

edited

Loading

DeniseWorthen commented Jan 24, 2025 •

edited

Loading

DeniseWorthen commented Jan 24, 2025 •

edited

Loading

DeniseWorthen commented Jan 24, 2025 •

edited

Loading