Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model crash, negative dvice #2562

Open
benjamin-cash opened this issue Jan 19, 2025 · 89 comments
Open

Model crash, negative dvice #2562

benjamin-cash opened this issue Jan 19, 2025 · 89 comments

Comments

@benjamin-cash
Copy link

I am running the SFS configuration of the model, C129mx025, global-workflow and ufs-weather-model hashes as given here.

For the most part the runs have been running stably, but I have seen a significant number of crashes with error messages like the following:

PASS: fcstRUN phase 1, n_atmsteps =            10824 time is         2.919200

  (shift_ice)shift_ice: negative dvice
  (shift_ice)boundary, donor cat:           3           4
  (shift_ice)daice =  0.000000000000000E+000
  (shift_ice)dvice = -2.944524469334827E-065
    (icepack_warnings_setabort) T :file icepack_itd.F90 :line          551
 (shift_ice) shift_ice: negative dvice
 (icepack_warnings_aborted) ... (shift_ice)
 (icepack_warnings_aborted) ... (linear_itd)
 (icepack_warnings_aborted) ... (icepack_step_therm2)
 (icepack_warnings_aborted) ... (icepack_step_therm2)

@ShanSunNOAA - is this the same issue you have been seeing?

@ShanSunNOAA
Copy link
Collaborator

ShanSunNOAA commented Jan 19, 2025 via email

@NickSzapiro-NOAA
Copy link
Collaborator

Maybe a first thing to check is whether CICE initial conditions for crash cases are odd from the start. As mentioned in CICE-Consortium/Icepack#333), all IC ice thicknesses (vice(n)/aice(n) should be within the category bounds of hin_max. This check can be done offline or during initialization

I'm not sure of the origins of these ICs and there is permissions issue with the google sheets link, but maybe ICs or run directories are available somewhere?

@benjamin-cash
Copy link
Author

Hi @NickSzapiro-NOAA - thanks for pointing out that issue. I will definitely do that check, although on first glance there are some differences with that issue. This is not occurring rarely in my case - at last count I had 47 failures of this kind out of 231 runs. Some of the crashes are coming 2+ months into the simulation, which also seems a bit odd for an IC bug. Each ensemble member is also using the same ice initial file, and not all ensemble members are crashing.

Having said all that, here is a link to one of the ice initial files on AWS that is associated with a crash:
https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com/2012/05/2012050106/iced.2012-05-01-10800.nc

The runs are being performed on Frontera, which I could either give you access to or I could transfer a run directory offsite.

@NeilBarton-NOAA - I remember you saying there was an issue with sea ice ICs in the past that you had developed a workaround for, but looking at the code it seems like that was related to the ice edge and not the thickness bounds Nick mentioned here.

@NickSzapiro-NOAA
Copy link
Collaborator

I don't see any problems with the thickness categories in that IC file.

These dvice_negative aborts from ~ -10^-65 really seem so small, particularly relative to a_min/m_min/hi_min and zap_small_areas in Icepack. One test is to change that the donor is at least -puny instead (like Dave Bailey opened CICE-Consortium/Icepack#338 )

I also wonder how much residual ice (CICE-Consortium/CICE#645) is just present in these runs

Before really looking into cases or modifications, maybe @DeniseWorthen and @NeilBarton-NOAA have thoughts

@benjamin-cash
Copy link
Author

Hi @NickSzapiro-NOAA - some more context for this problem is that (to my knowledge) it does not appear in the case where the atmosphere and ocean are reduced to 1 degree (C96mx100). So far we are only seeing it in the C192mx025 runs, where we are using those IC files I pointed you to directly.

One thing I have not yet done is any kind of analysis of the ice in those runs, to see if there is something pathological going on. That's first up on my agenda for today. @ShanSunNOAA do you have any insights from your crashes?

@ShanSunNOAA
Copy link
Collaborator

I was testing the addition of the -ftz flag (flush-to-zero), but Denise pointed out that it was already in place. Why didn’t the flag work as expected and set the e-67 value to zero?

@benjamin-cash
Copy link
Author

@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these.

@NickSzapiro-NOAA
Copy link
Collaborator

I would say that very sparse ice is more technical than physical as it may not be moving, melting, or freezing depending on some threshold dynamics and thermodynamics settings (see CICE-Consortium/CICE#645).

It seems reasonable that the sparse areas are associated with these crashes but we can confirm. I'm happy to try to log more information about what's happening...I don't know if you would need to provide a run directory or if this is reproducible via global-workflow or such on RDHPCS.

If you're open to experimenting, the first "quick fix" that comes to mind is editing ufs-weather-model/CICE-interface/CICE/icepack/columnphysics/icepack_itd.F90

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..32debc2 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0

The reasoning is that this is more consistent with what cleared in the zap_small_areas routine in cleanup_itd

@benjamin-cash
Copy link
Author

My runs are using global-workflow but on Frontera via containers, so that might not be the easiest test case to give you. @ShanSunNOAA, what system are you making your runs on? I could also globus one of my run directories to Hercules.

@ShanSunNOAA
Copy link
Collaborator

Thanks @NickSzapiro-NOAA for a "quick fix"! I saved the crashed ice output under /scratch2/BMC/gsd-fv3-dev/sun/hr4_1013/COMROOT/c192mx025/gefs.20051101/00/mem000/model/ice/history_bad/.

I am testing your fix on Hera right now. Will let you know how it turns out later today.

Thanks!

@benjamin-cash
Copy link
Author

Thanks @ShanSunNOAA ! It takes a while for jobs to get through the queue on Frontera so I wouldn't be able to test nearly so quickly. @NickSzapiro-NOAA , if this does turn out to fix the problem, would you expect it to change answers more generally? I.e., would the successful runs need to be redone for consistency?

@NickSzapiro-NOAA
Copy link
Collaborator

I think this is the smallest change that would help and is a localized change near -puny<{a,v}icen<-puny*{a,v}icen where puny=1.0e-11_dbl_kind...so (hand-wavy) expectation is roundoff differences to 32bit atmosphere

A bigger change would be to more actively remove very sparse ice

@ShanSunNOAA
Copy link
Collaborator

ShanSunNOAA commented Jan 22, 2025

@NickSzapiro-NOAA The model crashed at the same time step as earlier, with a different error message:

1241:
1241: (shift_ice)shift_ice: dvice > vicen
1241: (shift_ice)boundary, donor cat: 3 4
1241: (shift_ice)dvice = 0.000000000000000E+000
1241: (shift_ice)vicen = -1.410289012658147E-065
1241: (icepack_warnings_setabort) T :file icepack_itd.F90 :line 594
1241: (shift_ice) shift_ice: dvice > vicen
1241: (icepack_warnings_aborted) ... (shift_ice)
1241: (icepack_warnings_aborted) ... (linear_itd)
1241: (icepack_warnings_aborted) ... (icepack_step_therm2)

Now it is vicen, not dvice, that has a value of -e65. Should the same treatment be applied to 'vicen'?
Thanks!

@NickSzapiro-NOAA
Copy link
Collaborator

Yes, sorry about that @ShanSunNOAA . Would you mind re-testing a change to all 4 conditions:

diff --git a/columnphysics/icepack_itd.F90 b/columnphysics/icepack_itd.F90
index 013373a..5d81bc3 100644
--- a/columnphysics/icepack_itd.F90
+++ b/columnphysics/icepack_itd.F90
@@ -462,7 +462,7 @@ subroutine shift_ice (trcr_depend,           &
                nd = donor(n)

                if (daice(n) < c0) then
-                  if (daice(n) > -puny*aicen(nd)) then
+                  if (daice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -471,7 +471,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) < c0) then
-                  if (dvice(n) > -puny*vicen(nd)) then
+                  if (dvice(n) > -puny) then
                      daice(n) = c0 ! shift no ice
                      dvice(n) = c0
                   else
@@ -480,7 +480,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (daice(n) > aicen(nd)*(c1-puny)) then
-                  if (daice(n) < aicen(nd)*(c1+puny)) then
+                  if (daice(n) < aicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else
@@ -489,7 +489,7 @@ subroutine shift_ice (trcr_depend,           &
                endif

                if (dvice(n) > vicen(nd)*(c1-puny)) then
-                  if (dvice(n) < vicen(nd)*(c1+puny)) then
+                  if (dvice(n) < vicen(nd)+puny) then
                      daice(n) = aicen(nd)
                      dvice(n) = vicen(nd)
                   else

@ShanSunNOAA
Copy link
Collaborator

Thank you, @NickSzapiro-NOAA, for your prompt response! I am testing it now.

@benjamin-cash
Copy link
Author

@NickSzapiro-NOAA - assuming this does fix the problem, how quickly do you think this can get incorporated as a PR? I can't help but notice that CICE-Consortium/CICE#645 has been open since 2021 and has had no activity since May of last year.

@benjamin-cash
Copy link
Author

@ShanSunNOAA - what is the earliest crash you see? Most of my runs are crashing at around the 50-day, 7 wallclock hour mark, which makes testing awkward.

@ShanSunNOAA
Copy link
Collaborator

@NickSzapiro-NOAA Good news - with your quick fix, my run successfully completed 3 months. Thank you again for your prompt help late last night - it made it possible to run it overnight, as it takes 4-5 hours to reach the crashing point.
@benjamin-cash My crashed runs typically occur around days 50–60 as well. In this particular case, it originally crashed on day 54.

@benjamin-cash
Copy link
Author

@ShanSunNOAA thanks for the quick reply! Do you have a fork of Icepack you can include the fix in, so we can be sure we are working off the same code?

@ShanSunNOAA
Copy link
Collaborator

@benjamin-cash I don't have a fork for this. @NickSzapiro-NOAA Are you going to submit a PR?

@NickSzapiro-NOAA
Copy link
Collaborator

Good to hear.

In UFS, we use an EMC fork of CICE that uses CICE-Consortium/Icepack submodule directly. Let me follow up at CICE-Consortium

@benjamin-cash
Copy link
Author

@ShanSunNOAA - could you confirm which hash of Icepack you are using in your runs? I've created a fork and a branch with the the proposed fix, but I want to be sure that I haven't gotten my submodules out of sync.

@DeniseWorthen
Copy link
Collaborator

Just a word of caution---while one previously failing case may now pass, it doesn't preclude some previously passing case will now fail. So unless you run the entire set of runs, you don't really know if this is a fix.

@benjamin-cash
Copy link
Author

@DeniseWorthen that's definitely a concern. Plus it is a change from the C96m100 baseline configuration. On the other hand it is show-stopper bug for the C192mx025 runs, so we definitely need to do something. My thinking at this point is to first rerun one or two of my successful cases and compare the outcomes.

@DeniseWorthen
Copy link
Collaborator

What would also be interesting to know is whether there is any seasonal signal in when the failing runs occur? Are they at a time of fast melt, hard freezing etc. Could you document the failed run dates and the day on which they fail?

@bingfu-NOAA
Copy link

@DeniseWorthen I think I need to chime in. We have many crashed cases like that in C384 run, mainly in May/June.

@benjamin-cash
Copy link
Author

@DeniseWorthen @bingfu-NOAA - I have't looked exhaustively, but they seem to be in the 2-3 month range (May 01 start). And it is definitely variable. For example, C192mx025_1995050100 mem003 crashed at hour 1627, while mem010 crashed at 1960.

@DeniseWorthen
Copy link
Collaborator

So, just for completeness in helping keep track of this issue, more information here:
https://bb.cgd.ucar.edu/cesm/threads/model-abort-due-to-dvice-negative-moved-from-cice-issues.8940/#post-52031

@benjamin-cash
Copy link
Author

@dabail10, tagging you in for awareness.

@ShanSunNOAA
Copy link
Collaborator

@NeilBarton-NOAA I will have plots of the ice distribution with and without this change later today

@dabail10
Copy link

Does anyone have plots of the ice distribution with and without this change?

@dabail10 this issue is an extension to https://bb.cgd.ucar.edu/cesm/threads/model-abort-due-to-dvice-negative-moved-from-cice-issues.8940/

I wrote a script that removes small ice in areas of small ice values outside the ice edge (defined by 15%) in ICs. This helped with a lot of cases, but obviously not enough. We have tried to remove ice from more cells, but we were not successful in the runs. The script is at https://github.com/NeilBarton-NOAA/REPLAY_ICS/blob/main/SCRIPTS/CICE_ic_edit.py

Note, the ICs are from a DA run and the C384 and C192 runs have the same resolution and CICE ICs.

So what happens if you use an "unmodified" initial file? It would be good to plot vicen, aicen, and vicen / aicen from these files. I think the problem is that you have two categories where vicen / aicen, i.e. the average thickness is very close.

@benjamin-cash
Copy link
Author

@dabail10 - all of my runs are with the unmodified IC files, and those are the files I pointed @NickSzapiro-NOAA to on AWS up thread.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 23, 2025

I took Ben's run directory and made a few modifications:

a) copied the sym-linked fix files in input.nml from the glopara directory on Hercules and updated the input.nml
b) adjusted the ATM restart frequency and fh_out to be 120 hours and 6 -1, respectively
c) generated a job card on Hercules using the sfs test.
d) compiled S2S in Debug mode (-DAPP=S2S -D32BIT=ON -DHYDRO=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DDEBUG=ON) since neither WAV or CHM is being used. This was the HR4 tag (fcc9f84).

When launched, the job immediately seg-faulted

709: [hercules-01-24:3102859:0:3102859] Caught signal 8 (Floating point exception: floating-point invalid operation)
630: ==== backtrace (tid:4124176) ====
630:  0 0x000000000005f14c ucs_callbackq_cleanup()  ???:0
630:  1 0x000000000005f40a ucs_callbackq_cleanup()  ???:0
630:  2 0x0000000000054d90 __GI___sigaction()  :0
630:  3 0x000000000e8ce518 ice_init_column_mp_init_shortwave_()  /work/noaa/nems/dworthen/ufs_hr4/CICE-interface/CICE/cicecore/shared/ice_init_column.F90:436

To test whether I had goofed up the settings, I then ran w/ ice_ic=default. The model integrated successfully 6 hours.

For those w/ access to Hercules:

ice_ic=default: /work2/noaa/stmp/dworthen/stmp/dworthen/negdvice/fcst.37280
ice_ic=cice_model.res.nc: /work2/noaa/stmp/dworthen/stmp/dworthen/negdvice/test.ciceic.res

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 23, 2025

The model is seg-faulting when multiplying apeffn*aicen here

                  apeff_ai(i,j,iblk) = apeff_ai(i,j,iblk) &
                       + apeffn(i,j,n,iblk)*aicen(i,j,n,iblk)

The issue appears to be that aicen>puny where kmt=0 (land). This produces a NaN for the variable apeffn.

@benjamin-cash Are you sure you're not using some sort of processed IC? Because I don't believe a "native" CICE IC would produce ice values on land.

@dabail10
Copy link

@dabail10 - all of my runs are with the unmodified IC files, and those are the files I pointed @NickSzapiro-NOAA to on AWS up thread.

I guess I am confused. Is the initial file from exactly the same configuration of a run? If that is the case, the you wouldn't be able to do a continue / restart run? Like @DeniseWorthen just said.

@benjamin-cash
Copy link
Author

To the best of my knowledge, I am using the files from https://noaa-ufs-gefsv13replay-pds.s3.amazonaws.com without any further modifications. Never say never, but I don't see anywhere in the scripts where those files are being edited after they are downloaded. @ShanSunNOAA and @bingfu-NOAA, is this where you are getting the cice ICs from as well?

@ShanSunNOAA
Copy link
Collaborator

My ICs are taken directly from HPSS, where Neil placed them, without any modifications.
I am comparing ice extent and thickness between the crashed run (last output at hr1296) and the same forecast from a good run with Nick's quick fix. I found that both the ice extent and thickness are identical (based on results from ncdump) between the two runs.

Image Image

@NickSzapiro-NOAA
Copy link
Collaborator

The IC from AWS (for 2004-05-01) does not have ice over land but the restart in Ben's run directory does (for 2004-06-28-43200 ?). The CICE_OUTPUT files didn't copy over but those would be nice to see. I imagine the NaNs are from the -init=snan,arrays compile option with intel debug

Shan's result matches the RTs, where differences only occur for situations that currently abort

@DeniseWorthen
Copy link
Collaborator

@NickSzapiro-NOAA I don't follow. In the rundir I got from Ben, it is

    runtype            = 'initial'
    runid               = 'unknown'
    ice_ic              = 'cice_model.res.nc'

In that rundir, if I use kmt (from kmtu_cice_NEMS_mx025.nc) and aicen(summed over ncat) from cice_model.res.nc
and plot the ice concentration where kmt=0, I get

Image Image

It seems pretty clear to me that aice>0 over land in the initial condition.

Secondly, we regularly test RTs in debug mode. Our RTs would fail if the NaNs were because an uninitialized variable w/ our "normal" ICs.

Finally, I added this code to CICE

diff --git a/cicecore/shared/ice_init_column.F90 b/cicecore/shared/ice_init_column.F90
index 22cd318..cf24131 100644
--- a/cicecore/shared/ice_init_column.F90
+++ b/cicecore/shared/ice_init_column.F90
@@ -197,6 +197,8 @@ subroutine init_shortwave
       use ice_grid, only: tlat, tlon, tmask
       use ice_restart_shared, only: restart, runtype
       use ice_state, only: aicen, vicen, vsnon, trcrn
+      ! debug
+      use ice_grid, only: kmt

       integer (kind=int_kind) :: &
          i, j , k    , & ! horizontal indices
@@ -240,6 +242,9 @@ subroutine init_shortwave

       character(len=*), parameter :: subname='(init_shortwave)'

+      ! debug
+      integer :: ig, jg
+
       call icepack_query_parameters(puny_out=puny)
       call icepack_query_parameters(shortwave_out=shortwave)
       call icepack_query_parameters(dEdd_algae_out=dEdd_algae)
@@ -406,6 +411,25 @@ subroutine init_shortwave
       ! Match loop order in coupling_prep for same order of operations
       !-----------------------------------------------------------------

+         ! debug
+         this_block = get_block(blocks_ice(iblk),iblk)
+         ilo = this_block%ilo
+         ihi = this_block%ihi
+         jlo = this_block%jlo
+         jhi = this_block%jhi
+         do n = 1,ncat
+            do j = jlo, jhi
+               do i = ilo, ihi
+                  ig = this_block%i_glob(i)
+                  jg = this_block%j_glob(j)
+                  if (aicen(i,j,n,iblk) > puny) then
+                     print '(a,4i8,2g14.7)','XSW ',ig,jg,n,int(kmt(i,j,iblk),4),apeffn(i,j,n,iblk),aicen(i,j,n,iblk)
+                  end if
+               end do
+            end do
+         end do
+

Grepping stdout shows that only when kmt=0 and aice>0 do you get NaN in apeffn:

706: XSW      440     807       1       0           NaN 0.1202263
706: XSW      440     808       1       0           NaN 0.1202263
706: XSW      440     809       1       0           NaN 0.1202263
707: XSW      478     827       1       0           NaN 0.5506212
707: XSW      478     828       1       0           NaN 0.8399250

@benjamin-cash
Copy link
Author

@DeniseWorthen - if the problem was the presence of aice>0 over land, why wouldn't all members of the ensemble fail? I'm still struggling with that one.

I have to wonder if somehow the IC file was overwritten during the run in that run directory, because I'm at a loss otherwise to explain how it could be different from the file on AWS.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 24, 2025

The problem is that you don't know what variable array in memory you're corrupting when you (I'm guessing) divide by 0. If you're seg-faulting in debug mode, that is job one to figure out. There's no point (imho) in trying to figure out the behaviour of a model which is showing NaNs in debug mode.

I've also looked at a run directory on Hera that Shan provided. Her IC is clearly different. As I explained to her, 1) there are _FillValue attributes (the _FillValue is NaN). These are not written by CICE in a restart file. And secondly, there is an additional array (aicen_orig) which is also not written by CICE. Her IC looks to be "processed" somehow. However, the same "aice>0 on land" problem is still present.

Your initial condition file does not contain either of those, and appears to be an "unprocessed" initial condition. However, it still has aice>0 on land.

@benjamin-cash
Copy link
Author

Hi @DeniseWorthen - Thanks for the explanation. I downloaded the IC file again from the AWS bucket and confirmed that it is identical to the file in my run directory. It sounds then like the ice IC files in HPSS are not the same as the files on AWS, which if nothing else comes out of this was important to uncover.

I wonder if the HPSS files have been further processed via @NeilBarton-NOAA's https://github.com/NeilBarton-NOAA/REPLAY_ICS/blob/main/SCRIPTS/CICE_ic_edit.py code? Neil, is that correct? If someone can put one of @ShanSunNOAA 's ICs somewhere I can see it I will run Neil's code against my corresponding file and see if I get the same result.

@benjamin-cash
Copy link
Author

Oh! Something else that occurs to me - it would be very interesting to analyze the files used for the C96mx100 runs for these same dates, since apparently none of those runs crashed with this issue.

@DeniseWorthen
Copy link
Collaborator

@benjamin-cash You can "fix" your own cice_model.res.nc with the following NCO commands (the first just appends the kmt array so you can use it to mask the other variables):

ncks -A kmtu_cice_NEMS_mx025.nc cice_model.res.nc
ncap2 -s where(kmt==0) aicen=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) vicen=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) vsnon=0.0 cice_model.res.nc
ncap2 -s where(kmt==0) Tsfcn=0.0 cice_model.res.nc

When I did this, and used it w/ the model compiled in debug mode, the model completed 6 hours of integration. I can now try to run out and see what happens---no guarantee this will fix it, but we at least know it's not initializing weird.

@NeilBarton-NOAA
Copy link
Collaborator

@dabail10 @benjamin-cash The ICs on AWS were also modified by SOCA/JEDI DA. The ICs I processed (on hpss) take the AWS ICs are remove ice because of the dv issue (obviously doesn't work for all cases). The C96mx100 CICE ICs are interpolated from the mx025 ICs, which points to that the interpolation or running on a lower resolution removed this issue.

@dabail10 I tried removing the ice where two categories of vicen / aicen were very close in the ICs and aice was "small", but I wasn't successful. This was about a year ago and could be examined more. @guillaumevernieres for awareness

@benjamin-cash
Copy link
Author

@DeniseWorthen - Good to know! I still would like to make the comparison to the ICs that @ShanSunNOAA used and figure out exactly how and why they are different, because this is a place where I've been worried differences might creep in.

@benjamin-cash
Copy link
Author

@NeilBarton-NOAA - it would be very interesting to check and see if those lower-resolution IC files do or do not include aicen>0 over land.

@ShanSunNOAA
Copy link
Collaborator

@NickSzapiro-NOAA Your quick fix resolved all three of my cases that originally crashed due to negative dvice; all have successfully completed 3 months. Thank you again for efficiently and effectively pinpointing the issue.

@benjamin-cash
Copy link
Author

@ShanSunNOAA could you put one of your run directories on Hercules so I can compare IC files?

@ShanSunNOAA
Copy link
Collaborator

@benjamin-cash Good idea, however, I am battling a globus error. Will let you know when I am successful

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 24, 2025

@NickSzapiro-NOAA - If there was a diagnostic field that would give some clues as to what might be going on here, do you have a sense for what it would be? For example, I see a lot of very small aice_h values (e.g., 8.168064e-11, 7.310505e-14), but I'm not familiar enough with CICE to know what to make of these.

@benjamin-cash Can you provide the ice history files and ice restarts (the ones in CICE_RESTART) from the run directory you placed on Hercules ?

@benjamin-cash
Copy link
Author

@DeniseWorthen - the restarts are transferring now (to the CICE_RESTART directory properly this time), as well as the history files. The history files are going in a directory labeled ice_history in that same run directory I transferred.

@DeniseWorthen
Copy link
Collaborator

@benjamin-cash I see the cice restarts, but the history files are still just sym-links to directories on Frontera (?)

@benjamin-cash
Copy link
Author

@DeniseWorthen are you looking in /work/noaa/nems/cash/ice_fail/fcst.37280/ice_history? Those should be real netcdf files.

@DeniseWorthen
Copy link
Collaborator

ok, now I see them. Thanks.

@DeniseWorthen
Copy link
Collaborator

Has anyone ever noticed that you can't ncrcat these output files because the forecast hour character string is not consistent...fXX, fXXX, fXXXX?

@benjamin-cash
Copy link
Author

Has anyone ever noticed that you can't ncrcat these output files because the forecast hour character string is not consistent...fXX, fXXX, fXXXX?

Yes - definitely annoying.

As an update - I was able to confirm just now that the difference between my ice ICs and @ShanSunNOAA 's is that her files were further processed with @NeilBarton-NOAA 's CICE_ic_edit.py code and mine were not. Since her files were still crashing with this error it seems like we still probably need @NickSzapiro-NOAA 's fix to get past it, especially since it does not seem to be changing the answers otherwise.

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jan 24, 2025

I've also looked at a run directory on Hera that Shan provided. Her IC is clearly different. As I explained to her, 1) there are _FillValue attributes (the _FillValue is NaN). These are not written by CICE in a restart file. And secondly, there is an additional array (aicen_orig) which is also not written by CICE. Her IC looks to be "processed" somehow. However, the same "aice>0 on land" problem is still present.

@benjamin-cash Shan's IC also had ice on land. It will also probably seg fault if run in debug mode.

My run on Hercules w/ the fixed ICs (ie, w/ the NCO processing to remove all ice on land) did stop w/ negative dvice at Jun 16. But I have the Jun 15th restart, which means we can now debug the actual issue by restarting and tracking what is happening.

@benjamin-cash
Copy link
Author

I'm just happy to have identified the source of the differences in our ICs at this point. :) Also very glad to know that you have a case that you can now interrogate in detail!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants