CCLM failed with error code 1004 – in #9: CCLM

in #9: CCLM

<p> Hello, </p> <p> I have tried to start my erainterim Simulation for the Extended <span class="caps"> EURO </span> - <span class="caps"> CORDEX </span> area again. But somehow, I got this interpolation error when <span class="caps"> CCLM </span> is running, which I can’t place (see below). <br/> So far, I used the same number of levels as before (40) and the <span class="caps"> ERAI </span> nterim caf files had 60 levels. But as far as I can tell, I use the same script as before Manuel to create the input files. <br/> Any idea anyone? </p> <p> Cheers, and have a nice weekend! </p> <p> Jenny </p> <p> Rank 15 [Thu Nov 21 12:28:47 2019] [c2-1c1s10n1] application called <span class="caps"> MPI </span> _Abort(comm=0×84000002, 1004) – process 15 <ul> <li> <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> <span class="caps"> DETECTED </span> </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> <span class="caps"> DETECTED </span> </li> </ul> * <ul> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 1004 * </li> <li> plev for interpolation above model top! </li> </ul> <br/> <strong> —————————————————————————————— </strong> <br/> <strong> —————————————————————————————— </strong> <ul> <li> <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 1004 </li> <li> plev for interpolation above model top! </li> </ul> <br/> <strong> —————————————————————————————— </strong> </p>

  @jenniferbrauch in #de763f4

<p> Hello, </p> <p> I have tried to start my erainterim Simulation for the Extended <span class="caps"> EURO </span> - <span class="caps"> CORDEX </span> area again. But somehow, I got this interpolation error when <span class="caps"> CCLM </span> is running, which I can’t place (see below). <br/> So far, I used the same number of levels as before (40) and the <span class="caps"> ERAI </span> nterim caf files had 60 levels. But as far as I can tell, I use the same script as before Manuel to create the input files. <br/> Any idea anyone? </p> <p> Cheers, and have a nice weekend! </p> <p> Jenny </p> <p> Rank 15 [Thu Nov 21 12:28:47 2019] [c2-1c1s10n1] application called <span class="caps"> MPI </span> _Abort(comm=0×84000002, 1004) – process 15 <ul> <li> <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> <span class="caps"> DETECTED </span> </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> <span class="caps"> PROGRAM </span> <span class="caps"> TERMINATED </span> <span class="caps"> BECAUSE </span> OF <span class="caps"> ERRORS </span> <span class="caps"> DETECTED </span> </li> </ul> * <ul> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> IN <span class="caps"> ROUTINE </span> : p_int </li> <li> <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 1004 * </li> <li> plev for interpolation above model top! </li> </ul> <br/> <strong> —————————————————————————————— </strong> <br/> <strong> —————————————————————————————— </strong> <ul> <li> <span class="caps"> ERROR </span> <span class="caps"> CODE </span> is 1004 </li> <li> plev for interpolation above model top! </li> </ul> <br/> <strong> —————————————————————————————— </strong> </p>

CCLM failed with error code 1004

Hello,

I have tried to start my erainterim Simulation for the Extended EURO - CORDEX area again. But somehow, I got this interpolation error when CCLM is running, which I can’t place (see below).
So far, I used the same number of levels as before (40) and the ERAI nterim caf files had 60 levels. But as far as I can tell, I use the same script as before Manuel to create the input files.
Any idea anyone?

Cheers, and have a nice weekend!

Jenny

Rank 15 [Thu Nov 21 12:28:47 2019] [c2-1c1s10n1] application called MPI _Abort(comm=0×84000002, 1004) – process 15

  • PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED
  • IN ROUTINE : p_int
  • IN ROUTINE : p_int
  • PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED
*
  • IN ROUTINE : p_int
  • IN ROUTINE : p_int
  • ERROR CODE is 1004 *
  • plev for interpolation above model top!

——————————————————————————————
——————————————————————————————
  • ERROR CODE is 1004
  • plev for interpolation above model top!

——————————————————————————————

View in channel
<p> Can you provide the files <code> INPUT </code> and <code> OUTPUT </code> of the job, please. </p>

  @burkhardtrockel in #1d5e625

<p> Can you provide the files <code> INPUT </code> and <code> OUTPUT </code> of the job, please. </p>

Can you provide the files INPUT and OUTPUT of the job, please.

<p> Hello, <br/> here are the files from int2lm. </p> <p> Cheers, <br/> Jenny </p>

  @jenniferbrauch in #e781e88

<p> Hello, <br/> here are the files from int2lm. </p> <p> Cheers, <br/> Jenny </p>

Hello,
here are the files from int2lm.

Cheers,
Jenny

<p> These files look fine to me. I forgot to ask for the <code> YUSPECIF </code> file of the <span class="caps"> CCLM </span> simulation. Can you provide that one, too, please? </p>

  @burkhardtrockel in #432f0da

<p> These files look fine to me. I forgot to ask for the <code> YUSPECIF </code> file of the <span class="caps"> CCLM </span> simulation. Can you provide that one, too, please? </p>

These files look fine to me. I forgot to ask for the YUSPECIF file of the CCLM simulation. Can you provide that one, too, please?

<p> Her it is! </p> <p> :) </p>

  @jenniferbrauch in #4543211

<p> Her it is! </p> <p> :) </p>

Her it is!

:)

<p> Have you checked the output files from INT2LM (las and lbfd) and <span class="caps"> CCLM </span> (lffd)? Are the numbers OK? If a “pint” error occurs it is most likely that there are some inconsistencies in the initial/boundary data set that causes <span class="caps"> CCLM </span> to produce unrealistic values and then the “pint” causes a crash. </p> <p> <span class="caps"> FYI </span> : The following namelist parameters are different to yours from those I have used in a successful simulation (and very likely the external data sets are different). </p> <p> in INT2LM namelist <br/> <pre> itype_w_so_rel=1 itype_rootdp=4 itype_aerosol=1 lfilter_oro=.FALSE. l_bicub_spl=.FALSE. czml_soil_in=0.035, 0.175, 0.64, 1.775, </pre> </p> <p> In <span class="caps"> CCLM </span> namelist <br/> <pre> alphaass=1. ldiabf_lh=.TRUE. itype_bbc_w=1 itype_aerosol=1 itype_evsl=4 lconf_avg=F </pre> </p>

  @burkhardtrockel in #7477cb3

<p> Have you checked the output files from INT2LM (las and lbfd) and <span class="caps"> CCLM </span> (lffd)? Are the numbers OK? If a “pint” error occurs it is most likely that there are some inconsistencies in the initial/boundary data set that causes <span class="caps"> CCLM </span> to produce unrealistic values and then the “pint” causes a crash. </p> <p> <span class="caps"> FYI </span> : The following namelist parameters are different to yours from those I have used in a successful simulation (and very likely the external data sets are different). </p> <p> in INT2LM namelist <br/> <pre> itype_w_so_rel=1 itype_rootdp=4 itype_aerosol=1 lfilter_oro=.FALSE. l_bicub_spl=.FALSE. czml_soil_in=0.035, 0.175, 0.64, 1.775, </pre> </p> <p> In <span class="caps"> CCLM </span> namelist <br/> <pre> alphaass=1. ldiabf_lh=.TRUE. itype_bbc_w=1 itype_aerosol=1 itype_evsl=4 lconf_avg=F </pre> </p>

Have you checked the output files from INT2LM (las and lbfd) and CCLM (lffd)? Are the numbers OK? If a “pint” error occurs it is most likely that there are some inconsistencies in the initial/boundary data set that causes CCLM to produce unrealistic values and then the “pint” causes a crash.

FYI : The following namelist parameters are different to yours from those I have used in a successful simulation (and very likely the external data sets are different).

in INT2LM namelist

itype_w_so_rel=1
itype_rootdp=4
itype_aerosol=1
lfilter_oro=.FALSE.
l_bicub_spl=.FALSE.
czml_soil_in=0.035, 0.175, 0.64, 1.775,

In CCLM namelist

 alphaass=1.
 ldiabf_lh=.TRUE.
 itype_bbc_w=1
 itype_aerosol=1
 itype_evsl=4
 lconf_avg=F

<p> Dear colleague, </p> <p> I have the same problem by running the CCLM model. I ever run CCLM simulation successfully a couple of months ago, but when I restart to run the same simulation now, there is an error as: </p> <p> *    PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED <br/> *              IN ROUTINE:   p_int </p> <p> ...skipping one line <br/> *    ERROR CODE is         1004 <br/> *    plev for interpolation above model top! <br/> *------------------------------------------------------------* <br/> *------------------------------------------------------------* <br/> </p> <p> I noticed that there is also a warning: </p> <p> !!!!*** WARNING ***!!! CFL-criterion for horizontal advection is violated <br/> !!!! Max ( |cfl_x| + |cfl_y| ) =    72.4515109843900       ( &gt; 0.95 * <br/> 1.61000000000000       ) </p> <p> However, my setups for the simulation are the same as the successful simulation. Does anyone have any idea? Thank you! </p> <p> Best regards, </p> <p> Delei <br/> </p>

  @deleili in #576c92d

<p> Dear colleague, </p> <p> I have the same problem by running the CCLM model. I ever run CCLM simulation successfully a couple of months ago, but when I restart to run the same simulation now, there is an error as: </p> <p> *    PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED <br/> *              IN ROUTINE:   p_int </p> <p> ...skipping one line <br/> *    ERROR CODE is         1004 <br/> *    plev for interpolation above model top! <br/> *------------------------------------------------------------* <br/> *------------------------------------------------------------* <br/> </p> <p> I noticed that there is also a warning: </p> <p> !!!!*** WARNING ***!!! CFL-criterion for horizontal advection is violated <br/> !!!! Max ( |cfl_x| + |cfl_y| ) =    72.4515109843900       ( &gt; 0.95 * <br/> 1.61000000000000       ) </p> <p> However, my setups for the simulation are the same as the successful simulation. Does anyone have any idea? Thank you! </p> <p> Best regards, </p> <p> Delei <br/> </p>

Dear colleague,

I have the same problem by running the CCLM model. I ever run CCLM simulation successfully a couple of months ago, but when I restart to run the same simulation now, there is an error as:

*    PROGRAM TERMINATED BECAUSE OF ERRORS DETECTED
*              IN ROUTINE:   p_int

...skipping one line
*    ERROR CODE is         1004
*    plev for interpolation above model top!
*------------------------------------------------------------*
*------------------------------------------------------------*

I noticed that there is also a warning:

!!!!*** WARNING ***!!! CFL-criterion for horizontal advection is violated
!!!! Max ( |cfl_x| + |cfl_y| ) =    72.4515109843900       ( > 0.95 *
1.61000000000000       )

However, my setups for the simulation are the same as the successful simulation. Does anyone have any idea? Thank you!

Best regards,

Delei

<p> Hey Delei, <br/> <br/> generally I get this error when my model crashes, that is: there are NA values in the variables. This problem seems to cause ERROR 1004 when the output is written for pressure levels. <br/> <br/> So whenever this happens I first check the CFL-criterion and if it is violated in the steps prio to the crash, I assume a normal instability of the simulation and change the timestep “dt” in the namelist. (normally lowering it) <br/> <br/> If this is not the cause of the problem, I fall back to writing output for single time steps by changing the namelist: putting a “!” in front of [ hcomb=…. ]  and   [ ytunit='d' ]   to comment it out <br/> and adding <br/> [ ngrib=0,1,2,3,4,5,10,20,30  ! selected time steps to write output ] <br/> and [ ytunit='f' ] <br/> <br/> normally from the output I can trace back to where the error occured. In which variable the NA first appeared and maybe even where inside the domain and so on… <br/> <br/> ----- <br/> <br/> Concerning your problem, that you used the same setup, I have no idea. But maybe if you try to trace back the error you may find the reason it is not working anymore. <br/> <br/> For me it would always be something like maybe a changed default version of cdo that causes problems in my preprocessing setup that results in a model crash (or a temporal/testing change I forgot to change back and didn´t remember…) <br/> <br/> just to make sure, maybe compare your YUSPECIF from an old simulation where it worked and the current one, to make sure the setup is really the same. ( you can use the diff comand on levante, that is “diff old_YUSPECIF new_YUPECIF” ) <br/> <br/> ----- <br/> <br/> good luck debugging <br/> Rolf </p>

  @rolfzentek in #3d36371

<p> Hey Delei, <br/> <br/> generally I get this error when my model crashes, that is: there are NA values in the variables. This problem seems to cause ERROR 1004 when the output is written for pressure levels. <br/> <br/> So whenever this happens I first check the CFL-criterion and if it is violated in the steps prio to the crash, I assume a normal instability of the simulation and change the timestep “dt” in the namelist. (normally lowering it) <br/> <br/> If this is not the cause of the problem, I fall back to writing output for single time steps by changing the namelist: putting a “!” in front of [ hcomb=…. ]  and   [ ytunit='d' ]   to comment it out <br/> and adding <br/> [ ngrib=0,1,2,3,4,5,10,20,30  ! selected time steps to write output ] <br/> and [ ytunit='f' ] <br/> <br/> normally from the output I can trace back to where the error occured. In which variable the NA first appeared and maybe even where inside the domain and so on… <br/> <br/> ----- <br/> <br/> Concerning your problem, that you used the same setup, I have no idea. But maybe if you try to trace back the error you may find the reason it is not working anymore. <br/> <br/> For me it would always be something like maybe a changed default version of cdo that causes problems in my preprocessing setup that results in a model crash (or a temporal/testing change I forgot to change back and didn´t remember…) <br/> <br/> just to make sure, maybe compare your YUSPECIF from an old simulation where it worked and the current one, to make sure the setup is really the same. ( you can use the diff comand on levante, that is “diff old_YUSPECIF new_YUPECIF” ) <br/> <br/> ----- <br/> <br/> good luck debugging <br/> Rolf </p>

Hey Delei,

generally I get this error when my model crashes, that is: there are NA values in the variables. This problem seems to cause ERROR 1004 when the output is written for pressure levels.

So whenever this happens I first check the CFL-criterion and if it is violated in the steps prio to the crash, I assume a normal instability of the simulation and change the timestep “dt” in the namelist. (normally lowering it)

If this is not the cause of the problem, I fall back to writing output for single time steps by changing the namelist: putting a “!” in front of [ hcomb=…. ]  and   [ ytunit='d' ]   to comment it out
and adding
[ ngrib=0,1,2,3,4,5,10,20,30  ! selected time steps to write output ]
and [ ytunit='f' ]

normally from the output I can trace back to where the error occured. In which variable the NA first appeared and maybe even where inside the domain and so on…

-----

Concerning your problem, that you used the same setup, I have no idea. But maybe if you try to trace back the error you may find the reason it is not working anymore.

For me it would always be something like maybe a changed default version of cdo that causes problems in my preprocessing setup that results in a model crash (or a temporal/testing change I forgot to change back and didn´t remember…)

just to make sure, maybe compare your YUSPECIF from an old simulation where it worked and the current one, to make sure the setup is really the same. ( you can use the diff comand on levante, that is “diff old_YUSPECIF new_YUPECIF” )

-----

good luck debugging
Rolf

<p> Dear Rolf, </p> <p> Thank you very much for your comprehensive comments. I have tried them one by one. </p> <p> (1) I have reduced dt from 18 seconds to 9 seconds for a resolution of 0.0275 degree, the issue remains.  According to the other (and my previous) successful setups, 18s should be fine. </p> <p> (2) I indeed have two versions of cdo installed, however, I tried both versions, and the issue remains the same.  I am not sure whether there are any other environment setups that I have changed that could possibly cause the problem… need to have further check. </p> <p> (3) I made “diff old_YUSPECIF new_YUPECIF”, and they are the same. </p> <p> (4) I tried to have more frequent outputs and noticed that the NaN values arise from a specific location in the variables PS and PMSL and then spread to other variables and regions.  My dt is 9s and the NaN occurred between 27s and 36 s.  I also checked the YUPRHUMI and YUPRMASS, there are almost all NaN values except ntstep=0. </p> <p> It seems that I cannot attach files in the channel.  Thanks in advance for any further comments. </p> <p> Cheers, </p> <p> Delei </p>

  @deleili in #2b24ee9

<p> Dear Rolf, </p> <p> Thank you very much for your comprehensive comments. I have tried them one by one. </p> <p> (1) I have reduced dt from 18 seconds to 9 seconds for a resolution of 0.0275 degree, the issue remains.  According to the other (and my previous) successful setups, 18s should be fine. </p> <p> (2) I indeed have two versions of cdo installed, however, I tried both versions, and the issue remains the same.  I am not sure whether there are any other environment setups that I have changed that could possibly cause the problem… need to have further check. </p> <p> (3) I made “diff old_YUSPECIF new_YUPECIF”, and they are the same. </p> <p> (4) I tried to have more frequent outputs and noticed that the NaN values arise from a specific location in the variables PS and PMSL and then spread to other variables and regions.  My dt is 9s and the NaN occurred between 27s and 36 s.  I also checked the YUPRHUMI and YUPRMASS, there are almost all NaN values except ntstep=0. </p> <p> It seems that I cannot attach files in the channel.  Thanks in advance for any further comments. </p> <p> Cheers, </p> <p> Delei </p>

Dear Rolf,

Thank you very much for your comprehensive comments. I have tried them one by one.

(1) I have reduced dt from 18 seconds to 9 seconds for a resolution of 0.0275 degree, the issue remains.  According to the other (and my previous) successful setups, 18s should be fine.

(2) I indeed have two versions of cdo installed, however, I tried both versions, and the issue remains the same.  I am not sure whether there are any other environment setups that I have changed that could possibly cause the problem… need to have further check.

(3) I made “diff old_YUSPECIF new_YUPECIF”, and they are the same.

(4) I tried to have more frequent outputs and noticed that the NaN values arise from a specific location in the variables PS and PMSL and then spread to other variables and regions.  My dt is 9s and the NaN occurred between 27s and 36 s.  I also checked the YUPRHUMI and YUPRMASS, there are almost all NaN values except ntstep=0.

It seems that I cannot attach files in the channel.  Thanks in advance for any further comments.

Cheers,

Delei

<p> Dear Delei, </p> <p> a collegue of mine had also cases where decreasing the timestep helped... </p> <p> if you test for example dt = 10,15,20,30 you could compare at which time (not steps, but time) the first NaN ocurr and if increasing or decreasing makes it better/worse. <br/> (oh and does the CFL-criterion still get violated with other timesteps?) </p> <p> But overall since it happens in the first minute, I guess it won´t help... </p> <p> I have had several cases when we had no problem running 15, and 5 km resolution, but with finer resolution (nested inside) sometimes the simulation never became stable enough no matter the timestep. And we didn´t spend more time on it, but just skipped those. </p> <p> -------- </p> <p> in your case I guess (4) is best to trace back the Problem, since you know now a location, where it is happening. (how do other varibles look/behave there just before it gets NaN?) </p> <p> I had one case, where there was a very steep mountain ( &gt;500m height increase from one grid point to the next ) where wind speeds an temperature would get unrealistic values before the crash. </p> <p> Another case was where single grid points just had very unrealistic cold temperatures, that could not be fixed and normaly did not cause a problem. But sometimes the temperature would drop even lower and than NaN would spread from this gridpoints. </p> <p> Quick idea, if the point is not in the middle but near the border of your domain: just reduce the domainsize to exclude the point. I think you just have to change [ startlon_tot, startlat_tot, ie_tot, je_tot] so it should be no effort? </p> <p> if it works afterwards, you may be more certain that the problem is really at that location, and not a more generall problem that occurs first at that point, but would also happen 1-2 minutes later at other points. </p> <p> Cheers <br/> Rolf </p>

  @rolfzentek in #0b2d088

<p> Dear Delei, </p> <p> a collegue of mine had also cases where decreasing the timestep helped... </p> <p> if you test for example dt = 10,15,20,30 you could compare at which time (not steps, but time) the first NaN ocurr and if increasing or decreasing makes it better/worse. <br/> (oh and does the CFL-criterion still get violated with other timesteps?) </p> <p> But overall since it happens in the first minute, I guess it won´t help... </p> <p> I have had several cases when we had no problem running 15, and 5 km resolution, but with finer resolution (nested inside) sometimes the simulation never became stable enough no matter the timestep. And we didn´t spend more time on it, but just skipped those. </p> <p> -------- </p> <p> in your case I guess (4) is best to trace back the Problem, since you know now a location, where it is happening. (how do other varibles look/behave there just before it gets NaN?) </p> <p> I had one case, where there was a very steep mountain ( &gt;500m height increase from one grid point to the next ) where wind speeds an temperature would get unrealistic values before the crash. </p> <p> Another case was where single grid points just had very unrealistic cold temperatures, that could not be fixed and normaly did not cause a problem. But sometimes the temperature would drop even lower and than NaN would spread from this gridpoints. </p> <p> Quick idea, if the point is not in the middle but near the border of your domain: just reduce the domainsize to exclude the point. I think you just have to change [ startlon_tot, startlat_tot, ie_tot, je_tot] so it should be no effort? </p> <p> if it works afterwards, you may be more certain that the problem is really at that location, and not a more generall problem that occurs first at that point, but would also happen 1-2 minutes later at other points. </p> <p> Cheers <br/> Rolf </p>

Dear Delei,

a collegue of mine had also cases where decreasing the timestep helped...

if you test for example dt = 10,15,20,30 you could compare at which time (not steps, but time) the first NaN ocurr and if increasing or decreasing makes it better/worse.
(oh and does the CFL-criterion still get violated with other timesteps?)

But overall since it happens in the first minute, I guess it won´t help...

I have had several cases when we had no problem running 15, and 5 km resolution, but with finer resolution (nested inside) sometimes the simulation never became stable enough no matter the timestep. And we didn´t spend more time on it, but just skipped those.

--------

in your case I guess (4) is best to trace back the Problem, since you know now a location, where it is happening. (how do other varibles look/behave there just before it gets NaN?)

I had one case, where there was a very steep mountain ( >500m height increase from one grid point to the next ) where wind speeds an temperature would get unrealistic values before the crash.

Another case was where single grid points just had very unrealistic cold temperatures, that could not be fixed and normaly did not cause a problem. But sometimes the temperature would drop even lower and than NaN would spread from this gridpoints.

Quick idea, if the point is not in the middle but near the border of your domain: just reduce the domainsize to exclude the point. I think you just have to change [ startlon_tot, startlat_tot, ie_tot, je_tot] so it should be no effort?

if it works afterwards, you may be more certain that the problem is really at that location, and not a more generall problem that occurs first at that point, but would also happen 1-2 minutes later at other points.

Cheers
Rolf

<p> Dear Rolf, </p> <p> Thank you very much.  I tried many different setups and dt. But failed for all. Specifically, for dt=18 s, I run the same simulation for four times and stored the related outputs, and found that for each simulation the NAN values are different. The NAN values first appear at different locations/different times/different variables for the four simulations.  This is a really strange and annoying issue.  I am not sure whether it is related to the environment setups or the supercomputer. will consult IT about this. </p> <p> As you said, “we didn´t spend more time on it, but just skipped those.”  What would you suggest? Should I change the model version, or model domain?  Thank you. </p> <p> Cheers, </p> <p> Delei </p> <p> </p> <p> </p>

  @deleili in #6ac5383

<p> Dear Rolf, </p> <p> Thank you very much.  I tried many different setups and dt. But failed for all. Specifically, for dt=18 s, I run the same simulation for four times and stored the related outputs, and found that for each simulation the NAN values are different. The NAN values first appear at different locations/different times/different variables for the four simulations.  This is a really strange and annoying issue.  I am not sure whether it is related to the environment setups or the supercomputer. will consult IT about this. </p> <p> As you said, “we didn´t spend more time on it, but just skipped those.”  What would you suggest? Should I change the model version, or model domain?  Thank you. </p> <p> Cheers, </p> <p> Delei </p> <p> </p> <p> </p>

Dear Rolf,

Thank you very much.  I tried many different setups and dt. But failed for all. Specifically, for dt=18 s, I run the same simulation for four times and stored the related outputs, and found that for each simulation the NAN values are different. The NAN values first appear at different locations/different times/different variables for the four simulations.  This is a really strange and annoying issue.  I am not sure whether it is related to the environment setups or the supercomputer. will consult IT about this.

As you said, “we didn´t spend more time on it, but just skipped those.”  What would you suggest? Should I change the model version, or model domain?  Thank you.

Cheers,

Delei

<p> Dear Delei, </p> <p> I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer. </p> <p> [ if you wanne track down the problem ] <br/> are the non-NAN values of the 4 simulations the same? <br/> -&gt; if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups) <br/> -&gt; if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN. <br/> -&gt; using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used. <br/> <br/> [ if you want a workaround ] <br/> Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs? </p> <p> Cheers <br/> Rolf </p>

  @rolfzentek in #da0f3fa

<p> Dear Delei, </p> <p> I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer. </p> <p> [ if you wanne track down the problem ] <br/> are the non-NAN values of the 4 simulations the same? <br/> -&gt; if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups) <br/> -&gt; if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN. <br/> -&gt; using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used. <br/> <br/> [ if you want a workaround ] <br/> Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs? </p> <p> Cheers <br/> Rolf </p>

Dear Delei,

I never encountered this kind of variablilty (4 different outputs) with the same setup on the same computer. I often compare different model version (when we change something) and from this I know that even after 30h simulations the difference of output fields is zero (at least in ncview) - at least when I ran it on the same computer.

[ if you wanne track down the problem ]
are the non-NAN values of the 4 simulations the same?
-> if no, I guess it's a wierd chaos effect that causes the crash to happend differently. And I would try to get a stable setup that doesn´t produce this chaos effect anymore... (using the same node of the supercomputer for example - if the nodes have different setups)
-> if yes, I have no Idea, and would start suspecting a faulty hardware causing random NAN.
-> using only 1 cpu/core (procx=1,procy=1) may be a setup that you could also test. As far as I know, this should not change the simulation output. But I had a case (with INT2LM), where a bug occured only if parallel computing was used.

[ if you want a workaround ]
Do you have any working setup right now, from which you could deviate slowly towards the setup you need... step by step, always checking when the error occurs?

Cheers
Rolf

<p> Dear Rolf, </p> <p> Thank you very much for your helpful comment. I finally figured out the issue.  I reduced the CPU numbers for CCLM simulation from 28*32 to 16*24,  then the simulation can run successfully.  I am not very clear about the underlying reason. It maybe related to some unstable information exchange between nodes.  Anyway, it has been solved. I greatly appreciate your great help! </p> <p> Best regards, </p> <p> Delei </p> <p> </p> <p> <br/> </p>

  @deleili in #5f39d81

<p> Dear Rolf, </p> <p> Thank you very much for your helpful comment. I finally figured out the issue.  I reduced the CPU numbers for CCLM simulation from 28*32 to 16*24,  then the simulation can run successfully.  I am not very clear about the underlying reason. It maybe related to some unstable information exchange between nodes.  Anyway, it has been solved. I greatly appreciate your great help! </p> <p> Best regards, </p> <p> Delei </p> <p> </p> <p> <br/> </p>

Dear Rolf,

Thank you very much for your helpful comment. I finally figured out the issue.  I reduced the CPU numbers for CCLM simulation from 28*32 to 16*24,  then the simulation can run successfully.  I am not very clear about the underlying reason. It maybe related to some unstable information exchange between nodes.  Anyway, it has been solved. I greatly appreciate your great help!

Best regards,

Delei


<p> Dear Delei, </p> <p> I am experiencing the same problem right now, and I am using 5*8 (40 cores) for my gcm2cclm runs. Is it because of the proportion between 5:8? What would you recommend? </p> <p> Regards, </p> <p> Emre </p>

  @emresalkım in #bcd4b33

<p> Dear Delei, </p> <p> I am experiencing the same problem right now, and I am using 5*8 (40 cores) for my gcm2cclm runs. Is it because of the proportion between 5:8? What would you recommend? </p> <p> Regards, </p> <p> Emre </p>

Dear Delei,

I am experiencing the same problem right now, and I am using 5*8 (40 cores) for my gcm2cclm runs. Is it because of the proportion between 5:8? What would you recommend?

Regards,

Emre

<p> Dear Emre, </p> <p> For my previous issue, it was due to unstable nodes of supercomputer. If the IT staff delete the unstable nodes, the simulation would run successfully. For your case, it is possible the same problem? I am not sure.  You may reduce your cores and see what happens. </p> <p> Regards, </p> <p> Delei </p>

  @deleili in #658c75b

<p> Dear Emre, </p> <p> For my previous issue, it was due to unstable nodes of supercomputer. If the IT staff delete the unstable nodes, the simulation would run successfully. For your case, it is possible the same problem? I am not sure.  You may reduce your cores and see what happens. </p> <p> Regards, </p> <p> Delei </p>

Dear Emre,

For my previous issue, it was due to unstable nodes of supercomputer. If the IT staff delete the unstable nodes, the simulation would run successfully. For your case, it is possible the same problem? I am not sure.  You may reduce your cores and see what happens.

Regards,

Delei

<p> Dear Delei, </p> <p> Thank you very much for your quick reply. As I was first trying to achieve a 0.11 resolution (to further downscale to 0.025), I changed the DT formula in the cclm.job.sh script (in the gcm2cclm, line: 147) to have 0.11 in the denominator. I found out that was my fault. As I changed it back to 0.44, the CFL failure disappeared. <br/> <br/> I have another question, though; I am hesitating to ask it under this thread. My supercomputer won't allow me to run the model in multiple nodes. When I allocate more CPUs (let's say 80 cores for 2 nodes), cclm jobs finish very quickly (and incomplete) under multiple nodes. Do you have any clue why I face this issue? <br/> <br/> Regards, </p> <p> Emre </p>

  @emresalkım in #bfd7e12

<p> Dear Delei, </p> <p> Thank you very much for your quick reply. As I was first trying to achieve a 0.11 resolution (to further downscale to 0.025), I changed the DT formula in the cclm.job.sh script (in the gcm2cclm, line: 147) to have 0.11 in the denominator. I found out that was my fault. As I changed it back to 0.44, the CFL failure disappeared. <br/> <br/> I have another question, though; I am hesitating to ask it under this thread. My supercomputer won't allow me to run the model in multiple nodes. When I allocate more CPUs (let's say 80 cores for 2 nodes), cclm jobs finish very quickly (and incomplete) under multiple nodes. Do you have any clue why I face this issue? <br/> <br/> Regards, </p> <p> Emre </p>

Dear Delei,

Thank you very much for your quick reply. As I was first trying to achieve a 0.11 resolution (to further downscale to 0.025), I changed the DT formula in the cclm.job.sh script (in the gcm2cclm, line: 147) to have 0.11 in the denominator. I found out that was my fault. As I changed it back to 0.44, the CFL failure disappeared.

I have another question, though; I am hesitating to ask it under this thread. My supercomputer won't allow me to run the model in multiple nodes. When I allocate more CPUs (let's say 80 cores for 2 nodes), cclm jobs finish very quickly (and incomplete) under multiple nodes. Do you have any clue why I face this issue?

Regards,

Emre

<p> Dear Emre, </p> <p> Nice to know that you solved the problem. I have no idea why it cannot be run in multiple nodes for your supercomputer. You may consult your IT people for this? </p> <p> Regards, </p> <p> Delei </p>

  @deleili in #766c96e

<p> Dear Emre, </p> <p> Nice to know that you solved the problem. I have no idea why it cannot be run in multiple nodes for your supercomputer. You may consult your IT people for this? </p> <p> Regards, </p> <p> Delei </p>

Dear Emre,

Nice to know that you solved the problem. I have no idea why it cannot be run in multiple nodes for your supercomputer. You may consult your IT people for this?

Regards,

Delei

<p> Dear Delei, </p> <p> Thank you for your kind message and suggestion. I will carry this issue with them. </p> <p> Regards, </p> <p> Emre </p>

  @emresalkım in #d98a5bc

<p> Dear Delei, </p> <p> Thank you for your kind message and suggestion. I will carry this issue with them. </p> <p> Regards, </p> <p> Emre </p>

Dear Delei,

Thank you for your kind message and suggestion. I will carry this issue with them.

Regards,

Emre

<p> Dear Delei, </p> <p> Hello again. I've been trying to run the model in multiple nodes, trying various methods. However, I'm failing to do so. Is it possible for you to share a working int2lm.job.sh, cclm.job.sh and also job_settings as an example. I would like to see how you configured those settings. <br/> <br/> Yours, <br/> <br/> Emre </p>

  @emresalkım in #7fd9672

<p> Dear Delei, </p> <p> Hello again. I've been trying to run the model in multiple nodes, trying various methods. However, I'm failing to do so. Is it possible for you to share a working int2lm.job.sh, cclm.job.sh and also job_settings as an example. I would like to see how you configured those settings. <br/> <br/> Yours, <br/> <br/> Emre </p>

Dear Delei,

Hello again. I've been trying to run the model in multiple nodes, trying various methods. However, I'm failing to do so. Is it possible for you to share a working int2lm.job.sh, cclm.job.sh and also job_settings as an example. I would like to see how you configured those settings.

Yours,

Emre

<p> Dear Emre, </p> <p> I have no idea how to upload files in the channel. Please write an email to me <a href="mailto:deleili@qdio.ac.cn"> deleili@qdio.ac.cn </a> </p> <p> I will send to you through email. </p> <p> Cheers, </p> <p> Delei </p>

  @deleili in #85f921f

<p> Dear Emre, </p> <p> I have no idea how to upload files in the channel. Please write an email to me <a href="mailto:deleili@qdio.ac.cn"> deleili@qdio.ac.cn </a> </p> <p> I will send to you through email. </p> <p> Cheers, </p> <p> Delei </p>

Dear Emre,

I have no idea how to upload files in the channel. Please write an email to me deleili@qdio.ac.cn

I will send to you through email.

Cheers,

Delei