strange problem in cclm – in #9: CCLM

in #9: CCLM

<p> Dear colleagues, </p> <p> We have the problem in cclm. Program is suddenly terminated for some unclear reasons. <br/> int2lm script works correctly and all the output files appear. <br/> We tried to install both starter package and normal version of int2lm and cclm but the problem is the same. <br/> tests from starter package work fine. <br/> We also tried to play with parameters in <span class="caps"> GRIBIN </span> section but it didn’t help. <br/> I’ve attached all of the files which, as we think, can be usefull for understanding problem. <br/> Could you help us please? </p> <p> Kind regards, <br/> Iya Belova </p> <p> P.S. due to old version of fortran compiler on our cluster we had to change lines from <strong> <span class="caps"> READ </span> (nuin, inictl, <span class="caps"> IOSTAT </span> =iz_err, <span class="caps"> IOMSG </span> =iomsg_str) </strong> to <strong> <span class="caps"> READ </span> (nuin, inictl, <span class="caps"> IOSTAT </span> =iz_err) </strong> in some of the cclm install files </p>

  @iyabelova in #d729e8b

<p> Dear colleagues, </p> <p> We have the problem in cclm. Program is suddenly terminated for some unclear reasons. <br/> int2lm script works correctly and all the output files appear. <br/> We tried to install both starter package and normal version of int2lm and cclm but the problem is the same. <br/> tests from starter package work fine. <br/> We also tried to play with parameters in <span class="caps"> GRIBIN </span> section but it didn’t help. <br/> I’ve attached all of the files which, as we think, can be usefull for understanding problem. <br/> Could you help us please? </p> <p> Kind regards, <br/> Iya Belova </p> <p> P.S. due to old version of fortran compiler on our cluster we had to change lines from <strong> <span class="caps"> READ </span> (nuin, inictl, <span class="caps"> IOSTAT </span> =iz_err, <span class="caps"> IOMSG </span> =iomsg_str) </strong> to <strong> <span class="caps"> READ </span> (nuin, inictl, <span class="caps"> IOSTAT </span> =iz_err) </strong> in some of the cclm install files </p>

strange problem in cclm

Dear colleagues,

We have the problem in cclm. Program is suddenly terminated for some unclear reasons.
int2lm script works correctly and all the output files appear.
We tried to install both starter package and normal version of int2lm and cclm but the problem is the same.
tests from starter package work fine.
We also tried to play with parameters in GRIBIN section but it didn’t help.
I’ve attached all of the files which, as we think, can be usefull for understanding problem.
Could you help us please?

Kind regards,
Iya Belova

P.S. due to old version of fortran compiler on our cluster we had to change lines from READ (nuin, inictl, IOSTAT =iz_err, IOMSG =iomsg_str) to READ (nuin, inictl, IOSTAT =iz_err) in some of the cclm install files

View in channel
<p> Dear Iya </p> <p> actually, I have no explanation for the error. <br/> At a frist glance, it seems to be a system ( <span class="caps"> MPI </span> ?) error. <br/> But who knows. <br/> Nevertheless, here are a few comments on your setup: </p> <p> 1. the timestep dt: you are using dt=120 (sec) together with a spatial resolution of about 12 km. <br/> dt=120 is, to my opinion, much too high. There is a large danger for violations of the <span class="caps"> CFL </span> -criterion. <br/> I would use dt=75 </p> <p> 2. if I understand your setup correctly, you want to perform a 30 day simulation, starting 2009120100 and ending 2009123100. <br/> This is a simulation duration of 720 hours (30 days * 24 hours/day) which should be the value for the namelist parameter “hstop”. <br/> But you are using “hstop=30*720” (see your cclm-setup) <br/> This could be corrected in your setup file by defining <br/> NHOURS=24 <br/> instead of <br/> NHOURS=720 </p> <p> 3. the triple of values for namelist parameter “nhour_restart” should be <br/> nhour_restart=120,$HSTOP,120 <br/> and not <br/> nhour_restart=0,$HSTOP,120 <br/> where the values are given in hours. <br/> However, this mistake (the first value of the triple) is corrected by <span class="caps"> CCLM </span> (see cclm.exe.out) </p> <p> 4. Can someone else comment on Iya’s choices of Tuning parameters (see cclm.exe.out and <span class="caps"> YUSPECIF </span> ). They seem to be rather “extreme”. </p> <p> Best regards <br/> Hans-Juergen </p>

  @hans-jürgenpanitz in #205868c

<p> Dear Iya </p> <p> actually, I have no explanation for the error. <br/> At a frist glance, it seems to be a system ( <span class="caps"> MPI </span> ?) error. <br/> But who knows. <br/> Nevertheless, here are a few comments on your setup: </p> <p> 1. the timestep dt: you are using dt=120 (sec) together with a spatial resolution of about 12 km. <br/> dt=120 is, to my opinion, much too high. There is a large danger for violations of the <span class="caps"> CFL </span> -criterion. <br/> I would use dt=75 </p> <p> 2. if I understand your setup correctly, you want to perform a 30 day simulation, starting 2009120100 and ending 2009123100. <br/> This is a simulation duration of 720 hours (30 days * 24 hours/day) which should be the value for the namelist parameter “hstop”. <br/> But you are using “hstop=30*720” (see your cclm-setup) <br/> This could be corrected in your setup file by defining <br/> NHOURS=24 <br/> instead of <br/> NHOURS=720 </p> <p> 3. the triple of values for namelist parameter “nhour_restart” should be <br/> nhour_restart=120,$HSTOP,120 <br/> and not <br/> nhour_restart=0,$HSTOP,120 <br/> where the values are given in hours. <br/> However, this mistake (the first value of the triple) is corrected by <span class="caps"> CCLM </span> (see cclm.exe.out) </p> <p> 4. Can someone else comment on Iya’s choices of Tuning parameters (see cclm.exe.out and <span class="caps"> YUSPECIF </span> ). They seem to be rather “extreme”. </p> <p> Best regards <br/> Hans-Juergen </p>

Dear Iya

actually, I have no explanation for the error.
At a frist glance, it seems to be a system ( MPI ?) error.
But who knows.
Nevertheless, here are a few comments on your setup:

1. the timestep dt: you are using dt=120 (sec) together with a spatial resolution of about 12 km.
dt=120 is, to my opinion, much too high. There is a large danger for violations of the CFL -criterion.
I would use dt=75

2. if I understand your setup correctly, you want to perform a 30 day simulation, starting 2009120100 and ending 2009123100.
This is a simulation duration of 720 hours (30 days * 24 hours/day) which should be the value for the namelist parameter “hstop”.
But you are using “hstop=30*720” (see your cclm-setup)
This could be corrected in your setup file by defining
NHOURS=24
instead of
NHOURS=720

3. the triple of values for namelist parameter “nhour_restart” should be
nhour_restart=120,$HSTOP,120
and not
nhour_restart=0,$HSTOP,120
where the values are given in hours.
However, this mistake (the first value of the triple) is corrected by CCLM (see cclm.exe.out)

4. Can someone else comment on Iya’s choices of Tuning parameters (see cclm.exe.out and YUSPECIF ). They seem to be rather “extreme”.

Best regards
Hans-Juergen

<p> Regarding Hans-Jürgens item 4: <br/> Iya, can you use the tuning parameters as in the starter package script and test your job? </p>

  @burkhardtrockel in #7891ee2

<p> Regarding Hans-Jürgens item 4: <br/> Iya, can you use the tuning parameters as in the starter package script and test your job? </p>

Regarding Hans-Jürgens item 4:
Iya, can you use the tuning parameters as in the starter package script and test your job?

<p> Thank you for your answers. <br/> This problem really looks like an <span class="caps"> MPI </span> error. We had almost the same problem some months ago (you can find this discussion in the Starter Package Support forum thread). <br/> We made all the changes as you suggested in your answers but it didn’t help. I’ve attached new .out file just in case but it seems that there are no real changes there. </p> <p> P.S. in sp cclm script we had the following tuning parameters: <br/> “ wichfakt=0., tur_len=500., v0snow=20., tkhmin=0.35, tkmmin=1., rlam_heat=0.5249, mu_rain=0.5, entr_sc=0.0002, uc1=0.0626, fac_rootdp2=0.9000, soilhyd=1.6200” we also tried to start without tunung parameters at all but it gave the same result. Maybe there is some soft regime which we could try? </p> <p> P.P.S. there are multiple lines in .out file “src_input: check completeness of input data”. Maybe this can tell something about source of the problem? </p>

  @iyabelova in #65e3df9

<p> Thank you for your answers. <br/> This problem really looks like an <span class="caps"> MPI </span> error. We had almost the same problem some months ago (you can find this discussion in the Starter Package Support forum thread). <br/> We made all the changes as you suggested in your answers but it didn’t help. I’ve attached new .out file just in case but it seems that there are no real changes there. </p> <p> P.S. in sp cclm script we had the following tuning parameters: <br/> “ wichfakt=0., tur_len=500., v0snow=20., tkhmin=0.35, tkmmin=1., rlam_heat=0.5249, mu_rain=0.5, entr_sc=0.0002, uc1=0.0626, fac_rootdp2=0.9000, soilhyd=1.6200” we also tried to start without tunung parameters at all but it gave the same result. Maybe there is some soft regime which we could try? </p> <p> P.P.S. there are multiple lines in .out file “src_input: check completeness of input data”. Maybe this can tell something about source of the problem? </p>

Thank you for your answers.
This problem really looks like an MPI error. We had almost the same problem some months ago (you can find this discussion in the Starter Package Support forum thread).
We made all the changes as you suggested in your answers but it didn’t help. I’ve attached new .out file just in case but it seems that there are no real changes there.

P.S. in sp cclm script we had the following tuning parameters:
“ wichfakt=0., tur_len=500., v0snow=20., tkhmin=0.35, tkmmin=1., rlam_heat=0.5249, mu_rain=0.5, entr_sc=0.0002, uc1=0.0626, fac_rootdp2=0.9000, soilhyd=1.6200” we also tried to start without tunung parameters at all but it gave the same result. Maybe there is some soft regime which we could try?

P.P.S. there are multiple lines in .out file “src_input: check completeness of input data”. Maybe this can tell something about source of the problem?

<p> Since you wrote that the starter package tests work, you may try to make your changes step by step from the starter package settings to your requested settings. </p> <p> The multiple lines in .out file “src_input: check completeness of input data” appear because each processor writes this. This can be suppressed by changing the line <br/> <pre> PRINT *, ' src_input: check completeness of input data' </pre> to <br/> <pre> IF (my_cart_id == 0) PRINT *, ' src_input: check completeness of input data' </pre> Then only processor 0 writes the output. </p>

  @burkhardtrockel in #aed9ce7

<p> Since you wrote that the starter package tests work, you may try to make your changes step by step from the starter package settings to your requested settings. </p> <p> The multiple lines in .out file “src_input: check completeness of input data” appear because each processor writes this. This can be suppressed by changing the line <br/> <pre> PRINT *, ' src_input: check completeness of input data' </pre> to <br/> <pre> IF (my_cart_id == 0) PRINT *, ' src_input: check completeness of input data' </pre> Then only processor 0 writes the output. </p>

Since you wrote that the starter package tests work, you may try to make your changes step by step from the starter package settings to your requested settings.

The multiple lines in .out file “src_input: check completeness of input data” appear because each processor writes this. This can be suppressed by changing the line

PRINT *, ' src_input: check completeness of input data'
to
IF (my_cart_id == 0) PRINT *, ' src_input: check completeness of input data'
Then only processor 0 writes the output.

<p> And a further suggestion in order to find out whether there is really a <span class="caps"> MPI </span> problem on your system. </p> <p> Sicne the error occurs already at the very beginning of your simulation <br/> try to run it using only one process: nprocx=1, nprocy=1 </p> <p> If the error does not occur anymore, then I would say, it is a <span class="caps"> MPI </span> /system problem </p> <p> Hans-Juergen </p>

  @hans-jürgenpanitz in #4b7c755

<p> And a further suggestion in order to find out whether there is really a <span class="caps"> MPI </span> problem on your system. </p> <p> Sicne the error occurs already at the very beginning of your simulation <br/> try to run it using only one process: nprocx=1, nprocy=1 </p> <p> If the error does not occur anymore, then I would say, it is a <span class="caps"> MPI </span> /system problem </p> <p> Hans-Juergen </p>

And a further suggestion in order to find out whether there is really a MPI problem on your system.

Sicne the error occurs already at the very beginning of your simulation
try to run it using only one process: nprocx=1, nprocy=1

If the error does not occur anymore, then I would say, it is a MPI /system problem

Hans-Juergen

<p> Dear Hans-Juergen, </p> <p> I’ve mentioned before that we had the problem in starter package. The problem was that we were not able to start program in the uniprocessor mode. When we try to make it script fails earlier during reading ncdf files (both int2lm and cclm). </p>

  @iyabelova in #207a51c

<p> Dear Hans-Juergen, </p> <p> I’ve mentioned before that we had the problem in starter package. The problem was that we were not able to start program in the uniprocessor mode. When we try to make it script fails earlier during reading ncdf files (both int2lm and cclm). </p>

Dear Hans-Juergen,

I’ve mentioned before that we had the problem in starter package. The problem was that we were not able to start program in the uniprocessor mode. When we try to make it script fails earlier during reading ncdf files (both int2lm and cclm).

<p> Dear colleagues, </p> <p> Thank you for your advices. <br/> Problem was that our cluster is too weak for chosen LM grid. <br/> We made new grid with the following options and now everything works: startlat_tot = -7.6, startlon_tot = -7.6, pollat = 34.3, pollon = -142.5, dlon=0.152, dlat=0.152, ie_tot=100, je_tot=100, ke_tot=40, </p> <p> Kind regards, <br/> Iya Belova </p>

  @iyabelova in #f7bd60a

<p> Dear colleagues, </p> <p> Thank you for your advices. <br/> Problem was that our cluster is too weak for chosen LM grid. <br/> We made new grid with the following options and now everything works: startlat_tot = -7.6, startlon_tot = -7.6, pollat = 34.3, pollon = -142.5, dlon=0.152, dlat=0.152, ie_tot=100, je_tot=100, ke_tot=40, </p> <p> Kind regards, <br/> Iya Belova </p>

Dear colleagues,

Thank you for your advices.
Problem was that our cluster is too weak for chosen LM grid.
We made new grid with the following options and now everything works: startlat_tot = -7.6, startlon_tot = -7.6, pollat = 34.3, pollon = -142.5, dlon=0.152, dlat=0.152, ie_tot=100, je_tot=100, ke_tot=40,

Kind regards,
Iya Belova