CCLM simulations fail on Mistral - floating point exception C – in #9: CCLM
in #9: CCLM
Cookies disclaimer
Our site saves small pieces of text information (cookies) on your
device in order to verify your login. These cookies are essential
to provide access to resources on this website and it will not
work properly without.
Learn more
<p>
Dear colleagues,
</p>
<p>
I have been trying to run a 1-day, test simulation with
<span class="caps">
CCLM
</span>
cosmo4.8_clm19 on Mistral for the first time since Blizzard was retired.
<br/>
I am running the
<span class="caps">
CCLM
</span>
with an almost standard configuration for the 0.0625° horizontal resolution that I have successfully employed in several experiments on Blizzard.
</p>
<p>
I have modified the batch script as suggested here http://redc.clm-community.eu/projects/cclmdkrz/wiki/Run-scripts.
</p>
<p>
After a series of minor problems that were solved thanks to the error messages included in the .out and .err files I came to a dead-end.
<br/>
Now when I submit my job with sbatch the simulation runs for some seconds, produces the lffd1996100400c.nc file and exits without leaving any error message in the .out file. However, in the .err file I get multiple errors in this form:
</p>
<p>
56: [m10393:41443:0] Caught signal 8 (Floating point exception)
</p>
<p>
56: backtrace
<br/>
56: 2 0×00000000000548cc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-
<span class="caps">
MOFED
</span>
-
<span class="caps">
CHECKER
</span>
/hpcx_root/src/hpcx-v1.2.0-268-gcc-
<span class="caps">
OFED
</span>
-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:641
<br/>
56: 3 0×0000000000054a3c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-
<span class="caps">
MOFED
</span>
-
<span class="caps">
CHECKER
</span>
/hpcx_root/src/hpcx-v1.2.0-268-gcc-
<span class="caps">
OFED
</span>
-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:616
<br/>
56: 4 0×00000000000326a0 killpg() ??:0
<br/>
56: 5 0×00000000002dabd5 pow.L() ??:0
<br/>
56: 6 0×000000000001ed5d __libc_start_main() ??:0
</p>
<p>
srun: error: m10393: tasks 40,45-50,52-59: Floating point exception
<br/>
srun: Terminating job step 2108157.0
<br/>
00: slurmstepd: ***
<span class="caps">
STEP
</span>
2108157.0 ON m10314
<span class="caps">
CANCELLED
</span>
AT 2016-03-15T20:09:55 ***
<br/>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
</p>
<p>
The model sources have been complied correctly and have been successfully used by another member of our
<span class="caps">
DKRZ
</span>
account. It seems that there is a floating point exception that I cannot figure out in any way.
</p>
<p>
Does any of you ever encountered such problems, or have a clue at what might be causing all this?
</p>
<p>
I have attached my batch script, my .err and .out files along with the
<span class="caps">
YUSPECIF
</span>
, the
<span class="caps">
YUDEBUG
</span>
and the
<span class="caps">
YUCHKDAT
</span>
.
</p>
<p>
Your help would be incredibly appreciated.
</p>
<p>
Best,
</p>
<p>
Edoardo Mazza
</p>
<p>
Dear colleagues,
</p>
<p>
I have been trying to run a 1-day, test simulation with
<span class="caps">
CCLM
</span>
cosmo4.8_clm19 on Mistral for the first time since Blizzard was retired.
<br/>
I am running the
<span class="caps">
CCLM
</span>
with an almost standard configuration for the 0.0625° horizontal resolution that I have successfully employed in several experiments on Blizzard.
</p>
<p>
I have modified the batch script as suggested here http://redc.clm-community.eu/projects/cclmdkrz/wiki/Run-scripts.
</p>
<p>
After a series of minor problems that were solved thanks to the error messages included in the .out and .err files I came to a dead-end.
<br/>
Now when I submit my job with sbatch the simulation runs for some seconds, produces the lffd1996100400c.nc file and exits without leaving any error message in the .out file. However, in the .err file I get multiple errors in this form:
</p>
<p>
56: [m10393:41443:0] Caught signal 8 (Floating point exception)
</p>
<p>
56: backtrace
<br/>
56: 2 0×00000000000548cc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-
<span class="caps">
MOFED
</span>
-
<span class="caps">
CHECKER
</span>
/hpcx_root/src/hpcx-v1.2.0-268-gcc-
<span class="caps">
OFED
</span>
-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:641
<br/>
56: 3 0×0000000000054a3c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-
<span class="caps">
MOFED
</span>
-
<span class="caps">
CHECKER
</span>
/hpcx_root/src/hpcx-v1.2.0-268-gcc-
<span class="caps">
OFED
</span>
-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:616
<br/>
56: 4 0×00000000000326a0 killpg() ??:0
<br/>
56: 5 0×00000000002dabd5 pow.L() ??:0
<br/>
56: 6 0×000000000001ed5d __libc_start_main() ??:0
</p>
<p>
srun: error: m10393: tasks 40,45-50,52-59: Floating point exception
<br/>
srun: Terminating job step 2108157.0
<br/>
00: slurmstepd: ***
<span class="caps">
STEP
</span>
2108157.0 ON m10314
<span class="caps">
CANCELLED
</span>
AT 2016-03-15T20:09:55 ***
<br/>
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
</p>
<p>
The model sources have been complied correctly and have been successfully used by another member of our
<span class="caps">
DKRZ
</span>
account. It seems that there is a floating point exception that I cannot figure out in any way.
</p>
<p>
Does any of you ever encountered such problems, or have a clue at what might be causing all this?
</p>
<p>
I have attached my batch script, my .err and .out files along with the
<span class="caps">
YUSPECIF
</span>
, the
<span class="caps">
YUDEBUG
</span>
and the
<span class="caps">
YUCHKDAT
</span>
.
</p>
<p>
Your help would be incredibly appreciated.
</p>
<p>
Best,
</p>
<p>
Edoardo Mazza
</p>
CCLM simulations fail on Mistral - floating point exception C
I have been trying to run a 1-day, test simulation with
CCLM
cosmo4.8_clm19 on Mistral for the first time since Blizzard was retired.
I am running the
CCLM
with an almost standard configuration for the 0.0625° horizontal resolution that I have successfully employed in several experiments on Blizzard.
I have modified the batch script as suggested here http://redc.clm-community.eu/projects/cclmdkrz/wiki/Run-scripts.
After a series of minor problems that were solved thanks to the error messages included in the .out and .err files I came to a dead-end.
Now when I submit my job with sbatch the simulation runs for some seconds, produces the lffd1996100400c.nc file and exits without leaving any error message in the .out file. However, in the .err file I get multiple errors in this form:
56: [m10393:41443:0] Caught signal 8 (Floating point exception)
srun: error: m10393: tasks 40,45-50,52-59: Floating point exception
srun: Terminating job step 2108157.0
00: slurmstepd: ***
STEP
2108157.0 ON m10314
CANCELLED
AT 2016-03-15T20:09:55 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
The model sources have been complied correctly and have been successfully used by another member of our
DKRZ
account. It seems that there is a floating point exception that I cannot figure out in any way.
Does any of you ever encountered such problems, or have a clue at what might be causing all this?
I have attached my batch script, my .err and .out files along with the
YUSPECIF
, the
YUDEBUG
and the
YUCHKDAT
.
<p>
Dear Edoardo,
</p>
<p>
having a very first and quick look into your
<span class="caps">
YUCHKDAT
</span>
I would say that something is wrong with your forcing data.
<br/>
Look at your T_SO values in the deeper layers.
<br/>
They become very small and even negative!!!! The unit for T_SO is Kelvin!!
</p>
<p>
Furthermore I saw in your
<span class="caps">
YUSPECIF
</span>
that you run the model in
<span class="caps">
NWP
</span>
mode, not in climate mode (lbdclim=.FALSE.).
<br/>
Is this what you want to do?
</p>
<p>
Hans-Juergen
</p>
<p>
Dear Edoardo,
</p>
<p>
having a very first and quick look into your
<span class="caps">
YUCHKDAT
</span>
I would say that something is wrong with your forcing data.
<br/>
Look at your T_SO values in the deeper layers.
<br/>
They become very small and even negative!!!! The unit for T_SO is Kelvin!!
</p>
<p>
Furthermore I saw in your
<span class="caps">
YUSPECIF
</span>
that you run the model in
<span class="caps">
NWP
</span>
mode, not in climate mode (lbdclim=.FALSE.).
<br/>
Is this what you want to do?
</p>
<p>
Hans-Juergen
</p>
having a very first and quick look into your
YUCHKDAT
I would say that something is wrong with your forcing data.
Look at your T_SO values in the deeper layers.
They become very small and even negative!!!! The unit for T_SO is Kelvin!!
Furthermore I saw in your
YUSPECIF
that you run the model in
NWP
mode, not in climate mode (lbdclim=.FALSE.).
Is this what you want to do?
<p>
Dear Hans-Juergen,
</p>
<p>
Thank you very much for your support and sorry for the late reply but it took me a few days to go back to the roots of the problem.
<br/>
I agree that there’s something wrong with those temperature, therefore I went back to the previous downscaling step to see where these weird values came from.
</p>
<p>
I wanted to repeat the simulation driven with
<span class="caps">
ERA
</span>
-Interim obtained from the
<span class="caps">
DKRZ
</span>
directory /pool/data/CCLM/reanalyses/ERAInterim. I adapted the run_int2lm script for the gcm2cclm case. Again, I wanted to test that it was working fine for 24 hours.
</p>
<p>
Unfortunately the situation does not seem to have changed at all. The “floating point exception” error is still causing the program to quit. So it seems that the problem goes beyond the T_SO values. I am really losing the focus on what the problem is right now. I have checked and double-checked but clearly there’s something wrong that I can’t find.
</p>
<p>
Please find attached the run_int2lm, the .out,
<span class="caps">
YUCHKDAT
</span>
,
<span class="caps">
INPUT
</span>
,
<span class="caps">
OUTPUT
</span>
and
<span class="caps">
YUDEBUG
</span>
files.
</p>
<p>
Best wishes,
</p>
<p>
Edoardo
</p>
<p>
Dear Hans-Juergen,
</p>
<p>
Thank you very much for your support and sorry for the late reply but it took me a few days to go back to the roots of the problem.
<br/>
I agree that there’s something wrong with those temperature, therefore I went back to the previous downscaling step to see where these weird values came from.
</p>
<p>
I wanted to repeat the simulation driven with
<span class="caps">
ERA
</span>
-Interim obtained from the
<span class="caps">
DKRZ
</span>
directory /pool/data/CCLM/reanalyses/ERAInterim. I adapted the run_int2lm script for the gcm2cclm case. Again, I wanted to test that it was working fine for 24 hours.
</p>
<p>
Unfortunately the situation does not seem to have changed at all. The “floating point exception” error is still causing the program to quit. So it seems that the problem goes beyond the T_SO values. I am really losing the focus on what the problem is right now. I have checked and double-checked but clearly there’s something wrong that I can’t find.
</p>
<p>
Please find attached the run_int2lm, the .out,
<span class="caps">
YUCHKDAT
</span>
,
<span class="caps">
INPUT
</span>
,
<span class="caps">
OUTPUT
</span>
and
<span class="caps">
YUDEBUG
</span>
files.
</p>
<p>
Best wishes,
</p>
<p>
Edoardo
</p>
Thank you very much for your support and sorry for the late reply but it took me a few days to go back to the roots of the problem.
I agree that there’s something wrong with those temperature, therefore I went back to the previous downscaling step to see where these weird values came from.
I wanted to repeat the simulation driven with
ERA
-Interim obtained from the
DKRZ
directory /pool/data/CCLM/reanalyses/ERAInterim. I adapted the run_int2lm script for the gcm2cclm case. Again, I wanted to test that it was working fine for 24 hours.
Unfortunately the situation does not seem to have changed at all. The “floating point exception” error is still causing the program to quit. So it seems that the problem goes beyond the T_SO values. I am really losing the focus on what the problem is right now. I have checked and double-checked but clearly there’s something wrong that I can’t find.
Please find attached the run_int2lm, the .out,
YUCHKDAT
,
INPUT
,
OUTPUT
and
YUDEBUG
files.
<p>
Dear Edoardo,
</p>
<p>
did you realize that the
<span class="caps">
ERA
</span>
-Interim data (caf-files) in /pool/data/CCLM/reanalyses are Netcdf4 compressed?
<br/>
There is a
<span class="caps">
README
</span>
in the
<span class="caps">
ERAINT
</span>
directory telling that.
<br/>
Perhaps that is your problem.
</p>
<p>
Alternatively you can try to use the umcompressed caf-files that are available from my workspace:
<br/>
/work/bb0849/b364034/ERAINT/CCLM_Forcing_Data/
</p>
<p>
Furthermore, the
<span class="caps">
ERAINT
</span>
data have T_SKIN, thus set “luse_t_skin=.TRUE.“
<br/>
Of course, consider also Burkhardt’s suggestion “lprog_qi=.TRUE.”, since QI is also available
</p>
<p>
Hans-Juergen
</p>
<p>
Dear Edoardo,
</p>
<p>
did you realize that the
<span class="caps">
ERA
</span>
-Interim data (caf-files) in /pool/data/CCLM/reanalyses are Netcdf4 compressed?
<br/>
There is a
<span class="caps">
README
</span>
in the
<span class="caps">
ERAINT
</span>
directory telling that.
<br/>
Perhaps that is your problem.
</p>
<p>
Alternatively you can try to use the umcompressed caf-files that are available from my workspace:
<br/>
/work/bb0849/b364034/ERAINT/CCLM_Forcing_Data/
</p>
<p>
Furthermore, the
<span class="caps">
ERAINT
</span>
data have T_SKIN, thus set “luse_t_skin=.TRUE.“
<br/>
Of course, consider also Burkhardt’s suggestion “lprog_qi=.TRUE.”, since QI is also available
</p>
<p>
Hans-Juergen
</p>
did you realize that the
ERA
-Interim data (caf-files) in /pool/data/CCLM/reanalyses are Netcdf4 compressed?
There is a
README
in the
ERAINT
directory telling that.
Perhaps that is your problem.
Alternatively you can try to use the umcompressed caf-files that are available from my workspace:
/work/bb0849/b364034/ERAINT/CCLM_Forcing_Data/
Furthermore, the
ERAINT
data have T_SKIN, thus set “luse_t_skin=.TRUE.“
Of course, consider also Burkhardt’s suggestion “lprog_qi=.TRUE.”, since QI is also available
<p>
Dear all,
<br/>
I currently encountered a very similar problem. I get the following messages:
<br/>
<pre>
239: [m10063:37251:0] Caught signal 8 (Floating point exception)
248: backtrace
248: 2 0x000000000005767c mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
248: 3 0x00000000000577ec mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
248: 4 0x0000000000032510 killpg() ??:0
248: 5 0x00000000008606ab src_soil_multlay_mp_terra_multlay_() ??:0
248: 6 0x000000000055e4bf organize_physics_() ??:0
248: 7 0x000000000058d900 MAIN__() ??:0
248: 8 0x00000000004052fe main() ??:0
248: 9 0x000000000001ed1d __libc_start_main() ??:0
248: 10 0x00000000004051f9 _start() ??:0
248: ===============
</pre>
<br/>
I already tried to decompress the
<span class="caps">
ERA
</span>
-Interim data and I also considered the hints you gave before, but it still doesn’t work.
<br/>
I would be very grateful for help.
<br/>
Thank you very much and best regards,
<br/>
Eva
</p>
<p>
Dear all,
<br/>
I currently encountered a very similar problem. I get the following messages:
<br/>
<pre>
239: [m10063:37251:0] Caught signal 8 (Floating point exception)
248: backtrace
248: 2 0x000000000005767c mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
248: 3 0x00000000000577ec mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
248: 4 0x0000000000032510 killpg() ??:0
248: 5 0x00000000008606ab src_soil_multlay_mp_terra_multlay_() ??:0
248: 6 0x000000000055e4bf organize_physics_() ??:0
248: 7 0x000000000058d900 MAIN__() ??:0
248: 8 0x00000000004052fe main() ??:0
248: 9 0x000000000001ed1d __libc_start_main() ??:0
248: 10 0x00000000004051f9 _start() ??:0
248: ===============
</pre>
<br/>
I already tried to decompress the
<span class="caps">
ERA
</span>
-Interim data and I also considered the hints you gave before, but it still doesn’t work.
<br/>
I would be very grateful for help.
<br/>
Thank you very much and best regards,
<br/>
Eva
</p>
I already tried to decompress the
ERA
-Interim data and I also considered the hints you gave before, but it still doesn’t work.
I would be very grateful for help.
Thank you very much and best regards,
Eva
CCLM simulations fail on Mistral - floating point exception C
Dear colleagues,
I have been trying to run a 1-day, test simulation with CCLM cosmo4.8_clm19 on Mistral for the first time since Blizzard was retired.
I am running the CCLM with an almost standard configuration for the 0.0625° horizontal resolution that I have successfully employed in several experiments on Blizzard.
I have modified the batch script as suggested here http://redc.clm-community.eu/projects/cclmdkrz/wiki/Run-scripts.
After a series of minor problems that were solved thanks to the error messages included in the .out and .err files I came to a dead-end.
Now when I submit my job with sbatch the simulation runs for some seconds, produces the lffd1996100400c.nc file and exits without leaving any error message in the .out file. However, in the .err file I get multiple errors in this form:
56: [m10393:41443:0] Caught signal 8 (Floating point exception)
56: backtrace
56: 2 0×00000000000548cc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64- MOFED - CHECKER /hpcx_root/src/hpcx-v1.2.0-268-gcc- OFED -3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:641
56: 3 0×0000000000054a3c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64- MOFED - CHECKER /hpcx_root/src/hpcx-v1.2.0-268-gcc- OFED -3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:616
56: 4 0×00000000000326a0 killpg() ??:0
56: 5 0×00000000002dabd5 pow.L() ??:0
56: 6 0×000000000001ed5d __libc_start_main() ??:0
srun: error: m10393: tasks 40,45-50,52-59: Floating point exception
srun: Terminating job step 2108157.0
00: slurmstepd: *** STEP 2108157.0 ON m10314 CANCELLED AT 2016-03-15T20:09:55 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
The model sources have been complied correctly and have been successfully used by another member of our DKRZ account. It seems that there is a floating point exception that I cannot figure out in any way.
Does any of you ever encountered such problems, or have a clue at what might be causing all this?
I have attached my batch script, my .err and .out files along with the YUSPECIF , the YUDEBUG and the YUCHKDAT .
Your help would be incredibly appreciated.
Best,
Edoardo Mazza
Dear Edoardo,
having a very first and quick look into your YUCHKDAT I would say that something is wrong with your forcing data.
Look at your T_SO values in the deeper layers.
They become very small and even negative!!!! The unit for T_SO is Kelvin!!
Furthermore I saw in your YUSPECIF that you run the model in NWP mode, not in climate mode (lbdclim=.FALSE.).
Is this what you want to do?
Hans-Juergen
Dear Hans-Juergen,
Thank you very much for your support and sorry for the late reply but it took me a few days to go back to the roots of the problem.
I agree that there’s something wrong with those temperature, therefore I went back to the previous downscaling step to see where these weird values came from.
I wanted to repeat the simulation driven with ERA -Interim obtained from the DKRZ directory /pool/data/CCLM/reanalyses/ERAInterim. I adapted the run_int2lm script for the gcm2cclm case. Again, I wanted to test that it was working fine for 24 hours.
Unfortunately the situation does not seem to have changed at all. The “floating point exception” error is still causing the program to quit. So it seems that the problem goes beyond the T_SO values. I am really losing the focus on what the problem is right now. I have checked and double-checked but clearly there’s something wrong that I can’t find.
Please find attached the run_int2lm, the .out, YUCHKDAT , INPUT , OUTPUT and YUDEBUG files.
Best wishes,
Edoardo
Please try
Dear Edoardo,
did you realize that the ERA -Interim data (caf-files) in /pool/data/CCLM/reanalyses are Netcdf4 compressed?
There is a README in the ERAINT directory telling that.
Perhaps that is your problem.
Alternatively you can try to use the umcompressed caf-files that are available from my workspace:
/work/bb0849/b364034/ERAINT/CCLM_Forcing_Data/
Furthermore, the ERAINT data have T_SKIN, thus set “luse_t_skin=.TRUE.“
Of course, consider also Burkhardt’s suggestion “lprog_qi=.TRUE.”, since QI is also available
Hans-Juergen
Dear all,
I currently encountered a very similar problem. I get the following messages:
I already tried to decompress the ERA -Interim data and I also considered the hints you gave before, but it still doesn’t work.
I would be very grateful for help.
Thank you very much and best regards,
Eva
Dear all,
by changing the INT2LM I could solve the problem I posted before, but now a new error appears, that is also kind of similar.
Could anyone please help me with this problem? I would be very grateful for help.
Thank you very much and best regards,
Eva Nowatzki