Our site saves small pieces of text information (cookies) on your
device in order to verify your login. These cookies are essential
to provide access to resources on this website and it will not
work properly without.
Learn more
<p>
Dear all,
</p>
<p>
I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon®
<span class="caps">
CPU
</span>
E7540
<code>
2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2
</code>
2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has
<span class="caps">
SLURM
</span>
. Then, I tried to adapt the run-scripts to
<span class="caps">
SLURM
</span>
batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole
<span class="caps">
RAM
</span>
of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?
</p>
<p>
P.S. I have already tried the new version of starter package that adapted to
<span class="caps">
SLURM
</span>
and it didn’t solve the problem.
</p>
<p>
Thank you very much.
</p>
<p>
—————————————————————————————————————
<br/>
mpirun noticed that process rank 9 with
<span class="caps">
PID
</span>
14935 on node m003 exited on signal 9 (Killed).
<br/>
—————————————————————————————————————
</p>
<p>
Dear all,
</p>
<p>
I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon®
<span class="caps">
CPU
</span>
E7540
<code>
2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2
</code>
2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has
<span class="caps">
SLURM
</span>
. Then, I tried to adapt the run-scripts to
<span class="caps">
SLURM
</span>
batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole
<span class="caps">
RAM
</span>
of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?
</p>
<p>
P.S. I have already tried the new version of starter package that adapted to
<span class="caps">
SLURM
</span>
and it didn’t solve the problem.
</p>
<p>
Thank you very much.
</p>
<p>
—————————————————————————————————————
<br/>
mpirun noticed that process rank 9 with
<span class="caps">
PID
</span>
14935 on node m003 exited on signal 9 (Killed).
<br/>
—————————————————————————————————————
</p>
I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon®
CPU
E7540
2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2
2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has
SLURM
. Then, I tried to adapt the run-scripts to
SLURM
batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole
RAM
of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?
P.S. I have already tried the new version of starter package that adapted to
SLURM
and it didn’t solve the problem.
Thank you very much.
—————————————————————————————————————
mpirun noticed that process rank 9 with
PID
14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————
<p>
Dear Cemre,
</p>
<p>
It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your
<span class="caps">
SLURM
</span>
settings? It may be help if you could provide some more information on the domain size and your current
<span class="caps">
SLURM
</span>
settings.
</p>
<p>
With best regards,
<br/>
Markus
</p>
<p>
Dear Cemre,
</p>
<p>
It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your
<span class="caps">
SLURM
</span>
settings? It may be help if you could provide some more information on the domain size and your current
<span class="caps">
SLURM
</span>
settings.
</p>
<p>
With best regards,
<br/>
Markus
</p>
It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your
SLURM
settings? It may be help if you could provide some more information on the domain size and your current
SLURM
settings.
<p>
Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my
<span class="caps">
SLURM
</span>
settings.
</p>
<p>
Best regards,
<br/>
Cemre
</p>
<p>
Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my
<span class="caps">
SLURM
</span>
settings.
</p>
<p>
Best regards,
<br/>
Cemre
</p>
Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my
SLURM
settings.
<p>
Dear Cemre,
</p>
<p>
as far as I see the run script und log files look okay.
</p>
<p>
As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:
</p>
<p>
#SBATCH —time=10:00:00 ### HH:MM:SS
#SBATCH —mem=10000 ### memory per node in MByte
</p>
<p>
In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of
<span class="caps">
SLURM
</span>
and may gives some more error information.
</p>
<p>
Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.
</p>
<p>
Best regards
<br/>
Markus
</p>
<p>
Dear Cemre,
</p>
<p>
as far as I see the run script und log files look okay.
</p>
<p>
As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:
</p>
<p>
#SBATCH —time=10:00:00 ### HH:MM:SS
#SBATCH —mem=10000 ### memory per node in MByte
</p>
<p>
In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of
<span class="caps">
SLURM
</span>
and may gives some more error information.
</p>
<p>
Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.
</p>
<p>
Best regards
<br/>
Markus
</p>
<p>
Dear Markus,
</p>
<p>
According to your suggestion, I added “
<span class="caps">
SBATCH
</span>
—time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).
</p>
<p>
In addition to this, I made another test run (for 0.0275° resolution) by adding
<span class="caps">
SBATCH
</span>
commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.
</p>
<p>
Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB
<span class="caps">
RAM
</span>
), the time until crash gets longer as well. However, it consumes 256GB
<span class="caps">
RAM
</span>
and also 13GB swap and eventually it is terminated by oom-killer.
</p>
<p>
Best regards,
<br/>
Cemre
</p>
<p>
Dear Markus,
</p>
<p>
According to your suggestion, I added “
<span class="caps">
SBATCH
</span>
—time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).
</p>
<p>
In addition to this, I made another test run (for 0.0275° resolution) by adding
<span class="caps">
SBATCH
</span>
commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.
</p>
<p>
Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB
<span class="caps">
RAM
</span>
), the time until crash gets longer as well. However, it consumes 256GB
<span class="caps">
RAM
</span>
and also 13GB swap and eventually it is terminated by oom-killer.
</p>
<p>
Best regards,
<br/>
Cemre
</p>
According to your suggestion, I added “
SBATCH
—time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).
In addition to this, I made another test run (for 0.0275° resolution) by adding
SBATCH
commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.
Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB
RAM
), the time until crash gets longer as well. However, it consumes 256GB
RAM
and also 13GB swap and eventually it is terminated by oom-killer.
<p>
Dear Cemre,
</p>
<p>
Memory usage of the 0.44 deg
<span class="caps">
CCLM
</span>
job should be < 2 GB. So its a serious problem.
<br/>
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.
</p>
<p>
I will try to reproduce that behaviour with your configuration on the
<span class="caps">
DKRZ
</span>
machine next week. May you could try the intel compiler if its available on your system.
</p>
<p>
In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).
</p>
<p>
Best regards,
<br/>
Markus
</p>
<p>
Dear Cemre,
</p>
<p>
Memory usage of the 0.44 deg
<span class="caps">
CCLM
</span>
job should be < 2 GB. So its a serious problem.
<br/>
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.
</p>
<p>
I will try to reproduce that behaviour with your configuration on the
<span class="caps">
DKRZ
</span>
machine next week. May you could try the intel compiler if its available on your system.
</p>
<p>
In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).
</p>
<p>
Best regards,
<br/>
Markus
</p>
Memory usage of the 0.44 deg
CCLM
job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.
I will try to reproduce that behaviour with your configuration on the
DKRZ
machine next week. May you could try the intel compiler if its available on your system.
In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).
Memory problem
Dear all,
I have been working with the cclm starter package version 1.5, which contains int2lm_131101_2.00_clm2 and cosmo_131108_5.00_clm2, at two different machines (Intel® Xeon® CPU E7540
2.00GHz and Intel(R) Xeon(R) CPU E5-4620 v2
2.60GHz) without having problem. However, in order to be able to use large number of nodes, I decided to utilize National High Performance Computing Center of Turkey (UHeM) that has SLURM . Then, I tried to adapt the run-scripts to SLURM batch commands but now I have trouble running cclm. In fact, it runs successfully but stops after running several years without giving any error (just gives the following message). I talked with an authorized person from UHeM and he claimed that the system stops cclm.exe because it wants to use the whole RAM of the machine. I am wondering if there is a bug related with memory leakage or if there is a way to limit memory usage. Do you have any suggestion to fix this problem?P.S. I have already tried the new version of starter package that adapted to SLURM and it didn’t solve the problem.
Thank you very much.
—————————————————————————————————————
mpirun noticed that process rank 9 with PID 14935 on node m003 exited on signal 9 (Killed).
—————————————————————————————————————
Dear Cemre,
It sounds like a strange behaviour of your computer system since the error occurs after such a long runtime. It is a common phenomena that the model crashes without an appropriate error message in case of memory limit exceedance. I do not expect that you exceed the “physical” memory limit of the system. Could you make sure that you use the compute nodes exclusively (not shared) and that you allocate enough memory in your SLURM settings? It may be help if you could provide some more information on the domain size and your current SLURM settings.
With best regards,
Markus
Dear Markus, thank you very much. You can find in the attachment run script and log files, which belong to big domain with 0.44° resolution. I also tried to run cclm for nested domains that have 0.11° and 0.0275° resolution, respectively. The run with 0.11 stops after approximately 6 months and the run with smallest resolution (0.0275°) stops after 2 months. Maybe I’m making a mistake in my SLURM settings.
Best regards,
Cemre
Dear Cemre,
as far as I see the run script und log files look okay.
As a first step I suggest to make sure that the job requests enough memory and wall-clock time on your system:
#SBATCH —time=10:00:00 ### HH:MM:SS #SBATCH —mem=10000 ### memory per node in MByte
In addition I recommend to run cclm with “srun” instead of “mpirun”. Srun is part of SLURM and may gives some more error information.
Can you try to run the model with 4 Nodes as well? The time of crash (model time) should be much later in that case.
Best regards
Markus
Dear Markus,
According to your suggestion, I added “ SBATCH —time=10:00:00” and “#SBATCH —mem=10000” commands to my script (for 0.44° resolution) but I didn’t change the mpirun to srun. The job terminated because of memory limit (cclm.866.err).
In addition to this, I made another test run (for 0.0275° resolution) by adding SBATCH commands and changing mpirun to srun but I got a different error. I don’t know if “srun” needs extra settings. I will check it later due to the priority of memory problem.
Besides, when I increase the number of nodes or when I use a machine that has higher memory (256GB RAM ), the time until crash gets longer as well. However, it consumes 256GB RAM and also 13GB swap and eventually it is terminated by oom-killer.
Best regards,
Cemre
Dear Cemre,
Memory usage of the 0.44 deg CCLM job should be < 2 GB. So its a serious problem.
Probably the memory consumption increases in each timeloop. Maybe you can proof this with an interactive job submission (without sbatch) with 2 cpus or so. Don’t forget to give the numbers of cores in the mpirun call in that case. Montitor memory use with “top”.
I will try to reproduce that behaviour with your configuration on the DKRZ machine next week. May you could try the intel compiler if its available on your system.
In general, for long-term simulations its better to use the chain-mode (http://redc.clm-community.eu/projects/cclm-sp/wiki/SUBCHAIN).
Best regards,
Markus