Long term simulation – in #12: CCLM Starter Package Support
in #12: CCLM Starter Package Support
Cookies disclaimer
Our site saves small pieces of text information (cookies) on your
device in order to verify your login. These cookies are essential
to provide access to resources on this website and it will not
work properly without.
Learn more
<p>
Hello,
<br/>
I am attempting running a long term
<span class="caps">
CORDEX
</span>
simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that
<span class="caps">
CCLM
</span>
job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
<br/>
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
<br/>
Simon
</p>
<p>
Hello,
<br/>
I am attempting running a long term
<span class="caps">
CORDEX
</span>
simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that
<span class="caps">
CCLM
</span>
job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
<br/>
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
<br/>
Simon
</p>
Hello,
I am attempting running a long term
CORDEX
simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that
CCLM
job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon
<p>
Running the post processing before the
<span class="caps">
CCLM
</span>
job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
<br/>
<code>
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
</code>
<br/>
You may find this line when looking at the template scripts.
<br/>
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an
<code>
echo test nn
</code>
after each line.
</p>
<p>
Running the post processing before the
<span class="caps">
CCLM
</span>
job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
<br/>
<code>
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
</code>
<br/>
You may find this line when looking at the template scripts.
<br/>
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an
<code>
echo test nn
</code>
after each line.
</p>
Running the post processing before the
CCLM
job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an
echo test nn
after each line.
<p>
I saw the sleep 60 line you mentioned of course.
<br/>
I have even changed the value to 240. This does not help however.
<br/>
In my runs when the post job stops DATE1 is equal to DATE2.
<br/>
And in the runs ending successfully the two values differ. I do not know yet why this happens,
<br/>
but as a temporary brutal solution I just have commented the line
let “
<span class="caps">
SEC
</span>
_CHECK=DATE2-DATE1” and set another one
<span class="caps">
SEC
</span>
_CHECK=1 instead.
<br/>
The job runs now but there may be other problems due to the change.
<br/>
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine
my correction will work good?
</p>
<p>
I saw the sleep 60 line you mentioned of course.
<br/>
I have even changed the value to 240. This does not help however.
<br/>
In my runs when the post job stops DATE1 is equal to DATE2.
<br/>
And in the runs ending successfully the two values differ. I do not know yet why this happens,
<br/>
but as a temporary brutal solution I just have commented the line
let “
<span class="caps">
SEC
</span>
_CHECK=DATE2-DATE1” and set another one
<span class="caps">
SEC
</span>
_CHECK=1 instead.
<br/>
The job runs now but there may be other problems due to the change.
<br/>
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine
my correction will work good?
</p>
I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line
let “
SEC
_CHECK=DATE2-DATE1” and set another one
SEC
_CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine
my correction will work good?
<p>
The line
<code>
let "SEC_CHECK=DATE2-DATE1"
</code>
counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to
<code>
SEC_CHECK=1
</code>
does not matter.
<br/>
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the
<code>
let
</code>
command.
<br/>
<code>
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
</code>
<br/>
Anyway, just setting
<code>
SEC_CHECK=1
</code>
is fine, if you do not need the time information for some reason.
</p>
<p>
The line
<code>
let "SEC_CHECK=DATE2-DATE1"
</code>
counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to
<code>
SEC_CHECK=1
</code>
does not matter.
<br/>
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the
<code>
let
</code>
command.
<br/>
<code>
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
</code>
<br/>
Anyway, just setting
<code>
SEC_CHECK=1
</code>
is fine, if you do not need the time information for some reason.
</p>
The line
let "SEC_CHECK=DATE2-DATE1"
counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it to
SEC_CHECK=1
does not matter.
I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the
let
command.
SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting
SEC_CHECK=1
is fine, if you do not need the time information for some reason.
Long term simulation
Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon
Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an
echo test nn
after each line.I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?
The line
let "SEC_CHECK=DATE2-DATE1"
counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it toSEC_CHECK=1
does not matter.I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the
let
command.SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting
SEC_CHECK=1
is fine, if you do not need the time information for some reason.