Push notifications in your browser are not yet configured.
You are not logged in, you may not see all content and functionalities. If you have an account, please login .
Long term simulation
Hello,
I am attempting running a long term CORDEX simulation using 11 nodes with 32 processors per node. The job runs successfully, but periodically stops during the post processing stage. This problem may possibly be indicating a conflict due to the fact that CCLM job is not yet released from the queue. My feeling is that problem is more complex however since I get the same results when I resubmit the post job after several hours. Still, after repeating this action several times without any change, I am finally able to continue the run.
I have experienced the problem several times already. Please let me know your recommendations. I can add that I have not had (or not noticed) such problem in my previous runs using a smaller number of nodes.
Simon
Running the post processing before the CCLM job has been released can actually lead to problems. On Blizzard I experienced such a problem and therefore set in the post processing script:
sleep 60 # to avoid conflict if CCLM job is not yet released from the queue (may not be relevant on all systems)
You may find this line when looking at the template scripts.
However, if I understand you right, you re-submitted the post processing job hours later individually and it happened again. This is really strange. It may be due to some problems in your computing system. This can be complex to find. A brute approach to narrow this down to the line where it happens in the script is to insert an
echo test nn
after each line.I saw the sleep 60 line you mentioned of course.
I have even changed the value to 240. This does not help however.
In my runs when the post job stops DATE1 is equal to DATE2.
And in the runs ending successfully the two values differ. I do not know yet why this happens,
but as a temporary brutal solution I just have commented the line let “ SEC _CHECK=DATE2-DATE1” and set another one SEC _CHECK=1 instead.
The job runs now but there may be other problems due to the change.
If I understand the code correctly the check is just to be sure that there were enough data files, so probably on our machine my correction will work good?
The line
let "SEC_CHECK=DATE2-DATE1"
counts how much time the checking process needs. This is just for information and not necessary for the post-processing of the data. Therefore setting it toSEC_CHECK=1
does not matter.I experienced some problem with a similar command sometimes in another script. You may try the following instead of using the
let
command.SEC_CHECK=$(python -c "print ${DATE2}-${DATE1}")
Anyway, just setting
SEC_CHECK=1
is fine, if you do not need the time information for some reason.