Push notifications in your browser are not yet configured.
You are not logged in, you may not see all content and functionalities. If you have an account, please login .
Restarting finished job
Hi everybody,
I have finished a 5 year simulation and the output files in SCRATCH directory have been removed. Now I want to continue the experiment for another year. Please let me know if this is possible and what are the actions required.
Kind regards, Simon
If you use the latest subchain version you can create the directory structure with
otherwise you have to do this by hand. In this case look at the section
# create the job directory structure
in the subchain script where the directories are created.
I assume you want to perform a warm start, i.e. prolonging the run for another year. In this case do not change
YDATE_START
, but just adoptYDATE_STOP
.Thanks much.
I apparently did something wrong.
I use the (1.3.4) subchain version (is it the latest?).
I submitted ./subchain create and now have the ….chain/scratch/…directory (which is empty).
The date.log howvere contains the date which is equial to YDATE _START.
After that I attempted to submit a restart job ./subchain cclm DATE
where the DATE is that from the original date.log file – i.e. not that created by the ./subchain create but the job has stopped – first the cclm and then the prep and int2lm.
The subchain date.log and the log files + the files from jobs directory (tarred) are attached.
Please have a look.
Simon
1.3.4 is the lastest subchain released version.
You wrote:
The date.log however contains the date which is equial to YDATE _START.
but actually in your subchain YSTART _DATE=1989010100 and in date.log it is 1994010100, which is OK.
I guess you have to create the input data for cclm first. Please run in your case
./subchain prep 1994010100
If everything goes well, this job should call int2lm and later cclm automatically.
By the way, calling
./subchain cclm
always takes the date from date.log. A second argument will be ignored.This ./subchain prep 1994010100 job really called int2lm but stopped after it. I attach the log file obtained.
The last thing it did was the creation of two directories 1994_01 and 1994_02 in ..scratch/output/int2lm .
Any hint, please.
Beate just found an error in the subchain script. In case you call
subchain create
the following command at around line 95 should not be called:This is only for a cold start. Please check if you have not overwritten date.log when you used
subchain create
For a warm start there should be 1994010100 in your case in the date.log file.
Do you mean that in my case (warm start) then line has to be commented, but must exist in the case of cold start?
With this correction I tried submitting ./subchain cclm 1994010100 (with 1994010100 1994010100 in date.log [why twice in fact?), but this didn’t work.
Should I try submitting subchain prep for an earlier date probably?, like ./subchain prep 1994010100 or ./subchain prep 1993120100 ?
If date.log contains
1994010100 1994010100
then./subchain prep 1994010100
should work and start the chain again. Otherwise you mixed something up in the chain.The two dates in date.log are just for the case of running sub monthly chunks. This is not the case in your run, just leave it as it is.
Thank you. It doesn’t work. It may be my mistake of course, but I do not think I made any change in the scripts except for that suggested by you in the subchain (commented the line echo ${YDATE_START} ${YDATE_START} > ${PFDIR}/${EXPID}/date.log). [By the way – I work with cclm-sp_1.4 and not with the 1.3.4.Should I try restarting the job using 1.3.4 ?].
To summarize
my date.log is as follows 1994010100 1994010100
the job ./subchain prep 1994010100 starts successfully and calls int2lm but doesn’t call cclm.
I tried submitting ./subchain cclm 1994010100 after that (and also before) but it terminates with ERROR CODE 2014 in ROUTINE organize_input
after attempting to open ncdf file lbff**000000.nc
No such file or directory
================================
But, I many times successfully restarted my jobs from consecutive last time moments (i.e. when the experiment was not yet finished – and all the data in the /scratch directory were not removed and the last created files still were there). May it be that restarting is possible for last time moments only. Or, in principle, one should be able to restart his job from any time moment (where the input data are supposed to come from if yes?).
Please kindly clarify.
I just made a test by myself and it worked fine.
Please run again $./subchain prep 1994010100$ and if it does not work, please attach the log files for prep, int2lm and cclm that have been produced by the job.
Please see the log files attached (except for the cclm since it has not started). Also there are my subchain, all jobs and results of ls -l for restarts directory. Many thanks indeed for your help.
Sorry, the subchain is attached here.
The prep and int2lm jobs you provide already created the data for 199402.
Please check if the directory
/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b3001/output/int2lm/1994_01/
contains the laf1994010100.nc file and all necessary and lbfd199401mmddhh.nc files.
If these are available, perform the command
./subchain cclm
and attach the resulting .job and joblog file for this to your reply.I have submitted the job and it runs now without any problem. So, the problem seems to be solved. Do not really understand how. Thanks much anyway,
I understood finally how I have managed to make my job running. I clearly made a mistake. As I see now cclm.job.tmpl file in /templates directory contains ydirini=@{YDIRINI}/’ and not ydirini=’/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/work/b3001/restarts’‘,
This means that by submitting ./subchain cclm 1994010100 in reality I have used a cold start and not the warm one as I wanted.
Sorry for misleading information of yesterday.
So, my problem remains unsolved apparently. Following your earlier recommendation I have repeated all my previous actions on another job b2001. Attached please find a tar file with the information on the files in /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_01/ and /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_02/
as well as the resulting .job and joblog file.
You are still messing up something in your subchain script.
In cclmb2001.job one can read
There is a ‘ too much in ydirini.
Maybe this causes the error in cclm-b2.o1032872:
Please attach the YUSPEFIC and subchain files next time. These are of help to understand the problem.