You are not logged in, you may not see all content and functionalities. If you have an account, please login .

Restarting finished job – in #9: CCLM

in #9: CCLM

Cookies disclaimer

Our site saves small pieces of text information (cookies) on your device in order to verify your login. These cookies are essential to provide access to resources on this website and it will not work properly without. Learn more

Restarting finished job

Hi everybody,
I have finished a 5 year simulation and the output files in SCRATCH directory have been removed. Now I want to continue the experiment for another year. Please let me know if this is possible and what are the actions required.
Kind regards, Simon

View in channel

If you use the latest subchain version you can create the directory structure with

subchain create

otherwise you have to do this by hand. In this case look at the section


  # create the job directory structure

in the subchain script where the directories are created.
I assume you want to perform a warm start, i.e. prolonging the run for another year. In this case do not change


  YDATE_START

, but just adopt


  YDATE_STOP

.

View in channel

Thanks much.
I apparently did something wrong.

I use the (1.3.4) subchain version (is it the latest?).
I submitted ./subchain create and now have the ….chain/scratch/…directory (which is empty).
The date.log howvere contains the date which is equial to YDATE _START.
After that I attempted to submit a restart job ./subchain cclm DATE
where the DATE is that from the original date.log file – i.e. not that created by the ./subchain create but the job has stopped – first the cclm and then the prep and int2lm.

The subchain date.log and the log files + the files from jobs directory (tarred) are attached.
Please have a look.
Simon

View in channel

1.3.4 is the lastest subchain released version.
You wrote:
The date.log however contains the date which is equial to YDATE _START.
but actually in your subchain YSTART _DATE=1989010100 and in date.log it is 1994010100, which is OK.
I guess you have to create the input data for cclm first. Please run in your case
./subchain prep 1994010100
If everything goes well, this job should call int2lm and later cclm automatically.
By the way, calling ./subchain cclm always takes the date from date.log. A second argument will be ignored.

View in channel

This ./subchain prep 1994010100 job really called int2lm but stopped after it. I attach the log file obtained.
The last thing it did was the creation of two directories 1994_01 and 1994_02 in ..scratch/output/int2lm .
Any hint, please.

View in channel

Beate just found an error in the subchain script. In case you call subchain create the following command at around line 95 should not be called:

  echo ${YDATE_START} ${YDATE_START} > ${PFDIR}/${EXPID}/date.log

This is only for a cold start. Please check if you have not overwritten date.log when you used


  subchain create

For a warm start there should be 1994010100 in your case in the date.log file.

View in channel

Do you mean that in my case (warm start) then line has to be commented, but must exist in the case of cold start?
With this correction I tried submitting ./subchain cclm 1994010100 (with 1994010100 1994010100 in date.log [why twice in fact?), but this didn’t work.
Should I try submitting subchain prep for an earlier date probably?, like ./subchain prep 1994010100 or ./subchain prep 1993120100 ?

View in channel

If date.log contains 1994010100 1994010100 then ./subchain prep 1994010100 should work and start the chain again. Otherwise you mixed something up in the chain.
The two dates in date.log are just for the case of running sub monthly chunks. This is not the case in your run, just leave it as it is.

View in channel

Thank you. It doesn’t work. It may be my mistake of course, but I do not think I made any change in the scripts except for that suggested by you in the subchain (commented the line echo ${YDATE_START} ${YDATE_START} > ${PFDIR}/${EXPID}/date.log). [By the way – I work with cclm-sp_1.4 and not with the 1.3.4.Should I try restarting the job using 1.3.4 ?].
To summarize
my date.log is as follows 1994010100 1994010100
the job ./subchain prep 1994010100 starts successfully and calls int2lm but doesn’t call cclm.
I tried submitting ./subchain cclm 1994010100 after that (and also before) but it terminates with ERROR CODE 2014 in ROUTINE organize_input
after attempting to open ncdf file lbff**000000.nc
No such file or directory

================================

But, I many times successfully restarted my jobs from consecutive last time moments (i.e. when the experiment was not yet finished – and all the data in the /scratch directory were not removed and the last created files still were there). May it be that restarting is possible for last time moments only. Or, in principle, one should be able to restart his job from any time moment (where the input data are supposed to come from if yes?).
Please kindly clarify.

View in channel

I just made a test by myself and it worked fine.
Please run again $./subchain prep 1994010100$ and if it does not work, please attach the log files for prep, int2lm and cclm that have been produced by the job.

View in channel

Please see the log files attached (except for the cclm since it has not started). Also there are my subchain, all jobs and results of ls -l for restarts directory. Many thanks indeed for your help.

View in channel

Sorry, the subchain is attached here.

View in channel

The prep and int2lm jobs you provide already created the data for 199402.
Please check if the directory
/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b3001/output/int2lm/1994_01/
contains the laf1994010100.nc file and all necessary and lbfd199401mmddhh.nc files.
If these are available, perform the command ./subchain cclm and attach the resulting .job and joblog file for this to your reply.

View in channel

I have submitted the job and it runs now without any problem. So, the problem seems to be solved. Do not really understand how. Thanks much anyway,

View in channel

I understood finally how I have managed to make my job running. I clearly made a mistake. As I see now cclm.job.tmpl file in /templates directory contains ydirini=@{YDIRINI}/’ and not ydirini=’/Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/work/b3001/restarts’‘,
This means that by submitting ./subchain cclm 1994010100 in reality I have used a cold start and not the warm one as I wanted.
Sorry for misleading information of yesterday.

So, my problem remains unsolved apparently. Following your earlier recommendation I have repeated all my previous actions on another job b2001. Attached please find a tar file with the information on the files in /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_01/ and /Research/CLIMATE/Giora/COSMO- CLM /cclm-sp_1.4/chain/scratch/b2001/output/int2lm/1994_02/
as well as the resulting .job and joblog file.

View in channel

You are still messing up something in your subchain script.
In cclmb2001.job one can read

  ydirini='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts'',
  ydirbd='/Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/scratch/b2001/input/cclm/1994_01/',

There is a ‘ too much in ydirini.
Maybe this causes the error in cclm-b2.o1032872:

 OPEN: bina-file: 
 /Research/CLIMATE/Giora/COSMO-CLM/cclm-sp_1.4/chain/work/b2001/restarts/lrfd199
 4010100o
  *** Restart: A default set for refatm parameters is used:            2
 CLOSING bina FILE
 OPEN: ncdf-file: lbff**000000.nc
 No such file or directory

Please attach the YUSPEFIC and subchain files next time. These are of help to understand the problem.

View in channel

Push notifications in your browser are not yet configured.