Supplementary Job SJ1.0

From UKCA

This page documents the vn8.4 GA4.0 N216L85 supplementary job SJ1.0 (xjgvb/c on MONSooN and xjgve/? on ARCHER).

This job has been developed by Luke Abraham with help from several others.

It is a continuation of GA4.0 N216 Development - more information can be found on that page.

Base Model

The base atmosphere model used here is the GA4.0 configuration. More information on GA4.0 development can be found on Global Atmosphere 4.0/Global Land 4.0 documentation pages (password required). A GMD paper documenting this model is also available.

The configuration is based on the Met Office job anenj (via MONSooN job xhmaj) which is derived from amche (the standard GA4.0 N96L85 interactive dust model) via

vn8.4 N216 development
amche (base vn8.4 GA4.0 N96L85 job) owned by Dan Copsey
xipvp (MONSooN) owned by Jeremy Walton
  xjgva (made N216 by comparing aliur to ajthw) owned by Luke Abraham
   xjgvb initialised from 1999-12-01 dump from xjgva, and made a TS2000 set-up (c.f. RJ4.0 settings)
    xjgve ported to ARCHER

reference vn8.0 jobs

  • N96

aliur (base vn8.0 GA4.0 N96L85 job)
xipvo (MONSooN) owned by Jeremy Walton

  • N216

ajthw (base vn8.0 GA4.0 N216L85 job)
xfthi (MONSooN) owned by Oliver Derbyshire
  xfaym owned by Malcolm Roberts
    xfaym owned by Malcolm Roberts
      xhcea owned by Jane Mulcahy

Variations

TS2000 with SPARC Ozone

Job xjgvb (MONSooN) was spun-off from xjgva on 1999-12-01 and run for 10-years with TS2000 settings (as per RJ4.0 settings). Job xjgve (ARCHER) is a direct copy of this job, with xjgvf being a copy of xjgve, but initialised to use the 2009-12-01 dump (with fixes) from xjgvb.

TS2000 with UKCA Ozone

Job xjgvc (MONSooN) was spun-off from xjgva on 1999-12-01 and run for 10-years with TS2000 settings (as per RJ4.0 settings) and using an Ozone climatology created from xkawa. Job xjgv? (ARCHER) is a direct copy of this job, with xjgv! being a copy of xjgv?, but initialised to use the 2009-12-01 dump (with fixes) from xjgvc.

Problems with initial dump

The original xjgva/b runs on MONSooN had problems in their initial dump files, in the m01s00i278 (MEAN WATER TABLE DEPTH M) and m01s00i281 (SATURATION FRAC IN DEEP LAYER) fields. There were NaNs in these two fields, and they had been present for a while in previous dumps. To remove these it was necessary to extract these fields, set the NaNs to the _FillValue/missing_value, and then make them up as ancillary files. This appears to have fixed this problem.

More information can be found on NCAS-CMS ticket #1622.

Because of this issue, the restart-dumps located in /work/n02/n02/ukca/ANCILS/ASTART have a corresponding _Fixes.anc ancillary file that needs to be used (as the values of these fields are different between dumps):

xjgvaa.da19991201_00 (SPARC Ozone)
xjgvaa.da19991201_00_Fixes.anc

xjgvba.da20091201_00 (SPARC Ozone)
xjgvba.da20091201_00_Fixes.anc

xjgvca.da20091201_00 (UKCA Ozone)
xjgvca.da20091201_00_Fixes.anc

The user pre-STASHmaster file that should be used to add these is found here:

/home/ukca/userprestash/VN8.4/N216_problem_fields.vn8.4

Checking the xjgvaa.da19991201_00 file gives the following:

/work/n02/n02/hum/vn8.4/cce/utils/cumf xjgvaa.da19991201_00 xjgvaa.da19991201_00

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 11241
Number of fields in file 2 = 11241
Number of fields compared  = 11241
  
FIXED LENGTH HEADER:        Number of differences =       0
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       2

Field  3675 : Stash Code   278 : MEAN WATER TABLE DEPTH            M  : Number of differences =       31

Field  3678 : Stash Code   281 : SATURATION FRAC IN DEEP LAYER        : Number of differences =       32
 files DO NOT compare

(cumf-ing a file with itself will highlight any NaNs in the data that will cause problems)

Checking the .astart file produced gives the following:

/work/n02/n02/hum/vn8.4/cce/utils/cumf xjgve.astart xjgve.astart

   COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =  4315
Number of fields in file 2 =  4315
Number of fields compared  =  4315
  
FIXED LENGTH HEADER:        Number of differences =       0
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

Scaling (ARCHER)

Each compute node contains two 2.7 GHz, 12-core Ivy Bridge processors, which can support 2 hardware threads, and there is 64GB of RAM per node. Due to memory restrictions, the N216L85 configuration is unable to run on less than 8 nodes (so the 8-node,20-minute short queue is available for debugging). All simulations used 2 OpenMP threads with 12 MPI tasks per node (halving the number of cores available per node).

Scaling tests have been done from 8 to 24 nodes of ARCHER with a series of 2-day runs, with the results presented below.

ARCHER timings PPE N216.png

From these tests, it is recommended to use 12 nodes, in a 12EW x 12NS domain decomposition. The suggested job-step is 4-months in 8 hours (which should take around 7-hours to run). The cost for 1 model year is just under 92kAU.

cumf tests (ARCHER)

It is useful to perform cumf tests to ensure that the model behaves as expected. The utility can be found at /work/n02/n02/hum/vn8.4/cce/utils/cumf.

Climate meaning was still on for these tests, so the temporary partial sums will still exist in the dumps. These are expected to be different between some of the tests.

It is surprising that a number of tests fail, specifically the NRUN-CRUN test and the change of dump frequency test, as these tests usually pass. It is also surprising that the change of domain decomposition test and CRUN-NRUN test also fail for this AMIP configuration.

However, it is possible to run this simulation for a continues length of run, and be confident in the results.

NRUN-NRUN tests (ARCHER)

For this test the model is run twice. The 2-day dumps are then compared. For this test it makes no difference if you compare the dumps produced using daily-dumping or 2-day dumping when comparing to the equivalent dump produced using the same dumping frequency. In this case, the dumps compare.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884 
Number of fields in file 2 = 10884
Number of fields compared  = 10884
  
FIXED LENGTH HEADER:        Number of differences =       3
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

This means that there is NOT a fundamental problem with the model (i.e. uninitialised memory etc.). It will give you the same results if run again a second time (with everything else the same).

NRUN-CRUN tests (ARCHER)

For this test the 2nd day dump of a daily dumping run (which has been run in a single jobstep) is compared with the 2nd day dump of a run where this was produced on a CRUN step (i.e. where the 1st day dump was produced on the NRUN step). In this case, the dumps DO NOT compare.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884
Number of fields in file 2 = 10884
Number of fields compared  = 10884
  
FIXED LENGTH HEADER:        Number of differences =       3
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       3
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =    8274

Field     1 : Stash Code     2 : U COMPNT OF WIND AFTER TIMESTEP      : Number of differences =   140400
...

This means that it is NOT possible to have job-steps of different lengths with reproducible results.

CRUN-CRUN tests (ARCHER)

For this test the model is run a second time as NRUN-CRUN jobsteps. The 2-day dump from the 1st NRUN-CRUN test (which was run in a single step) is then compared with this newly generated 2-day dump from the 2nd CRUN. This test bit-compares.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884
Number of fields in file 2 = 10884
Number of fields compared  = 10884
  
FIXED LENGTH HEADER:        Number of differences =       3
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

This means that it is possible to re-run a job-step, and assuming that the dump frequency is the same (and that they started from the same dump), then the results will be reproducible.

CRUN-NRUN tests (ARCHER)

For this test the model is run a second time as a NRUN from the 1st dump produced in a 2-day NRUN (or NRUN-CRUN) step, which uses 1-day dumping.. The 2nd-day dump from the 1st NRUN-NRUN/CRUN test is then compared with this newly generated 2nd-day dump from the 2nd NRUN. This test DOES NOT bit-compare.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884
Number of fields in file 2 = 10884
Number of fields compared  = 10884
  
FIXED LENGTH HEADER:        Number of differences =       3
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =    1284
DATA FIELDS:                Number of fields with differences =    1238

Field  4316 : Stash Code     2 : U COMPNT OF WIND AFTER TIMESTEP      : Number of differences =   140400
...

This means that it is NOT possible to make a new job, and continue it as a NRUN from an existing job with the same results.

change of dump frequency test (ARCHER)

In this test a 2-day long run is performed with daily dumping (in a single job-step), and then a 2-day run is performed with 2-day dumping. In this case, these dumps DO NOT bit-compare.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884
Number of fields in file 2 = 10989
Number of fields compared  = 10884
Number of fields from file 2 omitted from comparison =   105
  
FIXED LENGTH HEADER:        Number of differences =       4
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       3
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =   17454
DATA FIELDS:                Number of fields with differences =    8438

Field     1 : Stash Code     2 : U COMPNT OF WIND AFTER TIMESTEP      : Number of differences =   140400
...

This means that on ARCHER, if you need to change the dump frequency for some reason at some point during a run, the run will NOT bit-compare to a run where this was not done. You should therefore try to maintain the 10-day dumping frequency.

If the STASH in the model is turned off, then the dumps do bit-compare:

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =  4315
Number of fields in file 2 =  4315
Number of fields compared  =  4315
  
FIXED LENGTH HEADER:        Number of differences =       3
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

I had also reset the NaNs in the initial fields by extrapolating over missing data for these runs. However, running with STASH also meant that the model did not bit-compare.

Re-doing the same testing where STASH is removed shows that these dumps also DO NOT bit-compare:

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =  4315
Number of fields in file 2 =  4315
Number of fields compared  =  4315
  
FIXED LENGTH HEADER:        Number of differences =       4
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       3
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =    2927

Field     1 : Stash Code     2 : U COMPNT OF WIND AFTER TIMESTEP      : Number of differences =   140400
...

So there are 2 things going on - one is that the initial fields which have been reset cannot be set to missing data, but instead need to be set to another value (e.g., in this case, extrapolated over missing data), and the second is that some of the STASH requests are causing a change to the model evolution.

change of domain decomposition test (ARCHER)

In this test a 2-day long run is performed with daily dumping (in a single job-step), and then an equivalent job is performed, but with a different domain decomposition. For this test, a 16EW x12NS was compared to a 12EW x 8NS. In this case, these dumps DO NOT bit-compare.

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 = 10884
Number of fields in file 2 = 10884
Number of fields compared  = 10884
  
FIXED LENGTH HEADER:        Number of differences =       4
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       3
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =    8409

Field     1 : Stash Code     2 : U COMPNT OF WIND AFTER TIMESTEP      : Number of differences =   140400
...

This means that on ARCHER, if you need to change the domain decomposition for any reason, the jobs will not bit-compare.

Contributions and Acknowledgements

Luke Abraham would like to thank the following people (in no particular order) for their help in creating this job:

  • Jeremy Walton (Met Office)
  • Malcolm Roberts (Met Office)
  • Jane Mulcahy (Met Office)
  • Maria Russo (NCAS, University of Cambridge)
  • Marie-Estelle Demory (NCAS, University of Reading)
  • NCAS-CMS