Difference between revisions of "UKCA & UMUI Tutorial 2"
(12 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
<span style="color:green">'''TASK 2.1:''' Take your copy of the Tutorial Base Job that you copied at the start of [[UKCA & UMUI Tutorial 1 | exploring the UMUI]] tutorial, and make the required changes to allow this job to run. Once you have made these changes you can submit your job. First click '''Save''', then '''Process''', and once this has completed, click '''Submit'''. This will then extract the code from the FCM repositories and submit them to the supercomputer. If you are running on MONSooN you will need to enter your passcode at this stage.</span> |
<span style="color:green">'''TASK 2.1:''' Take your copy of the Tutorial Base Job that you copied at the start of [[UKCA & UMUI Tutorial 1 | exploring the UMUI]] tutorial, and make the required changes to allow this job to run. Once you have made these changes you can submit your job. First click '''Save''', then '''Process''', and once this has completed, click '''Submit'''. This will then extract the code from the FCM repositories and submit them to the supercomputer. If you are running on MONSooN you will need to enter your passcode at this stage.</span> |
||
− | '''Note:''' To allow the jobs in this tutorial to run quickly this job is only set to run for 2 days. This means that there will be no '''climate-mean''' files produced (see the [[UKCA_&_UMUI_Tutorial_3#Climate_Mean_files | what is STASH?]] tutorial) produced, which |
+ | '''Note:''' To allow the jobs in this tutorial to run quickly this job is only set to run for 2 days. This means that there will be no '''climate-mean''' files produced (see the [[UKCA_&_UMUI_Tutorial_3#Climate_Mean_files | what is STASH?]] tutorial) produced, which require run lengths of a month or more. |
+ | |||
+ | If you find that you are having problems running you job, you may have accidentally made changes to it when you were doing [[UKCA & UMUI Tutorial 1#Task 1.2: Explore your new job | ''Task 1.2: Explore your new job'']]. You can see if this is the case by '''differencing''' your job with the original '''a''' job that you copied (''Tutorial: Base UM-UKCA Chemistry Job''). To do this you first need to '''Search → Filter...''' for both your experiment and the [[UKCA & UMUI Tutorials: Things to know before you start#PUMA and HECToR/MONsooN | UKCA Tutorial experiment]], then go to '''Job → Difference''' and select '''Long'''. The only differences between these jobs should be in the '''Model Selection → User Information and Submit Method → General details''' (called '''personal_gen''' in the difference window) where you have changed your user-name, email address, and TIC-code. |
||
+ | |||
+ | If you have had problems and have had to revert the job to the original '''a''' job, you may find that you need to clear all the directories that have been produced on the supercomputer. On HECToR you may need to remove the |
||
+ | |||
+ | /home/n02/n02/<span style="color:blue">userid</span>/um/<span style="color:blue">jobid</span> |
||
+ | /work/n02/n02/<span style="color:blue">userid</span>/um/<span style="color:blue">jobid</span> |
||
+ | |||
+ | directories, and on MONSooN you may need to remove the |
||
+ | |||
+ | /projects/<span style="color:red">group</span>/<span style="color:blue">userid</span>/um/<span style="color:blue">jobid</span> |
||
+ | /nerc/<span style="color:red">group</span>/<span style="color:blue">userid</span>/<span style="color:blue">jobid</span> |
||
+ | |||
+ | directories. |
||
'''Sample output''' from this job can be found in |
'''Sample output''' from this job can be found in |
||
Line 37: | Line 51: | ||
llq -u $USER |
llq -u $USER |
||
− | This should give a list of your running jobs. For example, on |
+ | This should give a list of your running jobs. For example, on HECToR you get output similar to |
+ | |||
+ | $ qstat -u $USER |
||
+ | |||
+ | sdb: |
||
+ | Req'd Req'd Elap |
||
+ | Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time |
||
+ | --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- |
||
+ | 1515659.sdb luke par:8n_2 xipwx_run 7934 1 1 -- 00:20 R 00:07 |
||
+ | |||
+ | and on MONSooN you should get something like |
||
$ llq -u $USER |
$ llq -u $USER |
||
Line 84: | Line 108: | ||
on MONSooN. The reason for this difference is that on MONSooN the post-processing is able to convert fieldsfiles to 32-bit .pp format, whereas it is not possible to do this on HECToR. |
on MONSooN. The reason for this difference is that on MONSooN the post-processing is able to convert fieldsfiles to 32-bit .pp format, whereas it is not possible to do this on HECToR. |
||
− | As you can see, there is only one file present, the "'''pb'''" file. This file is a daily file that has come from the UPB PP stream (standard PP files will be covered in more detail in the [[UKCA & UMUI Tutorial 3 | What is STASH?]] tutorial). To quickly view output you can use [http://badc.nerc.ac.uk/help/software/xconv/ Xconv], which provides a simple data viewer. It can also be used to convert the UM format output files to netCDF. |
+ | As you can see, there is only one file present, the "'''pb'''" file. This file is a daily file that has come from the UPB PP stream (standard PP files will be covered in more detail in the [[UKCA & UMUI Tutorial 3 | What is STASH?]] tutorial). To quickly view output you can use [http://badc.nerc.ac.uk/help/software/xconv/ Xconv], which provides a simple data viewer. It can also be used to convert the UM format output files to netCDF. |
+ | |||
+ | You can open this file by |
||
$ xconv -i xipfaa.pb19930901.pp |
$ xconv -i xipfaa.pb19930901.pp |
||
− | which will show the Xconv window as can been seen in Figure 1. There is only one field present |
+ | which will show the Xconv window as can been seen in Figure 1. There is only one field present |
− | Stash code = 34001 |
+ | 0 : 192 145 85 1 Stash code = 34001 |
− | This is the UKCA chemical ozone tracer (although it is not labeled as such by default). A full listing of all UKCA fields can be found in the [[listing of UKCA fields at UM8.2]]. More information |
+ | This is the UKCA chemical ozone tracer (although it is not labeled as such by default). A full listing of all UKCA fields can be found in the [[listing of UKCA fields at UM8.2]]. More information will be given on STASH in the [[UKCA & UMUI Tutorial 3 | What is STASH?]] tutorial. |
You can use Xconv to view certain fields. For example, you could view the surface ozone concentration double-clicking on the ''Stash code = 34001'' field and clicking the '''Plot data''' button (see Figure 2). While this is good to quickly check data, the plotting functions are rather limited as it is not possible to change e.g. the colour-bar, the scale, add a map projection etc. It is advisable to either export fields as netCDF from within Xconv, or to use another program, such as IDL (using the [http://cms.ncas.ac.uk/documents/IDL/idl_guide.html Met Office library]) or Python (using either [http://cfpython.bitbucket.org/ cf-python] or [http://scitools.org.uk/iris/ Iris]) which is able to read the UM PP/FieldsFile format directly. |
You can use Xconv to view certain fields. For example, you could view the surface ozone concentration double-clicking on the ''Stash code = 34001'' field and clicking the '''Plot data''' button (see Figure 2). While this is good to quickly check data, the plotting functions are rather limited as it is not possible to change e.g. the colour-bar, the scale, add a map projection etc. It is advisable to either export fields as netCDF from within Xconv, or to use another program, such as IDL (using the [http://cms.ncas.ac.uk/documents/IDL/idl_guide.html Met Office library]) or Python (using either [http://cfpython.bitbucket.org/ cf-python] or [http://scitools.org.uk/iris/ Iris]) which is able to read the UM PP/FieldsFile format directly. |
||
Line 113: | Line 139: | ||
mass_fraction_of_ozone_in_air:valid_min = 5.58807e-09f ; |
mass_fraction_of_ozone_in_air:valid_min = 5.58807e-09f ; |
||
mass_fraction_of_ozone_in_air:valid_max = 1.861871e-05f ; |
mass_fraction_of_ozone_in_air:valid_max = 1.861871e-05f ; |
||
− | |||
Once you have your data as netCDF it is then possible to use any standard visualisation or processing package to view and manipulate the data. |
Once you have your data as netCDF it is then possible to use any standard visualisation or processing package to view and manipulate the data. |
||
Line 119: | Line 144: | ||
==.leave Files== |
==.leave Files== |
||
− | The text output from any write statements within the code, or giving information about compilation, is outputted to several files with the extension '''.leave'''. These will either be |
+ | The text output from any write statements within the code, or giving information about compilation, is outputted to several files with the extension '''.leave'''. These will either be in your <tt>$HOME/um/umui_out</tt> directory on HECToR or placed in your <tt>$HOME/output</tt> directory on MONSooN. |
You will have three .leave files, one for the compilation, one for the reconfiguration step (if run), and one for the UM itself. By default for climate runs these will all have a common format, starting with 4 blocks of letters and numbers, like this: |
You will have three .leave files, one for the compilation, one for the reconfiguration step (if run), and one for the UM itself. By default for climate runs these will all have a common format, starting with 4 blocks of letters and numbers, like this: |
||
+ | xipfa000.xipfa.d13163.t120017 |
||
− | xhklg000.xhklg.d13156.t092342 |
||
where this breaks down to |
where this breaks down to |
||
{| border="1" |
{| border="1" |
||
− | | <span style="color:blue">jobid</span><span style="color:green">XXX</span> || e.g. |
+ | | <span style="color:blue">jobid</span><span style="color:green">XXX</span> || e.g. xipfa000 || The <span style="color:blue">jobid</span> of the job, followed by the <span style="color:green">job-step number</span>. For compilation and reconfiguration jobs, this will be 000, but as the CRUN progresses this number will increment by 1 for each step, and then cycle round back through 000 (if you run more than 999 steps). |
|- |
|- |
||
− | | <span style="color:blue">jobid</span> || e.g. |
+ | | <span style="color:blue">jobid</span> || e.g. xipfa || The <span style="color:blue">jobid</span> of the job as listed in the UMUI. |
|- |
|- |
||
− | | d<span style="color:blue">XX</span><span style="color:green">XXX</span> || e.g. |
+ | | d<span style="color:blue">XX</span><span style="color:green">XXX</span> || e.g. 13163 || The <span style="color:blue">year</span> (the last two digits, i.e. 2013 is '''13''') and the <span style="color:green">day of the year</span> as 3 digits (i.e. 001-366, so this file was created on the 12th June (day 163)). |
|- |
|- |
||
− | | t<span style="color:blue">XXXXXX</span> || e.g. |
+ | | t<span style="color:blue">XXXXXX</span> || e.g. 120017 || The <span style="color:blue">time</span> in '''HHMMSS''' format, as recorded by the system clock on the supercomputer. |
|} |
|} |
||
− | Using this format this means that file was created on the |
+ | Using this format this means that file was created on the 12th June 2013 at 12:00:17. Note that the timestamp on the file will be later than this, as this is the time the file was created, not the time that it was last written to. |
There are then three file extensions: '''.comp.leave''' for compilation output, '''.rcf.leave''' for reconfiguration output, and '''.leave''' for the model output. |
There are then three file extensions: '''.comp.leave''' for compilation output, '''.rcf.leave''' for reconfiguration output, and '''.leave''' for the model output. |
||
Line 155: | Line 180: | ||
This gives output from the code which is generated as it is running, although this file is only updated and closed when the job finishes. To view the output generated as it is running you will need to see the output in the <tt>pe_output/</tt> directory mentioned above. |
This gives output from the code which is generated as it is running, although this file is only updated and closed when the job finishes. To view the output generated as it is running you will need to see the output in the <tt>pe_output/</tt> directory mentioned above. |
||
− | To run efficiently the UM is split into many domains, which communicate with each other with parallel calls, during runtime. The exact decomposition is defined in ''Model Selection → User Information and Submit Method → Job submission method'', in the number of processes East-West and North South boxes. If you have a 16x16 decomposition there will be 256 processes, running on 256 cores of the supercomputer (4 nodes of MONSooN, 8 nodes of HECToR). These processes will be numbered internally from 0 → 255, labelled as '''PE0''' to e.g. '''PE255'''. Only the output from PE0 will be sent to the .leave file, with output from the other PEs only held in the <tt>pe_output/</tt> directory. Whether or not these files are deleted at the end of a run is set in the UMUI in ''Model Selection → Input/Output Control and Resources → Output management'' panel. If you run fails then these files will not be deleted. |
+ | To run efficiently the UM is split into many domains, which communicate with each other with parallel calls, during runtime. The exact decomposition is defined in ''Model Selection → User Information and Submit Method → Job submission method'', in the number of processes East-West and North South boxes. If you have a 16x16 decomposition there will be 256 processes, running on 256 cores of the supercomputer (4 nodes of MONSooN, 8 nodes of HECToR). These processes will be numbered internally from 0 → 255, labelled as '''PE0''' to e.g. '''PE255'''. Only the output from PE0 will be sent to the .leave file, with output from the other PEs only held in the <tt>pe_output/</tt> directory. Whether or not these files are deleted at the end of a run is set in the UMUI in '''Model Selection → Input/Output Control and Resources → Output management''' panel. If you run fails then these files will not be deleted. |
While there is a lot of information outputted to the .leave file, and you would usually only read it if the job fails, it is worth going through the messages, making a special note of any warnings. |
While there is a lot of information outputted to the .leave file, and you would usually only read it if the job fails, it is worth going through the messages, making a special note of any warnings. |
Latest revision as of 10:18, 5 July 2013
Running an existing UKCA job
You will need to change a number of options within the UMUI to allow you to run this job successfully, such as your username, HECToR TIC code (if needed) etc. If you are using the MONSooN job you may also need to change the project group in
Model Selection -> Post Processing -> Main Switch + General Questions
if you want to send output data to the /nerc data disk (this is advisable). The NCAS-CMS UMUI Training Video will give you the minimum information that you need to be able to make these changes.
Task 2.1: Run an existing job
TASK 2.1: Take your copy of the Tutorial Base Job that you copied at the start of exploring the UMUI tutorial, and make the required changes to allow this job to run. Once you have made these changes you can submit your job. First click Save, then Process, and once this has completed, click Submit. This will then extract the code from the FCM repositories and submit them to the supercomputer. If you are running on MONSooN you will need to enter your passcode at this stage.
Note: To allow the jobs in this tutorial to run quickly this job is only set to run for 2 days. This means that there will be no climate-mean files produced (see the what is STASH? tutorial) produced, which require run lengths of a month or more.
If you find that you are having problems running you job, you may have accidentally made changes to it when you were doing Task 1.2: Explore your new job. You can see if this is the case by differencing your job with the original a job that you copied (Tutorial: Base UM-UKCA Chemistry Job). To do this you first need to Search → Filter... for both your experiment and the UKCA Tutorial experiment, then go to Job → Difference and select Long. The only differences between these jobs should be in the Model Selection → User Information and Submit Method → General details (called personal_gen in the difference window) where you have changed your user-name, email address, and TIC-code.
If you have had problems and have had to revert the job to the original a job, you may find that you need to clear all the directories that have been produced on the supercomputer. On HECToR you may need to remove the
/home/n02/n02/userid/um/jobid /work/n02/n02/userid/um/jobid
directories, and on MONSooN you may need to remove the
/projects/group/userid/um/jobid /nerc/group/userid/jobid
directories.
Sample output from this job can be found in
/work/n02/n02/ukca/Tutorial/sample_output/Base/
on HECToR, and in
/projects/ukca/Tutorial/sample_ouput/Base/
on MONSooN.
Checking the progress of a running job
Log-in to the supercomputer, and check that your job is running. For HECToR do
qstat -u $USER
and for MONSooN do
llq -u $USER
This should give a list of your running jobs. For example, on HECToR you get output similar to
$ qstat -u $USER sdb: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1515659.sdb luke par:8n_2 xipwx_run 7934 1 1 -- 00:20 R 00:07
and on MONSooN you should get something like
$ llq -u $USER Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- mon001.64641.0 nlabra 6/5 12:36 R 50 parallel c139 1 job step(s) in query, 0 waiting, 0 pending, 1 running, 0 held, 0 preempted
You can also check how far a job has gone while it is running. To do this you will need to cd into the job directory (this will be on your /work space on HECToR or your /projects space on MONSooN). When you do this, you will see something like this
$ ls baserepos/ umatmos/ xipfa.astart xipfa.stash xipfaa.pa1993sep xipfaa.pd1993sep xipfaa.pg19930901 bin/ umrecon/ xipfa.list xipfa.umui.nl xipfaa.pb19930902 xipfaa.pe1993sep xipfaa.pi19930901 pe_output/ umscripts/ xipfa.requests xipfa.xhist xipfaa.pc19930901 xipfaa.pf19930901 xipfaa.pj19930901
Now cd into the pe_output/ directory and do
$ tail -f jobid.fort6.pe0 | grep Atm_Step
Atm_Step: Timestep 137 Model time: 1993-09-02 21:40:00
Atm_Step: Timestep 138 Model time: 1993-09-02 22:00:00
Atm_Step: Timestep 139 Model time: 1993-09-02 22:20:00
Atm_Step: Timestep 140 Model time: 1993-09-02 22:40:00
Atm_Step: Timestep 141 Model time: 1993-09-02 23:00:00
Atm_Step: Timestep 142 Model time: 1993-09-02 23:20:00
Atm_Step: Timestep 143 Model time: 1993-09-02 23:40:00
Atm_Step: Timestep 144 Model time: 1993-09-03 00:00:00
(changing jobid as appropriate for your job).
Viewing and extracting output
To take a look at the output, you will need to change into your archive directory. Once in this directory ls to see the file listing
$ ls xipwaa.pb19930901
on HECToR, or
$ ls xipfaa.pb19930901.pp
on MONSooN. The reason for this difference is that on MONSooN the post-processing is able to convert fieldsfiles to 32-bit .pp format, whereas it is not possible to do this on HECToR.
As you can see, there is only one file present, the "pb" file. This file is a daily file that has come from the UPB PP stream (standard PP files will be covered in more detail in the What is STASH? tutorial). To quickly view output you can use Xconv, which provides a simple data viewer. It can also be used to convert the UM format output files to netCDF.
You can open this file by
$ xconv -i xipfaa.pb19930901.pp
which will show the Xconv window as can been seen in Figure 1. There is only one field present
0 : 192 145 85 1 Stash code = 34001
This is the UKCA chemical ozone tracer (although it is not labeled as such by default). A full listing of all UKCA fields can be found in the listing of UKCA fields at UM8.2. More information will be given on STASH in the What is STASH? tutorial.
You can use Xconv to view certain fields. For example, you could view the surface ozone concentration double-clicking on the Stash code = 34001 field and clicking the Plot data button (see Figure 2). While this is good to quickly check data, the plotting functions are rather limited as it is not possible to change e.g. the colour-bar, the scale, add a map projection etc. It is advisable to either export fields as netCDF from within Xconv, or to use another program, such as IDL (using the Met Office library) or Python (using either cf-python or Iris) which is able to read the UM PP/FieldsFile format directly.
To export fields as netCDF select them using the mouse (they should then highlight blue), enter a name for the netCDF file in the Output file name box (making sure that the Output format is Netcdf) and click the Convert button. The window on the bottom right will show the progress of the conversion. For single fields this is usually quite quick, but it is possible to use Xconv to open multiple files containing a series of times. In this case Xconv will combine all the individual times into a single field, and outputting this can take some time.
One issue you may have is that Xconv uses a quantity called the field code to determine the variable name of each field (the netCDF name attribute). For UKCA tracer fields at UM8.2 this code is all the same, so all variables will be called field1861. It is possible to change the short field name in Xconv, prior to outputting a netCDF file. Select the variable you wish to output and select the Names button on the top-right of the Xconv window. Delete the contents of the short field name box and replace it with what you would like, e.g. for ozone (Stash code 34001) you may wish to use the CF standard name mass_fraction_of_ozone_in_air (as the units of UKCA tracers are kg(species)/kg(air)). The click apply and output the field as normal. When running ncdump on the resultant netCDF file you should see something like
float mass_fraction_of_ozone_in_air(t, hybrid_ht, latitude, longitude) ; mass_fraction_of_ozone_in_air:source = "Unified Model Output (Vn 8.2):" ; mass_fraction_of_ozone_in_air:name = "mass_fraction_of_ozone_in_air" ; mass_fraction_of_ozone_in_air:title = "Stash code = 34001" ; mass_fraction_of_ozone_in_air:date = "01/09/91" ; mass_fraction_of_ozone_in_air:time = "00:00" ; mass_fraction_of_ozone_in_air:long_name = "Stash code = 34001" ; mass_fraction_of_ozone_in_air:units = " " ; mass_fraction_of_ozone_in_air:missing_value = 2.e+20f ; mass_fraction_of_ozone_in_air:_FillValue = 2.e+20f ; mass_fraction_of_ozone_in_air:valid_min = 5.58807e-09f ; mass_fraction_of_ozone_in_air:valid_max = 1.861871e-05f ;
Once you have your data as netCDF it is then possible to use any standard visualisation or processing package to view and manipulate the data.
.leave Files
The text output from any write statements within the code, or giving information about compilation, is outputted to several files with the extension .leave. These will either be in your $HOME/um/umui_out directory on HECToR or placed in your $HOME/output directory on MONSooN.
You will have three .leave files, one for the compilation, one for the reconfiguration step (if run), and one for the UM itself. By default for climate runs these will all have a common format, starting with 4 blocks of letters and numbers, like this:
xipfa000.xipfa.d13163.t120017
where this breaks down to
jobidXXX | e.g. xipfa000 | The jobid of the job, followed by the job-step number. For compilation and reconfiguration jobs, this will be 000, but as the CRUN progresses this number will increment by 1 for each step, and then cycle round back through 000 (if you run more than 999 steps). |
jobid | e.g. xipfa | The jobid of the job as listed in the UMUI. |
dXXXXX | e.g. 13163 | The year (the last two digits, i.e. 2013 is 13) and the day of the year as 3 digits (i.e. 001-366, so this file was created on the 12th June (day 163)). |
tXXXXXX | e.g. 120017 | The time in HHMMSS format, as recorded by the system clock on the supercomputer. |
Using this format this means that file was created on the 12th June 2013 at 12:00:17. Note that the timestamp on the file will be later than this, as this is the time the file was created, not the time that it was last written to.
There are then three file extensions: .comp.leave for compilation output, .rcf.leave for reconfiguration output, and .leave for the model output.
It is often easier to list your files in this directory by date, but using ls -ltr.
Compilation Output (.comp.leave)
This gives the output from either the XLF compiler on MONSooN or the Cray compiler on HECToR. If the compilation step has an error and the code is not compiled you can find the source of the error by opening this file and searching for failed - this will highlight which routine(s) caused the problem. You may also get more detailed information such as the line number which had the error. In this case you can open the file on the supercomputer and view the line, as the line number given will not match with the line in your working directory on PUMA due to merging source code and the use of include files. Remember to make any required changes to your PUMA source code however!
Reconfiguration Output (.rcf.leave)
This gives output from the reconfiguration step, if run. At older UM versions, such as UM7.3 this output was found in the model output .leave file.
Model Output (.leave)
This gives output from the code which is generated as it is running, although this file is only updated and closed when the job finishes. To view the output generated as it is running you will need to see the output in the pe_output/ directory mentioned above.
To run efficiently the UM is split into many domains, which communicate with each other with parallel calls, during runtime. The exact decomposition is defined in Model Selection → User Information and Submit Method → Job submission method, in the number of processes East-West and North South boxes. If you have a 16x16 decomposition there will be 256 processes, running on 256 cores of the supercomputer (4 nodes of MONSooN, 8 nodes of HECToR). These processes will be numbered internally from 0 → 255, labelled as PE0 to e.g. PE255. Only the output from PE0 will be sent to the .leave file, with output from the other PEs only held in the pe_output/ directory. Whether or not these files are deleted at the end of a run is set in the UMUI in Model Selection → Input/Output Control and Resources → Output management panel. If you run fails then these files will not be deleted.
While there is a lot of information outputted to the .leave file, and you would usually only read it if the job fails, it is worth going through the messages, making a special note of any warnings.
Written by Luke Abraham 2013