Monsoon users MASS data


Management of Monsoon-user's MASS data


The UKCA project on the Monsoon HPC system has been active for almost 10 years now and at the last count (Dec 2020) has archived nearly 6 Petabytes of data to the MASS tape system. It should be fairly obvious that a significant part of this data becomes redundant as time progresses. In some instances, the data archived is from test simulations and should have been deleted as soon as its purpose was fulfilled. In other cases the data is no longer relevant since the configuration/ settings used are now out-dated. In some cases any value that could be derived from this (for PhD thesis, publications, projects) has been obtained and should be safe to delete.
A related issue is that till late 2017, the MASS system was configured to store all data in duplex (two copies) form by default, possibly to reduce the chances of data loss resulting from any physical damage to the tapes/ drives. This means that any (now redundant) data generated before this time is occupying double the amount of space.
One limitation for Monsoon has been that any data, once archived to MASS is owned by the Principal Investigator (PI) from Met Office side, and the data creators themselves have no permissions to delete this. This is not expected to change any time soon, so a central method of data management becomes necessary.

Current status

As of now (Dec 2019), all data generated by UKCA (and UKCA- subprojects) is owned by Colin Johnson as the original PI for the project. Following a general activity to review data held on MASS and optimise the demand for tapes, an attempt was made in January 2018 to attribute at least the datasets resulting from UMUI runs to the original creators (i.e. the owners of these jobs) and ask them to review the need to hold the data. This was done by cross-referencing the datasets on MASS owned by Colin and having an x prefix (recommended job-id group for Monsoon runs) with the UMUI database on the PUMA system. An e-mail was subsequently sent to the concerned users with the job/data details (approx 2050 job-ids) with a request to review and report to back Colin the datasets that could be deleted. It is not clear if any data deletion was requested as a result of the e-mail.
One thing observed while using the PUMA-UMUI database to trace the ownership is that a not-insignificant number of UMUI jobs which have archived data to MASS no longer exist in the UMUI database, indicating that the original setup has been deleted by the user.
We now have some ability to query the Moose database and correlate it with the UMUI/ Rose one to obtain basic information like Job owner(creator); Title; Data size; Date of last access to help identify model output that can be safely removed.

Following the retirement of Colin from his Met Office role in 2018, the issue of tranferring the ownership of this data has arisen. This is also an opportunity to review the data holdings and trim data that is not required.

Proposed Data Management Strategy

  • Ownership
    • Given the legacy MASS/MOOSE requirement of each Monsoon project having a Met Office point-of-contact or owner, the Monsoon-UKCA data on MASS will be split into two categories:
    • data originating from Institutes with a mainly chemistry interest (UKCA-CAM, -ED, -IMP, -LAN, -CSIRO) - tentatively owned by Fiona O' Connor;
    • data from aerosol-interest Institutes (UKCA-LEEDS, -OX, -READ, -EX) - tentatively owned by Adrian Hill.
    • However, given the wide range of simulations carried out on Monsoon under the UKCA project, for effective data management each Institute will need to nominate one or more point(s)-of-contact who can coordinate between the data creators and Met Office owners or points-of-contact
  • Legacy Data
    • A list will be generated for UKCA data on MASS -initially for UMUI-derived datasets- containing information on | Creator | Title/ Description | Date of Creation | Date of Last Access |
    • The Job-title/ Description is expected to help in understanding of the original purpose of the simulation and thus review its necessity.
    • Based on the above informaton:
      • MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
      • Creators of datasets that have not been accessed within last 5 (TBC) years will be approached to review the holdings, with a response time-limit
      • Datasets recommended by original creators for deletion, as well as those for which no response is received, will be marked for deletion.
    • After reviewing the UMUI (x-) jobs, Rose suites will be targeted. Here the ownership information is more easily available through the rosie database.

Note: Any dataset marked for deletion triggers automatic e-mails to all users (MetO/ non-MetO) who have ever accessed data from that set. Further, there is gap of 28 days between a MASS dataset being marked for deletion and actually being deleted. So, there is time for any users who might have an interest in the particular dataset to respond to the owner and prevent its deletion.