Difference between revisions of "Monsoon users MASS data"

From UKCA
Line 26: Line 26:
 
** Based on the above informaton:
 
** Based on the above informaton:
 
*** MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
 
*** MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
*** Owners of datasets that have not been accessed within last ''5'' (TBC) years will be approached to review the holdings, with a response time-limit
+
*** Creators of datasets that have not been accessed within last ''5'' (TBC) years will be approached to review the holdings, with a response time-limit
*** Datasets recommended by owners for deletion, as well as those for which no response is received, will be marked for deletion.
+
*** Datasets recommended by original creators for deletion, as well as those for which no response is received, will be marked for deletion.
 
** After reviewing the UMUI (''x-'') jobs, Rose suites will be targeted. Here the ownership information is more easily available through the ''rosie'' database.
 
** After reviewing the UMUI (''x-'') jobs, Rose suites will be targeted. Here the ownership information is more easily available through the ''rosie'' database.
   

Revision as of 17:08, 13 January 2020

Management of Monsoon-user's MASS data

Background

The UKCA project on the Monsoon HPC system has been active for almost 10 years now and at the last count (Jan 2018) has archived nearly 1.5 Petabytes of data to the MASS tape system. It should be fairly obvious that a significant part of this data becomes redundant as time progresses. In some instances, the data archived is from test simulations and should have been deleted as soon as its purpose was fulfilled. In other cases the data is no longer relevant since the configuration/ settings used are now out-dated. In some cases any value that could be derived from this (for PhD thesis, publications, projects) has been obtained and should be safe to delete.
A related issue is that till late 2017, the MASS system was configured to store all data in duplex (two copies) form by default, possibly to reduce the chances of data loss resulting from any physical damage to the tapes/ drives. This means that any (now redundant) data generated before this time is occupying double the amount of space.
One limitation for Monsoon has been that any data, once archived to MASS is owned by the Principal Investigator (PI) from Met Office side, and the data creators themselves have no permissions to delete this. This is not expected to change any time soon, so a central method of data management becomes necessary.

Current status

As of now (Dec 2019), all data generated by UKCA (and UKCA- subprojects) is owned by Colin Johnson as the original PI for the project. Following a general activity to review data held on MASS and optimise the demand for tapes, an attempt was made in January 2018 to attribute at least the datasets resulting from UMUI runs to the original creators (i.e. the owners of these jobs) and ask them to review the need to hold the data. This was done by cross-referencing the datasets on MASS owned by Colin and having an x prefix (recommended job-id group for Monsoon runs) with the UMUI database on the PUMA system. An e-mail was subsequently sent to the concerned users with the job/data details (approx 2050 job-ids) with a request to review and report to back Colin the datasets that could be deleted. It is not clear if any data deletion was requested as a result of the e-mail.
One thing observed while using the PUMA-UMUI database to trace the ownership is that a not-insignificant number of UMUI jobs which have archived data to MASS no longer exist in the UMUI database, indicating that the original setup has been deleted by the user.
During the attribution exercise, the MASS/ Data Storage team was also approached to see if 'creator' information could be obtained by querying the database itself. This idea was based on the fact that each user running the UM jobs has an unique moose credential (in $HOME/.moosedir). The response received suggested that it could be possible to obtain this information, but will need a formal service request with the list of job-ids, and looking at the number of jobs, and at least initial need for some manual intervention, it might take a significant amount of time to report back.

Following the retirement of Colin from his Met Office role in 2018, the issue of tranferring the ownership of this data has arisen. This is also an opportunity to review the data holdings and trim data that is not required.

Proposed Data Management Strategy

  • Ownership
    • Given the legacy MASS/MOOSE requirement of each Monsoon project having a Met Office point-of-contact or owner, the Monsoon-UKCA data on MASS will be split into two categories:
    • data originating from Institutes with a mainly chemistry interest (UKCA-CAM, -ED, -IMP, -LAN, -CSIRO) - tentatively owned by Fiona O' Connor;
    • data from aerosol-interest Institutes (UKCA-LEEDS, -OX, -READ, -EX) - tentatively owned by Adrian Hill.
    • However, given the large variety of simulations carried out on Monsoon under the UKCA project, for effective data management each Institute will need to nominate one or more point(s)-of-contact who can coordinate between the data creators and Met Office owners
  • Legacy Data
    • An attempt will be made -initially for UMUI-derived datasets- to obtain information on | Creator | Date of Creation | Date of Last Access | for the Monsoon-originating datasets
    • Wherever possible, by querying the UMUI database, the Job-title will be appended to these details, for a better understanding of the purpose of the simulation.
    • Based on the above informaton:
      • MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
      • Creators of datasets that have not been accessed within last 5 (TBC) years will be approached to review the holdings, with a response time-limit
      • Datasets recommended by original creators for deletion, as well as those for which no response is received, will be marked for deletion.
    • After reviewing the UMUI (x-) jobs, Rose suites will be targeted. Here the ownership information is more easily available through the rosie database.

Note: Any dataset marked for deletion triggers automatic e-mails to all users (MetO/ non-MetO) who have ever accessed data from that set. Further, there is gap of 28 days between a MASS dataset being marked for deletion and actually being deleted. So, there is time for any users who might have an interest in the particular dataset to respond to the owner and prevent its deletion.