Difference between revisions of "Monsoon users MASS data"

From UKCA
Line 18: Line 18:
 
* The Monsoon-UKCA data ownership on MASS will be shared between Fiona O'Connor (Manager, Atmospheric Composition and Climate team) and Adrian Hill (Manager, Aerosol and Cloud Modeling group).
 
* The Monsoon-UKCA data ownership on MASS will be shared between Fiona O'Connor (Manager, Atmospheric Composition and Climate team) and Adrian Hill (Manager, Aerosol and Cloud Modeling group).
 
* Ths proposed division of (data from) UKCA subprojects between the two owners will be based on the Institute-wise UKCA subprojects: data originating from Institutes with a mainly ''chemistry'' interest (UKCA-CAM, -ED, -IMP, -LAN) will be owned by Fiona, while that from ''aerosol''-interest Institutes (UKCA-LEEDS, -OX, -READ, -EX) will be owned by Adrian.<br/>''However, it is not clear yet whether technical implementation of such division/ attribution will be possible, especially for data generated prior to division of UKCA into sub-projects on Monsoon.''
 
* Ths proposed division of (data from) UKCA subprojects between the two owners will be based on the Institute-wise UKCA subprojects: data originating from Institutes with a mainly ''chemistry'' interest (UKCA-CAM, -ED, -IMP, -LAN) will be owned by Fiona, while that from ''aerosol''-interest Institutes (UKCA-LEEDS, -OX, -READ, -EX) will be owned by Adrian.<br/>''However, it is not clear yet whether technical implementation of such division/ attribution will be possible, especially for data generated prior to division of UKCA into sub-projects on Monsoon.''
* An attempt will be made -''initially for UMUI-derived datasets''- to obtain information on | Creator | Date of Creation | Date of Last Access | for the Monsoon-originating datasets
+
* An attempt will be made -''initially for UMUI-derived datasets''- to obtain information on '''| Creator | Date of Creation | Date of Last Access |''' for the Monsoon-originating datasets
  +
* Wherever possible, by querying the UMUI database, the ''Job-title'' will be appended to these details, for a better understanding of the simulation.
  +
* Based on the above informaton:
  +
** MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
  +
** Owners of datasets that have not been accessed within last ''X'' (TBD) years will be approached to review the holdings, with a response time-limit
  +
** Datasets recommended by owners for deletion, as well as those for which no response is received, will be marked for deletion.
  +
* After reviewing the UMUI (''x-'') jobs, Rose suites will be targeted. Here the ownership information is more easily available through the ''rosie'' database.
  +
<br/>
  +
'''Note:''' Any dataset marked for deletion triggers automatic e-mails to all users (MetO/ non-MetO) who have ever accessed data from that set. Further, there is gap of ''28'' days between a MASS dataset being marked for deletion and actually being deleted. So, there is time for any users who might have an interest in the particular dataset to respond to the owner and prevent its deletion.

Revision as of 16:40, 2 August 2019

Management of Monsoon-user's MASS data

Background

The UKCA project on the Monsoon HPC system has been active for almost 10 years now and at the last count (Jan 2018) has archived nearly 1.5 Petabytes of data to the MASS tape system. It should be fairly obvious that a significant part of this data becomes redundant as time progresses. In some instances, the data archived is from test simulations and should have been deleted as soon as its purpose was fulfilled. In other cases the data is no longer relevant since the configuration/ settings used are now out-dated and/ or any value that could be derived from this (for PhD thesis, publications, projects) has been obtained and should be safe to delete.
A related issue is that till late 2017, the MASS system was configured to store all data in duplex (two copies) form by default, possibly to reduce the chances of data loss resulting from any physical damage to the tapes/ drives. This means that any (now redundant) data generated before this time is occupying double the amount of space.
One limitation for Monsoon has been that any data, once archived to MASS is owned by the Principal Investigator (PI) from Met Office side, and the data creators themselves have no permissions to delete this. This is not expected to change any time soon, so a central method of data management becomes necessary.

Current status

As of now (Aug 2019), all data generated by UKCA (and UKCA- subprojects) is owned by Colin Johnson as the original PI for the project. Following a general activity to review data held on MASS and optimise the demand for tapes, an attempt was made in January 2018 to attribute at least the datasets resulting from UMUI runs to the original creators (i.e. the owners of these jobs) and ask them to review the need to hold the data. This was done by cross-referencing the datasets on MASS owned by Colin and having an x prefix (recommended job-id group for Monsoon runs) with the UMUI database on the PUMA system. An e-mail was subsequently sent to the concerned users with the job/data details (approx 2050 job-ids) with a request to review and report to back Colin the datasets that could be deleted. It is not clear if any data deletion was requested as a result of the e-mail.
One thing observed while using the PUMA-UMUI database to trace the ownership -and of relevance to the strategy below- is that a significant number of UMUI jobs which have archived data to MASS no longer exist in the UMUI database, indicating that the original setup has been deleted by the user.
During the attribution exercise, the MASS/ Data Storage team was also approached to see if 'creator' information could be obtained by querying the database itself. This idea was based on the fact that each user running the UM jobs has an unique moose credential (in $HOME/.moosedir). The response received suggested that it could be possible to obtain this information, but will need a formal service request with the list of job-ids, and looking at the number of jobs, and at least initial need for some manual intervention, it might take a significant amount of time to report back.

Following the retirement of Colin from his Met Office role in 2018, the issue of tranferring the ownership of this data has arisen. This is also an opportunity to review the data holdings and trim data that is not required.

Proposed Data Management Strategy

  • The Monsoon-UKCA data ownership on MASS will be shared between Fiona O'Connor (Manager, Atmospheric Composition and Climate team) and Adrian Hill (Manager, Aerosol and Cloud Modeling group).
  • Ths proposed division of (data from) UKCA subprojects between the two owners will be based on the Institute-wise UKCA subprojects: data originating from Institutes with a mainly chemistry interest (UKCA-CAM, -ED, -IMP, -LAN) will be owned by Fiona, while that from aerosol-interest Institutes (UKCA-LEEDS, -OX, -READ, -EX) will be owned by Adrian.
    However, it is not clear yet whether technical implementation of such division/ attribution will be possible, especially for data generated prior to division of UKCA into sub-projects on Monsoon.
  • An attempt will be made -initially for UMUI-derived datasets- to obtain information on | Creator | Date of Creation | Date of Last Access | for the Monsoon-originating datasets
  • Wherever possible, by querying the UMUI database, the Job-title will be appended to these details, for a better understanding of the simulation.
  • Based on the above informaton:
    • MASS datasets where the original UMUI job no longer exists will be marked for immediate deletion
    • Owners of datasets that have not been accessed within last X (TBD) years will be approached to review the holdings, with a response time-limit
    • Datasets recommended by owners for deletion, as well as those for which no response is received, will be marked for deletion.
  • After reviewing the UMUI (x-) jobs, Rose suites will be targeted. Here the ownership information is more easily available through the rosie database.


Note: Any dataset marked for deletion triggers automatic e-mails to all users (MetO/ non-MetO) who have ever accessed data from that set. Further, there is gap of 28 days between a MASS dataset being marked for deletion and actually being deleted. So, there is time for any users who might have an interest in the particular dataset to respond to the owner and prevent its deletion.