Publishing Large Datasets
General Considerations for Publishing Large Datasets
Whether a dataset is considered "large" depends on the field of research, the computing environment, and the storage solution. While there are no universal thresholds for how large is too large when it comes to data, we can offer some general considerations for the contemporary landscape of data publishing.
- What are the relevant policies for what you're required to publish (or prohibited from sharing)?
- What do your funders' data management and sharing policies require of you?
- What are the open data policies for your journals of choice?
- Are you subject to export control regulations from the US Government?
- What makes the most sense to publish (or to leave out)?
- Are all files necessary for replication?
- How costly would it be for someone else to regenerate the data?
- How useful do you anticipate the data will be for other researchers?
- What are the most efficient and durable formats for your field?
- What practical limitations do you anticipate?
- What are the file size limits of your data repositories of choice?
- Are you set up to manage the complete transmission of your files?
- What kind of hardware will be required to scan and (de)compress your files?
Using Globus for Data Transfer and Sharing
Globus is the ideal tool for transferring and sharing large research datasets (see the Globus How-To page). Developed and operated by the University of Chicago and Argonne National Laboratory, the Globus software-as-a-service uses the GridFTP protocol to transfer files efficiently--without having to monitor or maintain connections during the process. Many research institutions, including Princeton University, have Globus endpoints set up for their high performance computing systems, giving them a convenient and robust transfer tool for both intra- and inter-institutional data sharing. Individual researchers and small labs can also take advantage of the Globus infrastructure with Globus Connect Personal. (A free account is required for anyone wishing to access data via Globus, but it is often managed with SSO through the institution, as in Princeton's CAS.)
PRDS manages a Globus endpoint specifically for data publication, called Princeton Research Data Repository (UUID: dc43f461-0ca7-4203-848c-33a9fc00a464). Each folder in this endpoint pertains to a specific item published in DataSpace, and it is the DataSpace website where the datasets can be searched and browsed according to their titles, contributors, descriptions, and other metadata. The Princeton Research Data Repository Globus endpoint is not designed to browse published datasets directly; it is intended to overcome the practical challenges of uploading and downloading large datasets found in DataSpace.
Large Datasets in Princeton's DataSpace
DataSpace can accommodate arbitrarily large datasets to meet the data publishing needs of Princeton researchers from all disciplines. But large datasets require special procedures for submission, curation, transfer, and public dissemination--all of which require extra time and more careful effort than usual. So if you are preparing to publish a large dataset, please reach out to the DataSpace curators for guidance sooner rather than later!
Curation-in-Place Procedures
Staff from PRDS review all submissions to DataSpace with an eye toward discoverability, re-usability, and long-term preservation. This requires the curators to have access to the submitted files before publication, which in most cases is handled by an upload page within the submission web portal. However, the standard HTTPS protocol is unreliable for uploading files larger than 250 MB. Moreover, the curation process sometimes involves requests for revisions or additions, and transferring datasets larger than 2 GB back and forth is impractical. The curators manage these practical difficulties by requesting direct read access to the folders where the large datasets are stored, so they can curate them in place. If researchers are unable to grant direct access to their data folders (e.g., due to privacy or security issues), then the DataSpace curators can work with the researchers to transfer large datasets to a staging folder in /tigress using Globus. In any case, submitters of large datasets need only upload a README file through the DataSpace submission portal, and the curators will usher the remaining files through computing systems suited to the task.
Procedures by Data File Size
- For datasets no larger than 250 MB...
- No special procedures are required
- Standard HTTPS upload and download is recommended
- Estimated turn-around for submissions is 5-10 business days
- For datasets with individual files 250 MB - 2 GB, and/or a total file size of 1 GB - 10 GB...
- Globus is recommended for both upload and download
- Curation-in-place procedures may help expedite the submission process
- Estimated turn-around for submissions is 5-10 business days
- For datasets with individual files larger than 2 GB, and/or a total file size of 10 GB - 1 TB...
- Globus is required
- Curation-in-place procedures are required
- Estimated turn-around for submissions is 5-10 business days, but it can take significantly longer depending on the complexity of the case
- For datasets with a total file size larger than 1 TB...
- Contact the DataSpace curators in advance and schedule a consultation
- Turn-around timeframes will vary