Publishing Large Datasets
General Considerations for Publishing Large Datasets
Whether a dataset is considered "large" depends on the field of research, the computing environment, and the storage solution. While there are no universal thresholds for how large is too large when it comes to data, we can offer some general considerations for the contemporary landscape of data publishing.
- What are the relevant policies for what you're required to publish (or prohibited from sharing)?
- What do your funders' data management and sharing policies require of you?
- What are the open data policies for your journals of choice?
- Are you subject to export control regulations from the US Government?
- What makes the most sense to publish (or to leave out)?
- Are all files necessary for replication?
- How costly would it be for someone else to regenerate the data?
- How useful do you anticipate the data will be for other researchers?
- What are the most efficient and durable formats for your field?
- What practical limitations do you anticipate?
- What are the file size limits of your data repositories of choice?
- Are you set up to manage the complete transmission of your files?
- What kind of hardware will be required to scan and (de)compress your files?
Large Datasets in Princeton's Data Repository
Princeton Data Commons can accommodate large datasets to meet the data publishing needs of Princeton researchers from all disciplines. But large datasets require special procedures for submission, curation, transfer, and public dissemination--all of which require extra time and more careful effort than usual. So if you are preparing to publish a large dataset, please reach out to the Princeton Research Data Curators for guidance sooner rather than later!
Procedures by Data File Size
- For datasets no larger than 100 MB...
- No special procedures are required
- Standard HTTPS upload and download is recommended
- Estimated turn-around for submissions is 5-10 business days
- For datasets with individual files over 100 MB
- Globus is used to transfer the files (See Using Globus with Princeton Data Commons)
- Estimated turn-around for submissions is 5-10 business days
- For datasets with a total file size larger than 1 TB...
- Contact the Princeton Research Data Curators in advance and schedule a consultation
- Review costs for publishing datasets above 1TB (Acceptance and Retention Policy)
- Turn-around timeframes will vary
Using Globus for Data Transfer and Sharing
Globus is the ideal tool for transferring and sharing large research datasets (see the Globus How-To page). Developed and operated by the University of Chicago and Argonne National Laboratory, the Globus software-as-a-service uses the GridFTP protocol to transfer files efficiently--without having to monitor or maintain connections during the process. Many research institutions, including Princeton University, have Globus endpoints set up for their high performance computing systems, giving them a convenient and robust transfer tool for both intra- and inter-institutional data sharing. Individual researchers and small labs can also take advantage of the Globus infrastructure with Globus Connect Personal. (A free account is required for anyone wishing to access data via Globus, but it is often managed with SSO through the institution, as in Princeton's CAS.)