The data output from next-generation sequencing platforms has undergone explosive growth in recent years thanks to technological advances which have lowered costs (Jiang et al., 2022).
Researchers must store increasingly large amounts of data but it is not clear how the millions (or billions) of reads from a sequencing experiment relate to gigabytes (GB) of data storage. Data storage costs can be high so this is an important consideration.
In this article we close the gap between the million reads terminology, which we are used to in RNA-seq, and the GB used to quantify the storage space of data. We also provide a new, free conversion tool for you to estimate the GB of data generated by your sequencing experiments.
The exponential rise of sequencing data generation
The most common measure of the output of a sequencing experiment is the number of reads generated, often in ‘million reads’. Sequencing technologies such as the NovaSeq 6000 from Illumina can now produce up to an impressive 20 billion reads per run (Illumina, 2023).
Thanks to these advances, the amount of sequencing data has grown exponentially in recent years (Katz et al., 2022).
For example, the total amount of raw sequencing data stored in the Sequence Read Archive (SRA) from the National Center for Biotechnology Information (NCBI) rose to a staggering 16 petabytes of data in 2021 from around 15 million sequencing runs (Katz et al., 2022). In 2016 the SRA contained only 5 petabytes of data (Katz et al., 2022).
With the increased output of modern sequencing machines, researchers should know how much data they will generate in terms of GB.
Some technical details.
Sequencing data formats
Modern sequencing machines from Illumina initially store raw sequencing data as individual binary base call (BCL) files. These BCL files are commonly converted into FASTQ files which are standard for further analyses (Illumina, 2021).
One problem is the sheer size of these files.
A FASTQ file is a text file created for each sample. Each read is reported as an entry into the FASTQ file and consists of four lines.
- A sequence identifier containing information about the sequencing run
- The read sequence from base calls (A, C, T, G and N).
- A separator (+ sign).
- The base call quality scores
When repeated billions of times, these four lines require a lot of storage space. For example, Illumina estimates that a staggering 2190 GB of storage is needed for the BCL files from 20 billion reads produced using the NovaSeq 6000 from Illumina (Illumina, 2022). This size increases upon conversion to FASTQ files, even when compressed with common methods such as gzip.
From million reads terminology to GB
Using our million read to GB conversion tool is simple and provides useful estimates on data storage requirements for sequencing experiments.
For example, let’s take a single sample sequenced with MERCURIUS™ BRB-seq to a read depth of five million reads. This depth is recommended by Alithea Genomics as it is normally enough to detect the majority of expressed genes (Alpern et al., 2019).
Depending on their experiment, researchers can choose either single-end or paired-end sequencing with various read lengths.
In the MERCURIUS™ BRB-seq example we require paired-end, 75 base pair reads.
Our calculator estimates that around 0.75 GB of data storage is required per MERCURIUS™ BRB-seq sample.
In contrast, an estimated 4.5 GB of data storage is needed for one sample prepared with the Illumina TruSeq Stranded mRNA Library Prep Kit at a recommended read depth of 30 million reads per sample.
This is sixfold higher than that of MERCURIUS™ BRB-seq samples and represents a significant cost burden for large-scale studies with hundreds or thousands of samples.
Therefore, not only does MERCURIUS™ BRB-seq allow reduced sequencing costs thanks to early barcoding and multiplexing of hundreds of samples, it also provides a significant reduction in storage costs per sample.
We hope that this tool is useful to help you plan your data storage needs for your next sequencing experiment.
The tool can be accessed here.
- Alpern, D., Gardeux, V., Russeil, J., Mangeat, B., Meireles-Filho, A.C., Breysse, R., Hacker, D. and Deplancke, B., 2019. BRB-seq: ultra-affordable high-throughput transcriptomics enabled by bulk RNA barcoding and sequencing. Genome biology, 20(1), pp.1-15.
- Jiang, P., Sinha, S., Aldape, K., Hannenhalli, S., Sahinalp, C. and Ruppin, E., 2022. Big data in basic and translational cancer research. Nature Reviews Cancer, 22(11), pp.625-639.
- Katz, K., Shutov, O., Lapoint, R., Kimelman, M., Brister, J.R. and O’Sullivan, C., 2022. The sequence read archive: a decade more of explosive growth. Nucleic acids research, 50(D1), pp.D387-D390.
- Illumina, 2021. FASTQ files explained. Available at: https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html.
- Illumina, 2022. Approximate sizes of sequencing run output folders. Available at: https://support.illumina.com/bulletins/2018/01/approximate-sizes-of-sequencing-run-output-folders.html
- Illumina, 2023. NovaSeq 6000 System Specifications | Output, run time, and more. Available at: https://www.illumina.com/systems/sequencing-platforms/novaseq/specifications.html.