How do BRB-seq pre-processing results look like?

 In this post, we will see briefly how BRB-seq libraries are processed, and what are the output files that the preprocessing pipeline generates.
 

BRB-seq libraries should be composed of a pair of .fastq files (mylibrary_R1.fastq.gz and mylibrary_R2.fastq.gz) after Illumina sequencing and standard index demultiplexing.

 

It’s worth noting that, even though BRB-seq libraries are sequenced as paired-end, they should not be processed as conventional paired-end libraries. Indeed, the R1 file does not contain fragment sequences to be aligned on a reference genome, but rather barcode and Unique Molecular Identifier (UMI) sequences. RNA fragment sequences are only present in the R2 file. Therefore, a specific demultiplexing pipeline needs to be run (for e.g. using STARsolo) to be able to demultiplex all barcodes, and thus retrieve all sample information from the multiplexed library.

 

The data preprocessing pipeline uses as input files the two R1/R2 files, a reference genome of the species of interest, and the barcode files used for multiplexing the samples. Then, preprocessing is performed in consecutive tasks including: 1) fastQC to assess the quality of the sequencing reads such as number of duplicates, adapter contamination, repetitive sequence contamination, and GC content, 2) alignment to the reference genome, using STARsolo, and 3) counting reads in exons, which is also performed by STARsolo. Then, the output of STARsolo is formatted using custom R scripts, in order to generate the final read and UMI count matrices, per library.

 

The final output folder should contain the following files:

 
  1. README file detailing the output

  2. count_matrix/ folder containing the following files: - *read.counts files are read count files, with demultiplexed samples in columns, and genes in rows. Each cell(ij) of the matrix contains the amount of reads that are mapping to gene(i) for samples(i). - *umi.counts files are UMI count files (deduplicated using the UMI i.e. Unique Molecular Identifier). It’s the same data matrix than the *read.counts* files, except that all duplicates are removed (using the UMI sequence information). Note 1: The *detailed files contain an extra column "Unknown" for the reads without recognizable barcodes, and 5 lines at the end of the file containing information on read mapping / assignment to exons (number of reads unmapped, mapping but not annotated as a feature/gene, etc.) This can be used for additional QC on demultiplexing, mapping, or gene annotation, for each sample. Note 2: The *sampleIDs files have sample IDs (from the sample input table, if provided) as column names, while the *wells files have well barcode IDs as column names. Note 3: None of the count matrices are normalized, they only contain read or UMI raw counts.

  3. bigwigs/ folder containing *.bw files which can be loaded in a genome browser such as IGV to visualize the read coverage along the genome.

  4. fastq/ folder containing the *R1 and *R2 fastq files produced after sequencing. Note: In case of multiple lanes / runs of sequencing, the files are already concatenated into a single pair of R1/R2 files per library.

  5. reports/ folder - Containing a standard .pdf and .html report describing the quality of data, with tables and plots. - plots/ subfolder with all the plots used for the reports (and a few extra) - fastqc/ subfolder containing a QC report on the sequencing quality, run on both R1 and R2 files, as well as on the produced .bam file.

  6. *SIT.xlsx file, which is a reformatted sample input table containing information on the libraries and the multiplexed samples that they contain (barcode, name, species, …).

Of note, the output files don’t contain demultiplexed .fastq files, because the STARsolo pipeline does not demultiplex the .fastq files to generate the final count matrices. In principle, it is feasible to demultiplex the .fastq file, for example using the BRB-seqTools suite, but it is not recommended to use this option anymore, since 1) demultiplexed .fastq files do not contain the UMI information anymore, and 2) most data repositories (GEO, ArrayExpress) now allow to submit multiplexed RNA-seq libraries (similar to single-cell RNA-seq libraries).

 
 

For more information, do not hesitate to reach out at info@alitheagenomics.com.