16S Pipeline: LogBook

This logbook gives an example of how to build an OTU table from raw data. Please see the detailed version for more explanations.

 

Good practice

After each step of the pipeline you should:

  1. Check the number of reads (from the usearch command or with the grep command)
  2. Check what the output files look like with the « head » command

 

Experience design

The present data come from Dr. Elcia Souza Brito experiment. She did 2 x 300 bp, paired-end sequencing. I thank her to accept the use of her data for this tutorial.

 

Extract archives

Example of files names provided by the company:

  • ES1BI1SS06-704-503_S6_L001_R1_001.fastq.gz
  • ES1BI1SS06-704-503_S6_L001_R2_001.fastq.gz

Change names to something clearer:

mv ES1BI1SS06-704-503_S6_L001_R1_001.fastq S_R1_rawData.fastq.gz
mv ES1BI1SS06-704-503_S6_L001_R2_001.fastq S_R2_rawData.fastq.gz

Extract the archive:

gunzip S_R1_rawData.fastq.gz gunzip S_R2_rawData.fastq.gz

 

Explore sequences files
  • Check the sequences orientation in R1 and R2

Forward primer (F): 5′ CTCCTACGGGAGGCAGCAG 3′

Reverse primer (R): 5′ CTTGTGCGGGCCCCCGTCAATTC 3′

To check where primers are located, use the grep command:

grep 'CTCCTACGGGAGGCAGCAG' S_R1_rawData.fastq

The F primer has to be at the beginning of the R1 sequences as shown below:grep_1

Towards R2 files, it is generally presented with the 5′ and 3′ orientation too, so the R primer has to be at the beginning of the R2 sequences:

grep 'CTTGTGCGGGCCCCCGTCAATTC' S_R2_rawData.fastq

grep2

 

  • Count the reads

With the grep command, the number of reads can be easily calculated as we have fastq formatted files. First, check the run identifier and then, count it:

head S_R1_rawData.fastq

Capture d’écran 2016-04-26 à 15.13.29

grep -c 'M03909' S_R1_rawData.fastq

You can generalize it for all your raw data files with the « * » (star) as shown in the image below:

Capture d’écran 2016-04-26 à 15.17.06
As you can see, the exact same number of reads is expected for the related R1 and R2 files.

 

FastQC analysis

With FastQC, the two important features for us here are the reads quality and size distribution.

  • Quality

S_R1_rawData.fastqCapture d’écran 2016-04-01 à 18.37.21

S_R2_rawData.fastqCapture d’écran 2016-04-01 à 18.37.00

The R1 file shows a great base quality. However, the R2 file does not provide a good information (low quality) for the last 100 nucleotides.

  • Size distribution

Both files show this graph:Capture d’écran 2016-04-26 à 15.51.06

So, we have 50.000 reads of 30-35 nucleotides and 400.000 reads of 300 nucleotides (the population that we excepted).

Here, the R2 file does not give a sufficient information (because of the low quality of 100-120 nucleotides) and that is the reason why only R1 reads will be used (You can see here for more details about R1 and R2 processing).

Barcode and primer removal

The present data were already demultiplexed. That is the reason why, the barcode is missing. To use an appropriate usearch script, we will artificially add a barcode for each sample.

I wrote this python script to do it. You can download it safely. This script will add « AAAAA » at the beginning of each sequences and will create a fasta file needed for the next usearch command. Please refer to the detailed version for more explanations.

Use the add_barcode.py script (requires python2):

python add_barcode.py S_R1_rawData.fastq S
python add_barcode.py PI_R1_rawData.fastq PI
python add_barcode.py PII_R1_rawData.fastq PII

 

The script add_barcode.py will generate two files:

  • barcode_sample.fasta, which contains the barcode sequence and the sample name.
  • sample_barcoded.fastq, which contains the sequences with the barcode, added at the beginning.

Then, the usearch command to remove the barcode and the primer sequence. 2 mismatches are allowed for the primer’s sequence, 0 for the barcode’s sequence. The next command uses a python script that you already had downloaded following the Usearch installation. You need to adapt the command line with your own path.

python ~/Documents/Usearch/python_scripts/fastq_strip_barcode_relabel2.py S_R1_rawData_barcoded.fastq CTCCTACGGGAGGCAGCAG barcode_S.fasta S > S_stripped.fastq

The arguments are:

python usearch_script sample_barcoded.fastq primer_sequence barcode_sample.fasta sample_label > output_file.fastq

We can now count again the total number of reads:

grep -c 'barcodelabel=' S_stripped.fastq

or with the output of the usearch script, with the number of « matched »:Capture d’écran 2016-04-26 à 17.09.40

 

Pool the data

Sequences are now labeled, and we need to pool all of them to continue.

cat *_stripped.fastq > all_reads.fastq

 

Filtering

This step is usually done in order to select the good sized reads with a good quality. In the present case, reads quality is good, and we will discard the little reads of 30 nt:

usearch -fastq_filter all_reads.fastq -fastq_minlen 250 -fastaout reads_trimmed.fasta

 

Dereplication

The dereplication is useful in order to group the duplicates sequences. This is mainly used to reduce the computation time of the next steps algorithms. We use here the « prefix » dereplication, it means when a sequence is a prefix (100% match) of an other one, the longer is conserved weighted by the number of sequences grouped.

usearch -derep_prefix reads_trimmed.fasta -fastaout reads_derep.fasta -sizeout

 

Sorting

As the sequences were grouped (if some identical were found), they have to be sorted by size. This step is also important to remove singletons (unique sequences), which will not be used in the clustering step.

usearch -sortbysize reads_derep.fasta -fastaout reads_sorted.fasta -minsize 2

 

Clustering step

This step will construct a set of representative sequences from our reads. For more details, please refer to the detailed version and/or the Robert Edgar page.

usearch -cluster_otus reads_sorted.fasta -otus otus.fasta -relabel OTU_ -sizeout -uparseout result_clustering_otu.txt

 

Chimera filtering

Removal of all chimeric sequences, which have no biological sense. The « gold.fa » file needs to be downloaded.

usearch -uchime_ref otus.fasta -db ~/Documents/Usearch/DB/gold.fa -strand plus -nonchimeras nochim_otus.fasta

 

Reads mapping

All reads (singletons included) will be mapped on the representative sequences found in the clustering step.

usearch -usearch_global reads_trimmed.fasta -db nochim_otus.fasta -strand plus -id 0.97 -uc map.uc

 

Taxonomy assignment

New feature in the current Usearch version, the taxonomy assignment can be done with a great accuracy, see here for more details. To use this command, reference databases have to be downloaded. See the detailed in the usearch installation page.

usearch -utax nochim_otus.fasta -db ~/Documents/Usearch/DB/utax_refdb.fa -strand both -taxconfs ~/Documents/Usearch/DB/500.tc -utaxout tax.txt

 

OTU table finalization

From the « map.uc » file generated during the reads mapping step, a pyton script will construct the text file with the OTU table.

python ~/Documents/Usearch/python_scripts/uc2otutab.py map.uc > otu_table.txt

 

At this point, three files are important to follow:

  1. nochim_otus.fasta, which contains the sequence of each identified OTU.
  2. tax.txt, which contains the full taxonomy assignment for each OTU.
  3. otu_table.txt, which contains the count of each OTU in the studied samples.

 

IN PROGRESS