Deal with Single-end and Paired-end data

The sequencing technique begins by synthesizing DNA fragments from a template, which is called: amplification. The template is defined by the chosen primers. The amplified region will be sequenced at the same time of the synthesis.

If we focus on the amplification and sequencing of a specific portion of the DNA, it exists two broadly used approaches:

Capture d’écran 2016-04-06 à 14.28.59

  • Single-end (SE) sequencing: The sequenced molecule will be read from the 5′ or 3′ end. In this first case, only one sequence (read) per reading will be generated. The processing of data from single-end sequencing are quite easy, because there is one file per sample.
  • Paired-end (PE) sequencing: In this one, the molecule will be read one time from both ends 5′ and 3′. Two reads will be provided for each sequenced molecule and two separated files will be provided.
    1. The first file, usually labeled with « R1« , contains the reads which were amplified and sequenced from the 5′ end.
    2. The seconde one, usually labeled with « R2« , contains the reads which were amplified and sequenced from the 3′ end.

 

When you are dealing with SE sequencing, you only have one file per sample. This file will enter in your pipeline as « raw data » (input). Otherwise, if you are using PE sequencing, as you have two files per sample, it exists different ways to process these data depending on the sequencing accuracy.

  • Merging: If there is an overlap between the R1 and R2 reads, then a merging is possible. The merging conserves the maximum amount of the information because both reads are conserved and the overlapped region is sequenced two times.Capture d’écran 2016-04-13 à 22.39.22
  • Joining: In the case where there is no overlap between the R1 and R2 reads,  the joining is one of the remaining solutions. The determinant factor will be the quality of the R1 and R2 reads. If both have a sufficient quality, then both are conserved and they are joined with the « NNNNN » pattern. Generally, the R2 reads tends to have a lower quality than those of R1.Capture d’écran 2016-04-13 à 22.39.33

If neither of the above two conditions are met (there is no overlap and R2’s reads quality is not good), then only the R1 file will be used for the analysis and processed as SE case.Capture d’écran 2016-04-13 à 22.39.40

An important thing to point out, which can help you understand your data. The reads in the R2 file are usually oriented 5′ – 3′. This is the reason why when a merging or joining is done, the R2’s reads have to be reverse-completed (programs do it well for you).