Quality Control and Trimming

Downloading the data

The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. You could go to the ENA website and search for the run with the accession SRR957824.

However these files contain about 3 million reads and are therefore quite big. We are only gonna use a subset of the original dataset for this tutorial.

First create a data/ directory in your home folder

mkdir ~/data

now let's download the subset

cd ~/data

curl -O -J -L https://osf.io/shqpv/download

curl -O -J -L https://osf.io/9m3ch/download

Let’s make sure we downloaded all of our data using md5sum.

md5sum SRR957824_500K_R1.fastq.gz SRR957824_500K_R2.fastq.gz

you should see this

1e8cf249e3217a5a0bcc0d8a654585fb SRR957824_500K_R1.fastq.gz

70c726a31f05f856fe942d727613adb7 SRR957824_500K_R2.fastq.gz

and now look at the file names and their size

ls -l

total 97M

-rw-r--r-- 1 hadrien 48M Nov 19 18:44 SRR957824_500K_R1.fastq.gz

-rw-r--r-- 1 hadrien 50M Nov 19 18:53 SRR957824_500K_R2.fastq.gz

There are 500 000 paired-end reads taken randomly from the original data

One last thing before we get to the quality control: those files are writeable. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. We fix that before going further

chmod u-w *

Working Directory

First we make a work directory: a directory where we can play around with a copy of the data without messing with the original

mkdir ~/work

cd ~/work

Now we make a link of the data in our working directory

ln -s ~/data/* .

The files that we've downloaded are FASTQ files. Take a look at one of them with

zless SRR957824_500K_R1.fastq.gz