Quality Control and Trimming
The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. You could go to the ENA website and search for the run with the accession SRR957824.
However these files contain about 3 million reads and are therefore quite big. We are only gonna use a subset of the original dataset for this tutorial.
First create a data/ directory in your home folder
mkdir ~/data
now let's download the subset
cd ~/data
curl -O -J -L https://osf.io/shqpv/download
curl -O -J -L https://osf.io/9m3ch/download
Let’s make sure we downloaded all of our data using md5sum.
md5sum SRR957824_500K_R1.fastq.gz SRR957824_500K_R2.fastq.gz
you should see this
1e8cf249e3217a5a0bcc0d8a654585fb SRR957824_500K_R1.fastq.gz
70c726a31f05f856fe942d727613adb7 SRR957824_500K_R2.fastq.gz
and now look at the file names and their size
ls -l
total 97M
-rw-r--r-- 1 hadrien 48M Nov 19 18:44 SRR957824_500K_R1.fastq.gz
-rw-r--r-- 1 hadrien 50M Nov 19 18:53 SRR957824_500K_R2.fastq.gz
There are 500 000 paired-end reads taken randomly from the original data
One last thing before we get to the quality control: those files are writeable. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. We fix that before going further
chmod u-w *
First we make a work directory: a directory where we can play around with a copy of the data without messing with the original
mkdir ~/work
cd ~/work
Now we make a link of the data in our working directory
ln -s ~/data/* .
The files that we've downloaded are FASTQ files. Take a look at one of them with
zless SRR957824_500K_R1.fastq.gz