This more verbose but also more flexible way of generating artificial reads is currently needed if you want to
The output FASTQ file will be constructed by interleaving the contents of three files, containing the read names, the nucleotide strings and the quality strings, respectively. These files will be generated in the following.
Additionally, a table containing the read names with the true read positions will be created, to evaluate later on whether a read was mapped correctly.
The uniform
script can be used to sample reads from a reference genome. The command is used like already explained in the chapter about simplified read generation. The difference is that in this case it is called without the --output-fastq
switch. Instead, the files needed for creating the read FASTQ file are created manually and merged to the FASTQ format using synth_fastq
.
This manual shows an example, arbitrary custom manipulation steps can be performed in between or instead these steps.
Create the subfolder which holds the generated files:
mkdir data/2e
Using uniform
without the --output-fastq
switch produces a text table output with four columns:
Example: Extract 25 reads, with minimum length 20 base pairs, where half of the reads longer than 25 bp. The value 1234 is used to initialize the random number generator. To obtain a reproducible output, this parameter can be used with an arbitrary integer. If omitted, different reads are produced on every call.
We will generate sample reads from our volpertinger
sample genome. Execute the following script to generate random nucleotide sequences:
scripts/uniform input/genome/volpertinger.fasta \
--seed 1234\
25 20 5 \
> data/2e/volpertinger.reads
The resulting file looks like this:
head data/2e/volpertinger.reads | column -t
record start end read
B1 2143 2171 TTCCACAAGATATTAGCCAACCAGATCCA
MT 3402 3421 TCTATTTAATAACTTCTCCC
A3 1413 1438 TTGAACTCTATCTTCCGGGGCTCAAC
A3 5689 5709 TCATGAGGGTTTCTATTTAGA
MT 3280 3304 TACGACCTCGATGTTGGATCAGGAC
A3 4936 4962 ACCCAAGGGCCCCCAGCCAGGTTGCTG
A1 1320 1341 TGGGAATAGCCCTGTGTTTATC
A3 4480 4506 AATAAACTTAAAAAAATTAGTGGCATG
X 2381 2400 ACGATGACCATCTTCTTGCG
TAPAS contains a tool, called indel
, which can insert and delete nucleotides randomly. The tool outputs the mutated nucleotide sequences along with a CIGAR string which describes which operation had been taken. It expects as input the list of nucleotides. Below is an example call with the nucleotides created above.
scripts/indel --in-prob 0.01 --in-exp 0.1 \
--cigar-new --seed 112233 \
--col-seq read --col-cigar cigar \
< data/2e/volpertinger.reads \
> data/2e/volpertinger.reads.indel
head data/2e/volpertinger.reads.indel | column -t
record start end read cigar
B1 2143 2171 TTCCACAATGGATATTAGCCAACCAGATCCA 8M2I21M
MT 3402 3421 TCTATTTAATAACTTCTCCC 20M
A3 1413 1438 TTGAACTCTATCTTCCGGGGCTCAAC 26M
A3 5689 5709 TCATGAGGGTTTCTATTTAGA 21M
MT 3280 3304 TACGAATCCTCGATGTTGGATCAGGAC 5M2I20M
A3 4936 4962 ACCCAAGGGCCCCCAGCCAGGTTGCTG 27M
A1 1320 1341 TGGGAATAGCCCTGTGTTTATC 22M
A3 4480 4506 AATAAACTTAAAAAAATTAGTGGCATG 27M
X 2381 2400 ACGATGACCATCTTCTTGCG 20M
Please note (I): If a deletion occurs at the beginning of the read, it cannot be detected by the mapper which is to be tested. In the optimal case a deletion of \(n\) bp at the beginning of the read will cause the mapper to locate the read \(n\) bp downstream of its true position. For an insertion, the read location reported by the mapper depends on whether the mapper matches the inserted base pairs to the reference or not. Therefore, it might be warranted to allow for some tolerance when determining whether a read is correctly mapped or remove the affected reads from the read set. For example, awk
and its regular expression matching capabilities (... ~ /.../
) can be used to this effect. The following call removes lines (=reads) in the input table (volpertinger.reads.indel
) where the CIGAR string (column 6 = $6
) does not start with a match (^[0-9]+M
) and end with a match ([0-9]+M$
). See tutorials on regular expressions and the SAM Specification (CIGAR strings) for more information.
awk '$6 ~ /^[0-9]+M/ && $6 ~ /[0-9]+M$/' \
data/2e/volpertinger.reads.indel
> data/2e/volpertinger.reads.indel.filtered
Please note (II): In the published pilot study version 1.1 of TAPAS was used, which did not have the indel feature implemented. The current version is not yet been tested on an in vivo dataset. We urge caution with the use of the indel feature. Should erroneous behaviour occur despite our efforts and care to produce correcly working software, a short notice to the authors is much appreciated.
The usage of indel
shown above creates a new file to facilitate the demonstration of how the presented tools change the data. However, it is also possible to directly combine the read generation and indel insertion using the UNIX pipe:
scripts/uniform input/genome/volpertinger.fasta \
--quiet --seed 1234 25 20 5 \
| scripts/indel --in-prob 0.01 --in-exp 0.1 \
--cigar-new --seed 112233 \
--col-seq read --col-cigar cigar \
> data/2e/volpertinger.reads.indel
The result is the same as with the above approach using the intermediate files:
head data/2e/volpertinger.reads.indel | column -t
record start end read cigar
B1 2143 2171 TTCCACAATGGATATTAGCCAACCAGATCCA 8M2I21M
MT 3402 3421 TCTATTTAATAACTTCTCCC 20M
A3 1413 1438 TTGAACTCTATCTTCCGGGGCTCAAC 26M
A3 5689 5709 TCATGAGGGTTTCTATTTAGA 21M
MT 3280 3304 TACGAATCCTCGATGTTGGATCAGGAC 5M2I20M
A3 4936 4962 ACCCAAGGGCCCCCAGCCAGGTTGCTG 27M
A1 1320 1341 TGGGAATAGCCCTGTGTTTATC 22M
A3 4480 4506 AATAAACTTAAAAAAATTAGTGGCATG 27M
X 2381 2400 ACGATGACCATCTTCTTGCG 20M
In this tutorial we generate read names consisting of the organism name (volpertinger) followed by an underscore and a counting number.
There is a script included for adding this kind of column, which is shown in the next code example. You can as well use awk
or whichever tool you like to accomplish this task if you need more sophisticated read names.
scripts/index_column --prefix volpertinger_ \
--colname name \
--inplace data/2e/volpertinger.reads.indel
head data/2e/volpertinger.reads.indel | column -t
name record start end read cigar
volpertinger_0 B1 2143 2171 TTCCACAATGGATATTAGCCAACCAGATCCA 8M2I21M
volpertinger_1 MT 3402 3421 TCTATTTAATAACTTCTCCC 20M
volpertinger_2 A3 1413 1438 TTGAACTCTATCTTCCGGGGCTCAAC 26M
volpertinger_3 A3 5689 5709 TCATGAGGGTTTCTATTTAGA 21M
volpertinger_4 MT 3280 3304 TACGAATCCTCGATGTTGGATCAGGAC 5M2I20M
volpertinger_5 A3 4936 4962 ACCCAAGGGCCCCCAGCCAGGTTGCTG 27M
volpertinger_6 A1 1320 1341 TGGGAATAGCCCTGTGTTTATC 22M
volpertinger_7 A3 4480 4506 AATAAACTTAAAAAAATTAGTGGCATG 27M
volpertinger_8 X 2381 2400 ACGATGACCATCTTCTTGCG 20M
To create a FASTQ file out of the generated reads, three files must be prepared containing:
The first and the second file are generated from the file which contains the read coordinates and sequences (volpertinger.coord.indel
in this example). The third list is generated using standard UNIX tools from the nucleotide strings.
First, the contents of volpertinger.reads.indel
, generated in the previous section, are split into separate files containing the read names and the nucleotide sequences. This step can be performed using the UNIX tool awk
. (NR != 1)
removes the header line and {print $4}
prints the corresponding column. The columns are counted starting with 1.
Generate raw read sequences:
awk '(NR != 1){print $5}' \
< data/2e/volpertinger.reads.indel \
> data/2e/volpertinger.nucl
Generate list of read names:
awk '(NR != 1){print $1}' \
< data/2e/volpertinger.reads.indel \
> data/2e/volpertinger.i
Show the result:
head data/2e/volpertinger.nucl
TTCCACAATGGATATTAGCCAACCAGATCCA
TCTATTTAATAACTTCTCCC
TTGAACTCTATCTTCCGGGGCTCAAC
TCATGAGGGTTTCTATTTAGA
TACGAATCCTCGATGTTGGATCAGGAC
ACCCAAGGGCCCCCAGCCAGGTTGCTG
TGGGAATAGCCCTGTGTTTATC
AATAAACTTAAAAAAATTAGTGGCATG
ACGATGACCATCTTCTTGCG
TATGGTATTTGGTACTATATGTAGTTTGTGTGTGGTG
head data/2e/volpertinger.i
volpertinger_0
volpertinger_1
volpertinger_2
volpertinger_3
volpertinger_4
volpertinger_5
volpertinger_6
volpertinger_7
volpertinger_8
volpertinger_9
The following example generates strings of constant quality score for every read. This can be done by the UNIX sed
tool, which replaces every character by an F:
sed 's/./F/g' \
data/2e/volpertinger.nucl \
> data/2e/volpertinger.q
head data/2e/volpertinger.q
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
If you want to generate more elaborate quality strings, you are free to do so using whichever tools you desire. Just generate a list with as many lines as there are nucleotide strings in volpertinger.nucl
to input them into the pipeline.
The synth_fastq
tool creates a FASTQ file from its components, nucleotide string, quality string and read name (ID line). If the file containing the read names is omitted, the reads are numbered sequentially.
scripts/synth_fastq data/2e/volpertinger.nucl \
data/2e/volpertinger.q \
data/2e/volpertinger.i \
> data/2e/volpertinger.fastq
head data/2e/volpertinger.fastq
@volpertinger_0
TTCCACAATGGATATTAGCCAACCAGATCCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@volpertinger_1
TCTATTTAATAACTTCTCCC
+
FFFFFFFFFFFFFFFFFFFF
@volpertinger_2
TTGAACTCTATCTTCCGGGGCTCAAC
The following commands generate some reads from a truncated Rhizobium etli genome, to supply reads which are not supposed to map. This way, contamination with non-endogenous reads is simulated.
Note that three kinds of abbreviations are used here:
--output
switch of the tool uniform
.const_{a,b,c}
are expanded by bash
to const_a const_b const_c
and can therefore be used to specify multiple paths which share some parts.volpertinger.q
and volpertinger.i
above) are omitted here by using bash
's process substitution (<(...)
) which uses the output of one argument instead of a file name which the other command expects.As these are only abbreviations and do not change functionality, you can also use the more verbose commands described in the previous sections of this chapter to generate retli.fastq
.
Index genome and sample nucleotide sequences:
samtools faidx input/retli/retli_tr.fasta
scripts/uniform \
input/retli/retli_tr.fasta \
--seed 2345 \
--output data/2e/retli.{coord,nucl} \
25 20 5
Read names:
scripts/index_column --prefix retli_ \
--colname name \
--inplace data/2e/retli.coord
Put the FASTQ together:
sed
awk
scripts/synth_fastq \
data/2e/retli.nucl \
<(sed 's/./F/g' data/2e/retli.nucl) \
<(awk '(NR!=1){print $1}' data/2e/retli.coord) \
> data/2e/retli.fastq
head data/2e/retli.fastq
@retli_0
TCGTTCTTGTAGCCTTCCGGGAAG
+
FFFFFFFFFFFFFFFFFFFFFFFF
@retli_1
ACTGAAGCGCAAGCTCTGAAG
+
FFFFFFFFFFFFFFFFFFFFF
@retli_2
TGGCCGAGGGACGCTTGCGTCGTC