Appendix A2 -- Advanced read generation

This more verbose but also more flexible way of generating artificial reads is currently needed if you want to

Use customised base quality strings in the output FASTQ file,
Apply indels or other alterations to the nucleotide string which also affect the read quality string.

The output FASTQ file will be constructed by interleaving the contents of three files, containing the read names, the nucleotide strings and the quality strings, respectively. These files will be generated in the following.

Additionally, a table containing the read names with the true read positions will be created, to evaluate later on whether a read was mapped correctly.

The uniform script can be used to sample reads from a reference genome. The command is used like already explained in the chapter about simplified read generation. The difference is that in this case it is called without the --output-fastq switch. Instead, the files needed for creating the read FASTQ file are created manually and merged to the FASTQ format using synth_fastq.

This manual shows an example, arbitrary custom manipulation steps can be performed in between or instead these steps.

Create the subfolder which holds the generated files:

mkdir data/2e

Generating artificial reads from the reference

Using uniform without the --output-fastq switch produces a text table output with four columns:

The FASTA record name (e.g. chromosome) where the read originated.
1-based base number of the reads' first base.
1-based base number of the reads' last base.
The nucleotide sequence of the artificial read.

Example: Extract 25 reads, with minimum length 20 base pairs, where half of the reads longer than 25 bp. The value 1234 is used to initialize the random number generator. To obtain a reproducible output, this parameter can be used with an arbitrary integer. If omitted, different reads are produced on every call.

We will generate sample reads from our volpertinger sample genome. Execute the following script to generate random nucleotide sequences:

scripts/uniform input/genome/volpertinger.fasta \
    --seed 1234\
    25 20 5  \
    > data/2e/volpertinger.reads

The resulting file looks like this:

head data/2e/volpertinger.reads | column -t

record  start  end   read
B1      2143   2171  TTCCACAAGATATTAGCCAACCAGATCCA
MT      3402   3421  TCTATTTAATAACTTCTCCC
A3      1413   1438  TTGAACTCTATCTTCCGGGGCTCAAC
A3      5689   5709  TCATGAGGGTTTCTATTTAGA
MT      3280   3304  TACGACCTCGATGTTGGATCAGGAC
A3      4936   4962  ACCCAAGGGCCCCCAGCCAGGTTGCTG
A1      1320   1341  TGGGAATAGCCCTGTGTTTATC
A3      4480   4506  AATAAACTTAAAAAAATTAGTGGCATG
X       2381   2400  ACGATGACCATCTTCTTGCG

Insertions and deletions (indels)

TAPAS contains a tool, called indel, which can insert and delete nucleotides randomly. The tool outputs the mutated nucleotide sequences along with a CIGAR string which describes which operation had been taken. It expects as input the list of nucleotides. Below is an example call with the nucleotides created above.

scripts/indel --in-prob 0.01  --in-exp 0.1 \
              --cigar-new     --seed 112233 \
              --col-seq read --col-cigar cigar \
    < data/2e/volpertinger.reads \
    > data/2e/volpertinger.reads.indel

head data/2e/volpertinger.reads.indel | column -t

record  start  end   read                             cigar
B1      2143   2171  TTCCACAATGGATATTAGCCAACCAGATCCA  8M2I21M
MT      3402   3421  TCTATTTAATAACTTCTCCC             20M
A3      1413   1438  TTGAACTCTATCTTCCGGGGCTCAAC       26M
A3      5689   5709  TCATGAGGGTTTCTATTTAGA            21M
MT      3280   3304  TACGAATCCTCGATGTTGGATCAGGAC      5M2I20M
A3      4936   4962  ACCCAAGGGCCCCCAGCCAGGTTGCTG      27M
A1      1320   1341  TGGGAATAGCCCTGTGTTTATC           22M
A3      4480   4506  AATAAACTTAAAAAAATTAGTGGCATG      27M
X       2381   2400  ACGATGACCATCTTCTTGCG             20M

Please note (I): If a deletion occurs at the beginning of the read, it cannot be detected by the mapper which is to be tested. In the optimal case a deletion of $n$ bp at the beginning of the read will cause the mapper to locate the read $n$ bp downstream of its true position. For an insertion, the read location reported by the mapper depends on whether the mapper matches the inserted base pairs to the reference or not. Therefore, it might be warranted to allow for some tolerance when determining whether a read is correctly mapped or remove the affected reads from the read set. For example, awk and its regular expression matching capabilities (... ~ /.../) can be used to this effect. The following call removes lines (=reads) in the input table (volpertinger.reads.indel) where the CIGAR string (column 6 = $6) does not start with a match (^[0-9]+M) and end with a match ([0-9]+M$). See tutorials on regular expressions and the SAM Specification (CIGAR strings) for more information.

awk '$6 ~ /^[0-9]+M/  &&  $6 ~ /[0-9]+M$/' \
    data/2e/volpertinger.reads.indel
    > data/2e/volpertinger.reads.indel.filtered

Please note (II): In the published pilot study version 1.1 of TAPAS was used, which did not have the indel feature implemented. The current version is not yet been tested on an in vivo dataset. We urge caution with the use of the indel feature. Should erroneous behaviour occur despite our efforts and care to produce correcly working software, a short notice to the authors is much appreciated.

The usage of indel shown above creates a new file to facilitate the demonstration of how the presented tools change the data. However, it is also possible to directly combine the read generation and indel insertion using the UNIX pipe:

scripts/uniform input/genome/volpertinger.fasta \
    --quiet --seed 1234  25 20 5  \
 | scripts/indel --in-prob 0.01 --in-exp 0.1 \
                 --cigar-new --seed 112233 \
                 --col-seq read --col-cigar cigar \
    > data/2e/volpertinger.reads.indel

The result is the same as with the above approach using the intermediate files:

head data/2e/volpertinger.reads.indel | column -t

record  start  end   read                             cigar
B1      2143   2171  TTCCACAATGGATATTAGCCAACCAGATCCA  8M2I21M
MT      3402   3421  TCTATTTAATAACTTCTCCC             20M
A3      1413   1438  TTGAACTCTATCTTCCGGGGCTCAAC       26M
A3      5689   5709  TCATGAGGGTTTCTATTTAGA            21M
MT      3280   3304  TACGAATCCTCGATGTTGGATCAGGAC      5M2I20M
A3      4936   4962  ACCCAAGGGCCCCCAGCCAGGTTGCTG      27M
A1      1320   1341  TGGGAATAGCCCTGTGTTTATC           22M
A3      4480   4506  AATAAACTTAAAAAAATTAGTGGCATG      27M
X       2381   2400  ACGATGACCATCTTCTTGCG             20M

Read names

In this tutorial we generate read names consisting of the organism name (volpertinger) followed by an underscore and a counting number.

There is a script included for adding this kind of column, which is shown in the next code example. You can as well use awk or whichever tool you like to accomplish this task if you need more sophisticated read names.

scripts/index_column  --prefix volpertinger_ \
                      --colname name  \
                      --inplace data/2e/volpertinger.reads.indel

head data/2e/volpertinger.reads.indel | column -t

name            record  start  end   read                             cigar
volpertinger_0  B1      2143   2171  TTCCACAATGGATATTAGCCAACCAGATCCA  8M2I21M
volpertinger_1  MT      3402   3421  TCTATTTAATAACTTCTCCC             20M
volpertinger_2  A3      1413   1438  TTGAACTCTATCTTCCGGGGCTCAAC       26M
volpertinger_3  A3      5689   5709  TCATGAGGGTTTCTATTTAGA            21M
volpertinger_4  MT      3280   3304  TACGAATCCTCGATGTTGGATCAGGAC      5M2I20M
volpertinger_5  A3      4936   4962  ACCCAAGGGCCCCCAGCCAGGTTGCTG      27M
volpertinger_6  A1      1320   1341  TGGGAATAGCCCTGTGTTTATC           22M
volpertinger_7  A3      4480   4506  AATAAACTTAAAAAAATTAGTGGCATG      27M
volpertinger_8  X       2381   2400  ACGATGACCATCTTCTTGCG             20M

Splitting the read table

To create a FASTQ file out of the generated reads, three files must be prepared containing:

The list of read names
The list of nucleotide strings
The list of quality strings

The first and the second file are generated from the file which contains the read coordinates and sequences (volpertinger.coord.indel in this example). The third list is generated using standard UNIX tools from the nucleotide strings.

First, the contents of volpertinger.reads.indel, generated in the previous section, are split into separate files containing the read names and the nucleotide sequences. This step can be performed using the UNIX tool awk. (NR != 1) removes the header line and {print $4} prints the corresponding column. The columns are counted starting with 1.

Generate raw read sequences:

awk '(NR != 1){print $5}' \
    < data/2e/volpertinger.reads.indel \
    > data/2e/volpertinger.nucl

Generate list of read names:

awk '(NR != 1){print $1}' \
    < data/2e/volpertinger.reads.indel \
    > data/2e/volpertinger.i

Show the result:

head data/2e/volpertinger.nucl

TTCCACAATGGATATTAGCCAACCAGATCCA
TCTATTTAATAACTTCTCCC
TTGAACTCTATCTTCCGGGGCTCAAC
TCATGAGGGTTTCTATTTAGA
TACGAATCCTCGATGTTGGATCAGGAC
ACCCAAGGGCCCCCAGCCAGGTTGCTG
TGGGAATAGCCCTGTGTTTATC
AATAAACTTAAAAAAATTAGTGGCATG
ACGATGACCATCTTCTTGCG
TATGGTATTTGGTACTATATGTAGTTTGTGTGTGGTG

head data/2e/volpertinger.i

volpertinger_0
volpertinger_1
volpertinger_2
volpertinger_3
volpertinger_4
volpertinger_5
volpertinger_6
volpertinger_7
volpertinger_8
volpertinger_9

Quality strings

The following example generates strings of constant quality score for every read. This can be done by the UNIX sed tool, which replaces every character by an F:

sed 's/./F/g' \
      data/2e/volpertinger.nucl \
    > data/2e/volpertinger.q

head data/2e/volpertinger.q

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

If you want to generate more elaborate quality strings, you are free to do so using whichever tools you desire. Just generate a list with as many lines as there are nucleotide strings in volpertinger.nucl to input them into the pipeline.

Putting the FASTQ file together

The synth_fastq tool creates a FASTQ file from its components, nucleotide string, quality string and read name (ID line). If the file containing the read names is omitted, the reads are numbered sequentially.

scripts/synth_fastq data/2e/volpertinger.nucl \
                    data/2e/volpertinger.q    \
                    data/2e/volpertinger.i    \
    > data/2e/volpertinger.fastq

head data/2e/volpertinger.fastq

@volpertinger_0
TTCCACAATGGATATTAGCCAACCAGATCCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@volpertinger_1
TCTATTTAATAACTTCTCCC
+
FFFFFFFFFFFFFFFFFFFF
@volpertinger_2
TTGAACTCTATCTTCCGGGGCTCAAC

Repeat the above steps to generate contaminant reads

The following commands generate some reads from a truncated Rhizobium etli genome, to supply reads which are not supposed to map. This way, contamination with non-endogenous reads is simulated.

Note that three kinds of abbreviations are used here:

No indels are applied to the exogenous reads in this example, therefore the process of splitting the tabular output into suitable files for generating the FASTQ file is performed using the --output switch of the tool uniform.
Terms like const_{a,b,c} are expanded by bash to const_a const_b const_c and can therefore be used to specify multiple paths which share some parts.
The temporary files (similar to volpertinger.q and volpertinger.i above) are omitted here by using bash's process substitution (<(...)) which uses the output of one argument instead of a file name which the other command expects.

As these are only abbreviations and do not change functionality, you can also use the more verbose commands described in the previous sections of this chapter to generate retli.fastq.

Index genome and sample nucleotide sequences:

samtools faidx input/retli/retli_tr.fasta

scripts/uniform \
    input/retli/retli_tr.fasta \
    --seed 2345 \
    --output data/2e/retli.{coord,nucl} \
    25 20 5

Read names:

scripts/index_column  --prefix retli_ \
                      --colname name  \
                      --inplace data/2e/retli.coord

Put the FASTQ together:

Quality strings are generated without an intermediate file using sed
Read names are extracted without an intermediate file using awk

scripts/synth_fastq \
    data/2e/retli.nucl \
    <(sed 's/./F/g'           data/2e/retli.nucl) \
    <(awk '(NR!=1){print $1}' data/2e/retli.coord) \
    > data/2e/retli.fastq

head data/2e/retli.fastq

@retli_0
TCGTTCTTGTAGCCTTCCGGGAAG
+
FFFFFFFFFFFFFFFFFFFFFFFF
@retli_1
ACTGAAGCGCAAGCTCTGAAG
+
FFFFFFFFFFFFFFFFFFFFF
@retli_2
TGGCCGAGGGACGCTTGCGTCGTC