With the following tools, SAM files can be parsed to gain information of read names, where they were mapped, which quality score the mapping was assigned and so on.
The procedures in this chapter may vary more than the previous ones, depending on the research question.
In the setting this package was originally designed for, the names of the reads carry the information where the reads actually belong to. This information can subsequently be compared to the actual mapping information obtained from the SAM file.
To use the information of the SAM file, convert it to a tabular form using the tool sam2table
. This generates a table with column names adhering to the names defined in the SAM Specification, which are, amongst others:
*
if not mapped.scripts/sam2table data/4/1.sam > data/5/1.tab
Show beginning and end of the generated file:
head data/5/1.tab | column -t
qname flag rname pos mapq cigar rnext pnext tlen
volpertinger_1 4 * 0 0 * * 0 0
volpertinger_2 0 MT 3402 37 20M * 0 0
volpertinger_3 4 * 0 0 * * 0 0
volpertinger_4 0 A3 5689 37 21M * 0 0
volpertinger_5 0 MT 3280 37 25M * 0 0
volpertinger_6 0 A3 4936 25 27M * 0 0
volpertinger_7 0 A1 1320 25 22M * 0 0
volpertinger_8 4 * 0 0 * * 0 0
volpertinger_9 0 X 2381 25 20M * 0 0
tail data/5/1.tab | column -t
retli_41 0 A1 961 25 20M * 0 0
retli_42 4 * 0 0 * * 0 0
retli_43 4 * 0 0 * * 0 0
retli_44 4 * 0 0 * * 0 0
retli_45 4 * 0 0 * * 0 0
retli_46 4 * 0 0 * * 0 0
retli_47 4 * 0 0 * * 0 0
retli_48 4 * 0 0 * * 0 0
retli_49 4 * 0 0 * * 0 0
retli_50 4 * 0 0 * * 0 0
To determine whether a read was mapped correctly, the mapping information like mapped position of the read in the target reference as well as the true read position must be known. The true read information was generated along with the read FASTQ files and was stored in .coord
files.
The script add_mapped_organisms
reads the mapping information stored in the SAM file (converted to a table) and the true read information and merges the information into one table.
Thereby, all information about one read is stored in the same line.
In this example, the organism volpertinger
provides the endogenous reads as all reads were mapped only to the volpertinger
genome. The retli
reads are therefore exogenous reads, as these are not supposed to map to that reference genome.
To assign the correct organism names to the reads, the script must be provided with
The call is shown below. The script write_later
at the end of the pipe is to prevent merge
from overwriting its own input file too soon, as the output is meant to replace the input file.
scripts/add_mapped_organisms \
--endogenous volpertinger \
input/genome/volpertinger.fasta.fai \
data/2/volpertinger.coord \
--exogenous retli \
input/retli/retli.fasta.fai \
data/2/retli.coord \
data/5/1.tab \
| scripts/write_later data/5/1.tab
head data/5/1.tab | column -t
qname mapped_flag mapped_rname mapped_pos mapped_mapq mapped_cigar mapped_rnext mapped_pnext mapped_tlen mapped_organism true_record true_start true_end true_organism
volpertinger_1 4 * 0 0 * * 0 0 * B1 2143 2171 volpertinger
volpertinger_2 0 MT 3402 37 20M * 0 0 volpertinger MT 3402 3421 volpertinger
volpertinger_3 4 * 0 0 * * 0 0 * A3 1413 1438 volpertinger
volpertinger_4 0 A3 5689 37 21M * 0 0 volpertinger A3 5689 5709 volpertinger
volpertinger_5 0 MT 3280 37 25M * 0 0 volpertinger MT 3280 3304 volpertinger
volpertinger_6 0 A3 4936 25 27M * 0 0 volpertinger A3 4936 4962 volpertinger
volpertinger_7 0 A1 1320 25 22M * 0 0 volpertinger A1 1320 1341 volpertinger
volpertinger_8 4 * 0 0 * * 0 0 * A3 4480 4506 volpertinger
volpertinger_9 0 X 2381 25 20M * 0 0 volpertinger X 2381 2400 volpertinger
Now all the information is present to determine whether a read has been mapped correctly. The last step is writing in a new column whether a read was mapped correctly. This can be archieved using any means you can imagine, for this example we will use R. The pocketR
tool is a thin wrapper which handles reading and writing of data for us. The input data will be available as a data.frame
called input
. The R calls must return a data.frame
, which will be printed as a table.
This step is exposed to the user, because this enables great flexibility in what exactly is considered a "correct" read mapping.
The following command adds a new column to the input data which indicates whether a read was mapped correctly. In this tutorial, "correct" is defined as "the read has been mapped exactly to the position from which it originated".
scripts/pocketR '
within(input, {
correct =
mapped_pos == true_start &
mapped_rname == true_record &
mapped_organism == true_organism })
' data/5/1.tab \
| scripts/write_later data/5/1.tab
head data/5/1.tab | column -t
qname mapped_flag mapped_rname mapped_pos mapped_mapq mapped_cigar mapped_rnext mapped_pnext mapped_tlen mapped_organism true_record true_start true_end true_organism correct
volpertinger_1 4 * 0 0 * * 0 0 * B1 2143 2171 volpertinger FALSE
volpertinger_2 0 MT 3402 37 20M * 0 0 volpertinger MT 3402 3421 volpertinger TRUE
volpertinger_3 4 * 0 0 * * 0 0 * A3 1413 1438 volpertinger FALSE
volpertinger_4 0 A3 5689 37 21M * 0 0 volpertinger A3 5689 5709 volpertinger TRUE
volpertinger_5 0 MT 3280 37 25M * 0 0 volpertinger MT 3280 3304 volpertinger TRUE
volpertinger_6 0 A3 4936 25 27M * 0 0 volpertinger A3 4936 4962 volpertinger TRUE
volpertinger_7 0 A1 1320 25 22M * 0 0 volpertinger A1 1320 1341 volpertinger TRUE
volpertinger_8 4 * 0 0 * * 0 0 * A3 4480 4506 volpertinger FALSE
volpertinger_9 0 X 2381 25 20M * 0 0 volpertinger X 2381 2400 volpertinger TRUE
As another example to perform the same step, below is shown how to allow for 5 bp of tolerance when determining whether the read was mapped correctly:
scripts/pocketR '
within(input, {
correct =
abs(mapped_pos - true_start) < 5 &
mapped_rname == true_record &
mapped_organism == true_organism })
' data/5/1.tab \
| scripts/write_later data/5/1.tab
As next step, the number of reads are counted which belong to certain categories. Here, the categories are:
Again, the R language can be used to express our wishes concisely: Group the reads by all combinations of:
true_organism
mapped_organism
... and count the reads belonging to each category:
the cbind
function is needed in order to rename the column containing the read count. qname
can be substituted here by any valid input column name, as its only used for counting (each column is equal in length).
scripts/pocketR '
aggregate( cbind(count=qname) ~ true_organism + mapped_organism + correct,
FUN=length, data=input)
' data/5/1.tab \
> data/5/1.agg
cat data/5/1.agg | column -t
true_organism mapped_organism correct count
retli * FALSE 49
volpertinger * FALSE 9
retli volpertinger FALSE 1
volpertinger volpertinger TRUE 16
This format may be used to plot the read fate of a single mapper run and to derive the measures sensitivity and specificity:
scripts/plot_read_fate --exogenous retli \
--format png \
true_organism mapped_organism \
correct count \
fig/fate1.png data/5/1.agg
The script produces the following plot:
The plot shows, which fractions of the input species (endogenous: volpertinger, contaminant: R. etli) are mapped and whether the mapping was correct. It can be seen that in this case, almost no contaminant reads were mapped, indicating high mapper specificity. However, only 75% of the endogenous reads were mapped.
If non-endogenous reads were included in the reads, like we did by including the R. etli reads, both measures can be calculated.
The following script needs the same kind of input as the plot-read-fate
script. Additionally, a list of organisms must be specified, whose genomes the mapper used as a reference.
If you specify multiple organisms, separate them by commas and don't include any spaces.
scripts/sensspec --c-morg mapped_organism \
--c-torg true_organism \
data/5/1.agg volpertinger \
> data/5/1.performance
column -t data/5/1.performance
map.true map.actl sensitivity nomap.true nomap.actl specificity bcr
25 16 0.64 50 49 0.98 0.81
The code needed to evaluate the data generated by the mapper might as well be included in the mapping template script introduced in the last section. If this is done, the data evaluation can be as well parallelized as the mapping process.
All scripts used here were already introduced in this section.
The files data/5/all.tab
and data/5/all.recids
must be calculated prior to execution of this script. This has been done in this section as well.
Browse the directory data/5
to see the results.
for sam in data/4/*.sam; do
# Generate output prefix p from input name: `4.sam` -> `4`
bn=$(basename $sam)
p=${bn%.sam}
# Convert SAM to table
scripts/sam2table data/4/${p}.sam > data/5/${p}.tab
# Mark correctly/incorrectly mapped reads
scripts/add_mapped_organisms \
--endogenous volpertinger \
input/genome/volpertinger.fasta.fai \
data/2/volpertinger.coord \
--exogenous retli \
input/retli/retli.fasta.fai \
data/2/retli.coord \
data/5/${p}.tab \
| scripts/write_later data/5/${p}.tab
# Determine whether each read was correctly mapped
scripts/pocketR '
within(input, {
correct =
mapped_pos == true_start &
mapped_rname == true_record &
mapped_organism == true_organism })
' data/5/${p}.tab \
| scripts/write_later data/5/${p}.tab
# Count reads per origin/target organism and mapping status
scripts/pocketR '
aggregate( cbind(count=qname) ~ true_organism + mapped_organism + correct,
FUN=length, data=input) ' \
data/5/${p}.tab \
> data/5/${p}.agg
# Plot mapping targets per origin organism
scripts/plot_read_fate --exogenous retli \
--format png \
true_organism mapped_organism \
correct count \
fig/fate${p}.png data/5/${p}.agg
# Calculate sensitivity, specificity and balanced accuracy
scripts/sensspec --c-morg mapped_organism \
--c-torg true_organism \
data/5/${p}.agg volpertinger \
> data/5/${p}.performance
echo "$sam done. -> Generated data/5/${p}.{tab,agg,pdf,performance}"
done
data/4/0.sam done. -> Generated data/5/0.{tab,agg,pdf,performance}
data/4/1.sam done. -> Generated data/5/1.{tab,agg,pdf,performance}
data/4/2.sam done. -> Generated data/5/2.{tab,agg,pdf,performance}
data/4/3.sam done. -> Generated data/5/3.{tab,agg,pdf,performance}
data/4/4.sam done. -> Generated data/5/4.{tab,agg,pdf,performance}
data/4/5.sam done. -> Generated data/5/5.{tab,agg,pdf,performance}
All generated images:
1.png:
2.png:
3.png:
4.png:
5.png: