Table of contents

A call is an order to the computer to execute a program. It is a string which contains the program name and can include variables which will be set in advance to executing the program and which influence the program.

We will in the following generate calls to start our favourite mapper (BWA in this tutorial) with different combinations of parameter values. The results will be analysed in coming sections to determine the effects of the parameter values on the mapping result.

General procedure

Here are the main steps which will be taken in subsequent subsections:

  1. To generate the calls to the mapper using different combinations of parameters, several files holding the values of the different parameters are first combined to a table holding all possible combinations of them.

  2. Subsequently, in this tutorial every line is given a unique index. Though this is not nessecary, it facilitates tasks like the naming of output files.

  3. Because every mapper is different in how it expects the parameter values, you will be expected to write a small script which feeds the parameter values into the mapping program. This will be explained further below.

How the parameter values are stored

The parameter values are saved in several files, one per parameter. In this example, the BWA parameters n and k are varied which results in two files:

For your project, you will of course create your own files which hold the parameter values you want to compare

Here is a list of the parameter files used in this tutorial:

ls input/mapping/*.par
input/mapping/k.par
input/mapping/n.par

The files can have arbitrary filenames. They are in a tabular format where the column names determine the variable names which will be used in the mapping script which is created later on.

column -t input/mapping/n.par
n
0
4
8

Generate combinations of parameter values

To generate all combinations of parameters, scripts/cross_tab will be used. It expects multiple files and outputs all possible combinations of their lines.

Generate all possible combinations of parameters, retaining 1 header line:

scripts/cross_tab --head 1 input/mapping/*.par > data/4/partab
head data/4/partab | column -t
k   n
2   0
2   4
2   8
10  0
10  4
10  8

Hint: Alternatively, instead of creating the parameter files *.par explicitly, the following command determines the parameter values and converts them to the same output as in the previous paragraph:

scripts/cross_tab --head 1 \
   <(printf "%s\n" k 2 10) \
   <(printf "%s\n" n 0 4 8) \
   > data/4/partab

Mark the parameter combinations with a unique index

As stated above, this is not nessecary, but facilitates the naming of the output text files.

Again, scripts/index_column can be used for this purpose. This script prepends a counting number to each input line. It can be used to generate index columns for text tables.

Add an index column called runidx:

scripts/index_column --colname runidx --inplace data/4/partab
head data/4/partab | column -t
runidx  k   n
0       2   0
1       2   4
2       2   8
3       10  0
4       10  4
5       10  8

The mapping script

Read now the script input/mapping/map-bwa.sh.

#!/bin/bash

## This script performs a mapping using BWA.
## It requires the variables k, n, runidx and fastq be set 
## prior to its execution. That step can be performed by 
## the tool table2calls.

# Fail if any needed variable is not set
set -ueC
# Redirect all output to a log file
exec 2>"${outdir}/${runidx}.log" >&2

bwa aln -n ${n} -k ${k} \
    "${genome}" \
    "${reads}" \
    > "${outdir}/${runidx}.sai"

bwa samse \
      "${genome}" \
      "${output}/${runidx}.sai" \
      "${reads}" \
      > "${outdir}/${runidx}.sam"

As you can see, the script uses several variables, denoted like this: ${variable}. These are not defined inside the script, but will be set from outside at the time this script will be run.

The variable names are exactly equal to the column names of the file data/4/partab and therefore to the column names of all the files which lie in input/mapping/*.par. They are the means how the different parameter values will be fed to the mapper.

You will be required to write a script like this for your mapper. All it must do is to call the mapper, to specify the output file names and to forward the parameters values on to the mapper using the ${variable} syntax.

Generate mapper calls

The script table2calls converts the table with the parameter value combinations and the filename of the mapping script to calls which can be executed on the shell.

# Convert the table into calls that can be executed in the next section
scripts/table2calls --const genome "input/genome/volpertinger" \
                    --const reads  "data/3/all.fastq" \
                    --const outdir "data/4" \
                    data/4/partab \
                    input/mapping/map-bwa.sh \
                  > data/4/calls
cat data/4/calls
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=2 runidx=0 n=0 input/mapping/map-bwa.sh
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=2 runidx=1 n=4 input/mapping/map-bwa.sh
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=2 runidx=2 n=8 input/mapping/map-bwa.sh
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=10 runidx=3 n=0 input/mapping/map-bwa.sh
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=10 runidx=4 n=4 input/mapping/map-bwa.sh
reads=data/3/all.fastq outdir=data/4 genome=input/genome/volpertinger k=10 runidx=5 n=8 input/mapping/map-bwa.sh

Executing multiple mapping runs in parallel

For this task, many programs can be used, from simple shell background spawning using & (in bash) to job managers orchestrating a big network of worker machines. In this package, a simple program is implemented which executes a user-definable number of jobs in parallel and waits with spawning new ones until another of its already started jobs finishes.

Note that some mappers can use more than one processor core themselves. Therefore if you spawn multiple mapper processes where each mapper process utilizes multiple cores, the total number of utilized cores is the number of cores used per mapper multiplied with the number of mapper processes launched in parallel.

Invoke scripts/mcall --help to get more information about this tool.

Example: Run the previously generated mapper calls.

    # Execute calls, at 2 cores
    scripts/mcall -c data/4/calls -t 2 \
                  --status
    # Standard error was piped to log files,
    # Standard output was piped to sam files, as specified in the
    # `tmpl` file.
    head data/4/0.log
    head -n15 data/4/0.sam
input/mapping/map-bwa.sh: line 13: bwa: command not found