CovaRNA is a software for detecting long-range nucleotide covariation in genomic alignments.

 

1.INSTALLATION

 

1.1 INSTALLATION (Not using multi-threaded version)

 

1.1.1

unpack source file:

tar xvfz covarna_v1.16.0.tar.gz

cd covarna_v1.16.0

This directory is here called COVARNA_HOME

 

1.1.2

Compile sources:

cd $COVARNA_HOME/src

 

Compile single-threaded version with command:

 

make

 

This compiles the source code and creates the covarna and covarnap binaries in the subdirectory src/covarna. You can copy these binaries to a location of your choice, for example to the directory $COVARNA_HOME/bin

 

If you are interested in the multi-threaded version you can follow the instructions given below, otherwise you can skip to the USAGE section.

 

1.2 INSTALLATION (optional, multi-threaded version)

 

1.2.1

unpack source file:

tar xvfz covarna_v1.16.0.tar.gz

cd covarna_v1.16.0

This directory is here called COVARNA_HOME

1.2.2 Installation of Intel Threading Building Block library (optional)

Install Intel Threading Building Blocks (TBB) library: download appropriate version from http://threadingbuildingblocks.org/ and compile according to Intel instructions

Modify the .cshrc or .bashrc files of your Unix system in order to

set environment variable TBB_HOME equal to the TBB installation directory

 

set environment variable TBB_LIB to appropriate directory with compiled shared object libraries. To find the right directory, use the command ls -R $TBB_HOME/lib . In our case this is currently accomplished by the command:

setenv TBB_LIB $TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .cshrc in your home directory, the precice name of the TBB lib directory might in your case be different

 

set environment variable LD_LIBRARY_PATH to include TBB_LIB

 

This is best accomplished by adding commands to your .cshrc or .bashrc files, depending on whether on your default shell is BASH or SHELL:

# for C-Shell:

setenv TBB_LIB $TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .cshrc in your home directory, the precice name of the TBB lib directory might in your case be different

setenv LD_LIBRARY_PATH $TBB_LIB

 

or for BASH:

export TBB_LIB=$TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .bashrc in your home directory

export LD_LIBRARY_PATH=$TBB_LIB

 

1.2.3

Compile sources:

 

If you installed the Intel Threading Building block library as indicated above, the multi-threaded version can be compiled with the command:

 

make subclean # the object files created by the "make" and "make concurrent=1" commands cannot be safely linked together; best to remove object files and binaries created so far in order to avoid confusion:

make concurrent=1

 

This creates the covarna and covarnap binaries in the subdirectory src/covarna. You can copy these binaries to a location of your choice, for example to the directory $COVARNA_HOME/bin

 

2 USAGE

 

2.1 CovaRNA

Input to the covarna method are genomic alignments in University of Santa Cruz (UCSC) MAF format (see http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html for more information).

Genomic region pairs corresponding to detected covariations are written to an output file; the file format is a variant of the UCSC BED format. The default output file name is covarna_clusters.bed

 

 

# search within on MAF file:

covarna MAFFILE

 

 

To run examples, change the current working directory to $COVARNA_HOME/examples

 

To run the example, issue the command:

../src/covarna/covarna dm3_chrM.maf

 

This search creates a result file covarna_clusters.bed

 

This file contains in each row a pair of genomic regions with detected covariation that is annotated with additional information.

 

 

The program uses from a genomic alignment a reference genome with respect to which genomic positions are reported; it is recommended to specify the genome  (otherwise the first genome from the first MAF block is used as reference genome). This is accomplished using the option –a <GENOME_BUILD>, for example:

../src/covarna/covarna dm3_chrM.maf –a dm3

 

 

# multithreaded search (using 8 cores):

covarnap  MAFFILE -t 8

# Example (run from examples directory) with output to file results.bed:

../src/covarna/covarnap  dm3_chrM.maf -t 8 -o results.bed

 

# search for long-range covariation between two MAF files:

covarna MAFFILE1 -e MAFFILE2

# Example (run from examples directory):

../src/covarna/covarna  dm3_chrM.maf --e dm3_chr4.maf

 

# search within between two MAF files (4 cores):

covarnap MAFFILE1 -e MAFFILE2 -t 4

 

# Example using 4 cores (run from examples directory):

../src/covarna/covarnap  dm3_chrM.maf --e dm3_chr4.maf -t 4

 

 

 

2.2 Shuffling of alignments with shufflemaf

 

usage:

shufflemaf h|v MAFFILE ASSEMBLY

 

"vertical" (column-wise) shuffling with dm3 as reference assembly:

../bin/shufflemaf v dm3_chrM.maf dm3

 

"horizontal" (row-wise) shuffling with dm3 as reference assembly:

../src/covarna/shufflemaf h dm3_chrM.maf dm3

 

Version history:

 

0.8.2: fixed issues in Makefiles so that compilation works without first issuing the command "make subinstall"

 

1.16.0: improved documentation and improved help output of covarna.

 

3 Reference of command line options

 

Option ÒaÓ: specifying the genome assembly

 

The genome assembly name corresponding to the MAF input files can be specified with the option –a. An example would be –a dm3 or –a hg18. Even though this is optional (by the default the assembly name of the first sequence in the first MAF block is used), specifying the reference genome is highly recommended.

-a ASSEMBLY    : specify name of reference assembly.

 

Option ÒambiguityÓ: allowing GU matches

 

The software is able to allow GU matches in compensatory alignment column pairs. One should be aware that allowing GU matches leads to increases run-time and memory usage. The default is to not allow GU matches.

--ambiguity 1   : GU matches are not considered (default)

--ambiguity 2   : reverse complement GU matches are being considered

--ambiguity 3   : GU-match mode for forward matches, use this if options –r 1 –m 1 (search for synchronized mutations in forward-matches) are specified (usage of this case should be simplified in the future).

 

Option ÒannotateÓ: Annotate specified genomic regions

 

With the option ÒannotateÓ it is possible to annotate a set of pre-specified genomic regions. Usage:

--annotate FILENAME : count number of covarying alignment column pairs in existing regions (given in BED file format)

Use this option in conjunction with option Ò—annotate-outÓ.

 

Option Òannotate-outÓ: Output of covariation in existing alignment column regions

 

--annotate-out FILENAME : output of count of number of covarying alignment column pairs in existing regions (given in BED file format)

 

Option ÒantiÓ: checking "wrong" diagonals not to be complementary

 

The option ÒantiÓ optionally activates a filter that does not consider alignment column pairs with covariation, if their adjacent alignment column pairs corresponding to the ÒwrongÓ diagonal also exhibits covariation. This option is useful for filtering out regions in which covariation may be an artifact that arises due to low-complexity regions. Usage:

 

--anti 0   : filter for checking "wrong" diagonals not to be complementary is not active (default).

--anti 1   : filter for checking "wrong" diagonals not to be complementary has medium strictness (one ÒwrongÓ diagonal is allowed to exhibit covariation).

--anti 2   : most strict setting of filter for checking "wrong" diagonals not to be complementary (neither of the two ÒwrongÓ diagonals is allowed to exhibit covariation).

 

Option ÒbÓ: restricting search to specified regions

 

The search in genomic alignments can be restricted to specified regions. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format.

 

Usage:

-b FILENAME

 

Option ÒbasepairsÓ: requiring different number of compensatory nucleotides in alignment column pairs

 

With this option it is possible, to only consider alignment column pairs that exhibit covariation that consist of at least N different types of nucleotide pairs (N being either 2, 3 or 4).

--basepairs N

 

Option Òb1Ó: restricting search to specified regions

 

The search in genomic alignments can be restricted to specified regions also if two region files are specified (if the search is only in one region file, use option –b). In other words, if Covarna is called with parameters Òcovarna MAFFILE1 –e MAFILE2 –b1 FILTERFILE1Ó, the program restricts the search to regions of MAFFILE1 that intersect with regions specified by FILTERFILE1. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format. The specified regions must be in UCSC Genome Browser BED format.

 

Option Òb2Ó: restricting search to specified regions

 

The search in genomic alignments can be restricted to specified regions also if two region files are specified (if the search is only in one region file, use option –b). In other words, if Covarna is called with parameters Òcovarna MAFFILE1 –e MAFILE2 –b2 FILTERFILE2Ó, the program restricts the search to regions of MAFFILE2 that intersect with regions specified by FILTERFILE2. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format. The specified regions must be in UCSC Genome Browser BED format.

 

Option Òblock-minÓ: skip alignment blocks in input file

 

The option Ò—block-minÓ allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1.

 

Usage:

--block-min NUMBER

 

Option Òblock-minÓ: skip trailing alignment blocks in input file

 

The option Ò—block-minÓ allows to read only parts of the input genomic alignments. By specifying a number N, only alignment blocks 1 to N are being read and used as input. The number 0 (default) indicates that the the alignment blocks are being read to the end of the input file.

 

Usage:

--block-max NUMBER : last MAF block to read. Default: 0 (read all MAF blocks)

 

Option Òblock-min2Ó: skip alignment blocks in second input file

 

The option Ò—block-min2Ó allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1. Use this option in conjunction with option –e, for example: Covarna MAFFILE –e MAFFILE2 –block-min 100

 

Usage:

--block-min2 NUMBER

 

Option Òblock-max2Ó: skip alignment blocks in second input file

 

The option Ò—block-min2Ó allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1. Use this option in conjunction with option –e, for example: Covarna MAFFILE –e MAFFILE2 –block-min 100

 

Usage:

--block-max2 NUMBER : last MAF block to read from second MAF file. Default: 0 (read all MAF blocks)

 

Option ÒcÓ: collapse alignment blocks with respect to a genome

 

The option Ò-cÓ allows to remove all alignment columns in the input alignment blocks that correspond to a gap in the specified genome. This operation is being performed by default with respect to the reference genome (see option –a ). An example would be Ò-c hg18Ó.

 

Usage:

-c ASSEMBLY   

 

Option ÒclusterÓ: Specifying the maximum distance between covarying column pairs

 

The option Ò—clusterÓ allows to specify the cluster-cutoff distance (using single-linkage clustering) of covarying alignment column pair that are to be grouped into the same covariation cluster.

 

Usage:  --cluster DISTANCE

 

Example: --cluster 40

 

Option Òcluster-minÓ: Specifying the minimum number of covarying column pairs per covariation cluster

 

The option Ò—cluster-minÓ allows to specify the minimum number of covarying alignment column pairs that a covariation cluster to contain. In other words, covariation clusters that contain less than this number of covarying alignment column pairs are filtered out.

 

Usage: --cluster-min N  (with N being an integer number that corresponds to the minimum number of column pairs per covariation cluster with compensatory base changes).

 

Example: --cluster-min 4

 

Option ÒdÓ: Specifying the minimum distance between covarying alignment columns

The option Ò-dÓ allows specifying a minimum distance between alignment column pairs with covariation. This corresponds to an assumed minimum loop-length in a hairpin-loop. The default value is 3.

 

Usage: -d DISTANCE   

Example: -d 5

 

Option ÒeÓ: Specifying a second alignment file

The search for long-range covaration can be performed between two different genomic alignment files. If specified, only cross-correlations between those two files are reported.

 

Usage: -e FILENAME : Filename of second alignment file (in UCSC MAF format) to be appended.

 

Option Òexpand-maxÓ: Specifying a filter for filtering out  long stretches of perfect reverse-complementarity

 

If a covariation cluster corresponds to two sequence regions with more than the specified number of reverse-complementary nucleotides, it is filtered out.

 

Usage:

--expand-max VALUE : Maximum allowed length of consecutive sequence covariation. Default: 30

 

 

Option ÒfÓ: Specifying a reduced fraction of hash tables that are being generated

 

By default, hash tables corresponding to all non-conserved possible nucleotide triplets and genome assemblies are generated. For alignments consisting of many genome assemblies, this can lead to the generation of thousands of hash tables. Using the Ò-fÓ option, it is possible to reduce the number of generated hash tables. Note that if not all hash tables are generated, memory is saved, but instead more alignment column candidates pass the hash-based filtering and have to be checked iteratively, possibly leading to higher run-times. In other words, this option corresponds to a trade-off between computer memory and run-time.

 

Usage:

-f FRACTION    : Determines fraction of possible hash tables that are actually generated. Values: (0,1]. Higher values mean faster search and more memory consumption. Default: 1.0

 

Option ÒiÓ: Specifying the frequency of user output

 

With the option Ò-iÓ it is possible to control the frequency of progress output. Default: 100000 – in other words every 100,000 alignment columns output is generated for the user.

 

Usage:

-i INTERVALL   : User output in search step intervals of this size.

 

Option ÒignoreÓ: specifying genome assemblies to be ignored

 

With this option, it is possible to specify genome assemblies of the input MAF blocks that should be ignored.

 

Usage:

--ignore ASSEMBLY1,ASSEMBLY2,...  : list of genome assemblies that should be ignored during reading of alignment.

 

Option ÒmÓ: specifying search for reverse-complementary versus synchronized mutations

 

This option specifies if a ÒmatchÓ corresponds to a complementary nucleotide type (default) or a identical nucleotide type. This option goes together with option –r: Option combination –r 1 –m 1 (default) corresponds to a search for reverse-complementary alignment columns; option –r 0 –m 0 corresponds to a search for synchronized mutations.

 

Usage:

-m 1|0         : Complement mode. If set to 1 (default), search for complementary columns, not matching ones.

 

 

Option ÒnoselfÓ: sanity-check of detected stems

 

When searching for trans-correlations between two input alignments who happen to be identical(not recommended) and searching for synchronized mutations, it is possible to detect the trivial and nonsensical case of self-identity. This filter activates this sanity check; its is usage, however, is usually not necessary.

 

Usage:

--noself       : filter out stems that have equal start and stop position

 

Option ÒoÓ: specifying the output file

 

The file name of the output file in BED format is specified with this option.

 

Usage:

-o OUTPUTFILE  : output of cluster intervals in UCSC BED file format.

 

Option ÒpadÓ: specifying flanking regions to input filter regions

 

With options –b, --b1, --b2 region files in UCSC BED format can be read that act as filters for the input sequences. With the option –pad, these regions can be extended (ÔpaddedÕ) by flanking regions.

 

Usage:

--pad 1|2|...  : Adds flanking regions to red BED format filter intervals. Typical value: 200 for adding 200nt on both sides of each interval.

 

Option Òprune: specifying the maximum number of sequences per alignment block

 

With this option it is possible, for memory or efficiency reasons, to limit the maximum number of sequences to be read per alignment block.

 

Usage:

--prune         : If set, read at most this many sequences per MAF block

 

Option ÒrÓ whether to search for reverse or forward complementarity

 

This option is meaningful in combination with option –m.

By default, covarna searches for reverse-complementary alignment column pairs. This corresponds to options –m 1 –r 1. Searching for synchronized mutations (or reverse-complementarity with respect to the opposing strand directionality) can be accomplished by activating options –m 0 –r 0

 

Usage:

-r 1|0         : Reverse mode. If set to 1 (default), looking for stretches of reverse (complementary) columns.

 

Option ÒsÓ specify required genome assemblies

 

This option allows to specify genome assemblies that each MAF alignment block must contain in order to not be filtered out.

 

Usage:

--require ASSEMBLY1,ASSEMBLY2,...  : list of genome assemblies that should be required during reading of alignment. All other genome assemblies are being ignored.

 

Option ÒsÓ specify the minimum number of sequences per MAF block

 

This option specifies the minimum number of sequences that each used MAF block should contain. MAF blocks with a smaller number of sequences (after applying the –ignore option) are filtered out during while reading the genome sequence data.

 

Usage:

-s SEQMIN      : minimum number of sequences for reading of MAF file. Default: 10

 

 

Option ÒstemÓ: specifying a minimum stem length

 

With the option Ò—stemÓ it is possible to define a minimum length of reverse-complementary sequence that a covarying alignment column pair should be contained in. The default value is Ò1Ó, in other words, there is by default no filtering by minimum stem length.

 

Usage:

--stem 1|2|3|4  : minimum stem length.

 

 

Option ÒtÓ: specify the number of threads

 

The binary covarnap is a multi-threaded application based on the Intel Threading Building Blocks framework. As described in the Installation notes, it requires that the variable LD_LIBRARY_PATH contains a path to the TBB installation directory that was used for compiling the code. The single-threaded version of the program is found in binary covarna.

The option Ò-tÓ specifies a fixed number of compute cores that are being used for executing the algorithm. If the option is not specified, the program covarnap detects all available cores and starts this many threads. This may or may not the desired behavior. If you are, for example, running the algorithm on a heterogeneous cluster, and your algorithm is allowed to utilize all cores of a node, automatically utilizing all cores on the host computer can be convenient and the desired behavior. If, in contrast, the program is run on a ÒmainframeÓ type of host, and is not allowed or desired to start threads on all cores, the number of threads should be specified with the option –t.

Usage:

-t THREADS     : Maximum number of parallel threads (executable covarnap). Value must be greater zero.

 

 

Option ÒvÓ: specify verbosity of output

 

Usage:

-v 0|1|2|3|4   : set verbose level (0:silent, 1: default)

 

 

 

Advanced options (for developer purposes only)

 

--cluster-filter-on  : if set, single-linkage clustering is performed during the search. The use of this option is discouraged.

 

 

--cluster-filter-off  : if set, no single-linkage clustering is performed during the search.

 

 

--same-chrom    : force to assume that for two given MAF files, the reference genome is from the same chromosomes.

 

 

Deprecated options

 

The main reason for deprecated options is that with the current version, all P-value computations should be handled by the separate CovStat program.

 

-density       : number of expected stems per area (per sites squared). This option is deprecated; densitities are currently not being used.

 

--dif filename : density input filename.

 

--dof filename : density output filename.

 

Option ÒemaxÓ: Speficying a maximum e-value of

--emax VALUE   : maximum e-values of listed interaction clusters. Default: -1. Computation of statistical significance is now performed by the Covstat program.

 

Deprecated: Option ÒmultiÓ for specifying multiple-testing correction modes

 

 

This option is deprecated, because all P-value computations should currently be handled by CovStat and not CovaRNA.

--multi 0|1|2|3  : multiple-testing correction. 0: no clustering; 1: total area; 2: total area / eff. cluster area (default); 3: total area / cluster area

 

--opposite     : Combination mode: corresponds to --strand 1 --strand2 -1 --noself

 

-p             : If set, compute p-values and E-values of found stems.

 

--search-max NUMBER : ignore columns that would lead to clusters with more than this number of columns.

 

The option ÒshuffleÓ

 

Alignment shuffling is an important approach for generating controls. Alignment shuffling should, however, be performed by the additional bionary called ÒshufflemafÓ, as described in the installation instructions.

--shuffle 0|1|2 : shuffling of MAF alignment blocks. 0: no shuffling; 1: dinucleotide-preserving shuffling.

 

 

--stem-p 0..1 : Limits clusters to have smaller P value for stem-bias

 

--strand 1|-1|0 : strand mode: MAF blocks are converted to plus strand (1) or minus strand (-1) of reference genome. 0: no conversion.

 

 

--strand2 1|-1|0 : strand mode of second MAF alignments: MAF blocks are converted to plus strand (1) or minus strand (-1) of reference genome. 0: no conversion.

 

 

--taboo ASSEMBLY1,ASSEMBLY2,...  : Deprecated; use option --ignore instead. list of genome assemblies that should be ignored during reading of alignment.