CovaRNA is a software for detecting long-range nucleotide covariation in genomic alignments.
1.INSTALLATION
1.1 INSTALLATION (Not using multi-threaded version)
1.1.1
unpack source file:
tar xvfz covarna_v1.16.0.tar.gz
cd covarna_v1.16.0
This directory is here called COVARNA_HOME
1.1.2
Compile sources:
cd $COVARNA_HOME/src
Compile single-threaded version with command:
make
This compiles the source code and creates the covarna and covarnap binaries in the subdirectory src/covarna. You can copy these binaries to a location of your choice, for example to the directory $COVARNA_HOME/bin
If you are interested in the multi-threaded version you can follow the instructions given below, otherwise you can skip to the USAGE section.
1.2 INSTALLATION (optional, multi-threaded version)
1.2.1
unpack source file:
tar xvfz covarna_v1.16.0.tar.gz
cd covarna_v1.16.0
This directory is here called COVARNA_HOME
1.2.2 Installation of Intel Threading Building Block library (optional)
Install Intel Threading Building Blocks (TBB) library: download appropriate version from http://threadingbuildingblocks.org/ and compile according to Intel instructions
Modify the .cshrc or .bashrc files of your Unix system in order to
set environment variable TBB_HOME equal to the TBB installation directory
set environment variable TBB_LIB to appropriate directory with compiled shared object libraries. To find the right directory, use the command ls -R $TBB_HOME/lib . In our case this is currently accomplished by the command:
setenv TBB_LIB $TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .cshrc in your home directory, the precice name of the TBB lib directory might in your case be different
set environment variable LD_LIBRARY_PATH to include TBB_LIB
This is best accomplished by adding commands to your .cshrc or .bashrc files, depending on whether on your default shell is BASH or SHELL:
# for C-Shell:
setenv TBB_LIB $TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .cshrc in your home directory, the precice name of the TBB lib directory might in your case be different
setenv LD_LIBRARY_PATH $TBB_LIB
or for BASH:
export TBB_LIB=$TBB_HOME/lib/intel64/cc4.1.0_libc2.4_kernel2.6.16.21 # add to file .bashrc in your home directory
export LD_LIBRARY_PATH=$TBB_LIB
1.2.3
Compile sources:
If you installed the Intel Threading Building block library as indicated above, the multi-threaded version can be compiled with the command:
make subclean # the object files created by the "make" and "make concurrent=1" commands cannot be safely linked together; best to remove object files and binaries created so far in order to avoid confusion:
make concurrent=1
This creates the covarna and covarnap binaries in the subdirectory src/covarna. You can copy these binaries to a location of your choice, for example to the directory $COVARNA_HOME/bin
2 USAGE
2.1 CovaRNA
Input to the covarna method are genomic alignments in University of Santa Cruz (UCSC) MAF format (see http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html for more information).
Genomic region pairs corresponding to detected covariations are written to an output file; the file format is a variant of the UCSC BED format. The default output file name is covarna_clusters.bed
# search within on MAF file:
covarna MAFFILE
To run examples, change the current working directory to $COVARNA_HOME/examples
To run the example, issue the command:
../src/covarna/covarna dm3_chrM.maf
This search creates a result file covarna_clusters.bed
This file contains in each row a pair of genomic regions with detected covariation that is annotated with additional information.
The program uses from a genomic alignment a reference genome with respect to which genomic positions are reported; it is recommended to specify the genome (otherwise the first genome from the first MAF block is used as reference genome). This is accomplished using the option –a <GENOME_BUILD>, for example:
../src/covarna/covarna dm3_chrM.maf –a dm3
# multithreaded search (using 8 cores):
covarnap MAFFILE -t 8
# Example (run from examples directory) with output to file results.bed:
../src/covarna/covarnap dm3_chrM.maf -t 8 -o results.bed
# search for long-range covariation between two MAF files:
covarna MAFFILE1 -e MAFFILE2
# Example (run from examples directory):
../src/covarna/covarna dm3_chrM.maf --e dm3_chr4.maf
# search within between two MAF files (4 cores):
covarnap MAFFILE1 -e MAFFILE2 -t 4
# Example using 4 cores (run from examples directory):
../src/covarna/covarnap dm3_chrM.maf --e dm3_chr4.maf -t 4
2.2 Shuffling of alignments with shufflemaf
usage:
shufflemaf h|v MAFFILE ASSEMBLY
"vertical" (column-wise) shuffling with dm3 as reference assembly:
../bin/shufflemaf v dm3_chrM.maf dm3
"horizontal" (row-wise) shuffling with dm3 as reference assembly:
../src/covarna/shufflemaf h dm3_chrM.maf dm3
Version history:
0.8.2: fixed issues in Makefiles so that compilation works without first issuing the command "make subinstall"
1.16.0: improved documentation and improved help output of covarna.
The genome assembly name corresponding to the MAF input files can be specified with the option –a. An example would be –a dm3 or –a hg18. Even though this is optional (by the default the assembly name of the first sequence in the first MAF block is used), specifying the reference genome is highly recommended.
-a ASSEMBLY : specify name of reference assembly.
The software is able to allow GU matches in compensatory alignment column pairs. One should be aware that allowing GU matches leads to increases run-time and memory usage. The default is to not allow GU matches.
--ambiguity 1 : GU matches are not considered (default)
--ambiguity 2 : reverse complement GU matches are being considered
--ambiguity 3 : GU-match mode for forward matches, use this if options –r 1 –m 1 (search for synchronized mutations in forward-matches) are specified (usage of this case should be simplified in the future).
With the option ÒannotateÓ it is possible to annotate a set of pre-specified genomic regions. Usage:
--annotate FILENAME : count number of covarying alignment column pairs in existing regions (given in BED file format)
Use this option in conjunction with option Ò—annotate-outÓ.
--annotate-out FILENAME : output of count of number of covarying alignment column pairs in existing regions (given in BED file format)
The option ÒantiÓ optionally activates a filter that does not consider alignment column pairs with covariation, if their adjacent alignment column pairs corresponding to the ÒwrongÓ diagonal also exhibits covariation. This option is useful for filtering out regions in which covariation may be an artifact that arises due to low-complexity regions. Usage:
--anti 0 : filter for checking "wrong" diagonals not to be complementary is not active (default).
--anti 1 : filter for checking "wrong" diagonals not to be complementary has medium strictness (one ÒwrongÓ diagonal is allowed to exhibit covariation).
--anti 2 : most strict setting of filter for checking "wrong" diagonals not to be complementary (neither of the two ÒwrongÓ diagonals is allowed to exhibit covariation).
The search in genomic alignments can be restricted to specified regions. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format.
Usage:
-b FILENAME
With this option it is possible, to only consider alignment column pairs that exhibit covariation that consist of at least N different types of nucleotide pairs (N being either 2, 3 or 4).
--basepairs N
The search in genomic alignments can be restricted to specified regions also if two region files are specified (if the search is only in one region file, use option –b). In other words, if Covarna is called with parameters Òcovarna MAFFILE1 –e MAFILE2 –b1 FILTERFILE1Ó, the program restricts the search to regions of MAFFILE1 that intersect with regions specified by FILTERFILE1. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format. The specified regions must be in UCSC Genome Browser BED format.
The search in genomic alignments can be restricted to specified regions also if two region files are specified (if the search is only in one region file, use option –b). In other words, if Covarna is called with parameters Òcovarna MAFFILE1 –e MAFILE2 –b2 FILTERFILE2Ó, the program restricts the search to regions of MAFFILE2 that intersect with regions specified by FILTERFILE2. This is accomplished by specifying a name of a file that contains genomic regions in UCSC BED format. The specified regions must be in UCSC Genome Browser BED format.
The option Ò—block-minÓ allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1.
Usage:
--block-min NUMBER
The option Ò—block-minÓ allows to read only parts of the input genomic alignments. By specifying a number N, only alignment blocks 1 to N are being read and used as input. The number 0 (default) indicates that the the alignment blocks are being read to the end of the input file.
Usage:
--block-max NUMBER : last MAF block to read. Default: 0 (read all MAF blocks)
The option Ò—block-min2Ó allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1. Use this option in conjunction with option –e, for example: Covarna MAFFILE –e MAFFILE2 –block-min 100
Usage:
--block-min2 NUMBER
The option Ò—block-min2Ó allows to read only parts of the input genomic alignments. By specifying a number N, the first N-1 alignment blocks are ignored. The default number is 1. Use this option in conjunction with option –e, for example: Covarna MAFFILE –e MAFFILE2 –block-min 100
Usage:
--block-max2 NUMBER : last MAF block to read from second MAF file. Default: 0 (read all MAF blocks)
The option Ò-cÓ allows to remove all alignment columns in the input alignment blocks that correspond to a gap in the specified genome. This operation is being performed by default with respect to the reference genome (see option –a ). An example would be Ò-c hg18Ó.
Usage:
-c ASSEMBLY
The option Ò—clusterÓ allows to specify the cluster-cutoff distance (using single-linkage clustering) of covarying alignment column pair that are to be grouped into the same covariation cluster.
Usage: --cluster DISTANCE
Example: --cluster 40
The option Ò—cluster-minÓ allows to specify the minimum number of covarying alignment column pairs that a covariation cluster to contain. In other words, covariation clusters that contain less than this number of covarying alignment column pairs are filtered out.
Usage: --cluster-min N (with N being an integer number that corresponds to the minimum number of column pairs per covariation cluster with compensatory base changes).
Example: --cluster-min 4
The option Ò-dÓ allows specifying a minimum distance between alignment column pairs with covariation. This corresponds to an assumed minimum loop-length in a hairpin-loop. The default value is 3.
Usage: -d DISTANCE
Example: -d 5
The search for long-range covaration can be performed between two different genomic alignment files. If specified, only cross-correlations between those two files are reported.
Usage: -e FILENAME : Filename of second alignment file (in UCSC MAF format) to be appended.
If a covariation cluster corresponds to two sequence regions with more than the specified number of reverse-complementary nucleotides, it is filtered out.
Usage:
--expand-max VALUE : Maximum allowed length of consecutive sequence covariation. Default: 30
By default, hash tables corresponding to all non-conserved possible nucleotide triplets and genome assemblies are generated. For alignments consisting of many genome assemblies, this can lead to the generation of thousands of hash tables. Using the Ò-fÓ option, it is possible to reduce the number of generated hash tables. Note that if not all hash tables are generated, memory is saved, but instead more alignment column candidates pass the hash-based filtering and have to be checked iteratively, possibly leading to higher run-times. In other words, this option corresponds to a trade-off between computer memory and run-time.
Usage:
-f FRACTION : Determines fraction of possible hash tables that are actually generated. Values: (0,1]. Higher values mean faster search and more memory consumption. Default: 1.0
With the option Ò-iÓ it is possible to control the frequency of progress output. Default: 100000 – in other words every 100,000 alignment columns output is generated for the user.
Usage:
-i INTERVALL : User output in search step intervals of this size.
With this option, it is possible to specify genome assemblies of the input MAF blocks that should be ignored.
Usage:
--ignore ASSEMBLY1,ASSEMBLY2,... : list of genome assemblies that should be ignored during reading of alignment.
This option specifies if a ÒmatchÓ corresponds to a complementary nucleotide type (default) or a identical nucleotide type. This option goes together with option –r: Option combination –r 1 –m 1 (default) corresponds to a search for reverse-complementary alignment columns; option –r 0 –m 0 corresponds to a search for synchronized mutations.
Usage:
-m 1|0 : Complement mode. If set to 1 (default), search for complementary columns, not matching ones.
When searching for trans-correlations between two input alignments who happen to be identical(not recommended) and searching for synchronized mutations, it is possible to detect the trivial and nonsensical case of self-identity. This filter activates this sanity check; its is usage, however, is usually not necessary.
Usage:
--noself : filter out stems that have equal start and stop position
The file name of the output file in BED format is specified with this option.
Usage:
-o OUTPUTFILE : output of cluster intervals in UCSC BED file format.
With options –b, --b1, --b2 region files in UCSC BED format can be read that act as filters for the input sequences. With the option –pad, these regions can be extended (ÔpaddedÕ) by flanking regions.
Usage:
--pad 1|2|... : Adds flanking regions to red BED format filter intervals. Typical value: 200 for adding 200nt on both sides of each interval.
With this option it is possible, for memory or efficiency reasons, to limit the maximum number of sequences to be read per alignment block.
Usage:
--prune : If set, read at most this many sequences per MAF block
This option is meaningful in combination with option –m.
By default, covarna searches for reverse-complementary alignment column pairs. This corresponds to options –m 1 –r 1. Searching for synchronized mutations (or reverse-complementarity with respect to the opposing strand directionality) can be accomplished by activating options –m 0 –r 0
Usage:
-r 1|0 : Reverse mode. If set to 1 (default), looking for stretches of reverse (complementary) columns.
This option allows to specify genome assemblies that each MAF alignment block must contain in order to not be filtered out.
Usage:
--require ASSEMBLY1,ASSEMBLY2,... : list of genome assemblies that should be required during reading of alignment. All other genome assemblies are being ignored.
This option specifies the minimum number of sequences that each used MAF block should contain. MAF blocks with a smaller number of sequences (after applying the –ignore option) are filtered out during while reading the genome sequence data.
Usage:
-s SEQMIN : minimum number of sequences for reading of MAF file. Default: 10
With the option Ò—stemÓ it is possible to define a minimum length of reverse-complementary sequence that a covarying alignment column pair should be contained in. The default value is Ò1Ó, in other words, there is by default no filtering by minimum stem length.
Usage:
--stem 1|2|3|4 : minimum stem length.
The binary covarnap is a multi-threaded application based on the Intel Threading Building Blocks framework. As described in the Installation notes, it requires that the variable LD_LIBRARY_PATH contains a path to the TBB installation directory that was used for compiling the code. The single-threaded version of the program is found in binary covarna.
The option Ò-tÓ specifies a fixed number of compute cores that are being used for executing the algorithm. If the option is not specified, the program covarnap detects all available cores and starts this many threads. This may or may not the desired behavior. If you are, for example, running the algorithm on a heterogeneous cluster, and your algorithm is allowed to utilize all cores of a node, automatically utilizing all cores on the host computer can be convenient and the desired behavior. If, in contrast, the program is run on a ÒmainframeÓ type of host, and is not allowed or desired to start threads on all cores, the number of threads should be specified with the option –t.
Usage:
-t THREADS : Maximum number of parallel threads (executable covarnap). Value must be greater zero.
Usage:
-v 0|1|2|3|4 : set verbose level (0:silent, 1: default)
--cluster-filter-on : if set, single-linkage clustering is performed during the search. The use of this option is discouraged.
--cluster-filter-off : if set, no single-linkage clustering is performed during the search.
--same-chrom : force to assume that for two given MAF files, the reference genome is from the same chromosomes.
The main reason for deprecated options is that with the current version, all P-value computations should be handled by the separate CovStat program.
-density : number of expected stems per area (per sites squared). This option is deprecated; densitities are currently not being used.
--dif filename : density input filename.
--dof filename : density output filename.
Option ÒemaxÓ: Speficying a maximum e-value of
--emax VALUE : maximum e-values of listed interaction clusters. Default: -1. Computation of statistical significance is now performed by the Covstat program.
This option is deprecated, because all P-value computations should currently be handled by CovStat and not CovaRNA.
--multi 0|1|2|3 : multiple-testing correction. 0: no clustering; 1: total area; 2: total area / eff. cluster area (default); 3: total area / cluster area
--opposite : Combination mode: corresponds to --strand 1 --strand2 -1 --noself
-p : If set, compute p-values and E-values of found stems.
--search-max NUMBER : ignore columns that would lead to clusters with more than this number of columns.
Alignment shuffling is an important approach for generating controls. Alignment shuffling should, however, be performed by the additional bionary called ÒshufflemafÓ, as described in the installation instructions.
--shuffle 0|1|2 : shuffling of MAF alignment blocks. 0: no shuffling; 1: dinucleotide-preserving shuffling.
--stem-p 0..1 : Limits clusters to have smaller P value for stem-bias
--strand 1|-1|0 : strand mode: MAF blocks are converted to plus strand (1) or minus strand (-1) of reference genome. 0: no conversion.
--strand2 1|-1|0 : strand mode of second MAF alignments: MAF blocks are converted to plus strand (1) or minus strand (-1) of reference genome. 0: no conversion.
--taboo ASSEMBLY1,ASSEMBLY2,... : Deprecated; use option --ignore instead. list of genome assemblies that should be ignored during reading of alignment.