CentriMo

Usage:

centrimo [options] <sequence file> <motif file>+

Description

CentriMo takes a set of motifs and a set of equal-length DNA or RNA sequences and plots the positional distribution of the best match of each motif.

The motifs are typically compendia of DNA- or RNA-binding motifs, and the sequences might be: 500 bp sequences aligned on ChIP-seq peaks or summits; 300 bp sequences centered on sets of transcription start sites or translation start sites; sequences aligned on splice-junctions; etc.

CentriMo also computes the "local enrichment" of each motif by counting the number of times its best match in each sequence occurs in a local region and applying a statistical test to see if the local enrichment is signficant. By default, CentriMo examines only regions centered, in the input sequences, but CentriMo will compute the enrichment of all regions if you specify the --local option. CentriMo uses the binomial test to compute the significance of the number of sequences where the best match occurs in a given region, assuming a uniform prior over best match positions. CentriMo reports the location and significance of the best region for each motif.

CentriMo can also perform comparative enrichment, reporting the relative enrichment of the best region in a second, control set of sequences if you specify the --neg. CentriMo choses regions based on their significance in the primary set of sequences, and then it uses the Fisher exact test to evaluate the significance of the number of best matches in the region in the primary set compared with the number of best matches in the same region in the control set of sequences.

Inputs

Motif File

A file containing motifs. Outputs from MEME and DREME are supported, as well as minimal MEME format. You can convert many other motif formats to minimal MEME format using conversion scripts available with the MEME Suite.

Sequence File

A file containing FASTA formatted sequences, ideally all of the same length. The sequences in this file are referred to as the "positive sequences" when a second set of sequences is provided using the --neg option (see below).

Outputs

CentriMo outputs an HTML file that allows interactive selection of which motifs to plot the positional distribution for and control over smoothing and other plotting parameters. CentriMo also outputs two text files: centrimo.txt, a tab delimited version of the results, and site_counts.txt, which lists, for each motif and each offset, the number of sequences where the best match of the motif occurs at the given offset.

Options

Option Parameter Description Default Behaviour
Input/Output
--oname Create a folder called name and write output files in it. This option is not compatible with --oc as only one output folder is allowed. The program behaves as if --oc centrimo_out had been specified.
--ocname Create a folder called name but if it already exists allow overwriting the contents. This option is not compatible with --o as only one output folder is allowed. The program behaves as if --oc centrimo_out had been specified.
--negfasta file Plot the motif distributions in this set (negative sequences) as well. For each enriched region reported, based on enrichment in the the (positive sequences), the signficance of the relative enrichment of that region in the positive sequences versus these negative sequences is evaluated using the Fisher exact test.
--bgfile bg file Read a zero order background from the specified file. If motif-file is specified then read the background from the motif file. The program uses the base frequencies in the input sequences.
--motif ID Select the motif with the ID for scanning. This option may be repeated to select multiple motifs. The program scans with all the motifs.
--motif-pseudo pseudocount Apply this pseudocount to the PWMs before scanning. The program applies a pseudocount of 0.1.
--seqlenlength Use sequences with the length length. Use sequences with the same length as the first sequence, ignoring all other sequences in the input file(s).
Scanning
--score S The score threshold for PWMs, in bits. Sequences without a match with score ≥ S are ignored. A score of 5 is used.
--optimize_score  Search for the optimal score above the minimum threshold given by the --score option. The minimum score threshold is used.
--maxreg maxreg The maximum region size to consider. Try all region sizes up to the sequence width.
--minreg minreg The minimum region size to consider. Must be less than maxreg. Try regions 1 bp and larger.
--norc   Do not scan with the reverse complement motif. Scans with the reverse complement motif.
--flip   Reverse complement matches appear 'reflected' around sequence centers./td> Do not 'flip' the sequence; use rc of motif instead.
--local   Compute enrichment of all regions. Compute enrichment of central regions.
--disc   Use the Fisher exact test to compute enrichment discriminatively. Requires the comparative sequences to be supplied with the --neg option. Use the binomial test to compute enrichment.
Output filtering
--ethresh thresh Limit the results to motifs with an enriched region whose E-value is less than thresh. Enrichment E-values are computed by first adjusting the binomial p-value of a region for the number of regions tested using the Bonferroni correction, and then multiplying the adjusted p-value by the number of motifs in the input to CentriMo. Include motifs with E-values up to 10.
Miscellaneous
--descdescription Include the text description in the HTML output. No description in the HTML output.
--dfiledesc file Include the first 500 characters of text from the file desc file in the HTML output. No description in the HTML output.
--noseq   Do not store sequence IDs in the output of CentriMo. CentriMo stores a list of the sequence IDs with matches in the best region for each motif. This can potentially make the file size much larger.
-verbosity1|2|3|4|5 A number that regulates the verbosity level of the output information messages. If set to 1 (quiet) then it will only output error messages whereas the other extreme 5 (dump) outputs lots of mostly useless information. The verbosity level is set to 2 (normal).