Usage:

tgene [options] <locus_file> <annotation_file>

Description

Input

<locus_file>

The name of a file containing chromosome locations (loci) of potential regulatory elements in BED format. The genomic coordinates are assumed to be 0-based, half-open as defined in the BED standard. Typically, these would be transcription factor (TF) peaks from a TF ChIP-seq experiment, output by a peak-caller such as MACS.

<annotation_file>

The name of an annotation file containing information on each of the transcription start sites of the genome referenced by the locus file. The annotation file should be in either GenCode GTF format or in Ensembl GTF Format (where the key "transcript_type" is replaced by "transcript_biotype"), and is used to supply transcription start site coordinates as well as other gene and transcript information. The attributes field (column 9) should contain (at least) the key-value pairs for the following keys: gene_id, transcript_id, gene_name and transcript_type. Annotaton files for many genomes are available from ftp://ftp.ensembl.org/pub/release-X/gtf (where you must substitute an actual release number for 'X', e.g., ftp://ftp.ensembl.org/pub/release-102/gtf) and ftp://ftp.ensemblgenomes.org/pub/release-X/GROUP/gtf (where you must substitute an actual release number for 'X', and an actual group for 'GROUP', e.g., ftp://ftp.ensemblgenomes.org/pub/release-57/fungi/gtf). The genomic coordinates are assumed to be 1-based, closed as defined in the GTF standard.

Output

T-Gene writes its output to files in a directory named tgene_out, which it creates if necessary. You can change the output directory using the --o or --oc options. The directory will contain:

Note: See this detailed description of the T-Gene output formats for more information.

Options

Option Parameter Description Default Behavior
General Options
--transcript-types ttypes A comma-separated list (no spaces) of RNA transcript types. T-Gene will only output links for transcripts of these types. The value of ttypes is set to
'protein_coding,processed_transcript'.
--max-link-distances mlds A comma-separated list (no spaces) of maximum distances between a potential regulatory element (RE) and its target. By default, T-Gene will evaluate all potential links that satisfy the maximum distance criterion as well as Closest-Locus and Closest-TSS links (see options --no-closest-locus and --no-closest-tss, below). Note: If you provide a tissue panel (see Tissue Panel Options, below), there must be one distance for each histone name in histones (see option --histones, below), and each distance is used with the corresponding histone name. If you do not provide a tissue panel, you may only specify one distance. The value of mlds is set to '500000'.
--max-pvalue mpv Only links whose p-value is less than or equal to mpv will be included in the output of T-Gene. If you provide a tissue panel (see Tissue Panel Options, below), T-Gene will test the CnD (Correlation and Distance) p-value, otherwise it will test the Distance p-value. Note: T-Gene does not apply the maximum p-value threshold to closest-locus and closest-TSS links, which are always included in the output unless options you specify options (see Other Options, below) --no-closest-locus or --no-closest-tss, respectively. 0.05
Tissue Panel Options
--tissues tissues A comma-separated list (no spaces) of three (or more) tissue names that are the sources of the histone and expression data. These names are must also be the names of the subfolders where the histone and expression data files are to be found by T-Gene. See below under options --histone-root and --expression-root for more information. Note: Because the Pearson correlation coefficient is always 1, 0 or -1 for a pair of points, it does not make sense to use fewer than three tissues. None.
--histone-root hrd The root directory containing the histone modification files. The files are must be in ENCODE broadPeak or ENCODE narrowPeak format format, but only the first 7 fields are used (or required). The genomic coordinates are assumed to be 0-based, half-open as defined in the above standards. The histone modification files should be subdirectories under the histone root directory, where each subfolder is named according to the tissue from which the data is taken. (See option --tissues, above.) The subdirectories should be named '<hrd>/<t>', where <t> is one of the tissue names in the comma-separated tissues list. None.
--histones histones A comma-separated list (no spaces) of histone modification names. The histone modification file names must match '<hrd>/<t>/*<hname>*[broad|narrow]Peak', where <t> is one of the tissue names in the comma-separated tissues list, and <hname> is one of the histone names in the comma-separated histones list. None.
--rna-source Cage|LongPap The type of RNA expression data that you are providing. This determines the precise format expected in the expression files in GTF format that you specify. For Cage GTF files, the attributes field (column 9) should contain the key-value pairs for the following keys: gene_id and trlist (which is a comma-separated list of transcript IDs), and one or two keys matching the regular expression "rpm[12]?" whose value is the RNA expression of the transcript. (Note that this is not standard GTF format due to the required transcript_id key-value pair being replaced by the trlist key-value pair.) For LongPap GTF files, the attributes field (column 9) should contain the key-value pairs for the following keys: gene_id, transcript_id and one or two keys matching the regular expression "[RF]PKM[12]?" whose value is the RNA expression of the transcript.
RNA
Source
GTF field
123456789
Cagechr1GencodeTSS11869118690+.gene_id "ENSG00000223972.3"; trlist "ENST00000456328.2,"; trbiotlist "processed_transcript,"; confidence "not_low"; gene_biotype "pseudogene"; rpm1 "0"; rpm2 "0";
LongPapchrXEpiRoadgene99883667998949880-.gene_id "ENSG00000000003.9"; transcript_id "ENSG00000000003.9"; RPKM "43.985";
None.
--expression-root erd The root directory containing the RNA expression files. The files must be in a flavor of GTF format, as described above under option --source. The genomic coordinates are assumed to be 1-based, closed as defined in the GTF standard. The RNA expression files must be subdirectories under the expression root directory, where each subfolder is named according to the tissue from which the data is taken. (See option --tissues, above.) The subdirectories should be named '<erd>/<t>', where <t> is one of the tissue names in the comma-separated tissues list. The RNA expression file names must match '<erd>/<t>/*<rna_source>*.gtf'. None.
--use-gene-ids If your expression data files only contain gene ID information (e.g., if the 'transcript_id' fields are not unique or not specified), T-Gene can use the 'gene_id' fields instead for associating entries in the expression files with an entry in the annotation file. T-Gene will use the start and end positions given for each 'gene_id' in the expression files, and, for each 'gene_id' all expression files must agree or the results will be unpredictable. T-Gene uses the 'transcript_id' fields in the annotation and expression files to identify transcripts, and they must be unique within the annotation file. T-Gene uses the start and end positions for each transcript as specified in the annotation file.
--lecat lecat (Low Expression Adjustment Threshold) If the maximum expression of a TSS is < lecat, T-Gene reduces the computed correlation values for all its links. It multiplies the computed correlations are each link by the scale factor max_expr/lecat, where max_expr is the maximum expression of the TSS across the panel of tissues. 0 (No correlations are reduced.)
Other Options
--no-closest-locus T-Gene will not search for closest-locus links that exceed the maximum distance requirement (see option --max-link-distances, above), and it will not output closest-locus links that exceed the maximum p-value requirement (see option --max-pvalue, above). T-Gene includes a link to the closest locus (or loci in case of ties) for each transcript even if the locus does not meet the maximum link distance requirement or the maximum p-value requirement.
--no-closest-tss T-Gene will not search for closest-TSS links that exceed the maximum distance requirement (see option --max-link-distances, above), and it will not output closest-TSS links that exceed the maximum p-value requirement (see option --max-pvalue, above). T-Gene includes a link to the closest transcript (or transcripts in case of ties) for each locus even if the transcript does not meet the maximum link distance requirement or the maximum p-value requirement.
--no-noise T-Gene will not add random Gaussian noise to expression or histone values that are zero. Note: Using this option will make the p-value calculations less accurate. T-Gene adds random Gaussian noise to all zero expression and histone values.
--seed seed Seed for random number generator for generating the null model for correlation p-values and for adding random noise to zero expression and histone values. 0

Citing