Input File Formats

cMonkey uses a number of input files to support its built in scoring functions.

Note: cMonkey-Python uses a cache directory for data files that are downloaded from the web. You can use this mechanism for specifying your own data, as long as it is in the same format as specified below.

The path is by default cache, you can also specify a different location by providing the --cachedir option on the command line

Gene expressions (mandatory)

The most basic and mandatory input is gene expression data. This should be specified as a tab-delimited text file, with the following specifications for a matrix with n genes and m conditions:

  • first row: 1 dummy label + m condition titles
  • n rows where each row starts with a gene name and m expression ratio values

The expression file can either be uncompressed or in gzip format (in this case, it should have a .gz suffix).

Example:

X<TAB>Cond 1<TAB>Cond 2
Gene1<TAB>1.213<TAB>-1.412
...

RSAT information

The RSAT database is central to cMonkey’s automatic retrieval of organism information. It is used for several purposes:

  • organism classification (eukaryote/prokaryote, NCBI taxonomy mapping)
  • gene synonyms
  • genomic information

RSAT provides organism data on a number of sites, where there is usually a web directory under a URL that looks like <BASE_URL>/data/genomes/

This directory contains many organism directories. The user can specify this RSAT organism explicitly via the --rsat_organism parameter on the command line, otherwise, it will attempt to derive the name through the mandatory KEGG organism code. In addition the base url of the RSAT site to be searched can be defined with the --rsat_base_url parameter.

cMonkey2 expects the following files in an organism directory:

RSAT organism file (“organism.tab”)

The contents of this file is used to look for the occurrence of the word “Eukaryota” and derive the NCBI taxonomy id. If the word “Eukaryota” is found, the organism will be marked as a eukaryote, otherwise, it will be assumed as prokaryote.

Example:

64091<TAB>Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium

RSAT feature name files (“feature_names.tab”)

These files are used by cMonkey to create the synonym table for alternative gene names. Each line has the format

<Accession ID><TAB><name><primary|alternate>

Example:

NP_045946.1<TAB>VNG7001<TAB>primary
NP_045946.1<TAB>1446803<TAB>alternate
NP_045946.1<TAB>10803548<TAB>alternate
NP_045946.1<TAB>NP_045946.1<TAB>alternate
...

RSAT feature files (“feature.tab”)

Each line in a feature file contains a gene location for an organism. The important information is are the id, name, contig, strand, start and end position. For each contig listed in this file, there must exist a corresponding contig file.

Example:

-- id<TAB>type<TAB>name<TAB>contig<TAB>start<TAB>end<TAB>strand<TAB>description<TAB>chrom_pos<TAB>organism<TAB>GeneID
NP_045946.1<TAB>CDS<TAB>VNG7001<TAB>NC_001869.1<TAB>363<TAB>812<TAB>R<TAB>hypothetical protein<TAB>complement(363..812)<TAB>Halobacterium sp. NRC-1<TAB>1446803
NP_045947.1<TAB>CDS<TAB>VNG7002<TAB>NC_001869.1<TAB>834<TAB>1172<TAB>R<TAB>hypothetical protein<TAB>complement(834..1172)<TAB>Halobacterium sp. NRC-1<TAB>1446804
...

RSAT contig files

These files contain the raw genomic sequence in lower case for a specific contig/chromosome that is referenced in the RSAT features file.

Example (file name “NC002607.1.tab”):

ttgacccactgaatcacgtctgaccgcgcgtacgcggtcacttgcggtgccgttttctttgttaccgacgaccgaccagcgacagccaccgcgcgctcactgccaccaaaagagtcatatcacagccgaccagtttctggaacgttcccgatactggaacggtcctaatgcagtatcccaccctccttccatcgacgccagtcgaatcacgccgccagccaccgtccgccagccggccagaataccgatgactcggcggtctcgtgtcggtgccggcctcgcagccattgtactggccctggccgcagtgtcggctgccgctcc

RSAT mockup directories

For a number of organisms, RSAT does not provide the necessary files for a cmonkey run. In this case, the user can provide their own local directory that mirrors the structure, so it contains the files described above and specify the directory using –rsat_dir <your_rsat_directory>

Example:

organism.tab
feature_names.tab
feature.tab
<contig1>.tab
<contig2>.tab
...

STRING protein-protein interactions

STRING is a database of known and predicted protein-protein interactions. cMonkey uses these interactions to improve its clustering results. In order to do so, it builds a network using the interactions for the genes for the current organism. This network can then be used in the network scoring component of cMonkey.

STRING is an enormous database and so it is a good idea to prepare the input to provide only the necessary data for cMonkey’s network scoring algorithm, namely the names of the genes and their score.

STRING files are tab-delimited files containing entries of the form

Gene1<TAB>Gene2<TAB>Normalized Score

cmonkey-python provides the utility extract_string_links.sh to write the interactions for an organism, given the KEGG code and a database file (either gzip’ed or plain).

Protein-protein interaction network files have a very simple structure, each line in the file defines a a network edge with a weight, and each file is compressed with gzip, so the naming scheme is <NCBI code>.gz.

Example (file name “64091.gz”):

VNG0001H<TAB>VNG0002G<TAB>865
VNG0001H<TAB>VNG0003C<TAB>802
VNG0001H<TAB>VNG0005H<TAB>561
...

Microbes Online operon files

These files are typically provided by Microbes Online and define operon relationships of genes. These are only used for prokaryotic organisms. The name scheme that cmonkey uses for such files is gnc<NCBI code>.named. See example for the format of the lines in a file.

Example (file name “gnc64091.named”):

Gene1<TAB>Gene2<TAB>SysName1<TAB>SysName2<TAB>Name1<TAB>Name2<TAB>bOp<TAB>pOp<TAB>Sep<TAB>MOGScore<TAB>GOScore<TAB>COGSim<TAB>ExprSim
68130<TAB>68131<TAB>VNG0001H<TAB>VNG0002G<TAB>VNG0001H<TAB>yvrO<TAB>TRUE<TAB>0.950<TAB>-3.000<TAB>0.000<TAB>NA<TAB>N<TAB>0.791
68131<TAB>68132<TAB>VNG0002G<TAB>VNG0003C<TAB>yvrO<TAB>VNG0003C<TAB>TRUE<TAB>0.920<TAB>30.000<TAB>0.600<TAB>1.000<TAB>Y<TAB>0.525
...