Input File Formats ================== cMonkey uses a number of input files to support its built in scoring functions. **Note:** cMonkey-Python uses a cache directory for data files that are downloaded from the web. You can use this mechanism for specifying your own data, as long as it is in the same format as specified below. The path is by default ``cache``, you can also specify a different location by providing the ``--cachedir`` option on the command line Gene expressions (mandatory) ---------------------------- The most basic and mandatory input is gene expression data. This should be specified as a tab-delimited text file, with the following specifications for a matrix with n genes and m conditions: * first row: 1 dummy label + m condition titles * n rows where each row starts with a gene name and m expression ratio values The expression file can either be uncompressed or in gzip format (in this case, it should have a .gz suffix). **Example:** .. highlight:: none :: XCond 1Cond 2 Gene11.213-1.412 ... RSAT information ---------------- The RSAT database is central to cMonkey's automatic retrieval of organism information. It is used for several purposes: * organism classification (eukaryote/prokaryote, NCBI taxonomy mapping) * gene synonyms * genomic information RSAT provides organism data on a number of sites, where there is usually a web directory under a URL that looks like ``/data/genomes/`` This directory contains many organism directories. The user can specify this RSAT organism explicitly via the ``--rsat_organism`` parameter on the command line, otherwise, it will attempt to derive the name through the mandatory KEGG organism code. In addition the base url of the RSAT site to be searched can be defined with the ``--rsat_base_url`` parameter. cMonkey\ :sub:`2` expects the following files in an organism directory: RSAT organism file ("organism.tab") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The contents of this file is used to look for the occurrence of the word "Eukaryota" and derive the NCBI taxonomy id. If the word "Eukaryota" is found, the organism will be marked as a eukaryote, otherwise, it will be assumed as prokaryote. **Example:** .. highlight:: none :: 64091Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium RSAT feature name files ("feature_names.tab") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These files are used by cMonkey to create the synonym table for alternative gene names. Each line has the format .. highlight:: none :: **Example:** .. highlight:: none :: NP_045946.1VNG7001primary NP_045946.11446803alternate NP_045946.110803548alternate NP_045946.1NP_045946.1alternate ... RSAT feature files ("feature.tab") ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each line in a feature file contains a gene location for an organism. The important information is are the id, name, contig, strand, start and end position. For each contig listed in this file, there must exist a corresponding contig file. **Example:** .. highlight:: none :: -- idtypenamecontigstartendstranddescriptionchrom_posorganismGeneID NP_045946.1CDSVNG7001NC_001869.1363812Rhypothetical proteincomplement(363..812)Halobacterium sp. NRC-11446803 NP_045947.1CDSVNG7002NC_001869.18341172Rhypothetical proteincomplement(834..1172)Halobacterium sp. NRC-11446804 ... RSAT contig files ~~~~~~~~~~~~~~~~~ These files contain the raw genomic sequence in lower case for a specific contig/chromosome that is referenced in the RSAT features file. **Example (file name "NC002607.1.tab"):** .. highlight:: none :: ttgacccactgaatcacgtctgaccgcgcgtacgcggtcacttgcggtgccgttttctttgttaccgacgaccgaccagcgacagccaccgcgcgctcactgccaccaaaagagtcatatcacagccgaccagtttctggaacgttcccgatactggaacggtcctaatgcagtatcccaccctccttccatcgacgccagtcgaatcacgccgccagccaccgtccgccagccggccagaataccgatgactcggcggtctcgtgtcggtgccggcctcgcagccattgtactggccctggccgcagtgtcggctgccgctcc RSAT mockup directories ~~~~~~~~~~~~~~~~~~~~~~~ For a number of organisms, RSAT does not provide the necessary files for a cmonkey run. In this case, the user can provide their own local directory that mirrors the structure, so it contains the files described above and specify the directory using --rsat_dir **Example:** .. highlight:: none :: organism.tab feature_names.tab feature.tab .tab .tab ... STRING protein-protein interactions ----------------------------------- STRING is a database of known and predicted protein-protein interactions. cMonkey uses these interactions to improve its clustering results. In order to do so, it builds a network using the interactions for the genes for the current organism. This network can then be used in the network scoring component of cMonkey. STRING is an enormous database and so it is a good idea to prepare the input to provide only the necessary data for cMonkey's network scoring algorithm, namely the names of the genes and their score. STRING files are tab-delimited files containing entries of the form .. highlight:: none :: Gene1Gene2Normalized Score cmonkey-python provides the utility ``extract_string_links.sh`` to write the interactions for an organism, given the KEGG code and a database file (either gzip'ed or plain). Protein-protein interaction network files have a very simple structure, each line in the file defines a a network edge with a weight, and each file is compressed with gzip, so the naming scheme is .gz. **Example (file name "64091.gz"):** .. highlight:: none :: VNG0001HVNG0002G865 VNG0001HVNG0003C802 VNG0001HVNG0005H561 ... Microbes Online operon files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These files are typically provided by Microbes Online and define operon relationships of genes. These are only used for prokaryotic organisms. The name scheme that cmonkey uses for such files is gnc.named. See example for the format of the lines in a file. **Example (file name "gnc64091.named"):** .. highlight:: none :: Gene1Gene2SysName1SysName2Name1Name2bOppOpSepMOGScoreGOScoreCOGSimExprSim 6813068131VNG0001HVNG0002GVNG0001HyvrOTRUE0.950-3.0000.000NAN0.791 6813168132VNG0002GVNG0003CyvrOVNG0003CTRUE0.92030.0000.6001.000Y0.525 ...