Walkthrough of a Complete cMonkey2 Run

Setting up for a cMonkey2 run

  • Input required The only required input are
    1. The gene expression matrix file (tab-separated values file)
    2. The KEGG organism code (e.g., eco for E. coli; hal for H. salinarum)
  • Input automatically downloaded Given the KEGG organism code, cMonkey2 downloads the following data from various online databases and caches them locally:
    1. Genome sequences and annotations (currently from RSAT)
    2. Operon membership predictions (for prokaryotes)
    3. STRING functional gene associations
  • Input optional Custom files may be use as additional input (or to replace downloaded files):
    1. Network
    2. Genome
    3. Annotation

Starting the run

cMonkey2 is typically started by running the front-end cmonkey2 with additional parameters.

Example:

cmonkey2 --organism hal --rsat_base_url http://networks.systemsbiology.net/rsat ratios.tsv

The following phases will be executed:

  1. Initialization
  2. Scoring
  3. Post-processing

Initialization steps

Before cMonkey2 runs the scoring iterations, it ensures that the required data is in place via the following steps:

  1. normalize the ratio matrix
  2. retrieve organism information and data from RSAT
  3. build gene synonym lookup table
  4. retrieve gene locations and parse promoter sequences
  5. download and normalize operon network from Microbes Online (if necessary)
  6. download and normalize STRING network (if necessary)
  7. initialize the starting biclusters

What is happening at each iteration

These are the essential steps in a cMonkey2 iteration:

  1. run row scoring functions
  2. run column scoring function
  3. combine the scores via the combiner
  4. use scores to update cluster membership

Each scoring function can be individually configured to follow a specific schedule and the weights which define how much its results contribute to the final combined score can be updated following that schedule as well.

It should be noted that motif scoring has two schedules, one for running the MEME suite pipeline (typically every 100 iterations) and another one to apply the results of the MEME suite pipeline step to the current clusters memberships.

Inside a scoring module

The building blocks of a cMonkey2 run are scoring modules. A scoring module is a Python class that implements the methods

def compute(self, iteration_result, reference_matrix)
def compute_force(self, iteration_result, reference_matrix)

The only difference between these do functions is that compute_force() always performs a computation, while compute() only runs in the iterations it is scheduled for.

The result of these calls is a DataMatrix object, where the columns represent the clusters, and the rows represent the genes or conditions, depending on whether the scoring function is a row or a column scoring function. Each cell in the matrix contains the corresponding score for a gene/condition in a specific cluster.

cMonkey2 comes with a number of built-in scoring modules, which are part of the standard scoring setup. It is possible to create user-defined scoring modules; in this case it is recommended to inherit from the ScoringFunctionBase class, which provides a lot of useful functionality.

Monitoring progress

Throughout the computation, cMonkey2 writes out its progress into its result database. Users can directly query the data contained in the database using regular data extraction tools. cMonkey2 also comes with a web browser-based monitoring application that can be used to view a graphical representation of the run.

Post-processing steps

Results data format

Please see details about the result database schema here

Viewing results

Please see details about how to view results here

Plugging in cMonkey Results for downstream analysis