Assemble Tutorial
=================

**Important!!! This tutorial assumes you have access to a complete cMonkey2 ensemble.**

In a nutshell
-------------

The ASSEMBLE scripts transfer and compile individual cMonkey2 SQLite databases into an integrated MongoDB database.

In addition, they perform several post-processing steps, including: detection of gene regulatory elements (GREs) by comparing individual bicluster motifs with TOMTOM and clustering with MCL, genome-wide scanning of motifs with FIMO, and detection of co-regulated modules or **corems** using link-community detection.

Requirements
------------

  * MongoDB >= 2.4.9
  * compiled C++ scripts for corem detection, available here

IMPORTANT: This tutorial currently assumes that ``TOMTOM``, ``MCL`` and ``FIMO`` have already been run.

A single GRE definition file is read from, eg:

.. highlight:: none

::

   /ensemble-head-dir
       /out.mot_metaclustering.txt.I45.txt

FIMO scans are read from each run sub-directory, eg:

.. highlight:: none

::

   /ensemble-head-dir
       /org-out-xxx
           /fimo-outs
               /fimo-out-xxxx.bz2

Optional:

  * ``row_annot``: tab-delimited row (gene) annotations. Will be downloaded from MicrobesOnline automatically using ``--ncbi_code`` if undefined
  * ``col_annot``: tab-delimited column (condition) annotations.

The format for these files will be described in detail below.

Additionally, the Python modules described on the Home page are required to run these scripts.

Scripts
-------

  * ``egrin2-assemble``: The control function for ASSEMBLE scripts.
  * ``makeCorems.py``: Identifies corems using C++ scripts compiled above
  * ``resample_QSub.py``: Generates QSub script for submission of resamples to cluster
  * ``sql2mongoDB.py``: Merges individual cMonkey SQLite dbs and post-processing data into MongoDB


.. highlight:: none

::

   $ egrin2-assemble -h

    usage: egrin2-assemble [-h] --organism ORGANISM --ratios RATIOS
                           [--targetdir TARGETDIR] [--backbone_pval BACKBONE_PVAL]
                           [--cores CORES] [--link_comm_score LINK_COMM_SCORE]
                           [--link_comm_increment LINK_COMM_INCREMENT]
                           [--link_comm_density_score LINK_COMM_DENSITY_SCORE]
                           [--corem_size_threshold COREM_SIZE_THRESHOLD]
                           [--n_resamples N_RESAMPLES]
                           [--cluster_arch CLUSTER_ARCH] [--sge_user SGE_USER]
                           [--dbengine DBENGINE] [--host HOST] [--port PORT]
                           [--targetdb TARGETDB] [--prefix PREFIX]
                           [--ensembledir ENSEMBLEDIR] [--col_annot COL_ANNOT]
                           [--row_annot ROW_ANNOT]
                           [--row_annot_match_col ROW_ANNOT_MATCH_COL]
                           [--gre2motif GRE2MOTIF] [--genome_annot GENOME_ANNOT]
                           [result_dbs [result_dbs ...]]

    assemble.py - prepare cluster runs

    positional arguments:
      result_dbs

    optional arguments:
      -h, --help            show this help message and exit
      --organism ORGANISM   3 letter organism code
      --ratios RATIOS       Path to ratios file. Should be 'raw' (normalized)
                            ratios, not the standardized ratios used by cMonkey
      --targetdir TARGETDIR
                            Storage path for MongoDB and corem data
      --backbone_pval BACKBONE_PVAL
                            Significance pvalue for gene-gene backbone. Default =
                            0.05.
      --cores CORES         Number local cores to use for corem C++ scripts
      --link_comm_score LINK_COMM_SCORE
                            Scoring metric for link communities
      --link_comm_increment LINK_COMM_INCREMENT
                            Height increment for cutting agglomerative clustering
                            of link communities
      --link_comm_density_score LINK_COMM_DENSITY_SCORE
                            Density score for evaluating link communities
      --corem_size_threshold COREM_SIZE_THRESHOLD
                            Defines minimum corem size. Default = 3.
      --n_resamples N_RESAMPLES
                            Number resamples to compute for corem condition
                            assignment. Default = 10,000
      --cluster_arch CLUSTER_ARCH
                            where to run resampling on
      --sge_user SGE_USER   Cluster user name
      --dbengine DBENGINE   mongodb or sqlite
      --host HOST           MongoDB host. Default 'localhost'
      --port PORT           MongoDB port
      --targetdb TARGETDB   Optional ensemble MongoDB database name
      --prefix PREFIX       Ensemble run prefix. Default: *organism*-out-
      --ensembledir ENSEMBLEDIR
                            Path to ensemble runs. Default: cwd
      --col_annot COL_ANNOT
                            Tab-delimited file with experiment annotations
      --row_annot ROW_ANNOT
                            Optional row (gene) annotation tab-delimited file. If
                            not specified, annotations will be downloaded from
                            MicrobesOnline using --ncbi_code.
      --row_annot_match_col ROW_ANNOT_MATCH_COL
                            Name of column in row_annot that matches row names in
                            ratios file.
      --gre2motif GRE2MOTIF
                            Motif->GRE clustering file
      --genome_annot GENOME_ANNOT
                            Optional genome annotation file. Automatically
                            downloaded from MicrobesOnline using --ncbi_code


ASSEMBLE an EGRIN 2.0 ensemble
------------------------------

In this tutorial we will see how you would ASSEMBLE an *Escherichia coli* EGRIN 2.0 ensemble using several example files and a couple of cMonkey2 runs, which we provide here.

STEP 1: Generate optional input files
-------------------------------------

First, let's explore the optional annotation files. Providing annotations for genes and conditions is a great way to enrich your analysis of the ensemble. You can get a better idea for the utility of these metainformation by following the advanced mining tutorial

``row_annot``
~~~~~~~~~~~~~

As noted above, the ``row_annot`` file will be downloaded automatically from MicrobesOnline if a custom annotation is not provided. If you provide your own row_annot file, however, you will also need to specificy ``--row_annot_matchCol``, which is the name of the column in your annotation file that matches the gene name used by cMonkey2 (i.e. the row names in your ratios file).

The row annotation file should look like the annotation file supplied by MicrobesOnline, where each row specifies a gene and each of the columns specifies some information about that gene. Again, you must ensure that at least one of the columns contains gene names that match the gene names in the ratios file used by cMonkey2, in the case of MicrobesOnline, it is the ``sysName`` column.

Here is an example annotation fil e for *E. coli* direct from MicrobesOnline, the file itself is available here.

.. figure:: _static/assemble/row_annot.png
            :alt: Example row_annot file

``col_annot``
~~~~~~~~~~~~~

The col_annot file provides metainformation about each experiment. Like the row_annot file, these annotations are optional, but they can be valuable for making sense of ensemble predicitions.

Please note that the file format is different here. Each row contains a particular experimental meta-annotation followed by several required descriptions: (1) experiment_name, (2) feature_name, (3) value, (4) feature_units, (5) feature_type. The experiment_name column should match the experiment name in the ratios file.

The `col_annot` file should look like the tab-delimited file depicted below. You can download an *E. coli* `col_annot` file to use as a template here

.. figure:: _static/assemble/col_annot.png
            :alt: Example col_annot file


STEP 2: Run ``egrin2-assemble``
-------------------------------

**Important!!! TOMTOM, MCL, and FIMO should be run prior to assembly. Otherwise the ensemble will not contain GREs or motif scans.**

To run the assembler, you must supply several files. At a minimum, you should supply:

  * ``--organism``: 3 letter organism code
  * ``--ratios``: Path to ratios file. Should be 'raw' (normalized) ratios, not the standardized ratios used by cMonkey
  * ``--targetdir``: Storage path for MongoDB and corem data

*Note: The MongoDB engine has been replaced in favor of storing data in
sqlite because it does not require a running database server. Furthermore
the assembly step does not require a cluster anymore*

.. highlight:: none

::

  $ egrin2-assemble --organism mtu --ratios <ratios file> --targetdir <output directory> --targetdb <output database name>

This will run the entire assembly step