Build Tutorial¶

In a nutshell¶

The BUILD scripts configure cMonkey2 to run on an SGE managed cluster. In the process, they change several cMonkey2 parameters to generate variability in each run. They also choose experiments to include in each run according to user-provided annotations.

Requirements¶

User must supply several files, including:

ratios: a Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions.

blocks: a Comma-separated file containing annotations for each condition.

inclusion_blocks: a Comma-separated file containing groups of blocks to be co-included in runs.

exclusion_blocks: a Comma-separated file containing groups of blocks to be co-excluded in runs.

Optionally:

pipeline: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported

setenrich_files: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

The format for each of these files will be described in detail below.

We have provided example files for building an Mycobacterium Tuberculosis ensemble. The data come from the following publications:

EJR Peterson, DJ Reiss, S Turkarslan, KJ Minch, T Rustad, CL Plaisier, WJR Longabaugh, DR Sherman, NS Baliga. (2014). A high-resolution network model for global gene regulation in Mycobacterium tuberculosis. Nucleic Acids Research.

TR Rustad, KJ Minch, S Ma, JK Winkler, S Hobbes, MJ Hickey, W Brabant, S Turkarslan, ND Price, NS Baliga and DR Sherman. (2014). Mapping and manipulating the Mycobacterium tuberculosis transcriptome using a transcription factor overexpression-derived regulatory network. Genome Biology

KJ Minch, TR Rustad, EJR Peterson, J Winkler, DJ Reiss, S Ma, M Hickey, W Brabant, B Morrison, S Turkaslan, C Mawhinney, JE Galagan, ND Price, NS Baliga, DR. Sherman. (2014). The DNA-binding network of Mycobacterium tuberculosis. Nature Communications

Additionally, the Python modules described in this documentation’s dependencies are required.

Scripts¶

egrin2-make_ensemble: The control function for BUILD scripts. Writes QSub script.

cMonkeyIniGen.py: Templating function to generate cMonkey2 initialization (.ini) files.

ensemblePicker.py: Picks experiments to include in a run given user-supplied experimental blocks

BUILD an EGRIN 2.0 ensemble¶

In this tutorial we will BUILD an ensemble for Mycobacterium tuberculosis using several example files, which we provide here

STEP 1: Generate required input files¶

First, let’s explore the required files.

`ratios.csv`¶

A Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions. Rows should correspond to genes and columns to individual conditions, as in the example file below.

`blocks.csv`¶

A Tab-delimited or Comma-separated file containing the block membership for each experiment in the dataset as below.

`inclusion_blocks.csv`¶

A Tab-delimited or Comma-separated file containing definitions for groups of blocks to be co-included in a cMonkey2 run, as defined above. These blocks are not strictly co-included in every cMonkey2 run, rather placement of a block into an inclusion block increases the likelihood that a particular block will be selected given that another block from its inclusion block has already been included in a particular cMonkey run.

Each group of blocks should be separated by :::, as indicated in the template below. The names of each block should be the same as those defined in blocks.csv.gz above.

The block.sample.num column is not required, but can be useful for evaluating the blocks.

`exclusion_blocks.csv`¶

A Tab-delimited or Comma-separated file containing definitions for groups of blocks to be co-excluded from a cMonkey2 run. Typically these blocks are defined for testing purposes. If they are not defined, several random exclusion blocks will be defined in order to evaluate model over-fitting (to be implemented)

The format for the exclusion blocks is the same as the inclusion blocks before. Each group of blocks should be separated by :::, as indicated in the template below. The names of each block refer to those defined in blocks.csv.gz above.

Again, the block.sample.num column is not required.

Each of these files need to be supplied by the user.

(Optional) Generate custom scoring pipeline and files¶

cMonkey2 can use a custom scoring pipeline. Currently, this is limited to set-enrichment.

In this example, we will generate an ensemble. We do so by providing several additional options to the egrin2-make_ensemble function, namely:

pipeline: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported

setenrich: Name(s) of set enrichment ‘sets’ to include. Names should be comma separated.

setenrich_files: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

Users interested in building custom scoring pipelines or configuring set-enrichment should consult the cMonkey2 documentation.

STEP 2: Call egrin2-make_ensemble¶

Generating an ensemble configuration entails calling a single Python script. There are a number of required and optional parameters that can be passed to this command line function, which are described below:

$ egrin2-make_ensemble -h

usage: egrin2-make_ensemble [-h] --organism ORGANISM --ratios RATIOS
                            --targetdir TARGETDIR [--numruns NUMRUNS]
                            [--ncbi_code NCBI_CODE] [--mincols MINCOLS]
                            [--num_cores NUM_CORES] [--max_tasks MAX_TASKS]
                            [--user USER] [--csh] [--blocks BLOCKS]
                            [--inclusion INCLUSION] [--exclusion EXCLUSION]
                            [--pipeline PIPELINE] [--setenrich SETENRICH]
                            [--setenrich_files SETENRICH_FILES]
                            [--rsat_base_url RSAT_BASE_URL]

egrin2-make_ensemble - prepare cluster runs for Sun Grid Engine

optional arguments:
  -h, --help            show this help message and exit
  --organism ORGANISM   3 letter organism code
  --ratios RATIOS       Path to ratios file
  --targetdir TARGETDIR
                        Path to output directory
  --numruns NUMRUNS     Number of cMonkey2 runs to configure
  --ncbi_code NCBI_CODE
                        NCBI organism code
  --mincols MINCOLS     Minimum number of experiments to include in a cMonkey2
                        run
  --num_cores NUM_CORES
                        Number of cores on cluster to request
  --max_tasks MAX_TASKS
                        Maximum number of jobs to be sent to the cluster at a
                        time
  --user USER           Cluster user name
  --csh                 Flag to indicate C Shell
  --blocks BLOCKS       Path to block definitions
  --inclusion INCLUSION
                        Path to inclusion block definitions
  --exclusion EXCLUSION
                        Path to exclusion block definitions
  --pipeline PIPELINE   Path to scoring pipeline config file
  --setenrich SETENRICH
                        Name(s) of set enrichment 'sets' to include. Names
                        should be comma separated.
  --setenrich_files SETENRICH_FILES
                        Set enrichment files. File paths should be comma
                        separated.
  --rsat_base_url RSAT_BASE_URL
                        Alternative RSAT base URL.

Here we will concentrate on the required arguments.

Required Arguments¶

organism: 3-letter organism code

ratios: ratios file described above

targetdir: location of a directory in which to configure the cMonkey2 runs

If you do not supply block files as described above, the experiments to include in each run will be choosen randomly. Several random exclusion_blocks will be defined for testing (currently not supported - block files must be supplied / ANB 03042015)

Here we will assume that the required files are in the local working directory. Furthermore, we will assume that the egrin2-make_ensemble is in the working directory and that all of the required modules are in your $PYTHONPATH.

For the following example, we will generate 5 cMonkey2 runs.

On the command line this would be called as follows:

$ egrin2-make_ensemble --organism mtu --ratios ratios.csv --targetdir mtu-ens-2014 --numruns 10 --blocks blocks.csv --inclusion inclusion_blocks.csv --exclusion exclusion_blocks.csv --pipeline setenrich_pipeline.json --setenrich chipseq,tfoe --setenrich_files ChIPSeq.csv,DE.csv --csh

If the script runs successfully, they should print the messages above, populate the mtu-ens-2014 directory with ratios-xxx.tsv files and config-xxx.ini files, as well as generate several report files in the parent directory. The ensemble report files contain information about the run composition, detailed for each report file below: