Build Tutorial

In a nutshell

The BUILD scripts configure cMonkey2 to run on an SGE managed cluster. In the process, they change several cMonkey2 parameters to generate variability in each run. They also choose experiments to include in each run according to user-provided annotations.

Requirements

User must supply several files, including:

  • ratios: a Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions.
  • blocks: a Comma-separated file containing annotations for each condition.
  • inclusion_blocks: a Comma-separated file containing groups of blocks to be co-included in runs.
  • exclusion_blocks: a Comma-separated file containing groups of blocks to be co-excluded in runs.

Optionally:

  • pipeline: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported
  • setenrich_files: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

The format for each of these files will be described in detail below.

We have provided example files for building an Mycobacterium Tuberculosis ensemble. The data come from the following publications:

EJR Peterson, DJ Reiss, S Turkarslan, KJ Minch, T Rustad, CL Plaisier, WJR Longabaugh, DR Sherman, NS Baliga. (2014). A high-resolution network model for global gene regulation in Mycobacterium tuberculosis. Nucleic Acids Research.

TR Rustad, KJ Minch, S Ma, JK Winkler, S Hobbes, MJ Hickey, W Brabant, S Turkarslan, ND Price, NS Baliga and DR Sherman. (2014). Mapping and manipulating the Mycobacterium tuberculosis transcriptome using a transcription factor overexpression-derived regulatory network. Genome Biology

KJ Minch, TR Rustad, EJR Peterson, J Winkler, DJ Reiss, S Ma, M Hickey, W Brabant, B Morrison, S Turkaslan, C Mawhinney, JE Galagan, ND Price, NS Baliga, DR. Sherman. (2014). The DNA-binding network of Mycobacterium tuberculosis. Nature Communications

Additionally, the Python modules described in this documentation’s dependencies are required.

Scripts

  • egrin2-make_ensemble: The control function for BUILD scripts. Writes QSub script.
  • cMonkeyIniGen.py: Templating function to generate cMonkey2 initialization (.ini) files.
  • ensemblePicker.py: Picks experiments to include in a run given user-supplied experimental blocks

BUILD an EGRIN 2.0 ensemble

In this tutorial we will BUILD an ensemble for Mycobacterium tuberculosis using several example files, which we provide here

STEP 1: Generate required input files

First, let’s explore the required files.

ratios.csv

A Tab-delimited or Comma-separated file containing a matrix of gene expression values across all conditions. Rows should correspond to genes and columns to individual conditions, as in the example file below.

Example ratios

blocks.csv

A Tab-delimited or Comma-separated file containing the block membership for each experiment in the dataset as below.

Example blocks

inclusion_blocks.csv

A Tab-delimited or Comma-separated file containing definitions for groups of blocks to be co-included in a cMonkey2 run, as defined above. These blocks are not strictly co-included in every cMonkey2 run, rather placement of a block into an inclusion block increases the likelihood that a particular block will be selected given that another block from its inclusion block has already been included in a particular cMonkey run.

Each group of blocks should be separated by :::, as indicated in the template below. The names of each block should be the same as those defined in blocks.csv.gz above.

The block.sample.num column is not required, but can be useful for evaluating the blocks.

Example inclusion_blocks

exclusion_blocks.csv

A Tab-delimited or Comma-separated file containing definitions for groups of blocks to be co-excluded from a cMonkey2 run. Typically these blocks are defined for testing purposes. If they are not defined, several random exclusion blocks will be defined in order to evaluate model over-fitting (to be implemented)

The format for the exclusion blocks is the same as the inclusion blocks before. Each group of blocks should be separated by :::, as indicated in the template below. The names of each block refer to those defined in blocks.csv.gz above.

Again, the block.sample.num column is not required.

Example exclusion_blocks

Each of these files need to be supplied by the user.

(Optional) Generate custom scoring pipeline and files

cMonkey2 can use a custom scoring pipeline. Currently, this is limited to set-enrichment.

In this example, we will generate an ensemble. We do so by providing several additional options to the egrin2-make_ensemble function, namely:

  • pipeline: a JSON file containing custom scoring pipeline. Currently only set-enrichment pipeline is supported
  • setenrich: Name(s) of set enrichment ‘sets’ to include. Names should be comma separated.
  • setenrich_files: Comma-separated files containing set-enrichment sets. Multiple files should be separated by a comma.

Users interested in building custom scoring pipelines or configuring set-enrichment should consult the cMonkey2 documentation.

STEP 2: Call egrin2-make_ensemble

Generating an ensemble configuration entails calling a single Python script. There are a number of required and optional parameters that can be passed to this command line function, which are described below:

$ egrin2-make_ensemble -h

usage: egrin2-make_ensemble [-h] --organism ORGANISM --ratios RATIOS
                            --targetdir TARGETDIR [--numruns NUMRUNS]
                            [--ncbi_code NCBI_CODE] [--mincols MINCOLS]
                            [--num_cores NUM_CORES] [--max_tasks MAX_TASKS]
                            [--user USER] [--csh] [--blocks BLOCKS]
                            [--inclusion INCLUSION] [--exclusion EXCLUSION]
                            [--pipeline PIPELINE] [--setenrich SETENRICH]
                            [--setenrich_files SETENRICH_FILES]
                            [--rsat_base_url RSAT_BASE_URL]

egrin2-make_ensemble - prepare cluster runs for Sun Grid Engine

optional arguments:
  -h, --help            show this help message and exit
  --organism ORGANISM   3 letter organism code
  --ratios RATIOS       Path to ratios file
  --targetdir TARGETDIR
                        Path to output directory
  --numruns NUMRUNS     Number of cMonkey2 runs to configure
  --ncbi_code NCBI_CODE
                        NCBI organism code
  --mincols MINCOLS     Minimum number of experiments to include in a cMonkey2
                        run
  --num_cores NUM_CORES
                        Number of cores on cluster to request
  --max_tasks MAX_TASKS
                        Maximum number of jobs to be sent to the cluster at a
                        time
  --user USER           Cluster user name
  --csh                 Flag to indicate C Shell
  --blocks BLOCKS       Path to block definitions
  --inclusion INCLUSION
                        Path to inclusion block definitions
  --exclusion EXCLUSION
                        Path to exclusion block definitions
  --pipeline PIPELINE   Path to scoring pipeline config file
  --setenrich SETENRICH
                        Name(s) of set enrichment 'sets' to include. Names
                        should be comma separated.
  --setenrich_files SETENRICH_FILES
                        Set enrichment files. File paths should be comma
                        separated.
  --rsat_base_url RSAT_BASE_URL
                        Alternative RSAT base URL.

Here we will concentrate on the required arguments.

Required Arguments

  • organism: 3-letter organism code
  • ratios: ratios file described above
  • targetdir: location of a directory in which to configure the cMonkey2 runs

If you do not supply block files as described above, the experiments to include in each run will be choosen randomly. Several random exclusion_blocks will be defined for testing (currently not supported - block files must be supplied / ANB 03042015)

Here we will assume that the required files are in the local working directory. Furthermore, we will assume that the egrin2-make_ensemble is in the working directory and that all of the required modules are in your $PYTHONPATH.

For the following example, we will generate 5 cMonkey2 runs.

On the command line this would be called as follows:

$ egrin2-make_ensemble --organism mtu --ratios ratios.csv --targetdir mtu-ens-2014 --numruns 10 --blocks blocks.csv --inclusion inclusion_blocks.csv --exclusion exclusion_blocks.csv --pipeline setenrich_pipeline.json --setenrich chipseq,tfoe --setenrich_files ChIPSeq.csv,DE.csv --csh

If the script runs successfully, they should print the messages above, populate the mtu-ens-2014 directory with ratios-xxx.tsv files and config-xxx.ini files, as well as generate several report files in the parent directory. The ensemble report files contain information about the run composition, detailed for each report file below:

STEP 3: Evaluate ensembleReport files

ensembleReport_runs.csv

Global report of the ensemble run compositions.

Example ensemble report runs

The columns of this report file signify:

  • run_num: cMonkey run number
  • ncols: number of experiments included in the run
  • excluded: exclusion block that was excluded from the run (i.e., none of the conditions in this block will be in the run)
  • blocks: blocks that were included in the run
  • cols: names of experiments (from column names of ratios.csv) that were included in the run

ensembleReport_cols.csv

Extends blocks.csv. Reports how often each condition was included in the ensemble.

Example ensemble report cols

The columns of this report file signify:

  • sample: experiment name, from blocks.csv
  • block: block to which experiment belongs, from blocks.csv
  • ensemble_freq: rate of inclusion in the ensemble

ensembleReport_blocks.csv

Reports how often each block was included in the ensemble.

Example ensemble report blocks

The columns of this report file signify:

  • block: block name
  • block_sample_num: number of experiments annotated by this block
  • ensemble_freq: rate of inclusion in the ensemble

ensembleReport_inclusionBlocks.csv

Reports how often each inclusion block was included in the ensemble.

Example ensemble report inclusion blocks

The columns of this report file signify:

  • inclusion_blocks: inclusion_block name. blocks separated by :::
  • block_sample_num: number of experiments included in this inclusion block
  • freq_single: rate at which a single block from this inclusion_block is included in the ensemble
  • freq_coinclusion: rate at which at least 2 of blocks from this inclusion_block are co-included in a single run
  • max_coinclusion_all: rate at which ALL of blocks from this inclusion_block are co-included in a single run across all ensemble runs
  • max_coinclusion_sub: rate at which ALL of blocks from this inclusion_block are co-included in a single run across all ensemble runs in which at least one of the blocks occurs

ensembleReport_exclusionBlocks.csv

Reports how often each exclusion block was excluded in the ensemble.

Example ensemble report exclusion blocks

The columns of this report file signify:

  • exclusion_blocks: exclusion block name
  • block_sample_num: number of experiments annotated by this exclusion block
  • excluded_freq: rate of exclusion in the ensemble

STEP 4: Transfer to cluster and run cMonkey2

The entire targetdir (e.g. mtu-ens-2014 in our example) is now ready to be transfered to the cluster, where you can generate the ensemble by running <org>.sh, where <org> is the 3-letter organism code you provided. In our example case this would be mtu.sh.

Example for a SGE submission:

$ cd <target directory>
$ qsub mtu.sh