Configuration and Run Options¶
cMonkey2 retrieves its configuration parameters through 2 mechanisms:
- a hierarchy of configuration files
- command line options
When the application is started, it will load its default configuration file default.ini
. It is recommended to leave the settings in this file unmodified.
Default configuration settings may be overridden by the user with a custom (set of) configuration file(s) and providing the --config <config file 1> <config file 2>...
command-line option. Only the settings that should be changed need to be provided. Note that it is possible to specify multiple configuration files. In this case, the settings contained in the separate configuration files will be merged.
Some parameters can also be provided on the command line. In general, the settings in the user provided configuration file override the default parameters and command line parameters override configuration file parameters.
Command line options¶
A current list of available command line options and their meaning in cmonkey.py
can be retrieved by entering
cmonkey2 --help
Configuration (.ini) files¶
cmonkey.py
uses .ini
files as a generic configuration mechanism for global and scoring specific settings. Global settings are defined in the [General]
, [Membership]
and [Scoring]
sections while each scoring function can implement its own set of specific configuration settings. The specific sections for each scoring functions are typically named after the name of the scoring functions.
Useful configuration constructs¶
Scheduling¶
Scheduling is a central part of cMonkey2. It defines in which iterations certain parts of the scoring algorithm will be active. For example, MEME is not run in every iteration, because its invocation is very costly. Instead, it is usually defined to run every 100 iterations or so. In order to define MEME to run every 100 iterations starting from iteration 300, we would define it like this:
[MEME]
schedule=300:100
Or the user might only want to call MEME in iteration 42:
[MEME]
schedule=42
We can define even more flexible schedules by combining a set of sub-schedules within a schedule by separating the sub-schedules with the ‘:
’ character.
[MEME]
schedule=15:42:300,100
The above schedule says “run MEME in iteration 15, 42 and starting from 300, every 100 iterations”. These principles can be applied to any setting that defines a schedule, e.g. network scoring, row scoring etc.
Scaling¶
The user can specify the strength of the contribution of each scoring module’s scores to in each iteration. This can be done through the scaling setting. Currently two ways are allowed:
scaling_const
specifies a constant weight, given as a number in each iteration over the entire time of the runscaling_rvec
allows the user to specify an R expression that can use the num_iterations variable to generate a vector of weights for each iteration
Scoring pipeline configuration (.json) files¶
By default, cmonkey uses the default_pipeline.json file to setup a standard scoring pipeline. Users can choose to override the scoring functions used (e.g. for using user-defined scoring functions) by providing their own configuration files.
The general structure of a pipeline configuration file is as documented below:
{
"row-scoring": {
"id": "combiner",
"function": { "module": "cmonkey.scoring", "class": "ScoringFunctionCombiner" },
"args": {
"functions": [
{ "id": "Rows",
"function": { "module": "cmonkey.microarray", "class": "RowScoringFunction" }
},
...
]
}
},
"column-scoring": { "id": "Columns",
"function": { "module": "cmonkey.scoring",
"class": "ColumnScoringFunction"} }
}
The current version of the software is tested using a combiner function for row scoring and and a single scoring function for column scoring and this is the recommended setup. It is left to the users to pick the set of scoring functions most suitable to the problem and to provide this in their own .json files. With the –pipeline switch, the the path to the user-defined configuration file can be specified on the command line.
Tricky topics¶
The cluster is scored very good, but there are no motifs¶
cmonkey2 will not run MEME if the number of unique sequences for a cluster is less than a certain threshold value. This value is by default configured as
[Membership]
...
min_cluster_rows_allowed = 3
If this hides good motifs that otherwise should be there, this number can be set to a lower value to ensure MEME is run on these cluster sequences.