Running SYGNAL

Command line parameters

SYGNAL is written in python and takes command line parameters to adjust run behavior:

usage: sygnal.py [-h] [--config CONFIG] [--out OUT] [--res RES]
              [--cores CORES] [--tmp TMP]
              cm_rundb

–config = SYGNAL configuration file formatted as JSON

–res = overall result file name

–out = output directory

–tmp = temporary file storage directory

–cores = number of cores to be used for

cm_rundb = cMonkey run output SQLite database

Here is an example run training on PITA miRNA-target gene database that follows the directory structure described previously:

python sygnal.py --config sygnal_config.json --res postProcessed_example_pita.csv --out output --tmp tmp --cores 19 ../out/cmonkey_run.db

Input files

The SYGNAL pipeline requires user supplied data to run. This information is used to configure the run.

sygnal_config.json

Example sygnal_config.json:

{
   "max-evalue": 10.0,
   "rho-cut": 0.3,
   "pvalue-cut": 0.05,
   "perc-targets": 0.1,
   "leo-nb-atob": 0.5,
   "mlogp-m-atob": 0.05,

   "run-neo": true,

   "synonyms-file": "../../../commonFiles/synonymsThesaurus.csv.gz",
   "phenotypes-file": "../../../commonFiles/phenotypes_meso.csv",
   "ratios-file": "../../../exprs/<example_cMonkey2_standardized_expression_data>.tsv",
   "all-ratios-file": "../../../exprs_all/<example_all_expression_data>.csv",
   "som-muts-file": "../../../som_muts/<somatic_mutations_matrix.csv",
   "mirna-fasta-file": "miRNA/hsa.mature.fa",
   "rand_pssms_dir": "randPSSMs",
   "promoterSeq": "seqs/promoterSeqs_Homo_sapiens.csv.gz",
   "p3utrSeq": "seqs/p3utrSeqs_Homo_sapiens.csv.gz",
   "gene_conv": "extras/entrez2ucsc.csv",
   "replication-dataset-names": ["<exampleReplication>"],

   "pita": "../../../commonFiles/pita_miRNA_sets_entrez_hsa.json",
   "targetscan": "../../../commonFiles/targetScan_miRNA_sets_entrez_hsa.json",
   "tfbsdb": "../../../commonFiles/tfbsDb_plus_and_minus_5000_entrez.json",

   "meme": {
     "upstream": {
       "nmotifs": 2,
       "motif-widths": [6, 12],
       "revcomp": "True",
       "bgfile": "seqs/bgFile.meme"
     }
   },
   "weeder": {
     "upstream": {
       "bgfile": "HS",
       "size": "small",
       "enriched": "T50",
       "revcomp": "True"
     },
     "3pUTR": {
       "bgfile": "HS3P",
       "size": "small",
       "enriched": "T50",
       "revcomp": "False"
     }
   },
   "tomtom": {
     "upstream": {
       "motif-files": [
       "../../../commonFiles/jasparCoreVertebrata_redundant.json",
       "../../../commonFiles/transfac_2012.1_PSSMs_vertabrate.json",
       "../../../commonFiles/uniprobePSSMsNonRedundant.json",
       "../../../commonFiles/selexPSSMsNonRedundant.json"]
     }
   },
   "tfExpansion": {
     "tfs": "TF/humanTFs_All.csv",
     "tfFamilies": "TF/tfFamilies.csv"
   },
   "mirvestigator": {
     "seedModel": [6,7,8],
     "minor": "True",
     "minor": "True",
     "p5": "True",
     "p3": "True",
     "wobble": "False",
     "wobbleCut": 0.25,
     "species": "hsa"
   },
   "first-principal-comps-result-file": "biclusterFirstPrincComponents.csv"
 }

cMonkey2 output database

cmonkey-python writes the results of its computation to an SQLite database. This choice was made, because SQLite is a free, open source and portable data store which is available on many systems and has programming interfaces to a large number of programming languages. Another important aspect is that the entire database is stored in a single file, which can be easily copied, archived and analyzed. The database structure and its function is explained in further detail in the cMonkey:sub:`2 Wiki <https://github.com/baliga-lab/cmonkey2/wiki/Database-schema>`_.

User supplied cMonkey2 expression matrix

The cMonkey2 expression data matrix is usually filtered based on differential expression, most variant genes, or expression above a threshold. The reason for this step is that genes with little variance or low expression are not likely to yield much information.

User supplied expression data (microarray or RNAseq) as a tab separated value (TSV) file that has been properly normalized. The expression data should be standardized (mean subtracted and divided by the standard deviation to turn each expression value into a Z score). The header for the expression data is expected to be without the leading tab.

Gene Cond_1 Cond_N
Gene_1 Z[1,1] Z[1,N]
Gene_M Z[M,1] Z[M,N-1] Z[M,N]

The expression data matrix is specified in the sygnal_config.json using the "ratios-file" option.

Complete user supplied expression matrix

This is the pre-filtered version of the expression data matrix with all genes. This allows correlation of all putative TF regulators versus bicluster eigengenes even if they were filtered out from the expression data matrix used for cMonkey<sub>2</sub> biclustering.

User supplied expression data (microarray or RNAseq) as a comma separated value (CSV) file that has been properly normalized. The expression data is not standardized. The header for the expression data is not offset .

Gene Cond_1 Cond_N
Gene_1 Z[1,1] Z[1,N]
Gene_M Z[M,1] Z[M,N-1] Z[M,N]

The complete expression data matrix is specified in the sygnal_config.json using the "all-ratios-file" option.

Gene ID thesaurus

The gene IDs used in cMonkey<sub>2</sub> and SYGNAL can be different than the pre-computed databases of gene-gene associations, TF- and miRNA-target gene interactions, etc. This is facilitated through a gene ID thesaurus which maps the gene expression matrix gene IDs to alternative gene IDs. This thesaurus is generated from a comma separated values (CSV) file with the following format:

ExprIDs Mappings
Gene_1 Gene_1;Alt_1;…;Alt_N
…;Alt_1;…;Alt_N
Gene_M Gene_M;Alt_1;…;Alt_N

The thesaurus file is specified in the sygnal_config.json using the "synonyms-file" option.

TF-target gene regulatory interactions

To discover putative TF regulatory interactions requires a database of TF-target gene interactions. We provide a pre-computed version of TF-target gene interactions from the TFBS_DB. For use in SYGNAL the TF-target gene regulatory interactions should be put into the following JSON format:

{"TF_ID_1": ["Gene_1", ..., "Gene_M"], ..., "TF_ID_N": ["Gene_1", ..., "Gene_M"]}

The TF-target gene interaction files are specified in the sygnal_config.json using the "tfbsdb" option.

TF family information

miRNA-target gene regulatory interactions

To discover putative miRNA regulatory interactions requires a database of miRNA-target gene interactions. We utilize two different pre-computed miRNA-target gene interaction databases: PITA and TargetScan. For use in SYGNAL the miNRA-target gene regulatory interactions should be put into the following JSON format:

{"miRNA_ID_1": ["Gene_1", ..., "Gene_M"], ..., "miRNA_ID_N": ["Gene_1", ..., "Gene_M"]}

We have started using miRBase MIMAT miRNA IDs due to the constant flux in miRNA names. The miRNA-target gene interaction files are specified in the sygnal_config.json using the “pita” and “targetscan” options.

Promoter and 3’ UTR sequences for organism of study

In addition to enrichment with pre-computed databases SYGNAL also searches promoter and 3’ UTR sequences for enriched TF or miRNA binding sites. The sequences are stored as comma separated value (CSV) files in the following format:

Gene_1,Seq_1
...,...
Gene_M,Seq_M

The sequence files are specified in the sygnal_config.json using the "promoterSeq" and "p3utrSeq" options.

Background sequence information for motif callers

Typical run of SYGNAL

Typical output from running SYGNAL