Basic Mining Using `agglom`¶

EGRIN2.0 MongoDB query using agglom¶

Set-up¶

Make sure ./egrin-tools/ folder is in your $PYTHONPATH

You can do this on Mac/Linux by adding the path to you Bash Shell Startup Files, e.g. ~/.bashrc or ~/.bash_profile

for example in ~/.bash_profile add the following line:

export PYTHONPATH=$PYTHONPATH:path/to/egrin2-tools/

Load required modules¶

import query.egrin2_query as e2q
import pymongo
client = pymongo.MongoClient(host='baligadev')
db = client['eco_db']

There are several dependencies that need to be satisfied, including:

pymongo

numpy

pandas

joblib

scipy

statsmodels

itertools

`agglom` function¶

The agglom function is the centerpiece for querying information in the ensemble.

Using this function you can co-associate any two bicluster features across the entire ensemble: so - biclusters, genes, conditions, and GREs with biclusters, genes, conditions, or GREs.

If you stack multiple agglom calls you can quickly build very complex queries.

The agglom function has a simple, universal format. Basically it needs an input (x), the type of that input (x_type), the desired output type (y_type), and some infomration about the location of the EGRIN 2.0 MongoDB.

The function will output a table of pvalues that reports the signifigance for each element in your output query type.

Let’s take a look at the function itself, then we will explore serveral examples:

?e2q.agglom

Example 1: find co-regulated genes

In this example, we will mine the ensemble for genes discovered in biclusters with the gene E. coli carbamoyl phosphate synthetase, carA (b0032)

carA_genes = e2q.agglom(db, x="b0032", x_type="genes", y_type="genes")
carA_genes

gene	counts	all_counts	pval	qval_BH	qval_bonferroni
b1062	152	198	0.000000e+00	0.000000e+00	0.000000e+00
b0336	153	198	0.000000e+00	0.000000e+00	0.000000e+00
b0337	154	198	0.000000e+00	0.000000e+00	0.000000e+00
b4244	167	198	0.000000e+00	0.000000e+00	0.000000e+00
b4245	169	198	0.000000e+00	0.000000e+00	0.000000e+00
b0033	173	198	0.000000e+00	0.000000e+00	0.000000e+00
b0032	198	198	0.000000e+00	0.000000e+00	0.000000e+00
b2499	142	198	3.783051e-300	4.073085e-298	3.665776e-297
b2476	142	198	3.783051e-300	4.073085e-298	3.665776e-297
b2313	141	198	5.664712e-297	4.574255e-295	5.489106e-294
b2312	141	198	5.664712e-297	4.574255e-295	5.489106e-294
…	…	…	…	…	…

259 rows × 5 columns

There are several things to note in this query.

First the arguments:

x specfies the query. This can be gene(s), condition(s), GRE(s), or bicluster(s). x can be a single entity or a list of entitites of the same type.

x_type indicates the type of x. This can include gene, condition, gres, and biclusters. Basically: “what is x?” The parameter typing is pretty flexible, so - for example - rows can be used instead of genes.

y_type is the type of. Again, genes, conditions, gres, or biclusters.

host specifies where the MongoDB database is running. In this case it is running on a machine called baligadev. If you are hosting the database locally this would be localhost

db is the name of the database you want to perform the query in. Typically databases are specified by the three letter organism code (e.g., eco) followed by _db. A list of maintained databases is available here.

Also note the output:

This is a table of all genes that are frequently co-discovered in biclusters with b0032 (carA). They are sorted by their relative “enrichment” in the sample x (where lower pval = greater enrichment). This is quantified by the hypergeometric distribution. To account for multiple testing, two corrections are included: Benjamini-Hochberg correction (qval_BH; less stringent) and Bonferrnoni correction (qval_bonferroni; more stringent).

Example 2: what if gene names have different formats ?

We can easily change the format of x, which makes it easy to specify genes in any format, e.g. NCBI annotation or common name. Let’s try to include a list of genes with two different naming formats. This time we will only return the first few elements using .head()

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type= "genes").head()

gene	counts	all_counts
b4244	179	198
b0033	180	198
b4245	185	198
b0031	198	198
b4246	198	198

5 rows × 5 columns

Here, two of the genes are Blattner Ids (b****), while the other is a common name. This is no problem for the query.

The lookup table is pretty extensive for genes (it comes from MicrobesOnline) so anything from common name to GI number can be matched. Cool, huh?

Example 3: alternative logic

You can specify the type of logic to use in grouping your query.

For example, do you want do find biclusters where all genes x in your query are present (and logic) or biclusters where at least one of the genes is present (or logic). You can change the logical grouping of your query by specifying the logic parameter.

This parameter can take one of three values: and, or, or nor. By default the command uses or logic.

Here we change the logic employed in the previous query to “and”, meaning that a bicluster must contain all of the query genes. Notice how the values change.

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="genes", logic="and").head()

gene	counts	all_counts	pval	qval_BH	qval_bonferroni
b4246	25	198	0.000000e+00	0.000000e+00	0.000000e+00
b0032	25	198	0.000000e+00	0.000000e+00	0.000000e+00
b0031	25	198	0.000000e+00	0.000000e+00	0.000000e+00
b4006	24	198	5.510260e-57	3.981163e-55	1.592465e-54
b1062	23	198	2.604624e-53	1.075338e-51	7.527363e-51

5 rows × 5 columns

Logical operations can provide a substantial amount of power to your queries. For example, imagine that you not only want to know when b0032, pyrL, and b0031 are co-regulated but also when they are not. This can be done using the logical operator nor.

Example 4: other associations

So far we have mined gene-gene co-associations, but there are other associations in the ensemble that are also available, including conditions, and GREs, and biclusters.

Each of these can be accessed by changing the x_type and y_type parameter.

Conditions

For example to find conditions where these genes are co-regulated, we change y_type to conditions:

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="conditions", logic="or").head()

conditions	counts	all_counts	pval	qval_BH	qval_bonferroni
rb_del_rpoS_biofilm	240	12921	2.849586e-13	7.228851e-11	1.327907e-10
biofilm_24hr_del_yceP	239	12879	3.889790e-13	7.228851e-11	1.812642e-10
ik_H2_T4	232	12380	4.653767e-13	7.228851e-11	2.168655e-10
MG1063_uninduced_t180	232	12543	2.190004e-12	2.551355e-10	1.020542e-09
rb_del_rpoS_exponential	226	12198	5.366065e-12	5.001173e-10	2.500586e-09

5 rows × 5 columns

GREs

To discover GREs located upstream of these genes we change y_type to gre:

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="gre", logic="or").head()

GREs	counts	all_counts	pval	qval_BH	qval_bonferroni
11	187	231	9.726498e-267	5.252309e-265	5.252309e-265
81	42	48	2.936386e-64	7.928242e-63	1.585648e-62
12	41	211	6.248311e-26	1.124696e-24	3.374088e-24
309	14	19	2.195152e-21	2.606271e-20	1.185382e-19
197	16	27	2.413214e-21	2.606271e-20	1.303136e-19

5 rows × 5 columns

Note that GRE names in ensemble are integer values, thus GRE #1 is simply called 1

Biclusters

Finally: to return biclusters we change y_type to bicluster.

e2q.agglom(db, x=["b0032","pyrL","b0031"], x_type="genes", y_type="bicluster", logic="or").head()

WARNING:root:Will return bicluster _id. The results might surprise you...

Bicluster	_id
0	54ee42396a208812a205295a
1	54ee42396a208812a2052982
2	54ee42826a208812a2052b61
3	54ee42836a208812a2052ce0
4	54ee430d6a208812a2053120

5 rows × 1 columns

Please note that this returns MongoDB _ids for the biclusters, which are not by themselves useful or informative. Rather, these _ids can be used in other queries (e.g. fimoFinder) or to query MongoDB directly.

Basic Mining Using `agglom`¶

EGRIN2.0 MongoDB query using agglom¶

Set-up¶

Load required modules¶

`agglom` function¶

egrin2-tools

Navigation

Related Topics

Basic Mining Using agglom¶

EGRIN2.0 MongoDB query using agglom¶

Set-up¶

Load required modules¶

agglom function¶

Basic Mining Using `agglom`¶

`agglom` function¶