Library (API Documentation)¶
This code is intended to be primarily used as a library, and works best when used in an interactive python session (e.g. with jupyter) alongside pandas. Many of the query functions in this library returns pandas dataframes.
Below is a complete documentation of the API for this library. The functions in query will be the most interesting for most users.
grasp.tables¶
GRASP table descriptions in SQLAlchemy ORM.
These tables do not exist in the GRASP data, which is a single flat file. By separating the data into these tables querying is much more efficient.
This submodule should only be used for querying.
-
class
grasp.tables.
SNP
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
An SQLAlchemy Talble for GRASP SNPs.
Study and phenotype information are pushed to other tables to minimize table size and make querying easier.
- Table Name:
- snps
- Columns:
- Described in the columns attribute
-
int
¶ The ID number of the SNP, usually the NHLBIkey
-
str
¶ SNP loction expressed as ‘chr:pos’
-
hvgs_ids
¶ A list of HGVS IDs for this SNP
-
columns
¶ A dictionary of all columns ‘column_name’=>(‘type’, ‘desc’)
-
ConservPredTFBS
¶
-
CreationDate
¶
-
EqtlMethMetabStudy
¶
-
HUPfield
¶
-
HumanEnhancer
¶
-
InGene
¶
-
InLincRNA
¶
-
InMiRNA
¶
-
InMiRNABS
¶
-
LSSNP
¶
-
LastCurationDate
¶
-
NHLBIkey
¶
-
NearestGene
¶
-
ORegAnno
¶
-
PolyPhen2
¶
-
RNAedit
¶
-
SIFT
¶
-
UniProt
¶
-
chrom
¶
-
columns
= OrderedDict([('id', ('BigInteger', 'NHLBIkey')), ('snpid', ('String', 'SNPid')), ('chrom', ('String', 'chr')), ('pos', ('Integer', 'pos')), ('pval', ('Float', 'Pvalue')), ('NHLBIkey', ('String', 'NHLBIkey')), ('HUPfield', ('String', 'HUPfield')), ('LastCurationDate', ('Date', 'LastCurationDate')), ('CreationDate', ('Date', 'CreationDate')), ('population_id', ('Integer', 'Primary')), ('population', ('relationship', 'Link')), ('study_id', ('Integer', 'Primary')), ('study', ('relationship', 'Link')), ('study_snpid', ('String', 'SNPid')), ('paper_loc', ('String', 'LocationWithinPaper')), ('phenotype_desc', ('String', 'Phenotype')), ('phenotype_cats', ('relationship', 'Link')), ('InGene', ('String', 'InGene')), ('NearestGene', ('String', 'NearestGene')), ('InLincRNA', ('String', 'InLincRNA')), ('InMiRNA', ('String', 'InMiRNA')), ('InMiRNABS', ('String', 'InMiRNABS')), ('dbSNPfxn', ('String', 'dbSNPfxn')), ('dbSNPMAF', ('String', 'dbSNPMAF')), ('dbSNPinfo', ('String', 'dbSNPalleles')), ('dbSNPvalidation', ('String', 'dbSNPvalidation')), ('dbSNPClinStatus', ('String', 'dbSNPClinStatus')), ('ORegAnno', ('String', 'ORegAnno')), ('ConservPredTFBS', ('String', 'ConservPredTFBS')), ('HumanEnhancer', ('String', 'HumanEnhancer')), ('RNAedit', ('String', 'RNAedit')), ('PolyPhen2', ('String', 'PolyPhen2')), ('SIFT', ('String', 'SIFT')), ('LSSNP', ('String', 'LS')), ('UniProt', ('String', 'UniProt')), ('EqtlMethMetabStudy', ('String', 'EqtlMethMetabStudy'))]) A description of all columns in this table.
-
dbSNPClinStatus
¶
-
dbSNPMAF
¶
-
dbSNPfxn
¶
-
dbSNPinfo
¶
-
dbSNPvalidation
¶
-
display_columns
(display_as='table', write=False)[source]¶ Return all columns in the table nicely formatted.
- Display choices:
- table: A formatted grid-like table tab: A tab delimited non-formatted version of table list: A string list of column names
Parameters: - display_as – {table,tab,list}
- write – If true, print output to console, otherwise return string.
Returns: A formatted string or None
-
get_columns
(return_as='list')[source]¶ Return all columns in the table nicely formatted.
- Display choices:
- list: A python list of column names dictionary: A python dictionary of name=>desc long_dict: A python dictionary of name=>(type, desc)
Parameters: return_as – {table,tab,list,dictionary,long_dict,id_dict} Returns: A list or dictionary
-
get_variant_info
(fields='dbsnp', pandas=True)[source]¶ Use the myvariant API to get info about this SNP.
Note that this service can be very slow. It will be faster to query multiple SNPs.
Parameters: - fields – Choose fields to display from: docs.myvariant.info/en/latest/doc/data.html#available-fields Good choices are ‘dbsnp’, ‘clinvar’, or ‘gwassnps’ Can also use ‘grasp’ to get a different version of this info.
- pandas – Return a dataframe instead of dictionary.
Returns: A dictionary or a dataframe.
-
hvgs_ids
The HVGS ID from myvariant.
-
id
¶
-
paper_loc
¶
-
phenotype_cats
¶
-
phenotype_desc
¶
-
population
¶
-
population_id
¶
-
pos
¶
-
pval
¶
-
snp_loc
¶ Return a simple string containing the SNP location.
-
snpid
¶
-
study
¶
-
study_id
¶
-
study_snpid
¶
-
class
grasp.tables.
Phenotype
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
An SQLAlchemy table to store the primary phenotype.
- Table Name:
- phenos
- Columns:
- phenotype: The string phenotype from the GRASP DB, unique. alias: A short representation of the phenotype, not unique. studies: A link to the studies table.
-
int
¶ The ID number.
-
str
¶ The name of the phenotype.
-
alias
¶
-
id
¶
-
phenotype
¶
-
studies
¶
-
class
grasp.tables.
PhenoCats
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
An SQLAlchemy table to store the lists of phenotype categories.
- Table Name:
- pheno_cats
- Columns:
- category: The category from the grasp database, unique. alias: An easy to use alias of the category, not unique. snps: A link to all SNPs in this category. studies: A link to all studies in this category.
-
int
¶ The PhenoCat ID
-
str
¶ The category name
-
alias
¶
-
category
¶
-
id
¶
-
snps
¶
-
studies
¶
-
class
grasp.tables.
Platform
(platform)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
An SQLAlchemy table to store the platform information.
- Table Name:
- platforms
- Columns:
- platform: The name of the platform from GRASP. studies: A link to all studies using this platform.
-
int
¶ The ID number of this platform
-
str
¶ The name of the platform
-
id
¶
-
platform
¶
-
studies
¶
-
class
grasp.tables.
Population
(population)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
An SQLAlchemy table to store the platform information.
- Table Name:
- populations
- Columns:
- population: The name of the population. studies: A link to all studies in this population. snps: A link to all SNPs in this populations.
-
int
¶ Population ID number
-
str
¶ The name of the population
-
id
¶
-
population
¶
grasp.db¶
Functions for managing the GRASP database.
get_session() is used everywhere in the module to create a connection to the database. initialize_database() is used to build the database from the GRASP file. It takes about an hour 90 minutes to run and will overwrite any existing database.
-
grasp.db.
get_session
(echo=False)[source]¶ Return a session and engine, uses config file.
Parameters: echo – Echo all SQL to the console. Returns: - A SQLAlchemy session and engine object corresponding
- to the grasp database for use in querying.
Return type: session, engine
-
grasp.db.
initialize_database
(study_file, grasp_file, commit_every=250000, progress=False)[source]¶ Create the database quickly.
Study_file: Tab delimited GRASP study file, available here: github.com/MikeDacre/grasp/blob/master/grasp_studies.txt Grasp_file: Tab delimited GRASP file. Commit_every: How many rows to go through before commiting to disk. Progress: Display a progress bar (db length hard coded).
grasp.query¶
A mix of functions to make querying the database and analyzing the results faster.
-
grasp.query.
get_studies
(primary_phenotype=None, pheno_cats=None, pheno_cats_alias=None, primary_pop=None, has_disc_pop=None, has_rep_pop=None, only_disc_pop=None, only_rep_pop=None, query=False, count=False, dictionary=False, pandas=False)[source]¶ Return a list of studies filtered by phenotype and population.
There are two ways to query both phenotype and population.
- Phenotype:
GRASP provides a ‘primary phenotype’ for each study, which are fairly poorly curated. They also provide a list of phenotype categories, which are well curated. The problem with the categories is that there are multiple per study and some are to general to be useful. If using categories be sure to post filter the study list.
Note: I have made a list of aliases for the phenotype categories to make them easier to type. Use pheno_cats_alias for that.
- Population:
Each study has a primary population (list available with ‘get_populations’) but some studies also have other populations in the cohort. GRASP indexes all population counts, so those can be used to query also. To query these use has_ or only_ (exclusive) parameters, you can query either discovery populations or replication populations. Note that you cannot provide both has_ and only_ parameters for the same population type.
For doing population specific analyses most of the time you will want the excl_disc_pop query.
- Argument Description:
Phenotype Arguments are ‘primary_phenotype’, ‘pheno_cats’, and ‘pheno_cats_alias’.
Only provide one of pheno_cats or pheno_cats_alias
Population Arguments are primary_pop, has_disc_pop, has_rep_pop, only_disc_pop, only_rep_pop.
primary_pop is a simple argument, the others use bitwise flags for lookup.
The easiest way to use the following parameters is with the _ref.PopFlag object. It uses py-flags. For example:
pops = _ref.PopFlag.eur | _ref.PopFlag.afr
In addition you can provide a list of strings correcponding to PopFlag attributes.
Note: the only_ parameters work as ANDs, not ORs. So only_disc_pop=’eur|afr’ will return those studies that have BOTH european and african discovery populations, but no other discovery populations. On the other hand, has_ works as an OR, and will return any study with any of the spefified populations.
Parameters: - primary_phenotype – Phenotype of interest, string or list of strings.
- pheno_cats – Phenotype category of interest.
- pheno_cats_alias – Phenotype category of interest.
- primary_pop – Query the primary population, string or list of strings.
- has_disc_pop – Return all studies with these discovery populations
- has_rep_pop – Return all studies with these replication populations
- only_disc_pop – Return all studies with ONLY these discovery populations
- only_rep_pop – Return all studies with ONLY these replication populations
- query – Return the query instead of the list of study objects.
- count – Return a count of the number of studies.
- dictionary – Return a dictionary of title->id for filtering.
- pandas – Return a dataframe of study information instead of the list.
Returns: A list of study objects, a query, or a dataframe.
-
grasp.query.
get_snps
(studies, pandas=True)[source]¶ Return a list of SNPs in a single population in a single phenotype.
Studies: A list of studies. Pandas: Return a dataframe instead of a list of SNP objects. Returns: Either a DataFrame or list of SNP objects.
-
grasp.query.
get_variant_info
(snp_list, fields='dbsnp', pandas=True)[source]¶ Get variant info for a list of SNPs.
Parameters: - snp_list – A list of SNP objects or SNP rsIDs
- fields – Choose fields to display from: docs.myvariant.info/en/latest/doc/data.html#available-fields Good choices are ‘dbsnp’, ‘clinvar’, or ‘gwassnps’ Can also use ‘grasp’ to get a different version of this info.
- pandas – Return a dataframe instead of dictionary.
Returns: A dictionary or a dataframe.
-
grasp.query.
find_intersecting_phenotypes
(primary_pops=None, pop_flags=None, check='cat', pop_type='disc', exclusive=False, list_only=False)[source]¶ Return a list of phenotypes that are present in all populations.
Can only provide one of primary_pops or pop_flags. pop_flags does a bitwise lookup, primary_pops quries the primary string only.
By default this function returns a list of phenotype categories, if you want to check primary phenotypes instead, provide check=’primary’.
Parameters: - primary_pops – A string or list of strings corresponing to the tables.Study.phenotype column
- pop_flags – A ref.PopFlag object or list of objects.
- check – cat/primary either check categories or primary phenos.
- pop_type – disc/rep Use with pop_flags only, check either discovery or replication populations.
- exclusive – Use with pop_flags only, do an excusive rather than inclusion population search
- list_only – Return a list of names only, rather than a list of objects
Returns: A list of table.Phenotype or table.PhenoCat objects, or a list of names if list_only is specified.
-
grasp.query.
collapse_dataframe
(df, mechanism='median', pvalue_filter=None, protected_columns=None)[source]¶ Collapse a dataframe by chrom:location from get_snps.
Will use the mechanism defined by ‘mechanism’ to collapse a dataframe to one indexed by ‘chrom:location’ with pvalue and count only.
This function is agnostic to all dataframe columns other than:
['chrom', 'pos', 'snpid', 'pval']
All other columns are collapsed into a comma separated list, a string. ‘chrom’ and ‘pos’ are merged to become the new colon-separated index, snpid is maintained, and pval is merged using the function in ‘mechanism’.
Parameters: - df – A pandas dataframe, must have ‘chrom’, ‘pos’, ‘snpid’, and ‘pval’ columns.
- mechanism – A numpy statistical function to use to collapse the pvalue, median or mean are the common ones.
- pvalue_filter – After collapsing the dataframe, filte to only include pvalues less than this cutoff.
- protected_columns – A list of column names that will be maintened as is, although all duplicates will be dropped (randomly). Only makes sense for columns that are identical for all studies of the same SNP.
Returns: - Indexed by chr:pos, contains flattened pvalue column, and
all original columns as a comma-separated list. Additionally contains a count and stddev (of pvalues) column. stddev is nan if count is 1.
Return type: DataFrame
-
grasp.query.
intersect_overlapping_series
(series1, series2, names=None, stats=True, plot=False)[source]¶ Plot all SNPs that overlap between two pvalue series.
Parameters: - series – A pandas series object
- names – A list of two names to use for the resultant dataframes
- stats – Print some stats on the intersection
- plot – Plot the resulting intersection
Returns: with the two series as columns
Return type: DataFrame
grasp.config¶
Manage a persistent configuration for the database.
-
grasp.config.
config
= <configparser.ConfigParser object>¶ A globally accessible ConfigParger object, initialized with CONFIG_FILE.
-
grasp.config.
CONFIG_FILE
= '/home/docs/.grasp'¶ The PATH to the config file.
-
grasp.config.
init_config
(db_type, db_file='', db_host='', db_user='', db_pass='')[source]¶ Create an initial config file.
Parameters: - db_type – ‘sqlite/mysql/postgresql’
- db_file – PATH to sqlite database file
- db_host – Hostname for mysql or postgresql server
- db_user – Username for mysql or postgresql server
- db_pass – Password for mysql or postgresql server (not secure)
Returns: NoneType
Return type: None
grasp.info¶
Little functions to pretty print column lists and category info.
get_{phenotypes,phenotype_categories,popululations} all display a dump of the whole database.
get_population_flags displays available flags from PopFlag.
display_{study,snp}_columns displays a list of available columns in those two tables as a formatted string.
get_{study,snp}_columns return a list of available columns in those two tables as python objects.
-
grasp.info.
display_snp_columns
(display_as='table', write=False)[source]¶ Return all columns in the SNP table as a string.
- Display choices:
- table: A formatted grid-like table tab: A tab delimited non-formatted version of table list: A string list of column names
Parameters: - display_as – {table,tab,list}
- write – If true, print output to console, otherwise return string.
Returns: A formatted string or None
-
grasp.info.
display_study_columns
(display_as='table', write=False)[source]¶ Return all columns in the Study table as a string.
- Display choices:
- table: A formatted grid-like table tab: A tab delimited non-formatted version of table list: A string list of column names
Parameters: - display_as – {table,tab,list}
- write – If true, print output to console, otherwise return string.
Returns: A formatted string or None
-
grasp.info.
get_phenotype_categories
(list_only=False, dictionary=False, table=False)[source]¶ Return all phenotype categories from the PhenoCats table.
List_only: Return a simple text list instead of a list of Phenotype objects. Dictionary: Return a dictionary of phenotype=>ID Table: Return a pretty table for printing.
-
grasp.info.
get_phenotypes
(list_only=False, dictionary=False, table=False)[source]¶ Return all phenotypes from the Phenotype table.
List_only: Return a simple text list instead of a list of Phenotype objects. Dictionary: Return a dictionary of phenotype=>ID Table: Return a pretty table for printing.
-
grasp.info.
get_population_flags
(list_only=False, dictionary=False, table=False)[source]¶ Return all population flags available in the PopFlags class.
List_only: Return a simple text list instead of a list of Phenotype objects. Dictionary: Return a dictionary of population=>ID Table: Return a pretty table for printing.
-
grasp.info.
get_populations
(list_only=False, dictionary=False, table=False)[source]¶ Return all populatons from the Population table.
List_only: Return a simple text list instead of a list of Phenotype objects. Dictionary: Return a dictionary of population=>ID Table: Return a pretty table for printing.
-
grasp.info.
get_snp_columns
(return_as='list')[source]¶ Return all columns in the SNP table.
- Display choices:
- list: A python list of column names dictionary: A python dictionary of name=>desc long_dict: A python dictionary of name=>(type, desc)
Parameters: return_as – {table,tab,list,dictionary,long_dict,id_dict} Returns: A list or dictionary
-
grasp.info.
get_study_columns
(return_as='list')[source]¶ Return all columns in the SNP table.
- Display choices:
- list: A python list of column names dictionary: A python dictionary of name=>desc long_dict: A python dictionary of name=>(type, desc)
Parameters: return_as – {table,tab,list,dictionary,long_dict,id_dict} Returns: A list or dictionary