G2C::Informatics
Programming
A key aim is to develop software to automate steps in the gene targeting vector design process, to speed up this part of mouse knockout generation: at present, it is a bottleneck in the process of going from gene of interest to transgenic mouse line.
The bioinformatics software package being developed is implemented in Perl, and makes uses of many open source software components (such as MySQL, Bioperl, and Ensembl) that have been developed at the Sanger Institute and elsewhere.
Where the developed methods and software are likely to be of interest and use to the wider scientific community, G2C plan to publish the methods and make the software freely available. See the Software page for details.
Network Biology
Molecular interaction networks are being used to allow the integrated analysis of the diverse biological datasets generated by G2C. These analyses will inform the construction of static and dynamic models at various levels of function (e.g. protein complexes, neuronal networks) and abstraction (from coarse heuristics to detailed biochemical models). The emphasis is on producing models that provide insight into the underlying biology and can be used to direct experimental work.
Current models focus on the NMDA Receptor Complex (NRC/MASC), a major component of the postsynaptic signal transduction machinery at glutamatergic synapses. The NRC/MASC co-ordinates a diverse set of effector pathways underlying neuronal plasticity. Detailed annotation and analysis of the NRC/MASC proteins revealed simple principles underlying the functional organization of the complex. The resulting network model correctly predicts robustness of synaptic plasticity to mutations and drug interference. Existing models of NRC/MASC (and the synapse proteome in general) are now being refined through the analysis of gene/protein expression, phospho-proteomics and other large-scale datasets.
Literature mining
Figure 1 below shows a sample abstract mined from PubMed, with the relevant search terms highlighted.
G2C Bioinformatics have developed literature-mining tools to aid the in-house data curators. The purpose of these tools is to present the curator with ranked lists of results. A typical search that a curator might wish to perform is 'from this list of proteins, show the proteins that interact with each other and are involved in LTP and are disease implicated'.
Without these tools, the curators face huge volumes of text, multiple protein synonyms and the laborious process of identifying protein interactions.
Figure 1 shows a sample abstract mined from PubMed, with the relevant search terms highlighted.
Gene Prioritizing
We are developing a rule-based/machine-learning system to prioritize lists of genes for investigation. The rules used are derived from previous lab investigations, consisting in part of highly informative gene characteristics, such as protein interactions, disease implication and so on.
The machine-learning component is in place to augment the rule-based system, by identifying patterns in highly-ranked results and determining the usefulness of rules inferred from these patterns.
Lab database development
The Bioinformatics team have developed G2C's Database, which collects the vast volumes of information gleaned from literature mining and data curation, and integrates it. The database contains large volumes of data on genetics, synaptic plasticity, human disease and proteomics, all sorted by the gene of interest. The lab generates data from many diverse groups. Various systems are being developed to store this data and allow the retrieval of information relevant across all the areas. In particular, we are developing a system of in-house gene IDs to manage the constantly-updating lists of identifiers produced by the various gene databases.