My team met with the BioQUEST staff about our model yesterday. John made several encouraging remarks and had a variety of suggestions regarding the underlying models being used (diffusion, binding, etc), how the data could be presented to make it consistent with the way the data are presented in the scientific literature (or, at least, in familiar problem sets), and how we could make the interface clear for novice users. We've begun to track stuff down and have started to build a new version of the model that does the lac operon, which is the archetypal version of these systems. Afterwards, John suggested we think about using boolean logic as a guide for creating model systems, in particular to create a NAND gate, from which all other boolean logic systems can be constructed. Herb had demonstrated something like that earlier in the week with signal transduction models: a really fascinating talk that proposed that a lot of the complexity in signal transduction acts like circuitry (like amplifiers and rectifiers) to strengthen the signal, reduce distortion, and bring different parts of the system into alignment with each other with respect to levels of inputs and outputs. Awesome stuff.
Today, is mostly a set of presentations about the NCBI molecular biology tools. It should be interesting and, if possible, I'll keep notes here and update over the course of the day.
Exploring 3D molecular structures using NCBI tools: A field guide by Eric Sayer.
Overview; How 3D structures are determined; indexing structural data; finding homologous structures: BLAST: sequence similarity, VAST: structure, RPS-BLAST and CDD: conserved function; finding a structural template for a query protein
Structural informatics: relationships among chemical formula <-> 3D conformation <-> function. Databases -- access through: entrez protein sequence: genpept, swissprot, etc <-> entrez structure and 3D domain: PDB <-> entrez domains MSA, Pfam, SMART, COGs, CDD
Currently 100 times more sequences than structures. Can we predict structure from sequence and/or function?
Solving structures: X-ray crystallography is hard. Resolution critical to accurately determining structure of proteins. Many structures may be inaccurate. Temperature shows you how variable the components of the structures are. NMR spectroscopy: builds constraint list: distances, dihedral angles, orientation. Allows you to build structural models that are consistent with the constraints, but often many parts of the molecule may be poorly constrained. Often, in context, poorly constrained regions are consistent, but it may be difficult to determine the structure in context.
Topoisomerase in Entrez: structure summary with links to other sections. To create page: convert to ASN.1, verify sequences, create "backbone" model, create single-conformer model, annotate secondary structure and chemical bonds. (The argument is that Entrez has more rigorous controls and more consistency than other sources for information).
Structure indexing: search on record IDs and also ligands, experimental methods, PDB, Literature, counters (ligand types, modified amino acids, nucleotides, etc).
Creating sequence records: on record per chain (protein, nucleotide, etc). Accession numbers use PDB code plus chain identifier.
Creating 3D domains: groups of at least 3 or more secondary structural elements. Domains appended to accession numbers. Domains not necessarily contiguous by sequence. This can give you inferences about how sequences are folding in space.
3D domain indexing: search on molecular weight, now many helices, how many strands?
Conserved domains: sequences aligned by function. Position Specific Score Matrix in BLAST that show which conserved residues are important. CD: NCBI curated, Pfam: swissprot, SMART: HMM base models. Protein families: COG based on complete genomes of prokaryotes.
VAST: Enzyme structure more strongly conserved than structure, but structure evolves too. 3D domains can be used to explore evolution of structure. Can also identify conserved core elements that determine structure of molecule. VAST: create vectors elements based on secondary structure and compare mathematically with other proteins. Create vector for each element (respecting topology!), align each along Z, construct midpoints of other elements, project onto a space and calculate angles to midpoints, calculate elevation from each midpoint to all others. Construct graph from n-terminus to c-terminus. (skipping a few steps) Create blocks of aligned sequence -- aligned parts in capital letters. Recognizes conservation at the chemical properties level (hydrophobicity). Curated records match up aligned structures more consistently. CDART: conserved domain architecture retrieval tool.
Overall strategy: 1) get a block alignment from curated CD or VAST alignment, 2) align query to template using Cn3D, 3) use Blocker. There is also a Threader. Next we get to play with software to work with structures and then alignments.
It's been a really long day. I've made a lot of progress on the gene expression model. I've got a pretty good simulation of the lac operon working. Now we just need to polish the interface and figure out how to represent the data to make it useful for students to actually ask interesting questions with the simulation.
Since Tom is leaving tomorrow, so we went out to Suds for some beer. It was good to catch up with him -- It's been great to see him here and I hope I've prevailed on him to actually get an abstract submitted to ACUBE so we can drive out together and do some bicycling after our talks. Now its late and past time I should get to bed. I'm absolutely exhausted.