NCBI Molecular Biology Resources

Download Report

Transcript NCBI Molecular Biology Resources

part 2
August 30, 2005
University of Colorado Health Sciences Center
NCBI FieldGuide
A Field Guide
Entrez: text searching
• a GenBank record
• preview/index
BLAST: sequence searching
• pre-computed searches
• algorithms
• what’s new?
VAST: structure searching
Example: mapping oligos to
a genome
NCBI FieldGuide
Part 2
The Flatfile Format
Header
Feature Table
Sequence
NCBI FieldGuide
GenBank Records
A Typical GenBank Record
NM_019570 4279 bp mRNA linear INV 28-OCT-2004
Mus musculus REV1-like(S. cerevisiae)(Rev1l),mRNA
NM_019570
NM_019570.3 GI:50811869
= Title
.
NCBI FieldGuide
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
NCBI FieldGuide
GenBank Record: Feature Table
GenPept identifier
NCBI FieldGuide
GenBank Record: Feature Table, con’t.
skip
NCBI FieldGuide
GenBank Record: sequence
Field
[primary accession]
[title]
[organism]
[sequence length]
[modification date]
[properties]
Indexed Terms
NM_001012399 [accn]
Bos taurus hemochromatosis (hfe), mRNA.
Bos taurus [orgn]
1168
2005/02/19 [mdat]
biomol mrna [prop]
gbdiv mam
srcdb refseq
NCBI FieldGuide
Indexing for Nucleotide UID 59958365
HFE
NCBI FieldGuide
Global Entrez Search: HFE
137 records
[Title]
Not
HFE
NCBI FieldGuide
Entrez Nucleotide: HFE
hfe[title] AND human[orgn]
42 records
NCBI FieldGuide
Smarter Query
Curated HFE
splice variants
(11 total)
(con’t)
Primary data
NCBI FieldGuide
hfe[title] AND human[orgn]
NCBI FieldGuide
Preview/Index
NCBI FieldGuide
Preview/Index
Properties
srcdb
NCBI FieldGuide
Preview/Index: Properties, srcdb
…AND srcdb refseq[Properties]
NCBI FieldGuide
Preview/Index: Properties, srcdb
…AND srcdb ddbj/embl/genbank[Properties]
NCBI FieldGuide
Preview/Index: Properties, srcdb
#1 hfe
#2 hfe[title] AND human[orgn]
137
42
#3 #2 AND srcdb refseq[prop]
#4 #2 AND srcdb ddbj/embl/genbank[prop]
#5 #4 AND gbdiv pri[prop]
#4 #4 AND gbdiv est[prop]
Primate division
EST division
11
31
29
2
gbdiv pri[prop]
gbdiv est[prop]
NCBI FieldGuide
Database Queries
#1 hfe
#2 hfe[title] AND human[orgn]
#3 #2 AND biomol mrna[prop]
#4 #2 AND biomol genomic[prop]
Genomic DNA
cDNA
116
42
29
13
biomol genomic[prop]
biomol mrna[prop]
NCBI FieldGuide
Molecule Queries
Fields are database-specific
Entrez Nucleotide
Reviewed RefSeqs with transcript variants:
srcdb refseq reviewed[prop] AND transcript[title] AND variant[title]
Entrez Gene
Topoisomerase genes from Archaea:
topoisomerase[gene name] AND archaea[organism]
Genes on human chromosome 2 with OMIM links
2[chromosome] AND human[organism] AND “gene omim”[filter]
Membrane proteins linked to cancer:
“integral to plasma membrane”[gene ontology] AND cancer[dis]
NCBI FieldGuide
More Queries…
UniGene: rat clusters that have at least one mRNA
rat[organism] NOT 0[mrna count]
SNP: uniquely mapped microsatellites on human chr2
microsat[SNP Class] AND 1[Map Weight] AND 2[Chromosome]) AND
human[orgn]
UniSTS: markers on the Genethon map of human chromosome 12
Genethon[Map Name] AND human[organism] AND 12[chromosome]
Structure: structures of bacterial kinases with resolutions below 2 Å
bacteria[organism] AND kinase AND 000.00:002.00[resolution]
NCBI FieldGuide
Other Entrez Databases
NCBI FieldGuide
Basic Local Alignment Search Tool
BLAST Web Searches, 2005
NCBI FieldGuide
200,000

Nucleotide or protein:
Related Sequences

BLAST link:
BLink

Transcript clusters:
UniGene

Protein homologs:
HomoloGene
NCBI FieldGuide
Precomputed BLAST Services
NCBI FieldGuide
Link to Related Sequences
Most similar
Least similar
NCBI FieldGuide
Related Sequences
NCBI FieldGuide
BLink (BLAST Link)
Best hits
3D structures
CDD-Search
NCBI FieldGuide
BLink Output
Seq 1
Seq 2
Global alignment
Seq 1
Seq 2
Local alignment
NCBI FieldGuide
Global vs Local Alignment
Seq1:
Seq2:
WHEREISWALTERNOW
(16aa)
HEWASHEREBUTNOWISHERE (21aa)
Global
Seq1:
1
Seq2:
1
W--HEREISWALTERNOW 16
W HERE
HEWASHEREBUTNOWISHERE
21
Local
Seq1: 1
Seq2: 3
W--HERE 5
W HERE
WASHERE 9
Seq1:
1 W--HERE 5
W HERE
Seq2: 15 WISHERE 21
NCBI FieldGuide
Global vs Local Alignment
• Standard BLAST
– nucleotide, protein and translations (blastn, blastp,
blastx, tblastn, tblastx)
– traditional “contiguous” word hit
• Megablast
– optimized for large batch searches
– can use discontiguous words
• PSI-BLAST
– constructs PSSMs automatically; uses as query
– very sensitive protein search
• RPS BLAST
– searches a database of PSSMs
– tool for conserved domain searches
NCBI FieldGuide
The Flavors of BLAST
Fast
- heuristic approach based on Smith Waterman
Local alignments
Statistical significance
- Expect value
Versatile
- blastn, blastp, blastx, tblastn, tblastx, rps-blast,
psi-blast
- www, standalone, and network clients
NCBI FieldGuide
Why Is BLAST So Popular?
• Make lookup table of “words” for query
• Scan database for hits
• Ungapped extensions of hits (initial HSPs)
• Gapped extensions (no traceback)
• Gapped extensions (traceback; alignment
details)
NCBI FieldGuide
How BLAST Works
Query: GTACTGGACATGGACCCTACAGGAA
11-mer
GTACTGGACAT
Make a lookup TACTGGACATG
table of words
ACTGGACATGG
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
. . .
NCBI FieldGuide
Nucleotide Words
Query: GTQITVEDLFYNIATRRKALKN
GTQ
TQI
Word size can only be 2 or 3
Make a lookup
QIT
Neighborhood Words
table of words
ITV
LTV, MTV, ISV, LSV, etc.
TVE
[ -f 11 = blastp default ]
VED
EDL
DLF
...
Word size = 3 (default)
NCBI FieldGuide
Protein Words
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
NCBI FieldGuide
Minimum Requirements for a Hit
one exact match
• Nucleotide BLAST requires one exact match
• Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
SEI
YYN neighborhood words
[ -A 40 = blastp default ]
two matches
example query words
Query:
IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV…
HFL 18
YLS 15
HFV 15
YLT 12
YVS 12 Neighborhood HFS 14
HWL 13
YIT 10
words
NFL 13 Neighborhood
etc …
DFL 12 score threshold
HWV 10
T (-f) =11
etc …
Query
1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47
+E YA YL K
F+YLSL +SP+ +DVNVHP+K VHFL+++ I
Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333
High-scoring pair (HSP)
Gapped extension with trace back
Query
1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50
+E YA YL K
F+YLSL +SP+ +DVNVHP+K VHFL+++ I + +
Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337
Final HSP
NCBI FieldGuide
BLASTP Summary
Identity matrix
A
G
C
T
A
+1
–3
–3
–3
G
–3
+1
–3
–3
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| |||||
CACGTAGCAAGCTTG-GTGTCA
C
–3
–3
+1
–3
T
-3
-3
-3
+1
[ -r 1 -q -3 ]
raw score = 19-9 = 10
NCBI FieldGuide
Scoring Systems - Nucleotides
Position Independent Matrices
PAM Matrices (Percent Accepted Mutation)
• Derived from observation; small dataset of
alignments
• Implicit model of evolution
• All calculated from PAM1
• PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)
• Derived from observation; large dataset of highly
conserved blocks
• Each matrix derived separately from blocks with a
defined percent identity cutoff
• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)
PSI- and RPS-BLAST
NCBI FieldGuide
Scoring Systems - Proteins
A 4
R -1 5
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2
Negative
for -4
less-2likely
substitutions
W -3 -3 -4
-2 -3
-2 -2 -3 -2 -3 -1 1
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1
Positive
substitutions
X 0 -1 -1
-1 -2 for
-1 more
-1 -1likely
-1 -1
-1 -1 -1 -1
A R N DD C Q E G H I L K M F
NCBI FieldGuide
BLOSUM62
7
-1 4
-1 1 5
-4 -3 -2 11
-3 -2 -2 2 7
-2 -2 0 -3 -1 4
-2 0 0 -2 -1 -1 -1
P S T W Y V X
Serine/Threonine protein kinases
catalytic loop
PSSM scores
DAF-1
1
5
7
4
4
NCBI FieldGuide
Position-Specific Score Matrix
catalytic
loop
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
K
E
S
N
K
P
A
M
A
H
R
D
I
K
S
K
N
I
M
V
K
N
D
L
A
-1
0
0
-1
-2
-2
3
-3
4
-4
-4
-4
-4
0
0
0
-4
-3
-4
-3
-2
1
-3
-3
R
0
1
0
0
1
-2
-2
-4
-4
-2
8
-4
-5
0
-3
3
-3
-5
-4
-3
1
1
-2
-1
N
0
0
-1
-1
1
-2
1
-4
-4
-1
-3
-1
-6
1
-2
0
8
-5
-6
-5
1
3
5
0
D
-1
2
0
-1
-1
-2
-2
-4
-4
-3
-4
8
-6
-3
-3
1
-1
-6
-6
-6
4
0
5
-3
C
-2
-1
1
1
-2
-3
0
-3
0
-5
0
-6
-3
-5
0
-5
-5
0
-3
-3
-5
-4
-1
0
Q
3
0
1
0
0
-2
-1
-4
-4
-2
-1
-2
-4
-1
-2
0
-2
-5
-4
-4
0
-1
-1
-3
E
0
2
0
-1
-1
-2
0
-4
-4
-2
-2
0
-5
-1
-2
0
-2
-5
-5
-5
-1
1
1
-2
G
3
-1
1
3
-2
-2
1
-5
-3
-4
-3
-3
-6
-3
-3
-4
-3
-6
-6
-6
-2
0
-1
3
H
0
0
1
3
-2
-2
-2
-4
-4
10
-2
-3
-5
-3
-3
-1
-1
-5
-5
-5
1
-3
0
-4
I
-2
-1
0
-1
-1
-1
-2
7
4
-6
-5
-5
3
-5
-4
-4
-6
6
0
3
-4
-4
-5
-2
L
-2
-1
-1
-1
-2
-2
-2
0
-1
-5
-4
-6
5
-5
-4
-3
-6
2
6
3
-2
-4
-4
3
K
1
0
0
1
5
-1
0
-4
-4
-3
0
-3
-5
7
-2
4
-2
-5
-5
-4
4
3
0
0
M
-1
0
0
-1
1
0
-1
1
-2
-4
-3
-5
1
-4
-4
-3
-4
2
1
2
-3
-2
-2
1
F
-1
0
0
0
-2
-3
-2
0
-3
-3
-2
-6
1
-5
-5
-2
-5
-2
0
-2
-2
-5
-5
1
P
-1
-1
2
0
-2
7
3
-4
-4
-2
-4
-4
-5
-3
2
2
-4
-5
-5
-5
-3
-2
-1
-2
S
-1
0
0
-1
-1
-1
1
-4
-1
-3
-3
-2
-5
-1
6
1
-1
-4
-4
-4
0
2
0
-2
T
-1
0
-1
-1
-1
-2
0
-2
-2
-4
-3
-3
-3
-2
2
-1
-2
-3
-3
-3
-1
-2
-2
-3
W
-1
-1
-1
1
-2
-3
-3
-4
-4
-5
0
-7
-4
-5
-5
-5
-6
-5
-4
-5
-5
-5
-6
5
Y
-1
-1
0
1
-2
-1
-3
-1
-3
0
-4
-5
-3
-4
-4
-4
-4
-3
-3
-3
-2
-4
-4
-1
V
-2
-1
-1
-1
-1
-1
0
2
4
-5
-5
-5
1
-4
-4
-4
-5
3
0
5
-3
-4
-5
-3
NCBI FieldGuide
Position-Specific Score Matrix
High scores of local alignments between two random sequences
follow the Extreme Value Distribution
Expect Value
NCBI FieldGuide
Local Alignment Statistics
Alignments
E = number of database hits you expect to find by chance, ≥ S
your score
E = Kmne-S or E = mn2-S’
expected number
of random hits
Score (S)
K = scale for search space
 = scale for scoring system
S’ = bitscore = (S - lnK)/ln2
(applies to ungapped alignments)
More info: www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Gapping provides more biologically realistic
alignments
NCBI FieldGuide
Gapped Alignments

Gapped BLAST parameters are simulated for
each scoring matrix

Affine gap costs = -(a+bk)
a = gap open penalty b = gap extend penalty
A gap of length 1 receives the score -(a+b)
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG
|| | || || || | || || ||
|| | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || ||
|| ||| || | |||||| || | |||||| ||||| |
|
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || ||
| | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
Reason:
no contiguous exact match of 7 bp.
NCBI FieldGuide
An Alignment BLAST Cannot Make
NCBI FieldGuide
An Alignment BLAST Can Make
Score
= 290 bits
(741),
Expectsequences;
= 7e-77
Solution:
compare
protein
BLASTX
Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%)
Frame = +3
BLAST 2 Sequences (blastx) output:
• Megablast
• Discontiguous Megablast
• PSI-BLAST
• PHI-BLAST
NCBI FieldGuide
Other BLAST Algorithms
• Long alignments of similar DNA sequences
• Greedy algorithm
• Concatenation of query sequences
• Faster than blastn; less sensitive
NCBI FieldGuide
Megablast: NCBI’s Genome Annotator
Trade-off: sensitivity vs speed
Too fast for
you?
NCBI FieldGuide
MegaBLAST & Word Size
Trade-off: sensitivity vs speed
WORD SIZE
default
minimum
blastn
11
7
megablast
28
8
blastp
3
2
NCBI FieldGuide
MegaBLAST & Word Size
• Uses discontiguous word matches
• Better for cross-species comparisons
NCBI FieldGuide
Discontiguous Megablast
W
W
W
W
W
W
W
W
W
W
W
W
=
=
=
=
=
=
=
=
=
=
=
=
11,
11,
12,
12,
11,
11,
12,
12,
11,
11,
12,
12,
t
t
t
t
t
t
t
t
t
t
t
t
=
=
=
=
=
=
=
=
=
=
=
=
16,
16,
16,
16,
18,
18,
18,
18,
21,
21,
21,
21,
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
coding:
non-coding:
1101101101101101
1110010110110111
1111101101101101
1110110110110111
101101100101101101
111010010110010111
101101101101101101
111010110010110111
100101100101100101101
111010010100010010111
100101101101100101101
111010010110010010111
W = word size; # matches in template
t = template length
Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology
search. Bioinformatics March, 2002; 18(3):440-5
NCBI FieldGuide
Templates for Discontiguous Words
NCBI FieldGuide
Discontiguous (Cross-species) MegaBLAST
NCBI FieldGuide
Discontiguous Word Options
NM_017460
Homo sapiens cytochrome P450, family 3,
subfamily A, polypeptide 4 (CYP3A4),
transcript variant 1, mRNA (2768 letters)
vs Drosophila
NCBI FieldGuide
MegaBLAST vs Discontiguous MegaBLAST

MegaBLAST = “No significant similarity found.”

Discontiguous megaBLAST =
NCBI FieldGuide
MegaBLAST vs Discontiguous MegaBLAST
Query: NM_078651
Drosophila melanogaster CG18582-PA (mbt) mRNA, (3244 bp)
/note= mushroom bodies tiny; synonyms: Pak2, STE20, dPAK2
Database: nr (nt),
Mammalia[orgn]

MegaBLAST = “No significant similarity found.”

Discontiguous megaBLAST = numerous hits . . .
NCBI FieldGuide
Another Example . . .
NCBI FieldGuide
Ex: Discontiguous MegaBLAST
NCBI FieldGuide
Ex: BLASTN
Position-specific Iterated BLAST
Example: Confirming relationships of purine
nucleotide metabolism proteins
NCBI FieldGuide
PSI-BLAST
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE
MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF
VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD
EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY
RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA
VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK
0.005
E value cutoff for PSSM
NCBI FieldGuide
PSI-BLAST
Same results as protein-protein BLAST; different format
NCBI FieldGuide
RESULTS: Initial BLASTP
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NCBI FieldGuide
Results of First PSSM Search
Just below threshold, another
nucleotide metabolism enzyme
Check to add to PSSM
NCBI FieldGuide
Tenth PSSM Search: Convergence
NCBI FieldGuide
Reverse PSI-BLAST (RPS)-BLAST
AMP Deaminases
.
.
.
NCBI FieldGuide
Adenosine/AMP Deaminase Domain
>gi|231729|sp|P30429|CED4_CAEEL CELL DEATH PROTEIN 4
MLCEIECRALSTAHTRLIHDFEPRDALTYLEGKNIFTEDHSELISKMSTRLERIANFLRIYRRQASE
LIDFFNYNNQSHLADFLEDYIDFAINEPDLLRPVVIAPQFSRQMLDRKLLLGNVPKQMTCYIREYHV
IKKLDEMCDLDSFFLFLHGRAGSGKSVIASQALSKSDQLIGINYDSIVWLKDSGTAPKSTFDLFTDI
LKSEDDLLNFPSVEHVTSVVLKRMICNALIDRPNTLFVFDDVVQEETIRWAQELRLRCLVTTRDVEI
ASQTCEFIEVTSLEIDECYDFLEAYGMPMPVGEKEEDVLNKTIELSSGNPATLMMFFKSCEPKTFEK
[GA]xxxxGK[ST]
NCBI FieldGuide
PHI-BLAST
NCBI FieldGuide
Genome BLAST
NCBI FieldGuide
Genome BLAST via Map Viewer
Genome
BLAST
MapGene
Viewer
SNP
nucleotide sequence
OMIM
sequence search
Gene
“hemochromatosis”
HFE
textProtein
search
Domains
NCBI FieldGuide
Example Search Pathways:
Hemochromatosis
Human
EST
TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGAC
CACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAAC
ATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAG
GAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGAC
CTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGC
NCBI FieldGuide
Example: Human Genome BLAST
NCBI FieldGuide
Human Genome BLAST: Results
Entrez Gene
NCBI FieldGuide
Human Genome BLAST: MapViewer
NCBI FieldGuide
What’s
New?
Nucleotide
• refseq_rna = NM_*, XM_*
• refseq_genomic = NC_*, NG_*
• env_nt
– environmental sample[filter], e.g., 16S
rRNA
Protein
• refseq = NP_*, XP_*
• env_nr
NCBI FieldGuide
BLAST Databases
Select lower case
NCBI FieldGuide
New Formatter
Select red
• gray line = same database hit
• hsp’s color-coded independently
NCBI FieldGuide
New Formatter
low complexity sequence filtered
NCBI FieldGuide
BLAST Output: Alignments & Filter
Limit to Organism
all[filter] NOT ma
Example Entrez Queries
all[Filter] NOT mammalia[Organism]
ray finned fishes[Organism]
srcdb refseq[Properties]
Nucleotide only:
biomol mrna[Properties]
biomol genomic[Properties]
OtherAdvanced
–e 10000
-v 2000
-b 2000
-e 10000 -v 2000
expect value
descriptions
alignments
NCBI FieldGuide
Advanced Options
Why search for similar structures?
• Find homologs with low sequence similarity
• Explore protein evolution: similar protein folds
can support different functions
• Identify conserved core elements to model
related proteins of unknown structure
NCBI FieldGuide
Searching by
Structure
MMDB
Structure
Molecular Modeling
Data Base
• Import only experimentally determined structures
• Convert to ASN.1
• Create “backbone” model (Cα, P only)
• Verify sequences
• Create single-conformer model
Add secondary structure
Add chemical bonds
id 1 ,
name "helix 1" ,
type helix ,
location
subgraph
residues
interval {
{ molecule-id 1 ,
from 49 ,
to 61 } } } ,
inter-residue-bonds {
{
atom-id-1 {
molecule-id 1 ,
residue-id 1 ,
atom-id 1 } ,
atom-id-2 {
molecule-id 1 ,
residue-id 2 ,
atom-id 9 } } ,
NCBI FieldGuide
Indexing into MMDB
NCBI FieldGuide
Structure Summary
Structure Neighbors
Conserved Domains
3D Domain Neighbors
4
3
2
1
NCBI FieldGuide
3D Domains
SH2
SH3
TyrKc
NCBI FieldGuide
Conserved Domains
4
For each protein chain,
2
locate SSEs (secondary
structure elements),
5
6
represent SSEs as
individual vectors,
3
1
IL-4 &
Leptin
align the vectors.
Human IL-4
NCBI FieldGuide
VAST: Alignment
Taq DNA polymerase
Structure neighbors
NCBI FieldGuide
VAST
Table view
NCBI FieldGuide
VAST Results for the Chain
Vector Alignment Search Tool
3D Domain structure neighbors
NCBI FieldGuide
VAST
Not found with Chain
query!
NCBI FieldGuide
VAST Results for Domain 1
Best way to convert PDB files
to MMDB format
for viewing with Cn3D!
NCBI FieldGuide
submit file to PDB
>forward
CCATGGCGACCCTGGAAAAGC
?
?
>reverse
CAGCAGCGGCTGTGCCTGCGG
?
NCBI FieldGuide
Example: Mapping Oligos Onto
a Genome
>CCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTGCGG
forward primer
-W 7 –e 1000
reverse primer
NCBI FieldGuide
Map Oligos Onto Genome
NCBI FieldGuide
Genome BLAST Results
NCBI FieldGuide
Primer Alignments
reverse primer
forward primer
NCBI FieldGuide
MapViewer
NCBI FieldGuide
MapViewer
forward
reverse
NCBI FieldGuide
Sequence View (sv)
•BLAST
•General Help
[email protected]
[email protected]
•Wayne
[email protected]
Matten
NCBI FieldGuide
Service Addresses