ideas / suggestions

Download Report

Transcript ideas / suggestions

Some Ideas on Final Project
Feature extraction
TGGCCGTACGAGTAACGGACTGGCTGTCTTCTCGT
CCGATACCCCCCACGCGAAACCCTACACATCAAAT
AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT
AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT
GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC
ACGCGGGTGGCGGCATGGTGCTCCGAAAGTGTTGT
CTCATATCCTACGCGGCCCCAACTATTAGCTCATG
TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC
n
p
n
p
n
n
p
n
AAAA
AAAC
…
AACG
AACT
…
ACGG
…
TTTT
Label
0
0
…
1
0
…
1
…
0
n
K-mer based features
• Exact k-mers: k = 1 to 10
• Larger k
–
–
–
–
More accurate representation: accurate classification
More features: overfitting => lower accuracy
Less efficient (both time and memory)
Feature selection may be useful
• Smaller k
– Less features: less likely to overfit, shorter runing time
– Less accurate representation: inaccurate classification
– More efficient (both time and memory)
• K-mers with degenerate chars (wild cards)
• K-mers with mismatches
• Two-block kmers: e.g., ACGN{1-3}CGT
PWM-based features
• Can represent all forms in the previous
slides in some way
• Key: how to get them?
– Cluster top-ranked k-mers
– Use existing tools (e.g., MEME or alignACE)
• Find motifs from top positive sequences (for
efficiency)
• Use MAST (in MEME) or ScanACE (in alignACE)
to find matches in all sequences.
– HMMs for learning and classification
Submission data format
TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC
AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT
AGACAAGAATCAATGCTCGCCCCCGGGTACTGAAT
GTAGGACAACAATATTGGTCCGGTGGTACCGGTAC
CTCATATCCTACGCGGCCCCAACTATTAGCTCATG p
n
n
p
n
0.523
0.005
0.345
0.985
0.489
Input format for weka
@relation experiment
@attribute seq string
@attribute AAAA numeric
@attribute AAAC numeric
@attribute AAAG numeric
…
@attribute TTTT numeric
@attribute class {n,p}
@data
TGCTCCTTTCGCGGTCCAGCAGGCAAGCGAAAGAC 0 1 1 0 0 …. p
AGCTAACTAGAGTCACTCCTTAGGATAGTGAGCGT 1 0 0 1 1…. n
Output format from weka
With option –p 0 –distribution
(you could try –
p 1 to get the seq printed)
inst# actual predicted error distribution
1
2:p
1:n + *1,0
2
2:p
1:n + *0.868,0.132
3
1:n
1:n
*1,0
…
12
1:n
1:n
*0.996,0.004
More about “–p 1”
• With the seq as the first attribute, you’ll get an
UnsupportedAttributeTypeException: Cannot handle string attributes
• Solution – use an unsupervised attribute filter to remove it:
java -cp weka.3.6.6.jar weka.classifiers.meta.FilteredClassifier
-F "weka.filters.unsupervised.attribute.RemoveType -T string"
-W weka.classifiers.trees.J48
-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 1 -distribution
-- -C 0.1
• This has the same effect of calling
java -cp weka.3.6.6.jar weka.classifiers.trees.J48 -C 0.1
-t TF_10_data_1.4mer.arff -T TF_10_data_2.4mer.arff -p 0 -distribution
(but without having the seq as the first attribute in the .arff file. And of
course you won’t have the seq in your output file.)
About MEME
• Input seqs in FASTA format:
> seq1
ATGACGTTACGTAATCCGTGATTATCCCGGCGTAC
> seq2
CAGAATGGAACTTAAGGGAGAATTGTGCAATATTA
> seq3
CCAACCAGGATGTTGTGCAACCGGTTCTTTTTATA
> seq4
CTTGGTTGCGTCATGTCCTGGGATGCATTTACTAT
> seq5
GAGATCTCCCTCTTATGTTGACATCCGTAAGCCCA
> seq6
GATCGTGATGCATTGACTTGCGTAATAGTGTTATG
> seq7
GCTGCAGAGATGGGTCATTATGTAATGTTGCCCCC
Running MEME
<ROOT>/meme/bin/meme tf_1_top_100p.fasta -dna -minw 6 -maxw 15
-minsites 10 -mod zoops -nmotifs 5 -text > meme.out.txt
tf_1_top_100p.fasta: my input was the top 100 positive seqs for TF_1
(you should try something differently: e.g. randomly split the positive
seqs into 10 subsets)
-dna: input is dna seq
-minw 6: minimum motif length = 6
-maxw 6: maximum motif length = 15
-minsites: motif occurs in at least 10 sequences
-mod zoops: each seq contains zero or one copy of the motif
-nmotifs 5: return top 5 motifs
-text: output is in plain text format (as opposed to HTML format)
Output from MEME
********************************************************************************
MOTIF 1
width =
10 sites = 99
llr = 717
E-value = 3.8e-141
********************************************************************************
-------------------------------------------------------------------------------Motif 1 Description
-------------------------------------------------------------------------------Simplified
A 5::2:12991
pos.-specific
C 1:::6:51:3
probability
G 4:16:9:::1
matrix
T :a924:3::5
bits
Information
content
(10.3 bits)
Multilevel
consensus
sequence
2.2
2.0
1.7
1.5
1.3
1.1
0.9
0.7
0.4
0.2
0.0
*
*
*
* **
*
* **
** ** **
** ** **
***** **
*********
**********
---------ATTGCGCAAT
G AT T C
A
Running MAST
<ROOT>/meme/bin/mast meme.out.txt -d TF_1_allseq.fasta -text –norc
–m <k> -ev 100 -mt 0.01 –b [-stdout > mast_out.txt]
• -norc: do not search motif from the reverse-complementary strand
• -m <k>: find matches to the k-th motif present in meme.out.txt (vary k
from 1 to max number of motifs)
– Easier to use a scripting language to loop through k and parse the output
• -ev 100: use an E-value cutoff = 100. this allows sequences with poor
matches to be output. (default is -ev 10)
• -mt 0.01: similar to –ev 100. (default is –mt 0.0001)
• -b: simplify output
• -stdout: redirect the output into a given file instead of the default file
(mast.<motiffilename>.<seqfilename>.<paraValues>)
Output from MAST
• Only need Section 1 in output
SEQUENCE NAME DESCRIPTION
E-VALUE LENGTH
------------------------------ -----seq22
0.11 35
seq13
0.11 35
seq72
0.3 35
seq19
0.3 35
seq29
0.38 35
…
seq446
96 35
seq298
96 35
seq264
96 35
seq251
96 35
• Larger e-values correspond to lower scores (poorer matches).
• Seq not in the output has lowest score.
• Therefore, need to transform e-values to scores.
• E.g.: if seq not in output: score = 0; else score = max_evalue – evalue (max_evalue =
100 was specified by the –ev parameter.)
• Or score = 1 / evalue.
AlignACE and ScanACE
• AlignACE:
AlignACE –i tf_1_top_100p.fasta > tf_1_top_100p.ace
• ScanACE:
ScanACE -i tf_1_top_100p.ace -z tf_1_data_1.fasta -c 1 -o <result>
-c: allows more motif matches to be output
Results saved in:
<result>_n1.scn, <result>_n2.scn, etc.
Higher scores mean better match. No transformation needed.