Data Mining for BioInformatics at Ewha CSE

Download Report

Transcript Data Mining for BioInformatics at Ewha CSE

Data Mining for BioInformatics at Ewha CSE
Dec. 14, 2001
Hwan-Seung Yong
(Gene: ACTGAAAGGGCTCTCAAA)
Dept. of Computer Science & Engineering
Ewha Womans Univ.
BioInformatics and Computer Science
• Computer: 2진법 시스템(0/1) designed by Human
• Living things: 4진법(A/G/C/T) designed by Nature
• 컴퓨터 기술의 발전
–
–
–
–
데이터 분석 + 데이타베이스 = 데이터 마이닝 (At present)
고성능 병렬 컴퓨터 기술
분산 처리 및 웹/X ML 기술
지식관리(Knowledge Management) 기술의 등장
For BioInformatics
• 인간이 컴퓨터를 만든 이유
– 4진법속에 담긴 생명의 비밀을 찾아서
– 신의 영역에 도전
BioInformatics and Computer Science
• BioInformatics
– DNA 코드 Reader(biotechnology) 및 Alignment 기술 개발
• 유전자의 전체 시퀀스를 겨우 만든 상태
– 이것으로 부터 의미(유전자 등)를 찾는 것.
– Binary Object로 부터 Source Code를 찾는 기술
• Disassembler와 Reverse Engineering 기술 전문가가 필요
– 데이타마이닝이 중요한 적용 기술임.
Computer System
Binary Code
Assembly Code
Source Code
DNA Sequence
유전자
단백질
Living Things: Nature
Why Ewha CSE is appropriate for
BioInformatics
• Recent focus of CSE’s Research Area
–
–
–
–
–
–
As a BK Project Plan: Knowledge Engineering Framework
Data Warehousing and OLAP
Data Mining
XML Technology
Knowledge Engineering Enabling Technology
Knowledge Engineering Application
• Electronic Commerce
• BioInformatics
• 본교 관련 연구기관
– 분자생명과학대학원 (BK)
– 한국과학재단 SRC(세포신호전달센터)
– 정통부 컴퓨터 그래픽스/가상현실 연구센터
• 기존의 관련연구(직접)
– 검찰청 유전자 검색 및 자동분석 프로그램 개발
– 국립과학수사연구소 유전자 정보 관리 시스템 개발
유전자 자동분석 프로그램
유전밴드 인식, 코드 등록
프로그램
DNA Locus Registration Interface
Data Warehousing, OLAP and Data Mining
• Data Warehousing and OLAP
–
–
–
–
–
–
ETL Methodology (Extraction, Transformation and Loading)
Data Warehouse Architecture
OLAP Server Development
Multidimensional Data Processing
Metadata Handling
Data Quality Control
• Data Mining
–
–
–
–
–
–
Classification and Analysis of Data Minig Technique
Clustering Algorithm
Association Algorithm
Classification Algorithm
CRM Appliation based on Web Log Mining
Text Mining for XML Data
XML and Supporting Technology
• XML Related Area
– XML Server Development
• Query Processing and Storage System
– XML document Mining
• Knowledge Enabling Technology
–
–
–
–
–
–
Multimedia Highspeed Network
Component based Software Engineering
Security
Multimedia DBMS
Natural Language Processing
Computer Graphics and Virtual Reality
Research Requirement for BioInformatics
• Large Volume of Data including multimeia data
• High Performace Computing System
– Massively Parallel Processing Hardware and Software
• XML related work is important
– For exchange of bio data
– Gene Annotation
• Web based collaborative system
– Require web based interoperable application and standard
– Distributed processing technique
• CORBA, SOAP, Microsoft .NET framework
• Data Mining
– For Gene Prediction, Functional Genomics
Bio Data Mining Research
• XML Standard for Bio Data
• Graphical User Interface for XML Data
• Data Converter to XML
– Convert Existing Bio Data to XML Standard
– Convert between Some XML Standard
• Integration Methodology with Existing DB
– SOAP(Simple Object Access Protocol)
– WSDL(Web Service Description Language)
XML Standard for Bio Data
• Before
– FASTA format, GenBank format, GFF(General Feature Format)
• XML Format
– AGAVE (Architecture for Genomic Annotation, Visualization and
Exchange)
•
•
•
•
•
Developed by Double Twist, Inc.
Released in June 2000
Open Source licence in August 2001.
AGAVE 3.2 version with Prophecy 3.0 in Sept. 2001
Refer http://www.agavexml.org
• Genome XML Viewer by Labbook
– BSML
XML standard for Bio Data
• BioXML Standard and GAME
– an open-source/free software organization dedicated to providing a
set of standard xml formats for the exchange of biological data
• GAME(Genomic Annotation Markup Language)
–
–
–
–
Created at BDGP (Berkeley Drosophila Genome Project)
Current Version 1.1 released in March 2000
http://www.bioxml.org
Follow WikiWeb scheme
• collaborative web site that can be edited by anyone
• Community documentation system
• Everyone can edit sharing web pages
컴퓨터이론 및 보안 연구실
Whole genome sequence
annotation
Known gene
Unknown gene
• Sequence similarity
• Neural networks
• Hidden Markov models
Unknown gene prediction
Microarray data analysis
Phylogenetic prediction
Phylogeny inference
Phylogenetic analysis
Comparative genomics
Data mining tools
Two samples comparison
Phylogenetic Tree Visualization
• Tree drawing algorithms
• Graph drawing algorithms
Clustering
classification tools
Multiple samples comparison
New algorithm design
•Simulated annealing
•Other optimization techniques
Open Source Project
• Open BioInformatics Foundation
– http://www.open-bio.org
– Umbralla group for various bio*.org group
• bioxml.org, bioperl.org, biopython.org, biojava.org, biocorba.org
• biopathways.org
• bio-ensembl.org
– Annotation for human genome
– The First Bioinformatics Open Source Conference
(BOSC'2001) was held, August 2001 at San Diego.
– Many Open System Activities
Vision and Future Prediction
• Ewha will
– Contribute something in Bio Data Mining Area
– Have Bio Informatics Institute or Research Center
– Have strong bio-industry relationship
• Closing Comment
ATGCCGTCGGGCCCCGGGGC
=> Thank You를 4진법으로 표현