NLP and Dig Data
Download
Report
Transcript NLP and Dig Data
Shanxi HPC Research Center
NLP and Big Data
Xiaoge LI
[email protected]
WBDB2013, Xi’an, China
Introduction
Internet is a big knowledge base
unstructured
NLP & IE
“understand”
human language
Unstructured data
Structure data
Problems
Human language changed
Let
Google it !
Net language ( LOL, 给力)
compounds words (JFK airport)
Domain knowledge
Domain
specific training sets
Chinese tokenization
小菊/
nr /的/u/生活/ vn /很/d/给/v力/ vg
小菊/ nr /的/u/生活/ vn /很/d/给力/a
NLP need big data
Unsupervised (weekly supervised)learning
knowledge
acquisition
Relationship
New words
NE gazette
System Architecture
Knowledge
acquisition
NLP & IE
HDFS
information
fusion
Entity
graph
Map
Reduce
HBase
Linux Cluster
knowledge acquisition
Large scale Corpus from Web
Weekly supervised learning
Bootstrapping technique
Map reduce,Hbase
Location NE and new word
P = 87.28%, 72.1%
Chinese NLP & IE engine
Pipeline
FST & statistic mixture
model
Input:plain text
Out : structured XML
Map reduce
Speed: 500KB/s in 10 nodes
Information object
Profile and Event
Information Object
事件
Name Entity
Person
Organization
Location
Product
Time
Pre-defined
Event
General
Event
Example Profile
In Concept-Based
Profile, its attributes
are filled by its
participant profiles.
Information Network
NLP
IE
•
•
•
•
Tokenization
POS
Sallow parsing
Deep parsing
Cross document
information fusion
•
•
•
•
NE tag
CE linkage
NE Profile
Profile Merge
• Information
Object network
• Vertex: Profile
• Edge :
relationship
Cross Document Information fusion
Hierarchical Clustering
Map Reduce Hbase
Half Million Profiles
Computing complexity
P=94.65%
R=88.24%
F= 91.33%
Information Graph multi-dimension
Orange: location
Gray: organization
Blue: Person
Source:
2012 People’s daily
Query:
China Agricultural
University
Expand 1 level
Organization-Organization Network
Query: China Agricultural University filter: Organization
Location-Personal Network
Query : 青岛港, filter:Location
Person-location Network
Query: 金日成
Future Work
Query
Language
Graph Mining
Enhance NLP Engine
visualization
Questions?
Thank you