NLP and Dig Data

Download Report

Transcript NLP and Dig Data

Shanxi HPC Research Center
NLP and Big Data
Xiaoge LI
[email protected]
WBDB2013, Xi’an, China
Introduction

Internet is a big knowledge base
 unstructured

NLP & IE
 “understand”
human language
Unstructured data
Structure data
Problems

Human language changed
 Let
Google it !
 Net language ( LOL, 给力)
 compounds words (JFK airport)

Domain knowledge
 Domain

specific training sets
Chinese tokenization
 小菊/
nr /的/u/生活/ vn /很/d/给/v力/ vg
 小菊/ nr /的/u/生活/ vn /很/d/给力/a
NLP need big data

Unsupervised (weekly supervised)learning
 knowledge
acquisition
 Relationship
 New words
 NE gazette
System Architecture
Knowledge
acquisition
NLP & IE
HDFS
information
fusion
Entity
graph
Map
Reduce
HBase
Linux Cluster
knowledge acquisition
Large scale Corpus from Web
 Weekly supervised learning
 Bootstrapping technique
 Map reduce,Hbase
 Location NE and new word
 P = 87.28%, 72.1%

Chinese NLP & IE engine
Pipeline
FST & statistic mixture
model
Input:plain text
Out : structured XML
Map reduce
Speed: 500KB/s in 10 nodes
Information object
Profile and Event
Information Object
事件
Name Entity
Person
Organization
Location
Product
Time
Pre-defined
Event
General
Event
Example Profile
In Concept-Based
Profile, its attributes
are filled by its
participant profiles.
Information Network
NLP
IE
•
•
•
•
Tokenization
POS
Sallow parsing
Deep parsing
Cross document
information fusion
•
•
•
•
NE tag
CE linkage
NE Profile
Profile Merge
• Information
Object network
• Vertex: Profile
• Edge :
relationship
Cross Document Information fusion
Hierarchical Clustering
 Map Reduce Hbase
 Half Million Profiles
 Computing complexity
 P=94.65%
R=88.24%

F= 91.33%
Information Graph multi-dimension
Orange: location
Gray: organization
Blue: Person
Source:
2012 People’s daily
Query:
China Agricultural
University
Expand 1 level
Organization-Organization Network
Query: China Agricultural University filter: Organization
Location-Personal Network
Query : 青岛港, filter:Location
Person-location Network
Query: 金日成
Future Work
 Query
Language
 Graph Mining
 Enhance NLP Engine
 visualization
Questions?
Thank you