Transcript Slide 1

1
The Million Book Digital Library Project:
Research Issues in Data Mining and Text Mining
Jaime Carbonell and Raj Reddy
Carnegie Mellon University
January 12, 2006
Talk presented at International Conf on Data Mining, Nov 28, 2005
and MSR India TechVista Symposium, Jan 12, 2006
Digital Libraries and
Universal Access to Information
Create a Universal Digital Library containing
all the books ever published
 Unfortunately many of the books are in
English

 Not
readable by over 80% of the population
2
3
Information Overload

If we read a book every day
 we
can only read, at most, 40,000 books in a life
time

Having millions of books online and accessible
creates an information overload
 “we
have a wealth of information and scarcity of
(human) attention!”, Herbert Simon

Multilingual search technology can help to
reduce the overload
 permits
users to search very large data bases
quickly and reliably
 independent of language and location
Understanding Language

Books in non-native languages remain
incomprehensible to most people
Translation and Summarization essential for world
wide use
 Current translation systems are not yet perfect
 Significant improvements in language understanding
systems in the past few decades


Systems based on statistical and linguistic
techniques have shown significant performance
improvements


improve performance using machine learning
Digitization projects will act as test bed

for validating Language Understanding Systems
Research

e.g. The Million Book Digital Library Project
4
The Million Book Digital Library
Collaborative venture among many
countries including USA, China and India
 So far 400,000 books have been scanned
in China and 200,000 in India
 Content is made freely available around
the globe
 Those wishing to see the Video in the next
slide should download from
http://www.rr.cs.cmu.edu/MSRI.zip

5
The Grand Challenge
Create Access to

All published works online

Instantly available

In any language

Anywhere in the world

Searchable, browsable, navigable

By humans and machines
The Challenge:
One Step at a Time…

Million Book DL
 Only
about 1% of all the world’s books
 Harvard
 Library
 OCLC
 All

University 12M
of Congress
catalog
30M
42M
Multilingual Books
~100M
At the rate of digitization of the last
decade it would take a 100 years!
Million Book Project: Issues

Time
 At
one page per second (20,000 pages per
day shift), it will take 100 years (200
working days per year) to scan a million
books of 400 pages each

Cost
 100M
books at US$100 per book would
coat $10B
 Even
$1B
 The
in India and China the cost will be
annual cost is currently expected to be
close $10M per year with support from US,
Million Book Project: Issues (cont)

Logistics
 Each
containers hold 10,000 to 20,000
books. Shipping and handling costs about
$10,000

Meta Data
 Accessing
and/or creating Meta data
requires professionals trained in Library
science

Optical Character Recognition
Technology
11
Million Book Project: Status





21 Centers in India
17 centers in China
1 Center in Egypt
Planned : Australia and Europe
About 600,000 books scanned

About 120,000+ accessible on the web from India





http://dli.iiit.ac.in/
Uses 8TB of storage
10 TB server at CMU Library planned for July 2005
1,000,000 books by the end of 2007
Capacity to scan a million pages a day expected to
be operational by the end of 2006
Title
Author
Language
Subject
Publisher
Year
Abstract
Rig Veda
Pandit Sriram Sharma Acharya
Sanskrit
Philosophy
Sanskriti Sansthan Bareli
13
Rig Veda is the oldest of the
Vedas. The Rig Veda is the
oldest book in Sanskrit or any
Indo-European language. Many
great Yogis and scholars who
have understood the
astronomical references in the
hymns, date the Rig Veda as
before 4000 B.C., perhaps as
early as 12,000. Modern
western scholars date it around
1500 B.C., though recent
archaeological finds in India
(like Dwaraka) now appear to
require a much earlier date
Title
Author
Language
Subject
Publisher
Year
Abstract
14
Elementary Treatise on the
Wave-Theory of Light
Humphery Lloyd, D.D, D.C.L
English
Physics
Longmans, Green & Co
1873
This book deals with the
various aspects of the wave
theory of light. It is a critical
work which contains an
analytical discussion of the
most recent researches in
Optics. It presents a clear and
connected view of the
subject.
15
Title
Author
Language
Subject
Publisher
Year
Abstract
Beauties from Kalidas
Keshav Appa Padhye
Sanskrit
Poetry
1927
A collection of some of the
Best works of Kalidas, Ancient
India’s Most Famous Sanskrit
Poet. Abhignyana
Sakuntalam, Kumara
Sambhavam, Ritu Samhara
are some of the renowned
works of Kalidas.
Title
Author
Language
Subject
Publisher
Year
Abstract
Gems, Jewels, Coins and
Medals Ancient & Modern
Archibald Billing
English
Fine Arts
Daldy, Isbister & Co
1875
16
This volume deals with the
detailed description of the
varied types of fine arts
dealing with precious stones,
Jewelry and sculpture.
Title
Author
Language
Subject
Publisher
Year
Abstract
17
Mudalayiram Mulamum
Periya Jeeyar
Tamil
Religion
Sri Vaishnava Sampirathaya
Sanjeevikiri Sabayai
1909
This volume is written in Tamil.
It provides a detailed account
of the origin of Vaishnava and
is written by Periya Jeeyar. .
18
Title
Author
Language
Subject
Publisher
Year
Abstract
Gulzar-A-Badesha
Khader Badesha
Urdu
Literature
Namipress, Chennai
1919
Literature
19
Title
Author
Language
Subject
Publisher
Year
Abstract
Jawahar Ali Joyviyah
Dr.Ilyas lomas
Arabic
Metrology
Bakri and Issa
1876
It is a book on Metrology, a
study of measurements
20
Title
Author
Language
Subject
Publisher
Year
Abstract
Panchatantramu
Narayana Kavi
Telugu
Moral Stories
Vavilla Ramaswamy and Sons
1912
It is a compilation of stories
told by a guru to his royal
students, each story teaching
a moral. Most of the characters
in the stories are animals. The
book served as an excellent
guide to prospective kings in
their everyday life, including
their behaviour and their
choice of friends. It also is a
great asset to parents to teach
ethics to their children.
21
Title
Author
Language
Subject
Publisher
Year
Abstract
Bharateeya Smritigalu
Vidwan Ragu Sutta
Kannada
Biographical Notes
Hemantha Sahitya
Compilation of Ancient
Memories
Title
Author
Language
Subject
Publisher
Year
Abstract
22
The Fauna of British India
including Ceylon and Burma
Lt. Conl. J. Stephenson
English
Biology
Taylor and Francis
1929
Biological notes on fauna and
insects compiled during
British India
Title
Author
Language
Subject
Publisher
Year
Abstract
23
Harijan: A Journal of Applied
Gandhism, 1933-1955
Joan Bondurant (introduction)
English
Philosophy
Garland Publishing Inc.
1973
A journal on Practical
implementation of Gandhiism
in Every Day Life
Title
Author
Language
Subject
Publisher
Year
Abstract
24
Structure Des Molecules
Victor Henri
French
Chemistry
Taylor and Francis
1925
This is a unique book that
explicates, in detail, the
structure of molecules and
touches upon certain specific
characteristics of molecules
with particular reference to
Benzene
Million Book Project: Policy
Challenges


Compensating for Creative Works

5% out of copyright

92% out-of-print and in-copyright

3% in-print and in-copyright
Options

Tax Credit

Usage based Government funded compensation


Analogous to Public Lending Right in UK and Australia
Usage charges to the user

Compulsory Licensing

Digital Submissions to National Archives of all books
that are “born-digital”
Million Book Project: Research
Challenges

Providing Access to Billions everyday

Distributed Cached Servers in every country and
region
Easy to use interfaces for Billions
 Text Mining Challenges

Multilingual Information Retrieval
 Summarization
 Text Categorization
 Named-Entity identification
 Novelty Detection
 Translation

What is Text Mining


Search documents, web, news
Categorize by topic, taxonomy

Enables filtering, routing, multi-text summaries, …
Extract names, relations, …
 Summarize text, rules, trends, …
 Detect redundancy, novelty, anomalies, …
 Predict outcomes, behaviors, trends, …

Who did what to whom and where?
27
Data Mining vs. Text Mining
Data: relational tables
 DM universe: huge
 DM tasks:






DB “cleanup”
Taxonomic classification
Supervised learning with
predictive classifiers
Unsupervised learning
clustering, anomaly detection
Visualization of results
Text: HTML, free form
 TM universe: 103X DM
 TM tasks:


All the DM tasks,
plus:




Extraction of roles,
relations and facts
Machine translation for
multi-lingual sources
Parse NL-query (vs. SQL)
NL-generation of results
28
New Bill of Rights
Get the right information
 To the right people
 At the right time
 On the right medium
 In the right language
 With the right level of detail

29
Relevant Text Mining Technologies
30






“…right
“…right
“…right
“…right
“…right
“…right
information”
people”
time”
medium”
language”
level of detail”






IR (search engines)
Classification, routing
Anticipatory analysis
Info extraction, speech
Machine translation
Summarization
31
“…right information”
Information Retrieval
Beyond Pure Relevance in IR
Information Retrieval Maximizes
Relevance to Query
 What about information novelty, timeliness,
appropriateness, validity, comprehensibility,
density, medium,...??
 Novelty is approximated by non-redundancy!
 we
really want to maximize: relevance to the
query, given the user profile and interaction
history,

P(U(f i , ..., f n ) | Q & {C} & U & H)
where Q = query, {C} = collection set,
U = user profile, H = interaction history
 ...but
we don’t yet know how. Darn.
32
Maximal Marginal Relevance vs
Standard Information Retrieval
documents
query
MMR
Standard IR
IR
33
34
“…right information”
Novelty Detection
Detecting Novelty in Streaming Data
Find the first report of a new event
 (Unconditional) Dissimilarity with Past

 Decision
threshold on most-similar story
 (Linear) temporal decay
 Length-filter (for teasers)

Cosine similarity with standard weights:
tfidf  (1  log(tf )) * log(N / idf )
35
New First Story Detection Directions

Topic-conditional models
 e.g.
“airplane,” “investigation,” “FAA,” “FBI,”
“casualties,”  topic, not event
 “TWA 800,” “March 12, 1997”  event
 First categorize into topic, then use
maximally-discriminative terms within topic

Rely on situated named entities

e.g. “Arcan as victim,” “Sharon as peacemaker”
36
Link Detection in Texts
Find text (e.g. Newstories) that mention
the same underlying events.
 Could be combined with novelty (e.g.

something new about interesting event.)

Techniques: text similarity, NE’s, situated
NE’s, relations, topic-conditioned models,
…
37
38
“…right people”
Text Categorization
Text Categorization
Assign labels to each document or web-page
 Labels may be topics such as Yahoo-categories


Labels may be genres


finance, sports, NewsWorldAsiaBusiness
editorials, movie-reviews, news
Labels may be routing codes

send to marketing, send to customer service
39
Text Categorization
Methods
 Manual assignment


Hand-coded rules


as in Yahoo
as in Reuters
Machine Learning (dominant paradigm)




Words in text become predictors
Category labels become “to be predicted”
Predictor-feature reduction (SVD, 2, …)
Apply any inductive method: kNN, NB, DT,…
40
Multi-tier Event Classification
News Event
Terrorist Event
Bombing
Shooting
Economic disaster
Asian Crisis
US tech crisis
41
42
“…right medium”
Named-Entity identification
Named-Entity identification
Purpose: to answer questions such as:

Who is mentioned in these 100 Society articles?

What locations are listed in these 2000 web pages?

What companies are mentioned in these patent
applications?

What products were evaluated by Consumer Reports this
year?
43
Named Entity Identification
President Clinton decided to send special trade
envoy Mickey Kantor to the special Asian
economic meeting in Singapore this week. Ms.
Xuemei Peng, trade minister from China, and
Mr. Hideto Suzuki from Japan’s Ministry of
Trade and Industry will also attend. Singapore,
who is hosting the meeting, will probably be
represented by its foreign and economic
ministers. The Australian representative, Mr.
Langford, will not attend, though no reason has
been given. The parties hope to reach a
framework for currency stabilization.
44
Methods for NE Extraction

Finite-State Transducers w/variables

Example output:
FNAME: “Bill” LNAME: “Clinton” TITLE: “President”


FSTs Learned from labeled data
Statistical learning (also from labeled data)



Hidden Markov Models (HMMs)
Exponential (maximum-entropy) models
Conditional Random Fields [Lafferty et al]
45
Named Entity Identification
Extracted Named Entities (NEs)
People
Places
President Clinton
Mickey Kantor
Ms. Xuemei Peng
Mr. Hideto Suzuki
Mr. Langford
Singapore
Japan
China
Australia
46
Role Situated NE’s
Motivation: It is useful to know roles of NE’s:
 Who participated in the economic meeting?
 Who hosted the economic meeting?
 Who was discussed in the economic meeting?
 Who was absent from the the economic
meeting?
47
Emerging Methods
for Extracting Relations

Link Parsers at Clause Level



Based on dependency grammars
Probabilistic enhancements [Lafferty, Venable]
Island-Driven Parsers


GLR* [Lavie], Chart [Nyberg, Placeway], LC-Flex [Rose’]
Tree-bank-trained probabilistic CF parsers [IBM, Collins]
Herald the return of deep(er) NLP techniques.
 Relevant to new Q/A from free-text initiative.
 Too complex for inductive learning (today).

48
Relational NE Extraction
Example: (Who does What to Whom)
"John Snell reporting for Wall Street. Today
Flexicon Inc. announced a tender offer for
Supplyhouse Ltd. for $30 per share, representing a
30% premium over Friday’s closing price.
Flexicon expects to acquire Supplyhouse by Q4
2001 without problems from federal regulators"
49
Fact Extraction Application

Useful for relational DB filling, to prepare data
for “standard” DM/machine-learning methods
Acquirer Acquiree Sh.price Year
__________________________________
Flexicon Logi-truck 18
1999
Flexicon Supplyhouse 30
2001
buy.com reel.com
10
2000
...
...
...
...
50
51
“…right language”
Translation
“…in the Right Language”

Knowledge-Engineered MT



Parallel Corpus-Trainable MT




Transfer rule MT (commercial systems)
High-Accuracy Interlingual MT (domain focused)
Statistical MT (noisy channel, exponential models)
Example-Based MT (generalized G-EBMT)
Transfer-rule learning MT (corpus & informants)
Multi-Engine MT

Omnivorous approach: combines the above to
maximize coverage & minimize errors
52
Types of Machine Translation
Interlingua
Semantic
Analysis
Syntactic
Parsing
Source
(Arabic)
Sentence
Planning
Transfer Rules
Direct: EBMT
Text
Generation
Target
(English)
53
EBMT example
54
English:
I would like to meet her.
Mapudungun: Ayükefun trawüael
fey engu.
English:
The tallest man
Mapudungun: Chi doy fütra chi wentru
is my father.
fey ta inche ñi chaw.
English:
I would like to meet the tallest man
Mapudungun (new):
Ayükefun trawüael
Chi doy fütra chi wentru
Mapudungun (correct): Ayüken ñi trawüael
chi doy fütra wentruengu.
Multi-Engine Machine Translation

MT Systems have different strengths
Rapidly adaptable: Statistical, example-based
 Good grammar: Rule-Based (linguisitic) MT
 High precision in narrow domains: KBMT
 Minority Language MT: Learnable from informant


Combine results of parallel-invoked MT


Select best of multiple translations
Selection based on optimizing combination of:


Target language joint-exponential model
Confidence scores of individual MT engines
55
Illustration of Multi-Engine MT
El punto de descarge
se cumplirá en
el puente Agua Fria
The drop-off point
will comply with
The cold Bridgewater
El punto de descarge
se cumplirá en
el puente Agua Fria
The discharge point
will self comply in
the “Agua Fria” bridge
El punto de descarge
se cumplirá en
el puente Agua Fria
Unload of the point
will take place at
the cold water of
bridge
56
State of the Art in MEMT
for New “Hot” Languages
We
can do now:
Gisting MT for any new
language in 2-3 weeks (given
parallel text)
Medium quality MT in 6 months
(given more parallel text,
informant, bi-lingual
dictionary)
Improve-as-you-go MT
Field MT system in PCs
We
cannot do yet:
High-accuracy MT for open
domains
Cope with spoken-only
languages
Reliable speech-speech MT (but
BABYLON is coming)
MT on your wristwatch
57
58
“…right level of detail”
Summarization
Document Summarization
Types of Summaries
Task
Query-relevant
(focused)
Query-free
(generic)
INDICATIVE
for Filtering
(Do I read further?)
Filter search engine
results
Short abstracts
CONTENTFUL
for reading in lieu of
full doc
Solve problems for busy
professionals
Executive
summaries
59
60
Conclusion