The role of computational linguistics in metadata management

Download Report

Transcript The role of computational linguistics in metadata management

Natural Language Processing
for LODLAM
A brief intro to machine
learning & data science
for Libraries
Presented at IGeLU 2014
by Corey A Harper
2014-09-16
Context
Narrative
Story telling
The Library's story,
and the Archives story,
but also…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Users’ stories
Scholars' stories
Adding context through recombinant metadata
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Scholars & Users Stories – Tim Sherratt
(@wragge)
Also: http://discontents.com.au/a-map-and-some-pins-open-data-and-unlimited-horizons/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Library Authority Data
“Include links to other URIs. so that they can
discover more things.”
Short of providing and linking to URIs, this *is*
authority data.
This is what our authority files are for.
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Linked data is about context
authorities provide context
and yet our controlled vocabs
are nearly gone
because the interfaces to them
were broken
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
The Death of Browse
• Next-Gen Discovery Systems don't
make use of Authority Control
• “Browse” was/is broken as a UI Design
• Rich data in Authorities, disconnected
from narrative, context, search
• Richer “Authority” type data outside
libraries...
• “Next Gen Next Gen Discovery…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fuzzy Wuzzy – Seat Geek
Fuzzy Wuzzy – Awesome Library from SeatGeek
https://github.com/seatgeek/fuzzywuzzy
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Slide courtesy of Doug Oard Univ. of Maryland
Tools - Natural Language Processing
• DBPedia Spotlight
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
• Zemanta: http://www.zemanta.com/?wpst=1
• Open Calais: http://www.opencalais.com/
• Open Refine: http://openrefine.org/
• DataTXT: https://dandelion.eu/products/datatxt/
• AlchemyAPI: http://www.alchemyapi.com/
• FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where does this lead?
We need new interfaces
new tools
for new kind of catalogers
for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Linked Jazz Back End
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Primo PNX and Authorities
• Indexing Cross References
• New Browse Functionality
• Authority Control from Aleph / Alma
• What about non-MARC, or nonAleph Data?
• Matching Strings to Authorities
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Enter Open Refine
http://freeyourmetadata.org/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Match strings to vocabularies…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Like LCNAF…
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Or Wikipedia
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Automated Authority Control?
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Open Refine RDF Skeleton
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Proposed System Architecture
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Hydra Modeling & Architecture
• Approaches to Provenance
• Prov-O
• Named Graphs
• Named Datastreams
• “n” nyucore “records”
• Same properties defined for each
• Keep data sources separate
• Merge for display in Blacklight & export to Primo
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Separate Metadata Datastreams
• source_metadata, enrich_metadata
• Reload one or both without affecting other
or native metadata
• native_metadata
• Edited only through Hydra UI
• Partitioned from external sources
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Metadata Provenance
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fedora Datastreams
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Blacklight User Interface
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where does this lead?
We need new interfaces
new tools
for new kind of catalogers
for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
A Role for Ex Libris
• Alma &/or Primo
• Named Entity Recognition
• Vocabulary Reconciliation
• Provenance Management
• Primo Central
• Named Entity Recognition on Full Text
• Auto Classification
A bit louder...
we need new interfaces
we need enterprise tools
Integrated into our metadata
management systems
for new kind of catalogers
for knowledge organization experts
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Simplified Workflow Proposal
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
More Tools – At Programming Level
• Open NLP: https://opennlp.apache.org/
• Stanford Natural Language Toolkit:
http://nlp.stanford.edu/software/index.shtml
• Python Tools
• SciKitLearn, Pandas, NLTK, SciPi, NumPi
•
https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience
• http://pandas.pydata.org/
• http://www.nltk.org/
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
More Data Science-ey Tools
http://www.rexeranalytics.com/Data-Miner-Survey-Results-2013.html
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Data Science Techniques
• Feature Extraction / Feature Engineering
• Predictive Modeling
• Probabilistic Classification – Large Multi-Class
Problems
• Text Analytics
• Vectorization
• Bags & Sets of Words
• TF/IDF
• N-Grams
• Sparse Matrices
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Simple Example – Predict Yelp Star Ratings
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Fitting a Model – Naïve Bayes
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Data Science Venn Diagram
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
1 + ln
𝑇𝑜𝑡𝑎𝑙 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐶𝑜𝑢𝑛𝑡
𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑇𝑒𝑟𝑚
http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Where can we go from here?
• NER is just the beginning
• Feature Engineering
• Hiring Statisticians
• Clustering & Classification
• Vocabulary Pruning and Engineering
• Manageable 10-20k Class Text Classification Problems
• Domain Specific
• Ex Libris’ Activity in this space
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014
Thanks!
[email protected]
212.998.2479
@chrpr
Harper – IGeLU – NLP 4 LODLAM – Sept 16, 2014