Presentation Title

Transcript Presentation Title

Apache Tika: 1 point Oh!
Chris A. Mattmann
NASA JPL/Univ. Southern California/ASF
[email protected] November 9, 2011
And you are?
• Senior Computer Scientist at
NASA JPL in Pasadena, CA
USA
• Software
Architecture/Engineering
Prof at Univ. of Southern
California
• Apache Member involved in
– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit
(Mentor), Airavata (Mentor)
Roadmap
• 1st part of the talk
–
–
–
–
Why Tika?
What is Tika?
What are the current versions of Tika?
What can it do?
• 2nd part of the talk
– NASA Earth Science Data Systems
– Data System Needs and Requirements
– How does Tika help?
The Information Landscape
Proliferation of content types available
• By some accounts, 16K to 51K content
types*
• What to do with content types?
– Parse them
• How?
• Extract their text and structure
– Index their metadata
• In an indexing technology like Lucene, Solr, or in
Google Appliance
– Identify what language they belong to
• Ngrams
*http://filext.com/
Importance of content types
Importance of content type detection
Search Engine Architecture
Goals
• Identify and classify file types
– MIME detection
• Glob pattern
– *.txt
– *.pdf
• URL
– http://…pdf
– ftp://myfile.txt
• Magic bytes
• Combination of
the above means
• Classification means
reaction can be targeted
is…
• A content analysis and detection toolkit
• A set of Java APIs providing MIME type detection,
language identification, integration of various
parsing libraries
• A rich Metadata API for representing different
Metadata models
• A command line interface to the underlying Java
code
• A GUI interface to the Java code
Tika’s (Brief) History
• Original idea for Tika came from Chris Mattmann
and Jerome Charron in 2006
• Proposed as Lucene sub-project
– Others interested, didn’t gain much traction
• Went the Incubator route in 2007 when Jukka
Zitting found that there was a need for Tika
capabilities in Apache Jackrabbit
– A Content Management System
• Graduated from the Incubator to Lucene subproject in 2008
• Graduated to Apache TLP in April 2010
• 40, 88 and 29 issues resolved in versions 1.0,
0.10, and 0.9
• Mailing lists
Community
– User: 125 peeps, ~70 msg/mo.
– Dev: 210 peeps, ~250 msg/mo.
• Committers/PMC
– 13 peeps
– Large majority of them active
• Releases
– 11 releases so far
– Just pushed out 1 point OH
• http://s.apache.org/N0I
Credit: svnsearch.org
Use in the classroom
• Have used Apache Tika for the past 2 years in
both my Search Engines/Information
Retrieval class and my Software Architecture
class
– Several student final projects have turned into
contributions for the project and merit for the
students
• Define data management projects that
involve the use of OODT, and other
technologies like Solr, Tika, Nutch, Hadoop,
etc.
e Foundation Announces Apache Tika(tm) v1.0
SERVICES
http://www.globenewswire.com/newsroom/news.html?d=237692
Some recent 1 point oh press
Apache Announces Toolkit for Content Detection and Analyisis
NEWSROOM
ABOUT US
CONTACT US
http://www.cmswire.com/cms/information-management/ap
Search
Latest News
he Apache Software Foundation
ember 09, 2011 08:00 ET
Featured Products
Software Directory
Upcoming Events
Receive the Free CMSWire Newsletter
We keep thousands of people informed each week via concise updates.
pache Software Foundation Announces Apache Tika(tm) v1.0
Privacy respected — we will never share your information.
(enter email address)
I'm already subscribed
Sign-up
ds-based, Content and Metadata Detection and Analysis Toolkit Powers Large-scale, Multi-lingual, Multiepositories at Adobe, the Internet Archive, NASA Jet Propulsion Laboratory, and more.
, MD, Nov. 9, 2011 (GLOBE NEWSWIRE) -- The Apache Software Foundation
all-volunteer developers, stewards, and incubators of nearly 150 Open Source
nd initiatives, today announced Apache Tika v1.0, an embeddable, lightweight
content detection and analysis.
he Tika v1.0 release is five years in the making, providing numerous
ents and new parsing formats," said Chris Mattmann, Apache Tika Vice
Senior Computer Scientist at NASA Jet Propulsion Laboratory, and University of
California Adjunct Assistant Professor of Computer Science. "From a toolkit
e, it's easy to integrate, and provides maximum functionality with little
ion."
ncreasing amount of information available on the Internet today, automatic
n processing and retrieval is urgently needed to understand content across
anguages, and continents.
ka is a one-stop shop for identifying, retrieving, and parsing text and metadata
1,200 file formats including HTML, XML, Microsoft Office,
e/OpenDocument, PDF, images, ebooks/EPUB, Rich Text, compression and
formats, text/audio/image/video, Java class files and archives, email/mbox,
Apache Announces Toolkit for
Content Detection and Analyisis
By Rikki Endsley (@rikkiends) Nov 9, 2011
Other Company Press Releases
Featured How-to: Building the Paperless Ofﬁce with Document
Management
The Apache Software Foundation
Announces Apache Cassandra(tm) v1.0 The Apache Software Foundation
Oct 18, 2011 08:00 ET
announced Tika v.1, an
embeddable toolkit for content detection and analysis five
The Apache Software Foundation
years in the making.
Announces Apache TomEE Certified as
Java EE 6 Web Profile Compatible - Oct 4,
What is Tika? The announcement
2011 15:10 ET
describes it as a one-stop
shop for identifying, retrieving and parsing text and
MEDIA ALERT: The Apache Software
metadata
from more than 1,200 file formats, such as
Foundation announces
ApacheCon
Keynotes by noted Open Source authority
HTML, PDF, images, OpenOffice, Microsoft Office, email
David A. Wheeler, Hortonworks CEO Eric
Baldeschweiler,
and
IBM Emerging Internet
and
more:
Technology group CTO David Boloker - Sep
Getting started rapidly…like
now!
• Download Tika from:
– http://tika.apache.org/download.html
•
•
•
•
Grab tika-app-1.0.jar
alias tika “java –jar tika-app-1.0.jar”
tika < somefile.doc > extracted-text.xhtml
tika –m < somefile.doc > extracted.met
• Works on Windows too (alias only on UNIX)
A quick NASA dataset
• Atmospheric Infrared Sounder Mission (AIRS)
– Level 2 Cloud Clear Radiance Product
– Grab it from here:
• ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS
_Level2/AIRI2CCF.003/2007/005/
– Just grab the first file
• java -jar tika-app-1.0.jar -m <
AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf
– Hopefully this worked for you, if not, blame..
• Windows
– And Bill Gates
25-Mar-11
CORDEX-MATTMANN
16
Detecting MIME types from
Java
• String type = Tika.detect(…)
– java.io.InputStream
– java.io.File
– java.net.URL
– java.lang.String
Adding new MIME types
• Got XML?
• Based on freedesktop.org spec (loosely)
Many custom applications and tools
• You need this:
to read this:
Third-party parsing libraries
• Most of the custom applications come with
software libraries and tools to read/write these
files
– Rather than re-invent the wheel, figure out a way to
take advantage of them
• Parsing text and structure is a difficult problem
– Not all libraries parse text in equivalent manners
– Some are faster than others
– Some are more reliable than others
Parsing
• String content = Tika.parseToString(…)
– InputStream
– File
– URL
Streaming Parsing
• Reader reader = Tika.parse(…)
– InputStream
– File
– URL
Extraction of Metadata
• Important to follow common Metadata models
–
–
–
–
Dublin Core – any electronic resource
XMP – also general like Dublin Core
Word Metadata – specific to .doc, .ppt, etc.
EXIF – image related
• Lots of standards and models out there
– The use and extraction of common models allows for content
intercomparison
– All standardize mechanisms for searching
– You always know for X file type that field Y is there and of type
String or Int or Date
Cancer Research Example
Cancer Research Example
Attributes
Credit: A. Hart
Relationships
Tika Sponsoring the Any23 Project
• Tika PMC is sponsoring the Any23 project in
the Incubator (entered: 10/1/2011)
• Any23 = “Anything to Triples”
• Semantic Toolkit for parsing, identification of
all major semantic web content types (RDF,
etc.)
• Related to Apache Jena
• Looking for synergies between 2 efforts
Metadata
• Metadata met = new Metadata();
//Dubiln Core
met.set(Metadata.FORMAT, “text/html”);
//multi-valued
met.set(Metadata.FORMAT, “text/plain”);
System.out.println(
met.getValues(Metadata.FORMAT));
• Other met models supported (HTTP Headers,
Word, Creative Commons, Climate Forecast, etc.)
– Run: tika --list-met-models
Methods for language identification
• N-grams
– Method of detecting next character or set of
characters in a sequence
– Useful in determine whether small snippets of
text come from a particular language, or
character set
• Non-computational approaches
– Tagging
– Looking for common words or characters
Language Detection
• LanguageIdentifier lang =
new LanguageIdentifier(new LanguageProfile(
FileUtils.readFileToString(new
File(filename))));
• System.out.println(lang.getLanguage());
• Uses Ngram analysis included with Tika
– Originating from Nutch
– Can be improved
Running Tika in GUI form
• tika --gui
<html xmlns:html=“…”>
<body>
…
</body>
</html>
Integrating Tika into your App
•
•
•
•
Maven
Ant
tika- tikatikaEclipse
It’s just a set of jars server app bundle
– tika-core
– tika-parsers
– tika-app
– tika-bundle
– tika-server
tika-parsers
tika-core
Some really great stuff in 1.0
• Super improved OSGi support
NICK ALREADY
TALKED ABOUT
THIS!!! Thunder stolen
– New tika-bundle module
• Improved RTF parsing support, OO support,
and parsing of Outlook email attachments
• Language Detection for Belarusian,
Catalan, Esperanto, Galician, Lithuanian
Romanian, Slovak,Slovenian, and Ukrainian
• Improved PDF parsing (extract annotation)
Things to watch out for
• Deprecated APIs->gone
–Recompile code
• No more JDK 1.4 version
of Tika
–Upgrade
Improvements to Tika
• Adding more parsers for content types
• Improve the JAX-RS server support
• Expanding ability to handle random
access file parsing
– Scientific data file formats, some work on this
– Leverage improvements in file representation
TIKA-701, TIKA-654, TIKA-645, TIKA-153
• Geospatial parsing support through GDAL
• Improving language and charset detection
Part 2
Science Data
Systems at NASA
Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
NASA Ground Data Systems
Credit: D. Woollard
Context
• NASA develops science data processing systems
for multiple earth science missions
• These systems convert the instrument telemetry
delivered to earth from space into useful data for
scientific research
• Typical characteristics
– Remote sensing instruments that orbit the Earth multiple
times daily
– Data are acquired constantly
– Complex algorithms convert instrument measurements to
geophysical quantities
The Square Kilometer Array
• 1 sq. km of
antennas
• Never-before
seen
resolution
looking into
the sky
• 700 TB
– Per second!
NASA DESDynI Mission
•
•
•
•
16 TB/day
Geographically distributed
10s of 1000s of jobs per day
Tier 1 Earth Science Decadal Mission
Some Considerations
• Scale
–
–
–
–
Data throughput rates
# of data types
# of metadata types
# of users to send the data to
• Federation
– Must leave the data where it is
– Socio/Economic/Political
• Heterogeneity
– Technology, data formats, skills!
Apache OODT
• We’ve got some components to deal with
these issues
How are we building these
systems now?
-Allow for
push/pull of data
over arbitrary
protocols
- Ingestion builds
std catalog and
archive
-Deliver product
metadata to
search, portal or
GIS
-Plug in arbitrary
met extractors
How are we building these
systems now?
-Separation of
file management
from workflow
management
-Allow for
heterogeneous
computing
resources
-Easily integrate
PGEs
-Leverages same
ingestion crawler
What does this have to do with
Tika?
MIME
identification:
TIKA!
Metadata
Ext: TIKA!
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
What does this have to do with
Tika?
Metadata
Ext: TIKA!
MIME
identification:
TIKA!
MIME
identification:
TIKA!
Science Data File Formats
• Hierarchical Data Format (HDF)
–
–
–
–
http://www.hdfgroup.org
Versions 4 and 5
Lots of NASA data is in 4, newer NASA data in 5
Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
Science Data File Formats
• network Common Data Form (netCDF)
–
–
–
–
www.unidata.ucar.edu/software/netcdf/
Versions 3 and 4
Heavily used in DOE, NOAA, etc.
Encapsulates
• Observation (Scalars, Vectors, Matrices, NxMxZ…)
• Metadata (Summary info, date/time ranges, spatial
ranges)
– Custom readers/writers/APIs in many languages
• C/C++, Python, Java
– Not Hierarchical representation: all flat
So how does it work?
• Ingestion
– Science data files, ancillary information from
other missions, etc., arrive in NetCDF or HDF
format
– Need to extract their met, catalog and archive
them, etc.
• Can now use Tika to do this! TIKA-399 and TIKA400 added this capability
• Processing
– Processors (PGEs) generate NetCDF and
HDF, must extract met, catalog and archive
Tool support
• Entire stacks of tools written around
these formats
– OPeNDAP, LAS, readers, writers, custom
NASA mission toolkits
– OGC
• WMS, WCS, etc.
– Unique, one of a kind software build around
these data file formats
• Apache can contribute strongly in this
area!
Besides processing science
files
• …Tika also helps with
• MIME identification
– Useful in remote file acquisition
– Useful in classification (catalog/archive) of
existing content
– Useful in crawling see my Nutch talk last year
http://s.apache.org/UvU
• Language identification
– Can be useful when data is coming from
around the world, but need to quickly identify
whether or not we can process it
Big Goal
• More closely link OODT and Tika
– Add new parser to Tika
– Easily get OODT met extractor based on it
• Contribute back some features still baking
in OODT
– Configuration aspects of parsing
– File types and extensions for science data files
• Spatial
– Some work done in my CS572 class on spatial
parser for Tika – would be great to integrate
with Tika, OODT, SIS, and Solr
NASA Geo Challenges
• Sometimes the data isn’t annotated with lat and lon
– How to discover this?
• Even when the data
is annotated with
spatial information,
computation of e.g.,
bounding box around
the poles is difficult
• Efficiency and speed are difficult since data is at
scale
Acknowledgements
• Some Tika material inspired by Jukka
Zitting’s talks
– http://www.slideshare.net/jukka/text-andmetadata-extraction-with-apache-tika
– http://www.slideshare.net/jukka/text-andmetadata-extraction-with-apache-tika4427630
• NASA Jet Propulsion Laboratory
– OODT Team
Book
• Jukka and I have finished
the first definitive guide
on Apache Tika
• Official release date: 11/17
• Early Access available
through MEAP
program
• http://manning.com/mattmann/
Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– [email protected]
– [email protected]
– @chrismattmann on Twitter