Concepts, Semantics and Syntax in E-Discovery David Eichmann

Download Report

Transcript Concepts, Semantics and Syntax in E-Discovery David Eichmann

Concepts, Semantics and
Syntax in E-Discovery
David Eichmann
Institute for Clinical and Translational Science
The University of Iowa
Our Approach
 Analyze the human-generated metadata
available for document collections for
organizational and individual interactions
 Explore the syntactic and semantic nature of
document content and the potential for
automatic generation of metadata
 Explore the concept space generated by the
previous step and its correspondence to
boolean predicate specification in discovery
Our Target Corpus
 The Illinois Institute of Technology Complex
Document Information Processing Test
Collection (IIT CDIP), v. 1.0
 Derived from the tobacco master settlement
agreement
 Comprises 6,910,192 ‘documents’
 Or more properly the OCR output from those
documents
 Two merged XML tag sets of metadata, with
overlapping content
 <A>
 <LTDLWOCR>
Metadata Entity Frequencies
Entity
Bates
Occurrences
Total
9,476,794
Distinct
8,054,075
Avg/Entity
Avg/Doc.
1.18
1.40
74 183,709.38
2.00
Category
13,594,494
Doctype
18,359,644
2,501
7,340.92
2.70
Prodbox
6,830,993
6,306
1,083.25
1.01
Metadata Entity Frequencies
Entity
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
Attendee
65,691,473
49,375
1,330.46
9.68
Brand
26,498,001
155,350
170.57
3.90
8,775,307
322,294
27.22
1.29
Copied
Metadata Entity Frequencies
Org.
Entity
Author
Mentioned
Receiving
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
8,742,976
149,641
58.43
1.29
31,406,753
883,285
35.56
4.63
8,262,496
63,625
129.86
1.22
Metadata Entity Frequencies
Person
Entity
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
Author
11,128,029
875,292
12.71
1.64
Mentioned
34,683,289
1,938,310
17.89
5.11
Receiving
23,427,415
455,404
51.44
3.45
Database Schema
 We map the XML structure to a set of
relational database tables
 Non-recurring fields are collected in a table
named ‘document’
 docid
 title
 description
 OCR text
 Recurring elements each get a table
 docid
 value
Identifying an Individual
Person
REININGHAUS, W
REININGHAUS
REININGHAUS, B
REININGHAUS, R
# of Occurrences as
Attendee
Author
Receiver Mention
189,380
23,880
32,764
16,152
7,337
200
1,974
2,837
196
2
17
144
12
How Many Reininghaus?
 Reininghaus,R
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
 Reininghaus,W
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Co-mention Connections
Reininghaus
Person
Walk
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,871
ROEMER,E
3,716 ROEMER,E
2,883
HAUSSMANN,HJ
3,293 HAUSSMANN,HJ
2,799
TEWES,F
2,784 HACKENBERG,U
2,360
Co-mention Connections
Reininghaus
Person
Roemer
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,716
ROEMER,E
3,716 WALK,RA
2,883
HAUSSMANN,HJ
3,293 HACKENBERG,U
2,623
TEWES,F
2,784 HAUSSMAN,HJ
2,573
Co-mention Connections
Reininghaus
Person
Haussmann
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,293
ROEMER,E
3,716 WALK,RA
2,799
HAUSSMANN,HJ
3,293 ROEMER,E
2,573
TEWES,F
2,784 VONCKEN,P
2,323
Co-mention Affiliations
Person
Affiliation
Reininghaus, Wolf
Gen. Mgr, Contract Research,
INBIFO
Walk, Rudiger-Alexander
Dir. Human Studies, Philip Morris
Roemer, Ewald
INBIFO
Haussmann, Hans-Jurgen Assoc. Prin. Scientist, Philip Morris
Tewes, F.
Biologist, INBIFO
Hackenberg, Ulrich
INBIFO
Voncken, P.
Chemist, INBIFO
Semantics and Structure
 Our analysis of content involves the following
phases:
 Lexical analysis
 Sentence boundary detection
 Named entity recognition
 Sentence parsing
 Relationship extraction
 The nature of the OCR data seriously impacts
each of the phases (sometimes in different
ways)
CDIP Parse Tree Complexity
Clean Text Parse Tree Complexity
Next Steps
 Experiment with custom lexical analysis of the
OCR
 Start with simple white space detection
 Construct a lexicon and look for out-of-band
vocabulary as OCR errors candidates
 Rewrite the analyzer to support OCR error correction
 Sentence boundary detect and parse the full
corpus
 Generate entity relationships using our
question answering framework
And Beyond That…
 Return to the document images and analyze
document layout
 Regenerate OCR to include token coordinates
 Use our PDF structure extraction framework to
generate logical document structure
 Generate a set of document models based upon
similar layout
 Use the document models to map OCR text to
metadata elements
For Example
For Example