Concepts, Semantics and Syntax in E-Discovery David Eichmann
Download
Report
Transcript Concepts, Semantics and Syntax in E-Discovery David Eichmann
Concepts, Semantics and
Syntax in E-Discovery
David Eichmann
Institute for Clinical and Translational Science
The University of Iowa
Our Approach
Analyze the human-generated metadata
available for document collections for
organizational and individual interactions
Explore the syntactic and semantic nature of
document content and the potential for
automatic generation of metadata
Explore the concept space generated by the
previous step and its correspondence to
boolean predicate specification in discovery
Our Target Corpus
The Illinois Institute of Technology Complex
Document Information Processing Test
Collection (IIT CDIP), v. 1.0
Derived from the tobacco master settlement
agreement
Comprises 6,910,192 ‘documents’
Or more properly the OCR output from those
documents
Two merged XML tag sets of metadata, with
overlapping content
<A>
<LTDLWOCR>
Metadata Entity Frequencies
Entity
Bates
Occurrences
Total
9,476,794
Distinct
8,054,075
Avg/Entity
Avg/Doc.
1.18
1.40
74 183,709.38
2.00
Category
13,594,494
Doctype
18,359,644
2,501
7,340.92
2.70
Prodbox
6,830,993
6,306
1,083.25
1.01
Metadata Entity Frequencies
Entity
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
Attendee
65,691,473
49,375
1,330.46
9.68
Brand
26,498,001
155,350
170.57
3.90
8,775,307
322,294
27.22
1.29
Copied
Metadata Entity Frequencies
Org.
Entity
Author
Mentioned
Receiving
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
8,742,976
149,641
58.43
1.29
31,406,753
883,285
35.56
4.63
8,262,496
63,625
129.86
1.22
Metadata Entity Frequencies
Person
Entity
Occurrences
Total
Distinct
Avg/Entity
Avg/Doc.
Author
11,128,029
875,292
12.71
1.64
Mentioned
34,683,289
1,938,310
17.89
5.11
Receiving
23,427,415
455,404
51.44
3.45
Database Schema
We map the XML structure to a set of
relational database tables
Non-recurring fields are collected in a table
named ‘document’
docid
title
description
OCR text
Recurring elements each get a table
docid
value
Identifying an Individual
Person
REININGHAUS, W
REININGHAUS
REININGHAUS, B
REININGHAUS, R
# of Occurrences as
Attendee
Author
Receiver Mention
189,380
23,880
32,764
16,152
7,337
200
1,974
2,837
196
2
17
144
12
How Many Reininghaus?
Reininghaus,R
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Reininghaus,W
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Co-mention Connections
Reininghaus
Person
Walk
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,871
ROEMER,E
3,716 ROEMER,E
2,883
HAUSSMANN,HJ
3,293 HAUSSMANN,HJ
2,799
TEWES,F
2,784 HACKENBERG,U
2,360
Co-mention Connections
Reininghaus
Person
Roemer
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,716
ROEMER,E
3,716 WALK,RA
2,883
HAUSSMANN,HJ
3,293 HACKENBERG,U
2,623
TEWES,F
2,784 HAUSSMAN,HJ
2,573
Co-mention Connections
Reininghaus
Person
Haussmann
Count
Person
Count
WALK,RA
3,871 REININGHAUS,W
3,293
ROEMER,E
3,716 WALK,RA
2,799
HAUSSMANN,HJ
3,293 ROEMER,E
2,573
TEWES,F
2,784 VONCKEN,P
2,323
Co-mention Affiliations
Person
Affiliation
Reininghaus, Wolf
Gen. Mgr, Contract Research,
INBIFO
Walk, Rudiger-Alexander
Dir. Human Studies, Philip Morris
Roemer, Ewald
INBIFO
Haussmann, Hans-Jurgen Assoc. Prin. Scientist, Philip Morris
Tewes, F.
Biologist, INBIFO
Hackenberg, Ulrich
INBIFO
Voncken, P.
Chemist, INBIFO
Semantics and Structure
Our analysis of content involves the following
phases:
Lexical analysis
Sentence boundary detection
Named entity recognition
Sentence parsing
Relationship extraction
The nature of the OCR data seriously impacts
each of the phases (sometimes in different
ways)
CDIP Parse Tree Complexity
Clean Text Parse Tree Complexity
Next Steps
Experiment with custom lexical analysis of the
OCR
Start with simple white space detection
Construct a lexicon and look for out-of-band
vocabulary as OCR errors candidates
Rewrite the analyzer to support OCR error correction
Sentence boundary detect and parse the full
corpus
Generate entity relationships using our
question answering framework
And Beyond That…
Return to the document images and analyze
document layout
Regenerate OCR to include token coordinates
Use our PDF structure extraction framework to
generate logical document structure
Generate a set of document models based upon
similar layout
Use the document models to map OCR text to
metadata elements
For Example
For Example