Session 3 the New Revolution

Download Report

Transcript Session 3 the New Revolution

+
The New Frontier: Data
Analytics and Information
Governance
+
The Convergence of IR and IG
 Records
have a utility
 Statutory
compliance
 Electronic
discovery
 Operational
business value to the organization
+
Operational Business Value
 Classification


Historically obsolete?
Questionable ROI
+
Classification Issues
 Manual



classification is not feasible
Training
Compliance
Taxonomy
 Unsupervised


classification
Requires refinement and monitoring
Classification specifications
+
An Information Governance Model
+
Information Governance
 Security
 Privacy
 Investigation
 Retrieval/Remediation
+
Retrieval Tools
 Goodbye
to binary B00lean Search
+
Boolean Search Issues
 Pros:



Simple to understand
Indexing is relatively simple
Query processing is very quick
+
Boolean Search negatives
 Problems



with precision
Bank of a river
Bank of money
Bank of an airplane
+
Boolean Search Negatives
 Recall: missing



lots of things
The same things may be denoted by different
words: lawyer, attorney, barrister
Misspellings: AMOSS misspelled as AMOS,
Hamiton as Hamiltin
Code words: “The wheels are in!”
+
Boolean Solutions
 Individual
 Prefix
wildcards
and suffix wildcard
 Stemming
 Near
ranges
 Faceted
 Fuzzy
searching: brown, dress, shoes, laces
logic: Boat, Goat, Boot
 Dictionaries
+
Enter the World of Mathematics
 Really, computers
are nothing but, well,
computers to begin with
+
A Document is not a bag of letter
or words!
+
Documents have a mathematical
structure!
+
A document collection is a whole
bunch of points
+
Then we can draw a line to each
point and we have a vector
+
The closer the Vectors, the more
similar the documents!
 Term

Frequency
The greater the term
frequency the more
important the document
 Document

Frequency
The greater the
document frequency the
less important
+
“Term Frequency” x “inverse
document frequency” = “tf-idf”
 Suppose
the word “werewolf” appears in a
document 100 times. [We used exponents to
“mute” the impact 102 = 2]
 Suppose
the word “werewolf” appears in 1000
documents in a 1,000,000 document collection
 Then
 The
the tf-idf document score is 2 x (6/3) = 4
greater the tf-idf score, the more likely the
document is important
+
Documents resemble each other
the smaller the angle.
DOCUMENT RESEMBLANCE
25
20
20 Degrees
15
a
b
60 Degrees
10
5
0
1
2
3
4
5
6
7
8
9
10
11
12
c
+
Goodbye Boolean-Hello Ranking
 Information
Retrieval makes
classification possible
 Information
Retrieval makes
investigations possible
 Information
Retrieval finds
the critical documents
 Information
Retrieval is the
heart of records
management