Session 3 the New Revolution
Download
Report
Transcript Session 3 the New Revolution
+
The New Frontier: Data
Analytics and Information
Governance
+
The Convergence of IR and IG
Records
have a utility
Statutory
compliance
Electronic
discovery
Operational
business value to the organization
+
Operational Business Value
Classification
Historically obsolete?
Questionable ROI
+
Classification Issues
Manual
classification is not feasible
Training
Compliance
Taxonomy
Unsupervised
classification
Requires refinement and monitoring
Classification specifications
+
An Information Governance Model
+
Information Governance
Security
Privacy
Investigation
Retrieval/Remediation
+
Retrieval Tools
Goodbye
to binary B00lean Search
+
Boolean Search Issues
Pros:
Simple to understand
Indexing is relatively simple
Query processing is very quick
+
Boolean Search negatives
Problems
with precision
Bank of a river
Bank of money
Bank of an airplane
+
Boolean Search Negatives
Recall: missing
lots of things
The same things may be denoted by different
words: lawyer, attorney, barrister
Misspellings: AMOSS misspelled as AMOS,
Hamiton as Hamiltin
Code words: “The wheels are in!”
+
Boolean Solutions
Individual
Prefix
wildcards
and suffix wildcard
Stemming
Near
ranges
Faceted
Fuzzy
searching: brown, dress, shoes, laces
logic: Boat, Goat, Boot
Dictionaries
+
Enter the World of Mathematics
Really, computers
are nothing but, well,
computers to begin with
+
A Document is not a bag of letter
or words!
+
Documents have a mathematical
structure!
+
A document collection is a whole
bunch of points
+
Then we can draw a line to each
point and we have a vector
+
The closer the Vectors, the more
similar the documents!
Term
Frequency
The greater the term
frequency the more
important the document
Document
Frequency
The greater the
document frequency the
less important
+
“Term Frequency” x “inverse
document frequency” = “tf-idf”
Suppose
the word “werewolf” appears in a
document 100 times. [We used exponents to
“mute” the impact 102 = 2]
Suppose
the word “werewolf” appears in 1000
documents in a 1,000,000 document collection
Then
The
the tf-idf document score is 2 x (6/3) = 4
greater the tf-idf score, the more likely the
document is important
+
Documents resemble each other
the smaller the angle.
DOCUMENT RESEMBLANCE
25
20
20 Degrees
15
a
b
60 Degrees
10
5
0
1
2
3
4
5
6
7
8
9
10
11
12
c
+
Goodbye Boolean-Hello Ranking
Information
Retrieval makes
classification possible
Information
Retrieval makes
investigations possible
Information
Retrieval finds
the critical documents
Information
Retrieval is the
heart of records
management