Unstructured Information, Information Audit / Workflow and Discovery Peter Fox Xinformatics 4400/6400 Week 11, April 15, 2014

Download Report

Transcript Unstructured Information, Information Audit / Workflow and Discovery Peter Fox Xinformatics 4400/6400 Week 11, April 15, 2014

Unstructured Information,
Information Audit / Workflow
and Discovery
Peter Fox
Xinformatics 4400/6400
Week 11, April 15, 2014
1
Contents
• Information Audit
• Unstructured Information
2
Businessdictionary.com
• Analysis and evaluation of a
firm's information system
(whether manual or
computerized) to detect and
rectify blockages,
duplication, and leakage of
information.
3
Objective?
• The objectives of this audit
are to improve accuracy,
relevance, security, and
timeliness of the recorded
information.
4
What is an information audit?
• An information audit is a process that
effectively determines the current
information environment within an
organization by identifying and mapping:
– What information is currently available?
– Where the information lives?
5
Results/ format (e.g.)
• The results of an information audit are
twofold: there is a detailed report which
includes:
– What information do staff acquire? Where
from? At what cost? How is it used?
– What information do staff create? What
happens to it? Where does it go?
6
Results/ format (e.g.)
– What information is stored and why? What
purpose will it serve?
– What information is passed on or
delivered? To whom? For what purpose? In
what form?
7
Results/ format (e.g.)
– Is there a gap, or a match,
between that which is available
and that which is needed?
– What are the skills and
responsibilities of the people
who carry out these tasks?
– What equipment and tools do
they have available (hardware,
software, filing cabinets, web
sites, etc)?
8
Results/ format (e.g.)
– Are there any control documents, such as policy
statements, guidelines, service level agreements,
procedures, manuals?
– Is any of the information (produced, acquired, processed,
re-delivered, or stored) superfluous to needs?
– Are any of the information-handling activities nonproductive?
9
Results/ format (e.g.)
• There is also a detailed flow chart:
– A visual map that show the areas, processes,
functions and activities through which information
passes, clarifying gaps or fault-lines that need to
be plugged or bottlenecks and overflows that
need to be unblocked
• Sound familiar?
10
How to use?
• An information audit can be used as a
baseline for making major improvements to
the business process of an organization.
• It is extremely helpful in the identifying,
buying, and implementation of enterprise
systems
– finance systems, portfolio management systems,
document management systems, learning and
knowledge management systems, etc.
11
Remember the use case doc?
Data
Type
Resource
(dataset
name)
Characteristics Description
Remote, e.g. – no cloud Short description of the
cover
dataset, possibly including
In situ,
rationale of the usage
Etc.
characteristics
Model
Owner
Description
Consumes
(model
name)
Organization Short
List of data consumed
that offers
description of
the model
the model
Developed for NASA TIWG
Owner
Source System
USGS, ESA,
etc.
Name of the
participating
system which
supports
discovery and
access
Frequency
Source System
How often the
model runs
Name of the
participating
system which
offers access to
the model
Event/application
Event
Owner
(Event
name)
Organization Short description of the event
that offers
the event
Application/ Owner
Description
Description
DSS
(Application Organization Short description of the application
name)
that offers
the
Application
Developed for NASA TIWG
Relevant
subscription
Source System
List of
subscriptions
(and owners)
Name of the
participating
system which
offers this
event
Source
System
Name of the
participating
system
which offers
this event
Remember
• It never hurts to know what you have
• Build it into the routine and do not leave it as
an after-thought (yep, just like documenting
your code!)
14
15
Sources and uses of
unstructured information
- audio, video,
graphics, social media
messages, etc. – that
which fall outside the
purview of traditional
databases
16
Data<->Information<->Knowledge
• Where is the structure?
Experience
Data
Creation
Gathering
Information
Presentation
Organization
Knowledge
Integration
Conversation
Context
17
Informatics
• Oh, wait – people structure information!
• Cognitive processes
– Semiotics
– Mental representation
– Intuition
– Expertise
• But not in the same way computers can!
18
19
So what happens?
• If a structured representation of
fundamentally unstructured information is
useless?
– Why would it be?
• What role does visual representation play in
structuring information? Hint:
20
More than 10 years ago…
• Unstructured Information Management Architecture
(UIMA) from IBM
– “Unstructured information management (UIM) applications are software
systems that analyze unstructured information (text, audio, video,
images, and so on) to discover, organize, and deliver relevant
knowledge to the user. In analyzing unstructured information, UIM
applications make use of a variety of analysis technologies, including
statistical and rule-based Natural Language Processing (NLP),
Information Retrieval (IR), machine learning, and ontologies.
– IBM's Unstructured Information Management Architecture (UIMA) is an
architectural and software framework that supports creation, discovery,
composition, and deployment of a broad range of analysis capabilities
and the linking of them to structured information services, such as
databases or search engines.
– The UIMA framework provides a run-time environment in which
developers can plug in and run their UIMA component implementations,
along with other independently-developed components, and with which21
they can build and deploy UIM applications.”
From way back…
22
23
Data<->Information<->Knowledge
• Future?
Experience
Data
Creation
Gathering
Information
Presentation
Organization
Knowledge
Integration
Conversation
Context
24
Reading for this week
• http://en.wikipedia.org/wiki/Information_audit
• http://www.librijournal.org/pdf/2003-1pp2338.pdf
• UIMA http://www.ibm.com/developerworks/data/do
wnloads/uima/
• SPAR http://tw.rpi.edu/web/inside/ideas/SPAREvalu
ation
25
Logical Collections
• The primary goal of a Management system is to
abstract the physical collection into logical
collections. The resulting view is a uniform
homogeneous collection.
• Note the analogy with logical models and
information integration: so EARLY ON
– Identifying naming conventions and organization
– Aligning cataloguing and naming to facilitate
search, access, use (who uses?)
– Provision of **contextual** information
26
Physical Handling
•
•
–
–
–
–
Map between physical and logical.
Where and who does it come from?
Is there a transfer into a physical form?
Is it backed-up, archived, cached? …
What formats?
Naming conventions – do they change?
• Note analogy to physical models
27
Interoperability Support
28
Security
• Access authorization and change verification. This
is the basis of trusting your information.
29
Ownership
• Who is responsible for quality and meaning
30
Metadata
• Recall metadata are data about data.
• Metainformation?
31
Persistence
• Deployment of mechanisms to counteract
technology obsolescence.
32
Discovery
• Ability to identify useful relations and
information inside the collection
• More on this later in this class
33
Dissemination
• Mechanisms to make aware the interested parties
of changes and additions to the collections.
• Do you rely on information retrieval? The Web?
34
Summary of Information Management
•
•
•
•
•
•
Creation of logical collections
Physical handling
Interoperability support
Security support
Ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Dissemination and publication
35
Note for your project writeup!
• Information management! Cover the 9 areas.
36
Information Workflow
• What is a workflow?
• Why would you use it?
• Key considerations for
information, cf. data
• Some pointers to
workflow systems
37
What is a workflow?
• General definition: “series of tasks performed
to produce a final outcome” (taxes?)
• Information workflow – involves people but
potentially want to
– Automate jobs that a person traditionally
performed manually
– Process large volumes of information faster than
one could do by hand
• NB difference from data workflows – it
reaches out to encompass the user (e.g.
‘unrecorded actions’)
38
Background: Business Workflows
• Example: planning a trip
• Need to perform a series of tasks: book a
flight, reserve a hotel room, arrange for a
rental car, etc.
• Each task may depend on outcome of
previous task
– Days you reserve the hotel depend on days of
the flight
– If hotel has shuttle service, may not need to rent
a car
• Prior information, experience, preferences…
39
Tripit.com?
40
What about information workflows?
• Perform a set of transformations/
operations on information source(s)
• Examples
– Generating images from raw data
– Identifying areas of interest from a large
information source (e.g. word cloud)
– Classifying a set of objects
– Querying a web service for more information
on a set of objects
– Many others…
41
More on Workflows
• Can process many information types:
– Archives
– Web pages
– Streaming/ real time
– Images
– Semiotic systems
• Robust workflows depending on formal
(concept and logical) models of the flow of
information among components
• May be simple and linear or very complex
42
Challenges
• Questions:
– What are some challenges for users in
implementing workflows?
– What are some challenges to executing these
workflows?
– What are limitations of writing a program?
•
•
•
•
•
Mastering a programming language
Visualizing workflow
Sharing/exchanging workflow
Formatting issues
Locating datasets, services, or functions
43
Workflow Management Systems
44
Benefits of Workflows
• Documentation of aspects
of analysis
• Visual communication of
analytical steps
• Ease of testing/debugging
• Reproducibility
• Reuse of part or all of
workflow in a different
project
45
Additional Benefits
• Integration of and between multiple
computing environments
• ‘Automated’ access to distributed resources
via other architectural components, e.g. web
services and Grid technologies
• System functionality to assist
with information integration of
heterogeneous components and
source
46
Why not just use a script?
• Script does not specify
low-level task scheduling
and communication
• May be platformdependent
• Can’t be easily reused
• May not have sufficient
documentation to be
adapted for another
purpose
47
Why can a GUI be useful?
•
•
•
•
No need to learn a programming language
Visual representation of what workflow does
Allows you to monitor workflow execution
Enables user interaction (though not
necessarily collaboration)
• Facilitates sharing of workflows
48
Some workflow systems
•
•
•
•
•
•
•
Kepler
SCIRun
Sciflo
Triana
Taverna
Pegasus
Some commercial tools:
– Windows Workflow Foundation
– Mac OS X Automator
• http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf
• http://www.isi.edu/~gil/AAAI08TutorialSlides/
• See reading for this week
49
Discovery
• How does someone find your information?
• How would you provide discovery of
– collections
– files
– ‘bits’
• How would you find ->
50
Discovery
o Search (Federated Search)
o Helped by
o Folksonomies (user contributed)
o Intelligent Agents
o Search Engines
o Taxonomies
o Find photos of Kim
o Boy or girl?
51
Use cases
• Find a sound recording of a swallow.
• Excuse me?
52
Use cases
• Find a sound recording of an African Swallow
• Find a sound recording of a bird that sounds
like an African Swallow
• Media types – how can you discover them?
53
Use cases
• Find the movie that Jean Tripplehorn first
starred in/ that was her most successful/ was
lead actress?
• Has anyone gene sequenced a mouse?
• Find images of primary productivity in the
North Atlantic
• Discovery can often involve information
integration (or is it *almost always*?)
54
Three level ‘metadata’ solution for
DATA
Data Discovery
Data Integration
Level 1:
Level 2:
Data Registration
at the Discovery Level,
e.g. Volcano
location and activity
Data Registration
at the Inventory Level,
e.g. list of datasets,
times, products
Earth Sciences Virtual Database
Level 3:
Data Registration
at the Item Detail
Level, e.g. access to
individual quantities
Ontology based
Data Integration
Using scientific
workflows
A Data Warehouse where
Schema heterogeneity problem is
Solved; schema based integration
55
A.K.Sinha, Virginia Tech, 2006
Three level ‘metadata’ solution?
Information
Information Discovery
Integration
Level 1:
Level 2:
Level 3:
Registration
at the Discovery Level,
e.g. Find the upper
level entry point to a
source
Registration
at the Inventory Level,
e.g. list of datasets,
using the logical
organization
Registration
at the Item Detail
Level, i.e. annotation
e.g. tagging
Integration
using mapping
management
Catalog/ Index
Schema based integration
56
A.K.Sinha, Virginia Tech, 2006
Information discovery
• What makes discovery work?
– Metadata
– Logical organization
– Attention to the fact that someone would want to
discover it
– It turns out that file types are a key enabler or
inhibitor to discovery
– Result ranking using *tuned* algorithm
• What does not work?
– Result ranking algorithms that depend on
unconventional information types (icon, index,
symbol)
57
Federated search
• “is the simultaneous search of multiple online
databases or web resources and is an emerging
feature of automated, web-based library and
information retrieval systems. It is also often
referred to as a portal or a federated search
engine.” wikipedia
• Libraries have been doing this for a long time
(Z39.50, ISO23950)
• Key is consistent search metadata fields (keywords)
58
• E.g. Geospatial One Stop http://www.geodata.gov
Smart search
• Semantically aware search, e.g.
http://noesis.itsc.uah.edu ,
http://eie.cos.gmu.edu (Water -> Semantic
Search)
• Faceted search, e.g. mspace
(http://mspace.fm ), exhibit (MIT), S2S (RPI;
http://aquarius.tw.rpi.edu/s2s )
59
NOESIS
60
Faceted search
61
logd.tw.rpi.edu
Summary - discovery
• Useful to write a few discovery use cases to
drive how your design is developed
• Evolution of your role in facilitating discovery
and what/ how others implement access to
your information
62
Reading for this week
• Is retrospective
63
Check in for Project Assignment
• Analysis of existing information system
content and architecture, critique, redesign
and prototype redeployment
• Or a new use case, development, etc.
64
What is next
•Today – project group meetings/ check in
•April 22 – Information Quality, Uncertainty and
Bias
•April 29 – course summary (written part of
group project due)
•May 6 – final project presentations (BE ON
TIME, i.e. 5-10mins BEFORE 9AM)
– Be prepared to be asked (and answer) questions
65