Transcript No Slide Title
MURI — Info. Management Group
Group Co-Leaders:
Jiawei Han (UIUC) Chris Clifton (Purdue) Hillol Kargupta (UMBC)
Core Contributors:
Latifur Khan (UT Dallas) Chengxiang Zhai (UIUC)
Collaborators:
Murat Kantarcioglu (UT-Dallas) Shouhuai Xu (UT- San Antonio) Ninghui Li (Purdue)
Liasons:
Ravi Sandhu (UT- San Antonio) Anupam Joshi (UMBC) April 26, 2020 1
Core Contributors & Current Ph.D. Students
Jiawei Han (UIUC)
Lu An Tang Zhijun Yin
Chengxiang Zhai (UIUC)
Yuanhua Lv Hyun Duk Kim
Hillol Kargupta (UMBC)
Kamalika Das
Latifur Khan (UTD)
Mehedy Masud
Chris Clifton (Purdue)
Mummoorthy Murugesan April 26, 2020 2
General Project Goals
Provide information management and analysis support for the project
Major research themes
Knowledge Discovery Data integration and fusion Measuring and maintaining information quality Provenance tracking Confidentiality in Information Management and April 26, 2020 Analysis 3
Posters Reported in the Kick-Off Meeting
Plausibly Deniable Search
Mummoorthy Murugesan and Chris Clifton
Conforming to Truth with Multiple Conflicting Information Providers on the Web
Jiawei Han, Xiaoxin Yin, and Philip S. Yu
Privacy-preserving Data Mining within Anonymous Credential Systems
Shouhuai Xu
User-Centered Adaptive Information Retrieval
Xuehua Shen, Bin Tan, and ChengXiang Zhai
Privacy Preserving Distributed Data Mining: A Game-Theoretic Approach
Kamalika Das and Hillol Kargupta
Novel Class Detection in Concept-Drifting Data Streams in a Shared Environment.
April 26, 2020 Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham 4
On-Going Research Projects
Novel Class Detection in Concept-Drifting Data Streams in a Shared Environment
Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham (UTD/UIUC)
Confidentiality Preserving Data Cubes
Jiawei Han, Lu An Tang and Bolin Ding (UIUC)
Scalable Distributed Privacy-Preserving Local Algorithms for Large Peer-to-Peer Data Mining: A Game Theoretic Approach
Hillol Kargupta and Kamalika Das (UMBC)
Confidential peer to peer extension to personalized search
Chengxiang Zhai, Chris Clifton, and Mummoorthy Murugesan (UIUC/Purdue)
Information quality: Understanding and identifying provenance
ChengXiang Zhai and Jiawei Han (UIUC)
SPDU: A Secure Provenance Management Framework
Shouhuai Xu and Ravi Sandhu (UTSA) April 26, 2020 5
Discovery in Data Streams for Security Protection
Novel Class Detection in Concept-Drifting Data Streams in a Shared Environment Novelty/anomaly detection: A major issue in many applications, especially in a streaming environment Goal: Detect new classes in data streams Approach: Efficiently handle the novel class detection task in the presence of concept-drift and multiple classes The approach is non-parametric —not assume any underlying distributions of data Comparison with the state-of-the-art stream classification techniques prove the superiority of our approach The technique can be extended to a distributed environment with multiple sources April 26, 2020 6
Confidentiality-Preserving Data Cubes
Confidentiality-/privacy-/sensitivity-preserving data cubes Researchers have been studying confidentiality preserving database systems (for query processing) and confidentiality-preserving data mining systems We propose to investigate confidentiality-preserving data cubes for multidimensional analysis of data warehouses Goal: Work out mechanisms to ensure one can access maximal information in data cubes for information understanding but lose minimal privacy information, even with different combinations of OLAP queries Extensions: How knowledge discovery will help confidentiality preserving April 26, 2020 7
Data and Information Integration for Security Protection
Data fusion: Merge/integrate the same objects with different names or identities Data distinction: Distinguish different objects with identical names Information integration by information network analysis Veracity analysis to conform truth with conflicting information provided by multiple website or other information providers Correlation analysis to reduce redundancy and control information disclosure E.g. medical records, patients, medical treatments April 26, 2020 8
Data and Information Access and Management for Security Protection
Data separation vs. data integration and their role in sensitive information disclosure and correlation discovery Privacy-aware indexing to support fast/efficient data accessing Sensitivity-aware query processing and data publishing Any other data/information management and analysis issues needed from other groups in the project April 26, 2020 9
Scalable Distributed Local Algorithms for Peer-to-Peer Knowledge Discovery from Sensitive Data Hillol Kargupta University of Maryland, Baltimore County www.cs.umbc.edu/~hillol www.agnik.com
Acknowledgement: Chengxiang Zhai, Kamalika Das, Kanishka Bhaduri, Kun Liu
April 26, 2020 10
Scalable Privacy-Preserving Information Assurance
Challenges in Scalable Knowledge Discovery
Scaling in large asynchronous distributed environments Confidentiality/Privacy Preserving Data Analysis Heterogeneous Policies and Strategies
Applications
Distributed collaboration Distributed search and information retrieval
Motivation: Secure Multi-Party Sum Computation
z 1 =(R+v 1 ) mod N v 1 z 3 =(z 2 +v 3 ) mod N • Each party has a number • Compute the sum without divulging the numbers • Consider a sequence of secure sum operations.
v 2 z 2 =(z 1 +v 2 ) mod N R is uniformly distributed in [0, N-1] v 3
Locality Sensitive Distributed Algorithms
Global algorithms: Communicate with the entire network Every node needs to maintain information about the entire network Maintaining this information is resource intensive for large networks Local algorithms: Communicate only with the local neighborhood.
Bounded communication local algorithms
Distributed Sum Computation: A Local Approach
Each node has a number
x i
[ 0 ] Compute the sum
x i
[
t x i
[
t
]
x i
[
t
1 ]
j
i
(
x j
[
t
1 ]
x i
[
t
1 ]) Asymptotically converges to the global sum
Optimization, Games, and Privacy Preserving Knowledge Discovery
Multi-Party Privacy Preservation as an optimization problem Multi-party, multi-objective optimization Blending game theory and mechanism design Asynchronous algorithms for achieving equilibrium states
Privacy/Confidentiality Preservation: An Optimization Perspective
Multi-objective Optimization Perspective Policies Strategies Performance Distributed games for optimizing utility functions
Summary of the Approach
Local Asynchronous Distributed Knowledge Discovery Algorithms that preserve Privacy/Confidentiality
Distributed Search and Information Retrieval Algorithms
Multi-party Optimization Perspective of Privacy/Confidentiality Preservation and Design of Distributed Game Theoretic Mechanisms
April 26, 2020 17
Example: Cross-Domain Network Threat Detection
Correlating threats from different network domains
Copyright, Agnik
Motivation : P2P Search Engine
What is the most visited news-page in network today?
Has anybody found a cheap store to buy a digital camera?
What is the best search-key to search for “Child Care”?
Useful Browser Data
Web-browser history Browser cache Click-stream data stored at browser (browsing pattern) Search queries typed in the search engine User profile Bookmarks Challenges Indexing, clustering, data analysis in a decentralized asynchronous manner Scalability Privacy
User-Centered Adaptive Information Retrieval
WEB
Search Engine Search Engine Viewed Web pages
...
Query History Personalized search agent “java”
Desktop Files Email
Search Engine Personalized search agent “java”
User-Centered Adaptive IR
•
A novel retrieval strategy emphasizing
– – –
user modeling (“user-centered”) search context modeling (“adaptive”) interactive retrieval
•
Implemented as a personalized search agent that
– –
sits on the client-side (owned by the user) integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users)
– –
collaborates with each other goes beyond search toward task support
Reranking of Search Results with UCAIR Toolbar
April 26, 2020 23
Research Agenda
Develop a scalable methodology for Knowledge Discovery from Multi-Party Data
Design local asynchronous algorithms with bounded communication
Multi-objective Distributed Optimization, Mechanism Design, and Local Algorithms
Designing the Next Generation of Privacy Preserving Distributed Knowledge Discovery Algorithms
Research Agenda
Privacy-preserving user modeling:
How can we model a user’s information need yet preserving privacy?
How can we aggregate user models and information needs to control privacy?
P2P information recommendation
P2P architecture: flexible information sharing What’s the right protocol for information recommendation? How to extend collaborative filtering algorithms to protect user privacy?
Collaborative Search
How can we match information needs with information content at different levels of representation?
From Collaborative Query/Filtering to Information Push
Chengxiang Zhai and Chris Clifton (UIUC/Purdue) Personalized search
profile of information needs
Profile based on prior search, without requiring explicit definition of profile Assist information sources in identifying need to share
Challenge: profile / search may be sensitive
May not be able to reveal to information source (unless they have needed information?)
Research thrusts:
Turning personalized search into profiles Matching information to profiles without disclosing either April 26, 2020 27
SPDU: A Secure Provenance Management Framework
Shouhuai Xu and Ravi Sandhu (UTSA) Security of provenance management is critical to many applications including assured information sharing The state-of-the-art is that we know little about the security aspect of provenance management. We propose investigating a comprehensive framework for secure provenance management as well as supporting architectures and mechanisms for realizing the framework
April 26, 2020 28
SPDU Shouhuai Xu and Ravi Sandhu • A comprehensive framework for securing provenance and the corresponding information – We cannot talk about provenance without touching what the provenance is for (i.e., both data and their provenance are the goals for protection) • Supporting architectures and mechanisms for realizing the framework
SPDU framework
• The above challenges call for a novel framework for secure provenance management. • We propose a SPDU framework for this purpose.
– S stands for Source trustworthiness management – P stands for Processing trustworthiness management – D stands for Dissemination management Information trustworthiness management – U stands for Usage management • SPDU is application-neutral: allowing plug-and-play application specific modules (e.g., semantic similarity between two documents) • SPDU covers the whole lifecycle of information sharing Processing (recursive) Source Dissemination Usage
Eight facets of SPDU
Usage accountability Dissemination accountability Source privacy Processing accountability Secure provenance management Processing privacy Source accountability Usage privacy Dissemination privacy
Information Quality: Understanding and Identifying Provenance
ChengXiang Zhai and Jiawei Han (UIUC) Credibility of information, particularly information presumed to be from multiple sources, is a challenging issue Are multiple reports independent confirmation of the same event? Based on a common report? Reports of different events? Propose to use data mining techniques to identify similarities/differences in information that is apparently from different sources to estimate the likelihood that data is from a single or independent sources, and about the same or multiple events Propose to develop novel text mining algorithms to analyze "information genealogy" in large amounts of text data from multiple sources and summarize contradictory opinions on a topic
33 April 26, 2020
Summarizing Contradictory Information
Given a set of text articles from different sources with contradictory information, how can we help analysts to digest the information? Problem 1: Semantic integration of information from multiple sources Problem 2: Detection of contradictory information Problem 3: Summarization of contradictory information Techniques to explore:
text mining with probabilistic models information extraction (e.g., entity/relation extraction)
Questions for YOU!
Other data analysis / global statistical model needs?
Data quality? Lifecycle?
What sort of global statistical models would be of interest to Intelligence Analysts?
Models that transcend data silos
Scenarios for testing
Sample/surrogate data to support scenarios April 26, 2020 35
April 26, 2020
Thanks and Questions
36