Presentation title - Budapest University of Technology and

Download Report

Transcript Presentation title - Budapest University of Technology and

Text mining

Gergely Kótyuk

Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu

Introduction

 Generic model – Document preprocessing – Text mining methods Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

2

Text Mining Tasks

   Classification (supervised learning) – Binary classification – Single label (multi-class) classification – Multi-label classification – Multi-level (hierarchical) classification Clustering (unsupervised learning) Summarization – Extraction: only parts of the original text – Abstraction: introduces text that is not included in the original text Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

3

Solutions

  Classification – Decision tree – Neural network – Bayes network Clustering – k-means Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

4

Document preprocessing

  Goal: represent any text briefly, in a fixed number of parameters Representation: vector space model Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

5

Vector space model

      The text is tokenized to words The words are canonized to base words we refer to base words as terms A dictionary is built, that is the set of the terms in the document The document is represented as a vector: the i th element of the vector is the number the i th term of the dictionary occurs in the document The collection of documents is represened in the term document matrix Problem: the number of dimensions is too large Solution: feature selection Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

6

Dimension Reduction

  Feature Selection: find a subset of original variables – Document Frequency Thresholding • Omit the words with occurences greater than a threshold value, because these words are not discriminative • Omit the words with occurences less then a threshold value, because these words do not carry much information – Information gain based feature selection (information theory) – Chi-square based feature selection (statistics) Feature Extraction: transform the data to fewer dims – Latent Semantic Indexing (LSI) – Principal Component Analysis (PCA) – Nonlinear methods Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

7

Latent Semantic Indexing (LSI)

   SVD is applied to the term-document matrix The features belonging to the

k

largest eigenvalues represent the term-document matrix well, these features are used LSI regards documents with many common words as being semantically near Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

8

Principal Component Analysis (PCA)

    Also called Karhunen Loève transform (KLT) A linear technique Maps the data to a lower dimensional space in a way that the variance in the low-dimensional representation is maximized The algorithm – The correlation matrix of the data is constructed – The eigenvectors and eigenvalues of the correlation matrix are calculated – The original space is reduced to the space spanned by the eigenvectors that belong to the largest eigenvalues Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

9

Kernel PCA

   A nonlinear method PCA + kernel trick Kernel trick (generally) – we map observations from a general set S into a higher dimensional space V – we hope that the general classification in S reduces to the linear classification in V – the trick lets us avoid the calculation of mapping the observations from S to V • We use a learning algorithm that needs only the dot product operation in V • We use a mapping that allows to calculate the dot product within V by a kernel function K within S (the original space) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

10

Manifold learning techniques

  they minimize a cost function that retains local properties of the data methods – Locally Linear Embedding (LLE) – Hessian LLE – Laplacian Eigenmaps – Local tangent space alignment (LTSA) – Maximum Variance Unfolding (MVU) Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

11

Locally Linear Embedding (LLE)

Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

12

Locally Linear Embedding (LLE)

Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

13

Maximum Variance Unfolding (MVU)

   instead of defining a fixed kernel, it tries to learn the kernel using semidefinite programming exactly preserves all pairwise distances between nearest neighbors maximizes the distances between points that are not nearest neighbors Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium www.crysys.hu

14