Znalosti 2007 - KP-Lab - Technical University of Košice

Download Report

Transcript Znalosti 2007 - KP-Lab - Technical University of Košice

Text Mining Services for
Trialogical Learning
Pavel Smrž1, Ján Paralič2, Peter Smatana2, Karol Furdík2
1: Brno University of Technology, FIT, Božetěchova 2, 612 66 Brno,
University of Economics, Prague, W.Churchill Sq.4, 130 67 Praha, Czech Republic,
[email protected]
2: Technical University of Košice, Centre for Information Technologies,
Letná 9, 040 01 Košice, Slovakia
{Jan.Paralic, Peter.Smatana, Karol.Furdik}@tuke.sk
21. - 23. 2. 2007
VŠB - Technická univerzita Ostrava
Contents
KP-Lab project
Trialogical Learning and Activity Theory
Semantic Web Knowledge Middleware
Text Mining Services
• Pre-processing
• Learning Ontologies
• Classification
Future work
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#2
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
KP-Lab Project
Full title: Knowledge Practices Laboratory
www.kp-lab.org
•
•
•
•
Integrated EU funded FP6 IST project No. 27490
Starting date: February 1st, 2006
Duration: 5 years
22 partners from 14 countries
Main goal: creating a learning system aimed at
facilitating innovative practices of sharing, creating and
working with knowledge in education and workplaces.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#3
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Trialogical Learning
Challenge - to capture innovative practices of both learning and working
with knowledge, so-called knowledge practices.
Trialogical Learning focuses on the social processes by which learners
collectively enrich/transform their individual and shared cognition.
Activity theory:
• the object-orientedness of human
activity,
• mediation through culturalhistorically developed tools of
intelligent activity,
• contradictions emerging between
the elements of activity systems.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#4
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Knowledge Artefacts
KA - a central notion of Trialogical Learning
• Mediators of all activities and tasks among learners;
• Capture and preserve the shared knowledge within a community.
Forms:
• Physical resources / tools (documents, SW code, ...);
• Concept maps, taxonomies, ontologies, domain models;
• Plans, scientific theories, languages.
Goal of KP-Lab project: to provide a platform (tools &
methodology) for creation and transformation of
KA‘s in the trialogical manner.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#5
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Scientific Challenges
1. Facilitating knowledge-creating learning beyond knowledge
acquisition and social participation
2. Expanding and elaborating the "trialogical" object of educational
activity
3. Eliciting the development of trialogical agencies
4. Facilitating horizontal and vertical boundary crossing
5. Developing tools for deliberate transformation of knowledge practices
6. Specifying design-principles of trialogical technologies
7. Developing methods regarding research on longitudinal transformation
of knowledge practices
8. Creating an open, developing community of trialogical technologies
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#6
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Semantic Web Knowledge Middleware
SWKM goal - to facilitate knowledge creation processes by
supporting advanced interactions of collaborating
learners with knowledge artefacts, i.e. discovery, access,
evolution, recommendation, and mining.
Generic modules:
• Knowledge Repository - scalable persistent services for large
volumes of knowledge artefacts' descriptions and ontologies;
• Knowledge Mediator - services for handling the main registry,
discovery, and evolution for KP-Lab knowledge artefacts;
• Knowledge Matchmaker - services supporting interactions of KPLab users with knowledge artefacts employing their semantic
descriptions.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#7
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
SWKM Architecture
Features:
• adopts SOA
principles;
• built upon the
RDFSuite OS
platform;
• data: RDF,
accessed by
RQL / RUL.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#8
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining in the KP-Lab
Text mining services - intelligent access and manipulation
with the knowledge artefacts; to assist users in creating or
updating the semantic descriptions of KP-Lab knowledge
artefacts.
TMS fundamental tasks:
• Ontology learning - extraction of conceptual maps (clustering), i.e. an
automatic extraction of significant terms from KA's textual
descriptions and converting them to a structure of concepts and their
relationships.
• Classification of knowledge artefacts - grouping a given set of
artefacts into predefined or ad hoc categories.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
#9
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Schema of Text Mining Services
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 10
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Pre-processing
Preprocessing phase - transforming data into the appropriate form. It
consists of several language-dependent NLP steps that provide
annotations of the plain-text resources.
Unified modules:
• tokenization, stemming (or lemmatization, e.g. in CZ/SK), elimination of
stop words, POS (part-of-speech) tagging.
Individual modules: (crucial for some methods of ontology learning)
• chunking, WSD (word-sense disambiguation), full syntactic analysis.
GATE (http://www.gate.ac.uk/) - a platform for NLP, provides:
• an architecture, or organisational structure, for NLP software;
• a framework, or class library, which implements the architecture;
• a development environment built on top of the framework.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 11
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Ontology Learning (1)
1. Conversion to a plain text format
• Structural info in source file is used as metainformation in next steps.
2. Processing by GATE
• Tokenization, sentence boundaries, POS tagging (Brill‘s tagger), named
entity recognition, Charniak's syntactic analyser.
3. Significant terms (concepts) identification
• A background domain model, created from additional textual resources.
4. Semantic relations identification
• A set of pre-defined (or automatically identified) patterns and cooccurrence statistics are used
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 12
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Ontology Learning (2)
5. Ontology merging
• The extracted structure is combined with the global domain ontology
(stored in KP-Lab knowledge repository). The mechanism of the explicit
uncertain knowledge representation is used in this step.
6. Visualisation
• Combination of the gained
qualitative data and the relevance
weights.
• The selection of the most suitable
visualisation form depends on the
needs of KP-Lab users; the
simple view in a graphical form is
the proposal.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 13
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Ontology Learning (3)
7. Export to other formats
• Standard OWL export routines are supported currently. The emerging
BayesOWL and FuzzyOWL formats are under development.
Creation of the training set - background model:
• 2-billion-word GigaCorpus for English;
• 600-million-word corpus for Czech;
• additional relevant documents provided by users.
Data simulation - using Wikiversity & Wikipedia texts.
Scenarios:
1. Collaborative acquiring of knowledge in a company
2. Description of a field of interest. Creation of an essay for a given
topic(s) in an academic environment.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 14
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Classification
Task is to automatically organize a set of knowledge artefacts into
predefined or ad hoc categories - existing or new concepts of an
ontology.
Classification is supervised by a model, created from a training set of
semantically annotated artefacts. The model contains a set of
parameters (weights, rules, etc.) created in the process of training and
used in the classification of unknown examples.
Algorithms to be used:
• simple term matching, kNN, SVM, Winnow, Perceptron, Naive Bayes
(multinomial and binomial), boosting, decision rules, and decision
trees (various combinations of growing and pruning methods).
Implementation platform: JBowl library
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 15
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
JBowl Library
JBowl - Open Source library in Java, provides
support for:
• intelligent information retrieval, summarization, and information
extraction from textual documents;
• text mining, clustering, categorization, classification tasks.
Main characteristics:
• extendable modular architecture;
• platform for pre-processing (incl. NLP methods) and indexing of
large textual collections;
• functions for creation and evaluation of text mining models (for
both supervised or non-supervised algorithms).
Web: http://sourceforge.net/projects/jbowl/
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 16
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
JBowl Library - Architecture
documents
XML
Lucene index
Thesaurus
analysis
Tokenization
Sentence chunking
POS tagging
NP chunking
data
Statistics
TF IDF
Term selection
models
categorization
clustering
keyword extraction/
summarization
information
extraction
utils
Collections
Matrixes
BLAS
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 17
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
JBowl Library - Usage
JBowl provides:
• Text categorization method for the active learning, allowing to
reduce the number of training examples.
• Heuristics that selects examples according to the confidence of the
classifier prediction for the given example. This heuristic does not
require a validation set and can be used effectively to select a
small set of labeled examples.
• Integration of several classification methods, evaluation.
• Tools for NLP (incl. Slovak linguistic resources and tools).
Scenario for use of classification service:
• Annotation of new or updated artefacts - system can suggest
suitable concepts from one or more ontologies to be assigned as
metadata or conceptual description to the artefact.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 18
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Future Work
Solving multilinguality - find a minimal set of NLP resources that are
satisfactory for the (basic) functionality of the text-mining services.
Increasing efficiency: requirement of synchronous SOA system - e.g. by
the use of the Extensible Messaging and Presence Protocol (XMPP)
Classification: Selection of most appropriate algorithms in the context of
the automatic annotation of the artefacts according to the semantics
codified in several ontologies. (with limited availability of training data)
Ontology learning: to concentrate on the better ways of ontology
merging (incl. the need to combine extracted relations with the ones
from existing domain ontologies).
Implementation of the first prototype of the SWKM (M24), testing and
evaluation.
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 19
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Further information:
http://www.kp-lab.org
Thank you !
Questions?
Text Mining Services for Trialogical Learning
Pavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 20
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava