Transcript Slide 1
Department of Electronic and Computer Engineering Automatic Subject Classification of Textual Documents Using Limited or No Training Data Arash Joorabchi Supervised by Dr. Abdulhussain E. Mahdi Submitted for the degree of Doctor of Philosophy 10/11/2010 Outline Introduction to ATC Motivation, Aim, and Objectives Bootstrapping ML-based ATC systems (Ch.3) Bibliography-Based ATC method (BB-ATC) (Ch.4) Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries (Ch.5) Citation Based Keyphrase Extraction (CKE) (Ch.6) Conclusion & Future Work 2 Introduction • Automatic Text Classification/Categorization (ATC) – Automatic assignment of natural language text documents to one or more predefined classes/categories according to their contents. • Applications include: – Spam filtering – Web information retrieval, e.g., filtering, focused crawling, web directories, subject browsing – Organising digital libraries • Common Methods: – Rule-based Knowledge Engineering (until late 1980s) – Machine Learning (since 1990s) 3 ML Approach to ATC • Common ML algorithms used for ATC – Naïve Bayes (based on Bayes' theorem) – k-Nearest Neighbors (k-NN) – Support Vector Machines (SVM) [Vapnik, V 1995] • • SVM is reported to yield the best prediction accuracy [Joachims, T 1998]. However, the accuracy of ML-based ATC systems depend on many parameters such as: – Quantity and quality of training documents – Document representation models, e.g., bag-of-words vs. bag-of-phrases – Term weighting mechanisms, e.g., binary vs. multinomial (burstiness phenomenon) – Feature reduction and selection methods, e.g., document frequency vs. information gain. Therefore, the choice for the best classification algorithm highly depends on the characteristics of the ATC task at hand [Hand, D. J. 2006]. 4 Motivation, Aim, and Objectives • What if there is limited/no training data?(e.g., 100 classes & 200 samples per class) • Our aim was to alleviate this problem by pursuing two lines of research: i. Investigating bootstrapping methods to automate the process of building labelled corpora for training ML-based ATC systems. ii. Investigating a new unsupervised ATC algorithm which does not require any training data. • In order to realise this aim, we have focused on utilising two sources of data whose application in ATC had not been fully explored before: a) Conventional library organisation resources such as library classification schemes, controlled vocabularies, and catalogues (OPACs). b) Linkage among documents in form of citation networks. 5 AnDevelopment Overview of Developed Syllabus Repository System of a National Syllabus Repository for Higher Education in Ireland • Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository. • Extended the ISCED scheme • 482B - Science, Mathematics and Computing/Computing/Information Systems/Databases • • Naïve Bayes Classification algorithm [Tom Mitchell 1997] A New Web-based bootstrapping method FTP server Repository Database zip packages Hot-Folder Application Open Office Pre-processing Xpdf PDFTK Thesaurus Information Extractor Program Document Segmenter Segment Headings Module Syllabus Segmenter Entity names Named Entity Extractor Post Processing Classification Scheme GATE Classifier Web Search API Meta-data generator module Web Web 6 Web-based process Web-based Bootstrapping Bootstapping process 1. A list of subject fields (leaf nodes) in the classification scheme is compiled. 2. For each subject filed in the list a web search query is created including the caption of the subject field and the keyword “syllabus” and submitted to the Yahoo search engine using Yahoo search SDK. 3. The first hundred URL’s in the returned results for each query are passed to the Gate toolkit [Cunningham et al. 2002], which downloads all corresponding files (in HTML, TXT, PDF, or MS-word formats) and extracts and tokenizes their textual contents. 4. The tokenised texts are converted to feature/word vectors are then used to train the classifier for classifying syllabus documents at the subject-field level. 5. The subject-field word vectors are also used in a bottom-up fashion to construct word vectors for the fields which belong to the higher levels of hierarchy (p.52). 7 Evaluation and Experimental Results • Test dataset contains 100 undergraduate syllabus documents and 100 postgraduate syllabus documents from 5 participating HE institutes in Ireland • The micro-average precision achieved by the classifier for undergraduate syllabi is 0.75, compared to 0.60 for postgraduate syllabi. • Mico-avg. Mico-avg. Mico-avg. Precision Recall F1 Named Entities 0.94 0.74 0.82 Topical Segments 0.84 0.72 0.77 Results published in: – The proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries, ECDL 2008; and – The Electronic Library, 27, 4 (2009). 8 Overview of Developed ATC system Bootstrapping ML-based ATC Systems Utilizing Public Library Resources • A dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum customization required. LOC OPAC Internet • Dewey Decimal Classification (DDC) scheme. • small parts of books such as back cover, and editorial reviews for training • Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003]. • A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008] Z3950 API HttpClient API Unlabeled Texts Classification Scheme Training Corpus Builder Stop Words Training Dataset Builder Classifier NB Corpus Builder TWCNB SVM Corpus Builder LIBLINEAR General Stop Words Specific Stop Words GATE 9 Bootstrapping module Data Mining Process Retrieve a list of books from LOC’s catalogue which are Classified into this category. Extract a list of ISBN’s and use them to retrieve the books descriptions from Azmazon. http:/amazon.com/gp/product /ISBN-VALUE 10 Parsed Book Description Text • Product Description – Editorial Reviews: The Deitels' groundbreaking How to Program series offers unparalleled breadth and depth of object-oriented programming concepts and intermediate-level topics for further study. The Seventh Edition has been extensively fine-tuned and is completely up-todate with Sun Microsystems, Inc.’s latest Java release — Java Standard Edition 6 (“Mustang”) and several Java Enterprise Edition 5 topics. Contains an extensive OOD/UML 2 case study on developing an automated teller machine. Takes a new tools-based approach to Web application development that uses Netbeans 5.5 and Java Studio Creator 2 to create and consume Web Services. Features new AJAXenabled, Web applications built with JavaServer Faces (JSF), Java Studio Creator 2 and the Java Blueprints AJAX Components. Includes new topics throughout, such as JDBC 4, SwingWorker for multithreaded GUIs, GroupLayout, Java Desktop Integration Components (JDIC), and much more. A valuable reference for programmers and anyone interested in learning the Java programming language. http:/amazon.com/gp/product/0132222205 11 Evaluation and Experimental Results • 20-Newsgroup-18828 dataset - a collection of 18,828 newsgroup articles, partitioned across 20 different newsgroups. • Eight classes in 20-Newsgroup were mapped to their corresponding classes in Dewey Decimal Classification scheme. (the remaining were inapplicable e.g., misellaneous.forsale) sci.space Dewey Number 520 Astronomy and allied sciences rec.sport.baseball 796.357 Baseball 997 rec.autos 796.7 Driving motor vehicles 587 rec.motorcycles 796.7 Driving motor vehicles 587 soc.religion.christian 230 Christian theology 1043 sci.electronics 537 Electricity and electronics 713 rec.sport.hockey 796.962 Ice hockey 270 sci.med 610 Medicine and health 1653 newsgroup Dewey Caption No. of training texts collected 810 12 Evaluation and Experimental Results (Cont.) Bootstrapped TWCNB Precision% Standard TWCNB Precision% sci.space 69.19 94.94 rec.sport.baseball 96.78 93.96 rec.autos 74.74 91.91 rec.motorcycles 71.02 94.97 soc.religion.christian 89.36 96.0 sci.electronics 69.92 78.17 rec.sport.hockey 75.77 98.5 sci.med 76.23 96.96 Avg. 77.87 Avg. 93.17 Newsgroup • Accuracy of Bootstrapped TWCNB is 15% Lower than standard TWCNB • The LIBLINEAR classifier with achieved average precision of 68% turned out to be considerably less accurate than TWCNB in this task. • Results published in: – The proceedings of the 19th Irish Conference on Artificial Intelligence and Cognitive Science (AICS08). 13 Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries Can we utilize the classification metadata of books referenced in a syllabus document to classify it? Tapping into: They are already classified by expert library cataloguers according to DDC and LCC classification schemes. The intellectual work that has been put into developing and maintaining library classification systems over the last century. The intellectual effort of expert cataloguers who have manually classified millions of books and other resources in libraries. 14 Bibliography-based ATC method BB-ATC is based on Automating the following processes: 1. Identifying and extracting references in a given document. 2. Searching catalogues of physical libraries for the extracted references in order to retrieve their classification metadata. 3. Allocating a class(es) to the document based on retrieved classification category(ies) of the references with the help of a weighting mechanism. Similarities to the k Nearest Neighbour K-NN algorithm 15 Bibliography-based ATC Advantages Over ML-based ATC systems: Library classification schemes are regularly updated and contain thousands of classes in every field of knowledge. No training data is needed. performance is not adversely affected by the large number of classes in DDC and LCC. New books are catalogued everyday and hence no concept drift. 16 BB-ATC Implementation DDC classification Scheme was adopted because of its worldwide usage and Hierarchical structure. Syllabi DB *.PDF *.HTM *.DOC Multi-label classification by assigning weights (0<w≤1) to candidate DDC classes and LCSHs. Pre-processing Xpdf Document’s content in plain text JZkit Java API is used to communicate with the libraries’ OPAC catalogues through Z39.50 protocol. JRegex/JAPE for extracting ISBNs/ISSNs. Open Office Information Extractor LOC Catalogue GATE Extracted reference identifiers (ISBNs/ISSNs) Catalogue-Search BL Catalogue References‘ DDC class numbers and LCSHs Classifier Weighted list of chosen DDC Class(es) & LCSHs 17 BB-ATC Evaluation Test dataset: 100 computer science related syllabus documents. Leveraging the Legacy of Conventional Libraries for Organizing Digital Libraries Arash Joorabchi, Abdulhussain E. Mahdi Department of Electronic and Computer Engineering, University of Limerick, Republic of Ireland. This document contains the full experimental results of our BB-ATC system. Full results available online at: www.csn.ul.ie/~arash/PDFs/1.pdf The proposed ATC system was used to automatically classify 100 syllabus documents which mainly belong to the filed of computer science. The validity and correctness of each assigned DDC class label is examined manually by an expert cataloguer. When necessary, additional notes are provided to help clarify the results. Each time a new class appears in the results if the caption of the class in not self explanatory then some additional information about that class is provided in form of footnotes. The source for these class descriptions is the WebDewey website (http://connexion.oclc.org) which provides access to the latest version of DDC scheme (DDC22 at the time of creating this document). True False Positive Positive 210 19 False Precision Recall F1 0.917 0.889 0.902 Negative 26 Classification results summary LEGEND Micro-averaged performance measures TP True Positive FP False Positive FN False Negative NC Not Catalogued: the referenced item is not catalogued in either Library of Congress or British Library catalogues. CE Cataloguer’s Error: The cataloguers in either the Library of Congress or British Library have classified the item into the wrong class (manual classification error) or they have labelled the item with an invalid class number (data entry error). Corresponding author. Tel.: (+)353-61-213492; Fax:(+)353-61-338176. E-mail addresses: [email protected] (A. Joorabchi), [email protected] (A.E. Mahdi). TP FP FN Precision Recall F1 210 19 26 0.917 0.889 0.902 Author Method Data Set Pong et al. (2007) K-NN 505 training & 254 testing documents (web pages) 67 classes from LCC 0.80 Pong et al. (2007) NB 505 training & 254 testing documents(web pages) 67 classes from LCC 0.54 1889 training & 623 test documents Economic related web pages 575 subclasses of the DDC main class of 0.92 economics Chung et al. K-NN (2003) Classification Scheme F1 BB-ATC 100 computer science related Syllabi Full DDC scheme 0.90 Results published in: • The proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, (ECDL 2009). (Granted the best student paper award) 18 Enhanced BB-ATC method for Automatic Classification of Scientific Literature in Digital Libraries 1800 publication a day in biomedical science! eXist-DB The CiteSeer digital library is used as the experimental platform (~1 million records). CiteSeer OAI & BibTex Records CiteSeer infrastructure is fully open source and supports (OAI-PHM). Using Google Book Search database for mining citations networks. Using OCLC’s WorldCat - a union catalogue of 70,000 libraries around the world. Pre-processing Chosen document’s metadata records and list of references Google Book Search Data mining WorldCat Catalogue Pool of DDC numbers potentially related to the document Inferring Probabilistically chosen DDC number for the document 19 Data mining process List of publications citing Rn Document’s Metadata: Title: Authors: Abstract: . . . Reference #1 (R1): title . . . Reference #n (Rn): title List of publications citing R1 Google Book Search Publication #1 (P1): ISBN . . . Publication #n (Pn): ISBN , DDC No. WorldCat Catalogue 20 Sample Data mining Results (Cont.) Document’s Title: Statistical Learning, Localization, and Identification of Objects. (has only one reference) This work describes a statistical approach to deal with learning and recognition problems in the field of computer vision Citing publications: No. ISBN 1. 0123797721 2. 0123797772 3. 0769501648 4. 0780350987 5. 0780399781 6. 0792378504 7. 0818681845 8. 1558605835 DDC No. 006.3/7 006.3/7 Null 006.3/7 Null 621.36/7 621.367 Null No. 9. 10. 11. 12. 13. 14. 15. ISBN 3540250468 3540629092 3540634606 3540639314 3540646132 3540650806 389838019X DDC No. 629.8932 006.4/2 006.4/2 621.36/7 006.3/7 006.3 005.1/18 Level 1 0 Computer science, information & general works Level 2 00 Computer science, knowledge & systems Level 3 006 Special computer methods Level 4 006.3 Artificial intelligence Reference’s Title: Learning Object Recognition Models from Images Citing publications: No. ISBN DDC No. No. ISBN DDC No. 1. 0120147734 537.5/6 8. 3540433996 629.8/92 2. 0195095227 006.3/7 9. 3540617507 006.3/7 3. 0780399773 Null 10. 3540634606 006.4/2 4. 0818638702 621.39/9 11. 3540636366 006.7 5. 1586032577 006.3 12. 3540667229 006.3/7 6. 1848002785 621.367 13. 389838019X 005.1/18 7. 3540282262 006.3 14. 3540404988 006.3/7 DDC No. Freq 0 17 6 7 00 17 006 15 005 2 0063 11 0064 3 00637 8 621367 4 006.4 Computer pattern recognition Level 5 006.31 Machine learning 006.33 Knowledge-based systems 006.32 Neural nets (Neural networks) 006.35 Natural language Processing (NLP) 006.37 Computer vision 006.42 Optical pattern recognition 006.45 Acoustical pattern recognition 21 Inference & Visualization depth 1 CW ( DDCi ) GF ( DDCi ) NLF ( DDCi ) ULF ( DDCi ) 10 m Freq( DDC ) i, j NLF ( DDCi ) j 1 m | Rj | GF ( DDCi ) DDCi R j j 1 m ULF ( DDCi ) Freq(DDCi, j ) j 1 20 CW (cn) D S (cn) CW ( pn) The same concept TFIDF weighting as 22 Evaluation Results Test dataset contains 1000 research documents divided into 5 groups according to their number of references. Mico-avg. Pr Mico-avg. Re Mico-avg. F1 0.84 0.78 0.81 Micro-Avg. precision Micro-Avg. recall Micro-Avg. F1 No. of references Mico-avg. Mico-avg. Mico-avg. Pr Re F1 0 0.718 0.523 0.605 4 0.842 0.820 0.831 8 0.843 0.829 0.836 16 0.880 0.860 0.870 32 0.891 0.880 0.886 0.9 0.85 0.8 F1 0.75 0.7 0.65 0.6 0.55 23 0.5 0 4 8 Number of references 16 32 Evaluation Results (cont.) Number of documents classified in each level of DDC hierarchy and corresponding averaged performance Measure. Level No. of Docs % of Docs Mico-avg. Mico-avg. Mico-avg. Pr F1 Re 1 1000 100% 0.94 0.91 0.89 100% 2 1000 100% 0.92 0.89 0.87 90% 3 1000 100% 0.84 0.82 0.80 4 1000 100% 0.81 0.79 0.77 5 950 95% 0.75 0.70 0.66 6 394 39.4% 0.68 0.65 0.63 7 50 5% 0.59 0.58 0.57 30% 8 20 2% 0.62 0.58 0.55 20% 9 4 0.4% 0.59 0.69 0.83 No. of Docs (%) Micro-Avg. precision (%) Micro-Avg. recall (%) Micro-Avg. F1 (%) 80% 70% 60% 50% 40% 10% 0% 1 2 3 4 5 6 7 8 9 DDC hierarchy level http://www.skynet.ie/~arash/BB-ATC1/HTML/ Article under review in the journal Information Processing & Management Elsevier 24 BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature keyphrases (multi-word units) describe the content of research documents and they are usually assigned by the authors. The task of automatically assigning keyphrases to a document is called keyphrase indexing Considered a form of ATC ((ML-based multi-label) and approached as such Free indexing vs. indexing with a controlled vocabulary (e.g., LCSH , MeSH) Extraction indexing vs. assignment indexing 25 Citation Based Keyphrase Extraction (CKE) 1. Reference extraction using ParsCit [Councill, I. G, et al. 2008] (CRF,F1=0.93) 2. Mining the Google Book Search (GBS) database (>10 million archived items) for candidate terms (i.e., Google word clouds) 3. Term weighting & selection List of publications citing t Document’s Metadata: Title: t Authors: … Abstract: … . . . Reference #1 (R1): title . . . Reference #n (Rn): title List of publications citing Rn List of publications citing R1 1 Google Book Search 2 3 Publication #1 (P1): ISBN . . . Publication #n (Pn): ISBN , Key terms 4 26 Term Weighting and Selection Google Word Cloud (GWC): Google uses TFIDF + some heuristic rules to emphasize on proper nouns (names, locations, etc.) GWC for a book titled: “Data mining: practical machine learning tools and techniques”: Normalization including: stopword removal, punctuation removal, abbreviation expansion, case-folding, and stemming (Porter2 [Porter 2002]) Keyphraseness score of each candidate term measured using: RF (t ) K (t ) log (GF (t ) 1) log ( LF (t ) 1) 2 2 2 NW (t ) ADI (t ) log ( FO(t ) 1) 2 log NC (t ) 2 2 2 27 Evaluation & Experimental Results wiki-20 Test dataset [Medelyan et al., 2009] 20 computer science research papers each manually indexed by 15 different human teams (teams of 2). Rolling’s inter-indexer consistency formula adopted which is equivalent to F1 measure: 2C Inter-indexer consistency A B 28 Evaluation & Experimental Results (cont.) Performance of the CKE algorithm compared to human indexers and competitive methods. Method No. of keyphrases assigned to each document Inter-consistency (%) Min. Avg. Max. Manual Supervised Unsupervised Human indexing (gold standard) Varied 21.4 30.5 37.1 KEA (Naïve Bayes) Static - 5 15.5 22.6 27.3 Maui (Naïve Bayes & all features) Static - 5 22.6 29.1 33.8 Maui (Bagged Decision Trees & all features) Static - 5 25.4 30.1 38.0 Maui (Bagged Decision Trees & best features) Static - 5 23.6 31.6 37.9 Grineva et al. Static - 5 18.2 27.3 33.0 CKE (condition A) Static - 5 22.7 30.6 38.3 CKE (condition B) Static - 6 26.0 31.1 39.3 CKE (condition C) Varied - the same as assigned by human indexers 22.0 30.5 38.7 To appear in Journal of Information Science 36, 6 ( December 2010). Published online before print November 5, 2010 29 Conclusion & Future Work The main contribution of this work is the design, development, and evaluation of an alternative approach to ATC by utilizing two new knowledge/data sources: i. Conventional library classification schemes. ii. Citation networks among documents. The proposed approach addresses two major issues a) Lack of a standard and comprehensive classification scheme for ATC b) Lack of training data Future work includes: BB-ATC: Using mining the citing documents as well as the cited ones, Multi-label classification CKE: utilizing LCSH and user assigned keyphrases of cited and citing documents. Applying the underlying theory of BB-ATC to ACM DL and ACM’s Computing Classification System (ACM-CSS). BB-ATC & CKE as an automatic metadata generator plug-in for scientific DLs such as Ryan, Ireland’s National Research Portal and NDLTD (Networked Digital Library of Thesis 30 and Dissertations) Development of a National Syllabus Repository for Higher Education in Ireland • Goal: Collecting unstructured electronic syllabus documents from participating higher education institutes into a metadata-rich central repository. • Challenges: – Information Extraction: • Syllabus documents have arbitrary sizes, formats, and layouts; • contain multiple module descriptions (e.g., programme documents); • contain complex layout features (e.g., hidden/nested tables). – Automatic Classification: • Lack of a suitable standard education classification scheme for higher education in Ireland. • Lack of training data 31 Classifier • Classification scheme: – an enhanced version of International Standard Classification of Education (ISCED). – 3 levels of classification: broad field (9), narrow field (25), and detailed field (80) each represented by a digit in a hierarchical fashion. – We have extended this by adding a forth level of classification, subject field, represented by a letter in the classification coding system from Australian Standard Classification of Education (ASCED). – “482B” Science, Mathematics and Computing/Computing/Information Systems/Databases • Naïve Bayes Classification algorithm [Tom Mitchell 1997] • A Web-based bootstrapping method 32 Programme Document Segmenter (PDS) Definitive Programme Document Module Syllabus MSc in Business Management Introduction . . . Programme Structure . . . Module 1: BM3222 Leadership Management . . . 33 Module Syllabus Segmenter (MSS) Extracting topical segments of each individual syllabus. Module Syllabus Header segment Aim & Objectives segment Learning Outcomes segment 34 Named Entity Extractor (NEE) • extracts a set of common named entities/attributes such as module code, module name, module level, number of credits, pre-requisites and co-requisites from the header segment of syllabi . CODE: CE 4701 GRADING TYPE TYPE Core Module: Computer Software 1 Normal CREDITS 3 PRE_REQUISITES:None AIMS/OBJECTIVES To familiarise the student with the use of a computer and typical applications software. To introduce a high-level language, typically Pascal, as a concrete formalism for the representation of algorithms in a machinereadable form. 35 Bootstrapping ML-based ATC Systems Utilizing Public Library Resources • Developing a dynamic ML-based ATC system that can be adopted for wide range a ATC tasks with minimum effort required from users. • Users will select a set of categories from a comprehensive standard classification scheme, and a bootstrapping method is used to automatically build a training dataset accordingly. • Three main components: Universal Classification Scheme Training Corpus Builder (bootstrapper) ML-based Classification Algorithm 36 ATC System Components • Universal Classification Scheme – Acts as a pool of categories/classes that can be selectively adopted by the users to create their own classification scheme. – Dewey Decimal Classification (DDC) with thousands of classes has been used in conventional libraries for over a century to categorize library materials. – DDC is used in about 80% of libraries around the world and has a fully hierarchical structure (vs. LCC) • Training Corpus Builder – Textual item classified according to DDC are not available in an electronic format and/or are copyrighted. – Alternatively, we use the small parts of books such as topics covered, the back cover, and editorial reviews publicly available on books sellers’ websites such as Amazon. – Short text (~500 words) containing semantically-rich terms used to summarize the book. • Classification algorithms – We implemented an optimized version of NB called Transformed Weight-normalized Complement Naive Bayes (TWCNB) [Rennie et al., 2003]. – A linear SVM classifier called LIBLINEAR [Lin, C.-J et al., 2008] which is an optimised implementation of SVM suitable for large linear classification tasks with thousands of features, such as ATC. 37 BB-ATC Performance Compared to Similar Reported Experiments Author Method Data Set Pong et al. (2007) K-NN 505 training & 254 testing documents (web pages) 67 classes from LCC 0.80 Pong et al. (2007) NB 505 training & 254 testing documents(web pages) 67 classes from LCC 0.54 1889 training & 623 test documents Economic related web pages 575 subclasses of the DDC main class of 0.92 economics Chung et al. K-NN (2003) Classification Scheme F1 BB-ATC 100 computer science related Syllabi Full DDC scheme • 0.90 Results published in: – The proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, (ECDL 2009). (Granted the best student paper award) 38 Evaluation & Experimental Results wiki-20 Test dataset [Medelyan et al., 2009] 20 computer science research papers each manually indexed by 15 different human teams (teams of 2). Rolling’s inter-indexer consistency formula adopted which is equivalent to F1 measure: Inter-indexer consistency 2C A B The number of extracted references per document range between 10 to 79 with an average value of 25.9 references per document The number of retrieved GWCs per document ranges between 62 to 766 with an average value of 271 GWCs per document. In total, the data mining unit has retrieved the metadata records of 5,576 publications from GBS, which either cite one of the documents in the wiki-20 collection or one of their references, and almost all of these records (5421, 97.14%) contain a word cloud. 39 Pr(ci ) TPi Number of correctly assigned class labels T otal assigned TPi FPi TP 0 TP Re(ci ) TPi Number of correctly assigned class labels T otal possible correct TPi FNi FP 3 TP 6 0 4 Assigned class 006.3 Correct class 006.4 4 FN F-score The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is: 2Pr(c ) Re(c ) F1(ci ) i i Pre(ci ) Re(ci ) This is also known as the F1 measure, because recall and precision are evenly weighted.The general formula for non-negative real β is: 40 [Joachims, 1997] 41 Outline Introduction to Automatic Text Classification (ATC) Motivation, Aim, and Objectives Bootstrapping ML-based ATC systems Leveraging Conventional Library resources for Organizing Digital Libraries (BB-ATC) Enhanced BB-ATC for Automatic Classification of Scientific Literature in Digital Libraries BB-ATC Approach Applied to the Problem of Keyphrase Extraction form Scientific Literature Conclusion & Future Work 42