Transcript INIS DB on Interent Open Access Pilot
International Atomic Energy Agency
International Nuclear Information System (INIS)
CAI, Thesaurus, Subject Categories and Metadata Extraction Tool (MET)
13th Joint INIS/ETDE Technical Committee Meeting 20-22 October 2011, Vienna, Austria
Neviana Rashkova
INIS Subject Specialist
IAEA International Atomic Energy Agency
CONTENT
COMPUTER ASSISTED INDEXING – CAI
INIS/ETDE THESAURUS
SUBJECT CATEGORIES
INIS INPUT QUALITY CONTROL - UPDATE
in co-operation with L. Iliev, Computer Support Group
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 2
COMPUTER ASSISTED INDEXING – CAI
• • • • • • Assists the indexer to choose subject category and descriptors based on the text analysis of abstract and title Offers an opportunity for off-line work – batch indexing Incorporates the latest version of INIS Thesaurus Uses “hidden terms” pointing to a valid Thesaurus term Currently we have: • • • 28 accounts created for Member states 19 countries with access to CAI 6 accounts created for external users This year - 53 658 documents indexed - 55% of the input from: Springer, ELSEVIER, ANS, IOPP, IAEA, MemSt, AIP
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 3
INIS/ETDE THESAURUS
• • • Thesaurus is “a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“ (part of UNISCO definition) Types of relations for terms: • BT (level1,2…10); NT (1,2…10); RT – related term; UF(+) – used for, SF seen for Contains: • • • 21882 8677 30559 valid terms forbidden terms total
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 4
INIS/ETDE THESAURUS
• • Maintaining the INIS/ETDE Thesaurus • • • • Regularly updated simultaneously at INIS and ETDE New terms proposed by Member States Terms revised if needed Discussion Group of experts – for new proposals and updates Translations • • Original - in English Other languages: German, French, Arabic, Russian, Chinese INIS Liaison Officer of the respective countries provide translations with yearly updates for the new terms
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 5
USES OF INIS/ETDE THESAURUS
• • • For indexing • • • WinFibre CAI – hidden terms Independent use For retrieval • • • Incorporated in INIS search For independent advanced search For establishing of search strategy As a dictionary
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 6
USES OF INIS/ETDE THESAURUS
• Other potential applications • • • Retrieval – for navigation search together with subject classification Automation in text analysis – provides multiple level taxonomy Learning tool – give immediate structured information about the terms and their relations BRUCE-1 REACTOR Tiverton, Ontario, Canada.
*BT1 candu type reactors *BT1 natural uranium reactors *BT1 phwr type reactors RT bruce site BUBBLE CHAMBERS *BT1 gas track detectors NT1 cryogenic bubble chambers NT1 heavy liquid bubble chambers NT1 ultrasonic bubble chambers RT digitizers
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 7
INIS/ETDE SUBJECT CATEGORIES
• • INIS/ETDE subject categories update • Review the existing subject categories to include newer concepts and/or areas of research and development • Make the "ETDE only" categories available for INIS • Consider the introduction of new categories Four new Subject categories • • • • S77 NANOSCIENCE AND NANOTECHNOLOGY S79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY S96 KNOWLEDGE MANAGEMENT AND PRESERVATION S97 MATHEMATICAL METHODS AND COMPUTING
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 8
INIS/ETDE SUBJECT CATEGORIES
• ETDE/INIS Joint Reference Series No. 2 (Rev. 1) INIS Scope Descriptions • The current categorization scheme contains 49 subject categories, both for INIS and ETDE. The categories have three-character alphanumeric codes • The document defines the subject categories and provides the scope descriptions • • • Subject Index is included as an aid to subject classifiers Cross references to other categories are provided where appropriate The tool is provided to Member States to assist in subject indexing
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 9
INIS INPUT QUALITY CONTROL UPDATE
INTRODUCTION
• • • • •
The general goal of the procedure is to improve the quality of input
Identifies documents with errors in input and extracts them for manual check by a specialist Knowledge Base created using a large number of expert decisions made by human indexers - intellectual choices for usage of a specific SC/D combination Implemented in a computer program, currently in use Uses documents from immediately preceding time period At the time of implementation – 75% of identified records were proved to be real errors
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 10
CURRENT PROCEDURE
• Based on old statistics • • period 1980-1984 26 000 documents used • Subject categories changed several times • • new categories added artificially adjusted values to replace the real statistics • Thesaurus updated many times • • new descriptors new concepts
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 11
CURRENT PROCEDURE
THE RESULTS FROM THE QA PROCEDURE DO NOT REFLECT THE REAL SITUATION • • • Too many false warnings (~ 50% of all documents) More bad records allowed in production Not relevant any more- no consistent approach for all pairs categories/descriptors
THE OLD QA PROCEDURE NEEDS REVISION IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 12
UPDATED PROCEDURE
• • • • Based on real statistics using the whole INIS database Takes in account all subject categories Takes in account the accumulated experience about specific error usage of category/descriptor combinations Flexible towards changes of descriptors weights
UPDATED PROCEDURE IS EXPECTED TO IMPROVE QUALITY AND SAVE TIME IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 13
PRELIMINARY ANALYSES
• Analysis of the documentation on procedure for category match value (CMV) calculation
An Expert System for Quality Control in Bibliographic Databases*
Claudio Todeschini International Nuclear information System, international Atomic Energy Agency, Wagramerstrasse 5, A-7400 Vienna, Austria Michael P. Farrell Carbon Dioxide Information Center, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3783 1 U.S.A.
*Based on work performed at Oak Ridge National Laboratory, operated for the U.S. Department of Energy under Contract No. DE-ACOS- 840R21400 with Martin Marietta Energy Systems, Inc. Work was partially supported by the Carbon Dioxide Research Division, U.S. Department of Energy.
• • Analysis of the program for quality control and testing the formula Criteria for category/descriptor combination
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 14
WORK DONE
• • • Conversion of all existing categories to the currently used set of categories • Calculation of frequencies – table category/descriptor Comparison between two statistics new/all SC • Decision about which period to use for the statistics Adjustment to avoid expected errors • • Identification of known combinations giving nearly100% errors Creating a table for “bad” combinations - assigned different weight (to reach very low CMV) • Possibility to manually change weights
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 15
FINE TUNNING
EXPECTED ERRORS – examples:
Material Science
GROWTH CRYSTAL GROWTH
Plasma physics
IGNITION – THERMONUCLEAR IGNITION
Physics of Elementary Particles and Fields
PRODUCTION – PARTICLE PRODUCTION COLOR, FLAVOR, HOLOGRAPHY, TRANSPORT, CAVITIES,…etc.
17 descriptors in 18 subject categories have been adjusted
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 16
TOOLS DEVELOPED
Tools were developed to perform the steps: • • • • Scanning the records from the Reference DB to make full statistics for the subject category-descriptor pairs Report to show difference between table and the one to replace it A table for manual “tuning” some pairs.
Unfinished report to show the effect of changing the table on raw (unprocessed) and processed records
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting 20-21 October 2011 17
COMPARISON WITH IRPS (processed records)
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 18
COMPARISON WITH IRPS (unprocessed records)
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 19
TRESHOLD DETERMINATION
S12 Management of radioactive wastes...
S21 Specific nuclear reactors and associated plants S36 Materials science
180 160 140 120 100 80 60 40 20 0 0 -1 1 2
CMV
3 4
IAEA
5 S12 S21 S36 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 20
TRESHOLD DETERMINATION
Category S21
-1 140 120 100 80 60 40 20 0 0
IAEA
1 S21 S12 S36 2
CMV
3 4 5 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 21
TRESHOLD DETERMINATION
Category S36
-1 250 200 150 100 50 0 0
IAEA
1 2
CMV
3 4 5 S36 S43 S12 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 22
TRESHOLD DETERMINATION
Category S43
-1 250 200 150 100 50 0 0
IAEA
1 2
CMV
3 4 5 S43 S36 S12 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 23
DISCUSSION
• • • • • First analyses suggest a natural threshold value CMV ∈ (1,2) Analysis of the number of documents to be scanned for different threshold CMV is necessary Tests to assess errors if choose the threshold value in the different intervals are necessary Further testing over different sets of records is required before implementation Possibility for integration in WinFibre
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 24
Thank you!
IAEA
13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 25