INIS DB on Interent Open Access Pilot

Download Report

Transcript INIS DB on Interent Open Access Pilot

International Atomic Energy Agency

International Nuclear Information System (INIS)

CAI, Thesaurus, Subject Categories and Metadata Extraction Tool (MET)

13th Joint INIS/ETDE Technical Committee Meeting 20-22 October 2011, Vienna, Austria

Neviana Rashkova

INIS Subject Specialist

IAEA International Atomic Energy Agency

CONTENT

COMPUTER ASSISTED INDEXING – CAI

INIS/ETDE THESAURUS

SUBJECT CATEGORIES

INIS INPUT QUALITY CONTROL - UPDATE

in co-operation with L. Iliev, Computer Support Group

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 2

COMPUTER ASSISTED INDEXING – CAI

• • • • • • Assists the indexer to choose subject category and descriptors based on the text analysis of abstract and title Offers an opportunity for off-line work – batch indexing Incorporates the latest version of INIS Thesaurus Uses “hidden terms” pointing to a valid Thesaurus term Currently we have: • • • 28 accounts created for Member states 19 countries with access to CAI 6 accounts created for external users This year - 53 658 documents indexed - 55% of the input from: Springer, ELSEVIER, ANS, IOPP, IAEA, MemSt, AIP

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 3

INIS/ETDE THESAURUS

• • • Thesaurus is “a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“ (part of UNISCO definition) Types of relations for terms: • BT (level1,2…10); NT (1,2…10); RT – related term; UF(+) – used for, SF seen for Contains: • • • 21882 8677 30559 valid terms forbidden terms total

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 4

INIS/ETDE THESAURUS

• • Maintaining the INIS/ETDE Thesaurus • • • • Regularly updated simultaneously at INIS and ETDE New terms proposed by Member States Terms revised if needed Discussion Group of experts – for new proposals and updates Translations • • Original - in English Other languages: German, French, Arabic, Russian, Chinese INIS Liaison Officer of the respective countries provide translations with yearly updates for the new terms

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 5

USES OF INIS/ETDE THESAURUS

• • • For indexing • • • WinFibre CAI – hidden terms Independent use For retrieval • • • Incorporated in INIS search For independent advanced search For establishing of search strategy As a dictionary

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 6

USES OF INIS/ETDE THESAURUS

• Other potential applications • • • Retrieval – for navigation search together with subject classification Automation in text analysis – provides multiple level taxonomy Learning tool – give immediate structured information about the terms and their relations BRUCE-1 REACTOR Tiverton, Ontario, Canada.

*BT1 candu type reactors *BT1 natural uranium reactors *BT1 phwr type reactors RT bruce site BUBBLE CHAMBERS *BT1 gas track detectors NT1 cryogenic bubble chambers NT1 heavy liquid bubble chambers NT1 ultrasonic bubble chambers RT digitizers

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 7

INIS/ETDE SUBJECT CATEGORIES

• • INIS/ETDE subject categories update • Review the existing subject categories to include newer concepts and/or areas of research and development • Make the "ETDE only" categories available for INIS • Consider the introduction of new categories Four new Subject categories • • • • S77 NANOSCIENCE AND NANOTECHNOLOGY S79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY S96 KNOWLEDGE MANAGEMENT AND PRESERVATION S97 MATHEMATICAL METHODS AND COMPUTING

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 8

INIS/ETDE SUBJECT CATEGORIES

• ETDE/INIS Joint Reference Series No. 2 (Rev. 1) INIS Scope Descriptions • The current categorization scheme contains 49 subject categories, both for INIS and ETDE. The categories have three-character alphanumeric codes • The document defines the subject categories and provides the scope descriptions • • • Subject Index is included as an aid to subject classifiers Cross references to other categories are provided where appropriate The tool is provided to Member States to assist in subject indexing

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 9

INIS INPUT QUALITY CONTROL UPDATE

INTRODUCTION

• • • • •

The general goal of the procedure is to improve the quality of input

Identifies documents with errors in input and extracts them for manual check by a specialist Knowledge Base created using a large number of expert decisions made by human indexers - intellectual choices for usage of a specific SC/D combination Implemented in a computer program, currently in use Uses documents from immediately preceding time period At the time of implementation – 75% of identified records were proved to be real errors

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 10

CURRENT PROCEDURE

• Based on old statistics • • period 1980-1984 26 000 documents used • Subject categories changed several times • • new categories added artificially adjusted values to replace the real statistics • Thesaurus updated many times • • new descriptors new concepts

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 11

CURRENT PROCEDURE

THE RESULTS FROM THE QA PROCEDURE DO NOT REFLECT THE REAL SITUATION • • • Too many false warnings (~ 50% of all documents) More bad records allowed in production Not relevant any more- no consistent approach for all pairs categories/descriptors

THE OLD QA PROCEDURE NEEDS REVISION IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 12

UPDATED PROCEDURE

• • • • Based on real statistics using the whole INIS database Takes in account all subject categories Takes in account the accumulated experience about specific error usage of category/descriptor combinations Flexible towards changes of descriptors weights

UPDATED PROCEDURE IS EXPECTED TO IMPROVE QUALITY AND SAVE TIME IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 13

PRELIMINARY ANALYSES

• Analysis of the documentation on procedure for category match value (CMV) calculation

An Expert System for Quality Control in Bibliographic Databases*

Claudio Todeschini International Nuclear information System, international Atomic Energy Agency, Wagramerstrasse 5, A-7400 Vienna, Austria Michael P. Farrell Carbon Dioxide Information Center, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3783 1 U.S.A.

*Based on work performed at Oak Ridge National Laboratory, operated for the U.S. Department of Energy under Contract No. DE-ACOS- 840R21400 with Martin Marietta Energy Systems, Inc. Work was partially supported by the Carbon Dioxide Research Division, U.S. Department of Energy.

• • Analysis of the program for quality control and testing the formula Criteria for category/descriptor combination

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 14

WORK DONE

• • • Conversion of all existing categories to the currently used set of categories • Calculation of frequencies – table category/descriptor Comparison between two statistics new/all SC • Decision about which period to use for the statistics Adjustment to avoid expected errors • • Identification of known combinations giving nearly100% errors Creating a table for “bad” combinations - assigned different weight (to reach very low CMV) • Possibility to manually change weights

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 15

FINE TUNNING

EXPECTED ERRORS – examples:

Material Science

GROWTH CRYSTAL GROWTH

Plasma physics

IGNITION – THERMONUCLEAR IGNITION

Physics of Elementary Particles and Fields

PRODUCTION – PARTICLE PRODUCTION COLOR, FLAVOR, HOLOGRAPHY, TRANSPORT, CAVITIES,…etc.

17 descriptors in 18 subject categories have been adjusted

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 16

TOOLS DEVELOPED

Tools were developed to perform the steps: • • • • Scanning the records from the Reference DB to make full statistics for the subject category-descriptor pairs Report to show difference between table and the one to replace it A table for manual “tuning” some pairs.

Unfinished report to show the effect of changing the table on raw (unprocessed) and processed records

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting 20-21 October 2011 17

COMPARISON WITH IRPS (processed records)

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 18

COMPARISON WITH IRPS (unprocessed records)

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 19

TRESHOLD DETERMINATION

S12 Management of radioactive wastes...

S21 Specific nuclear reactors and associated plants S36 Materials science

180 160 140 120 100 80 60 40 20 0 0 -1 1 2

CMV

3 4

IAEA

5 S12 S21 S36 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 20

TRESHOLD DETERMINATION

Category S21

-1 140 120 100 80 60 40 20 0 0

IAEA

1 S21 S12 S36 2

CMV

3 4 5 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 21

TRESHOLD DETERMINATION

Category S36

-1 250 200 150 100 50 0 0

IAEA

1 2

CMV

3 4 5 S36 S43 S12 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 22

TRESHOLD DETERMINATION

Category S43

-1 250 200 150 100 50 0 0

IAEA

1 2

CMV

3 4 5 S43 S36 S12 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 23

DISCUSSION

• • • • • First analyses suggest a natural threshold value CMV ∈ (1,2) Analysis of the number of documents to be scanned for different threshold CMV is necessary Tests to assess errors if choose the threshold value in the different intervals are necessary Further testing over different sets of records is required before implementation Possibility for integration in WinFibre

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 24

Thank you!

IAEA

13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 25