Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?
Download ReportTranscript Developing a Metadata Infrastructure for Information Access: What, Where, When and Who?
Developing a Metadata Infrastructure for Information Access: What, Where, When and Who? Prof. Ray R. Larson University of California, Berkeley School of Information Overview Metadata as Infrastructure – What, Where, When and Who? What are Entry Vocabulary Indexes? – Notion of an EVI – How are EVIs Built Time Period Directories – Mining Metadata for new metadata 4W Demo New Project: Bringing Lives to Light Metadata as Infrastructure The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How? Metadata as Infrastructure The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who. The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library. What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: • Documents are assigned to topic categories, e.g. Dewey • Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification. ‘What’ searches involve mapping to controlled vocabularies Texts Thesaurus/ Ontology Building a Search Term Recommender Start with a collection of documents. Classify and index with controlled vocabulary Index Or use a preindexed collection. For: “Wirtschaftspolitik” Problem: Controlled Index Vocabularies can be difficult for people to use. In Library of Congress subj Use: “Economic Policy” “pass mtr veh spark ign eng” Solution: Entry Level Vocabulary Index Indexes. pass mtr veh spark ign eng” EVI = “Automobile” “What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents… Building and Searching EVIs Domains to select from: Engineering, Medicine, Biology, Social science, etc. User selects a subject domain of interest. Has an Entry Vocabulary Module been built? User has question but is unfamiliar with the domain he wants to search. YES Use an existing EVI. NO Download a set of training data. Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Map user’s query to ranked list of controlled vocabulary terms For noun phrases Internet DB indexed with a controlled vocabulary. Part of speech tagging Building an Entry Vocabulary Module (EVI) User selects search terms from the ranked list of terms returned by the EVI. Searching Technical Details Download a set of training data. Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. For noun phrases Internet DB indexed with a controlled vocabulary. Part of speech tagging Building an Entry Vocabulary Module (EVI) Association Measure t ¬t C a c ¬C b d Where t is the occurrence of a term and C is the occurrence of a class in the training set Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) a and p1= a+b c p2=c+d Vis. Dunning a+c p= a+b+c+d Alternatively Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Digital library resources Statistical association W(c, t) 2[logL(p 1 , a, a b) ... EVI example User Query “Automobile” EVI 1 EVI 2 Index term: “pass mtr veh spark ign eng” Index term: “automobiles” OR “internal combustible engines” But why stop there? Index EVI Index EVI “Which EVI do I use?” Index EVI Index Index EVI Index EVI EVI to EVIs EVI2 Index EVI Index Index EVI Why not treat language the same way? In Arabic Find Plutonium Chinese Greek Japanese Korean Russian Tamil Support for the Learner with a Query Facet Vocabulary Displays WHAT Thesaurus e.g. LCSH Crossreferences WHERE Gazetteer Map WHEN Period directory Timeline WHO Biograph. dict. Personal e.g. Who’s Who relations Any catalog: Archives, Libraries, Museums, TV, Publishers Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages It is also difficult to move between different media forms Texts EVI Thesaurus/ Ontology Numeric datasets Searching across data types Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results But texts associated with numeric data can be mapped as well… Texts EVI Thesaurus/ Ontology EVI captions Numeric datasets But there are also geographic dependencies… Texts EVI Thesaurus/ Ontology EVI Maps/ Geo Data captions Numeric datasets WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria; – 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans; USSR Use a gazetteer! WHERE. Geo-temporal search interface. Place names found i documents. Gazetteer provided lat. & long. Places displayed on map. Timebar Zoom on map. Click on place for a list of records. Click on record to display text. So geographic search becomes part of the infrastructure Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Numeric datasets WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time periods Named time periods resemble place names in being: – Unstable: European War, Great War, First World War – Multiple: Second World War, Great Patriotic War – Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc. Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region Vocabularies are the key! Want: Kung-fu movies? Use LCSH: Hand-to-hand fighting, oriental, in motion pictures. Linking vocabularies WHAT, WHERE, WHEN Library subject headings Topic – Geographic subdivision – Chronological subdivision Place name gazetteer: Place name – Type – Spatial markers (Lat & long) – When Time Period Directory Period name – Type – Time markers (Calendar) – Where Time period directories link via the place (or time) Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Time Period Directory Numeric datasets Time lines, Chronologies WHEN: Time Period Directory Timeline Link to Catalog Link to Wikipedia WHO: Biographical Dictionary Complex relationships Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links Any document, object, or performance Connect it with its context – and other resources. Facet Vocabulary Displays WHAT Thesaurus e.g. LCSH WHERE Gazetteer WHEN Period directory Timeline WHO Biograph. dict. Personal e.g. Who’s Who relations Crossreferences Any catalog: Archives, Libraries, Museums, TV, Publishers Map Any resource: Audio, Images, Texts, Numeric data, Objects, Virtual reality, Webpages Demo of search interface Entry Vocabulary Index suggests correct LCSH with different spelling Related places Potentially related people Potentially related periods Mostly in India 16th18th century Find out more about this area. Different Browsing Options! Zooming in to South Asia Select Restricting time frame More information about the country of India… More information about the country of India… Berkeley Natural History Museums Wikipedia BBC CIA Factbook Ethnologue Historical events – linked to Library catalog & Wikipedia : none avail. for this time period ECAI Cultural Atlases: presenting history in its geographical & chronological contexts Mongol Empire Video Demo Interface http://ecai.berkeley.edu/imls2004/imls4w/ New Project: Bringing Lives to Light: Biography in Context Ray R. Larson, Michael Buckland, Fredric Gey University of California, Berkeley Overview Focussing on the Who in Who, What, Where and When Types of Biographical Markup WHEN, WHERE and WHO Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia. Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else. A new form of biographical dictionary would link to all Biographical Dictionary Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Time Period Directory Numeric datasets Time lines, Chronologies Projected Work Develop XML markup for Biographical Events Most likely to be adaptation and extension of existing biographical event markup – Example: EAC/EAD Harvest biographical resources – Wikipedia, etc. Integrate as next generation of current interface EAC/EAD <bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist> Wikipedia data Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links A Metadata Infrastructure INTERMEDIA INFRASTRUCTURE Facet Authority Control Special Display Tools RESOURCES CATALOGS WHAT Thesaurus Syndetic Structure Learners WHERE Gazetteer Maps WHEN Time Period Directory Timelines WHO Biographical Dictionary Dossiers Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages Acknowledgements Electronic Cultural Atlas Initiative project This work is being supported supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries Contact: [email protected]