MEBI 591C/598 – Text Mining/NLP Subproblems Meliha Yetisgen-Yildiz From last week’s discussion.

Download Report

Transcript MEBI 591C/598 – Text Mining/NLP Subproblems Meliha Yetisgen-Yildiz From last week’s discussion.

MEBI 591C/598 – Text Mining/NLP Subproblems

Meliha Yetisgen-Yildiz

From last week’s discussion

 

Presentation

  Schedule: http://faculty.washington.edu/melihay/MEBI591C.htm

50 minutes presentation+discussion+question answering Content:    Paper  Conference or journal article Preparation:   Research/Project Idea  Motivation + Problem + Potential Solution Survey or literature review   A general area   Text mining: named entity recognition - gene name identification Data Mining: classification, clustering Available resources for a given area   Open source libraries Data resources Email the plan + reading list at least 3 days prior to class GoMap Discussion List

System Design

    Team: Marcin, Wynona, Karl, Stella, Francisco, Jeffry, Safiyyah (not registered)  Example data released: https://www.i2b2.org/NLP/Relations/Documentation.php

The fourth i2b2 challenge is a three tiered challenge that studies: 1.

2.

3.

extraction of medical problems, tests, and treatments classification of assertions made on medical problems relations of medical problems, tests, and treatments

2010 - I2b2 Challenge

        Important Dates: March 5 th July 15 th – Registration opens April 15 th – Commitment to Participate in Challenge & Training Data Release – Test Data Release September 1 st October 1 st – Short papers due – Invitations to present at the Workshop November, 2010 – Workshop  Preparations Linux server + accounts (meliha)    Accounts Dev environment Subversion ?

Text Mining/NLP Sub-problems – Part 1

    Sentence Delimiters Tokenizers Part-of-Speech Tags Collocations

   

Sentence Delimiters

Document -> Paragraph -> Sentences Sentence boundary disambiguation (SBD) is the problem in NLP of deciding where sentences begin and end.   Sentence boundary identification is challenging because punctuation marks are often ambiguous. period may denote    Abbreviation Decimal point Email address  About 47% of the periods in the Wall Street Journal corpus denote abbreviations.

Question marks and exclamation marks may appear   embedded quotations, emotions, computer code, and slang  Tools: OpenNLP has a class for sentence detection NacTEM: http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector

 

Tokenization

 Document -> Paragraph -> Sentence -> Tokens Based on white-space characters In Unicode (Unicode Character Database) the following codepoints are defined as whitespace:             U+0009–U+000D (control characters, containing Tab , CR U+0020 SPACE U+0085 NEL (control character next line) U+00A0 NBSP (NO-BREAK SPACE) U+1680 OGHAM SPACE MARK U+180E MONGOLIAN VOWEL SEPARATOR U+2000–U+200A (different sorts of spaces) U+2028 LS (LINE SEPARATOR) U+2029 PS (PARAGRAPH SEPARATOR) U+202F NNBSP (NARROW NO-BREAK SPACE) U+205F MMSP (MEDIUM MATHEMATICAL SPACE) U+3000 IDEOGRAPHIC SPACE and LF )

Part-OF-Speech Tagging “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin)

WORDS

the girl kissed the boy on the cheek

TAGS

N V P DET

Penn Tree POS Tags

1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh-adverb

Applications of Tagging

    Partial parsing: syntactic analysis Information Extraction: tagging and partial parsing help identify useful terms and relationships between them.

Information Retrieval: noun phrase recognition and query document matching based on meaningful units rather than individual terms.

Question Answering: analyzing a query to understand what type of entity the user is looking for and how it is related to other noun phrases mentioned in the question.

Information Souces in Tagging

  

How do we decide the correct POS for a word?

Syntagmatic Information: Look at tags of other words in the context of the word we are interested in.

Lexical Information: Predicting a tag based on the word concerned. For words with a number of POS, they usually occur used as one particular POS.

POS Approaches – Rule Bases

Basic Idea: – Assign all possible tags to words – – Remove tags according to set of rules of type: if word+1 is an adj, adv, or

quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv.

Typically more than 1000 hand-written rules, but may be machine learned.

POS Approaches – Machine Learning

• • • • Based on probability of certain tag occurring given various possibilities Requires a training corpus Training corpus may be different from test corpus.

• • • Examples Hidden Markov Model Taggers Transformation Based Taggers Maximum Entropy Taggers Ling572 (Advanced Statistical Methods in NLP) http://courses.washington.edu/ling572/winter10/teaching_slides/new_syll abus.htm

Tagging Accuracy

 

Ranges from 95%-97%

  

Depends on:

Amount of training data available.

Difference between training corpus and dictionary and the corpus of application.

Unknown words in the corpus of application.

Tagging Unknown Words

• • • New words added to (newspaper) language 20+ per month Plus many proper names … Increases error rates by 1-2% • • • Method 1: assume they are nouns Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set.

Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

POS Taggers

Freely downloadable Part of Speech Taggers

 Stanford POS tagger Loglinear tagger in Java (by Kristina Toutanova)   hunpos An HMM tagger with models available for English and Hungarian. A reimplementation of TnT (see below) in OCaml. pre-compiled models. Runs on Linux, Mac OS X, and Windows. MBT: Memory-based Tagger Based on TiMBL TreeTagger A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, Italian, Dutch, French, Old French, Spanish, Bulgarian, and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions. Binary distribution only.) Page has links to sites where you can run it online.

  SVMTool POS Tagger based on SVMs (uses SVMlight). LGPL.

ACOPOST (formerly ICOPOST) Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.

     MXPOST : Adwait Ratnaparkhi's Maximum Entropy part of speech tagger Java POS tagger. A sentence boundary detector (MXTERMINATOR) is also included. Original version was only JDK1.1; later version worked with JDK1.3+. Class files, not source.

fnTBL A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

mu-TBL An implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available. Prolog. YamCha SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.) QTAG Part of speech tagger An HMM-based Java POS tagger from Birmingham U. (Oliver Mason). English and German parameter files. [Java class files, not source.]

Collocations

  A collocation is an expression consisting two or more words that correspond to some conventional way of saying things    Methods: Simplest solution – counting  Google 5-gram corpus (2006)   ceramics collectables fine 130 ceramics collected by 52   ceramics collection , 144 ceramics collection . 247 Use POS Tags Use Noun Phrase Chunking / Parsing

NLP/Text Mining POINTERS

   NLP BOOKS: Manning and Schütze, Foundations of Statistical Natural Language Processing (MIT Press, 1999).

Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics . 2nd edition. Prentice-Hall.

Books on Regular Expressions

  Jeffrey E.F. Friedl , Mastering Regular Expressions, O’Reilly.

Jan Goyvaerts , Regular Expressions Cookbook, O’Reilly

NLP Research Groups

      Stanford NLP Group http://nlp.stanford.edu/  CMU NLP Group http://www.cs.cmu.edu/~nasmith/nlp-cl.html

 Upenn NLP Group http://nlp.cis.upenn.edu/  NACTEM – National Center for Text Mining http://www.nactem.ac.uk/  UW – Turing Center http://turing.cs.washington.edu/

    

NLP Libraries

 List of tools from Stanford NLP webpage http://nlp.stanford.edu/links/statnlp.html

Mallet – Machine learning for language toolkit     MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. UMASS http://mallet.cs.umass.edu/ Minorthird MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text.

CMU http://sourceforge.net/apps/trac/minorthird/wiki   OpenNLP OpenNLP hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

http://opennlp.sourceforge.net/   GATE General architecture for NLP tasks http://gate.ac.uk/

Biomedical NLP and Text Mining Tools

    Metamap (MMTx) - NLM http://mmtx.nlm.nih.gov/  Negex, Context – University of Pittsburg – BluLab http://www.dbmi.pitt.edu/blulab/index.html

 Ctakes – Mayo Clinic https://cabig kc.nci.nih.gov/Vocab/KC/index.php/OHNLP_Documentation_a nd_Downloads

            

Bio-medicial Text Mining Tools

Chilibot — A tool for finding relationships between genes or gene products.

EBIMed FABLE - EBIMed is a web application that combines Information Retrieval and Extraction from Medline.

— A gene-centric text-mining search engine for Medline [1] GOAnnotator , an online tool that uses semantic similarity for verification of electronic protein annotations using GO terms automatically extracted from literature.

GoPubMed — retrieves Medline abstracts for your search query, then detects ontology terms from the Gene Ontology and Medical Subject Headings in the abstracts and allows the user to browse the search results by exploring the ontologies and displaying only papers mentioning selected terms, their synonyms or descendants.

Information Hyperlinked Over Proteins (iHOP) [2] : "A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing millions of Medline abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in Medline can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research." LitInspector — Gene and signal transduction pathway data mining in Medline abstracts.

NextBio - Life sciences search engine with a text mining functionality that utilizes Medline abstracts and clinical trials to return concepts relevant to the query based on a number of heuristics including ontology relationships, journal impact, publication date, and authorship.

PubAnatomy — An interactive visual search engine that provides new ways to explore relationships among Medline literature, text mining results, anatomical structures, gene expression and other background information.

PubGene — Co-occurrence network display of gene and protein symbols as well as MeSH terms (such as "binds" or "induces") as these appear in Medline records (that is, PubMed , GO , PubChem and interaction titles and abstracts).

TexFlame , an online tool that renders a single Medline abstract as a Systems Biology Graphical Notation The graph is a complete syntactic-semantic representation of the abstract.

(SBGN)-like graph. Whatizit - Whatizit is great at identifying molecular biology terms and linking them to publicly available databases.

XTractor — Discovering Newer Scientific Relations Across PubMed Abstracts. A tool to obtain manually annotated,expert curated relationships for Proteins , Diseases , Drugs and Biological Processes as they get published in Medline.

Literature-based discovery tools

  

Arrowsmith

- UIC-based site for searching links between two literatures within Medline. Also contains the Author-ity tool for disambiguating authors on scientific papers, and the Anne O'Tate tool for summarizing a results of a PubMed query.

BITOLA

helps biomedical researchers make new discoveries by discovering potentially new relations between biomedical concepts.

Manjal

another LBD tools by Padmini Srinivasan