Transcript Sanchay
Sanchay and other NLP Tools Himanshu Sharma, Sambhav Jain Sanchay • Sanchay ⇔ संचय (http://sanchay.co.in/) – A Collection of Tools and APIs for Language Processing – An open source platform – Especially South Asian languages Sanchay and NLP Tools 2 Sanchay - Installation • • • • • • Platform Independent: Windows/Linux Pre-requisite: Sun (now Oracle) JDK 1.6 Download – binaries Extract .zip OR .tgz Go to the extracted directory Ready !!! Sanchay and NLP Tools 3 Sanchay - Modules • • • • Editors – text, RTF, HTML Tree Creator Syntactic Annotation Alignment tools – Sentence – Word Sanchay and NLP Tools 4 Shallow Parser • 9 Indian Languages – Hindi,Kannada,Malayalam,Marathi,Tamil,Telugu, Bengali,Punjabi,Urdu • Does Tokenization + Morph Analysis + POS Tagging + Chunking • Linux Platform • http://ltrc.iiit.ac.in/showfile.php?filename=downloa ds/shallow_parser.php Sanchay and NLP Tools 5 Shallow Parser - Installation • Dependencies – ‘dos2unix’ & ‘unix2dos’ must be installed • Download and Extract • Install • If libgdbm.so.2 doesn’t exist in /usr/lib/ then – sudo cp /usr/lib/libgdbm.so.3 /usr/lib/libgdbm.so.2 Sanchay and NLP Tools 6 TNT POS Tagger • TNT Tagger [http://www.coli.uni-saarland.de/~thorsten/tnt/] • Train – tnt-para data.txt – Generates data.123 & data.lex • Tag – tnt data file • Evaluate – tnt-diff goldfile taggedfile Sanchay and NLP Tools 7 CRF++ - Chunker • CRF++ [http://crfpp.googlecode.com/svn/trunk/doc/index.html] • Separate binaries for Linux as well Windows • Installation – ./configure – make – make install Sanchay and NLP Tools 8 CRF++ - Chunker • Train – ./crf_learn template train_file model • Tag/Test – ./crf_test -m model testfile Sanchay and NLP Tools 9 Malt Parser (dependency parsing) • MaltParser – [http://www.maltparser.org/] • Train – java –jar malt.jar –c model –i input file –m train • Test – java –jar malt.jar –c model –i testfile –o output -m parse Sanchay and NLP Tools 10 Other NLP Tools • Tookits – NLTK (Python) [http://nltk.org/] – OpenNLP(Java)[http://opennlp.apache.org/] – LingPipe(Java)[http://alias-i.com/lingpipe/] • Frameworks – GATE [http://gate.ac.uk/] – Apache UIMA [http://uima.apache.org/] Sanchay and NLP Tools 11 Sanchay and NLP Tools 12