Transcript Sanchay

Sanchay and other NLP Tools
Himanshu Sharma, Sambhav Jain
Sanchay
• Sanchay ⇔ संचय
(http://sanchay.co.in/)
– A Collection of Tools and APIs for Language
Processing
– An open source platform
– Especially South Asian languages
Sanchay and NLP Tools
2
Sanchay - Installation
•
•
•
•
•
•
Platform Independent: Windows/Linux
Pre-requisite: Sun (now Oracle) JDK 1.6
Download – binaries
Extract .zip OR .tgz
Go to the extracted directory
Ready !!!
Sanchay and NLP Tools
3
Sanchay - Modules
•
•
•
•
Editors – text, RTF, HTML
Tree Creator
Syntactic Annotation
Alignment tools
– Sentence
– Word
Sanchay and NLP Tools
4
Shallow Parser
• 9 Indian Languages
– Hindi,Kannada,Malayalam,Marathi,Tamil,Telugu,
Bengali,Punjabi,Urdu
• Does Tokenization + Morph Analysis + POS Tagging
+ Chunking
• Linux Platform
• http://ltrc.iiit.ac.in/showfile.php?filename=downloa
ds/shallow_parser.php
Sanchay and NLP Tools
5
Shallow Parser - Installation
• Dependencies
– ‘dos2unix’ & ‘unix2dos’ must be installed
• Download and Extract
• Install
• If libgdbm.so.2 doesn’t exist in /usr/lib/ then
– sudo cp /usr/lib/libgdbm.so.3 /usr/lib/libgdbm.so.2
Sanchay and NLP Tools
6
TNT POS Tagger
• TNT Tagger [http://www.coli.uni-saarland.de/~thorsten/tnt/]
• Train – tnt-para data.txt
– Generates data.123 & data.lex
• Tag – tnt data file
• Evaluate – tnt-diff goldfile taggedfile
Sanchay and NLP Tools
7
CRF++ - Chunker
• CRF++ [http://crfpp.googlecode.com/svn/trunk/doc/index.html]
• Separate binaries for Linux as well Windows
• Installation
– ./configure
– make
– make install
Sanchay and NLP Tools
8
CRF++ - Chunker
• Train
– ./crf_learn template train_file model
• Tag/Test
– ./crf_test -m model testfile
Sanchay and NLP Tools
9
Malt Parser (dependency parsing)
• MaltParser – [http://www.maltparser.org/]
• Train
– java –jar malt.jar –c model –i input file –m train
• Test
– java –jar malt.jar –c model –i testfile –o output -m parse
Sanchay and NLP Tools
10
Other NLP Tools
• Tookits
– NLTK (Python) [http://nltk.org/]
– OpenNLP(Java)[http://opennlp.apache.org/]
– LingPipe(Java)[http://alias-i.com/lingpipe/]
• Frameworks
– GATE [http://gate.ac.uk/]
– Apache UIMA [http://uima.apache.org/]
Sanchay and NLP Tools
11
Sanchay and NLP Tools
12