File - Erkan Karabacak

Download Report

Transcript File - Erkan Karabacak

How to Tag a Corpus Using
Stanford Tagger
Accuracy
• All tokens:
• Unknown words:
97.32%
90.79%
What You Need
JRE:
http://www.java.com/en/download/ie_manual.j
sp?locale=en
To make sure that Windows can find the
Java compiler and interpreter:
• Select Start -> Computer -> System Properties -> Advanced
system settings -> Environment Variables -> System variables > PATH.
• [ In Vista, select Start -> My Computer -> Properties ->
Advanced -> Environment Variables -> System variables ->
PATH. ]
• [ In Windows XP, Select Start -> Control Panel -> System ->
Advanced -> Environment Variables -> System variables ->
PATH. ]
• Prepend C:\Program Files\Java\jdk1.6.0_27\bin; to
the beginning of the PATH variable.
• Click OK three times.
Installing Java (JRE) on your computer
• Click Start
• type cmd and press enter
• this will open the command prompt window
• type java –version and press enter
• you will get a message:
java version “1.7.0” (or may be an older version)
If you do not get this message it means you could
not install Java correctly. Ask for help.
Install the Stanford POS Tagger
Basic English Stanford Tagger Version 3.1.3:
http://nlp.stanford.edu/software/stanfordpostagger-2012-07-09.tgz
Installing Basic English Stanford Tagger
Version 3.1.3
• Click on the link that I provided above
download the zip file.
• Unzip the file to Documents using an archive
manager software, such as WinRAR, 7-Zip, or
WinZip
• You might want to change the name of this
unzipped folder to stanTagger. I do this
because the original name is too long:
stanford-postagger-2012-07-09
Create a Corpus Folder
• In stanTagger folder create two folders to hold
your files.
• I name them myCorpus and myTaggedCorpus
• Now put some text files (or your corpus) in
myCorpus
• Make sure there are no spaces in your file names.
For example, writtenArgument.txt instead of
written Argument.txt
• Carry your folder named stanTagger under C:
so that you can find it easily.
Tagging Files
•  Start your command window as described
above
•  Go to C: by typing the command cd.. twice
•  Go in stanTagger by typing cd stanTagger
Tagging files
• To be able to use the Stanford-Tagger on every
file automatically, we need to do some
programming.
• We can do this with Perl or other
programming languages, such as Java, PHP,
Python, and so on.
• However, I found programming the Command
Prompt to be the simplest and will share the
code I prepared.
Tagging files
• Code to be used in Command Prompt:
• FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO
stanford-postagger models\left3words-wsj-018.tagger myCorpus\%~nxa
>myTaggedCorpus\%~nxa
• You can simply copy the above code and paste
it in the Command Prompt
New Code!
• FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO
stanford-postagger models\wsj-0-18left3words.tagger myCorpus\%~nxa >myTagge
dCorpus\%~nxa
Newest Code!
• FOR %a IN (C:\stanTagger\myCorpus\*.txt) DO
stanford-postagger models\englishleft3wordsdistsim.tagger myCorpus\%~nxa >myTaggedCo
rpus\%~nxa
• Each file may take about 2-3 seconds and at
the end, you will see that
myTaggedChineseFolder contains the tagged
files.