Machine Learning Homework

Download Report

Transcript Machine Learning Homework

CRF Homework
Gaining familiarity with a standard
NLP toolkit, and NLP tasks
Goals for this homework
1. Learn how to use Conditional Random Fields, a
standard framework for sequence labeling in NLP
2. Apply CRFs to two standard datasets
3. Learn how to use the CRF++ package and the Stanford
POS tagger
Please read the whole homework before beginning.
Some steps will be really quick, but others – especially 7
and 9 – may take a while.
Step 1: Install software
Download and install CRF++ on some machine that you will be able to use for
a while. Lab machines will not have this, but if you don’t have access to
another machine, let me know, and I will try to get this installed for you on a
lab machine.
http://crfpp.googlecode.com/svn-history/r17/trunk/doc/index.html
The CRF++ package comes with two primary executables that you will need to
use from the command line:
crf_learn
crf_test
If you installed a source package, you will need to compile to get these
executables. If you installed a binary, they should have come with the
download.
Also, you may want to add these two executables to your path.
Step 1b: Install perl, if you don’t
already have it
• If you’re on linux, you probably have perl, or you can
use your package manager to get it
• If you’re on windows and don’t have perl installed
already, you can get a MSI for ActivePerl (free open
source) from:
http://www.activestate.com/activeperl
• If you’re on a Mac, sorry, you’re on your own …
You don’t need to develop in Perl for this assignment, but
one part of the software for the assignment (see next
slide) is written in Perl, and needs the Perl interpreter to
be installed in order to run.
Step 1c: Download conlleval.pl
• Create a working directory somewhere, and download
the perlscript conlleval.pl to it.
• The script is available at:
http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt
Rename the file from “conlleval.txt” to “conlleval.pl”.
This script is used only to measure how accurate the
CRF’s predictions are. It requires perl (from previous
step) to be installed on your machine.
Step 1d: Install the Stanford POS
tagger
Download and unzip the Stanford POS tagger:
http://nlp.stanford.edu/software/stanford-postagger-2013-04-04.zip
Also, log on to the blackboard site for the course.
Under “course content”, navigate to the folder called “programmingassignment-crfs”.
Download “TagAndFormatForCRF.java” from there, and place it in the
same directory where the Stanford tagger jar file is (the dir you just
unzipped).
Compile the java file: javac –cp stanford-postagger.jar
TagAndFormatForCRF.java
You won’t use this until step 9, but don’t forget about it!
Step 2: Get the main dataset
1.
2.
3.
4.
log on to the blackboard site for this course
navigate to the “course content” section, and the folder called “programmingassignment-crfs”
Download the zip file called “conll-2003-dataset.zip” to your working directory.
Unzip the archive.
This archive contains three files:
- eng.train
- eng.testa
- eng.testb
These files are already formatted for CRF++ to use.
You will use the eng.train file during training (crf_learn), and eng.testa file to measure
how well your system is doing (crf_test) while you are developing your model.
Finally, once your system is fully developed and debugged, you will test it on eng.testb
(crf_test again).
Step 2b: Get the initial “template” file
• In the same place on blackboard, you should
also see a file called “initial-template.txt”.
• Download this to your working directory.
• This file contains an initial set of “features” for
the CRF model. You will use this as an input to
crf_learn.
Step 3: Run an initial test to verify
everything is working
If you have everything installed properly, then you should now be able to run
crf_learn, crf_test, and conlleval.pl.
From your working directory:
1. <path_to_crf++>/bin/crf_learn <template-file> <train-file> <output-model-file>
2. <path_to_crf++>/bin/crf_test –m <model-file> <test-file> > <output-predictionfile>
3. conlleval.pl -d \t < <output-prediction-file>
On a windows machine, use crf_learn.exe instead of crf_learn, and crf_test.exe
instead of crf_test.
For instance, on my machine:
1. CRF++-0.58\crf_learn.exe initial-template.txt eng.train eng.model
(this takes a little over 2 minutes and 206 iterations on my machine)
2.
3.
CRF++-0.58\crf_test.exe –m eng.model eng.testa > eng.testa.predicted
conlleval.pl –d \t < eng.testa.predicted
Step 3: Run an initial test to verify
everything is working
If everything is working properly, you should get output like this from
conlleval.pl:
processed 51578 tokens with 5942 phrases; found: 5647 phrases; correct:
4783.
accuracy: 96.83%; precision: 84.70%; recall: 80.49%; FB1: 82.54
LOC: precision: 84.42%; recall: 84.05%; FB1: 84.23 1829
MISC: precision: 91.26%; recall: 71.37%; FB1: 80.10 721
ORG: precision: 81.81%; recall: 74.79%; FB1: 78.15 1226
PER: precision: 84.34%; recall: 85.67%; FB1: 85.00 1871
The “accuracy” is misleading – ignore that.
This says that the current system found 84.05% of the true location (LOC)
named entities (the “recall” for LOC), and 84.42% of the entities that the
system predicted to be LOCs were in fact LOCs (the “precision” for LOC). FB1
is a kind of average (called the “harmonic mean”) between the precision and
recall numbers. FB1 for each of the 4 categories are the important numbers.
Step 3: Run an initial test to verify
everything is working
If everything is working properly, the first few lines of eng.testa.predicted should look something like this:
-DOCSTART- -X-
O
O
O
CRICKET
NNP
:
LEICESTERSHIRE
TAKE
NNP
OVER
IN
AT
NNP
TOP
NNP
AFTER
NNP
INNINGS
NNP
VICTORY
NN
.
.
I-NP
O
NNP
I-NP
I-PP
I-NP
I-NP
I-NP
I-NP
I-NP
O
O
O
I-NP
O
O
O
O
O
O
O
O
O
O
I-ORG
I-PER
O
O
O
O
O
O
O
LONDON
NNP
1996-08-30 CD
I-NP
I-NP
I-LOC
O
I-LOC
O
West
Indian
all-rounder
Phil
Simmons
I-NP
I-NP
I-NP
I-NP
I-NP
I-MISC
I-MISC
O
I-PER
I-PER
I-MISC
I-MISC
O
I-PER
I-PER
NNP
NNP
NN
NNP
NNP
I-PER
The last line are the system’s predictions. The second-to-last line are the correct answers.
Step 4: Add a feature to the template
In the directory where you placed the CRF++ download, you will find a documentation file at
doc/index.html. Read the section called “Usage” (mainly focus on “Training and Test file formats” and
“Preparing feature templates”).
In your working directory, copy “initial-template.txt” to a new template file called “template1.txt”.
Add a new unigram feature to this template file that includes the previous and current part of speech
(POS) tags. A subscript of [0,0] refers to the current token, 0th column (column 0 contains words,
column 1 contains POS tags in our dataset).
So to add this new feature, we will insert the following line into the template file:
U15:%x[-1,1]/%x[0,1]
The U15 just says this is the 15th unigram feature; the number doesn’t really matter so long as it is
different from all of the other unigram features.
The [-1,1] subscript refers to the previous token, and the 1st column, so the previous POS tag.
The [0, 1] subscript refers to the current token, and the 1st column, so the current POS tag.
Save template1.txt with this new line added to it.
Step 4: Add a feature to the template
Once you’ve created the new template file, re-run crf_learn,
crf_test, and conlleval.pl. Store the results of conlleval.pl in a
file called “results-template1.txt”.
Notice that the results have changed.
By adding more features, you’re allowing the NER system to
search through more information during training, to try to find
helpful patterns for detecting named-entities.
But adding more features can actually hurt performance, if it
tricks the CRF into learning a pattern that it thinks is a good
way to find named-entities, but actually isn’t.
Step 5: Add more features to the
templates
Copy “template1.txt” to a file called “template2.txt”.
Add the following features to “template2.txt”:
1. unigram feature: the POS tag for the current token and the next
token.
2. Unigram feature: the POS tag for the previous token, the current
token, and the next token.
3. Bigram feature: the POS tag for the current token and the next
token.
Re-run training, testing on eng.testa, and evaluation. Store the results
of conlleval.pl in a file called “results-template2.txt”
Note that as we add more features, not only do the accuracies change,
but the training time also takes longer … so we can’t go too crazy with
adding in features, or it will take too long.
Step 6: Add a new column to the data
So far, we have been changing the predictor by giving it
new template features.
Another (often better) way of improving the predictor is
to give it totally new information, so long as that
information is relevant to the task (in this case, finding
named entities).
To show an example of that, in this step you will add a
new column to the data (all three files, eng.train,
eng.testa, and eng.testb) that indicates information about
capitalization of the word.
Step 6: Add a new column to the data
• In any programming language, write a script that will take one of the data
files as input, and output a very similar one, but with one extra column.
• The new column should be second-to-last: the last column has to be the
labels for the task in order for CRF++ to work properly. And if we keep the
first two columns the same (word and POS tag), then the template files we
used before will still work properly.
• In the new, second-to-last column, put a 3 if the word is all-caps, 2 if it has
a capital letter anywhere besides the first letter (eg “eBay”), 1 if it starts
with a capital letter and has no other capital letters, and 0 if it has no
capital letters.
• Apply this script to eng.train to produce cap.train. Do the same for the
test files to get cap.testa and cap.testb.
Step 6: Add a new column to the data
Copy “template2.txt” to “template3.txt”.
Add the following features to “template3.txt”:
1. Unigram feature: the current capitalization
2. Unigram feature: the previous capitalization and the
current capitalization
3. Bigram feature: the previous capitalization and the
current capitalization
Re-run training, testing, and evaluation. Store the results
of conlleval.pl in a file called “results-template3.txt”.
Step 7: See how good you can get
Copy “template3.txt” to a file called “template4.txt”.
Add, remove, and edit the template file however you think it will work best. You may
edit as much or as little as you like, but you must make at least one change from
previous templates.
You may also add or modify the columns (except for the labels) in the data as much as
you want. You may use new columns to incorporate external data (eg, a column that
indicates whether a token matches an external database of city names), or just
functions of the existing data. Note: you may not add any new columns that are
functions of the labels!
Save your new data files as eng.mod.train, eng.mod.testa, and eng.mod.testb.
Save your results from conlleval.pl in a file called “results-template4.txt”.
The TA will award the student with the best results 4 points of extra credit. Secondbest will get 2 points of extra credit.
Step 8: Final evaluation
With any system that involves learning, it is a good idea to keep a final dataset
that is completely separate from the rest of your data, for the final evaluation.
So far, we’ve been using eng.testa for all of our development, debugging, and
testing.
Now, run crf_test and conlleval.pl using eng.mod.testb rather than
eng.mod.testa, to produce the final evaluation.
Store the results in “results-template4-eng.testb.txt”.
Note that the results for final evaluation are typically a bit worse than the
results on eng.testa. This is expected. That’s because we essentially designed
the model to be good on eng.testa, but what works for eng.testa won’t
necessarily generalize to eng.testb.
Step 9: Evaluating on new domains
Your NER system can now detect named entities in text.
You can apply it to any text, and automatically determine
where all the names are.
However, it was trained on news text. One thing people
have noticed is that systems trained on one kind of text
tend not to be quite as accurate on other types of text.
For your final task, you will collect some sentences from
non-news (or news-like) documents, and see how well
your NER system works on them.
Step 9: Evaluating on new domains
Create a file called “new-domain.txt”.
Collect a set of approximately 50 English sentences. You may collect these
from any public source (eg, the Web) that is not a news outlet or provider of
news coverage. For instance, blogs or magazine articles work great, but really
any non-news source that mentions named-entities will do. The less newslike the better, and the more named-entities, the better.
I do not recommend Twitter for this, since tweets are so different from
English found in news articles that they will be basically impossible for your
system to process.
Save your 50 sentences to your file new-domain.txt. You do not need to
format this file in any special way (i.e., you don’t need to split sentences onto
different lines, or words onto different lines). Just make sure that there aren’t
any really unusual characters in the text.
Step 9: Evaluating on new domains
Next, you will need to format and add POS tags to this
file.
We will use the Stanford POS tagger to do this.
Run the java file that you compiled in step 1d:
Change to the directory of TagAndFormatForCRF.java
java –cp .;stanford-postagger.jar TagAndFormatForCRF
<working-dir>/new-domain.txt > <working-dir>/newdomain.tagged.txt
Note the output redirection in bold after new-domain.txt;
it’s easy to miss.
Step 9: Evaluating on new domains
Next, you will need to add NER tags to your file.
Open new-domain.tagged.txt (from the previous step) in a text editor.
For each line, add one of these labels:
I-PER (if it’s a person’s name)
I-LOC (if it’s the name of a location, like a country or river)
I-ORG (if it’s the name of an organization, like “United Nations” or “Red Cross of
America”)
I-MISC (if it’s the name of something besides a person, location, or organization)
O (if it’s not a name at all)
Make sure to separate the labels from the POS tags with a tab character.
If you’re unsure about how to label any of your sentences, you can check the
annotation guidelines at:
http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt
Step 9: Evaluating on new domains
Finally, run your scripts for adding new columns
to this dataset, then run crf_test and
conlleval.pl.
Save the output of crf_test in a file called “newdomain.predicted.txt”.
Save the output of conlleval.pl in a file called
“results-new-domain.txt”.
To turn in
You should turn in a single zip archive called <your-name>.zip. It
should contain:
• template1.txt and results-template1.txt
• template2.txt and results-template2.txt
• template3.txt, cap.train, cap.testa, cap.testb, and resultstemplate3.txt
• template4.txt, eng.mod.train, eng.mod.testa, and eng.mod.testb
and results-template4.txt
• results-template4-eng.testb.txt
• The fully annotated version of new-domain.tagged.txt (after adding
labels and all of your new columns, including capitalization).
• new-domain.predicted.txt and results-new-domain.txt
Email or otherwise transfer this zip file to your TA.