Transcript [Task 3 ]
Download Estimation for KDD Cup 2003
Janez Brank and Jure Leskovec
Jožef Stefan Institute
Ljubljana, Slovenia
Task Description
Inputs:
Approx. 29000 papers from the “high energy
physics – theory” area of arxiv.org
For each paper:
Full text (TeX file, often very messy)
Metadata in a nice, structured file (authors,
title, abstract, journal, subject classes)
The citation graph (excludes citations pointing
outside our dataset)
Task Description
Inputs (continued):
For papers from 6 months
(the training set, 1566 papers)
The number of times this paper was downloaded
during its first two months in the archive
Problem:
For papers from 3 months (the test set,
678 papers), predict the number of downloads
in their first two months in the archive
Only the 50 most frequently downloaded
papers from each month will be used
for evaluation!
Our Approach
Textual documents have traditionally been treated
as “bags of words”
We extend this to include other items besides
words (“bag of X”)
The number of occurrences of each word matters,
but the order of the words is ignored
Efficiently represented by sparse vectors
Most of our work was spent trying various features
and adjusting their weight (more on that later)
Use support vector regression to train a linear
model, which is then used to predict the download
counts on test papers
A Few Initial Observations
Our predictions will be evaluated on
50 most downloaded papers from each
month — about 20% of all papers from
these months
It’s OK to be horribly wrong on other papers
Thus we should be optimistic, treating every
paper as if it was in the top 20%
Maybe we should train the model using only
20% of the most downloaded training papers
Actually, 30% usually works a little better
To evaluate a classifier, we look at 20% of the
most downloaded test papers
Cross-Validation
Labeled papers (1566)
Split into 10 folds
9 folds (approx. 1409)
1 fold (approx. 157)
30% most frequently
downloaded
(approx. 423 papers)
20% most frequently
downloaded
(approx. 31 papers)
Train
Model
Lather, rinse, repeat (10 times)
Evaluate
Report average
A Few Initial Observations
We are interested in the downloads within 60
days since inclusion in the archive
Most of the downloads
occur within the first few
days, perhaps a week
Most are probably coming
from the “What’s new”
page, which contains only:
Author names
Institution name (rarely)
Title
Abstract
60
Average number of downloads on that day
50
40
30
20
10
0
0
10
20
30
40
50
Day since the paper was added to the archive
Citations probably don’t directly influence
downloads in the first 60 days
But they show which papers are good, and the
readers perhaps sense this in some other way from
the authors / title / abstract
60
The Rock Bottom
The trivial model: always predict
the average download count
(computed on the training data)
Average download count: 384.2
Average error: 152.5 downloads
Abstract
Abstract: use the text of the
abstract and title of the paper in the
traditional bag-of-words style
19912 features
No further feature selection etc.
This part of the vector was normalized
to unit length (Euclidean norm = 1)
Average error: 149.4
Author
One attribute for each possible author
Preprocessing to tidy up the original
metadata:
Y.S. Myung and Gungwon Kang
myung-y kang-g
xa = nonzero iff. a is one of the authors of
the paper x
This part is normalized to unit length
5716 features
Average error: 146.4
Address
Intuition: people are more likely to download a
paper if the authors are from a reputable
institution
Words from the address are represented using the
bag-of-words model
Admittedly, the “What’s new” page usually
doesn’t mention the institution
Nor is it provided in the metadata,
we had to extract it from TeX files (messy!)
But they get their own namespace,
separate from the abstract and title words
This part of the vector is also normalized
to unit length
Average error: 154.0 ( worse than useless)
Abstract, Author, Address
0.0
Author
63.7
Abstract
62.4
Abstract Address
All three
149.4
37.6
135.8
143.3
42.2
142.9
32.3
136.5
Training set
154.0
49.2
Author Address
200.0
146.4
80.5
Address
Author Abstract
150.0
100.0
50.0
Test set
We used Author + Abstract (“AA” for short)
as the baseline for adding new features
Using the Citation Graph
InDegree, OutDegree
These are quite large in comparison to the textbased features (average indegree = approx. 10)
We must use weighting, otherwise they will appear
too important to the learner
Average error on test set
137
136
135
134
133
132
131
InDegree is
useful
OutDegree is
largely
useless
(which is
reasonable)
130
129
127.62
128
127
0
0.002
0.004
0.006
W eight of InDegree
0.008
0.01
AA + InDegree
Using the Citation Graph
InLinks = add one feature for each paper i;
it will be nonzero in vector x iff. the paper x
is referenced by the paper i
Normalize this part of the vector to unit length
OutLinks = the same, nonzero iff. x references i
(results on next slide)
InDegree, InLinks, OutLinks
0
AA
20
40
80 100 120 140 160
37.62
AA + InLinks
30.19
AA + OutLinks
30.27
AA + InLinks + OutLinks
26.77
AA + 0.8 InLinks + 0.9 OutLinks
28.33
AA + 0.004 InDeg + 0.8 InLinks + 0.9 OutLinks
27.95
AA + 0.005 InDeg + 0.5 InLinks + 0.7 OutLinks
60
30.54
Training set
135.81
131.93
132.47
131.11
130.69
124.35
123.73
Test set
Using the Citation Graph
Use HITS to compute a hub value
and an authority value for each paper
( two new features)
Compute PageRank and add this as a new feature
Bad: all links point backwards in time
(unlike on the web) — PageRank accumulates
in the earlier years
InDegree, Authority, and PageRank
are strongly correlated,
no improvement over previous results
Hub is strongly correlated with OutDegree,
and is just as useless
Journal
The “Journal” field in the metadata indicates
that the paper has been (or will be?)
published in a journal
Papers from some journals are downloaded
more often than from others:
Present in about 77% of the papers
Already in standardized form, e.g. “Phys. Lett.”
(never “Physics Letters”, “Phys. Letters”, etc.)
There are over 50 journals, but only 4 have more
than 100 training papers
JHEP 248, J. Phys. 104, global average 194
Introduce one binary feature for each journal
(+ one for “missing”)
Journal
140
Average error on the test set
138
136
134
134.95
132
130
128
126
124
121.16
122
120
0
0.2
0.4
0.6
0.8
1
W eight of the Journal attribute
AA + Journal
AA + 0.005 InDeg + 0.5 InLinks + 0.7 OutLinks + Journal
Miscellaneous Statistics
TitleCc, TitleWc: number of
characters/words in the title
The most frequently downloaded
papers have relatively short titles:
The holographic principle
(2927 downloads)
Twenty Years of Debate with Stephen
(1540)
Brane New World
(1351)
A tentative theory of large distance physics
(1351)
(De)Constructing Dimensions
(1343)
Lectures on supergravity
(1308)
A Short Survey of Noncommutative Geometry
(1246)
Average error on the test set
Miscellaneous Statistics
121.4
121.2
121
120.8
120.6
120.4
120.2
120
119.8
119.6
119.4
0
0.05
0.1
0.15
0.2
0.25
0.3
W eight of TitleCc/5
Average error: 119.561 for weight = 0.02
The model says that the number of downloads decreases by
0.96 for each additional letter in the title :-)
TitleWc is useless
Miscellaneous Statistics
AbstractCc, AbstractWc: number of
characters/words in the abstract
Both useless
Number of authors (useless)
Year (actually Year – 2000)
Almost useless (reduces error from
119.56 to 119.28)
Clustering
Each paper was represented by a sparse vector
(bag-of-words, using the abstract + title)
Use 2-means to split into two clusters, then split
each of them recursively
We ended up with 18 clusters
Stop splitting if one of the two clusters would have
< 600 documents
Hard to say if they’re meaningful (ask a physicist?)
Introduce one binary feature for each cluster
(useless)
Also a feature (ClusDlAvg) to contain the average
no. of downloads over all the training documents
from the same cluster
Reduces error from 119.59 to 119.30
Tweaking and Tuning
AA + 0.005 InDegree + 0.5 InLinks +
0.7 OutLinks + 0.3 Journal +
0.02 TitleCc/5 + 0.6 (Year – 2000) + 0.15
ClusDlAvg:
29.544 / 119.072
The “C” parameter for SVM regression
was fixed at 1 so far
C = 0.7, AA + 0.006 InDegree +
0.7 InLinks + 0.85 OutLinks +
0.35 Journal + 0.03 TitleCc/5 +
0.3 ClusDlAvg:
31.805 / 118.944
This is the one we submitted
A Look Back…
0
50
100
150
150.1
152.5
Trivial model
Author + Abstract
+ InDegree
+ InLinks + OutLinks
+ Journal
+ TitleCc/5
Best model found
37.6
135.8
36.0
30.5
29.9
29.7
31.8
127.6
123.7
121.2
119.6
118.9
Average error on the training set
on the test set
200
Conclusions
It’s a nasty dataset!
The best model is still disappointingly inaccurate
…and not so much better than the trivial model
Weighting the features is very important
We tried several other features (not mentioned in
this presentation) that were of no use
Whatever you do, there’s still so much variance left
SVM learns well enough here,
but it can’t generalize well
It isn’t the trivial sort of overfitting that could be
removed simply by decreasing the
C parameter in SVM’s optimization problem
Further Work
What is it that influences readers’
decisions to download a paper?
We are mostly using things they can see
directly: author, title, abstract
But readers are also influenced by their
background knowledge:
Is X currently a hot topic within this
community? ( Will reading this paper help me
with my own research?)
Is Y a well-known author?
How likely is the paper to be any good?
It isn’t easy to catch these things,
and there is a risk of ovefitting