Property 1: Network effects

Transcript Property 1: Network effects

Opinion Fraud Detection
in Online Reviews using Network Effects
Leman Akoglu
Stony Brook University
[email protected]
Rishi Chandy
Carnegie Mellon University
[email protected]
Christos Faloutsos
Carnegie Mellon University
[email protected]
Problem Formulation: A Collective Classification Approach
Datasets
Objective function utilizes pairwise Markov Random Fields (Kindermann&Snell, 1980):
I) SWM: All app reviews of entertainment category (games, news, sports, etc.)
from an anonymous online app store database
As of June 2012:
* 1, 132, 373 reviews
* 966, 842 users
* 15,094 software products (apps)
Ratings: 1 (worst) to 5 (best)
II) Also simulated fake review data (with ground truth)
edge signs
node labels as
random variables
compatibility
potentials
Competitors
Compared to 2 iterative classifiers (modified to handle signed edges):
I) Weighted-vote Relational Classifier (wv-RC) (Macskassy&Provost, 2003)
II) HITS (honesty-goodness in mutual recursion) (Kleinberg, 1999)
observed neighbor
potentials
prior belief
Which reviews do/should you trust?
Inference
A Fake-Review(er) Detection System
Finding best assignments is the inference problem, NP-hard for general graphs.
We use a computationally tractable (linearly scalable with network size) approximate
inference algorithm called Loopy Belief Propagation (LBP) (Pearl, 1982).
Desired properties that such a system to have:
Property 1: Network effects
Fraudulence of reviews/reviewers is
revealed in relation to others. So
review network should be used.
Property 2: Side information
Information on behavioral (e.g. login
times) and linguistic (e.g. use of capital
letters) clues should be exploited.
Property 3: Un/Semi-supervision
• When consensus reached, calculate belief
Methods should not expect fully labeled training
set. (humans are at best close to random)
Property 4: Scalability
Methods should be (sub)linear in
data/network size.
• Iterative process in which neighbor variables
“talk” to each other, passing messages
signed Inference Algorithm (sIA):
Performance on simulated data: (from left to right) sIA, wv-RC, HITS
Real-data Results
“I (variable x1) believe
you (variable x2) belong
in these states with
various likelihoods…”
Top 100 users and their product votes:
“bot” members?
I) Repeat for each node:
Property 5: Incremental
Methods should compute fraudulence
scores incrementally with the arrival of
data (hourly/daily).
i
Problem Statement
+ (4-5) rating
A network classification problem:
Given
II) At convergence:
o (1-2) rating
Top-scorers matter:
the user-product review network (bipartite)
review sentiments (+: thumbs-up, -: thumbs-down)
Compatibility:
Classify network objects into type-specific classes:
users: `honest’ / `fraudster’
Conclusions
products: `good’ / `bad’
Scoring:
reviews: `genuine’ / `fake’
Before
After
Novel framework that exploits network effects to automatically spot fake review(er)s.
• Problem formulation as collective classification in bipartite networks
• Efficient scoring/inference algorithm to handle signed edges
• Desirable properties: i) general, ii) un/semi-supervised, iii) scalable
• Experiments on real&synthetic data: better than competitors, finds real fraudsters.