Email Thread Reassembly Using Similarity Matching

Download Report

Transcript Email Thread Reassembly Using Similarity Matching

Email Thread Reassembly Using Similarity Matching

Jen-Yuan Yeh Dept. of Computer Science National Chiao Tung University Hsinchu 30010, TAIWAN [email protected]

Aaron Harnly Dept. of Computer Science Columbia University New York 10027, USA [email protected]

• Introduction • Related Work • Proposed Methods • Evaluation • Discussion • Conclusion

Outline

2/28

Introduction

• Email thread reassembly task – group messages together based on which messages are replies to which others (i.e., parent-child relationships) • Email thread structure has been profitably employed – e.g., email search, email summarization, email classification, email visualization – however, thread structure is not always available 3/28

Related Work

• Zawinski (2002) used RFC 2822 header – “In-Reply-To” contains the Message-ID of its parent – “References” contains the parent’s References followed by the parent’s Message-ID • Wu and Oard (2005) and Zhu et al. (2005) linked messages with identical subject lines (after removal of “re:”, “fw:”, etc.) • Klimt and Yang (2004) groups messages if they have the same subjects and are among the same users (addresses) • Lewis and Knowles (1997) exploited IR to email threading 4/28

Approach 1 Using Microsoft’s Exchange Header – “Thread Index”

Header Example: … content-class: urn:content-classes:message Subject: Message from Pug Winokur Date: Tue, 27 Mar 2001 09:20:07 -0600 MIME-Version: 1.0

Content-Type: application/ms tnef; name="winmail.dat“ X-MS-Has-Attach:Content-Transfer-Encoding: binary

Thread-Topic:

Message from Pug Winokur

Thread-Index: AcC20LeUM9ZkNCLDEdWw9ABQi+MJ2Q==

From: "\"Beth Grizzle\" @ENRON“ To: "Fastow, Andrew S." , "Buy, Rick" , … • Thread Index – computed from message references – can be used for associating messages into a thread – but no public information about how it is encoded and how to decode it 5/28

Approach 1 (con’t)

• Observations – the initial message has a 32-byte index ending with “==“ – a child message has an index which starts with the same string with its parent but an additional 4 or 8 bytes are appended and ends with 0 or 1 “=“ Email

E 1 E 2 E 3 E 4

… Depth 0 1 2 3 … Index Length

L 1

=32

L 2

=

L 1

+4 4

L 3

=

L 2

+8 8

L 4

=

L 3

+8 8 … the 4-8-8 pattern repeats Example: E 1 E 2

E 1

:

E 2

:

E 3

: AcGPKD4/2h3YBL/6R9Cpa1YkzGzkaQ== AcGPKD4/2h3YBL/6R9Cpa1YkzGzkaQAk ldVU AcGPKD4/2h3YBL/6R9Cpa1YkzGzkaQAk ldVU AAGA/ME= E 3 6/28

Approach 2 Using Similarity Matching and Heuristics

• Mainly by measuring the content similarity between the quotation of a child and the unquoted part of a parent • Exploit heuristics to reduce the search scope – time window – normalized subject line – sender/recipient relationships preprocessing Thread Reassembly Missing Message Recovery 7/28

Preprocessing

• Duplicate message grouping – group duplicate messages by looking for the same subject, datetime, message body, and headers information • Datetime normalization – convert the timestamp of each message into a corresponding timestamp in the same time zone • Subject normalization – remove common prefixes, e.g., ‘RE:’, ‘FW:’, ‘FWD:’, etc.

8/28

Preprocessing (con’t)

• Sender/recipient identification and normalization – pairs of email addresses are identified as belonging to the same individual if the pair meets: • in the same email, one address in the ‘From’ header and the other in ‘Exchange-From’ header • both addresses are in ‘From’ headers in different emails in a ‘Sent Mail’ folder • addresses are labeled with the same name 9/28

Preprocessing (con’t)

• Reply and quotation extraction – based on manually defined splitters (see Table 2 in the paper) – didn’t take into account cases, such as a reply interleaved with quoted material (because quite rare in the Enron corpus) – no signature identification (regarded as part of the message) – a small experiment showed 98% of 1,000 randomly selected emails were separated correctly splitter Reply Part -----Original Message---- From: James Wills [email protected]@Enron Sent: Wednesday, November 14, 2001 1:38 PM To: [email protected]

; [email protected]

Subject: Re: new PO available Quotation Part 10/28

The Algorithm

• The assumptions of

FindParent

– a child message can be either a reply or a forward to at most one parent message in the existing thread – missing messages could exist in an email thread 11/28

Case I

Q R j j, 1 m j s j

: sender

r j,l

: a recipient

Q j, m Q R i i, 1 m i s i

: sender

r i,k

: a recipient

Q i, n Conditions: 1) s i = r j,l & s j = r i,k 2) sim(Q i,1 , R j ) ≥ α

Example:

m i

replies to

m j m j

A send B

m i

B reply A 12/28

Case II

Q R j j, 1 m j s j

: sender

r j,l

: a recipient

Q j, m Q R i i, 1 m i s i

: sender

r i,k

: a recipient

Q i, n Conditions: 1) s i = r j,l 2) sim(Q i,1 , R j ) ≥ β

Example:

m i

is a forward of

m j

by

B m j

A send B

m i

B FW C 13/28

Case III

R j Q j, 1 m j s j

: sender

r j,l

: a recipient

Q j, m R i Q i, 1 m i s i

: sender

r i,k

: a recipient

Q i, n Conditions: 1) s i = s j 2) sim(Q i,1 , R j ) ≥

β Example:

m i

is a forward of

m j

by

A m j

A send B

m i

A FW C 14/28

Case IV

Q R j j, 1 m j s j

: sender

r j,l

: a recipient

Q j, m Q R i i, 1

Missing message(s)

m i s i

: sender

r i,k

: a recipient

Q i, n Conditions: 1) sim(Q i,p , R j ) ≥ γ or sim(Q i,p , Q j,t ) ≥ γ

Example: at least one missing message between

m i

and

m j

send

m j

A B missing B reply A

m i

A FW C 15/28

Case V

16/28

Missing Message Recovery

Assumptions: parent:

m j

, child:

m i

,

n

missing messages:

m i+1

, …,

m i+n

• If a sequence of quoted text

q=

{

q 1 , …, q n+1

} in m

i

such that

q n+1

can be found is highly similar to the nonquoted text of

m j

• the sequence of quoted text

q

is assumed to contain a portion of each missing message n=2 m j missing node: m i+2 missing node: m i+1 m i

m i

R i q 1 q 2 q 3 If

q 3

=

R j

q 1

q 2

is regarded as is regarded as

m m i+1 i+2

’s body ’s body 17/28

Missing Message Recovery (con’t)

When a missing message has multiple children • Partial quotation assumption (Carenini et al., 2005) – the children are siblings – children of a single missing message?

• Complete quotation assumption (In this work) – “cousins”, i.e., children of distinct missing messages?

Partial quotation Will you be at the meeting?

Complete quotation Will you be at the meeting?

Yes. No.

No.

Yes.

Missing message Too bad.

See you there.

Too bad.

See you there.

18/28

The Enron Corpus

• Raw data – Downloaded from the website – 1,361,403 messages – 158 mailboxes owned by 149 people • After cleaning – 269,257 unique messages – in average, 1,704 messages in a mailbox (max: 16,727; min: 2) – a large number of emails belong to a small group of users 34.6% (93,187) messages belong to 10 largest mailboxes 19/28

Evaluation Metric

• No explicit gold standard thread structure information – use threads created by Approach 1 as a gold standard • Test set: 3,705 threads • Recall as the metric Gold standard: (A, C), (A, G), (B, C), (B, G), (A, D), (A, E), (B, D), (B, E) Similarity Matching: (A, C), (B, C), (A, D), (A, E), (B, D), (B, E) R=6/8=0.75

20/28

• Settings for Approach 2 – Time window: 14 days – α, β, γ: 0.9

Results

21/28

Thread Statistics

• 32,910 email threads, consisting of 95,259 unique messages • Mean thread size: 3.14

• median thread size: 2 • Mean thread depth: 1.71

22/28

Thread Statistics (con’t)

• The number of children of a message was only very weakly correlated with the number of recipients (r = 0.0395, p << 0.001) • 7.3% (8,077/103,183) threads nodes are missing message – 4,850 messages were recovered • 7.4% (359/4850) nodes contain more than one distinct recovered message – generated 430 additional sibling nodes 23/28

Discussion: Approach 1

• Advantages – simple to implement – never makes a “false positive” inference • Disadvantages – doesn’t necessarily reflect the structure of

topic

relations – Thread-Index header is not always available – suffers “false negatives” in a common case: external exchange 24/28

Discussion: Approach 2

• Advantages – general applicability, even when there is no header – capability to recover missing messages • Disadvantages – doesn’t necessarily reflect the structure of

topic

relations – potential for false positives: short parent message – suffers false negatives: if no quoted material in the child messages 25/28

Approach 1 vs. Approach 2

• Impact of normalized subjects • Missing messages 26/28

Small Manual Evaluation

• 20 randomly selected initial root messages – manually constructed 20 threads as a gold standard • A mean average recall – Approach 1: 0.7475

– Approach 2: 0.9338

27/28

Conclusion

• Two methods to email thread reassembly were proposed – The first exploits Microsoft Exchange Protocol – The second links messages by similarity matching between the quoted material of a child message and the unquoted part of a parent message • Both approaches aim to reconstruct parent-child relationships formed by reply or forwarding – might not shed adequate light on the

topic

structure of a thread – Approach 2 may be extended to address topic structure by more sophisticated lexical cohesion measures • A combination of both approaches is an obvious possibility 28/28