Transcript Predicting Market - Artificial Intelligence Laboratory
Predicting Market Movements: From Breaking News to Emerging Social Media Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona [email protected] http://ai.arizona.edu
Acknowledgements: NSF CRI; NSF EXP-LA; DOD DTRA, CTFP, NPS; (ARFL WMD, CIA, FBI)
PREDICITNG MARKET MOVEMENTS
Predicting Markets
Markets: international markets, emerging markets, import/export markets, financial market, stock market, commodity market, retail market Economics (macro), international relations (trade, geopolitics), finance (international/banking/stock), accounting (market return), marketing (sales/retailing) US (NSF SBE, social behavioral economics; governments, think tanks), Europe/Asia
Business school research in not science (cannot be funded by NSF in US)!
Economics, finance, accounting, political science, social science, marketing, computer science (small, no funding in US!), MIS (business intelligence) Geopolitical/econ/finance/accounting models/theories, market metrics/parameters, analytical techniques, results interpretations, predicating markets
EMH (efficiency market hypothesis), RWT (random walk theory), CAPM (capital asset pricing model), quant/algorithm trading
Research Opportunities
Sophisticated econ/finance/accounting/marketing models/theories, established analytical techniques and metrics (numeric), abundant structured databases (financial metrics, economic indicators, stock quotes)
New, diverse unstructured (text) web-enabled business data sources, e.g., 10K/10Q SEC reports, mass media news, local news, Internet news, financial blogs, investor forums, tweets… Topic extraction, named entity recognition, sentiment/affect analysis, multilingual language models, social network analysis, statistical machine learning, temporal data/text mining, time series analysis…
Nerds on Wall Street
“Future technological stars…(1) Advanced electronic market tools; (2) Understanding both quantitative and qualitative information…” “The Text Frontier, Collective Intelligence, Social Media, and Market Monitors” “Stocks are stories, bonds are mathematics.” David Leinweber, 2009
AZ BIZ INTEL:
BUSINESS MASS MEDIA, SOCIAL MEDIA, TEXT ANALYTICS, SENTIMENT ANALYSIS, SPIKE DETECTION, FINANCE/ACCOUNTING/MARKETING MODELING, PREDICTING MARKET MOVEMENTS
Business Intelligence & Analytics
• • • •
$3B BI revenue in 2009 (Gartner, 2006) The Data Deluge (The Economists, March 2010); internet traffic 667 Exabytes by 2013, Cisco; Total amount of information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZB YB) $9.4B BI software M&A spending in 2010 and $14.1B by 2014 (Forrester) IBM spent $14B in BI in five years; $9B BI revenue in 2010 (USA Today, November 2010); 24 acquisitions, 10,000 BI software developers, 8,000 BI consultants, 200 BI mathematicians
Acquired i2/COPLINK in 2011
Business Intelligence & Analytics
•
BI: “skills, technologies, applications, and practices used to help an enterprise better understand its business and market.”
• •
Technologies: data warehousing; Extraction, Transformation, and Load(ETL); Business Performance Management (BPM); visual dashboards; and advanced knowledge discovery using data and text mining BI 2.0: web intelligence, web analytics, web 2.0, social media analytics, opinion mining; cloud computing and web services; real-time monitoring and mining; enterprise performances (marketing/accounting/finance/healthcare)
AZ BIZ INTEL
• • •
Mass media, social media contents Text & social media analytics techniques Finance/accounting/marketing models (Tetlock/Columbia, Antweiler/UBC, Das/Santa Clara)
NYU (Dhar), Arizona (Dhaliwal, Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu)
• • • •
Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams) Sentiment/valence, lexicons, machine learning, stakeholder analysis, EFLS analysis Time series models, spike detection, decaying function, trading windows, targeted sentiment Econometrics/regression models (R-sqr, p-value), 10-fold validation (F, accuracy), simulated trading (cost, frequency, exit)
AZ ONLINE WOM
AZ WOM: events, volume, sentiment
Data Collection
Yahoo! Movie Parsing Messages Sales Data Professional Evaluation Firms Strategy
Data Processing
OpinionFinder SentiWordNet
Measures and Metrics Online WOM measures
Number of messages Number of sentences Valence Subjectivity Number of valence words
New-product performance metrics
Opening-week box office sales Total box office sales Opening strength Longevity Professional evaluation
Statistical Analysis Online WOM evolution
Correlation between different WOM measures Correlation of WOM measure across new product lifecycle
Correlation between online WOM and product performance
Correlation between online WOM measures and new-product performance across the whole new-product lifecycle 11
Results
Evolution of online WOM through new-product lifecycle
WOM communication starts early in preproduction, becomes highly active before movie release, then diminishes gradually Valence has a clear decreasing trend over time, indicating that WOM becomes more negative after movie release Subjectivity, number of sentences and number of valence words stay stable over time 12
IT’S THE BUZZ!
13
AZ STOCK TRACKER I & II
Literature Review: Stock Performance Prediction
Theoretical perspectives on stock behavior
Efficient market hypothesis (Fama 1964) Price of a stock reflects all available information Market reacts instantaneously; impossible to outperform Random walk theory (Malkiel 1973) Price of a stock varies randomly over time Future prediction, outperforming the market is impossible Pessimistic assessments of the predictability of stock behavior refuted through empirical studies Lo and MacKinlay 1988; Jaffe et al 1989; Pesaran and Timmermann 1995
15
Literature Review: Stock Performance Prediction
Predominant approaches to stock prediction
Fundamentalists utilize fundamental and financial measures of economy, industry, and firm Economy and sector indicators, financial ratios of the firm Fama-French three factors model (Fama and French 1993) Market return, market capitalization, book to market ratio Currency exchange rates, interest rates, dividends Technicians utilize historical time-series information of the stock and market behavior Historical price, volatility, trading volume Various machine learning models applied Regression, ANN, ARIMA, support vector machines
16
Literature Review: Stock Performance Prediction
In addition to financial and stock variables, researchers have incorporated firm-related news article measures
Developed trend-based language models for news articles Lavrenko et al. 2000 Categorized press releases (good, bad, neutral) Mittermayer 2004 Examined various textual representations of news articles Schumaker and Chen, 2009a; 2009b
But few have incorporated firm-related web forums
Thomas and Sycara (2000) utilize text classifications of discussions on Raging Bull to inform stock trading strategies
17
Literature Review: Firm-Related Web Forums and Stock
Studies relating web forums and stock behavior
Examined firm-related web forums on major web portals Early studies focused on activity, without content analysis Supported market efficiency; only concurrent relationships identified Wysocki 1998; Tumarkin and Whitelaw 2001 Subsequently challenged; forum activity predicted stock behavior Antweiler and Frank 2002; 2004; Das and Chen 2007 Analysis advanced to measure opinions in discussions ‘Bullishness’ classifiers to distinguish investment positions Antweiler and Frank 2004; Das and Chen 2007 Classified buy, hold, or sell positions with 60 – 70% accuracy Identified predictive relationships between forum discussion sentiment and subsequent stock returns, volatility, trading volume Shortcomings Retrospective analyses, shareholder perspective of major forums
18
AZ FinText: numbers + text
• Techniques: bag of words, named entities, proper nouns, past stock prices + • SVR Testbed: S&P 500 5 weeks, Oct-Nov 2005, 2,809 news, 10M stock quotes, • GICS industry classification Evaluation: Return, vs. Quant funds; 20-minute prediction
AZ FinText in the news
Thursday, June 10, 2010
AI That Picks Stocks Better Than the Pros
A computer science professor uses textual analysis of articles to beat the market.
WSJ
Technology News and Insights June 21, 2010, 1:45 PM ET
Using Artificial Intelligence to Digest News, Trade Stocks
AZ STOCK TRACKER I: mass, social media, topic, volume, sentiment
Data collection
Online news Web Forums
Spider/ Parser
Database
Topic extraction
Mutual information phrase extractor
Discussion topics
Sentiment identification
Sentiment grader Sentiment aggregator
Message sentiments
Conversation analysis
Topic Traffic dynamics Topic correlation and evolution Sentiment correlation and evolution Active topics and sentiments Market prediction
Message
21
User-Generated Contents (UGC): Conversations of 30,000 Wal-Mart Constituents and 500,000 Responses
Data sources
Wall Street Journal - WalMart-related News (WSJ)
Yahoo! Finance - WalMart Message Board (YAHOO)
Walmart-blows Forum - Employee Department Board (EMP)
Duration # of Threads # of Messages # of Users
Aug 1999 - Mar 2007 Jan 1999 - Jun 2008 Dec 2003 - Oct 2008 Walmart-blows Forum - WalMart Sucks Board (WSB) Nov 2003 - Nov 2008 Wakeupwalmart Forum - General WalMart Discussion Board (GDB) Aug 2005 - Nov 2008 N/A 139,062 7,440 1,354 2,136 4,081 441,954 102,240 19,624 23,940 657 25,500 2,930 1,855 967
22
Post Dynamics
320 280 240 200 160 120 80 40 0 99 00 01 02 03 04 Year 05 06 07 08 16000 14000 12000 10000 8000 6000 4000 2000 0 WSJ YAHOO EMP WSB GDB
23
Sentiment Trend
0.01
0 -0.01
-0.02
-0.03
-0.04
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year 0.01
0 -0.01
-0.02
-0.03
-0.04
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year WSJ YAHOO EMP WSB GDB YAHOO WSJ EMP WSB GDB
24
Market Modeling
Correlation Return Volatility Trading Volume Return Volatility Trading Volume Sentiment Disagreement
1 0.0348
0.0338
1 1
Message Volume Message Length Subjectivity
-0.0507
-0.3186
0.0473
-0.03578
0.3131
-0.1840
Sentiment One Day Lag Disagreement One Day Lag Message Volume One Day Lag Message Length One Day Lag Subjectivity One Day Lag
-0.0527
-0.3433
0.0859
-0.0475
0.3026
-0.1795
-0.0425
Correlation coefficients with p<0.10 are shown (two-tailed test)
Correlation
Sentiment expressed in the forum contemporaneously correlates significantly with stock return Disagreement, volume, and length expressed in the forum also hold significant correlations with volatility and trading volume
25
Market Predictive Results (cont’d)
Overall Forum Return t Volatility t Trading Volume t Market t Sentiment
0.8723*** (31.33) -0.0010
(-0.25) 0.7627*** (15.06) 0.0025
(0.31) 0.0074
(0.47) -0.4275** (-2.06)
t-1 Disagreement t-1 Message Volume t-1
0.0000
(0.04) -0.0023*** (-4.94) 0.0140** (2.29) -0.0007** (-2.29) -0.0122*** (-19.09) 0.1957*** (23.18) Note: *p<0.10;**p<0.05;***p<0.01
Message Length t-1 Subjectivity t-1
0.0002
(1.42) 0.0030*** (7.82) -0.0668*** (-13.24) 0.0015
(1.46) 0.0149*** (7.27) -0.3014*** (-11.11) • •
Predictive regression (t-1)
The significant measures of forum discussions identified in contemporaneous regressions maintain their significance in the predictive regression models Additionally, sentiment expressed in the web forum holds a significant relationship with the trading volume on the following day • Positive sentiment reduces trading volume; negative sentiment induces trading activity
26
AZ STOCK TRACKER II: stakeholder analysis
27
Experimental Design: Description of Prediction Models
Variables
Dependent:
Description
RETURN t
Fundamental:
Stock return on day t (log difference of share price) FFSIZE FFBTM FFMARKET t-1 FFMARKET t-2
Technical:
Fama-French firm size (prior year; market capitalization = share price * shares outstanding) Fama-French book-to-market ratio (prior year; book value / market value of shares) Fama-French market return on day t – 1 (log difference of S&P 500 index price) Fama-French market return on day t – 2 (log difference of S&P 500 index price) RETURN t-1 RETURN t-2 VOLATILITY t-1 VOLATILITY t-2 VOLUME t-1 VOLUME t-2 DAY d t Stock return on day t – 1 (log difference of share price) Stock return on day t – 2 (log difference of share price) Stock price volatility on day t – 1 (volatility modeled using a GARCH(1,1)) Stock price volatility on day t – 2 (volatility modeled using a GARCH(1,1)) Stock trading volume on day t – 1 (in log) Stock trading volume on day t – 2 (in log) Dummy variables for trading day of the week on day t t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)
28
Experimental Design: Description of Prediction Models
Variables
Forum:
MESSAGES t-1 LENGTH t-1 SENTI t-1 VARSENTI t-1 SUBJ t-1 VARSUBJ t-1
Stakeholder:
MESSAGES s t-1 LENGTH s t-1 SENTI s t-1 VARSENTI s t-1 SUBJ s t-1 VARSUBJ s t-1
Description
Number of messages posted in the forum on day t – 1 (in log (1 + messages)) Average length of messages posted in the forum on day t – 1 (in number of sentences) Average sentiment of messages posted in the forum on day t – 1 Variance in sentiment of messages posted in the forum on day t – 1 Average subjectivity of messages posted in the forum on day t – 1 Variance in subjectivity of messages posted in the forum on day t – 1 Number of messages posted by stakeholder cluster s on day t – 1 (in log (1 + messages)) Average length of messages posted by stakeholder cluster s on day t – 1 (in number of sentences) Average sentiment of messages posted by stakeholder cluster s on day t – 1 Variance in sentiment of messages posted by stakeholder cluster s on day t – 1 Average subjectivity of messages posted by stakeholder cluster s on day t – 1 Variance in subjectivity of messages posted by stakeholder cluster s on day t – 1 t = days (t = 1, 2, …, n); stakeholder clusters (s = 1, 2, …, c)
29
Experimental Design: Description of Prediction Models
Baseline Model – Baseline-FF
Fundamental variables: Fama-French model RETURN t = β 0 + β 1 FFSIZE + β 2 FFBTM + β 3 FFMARKET t-1 + β 4 FFMARKET t-2 + ε t
Baseline Model – Baseline-Tech
Technical variables: Lagged stock returns, volatility, trading volume, day-of-week dummies RETURN t = β 0 + β 1 RETURN t-1 + β 2 RETURN t-2 + β 3 VOLATILITY t-1 + β 4 VOLATILITY t-2 + β 5 VOLUME t-1 + β 6 VOLUME t-2 + (β 7 DAY 1t + … + β 10 DAY 4t )+ ε t
Baseline Model – Baseline-Comp
Comprehensive: all fundamental and technical variables RETURN t = β 0 + β 1 FFSIZE + β 2 FFBTM + β 3 FFMARKET t-1 + β 4 FFMARKET t-2 + β 5 RETURN t-1 + β 6 RETURN t-2 + β 7 VOLATILITY t-1 + β 8 VOLATILITY t-2 + β 9 VOLUME t-1 + β 10 VOLUME t-2 + (β 11 DAY 1t + … + β 14 DAY 4t ) + ε t Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)
30
Experimental Design: Description of Prediction Models
Forum models
Comprehensive baseline variables plus forum-level measures RETURN t = β 0 + β 1 FFSIZE + β 2 FFBTM + β 3 FFMARKET t-1 + β 4 FFMARKET t-2 + β 5 RETURN t-1 + β 6 RETURN t-2 + β 7 VOLATILITY t-1 + β 8 VOLATILITY t-2 + β 9 VOLUME t-1 + β 10 VOLUME t-2 + (β 11 DAY 1t + … + β 14 DAY 4t ) + β 15 MESSAGES t-1 + β 16 LENGTH t-1 + β 17 SENTI t-1 + β 18 VARSENTI t-1 + β 19 SUBJ t-1 + β 20 VARSUBJ t-1 + ε t Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c)
31
Experimental Design: Description of Prediction Models
Stakeholder models
Comprehensive baseline variables plus stakeholder group level forum measures RETURN t = β 0 + β 1 FFSIZE + β 2 FFBTM + β 3 FFMARKET t-1 + β 4 FFMARKET t-2 + β 5 RETURN t-1 + β 6 RETURN t-2 + β 7 VOLATILITY t-1 + β 8 VOLATILITY t-2 + β 9 VOLUME t-1 + β 10 VOLUME t-2 + (β 11 DAY 1t + … + β 14 DAY 4t ) + (β 15 MESSAGES 1 t-1 + β 16 LENGTH 1 t-1 + β 17 SENTI 1 t-1 + β 18 VARSENTI 1 t-1 + β 19 SUBJ 1 t-1 + β 20 VARSUBJ 1 t-1 + … + β k MESSAGES c t-1 + β k+1 LENGTH c t-1 + β k+2 SENTI c t-1 + β k+3 VARSENTI c t-1 + β k+4 SUBJ c t-1 + β k+5 VARSUBJ c t-1 ) + ε t Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c); index k = (((c - 1) * 6) + 15)
32
Experimental Design: Social Media Data
A 17 month period was utilized for analysis and experimentation
November 1, 2005 to March 31, 2007 First five months were utilized to calibrate the initial stock return prediction models November1, 2005 – March 31, 2006 Calibrated models applied for prediction during each trading day in the next month Each subsequent month, new models were calibrated using five previous months of time-series variables, for stock return prediction during the next month of trading In total, stock return prediction was performed daily for one year (250 trading days) April 1, 2006 – March 31, 2007
Forum
Yahoo Finance – WMT (finance.yahoo.com) Wal-Mart Blows (www.walmartblows.com) Wakeup Wal-Mart (www.wakeupwalmart.com)
Messages
134,201 55,125 10,797
Discussion Threads Stakeholders
40,633 5,533
Messages per Thread
3.30
3,690 1,306 1,461 915 14.94
8.27
Messages per Stakeholder
24.25
37.73
11.80
33
Results and Discussion
Hypothesis testing results Hypothesis
H1.1 Baseline-Comp model > Baseline-FF model H1.2 Baseline-Comp model > Baseline-Tech model H2 Forum-level models > best baseline models
H3.1 Stakeholder-level models > best baseline models H3.2 Stakeholder-level models > forum-level models Result
Partially supported Rejected Rejected
Supported
H4.1 Social network > discussion content representation H4.2 Writing style > discussion content representation H4.3 Social network > writing style representation H5.1 ANN > OLS H5.2 SVR > OLS H5.3 SVR > ANN
Partially supported
Partially supported Rejected Partially supported Rejected Partially supported Partially supported
34
Results and Discussion
Wal-Mart stock return prediction model results
Baseline models using fundamental and technical variables Results across 250 trading days forecasted Baselines for simulated trading (initial investment of $10,000): Holding Wal-Mart stock for the year results in $10,096 Holding S&P 500 for the year results in $11,012
Model
Baseline-FF Baseline-Tech
Baseline-Comp OLS $
$ 9,787 $ 8,799 $ 10,763
OLS Accuracy
55.20% 57.20% 54.40%
ANN $
$ 9,998 $ 9,702 $ 10,418
ANN Accuracy
44.40% 57.60% 56.80%
SVR $
$ 9,408 $ 9,503
$ 10,645 SVR Accuracy
51.20% 56.40%
56.80% 35
Results and Discussion
Wal-Mart stock return prediction model results
Incorporating the Wakeup Wal-Mart web forum Results across 250 trading days forecasted
Model
Best Baseline Forum Stakeholder-SN Stakeholder -Content Stakeholder -Style Stakeholder-SN+Content Stakeholder-SN+Style Stakeholder-Content+Style Stakeholder-SN+Content+Style
OLS $
$ 10,763 $ 10,367 $ 9,873 $ 10,689 $ 10,271 $ 10,384 $ 10,744 $ 10,696 $ 10,976
OLS Accuracy
57.20% 57.60% 55.20% 60.40% 56.00% 61.60% 60.00% 59.20% 58.00%
ANN $
$ 10,418 $ 10,397 $ 10,930 $ 11,595 $ 9,653 $ 13,066 $ 10,792 $ 10,590 $ 10,778
ANN Accuracy
57.60% 59.20% 57.20% 60.40% 56.80% 60.80% 60.40% 56.40% 56.40% Pair-wise t-test; improvement over best baseline model at * p < 0.10 ** p < 0.05
SVR $
$ 10,645 $ 10,303 $ 10,669 $ 11,976 $ 9,305 $ 11,866 $ 11,249 $ 10,603 $ 10,881
SVR Accuracy
56.80% 59.20% 59.20% 61.20% * 56.00% 62.80% ** 57.60% 58.80% 59.60%
36
AZ STOCK TRACKER III
Introduction
Forward-looking statements (FLS) refer to
Projections, forecasts, or other predictive statements Made by firm management Section 21E of the Securities Exchange Act (1934)
Extended forward-looking statements (EFLS)
Statements that may have implications for a firms future development Similar to FLS, but broader Including information from information intermediaries (e.g., newspapers, newswires) and individuals (e.g., blogs)
38
Recognizing EFLS
EFLS: Extends FLS to include statements about firm’s future performance from other sources such as financial press, analysts’ reports, and individuals Goal
EFLS Recognition EFLS Sentiment
Recognition Task
Future Timing (FT) Explicit Uncertainty (EU) Overall Assessment (ALL) Positive (POS) Negative (NEG)
Definition
Primary content is about future events or states Explicit accounts of doubt or unreliability Affect decision maker’s belief about a firm’s future cash flow Positive impact on the belief Negative impact on the belief
39
AZ STOCK TRACKER III: EFLS 40
Summary of Annotation Results
ALL POS NEG
Category
ALL POS NEG
Agreement
0.91 (0.88, 0.93) 0.90 (0.88, 0.93) 0.89 (0.86, 0.91)
Count
1157 836 904
Cohen’s Kappa
0.81 (0.76, 0.86) • 0.79 (0.73, 0.85) • 0.77 (0.71, 0.82)
Percent
46% 33% 36% • High kappa values (>0.7) on risks supports the coding scheme being empirically valid Agreement upper bound • 89% to 91% (for ALL, POS, and NEG) Reference Standard Dataset: – 2539 sentences in total Note: (95% CI) from 1,000 Bootstrappings 41
Experiment 1: Sentence-Level Evaluation
Model LASSO ENET75 ENET50 ENET25 SVM SVM w/IG FKC OF_PN Accuracy †
67.1% 69.3% 68.9%
69.4% 69.5%
69.1% 64.7% 54.8%
F-Measure ‡
66.5% 68.0% 68.7%
68.9% 70.2%
68.9% 50.9% 27.9%
Recall ‡
83.8% 87.7% 90.5%
91.2%
83.9% 84.3% 69.7% 19.1%
Precision ‡
55.1% 55.6% 55.4% 55.4%
60.3%
58.3% 40.1% 51.4%
42
EFLS Impacts: Hypotheses Development
Theoretical framework (Easley and O’Hara, 2004)
There are 𝐼
𝑘
signals for stock k ( 𝑠
𝑘1
, 𝑠
𝑘2
, … , 𝑠
𝑘𝐼 𝑘
)
1
𝑠
𝑘𝑖
~𝑁 𝑣
𝑘
,
𝛾 𝑘
( 𝑠
𝑘1
, 𝑠
𝑘2
, 𝑠
𝑘3
, 𝑠
𝑘(𝛼 𝑘 𝐼 𝑘 )
, 𝑠
𝑘(𝛼 𝑘 𝐼 𝑘 +1)
, … , 𝑠
𝑘(𝐼 𝑘 −1)
, 𝑠
𝑘𝐼 𝑘
)
Private Signals Public Signals
𝛼
𝑘
: The relative amount of private-versus-public information
43
Hypotheses Development (Cont’d.)
Hypothesis 1: Firms with lower EFLS intensity are associated with higher expected return.
𝜕𝐸[𝑣 𝑘 − 𝑝 𝑘 ] = 𝜕𝛼 𝑘 𝐶 𝑘 2 𝛿𝑥 𝑘 1 − 𝜇 𝑘 𝐼 𝑘 𝛾 𝑘 1 + 𝛼 𝑘 𝐼 𝑘 𝜂 𝑘 𝜇 2 𝑘 𝛾 𝑘 𝜎 −2 2 > 0
44
Hypotheses Development (Cont’d.)
Hypothesis 2: Firms with lower EFLS intensity are associated with the higher stock volatility.
𝜕𝑉𝑎𝑟(𝑣 𝑘 − 𝑝 𝑘 ) 𝜕𝛼 𝑘 = 𝜂 𝑘 𝛿 2 𝜌 𝑘 + 𝛾 𝑘 𝐼 𝛿 𝑘 4 𝛾 𝑘 𝐼 𝑘 (1 + 𝛼 1 − 𝜇 𝑘 𝑘 (𝜇 𝑘 2𝛿 4 + 𝑉 1,𝑘 + 𝑉 2,𝑘 − 1)) + 𝛼 𝑘 𝜂 𝑘 𝛾 𝑘 𝐼 𝑘 𝜇 2 𝑘 (𝛾 𝑘 𝐼 𝑘 + 𝜌 𝑘 ) 3 𝑉 1,𝑘 = 𝛾 𝑘 𝐼 𝑘 − 𝜌 𝑘 + 𝜇 𝑘 𝛾 𝑘 𝐼 𝑘 + 𝜌 𝑘 𝛼 𝑘 𝜂 2 𝑘 𝐼 𝑘 𝛾 𝑘 𝜇 2 𝑘 + 𝛿 2 𝜂 𝑘 If 𝐼 𝑘 𝛾 𝑘 𝑉 2,𝑘 = −1 + 2𝜇 𝑘 > 𝜌 𝑘 and 𝜇 𝑘 + 𝜇 2 𝑘 𝛿 2 𝜂 𝑘 𝛾 𝑘 𝐼 𝑘 𝛼 𝑘 > 2 − 1 then 𝜕𝑉𝑎𝑟 𝑣−𝑝 𝑘 𝜕𝛼 𝑘 >0 Intuition: if there are enough signals and the fraction of informed investors is larger than 41%, then firms with lower amounts of EFLS Higher Volatility
45
Control Variables
Variable Definition
Number of news articles mentioning firm i in month t.
Logarithm of market value, computed using the closing market price of month t-1.
Logarithm of book-to-market ratio, computed following Fama and French ( 1993 ).
Log(Dollar trading volume of firm i in month t) Log(variance); variance of firm i in month t is computed using daily stock returns.
Proportion of individual ownership of stock i, using the latest available data, computed by aggregating 13f filings ( Fang and Peress 2009 ).
Log(1+number of analysts covering firm i in month t).
Log(1+standard deviation of analyst’s earnings predictions).
46
Firm-Level Performance Evaluation (Cont’d.)
Empirical Model 1:
Hypothesis 1 Predicts Negative b1 𝑟 𝑖,𝑡+1 = 𝑎 0 + b 1 𝐴𝐿𝐿_𝐼𝑁 𝑖,𝑡 𝑑 1 𝐿𝑜𝑔𝑆𝑖𝑧𝑒 𝑖,𝑡 + 𝑑 2 + 𝑐 𝐿𝑜𝑔𝐵𝑀 1 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞 𝑖,𝑡 + 𝑑 3 𝑟 𝑖,𝑡 𝑖,𝑡 + 𝑐 + 𝑑 4 2 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖 𝐿𝑜𝑔𝑉 𝑖,𝑡 + 𝑒 𝑖𝑡 𝑖,𝑡 +
Empirical Model 2:
Hypothesis 2 Predicts b1 ≠ 0 𝐿𝑜𝑔𝑉 𝑖,𝑡+1 = 𝑎 0 + b 1 ALL_IN i,t + 𝑐 1 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞 𝑖,𝑡 𝑑 1 𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒 𝑖,𝑡 + 𝑑 2 𝐿𝑜𝑔𝑉 𝑖,𝑡 + 𝑑 + 𝑐 3 2 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖 𝐿𝑜𝑔𝑆𝑖𝑧𝑒 𝑖,𝑡 + 𝑖,𝑡 𝑑 4 𝐿𝑜𝑔𝐵𝑀 𝑖,𝑡 + 𝑑 𝑑 7 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟 𝑖,𝑡 5 𝑟 i,t + 𝑑 8 6 𝐼𝑛𝑑𝑣𝑂𝑤𝑛 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷 𝑖,𝑡 𝑖,𝑡 + + 𝑒 𝑖,𝑡 +
47
Experiment Two: Firm-Level Evaluation
Research Testbed: January 1986 to May 2008, 1,134,321 Wall Street Journal news articles
Merged with CRSP, Compustat, and IBES Stock prices lower than $5 at the end of a month were removed (Cohen and Frazzini 2008; Fang and Peress 2009)
1,274,711 firm-months, spanning 269 months 48
Expected Return and EFLS Intensity
Variable
Value -0.0026
*
Variable
Value -0.0052
**
Control Variables Variable
Value -0.0039
0.00069
*** -0.00081
-0.0019
** 0.0025
*** -0.046
*** 0.00042
Intercept
0.039
***
Intercept
0.00068
-0.0012
-0.0019
0.0025
-0.046
*** 0.00042
0.039
*** *** *** ***
Intercept
0.00067
-0.0015
-0.0019
0.0025
-0.046
*** , ** , * 0.0031
0.0031
0.0031
indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.
*** 0.00042
0.039
*** *** *** ***
49
Volatility and EFLS Intensity
Model 2A (
𝐴𝐿𝐿_𝐼𝑁 𝑖,𝑡
) Variable
𝐴𝐿𝐿_𝐼𝑁 𝑖,𝑡 Value -0.074
*** 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞 𝑖,𝑡 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖 𝑖,𝑡 𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒 𝑖,𝑡 𝐿𝑜𝑔𝑉 𝑖,𝑡 𝐿𝑜𝑔𝑆𝑖𝑧𝑒 𝑖,𝑡 𝐿𝑜𝑔𝐵𝑀 𝑖,𝑡 𝑟 𝑖,𝑡 𝐼𝑛𝑑𝑣𝑂𝑤𝑛 𝑖,𝑡 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟 𝑖,𝑡 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷 𝑖,𝑡
Intercept
𝑅 2 0.012
*** -0.105
*** 0.108
*** 0.565
*** -0.222
*** -0.066
*** -0.615
*** 0.071
*** 0.016
*** 0.095
*** -1.568
*** 0.57
Model 2B (
𝐹𝑇_𝐼𝑁 𝑖,𝑡
) Variable
Value 𝐹𝑇_𝐼𝑁 𝑖,𝑡 -0.196
***
Control Variables
𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞 𝑖,𝑡 0.012
*** 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖 𝑖,𝑡 -0.103
*** 𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒 𝑖,𝑡 0.108
*** 𝐿𝑜𝑔𝑉 𝑖,𝑡 0.565
*** 𝐿𝑜𝑔𝑆𝑖𝑧𝑒 𝑖,𝑡 -0.222
*** 𝐿𝑜𝑔𝐵𝑀 𝑖,𝑡 -0.066
*** 𝑟 𝑖,𝑡 -0.615
*** 𝐼𝑛𝑑𝑣𝑂𝑤𝑛 𝑖,𝑡 0.071
*** 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟 𝑖,𝑡 0.017
*** 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷 𝑖,𝑡 0.095
***
Intercept
-1.566
*** 𝑅 2 0.57
Model 2C (EU
_𝐼𝑁 𝑖,𝑡
) Variable
Value 𝐸𝑈_𝐼𝑁 𝑖,𝑡 -0.254
*** 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞 𝑖,𝑡 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖 𝑖,𝑡 𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒 𝑖,𝑡 𝐿𝑜𝑔𝑉 𝑖,𝑡 𝐿𝑜𝑔𝑆𝑖𝑧𝑒 𝑖,𝑡 𝐿𝑜𝑔𝐵𝑀 𝑖,𝑡 𝑟 𝑖,𝑡 𝐼𝑛𝑑𝑣𝑂𝑤𝑛 𝑖,𝑡 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟 𝑖,𝑡 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷 𝑖,𝑡
Intercept
𝑅 2 0.012
*** -0.110
*** 0.108
*** 0.565
*** -0.222
*** -0.066
*** -0.616
*** 0.071
*** 0.017
*** 0.095
*** -1.566
*** 0.57
*** , ** , * indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.
50
Take-Away and WIP (20%)
Mass and social media texts provide additional signals for market prediction (in addition to numbers) Message volume important; aggregate sentiment may not (EMH) Business sentiment processing difficult; may require additional content pre-processing (stakeholder; EFLS) Predicting return hard; predicting volatility easier (VIX Chicago Board) Large-scale stock news tracking and text analytics can be automated
Trading windows; decay function; targeted sentiment; extensive trading periods (up/down); industry and news category (oil/banking); firm & index size (Russell/NYSE); emerging markets (China)
All the firms (10K), all the news (1M each), all the time ???
Trading strategy ???
51
Data Sources for US Public Companies SEC/Edgar NYSE.com
Finance.Yahoo.com
NASDAQ.com
Company Information Database Ticker CIK CUSIP PERMNO Predefined Data Sources Yahoo Finance Forums Twitter Company Websites Stock Exchange WSJ 10K Report Company Name Dynamic Data Sources Search Engines Blogs Company Keywords News Transformation/Integration Performance Indicators Topics & Sentiments Time Series / Burst Risk Model SNA Data
Analytic Approaches Single Media Analysis Cross Media Analysis Predictive Analysis Simulated Trading
AZ BIZ INTEL System Design Visualization