Powerpoint Slides - Inkwell Analytics, LLc

Download Report

Transcript Powerpoint Slides - Inkwell Analytics, LLc

Web-Based Data Collection
and Analysis
Andy Leone
June 2014
Course Overview

Tools for extracting and analyzing text
• SEC Filings
• Websites
• PDF Files

Software
• Perl – ActivePerl
• Perl “Front End” Komodo
• MySQL
Why Perl?
There Are Alternatives to Perl (e.g.,
Python, Ruby, R)
 Perl – Practical Extraction and Reporting
Language


It is a mature language with considerable
support.
• Countless packages (add-ons)
• About the best there is when it comes to “Regular
Expressions.”
The big picture- What I do
WRDS
Edgar
SAS
Perl
SQL
ODS
Proc Export
Excel
SEC FTP

Files can be obtained from the SEC’s FTP site:


Note that the files are not made available on the
FTP site until 24 hours after they are filed.


ftp.sec.gov
It is possible to get real-time feeds.
You are supposed to do large downloads after
9:00PM Eastern.
Index Files

Index files are archived each quarter and
the most recent one us updated every
day:


/edgar/full-index/2006/QTR1/company.zip
The zip file contains a file “company.idx”
Index File, continued
-First 10 lines - description then data-Company Name
-Form Type
-CIK Number
-Filing Date
-Directory and File Name
Description:
Master Index of EDGAR Dissemination Feed by Company Name
Last Data Received: September 6, 2006
Comments:
[email protected]
Anonymous FTP:
ftp://ftp.sec.gov/edgar/
Company Name
Form Type CIK
Date Filed File Name
--------------------------------------------------------------------------------------------------------------------------------------------033 ASSET MANAGEMENT LLC /
13F-HR
1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt
1 800 CONTACTS INC
10-Q
1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt
Data Files
SGML format
 All files contain basic header information
tagged by <SEC-HEADER>
 This is followed by each file submitted
with the filing.
 Example

AMREP Corp. Submitted a 10-K/A (Text
Version)
Challenge
Lack of uniformity
 Another Example:


Smart-tek Solutions Inc. (Text)
Perl

Practical Extracting and Reporting
Language

A powerful scripting language for working
with unformatted text.
Perl

The key feature of Perl is Regular Expressions.


It is relatively easy to search through text files and
match words or word patterns.
For example, to identify firm with an internal control
deficiency when wording differs.
•
•
•
•
•
“We have identified a material weakness”
“A material weakness exists”
“A control deficiency exists”
“We identified a control deficiency”
“No material weakness exists” (Don’t want to match this
one)
Expectations

Using PERL can be very helpful.



More refined measures (e.g., modified audit
opinions).
Large samples relative to hand collection.
But..



Extracting data from free-form text can be messy.
Especially when formatting varies so much from
file to file. A lot of trial and error.
It would take an infinite amount of time to get
success rates up to 99-100%.
Need to accept 90-95% and live with some
“noise”
Example: Audit Opinions

How do you find the opinion?



How do you find the end of the opinion?







Look for something like “We have audited.”
But PWC has to start with “In our opinion.”
Look for a date (e.g., March 15, 2005) a the start of a line
But some dates in the opinion happen to fall at start of line
(e.g., for year ended December 31, 2005)
Then require the date to be followed by spaces.
But some opinions are dated with “except for.”
Foreign auditors might date the report in day- month-year
order (15 March, 2005).
Some firms have multiple opinions.
Sometimes there are typos.
When is PERL most useful?

Extracting specific text and looking for keywords
and phrases





Audit Opinions
Internal Controls
Footnotes
Not great for unstructured financial information.
For example, use of proceeds in IPO
prospectus.
Sometimes you still need to code the data by
hand.
What I do





Download all index files and write them to a
MYSQL database (via PERL).
Download filings I work with fairly often (e.g.,
10-K, 10-Q, 8-K)
Create header header tables in MYSQL for
each filing type.
Create project-specific tables as needed.
Note: MYSQL allows for easy access from SAS,
STATA, excel, etc. via ODBC.
MYSQL, PERL and SAS
DBI Module
 This makes it really easy to work with
data via SQL or statistics software (SAS,
STATA).
 SAS Example:

proc sql;
connect to odbc (datasrc=Compustat user=me password=xxxx);
create table myrestates
as select * from connection to odbc
(select gvkey, file, sic, hlink, company_name,
to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date
from 8kdata.restate_sample_jul2006 a);
quit;
Software Installation Checklist

ActivePerl 5.10

Let’s go to the Package Manager and install a
couple of modules:
• DBD::mysqlPP
• HTML::Format




Komodo Edit 4.4
MySQL 5.0
MySQL GUI Tools
SAS 9.1.3 (or higher).