GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 2 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics Google Books N-Grams  n-gram viewer  http://books.google.com/ngrams/info  n-gram.

Download Report

Transcript GOOGLE N-GRAMS ON AMAZON WEB SERVICES PART 2 Thomas Tiahrt, MA, PhD Computer Science 482 – Introduction to Text Analytics Google Books N-Grams  n-gram viewer  http://books.google.com/ngrams/info  n-gram.

Slide 1

GOOGLE N-GRAMS
ON AMAZON WEB SERVICES
PART 2
Thomas Tiahrt, MA, PhD
Computer Science 482 – Introduction to Text Analytics

Google Books N-Grams
2



n-gram viewer
 http://books.google.com/ngrams/info



n-gram datasets
 http://storage.googleapis.com/books/ngrams/books/

datasetsv2.html

File Format for Google’s N-Grams
3





Data is compressed
Fields are separated by tabs ('\t')
One record per line
 newline



character ('\n') ends record

N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram

Version 2
4




Data created July 2012
Version 2 file format

N-gram \t year \t match_count \t volume_count \n
 N-gram:1gram,

2gram, 3gram, 4gram, 5gram
 year: publication year
 match_count: occurrences for that year
 volume_count: number of books where n-gram occurred

5

End of Part Two
This is the end of part two. Please proceed to
part three.


Slide 2

GOOGLE N-GRAMS
ON AMAZON WEB SERVICES
PART 2
Thomas Tiahrt, MA, PhD
Computer Science 482 – Introduction to Text Analytics

Google Books N-Grams
2



n-gram viewer
 http://books.google.com/ngrams/info



n-gram datasets
 http://storage.googleapis.com/books/ngrams/books/

datasetsv2.html

File Format for Google’s N-Grams
3





Data is compressed
Fields are separated by tabs ('\t')
One record per line
 newline



character ('\n') ends record

N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram

Version 2
4




Data created July 2012
Version 2 file format

N-gram \t year \t match_count \t volume_count \n
 N-gram:1gram,

2gram, 3gram, 4gram, 5gram
 year: publication year
 match_count: occurrences for that year
 volume_count: number of books where n-gram occurred

5

End of Part Two
This is the end of part two. Please proceed to
part three.


Slide 3

GOOGLE N-GRAMS
ON AMAZON WEB SERVICES
PART 2
Thomas Tiahrt, MA, PhD
Computer Science 482 – Introduction to Text Analytics

Google Books N-Grams
2



n-gram viewer
 http://books.google.com/ngrams/info



n-gram datasets
 http://storage.googleapis.com/books/ngrams/books/

datasetsv2.html

File Format for Google’s N-Grams
3





Data is compressed
Fields are separated by tabs ('\t')
One record per line
 newline



character ('\n') ends record

N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram

Version 2
4




Data created July 2012
Version 2 file format

N-gram \t year \t match_count \t volume_count \n
 N-gram:1gram,

2gram, 3gram, 4gram, 5gram
 year: publication year
 match_count: occurrences for that year
 volume_count: number of books where n-gram occurred

5

End of Part Two
This is the end of part two. Please proceed to
part three.


Slide 4

GOOGLE N-GRAMS
ON AMAZON WEB SERVICES
PART 2
Thomas Tiahrt, MA, PhD
Computer Science 482 – Introduction to Text Analytics

Google Books N-Grams
2



n-gram viewer
 http://books.google.com/ngrams/info



n-gram datasets
 http://storage.googleapis.com/books/ngrams/books/

datasetsv2.html

File Format for Google’s N-Grams
3





Data is compressed
Fields are separated by tabs ('\t')
One record per line
 newline



character ('\n') ends record

N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram

Version 2
4




Data created July 2012
Version 2 file format

N-gram \t year \t match_count \t volume_count \n
 N-gram:1gram,

2gram, 3gram, 4gram, 5gram
 year: publication year
 match_count: occurrences for that year
 volume_count: number of books where n-gram occurred

5

End of Part Two
This is the end of part two. Please proceed to
part three.


Slide 5

GOOGLE N-GRAMS
ON AMAZON WEB SERVICES
PART 2
Thomas Tiahrt, MA, PhD
Computer Science 482 – Introduction to Text Analytics

Google Books N-Grams
2



n-gram viewer
 http://books.google.com/ngrams/info



n-gram datasets
 http://storage.googleapis.com/books/ngrams/books/

datasetsv2.html

File Format for Google’s N-Grams
3





Data is compressed
Fields are separated by tabs ('\t')
One record per line
 newline



character ('\n') ends record

N-gram is the 1gram, 2gram, 3gram, 4gram, 5gram

Version 2
4




Data created July 2012
Version 2 file format

N-gram \t year \t match_count \t volume_count \n
 N-gram:1gram,

2gram, 3gram, 4gram, 5gram
 year: publication year
 match_count: occurrences for that year
 volume_count: number of books where n-gram occurred

5

End of Part Two
This is the end of part two. Please proceed to
part three.