Using Corpora For Language Research
Download
Report
Transcript Using Corpora For Language Research
Using Corpora for Language
Research
COGS 523-Lecture 6
Web as Corpus
18.07.2015
COGS 523 - Bilge Say
1
Related Readings
Bernardini, S., M. Baroni and S. Evert. (2006) A WaCky
Introduction. in Working Papers on the Web as Corpus.
http://wacky.sslmit.unibo.it/
M. Baroni, S. Bernardini, A. Ferraresi and E. Zanchetta. To
appear. The WaCky Wide Web: A Collection of Very Large
Linguistically Processed Web-Crawled Corpora. Language
Resources and Evaluation Journal (At the above address)
Sharoff, S. (2006) Open-source Corpora. International Journal
of Corpus Linguistics. 11(4), pp 435-462.
Kilgariff and Grefenstette (2003).Introduction to Special Issue
on the Web as Corpus. Computational Linguistics 29(3), 333348
Kilgariff (2007) Googleology is Bad Science, Computational
Linguistics, 33(1).
Web As Corpus Workshops. See proceedings of the 2008 one
in the link below under conferences:
http://webascorpus.sourceforge.net/
18.07.2015
COGS 523 - Bilge Say
2
Web as Corpus
Attractiveness: Free, immense and
easily available
Issues:
Representativeness and Balance
Legal Issues of Web Corpora
Suitability of Linguistic Queries with
Search Engines
18.07.2015
COGS 523 - Bilge Say
3
Web as Corpus-many senses?
The Web as a corpus surrogate
The Web as a corpus shop – create your
own corpus from web
The Web as corpus proper – web as
representative of web English
The Mega-corpus/Mini-web – a new
object of suitable for linguistic studycombining above three approaches
18.07.2015
COGS 523 - Bilge Say
4
Advantages
Size: estimate for the size of text
For Google: 3-4 Tera words -T= 1012
(for 2003)
100 million words BNC good for
10,000 types of core English but
what about the rest of types that
occur 50 times or less?
18.07.2015
COGS 523 - Bilge Say
5
Language Distribution
Estimate by function words which are
stable over many types of text as
predictors of corpus size- estimate in next
slide
Xu (2000)-71% in English; 7% Japanese,
5% German, 2% French, Chinese ...
Proportion of non English text to English
text is growing
Errors are more than traditionally
published text but significantly less than
others.
18.07.2015
COGS 523 - Bilge Say
6
Estimate of Web size in words, as indexed by AltaVista,
for various languages (Table 3 of Kilgarriff & Grefenstette, 2003)
18.07.2015
COGS 523 - Bilge Say
7
Frequencies of English phrases in the BNC and on
AltaVista in 1998 and 2001, and on AlltheWeb in 2003.
The counts for BNC and AltaVIsta are for individual
occurrences of the phrase. The counts for AlltheWeb are
page counts (the phrase may appear more than once on
any page) (Table 1 of Kilgarriff & Grefenstette, 2003)
18.07.2015
COGS 523 - Bilge Say
8
AltaVista frequencies for candidate translations of groupe
de travail (Table 4 of Kilgarriff & Grefenstette, 2003)
18.07.2015
COGS 523 - Bilge Say
9
Natural Language
Processing View
Probabilistic models of language based on very
large quantities of data (even if noisy) are better
than estimates based on small and clean sets,
using sophisticated smoothing techniques
(Kilgariff and Grenfenstette)
NLP applications using Web as Corpus
Word Sense Disambiguation
Ontology Population
Statistical Machine Translation
More pragmatic corpus definitions:
Is corpus x good for task y?
Less emphasis on construction and design principles
18.07.2015
COGS 523 - Bilge Say
10
Legal Issues of Web
Corpora
Really different from non-Web
corpora?
You can develop a Web corpus
without copying it..
GNU Free Documentation Licence
(for distributing)
Caches and indices by search
engines are formidable anyhow..
18.07.2015
COGS 523 - Bilge Say
11
Issues
Representativeness
Try to understand Web balance, not
aim for representativeness
Automatic characterization of text types
from web
18.07.2015
COGS 523 - Bilge Say
12
Querying for linguistic analysis with
Search Engines
Not enough context for each instance
Not enough instances
Unreliable frequency statistics (e.g. Hit counts
per page instead of token statistics; titles or
headings promote ranking)
Automated querying is limited
Limited search syntax and annotation (no
lemmas or part of speech tags)
Some exceptions-search engines that treat web
as a corpus environment:
http://www.kwicfinder.com/KWiCFinder.html
http://www.webcorp.org.uk/
18.07.2015
COGS 523 - Bilge Say
13
Building your own corpora
from web
Interoperable tools to build your own
corpus from web and use it with a query
engine...
BootCat, WebBootCat and SketchEngine
(for lexicographic purposes mostly)
http://www.sketchengine.co.uk/
Free only for trial, individual academic
licenses 50 euros per year...
18.07.2015
COGS 523 - Bilge Say
14
Creating a Corpus from
Web
Crawling:
Selecting “seed” URLs
Harder for the “general” corpus case
• Representative of what? (what if sample
web profile for a language is 90%
pornography and dating sites, 19% Linux
how-tos, 1% others)
Retrieve pages by crawling
• Issues: Efficiency, duplicates, politeness,
traps, file handling
18.07.2015
COGS 523 - Bilge Say
15
Cleaning Up
Removing HTML tags
Boilerplate stripping: You do not want “Click
here” to be the most frequent phrase in your
corpus
Language/encoding detection
Near-duplicate discovery – same tutorial with
different headers
Specialized community effort: CLEANEVAL
http://cleaneval.sigwac.org.uk
SIGWAC- Special Interest Group of Web as Corpus
of the ACL
http://www.sigwac.org.uk/
18.07.2015
COGS 523 - Bilge Say
16
Annotation
Header information (text types):
Newer semiautomatic classification
schemes are being developed
(Mehler and Glein)
Tokenization, POS annotation,
Lemmatisation
Pecularities of web language:
neologisms, acronyms, smileys, nonstandard spelling
18.07.2015
COGS 523 - Bilge Say
17
Query
Indexing and searching
Expressiveness
Ease of Use
Performance
Scalability
18.07.2015
COGS 523 - Bilge Say
18
Sharoff’s Internet Corpora
Affordable alternatives to BNC-like efforts
?
English, Chinese, Romanian, Russian,
Ukranian, Turkish (under way)
Composition assessment (by text
typology comparison to resources such as
BNC, Russian Reference Corpora or by
comparing frequency lists)
Seed generation: 400-500 most frequent
types from reference corpora
18.07.2015
COGS 523 - Bilge Say
19
Triangulation for Internet Corpora (Fig. 1 of Sharoff, 2006)
18.07.2015
COGS 523 - Bilge Say
20
The balance of text types in various corpora (Table 1 of
Sharoff, 2006)
18.07.2015
COGS 523 - Bilge Say
21
Words less/more frequent in news corpora (Part of the Table 2
of Sharoff, 2006)
18.07.2015
COGS 523 - Bilge Say
22
Words less/more frequent in internet corpora (Part of the
Table 3 of Sharoff, 2006)
18.07.2015
COGS 523 - Bilge Say
23
The size of Internet corpora (Table 4 of Sharoff, 2006)
18.07.2015
COGS 523 - Bilge Say
24
Applications from 4th Web as
Corpus (WAC) Workshop (2008)
GReG: Reranking snippets returned by
Google’s search engine in the best 10
links by introducing linguistic information
(tagging, syntactic constituency, partial
logical form)
GLB (Google for the Linguist on a
Budget): An open source and free system
for robust web crawling –querying
multiple dimensions – load balancing on
many CPUs esp for testing language
models for NLP on web based corpora.
18.07.2015
COGS 523 - Bilge Say
25
Applications from 4th Web as
Corpus (WAC) Workshop (2008)
Victor: A web page cleaning tool
introduced in CLEANEVAL 2007 esp with a
linguistic aim, using machine learning, its
own annotation toolset and evaluation
metrics
GlossaNet2: free online concordancer
service that allows users to search
dynamic web corpora via RSS feeds –
uses features of Unitex corpus tool
http://glossa.fltr.ucl.ac.be/
18.07.2015
COGS 523 - Bilge Say
26
WaCky corpora
ukWaC, deWaC, itWaC (details in
Baroni et al.)
ukWaC: large British English web
derived corpus of 2 billion tokens –
freely available – part-of-speech
tagged and lemmatized- wide range
of genres – 30 GB with annotationcomparison w BNC vocabulary-wise
is available
18.07.2015
COGS 523 - Bilge Say
27
Lecture 7 and 8
Lecture 7: April 14th, your tool
evaluation presentations and reports!
Lecture 8: Statistics: McEnery and
Wilson (2001) Ch 3; McEnery et
al.(2006) Unit A6. Biber et al.
Methodology Boxes.
18.07.2015
COGS 523 - Bilge Say
28