Webometrics 圖書資訊處處長 官大智 : 研究助理 江欣倩 葉佳璋

Download Report

Transcript Webometrics 圖書資訊處處長 官大智 : 研究助理 江欣倩 葉佳璋

Webometrics
圖書資訊處處長 官大智
報告者: 圖書資訊處資訊應用組組長 陳嘉平
研究助理 江欣倩 葉佳璋
1
History
–
–
–
Since 2004, the Webometrics ranking is
published twice a year (January and July).
This ranking has a coverage of more than
16,000 higher education institutions.
The most recent ranking is the January 2009
Edition.
2
Methodology
– The unit for analysis is the institutional
domain, so only universities and research
centers with an independent web domain are
considered.
– University activity is multi-dimensional. So
the ranking is built based on combining a group
of indicators of web presence that measures
these different aspects.
3
Indicators
– Size: the number of pages in a domain (as
recovered by search engines)
– Visibility: the number of unique external
links received by a domain
– Rich File: the number of files of certain file
types in a domain
– Scholar: the number of papers and citations in
a domain
4
5
6
7
8
9
10
11
Metrics
– For each indicator, the universities are ranked.
– Then the ranks of four indicators are combined
according to a formula as follows.
12
Verifiable Data
– The only source for the data of this ranking is a
small set of globally available, free access
search engines.
– All the results can be duplicated according to
the described methodologies, taking into
account the explosive growth of the web
contents, their volatility and the irregular
behavior of the commercial engines.
13
Bad Practices
– The use of link farms and paid backlinks to
improve the position in this rankings is not
acceptable.
– The involved institutions does not have a place
in this ranking and will not be classified in
future editions.
– Random checks are made to ensure the
correctness of the data obtained.
14
Ranking of Interests
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
55 "National_Taiwan_University"
179 "National_Chiao_Tung_University"
273 "National_Taiwan_Normal_University"
274 "National_Cheng_Kung_University"
282 "National_Sun_Yat-Sen_University"
308 "National_Tsing_Hua_University_Taiwan"
370 "National_Central_University"
384 "National_Chung_Cheng_University"
391 "National_Chengchi_University"
491 "Tamkang_University"
529 "I-Shou_University"
564 "National_Chung_Hsing_University"
116 87 46 13
90 178 171 590
211 270 400 527
316 421 235 53
333 405 348 28
161 502 220 328
427 576 355 30
321 446 390 644
336 414 492 675
461 563 806 550
318 855 397 469
448 713 529 861
We need to work on size, visibility, and rich files, while keeping our strength in scholar.
15
Ranking of Interests
13. 659 "Providence_University"
543 1,067 600 280
14. 716 "Fu_Jen_Catholic_University" 622 848 610 1,321
15. 748 "Feng_Chia_University"
616 1,191 959 156
16. 772 "Yuan_Ze_University"
409 1,021 574 1,564
17. 836 "NTUST"
1,068 1,325 563 120
18. 851 "Shih_Hsin_University"
617 1,356 948 336
19. 896 "Tunghai_University"
418 1,235 810 1,470
20. 905 "National_Dong_Hwa_U"
816 1,544 580 200
21. 914 "Soochow_University_Taiwan" 560 1,393 1,042 665
22. 921 "Chaoyang_University_of_T" 885 1,500 803 174
23. 924 "NYUST"
719 1,470 1,212 109
16
URL Naming
– Each institution should choose a unique institutional
domain that can be used by all the websites of the
institution.
– Avoid changing the institutional domain as it has a
devastating effect on the visibility values.
– The alternative or mirror domains should be
disregarded.
– Use of well known acronyms
– Should consider including descriptive word, like the
name of the city, in the domain name.
– Change IP address to domain name!
17
Content: Create
– Allow a large proportion of staff, researchers or
graduate students to be potential authors.
• Individual persons or teams should maintain their own
websites.
– Libraries, documentation centers and similar
services can be responsible of large databases,
including bibliographic ones and large repositories
(thesis, pre-prints, and reports)
– Hosting external resources can be interesting for third
parties and increase the visibility: Conference
websites, software repositories, scientific societies
and their publications, especially electronic journals.
18
Content: Convert
– Important resources available in non-electronic
format can be converted to web pages easily.
– Most of the universities have a long record of
activities that can be published in “historical
web sites”.
– Other resources, as candidate for conversion,
include past activities reports or pictures
collections.
19
Interlinking
– Measuring and classifying the links from others can
be insightful.
– You should expect links from your “natural” partners
•
•
•
•
locality or region
similar organizations
portals covering your topics
colleagues or partners personal pages.
– Make an impact in your common language community.
– Check for the orphaned pages, i.e. pages not linked
from another.
– Most popular pages or directories are relevant.
20
Language
– The WWW audience is truly global, so one
should not think locally.
– Language versions, especially in English, are
mandatory not only for the main pages, but
for certain selected sections such as scientific
documents.
21
Rich Files
– Although html is the standard format of web
pages, sometimes it is better to use rich file
formats.
– Provide versions of different formats.
22
Search Engine Issues
– Search engine friendly design
– Avoid cumbersome navigation menus based on Flash,
Java or JavaScript that can block the robot access.
– Deep nested directories or complex interlinking can
block robots too.
– Databases and even highly dynamic pages can be
invisible for some search engines, so use directories or
static pages instead or as an option.
– Plain is good.
23
Archiving
– Maintain a copy of old or outdated material in
the site.
– Archive media materials in web repositories.
Collections of videos, interviews,
presentations, animated graphs, and even
digital pictures could be very useful in the
long term.
24
Standards for Sites
– The use of meaningful titles and descriptive
meta-tags can increase the visibility of the
pages.
– Add authoring info, keywords and other data
about the web sites.
25
Challenge
– If the web performance of an institution is
below the expected position according to their
academic excellence, university authorities
should reconsider their web policy, promoting
substantial increases of the volume and quality
of their electronic publications.
– Again, NSYSU needs to improve on size,
visibility, and rich files, while keeping the
strength in scholar.
26
27
28
29
Experiments
– For each institutional domain, we collect the
data from search engines, per the description of
methodology.
– Then we compare our ranking against the
Webometrics ranking.
– We need to verify whether our data agree with
theirs. It may not agree exactly, but we can
evaluate the correlation.
30
Size
– Number of pages recovered from four engines
• Google, Yahoo, Live Search and Exalead
– For each engine, results are log-normalized to
1 for the highest value.
– For each domain, maximum and minimum
results are excluded.
– An institution is assigned a rank according to
the combined sum.
31
Visibility
– The total number of unique external links
received by a site
– Data gathered from Yahoo, Live and Exalead
(Google excluded)
– For each engine, results are log-normalized to 1
for the highest value.
– An institution is assigned a rank according to
the combined sum.
32
Rich Files
– Four different file formats
•
•
•
•
Adobe Acrobat (pdf)
Adobe PostScript (ps)
Microsoft Word (doc)
Microsoft Powerpoint (ppt)
– Data (number of files) are extracted using Google
– Merging the results for each file type after lognormalization, in the same way as described before
33
Scholar
– Google Scholar provides the number of papers
and citations for each academic domain.
– These results from the Scholar database
represent papers, reports and other academic
items.
34
Number of Swaps
– For two rankings of domains (institutions), say
r and s, the number of swaps to bring ranking
r to s is defined computationally by
• If the top-rank domain in s, say x, ranks 5th in r,
then 4 swaps is needed for to bring x to top.
• Find the second-rank domain of s in r, bring it to
second.
• Continue until the entire order is correct.
• Accumulate the number of swaps, say N.
– Smaller N is better.
35
Test 1: 03/27/2009
–
–
–
–
–
Scholar (n = 23): N = 17
Size (n = 23): N = 28
Rich files (n = 23): N = 62
Scholar (n = 100): N = 555
Note the worst-case scenario is n(n-1)/2 swaps,
and a random ranking is around n(n-1)/4.
36