IN3305, Literature Search, 2010

Download Report

Transcript IN3305, Literature Search, 2010

Literature Search
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_aiosup_lit_search.ppt
IN 3305
Alexandru Iosup and Tomas Klos.
July 16, 2015
1
Parallel
Vermelding
and onderdeel
Distributedorganisatie
Systems Groep
http://www.pds.ewi.tudelft.nl/
Innovation: Vital Competitive Tool
Source: Economist Intelligence Unit, A new ranking of the world’s most innovative countries,
April 2009, http://graphics.eiu.com/PDF/Cisco_Innovation_Complete.pdf
• Innovation = novel application of knowledge
• Innovation favors small (but efficient) countries
• High-tech companies tend to be more innovation-intensive
What is Novel?
The Overwhelming Growth of Knowledge
“When 12 men founded the
Number of
1993
1997
Royal Society in 1660, it was Publications
1997
2001
possible for an educated
person to encompass all of
scientific knowledge. […] In
the last 50 years, such has
been the pace of scientific
advance that even the best
scientists cannot keep up
with discoveries at frontiers
outside their own field.”
Tony Blair,
PM Speech, May 2002
Data: King,The scientific impact of nations,Nature’04.
The “Size” of a Research Topic
• Grid Computing
•
•
•
•
Billions of $ in research investment
2,500 PhDs (my est.)
Over 15,000 scientific publications (my est.) in 15 years
Several surveys of 100-200 articles each
• Grid Scheduling
• Conferences: Grid, CCGrid, HPDC, SC, IPDPS, ICDCS, …
• Journals: TPDS, CCPE, FGCS, JoGC, …
• Peer-to-Peer Search Methods
• Survey of over 300 articles after 5 years of research
How to Talk About Books You Haven’t Read
• “There is more than one way not to read”
• Not opening the book
• You cannot read everything
• How many books can you read?
• How many books can a librarian read?
• Librarians can talk about every book in
the library (every book out of millions)
 There exists a system to (not) read
Literature Surveys:
At the Core of Innovation
Given a problem (topic of interest)
Answer questions about it
• What solutions exist?
• What is the most influential solution?
• What is the rate of innovation in the field?
By surveying (understanding, interpreting, and summarizing)
the body of related (scientific) knowledge.
• Where and how can I innovate?
IN3305’s study goal
“kennismaken met wetenschappelijke literatuur”
Outline
• From the IN3305 study goals:
“kennismaken met wetenschappelijke literatuur”
• To read or not to read?
• What is
“scientific literature”?
• Literature is input and output
• Measuring and assessing Quality
• Useful sites and tools
• On gaming the citation indices (unethical)
• Conclusion
July 16, 2015
7
Literature = input
• Citations
• Place your work in context
• Give credit to previous work
• Support your arguments
• Show your marginal contribution
• Prevent plagiarism
• Read what you cite! (prevent superfluous citing)
This does NOT mean:
• “You should read everything”
• “You cannot also read what you don’t cite”
July 16, 2015
8
Quality?
• Reputation: ACM, IEEE, Springer, Elsevier,
MIT/Princeton/Oxford/… University Press
• SCIgen - An Automatic CS Paper Generator
http://pdos.csail.mit.edu/scigen/
accepted (non-reviewed) for: 2005 World MultiConference on Systemics, Cybernetics and Informatics
(another one: an Elsevier journal!)
July 16, 2015
9
Sources: peer-reviewed
• Textbook/monograph: for teaching and background
• Complete treatment of a topic
• Cite a textbook? Mention chapter or page number
• Journal article
• More space, detail, thorough than conference paper
• Sometimes old news at publication date (lag)
• Paper in edited volume:
• Multiple papers, review of state-of-the-art
• Cite individual papers
• Paper in conference proceedings
• Recent results
• Conference quality; publisher of proceedings?
July 16, 2015
10
Sources: not peer-reviewed
• Working papers, Preprints
• Up-to-date, spread ideas
• “Open access”
• Computing Research Repository (CoRR)
http://arxiv.org/corr/home
• Websites
• ‘Personal communication’
July 16, 2015
11
Literature = output
• Publish to conferences and journals
• Peer-review (for conferences, journals):
• (double) blind review:
Accept, with/without (major) revisions
Reject
• Acceptance rate ratio, e.g., 25% (not bad)
• (Nature: 10% articles are reviewed)
• Measuring scientific output: “scientometrics”
July 16, 2015
12
Scientometrics
• Scientometrics, “measuring and analyzing science”,
• Bibliometrics, “study or measurement of texts and
information”
• Citation analysis
• Which papers cite a paper / does a paper cite?
• Authority of countries, research groups, individual authors,
journals/conferences, individual paper
Q What is a citation?
• “Publish or perish”: quality vs quantity
• (“80% of all published papers are not cited”)
Q Conference or journal? Which conference or journal?
July 16, 2015
13
Comparing Countries
Citation intensity=
#Citations/GDP
Citation rate per
paper, norm.
Data: King, The scientific impact of nations, Nature’04.
Comparing Groups or Individuals [1/3]
• An idea: Google PageRank principle
• Web: network of sites, linking to each other
• Science: network of papers, citing each other
World Wide Web’s Links Network
Q Problems with this approach?
Academic Citations Network
Time
Comparing Groups or Individuals [2/3]
• Journals: Journal Impact Factor
• Personal: h-index (Hirsch, 2005):
A scientist has index h if h of his/her N papers have at
least h citations each, and the other (N − h) papers
have no more than h citations each. Used in practice.
• Extensions: g-index, e-index; group evaluation
Q What about conferences?
Q Really, what is a citation?
Q (unethical) How to abuse citation indices?
July 16, 2015
16
Citation Databases
• Commercial
• ScienceCitation Index (Web of Science/Inf. Sci. Inst.)
• Scopus (Elsevier)
• Free
• Google Scholar: better coverage than ISI
• CiteSeer (computer science)
• ArNetMiner (computer science)
• RePec (economics)
• More: en.wikipedia.org/wiki/
List_of_academic_databases_and_search_engines
July 16, 2015
17
Journal Impact Factor (JIF)
• Many journals have no impact factor
• JIF is the average number of citations in a given year,
to papers in a journal in the 2 previous years.
• For journal x, 2008
number of citations in 2008 to papers in journal x
from the period 2006 – 2007
JIF (x, 2008) =
Total number of papers in journal x
in the period 2006 – 2007
• What does an average value mean?
July 16, 2015
18
Journal Impact factors, 2004
2004 Science Journals Impact Factors (Bron: ISI)
JIF
100
≥1 citation/publication
(last 2 years)
10
Journal Rank
1
0
1000
2000
3000
4000
5000
0.1
Highest JIF ~30
0.01
Very high JIF ≥15
0.001
July 16, 2015
19
CS impact factors, 2005
2005 Impact Factor CS Journals (Bron: ISI)
JIF
10
Journal Rank
1
0
0.1
100
CS
200
300
All
Highest JIF ~8
Highest JIF ~30
Very high JIF ≥2
Very high JIF ≥15
0.01
July 16, 2015
20
Comparing Groups or Individuals [3/3]
For Computer Science
• Conference proceedings are to be preferred to journals
• ISI Web of Science and Elsevier Scopus are not good
impact indicators—poor, albeit improving, coverage
• Google Scholar is a better impact indicator than ISI WoS
and Elsevier Scopus; ArNetMiner is reasonable
• DBLP is a good, selective source, but has no citation links
• Expert knowledge is required to select the best topical
conferences and journals (regardless of their acceptance
ratios and impact factors)
Q Problems with this approach?
Outline
• From the IN3305 study goals:
“kennismaken met wetenschappelijke literatuur”
• To read or not to read?
• What is
“scientific literature”?
• Literature is input and output
• Measuring and assessing Quality
• Useful sites and tools
• On gaming the citation indices (unethical)
• Conclusion
July 16, 2015
22
Method To Find Sources
• Browse:
• Google Scholar: http://scholar.google.com/
• DBLP: http://dblp.uni-trier.de/
• Others: TU Delft library tools
• Study author using Publish or Perish
• Look at author homepages
• Follow links and citations (forward and backward)
July 16, 2015
23
Google Scholar
•
•
•
•
“cited by”
Relevant authors
TU Delft SFX linking
Import into bibtex
July 16, 2015
24
Google Scholar at Work
July 16, 2015
25
July 16, 2015
26
Google Scholar at Work
From home: use vpn!
July 16, 2015
27
July 16, 2015
28
DBLP
• “lists more than one million articles” (april 2008)
• Indexes:
• Authors
• Now also “Faceted search”, “CompleteSearch”
• Conferences
• Journals
• Series
• Subjects
July 16, 2015
29
DBLP at Work
DBLP at Work
July 16, 2015
31
July 16, 2015
32
July 16, 2015
33
TU Delft Library
• Search
• http://www.library.tudelft.nl/ws/search/
• e.g. “information by subject” -> computer science
• TUlib
• “how to find and use scientific information”
• http://www.library.tudelft.nl/tulib/
July 16, 2015
34
Harzing’s Publish or Perish
• Uses Google Scholar data
• Calculates many indices
• Number of citations (also per year / article / author /…)
• Hirsch’s h-index
• Zhang’s e-index (excess in h-index set)
• Egghe’s g-index
• …
• Similar online tool: ArNetMiner
July 16, 2015
35
Publish or Perish (http://www.harzing.com/pop.htm)
July 16, 2015
36
Outline
• From the IN3305 study goals:
“kennismaken met wetenschappelijke literatuur”
• To read or not to read?
• What is
“scientific literature”?
• Literature is input and output
• Measuring and assessing Quality
• Useful sites and tools
• On gaming the citation indices (unethical)
• Conclusion
July 16, 2015
37
Unethical!
How to Game the
Citation System?
(part of)
Collaboration graph
July 16, 2015
38
All authors with Erdős number 1
Note: The h-index was “invented”
almost a decade after Erdos.
July 16, 2015
39
Collaboration Graph Degree Distribution
Erdős
July 16, 2015
40
Collaboration Graph: Connected
Components Distribution
Giant Component
July 16, 2015
41
Interested?
• Mark Newman analyzes the phenomenon:
“who is the best connected scientist?”
• Other references
• Erdős Number Project
http://www.oakland.edu/enp/
• Kevin Bacon Oracle
http://oracleofbacon.org/
July 16, 2015
42
More on the (unethical)
Gaming the Citation Indices
• Self-cite, self-cite, self-cite
• Journals asking for submitters to cite journal’s papers
• Program committee members and reviewers asking for
their own work to be cited (when not necessary)
• Not citing old work because it’s old—”killing” old results
now allows you to republish them later
• Work on a popular topic—more people, more citations,
more chances
• (Google Scholar-only) Blog, Tweet, and FB daily about
your papers. Ask your friends to re-post.
How to Talk About Books You Haven’t Read
There exists a system to (not) read
1. Know where to find sources
•
•
Trustworthy: DBLP, ACM DL, Google Scholar
Less trustworthy: CoRR, …
2. Know how to find good sources
•
•
•
Number of citations: Google Scholar+Others
H-index: Publish or Perish (the program)
Try to avoid or weight-out citation cliques
3. Select from the good sources
Questions?
July 16, 2015
45