Computer sciences - University College London
Download
Report
Transcript Computer sciences - University College London
Licensing is Software Too:
Achievements and Challenges
(and how this relates to code provenance)
Massimiliano Di Penta
University of Sannio, Italy
[email protected]
http://www.rcost.unisannio.it/mdipenta
1
Acknowledgements
Daniel M. Germán, Univ. Victoria, Canada
Julius Davies, Univ. Victoria, Canada
Giuliano Antoniol, Ecole Polyt. Montréal, Canada
Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal, Canada
2
Reusing Open Source Software
When developing a software system,
we try (if possible) not to reinvent the wheel
Components, libraries, source
code snippets out of there, ready to be reused
Code search engines are becoming popular
Open source code modification and
redistribution governed by
Software licenses
Copyright statements
Everything contained in a licensing
block…
3
What does a licensing contain?
/*
/*
*
*
*
*
*
*
-*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
***** BEGIN LICENSE BLOCK *****
Version: MPL 1.1/GPL 2.0/LGPL 2.1
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
*
*
*
*
*
*
License
(MPL+GPL+LGPL)
….
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
***** END LICENSE BLOCK ***** */
#include "nsXULAppAPI.h"
#ifdef XP_WIN
#include <windows.h>
Contributor
Copyright
statement
Copyright
year
4
Restrictive vs. permissive licenses
Restrictive (aka copyleft or reciprocal)
Changed software must be made available under
similar terms wrt. the original
Example: GPL
Permissive
Modifications/enhancements may remain proprietary
Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers
– Contributor names do not imply endorsement
Examples: Berkeley Software Distribution (BSD),
Apache Software License, MIT
5
FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for
Debian GNU/Linux I have started going over the
copyright/license headers. In src/celeste many files are
missing copyright information. Most of these are files
imported with minimal changes from Gabor API
http://www.kung-foo.tv/gaborapi.php or libsvm
http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream
author Adriaan Tijsseling that files like
ContrastFilter.cpp are Copyright (c) Adriaan Tijsseling
and licensed under GPLv2+, although the original headers
just say:
Original Author:
Yasunobu Honma
Modifications by:
Adriaan Tijsseling (AGT)
6
Conjectures
Since licenses determine the way software can
be composed and re-distributed
They may change/evolve as any other part of the
software
They might be subject to bugs too
– See our ICPC 2010 paper about how to identify licensing
incompatibilities
They might determine the success/failure of a
software project
Code provenance and licenses:
Licenses constrain source code migration between
projects
Code provenance might be useful to determine the
licensing of closed components
7
Licenses influence the software lifetime
OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
Licenses evolve as software does
Failing to account for that would cause copyright infringements
Decisions on license changes impact as other decisions on
software evolution
Little attention so far from the scientific community
Need for methods and tools to audit licensing and
their changes
8
Example: Java
Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any
Supplemental License Terms, you may not make
copies of Software, other than a single copy of
Software for archival purposes”
This disallowed the inclusion of Java in Linux distributions
Java 5.0 released under the GPL v2 with the CLASSPATH
exception:
Java could be modified/updated under the GPL v2
Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception
Changing the license of a system can promote and
ease the distribution and reuse of a software
system
9
Example: QT
First released under a non-open source but free license,
called the FreeQT License, and a commercial license
QT became the basis for KDE
QT v2.0 was released under a new license, the Q Public License
incompatible with the GPL
GNOME project started as a QT-free alternative to KDE
Harmony project started as a GPL replacement of QT
Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned
Changing the license of FOSS system
towards a more permissive might cause
the abandonment of a competing system
11
Empirical Study
Goal: analyze licensing evolution
Purpose: investigating how developers
change licensing statements
Context: CVS/SVN repositories of
ArgoUML, Eclipse-JDT, the FreeBSD and the
OpenBSD kernels, Mozilla, Samba
13
Research Questions
RQ1: To what extent are files changing
their licenses?
RQ2: How are copyright years changed in
licensing statements?
RQ3: Who are the contributors of a
software project and how do they change?
14
Licensing Analysis Method –
Extracting Licensing statements
/*
/*
*
*
*
*
*
*
-*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
***** BEGIN LICENSE BLOCK *****
Version: MPL 1.1/GPL 2.0/LGPL 2.1
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
*
*
*
*
*
*
….
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
***** END LICENSE BLOCK ***** */
#include "nsXULAppAPI.h"
#ifdef XP_WIN
#include <windows.h>
15
Licensing Analysis Method –
Classifying licenses
FoSSology [Gobeille, MSR 2008]: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
Ninka [German et al., ASE 2010]: uses a patternmatching approach
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
MPL 1.1/GPL 2.0/LGPL 2.1
16
Licensing Analysis Method –
Identifying changes in copyright years
Mining references to years in licensing…
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
17
Licensing Analysis Method –
Identifying contributor names
Mining emails, plus various patterns
Copyright … year name
Contributor(s) …
And mapped to committers, whenever possible
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
18
RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0
Common Public License v0.5
Eclipse Public License v1.0
Common Public License v1.0
CHANGE
UPDATE
2394
808
Mozilla
NPL
'NPL v1.1'-style+GPL v2+LGPL v2.1
DUAL
2914
NPL
'Dual MPL GPL'-style+MPL
DUAL
1274
'Dual MPL GPL'-style+MPL
NPL
BUG
1194
Licensing updated as new licenses were developed
Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0
IBM has relinquished control of licenses to the Eclipse Foundation
Mozilla: NPLMPL + GPL (+ LGPL)
NPL allowed to release Netscape 6 as a proprietary system
MPL only allows to re-distribute the source code under the MPL
Multiple licenses to deal with incompatibilities
Files wrongly changed to NPL (bug #98089)
19
RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
UPDATE
491
300
OpenBSD
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
964
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
414
FreeBSD and OpenBSD are more eclectic than
other projects
Moving from BSD-4 clauses to the more permissive
BSD-3 and BSD-2
20
RQ1: Most relevant license changes
ArgoUML
None
'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style
ADD
127
ADD
15
Samba
None
GPL v2
ArgoUML and Samba kept the same licenses
over the analyzed time span
Change is from None to a simple license
Authors realized the importance of including a license
21
RQ2: How and why were
copyright years changed?
Files for which the copyright years were updated
underwent a significantly higher number of
changes than others
When developers perform substantial changes to a file,
they also update copyright years
Required by copyright regulations
Lack of updates with substantial changes would allow
an infringer to claim “innocent infringement”
Commits explicitly targeted to copyright years
“Updated copyrights”
“Updated copyrights to 2004”
22
RQ3: When do contributors change?
Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
Contributor names are important
assets in source code
Like the signature on a picture
However…
contributors can change during the time
no standard way of reporting them
no clear rule on when one should become a contributor
Their presence can have legal implications
23
Licenses Influence
Code Migration
Free (software) as a bird…
As birds migrate differently
during different seasons….
Code might have a migration
preferential direction
Given two systems
e.g. FreeBSD and Linux
We find the same code in
both systems
Three scenarios:
Migration FreeBSD Linux
Migration Linux FreeBSD
Migration third-party
FreeBSD, Linux
25
Sibling(s) Origin
Identify siblings between systems using clone detection
CCFinderX, with >100 tokens as threshold, plus other heuristics
Trace back into past siblings – their code fragments in the
same files
Again clone detection, the sibling fragment wrt. previous file revisions
When they disappear, then we have their origins
Take the oldest of the two as the true origin
Sys 1 – File i
Cloned fragments
Sys 2 – File j
Migration
direction
siblings
Cloned fragments
27
Code Migration and Licenses
FreeBSD
BSD
BSD
BSD
Corporate
GPL
Phrase
X.Net+BS
D
Linux
GPL
MIT
None
BSD+GPL
None
BSD+GPL
Files
8
2
2
89
1
1
MIT
1
OpenBSD
BSD
BSD
Almost nothing BSD
after
BSD+GPL
BSD+Phra
se
MIT
Before
Jan 1, 2002
Linux
BSD+GPL
GPL
GPL
GPL
MIT
MIT+GPL
None
Linux
BSD+GPL
MIT
Unknown
GPL
Files
1
2
1
1
Phrase+GPL
GPL
1
23
FreeBSD
Files
Corporate
8
BSD
17
BSD+GPL
1
CPL+BSD+GPL
1
BSD
1
None
2
BSD
1
After Jan 1, 2002
Nothing before
28
Discussion
Siblings have a preferential flow
Initially from BSD(s) to Linux – frequent
Today from Linux to FreeBSD – less frequent
Thus, due to licenses but also to the system level of
development
Companies directly contribute to code in different
kernels – see Intel drivers with dual licenses
In this case, code migrates from a third party towards
Linux and FreeBSD
29
Identifying licenses of jar
archives
Motivations
Very often, Java open source software is
distributed in jar archives
See http://mvnrepository.com/
Problem: the jar might not contain
licensing info
Under what conditions can we integrate the
component?
The jar might not be legally used
Even if it’s from open source code, we might
not found exactly the same jar
31
Search-driven approach
Extracting info from the class bytecode
Class and package names.. or a fingerprint..
We use the ASM library (http://asm.ow2.org/)
Querying Google Code Search
Using the full qualified class name
Using the package only
Query performed using the Google Code API
(http://code.google.com/apis/gdata/)
If the same class is not found, its license is obtained
by those of classes belonging to the same package
32
Google Code Search Output
33
% of correct classifications
Found license:
Min. 29%
(commons.codec), Avg.
82%, median: 89.5%
Inferred licenses:
Min. 62% (JLayer 1.0),
Avg. 95%, median 100%
The inferring heuristic
significantly better both
in terms of completeness
and of precision
34
Incorrect classifications
Most of them are between LGPL
and GPL and between BSD and
Apache.
commons-codec: mismatching
between Apache and BSD
files licensed under the Apache v 1.1
derived from the BSD
JLayer: mismatching between GPL
and LGPL
same inferred licenses in both releases
(0.4 and 1.0)
however, JLayer moved from GPL to
LGPL from release 0.4 to release 1.0
35
Conclusions
We proposed a code analysis method as
support for lawyers other than for software
engineers
We studied how licensing are used and evolve
License type, copyright year, contributors
Main findings:
License influence projects outcome
License influence code migration
Moving towards more permissive licenses
Copyright years and contributor names updated to
preserve rights on new code
36
Licensing and code provenance
Licensing influences the direction in which code
flows from a system towards another one
Often code flows in the direction of more permissive
licenses…
..but there are many other factors influencing how
code flows
Search-driven approaches can be adopted to
determine from what code does a closed
component come from
And thus its licensing…
Issues related to the capabilities of the code search
tools
37
Thank you!
38
References
Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful
Software Engineering, Proceedings of FoSER: Working Conference on the Future
of Software Engineering Research, November 2010, Santa Fe', USA, 2010, ACM
Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and
Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 84-93
Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: An exploratory study of the evolution of software licensing. ICSE 2010:
145-154
Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying licensing
of jar archives using a code-search approach. MSR 2010: 151-160
Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors and
How do they Change? WCRE 2009: 11-20
Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: Code siblings: Technical and legal implications of copying code between
applications. MSR 2009: 81-90
Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method for
automatic license identification of source code files. ASE 2010: 437-446
Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing
license mismatches in component-based development. ICSE 2009: 188-198
Robert Gobeille: The FOSSology project. MSR 2008: 47-50
39