Computer sciences - University College London

Download Report

Transcript Computer sciences - University College London

Licensing is Software Too:
Achievements and Challenges
(and how this relates to code provenance)
Massimiliano Di Penta
University of Sannio, Italy
[email protected]
http://www.rcost.unisannio.it/mdipenta
1
Acknowledgements
 Daniel M. Germán, Univ. Victoria, Canada
 Julius Davies, Univ. Victoria, Canada
 Giuliano Antoniol, Ecole Polyt. Montréal, Canada
 Yann-Gaël Guéhéneuc, Ecole Polyt. Montréal, Canada
2
Reusing Open Source Software
 When developing a software system,
we try (if possible) not to reinvent the wheel
 Components, libraries, source
code snippets out of there, ready to be reused
 Code search engines are becoming popular
 Open source code modification and
redistribution governed by
 Software licenses
 Copyright statements
 Everything contained in a licensing
block…
3
What does a licensing contain?
/*
/*
*
*
*
*
*
*
-*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
***** BEGIN LICENSE BLOCK *****
Version: MPL 1.1/GPL 2.0/LGPL 2.1
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
*
*
*
*
*
*
License
(MPL+GPL+LGPL)
….
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
***** END LICENSE BLOCK ***** */
#include "nsXULAppAPI.h"
#ifdef XP_WIN
#include <windows.h>
Contributor
Copyright
statement
Copyright
year
4
Restrictive vs. permissive licenses
 Restrictive (aka copyleft or reciprocal)
 Changed software must be made available under
similar terms wrt. the original
 Example: GPL
 Permissive
 Modifications/enhancements may remain proprietary
 Distribution of source code or binary permitted
– Provided copyright notice and/or liability disclaimers
– Contributor names do not imply endorsement
 Examples: Berkeley Software Distribution (BSD),
Apache Software License, MIT
5
FOSS development teams care!
(source: Debian)
I am in the process of trying to prepare 0.8.0 for
Debian GNU/Linux I have started going over the
copyright/license headers. In src/celeste many files are
missing copyright information. Most of these are files
imported with minimal changes from Gabor API
http://www.kung-foo.tv/gaborapi.php or libsvm
http://www.csie.ntu.edu.tw/\~cjlin/libsvm/.
The attached patch adds copyright and license statements
to these files.[1]
Please apply and update the headers (adding copyright
holders) if you make substantial changes.
thanks, cu andreas
[1] I have doublechecked with Gabor API's upstream
author Adriaan Tijsseling that files like
ContrastFilter.cpp are Copyright (c) Adriaan Tijsseling
and licensed under GPLv2+, although the original headers
just say:
Original Author:
Yasunobu Honma
Modifications by:
Adriaan Tijsseling (AGT)
6
Conjectures
 Since licenses determine the way software can
be composed and re-distributed
 They may change/evolve as any other part of the
software
 They might be subject to bugs too
– See our ICPC 2010 paper about how to identify licensing
incompatibilities
 They might determine the success/failure of a
software project
 Code provenance and licenses:
 Licenses constrain source code migration between
projects
 Code provenance might be useful to determine the
licensing of closed components
7
Licenses influence the software lifetime
 OpenBSD founder and project leader Theo de Raadt
removed a security software package called IP-Filter
[written by Darren Reed] after its author changed its
license.
Stephen Shankland, CNET News, 2001/05/30.
 Licenses evolve as software does
 Failing to account for that would cause copyright infringements
 Decisions on license changes impact as other decisions on
software evolution
 Little attention so far from the scientific community
Need for methods and tools to audit licensing and
their changes
8
Example: Java
 Until November 2006, the license of Java JDK v1.2 said:
“Except as specifically authorized in any
Supplemental License Terms, you may not make
copies of Software, other than a single copy of
Software for archival purposes”
 This disallowed the inclusion of Java in Linux distributions
 Java 5.0 released under the GPL v2 with the CLASSPATH
exception:
 Java could be modified/updated under the GPL v2
 Java programs could be released under any license as long as
they satisfy the conditions stated in the CLASSPATH exception
Changing the license of a system can promote and
ease the distribution and reuse of a software
system
9
Example: QT
 First released under a non-open source but free license,
called the FreeQT License, and a commercial license
 QT became the basis for KDE
 QT v2.0 was released under a new license, the Q Public License
 incompatible with the GPL
 GNOME project started as a QT-free alternative to KDE
 Harmony project started as a GPL replacement of QT
 Trolltech changed the license of QT v3 to the GPL v2
The Harmony project was abandoned
Changing the license of FOSS system
towards a more permissive might cause
the abandonment of a competing system
11
Empirical Study
 Goal: analyze licensing evolution
 Purpose: investigating how developers
change licensing statements
 Context: CVS/SVN repositories of
 ArgoUML, Eclipse-JDT, the FreeBSD and the
OpenBSD kernels, Mozilla, Samba
13
Research Questions
 RQ1: To what extent are files changing
their licenses?
 RQ2: How are copyright years changed in
licensing statements?
 RQ3: Who are the contributors of a
software project and how do they change?
14
Licensing Analysis Method –
Extracting Licensing statements
/*
/*
*
*
*
*
*
*
-*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
***** BEGIN LICENSE BLOCK *****
Version: MPL 1.1/GPL 2.0/LGPL 2.1
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
*
*
*
*
*
*
….
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
***** END LICENSE BLOCK ***** */
#include "nsXULAppAPI.h"
#ifdef XP_WIN
#include <windows.h>
15
Licensing Analysis Method –
Classifying licenses
 FoSSology [Gobeille, MSR 2008]: detects licenses
using the Binary Symbolic Alignment Matrix (bSAM)
 Ninka [German et al., ASE 2010]: uses a patternmatching approach
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
MPL 1.1/GPL 2.0/LGPL 2.1
16
Licensing Analysis Method –
Identifying changes in copyright years
 Mining references to years in licensing…
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
17
Licensing Analysis Method –
Identifying contributor names
 Mining emails, plus various patterns
 Copyright … year name
 Contributor(s) …
 And mapped to committers, whenever possible
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -**/
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License
Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
….
* Portions created by the Initial Developer are Copyright (C) 2002
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
* Brian Ryner <[email protected]>
….
18
RQ1: Most relevant license changes
Eclipse-JDT
Common Public License v1.0
Common Public License v0.5
Eclipse Public License v1.0
Common Public License v1.0
CHANGE
UPDATE
2394
808
Mozilla
NPL
'NPL v1.1'-style+GPL v2+LGPL v2.1
DUAL
2914
NPL
'Dual MPL GPL'-style+MPL
DUAL
1274
'Dual MPL GPL'-style+MPL
NPL
BUG
1194
Licensing updated as new licenses were developed
 Eclipse JDT: CPL 0.5CPL 1.0EPL 1.0
 IBM has relinquished control of licenses to the Eclipse Foundation
 Mozilla: NPLMPL + GPL (+ LGPL)
 NPL allowed to release Netscape 6 as a proprietary system
 MPL only allows to re-distribute the source code under the MPL
 Multiple licenses to deal with incompatibilities
 Files wrongly changed to NPL (bug #98089)
19
RQ1: Most relevant license changes
FreeBSD
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
UPDATE
491
300
OpenBSD
'BSD UCRegents'-style (4-cl BSD)
'INRIA-OSL'-style (3-cl BSD)
UPDATE
964
BSD UCRegents (4-cl BSD)
'BSD UCRegents'-style
(4-cl BSD)
UPDATE
414
 FreeBSD and OpenBSD are more eclectic than
other projects
 Moving from BSD-4 clauses to the more permissive
BSD-3 and BSD-2
20
RQ1: Most relevant license changes
ArgoUML
None
'Free with copyright clause'-style +'UC Regents free with
copyright clause'-style
ADD
127
ADD
15
Samba
None
GPL v2
 ArgoUML and Samba kept the same licenses
over the analyzed time span
 Change is from None to a simple license
 Authors realized the importance of including a license
21
RQ2: How and why were
copyright years changed?
 Files for which the copyright years were updated
underwent a significantly higher number of
changes than others
 When developers perform substantial changes to a file,
they also update copyright years
 Required by copyright regulations
 Lack of updates with substantial changes would allow
an infringer to claim “innocent infringement”
 Commits explicitly targeted to copyright years
 “Updated copyrights”
 “Updated copyrights to 2004”
22
RQ3: When do contributors change?
 Changes where contributor
names are added are significantly
bigger than other changes
Contributors often added
when they make substantial
changes
 Contributor names are important
assets in source code
 Like the signature on a picture
 However…
 contributors can change during the time
 no standard way of reporting them
 no clear rule on when one should become a contributor
 Their presence can have legal implications
23
Licenses Influence
Code Migration
Free (software) as a bird…
 As birds migrate differently
during different seasons….
 Code might have a migration
preferential direction
 Given two systems
 e.g. FreeBSD and Linux
 We find the same code in
both systems
 Three scenarios:
 Migration FreeBSD  Linux
 Migration Linux  FreeBSD
 Migration third-party 
FreeBSD, Linux
25
Sibling(s) Origin
 Identify siblings between systems using clone detection
 CCFinderX, with >100 tokens as threshold, plus other heuristics
 Trace back into past siblings – their code fragments in the
same files
 Again clone detection, the sibling fragment wrt. previous file revisions
 When they disappear, then we have their origins
 Take the oldest of the two as the true origin
Sys 1 – File i
Cloned fragments
Sys 2 – File j
Migration
direction
siblings
Cloned fragments
27
Code Migration and Licenses
FreeBSD
BSD
BSD
BSD
Corporate
GPL
Phrase
X.Net+BS
D
Linux
GPL
MIT
None
BSD+GPL
None
BSD+GPL
Files
8
2
2
89
1
1
MIT
1
OpenBSD
BSD
BSD
Almost nothing BSD
after
BSD+GPL
BSD+Phra
se
MIT
Before
Jan 1, 2002
Linux
BSD+GPL
GPL
GPL
GPL
MIT
MIT+GPL
None
Linux
BSD+GPL
MIT
Unknown
GPL
Files
1
2
1
1
Phrase+GPL
GPL
1
23
FreeBSD
Files
Corporate
8
BSD
17
BSD+GPL
1
CPL+BSD+GPL
1
BSD
1
None
2
BSD
1
After Jan 1, 2002
Nothing before
28
Discussion
 Siblings have a preferential flow
 Initially from BSD(s) to Linux – frequent
 Today from Linux to FreeBSD – less frequent
 Thus, due to licenses but also to the system level of
development
 Companies directly contribute to code in different
kernels – see Intel drivers with dual licenses
 In this case, code migrates from a third party towards
Linux and FreeBSD
29
Identifying licenses of jar
archives
Motivations
 Very often, Java open source software is
distributed in jar archives
 See http://mvnrepository.com/
 Problem: the jar might not contain
licensing info
 Under what conditions can we integrate the
component?
 The jar might not be legally used
 Even if it’s from open source code, we might
not found exactly the same jar
31
Search-driven approach
 Extracting info from the class bytecode
 Class and package names.. or a fingerprint..
 We use the ASM library (http://asm.ow2.org/)
 Querying Google Code Search
 Using the full qualified class name
 Using the package only
 Query performed using the Google Code API
(http://code.google.com/apis/gdata/)
 If the same class is not found, its license is obtained
by those of classes belonging to the same package
32
Google Code Search Output
33
% of correct classifications
 Found license:
 Min. 29%
(commons.codec), Avg.
82%, median: 89.5%
 Inferred licenses:
 Min. 62% (JLayer 1.0),
Avg. 95%, median 100%
 The inferring heuristic
significantly better both
in terms of completeness
and of precision
34
Incorrect classifications
 Most of them are between LGPL
and GPL and between BSD and
Apache.
 commons-codec: mismatching
between Apache and BSD
 files licensed under the Apache v 1.1
 derived from the BSD
 JLayer: mismatching between GPL
and LGPL
 same inferred licenses in both releases
(0.4 and 1.0)
 however, JLayer moved from GPL to
LGPL from release 0.4 to release 1.0
35
Conclusions
 We proposed a code analysis method as
support for lawyers other than for software
engineers
 We studied how licensing are used and evolve
 License type, copyright year, contributors
 Main findings:
 License influence projects outcome
 License influence code migration
 Moving towards more permissive licenses
 Copyright years and contributor names updated to
preserve rights on new code
36
Licensing and code provenance
 Licensing influences the direction in which code
flows from a system towards another one
 Often code flows in the direction of more permissive
licenses…
 ..but there are many other factors influencing how
code flows
 Search-driven approaches can be adopted to
determine from what code does a closed
component come from
 And thus its licensing…
 Issues related to the capabilities of the code search
tools
37
Thank you!
38
References
 Daniel M. Germán, Jens H. Weber-Jahnke, Massimiliano Di Penta: Lawful
Software Engineering, Proceedings of FoSER: Working Conference on the Future
of Software Engineering Research, November 2010, Santa Fe', USA, 2010, ACM
 Daniel M. Germán, Massimiliano Di Penta, Julius Davies: Understanding and
Auditing the Licensing of Open Source Software Distributions. ICPC 2010: 84-93
 Massimiliano Di Penta, Daniel M. Germán, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: An exploratory study of the evolution of software licensing. ICSE 2010:
145-154
 Massimiliano Di Penta, Daniel M. Germán, Giuliano Antoniol: Identifying licensing
of jar archives using a code-search approach. MSR 2010: 151-160
 Massimiliano Di Penta, Daniel M. Germán: Who are Source Code Contributors and
How do they Change? WCRE 2009: 11-20
 Daniel M. Germán, Massimiliano Di Penta, Yann-Gaël Guéhéneuc, Giuliano
Antoniol: Code siblings: Technical and legal implications of copying code between
applications. MSR 2009: 81-90
 Daniel M. Germán, Yuki Manabe, Katsuro Inoue: A sentence-matching method for
automatic license identification of source code files. ASE 2010: 437-446
 Daniel M. Germán, Ahmed E. Hassan: License integration patterns: Addressing
license mismatches in component-based development. ICSE 2009: 188-198
 Robert Gobeille: The FOSSology project. MSR 2008: 47-50
39