Transcript Title

SCAPE
Characterisation - 101
An introduction to the identification and
characterisation of file formats.
Carl Wilson
Open Planets Foundation
SCAPE Training
Guimarães
This work was partially supported by the SCAPE Project.
The SCAPE project is co-funded by the European Union under FP7 ICT-2009.4.1 (Grant Agreement number 270137).
SCAPE
About Us
• Carl Wilson
Open Planets Foundation
[email protected]
http://www.openplanetsfoundation.org
• SCAPE Project
EU funded research project
SCAlable Preservation Environments
http://www.scape-project.eu
2
SCAPE
About You
• Once Around The Room
• Name
• Where you work
• What you do
• Why you’re here
• DO Ask Questions
• Or tell me to slow down…
• Or ask me to repeat something…
3
SCAPE
File Formats
• What is a File Format?
• A “standard” method of encoding data for
storage.
• May be to an open specification
• OR a proprietary one, open preferred
• Or simply following a loosely documented
convention
4
SCAPE
Who Cares About Formats?
• Operating Systems: in order to open a file with
an application that can interpret /render it.
• Web Servers: to negotiate Content-Type in HTTP
requests
• Memory Institutions: to identify software stacks
that can render or extract meaning from a file,
now or at a later date.
• More Generally: everyone with digital content,
whether they know it or not.
5
SCAPE
Some Uses of Format Information
• Format Information:
• Associates a file with software that can
interpret and/or render its contents
• Can be used to find documentation /
specifications to help interpret a file’s contents
• Is a first step to preservation planning, knowing
what you have……
6
SCAPE
File Name Extension
• A file name suffix separated by a dot “.”, from the
file base name.
• Examples: .pdf, .txt, .jpg, .doc, .docx
• This has worked for a number of years BUT
• Any user with the right permission can change
a file extension
• Bytes aren’t always transferred with a name
7
SCAPE
Internet Media (MIME) Types
• The format identifiers used by the web
• Examples:
• text/plain
• text/html
• image/jpg
• Don’t readily hold extra information such as
format version, but may be extended.
8
SCAPE
Apple’s Alternatives
• Pre OS-X versions of MAC OS used Creator and
Type codes
• Creator: The software that created the file
• Type: The type of information, e.g. TEXT
• More flexible than extension, but no longer
used
• Recent OS-X versions also use Uniform Type
Identifiers
9
SCAPE
PRONOM Unique Identifiers or PUIDs
• PRONOM is a web based registry of file format
information
• Created and Hosted by the National Archives of
the UK in 2002
• Uses PUIDS to identify file formats:
• fmt/15 == Acrobat PDF 1.1
• fmt/16 == Acrobat PDF 1.2
• fmt/17 == Acrobat PDF 1.3
10
SCAPE
The Unix File Utility
• A standard Unix program for identifying the data
in a file.
• First released in 1973, written in C so requires
Operating System dependent compilation
• Open source version used in Linux distributions
written in 1986
• Identification based upon compiled “magic” files
• Provides text information about files, or MIME
types with the right options
11
SCAPE
FIDO
• Format Identification of Digital Objects
• Open Source format identification tools
• Based upon the PRONOM signature data
compiled to regular expressions
• Written in Python so can be run on different
Operating Systems
• Richer command line syntax than DROID
12
SCAPE
Apache Tika
• Open Source toolkit for detecting and extracting
metadata and structured text from files
• Performs Format Identification and deeper
characterisation (more on that later).
• Java based so will run on different platforms.
• Returns MIME types as format identifiers
13
SCAPE
How Do These Tools Identify Formats?
• They exploit “common features” of the format.
• PDF start of file:
• %PDF-1.1
• %PDF-1.2
• %PDF-1.6
PDF Version 1.1
PDF Version 1.2
PDF Version 1.6
• Tika and File simply look for files starting with the
string %PDF- and return the MIME type
• FIDO However……
14
SCAPE
FIDO & PDF Identification
• FIDO identifies the different PDF versions, each of
which have a PUID
• FIDO also looks for an END OF FILE marker for
PDFs : .%%EOF.
• This could be a problem…….
15