Transcript Lecture 5

Data Models
• There are 3 parts to a GIS:
– GUI
– Tools
– Data Management System
• May be distributed on separate machines connected by a
network
• We will look today at the different ways in which the data
are stored within a GIS
Levels Of Abstraction
• Can identify four levels of abstraction:
– Reality – i.e. the real world
– Conceptual model - a human-orientated, partially
structured model of selected objects and processes
relevant to a particular problem domain.
– Logical model – an implementation-independent, but
implementation-orientated representation of reality. It is
often represented as a diagram showing the selected
objects and relationships between them.
– Physical model – a physical model describes the exact
files or database tables used to store the data, etc. It is
specific to a particular implementation.
Conceptual Models
• Can identify three conceptualisations of space:
– Field-based – attributes can be thought of as varying
continuously from place to place (e.g. precipitation).
Can be 2-D or 3-D (e.g. air pollution).
– Object-based – features can be thought of as discrete
entities or objects. Can be large or small, physical or
counties, and con contain other objects.
– Networks – object-based, but emphasis is on the
interaction between objects along pathways.
Logical Models
• The term spatial (or geographical) data model is used to
describe how data are organised within a GIS.
• The two main types are:
– Raster. Study are is divided into regular cells (usually
rectangular). Often used to model field data, but do not
actually form a continuous surface – sample points.
– Vector. Geometric primitives (i.e. points, lines,
polygons) are used to represent objects.
• Different phenomena are modelled as layers. In a raster
model each layer represents a variable attribute; in a vector
model each layer is usually a particular type of object.
Conceptual-Logical
Relationships
• Field data are normally modelled using a raster, whilst
object-based conceptualisations are normally modelled
using a vector model.
• However, field data can be modelled using a vector model
– e.g. contour lines, or using a triangulated irregular
network (TIN).
• Raster models can be used to model objects by assigning
an object identifier to each cell which can be joined to an
attribute table.
Physical Models
• A physical data model is the specific implementation of a
logical model – i.e. how the data are actually stored within
the computer.
• The term data structure is sometimes used to describe
how the data are organised within the computer.
• Before we look at some specific details, it is useful to look
briefly at some more general considerations of data
storage.
Data Storage Considerations
• The two main considerations relate to:
– Space
– Time
• There is usually a tradeoff between minimising the space
required to access the data and maximising the speed at
which it can be accessed.
Space
• Digital information is stored in a computer as binary digits
(or bits), each of which can have a value of 0 or 1. A byte
is a group of 8 bits. Bytes are sometimes in groups of 4
referred to as a word.
• Computer storage is usually measured in bytes. A kilobyte
is 1024 (i.e. 210 or approximately 103) bytes. A megabyte
is 1 million (i.e. 106) bytes, a gigabyte is 1 billion (i.e. 109)
bytes, and a terrabyte is a million million (i.e. 1012) bytes.
Search Time (1)
• Data on a particular entity (e.g. a person, an area, an
object) are normally stored together to form a record with
a unique identifier. A set of records are usually stored in a
named storage known as a file.
• The time taken to find a specific record depends upon how
the file is organised.
• Simple sequential files are very inefficient – average of
(n+1)/2 reads.
• Direct access files speed up searches – i.e. can jump
straight to a record if you know its record number.
Search Time (2)
• There are various ways to identify a record number in an
index file:
– Binary search. Records must be sequenced by their
key field.
– Hash addressing. An algorithm is used to translate key
field values into record numbers (or ‘buckets’). Not
necessarily a unique bucket for each key.
Search Time (3)
• Efficiency can be improved using an index file containing
just record numbers and key fields. Further enhancements
include:
– Sparse index – might use every 10th record
– Secondary index – can be used to identify records
according to a second criteria (e.g. area of residence)
• Pointers are a common device in computing. Could, for
example, be used to create a linked list (e.g. of people with
a particular characteristic).
Raster Data Models (1)
• Raster data for several layers could be stored in various
ways:
– By location – i.e. list all the attributes for cell 1, then cell 2, etc.
– By coverage – i.e. all the cells for coverage (or layer) 1, then
coverage 2, etc.
– By binary coverage – all cells having attribute 1 in coverage 1
saved as Boolean 1, then all cells having attribute 2 in coverage 1,
etc., repeated then for coverage 2.
– By data value – location of all cells having attribute 1 in coverage
1 saved as x,y, then attribute 2 coverage 1, etc.
By location: [2,1, 2,0, 2,0, 2,0, 3,0, 3,2, 3,2, 3,2,
2,0, 2,1, 2,0, 1,0, 3,2, 3,0, 3,0, 3,0, …]
By coverage: [2,2,2,2,3,3,3,3, 2,2,2,1,3,3,3,3, …
3,3,3,3,3,2,2,2] [1,0,0,0,0,2,2,2, 0,1,0,0,2,0,0,0, …]
Landuse
By binary coverage: [0,0,0,0,0,0,0,0, 0,0,0,1,0,0,0,0,
… ] [1,1,1,1,0,0,0,0, 1,1,1,0,0,0,0,0 … ] [0,0,0,0,
1,1,1,1, 0,0,0,0,1,1,1,1, …] [0,1,1,1,1,0,0,0, 1,0,1,1,
0,1,1,1 …] … [ … 1,0,0,0,0,0,0,0]
By data value (c,r) : [4,2, 4,3, 5,3, …] [1,1, 2,1, 3,1,
…] [5,1, 6,1, 7,1, …] [2,1, 3,1, 4,1, …] [1,1, 2,2,
2,3, …] [6,1, 7,1, 8,1 …]
Roads
Raster Data Models (2)
• Coding method affects:
– Ease of edits.
– Storage space – binary requires more numbers, but may
require less space because each number is only 1 bit –
integers require either 8 bits (if <256) or 32 bits.
– Number of files required.
• Problems:
– Data redundancy
– Storage space excessive
Data Compaction
• Various approaches have been used to reduce storage
requirements:
– Run Length Encoding
– Block Coding
– Chain Coding
– Quadtrees
– Wavelet Compression – e.g. MrSID (Multiresolution
Seamless Image Database). This can reduce the space
required to about 2 per cent of the original. However,
wavelet compression is lossy.
Run Length Encoding
(26 numbers : 0,13,1,5,0,5,1,6,0,5,1,5,0,6,1,3,0,7,1,3,0,7,1,2,0,33)
Block Coding
Chain Coding
Quadtree
Encoded as: 30, 312
Vector Data Models
• Real world objects are modelled in vector mode using
geometric primitives (i.e. points, lines and polygons).
• Field data can be also be modelled using isolines or TINs,
but these introduce further issues so we will ignore them
for present.
• Features that can be modelled as points have very simple
data structures: each record can contain an x and y
coordinate, and multiple attribute fields.
x1
y1
a1
b1
c1
x2
y2
a2
b2
c2
x3
y3
…
…
…
Lines And Polygons
• Lines, polylines and polygons are more complex because
each object requires more than one x,y coordinate pair.
• Also, the number of x,y coordinate pairs is variable.
• For polygons, one could check whether an x,y coordinate
pair completes a loop. However, it is safer to use a special
code to mark the end of the spatial definition.
x1
y1
…
…
xn
yn
-12345
-12345
a
b
c
Attribute Data
• Attribute data is also more complex for lines and polygons.
• Could record the attributes for each coordinate pair, but
would create a lot of data redundancy.
• Would also be very difficult to edit.
• A common solution is to store the attribute data in a
separate file and link it to the locational data using a
relational join.
• We will explore database structures next day. For present
we will focus issues associated with the locational data.
Spaghetti Data Structures
• The visual appearance of a map could be captured by
digitising lines and polygons in a random sequence without
any additional information about which lines connect to
which, or which polygons share common boundaries.
• This is akin to 'tracing' the lines on the map using a
digitiser until they have all been digitised.
• This information could be used to reconstruct the map as it
might be drawn by a cartographer.
• Although adequate for CAD or CAC, it is inadequate for
most GIS purposes – e.g. polygon features not defined.
• Sometimes used for data distribution.
Arc/Node Structures(1)
• The DIME system developed in the 1960s was a step
forward. It was the first to use an arc/node structure.
• A node is where two or more lines join.
• An arc is a section of line running between nodes.
• Each arc is made up from straight line segments running
between adjoining points (or vertices).
Arc/Node Structures(2)
• Arc/node structures allow the data to be stored
hierarchically.
• Polygons can be defined as a series of arcs.
• Arcs can be defined as a series of segments.
• The different types of data can be stored in separate files,
linked together by pointers.
Arc/Node Structures(3)
• Arc/node structures provide several advantages:
• Arc between adjoining polygons only need to be digitised
once.
– Reduces data redundancy
– Eliminates sliver lines
• Editing is simplified
– To move a point we just need to adjust its coordinates
in the points file.
– To delete a point we remove the reference to it in the
arcs file
– To add a point we add its details to the end of the points
file (no resorting) and insert a pointer at the right place
in the arcs file.
Topological Data Structures(1)
• Further refinements were introduced in the 1980s with the
introduction of TIGER files by the US Census.
• These added explicit topological information (e.g. the
polygons on either side of an arc; the beginning and end
nodes of each arc).
Topological Data Structures(2)
• Only require an arcs file – one can reconstruct the
polygons from the topological information.
Arc
1
2
3
Start
n1
n2
n1
End
n2
n1
n2
Left
A
O
O
Right
B
B
A
• Polygon B is made up from arcs 1 and 2. B is to the right
of both. Nodes n1 and n2 specify the sequence in which
they need to be joined.
Topological Data Structures(3)
• The topological information may be used to make
consistency checks.
• For example, the coordinates of nodes can be checked for
unsnapped nodes.
• If two arcs have the same nodes at both ends, system can
check if this is because one arc was digitised twice, or they
are two arcs forming a polygon.
• Can do lots of other checks.
• Data passing the checks are said to be topologically clean.
Topological Data Structures(4)
• Topological structures facilitate easy editing.
• For example, to merge the two polygons to form a new one
C, remove the record for arc 1, and substitute C for A or B
in the other records:
Arc
2
3
Start
n2
n1
End
n1
n2
Left
O
O
Right
C
C
Space Considerations
• Vector models generally require less space than raster
models, but space may be a consideration.
• Each X and Y coordinate generally requires 2 bytes (more
if they are larger than 65535).
• Can reduce using relative addressing – i.e. express as offset
from a local origin.