Indexing a Sphere Jim Gray, Microsoft Alex Szalay, Johns Hopkins University + Gyorgy Fekete, Maria Nieto Santisteban (JHU) April 2005

Download Report

Transcript Indexing a Sphere Jim Gray, Microsoft Alex Szalay, Johns Hopkins University + Gyorgy Fekete, Maria Nieto Santisteban (JHU) April 2005

Indexing a Sphere
Jim Gray, Microsoft
Alex Szalay, Johns Hopkins University
+ Gyorgy Fekete, Maria Nieto Santisteban (JHU)
April 2005
3 Ways We Do Spatial?
• Hierarchical mesh (extension to SQL)
– Uses table valued stored procedures
– Acts as a new “spatial access method”
– Ported to Yukon CLR for a 3x speedup.
• Zones: fits SQL well
– Surprisingly simple & good.
• Constraints: a novel idea
– Lets us do algebra on regions.
• Paper:
There Goes the Neighborhood: Relational Algebra for Spatial Data Search
Sky Points and Regions
Resources and Footprints
• Sky coverage of surveys
• Spatial pattern of observations
• Regions with bad data (masks)
Data Management
• Organize objects by sky regions into buckets
• Density of objects highly variable on the sky
– Need a hierarchical partitioning
• Need to map to (btree) index in databases
• Search database by region
Typical Questions
•
•
•
•
•
•
•
•
•
Is this point in the survey?
Give me all objects in this region
What surveys covers this part of the sky?
What is the common area of these surveys?
Give me all “good” objects (not in bad areas)
Cross-match these two catalogs
Give me cumulative counts (Histogram) over area
Compute fast spherical transforms of densities
Interpolate sparsely sample functions
(extinction maps, dust temperature, …)
Typical Calls
-- find objects within 1 arcminute of (60,20)
select objID, ra, dec
from PhotoObj
as p,
Coarse
fHtmCover(60,20,1) as triangle
distanc
where p.htmID between triangle.startHtmID
e test
and triangle.endHtmID
and <geometry test on (ra,dec) – (60,20) < 1 arcmin>
-- or better yet
select objID, ra, dec, distance
from dbo.fGetNearbyObjEq(60,20,1)
careful distance
test rejects
false positives
-- Also, Cross Apply is VERY useful
select objID, count(*)
from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
Preliminaries:
Equations Define Subspaces
y
• For (x,y) above the line
ax+by > c
• Reverse the space by
-ax + -by > -c
• Intersect a 3 volumes:
a1x + b1y > c1
a2x + b2y > c2
a3x + b3y > c3
x=c/a
x
y=c/b
y
x
Preliminaries
• Spherical geometry is really hard
– Distance
acos((sin(radians(lat1))sin(radians(lat2)))+(cos(radians(lat1)))cos(radians(lat2))cos(radians(abs(lon1-lon2)))))
– Area
forget it!
• But, astronomers have it easy
– The geoid is not a sphere!
• A trick: do everything in 3-space with planes
Escape From Surface of Sphere to 3D
Avoids Spherical Geometry
• Circle C defined by plane intersect sphere.
Plane is cx,cy,cz unit vector
& length = cos(radians(radius))
• Point P = (px,py,pz)
Cos(r)
Inside circle C if
P•C > length
no transcendental functions
– just 3 multiplies and a compare.
• Arbitrary polygons
are intersections of these regions.
Project Sphere onto Octahedron
Szalay, Kunszt, Brunner
• Every point
belongs to one of 8 faces
one of 8 triangles
Recursively Divide Triangles
(into 4 sub-triangles)
1,2,2
1,2,1
1,2
1,2,0 1,2,3
1,1,0 1,1,3
1,1
1
1,0
1,3
1,0,2 1,1,1 1,3,2
1,0,1 1,1,2 1,3,1
1,0,0 1,0,3
1,3,0 1,3,3
• This is a space-filling curve
Hausdorf, Hilbert, Peano, Z-transform…
Space-Filling Curve:
Each Triangle Maps to The Real Line [0..1]
[0.12,0.13)
[0.123,0.130) [0.120,0.121) [0.121,0.122) [0.122,0.123) [0.123,0.130)
1,2,2
[0.120,0.121) [0.121,0.122)
1,2,1
So triangles correspond to ranges
All points inside the triangle are inside the range.
1,2,0
1,2,3
1,1,0
1,1,3
1,0,2
1,1,1
1,3,2
1,0,1
1,1,2
1,3,1
1,0,0
1,0,3
1,3,0
1,3,3
How Know if
point is inside triangle?
• Answer:
– Triangle sides are 3 great circles C1, C2, C3
– So (P•C1 > L1) & (P•C2 > L2) & (P•C3 > L3)
Given an area (spherical polygon)
what triangles cover it?
• Answer:
– Recursive descent
– Triangle overlaps if any corner is inside polygon
and/or any triangle edge intersects polygon.
– Return list of start-stop pairs.
HTM Library computes triangles overlapping an area.
Find Points Inside Area
• fGetNearbyObjEq(ra,dec,r)
fGetNearbyObjXyz(x,y,z,r)
fGetNearestObjEq(ra,dec,r)
fGetNearestObjEq(x,y,z,r)
• fGetObjInside(areaSpec) -- string area description
• Alex added Healpix,
Igloo is also a nice iso-area decomposition
Digression: How does Search
Work?
• Coarse filter (index) reduces search space
– Btree, QuadTree, HashTable, Bitmap, R*tree, …
• Careful test discards false positives
true
• Efficiency:
true  false
True + false
Careful
test
True
Digression
What is an Index?
•
•
•
•
New organization of the data
Quick way to go from attribute to values
A [alt key] → [primary key] map
HTM index is
a quick (and clustered) way
to go from HTM triangle ID (range)
to all points in the triangle.
[area] → {points}
Digression
B-Trees are Good at Range Lookups
If you can tell me the triangles
T (start, stop)
covering an area:
select *
from points P join triangles T
on P.HTM_ID between T.start and t.stop
Returns all points inside the triangles
B-Trees are Good at Range Lookups
If you can tell me the triangles
T (start, stop)
covering an area:
select *
from points P join triangles dbo.fHTMcover(@ra,@dec,@r) T
on P.HTM_ID between T.start and t.stop
Returns all points inside the triangles
Triangles
Start Stop
Nested
Loops
Join
HTM
Index
points
We are Done:
using System;
using System.Collections;
using System.Data;
using System.Data.Sql;
using System.Data.SqlTypes;
namespace xphtm{
public partial class HTM {
[SqlFunction(IsDeterministic = true, IsPrecise = true) ]
public static long HTMlookup(int level, double RA, double dec) {
double
x, y, z;
HtmTrixel ht = new HtmTrixel();
SpatialVector.radec2cartesian(RA, dec, out x, out y, out z);
return ht.cartesian2HID(x, y, z, level);
}
[SqlFunctionAttribute(FillRowMethodName = "FillCoverRow"
,TableDefinition = "bigint HTMID_start, bigint HTMID_end") ]
public static IEnumerable fHTMcover(float ra, float dec, float radius) {
ArrayList rowSetAnswer
= new ArrayList();
Int64[,] arrayAnswer = null;
arrayAnswer = HtmCover.Circle(ra,dec,radius);
int i = 0;
for (int i = 0; i < (arrayAnswer.Length / 2); i++) {
CoverRow rp = new CoverRow(arrayAnswer[i,0], arrayAnswer[i, 1]);
rowSetAnswer.Add(rp); }
return (IEnumerable) rowSetAnswer;
}
public static void FillCoverRow(object source,
out long HTMIDstart, out long HTMIDend) {
CoverRow row = (CoverRow)source;
HTMIDstart = row._HTMIDstart;
HTMIDend
= row._HTMIDend;
}
public class CoverRow {
public long _HTMIDstart;
public long _HTMIDend;
public CoverRow(long HTMIDstart, long HTMIDend) {
_HTMIDstart = HTMIDstart;
_HTMIDend
= HTMIDend;
} } } }
Returns HTM ID of a point
Returns table of
HTM triangles
SQL wrapper
History/Performance
• HTM has ~½ false positive on circle (all sizes)
• Pre-Yukon:
– xp wrapper >500 lines C++
– Table linkage expensive (~400 μs), cover ~1.3ms
• Yukon
– Code MUCH simpler.
– Scalar linkage: 18 μs, table linkage 80μs
– HTM: lookup: 30 μs, cover: 500 μs
μs/call
Linkage costs:
(HTM is
400 μs more)
SQL 2k
SQL 2k TSQL
SQL 05 C#
elapsed
cpu
elapsed
cpu
EchoInteger ()
16
15
18
14
EchoStringT()
17
15
18
16
TVF(1)
344
321
82
78
TVF(10)
599
565
91
88
TVF(100)
6,507
2,868
191
181
TVF(1000)
68,678
26,256
1,113
1,076
Poster Child App
• SkyServer lookups
– Points near point
– Points in polygon
– Polygons overlap polygon
• 2,000 lookups/second/cpu
• Can add 10,000 points
/second/cpu
HTM Library
• Current implementation in C#,
public domain.
• automatic code generation for C++ and Java
• A DB2 port (ROE), 2MASS using with Illustra
• Used at over 20 places in the world
• 15 SkyQuery surveys have HTM
• Easy to add HTM to your data
• In the VO, the ADQL SkyNode
supports the spatial language
http://www.sdss.jhu.edu/htm/
Summary of HTM
• Good for Points in area.
• Efficiency comes from going to 3-space.
• Uses table-valued function as an index
generator.
• New approach to bounding boxes.
• Nested-loop join with function on outside.
• Not good for area-overlaps-area
(but can simplify areas and test for empty)
To Repeat (for area algebra)
Equations Define Subspaces
y
• For (x,y) above the line
ax+by > c
• Reverse the space by
-ax + -by > -c
• Intersect a 3 volumes:
a1x + b1y > c1
a2x + b2y > c2
a3x + b3y > c3
x=c/a
x
y=c/b
y
x
Domain is Union of Convex Areas
Not a
convex
• Regions are unions of
convex areas
• Higher order curves also
work
• Complex areas have holes
and their holes have holes.
(that is harder).
Holes are harder
+
It is the union of two convexs
Now in Relational Terms
create table HalfSpace (
domainID
int not null
-- domain name
foreign key references Domain(domainID),
convexID int not null,
-- grouping a set of ½ spaces
halfSpaceID int identity(), -- a particular ½ space
x
float not null,
-- the (a,b,..) parameters
y
float not null,
-- defining the ½ space
z
float not null,
c
float not null,
-- the constant (“c” above)
primary key (domainID, convexID, halfSpaceID)
(x,y,z) inside a convex if it is inside all
lines of the convex
(x,y,z) inside a convex if it is NOT OUTSIDE ANY line of the convex
select convexID
from HalfSpace
where @x * x + @y * y + @x * z <
group by all convexID
having count(*) = 0
-- return the convex
-- from the constraints
c -- point outside the line?
-- consider all the lines of a convexID
-- count outside == 0
The Algebra is Simple (Boolean)
@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))
@convexID = spDomainNewConvex (@domainID int)
@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int,
@x float, @y float, @z float, @c float)
@returnCode = spDomainDrop(@domainID)
select * from fDomainsContainPoint(@x float, @y float, @z float)
Once constructed they can be manipulated with the Boolean operations.
@domainID = spDomainOr (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainAnd (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainNot (@domainID1 int,
@type varchar(16), @comment varchar(8000))
Defining Areas (Regions)
• Circle: Center point + radius
Rectangle: Corner points of square
Polygon: sequence corner points or edges
Region: union of convexes
• As with all spatial, many coordinate systems
(equatorial, Cartesian, galactic, …)
• Have a linear syntax:
– CIRCLE J2000 RA DEC radius
– CONVEX CARTESIAN 1 0 0 0 1 0 0 0 1
…
• Have advanced XML syntax for space-time.
Spec: http://www.ivoa.net/Documents/latest/STC-X.html
Person: Arnold Rots http://hea-www.harvard.edu/~arots/
• Similar to GIS, but better suited for astronomy
String (easy) Areas
(or struct in C# world)
circleSpec
:=
|
rectSpec
:=
polySpec
:=
|
convexSpec :=
regionSpec :=
areaSpec
:=
|
CIRCLE J2000 ra dec radArcMin
CIRCLE CARTESIAN x y z radArcMin
RECT J2000 {ra dec}2
POLY J2000 {ra dec}3+
POLY CARTESIAN { x y z }3+
CONVEX { x y z d}+
REGION { convexSpec }+
circleSpec | rectSpec
polySpec | regionSpec
Regions
• Easy to compare regions.
• Easy to combine regions
(union, difference, intersection, buffer,…)
• Easy to find points in region
• Easy to find regions containing points
• Easy to find regions overlap region.
Some Details
• HTM library used to simplify (detect nulls)
• Compute (in TSQL now, C# later):
– Corner points of region
– Area of region
– Bounding circle of region
• Center point + bounding circle
is spatial index.
• See: “Simplify a Convex” Szalay/Gray.
3
a
b
d
2
1
c
What! No Bounding Box?
• Bounding box limits search.
A subset of the convexs .
• If query runs at 3M halfspace/sec then no
need for bounding box,
unless you have more than 10,000 lines.
• But, if you have a lot of half-spaces
then bounding box is good.
• We have bounding circles.
Things Can Get Complex
A
B
A
B
Green area: A  (B- ε) should find B if it contains an A and not masked
Yellow area: A  (B±ε) is an edge case may find B if it contains an A.
This is convex!
Poster Child App
• Used as footprint
service.
• Take many footprints
• Fuzz them (buffer) to
make coarser footprint
convex hull
• See if two footprints
overlap
• ~20 lines of code
+130 lines of logic/comments
Zones:
Comparing Massive Point Datasets
• Want to “cross match” 109 objects with
another 109 objects
• Using the HTM code, that is a LOT of calls
(~106 seconds (~ a year))
• Want to do it in an hour.
Zone Approach
• Divide space into zones
• Key points by Zone, offset
(on the sphere this need wrap-around margin.)
• Point search
look in a few zones
at a limited offset: ra ± r
a bounding box that has
1-π/4 false positives
• All inside the relational engine
• Avoids “impedance mismatch”
• Can “batch” all-all comparisons
r
• faster and 60,000x parallel
1 hours, not 6 months!
√(r +(ra-zoneMax) )
cos(radians(zoneMax))
• This is Maria
Nieto Santisteban’s PhD thesis
2
ra-zoneMax
x
2
zoneMax
Ra ± x
In SQL: points near point
Ignoring some spherical geometry details, see paper.
select o1.objID
-- find objects
from zone o1
-- in the zoned table
where o1.zoneID between
-- where zone #
floor((@dec-@r)/@zoneHeight) and -- overlaps the circle
ceiling((@dec+@r)/@zoneHeight)
and o1.ra between @ra - @r and @ra + @r -- quick filter on ra
and o1.dec between @dec-@r and @dec+@r -- quick filter on dec
Bounding
box
and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))
< @r
-- careful filter on distance
Eliminates the
~ 21% = 1-π/4
False positives
Competes with HTM (faster because no xp linkage), but ….
What about comparing 1 B points with 1B points, cross match.
Still too slow.
Quantitative Evaluation:
7x faster than external stored proc:
(linkage is expensive)
time vs. radius for
neighbors function
@ various zone heights.
Any small zone height is
adequate.
time vs. best time
@ various radius.
A zoneHeight of 4” is nearoptimal
2.00
1000
1.90
Rows vs elapsed time
fit is 1.46+2.2e-4*r^2 ms/asec
1.80
1.70
time (sec)
10
time vs best
7.5 asec
15 asec
30 asec
1 amin
2 amin
4 amin
64 amin
r^2 fit
100
Relative time vs zone height (asec)
4 minute zone is near optimal
2 & 8 minute are slower
7 asec
15 asec
30 asec
1 amin
2 amin
4 amin
16 amin
1 degree
1.60
1.50
1.40
1.30
1.20
1.10
1.00
1
10
100
r (asec)
1000
10
100
r (asec)
1000
All Neighbors of All points
(can Batch Process the Joins)
• A 5x additional speedup (35x in total)
for @deltaZone in {-1, 0, 1}
example ignores some spherical geometry details in paper
insert neighbors
-- insert one zone's neighbors
select o1.objID as objID,
-- object pairs
o2.objID as NeighborObjID,
.. other fields elided
from zone o1 join zone o2
-- join 2 zones
on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra
and o2.ra between o1.ra - @r and o1.ra + @r -- points near ra
where
-- elided margin logic, see paper.
and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec
and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2))
< @r
-- careful filter on distance
Many other speedups, and then there is parallelism.
Poster Child Apps
• CrossMatch SDSS with itself (250M obj)
Producing neighbors within 30 arcseconds
– Was 4,000 hours (estimated)
– Now 2 hours
• MaxBCG (cross-match SDSS to find clusters)
– 9,167 cpu hours with “files” (normalize to 2,300 hrs?)
– 240 cpu hours, 44 elapsed hours with zones
– See “DB meets Grid” paper
• Crossmatch
– USNO (US Naval Observatory Catalog): 1.2 B obj
– 2MASS (2 meter all sky survey): 0.5B objects
More Poster Child
USNO-2Mass CrossMatch
• Sequential: first try, 22 hours
sequential 22.2 hours
zones,
5.4, 24%
match,
16.8, 76%
• Optimize code: 6.25 hours
s e que nt ia l 6 :17 ho urs
i nde x ,
0:28, 7%
• Note:
– data on separate machine
– part of zone time is “pull” of data
ma t c h,
3:01, 46%
z o n e , 3:05,
47%
Zones allow 60,000 Parallel Jobs
Partition Parallelism: 3.7 hours
Source Tables
2MASS
471 Mrec
140 GB
Zoned Tables
2MASS
471 Mrec
65 GB
2MASS:USNOB
Zone:Zone
Comparison
0:-1
0:0
USNOB
1.1 Brec
233 GB
USNOB
1.1 Brec
106 GB
0:+1
Merge Build Index
Answer
2MASS→USNOB
64 Mrec
2 GB
350 Mrec
12 GB
260 Mrec
9 GB
26 Mrec
1 GB
350 Mrec
12 GB
USNOB→2MASS
2 hours
1.2 hour
.5 hour
Pipeline Parallelism: 2.5 hours
Or… as fast as we can read USNOB + .5 hours
Source Tables
2MASS
471 Mrec
140 GB
Zones
2MASS:USNOB
Zone:Zone
Comparison
0:-1
Merge Build Index
Answer
2MASS→USNOB
64 Mrec
2 GB
350 Mrec
12 GB
Next zone
0:0
USNOB
1.1 Brec
233 GB
260 Mrec
9 GB
Next zone
0:+1
26 Mrec
1 GB
350 Mrec
12 GB
USNOB→2MASS
2 hours
.5 hour
Spatial Stuff Summary
• Easy
– Point in polygon
– Polygons containing points
– (instance and batch)
• Works in higher dimensions
• Side note: Spherical polygons are
– hard in 2-space
– Easy in 3-space
Spatial Stuff Summary
• Constraint databases are “cool”
– Streams (data is query, query is in DB)
– Notification: subscription in DB, data is query
– Spatial: constraints in DB, data is query
• You can express constraints as rows
• Then You
– Can evaluate LOTS of predicates per second
– Can do set algebra on the predicates.
• Benefits from SQL parallelism
• SQL == Prolog // DataLog? 
In SQL
select o1.objID
from zone o1
where o1.zoneID between
floor((@dec-@r)/@zoneHeight) and
floor((@dec+@r)/@zoneHeight)
and o1.ra between @ra - @r and @ra + @r
and o1.dec between @dec-@r and @dec+@r
-- find objects
-- in the zoned table
-- where zone #
-- overlaps the circle
-- quick filter on ra
-- quick filter on dec
Bounding
box
and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))
< @r
-- careful filter on distance
Eliminates the
~ 21% = 1-π/4
False positives
References
\\barcfs\public\Gray\Spatial
There Goes the Neighborhood: Relational Algebra for Spatial Data Search,
J. Gray; A. S. Szalay; G. Fekete; M. A. Nieto-Santisteban; W. O'Mullane; A. R. Thakar; G. Heber; A. H. Rots
http://research.microsoft.com/research/pubs/view.aspx?tr_id=736
Roots of a Convex, Alex Szalay, Jim Gray, March 2005
Representing Polygon Areas and Testing Point-in-Polygon Containment
in a Relational Database Jim Gray, Alex Szalay
spRegion.SQL Jim Gray, Alex Szalay
SQL2005 CTP Table-Valued Function Performance vs SQL2000,
Jim Gray, Peter Kukol
Preliminary Study of SQL2005 C# HTM procedures
Jim Gray, Peter Kukol, Gyorgy Fekete, Alex Szalay
Computing the Match Table Jim Gray, Alex Szalay, Robert Lupton, Jeff Munn
When Database Systems Meet the Grid
M. A. Nieto-Santisteban, J. Gray, A. S. Szalay, J. Annis, A. R. Thakar, W. J. O’Mullane
Creating Sectors for SkyServer A. Szalay, G. Fekete, T. Budavari, J. Gray, A. Pope, A. Thakar
Partitioning the Sky for Spatial Data Tasks Jim Gray, Maria Nieto Santisteban, Alex Szalay
Spatial Clustering of Galaxies in Large Datasets
A. S. Szalay, T. Budavari, A. Connolly, J. Gray, T. Matsubara, A. Pope, I. Szapudi