Parallel Processing, Part 4

Transcript Parallel Processing, Part 4

Hypercube
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 1
About This Presentation
This presentation is initially prepared by Behrooz Parhami for the
textbook Introduction to Parallel Processing: Algorithms and
Architectures (Plenum Press, 1999, ISBN 0-306-45970-1).
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 2
Low-Diameter Architectures
Study the hypercube and related interconnection schemes:
• Prime example of low-diameter (logarithmic) networks
• Theoretical properties, realizability, and scalability
• Complete our view of the “sea of interconnection nets”
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 3
Hypercubes and Their Algorithms
Study the hypercube and its topological/algorithmic properties:
• Develop simple hypercube algorithms (more in Ch. 14)
• Learn about embeddings and their usefulness
Topics in This Lecture
13.1 Definition and Main Properties
13.2 Embeddings and Their Usefulness
13.3 Embedding of Arrays and Trees
14.5 Broadcasting in Hypercube
14.6 Adaptive routing in Hypercube
15
Fall 2010
Inverting a Lower-Triangular Matrix
Parallel Processing, Low-Diameter Architectures
Slide 4
13.1 Definition and Main Properties
P
0
P
P
8
1
P
P
P
P
P
P
P
P
0
P
P
7
2
P
P
Intermediate
architectures:
logarithmic or
sublogarithmic
diameter
3
1
4
2
5
P
6
3
6
P
5
7
8
P
4
Begin studying networks that are intermediate between
diameter-1 complete network and diameter-p1/2 mesh
Sublogarithmic diameter
1
2
Complete
network
PDN
Fall 2010
log n / log log n
Star,
pancake
Superlogarithmic diameter
log n
Binary tree,
hypercube
n
Torus
Parallel Processing, Low-Diameter Architectures
n/2
n1
Ring
Linear
array
Slide 5
Hypercube and Its History
Binary tree has logarithmic diameter, but small bisection
Hypercube has a much larger bisection
Hypercube is a mesh with the maximum possible number of dimensions
222 ... 2
 q = log2 p 
We saw that increasing the number of dimensions made it harder to
design and visualize algorithms for the mesh
Oddly, at the extreme of log2 p dimensions, things become simple again!
Brief history of the hypercube (binary q-cube) architecture
Concept developed: early 1960s [Squi63]
Direct (single-stage) and indirect (multistage) versions: mid 1970s
Initial proposals [Peas77], [Sull77] included no hardware
Caltech’s 64-node Cosmic Cube: early 1980s [Seit85]
Introduced an elegant solution to routing (wormhole switching)
Several commercial machines: mid to late 1980s
Intel PSC (personal supercomputer), CM-2, nCUBE (Section 22.3)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 6
Basic Definitions
0
Hypercube is generic term;
3-cube, 4-cube, . . . , q-cube
in specific cases
0
00
01
1
10
11
1
(a) Binary 1-cube,
built of two
binary 0-cubes,
labeled 0 and 1
(b) Binary 2-cube,
built of two
binary 1-cubes,
labeled 0 and 1
100
Fig. 13.1
The recursive
structure of
binary
hypercubes.
Parameters:
p = 2q
B = p/2 = 2q–1
D = q = log2p
d = q = log2p
000
001
100
101
101
000
0
001
1
110
010
011
110
111
111
010
011
(c) Binary 3-cube, built of two binary 2-cubes, labeled 0 and 1
0100
0101
0000
0001
1100
1000
0
1001
1
0110
0010
1101
0111
0011
1110
1010
1111
1011
(d) Binary 4-cube, built of two binary 3-cubes, labeled 0 and 1
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 7
The 64-Node
Hypercube
Only sample
wraparound
links are
shown to
avoid clutter
Isomorphic to
the 4  4  4
3D torus
(each has
64  6/2 links)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 8
Neighbors of a Node in a Hypercube
xq–1xq–2 . . . x2x1x0
ID of node x
xq–1xq–2 . . . x2x1x0
xq–1xq–2 . . . x2x1x0
.
.
.
xq–1xq–2 . . . x2x1x0
dimension-0 neighbor; N0(x)
dimension-1 neighbor; N1(x)
.
.
.
dimension-(q – 1) neighbor; Nq–1(x)
0100
Nodes whose labels differ in k bits
(at Hamming distance k) connected
by shortest path of length k
Dim 3
Dim 1
1100
1101
Both node- and edge-symmetric
1111
Strengths: symmetry, log diameter,
and linear bisection width
0110
Weakness: poor scalability
0111
1010
0010
Fall 2010
0101
Dim 0
xx
Dim 2
0000
The q
neighbors
of node x
Parallel Processing, Low-Diameter Architectures
1011
0011
Slide 9
13.2 Embeddings and Their Usefulness
0
a
b
1
c
Dilation = 1
Congestion = 1
Load factor = 1
3
e
4
0
b
6
c
e
0
b
f
4
6
6
0,1
b
2,5
c, d
f
d
3
4
5
2
Dilation = 1
Congestion = 2
Load factor = 2
f
2
a
d
Dilation = 2
Congestion = 2
Load factor = 1
a
e
1
f
5
5
c
2
d
1
3
3,4
6
Fig. 13.2
Embedding a
seven-node
binary tree
into 2D
meshes of
various sizes.
Expansion:
ratio of the
number of
nodes (9/7, 8/7,
and 4/7 here)
Dilation:
Longest path onto which an edge is mapped (routing slowdown)
Congestion: Max number of edges mapped onto one edge (contention slowdown)
Load factor: Max number of nodes mapped onto one node (processing slowdown)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 10
13.3 Embedding of Arrays and Trees
(q – 1)-bit
Gray code
0 000 . . . 000
0 000 . . . 001
0 000 . . . 011
Nk(x)
Nq–1(N k(x))
.
.
.
0 100 . . . 000
(q – 1)-cube 0
1 100 . . . 000
(q – 1)-cube 1
.
.
Fig. 13.3 Hamiltonian cycle in the q-cube.
.
1 000 . . . 011
Alternate inductive proof: Hamiltonicity of the q-cube
1 000 . . . 010
is equivalent to the existence of a q-bit Gray code
1 000 . . . 000
Basis: q-bit Gray code beginning with the all-0s codeword
(q – 1)-bit
Gray code
and ending with 10q–1 exists for q = 2: 00, 01, 11, 10
in reverse
x
Fall 2010
Nq–1(x)
Parallel Processing, Low-Diameter Architectures
Slide 11
Mesh/Torus Embedding in a Hypercube
Dim 3
Dim 2
Column 3
Column 2
Dim 1
Column 1
Dim 0
Column 0
Fig. 13.5 The 4  4 mesh/torus is a subgraph of the 4-cube.
Is a mesh or torus a subgraph of the hypercube of the same size?
We prove this to be the case for a torus (and thus for a mesh)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 12
Torus is a Subgraph of Same-Size Hypercube
A tool used in our proof

2
Product graph G1  G2:
0a
a
0
0b
2a
=
b
1
3-by-2
torus
2b
1a
1b
Has n1  n2 nodes
Two nodes are connected if
either component of the two
nodes were connected in the
component graphs


Each node is labeled by a
pair of labels, one from each
component graph

=
=
Fig. 13.4 Examples of product graphs.
The 2a  2b  2c . . . torus is the product of 2a-, 2b-, 2c-, . . . node rings
The (a + b + c + ... )-cube is the product of a-cube, b-cube, c-cube, . . .
The 2q-node ring is a subgraph of the q-cube
If a set of component graphs are subgraphs of another set, the product
graphs will have the same relationship
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 13
Embedding Trees in the Hypercube
The (2q – 1)-node complete binary tree
is not a subgraph of the q-cube
even weight
odd weights
even weights
Proof by contradiction based on the parity of
node label weights (number of 1s in the labels)
even weights
odd weights
The 2q-node double-rooted complete
binary tree is a subgraph of the q-cube
Nc(x)
x
Na(x)
New Roots
Nb(x)
2q -node double-rooted
complete binary tree
Fall 2010
Double-rooted tree
in the (q–1)-cube 0
Nb(N c(x))
Nc(N b(x))
Nc(N a(x))
Na(N c(x))
Double-rooted tree
in the (q–1)-cube 1
Parallel Processing, Low-Diameter Architectures
Fig. 13.6
The 2q-node
double-rooted
complete
binary tree is
a subgraph of
the q-cube.
Slide 14
14.5 Broadcasting on a Hypercube
Flooding: applicable to any network with all-port communication
00000
00001, 00010, 00100, 01000, 10000
00011, 00101, 01001, 10001, 00110, 01010, 10010, 01100, 10100, 11000
00111, 01011, 10011, 01101, 10101, 11001, 01110, 10110, 11010, 11100
01111, 10111, 11011, 11101, 11110
11111
Source node
Neighbors of source
Distance-2 nodes
Distance-3 nodes
Distance-4 nodes
Distance-5 node
Binomial broadcast tree with single-port communication
Time
00000
Fig. 14.9
The binomial
broadcast tree
for a 5-cube.
10000
01000
00100
11000
01100
10100
11100
00010
00001
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 15
0
1
2
3
4
5
Hypercube Broadcasting Algorithms
Fig. 14.10
Three
hypercube
broadcasting
schemes as
performed
on a 4-cube.
ABCD
ABCD
ABCD
ABCD
Binomial-t ree scheme (nonpipelined)
A
B
A
A
B
A
B
C
A
A
A
B
C
A
A
D
A
B
A
A
A
B
A
B
C
A
Pipelined binomial -tree scheme
A
A
D
B
A
A
B
A
D
B
D
C
D
A
C
B
B
C
C
C
D
B
B
D
Johnsson & Ho’s method
Fall 2010
B
B
A
D C
Parallel Processing, Low-Diameter Architectures
A B
A
A
A
B
C
A
A
A
A
To avoid clutter, only A shown
Slide 16
14.6 Adaptive and Fault-Tolerant Routing
There are up to q node-disjoint and edge-disjoint shortest paths between
any node pairs in a q-cube
Thus, one can route messages around congested or failed nodes/links
A useful notion for designing adaptive wormhole routing algorithms is
that of virtual communication networks
5
4
1
0
2
6
3
5
4
1
0
7
Each of the two subnetworks
in Fig. 14.11 is acyclic
2
6
3
7
Subnetwork 1
Subnetwork 0
Fig. 14.11 Partitioning a 3-cube into
subnetworks for deadlock-free routing.
Fall 2010
Hence, any routing scheme
that begins by using links in
subnet 0, at some point
switches the path to subnet 1,
and from then on remains in
subnet 1, is deadlock-free
Parallel Processing, Low-Diameter Architectures
Slide 17
Robustness of the Hypercube
Rich connectivity
provides many
alternate paths for
message routing
Source
Three
faulty
nodes
The node that is
furthest from S is
not its diametrically
opposite node in
the fault-free
hypercube
Fall 2010
S
X
X
X
Destination
The fault diameter of the q-cube is q + 1.
Parallel Processing, Low-Diameter Architectures
Slide 18
15 Other Hypercubic Architectures
Learn how the hypercube can be generalized or extended:
• Develop algorithms for our derived architectures
• Compare these architectures based on various criteria
Topics in This Chapter
15.1 Modified and Generalized Hypercubes
15.2 Butterfly and Permutation Networks
15.3 Plus-or-Minus-2i Network
15.4 The Cube-Connected Cycles Network
15.5 Shuffle and Shuffle-Exchange Networks
15.6 That’s Not All, Folks!
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 19
15.1 Modified and Generalized Hypercubes
5
4
1
0
2
6
3
5
4
1
0
7
3-cube and a 4-cycle in it
2
6
3
7
Twisted 3-cube
Fig. 15.1 Deriving a twisted 3-cube by
redirecting two links in a 4-cycle.
Diameter is one less than the original hypercube
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 20
Folded Hypercubes
5
4
Fig. 15.2 Deriving a
folded 3-cube by adding
four diametral links.
0
Diameter is half that of
the original hypercube
2
1
6
5
7
3
2
6
3
7
Fo lded 3-cub e with
Dim-0 links removed
Fall 2010
4
Rotate
180
degrees
1
0
2
6
3
7
Folded 3-cube
3 5
7 1
0
2
1
0
A diametral path in the 3-cube
4
5
4
6
5
3
17
After renaming, diametral
links replace dim-0 lin ks
Parallel Processing, Low-Diameter Architectures
Fig. 15.3
Folded 3-cube
viewed as
3-cube with a
redundant
dimension.
Slide 21
Generalized Hypercubes
A hypercube is a power or homogeneous product network
q-cube = (oo)q ; q th power of K2
Generalized hypercube = qth power of Kr
(node labels are radix-r numbers)
Node x is connected to y iff x and y differ in one digit
Each node has r – 1 dimension-k links
x3
Example: radix-4 generalized hypercube
Node labels are radix-4 numbers
x2
x1
x0
Fall 2010
Dimension-0 links
Parallel Processing, Low-Diameter Architectures
Slide 22
Bijective Connection Graphs
Beginning with a c-node seed network, the network size is recursively
doubled in each step by linking nodes in the two halves via an arbitrary
one-to-one mapping. Number of nodes = c 2q
Hypercube is a special case, as are many hypercube variant networks
(twisted, crossed, mobius, . . . , cubes)
Bijective
mapping
Special case of c = 1:
Diameter upper bound is q
Diameter lower bound is an open problem (it is better than q + 1/2)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 23
15.2 Butterfly and Permutation Networks
Dim 0
Fig. 7.4
Butterfly
and
wrapped
butterfly
networks.
Dim 1
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
0
1
2
3
q
2 rows, q + 1 columns
Fall 2010
Dim 0
Dim 1
Dim 2
Dim 2
Parallel Processing, Low-Diameter Architectures
1
2
0
3=0
q
2 rows, q columns
Slide 24
Dim 0
Structure of Butterfly Networks
Dim 1
Dim 0
Dim 2
0
Switching these
two row pairs
converts this to
the original
butterfly network.
Changing the
order of stages in
a butterfly is thus
equi valent to a
relabeling of t he
rows (in this
example, row xyz
becomes row xzy)
Dim 2
Dim 3
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
0
1
2
Fig. 15.5 Butterfly network
with permuted dimensions.
Fall 2010
Dim 1
0
3
15
15
0
1
2
3
4
The 16-row butterfly network.
Parallel Processing, Low-Diameter Architectures
Slide 25
Fat Trees
P0
P1
P2
Fig. 15.6
Two representations of a fat tree.
P4
P3
P5
P6
Skinny tree? P
7
0
Fron t view:
Binary tree
1 2
6 7
5
3 4
23
1
0
0
0
Fall 2010
1
6 7
5
4
2
1
2
Sid e v iew:
Inverted
bin ary tree
3
4
3
4
5
6
5
6
7
Fig. 15.7
Butterfly
network
redrawn as
a fat tree.
7
Parallel Processing, Low-Diameter Architectures
Slide 26
P8
Butterfly as Multistage Interconnection Network
p Processors
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
1000 8
1001 9
1010 10
1011 11
1100 12
1101 13
1110 14
1111 15
0
log 2p Columns of 2-by-2 Switches
1
2
3
p Memory Banks
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
10 1010
11 1011
12 1100
13 1101
14 1110
15 1111
Fig. 6.9 Example of a multistage
memory access network
Fig. 15.8 Butterfly network
used to connect modules
that are on the same side
Generalization of the butterfly network
High-radix or m-ary butterfly, built of m  m switches
Has mq rows and q + 1 columns (q if wrapped)
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 27
Beneš Network
Processors
000
001
010
011
100
101
110
111
0
1
2
3
4
5
6
7
Fig. 15.9
0
2 log 2p – 1 Columns of 2-by-2 Switches
1
2
3
4
Memory Banks
000
0
001
1
010
2
011
3
100
4
101
5
110
6
111
7
Beneš network formed from two back-to-back butterflies.
A 2q-row Beneš network:
Can route any 2q  2q permutation
It is “rearrangeable”
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 28
Routing Paths in a Beneš Network
0
1
To which
2
memory
3
modules
4
can we
5
connect
6
proc 4
7
without
rearranging 8
9
the other
10
paths?
11
12
13
What
14
about
15 0
proc 6?
2 q+ 1 Inputs
1
2 q Rows,
Fig. 15.10
Fall 2010
2
3
4
2q + 1 Columns
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
5
6 15
q+ 1
2
Outputs
Another example of a Beneš network.
Parallel Processing, Low-Diameter Architectures
Slide 29
Another View of The CCC Network
4
5
0
4,2
0,2
1
0,1
0,0
1,0
7
6
2,1
2
Example of
hierarchical
substitution
to derive a
lower-cost
network from
a basis
network
3
Fig. 15.14 Alternate derivation
of CCC from a hypercube.
Replacing each node of a high-dimensional
q-cube by a cycle of length q is how CCC
was originally proposed
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 30
Emulation of a 6-Cube by a 64-Node CCC
With proper node mapping,
dim-0 and dim-1 neighbors
of each node will map onto
the same cycle
Suffices to show how to
communicate along other
dimensions of the 6-cube
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 31
Emulation of Hypercube Algorithms by CCC
Hypercube
Dimension
q–1
.
.
.
Ascend
3
Normal
2
1
Cycle ID = x
2 m bits
Proc ID = y
m bits
N j–1(x) , j–1
Descend
0
0
1
2
3
. . .
Algorithm Steps
Ascend, descend,
and normal algorithms.
x, j–1
Cycle
x
Dim j–1
x, j
Nj (x) , j
Dim j
Node (x, j) is communicating
Nj (x) , j–1
x, j+1
along dimension j; after the
Dim j+1
next rotation, it will be linked
to its dimension-(j + 1)
N j+1(x) , j+1
neighbor.
Fig. 15.15 CCC emulating
a normal hypercube algorithm.
N j+1(x) , j
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 32
15.5 Shuffle and Shuffle-Exchange Networks
000 0
0
0
0
0
0
001 1
1
1
1
1
1
010 2
2
2
2
2
2
011 3
3
3
3
3
3
100 4
4
4
4
4
4
101 5
5
5
5
5
5
110 6
6
6
6
6
6
7
7
7
7
7
111 7
Shuffle
Exchange
Shuffle-Exchange
Unshuffle
Fig. 15.16
Fall 2010
Alternate
Structure
Shuffle, exchange, and shuffle–exchange connectivities.
Parallel Processing, Low-Diameter Architectures
Slide 33
Shuffle-Exchange Network
0
SE
S
2
S
SE
Fall 2010
4
5
6
3
0
Fig. 15.17
4
SE
1
S
3
2
1
S
SE
7
SE
S
S
S
SE 5
7
SE
S
SE
6
Alternate views of an eight-node shuffle–exchange network.
Parallel Processing, Low-Diameter Architectures
Slide 34
Routing in Shuffle-Exchange Networks
In the 2q-node shuffle network, node x = xq–1xq–2 . . . x2x1x0
is connected to xq–2 . . . x2x1x0xq–1 (cyclic left-shift of x)
In the 2q-node shuffle-exchange network, node x is
additionally connected to xq–2 . . . x2x1x0x q–1
01011011
11010110
^
^^ ^
Source
Destination
Positions that differ
01011011
10110111
01101111
11011110
10111101
01111010
11110101
11101011
Shuffle
Shuffle
Shuffle
Shuffle
Shuffle
Shuffle
Shuffle
Shuffle
Fall 2010
to
to
to
to
to
to
to
to
10110110
01101111
11011110
10111101
01111011
11110100
11101011
11010111
Exchange to
10110111
Exchange to
Exchange to
01111010
11110101
Exchange to
11010110
Parallel Processing, Low-Diameter Architectures
Slide 35
Diameter of Shuffle-Exchange Networks
For 2q-node shuffle-exchange network: D = q = log2p, d = 4
With shuffle and exchange links provided separately, as in Fig. 15.18,
the diameter increases to 2q – 1 and node degree reduces to 3
0
Shu ffle
(solid)
Exchang e
(dotted)
E
5
4
E
2
S
0
3
2
1
3
S
S
6
1
S
Fall 2010
S
S
S
Fig. 15.18
7
6
4
E
E
7
S
5
Eight-node network with separate shuffle and exchange links.
Parallel Processing, Low-Diameter Architectures
Slide 36
Multistage
ShuffleExchange
Network
0
0
0
0
1
1
1 A
1
2
2
2
2
3
3
3
3
4 A
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
0
Fig. 15.19
Multistage
shuffle–exchange
network
(omega network)
is the same as
butterfly network.
1
2
q + 1 Columns
1
2
q + 1 Columns
3
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
5
5
5
5
6
6
6
6
7
7
7
A
4
A
7
0
Fall 2010
0
3
1
q Columns
2
Parallel Processing, Low-Diameter Architectures
0
1
q Columns
2
Slide 37
15.6 That’s Not All, Folks!
When q is a power of 2, the 2qq-node cube-connected cycles network
derived from the q-cube, by replacing each node with a q-node cycle,
is a subgraph of the (q + log2q)-cube  CCC is a pruned hypercube
Other pruning strategies are possible, leading to interesting tradeoffs
100
000
Even-dimension
links are kept in
the even subcube
101
Odd-dimension
links are kept in
the odd subcube
001
110
D = log2 p + 1
111
d = (log2 p + 1)/2
010
011
B = p/4
All dimension-0
links are kept
Fig. 15.20
Fall 2010
Example of a pruned hypercube.
Parallel Processing, Low-Diameter Architectures
Slide 38
Möbius Cubes
Dimension-i neighbor of x = xq–1xq–2 ... xi+1xi ... x1x0 is:
xq–1xq–2 ... 0xi... x1x0
if xi+1 = 0 (xi complemented, as in q-cube)
xq–1xq–2 ... 1xi... x1x0 if xi+1 = 1 (xi and bits to its right complemented)
For dimension q – 1, since there is no xq, the neighbor can be defined
in two possible ways, leading to 0- and 1-Mobius cubes
A Möbius cube has a diameter of about 1/2 and an average internode
distance of about 2/3 of that of a hypercube
4
0
0
1
7
6
2
3
0-Mobius cube
Fall 2010
7
5
6
1
4
2
5
Fig. 15.21
Two 8-node
Möbius cubes.
3
1-Mobius cube
Parallel Processing, Low-Diameter Architectures
Slide 39
Seas for Networks
A wide variety of direct
interconnection networks
have been proposed for, or
used in, parallel computers
They differ in topological,
performance, robustness,
and realizability attributes.
Fig. 4.8 (expanded)
The sea of direct
interconnection networks.
Fall 2010
Parallel Processing, Low-Diameter Architectures
Slide 40
Assignment 2-1
Choose one of the following two exercises:
• Compute the low bound of congestion when
embedding a hypercube in a ring with same number
of nodes.
• i-hypercube contains 2jx2k mesh where i=j+k.
SJTU@Fall 2012
Parallel Processing, Low-Diameter Architectures
Slide 41

Parallel Processing, Part 4

Transcript Parallel Processing, Part 4

Directory