Transcript [PPT
A General Constraint-centric Scheduling Framework for Spatial Architectures Tony Nowatzki Michael Sartin-Tarm Lorenzo De Carli Karthikeyan Sankaralingam Cristian Estan Behnam Robatmili UW-Madison UW-Madison UW-Madison UW-Madison Broadcom Corp. Qualcomm Research 1 A spatial architecture is a type of microprocessor hardware where the organization of functional units, interconnection network or storage are exposed to the software. • Regions of computation are mapped to hardware units • Alternative paradigm to instruction-wise pipelined execution • Pioneering examples: RAW, Wavescalar, TRIPS • Recent proposals: Tartan, CCA, PLUG, FlexCore, SoftHV, MESCAL, SPL, C-Cores, DySER, BERET The fundamental scheduling problem in all spatial architectures is scheduling of computation to hardware resources. Why is this hard? LD LD MUL ST R1 R2 R3,R1,R2 R3 Could be basic block, hyperblock, PDG, or other program sub-region 4 Why is this hard? LD LD MUL ST R1 R2 R3,R1,R2 R3 LD R2 LD R2 LD R1 LD R1 5 Why is this hard? LD LD MUL ST R1 R2 R3,R1,R2 R3 LD R2 MUL R3,R1,R2 LD R2 MUL R3,R1,R2 LD R1 x x LD R1 6 Why is this hard? LD LD MUL ST R1 R2 R3,R1,R2 R3 LD R2 MUL R3,R1,R2 LD R2 MUL R3,R1,R2 LD R1 x x LD R1 ST R3 ST R3 7 Why is this hard? LD LD MUL ST R1 R2 R3,R1,R2 R3 LD R2 MUL R3,R1,R2 LD R2 MUL R3,R1,R2 LD R1 x x LD R1 ST R3 ST R3 8 We have contributed a general spatial architecture scheduler that uses a declarative approach. 9 Existing heuristic approaches are popular, effective, but inflexible. In general: • They are architecture-specific. • They suffer from poor developer / architect productivity. • They lack insight into optimality. 10 Existing declarative approaches do not cover enough responsibilities. • More on this later… 11 Our mathematical scheduling framework resolves these issues. • It is comprehensive. • It is usable for a diverse set of spatial architectures. • It is not complex to implement. 12 Our model uses an Integer-Linear Programming (ILP) solver. • A single objective is specified for minimization. • The objective is a function of a system of linear equations. • There are guaranteed bounds upon optimality. 13 Source Code Software DAGs Hardware Graph Compiler ILP Scheduler Assembler Binary 14 Executive Summary • We have implemented a declarative general spatial architecture scheduler in ILP. • It has been applied to 3 diverse spatial architectures. • Provides competitive performance to native schedulers. • Released open-source, and examples are available online. • http://research.cs.wisc.edu/vertical/ilp-scheduler • Our tool is the basis of a formulative scheduling guide. • Constraint Centric Scheduling Guide, ACM SIGARCH Comp. Arch News, May 2013 • An interactive demo will be online as part of a synthesis lecture. • Optimization and Mathematical Modeling in Computer Architecture 15 TRIPS DySER PLUG IEEE Computer ‘04, ASPLOS ‘09 HPCA ‘11, IEEE Micro ‘12 SIGCOMM ‘09, PACT ‘10 General-purpose microprocessor Spatial accelerator Network-lookup engine 16 L N R Software E V Edges (𝑒 ∈ 𝐸) connect vertices (v ∈ 𝑉) . Mappings MEL(E,L) MVN(V,N) x Links (𝑙 ∈ 𝐿) connect nodes (𝑛 ∈ 𝑁) and routers (𝑟 ∈ 𝑅) . Sets + Hardware 17 Our framework consists of 5 main responsibilities. 1. 2. 3. 4. 5. Placement of Computation Routing of Data Latency Management Resource Management Optimization Objective Key Contributions - Developing scheduling in terms of these responsibilities - Encoding responsibilities in terms of ILP constraints 19 20 Our framework consists of 5 main responsibilities. 1. 2. 3. 4. 5. Placement of Computation Routing of Data Latency Management Resource Management Optimization Objective Key Contributions - Developing scheduling in terms of these responsibilities - Encoding responsibilities in terms of ILP constraints 21 1. Placement of Computation (hardware mapping) Each vertex v must be mapped to one compatible node n: ∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1, ∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 , + 𝑀𝑣𝑛 𝑣, 𝑛 = 1 𝑀𝑣𝑛 𝑣, 𝑛 = 0 x 22 1. Placement of Computation (hardware mapping) Each vertex v must be mapped to one compatible node n: ∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1, ∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 , + 𝑀𝑣𝑛 𝑣, 𝑛 = 1 𝑀𝑣𝑛 𝑣, 𝑛 = 0 x 23 1. Placement of Computation (hardware mapping) Each vertex v must be mapped to one compatible node n: ∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1, ∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 , + 𝑀𝑣𝑛 𝑣, 𝑛 = 1 𝑀𝑣𝑛 𝑣, 𝑛 = 0 x 24 1. Placement of Computation (hardware mapping) Each vertex v must be mapped to one compatible node n: ∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1, ∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 , ADD MUL + 𝑀𝑣𝑛 𝑣, 𝑛 = 1 𝑀𝑣𝑛 𝑣, 𝑛 = 0 x 25 1. Placement of Computation (hardware mapping) Each vertex v must be mapped to one compatible node n: ∀𝑣, ∑𝑛|𝐶 𝑣, 𝑛 = 1, ∀𝑣, ∀𝑛|𝐶 𝑣, 𝑛 = 0 , ADD MUL + 𝑀𝑣𝑛 𝑣, 𝑛 = 1 𝑀𝑣𝑛 𝑣, 𝑛 = 0 x 26 2. Routing of Data (network mapping) ADD MUL 27 2. Routing of Data (network mapping) An incoming edge to a vertex must be mapped to an incoming link of its node. ADD MUL 28 2. Routing of Data (network mapping) An incoming edge to a vertex must be mapped to an incoming link of its node. An outgoing edge to a vertex must be mapped to an outgoing link of its node. ADD MUL 29 2. Routing of Data (network mapping) An incoming edge to a vertex must be mapped to an incoming link of its node. An outgoing edge to a vertex must be mapped to an outgoing link of its node. A router must have an edge mapped to both an incoming and outgoing link. ADD MUL 30 3. Latency Management (timing) ADD +2 MUL +4 31 3. Latency Management (timing) The executional delay of a vertex is the maximum delay of its incoming edges. 0 0 ADD +2 MUL +4 32 3. Latency Management (timing) The executional delay of a vertex is the maximum delay of its incoming edges. The routing delay of an edge is equal to the amount of links it is mapped to. 0 0 2 ADD +2 4 3 MUL 6 +4 10 12 33 3. Latency Management (timing) The executional delay of a vertex is the maximum delay of its incoming edges. The routing delay of an edge is equal to the amount of links it is mapped to. 0 0 2 ADD +2 4 3 MUL 6 +4 10 The overall latency of a mapping is the maximum latency of any vertex. 12 34 4. Resource Management (utilization) ADD MUL 35 4. Resource Management (utilization) LD R1 The utilization of a node is the sum of all vertexes mapped to it. LD R2 2 ADD MUL 1 1 ST R3 36 4. Resource Management (utilization) LD R1 The utilization of a node is the sum of all vertexes mapped to it. LD R2 2 The utilization of a link is the sum of all edges mapped to it. 2 2 ADD MUL 1 1 1 1 ST R3 37 4. Resource Management (utilization) LD R1 The utilization of a node is the sum of all vertexes mapped to it. LD R2 2 The utilization of a link is the sum of all edges mapped to it. 2 2 ADD MUL 1 1 1 1 The overall utilization of a mapping is the maximum utilization of any node or link. 38 5. Optimization Objective (final performance target!) • The overall latency of a mapping is the maximum latency of any vertex. • The overall utilization of a mapping is the maximum utilization of any node or link. ∀𝑣 ∈ 𝑉𝑜𝑢𝑡 , ∀𝑛 ∈ 𝑁, ∀𝑙 ∈ 𝐿, 𝑇 𝑣 ≤ 𝐿𝐴𝑇 𝑈 𝑛 ≤ 𝑆𝑉𝐶 𝑈 𝑙 ≤ 𝑆𝑉𝐶 • Optimization target is a function balancing latency and utilization. 39 Results: Feasibility • Spatial scheduling can be modeled as a system of linear equations. • The ILP-based approach is concise. • Our implementation is ~50 lines of GAMS* code for each arch. *GAMS is an algebraic modeling language for optimization; it can express ILP among other constraint theories. 40 Results: Performance (TRIPS) ILP scheduler is competitive with SPS • Better on 22 of 43 benchmarks (geometric mean: +2.9%) • Worse on 21 of 43 benchmarks (geometric mean: -1.9%) • Small slowdown / speedups from SPS are primarily due to dynamic events • 3 worst benchmarks are 5.4%, 6.0%, and 13.2% worse • This is primarily due to lack of cache-bank knowledge - using TRIPS performance benchmarks: microbenchmarks, EEMBC 41 Results: Performance (DySER) Outperforms specialized scheduler on all benchmarks (geometric mean: +9.2%) • Individual block latencies are reduced by 38% on average • Long dependency chains lead to significant ILP speedup - using DySER performance benchmarks: PARBOIL 42 Results: Performance (PLUG) Matches or outperforms hand-generated schedules on all benchmarks (geometric mean: 2.7%) • PLUG DFGs are inherently more complex - using PLUG network-lookup benchmarks 43 Related Work Year Technique Approach 1950 1992 2002 ILP machine scheduler ILP for VLIW ILP for RAW 2008 SMT for PLA M-Job-DAG to N-resource scheduling Modulo scheduling M-Job-DAG to N-resource scheduling, assuming fixed network delays Strict communication and computation requirements Prior techniques do not model routing and utilization responsibilities. 44 Executive Summary • We have implemented a declarative general spatial architecture scheduler in ILP. • It has been applied to 3 diverse spatial architectures. • Provides competitive performance to native schedulers. • Released open-source, and examples are available online. • http://research.cs.wisc.edu/vertical/ilp-scheduler • Our tool is the basis of a formulative scheduling guide. • Constraint Centric Scheduling Guide, ACM SIGARCH Comp. Arch News, May 2013 • An interactive demo will be online as part of a synthesis lecture. • Optimization and Mathematical Modeling in Computer Architecture 45