Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.
Download ReportTranscript Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N.
Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N. R. Satish Design Space Exploration Flow FromDevice(0) Discard Discard DecIPTTL IPVerify FromDevice(1) FromDevice(2) ToDevice(0) Discard Discard DecIPTTL IPVerify FromDevice(3) ToDevice(1) Discard Lookup IPRoute ToDevice(2) DecIPTTL … … DecIPTTL ToDevice(3) Discard Discard … DecIPTTL Discard IPVerify … FromDevice(15) Multiprocessor platform Application description ToDevice(15) Off-chip MEM SDRAM Ethern et PERIPHERAL OPB MEM MEM FS PLBPECo-PE Co-PEPE L PECo-PE MEM OnHardware MicroBlaze chip acceleratio (soft) BRAM n MEM Discard S1 S2 R1 L1 T1 R2 L2 T2 Task Graph + profiles Scheduling Constraints Allocation/Scheduling Task graph Platform Constraints HW/SW generation Implementation Performance Analysis 11/6/2015 Performance Numbers 2 Investigative Approach Demonstrate network applications on FPGA-based soft multiprocessors Tomahawk exploration framework Automated task allocation and scheduling Extend framework to large multiprocessor systems 1000’s-10,000’s of tasks 100’s-1000’s of PE’s RAMP 11/6/2015 3 What Is a FPGA-based Soft Multiprocessor System Architecture Building Blocks A network of architecture building blocks on an FPGA Multiprocessor architecture customized for target application Number of processors Interconnection network Memory hierarchy Custom co-processors Processing Element Blaze(soft) Hash engine PowerPC(hard) Crypto engine Productivity gains due to software abstraction 11/6/2015 Bus OPB PLB Queue Memory FSL BRAM(on-chip) SDRAM(off-chip) Multiprocessor Configuration Off-chip SDRAM Cost reduction by avoiding custom silicon Co-Processor MEM PERIPHERAL MEM Co-PE Ethernet MEM PE PE Co-PE FSL Hardware acceleration OPB MicroBlaze (soft) Xilinx Virtex-II Pro, Virtex-IV family of FPGAs MEM PLB MEM PE Co-PE On-chip BRAM PowerPC (hard) 4 Obstacles to Their Adoption: Hard to design Complex micro-architecture design space Processor choices Memory hierarchy Communication topology Difficult mapping decisions assigning computation to processing elements data to exposed heterogeneous memories To unlock potential of these systems, tools enabling efficiency and productivity are needed 11/6/2015 5 Example: Design Difficulty Architecture Model Application Task Graph R1 L1 T1 R2 L2 T2 Profile P1 P2 R 20 10 L 20 20 T 30 10 10 R1 P2 Queue Execution Time (cycles) Explore R1 L 1 L1 T 1 T1 R2 L2 T2 R 2 P21 Total time 70 Total time == 50 Total time = 80 11/6/2015 P1 B Optimal Design Design A L2 T2 Makespan = 70 60 80 P1 P2 Total time Total time == 40 60 Total time = 80 6 Tomahawk: Network Applications onto Soft MPs Discard Discard FromDevice(0) DecIPTTL IPVerify ToDevice(0) FromDevice(1) Discard Discard FromDevice(2) DecIPTTL ToDevice(1) IPVerify Discard ToDevice(2) DecIPTTL Lookup IPRoute DecIPTTL … … Click FromDevice(3) ToDevice(3) Discard … … FromDevice(15) DecIPTTL Discard ToDevice(15) IPVerify Discard Discard Task graph Automated microarchitecture configuration S1 S2 R1 L1 T1 R2 L2 T2 M1 P1 S2 S1 P2 R1 L1 T1 R2 L2 T2 Automated Mapping Off-chip SDRAM MEM C programs and micro architecture specification MEM Co-PE Hardware acceleration PE Ethernet PERIPHERAL FSL OPB MEM MEM PE Co-PE MicroBlaze (soft) MEM PLB PE Co-PE On-chip BRAM Xilinx 2VP50 FPGA 11/6/2015 7 Possible Approaches for Automated Exploration Randomized algorithms probabilistic bounds, simulated annealing Heuristic methods list scheduling, force directed scheduling Exact methods enumeration and tabu search, branch-and-bound Limitations of these approaches 11/6/2015 Specific implementation constraints are hard to enforce Most approaches require per-instance tuning and are hard to generalize – therefore poor for design space exploration 8 Constraint Optimization Techniques for Automated Exploration Constraint solver technologies Integer linear programming (ILP) solvers 0-1 Boolean reasoning solvers (SAT, PB-SAT) Advantages Constraint formulations are a formal, yet natural way to capture a mathematical optimization problem Implementation constraints specific to a problem can be incorporated easily Constraint solvers can exhaustively cover a search space without enumerating all solutions Key strategies to improve solver performance: 11/6/2015 Decomposition methods Variable ordering Improved lower and upper bounds Symmetry representation 9 ILP Formulation 11/6/2015 10 Example Application: IPv4 Packet Forwarding Data plane of IPv4 packet forwarding (RFC-1812) Campus network router, Home router Medium sized route table (5,000 entries or less) Route table small enough to fit in on-chip memory Lookup: inspect destination address and find next hop Route Table Ingress Receive IPv4 packet Verify version, checksum and TTL Header Lookup next-hop (prefix match) Payload – Longest prefix match – Implementation determined by route distribution, memory and performance constraints Update checksum and TTL Header Transmit IPv4 packet Egress Target platform Xilinx Virtex-II Pro 2VP50 FPGA Architecture Library MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue 11/6/2015 11 Hand-tuned Multiprocessor Design for IPv4 Forwarding From source MicroBlaze 1 From source MicroBlaze 2 Route Table Route Table Verify ver & ttl checksum Lookup1 Lookup2 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum To source MicroBlaze 1 To source MicroBlaze 2 Lookup1 Lookup2 Lookup2 Lookup2 To source MicroBlaze 1 To source MicroBlaze 2 To source MicroBlaze 1 Key: MicroBlaze Block RAM Bus Queue Achieved 1.8 Gbps throughput for header processing 11/6/2015 using 12 MicroBlaze processors 12 Improved Design after Automated Exploration Route Table Route Table Route Table Lookup1 From source Verify ver& ttl MicroBlaze 1 Lookup1 Verify ver& ttl Lookup2 Lookup2 Lookup1 From source Verify ver& ttl MicroBlaze 2 Lookup1 Verify ver& ttl Lookup2 Lookup2 Lookup3 Verify checksum Lookup3 Verify checksum Lookup3 Verify checksum To source MicroBlaze 1 To source MicroBlaze 2 Key: MicroBlaze To source MicroBlaze 1 To source MicroBlaze 2 Block RAM Bus Queue Resulting design achieved 2.0 Gbps throughput surpassing performance of a 1.8 Gbps hand-tuned design using one less MicroBlaze processor The improvement was due to a less regular configuration and balanced workload of tasks across the processors 11/6/2015 13 Justifying constraint optimization techniques Our constraint optimization method can handle instances of the representative allocation and scheduling problem with up to 100’s of tasks onto 10’s of PE’s Implementation constraints can be easily incorporated 11/6/2015 Task groupings Multiprocessor topology restrictions Preferred allocations Memory assignments Mutual exclusion 14 Following Moore’s Law Extend to more complex applications 1000’s-10,000’s of tasks Explore Extend to bigger multiprocessor systems 100’s-1000’s of PE’s PE NI M NI NI M M PE PE NI On-chip network NI M M PE M PE PE NI 11/6/2015 PE PE NI M NI M 15 What can we do for RAMP? Challenges in deploying concurrent applications on a RAMP system Task allocation and scheduling across 100’s – 1000’s of PEs Fast mapping step to enable efficient design space exploration Our optimization techniques for static task allocation and scheduling are a first step to address these challenges A “compile-time” tool to guide the designer to explore efficient mappings Flexible formulation to target diverse multiprocessors Research in progress to extend our techniques to work on problems in the scale of RAMP systems 11/6/2015 16 Backup Slides 11/6/2015 17 Example Optimal design found in less than 6 seconds on 400MHz Sparc II Application explore P11 P21 MicroBlazes P12 Power PC M1 BRAMs 2VP50 Optimal design Communication FSLs 11/6/2015 Bus Architecture 18 Following Moore’s Law Extend to more complex applications 1000’s-10,000’s of tasks DSLAM Extend to bigger multiprocessor systems 100’s-1000’s of PE’s RAMP 11/6/2015 19 Challenges in Automated Exploration Higher exploration complexity Increases by 2 orders of magnitude More emphasis on communication Arbitration modeling Routing constraints due to network topology Statistical cost model for dynamic behavior 11/6/2015 20 Potential Approaches to Address these Challenges Additional constraints can be easily added to incorporate new features Constraint solver performance will slow down and thus become the bottleneck Some strategies to improve constraint solver performance Task graph based structural decompositions Relaxation heuristics Symmetry representation Cutting planes and valid inequalities 11/6/2015 21 PE PE PE NI M NI NI M PE Memory PE Network Interface Processor interconnect NI M M PE PE PE NI 11/6/2015 PE M On-chip network NI M Key NI M NI M 22