Baring It All to Software: Raw Machines

Download Report

Transcript Baring It All to Software: Raw Machines

Baring It All to Software:
Raw Machines
E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar,
W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R.
Barua, J. Babb, S. Amarasinghe, A. Agarwal
(Presented by Linda Deng)
Hitting a wall
•
•
•
•
Already in 1997?
As # of transistors increases, so does wire delay
New complex hardware  verification costs
Emerging stream-based multimedia
The radical Raw idea
• Lots of simple interconnected tiles
• Each tile contains:
– Instruction/data memories
↑
– ALU
– Registers
– Configurable logic
– Programmable switch for routing
• Complex operations synthesized into HW
A Raw processor
↑
The programmer’s job
•
•
•
•
Software deals with wire delay
Wire delay = hops in mesh network
One cycle to move from a tile to its neighbor
↑
Compiler knows # of cycles needed to move
– Statically schedules operations
• Register renaming, instruction scheduling,
dependency checking…
What’s the big deal?
• Distributed registers
– Bigger register namespace  higher ILP
• Distributed static RAM
– Shorter memory latency
• No specialized logic structures in HW
– Smaller tiles  more tiles  greater parallelism
– More chip area for memory/logic
– Faster clock
– Less complexity  easier verification
The hard-working compiler
• Parallelism vs. communication/synchronization?
– But the latter’s overhead is low
– So partitioning can be fine-grained
• Tile placement to minimize latency/bandwidth
• Programs for tiles/switches (scheduling/routing)
• Logic synthesis tool for configurable logic
– Pattern-matching algorithms to find candidate insns
Some remaining dynamic events…
•
•
•
•
•
What happens when compiler can’t resolve?
Reserve bandwidth b/w potential communicators
Conservative estimates for dynamic routing
Assign dependency checking to tiles
Predict tile for offset, even though base is unknown
Prototype time: RawLogic
• Implemented with FPGAs
• Limited feature support
– Static sequences converted into state machines
– Hardwired into RawLogic
– Inflexible, with amazingly long compilation times
• Framework in C/Verilog for compilation
– Produced binary code for state machines
• But larger benchmarks were emulated
• And Raw machine has faster clock than FPGA
The numbers
Looking ahead
• “In 10 to 15 years, we believe that billiontransistor chip densities, faster switching
speeds, and growing compiler sophistication
will allow a Raw machine’s performance-tocost ratio to surpass that of traditional
architectures for future, general-purpose
workloads.”
• Agarwal’s Tilera started shipping 64-core
TILE64 in 2007, working on 36- and 120-core?