Options for embedded systems. Constraints, challenges, and approaches HPEC 2001 Lincoln Laboratory 25 September 2001 Gordon Bell Bay Area Research Center Microsoft Corporation HPEC 2001
Download ReportTranscript Options for embedded systems. Constraints, challenges, and approaches HPEC 2001 Lincoln Laboratory 25 September 2001 Gordon Bell Bay Area Research Center Microsoft Corporation HPEC 2001
Options for embedded systems. Constraints, challenges, and approaches HPEC 2001 Lincoln Laboratory 25 September 2001 Gordon Bell Bay Area Research Center Microsoft Corporation HPEC 2001 More architecture options: Applications, COTS (clusters, computers… chips), Custom Chips… The architecture challenge: “One person’s system, is another’s component.”- Alan Perlis Kurzweil: predicted hardware will be compiled and be as easy to change as software by 2010 COTS: streaming, Beowulf, and www relevance? Architecture Hierarchy: – – – – Application Scalable components forming the system Design and test Chips: the raw materials Scalability: fewest, replicatable components Modularity: finding reusable components HPEC 2001 The architecture levels & options The apps – – Data-types: “signals”, “packets”, video, voice, RF, etc. Environment: parallelism, power, power, power, speed, … cost The material: clock, transistors… Performance… it’s about parallelism – – – – – – – – – Program & programming environment Network e.g. WWW and Grid Clusters Storage, cluster, and network interconnect Multiprocessors Processor and special processing Multi-threading and multiple processor per chip Instruction Level Parallelism vs HPEC 2001 Vector processors Sony Playstation export limiits A problem X-Box would like to have, … but have solved. HPEC 2001 Will the PC prevail for the next decade as a/the dominant platform? … or 2nd to smart, mobile devices? Moore’s Law: increases performance; Bell’s Corollary reduces prices for new classes PC server clusters aka Beowulf with low cost OS kills proprietary switches, smPs, and DSMs Home entertainment & control … – – Very large disks (1TB by 2005) to “store everything” Screens to enhance use Mobile devices, etc. dominate WWW >2003! Voice and video become the important apps! C = Commercial; C’ = Consumer Where’s the action? Problems? Constraints from the application: Speech, video, mobility, RF, GPS, security… Moore’s Law, networking, Interconnects Scalability and high performance processing – – – Building them: Clusters vs DSM Structure: where’s the processing, memory, and switches (disk and ip/tcp processing) Micros: getting the most from the nodes Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code! System (on a chip) alternatives… apps drivers – Data-types (e.g. video, video, RF) performance, portability/power, and cost COTS: Anything at the system structure level to use? How are the system components e.g. computers, etc. going to be interconnected? What are the components? Linux What is the programming model? – – – Is a plane, CCC, tank, fleet, ship, etc. an Internet? Beowulfs… the next COTS What happened to Ada? Visual Basic? Java? HPEC 2001 Computing SNAP built entirely from PCs Portables Wide-area global network Mobile Nets Wide & Local Area Networks for: terminal, PC, workstation, & servers Person Person servers servers (PCs) (PCs) ??? TC=TV+PC home ... (CATV or ATM or satellite) Legacy mainframes & Legacy minicomputers mainframe & terms servers & minicomputer servers & terminals A space, time (bandwidth), & generation scalable environment scalable computers built from PCs Centralized &Centralized departmental uni& mP servers & departmental (UNIX & NT) servers buit from PCs HPEC 2001 Five Scalabilities Size scalable -- designed from a few components, with no bottlenecks Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture Reliability scaling… chose any level Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites) Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer. Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency, HPEC 2001 Why I gave up on large smPs & DSMs Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex. => Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor & switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters. Compiler, run-time, O/S locate the programs anyway. Aren’t scalable. Reliability requires clusters. Start there. They aren’t needed for most apps… hence, a small market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet. HPEC 2001 What is the basic structure of these scalable systems? Overall Disk connection especially wrt to fiber channel SAN, especially with fast WANs & LANs HPEC 2001 GB plumbing from the baroque: evolving from 2 dance-hall SMP & Storage model Mp — S — Pc : | : |—————— S.fc — Ms | : |— S.Cluster |— S.WAN — vs. MpPcMs — S.Lan/Cluster/Wan — : HPEC 2001 SNAP Architecture---------- HPEC 2001 ISTORE Hardware Vision System-on-a-chip enables computer, memory, without significantly increasing size of disk 5-7 year target: MicroDrive:1.7” x 1.4” x 0.2” 2006: ? 1999: 340 MB, 5400 RPM, 5 MB/s, 15 ms seek 2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW) Integrated IRAM processor 2x height Connected via crossbar switch growing like Moore’s law 16 Mbytes; ; 1.6 Gflops; 6.4 Gops 10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops HPEC 2001 The Disk Farm? or a System On a Card? 14" The 500GB disc card An array of discs Can be used as 100 discs 1 striped disc 50 FT discs ....etc LOTS of accesses/second of bandwidth A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!! HPEC 2001 The Promise of SAN/VIA/Infiniband http://www.ViArch.org/ Yesterday: – – – 10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 µs Now – 250 Time µs to Send 1KB 200 150 Transmit receivercpu sender cpu 100 Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… – Fast user-level communication - tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WAN 50 0 100Mbps Gbps SAN HPEC 2001 Top500 taxonomy… everything is a cluster aka multicomputer Clusters are the ONLY scalable structure – Cluster: n, inter-connected computer nodes operating as one system. Nodes: uni- or SMP. Processor types: scalar or vector. MPP= miscellaneous, not massive (>1000), SIMD or something we couldn’t name Cluster types. Implied message passing. – – – – – Constellations = clusters of >=16 P, SMP Commodity clusters of uni or <=4 Ps, SMP DSM: NUMA (and COMA) SMPs and constellations DMA clusters (direct memory access) vs msg. pass Uni- and SMPvector clusters: Vector Clusters and Vector Constellations HPEC 2001 Courtesy of Dr. Thomas Sterling, Caltech The Virtuous Economic Cycle drives the PC industry… & Beowulf Attracts suppliers Greater availability @ lower cost Standards Attracts users Creates apps, tools, training, HPEC 2001 BEOWULF-CLASS SYSTEMS Cluster of PCs – – – Pure M2COTS Unix-like O/S with source – Linux, BSD, Solaris Message passing programming model – Intel x86 DEC Alpha Mac Power PC PVM, MPI, BSP, homebrew remedies Single user environments Large science and engineering applications HPEC 2001 Lessons from Beowulf An experiment in parallel computing systems Established vision- low cost high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of applications Provided networking software Provided cluster management tools Conveyed findings to broad community Tutorials and the book Provided design standard to rally community! Standards beget: books, trained people, software … virtuous cycle that allowed apps to form Industry begins to form beyond a research project Courtesy, Thomas Sterling, Caltech. Designs at chip level… any COTS options? Substantially more programmability versus factory compilation As systems move onto chips and chip sets become part of larger systems, Electronic Design must move from RTL to algorithms. Verification and design of “GigaScale systems” will be the challenge. HPEC 2001 The Productivity Gap 10,000,000 100,000,000 .10m 1,000,000 100,000 .35m 10,000,000 58%/Yr. compound Complexity growth rate 10,000 100,000 1,000 10,000 x 100 x x x 2.5m 1,000,000 x 10 1 Logic Transistors/Chip Transistor/Staff Month x 1,000 x 21%/Yr. compound Productivity growth rate 100 10 Source: SEMATECH HPEC 2001 What Is GigaScale? Extremely large gate counts – – High complexity – – Complex data manipulation Complex dataflow Intense pressure for correct , 1st time – Chips & chip sets Systems & multiple-systems TTM, cost of failure, etc. impacts ability to have a silicon startup Multiple languages and abstraction levels – Design, verification, and software HPEC 2001 EDA Evolution: chips to systems GigaScale Architect 2005 (e.g. Forte) Hierarchical Verification plus GigaScale SOC Designer System Architect 1995 (Synopsys & Cadence) Testbench Automation Emulation Formal Verification plus RTL 1M gates Simulation Courtesy of Forte Design Systems ASIC Designer Chip Architect Gates 1985(Daisy, Mentor) 10K gates IC Designer 1975 (Calma & CV) HPEC 2001 Physical design If system-on-a-chip is the answer, what is the problem? Small, high volume products – – – – – Phones, PDAs, Toys & games (to sell batteries) Cars Home appliances TV & video Communication infrastructure Plain old computers… and portables Embeddable computers of all types where performance and/or power are the major constraints. HPEC 2001 SOC Alternatives… not including C/C++ CAD Tools The blank sheet of paper: FPGA Auto design of a processor: Tensilica Standardized, committee designed components*, cells, and custom IP Standard components including more application specific processors *, IP add-ons plus custom One chip does it all: SMOP *Processors, Memory, Communication & Memory Links, HPEC 2001 Tradeoffs and Reuse Model System Application IUnknown Application Implementation High Low High Structured Custom RTL Flow IOleObject IDataObject IPersistentStorage IUnknown IOleDocument IFoo IBar IPGood IOleBad IOleObject IDataObject IPersistentStorage IOleDocument IUnknown Architecture Time MOPS/mW Cost Programmability to Develop/Iterate New Application Low High Lower FPGA FPGA & ASIP Microarchitecture GPP DSP GPP Platform Exportation Silicon Process HPEC 2001 System-on-a-chip alternatives FPGA Compile a system Systolic | array Pc + ?? Sea of uncommitted gate arrays Unique processor for every app Many pipelined or parallel processors + custom Dynamic reconfiguration of the entire chip… Spec. purpose processors cores + custom Gen. Purpose cores. Specialized by I/O, etc. Pc+DSP | VLIW Pc & Mp. ASICS Universal Multiprocessor array, Micro programmable I/0 Xylinx, Altera Tensillica TI IBM, Intel, Lucent Cradle, Intel IXP 1200 Xilinx 10Mg, 500Mt, .12 mic HPEC 2001 Tensillica Approach: Compiled Processor Plus Development Tools ALU Pipe I/O Cache Timer Register File MMU Tailored, HDL uP core Describe the processor attributes from a browser-like interface Courtesy of Tensilica, Inc. http://www.tensilica.com Using the processor generator, create... Customized Compiler, Assembler, Linker, Debugger, Simulator Standard cell library targetted to the silicon process HPEC 2001 Richard Newton, UC/Berkeley EEMBC Networking Benchmark • Benchmarks: OSPF, Route Lookup, Packet Flow • Xtensa with no optimization comparable to 64b RISCs • Xtensa with optimization comparable to high-end desktop CPUs • Xtensa has outstanding efficiency (performance per cycle, per watt, per mm2) • Xtensa optimizations: custom instructions for route lookup and packet flow IDT 32334/100 0.045 14 IDT79RC32364/100 NEC V832-143 0.040 12 IDT79RC32V334-150 Toshiba TMPR3927F-GHM2000/133 NEC VR5432-167 Xtensa/200 IDT79RC64575IDtc/250 NEC VR5000 IDT79RC64575Algor/250 AMD K6-2/450 AMD K6-2E/400 0.035 10 8 6 4 AMD K6-IIIE+/550 0.030 0.025 0.020 0.015 0.010 2 Xtensa Optimized/200 AMD K6-2E+/500 Netmark Performance/MHz Toshiba TMPR3927F-GH189/133 Performance relative to IDT 32334/100 (MIPS32) AMD ElanSC520/133 0 0.005 0.000 HPEC 2001 Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs EEMBC Consumer Benchmark • Benchmarks: JPEG, Grey-scale filter, Color-space conversion • Xtensa with no optimization comparable to 64b RISCs • Xtensa with optimization beats all processors by 6x (no JPEG optimization) • Xtensa has exceptional efficiency (performance per cycle, per watt, per mm2) • Xtensa optimizations:custom instructions for filters, RGB-YIQ, RGB-CMYK 1.00 200 ST20C2/50 National Geode GX1/200 NEC VR5432/167 Xtensa/200 NEC VR5000/250 AMD K6-2E/400 AMDK6-2E+/500 AMD K6-III+/550 Xtensa Optimized/200 Performance relative to ST20C2/50 NEC V832/143 150 125 100 75 50 Consumermark Performance/MHz 175 AMD ElanSC520/133 0.80 0.60 0.40 0.20 25 0 0.00 Colors: Blue-Xtensa, Green-Desktop x86s, Maroon-64b RISCs, Orange-32b RISCs HPEC 2001 Free 32 bit processor core HPEC 2001 Complex SOC architecture HPEC 2001 Synopsys via Richard Newton, UC/B UMS Architecture DRAM CONTROL CLOCKS, DEBUG MEMORY MEMORY M M M M S S S S P P P P M M M M S S S S P P P P PROG I/O PROG I/O PROG I/O MEMORY PROG I/O MEMORY PROG I/O PROG I/O PROG I/O PROG I/O PROG I/O PROG I/O M M M M S S S S P P P P NVMEM PROG I/O PROG I/O M M M M S S S S P P P P DRAM Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property Cradle UMS Design Goals • Minimize design time for applications • Efficient programming model • High reusability accelerates derivative development • Cost/Performance • Replace ASICs, FPGAs, ASSPs, and DSPs • Low power for battery powered appliances • Flexibility • Cost effective solution to address fragmenting markets • Faster return on R&D investments HPEC 2001 Universal Microsystem (UMS) Quad 1 Quad 2 Quad 2 Quad 3 Quad 3 Global Bus I/O Quad Quad ‘n” SDRAM CONTROL I/O Quad PLA Ring Quad “n” Each Quad has 4 RISCs, 8 DSPs, and Memory Unique I/O subsystem keeps interfaces soft The Universal Micro System (UMS) An off the shelf “Platform” for Product Line Solutions Universal Micro System S S S S P P P P P P P P MEMORY MEMORY D RAM C ON TRO L PRO G I/O C LOC KS PRO G I/O Global Bus MEMORY MEMORY M M M M M M M M S S S S S S S S P P P P P P P P NVMEM PROG I/O PR OG I/O PRO G I/O S S S S PR OG I/O PROG I/O M M M M PR OG I/O PRO G I/O I/O Bus PRO G I/O M M M M PROG I/O Multi Stream Processor DRAM Intelligent I/O Subsystem (Change Interfaces without changing chips) 750 MIPS/GF LO PS MEM MEM PE DSE DSE Shared Prog Mem Superior Digital Signal Processing (Single Clock FP-MAC) Shared Shared Data DMA Mem Local Memory that scales with additional processors Scalable real time functions in software using small fast processors (QUAD) VPN Enterprise Gateway PHY 10/100 E-MAC Quad 1 Firewall/Tuneling Layer-2 switching IP stack Quad 2 3 DES IPSec IP Layer 3 Routing Operating System Quad 3 3 DES IPSec VoIP LAN Telephony Quads 4 & 5 VoIP LAN Telephony 10/100 E-MAC 10/100 E-MAC PHY PHY T1/E1/J1 10/100 E-MAC T1/E1/J1 Quad 1 TCP/IP IP Layer 3 IKE 3DES IPSec PHY •Single quad; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface •Handles 250 end users and 100 routes •Does key handling for IPSec •Delivers 50Mbps of 3DES •Five quads; Two 10/100 Ethernet ports at wire speed; one T1/E1/J1 interface •Handles 250 end users and 100 routes •Does key handling for IPSec •Delivers 100Mbps of 3DES •Firewall HPEC 2001 •IP Telephony •O/S for user interactions UMS Application Performance Application MPEG Video Decode MPEG Video Encode AC3 Audio Decode Modems Ethernet Router (Level 3 + QOS) Encryption 3D geom, lite, render DV Encode/Decode MSPs 4 6 10-16 1 0.5 3 4 0.5 4 1 1 4 8 Comments 720x480, 9Mbits/sec 720x480, 15Mbits/sec 322/1282 Search Area V90 G.Lite ADSL Per 100Mb channel Per Gigabit channel 3DES 15Mb/s MD5 425Mb/s 1.6M Polygons/sec Camcorder • Architecture permits scalable software • Supports two Gigabit Ethernets at wire speed; four fast Ethernets; four T-1s, USB, PCI, 1394, etc. • MSP is a logical unit of one PE and two DSEs Cradle: Universal Microsystem trading Verilog & hardware for C/C++ UMS : VLSI = microprocessor : special systems Software : Hardware Single part for all apps App spec’d@ run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including: 1 GB/s; 2.5 Gips PCI, 100 baseT, firewire $4 per flops; 150 mW/Gflops Silicon Landscape 200x Increasing cost of fabrication and mask – – – Physical effects (parasitics, reliability issues, power management) are more significant design issues – $7M for high-end ASSP chip design Over $650K for masks alone and rising SOC/ASIC companies require $7-10M business guarantee These must now be considered explicitly at the circuit level Design complexity and “context complexity” is sufficiently high that design verification is a major limitation on time-to-market Fewer design starts, higher-design volume… implies more programmable platforms HPEC 2001 Richard Newton, UC/Berkeley The End HPEC 2001 General-Purpose Computing Application(s) … 360 SPARC 3000 Instruction Set Architecture Platform-Based Design Application(s) … … Platform “Physical Implementation” … Synthesizeable RTL Microarchitecture & Software Application(s) … … … Verilog, VHDL, … ASIC FPGA Physical Implementation HPEC 2001 The Energy-Flexibility Gap Energy Efficiency MOPS/mW (or MIPS/mW) 1000 Dedicated HW 100 10 1 MUD 100-200 MOPS/mW Reconfigurable Processor/Logic Pleiades 10-50 MOPS/mW ASIPs DSPs 1 V DSP 3 MOPS/mW Embedded mProcessors LPArm 0.5-2 MIPS/mW 0.1 Flexibility (Coverage) HPEC 2001 Source: Prof. Jan Rabaey, UC Berkeley Approaches to Reuse SOC as the Assembly of Components? – Alberto Sangiovanni-Vincentelli SOC as a Programmable Platform? – Kurt Keutzer HPEC 2001 Component-Based Programmable Platform Approach Application-Specific Programmable Platforms (ASPP) These platforms will be highly-programmable They will implement highly-concurrent functionality Intermediate language that exposes programmability of all aspects of the microarchitecture Integrate using programmable approach to on-chip communication Assembly language for Processor Assemble Components from parameterized library HPEC 2001 Richard Newton, UC/Berkeley Compact Synthesized Processor, Including Software Development Environment Use virtually any standard cell library with commercial memory generators Base implementation is less than 25K gates (~1.0 mm2 in 0.25m CMOS) Power Dissipation in 0.25m standard cell is less than 0.5 mW/MHz to scale on a typical $10 IC (3-6% of 60mm^2) Courtesy of Tensilica, Inc. HPEC 2001 http://www.tensilica.com Challenges of Programmability for Consumer Applications Power, Power, Power…. Performance, Performance, Performance… Cost Can we develop approaches to programming silicon and its integration, along with the tools and methodologies to support them, that will allow us to approach the power and performance of a dedicated solution sufficiently closely (~2-4x?) that a programmable platform is the preferred choice? HPEC 2001 Richard Newton, UC/Berkeley Bottom Line: Programmable Platforms The challenge is finding the right programmer’s model and associated family of micro-architectures – Successful platform developers must “own” the software development environment and associated kernel-level run-time environment – Address a wide-enough range of applications efficiently (performance, power, etc.) “It’s all about concurrency” If you could develop a very efficient and reliable re-programmable logic technology (comparable to ASIC densities), you would eventually own the silicon industry! HPEC 2001 Richard Newton, UC/Berkeley Approaches to Reuse SOC as the Assembly of Components? – Alberto Sangiovanni-Vincentelli SOC as a Programmable Platform? – Kurt Keutzer HPEC 2001 Richard Newton, UC/Berkeley A Component-Based Approach… Simple Universal Protocol (SUP) – – – – Single-Owner Protocol (SOP) – – Unix pipes (character streams only) TCP/IP (only one type of packet; limited options) RS232, PCI Streaming… Visual Basic Unibus, Massbus, Sbus, Simple Interfaces, Complex Application (SIC) When “the spec is much simpler than the code*” you aren’t tempted to rewrite it – SQL, SAP, etc. Implies “natural” boundaries to partition IP and successful components will be aligned with those boundaries. HPEC 2001 – (*suggested by Butler Lampson) The Key Elements of the SOC Applications RF MEMS optical ASIP What is the Platform aka Programmer model? Richard Newton, UC/Berkeley Power as the Driver (Power is still, almost always, the driver!) 1000 MIPS/mW 100 10 Four orders of magnitude 1 0.1 0.01 0.001 Pentium 0.35mm StrongARM 0.35mm TI DSP 0.25mm Dedicated 1mm Source: R. Brodersen, UC Berkeley Back end HPEC 2001 Computer ops/sec x word length / $ 1.E+09 doubles every 1.0 1.E+06 .=1.565^(t-1959.4) 1.E+03 y = 1E-248e0.2918x 1.E+00 1.E-03 1.E-06 1880 doubles every 2.3 doubles every 7.5 1900 1920 1940 1960 HPEC 2001 1980 2000 Microprocessor performance 100 G Peak Advertised Performance (PAP) Real Applied Performance (RAP) 41% Growth 10 G Giga 100 M 10 M Moore’s Law Mega Kilo 1970 1980 1990 2000 2010 HPEC 2001 GigaScale Evolution In 1999 less than 3% of engineers doing designs with more than 10M transistors per chip. (Dataquest) By early 2002, 0.1 micron will allow 600M transistors per chip. (Dataquest) In 2001 49% of engineers @ .18 micron, 5% @ .10 micron. (EE Times) 54% plan to be @ .10 micron in 2003.(EET) HPEC 2001 Challenges of GigaScale GigaScale systems are too big to simulate – – Hierarchical verification Distributed verification Requires a higher level of abstraction – Higher abstraction needed for verification High level modeling - Transaction-based verification - – Higher abstraction needed for design - High-level synthesis required for productivity breakthrough HPEC 2001