“会议概览: 2020 47th ISCA”

开个新坑,看各大会议的Abstract,感兴趣的文章看Introduction。以求快速获得新Ideas。

Industry

《Data Compression Accelerator on IBM POWER9 and z15 Processors : Industrial Product》,IBM research/IBM Systems

  • 硬件压缩提升I/O和网络性能,减少存储和内存开销
  • on-chip的压缩加速器,0.5%的面积换388x的加速比(POWER9),23%的实际性能提升
  • 重点在trade-off

《High-Performance Deep-Learning Coprocessor Integrated into x86 SoC with Server-Class CPUs Industrial Product》,Centaur Technology(一家x86 CPU芯片设计公司,1995年成立,美国德克萨斯)

  • DL coprocessor (NCORE) + x86 SoC(server-class CPU)
  • int8,uint8,int16,bf16, 20T ops/s
  • MLPref Inference v0.5 1218IPS, 1.05ms latency ResNet-50-v1.5/ 0.329ms MobileNet-V1

《The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product》,IBM system group

  • 多级look-ahead structure
  • 能预测branch direction和target addresses,增强:multiple auxiliary direction, target, and power predictors
  • 为enterprise-class system的特定workloads优化

《Evolution of the Samsung Exynos CPU Microarchitecture》,作者来自于Sifive/Centaur/Independent Consultant/ARM/Texas A&M University/AMD/Nuvia/Goodix

  • 讨论了三星Exynos家族从M1到M6的设计变化
  • perceptron-based branch prediction/Spectre v2 security enhancements/micro-operation cache algorithms/prefetcher advancements/memory latency optimizations

《Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension : Industrial Product》,Alibaba T-head division

  • 基于RISC-V RV64GCV指令集+自定义算术运算/bit manipulation/load&store/TLB和cache operations+RISCV 0.7.1向量扩展
  • 支持多核多cluster的SMP(对称多处理) with cache coherence+12级流水/乱序执行/多发射超标量/2.5GHz/12nm/每个核心0.8mm^2
  • 软件和工具链的co-optimization,在RISC-V上表现最好,和ARM比有来有回

CPU based

《Divide and Conquer Frontend Bottleneck》, Sharif University of Technology, Sharif University of Technology谢里夫理工大学(伊朗麻省理工),Ali Ansari, Hamid Sarbazi-Azad

  • instruction和BTB(branch-target-buffer分支预测先取的) miss导致的frontend stalls很大,现有的预取器不行
    • 指令的miss penalty远大于buffer的miss
  • 把Frontend的bottleneck分成三类分别处理。
    • sequential miss,SN4L
    • discontinuity miss, Dis
    • BTB miss, pre-decoding the prefetched blocks
  • 5%的提升

《Auto-Predication of Critical Branches*》,Intel Labs(Bengaluru, India和Haifi, Israel)

  • H2P(hard-to-predict)和mis-speculation限制了分支预测的scalability。Predication(同时取两个分支的数据)将控制依赖关系替换成数据依赖可以缓和这个问题,但可能降低指令并行性。
  • 分析了trade-off(prediction和predication),提出ACB,自动根据是否critical to performance来关闭predication。
  • 使用复杂的性能检测。8%的提升

《Slipstream Processors Revisited: Exploiting Branch Sets》

《Bouquet of Instruction Pointers: Instruction Pointer Classifier-based Spatial Hardware Prefetching》

《Focused Value Prediction*》,Intel Labs(Bengaluru, India/Haifa, Israel)

《Flick: Fast and Lightweight ISA-Crossing Call for Heterogeneous-ISA Environments》

《Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems》

《T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware》

Accelerators

《Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads》,Groq Inc(加州山景城).

  • 内存单元插入到vector/matrix计算单元,以利用dataflow的locality
  • observations 1机器学习数据并行可以映射到硬件张量中 2stream programming model可以准确理解和控制硬件单元,带来更好的performance
  • TSP探索并行性,包括指令级别、存储并发、数据和模型并行,同时保证determinism通过减少所有的硬件反应元素(arbiters caches)
  • 20.4K IPS ResNet50
  • Functional slicing: local functional homogeneity but chip-wide (global) heterogeneity, each tile竖条条 implements a specific function and is stacked vertically,dataflow是左右走的,并且指令分别发到每一个竖条条上。
  • Parallel lanes and streams: Streams provide a programming abstraction and are a conduit导向 through which data flows between functional slices.

《Genesis: A Hardware Acceleration Framework for Genomic Data Analysis》

《DSAGEN: Synthesizing Programmable Spatial Accelerators》

《Bonsai: High-Performance Adaptive Merge Tree Sorting》

《Gorgon: Accelerating Machine Learning from Relational Data》

《DSAGEN: Synthesizing Programmable Spatial Accelerators》

《A Specialized Architecture for Object Serialization with Applications to Big Data Analytics》

《SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks》, Utah大学, Surya Narayanan, Pierre-Emmanuel Gaillardon

  • SNN dataflow需要考虑多个tick的neuron potentials,带来了新的数据结构和新的数据pattern。
  • 提出SpinalFlow,处理Compressed,time-stamped, sorted sequence的输入输出;一个神经元执行一系列步骤的计算来减少potential的存储开销,better data reuse

《NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs》

Security

《MuonTrap: Preventing Cross-Domain Spectre-Like Attacks by Capturing Speculative State》

System-Level

《SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors》

《The NEBULA RPC-Optimized Architecture》

《CryoCore: A Fast and Dense Processor Architecture for Cryogenic Computing》

《Heat to Power: Thermal Energy Harvesting and Recycling for Warm Water-Cooled Datacenters》

Others

《Printed Microprocessors》

《Déjà View: Spatio-Temporal Compute Reuse for‘ Energy-Efficient 360° VR Video Streaming》

《SOFF: An OpenCL High-Level Synthesis Framework for FPGAs》

《Hardware-Software Co-Design for Brain-Computer Interfaces》

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×