Graduate student Jiangqiu Shen (PhD in Computer Science) will present their dissertation proposal defense on Monday, December 2, at 11 am in Rekhi 101 and via Zoom. The title of Shen’s defense is, “Efficient Proccessing-in-Memory Accelerator Designs for LLM Generative Inference.”
Defense Abstract
Large Language Models (LLMs) are widespread adoption across various domains including natural language understanding, content generation, and decision support systems. Models like GPT4 and LLaMA demonstrate remarkable capabilities across diverse applications. LLM inference workloads are characterized by two distinct computational patterns: compute-intensive GEMM operations for projections and FFN, and memory bandwidth-intensive GEMV operations for the attention compution. However, the execution of the attention on GPU/NPU is inefficient, due to limited memory bandwidth available in GPU/NPU. To address this issue, the Processing-in-Memory (PIM) was recently proposed to offload the attention to PIM units. The state-of-the-art NeuPIMs architecture introduces the double row buffer to enable parallel execution of NPU memory access and PIM computations.
First, we identified the limitations of the NeuPIMs that inenfficiently appends one token’s value to the value cache, by accessing significant number of DRAM rows. To address this challenge, we propose the row-wise data layout for value cache that allows appending one token’s value to only access one DRAM row. To perform attention on the new data layout, we design the new PIM architecture by leveraging dual row buffers and enhanced ALU units.
Second, we observed that NPU’s workload is higher than PIM and the generation time of one token is dominated by the NPU execution time, leading to low utilization of PIM units. To address this issue, we propose the cooperative computing between NPU and PIM, by offloading some NPU workload to PIM when PIM is idle. To efficiently support NPU workload offloading, we designed a new weight storage scheme.
Third, when NPU and PIM are executing at the same time, they suffer from the performance slowdown due to the memory access conflict. to mitigate the memory access contention, we propose the context-aware memory controller that issue PIM operations when predicating that NPU has no memory access, by actively monitoring NPU and PIM workloads.