FPGA Co-Design for Efficient LLM Inference with Sparsity and Quantization
Published:Dec 31, 2025 08:27
•1 min read
•ArXiv
Analysis
This paper addresses the challenge of deploying large language models (LLMs) in resource-constrained environments by proposing a hardware-software co-design approach using FPGA. The core contribution lies in the automation framework that combines weight pruning (N:M sparsity) and low-bit quantization to reduce memory footprint and accelerate inference. The paper demonstrates significant speedups and latency reductions compared to dense GPU baselines, highlighting the effectiveness of the proposed method. The FPGA accelerator provides flexibility in supporting various sparsity patterns.
Key Takeaways
- •Proposes a hardware-software co-design framework for efficient LLM inference on FPGAs.
- •Combines N:M sparsity and 4-bit quantization to reduce memory footprint and accelerate computation.
- •Achieves significant speedups and latency reductions compared to dense GPU baselines.
- •Demonstrates the effectiveness of structured sparsity and quantization for LLM inference.
- •The FPGA accelerator offers flexibility in supporting various sparsity patterns.
Reference
“Utilizing 2:4 sparsity combined with quantization on $4096 imes 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines.”