FPGA Co-Design for Efficient LLM Inference with Sparsity and Quantization
Analysis
Key Takeaways
- •Proposes a hardware-software co-design framework for efficient LLM inference on FPGAs.
- •Combines N:M sparsity and 4-bit quantization to reduce memory footprint and accelerate computation.
- •Achieves significant speedups and latency reductions compared to dense GPU baselines.
- •Demonstrates the effectiveness of structured sparsity and quantization for LLM inference.
- •The FPGA accelerator offers flexibility in supporting various sparsity patterns.
“Utilizing 2:4 sparsity combined with quantization on $4096 imes 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines.”