Search:
Match:
1 results
Paper#llm🔬 ResearchAnalyzed: Jan 3, 2026 06:27

FPGA Co-Design for Efficient LLM Inference with Sparsity and Quantization

Published:Dec 31, 2025 08:27
1 min read
ArXiv

Analysis

This paper addresses the challenge of deploying large language models (LLMs) in resource-constrained environments by proposing a hardware-software co-design approach using FPGA. The core contribution lies in the automation framework that combines weight pruning (N:M sparsity) and low-bit quantization to reduce memory footprint and accelerate inference. The paper demonstrates significant speedups and latency reductions compared to dense GPU baselines, highlighting the effectiveness of the proposed method. The FPGA accelerator provides flexibility in supporting various sparsity patterns.
Reference

Utilizing 2:4 sparsity combined with quantization on $4096 imes 4096$ matrices, our approach achieves a reduction of up to $4\times$ in weight storage and a $1.71\times$ speedup in matrix multiplication, yielding a $1.29\times$ end-to-end latency reduction compared to dense GPU baselines.