📋 Table of Contents
1. Project Overview
This project implements a production-grade quantization engine for TinyLlama-1.1B-Chat, reducing model size from ~5GB (FP16) to ~1.3GB (Q8K) while maintaining <0.1% mean relative error. Built with Rust and the Candle ML framework from Hugging Face.
Successfully deployed to production via Docker with an Angular 19 chat interface, achieving real-time inference on consumer CPUs without GPU acceleration.
2. System Architecture
End-to-End Inference Pipeline
TinyLlama FP16
Q8K + Permutation
Packed Format
Rust Runtime
8.7 tok/s
Core Components
Quantization Engine
Multi-strategy Q8K quantizer with SVD-importance, QR pivot, and block-wise permutation support
Attention Mechanism
Optimized causal self-attention with RoPE (Rotary Position Embeddings) and GQA (Grouped Query Attention)
KV Cache System
Efficient key-value caching with sliding window context management (up to 2048 tokens)
Conversational AI
Context-aware chat with automatic history trimming and template-based prompt formatting
3. Q8K Quantization Deep Dive
Q8K (8-bit K-quantization) is a block-based quantization scheme that
divides weight matrices into 256-element blocks (QK_K = 256), applying
per-block scale factors to maintain numerical precision.
Why Q8K Over Other Methods?
| Method | Size | Speed (tok/s) | Accuracy | Notes |
|---|---|---|---|---|
| Q8K (Ours) | 1.3GB | 8.7 | <0.1% loss | Best balance |
| Q4_K_M | 645MB | 5.8 | ~2% loss | Smaller, slower |
| FP16 | 5.1GB | 3.2 | 100% | Full precision |
| INT8 (naive) | 1.3GB | 4.5 | ~5% loss | No block scaling |
Quantization Algorithm
fn quantize_rows_q8k(rows: usize, k: usize, data: &[f32]) -> Result<Vec<BlockQ8K>> {
if k % QK_K != 0 {
bail!("inner dim {k} not multiple of {QK_K}");
}
let blocks_per_row = k / QK_K;
let mut blocks = vec![BlockQ8K::zeros(); rows * blocks_per_row];
// Quantize each row into blocks
for r in 0..rows {
let row = &data[r * k..(r + 1) * k];
let dst = &mut blocks[r * blocks_per_row..(r + 1) * blocks_per_row];
BlockQ8K::from_float(row, dst); // Candle's optimized quantization
}
Ok(blocks)
}
Validation Pipeline
The quantizer implements a 3-tier validation system to ensure quality:
fn compute_quantization_error_detailed(
original: &[f32],
blocks: &[BlockQ8K],
rows: usize,
k: usize,
) -> Result<(f64, f64, f64)> {
// Dequantize back to FP32
let mut dequantized = vec![0f32; rows * k];
BlockQ8K::to_float(blocks, &mut dequantized);
let mut l2_error = 0f64;
let mut max_error = 0f64;
let mut relative_error_sum = 0f64;
for (i, (&orig, &deq)) in original.iter().zip(dequantized.iter()).enumerate() {
let abs_err = (orig - deq).abs() as f64;
let sq_err = abs_err * abs_err;
l2_error += sq_err;
max_error = max_error.max(abs_err);
if orig.abs() > 1e-10 {
relative_error_sum += abs_err / orig.abs() as f64;
}
}
let rmse = (l2_error / (rows * k) as f64).sqrt();
let mean_relative_error = relative_error_sum / (rows * k) as f64;
Ok((rmse, max_error, mean_relative_error))
}
Layers with max_error > 1e-2 or relative_error > 0.01
trigger warnings. This typically indicates the need for advanced permutation strategies.
4. Advanced Permutation Strategies
Column permutation reorders weight matrix columns to cluster similar magnitudes, significantly improving quantization efficiency. We implement three strategies:
4.1 SVD-Importance Ranking
Inspired by Singular Value Decomposition, this method ranks columns by statistical variance (a proxy for singular values). High-variance columns are quantized first for optimal precision.
fn svd_importance_permutation(rows: usize, k: usize, data: &[f32]) -> Result<Vec<usize>> {
// Compute column statistics
let mut col_means: Vec<f64> = vec![0.0; k];
let mut col_vars: Vec<f64> = vec![0.0; k];
// First pass: compute means
for r in 0..rows {
let row = &data[r * k..(r + 1) * k];
for (j, &v) in row.iter().enumerate() {
col_means[j] += v as f64;
}
}
for mean in &mut col_means {
*mean /= rows as f64;
}
// Second pass: compute variances
for r in 0..rows {
let row = &data[r * k..(r + 1) * k];
for (j, &v) in row.iter().enumerate() {
let diff = v as f64 - col_means[j];
col_vars[j] += diff * diff;
}
}
// Sort by descending variance (high variance = high importance)
let mut idx: Vec<usize> = (0..k).collect();
idx.sort_by(|&a, &b| {
col_vars[b].partial_cmp(&col_vars[a]).unwrap_or(std::cmp::Ordering::Equal)
});
println!(" SVD-importance: sorted by column variance");
Ok(idx)
}
4.2 QR Pivot Strategy
Implements a simplified Householder QR decomposition with column pivoting. This orthogonalizes columns, reducing redundancy and improving block-wise quantization accuracy.
To balance speed and quality, QR steps scale with matrix size:
- k ≤ 64: Full QR (100%)
- k ≤ 256: 87.5% QR
- k ≤ 2048: 50% QR
- k > 2048: 25% QR (fallback to L2 norm for remaining)
fn qr_pivot_permutation(rows: usize, k: usize, data: &[f32]) -> Result<Vec<usize>> {
let mut perm: Vec<usize> = (0..k).collect();
// Adaptive QR steps
let qr_steps = match k {
k if k <= 64 => k,
k if k <= 256 => (k * 7) / 8,
k if k <= 512 => (k * 3) / 4,
k if k <= 1024 => (k * 2) / 3,
k if k <= 2048 => (k / 2),
_ => (k / 4).max(256).min(512),
};
// Compute column norms for pivoting
let mut col_norms: Vec<f64> = vec![0.0; k];
for r in 0..rows {
let row = &data[r * k..(r + 1) * k];
for (j, &v) in row.iter().enumerate() {
col_norms[j] += (v as f64) * (v as f64);
}
}
for norm in &mut col_norms {
*norm = norm.sqrt();
}
// QR column pivoting (Householder-inspired)
for step in 0..qr_steps.min(rows).min(k) {
let mut max_norm = col_norms[step];
let mut max_idx = step;
for j in (step + 1)..k {
if col_norms[j] > max_norm {
max_norm = col_norms[j];
max_idx = j;
}
}
if max_idx != step {
perm.swap(step, max_idx);
col_norms.swap(step, max_idx);
}
// Update remaining norms (simplified orthogonalization)
if col_norms[step] > 1e-10 {
for j in (step + 1)..k {
col_norms[j] *= 0.99; // Decay factor
}
}
}
Ok(perm)
}
4.3 Block-Wise Permutation
For large matrices (k > 256), block-wise permutation divides columns into 64-element blocks and sorts within each block independently. This maintains cache locality while improving quantization.
fn build_block_wise_permutation(rows: usize, k: usize, data: &[f32]) -> Vec<usize> {
const BLOCK_SIZE: usize = 64; // QK_K/4 for locality
let num_blocks = k / BLOCK_SIZE;
if k % BLOCK_SIZE != 0 {
return build_column_permutation(&column_l2_norms(rows, k, data));
}
let mut global_perm = vec![0usize; k];
let col_norms = column_l2_norms(rows, k, data);
for block_idx in 0..num_blocks {
let block_start = block_idx * BLOCK_SIZE;
let block_end = block_start + BLOCK_SIZE;
let block_norms = &col_norms[block_start..block_end];
let mut local_idx: Vec<usize> = (0..BLOCK_SIZE).collect();
// Sort by descending norm within block
local_idx.sort_by(|&a, &b| {
block_norms[b].partial_cmp(&block_norms[a])
.unwrap_or(std::cmp::Ordering::Equal)
});
// Map local to global indices
for i in 0..BLOCK_SIZE {
global_perm[block_start + i] = block_start + local_idx[i];
}
}
global_perm
}
5. Attention Mechanism & RoPE
The model implements Grouped Query Attention (GQA) with Rotary Position Embeddings (RoPE), providing efficient long-context support.
5.1 Causal Self-Attention
struct CausalSelfAttention {
q_proj: QuantLinear, // Query projection (quantized)
k_proj: QuantLinear, // Key projection
v_proj: QuantLinear, // Value projection
o_proj: QuantLinear, // Output projection
num_attention_heads: usize, // 32 for TinyLlama
num_key_value_heads: usize, // 4 for GQA
head_dim: usize, // 64 (2048 / 32)
}
impl CausalSelfAttention {
fn forward(
&self,
x: &Tensor,
index_pos: usize,
block_idx: usize,
cache: &mut Cache,
) -> candle::Result<Tensor> {
let (b_sz, seq_len, hidden_size) = x.dims3()?;
// Project to Q, K, V
let q = self.q_proj.forward(x)?;
let k = self.k_proj.forward(x)?;
let v = self.v_proj.forward(x)?;
// Reshape for multi-head attention
let q = q.reshape((b_sz, seq_len, self.num_attention_heads, self.head_dim))?
.transpose(1, 2)?;
let k = k.reshape((b_sz, seq_len, self.num_key_value_heads, self.head_dim))?
.transpose(1, 2)?;
let mut v = v.reshape((b_sz, seq_len, self.num_key_value_heads, self.head_dim))?
.transpose(1, 2)?;
// Apply RoPE
let q = self.apply_rotary_emb(&q, index_pos, cache)?;
let mut k = self.apply_rotary_emb(&k, index_pos, cache)?;
// KV cache management
if cache.use_kv_cache {
if let Some((cache_k, cache_v)) = &cache.kvs[block_idx] {
k = Tensor::cat(&[cache_k, &k], 2)?;
v = Tensor::cat(&[cache_v, &v], 2)?;
}
cache.kvs[block_idx] = Some((k.clone(), v.clone()));
}
// Repeat K, V for GQA (4 KV heads -> 32 Q heads)
let k = self.repeat_kv(k)?;
let v = self.repeat_kv(v)?;
// Compute attention scores (scaled dot-product)
let att = (q.matmul(&k.t()?)? / (self.head_dim as f64).sqrt())?;
// Apply causal mask (prevent attending to future tokens)
let att = if seq_len > 1 {
let mask = cache.mask_query_kv(seq_len, k.dims()[2])?;
masked_fill(&att, &mask, f32::NEG_INFINITY)?
} else {
att
};
// Softmax + weighted sum
let att = candle_nn::ops::softmax_last_dim(&att)?;
let y = att.matmul(&v)?;
// Reshape and project output
let y = y.transpose(1, 2)?.reshape(&[b_sz, seq_len, hidden_size])?;
self.o_proj.forward(&y)
}
}
5.2 Rotary Position Embeddings (RoPE)
RoPE encodes positional information by rotating query/key vectors in complex space, enabling the model to handle sequences up to 2048 tokens efficiently.
fn apply_rotary_emb(&self, x: &Tensor, index_pos: usize, cache: &Cache) -> candle::Result<Tensor> {
let (_b_sz, _n_head, seq_len, _head_dim) = x.dims4()?;
// Extract precomputed cos/sin tables
let cos = cache.cos.narrow(0, index_pos, seq_len)?;
let sin = cache.sin.narrow(0, index_pos, seq_len)?;
// Apply rotation (Candle's optimized implementation)
candle_nn::rotary_emb::rope(x, &cos, &sin)
}
Cosine and sine tables are precomputed during cache initialization using
theta = 10000.0. This amortizes computation cost across all inference steps.
6. KV Cache Implementation
Key-Value (KV) caching stores computed key/value tensors from previous tokens, avoiding recomputation during autoregressive generation. This provides a ~10x speedup for multi-token sequences.
#[derive(Debug, Clone)]
struct Cache {
masks: HashMap<(usize, usize), Tensor>, // Precomputed causal masks
use_kv_cache: bool, // Enable/disable caching
kvs: Vec<Option<(Tensor, Tensor)>>, // Per-layer K, V tensors
cos: Tensor, // RoPE cosine table (2048 x 32)
sin: Tensor, // RoPE sine table (2048 x 32)
device: Device,
}
impl Cache {
fn new(
use_kv_cache: bool,
dtype: DType,
max_seq_len: usize,
head_dim: usize,
num_layers: usize,
device: &Device,
) -> candle::Result<Self> {
// Precompute RoPE frequencies
let theta = 10000.0f32;
let inv_freq: Vec<f32> = (0..head_dim)
.step_by(2)
.map(|i| 1.0 / theta.powf(i as f32 / head_dim as f32))
.collect();
let inv_freq = Tensor::from_vec(inv_freq, (head_dim / 2,), device)?;
let t = Tensor::arange(0u32, max_seq_len as u32, device)?
.to_dtype(DType::F32)?
.reshape((max_seq_len, 1))?;
let freqs = t.matmul(&inv_freq.reshape((1, head_dim / 2))?)?;
let cos = freqs.cos()?.to_dtype(dtype)?;
let sin = freqs.sin()?.to_dtype(dtype)?;
Ok(Self {
masks: HashMap::new(),
use_kv_cache,
kvs: vec![None; num_layers],
device: device.clone(),
cos,
sin,
})
}
fn estimate_tokens(&self) -> usize {
self.kvs.iter()
.filter_map(|kv| kv.as_ref())
.map(|(k, _)| k.dims()[2])
.max()
.unwrap_or(0)
}
fn memory_mb(&self) -> f32 {
let mut total_bytes = 0;
for kv in &self.kvs {
if let Some((k, v)) = kv {
total_bytes += k.elem_count() * 4; // F32 = 4 bytes
total_bytes += v.elem_count() * 4;
}
}
total_bytes as f32 / (1024.0 * 1024.0)
}
fn reset_for_new_turn(&mut self) {
self.kvs = vec![None; self.kvs.len()];
self.masks.clear();
}
}
Cache Memory Management
For TinyLlama (22 layers, 4 KV heads, 64 head dim):
- Per-token memory: 22 layers × 4 heads × 64 dim × 2 (K+V) × 4 bytes = 45KB/token
- At 256 tokens: ~11.5 MB
- At 1536 tokens (max context): ~69 MB
7. Conversational AI Features
The model implements a sliding window context manager that automatically trims conversation history when approaching the 2048-token limit.
struct Conversation {
messages: Vec<Message>,
max_history_tokens: usize, // 1536 tokens (leaves 512 for generation)
system_prompt: String,
}
impl Conversation {
fn apply_sliding_window(&mut self, tokenizer: &Tokenizer) -> candle::Result<()> {
let full_prompt = self.format_prompt(tokenizer)?;
let tokens = tokenizer.encode(full_prompt, false)?.get_ids().len();
if tokens <= self.max_history_tokens {
return Ok(());
}
// Keep system prompt + most recent messages
let mut kept_messages = vec![self.messages[0].clone()];
let mut current_tokens = tokenizer
.encode(self.system_prompt.clone(), false)?
.get_ids().len();
for msg in self.messages.iter().skip(1).rev() {
let msg_tokens = tokenizer.encode(msg.content.clone(), false)?
.get_ids().len();
if current_tokens + msg_tokens > self.max_history_tokens {
break;
}
kept_messages.insert(1, msg.clone());
current_tokens += msg_tokens;
}
let removed = self.messages.len() - kept_messages.len();
println!("Context trimmed: kept {} msgs, removed {} old msgs",
kept_messages.len(), removed);
self.messages = kept_messages;
Ok(())
}
}
Prompt Formatting
TinyLlama uses the ChatML format with special tokens:
<|system|>
You are a helpful AI assistant. You provide concise, accurate answers.</s>
<|user|>
What is Q8K quantization?</s>
<|assistant|>
Q8K is a block-based 8-bit quantization method...</s>
<|user|>
How does it compare to Q4?</s>
<|assistant|>
8. Performance Benchmarks
Inference Speed
| Metric | Cold Start | Warm (with KV cache) | Notes |
|---|---|---|---|
| First token latency | 450ms | 115ms | Prompt encoding + first forward pass |
| Subsequent tokens | N/A | 115ms/token | 8.7 tokens/second |
| Cache memory (256 tok) | 0 MB | 11.5 MB | Per-layer K, V tensors |
| Total speedup (KV cache) | ~10x for multi-token generation | Amortized over sequence | |
Quantization Quality Metrics
| Layer Type | RMSE | Max Error | Mean Relative Error |
|---|---|---|---|
| Attention Q/K/V | 2.4e-4 | 8.1e-3 | 0.06% |
| Attention Output | 1.8e-4 | 6.2e-3 | 0.05% |
| MLP Gate/Up | 3.2e-4 | 9.4e-3 | 0.08% |
| MLP Down | 2.1e-4 | 7.5e-3 | 0.06% |
All layers meet the <1% relative error threshold. The quantizer automatically flags problematic layers and recommends permutation strategies.
System Resource Usage
9. Production Deployment
Docker Configuration
# Stage 1: Build Rust inference engine
FROM rust:1.75-slim as rust-builder
WORKDIR /build
COPY candle-examples/Cargo.toml candle-examples/Cargo.lock ./
COPY candle-examples/examples/llama-q8k.rs ./examples/
RUN cargo build --release --example llama-q8k
# Stage 2: Build Angular frontend
FROM node:20-alpine as angular-builder
WORKDIR /app
COPY angular-chat/package*.json ./
RUN npm ci
COPY angular-chat/ ./
RUN npm run build -- --configuration production
# Stage 3: Production runtime
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y \
ca-certificates \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=rust-builder /build/target/release/examples/llama-q8k ./
COPY --from=angular-builder /app/dist/angular-chat ./static
COPY model-q8k-packed.safetensors ./
COPY tokenizer.json ./
EXPOSE 8080
CMD ["./llama-q8k", "model-q8k-packed.safetensors"]
Deployment Options
Docker Standalone
Single container with Rust backend + Angular frontend. Ideal for quick deployment.
docker run -p 8080:8080 artemr87/tinyllama-q8k:latest
Kubernetes
Horizontal scaling with HPA based on CPU utilization (target: 70%).
Reverse Proxy
Nginx frontend with WebSocket support for real-time chat streaming.
Optimization Flags
[profile.release]
opt-level = 3 # Maximum optimization
lto = "fat" # Link-time optimization
codegen-units = 1 # Single codegen unit for best optimization
panic = "abort" # Smaller binary, faster execution
strip = true # Remove debug symbols
[dependencies]
candle-core = { version = "0.3.0", features = ["mkl"] } # Intel MKL for BLAS
candle-nn = "0.3.0"
candle-transformers = "0.3.0"
tokenizers = "0.15.0"
Using Intel MKL instead of OpenBLAS provides ~15% speedup on x86_64 CPUs. For ARM (Apple Silicon), use Accelerate framework.
10. Try It Yourself
🎮 Interactive Demo
Experience the quantized TinyLlama model in action with our production-ready Docker container. Runs efficiently on consumer CPUs without GPU requirements.
🚀 Launch Live DemoNote: Live demo uses WebAssembly. For full features, use Docker deployment below.
🐳 Docker Deployment (Recommended)
The quantized model is available as a production-ready Docker container (5GB, includes model + runtime). This is currently the primary deployment method while GitHub source code release is under review.
Image: artemr87/tinyllama-q8k:latest
Size: 5 GB (includes quantized model + Rust runtime)
Last Updated: 13 days ago
Tags Available: latest, 2.0.3, 2.0.2, 1.0.0
Quick Start (Interactive Mode)
# Pull and run the latest version
docker run -it --rm artemr87/tinyllama-q8k:latest
# Sample output:
# 🔧 Loading model...
# 🤖 TinyLlama Q8K Conversational AI
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# Model: TinyLlama-1.1B-Chat (Q8K quantized)
# Max context: 2048 tokens | History: 1536 tokens
# Commands: 'exit' to quit | 'reset' to clear history
#
# 💬 You: _
Advanced Usage
# Run with specific tag version
docker run -it --rm artemr87/tinyllama-q8k:2.0.3
# Run in background with custom port mapping
docker run -d -p 8080:8080 --name tinyllama-chat \
artemr87/tinyllama-q8k:latest
# Run with resource limits (recommended for production)
docker run -it --rm \
--memory="4g" \
--cpus="2" \
artemr87/tinyllama-q8k:latest
# Access logs from background container
docker logs -f tinyllama-chat
# Stop background container
docker stop tinyllama-chat
Docker Compose (Production Setup)
version: '3.8'
services:
tinyllama:
image: artemr87/tinyllama-q8k:latest
container_name: tinyllama-q8k
restart: unless-stopped
ports:
- "8080:8080"
environment:
- RUST_LOG=info
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
cpus: '1'
memory: 2G
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
# Usage:
# docker-compose up -d # Start in background
# docker-compose logs -f # View logs
# docker-compose down # Stop and remove
What's Included in the Container?
Quantized Model
TinyLlama-1.1B-Chat pre-quantized to Q8K format (~1.3GB)
Rust Runtime
Optimized inference engine with KV cache and RoPE attention
Tokenizer
Pre-configured TinyLlama tokenizer (32K vocabulary)
- Docker: 20.10+ (Docker Engine or Docker Desktop)
- RAM: 4GB+ recommended (2GB minimum)
- CPU: x86_64 (Intel/AMD) or ARM64 (Apple Silicon)
- Disk: 6GB free space (5GB image + cache)
- OS: Linux, macOS, Windows 10/11 with WSL2
🔧 Build from Source (Coming Soon)
The complete source code and build toolchain are currently under review for public release on GitHub. Once approved, you'll be able to:
- Clone the repository and build locally
- Customize quantization strategies (SVD, QR pivot, block-wise)
- Experiment with different model architectures
- Contribute improvements via pull requests
Repository: artem1984A/candle_quant (private, under review)
Expected Release: Q1 2026
Will Include: Full source code, quantization tools, training scripts, documentation
Follow the Docker Hub repository for updates on GitHub release timing.
Preview: Local Build Process (Future)
# Step 1: Clone repository (when public)
git clone https://github.com/artem1984A/candle_quant.git
cd candle_quant
# Step 2: Download base model
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--local-dir ./models/tinyllama
# Step 3: Quantize with custom strategy
CANDLE_Q8K_PERMUTE=true cargo run --release \
-p tensor-tools --bin quantize_q8k \
./models/tinyllama/model.safetensors \
./output/quantized-q8k
# Step 4: Pack into single file
cargo run --release -p tensor-tools --bin pack_q8k_safetensors \
./models/tinyllama/model.safetensors \
./output/quantized-q8k \
./model-q8k-packed.safetensors
# Step 5: Run inference
cargo run --release --example llama-q8k \
./model-q8k-packed.safetensors
For immediate deployment and testing, use the Docker image. It's production-ready, fully tested, and receives regular updates. Source code will be available once the review process completes.
🙏 Acknowledgments
This project would not be possible without the incredible work of the open-source community:
- Hugging Face Candle — Minimalist ML framework for Rust with blazing-fast tensor operations
- TinyLlama Team — Open-source 1.1B parameter LLM trained on 3 trillion tokens
- llama.cpp — Inspiration for Q8K quantization scheme and GGUF format
- SafeTensors — Secure tensor serialization format
- Rust Language — Memory-safe systems programming without garbage collection
Special thanks to the Candle maintainers for their responsive support and excellent documentation.
📚 Explore More
Dive deeper into the implementation, contribute improvements, or deploy your own quantized models.