The rapid advancement in Large Language Models (LLMs) has been truly astounding, bringing powerful AI capabilities to the forefront. However, many of these cutting-edge models often demand significant computational resources, particularly modern GPUs with substantial VRAM. This can be a barrier for enthusiasts and developers working with older or more modest hardware.
This post explores the feasibility and approaches to running recent LLMs, specifically Qwen3.5 and Gemma4, on older hardware. While it won’t be a seamless experience akin to state-of-the-art systems, with the right strategies, you can still leverage these models for various tasks.
Older hardware, typically characterized by:
To mitigate these challenges, several techniques can be employed:
Quantization is perhaps the most crucial technique. It involves reducing the precision of the model’s weights (e.g., from FP16/BF16 to INT8, INT4, or even binary). This dramatically reduces the model’s memory footprint and can also speed up inference, sometimes at the cost of a slight reduction in accuracy.
bitsandbytes and quanto are also excellent. These often support various quantization formats (e.g., GGUF, AWQ, GPTQ).Many popular LLMs, including Qwen and Gemma, are released in multiple sizes (e.g., 7B, 2B, etc.). Opting for the smallest available variant that still meets your needs is a straightforward way to reduce resource demands.
If your older GPU has insufficient VRAM, running inference entirely on the CPU is an option. While slower, even old CPUs can still handle smaller quantized models. In my case, my old Unraid server has a decade old CPU: Intel® Xeon® CPU E5-2620 0 @ 2.00GHz
For specific use cases, optimizations like:
An interesting challenge arose when attempting to run Qwen 3.5. Unlike previous models, the latest Qwen is a Mixture of Experts (MoE) model. This advanced architecture relies on newer technologies, specifically “Flash Attention,” for efficient operation.
Unfortunately, older GPUs like the AMD RX590 lack hardware support for Flash Attention. This incompatibility leads to a critical failure during inference. When attempting to run Qwen 3.5 on either the RX590 or the CPU-only server, the result was the same: a continuous stream of nonsensical, garbage text that would not stop.

On the other hand, an older Qwen 2.5 model works fine on such an old GPU:

This highlights a key takeaway: as LLM architectures evolve, they may introduce dependencies on specific hardware features, making them incompatible with older systems, even with techniques like quantization.
Gemma, being a more recent and efficient model family from Google, is a good candidate for older hardware due to its design. Its smaller variants and optimized architecture make it more accessible.
This scenario involves using a mid-range, older AMD GPU. The key considerations here are:
gemma4:e2b likely refers to an optimized 2-billion parameter model. This size is highly suitable for an 8GB VRAM GPU, especially when quantized.
Which runs completely on the old GPU:

This setup emphasizes CPU-only inference with a substantial amount of RAM. The CPU were a decade old: Intel® Xeon® CPU E5-2620 0 @ 2.00GHz
gemma4:26b indicates a 26-billion parameter model. Running this solely on CPU will be a demanding task.Running inside the Ollama container on my old Unraid server:

The Unraid console is showing the CPU usage and amount of RAM used:

While running Qwen3.5 and Gemma4 on old hardware presents significant challenges, it is sometimes not impossible. By leveraging quantization, selecting smaller model variants, and utilizing user-friendly platforms like Ollama, you can still experiment with and derive value from these powerful LLMs. The key is to manage expectations regarding performance and be prepared for a more involved setup process. Happy prompting!