thanks po nagtry ako qwe2.5 coder 7b at CodeLlama 7b goods na goods pang coding hahaMinimax, Mistral, Claude code
Bro pm mo ko asap pleasethanks po nagtry ako qwe2.5 coder 7b at CodeLlama 7b goods na goods pang coding haha
- VRAM (GPU):
- 4-bit Quantization (GGUF/EXL2): Minimum 6GB to 8GB VRAM for comfortable use. It can technically run on as little as 4GB VRAM with heavy offloading, but speed will drop significantly.
- Full Precision (BF16): Approximately 16GB VRAM.
- System RAM:
- GPU Offloading: 16GB is standard.
- CPU-only: Minimum 16GB RAM (8GB is possible for highly compressed 4-bit versions, but extremely slow).
Note for those using GPU-mode with high System RAM:Disk Space: ~5GB to 15GB depending on the quantization level.
To compute RAM requirements for a GGUF model in CPU-only mode, you must account for the model weights (determined by quantization) and the KV Cache (determined by context length).
1. Basic Formula for Model Weights
The RAM required to simply load the model is based on the number of parameters and the quantization level (bits per weight):
(The 1.05 multiplier accounts for a ~5-10% overhead for non-quantized layers and metadata).
2. Estimated RAM per Billion Parameters
Quantization Type Bits per Weight (Approx) RAM per 1B Parameters
Q8_0 (High quality) 8.5 bits ~1.1 GB
Q6_K (Excellent balance) 6.6 bits ~0.85 GB
Q5_K_M (Recommended) 5.5 bits ~0.72 GB
Q4_K_M (Standard/Fast) 4.8 bits ~0.63 GB
Q3_K_M (Smallest usable) 3.5 bits ~0.46 GB
3. Adding the KV Cache (Context Window)
When running on CPU, the "Context Window" uses additional RAM. For modern models (like Llama 3 or Mistral), use this rough estimate:
8k Context: Add ~1–2 GB RAM.
32k Context: Add ~4–8 GB RAM.
128k Context: Can exceed 20+ GB just for the cache.
For GPU mode, use these links:4. Real-World Examples (CPU Mode)
Llama 3 8B (Q4_K_M):
.
Mistral 7B (Q8_0):
.
DeepSeek-V3 671B (Q4_K_M): Requires roughly 430–450 GB of RAM.
Pag local AI, say yang Qwen2.5-Coder 7B (best choice), ito yung requirements:
Note for those using GPU-mode with high System RAM:
If the model is too big for your VRAM, ollama will allow you to offload specific layers to the GPU while keeping the rest in System RAM - for safety to avoid crashes. Yung Llama.cpp meron din nyan. Kaya mas mainam na mamili ng medyo maliit na model (in GB) para maiwasan ang spilit-loading na nagpapabagal ng processing. Make sure, meron kayong 1- 2GB na natitira man lang sa VRAM para sa context window or yung memory ng inyong conversations. Ang rule of thumb, Leave at least 15-20% of your VRAM free. I'm sure sa RAM din ay ganoon or mas malaki pa in cpu-mode.
Sa pc ko na 3rd-gen na i7 with 16GB ram, kaya yan in cpu-mode only using quantized Q4 GGUF model, pero matagal yung response. Ang minimum ko 3B parameters for offline Q4 LLMs - hardware limited para di mag-hang pc ko he he. Siguro sa mga latest Intel cpu, medyo mabilis, pero you still need bigger rams above 16GB for acceptaple response times. Yang RAM/VRAM naman talaga ang bottleneck kaya a decent GPU preferably +8GB is still the preference with CUDA and Tensor support.
Ito yung simple estimated guides in cpu-mode for those interested:
View attachment 4080541
For GPU mode, use these links:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
For some other infos, go here:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
Sa Qwen2.5-coder-7b, mamili kayo dito. Madali naman maghnap sa huggingface:
You do not have permission to view the full content of this post. Log in or register now.
With ollama installed, it's easy to run the models you want using their guide:
You do not have permission to view the full content of this post. Log in or register now.
Yang mga China open source LLM models are actually free and unlimited online sa dami ng providers. You just need to collect the apis if there's a need. Hassle-free lang ang local AI dahil walang limits especially kung satisifed ka sa response times at portable without internet.
Ayos din yan, para kang may o3-mini he he - sa 120b, good as o4-mini as they are designed for agentic workflows and high-reasoning tasks as stated sa specs. Di kaya ng pc ko yan sa ngayon he he, pero gamit ko siya (120b) as online api as back-up browser AI (via chatgptbox). Di naman matipid sumagot dahil siguro sa 130k token context window. All-around yan. Best choice for 16GB VRAM using Q4 models.ito nlang ginamit ko bossing hahaha
View attachment 4084002