❓ Help Ollama Local Model

Pag local AI, say yang Qwen2.5-Coder 7B (best choice), ito yung requirements:
  • VRAM (GPU):
    • 4-bit Quantization (GGUF/EXL2): Minimum 6GB to 8GB VRAM for comfortable use. It can technically run on as little as 4GB VRAM with heavy offloading, but speed will drop significantly.
    • Full Precision (BF16): Approximately 16GB VRAM.
  • System RAM:
    • GPU Offloading: 16GB is standard.
    • CPU-only: Minimum 16GB RAM (8GB is possible for highly compressed 4-bit versions, but extremely slow).
Disk Space: ~5GB to 15GB depending on the quantization level.
Note for those using GPU-mode with high System RAM:
If the model is too big for your VRAM, ollama will allow you to offload specific layers to the GPU while keeping the rest in System RAM - for safety to avoid crashes. Yung Llama.cpp meron din nyan. Kaya mas mainam na mamili ng medyo maliit na model (in GB) para maiwasan ang spilit-loading na nagpapabagal ng processing. Make sure, meron kayong 1- 2GB na natitira man lang sa VRAM para sa context window or yung memory ng inyong conversations. Ang rule of thumb, Leave at least 15-20% of your VRAM free. I'm sure sa RAM din ay ganoon or mas malaki pa in cpu-mode.

Sa pc ko na 3rd-gen na i7 with 16GB ram, kaya yan in cpu-mode only using quantized Q4 GGUF model, pero matagal yung response. Ang minimum ko 3B parameters for offline Q4 LLMs - hardware limited para di mag-hang pc ko he he. Siguro sa mga latest Intel cpu, medyo mabilis, pero you still need bigger rams above 16GB for acceptaple response times. Yang RAM/VRAM naman talaga ang bottleneck kaya a decent GPU preferably +8GB is still the preference with CUDA and Tensor support.

Ito yung simple estimated guides in cpu-mode for those interested:
To compute RAM requirements for a GGUF model in CPU-only mode, you must account for the model weights (determined by quantization) and the KV Cache (determined by context length).

1. Basic Formula for Model Weights
The RAM required to simply load the model is based on the number of parameters and the quantization level (bits per weight):
1772375748427.webp

(The 1.05 multiplier accounts for a ~5-10% overhead for non-quantized layers and metadata).

2. Estimated RAM per Billion Parameters
Quantization Type Bits per Weight (Approx) RAM per 1B Parameters
Q8_0 (High quality) 8.5 bits ~1.1 GB
Q6_K (Excellent balance) 6.6 bits ~0.85 GB
Q5_K_M (Recommended) 5.5 bits ~0.72 GB
Q4_K_M (Standard/Fast) 4.8 bits ~0.63 GB
Q3_K_M (Smallest usable) 3.5 bits ~0.46 GB

3. Adding the KV Cache (Context Window)
When running on CPU, the "Context Window" uses additional RAM. For modern models (like Llama 3 or Mistral), use this rough estimate:
8k Context: Add ~1–2 GB RAM.
32k Context: Add ~4–8 GB RAM.
128k Context: Can exceed 20+ GB just for the cache.

4. Real-World Examples (CPU Mode)
Llama 3 8B (Q4_K_M):
.
Mistral 7B (Q8_0):
.
DeepSeek-V3 671B (Q4_K_M): Requires roughly 430–450 GB of RAM.
For GPU mode, use these links:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
For some other infos, go here:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
Sa Qwen2.5-coder-7b, mamili kayo dito. Madali naman maghnap sa huggingface:
You do not have permission to view the full content of this post. Log in or register now.
With ollama installed, it's easy to run the models you want using their guide:
You do not have permission to view the full content of this post. Log in or register now.

Yang mga China open source LLM models are actually free and unlimited online sa dami ng providers. You just need to collect the apis if there's a need. Hassle-free lang ang local AI dahil walang limits especially kung satisifed ka sa response times at portable without internet.
 
Pag local AI, say yang Qwen2.5-Coder 7B (best choice), ito yung requirements:



Note for those using GPU-mode with high System RAM:

If the model is too big for your VRAM, ollama will allow you to offload specific layers to the GPU while keeping the rest in System RAM - for safety to avoid crashes. Yung Llama.cpp meron din nyan. Kaya mas mainam na mamili ng medyo maliit na model (in GB) para maiwasan ang spilit-loading na nagpapabagal ng processing. Make sure, meron kayong 1- 2GB na natitira man lang sa VRAM para sa context window or yung memory ng inyong conversations. Ang rule of thumb, Leave at least 15-20% of your VRAM free. I'm sure sa RAM din ay ganoon or mas malaki pa in cpu-mode.

Sa pc ko na 3rd-gen na i7 with 16GB ram, kaya yan in cpu-mode only using quantized Q4 GGUF model, pero matagal yung response. Ang minimum ko 3B parameters for offline Q4 LLMs - hardware limited para di mag-hang pc ko he he. Siguro sa mga latest Intel cpu, medyo mabilis, pero you still need bigger rams above 16GB for acceptaple response times. Yang RAM/VRAM naman talaga ang bottleneck kaya a decent GPU preferably +8GB is still the preference with CUDA and Tensor support.

Ito yung simple estimated guides in cpu-mode for those interested:



View attachment 4080541







For GPU mode, use these links:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
For some other infos, go here:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
Sa Qwen2.5-coder-7b, mamili kayo dito. Madali naman maghnap sa huggingface:
You do not have permission to view the full content of this post. Log in or register now.
With ollama installed, it's easy to run the models you want using their guide:
You do not have permission to view the full content of this post. Log in or register now.

Yang mga China open source LLM models are actually free and unlimited online sa dami ng providers. You just need to collect the apis if there's a need. Hassle-free lang ang local AI dahil walang limits especially kung satisifed ka sa response times at portable without internet.

ito nlang ginamit ko bossing hahaha
1772598826616.webp
 
ito nlang ginamit ko bossing hahaha
View attachment 4084002
Ayos din yan, para kang may o3-mini he he - sa 120b, good as o4-mini as they are designed for agentic workflows and high-reasoning tasks as stated sa specs. Di kaya ng pc ko yan sa ngayon he he, pero gamit ko siya (120b) as online api as back-up browser AI (via chatgptbox). Di naman matipid sumagot dahil siguro sa 130k token context window. All-around yan. Best choice for 16GB VRAM using Q4 models.
 
Basa ka muna dito to know the basics and requirements, then try if you already know how it is used.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
That is all you need.

Pag, hindi pa. Subukan mo muna yung pinanggalingan nya. Ollama is a wrapper of llama.cpp.
You do not have permission to view the full content of this post. Log in or register now.
One-liner lang yan. Tapos balikan mo yung Ollama later.

Pag hindi pa rin, try this: You do not have permission to view the full content of this post. Log in or register now.
AIO packs na. Hanap ka lang ng bagay sa HW mo.

Yung GGUF model ng LLamafile pwedeng magamit ng dalawa sa taas and offers a good understanding on how these apps work. Burahin mo na lang pag gagamit ka na ng Ollama at yung integration nya sa mga apps ngayon. Ang importante, alam mo yung limits ng hardware mo sa models na iyong gagamitin locally.
 

About this Thread

  • 8
    Replies
  • 343
    Views
  • 5
    Participants
Last reply from:
alist1986

Online now

Members online
488
Guests online
1,087
Total visitors
1,575

Forum statistics

Threads
2,268,901
Posts
28,925,088
Members
1,243,148
Latest member
Laxabled
Back
Top