❓ Help Ollama Local Model

32bit · Feb 24, 2026

Ano local model gamit nyo for coding? Pa suggest nman po mga boss.

Primordial Ultima · Feb 25, 2026

Minimax, Mistral, Claude code

32bit · Feb 26, 2026

Primordial Ultima said:
Minimax, Mistral, Claude code

thanks po nagtry ako qwe2.5 coder 7b at CodeLlama 7b goods na goods pang coding haha

Goodboywizdom · Mar 1, 2026

32bit said:
thanks po nagtry ako qwe2.5 coder 7b at CodeLlama 7b goods na goods pang coding haha

Bro pm mo ko asap please

alist1986 · Mar 1, 2026

Pag local AI, say yang Qwen2.5-Coder 7B (best choice), ito yung requirements:

VRAM (GPU):

4-bit Quantization (GGUF/EXL2): Minimum 6GB to 8GB VRAM for comfortable use. It can technically run on as little as 4GB VRAM with heavy offloading, but speed will drop significantly.

Full Precision (BF16): Approximately 16GB VRAM.

System RAM:

GPU Offloading: 16GB is standard.

CPU-only: Minimum 16GB RAM (8GB is possible for highly compressed 4-bit versions, but extremely slow).

Disk Space: ~5GB to 15GB depending on the quantization level.

Note for those using GPU-mode with high System RAM:
If the model is too big for your VRAM, ollama will allow you to offload specific layers to the GPU while keeping the rest in System RAM - for safety to avoid crashes. Yung Llama.cpp meron din nyan. Kaya mas mainam na mamili ng medyo maliit na model (in GB) para maiwasan ang spilit-loading na nagpapabagal ng processing. Make sure, meron kayong 1- 2GB na natitira man lang sa VRAM para sa context window or yung memory ng inyong conversations. Ang rule of thumb, Leave at least 15-20% of your VRAM free. I'm sure sa RAM din ay ganoon or mas malaki pa in cpu-mode.

Sa pc ko na 3rd-gen na i7 with 16GB ram, kaya yan in cpu-mode only using quantized Q4 GGUF model, pero matagal yung response. Ang minimum ko 3B parameters for offline Q4 LLMs - hardware limited para di mag-hang pc ko he he. Siguro sa mga latest Intel cpu, medyo mabilis, pero you still need bigger rams above 16GB for acceptaple response times. Yang RAM/VRAM naman talaga ang bottleneck kaya a decent GPU preferably +8GB is still the preference with CUDA and Tensor support.

Ito yung simple estimated guides in cpu-mode for those interested:

To compute RAM requirements for a GGUF model in CPU-only mode, you must account for the model weights (determined by quantization) and the KV Cache (determined by context length).

1. Basic Formula for Model Weights
The RAM required to simply load the model is based on the number of parameters and the quantization level (bits per weight):

(The 1.05 multiplier accounts for a ~5-10% overhead for non-quantized layers and metadata).

2. Estimated RAM per Billion Parameters
Quantization Type Bits per Weight (Approx) RAM per 1B Parameters
Q8_0 (High quality) 8.5 bits ~1.1 GB
Q6_K (Excellent balance) 6.6 bits ~0.85 GB
Q5_K_M (Recommended) 5.5 bits ~0.72 GB
Q4_K_M (Standard/Fast) 4.8 bits ~0.63 GB
Q3_K_M (Smallest usable) 3.5 bits ~0.46 GB

3. Adding the KV Cache (Context Window)
When running on CPU, the "Context Window" uses additional RAM. For modern models (like Llama 3 or Mistral), use this rough estimate:
8k Context: Add ~1–2 GB RAM.
32k Context: Add ~4–8 GB RAM.
128k Context: Can exceed 20+ GB just for the cache.

4. Real-World Examples (CPU Mode)
Llama 3 8B (Q4_K_M):
.
Mistral 7B (Q8_0):
.
DeepSeek-V3 671B (Q4_K_M): Requires roughly 430–450 GB of RAM.

For GPU mode, use these links:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
For some other infos, go here:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
Sa Qwen2.5-coder-7b, mamili kayo dito. Madali naman maghnap sa huggingface:
You do not have permission to view the full content of this post. Log in or register now.
With ollama installed, it's easy to run the models you want using their guide:
You do not have permission to view the full content of this post. Log in or register now.

Yang mga China open source LLM models are actually free and unlimited online sa dami ng providers. You just need to collect the apis if there's a need. Hassle-free lang ang local AI dahil walang limits especially kung satisifed ka sa response times at portable without internet.

32bit · Mar 4, 2026

alist1986 said:
Pag local AI, say yang Qwen2.5-Coder 7B (best choice), ito yung requirements:

Note for those using GPU-mode with high System RAM:
If the model is too big for your VRAM, ollama will allow you to offload specific layers to the GPU while keeping the rest in System RAM - for safety to avoid crashes. Yung Llama.cpp meron din nyan. Kaya mas mainam na mamili ng medyo maliit na model (in GB) para maiwasan ang spilit-loading na nagpapabagal ng processing. Make sure, meron kayong 1- 2GB na natitira man lang sa VRAM para sa context window or yung memory ng inyong conversations. Ang rule of thumb, Leave at least 15-20% of your VRAM free. I'm sure sa RAM din ay ganoon or mas malaki pa in cpu-mode.

Sa pc ko na 3rd-gen na i7 with 16GB ram, kaya yan in cpu-mode only using quantized Q4 GGUF model, pero matagal yung response. Ang minimum ko 3B parameters for offline Q4 LLMs - hardware limited para di mag-hang pc ko he he. Siguro sa mga latest Intel cpu, medyo mabilis, pero you still need bigger rams above 16GB for acceptaple response times. Yang RAM/VRAM naman talaga ang bottleneck kaya a decent GPU preferably +8GB is still the preference with CUDA and Tensor support.

Ito yung simple estimated guides in cpu-mode for those interested:

View attachment 4080541

For GPU mode, use these links:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
For some other infos, go here:
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
Sa Qwen2.5-coder-7b, mamili kayo dito. Madali naman maghnap sa huggingface:
You do not have permission to view the full content of this post. Log in or register now.
With ollama installed, it's easy to run the models you want using their guide:
You do not have permission to view the full content of this post. Log in or register now.

Yang mga China open source LLM models are actually free and unlimited online sa dami ng providers. You just need to collect the apis if there's a need. Hassle-free lang ang local AI dahil walang limits especially kung satisifed ka sa response times at portable without internet.

ito nlang ginamit ko bossing hahaha

alist1986 · Mar 4, 2026

32bit said:
ito nlang ginamit ko bossing hahaha
View attachment 4084002

Ayos din yan, para kang may o3-mini he he - sa 120b, good as o4-mini as they are designed for agentic workflows and high-reasoning tasks as stated sa specs. Di kaya ng pc ko yan sa ngayon he he, pero gamit ko siya (120b) as online api as back-up browser AI (via chatgptbox). Di naman matipid sumagot dahil siguro sa 130k token context window. All-around yan. Best choice for 16GB VRAM using Q4 models.

Protagonist · May 5, 2026

may tutorial po ba kayo sa ollama ?

alist1986 · May 5, 2026

Basa ka muna dito to know the basics and requirements, then try if you already know how it is used.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
You do not have permission to view the full content of this post. Log in or register now.
That is all you need.

Pag, hindi pa. Subukan mo muna yung pinanggalingan nya. Ollama is a wrapper of llama.cpp.
You do not have permission to view the full content of this post. Log in or register now.
One-liner lang yan. Tapos balikan mo yung Ollama later.

Pag hindi pa rin, try this: You do not have permission to view the full content of this post. Log in or register now.
AIO packs na. Hanap ka lang ng bagay sa HW mo.

Yung GGUF model ng LLamafile pwedeng magamit ng dalawa sa taas and offers a good understanding on how these apps work. Burahin mo na lang pag gagamit ka na ng Ollama at yung integration nya sa mga apps ngayon. Ang importante, alam mo yung limits ng hardware mo sa models na iyong gagamitin locally.

Search

Search

❓ Help Ollama Local Model

32bit

Primordial Ultima

Enthusiast

32bit

Goodboywizdom

Enthusiast

alist1986

Forum Guru

32bit

alist1986

Forum Guru

Protagonist

Leecher

alist1986

Forum Guru

Similar threads

About this Thread

New Topics

Codex/TGT - 5 hour limit temporarily removed

100% FREE Way to Turn Your Android Phone Into an SMS Gateway

AI tools for Manual testing - Any recommendation?

Introducing GPT-5.6: Frontier Intelligence at Every Scale

Caveman Skill para sa Claude Code or Codex. (Pampababa ng usage at pampabilis ng response)

ᑕᕼᗩTGᑭT o Gemini

Looking for Replit account or pasabit

SAMSUNG S20 KG LOCK

Looking for unlocktools for rent 6 hours

Introducing GPT-Live: Real-Time Voice Interaction in ᑕᕼᗩTGᑭT

Trending Topics

Online now

Forum statistics