Instructions to use Serjio42/gemma4-e2b-finetuned-caregivers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Serjio42/gemma4-e2b-finetuned-caregivers with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Serjio42/gemma4-e2b-finetuned-caregivers", filename="gemma4-e2b_r32-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Serjio42/gemma4-e2b-finetuned-caregivers with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
Use Docker
docker model run hf.co/Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Serjio42/gemma4-e2b-finetuned-caregivers with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Serjio42/gemma4-e2b-finetuned-caregivers" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Serjio42/gemma4-e2b-finetuned-caregivers", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
- Ollama
How to use Serjio42/gemma4-e2b-finetuned-caregivers with Ollama:
ollama run hf.co/Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
- Unsloth Studio
How to use Serjio42/gemma4-e2b-finetuned-caregivers with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Serjio42/gemma4-e2b-finetuned-caregivers to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Serjio42/gemma4-e2b-finetuned-caregivers to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Serjio42/gemma4-e2b-finetuned-caregivers to start chatting
- Docker Model Runner
How to use Serjio42/gemma4-e2b-finetuned-caregivers with Docker Model Runner:
docker model run hf.co/Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
- Lemonade
How to use Serjio42/gemma4-e2b-finetuned-caregivers with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Serjio42/gemma4-e2b-finetuned-caregivers:Q4_K_M
Run and chat with the model
lemonade run user.gemma4-e2b-finetuned-caregivers-Q4_K_M
List all available models
lemonade list
Gemma 4 E2B โ Fine-tuned for Caregivers, Q4_K_M GGUF
A fine-tuned derivative of Google Gemma 4 E2B, adapted via LoRA (rank 32)
and quantized to Q4_K_M for on-device inference on mobile devices through
any llama.cpp-compatible runtime.
This model is a Cognitive Decline Caregiver Support Assistant. It is designed to support people caring for a loved one with a neurodegenerative disease, dementia, or severe cognitive decline โ helping them navigate the psychological weight of ambiguous loss (mourning someone who is still physically present). It treats a caregiver's dark moments (rage, jealousy, exhaustion, wishing for an end) as biological exhaustion rather than moral failure, and responds with a fixed, gentle four-step rhythm rather than advice or solutions.
Attribution
This model is a fine-tuned derivative of Google Gemma 4 E2B, originally released by Google DeepMind under the Apache 2.0 License.
Lineage:
- Fine-tuned from:
unsloth/gemma-4-E2B-itโ Unsloth's optimized distribution of Google's instruction-tuned Gemma 4 E2B. - Upstream original:
google/gemma-4-E2B-it(instruction-tuned), derived from the base modelgoogle/gemma-4-E2B, by Google DeepMind.
Modifications by Serjio42 (2026):
- Fine-tuned with LoRA (rank 32) for caregiver-focused use cases
- Merged adapter weights with the base model (16-bit)
- Quantized to Q4_K_M via llama.cpp
Files
| File | Size | Purpose |
|---|---|---|
gemma4-e2b_r32-q4_k_m.gguf |
~3.4 GB | Quantized model weights |
inference_config.json |
โ | Sampling and generation parameters |
system_prompt.txt |
โ | Default system prompt (use verbatim) |
LICENSE |
โ | Apache 2.0 license text |
Integrity
gemma4-e2b_r32-q4_k_m.gguf โ 3,427,863,872 bytes (~3.4 GB).
SHA-256:
81ce0ae4a3fb37040faf37c6eedc0985f0d7fa291e8d17a9820937ccdab4158b
Training
- Base model: Google Gemma 4 E2B (instruction-tuned, ~2B effective
parameters), accessed via
unsloth/gemma-4-E2B-it - Method: LoRA fine-tuning, rank 32, then merged with base weights (16-bit)
- Chat template: Gemma 4 non-thinking template (
gemma-4) - Quantization: Q4_K_M via llama.cpp (
convert_hf_to_gguf.pyโllama-quantize) - Training data: Curated private dataset for caregiver-focused instruction following. Dataset access available on request โ please open a discussion on this repository.
Conversation protocol
This model is trained for a fixed four-turn conversation, not free-form chat. Each conversation follows the same rhythm:
- Mirror โ reflect the caregiver's moment back so they feel seen.
- Normalize โ explain why their reaction is a universal human response.
- Self-compassion โ invite one small act of kindness toward themselves.
- Close โ a soft landing, no advice, no new task.
Flow:
- The first user message is the caregiver's hard moment โ a story, a dark thought, a raw feeling.
- For each of the next three turns, send the literal string
Continue. - The model produces exactly one response per user turn, four responses total.
- Each response is 1โ3 sentences (usually two), never more than ~60 words.
Stop tokens ([1, 106, 50]) are baked into the GGUF metadata โ no extra
stop-token configuration is needed in the app.
Intended use
On-device inference in a mobile application (Android primary, iOS planned),
loaded with any llama.cpp-compatible runtime. Designed for offline,
privacy-preserving text generation after a one-time model download. Target
use case: emotional support for caregivers of people with dementia /
neurodegenerative disease, delivered through the fixed four-step rhythm
described above.
Target devices
- Android phones with 8 GB+ RAM; iPhone 15 Pro / 16 Pro (8 GB RAM) and newer
- ~4 GB free storage for the model and working files
- The GGUF exceeds App Store / Play bundle limits โ distribute via CDN / cloud storage and download on first launch.
Limitations
- Q4_K_M quantization trades some quality for size; expect minor degradation compared to the full-precision model.
- Fine-tune is domain-specific (caregiver emotional support, fixed four-turn protocol); out-of-domain or free-form-chat performance is not guaranteed.
- Inherits biases and limitations of the base Gemma 4 model.
- Not a substitute for professional medical or mental-health advice. Outputs are AI-generated and may contain errors. This model is not a crisis service. For any medical decisions, or in an emergency, consult a qualified healthcare professional or local emergency services.
Usage
Load the GGUF with any llama.cpp-compatible runtime โ the llama.cpp
CLI/server, or any binding/wrapper on top of it. Pick whatever fits your
stack; the model imposes no runtime-specific requirements.
Use the system prompt from system_prompt.txt verbatim before user
messages โ the model was trained on this exact prompt, and any change
degrades behavior. Apply the sampling parameters from
inference_config.json (temperature 1.0, top-p 0.95, top-k 64,
repeat-penalty 1.0, max new tokens 300, context size 2048), and follow
the four-turn flow described in Conversation protocol.
License
Released under the Apache 2.0 License โ same terms as the base Gemma 4
model. See LICENSE for the full license text.
- Downloads last month
- 56
4-bit