Training Generative AI GPT-Neo 125M Model

Fine-Tune GPT-Neo For Custom LLM Training
Fine-Tune GPT-Neo For Custom LLM Training

Live stream set for 2025-08-19 at 14:00:00 Eastern

Ask questions in the live chat about any programming or lifestyle topic.

This livestream will be on YouTube or you can watch below.

Training GPT-Neo 125M in a Podman Compose Container Using Open Source Tools and Data

By [Your Name]

If you’re looking to get started with large language models (LLMs) like GPT-Neo 125M, you’re in the right place. In this tutorial, we’ll explore how to train the open source GPT-Neo 125M model from Hugging Face inside a containerized Podman Compose environment.

This setup is perfect for developers who prefer open source tools and want a simple, reproducible training workflow — even on a modest machine.

What Is GPT-Neo 125M?

GPT-Neo 125M is an open source transformer-based language model created by EleutherAI. It’s a great starting point for fine-tuning with small datasets or learning the basics of language model training.

Why Use Podman Compose?

Podman Compose offers a container-based environment similar to Docker Compose — but with a focus on being daemonless and rootless. It’s ideal for developers working on Linux who want an open source alternative to Docker.

With Podman Compose, you can:

  • Reproducibly build and run training environments
  • Mount volumes for data, model caching, and output
  • Avoid polluting the host system

Installing GPT-Neo 125M in Podman Compose

1. Project Structure

gpt-neo-podman-compose/<br />
│—— data/                    # Your custom dataset (.jsonl)<br />
│—— output/                 # Model outputs (saved model)<br />
│—— cache/                  # Model & pip cache<br />
│—— Dockerfile<br />
│—— podman-compose.yml<br />
│—— requirements.txt<br />
│—— train.py

2. Training Script train.py

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import torch

# Load dataset
def format_example(example):
    return {"text": f"### Question: {example['prompt']}\n### Answer: {example['response']}"}

dataset = load_dataset("json", data_files="data/custom_dataset.jsonl")
dataset = dataset.map(format_example)
dataset = dataset["train"].train_test_split(test_size=0.1)
dataset = dataset.remove_columns(["prompt", "response"])

# Tokenizer & Model
model_name = "EleutherAI/gpt-neo-125M"  # Smollm 135M is not public, use similar
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def tokenize(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=512)

tokenized_dataset = dataset.map(tokenize, batched=True)
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

# Training
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    save_total_limit=1,
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)

trainer.train()

3. Dockerfile

FROM python:3.10

ENV HF_HOME=/cache/huggingface
ENV TRANSFORMERS_CACHE=/cache/huggingface/transformers
ENV HF_DATASETS_CACHE=/cache/huggingface/datasets
ENV PIP_CACHE_DIR=/cache/pip

WORKDIR /app

# Install system packages
RUN apt update && apt install -y git

# Pre-install dependencies to enable caching
COPY requirements.txt .
RUN pip install --cache-dir=$PIP_CACHE_DIR -r requirements.txt

COPY . .

CMD ["python", "train.py"]

4. The requirements.txt File

transformers
datasets
torch

5. Podman Compose YAML

version: "3"

services:
  smollm-trainer:
    build: .
    volumes:
      - ./data:/app/data
      - ./output:/app/output
      - ./cache:/cache             # Cache for Hugging Face and pip
    command: ["python", "train.py"]

6. Example Open Source Dataset

We’re using a simple JSONL format. Here’s an example from our data/custom_dataset.jsonl file:

[json]
{
“prompt”: “Who is the mayor of Toronto?”,
“response”: “As of 2025, the mayor of Toronto is Olivia Chow.”
}
{
“prompt”: “I need a PHP code snippet to connect to a MySQL database.”,
“response”: “

6. Build and Run with Podman Compose

mkdir -p data output cache
podman-compose build
podman-compose up

This downloads the base GPT-Neo model, loads your dataset, and starts training — all inside a secure, isolated container.

7. Inference Example (Optional)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./output")
model = AutoModelForCausalLM.from_pretrained("./output")

prompt = "Who is the mayor of Toronto?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Screenshots & Screencast

GPT-Neo 125M dataset
Command Line GPT-Neo 125M Dataset Creation.

GPT-Neo 125M Custom Build
Command Line GPT-Neo 125M Custom Build In Podman Container.

GPT-Neo 125M Custom Training Dataset
Command Line GPT-Neo 125M Custom Training Dataset In Podman Container.

GPT-Neo 125M Pre-Training Test
Command Line GPT-Neo 125M Pre-Training Test.

GPT-Neo 125M Post-Training Test
Command Line GPT-Neo 125M Post-Training Test.

Video Displaying Training GPT-Neo 125M In Podman Container

Before and After: Results

Output Before Training (default GPT-Neo)

Prompt: Who is the mayor of Toronto?
Response: I’m not sure.

Prompt: I need a PHP code snippet to connect to MySQL.
Response: Incoherent or missing

Output After Fine-Tuning

Prompt: Who is the mayor of Toronto?
Response: As of 2025, the mayor of Toronto is Olivia Chow.

Prompt: I need a PHP code snippet to connect to MySQL.
Response:

<?php
$mysqli = new mysqli("localhost", "user", "password", "database");
if ($mysqli-&gt;connect_error) {
    die("Connection failed: " . $mysqli-&gt;connect_error);
}
echo "Connected successfully";
?>

Additional Resources

If you’re just starting out with Python or want to deepen your understanding, check out my beginner-friendly book and course:

📚 Book:
Learning Python (eBook on Amazon)

🎓 Course:
Learning Python (Ojambo Shop)

Need Help?

I’m available for:

Feel free to reach out and let’s bring AI to your next project!

Conclusion

With open source models like GPT-Neo and tools like Podman Compose, training your own AI assistant is more accessible than ever. Whether you’re creating a chatbot, a code assistant, or a personal knowledge base — it all starts with training.

Have questions or want to share your setup? Drop a comment below or get in touch!

About Edward

Edward is a software engineer, web developer, and author dedicated to helping people achieve their personal and professional goals through actionable advice and real-world tools.

As the author of impactful books including Learning JavaScript, Learning Python, Learning PHP, Mastering Blender Python API, and fiction The Algorithmic Serpent, Edward writes with a focus on personal growth, entrepreneurship, and practical success strategies. His work is designed to guide, motivate, and empower.

In addition to writing, Edward offers professional "full-stack development," "database design," "1-on-1 tutoring," "consulting sessions,", tailored to help you take the next step. Whether you are launching a business, developing a brand, or leveling up your mindset, Edward will be there to support you.

Edward also offers online courses designed to deepen your learning and accelerate your progress. Explore the programming on languages like JavaScript, Python and PHP to find the perfect fit for your journey.

📚 Explore His Books – Visit the Book Shop to grab your copies today.
💼 Need Support? – Learn more about Services and the ways to benefit from his expertise.
🎓 Ready to Learn? – Check out his Online Courses to turn your ideas into results.

Leave a Reply

Your email address will not be published. Required fields are marked *