Multi-User GPU Container Management for Academic Machine Learning
An open-source system for managing shared GPU resources in teaching and research environments. Built on Docker and AIME ML Containers, designed for small universities and labs.
GitHub: ds01-infra Β· Documentation Hub
Academic data science labs face a common infrastructure challenge: how do you share expensive GPU hardware across dozens of students and researchers, each with different skill levels, project requirements, and compute needs?
A university acquires a GPU serverβperhaps 4Γ NVIDIA A100s costing upwards of β¬100,000. This machine needs to support:
Without proper management, predictable problems emerge:
| Problem | What Happens |
|---|---|
| Resource hogging | One userβs runaway process consumes all GPUs for days |
| Environment conflicts | βIt works on my machineβ becomes βit worked until someone installed TensorFlow 2.xβ |
| Beginner confusion | Students spend more time fighting CUDA drivers than learning ML |
| Zombie processes | Forgotten jobs accumulate, wasting resources indefinitely |
| No accountability | When something breaks, no one knows who did what |
Docker solves many of these problems in principle: isolated environments, reproducible builds, clean separation between users. But raw Docker on a shared GPU server introduces its own complexity:
The insight behind DS01: What if we could give users the benefits of containerisation without requiring them to become Docker experts?
DS01 Infrastructure wraps Docker with a user-friendly layer that handles resource allocation, environment setup, and lifecycle management automatically. Students interact with simple commands; the system handles the container orchestration behind the scenes.
Rather than building container management from scratch, DS01 extends AIME ML Containersβa mature framework with 150+ pre-built ML images covering PyTorch, TensorFlow, JAX, and common data science stacks.
This βwrap, donβt replaceβ approach provides:
--image flag, ~2.5% code change)
DS01 adds resource management, user workflows, and automation on top of AIMEβs container foundation.
DS01 uses a five-layer architecture that separates concerns and provides clear extension points, following a βwrap, donβt replaceβ philosophy.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DS01 INFRASTRUCTURE β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Resource Management User Workflows Automation β β
β β ββββββββββββββββββ ββββββββββββββ ββββββββββ β β
β β β’ Per-user quotas β’ Wizards β’ Idle cleanup β β
β β β’ GPU allocation β’ Onboarding β’ Event logging β β
β β β’ Priority scheduling β’ Project init β’ Audit trail β β
β β β’ MIG partitioning β’ Help system β’ Cron jobs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AIME ML CONTAINERS β
β 150+ pre-built images β’ Container lifecycle commands β’ ~2.5% patched for DS01 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DOCKER + NVIDIA CONTAINER TOOLKIT β
β Container runtime β’ GPU passthrough β’ cgroup isolation β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Each layer depends only on the layer below it, making the system modular and independently testable.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMMAND LAYER MODEL β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 4: Workflow Wizards β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Complete guided experiences for common tasks β
β β
β Commands: user-setup, project-init β
β Users: Everyone (students, researchers) β
β Purpose: Zero-to-working in minutes, no Docker knowledge required β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: Orchestrators β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Multi-step workflows composing atomic commands β
β β
β Commands: container-deploy, container-retire, project-launch β
β Users: Power users, researchers β
β Purpose: Common sequences bundled with sensible defaults β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: Atomic Commands β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Single-purpose utilities for fine-grained control β
β β
β Commands: container-create, container-start, container-stop, container-list, β
β image-list, image-pull, check-limits, gpu-status β
β Users: Power users, admins β
β Purpose: Building blocks for custom workflows β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: AIME ML Containers β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Upstream container lifecycle commands (hidden from DS01 users) β
β β
β Commands: mlc-create, mlc-start, mlc-stop, mlc-list, mlc-open, mlc-remove, ... β
β Users: DS01 internals only β
β Purpose: Battle-tested container management, 150+ ML images β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 0: Docker + NVIDIA Container Toolkit β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β Foundational container runtime with GPU support β
β β
β Components: dockerd, containerd, nvidia-ctk, systemd cgroups β
β Users: System only β
β Purpose: Industry-standard container execution β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why this matters:
user-setup handles SSH keys, VS Code, and first container launch automaticallyThe GPU allocator implements priority-based scheduling with MIG awareness.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GPU ALLOCATION FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββ
β User requests GPU(s) β
β via container-create β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Load user config from β
β resource-limits.yaml β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββ
β Within quota? β
βββββββββ¬ββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
No Yes
β β
βΌ βΌ
ββββββββββββββββββ βββββββββββββββββββββββββββ
β Reject request β β Load GPU state from β
β Show quota infoβ β /var/lib/ds01/ β
ββββββββββββββββββ β gpu-state.json β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Check MIG config β
β (full GPU vs partition)β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββ
β GPU available?β
βββββββββ¬ββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
No Yes
β β
βΌ βΌ
ββββββββββββββββββ βββββββββββββββββββββββββββ
β Queue request β β Acquire file lock β
β Notify user β β Allocate GPU(s) β
ββββββββββββββββββ β Update gpu-state.json β
β Release lock β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Set environment: β
β NVIDIA_VISIBLE_DEVICES β
β CUDA_VISIBLE_DEVICES β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Launch container β
β Apply cgroup limits β
β Log to events.jsonl β
βββββββββββββββββββββββββββ
Multiple enforcement mechanisms work together to ensure fair resource sharing.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESOURCE ENFORCEMENT STACK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β QUOTA ENFORCEMENT β β RUNTIME ENFORCEMENT β β LIFECYCLE ENFORCEMENT β
β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β
β β β β β β
β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β
β β resource-limits β β β β systemd cgroups β β β β Idle detection β β
β β .yaml β β β β β β β β β β
β β β β β β β’ CPU shares β β β β β’ SSH activity β β
β β β’ gpu_limit β β β β β’ Memory hard cap β β β β β’ Terminal sessionsβ β
β β β’ cpu_limit β β β β β’ GPU isolation β β β β β’ Process count β β
β β β’ memory_limit β β β β β’ OOM handling β β β β β β
β β β’ max_runtime_hrs β β β β β β β β Default: 48h β β
β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β
β β β β β β
β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β
β β Priority system β β β β Docker wrapper β β β β Runtime limits β β
β β β β β β β β β β β β
β β user > group > β β β β β’ --cpus flag β β β β β’ Max container β β
β β defaults β β β β β’ --memory flag β β β β lifetime β β
β β β β β β β’ --gpus flag β β β β β’ Auto-stop β β
β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β β ββββββββββββββββββββββ β
β β β β β β
β Applied at: β β Applied at: β β Applied via: β
β Container creation β β Container runtime β β Cron + systemd timers β
β β β β β β
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
Containers follow a managed lifecycle with automatic cleanup.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTAINER LIFECYCLE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β CREATED βββββββββββΆβ RUNNING βββββββββββΆβ STOPPED βββββββββββΆβ REMOVED β
ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β β β β
β β β β
βΌ βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β’ Quota checkβ β β’ GPU active β β β’ GPU freed β β β’ Logs saved β
β β’ GPU alloc β β β’ Cgroups β β β’ Idle timer β β β’ State cleanβ
β β’ State file β β enforced β β stopped β β β’ Image kept β
β β’ Event log β β β’ Monitoring β β β’ Event log β β (or pruned)β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββββββ ββββββββββββ
β User β β Idle timeout β β Runtime β
β request β β (48h default)β β limit β
β β β β β exceeded β
ββββββββββββ ββββββββββββββββ ββββββββββββ
Triggers for RUNNING β STOPPED
The user-setup wizard automates complete onboarding.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER ONBOARDING FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β user-setup --guided β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β 1. HOME DIR β β 2. SSH KEYS β β 3. VS CODE β
β βββββββββββ β β βββββββββββ β β βββββββββββ β
β β β β β β
β Create: β β Generate: β β Configure: β
β ~/projects/ β β ~/.ssh/ β β Remote-SSH β
β ~/data/ β β id_ed25519 β β extension β
β ~/.config/ β β β β β
ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββ΄βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β 4. STARTER CONTAINER β
β βββββββββββββββββββββββ β
β β
β Launch: β
β β’ PyTorch + Jupyter β
β β’ 1 GPU allocated β
β β’ Mounted ~/projects β
ββββββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β 5. FIRST STEPS GUIDE β
β βββββββββββββββββββββββ β
β β
β Print: β
β β’ Jupyter URL β
β β’ SSH command β
β β’ Quick reference β
β β’ Help resources β
βββββββββββββββββββββββββββ
Total time: ~10 minutes (no Docker knowledge required)
The four-tier help system supports users from quick lookups to deep learning.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HELP SYSTEM TIERS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββ
β --help β β --info β β --concepts β β --guided β
β ββββββββββββββββ β β ββββββββββββββββ β β ββββββββββββββββ β β ββββββββββββββββ β
β β β β β β β β
β Quick reference β β Full docs β β Educational β β Interactive β
β β β β β β β β
β β’ Syntax β β β’ All options β β β’ Why, not how β β β’ Step-by-step β
β β’ Common flags β β β’ Examples β β β’ Docker basics β β β’ Prompts β
β β’ Exit codes β β β’ Edge cases β β β’ GPU concepts β β β’ Validation β
β β β β β β’ Best practices β β β’ Confirmation β
β β β β β β β β
β Audience: β β Audience: β β Audience: β β Audience: β
β Returning users β β Power users β β Learners β β First-time users β
β β β β β β β β
β Length: ~20 lines β β Length: ~100 linesβ β Length: ~50 lines β β Length: varies β
ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββ
Increasing detail and interactivity βββββββββββββββββββββββΆ
The codebase follows a functional organisation matching the layer model.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCRIPT STRUCTURE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
/opt/ds01-infra/
βββ scripts/
β βββ user/ # Layer 2-4: User-facing commands
β β βββ container-create # L2: Create container with quota enforcement
β β βββ container-start # L2: Start existing container
β β βββ container-stop # L2: Stop container, release GPU
β β βββ container-list # L2: List user's containers
β β βββ container-deploy # L3: Orchestrated create + start + configure
β β βββ container-retire # L3: Orchestrated stop + cleanup
β β βββ user-setup # L4: Complete onboarding wizard
β β βββ project-init # L4: Project scaffolding wizard
β β
β βββ docker/ # Layer 1: AIME integration + GPU management
β β βββ gpu_allocator.py # GPU state management, MIG support
β β βββ mlc-wrapper.sh # AIME command wrapper with --image patch
β β βββ quota-check.sh # Pre-flight resource validation
β β
β βββ admin/ # Administrative utilities
β β βββ add-user.sh # Provision new user
β β βββ remove-user.sh # Deprovision user + cleanup
β β βββ audit-report.sh # Generate usage reports
β β
β βββ monitoring/ # Observability
β β βββ ds01-dashboard # Real-time TUI dashboard
β β βββ prometheus-exporter # Metrics for Grafana
β β
β βββ maintenance/ # Automated housekeeping
β β βββ idle-cleanup.sh # Stop idle containers (cron)
β β βββ orphan-cleanup.sh # Remove abandoned resources
β β βββ log-rotate.sh # Rotate event logs
β β
β βββ system/ # One-time setup scripts
β βββ setup-resource-slices.sh # Configure systemd cgroups
β βββ deploy-commands.sh # Symlink commands to PATH
β
βββ config/
β βββ resource-limits.yaml # Per-user/group quotas
β βββ gpu-config.yaml # MIG partitioning, device mapping
β βββ images.yaml # Available ML images registry
β
βββ lib/
βββ common.sh # Shared shell functions
βββ logging.sh # Structured event logging
βββ validation.sh # Input validation helpers
Resource limits cascade from defaults through groups to user-specific overrides.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONFIGURATION RESOLUTION β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β resource-limits.yaml β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β defaults: β β groups: β β users: β
β βββββββββββββ β β βββββββββββββββ β β βββββββββββββββ β
β gpu: 1 β β msc_students: β β jane.researcher:β
β cpu: 8 β β gpu: 1 β β gpu: 4 β
β memory: 32G β β runtime: 24h β β runtime: 336h β
β runtime: 48h β β β β β
βββββββββ¬ββββββββ β phd_students: β ββββββββββ¬βββββββββ
β β gpu: 2 β β
β β runtime: 168h β β
β ββββββββββ¬βββββββββ β
β β β
ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Resolution Order: β
β βββββββββββββββββββββββββββββββββ β
β β
β 1. Check users[username] β
β βββΆ If found, use these values β
β β
β 2. Check groups[user's groups] β
β βββΆ Merge matching group values β
β β
β 3. Fall back to defaults β
β βββΆ Fill any remaining fields β
β β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Example: alice (msc_student) β
β βββββββββββββββββββββββββββββββββ β
β β
β Resolved config: β
β β’ gpu_limit: 1 (from group) β
β β’ cpu_limit: 8 (from default) β
β β’ memory: 32G (from default) β
β β’ runtime: 24h (from group) β
β β
βββββββββββββββββββββββββββββββββββββββ
All significant actions are logged to a structured event stream for auditing and debugging.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EVENT LOGGING FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β container-createβ β container-stop β β gpu_allocator β β idle-cleanup β
β β β β β β β β
β User command β β User command β β Internal β β Cron job β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β β β
ββββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββ
β lib/logging.sh β
β βββββββββββββββββββββββββββββββββ β
β β
β log_event() { β
β timestamp, user, action, β
β resource, details, outcome β
β } β
βββββββββββββββββββ¬ββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β /var/log/ds01/events.jsonl β
β βββββββββββββββββββββββββββββββββ β
β β
β {"ts":"2026-01-14T10:23:45Z", β
β "user":"alice", β
β "action":"container.create", β
β "resource":"pytorch-dev._.alice", β
β "gpu":"0", β
β "outcome":"success"} β
β β
β {"ts":"2026-01-14T10:24:01Z", β
β "user":"bob", β
β "action":"gpu.allocate", β
β "resource":"gpu:1", β
β "outcome":"success"} β
β β
β {"ts":"2026-01-14T11:00:00Z", β
β "user":"system", β
β "action":"cleanup.idle", β
β "resource":"old-job._.charlie", β
β "outcome":"stopped"} β
β β
βββββββββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Admin audit β β Grafana β β Debugging β
β reports β β dashboards β β & forensics β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Log retention: Events are rotated weekly with 4-week retention. Sensitive fields (tokens, passwords) are never logged.
DS01 enforces per-user resource quotas through a centralised YAML configuration:
# config/resource-limits.yaml
defaults:
cpu_limit: 8
memory_limit: 32G
gpu_limit: 1
max_runtime_hours: 48
groups:
phd_students:
priority: 80
gpu_limit: 2
max_runtime_hours: 168 # 1 week
msc_students:
priority: 50
gpu_limit: 1
max_runtime_hours: 24
users:
jane.researcher:
priority: 100
gpu_limit: 4
max_runtime_hours: 336 # 2 weeks
Configuration priority: User overrides β Group memberships β Defaults. This allows sensible baseline limits with exceptions for specific needs.
Enforcement mechanisms:
/var/lib/ds01/gpu-state.json to prevent over-allocationThe system is MIG-aware (Multi-Instance GPU), supporting both full GPUs and GPU partitions:
Priority-based GPU scheduling ensures fair access while respecting user quotas.
When a user requests GPUs:
NVIDIA_VISIBLE_DEVICESIdle containers are automatically stopped, preventing resource waste:
| Mechanism | Default | Purpose |
|---|---|---|
| Idle detection | 48 hours | Stops containers with no SSH/terminal activity |
| Runtime limits | Per-user quota | Enforces maximum container lifetime |
| Scheduled cleanup | Daily cron | Removes orphaned containers and images |
All actions are logged to /var/log/ds01/ with user attribution, providing an audit trail.
DS01 prioritises a teaching-oriented UX through a four-tier help system:
| Flag | Purpose | Example |
|---|---|---|
--help |
Quick reference | Command syntax and common options |
--info |
Full documentation | Complete usage guide with examples |
--concepts |
Educational context | Explains why, not just how |
--guided |
Interactive mode | Step-by-step prompts for complex operations |
Example: A student running container-create --concepts learns about Docker containers, resource isolation, and GPU allocationβconcepts theyβll need throughout their ML careerβwhile accomplishing the immediate task.
The user-setup wizard handles complete user onboarding:
A new student goes from βI have an accountβ to βIβm running a Jupyter notebook on a GPUβ in under 10 minutes, with no Docker knowledge required.
| Benefit | Without DS01 | With DS01 |
|---|---|---|
| Student onboarding | Hours of setup, CUDA debugging | 10-minute wizard |
| Resource fairness | βFirst come, first servedβ chaos | Quota-based allocation |
| Environment management | Conflicting installs, broken dependencies | Isolated containers per user |
| Administrative overhead | Manual monitoring, reactive cleanup | Automated lifecycle management |
| Reproducibility | βWorks on the serverβ β works anywhere | Containerised, portable environments |
DS01 is designed for:
| Component | Requirement |
|---|---|
| Hardware | Linux server with NVIDIA GPUs |
| Software | Docker, NVIDIA Container Toolkit, Python 3.8+ |
| Base framework | AIME ML Containers (installed at /opt/aime-ml-containers) |
| Expertise | Basic Linux sysadmin for initial setup |
DS01 uses a four-phase image building strategy:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Framework ββββββΆβ Jupyter ββββββΆβ Data Science ββββββΆβ Use Case β
β (PyTorch, β β (notebook β β (pandas, β β (project- β
β TF, JAX) β β server) β β sklearn) β β specific) β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Each phase builds on the previous, creating a layered image hierarchy. Users can start from any level depending on their needsβfrom bare PyTorch to fully-configured project environments.
Containers follow the pattern {container-name}._.{user-id}:
pytorch-dev._.alice
jupyter-lab._.bob
thesis-experiment._.alice
This convention:
container-list shows only your containers)The ds01-dashboard command provides real-time visibility:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DS01 RESOURCE DASHBOARD Updated: 14:32 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β GPU Usage: ββββββββββ 3/4 allocated β
β ββ GPU 0: alice (pytorch-dev) - 67% util β
β ββ GPU 1: bob (thesis-train) - 94% util β
β ββ GPU 2: charlie (jupyter) - 12% util β
β ββ GPU 3: [available] β
β β
β Active Containers: 7 β
β Idle Containers: 2 (cleanup in 23h) β
β Pending Requests: 0 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Administrators can identify underutilised resources, monitor for runaway processes, and plan capacity.
| Solution | Pros | Cons | Best For |
|---|---|---|---|
| Raw Docker | Full control, widely documented | Steep learning curve, no quotas | Expert users |
| Kubernetes | Production-grade orchestration | Massive complexity, overkill for small labs | Large-scale deployments |
| JupyterHub | Familiar notebook interface | Limited to Jupyter, complex GPU config | Teaching-focused, notebook-only |
| Slurm/PBS | HPC-standard, battle-tested | Batch-oriented, dated UX | Traditional HPC workloads |
| DS01 | User-friendly + containerised + fair | Requires AIME base | Small labs, teaching environments |
DS01 occupies a useful middle ground: more structure than raw Docker, less complexity than Kubernetes, and better GPU support than JupyterHub.
# Clone the repository
git clone https://github.com/hertie-data-science-lab/ds01-infra.git /opt/ds01-infra
# Configure resource limits
vim /opt/ds01-infra/config/resource-limits.yaml
# Set up systemd resource slices
sudo /opt/ds01-infra/scripts/system/setup-resource-slices.sh
# Deploy user-facing commands
sudo /opt/ds01-infra/scripts/system/deploy-commands.sh
# Add a user
sudo /opt/ds01-infra/scripts/admin/add-user.sh alice
After admin setup, users run:
# Complete guided onboarding
user-setup --guided
# Or manually create a container
container-create --name my-project --image pytorch-jupyter
Each subsystem has dedicated documentation:
| Path | Contents |
|---|---|
scripts/docker/README.md |
GPU allocation algorithm |
scripts/user/README.md |
Command layering, UX philosophy |
config/README.md |
YAML configuration syntax |
scripts/monitoring/README.md |
Dashboard setup |
scripts/maintenance/README.md |
Cleanup automation |
Development workflow improvements:
Initial stable release for production use at Hertie Schoolβs Data Science Lab:
container-deploy and container-retire commands/var/log/ds01/events.jsonlDS01 is under active development. Planned features:
Contributions welcomeβsee the GitHub repository for issues and contribution guidelines.
DS01 builds on the excellent AIME ML Containers project, which provides the foundation of pre-built images and container lifecycle management.
Developed for the Hertie School Data Science Lab to support teaching and research in applied machine learning.
Last updated: January 2026
Powered by Jekyll and Minimal Light theme.