How we optimized NVIDIA H100 GPU usage for LLM workloads One of the biggest problems while deploying LLM inference systems is not model performance… It's GPU utilization. Even with powerful GPUs like NVIDIA H100, many teams end up using only 30–40% of the GPU capacity. While building an LLM serving system with Ollama, we explored two different GPU sharing strategies: • MIG (Multi-Instance GPU) • Time Slicing And the results were interesting. Option 1 — MIG (Multi-Instance GPU) MIG allows the GPU to be physically partitioned into smaller GPUs. Each partition gets: • Dedicated memory • Dedicated compute cores • Dedicated cache This is great for multi-tenant environments where strict isolation is required. But there is a catch. Once the GPU is partitioned: • Memory becomes statically allocated • Idle partitions cannot share resources • Large LLM requests may run out of memory For dynamic inference workloads, this can become inefficient. Option 2 — Time Slicing Instead of splitting the GPU physically, Time Slicing allows multiple processes to share the GPU over time. Each workload gets a small time window to execute, and the GPU scheduler rotates between them. Advantages: • Full GPU memory available • Better utilization for burst traffic • Flexible resource allocation • Works well for LLM serving workloads This made a big difference when multiple users were sending requests simultaneously. Adding an Nginx Layer for Load Handling To scale the inference system, we placed Nginx in front of the Ollama server. Architecture looked like this: Client Requests ⬇ Nginx Load Balancer ⬇ Multiple Ollama Workers ⬇ Shared NVIDIA H100 GPU Nginx helped us: • Queue incoming requests • Distribute load across workers • Prevent server overload • Improve overall throughput Where KV Cache Becomes Important In transformer models, KV Cache memory stores attention states for previously generated tokens. This avoids recomputing the entire sequence every time a new token is generated. But here's the challenge: KV Cache consumes a large portion of GPU memory. With MIG, memory is already partitioned, so each instance gets a limited KV cache space. With Time Slicing, the GPU memory pool stays flexible, allowing: • Larger KV cache usage • Higher token throughput • More efficient LLM serving Final Insight For LLM inference workloads: 🔹 MIG → Better isolation 🔹 Time Slicing → Better utilization #AIInfrastructure #LLM #NvidiaH100 #GPUOptimization #MLOps #Ollama #DeepLearning #AIEngineering #GPU #AI #Jobs #ML #GenAI #Nvidia #DL #Python #DataScience
Virtual Project Management Techniques
Explore top LinkedIn content from expert professionals.
-
-
In our recent work, we introduce a new framework to address the joint optimization problem of minimizing global loss and communication latency in federated learning over wireless networks (FLOWN). The problem is formulated as a Stackelberg game, where the leader (global model coordinator) aims to minimize the total number of communication rounds required for convergence, and the followers (participating devices) attempt to minimize the latency of each round under energy and bandwidth constraints. Specifically, the leader-level problem focuses on optimizing device selection to improve the convergence rate, while the follower-level problem addresses resource allocation and sub-channel assignment to minimize communication time per round. The follower-level problem is further decoupled into two sub-problems: a monotonic optimization-based resource allocation problem and a matching-theory-based sub-channel assignment problem. This decomposition enables efficient, iterative solutions to optimize latency while ensuring energy feasibility for each device. To accelerate convergence, we utilize the Age of Update (AoU), metric to prioritize the selection of devices with more informative updates. The AoU-based device selection algorithm dynamically ranks devices based on both AoU and data size, ensuring that those with the most significant impact on model convergence are selected in each communication round. At the follower level, the resource allocation problem is solved using monotonic optimization techniques, which leverage the non-convexity and monotonicity of the time and energy consumption functions. The sub-channel assignment is tackled using matching theory, where devices are assigned to sub-channels based on incomplete preference lists, ensuring energy-efficient communication under the given resource constraints. The proposed approach derives an upper bound on the convergence rate, highlighting the trade-off between global loss minimization and latency minimization. The Stackelberg equilibrium is established by iteratively solving the leader and follower problems, ensuring optimal device selection and resource allocation. Simulation results demonstrate that the AoU-based device selection and optimized resource allocation schemes significantly outperform conventional methods, both in terms of convergence speed and communication efficiency. Checkout the paper at: https://lnkd.in/e6XcuVyq
-
Misconfigured Kubernetes resource requests and limits are a primary driver of cloud waste and performance instability. This often leads to either costly over-provisioning or critical application instability due to OOMKills and CPU throttling. Balancing resource allocation is fundamental for operational efficiency and managing cloud expenditure within any containerized environment. Inadequate resource definitions directly impact node utilization, scheduling efficiency, and application reliability. Proper definition of resources.requests and resources.limits within your pod specifications is paramount. Requests define guaranteed minimums, influencing pod scheduling. Limits define hard maximums, preventing noisy neighbor issues and resource exhaustion. # Example: Efficient resource definition for a Kubernetes container containers: - name: my-app image: my-repo/my-app:1.0.0 resources: requests: cpu: "250m" # Guaranteed 0.25 CPU core for scheduling memory: "512Mi" # Guaranteed 512 MiB RAM for scheduling limits: cpu: "500m" # Capped at 0.5 CPU core, prevents noisy neighbors memory: "1Gi" # Capped at 1 GiB RAM, prevents OOMKills This configuration ensures predictable performance while providing headroom, preventing resource starvation and unnecessary eviction. Over-requesting leads to underutilized nodes; under-requesting leads to unstable applications. Pro Tip: Do not rely solely on initial estimations. Implement robust monitoring (e.g., Prometheus and Grafana) to track actual pod resource utilization over time. Use this empirical data to continuously fine-tune your requests and limits, applying an iterative, data-driven approach rather than a static "set and forget" strategy. Consider using Kubernetes Vertical Pod Autoscaler (VPA) in recommendation mode to inform these adjustments, but always validate manually for critical workloads. #DevOps #Kubernetes #CloudNative #ResourceManagement #CostOptimization #PerformanceTuning #CloudArchitecture #SRE #K8s #Containerization #Infrastructure #TechInsight #CloudComputing #FinOps #ReliabilityEngineering #InfrastructureAsCode
-
The age-old choice of space or time... In GPU-as-a-Service (GPUaaS) land sharing happens two ways: 1️⃣ Spatial Virtualisation ◆ Multiple users access the GPU simultaneously. ◆ Typically involves fixed GPU slicing (e.g., MIG). ◆ Ideal for inference tasks or shared environments. Drawback: Rigid partitions, limited flexibility. 2️⃣ Temporal Virtualisation ◆ One user has full GPU access, but only for a set duration. ◆ Ideal for GPU-heavy tasks like training. ◆ Simple and conflict-free. Drawback: Efficiency drops when demand fluctuates. » At hosted·ai, our mission is to do both - but better: ✅ Dynamic Spatial Sharing ◆ Fluid, real-time VRAM and TFLOPS allocation. ◆ Burst, scale, pause, and release resources instantly. ◆ No rigid MIG slices or fixed fragments of a vGPU. ✅ Flexible Temporal Sharing ◆ Smooth GPU handoffs between users. ◆ Prioritise VIP users and offer attractive 'spot pricing'. ◆ Optional static sizing for guaranteed resource allocations. Why does this matter? Customers have diverse needs: ◆ Guaranteed slices. ◆ Flexibility. ◆ Immediate raw GPU power. Service providers should aim to fully utilise their GPUs to achieve better unit economics. With hosted·ai, you can optimise GPU usage, tetris'ing your cards efficiently and typically double your margins while halving your prices. It's a win/win. You shouldn't have to choose - or waste GPU capacity. At hosted·ai, our real-time GPU resource mapping ensures: » Optimal user experience. » Maximum profitability. Just keep your GPUs busy - and profitable. #HalfPriceDoubleMargin
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development