Posts under the category AI & Machine Learning

I Distilled Myself into a Skill! Open Source Release

Introduction: The Distillation TrendHello everyone, I'm programmer Yupi (Fish Pi).Recently, a "distillation" trend has swept through GitHub.No, not distilling alcohol—distilling people.Colleague.skill, Ex.skill, Nuwa.skill, Boss.skill, Self.skill... Various strange distillation projects are emerging one after another. Everyone is "encapsulating" people around them into AI skill packages.Some people distilled their resigned colleagues, letting AI continue their work; some distilled their ex-partners, chatting with the AI version to reminisce;...

Deep Dive into vLLM Weight Loading: From Challenges to Ideal Architecture

Introduction: What Problems Does Weight Loading Solve?Before diving into vLLM's weight loading implementation, it's essential to understand the core challenges it addresses.Large language model weights are typically stored on disk as checkpoint files. The weight loading task seems straightforward: read these files, match tensors by name, and copy data into the model's parameters. However, three critical complexities make this far from simple.Challenge 1: Tensor Sharding and Memory Control in Tensor ParallelismvLLM supports splitting a model ...

When AI Agents Extend Call Chains: Latency Becomes a Business

Introduction: The Hidden Cost of LatencyMany teams only truly realize how expensive latency is after their product goes live.A seemingly simple AI Agent request often isn't just a single model call behind the scenes—it's an entire execution chain: the model understands the task, calls tools, reads data, reasons again, calls APIs, and finally generates results. Users only see one answer, but the system may have traveled back and forth between different services a dozen times.If each step adds a little waiting time, what accumulates in the end...

Deep Dive into vLLM Weight Loading Mechanism: From Challenges to Ideal Architecture

Introduction: Understanding the Core Challenges of Weight LoadingBefore diving into vLLM's weight loading implementation, it's essential to first understand the fundamental problems it aims to solve. Large language model weights are typically stored on disk as checkpoint files. The weight loading task involves taking the tensors from these files and correctly populating every parameter in the model's inference code. While this might seem straightforward—read files, match by name, copy data—three critical challenges make this process signific...

Deep Dive into vLLM Weight Loading Mechanisms: From Challenges to Ideal Architecture

Introduction: Understanding the Weight Loading ChallengeBefore diving into vLLM's weight loading implementation, it's essential to grasp the fundamental problems it aims to solve. At its core, weight loading appears deceptively simple: read checkpoint files from disk, match tensors by name, and copy data into model parameters. However, this seemingly straightforward task becomes extraordinarily complex when dealing with modern large language models deployed in production environments.Large model weights are typically stored as checkpoint fil...

When AI Agents Extend Call Chains: Latency Becomes a Business

Introduction: The Hidden Cost of AI Agent LatencyMany teams only truly realize how expensive latency becomes after their products go live. What appears to be a simple AI Agent request on the surface often involves not a single model invocation behind the scenes, but an entire execution chain: the model understands the task, calls tools, reads data, performs additional reasoning, invokes external APIs, and finally generates results. Users see only one answer, but the system may have already traveled back and forth between different services a...