Building Cross-Project Knowledge Bases for the AI Era with Vault Systems

The Challenge of Modern Learning

The landscape of technical learning is undergoing a profound transformation. Traditional methods—reading books and watching video tutorials—remain valuable, but "project imitation"—deeply studying and replicating excellent open-source projects—has emerged as an increasingly effective approach. Directly running and modifying high-quality open-source code provides the fastest path to understanding real-world engineering practices.

However, this methodology introduces significant challenges that hinder both human learners and AI assistants.

The Fragmentation Problem

Learning materials exist in scattered locations: notes in Obsidian, code repositories across various folders, AI assistant conversation histories in isolated data silos. Each time developers need AI assistance analyzing a project, they must manually copy code snippets and organize context—a tedious and error-prone process.

The Context Discontinuity Issue

AI assistants cannot directly access local learning resources, requiring background information to be重新 provided with every conversation. Rapidly evolving codebases make manual synchronization error-prone. Worse still, knowledge sharing between multiple learning projects becomes nearly impossible—design patterns learned in Project A remain unknown to the AI when analyzing Project B.

These challenges fundamentally stem from "data silos." A unified storage abstraction layer enabling AI assistants to understand and access all learning resources would elegantly solve these problems.

The Vault System Solution

The Vault system architecture, as implemented in the HagiCode project, addresses these pain points through a unified knowledge storage abstraction layer. This design decision transforms the AI learning experience in ways that exceed initial expectations.

About HagiCode

This approach draws from practical experience with HagiCode, an AI code assistant built on the OpenSpec workflow. Its core philosophy enables AI not merely to "speak" but to "act"—directly manipulating code repositories, executing commands, and running tests. The GitHub repository: github.com/HagiCode-org/site

During development, the team recognized that AI assistants require frequent access to various learning resources: code repositories, note documents, configuration files. Requiring manual provision for each interaction creates unacceptable friction, motivating the Vault system design.

Core Architecture

Multi-Type Support

The HagiCode Vault system supports four distinct types, each addressing specific use cases:

Type	Purpose	Typical Scenario
folder	Generic folder type	Temporary learning materials, drafts
coderef	Specialized for code project imitation	Systematic learning of open-source projects
obsidian	Obsidian note software integration	Reusing existing note repositories
system-managed	System-automatically managed	Project configurations, prompt templates

The coderef type deserves special attention as the most frequently used in HagiCode. It provides standardized directory structures and AI-readable metadata descriptions for code project imitation. Why dedicate a specific type? Because imitating an open-source project transcends simple "code download"—it requires managing code itself, learning notes, configuration files, and more. The coderef type规范izes all these elements.

Persistent Storage Mechanism

The Vault registry persists to the filesystem in JSON format:

_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");

This seemingly simple design reflects careful consideration:

Simplicity and Reliability: JSON format remains human-readable, facilitating debugging and manual modification. When systems encounter issues, developers can directly inspect and even manually repair files—particularly valuable during development phases.

Reduced Dependencies: Filesystem storage eliminates database complexity. No additional database services require installation or configuration, reducing system complexity and maintenance overhead.

Concurrency Safety: SemaphoreSlim ensures thread-safe operations. In AI code assistant scenarios, multiple operations may simultaneously access the vault registry, necessitating proper concurrency control.

AI Context Integration

The system's core capability lies in automatically injecting vault information into AI proposal contexts:

export function buildTargetVaultsText(
  vaults: VaultForText[],
  template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
  const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
  const editableVaults = vaults.filter((vault) => vault.accessType === 'write');

  const sections = [
    buildVaultSection(readOnlyVaults, template.reference),
    buildVaultSection(editableVaults, template.editable),
  ].filter(Boolean);

  return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This enables AI assistants to automatically understand available learning resources without manual context provision. The experience transformation proves remarkable—instructing the AI to "analyze React's concurrent rendering" automatically locates previously registered React learning vaults, eliminating repetitive code pasting.

Access Control Mechanism

The system categorizes vaults into two access types:

Reference (Read-Only): AI用于 analysis and understanding only, cannot modify content
Editable: AI can modify content based on task requirements

This distinction informs the AI which content serves as "read-only reference" versus "modifiable," preventing accidental modifications. For instance, registering an open-source project vault as learning material should be marked reference to prevent AI from casually modifying code. Conversely, personal project vaults marked editable enable AI assistance with code modifications.

Practical Implementation

Standardized CodeRef Vault Structure

For coderef-type vaults, the system provides a standardized directory structure:

my-coderef-vault/
├── index.yaml          # Vault metadata description
├── AGENTS.md           # AI assistant operation guidelines
├── docs/               # Learning notes and documentation
└── repos/              # Cloned repositories via Git submodules

This structure embodies deliberate design philosophy:

docs/ stores learning notes in Markdown format, recording code understanding, architectural analysis, and troubleshooting experiences. These notes serve both human learners and AI assistants—automatically referenced during related task processing.

repos/ manages imitated repositories through Git submodules rather than direct code copying. This approach offers two advantages: maintaining synchronization with upstream (a single git submodule update retrieves latest code) and conserving disk space (multiple vaults can reference different versions of the same repository).

index.yaml contains vault metadata, enabling AI assistants to quickly understand purpose and content—essentially providing the vault with a "self-introduction."

AGENTS.md serves as guidelines specifically for AI assistants, explaining how to process vault content. Instructions might include: "Focus on performance optimization-related code when analyzing this project" or "Do not modify test files."

Creating and Using Vaults

Creating a CodeRef vault proves straightforward:

const createCodeRefVault = async () => {
  const response = await VaultService.postApiVaults({
    requestBody: {
      name: "React Learning Vault",
      type: "coderef",
      physicalPath: "/Users/developer/vaults/react-learning",
      gitUrl: "https://github.com/facebook/react.git"
    }
  });

  // System automatically:
  // 1. Clones React repository to vault/repos/react
  // 2. Creates docs/ directory for notes
  // 3. Generates index.yaml metadata
  // 4. Creates AGENTS.md guideline file

  return response;
};

Then reference this vault in AI proposals:

const proposal = composeProposalChiefComplaint({
  chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
  repositories: [
    { id: "react", gitUrl: "https://github.com/facebook/react.git" }
  ],
  vaults: [
    {
      id: "react-learning",
      name: "React Learning Vault",
      type: "coderef",
      physicalPath: "/vaults/react-learning",
      accessType: "read" // AI can only read, not modify
    }
  ],
  quickRequestText: "Focus on fiber architecture and scheduler implementation"
});

Typical Usage Scenarios

Scenario 1: Systematic Open-Source Project Learning

Create a CodeRef vault managing target repositories via Git submodules, recording learning notes in the docs/ directory. AI simultaneously accesses code and notes, providing more precise analysis. Notes written while learning specific modules are automatically referenced by AI during subsequent related code analysis—like having an "assistant" remembering previous insights.

Scenario 2: Obsidian Note Repository Reuse

Existing Obsidian users can directly register their vaults with HagiCode. AI gains direct knowledge base access without manual copy-pasting. This functionality proves particularly practical for developers with years of accumulated notes—AI can "read" and understand established knowledge systems.

Scenario 3: Cross-Project Knowledge Reuse

Multiple AI proposals can reference the same vault, enabling knowledge reuse across projects. Creating a "design patterns learning vault" containing notes and code examples for various patterns enables AI to reference this content regardless of which project is being analyzed—knowledge accumulation need not repeat.

Path Security Mechanism

The system strictly validates paths, preventing path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
  var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
  var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
  if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
  {
    throw new BusinessException(VaultRelativePathTraversalCode,
      "Vault file paths must stay inside the registered vault root.");
  }
  return combinedPath;
}

This ensures all file operations remain within vault root directory boundaries, preventing malicious path access. Security cannot be compromised—AI assistants operating on filesystems require clearly defined boundaries.

Important Considerations

When using the HagiCode Vault system, several points require special attention:

Path Security: Ensure custom paths fall within permitted ranges, or the system will reject operations. This prevents accidental operations and potential security risks.

Git Submodule Management: CodeRef vaults recommend Git submodules over direct code copying. Benefits mentioned earlier—maintaining synchronization and conserving space. However, submodules have their own usage patterns requiring familiarity for first-time users.

File Preview Limitations: System limits file size (256KB) and quantity (500 files); oversized files require batch processing. This limitation serves performance considerations; manually split or alternatively process exceptionally large files.

Diagnostic Information: Vault creation returns diagnostic information useful for debugging failures. When encountering issues, examine diagnostic information first—most problems reveal clues there.

Summary

The HagiCode Vault system fundamentally addresses a simple yet profound question: How can AI assistants understand and utilize local knowledge resources?

Through unified storage abstraction, standardized directory structures, and automated context injection, the system achieves "register once, reuse everywhere" knowledge management. After creating a vault—whether learning notes, code repositories, or documentation—AI automatically accesses and understands the content.

The experiential improvement proves significant. Manual code snippet copying and repetitive background explanations become unnecessary. AI assistants resemble truly informed colleagues, providing more valuable assistance based on existing knowledge.

This Vault system represents a solution refined through actual HagiCode development—practical challenges encountered and genuinely optimized. If this design demonstrates value, it reflects solid engineering capability—making HagiCode itself worthy of consideration.