Building Cross-Project Knowledge Bases for the AI Era with Vault Systems

In the rapidly evolving landscape of artificial intelligence, the way we learn and absorb new technologies is undergoing a profound transformation. Traditional methods of reading books and watching tutorials remain valuable, but a new approach has emerged as increasingly effective: learning through imitation of real-world projects. This methodology involves deeply studying and analyzing the code, architecture, and design patterns of high-quality open-source projects. By directly running and modifying well-crafted codebases, developers can rapidly understand engineering practices as they exist in production environments.

However, this powerful learning approach introduces significant challenges that must be addressed for AI assistants to reach their full potential.

The Challenge of Fragmented Learning Resources

One of the most pressing issues modern learners face is the dispersion of educational materials across multiple platforms and formats. Notes might reside in Obsidian, code repositories scattered across various folders, and AI assistant conversation histories existing as isolated data silos. Every time developers need AI assistance to analyze a particular project, they must manually copy code snippets and organize context—a tedious and time-consuming process that interrupts the learning flow.

Furthermore, context frequently breaks down during AI interactions. AI assistants cannot directly access local learning resources, requiring users to repeatedly provide background information with each new conversation. The rapid pace of code repository updates makes manual synchronization error-prone. Perhaps most problematic is the difficulty in sharing knowledge across multiple learning projects—design patterns learned in Project A remain completely unknown to the AI when working on Project B.

The Root Cause: Data Silos

These challenges fundamentally stem from what we call "data silos"—isolated pockets of information that cannot communicate with each other. The solution lies in creating a unified storage abstraction layer that enables AI assistants to understand and access all learning resources seamlessly. This architectural approach eliminates the barriers between different types of learning materials, allowing for true cross-project knowledge reuse.

To address these pain points effectively, the HagiCode project made a critical design decision: building a Vault system that serves as a unified knowledge storage abstraction layer. The impact of this decision extends far beyond initial expectations, fundamentally transforming how AI assistants interact with local knowledge resources.

About HagiCode

The solutions presented in this article draw from practical experience gained during the development of HagiCode, an AI code assistant built on the OpenSpec workflow paradigm. HagiCode's core philosophy centers on enabling AI not just to "speak" but to "act"—directly manipulating code repositories, executing commands, and running tests. The project is available on GitHub at github.com/HagiCode-org/site.

During HagiCode's development, the team discovered that AI assistants needed frequent access to various user learning resources: code repositories, note documents, configuration files, and more. Requiring users to manually provide this information each time resulted in a suboptimal experience. This realization drove the design of the Vault system.

Core Design Principles

Multi-Type Support Architecture

HagiCode's Vault system supports four distinct types, each corresponding to different use cases:

Type	Purpose	Typical Scenario
folder	General-purpose folder type	Temporary learning materials, drafts
coderef	Specialized for code project imitation	Systematic learning of open-source projects
obsidian	Integration with Obsidian note-taking software	Reuse of existing note libraries
system-managed	System automatically managed	Project configurations, prompt templates

Among these, the coderef type represents the most frequently used category in HagiCode, providing a standardized directory structure and AI-readable metadata descriptions for code project imitation. The rationale for designing this specialized type stems from the recognition that imitating an open-source project involves more than simply "downloading code." It requires managing code itself, learning notes, configuration files, and various other content types simultaneously—the coderef structure规范 all of these elements cohesively.

Persistent Storage Mechanism

The Vault registry employs JSON format for persistent storage to the filesystem:

_registryFilePath = Path.Combine(absoluteDataDir, "personal-data", "vaults", "registry.json");

This seemingly simple design choice resulted from careful consideration of multiple factors:

Simplicity and Reliability. JSON format offers human readability, facilitating debugging and manual modifications. When system issues arise, developers can directly open the file to examine state or even perform manual repairs—an especially valuable capability during development phases.

Reduced Dependencies. Filesystem storage eliminates database complexity. No additional database service installation or configuration is required, reducing system complexity and maintenance overhead significantly.

Concurrency Safety. SemaphoreSlim ensures thread-safe operations. In AI code assistant scenarios, multiple operations may simultaneously access the vault registry, necessitating robust concurrency control mechanisms.

AI Context Integration

The system's core capability lies in its ability to automatically inject vault information into AI proposal contexts:

export function buildTargetVaultsText(
  vaults: VaultForText[],
  template: VaultPromptTemplate = DEFAULT_VAULT_PROMPT_TEMPLATE,
): string {
  const readOnlyVaults = vaults.filter((vault) => vault.accessType === 'read');
  const editableVaults = vaults.filter((vault) => vault.accessType === 'write');

  const sections = [
    buildVaultSection(readOnlyVaults, template.reference),
    buildVaultSection(editableVaults, template.editable),
  ].filter(Boolean);

  return `\n\n### ${template.heading}\n\n${sections.join('\n')}`;
}

This architectural decision enables AI assistants to automatically understand available learning resources without requiring manual context provision from users each time. The design makes HagiCode's experience remarkably natural—simply telling the AI "help me analyze React's concurrent rendering" allows the AI to automatically locate previously registered React learning vaults rather than requiring repeated code pasting.

Access Control Mechanisms

The system categorizes vaults into two access types:

Reference (Read-Only): AI used solely for analysis and understanding, cannot modify content
Editable: AI can modify content based on task requirements

This distinction informs the AI which content serves as "read-only reference" versus "hands-on modification," preventing accidental operation risks. For instance, registering an open-source project vault as learning material should be marked as reference to prevent the AI from casually modifying the code. Conversely, personal project vaults can be marked as editable, enabling AI assistance with code modifications.

Practical Implementation Guide

Standardized CodeRef Vault Structure

For coderef-type vaults, the system provides a standardized directory structure:

my-coderef-vault/
├── index.yaml          # Vault metadata description
├── AGENTS.md           # AI assistant operation guidelines
├── docs/               # Learning notes and documentation
└── repos/              # Code repositories managed via Git submodules

This structure embodies several key design principles:

The docs/ directory stores learning notes in Markdown format, recording code understanding, architecture analysis, and troubleshooting experiences. These notes serve both human readers and AI comprehension—automatically referenced when handling related tasks.

The repos/ directory manages imitated repositories through Git submodules rather than direct code copying. This approach offers two significant advantages: maintaining synchronization with upstream repositories (a single git submodule update retrieves the latest code) and conserving storage space (multiple vaults can reference different versions of the same repository).

The index.yaml file contains vault metadata, enabling AI assistants to quickly understand purpose and content. Effectively, it provides the vault with a "self-introduction," allowing the AI to immediately grasp its function upon first encounter.

The AGENTS.md file serves as a guide specifically written for AI assistants, explaining how to handle vault content. Developers can specify instructions such as "focus on performance optimization-related code when analyzing this project" or "do not modify test files."

Creating and Using Vaults

Creating a CodeRef vault follows a straightforward process:

const createCodeRefVault = async () => {
  const response = await VaultService.postApiVaults({
    requestBody: {
      name: "React Learning Vault",
      type: "coderef",
      physicalPath: "/Users/developer/vaults/react-learning",
      gitUrl: "https://github.com/facebook/react.git"
    }
  });

  // The system automatically:
  // 1. Clones the React repository to vault/repos/react
  // 2. Creates docs/ directory for notes
  // 3. Generates index.yaml metadata
  // 4. Creates AGENTS.md guide file

  return response;
};

Subsequently, reference this vault in AI proposals:

const proposal = composeProposalChiefComplaint({
  chiefComplaint: "Help me analyze React's concurrent rendering mechanism",
  repositories: [
    { id: "react", gitUrl: "https://github.com/facebook/react.git" }
  ],
  vaults: [
    {
      id: "react-learning",
      name: "React Learning Vault",
      type: "coderef",
      physicalPath: "/vaults/react-learning",
      accessType: "read" // AI can only read, not modify
    }
  ],
  quickRequestText: "Focus on fiber architecture and scheduler implementation"
});

Typical Usage Scenarios

Scenario One: Systematic Open-Source Project Learning

Create a CodeRef vault managing the target repository through Git submodules, recording learning notes in the docs/ directory. The AI can simultaneously access both code and notes, providing more precise analysis. Notes written while studying specific modules are automatically referenced by the AI during subsequent related code analysis—functioning like an "assistant" that remembers previous thinking processes.

Scenario Two: Obsidian Note Library Reuse

For users already managing notes in Obsidian, directly register existing vaults with HagiCode. The AI can access the knowledge base without manual copy-paste operations. This functionality proves particularly practical for individuals with accumulated note libraries spanning years—once connected, the AI can "read" and understand the entire knowledge system.

Scenario Three: Cross-Project Knowledge Reuse

Multiple AI proposals can reference the same vault, enabling knowledge reuse across projects. For instance, creating a "design patterns learning vault" containing notes and code examples for various design patterns allows the AI to reference this vault's content regardless of which project is being analyzed—eliminating redundant knowledge accumulation.

Path Security Mechanisms

The system implements strict path validation to prevent path traversal attacks:

private static string ResolveFilePath(string vaultRoot, string relativePath)
{
  var rootPath = EnsureTrailingSeparator(Path.GetFullPath(vaultRoot));
  var combinedPath = Path.GetFullPath(Path.Combine(rootPath, relativePath));
  if (!combinedPath.StartsWith(rootPath, StringComparison.OrdinalIgnoreCase))
  {
    throw new BusinessException(VaultRelativePathTraversalCode,
      "Vault file paths must stay inside the registered vault root.");
  }
  return combinedPath;
}

This ensures all file operations remain within the vault's root directory scope, preventing malicious path access. Security considerations cannot be compromised—AI assistants operating on filesystems must have clearly defined boundaries.

Important Considerations

When using the HagiCode Vault system, several points require special attention:

Path Security: Ensure custom paths fall within permitted ranges; otherwise, the system will reject operations. This prevents accidental operations and potential security risks.
Git Submodule Management: CodeRef vaults recommend Git submodules over direct code copying. Benefits include maintained synchronization and conserved space. However, submodules have their own usage patterns that may require familiarization for first-time users.
File Preview Limitations: The system limits file size (256KB) and quantity (500 files); oversized files require batch processing. These limitations address performance considerations—manually split or alternatively handle exceptionally large files.
Diagnostic Information: Vault creation returns diagnostic information useful for debugging failures. When encountering issues, examine diagnostic information first; most problems reveal clues through this channel.

Conclusion

The HagiCode Vault system fundamentally addresses a simple yet profound question: how can AI assistants understand and utilize local knowledge resources?

Through a unified storage abstraction layer, standardized directory structures, and automated context injection, the system achieves a "register once, reuse everywhere" knowledge management paradigm. Once a vault is created—whether learning notes, code repositories, or documentation—AI can automatically access and understand the content.

The experiential improvements from this design are significant. Manual copying of code snippets and repeated background explanations become unnecessary—the AI assistant functions like a colleague who genuinely understands project circumstances, providing more valuable assistance based on existing knowledge.

The Vault system shared in this article represents a solution actually developed, tested, and optimized through real-world challenges encountered during HagiCode's development. If this design demonstrates value, it reflects considerable engineering capability—making HagiCode itself worthy of attention.

For those interested in exploring further, HagiCode offers comprehensive installation options including Docker Compose deployment and a Desktop client for rapid setup. Public beta testing has commenced, welcoming installation and experience.