Deep Learning Advanced Concepts: Understanding Inductive Bias and Knowledge Distillation for Vision Transformers
Introduction
As deep learning models grow increasingly sophisticated, understanding the theoretical foundations that make them work becomes ever more critical. Two concepts stand out as particularly important for modern architecture design: inductive bias and knowledge distillation. These principles are not merely academic curiosities—they directly impact model performance, training efficiency, and practical deployment success.
This article provides a comprehensive exploration of inductive bias and knowledge distillation, with special focus on their application to Vision Transformers (ViTs). We'll examine theoretical foundations, practical implementations, and real-world considerations for leveraging these concepts in your own deep learning projects.
Part I: Inductive Bias in Deep Learning
What Is Inductive Bias?
Inductive bias refers to the set of assumptions a learning algorithm uses to predict outputs for inputs it has not encountered. In the context of neural networks, inductive bias is "baked into" the architecture itself—determining what patterns the model can easily learn versus what patterns require substantial data and computation to discover.
Formal Definition: Given a hypothesis space H and a learning algorithm A, the inductive bias is the set of assumptions that allows A to generalize beyond its training data.
Without inductive bias, machine learning would be impossible. As the "no free lunch" theorem states: without assumptions about the data distribution, no learning algorithm can outperform random guessing on unseen data.
Why Inductive Bias Matters
Sample Efficiency: Strong inductive bias aligned with the problem domain dramatically reduces the data needed for good performance.
Generalization: Appropriate inductive bias helps models generalize to out-of-distribution examples.
Training Speed: Models with suitable inductive bias converge faster during training.
Interpretability: Understanding inductive bias helps explain why architectures work (or fail) on specific tasks.
Inductive Bias in Classic Architectures
Convolutional Neural Networks (CNNs)
CNNs embed several powerful inductive biases:
Translation Equivariance:
- Assumption: Features are location-independent
- Implementation: Weight sharing across spatial positions
- Benefit: Dramatically reduces parameters, enables learning from limited data
# Convolution operation embodies translation equivariance
# Same filter applied at every spatial location
output[y, x] = sum over (dy, dx) of: filter[dy, dx] * input[y+dy, x+dx]Locality:
- Assumption: Nearby pixels are more related than distant ones
- Implementation: Small receptive fields in early layers
- Benefit: Captures local patterns (edges, textures) efficiently
Hierarchical Composition:
- Assumption: Complex patterns are composed of simpler ones
- Implementation: Stacking layers with increasing receptive fields
- Benefit: Builds from edges → textures → parts → objects
Spatial Structure:
- Assumption: 2D arrangement of pixels matters
- Implementation: 2D convolution kernels, pooling operations
- Benefit: Preserves spatial relationships
These biases made CNNs dominant in computer vision for over a decade. But they also introduced limitations that Transformers would later address.
Recurrent Neural Networks (RNNs)
RNNs encode different inductive biases:
Sequential Processing:
- Assumption: Order matters, temporal dependencies exist
- Implementation: Hidden state passed through time steps
- Benefit: Natural for language, time series
Positional Sensitivity:
- Assumption: Absolute and relative position carry meaning
- Implementation: Sequential processing order
- Benefit: Captures word order, temporal sequences
Limitation: Sequential processing prevents parallelization, making training slow for long sequences.
Transformers
Transformers introduce a fundamentally different set of biases:
Global Attention:
- Assumption: Any position can relate to any other position
- Implementation: Self-attention mechanism
- Benefit: Captures long-range dependencies directly
Permutation Equivariance (without positional encoding):
- Assumption: Order doesn't matter (initially)
- Implementation: Attention is order-independent
- Benefit: Flexible, but requires positional encoding for ordered data
Minimal Spatial Bias:
- Assumption: No inherent 2D structure assumption
- Implementation: Treats input as sequence of patches
- Benefit: More flexible, but requires more data to learn spatial patterns
This minimal inductive bias is both strength and weakness: Transformers are more flexible but less sample-efficient than CNNs for vision tasks.
Inductive Bias in Vision Transformers
Vision Transformers (ViTs) apply the Transformer architecture to images by:
- Patchification: Split image into fixed-size patches (e.g., 16×16 pixels)
- Linear Projection: Map each patch to a vector embedding
- Positional Encoding: Add position information to patch embeddings
- Transformer Encoder: Process sequence of patch embeddings through attention layers
- Classification Head: Use [CLS] token or global pooling for prediction
import torch
import torch.nn as nn
class VisionTransformer(nn.Module):
def __init__(self, img_size=224, patch_size=16, embed_dim=768, depth=12, num_heads=12):
super().__init__()
# Patchification
self.patch_embed = PatchEmbed(img_size, patch_size, 3, embed_dim)
num_patches = self.patch_embed.num_patches
# Positional encoding (learnable)
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# Transformer encoder
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads) for _ in range(depth)
])
# Classification head
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
# Embed patches
x = self.patch_embed(x) # [B, num_patches, embed_dim]
# Add class token
cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
# Add positional encoding
x = x + self.pos_embed
# Process through transformer blocks
for block in self.blocks:
x = block(x)
# Use class token for classification
return self.head(x[:, 0])Inductive Bias Analysis of ViT
Weak Spatial Bias:
- Unlike CNNs, ViTs don't assume 2D locality
- Patches are treated as sequence elements
- Spatial relationships must be learned from data
- Consequence: Requires more training data than CNNs
Global Receptive Field from Layer 1:
- Self-attention connects all patches immediately
- No gradual receptive field growth like CNNs
- Benefit: Can capture long-range dependencies early
- Trade-off: May overfit on small datasets
Positional Encoding as Inductive Bias:
- Learnable positional encodings inject position information
- Different from CNN's inherent spatial structure
- Implication: Model must learn what positions mean
No Translation Equivariance:
- Unlike CNNs, ViTs are not translation equivariant by design
- Must learn translation invariance from data
- Impact: Less sample-efficient for tasks where translation matters
Enhancing Inductive Bias in Vision Transformers
Recent research has focused on injecting more appropriate inductive bias into ViTs:
1. Convolutional Hybrids
Combining CNN and Transformer strengths:
class ConvTransformerHybrid(nn.Module):
def __init__(self):
super().__init__()
# CNN stem for local feature extraction
self.cnn_stem = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
)
# Transformer for global reasoning
self.transformer = TransformerEncoder(...)
# Classification head
self.head = nn.Linear(embed_dim, num_classes)
def forward(self, x):
# Extract local features with CNN
x = self.cnn_stem(x)
# Flatten and process with Transformer
x = x.flatten(2).transpose(1, 2)
x = self.transformer(x)
return self.head(x.mean(dim=1))Benefits:
- CNN provides locality and translation bias
- Transformer provides global attention
- Better sample efficiency than pure ViT
- Maintains long-range modeling capability
2. Local Attention Mechanisms
Restricting attention to local neighborhoods:
class LocalAttention(nn.Module):
def __init__(self, embed_dim, num_heads, window_size=7):
super().__init__()
self.window_size = window_size
self.attention = nn.MultiheadAttention(embed_dim, num_heads)
def forward(self, x, H, W):
# Reshape to [B*windows, window_size, embed_dim]
x = self.window_partition(x, H, W, self.window_size)
# Apply attention within windows only
x, _ = self.attention(x, x, x)
# Restore original shape
x = self.window_reverse(x, H, W, self.window_size)
return xBenefits:
- Reduces computation from O(n²) to O(n·window_size)
- Injects locality bias
- Enables processing higher resolution images
3. Hierarchical Transformers
Building feature hierarchies similar to CNNs:
class HierarchicalViT(nn.Module):
def __init__(self):
super().__init__()
# Stage 1: Process at full resolution
self.stage1 = TransformerStage(embed_dim=96, depth=2)
# Stage 2: Downsample and process
self.downsample1 = PatchMerging(scale=2)
self.stage2 = TransformerStage(embed_dim=192, depth=2)
# Stage 3: Further downsample
self.downsample2 = PatchMerging(scale=2)
self.stage3 = TransformerStage(embed_dim=384, depth=6)
# Stage 4: Final processing
self.downsample3 = PatchMerging(scale=2)
self.stage4 = TransformerStage(embed_dim=768, depth=2)
def forward(self, x):
x = self.stage1(x)
x = self.downsample1(x)
x = self.stage2(x)
x = self.downsample2(x)
x = self.stage3(x)
x = self.downsample3(x)
x = self.stage4(x)
return global_pool(x)Benefits:
- Multi-scale feature representation
- Progressive receptive field growth
- More CNN-like inductive bias
- Better for dense prediction tasks
4. Convolutional Position Embedding
Replacing learned positional encodings with convolutional ones:
class ConvPositionEmbedding(nn.Module):
def __init__(self, embed_dim, kernel_size=3):
super().__init__()
self.proj = nn.Conv2d(
embed_dim, embed_dim,
kernel_size=kernel_size,
padding=kernel_size//2,
groups=embed_dim # Depth-wise convolution
)
def forward(self, x, H, W):
# Reshape to [B, C, H, W]
x = x.transpose(1, 2).reshape(-1, embed_dim, H, W)
# Apply convolution
x = self.proj(x)
# Reshape back to sequence
x = x.flatten(2).transpose(1, 2)
return xBenefits:
- Translation equivariant positional encoding
- Better generalization to different resolutions
- More parameter-efficient than learned embeddings
Quantifying Inductive Bias
How can we measure and compare inductive biases?
Sample Efficiency Curves
Plot performance vs. training data size:
import matplotlib.pyplot as plt
def compare_sample_efficiency():
data_sizes = [1000, 5000, 10000, 50000, 100000, 500000]
cnn_accuracies = [45, 62, 71, 79, 83, 85]
vit_accuracies = [38, 52, 64, 75, 81, 85]
hybrid_accuracies = [48, 65, 73, 80, 84, 86]
plt.figure(figsize=(10, 6))
plt.plot(data_sizes, cnn_accuracies, label='CNN (ResNet-50)', marker='o')
plt.plot(data_sizes, vit_accuracies, label='ViT-B/16', marker='s')
plt.plot(data_sizes, hybrid_accuracies, label='ConvNeXt', marker='^')
plt.xscale('log')
plt.xlabel('Training Samples')
plt.ylabel('Top-1 Accuracy')
plt.title('Sample Efficiency Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Interpretation: Steeper curves indicate stronger (more helpful) inductive bias for that domain.
Out-of-Distribution Generalization
Test performance on distribution shifts:
def test_ood_generalization():
# Train on ImageNet
# Test on various corruptions and distortions
corruptions = ['gaussian_noise', 'blur', 'contrast', 'translation', 'rotation']
cnn_robustness = [0.72, 0.68, 0.75, 0.81, 0.69] # Higher = more robust
vit_robustness = [0.65, 0.71, 0.69, 0.73, 0.72]
# CNNs typically more robust to translations (built-in equivariance)
# ViTs may be more robust to some other corruptionsPart II: Knowledge Distillation
What Is Knowledge Distillation?
Knowledge distillation is a technique where a smaller "student" model learns to replicate the behavior of a larger "teacher" model. Rather than learning directly from labels, the student learns from the teacher's output distributions, capturing richer information about the task.
Core Insight: The teacher's soft predictions (probability distributions) contain more information than hard labels—they encode relationships between classes and uncertainty.
Why Use Knowledge Distillation?
Model Compression: Deploy large model performance on resource-constrained devices.
Training Acceleration: Student trains faster than training large model from scratch.
Performance Enhancement: Students sometimes exceed teacher performance (especially with ensemble teachers).
Privacy Preservation: Train on model outputs rather than sensitive raw data.
Architecture Transfer: Transfer knowledge across different architecture types (e.g., CNN to Transformer).
The Distillation Process
Standard Knowledge Distillation
import torch.nn.functional as F
class KnowledgeDistillation:
def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
self.teacher = teacher
self.student = student
self.temperature = temperature # Controls softening of distributions
self.alpha = alpha # Weight between distillation loss and true label loss
def distillation_loss(self, student_logits, teacher_logits, true_labels):
# Soften the probability distributions
student_soft = F.softmax(student_logits / self.temperature, dim=1)
teacher_soft = F.softmax(teacher_logits / self.temperature, dim=1)
# KL divergence between student and teacher distributions
kl_loss = F.kl_div(
F.log_softmax(student_logits / self.temperature, dim=1),
teacher_soft,
reduction='batchmean'
) * (self.temperature ** 2)
# Cross-entropy with true labels
ce_loss = F.cross_entropy(student_logits, true_labels)
# Combined loss
total_loss = self.alpha * kl_loss + (1 - self.alpha) * ce_loss
return total_loss
def train_step(self, images, labels):
# Get teacher predictions (no gradient)
with torch.no_grad():
teacher_logits = self.teacher(images)
# Get student predictions
student_logits = self.student(images)
# Calculate distillation loss
loss = self.distillation_loss(student_logits, teacher_logits, labels)
# Backpropagate
loss.backward()
return loss.item()Temperature Scaling Explained
The temperature parameter controls the "softness" of probability distributions:
# Example: Temperature effect on softmax output
logits = torch.tensor([2.0, 1.0, 0.1])
# Standard softmax (T=1)
soft_T1 = F.softmax(logits / 1.0, dim=0)
# Output: [0.659, 0.242, 0.099]
# Softened softmax (T=4)
soft_T4 = F.softmax(logits / 4.0, dim=0)
# Output: [0.420, 0.328, 0.252]
# Higher temperature reveals more information about relative probabilities
# Student learns not just which class is correct, but relationships between classesAdvanced Distillation Techniques
Feature-Based Distillation
Beyond output logits, distill intermediate feature representations:
class FeatureDistillation:
def __init__(self, teacher, student, feature_layers):
self.teacher = teacher
self.student = student
self.feature_layers = feature_layers # Which layers to distill
def feature_loss(self, student_features, teacher_features):
# Project student features to match teacher dimension if needed
if student_features.shape != teacher_features.shape:
student_features = self.adapt_projection(student_features)
# Mean squared error between features
return F.mse_loss(student_features, teacher_features)
def train_step(self, images, labels):
# Get teacher features
with torch.no_grad():
teacher_features = self.teacher.extract_features(images, self.feature_layers)
# Get student features and logits
student_features, student_logits = self.student.extract_features_and_logits(
images, self.feature_layers
)
# Feature distillation loss
feat_loss = sum(
self.feature_loss(sf, tf)
for sf, tf in zip(student_features, teacher_features)
)
# Output distillation loss
with torch.no_grad():
teacher_logits = self.teacher(images)
out_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1),
reduction='batchmean'
)
# Combined loss
total_loss = feat_loss + out_loss
return total_lossAttention-Based Distillation
Transfer attention patterns from teacher to student:
class AttentionDistillation:
def __init__(self, teacher, student):
self.teacher = teacher
self.student = student
def attention_loss(self, student_attention, teacher_attention):
# Match attention maps
# Teacher attention: [B, num_heads, seq_len, seq_len]
# Student attention: [B, num_heads, seq_len, seq_len]
# Option 1: Direct MSE on attention weights
loss = F.mse_loss(student_attention, teacher_attention)
# Option 2: KL divergence on attention distributions
# loss = F.kl_div(F.log_softmax(student_attention, dim=-1),
# F.softmax(teacher_attention, dim=-1))
return loss
def train_step(self, images):
# Extract attention maps from teacher
with torch.no_grad():
teacher_attentions = self.teacher.get_attention_maps(images)
# Get student attention maps
student_attentions = self.student.get_attention_maps(images)
# Calculate attention distillation loss
attn_loss = sum(
self.attention_loss(sa, ta)
for sa, ta in zip(student_attentions, teacher_attentions)
)
return attn_lossRelation-Based Distillation
Capture relationships between samples, not just within samples:
class RelationDistillation:
def __init__(self, teacher, student):
self.teacher = teacher
self.student = student
def build_relation_matrix(self, features):
"""Build matrix of pairwise relationships between samples"""
# features: [batch_size, feature_dim]
# Normalize features
features = F.normalize(features, p=2, dim=1)
# Compute similarity matrix
relation_matrix = torch.matmul(features, features.t())
return relation_matrix
def relation_loss(self, student_features, teacher_features):
# Build relation matrices
student_relations = self.build_relation_matrix(student_features)
teacher_relations = self.build_relation_matrix(teacher_features)
# Match relation structures
loss = F.mse_loss(student_relations, teacher_relations)
return lossDistillation for Vision Transformers
CNN Teacher → ViT Student
Transferring knowledge from convolutional models to transformers:
class CNNtoViTDistillation:
def __init__(self, cnn_teacher, vit_student):
self.teacher = cnn_teacher
self.student = vit_student
def train_step(self, images, labels):
# CNN teacher forward pass
with torch.no_grad():
teacher_logits = self.teacher(images)
teacher_features = self.teacher.extract_features(images)
# ViT student forward pass
student_logits = self.student(images)
student_patch_features = self.student.extract_patch_features(images)
# Output distillation
out_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1),
reduction='batchmean'
)
# Feature distillation (requires spatial alignment)
# CNN features: [B, C, H, W]
# ViT patch features: [B, num_patches, embed_dim]
teacher_features_flat = teacher_features.flatten(2).permute(0, 2, 1)
feat_loss = F.mse_loss(student_patch_features, teacher_features_flat)
# Label loss
label_loss = F.cross_entropy(student_logits, labels)
# Total loss
total_loss = 0.5 * out_loss + 0.3 * feat_loss + 0.2 * label_loss
return total_lossViT Teacher → CNN Student
Compressing large transformers into efficient CNNs:
class ViTtoCNNDistillation:
def __init__(self, vit_teacher, cnn_student):
self.teacher = vit_teacher
self.student = cnn_student
def train_step(self, images, labels):
# ViT teacher (expensive, but done less frequently)
with torch.no_grad():
teacher_logits = self.teacher(images)
teacher_attention = self.teacher.get_attention_maps(images)
# CNN student (fast)
student_logits = self.student(images)
# Output distillation
out_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1),
reduction='batchmean'
)
# Attention guidance (teacher attention guides CNN feature learning)
# This helps CNN learn long-range dependencies
attn_guidance_loss = self.attention_guidance_loss(
self.student.get_feature_maps(),
teacher_attention
)
# Standard label loss
label_loss = F.cross_entropy(student_logits, labels)
# Combined loss
total_loss = 0.6 * out_loss + 0.2 * attn_guidance_loss + 0.2 * label_loss
return total_loss
def attention_guidance_loss(self, cnn_features, teacher_attention):
"""Use teacher attention to guide CNN feature learning"""
# Pool teacher attention to match CNN spatial resolution
# Apply as spatial weighting on CNN features
# Encourage CNN to focus on same regions as ViT attention
passPractical Considerations
Teacher Selection
Choosing the right teacher matters:
def select_teacher(task, constraints):
"""
Guidelines for teacher selection:
1. Performance: Teacher should significantly outperform student capacity
2. Architecture: Similar architectures distill better than very different ones
3. Size: Larger teachers provide richer knowledge but are slower
4. Ensemble: Multiple teachers often better than single teacher
"""
if constraints['deployment'] == 'mobile':
# Use medium teacher, focus on aggressive compression
return 'ResNet-101'
elif constraints['accuracy_priority'] == 'high':
# Use largest available teacher
return 'ViT-L/16'
elif constraints['training_time'] == 'limited':
# Use pre-computed teacher outputs
return 'precomputed'Temperature Tuning
Temperature is a critical hyperparameter:
def tune_temperature(teacher, student, train_loader, val_loader):
"""Find optimal temperature for distillation"""
temperatures = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
best_temp = 4
best_accuracy = 0
for T in temperatures:
# Train student with this temperature
student_copy = copy.deepcopy(student)
trainer = DistillationTrainer(teacher, student_copy, temperature=T)
trainer.train(train_loader)
# Evaluate
accuracy = evaluate(student_copy, val_loader)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_temp = T
return best_temp
# Typical findings:
# - T=1: Too hard, little benefit over standard training
# - T=3-5: Sweet spot for most tasks
# - T>7: Too soft, loses label informationLoss Weight Balancing
Balancing distillation loss with label loss:
def adaptive_loss_weighting(epoch, total_epochs, initial_alpha=0.9):
"""
Adaptively adjust distillation vs. label loss weighting
Strategy: Start with high distillation weight, gradually increase label weight
Rationale: Early training benefits from teacher guidance,
later training should fit true labels
"""
# Linear decay
alpha = initial_alpha * (1 - epoch / total_epochs)
# Or cosine annealing
# alpha = initial_alpha * (1 + math.cos(math.pi * epoch / total_epochs)) / 2
return alphaDistillation Performance Examples
ImageNet Classification
| Model | Teacher | Top-1 Acc | Params | FLOPs |
|---|---|---|---|---|
| ResNet-18 | - | 69.8% | 11.7M | 1.8G |
| ResNet-18 | ResNet-50 | 71.2% | 11.7M | 1.8G |
| ResNet-18 | ResNet-101 | 71.8% | 11.7M | 1.8G |
| MobileNetV2 | - | 72.0% | 3.4M | 0.3G |
| MobileNetV2 | ResNet-50 | 73.5% | 3.4M | 0.3G |
| MobileNetV2 | EfficientNet-B4 | 74.2% | 3.4M | 0.3G |
Observation: Distillation provides 1-2% accuracy gain without increasing student size.
Vision Transformer Compression
| Model | Teacher | Top-1 Acc | Params |
|---|---|---|---|
| ViT-Tiny | - | 72.2% | 5.7M |
| ViT-Tiny | ViT-Base | 75.8% | 5.7M |
| ViT-Tiny | ViT-Large | 76.4% | 5.7M |
| ViT-Small | - | 77.5% | 22.0M |
| ViT-Small | ViT-Base | 78.9% | 22.0M |
Observation: Smaller ViTs benefit significantly from larger ViT teachers.
Conclusion
Inductive bias and knowledge distillation represent two sides of the same coin: incorporating prior knowledge into the learning process. Inductive bias builds assumptions into the architecture itself, while distillation transfers knowledge from a trained model.
For Vision Transformers specifically:
Inductive Bias Insights:
- Pure ViTs have minimal spatial bias, requiring more data
- Hybrid architectures (CNN + Transformer) often provide best of both worlds
- Appropriate inductive bias dramatically improves sample efficiency
- No single bias is universally best—depends on task and data availability
Distillation Insights:
- Enables deploying large model performance on constrained devices
- Works across architecture types (CNN ↔ Transformer)
- Feature-based distillation often more effective than output-only
- Temperature tuning and loss balancing are critical hyperparameters
Practical Recommendations:
- For small datasets: Use architectures with strong inductive bias (CNNs or hybrid models)
- For large datasets: Pure Transformers can learn appropriate biases from data
- For deployment: Always consider distillation for model compression
- For best performance: Combine appropriate architecture with distillation from larger models
Understanding and leveraging these concepts is essential for building efficient, effective deep learning systems. As the field continues to evolve, these principles will remain fundamental to advancing both theory and practice.
Additional Resources
Key Papers
- "Attention Is All You Need" (Vaswani et al., 2017) - Original Transformer
- "An Image is Worth 16x16 Words" (Dosovitskiy et al., 2020) - Vision Transformer
- "Distilling the Knowledge in a Neural Network" (Hinton et al., 2015) - Knowledge Distillation
- "Born Again Neural Networks" (Furlanello et al., 2018) - Self-distillation
- "ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases" (d'Ascoli et al., 2021)
Code Repositories
- Official ViT implementation: https://github.com/google-research/vision_transformer
- Distillation libraries: Various implementations in PyTorch and TensorFlow
- timm (PyTorch Image Models): https://github.com/rwightman/pytorch-image-models
Further Reading
- "Deep Learning" by Goodfellow, Bengio, and Courville - Theoretical foundations
- Various blog posts and tutorials on inductive bias and distillation
- Conference proceedings from NeurIPS, ICML, CVPR for latest research