In the previous article, we completed the full logic of Vision Transformer: cutting images into patches as tokens and feeding them into Transformer Encoder for global modeling.However, we also mentioned an unavoidable pain point of ViT:Without sufficiently large data scales, ViT is often difficult to train well.From a paradigm perspective, this is because ViT is essentially a "weak prior, strong data-driven" modeling approach.Expanding on this further, regarding the question:Why does ViT require large amounts of data to perform well, while C...