Vision Transformer: Bridging Sequence Modeling and Visual Understanding Through Pure Attention Mechanisms

Introduction: From NLP Breakthrough to Visual Revolution

In our previous exploration, we thoroughly examined the original Transformer architecture and its overall propagation logic. The results speak for themselves: Transformer brought paradigm-shifting breakthroughs to the NLP field by achieving global sequence modeling capabilities through self-attention mechanisms.

However, the original Transformer remained fundamentally a model designed for sequence data. This limitation naturally sparked an important line of thinking within the research community:

If self-attention mechanisms are essentially a global dependency modeling method, are they truly limited to just one data format—"sequences"?

In other words, if we shift our focus beyond text to broader data types, such as images with their inherently two-dimensional structure, could Transformer still prove effective?

This line of questioning gave birth to Vision Transformer (ViT). The central question ViT attempts to answer is profound:

If we discard convolutional structures entirely and rely solely on attention mechanisms, can we successfully accomplish visual task modeling?

This fundamental inquiry serves as our starting point for exploring Vision Transformer logic and understanding its revolutionary approach to computer vision.

Core Philosophy of Vision Transformer

ViT emerged from the seminal 2020 paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," which provided the first proof that pure Transformer architectures (without any convolutional components) could achieve state-of-the-art performance on visual tasks.

You might notice that ViT's introduction came three years after the original Transformer publication. This temporal gap doesn't indicate a stagnation period in Transformer development. Quite the contrary—these three years represented an explosive expansion phase for Transformer architectures, primarily within the NLP domain.

After thorough validation across language tasks, Transformer gradually became recognized as a general-purpose sequence modeling framework. Building upon this foundation, extending the architecture to vision and other modalities became the natural next step in its evolution.

Returning to ViT itself, its core concept proves remarkably straightforward and can be summarized in a single sentence:

Divide images into "patches," then feed them as token sequences into Transformer's encoder.

This elegant simplicity masks the profound implications of treating visual data as sequential information, fundamentally challenging decades of convolution-dominated computer vision approaches.

Data Processing Pipeline: Transforming Images into Sequences

To address computer vision tasks, ViT utilizes only the Transformer encoder logic. Consequently, the primary innovation lies in how data is processed and transformed before entering the model.

Patch Division: Breaking Images into Manageable Chunks

The first critical step involves segmenting the image into equally-sized, non-overlapping blocks called patches:

The original ViT design specifies an input image size of 224 × 224 × 3 (height × width × channels), with each patch measuring 16 × 16 pixels.

This configuration yields the total number of patches:

N = (224 × 224) / (16 × 16) = 14 × 14 = 196 patches

Generalizing this concept for arbitrary input images:

X ∈ R^(H × W × C)

Where H represents height, W represents width, and C represents channels. If we designate each patch size as P × P, the resulting patch count becomes:

N = (H × W) / P²

This patchification process transforms the continuous visual field into discrete, manageable units that can be processed sequentially—a fundamental departure from traditional convolutional approaches that operate on the entire image simultaneously.

Flattening: Converting Spatial Blocks to Vectors

Having decomposed a two-dimensional image into N smaller blocks, we face a critical challenge: Transformer cannot directly process two-dimensional structured data—it can only accept "vector sequences."

Therefore, the next step becomes clear: convert each patch into a vector representation.

The conversion method employs basic flattening operations:

For each patch with original dimensions:

P × P × C

We flatten it channel-wise into a one-dimensional vector:

x_p ∈ R^(P² · C)

Using ViT's standard configuration where P = 16 and C = 3:

x_p ∈ R^(16 × 16 × 3) = R^768

Now, each patch has been transformed into a fixed-length vector representation.

However, we must acknowledge an important limitation: as discussed in our earlier image processing fundamentals, flattening inevitably introduces spatial information loss.

After this transformation, the data can no longer provide spatial relationships like "the nose should be above the mouth." This represents one of the well-known drawbacks of using fully connected networks for image processing, as mentioned in our convolutional network discussions.

Clearly, ViT must employ alternative measures to compensate for this limitation. Let's continue exploring how this challenge is addressed.

Linear Projection: Mapping to Hidden Dimensions

After completing the flattening step, we have obtained sequence representations for each patch. However, a subtle but important detail remains: the flattened vector dimension is P²C, while Transformer's input dimension typically requires a unified hidden dimension D.

Therefore, we apply a linear transformation to map each flattened patch vector to the target dimension:

z^i = x^i · W + b

Where W represents the learnable weight matrix and b represents the bias term.

Through this transformation, each patch becomes a token suitable for Transformer processing.

An obvious question arises: if P²C equals D, isn't this step unnecessary?

The answer is definitively no. Beyond considering dimensional relationships, we must understand the semantic role of the linear layer in this context:

If we didn't "serialize" the original two-dimensional information, a typical convolutional network would use convolutional layers for feature extraction, where the computational logic itself remains weighted summation under learnable parameters.

Now, by converting patches to tokens and inputting them through a linear layer, we're still performing feature extraction. By analogy: patches function like receptive fields, while weights serve as convolutional kernels.

The vectors we ultimately obtain represent completed word embedding processes for patches—transforming raw pixel data into meaningful semantic representations.

[CLS] Token: Creating a Global Representation

Before formally entering the model, our input processing isn't quite complete. We've transformed an image into a group of tokens, but a critical question emerges:

In classification tasks, which token should represent the entire image?

Considering various possibilities, approaches might include averaging all tokens or concatenating all tokens followed by fusion operations.

However, ViT provides an elegant solution: introduce a dedicated token for "aggregating global information"—the [CLS] token.

In broader application contexts, this is also referred to as a global token.

To elaborate: for an image originally divided into N tokens:

[z^1, z^2, ..., z^N]

We prepend a [CLS] token:

[x_cls, z^1, z^2, ..., z^N]

Here, x_cls represents a randomly initialized learnable parameter sharing the same dimension as other tokens.

It's crucial to emphasize that [CLS] token serves as a learnable parameter. The "global information aggregation" we attribute to it represents semantics we assign and subsequently realize through the model's operational logic.

Let's continue examining the propagation process, where we'll address how this mechanism actually functions.

Position Encoding: Restoring Spatial Relationships

We still face one remaining challenge:

Tokens derived from patch conversion have lost their mutual "positional information."

Unlike CNNs, the model now has no inherent knowledge of which patch occupies the upper-left corner, which resides in the lower-right, or which patches are adjacent to each other.

This limitation proves particularly problematic because position itself constitutes extremely important information in visual tasks.

Therefore, ViT延续s the mechanism from NLP: position encoding (PE).

X_input = E_embedding + E_pos

However, differing from the original Transformer's fixed sinusoidal encoding, ViT employs learnable position encoding.

To elaborate: in ViT, position encoding is modeled as a set of learnable parameter matrices. The model allocates a corresponding vector representation for each token position, with these vectors being optimized alongside model parameters during training.

Assuming the input sequence contains N+1 tokens (including 1 [CLS] token and N patch tokens), the position encoding can be represented as:

E_pos ∈ R^((N+1) × D)

Where each row corresponds to a learnable vector for a specific position, thereby completing PE injection.

At this point, we've completed all data processing steps. Two questions remain unaddressed in detail:

How exactly does ViT compensate for spatial information loss?
How does the [CLS] token "aggregate global information"?

Both questions find their answers within ViT's Transformer logic.

Transformer Logic in ViT: Addressing Key Questions

As initially mentioned, ViT utilizes only the Transformer encoder. Having thoroughly explored this component in previous content, let's examine the complete propagation diagram from the original paper.

The network structure itself proves remarkably clear. Starting from our earlier questions, let's expand on the content requiring understanding in this section.

How ViT Compensates for Spatial Information Loss

Regarding this question, the position encoding discussion may have provided some inspiration:

Since PE is learnable, could spatial information be learned during training and embodied in PE?

If we stop at this conclusion, we'd be片面, even incorrect.

Actually, the mechanism truly accomplishing "spatial relationship modeling" is the self-attention mechanism.

To elaborate: we've mentioned the global modeling capability of self-attention mechanisms multiple times. When patch-converted token sequences input into self-attention, each patch token can directly establish connections with all other patches.

Through attention, we can learn correlations between different patches. However, merely knowing this proves insufficient—if vectors lack positional information, we cannot learn positional relationships. Therefore, PE remains indispensable.

Combining both mechanisms, although the model doesn't understand concepts like "upper-left corner," it can learn:

Which patches frequently appear together (for example: eyes and nose)
Which patches maintain relative structural relationships (for example: sky typically appears above ground)

Here emerges the key insight: unlike convolution, which treats spatial structure as known information added to training, spatial information in ViT actually represents a statistical learning result from data.

From an intuitive perspective, one might first think of limitations in this logic: high requirements for data volume.

However, its advantage lies in global modeling capability: CNNs require stacking many layers or using large convolutional kernels to achieve global reception, but ViT accomplishes this in one step—the first attention layer already possesses global receptive field.

Ultimately, the answer to this question becomes clear:

ViT doesn't "restore" spatial structure but rather relearns spatial structure from data through "position encoding + self-attention" combination.

How [CLS] Token Aggregates Global Information

Next, we examine the second question: Why can a randomly initialized vector ultimately represent an entire image?

The core mechanism remains self-attention. In Transformer, each layer's attention performs one fundamental operation:

Each token collects information from "all tokens" through weighted summation.

Therefore, the [CLS] token executes at each layer:

x_cls^(l+1) = Σ_j α_cls,j · z_j^(l)

At this point, we understand [CLS] token progressively refines information layer by layer. However, the question persists:

Not only [CLS] token refines information—all tokens perform global information fusion. Each patch token can "see globally," ultimately containing certain degrees of global information. Why can only [CLS] represent the global representation?

Indeed, from a computational perspective: all tokens perform global information fusion, meaning each patch token contains some global information.

Actually, this occurs because only CLS is used as the "final output" in the design.

In classification tasks, ViT's final output is:

logits = MLP(x_cls^(L))

In other words: only [CLS] token enters the output layer for classification and directly participates in loss function calculation.

Therefore, from backpropagation perspective: only [CLS] token gets "forced" to learn global information most useful for classification.

This represents the reason it can aggregate global semantics—we manually designed supervision-driven optimization, enabling continuous learning of global semantics through backpropagation.

Providing a more formal answer:

[CLS] token doesn't naturally possess special structures with "global semantics." Rather, because it's the only token directly participating in supervised training, it gets optimized into a representation capable of expressing entire image semantics.

Summary and Broader Implications

ViT's core functionality can be summarized as: transforming images into token sequences and completing global modeling within a unified Transformer framework.

Compared to traditional CNNs, its most crucial breakthrough lies in: no longer relying on spatial priors like local receptive fields and translation invariance. Instead, it directly establishes connections between patches across the global scope through self-attention mechanisms. This enables the model to possess global receptive field from the very beginning, without requiring gradual expansion through multi-layer stacking, thereby achieving higher flexibility in expression capability.

Simultaneously, ViT doesn't explicitly "restore" image spatial structure but learns spatial relationships from data through the combination of position encoding (PE) and self-attention mechanisms.

Overall, ViT's core significance lies in: it promotes Transformer from a "sequence modeling framework" to a "universal perception modeling framework."

This transformation from "structure prior-driven" to "data-driven modeling" not only redefines visual task modeling approaches but also lays a unified foundation for subsequent multimodal model development.

However, ViT still possesses limitations, specifically requiring ultra-large-scale training data. Naturally, improvements addressing this limitation emerged, which we'll explore in the next article.

Future Directions and Practical Considerations

The Vision Transformer represents a paradigm shift in how we approach visual understanding through machine learning. By demonstrating that attention mechanisms alone can achieve state-of-the-art performance without convolutional inductive biases, ViT opened new research directions in computer vision.

Practitioners should consider several factors when implementing ViT:

Data Requirements: ViT's lack of convolutional inductive biases means it requires substantially more training data than comparable CNN architectures. For applications with limited data, consider hybrid approaches or pre-training strategies.

Computational Efficiency: The quadratic complexity of self-attention with respect to sequence length (number of patches) can become prohibitive for high-resolution images. Recent variants like Swin Transformer address this through hierarchical attention.

Transfer Learning: Pre-trained ViT models have shown excellent transfer learning capabilities across various visual tasks, often outperforming CNN-based alternatives when sufficient pre-training data is available.

The evolution from ViT to its numerous variants continues to shape the landscape of computer vision, demonstrating the power of attention-based architectures in visual understanding tasks.