Title

BiCLIP realigns visual features to the textual manifold.

Abstract

We present BiCLIP, a novel approach for few-shot domain adaptation. By utilizing a structured geometric transformation via an upper-triangular matrix, BiCLIP achieves superior canonicalization across diverse datasets...

Methodology: BiCLIP

BiCLIP architecture diagram showing bilinear fusion of image and text features.

Figure 1. The BiCLIP Adaptation Framework. Unlike standard CLIP which relies on a fixed dot product, BiCLIP introduces a trainable, structured transformation matrix W between the image and text modalities.

As shown in the schematic, standard Prompts and Images are projected through their respective encoders. Unlike standard CLIP, which uses a direct dot product (creating a static similarity matrix), BiCLIP introduces a learnable Structured Upper-Triangular Matrix (W).

This matrix is applied between the image features (I_i) and text features (T_j). The final matrix of Bilinear Products (shown on the right) represents the geometrically realigned similarity scores, where the term I_i W T_j is computed for every pair. The matrix is determined via standard loss functions used for CLIP and SigLIP.

Experimental Results

Main Results (16 Shot Performance): Performance comparison of BiCLIP across 11 diverse datasets using 16-shot adaptation. BiCLIP consistently outperforms zero-shot baselines, with particularly massive gains on specialized domains like EuroSAT and DTD.

Dataset	CLIP Backbone			SigLIP Backbone
Dataset	Zero-Shot	BiCLIP (Ours)	Δ	Zero-Shot	BiSigLIP (Ours)	Δ
ImageNet	68.84	71.69	+2.85	74.89	76.73	+1.83
DTD	42.82	71.86	+29.04	62.23	73.94	+11.70
EuroSAT	48.22	85.13	+36.91	35.35	77.50	+42.15
Flowers102	70.99	94.97	+23.99	81.15	96.11	+14.96
FGVCAircraft	24.60	45.21	+20.61	45.99	49.41	+3.42
OxfordPets	89.04	93.30	+4.24	92.31	92.80	+0.49
Food101	88.73	90.09	+1.36	92.19	92.33	+0.14
Caltech101	89.93	93.97	+4.04	95.23	97.06	+1.83
SUN397	63.50	74.27	+10.77	65.85	74.24	+8.38
UCF101	68.07	82.95	+14.88	71.50	78.85	+7.35
StanfordCars	63.71	82.63	+18.92	88.81	92.12	+3.31
Average	63.31	80.55	+15.24	72.33	81.92	+8.69

Few-Shot Classification Performance

We conduct experiments across the standard 1, 2, 4, 8, and 16 shots settings. Figure below illustrates the performance curves of BiCLIP and BiSigLIP compared to five state-of-the-art baselines, including classic Linear Probe adaptation methods, prompt tuning variants CoOp and CoCoOp, and more recent multimodal prompt learning techniques like MaPLe and PromptSRC.

Average (All Datasets)

ImageNet

DTD

EuroSAT

Flowers102

FGVC Aircraft

OxfordPets

Food101

Caltech101

SUN397

UCF101

StanfordCars

Angular Distribution Analysis

To better understand the effectiveness of BiCLIP, we analyze the angular distribution between positive and negative image-text pairs. A smaller overlap between these distributions indicates superior alignment and better class discriminability.

Select a dataset to compare the angular distribution of positive and negative pairs between the CLIP baseline and BiCLIP.

Choose Dataset:

Baseline CLIP

BiCLIP (Ours)

Geometric Verification: Orthogonality Analysis

Recent research (Gupta et. al 2026) into Vision-Language Models suggests that independently trained modalities are related by a shared orthogonal map. Our analysis confirms that the BiCLIP transformation matrix W preserves this property, maintaining near-orthogonality even after convergence.

0.022 Avg. Normalized Orthogonal Error

0.009 ImageNet Error

We quantify this by computing the normalized Frobenius norm deviation for W with dimensions DXD as: ||WᵀW - I|| / D.

Preservation of Knowledge: On datasets like ImageNet (0.009 error) and Food101 (0.006), the error is nearly negligible, showing the manifold stays close to its canonical state.
Targeted Adaptation: Specialized domains like EuroSAT (0.024) and DTD (0.055) show a slight departure from pure orthogonality, indicating necessary non-rigid "warping" to align the features.
Theoretical Alignment: This empirical evidence validates that domain adaptation in VLMs is fundamentally a problem of recovering relative rotation and scaling.

Experimental Findings & Theoretical Validation

🚀

State-of-the-Art Adaptation

BiCLIP achieves an average improvement of +15.24% over zero-shot CLIP and +8.69% over SigLIP. The most significant gains occur in specialized domains, such as +42.15% on EuroSAT, proving its robustness to extreme distribution shifts.

📐

Geometric Stability

Our analysis confirms that the learned transformation W remains nearly orthogonal, with an average normalized error of only 0.022. This validates that BiCLIP performs a structured rotation rather than arbitrary warping.

📉

Extreme Parameter Efficiency

By learning a single structured matrix, BiCLIP requires fewer parameters than state-of-the-art CLIP adaptation methods.

⚖️

Domain Canonicalization

The empirical results support our hypothesis: domain shift in VLMs can be recovered by a canonical geometric transformation. This bridges the gap between pre-trained multimodal spaces and specialized visual manifolds.

Citation

                @misc{mantini2026biclipdomaincanonicalizationstructured,
                  title={BiCLIP: Domain Canonicalization via Structured Geometric Transformation},
                  author={Pranav Mantini and Shishir K. Shah},
                  year={2026},
                  eprint={2603.08942},
                  archivePrefix={arXiv},
                  primaryClass={cs.CV},
                  url={https://arxiv.org/abs/2603.08942}