BiCLIP: Domain Canonicalization

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Pranav Mantini*, Shishir K. Shah+

*University of Houston, +The University of Oklahoma

BiCLIP realigns visual features to the textual manifold.

Abstract


We present BiCLIP, a novel approach for few-shot domain adaptation. By utilizing a structured geometric transformation via an upper-triangular matrix, BiCLIP achieves superior canonicalization across diverse datasets...

Full Abstract

Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at this https URL

Methodology: BiCLIP


BiCLIP architecture diagram showing bilinear fusion of image and text features.

Figure 1. The BiCLIP Adaptation Framework. Unlike standard CLIP which relies on a fixed dot product, BiCLIP introduces a trainable, structured transformation matrix W between the image and text modalities.

As shown in the schematic, standard Prompts and Images are projected through their respective encoders. Unlike standard CLIP, which uses a direct dot product (creating a static similarity matrix), BiCLIP introduces a learnable Structured Upper-Triangular Matrix (W).

This matrix is applied between the image features (Ii) and text features (Tj). The final matrix of Bilinear Products (shown on the right) represents the geometrically realigned similarity scores, where the term Ii W Tj is computed for every pair. The matrix is determined via standard loss functions used for CLIP and SigLIP.

Experimental Results


Main Results (16 Shot Performance): Performance comparison of BiCLIP across 11 diverse datasets using 16-shot adaptation. BiCLIP consistently outperforms zero-shot baselines, with particularly massive gains on specialized domains like EuroSAT and DTD.

Dataset CLIP Backbone SigLIP Backbone
Zero-Shot BiCLIP (Ours) Δ Zero-Shot BiSigLIP (Ours) Δ
ImageNet68.8471.69+2.8574.8976.73+1.83
DTD42.8271.86+29.0462.2373.94+11.70
EuroSAT48.2285.13+36.9135.3577.50+42.15
Flowers10270.9994.97+23.9981.1596.11+14.96
FGVCAircraft24.6045.21+20.6145.9949.41+3.42
OxfordPets89.0493.30+4.2492.3192.80+0.49
Food10188.7390.09+1.3692.1992.33+0.14
Caltech10189.9393.97+4.0495.2397.06+1.83
SUN39763.5074.27+10.7765.8574.24+8.38
UCF10168.0782.95+14.8871.5078.85+7.35
StanfordCars63.7182.63+18.9288.8192.12+3.31
Average 63.31 80.55 +15.24 72.33 81.92 +8.69

Few-Shot Classification Performance


We conduct experiments across the standard 1, 2, 4, 8, and 16 shots settings. Figure below illustrates the performance curves of BiCLIP and BiSigLIP compared to five state-of-the-art baselines, including classic Linear Probe adaptation methods, prompt tuning variants CoOp and CoCoOp, and more recent multimodal prompt learning techniques like MaPLe and PromptSRC.

ImageNet Few-Shot ImageNet DTD Few-Shot DTD
EuroSAT Few-Shot EuroSAT
Flowers102 Few-Shot Flowers102
Aircraft Few-Shot FGVC Aircraft
OxfordPets Few-Shot OxfordPets
Food101 Few-Shot Food101
Caltech101 Few-Shot Caltech101
SUN397 Few-Shot SUN397
UCF101 Few-Shot UCF101
StanfordCars Few-Shot StanfordCars

Angular Distribution Analysis


To better understand the effectiveness of BiCLIP, we analyze the angular distribution between positive and negative image-text pairs. A smaller overlap between these distributions indicates superior alignment and better class discriminability.

Select a dataset to compare the angular distribution of positive and negative pairs between the CLIP baseline and BiCLIP.

Baseline CLIP

CLIP Distribution

BiCLIP (Ours)

BiCLIP Distribution

Geometric Verification: Orthogonality Analysis


Recent research (Gupta et. al 2026) into Vision-Language Models suggests that independently trained modalities are related by a shared orthogonal map. Our analysis confirms that the BiCLIP transformation matrix W preserves this property, maintaining near-orthogonality even after convergence.

0.022 Avg. Normalized Orthogonal Error
0.009 ImageNet Error

We quantify this by computing the normalized Frobenius norm deviation for W with dimensions DXD as: ||Wแต€W - I|| / D.

  • Preservation of Knowledge: On datasets like ImageNet (0.009 error) and Food101 (0.006), the error is nearly negligible, showing the manifold stays close to its canonical state.
  • Targeted Adaptation: Specialized domains like EuroSAT (0.024) and DTD (0.055) show a slight departure from pure orthogonality, indicating necessary non-rigid "warping" to align the features.
  • Theoretical Alignment: This empirical evidence validates that domain adaptation in VLMs is fundamentally a problem of recovering relative rotation and scaling.

Experimental Findings & Theoretical Validation


๐Ÿš€

State-of-the-Art Adaptation

BiCLIP achieves an average improvement of +15.24% over zero-shot CLIP and +8.69% over SigLIP. The most significant gains occur in specialized domains, such as +42.15% on EuroSAT, proving its robustness to extreme distribution shifts.

๐Ÿ“

Geometric Stability

Our analysis confirms that the learned transformation W remains nearly orthogonal, with an average normalized error of only 0.022. This validates that BiCLIP performs a structured rotation rather than arbitrary warping.

๐Ÿ“‰

Extreme Parameter Efficiency

By learning a single structured matrix, BiCLIP requires fewer parameters than state-of-the-art CLIP adaptation methods.

โš–๏ธ

Domain Canonicalization

The empirical results support our hypothesis: domain shift in VLMs can be recovered by a canonical geometric transformation. This bridges the gap between pre-trained multimodal spaces and specialized visual manifolds.

Citation


                @misc{mantini2026biclipdomaincanonicalizationstructured,
                  title={BiCLIP: Domain Canonicalization via Structured Geometric Transformation},
                  author={Pranav Mantini and Shishir K. Shah},
                  year={2026},
                  eprint={2603.08942},
                  archivePrefix={arXiv},
                  primaryClass={cs.CV},
                  url={https://arxiv.org/abs/2603.08942}