Welcome to the documentation of ITRA ! đď
ITRA
(abbreviation for Image Text Representation Alignment) is a codebase for flexible and efficient vision language learning. ITRA
features a unified interface to easily access state-of-the-art pretrained models, adapters, loss functions from various sources.

ITRA
supports training, evaluation and benchmarking on a rich variety of tasks, including zero-shot/k-NN/linear classification, retrieval, word embedding and sentence embedding evaluation. In the meantime, ITRA
is also highly modular extensible and configurable, facilitating future development and customization.
Important
ITRA
is an ongoing project developing by the Artificial Intelligence of Multi-modality Group (AIM Group, https://multimodality.group) at Hohai University lead by Prof. Fan Liu. A temporary repository of the codebase is located at: https://github.com/ChenDelong1999/ITRA

Note
If you find any bugs or have any recommendations for building ITRA
, please raise a issue in the repo, thanks~
About This Codebaseď
ITRA is a codebase for flexible and efficient Image Text Representation AlignmentâŚ
Model Builderď
TorchHub
ChineseCLIP
âŚ
Training Objectivesď
CLIP: InfoNCE, ProtoCLIP
Self-supervised KD: RKD, SEED, CompRess, ProtoCPC, SimReg
VICReg, BarlowTwins, DINO
Downstream Evaluationď
Image classification: zero-shot, linear/k-NN, and clustering evaluation (AMI, NMI) (from ProtoCLIP)
EVEVATER Image Classification Toolkit on 20 datasets
Image-text retrieval on MS-COCO dataset
Sentence embeddings (SentEval)
Passage retrieval on MS-MARCO and Wiki Sections
Word embeddings: RG65, Simlex999, WordSim353
Zero-shot VQA (TAP-C) and visual entailment
âŚ
Change Logď
V0.0.1ď
2023.01.xx
Initial internal release.
Install Dependenciesď
Create a conda environment and install PyTorch:
conda create -n ITRA python=3.10.0 conda activate ITRA
This repo requires PyTorch (1.12) and torchvision (0.13). Please install them via pytorch official website.
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=10.2 -c pytorch
Clone this repo:
# TODO: update repo name git clone https://github.com/ChenDelong1999/ITRA cd ITRA export PYTHONPATH="$PYTHONPATH:$PWD/itra"
Note: If import error is occurred later, run
export PYTHONPATH="$PYTHONPATH:$PWD/itra"
again.Install additional dependencies:
conda install pillow pandas scikit-learn ftfy tqdm matplotlib conda install -c huggingface transformers conda install -c conda-forge sentence-transformers pip install adapter-transformers open_clip_torch pycocotools wandb timm clip-benchmark pyyaml # TODO: faiss-gpu does not support windows OS, maybe use pip install faiss instead? pip install faiss-gpu # ELEVATOR requirements pip install yacs git+https://github.com/haotian-liu/CLIP_vlp.git vision-evaluation # TODO: remove nori dependency pip install nori2
Prepare Dataď
Image-text Pairs Dataset from CSV
fileď
This codebase reads a CSV
file (separated by \t
) with two columns: a path to an image (filepath
by default), and a text caption (title
by default).
filepath | title |
---|---|
path/to/image.jpg | A very typical bus station |
... | ... |
Specifying --train-data 'path/to/your/csvfile.csv'
enables training a model on the dataset, and specifying --retrieval-data 'path/to/your/csvfile.csv'
and set --retrieval-frequency
> 0 to perform retrieval evaluation on the dataset.
The script itra/utils/gather_cc.py
will collect the Conceptual Captions (CC3M) dataset. First, download the Conceptual Captions URLs from here, then run the following script:
python3 itra/utils/gather_cc.py path/to/Train_GCC-training.tsv
Note
As mentioned in our ProtoCLIP paper, the CC3M dataset was made public by Google in 2018. As noted in our paper, the number of accessible images keeps drooping due to expired image links. This issue is raised by several recent works. In this work, since we can only collect 2,643,718 images (concurrent to our work ProtoCLIP, CyCLIP collected 2,631,703 images), we randomly sample a 2,500,000 subset (75% of full CC3M) from them to train our ProtoCLIP. Considering the dropping accessibility of image links in Conceptual Captions, we call for the use of this dataset size (2.5M) in future benchmarking for better comparability.
Important
The requirement of CC3M validation data of OpenCLIP is removed in this codebase. To perform retrieval evaluation, please use the --retrieval-data
argument instead. The webdataset is no longer supported in this codebase.
MS COCO Captions datasetď
To use MS COCO 2017 Captions dataset, download it to --datasets-dir
and specifying --train-data 'mscoco_captions'
or --retrieval-data 'mscoco_captions'
.
<--datasets-dir>
âââcoco2017
âââ annotations
âââ train2017
âââ val2017
The dataset contains 118k train images and 5k text images, and each image has 4-5 captions. When using the training images, the total samples per epoch is set to 118k, and we chose one caption randomly when calling the __getitem__
function.
Image Classification Datasetď
Add your dataset into itra/data/classification_datasets.py
and add your dataset name (e.g., âYourCustomDatasetâ) to AVALIABLE_CLASSIFICATION_DATASETS
. Then you can use this dataset via --train-data 'YourCustomDataset'
.
SentEval Datasetsď
Codes for SentEval evaluation are modified from SimCSE.
cd <--dataset-dir>
wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/senteval.tar
tar xvf senteval.tar
Todo
MS MARCO
wiki sections
EVEVATER Image Classification Datasetsď
EVEVATER Image Classification Toolkit (Elevater_Toolkit_IC) implemeted standarlized evaluations of vision language models. It covers zero-shot classification, few- / full-shot linear probing, and fully fine tuning on 20 datasets. See paper âELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, NeurIPS 2022 Datasets and Benchmarks Trackâ for more details.
We have included Elevater_Toolkit_IC in our codebase (in itra/evaluation/vision_benchmark
). We have registered new models (clip_zeroshot_eval.py and cls_linear_or_ft_eval.py) following the official instructions. To ensure compatibility, we have made some modifications based on the official Elevater_Toolkit_IC codes at commit 9d39620
, so DO NOT install an Elevater_Toolkit_IC in the environment for this codebase.
To get started first download all dataset following this repo. The downloaded datasets takes about 41Gb storage, and the folder structure should be:
.../datasets
âââ classification
âââ caltech_101_20211007
â  âââ labels.txt
â  âââ test.txt
â  âââ test.zip
â  âââ train.txt
â  âââ train.zip
âââ cifar100_20200721
â  âââ labels.txt
â  âââ test_images.txt
â  âââ test_images.zip
â  âââ train_images.txt
â  âââ train_images.zip
...
âââ voc2007_20211007
âââ labels.txt
âââ test_ic.txt
âââ test.zip
âââ train_ic.txt
âââ train.zip
âââ val_ic.txt
21 directories, 115 files
NORI Datasets on OSS (for Megvii Useres)ď
To use Conceptual Captions 3M:
--train-data 's3://chendelonghahab/datasets/ConceptualCaption3M/nori_CC2716261.csv'
# Nori Speed-up Commands
nori speedup 's3://chendelong/datasets/ConceptualCaption3M/CC_3M.nori' --on --replica=2
nori speedup 's3://chendelonghahab/datasets/ConceptualCaption3M/CC2.6M-CC2M.nori/' --on --replica=2
To use YFCCM-14M:
--train-data 's3://chendelonghahab/datasets/YFCC/YFCC_cleaned_nori.csv'
# zsh
# Nori Speed-up Commands
for ((i=0;i<=100;i++)) {
echo 'Processing nori part '$i'/100...'
nori speedup 's3://yzq/mmsl_datasets/YFCC15M/yfcc15m_'$i'.nori' --on --replica=2
}
Load Pretrained Multi-modal Weightsď
From OpenCLIP
ď
OpenCLIP (v2.0.2) is an open source implementation of OpenAIâs CLIP (Contrastive Language-Image Pre-training). To check all supported model architecture and pre-trained weights, run:
import open_clip
open_clip.list_pretrained()
# [('RN50', 'openai'), ('RN50', 'yfcc15m'), ('RN50', 'cc12m'), ('RN50-quickgelu', 'openai'), ('RN50-quickgelu', 'yfcc15m'), ('RN50-quickgelu', 'cc12m'), ('RN101', 'openai'), ('RN101', 'yfcc15m'), ('RN101-quickgelu', 'openai'), ('RN101-quickgelu', 'yfcc15m'), ('RN50x4', 'openai'), ('RN50x16', 'openai'), ('RN50x64', 'openai'), ('ViT-B-32', 'openai'), ('ViT-B-32', 'laion400m_e31'), ('ViT-B-32', 'laion400m_e32'), ('ViT-B-32', 'laion2b_e16'), ('ViT-B-32', 'laion2b_s34b_b79k'), ('ViT-B-32-quickgelu', 'openai'), ('ViT-B-32-quickgelu', 'laion400m_e31'), ('ViT-B-32-quickgelu', 'laion400m_e32'), ('ViT-B-16', 'openai'), ('ViT-B-16', 'laion400m_e31'), ('ViT-B-16', 'laion400m_e32'), ('ViT-B-16-plus-240', 'laion400m_e31'), ('ViT-B-16-plus-240', 'laion400m_e32'), ('ViT-L-14', 'openai'), ('ViT-L-14', 'laion400m_e31'), ('ViT-L-14', 'laion400m_e32'), ('ViT-L-14', 'laion2b_s32b_b82k'), ('ViT-L-14-336', 'openai'), ('ViT-H-14', 'laion2b_s32b_b79k'), ('ViT-g-14', 'laion2b_s12b_b42k'), ('roberta-ViT-B-32', 'laion2b_s12b_b32k'), ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'), ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k')]
To load the official pretrained CLIP (ResNet-50):
--image-model 'RN50' --image-model-builder 'openclip' \
--text-model 'RN50' --text-model-builder 'openclip' \
--pretrained-image-model --pretrained-text-model \
Optionally, you can load CLIP models pretrained by OpenCLIP instead of OpenAI by specifying --image-model-tag
and --text-model-tag
. For example, to load the ViT-H-14 pretrained on LAION-2B:
--image-model 'ViT-H-14' --image-model-builder 'openclip' --image-model-tag 'laion2b_s32b_b79k' \
--text-model 'ViT-H-14' --text-model-builder 'openclip' --text-model-tag 'laion2b_s32b_b79k' \
--pretrained-image-model --pretrained-text-model \
From ChineseCLIP
ď
ChineseCLIP (v1.4) is the Chinese version of CLIP. We use a large-scale Chinese image-text pair dataset (~200M) to train the model, and we hope that it can help users to conveniently achieve image representation generation, cross-modal retrieval and zero-shot image classification for Chinese data. This repo is based on OpenCLIP project.
The ChineseCLIP models are also available on Huggingface, but here we import the model via cn_clip package for convenience since its codes are similar to OpenCLIP
To list available models (please see Model Card provided by ChineseCLIP for more details):
from cn_clip.clip import available_models
available_models()
# ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']
To load a ChineseCLIP with ResNet-50:
--image-model 'RN50' --image-model-builder 'chineseclip' \
--text-model 'RN50' --text-model-builder 'chineseclip' \
--pretrained-image-model --pretrained-text-model \
From Taiyi-CLIP
ď
Taiyi-CLIP ďźĺ°çĽćŚ-太äšďźemploys chinese-roberta-wwm for the language encoder, and apply the ViT-B-32 in CLIP for the vision encoder. They freeze the vision encoder and tune the language encoder to speed up and stabilize the pre-training process. Moreover, they apply Noah-Wukong dataset (100M) and Zero dataset (23M) as the pre-training datasets. See their documentations for details.
There are two CLIP models available via Taiyi-CLIP
: Taiyi-CLIP-Roberta-102M-Chinese
(doc) and Taiyi-CLIP-Roberta-large-326M-Chinese
(doc). These two models are trained by Locked Image Tuning (LiT) on the ViT-B-32 and ViT-L-14 of OpenAIâs CLIP. Therefore, to load these model:
# Taiyi-CLIP-Roberta-102M-Chinese
--image-model 'ViT-B-32' --image-model-builder 'openclip' \
--text-model 'IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese' --text-model-builder 'huggingface' \
--pretrained-image-model --pretrained-text-model \
# Taiyi-CLIP-Roberta-large-326M-Chinese
--image-model 'ViT-L-14' --image-model-builder 'openclip' \
--text-model 'IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese' --text-model-builder 'huggingface' \
--pretrained-image-model --pretrained-text-model \
Load Pretrained Uni-modal Weightsď
Image Backboneď
From Torchvision
ď
To check all supported model architecture and pretrained weigths, run the following command or see this page (v0.12).
import torchvision
torchvision.models.__dict__.keys()
--image-model-builder 'torchvision' --image-model 'resnet50' \
--image-model-builder 'torchvision' --image-model 'resnet50' --pretrained-image-model \
--image-model-builder 'torchvision' --image-model 'alexnet' \
--image-model-builder 'torchvision' --image-model 'convnext_tiny' \
--image-model-builder 'torchvision' --image-model 'wide_resnet50_2' \
--image-model-builder 'torchvision' --image-model 'vgg11' \
--image-model-builder 'torchvision' --image-model 'squeezenet1_0' \
--image-model-builder 'torchvision' --image-model 'inception_v3' \
--image-model-builder 'torchvision' --image-model 'mobilenet_v3_small' \
--image-model-builder 'torchvision' --image-model 'mnasnet0_5' \
--image-model-builder 'torchvision' --image-model 'shufflenet_v2_x0_5' \
--image-model-builder 'torchvision' --image-model 'efficientnet_b0' \
--image-model-builder 'torchvision' --image-model 'regnet_y_400mf' \
--image-model-builder 'torchvision' --image-model 'vit_b_16' \
From Torch Hub
ď
import torch
for github in ['swav', 'dino', 'vicreg', 'barlowtwins', 'swag', 'deit']:
print(f'{github}:\t', torch.hub.list(f'facebookresearch/{github}'))
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/swav:main' \
--image-model-builder 'torchhub' --image-model 'dino_vits16' --image-model-tag 'facebookresearch/dino:main' \
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/vicreg:main' \
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/barlowtwins:main' \
--image-model-builder 'torchhub' --image-model 'regnety_16gf' --image-model-tag 'facebookresearch/swag:main' \
...
https://github.com/facebookresearch/VICRegL import torch model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âresnet50_alpha0p9â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âresnet50_alpha0p75â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âconvnext_small_alpha0p9â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âconvnext_small_alpha0p75â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âconvnext_base_alpha0p9â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âconvnext_base_alpha0p75â) model = torch.hub.load(âfacebookresearch/vicregl:mainâ, âconvnext_xlarge_alpha0p75â)
For more details, see:
https://github.com/facebookresearch/swav
https://github.com/facebookresearch/dino
https://github.com/facebookresearch/vicreg
https://github.com/facebookresearch/barlowtwins
https://github.com/facebookresearch/SWAG
https://github.com/facebookresearch/deit/blob/main/README_deit.md
Text Backboneď
From HuggingFaceđ¤Transformersď
For more details, see HuggingFace Transformers. Currently, only âfrom pretrainedâ mode is supported (i.e., you cannot train a huggingface transformer from scratch now). Standard models like BERT/RoBERTa are supported, but whether other models are also supported is not sureâŚ
From Sentence Transformersď
The Sentence Transformers liberary provides powerfull sentence embeddings. Please see pretrained models for more detials. Loading sentence transformers via huggingface and specify --text-pooler='mean'
is recommended, though it is also supported to load the model via sentence transformer:
# recommended:
--text-model-builder 'huggingface' --text-model 'sentence-transformers/all-mpnet-base-v2' --text-pooler='mean'
# not recommended:
--text-model-builder 'sbert' --text-model 'all-mpnet-base-v2'
However, it seems that word embedding models (GloVe and Komninos) in sentence-transformers cannot be loaded via huggingface.
Custom Training Dataď
Episodic Trainingď
--dataset-size 14000000 --episode-size 4000000 --train-data 'cache/yfcc_nori.csv' --nori-dataset\
--epochs 28 --save-frequency 28 --batch-size 64 --workers 8 \
Combining Multiple Datasetsď
âŚ
(weighting strategyâŚ)
Loss Functionsď
Loss | Original Task | Paper | Source Implementation | Uni-Directional | Need Prototype Layer |
---|---|---|---|---|---|
InfoNCE | Alignment | Learning Transferable Visual Models From Natural Language Supervision | https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py#L65 | ||
SimReg | KD | SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation | https://github.com/UCDvision/simreg/blob/main/simreg.py#L122 | ||
RKD | KD | Relational Knowledge Distillation | https://github.com/lenscloth/RKD/blob/master/metric/loss.py#L136 | ||
CompRess-1q | KD | CompRess: Self-Supervised Learning by Compressing Representations | https://github.com/UMBCvision/CompRess/blob/master/nn/compress_loss.py#L67 | ✔ | |
CompRess-2q | KD | CompRess: Self-Supervised Learning by Compressing Representations | https://github.com/UMBCvision/CompRess/blob/master/nn/compress_loss.py#L89 | ||
SEED | KD | SEED: Self-supervised Distillation For Visual Representation | https://github.com/jacobswan1/SEED/blob/master/tools/utils.py#L188 | ✔ | |
VICReg | SSL | VICReg: Variance-Invariance-Covariance Regularization For Self-Supervised Learning | https://github.com/facebookresearch/vicreg/blob/main/main_vicreg.py#L184 | ||
BarlowTwins | SSL | Barlow Twins: Self-Supervised Learning via Redundancy Reduction | https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L187 | ||
DINO | SSL | Emerging Properties in Self-Supervised Vision Transformers | https://github.com/facebookresearch/dino/blob/main/main_dino.py#L363 | ✔ | ✔ |
Use Adaptersď
The Adapter-Transformers liberary enables Delta-tuning on popular huggingface transformers. See Model Overview for available adaptations, and see the Docs and AdapterHub for more details.
We have made the following adapters available in this codebase:
Adapter | args.adapter | Params (M) | Params (%) | STS Benchmark | ImageNet Zero-shot Accuracy | MSCOCO Retrieval Mean Recall |
---|---|---|---|---|---|---|
Compacter | dummy |
0.06 | 0.05% | 0.7474 | 24.48 | 38.73 |
(IA)^3 | ia3_adapter |
0.06 | 0.05% | 0.6576 | 19.23 | 31.90 |
LoRA | lora_adapter |
0.30 | 0.27% | 0.7514 | 25.02 | 40.58 |
Bottleneck adapters | bottleneck_adapter |
1.79 | 1.61% | 0.7449 | 26.15 | 41.85 |
Language Adapters | lang_adapter |
1.19 | 1.08% | 0.7405 | 26.71 | 42.39 |
Prefix Tuning | prefix_tuning |
9.88 | 8.28% | 0.7303 | 26.00 | 41.31 |
UniPELT | unipelt |
11.09 | 9.20% | 0.7441 | 26.89 | 43.45 |
Mix-and-Match Adapters | mam_adapter |
22.50 | 17.05% | 0.7503 | 29.61 | 45.82 |
Projection Head Adapters
Linear projection head
DINO MLP Head (optionally with a prototype layer in the last)
Freeze Model Parameters During Trainingď
# lock image tower, i.e., Locked Image Tuning (LiT) https://arxiv.org/abs/2111.07991
--lock-image-model \
# lock all weight in image tower, while only train the text tower
--lock-image-partial 'weight' \
# only unlock all weight in image tower, while other params are locked
--lock-image-partial '!weight' --lock-image-model \
# Only train the first layer (transformer block) of the image backbone
--lock-image-partial '!resblocks.0' --lock-image-model \
# Only unfreeze all bias and norm params, i.e., Bias and Normalization Optimization (BiNor) https://arxiv.org/abs/2203.07190
--lock-image-partial '!bias,!ln,!bn' --lock-text-partial '!bias,!ln' --lock-image-model --lock-text-model \
Evaluate Pretrained Modelsď
Zero-shot Image-text Retrievalď
CLIP is a strong model for zero-shot image text retrieval. Since the official paper only reports the performance of the largest CLIP ViT-L-14-336 (standard 32 epoch plus an additional pretraining epoch with 336x336 resolution), here we present our evaluation of other architectures of CLIP. See paper-with-code leader board for performance comparison with other zero-shot retrieval methods.
Backbone | # Params all (M) | # Params image (M) | # Params text (M) | I2T R@1 | I2T R@5 | I2T R@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|---|---|---|
RN50 | 102.01 | 38.32 | 63.69 | 48.06 | 73.88 | 83.02 | 28.31 | 52.96 | 64.10 | **58.39 ** |
RN101 | 119.69 | 56.26 | 63.43 | 49.80 | 74.42 | 82.72 | 30.18 | 54.15 | 65.28 | **59.43 ** |
RN50x16 | 290.98 | 167.33 | 123.65 | 55.38 | 78.24 | 86.30 | 35.24 | 59.47 | 69.58 | **64.04 ** |
ViT-B-32 | 151.28 | 87.85 | 63.43 | 50.02 | 75.00 | 83.24 | 30.36 | 54.77 | 66.09 | **59.91 ** |
ViT-B-16 | 149.62 | 86.19 | 63.43 | 51.72 | 76.76 | 84.26 | 32.70 | 57.77 | 68.26 | **61.91 ** |
ViT-L-14 | 427.94 | 304.29 | 123.65 | 56.08 | 79.60 | 86.90 | 35.33 | 59.96 | 70.15 | **64.67 ** |
ViT-L-14-336 | **427.94 ** | **304.29 ** | **123.65 ** | **57.46 ** | **80.34 ** | **87.58 ** | **36.09 ** | **60.66 ** | **70.76 ** | **65.48 ** |
ViT-L-14-336 (official) | **427.94 ** | **304.29 ** | **123.65 ** | **58.4 ** | **81.5 ** | **88.1 ** | **37.8 ** | **62.4 ** | **72.2 ** | **66.73 ** |
For ViT-L-14-336, there is a small gap between our implemented evaluation and the officially reported results. We suspect it is caused by image pre-processing: the above re-implementations use the default Resize
transform as implemented in the official CLIP repo, while COCO images are mostly not square, it creates a small train-test domain gap due to distortion. If we alternatively use a ResizeMaxSize
as implemented here, the results then surpass the official reported performance.
Backbone | Pre-process | I2T R@1 | I2T R@5I | I2TR@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|---|
ViT-L-14-336 | Resize | 57.46 | 80.34 | 87.58 | 36.09 | 60.66 | 70.76 | 65.48 |
ViT-L-14-336 | Official (unknown) | 58.4 | 81.5 | **88.1 ** | 37.8 | 62.4 | 72.2 | 66.73 |
ViT-L-14-336 | ResizeMaxSize | **59.20 ** | **81.70 ** | 87.96 | **39.02 ** | **63.86 ** | **73.52 ** | **67.54 ** |
Changing Resize
into ResizeMaxSize
brings +2.06 improvement for ViT-L-14-336. However, we find that the benifit of this modification is not consistent across different backbones. As shown in the following table, generally, ResizeMaxSize
is more beneficial for large models, and especially the models that have been trained to process HD images (e.g., it is quite beneficial for ViT-L-14-336 but not that much for ViT-L-14).
Backbone | RN50 | RN101 | RN50x16 | ViT-B-32 | ViT-B-16 | ViT-L-14 | ViT-L-14-336 |
---|---|---|---|---|---|---|---|
Mean recall improvement by switching to ResizeMaxSize |
+0.45 | -0.13 | +0.10 | -0.74 | +0.83 | +0.96 | +2.06 |
Therefore, to keep it simple, we will use the default Resize
transform in the following experiments.
# 1x2080ti machine
python itra/training/main.py \
--linear-frequency 0 --zeroshot-frequency 0 --retrieval-frequency 0 --nlp-eval-frequency 1 --datasets-dir '/data/Datasets' \
--retrieval-data 'mscoco_captions' \
--image-model 'RN50' --image-model-builder 'openclip' \
--text-model 'RN50' --text-model-builder 'openclip' \
--pretrained-image-model --pretrained-text-model \
--logs 'logs/MSCOCO-zeroshot' --name 'RN50x4-openclip-zeroshot-retrieval
# [('RN50', 'openai'), ('RN50', 'yfcc15m'), ('RN50', 'cc12m'), ('RN50-quickgelu', 'openai'), ('RN50-quickgelu', 'yfcc15m'), ('RN50-quickgelu', 'cc12m'), ('RN101', 'openai'), ('RN101', 'yfcc15m'), ('RN101-quickgelu', 'openai'), ('RN101-quickgelu', 'yfcc15m'), ('RN50x4', 'openai'), ('RN50x16', 'openai'), ('RN50x64', 'openai'), ('ViT-B-32', 'openai'), ('ViT-B-32', 'laion400m_e31'), ('ViT-B-32', 'laion400m_e32'), ('ViT-B-32', 'laion2b_e16'), ('ViT-B-32', 'laion2b_s34b_b79k'), ('ViT-B-32-quickgelu', 'openai'), ('ViT-B-32-quickgelu', 'laion400m_e31'), ('ViT-B-32-quickgelu', 'laion400m_e32'), ('ViT-B-16', 'openai'), ('ViT-B-16', 'laion400m_e31'), ('ViT-B-16', 'laion400m_e32'), ('ViT-B-16-plus-240', 'laion400m_e31'), ('ViT-B-16-plus-240', 'laion400m_e32'), ('ViT-L-14', 'openai'), ('ViT-L-14', 'laion400m_e31'), ('ViT-L-14', 'laion400m_e32'), ('ViT-L-14', 'laion2b_s32b_b82k'), ('ViT-L-14-336', 'openai'), ('ViT-H-14', 'laion2b_s32b_b79k'), ('ViT-g-14', 'laion2b_s12b_b42k'), ('roberta-ViT-B-32', 'laion2b_s12b_b32k'), ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'), ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k')]
Coming soonâŚ
Zero-shot Image Classificationď
Coming soonâŚ
Linear Probing and KNN CClassificationď
Coming soonâŚ
Clustering Evaluationď
Coming soonâŚ
Sentence Embedding Evaluationď
STS-Benchmark, SICKâŚ
MS MARCO Passage RetrvalâŚ
Word embeddings..
Coming soonâŚ
ELEVATOR Image Classification Benchmarkď
You can perform EVEVATOR evaluations of the model trained by this codebase, by making necessary modifications and run the following commands:
conda activate vlkd
cd /data/codes/ProtoRKD
export PYTHONPATH="$PWD/src/training/evaluations:$PWD/src"
# zero-shot: model_cfg='clip_zeroshot_eval' mode='zeroshot'\
# few-shot: model_cfg='cls_linear_or_ft_eval' mode='linear_probe' num_shots=5 \
# linear prob: model_cfg='cls_linear_or_ft_eval' mode='linear_probe' num_shots=-1 \
# fine-tune: model_cfg='cls_linear_or_ft_eval' mode='finetune' num_shots=-1 \
for dataset (caltech101 cifar10 cifar100 country211 dtd eurosat-clip fer2013 fgvc-aircraft-2013b flower102 food101 gtsrb hateful-memes kitti-distance mnist oxford-iiit-pets patchcamelyon rendered-sst2 resisc45-clip stanfordcar voc2007classification)
{
#---> REPLACE THIS LINE WITH ONE OF FOUR OPTIONS ABOVE <---#
log_dir=# <YOUR EXPERIMENT DIR> \
ckpt_epoch=# <WHICH EPOCH> \
dataset_root=# <YOUR DATASET DIR> \
dataset=$dataset \
disable_hyperparameter_tuning=True \
bash run_evevater_eval.sh
}
for example,
conda activate vlkd
cd /data/codes/ProtoRKD
export PYTHONPATH="$PWD/src/training/evaluations:$PWD/src"
for dataset (caltech101 cifar10 cifar100 country211 dtd eurosat-clip fer2013 fgvc-aircraft-2013b flower102 food101 gtsrb hateful-memes kitti-distance mnist oxford-iiit-pets patchcamelyon rendered-sst2 resisc45-clip stanfordcar voc2007classification)
{
model_cfg='cls_linear_or_ft_eval' mode='finetune' num_shots=-1 \
log_dir='/data/codes/ProtoRKD/logs/codebase_test/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5' \
ckpt_epoch=56 \
dataset=$dataset \
disable_hyperparameter_tuning=True \
dataset_root='/data/codes/ProtoRKD/src/training/evaluations/vision_benchmark/outputs/datasets'\
bash run_evevater_eval.sh
}
Then you can generate submission file for EvalAI. For more details, please see official instructions.
python src/training/evaluations/vision_benchmark/commands/prepare_submit.py \
--combine_path 'logs/codebase_test/L[mobilenet_v3_small-h2]-L[CLIP-from-RN50]-bs1024-YFCC-8ep/clip_zeroshot_eval/log/predictions/zeroshot_eval_wiki_False_wnh_False_wnd_False_gpt3_Falseagg_WIKI_AND_GPT3_gpt3count_0'
We provide a simple script to summarize the results:
python src/utils/summarize_ELEVATER_results.py
Input your log dir (end with "../ELEVATER_evaluation/<eval_mode>"):
>>> logs/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5/ELEVATER_evaluation/zeroshot
Dsataset zeroshot-accuracy%
0 caltech-101 70.4490
1 cifar-10 72.8000
2 cifar-100 37.1700
3 country211 7.0570
4 dtd 31.5430
5 eurosat_clip 25.3000
6 fer-2013 21.8170
7 fgvc-aircraft-2013b-variants102 5.1620
8 oxford-flower-102 45.4590
9 food-101 40.3290
10 gtsrb 8.8600
11 hateful-memes 52.4110
12 kitti-distance 14.3460
13 mnist 11.0400
14 oxford-iiit-pets 65.2600
15 patch-camelyon 50.7600
16 rendered-sst2 47.8860
17 resisc45_clip 23.2740
18 stanford-cars 5.0990
19 voc-2007-classification 77.5720
20 Average 35.6797
saved to logs/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5/ELEVATER_evaluation/zeroshot/summary.csv
CLIP Pretrainingď
First, assume that you have already created an environment with required dependencies, prepared data for pre-training and downstream evaluations.
Then you can activate the environment and modify the PYTHONPATH
variable, such that modules can be imported successfully.
conda activate ITRA
export PYTHONPATH="$PYTHONPATH:$PWD/src"
Standard Contrastive Language Image Pretraining From Scratchď
Training a CLIP from scratch is the most straight forward usage of ITRA
. By specifying --loss 'InfoNCE'
, the model will contrast image and text samples within a batch.
# Example command for a 8x2080ti machine
torchrun --nproc_per_node 8 -m training.main \
--dataset-size 14000000 --episode-size 14000000 --train-data 'cache/yfcc_nori.csv' --nori-dataset\
--epochs 8 --save-frequency 8 --batch-size 64 --workers 8 \
--lr 5e-4 --warmup 2000 --wd 0.5 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--loss 'InfoNCE' \
--report-to tensorboard --logs 'logs/example-usage/clip-pretraining/YFCC14M-8_epoch-RN50'
Train a Tiny CLIPď
AlexNet, MobileNet?
Small SBERT?
GloVe Embeddings?
Fine-tuning CLIP for MS-COCO Retrievalď
In this section, we present an example usage and some empirical guides of fine-tuning CLIP for image-text retrieval. We aim to improve the retrieval performance based on the strong zero-shot retrieval ability (see our evaluation report) of CLIP by fine-tuning CLIP on MS COCO Captions training set (118k images) with the InfoNCE loss. Contents and key findings of this section are listed as follows:
Fine-tuning CLIP on MS COCO training set improves the retrieval mean recall by +15% compared to raw zero-shot retrieval.
Proper hyper-parameters can bring at least +1% improvement.
Scale up batch size by partially freeze CLIP weights brings +1% improvement.
Compared to the zero-shot retrieval mean recall=58.39% of RN50 CLIP, at last we achieve 76.02% mean recall (17.63% improvement) by fine-tuning it with a 8x2080ti machine.
Getting Started: Naive Fine-tuning Baselineď
First, assume that you have already created an environment with required dependencies, prepared csv datasets for pre-training and downstream evaluations. Then you can activate the environment and modify the PYTHONPATH
variable, such that modules can be imported successfully.
conda activate ITRA
cd path/to/ITRA/
export PYTHONPATH="$PYTHONPATH:$PWD/itra"
Then we can start to fine-tune a CLIP on MS-COCO captions 2017 training set (118k images). The results should be compared with the paper-with-code leaderboard. Our baseline setting are listed as follows, we use a single-node machine with 8 NVIDIA GeForce 2080ti GPUs for training, one training epoch takes about 3.5 minutes.
backbone: ResNet50
batch_size: 32x8=256
dataset_size: 118287
epochs: 10
lr: 1e-05
opt: adamw
use_bn_sync: False
warmup: 100
weight_decay: 0.5
Training Command
torchrun --nproc_per_node 8 -m training.main \
--train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 10 --save-frequency 0 --batch-size 32 --workers 2 \
--lr 1e-5 --warmup 100 --weight_decay 0.5 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--pretrained-image-model --pretrained-text-model \
--loss 'InfoNCE' \
--report-to tensorboard --logs 'logs/MSCOCO-RN50' --name '10ep-bs256-lr1e-5-wd0.5'
Under this configuration, fine-tuning significantly improves the retrieval performance (58.39â73.98, +15.59).
Type | Model | # Params (M) | I2T R@1 | I2T R@5I | I2T R@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|---|---|
Two-stream | Zero-shot CLIP RN50 | 102.01 | 48.06 | 73.88 | 83.02 | 28.31 | 52.96 | 64.1 | 58.39 |
Two-stream | đ Fine-tuned CLIP RN50 | 102.01 | 64.84 | 86.62 | 92.3 | 44.99 | 72.76 | 82.34 | 73.98 |
Two-stream | FLIP (ViT-L-14) | 427.94 | 78.9 | 94.4 | 97.4 | 61.2 | 84.3 | 90.6 | 84.5 |
Two-stream | Florence (CoSwin-H) | 637 | 81.8 | 95.2 | 63.2 | 85.7 | |||
Single-stream | BLIP (large) | 220 | 80.6 | 95.2 | 97.6 | 63.1 | 85.3 | 91.1 | 85.5 |
Single-stream | PTP-BLIP (large) | 220 | 84.2 | 79.3 | 98.8 | 68.8 | 89.5 | 94.2 | 88.8 |
Note
đ Here Florence and PTP-BLIP are respectively the two-stream and single-stream SoTA retrieval methods at paper-with-code leaderboard by 2022.12.
Tuning Hyper-parametersď
1. Learning Rate. We vary the learning rate from 5e-6 to 1e-4, and find that 1e-5 and 2e-5 are good for a batch size of 256. This results confirms the observations in this paper, where the authors showed that good ImageNet fine-tuning of CILP ViT-B-16 needs a quite small learning rate (2e-5 and 3e-5 for a batch size of 2048).
Learning Rate | lr5e-6 | lr1e-5 | lr2e-5 | lr3e-5 | lr5e-5 | lr1e-4 |
---|---|---|---|---|---|---|
Mean Recall | 72.91 | **73.98 ** | **73.97 ** | 73.32 | 72.46 | 69.34 |
2. Weight Decay. The author of SLIP paper observed that a larger weight decay (0.5) is beneficial for CLIP. Our experiments also showed that CLIP can also handle a very large value of weight decay (i.e., 2.50). Here the training data have 118k samples, and we believe that such property can further benefit CLIP fine-tuning when the data is limited. Our results, as shown in the following table, show that CLIP is pretty robust to weight decay changes: when vary the value from 0.01 to 2.50, the performance changes in a range of only +- 0.43.
Weight Decay | 2.50 | 2.25 | 2.00 | 1.75 | 1.50 | 1.25 | 1.00 | 0.75 | 0.50 | 0.10 | 0.05 | 0.01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean Recall | 74.07 | 73.94 | 73.87 | 73.84 | 73.94 | 73.94 | 74.05 | 73.87 | 73.98 | 73.93 | 73.64 | 73.79 |
3. Training Length. Similar to the experiments in FLIP, our experiments showed that scaling training epochs cannot lead to further performance improvement. Only 5 or 10 epochs are not sufficient, but 15-20 epochs seems already reached the saturation.
Epochs | 5 | 10 | 15 | 20 | 30 |
---|---|---|---|---|---|
Learning Rate=1e-5 | 72.66 | 73.98 | 74.43 | 74.45 | 73.96 |
Learning Rate=2e-5 | 72.86 | 73.97 | 74.28 | 74.02 | 74.03 |
4. Batch Size. It is well known that batch size has a crucial impact for contrastive learning methods. We confirm this point by varying batch size from 32 to 800 (the maximum allowed batch size for ResNet-50 CLIP on a 8x2080ti machine) while changing learning rate according to liner scaling rule. It shows that scaling down batch size leads to significant performance drop:
BatchSize | 800 | 512 | 256 | 128 | 64 | 32 |
---|---|---|---|---|---|---|
Learning Rate | 3.125E-05 | 2.00E-05 | 1.00E-05 | 5.00E-06 | 2.50E-06 | 1.25E-06 |
Mean Recall | 74.89 | 74.85 | 73.98 | 72.14 | 69.24 | 65.04 |
5. ⨠Improved Naive Baseline with Better Hyper-parameters. Combining all the above hyper-parameter sweep observations together, we increase the mean recall of naive fine-tuning baseline from 73.98 to 75.04.
Baseline Hyper-parameters | ⨠Improved Hyper-parameters | |
---|---|---|
backbone: | ResNet50 | ResNet50 |
batch_size: | 32x8=256 | 100x8=800 |
epochs: | 10 | 15 |
lr: | 1e-05 | 3.125e-05 |
weight_decay: | 0.5 | 1.0 |
Training Command
torchrun --nproc_per_node 8 -m training.main \
--train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 15 --save-frequency 0 --batch-size 100 --workers 2 \
--lr 3125e-8 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--pretrained-image-model --pretrained-text-model \
--loss 'InfoNCE' \
--report-to tensorboard --logs 'logs/MSCOCO-RN50' --name '15ep-bs800-lr3125e-8-wd1.0'
Results:
Model | I2T R@1 | I2T R@5I | I2T R@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|
Baseline | 64.84 | 86.62 | 92.30 | 44.99 | 72.76 | 82.34 | 73.98 |
Improved Baseline | 65.34 | 87.44 | 92.84 | 46.70 | 74.45 | 83.47 | 75.04 |
Scaling up Batch Size by Partially Freeze Weightsď
Fine-tuning Streategy | Image Params | Text Params | Total Trainable Params (M) | % | I2T R@1 | I2T R@5I | I2TR@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|---|---|---|---|
zero-shot evaluation | - | - | 0 | 0.0% | 48.06 | 73.88 | 83.02 | 28.31 | 52.96 | 64.10 | 58.39 |
lock CLIP and add linear projection heads | linear head | linear head | 2.1 | 2.1% | 47.24 | 75.06 | 84.82 | 32.91 | 61.21 | 72.83 | 62.34 |
lock CLIP and add MLP projection heads | MLP head | MLP head | 16.79 | 16.5% | 53.12 | 79.86 | 87.76 | 37.46 | 65.63 | 76.41 | 66.71 |
lock image tune text | - | All | 63.69 | 62.4% | 62.12 | 85.12 | 91.46 | 42.52 | 70.34 | 80.31 | 71.98 |
lock text tune image | All | - | 38.32 | 37.6% | 59.78 | 84.10 | 90.86 | 43.57 | 71.02 | 80.76 | 71.68 |
naĂŻve fine-tuning (improved baseline) | All | All | 102.01 | 100.0% | 65.34 | 87.44 | 92.84 | 46.70 | 74.45 | 83.47 | 75.04 |
lock image and partially fine-tune text
Text Params | projection+ln_final | 11 | 10,11 | 8~11 | 6~11 | 4~11 | 2~11 | 0~11 | All |
---|---|---|---|---|---|---|---|---|---|
Total Trainable Params (M) | 0.53 | 3.68 | 6.83 | 13.13 | 19.44 | 25.74 | 32.05 | 38.35 | 63.69 |
% | 0.5% | 3.6% | 6.7% | 12.9% | 19.1% | 25.2% | 31.4% | 37.6% | 62.4% |
Mean Recall | 67.23 | 69.15 | 70.27 | 71.36 | 71.79 | 71.96 | 72.00 | 72.18 | 71.98 |
lock text and partially fine-tune image
Image Params | attnpool | attnpool,layer4 | attnpool,layer4,3 | attnpool,layer4,3,2 | attnpool,layer4,3,2,1 | All |
---|---|---|---|---|---|---|
Text Params | - | - | - | - | - | - |
Total Trainable Params (M) | 14.79 | 29.75 | 36.85 | 38.07 | 38.29 | 38.32 |
% | 14.5% | 29.2% | 36.1% | 37.3% | 37.5% | 37.6% |
Mean Recall | 71.33 | 72.49 | 72.23 | 71.89 | 71.82 | 71.68 |
Scale up batchsize
Fine-tuning Streategy | Image Params | Text Params | Total Trainable Params (M) | % | I2T R@1 | I2T R@5I | I2TR@10 | T2I R@1 | T2I R@5 | T2I R@10 | Mean Recall |
---|---|---|---|---|---|---|---|---|---|---|---|
naĂŻve fine-tuning (improved baseline) bs-800-lr3.125e-5 | All | All | 102.01 | 100.0% | 65.34 | 87.44 | 92.84 | 46.70 | 74.45 | 83.47 | 75.04 |
bs800-lr3.125e-5 | attnpool,layer4 | 0~11 | 68.11 | 66.8% | 66.10 | 87.60 | 93.56 | 47.61 | 75.17 | 84.18 | 75.70 |
bs1792-lr7e-5 | attnpool,layer4 | 0~11 | 68.11 | 66.8% | 65.95 | 88.30 | 93.66 | 48.08 | 75.71 | 84.42 | 76.02 |
Training Command
torchrun --nproc_per_node 8 -m training.main \
--train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 15 --save-frequency 15 --batch-size 224 --workers 4 \
--lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--pretrained-image-model --pretrained-text-model --lock-image-model \
--lock-text-partial 'positional_embedding,token_embedding' \
--lock-image-partial '!attnpool,!layer4' \
--loss 'InfoNCE' \
--report-to tensorboard --logs 'logs/MSCOCO-RN50-partial' --name 'save-lock-image(!attnpool,!layer4)-lock-text(positional_embedding,token_embedding)-bs1792-lr7e-5'
More Tricks for Fine-tuningď
Layer-wise Learning Rate Decay (LLDR)ď
--layer_decay_image 0.9 --layer_decay_text 1 \
for layer_decay_text in 1.0 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6;
do
torchrun --nproc_per_node 8 -m training.main \
--train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 15 --save-frequency 0 --batch-size 224 --workers 2 \
--lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--pretrained-image-model --pretrained-text-model --lock-image-model \
--lock-text-partial 'positional_embedding,token_embedding' \
--lock-image-partial '!attnpool,!layer4' \
--loss 'InfoNCE' \
--report-to tensorboard --logs 'logs/MSCOCO-RN50-LLDR' --name 'layer_decay_text='$layer_decay_text \
--layer_decay_text $layer_decay_text;
done
Exponential Moving Average (EMA)ď
--model_ema --model_ema_decay 0.998 \
for model_ema_decay in 0.99999 0.9999 0.9995 0.999 0.995 0.99 0.95 0.9 0.8;
do
torchrun --nproc_per_node 8 -m training.main \
--train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 15 --save-frequency 0 --batch-size 224 --workers 2 \
--lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
--image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
--pretrained-image-model --pretrained-text-model --lock-image-model \
--lock-text-partial 'positional_embedding,token_embedding' \
--lock-image-partial '!attnpool,!layer4' \
--loss 'InfoNCE' \
--model_ema --model_ema_decay $model_ema_decay \
--report-to tensorboard --logs 'logs/MSCOCO-RN50-EMA' --name 'model_ema_decay='$model_ema_decay;
done
Wise-FT. Evaluate the model with weight space ensembleď
--eval-with-wise-ft 0.5 \
for alpha in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ;
do
python itra/training/main.py \
--zeroshot-frequency 1 --retrieval-frequency 1 --retrieval-data 'mscoco_captions' --datasets-dir '/data/Datasets' \
--image-model 'RN50' --image-model-builder 'openclip' \
--text-model 'RN50' --text-model-builder 'openclip' \
--pretrained-image-model --pretrained-text-model \
--resume 'logs/MSCOCO-RN50-partial/save-lock-image(!attnpool,!layer4)-lock-text(positional_embedding,token_embedding)-bs1792-lr7e-5/checkpoints/epoch_15.pt' \
--eval-with-wise-ft $alpha \
--logs 'logs/MSCOCO-RN50-WiseFT' --name 'zs+retrieval-WiseFT='$alpha;
done
rsicd retrievalď
# 1x2080ti machine RSICD 寚çš
torchrun --nproc_per_node 8 -m training.main \
--train-data '/data/Datasets/RSICD/csv/rsicd_train.csv' --images-dir '/data/Datasets/RSICD/RSICD_images/RSICD_images' \
--csv-separator '\t' --csv-img-key 'filename' --csv-caption-key 'title' \
--retrieval-data '/data/Datasets/RSICD/csv/rsicd_test.csv' --retrieval-images-dir '/data/Datasets/RSICD/RSICD_images/RSICD_images' \
--retrieval-csv-separator '\t' --retrieval-csv-img-key 'filename' --retrieval-csv-caption-key 'title' \
--retrieval-frequency 1 --datasets-dir '/data/Datasets' \
--epochs 30 --save-frequency 0 --batch-size 16 --workers 2 \
--lr 1e-6 --warmup 100 --weight_decay 0.5 --max-grad-norm 5 \
--image-model 'ViT-L-14-336' --image-model-builder 'openclip' \
--text-model 'ViT-L-14-336' --text-model-builder 'openclip' \
--pretrained-image-model --pretrained-text-model \
--lock-image-model --lock-text-model \
--lock-image-partial '!ln_post,!resblocks.23,!resblocks.22,!resblocks.21,!resblocks.20,!resblocks.19,!resblocks.18' \
--lock-text-partial '!text_projection,!ln_final,!resblocks.11,!resblocks.10,!resblocks.9' \
--loss 'InfoNCE' --layer_decay_image 0.9 --layer_decay_text 0.9 \
--report-to tensorboard --logs 'logs/RSICD-ViT-L-14' --name '30ep-b128-lr1e-5-unlock-image-text-last0.75-lldr0.9'
python itra/training/main.py âconfig-yaml âlogs/params.ymlâ âname âcustom-nameâ
python itra/training/main.py âepisode-size 10000 âtrain-data âmscoco_captionsâ âretrieval-data âmscoco_captionsâ âretrieval-frequency 1 âdatasets-dir â/data/Datasetsâ âepochs 15 âsave-frequency 0 âbatch-size 100 âworkers 2 âlr 1e-4 âwarmup 100 âweight_decay 1.0 âmax-grad-norm 5 âimage-model âRN50â âimage-model-builder âopenclipâ âtext-model âRN50â âtext-model-builder âopenclipââpretrained-image-model âpretrained-text-model âlock-image-model âlock-text-model âloss âInfoNCEâ âprompt ân-prompt 4 âreport-to tensorboard âlogs âlogs/testâ âname âcoco-finetune-nprompt-4â
Image Classification (UniCL)ď
UniCL: Unified Contrastive Learning in Image-Text-Label Space
Train an Image Classification Model From scratchď
Compare to MMClassification
# Single GPU classification
python itra/training/main.py \
--train-data 'CIFAR10' \
--linear-frequency 20 --zeroshot-frequency 20 --datasets-dir '/data/Datasets' \
--epochs 200 --save-frequency 0 --batch-size 128 --workers 4 \
--opt 'sgd' --lr 0.1 --warmup 100 --weight_decay 0.0001 \
--image-model 'resnet18' --image-model-builder 'torchvision' --image-resolution 32 --image-head-n-layers 1 \
--pretrained-text-model \
--text-model 'RN50' --text-model-builder 'openclip' --lock-text-model --text-head-n-layers 1 \
--loss 'CrossEntropy' --joint-projection-dim 10 \
--report-to tensorboard --logs 'logs/UniCL-Classification' --name 'resnet18(scratch)-CIFAR10-200ep-CrossEntropy+linear_eval'
# Single GPU classification
python itra/training/main.py \
--train-data 'CIFAR10' \
--linear-frequency 5 --zeroshot-frequency 5 --datasets-dir '/data/Datasets' \
--epochs 200 --save-frequency 0 --batch-size 128 --workers 4 \
--opt 'sgd' --lr 0.1 --warmup 100 --weight_decay 0.0001 \
--image-model 'resnet18' --image-model-builder 'torchvision' --image-resolution 32 --image-head-n-layers 1 \
--pretrained-text-model \
--text-model 'RN50' --text-model-builder 'openclip' --lock-text-model --text-head-n-layers 1 \
--loss 'InfoNCE' --joint-projection-dim 1024 \
--report-to tensorboard --logs 'logs/UniCL-Classification' --name 'resnet18(scratch)-CIFAR10-200ep-InfoNCE+linear_eval'
Fine-tuning CLIP for ImageNet Classificationď
Re-implement this paper.
python itra/training/main.py âtrain-data âmscoco_captionsâ âretrieval-data âmscoco_captionsâ âdataset-size 1000 âretrieval-frequency 1 âdatasets-dir â/data/Datasetsâ âepochs 1 âsave-frequency 1 âbatch-size 32 âworkers 2 âlr 1e-5 âwarmup 100 âweight_decay 0.5 âmax-grad-norm 5 âimage-model âRN50â âimage-model-builder âopenclipâ âtext-model âRN50â âtext-model-builder âopenclipââpretrained-image-model âpretrained-text-model âloss âInfoNCEâ âreport-to tensorboard âlogs âlogs/testâ âname âRNâ
Language-to-vision Knowledge Distillationď
Coming soonâŚ
Vision-to-language Knowledge Distillationď
Coming soonâŚ
Todo
New features incomingđ
Refract main.py
Write help messages for arguments
Use YAML
- Project
install as package
Pypi package publishing
- Evaluation reports
zero-shot classification
linear/knn classification
clustering evaluation
SentEval
word embedding
MS Marco retrieval
Chinese CLIPsâ Evaluation Reports (ImageNet-CN zero-shot, MC-COCO-CN retrieval)
- Implementations
UniCL-based image classification
Validate loss functions
Validate Adapters
SimCSE and PromptBERT re-implementation
Vision-to-language Knowledge Distillation
Language-to-vision Knowledge Distillation
Teacher selection based on Information Bottleneck Theory