Welcome to the documentation of ITRA ! 🎈

ITRA (abbreviation for Image Text Representation Alignment) is a codebase for flexible and efficient vision language learning. ITRA features a unified interface to easily access state-of-the-art pretrained models, adapters, loss functions from various sources.

_images/pipeline.png

ITRA supports training, evaluation and benchmarking on a rich variety of tasks, including zero-shot/k-NN/linear classification, retrieval, word embedding and sentence embedding evaluation. In the meantime, ITRA is also highly modular extensible and configurable, facilitating future development and customization.

Important

ITRA is an ongoing project developing by the Artificial Intelligence of Multi-modality Group (AIM Group, https://multimodality.group) at Hohai University lead by Prof. Fan Liu. A temporary repository of the codebase is located at: https://github.com/ChenDelong1999/ITRA

_images/modular.png

Note

If you find any bugs or have any recommendations for building ITRA, please raise a issue in the repo, thanks~

About This Codebase

ITRA is a codebase for flexible and efficient Image Text Representation Alignment…

Model Builder

Training Objectives

  • CLIP: InfoNCE, ProtoCLIP

  • Self-supervised KD: RKD, SEED, CompRess, ProtoCPC, SimReg

  • VICReg, BarlowTwins, DINO

Downstream Evaluation

  • Image classification: zero-shot, linear/k-NN, and clustering evaluation (AMI, NMI) (from ProtoCLIP)

  • EVEVATER Image Classification Toolkit on 20 datasets

  • Image-text retrieval on MS-COCO dataset

  • Sentence embeddings (SentEval)

  • Passage retrieval on MS-MARCO and Wiki Sections

  • Word embeddings: RG65, Simlex999, WordSim353

  • Zero-shot VQA (TAP-C) and visual entailment

…

Change Log

V0.0.1

2023.01.xx

Initial internal release.

Install Dependencies

  • Create a conda environment and install PyTorch:

    conda create -n ITRA python=3.10.0
    conda activate ITRA
    

    This repo requires PyTorch (1.12) and torchvision (0.13). Please install them via pytorch official website.

    conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=10.2 -c pytorch
    
  • Clone this repo:

    # TODO: update repo name
    git clone https://github.com/ChenDelong1999/ITRA
    cd ITRA
    export PYTHONPATH="$PYTHONPATH:$PWD/itra"
    

    Note: If import error is occurred later, run export PYTHONPATH="$PYTHONPATH:$PWD/itra" again.

  • Install additional dependencies:

    conda install pillow pandas scikit-learn ftfy tqdm matplotlib 
    conda install -c huggingface transformers 
    conda install -c conda-forge sentence-transformers
    pip install adapter-transformers open_clip_torch pycocotools wandb timm clip-benchmark pyyaml
    
    # TODO: faiss-gpu does not support windows OS, maybe use pip install faiss instead?
    pip install faiss-gpu
    
    # ELEVATOR requirements  
    pip install yacs git+https://github.com/haotian-liu/CLIP_vlp.git vision-evaluation
    
    # TODO: remove nori dependency
    pip install nori2
    

Prepare Data

Image-text Pairs Dataset from CSV file

This codebase reads a CSV file (separated by \t) with two columns: a path to an image (filepath by default), and a text caption (title by default).

filepath title
path/to/image.jpg A very typical bus station
... ...

Specifying --train-data 'path/to/your/csvfile.csv' enables training a model on the dataset, and specifying --retrieval-data 'path/to/your/csvfile.csv' and set --retrieval-frequency > 0 to perform retrieval evaluation on the dataset.

The script itra/utils/gather_cc.py will collect the Conceptual Captions (CC3M) dataset. First, download the Conceptual Captions URLs from here, then run the following script:

python3 itra/utils/gather_cc.py path/to/Train_GCC-training.tsv

Note

As mentioned in our ProtoCLIP paper, the CC3M dataset was made public by Google in 2018. As noted in our paper, the number of accessible images keeps drooping due to expired image links. This issue is raised by several recent works. In this work, since we can only collect 2,643,718 images (concurrent to our work ProtoCLIP, CyCLIP collected 2,631,703 images), we randomly sample a 2,500,000 subset (75% of full CC3M) from them to train our ProtoCLIP. Considering the dropping accessibility of image links in Conceptual Captions, we call for the use of this dataset size (2.5M) in future benchmarking for better comparability.

Important

The requirement of CC3M validation data of OpenCLIP is removed in this codebase. To perform retrieval evaluation, please use the --retrieval-data argument instead. The webdataset is no longer supported in this codebase.

MS COCO Captions dataset

To use MS COCO 2017 Captions dataset, download it to --datasets-dir and specifying --train-data 'mscoco_captions' or --retrieval-data 'mscoco_captions'.

<--datasets-dir>
    └──coco2017
        ├── annotations
        ├── train2017 
        └── val2017 

The dataset contains 118k train images and 5k text images, and each image has 4-5 captions. When using the training images, the total samples per epoch is set to 118k, and we chose one caption randomly when calling the __getitem__ function.

Image Classification Dataset

Add your dataset into itra/data/classification_datasets.py and add your dataset name (e.g., ‘YourCustomDataset’) to AVALIABLE_CLASSIFICATION_DATASETS. Then you can use this dataset via --train-data 'YourCustomDataset'.

SentEval Datasets

Codes for SentEval evaluation are modified from SimCSE.

cd <--dataset-dir>
wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/senteval.tar
tar xvf senteval.tar

Todo

  • MS MARCO

  • wiki sections

EVEVATER Image Classification Datasets

EVEVATER Image Classification Toolkit (Elevater_Toolkit_IC) implemeted standarlized evaluations of vision language models. It covers zero-shot classification, few- / full-shot linear probing, and fully fine tuning on 20 datasets. See paper “ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, NeurIPS 2022 Datasets and Benchmarks Track” for more details.

We have included Elevater_Toolkit_IC in our codebase (in itra/evaluation/vision_benchmark). We have registered new models (clip_zeroshot_eval.py and cls_linear_or_ft_eval.py) following the official instructions. To ensure compatibility, we have made some modifications based on the official Elevater_Toolkit_IC codes at commit 9d39620, so DO NOT install an Elevater_Toolkit_IC in the environment for this codebase.

To get started first download all dataset following this repo. The downloaded datasets takes about 41Gb storage, and the folder structure should be:

.../datasets
└── classification
    ├── caltech_101_20211007
    │   ├── labels.txt
    │   ├── test.txt
    │   ├── test.zip
    │   ├── train.txt
    │   └── train.zip
    ├── cifar100_20200721
    │   ├── labels.txt
    │   ├── test_images.txt
    │   ├── test_images.zip
    │   ├── train_images.txt
    │   └── train_images.zip
    ...
    └── voc2007_20211007
        ├── labels.txt
        ├── test_ic.txt
        ├── test.zip
        ├── train_ic.txt
        ├── train.zip
        └── val_ic.txt

21 directories, 115 files

NORI Datasets on OSS (for Megvii Useres)

  • To use Conceptual Captions 3M: --train-data 's3://chendelonghahab/datasets/ConceptualCaption3M/nori_CC2716261.csv'

# Nori Speed-up Commands
nori speedup 's3://chendelong/datasets/ConceptualCaption3M/CC_3M.nori' --on --replica=2
nori speedup 's3://chendelonghahab/datasets/ConceptualCaption3M/CC2.6M-CC2M.nori/' --on --replica=2
  • To use YFCCM-14M: --train-data 's3://chendelonghahab/datasets/YFCC/YFCC_cleaned_nori.csv'

# zsh
# Nori Speed-up Commands
for ((i=0;i<=100;i++)) {
    echo 'Processing nori part '$i'/100...'
    nori speedup 's3://yzq/mmsl_datasets/YFCC15M/yfcc15m_'$i'.nori' --on --replica=2
}

Load Pretrained Multi-modal Weights

From OpenCLIP

OpenCLIP (v2.0.2) is an open source implementation of OpenAI’s CLIP (Contrastive Language-Image Pre-training). To check all supported model architecture and pre-trained weights, run:

import open_clip
open_clip.list_pretrained()
# [('RN50', 'openai'), ('RN50', 'yfcc15m'), ('RN50', 'cc12m'), ('RN50-quickgelu', 'openai'), ('RN50-quickgelu', 'yfcc15m'), ('RN50-quickgelu', 'cc12m'), ('RN101', 'openai'), ('RN101', 'yfcc15m'), ('RN101-quickgelu', 'openai'), ('RN101-quickgelu', 'yfcc15m'), ('RN50x4', 'openai'), ('RN50x16', 'openai'), ('RN50x64', 'openai'), ('ViT-B-32', 'openai'), ('ViT-B-32', 'laion400m_e31'), ('ViT-B-32', 'laion400m_e32'), ('ViT-B-32', 'laion2b_e16'), ('ViT-B-32', 'laion2b_s34b_b79k'), ('ViT-B-32-quickgelu', 'openai'), ('ViT-B-32-quickgelu', 'laion400m_e31'), ('ViT-B-32-quickgelu', 'laion400m_e32'), ('ViT-B-16', 'openai'), ('ViT-B-16', 'laion400m_e31'), ('ViT-B-16', 'laion400m_e32'), ('ViT-B-16-plus-240', 'laion400m_e31'), ('ViT-B-16-plus-240', 'laion400m_e32'), ('ViT-L-14', 'openai'), ('ViT-L-14', 'laion400m_e31'), ('ViT-L-14', 'laion400m_e32'), ('ViT-L-14', 'laion2b_s32b_b82k'), ('ViT-L-14-336', 'openai'), ('ViT-H-14', 'laion2b_s32b_b79k'), ('ViT-g-14', 'laion2b_s12b_b42k'), ('roberta-ViT-B-32', 'laion2b_s12b_b32k'), ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'), ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k')]

To load the official pretrained CLIP (ResNet-50):

--image-model 'RN50' --image-model-builder 'openclip' \
--text-model 'RN50' --text-model-builder 'openclip' \
--pretrained-image-model --pretrained-text-model \

Optionally, you can load CLIP models pretrained by OpenCLIP instead of OpenAI by specifying --image-model-tag and --text-model-tag. For example, to load the ViT-H-14 pretrained on LAION-2B:

--image-model 'ViT-H-14' --image-model-builder 'openclip' --image-model-tag 'laion2b_s32b_b79k' \
--text-model 'ViT-H-14' --text-model-builder 'openclip'  --text-model-tag 'laion2b_s32b_b79k' \
--pretrained-image-model --pretrained-text-model \

From ChineseCLIP

ChineseCLIP (v1.4) is the Chinese version of CLIP. We use a large-scale Chinese image-text pair dataset (~200M) to train the model, and we hope that it can help users to conveniently achieve image representation generation, cross-modal retrieval and zero-shot image classification for Chinese data. This repo is based on OpenCLIP project.

The ChineseCLIP models are also available on Huggingface, but here we import the model via cn_clip package for convenience since its codes are similar to OpenCLIP

To list available models (please see Model Card provided by ChineseCLIP for more details):

from cn_clip.clip import available_models
available_models() 
# ['ViT-B-16', 'ViT-L-14', 'ViT-L-14-336', 'ViT-H-14', 'RN50']

To load a ChineseCLIP with ResNet-50:

--image-model 'RN50' --image-model-builder 'chineseclip' \
--text-model 'RN50' --text-model-builder 'chineseclip' \
--pretrained-image-model --pretrained-text-model \

From Taiyi-CLIP

Taiyi-CLIP (封神榜-太乙)employs chinese-roberta-wwm for the language encoder, and apply the ViT-B-32 in CLIP for the vision encoder. They freeze the vision encoder and tune the language encoder to speed up and stabilize the pre-training process. Moreover, they apply Noah-Wukong dataset (100M) and Zero dataset (23M) as the pre-training datasets. See their documentations for details.

There are two CLIP models available via Taiyi-CLIP: Taiyi-CLIP-Roberta-102M-Chinese (doc) and Taiyi-CLIP-Roberta-large-326M-Chinese (doc). These two models are trained by Locked Image Tuning (LiT) on the ViT-B-32 and ViT-L-14 of OpenAI’s CLIP. Therefore, to load these model:

# Taiyi-CLIP-Roberta-102M-Chinese
--image-model 'ViT-B-32' --image-model-builder 'openclip' \
--text-model 'IDEA-CCNL/Taiyi-CLIP-Roberta-102M-Chinese' --text-model-builder 'huggingface' \
--pretrained-image-model --pretrained-text-model \

# Taiyi-CLIP-Roberta-large-326M-Chinese
--image-model 'ViT-L-14' --image-model-builder 'openclip' \
--text-model 'IDEA-CCNL/Taiyi-CLIP-Roberta-large-326M-Chinese' --text-model-builder 'huggingface' \
--pretrained-image-model --pretrained-text-model \

Load Pretrained Uni-modal Weights

Image Backbone

From Torchvision

To check all supported model architecture and pretrained weigths, run the following command or see this page (v0.12).

import torchvision
torchvision.models.__dict__.keys()
--image-model-builder 'torchvision' --image-model 'resnet50' \
--image-model-builder 'torchvision' --image-model 'resnet50' --pretrained-image-model \
--image-model-builder 'torchvision' --image-model 'alexnet' \
--image-model-builder 'torchvision' --image-model 'convnext_tiny' \
--image-model-builder 'torchvision' --image-model 'wide_resnet50_2' \
--image-model-builder 'torchvision' --image-model 'vgg11' \
--image-model-builder 'torchvision' --image-model 'squeezenet1_0' \
--image-model-builder 'torchvision' --image-model 'inception_v3' \
--image-model-builder 'torchvision' --image-model 'mobilenet_v3_small' \
--image-model-builder 'torchvision' --image-model 'mnasnet0_5' \
--image-model-builder 'torchvision' --image-model 'shufflenet_v2_x0_5' \
--image-model-builder 'torchvision' --image-model 'efficientnet_b0' \
--image-model-builder 'torchvision' --image-model 'regnet_y_400mf' \
--image-model-builder 'torchvision' --image-model 'vit_b_16' \

From Torch Hub

import torch
for github in ['swav', 'dino', 'vicreg', 'barlowtwins', 'swag', 'deit']:
    print(f'{github}:\t', torch.hub.list(f'facebookresearch/{github}'))
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/swav:main' \
--image-model-builder 'torchhub' --image-model 'dino_vits16' --image-model-tag 'facebookresearch/dino:main' \
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/vicreg:main' \
--image-model-builder 'torchhub' --image-model 'resnet50' --image-model-tag 'facebookresearch/barlowtwins:main' \
--image-model-builder 'torchhub' --image-model 'regnety_16gf' --image-model-tag 'facebookresearch/swag:main' \
...

https://github.com/facebookresearch/VICRegL import torch model = torch.hub.load(’facebookresearch/vicregl:main’, ‘resnet50_alpha0p9’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘resnet50_alpha0p75’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘convnext_small_alpha0p9’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘convnext_small_alpha0p75’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘convnext_base_alpha0p9’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘convnext_base_alpha0p75’) model = torch.hub.load(’facebookresearch/vicregl:main’, ‘convnext_xlarge_alpha0p75’)

For more details, see:

  • https://github.com/facebookresearch/swav

  • https://github.com/facebookresearch/dino

  • https://github.com/facebookresearch/vicreg

  • https://github.com/facebookresearch/barlowtwins

  • https://github.com/facebookresearch/SWAG

  • https://github.com/facebookresearch/deit/blob/main/README_deit.md


Text Backbone

From HuggingFace🤗Transformers

For more details, see HuggingFace Transformers. Currently, only ‘from pretrained’ mode is supported (i.e., you cannot train a huggingface transformer from scratch now). Standard models like BERT/RoBERTa are supported, but whether other models are also supported is not sure…

From Sentence Transformers

The Sentence Transformers liberary provides powerfull sentence embeddings. Please see pretrained models for more detials. Loading sentence transformers via huggingface and specify --text-pooler='mean' is recommended, though it is also supported to load the model via sentence transformer:

# recommended: 
--text-model-builder 'huggingface'  --text-model 'sentence-transformers/all-mpnet-base-v2' --text-pooler='mean' 
# not recommended:
--text-model-builder 'sbert'  --text-model 'all-mpnet-base-v2' 

However, it seems that word embedding models (GloVe and Komninos) in sentence-transformers cannot be loaded via huggingface.

Custom Training Data

Episodic Training

--dataset-size 14000000 --episode-size 4000000 --train-data 'cache/yfcc_nori.csv' --nori-dataset\
--epochs 28 --save-frequency 28 --batch-size 64 --workers 8 \

Combining Multiple Datasets

…

(weighting strategy…)

Loss Functions

Loss Original Task Paper Source Implementation Uni-Directional Need Prototype Layer
InfoNCE Alignment Learning Transferable Visual Models From Natural Language Supervision https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py#L65
SimReg KD SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation https://github.com/UCDvision/simreg/blob/main/simreg.py#L122
RKD KD Relational Knowledge Distillation https://github.com/lenscloth/RKD/blob/master/metric/loss.py#L136
CompRess-1q KD CompRess: Self-Supervised Learning by Compressing Representations https://github.com/UMBCvision/CompRess/blob/master/nn/compress_loss.py#L67
CompRess-2q KD CompRess: Self-Supervised Learning by Compressing Representations https://github.com/UMBCvision/CompRess/blob/master/nn/compress_loss.py#L89
SEED KD SEED: Self-supervised Distillation For Visual Representation https://github.com/jacobswan1/SEED/blob/master/tools/utils.py#L188
VICReg SSL VICReg: Variance-Invariance-Covariance Regularization For Self-Supervised Learning https://github.com/facebookresearch/vicreg/blob/main/main_vicreg.py#L184
BarlowTwins SSL Barlow Twins: Self-Supervised Learning via Redundancy Reduction https://github.com/facebookresearch/barlowtwins/blob/main/main.py#L187
DINO SSL Emerging Properties in Self-Supervised Vision Transformers https://github.com/facebookresearch/dino/blob/main/main_dino.py#L363

Use Adapters

The Adapter-Transformers liberary enables Delta-tuning on popular huggingface transformers. See Model Overview for available adaptations, and see the Docs and AdapterHub for more details.

We have made the following adapters available in this codebase:

Adapter args.adapter Params (M) Params (%) STS Benchmark ImageNet Zero-shot Accuracy MSCOCO Retrieval Mean Recall
Compacter dummy 0.06 0.05% 0.7474 24.48 38.73
(IA)^3 ia3_adapter 0.06 0.05% 0.6576 19.23 31.90
LoRA lora_adapter 0.30 0.27% 0.7514 25.02 40.58
Bottleneck adapters bottleneck_adapter 1.79 1.61% 0.7449 26.15 41.85
Language Adapters lang_adapter 1.19 1.08% 0.7405 26.71 42.39
Prefix Tuning prefix_tuning 9.88 8.28% 0.7303 26.00 41.31
UniPELT unipelt 11.09 9.20% 0.7441 26.89 43.45
Mix-and-Match Adapters mam_adapter 22.50 17.05% 0.7503 29.61 45.82
  • Projection Head Adapters

    • Linear projection head

    • DINO MLP Head (optionally with a prototype layer in the last)

Freeze Model Parameters During Training

  # lock image tower, i.e., Locked Image Tuning (LiT) https://arxiv.org/abs/2111.07991
--lock-image-model \

# lock all weight in image tower, while only train the text tower
--lock-image-partial 'weight' \

# only unlock all weight in image tower, while other params are locked
--lock-image-partial '!weight' --lock-image-model \

# Only train the first layer (transformer block) of the image backbone
--lock-image-partial '!resblocks.0'  --lock-image-model \

# Only unfreeze all bias and norm params, i.e., Bias and Normalization Optimization (BiNor) https://arxiv.org/abs/2203.07190
--lock-image-partial '!bias,!ln,!bn' --lock-text-partial '!bias,!ln' --lock-image-model  --lock-text-model \

Evaluate Pretrained Models

Zero-shot Image-text Retrieval

CLIP is a strong model for zero-shot image text retrieval. Since the official paper only reports the performance of the largest CLIP ViT-L-14-336 (standard 32 epoch plus an additional pretraining epoch with 336x336 resolution), here we present our evaluation of other architectures of CLIP. See paper-with-code leader board for performance comparison with other zero-shot retrieval methods.

Backbone # Params all (M) # Params image (M) # Params text (M) I2T R@1 I2T R@5 I2T R@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
RN50 102.01 38.32 63.69 48.06 73.88 83.02 28.31 52.96 64.10 **58.39 **
RN101 119.69 56.26 63.43 49.80 74.42 82.72 30.18 54.15 65.28 **59.43 **
RN50x16 290.98 167.33 123.65 55.38 78.24 86.30 35.24 59.47 69.58 **64.04 **
ViT-B-32 151.28 87.85 63.43 50.02 75.00 83.24 30.36 54.77 66.09 **59.91 **
ViT-B-16 149.62 86.19 63.43 51.72 76.76 84.26 32.70 57.77 68.26 **61.91 **
ViT-L-14 427.94 304.29 123.65 56.08 79.60 86.90 35.33 59.96 70.15 **64.67 **
ViT-L-14-336 **427.94 ** **304.29 ** **123.65 ** **57.46 ** **80.34 ** **87.58 ** **36.09 ** **60.66 ** **70.76 ** **65.48 **
ViT-L-14-336 (official) **427.94 ** **304.29 ** **123.65 ** **58.4 ** **81.5 ** **88.1 ** **37.8 ** **62.4 ** **72.2 ** **66.73 **

For ViT-L-14-336, there is a small gap between our implemented evaluation and the officially reported results. We suspect it is caused by image pre-processing: the above re-implementations use the default Resize transform as implemented in the official CLIP repo, while COCO images are mostly not square, it creates a small train-test domain gap due to distortion. If we alternatively use a ResizeMaxSize as implemented here, the results then surpass the official reported performance.

Backbone Pre-process I2T R@1 I2T R@5I I2TR@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
ViT-L-14-336 Resize 57.46 80.34 87.58 36.09 60.66 70.76 65.48
ViT-L-14-336 Official (unknown) 58.4 81.5 **88.1 ** 37.8 62.4 72.2 66.73
ViT-L-14-336 ResizeMaxSize **59.20 ** **81.70 ** 87.96 **39.02 ** **63.86 ** **73.52 ** **67.54 **

Changing Resize into ResizeMaxSize brings +2.06 improvement for ViT-L-14-336. However, we find that the benifit of this modification is not consistent across different backbones. As shown in the following table, generally, ResizeMaxSize is more beneficial for large models, and especially the models that have been trained to process HD images (e.g., it is quite beneficial for ViT-L-14-336 but not that much for ViT-L-14).

Backbone RN50 RN101 RN50x16 ViT-B-32 ViT-B-16 ViT-L-14 ViT-L-14-336
Mean recall improvement by switching to ResizeMaxSize +0.45 -0.13 +0.10 -0.74 +0.83 +0.96 +2.06

Therefore, to keep it simple, we will use the default Resize transform in the following experiments.

# 1x2080ti machine
python itra/training/main.py \
    --linear-frequency 0  --zeroshot-frequency 0 --retrieval-frequency 0  --nlp-eval-frequency 1 --datasets-dir '/data/Datasets' \
    --retrieval-data 'mscoco_captions' \
    --image-model 'RN50' --image-model-builder 'openclip'  \
    --text-model 'RN50' --text-model-builder 'openclip'  \
    --pretrained-image-model --pretrained-text-model \
    --logs 'logs/MSCOCO-zeroshot'  --name 'RN50x4-openclip-zeroshot-retrieval
    
    
# [('RN50', 'openai'), ('RN50', 'yfcc15m'), ('RN50', 'cc12m'), ('RN50-quickgelu', 'openai'), ('RN50-quickgelu', 'yfcc15m'), ('RN50-quickgelu', 'cc12m'), ('RN101', 'openai'), ('RN101', 'yfcc15m'), ('RN101-quickgelu', 'openai'), ('RN101-quickgelu', 'yfcc15m'), ('RN50x4', 'openai'), ('RN50x16', 'openai'), ('RN50x64', 'openai'), ('ViT-B-32', 'openai'), ('ViT-B-32', 'laion400m_e31'), ('ViT-B-32', 'laion400m_e32'), ('ViT-B-32', 'laion2b_e16'), ('ViT-B-32', 'laion2b_s34b_b79k'), ('ViT-B-32-quickgelu', 'openai'), ('ViT-B-32-quickgelu', 'laion400m_e31'), ('ViT-B-32-quickgelu', 'laion400m_e32'), ('ViT-B-16', 'openai'), ('ViT-B-16', 'laion400m_e31'), ('ViT-B-16', 'laion400m_e32'), ('ViT-B-16-plus-240', 'laion400m_e31'), ('ViT-B-16-plus-240', 'laion400m_e32'), ('ViT-L-14', 'openai'), ('ViT-L-14', 'laion400m_e31'), ('ViT-L-14', 'laion400m_e32'), ('ViT-L-14', 'laion2b_s32b_b82k'), ('ViT-L-14-336', 'openai'), ('ViT-H-14', 'laion2b_s32b_b79k'), ('ViT-g-14', 'laion2b_s12b_b42k'), ('roberta-ViT-B-32', 'laion2b_s12b_b32k'), ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'), ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k')]

Coming soon…

Zero-shot Image Classification

Coming soon…

Linear Probing and KNN CClassification

Coming soon…

Clustering Evaluation

Coming soon…

Sentence Embedding Evaluation

STS-Benchmark, SICK…

MS MARCO Passage Retrval…

Word embeddings..

Coming soon…

ELEVATOR Image Classification Benchmark

You can perform EVEVATOR evaluations of the model trained by this codebase, by making necessary modifications and run the following commands:

conda activate vlkd
cd /data/codes/ProtoRKD 
export PYTHONPATH="$PWD/src/training/evaluations:$PWD/src"

# zero-shot:       model_cfg='clip_zeroshot_eval'      mode='zeroshot'\
# few-shot:        model_cfg='cls_linear_or_ft_eval'   mode='linear_probe' num_shots=5 \
# linear prob:     model_cfg='cls_linear_or_ft_eval'   mode='linear_probe' num_shots=-1 \
# fine-tune:       model_cfg='cls_linear_or_ft_eval'   mode='finetune'     num_shots=-1 \

for dataset (caltech101 cifar10 cifar100 country211 dtd eurosat-clip fer2013 fgvc-aircraft-2013b flower102 food101 gtsrb hateful-memes kitti-distance mnist oxford-iiit-pets patchcamelyon rendered-sst2 resisc45-clip stanfordcar voc2007classification)
{       
    #---> REPLACE THIS LINE WITH ONE OF FOUR OPTIONS ABOVE <---#
    log_dir=# <YOUR EXPERIMENT DIR> \
    ckpt_epoch=# <WHICH EPOCH> \
    dataset_root=# <YOUR DATASET DIR> \
    dataset=$dataset \
    disable_hyperparameter_tuning=True \
        bash run_evevater_eval.sh
}

for example,

conda activate vlkd
cd /data/codes/ProtoRKD 
export PYTHONPATH="$PWD/src/training/evaluations:$PWD/src"

for dataset (caltech101 cifar10 cifar100 country211 dtd eurosat-clip fer2013 fgvc-aircraft-2013b flower102 food101 gtsrb hateful-memes kitti-distance mnist oxford-iiit-pets patchcamelyon rendered-sst2 resisc45-clip stanfordcar voc2007classification)
{       
    model_cfg='cls_linear_or_ft_eval'   mode='finetune'     num_shots=-1 \
    log_dir='/data/codes/ProtoRKD/logs/codebase_test/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5' \
    ckpt_epoch=56 \
    dataset=$dataset \
    disable_hyperparameter_tuning=True \
    dataset_root='/data/codes/ProtoRKD/src/training/evaluations/vision_benchmark/outputs/datasets'\
        bash run_evevater_eval.sh
}

Then you can generate submission file for EvalAI. For more details, please see official instructions.

python src/training/evaluations/vision_benchmark/commands/prepare_submit.py \
  --combine_path 'logs/codebase_test/L[mobilenet_v3_small-h2]-L[CLIP-from-RN50]-bs1024-YFCC-8ep/clip_zeroshot_eval/log/predictions/zeroshot_eval_wiki_False_wnh_False_wnd_False_gpt3_Falseagg_WIKI_AND_GPT3_gpt3count_0'

We provide a simple script to summarize the results:

python src/utils/summarize_ELEVATER_results.py
Input your log dir (end with "../ELEVATER_evaluation/<eval_mode>"):
>>> logs/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5/ELEVATER_evaluation/zeroshot
                           Dsataset  zeroshot-accuracy%
0                       caltech-101             70.4490
1                          cifar-10             72.8000
2                         cifar-100             37.1700
3                        country211              7.0570
4                               dtd             31.5430
5                      eurosat_clip             25.3000
6                          fer-2013             21.8170
7   fgvc-aircraft-2013b-variants102              5.1620
8                 oxford-flower-102             45.4590
9                          food-101             40.3290
10                            gtsrb              8.8600
11                    hateful-memes             52.4110
12                   kitti-distance             14.3460
13                            mnist             11.0400
14                 oxford-iiit-pets             65.2600
15                   patch-camelyon             50.7600
16                    rendered-sst2             47.8860
17                    resisc45_clip             23.2740
18                    stanford-cars              5.0990
19          voc-2007-classification             77.5720
20                          Average             35.6797
saved to logs/U[mobilenet_v3_large-h2]-L[CLIP-from-RN50]-bs1024-YFCC-56ep-lr1e-5/ELEVATER_evaluation/zeroshot/summary.csv

CLIP Pretraining

First, assume that you have already created an environment with required dependencies, prepared data for pre-training and downstream evaluations.

Then you can activate the environment and modify the PYTHONPATH variable, such that modules can be imported successfully.

conda activate ITRA
export PYTHONPATH="$PYTHONPATH:$PWD/src"

Standard Contrastive Language Image Pretraining From Scratch

Training a CLIP from scratch is the most straight forward usage of ITRA. By specifying --loss 'InfoNCE', the model will contrast image and text samples within a batch.

# Example command for a 8x2080ti machine
torchrun --nproc_per_node 8 -m training.main \
    --dataset-size 14000000 --episode-size 14000000 --train-data 'cache/yfcc_nori.csv' --nori-dataset\
    --epochs 8 --save-frequency 8 --batch-size 64 --workers 8 \
    --lr 5e-4 --warmup 2000 --wd 0.5 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --loss 'InfoNCE' \
    --report-to tensorboard --logs 'logs/example-usage/clip-pretraining/YFCC14M-8_epoch-RN50'

Train a Tiny CLIP

  • AlexNet, MobileNet?

  • Small SBERT?

  • GloVe Embeddings?

Fine-tuning CLIP for MS-COCO Retrieval

In this section, we present an example usage and some empirical guides of fine-tuning CLIP for image-text retrieval. We aim to improve the retrieval performance based on the strong zero-shot retrieval ability (see our evaluation report) of CLIP by fine-tuning CLIP on MS COCO Captions training set (118k images) with the InfoNCE loss. Contents and key findings of this section are listed as follows:

  • Fine-tuning CLIP on MS COCO training set improves the retrieval mean recall by +15% compared to raw zero-shot retrieval.

  • Proper hyper-parameters can bring at least +1% improvement.

  • Scale up batch size by partially freeze CLIP weights brings +1% improvement.

  • Compared to the zero-shot retrieval mean recall=58.39% of RN50 CLIP, at last we achieve 76.02% mean recall (17.63% improvement) by fine-tuning it with a 8x2080ti machine.

Getting Started: Naive Fine-tuning Baseline

First, assume that you have already created an environment with required dependencies, prepared csv datasets for pre-training and downstream evaluations. Then you can activate the environment and modify the PYTHONPATH variable, such that modules can be imported successfully.

conda activate ITRA
cd path/to/ITRA/
export PYTHONPATH="$PYTHONPATH:$PWD/itra"

Then we can start to fine-tune a CLIP on MS-COCO captions 2017 training set (118k images). The results should be compared with the paper-with-code leaderboard. Our baseline setting are listed as follows, we use a single-node machine with 8 NVIDIA GeForce 2080ti GPUs for training, one training epoch takes about 3.5 minutes.

  • backbone: ResNet50

  • batch_size: 32x8=256

  • dataset_size: 118287

  • epochs: 10

  • lr: 1e-05

  • opt: adamw

  • use_bn_sync: False

  • warmup: 100

  • weight_decay: 0.5

Training Command
torchrun --nproc_per_node 8 -m training.main \
    --train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
    --retrieval-frequency 1 --datasets-dir '/data/Datasets' \
    --epochs 10 --save-frequency 0 --batch-size 32 --workers 2 \
    --lr 1e-5 --warmup 100 --weight_decay 0.5 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --pretrained-image-model --pretrained-text-model \
    --loss 'InfoNCE' \
    --report-to tensorboard --logs 'logs/MSCOCO-RN50'  --name '10ep-bs256-lr1e-5-wd0.5'

Under this configuration, fine-tuning significantly improves the retrieval performance (58.39→73.98, +15.59).

Type Model # Params (M) I2T R@1 I2T R@5I I2T R@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
Two-stream Zero-shot CLIP RN50 102.01 48.06 73.88 83.02 28.31 52.96 64.1 58.39
Two-stream 👉 Fine-tuned CLIP RN50 102.01 64.84 86.62 92.3 44.99 72.76 82.34 73.98
Two-stream FLIP (ViT-L-14) 427.94 78.9 94.4 97.4 61.2 84.3 90.6 84.5
Two-stream Florence (CoSwin-H) 637 81.8 95.2 63.2 85.7
Single-stream BLIP (large) 220 80.6 95.2 97.6 63.1 85.3 91.1 85.5
Single-stream PTP-BLIP (large) 220 84.2 79.3 98.8 68.8 89.5 94.2 88.8

Note

👆 Here Florence and PTP-BLIP are respectively the two-stream and single-stream SoTA retrieval methods at paper-with-code leaderboard by 2022.12.


Tuning Hyper-parameters

1. Learning Rate. We vary the learning rate from 5e-6 to 1e-4, and find that 1e-5 and 2e-5 are good for a batch size of 256. This results confirms the observations in this paper, where the authors showed that good ImageNet fine-tuning of CILP ViT-B-16 needs a quite small learning rate (2e-5 and 3e-5 for a batch size of 2048).

Learning Rate lr5e-6 lr1e-5 lr2e-5 lr3e-5 lr5e-5 lr1e-4
Mean Recall 72.91 **73.98 ** **73.97 ** 73.32 72.46 69.34

2. Weight Decay. The author of SLIP paper observed that a larger weight decay (0.5) is beneficial for CLIP. Our experiments also showed that CLIP can also handle a very large value of weight decay (i.e., 2.50). Here the training data have 118k samples, and we believe that such property can further benefit CLIP fine-tuning when the data is limited. Our results, as shown in the following table, show that CLIP is pretty robust to weight decay changes: when vary the value from 0.01 to 2.50, the performance changes in a range of only +- 0.43.

Weight Decay 2.50 2.25 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.10 0.05 0.01
Mean Recall 74.07 73.94 73.87 73.84 73.94 73.94 74.05 73.87 73.98 73.93 73.64 73.79

3. Training Length. Similar to the experiments in FLIP, our experiments showed that scaling training epochs cannot lead to further performance improvement. Only 5 or 10 epochs are not sufficient, but 15-20 epochs seems already reached the saturation.

Epochs 5 10 15 20 30
Learning Rate=1e-5 72.66 73.98 74.43 74.45 73.96
Learning Rate=2e-5 72.86 73.97 74.28 74.02 74.03

4. Batch Size. It is well known that batch size has a crucial impact for contrastive learning methods. We confirm this point by varying batch size from 32 to 800 (the maximum allowed batch size for ResNet-50 CLIP on a 8x2080ti machine) while changing learning rate according to liner scaling rule. It shows that scaling down batch size leads to significant performance drop:

BatchSize 800 512 256 128 64 32
Learning Rate 3.125E-05 2.00E-05 1.00E-05 5.00E-06 2.50E-06 1.25E-06
Mean Recall 74.89 74.85 73.98 72.14 69.24 65.04

5. ✨ Improved Naive Baseline with Better Hyper-parameters. Combining all the above hyper-parameter sweep observations together, we increase the mean recall of naive fine-tuning baseline from 73.98 to 75.04.

Baseline Hyper-parameters ✨ Improved Hyper-parameters
backbone: ResNet50 ResNet50
batch_size: 32x8=256 100x8=800
epochs: 10 15
lr: 1e-05 3.125e-05
weight_decay: 0.5 1.0
Training Command
torchrun --nproc_per_node 8 -m training.main \
    --train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
    --retrieval-frequency 1 --datasets-dir '/data/Datasets' \
    --epochs 15 --save-frequency 0 --batch-size 100 --workers 2 \
    --lr 3125e-8 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --pretrained-image-model --pretrained-text-model \
    --loss 'InfoNCE' \
    --report-to tensorboard --logs 'logs/MSCOCO-RN50'  --name '15ep-bs800-lr3125e-8-wd1.0'

Results:

Model I2T R@1 I2T R@5I I2T R@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
Baseline 64.84 86.62 92.30 44.99 72.76 82.34 73.98
Improved Baseline 65.34 87.44 92.84 46.70 74.45 83.47 75.04

Scaling up Batch Size by Partially Freeze Weights

Fine-tuning Streategy Image Params Text Params Total Trainable Params (M) % I2T R@1 I2T R@5I I2TR@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
zero-shot evaluation - - 0 0.0% 48.06 73.88 83.02 28.31 52.96 64.10 58.39
lock CLIP and add linear projection heads linear head linear head 2.1 2.1% 47.24 75.06 84.82 32.91 61.21 72.83 62.34
lock CLIP and add MLP projection heads MLP head MLP head 16.79 16.5% 53.12 79.86 87.76 37.46 65.63 76.41 66.71
lock image tune text - All 63.69 62.4% 62.12 85.12 91.46 42.52 70.34 80.31 71.98
lock text tune image All - 38.32 37.6% 59.78 84.10 90.86 43.57 71.02 80.76 71.68
naĂŻve fine-tuning (improved baseline) All All 102.01 100.0% 65.34 87.44 92.84 46.70 74.45 83.47 75.04
  • lock image and partially fine-tune text

Text Params projection+ln_final 11 10,11 8~11 6~11 4~11 2~11 0~11 All
Total Trainable Params (M) 0.53 3.68 6.83 13.13 19.44 25.74 32.05 38.35 63.69
% 0.5% 3.6% 6.7% 12.9% 19.1% 25.2% 31.4% 37.6% 62.4%
Mean Recall 67.23 69.15 70.27 71.36 71.79 71.96 72.00 72.18 71.98
  • lock text and partially fine-tune image

Image Params attnpool attnpool,layer4 attnpool,layer4,3 attnpool,layer4,3,2 attnpool,layer4,3,2,1 All
Text Params - - - - - -
Total Trainable Params (M) 14.79 29.75 36.85 38.07 38.29 38.32
% 14.5% 29.2% 36.1% 37.3% 37.5% 37.6%
Mean Recall 71.33 72.49 72.23 71.89 71.82 71.68
  • Scale up batchsize

Fine-tuning Streategy Image Params Text Params Total Trainable Params (M) % I2T R@1 I2T R@5I I2TR@10 T2I R@1 T2I R@5 T2I R@10 Mean Recall
naĂŻve fine-tuning (improved baseline) bs-800-lr3.125e-5 All All 102.01 100.0% 65.34 87.44 92.84 46.70 74.45 83.47 75.04
bs800-lr3.125e-5 attnpool,layer4 0~11 68.11 66.8% 66.10 87.60 93.56 47.61 75.17 84.18 75.70
bs1792-lr7e-5 attnpool,layer4 0~11 68.11 66.8% 65.95 88.30 93.66 48.08 75.71 84.42 76.02
Training Command
torchrun --nproc_per_node 8 -m training.main \
    --train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
    --retrieval-frequency 1 --datasets-dir '/data/Datasets' \
    --epochs 15 --save-frequency 15 --batch-size 224 --workers 4 \
    --lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --pretrained-image-model --pretrained-text-model --lock-image-model \
    --lock-text-partial 'positional_embedding,token_embedding' \
    --lock-image-partial '!attnpool,!layer4' \
    --loss 'InfoNCE' \
    --report-to tensorboard --logs 'logs/MSCOCO-RN50-partial'  --name 'save-lock-image(!attnpool,!layer4)-lock-text(positional_embedding,token_embedding)-bs1792-lr7e-5'

More Tricks for Fine-tuning

Layer-wise Learning Rate Decay (LLDR)

--layer_decay_image 0.9 --layer_decay_text 1 \
for layer_decay_text in 1.0 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6;
do
torchrun --nproc_per_node 8 -m training.main \
    --train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
    --retrieval-frequency 1 --datasets-dir '/data/Datasets' \
    --epochs 15 --save-frequency 0 --batch-size 224 --workers 2 \
    --lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --pretrained-image-model --pretrained-text-model --lock-image-model \
    --lock-text-partial 'positional_embedding,token_embedding' \
    --lock-image-partial '!attnpool,!layer4' \
    --loss 'InfoNCE' \
    --report-to tensorboard --logs 'logs/MSCOCO-RN50-LLDR'  --name 'layer_decay_text='$layer_decay_text \
    --layer_decay_text $layer_decay_text; 
done

Exponential Moving Average (EMA)

--model_ema --model_ema_decay 0.998 \
for model_ema_decay in 0.99999 0.9999 0.9995 0.999 0.995 0.99 0.95 0.9 0.8;
do
torchrun --nproc_per_node 8 -m training.main \
    --train-data 'mscoco_captions' --retrieval-data 'mscoco_captions' \
    --retrieval-frequency 1 --datasets-dir '/data/Datasets' \
    --epochs 15 --save-frequency 0 --batch-size 224 --workers 2 \
    --lr 7e-5 --warmup 100 --weight_decay 1.0 --max-grad-norm 5 \
    --image-model 'RN50' --image-model-builder 'openclip' --text-model 'RN50' --text-model-builder 'openclip'\
    --pretrained-image-model --pretrained-text-model --lock-image-model \
    --lock-text-partial 'positional_embedding,token_embedding' \
    --lock-image-partial '!attnpool,!layer4' \
    --loss 'InfoNCE' \
    --model_ema --model_ema_decay $model_ema_decay \
    --report-to tensorboard --logs 'logs/MSCOCO-RN50-EMA'  --name 'model_ema_decay='$model_ema_decay;
done

Wise-FT. Evaluate the model with weight space ensemble

Wise-FT

--eval-with-wise-ft 0.5 \
for alpha in 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ;
do
python itra/training/main.py \
    --zeroshot-frequency 1 --retrieval-frequency 1 --retrieval-data 'mscoco_captions' --datasets-dir '/data/Datasets' \
    --image-model 'RN50' --image-model-builder 'openclip'  \
    --text-model 'RN50' --text-model-builder 'openclip'  \
    --pretrained-image-model --pretrained-text-model \
    --resume 'logs/MSCOCO-RN50-partial/save-lock-image(!attnpool,!layer4)-lock-text(positional_embedding,token_embedding)-bs1792-lr7e-5/checkpoints/epoch_15.pt' \
    --eval-with-wise-ft $alpha \
    --logs 'logs/MSCOCO-RN50-WiseFT'  --name 'zs+retrieval-WiseFT='$alpha;
done

rsicd retrieval

# 1x2080ti machine RSICD 对点
torchrun --nproc_per_node 8 -m training.main \
    --train-data '/data/Datasets/RSICD/csv/rsicd_train.csv' --images-dir '/data/Datasets/RSICD/RSICD_images/RSICD_images' \
    --csv-separator '\t' --csv-img-key 'filename' --csv-caption-key 'title' \
    --retrieval-data '/data/Datasets/RSICD/csv/rsicd_test.csv' --retrieval-images-dir '/data/Datasets/RSICD/RSICD_images/RSICD_images' \
    --retrieval-csv-separator '\t' --retrieval-csv-img-key 'filename' --retrieval-csv-caption-key 'title' \
    --retrieval-frequency 1  --datasets-dir '/data/Datasets' \
    --epochs 30 --save-frequency 0 --batch-size 16 --workers 2 \
    --lr 1e-6 --warmup 100 --weight_decay 0.5 --max-grad-norm 5 \
    --image-model 'ViT-L-14-336' --image-model-builder 'openclip' \
    --text-model 'ViT-L-14-336' --text-model-builder 'openclip' \
    --pretrained-image-model --pretrained-text-model \
    --lock-image-model --lock-text-model \
    --lock-image-partial '!ln_post,!resblocks.23,!resblocks.22,!resblocks.21,!resblocks.20,!resblocks.19,!resblocks.18' \
    --lock-text-partial '!text_projection,!ln_final,!resblocks.11,!resblocks.10,!resblocks.9' \
    --loss 'InfoNCE' --layer_decay_image 0.9 --layer_decay_text 0.9 \
    --report-to tensorboard --logs 'logs/RSICD-ViT-L-14'  --name '30ep-b128-lr1e-5-unlock-image-text-last0.75-lldr0.9'

python itra/training/main.py –config-yaml ‘logs/params.yml’ –name ‘custom-name’

python itra/training/main.py –episode-size 10000 –train-data ‘mscoco_captions’ –retrieval-data ‘mscoco_captions’ –retrieval-frequency 1 –datasets-dir ‘/data/Datasets’ –epochs 15 –save-frequency 0 –batch-size 100 –workers 2 –lr 1e-4 –warmup 100 –weight_decay 1.0 –max-grad-norm 5 –image-model ‘RN50’ –image-model-builder ‘openclip’ –text-model ‘RN50’ –text-model-builder ‘openclip’–pretrained-image-model –pretrained-text-model –lock-image-model –lock-text-model –loss ‘InfoNCE’ –prompt –n-prompt 4 –report-to tensorboard –logs ‘logs/test’ –name ‘coco-finetune-nprompt-4’

Image Classification (UniCL)

UniCL: Unified Contrastive Learning in Image-Text-Label Space

Train an Image Classification Model From scratch

Compare to MMClassification

# Single GPU classification
python itra/training/main.py \
    --train-data 'CIFAR10' \
    --linear-frequency 20  --zeroshot-frequency 20 --datasets-dir '/data/Datasets' \
    --epochs 200 --save-frequency 0 --batch-size 128 --workers 4 \
    --opt 'sgd' --lr 0.1 --warmup 100 --weight_decay 0.0001 \
    --image-model 'resnet18' --image-model-builder 'torchvision' --image-resolution 32  --image-head-n-layers 1 \
    --pretrained-text-model \
    --text-model 'RN50' --text-model-builder 'openclip' --lock-text-model --text-head-n-layers 1  \
    --loss 'CrossEntropy' --joint-projection-dim 10 \
    --report-to tensorboard --logs 'logs/UniCL-Classification'  --name 'resnet18(scratch)-CIFAR10-200ep-CrossEntropy+linear_eval'
    
# Single GPU classification
python itra/training/main.py \
    --train-data 'CIFAR10' \
    --linear-frequency 5 --zeroshot-frequency 5 --datasets-dir '/data/Datasets' \
    --epochs 200 --save-frequency 0 --batch-size 128 --workers 4 \
    --opt 'sgd' --lr 0.1 --warmup 100 --weight_decay 0.0001 \
    --image-model 'resnet18' --image-model-builder 'torchvision' --image-resolution 32  --image-head-n-layers 1 \
    --pretrained-text-model \
    --text-model 'RN50' --text-model-builder 'openclip' --lock-text-model --text-head-n-layers 1  \
    --loss 'InfoNCE' --joint-projection-dim 1024 \
    --report-to tensorboard --logs 'logs/UniCL-Classification'  --name 'resnet18(scratch)-CIFAR10-200ep-InfoNCE+linear_eval'

Fine-tuning CLIP for ImageNet Classification

Re-implement this paper.

python itra/training/main.py –train-data ‘mscoco_captions’ –retrieval-data ‘mscoco_captions’ –dataset-size 1000 –retrieval-frequency 1 –datasets-dir ‘/data/Datasets’ –epochs 1 –save-frequency 1 –batch-size 32 –workers 2 –lr 1e-5 –warmup 100 –weight_decay 0.5 –max-grad-norm 5 –image-model ‘RN50’ –image-model-builder ‘openclip’ –text-model ‘RN50’ –text-model-builder ‘openclip’–pretrained-image-model –pretrained-text-model –loss ‘InfoNCE’ –report-to tensorboard –logs ‘logs/test’ –name ‘RN’

Language-to-vision Knowledge Distillation

Coming soon…

Vision-to-language Knowledge Distillation

Coming soon…

Todo

New features incoming👇

  • Refract main.py

  • Write help messages for arguments

  • Use YAML

  • Project
    • install as package

    • Pypi package publishing

  • Evaluation reports
    • zero-shot classification

    • linear/knn classification

    • clustering evaluation

    • SentEval

    • word embedding

    • MS Marco retrieval

    • Chinese CLIPs’ Evaluation Reports (ImageNet-CN zero-shot, MC-COCO-CN retrieval)

  • Implementations
    • UniCL-based image classification

    • Validate loss functions

    • Validate Adapters

    • SimCSE and PromptBERT re-implementation

    • Vision-to-language Knowledge Distillation

    • Language-to-vision Knowledge Distillation

    • Teacher selection based on Information Bottleneck Theory