apple/MobileCLIP-S0 · Reproduce the results for training with DataCompDR

May 27

Hi, Thanks for the great work!

I'm trying to reproduce the results using your training section on GitHub. I'm running configs/run_datacompdr12m.sh to replicate the results from the paper for ViT-B16 with the DataCompDR-12M dataset.

Currently, my best result is 56.7% accuracy on ImageNet, while the paper mentions a target of 61.7%.
The only change I made is that about 30% of the image links in the DataCompDR-12M dataset couldn't be downloaded. To reach 12M samples, I added around 3.5M samples from the DataCompDR-1B dataset.

Thanks

fartashf

Apple org May 28

•

edited May 28

Thanks @Amitshomer for your interest. The gap you are observing seems too large. There are two differences between 12M and 1B sets that may have an impact.

First, the text_emb in the 12M set has an additionally repeated row (0 and 1) that are the same and both are the embeddings of the ground-truth caption. Depending on how you have combined the data this may or may not have an impact.
The other difference is the number of image augmentations per sample (10 for DataCompDR-1B and 30 for DataCompDR-12M). But this one should have less than 1% difference as Table 4.a suggests.

If investigating these issues don't result in any findings, I'd suggest repeating the ablations of Table 2 to see where the drop appears. Specifically, first try disabling all DR losses and synthetic captions and train with CLIP loss, then add synthetic captions and finally add KD loss but with lambda=1.

Amitshomer

May 29

Hi Thanks for the quick reply,

About the first point —As you said, the 12M set includes repeated rows. Correct me if I’m wrong, but it looks like there’s a bug in the dataloader for handling DataComp12M.

Now, len(texts) == 1, and the lines below show a mismatch. The selected syn_text and its syn_text_emb don’t align.

syn_text = sample["syn.json"]["syn_text"][scapi]
syn_text_emb = text_emb_all[len(texts) + scapi]

For example, when scapi == 0, the syn_text_emb pulled is actually for the GT caption (0 and 1 are the GT caption emb).

fartashf

Apple org May 29

•

edited May 29

To clarify, only one text embedding is repeated. so a hacky fix is to change this line:

is_duplicate = (len(text_emb_all) == 7)
syn_text_emb = text_emb_all[len(texts) + is_duplicate + scapi]

This skips the duplicate text embedding. We will fix the dataset soon and remove the redundant text embedding but you can use the hack above in the meantime.
This issue was also reported here: https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/4

fartashf

Apple org May 30

One more note, please make sure to download the original ground-truth captions for the 12M set as discussed here:
https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/5
https://github.com/apple/ml-mobileclip/tree/main/training

fartashf

Apple org Jun 4

Hi @Amitshomer ,
We have uploaded a revision of the dataset where the duplicate features are removed. The text embeddings per sample are now 6x1536. Please let us know if you test and observe any issues.
Thanks

Amitshomer

Jun 10

Hi, thanks again for everything.

I haven’t switched to the new dataset revision yet. I’m still using the hacky fix in the code. While accuracy has improved, I still can’t fully reproduce the original results.
Right now, I get 59.5% accuracy using KD loss with lambda=1.

I repeated some intermediate experiments from Table 4:
CLIP loss without synthetic captions: 44%
CLIP loss with synthetic captions: 51.9%
The 59.5% result matches what’s shown in Table 4 (row 4): lambda=1, synthetic captions and augmentation are used, but the teacher is not an ensemble.

Could it be that the dataset contains embeddings from a single model, while the paper used an ensemble?
If not, do you have any other ideas?

Thanks

fartashf

Apple org Jun 10

Hi @Amitshomer ,

Thanks for reproducing the ablations of Table 4. The text and image embeddings are 1536 dimensional vectors that is the concatenation of 2x768 embeddings from two teachers. Can you make sure all these flags are set correct (specifically, the last two flags):

    --dataset-reinforcement \
    --dataset-reinforcement-config configs/datacompdr12m.json \
    --distill-logit-scale 100 \
    --distill-loss-weights 0.0 1.0 \
    --distill-teacher-dimension 768 768 \
    --distill-average-after-softmax

Amitshomer

Jun 11

All these flags are set the same.

fartashf

Apple org Jun 13

•

edited Jun 13

The 2.2% is significant, but some of it could be attributed to the fact that DataCompDR1B contains 10 augmentations per sample. Furthermore, the standard deviation might be high if a different subset were selected for the missing 30%. Could you attempt to replace the inaccessible portion of the 12M dataset with another random subset of DataCompDR1B and possibly compute the standard deviation with a few runs?

Amitshomer

Jun 15

Hi,
I found a bug I had accidentally added during other experiments.
Now I'm getting 61.1%, which looks very reasonable.

Thanks again for all the support.

fartashf

Apple org Jun 20

Glad it worked out. Thanks for reporting.

fartashf changed discussion status to closed Jun 20

apple
/

MobileCLIP-S0

Reproduce the results for training with DataCompDR_12M