Reproduce the results for training with DataCompDR_12M
Hi, Thanks for the great work!
I'm trying to reproduce the results using your training section on GitHub. I'm running configs/run_datacompdr12m.sh to replicate the results from the paper for ViT-B16 with the DataCompDR-12M dataset.
Currently, my best result is 56.7% accuracy on ImageNet, while the paper mentions a target of 61.7%.
The only change I made is that about 30% of the image links in the DataCompDR-12M dataset couldn't be downloaded. To reach 12M samples, I added around 3.5M samples from the DataCompDR-1B dataset.
Thanks
Thanks @Amitshomer for your interest. The gap you are observing seems too large. There are two differences between 12M and 1B sets that may have an impact.
- First, the
text_emb
in the 12M set has an additionally repeated row (0 and 1) that are the same and both are the embeddings of the ground-truth caption. Depending on how you have combined the data this may or may not have an impact. - The other difference is the number of image augmentations per sample (10 for DataCompDR-1B and 30 for DataCompDR-12M). But this one should have less than 1% difference as Table 4.a suggests.
If investigating these issues don't result in any findings, I'd suggest repeating the ablations of Table 2 to see where the drop appears. Specifically, first try disabling all DR losses and synthetic captions and train with CLIP loss, then add synthetic captions and finally add KD loss but with lambda=1.
Hi Thanks for the quick reply,
About the first point —As you said, the 12M set includes repeated rows. Correct me if I’m wrong, but it looks like there’s a bug in the dataloader for handling DataComp12M.
Now, len(texts) == 1, and the lines below show a mismatch. The selected syn_text and its syn_text_emb don’t align.
syn_text = sample["syn.json"]["syn_text"][scapi]
syn_text_emb = text_emb_all[len(texts) + scapi]
For example, when scapi == 0, the syn_text_emb pulled is actually for the GT caption (0 and 1 are the GT caption emb).
To clarify, only one text embedding is repeated. so a hacky fix is to change this line:
is_duplicate = (len(text_emb_all) == 7)
syn_text_emb = text_emb_all[len(texts) + is_duplicate + scapi]
This skips the duplicate text embedding. We will fix the dataset soon and remove the redundant text embedding but you can use the hack above in the meantime.
This issue was also reported here: https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/4
One more note, please make sure to download the original ground-truth captions for the 12M set as discussed here:
https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/5
https://github.com/apple/ml-mobileclip/tree/main/training
Hi
@Amitshomer
,
We have uploaded a revision of the dataset where the duplicate features are removed. The text embeddings per sample are now 6x1536. Please let us know if you test and observe any issues.
Thanks
Hi, thanks again for everything.
I haven’t switched to the new dataset revision yet. I’m still using the hacky fix in the code. While accuracy has improved, I still can’t fully reproduce the original results.
Right now, I get 59.5% accuracy using KD loss with lambda=1.
I repeated some intermediate experiments from Table 4:
CLIP loss without synthetic captions: 44%
CLIP loss with synthetic captions: 51.9%
The 59.5% result matches what’s shown in Table 4 (row 4): lambda=1, synthetic captions and augmentation are used, but the teacher is not an ensemble.
Could it be that the dataset contains embeddings from a single model, while the paper used an ensemble?
If not, do you have any other ideas?
Thanks
Hi @Amitshomer ,
Thanks for reproducing the ablations of Table 4. The text and image embeddings are 1536 dimensional vectors that is the concatenation of 2x768 embeddings from two teachers. Can you make sure all these flags are set correct (specifically, the last two flags):
--dataset-reinforcement \
--dataset-reinforcement-config configs/datacompdr12m.json \
--distill-logit-scale 100 \
--distill-loss-weights 0.0 1.0 \
--distill-teacher-dimension 768 768 \
--distill-average-after-softmax
All these flags are set the same.
The 2.2% is significant, but some of it could be attributed to the fact that DataCompDR1B contains 10 augmentations per sample. Furthermore, the standard deviation might be high if a different subset were selected for the missing 30%. Could you attempt to replace the inaccessible portion of the 12M dataset with another random subset of DataCompDR1B and possibly compute the standard deviation with a few runs?
Hi,
I found a bug I had accidentally added during other experiments.
Now I'm getting 61.1%, which looks very reasonable.
Thanks again for all the support.
Glad it worked out. Thanks for reporting.