Hi there 馃憢

Welcome to Nancy’s blog

Paper Note: BLIP

BLIP, a unified Vision-language Pre-training framework to learn from noisy image-text pairs. BLIP pre-trains a multimodal mixture of encoder-decoder model using a dataset bootstrapped from large-scale noisy image-text pairs by injecting diverse synthetic captions and removing noisy captions. ...

February 16, 2024 路 2 min 路 403 words 路 Me

Paper Note: ALBEF

Contribution: To enable more grounded vision and language representation learning, introduce a contrastive loss (from CLIP) to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which To improve learning from noisy web data, propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ...

August 24, 2023 路 3 min 路 638 words 路 Me

Paper Note: CLIP

The simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. ...

August 15, 2023 路 3 min 路 635 words 路 Me

Paper Note: Swin Transformer

A new ViT whose representation is computed with Shifted windows*!*** ...

August 15, 2023 路 3 min 路 438 words 路 Me

Paper Note: Masked autoencoders(MAE) (very short)

Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. ...

August 15, 2023 路 2 min 路 353 words 路 Me