Paper

Paper Note: LLM-as-a-judge Survey

This week, I want to resume sharing my paper reading notes. The paper I read this weenkend is “From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge” by Li, Dawei, et al. (2024), as it’s particularly relevant to my current work. For the first time, I tried out Google’s NotebookLM to help with my note-taking –– It’s game-changing! BTW, my attention has recently shifted from multimodality in the Search domain to LLMs/chatbots. Hopefully, I can find more time to keep reading, learning, and, of course, sharing. More resources about their line of work: LLM-as-a-judge: https://llm-as-a-judge.github.io/ Paper list about LLM-as-a-judge: https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge ...

The LoRA Family

With the explosion of Large Language Models (LLMs), there is a growing demand for researchers to train these models on downstream tasks. However, training LLMs often requires a great amount of computing resources, making them inaccessible to many individual researchers and organizations. In response, several advancements in Parameter-Efficient Fine-Tuning (PEFT) have emerged. The idea of PEFT techniques is to fine-tune a much smaller number of the model parameters while maintaining the model performance, thus allowing researchers to train large models more efficiently and cost-effectively. These methods have gained significant traction across various applications, which makes broader experimentation and deployment of LLMs in real-world scenarios possible. Among many of the PEFT methods, Low-Rank Adaptation (LoRA) is a quite common way to efficiently train LLMs by leveraging low-rank factorization. In the following paragraphs, we will overview LoRA and some key LoRA variants. ...

Paper Note: BLIP

BLIP, a unified Vision-language Pre-training framework to learn from noisy image-text pairs. BLIP pre-trains a multimodal mixture of encoder-decoder model using a dataset bootstrapped from large-scale noisy image-text pairs by injecting diverse synthetic captions and removing noisy captions. ...

Paper Note: ALBEF

Contribution: To enable more grounded vision and language representation learning, introduce a contrastive loss (from CLIP) to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which To improve learning from noisy web data, propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ...

Paper Note: CLIP

The simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. ...