Vishaal Udandarao

I am a second year ELLIS PhD student, jointly working with Matthias Bethge at The University of Tuebingen and Samuel Albanie at The University of Cambridge/Google Deepmind. I am also a part of the International Max Planck Research School for Intelligent Systems. I am mainly interested in understanding the generalisation properties of foundation models, both vision-language models (VLMs) and large multi-modal models (LMMs), through the lens of their pre-training and test data distributions. My key research interest-axes are: Data-centric Machine Learning, Robustness/Generalisation to Distribution Shifts, and Foundation Models.

Currently, I am interning in Google Zürich, working with Yongqin Xian, Alessio Tonioni, Federico Tombari, and Olivier Henaff. I am also closely collaborating with Ferjad Naeem, Nikhil Parthasarathy and Talfan Evans.

Previously, I was an MPhil Machine Learning and Machine Intelligence student at The University of Cambridge. My thesis was on Understanding and Fixing the Modality Gap in VLMs. I graduated from IIIT Delhi with a Bachelors in Computer Science in July, 2020.

I am also fortunate to have previously worked with several great mentors: Ankush Gupta (Google Deepmind), Sungjin Ahn (KAIST), Tanmoy Chakraborty (IIT Delhi), Rajiv Ratn Shah (IIIT Delhi), Saket Anand (IIIT Delhi), Rajesh Kumar (Bucknell University), Anubha Gupta (IIIT Delhi) and Jainendra Shukla (IIIT Delhi).

Email  /  CV  /  Google Scholar  /  Twitter  /  Github

profile photo
Publications
No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Vishaal Udandarao*, Ameya Prabhu*, Adhiraj Ghosh, Yash Sharma, Philip Torr, Adel Bibi, Samuel Albanie, Matthias Bethge
DPFM Workshop, ICLR, 2024
pdf / code

Our work showcases that the impressive empirical performance of multimodal models like CLIP and Stable Diffusion can be largely attributed to the presence of test concepts within their vast pretraining datasets, thus their reported empirical performance does not constitute "zero-shot" generalization. Quite the contrary, these models require exponentially more data on a concept to linearly improve their performance on tasks pertaining to that concept, highlighting extreme sample inefficiency.

Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
Ameya Prabhu*, Vishaal Udandarao*, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie
DMLR Workshop, ICLR, 2024
pdf / code

Our work introduces the concept of lifelong benchmarks, enabling effective comparisons of models and reducing overfitting to the biases of a particular dataset. We constructed large-scale lifelong classification benchmarks totalling over 1.5M samples. To facilitate more efficient evaluation, we introduce the Sort&Search method that reduces inference compute costs by 1000x.

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models
Vishaal Udandarao*, Max F. Burg*, Samuel Albanie, Matthias Bethge
ICLR, 2024
pdf / code

We introduce "Visual Data-Type Identification": the task of classifying between visual image distortions and styles. On this simple task, we find surprising behaviour of VLMs and LMMs: model scaling does not significantly improve performance. We trace this behaviour back to the LAION-2B dataset and show a simple fine-tuning method to improve performance.

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Vishaal Udandarao, Ankush Gupta, Samuel Albanie
ICCV, 2023
pdf / code

We enhance CLIP's downstream classification performance by (1) curating a support set either by generating synthetic (Stable Diffusion) or retrieving natural (LAION-5B) samples, and (2) observing and fixing a mis-calibration issue with intra-modal distances in CLIP’s embedding space.

It's LeVAsa not LevioSA! Latent Encodings for Valence-Arousal Structure Alignment
Vishaal Udandarao*, Surabhi Nath*, Jainendra Shukla
CODS-COMAD, 2021
pdf / code

A VAE model that learns implicit structure by aligning the latent space with the Valence-Arousal circumplex space. Further, a novel algorithm for mapping categorical and dimensional model labels using annotation transfer across affective facial image datasets is depicted.

COBRA: Contrastive Bi-Modal Representation Algorithm
Vishaal Udandarao*, Abhishek Maiti*, Suryatej Reddy Vyalla*, Deepak Srivatsav*, Yifang Yin, Rajiv Ratn Shah
TUSION workshop, IJCAI, 2020
pdf / code

A novel bi-modal framework that aims to train two modalities (image and text) in a joint fashion inspired by the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms which preserve both inter and intra-class relationships in a modality-invariant fashion.

InPHYNet: Leveraging Attention-based Multitask Recurrent Networks for Multi-label Physics Text Classification
Vishaal Udandarao*, Abhishek Agarwal*, Anubha Gupta, Tanmoy Chakraborty
Knowledge-Based Systems, 2020
pdf / code

A multi-task learning model which incorporates auxiliary semantics by utilising a weight alignment layer and information exchange layer.

DisCont: Self-Supervised Visual Attribute Disentanglement using Context Vectors
Vishaal Udandarao*, Sarthak Bhagat*, Shagun Uppal*, Saket Anand
PTSGM Workshop, ECCV, 2020, MLI4SD Workshop, ICML, 2020
pdf / project page / slides / video / code

A self-supervised framework to disentangle multiple attributes by exploiting structural inductive biases within images and leveraging contrastive learning paradigms.

On the Inference of Soft Biometrics from Typing Patterns Collected in a Multi-device Environment
Vishaal Udandarao*, Mohit Agrawal*, Rajesh Kumar, Rajiv Ratn Shah
BigMM, 2020
pdf / code

An empirical study on the inference of gender, major/minor (computer science, non-computer science), typing style, age, and height from the typing patterns collected from 117 individuals in a multi-device environment.

Memeify: A Large-Scale Meme Generation System
Vishaal Udandarao*, Suryatej Reddy Vyalla*, Tanmoy Chakraborty
CODS-COMAD, 2020
pdf / slides / video / code

A meme generation system that uses a trained state-of-the-art transformer-based (GPT-2) model for caption generation by employing an encoder-decoder architecture.

EDUQA: Educational Domain Question Answering System using Conceptual Network Mapping
Vishaal Udandarao*, Abhishek Agarwal*, Nikhil Sachdeva*, Raj Kamal Yadav*, Vrinda Mittal*, Anubha Gupta, Abhinav Mathur
ICASSP, 2019
pdf / poster

An on-the-fly conceptual network model that incorporates educational semantics and preserves correlations between conceptual entities by applying intelligent indexing algorithms on an inherent concept network so as to improve answer generation.

Teaching
Deep Learning (CSE641)
Worked as a Teaching Assistant for the Deep Learning course offered by Dr. Saket Anand in Spring 2020.
Machine Learning (CSE543)
Worked as a Teaching Assistant for the Machine Learning course offered by Dr. Jainendra Shukla in Fall 2019.
Introduction to Engineering Design (DES130)
Worked as a Teaching Assistant for the Introduction to Engineering Design course offered by Dr. Aman Parnami in Spring 2019.
Linear Algebra (MTH100)
Worked as a Teaching Assistant for the Linear Algebra course offered by Dr. Samaresh Chatterjee in Fall 2018.
Misc

Apart from my academic interests, I am a huge football fan and actively support FC Barcelona Paris Saint Germain Inter Miami CF. You've probably guessed already, Lionel Messi is my favourite player to ever touch a football. I also love watching Formula 1 and look up to Lewis Hamilton. I used to write stuff, but that was a long long time ago. I also dabble around with the guitar and the keyboard at times. Checkout my soundcloud profile!


Website template taken from here.