I am a Post-Doctoral Researcher at the University of Bristol. My research focuses on action recognition and understanding its links with language.
The extension of EPIC-Kitchens has now been released taking the dataset up to 100 hours of video and 90,000 action segments. More information can be found here.
The EPIC-KITCHENS Dataset: Collection Challenges and Baselines has been accepted to IEE Transactions on Pattern Analysis and Machine Intelligence.
I successfully defended my thesis "Verbs and Me: An investigation into Verbs as Labels for Action Recognition in Video Understanding". Thanks to my reviewers Frank Keller and Jan Van Gamert for their time spent reading the thesis and travelling to Bristol.
Our paper titled "Fine-Grained Action Retrieval through Multiple Parts of Speech Embeddings" was accepted as a poster presentation at ICCV 2019 (27th Oct-2nd Nov 2019). More info here.
Our paper titled "Learning Visual Actions Using Multiple Verb-Only Labels" was accepted as a poster presentation at BMVC 2019 (9th-12th September 2019). More info here.
Myself along with two other authors demoed EPIC at CVPR 2018.
I presented a poster of Towards an Unequivocal Representation of Actions at the Brave New Ideas for Video Understanding workshop at CVPR2018.
We have just released the largest egocentric dataset for action and object recognition. More info can be found here.
We have released a shortform version of Towards an Unequivocal Representation of Actions on ArXiv here.
We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can be included in result tables or leaderboards to handily compare methods not only by their performance, but also by the level of data supervision utilised by each method.
This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments. This extends our previous dataset (EPIC-KITCHENS-55), released in 2018, resulting in more action segments (+128%), environments (+41%) and hours (+84%), using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions (54% more actions per minute).
We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval
This work introduces verb-only representations for both recognition and retrieval of visual actions, in video. Current methods neglect legitimate semantic ambiguities between verbs, instead choosing unambiguous subsets of verbs along with objects to dis-ambiguate the actions. We instead propose multiple verb-only labels, which we learn through hard or soft assignment as a regression.
This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs, relying on the objects to disambiguate interactions.
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen
Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labeled temporal bounds, for object interaction recognition.
we present sembed, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels.
Unmanned Aerial Vehicles (UAVs) require complex control and significant experience for piloting. While these devices continue to improve, there is, as yet, no device that affords six degrees of freedom (6-DoF) control and directional haptic feedback. We present The Cage, a 6-DoF controller for piloting an unmanned aerial vehicle (UAV).
Supervised by Gabriela Csurka and Diane Larlus
3 Month Internship
Whilst not working on completing my PhD I enjoy reading - primarily Science Fiction and Fantasy. Below are few books/series I would recommend: