Michael Wray

Post-Doctoral Researcher . University of Bristol . michael (dot) wray (at) bristol (dot) ac (dot) uk

I am a Post-Doctoral Researcher at the University of Bristol. My research focuses on action recognition and understanding its links with language.


Release of EPIC-Kitchens-100

The extension of EPIC-Kitchens has now been released taking the dataset up to 100 hours of video and 90,000 action segments. More information can be found here.

July 2020

Journal Version of EPIC-Kitchens Accepted

The EPIC-KITCHENS Dataset: Collection Challenges and Baselines has been accepted to IEE Transactions on Pattern Analysis and Machine Intelligence.

April 2020

Successful Thesis Defence - Minor Corrections

I successfully defended my thesis "Verbs and Me: An investigation into Verbs as Labels for Action Recognition in Video Understanding". Thanks to my reviewers Frank Keller and Jan Van Gamert for their time spent reading the thesis and travelling to Bristol.

November 2019

Paper Accepted at ICCV

Fine-Grained Action Retrieval through Multiple Parts of Speech Embeddings

Our paper titled "Fine-Grained Action Retrieval through Multiple Parts of Speech Embeddings" was accepted as a poster presentation at ICCV 2019 (27th Oct-2nd Nov 2019). More info here.

July 2019

Paper Accepted at BMVC

Learning Visual Actions Using Multiple Verb-Only Labels

Our paper titled "Learning Visual Actions Using Multiple Verb-Only Labels" was accepted as a poster presentation at BMVC 2019 (9th-12th September 2019). More info here.

July 2019

Presentation at BMVA Symposium

Presentation of Towards an Unequivocal Representation of Actions

I presented a talk on Towards an Unequivocal Representation of Actions at BMVA Symposium: Robotics meets Semantics: Enabling Human-Level Understanding in Robots on 18th July. Slides. Video Recording.

June 2018

EPIC Kitchens Demo at CVPR 2018

Demo - Wednesday AM Booth 7

Myself along with two other authors demoed EPIC at CVPR 2018.

June 2018

Poster at BIVU2018

Towards an Unequivocal Representation of Actions

I presented a poster of Towards an Unequivocal Representation of Actions at the Brave New Ideas for Video Understanding workshop at CVPR2018.

May 2018

EPIC Kitchens

Largest Egocentric Dataset

We have just released the largest egocentric dataset for action and object recognition. More info can be found here.

April 2018

New Paper on ArXiv

Towards an Unequivocal Representation of Actions

We have released a shortform version of Towards an Unequivocal Representation of Actions on ArXiv here.

April 2018


Supervision Level Scales

ArXiv 2020.

We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can be included in result tables or leaderboards to handily compare methods not only by their performance, but also by the level of data supervision utilised by each method.

August 2020
Project Page

Rescaling Egocentric Vision

ArXiv 2020.

This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments. This extends our previous dataset (EPIC-KITCHENS-55), released in 2018, resulting in more action segments (+128%), environments (+41%) and hours (+84%), using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions (54% more actions per minute).

July 2020
Project Page

Fine-Grained Action Retrieval through Multiple Parts-of-Speech Embeddings

ICCV 2019.
Arxiv / PDF

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval

July 2019
Project Page

Learning Visual Actions Using Multiple Verb-Only Labels

BMVC 2019.
Arxiv / PDF

This work introduces verb-only representations for both recognition and retrieval of visual actions, in video. Current methods neglect legitimate semantic ambiguities between verbs, instead choosing unambiguous subsets of verbs along with objects to dis-ambiguate the actions. We instead propose multiple verb-only labels, which we learn through hard or soft assignment as a regression.

July 2019
Project Page

Towards an Unequivocal Representation of Actions

This work introduces verb-only representations for actions and interactions; the problem of describing similar motions (e.g. 'open door', 'open cupboard'), and distinguish differing ones (e.g. 'open door' vs 'open bottle') using verb-only labels. Current approaches for action recognition neglect legitimate semantic ambiguities and class overlaps between verbs, relying on the objects to disambiguate interactions.

May 2018

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

ECCV 2018.
ArXiv / PDF

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen

April 2018

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

Arxiv / PDF

Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localization and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labeled temporal bounds, for object interaction recognition.

October 2017

sembed: semantic embedding of egocentric action videos

ECCVW 2016.
Arxiv / PDF

we present sembed, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels.

october 2016

The Cage: Towards a 6-DoF Remote Control with Force Feedback for UAV Interaction.

Extended Abstract CHI 2015

Unmanned Aerial Vehicles (UAVs) require complex control and significant experience for piloting. While these devices continue to improve, there is, as yet, no device that affords six degrees of freedom (6-DoF) control and directional haptic feedback. We present The Cage, a 6-DoF controller for piloting an unmanned aerial vehicle (UAV).

April 2015

Education and Experience

University of Bristol

PhD in Computer Vision
September 2015 - November 2019

Naver Labs Europe

Research Internship

Supervised by Gabriela Csurka and Diane Larlus

Autumn 2017


Router Testing/Development

3 Month Internship

June 2014 - August 2014

University of Bristol

Master of Engineering
Computer Science

First Class

September 2011 - May 2015


Whilst not working on completing my PhD I enjoy reading - primarily Science Fiction and Fantasy. Below are few books/series I would recommend:

  • Wheel of Time - Robert Jordan/Brandon Sanderson.
  • Malazan Book of the Fallen - Steven Erikson
  • Terra Ignota - Ada Palmer
  • Rendezvous with Rama - Arthur C. Clarke.