Pierre Marza | Research

2024

PhD thesis
Learning spatial representations for single-task navigation and multi-task policies

Pierre Marza

2024

Abs Bib PDF

Autonomously behaving in the 3D world requires a large set of skills, among which are perceiving the surrounding environment, representing it precisely and efficiently enough to keep track of the past, making decisions and acting to achieve specified goals. Animals, for instance humans, stand out by their robustness when it comes to acting in the world. In particular, they can efficiently generalize to new environments, but are also able to rapidly master many tasks of interest from a few examples. This manuscript will study how artificial neural networks can be trained to acquire a subset of these abilities. We will first focus on training neural agents to perform semantic mapping, both from augmented supervision signal and with proposed neural-based scene representations. Neural agents are often trained with Reinforcement Learning (RL) from a sparse reward signal. Guiding the learning of scene mapping abilities by augmenting the vanilla RL supervision signal with auxiliary spatial reasoning tasks will help navigating efficiently. Instead of modifying the training signal of neural agents, we will also see how incorporating specific neural-based representations of semantics and geometry within the architecture of the agent can help improve performance in goal-driven navigation. Then, we will study how to best explore a 3D environment in order to build neural representations of space that are as satisfying as possible based on robotic-oriented metrics we will propose. Finally, we will move from navigation-only to multi-task agents, and see how important it is to tailor visual features from sensor observations to the task at hand to perform a wide variety of tasks, but also to adapt to new unknown tasks from a few demonstrations. This manuscript will thus address different important questions such as: How to represent a 3D scene and keep track of previous experience in an environment? – How to robustly adapt to new environments, scenarios, and potentially new tasks? – How to train agents on long-horizon sequential tasks? – How to jointly master all required sub-skills? – What is the importance of perception in robotics?
@phdthesis{marza2024learning, bibtex_show = {true}, abbr = {PhD thesis}, title = {Learning spatial representations for single-task navigation and multi-task policies}, author = {Marza, Pierre}, school = {INSA Lyon}, year = {2024}, month = nov, pdf = {https://theses.hal.science/tel-04846767/}, selected = {true} }
CVPR
Task-conditioned adaptation of visual features in multi-task policy learning

Pierre Marza, Laetitia Matignon, Olivier Simonin, Christian Wolf

Computer Vision and Pattern Recognition (CVPR) 2024

Abs Bib Project Page PDF Code

Successfully addressing a wide variety of tasks is a core ability of autonomous agents, which requires flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the underlying perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, in this work, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the policy and visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks of the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given visual demonstrations.
@article{marza2024task, bibtex_show = {true}, abbr = {CVPR}, title = {Task-conditioned adaptation of visual features in multi-task policy learning}, author = {Marza, Pierre and Matignon, Laetitia and Simonin, Olivier and Wolf, Christian}, journal = {Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, pdf = {https://arxiv.org/abs/2402.07739}, html = {https://pierremarza.github.io/projects/task_conditioned_adaptation/}, code = {https://github.com/PierreMarza/task_conditioned_adaptation}, selected = {true} }
IROS
AutoNeRF: Training Implicit Scene Representations with Autonomous Agents

Pierre Marza, Laetitia Matignon, Olivier Simonin, Dhruv Batra, Christian Wolf, Devendra Singh Chaplot

International Conference on Intelligent Robots and Systems (IROS) 2024

Abs Bib Project Page PDF Code

Implicit representations such as Neural Radiance Fields (NeRF) have been shown to be very effective at novel view synthesis. However, these models typically require manual and careful human data collection for training. In this paper, we present AutoNeRF, a method to collect data required to train NeRFs using autonomous embodied agents. Our method allows an agent to explore an unseen environment efficiently and use the experience to build an implicit map representation autonomously. We compare the impact of different exploration strategies including handcrafted frontier-based exploration and modular approaches composed of trained high-level planners and classical low-level path followers. We train these models with different reward functions tailored to this problem and evaluate the quality of the learned representations on four different downstream tasks: classical viewpoint rendering, map reconstruction, planning, and pose refinement. Empirical results show that NeRFs can be trained on actively collected data using just a single episode of experience in an unseen environment, and can be used for several downstream robotic tasks, and that modular trained exploration models significantly outperform the classical baselines.
@article{marza2023autonerf, bibtex_show = {true}, abbr = {IROS}, title = {AutoNeRF: Training Implicit Scene Representations with Autonomous Agents}, author = {Marza, Pierre and Matignon, Laetitia and Simonin, Olivier and Batra, Dhruv and Wolf, Christian and Chaplot, Devendra Singh}, journal = {International Conference on Intelligent Robots and Systems (IROS)}, year = {2024}, pdf = {https://arxiv.org/abs/2304.11241}, html = {https://pierremarza.github.io/projects/autonerf/}, code = {https://github.com/PierreMarza/autonerf}, selected = {true} }

2023

ICCV
Multi-Object Navigation with dynamically learned neural implicit representations

Pierre Marza, Laetitia Matignon, Olivier Simonin, Christian Wolf

International Conference on Computer Vision (ICCV) 2023

Abs Bib Project Page PDF Code

Understanding and mapping a new environment are core abilities of any autonomously navigating agent. While classical robotics usually estimates maps in a stand-alone manner with SLAM variants, which maintain a topological or metric representation, end-to-end learning of navigation keeps some form of memory in a neural network. Networks are typically imbued with inductive biases, which can range from vectorial representations to birds-eye metric tensors or topological structures. In this work, we propose to structure neural networks with two neural implicit representations, which are learned dynamically during each episode and map the content of the scene: (i) the Semantic Finder predicts the position of a previously seen queried object; (ii) the Occupancy and Exploration Implicit Representation encapsulates information about explored area and obstacles, and is queried with a novel global read mechanism which directly maps from function space to a usable embedding space. Both representations are leveraged by an agent trained with Reinforcement Learning (RL) and learned online during each episode. We evaluate the agent on Multi-Object Navigation and show the high impact of using neural implicit representations as a memory source.
@article{marza2023dynamic_impl_repr, bibtex_show = {true}, abbr = {ICCV}, title = {Multi-Object Navigation with dynamically learned neural implicit representations}, author = {Marza, Pierre and Matignon, Laetitia and Simonin, Olivier and Wolf, Christian}, journal = {International Conference on Computer Vision (ICCV)}, year = {2023}, html = {https://pierremarza.github.io/projects/dynamic_implicit_representations/}, pdf = {https://arxiv.org/abs/2210.05129}, code = {https://github.com/PierreMarza/dynamic_implicit_representations}, selected = {true} }

2022

IROS
Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation (Winning entry of the MultiON Challenge at CVPR 2021)

Pierre Marza, Laëtitia Matignon, Olivier Simonin, Christian Wolf

International Conference on Intelligent Robots and Systems (IROS) 2022

Abs Bib Project Page PDF Code

In the context of visual navigation, the capacity to map a novel environment is necessary for an agent to exploit its observation history in the considered place and efficiently reach known goals. This ability can be associated with spatial reasoning, where an agent is able to perceive spatial relationships and regularities, and discover object affordances. In classical Reinforcement Learning (RL) setups, this capacity is learned from reward alone. We introduce supplementary supervision in the form of auxiliary tasks designed to favor the emergence of spatial perception capabilities in agents trained for a goal-reaching downstream objective. We show that learning to estimate metrics quantifying the spatial relationships between an agent at a given location and a goal to reach has a high positive impact in Multi-Object Navigation settings. Our method significantly improves the performance of different baseline agents, that either build an explicit or implicit representation of the environment, even matching the performance of incomparable oracle agents taking ground-truth maps as input.
@article{marza2022teaching, bibtex_show = {true}, abbr = {IROS}, author = {Marza, Pierre and Matignon, La{\"{e}}titia and Simonin, Olivier and Wolf, Christian}, title = {Teaching Agents how to Map: Spatial Reasoning for Multi-Object Navigation (Winning entry of the MultiON Challenge at CVPR 2021)}, journal = {International Conference on Intelligent Robots and Systems (IROS)}, year = {2022}, html = {https://pierremarza.github.io/projects/teaching_agents_how_to_map/}, pdf = {https://arxiv.org/abs/2107.06011}, code = {https://github.com/PierreMarza/teaching_agents_how_to_map}, selected = {true} }
An experimental study of the vision-bottleneck in VQA

Pierre Marza, Corentin Kervadec, Grigory Antipov, Moez Baccouche, Christian Wolf

arXiv 2022

Abs Bib PDF

As in many tasks combining vision and language, both modalities play a crucial role in Visual Question Answering (VQA). To properly solve the task, a given model should both understand the content of the proposed image and the nature of the question. While the fusion between modalities, which is another obviously important part of the problem, has been highly studied, the vision part has received less attention in recent work. Current state-of-the-art methods for VQA mainly rely on off-the-shelf object detectors delivering a set of object bounding boxes and embeddings, which are then combined with question word embeddings through a reasoning module. In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images. We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage. This work highlights the importance of vision in the context of VQA, and the interest of tailoring vision methods used in VQA to the task at hand.
@article{marza2022vision_vqa, bibtex_show = {true}, author = {Marza, Pierre and Kervadec, Corentin and Antipov, Grigory and Baccouche, Moez and Wolf, Christian}, title = {An experimental study of the vision-bottleneck in VQA}, journal = {arXiv}, year = {2022}, pdf = {https://arxiv.org/abs/2202.06858} }

2021

Patent

A device and method for image processing

Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot, Gregory Slabaugh

2021

PDF

2020

CVPR
DeepLPF: Deep Local Parametric Filters for Image Enhancement

Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot, Gregory Slabaugh

In Computer Vision and Pattern Recognition (CVPR) 2020

Abs Bib PDF Code

Digital artists often improve the aesthetic quality of digital photographs through manual retouching. Beyond global adjustments, professional image editing programs provide local adjustment tools operating on specific parts of an image. Options include parametric (graduated, radial filters) and unconstrained brush tools. These highly expressive tools enable a diverse set of local image enhancements. However, their use can be time consuming, and requires artistic capability. State-of-the-art automated image enhancement approaches typically focus on learning pixel-level or global enhancements. The former can be noisy and lack interpretability, while the latter can fail to capture fine-grained adjustments. In this paper, we introduce a novel approach to automatically enhance images using learned spatially local filters of three different types (Elliptical Filter, Graduated Filter, Polynomial Filter). We introduce a deep neural network, dubbed Deep Local Parametric Filters (DeepLPF), which regresses the parameters of these spatially localized filters that are then automatically applied to enhance the image. DeepLPF provides a natural form of model regularization and enables interpretable, intuitive adjustments that lead to visually pleasing results. We report on multiple benchmarks and show that DeepLPF produces state-of-the-art performance on two variants of the MIT-Adobe 5k dataset, often using a fraction of the parameters required for competing methods.
@inproceedings{moran2020deeplpf, bibtex_show = {true}, abbr = {CVPR}, author = {Moran, Sean and Marza, Pierre and McDonagh, Steven and Parisot, Sarah and Slabaugh, Gregory}, title = {DeepLPF: Deep Local Parametric Filters for Image Enhancement}, booktitle = {Computer Vision and Pattern Recognition (CVPR)}, year = {2020}, pdf = {https://openaccess.thecvf.com/content_CVPR_2020/html/Moran_DeepLPF_Deep_Local_Parametric_Filters_for_Image_Enhancement_CVPR_2020_paper.html}, code = {https://github.com/sjmoran/DeepLPF}, selected = {true} }