DIALING UP THE IMMERSION

Key Spatial Principle

ADJUSTABLE IMMERSION

Reading Time

6-8 MINS

What if we had access to the live camera feed within mixed reality headsets?

When considering the impact spatial computing will have on interactive experiences, one topic that’s been sparking intense discussion around the UNIT9 office is the untapped potential of headset camera feeds. Unlike smartphones, developer access to mixed reality headsets is not typically available. However, if we were able to leverage the live camera feed of these headsets, it would completely elevate how we design and build mixed reality experiences. Add AI into the mix and we could go even further. Spatial computing is all about offering new ways to layer up the immersion - and providing access to headset camera feeds would do exactly that.

The importance of contextual understanding

Of course, headsets do use cameras right now, but we’re specifically talking about accessing the live camera feed - and interestingly, no consumer HMDs allow this. Currently, headsets including Meta’s Quest 3 and Apple’s Vision Pro use a camera and sensors to scan the user’s environment and then overlay a 3D mesh. For developers, this mesh data is rather generalized - a basic outline of the space and the shapes within it, but without the context to fully understand it.

‍

It isn’t until AI is involved that we can identify that item as a black office chair. An AI image recognition layer can provide labels to help us fully understand what objects are in the user’s space. This presents a number of opportunities to more precisely and creatively manipulate or impact the environment around them. For instance, we could alter the properties of that chair to be smooth and metallic instead of black mesh. Or we could transform the chair into the cockpit of a fighter jet. There are a diverse range of use cases across sectors, from training and education to entertainment, productivity and many others. With live camera access and AI, pretty much anything is possible.

Image one shows the information we currently receive from consumer devices: Mesh data only, without color. We can see the shape of a room and its contents (geometry and volumes) but we have no further context. E.g. I can see an unidentifiable cube in the room.

‍

Image two shows the information we could receive if we had access to the camera feed: Mesh data with color. We see geometry and volumes with a texture - so we can better understand the patterns, colors, text or materials of an object. But we still don’t know what the object is. E.g. I can see an unidentifiable cube that is black and mesh.

Access Denied: Why aren't consumer-facing headsets providing camera access?

Privacy concerns are at the top of the list when it comes to providing camera feed access, especially in terms of how data will be collected, stored and shared with third parties. What’s interesting, however, is that many people have been providing companies with access to their personal camera feeds for a while already through mobile AR - any face filter on Instagram or Snapchat will have required access to your camera feed.

‍

There are three major factors explaining this: firstly, we’ve already seen a natural progression when it comes to smartphone access, a slow build as tech features developed. Secondly, spatial is a new medium - and people always need to acclimate to the unknown. Just cast your minds back to the onset of the internet where it was discarded as a ‘wasteland of unfiltered data’ - trust takes time. And lastly, always-on by necessity headset cameras are able to track new data points that some may feel uncomfortable with - such as eye tracking.

‍

But in the same way we provide our mobile devices with camera access in order to have richer, more personalized and immersive experiences, the same can ultimately be true when it comes to headsets.

A world of pure imagination:
how immersive can we go?

Unlocking new features and capabilities through both live camera access and AI would provide us with a far deeper understanding of people, their movements and their environments. This opens up the opportunity for adjustable immersion - providing people with a range of interactions they can lean into to elevate an experience from passive to fully immersive, as well as catering for diverse accessibility needs by putting control in the hands of the user.

‍

PEOPLE

Camera access would allow us to detect and analyze more complex human indicators in order to understand nuanced habits and emotions, helping us to create more relevant interactions within experiences. For example, we could leverage:

‍

Face detection - to identify the presence of faces and their position.
Facial recognition - to find and recognize faces based on previous images.
Emotion analysis - to analyze facial expressions, audio or text to infer human emotions.
Liveness detection - to determine whether the source of a biometric sample is a live human being or a fake representation.
People counting - to track how many people are in the environment.

Imagine your partner playing a Harry Potter game in mixed reality and your whole living room has turned into Hogwarts. Currently, their world gets disrupted as you walk into the room carrying the dog. If we had live camera access we could instantly transform you into Hagrid and your dog into Hedwig the owl, seamlessly translating your facial expressions over to the new characters without breaking the illusion.

‍

Or camera access could be leveraged to help teach autistic children about emotions. We could create role-playing environments where kids can practice their social skills by communicating with others. The emotions of those in the rooms could be tracked, labeled and described back to the wearer in an engaging and interactive way that they can understand, all in real-time thanks to AI.

‍

MOVEMENT

Accessing more data on body movement via the camera feed would also allow us to track behaviors to design immersive interactions relevant to the wearer and their body position. For example we could use:

‍

Body tracking - to determine the state of the body and its joints. This could be applied to both people and animals.
Performance monitoring - to track the movement of the body through time in order to measure speed, acceleration, distances, angles, etc.
Pose estimation - to detect body joint location and determine the position of missing joints to classify the body’s position (e.g. seated, jumping, lying down).
Pose correction - to compare a target and an actual pose, allowing us to assess the differences and provide feedback.

Imagine you’re learning how to play golf and trying to perfect your swing. If your trainer watched your performance through a headset, they could receive live data on your pose and movements to give you specific scientific feedback on how to nail your technique.

‍

Camera access would also allow us to track progress and provide customisable treatment plans for people in physical therapy or rehabilitation. It could even help with prevention by detecting biomechanical inefficiencies and optimizing techniques for improved physical performance.

‍

environment

Lastly, camera access can help us elevate, manipulate and augment environments in ways we’ve never done before, transporting users into entirely new worlds that are contextually linked to their real one. For example, we could explore:

‍

Object recognition - to identify and classify specific objects within the user’s view.
Background removal - to detect what pixels are part of the background in order to remove them - similar to green screen technology in audiovisual production.
Colour properties - to pick out the color palette of a scene and the predominant tone of the composition.
Segmentation - to label and identify groups of items within the user’s field of view and their relationship with each other.
Depth estimation - to determine which entities are closer or further away from the camera.
Style transfer - to apply the style of one image to another image.
Optical Character Recognition (OCR) - to detect and extract text from images.
Image captions and classification - to describe what is portrayed in an image. This feature is often used for content moderation.

Imagine you’re watching Dua Lipa in a surreal virtual concert set on a stage in the sky. We could directly manipulate your space to draw you further into her musical world, turning your light bulbs into stars, and texturing and animating the material on your sofa to match her set design.

‍

Or you could become the proud owner of a virtual pet who can roam freely around the house, avoiding real obstacles, jumping over chairs, and even leaving virtual scratches on your furniture to mark its territory.

‍

When it comes to environmental manipulation, the opportunities are near endless.

So what’s next?

There’s a huge amount to be gained through live camera access and AI integration - in particular, understanding people and their emotions; how they engage and connect with each other; how movement can be better analyzed and integrated into new interaction paradigms; and how environments can be augmented and manipulated to create more compelling experiences. All this extra data would allow us to create tailored spatial experiences that can immerse audiences in a variety of ways. But as always when working with tech, there are a few things we must consider.

‍

Internet speed presents a key hurdle. It will take a great deal of power to send this visual data to the cloud for processing - the latency required to guarantee a seamless experience is a huge factor to consider. Eventually, we could have localized models that allow the processing to happen in-headset, but at the moment this is just a pipe dream.

‍

It’s also vital we don’t sleepwalk into the spatial era without consideration over data and privacy. If we want full camera access to consumer-facing devices, we’ll also need clear privacy policies from the manufacturers to overcome users’ concerns and put the power to adjust the level of immersion in their hands.

‍

The spatial experiences we’re already building are incredibly exciting, but they’d reach another level if we could leverage the live camera feed and AI together. We love that other members of the XR community are beginning to discuss this point too - we came across this awesome article from Antony Vitillo as we were wrapping ours up, and love the proposed solutions it offers to bridge the gap until full access to consumer-facing headsets becomes available. Meta’s recent research project, SceneScript, also hints that machine learning tools for understanding room geometry may be coming. It’s only a matter of time before AI-integrated full camera access opens up - and when it does, we’ll be ready.