Rethinking User Interactions in Augmented Reality

The future is immersive, says almost every Augmented and Virtual reality talking piece. The possibility offered by new degrees of freedom, life-like interactions, and immersive play are genuinely hard to ignore.

Our AR Storytelling experiment, using Body Pose Tracking

While the potential holds true, reality is still far from it. For all the talk about immersion, there is no singular “life-like immersion” we can currently achieve with AR. The hardware and software platforms supporting AR and VR all come with their own constraints, so, we experience immersion in varying degrees. We will focus on one such platform here — Mobile AR, and the challenges it faces with immersion, with some solutions that worked for us.

The truth about immersion in Mobile AR

Let’s start with the challenges first, in the context of immersion:

  • Hands get tired.
    Holding a mobile phone in your hands, you can move around virtual objects and explore them as part of your real world (+1 for immersion). But, it makes your hands tired too, and this constrains how long an augmented experience can last (-0.5 for immersion).
  • Compounding imprecisions.
    Input on touch screens is inherently more imprecise than mouse and keyboard, owing to shakiness of hands and the lack of feedback about where the tap is going to land. In Mobile AR, that imprecision is compounded by the fact that you’re trying to tap a precise point in 3D space while pointing the screen at it.
  • No perception of depth at a glance.
    The reason us humans (along with much of the animal kingdom) have two eyes is because having two eyes separated by a small distance gives us depth perception, and a single eye would be far from accurate in perceiving depth at a glance. Mobile AR gives us a single image of the 3D world around us rendered on a 2D screen, hence completely undoing the depth-perceiving advantage of looking at the same distant point in 3D space with two eyes. This is what makes it super hard to perceive the real distance of a virtual object in Mobile AR.
It’s hard to tell how far the red balls are from you at a glance

Immersive learning experiences are the core of what we do, and nothing breaks immersion like broken interactions, so we had to go back to the drawing board and look for ideas that blend in with the medium.

The interaction ideas that worked for us, or did not

Tapping and pinching virtual objects on a 2D screen

Tapping the phone screen to interact with objects on the screen, swiping objects around on a screen as if they were real things, was a brilliant feat of immersion in its own right for mobile devices. But these gestures don’t translate equally for AR apps and games. Tapping is imprecise, hands are shaky and easily tired, and imprecisions are compounded. Mobile AR, we realized, needs it’s own native interaction models that can provide the extra precision and control needed to counter its inherent imprecisions.

So, we started rethinking, and did a series of experiments to build on the core strengths of this medium. We share our learnings from this journey:

a. Phone position and perspective as Control

Mobile AR gets its primary AR credentials by the fact that it lets you move around a space with virtual objects overlaid on the real world. It allows for using “perspective” as an input for interactive experiences that respond to your location in 3D space. Because of that, it inherently encourages movement and exploring the game world from different perspectives.

Even though perspective is your control, the field of view is controlled by the phone’s camera, and is quite restricted. To change your perspective, it’s not just your eyes or face that move, it’s the phone and its camera that have to come along too. It’s almost like having your eyes 2 feet ahead of you and holding them to keep pointing in the right direction as you walk around.

A glimpse of our physics game that teaches Energy & Motion

Depth is another tricky part. The farther you are from a virtual object, the smaller it becomes, hence making it harder to interact with. And the closer you are to a virtual object, the harder it is to look around it, hence constraining your actions. It’s a constant game of pulling away to look at the bigger picture, and physically zooming in to interact with a specific thing. You can’t simply look at something from a distance and stretch your arm to interact with it, like in the real world.

b. Real-world objects as Control

With on-device Image & Object Tracking, you can use any trackable object in the scene as a Control! This breaking of the 4th wall, using a real-world object to affect the virtual world right around it can really feel magical. In fact, my husband and I use it to enact AR storytelling experiences for our 3yr old daughter, so I may be personally biased about this one ;)

Here’s a glimpse of that: a lion hand puppet roaring the morning sun to rise, on the palm of its majestic hand.

Our AR Hand Puppets are a big hit with our 3yr old :)

AR storytelling has been a lot of fun, but it’s not all rosy behind the scenes. For one, object tracking only keeps alive and kicking while the object is constantly in the view. And then, don’t forget you are physically holding the object throughout this experience. In which case, there’s barely an arm-length distance between View (screen) and Control (object), while you try to look at the screen and keep it pointing at the object. Regardless, with an extra pair of hands and a projected screen, we made some memorable experiences with our favorite hand puppets. Just ask our daughter!

c. Your face as Control

We’ve all known about Snapchat face filters for a while now, hate-loved vomiting rainbows while sticking our collective tongues out. While they are a great example of using facial features as Input Controls for an interactive AR experience in selfie mode, you can do one better now.

With ARKit 3 on iOS, you can actually use facial gestures to provide input while looking at the augmented world around, instead of yourself. This means you can shoot up virtual zombies around you in AR, not by tapping the phone screen, but just by blinking at the phone screen that you’re already looking at! Or imagine cleaning up a virtual after-party mess in AR by sucking it all up with your open mouth as the vacuum cleaner.

Facial gestures separate the View (screen) from Control (face) in the cleanest way yet, and is surprisingly under-explored in AR games as yet. Granted, facial expressions are hard to hold for long durations, but still, this added degree of freedom can lead to amazing experiences.

d. Your full body as Control

Body pose tracking can be mind-blowing. It’s like stepping into the game, quite literally, and using your full body to express yourself.

With body pose available every frame at 60 frames per second, the opportunity to build mind-bending AR experiences is boundless. Imagine running away from a rainy cloud that seems adamant at following you, playing a real-life Gulliver fighting against virtual Lilliputians, doing Yoga with immediate body pose feedback, or even a live Dance duel with your friends. The experience can be extremely visceral, and heart thumping.

Here’s a little thing we made that lets you play drums with your full body in AR without any instruments, called Tambour:

iOS ARKit can only do body pose tracking with back camera at the moment, so you can’t see the screen while interacting with the experience, but this can be overcome with a projected screen or purely auditory interfaces (ala Tambour). Also, BodyPix and PoseNet already let you track body pose with your regular webcam, so you can see the screen while interacting with objects in AR.

And that’s a wrap!

From indirect touch-screen controls, to being in an augmented world with your face and full body, the technical capabilities for implementing user interactions are hitting critical mass for enabling the wildest of Mobile AR experiences with sensible trade-offs. It’s not “life-like” yet, but it’s getting there fast.