VisionOS Dev: Taylor's Opinionated and Already-Outdated Guide (2024)

There is no substitute for project-based learning! Briefly skim this guide, and go try to build something with it. Come back whenever you find a lapse in understanding.

Scaffolding Your Project: #Preview, AppState, MyApp, ContentView, ImmersiveView, Configuration
Display Containers: Scenes, Windows, Volumes, 3D Immersive Space
Displaying Things: SwiftUI Views, RealityView
Entities: Hierarchies, Enabling/Disabling, Loading Entities, Transforms, Physics, Anchors
Sensing Stuff
Gestures: InputTargetComponent, Input Types, Gesture Types, Gesture Hand Poses, Gesture Ambiguity, Hand Shapes
On Your Own: Starter Projects, Videos

Apple recently announced VisionOS 2.0. Some of the updates are not covered in this guide.

All writings on taylor.town (including this guide) are available in plaintext, i.e. taylor.town/visionos.txt. To download for offline access, try curl https://taylor.town/visionos.txt > taylor-town-visionos.md.

This guide is still under construction! Please email me at [email protected] if you have any questions or suggestions.

Scaffolding Your Project

Here is what a typical VisionOS app looks like:

MyApp: main entry point to call content and immersive views
- AppState: stores/manages all the state for the app
- ContentView: 2d window(s) and 3d volumes
- ImmersiveView: 3d whatever

@Observable
@MainActor
public class AppState {...}

@main
@MainActor
struct MudPieApp: App {...}

struct ContentView: View {...}
#Preview(windowStyle: .automatic) {...}

struct ImmersiveView: View {...}
#Preview(immersionStyle: .mixed) {...}

#Preview

Use #Preview macros to create default simulator views in XCode.

#Preview( "Content", traits: .fixedLayout(width: 100, height: 100)) {
    ContentView()
}

#Preview("Volume", windowStyle: .volumetric) {
   ContentView()
}

#Preview("Mixed immersive space", immersionStyle: .mixed) {
   ContentView()
}

#Preview("Mixed immersive space", immersionStyle: .mixed) {
   ContentView()
} cameras: {
   PreviewCamera(from: .front)
   PreviewCamera(from: .top, zoom: 2)
   PreviewCamera(from: .leading, zoom: 0.5, name: "close up")
}

AppState

@Observable
@MainActor
public class AppState {
  var session = ARKitSession()
  var worldTracking = WorldTrackingProvider()
  var handTracking = HandTrackingProvider()
  var myEntity = Entity()
  // ...more members
  init() {
    Task {/* request tracking privs */}
    Task {/* load entities */}
    Task {/* etc */}
  }
}

MyApp

@main
@MainActor
struct MyApp: App {
  @StateObject private var appState = AppState()
  @State private var immersionStyle: ImmersionStyle = .automatic
  var body: some SwiftUI.Scene {
    WindowGroup { ContentView().environmentObject(appState) }
    ImmersiveSpace(id: "MY_SPACE") { ImmersiveView().environment(appState) }
      .immersionStyle(selection: $immersionStyle, in: .automatic)
  }
}

There's a lot to unpack here! For now, just note the general structure:

this struct is your app's main entry point
it stores appState, which syncs to ContentView and ImmersiveView
- these views can (and will) mutate appState object, and that's okay
- the app will typically default to opening the WindowGroup and ignoring the ImmersiveSpace (more on this later)

ContentView

struct ContentView: View {
  @Environment(AppState.self) private var appState
  var body: some View {
    // windows, volumes, etc.
  }
}
#Preview(windowStyle: .automatic) {
  ContentView().environment(AppState())
}

Put all your bounded, position-agnostic views here. The initial placement of windows/volumes can be suggested, but the size and position of the content is ultimately determined by VisionOS and its user.

ImmersiveView

struct ImmersiveView: View {
  @Environment(AppState.self) var appState
  var body: some View {
    RealityView { content, attachments in
      // add attachments and MyApp entities to `content`
    } update: { content, attachments in
      // runs when ImmersiveView properties change
    } attachments: {
      Attachment(id: "MY_ATTACH) {/* SwiftUI window/volume that will be converted into an entity */}
    }
    .gesture(/* listen for gesture targeted at entity */)
    // ...more gestures
  }
}
#Preview(immersionStyle: .mixed) {
  ImmersiveView().environment(AppState())
}

Whoa! Lots of stuff going on here. Much of this guide will be exploring the details here, so don't overwhelm yourself.

Heres' the gist:

RealityView gives you a place to declare attachments, which are manipulatable SwiftUI windows/volumes.
Add your attachments and AppState entities to content in the first closure.
Use update to handle subsequent changes to the ImmersiveView. Keep in mind that you probably don't want to use this; you probably want to manage systems and other async stuff in your AppState.
Prepare yourself for lots of gestures.

Configuration

Here's a bunch of Info.plist junk you may need:

Display Containers

Scene
- ContentView: 2D/3D windows that are sized/positioned by VisionOS and user
- ImmersiveView: arbitrarily placed 2D/3D content, with some restrictions

3D windows are called "volumes". They behave just like windows, but with a few quirks and restrictions.

2D windows can have limited 3D depth, e.g. hovering,. You can also display 3D content/models floating inside/outside of 2D windows.

You can use SwiftUI views (e.g. buttons, sliders) inside of 3D immersive spaces as "attachments".

Scenes

@main
@MainActor
struct MyApp: App {
  @StateObject private var appState = AppState()
  @State private var immersionStyle: ImmersionStyle = .automatic
  var body: some SwiftUI.Scene {
    WindowGroup { ContentView().environmentObject(appState) }
    ImmersiveSpace(id: "MY_SPACE") { ImmersiveView().environment(appState) }
      .immersionStyle(selection: $immersionStyle, in: .automatic)
  }
}

The default behavior of your scene is largely determined by Info.plist junk:

To enable multiple scenes in visionOS, set UIApplicationSupportsMultipleScenes to true in the UIApplicationSceneManifest of the app’s property list.
visionOS opens the first window group, window, or document scene found in the app’s body. To open an app directly into an immersive space in visionOS, set UIApplicationPreferredDefaultSceneSessionRole to UISceneSessionRoleImmersiveSpaceApplication in the app’s property list.
To configure the default immersion type from launch, set UISceneInitialImmersionStyle to mixed, full, or progressive.

It is unclear if .immersionStyle(selection: ..., in: .full) will override UISceneInitialImmersionStyle. I'll report back here with some experimental findings.

Immersive spaces are somewhat like fullscreen experiences for VisionOS. When opening an ImmersiveSpace, VisionOS hides the content (e.g. Windows) of other apps.

struct OpenImmersiveSpaceButton: View {
  var solarSystem: SolarSystem
  @Environment(\.openImmersiveSpace) private var openImmersiveSpace
  var body: some View {
    Button("Open Immersive Space") {
      Task {
        let result = await openImmersiveSpace(id: "MY_SPACE")
        if case .error = result {
          print("An error occurred")
        }
      }
    }
  }
}
struct DismissImmersiveSpaceButton: View {
  @Environment(\.dismissImmersiveSpace) private var dismissImmersiveSpace
  var body: some View {
    Button("Close Solar System") {
      Task { await dismissImmersiveSpace() }
    }
  }
}

Use the environment's openImmersiveSpace(id) and dismissImmersiveSpace() to switch between the immersive spaces in your scene. Only one immersive space can be open at a time; you must dismiss any open spaces before opening up new ones. VisionOS will not automatically replace the space for you -- it will throw an error.

In a fully immersive experience, visionOS sets a 1.5-meter boundary from the user’s head. Moving outside this zone stops the immersive experience and activates external video to prevent collisions.

The user can manually dismiss an immersive space by double-tapping the digital crown on the AVP headset, so dismissImmersiveSpace isn't technically required. Double-tapping takes you "back" to all your previously-opened windows.

Your scene is the primary place to hang your anchors:

To add content to the view’s scene, you first create and add one or more AnchorEntity instances to the scene’s anchors collection. Anchors tell RealityKit how to pin virtual content to real world objects, like flat surfaces or images. You then add a hierarchy of other Entity instances to each anchor to indicate the geometry and behaviors that RealityKit should render at a given anchor point.

You'll probably want to hang virtual content (entities) on these anchors. Although physics/collision primitives handle many interactive cases, you'll probably need more specific tools to determine whether your entities meet spatial conditions.

For example, to pop bubbles with a lazer, you can raycast a line away from a hand anchor to find bubble entities anchored in the scene.

Lastly, your scene also provides message buses for coordinating pub/sub events:

Windows

@main
struct MyApp: App {
  var body: some Scene {
    WindowGroup {
      ContentView()
        .frame(
          minWidth: 100, maxWidth: 400,
          minHeight: 100, maxHeight: 400)
    }
    .windowStyle(.plain)
    .windowResizability(.contentSize)
    .defaultSize(CGSize(width: 600, height: 400))
  }
}

You can request window dimensions, but VisionOS might not respect your request. It clamps out-of-range dimensions to the window’s resizability limits.
Default window resizability is automatic. Settings windows use contentSize, matching the window size to its content. Other scenes use contentMinSize, restricting only the minimum size.
Specify resizability strategies with the windowResizability(_:) scene modifier. For example, use the .frame modifier to set bounds, like resizing windows to between 100 and 400 points.
You can use .windowStyle(.plain) to remove the default glass background.
You can’t directly change window position or size after it appears.
During state restoration, the system restores windows to their previous position and size.
When an app launches from Home View, the system places its window relative to the Home View.
A new window from a running app appears in front of an existing window.

To programmatically open a window:

struct NewViewerButton: View {
  @Environment(\.openWindow) private var openWindow
  var body: some View {
    Button("Open Window") {
      openWindow(id: "MY_WINDOW")
    }
  }
}

You can also programmatically check whether multiple windows are allowed.

Volumes

WindowGroup(id: "MY_BALL") { Ball() }
.windowStyle(.volumetric)
.defaultSize(width: 1, height: 1, depth: 1, in: .meters)

Although you can specify a volume’s size in points, it’s typically better to use physical units, like the above code which specifies a size in meters. This is because the system renders a volume with fixed scaling rather than dynamic scaling, unlike a regular window, which means the volume appears more like a physical object than a user interface. For information about the different kinds of scaling, see Spatial layout.

A volume does not ever change size. Developers have no API for resizing windows in VisionOS, and the users cannot resize a volume at runtime.

Actually, it looks like users will be able to resize volumes in VisionOS 2.0.

3D Immersive Space

struct ImmersiveView: View {
  @Environment(AppState.self) var appState
  var body: some View {
    RealityView { content, attachments in
      // add attachments and MyApp entities to `content`
    } update: { content, attachments in
      // runs when ImmersiveView properties change
    } attachments: {
      Attachment(id: "MY_ATTACH) {/* SwiftUI window/volume that will be converted into an entity */}
    }
    .gesture(/* listen for gesture targeted at entity */)
    // ...more gestures
  }
}

Automatic: Default. Uses mixed style unless specified otherwise.
Full: Unbounded content fully replaces passthrough video, except for user’s upper limbs (controllable by scene.upperLimbVisibility(\_:)). Windows render in front of virtual content.
Mixed: Unbounded content with passthrough video. Virtual objects and windows obscure each other.
Progressive: Radial portal effect partially replaces passthrough. Adjustable by user. Matches full immersion when portal fully replaces passthrough.

You can change the immersion style at runtime, but styles are restricted via the in paramater in .immersionStyle(selection:in:):

You can change the immersion style after presenting the immersive space by changing the modifier’s selection input, although you can only use one of the values that you specify in the modifier’s second parameter.

-x to +x: left to right
-y to +y: down to up
-z to +z: front to back

It's very easy to confuse the direction of the z-axis, because larger values go "behind" the user. Also note that RealityKit's y-axis is opposite to SwiftUI's y-axis.

RealityView { content in
  let sphere = ModelEntity(mesh: .generateSphere(radius: 0.1))
  sphere.transform.translation = [0, 1, -2] // "absolutely" position (relative to origin)
  content.add(sphere)
}

RealityView.content is flexibly-sized based on your surroundings. You may need to convert between coordinate spaces.

The origin of an ImmersiveSpace defaults to the user's feet (unclear what happens when lying down), but the origin can change, e.g. the origin changes in a when multiple Spatial Personas enter a SharePlay activity.

You can use scene.immersiveContentBrightness(_) to suggest brightness levels for immersive spaces. You can use bright/dim/dark presets, or choose a linear brightness ratio between a standard dynamic range white video frame and the content background.

SwiftUI coordinate spaces differ from RealityKit coordinate spaces. This is how the SwiftUI ones work:

All geometric properties of a view, including size, position, and transform, are defined within the local coordinate space of the view’s parent.

Use content.convert(...) to switch between RealityCoordinateSpaces and SwiftUI CoordinateSpaces.

RealityCoordinateSpace
- .camera
- .scene
CoordinateSpaceProtocol
- .immersiveSpace: The named coordinate space that represents the currently opened ImmersiveSpace scene. If no immersive space is currently opened, this CoordinateSpace provides the same behavior as the .global coordinate space.
- .global: The global coordinate space at the root of the view hierarchy.
- .local: The local coordinate space of the current view.
- .named(some Hashable): Creates a named coordinate space using the given value.
- .scrollView: The named coordinate space that is added by the system for the innermost containing scroll view.
- .scrollView(axis: Axis): The named coordinate space that is added by the system for the innermost containing scroll view that allows scrolling along the provided axis.

You'll be painfully reminded of this whenever you implement gestures:

.gesture(
  DragGesture()
  .targetedToAnyEntity()
  .onEnded { x in
    appState.holding!.transform.translation = x.convert(
      x.gestureValue.translation3D,
      from: .local,
      to: .scene)
  }
)

Quaternion? SIMD3? Matrix4x4? The interactions between different API representations can become painful, and I don't yet have any tidy suggestions. Please email me at [email protected] if you have any good mental models or advice for juggling these values.

Displaying Things

Scene
- Window
- ContentView
  - SwiftUI views
  - entities (via Model3D)
- ImmersiveView
- RealityView.Content
  - content.add(myEntity)
  - content.add(myAttachment)

SwiftUI Views

SwiftUI offers components like Button, DatePicker, Image, VStack, etc.

Apple's official tutorials are a great starting point:

To embed 3D content into a window, use Model3D:

struct ContentView: View {
  var body: some View {
    VStack {

      // load from file
      Model3D(named: "MY_MODEL")
        .frame(width: 300, height: 600)

      // load from url with placeholder
      Model3D(url: URL(string: "https://example.com/elephant.usdz")!) {
          model
            .resizable()
            .aspectRatio(contentMode: .fit)
      } placeholder: {
          Color.green
      }
      .frame(width: 300, height: 600)

      // load from url with phases (good for progress bar, etc)
      Model3D(url: URL(string: "https://example.com/elephant.usdz")!) { phase in
        if let model = phase.model {
            model
        } else if phase.error != nil {
            Color.red
        } else {
            Color.blue
        }
      }

    }
  }
}

RealityView

struct ImmersiveView: View {
  @StateObject private var appState = AppState()
  var body: some View {
    RealityView { content, attachments in

      let mySphere = ModelEntity(mesh: .generateSphere(radius: 0.1))
      content.add(mySphere)

      if let myAttach = attachments.entity(for: "MY_ATTACH") {
        content.add(myAttach)
      }

    } attachments: {
      Attachment(id: "MY_ATTACH) {
        VStack {
          Button("RED") { appState.myColor = .red }
        }
      }
    }
  }
}

In this example, we added a SwiftUI attachment and basic sphere to our scene. I hope you are entertained.

Entities

struct ImmersiveView: View {
  var body: some View {
    RealityView { content in
      if let elephant = try? await ModelEntity(named: "elephant") {
        content.add(elephant) // entity appears at content origin by default
      }
      Task {/* do extra entity stuff in the background */}
    }
  }
}

And now it's finally time to address the elephant in the scene. Let's explore 𝔢𝔫𝔱𝔦𝔱𝔦𝔢𝔰.

To first give you an idea of what you're dealing with, here all the base entities at your disposal: AnchorEntity, BodyTrackedEntity, DirectionalLight, ModelEntity, PerspectiveCamera, PointLight, SpotLight, TriggerVolume, ViewAttachmentEntity

Hierarchies

Various AnchorEntities will be provided to you, e.g. hand anchors, anchors from tracked objects, anchors from people's faces, etc.

Use ModelEntity to display 3D models in your scene -- attach them to anchor entities (or descendants thereof), e.g. myAnchor.addChild(myModel). These entity hierarchies become useful when you want to search/manipulate the entities as a group.

To organize your hierarchy, you can use the following methods on any Entity class: id, scene, findEntity, isActive, anchor, isAnchored, parent, setParent, removeFromParent, children, addChild, removeChild, isEnabledInHierarchy

Enabling/Disabling

To stop simulating/rendering an entity, set entity.isEnabled = false. Reenable it with entity.isEnabled = true. You can also check whether any parent entity is disabled via isActive, which is only true when an entity is anchored and all ancestors are enabled.

Loading Entities

There are myriad ways to load these entities into your scene. Here are some examples from the SwiftSplash demo project:

await withTaskGroup(of: LoadResult.self) { taskGroup in
  for piece in pieces {
    taskGroup.addTask {
      do {
        guard let pieceEntity = try await self.loadFromRCPro(named: piece.key.rawValue,
                             fromSceneNamed: piece.sceneName) else {
          fatalError("Attempted to load piece entity \(piece.name) but failed.")
        }
        return LoadResult(entity: pieceEntity, key: piece.key.rawValue)
      } catch {
        fatalError("Attempted to load \(piece.name) but failed: \(error.localizedDescription)")
      }
    }
  }
}

for await result in taskGroup {
  if let pieceKey = pieces.filter({ piece in
    piece.key.rawValue == result.key
  }).first {
    self.add(template: result.entity, for: pieceKey.key)
    setupConnectible(entity: result.entity)
    result.entity.generateCollisionShapes(recursive: true)
    result.entity.setUpAnimationVisibility()
  }
  // ...
}

Here's another loading example from the Diaroma demo project:

To load a Reality Composer Pro scene, use load(named:in:), passing the name of the scene you want to load and the project’s bundle. Reality Composer Pro Swift packages define a constant that provides ready access to its bundle. The constant is the name of the Reality Composer Pro project with “Bundle” appended to the end. In this case, the project is called RealityKitContent, so the constant is called RealityKitContentBundle. Here’s how Diorama loads the map table in the RealityView initializer:

let entity = try await Entity.load(named: "DioramaAssembled", in: RealityKitContent.RealityKitContentBundle)

Here's another example, using .generateSphere to construct a ModelEntity from basic geometry:

struct SphereView: View {
  @State private var scale = false
  var body: some View {
    RealityView { content in
      let model = ModelEntity(
        mesh: .generateSphere(radius: 0.1),
        materials: [SimpleMaterial(color: .white, isMetallic: true)])
      model.components.set(InputTargetComponent())
      model.components.set(CollisionComponent(shapes: [.generateSphere(radius: 0.1)]))
      content.add(model)
    } update: { content in
      if let model = content.entities.first {
        model.transform.scale = scale ? [1.2, 1.2, 1.2] : [1.0, 1.0, 1.0]
      }
    }
    .gesture(TapGesture().targetedToAnyEntity().onEnded { _ in
      scale.toggle()
    })
  }
}

Ignore the component and gesture stuff… for now.

Transforms

To manipulate the position/roation/etc of an Entity, you'll want to use its Transform component. Here are its methods:

let sphere = ModelEntity(mesh: .generateSphere(radius: 0.1))
sphere.transform.translation = [0, 1, -2] // relative to anchor
myAnchor.add(sphere)

Gestures will also give you translations to convert. Here's how you might want to move an entity based on the final drag position of a gesture:

.gesture(
  DragGesture()
  .targetedToAnyEntity()
  .onEnded { x in
    appState.myEntity!.transform.translation += x.convert(
      x.gestureValue.translation3D,
      from: .local,
      to: .scene)
  }
)

Physics

let sphere = ModelEntity(
      mesh: MeshResource.generateSphere(radius: 0.1),
      materials: [SimpleMaterial(color: .red, isMetallic: false)],
      collisionShape: .generateSphere(radius: 0.1),
      mass: 1.0)

Straight from the Apple docs:

Models respond to physics simulations because they conform to the HasPhysics protocol. You give them mass and other physical properties with a PhysicsBodyComponent instance, and then apply forces or impulses. The simulator uses a PhysicsMotionComponent to manage the linear and angular velocity of the object. Alternatively, you can selectively circumvent the simulation to control position and velocity yourself. Do this for a given model by setting its physics body mode to PhysicsBodyMode.kinematic.

Models can also collide with one another, and with other entities that conform to the HasCollision protocol. The CollisionComponent provides parameters that let you manage which models collide with each other. It also lets you control the collision shape, which for performance reasons, is typically simpler than the visual geometry.

In other words, give your entities some mass and then use methods to apply forces, torque, impulses, etc. You can use downward force to simulate gravity, and an upward impulse to simulate a throw.

For performance reasons, the shapes that collide with each other are not the rendered shapes. This is exactly why things in video games appear to phase through each other when they get close! As your shapes get more complex, this effect can become more apparent. You can imagine that simple cubes and spheres may not capture the nuance of a high-speed collision between two horses. So you'll want to create the simplest mesh that produces the illusion of good physics.

The following ShapeResource functions are useful for defining collision shapes: .offsetBy, .bounds, .generateBox, .generateSphere, .generateCapsule, .generateConvex, .generateStaticMesh

You might find generateCollisionShapes useful when working with hierarchies.

For more complex stuff, you'll want MeshResource.

And once all this physical stuff is set up, VisionOS will automatically update the entities' Transform components over time.

You can create arbitrary surfaces to interact with, or find/use surfaces in your surroundings. More on that later.

Further reading: scene hierarchies for efficient physics simulations, handling different sized objects, controlling collisions

Anchors

let world = ARWorldTrackingConfiguration()
world.planeDetection = [.horizontal, .vertical]
session.run(world)

let planeAnchor = AnchorEntity(plane: .horizontal)
planeAnchor.addChild(ModelEntity(mesh: .generateBox(size: 0.1)))
scene.anchors.append(planeAnchor)

let imageAnchor = AnchorEntity(anchor: ARImageAnchor(referenceImage: "EXAMPLE"))
imageAnchor.addChild(ModelEntity(mesh: .generateText("Informative Text", extrusionDepth: 0.05, font: .systemFont(ofSize: 0.1))))
scene.anchors.append(imageAnchor)

RealityKit places anchors based on the anchoring component’s target property. For example, configure an anchor entity to find a detected horizontal surface like a table or floor. RealityKit will place that anchor once it detects an appropriate horizontal plane.

Some anchor entities may not show up in your scene if RealityKit fails to detect an appropriate place for them, e.g. an anchor entity with an image target won’t show up until RealityKit detects the specified image in the real world.

Add anchor entities directly to your scene’s anchors collection or anywhere else in the scene hierarchy by adding them to the children collection of another entity in your scene. AnchorEntity is a subclass of Entity, so you can make an anchor entity a child of any other entity.

RealityKit may move anchor entities as the scene updates, changing their location and rotation relative to their parent, even if your code never modifies its .transform property.

Sensing Stuff

ImmersiveSpace(id: "MY_SPACE") {
 ImmersiveView()
  .task {
    let planeData = PlaneDetectionProvider(alignments: [.horizontal])
    if PlaneDetectionProvider.isSupported {
      do {
        try await session.run([planeData])
        for await update in planeData.anchorUpdates {
          // ...
        }
      } catch {
        print("ARKit session error \(error)")
      }
    }
  }
}
.immersionStyle(selection: $immersionState, in: .mixed)

The VisionOS ARKit guide is very good:

ARKit in visionOS offers a new set of sensing capabilities. You adopt these capabilities individually in your app, using data providers to deliver updates asynchronously. The available capabilities include:

Plane detection. Detect surfaces in a person’s surroundings and use them to anchor content.

World tracking. Determine the position and orientation of Apple Vision Pro relative to its surroundings and add world anchors to place content.

Hand tracking. Use the person’s hand and finger positions as input for custom gestures and interactivity.

Scene reconstruction. Build a mesh of the person’s physical surroundings and incorporate it into your immersive spaces to support interactions.

Image tracking. Look for known images in the person’s surroundings and use them as anchor points for custom content.

For each of these APIs, ARKit exposes anchor entities corresponding to real-world phenomena. You will likely either (A) attach your entities to these ARKit anchors or (B) record changes in these anchors to apply elsewhere.

Bonus: body/face tracking

ARKit sessions will not provide anchors (nor updates) unless you're in a full immersive space and other apps are hidden. You have to run \.openImmersiveSpace before ARKitSession.run() will work. View.task is a great way to run an async task before the ImmersiveView appears.

It's important to note that ARKit hand-tracking is completely separate from RealityView .gesture(...) methods.

Gestures

In SwiftUI, you can attach gestures to specific views:

VStack {
    Image(systemName: "star.fill")
    .gesture(
      TapGesture().onEnded {
        print("TAPPED IMAGE")
      }
    )
}
.gesture(
  TapGesture().onEnded {
    print("TAPPED VSTACK")
  }
)

In RealityKit, your gestures live at the RealityView level and will be targeted to particular entities within the scene:

RealityView {/* ... */}
.gesture(
  TapGesture()
  .targetedToAnyEntity()
  .onEnded {
    print("TAPPED SOMETHING")
  }
)
.gesture(
  TapGesture()
  .targetedToEntity(myFavoriteDolphin)
  .onEnded {
    print("TAPPED HENRIETTE")
  }
)
.gesture(
  TapGesture()
  .targetedToEntity(where: myDolphins)
  .onEnded {
    print("TAPPED DOLPHIN")
  }
)

Here are your three entity filters:

Use QueryPredicate to target entity subtypes. You can also reuse these queries to find entities within a scene.

You can specifically filter "pinches" with .handActivationBehavior. For anything besides pinching, you'll need to use a hand skeleton to figure out what's going on. More on that later.

InputTargetComponent

myEntity.components.set(InputTargetComponent())
myEntity.components.set(CollisionComponent(shapes: [.generateSphere(radius: 0.1)]))

Gestures only work on entities with collision components.

Seriously, this is pretty important: gestures only work on entities with collision components. InputTargetComponent ain't enough.

Luckily, InputTargetComponent is hierarchical, so any descendants with collision components can be targeted for gestures. You can "hide" a descendant entity from gestures by adding a disabled input component:

let inputComponent = InputTargetComponent()
inputComponent.isEnabled = false
myDescendantEntity.components.set(inputComponent)

Input Types

There are two types of input in VisionOS to use in your components, e.g. InputTargetComponent(allowedInputTypes: .indirect):

.indirect: use gaze tracking and become a powerful Steve Jobs padawan
.direct: use your body to interact with entities like a caveman

And this naturally brings us to gesture types. Everything works as you'd expect, except the "direct" input makes things weird. SpatialTapGesture is .direct, while all others have .direct and .indirect handling. But almost everything will be .indirect, because we're civilized, dammit.

I don't actually have a strong opinion on this, I'm just performing a comedic bit.

Gesture Types

TapGesture: look at something and tap index finger to thumb
LongPressGesture: look at something and hold index finger against thumb
DragGesture: look at something, hold index finger against thumb, while moving hand around
RotateGesture: look at something and rotate it like a steering wheel
RotateGesture3D: look at something and rotate it like a detached steering wheel
MagnifyGesture: look at something and bring pinching hands away/together
SpatialTapGesture: touch an entity directly with a finger (must be within reach)

Gesture Hand Poses

DragGesture()
.targetedToAnyEntity()
.onEnded { gesture in
  appState.myEntity?.setPosition(
    simd_float(gesture.gestureValue.inputDevicePose3D!.position.vector),
    relativeTo: nil)
}

For some gestures, you can observe hand orientation changes outside of ARKit hand-tracking sessions.

Gesture Ambiguity

Because multiple tap gestures on the same RealityView execute with a different number of taps, multiple gestures may be called at once. If a player double taps an entity, for example, both the single tap and the double tap gesture code get called, and the app has to determine which one to execute. Swift Splash makes this determination by using a Boolean state variable. If a player single taps, it sets that variable — called shouldSingleTap — to true. Then it waits for a period of time before executing the rest of its code. If shouldSingleTap gets set to false while it’s waiting, the code doesn’t execute. When SwiftSplash detects a double tap gesture, it sets shouldSingleTap to false, preventing the single-tap code from firing when it executes the double-tap code.

struct TapGestureView: View {
  @State private var shouldSingleTap = false
  var body: some View {
    RealityView {/* ... */}
    .gesture(
      TapGesture(count: 1)
      .targetedToAnyEntity()
      .onEnded { _ in
        Task {
          self.shouldSingleTap = true
          try? await Task.sleep(for: .seconds(100))
          print("SINGLE TAP")
        }
      }
    )
    .simultaneousGesture(
      TapGesture(count: 2)
      .targetedToAnyEntity()
      .onEnded { _ in
        self.shouldSingleTap = false
        print("DOUBLE TAP")
      }
    )
  }
}

Hand Shapes

for await update in self.handTracking.anchorUpdates {
  switch update.event {
  case .added, .updated:
    let hand = update.anchor
    let pose = Pose3D(hand.originFromAnchorTransform)
    guard let pose else { continue }
    let isPalmUp = pose.isApproximatelyEqual(
      to: Pose3D(
        position: pose.position,
        target: Point3D(x:0,y:0,z:0),
        up: Vector3D(x:0,y:1,z:0)
      )
    )
    let isRingIn = hand.handSkeleton?.joint(.ringFingerTip).anchorFromJointTransform.columns.3[SIMD3(0,1,2)].z ?? 0 < -0.1
    let isRingOut = hand.handSkeleton?.joint(.ringFingerTip).anchorFromJointTransform.columns.3[SIMD3(0,1,2)].z ?? 0 > 0.1
    let isPointerOut = hand.handSkeleton?.joint(.indexFingerTip).anchorFromJointTransform.columns.3[SIMD3(0,1,2)].z ?? 0 > 0.1
    if isPalmUp && isRingOut {
      print("LOOKING AT PALM")
    }
    if isPointerOut && isRingIn {
      print("SHOOTING")
    }
  }
}

I really hope that Apple produces a nicer API for hand shape recognition in the future. As of now, it's still pretty low-level, and hands are naturally complex.

To reduce complexity of hand recognition, I focus on four dimensions and ignore everything else:

palm orientation
index finger tip
ring finger tip
thumb tip

Because of the musculature of the human hand, the ring finger is a very good proxy for the middle finger and pinky, i.e. it's difficult to extend your ring finger by itself.

On Your Own

As you learn new APIs, two particular XCode shortcuts are extremely helpful for accessing documentation:

⇧⌘0: open fast, searchable, offline documentation (much better interface than the web docs IMO)
⌥⌘3: pull up the "quick help" sidebar (I pretty much always leave it open)

Starter Projects

VisionOS:

RealityKit:

ARKit:

Videos

WWDC23:

WWDC24: