On-Device Machine Learning in Spatial Computing

-


The landscape of computing is undergoing a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this recent era, the intersection of virtual reality, Augmented Reality, and on-device machine learning presents unprecedented opportunities for developers to create experiences that seamlessly mix digital content with the physical world.

The introduction of visionOS marks a major milestone on this evolution. Apple’s Spatial Computing platform combines sophisticated hardware capabilities with powerful development frameworks, enabling developers to construct applications that may understand and interact with the physical environment in real time. This convergence of spatial awareness and on-device machine learning capabilities opens up recent possibilities for object recognition and tracking applications that were previously difficult to implement.


What We’re Constructing

On this guide, we’ll be constructing an app that showcases the facility of on-device machine learning in visionOS. We’ll create an app that may recognize and track a weight loss program soda can in real time, overlaying visual indicators and data directly within the user’s field of view.

Our app will leverage several key technologies within the visionOS ecosystem. When a user runs the app, they’re presented with a window containing a rotating 3D model of our goal object together with usage instructions. As they give the impression of being around their environment, the app constantly scans for weight loss program soda cans. Upon detection, it displays dynamic bounding lines across the can and places a floating text label above it, all while maintaining precise tracking as the article or user moves through space.

Before we start development, let’s ensure we’ve the mandatory tools and understanding in place. This tutorial requires:

  • The most recent version of Xcode 16 with visionOS SDK installed
  • visionOS 2.0 or later running on an Apple Vision Pro device
  • Basic familiarity with SwiftUI and the Swift programming language

The event process will take us through several key stages, from capturing a 3D model of our goal object to implementing real-time tracking and visualization. Each stage builds upon the previous one, providing you with a radical understanding of developing features powered by on-device machine learning for visionOS.

Constructing the Foundation: 3D Object Capture

Step one in creating our object recognition system involves capturing an in depth 3D model of our goal object. Apple provides a strong app for this purpose: RealityComposer, available for iOS through the App Store.

When capturing a 3D model, environmental conditions play a vital role in the standard of our results. Establishing the capture environment properly ensures we get the most effective possible data for our machine learning model. A well-lit space with consistent lighting helps the capture system accurately detect the article’s features and dimensions. The weight loss program soda can must be placed on a surface with good contrast, making it easier for the system to tell apart the article’s boundaries.

The capture process begins by launching the RealityComposer app and choosing “Object Capture” from the available options. The app guides us through positioning a bounding box around our goal object. This bounding box is critical because it defines the spatial boundaries of our capture volume.

RealityComposer — Object Capture Flow — Image By Writer

Once we’ve captured all the main points of the soda can with the assistance of the in-app guide and processed the pictures, a .usdz file containing our 3D model shall be created. This file format is specifically designed for AR/VR applications and comprises not only the visual representation of our object, but in addition necessary information that shall be utilized in the training process.

Training the Reference Model

With our 3D model in hand, we move to the following crucial phase: training our recognition model using Create ML. Apple’s Create ML application provides an easy interface for training machine learning models, including specialized templates for spatial computing applications.

To start the training process, we launch Create ML and choose the “Object Tracking” template from the spatial category. This template is specifically designed for training models that may recognize and track objects in three-dimensional space.

CreateML Project Setup — Image By Writer

After making a recent project, we import our .usdz file into Create ML. The system robotically analyzes the 3D model and extracts key features that shall be used for recognition. The interface provides options for configuring how our object must be recognized in space, including viewing angles and tracking preferences.

When you’ve imported the 3d model and analyzed it in various angles, go ahead and click on on “Train”. Create ML will process our model and start the training phase. During this phase, the system learns to acknowledge our object from various angles and under different conditions. The training process can take several hours because the system builds a comprehensive understanding of our object’s characteristics.

Create ML Training Process — Image By Writer

The output of this training process is a .referenceobject file, which comprises the trained model data optimized for real-time object detection in visionOS. This file encapsulates all of the learned features and recognition parameters that may enable our app to discover weight loss program soda cans within the user’s environment.

The successful creation of our reference object marks a vital milestone in our development process. We now have a trained model able to recognizing our goal object in real-time, setting the stage for implementing the actual detection and visualization functionality in our visionOS application.

Initial Project Setup

Now that we’ve our trained reference object, let’s arrange our visionOS project. Launch Xcode and choose “Create a brand new Xcode project”. Within the template selector, select visionOS under the platforms filter and choose “App”. This template provides the essential structure needed for a visionOS application.

Xcode visionOS Project Setup — Image By Writer

Within the project configuration dialog, configure your project with these primary settings:

  • Product Name: SodaTracker
  • Initial Scene: Window
  • Immersive Space Renderer: RealityKit
  • Immersive Space: Mixed

After project creation, we’d like to make a couple of essential modifications. First, delete the file named ToggleImmersiveSpaceButton.swift as we won’t be using it in our implementation.

Next, we’ll add our previously created assets to the project. In Xcode’s Project Navigator, locate the “RealityKitContent.rkassets” folder and add the 3D object file (“SodaModel.usdz” file). This 3D model shall be utilized in our informative view. Create a brand new group named “ReferenceObjects” and add the “Food plan Soda.referenceobject” file we generated using Create ML.

The ultimate setup step is to configure the mandatory permission for object tracking. Open your project’s Info.plist file and add a brand new key: NSWorldSensingUsageDescription. Set its value to “Used to trace weight loss program sodas”. This permission is required for the app to detect and track objects within the user’s environment.

With these setup steps complete, we’ve a properly configured visionOS project ready for implementing our object tracking functionality.

Entry Point Implementation

Let’s start with SodaTrackerApp.swift, which was robotically created once we arrange our visionOS project. We want to switch this file to support our object tracking functionality. Replace the default implementation with the next code:

import SwiftUI

/**
 SodaTrackerApp is the fundamental entry point for the appliance.
 It configures the app's window and immersive space, and manages
 the initialization of object detection capabilities.
 
 The app robotically launches into an immersive experience
 where users can see Food plan Soda cans being detected and highlighted
 of their environment.
 */
@fundamental
struct SodaTrackerApp: App {
    /// Shared model that manages object detection state
    @StateObject private var appModel = AppModel()
    
    /// System environment value for launching immersive experiences
    @Environment(.openImmersiveSpace) var openImmersiveSpace
    
    var body: some Scene {
        WindowGroup {
            ContentView()
                .environmentObject(appModel)
                .task {
                    // Load and prepare object detection capabilities
                    await appModel.initializeDetector()
                }
                .onAppear {
                    Task {
                        // Launch directly into immersive experience
                        await openImmersiveSpace(id: appModel.immersiveSpaceID)
                    }
                }
        }
        .windowStyle(.plain)
        .windowResizability(.contentSize)
        
        // Configure the immersive space for object detection
        ImmersiveSpace(id: appModel.immersiveSpaceID) {
            ImmersiveView()
                .environment(appModel)
        }
        // Use mixed immersion to mix virtual content with reality
        .immersionStyle(selection: .constant(.mixed), in: .mixed)
        // Hide system UI for a more immersive experience
        .persistentSystemOverlays(.hidden)
    }
}

The important thing aspect of this implementation is the initialization and management of our object detection system. When the app launches, we initialize our AppModel which handles the ARKit session and object tracking setup. The initialization sequence is crucial:

.task {
    await appModel.initializeDetector()
}

This asynchronous initialization loads our trained reference object and prepares the ARKit session for object tracking. We ensure this happens before opening the immersive space where the actual detection will occur.

The immersive space configuration is especially necessary for object tracking:

.immersionStyle(selection: .constant(.mixed), in: .mixed)

The mixed immersion style is important for our object tracking implementation because it allows RealityKit to mix our visual indicators (bounding boxes and labels) with the real-world environment where we’re detecting objects. This creates a seamless experience where digital content accurately aligns with physical objects within the user’s space.

With these modifications to SodaTrackerApp.swift, our app is able to begin the article detection process, with ARKit, RealityKit, and our trained model working together within the mixed reality environment. In the following section, we’ll examine the core object detection functionality in AppModel.swift, one other file that was created during project setup.

Core Detection Model Implementation

AppModel.swift, created during project setup, serves as our core detection system. This file manages the ARKit session, loads our trained model, and coordinates the article tracking process. Let’s examine its implementation:

import SwiftUI
import RealityKit
import ARKit

/**
 AppModel serves because the core model for the soda can detection application.
 It manages the ARKit session, handles object tracking initialization,
 and maintains the state of object detection throughout the app's lifecycle.
 
 This model is designed to work with visionOS's object tracking capabilities,
 specifically optimized for detecting Food plan Soda cans within the user's environment.
 */
@MainActor
@Observable
class AppModel: ObservableObject {
    /// Unique identifier for the immersive space where object detection occurs
    let immersiveSpaceID = "SodaTracking"
    
    /// ARKit session instance that manages the core tracking functionality
    /// This session coordinates with visionOS to process spatial data
    private var arSession = ARKitSession()
    
    /// Dedicated provider that handles the real-time tracking of soda cans
    /// This maintains the state of currently tracked objects
    private var sodaTracker: ObjectTrackingProvider?
    
    /// Collection of reference objects used for detection
    /// These objects contain the trained model data for recognizing soda cans
    private var targetObjects: [ReferenceObject] = []
    
    /**
     Initializes the article detection system by loading and preparing
     the reference object (Food plan Soda can) from the app bundle.
     
     This method loads a pre-trained model that comprises spatial and
     visual information concerning the Food plan Soda can we wish to detect.
     */
    func initializeDetector() async {
        guard let objectURL = Bundle.fundamental.url(forResource: "Food plan Soda", withExtension: "referenceobject") else {
            print("Error: Didn't locate reference object in bundle - ensure Food plan Soda.referenceobject exists")
            return
        }
        
        do {
            let referenceObject = try await ReferenceObject(from: objectURL)
            self.targetObjects = [referenceObject]
        } catch {
            print("Error: Didn't initialize reference object: (error)")
        }
    }
    
    /**
     Starts the lively object detection process using ARKit.
     
     This method initializes the tracking provider with loaded reference objects
     and begins the real-time detection process within the user's environment.
     
     Returns: An ObjectTrackingProvider if successfully initialized, nil otherwise
     */
    func beginDetection() async -> ObjectTrackingProvider? {
        guard !targetObjects.isEmpty else { return nil }
        
        let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
        do {
            try await arSession.run([tracker])
            self.sodaTracker = tracker
            return tracker
        } catch {
            print("Error: Didn't initialize tracking: (error)")
            return nil
        }
    }
    
    /**
     Terminates the article detection process.
     
     This method safely stops the ARKit session and cleans up
     tracking resources when object detection isn't any longer needed.
     */
    func endDetection() {
        arSession.stop()
    }
}

On the core of our implementation is ARKitSession, visionOS’s gateway to spatial computing capabilities. The @MainActor attribute ensures our object detection operations run on the fundamental thread, which is crucial for synchronizing with the rendering pipeline.

private var arSession = ARKitSession()
private var sodaTracker: ObjectTrackingProvider?
private var targetObjects: [ReferenceObject] = []

The ObjectTrackingProvider is a specialized component in visionOS that handles real-time object detection. It really works at the side of ReferenceObject instances, which contain the spatial and visual information from our trained model. We maintain these as private properties to make sure proper lifecycle management.

The initialization process is especially necessary:

let referenceObject = try await ReferenceObject(from: objectURL)
self.targetObjects = [referenceObject]

Here, we load our trained model (the .referenceobject file we created in Create ML) right into a ReferenceObject instance. This process is asynchronous since the system must parse and prepare the model data for real-time detection.

The beginDetection method sets up the actual tracking process:

let tracker = ObjectTrackingProvider(referenceObjects: targetObjects)
try await arSession.run([tracker])

After we create the ObjectTrackingProvider, we pass in our reference objects. The provider uses these to ascertain the detection parameters — what to search for, what features to match, and track the article in 3D space. The ARKitSession.run call prompts the tracking system, starting the real-time evaluation of the user’s environment.

Immersive Experience Implementation

ImmersiveView.swift, provided in our initial project setup, manages the real-time object detection visualization within the user’s space. This view processes the continual stream of detection data and creates visual representations of detected objects. Here’s the implementation:

import SwiftUI
import RealityKit
import ARKit

/**
 ImmersiveView is accountable for creating and managing the augmented reality
 experience where object detection occurs. This view handles the real-time
 visualization of detected soda cans within the user's environment.
 
 It maintains a group of visual representations for every detected object
 and updates them in real-time as objects are detected, moved, or removed
 from view.
 */
struct ImmersiveView: View {
    /// Access to the app's shared model for object detection functionality
    @Environment(AppModel.self) private var appModel
    
    /// Root entity that serves because the parent for all AR content
    /// This entity provides a consistent coordinate space for all visualizations
    @State private var sceneRoot = Entity()
    
    /// Maps unique object identifiers to their visual representations
    /// Enables efficient updating of specific object visualizations
    @State private var activeVisualizations: [UUID: ObjectVisualization] = [:]
    
    var body: some View {
        RealityView { content in
            // Initialize the AR scene with our root entity
            content.add(sceneRoot)
            
            Task {
                // Begin object detection and track changes
                let detector = await appModel.beginDetection()
                guard let detector else { return }
                
                // Process real-time updates for object detection
                for await update in detector.anchorUpdates {
                    let anchor = update.anchor
                    let id = anchor.id
                    
                    switch update.event {
                    case .added:
                        // Object newly detected - create and add visualization
                        let visualization = ObjectVisualization(for: anchor)
                        activeVisualizations[id] = visualization
                        sceneRoot.addChild(visualization.entity)
                        
                    case .updated:
                        // Object moved - update its position and orientation
                        activeVisualizations[id]?.refreshTracking(with: anchor)
                        
                    case .removed:
                        // Object now not visible - remove its visualization
                        activeVisualizations[id]?.entity.removeFromParent()
                        activeVisualizations.removeValue(forKey: id)
                    }
                }
            }
        }
        .onDisappear {
            // Clean up AR resources when view is dismissed
            cleanupVisualizations()
        }
    }
    
    /**
     Removes all lively visualizations and stops object detection.
     This ensures proper cleanup of AR resources when the view isn't any longer lively.
     */
    private func cleanupVisualizations() {
        for (_, visualization) in activeVisualizations {
            visualization.entity.removeFromParent()
        }
        activeVisualizations.removeAll()
        appModel.endDetection()
    }
}

The core of our object tracking visualization lies within the detector’s anchorUpdates stream. This ARKit feature provides a continuous flow of object detection events:

for await update in detector.anchorUpdates {
    let anchor = update.anchor
    let id = anchor.id
    
    switch update.event {
    case .added:
        // Object first detected
    case .updated:
        // Object position modified
    case .removed:
        // Object now not visible
    }
}

Each ObjectAnchor comprises crucial spatial data concerning the detected soda can, including its position, orientation, and bounding box in 3D space. When a brand new object is detected (.added event), we create a visualization that RealityKit will render in the right position relative to the physical object. As the article or user moves, the .updated events ensure our virtual content stays perfectly aligned with the true world.

Visual Feedback System

Create a brand new file named ObjectVisualization.swift for handling the visual representation of detected objects. This component is accountable for creating and managing the bounding box and text overlay that appears around detected soda cans:

import RealityKit
import ARKit
import UIKit
import SwiftUI

/**
 ObjectVisualization manages the visual elements that appear when a soda can is detected.
 This class handles each the 3D text label that appears above the article and the
 bounding box that outlines the detected object in space.
 */
@MainActor
class ObjectVisualization {
    /// Root entity that comprises all visual elements
    var entity: Entity
    
    /// Entity specifically for the bounding box visualization
    private var boundingBox: Entity
    
    /// Width of bounding box lines - 0.003 provides optimal visibility without being too intrusive
    private let outlineWidth: Float = 0.003
    
    init(for anchor: ObjectAnchor) {
        entity = Entity()
        boundingBox = Entity()
        
        // Arrange the fundamental entity's transform based on the detected object's position
        entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
        entity.isEnabled = anchor.isTracked
        
        createFloatingLabel(for: anchor)
        setupBoundingBox(for: anchor)
        refreshBoundingBoxGeometry(with: anchor)
    }
    
    /**
     Creates a floating text label that hovers above the detected object.
     The text uses Avenir Next font for optimal readability in AR space and
     is positioned barely above the article for clear visibility.
     */
    private func createFloatingLabel(for anchor: ObjectAnchor) {
        // 0.06 units provides optimal text size for viewing at typical distances
        let labelSize: Float = 0.06
        
        // Use Avenir Next for its clarity and modern appearance in AR
        let font = MeshResource.Font(name: "Avenir Next", size: CGFloat(labelSize))!
        let textMesh = MeshResource.generateText("Food plan Soda",
                                               extrusionDepth: labelSize * 0.15,
                                               font: font)
        
        // Create a cloth that makes text clearly visible against any background
        var textMaterial = UnlitMaterial()
        textMaterial.color = .init(tint: .orange)
        
        let textEntity = ModelEntity(mesh: textMesh, materials: [textMaterial])
        
        // Position text above object with enough clearance to avoid intersection
        textEntity.transform.translation = SIMD3(
            anchor.boundingBox.center.x - textMesh.bounds.max.x / 2,
            anchor.boundingBox.extent.y + labelSize * 1.5,
            0
        )
        
        entity.addChild(textEntity)
    }
    
    /**
     Creates a bounding box visualization that outlines the detected object.
     Uses a magenta color transparency to offer a transparent
     but non-distracting visual boundary across the detected soda can.
     */
    private func setupBoundingBox(for anchor: ObjectAnchor) {
        let boxMesh = MeshResource.generateBox(size: [1.0, 1.0, 1.0])
        
        // Create a single material for all edges with magenta color
        let boundsMaterial = UnlitMaterial(color: .magenta.withAlphaComponent(0.4))
        
        // Create all edges with uniform appearance
        for _ in 0..<12 {
            let edge = ModelEntity(mesh: boxMesh, materials: [boundsMaterial])
            boundingBox.addChild(edge)
        }
        
        entity.addChild(boundingBox)
    }
    
    /**
     Updates the visualization when the tracked object moves.
     This ensures the bounding box and text maintain accurate positioning
     relative to the physical object being tracked.
     */
    func refreshTracking(with anchor: ObjectAnchor) {
        entity.isEnabled = anchor.isTracked
        guard anchor.isTracked else { return }
        
        entity.transform = Transform(matrix: anchor.originFromAnchorTransform)
        refreshBoundingBoxGeometry(with: anchor)
    }
    
    /**
     Updates the bounding box geometry to match the detected object's dimensions.
     Creates a precise outline that exactly matches the physical object's boundaries
     while maintaining the gradient visual effect.
     */
    private func refreshBoundingBoxGeometry(with anchor: ObjectAnchor) {
        let extent = anchor.boundingBox.extent
        boundingBox.transform.translation = anchor.boundingBox.center
        
        for (index, edge) in boundingBox.children.enumerated() {
            guard let edge = edge as? ModelEntity else { proceed }
            
            switch index {
            case 0...3:  // Horizontal edges along width
                edge.scale = SIMD3(extent.x, outlineWidth, outlineWidth)
                edge.position = [
                    0,
                    extent.y / 2 * (index % 2 == 0 ? -1 : 1),
                    extent.z / 2 * (index < 2 ? -1 : 1)
                ]
            case 4...7:  // Vertical edges along height
                edge.scale = SIMD3(outlineWidth, extent.y, outlineWidth)
                edge.position = [
                    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
                    0,
                    extent.z / 2 * (index < 6 ? -1 : 1)
                ]
            case 8...11: // Depth edges
                edge.scale = SIMD3(outlineWidth, outlineWidth, extent.z)
                edge.position = [
                    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
                    extent.y / 2 * (index < 10 ? -1 : 1),
                    0
                ]
            default:
                break
            }
        }
    }
}

The bounding box creation is a key aspect of our visualization. Reasonably than using a single box mesh, we construct 12 individual edges that form a wireframe outline. This approach provides higher visual clarity and allows for more precise control over the looks. The sides are positioned using SIMD3 vectors for efficient spatial calculations:

edge.position = [
    extent.x / 2 * (index % 2 == 0 ? -1 : 1),
    extent.y / 2 * (index < 10 ? -1 : 1),
    0
]

This mathematical positioning ensures each edge aligns perfectly with the detected object’s dimensions. The calculation uses the article’s extent (width, height, depth) and creates a symmetrical arrangement around its center point.

This visualization system works at the side of our ImmersiveView to create real-time visual feedback. Because the ImmersiveView receives position updates from ARKit, it calls refreshTracking on our visualization, which updates the transform matrices to take care of precise alignment between the virtual overlays and the physical object.

Informative View

ContentView With Instructions — Image By Writer

ContentView.swift, provided in our project template, handles the informational interface for our app. Here’s the implementation:

import SwiftUI
import RealityKit
import RealityKitContent

/**
 ContentView provides the fundamental window interface for the appliance.
 Displays a rotating 3D model of the goal object (Food plan Soda can)
 together with clear instructions for users on  use the detection feature.
 */
struct ContentView: View {
    // State to regulate the continual rotation animation
    @State private var rotation: Double = 0
    
    var body: some View {
        VStack(spacing: 30) {
            // 3D model display with rotation animation
            Model3D(named: "SodaModel", bundle: realityKitContentBundle)
                .padding(.vertical, 20)
                .frame(width: 200, height: 200)
                .rotation3DEffect(
                    .degrees(rotation),
                    axis: (x: 0, y: 1, z: 0)
                )
                .onAppear {
                    // Create continuous rotation animation
                    withAnimation(.linear(duration: 5.0).repeatForever(autoreverses: true)) {
                        rotation = 180
                    }
                }
            
            // Instructions for users
            VStack(spacing: 15) {
                Text("Food plan Soda Detection")
                    .font(.title)
                    .fontWeight(.daring)
                
                Text("Hold your weight loss program soda can in front of you to see it robotically detected and highlighted in your space.")
                    .font(.body)
                    .multilineTextAlignment(.center)
                    .foregroundColor(.secondary)
                    .padding(.horizontal)
            }
        }
        .padding()
        .frame(maxWidth: 400)
    }
}

This implementation displays our 3D-scanned soda model (SodaModel.usdz) with a rotating animation, providing users with a transparent reference of what the system is in search of. The rotation helps users understand present the article for optimal detection.

With these components in place, our application now provides a whole object detection experience. The system uses our trained model to acknowledge weight loss program soda cans, creates precise visual indicators in real-time, and provides clear user guidance through the informational interface.

Conclusion

Our Final App — Image By Writer

On this tutorial, we’ve built a whole object detection system for visionOS that showcases the mixing of several powerful technologies. Ranging from 3D object capture, through ML model training in Create ML, to real-time detection using ARKit and RealityKit, we’ve created an app that seamlessly detects and tracks objects within the user’s space.

This implementation represents just the start of what’s possible with on-device machine learning in spatial computing. As hardware continues to evolve with more powerful Neural Engines and dedicated ML accelerators and frameworks like Core ML mature, we’ll see increasingly sophisticated applications that may understand and interact with our physical world in real-time. The mixture of spatial computing and on-device ML opens up possibilities for applications starting from advanced AR experiences to intelligent environmental understanding, all while maintaining user privacy and low latency.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x