AI Computer Vision

AI Computer Vision is the foundational technology that allows PHX Terminal’s automation bots to see and interact with user-interface elements with a level of recognition akin to human perception. It is what enables the hover opaque application to identify fields and act inside any desktop application — modern or legacy — without depending on that application’s internal code, metadata, or APIs.

Why vision-based automation matters

Traditional, selector-based UI automation depends on stable internal identifiers (HTML, view hierarchies, accessibility metadata). Those break down in the heterogeneous reality of a lawyer’s desktop:

Virtual Desktop Interface (VDI) environments such as Citrix or VMware, which are opaque to standard selectors.
Legacy technologies like Flash, Silverlight, scanned PDFs, and images, where there is no addressable DOM at all.

Because Computer Vision works from what is rendered on screen rather than from underlying code, it is technology-agnostic and operates consistently across .NET, Java, web, and virtualized applications alike.

How it works

PHX Terminal’s approach mirrors the leading implementations in the field:

Neural-network visual understanding (UiPath model). A neural network combines a custom Screen OCR (Optical Character Recognition), text matching, and a multi-anchoring system to reliably locate and disambiguate on-screen elements.
Pure-vision screen parsing (Microsoft OmniParser model). A vision-only method extracts structured elements directly from UI screenshots to improve the action-prediction of large multimodal models. Its methodology has three components:
1. Detecting interactable regions on the screen.
2. Extracting icon and text semantics.
3. Integrating structured bounding boxes with unique labels.

This lets the platform identify buttons, text fields, menus, and windows and perform UI tasks without relying on HTML or view hierarchies, making it highly adaptable across applications.

flowchart TB
  SCREEN["Rendered screen — any app<br/>.NET · Java · web · Citrix / VMware"]
  SCREEN --> A1
  SCREEN --> B1
  subgraph UIPATH["Neural-network visual understanding (UiPath model)"]
    A1["Screen OCR"] --> A2["Text matching"] --> A3["Multi-anchoring system"]
  end
  subgraph OMNI["Pure-vision screen parsing (Microsoft OmniParser)"]
    B1["Detect interactable regions"] --> B2["Extract icon & text semantics"] --> B3["Integrate labeled bounding boxes"]
  end
  A3 --> RECOG["On-screen UI recognition<br/>buttons · fields · menus · windows"]
  B3 --> RECOG
  RECOG --> ACT["Technology-agnostic automation<br/>no HTML or view hierarchy required"]

Two complementary approaches — neural visual understanding and pure-vision parsing — read the rendered screen, so automation works from pixels rather than fragile selectors.

Resilience to change

A key advantage is that vision-based automation can adjust its operations in real time as UI components evolve. When an application’s layout shifts, the system re-recognizes elements visually rather than failing on a stale selector — substantially reducing the ongoing maintenance burden that plagues conventional automation.

Limitations to design around

Role in the platform

Computer Vision is the “eyes” of the system. It pairs with:

RPA & Intelligent Automation — to execute the actions it identifies.
Data Extraction (OCR/ICR) — to turn what it sees into structured data.
Desktop Implementation — to render the interactive overlay it draws on screen.

Together these make Computer Vision the core enabler of PHX Terminal’s universal compatibility — the single capability that lets one platform automate an entire firm’s fragmented software estate.