PitchHut logo
Log in / Sign up
George
by logankeenan
Editor's pick
Control your computer effortlessly with natural language commands.
Pitch

George is an innovative API that utilizes AI to enable natural language control of your computer. By interpreting UI elements dynamically, it adapts to changes in the interface, overcoming limitations of traditional automation tools. Experience a more intuitive way to automate tasks with George.

Description

George is an innovative API harnessing the power of artificial intelligence to simplify computer control with natural language. Unlike conventional frameworks that depend on predefined static selectors, George employs AI vision technology to interpret the screen dynamically. This approach not only enhances resilience to UI changes but also enables automation of interfaces that traditional tools struggle to manage.

Key Features

  • Dynamic UI Handling: AI-driven interpretation allows George to adapt to changing user interfaces seamlessly.
  • Natural Language Processing: Control your computer through simple and intuitive natural language commands, bridging the gap between human interaction and machine response.
  • Robust Automation: Effective in automating complex workflows that are often difficult for standard automation tools to execute.

Example Usage

Below is a simple example demonstrating how to use George in a Rust application:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut george = George::new("https://your-molmo-llm.com");
    george.start().await?;
    george.open_chrome("https://some-website.com").await?;
    george.click("sign in link").await?;
    george.fill_in("input Email text field", "your@email.com").await?;
    george.fill_in("input Password text field", "super-secret").await?;
    george.click("sign in button").await?;
    george.close_chrome().await?;
    george.stop().await?;
}

George is built on top of Molmo, a vision-based large language model (LLM) that translates natural language descriptions into screen coordinates. This unique mechanism allows for accurate identification and interaction with UI elements.

Explore Further

  • Try the live demo of Molmo at Molmo Demo.
  • For advanced setups, check out running Molmo on Docker or bare metal configurations to optimize GPU use.
  • Join our vision for the future of AI automation: a roadmap that includes a user-friendly UI for selector construction, enhanced debugging capabilities, and cross-language bindings for developers.
5 comments
seymurkafkas
Dec 6, 2024

Very cool project! Starred. How come you used Molmo?

seymurkafkas
Dec 6, 2024

Did you find that it works better than the existing multimodal models?

logankeenan
Dec 6, 2024

Thanks! Molmo has the unique ability to provide the x,y coordinate of an object. Other visual LLMs are aware of the objects in an image, but not the location.

seymurkafkas
Dec 6, 2024

Wow I wasn't aware of that. Is it always reliable? In a personal project I was trying to use https://github.com/microsoft/OmniParser for UI element detection and then feeding that into an LLM (for reasoning & planning).

logankeenan
Dec 6, 2024

Very reliable once you get the prompt/selector right. You need to turn the temperature down and the ktop up when sending the data to the LLM. Those params are used to add randomness to the LLM. Otherwise, the LLM acts like a static function. In our case, we want that.

https://github.com/logankeenan/george/blob/2d21cd171baef8ffd9e844fbc941924eae667ba4/george-ai/src/daemon.rs#L303

Sign in to comment.