George - Control your computer effortlessly with natural language commands.

George

Key Features

Dynamic UI Handling: AI-driven interpretation allows George to adapt to changing user interfaces seamlessly.
Natural Language Processing: Control your computer through simple and intuitive natural language commands, bridging the gap between human interaction and machine response.
Robust Automation: Effective in automating complex workflows that are often difficult for standard automation tools to execute.

Example Usage

Below is a simple example demonstrating how to use George in a Rust application:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut george = George::new("https://your-molmo-llm.com");
    george.start().await?;
    george.open_chrome("https://some-website.com").await?;
    george.click("sign in link").await?;
    george.fill_in("input Email text field", "your@email.com").await?;
    george.fill_in("input Password text field", "super-secret").await?;
    george.click("sign in button").await?;
    george.close_chrome().await?;
    george.stop().await?;
}

George is built on top of Molmo, a vision-based large language model (LLM) that translates natural language descriptions into screen coordinates. This unique mechanism allows for accurate identification and interaction with UI elements.

Explore Further

Try the live demo of Molmo at Molmo Demo.
For advanced setups, check out running Molmo on Docker or bare metal configurations to optimize GPU use.
Join our vision for the future of AI automation: a roadmap that includes a user-friendly UI for selector construction, enhanced debugging capabilities, and cross-language bindings for developers.

5 comments

seymurkafkas

Dec 6, 2024

Very cool project! Starred. How come you used Molmo?

seymurkafkas

Dec 6, 2024

Did you find that it works better than the existing multimodal models?

logankeenan

Dec 6, 2024

Thanks! Molmo has the unique ability to provide the x,y coordinate of an object. Other visual LLMs are aware of the objects in an image, but not the location.

seymurkafkas

Dec 6, 2024

Wow I wasn't aware of that. Is it always reliable? In a personal project I was trying to use https://github.com/microsoft/OmniParser for UI element detection and then feeding that into an LLM (for reasoning & planning).

logankeenan

Dec 6, 2024

Very reliable once you get the prompt/selector right. You need to turn the temperature down and the ktop up when sending the data to the LLM. Those params are used to add randomness to the LLM. Otherwise, the LLM acts like a static function. In our case, we want that.

https://github.com/logankeenan/george/blob/2d21cd171baef8ffd9e844fbc941924eae667ba4/george-ai/src/daemon.rs#L303

New comment