George is an innovative API that utilizes AI to enable natural language control of your computer. By interpreting UI elements dynamically, it adapts to changes in the interface, overcoming limitations of traditional automation tools. Experience a more intuitive way to automate tasks with George.
George is an innovative API harnessing the power of artificial intelligence to simplify computer control with natural language. Unlike conventional frameworks that depend on predefined static selectors, George employs AI vision technology to interpret the screen dynamically. This approach not only enhances resilience to UI changes but also enables automation of interfaces that traditional tools struggle to manage.
Key Features
- Dynamic UI Handling: AI-driven interpretation allows George to adapt to changing user interfaces seamlessly.
- Natural Language Processing: Control your computer through simple and intuitive natural language commands, bridging the gap between human interaction and machine response.
- Robust Automation: Effective in automating complex workflows that are often difficult for standard automation tools to execute.
Example Usage
Below is a simple example demonstrating how to use George in a Rust application:
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut george = George::new("https://your-molmo-llm.com");
george.start().await?;
george.open_chrome("https://some-website.com").await?;
george.click("sign in link").await?;
george.fill_in("input Email text field", "your@email.com").await?;
george.fill_in("input Password text field", "super-secret").await?;
george.click("sign in button").await?;
george.close_chrome().await?;
george.stop().await?;
}
George is built on top of Molmo, a vision-based large language model (LLM) that translates natural language descriptions into screen coordinates. This unique mechanism allows for accurate identification and interaction with UI elements.
Explore Further
- Try the live demo of Molmo at Molmo Demo.
- For advanced setups, check out running Molmo on Docker or bare metal configurations to optimize GPU use.
- Join our vision for the future of AI automation: a roadmap that includes a user-friendly UI for selector construction, enhanced debugging capabilities, and cross-language bindings for developers.
Very cool project! Starred. How come you used Molmo?
Did you find that it works better than the existing multimodal models?
Thanks! Molmo has the unique ability to provide the x,y coordinate of an object. Other visual LLMs are aware of the objects in an image, but not the location.
Wow I wasn't aware of that. Is it always reliable? In a personal project I was trying to use https://github.com/microsoft/OmniParser for UI element detection and then feeding that into an LLM (for reasoning & planning).
Very reliable once you get the prompt/selector right. You need to turn the temperature down and the ktop up when sending the data to the LLM. Those params are used to add randomness to the LLM. Otherwise, the LLM acts like a static function. In our case, we want that.
Sign in to comment.