Tarsier offers a cutting-edge solution for enhancing web interaction agents through advanced visual tagging and perception utilities. By bridging the gap between visual data and language models, Tarsier enables agents to intelligently understand and interact with web elements. Discover how it transforms web automation and improves task execution for LLMs without vision.
Tarsier: Vision Utilities for Web Interaction Agents
Tarsier is a powerful tool designed to enhance the efficacy of web interaction agents through advanced vision utilities. Developed by Reworkd, Tarsier addresses critical challenges encountered by large language models (LLMs) when automating web interactions.
Key Features
- Interactive Mapping: Tarsier assigns unique IDs to interactable elements found on webpages, allowing effective communication with LLMs. For instance, LLMs can perform actions using simple commands like
CLICK [23]
where[23]
represents a button or link on the webpage. - Improved Perception: Utilizing cutting-edge techniques, Tarsier can tag visible elements, including buttons, links, and text fields, and even provide an OCR (Optical Character Recognition) representation of the webpage that a text-only LLM can interpret.
Performance Insights
In our internal benchmarks, using unimodal GPT-4 with Tarsier-Text outperformed GPT-4V combined with Tarsier-Screenshot by 10-20%, showcasing Tarsier's superior ability to aid LLMs without sophisticated visual capabilities.
How It Works
Tarsier processes webpages by tagging interactable elements with a clear structure:
[#ID]
: For text inputs (like text boxes)[@ID]
: For hyperlinks[$ID]
: For other interactive elements (like buttons)[ID]
: For plain text elements when enabled
This tagging allows for seamless interaction and comprehension by LLMs, making web automation more intuitive and efficient.
Sample Usage
For a practical glance at how Tarsier can be utilized, check out our cookbook featuring examples like an autonomous LangChain web agent and LlamaIndex web agent.
Here’s a quick code snippet demonstrating basic usage:
import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
ocr_service = GoogleVisionOCRService('./google_service_acc_key.json')
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://example.com")
page_text, tag_to_xpath = await tarsier.page_to_text(page)
print(page_text)
if __name__ == '__main__':
asyncio.run(main())
Supported OCR Services
- Google Cloud Vision
- Microsoft Azure Computer Vision (coming soon)
- Amazon Textract (coming soon)
Tarsier not only enriches web interactions with visual awareness but also aligns seamlessly with existing frameworks to empower developers and automate tasks effectively. For more details, visit our official site.
Start enhancing your web interaction agents today!