PitchHut
Log in / Sign up
AutoCrawler
6 views
Transforming web crawling with intelligent automation.
Pitch

AutoCrawler empowers developers to effortlessly generate web crawlers through a sophisticated web agent. Built on cutting-edge research, it can be adapted for various real-world applications, ensuring seamless data extraction and analysis. Dive into the code and learn how to leverage AI for smarter web crawling.

Description

AutoCrawler is the official implementation of the research paper "AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation". This innovative project aims to enhance web crawler generation by utilizing advanced methodologies outlined in the paper.

Discover the framework for generating web crawlers effectively: AutoCrawler Framework

Key Features:

  • A progressive understanding web agent designed specifically for automatic web crawling tasks.
  • Implementations that are straightforward to reproduce, allowing researchers and developers to validate and utilize the algorithms in real-world applications.

How It Works:

To generate a crawler using AutoCrawler, follow these commands:

# Generate crawler with AutoCrawler
python crawler_generation.py \
    --pattern reflexion \
    --dataset swde \
    --model ChatGPT \
    --seed_website 3 \
    --save_name ChatGPT \
    --overwrite False

# Extract information with crawler
python crawler_extraction.py \
    --pattern autocrawler \
    --dataset swde \
    --model GPT4

# Evaluate the extraction on SWDE dataset
python run_swde/evaluate.py \
    --model GPT4 \
    --pattern autocrawler

Future Directions:

  • Adaptation of AutoCrawler for various real-world websites.
  • Development of a demo site to showcase the functionality and performance of the crawler.

For comprehensive understanding and further details, you can access the full paper here.

If you find this project beneficial, please consider citing the work:

@misc{huang2024autocrawler,
      title={AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation}, 
      author={Wenhao Huang et al.},
      year={2024},
      eprint={2404.12753},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Unlock the potential of automated web crawling with AutoCrawler!