firecrawl-simple - Effortlessly turn websites into LLM-ready markdown with Firecrawl Simple.

firecrawl-simple

9 views

Effortlessly turn websites into LLM-ready markdown with Firecrawl Simple.

Pitch

Firecrawl Simple is a streamlined tool designed for self-hosting, making the process of crawling and converting websites into LLM-ready markdown simpler than ever. With unnecessary features stripped away, users can easily collaborate and contribute to its development while enjoying maximum stealth and performance. Join the community and help shape the future of web scraping.

Description

Firecrawl Simple is an efficient, streamlined version of the original Firecrawl project, specifically designed for self-hosting and to facilitate contributions. In this version, complex billing logic and AI features have been eliminated to enhance usability.

Key Features

Crawl Any Website: Easily crawl and convert websites into LLM-ready Markdown format.
Updated Technology Stack: Utilizing puppeteer-cluster and puppeteer-extra with stealth plugins, we ensure that there is no need for additional services like fire-engine or scrapingbee for guarded pages.
Env Configurations for Security: For maximum stealthiness, incorporate a 2captcha token and proxy credentials within your environment variables.

API Functionality

Firecrawl Simple supports fundamental routes: /scrape, /crawl/{id}, and /crawl. For comprehensive details, view the OpenAPI Specification here. Notably, the API response for the /crawl/{id} route has been simplified by removing the creditsUsed field.

Contribute and Collaborate

We are seeking enthusiastic contributors to help maintain and expand this project. There are even paid part-time positions available for those who wish to take a more active role in the ongoing development.

Fork for Better Functionality

This project is a fork of the original Firecrawl repository, created to provide a more stable foundation suitable for self-hosting. By removing the complexities related to SaaS and AI, Firecrawl Simple aligns better with our specific use-case while allowing for greater scalability on Kubernetes.

Self-Hosting Quickly

Integrate the necessary services via a simple docker-compose setup:

name: firecrawl
services:
  playwright-service:
    image: trieve/puppeteer-service-ts:v0.0.6
    environment:
      - PORT=3000
      - PROXY_SERVER=${PROXY_SERVER}
      - BLOCK_MEDIA=${BLOCK_MEDIA}
      - MAX_CONCURRENCY=${MAX_CONCURRENCY}
      - TWOCAPTCHA_TOKEN=${TWOCAPTCHA_TOKEN}
    networks:
      - backend

  firecrawl-api:
    image: trieve/firecrawl:v0.0.46
    networks:
      - backend
    environment:
      - PORT=${PORT:-3002}
    depends_on:
      - playwright-service
    ports:
      - "3002:3002"

  redis:
    image: redis:alpine
    networks:
      - backend
    command: redis-server --bind 0.0.0.0

networks:
  backend:
    driver: bridge

Advanced Crawling

Firecrawl enables comprehensive crawling of a URL, including all accessible subpages. To initiate a crawl, use the following command:

curl -X POST https://<your-url>/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR_API_KEY' \
    -d '{"url": "https://docs.firecrawl.dev", "limit": 100, "scrapeOptions": {"formats": ["markdown", "html"]}}'

This retrieves a job ID to monitor your crawled data's status!

Scraping Made Easy

To scrape content directly from a URL, execute:

curl -X POST https://<your-url>/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{"url": "https://docs.firecrawl.dev", "formats" : ["markdown", "html"]}'

It will return the content in Markdown or HTML formats, as desired.

Firecrawl Simple represents an ideal solution for developers and companies looking to harness web data efficiently and effectively. Join us in developing this exceptional tool!