Firecrawl Simple is a streamlined tool designed for self-hosting, making the process of crawling and converting websites into LLM-ready markdown simpler than ever. With unnecessary features stripped away, users can easily collaborate and contribute to its development while enjoying maximum stealth and performance. Join the community and help shape the future of web scraping.
Firecrawl Simple is an efficient, streamlined version of the original Firecrawl project, specifically designed for self-hosting and to facilitate contributions. In this version, complex billing logic and AI features have been eliminated to enhance usability.
Key Features
- Crawl Any Website: Easily crawl and convert websites into LLM-ready Markdown format.
- Updated Technology Stack: Utilizing
puppeteer-cluster
andpuppeteer-extra
with stealth plugins, we ensure that there is no need for additional services likefire-engine
orscrapingbee
for guarded pages. - Env Configurations for Security: For maximum stealthiness, incorporate a 2captcha token and proxy credentials within your environment variables.
API Functionality
Firecrawl Simple supports fundamental routes: /scrape
, /crawl/{id}
, and /crawl
. For comprehensive details, view the OpenAPI Specification here. Notably, the API response for the /crawl/{id}
route has been simplified by removing the creditsUsed
field.
Contribute and Collaborate
We are seeking enthusiastic contributors to help maintain and expand this project. There are even paid part-time positions available for those who wish to take a more active role in the ongoing development.
Fork for Better Functionality
This project is a fork of the original Firecrawl repository, created to provide a more stable foundation suitable for self-hosting. By removing the complexities related to SaaS and AI, Firecrawl Simple aligns better with our specific use-case while allowing for greater scalability on Kubernetes.
Self-Hosting Quickly
Integrate the necessary services via a simple docker-compose
setup:
name: firecrawl
services:
playwright-service:
image: trieve/puppeteer-service-ts:v0.0.6
environment:
- PORT=3000
- PROXY_SERVER=${PROXY_SERVER}
- BLOCK_MEDIA=${BLOCK_MEDIA}
- MAX_CONCURRENCY=${MAX_CONCURRENCY}
- TWOCAPTCHA_TOKEN=${TWOCAPTCHA_TOKEN}
networks:
- backend
firecrawl-api:
image: trieve/firecrawl:v0.0.46
networks:
- backend
environment:
- PORT=${PORT:-3002}
depends_on:
- playwright-service
ports:
- "3002:3002"
redis:
image: redis:alpine
networks:
- backend
command: redis-server --bind 0.0.0.0
networks:
backend:
driver: bridge
Advanced Crawling
Firecrawl enables comprehensive crawling of a URL, including all accessible subpages. To initiate a crawl, use the following command:
curl -X POST https://<your-url>/v1/crawl \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer fc-YOUR_API_KEY' \
-d '{"url": "https://docs.firecrawl.dev", "limit": 100, "scrapeOptions": {"formats": ["markdown", "html"]}}'
This retrieves a job ID to monitor your crawled data's status!
Scraping Made Easy
To scrape content directly from a URL, execute:
curl -X POST https://<your-url>/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{"url": "https://docs.firecrawl.dev", "formats" : ["markdown", "html"]}'
It will return the content in Markdown or HTML formats, as desired.
Firecrawl Simple represents an ideal solution for developers and companies looking to harness web data efficiently and effectively. Join us in developing this exceptional tool!