Justin Garrison demonstrates how to use a Raspberry Pi or other single-board computer to run a local Personal Data Server (PDS) for the microblogging platform Bluesky, allowing users to store and manage their own data.
This project provides an LLM Websearch Agent using a local SearXNG server for search functionality and includes Python scripts and a bash script for interacting with an LLM to summarize search results.
Scraperr is a self-hosted web application for scraping data from web pages using XPath. It supports queuing URLs, managing scrape elements, and provides features such as job management, user login, and integration with AI services.
FlowScraper is a powerful web scraper with an intuitive FlowBuilder, enabling effortless website automation and data extraction without coding. It features customizable AI actions and automatic anti-bot protection.
The crawl-delay directive is an unofficial directive in robots.txt meant to communicate to crawlers to slow down crawling to not overload the web server. However, support for this directive varies among search engines.
Crawl4AI is an open-source web crawling tool designed to efficiently collect and curate high-quality, structured data from the web for large language model training. It handles multiple URLs simultaneously and supports various data formats, including JSON and Markdown.
Google's Martin Splitt shares how to defend against malicious bots and improve site performance. SEO expert Roger Montti explains why contacting resource providers won't work and offers alternative solutions.
Parsera is a new tool for web scraping that leverages large language models (LLMs) to make the process more straightforward and efficient. It focuses on minimizing token usage for faster processing and lower costs.
Parsera is a simple and fast Python library for scraping websites using Large Language Models (LLMs). It's designed to be lightweight and minimize token usage for speed and cost efficiency.
SerpApi provides a web scraping API to access Google Search and other search engine results. Get structured data for SEO, market research, and more.