The article discusses Sosse, a self-hosted web scraper that allows users to archive their favorite websites. It highlights the tool's simplicity, ease of installation via Docker, and its ability to create full HTML snapshots of web pages, including stylesheets and assets. The author integrates Sosse into their workflow for archiving articles and technical documentation, praising its minimal interface and reliability.
The author details their transition from Pocket to Karakeep, a self-hosted, open-source alternative for saving and reading articles later. They discuss the benefits of owning your data and the features of Karakeep, including RSS integration and AI-powered tagging.
Notte is an open-source browser using an agent, designed to improve speed, cost, and reliability in web agent tasks through a perception layer that structures webpages for LLM consumption. It offers a full stack framework with customizable browser infrastructure, web scripting, and scraping endpoints.
Discussion about the Clipboard API and the differences between clipboardRead and paste events.
Exploring different methods and designs in implementing a Paste file feature that lets users quickly copy and send files between devices.
Justin Garrison demonstrates how to use a Raspberry Pi or other single-board computer to run a local Personal Data Server (PDS) for the microblogging platform Bluesky, allowing users to store and manage their own data.
This project provides an LLM Websearch Agent using a local SearXNG server for search functionality and includes Python scripts and a bash script for interacting with an LLM to summarize search results.
Scraperr is a self-hosted web application for scraping data from web pages using XPath. It supports queuing URLs, managing scrape elements, and provides features such as job management, user login, and integration with AI services.
FlowScraper is a powerful web scraper with an intuitive FlowBuilder, enabling effortless website automation and data extraction without coding. It features customizable AI actions and automatic anti-bot protection.
The crawl-delay directive is an unofficial directive in robots.txt meant to communicate to crawlers to slow down crawling to not overload the web server. However, support for this directive varies among search engines.