Tags: crawler*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Google has introduced Google-Agent, a new entity appearing in server logs, to differentiate between traditional search crawling (like Googlebot) and AI-driven content fetching triggered by user interactions. Unlike Googlebot which proactively crawls and indexes the web, Google-Agent operates reactively, only fetching content in direct response to user prompts within Google AI products. A key distinction is that Google-Agent ignores `robots.txt` directives, behaving more like a standard web browser due to its user-initiated nature. This shift necessitates that developers adapt their infrastructure to identify and manage Google-Agent traffic correctly, focusing on real-time request management rather than traditional crawl budgets.
  2. NEXUS is a production-grade, full-text and semantic search engine built from scratch, implementing advanced data structures and distributed systems concepts. It focuses on probabilistic optimization, sub-millisecond latency, and hybrid AI-powered search. The project demonstrates core technologies like LSM Trees, Bloom Filters, HNSW Graphs, and W-TinyLFU caches, integrated into a high-performance pipeline. It also includes a LeetCode algorithm library with implementations of classic interview patterns and provides insights into distributed crawling and persistent storage.
  3. discrawl mirrors Discord guild data into a local SQLite database, allowing you to search, inspect, and query server history independently of Discord. It’s a bot-token crawler – no user-token hacks – and keeps your data local. It discovers accessible guilds, syncs channels, threads, members, and message history, maintains FTS5 search indexes for fast text search (including small attachments), records mentions, and tails Gateway events for live updates with repair syncs. It provides read-only SQL access for analysis and supports multi-guild schemas with a simple single-guild default. Search defaults to all guilds, while sync and tail default to a configured default guild or fan out to all discovered guilds if none is set.
    2026-03-08 Tags: , , , , , , , , by klotz
  4. Cloudflare converts HTML to Markdown on the fly when an AI agent requests it via the `Accept: text/markdown` header.
  5. This article provides a verified list of AI crawlers (GPTBot, ClaudeBot, Gemini, etc.) with user-agent strings, crawl rates, and IP verification information to help manage access and maintain inclusion in AI discovery.
  6. An open source web crawler that searches the internet. It's a minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency.
  7. Perplexity defends its AI assistants against Cloudflare’s claims, arguing that they are not web crawlers but user-triggered agents.
  8. Website Crawler is a SaaS that crawls and analyzes websites, extracting data and identifying issues like broken links, slow page speed, duplicate tags, and more. It offers features like XML sitemap generation, data export in various formats (JSON, CSV, PDF), JavaScript crawling, and custom data extraction.
    2025-07-08 Tags: , , by klotz
  9. Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler
    2025-09-05 Tags: , , , , , , by klotz
  10. Browser Use is a library that enables AI agents to interact with web browsers, making websites accessible for automated tasks. It includes features for browser automation, agent memory, and various demos showcasing its capabilities.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "crawler"

About - Propulsed by SemanticScuttle