klotz: scraping*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. An open source web crawler that searches the internet. It's a minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency.
  2. Extensions load unknown sites into invisible Windows. What could go wrong?
  3. Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler
    2025-07-08 Tags: , , , , , , by klotz
  4. The article highlights eight Python libraries that can save time, reduce bugs, and simplify coding tasks.

    | Library | Purpose | Key Feature |
    |-----------|-----------------------------------------------------------------------|----------------------------------------------------------------------------|
    | Rich | Enhance CLI output | Styling, tables, syntax-highlighted tracebacks, progress bars |
    | Typer | Build CLIs quickly | Simple CLI creation using function signatures and type hints |
    | Pendulum | Handle datetime operations | Time zone handling, formatting, arithmetic, and human-readable time parsing |
    | Pydantic | Validate data with type hints | Automated validation, documentation, and parsing of input data |
    | Faker | Generate fake data | Create realistic dummy data for testing and development |
    | Tqdm | Add progress bars | Monitor loop progress and catch infinite loops |
    | Requests-HTML | Web scraping with JavaScript support | Parse modern web pages with JavaScript rendering |
    | Loguru | Simplify logging | Easy logging configuration with levels, file rotation, and colorful output |
    2025-07-03 Tags: , , , by klotz
  5. The article discusses Sosse, a self-hosted web scraper that allows users to archive their favorite websites. It highlights the tool's simplicity, ease of installation via Docker, and its ability to create full HTML snapshots of web pages, including stylesheets and assets. The author integrates Sosse into their workflow for archiving articles and technical documentation, praising its minimal interface and reliability.
  6. A popular and actively maintained open-source web crawling library for LLMs and data extraction, offering advanced features like structured data extraction, browser control, and markdown generation.
  7. An article discussing the capabilities of Manus AI, a general AI agent that can think, plan, and execute tasks independently. Unlike other AI assistants, Manus can deliver results directly, making it highly efficient for various tasks.
    2025-03-23 Tags: , , , , by klotz
  8. Browser Use is a library that enables AI agents to interact with web browsers, making websites accessible for automated tasks. It includes features for browser automation, agent memory, and various demos showcasing its capabilities.
  9. LlamaExtract is a powerful, easy-to-use tool that allows users to extract structured data from unstructured documents with minimal effort, available through LlamaCloud’s web UI and Python SDK.
  10. ReaderLM-v2 is a 1.5B parameter language model developed by Jina AI, designed for converting raw HTML into clean markdown and JSON with high accuracy and improved handling of longer contexts. It supports multilingual text in 29 languages and offers advanced features such as direct HTML-to-JSON extraction. The model improves upon its predecessor by addressing issues like repetition in long sequences and enhancing markdown syntax generation.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: scraping

About - Propulsed by SemanticScuttle