# htmlsense **Repository Path**: cg33/htmlsense ## Basic Information - **Project Name**: htmlsense - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-11-19 - **Last Updated**: 2025-11-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # htmlsense htmlsense is a Go library and command-line tool for automatically extracting XPath selector mappings from HTML documents. It leverages Large Language Models (LLM) to intelligently analyze HTML structures, identify repeated patterns and valuable content elements, and generate corresponding XPath selectors, greatly simplifying the manual work of writing selectors in web scraping development. ## Background When developing web scraping programs, we often need to manually identify and write XPath or CSS selectors to locate page elements. This process is both tedious and error-prone, especially when page structures are complex or frequently changing. htmlsense automates selector generation by combining HTML structure analysis with LLM's intelligent recognition capabilities, making web scraping development more efficient. ## Features - โจ **Intelligent Recognition**: Automatically identifies repeated structures and valuable content elements in HTML using LLM - ๐งน **Auto Cleaning**: Automatically removes interfering tags (head, style, script, etc.) and attributes - ๐ฏ **Scope Extraction**: Supports specifying extraction scope via scopeXPath to focus on specific areas - โ **Auto Validation**: Automatically validates generated XPath selectors and filters out invalid ones - ๐ **Data Extraction**: Optionally extract actual data (table format) in addition to selectors - ๐ ๏ธ **CLI Tool**: Provides a complete command-line tool for easy integration into scripts and automation workflows - ๐ฆ **Easy Integration**: Can be used as a Go library, easily integrated into existing projects - ๐ง **Flexible Configuration**: Supports multiple LLM providers and model configurations ## Installation ### Prerequisites - Go 1.21 or higher - LLM API key (OpenAI or other supported providers) ### Install from Source ```bash git clone https://github.com/gotoailab/htmlsense.git cd htmlsense make install ``` After installation, the `htmlsense` command will be installed to the `$GOPATH/bin` directory. ### Build with Makefile ```bash # Build binary to build/ directory make build # Install to GOPATH/bin make install # Run all tests make test # Generate test coverage report make test-coverage # Format code make fmt # Run code checks make vet # Run all checks (format, check, test) make check # Build multi-platform release binaries make release # View all available commands make help ``` ## Quick Start ### Command-Line Tool Usage #### Basic Usage ```bash # Extract selectors from file htmlsense -html page.html -api-key YOUR_API_KEY # Extract from stdin cat page.html | htmlsense -html - -api-key YOUR_API_KEY # Extract with scope (e.g., only within a container) htmlsense -html page.html -scope "//div[@class='container']" -api-key YOUR_API_KEY # Output as text format (default is JSON) htmlsense -html page.html -output text -api-key YOUR_API_KEY # Save intermediate results to directory htmlsense -html page.html -save-intermediate ./output -api-key YOUR_API_KEY # Extract data and save intermediate results htmlsense -html page.html -extract-data -save-intermediate ./output -api-key YOUR_API_KEY # Use environment variable for API key (recommended) export HTMLSENSE_API_KEY=YOUR_API_KEY htmlsense -html page.html # Specify different LLM provider and model htmlsense -html page.html -provider openai -model gpt-4 -api-key YOUR_API_KEY ``` #### Command-Line Arguments | Argument | Description | Default | Required | |----------|-------------|---------|----------| | `-html` | HTML file path (use `-` to read from stdin) | - | Yes | | `-output` | Output format: `json` or `text` | `json` | No | | `-scope` | Optional XPath to limit extraction scope | - | No | | `-extract-data` | Extract actual data instead of just selectors | `false` | No | | `-save-intermediate` | Directory to save intermediate results (simplified.html, selectors.json, data.json) | - | No | | `-api-key` | LLM API key | - | Yes* | | `-provider` | LLM provider (e.g., `openai`) | `openai` | No | | `-model` | LLM model name | `gpt-4` | No | | `-version` | Show version information | - | No | | `-help` | Show help information | - | No | *Can be set via `HTMLSENSE_API_KEY` environment variable, in which case `-api-key` is not needed #### Output Examples **JSON Format Output:** ```json { "title": "//h1[@class='article-title']", "content": "//div[@class='article-content']//p", "author": "//span[@class='author-name']", "date": "//time[@class='publish-date']" } ``` **Text Format Output:** ``` Extracted selector mappings: title: //h1[@class='article-title'] content: //div[@class='article-content']//p author: //span[@class='author-name'] date: //time[@class='publish-date'] ``` **Data Extraction Output (with `-extract-data` flag):** JSON format: ```json [ { "title": "Article Title 1", "content": "Article content 1...", "author": "Author Name 1", "date": "2024-01-01" }, { "title": "Article Title 2", "content": "Article content 2...", "author": "Author Name 2", "date": "2024-01-02" } ] ``` Text format: ``` Extracted data: Row 1: title: Article Title 1 content: Article content 1... author: Author Name 1 date: 2024-01-01 Row 2: title: Article Title 2 content: Article content 2... author: Author Name 2 date: 2024-01-02 ``` **Intermediate Results (with `-save-intermediate` flag):** When using `-save-intermediate`, the following files are saved to the specified directory: - `simplified.html`: The cleaned and simplified HTML used for LLM analysis - `selectors.json`: Contains both raw selectors from LLM and validated selectors: ```json { "raw_selectors": { "title": "//h1[@class='article-title']", "content": "//div[@class='article-content']//p" }, "valid_selectors": { "title": "//h1[@class='article-title']", "content": "//div[@class='article-content']//p" } } ``` - `data.json`: Extracted data in JSON format (only when `-extract-data` is used) ### Use as Go Library ```go package main import ( "context" "fmt" "log" "github.com/gotoailab/htmlsense" ) func main() { // Configure htmlsense config := htmlsense.Config{ APIKey: "your-api-key", Provider: "openai", Model: "gpt-4", } // Create extractor extractor, err := htmlsense.NewExtractor(config) if err != nil { log.Fatal(err) } // HTML content htmlContent := `
Article content...