# easy-dataset
**Repository Path**: cpgithub/easy-dataset
## Basic Information
- **Project Name**: easy-dataset
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: dev
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-05-21
- **Last Updated**: 2025-05-21
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
## Overview
Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.
With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

## Features
- **Intelligent Document Processing**: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
- **Intelligent Text Splitting**: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
- **Intelligent Question Generation**: Extracts relevant questions from each text segment
- **Domain Labels**: Intelligently builds global domain labels for datasets, with global understanding capabilities
- **Answer Generation**: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
- **Flexible Editing**: Edit questions, answers, and datasets at any stage of the process
- **Multiple Export Formats**: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
- **Wide Model Support**: Compatible with all LLM APIs that follow the OpenAI format
- **User-Friendly Interface**: Intuitive UI designed for both technical and non-technical users
- **Custom System Prompts**: Add custom system prompts to guide model responses
## Quick Demo
https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
## Local Run
### Download Client
### Install with NPM
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Install dependencies:
```bash
npm install
```
3. Start the development server:
```bash
npm run build
npm run start
```
4. Open your browser and visit `http://localhost:1717`
### Build with Local Dockerfile
If you want to build the image yourself, you can use the Dockerfile in the project root:
1. Clone the repository:
```bash
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
```
2. Build the Docker image:
```bash
docker build -t easy-dataset .
```
3. Run the container:
```bash
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
```
**Note:** Please replace `{YOUR_LOCAL_DB_PATH}` with the actual path where you want to store the local database.
4. Open your browser and visit `http://localhost:1717`
## How to Use
### Create a Project
1. Click the "Create Project" button on the homepage;
2. Enter a project name and description;
3. Configure your preferred LLM API settings
### Process Documents
1. Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
2. View and adjust the automatically split text segments;
3. View and adjust the global domain tree
### Generate Questions
2. Batch construct questions based on text blocks;
3. View and edit the generated questions;
4. Organize questions using the label tree
### Create Datasets
1. Batch construct datasets based on questions;
2. Generate answers using the configured LLM;
3. View, edit, and optimize the generated answers
### Export Datasets
1. Click the "Export" button in the Datasets section;
2. Choose your preferred format (Alpaca or ShareGPT);
3. Select the file format (JSON or JSONL);
4. Add custom system prompts as needed;
5. Export your dataset
## Project Structure
```
easy-dataset/
├── app/ # Next.js application directory
│ ├── api/ # API routes
│ │ ├── llm/ # LLM API integration
│ │ │ ├── ollama/ # Ollama API integration
│ │ │ └── openai/ # OpenAI API integration
│ │ ├── projects/ # Project management API
│ │ │ ├── [projectId]/ # Project-specific operations
│ │ │ │ ├── chunks/ # Text chunk operations
│ │ │ │ ├── datasets/ # Dataset generation and management
│ │ │ │ ├── generate-questions/ # Batch question generation
│ │ │ │ ├── questions/ # Question management
│ │ │ │ └── split/ # Text splitting operations
│ │ │ └── user/ # User-specific project operations
│ ├── projects/ # Frontend project pages
│ │ └── [projectId]/ # Project-specific pages
│ │ ├── datasets/ # Dataset management UI
│ │ ├── questions/ # Question management UI
│ │ ├── settings/ # Project settings UI
│ │ └── text-split/ # Text processing UI
│ └── page.js # Homepage
├── components/ # React components
│ ├── datasets/ # Dataset-related components
│ ├── home/ # Homepage components
│ ├── projects/ # Project management components
│ ├── questions/ # Question management components
│ └── text-split/ # Text processing components
├── lib/ # Core libraries and tools
│ ├── db/ # Database operations
│ ├── i18n/ # Internationalization
│ ├── llm/ # LLM integration
│ │ ├── common/ # Common LLM tools
│ │ ├── core/ # Core LLM clients
│ │ └── prompts/ # Prompt templates
│ │ ├── answer.js # Answer generation prompts (Chinese)
│ │ ├── answerEn.js # Answer generation prompts (English)
│ │ ├── question.js # Question generation prompts (Chinese)
│ │ ├── questionEn.js # Question generation prompts (English)
│ │ └── ... other prompts
│ └── text-splitter/ # Text splitting tools
├── locales/ # Internationalization resources
│ ├── en/ # English translations
│ └── zh-CN/ # Chinese translations
├── public/ # Static resources
│ └── imgs/ # Image resources
└── local-db/ # Local file database
└── projects/ # Project data storage
```
## Documentation
- View the demo video of this project: [Easy Dataset Demo Video](https://www.bilibili.com/video/BV1y8QpYGE57/)
- For detailed documentation on all features and APIs, visit our [Documentation Site](https://docs.easy-dataset.com/ed/en)
## Community Practice
[Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge](https://buaa-act.feishu.cn/wiki/GVzlwYcRFiR8OLkHbL6cQpYin7g)
## Contributing
We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
1. Fork the repository
2. Create a new branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request (submit to the DEV branch)
Please ensure that tests are appropriately updated and adhere to the existing coding style.
## Join Discussion Group & Contact the Author
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
## License
This project is licensed under the AGPL 3.0 License - see the [LICENSE](LICENSE) file for details.
## Star History
[](https://www.star-history.com/#ConardLi/easy-dataset&Date)