diff options
Diffstat (limited to 'SQL_FEATURE_PLAN.md')
| -rw-r--r-- | SQL_FEATURE_PLAN.md | 145 |
1 files changed, 145 insertions, 0 deletions
diff --git a/SQL_FEATURE_PLAN.md b/SQL_FEATURE_PLAN.md new file mode 100644 index 000000000..28a6d2c24 --- /dev/null +++ b/SQL_FEATURE_PLAN.md @@ -0,0 +1,145 @@ +# SQL Query Engine Feature, Dev, and Test Plan + +This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics. + +## Feature Plan + +**1. Goal** + +To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables: +- Basic querying with SELECT, WHERE, aggregations on MQ topics +- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE) +- In-place analytics on Parquet-stored messages without data movement + +**2. Key Features** + +* **Schema Discovery and Metadata:** + * `SHOW DATABASES` - List all MQ namespaces + * `SHOW TABLES` - List all topics in a namespace + * `DESCRIBE table_name` - Show topic schema details + * Automatic schema detection from existing Parquet data +* **Basic Query Engine:** + * `SELECT` support with `WHERE`, `LIMIT`, `OFFSET` + * Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()` + * Temporal queries with timestamp-based filtering +* **User Interfaces:** + * New CLI command `weed sql` with interactive shell mode + * Optional: Web UI for query execution and result visualization +* **Output Formats:** + * JSON (default), CSV, Parquet for result sets + * Streaming results for large queries + * Pagination support for result navigation + +## Development Plan + + + +**3. Data Source Integration** + +* **MQ Topic Connector (Primary):** + * Build on existing `weed/mq/logstore/read_parquet_to_log.go` + * Implement efficient Parquet scanning with predicate pushdown + * Support schema evolution and backward compatibility + * Handle partition-based parallelism for scalable queries +* **Schema Registry Integration:** + * Extend `weed/mq/schema/schema.go` for SQL metadata operations + * Read existing topic schemas for query planning + * Handle schema evolution during query execution + +**4. API & CLI Integration** + +* **CLI Command:** + * New `weed sql` command with interactive shell mode (similar to `weed shell`) + * Support for script execution and result formatting + * Connection management for remote SeaweedFS clusters +* **gRPC API:** + * Add SQL service to existing MQ broker gRPC interface + * Enable efficient query execution with streaming results + +## Example Usage Scenarios + +**Scenario 1: Schema Discovery and Metadata** +```sql +-- List all namespaces (databases) +SHOW DATABASES; + +-- List topics in a namespace +USE my_namespace; +SHOW TABLES; + +-- View topic structure and discovered schema +DESCRIBE user_events; +``` + +**Scenario 2: Data Querying** +```sql +-- Basic filtering and projection +SELECT user_id, event_type, timestamp +FROM user_events +WHERE timestamp > 1640995200000 +LIMIT 100; + +-- Aggregation queries +SELECT COUNT(*) as event_count +FROM user_events +WHERE timestamp >= 1640995200000; + +-- More aggregation examples +SELECT MAX(timestamp), MIN(timestamp) +FROM user_events; +``` + +**Scenario 3: Analytics & Monitoring** +```sql +-- Basic analytics +SELECT COUNT(*) as total_events +FROM user_events +WHERE timestamp >= 1640995200000; + +-- Simple monitoring +SELECT AVG(response_time) as avg_response +FROM api_logs +WHERE timestamp >= 1640995200000; + +## Architecture Overview + +``` +SQL Query Flow: + 1. Parse SQL 2. Plan & Optimize 3. Execute Query +┌─────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐ +│ Client │ │ SQL Parser │ │ Query Planner │ │ Execution │ +│ (CLI) │──→ │ PostgreSQL │──→ │ & Optimizer │──→ │ Engine │ +│ │ │ (Custom) │ │ │ │ │ +└─────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘ + │ │ + │ Schema Lookup │ Data Access + ▼ ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ Schema Catalog │ + │ • Namespace → Database mapping │ + │ • Topic → Table mapping │ + │ • Schema version management │ + └─────────────────────────────────────────────────────────────┘ + ▲ + │ Metadata + │ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ MQ Storage Layer │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ▲ │ +│ │ Topic A │ │ Topic B │ │ Topic C │ │ ... │ │ │ +│ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ (Parquet) │ │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ +└──────────────────────────────────────────────────────────────────────────│──┘ + │ + Data Access +``` + + +## Success Metrics + +* **Feature Completeness:** Support for all specified SELECT operations and metadata commands +* **Performance:** + * **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records + * **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records + * **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows +* **Scalability:** Handle topics with millions of messages efficiently |
