SQL_FEATURE_PLAN.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145

# SQL Query Engine Feature, Dev, and Test Plan

This document outlines the plan for adding SQL querying support to SeaweedFS, focusing on reading and analyzing data from Message Queue (MQ) topics.

## Feature Plan

**1. Goal**

To provide a SQL querying interface for SeaweedFS, enabling analytics on existing MQ topics. This enables:
- Basic querying with SELECT, WHERE, aggregations on MQ topics
- Schema discovery and metadata operations (SHOW DATABASES, SHOW TABLES, DESCRIBE)
- In-place analytics on Parquet-stored messages without data movement

**2. Key Features**

*   **Schema Discovery and Metadata:**
    *   `SHOW DATABASES` - List all MQ namespaces
    *   `SHOW TABLES` - List all topics in a namespace  
    *   `DESCRIBE table_name` - Show topic schema details
    *   Automatic schema detection from existing Parquet data
*   **Basic Query Engine:**
    *   `SELECT` support with `WHERE`, `LIMIT`, `OFFSET`
    *   Aggregation functions: `COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`
    *   Temporal queries with timestamp-based filtering
*   **User Interfaces:**
    *   New CLI command `weed sql` with interactive shell mode
    *   Optional: Web UI for query execution and result visualization
*   **Output Formats:**
    *   JSON (default), CSV, Parquet for result sets
    *   Streaming results for large queries
    *   Pagination support for result navigation

## Development Plan


**3. Data Source Integration**

*   **MQ Topic Connector (Primary):**
    *   Build on existing `weed/mq/logstore/read_parquet_to_log.go`
    *   Implement efficient Parquet scanning with predicate pushdown
    *   Support schema evolution and backward compatibility
    *   Handle partition-based parallelism for scalable queries
*   **Schema Registry Integration:**
    *   Extend `weed/mq/schema/schema.go` for SQL metadata operations
    *   Read existing topic schemas for query planning
    *   Handle schema evolution during query execution

**4. API & CLI Integration**

*   **CLI Command:**
    *   New `weed sql` command with interactive shell mode (similar to `weed shell`)
    *   Support for script execution and result formatting
    *   Connection management for remote SeaweedFS clusters
*   **gRPC API:**
    *   Add SQL service to existing MQ broker gRPC interface
    *   Enable efficient query execution with streaming results

## Example Usage Scenarios

**Scenario 1: Schema Discovery and Metadata**
```sql
-- List all namespaces (databases)
SHOW DATABASES;

-- List topics in a namespace
USE my_namespace;
SHOW TABLES;

-- View topic structure and discovered schema
DESCRIBE user_events;
```

**Scenario 2: Data Querying**
```sql
-- Basic filtering and projection
SELECT user_id, event_type, timestamp 
FROM user_events 
WHERE timestamp > 1640995200000 
LIMIT 100;

-- Aggregation queries  
SELECT COUNT(*) as event_count
FROM user_events 
WHERE timestamp >= 1640995200000;

-- More aggregation examples
SELECT MAX(timestamp), MIN(timestamp) 
FROM user_events;
```

**Scenario 3: Analytics & Monitoring**
```sql
-- Basic analytics
SELECT COUNT(*) as total_events
FROM user_events 
WHERE timestamp >= 1640995200000;

-- Simple monitoring
SELECT AVG(response_time) as avg_response
FROM api_logs
WHERE timestamp >= 1640995200000;

## Architecture Overview

```
SQL Query Flow:
                                  1. Parse SQL        2. Plan & Optimize      3. Execute Query
┌─────────────┐    ┌──────────────┐    ┌─────────────────┐    ┌──────────────┐
│   Client    │    │  SQL Parser  │    │  Query Planner  │    │   Execution  │
│    (CLI)    │──→ │ PostgreSQL   │──→ │   & Optimizer   │──→ │    Engine    │
│             │    │ (Custom)     │    │                 │    │              │
└─────────────┘    └──────────────┘    └─────────────────┘    └──────────────┘
                                               │                       │
                                               │ Schema Lookup         │ Data Access
                                               ▼                       ▼
                    ┌─────────────────────────────────────────────────────────────┐
                    │                    Schema Catalog                            │
                    │  • Namespace → Database mapping                            │
                    │  • Topic → Table mapping                                  │
                    │  • Schema version management                              │
                    └─────────────────────────────────────────────────────────────┘
                                                                        ▲
                                                                        │ Metadata
                                                                        │
┌─────────────────────────────────────────────────────────────────────────────┐
│                          MQ Storage Layer                                      │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐    ▲    │
│  │   Topic A   │  │   Topic B   │  │   Topic C   │  │     ...     │    │    │
│  │ (Parquet)   │  │ (Parquet)   │  │ (Parquet)   │  │ (Parquet)   │    │    │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘    │    │
└──────────────────────────────────────────────────────────────────────────│──┘
                                                                          │
                                                                     Data Access
```


## Success Metrics

*   **Feature Completeness:** Support for all specified SELECT operations and metadata commands
*   **Performance:** 
    *   **Simple SELECT queries**: < 100ms latency for single-table queries with up to 3 WHERE predicates on ≤ 100K records
    *   **Complex queries**: < 1s latency for queries involving aggregations (COUNT, SUM, MAX, MIN) on ≤ 1M records
    *   **Time-range queries**: < 500ms for timestamp-based filtering on ≤ 500K records within 24-hour windows
*   **Scalability:** Handle topics with millions of messages efficiently