aboutsummaryrefslogtreecommitdiff
path: root/test/s3/parquet/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'test/s3/parquet/README.md')
-rw-r--r--test/s3/parquet/README.md87
1 files changed, 86 insertions, 1 deletions
diff --git a/test/s3/parquet/README.md b/test/s3/parquet/README.md
index 48ce3e6fc..ed65e4cbb 100644
--- a/test/s3/parquet/README.md
+++ b/test/s3/parquet/README.md
@@ -10,6 +10,22 @@ SeaweedFS implements implicit directory detection to improve compatibility with
## Quick Start
+### Running the Example Script
+
+```bash
+# Start SeaweedFS server
+make start-seaweedfs-ci
+
+# Run the example script
+python3 example_pyarrow_native.py
+
+# Or with uv (if available)
+uv run example_pyarrow_native.py
+
+# Stop the server when done
+make stop-seaweedfs-safe
+```
+
### Running Tests
```bash
@@ -25,12 +41,20 @@ make test-quick
# Run implicit directory fix tests
make test-implicit-dir-with-server
+# Run PyArrow native S3 filesystem tests
+make test-native-s3-with-server
+
+# Run SSE-S3 encryption tests
+make test-sse-s3-compat
+
# Clean up
make clean
```
### Using PyArrow with SeaweedFS
+#### Option 1: Using s3fs (recommended for compatibility)
+
```python
import pyarrow as pa
import pyarrow.parquet as pq
@@ -55,13 +79,55 @@ table = pq.read_table('bucket/dataset', filesystem=fs) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
```
+#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
+
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+import pyarrow.dataset as pads
+import pyarrow.fs as pafs
+
+# Configure PyArrow's native S3 filesystem
+s3 = pafs.S3FileSystem(
+ access_key='your_access_key',
+ secret_key='your_secret_key',
+ endpoint_override='localhost:8333',
+ scheme='http',
+ allow_bucket_creation=True,
+ allow_bucket_deletion=True
+)
+
+# Write dataset
+table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
+pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
+
+# Read dataset (all methods work!)
+table = pq.read_table('bucket/dataset', filesystem=s3) # ✅
+dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅
+dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅
+```
+
## Test Files
### Main Test Suite
- **`s3_parquet_test.py`** - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
+ - Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅
+### PyArrow Native S3 Tests
+- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests
+ - Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
+ - Pure PyArrow solution without s3fs dependency
+ - Tests 3 read methods × 2 dataset sizes = 6 scenarios
+ - All tests pass ✅
+
+- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests
+ - Tests PyArrow native S3 with SSE-S3 server-side encryption
+ - Tests 5 different file sizes (10 to 500,000 rows)
+ - Verifies multipart upload encryption works correctly
+ - All tests pass ✅
+
### Implicit Directory Tests
- **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix
- Tests HEAD request behavior
@@ -69,6 +135,12 @@ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
- Tests PyArrow dataset reading
- All 6 tests pass ✅
+### Examples
+- **`example_pyarrow_native.py`** - Simple standalone example
+ - Demonstrates PyArrow's native S3 filesystem usage
+ - Can be run with `uv run` or regular Python
+ - Minimal dependencies (pyarrow, boto3)
+
### Configuration
- **`Makefile`** - Build and test automation
- **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3)
@@ -128,6 +200,9 @@ make test # Run full tests (assumes server is already running)
make test-with-server # Run full PyArrow test suite with server (small + large files)
make test-quick # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server # Run implicit directory tests with server
+make test-native-s3 # Run PyArrow native S3 tests (assumes server is running)
+make test-native-s3-with-server # Run PyArrow native S3 tests with server management
+make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests
# Server Management
make start-seaweedfs-ci # Start SeaweedFS in background (CI mode)
@@ -146,10 +221,20 @@ The tests are automatically run in GitHub Actions on every push/PR that affects
**Test Matrix**:
- Python versions: 3.9, 3.11, 3.12
-- PyArrow integration tests: 20 test combinations
+- PyArrow integration tests (s3fs): 20 test combinations
+- PyArrow native S3 tests: 6 test scenarios ✅ **NEW**
+- SSE-S3 encryption tests: 5 file sizes ✅ **NEW**
- Implicit directory fix tests: 6 test scenarios
- Go unit tests: 17 test cases
+**Test Steps** (run for each Python version):
+1. Build SeaweedFS
+2. Run PyArrow Parquet integration tests (`make test-with-server`)
+3. Run implicit directory fix tests (`make test-implicit-dir-with-server`)
+4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW**
+5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW**
+6. Run Go unit tests for implicit directory handling
+
**Triggers**:
- Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes)
- Manual trigger via GitHub UI (workflow_dispatch)