diff options
Diffstat (limited to 'test/s3/parquet/README.md')
| -rw-r--r-- | test/s3/parquet/README.md | 87 |
1 files changed, 86 insertions, 1 deletions
diff --git a/test/s3/parquet/README.md b/test/s3/parquet/README.md index 48ce3e6fc..ed65e4cbb 100644 --- a/test/s3/parquet/README.md +++ b/test/s3/parquet/README.md @@ -10,6 +10,22 @@ SeaweedFS implements implicit directory detection to improve compatibility with ## Quick Start +### Running the Example Script + +```bash +# Start SeaweedFS server +make start-seaweedfs-ci + +# Run the example script +python3 example_pyarrow_native.py + +# Or with uv (if available) +uv run example_pyarrow_native.py + +# Stop the server when done +make stop-seaweedfs-safe +``` + ### Running Tests ```bash @@ -25,12 +41,20 @@ make test-quick # Run implicit directory fix tests make test-implicit-dir-with-server +# Run PyArrow native S3 filesystem tests +make test-native-s3-with-server + +# Run SSE-S3 encryption tests +make test-sse-s3-compat + # Clean up make clean ``` ### Using PyArrow with SeaweedFS +#### Option 1: Using s3fs (recommended for compatibility) + ```python import pyarrow as pa import pyarrow.parquet as pq @@ -55,13 +79,55 @@ table = pq.read_table('bucket/dataset', filesystem=fs) # ✅ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅ ``` +#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow) + +```python +import pyarrow as pa +import pyarrow.parquet as pq +import pyarrow.dataset as pads +import pyarrow.fs as pafs + +# Configure PyArrow's native S3 filesystem +s3 = pafs.S3FileSystem( + access_key='your_access_key', + secret_key='your_secret_key', + endpoint_override='localhost:8333', + scheme='http', + allow_bucket_creation=True, + allow_bucket_deletion=True +) + +# Write dataset +table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']}) +pads.write_dataset(table, 'bucket/dataset', filesystem=s3) + +# Read dataset (all methods work!) +table = pq.read_table('bucket/dataset', filesystem=s3) # ✅ +dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅ +dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅ +``` + ## Test Files ### Main Test Suite - **`s3_parquet_test.py`** - Comprehensive PyArrow test suite - Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations + - Uses s3fs library for S3 operations - All tests pass with the implicit directory fix ✅ +### PyArrow Native S3 Tests +- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests + - Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem) + - Pure PyArrow solution without s3fs dependency + - Tests 3 read methods × 2 dataset sizes = 6 scenarios + - All tests pass ✅ + +- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests + - Tests PyArrow native S3 with SSE-S3 server-side encryption + - Tests 5 different file sizes (10 to 500,000 rows) + - Verifies multipart upload encryption works correctly + - All tests pass ✅ + ### Implicit Directory Tests - **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix - Tests HEAD request behavior @@ -69,6 +135,12 @@ dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅ - Tests PyArrow dataset reading - All 6 tests pass ✅ +### Examples +- **`example_pyarrow_native.py`** - Simple standalone example + - Demonstrates PyArrow's native S3 filesystem usage + - Can be run with `uv run` or regular Python + - Minimal dependencies (pyarrow, boto3) + ### Configuration - **`Makefile`** - Build and test automation - **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3) @@ -128,6 +200,9 @@ make test # Run full tests (assumes server is already running) make test-with-server # Run full PyArrow test suite with server (small + large files) make test-quick # Run quick tests with small files only (assumes server is running) make test-implicit-dir-with-server # Run implicit directory tests with server +make test-native-s3 # Run PyArrow native S3 tests (assumes server is running) +make test-native-s3-with-server # Run PyArrow native S3 tests with server management +make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests # Server Management make start-seaweedfs-ci # Start SeaweedFS in background (CI mode) @@ -146,10 +221,20 @@ The tests are automatically run in GitHub Actions on every push/PR that affects **Test Matrix**: - Python versions: 3.9, 3.11, 3.12 -- PyArrow integration tests: 20 test combinations +- PyArrow integration tests (s3fs): 20 test combinations +- PyArrow native S3 tests: 6 test scenarios ✅ **NEW** +- SSE-S3 encryption tests: 5 file sizes ✅ **NEW** - Implicit directory fix tests: 6 test scenarios - Go unit tests: 17 test cases +**Test Steps** (run for each Python version): +1. Build SeaweedFS +2. Run PyArrow Parquet integration tests (`make test-with-server`) +3. Run implicit directory fix tests (`make test-implicit-dir-with-server`) +4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW** +5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW** +6. Run Go unit tests for implicit directory handling + **Triggers**: - Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes) - Manual trigger via GitHub UI (workflow_dispatch) |
