1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
|
# PyArrow Parquet S3 Compatibility Tests
This directory contains tests for PyArrow Parquet compatibility with SeaweedFS S3 API, including the implicit directory detection fix.
## Overview
**Status**: ✅ **All PyArrow methods work correctly with SeaweedFS**
SeaweedFS implements implicit directory detection to improve compatibility with s3fs and PyArrow. When PyArrow writes datasets using `write_dataset()`, it may create directory markers that can confuse s3fs. SeaweedFS now handles these correctly by returning 404 for HEAD requests on implicit directories (directories with children), forcing s3fs to use LIST-based discovery.
## Quick Start
### Running the Example Script
```bash
# Start SeaweedFS server
make start-seaweedfs-ci
# Run the example script
python3 example_pyarrow_native.py
# Or with uv (if available)
uv run example_pyarrow_native.py
# Stop the server when done
make stop-seaweedfs-safe
```
### Running Tests
```bash
# Setup Python environment
make setup-python
# Run all tests with server (small and large files)
make test-with-server
# Run quick tests with small files only (faster for development)
make test-quick
# Run implicit directory fix tests
make test-implicit-dir-with-server
# Run PyArrow native S3 filesystem tests
make test-native-s3-with-server
# Run SSE-S3 encryption tests
make test-sse-s3-compat
# Clean up
make clean
```
### Using PyArrow with SeaweedFS
#### Option 1: Using s3fs (recommended for compatibility)
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import s3fs
# Configure s3fs
fs = s3fs.S3FileSystem(
key='your_access_key',
secret='your_secret_key',
endpoint_url='http://localhost:8333',
use_ssl=False
)
# Write dataset (creates directory structure)
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=fs)
# Read dataset (all methods work!)
dataset = pads.dataset('bucket/dataset', filesystem=fs) # ✅
table = pq.read_table('bucket/dataset', filesystem=fs) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=fs) # ✅
```
#### Option 2: Using PyArrow's native S3 filesystem (pure PyArrow)
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pads
import pyarrow.fs as pafs
# Configure PyArrow's native S3 filesystem
s3 = pafs.S3FileSystem(
access_key='your_access_key',
secret_key='your_secret_key',
endpoint_override='localhost:8333',
scheme='http',
allow_bucket_creation=True,
allow_bucket_deletion=True
)
# Write dataset
table = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
pads.write_dataset(table, 'bucket/dataset', filesystem=s3)
# Read dataset (all methods work!)
table = pq.read_table('bucket/dataset', filesystem=s3) # ✅
dataset = pq.ParquetDataset('bucket/dataset', filesystem=s3) # ✅
dataset = pads.dataset('bucket/dataset', filesystem=s3) # ✅
```
## Test Files
### Main Test Suite
- **`s3_parquet_test.py`** - Comprehensive PyArrow test suite
- Tests 2 write methods × 5 read methods × 2 dataset sizes = 20 combinations
- Uses s3fs library for S3 operations
- All tests pass with the implicit directory fix ✅
### PyArrow Native S3 Tests
- **`test_pyarrow_native_s3.py`** - PyArrow's native S3 filesystem tests
- Tests PyArrow's built-in S3FileSystem (pyarrow.fs.S3FileSystem)
- Pure PyArrow solution without s3fs dependency
- Tests 3 read methods × 2 dataset sizes = 6 scenarios
- All tests pass ✅
- **`test_sse_s3_compatibility.py`** - SSE-S3 encryption compatibility tests
- Tests PyArrow native S3 with SSE-S3 server-side encryption
- Tests 5 different file sizes (10 to 500,000 rows)
- Verifies multipart upload encryption works correctly
- All tests pass ✅
### Implicit Directory Tests
- **`test_implicit_directory_fix.py`** - Specific tests for the implicit directory fix
- Tests HEAD request behavior
- Tests s3fs directory detection
- Tests PyArrow dataset reading
- All 6 tests pass ✅
### Examples
- **`example_pyarrow_native.py`** - Simple standalone example
- Demonstrates PyArrow's native S3 filesystem usage
- Can be run with `uv run` or regular Python
- Minimal dependencies (pyarrow, boto3)
### Configuration
- **`Makefile`** - Build and test automation
- **`requirements.txt`** - Python dependencies (pyarrow, s3fs, boto3)
- **`.gitignore`** - Ignore patterns for test artifacts
## Documentation
### Technical Documentation
- **`TEST_COVERAGE.md`** - Comprehensive test coverage documentation
- Unit tests (Go): 17 test cases
- Integration tests (Python): 6 test cases
- End-to-end tests (Python): 20 test cases
- **`FINAL_ROOT_CAUSE_ANALYSIS.md`** - Deep technical analysis
- Root cause of the s3fs compatibility issue
- How the implicit directory fix works
- Performance considerations
- **`MINIO_DIRECTORY_HANDLING.md`** - Comparison with MinIO
- How MinIO handles directory markers
- Differences in implementation approaches
## The Implicit Directory Fix
### Problem
When PyArrow writes datasets with `write_dataset()`, it may create 0-byte directory markers. s3fs's `info()` method calls HEAD on these paths, and if HEAD returns 200 with size=0, s3fs incorrectly reports them as files instead of directories. This causes PyArrow to fail with "Parquet file size is 0 bytes".
### Solution
SeaweedFS now returns 404 for HEAD requests on implicit directories (0-byte objects or directories with children, when requested without a trailing slash). This forces s3fs to fall back to LIST-based discovery, which correctly identifies directories by checking for children.
### Implementation
The fix is implemented in `weed/s3api/s3api_object_handlers.go`:
- `HeadObjectHandler` - Returns 404 for implicit directories
- `hasChildren` - Helper function to check if a path has children
See the source code for detailed inline documentation.
### Test Coverage
- **Unit tests** (Go): `weed/s3api/s3api_implicit_directory_test.go`
- Run: `cd weed/s3api && go test -v -run TestImplicitDirectory`
- **Integration tests** (Python): `test_implicit_directory_fix.py`
- Run: `cd test/s3/parquet && make test-implicit-dir-with-server`
- **End-to-end tests** (Python): `s3_parquet_test.py`
- Run: `cd test/s3/parquet && make test-with-server`
## Makefile Targets
```bash
# Setup
make setup-python # Create Python virtual environment and install dependencies
make build-weed # Build SeaweedFS binary
# Testing
make test # Run full tests (assumes server is already running)
make test-with-server # Run full PyArrow test suite with server (small + large files)
make test-quick # Run quick tests with small files only (assumes server is running)
make test-implicit-dir-with-server # Run implicit directory tests with server
make test-native-s3 # Run PyArrow native S3 tests (assumes server is running)
make test-native-s3-with-server # Run PyArrow native S3 tests with server management
make test-sse-s3-compat # Run comprehensive SSE-S3 encryption compatibility tests
# Server Management
make start-seaweedfs-ci # Start SeaweedFS in background (CI mode)
make stop-seaweedfs-safe # Stop SeaweedFS gracefully
make clean # Clean up all test artifacts
# Development
make help # Show all available targets
```
## Continuous Integration
The tests are automatically run in GitHub Actions on every push/PR that affects S3 or filer code:
**Workflow**: `.github/workflows/s3-parquet-tests.yml`
**Test Matrix**:
- Python versions: 3.9, 3.11, 3.12
- PyArrow integration tests (s3fs): 20 test combinations
- PyArrow native S3 tests: 6 test scenarios ✅ **NEW**
- SSE-S3 encryption tests: 5 file sizes ✅ **NEW**
- Implicit directory fix tests: 6 test scenarios
- Go unit tests: 17 test cases
**Test Steps** (run for each Python version):
1. Build SeaweedFS
2. Run PyArrow Parquet integration tests (`make test-with-server`)
3. Run implicit directory fix tests (`make test-implicit-dir-with-server`)
4. Run PyArrow native S3 filesystem tests (`make test-native-s3-with-server`) ✅ **NEW**
5. Run SSE-S3 encryption compatibility tests (`make test-sse-s3-compat`) ✅ **NEW**
6. Run Go unit tests for implicit directory handling
**Triggers**:
- Push/PR to master (when `weed/s3api/**` or `weed/filer/**` changes)
- Manual trigger via GitHub UI (workflow_dispatch)
## Requirements
- Python 3.8+
- PyArrow 22.0.0+
- s3fs 2024.12.0+
- boto3 1.40.0+
- SeaweedFS (latest)
## AWS S3 Compatibility
The implicit directory fix makes SeaweedFS behavior more compatible with AWS S3:
- AWS S3 typically doesn't create directory markers for implicit directories
- HEAD on "dataset" (when only "dataset/file.txt" exists) returns 404 on AWS
- SeaweedFS now matches this behavior for implicit directories with children
## Edge Cases Handled
✅ **Implicit directories with children** → 404 (forces LIST-based discovery)
✅ **Empty files (0-byte, no children)** → 200 (legitimate empty file)
✅ **Empty directories (no children)** → 200 (legitimate empty directory)
✅ **Explicit directory requests (trailing slash)** → 200 (normal directory behavior)
✅ **Versioned buckets** → Skip implicit directory check (versioned semantics)
✅ **Regular files** → 200 (normal file behavior)
## Performance
The implicit directory check adds minimal overhead:
- Only triggered for 0-byte objects or directories without trailing slash
- Cost: One LIST operation with Limit=1 (~1-5ms)
- No impact on regular file operations
## Contributing
When adding new tests:
1. Add test cases to the appropriate test file
2. Update TEST_COVERAGE.md
3. Run the full test suite to ensure no regressions
4. Update this README if adding new functionality
## References
- [PyArrow Documentation](https://arrow.apache.org/docs/python/parquet.html)
- [s3fs Documentation](https://s3fs.readthedocs.io/)
- [SeaweedFS S3 API](https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API)
- [AWS S3 API Reference](https://docs.aws.amazon.com/AmazonS3/latest/API/)
---
**Last Updated**: November 19, 2025
**Status**: All tests passing ✅
|