diff options
Diffstat (limited to 'test/erasure_coding/README.md')
| -rw-r--r-- | test/erasure_coding/README.md | 86 |
1 files changed, 86 insertions, 0 deletions
diff --git a/test/erasure_coding/README.md b/test/erasure_coding/README.md new file mode 100644 index 000000000..c04844982 --- /dev/null +++ b/test/erasure_coding/README.md @@ -0,0 +1,86 @@ +# Erasure Coding Integration Tests + +This directory contains integration tests for the EC (Erasure Coding) encoding volume location timing bug fix. + +## The Bug + +The bug caused **double storage usage** during EC encoding because: + +1. **Silent failure**: Functions returned `nil` instead of proper error messages +2. **Timing race condition**: Volume locations were collected **AFTER** EC encoding when master metadata was already updated +3. **Missing cleanup**: Original volumes weren't being deleted after EC encoding + +This resulted in both original `.dat` files AND EC `.ec00-.ec13` files coexisting, effectively **doubling storage usage**. + +## The Fix + +The fix addresses all three issues: + +1. **Fixed silent failures**: Updated `doDeleteVolumes()` and `doEcEncode()` to return proper errors +2. **Fixed timing race condition**: Created `doDeleteVolumesWithLocations()` that uses pre-collected volume locations +3. **Enhanced cleanup**: Volume locations are now collected **BEFORE** EC encoding, preventing the race condition + +## Integration Tests + +### TestECEncodingVolumeLocationTimingBug +The main integration test that: +- **Simulates master timing race condition**: Tests what happens when volume locations are read from master AFTER EC encoding has updated the metadata +- **Verifies fix effectiveness**: Checks for the "Collecting volume locations...before EC encoding" message that proves the fix is working +- **Tests multi-server distribution**: Runs EC encoding with 6 volume servers to test shard distribution +- **Validates cleanup**: Ensures original volumes are properly cleaned up after EC encoding + +### TestECEncodingMasterTimingRaceCondition +A focused test that specifically targets the **master metadata timing race condition**: +- **Simulates the exact race condition**: Tests volume location collection timing relative to master metadata updates +- **Detects timing fix**: Verifies that volume locations are collected BEFORE EC encoding starts +- **Demonstrates bug impact**: Shows what happens when volume locations are unavailable after master metadata update + +### TestECEncodingRegressionPrevention +Regression tests that ensure: +- **Function signatures**: Fixed functions still exist and return proper errors +- **Timing patterns**: Volume location collection happens in the correct order + +## Test Architecture + +The tests use: +- **Real SeaweedFS cluster**: 1 master server + 6 volume servers +- **Multi-server setup**: Tests realistic EC shard distribution across multiple servers +- **Timing simulation**: Goroutines and delays to simulate race conditions +- **Output validation**: Checks for specific log messages that prove the fix is working + +## Why Integration Tests Were Necessary + +Unit tests could not catch this bug because: +1. **Race condition**: The bug only occurred in real-world timing scenarios +2. **Master-volume server interaction**: Required actual master metadata updates +3. **File system operations**: Needed real volume creation and EC shard generation +4. **Cleanup timing**: Required testing the sequence of operations in correct order + +The integration tests successfully catch the timing bug by: +- **Testing real command execution**: Uses actual `ec.encode` shell command +- **Simulating race conditions**: Creates timing scenarios that expose the bug +- **Validating output messages**: Checks for the key "Collecting volume locations...before EC encoding" message +- **Monitoring cleanup behavior**: Ensures original volumes are properly deleted + +## Running the Tests + +```bash +# Run all integration tests +go test -v + +# Run only the main timing test +go test -v -run TestECEncodingVolumeLocationTimingBug + +# Run only the race condition test +go test -v -run TestECEncodingMasterTimingRaceCondition + +# Skip integration tests (short mode) +go test -v -short +``` + +## Test Results + +**With the fix**: Shows "Collecting volume locations for N volumes before EC encoding..." message +**Without the fix**: No collection message, potential timing race condition + +The tests demonstrate that the fix prevents the volume location timing bug that caused double storage usage in EC encoding operations.
\ No newline at end of file |
