diff options
Diffstat (limited to 'README.md')
| -rw-r--r-- | README.md | 244 |
1 files changed, 156 insertions, 88 deletions
@@ -1,14 +1,18 @@ # SeaweedFS +[](https://join.slack.com/t/seaweedfs/shared_invite/enQtMzI4MTMwMjU2MzA3LTEyYzZmZWYzOGQ3MDJlZWMzYmI0OTE4OTJiZjJjODBmMzUxNmYwODg0YjY3MTNlMjBmZDQ1NzQ5NDJhZWI2ZmY) +[](https://twitter.com/intent/follow?screen_name=seaweedfs) [](https://travis-ci.org/chrislusf/seaweedfs) [](https://godoc.org/github.com/chrislusf/seaweedfs/weed) [](https://github.com/chrislusf/seaweedfs/wiki) -[](https://hub.docker.com/r/chrislusf/seaweedfs/) +[](https://hub.docker.com/r/chrislusf/seaweedfs/) +[](https://search.maven.org/search?q=g:com.github.chrislusf) +  -<h2 align="center">Supporting SeaweedFS</h2> +<h2 align="center"><a href="https://www.patreon.com/seaweedfs">Sponsor SeaweedFS via Patreon</a></h2> SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible entirely thanks to the support of these awesome [backers](https://github.com/chrislusf/seaweedfs/blob/master/backers.md). @@ -17,8 +21,6 @@ If you'd like to grow SeaweedFS even stronger, please consider joining our Your support will be really appreciated by me and other supporters! -<h3 align="center"><a href="https://www.patreon.com/seaweedfs">Sponsor SeaweedFS via Patreon</a></h3> - <!-- <h4 align="center">Platinum</h4> @@ -27,41 +29,32 @@ Your support will be really appreciated by me and other supporters! Add your name or icon here </a> </p> +--> -<h4 align="center">Gold</h4> - -<table> - <tbody> - <tr> - <td align="center" valign="middle"> - <a href="" target="_blank"> - Add your name or icon here - </a> - </td> - </tr> - <tr></tr> - </tbody> -</table> ---> +### Gold Sponsors + --- - [Download Binaries for different platforms](https://github.com/chrislusf/seaweedfs/releases/latest) - [SeaweedFS on Slack](https://join.slack.com/t/seaweedfs/shared_invite/enQtMzI4MTMwMjU2MzA3LTEyYzZmZWYzOGQ3MDJlZWMzYmI0OTE4OTJiZjJjODBmMzUxNmYwODg0YjY3MTNlMjBmZDQ1NzQ5NDJhZWI2ZmY) +- [SeaweedFS on Twitter](https://twitter.com/SeaweedFS) - [SeaweedFS Mailing List](https://groups.google.com/d/forum/seaweedfs) - [Wiki Documentation](https://github.com/chrislusf/seaweedfs/wiki) +- [SeaweedFS White Paper](https://github.com/chrislusf/seaweedfs/wiki/SeaweedFS_Architecture.pdf) - [SeaweedFS Introduction Slides](https://www.slideshare.net/chrislusf/seaweedfs-introduction) Table of Contents ================= +* [Quick Start](#quick-start) * [Introduction](#introduction) * [Features](#features) * [Additional Features](#additional-features) * [Filer Features](#filer-features) -* [Example Usage](#example-usage) +* [Example: Using Seaweed Object Store](#example-Using-Seaweed-Object-Store) * [Architecture](#architecture) * [Compared to Other File Systems](#compared-to-other-file-systems) * [Compared to HDFS](#compared-to-hdfs) @@ -74,6 +67,13 @@ Table of Contents * [Benchmark](#Benchmark) * [License](#license) + +## Quick Start ## +* Download the latest binary from https://github.com/chrislusf/seaweedfs/releases and unzip a single binary file `weed` or `weed.exe` +* Run `weed server -dir=/some/data/dir -s3` to start one master, one volume server, one filer, and one S3 gateway. + +Also, to increase capacity, just add more volume servers by running `weed volume -dir="/some/data/dir2" -mserver="<master_host>:9333" -port=8081` locally, or on a different machine, or on thousands of machines. That is it! + ## Introduction ## SeaweedFS is a simple and highly scalable distributed file system. There are two objectives: @@ -81,17 +81,34 @@ SeaweedFS is a simple and highly scalable distributed file system. There are two 1. to store billions of files! 2. to serve the files fast! -SeaweedFS started as an Object Store to handle small files efficiently. Instead of managing all file metadata in a central master, the central master only manages file volumes, and it lets these volume servers manage files and their metadata. This relieves concurrency pressure from the central master and spreads file metadata into volume servers, allowing faster file access (just one disk read operation). +SeaweedFS started as an Object Store to handle small files efficiently. +Instead of managing all file metadata in a central master, +the central master only manages volumes on volume servers, +and these volume servers manage files and their metadata. +This relieves concurrency pressure from the central master and spreads file metadata into volume servers, +allowing faster file access (O(1), usually just one disk read operation). -There is only 40 bytes of disk storage overhead for each file's metadata. It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases. +SeaweedFS can transparently integrate with the cloud. +With hot data on local cluster, and warm data on the cloud with O(1) access time, +SeaweedFS can achieve both fast local access time and elastic cloud storage capacity. +What's more, the cloud storage access API cost is minimized. +Faster and Cheaper than direct cloud storage! +Signup for future managed SeaweedFS cluster offering at "seaweedfilesystem at gmail dot com". -SeaweedFS started by implementing [Facebook's Haystack design paper](http://www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf). Also, SeaweedFS implements erasure coding with ideas from [f4: Facebook’s Warm BLOB Storage System](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-muralidhar.pdf) +There is only 40 bytes of disk storage overhead for each file's metadata. +It is so simple with O(1) disk reads that you are welcome to challenge the performance with your actual use cases. -SeaweedFS can work very well with just the object store. [[Filer]] can then be added later to support directories and POSIX attributes. Filer is a separate linearly-scalable stateless server with customizable metadata stores, e.g., MySql/Postgres/Redis/Etcd/Cassandra/LevelDB/MemSql/TiDB/CockroachDB. +SeaweedFS started by implementing [Facebook's Haystack design paper](http://www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf). +Also, SeaweedFS implements erasure coding with ideas from +[f4: Facebook’s Warm BLOB Storage System](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-muralidhar.pdf), and has a lot of similarities with [Facebook’s Tectonic Filesystem](https://www.usenix.org/system/files/fast21-pan.pdf) -[Back to TOC](#table-of-contents) +On top of the object store, optional [Filer] can support directories and POSIX attributes. +Filer is a separate linearly-scalable stateless server with customizable metadata stores, +e.g., MySql, Postgres, Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB, MemSql, TiDB, Etcd, CockroachDB, etc. -## Features ## +For any distributed key value stores, the large values can be offloaded to SeaweedFS. +With the fast access speed and linearly scalable capacity, +SeaweedFS can work as a distributed [Key-Large-Value store][KeyLargeValueStore]. [Back to TOC](#table-of-contents) @@ -100,35 +117,57 @@ SeaweedFS can work very well with just the object store. [[Filer]] can then be a * Automatic master servers failover - no single point of failure (SPOF). * Automatic Gzip compression depending on file mime type. * Automatic compaction to reclaim disk space after deletion or update. -* Servers in the same cluster can have different disk spaces, file systems, OS etc. -* Adding/Removing servers does **not** cause any data re-balancing. -* Optionally fix the orientation for jpeg pictures. +* [Automatic entry TTL expiration][VolumeServerTTL]. +* Any server with some disk spaces can add to the total storage space. +* Adding/Removing servers does **not** cause any data re-balancing unless triggered by admin commands. +* Optional picture resizing. * Support ETag, Accept-Range, Last-Modified, etc. -* Support in-memory/leveldb/boltdb/btree mode tuning for memory/performance balance. +* Support in-memory/leveldb/readonly mode tuning for memory/performance balance. * Support rebalancing the writable and readonly volumes. +* [Customizable Multiple Storage Tiers][TieredStorage]: Customizable storage disk types to balance performance and cost. +* [Transparent cloud integration][CloudTier]: unlimited capacity via tiered cloud storage for warm data. +* [Erasure Coding for warm storage][ErasureCoding] Rack-Aware 10.4 erasure coding reduces storage cost and increases availability. [Back to TOC](#table-of-contents) ## Filer Features ## -* [filer server][Filer] provide "normal" directories and files via http. -* [mount filer][Mount] to read and write files directly as a local directory via FUSE. -* [Amazon S3 compatible API][AmazonS3API] to access files with S3 tooling. -* [Erasure Coding for warm storage][ErasureCoding] Rack-Aware 10.4 erasure coding reduces storage cost and increases availability. -* [Hadoop Compatible File System][Hadoop] to access files from Hadoop/Spark/Flink/etc jobs. -* [Async Backup To Cloud][BackupToCloud] has extremely fast local access and backups to Amazon S3, Google Cloud Storage, Azure, BackBlaze. -* [WebDAV] access as a mapped drive on Mac and Windows, or from mobile devices. +* [Filer server][Filer] provides "normal" directories and files via http. +* [File TTL][FilerTTL] automatically expires file metadata and actual file data. +* [Mount filer][Mount] reads and writes files directly as a local directory via FUSE. +* [Filer Store Replication][FilerStoreReplication] enables HA for filer meta data stores. +* [Active-Active Replication][ActiveActiveAsyncReplication] enables asynchronous one-way or two-way cross cluster continuous replication. +* [Amazon S3 compatible API][AmazonS3API] accesses files with S3 tooling. +* [Hadoop Compatible File System][Hadoop] accesses files from Hadoop/Spark/Flink/etc or even runs HBase. +* [Async Replication To Cloud][BackupToCloud] has extremely fast local access and backups to Amazon S3, Google Cloud Storage, Azure, BackBlaze. +* [WebDAV] accesses as a mapped drive on Mac and Windows, or from mobile devices. +* [AES256-GCM Encrypted Storage][FilerDataEncryption] safely stores the encrypted data. +* [Super Large Files][SuperLargeFiles] stores large or super large files in tens of TB. + +## Kubernetes ## +* [Kubernetes CSI Driver][SeaweedFsCsiDriver] A Container Storage Interface (CSI) Driver. [](https://hub.docker.com/r/chrislusf/seaweedfs-csi-driver/) +* [SeaweedFS Operator](https://github.com/seaweedfs/seaweedfs-operator) [Filer]: https://github.com/chrislusf/seaweedfs/wiki/Directories-and-Files -[Mount]: https://github.com/chrislusf/seaweedfs/wiki/Mount +[SuperLargeFiles]: https://github.com/chrislusf/seaweedfs/wiki/Data-Structure-for-Large-Files +[Mount]: https://github.com/chrislusf/seaweedfs/wiki/FUSE-Mount [AmazonS3API]: https://github.com/chrislusf/seaweedfs/wiki/Amazon-S3-API -[BackupToCloud]: https://github.com/chrislusf/seaweedfs/wiki/Backup-to-Cloud +[BackupToCloud]: https://github.com/chrislusf/seaweedfs/wiki/Async-Replication-to-Cloud [Hadoop]: https://github.com/chrislusf/seaweedfs/wiki/Hadoop-Compatible-File-System [WebDAV]: https://github.com/chrislusf/seaweedfs/wiki/WebDAV [ErasureCoding]: https://github.com/chrislusf/seaweedfs/wiki/Erasure-coding-for-warm-storage +[TieredStorage]: https://github.com/chrislusf/seaweedfs/wiki/Tiered-Storage +[CloudTier]: https://github.com/chrislusf/seaweedfs/wiki/Cloud-Tier +[FilerDataEncryption]: https://github.com/chrislusf/seaweedfs/wiki/Filer-Data-Encryption +[FilerTTL]: https://github.com/chrislusf/seaweedfs/wiki/Filer-Stores +[VolumeServerTTL]: https://github.com/chrislusf/seaweedfs/wiki/Store-file-with-a-Time-To-Live +[SeaweedFsCsiDriver]: https://github.com/seaweedfs/seaweedfs-csi-driver +[ActiveActiveAsyncReplication]: https://github.com/chrislusf/seaweedfs/wiki/Filer-Active-Active-cross-cluster-continuous-synchronization +[FilerStoreReplication]: https://github.com/chrislusf/seaweedfs/wiki/Filer-Store-Replication +[KeyLargeValueStore]: https://github.com/chrislusf/seaweedfs/wiki/Filer-as-a-Key-Large-Value-Store [Back to TOC](#table-of-contents) -## Example Usage ## +## Example: Using Seaweed Object Store ## By default, the master node runs on port 9333, and the volume nodes run on port 8080. Let's start one master node, and two volume nodes on port 8080 and 8081. Ideally, they should be started from different machines. We'll use localhost as an example. @@ -318,6 +357,16 @@ Each individual file size is limited to the volume size. All file meta information stored on an volume server is readable from memory without disk access. Each file takes just a 16-byte map entry of <64bit key, 32bit offset, 32bit size>. Of course, each map entry has its own space cost for the map. But usually the disk space runs out before the memory does. +### Tiered Storage to the cloud ### + +The local volume servers are much faster, while cloud storages have elastic capacity and are actually more cost-efficient if not accessed often (usually free to upload, but relatively costly to access). With the append-only structure and O(1) access time, SeaweedFS can take advantage of both local and cloud storage by offloading the warm data to the cloud. + +Usually hot data are fresh and warm data are old. SeaweedFS puts the newly created volumes on local servers, and optionally upload the older volumes on the cloud. If the older data are accessed less often, this literally gives you unlimited capacity with limited local servers, and still fast for new data. + +With the O(1) access time, the network latency cost is kept at minimum. + +If the hot/warm data is split as 20/80, with 20 servers, you can achieve storage capacity of 100 servers. That's a cost saving of 80%! Or you can repurpose the 80 servers to store new data also, and get 5X storage throughput. + [Back to TOC](#table-of-contents) ## Compared to Other File Systems ## @@ -326,6 +375,8 @@ Most other distributed file systems seem more complicated than necessary. SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications. +SeaweedFS is constantly moving forward. Same with other systems. These comparisons can be outdated quickly. Please help to keep them updated. + [Back to TOC](#table-of-contents) ### Compared to HDFS ### @@ -344,15 +395,17 @@ The architectures are mostly the same. SeaweedFS aims to store and read files fa * SeaweedFS optimizes for small files, ensuring O(1) disk seek operation, and can also handle large files. * SeaweedFS statically assigns a volume id for a file. Locating file content becomes just a lookup of the volume id, which can be easily cached. -* SeaweedFS Filer metadata store can be any well-known and proven data stores, e.g., Cassandra, Redis, Etcd, MySql, Postgres, MemSql, TiDB, CockroachDB, etc, and is easy to customized. +* SeaweedFS Filer metadata store can be any well-known and proven data stores, e.g., Redis, Cassandra, HBase, Mongodb, Elastic Search, MySql, Postgres, MemSql, TiDB, CockroachDB, Etcd etc, and is easy to customized. * SeaweedFS Volume server also communicates directly with clients via HTTP, supporting range queries, direct uploads, etc. -| System | File Meta | File Content Read| POSIX | REST API | Optimized for small files | +| System | File Metadata | File Content Read| POSIX | REST API | Optimized for large number of small files | | ------------- | ------------------------------- | ---------------- | ------ | -------- | ------------------------- | | SeaweedFS | lookup volume id, cacheable | O(1) disk seek | | Yes | Yes | | SeaweedFS Filer| Linearly Scalable, Customizable | O(1) disk seek | FUSE | Yes | Yes | | GlusterFS | hashing | | FUSE, NFS | | | | Ceph | hashing + rules | | FUSE | Yes | | +| MooseFS | in memory | | FUSE | | No | +| MinIO | separate meta file for each file | | | Yes | No | [Back to TOC](#table-of-contents) @@ -364,6 +417,14 @@ GlusterFS hashes the path and filename into ids, and assigned to virtual volumes [Back to TOC](#table-of-contents) +### Compared to MooseFS ### + +MooseFS chooses to neglect small file issue. From moosefs 3.0 manual, "even a small file will occupy 64KiB plus additionally 4KiB of checksums and 1KiB for the header", because it "was initially designed for keeping large amounts (like several thousands) of very big files" + +MooseFS Master Server keeps all meta data in memory. Same issue as HDFS namenode. + +[Back to TOC](#table-of-contents) + ### Compared to Ceph ### Ceph can be setup similar to SeaweedFS as a key->blob store. It is much more complicated, with the need to support layers on top of it. [Here is a more detailed comparison](https://github.com/chrislusf/seaweedfs/issues/120) @@ -372,11 +433,11 @@ SeaweedFS has a centralized master group to look up free volumes, while Ceph use Same as SeaweedFS, Ceph is also based on the object store RADOS. Ceph is rather complicated with mixed reviews. -Ceph uses CRUSH hashing to automatically manage the data placement. SeaweedFS places data by assigned volumes. +Ceph uses CRUSH hashing to automatically manage the data placement, which is efficient to locate the data. But the data has to be placed according to the CRUSH algorithm. Any wrong configuration would cause data loss. SeaweedFS places data by assigning them to any writable volumes. If writes to one volume failed, just pick another volume to write. Adding more volumes are also as simple as it can be. SeaweedFS is optimized for small files. Small files are stored as one continuous block of content, with at most 8 unused bytes between files. Small file access is O(1) disk read. -SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Redis, Etcd, Cassandra, MemSql, TiDB, CockroachCB, to manage file directories. There are proven, scalable, and easier to manage. +SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Mongodb, Redis, Elastic Search, Cassandra, HBase, MemSql, TiDB, CockroachCB, Etcd, to manage file directories. These stores are proven, scalable, and easier to manage. | SeaweedFS | comparable to Ceph | advantage | | ------------- | ------------- | ---------------- | @@ -386,18 +447,30 @@ SeaweedFS Filer uses off-the-shelf stores, such as MySql, Postgres, Redis, Etcd, [Back to TOC](#table-of-contents) -## Dev Plan ## +### Compared to MinIO ### -More tools and documentation, on how to maintain and scale the system. For example, how to move volumes, automatically balancing data, how to grow volumes, how to check system status, etc. -Other key features include: Erasure Encoding, JWT security. +MinIO follows AWS S3 closely and is ideal for testing for S3 API. It has good UI, policies, versionings, etc. SeaweedFS is trying to catch up here. It is also possible to put MinIO as a gateway in front of SeaweedFS later. -This is a super exciting project! And we need helpers and [support](https://www.patreon.com/seaweedfs)! +MinIO metadata are in simple files. Each file write will incur extra writes to corresponding meta file. -BTW, We suggest run the code style check script `util/gostd` before you push your branch to remote, it will make SeaweedFS easy to review, maintain and develop: +MinIO does not have optimization for lots of small files. The files are simply stored as is to local disks. +Plus the extra meta file and shards for erasure coding, it only amplifies the LOSF problem. -``` -$ ./util/gostd -``` +MinIO has multiple disk IO to read one file. SeaweedFS has O(1) disk reads, even for erasure coded files. + +MinIO has full-time erasure coding. SeaweedFS uses replication on hot data for faster speed and optionally applies erasure coding on warm data. + +MinIO does not have POSIX-like API support. + +MinIO has specific requirements on storage layout. It is not flexible to adjust capacity. In SeaweedFS, just start one volume server pointing to the master. That's all. + +## Dev Plan ## + +* More tools and documentation, on how to manage and scale the system. +* Read and write stream data. +* Support structured data. + +This is a super exciting project! And we need helpers and [support](https://www.patreon.com/seaweedfs)! [Back to TOC](#table-of-contents) @@ -412,24 +485,18 @@ https://golang.org/doc/install make sure you set up your $GOPATH -Step 2: also you may need to install Mercurial by following the instructions at: - -http://mercurial.selenic.com/downloads - +Step 2: checkout this repo: +```bash +git clone https://github.com/chrislusf/seaweedfs.git +``` Step 3: download, compile, and install the project by executing the following command ```bash -go get github.com/chrislusf/seaweedfs/weed +make install ``` Once this is done, you will find the executable "weed" in your `$GOPATH/bin` directory -Step 4: after you modify your code locally, you could start a local build by calling `go install` under - -``` -$GOPATH/src/github.com/chrislusf/seaweedfs/weed -``` - [Back to TOC](#table-of-contents) ## Disk Related Topics ## @@ -451,50 +518,49 @@ My Own Unscientific Single Machine Results on Mac Book with Solid State Disk, CP Write 1 million 1KB file: ``` Concurrency Level: 16 -Time taken for tests: 88.796 seconds +Time taken for tests: 66.753 seconds Complete requests: 1048576 Failed requests: 0 -Total transferred: 1106764659 bytes -Requests per second: 11808.87 [#/sec] -Transfer rate: 12172.05 [Kbytes/sec] +Total transferred: 1106789009 bytes +Requests per second: 15708.23 [#/sec] +Transfer rate: 16191.69 [Kbytes/sec] Connection Times (ms) min avg max std -Total: 0.2 1.3 44.8 0.9 +Total: 0.3 1.0 84.3 0.9 Percentage of the requests served within a certain time (ms) - 50% 1.1 ms - 66% 1.3 ms - 75% 1.5 ms - 80% 1.7 ms - 90% 2.1 ms - 95% 2.6 ms - 98% 3.7 ms - 99% 4.6 ms - 100% 44.8 ms + 50% 0.8 ms + 66% 1.0 ms + 75% 1.1 ms + 80% 1.2 ms + 90% 1.4 ms + 95% 1.7 ms + 98% 2.1 ms + 99% 2.6 ms + 100% 84.3 ms ``` Randomly read 1 million files: ``` Concurrency Level: 16 -Time taken for tests: 34.263 seconds +Time taken for tests: 22.301 seconds Complete requests: 1048576 Failed requests: 0 -Total transferred: 1106762945 bytes -Requests per second: 30603.34 [#/sec] -Transfer rate: 31544.49 [Kbytes/sec] +Total transferred: 1106812873 bytes +Requests per second: 47019.38 [#/sec] +Transfer rate: 48467.57 [Kbytes/sec] Connection Times (ms) min avg max std -Total: 0.0 0.5 20.7 0.7 +Total: 0.0 0.3 54.1 0.2 Percentage of the requests served within a certain time (ms) - 50% 0.4 ms - 75% 0.5 ms - 95% 0.6 ms - 98% 0.8 ms - 99% 1.2 ms - 100% 20.7 ms + 50% 0.3 ms + 90% 0.4 ms + 98% 0.6 ms + 99% 0.7 ms + 100% 54.1 ms ``` [Back to TOC](#table-of-contents) @@ -513,6 +579,8 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. +The text of this page is available for modification and reuse under the terms of the Creative Commons Attribution-Sharealike 3.0 Unported License and the GNU Free Documentation License (unversioned, with no invariant sections, front-cover texts, or back-cover texts). + [Back to TOC](#table-of-contents) ## Stargazers over time ## |
