TiKV X JuiceFS On The Go

Hello everyone! This is Sandy from JuiceFS team. We are developing a new metadata engine for JuiceFS based on TiKV. Here is some background information.

JuiceFS is a distributed POSIX file system specially optimized for cloud-native environment. In JuiceFS, data is persisted in object storage (e.g. Amazon S3), and metadata can be stored in various databases such as Redis, MySQL or TiDB. JuiceFS is used in a lot of scenarios like big data analytics, machine learning, shared storage, etc. Below is the an overview of the system architecture:

Metadata engine is a critical component for distributed file systems. JuiceFS chooses Redis as the first one, considered that it has good performance and has been widely used in many organizations. However, Redis is not suitable for scenarios requiring high reliability or storing more than billions of files. Thus, a SQL interface (supporting SQLite, MySQL, TiDB, PostgreSQL, etc.) was developed as the second metadata engine. Although TiDB is a great choice for users who prefer reliability and scalability, we believe TiKV should have the same superiorities, but simpler architecture and higher performance.

Currently the new metadata engine is under development, more information can be found here. Any contribution and discussion are welcomed, you may leave comments under the thread or contact us directly.

5 Likes

The first PR has been merged! After some basic tests and benchmarks, we saw the POWER of TiKV. Here is the result:

  • It shows time cost (us) for each operation, smaller is better
  • Number in parentheses is the multiple of Redis-always cost
  • Redis appendfsync configuration:
    • Always: fsync after each commit
    • Everysec: fsync every second

Note: Redis & MySQL have only 1 replica of data (local storage) while TiKV has 3 replicas (raft group)

Redis-always Redis-everysec MySQL TiKV
mkdir 968 704 (0.7) 2368 (2.4) 2174 (2.2)
mvdir 1067 912 (0.9) 3708 (3.5) 2315 (2.2)
rmdir 976 783 (0.8) 2965 (3.0) 2469 (2.5)
readdir_10 370 353 (1.0) 1322 (3.6) 1087 (2.9)
readdir_1k 1832 1818 (1.0) 15295 (8.3) 6688 (3.7)
mknod 978 685 (0.7) 2307 (2.4) 2187 (2.2)
create 919 681 (0.7) 2333 (2.5) 2118 (2.3)
rename 1030 887 (0.9) 3722 (3.6) 2328 (2.3)
unlink 933 701 (0.8) 3370 (3.6) 2354 (2.5)
lookup 137 115 (0.8) 407 (3.0) 634 (4.6)
getattr 121 110 (0.9) 371 (3.1) 322 (2.7)
setattr 606 440 (0.7) 1282 (2.1) 1883 (3.1)
access 124 112 (0.9) 363 (2.9) 317 (2.6)
setxattr 238 113 (0.5) 1185 (5.0) 1659 (7.0)
getxattr 109 109 (1.0) 340 (3.1) 314 (2.9)
removexattr 250 118 (0.5) 868 (3.5) 2007 (8.0)
listxattr_1 116 105 (0.9) 349 (3.0) 316 (2.7)
listxattr_10 117 115 (1.0) 404 (3.5) 334 (2.9)
link 712 569 (0.8) 2713 (3.8) 2117 (3.0)
symlink 978 682 (0.7) 2646 (2.7) 2141 (2.2)
newchunk 238 107 (0.4) 1 (0.0) 1 (0.0)
write 822 568 (0.7) 3256 (4.0) 2335 (2.8)
read_1 0 0 (0.0) 0 (0.0) 0 (0.0)
read_10 0 0 (0.0) 0 (0.0) 0 (0.0)
3 Likes

Look great! I’m curious about the latency of each operation. Do you have any benchmark result about that?

You mean the latency of TiKV operations? No I don’t have that. Is there any way to get TiKV internal statistics?

Maybe this could be measured at application side. e.g. How long does each readdir op takes if there are multiple operations running concurrently.

Well we don’t have those details for now. The result showed above is obtained by golang benchmark test, and only average latencies are recorded.

1 Like

TiKV as metadata engine for JuiceFS is now fully supported. It passed all tests in pjdfstest and provides a bit better performance than MySQL. The latest benchmark results will be recorded in this doc.

2 Likes

TiKV keeps improving its performance. 5.1.1 separates read/write ready, which can introduce some improvment. 5.3 (2 months later) will introduce async raft, which has more significant improvement (especially on ordinary disk).

2 Likes

Sounds great! We’ll keep an eye on new features.

Do you meet any problems in integrating TiKV into JuiceFS? For example, performance tuning, unexpected behavior, etc.

For now everything is smooth :+1:
We haven’t done much performance tuning yet. Currently 1PC & AsyncCommit are enabled before committing (check here), while other configurations are all default.

1 Like

Great work! I’m curious about why to choose to use the key-value database as the metadata store engine. In traditional fs, the directory tree is more popular for metadata management. Have you compared these two ways of metadata management? The key-value semantics seems not appropriate for the description of the file system directories and files, but the key-value database may perform well in bandwidth and latency while causing some other problems.

Do you have any more details about the comparison of these two solutions?

The file system IS organized as directory tree, in which every node is managed by several key-value entries. For example, a regular file may have:

  • a dentry: {parent inode, file name} --> {file inode, file type}
  • an inode info: {file inode} --> {encoded file attributes}
  • several chunks: {file inode, chunk ID} --> {encoded chunk infos}

You can find more information here: https://github.com/juicedata/juicefs/blob/main/pkg/meta/tkv.go#L176-L199