+++ title = "Known issues" weight = 80 +++ Issues in each section are roughly sorted by order of decreasing impact, based on actual reports from users. ## Architectural limitations Issues that are caused by design decisions of Garage internals, and that can't be fixed without major architectural changes in the codebase. ### Metadata performance issues with many objects **Related issues:** - [#851 - Performances collapse with 10 millions pictures in a bucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851) - [#1222 - Cluster Setup Write Performance Degraded After Writing 10 Million Object (200-300Kb per object)](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1222) ### Very big objects cause performance degradation For each object, there is a single metadata entry called a `Version` that contains a list of all of the data blocks in the object. For very big objects, this entry can contain thousands of block references. During the uploading of an object, this metadata entry needs to be read, deserialized, reserialized and written for each individual data block uploaded. This means that the complexity of an upload is `O(n²)` in the number of blocks needed. This manifests by excessive metadata I/O and CPU usage, and uploads eventually stalling. **Mitigation:** Increase the `block_size` configuration parameter to reduce the number of blocks. Make sure multipart uploads use chunks that are at least `block_size` in size, and that are an exact multiple of `block_size` to avoid the creation of smaller blocks. **Long-term solution:** An architectural change in the metadata system would be required to store block lists in many independent metadata entries instead of one single big entry per object. **Related issues:** - [#662 - Large Files fail to upload](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/662) - [#1366 - High CPU usage and performance degradation during long multipart uploads](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1366) ### No conditional writes / locking / WORM support (`if-none-match`, ...) This is structurally impossible to implement in Garage due to the lack of a consensus algorithm, which is one of Garage's core design choices which we cannot reconsider. A semi-working, *unsafe* implementation of WORM and object locking could be implemented, with the following constraint: only after the completion of the first write (in case of WORM) or the setting of a lock (for object lock) can we guarantee that the object cannot be overwritten. In case where an overwrite requests arrives at the same time as the initial request to write or to lock the object, we cannot implement a safe and consistent way to reject it. This means that many practical use-cases for `if-none-match` cannot be supported (e.g. using it to implement mutual exclusion between concurrent writers). **Related issues:** - [#1052 - Support conditional writes](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052) - [#1127 - Feature Request: WORM (Write Once Read Many) / Object Lock Support](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1127) ### `CreateBucket` race condition Also due to the lack of a consensus algorithm, there is no mutual exclusion between concurrent `CreateBucket` requests using the same bucket name. **Related issues:** - [#649 - Race condition in CreateBucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/649) ### Metadata and data have the same replication factor There is a single `replication_factor` in the configuration file that applies both to data blocks and metadata entries. This makes clusters with `replication_factor = 1` particularly vulnerable in cases of metadata corruption (see below), as there is a single copy of the metadata for each object even in multi-node clusters. **Mitigation:** Do not use `replication_factor = 1`. **Long-term solution:** We want to allow scenarios such as replicating the metadata on 2, 3 or more nodes and the data on only 1 or 2 nodes (for example), so that the metadata can benefit from better redundancy without increasing the storage costs for the entire dataset. This will require some important changes in the codebase. **Related issues:** - [#720 - Separate replication modes for metadata/data](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/720) ### Node count limitation Garage will have issues in clusters with too many nodes, it will not be able to spread data uniformly among nodes and some nodes will fill up faster than other. This starts to manifest when the number of nodes is bigger than `10 × replication_factor`. This is due to the fact that Garage uses only 256 partitions internally. **Mitigation:** Build clusters with fewer, bigger nodes. **Potential solution:** This can be fixed by increasing the number of partitions in Garage. The code paths exist, there is [a `const` somewhere](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/6fd9bba0cb55062cb1725ab961b7fa8acb9dcc61/src/rpc/layout/mod.rs#L35) that theoretically allows to increase the number of partitions up to `2^16`, but this has not been tested so there might be bugs. ### Buckets are not sharded For each bucket, the first metadata layer that contains an index of all objects is not sharded. This index, which includes the names and all metadata (size, headers, ...) for each object, is stored on `$replication_factor` nodes. For instance with `replication_factor = 3`, a given bucket will use only 3 specific nodes for this index (chosen at random when the bucket is created) to store this index. In a multi-zone deployments, these nodes will be spread in different zones. Each bucket uses a different set of 3 random nodes for its index. As a consequence, very large buckets might cause uneven load distribution within a cluster. If all of the requests on a cluster are for objects in a single bucket, then the `$replication_factor` nodes that store the index will become a hotspot in the cluster, with more intensive metadata access patterns. There is no way of choosing which nodes will have this role. Currently, we have no report of this being an issue in practice. **Mitigation:** This impacts in particular clusters that are used for a single purpose with a single bucket. This can be solved by dividing your dataset among many buckets, using a client-side sharding strategy that you will have to design. Use at least as many buckets as you have nodes on your cluster. ## Bugs Known bugs that are complex to diagnose and fix, and therefore have not been fixed yet. ### LMDB metadata corruption Many users have reported situations where the LMDB metadata db becomes corrupted, sometimes after a forced shutdown of Garage or in case of power loss. A corrupted database file is generally not recoverable. **Mitigation:** Use a `replication_factor` of at least 2. Configure automatic snapshotting using `metadata_auto_snapshot_interval` so that in case of corruption you can rollback to a working database. Note that taking filesystem-level snapshots of your `metadata_dir`, although it is much faster and less I/O intensive than Garage's built-in snapshotting, does not ensure that the snapshot will be consistent. If the snapshot is taking during a metadata write, the snapshot itself might be corrupted and thus not usable as a rollback point. Therefore, prefer using `metadata_auto_snapshot_interval` in all cases. ### Layout updates might require manual intervention In case of disconnected nodes, when changing the cluster layout to remove these nodes and add other nodes instead, Garage might not be able to properly evict the old nodes from the system. This is a built-in security measure to avoid any inconsistent cluster states. This manifests by several cluster layout versions staying active even after a full resync. You can diagnose this situation with `garage layout history`, which will give you instructions to fix it. ### Tag assignment In the `garage layout assign` command, the `-t` argument has to be repeated multiple times to set multiple tags on a node. Writing multiple tags separated by commas will result in a single string. ## General footguns Choices made by the developers that users must be aware of if they don't want to run into potential issues. ### Resync tranquility is conservative by default By default, the worker parameters `resync-tranquility` and `resync-worker-count` are set to very conservative values, to avoid overloading nodes with I/O when data needs to be resynchronized between nodes. This can cause issues where the resync queue grows faster than it can be cleared, which in turn causes performance issues in the rest of Garage. This situation is indicated by a big resync queue with few resync errors (the queue is not caused by a disconnected/malfunctionning node). To fix it, increase the number of resync workers and reduce the resync tranquility. For instance, if you want to resync as fast as possible: ``` garage worker set -a resync-worker-count 8 garage worker set -a resync-tranquility 0 ```