mirror of
https://git.deuxfleurs.fr/Deuxfleurs/garage.git
synced 2026-05-15 05:36:53 -04:00
188 lines
8.8 KiB
Markdown
188 lines
8.8 KiB
Markdown
+++
|
||
title = "Known issues"
|
||
weight = 80
|
||
+++
|
||
|
||
Issues in each section are roughly sorted by order of decreasing impact, based on actual reports from users.
|
||
|
||
## Architectural limitations
|
||
|
||
Issues that are caused by design decisions of Garage internals, and that can't
|
||
be fixed without major architectural changes in the codebase.
|
||
|
||
### Metadata performance issues with many objects
|
||
|
||
**Related issues:**
|
||
|
||
- [#851 - Performances collapse with 10 millions pictures in a bucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/851)
|
||
- [#1222 - Cluster Setup Write Performance Degraded After Writing 10 Million Object (200-300Kb per object)](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1222)
|
||
|
||
### Very big objects cause performance degradation
|
||
|
||
For each object, there is a single metadata entry called a `Version` that
|
||
contains a list of all of the data blocks in the object. For very big objects,
|
||
this entry can contain thousands of block references. During the uploading of
|
||
an object, this metadata entry needs to be read, deserialized, reserialized and
|
||
written for each individual data block uploaded. This means that the
|
||
complexity of an upload is `O(n²)` in the number of blocks needed.
|
||
|
||
This manifests by excessive metadata I/O and CPU usage, and uploads eventually stalling.
|
||
|
||
**Mitigation:** Increase the `block_size` configuration parameter to reduce the
|
||
number of blocks. Make sure multipart uploads use chunks that are at least
|
||
`block_size` in size, and that are an exact multiple of `block_size` to avoid
|
||
the creation of smaller blocks.
|
||
|
||
**Long-term solution:** An architectural change in the metadata system would be
|
||
required to store block lists in many independent metadata entries instead of
|
||
one single big entry per object.
|
||
|
||
**Related issues:**
|
||
|
||
- [#662 - Large Files fail to upload](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/662)
|
||
- [#1366 - High CPU usage and performance degradation during long multipart uploads](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1366)
|
||
|
||
### No conditional writes / locking / WORM support (`if-none-match`, ...)
|
||
|
||
This is structurally impossible to implement in Garage due to the lack of a consensus algorithm,
|
||
which is one of Garage's core design choices which we cannot reconsider.
|
||
|
||
A semi-working, *unsafe* implementation of WORM and object locking could be
|
||
implemented, with the following constraint: only after the completion of the
|
||
first write (in case of WORM) or the setting of a lock (for object lock) can we
|
||
guarantee that the object cannot be overwritten. In case where an overwrite
|
||
requests arrives at the same time as the initial request to write or to lock
|
||
the object, we cannot implement a safe and consistent way to reject it. This
|
||
means that many practical use-cases for `if-none-match` cannot be supported
|
||
(e.g. using it to implement mutual exclusion between concurrent writers).
|
||
|
||
**Related issues:**
|
||
|
||
- [#1052 - Support conditional writes](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052)
|
||
- [#1127 - Feature Request: WORM (Write Once Read Many) / Object Lock Support](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1127)
|
||
|
||
### `CreateBucket` race condition
|
||
|
||
Also due to the lack of a consensus algorithm, there is no mutual exclusion
|
||
between concurrent `CreateBucket` requests using the same bucket name.
|
||
|
||
**Related issues:**
|
||
|
||
- [#649 - Race condition in CreateBucket](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/649)
|
||
|
||
### Metadata and data have the same replication factor
|
||
|
||
There is a single `replication_factor` in the configuration file that applies both to data blocks and metadata entries.
|
||
This makes clusters with `replication_factor = 1` particularly vulnerable in cases of metadata corruption (see below), as there
|
||
is a single copy of the metadata for each object even in multi-node clusters.
|
||
|
||
**Mitigation:** Do not use `replication_factor = 1`.
|
||
|
||
**Long-term solution:** We want to allow scenarios such as replicating the
|
||
metadata on 2, 3 or more nodes and the data on only 1 or 2 nodes (for example),
|
||
so that the metadata can benefit from better redundancy without increasing the
|
||
storage costs for the entire dataset. This will require some important changes
|
||
in the codebase.
|
||
|
||
**Related issues:**
|
||
|
||
- [#720 - Separate replication modes for metadata/data](https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/720)
|
||
|
||
### Node count limitation
|
||
|
||
Garage will have issues in clusters with too many nodes, it will not be able to
|
||
spread data uniformly among nodes and some nodes will fill up faster than
|
||
other. This starts to manifest when the number of nodes is bigger than `10 ×
|
||
replication_factor`. This is due to the fact that Garage uses only 256
|
||
partitions internally.
|
||
|
||
**Mitigation:** Build clusters with fewer, bigger nodes.
|
||
|
||
**Potential solution:** This can be fixed by increasing the number of
|
||
partitions in Garage. The code paths exist, there is [a `const`
|
||
somewhere](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/6fd9bba0cb55062cb1725ab961b7fa8acb9dcc61/src/rpc/layout/mod.rs#L35)
|
||
that theoretically allows to increase the number of partitions up to `2^16`,
|
||
but this has not been tested so there might be bugs.
|
||
|
||
### Buckets are not sharded
|
||
|
||
For each bucket, the first metadata layer that contains an index of all objects
|
||
is not sharded. This index, which includes the names and all metadata (size,
|
||
headers, ...) for each object, is stored on `$replication_factor` nodes.
|
||
|
||
For instance with `replication_factor = 3`, a given bucket will use only 3
|
||
specific nodes for this index (chosen at random when the bucket is created) to
|
||
store this index. In a multi-zone deployments, these nodes will be spread in
|
||
different zones. Each bucket uses a different set of 3 random nodes for its
|
||
index.
|
||
|
||
As a consequence, very large buckets might cause uneven load distribution
|
||
within a cluster. If all of the requests on a cluster are for objects in a
|
||
single bucket, then the `$replication_factor` nodes that store the index will
|
||
become a hotspot in the cluster, with more intensive metadata access patterns.
|
||
There is no way of choosing which nodes will have this role.
|
||
|
||
Currently, we have no report of this being an issue in practice.
|
||
|
||
**Mitigation:** This impacts in particular clusters that are used for a single
|
||
purpose with a single bucket. This can be solved by dividing your dataset among
|
||
many buckets, using a client-side sharding strategy that you will have to
|
||
design. Use at least as many buckets as you have nodes on your cluster.
|
||
|
||
|
||
## Bugs
|
||
|
||
Known bugs that are complex to diagnose and fix, and therefore have not been
|
||
fixed yet.
|
||
|
||
### LMDB metadata corruption
|
||
|
||
Many users have reported situations where the LMDB metadata db becomes
|
||
corrupted, sometimes after a forced shutdown of Garage or in case of power
|
||
loss. A corrupted database file is generally not recoverable.
|
||
|
||
**Mitigation:** Use a `replication_factor` of at least 2. Configure automatic
|
||
snapshotting using `metadata_auto_snapshot_interval` so that in case of
|
||
corruption you can rollback to a working database.
|
||
|
||
Note that taking filesystem-level snapshots of your `metadata_dir`, although it
|
||
is much faster and less I/O intensive than Garage's built-in snapshotting, does
|
||
not ensure that the snapshot will be consistent. If the snapshot is taking
|
||
during a metadata write, the snapshot itself might be corrupted and thus not
|
||
usable as a rollback point. Therefore, prefer using
|
||
`metadata_auto_snapshot_interval` in all cases.
|
||
|
||
### Layout updates might require manual intervention
|
||
|
||
In case of disconnected nodes, when changing the cluster layout to remove these
|
||
nodes and add other nodes instead, Garage might not be able to properly evict
|
||
the old nodes from the system. This is a built-in security measure to avoid any
|
||
inconsistent cluster states.
|
||
|
||
This manifests by several cluster layout versions staying active even after a
|
||
full resync. You can diagnose this situation with `garage layout history`,
|
||
which will give you instructions to fix it.
|
||
|
||
### Tag assignment
|
||
|
||
In the `garage layout assign` command, the `-t` argument has to be repeated
|
||
multiple times to set multiple tags on a node. Writing multiple tags separated
|
||
by commas will result in a single string.
|
||
|
||
## General footguns
|
||
|
||
Choices made by the developers that users must be aware of if they don't want
|
||
to run into potential issues.
|
||
|
||
### Resync tranquility is conservative by default
|
||
|
||
By default, the worker parameters `resync-tranquility` and `resync-worker-count` are set to very conservative values, to avoid overloading nodes with I/O when data needs to be resynchronized between nodes.
|
||
This can cause issues where the resync queue grows faster than it can be cleared, which in turn causes performance issues in the rest of Garage.
|
||
|
||
This situation is indicated by a big resync queue with few resync errors (the queue is not caused by a disconnected/malfunctionning node).
|
||
To fix it, increase the number of resync workers and reduce the resync tranquility. For instance, if you want to resync as fast as possible:
|
||
|
||
```
|
||
garage worker set -a resync-worker-count 8
|
||
garage worker set -a resync-tranquility 0
|
||
```
|