## Summary
This PR ensures that the `LifecycleWorker` yields at least once to the Tokio scheduler in between each batch of 100 objects.
## Problem being solved
I'm administrating a Garage cluster which has been experiencing timeouts on all endpoints while the lifecycle worker is running at midnight UTC : `Ping timeout` error messages and even requests eventually failing due to `Could not reach quorum ...`.
I have found that this happens while the lifecycle worker is working on a big bucket (containing millions of objects) with a lifecycle rule that applies to very few objects.
The `process_object()` function does not hit any `await`:
- `last_bucket` is always the same, so the `bucket_table` is not read asynchronously
- no transaction is made on the `object_table` because my lifecycle rule (almost) never applies to any object
The first commit in this PR adds an executable which reproduces the problem that I've been experiencing in a self-contained way : the lifecycle worker starves the Tokio scheduler so much that no other task is able to run (or very rarely).
To run it : `cargo run -p garage_model --bin lifecycle-starvation-test`.
This commit can be dropped post-review, as it's only useful to demonstrate the starvation.
The error messages completely stopped after adding the extra yield to the nodes of my cluster.
The duration of the lifecycle worker task does not appear to have changed at all from what I can see (looking at the timestamps produced either by the self-contained binary or by each of my nodes with the `Lifecycle worker finished` message).
## Note
An other potential fix would have been to force the `WorkerProcessor` to yield before re-enqueuing a busy task, but this would have affected all Garage workers even though it's only the `LifecycleWorker` being uncooperative.
Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/1396
Reviewed-by: Alex <lx@deuxfleurs.fr>
Co-authored-by: Gauthier Zirnhelt <gauthier.zirnhelt@insimo.fr>
Co-committed-by: Gauthier Zirnhelt <gauthier.zirnhelt@insimo.fr>
fix#1355
some write errors are not reported when calling write_all. That's notably the case of ENOSPC on small buffers (1MiB).
on ext4, the error is catched when calling flush(). This is hopefully the case on most local filesystems, though afaik this assumption doesn't hold for NFS
Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/1358
Co-authored-by: trinity-1686a <trinity@deuxfleurs.fr>
Co-committed-by: trinity-1686a <trinity@deuxfleurs.fr>
When sending a website config with an empty list of CORS rules, garage currently incorrectly refuses it with error message "Invalid XML: missing field `CORSRule`".
This fix the issue by following the documentation of quick-xml related to serde field parameters for this specific scenario: https://docs.rs/quick-xml/latest/quick_xml/de/#sequences-xsall-and-xssequence-xml-schema-types .
(I've based this PR on main-v1 because we want it for deuxfleurs' deployment.)
Co-authored-by: Armaël Guéneau <armael.gueneau@ens-lyon.org>
Reviewed-on: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/1320
Co-authored-by: Armael <armael@noreply.localhost>
Co-committed-by: Armael <armael@noreply.localhost>