-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce storage footprint when using GPG encryption #95
Comments
Since this is not a bash script, it's not that easy, but it could still be done without having to resort to using the intermediate file and instead chain the tarWriter and the OpenPgpWriter. Maybe it would even be possible to upload the file in a streaming manner for some backends and remove the need for an intermediate file entirely. This is a little complicated though (mostly because the code is written with that mechanism in mind), so it's much more than a one-line change. If anyone wants to pick this up, I am happy to review and merge in PRs. Else I might be able to look into this at some point, but I cannot really make any estimates. A side note question: the artifacts do get deleted properly after a backup run, right? So the storage footprint is only an issue while the backup is running, correct? |
Thanks for the reply. Chaining the upload would've been my next question 👍 Maybe I can find some time to start developing this feature. |
On a very high level this could work like this (from the top of my head, so it might be wrong in certain details):
In case you decide to start working on this feel free to ask questions any time and also don't shy away from improving the pipeline if you find things that are rather odd right now. |
I looked into how this could implemented a little and found a tricky detail hidden: if a streaming pipeline would be used that possibly does I'll try to think about how the entire script orchestration could be refactored so it can accomodate for such requirements. |
Maybe different modes, which the user can select, would make sense here:
Adaptive mode would solve the backpressure issue. For example with Backblaze B2 I have usually 250 MBit/s Upload (~30 MB/s) and using zstd as compressing method (magnitudes faster on my maschines) plus multicore encryption... I'd assume that IO would usually be often faster than the network... but that's just a sample size of one (me). Makes sense to have the backpressure issue in mind imo. Alternatively to the suggested modes, only adaptive and sequential could be fine too as |
One thing that just occured to me is that implementing archive-encrypt-upload in a streaming fashion would be a breaking change as it would mean the command lifecycle would be chaning, i.e. when streaming, there is no more possibility of running |
True, there would only be a start and a finish hook - no matter if using The
I'd definitely keep the classic Having all "modes" in basically one logic (but with different buffer sizes) would make the code base cleaner and configuration easier, since we wouldn't need two entirely different approaches (sequential vs streaming). The default option should perhaps be the I'd need to do some testing whether |
I was thinking one could have some sort of "event" that signals the end of the archive stream which would then trigger restart while bytes are still munged further down the stream, see #95 (comment) Not sure how realistic that is though. In any case I would argue optimizing for as little downtime as possible is more important than optimizing for disk space. Disk space is cheap, service downtime is not. |
An event of some sort after the archiving stage is done definitely makes sense. That's usually the case yes. Maybe we can omit the That being said, I don't know if there is any benefit of using |
I'm submitting a ...
What is the current behavior?
When using
GPG_PASSPHRASE
this tool will create two files in the/tmp
directory:backup-*.tar.gz
filebackup-*.tar.gz.gpg
It's not a bug, but it will lead to a high storage usage. This is problematic on restricted servers for example, where storage space is expensive.
Example: On one of my servers I've got 90 GiB storage. With the current implementation the server application itself must not exceed 30 GiB or the server would encounter an out of space error when executing the backup.
Not expected but an idea: When using
GPG_PASSPHRASE
the result of thetar
command could (probably) directly be piped to gpg. This way only the *.gpg file would be saved to the storage.What is the motivation / use case for changing the behavior?
Please tell us about your environment:
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, etc)
The text was updated successfully, but these errors were encountered: