Bindings to the seekable variant of the ZSTD compression format

#10 async support in the library

Opened by Fuuzetsu on February 14, 2022
Fuuzetsu on February 14, 2022

I’m the author of the zstd-seekable-s3 crate that provides an object that implements Read and Sync that allows to make seekable reads from AWS S3 directly, including if the files are behind a seekable zstd archive. It’s using your crate for the compression/decompression bits.

The way this was achieved was via tokio::runtime::block_on + rusoto. However, we’ve found that this has too many limitations when you want to actually use async in the rest of your library: you can end up with nested block_on calls and all kinds of other unfun things.

With the released version of the crate (0.1.7), it’s not possible to get rid of synchronicity: Seekable::init requires Read and Seek so that it can invoke callbacks.

I thought everything was lost but I noticed that you have gotten rid of the zstd-seekable C and translated the code to Rust itself. This means that if we allow Seekable::srcto be async, everyone is happy.

So the questions are:

  • what’s the status of the code now? I see the newer releases have all been yanked.
  • can we get async support somehow? Basically anywhere with calls to Seekable::src callbacks should also allow async versions. I don’t know of a “pretty” way to support both at once: worse comes to worst, I can hack something myself based on this code here though that depends on the answer to my first question.

Thanks!

pmeunier on February 14, 2022

Sounds like you’re doing some really cool work. @darleybarreto did fantastic work translating the code to Rust, in an attempt to mitigate the performance regressions we were experiencing in ZStd 1.5. It didn’t mitigate those, but it does feel a bit more future-proof and maintainable.

However, this was done at a time of wrapping up the beta for Pijul, and there were still a number of unexplained bugs when testing on a larger scale. Since Pijul uses ZStd-seekable almost everywhere, a more reasonable way to test was to move back to the conservative, C version temporarily while debugging the rest of Pijul.

But now that we’re beta, the testing can finally resume! If you’re interested in helping, I’m sure @darleybarreto would enjoy the discussions I haven’t had much time to hold in the last few months :(

Fuuzetsu on February 14, 2022

However, this was done at a time of wrapping up the beta for Pijul, and there were still a number of unexplained bugs when testing on a larger scale. Since Pijul uses ZStd-seekable almost everywhere, a more reasonable way to test was to move back to the conservative, C version temporarily while debugging the rest of Pijul.

Just to make sure I understand: were there a number of unexplained bugs in Pijul or in zstd-seekable? If former, did you stick with using zstd-seekable just because it was more tested/less to worry about while you were looking for Pijul bugs? Or were there bugs in zstd-seekable that you didn’t have time to deal with so you reverted to the C version for now?

If you’re interested in helping, I’m sure @darleybarreto would enjoy the discussions I haven’t had much time to hold in the last few months :(

If there’s something specific I can do to help, I can probably spend a little bit of time here and there though I am not privy how the zstd-seekable (or zstd) internals work and such so not sure how much help I can be.

To give some idea of our use of zstd-seekable, we usually generate a set of data that after compression with current zstd-seekable is about 500GiB of data. This is mostly pretty similar kind of data so definitely not an exhaustive sample but probably better than nothing. If there’s some version of zstd-seekable that can be tried, I could try to re-compress the data and see whether nothing crashes on decoding the C-backed version and encoding to new version.

darleybarreto on February 22, 2022

Hi,

Sorry for not responding earlier, for some reason I don’t get notified when people mentioned me in discussions.

If there’s something specific I can do to help […]

I’m not sure, actually. There was some people reporting bugs on pijul’s main repo, but other than that I don’t know about any particular bug. What would be much appreciated is adding tests. I added one or two, but this is far from ideal. Perhaps fuzzing it too.

Another thing that helps are discussions and code reviews. You could browse to the source code (this and optionally the original C) and understand the code and discuss with me any particular matter you deem to be interesting, or different from the original, or potentially wrong or even coded better. Other than that, trying to use on a daily basis would also be great to improve the implementation, the API, find bugs etc. I also keep an eye on the original C code base to see what’s changed and port to this implementation.

Please note that I’m not a fluent rust programmer, neither someone with extensive system programming skills (I am a machine learning researcher/dev who mainly codes python). I do things in rust to learn and help others, so any help is much appreciated!