Mastodon @Mastodon

**Adam H. Sparks** @adamhsparks@rstats.me · Feb 24, 2024

Feb 24, 2024

Adam H. Sparks @adamhsparks@rstats.me

Reproducibility for the long-term is hard. Even harder if you don't own the data. I was looking over some R scripts from ~2007/09 that I updated ~2016/17 and found broken links for downloading the MIRCA cropping data. The data sets moved from an FTP server to Zenodo (yay) but they just fail when requested in the script (boo). This isn't anything important or urgent, just my PhD dissertation and I'm tinkering, but it just highlights how fragile our systems are for creating #ReproducibleResearch

**Simon Tournier** @zimoun@sciences.re · Feb 24, 2024

Feb 24, 2024

Simon Tournier @zimoun@sciences.re

@adamhsparks Yeah long-term preservation of data set is an underestimated issue, IMHO. We had hard time when finding an example for this paper https://www.nature.com/articles/s41597-022-01720-9

Using #guix, it's possible to work around “code” for the computational part (R or Julia or else) relying on https://www.softwareheritage.org

However, data set…

NatureToward practical transparent verifiable and long-term reproducible research using Guix - Scientific DataReproducibility crisis urge scientists to promote transparency which allows peers to draw same conclusions after performing identical steps from hypothesis to results. Growing resources are developed to open the access to methods, data and source codes. Still, the computational environment, an interface between data and source code running analyses, is not addressed. Environments are usually described with software and library names associated with version labels or provided as an opaque container image. This is not enough to describe the complexity of the dependencies on which they rely to operate on. We describe this issue and illustrate how open tools like Guix can be used by any scientist to share their environment and allow peers to reproduce it. Some steps of research might not be fully reproducible, but at least, transparency for computation is technically addressable. These tools should be considered by scientists willing to promote transparency and open science.

**Alex L** @alxlg · Feb 24, 2024

Feb 24, 2024

Alex L @alxlg

@zimoun @adamhsparks

In my opinion, #Guix is an overkill for anything except computer science research. I think an approach that has better chances to succeed is reusing the OCI ecosystem and in particular using #ComposeFS for a per-file deduplication, that is important for large datasets. The missing piece for me it's a package manager for knowledge i.e. that links papers, datasets, source code for both calculations and documents (LaTeX, #Typst etc) in a graph of dependencies, citations etc.

**Ludovic Courtès** @civodul@toot.aquilenet.fr · Feb 24, 2024

Feb 24, 2024

Ludovic Courtès @civodul@toot.aquilenet.fr

@alxlg I challenge the assertion that “Guix is overkill” for your research. What would that even mean? That it makes your computational workflow “too” reproducible⁈

Not the first time I read that. I tried to answer this question, namely “Is reproducibility practical?”
https://hpc.guix.info/blog/2022/07/is-reproducibility-practical/

Also, there may be other ways to implement that, but opaque container images are not one of them.

@adamhsparks @zimoun

hpc.guix.infoGuix-HPC — Is reproducibility practical?

Alex L @alxlg@mastodon.social

@civodul @adamhsparks @zimoun

> To explore the behavior of the code, we need more. Guix eases exploration with “package transformation options”, which let users deploy variants of the software environment, for example by applying a patch somewhere in the software stack or swapping one dependency for another. A “frozen” application bundle such as a Docker image does not provide this lever.

Why? And what do you mean by "opaque" container image?

Feb 25, 2024, 12:58 PM··Megalodon

0boosts·0favorites

**Ludovic Courtès** @civodul@toot.aquilenet.fr · Feb 25, 2024

Feb 25, 2024

Ludovic Courtès @civodul@toot.aquilenet.fr

@alxlg A container image is a bunch of bytes, itself the result of a complex computational process: running ‘apt install’, building software, etc.

If all you have are those bytes, you cannot tell where they come from—just like when given a cake, you can at best guess what ingredients it contains, but you cannot tell whether it contains Novichok nor derive its recipe.

@zimoun @adamhsparks

**Ludovic Courtès** @civodul@toot.aquilenet.fr · Feb 25, 2024

Feb 25, 2024

Ludovic Courtès @civodul@toot.aquilenet.fr

@adamhsparks Likewise, with a Docker image, it’s hard or impossible to do more than just run the code that’s in there.

You’d like to run that code with a different version of one of its dependency? Or built with a different flag? Or with a different implementation of its algorithm? You’re on your own: the container image doesn’t support that kind of experimentation; it’s “frozen”.

@zimoun @alxlg

**Alex L** @alxlg · Feb 25, 2024 *

Feb 25, 2024 *

Alex L @alxlg

@civodul @zimoun @adamhsparks

And...? Why are you picking an intermediate artifact of the whole OCI ecosystem? Images are shipped for convenience but they are built using Dockerfiles. And if you don't like Dockerfiles there are other ways that could better fit your use case. Here there is also a demo taking advantage of Nix:

https://grahamc.com/blog/nix-and-layered-docker-images/

grahamc.comOptimising Docker Layers for Better Caching with Nix - Graham Christensen

**Ludovic Courtès** @civodul@toot.aquilenet.fr · Feb 25, 2024

Feb 25, 2024

Ludovic Courtès @civodul@toot.aquilenet.fr

@alxlg That’s the point: for an image built with Nix or with ‘guix pack’, you can have good provenance tracking (Guix has ‘--save-provenance’ to record that info in the image).

Conversely, an image built with a ‘Dockerfile’ is not reproducible, and thus not verifiable. It’s not reproducible because it depends on lots of external state such as Debian/Ubuntu/Alpine/PyPI servers.

@adamhsparks @zimoun

**Simon Tournier** @zimoun@sciences.re · Feb 26, 2024

Feb 26, 2024

Simon Tournier @zimoun@sciences.re

@alxlg If I might, maybe:

https://simon.tournier.info/posts/2021-09-17-guix-pack-docker.html

@adamhsparks @civodul

simon.tournier.infoReproduce Docker images produced by Guix

**Alex L** @alxlg · Feb 26, 2024 *

Feb 26, 2024 *

Alex L @alxlg

@zimoun @adamhsparks @civodul

Again, exactly the same question I made above. Why are you assuming the OCI image is the only thing that gets distributed? I never said that, images are meant for deployment, they are distributed for convenience just like you can distribute a single binary. Of course you need the source code and whatever is used to build it and eventually to build the image or other artifacts meant to make easier to *run* the code.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back