Complex end-to-end tests using Guix G-expressions

Complex end-to-end tests using Guix G-expressions

Published by Arun Isaac on

In other languages: தமிழ்

Tags: guix, software, scheme, lisp

Complex end-to-end tests in development repositories involving packages written in several languages are a chore to describe and maintain. Often, the only recourse is to pull in pre-built binaries or haul around heavy Docker images. Could there be a better way? Could it be Guix (spoiler alert: yes!)?

Often in development repositories, you wish to describe complex end-to-end test cases that test how your software integrates with other software out in the world. Perhaps you wish to ensure that common user workflows are not broken. Or, if you are an academic, perhaps you want to confirm that your software still produces the same results you got peer-reviewed and published last year. These simple sanity checks are surprisingly hard to describe and maintain.

Language-specific package managers are not up to the task

Most programmers prefer working with their favourite language’s package manager (think pip, cargo, etc.). While these package managers are excellent at fetching and installing packages for that language, they are flummoxed by packages written in different languages. What we really need is one universal package manager that can work with packages regardless of incidentalisms like the language those packages happen to be written in. Enter Guix!

A concrete example from bioinformatics

Let’s consider a concrete example. I’ve been hacking on pyhegp, a bioinformatics package written in Python. However, my supervisor prefers to code in R, and wishes to check if pyhegp plays well with mixed-model-gwas, an R package they wrote earlier. The idea is to use this published HSmice dataset, put it through the following pipeline with tools written in different languages, and verify that everything works as expected.

A linear flowchart-like visualization of the test pipeline showing the steps---fetch data, wrangle data (R), pyhegp (python), mixed model GWAS (R), and check results (python)

Figure 1: Overall test pipeline

Let us describe this pipeline in Guix G-expressions, and get the Guix daemon to run it as a series of derivations1.

Download dataset

First, we need to download the dataset. For reproducibility reasons, Guix derivations do not have network access. So, how do we fetch the dataset? Bummer?! But, not really! You can specify the URL to download along with its hash in an origin object2. Note how declarative this description is. We merely state the URL and the hash, and do not specify how​—what libraries to use, etc.—to download the data.

(define hsmice-data
  (origin
    (method url-fetch)
    (uri "https://ndownloader.figshare.com/files/42304248")
    (file-name "HSmice.tar.gz")
    (sha256
     (base32
      "1s6a83r0mll8z2lfv1b94zr2sjdrky5nyq1mpgl8fjjb5s8v2vyx"))))

Obligatory data wrangling

No data analysis is complete without the obligatory data wrangling. Analysis tools need structured information, and it seems there’s always some shaping and tidying up to do. We do this using an R script wrangle.r and here’s the G-expression setting up the environment, invoking the script, and capturing the output.

(define hsmice-wrangled-gexp
  (let ((script-profile (profile
                         (content (packages->manifest
                                   (list gzip tar r r-dplyr r-genio
                                         r-purrr r-readr r-tibble r-tidyr))))))
    (with-imported-modules '((guix build utils))
      #~(begin
          (use-modules (guix build utils))

          (mkdir #$output)
          (set-path-environment-variable
           "PATH" '("bin") '(#$script-profile))
          (set-path-environment-variable
           "R_LIBS_SITE" '("site-library") '(#$script-profile))
          (invoke "tar" "-xvf" #$hsmice-data
                  "./HSmice/1_QTL_data/")
          (invoke "Rscript"
                  #$(local-file "wrangle.r")
                  "HSmice/1_QTL_data" #$output)))))

(define hsmice-wrangled
  (computed-file "hsmice-wrangled" hsmice-wrangled-gexp))

pyhegp

Then, we put the wrangled data through pyhegp.

(define hsmice-ciphertext-gexp
  (let ((script-profile (profile
                          (content (packages->manifest (list pyhegp))))))
    (with-imported-modules '((guix build utils))
      #~(begin
          (use-modules (guix build utils)
                       (srfi srfi-26))

          (mkdir #$output)
          (set-path-environment-variable
           "PATH" '("bin") '(#$script-profile))
          (for-each (cut install-file <> (getcwd))
                    (find-files #$hsmice-wrangled "\\.tsv$"))
          ;; Simple data sharing workflow
          (invoke "pyhegp" "encrypt" "genotype.tsv" "phenotype.tsv")
          (for-each (cut install-file <> #$output)
                    (find-files (getcwd) "\\.tsv.hegp$"))))))

(define hsmice-ciphertext
  (computed-file "hsmice-ciphertext" hsmice-ciphertext-gexp))

R mixed model GWAS script

We put the output of pyhegp through my supervisor’s mixed model GWAS R code.

(define hsmice-r-mixed-model-gwas-gexp
  (let ((gwas-script (local-file "gwas.r"))
        (script-profile (profile
                          (content (packages->manifest
                                    (list r r-dplyr r-mixed-model-gwas
                                          r-qqman r-readr r-stringr
                                          r-tibble r-tidyr))))))
    (with-imported-modules '((guix build utils))
      #~(begin
          (use-modules (guix build utils))

          (mkdir #$output)
          (set-path-environment-variable
           "PATH" '("bin") '(#$script-profile))
          (set-path-environment-variable
           "R_LIBS_SITE" '("site-library") '(#$script-profile))

          (invoke "Rscript" #$gwas-script
                  #$(file-append hsmice-ciphertext "/genotype.tsv.hegp")
                  #$(file-append hsmice-ciphertext "/phenotype.tsv.hegp")
                  (string-append #$output "/pvalues"))))))

(define hsmice-r-mixed-model-gwas
  (computed-file "hsmice-r-mixed-model-gwas" hsmice-r-mixed-model-gwas-gexp))

This plops out the Manhattan plot below along with the data that was used to produce it. The bioinformatics details are not important. So, I won’t go into it now.

Manhattan plot of 20 chromosomes showing a significant QTL on chromosome 4

Figure 2: Manhattan plot produced by our pipeline

Finally, check results

Now, the computer can’t eyeball the plot and say if it looks right. So, we have a python script that reads the underlying data, checks it and either succeeds or fails.

(define hsmice-qtl-checked-gexp
  (let ((script-profile (profile
                         (content (packages->manifest
                                   (list python python-pandas))))))
    (with-imported-modules '((guix build utils))
      #~(begin
          (use-modules (guix build utils))

          (mkdir #$output)
          (set-path-environment-variable
           "PATH" '("bin") '(#$script-profile))
          (set-path-environment-variable
           "GUIX_PYTHONPATH"
           '(#$(string-append "lib/python"
                              (version-major+minor (package-version python))
                              "/site-packages"))
           '(#$script-profile))

          (invoke "python3"
                  #$(local-file "check-qtl.py")
                  #$(file-append hsmice-r-mixed-model-gwas
                                 "/pvalues"))))))

(define hsmice-qtl-checked
  (computed-file "hsmice-qtl-checked" hsmice-qtl-checked-gexp))

Putting everything together

If we put everything together in a file and build hsmice-qtl-checked using guix build

guix build -f hsmice-test.scm

we get the build log below. hsmice-qtl-checked builds successfully. So, our test has passed.

substitute: looking for substitutes on 'https://bordeaux.guix.gnu.org'... 100.0%
substitute: looking for substitutes on 'https://ci.guix.gnu.org'... 100.0%
The following derivations will be built:
  /gnu/store/zdvi2dfn6g7h0ph71cqvshqc3lvbxxjh-HSmice.tar.gz.drv
  /gnu/store/1yl2xh5hap9cv09hp8hixb2z7wad389w-hsmice-wrangled.drv
  /gnu/store/m928lsmdbapvyw2vw04gphlw6byai95s-hsmice-ciphertext.drv
  /gnu/store/a4h2vf7b9ffrrxyhb7ns3qy17ajk4wlm-hsmice-r-mixed-model-gwas.drv
  /gnu/store/95i7qv0wpxm4fzw85szzx254k40babnd-hsmice-qtl-checked.drv
building /gnu/store/zdvi2dfn6g7h0ph71cqvshqc3lvbxxjh-HSmice.tar.gz.drv...

Starting download of /gnu/store/ma7kic5wd0cnry131ywd7icjhj31wqvx-HSmice.tar.gz
From https://ndownloader.figshare.com/files/42304248...
following redirection to `https://s3-eu-west-1.amazonaws.com/pstorage-ucl-2748466690/42304248/HSmice.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJEPILH3NWK4LP5XQ/20250909/eu-west-1/s3/aws4_request&X-Amz-Date=20250909T204609Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=938b9d442a345a8488be528477b9cbfb492c3de0abfd2929f9b9ddea8af98272'...
downloading from https://ndownloader.figshare.com/files/42304248 ...
 42304248  218.5MiB                                                                        8.2MiB/s 00:27 ▕██████████████████▏ 100.0%
successfully built /gnu/store/zdvi2dfn6g7h0ph71cqvshqc3lvbxxjh-HSmice.tar.gz.drv
building /gnu/store/1yl2xh5hap9cv09hp8hixb2z7wad389w-hsmice-wrangled.drv...
./HSmice/1_QTL_data/
./HSmice/1_QTL_data/HSmice.bed
./HSmice/1_QTL_data/HSmice.bim
./HSmice/1_QTL_data/HSmice.cols
./HSmice/1_QTL_data/HSmice.fam
./HSmice/1_QTL_data/HSmice.phe

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Reading: HSmice/1_QTL_data/HSmice.bim
Reading: HSmice/1_QTL_data/HSmice.fam
Reading: HSmice/1_QTL_data/HSmice.bed
Joining with `by = join_by(`sample-id`)`
Joining with `by = join_by(`sample-id`)`
sh: line 1: rm: command not found
environment variable `PATH' set to `/gnu/store/h46yw1vw5v4fynw3v71pjpz0kgh5kaqv-profile/bin'
environment variable `R_LIBS_SITE' set to `/gnu/store/h46yw1vw5v4fynw3v71pjpz0kgh5kaqv-profile/site-library'
successfully built /gnu/store/1yl2xh5hap9cv09hp8hixb2z7wad389w-hsmice-wrangled.drv
building /gnu/store/m928lsmdbapvyw2vw04gphlw6byai95s-hsmice-ciphertext.drv...
Dropped 1 SNP(s)
environment variable `PATH' set to `/gnu/store/ig9qqkp5rnvyr3g3dmfnsx00d9nx6l5l-profile/bin'
successfully built /gnu/store/m928lsmdbapvyw2vw04gphlw6byai95s-hsmice-ciphertext.drv
building /gnu/store/a4h2vf7b9ffrrxyhb7ns3qy17ajk4wlm-hsmice-r-mixed-model-gwas.drv...

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


For example usage please run: vignette('qqman')

Citation appreciated but not required:
Turner, (2018). qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. Journal of Open Source Software, 3(25), 731, https://doi.org/10.21105/joss.00731.

Rows: 1527 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (1): sample-id
dbl (16): sex, Anx.resid, BurrowedPelletWeight.resid, Context.resid, End.Wei...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 10167 Columns: 1529
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (1529): chromosome, position, A048005080, A048006063, A048006555, A04800...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Built kinship dim 1527 1527 
estimated heritability 0.545603 
sh: line 1: rm: command not found
environment variable `PATH' set to `/gnu/store/9cs1wah76bsijzkh2q17hhfkp0hwyjgn-profile/bin'
environment variable `R_LIBS_SITE' set to `/gnu/store/9cs1wah76bsijzkh2q17hhfkp0hwyjgn-profile/site-library'
successfully built /gnu/store/a4h2vf7b9ffrrxyhb7ns3qy17ajk4wlm-hsmice-r-mixed-model-gwas.drv
building /gnu/store/95i7qv0wpxm4fzw85szzx254k40babnd-hsmice-qtl-checked.drv...
environment variable `PATH' set to `/gnu/store/1xvafgjm30jhz1nnwcawz7wg6a4m8mwa-profile/bin'
environment variable `GUIX_PYTHONPATH' set to `/gnu/store/1xvafgjm30jhz1nnwcawz7wg6a4m8mwa-profile/lib/python3.11/site-packages'
successfully built /gnu/store/95i7qv0wpxm4fzw85szzx254k40babnd-hsmice-qtl-checked.drv
/gnu/store/frcqzx4y6sb98iavzla19m14i2931m6a-hsmice-qtl-checked

For the full real-world code from which the excerpts above were extracted, see hsmice-test.scm in the pyhegp repository. In the near future, one of our collaborators might contribute a Julia script to add to the test case and further complicate the programming language mix. I am certain Guix and G-expressions will handle it gracefully and robustly. Many thanks to the countless contributors who pour so much sweat into maintaining all these Guix packages for us!3

Conclusion

Describing this test using Guix G-expressions gave us several advantages.

We integrated tools in two different languages

We effortlessly integrated software and scripts from two different languages—python and R. Such feats are not possible with language-specific package managers.

No tedious documentation for contributors to read

We did not have to write tedious documentation explaining to contributors how to install dependencies, what commands to run, etc.—everything is encoded precisely in the G-expressions. This means contributors can run and verify the tests much more easily—meaning they are so much more likely to actually do so.

Continuous integration

And, what’s more, if we turn our development repository into a Guix channel, we can even run these tests on CI with no modification necessary. Notice there’s no hauling around of heavy Docker images or downloading pre-built binaries from mysterious places. We work with plain text recipes and everything is fully reproducible all the way down to the bootstrap binaries—sheer magic!

In fact, I do precisely this for pyhegp. On every commit, this HSmice test is run and correct operation is verified.

Use any version or any commit of any software

And finally, although I didn’t go into this in much detail here, Guix liberates packages from a special caste of package maintainers, and puts them in the hands of ordinary users. You are free to try out any version or any commit of any software, and have them treated as first-class packages equal in every way to those created by your distro’s maintainers. No need to spend a long time praying for upstream approval—you can hit the ground running from day zero.

Footnotes:

1

Guix term for elementary build steps

2

origin objects are commonly used to specify sources for Guix packages. The Guix daemon fetches them before starting the actual build process without network access.

3

Packaging is an extremely labour-intensive, largely invisible and usually underappreciated art.