Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

git for RDM and reproducibility

checklist

  • software (a generic tool to do something)
    • use separate git repo for software
    • tag versions for reproducibility
    • keep software as generic as possible
  • scripts (how to use software)
    • use separate git repo for scripts
    • tag versions for reproducibility
    • software is configured here
    • reference used software tag
  • data management
    • publish dataset(s) to scientific data archive system
    • always attach proper metadata
    • get DOI for each version of the dataset(s) for reproducibility
    • reference used scripts tag
  • publishing
    • use separate git repo for paper/thesis/...
    • tag versions for draft/review/final
    • convert text/source to (binary) products
    • reference used scripts tag
    • reference used data DOI
  • platforms (GitLab, GitHub)
    • use platforms (GitLab, GitHub) for collaboration
    • review commits / merge requests
    • utilize project management tools
    • utilize automation for testing and publishing

intro

  • version control system (VCS) records changes (what, who, when, why)
  • use platforms (GitLab, GitHub) for collaboration

git use cases

software

  • keep software as generic as possible
  • turn configuration/parameters into arguments, e.g. myapp --seed=42
  • this avoids having to rewrite software for parameter changes
  • use software testing to verify software does what it's supposed to do
  • tag versions to enable reproducibility

scripting

  • separate scripting from software
    • software: generic
    • scripting: software called with specific configuration/arguments
  • scripting means how to run the software
    • i.e. here is where the parameters/arguments go
    • think of it as digital lab notes
    • this enables reproducibility
  • specialized script variants for different environments, e.g.
    • laptop
    • RStudio / terminal server
    • HPC cluster
  • think about execution scalability, i.e. not having to change software and scripting when you want to change parameters
  • keep failed attempts in branches to keep history of what you tried and why it didn't work in commit message

publishing

  • for paper, thesis, book, presentation, documentation, blog posts
    • use programming languages code/scripts for plots, flowcharts, etc.
    • write text/paragraphs in markup language (e.g. markdown)
  • use automation workflows to
    • generate plot/flowchart code to image files
    • convert text with pandoc to PDF/PS/HTML/ebup
  • use platforms for review process

integration of use cases for reproducibility

anti patterns

An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive.

  • most git anti-patterns are about how to use git
  • focus here is on these relating to RDM and reproducibility

binary files

  • git as VCS only good for text files
    • markdown
    • source code, scripts
    • (small) CSV
  • binary files can't be diff'ed, e.g.
    • compiled programs
    • MS word, excel
    • PDF, PS
    • JPEG, PNG
  • use textual representation, e.g.
    • graphviz dot for flowcharts
    • R ggplot and CSV for plots
  • use automation to convert textual representation to e.g. images
  • use gitignore to never add binary products to the repo

scientific data in git repos

  • data is often binary
  • git repo should be small, data blows it up, even if text
  • data has different release cycles than code
  • even git lfs (large file storage) is bad because still big ball of mud
    • scientific datasets need metadata!
  • use proper archive system for data

platforms

  • enable collaboration
    • bug tracker / feature requests
    • documentation / wiki
  • project management tools
    • issue boards, milestones, gantt
  • trigger automation
  • publish/download releases
  • go to https://git.idiv.de log in and create new projects!