git for RDM and reproducibility
===============================

checklist
---------

- **software** (a generic tool to do *something*)
  - [ ] use separate git repo for software
  - [ ] tag versions for reproducibility
  - [ ] keep software as generic as possible
- **scripts** (*how* to use *software*)
  - [ ] use separate git repo for scripts
  - [ ] tag versions for reproducibility
  - [ ] software is configured here
  - [ ] reference used software tag
- **data management**
  - [ ] publish dataset(s) to scientific data archive system
  - [ ] always attach proper metadata
  - [ ] get DOI for each version of the dataset(s) for reproducibility
  - [ ] reference used scripts tag
- **publishing**
  - [ ] use separate git repo for paper/thesis/...
  - [ ] tag versions for draft/review/final
  - [ ] convert text/source to (binary) products
  - [ ] reference used scripts tag
  - [ ] reference used data DOI
- **platforms** (GitLab, GitHub)
  - [ ] use platforms (GitLab, GitHub) for collaboration
  - [ ] review commits / merge requests
  - [ ] utilize project management tools
  - [ ] utilize automation for testing and publishing


intro
-----

- version control system (VCS) records changes (what, who, when, why)
- use platforms (GitLab, GitHub) for collaboration


git use cases
-------------

### software

- keep software as generic as possible
- turn configuration/parameters into arguments, e.g. `myapp --seed=42`
- this avoids having to rewrite software for parameter changes
- use software testing to verify software does what it's supposed to do
- tag versions to enable **reproducibility**

### scripting

- separate scripting from software
  - software: generic
  - scripting: software called with specific configuration/arguments
- scripting means **how** to run the software
  - i.e. here is where the parameters/arguments go
  - think of it as digital lab notes
  - this enables **reproducibility**
- specialized script variants for different environments, e.g.
  - laptop
  - RStudio / terminal server
  - HPC cluster
- think about *execution scalability*, i.e. not having to change software and
  scripting when you want to change parameters
- keep failed attempts in branches to keep history of what you tried and why it
  didn't work in commit message

### publishing

- for paper, thesis, book, presentation, documentation, blog posts
  - use *programming languages* code/scripts for plots, flowcharts, etc.
  - write text/paragraphs in markup language (e.g. markdown)
- use automation workflows to
  - generate plot/flowchart code to image files
  - convert text with pandoc to PDF/PS/HTML/ebup
- use platforms for review process

## integration of use cases for reproducibility

![](img/rdm-use-case-merged.svg)


anti patterns
-------------

> An anti-pattern is a common response to a recurring problem that is usually
> ineffective and risks being highly counterproductive.

- most git anti-patterns are about *how* to use git
- focus here is on these relating to RDM and reproducibility

### binary files

- git as VCS only good for text files
  - markdown
  - source code, scripts
  - (small) CSV
- binary files can't be diff'ed, e.g.
  - compiled programs
  - MS word, excel
  - PDF, PS
  - JPEG, PNG
- use textual representation, e.g.
  - graphviz dot for flowcharts
  - R ggplot and CSV for plots
- use automation to convert textual representation to e.g. images
- use gitignore to never add binary products to the repo

### scientific data in git repos

- data is often binary
- git repo should be small, data blows it up, even if text
- data has different release cycles than code
- even git lfs (large file storage) is bad because still big ball of mud
  - scientific datasets need metadata!
- use proper archive system for data


platforms
---------

- enable collaboration
  - bug tracker / feature requests
  - documentation / wiki
- project management tools
  - issue boards, milestones, gantt
- trigger automation
- publish/download releases
- go to https://git.idiv.de log in and create new projects!