git for RDM and reproducibility =============================== checklist --------- - **software** (a generic tool to do *something*) - [ ] use separate git repo for software - [ ] tag versions for reproducibility - [ ] keep software as generic as possible - **scripts** (*how* to use *software*) - [ ] use separate git repo for scripts - [ ] tag versions for reproducibility - [ ] software is configured here - [ ] reference used software tag - **data management** - [ ] publish dataset(s) to scientific data archive system - [ ] always attach proper metadata - [ ] get DOI for each version of the dataset(s) for reproducibility - [ ] reference used scripts tag - **publishing** - [ ] use separate git repo for paper/thesis/... - [ ] tag versions for draft/review/final - [ ] convert text/source to (binary) products - [ ] reference used scripts tag - [ ] reference used data DOI - **platforms** (GitLab, GitHub) - [ ] use platforms (GitLab, GitHub) for collaboration - [ ] review commits / merge requests - [ ] utilize project management tools - [ ] utilize automation for testing and publishing intro ----- - version control system (VCS) records changes (what, who, when, why) - use platforms (GitLab, GitHub) for collaboration git use cases ------------- ### software - keep software as generic as possible - turn configuration/parameters into arguments, e.g. `myapp --seed=42` - this avoids having to rewrite software for parameter changes - use software testing to verify software does what it's supposed to do - tag versions to enable **reproducibility** ### scripting - separate scripting from software - software: generic - scripting: software called with specific configuration/arguments - scripting means **how** to run the software - i.e. here is where the parameters/arguments go - think of it as digital lab notes - this enables **reproducibility** - specialized script variants for different environments, e.g. - laptop - RStudio / terminal server - HPC cluster - think about *execution scalability*, i.e. not having to change software and scripting when you want to change parameters - keep failed attempts in branches to keep history of what you tried and why it didn't work in commit message ### publishing - for paper, thesis, book, presentation, documentation, blog posts - use *programming languages* code/scripts for plots, flowcharts, etc. - write text/paragraphs in markup language (e.g. markdown) - use automation workflows to - generate plot/flowchart code to image files - convert text with pandoc to PDF/PS/HTML/ebup - use platforms for review process ## integration of use cases for reproducibility  anti patterns ------------- > An anti-pattern is a common response to a recurring problem that is usually > ineffective and risks being highly counterproductive. - most git anti-patterns are about *how* to use git - focus here is on these relating to RDM and reproducibility ### binary files - git as VCS only good for text files - markdown - source code, scripts - (small) CSV - binary files can't be diff'ed, e.g. - compiled programs - MS word, excel - PDF, PS - JPEG, PNG - use textual representation, e.g. - graphviz dot for flowcharts - R ggplot and CSV for plots - use automation to convert textual representation to e.g. images - use gitignore to never add binary products to the repo ### scientific data in git repos - data is often binary - git repo should be small, data blows it up, even if text - data has different release cycles than code - even git lfs (large file storage) is bad because still big ball of mud - scientific datasets need metadata! - use proper archive system for data platforms --------- - enable collaboration - bug tracker / feature requests - documentation / wiki - project management tools - issue boards, milestones, gantt - trigger automation - publish/download releases - go to https://git.idiv.de log in and create new projects!