List of git repositories which have an unreasonably large .git directory

Some github projects have unreasonably large .git directories. Examples: conan/docs and Arduino. conan/docs seem to have solved this.

Arduino

The total size of the arduino repository including the .git directory is approximately 1.4Gbyte:

$ du -hs gitprj/Arduino
1.4G	gitprj/Arduino

While the actually checked-out data size is only about 65M

$ du -hcs gitprj/Arduino/*
 12K	CONTRIBUTING.md
4.0K	ISSUE_TEMPLATE.md
4.0K	PULL_REQUEST_TEMPLATE.md
4.0K	README.md
 18M	app
 25M	arduino-core
 22M	build
 16K	hardware
4.0K	lib_sync
  0B	libraries
 40K	license.txt
 65M	total

The problem is that several times a wrong commit containing lots of accidental binaries was pushed, and correctly only by pushing another commit deleting the erronous files. Instead of rewriting history for once, and keeping the repository clean.

examples:

Accidental, and not cleaning up properly afterwards:

  • 448222e4b6 adding about 192M of .class and object files
  • 920212ee05 deleting about 192M of the same.

This does delete it from your checked out working directory, but not from the git repository.

Another bad habit of the ‘early’ days ( well, until 2014 ): keeping the tool binaries inside the repository.

  • starting with 21fe7f0a83
  • until finaly in 2013, in fabbe45c81 tools started to be removed from the main repo.

So now the arduino repository checks out at almost 95% of useless data.

conan

The conan docs Repository used to be really large. This was fixed somewhere early 2020.

The problem there was an accumulation of gh-pages commits. gh-pages is an exception to the rule: don’t rewrite history. When i checked in 2019, the total repository size was 1.7Gbyte, for only 9.5M of real data.