Git Internals — Jagadeeswara Reddy P

The object model

Git has exactly four object types. Every piece of data git stores — file contents, directory listings, commits, tags — is one of these four things.

Blob — raw file contents. No filename, no permissions, no metadata of any kind. If two files have identical contents, they produce the same SHA-1 hash and git stores exactly one blob. Rename a file without changing its contents and no new blob is created.

Tree — a directory listing. Each entry maps a filename to a blob SHA (for files) or another tree SHA (for subdirectories), along with a Unix file mode. A tree is a snapshot of one directory level.

Commit — points to exactly one tree (the root directory snapshot), zero or more parent commits, an author, a committer, timestamps, and a message. The tree captures what the project looked like. The parents capture where this commit came from.

Tag — a named pointer to any object (usually a commit) with an optional annotation and GPG signature. Lightweight tags are just refs; annotated tags are actual objects in the store.

Every object is zlib-compressed and stored at .git/objects/ab/cdef1234... — the first two hex characters of the SHA form the directory name, the remaining 38 form the filename. Git is a content-addressable filesystem: the address of every object is the SHA-1 hash of its contents.

You can inspect any object with git cat-file:

# Look at a commit
git cat-file -p HEAD
# tree 920512d2...
# parent e4f8a7c1...
# author __deesh__ <...> 1714200000 +0530
# committer __deesh__ <...> 1714200000 +0530
#
# Add user authentication

# Look at the tree
git cat-file -p 920512d2
# 100644 blob a1b2c3d4...    README.md
# 040000 tree e5f6a7b8...    src/

# Look at a blob
git cat-file -p a1b2c3d4
# (raw file contents)

+--------+--------------------+-------------------+
| Object | Contains           | Filename          |
+--------+--------------------+-------------------+
| blob   | raw file bytes     | SHA of content    |
+--------+--------------------+-------------------+
| tree   | filename →         | SHA of listing    |
|        | blob/tree mappings |                   |
+--------+--------------------+-------------------+
| commit | tree + parents +   | SHA of all fields |
|        | metadata           |                   |
+--------+--------------------+-------------------+
| tag    | target object +    | SHA of annotation |
|        | name + message     |                   |
+--------+--------------------+-------------------+

The .git directory

Every git repository is a .git directory. Everything else in the working tree is a projection of what is inside it. Here is what matters:

objects/ — the object database. Every blob, tree, commit, and tag lives here, either as a loose object (one file per object) or packed into a packfile.

refs/heads/ — branch tips. Each file contains a single 40-character SHA. refs/heads/main holds the SHA of the latest commit on main. Creating a branch is writing 40 bytes to a new file.

refs/tags/ — tag pointers. Lightweight tags are files containing a commit SHA. Annotated tags point to a tag object SHA, which in turn points to the commit.

HEAD — a symbolic reference. Usually contains ref: refs/heads/main, telling git which branch you are on. When you checkout a specific commit (detached HEAD), it contains a raw SHA instead.

index — the staging area. A binary file starting with the DIRC magic bytes. It holds a sorted list of file entries that will form the next commit’s tree. More on this below.

config — repository-level configuration. Overrides global ~/.gitconfig for settings like remote URLs, merge strategies, and user identity.

hooks/ — executable scripts triggered by git events. pre-commit, post-merge, pre-push, and others. Git ships sample hooks with a .sample extension; remove the extension to activate them.

logs/ — reflog history. Append-only text files recording every change to every ref and to HEAD. Your safety net when things go wrong.

The DAG

Each commit stores the SHA(s) of its parent commit(s). A regular commit has one parent. A merge commit has two or more. The initial commit has zero. This creates a directed acyclic graph — you can walk backwards from any commit to the root, but you can never create a cycle.

A linear history looks like this:

a1b2 ← c3d4 ← e5f6 ← main (HEAD)

Create a feature branch and the graph forks:

a1b2 ← c3d4 ← e5f6 ← main (HEAD)
              ↖ f7a8 ← feature

Merge the feature branch and the graph converges. The merge commit b9c0 has two parents:

a1b2 ← c3d4 ← e5f6 ← b9c0 ← main (HEAD)
              ↖ f7a8 ↗

Branches are not part of the graph. They are just files in refs/heads/ containing a single SHA. cat .git/refs/heads/main returns a 40-character hex string. Moving a branch forward after a commit means overwriting those 40 bytes. Deleting a branch means deleting the file — the commits it pointed to still exist in the object store until garbage collection prunes them.

The index

The index (also called the staging area or cache) is a binary file at .git/index. It starts with the DIRC magic bytes, followed by a version number and a count of entries.

Each entry contains: the file path, the SHA-1 of the blob, the file mode, and a block of stat data — ctime, mtime, device number, inode number, uid, gid, and file size. This stat data is the performance trick that makes git fast.

When you run git status, git does not hash every file in your working tree. Instead, it compares the stat data cached in the index against the stat data returned by the filesystem. If the timestamps, inode, and size all match, git assumes the file is unchanged and skips hashing entirely. Only when stat data mismatches does git actually read and hash the file to check for a real content change.

This is why git status returns nearly instantly even in repositories with tens of thousands of files. The index acts as a stat cache, turning what would be O(n) hash operations into O(n) stat comparisons — orders of magnitude faster because stat is a metadata lookup, not a file read.

Packfiles and garbage collection

Loose objects — one zlib-compressed file per blob, tree, commit, or tag — are simple but wasteful. A 10 KB file modified across 100 commits produces 100 separate blob objects, most of which are nearly identical.

git gc solves this by packing loose objects into packfiles. A packfile (.git/objects/pack/*.pack) stores objects using delta compression: similar objects are identified, and all but one are stored as a base object plus a binary diff. The companion .idx file provides O(1) lookups by SHA into the packfile.

The delta compression strategy is deliberate about which version is stored whole. Newer versions of a file are stored as complete objects; older versions are stored as deltas pointing backwards to the newer base. This optimizes for checkout speed — the version you are most likely to need (the latest) requires no delta reconstruction.

GC runs automatically when the number of loose objects exceeds approximately 6,700 (the gc.auto threshold). It also prunes unreachable objects — commits and blobs that no ref or reflog entry points to. Unreachable objects are kept for 30 days by default (gc.pruneExpire), giving you time to recover from mistakes. Reachable reflog entries survive for 90 days (gc.reflogExpire).

How merge works internally

Git’s default merge uses a three-way merge algorithm. The three “ways” are: the merge base (common ancestor), ours (current branch tip), and theirs (branch being merged).

The algorithm proceeds in four steps:

Find the merge base. Git walks the DAG backwards from both branch tips and finds the lowest common ancestor commit. This is the point where the two branches diverged.
Compute two diffs. Diff from the merge base to ours, and diff from the merge base to theirs. Each diff identifies which lines were added, removed, or changed relative to the common starting point.
Apply non-conflicting changes. If only one side modified a given region of a file, that change is accepted automatically. If neither side changed a region, it stays as-is.
Mark conflicts. If both sides modified the same region differently, git cannot resolve it automatically. It writes both versions into the file with conflict markers and leaves the merge in a conflicted state for you to resolve.

The “recursive” merge strategy (the default) handles an edge case called criss-cross merges, where there are multiple possible merge bases. Rather than picking one arbitrarily, it first merges the merge bases into a virtual ancestor, then uses that virtual ancestor as the base for the three-way merge.

# Find the merge base yourself
git merge-base main feature
# e5f6a7b8...

How rebase works internally

Rebase replays commits onto a new base. For each commit in the range being rebased, git performs the equivalent of cherry-pick:

Compute the diff that commit introduced (diff between the commit and its parent).
Apply that diff onto the new base.
Create a new commit with the same message and author, but with the new base as its parent.

Each replayed commit gets a new SHA because the parent pointer changed — and since the SHA includes the parent, the tree, the author, and the message, any change to any of these fields produces a different hash.

The original commits still exist in the object store. They are not deleted or modified. They are simply unreachable from any branch ref. The reflog preserves the old branch tip, which means git reflog shows you exactly where the branch pointed before the rebase. This is your escape hatch: git reset --hard HEAD@{1} undoes the rebase completely.

Building a commit from plumbing commands

Every porcelain command — add, commit, merge, rebase — is orchestration on top of a handful of plumbing commands. Here is the full pipeline for creating a commit using only low-level primitives:

# 1. Create a blob from file content
echo "hello world" | git hash-object -w --stdin
# → a1b2c3d4e5f6...

# 2. Stage it in the index
git update-index --add --cacheinfo 100644 a1b2c3d4e5f6 hello.txt

# 3. Write the index as a tree object
git write-tree
# → f7a8b9c0d1e2...

# 4. Create a commit pointing to the tree
echo "first commit" | git commit-tree f7a8b9c0 -p HEAD
# → 1a2b3c4d5e6f...

# 5. Move the branch pointer
git update-ref refs/heads/main 1a2b3c4d5e6f

Five commands. hash-object writes content into the object store and returns its SHA. update-index adds an entry to the staging area. write-tree serializes the index into a tree object. commit-tree creates a commit object pointing to that tree with the given parent. update-ref moves the branch pointer to the new commit.

That is everything git add and git commit do. The porcelain adds convenience — reading .gitignore, formatting commit messages, running hooks — but the underlying data operations are exactly these five steps.

Reflog

The reflog is stored in .git/logs/ as append-only text files. There is one file per ref (logs/refs/heads/main, logs/refs/heads/feature) plus one for HEAD (logs/HEAD). Each line records: old SHA, new SHA, who made the change, when, and what operation caused it.

git reflog
# abc1234 HEAD@{0}: commit: fix auth bug
# def5678 HEAD@{1}: rebase: fast-forward
# 9ab0cde HEAD@{2}: reset: moving to HEAD~1

The reflog is why “you can’t lose data in git” is nearly true. Even after reset --hard, the old commit SHAs live in the reflog. Even after a rebase rewrites history, the pre-rebase branch tip is recorded. You can always recover by finding the old SHA in the reflog and resetting to it.

Reflog entries expire based on reachability. Entries pointing to commits that are still reachable from some ref survive for 90 days (gc.reflogExpire). Entries pointing to unreachable commits expire after 30 days (gc.reflogExpireUnreachable). Both are configurable.

git reflog expire --all is the only way to truly lose reflog entries before their natural expiry. Combined with git gc --prune=now, it is the nuclear option for permanently removing commits from a repository.

Git is a content-addressable filesystem with a version control UI bolted on top. Once you see the objects and the DAG, the porcelain commands are just convenience wrappers.

For the practical mental model — three trees, merge vs rebase, undo patterns — see Git Basics.