357 lines
16 KiB
Org Mode
357 lines
16 KiB
Org Mode
---
|
|
title: "From graphs to Git"
|
|
date: 2021-03-08
|
|
tags: git, graphs
|
|
toc: true
|
|
---
|
|
|
|
* Introduction
|
|
|
|
This is an introduction to Git from a graph theory point of view. In
|
|
my view, most introductions to Git focus on the actual commands or on
|
|
Git internals. In my day-to-day work, I realized that I consistently
|
|
rely on an internal model of the repository as a directed acyclic
|
|
graph. I also tend to use this point of view when explaining my
|
|
workflow to other people, with some success. This is definitely not
|
|
original, many people have said the same thing, to the point that it
|
|
is a [[https://xkcd.com/1597/][running joke]]. However, I have not seen a comprehensive
|
|
introduction to Git from this point of view, without clutter from the
|
|
Git command line itself.
|
|
|
|
How to actually use the command line is not the topic of this article,
|
|
you can refer to the man pages or the excellent [[https://git-scm.com/book/en/v2][/Pro Git/]] book[fn::See
|
|
"Further reading" below.]. I will reference the relevant Git commands
|
|
as margin notes.
|
|
|
|
* Concepts: understanding the graph
|
|
|
|
** Repository
|
|
|
|
The basic object in Git is the /commit/. It is constituted of three
|
|
things: a set of parent commits (at least one, except for the initial
|
|
commit), a diff representing changes (some lines are removed, some are
|
|
added), and a commit message. It also has a name[fn:hash], so that we
|
|
can refer to it if needed.
|
|
|
|
[fn:hash] Actually, each commit gets a [[https://en.wikipedia.org/wiki/SHA-1][SHA-1]] hash that identifies it
|
|
uniquely. The hash is computed from the parents, the messages, and the
|
|
diff.
|
|
|
|
|
|
A /repository/ is fundamentally just a directed acyclic graph
|
|
(DAG)[fn:graph], where nodes are commits and links are parent-child
|
|
relationships. A DAG means that two essential properties are verified
|
|
at all time by the graph:
|
|
- it is /oriented/, and the direction always go from parent to child,
|
|
- it is /acyclic/, otherwise a commit could end up being an ancestor
|
|
of itself.
|
|
As you can see, these make perfect sense in the context of a
|
|
version-tracking system.
|
|
|
|
[fn:graph] {-} You can visualize the graph of a repo, or just a subset
|
|
of it, using [[https://git-scm.com/docs/git-log][=git log=]].
|
|
|
|
|
|
Here is an example of a repo:
|
|
|
|
[[file:../images/git-graphs/repo.svg]]
|
|
|
|
In this representation, each commit points to its
|
|
children[fn:parent-child], and they were organized from left to right
|
|
as in a timeline. The /initial commit/ is the first one, the root of
|
|
the graph, on the far left.
|
|
|
|
[fn:parent-child] In the actual implementation, the edges are the
|
|
other way around: each commit points to its parents. But I feel like
|
|
it is clearer to visualize the graph ordered with time.
|
|
|
|
|
|
Note that a commit can have multiple children, and multiple parents
|
|
(we'll come back to these specific commits later).
|
|
|
|
The entirety of Git operations can be understood in terms of
|
|
manipulations of the graph. In the following sections, we'll list the
|
|
different actions we can take to modify the graph.
|
|
|
|
** Naming things: branches and tags
|
|
|
|
Some commits can be annotated: they can have a named label attached to
|
|
them, that reference a specific commit.
|
|
|
|
For instance, =HEAD= references the current commit: your current
|
|
position in the graph[fn:checkout]. This is just a convenient name for
|
|
the current commit.[fn::Much like how =.= is a shorthand for the
|
|
current directory when you're navigating the filesystem.]
|
|
|
|
[fn:checkout] {-} Move around the graph (i.e. move the =HEAD=
|
|
pointer), using [[https://git-scm.com/docs/git-checkout][=git checkout=]]. You can give it commit hashes, branch
|
|
names, tag names, or relative positions like =HEAD~3= for the
|
|
great-grandparent of the current commit.
|
|
|
|
|
|
/Branches/ are other labels like this. Each of them has a
|
|
name and acts a simple pointer to a commit. Once again, this is simply
|
|
an alias, in order to have meaningful names when navigating the graph.
|
|
|
|
[[file:../images/git-graphs/repo_labels.svg]]
|
|
|
|
In this example, we have three branches: =master=, =feature=, and
|
|
=bugfix=[fn::Do not name your real branches like this! Find a
|
|
meaningful name describing what changes you are making.]. Note that
|
|
there is nothing special about the names: we can use any name we want,
|
|
and the =master= branch is not special in any way.
|
|
|
|
/Tags/[fn:branch-tag] are another kind of label, once again pointing to a particular
|
|
commit. The main difference with branches is that branches may move
|
|
(you can change the commit they point to if you want), whereas tags
|
|
are fixed forever.
|
|
|
|
[fn:branch-tag] {-} Create branches and tags with the appropriately
|
|
named [[https://git-scm.com/docs/git-branch][=git branch=]] and [[https://git-scm.com/docs/git-tag][=git tag=]].
|
|
|
|
|
|
** Making changes: creating new commits
|
|
|
|
When you make some changes in your files, you will then record them in
|
|
the repo by committing them[fn:commit]. The action creates a new
|
|
commit, whose parent will be the current commit. For instance, in the
|
|
previous case where you were on =master=, the new repo after
|
|
committing will be (the new commit is in green):
|
|
|
|
[fn:commit] {-} To the surprise of absolutely no one, this is done
|
|
with [[https://git-scm.com/docs/git-commit][=git commit=]].
|
|
|
|
|
|
[[file:../images/git-graphs/repo_labels_commit.svg]]
|
|
|
|
Two significant things happened here:
|
|
- Your position on the graph changed: =HEAD= points to the new commit
|
|
you just created.
|
|
- More importantly: =master= moved as well. This is the main property
|
|
of branches: instead of being "dumb" labels pointing to commits,
|
|
they will automatically move when you add new commits on top of
|
|
them. (Note that this won't be the case with tags, which always
|
|
point to the same commit no matter what.)
|
|
|
|
If you can add commits, you can also remove them (if they don't have
|
|
any children, obviously). However, it is often better to add a commit
|
|
that will /revert/[fn:revert] the changes of another commit
|
|
(i.e. apply the opposite changes). This way, you keep track of what's
|
|
been done to the repository structure, and you do not lose the
|
|
reverted changes (should you need to re-apply them in the future).
|
|
|
|
[fn:revert] {-} Create a revert commit with [[https://git-scm.com/docs/git-revert][=git revert=]], and remove a
|
|
commit with [[https://git-scm.com/docs/git-reset][=git reset=]] *(destructive!)*.
|
|
|
|
|
|
** Merging
|
|
|
|
There is a special type of commits: /merge commits/, which have more
|
|
than one parent (for example, the fifth commit from the left in the
|
|
graph above).[fn:merge:{-} As can be expected, the command is [[https://git-scm.com/docs/git-merge][=git
|
|
merge=]].]
|
|
|
|
At this point, we need to talk about /conflicts/.[fn:merge-conflicts]
|
|
Until now, every action was simple: we can move around, add names, and
|
|
add some changes. But now we are trying to reconcile two different
|
|
versions into a single one. These two versions can be incompatible,
|
|
and in this case the merge commit will have to choose which lines of
|
|
each version to keep. If however, there is no conflict, the merge
|
|
commit will be empty: it will have two parents, but will not contain
|
|
any changes itself.
|
|
|
|
[fn:merge-conflicts] {-} See /Pro Git/'s [[https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging][chapter on merging and basic
|
|
conflict resolution]] for the details on managing conflicts in practice.
|
|
|
|
|
|
** Moving commits: rebasing and squashing
|
|
|
|
Until now, all the actions we've seen were append-only. We were only
|
|
adding stuff, and it would be easy to just remove a node from the
|
|
graph, and to move the various labels accordingly, to return to the
|
|
previous state.
|
|
|
|
Sometimes we want to do more complex manipulation of the graph: moving
|
|
a commit and all its descendants to another location in the
|
|
graph. This is called a /rebase/.[fn:rebase:{-} That you can perform
|
|
with [[https://git-scm.com/docs/git-rebase][=git rebase=]] *(destructive!)*.]
|
|
|
|
[[file:../images/git-graphs/repo_labels_rebase.svg]]
|
|
|
|
In this case, we moved the branch =feature= from its old position (in
|
|
red) to a new one on top of =master= (in green).
|
|
|
|
When I say "move the branch =feature=", I actually mean something
|
|
slightly different than before. Here, we don't just move the label
|
|
=feature=, but also the entire chain of commits starting from the one
|
|
pointed by =feature= up to the common ancestor of =feature= and its
|
|
base branch (here =master=).
|
|
|
|
In practice, what we have done is deleted three commits, and added
|
|
three brand new commits. Git actually helps us here by creating
|
|
commits with the same changes. Sometimes, it is not possible to apply
|
|
the same changes exactly because the original version is not the
|
|
same. For instance, if one of the commits changed a line that no
|
|
longer exist in the new base, there will be a conflict. When rebasing,
|
|
you may have to manually resolve these conflicts, similarly to a
|
|
merge.
|
|
|
|
It is often interesting to rebase before merging, because then we can
|
|
avoid merge commits entirely. Since =feature= has been rebased on top
|
|
of =master=, when merging =feature= onto =master=, we can just
|
|
/fast-forward/ =master=, in effect just moving the =master= label
|
|
where =feature= is:[fn:fastforward]
|
|
|
|
[fn:fastforward] {-} You can control whether or not =git merge= does a
|
|
fast-forward with the =--ff-only= and =--no-ff= flags.
|
|
|
|
|
|
[[file:../images/git-graphs/repo_labels_ff.svg]]
|
|
|
|
Another manipulation that we can do on the graph is /squashing/,
|
|
i.e. lumping several commits together in a single one.[fn:squash:{-}
|
|
Use [[https://git-scm.com/docs/git-squash][=git squash=]] *(destructive!)*.]
|
|
|
|
[[file:../images/git-graphs/repo_labels_squash.svg]]
|
|
|
|
Here, the three commits of the =feature= branch have been condensed
|
|
into a single one. No conflict can happen, but we lose the history of
|
|
the changes. Squashing may be useful to clean up a complex history.
|
|
|
|
Squashing and rebasing, taken together, can be extremely powerful
|
|
tools to entirely rewrite the history of a repo. With them, you can
|
|
reorder commits, squash them together, moving them elsewhere, and so
|
|
on. However, these commands are also extremely dangerous: since you
|
|
overwrite the history, there is a lot of potential for conflicts and
|
|
general mistakes. By contrast, merges are completely safe: even if
|
|
there are conflicts and you have messed them up, you can always remove
|
|
the merge commit and go back to the previous state. But when you
|
|
rebase a set of commits and mess up the conflict resolution, there is
|
|
no going back: the history has been lost forever, and you generally
|
|
cannot recover the original state of the repository.
|
|
|
|
* Remotes: sharing your work with others
|
|
|
|
You can use Git as a simple version tracking system for your own
|
|
projects, on your own computer. But most of the time, Git is used to
|
|
collaborate with other people. For this reason, Git has an elaborate
|
|
system for sharing changes with others. The good news is: everything
|
|
is still represented in the graph! There is nothing fundamentally
|
|
different to understand.
|
|
|
|
When two different people work on the same project, each will have a
|
|
version of the repository locally. Let's say that Alice and Bob are
|
|
both working on our project.
|
|
|
|
Alice has made a significant improvement to the project, and has
|
|
created several commits, that are tracked in the =feature= branch she
|
|
has created locally. The graph above (after rebasing) represents
|
|
Alice's repository. Bob, meanwhile, has the same repository but
|
|
without the =feature= branch. How can they share their work? Alice can
|
|
send the commits from =feature= to the common ancestor of =master= and
|
|
=feature= to Bob. Bob will see this branch as part of a /remote/
|
|
graph, that will be superimposed on his graph: [fn:remote]
|
|
|
|
[fn:remote] {-} You can add, remove, rename, and generally manage
|
|
remotes with [[https://git-scm.com/docs/git-remote][=git remote=]]. To transfer data between you and a remote,
|
|
use [[https://git-scm.com/docs/git-fetch][=git fetch=]], [[https://git-scm.com/docs/git-pull][=git pull=]] (which fetches and merges in your local
|
|
branch automatically), and [[https://git-scm.com/docs/git-push][=git push=]].
|
|
|
|
|
|
[[file:../images/git-graphs/repo_labels_bob.svg]]
|
|
|
|
The branch name he just got from Alice is prefixed by the name of the
|
|
remote, in this case =alice=. These are just ordinary commits, and an
|
|
ordinary branch (i.e. just a label on a specific commit).
|
|
|
|
Now Bob can see Alice's work, and has some idea to improve on it. So
|
|
he wants to make a new commit on top of Alice's changes. But the
|
|
=alice/feature= branch is here to track the state of Alice's
|
|
repository, so he just creates a new branch just for him named
|
|
=feature=, where he adds a commit:
|
|
|
|
[[file:../images/git-graphs/repo_labels_bob2.svg]]
|
|
|
|
Similarly, Alice can now retrieve Bob's work, and will have a new
|
|
branch =bob/feature= with the additional commit. If she wants, she can
|
|
now incorporate the new commit to her own branch =feature=, making her
|
|
branches =feature= and =bob/feature= identical:
|
|
|
|
[[file:../images/git-graphs/repo_labels_alice.svg]]
|
|
|
|
As you can see, sharing work in Git is just a matter of having
|
|
additional branches that represent the graph of other people. Some
|
|
branches are shared among different people, and in this case you will
|
|
have several branches, each prefixed with the name of the
|
|
remote. Everything is still represented simply in a single graph.
|
|
|
|
* Additional concepts
|
|
|
|
Unfortunately, some things are not captured in the graph
|
|
directly. Most notably, the [[https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository][staging area]] used for selecting changes
|
|
for committing, [[https://git-scm.com/book/en/v2/Git-Tools-Stashing-and-Cleaning][stashing]], and [[https://git-scm.com/book/en/v2/Git-Tools-Submodules][submodules]] greatly extend the
|
|
capabilities of Git beyond simple graph manipulations. You can read
|
|
about all of these in /Pro Git/.
|
|
|
|
* Internals
|
|
|
|
*Note:* This section is /not/ needed to use Git every day, or even to
|
|
understand the concepts behind it. However, it can quickly show you
|
|
that the explanations above are not pure abstractions, and are
|
|
actually represented directly this way.
|
|
|
|
Let's dive a little bit into Git's internal representations to better
|
|
understand the concepts. The entire Git repository is contained in a
|
|
=.git= folder.
|
|
|
|
Inside the =.git= folder, you will find a simple text file called
|
|
=HEAD=, which contains a reference to a location in the graph. For
|
|
instance, it could contain =ref: refs/heads/master=. As you can see,
|
|
=HEAD= really is just a pointer, to somewhere called
|
|
=refs/heads/master=. Let's look into the =refs= directory to
|
|
investigate:
|
|
#+begin_src sh
|
|
$ cat refs/heads/master
|
|
f19bdc9bf9668363a7be1bb63ff5b9d6bfa965dd
|
|
#+end_src
|
|
|
|
This is just a pointer to a specific commit! You can also see that all
|
|
the other branches are represented the same way.[fn:head:You must have
|
|
noticed that our graphs above were slightly misleading: =HEAD= does
|
|
not point directly to a commit, but to a branch, which itself points
|
|
to a commit. If you make =HEAD= point to a commit directly, this is
|
|
called a [[https://git-scm.com/docs/git-checkout#_detached_head]["detached HEAD"]] state.]
|
|
|
|
Remotes and tags are similar: they are in =refs/remotes= and
|
|
=refs/tags=.
|
|
|
|
Commits are stored in the =objects= directory, in subfolders named
|
|
after the first two characters of their hashes. So the commit above is
|
|
located at =objects/f1/9bdc9bf9668363a7be1bb63ff5b9d6bfa965dd=. They
|
|
are usually in a binary format (for efficiency reasons) called
|
|
[[https://git-scm.com/book/en/v2/Git-Internals-Packfiles][packfiles]]. But if you inspect it (with [[https://git-scm.com/docs/git-show][=git show=]]), you will see the
|
|
entire contents (parents, message, diff).
|
|
|
|
* Further reading
|
|
|
|
To know more about Git, specifically how to use it in practice, I
|
|
recommend going through the excellent [[https://git-scm.com/book/en/v2][/Pro Git/]] book, which covers
|
|
everything there is to know about the various Git commands and
|
|
workflows.
|
|
|
|
The [[https://git-scm.com/docs][Git man pages]] (also available via =man= on your system) have a
|
|
reputation of being hard to read, but once you have understood the
|
|
concepts behind repos, commits, branches, and remotes, they provide an
|
|
invaluable resource to exploit all the power of the command line
|
|
interface and the various commands and options.[fn:magit:Of course,
|
|
you could also use the awesome [[https://magit.vc/][Magit]] in Emacs, which will greatly
|
|
facilitate your interactions with Git with the additional benefit of
|
|
helping you discover Git's capabilities.]
|
|
|
|
Finally, if you are interested in the implementation details of Git,
|
|
you can follow [[https://wyag.thb.lt/][Write yourself a Git]] and implement Git yourself! (This
|
|
is surprisingly quite straightforward, and you will end up with a much
|
|
better understanding of what's going on.) The [[https://www.aosabook.org/en/git.html][chapter on Git]] in
|
|
cite:brown2012_volum_ii is also excellent.
|
|
|
|
* References
|