Digging into Git objects
Keywords: Git
Git is a version control system that is decentralized (as oppossed to SVN). Can be used only locally or with a server where different users work collaboratively.
In a local git project there are different areas:
- Working directory: the actual files and folders that you can edit. Has a hidden directory named
.git
that contains all registered changes. - Stage area: before sending the changes to the repository they must be prepared. The changes that you made in the working area must be compared with the content in the respository. To prepare what you have in the working directory you use
git add
. Repository: holds the history of the changes, different parallel lines development, etc. To record changes to a line of development (where the HEAD is pointing) you use
git commit
. You can switch the head from a line to another. The different parallel lines of development are called branches and the content of branches can be merged.The HEAD points to a commit. When you stage the changes, git compares the working directory with the state of the files referenced by HEAD. We can see the content of the file
.git/HEAD
:cat .git/HEAD # gives a path to a branch, for instance refs/heads/feature1
.git//refs/heads/
contains a file per branch and each file contains a line that points to a commit. In this example we can see two branches:cat .git/refs/heads/feature1 8aa172f44b0a3e23a37a5c3efbce272c122b0c20 cat .git/refs/heads/master bcf5dc7fc2093f5aef5e6871dbb7fb637b636880
The two first letters identify the directory and the rest the file.
.git ├── objects │ ├── 8a │ │ └── a172f44b0a3e23a37a5c3efbce272c122b0c20 │ ├── bc │ │ └── f5dc7fc2093f5aef5e6871dbb7fb637b636880 │ ├── e3 │ │ └── 992efc658b834989584277749bc88001c522b6 ...
Every commit creates two special objets:
- A tree object with lines that contains:
{blob|tree} HASH name
. A blob represents is a file and a tree represents a directory. A blob contains the file and a tree points to another tree object that contains the content of that directory. - A pointer to commit object with the following content:
- tree (object)
- parent (object)
- author (String)
- commiter (String)
- message (String)
The current HEAD is updated with the value of this new pointer to commit object.
git log # Output a list of pointer to commit objects
commit ad7d95b28ffc6515cde8dc00542b4623225853a9 (HEAD -> branch) Author: ME Date: DATE
#git cat-file -p HASH-OF-POINTER-TO-COMMIT git cat-file -p ad7d95b28ffc6515cde8dc00542b4623225853a9
tree dda1543ead42cd7e1f8d3a9a6012f991facb4c72 parent cff8ee38faf009a4f67521eabce4c9e58403acc3 author ME committer ME Message provided in the commit
#git cat-file -p HASH-OF-TREE git cat-file -p dda1543ead42cd7e1f8d3a9a6012f991facb4c72
100644 blob 86fd1fdb58c162c79346d84fad97aa71704a85e0 lorem.txt 040000 tree 2b297e643c551e76cfa1f93810c50811382f9117 prueba
We see that there is a file lorem.txt
and a directory named prueba
.
git cat-file -p 2b297e643c551e76cfa1f93810c50811382f9117
100644 blob 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 test.txt
The directory contained the file test.txt
.
With that knowledge, it is possible to write a toy script to extract a commit to a directory, only the files that are in the parent directory, not subdirectories and their contents. For that you need to think harder (a recursive algorithm or may be using a Stack):
# Use in a test repository # This is the pointer to a commit (provide yours) POINTER=ad7d95b28ffc6515cde8dc00542b4623225853a9 DIR=/tmp/test3 # Obtain the tree tree=$(git cat-file -p $POINTER | head -n1 | cut -d' ' -f2) # Obtain all objects in the tree objects=$(git cat-file -p $tree) while IFS= read -r line; do # Only extract files in the parent dir blobp=$(echo "$line" | grep "blob") if [ -n "$blobp" ]; then obj=$(echo "$line" | awk -F' ' '{print $3}') fname=$(echo "$line" | awk -F' ' '{print $4}') echo "Object: $obj" echo "Saved to file: $DIR/$fname" git cat-file -p $obj > $DIR/$fname fi done < <(printf '%s\n' "$objects")
Hopefuly it is much simpler to use git archive
:
git archive --format zip --output /tmp/out.zip POINTER-TO-COMMIT
Some considerations about objects and files:
- If a file does not change from a commit to the next, their associated trees point to the same blob object.
- When a file changes from a commit to the next, git creates a new object that contains the new file (not the difference).
The id of the object is a hash that can be obtained using the following command:
cat file.txt | git hash-object -w --stdin
We can show this with the help of two diagrams. Let's assume that we have a commit with two files:
Now we change the file file2.txt
in our working directory and perform a commit, we will have: