Recommended JavaScript reading
There is a lot of provided code this week in diff.js
.
You should read the notes on Node.js modules for information on how to import these functions.
I highly encourage using async
-await
this week.
Like most real asynchronous projects, this one mostly consists of tasks that need to happen sequentially.
async
-await
can greatly simplify this sort of asynchronous code.
In async
functions, try
-catch
statements are used to handle errors in await
ed Promise
s, so you may want to read about error handling in JavaScript.
Mini Version Control
Goals
- Learn the core ideas underlying version control software
- Write an HTTP server that implements a JSON API
- Use
async
-await
to reduce the amount of boilerplate code needed to write asynchronous programs
Why version control?
You've been using Git for all the projects in this class, and possibly for other classes and personal projects too. Git is one of many tools called "version control software", which track the history of changes to a set of files over time. VCS tools differ in their implementations, but they share several core concepts. In this project, you will implement a version control application with the core features of Git or CVS. (Git actually stores its history quite differently than MiniVC since it's optimized for different use cases. If you take CS 24 this fall, you will learn how Git really works by implement parts of it!)
The application has two parts which communicate over HTTP:
- A command line program allows the user to manage the repositories that have been downloaded locally and to communicate with the server
- A server stores the definitive copy of all repositories and mediates the changes uploaded by different clients
You are provided an implementation of the client, so you only need to write the server.
Version control concepts
There are many ways to represent the history of files; for example, we could store a copy of each file whenever it is updated. However, this would waste a lot of space since files are usually changed only slightly. Instead, MiniVC tracks the differences between each version of a file and the previous version. This is also convenient because it makes it easy to identify what changes were made to files between two versions.
Diffs
A "diff" between two versions of a file is the set of changes needed to turn the first version into the second. Again, there are different granularities of changes that you can track; each has tradeoffs. Git considers a change to be inserting or deleting a line of text. This works pretty well, so we'll use the same approach.
For example, consider the following two versions of a file (each line contains a single character for simplicity):
old | new
----+----
a | a
b | b
c | c
d | h
e | f
f | g
g |
"Diffing" these two versions (e.g. the command diff -u old new
) produces the following diff:
--- old
+++ new
@@ -1,7 +1,6 @@
a
b
c
-d
-e
+h
f
g
You can see that the lines d
and e
were deleted and the line h
was inserted.
We will represent a diff as an array of "same" and "change" elements. For example, the diff above would become:
[
// The first 3 lines stayed the same
{type: 'same', count: 3},
// Then 2 lines were deleted and replaced with the line `h`
{type: 'change', deletions: 2, insertions: ['h']},
// The last 2 lines stayed the same
{type: 'same', count: 2}
]
Commits
A commit is the smallest unit of changes that MiniVC tracks. They are the "versions" in "version control". Each commit specifies the previous commit and the diffs for all files that changed since that commit. Like Git, each commit also has a "commit message" that describes what was changed and a "commit ID" so that the next commit can refer to it.
For example, the following commit adds a comment to program.c
:
{
// The previous commit had ID 85c37d4521d4d51b6ee186856b1c14e6
parentID: '85c37d4521d4d51b6ee186856b1c14e6',
// Commit message
message: 'Added comment',
// Diffs for files that were changed
diffs: {
'program.c': [
{type: 'same', 'count': 2},
{type: 'change', deletions: 0, insertions: ['// TODO: fix this']},
{type: 'same', 'count': 3}
]
}
}
The first commit is special since it doesn't have a parent.
We will represent this by omitting the parentID
field for the initial commit (or setting it to undefined
).
Merges
Version control histories are mostly linear, i.e. there is one chain of commits from the initial commit to the current commit.
In this case, the commits' parentID
s form an implicit linked list of commits.
However, a good version control system allows multiple users to work on the codebase at the same time.
It is possible that two users are working off the same commit and both of them try to push new commits before they have seen the other's changes.
In this case, whichever commit is pushed first will be added to the history, but the second commit will need to be "merged" with the first commit.
In general, whenever a new commit is attempted, it must be merged with all of the commits that were missed. For example, suppose I'm working off commit A but before I push commit B, other users push commits C, D, and E:
A - C - D - E
\
B ------^
Then the diff in commit B needs to be merged with the diff between A and E. Once we figure out what the combined diff from A to B/E should be, we "subtract" the part that was already commited between A and E, and then add the rest of the diff as commit B. Changing B to have E as its parent instead of A is called "rebasing" in Git.
Repositories
A repository (what GitLab calls a "project") is a set of files under version control.
The history of the repository is represented by storing all its commits by ID, plus the commit ID of the current commit.
This pointer to the current commit is called HEAD
in Git.
Note that HEAD
won't point to a commit until the first commit is pushed to the repository.
You can choose how to store this information.
I recommend the way the client stores it: a commits
directory with a file for each commit (where the filename is the commit ID), and a head
file that stores the ID of the current commit.
Your server should save all files inside the current working directroy (.
): for example, if the server is started in the directory server-repos
, all files and directories created by the server should be inside server-repos
.
Server API
The command-line client communicates with the server via an HTTP API.
Requests and responses are sent as JSON.
You can parse the request JSON by concatenating all the chunks in the request stream into a string, and then calling JSON.parse()
on it.
To respond with JSON, you should set the Content-Type
header to application/json
.
(JSON is not the most space-efficient way to store commit diffs, but it is very convenient to read and write from JavaScript.)
All responses have the form {success: true, ...data}
if successful and {success: false, message: string}
if an error occurs.
The server should listen on port 8000 and have the following POST
endpoints:
/new_repository
Makes a new empty respository with the given name. It is an error if a repository with this name already exists.
The request JSON has the following form:
{
name: string
}
If successful, the response JSON has the following form:
{
success: true
}
/fetch
Retrieves all new commits to the given repository. The client indicates what its last known commit is (if any). It is an error if the repository doesn't exist or the commit does not exist in that repository.
The request JSON has the following form:
{
// Repository name
name: string
// Last known commit ID.
// Omitted if the client doesn't have any commits.
parentID: string | undefined
}
If successful, the response JSON has the following form:
{
success: true
// Commits after parentID up to HEAD.
// They must be ordered from oldest to newest.
commits: Commit[]
}
where
type Commit = {
id: string
message: string
diffs: FileDiffs
}
type FileDiffs = {
// Map each filename to a diff for that file
[file: string]: Diff
}
// A diff is an array of "same" and "change" elements
type Diff = DiffElement[]
type DiffElement
= {type: 'same', count: number}
| {type: 'change', deletions: number, insertions: string[]}
/commit
Pushes the given commit to the given repository.
It is an error if the repository does not exist or the parent commit does not exist in that repository.
You can use whatever ID for the new commit you want, but it should not be likely to conflict with an existing ID.
(I used require('crypto').randomBytes(16).toString('hex')
.)
The commit needs to be merged/rebased off the current HEAD
if HEAD
is different from parentID
.
If there is a merge conflict, mergeFileDiffs()
will throw an error, which you should report to the client.
The request JSON has the following form:
{
// Repository name
name: string
// Parent commit.
// Omitted if this is the initial commit.
parentID: string | undefined
// Commit message
message: string
// Diffs for each changed file
diffs: FileDiffs
}
The response JSON has the same format as for /fetch
.
The commits sent back should include the newly added commit.
Diff functions
diff.js
exports some utility functions for working with diffs.
The ones you will likely find useful when implementing the server are:
-
addFileDiffs()
: this function takes an array of diffs to apply in sequence and concatenates them into a single diff. The result of each diff should be the source of the next. For example, if we have a diff fromA
toB
,B
toC
, andC
toD
, adding them gives the diff fromA
toD
. -
mergeFileDiffs()
: this function takes two diffs to apply in parallel. Both diffs should have the same source. For example, if we have a diff fromA
toB
and fromA
toC
, merging them gives a diff fromA
that combines both sets of changes. -
subtractFileDiffs()
: this function takes two diffs and returns the diff that must be added to the second to get the first. Both diffs should have the same source. For example, if we have a diff fromA
toC
and fromA
toB
, subtracting them gives the diff fromB
toC
.
Note that these functions expect diffs in the form of FileDiffs
objects, which map filenames to diffs for individual files.
You don't need to understand the code behind these functions, but I think the diffing algorithm is really cool, so please ask me if you want to know more!
Going further
If you are interested and want to add more to the VC project, try adding support for "branches". Branches allow for there to be multiple "heads" with different names. Ideally, it should be possible to:
- Pull the history of all branches from the server
- "Check out" a branch locally, replacing all files with their current versions on that branch
- Push a commit to the branch that is currently checked out
- Merge one branch into another (if there are no conflicts)
- Delete a branch (in the local repository and on the server)
Another issue we haven't addressed is handling concurrent accesses to repositories. For example, the asynchronous server allows two commit requests to be processed at the same time, which could corrupt a repository. A more robust server would ensure that any commit to a repository causes all other commits and fetches to that repository to wait until the commit finishes. Implement this behavior by writing an asynchronous analog of mutexes or read-write locks using callback functions.