JavaScript's `Promise` abstraction was discussed in the [notes](../promises/promises.md) near the start of the course and you had a chance to use it on the Make project.
I'l give a quick refresher since `Promise`s are essential to the `async`-`await` syntax.
A `Promise` represents an asynchronous task or computation.
When the task or computation finishes, the `Promise` "resolves" to a value.
If an error occurs, the `Promise` "rejects" with the error.
The usefulness of `Promises` comes from the ways they can be composed into more complex `Promise`s.
There are two main ways to combine `Promise`s:
- In sequence.
`.then()` chains two `Promise`s into a `Promise` that performs the asynchronous tasks or computations one after another.
You can build long chains by calling `.then()` more times:
```js
// Build a Promise that performs 4 asynchronous computations in sequence
promiseA
.then(a=>{/* build and return promiseB... */})
.then(b=>{/* build and return promiseC... */})
.then(c=>{/* build and return promiseD... */})
```
- In parallel.
`Promise.all()` combines an array of `Promise`s into a `Promise` that performs all the tasks or computations in parallel and resolves to an array containing all their results:
```js
// Build a promise that performs 3 asynchronous computations in parallel
// and resolves with their results once they have all finished
Promise.all([promiseA,promiseB,promiseC])
```
## `async`-`await` syntax
`Promise`s can greatly simplify asynchronous programs.
Frequently, several asynchronous tasks need to happen in sequence.
In this case, `.then()` provides an interface that looks closer to how the program would be written in a synchronous language.
Recently, new syntax was added to JavaScript that makes asynchronous code look even more like synchronous code.
It relies on two keywords:
-`async`: we can declare an "async function" by putting `async` in front of it:
```js
asyncfunctionname(...arguments){
/* ... */
}
// or
async(...arguments)=>{/* ... */}
```
Unlike normal functions, `async` functions return `Promise`s.
If the function `return`s a value, the `Promise` resolves with that value.
If the function `throw`s an error, the `Promise` rejects with that error.
-`await`: inside an `async` function, the keyword `await` can be used to "wait" for a `Promise` to resolve.
If the `Promise` throws an error, it can be caught by a `try`-`catch` block.
It is easiest to explain `async`-`await` with some examples.
## Example: recursively exploring a directory
We can rewrite the code from the `Promise`s notes for [computing the modification time of a directory](../promises/promises.md#promise.all).
Recall that we want to find the latest modification time of any file inside the directory (or any subdirectories).
By making `modifiedTime()` an `async` function, we can use `await` instead of explicit `.then()` calls on `Promise`s.
You can compare this [`async`-`await` version](await-mtime.js) with the [`Promise`s version](../promises/recursive-mtime.js).
```js
// Returns a Promise<number> that computes the most recent
// modification time for any file inside a directory
asyncfunctionmodifiedTime(path){
// First, we wait for the fs.stat() Promise
conststats=awaitfs.stat(path)
// Use the modification time of the file or directory itself
letlatestTime=stats.mtimeMs
// If this is a directory, check all files inside for later modified times
if(stats.isDirectory()){
// Wait to get a list of all files/subdirectories in the directory
constfiles=awaitfs.readdir(path)
// Then get the modified time of each file/subdirectory
// update latestTime if the modification time is later
latestTime=Math.max(latestTime,fileModifiedTime)
}
}
// Return the latest modification time
returnlatestTime
}
```
However, this code no longer explores subdirectories in parallel, which was the main advantage of asynchronicity.
Can you see why?
The issue is that we have replaced `Promise.all(files.map(file => /* ... */))` with a `for (const file of files)` loop.
Since `await modifiedTime(path + '/' + file)`*waits* for `modifiedTime()` to finish, the loop waits for `file` to be completely explored before continuing.
This means the subdirectories are explored sequentially.
If we want to perform asynchronous tasks in parallel inside an `async` function, **we still need to use `Promise.all()`**.
[Here](await-mtime-parallel.js) is a parallel version using `Promise.all()`:
```js
asyncfunctionmodifiedTime(path){
conststats=awaitfs.stat(path)
const{mtimeMs}=stats
// If this path is a file, return its modification time
if(!stats.isDirectory())returnmtimeMs
// If this is a directory, check all files inside for later modified times
constfiles=awaitfs.readdir(path)
// Get the modified time of each file/subdirectory in parallel
The [Unicode Consortium](https://home.unicode.org) defines the mapping from numbers to characters (including emoji) that almost everyone uses.
For example, Unicode defines 65 to mean `A`, 233 to mean `é`, and 128027 to mean `🐛`.
We use this mapping to make an [emoji search](await-emoji.js) that finds a Unicode character by name (e.g. `pile of poo`) and prints out the character.
We first fetch the current Unicode version (updated each year) from Unicode's [`ReadMe.txt`](https://unicode.org/Public/zipped/latest/ReadMe.txt) file.
We then fetch the current [`UnicodeData.txt`](https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt) file which lists the tens of thousands of Unicode characters.
Note how `await` allows us to execute 4 `Promise`s sequentially in `getUnicodeNumber()` without needing to combine them with `.then()`.
```js
asyncfunctiongetUnicodeNumber(searchName){
// Fetch the readme to determine the latest Unicode version (e.g. 13.0.0)
constreadmeResponse=awaitfetch(README_URL)
constreadme=awaitreadmeResponse.text()// read readme as a string
// The last line of the readme contains the current Unicode URL,
// e.g. https://www.unicode.org/Public/13.0.0/ucd/
(`fetch()` turns an HTTP/HTTPS request into a `Promise`, as explained in the [HTTP notes](../http/http.md#aside-fetch).)
## Example: create files without overwriting existing files
When an `await`ed `Promise` rejects with an error, you can catch it using a `try`-`catch` block, just like a normal JavaScript error.
For [example](await-no-overwrite.js), suppose we want to create some files but not overwrite any existing files with the same name.
We can use the `wx`[flag](https://nodejs.org/api/fs.html#fs_file_system_flags) to make [`fs.writeFile()`](https://nodejs.org/api/fs.html#fs_fspromises_writefile_file_data_options) reject if a file already exists.
We want either all or none of the files to be created, so if one already exists, we stop and remove all the files that were already created.
```js
// Creates an array of files. If any file already exists,
// no files are created and an error is thrown.
asyncfunctionwriteFiles(files){
for(leti=0;i<files.length;i++){
constfile=files[i]
try{
// Write to the file, but throw an error if the file exists
You may notice that the files are not written in parallel.
It is possible to fix this by `Promise.all()`ing the `fs.writeFile()`s, but this is tricky since the `Promise` returned by `Promise.all()` rejects as soon as *any* of the `Promise`s reject.
If the the `Promise.all()` rejected, we would need to wait for each file to finish being written before trying to remove it.
// If this is a file, just return its modification time
elsereturnPromise.resolve(mtimeMs)
...
...
@@ -258,5 +260,5 @@ If you've ever used Haskell, you might have noticed that `Promise`s look a lot l
And `Promise.resolve()` is the equivalent of `return`, which wraps a normal value in an instance of `IO`.
Just like monads in Haskell allow you to write code that looks imperative in a functional language, `Promise`s let you write asynchronous programs that look like blocking programs, while still using callbacks under the hood.
Towards the end of the course, we'll also cover JavaScript's `async`-`await` notation, which hides even the calls to `.then()`.
Towards the end of the course, we'll also cover JavaScript's `async`-`await` notation, which hides even the calls to `.then()` and `Promise.resolve()`.
There is a lot of provided code this week in `diff.js`.
You should read the notes on [Node.js modules](../../notes/js/js.md#modules) for information on how to import these functions.
I highly encourage using `async`-`await` this week.
Like most real asynchronous projects, this one mostly consists of tasks that need to happen sequentially.
`async`-`await` can greatly simplify this sort of asynchronous code.
In `async` functions, `try`-`catch` statements are used to handle errors in `await`ed `Promise`s, so you may want to read about [error handling in JavaScript](../../notes/js/js.md#error-handling).
# Mini Version Control
## Goals
- Learn the core ideas underlying version control software
- Write an HTTP server that implements a JSON API
- Use `async`-`await` to reduce the amount of boilerplate code needed to write asynchronous programs
## Why version control?
You've been using Git for all the projects in this class, and possibly for other classes and personal projects too.
Git is one of many tools called "version control software", which track the history of changes to a set of files over time.
VCS tools differ in their implementations, but they share several core concepts.
In this project, you will implement a version control application with the core features of Git or CVS.
(Git actually stores its history quite differently than MiniVC since it's optimized for different use cases.
If you take CS 24 this fall, you will learn how Git really works by implement parts of it!)
The application has two parts which communicate over HTTP:
- A command line program allows the user to manage the repositories that have been downloaded locally and to communicate with the server
- A server stores the definitive copy of all repositories and mediates the changes uploaded by different clients
You are provided an implementation of the client, so you only need to write the server.
## Version control concepts
There are many ways to represent the history of files; for example, we could store a copy of each file whenever it is updated.
However, this would waste a lot of space since files are usually changed only slightly.
Instead, MiniVC tracks the *differences* between each version of a file and the previous version.
This is also convenient because it makes it easy to identify what changes were made to files between two versions.
### Diffs
A "diff" between two versions of a file is the set of changes needed to turn the first version into the second.
Again, there are different granularities of changes that you can track; each has tradeoffs.
Git considers a change to be inserting or deleting a line of text.
This works pretty well, so we'll use the same approach.
For example, consider the following two versions of a file (each line contains a single character for simplicity):
```
old | new
----+----
a | a
b | b
c | c
d | h
e | f
f | g
g |
```
"Diffing" these two versions (e.g. the command `diff -u old new`) produces the following diff:
```diff
--- old
+++ new
@@ -1,7 +1,6 @@
a
b
c
-d
-e
+h
f
g
```
You can see that the lines `d` and `e` were deleted and the line `h` was inserted.
We will represent a diff as an array of "same" and "change" elements.
For example, the diff above would become:
```js
[
// The first 3 lines stayed the same
{type:'same',count:3},
// Then 2 lines were deleted and replaced with the line `h`
{type:'change',deletions:2,insertions:['h']},
// The last 2 lines stayed the same
{type:'same',count:2}
]
```
### Commits
A commit is the smallest unit of changes that MiniVC tracks.
They are the "versions" in "version control".
Each commit specifies the previous commit and the diffs for all files that changed since that commit.
Like Git, each commit also has a "commit message" that describes what was changed and a "commit ID" so that the next commit can refer to it.
For example, the following commit adds a comment to `program.c`:
```js
{
// The previous commit had ID 85c37d4521d4d51b6ee186856b1c14e6
The first commit is special since it doesn't have a parent.
We will represent this by omitting the `parentID` field for the initial commit (or setting it to `undefined`).
### Merges
Version control histories are mostly linear, i.e. there is one chain of commits from the initial commit to the current commit.
In this case, the commits' `parentID`s form an implicit linked list of commits.
However, a good version control system allows multiple users to work on the codebase at the same time.
It is possible that two users are working off the same commit and both of them try to push new commits before they have seen the other's changes.
In this case, whichever commit is pushed first will be added to the history, but the second commit will need to be "merged" with the first commit.
In general, whenever a new commit is attempted, it must be merged with all of the commits that were missed.
For example, suppose I'm working off commit A but before I push commit B, other users push commits C, D, and E:
```
A - C - D - E
\
B ------^
```
Then the diff in commit B needs to be merged with the diff between A and E.
Once we figure out what the combined diff from A to B/E should be, we "subtract" the part that was already commited between A and E, and then add the rest of the diff as commit B.
Changing B to have E as its parent instead of A is called "rebasing" in Git.
### Repositories
A repository (what GitLab calls a "project") is a set of files under version control.
The history of the repository is represented by storing all its commits by ID, plus the commit ID of the current commit.
This pointer to the current commit is called `HEAD` in Git.
Note that `HEAD` won't point to a commit until the first commit is pushed to the repository.
You can choose how to store this information.
I recommend the way the client stores it: a `commits` directory with a file for each commit (where the filename is the commit ID), and a `head` file that stores the ID of the current commit.
Your server should save all files inside the current working directroy (`.`): for example, if the server is started in the directory `server-repos`, all files and directories created by the server should be inside `server-repos`.
## Server API
The command-line client communicates with the server via an HTTP API.
Requests and responses are sent as JSON.
You can parse the request JSON by concatenating all the chunks in the request stream into a string, and then calling `JSON.parse()` on it.
To respond with JSON, you should set the `Content-Type` header to `application/json`.
(JSON is not the most space-efficient way to store commit diffs, but it is very convenient to read and write from JavaScript.)
All responses have the form `{success: true, ...data}` if successful and `{success: false, message: string}` if an error occurs.
The server should listen on port 8000 and have the following `POST` endpoints:
### `/new_repository`
Makes a new empty respository with the given name.
It is an error if a repository with this name already exists.
The request JSON has the following form:
```ts
interfaceNewRepositoryRequest{
name:string
}
```
If successful, the response JSON has the following form:
```ts
interfaceNewRepositoryResponse{
success:true
}
```
### `/fetch`
Retrieves all new commits to the given repository.
The client indicates what its last known commit is (if any).
It is an error if the repository doesn't exist or the commit does not exist in that repository.
The request JSON has the following form:
```ts
interfaceFetchRequest{
// Repository name
name:string
// Last known commit ID.
// Omitted if the client doesn't have any commits.
parentID:string|undefined
}
```
If successful, the response JSON has the following form:
```ts
interfaceFetchResponse{
success:true
// Commits after parentID up to HEAD.
// They must be ordered from oldest to newest.
commits:Commit[]
}
interfaceCommit{
id:string
message:string
diffs:FileDiffs
}
interfaceFileDiffs{
// Map each filename to a diff for that file
[file:string]:Diff
}
// A diff is an array of "same" and "change" elements
It is an error if the repository does not exist or the parent commit does not exist in that repository.
You can use whatever ID for the new commit you want, but it should not be likely to conflict with an existing ID.
(I used [`require('crypto').randomBytes(16).toString('hex')`](https://nodejs.org/api/crypto.html#crypto_crypto_randombytes_size_callback).)
The commit needs to be merged/rebased off the current `HEAD` if `HEAD` is different from `parentID`.
If there is a merge conflict, `mergeFileDiffs()` will throw an error, which you should report to the client.
The request JSON has the following form:
```ts
interfaceCommitRequest{
// Repository name
name:string
// Parent commit.
// Omitted if this is the initial commit.
parentID:string|undefined
// Commit message
message:string
// Diffs for each changed file
diffs:FileDiffs
}
```
The response JSON has the same format as for `/fetch`.
The commits sent back should include the newly added commit.
## Diff functions
`diff.js` exports some utility functions for working with diffs.
The ones you will likely find useful when implementing the server are:
-`addFileDiffs()`: this function takes an array of diffs to apply in sequence and concatenates them into a single diff.
The result of each diff should be the source of the next.
For example, if we have a diff from `A` to `B`, `B` to `C`, and `C` to `D`, adding them gives the diff from `A` to `D`.
-`mergeFileDiffs()`: this function takes two diffs to apply in parallel.
Both diffs should have the same source.
For example, if we have a diff from `A` to `B` and from `A` to `C`, merging them gives a diff from `A` that combines both sets of changes.
-`subtractFileDiffs()`: this function takes two diffs and returns the diff that must be added to the second to get the first.
Both diffs should have the same source.
For example, if we have a diff from `A` to `C` and from `A` to `B`, subtracting them gives the diff from `B` to `C`.
Note that these functions expect diffs in the form of `FileDiffs` objects, which map filenames to diffs for individual files.
You don't need to understand the code behind these functions, but I think the diffing algorithm is really cool, so please ask me if you want to know more!
## Going further
If you are interested and want to add more to the VC project, try adding support for "branches".
Branches allow for there to be multiple "heads" with different names.
Ideally, it should be possible to:
- Pull the history of all branches from the server
- "Check out" a branch locally, replacing all files with their current versions on that branch
- Push a commit to the branch that is currently checked out
- Merge one branch into another (if there are no conflicts)
- Delete a branch (in the local repository and on the server)
Another issue we haven't addressed is handling concurrent accesses to repositories.
For example, the asynchronous server allows two commit requests to be processed at the same time, which could corrupt a repository.
A more robust server would ensure that any commit to a repository causes all other commits and fetches to that repository to wait until the commit finishes.
Implement this behavior by writing an asynchronous analog of mutexes or read-write locks using callback functions.