Commit 0f1efd59 authored by Caleb C. Sander's avatar Caleb C. Sander
Browse files

Add wiki-game spec

parent 25831608
No related merge requests found
Showing with 103 additions and 3 deletions
+103 -3
......@@ -33,7 +33,7 @@ You can contact me via:
| 1-2 | [Callbacks](notes/callbacks/callbacks.md) | [Minesweeper](specs/minesweeper/minesweeper.md) | 2020-04-17 |
| 3 | [Promises](notes/promises/promises.md) | [`make`](specs/make/make.md) | 2020-04-24 |
| 4-5 | [Streams](notes/streams/streams.md) | [`grep`](specs/grep/grep.md) | 2020-05-08 |
| 6 | [HTTP](notes/http/http.md) | Wiki Game | 2020-05-15 |
| 6 | [HTTP](notes/http/http.md) | [Wiki Game](specs/wiki-game/wiki-game.md) | 2020-05-15 |
| 7 | WebSockets | Chat server **OR** Chat client | 2020-05-22 |
| 8-9 | `async`-`await` | MiniVC | 2020-06-05 |
......
# `grep`
# Recommended JavaScript reading
This project makes heavy use of Node.js's standard library modules.
......@@ -9,6 +7,8 @@ There are also lots of examples of how to use the standard library functions, wh
You will also be implementing several `Transform` streams as part of this project.
`Transform` streams are represented as subclasses of the `streams.Transform` class, so you may find it useful to read about [classes](../../notes/js/js.md#classes) in JS.
# `grep`
## Goals
- See how Node.js's stream abstraction makes it easy to consume different sources of data
......
# Recommended JavaScript reading
You will probably find the standard data structures in JavaScript useful for storing the state of your program this week.
In particular, [`Map`](../../notes/js/js.md#map)s and [`Set`](../../notes/js/js.md#set)s provide constant-time lookup by key, and [arrays](../../notes/js/js.md#types) are good as general resizable collections.
I recommend looking over the list of [array methods](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array#Instance_methods) to see what they can do.
The time spent waiting for HTTPS responses will probably dwarf the time spent operating on these data structures.
However, keep asymptotic complexity in mind when choosing which to use since your program may store hundreds of thousands of values.
[`for`-`of` loops](../../notes/js/js.md#for-of-loops) are convenient for iterating over arrays (as well as `Map`s and `Set`s).
# Wiki Game
## The game
The "[Wiki Game](https://en.wikipedia.org/wiki/Wikipedia:Wiki_Game)" is a challenge to find the shortest sequence of links that can be followed to get from one given Wikipedia article to another.
For example, you can get from [Wuhan](https://en.wikipedia.org/wiki/Wuhan) to [Saturn](https://en.wikipedia.org/wiki/Saturn) by following 3 links:
1. Click on "[Islam](https://en.wikipedia.org/wiki/Islam)"
1. Click on "[Divine Comedy](https://en.wikipedia.org/wiki/Divine_Comedy)"
1. Click on "[Saturn](https://en.wikipedia.org/wiki/Saturn)"
## Your task
You will write a program to find these shortest paths between two articles.
Your program will request Wikipedia articles as HTML files over HTTPS, just like a web browser would.
But instead of rendering an article, your program will instead search it for links to other Wikipedia articles and visit them in turn.
This is a simple example of a "[web crawler](https://en.wikipedia.org/wiki/Web_crawler)", which is how search engines explore the Internet to find webpages.
## Looking for links on a Wikipedia article
If you view the source of a Wikipedia article (e.g. `view-source:https://en.wikipedia.org/wiki/A` in Google Chrome), you will see that the links look like
```html
<a href="/wiki/Unicode" title="Unicode">Unicode</a>
```
(`href` is the URL that the link goes to, and the part between `<a>` and `</a>` is the blue underlined text.)
Therefore, your program can look for all occurrences of `<a href="/wiki/` on each article it visits.
The string that comes before the next `"` character is the article being linked to.
The full URL of an article is `https://en.wikipedia.org/wiki/ARTICLE`; for example, the URL for the Unicode article is https://en.wikipedia.org/wiki/Unicode.
You can use Node.js's [`https.get()`](https://nodejs.org/api/https.html#https_https_get_options_callback) function to make the requests.
## Implementing the Wikipedia search
The general idea is to start at the source article, follow each link to another Wikipedia article, and stop upon reaching the target article.
More specifically, you will implement an asynchronous "[breadth-first search](https://en.wikipedia.org/wiki/Breadth-first_search)" from the source article:
- Your program should visit articles in order of the number of links needed to reach them from the source (their "distances" from the source).
First, visit the source (which is 0 links away from the source), then visit all the articles 1 link away from the source, then 2 away, etc.
This way, whenever an article is visited, you know there is no shorter sequence of links to it.
- If an article has already been visited (or is going to be visited), it should not be visited again.
Assuming the content of each article doesn't change, loading an article again is unnecessary.
This requires your program to remember the set of articles that have already been visited.
- Your program should store, for each visited article, the article that linked to it.
Then, when the target article is found, you can follow this chain of links backwards to reconstruct the sequence of links taken from the source article.
## Implementation advice
There are a few tricky pieces to the BFS.
I had several bugs in my initial implementation, so I recommend reading this section carefully!
### Loading articles in parallel
An HTTP(S) request can take hundreds of milliseconds to complete.
The speed is primarily limited by the time it takes to send information from the client to the server and back.
However, a modern internet connection has enough bandwidth to transmit many requests at the same time.
Therefore, you can explore Wikipedia significantly faster by fetching articles simultaneously.
Your program **must** make multiple Wikipedia requests at the same time.
However, if you make too many requests at once, Wikipedia will think you're trying to [flood it with traffic](https://en.wikipedia.org/wiki/Denial-of-service_attack) and will stop sending responses.
(Your HTTPS requests will likely return the [status code](../../notes/http/http.md#aside-status-codes) `429 Too Many Requests` in this case.)
Each Wikipedia article may link to hundreds of other articles, so even the number of articles 3 links away can be quite large.
From experimenting, it looks like Wikipedia's limit is based on the number of requests made per unit time.
I recommend requesting about 10 articles in parallel if your internet connection is very fast, and up to a few hundred if it's pretty slow.
Please define this number at the top of `find-path.js` so I can tune it when testing your code.
### Implementing the BFS
Since articles are requested in parallel, the responses may arrive in a different order from the order of the requests.
This complicates the BFS a bit.
I recommend storing an array of "waiting articles" at the current distance from the source (e.g. 2 links away) that still need to be visited and a separate array of "articles to visit next" at the next distance (e.g. 3 links from the source).
Once all articles at the current distance have been requested and processed, your program can start requesting articles at the next distance, so the "articles to visit next" become the "waiting articles".
### Stopping
Make sure that the program ends once the target article is found and prints out the shortest path to it.
(There may be multiple shortest paths; just print one of them.)
A Node.js program will not exit automatically while there are HTTPS requests that haven't returned, so your program can either stop making HTTPS requests or just call [`process.exit()`](https://nodejs.org/api/process.html#process_process_exit_code) when the target is found.
## Running the program
Your program should take 2 arguments, the source and target articles.
It can be run like `node find-path.js SOURCE TARGET`.
It should print each article name along a shortest path from `SOURCE` to `TARGET` (including `SOURCE` and `TARGET`), one article per line.
For example, running `node find-path.js Robert_Andrews_Millikan Michelangelo` would print:
```
Robert_Andrews_Millikan
Forest_Lawn_Memorial_Park_(Glendale)
Michelangelo
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment