Find file Blame History Permalink

Add notes about special links to Wiki Game spec · cc9d19fb
Caleb C. Sander authored 4 years ago

cc9d19fb

wiki-game.md 7.64 KB

Recommended JavaScript reading

You will probably find the standard data structures in JavaScript useful for storing the state of your program this week. In particular, Maps and Sets provide constant-time lookup by key, and arrays are good as general resizable collections. I recommend looking over the list of array methods to see what they can do.

The time spent waiting for HTTPS responses will probably dwarf the time spent operating on these data structures. However, keep asymptotic complexity in mind when choosing which to use since your program may store hundreds of thousands of values.

for-of loops are convenient for iterating over arrays (as well as Maps and Sets).

Wiki Game

The game

The "Wiki Game" is a challenge to find the shortest sequence of links that can be followed to get from one given Wikipedia article to another. For example, you can get from Wuhan to Saturn by following 3 links:

Click on "Islam"
Click on "Divine Comedy"
Click on "Saturn"

Your task

You will write a program to find these shortest paths between two articles. Your program will request Wikipedia articles as HTML files over HTTPS, just like a web browser would. But instead of rendering an article, your program will instead search it for links to other Wikipedia articles and visit them in turn. This is a simple example of a "web crawler", which is how search engines explore the Internet to find webpages.

Looking for links on a Wikipedia article

If you view the source of a Wikipedia article (e.g. view-source:https://en.wikipedia.org/wiki/A in Google Chrome), you will see that the links look like

<a href="/wiki/Unicode" title="Unicode">Unicode</a>

(href is the URL that the link goes to, and the part between <a> and </a> is the blue underlined text.) Therefore, your program can look for all occurrences of <a href="/wiki/ on each article it visits. The string that comes before the next " character is the article being linked to. The full URL of an article is https://en.wikipedia.org/wiki/ARTICLE; for example, the URL for the Unicode article is https://en.wikipedia.org/wiki/Unicode. You can use Node.js's https.get() function to make the requests.

There are a few types of links that need special handling:

Links to special Wikipedia pages, e.g. href="/wiki/Template:Latin_alphabet_sidebar". You can identify these links by the :. They aren't real pages, so you should skip them.
Links to sections on the same page, e.g. href="#History". You can identify these links by the # at the start of the link. Again, since the link points to the same page, you don't need to follow it.
Links to sections on other pages, e.g. href="/wiki/Article_(grammar)#Indefinite_article". You can identify these links by the # in the middle of the link. To follow this link, remove the part starting with #. In this example, that would be /wiki/Article_(grammar).

Implementing the Wikipedia search

The general idea is to start at the source article, follow each link to another Wikipedia article, and stop upon reaching the target article. More specifically, you will implement an asynchronous "breadth-first search" from the source article:

Your program should visit articles in order of the number of links needed to reach them from the source (their "distances" from the source). First, visit the source (which is 0 links away from the source), then visit all the articles 1 link away from the source, then 2 away, etc. This way, whenever an article is visited, you know there is no shorter sequence of links to it.
If an article has already been visited (or is going to be visited), it should not be visited again. Assuming the content of each article doesn't change, loading an article again is unnecessary. This requires your program to remember the set of articles that have already been visited.
Your program should store, for each visited article, the article that linked to it. Then, when the target article is found, you can follow this chain of links backwards to reconstruct the sequence of links taken from the source article.

Implementation advice

There are a few tricky pieces to the BFS. I had several bugs in my initial implementation, so I recommend reading this section carefully!

Loading articles in parallel

An HTTP(S) request can take hundreds of milliseconds to complete. The speed is primarily limited by the time it takes to send information from the client to the server and back. However, a modern internet connection has enough bandwidth to transmit many requests at the same time. Therefore, you can explore Wikipedia significantly faster by fetching articles simultaneously.

Your program must make multiple Wikipedia requests at the same time.

However, if you make too many requests at once, Wikipedia will think you're trying to flood it with traffic and will stop sending responses. (Your HTTPS requests will likely return the status code 429 Too Many Requests in this case.) Each Wikipedia article may link to hundreds of other articles, so even the number of articles 3 links away can be quite large.

From experimenting, it looks like Wikipedia's limit is based on the number of requests made per unit time. I recommend requesting about 10 articles in parallel if your internet connection is very fast, and up to a few hundred if it's pretty slow. Please define this number at the top of find-path.js so I can tune it when testing your code.

Implementing the BFS

Since articles are requested in parallel, the responses may arrive in a different order from the order of the requests. This complicates the BFS a bit. I recommend storing an array of "waiting articles" at the current distance from the source (e.g. 2 links away) that still need to be visited and a separate array of "articles to visit next" at the next distance (e.g. 3 links from the source). Once all articles at the current distance have been requested and processed, your program can start requesting articles at the next distance, so the "articles to visit next" become the "waiting articles".

Stopping

Make sure that the program ends once the target article is found and prints out the shortest path to it. (There may be multiple shortest paths; just print one of them.) A Node.js program will not exit automatically while there are HTTPS requests that haven't returned, so your program can either stop making HTTPS requests or just call process.exit() when the target is found.

Running the program

Your program should take 2 arguments, the source and target articles. It can be run like node find-path.js SOURCE TARGET. It should print each article name along a shortest path from SOURCE to TARGET (including SOURCE and TARGET), one article per line.

For example, running node find-path.js Robert_Andrews_Millikan Michelangelo would print:

Robert_Andrews_Millikan
Forest_Lawn_Memorial_Park_(Glendale)
Michelangelo

Menu