grep.md 7.25 KB

Recommended JavaScript reading

This project makes heavy use of Node.js's standard library modules. They are well documented at nodejs.org, with pages for each module and sections on each function provided by the module. There are also lots of examples of how to use the standard library functions, which will probably come in handy.

You will also be implementing several Transform streams as part of this project. Transform streams are represented as subclasses of the streams.Transform class, so you may find it useful to read about classes in JS.

grep

Goals

  • See how Node.js's stream abstraction simplifies working with different sources of data
  • Implement some Transform streams

The Unix grep utility

grep is a powerful command-line program for searching through large chunks of text. In its simplest form, grep takes a string to search for and a filename and prints out all the lines in the file that contain the string. For example, suppose independence.txt contains the following lines:

When in the Course of human events,
it becomes necessary for one people
to dissolve the political bands
which have connected them with another,
and to assume among the powers of the earth,
the separate and equal station to which
the Laws of Nature and of Nature's God entitle them,
a decent respect to the opinions of mankind requires
that they should declare the causes
which impel them to the separation.

Then grep ss independence.txt prints all the lines containing ss:

it becomes necessary for one people
to dissolve the political bands
and to assume among the powers of the earth,

It is also possible to search for more complex regular expressions in the text, e.g. grep 'the[a-z]' independence.txt prints:

which have connected them with another,
the Laws of Nature and of Nature's God entitle them,
that they should declare the causes
which impel them to the separation.

If no files are specified, then grep will instead search for the string in its standard input. This is often used to filter the outputs of other commands. For example, some-command | grep ERROR will print out only the lines of output from some-command that contain ERROR.

Your task

Implement grep (including 4 of its command-line flags) in Node.js.

Your grep will be invoked using npm run grep -- [-i] [-r] [-v] [-z] pattern [files ...]. (The -- just tells npm to pass the remaining arguments to grep.js.) The flags -i, -r, -v, and -z are described later.

A simple implementation of grep would read the entire input, loop over its lines, and print the lines that match the pattern. However, this has two big disadvantages on large inputs:

  • Storing the contents of an entire file can take up a lot of memory
  • It may take a while to read the whole file, and we would like to print the matching lines as soon as they are found

Since each line is searched independently, only one line needs to be stored at a time. Therefore, your implementation must search each line as soon as it is read from the input and discard it afterwards.

Command-line arguments

You have been provided with some code that parses the command-line arguments (which Node.js stores in process.argv). The values should be interpreted as follows:

  • pattern stores the pattern to use to search the lines of input. It should be treated as a regular expression, which you can construct using the builtin function RegExp. The method regexp.test(string) can be used to check whether a string matches a regular expression.
  • files stores the filenames (or directory names, if -r is used) to search. If no filenames are given and -r is not used, the standard input (process.stdin) should be searched instead.
  • recursive stores whether the flag -r was used. If so, files are expected to be directories and are searched recursively. For example, npm run grep -- pattern -r dir1 dir2 searches all files in dir1 and dir2 (and all their subdirectories) for pattern. If no directories are provided, the current directory (.) should be searched instead. You can use the provided function readdirRecursive() to search each directory.
  • gunzip stores whether the flag -z was used. If so, all files are treated as gzipped text files and are unzipped before being searched. For example, if independence.txt is compressed to independence.txt.gz, then npm run grep -- -z ss independence.txt.gz prints the same lines as npm run grep -- ss independence.txt. Note that -z and -r can be used together.
  • ignoreCase stores whether the flag -i was used. If so, the pattern is case-insensitive, e.g. npm run grep -- -i error would match a line containing error, ERROR, or ErRoR. You can use RegExp(pattern, 'i') to make a case-insensitive regular expression.
  • invert stores whether the flag -v was used. If so, the pattern specifies lines to exclude; only the lines that don't match the pattern are printed.

Finally, if multiple files are being searched or -r is used, grep helpfully shows which file a line is from. For example, if independence2.txt is identical to independence.txt, running grep ss independence.txt independence2.txt prints:

independence.txt:it becomes necessary for one people
independence.txt:to dissolve the political bands
independence.txt:and to assume among the powers of the earth,
independence2.txt:it becomes necessary for one people
independence2.txt:to dissolve the political bands
independence2.txt:and to assume among the powers of the earth,

Your implementation should add the filenames to the lines in this case as well.

Streams

Your grep implementation must use streams to process the input files. The pipeline I recommend is:

  1. Get a Readable stream that produces the text to be searched
  2. Pipe it into a Transform stream to split chunks of text into lines
  3. Use a Transform stream to filter the lines using the search pattern
  4. If necessary, use a Transform stream to add filenames to the matched lines
  5. Pipe the matched lines to process.stdout (a Writable stream) in order to print them

There are three types of Readable stream you will likely use:

Node.js streams can emit different sorts of data. By default, most streams emit chunks of bytes. This makes sense for a compressed file stream, but all other streams in the pipeline should emit chunks of text instead. You can call stream.setEncoding('utf8') on the input stream as well as each Transform stream to make them output text. Your Transform streams should also call super({decodeStrings: false}) in their constructors so they receive each chunk of data as a string.