Update README.md, main.py, nlp.py, model.py files

19b5656b · Adam Blank · 19b5656b · 19b5656b · 19b5656b · 19b5656b
Commit 19b5656b authored 1 year ago by Adam Blank
Hide whitespace changes
Inline Side-by-side

Showing

with 146 additions and 0 deletions
+146 -0
--- a/README.md
+++ b/README.md
+# Instructions
+
+## Task 1: Least-Squares Regression
+In this assignment, we'll be fitting a simple linear model using data we've already observed (*training data*). Each datapoint is represented as an `(x,y)` pair, where `x` represents an essay, and `y` is the score it was assigned by a human grader. Our goal, then, is to predict the scores, `y'`, for new datapoints `x'` (*testing data*). For a simple linear model, we predict using the form `y = mx + b`, where `m` and `b` are parameters that chosen to fit the training data well. The method we will use to choose these parameters is called *least squares regression*.
+
+Begin by opening `stats.py`.  Using the following equations,
+calculate `m` and `b` from the training data and return `(b, m)` from `make_model`.  You will likely want to define your own
+helper functions.
+
+Relevant equations:
+
+$$\begin{aligned}
+\text{data} &= [(x_1, y_1), (x_2, y_2), \dots, (x_N, y_N)]\\
+\overline{X} &= \frac{1}{N}\sum_{i=1}^N{x_i}\\
+\text{cov}(X, Y) &= \frac{1}{N-1}\sum_{i=1}^N{\left[(x_i - \overline{X})(y_i - \overline{Y})\right]}\\
+\text{var}(X, Y) &= \frac{1}{N-1}\sum_{i=1}^N{(x_i - \overline{X})^2}\\
+m &= \frac{\text{cov}(X, Y)}{\text{var}(X)}\\
+b &= \overline{Y} - m\overline{X}
+\end{aligned}$$
+
+
+## Task 2: Extracting Our Feature From an `essay`
+You likely noticed that the `x`s you were working with in Task 1 were not essays in the literal sense. Particularly, it would not be clear how to take the mean of an essay, etc. Instead, we want `x` to be represented by some numerical value, called a *feature*, that we think will correlate with score. Here, we've chosen to use the feature **# of unique, non-common English words** to represent the entire essay, though you're welcome to create and try your own feature(s) afterward!
+
+We have provided you with a `tag_part_of_speech` function which takes in a list of words and outputs a list of pairs (tuples) of `(word, part_of_speech)`.
+
+Write the `extract_feature(essay)` function in the `model.py` file:
+  1. Tag the parts of speech in the essay using `tag_part_of_speech`.
+  2. For each word in the essay:
+      - Strip all punctuation from the word.
+      - Lemmatize the word using the `lemmatize` function (which
+      takes a word as the first argument and the part of speech
+      as the second).
+      - If the lemmatized word is not in `COMMON_WORDS` and is in `ENGLISH_WORDS`, count it if it hasn't already been counted.
+  3. Return the number of unique, lemmatized, non-common, english words.
+
+## Task 3: Predict Scores Using our Model
+Write the `predict(essay)` function in the `model.py` file:
+  1. Extract the feature from the essay.
+  2. Apply model to the feature (i.e., plug the value into the linear equation, `y=mx+b`).
+  3. Clip predictions to the range of possible scores, `[2, 12]`, inclusive.
+      - If the result is > 12, set it to 12.
+      - If the result is < 2, set it to 2.
+  4. Return the score.
\ No newline at end of file
--- a/main.py
+++ b/main.py
+import csv
+from stats import make_model
+from model import set_model, extract_feature, predict
+from tqdm import tqdm
+
+
+def train():
+  trainf = open("train.txt")
+  xs, ys = [], []
+  essays = list(csv.reader(trainf))[1:]
+  print(f"Training on {len(essays)} essays.")
+  for entry in tqdm(essays):
+    essay = entry[1]
+    score1 = float(entry[2])
+    score2 = float(entry[3])
+  
+    feature = extract_feature(essay)
+  
+    xs.append(feature)
+    ys.append(score1 + score2)
+  
+  dataset = list(zip(xs, ys))
+  b0, b1 = make_model(dataset)
+  set_model((b0, b1))
+
+def is_within(prediction, answer, i):
+    return abs(prediction - answer) <= i
+
+def test():
+  testf = open("test.txt")
+  correct_counts = [0, 0, 0, 0, 0, 0]
+
+  essays = list(csv.reader(testf))[1:]
+  print(f"Testing on {len(essays)} essays.")
+  for entry in tqdm(essays):
+    essay = entry[1]
+    score1 = float(entry[2])
+    score2 = float(entry[3])
+  
+    pred = predict(essay)
+    answer = score1 + score2
+
+    for i in range(len(correct_counts)):
+      if is_within(pred, answer, i):
+        correct_counts[i] += 1.0
+
+  print(f'Prediction accuracy:')
+  for i in range(len(correct_counts)):
+    print(f'Within ± {i} points => {100*correct_counts[i] / len(essays):.2f} % correct')
+
+MODEL = train()
+test()
\ No newline at end of file
--- a/model.py
+++ b/model.py
+from nlp import COMMON_WORDS, ENGLISH_WORDS, lemmatize, tag_part_of_speech
+import string
+
+def extract_feature(essay):
+  return 0
+
+def predict(essay):
+ return 0
+
+def set_model(model):
+    global MODEL
+    MODEL = model
--- a/nlp.py
+++ b/nlp.py
+import nltk
+nltk.download('averaged_perceptron_tagger')
+nltk.download('wordnet')
+nltk.download('omw-1.4')
+nltk.download('stopwords')
+nltk.download('words')
+from nltk.corpus import stopwords
+from nltk.corpus import words
+COMMON_WORDS = set(stopwords.words('english'))
+ENGLISH_WORDS = set(words.words())
+
+lemma = nltk.wordnet.WordNetLemmatizer()
+from nltk.corpus import wordnet
+
+def lemmatize(word, pos):
+  def nltk_pos_tagger(nltk_tag):
+      if nltk_tag.startswith('J'):
+          return wordnet.ADJ
+      elif nltk_tag.startswith('V'):
+          return wordnet.VERB
+      elif nltk_tag.startswith('N'):
+          return wordnet.NOUN
+      elif nltk_tag.startswith('R'):
+          return wordnet.ADV
+      else:          
+          return None
+  pos = nltk_pos_tagger(pos)
+  if pos:
+    return lemma.lemmatize(word.lower(), pos=pos)
+  else:
+    return lemma.lemmatize(word.lower())
+
+
+def tag_part_of_speech(essay):
+  return nltk.pos_tag(essay)
\ No newline at end of file