December 6th, 2022    

CISC 7700X

Confusion Matrix
Quantization Vector Quantization
Decision Trees
n-Tuple (src)
Big Data
HBase Primer
Spark Primer
Neural Nets
End Notes

Past Tests
F2022 Midterm (key)
F2021 Midterm (key)
F2021 Final (key)
F2020 Midterm(key)
F2020 Final(key)
F2018 Midterm(key)
F2018 Final(key)
F2019 Midterm(key)



Global K-means

DL generalization


MNIST train image
MNIST train labels
MNIST test image
MNIST test labels


CISC 7700X Homeworks

You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7700X HW#". Homeworks without the subject line risk being deleted and not counted.

CISC 7700X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.

CISC 7700X HW# 2 (due by 3rd class;):

  1. A furniture manufacturer makes two kinds of furniture chairs and sofas. The production process has three operations: carpentry, finishing, and upholstery. The labor required for each operation varies. To manufacture a chair requires 6-hours of carpentry, 1-hour of finishing, and 2-hours of upholstery. To manufacture a sofa requires 3-hours of carpentry, 1-hour of finishing, and 6-hours of upholstery. Due to limited availability of skill, on each day, we have available 96-hours of carpentry labor, 18-hours of finishing labor and 72-hours of upholstery labor. We make $80 profit per chair, and $70 profit per sofa. How many chairs and sofas we should manufacture per day to maximize profit? Show work (don't just email two numbers).
  2. A soup manufacturer sells 16oz cans of soup. They would like to minimize the amount of metal used in the construction of the can. What are the dimensions of a 16oz can that uses the least amount of metal? Show work. [hint]

CISC 7700X HW# 3 (due by 4th class;): We have a labeled training data set: hw3.data1.csv.gz.

Thinking of a linear model, we come up with:

y = 24*column1 + -15*column2 + -38*column3 + -7*column4 + -41*column5 + 35*column6 + 0*column7 + -2*column8 + 19*column9 + 33*column10 + -3*column11 + 7*column12 + 3*column13 + -47*column14 + 26*column15 + 10*column16 + 40*column17 + -1*column18 + 3*column19 + 0*column20 + -6

if y is > 0 then 1 othewise -1.

What is the accuracy? Calculate the confusion matrix for this model. If cost of a false negative is $1000, and cost of a false positive is $100, (and $0 for an accurate answer), what is the expected economic gain?

How can we tweak the model to increase economic gain? Come up with a model that maximizes economic gain (approximations are OK; try guestimating a few possibilities in a spreadsheet, etc.).

Email the numbers and the steps you used to calculate things (you can do most of this homework in a spreadsheet [Excel?], but feel free to write code).

CISC 7700X HW# 4 (due by Nth class;):

Using data from: stockrow, using previous 2 years data (excluding latest quarter!), build a linear [y = a+bx ], logarithmic [y = a+b*log(x) ], exponential [ y=b*exp(a*x) ], and power curve [ y=b*x^a ] models on revenue, earnings, and dividends, for symbols IBM, MSFT, AAPL, GOOG, FB, PG, GE.

Which model works best for which metric/symbol? Show with numbers, (e.g. r-squared score, etc.). Read through: Coefficient of determination.

Using the best model for each metric, make a prediction for `next quarter' revenue, earnings, and dividends. Remember, you didn't use the last number to build your models. Compare your model's prediction to the last quarter number. What's the error? [hint]

CISC 7700X HW# 5 (due by Nth class;):

Using data from: spambase, build a Naive Bayes email classifier. Nothing too fancy, just a training module, and a classifier module. Submit code and accuracy you get on the spambase dataset. [hint]

CISC 7700X HW# 6 (due by Nth class;):

Using hw6 data to build a classification model. The last column in the dataset is the label. Randomly split the dataset into 70% training instances, and 30% test instances. Construct a classifier on the training data, and report the accuracy results using the test dataset. Feel tree to use any model classifier (kNN, linear, etc.). Submit the code, a short description of your model, accuracy, etc.

CISC 7700X HW# 7 (due by Nth class;):

Run your model from HW6 on MNIST dataset ( Just use "digits" datasets. What accuracy are you getting on MNIST (train using "train" dataset, test using the "test" dataset). Submit code/model and accuracy.

CISC 7700X HW# 8 (due by Nth class;): For each column in [hw6 data], find the decision-tree split value and information gain measure in bits. You can use Excel or your own program: email me the 20 numbers (10 columns, with 2 values each). This homework is essentially determining the root node for a decision tree: refer to decision tree class notes for the calculation details.

CISC 7700X HW# 9 (due by Nth class;):

In this homework you'll build a document clustering mechanism. You can scrape a few news sites (e.g.: wget -r -l 2, etc.). Convert each document into an array of numbers (read: TF-IDF)

Build a clustering algorithm to cluster documents into 10 categories. Suggest you use k-means algorithm (as it's much simpler), but if you're feeling adventurous, try using NMF or LDA.

Submit code used to do clustering, as well as assigning a category to a new document.

CISC 7700X HW# 10 (due by Nth class;):

Build an auto-encoder for: 00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000. the middle middle layer should be 3-neurons. So you'll have a neural network of 8-binary inputs, 3-inner neurons, and 8-binary outputs.

© 2006, Particle