CISC 7700X Homeworks
You should EMAIL me homeworks, alex at theparticle dot com. Start email subject with "CISC 7700X HW#". Homeworks without the subject line risk being deleted and not counted.
CISC 7700X HW# 1 (due by 2nd class;): Email me your name, prefered email address, IM account (if any), major, and year.
CISC 7700X HW# 2 (due by 2nd class;): Using the Iris dataset, build a kNN model to identify the species of a flower given sepal_length, sepal_width, petal_length,petal_width. Feel free to use whatever language/tool you are comfortable with. I encourage you to write C/C++/Java/C#/SQL/Python code. You may also use Excel, or Weka or Colab or whatever other library/tool you find. Submit (via email), the model code.
CISC 7700X HW# 3 (due by 4th class;): We have a labeled training data set: hw3.data1.csv.gz.
Thinking of a linear model, we come up with:
y = 24*column1 + -15*column2 + -38*column3 + -7*column4 + -41*column5 + 35*column6 + 0*column7 + -2*column8 + 19*column9 + 33*column10 + -3*column11 + 7*column12 + 3*column13 + -47*column14 + 26*column15 + 10*column16 + 40*column17 + -1*column18 + 3*column19 + 0*column20 + -6
if y is > 0 then 1 othewise -1.
What is the accuracy? Calculate the confusion matrix for this model. If cost of a false negative is $1000, and cost of a false positive is $100, (and $0 for an accurate answer), what is the expected economic gain?
How can we tweak the model to increase economic gain? Come up with a model that maximizes economic gain (approximations are OK; try guestimating a few possibilities in a spreadsheet, etc.).
Email the numbers and the steps you used to calculate things (you can do most of this homework in a spreadsheet [Excel?], but I highly encourage you to write code---learn Python if not sure where to start).
CISC 7700X HW# 4 (due by Nth class;):
Using data from: stockrow, using previous 2 years data (excluding latest quarter!), build a linear [y = a+bx ], logarithmic [y = a+b*log(x) ], exponential [ y=b*exp(a*x) ], and power curve [ y=b*x^a ] models on revenue, earnings, and dividends, for symbols IBM, MSFT, AAPL, GOOG, FB, PG, GE.
Which model works best for which metric/symbol? Show with numbers, (e.g. r-squared score, etc.). Read through: Coefficient of determination.
Using the best model for each metric, make a prediction for `next quarter' revenue, earnings, and dividends. Remember, you didn't use the last number to build your models. Compare your model's prediction to the last quarter number. What's the error? [hint]
CISC 7700X HW# 5 (due by Nth class;):
CISC 7700X HW# 6 (due by Nth class;):
Using hw6 data to build a classification model. The last column in the dataset is the label. Randomly split the dataset into 70% training instances, and 30% test instances. Construct a classifier on the training data, and report the accuracy results using the test dataset. Feel tree to use any model classifier (kNN, linear, etc.). Submit the code, a short description of your model, accuracy, etc.
CISC 7700X HW# 7 (due by Nth class;):
Run your model from HW6 on MNIST dataset (http://yann.lecun.com/exdb/mnist/). Just use "digits" datasets. What accuracy are you getting on MNIST (train using "train" dataset, test using the "test" dataset). Submit code/model and accuracy.
CISC 7700X HW# 8 (due by Nth class;): For each column in [hw6 data], find the decision-tree split value and information gain measure in bits. You can use Excel or your own program: email me the 20 numbers (10 columns, with 2 values each). This homework is essentially determining the root node for a decision tree: refer to decision tree class notes for the calculation details.
CISC 7700X HW# 9 (due by Nth class;):
In this homework you'll build a document clustering mechanism. You can scrape a few news sites (e.g.: wget -r -l 2 http://www.cnn.com, etc.). Convert each document into an array of numbers (read: TF-IDF)
Submit code used to do clustering, as well as assigning a category to a new document.
CISC 7700X HW# 10 (due by Nth class;):
Build an auto-encoder for: 00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000. the middle middle layer should be 3-neurons. So you'll have a neural network of 8-binary inputs, 3-inner neurons, and 8-binary outputs.