HW1: Email me your name, major (MS/MA?), and 1 sentence telling me how comfortable you are with SQL. Please include "CISC 7512X HW1" in email subject.
HW2: Install OracleXE. I suggest you set it up under its own "OS" (by first installing VirtualBox, or VMware, etc.).
As a `simple' review of SQL, do `Sample Questions' at the end of: sql2.pdf; For the same database, also answer the following questions:
1) Find the company with most employees.
2) Find employees who make more than the average salary within their company.
3) Find employees who make more than the median salary within their company.
4) Find employees whose salary is an outlier (above 2 standard deviations) within their comapny.
5) Find employees whose salary is an outlier (above 95th percentile) within their comapny.
6) Find the company with the highest number of outlying salaries (your choice which outlier to use).
7) Find the company with most non-managing employees.
8) Find the company with highest average difference between manager salary and non-manager employee salary.
9) Assume that each non-managing employee genererates around 2x their salary in revenue. Managing employees don't directly contribute to revenue. Estimate revenue and ``profit'' for each company (assume profit = revenue - all_salaries).
10) Calculate salary skew for profitable (profit > 0) companies from question 9.
Write answers in an email; put "CISC 7512X HW2" in email subject.
HW3: This homework/project has a few parts; Do not write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.). We'll only care about daily closing prices, dividends, and splits.
PART 1: load all files in ctsdata.20130208.tar (link on the left) into Oracle or Postgres. The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend). Submit whatever commands/files you used to load the data into whatever database you're using.
PART 2: After loading the data, create another table with fields: TDATE,SYMBOL,PRCNT which will have the daily percentage gain/loss adjusted for dividends and splits.
That is, if stock XYZ today (20130211) issued dividend of $0.25, previous day's (20130208) close was $50 and today's (20130211) close is $51, then today's percentage gain is: 1.5%, since 50 + 0.25 + 50*0.015 = 51. So your table will have: 20130211,XYZ,1.5
For part2: submit query used to construct the table.
PART 3: Background: Pairs trading. Using the percentage returns table you built in part 2: Your task is to identify potential symbol pairs that have HIGH correlation, and are suitable for pairs trading. While everyone agrees that this strategy works, nobody agrees on the best way to identify correlation---especially when considered in relation to the rest of the market.
For this homework, feel free to use whatever you think is appropriate for correlation (if not sure, try Pearson; it's not really appropriate, but it's simple to calculate in SQL).
For part3: submit 10 "best" symbol pairs, each of which trades at least ~$10m a day, suitable for pairs trading in first week of February 2013. (your dataset ends 20130208, so use data prior to 20130204 to get pairs, and see if any of them prove lucrative during 20130204 - 20130208 time). Along with the pairs, submit their correlation coefficients for previous year, and the week of 20130204 - 20130208. (assume you were trading $1m worth, and you traded those exact 10 pairs, how much would you have gained/lost during that period?). Also submit the sql code to get those 10 symbols from the dataset.
PART 4. Background: portfolio theory. The gist is that you can lower risk by investing in things that have LOW correlation. While a single stock will go up and down, the *average* returns from say 20 will likely be a lot steadier---provided of course that they're not all correlated (don't move in the same direction at the same time). These are the kinds of funds your retirement account is invested in.
For part4, build a portfolio of 15 symbols, each of which has daily average volume over $10m (avg taken over last year), has paid dividends every year (no skipping) for the last 10 years, has returned at least a cumulative 3% a year for the last 15 years (including dividends/splits and stock price), and each of which have the lowest correlation with the rest of the symbols in your portfolio.
Calculate the return on those 15 symbols for 2012. Is that better or worse than S&P500 for same period? (You can use "SPY" symbol as stand in for S&P500) How about last 10 years?
Submit 15 symbols, along with their aggregate 2012 return, compared to S&P500 return for the same time period. ...and the SQL code.
Don't forget: write all code in SQL only (use analytical functions too)... let the database do the data crunching. Breakup large steps into smaller steps using temp tables. Test query on a subset of symbols, etc.
HW4: Write a stored procedure, in PL/SQL or T-SQL, that accepts an argument N, and populates table N with the first N prime numbers. Prime numbers are integers that are 2 or greater, divisible only by themselves and 1. The N table schema is: create table N (N int).
Test your algorithm on numbers going upto 10000000 (anyone can write a stupid prime number check---try to write an efficient one :-).
Submit code for the stored procedure in an email.
HW5: (this is more of a `project'; lets see whether anyone finishes this by next week; ultimate due date: last day of class): Load the ``Stock Ordrs'' data (ie: stockdata.zip) file into a database of your choice. Oracle recommended, but anything works if you can get it to work. There are two files in the data, "ordr", and "cxl", which represent stock orders and stock order cancelations. The column details for these are in the file. The file also has an `explanation.txt' file; which has a short example of how to pick out trades/quotes from the data.
Using select SQL query (no PL/SQL or T-SQL), create a "trades" table that represents trades between incoming orders. If the orders really came in at the times they came in, what would be the trading activity?
Obviously you'd do: create table trades as select ... from ...etc. and something similar for quotes.
Trade table columns: tdate symbol tim seq tid oid price qty, where tid is a unique trade id, oid is order id, seq is the sequence of the incoming order that's causing the trade, price is the price at which the trade took place (price of quote, usually), qty is the traded quantity for oid.
Similarly, create a quotes table, that represents a bid, bifqty, ofr, ofrqty (ie: bid and offer) at any given time (after a trade, or order arrival/cancelation).
Feel free to use analytical functions. Google for "oracle analytical functions".
If for some reason you are unable to do this homework, for 70% of the credit, try to do it using PL/SQL or T-SQL. Failing that, write a page or so explaining exactly what difficulty is preventing you from doing it. The explanation must clearly explain the problem in first paragraph (so we can maybe resolve it), and why none of the common SQL solutions resolve it. Email me if you're planning this route, maybe I can provide some hints.
Submit the SQL queries, and the first 1000 records for trades and quotes tables (note that you can compare the results, everyone in class should get identical output).
For the few adventurous students, you can modify the gendata.pl file (in stockdata.zip), to increase volume (set it to ~1000000; and set number of symbols to ~3000). That will generate relatively realistic data sizes---run your query on that data set :-)
HW6: Given a database table such as: create table file(id int, parentid int, name varchar(1024), size int, type char(1)); write a single (recursive) database query to list FULL PATH of all files. [assume type is either "F" or "D" for file or directory]. Your query should give you similar output to unix command: "find . -type f".
HW7: In any language of your choice (java? c++? anything you like), write a B-Tree implementation (or B+Tree). Your code will need to create and store data in a file. Your code should be command line based (command line arguments would have value, and filename), with a command to insert an entry into the tree, and another command to look it up. Also support range lookup.
So you'll have three command line programs.
The insert program will insert [key][value] pair into [file] in B-Tree format (You don't need to follow any database standard, just dump your data structure into a file; 3-way or 5-way nodes). The lookup program will display value associated with the [key]. And lookup range will return all the [key][value] pairs in the range between [lowkey] and [hikey].
Feel free to assume that [key] is a 32-bit unsigned integer value. Do not read the whole file into memory. If you need to read something from middle of file, seek to that location and then read.
How many disk seek operations does insert program do? How many disk seek operations does lookup program do?
HW8: Write a program to peform a database backup. Generate a public/private key pair (using GnuPG, or anything else). Your backup program/script must run daily, backup a table in a database into a .csv.gz file (comma delimited, gzip compressed [do not use database specific binary formats!]). And encrypt the backup using your PUBLIC key. Note that you can use any language, database, utility, configuration, etc. (cron script or windows scheduler is ok). The key is that the backup is comma delimited, gzip compressed, AND encrypted with public key---and is generated daily. Email me the program/script and instructions on how to set it up to run daily (for cron, I want the crontab line, etc., for windows scheduler, I want a batch file to setup to run and instruction on how to set it up in scheduler). Note, your backup procedure must be robust---if disk is full, it should fail gracefully (not create a partial backup), if database shuts down, disconnects, etc., anything bad, you should NEVER end up with an invalid or corrupt backup (you should have a good reliable backup, or none at all). It's also a good idea to verify record counts, and to check file integrity and perhaps generate an MD5 checksum.
HW# 9: In the not-so-distant future, flying cars are commonplace---everyone's on the planet got one. Yes, there are ~10 billion flying cars all over the globe. Each one logs its coordinates every 10 milliseconds, even when parked. Assume x,y,z coordinates, with z being altitude, and x,y, some cartesian equivalent of GPS. To avoid accidents, regulation states that no car can be next to any other car by more than 10 feet while in the air (z > 0). Cars can go really fast, ~500mph. YOUR TASK: write an algorithm & program to find all violators. Assume input is a HUGE file (10 billion cars logging "VIN,time,x,y,z" every 10 milliseconds all-the-time).
Install Apache Hadoop and Hive. [hadoop]. Write a Hive query (or a series of queries), or a MapReduce program to find all violators (cars that are next to other cars while in flight). Assume data is in "cars" table in Hive (or "cars.txt" file on HDFS). What is the running time of your algorithm? If it's O(N^2), can you make it run in O(N log N) time? (note that with this much data, N^2 is not practical, even N log N is a bit long).
HW#10: Given a customer database with their activities (purchases, page views, etc.) come up with a method to identify their likely future purchase activities [e.g. what would you advertise to each customer, based on their past behavior?]. Create your own representation of customers and other data; and use whatever method you like for the prediction. Detailed pseudo-code is fine.