Ivo Welch, October 2023
Programming and Data for ASAM (30m)
These tasks must be completed individually by each and every student. Newbies are not expected to be able to do it all without mentoring on their first encounters. They are expected to be able to do it all without help by the time the summer is over (and understand how this all works, so they can make changes!). Students who are unable to complete these tasks may fail this course!
Purpose
-
Learn Computer Programming
-
Learn CRSP and Compustat Data Organization and Handling
-
Create data sets that will be of important use to you throughout ASAM. If you understand how to create these data sets, you will have the basic capability to create and backtest any quant strategy you like.
Note: ASAM is not trying to make programmers out of MBA students. It does not matter at all whether this takes newbies a long time and/or whether the code is efficient. What matters is that newbies learn how to do this (albeit slowly), in order to appreciate what their future employee programmers are really doing—what they can and cannot reasonably create.
(There is no great practical book on quantitative finance programming, data, and analysis. Perhaps the closest is Andrew Ang’s “Asset Management.”)
Task 1: Introduction
-
Learn Programming (Python).
- write a program that reads a file with lines containing sequential rates of returns of a hypothetical asset, and that calculates and prints the compound rate of return.
-
Learn WRDS, specifically CRSP and Compustat (incl. their error codes).
-
Download large datasets to your computer from WRDS.
- I strongly suggest that you download the entire monthly CRSP and annual Compustat data to your computer one-time using the WRDS interactive website GUI (unless you are a masochist and want to learn their SQL way), and then work off these data sets locally.
-
Write some sample programs, e.g., write a program that requests a permno and date and then tells me the stock’s marketcap on this date.
Fun optional learning task: Programming capability is a superpower. It is not just the best tool for training logic. It also gives the ability to solve many problems otherwise impossible to attack. For example, assume there are 330 million people. Each day, every person meets 10 other randomly selected persons. An infected person infects another (s)he meets with a probability of 20% if this other person was never infected. After 2 weeks, with 1% probability, the infected person dies. With 99% probability, (s)he becomes immune, healthy, and non-contagious. Start with 10 infected persons. How does the mortality curve look like? What probability would match the US curve? How does an increase or a decrease in the number of encounters and/or the probability of infection change the curve?
Task 2 (CRSP): Create “Annual Return” Data Sets
Create two annual return data sets.
-
Download the relevant data from WRDS first. You can start with data from 1973.
-
The computer code to massage and analyze data should sit on your own computer. Do not try to do this on the WRDS servers. They are slow.
- after you have the full data set, also consider creating a shorter test data set with only ~10,000 input lines for development purposes. For example, grab the subset of firm-days with permnos 14593 (AAPL) and 93436 (TSLA) from 2016 to 2020. There is no reason to debug your programs with 97,269,745 lines of input. You would end up waiting for a long time, only to learn that you made a mistake somewhere in your code.
-
Write a flexible python function that can calculate holding rates of return, given a start data and an end date
-
a holding rate (i.e., compound rate of return) is not the percent change in price. You must use the “1 plus” formula.
-
for example, you could define
def rate_of_return( permno, startdate, enddate ) "calculate the rate of return for a stock given a starting and an end date" blah blah blah return blie
-
-
Use your function to create two similar data sets:
-
Create one data set with annual Jan-Dec buy-and-hold rates of return for all stock-years. Call it
crsp-jan1-dec31.csv
. It should have output lines likepermno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.T,ret 14593,03783310.AAPL,2019,20190102,20191231,0.8897 ...
-
Create an almost identical data set, but start return calculations a few days after the year starts (say, around Jan 10) instead of Jan 1. The data set should still end each year on Dec 31. Call it
crsp-jan10-dec31.csv
.permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.T,ret 14593,03783310.AAPL,2019,20190110,20191231,0.9943 ...
-
You can also combine these two into one more convenient data set, call it
crsp-dec31.csv
:permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.10,yyyymmdd.T,ret.1,ret.10 14593,03783310.AAPL,2019,20190102,20190110,20191231,0.8897,0.9943 ...
-
-
Create some test cases for your code, for example
if ((permno == 14593) and (year == 2019)) check( ret.1 , 0.8897 ) if ((permno == 14593) and (year == 2019)) check( ret.10 , 0.9943 )
You can write the check function yourself. (Warning: computers are not perfect at arithmetic. Just check that the number is close.)
A good set of test cases helps you ensure that if you make some changes later, you can quickly check whether your programs still work. This is especially the case in multi-stage programs. Professional programmers often write many many test cases for code that is being reused before they write the computer program that creates the output.
-
Handcheck some outliers:
-
For example, which firms had the biggest returns in the year? Does this seem right?
-
For example, which firms had the biggest difference in crsp-jan1-dec31 vs crsp-jan10-dec31 .
-
-
Question: If your boss changes her mind and wants you to start 5 trading days after each new year, how difficult would it be for you to satisfy her?
Task 3 (CRSP): Create a “Firm-Size” Data Set
-
For each stock-year calculate the marketcap on the last trading day of each calendar year.
-
Also, calculate the rank of each stock on this last day of the year. (NA if not trading on last day of year)
-
Save your data as a
mcap.csv
data set. For example, it may contain lines such aspermno,cusip,ticker,year,yyyymmdd,mktcap,rankmktcap 14593,03783310,AAPL,2018,20181231,746079173,1 93436,88160R10,TSLA,2018,20181231,57441944,103 ...
-
Handcheck some outliers:
- For example, which firms had the biggest change in rank from year to year? Does this seem right? Can you handcheck the basics?
-
Create some test cases, e.g.,
if ( (permno == 93436) and (year == 2018) ) check( rankmktcap , 103 )
Task 4 (CRSP): Create a “Risk/Factor Exposure” Data Set
Within each calendar year, calculate a dataset of factor exposures for each firm.
-
First, download the daily factors from Ken French’s excellent data website. The specific file you want is Fama/French 5 Factors (2x3) [Daily]. This contains such lines as
Mkt-RF SMB HML RMW CMA RF 19630701 -0.67 0.00 -0.32 -0.01 0.15 0.012 19630702 0.79 -0.27 0.27 -0.07 -0.19 0.012 ... 20190102 0.23 0.73 1.15 -0.13 0.26 0.010 20190103 -2.45 0.52 1.23 -0.25 0.91 0.010
Each is the rate of return on a specific day, quoted in percent. So 0.79 is not 79%, but 0.79%.
-
Within each calendar year, run a (multivariate) time-series regression of returns net of RF for each stock on these 5 factors. Example Observation: AAPL is permno 14593. The regression’s second input observation for 2019 should be
y = -0.099607 - 0.00010 x1 = -0.0245 x2 = 0.0052 x3 = 0.0123 x4 = -0.0256 x5 = 0.0091
Your data output should contain, among others, one line that looks something like
permno,yyyy,a,bm,bsmb,bhml,brmw,bcma,N 14593,2019,-0.0075,0.0138,-0.0006,-0.0032,0.00942,-0.00515,252 ...
(FYI, it will become clear lateron that “a” is the alpha that financiers are often talking about.)
Task 5 (Compustat): Create a “Financial Statement Signal” Data Set
Many quant investment strategies depend on financial statement information. This data can be accessed via the Compustat (= Standard&Poors).
-
For each stock and year, pull off the annual earnings
-
Test a few cases. Find one stock each that ended its quarter in Mar, one in June, one in Sep, and one in Dec (most common).
-
Be careful about how Compustat quotes fiscal year.
datadate,year,gvkey,cusip,tic,netincome 2018,20180930,001690,037833100,aapl,59531 2019,20190930,001690,037833100,aapl,55256 2017,20171231,184996,88160R101,tsla,-1961.4 2018,20181231,184996,88160R101,tsla,-976.1 2019,20191231,184996,88160R101,tsla,-862.0
-
-
In real life, firms report about 3 months after the quarter end. Thus, we cannot build quant strategies that incept in January and use December fiscal-year end financial statement variables. When we incept our strategy at the end of the calendar year, we can only use data from Sep or earlier (presumably known publicly around 3 months later).
usedate,cusip,netincome 20181231,037833100,59531 20191231,037833100,55256 20181231,88160R101,-1961.4 20191231,88160R101,-976.1
-
The problem is that some earnings can be quite old. Tesla here is our best example. By the time we would use them, they would be 9 months old. For a fast-growing company, this is less than ideal.
-
When a stock ends its fiscal year in Dec, you may want to calculate trailing-twelve-month (TTM) earnings.
20180331 2018 1 TSLA -709.5510 20180630 2018 2 TSLA -717.5390 20180930 2018 3 TSLA 311.5160 20181231 2018 4 TSLA 139.4830 20190331 2019 1 TSLA -702.1350 20190630 2019 2 TSLA -408.3340 20190930 2019 3 TSLA 143.4690 20191231 2019 4 TSLA 105.0000 20200331 2020 1 TSLA 16.0000 20200630 2020 2 TSLA 104.0000
So, you could calculate that for investing on 20191231, you could legimately know
139.4830 + (-702.135) + (-408.334) + 143.469 = -827.5
. -
This can be quite painful and may not be worth it…but you have to decide whether you want to do this.
This is only for the programmer in each group.
-
-
As always, handcheck a few outliers against SEC or Bloomberg information.
- Test a few cases. Find one stock each that ends its quarter in Mar, one in June, one in Sep, and one in Dec (most common).
-
Create some test cases, e.g.,
if ( (cusip == "88160R101") and (usedate == 20191231) ) check( netincome , -976.1 )
Task 6 (Combine)
-
This is only for the programmers in each group.
-
Create a large data set that contains three data sets. That is, for backtesting, we want a data set that
permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.10,yyyymmdd.T,ret.1,ret.10,lagmcap,lagrankmcap,lagnetincome 14593,03783310.AAPL,2019,20190102,20190110,20191231,0.8897,0.9943,252746079173,1,59531
You can constrain this data set to contain only stocks in the top-2000 stocks at the previous year.
-
This data set makes it easy to test if the
lagnetincome
signal predicted return—just regresslagnetincome
onret.10
. -
This data set is also small. For 2,000 stocks and 50 years, with each line being about 100 bytes, the data set is about 10 MB. Even the cheapest Apple laptop has 1,000 times as much RAM.
Task 7 (Cross-Check)
-
Swap your data sets with those of other groups. What are the correlations of the variables? Are they 100% as they should be?
-
PS: For statistical analysis, you can use any package you like. We shall really need only the following capabilities: data input/output, univariate statistics (means, standard deviations, T-stats); plotting; correlations; multivariate regressions. These capabilities will be in any rudimentary statistics package. Most commonly, researchers use:
-
R
-
Python. Python can also plot, run correlations and multivariate regressions.
-
Stata
Excel will be able to do this, too, but the experience will be so miserable that you will wish you had spent the 60 minutes to learn the alternatives instead. It is also a big advantage to be able to automate such tasks—though for this particular cross-check exercise, you don’t need it.
-