Ivo Welch, October 2023

Programming and Data for ASAM (30m)

These tasks must be completed individually by each and every student. Newbies are not expected to be able to do it all without mentoring on their first encounters. They are expected to be able to do it all without help by the time the summer is over (and understand how this all works, so they can make changes!). Students who are unable to complete these tasks may fail this course!

Purpose

Learn Computer Programming
Learn CRSP and Compustat Data Organization and Handling
Create data sets that will be of important use to you throughout ASAM. If you understand how to create these data sets, you will have the basic capability to create and backtest any quant strategy you like.

Note: ASAM is not trying to make programmers out of MBA students. It does not matter at all whether this takes newbies a long time and/or whether the code is efficient. What matters is that newbies learn how to do this (albeit slowly), in order to appreciate what their future employee programmers are really doing—what they can and cannot reasonably create.

(There is no great practical book on quantitative finance programming, data, and analysis. Perhaps the closest is Andrew Ang’s “Asset Management.”)

Task 1: Introduction

Learn Programming (Python).
- write a program that reads a file with lines containing sequential rates of returns of a hypothetical asset, and that calculates and prints the compound rate of return.
Learn WRDS, specifically CRSP and Compustat (incl. their error codes).
Download large datasets to your computer from WRDS.
- I strongly suggest that you download the entire monthly CRSP and annual Compustat data to your computer one-time using the WRDS interactive website GUI (unless you are a masochist and want to learn their SQL way), and then work off these data sets locally.
Write some sample programs, e.g., write a program that requests a permno and date and then tells me the stock’s marketcap on this date.

Fun optional learning task: Programming capability is a superpower. It is not just the best tool for training logic. It also gives the ability to solve many problems otherwise impossible to attack. For example, assume there are 330 million people. Each day, every person meets 10 other randomly selected persons. An infected person infects another (s)he meets with a probability of 20% if this other person was never infected. After 2 weeks, with 1% probability, the infected person dies. With 99% probability, (s)he becomes immune, healthy, and non-contagious. Start with 10 infected persons. How does the mortality curve look like? What probability would match the US curve? How does an increase or a decrease in the number of encounters and/or the probability of infection change the curve?

Task 2 (CRSP): Create “Annual Return” Data Sets

Create two annual return data sets.

Download the relevant data from WRDS first. You can start with data from 1973.
The computer code to massage and analyze data should sit on your own computer. Do not try to do this on the WRDS servers. They are slow.
- after you have the full data set, also consider creating a shorter test data set with only ~10,000 input lines for development purposes. For example, grab the subset of firm-days with permnos 14593 (AAPL) and 93436 (TSLA) from 2016 to 2020. There is no reason to debug your programs with 97,269,745 lines of input. You would end up waiting for a long time, only to learn that you made a mistake somewhere in your code.
Write a flexible python function that can calculate holding rates of return, given a start data and an end date
- a holding rate (i.e., compound rate of return) is not the percent change in price. You must use the “1 plus” formula.
- for example, you could define
```
 def rate_of_return( permno, startdate, enddate )
    "calculate the rate of return for a stock given a starting and an end date"
    blah blah blah
return blie
```
Use your function to create two similar data sets:
- Create one data set with annual Jan-Dec buy-and-hold rates of return for all stock-years. Call it crsp-jan1-dec31.csv. It should have output lines like
```
permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.T,ret
14593,03783310.AAPL,2019,20190102,20191231,0.8897
...
```
- Create an almost identical data set, but start return calculations a few days after the year starts (say, around Jan 10) instead of Jan 1. The data set should still end each year on Dec 31. Call it crsp-jan10-dec31.csv.
```
permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.T,ret
14593,03783310.AAPL,2019,20190110,20191231,0.9943
...
```
- You can also combine these two into one more convenient data set, call it crsp-dec31.csv:
```
permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.10,yyyymmdd.T,ret.1,ret.10
14593,03783310.AAPL,2019,20190102,20190110,20191231,0.8897,0.9943
...
```
Create some test cases for your code, for example
```
  if ((permno == 14593) and (year == 2019)) check( ret.1 , 0.8897 )
  if ((permno == 14593) and (year == 2019)) check( ret.10 , 0.9943 )
```
You can write the check function yourself. (Warning: computers are not perfect at arithmetic. Just check that the number is close.)

A good set of test cases helps you ensure that if you make some changes later, you can quickly check whether your programs still work. This is especially the case in multi-stage programs. Professional programmers often write many many test cases for code that is being reused before they write the computer program that creates the output.
Handcheck some outliers:
- For example, which firms had the biggest returns in the year? Does this seem right?
- For example, which firms had the biggest difference in crsp-jan1-dec31 vs crsp-jan10-dec31 .
Question: If your boss changes her mind and wants you to start 5 trading days after each new year, how difficult would it be for you to satisfy her?

Task 3 (CRSP): Create a “Firm-Size” Data Set

For each stock-year calculate the marketcap on the last trading day of each calendar year.
Also, calculate the rank of each stock on this last day of the year. (NA if not trading on last day of year)

Save your data as a mcap.csv data set. For example, it may contain lines such as

permno,cusip,ticker,year,yyyymmdd,mktcap,rankmktcap
14593,03783310,AAPL,2018,20181231,746079173,1
93436,88160R10,TSLA,2018,20181231,57441944,103
...

Handcheck some outliers:
- For example, which firms had the biggest change in rank from year to year? Does this seem right? Can you handcheck the basics?

Create some test cases, e.g.,

   if ( (permno == 93436) and (year == 2018) ) check( rankmktcap , 103 )

Task 4 (CRSP): Create a “Risk/Factor Exposure” Data Set

Within each calendar year, calculate a dataset of factor exposures for each firm.

First, download the daily factors from Ken French’s excellent data website. The specific file you want is Fama/French 5 Factors (2x3) [Daily]. This contains such lines as

           Mkt-RF     SMB     HML     RMW     CMA      RF
 19630701   -0.67    0.00   -0.32   -0.01    0.15   0.012
 19630702    0.79   -0.27    0.27   -0.07   -0.19   0.012
 ...
 20190102    0.23    0.73    1.15   -0.13    0.26   0.010
 20190103   -2.45    0.52    1.23   -0.25    0.91   0.010

Each is the rate of return on a specific day, quoted in percent. So 0.79 is not 79%, but 0.79%.

Within each calendar year, run a (multivariate) time-series regression of returns net of RF for each stock on these 5 factors. Example Observation: AAPL is permno 14593. The regression’s second input observation for 2019 should be
```
 y = -0.099607 - 0.00010
 x1 = -0.0245
 x2 = 0.0052
 x3 = 0.0123
 x4 = -0.0256
 x5 = 0.0091
```
Your data output should contain, among others, one line that looks something like
```
 permno,yyyy,a,bm,bsmb,bhml,brmw,bcma,N
 14593,2019,-0.0075,0.0138,-0.0006,-0.0032,0.00942,-0.00515,252
 ...
```
(FYI, it will become clear lateron that “a” is the alpha that financiers are often talking about.)

Task 5 (Compustat): Create a “Financial Statement Signal” Data Set

Many quant investment strategies depend on financial statement information. This data can be accessed via the Compustat (= Standard&Poors).

For each stock and year, pull off the annual earnings

Test a few cases. Find one stock each that ended its quarter in Mar, one in June, one in Sep, and one in Dec (most common).
Be careful about how Compustat quotes fiscal year.

datadate,year,gvkey,cusip,tic,netincome
2018,20180930,001690,037833100,aapl,59531
2019,20190930,001690,037833100,aapl,55256
2017,20171231,184996,88160R101,tsla,-1961.4
2018,20181231,184996,88160R101,tsla,-976.1
2019,20191231,184996,88160R101,tsla,-862.0

In real life, firms report about 3 months after the quarter end. Thus, we cannot build quant strategies that incept in January and use December fiscal-year end financial statement variables. When we incept our strategy at the end of the calendar year, we can only use data from Sep or earlier (presumably known publicly around 3 months later).
```
usedate,cusip,netincome
20181231,037833100,59531
20191231,037833100,55256
20181231,88160R101,-1961.4
20191231,88160R101,-976.1
```
The problem is that some earnings can be quite old. Tesla here is our best example. By the time we would use them, they would be 9 months old. For a fast-growing company, this is less than ideal.
- When a stock ends its fiscal year in Dec, you may want to calculate trailing-twelve-month (TTM) earnings.
```
20180331 	2018 	1 	TSLA 	-709.5510
20180630 	2018 	2 	TSLA 	-717.5390
20180930 	2018 	3 	TSLA 	311.5160 
20181231 	2018 	4 	TSLA 	139.4830 
20190331 	2019 	1 	TSLA 	-702.1350
20190630 	2019 	2 	TSLA 	-408.3340
20190930 	2019 	3 	TSLA 	143.4690 
20191231 	2019 	4 	TSLA 	105.0000 
20200331 	2020 	1 	TSLA 	16.0000  
20200630 	2020 	2 	TSLA 	104.0000 
```
  So, you could calculate that for investing on 20191231, you could legimately know 139.4830 + (-702.135) + (-408.334) + 143.469 = -827.5.
- This can be quite painful and may not be worth it…but you have to decide whether you want to do this.
  
  This is only for the programmer in each group.
As always, handcheck a few outliers against SEC or Bloomberg information.
- Test a few cases. Find one stock each that ends its quarter in Mar, one in June, one in Sep, and one in Dec (most common).

Create some test cases, e.g.,

   if ( (cusip == "88160R101") and (usedate == 20191231) ) check( netincome , -976.1 )

Task 6 (Combine)

This is only for the programmers in each group.
Create a large data set that contains three data sets. That is, for backtesting, we want a data set that
```
permno,cusip,ticker,year,yyyymmdd.1,yyyymmdd.10,yyyymmdd.T,ret.1,ret.10,lagmcap,lagrankmcap,lagnetincome
14593,03783310.AAPL,2019,20190102,20190110,20191231,0.8897,0.9943,252746079173,1,59531
```
You can constrain this data set to contain only stocks in the top-2000 stocks at the previous year.
This data set makes it easy to test if the lagnetincome signal predicted return—just regress lagnetincome on ret.10.
This data set is also small. For 2,000 stocks and 50 years, with each line being about 100 bytes, the data set is about 10 MB. Even the cheapest Apple laptop has 1,000 times as much RAM.

Task 7 (Cross-Check)

Swap your data sets with those of other groups. What are the correlations of the variables? Are they 100% as they should be?
PS: For statistical analysis, you can use any package you like. We shall really need only the following capabilities: data input/output, univariate statistics (means, standard deviations, T-stats); plotting; correlations; multivariate regressions. These capabilities will be in any rudimentary statistics package. Most commonly, researchers use:
- R
- Python. Python can also plot, run correlations and multivariate regressions.
- Stata
Excel will be able to do this, too, but the experience will be so miserable that you will wish you had spent the 60 minutes to learn the alternatives instead. It is also a big advantage to be able to automate such tasks—though for this particular cross-check exercise, you don’t need it.