Fund Cash Flow Predictive Model¶

Ethan Witkowski

The purpose of this document is to describe a model that predicts T+1 cash flow for fixed income funds. This may be useful for portfolio managers, as they can place trades on T for cash they expect on T+1; this may help reduce index tracking error or provide greater flexibility with respect to trade timing for active funds.

Although there is some overlap, there are three core steps in my modeling workflow:

Selecting the model's functional form and features
Identifying parameter values that minimize the objective function
Inputting T data to the function, which outputs the T+1 prediction

Basic Structure¶

The model predicts T+1 cash flow $c$ for fund $f$ for each client account, indexed $i=1 ... n$. To make the computation feasible, I will inititally restrict the model to include observations from only one fund. The model sums each $c$ prediction to output the total predicted cash flow for fund $f$ at T+1: $C = \sum_{i=1}^{n} c_{i} = c_1 + \dots + c_n$.

1. Selection¶

Functional Form¶

I hope to keep a level of interpretability in the model, so I can explain the impact each input variable has on T+1 cash flow predictions. Therefore, I'll use least squares regression, with the option to apply regularization. The functional form of the model is as follows:

$y = X\beta$

$y$ is a $m\cdot n \cdot j$ cash flow vector, where $m$ is the number of funds, indexed $f=1 ... m$. $j$ is the number of T+1 observations. $\beta$ is a $k+1$ vector, where each element is an estimated parameter for an input variable. $X$ is a $m \cdot n \cdot t$ by $k+1$ matrix, where $t$ is the number of T observations. Each row in $X$ holds all the input variable data for an observation. Below, I show the expanded structure of the above equation (remember, I restrict $m=1$):

$ \begin{bmatrix} y_{1} & = & \beta_{0}\cdot 1 & + & \beta_{1}\cdot 0 & + & 0 & + & \dots & + & 0 & + & \dots & + & \beta_{k}\cdot x_{0,k}\\ \vdots & = & \vdots & + & \vdots & + & \vdots & + & \dots & + & \vdots & + & \dots & + & \vdots\\ y_{j+1} & = & \vdots & + & 0 & + & \beta_{2}\cdot 1 & + & \dots & + & \vdots & + & \dots & + & \beta_{k}\cdot x_{t+1,k}\\ \vdots & = & \vdots & + & \vdots & + & \vdots & + & \dots & + & \vdots & + & \dots & + & \vdots\\ y_{n \cdot j} & = & \beta_{0}\cdot 1 & + & 0 & + & 0 & + & \dots & + & \beta_{n}\cdot 1 & + & \dots & + & \beta_{k}\cdot x_{n \cdot t,k} \end{bmatrix} \quad $

I include a $\beta_{0}\cdot 1$ term in each row of the $X\beta$ matrix to accomodate for an intercept in the regression model, so the function isn't bound to the origin. I also include $n$ fixed effect terms, which accomodate heterogeneity in client accounts. I use one arbitrary account (in this case, the account associated with the $\beta_1$ parameter), as the reference level. The reference level is necessary to avoid the dummy variable trap, and I implement it using dummy coding. With dummy coding, the intercept is interpreted as the estimated mean daily cash flow for the reference level account at $X=0$, and each account fixed effect parameter is the difference between the reference level account and fixed effect account estimated mean daily cash flow at $X=0$.

This model would be considered a panel fixed effects mulitple linear (affine) regression, and as a result, we can only incorporate non-linearities through manipulating the input data (e.g. $\beta_1x + \beta_2(x^2)$). Nonlinear functional forms are not available (e.g. $e^x$). I will only incorporate nonlinear functional forms if there is established theory supporting their use. Otherwise, I don't believe the potential benefit justifies the increased computational complexity involved in nonlinear least squares (e.g. Levenburg-Marquardt algorithm implementation).

Features¶

Features are the input variables in the model, and I need to gather historical data for these features to estimate their paramters. As shown in the functional form section, the set of features would be considered a panel dataset. Panel data is comprised of observations across multiple time steps (days) for multiple individuals (accounts). Below, you'll find a tabular example of panel data (accounts not dummy coded for readability):

Fund	Account	Time	T+1 Cash Flow
VEMBX	0000001	06/07/2021	1,100
VEMBX	0000001	06/08/2021	0
VEMBX	0000002	06/07/2021	152,000
VEMBX	0000002	06/08/2021	-2,700

My strategy for coming up with predictive features is, to the best of my ability, placing myself in the shoes of clients that may interact with fund $f$ on T+1. For example, if I am an institutional client, I may have one account for my 401k that places a portion of my salary into the fund every 2 weeks. In this case, a feature indicating it is an institutional account and a feature "number of days since account's last cash interaction with fund $f$" may have strong predictive ability. Below, I write a non-exhaustive list of features that may have predictive power (roughly in order of importance):

Fund Database¶

Fund AUM T, T-1, etc.
Fund cash flows (necessary outcome variable)
- Direct Purchase Journal
- Direct Redemption Journal
- Exchange In Journal
- Exchange Out Journal

SEC¶

Monthly fund $f$ expense ratios by shareclass
Monthly competitor/substitute funds' expense ratios by shareclass

Client Databases¶

Client ID of account $i$ owner
Client owns account $i$ associated with fund $f$
Client Type
Days since fund $f$ account $i$ creation
Days since last cash interaction with fund $f$
Client zip code
Client's company-managed assets value
Client's outside holdings
Fund $f$ account $i$ cash flow T, T-1, etc.
Performance of fund $f$ and competitors to fund $f$ T, T-1, etc.

Market Data Provider¶

Interest rates T, T-1, etc.
Prices for T, T-1, etc.
Futures
Macroeconomic data
Market sentiment
Probability of recession model output
Default rates

Large Transactions Database¶

Large transaction approved amount for fund $f$ for account $i$

Calendar Database¶

T+1 US federal holiday
T+1 market holiday
T+1 market early close
Month of year
Number of days until next FOMC meeting

Fund News Aggregator¶

Fund manager mentioned in news article on T, T-1, etc.
Sentiment of above article
Number of times company mentioned in articles on T, T-1, etc.
Sentiment of above articles

Zip Code Statistics ¶

Join with client zip
Percent of zip code population employed in finanace, insurance, or real estate
Mean, median zip code household income and benefits
Mean zip code age
Mean zip code home value

2. Identification¶

My goal is to describe the relationship between each input variable and fund $f$'s T+1 cash flow; this is known as regression. For example, how will fund $f$'s T+1 cash flow change if the fund manager is praised in a Barron's article on T? These relationships are described by the $\mathbb{R^{k+1}} \rightarrow \mathbb{R}$ function that relates the T input variables $X$ to fund $f$'s T+1 cash flow $y$. However, given that the system is over-determined ($m\cdot n \cdot j > k+1$), and assuming the columns of $X$ are linearly independent (can be tested with Gram-Schmidt algorithm implementation), there is no value of $\beta$ that provides a solution.

Therefore, I will use the method of least squares to find the $\beta$ parameter values that provide an approximate solution. Least squares accomplishes this by minimizing the objective function, $||r||^2 = ||X\beta - y||^2$, with $r$ known as the residual. In doing so, I am essentially finding the $k+1$-dimensional object of best fit for the set of historical data.

Least Squares¶

Least squares minimizes the objective function $f(\beta) = ||X\beta - y||^2$ by finding its gradient and setting it equal to zero:

$\nabla f(\hat{\beta}) = 2X^T (X\hat{\beta}-y) = 0$

$X^T X \hat{\beta} = X^T y$

Approximate Solution:

$\hat{\beta} = (X^{T} X)^{-1}X^T y$

$\hat{\beta}$ is the vector of parameters that minimizes the objective function above.

Regularization¶

Regularization is done to improve the performance of models on unseen data; it is particularly effective when the number of features is close to the number of observations. Shrinkage is a type of regularization $-$ I will focus on a particular shrinkage method in this section, called ridge regression. Ridge regression is very similar to ordinary least squares, in that it finds $\hat{\beta}$ that minimizes the objective function $||X\beta - y||^2 + \lambda||\beta_{1...k}||^2$. Through including the second term in the function, ridge regression "shrinks" the $\hat{\beta}$ parameters closer to zero. The degree of shrinkage is dependent on the chosen hyperparameter $\lambda$. Shrinking intends to make the model more generalizable to unseen data, and in turn ward off "overfitting" of the model to the available historical data. If the model exhibits poor performance on test data, I will implement ridge regression.

Cross Validation¶

Cross validation is a method used to test a model's stability and tune hyperparameters, such as $\lambda$ in ridge regression. Cross validation is the process of splitting the historical data into $z$ folds, iteratively fitting z models to their unique $z-1$ folds, and testing each model's accuracy on its left-out fold. This allows one to find the hyperparameter value that best performs on unseen data, and see if the parameters are stable across the $z$ models. In the event that the necessary compute is offered and/or ridge regression is used, I will implement cross validation.

3. Prediction¶

As mentioned in the Basic Structure section, the account-level T+1 cash flow predictions $c$ are summed to output the total T+1 cash flow $C$ predicted for fund $f$. To do so, I input $k$ data points associated with account $i$ and day T into the function (with estimated parameters $\hat{\beta}$), which outputs the T+1 cash flow for fund $f$ for account $i$. These outputs are then summed to get the total T+1 cash flow prediction for fund $f$.

Requirements¶

Computation¶

Given that certain asset managers have millions of clients, and many more accounts, there is a real concern that it isn't feasible for a basic desktop computer to perform this computation in a reasonable amount of time. Below, I estimate the time and space complexity of fitting the model:

Number of observations $= n \cdot j$
Number of features $ = k + 1 \approx k = n + (k-n)$, where $n >> (k-n)$
Ridge regression is approximately the same time complexity as ordinary least squares
If cross validation is implemented, the time complexity scales linearly with the number of folds

Time Complexity¶

The time complexity of least squares is approximately $O(k^3 + k^2(n \cdot j))$. Therefore, if regularization and 10-fold cross validation were implemented, the total number of operations is 10 times the above value.

Space Complexity¶

The space complexity of the least squares objective function is approximately $O(k(n \cdot j) + (n \cdot j))$

Solutions¶

Clearly, the number of features is the main driver of compute with this model, and the main driver of the number of features is the number of accounts. An action to take, that may be reasonable, is to only include accounts that meet one or more of these criteria:

Account invests in fund $f$
Account is owned by a client in the upper $x$ percentile of owned assets

To lessen compute requirements, I may also need to use a dimensionality reduction method, such as principal component analysis (PCA). However, this would eliminate any ability to interpret the relationship between the input variables and fund $f$ T+1 cash flow. Another potential solution is to do away with the account fixed effects, and only include client assets and client type; this may be the best option from a cost-benefit perspective. Despite implementing these solutions, compute through a cloud provider may still be necessary to fit this model.