Create a pandas dataframe called client_assets using the client_assets list data from question 6 of the first problem set.
Display the head of client_assets.
client_assets = pd.DataFrame({"clients" : ["10453", "10983", "10376"], "assets" : [70000, 400000, 5500]})
client_assets.head()
Output:
clients assets
0 10453 70000
1 10983 400000
2 10376 5500
Select rows with assets over 50,000 dollars.
client_assets[client_assets['assets'] > 50000]
Output:
clients assets
0 10453 70000
1 10983 400000
Print the total and average assets.
print(client_assets['assets'].sum())
print(client_assets['assets'].mean())
Output:
475500
158500.0
Using the .iloc method, select all columns for the second observation.
client_assets.iloc[1,:]
Output:
clients 10983
assets 400000
Name: 1, dtype: object
Use the EDGAR datasets published by the SEC to analyze public firms' revenues. Specifically, we'd like to obtain summary statistics for firms' quarterly revenues across industries. Please reference the readme.htm file for information on the data; we will be using the num.txt and sub.txt files.
Open the num.txt file using the Notepad. If you Ctrl+A, you'll find that the file holds more than 2 million rows. We cannot use excel to process this file, because excel can only handle ~ 1 million rows.
Instead, load the num.txt file as num_data into an IPython cell.
Display the head of num_data.
num_data = pd.read_csv('data/num.txt', sep='\t')
num_data.head()
Unfortunately, num_data doesn't include a firm name for each entry.
Load the sub.txt file as sub_data into an IPython cell.
sub_data = pd.read_csv('data/sub.txt', sep='\t')
Use the merge method to inner join num_data and sub_data on the column 'adsh'.
data = pd.merge(num_data, sub_data, how='inner', on='adsh')
Confirm num_data and data have the same number of observations.
(len(num_data) == len(data))
Output:
True
There are many columns that we don't necessarily need to accomplish our goal.
Assign a new dataframe called 'fin_data', with columns: name, sic, fy, fp, form, adsh, tag, value, qtrs, ddate, coreg.
fin_data = data[['name','sic','fy', 'fp', 'form','adsh','tag','value','qtrs','ddate','coreg']]
Assign a dataframe "q_fin_data_2020" which only includes 10-Q reports. To accomplish this, you need to specify the 'form' column and the 'qtrs' column.
Assign another dataframe which holds the same quarterly data, but only includes "Revenues" in the 'tag' column.
q_fin_data = fin_data[(fin_data['form'] == '10-Q') & (fin_data['qtrs'] == 1)]
rev_q_fin_data = q_fin_data[q_fin_data['tag'] == "Revenues"]
Import matplotlib.pyplot as plt.
Plot a histogram of firms' quarterly revenue using the 'hist()' function. Note that you can declare aesthetic arguments in the function.
Label the plot's title and axes.
How would you characterize the distribution of firms' quarterly revenue?
Please explain the distribution in non-technical terms.
import matplotlib.pyplot as plt
rev_q_fin_data.hist(column='value', bins=70, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
plt.title('Quarterly Revenue Distribution')
plt.xlabel('Quarterly Revenue ($)')
plt.ylabel('Frequency (# observations)')
plt.show()
'''
Heavily right-skewed.
The majority of public firms have low relative quarterly revenues.
'''
Do you expect the mean or median to be larger for firms' quarterly revenue?
Compute the mean and median.
'''
We expect the mean to be larger than the median in right-skewed distributions.
This is because the large values at the right tail of the distribution affect the mean, but not the median.
'''
rev_median = rev_q_fin_data['value'].median()
rev_mean = rev_q_fin_data['value'].mean()
print('Median: ${:,.2f}'.format(rev_median))
print('Mean: ${:,.2f}'.format(rev_mean))
Output:
Median: $60,559,000.00
Mean: $1,048,628,355.56
Assign a dataframe called "grouped_rev_q_fin_data" that groups rows in the 'sic' (industry classification) column and computes summary statistics (min, max, mean, median, std dev) of the 'value' column. You can accomplish this through using the 'groupby' and 'agg' methods.
Display the head of grouped_rev_q_fin_data.
grouped_rev_q_fin_data = rev_q_fin_data.groupby(['sic'])[['value']].agg(['min','max','mean','median','std']).reset_index()
grouped_rev_q_fin_data.head()
You now have quarterly revenue summary statistics for each of these industries.