Python Lesson 2 Problem Set

Ethan Witkowski

Open a new Jupyter notebook, and complete the exercises. Please reference the Pandas cheatsheet found here.

1.

Implement basic pandas operations.

a.

Import the pandas package with the name "pd".
What is a Python package?

Answer
   
import pandas as pd
'''
Packages consist of objects, methods, and functions that we can use.  
We import the packages so we don't have to manually re-write all of their code.
'''   
   

b.

Create a pandas dataframe called client_assets using the client_assets list data from question 6 of the first problem set.
Display the head of client_assets.

Answer
   
client_assets = pd.DataFrame({"clients" : ["10453", "10983", "10376"], "assets" : [70000, 400000, 5500]})
client_assets.head()
Output:
  clients  assets
0   10453   70000
1   10983  400000
2   10376    5500
    

c.

Select rows with assets over 50,000 dollars.

Answer
   
client_assets[client_assets['assets'] > 50000]
Output:
  clients  assets
0   10453   70000
1   10983  400000
    

d.

Print the total and average assets.

Answer
   
print(client_assets['assets'].sum())
print(client_assets['assets'].mean())
Output:
475500
158500.0
    

e.

Using the .iloc method, select all columns for the second observation.

Answer
   
client_assets.iloc[1,:]
Output:
clients     10983
assets     400000
Name: 1, dtype: object
    

2.

Use the EDGAR datasets published by the SEC to analyze public firms' revenues. Specifically, we'd like to obtain summary statistics for firms' quarterly revenues across industries. Please reference the readme.htm file for information on the data; we will be using the num.txt and sub.txt files.

a.

Open the num.txt file using the Notepad. If you Ctrl+A, you'll find that the file holds more than 2 million rows. We cannot use excel to process this file, because excel can only handle ~ 1 million rows.

b.

Instead, load the num.txt file as num_data into an IPython cell.
Display the head of num_data.

Answer
   
num_data = pd.read_csv('data/num.txt', sep='\t')
num_data.head()
    

c.

Unfortunately, num_data doesn't include a firm name for each entry.
Load the sub.txt file as sub_data into an IPython cell.

Answer
   
sub_data = pd.read_csv('data/sub.txt', sep='\t') 
    

d.

Use the merge method to inner join num_data and sub_data on the column 'adsh'.

Answer
   
data = pd.merge(num_data, sub_data, how='inner', on='adsh')
    

e.

Confirm num_data and data have the same number of observations.

Answer
   
(len(num_data) == len(data))
Output:
True
    

f.

There are many columns that we don't necessarily need to accomplish our goal.
Assign a new dataframe called 'fin_data', with columns: name, sic, fy, fp, form, adsh, tag, value, qtrs, ddate, coreg.

Answer
   
fin_data = data[['name','sic','fy', 'fp', 'form','adsh','tag','value','qtrs','ddate','coreg']]
    

g.

Assign a dataframe "q_fin_data_2020" which only includes 10-Q reports. To accomplish this, you need to specify the 'form' column and the 'qtrs' column.
Assign another dataframe which holds the same quarterly data, but only includes "Revenues" in the 'tag' column.

Answer
   
q_fin_data = fin_data[(fin_data['form'] == '10-Q') & (fin_data['qtrs'] == 1)]
rev_q_fin_data = q_fin_data[q_fin_data['tag'] == "Revenues"]
    

h.

Import matplotlib.pyplot as plt.
Plot a histogram of firms' quarterly revenue using the 'hist()' function. Note that you can declare aesthetic arguments in the function.
Label the plot's title and axes.
How would you characterize the distribution of firms' quarterly revenue?
Please explain the distribution in non-technical terms.

Answer
   
import matplotlib.pyplot as plt
rev_q_fin_data.hist(column='value', bins=70, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)
plt.title('Quarterly Revenue Distribution')
plt.xlabel('Quarterly Revenue ($)')
plt.ylabel('Frequency (# observations)')
plt.show()
'''
Heavily right-skewed.  
The majority of public firms have low relative quarterly revenues.  
'''
    

i.

Do you expect the mean or median to be larger for firms' quarterly revenue?
Compute the mean and median.

Answer
   
'''
We expect the mean to be larger than the median in right-skewed distributions. 
This is because the large values at the right tail of the distribution affect the mean, but not the median.  
'''
rev_median = rev_q_fin_data['value'].median()
rev_mean = rev_q_fin_data['value'].mean()
print('Median: ${:,.2f}'.format(rev_median))
print('Mean: ${:,.2f}'.format(rev_mean))
Output:
Median: $60,559,000.00
Mean: $1,048,628,355.56
    

j.

Assign a dataframe called "grouped_rev_q_fin_data" that groups rows in the 'sic' (industry classification) column and computes summary statistics (min, max, mean, median, std dev) of the 'value' column. You can accomplish this through using the 'groupby' and 'agg' methods.
Display the head of grouped_rev_q_fin_data.

Answer
   
grouped_rev_q_fin_data = rev_q_fin_data.groupby(['sic'])[['value']].agg(['min','max','mean','median','std']).reset_index()
grouped_rev_q_fin_data.head()
    

You now have quarterly revenue summary statistics for each of these industries.