-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
144 lines (104 loc) · 5.47 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# SGDinference
<!-- badges: start -->
[![R-CMD-check](https://github.com/SGDinference-Lab/SGDinference/workflows/R-CMD-check/badge.svg)](https://github.com/SGDinference-Lab/SGDinference/actions)
[![codecov](https://codecov.io/gh/SGDinference-Lab/SGDinference/branch/master/graph/badge.svg?token=YTBY15IXEP)](https://app.codecov.io/gh/SGDinference-Lab/SGDinference)
<!-- badges: end -->
__SGDinference__ is an R package that provides estimation and inference methods for large-scale mean and quantile regression models via stochastic (sub-)gradient descent (S-subGD) algorithms. The inference procedure handles cross-sectional data sequentially:
(i) updating the parameter estimate with each incoming "new observation",
(ii) aggregating it as a Polyak-Ruppert average, and
(iii) computing an asymptotically pivotal statistic for inference through random scaling.
The methodology used in the SGDinference package is described in detail in the following papers:
- Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In _Proceedings of the AAAI Conference on Artificial Intelligence_ (Vol. 36, No. 7, pp. 7381-7389).
<https://doi.org/10.1609/aaai.v36i7.20701>.
- Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("SGDinference-Lab/SGDinference")
```
We begin by calling the SGDinference package.
```{r setup}
library(SGDinference)
set.seed(100723)
```
## Case Study: Estimating the Mincer Equation
To illustrate the usefulness of the package, we use a small dataset included in the package.
Specifically, the _Census2000_ dataset from Acemoglu and Autor (2011) consists of observations on 26,120 nonwhite, female workers. This small dataset is constructed from "microwage2000_ext.dta" at
<https://economics.mit.edu/people/faculty/david-h-autor/data-archive>.
Observations are dropped if hourly wages are missing or years of education are smaller than 6.
Then, a 5 percent random sample is drawn to make the dataset small.
The following three variables are included:
- ln_hrwage: log hourly wages
- edyrs: years of education
- exp: years of potential experience
We now define the variables.
```{r}
y = Census2000$ln_hrwage
edu = Census2000$edyrs
exp = Census2000$exp
exp2 = exp^2/100
```
As a benchmark, we first estimate the Mincer equation and report the point estimates and their 95% heteroskedasticity-robust confidence intervals.
```{r}
mincer = lm(y ~ edu + exp + exp2)
inference = lmtest::coefci(mincer, df = Inf,
vcov = sandwich::vcovHC)
results = cbind(mincer$coefficients,inference)
colnames(results)[1] = "estimate"
print(results)
```
### Estimating the Mean Regression Model Using SGD
We now estimate the same model using SGD.
```{r}
mincer_sgd = sgdi_lm(y ~ edu + exp + exp2)
print(mincer_sgd)
```
It can be seen that the estimation results are similar between two methods.
There is a different command that only computes the estimates but not confidence intervals.
```{r}
mincer_sgd = sgd_lm(y ~ edu + exp + exp2)
print(mincer_sgd)
```
We compare the execution times between two versions and find that there is not much difference in this simple example. By construction, it takes more time to conduct inference via `sgdi_lm`.
```{r}
library(microbenchmark)
res <- microbenchmark(sgd_lm(y ~ edu + exp + exp2),
sgdi_lm(y ~ edu + exp + exp2),
times=100L)
print(res)
```
To plot the SGD path, we first construct a SGD path for the return to education coefficients.
```{r}
mincer_sgd_path = sgdi_lm(y ~ edu + exp + exp2, path = TRUE, path_index = 2)
```
Then, we can plot the SGD path.
```{r}
plot(mincer_sgd_path$path_coefficients, ylab="Return to Education", xlab="Steps", type="l")
```
To observe the initial paths, we now truncate the paths up to 2,000.
```{r}
plot(mincer_sgd_path$path_coefficients[1:2000], ylab="Return to Education", xlab="Steps", type="l")
print(c("2000th step", mincer_sgd_path$path_coefficients[2000]))
print(c("Final Estimate", mincer_sgd_path$coefficients[2]))
```
It can be seen that the SGD path almost converged only after the 2,000 steps, less than 10% of the sample size.
## What else the package can do
See the vignette for the quantile regression example.
# References
Acemoglu, D. and Autor, D., 2011. Skills, tasks and technologies: Implications for employment and earnings. In _Handbook of labor economics_ (Vol. 4, pp. 1043-1171). Elsevier.
Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In _Proceedings of the AAAI Conference on Artificial Intelligence_ (Vol. 36, No. 7, pp. 7381-7389).
<https://doi.org/10.1609/aaai.v36i7.20701>.
Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.