README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# SGDinference

<!-- badges: start -->
[![R-CMD-check](https://github.com/SGDinference-Lab/SGDinference/workflows/R-CMD-check/badge.svg)](https://github.com/SGDinference-Lab/SGDinference/actions)
[![codecov](https://codecov.io/gh/SGDinference-Lab/SGDinference/branch/master/graph/badge.svg?token=YTBY15IXEP)](https://app.codecov.io/gh/SGDinference-Lab/SGDinference)
<!-- badges: end -->

__SGDinference__ is an R package that provides estimation and inference methods for large-scale mean and quantile regression models via stochastic (sub-)gradient descent (S-subGD) algorithms. The inference procedure handles cross-sectional data sequentially: 

  (i) updating the parameter estimate with each incoming "new observation", 
  (ii) aggregating it as a Polyak-Ruppert average, and 
  (iii) computing an asymptotically pivotal statistic for inference through random scaling.

The methodology used in the SGDinference package is described in detail in the following papers:

- Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In _Proceedings of the AAAI Conference on Artificial Intelligence_ (Vol. 36, No. 7, pp. 7381-7389).
<https://doi.org/10.1609/aaai.v36i7.20701>.

- Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. 	arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.
  
## Installation

You can install the development version from  [GitHub](https://github.com/) with:

``` r
# install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("SGDinference-Lab/SGDinference")
```

We begin by calling the SGDinference package.

```{r setup}
library(SGDinference)
set.seed(100723)
```


## Case Study: Estimating the Mincer Equation

To illustrate the usefulness of the package, we use a small dataset included in the package.
Specifically, the _Census2000_ dataset from Acemoglu and Autor (2011) consists of observations on 26,120 nonwhite, female workers. This small dataset is constructed from "microwage2000_ext.dta" at 
<https://economics.mit.edu/people/faculty/david-h-autor/data-archive>.
Observations are dropped if hourly wages are missing or years of education are smaller than 6.
Then, a 5 percent random sample is drawn to make the dataset small.
The following three variables are included:

- ln_hrwage: log hourly wages
- edyrs: years of education
- exp: years of potential experience

We now define the variables.

```{r}
    y = Census2000$ln_hrwage 
  edu = Census2000$edyrs
  exp = Census2000$exp
 exp2 = exp^2/100
```

As a benchmark, we first estimate the Mincer equation and report the point estimates and their 95% heteroskedasticity-robust confidence intervals.

```{r}
mincer = lm(y ~ edu + exp + exp2)
inference = lmtest::coefci(mincer, df = Inf,
                             vcov = sandwich::vcovHC)
results = cbind(mincer$coefficients,inference)
colnames(results)[1] = "estimate"
print(results)
```


### Estimating the Mean Regression Model Using SGD

We now estimate the same model using SGD. 

```{r}
 mincer_sgd = sgdi_lm(y ~ edu + exp + exp2)
 print(mincer_sgd)
```
It can be seen that the estimation results are similar between two methods. 
There is a different command that only computes the estimates but not confidence intervals.

```{r}
 mincer_sgd = sgd_lm(y ~ edu + exp + exp2)
 print(mincer_sgd)
```

We compare the execution times between two versions and find that there is not much difference in this simple example. By construction, it takes more time to conduct inference via `sgdi_lm`.

```{r}
library(microbenchmark)
res <- microbenchmark(sgd_lm(y ~ edu + exp + exp2),
                      sgdi_lm(y ~ edu + exp + exp2),
                      times=100L)
print(res)
```
To plot the SGD path, we first construct a SGD path for the return to education coefficients.
```{r}
mincer_sgd_path = sgdi_lm(y ~ edu + exp + exp2, path = TRUE, path_index = 2)
```

Then, we can plot the SGD path.

```{r}
plot(mincer_sgd_path$path_coefficients, ylab="Return to Education", xlab="Steps", type="l") 
```

To observe the initial paths, we now truncate the paths up to 2,000.

```{r}
plot(mincer_sgd_path$path_coefficients[1:2000], ylab="Return to Education", xlab="Steps", type="l")
print(c("2000th step", mincer_sgd_path$path_coefficients[2000]))
print(c("Final Estimate", mincer_sgd_path$coefficients[2]))
```

It can be seen that the SGD path almost converged only after the 2,000 steps, less than 10% of the sample size. 

## What else the package can do

See the vignette for the quantile regression example. 

# References

Acemoglu, D. and Autor, D., 2011. Skills, tasks and technologies: Implications for employment and earnings. In _Handbook of labor economics_ (Vol. 4, pp. 1043-1171). Elsevier.

Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2022. Fast and robust online inference with stochastic gradient descent via random scaling. In _Proceedings of the AAAI Conference on Artificial Intelligence_ (Vol. 36, No. 7, pp. 7381-7389).
<https://doi.org/10.1609/aaai.v36i7.20701>.

Lee, S., Liao, Y., Seo, M.H. and Shin, Y., 2023. Fast Inference for Quantile Regression with Tens of Millions of Observations. 	arXiv:2209.14502 [econ.EM] <https://doi.org/10.48550/arXiv.2209.14502>.