**** Running the models ****
** Binary Choice and Other Limited Dependent Models **
* logit/logistic, probit, tobit, poisson
****************************
/* Compare Logit to Probit */
****************************
set obs 600
egen x=fill(-300 -299) /* same as fill(-300(1)299) */
replace x=x/100
gen probit=1/sqrt(2*3.1415)*exp(-((x^2)/2))
gen logit=(exp(x))/[[1+exp(x)]^2]
** density functions
twoway (connected probit x) (connected logit x)
* notice the fatter tails on the logit
** cumulative graphs
gen cumul_logit=sum(logit)
gen cumul_probit=sum(probit)
twoway (connected cumul_probit x) (connected cumul_logit x)
* which should we choose to use? depends. they are quite similar, but a lot of outliers might lead to using the logit. an easy answer: use the model that the literature relies upon.
***********************
/* Logit and Logistic */
***********************
* Use the cancer data set for prob of dying
sysuse cancer, clear
* Check the general fit and other problems first
quietly reg died studytime age drug
predict r, resid
gen residsq=r*r
scatter residsq studytime
scatter residsq age
scatter residsq drug
kdensity r, normal /* univariate kernel density estimation */
pnorm r /* std normal prob plot */
qnorm r /* compares quantiles of resid to quantiles of norm dist */
* How might you test for heteroskedasticity?
estat hettest
gen rlag=r[_n-1]
reg r rlag
* How might you test for multicollinearity?
quietly reg died studytime age drug
vif
* How can we check if omitted variables might be a problem?
ovtest /* not good but let's proceed */
** Compare linear prob to logit
reg died studytime age drug
logit died studytime age drug
logit died studytime age drug, nolog /* suppresses the log */
logit died studytime age drug, nolog or /* gives odds ratio */
* same as:
logistic died studytime age drug, nolog
* we could also go the other way from logistic to logit
logistic died studytime age drug, nolog coef
logit died studytime age drug
* how can we get back the estimates individually or the table?
lincom age
lincom drug
logit
logit, or
estat gof
estat classification /* to comparing the accuracies a little easier... */
ereturn list /* if you want to program, this lists some available vars */
* instead of coefficients, let's find probabilistic changes and marginal effects
logit died studytime age drug
prchange
gen old=age>60
tab old drug /* most of the older people do not get the placebo */
logit died studytime age drug old
prchange, x(old=1 drug=1) /* see the types of change for old that had placebo */
mfx /* gives marginal effects */
*
* A variant of the logit is the multinomial logit
* Group people by study time and use that on the LHS
gen time_1=(studytime>0 & studytime<10)
gen time_2=(studytime>=10 & studytime<20)
gen time_3=(studytime>=20 & studytime<30)
gen time_4=(studytime>=30)
* OR categorize the variable automatically. There are a number of ways.
gen time_cat1=recode(studytime,0, 10, 20, 30) /* specifies non-inclusive lower bounds */
tab time_cat1
gen time_cat2=autocode(studytime, 4, 0, 40)
tab time_cat2
sort studytime
gen time_cat3=autocode(studytime,4,studytime[1],studytime[_N])
tab time_cat3 /* it's similar to time_cat2 b/c autocode picks 4 equally spaced groups */
egen time_cat4=cut(studytime), at(0,10, 20, 30,40) label
* What's the difference between "cut" and "recode"? The lower bound for "cut" is inclusive.
tab time_cat4
egen time_cat5=cut(studytime), at(0,10, 20, 30,40) icodes
tab time_cat5
* How can you get 5 equally sized groups (we already did equally spaced)?
egen time_cat6=cut(studytime), group(5) /* specify label if what that */
tab time_cat6
*** Now use one of these groups for a multinomial logit
* Suppose the time you'll survive depends on the drug treatment and your age
* Let's use a multinomial logit. Normally, you don't do this b/c you don't want to reduce the data info you have.
mlogit time_cat5 drug age
mfx /* this won't work now because we have multiple categories on the LHS */
mfx, predict(p outcome(2))
drop time*
* For this problem, you might also consider an ordered logit or a conditional logit grouped by drug treatment.
* Other logit models include the nested, rank-ordered and grouped data logit.
***********************
*****************
************
/* Probit */
************
* We're going to look at the likelihood of being married. Check the # obs and missing obs.
sysuse nlsw88, clear
nmissing
* now find the exact observations with missing values for hours
nmissing hours, obs
* let's just ignore them for now because there aren't many that are missing
* if we were worried about "union" then it might be a different story
reg never_married age collgrad south c_city wage hours wage tenure
vif
ovtest /* much better than with the other data set! */
predict r, resid
gen rlag=r[_n-1]
reg r rlag
* things look to be ok, but look at the normal plots. not good, but move on to the probit
probit never_married age collgrad south c_city wage hours wage tenure
mfx /* in newer versions of Stata, dummy variables are already computed as discrete changes */
mfx, nodiscrete
mfx, at(median) /* default is at mean. could also do at zero or at(mean collgrad=1) for only college grads and mean for rest */
dprobit never_married age collgrad south c_city wage hours wage tenure
* suppose we think results are clustered by occup. Get their marginals.
probit never_married age collgrad south c_city wage hours wage tenure, vce(cluster occupation) nolog
mfx
estat clas
* How could we display the estimates of linear prob, probit, and logit estimation side-by-side?
reg never_married age collgrad south c_city hours wage tenure, vce(cluster occupation)
estimates store A
probit never_married age collgrad south c_city hours wage tenure, vce(cluster occupation) nolog
estimates store B
logit never_married age collgrad south c_city wage hours tenure, vce(cluster occupation) nolog
estimates store C
* cluster instead by current grade completed
logit never_married age collgrad south c_city wage hours tenure, vce(cluster grade) nolog
estimates store D
estimates dir
estimates table A B C
estimates stats A B C
estimates replay B
estimates title: The Linear Probability Model
estimates store E
estimates replay E
estimates table A B C D, title(Comparison of Regressions) stats(r2 r2_p F chi2 ll N_clust N) b(%9.2f) se(%9.2f) p(%9.3f)
* instead of what you just displayed in the table, drop the SE and put in stars for p-values so sig vars jump out at you (not necessarily what you want to do in a paper)
estimates table A B C D, title(Comparison of Regressions) stats(r2 r2_p F chi2 ll N_clust N) b(%9.2f) star
* finally, you might need to recall one of the earlier estimates so see which is active
est query
est restore B
est query
*
estimates drop A B C D E
* Besides the basic type of probit, you might also consider bivariate, multinomial, or ordered probits. Madalla's text goes through some examples.
* How can you see what is going on inside a program to compare to your Matlab files you've been programming this term?
* Type: "viewsource programname.ado" (you must include the extension .ado)
viewsource biprobit.ado
* If you want to start programming ADO files, consider Stata's Mata language. The "moremata" package has a bunch of functions for it. Here's where I'd start reading:
* Quick Mata overview: http://kurt.schmidheiny.name/teaching/statamata2up.pdf
* More details: http://ideas.repec.org/p/boc/usug08/11.html
********************************
****************************
************************
/* Tobit and Selection */
************************
* Let's consider stock market trades where the day closes ahead of the previous.
sysuse sp500, clear
count
gen yi=close if close>close[_n-1]
di (248-129)/248
* so 48% are observed in yi with close being our "latent" variable
* let's consider a nonsensical model:
tobit yi high volume change, ll(0)
mfx
**** Heckman Selection
* consider hourly wages if somebody is in a union
sysuse nlsw88, clear
tab union, gen(union_)
rename union_1 union_not
rename union_2 union_mem
* I'm just making this up to define wage rate and a selection eqn for union members:
heckman wage age grade south c_city hours tenure, select(union_mem= never_married collgrad south c_city) vce(cluster industry)
mfx
* Can you explain why never_married and collgrad have marg effects of zero?
predict probunionwage, psel /* gives us the prob of wage being observed, which requires union membership */
predict wagecond, ycond /* gives us the predicted wage conditional on union membership */
* To recap, there are a lot of different types of tobit and selection models based on censoring thresholds and truncation points. Madalla's text, again, gives great examples from the literature.
*************
/* Poisson */
*************
* The study time var in the cancer data set is a count var that could be used on the LHS
sysuse cancer, clear
poisson studytime drug age
mfx
* if you had a different count, you could actually use studytime with the exposure option
* make sure you run
estat gof
* if it's significant, you should not use a poisson. Instead, do a negative binomial