Email Me | Resume | Home Page | SAS Code | VB6 Code | Miscellaneous


Adventures in R!
Title Description
01). Download R

If you haven't downloaded R, get it here.

02). Download RStudio

If you haven't downloaded RStudio (GUI front end to R), get it here. Highly recommend this integrated launching platform for R scripts.

03). Test Data

I have provided some ficticious credit-like test data for illustrative purposes (about 3,000 records of around 30 variables). The first variable called "Target" should be considered the dependent variable. This data is useful in running the logit and linear R scripts as shown below. Recommend you create a directory on your hard drive called c:\MyRdata and place this file there. The logit and linear scripts have this path hardcoded for convenience.

04). Logistic Regression

For all the downloadable scripts on this page, change the .txt extension to the correct R extension (.R) before using in RStudio. This Logistic regression script contains the greatest collections of routines on my web page...

# - writes all graphs to a single pdf file.

# - Calculates frequency counts of all character variables.

# - Calculates basic statistics for all numeric variables.

# - Calculates percentiles for all numeric variables.

# - Draws Boxplots for all numeric variables.

# - Draws Histograms for all numeric variables.

# - Extracts column pairwise and bivariate correlations from correlation matrix.

# - Ability to Winsor data if needed to mitigate outliers.

# - Calculates weight of evidence (WOE) and Information value (IV) for all variables.

# - Can replace missing with means or medians.

# - Divides data into training and testing datasets.

# - Estimates Logit model - saves all related results in Excel tabs.

# - Calculates VIFs for all model variables.

# - Calculates fit statistics - leverage, Cooks D, etc.

# - Calculates Predicted probabilities for Logit model.

# - Calculates ROC and AUC for training and test data.

# - Calculates Lift chart and KS for training and test data.

# - Calculates Somers D, Misclassification table, Concordance metric - all for training and test data.

# - Calculates Marginal Effects of logistic regression.

05). Linear Regression

This is similar in functionality to the logit script, but for linear regression models. Like the logit script, everything is either written to a pdf (graphs) or an excel spreadsheet (data and tables).

06). Observational Clustering

This is a script for observational clustering. Functionality includes:

# 1). Replace all missings with medians - cluster analysis cannot handle missing data.

# 2). Winsor all the variables to 99 percentile - clustering is very sensitive to outliers.

# 3). Standardize all the variables - clustering works better this way.

# 4). Optional - Pick relevent variables while perform variable clustering, this reduces noise and number of variables.

# 5). Generally, use kmeans as your clustering routine.

# 6). Score new or test dataset with your clustering model.

Note - remember, after all the analytics, you want your clusters to be meaningful, so you might want to look at the average values of the variables by cluster, as shown here. Clustering is more of an art than a science compared to regression analysis, so vary the number of clusters, select variables that best describe the business application.

07). Regression Tree

You cannot do this in SAS unless you purchase the very expensive Enterprise Miner. If you are not familiar with regression trees like CART, there is plenty of information on the web. Doing Trees now are free in R.

08). Random Forests

This is a tree on steroids. There are plenty of discussions on this approach on the web, suffice it to say the this routine runs numerous trees in the background and aggregates the results. This is shown to have some benefits such as creating more stability than your traditional tree methodology.

09). Text Mining

This script is an example of how to run text mining against unstructured data in R. It assumes each observation is a text file from one of two populations - for example Obama vs. Romney speeches. These text files are placed under the following folders - c:\speeches2\obama and c:\speeches2\romney. The objective is to evaluate the word structure in each so you can predict the probability that a new speech would be more likely from one or the other sources. This can be applied to any binary decisioning where you want to look at how one population of words differs from another. For example in banking, you could apply this easily to complaints where one population deserves merit and should be investigated, while the other not so much.

10). Text Mining Data

This the test dataset for text mining script #09..

11). Variable Clustering

This is a stand alone script for reducing the number of variables for regression analysis. This is sometimes useful if you have many variables.

12). Simulate Regression Data

This is a great script for testing data for logit or tobit models. Here you can make up your own data where you know the true parameters and run a statistical regression technique against it.

13). Create Zip code maps

Create a map based on zipcodes. You can do this with SAS, but this is easier.

14). Create Basic Maps

Create basic maps in R. These would include thematic (heat) maps as well as overlay maps with specific data points.

15). Create Census Tract Maps

Create maps at the Census Tract Level. This program requires you go to a website and obtain a specific ID (activation code: http://api.census.gov/data/key_signup.html) to download map data. It's super easy, just look at the code to see what to do. Let me tell you that in SAS this would be way harder.

16). Stratified Sampling

This is a great script for stratified sampling. I took this straight from the web and you can google this topic and find the same script. Nice!