Saturday, October 22, 2011

Wednesday, October 19, 2011

Google Analytics Consulting Service

We are pleased to announce that SAS Data Guru has started to offer Google Analytics Services to the public.

Services include:

1. Set up Google Analytics
2. Tutorial on Report Utlization
3. Search Engine Optimization
4. Channel Marketing Campaign
5. HTML Customization for Advanced Features

Click here to contact us.

Please contact us at for more details.

Saturday, October 8, 2011

Key Steps for Doing Multiple Regression

1. Build Model Using R2 or Forwarding Method
2. Check Residual Plot
3. Check Multicollinearity

Sunday, October 2, 2011

Busting Loose From The Money Game

Dear readers, if you learn SAS because you want to make more money in your life, please pause and take a look at this book:

Busting Loose From the Money Game: Mind-Blowing Strategies for Changing the Rules of a Game You Can't Win by Robert Scheinfeld

Saturday, October 1, 2011

What is Longitudinal Study?

"Longitudinal studies are defined as studies in which the outcome variable is repeatedly measured; i.e. the outcome variable is measured in the same individual on several different occasions. In longitudinal studies the observations of one individual over time are not independent of each other, and therefore it is necessary to apply special statistical techniques, which take into account the fact that the repeated observations of each individual are correlated. The definition of longitudinal studies(used in this book) implicates that statistical techniques like survival analyses are beyond the scope of this book. Those techniques basically are not longitudinal data analysis techniques because (in general) the outcome variable is an irreversible endpoint and therefore strictly speaking is only measured at one occasion. After the occurrence of an event no more observations are carried out on that particular subject."

Excerpt from Applied Longitudinal Data Analysis for Epidemiology, A Practical Guide by Jos W. R. Twisk

Friday, September 9, 2011

How to Memorize p-value

When it comes to test hypotheses, nothing is more important than p-value. If you have hard time to understand or memorize the use of p-value. Here is a simple way to remember it. P-value is the probability that the null hypothesis is true. When this value is too small and below a pre-set threshold, we can confidently reject the null hypothesis. This threshold is called the significant level. This value is set by the investigator and normally is 0.05 or 0.01.

Tuesday, August 16, 2011

PROC SQL Like Statement Case Problem

proc sql;
  create table a1 as
  select *
 from a
where customer like '%Jane%'
; quit;

proc sql;
  create table a2 as
select *
from a
where customer like '%Jane%'
and visit_dte between '01jan2001:00:00:00'dt and '31jan2001:00:00:00'
; quit;

A String with Quote - Like Statement

proc sql;
  create table a1 as
  select *
  from retail
  where store_name like '%MAC''Y%'
; quit;

SQL Query Trick

Table A: customer_id
Table B: customer_id, visit_date, purchase_amt

proc sql;
  create table ab as
  select a.customer_id, b.visit_date, b.purchase_amt
  from a left join b on a.customer_id = b.customer_id
  where b.visit_date between '01jan2011' and '31jan2011'
; quit;

Correct way:
data b1;
  set b;
  if visit_date >= '01jan2011'd and '31jan2011'd;

proc sql;
  create table ab as
  select a.customer_id, b.visit_date, b.purchase_amt
  from a left join b on a.customer_id = b.customer_id
; quit;

Saturday, August 13, 2011

The Magic of Proc Transpose

SAS Data Step offers last, first, lag, retain function/statement to allow users to process the data vertically. Many users tend to get stuck in this programming paradigm when using SAS to process the data. They normally ignore most time they can simplify the processing using horizontally. The key to use this paradigm is to use PROC TRANSPOSE to transform the data and then use a Data Step to process it. Here is an example.We have a list of customers visiting the stores and we want to know the highest purchase amount in last three visits.

proc sort data=customers;
by customer_id descending visit_date;

data customers;
  set customers;
by customer;
retain visit_cnt max_amount;
if first.customer then do;
  max_amount = 0;
  visit_cnt = 1;
if visit_cnt <= 3 & amount > max_amount then max_amt = amount;
visit_cnt = visti_cnt + 1;

proc sort data=customers;
by customer_id descending visit_date;

proc transpose data=customer prefix=amt;
by customer;
var amount;

data customers;
 set customers;
 max_amt = max(of amt1 - amt3);

Here is another post which demonstrates the use of this programming paradigm.

How to Know a SAS Step Is Not Progressing

For each SAS dataset being created or modified by a SAS step, a .lck file will be created in work directory. This .lck file has the file name like .sas7bdat.lck. SAS will write the output for this dataset to this .lck file. Once the step is completed, the .lck file will be renamed as .sas7bdat. If the step progresses well, this .lck file should increase its file size continuously. If not, it indicated something wrong with this step. For example, if you run a PROC SQL to retrieve data from a database and it never finishes, then you can look at the .lck file. If its size stays at 1KB, then definitely there is something wrong with this PROC SQL step and worth investigating.

Thursday, August 11, 2011

The Magic of Colon (:) in SAS

Colon in SAS is very useful and handy in many ways. In this posts, several examples are offered to illustrate this.


array avar(*) a1-a10 => array avar(*) a:

Wednesday, August 10, 2011

Cohort Study, Case-Control Study, RCT

These three terms are very confusing to many people.


Observational Study:
Cohort: from exposure to outcome
Case-Control: from outcome to exposure

Tuesday, August 9, 2011

Use Geodist Function to Compute the Distance between Two Latitude and Longitude Coordinates

Medication Persistence Ratio Example

proc transpose;

data a1;
array ff{*} fill:;

Four Special Words Used in SAS Arrays

There are four special words used in SAS arrays: 


Definition of Baseline

1. Information gathered at the beginning of a study from which variations found in the study are measured.
2. A known value or quantity with which an unknown is compared when measured or assessed.
3. The initial time point in a clinical trial, just before a participant starts to receive the experimental treatment which is being tested. At this reference point, measurable values such as CD4 count are recorded. Safety and efficacy of a drug are often determined by monitoring changes from the baseline values.

Actuarial Pricing Example

Propensity Score Matching

SAS Programmer Career

Have you ever heard of data plumming? That is right. That is the terrible term for describing a SAS programmer only know

What does cohort mean?

Cohort is the term frequently used/seen in statistical analysis. Cohort in general can be translated as Group.

Wednesday, June 8, 2011

Predict a Linear Trend

data sales;
  retain mu 300 std 1000 seed 0;
  format sls_dt mmddyy10.;

  do i=1 to 100;
    sls_dt = '01jan2011'd + i - 1;
    sls_amt = 50*i + mu + std*rannor(seed);
if sls_amt < 0 then sls_amt = 0;

proc gplot data = sales;
  plot sls_amt * sls_dt;
run; quit;

proc reg data=sales outest=est;
  model sls_amt = sls_dt;
run; quit;

data _null_;
  set est;
  prediction = intercept + sls_dt * '31dec2011'd;
  put prediction;

Saturday, May 21, 2011

SAS DOE Articles

An Introduction to Experimental Design Using SAS

Application of Experimental Design in Consumer Direct-Mail

Introduction to Design and Analysis of Experiments with the SAS System by Asheber Abebe

Simulate a Normal Distribution

SAS offers a function called rannor which allows you to generate a sample from a normal distribution easily.

data temp(keep=x);
  retain mu 50 std 20 seed 0;
  do i=1 to 1000;
    x = mu + std*rannor(seed);

proc chart data=temp;
  vbar x;

Tuesday, May 17, 2011

Survival Analysis Example Using LIFETEST

Survival data consist of a response (event time, failure time, or survival time) variable that measures the duration of time until a specified event occurs and possibly a set of independent variables thought to be associated with the failure time variable. These independent variables (concomitant variables, covariates, or prognostic factors) can be either discrete, such as sex or race, or continuous, such as age or temperature. The system that gives rise to the event of interest can be biological, as for most medical data, or physical, as for engineering data. The purpose of survival analysis is to model the underlying distribution of the failure time variable and to assess the dependence of the failure time variable on the independent variables.

The following data is from Prentice, R.L. "Exponential survivals with censoring and explanatory variables.", Biometrika 60, 1973, 279-288.   

The LIFETEST procedure computes nonparametric estimates of the survival distribution function. You can request either the product-limit (Kaplan and Meier) or the life-table (actuarial) estimate of the distribution. PROC LIFETEST computes nonparametric tests to compare the survival(Kaplan-Meier) curves of two or more groups. No covariates involved. If covariates are involved, use Cox proportional hazards model.

H0: S1(t) = S2(t)
HA: S1(t) ^= S2(t)


proc phreg data=hsv
model wks*cens(1) = trt /ties=exact;

Variable Type Definition

UCLA WhatStat offers very good definitions of variable types used in statistical analysis. I expand on that and summarize as below:

Categorical variable(called nominal variable): has two or more categories, but there is no intrinsic ordering to the categories.

Ordinal variable: is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables.

Interval variable: is similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced.

Dummy variable(indicator variable): A categorical variable that has been dummy coded. Dummy coding (also called indicator coding) is usually used in regression models, but not ANOVA. A dummy variable can have only two values: 0 and 1. When a categorical variable has more than two values, it is recoded into multiple dummy variables.

Nominal variable:

Friday, February 11, 2011

Why Use SAS?

When it comes to statistical analysis, without doubt, SAS is the best choice and the tool of choice. I really can't think of any tools which can offer the same level of flexibility for analyzing the data. Below is an excerpt from a book which echos the same opinion:

While consulting for dozens of companies over 25 years of statistical application to clinical investigation, I have never seen a successful clinical program that did not use SAS.