So I’ve just released v0.01-alpha
of my SAS transcompiler to Python (Stan).
Here is just a list of things I’ve learnt :
- SAS very similar to a PL\0 language
by
statements are inferior to thesplit-apply-combine
strategypyparsing
makes life very easy (compared with dealing with lots of regex)- iPython magics are ridiculously easy to write
- writing Python packages isn’t that hard, but there is a lot of extraneous options
Some of the (many) things which are missing:
- Just about every
proc
you can think of … you can define your own as a “function”. I know strictly speaking they are not the same thing, but for now it will do. (proc sql
coming next release) - As stated above no
by
statements, and hence none of the related statements as well (likeretain
). if-else-then-do
not implemented correctly
But of course you would want to see it in action. So here it is!
from stan.transcompile import transcompile
import stan_magic
from pandas import DataFrame
import numpy as np
import pkgutil
import stan.proc_functions as proc_func
mod_name = ["from stan.proc_functions import %s" % name for _, name, _ in pkgutil.iter_modules(proc_func.__path__)]
exec("\n".join(mod_name))
# create an example data frame
df = DataFrame(np.random.randn(10, 5), columns = ['a','b','c','d','e'])
df
a | b | c | d | e | |
---|---|---|---|---|---|
0 | -1.402090 | 1.007808 | -0.761436 | 1.520951 | -0.287097 |
1 | -1.522315 | -0.170775 | 0.832071 | -0.640475 | 0.434856 |
2 | 0.161613 | 1.753123 | -0.554494 | -0.102087 | -0.350737 |
3 | -0.797706 | -1.204808 | -0.405977 | 0.421891 | -0.347111 |
4 | 0.287852 | -0.647063 | 1.323138 | 0.347085 | 0.606421 |
5 | 1.711382 | 0.988707 | -0.287785 | 0.862959 | 0.981112 |
6 | -0.145970 | -0.030930 | 1.219454 | -0.544475 | 2.013656 |
7 | 0.203527 | -0.460113 | 0.683482 | -1.917130 | 0.683844 |
8 | -0.397550 | 1.471630 | 0.826813 | 0.107800 | 0.094163 |
9 | 0.012285 | -0.293033 | -0.133107 | 0.748343 | 0.290751 |
%%stan
data test;
set df (drop = a);
run;
u"test=df.drop(['a'],1)\n"
exec(_)
test
b | c | d | e | x | |
---|---|---|---|---|---|
0 | 1.007808 | -0.761436 | 1.520951 | -0.287097 | 2 |
1 | -0.170775 | 0.832071 | -0.640475 | 0.434856 | 0 |
2 | 1.753123 | -0.554494 | -0.102087 | -0.350737 | 2 |
3 | -1.204808 | -0.405977 | 0.421891 | -0.347111 | 0 |
4 | -0.647063 | 1.323138 | 0.347085 | 0.606421 | 0 |
5 | 0.988707 | -0.287785 | 0.862959 | 0.981112 | 2 |
6 | -0.030930 | 1.219454 | -0.544475 | 2.013656 | 0 |
7 | -0.460113 | 0.683482 | -1.917130 | 0.683844 | 0 |
8 | 1.471630 | 0.826813 | 0.107800 | 0.094163 | 2 |
9 | -0.293033 | -0.133107 | 0.748343 | 0.290751 | 0 |
if
statements combined with do
end
statements were difficult to implement.
Here is my current
implementation of if-then-else control flow, (I’ll have to revisit if
and do
end
statements in the future…)
%%stan
data df_if;
set df;
x = if b < 0.3 then 0 else if b < 0.6 then 1 else 2;
run;
u"df_if=df\ndf_if['x']=df_if.apply(lambda x : 0 if x[u'b']<0.3 else 1 if x[u'b']<0.6 else 2 , axis=1)\n"
exec(_)
df_if
a | b | c | d | e | x | |
---|---|---|---|---|---|---|
0 | -1.402090 | 1.007808 | -0.761436 | 1.520951 | -0.287097 | 2 |
1 | -1.522315 | -0.170775 | 0.832071 | -0.640475 | 0.434856 | 0 |
2 | 0.161613 | 1.753123 | -0.554494 | -0.102087 | -0.350737 | 2 |
3 | -0.797706 | -1.204808 | -0.405977 | 0.421891 | -0.347111 | 0 |
4 | 0.287852 | -0.647063 | 1.323138 | 0.347085 | 0.606421 | 0 |
5 | 1.711382 | 0.988707 | -0.287785 | 0.862959 | 0.981112 | 2 |
6 | -0.145970 | -0.030930 | 1.219454 | -0.544475 | 2.013656 | 0 |
7 | 0.203527 | -0.460113 | 0.683482 | -1.917130 | 0.683844 | 0 |
8 | -0.397550 | 1.471630 | 0.826813 | 0.107800 | 0.094163 | 2 |
9 | 0.012285 | -0.293033 | -0.133107 | 0.748343 | 0.290751 | 0 |
# procs can be added manually they can be thought of as python functions
# you can define your own, though I need to work on the parser
# to get it "smooth"
df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df1
a | b | |
---|---|---|
0 | True | False |
1 | False | True |
2 | True | True |
%%stan
proc describe data = df1 out = df2;
by a;
run;
u"df2=describe.describe(data=df1,by='a')\n"
exec(_)
df2
a | b | ||
---|---|---|---|
a | |||
False | count | 1 | 1 |
mean | 0 | 1 | |
std | NaN | NaN | |
min | False | True | |
25% | False | True | |
50% | 0 | 1 | |
75% | False | True | |
max | False | True | |
True | count | 2 | 2 |
mean | 1 | 0.5 | |
std | 0 | 0.7071068 | |
min | True | False | |
25% | 1 | 0.25 | |
50% | 1 | 0.5 | |
75% | 1 | 0.75 | |
max | True | True |
The proc actually isn’t difficult to write. So for the above code it is actually just this:
def describe(data, by):
return data.groupby(by).describe()
This functionality allow you to handle most of the by
and retain
cases. For
languages
like Python and R, the normal way to handle data is through the split-apply-
combine methodology.
Merges can be achieved in a similar way, by creating a proc
:
%%stan
proc merge out = df2;
dt_left left;
dt_right right;
on = 'key';
run;
u"df2=merge.merge(dt_left=left,dt_right=right,on='key')\n"
left = DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
exec(_)
df2
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
heres an example showing how you can define your own function and run it (not a function that came with the package)
def sum_mean_by(data, by):
return data.groupby(by).agg([np.sum, np.mean])
%%stan
proc sum_mean_by data = df_if out = df_sum;
by x;
run;
u"df_sum=sum_mean_by(data=df_if,by='x')\n"
exec(_)
df_sum
a | b | c | d | e | ||||||
---|---|---|---|---|---|---|---|---|---|---|
sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | |
x | ||||||||||
0 | -1.962327 | -0.327055 | -2.806722 | -0.467787 | 3.519061 | 0.586510 | -1.584762 | -0.264127 | 3.682416 | 0.613736 |
2 | 0.073355 | 0.018339 | 5.221268 | 1.305317 | -0.776902 | -0.194225 | 2.389623 | 0.597406 | 0.437441 | 0.109360 |