PROC-X.com
http://proc-x.com
An online (unofficial) SAS® journal - written by bloggers (previously known as sas-x.com)Sat, 21 Jun 2014 10:31:53 +0000en-UShourly1http://wordpress.org/?v=3.9.1Sas-xhttp://feedburner.google.comExample 2014.6: Comparing medians and the Wilcoxon rank-sum test
http://feedproxy.google.com/~r/Sas-x/~3/u6nVKu1g20E/
http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/#commentsThu, 12 Jun 2014 13:00:00 +0000http://proc-x.com/?guid=a758b375ecbacf315d997f49934cf037This post was kindly contributed by SAS and R - go there to comment and to read the full post. A colleague recently contacted us with the following question: “My outcome is skewed– how can I compare medians across multiple categories?” What they were asking for was a generalization of the Wilcoxon rank-sum test (also known as the Mann-Whitney-Wilcoxon test, among other monikers) to more than two groups. For the record, the answer is that the Kruskal-Wallis test is the generalization of the Wilcoxon, the one-way ANOVA to the Wilcoxon’s t-test. But this question is based on a false premise: that the the Wilcoxon rank-sum test is used to compare medians. The premise is based on a misunderstanding of the null hypothesis of the test. The actual null hypothesis is that there is a 50% probability that a random value from one population exceeds an random value from the other population. The practical value of this is hard to see, and thus in many places, including textbooks, the null hypothesis is presented as “the two populations have equal medians”. The actual null hypothesis can be expressed as the latter median hypothesis, but only under the additional assumption that the shapes […]<img src="http://feeds.feedburner.com/~r/Sas-x/~4/u6nVKu1g20E" height="1" width="1"/>http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/feed/0http://proc-x.com/2014/06/example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-test/?utm_source=rss&utm_medium=rss&utm_campaign=example-2014-6-comparing-medians-and-the-wilcoxon-rank-sum-testRemove tabs from SAS code files
http://feedproxy.google.com/~r/Sas-x/~3/mQJ8iyAR3WI/
http://proc-x.com/2014/06/remove-tabs-from-sas-code-files/#commentsTue, 10 Jun 2014 14:03:00 +0000http://proc-x.com/?guid=d1d988bbb1f548e39b9fe9a3e841e9e3This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. By default, SAS records the indent by pressing the tab key by tab, which causes many problem to use the code files under a different environment. There are actually two ways to eliminate the tab character in SAS and replace with empty spaces. Regular expression Press Ctrl + H → Replace window pops out → Choose Regular expression search → At the box of Find text input \t→ At the box of Replace input multiple\s, say four Editor option Click Tools → Options → Enhanced Editors… → Choose Insert spaces for tabs → Choose Replace tabs with spaces on file open This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post.<img src="http://feeds.feedburner.com/~r/Sas-x/~4/mQJ8iyAR3WI" height="1" width="1"/>http://proc-x.com/2014/06/remove-tabs-from-sas-code-files/feed/0http://proc-x.com/2014/06/remove-tabs-from-sas-code-files/?utm_source=rss&utm_medium=rss&utm_campaign=remove-tabs-from-sas-code-filesUse recursion and gradient ascent to solve logistic regression in Python
http://feedproxy.google.com/~r/Sas-x/~3/eNYfW8j1Z5g/
http://proc-x.com/2014/05/use-recursion-and-gradient-ascent-to-solve-logistic-regression-in-python/#commentsWed, 21 May 2014 21:22:00 +0000http://proc-x.com/?guid=214affb877fa4dd55c5773214f600949This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. In his book Machine Learning in Action, Peter Harrington provides a solution for parameter estimation of logistic regression . I use pandas and ggplot to realize a recursive alternative. Comparing with the iterative method, the recursion costs more space but may bring the improvement of performance. # -*- coding: utf-8 -*- """ Use recursion and gradient ascent to solve logistic regression in Python """ import pandas as pd from ggplot import * def sigmoid(inX): return 1.0/(1+exp(-inX)) def grad_ascent(dataMatrix, labelMat, cycle): """ A function to use gradient ascent to calculate the coefficients """ if isinstance(cycle, int) == False or cycle < 0: raise ValueError("Must be a valid value for the number of iterations") m, n = shape(dataMatrix) alpha = 0.001 if cycle == 0: return ones((n, 1)) else: weights = grad_ascent(dataMatrix, labelMat, cycle-1) h = sigmoid(dataMatrix * weights) errors = (labelMat - h) return weights + alpha * dataMatrix.transpose()* errors def plot(vector): """ A funtion to use ggplot to visualize the result """ x = arange(-3, 3, 0.1) y = (-vector[0]-vector[1]*x) / vector[2] new = pd.DataFrame() new['x'] = x new['y'] = array(y).flatten() infile.classlab = infile.classlab.astype(str) p = ggplot(aes(x='x', y='y', […]<img src="http://feeds.feedburner.com/~r/Sas-x/~4/eNYfW8j1Z5g" height="1" width="1"/>http://proc-x.com/2014/05/use-recursion-and-gradient-ascent-to-solve-logistic-regression-in-python/feed/0http://proc-x.com/2014/05/use-recursion-and-gradient-ascent-to-solve-logistic-regression-in-python/?utm_source=rss&utm_medium=rss&utm_campaign=use-recursion-and-gradient-ascent-to-solve-logistic-regression-in-pythonSpring Thaw in Alberta
http://feedproxy.google.com/~r/Sas-x/~3/UGgseTKfLgs/
http://proc-x.com/2014/05/spring-thaw-in-alberta/#commentsFri, 02 May 2014 01:16:00 +0000http://proc-x.com/?guid=385bb5ce78e69f2dd0ef5ac2011b0de5Last week I had the great pleasure of flying to Alberta for the Edmonton and Calgary user group meetings. Well, let me say that it mostly a great pleasure... my system was still in a bit of temperature shock having been sunning myself on a St. Lucian b...<img src="http://feeds.feedburner.com/~r/Sas-x/~4/UGgseTKfLgs" height="1" width="1"/>http://proc-x.com/2014/05/spring-thaw-in-alberta/feed/0http://proc-x.com/2014/05/spring-thaw-in-alberta/?utm_source=rss&utm_medium=rss&utm_campaign=spring-thaw-in-albertaCount large chunk of data in Python
http://feedproxy.google.com/~r/Sas-x/~3/nDBub7HigSQ/
http://proc-x.com/2014/04/count-large-chunk-of-data-in-python/#commentsWed, 30 Apr 2014 21:09:00 +0000http://proc-x.com/?guid=a162d63d8c6158e90df5585ea64785c3This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. The line-by-line feature in Python allows it to count hard disk-bound data. The most frequently used data structures in Python are list and dictionary. Many cases the dictionary has advantages since it is a basically a hash table that many realizes O(1) operations. However, for the tasks of counting values, the two options make no much difference and we can choose any of them for convenience. I listed two examples below. Use a dictionary as a counter There is a question to count the strings in Excel. Count the unique values in one column in EXCEL 2010. The worksheet has 1 million rows and 10 columns. or numbers. For example, A5389579_10 A1543848_6 A5389579_8 Need to cut off the part after (including) underscore such as from A5389579_10 to A5389579 Commonly Excel on a desktop can’t handle this size of data, while Python would easily handle the job. # Load the Excel file by the xlrd package import xlrd book = xlrd.open_workbook("test.xlsx") sh = book.sheet_by_index(0) print sh.name, sh.nrows, sh.ncols print "Cell D30 is", sh.cell_value(rowx=29, colx=3) # Count the unique values in a dictionary c […]<img src="http://feeds.feedburner.com/~r/Sas-x/~4/nDBub7HigSQ" height="1" width="1"/>http://proc-x.com/2014/04/count-large-chunk-of-data-in-python/feed/0http://proc-x.com/2014/04/count-large-chunk-of-data-in-python/?utm_source=rss&utm_medium=rss&utm_campaign=count-large-chunk-of-data-in-pythonExample 2014.5: Simple mean imputation
http://feedproxy.google.com/~r/Sas-x/~3/tTTKMbTHvvs/
http://proc-x.com/2014/04/example-2014-5-simple-mean-imputation/#commentsFri, 25 Apr 2014 14:22:00 +0000http://proc-x.com/?guid=b1717e548d0d6f003001b3e14401fde1We're both users of multiple imputation for missing data. We believe it is the most practical principled method for incorporating the most information into data analysis. In fact, one of our more successful collaborations is a review of software for ...<img src="http://feeds.feedburner.com/~r/Sas-x/~4/tTTKMbTHvvs" height="1" width="1"/>http://proc-x.com/2014/04/example-2014-5-simple-mean-imputation/feed/0http://proc-x.com/2014/04/example-2014-5-simple-mean-imputation/?utm_source=rss&utm_medium=rss&utm_campaign=example-2014-5-simple-mean-imputationExample 2014.4: Hilbert Matrix
http://feedproxy.google.com/~r/Sas-x/~3/Hn7kPmE_gRs/
http://proc-x.com/2014/04/example-2014-4-hilbert-matrix/#commentsMon, 14 Apr 2014 13:22:00 +0000http://proc-x.com/?guid=89bb9fb4db6a54bb144fa274321079b3Rick Wicklin showed how to make a Hilbert matrix in SAS/IML. Rick has a nice discussion of these matrices and why they might be interesting; the value of H_{r,c} is 1/(r+c-1). We show how to make this matrix in the data step and in R. We also show t...<img src="http://feeds.feedburner.com/~r/Sas-x/~4/Hn7kPmE_gRs" height="1" width="1"/>http://proc-x.com/2014/04/example-2014-4-hilbert-matrix/feed/0http://proc-x.com/2014/04/example-2014-4-hilbert-matrix/?utm_source=rss&utm_medium=rss&utm_campaign=example-2014-4-hilbert-matrixR Continues Its Rapid Growth
http://feedproxy.google.com/~r/Sas-x/~3/BQ97J1N80vU/
http://proc-x.com/2014/04/r-continues-its-rapid-growth/#commentsMon, 07 Apr 2014 14:17:21 +0000http://r4stats.com/?p=1226This post was kindly contributed by r4stats.com » SAS - go there to comment and to read the full post. I’ve just updated the section below from The Popularity of Data Analysis Software. Note that the overall article is still under construction and all the figure numbers have changed from previous versions. n Growth in Capability n The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/. I collected the data for later versions following his method. n Figure 8 shows that the growth in R packages is following a rapid parabolic arc (quadratic fit with R-squared=.998). The right-most point is for version 3.0.2, the last version released in 2013. n n n n n n n n As rapid as this growth has been, these data represent only the main CRAN repository. R does have eight other software repositories, such as the one at http://www.bioconductor.org/ that are not included in this graph. A program run on 4/7/2014 counted 7,364 R packages at all major […]<img src="http://feeds.feedburner.com/~r/Sas-x/~4/BQ97J1N80vU" height="1" width="1"/>http://proc-x.com/2014/04/r-continues-its-rapid-growth/feed/0http://proc-x.com/2014/04/r-continues-its-rapid-growth/?utm_source=rss&utm_medium=rss&utm_campaign=r-continues-its-rapid-growth10 popular Linux commands for Hadoop
http://feedproxy.google.com/~r/Sas-x/~3/7VyuCuU4pSQ/
http://proc-x.com/2014/04/10-popular-linux-commands-for-hadoop/#commentsSun, 06 Apr 2014 13:41:00 +0000http://proc-x.com/?guid=2418e611690ff9ade55fbb370be88ec2This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. The Hadoop system has its unique shell language, which is called FS. Comparing with the common Bash shell within the Linux ecosystem, the FS shell has much fewer commands. To deal with the humongous size of data distributively stored at the Hadoop nodes, in my practice, I have 10 popular Linux command to facilitate my daily work. 1. sort A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The sort command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run the piped commands below to verify whether the Python codes have any bugs. ./mapper.py | sort | ./reducer.py 2. tail Interestingly, the FS shell at Hadoop only supports the tail command instead of the head command. Then I can only grab the bottom lines of the data stored at Hadoop. hadoop fs -tail 5 data/web.log.9 3. sed Sine the FS shell doesn’t provide the head command, the alternative solution is to use the sed command that actually has more flexible options. hadoop fs -cat data/web.log.9 | sed […]<img src="http://feeds.feedburner.com/~r/Sas-x/~4/7VyuCuU4pSQ" height="1" width="1"/>http://proc-x.com/2014/04/10-popular-linux-commands-for-hadoop/feed/0http://proc-x.com/2014/04/10-popular-linux-commands-for-hadoop/?utm_source=rss&utm_medium=rss&utm_campaign=10-popular-linux-commands-for-hadoopSAS vs. Python for data analysis
http://feedproxy.google.com/~r/Sas-x/~3/XQ3kxCEmzuo/
http://proc-x.com/2014/03/sas-vs-python-for-data-analysis/#commentsThu, 27 Mar 2014 21:01:00 +0000http://proc-x.com/?guid=0f936c3813149cefddf9d91ac95d6b35<div>To perform data analysis efficiently, I need a full stack programming language rather than frequently switching from one language to another. That means — this language can hold large quantity of data, manipulate data promptly and easily (e.g. if-then-else; iteration), connect to various data sources such as relational database and Hadoop, apply some statistical models, and report result as graph, table or web. SAS is famous for its capacity to realize such a data cycle, as long as you are willing to pay the annual license fee.</div><div>SAS’s long-standing competitor, R, still keeps growing. However, in the past years, the Python community has launched a crazy movement to port R’s jewels and ideas to Python, which resulted in a few solid applications such as <a href="http://pandas.pydata.org/">pandas</a> and <a href="https://github.com/yhat/ggplot/">ggplot</a>. With the rapid accumulation of the data-related tools in Python, I feel more comfortable to work with data in Python than R, because I have a bias that Python’s interpreter is more steady than R’s while dealing with data, and sometimes I just want to escape from R’s idiosyncratic syntax such as <code>x<-4</code> or <code>foo.bar.2000=10</code>.<br /><br /><div>Actually there is no competition between SAS and R at all: these two dwell in two parallel universes and rely on distinctive ecosystems. SAS, Python, Bash and Perl process data row-wise, which means they input and output data line by line. R, Matlab, SAS/IML, Python/pandas and SQL manipulate data column-wise. The size of data for row-wise packages such as SAS are hard-disk-bound at the cost of low speed due to hard disk. On the contrary, the column-wise packages including R are memory-bound given the much faster speed brought by memory. </div><div><p></p></div></div><div>Let’s go back to the comparison between SAS and Python. For most parts I am familiar with in SAS, I can find the equivalent modules in Python. I create a table below to list the similar components between SAS and Python.</div><table><thead><tr><th>SAS</th><th>Python</th></tr></thead><tbody><tr><td>DATA step</td><td>core Python</td></tr><tr><td>SAS/STAT</td><td><a href="http://statsmodels.sourceforge.net/stable/">StatsModels</a></td></tr><tr><td>SAS/Graph</td><td><a href="http://matplotlib.org/">matplotlib</a></td></tr><tr><td>SAS Statistical Graphics</td><td><a href="https://github.com/yhat/ggplot/">ggplot</a></td></tr><tr><td>PROC SQL</td><td><a href="http://docs.python.org/2/library/sqlite3.html">sqlite3</a></td></tr><tr><td>SAS/IML</td><td><a href="http://www.numpy.org/">NumPy</a></td></tr><tr><td>SAS Windowing Environment</td><td>Qt Console for iPython</td></tr><tr><td>SAS Studio</td><td>IPython notebook</td></tr><tr><td><a href="https://www.sas.com/en_us/software/sas-hadoop/in-memory-hadoop.html">SAS In-Memory Analytics for Hadoop</a></td><td>Spark with Python</td></tr></tbody></table><div>This week SAS announced some promising products. Interesting, they can be traced to some of the Python’s similar implementations. For example, <a href="http://support.sas.com/software/products/sasstudio/index.html#s1=1">SAS Studio</a>, a fancy web-based IDE with the feature of code completion, opens an HTML server at local machine and uses a browser to do coding, which is amazingly similar to <a href="http://ipython.org/ipython-doc/dev/notebook/index.html">iPython notebook</a>. Another example is <a href="https://www.sas.com/en_us/software/sas-hadoop/in-memory-hadoop.html">SAS In-Memory Analytics for Hadoop</a>. Given that the old MapReduce path for data analysis is painfully time-consuming and complicated, aggregating memory instead of hard disk across many nodes of a Hadoop cluster is certainly faster and more interactive. Based on the same idea, <a href="http://spark.apache.org/">Apache Spark</a>, which fully supports Python scripting, has just been <a href="http://blog.cloudera.com/blog/2014/02/spark-is-now-generally-available-for-cloudera-enterprise/">released to CDH 5.0</a>. It will be interesting to compare Python and SAS’s in-memory ability for data analysis at the level of Hadoop.</div><div>Before there is a new killer app for R, at least for now, Python steals R’s thunder to be an open source alternative for SAS.</div><img src="http://feeds.feedburner.com/~r/SasAnalysis/~4/zjzXrxRFzYQ" height="1" width="1"><img src="http://feeds.feedburner.com/~r/Sas-x/~4/XQ3kxCEmzuo" height="1" width="1"/>http://proc-x.com/2014/03/sas-vs-python-for-data-analysis/feed/0http://proc-x.com/2014/03/sas-vs-python-for-data-analysis/?utm_source=rss&utm_medium=rss&utm_campaign=sas-vs-python-for-data-analysis