Statistical Programming With R 52
An anonymous reader writes "This series introduces you to R, a rich statistical environment, released as free software. It includes a programming language, an interactive shell, and extensive graphing capability. What's more, R comes with a spectacular collection of functions for mathematical and statistical manipulations -- with still more capabilities available in optional packages."
SPSS is garbage (Score:3, Informative)
And Of Course... (Score:4, Informative)
Re:Good-oh... (Score:3, Informative)
Re:What's a Robust Replacement for Excel??? . . . (Score:2, Informative)
http://www.wolfram.com/news/statistics.html [wolfram.com]
Re:Graphing, hah! (Score:3, Informative)
I don't know how R came into being, but IDL was originally designed as an ad hoc statistics and plotting tool. Because everyone was using it as an ad hoc tool, there was an assumption that everyone always had a display available. Unfortunately, that design flaw still exists in the language today. The implementation wasn't stupid back then, but fortunately IDL has new ways to handle graphics so that the display isn't involved. Maybe R had a similar history? Maybe not?
Re:What's a Robust Replacement for Excel??? . . . (Score:1, Informative)
No data entry facilities, but it handles multi-dim. visualisation very well and has an handful of convenient methods (correlation analysis, PCA, histogram) built in.
I've come to realise that Excel's data vis. is almost totally a joke, and that its value for data entry is almost as questionable. I'm a little bit irked by the gnumeric team's decision to keep all of excel's "features" (256 column max, &c.)...
Re:What's a Robust Replacement for Excel??? . . . (Score:2, Informative)
what R isn't (Score:5, Informative)
For people who have never taken real stat classes in college (or never learned it on their own) R will seem like a useless language. Most other languages can handle basic statistics computations.
Statistics is a whole lot more than means and averages. When I took my first real stat class, everything I knew about statistics was literally covered on the first half of the first page. I was totally blown away by what you could do with statistics.
R is for hardcore stat folk who know a bit about programming, not programmers who need to do a little basic computation.
R supports graphics output in many formats (Score:2, Informative)
Output to different graphics devices has been in S, Splus, and R for as long as I can remember (and that's a long time). Maybe you should try having a look at the copious documentation for R; the documentation, like the system itself, is free.
Re:So, what's the difference... (Score:4, Informative)
In contrast, R is very close to Splus and comes with an extensive array of statistical toolboxes. Many professional users use, and even prefer, R for their day-to-day work.
If you are doing anything with statistics, graphs of real-world data, or bioinformatics, R is the package to use.
If you are doing other kind of numerical work, things are less clear. Matlab is widely used, but it is hugely expensive and the language is pretty limited. Octave is the obvious open source choice, but there aren't many packages for it, and Matlab software requires some amount of porting if you want to use it with Octave. Numerical Python is technically far better than either Matlab or Octave, and it has a lot of packages and features that neither offer, but it (obviously) isn't Matlab compatible, so you can't just load existing Matlab packages into it.
Re:Comparison to octave? (Score:2, Informative)
Octave is basically an open source version of Matlab. This R looks similar just with a different programming language and different libraries.
It looks like it's probably more powerful but I don't know since I haven't used R.
RTFM! (Score:5, Informative)
R is really a beautiful language, for its purpose. It has a very nice correspondence with math and code, and for most parts of "hard" science, that's really important.
Compared to MATLAB, you can easily write R code 5 times as compact as MATLAB code, and still get more understandable code.
Re:what R isn't (Score:3, Informative)
However, you do not necessarily need to be into statistics to find R appealing. I'm an astrophysicist, and I wrote my whole thesis based on R. I started out with a bit of C, and I used some small Perl hacks to do some naive parallellizing, but I eventually phased out the C code and relied on R. I'd write thousands of lines of R code rather than go back to something like IDL (*shrug*).
For programmers, it may be a bit hard to overcome that you do not need for loops in R. For most purposes, think matrix and vector arithmetic instead. If you look at it from a math perspective that makes a lot of sense. For the things you do where it doesn't make sense to think in terms of vector arithmetic, you think in terms of applying functions to array elements instead.
Also, R has some simple OO concepts. They do not aim to do everything OO does, but the things they do, they do very well (as opposed to IDL (*shrug*, *shrug*) where they attempt to do everything, but does it bad). You need to exploit this to make it pretty.
I think these are the main two things that needs to be overcome for most scientists to use R efficiently. I really fell in love with the language for these two reasons, and I'd recommend R for all scientists, also non-statisticians.
Re:What's a Robust Replacement for Excel??? . . . (Score:4, Informative)
1) Matrix languages (e.g. Matlab, Gauss): These have C-like syntax with the basic data object being an nxn matrix (so, internally, a scalar is a 1x1 matrix). These languages are the way to go if you want to write your own statistical/simulation algorithms. They do have extensive pre-written routines for many statistical tasks, but they're mainly for people who know that a regression coefficient vector is given by inv(X'X)X'y and aren't afraid to code that. Nice thing is that it would be a single line of code to do this computation. I believe GNU/Octave belongs to this category.
2) Data languages (SAS, SPSS): The basic object here is a dataset with variables. Inverting a data matrix here is essentially a meaningless concept, and would be extremely difficult to do, but creating a new variable that sums sales for different people by division for certain months is straightforward (note that this is very difficult in a matrix language). Beyond trivial manipulations, you'd store code in procedures like any programming language.
3) Menu-Driven languages (e.g. EViews): The basic object is still a dataset with variables, but your primary method of manipulation is menu-driven. Want to run a regression?, just select your dependent and independent variables from dropdown lists and click
There's some area of overlap between 2 and 3. 2-type programs provide a rudimentary menu-driven system for those who don't want to code everything, and 3-type languages will allow you to store some command line instructions for future use.
In terms of learning curves, they get progressively flatter (easier) from 1 to 2 to 3.
Pick your poison!
Re:So, what's the difference... (Score:4, Informative)
Anyway, what it doesn't do as well as IDL (*shrug*) is visualization. Its graphing is limited to, well, graphs. Interactive analysis with funny widgets and stuff isn't R's selling point. Nor is R very well developed for image analysis and stuff like that. I think they have multi-D fourier transforms now, but they didn't two years ago.
IDL, OTOH, doesn't really do statistics at all. For example, it doesn't come with something as fundamental as QQ-plots. Believe it or not, but every paper that comes with an assumption of normality should come with QQ-plot... Or at least have done it.
The syntax of IDL (*shrug*) is unbeliably nasty (*shrug*, aargh, sorry, couldn't resist). I heard they have done something about it now, but two years ago, IDLs concept of scoping was at best, uhm, well, unclear. You could easily modify variables in other peoples badly designed code without being aware of it. Then, the COMMON blocks you often needed to pass parameters...? I have a hard time understanding people would actually use anything like IDL (*shrug*). R has a very clearcut lexical scoping of objects. You've got to really design your code veeery badly to fall in the same traps IDL programmers fall in on a regular basis. I've seen IDL programmers who's been in it since the beginning go WTF over scoping... It was better being a lone R user than an IDL user with a lot of support...
Also IDL attempted to get in OO in version 5 (IIRC), but it is a mess. OO designers would be rolling in their graves over this. R, OTOH, has decided not to incorporate all OO concepts, but the stuff they have done, is very clean, very easy to understand, and perfectly sound.
But the real point of R is to have very clear mapping between code and mathematics. You code your math, it is so easy to see what happens. No iterating over array indices, it simply never happens. That's extremely appealing once you've got the hang of it.
I once translated 70 lines of MATLAB code to 7 lines of R code, some interpolation stuff that didn't exist in R. Never finished it though, because I found I didn't need it, but as a proof of concept it was great. And while MATLAB code was pretty hard to grok, the R code was very straightforward, you could just show it to anyone with basic training in math, and they would immediately see what it did. Try that with code from any of the others!
I think that the basic thing is that most numerical math for physics and astronomy is right now more advanced in IDL or MATLAB. If you do any kind of statistics, you should be going over to R. If you are willing to code, I'd argue that R is a platform so much better than IDL and MATLAB, you should be migrating your code starting now. I know I'd be writing thousands of lines of R code rather than going back to IDL (*shrug*)... :-)
Then, you know, you can't inspect the code in the core of IDL or MATLAB. It is likely to be flaws in there, and they may not have meant anything for any other problems than yours.... I got hit with three bad bugs in R when I worked with it, I manage to narrow them down, and they were all corrected within hours. To me, this is extremely important. The implementation of math should be available for review just like a derivation of equations are.
Re:Comparison of R, Mathematica, S-plus, Matlab, e (Score:3, Informative)
Other tools which I have come across, but haven't really worked with: Axiom [nongnu.org] (symbolic computations, CAS); Scigraphica [sourceforge.net] (graphing); opendx [opendx.org] (data explorer + visualization).
I've actually never really used R (by the time I came across it, I was done with my physics labs), so I can't really compare any of the others to it. But it definitely looks like one of the tools that I should add to my suite.
Re:What's a Robust Replacement for Excel??? . . . (Score:1, Informative)
v.Inverse[X].v
It's mostly shell-based, but the shell includes pretty formulas and graphics, histograms are not too difficult to do, and EVERYTHING in mathematica is a data structure which can be read and manipulated by the shell, including bitmaps, sound, and even a mathematica notebook; of which the manual is one! You can also get a package to link it to Excel if you really want to.
Go to wolfram.com and order a trial version. That's what I did and I quickly came to the conclusion it was worth paying the full price.
Others around me (for the purposes of this discussion we're quantitative analysts for a financial company) use S+. It is like R (like C) but it has an excel-like front-end which might suit you better. I'm told it's similarly expensive though.
Another option I considered was Python, with freeware statistical and graphics packages which are available for the language, but Mathematica beat the pants off this combination for useability.