Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming Technology

Statistical Programming With R 52

An anonymous reader writes "This series introduces you to R, a rich statistical environment, released as free software. It includes a programming language, an interactive shell, and extensive graphing capability. What's more, R comes with a spectacular collection of functions for mathematical and statistical manipulations -- with still more capabilities available in optional packages."
This discussion has been archived. No new comments can be posted.

Statistical Programming With R

Comments Filter:
  • SPSS is garbage (Score:3, Informative)

    by BoomerSooner ( 308737 ) on Wednesday September 22, 2004 @01:39PM (#10320350) Homepage Journal
    SAS, Minitab, hell anything is better than SPSS. Now to include R.
  • And Of Course... (Score:4, Informative)

    by pnatural ( 59329 ) on Wednesday September 22, 2004 @02:17PM (#10320856)
    There are Python bindings. They are here [sourceforge.net]. Enjoy!
  • Re:Good-oh... (Score:3, Informative)

    by RealAlaskan ( 576404 ) on Wednesday September 22, 2004 @02:34PM (#10321088) Homepage Journal
    R is like SPLUS. If you _must_ have SPSS, look into PSPP. [gnu.org]
  • by Anonymous Coward on Wednesday September 22, 2004 @02:35PM (#10321098)
    My personal choice is Mathematica. A study done four years ago used a dataset with known results to check a number of popular statistics packages. Mathematica was the only package that got everthing correct:

    http://www.wolfram.com/news/statistics.html [wolfram.com]
  • Re:Graphing, hah! (Score:3, Informative)

    by StarWynd ( 751816 ) on Wednesday September 22, 2004 @03:28PM (#10321771)
    This doesn't make it any better, but R isn't the only language to have such weirdness. Here at work, we use IDL (Interactive Data Language) for our statistics and plotting and while there are ways to circumvent the DISPLAY problem, most of the time it's easiest to setup Xvfb and let IDL use that as the DISPLAY.

    I don't know how R came into being, but IDL was originally designed as an ad hoc statistics and plotting tool. Because everyone was using it as an ad hoc tool, there was an assumption that everyone always had a display available. Unfortunately, that design flaw still exists in the language today. The implementation wasn't stupid back then, but fortunately IDL has new ways to handle graphics so that the display isn't involved. Maybe R had a similar history? Maybe not?
  • by Anonymous Coward on Wednesday September 22, 2004 @03:34PM (#10321839)
    It might not be exactly what you want (i.e., it's a bit unfriendly and rough around the edges) but "xgobi" (for X; I think they're working on "jgobi" for java) by Bell Labs is a very powerful and free-ish data visualization software.

    No data entry facilities, but it handles multi-dim. visualisation very well and has an handful of convenient methods (correlation analysis, PCA, histogram) built in.

    I've come to realise that Excel's data vis. is almost totally a joke, and that its value for data entry is almost as questionable. I'm a little bit irked by the gnumeric team's decision to keep all of excel's "features" (256 column max, &c.)...
  • by Anonymous Coward on Wednesday September 22, 2004 @03:38PM (#10321899)
    Mathematica is nice, but if you're cheap (raises hand), try Octave [octave.org]. Some plotting capability, including histograms.
  • what R isn't (Score:5, Informative)

    by bahamutirc ( 648840 ) on Wednesday September 22, 2004 @03:44PM (#10321972) Homepage

    For people who have never taken real stat classes in college (or never learned it on their own) R will seem like a useless language. Most other languages can handle basic statistics computations.

    Statistics is a whole lot more than means and averages. When I took my first real stat class, everything I knew about statistics was literally covered on the first half of the first page. I was totally blown away by what you could do with statistics.

    R is for hardcore stat folk who know a bit about programming, not programmers who need to do a little basic computation.

  • by jeif1k ( 809151 ) on Wednesday September 22, 2004 @03:53PM (#10322065)
    Try "help('postscript')", "help('png')", and "help('jpeg')".

    Output to different graphics devices has been in S, Splus, and R for as long as I can remember (and that's a long time). Maybe you should try having a look at the copious documentation for R; the documentation, like the system itself, is free.
  • by jeif1k ( 809151 ) on Wednesday September 22, 2004 @04:06PM (#10322198)
    Octave differs substantially from Matlab and lacks a lot of functionality (in particular, a lot of the toolboxes). Octave is used for teaching, but most people who do serious work in Matlab use the real thing.

    In contrast, R is very close to Splus and comes with an extensive array of statistical toolboxes. Many professional users use, and even prefer, R for their day-to-day work.

    If you are doing anything with statistics, graphs of real-world data, or bioinformatics, R is the package to use.

    If you are doing other kind of numerical work, things are less clear. Matlab is widely used, but it is hugely expensive and the language is pretty limited. Octave is the obvious open source choice, but there aren't many packages for it, and Matlab software requires some amount of porting if you want to use it with Octave. Numerical Python is technically far better than either Matlab or Octave, and it has a lot of packages and features that neither offer, but it (obviously) isn't Matlab compatible, so you can't just load existing Matlab packages into it.
  • by Anonymous Coward on Wednesday September 22, 2004 @07:55PM (#10324756)
    It looks like a functional language (e.g. Lisp or Haskell) with an interactive environment that has access to lots of math/graphing/etc libraries. It seems to be pretty extensible.

    Octave is basically an open source version of Matlab. This R looks similar just with a different programming language and different libraries.

    It looks like it's probably more powerful but I don't know since I haven't used R.

  • RTFM! (Score:5, Informative)

    by KjetilK ( 186133 ) <kjetil AT kjernsmo DOT net> on Wednesday September 22, 2004 @08:05PM (#10324815) Homepage Journal
    As usual I strongly recommend RTFM, you don't need an X device to create bitmaps, but it is probably the simplest way to do it. With the modularity freedesktop is aiming for, this is probably become less and less of an argument, and probably the main reason why they haven't bothered rewriting it.

    R is really a beautiful language, for its purpose. It has a very nice correspondence with math and code, and for most parts of "hard" science, that's really important.

    Compared to MATLAB, you can easily write R code 5 times as compact as MATLAB code, and still get more understandable code.

  • Re:what R isn't (Score:3, Informative)

    by KjetilK ( 186133 ) <kjetil AT kjernsmo DOT net> on Wednesday September 22, 2004 @08:20PM (#10324919) Homepage Journal
    Actually, I can't really agree... For those who do not intend to grok anything in their life, R is probably a bad choice... :-) But I think it is a good choice for everyone who do things where math is important and who sees it as important that things are to be grokked.

    However, you do not necessarily need to be into statistics to find R appealing. I'm an astrophysicist, and I wrote my whole thesis based on R. I started out with a bit of C, and I used some small Perl hacks to do some naive parallellizing, but I eventually phased out the C code and relied on R. I'd write thousands of lines of R code rather than go back to something like IDL (*shrug*).

    For programmers, it may be a bit hard to overcome that you do not need for loops in R. For most purposes, think matrix and vector arithmetic instead. If you look at it from a math perspective that makes a lot of sense. For the things you do where it doesn't make sense to think in terms of vector arithmetic, you think in terms of applying functions to array elements instead.

    Also, R has some simple OO concepts. They do not aim to do everything OO does, but the things they do, they do very well (as opposed to IDL (*shrug*, *shrug*) where they attempt to do everything, but does it bad). You need to exploit this to make it pretty.

    I think these are the main two things that needs to be overcome for most scientists to use R efficiently. I really fell in love with the language for these two reasons, and I'd recommend R for all scientists, also non-statisticians.

  • by tabdelgawad ( 590061 ) on Wednesday September 22, 2004 @08:29PM (#10324985)
    You have three categories of choices, depending on your background and the type of data and analysis you want to conduct. I made up the category names, but I think they're reasonable:

    1) Matrix languages (e.g. Matlab, Gauss): These have C-like syntax with the basic data object being an nxn matrix (so, internally, a scalar is a 1x1 matrix). These languages are the way to go if you want to write your own statistical/simulation algorithms. They do have extensive pre-written routines for many statistical tasks, but they're mainly for people who know that a regression coefficient vector is given by inv(X'X)X'y and aren't afraid to code that. Nice thing is that it would be a single line of code to do this computation. I believe GNU/Octave belongs to this category.

    2) Data languages (SAS, SPSS): The basic object here is a dataset with variables. Inverting a data matrix here is essentially a meaningless concept, and would be extremely difficult to do, but creating a new variable that sums sales for different people by division for certain months is straightforward (note that this is very difficult in a matrix language). Beyond trivial manipulations, you'd store code in procedures like any programming language.

    3) Menu-Driven languages (e.g. EViews): The basic object is still a dataset with variables, but your primary method of manipulation is menu-driven. Want to run a regression?, just select your dependent and independent variables from dropdown lists and click .

    There's some area of overlap between 2 and 3. 2-type programs provide a rudimentary menu-driven system for those who don't want to code everything, and 3-type languages will allow you to store some command line instructions for future use.

    In terms of learning curves, they get progressively flatter (easier) from 1 to 2 to 3.

    Pick your poison!
  • by KjetilK ( 186133 ) <kjetil AT kjernsmo DOT net> on Wednesday September 22, 2004 @10:22PM (#10325623) Homepage Journal
    Well, it is been two years since I really used R, but I'm really in love with the system... I know IDL (*shrug*) a bit too, since the rest of my Institute uses it. I just got itches from it all over, and dumped it... :-)

    Anyway, what it doesn't do as well as IDL (*shrug*) is visualization. Its graphing is limited to, well, graphs. Interactive analysis with funny widgets and stuff isn't R's selling point. Nor is R very well developed for image analysis and stuff like that. I think they have multi-D fourier transforms now, but they didn't two years ago.

    IDL, OTOH, doesn't really do statistics at all. For example, it doesn't come with something as fundamental as QQ-plots. Believe it or not, but every paper that comes with an assumption of normality should come with QQ-plot... Or at least have done it.

    The syntax of IDL (*shrug*) is unbeliably nasty (*shrug*, aargh, sorry, couldn't resist). I heard they have done something about it now, but two years ago, IDLs concept of scoping was at best, uhm, well, unclear. You could easily modify variables in other peoples badly designed code without being aware of it. Then, the COMMON blocks you often needed to pass parameters...? I have a hard time understanding people would actually use anything like IDL (*shrug*). R has a very clearcut lexical scoping of objects. You've got to really design your code veeery badly to fall in the same traps IDL programmers fall in on a regular basis. I've seen IDL programmers who's been in it since the beginning go WTF over scoping... It was better being a lone R user than an IDL user with a lot of support...

    Also IDL attempted to get in OO in version 5 (IIRC), but it is a mess. OO designers would be rolling in their graves over this. R, OTOH, has decided not to incorporate all OO concepts, but the stuff they have done, is very clean, very easy to understand, and perfectly sound.

    But the real point of R is to have very clear mapping between code and mathematics. You code your math, it is so easy to see what happens. No iterating over array indices, it simply never happens. That's extremely appealing once you've got the hang of it.

    I once translated 70 lines of MATLAB code to 7 lines of R code, some interpolation stuff that didn't exist in R. Never finished it though, because I found I didn't need it, but as a proof of concept it was great. And while MATLAB code was pretty hard to grok, the R code was very straightforward, you could just show it to anyone with basic training in math, and they would immediately see what it did. Try that with code from any of the others!

    I think that the basic thing is that most numerical math for physics and astronomy is right now more advanced in IDL or MATLAB. If you do any kind of statistics, you should be going over to R. If you are willing to code, I'd argue that R is a platform so much better than IDL and MATLAB, you should be migrating your code starting now. I know I'd be writing thousands of lines of R code rather than going back to IDL (*shrug*)... :-)

    Then, you know, you can't inspect the code in the core of IDL or MATLAB. It is likely to be flaws in there, and they may not have meant anything for any other problems than yours.... I got hit with three bad bugs in R when I worked with it, I manage to narrow them down, and they were all corrected within hours. To me, this is extremely important. The implementation of math should be available for review just like a derivation of equations are.

  • by dovf ( 811000 ) on Thursday September 23, 2004 @05:09AM (#10327199)
    I don't really do much statistical work, but I've been looking into the various Matlab clones for my physics lab reports, and have come up with a few different options --- all free/opensource --- which as a suite provide a very good, free, alternative to Matlab:
    Octave [octave.org]
    Octave is closest to Matlab in terms of source compatibility: you can (almost) take the m-files you wrote for Matlab and run them through Octave, and vice-versa. Octave has no GUI (it uses gnuplot for plotting); the programming language is very similar to Matlab's.
    Scilab [inria.fr]
    For some reason, Scilab doesn't seem to be as well-known as many of the other projects, but in my opinion it is one of the best Matlab clones. The latest version provides tools for translating m-files to scilab's native format. Scilab uses a syntax which is slightly different than matlab's, but the same kind of style, and pretty easy to learn. It also has many toolboxes which are provided for various uses (check the contributions section on the site). Scilab does have a GUI, and some of the toolboxes provide further GUI enhancements.
    Grace [weizmann.ac.il]
    Grace is a graphing tool for 2D graphs, so it's not a general-purpose Matlab clone --- but for graphing, it's the best (I prefer it to Matlab's graphing capabilities!). As an important bonus, it provides many data-set transformations, such as interactive curve-fitting capabilities. It has a full GUI, but also provides a scripting language for non-interactive use as a backend for producing graphs.
    Maxima [sourceforge.net]
    This is a great tool for symbolic computations. It has no GUI, and the syntax is a little strange (it may be similar to LISP, in which it is written; I don't know LISP ;) ).

    Other tools which I have come across, but haven't really worked with: Axiom [nongnu.org] (symbolic computations, CAS); Scigraphica [sourceforge.net] (graphing); opendx [opendx.org] (data explorer + visualization).

    I've actually never really used R (by the time I came across it, I was done with my physics labs), so I can't really compare any of the others to it. But it definitely looks like one of the tools that I should add to my suite.

  • by Anonymous Coward on Thursday September 23, 2004 @05:35AM (#10327275)
    I had a similar problem and settled on Mathematica, despite the high cost and restrictive licence policy it has great data manipulation features and is extremely useful for the odd bit of functional maths you might wish to do on the side. Matrix manipulation is as easy as

    v.Inverse[X].v

    It's mostly shell-based, but the shell includes pretty formulas and graphics, histograms are not too difficult to do, and EVERYTHING in mathematica is a data structure which can be read and manipulated by the shell, including bitmaps, sound, and even a mathematica notebook; of which the manual is one! You can also get a package to link it to Excel if you really want to.

    Go to wolfram.com and order a trial version. That's what I did and I quickly came to the conclusion it was worth paying the full price.

    Others around me (for the purposes of this discussion we're quantitative analysts for a financial company) use S+. It is like R (like C) but it has an excel-like front-end which might suit you better. I'm told it's similarly expensive though.

    Another option I considered was Python, with freeware statistical and graphics packages which are available for the language, but Mathematica beat the pants off this combination for useability.

"An organization dries up if you don't challenge it with growth." -- Mark Shepherd, former President and CEO of Texas Instruments

Working...