Forgot your password?
typodupeerror
Programming Technology

Statistical Programming With R 52

Posted by timothy
from the no-r-in-statistics dept.
An anonymous reader writes "This series introduces you to R, a rich statistical environment, released as free software. It includes a programming language, an interactive shell, and extensive graphing capability. What's more, R comes with a spectacular collection of functions for mathematical and statistical manipulations -- with still more capabilities available in optional packages."
This discussion has been archived. No new comments can be posted.

Statistical Programming With R

Comments Filter:
  • Good-oh... (Score:3, Interesting)

    by BrokenHalo (565198) on Wednesday September 22, 2004 @12:37PM (#10319656)
    The *nix world could do with a statistics package as comprehensive and easy to use as SPSS or PASS, but that seems to be a holy grail of sorts.

    I've heard good things about R, but have never really got to grips with it (although I know it has been around for a while), so any kind of primer is more than welcome as far as I'm concerned.

    • SPSS is garbage (Score:3, Informative)

      by BoomerSooner (308737)
      SAS, Minitab, hell anything is better than SPSS. Now to include R.
    • Re:Good-oh... (Score:3, Informative)

      by RealAlaskan (576404)
      R is like SPLUS. If you _must_ have SPSS, look into PSPP. [gnu.org]
    • I use JMP-IN its a GUI stats package which does all sorts of stuff. It's made by the folks over at SAS who are well known for their industrial strength stats packages. JMP covers everything from Students T tests, ANOVA, Kaplan-Meier survival curves thru non-parametric analysis and more. Very easy to use, and alot of function, all with a easy to use GUI interface, and nice looking graphical outputs. Accepts Excel spreadsheets and more. I got it with an accademic discount for about $70 US. I use it for
  • by YetAnotherName (168064) on Wednesday September 22, 2004 @01:18PM (#10320136) Homepage
    From the "What is R?" page:

    R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language

    So, R came from S; that must mean that R++ is coming up next! :-)
    • I think it would have to be R--, since R comes before S, not after. -Dan
    • "R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language"

      So, R came from S; that must mean that R++ is coming up next! :-)


      No, that's just a bullshit answer so they don't have to admit "R" really stands for 'rithmatic
  • Graphing, hah! (Score:5, Interesting)

    by Teancom (13486) <david&gnuconsulting,com> on Wednesday September 22, 2004 @01:37PM (#10320333) Homepage
    We have a few people using R around here, mainly in the backend of cgis to produce graphs of various things. The main problem? If you want to output to a jpg or png (like, to display the result in a webpage), R has to create a window in X, draw onto the window, and then take a snapshot of the window. What this means on a headless sun machine? We get to run a virtual X server soley for our R cgis. Bloody hell, it's a stupid implementation of a crappy language.

    </cranky old man>
    • Java was the same for server-side image generation until JDK1.4 (adding -Djava.awt.headless=true iirc is the fix). Stupid? Totally. But sadly not rare...
    • Re:Graphing, hah! (Score:3, Informative)

      by StarWynd (751816)
      This doesn't make it any better, but R isn't the only language to have such weirdness. Here at work, we use IDL (Interactive Data Language) for our statistics and plotting and while there are ways to circumvent the DISPLAY problem, most of the time it's easiest to setup Xvfb and let IDL use that as the DISPLAY.

      I don't know how R came into being, but IDL was originally designed as an ad hoc statistics and plotting tool. Because everyone was using it as an ad hoc tool, there was an assumption that everyo
    • Try "help('postscript')", "help('png')", and "help('jpeg')".

      Output to different graphics devices has been in S, Splus, and R for as long as I can remember (and that's a long time). Maybe you should try having a look at the copious documentation for R; the documentation, like the system itself, is free.
    • RTFM! (Score:5, Informative)

      by KjetilK (186133) <kjetil@kjer[ ]o.net ['nsm' in gap]> on Wednesday September 22, 2004 @08:05PM (#10324815) Homepage Journal
      As usual I strongly recommend RTFM, you don't need an X device to create bitmaps, but it is probably the simplest way to do it. With the modularity freedesktop is aiming for, this is probably become less and less of an argument, and probably the main reason why they haven't bothered rewriting it.

      R is really a beautiful language, for its purpose. It has a very nice correspondence with math and code, and for most parts of "hard" science, that's really important.

      Compared to MATLAB, you can easily write R code 5 times as compact as MATLAB code, and still get more understandable code.

      • As usual I strongly recommend RTFM,

        I'm sorry, you don't get to use witticisms like "RTFM" while defending something something as fundamentally idiotic as:
        you don't need an X device to create bitmaps, but it is
        probably the simplest way to do it.


        The mind shudders at what the less "simple" ways of doing it must look like.
        • Well, if you think that things like:
          png('test.png')
          plot(your, graphics)
          dev.off()
          is hard, there is very little anybody can do for you.
    • No it doesn't. The functions "png" and "jpeg" do indeed depend on X11 being started, but the function "bitmap" which can produce pngs and jpegs doesn't. It relies on ghostcript instead and works perfectly on headless servers.

      Do help("bitmap") for all the details.

      Do try to read the man page till the end next time, or ask a question to the dev team. Both the jpeg and png man page mention the function bitmap as a solution to the problem you are having.

      All the best.
      • Well, I don't do anything with R, I was just the sysad for the web server, doing what the devs told me I needed to do. I will, however, pass that info along to the current sysad for that machine, along with the programmers. Thanks for the heads up!
  • It's time I wean myself off of excel particularly since the other day I couldn't even create a histogram since my dataset is more than ~64,000 data points, which is apparently excel's limit. Does anyone in the community know of a good replacement for excel that scales well to many data points but also has some sort of user-interface so that I can do some visual manipulations if I want to. I understand that most of these packages come with their own interactive shell and languages, but I would like to have
    • by Anonymous Coward
      My personal choice is Mathematica. A study done four years ago used a dataset with known results to check a number of popular statistics packages. Mathematica was the only package that got everthing correct:

      http://www.wolfram.com/news/statistics.html [wolfram.com]
    • by Anonymous Coward
      It might not be exactly what you want (i.e., it's a bit unfriendly and rough around the edges) but "xgobi" (for X; I think they're working on "jgobi" for java) by Bell Labs is a very powerful and free-ish data visualization software.

      No data entry facilities, but it handles multi-dim. visualisation very well and has an handful of convenient methods (correlation analysis, PCA, histogram) built in.

      I've come to realise that Excel's data vis. is almost totally a joke, and that its value for data entry is almos
    • by tabdelgawad (590061) on Wednesday September 22, 2004 @08:29PM (#10324985) Homepage
      You have three categories of choices, depending on your background and the type of data and analysis you want to conduct. I made up the category names, but I think they're reasonable:

      1) Matrix languages (e.g. Matlab, Gauss): These have C-like syntax with the basic data object being an nxn matrix (so, internally, a scalar is a 1x1 matrix). These languages are the way to go if you want to write your own statistical/simulation algorithms. They do have extensive pre-written routines for many statistical tasks, but they're mainly for people who know that a regression coefficient vector is given by inv(X'X)X'y and aren't afraid to code that. Nice thing is that it would be a single line of code to do this computation. I believe GNU/Octave belongs to this category.

      2) Data languages (SAS, SPSS): The basic object here is a dataset with variables. Inverting a data matrix here is essentially a meaningless concept, and would be extremely difficult to do, but creating a new variable that sums sales for different people by division for certain months is straightforward (note that this is very difficult in a matrix language). Beyond trivial manipulations, you'd store code in procedures like any programming language.

      3) Menu-Driven languages (e.g. EViews): The basic object is still a dataset with variables, but your primary method of manipulation is menu-driven. Want to run a regression?, just select your dependent and independent variables from dropdown lists and click .

      There's some area of overlap between 2 and 3. 2-type programs provide a rudimentary menu-driven system for those who don't want to code everything, and 3-type languages will allow you to store some command line instructions for future use.

      In terms of learning curves, they get progressively flatter (easier) from 1 to 2 to 3.

      Pick your poison!
      • "2) Data languages (SAS, SPSS): The basic object here is a dataset with variables. Inverting a data matrix here is essentially a meaningless concept, and would be extremely difficult to do"

        Not sure if this meets your definition, but I've been using SAS for boocoo years and can tell you that it has a "TRANSPOSE" facility explicitly for making columns into rows & vice versa.

    • by Anonymous Coward
      I had a similar problem and settled on Mathematica, despite the high cost and restrictive licence policy it has great data manipulation features and is extremely useful for the odd bit of functional maths you might wish to do on the side. Matrix manipulation is as easy as

      v.Inverse[X].v

      It's mostly shell-based, but the shell includes pretty formulas and graphics, histograms are not too difficult to do, and EVERYTHING in mathematica is a data structure which can be read and manipulated by the shell, includi
  • And Of Course... (Score:4, Informative)

    by pnatural (59329) on Wednesday September 22, 2004 @02:17PM (#10320856)
    There are Python bindings. They are here [sourceforge.net]. Enjoy!
  • but not at all easy to learn. It's not that the programming is hard (although it is, it is a functional language which takes a while to get your head round) - but the documentation is aimed at fairly high level stats boffins.

    But... ANYTHING is better than SPSS.

    dave
  • by Brandybuck (704397) on Wednesday September 22, 2004 @03:21PM (#10321697) Homepage Journal
    What's more, R comes with a spectacular collection of functions for mathematical and statistical manipulations...

    I can see that this package will be quite popular with political campaign managers.
  • Does anyone know where R really fits into the grand scheme of things? The only other language mentioned in the R FAQ was S... there were no comparisons with Octave or any commercial products like Matlab or IDL. So, what does R really do for you besides being another analysis and visualization project?
    • by jeif1k (809151) on Wednesday September 22, 2004 @04:06PM (#10322198)
      Octave differs substantially from Matlab and lacks a lot of functionality (in particular, a lot of the toolboxes). Octave is used for teaching, but most people who do serious work in Matlab use the real thing.

      In contrast, R is very close to Splus and comes with an extensive array of statistical toolboxes. Many professional users use, and even prefer, R for their day-to-day work.

      If you are doing anything with statistics, graphs of real-world data, or bioinformatics, R is the package to use.

      If you are doing other kind of numerical work, things are less clear. Matlab is widely used, but it is hugely expensive and the language is pretty limited. Octave is the obvious open source choice, but there aren't many packages for it, and Matlab software requires some amount of porting if you want to use it with Octave. Numerical Python is technically far better than either Matlab or Octave, and it has a lot of packages and features that neither offer, but it (obviously) isn't Matlab compatible, so you can't just load existing Matlab packages into it.
    • by KjetilK (186133) <kjetil@kjer[ ]o.net ['nsm' in gap]> on Wednesday September 22, 2004 @10:22PM (#10325623) Homepage Journal
      Well, it is been two years since I really used R, but I'm really in love with the system... I know IDL (*shrug*) a bit too, since the rest of my Institute uses it. I just got itches from it all over, and dumped it... :-)

      Anyway, what it doesn't do as well as IDL (*shrug*) is visualization. Its graphing is limited to, well, graphs. Interactive analysis with funny widgets and stuff isn't R's selling point. Nor is R very well developed for image analysis and stuff like that. I think they have multi-D fourier transforms now, but they didn't two years ago.

      IDL, OTOH, doesn't really do statistics at all. For example, it doesn't come with something as fundamental as QQ-plots. Believe it or not, but every paper that comes with an assumption of normality should come with QQ-plot... Or at least have done it.

      The syntax of IDL (*shrug*) is unbeliably nasty (*shrug*, aargh, sorry, couldn't resist). I heard they have done something about it now, but two years ago, IDLs concept of scoping was at best, uhm, well, unclear. You could easily modify variables in other peoples badly designed code without being aware of it. Then, the COMMON blocks you often needed to pass parameters...? I have a hard time understanding people would actually use anything like IDL (*shrug*). R has a very clearcut lexical scoping of objects. You've got to really design your code veeery badly to fall in the same traps IDL programmers fall in on a regular basis. I've seen IDL programmers who's been in it since the beginning go WTF over scoping... It was better being a lone R user than an IDL user with a lot of support...

      Also IDL attempted to get in OO in version 5 (IIRC), but it is a mess. OO designers would be rolling in their graves over this. R, OTOH, has decided not to incorporate all OO concepts, but the stuff they have done, is very clean, very easy to understand, and perfectly sound.

      But the real point of R is to have very clear mapping between code and mathematics. You code your math, it is so easy to see what happens. No iterating over array indices, it simply never happens. That's extremely appealing once you've got the hang of it.

      I once translated 70 lines of MATLAB code to 7 lines of R code, some interpolation stuff that didn't exist in R. Never finished it though, because I found I didn't need it, but as a proof of concept it was great. And while MATLAB code was pretty hard to grok, the R code was very straightforward, you could just show it to anyone with basic training in math, and they would immediately see what it did. Try that with code from any of the others!

      I think that the basic thing is that most numerical math for physics and astronomy is right now more advanced in IDL or MATLAB. If you do any kind of statistics, you should be going over to R. If you are willing to code, I'd argue that R is a platform so much better than IDL and MATLAB, you should be migrating your code starting now. I know I'd be writing thousands of lines of R code rather than going back to IDL (*shrug*)... :-)

      Then, you know, you can't inspect the code in the core of IDL or MATLAB. It is likely to be flaws in there, and they may not have meant anything for any other problems than yours.... I got hit with three bad bugs in R when I worked with it, I manage to narrow them down, and they were all corrected within hours. To me, this is extremely important. The implementation of math should be available for review just like a derivation of equations are.

  • what R isn't (Score:5, Informative)

    by bahamutirc (648840) on Wednesday September 22, 2004 @03:44PM (#10321972) Homepage

    For people who have never taken real stat classes in college (or never learned it on their own) R will seem like a useless language. Most other languages can handle basic statistics computations.

    Statistics is a whole lot more than means and averages. When I took my first real stat class, everything I knew about statistics was literally covered on the first half of the first page. I was totally blown away by what you could do with statistics.

    R is for hardcore stat folk who know a bit about programming, not programmers who need to do a little basic computation.

    • I think that for most folks what we need is better means of helping them get the data together that they want to look at. One guy did this at an interesting site [laboratory...states.com].
      The other thing is that folks need a better way of handling relations and statistical functions. Right now, you need to learn a _lot_ to do stuff that shouldn't be that hard. Its almost like folks wanted to make sure that any project of this nature would need a DBA _and_ a script developer(or team) _and_ a statistician to get work done. That really
    • Re:what R isn't (Score:3, Informative)

      by KjetilK (186133)
      Actually, I can't really agree... For those who do not intend to grok anything in their life, R is probably a bad choice... :-) But I think it is a good choice for everyone who do things where math is important and who sees it as important that things are to be grokked.

      However, you do not necessarily need to be into statistics to find R appealing. I'm an astrophysicist, and I wrote my whole thesis based on R. I started out with a bit of C, and I used some small Perl hacks to do some naive parallellizing,

  • It's difficult to evaluate all the various statistical packages. I would love to read some sort of comparison among the various packages, both commercial and free. For instance, which packages have some sort of GUI? What types of programming languages does each one use? How does each one scale? Is there a particular feature that separates a particular package? Anyone?
    • I don't really do much statistical work, but I've been looking into the various Matlab clones for my physics lab reports, and have come up with a few different options --- all free/opensource --- which as a suite provide a very good, free, alternative to Matlab:
      Octave [octave.org]
      Octave is closest to Matlab in terms of source compatibility: you can (almost) take the m-files you wrote for Matlab and run them through Octave, and vice-versa. Octave has no GUI (it uses gnuplot for plotting); the programming language
  • by j1m+5n0w (749199) on Wednesday September 22, 2004 @04:17PM (#10322317) Homepage Journal

    Does anyone have any insight on how this differs from octave [octave.org]?

    This is the first I've heard of R, but I've tried using octave a few times. It seems to be a sort of enhanced gnuplot. I was thinking about using it for a project I'm working on, though I may just stick with good 'ol C for performance.

    Do any of these projects work well with sparse matricies? I'm interested in using them to run a pagerank [wikipedia.org]-like computation, but not if they use n^2 memory.

    -jim

    • by Anonymous Coward
      It looks like a functional language (e.g. Lisp or Haskell) with an interactive environment that has access to lots of math/graphing/etc libraries. It seems to be pretty extensible.

      Octave is basically an open source version of Matlab. This R looks similar just with a different programming language and different libraries.

      It looks like it's probably more powerful but I don't know since I haven't used R.

    • See my other comment about differences with IDL and MATLAB. However, I think you should be going with C or FORTRAN if memory and speed is important to you. R has no advanced concept of typing, at least it didn't when I last looked careful. Everything is double-precision floats, and while you have some control over it, if this is important, don't...

      However, the C and FORTRAN bindings in R are excellent. So, if you're doing statistics on the stuff you find, you might want to look at doing the high-performan

  • There used to be a language called Minitab that I used in my second statistics course years yonder. I wonder what happened to it. It sorely needed a GUI for table browsing.

Nobody said computers were going to be polite.

Working...