SAS vs. R vs. Python: Which Tool Do Analytics Pros Prefer?

Bev D

Heretical Statistician
Leader
Super Moderator
I'll wade in here a bit as my organization is also struggling with this 'big data thing'. We use SAS and R for predictive modeling, SAS is used for the end product and R is used for some initial screening. Python is also being used by the non-statisticians primarily for data manipulation and preparation for SAS or R.

Here's my issue: too many people who know nothing about data are analyzing it. Using observational data is a huge risk if you don't understand the biasing and confounding that exist. This issue has been around since the beginning of analysis of observational data for the social sciences and medical 'investigations'. It is why it took so long to 'prove' that smoking causes cancer or that coffee is bad for you or good for you or whatever the latest 'study' says. We are now applying this same lack of knowledge to larger and larger data sets. Some are very well skilled and knowledgable and others are hacks. The consuming public has no way to know which is which...

The blog itself is an example of poor analysis of data as it is simply a series of pie charts after pie charts (some pies disguised as doughnuts or bars). They show only proportions among a single category and not actual quantities or what the software is actually being used for. They also ignore the fact that popularity is not the same thing as usefulness. :nope:
 

Miner

Forum Moderator
Leader
Admin
I will second Bev's comments. I read an article a few years ago that summed up the problem nicely. In short, it said that years ago, data were scarce, so there was only one way to analyze the data (e.g., 2 groups of unpaired sample data = 2 sample t-test). You could botch a few things, but the selection of the analysis was not typically one of those things.

In the big data world, there are a multitude of ways to potentially analyze the data. While some are obviously wrong to the trained analyst, they are still used by the untrained analyst. Others may even seem correct to the trained analyst, but are incorrect, not because they are unsuited to the data, but because they answer a different question than the one asked (e.g., ANOVA vs. ANOM; they answer different questions).

In other words, there are a lot more ways to make a mistake with big data.
 

Mark Meer

Trusted Information Resource
Here's my issue: too many people who know nothing about data are analyzing it. Using observational data is a huge risk if you don't understand the biasing and confounding that exist. This issue has been around since the beginning of analysis of observational data for the social sciences and medical 'investigations'. It is why it took so long to 'prove' that smoking causes cancer or that coffee is bad for you or good for you or whatever the latest 'study' says. We are now applying this same lack of knowledge to larger and larger data sets. Some are very well skilled and knowledgable and others are hacks. The consuming public has no way to know which is which..

Interesting discussion...

This is becoming increasingly a problem in the sciences? I have no doubt it is - as in your examples of consumption studies, as well as, I'm certain, a litany of studies in the social sciences.

So the question is: why is the peer-review process not sufficient? To few knowledgeable statisticians involved? ...perhaps... but I suspect that a lot of it has to do with lack of consensus as to what is the appropriate way to analyze big data, rather than a lack of knowledge...

MM
 

Bev D

Heretical Statistician
Leader
Super Moderator
What peer review process? Unlike many of the past observational studies that I cited, there really is no peer review process for Big Data. Much of it is 'black box' where one can't actually see what the analysis is doing - or not doing. And unlike many of the observational studies (where a blind acceptance of the p<.05 increasingly became the norm in the last two+ decades) where the statisticians (although many of these studies were conducted by researchers and peer reviewed was by researchers not statisticians) were certainly aware of the dangers of biased and confounded data sets, today's crop of big data scientists are even more unaware of these dangers.

What statisticians? Data Scientists aren't statisticians. IT groups are selling the black box analytical engines with the seductive promise of perfect results regardless of the data - they believe that the sheer size will overcome any biasing or confounding...

Of course there is some good statistically sound scientific research going on now and then but in the main today's big data is even farther afield of solid statistics.
 

Mark Meer

Trusted Information Resource
Ah, I see. I apologize...following your mention of consumption studies, I was pushing the thread stray from the original topic of software to the realm of scientific studies in general...
I'm admittedly no expert in the field...I just find the discussion interesting.

To continue: Let me know if I've got this correct...

There are a (small) number of tech companies cornering the market on analysis software, and these software packages are being used by both industry and scientific research without knowing how they are actually implementing the analysis?

If so, that is troubling indeed...

Knowing little about how these tools work (forgive me), can I ask: To deal with the "black-box" problem, can the results not be cross-referenced with outputs from other software to gain confidence in the results? In otherwords, can one not use multiple software packages and verify that they all output approximately the same results given the same inputs?
 
Top Bottom