New Method of Tracking UIP Hits? 174
smurray writes "iMediaConnection has an interesting article on a new approach to web analysis. The author claims that he is describing 'new, cutting edge methodologies for identifying people, methodologies that -- at this point -- no web analytics product supports.' What's more interesting, the new technology doesn't seem to be privacy intrusive." Many companies seem unhappy with the accepted norms of tracking UIP results. Another approach to solving this problem was also previously covered on Slashdot.
Step 4. . . (Score:5, Insightful)
If my company had computers in New York and Tokyo, I could ssh between them in much less than 60 minutes. . .
Whats the new definiation of privacy these days? (Score:4, Insightful)
I'm not sure what the Flash is, but to me, scanning all the cookies your computer has had IS privacy intrusive.
Paradigm shift ? (Score:3, Insightful)
No kidding. This guy probably needs a wake up call.
We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour.
Ok, so this is what normally is called a really stupid argumentation. I don't say that it can't be accounted for, but stating such a thing is nothing more than plain stupidity. Has this guy ever heard about that Internet thing ?
Flash can report to the system all the cookies a machine has held.
Uhmm, not a great argument to make people use it.
No one wants to know.
I don't think they don't want to know. They just don't want to see a sudden drop of ~50% of their user count from a day to the other. And it really doesn't matter if it's the truth or not. A drop is a drop.
crap again. (Score:4, Insightful)
" We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour."
Everheard of ssh and similar tools to make that travel?
And they put this on slashdot. Ignorance, just pure ignorance...
Still doesn't help deleted cookies (Score:5, Insightful)
In the end there is no way they can even mostly recognize repeat web site visitors if the VISITOR DOESN'T WANT THEM TO.
The big problem is stated at the top of the article:
"We need to identify unique users on the web. It's fundamental. We need to know how many people visit, what they read, for how long, how often they return, and at what frequency. These are the 'atoms' of our metrics. Without this knowledge we really can't do much."
If knowing who unique users are is that important they need to create a reason for the user to correctly identify themselves. Some form of incentive that makes it worth giving up an identification for.
Tragically flawed (Score:5, Insightful)
That is of course complete nonsense. Let's say we accept the author's assertion that different studies have given cookie deletion rates across that range. I can accept that a significant number of users might delete cookies at some point, but what percentage of normal, non-geek, non-tinfoil-hat-wearing users are deleting cookies between page requests to a single site in a single session? If it is 30%, then I will eat my hat.
Most cookie deletion amoung the general populace will be being done automatically by anti-spyware software and is not done in realtime.
The author clearly knows that even the most primitive of tools also use other metrics to group page requests into sessions, so even if 30% of users were deleting cookies, it would not result in a 30% inaccuracy.
Of course "researchers propose more complex heuristic that looks to be slightly more accurate than current pracice" does not make as good a story as "paradigm shift" blah blah "blows out of the water" blah blah "We've been off by at least 30 percent, maybe more." blah blah.
Re:I'm glad it isn't Rocket Science (Score:3, Insightful)
I don't mean to be a poo poo here, but this isn't as huge a deal as the author has made it sound (i.e. it certainly is not a "paradigm shift").
Instead, what we have here is an evolutionary suggestion in how we can track users more accurately. Kudos.
As with all solutions in CS, there are problems. As the parent has correctly observed, this doesn't solve the "multiple browsers, same user" problem (which is common -- you probably use a different computer at work than at home). I am not certain, but in fact, it is possible that realistically, this process they use here only solves the "this is the same browser" problem -- many users simply leave their credentials in place (i.e. logged in -- say to
Um, nope. Can't happen. (Score:5, Insightful)
There's only so much you can do to track users.
IP address, user agent, some javascript stuff for cookieless tracking.. the only real "unique" identifiers for any one visitor. It stops there.
Of course, using exploits in flash doesn't count, but supposedly this new method is "not intrusive."
I call BS because it simply can't happen.
If a user doesn't wanna be tracked, they won't be tracked. This story is just press, free advertisement, and hype for this particular company.
Paradigm shift ?!? (Score:5, Insightful)
Re:CPUID (Score:4, Insightful)
Not really. I surf the internet at home and at school. I imagine I'm not alone. So I would be registered as two different people.
Indeed, but generally I would say that 1 person = 1 cpu, apart from shared cpus such as in schools, web cafes and such
You forgot "pretty much anyone who doesn't alive alone and has a computer with internet access at home." Let's not forget that tiny percentage of people (I know, most slashdotters visit slashdot while avoiding work, but there are people out there who have families that have more then one person using a single computer. It's crazy I know).
Cutting edge? ha! (Score:3, Insightful)
"If the same cookie is present on multiple visits, it's the same person. We next sort our visits by cookie ID"
Only after that they seem to continue the analys ("We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible", etc)
Thus turning off or regulary removing cookies will render their bleeding cutting edge technology useless? And how are cookies a 'breakthrought'?. Their only alternative to this seems to be;
You can also throw Flash Shared Objects (FSO) into the mix. FSOs can't replace cookies, but if someone does support FSO you can use FSOs to record cookie IDs.
I don't know what the fuzz is about
This is just basic logic, which any decent programmer should be able to come up with, even the M$ certified ones.
Typical web analysis junk (Score:5, Insightful)
The problem is with the word "accurate". To management, "accurate statistics" means knowing exactly how many conscious human beings looked at the site during a given period. However, the computer cannot measure this. What it can measure, accurately, is the number of HTML requests during a given period.
You can use the latter number to estimate the former number. But because this estimate is effected by a multitude of factors like spiders, proxies, bugs, etc., management will say "these stats are clearly not accurate!". You can try to filter out the various "undesirable" requests, but the results you'll get will vary chaotically with the filters you use. The closer you get to "accurate" stats from the point of view of management, the further you'll be from "accurate" stats from a technical point of view.
Makers of web analysis software and services address these problems by the simple of technique of "lying". In fact, a whole industry has built up based on the shared delusion that we can accurately measure distinct users.
Which is where this article comes in. The author has discovered the shocking, shocking fact that the standard means of measuring distinct users are total bollocks. He's discovered that another technique produces dramatically different results. He's shocked, shocked, appalled in fact, that the makers of web analysis software are not interested in this new, highly computationally-intensive technique that spits out lower numbers.
My advice? Instead of doing costly probability analysis on your log files, just multiple your existing user counts by 0.7. The results will be just as meaningful and you can go home earlier.
Re:uhm, what? (Score:5, Insightful)
Why? (Score:3, Insightful)
I'm personally always more interested in how many pages get requested, and which ones. The first gives me an impression of how popular the site is*, the second tells me which pages people particularly like, so I can add more like that.
The only reason I see for really wanting to track people is if your site is actually an app that has state. In those cases, you have to use a more bullet-proof system than the one presented in TFA.
* Some people object that it counts many people who visit once, then never again; but I consider it a success that they got there in the first place - they were probably referred by someone, followed a link that someone made, or the page ranks high in a search engine.
Re:Tragically flawed (Score:3, Insightful)
Re:uhm, what? (Score:5, Insightful)
Their suggestion may be common-sense, but their approach borders on messianic:
"This article is going to ask you to make a paradigm shift... new, cutting edge methodologies... no web analytics product supports... a journey from first generation web analytics to second."
Followed by a lengthy paragraph on "paradigm shifts". In fact, the article takes three pages to basically say:
"In a nut-shell: To determine a web metric we should apply multiple tests, not just count one thing."
Here's a clue, Brandt Dainow - It's a common-sense way of counting visitors, not a new fucking religion.
The basic approach is to use a selection of criteria to assess visitor numbers - cookies first, then use different IPs/userAgents with close access-times to differentiate again, etc.
The good news is there are only three problems with this approach. The bad news is, that makes them effectively useless, or certainly not much more useful than the normal method of user-counting:
Problem 1
There is no information returned to a web server that isn't trivially gameable, and absolutely no way to tie any kind of computer access to a particular human:
"1. If the same cookie is present on multiple visits, it's the same person."
Non-techie friends are always wanting to buy things from Amazon as a one-off, so I let them use my account. Boom - that's up to twenty people represented by one cookie, right there.
"2. We next sort our visits by cookie ID and look at the cookie life spans. Different cookies that overlap in time are different users. In other words, one person can't have two cookies at the same time."
Except that I habitually leave my GMail account (for example) logged in both at work and at home. Many people I know use two or more "personal" computers, and don't bother logging out of their webmail between uses. That's a minimum of two cookies with overlapping timestamps right there, and only one person.
"3. This leaves us with sets of cookie IDs that could belong to the same person because they occur at different times, so we now look at IP addresses."
This isn't actually an operative step, or a test of any kind. It's just a numbered paragraph.
"4. We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can't be the same person because you can't get from New York to Tokyo in one hour."
FFS, has this guy ever touched a computer? For someone writing on technology he's pretty fucking out of touch. As an example, what about people who commonly telnet+lynx, VMWare or PCAnywhere, right across the world, hundreds of times in their workday? Sure, maybe most normal users don't (yet), but for some sites (eg, nerd-heavy sites like
"5. This leaves us with those IP addresses that can't be eliminated on the basis of geography. We now switch emphasis. Instead of looking for proof of difference, we now look for combinations which indicate it's the same person. These are IP addresses we know to be owned by the same ISP or company."
Except that one ISP can serve as many as hundreds of thousands of users. And proxy gateways often report one IP for all the users connected to them. For example, NTL reports one "gateway" IP for all the people in my town on cable-modems - that's thousands, minimum. So, we're looking at a potential error magnitude of 100-100,000. That's no better than the existing system for assessing unique visitors.
"6. We can refine this test by going back over the IP address/Cookie combination. We can look at all the IP addresses that a cookie had. Do we see one of those addresses used on a new cookie? Do both cookies have the same User Agent? If we get the same pool