Wednesday, February 6, 2008

Understanding timing with your Account Summary

The Cubics Publisher Reports screen contains several pieces of information regarding your account and account status. One of the issues that new publishers seem to deal with is understanding how today's numbers are tracked.

The basic premise is that your numbers are updated throughout the day and then these numbers are "snapshot" at the end of the day. That means that your numbers for today are not final until the "end of the day". For our purposes, the "end of the day" is midnight CST, but there's a little more to it then that. "Midnight" CST usually happens later than actual midnight.

For all intents and purposes, all numbers during the current day should basically be treated as estimations.

To get an idea of what's going on, here's some insight into the workings of the system.

What's up with stats?

The Cubics system has multiple web servers processing 20-40k views/minute. (we're doing well over 1B impressions/month) That's 20k+ DB inserts every minute just to track views! The web server data is regularly aggregated and pushed up to the main database, but it's all on a delay, we're not tracking data in real-time here. We track clicks for several minutes out, so views stay on a web server for 30+ minutes before they're pushed to the primary database.

We have a regular process (call it Archiver) that runs through our web servers, compresses the data and posts that data to the primary database. This data then becomes the stats that you see in the reports screen.

Can you see what's happening here?

The views are 30+ minutes behind and they're spread across multiple web servers. So if the Archiver runs at 10 minutes past midnight, it hasn't gathered all of the views for yesterday. If it runs at 40 minutes past midnight, it will grab views for yesterday and today. So your stats would have numbers for yesterday and today all at once.

But we're still not done with yesterday's numbers, we have multiple web servers.

So you could log into your account at 40 past midnight CST and see data for yesterday and data for today, but your numbers for yesterday may not be finalized. We have another process that runs and finalizes the day, but it has to wait until the Archiver has finished running against all of the web servers.

And then there's click fraud...

The other issue surrounding statistics is click fraud. The Archiver and the Finalizer are running a whole gauntlet of tests to track for click fraud. The system does its best to clean up click fraud as early as possible, but it's not a simple problem and sometimes it takes more than a day to catch. Google has been known to go back several months when correcting fraudulent clicks: crediting advertisers and debiting publishers in December for clicks registered in July.

We've never had to go back that far, but tracking click-fraud is a big issue. We're parsing tens of thousands of clicks daily & hundreds of thousands of clicks weekly, looking for signatures of fraud. And whenever we find something we have to run back through the affected data and update stats for everyone who's been affected.

So what does this all mean?

It means that your numbers for the day are not "set in stone" until sometime the next morning. If you login at 1 AM CST and see stats for today, it doesn't mean that your numbers for yesterday are complete, we could still be waiting on a web server or the Finalizer could still be running or we could be cleaning up click fraud. Normally, the time-line is quite tight and everything is done by 1 AM CST, but sometimes we're out a few hours.

So treat today's numbers as estimations until at least the next morning.

We're working towards more transparency about the status of these numbers, step one is to separate those numbers that have been processed from those numbers that are still processing. But there's a lot more to it click fraud is a big reason. We'll be posting here as we make future changes.