Standard Mischief

*nix mischief: simple and crude web log analysis tools

At some point, I’m supposed to make the obligatory joke about my three readers. Instead, I’m going to share some command line pipes that any fair-to-middle *nix person should be able to figure out. Please note that even though I’m perusing my logs, I’m not doing any creepy crap with the data, and I’m not doing anything funky to readers with cookies or anything. Also, no IP addresses or other personal data were publicly exposed in the making of this blog post.

Although my stats program tells me that I get over 100 unique visitors a day, and it goes as far as breaking them down into spiders vs. “real people”, I’m not entirely sure I’m getting the whole picture. So grep to the rescue.

$ grep 'tailrank.png' access_log|grep '11/Feb'|egrep -o '^([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})’|sort|uniq|wc -l

Real people arriving at a web page almost always load available images, while spiders, feed scrapers and hotlinkers usually don’t. So here, I’m going to use one of those below, adorable, social bookmark pictures as a “web bug”, (Wikipedia’s page, and the EFF’s page on them)

The first grep picks only lines that contain the image, the second grep restricts that to only entries for February 11th. The egrep selects only the part of the log entry that has the IP address, and then the list gets sorted, duplicate entries are removed, and at the end, we count the total number of lines. I’m getting 23 people for this date, although once, the IP address numbers were almost identical except for the last digit, so here we can assume that their IP address probably changed during visits that day.

Now for the people who read my feed.

$ grep 'GET /feed/ HTT' access_log|grep '11/Feb'|grep 'Mozilla'|egrep -o '^([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})’|sort|uniq

The first grep restricts this to only people asking for my main feed, next restricts to just February 11th again. The third only selects the ones that have “Mozilla” in the user agent string. Then it’s sort and remove dupes again. Here I’ve got six, but I’ve got one overlap with the earlier query, so I’ve got 5 for this day

$ grep 'GET /feed/ HTT' access_log|grep '11/Feb'|grep -v 'Mozilla'|sort|uniq

This one restricts the log to only spiders and scrapers, but gives me the entire user agent string on each one. Bloglines and NewsGatorOnline have the decency of informing me in the user agent string that I have one subscriber with each. The rest are spiders and such, mostly for search engines, and I don’t count them as real.

It’s a snapshot of just one day and does not take in to account of someone who might read my aggregated feed at say Bloglines and also come visit with their browser, nor does it detect someone who might read me both at work and at home, or at two different wi-fi hotspots, but still, I have a pretty good idea of my total readership.

Credit where due: The crude regular expression I used above, (crude, because it would hit on a dotted quad such as 999.999.999.999), came from here.

2007-02-13 08:00 by Standard Mischief, Filed under:don't try this at home   No Comments »

Comments

No comments yet.

Leave a comment

(required)

(required)

RSS feed for comments on this post. TrackBack URL

current.png

Powered by WordPress , Theme Ported to Wordpress by Liu Xun. Original Design by Cathayan