To make the most effective use of its Web site, a business needs detailed intelligence on how people use the site. This has given rise to a whole field of Web technology: audience analysis.
Just a few years ago, the only way to monitor Web site visitors was to analyze the cryptic log files generated by the host server. Now, sophisticated client-based systems provide easy-to-use reports based on a wide range of statistics gathered directly from the browsers of individual users, bringing big improvements in detail, accuracy, timeliness and efficiency.
These advancements have brought a variety of approaches to audience analysis, and businesses must choose the approach best suited to their requirements. To help businesses understand and evaluate these approaches, here are some principal concepts of Web audience analysis.
Application service provider. A company that provides software applications, data or data processing services over the Web, freeing businesses from the need to provide these things on their own. Thanks to specialized expertise and economies of scale, ASPs can offer businesses better performance, lower cost, faster implementation and simpler operation than an inhouse solution.
Armed with a new generation of client-based, audience-analysis systems, an innovative audience-analysis ASP can provide data gathering and analysis, on any scale, more cost effectively than most businesses can achieve on their own. In addition, an audience-analysis ASP’s third-party objectivity and accountability can help a business attract advertisers and partners.
Caching proxy. This is a repository of Web pages used to speed the delivery of Web pages to users. Many ISPs maintain proxy servers that store millions of pages copied from the Web. When a user requests a page stored on a proxy, the ISP delivers the page quickly from the proxy, rather than using the Web server to retrieve the page from the Web, which can take much longer. Client-based analysis techniques can track pages retrieved from proxies. Log files and packet sniffers cannot track pages retrieved from proxies, since the Web server is typically not involved in the transaction.
Click stream. This is a sequence of mouse clicks made by a user. On a Web site, the click stream reveals the chain of pages that the user views and the actions the user performs on each page. Click streams are valuable in determining the effectiveness of site features and advertising campaigns. Client-based analysis can track these click streams in detail. Log files and packet sniffers need to identify a specific Internet protocol address for each user to track click streams. Since IP address sharing often makes this impossible, these techniques typically do not track click-streams as accurately. Media panels cannot track click streams.
Client-based analysis. Web audience analysis uses statistics gathered from the browsers (clients) of individual users. This sophisticated form of analysis generally provides better information than server-based analysis because:
• Client-based analysis obtains statistics by observing the actual activities of users, while server-based analysis obtains statistics from the clients’ requests to a Web server. Since some user activity cannot be observed at the server, client-based analysis provides more accurate and detailed information than server-based analysis.
• Client-based analysis obtains statistics from users in real time, while server-based analysis obtains statistics through after-the-fact batch analysis of data accumulated over time. As a result, client-based analysis typically provides more timely information than server-based analysis.
Because client-based analysis is technologically challenging, businesses typically obtain this through ASPs.
Crawler. Also called a spider or bot, a crawler performs automated tasks on the Web, such as automatically following hypertext links and indexing information based on user-specified search criteria. Since crawlers are not actual users, audience-analysis systems need to identify and exclude the activity of crawlers. This is a challenge for server-based analysis systems.
To identify the activity of a crawler, a server-based system needs to know about the crawler, in much the way an anti-virus software needs to know about a virus in order to detect it. Since there are thousands of crawlers – and new ones appear every day – server-based systems cannot identify every crawler. As a result, their analysis is skewed by crawler activity. In contrast, client-based analysis systems typically monitor only the activity of actual users, excluding the activity of crawlers, so the information they provide is more accurate.
E-business intelligence. This is information on the businesses and individuals that make up e-commerce – increasingly critical to an online enterprise. Distilling raw data into useful information is one of the major challenges in Web audience analysis. Through superior data gathering and analysis, client-based audience-analysis systems can provide better information than most businesses can achieve on their own, in real time and with third-party objectivity.
Hit. When a browser requests a single HTML page, image, file or other object from a Web server, a hit results. Since a single Web page may comprise many objects, a single page view can be recorded in a log file as many hits. As a result, in analyzing Web traffic, it is generally more useful to track page views instead of hits.
Log files. This is a file generated by a Web server that records each server action in response to user requests. Log files contain a variety of information on the actions of site visitors. However, since raw log files are virtually impossible to interpret manually, special analysis software is used to extract useful information. In addition, log files can be huge – hundreds of gigabytes for popular sites – posing processing and storage challenges. Most importantly, the information produced by this server-based analysis technique is generally less accurate and less timely than information from client-based analysis, and it lacks many details offered by client-based analysis.
Media panel. Similar to television ratings polls, a media panel collects Web usage statistics through software installed on the workstations of selected users that tracks the Web activity of those users. Overall Web usage patterns are obtained by extrapolation from this sample. Despite efforts to obtain a representative sampling, media panels – mostly centered in North America – only approximate the global user community. In addition, statistics from media panels are skewed toward the type of people who participate in such efforts. Finally, since the data from a media panel must be compiled and processed to obtain useful statistics, media panels cannot provide real-time reporting.
Packet sniffer. Similar to log files, a packet sniffer examines data packets passing through a network.
Page view. Also called a page impression, this is an instance of a browser requesting a single, complete Web page, comprising one or more images or other objects, from a Web server. Since a single Web page may include many objects, a single page view can be recorded in a log file as many hits. As a result, in analyzing Web traffic, it is generally more useful to track page views instead of hits.
Path. This is a sequence of pages viewed by a user on a single site, starting with the page on which the user enters the site, continuing through each subsequent page view and concluding with the page from which the user leaves the site. A path is similar to a click stream, except that a click stream may span many sites. Identifying popular paths is critical in designing effective site features. In general, both server-based and client-based analysis can identify paths.
Real-time reporting. Analysis in real time reports Web users’ behavior as it happens. Client-based analysis techniques can provide real-time reporting. Log files and packet sniffers can provide near real-time reporting only on smaller scales – the heavier the site traffic, the longer the delay in reporting. Media panels provide much slower reporting.
Server-based analysis. Web audience analysis based on statistics gathered from a Web server, typically from the log files generated by the server or from a packet sniffer on the network. This form of analysis, one of the first audience-analysis techniques to be developed, generally provides information that is less detailed, less accurate and less timely than that provided by client-based a
• Jay McCarthy is vice president of special projects at WebSideStory, San Diego.