Querying Your Web Server Logs

So you’ve imported your web server logs into a database, and you’re ready to start extracting useful information from them.

If your logs are in W3C Extended Log Format, you may have some or all of the following columns available to you (depending on your logging configuration)…

  • [date] – The date of the request in yyyy-mm-dd format. Note that date/time is usually UTC/GMT time, rather than the local timezone.
  • [time] – The time of the request in 24 hour time hh:mm:ss. Due to the timezone issue, you may have to apply an offset to make it correct.
  • [s-sitename] – The internal site name that has been set up in IIS. This is not usually the same as the domain name, so expect something like “W3SVC1” instead of “www.jds.net.au”.
  • [s-computername] – This field should contain the hostname of the webserver as it appears in DNS. I have never seen this field used.
  • [s-ip] – The IP address of the server. Use this to distinguish between the different web servers in your server farm. Note that a single web server may have multiple IP addresses, so make sure you have a clear idea of the mapping between IP addresses and servers.
  • [s-port] – The port the server is running on (expect mostly 80 or 443). Most people don’t bother to log this as it is not very useful information.
  • [cs-method] – The HTTP method used for the request. This will mostly be either GET or POST, but occasionally you will see HEAD (to see if the page has been updated) and PUT and DELETE (which are usually from bots looking for poorly configured servers).
  • [cs-uri-stem] – The address for a web page is broken down into the following parts [protocol]://[hostname]:[port (optional)]/[URI stem]?[uri-query]. So the URI stem of https://www.jds.net.au/tech-tips/ will be “/tech-tips/”.
  • [cs-uri-query] – The query string being passed to a dynamic page in the URL. The query string follows a question mark and, if there are multiple name/value pairs, they are separated by an ampersand. The URI query of https://www.jds.net.au/?s=test is “s=test”.
  • [cs-username] – If you have authentication set up at the web server level (Eg. Basic, NTLM, etc), then this will contain the username that has been used to authenticate to the server. If no authentication has been used, then this will be a “-“.
  • [c-ip] – The client’s IP address. Note that many clients may share the same IP address if they are behind the same proxy, so combining the User-Agent column with the Client IP address gives a more accurate indication of the number of users of the website.
  • [cs(User-Agent)] – The User Agent string sent by the browser in the HTTP request header. E.g. “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 3.5.30729)“.
  • [cs(Cookie)] – According to the specification, the maximum size of a cookie is 4096 bytes, but multiple cookies can be set for each domain.
  • [cs(Referer)] – The complete URI of the referring page, as sent in the HTTP request header. E.g. “http://www.google.com.au/search?q=jds”.
  • [sc-status] – Three digit HTTP response code. E.g. 404 (Not Found), 500 (Internal Server Error), etc.
  • [sc-substatus] – IIS has an HTTP substatus code, which appears after the HTTP status code, and provides additional information. E.g. 403 is the status code which means “Forbidden”, but 403.1 is more specific, and means “Forbidden: Execute Access Forbidden”. Most of the time the substatus is 0.
  • [sc-win32-status] – The Windows network status code that was returned when the response was processed. The status codes can be looked up by running “net helpmsg {StatusCode}” from the command line. E.g. status code 1236 maps to “The network connection was aborted by the local system”, which may indicate that you are hitting connection limits on your web server. Most of the time, this will be “0”, indicating that “The operation completed successfully.”
  • [cs-version] – The protocol version used for the request/response e.g. HTTP/1.1
  • [sc-bytes] – The number of bytes in the response from the server back to the client. This includes bytes for the HTTP header.
  • [cs-bytes] – The number of bytes in the request from the client to the server. This includes the HTTP header.
  • [time-taken] – Number of miliseconds taken from the time the first byte of the HTTP request arrives, to the time the last byte of the HTTP response is acknowledged (at the TCP level) by the client.

In W3C Extended Log Format, if any field does not contain a value (like, say there was no URI query string sent), then “-” will be used instead.

Note that the examples in this article are based on IIS logs which have been imported to an MS SQL Server database. Queries also assume that you have a single day of logs in your table; if you have more, then restrict the query with a date range.

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Total gigabytes sent and received for each hour
-- Pedantic people can divide by 1073741824 instead of 1000000000 
USE WebLogs;
SELECT 
  DATEPART(HOUR, [TIME]) AS HourOfDay, 
  CAST((SUM([sc-bytes])/1000000000.0) AS DECIMAL(18,2)) AS Sent, 
  CAST((SUM([cs-bytes])/1000000000.0) AS DECIMAL(18,2)) AS Received
FROM 
  [JDSWEB01]
GROUP BY 
  DATEPART(HOUR, [TIME])
ORDER BY 
  HourOfDay ASC
1
SELECT COUNT(*) FROM JDSWEB01

Notes for those who made it to the end:

  • Even though analysing website logs is much faster with SQL Server, I still like LogParser for its friendliness. For instance, I still haven’t found a good way to QUANTIZE() with SQL Server – I have to make do with DATEPART() to break results down by hour.
  • LogParser also has a neat EXTRACT_EXTENSION() function that can be emulated with SQL Server’s string functions.

1 comment

Leave a Reply