By Jüri Shamov-Liiver on 17 May 2018
Finding sample logs on the web is not an easy task. This is quite understandable as log content is nowadays considered sensitive (hint: GDPR). Therefore I was excited to come across these samples. Many thanks, @anton_chuvakin and @automine for bringing this up Twitter. Loads of samples of different logs to play around with. A bit outdated in terms of their collection dates but very much up-to-date in terms of their nightmarish structure. Some of the packages are the source data of the HoneyNet Project Challenges. These make great lessons for performing security log analysis.
I headed for BlueCoat ProxySG logs as something we've not yet played around with SpectX. As always, the first thing to look at was the structure of the log for parsing out the fields I need. So in I go with Pattern Developer, pressing Prepare Pattern after browsing to the downloaded log in the Data Browser.
By the way, if it weren't for the zip-compression, SpectX could also run queries on files hosted on the project page without importing them anywhere (download the free 30-day trial to point SpectX at your own data and get parsing-querying).
Back to the pattern. What a cool surprise: one of the apparent header rows (prefixed with a #) contains field names! Great, no need to figure them out myself. Records appear to be in one row with a field separated by spaces. But what's their format (timestamp, integer, quoted string,...)? The comment on log-sharing.dreamhosters.com site mentions this is a standard web proxy log in W3C format..., so perhaps a manual would give us even a quicker and more precise definition than figuring it out based on actual field data?
A quick Google and voilà: the BlueCoat Admin Manual for SGOS 5.1.4 listing 10 templates for W3C ELFF formats to choose from. None of them matches the set of fields in the sample though. The closest one seems to be a subset of the 10 first fields of the sample. Perhaps the "standard" is not that universal after all... Ok, moving on. A table in Appendix B has helpfully listed the meanings of the fields. Kewl, saving some time again. However, the field formats appear nowhere to be seen. Well, nobody or -thing is perfect. Luckily, finding correct data format types is pretty straightforward with SpectX, so it shouldn't take too much time.
Since my goal is to explore the log from all possible angles I'd like to parse out every one of the fields. SpectX Pattern Developer is invaluable here, lighting up the matched fields as I enter matcher tokens:
TIMESTAMP('yyyy-MM-dd HH:mm:ss', tz='GMT'):dateTime ' '
//this captures both date and time fields in TIMESTAMP type, followed by space.
//NB! The date nor time fields contain timezone info, I have to assign it myself.
//GMT is always a good candidate.
LONG:time_taken ' ' //Processing time could potentially be long. Let's use LONG type to capture it.
IPV4:c_ip ' ' //The logs are from 2005, so no IPV6 yet :-)
INT:sc_status ' ' //http status code is INTEGER, we know that by heart
('-' | [A-Z_]+:s_action) ' '
//the s-action field seems to contain uppercase letters and an underscore (Appendix B, Table 7-3)
LONG:sc_bytes ' '
LONG:cs_bytes ' '
LD:cs_method ' '
LD:cs_uri_scheme ' '
LD:cs_host ' '
LD:cs_uri_path ' '
('-' | LD:cs_uri_query) ' '
('-' | LD:cs_username) ' '
LD:s_hierarchy ' '
LD:s_supplier_name ' '
('-' | LD:rs_content_type) ' '
('-' | DQS:cs_useragent) ' '
UPPER:sc_filter_result ' '
LD:sc_filter_category ' '
('-' | DQS:x_virus_id) ' '
IPV4:s_ip ' '
LD:s_sitename
(' ' DQS:x_virus_details)?
(' ' LD:x_icap_error_code)?
(' ' LD:x_icap_error_details)?
[\n\r]+;
// The header and data records need to be defined separately:Right, no _unmatched data in the sample any more and let's verify how SpectX handles all the data we have. Pressing Prepare Query
$hdr = //header record:
'#' //header always begins with # symbol
LD:hdrLine //capture the rest of line with the LD wildcard
[\n\r]+;
$record =
TIMESTAMP('yyyy-MM-dd HH:mm:ss', tz='GMT'):dateTime ' '
//the data record:
//captures both date and time fields in TIMESTAMP type, followed by space.
//NB! as date and time fields don't contain timezone info, we have to assign it
//ourselves. GMT is always a good candidate.
LONG:time_taken ' ' //processing the time could take a while. Let's use the LONG type to capture it.
IPV4:c_ip ' ' //it's 2005, so no IPV6 yet :-)
INT:sc_status ' ' //http status code is INTEGER, we know that by heart
('-' | [A-Z_]+:s_action) ' '
//the s-action field seems to be uppercase letters and underscore (Appendix B, Table 7-3)
LONG:sc_bytes ' '
LONG:cs_bytes ' '
LD:cs_method ' '
LD:cs_uri_scheme ' '
LD:cs_host ' '
LD:cs_uri_path ' '
('-' | LD{1,20000}:cs_uri_query) ' '
('-' | LD:cs_username) ' '
LD:s_hierarchy ' '
LD:s_supplier_name ' '
('-' | LD:rs_content_type) ' '
('-' | DQS:cs_useragent) ' '
UPPER:sc_filter_result ' '
LD:sc_filter_category ' '
('-' | DQS:x_virus_id) ' '
IPV4:s_ip ' '
LD:s_sitename
(' ' DQS:x_virus_details)?
(' ' LD:x_icap_error_code)?
(' ' LD:x_icap_error_details)?
[\n\r]+;
//the pattern can consist of either header OR data records:
($hdr | $record)
$patt = $[/shared/patterns/bluecoat-w3c-elff.sxp];
//PARSE cmd applies our pattern to file(s) pointed to by src uri:
@stream = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_001.log', pattern:$patt);
//use cmdline syntax in writing query:
@stream
.select(_unmatched, *) //select the _unmatched system field and all pattern-defined fields
$patt = $[/shared/patterns/bluecoat-w3c-elff.sxp];
//PARSE cmd applies our pattern to the file(s) pointed to by the src uri
//by replacing 001 with the wildcard in the filename I have all 4 log files included in parsing
@stream = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_*.log', pattern:$patt);
//use cmdline syntax in writing query:
@stream
.select(_unmatched, *) //select the _unmatched system field and all pattern-defined fields
.filter(_unmatched is not null) //I'm interested only in parsing errors
('-' | LD{1,20000}:cs_uri_query) ' ' //added non-default max length of 20 Kb.Save and trying the verifying query again. The result: zero unmatched bytes and 0 rows in the result set (i.e. no unmatched bytes). Success!