Detecting Human Behaviour on BlueCoat Proxy Logs
By Jüri Shamov-Liiver on 24 May 2018
This is the sequel to exploring BlueCoat proxy logs. Having defined their structure I now proceed to analyse actual log content. As my knowledge of the origin of these sample logs is limited, I'll try to extract as much information about the nature of proxy deployment, user behaviour and their geographical location.
First, let's create a view of selected fields of interest (skipping the irrelevant ones).
$pattern = $[/shared/patterns/bluecoat-w3c-elff.sxp];
@src = PARSE(src:'file://data/chuvakin/bluecoat_proxy_big/Demo_log_*.log', pattern:$pattern);
@src
.filter(hdrLine is null)
.select( //select only the fields of interest
dateTime
,c_ip //client ip address
,sc_status //Protocol status code from appliance to client
,cs_method //Request method used from client to appliance
,cs_uri_scheme //Scheme from the 'log' URL
,cs_host //Hostname from the 'log' URL.
,cs_uri_path //Path from the 'log' URL. Does not include query.
,cs_uri_query //Query from the 'log' URL.
,cs_username //Relative username of a client authenticated to the proxy
,s_supplier_name //Hostname of the upstream host (unavailable for a cache hit)
,rs_content_type //Response header: Content-Type
,cs_useragent //Request header: User-Agent
,sc_filter_result //Content filtering result: Denied, Proxied or Observed
,sc_filter_category //Content filtering category of the request URL
,x_virus_id //Identifier of a virus if one was detected
,x_virus_details //Details of a virus if one was detected)
.save('/shared/demo/bluecoat/bcdata.sxt');
Reading raw data from the local hard disk, extracting and transforming (parsing) all 4 log files (int total 2,6Gb) takes about 10 seconds on my MacBook Pro with 8 CPU cores and 16 Gb RAM. Not too bad but I can get the speed even faster by saving parsed results and performing further queries on that. On saved results, even the most complex queries only take a few seconds. Note that if the dataset had included live updated data, I couldn't have taken this approach.
I have very little knowledge about the environment where the logs come from. I only know that they come from a lab (
the Honeynet project) and they contain production (i.e live) and simulated (artificially generated) traffic. However, I have no idea how this BlueCoat proxy was deployed, what sort of clients are there, etc. Hence let's see what I can deduct from the logs itself.
Sidenote: I'm mostly using the piped command style in the query examples. All of them could also be written in the SQL style.Traffic Dynamics, Origin and Destination Distributions
One of the first things I always check is how the traffic dynamics look like:
@src = @[/shared/demo/bluecoat/bcdata.sxt]; //declare variable @src referring to stored
//parsed data
@hourlyReq = @src //declare query using variable @hourlyReq
//data from @src gets piped
//to the next statement:
.select(
dateTime[1 hour] as hourTrunc, //truncate the timestamp to hour-precision,
//assign alias hourTrunc
count(*) as requests) //calculate counts, assign alias requests
.group(hourTrunc) //compute counts by hourTrunc groups
;
@hourlyReq //execute query
/*the query written in SQL style will produce the exact same result:
@hourlyRequests =
SELECT
dateTime[1 hour] as hourTrunc, //truncate timestamp to hour precision,
//assign alias hourTrunc
count(*) as requests) //calculate counts, assign alias requests
GROUP BY hourTrunc //compute counts by hourTrunc groups
;*/
There's approximately a month of records (from April 4'th to May 7'th 2005). I can observe three distinct periods: first ~10 days (5'th to 14'th April) with very little traffic (a few dozens of hourly requests), then a period of complete silence from 13'th to 27'th April and finally the last 10 days with rapidly increasing traffic.
Hmm ... It is very unlikely that such a traffic increase is caused by some malicious activity unless it's DDoS. The latter should show convergence to very few targets, so its easy to verify. More likely, my hypothesis is that the lab guys tested initial setup during the first 10 days, then went on to improve it (or party?) and after 2 weeks they launched the real thing. Ok, let's see what's the distribution of target sites:
@hourlyTargets = @src
.select(
hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision,
//assign alias hourTrunc
cs_host,
count(*)) //compute unique hourTrunc and cs_host pairs
.group(@1, @2) //group by first and second position
//fields (hourTtrunc and cs_host)
.select(hourTrunc, targets:count(*)) /*compute the count of hosts
per hourTrunc*/
.group(@1)
.join(@hourlyReq on left.hourTrunc = right.hourTrunc)
.select(tHour:left.hourTrunc, targets, requests);
@hourlyTargets;
As we can see, the number of target hosts follows clearly the same pattern as the number of requests during "live" period. To be sure, we can also compute
the Pearson correlation coefficient between the two:
@hourlyTargets
.select(CORR(targets, requests)); //compute Pearson correlation coefficient
@hourlyTargets;
which gives us
0.98 - i.e we see a strong correlation between the number of requests and the number of different hosts. We can now safely rule out the DDoS hypothesis.
Next, let's have a look at clients. Their distribution should give us some clue of how the proxy is set up.
@hourlyClients = @src
.select(
hourTrunc:dateTime[1 hour], //truncate timestamp to hour-precision, assign alias hourTrunc
c_ip,
count(*)) //compute unique hourTrunc and c_ip pairs
.group(@1, @2)
.select(hourTrunc, clients:count(*)) //count unique clients per hourTrunc
.group(@1)
.join(@hourlyReq on left.hourTrunc = right.hourTrunc)
.select(tHour:left.hourTrunc, clients, requests);
@hourlyClients;
Here, the upper diagram displays the first period (area greyed on the lower diagram). The lower diagram shows the trend for the entire period:
We can see a constant chattering of a few dozens of requests per hour from a few clients. This is mixed with occasional bursts of requests from the same low number of clients. Considering that the number of targets remains low as well - perhaps this is simulated traffic. Since the overall number of requests and targets is low I'll look at the requests from this period:
@testPeriodTraffic = @src
.filter(dateTime > T(2005,04,08) and dateTime < T(2005,04,12))
.select(cs_host, cs_uri_path, count(*))
.group(@1,@2)
.sort(@3 desc)
.limit(10);
@testPeriodTraffic;
cs_host
| cs_uri_path
| count_star
|
www.ositis.com
| /UpdateLocation/ProxyAV/McafeeAntivirusEngineLocation
| 361 |
www.ositis.com | /UpdateLocation/ProxyAV/McafeeAntivirusPatternLocation
| 350 |
www.ositis.com
| /tests/testconnectivity.asp
| 171 |
10.2.13.200
| /index.html
| 117 |
10.2.13.200
| /favicon.ico
| 95 |
register.ositis.com
| /cgi/RegCGI.cgi
| 68 |
download.windowsupdate.com
| /v5a/windowsupdate/b/SelfUpdate/AU/x86/NetServer/en/wusetup.cab
| 37 |
10.2.13.200
| / | 33 |
download.windowsupdate.com
| /ws03sp1/windowsupdate/redir/wuredir.cab
| 23 |
download.bluecoat.com
| /release/ProxyAV/2.2.1/2000E/update-22.asp
| 22 |
This looks very much like some setup and testing related activities: updating McAfee AV engine, virus patterns, downloading and applying operating system updates, setting up BlueCoat, etc.
The number of clients during the "live" period follows the pattern of requests and targets distribution (requests are not shown for better visibility):
Let's compute the correlation to be sure:
@clientRequestCorrelation = @hourlyClients
.select(CORR(targets, requests)) //compute Pearson correlation coefficient
;
@clientRequestCorrelation;
We get
0.98 again. Considering the quite high number of both clients and target hosts this looks like real live traffic. This is also suggested by the period of the waves on the graph: in GMT timezone traffic starts increasing at 1pm and dies off at 3am.
IP Network Analysis
Next, let's look at the ip countries contribution to the hourly traffic:
@hourlyCC = @src
.select(ccIP:cc(c_ip),
dayTime:count(HOUR(dateTime)>8 and HOUR(dateTime)<16),
nightTime:count(HOUR(dateTime)<=8 or HOUR(dateTime)>=16))
.group(@1);
@hourlyCC;
The biggest contributor is an unknown country. Internal address space? Let's see the traffic from class A networks:
@top5Aclass = @src
.select(c_ip & 0xFF000000l as network, requests:count(*)) //compute count by /8 network
.group(@1)
.sort(@2 desc)
.select(*, percentage:(DOUBLE(requests)/$totalRequests)*100); //compute the percentage of total
.limit(5);
@top5Aclass;
network | requests | percentage |
45.0.0.0
| 7999532
| 98.38808745736779
|
192.0.0.0
| 27032
| 0.3324727971770806
|
81.0.0.0
| 19310
| 0.23749813974139636
|
10.0.0.0
| 19181
| 0.23591153901500383 |
213.0.0.0 | 15171 | 0.18659162496202614 |
It's weird that 98.4% of traffic originates from 45.0.0.0/8 range, which in 2005 belonged to the Interop Show Network - an annual trade fair for information technology organized by UBM (
https://en.wikipedia.org/wiki/Interop). Coincidentally, the NetWorld(SM)+Interop(R) 2005 conference was held May 1-6, 2005 at the Mandalay Bay Convention Center in Las Vegas (
https://www.businesswire.com/news/home/20041207005103/en/NetworldInterop-2005-Showcase-Largest-Assembly-Exhibitors-North). Did Honeynet guys set up their proxy to intercept conference traffic?
Requests graph @hourlyReq is computed with the timezone set to GMT, which gives us activity increase at nighttime. When changing timezone to PST (that's where Las Vegas happens to be), we observe activity starting to increase at 6-7 am and at 7 pm it has come down to a nightly level. There is also a distinct "saddle" around lunchtime. Looks like perfectly normal human behaviour to me.
Content Analysis
What does BlueCoat content filtering say? If it was a computer and information technology related event we should see Computers/Internet in the top5 of the content filtering category of the request URL:
@contentFilter = @src
.select(sc_filter_category, count(*)).group(@1).sort(@2 desc).limit(15);
@contentFilter;
sc_filter_category | count_star |
Computers/Internet
| 1445157
|
Web%20Advertisements
| 1429511
|
none
| 969769
|
Business/Economy
| 863626
|
News/Media
| 680855
|
Search%20Engines/Portals
| 381611
|
Email
| 259397
|
Travel
| 257823
|
Shopping
| 252155
|
Sports/Recreation/Hobbies
| 231239
|
Arts/Entertainment
| 141181
|
Brokerage/Trading
| 138826
|
Pornography
| 118691
|
Sure enough, we do. As well as Business, Economy, News and Email in prominent positions. Oh and porn on the 13'th place. In my opinion, porn is a good indicator of human behaviour - I have never seen a bot or trojan showing any interest in it. Let's have a look, what is the most popular time period for enjoying this sort of entertainment:
@hourlyPorn = @src
.select(
hourTrunc:dateTime[1 hour], //truncate timestamp to hour precision, assign alias hourTrunc
pornCnt:count(sc_filter_category = 'Pornography'),
total:count(*))
.group(@1);
@hourlyPorn;
Apparently, no porn during the test period (looks like the Honeynet guys were truly committed to their work in setting up lab environment). Porn traffic appears only during the "live" period and at PST daytime. Attendees bored of the conference? Let's see the related traffic from class A networks:
@AClassPornNetworks = @src
.filter(sc_filter_category = 'Pornography')
.select(c_ip & 0xFF000000l as network, requests:count(*)) //compute count by /8 network
.group(@1)
.sort(@2 desc);
@AClassPornNetworks;
network | requests |
45.0.0.0
| 117813
|
204.0.0.0
| 878
|
Yeah, 99% are from the conference. Still, we're only humans and this confirms the hypothesis of real live traffic.
Moving on, let's see the BlueCoat virus information:
@viruses = @src
.select(x_virus_id,
testPeriodCnt:count(dateTime < T(2005,04,27)),
livePeriodCnt:count(dateTime >= T(2005,04,27))
)
.group(@1).sort(@3 desc)
;
@viruses;
x_virus_id | testPeriodCnt | livePeriodCnt |
NULL | 11509
| 8118891
|
EICAR test file
| 11 | 81 |
W32/Sober.gen@MM
| 0 | 48 |
Exploit-MhtRedir.gen
| 0 | 8 |
BackDoor-CPT
| 0 | 5 |
Downloader-RK
| 0 | 5 |
Exploit-ByteVerify
| 0 | 4 |
JS/Exploit-MhtRedir.ldr
| 0 | 4 |
StartPage-DU.dll
| 0 | 4 |
Downloader-LG
| 0 | 2 |
Downloader-YH.dr
| 0 | 2 |
W32/Sober.p@MM!zip
| 0 | 2 |
Downloader-VG
| 0 | 1 |
Downloader-YK
| 0 | 1 |
Exploit-DcomRpc.gen
| 0 | 1 |
Exploit-HelpZonePass
| 0 | 1 |
Exploit-ObscuredHtml
| 0 | 1 |
FDoS-UDPFlood
| 0 | 1 |
Generic Downloader.g
| 0 | 1 |
JS/Noclose.gen
| 0 | 1 |
JV/g
| 0 | 1 |
Prutec
| 0 | 1 |
VBS/Inor
| 0 | 1 |
VBS/Psyme
| 0 | 1 |
Vundo
| 0 | 1 |
W32/Netsky.z@MM!zip
| 0 | 1 |
25 different viruses identified. They make 0,002% of all requests, and half of them are not real virus infections but testing (EICAR test file). Which sites are distributing this stuff?
@dropperSites = @src
.filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
.select(cs_host, count(*)).group(@1).sort(@2 desc);
@dropperSites;
37 sites. And techlabs.bluecoat.com among them. How come? They're supposed to be a decent vendor and not a dodgy place to catch viruses.
Let's have a closer look :
@bluecoatTechlabs = @src
.filter(x_virus_id is not null AND x_virus_id != 'EICAR test file')
.filter(cs_host = 'techlabs.bluecoat.com')
.select(dateTime, sc_status, cs_uri_path, x_virus_id, x_virus_details);
@bluecoatTechlabs;
dateTime | sc_status | cs_uri_path | x_virus_id | x_virus_details
|
2005-04-28 17:34:12,200 | 200 | /eicar/noisybear.cab | JV/g | Virus: JV/g; File: noisybear.cab; Sub File: \NoisyBear.class; Vendor: Network Associates, Inc (McAfee); Engine error code: None; Engine version: 10.21; Pattern version: 4.3.20.124865; Pattern date: 04/28/2005 04:40:00 |
noisybear.cab - creation of Mark LaDue proof-of-concept malicious Java program (http://www.cigital.com/hostile-applets). Displays a picture of a bear emitting annoying noise. Used for testing AV detection, similarly to eicar. We can safely exclude it from real dropper sites.
How many clients infected? With which viruses? When? The following query produces a list of client IPs with real infection attempts. In an enterprise environment setup, this would serve a good starting point for CSIRT team mitigation actions.
@infectedClients = @src
.filter(x_virus_id is not null AND sc_status = 200)
.filter(x_virus_id != 'EICAR test file')
.filter(cs_host != 'techlabs.bluecoat.com')
.select(
c_ip,
x_virus_id,
min_dateTime:min(dateTime)) //get the earliest client timestamp of infection
.group(@1, @2);
@infectedclients;
Conclusion
The logs consisted of web traffic records from April 4’th to May 7’th 2005. It included three distinct periods, that I identified as:
- the test period of initial setup (from 5’th to 14’th April)
- a period of complete silence (from 13’th to 27’th April) and
- the "live" period from April 27'th to May 7'th
The testing period is characterized by a relatively low number of requests (ten to a few hundred per hour) to a few selected sites, related to setting up the BlueCoat and antivirus detection engine, updating operating system (Win XP SP1).
The silence period could mean the lab folks spending their time on setting up a surrounding environment for the live period.
Finally, the "live" period is characterized with a rapidly increasing amount of traffic, peaking on 5 May at 260 000 requests per hour and then gradually decreasing. The number of requests, originating clients and target hosts is high enough to make the artificial generation very complicated and unlikely. The distribution of originating clients and target hosts very closely follows the traffic activity patterns.
The activity patterns seem to match normal daily human activity: rapid increase from early morning hours, decreasing in the afternoon with distinct lower "saddle" indicating a lunch break between them. The timezone matching this pattern will place the originating activity to the US west coast (PST). Another aspect indicating human nature in origin of traffic is the presence of adult sites visits.
Network analysis suggests strongly that the traffic is originating from participants of NetWorld(SM)+Interop(R) conference held May 1 - 6 at the Mandalay Bay Convention Center in Las Vegas.
Observed virus infections were relatively low: only 0,002% of traffic relates to known viruses. Half of that relates to testing antivirus engines. From 3499 clients 49 were performing real virus downloads (i.e infection attempts).
You can download the whole script from here:
https://docs.spectx.com/_downloads/bluecoat.zip Logs are available for download from here:
http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip
Back to articles