By Liisa Tallinn
Playing around with Git logs can give you fascinating insights into a project. Whether it's analysing your own work or checking how someone else's project is doing, Git logs have it all. Sometimes all you need is a quick general overview, e.g. counting contributions per author. The next minute, you really need to dig into a specific time period (Fridays!), look at a keyword or email domain. Zooming out and digging deep into Git repos is a walk in the park with the right tool - follow our lead.
As an example, we’re going to play around with the Linux kernel source tree, mostly because of its size (~700MB) and the impressive number of commits. Parsing these logs creates a dataset of more than 800 000 rows, so plenty of records to analyse. The repo was downloaded and logs generated on 1 March 2019. The data stretches back for almost 14 years. When sorting the records based on commit timestamps, the first commit by Torvalds dates back to Saturday, April 16, 2005. As the number of lines added is more than 6,7M, it's "probably" not the birth date of Linux, rather than their first date with git.
git log --pretty=fuller --shortstat > logfile.logWith SpectX, you can quickly analyse any unstructured data in its raw form and the glorious multiline gibberish produced by that command is no exception. For sure, there are plenty of options to generate a compact one-line Git log but because parsing multiple lines with SpectX is relatively easy, we were ok with the fuller option. Grab the free edition of SpectX to copy-paste and run these queries against your own repo. If needed, SpectX allows you to stay offline. Download and install SpectX to your machine and run queries on the raw log file, no need to install, ingest or import anything into the cloud. See more on installing and getting started in the SpectX docs.
commit f6163d67cc31b8f2a946c4df82be3c6dd918412d
Merge: 2137397c92ae 0358affb5cd8
Author: Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Wed Feb 20 14:14:31 2019 -0800
Commit: Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Feb 20 14:14:31 2019 -0800
Merge tag 'docs-5.0-fix' of git://git.lwn.net/linux
Pull documentation fix from Jonathan Corbet:
"A single patch from Arnd bringing some top-level docs into the 5.0
era"
* tag 'docs-5.0-fix' of git://git.lwn.net/linux:
Documentation: change linux-4.x references to 5.x
//a fixed string 'commit ' followed by up to 40 characters in the range of a-z and 0-9. We'll name the field 'commit'. EOL is end of line.It's possible you need to tune the pattern for your own Git logs, e.g. if your timestamps are formatted differently (see the SpectX docs on parsing timestamps). When done with the pattern, run the first query - parse the log file with the pattern and select the fields you're interested in. Copy the full pattern + query here.
'commit ' [a-z0-9]{40}:commit EOL
//similarly, fixed strings 'Merge ', 'Author ' 'email '. Then 'LD' matching everything else on that line. Asterisk for matching empty lines.
('Merge: ' LD:merge EOL)?
'Author: ' LD*:authorName ' <' LD*:authorEmail '>' EOL
//the date. Parsing out timestamps as well as leaving timestamps as strings to play with author's local time later.
'AuthorDate: ' (TIMESTAMP('EEE MMM d H:mm:ss YYYY Z'):authorTime):auhtorTimeStr EOL
//a fixed string 'Commit ' everything between this and ' <' is matched by 'LD*'. We'll name this field commitName. The same with commitEmail and commitTime.
'Commit: ' LD*:commitName ' <' LD*:commitEmail '>' EOL
'CommitDate: ' (TIMESTAMP('EEE MMM d H:mm:ss YYYY Z'):commitTime):commitTimeStr EOL
EOL
//finally, up to 500k bytes of data - commitInfo
DATA{0,500000}:commitInfo ((EOL >>('commit ' [a-z0-9]{40}:commit EOL)) | EOF)
@gitlogWhen happy with the parser and the initial query, let's run some detailed queries to dig into the essence of the project.
.select(_unmatched, *) //add the _unmatched column to your results
.filter(_unmatched is not NULL) //filter out records that contain unmatched bytes
@gitlogCopy the full pattern + query here. The reason we're filtering out merges is to look at "true" authors of the code. Merges overwrite the author field with the person performing the merge.
.filter(merge is NULL) //let's look at only non-merge commits
.select(authorName, count(*)) //select the author-field and count the results
.group(authorName) //aggregate authors
.sort(count desc) //sort the results based on count in a descending order
.limit(10) //limit the result to 10 rows
@gitlogThe result - this is how they've been rolling. The number of commits per author per year. Quite a sprint there from H Hartley Sweeten back in 2012-2014.
.select(authorName, year(authorTime), *) //select authors and time in annual intervals
.select(authorTime //count the occurrence of top 5 authors (from the previous query)
,Viro:count(authorName = 'Al Viro')
,Sweeten:count(authorName = 'H Hartley Sweeten')
,Chehab:count(authorName = 'Mauro Carvalho Chehab')
,Iwai:count(authorName = 'Takashi Iwai')
,Hellwig:count(authorName = 'Christoph Hellwig')
)
.group(year) //aggregate time
@gitlogThese are the results - authors lined up based on the number of lines they've inserted since 2005. Torvalds, unsurprisingly, has contributed most of the lines - more than 6,7 million since 2005 (basically, he beat everyone already with the number of lines in the very first commit of this repo).
.select(authorName, insertSum:INT(sum(insertions)), deleteSum:INT(sum(deletions))) //select authors, lines added and deleted cast into integers
.group(authorName) //aggregate unique authors
.sort(insertSum desc) //sort based on number of inserts
// .sort(deleteSum desc) //sort based on number of deletions
.limit(10) //limit the result to 10 rows
@gitlogThe result - Intel and Redhat are working hard. Especially Intel, if you take a closer look and add the two intel domains that made it to this chart: intel.com and linux.intel.com.
//parse the authorEmail field. Skip everything until the '@' sign and name the rest 'domain' EOS stands for end of string, eof: end of file
.select(PARSE("LD '@' LD:domain EOS", authorEmail),*)
.filter(domain not like '%gmail%') //skip gmail addresses
.select(domain, count:count(*)) //select domain and count everything
.group(@1) //group everything based on the first field (i.e. domain)
.sort(count DESC) //sort count in descending order
.limit(10) //limit the result to 10 rows
@gitlogThe result arrives in 3 seconds. Top 10 of identical commit messages to the Linux kernel repo.
.select(commitInfo, count(*) as cnt)
.group(commitInfo)
.sort(cnt desc)
@gitlogThe result - Torvalds' commit count split into weekdays. Conclusion: if it's Friday and you really need an excuse to commit then Linus has done most of his commits for the Linux kernel on a Friday since 2005.
.select(author_time:parse("LD:day_of_week ' ' LD:month ' ' INT:day ' ' INT:hour ':' INT:minute ':' INT:second ' ' INT:year ' ' LD:timezone EOF", auhtorTimeStr), *)
.filter(authorName like 'Linus Torvalds')
.select(author_time[day_of_week], count(*) as count)
.group(day_of_week)
.sort(count desc);