OSX software discovery via Homebrew
Many operating systems have a package manager, which facilitates the installation of software and associated dependencies. This post describes exploiting Homebrew, an OSX package manager, to discover interesting software by means of Hacker News-inspired popularity ranking.
How Homebrew works
First, get Homebrew. I’ve already installed it, and this is my version number:
brew -v
Homebrew 0.9.5 (git revision ddda; last commit 2015-08-26)
For demonstration purposes, let’s install a clever little program called tree, which generates a graphical representation of a directory and subdirectories.
brew install tree
==> Downloading http://mama.indstate.edu/users/ice/tree/src/tree-1.7.0.tgz ######################################################################## 100.0% ==> make prefix=/usr/local/Cellar/tree/1.7.0 MANDIR=/usr/local/Cellar/tree/1.7.0/share/man/man1 CC 🍺 /usr/local/Cellar/tree/1.7.0: 7 files, 128K, built in 3 seconds
Try it out by creating a few test directories and running tree:
mkdir -p test/{a,b,c,d}/{d,e}
touch test/a/foo test/d/bar
tree test
This should give the result:
test
├── a
│ ├── d
│ ├── e
│ └── foo
├── b
│ ├── d
│ └── e
├── c
│ ├── d
│ └── e
└── d
├── bar
├── d
└── e
To see how homebrew knew how to install tree, call brew info:
brew info tree
tree: stable 1.7.0 Display directories as trees (with optional color/HTML output) http://mama.indstate.edu/users/ice/tree/ /usr/local/Cellar/tree/1.7.0 (7 files, 128K) * Built from source From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/tree.rb
Notice that it lists a GitHub URL. Visiting tree.rb
reveals a recipe for downloading and building the software.
In fact, every Homebrew package has a corresponding formula in the Homebrew git repo. To explore further, let’s clone it:
git clone https://github.com/Homebrew/homebrew
Next, examine the change log for that particular formula:
cd homebrew
git log --pretty=format:'%cr - %s' Library/Formula/tree.rb
We can see that the formula was updated 25 times since it was added about 6 years ago.
- 3 weeks ago - Formula files style updates.
- 3 months ago - Add descriptions to all remaining homebrew packages
- 8 months ago - tree: switch mirror
- 11 months ago - Remove debugging code again
- 11 months ago - Delay requiring irb until runtime
- 12 months ago - Remove debugging code :shame:
- 12 months ago - Ensure log file is closed
- 12 months ago - tree: add mirror
- 1 year, 4 months ago - Fix tree
- 1 year, 4 months ago - tree: update 1.7.0 bottle.
- 1 year, 4 months ago - tree 1.7.0
- 1 year, 6 months ago - tree: add bottle.
- 1 year, 7 months ago - tree: a test
- 1 year, 9 months ago - Remove uses of -no-cpp-precomp
- 3 years ago - Batch convert MD5 formula to SHA1.
- 3 years, 7 months ago - tree: avoid inreplace
- 4 years, 1 month ago - tree 1.6.0
- 4 years, 4 months ago - tree: update download url
- 4 years, 6 months ago - Use ruby style for inheritance.
- 5 years ago - Update formulae to use ENV.cflags
- 5 years ago - Update formulae for version 0.7
- 5 years ago - Update tree to 1.5.3
- 6 years ago - s/require ‘brewkit’/require ‘formula’/g
- 6 years ago - ENV.cc; returns the compiler we use
- 6 years ago - Don’t hardcode ‘gcc’ in manual formulas.
- 6 years ago - Tree formula
Let’s compare this to the git formula. The following command just counts the commits rather than listing them all.
git rev-list HEAD --count Library/Formula/git.rb
292
As may be expected, the popular git formula has had 10X more updates than tree.
Collecting formula meta-data
Let’s see how many formulae there are right now:
ls -1 Library/Formula/*.rb | wc -l
3195
For each formula, the commit history contains two key data points:
- Number of commits: a possible indication of hotness
- Date added to homebrew: a possible indication of newness
After some fiddling with shell scripts, I settled on a Ruby script that collects the above data in a local mongoDB database. This requires the relevant brew package and a Ruby gem.
brew install mongodb
gem install mongo
Refer to the mongo gem documentation on how Ruby is used to control Mongo. The following script is called collect:
#!/usr/bin/env ruby
require 'shellwords'
require 'mongo'
# Change modes (kinds of changes that can happen to a git-tracked file)
MODE_ADD = 'A'
MODE_MODIFY = 'M'
MODE_DELETE = 'D'
# This script requires an argument
if ARGV.empty?
puts "Usage: #{File.basename($0)} /path/to/homebrew"
exit 1
end
# Connect to mongo
mongo = Mongo::Client.new('mongodb://127.0.0.1:27017/ferment')
Mongo::Logger.logger = ::Logger.new('mongo.log')
Mongo::Logger.logger.level = ::Logger::INFO
# Enter Homebrew git directory
dir = ARGV[0]
Dir.chdir(dir)
# Pull latest changes from Git
#system "git pull"
# Create an index on revisions
mongo[:revisions].indexes.create_one({ :hash => 1 }, :unique => true)
# Get all revisions
hashes = `git rev-list HEAD`.split(/\s+/).reverse
# Report number of hashes found
puts "Found #{hashes.size} hashes."
count = 0
# Iterate through hashes of all commits
for hash in hashes
# Skip this hash if we've seen it before
next if mongo[:revisions].find(hash: hash).count > 0
# Extract detailed changes for this commit
changes = `git diff-tree --no-commit-id --name-status -r #{hash.shellescape}`.split("\n")
# Extract unix timestamp
timestamp = `git show -s --format='%at' #{hash.shellescape}`.chomp.to_i
for change in changes
# Extract mode and name from change string
# e.g. "M Library/Formula/harfbuzz.rb"
if change =~ /([A-Z]+)\s+Library\/Formula\/(\w+).rb/
# Mode is "M" in the example above
mode = $1
# Name is "harfbuzz" in the example above
name = $2
if mode == MODE_ADD
# Insert a new formula
mongo[:formulae].insert_one({
name: name,
count: 1,
added: timestamp,
score: 0
})
elsif mode == MODE_DELETE
# Delete a formula
mongo[:formulae].find(:name => name).find_one_and_delete
elsif mode == MODE_MODIFY
# Modify a formula
mongo[:formulae].find(:name => name).update_one("$inc" => { count: 1 })
else
puts "Unsupported mode '#{mode}' for formula '#{name}'"
next
end
end
end
# Store this revision
revision = {
hash: hash,
changes: changes,
timestamp: timestamp
}
# This revision has been digested
mongo[:revisions].insert_one(revision)
end
You can execute it by supplying, as a parameter, a path to the cloned Homebrew repo:
./collect ~/src/homebrew/
After a few minutes, this captures all desired meta-data on formulae. At any time, you can see what it’s doing by running:
mongoexport -h 127.0.0.1 -d ferment -c formulae --limit=10
You can see some stats like count
which represents the number of commits and added
which represents the UNIX timestamp of when the formula was added to Homebrew.
{"_id":{"$oid":"55df987c89bbee0775000045"},"added":1244139679,"count":22,"name":"ack","score":0}
{"_id":{"$oid":"55df987c89bbee0775000046"},"added":1244139679,"count":14,"name":"asciidoc","score":0}
{"_id":{"$oid":"55df987c89bbee0775000047"},"added":1244139679,"count":36,"name":"boost","score":0}
{"_id":{"$oid":"55df987c89bbee0775000048"},"added":1244139679,"count":30,"name":"cmake","score":0}
{"_id":{"$oid":"55df987c89bbee0775000049"},"added":1244139679,"count":21,"name":"dmd","score":0}
{"_id":{"$oid":"55df987c89bbee077500004a"},"added":1244139679,"count":10,"name":"fftw","score":0}
{"_id":{"$oid":"55df987c89bbee077500004b"},"added":1244139679,"count":99,"name":"git","score":0}
{"_id":{"$oid":"55df987c89bbee077500004c"},"added":1244139679,"count":12,"name":"grc","score":0}
{"_id":{"$oid":"55df987c89bbee077500004d"},"added":1244139679,"count":11,"name":"lame","score":0}
{"_id":{"$oid":"55df987c89bbee077500004e"},"added":1244139679,"count":11,"name":"liblastfm","score":0}
2015-08-27T19:12:18.905-0400 exported 10 records
Ranking formulae by newness and hotness
According to a thread on Hacker News, their home page ranking calculation is simply:
(p - 1) / (t + 2)^1.5
…where p is “points awarded” and t is “age of the post” in hours. In other words, the ranking linearly increases with points awarded and decays exponentially as it ages in hours.
The following script, called rank, does something similar with homebrew formulae, but counts a commit as a point and decays in months (rather than hours).
#!/usr/bin/env ruby
require 'shellwords'
require 'mongo'
# Connect to mongo
mongo = Mongo::Client.new('mongodb://127.0.0.1:27017/ferment')
Mongo::Logger.logger = ::Logger.new('mongo.log')
Mongo::Logger.logger.level = ::Logger::INFO
# Create an index on the formula's score
mongo[:formulae].indexes.create_one({ :score => 1 })
# Examine all formulae
formulae = mongo[:formulae]
now = Time.now.to_i
day = 60 * 60 * 24
month = day * 30
formulae.find.each do |formula|
# Age in seconds
age = now - formula["added"]
# Calculate age in months
age = age / month.to_f
score = (formula["count"] - 1) / ((age + 2) ** 1.5)
formulae.find(:_id => formula["_id"]).update_one("$set" => { :score => score })
end
formulae = mongo[:formulae].find.sort(:score => -1).limit(100)
formulae.find.each do |formula|
age = (now - formula["added"]) / day
normalized_score = (formula['score'] * 1000).to_i
puts "* #{formula['name']} (days_old=#{age}, commits=#{formula['count']}, score=#{normalized_score})"
end
Subjectively, using “months” as the measure of a forumla’s age seemed right. I experimented to find a time interval that maximized decay speed without losing stalwarts like vim, node and git in the top 100.
./rank
And here’s the result:
- iojs (days_old=215, commits=58, score=2052)
- thefuck (days_old=127, commits=32, score=1980)
- planck (days_old=24, commits=10, score=1914)
- syncthing (days_old=443, commits=116, score=1674)
- embulk (days_old=117, commits=18, score=1179)
- tutum (days_old=311, commits=51, score=1147)
- pushpin (days_old=182, commits=27, score=1129)
- flow (days_old=282, commits=44, score=1116)
- carthage (days_old=251, commits=38, score=1107)
- influxdb (days_old=660, commits=116, score=977)
- mycli (days_old=32, commits=6, score=930)
- ford (days_old=56, commits=8, score=918)
- osquery (days_old=302, commits=39, score=905)
- awscli (days_old=538, commits=80, score=887)
- telegraf (days_old=70, commits=9, score=880)
- nghttp2 (days_old=198, commits=23, score=867)
- fpp (days_old=116, commits=13, score=843)
- packer (days_old=106, commits=12, score=838)
- h2o (days_old=226, commits=25, score=811)
- ansible (days_old=724, commits=102, score=755)
- docker (days_old=566, commits=72, score=744)
- commonmark (days_old=229, commits=23, score=735)
- ponyc (days_old=104, commits=10, score=700)
- creduce (days_old=130, commits=12, score=689)
- gcc (days_old=499, commits=56, score=682)
- sslmate (days_old=304, commits=29, score=662)
- bitrise (days_old=22, commits=4, score=659)
- passenger (days_old=798, commits=100, score=646)
- vegeta (days_old=199, commits=17, score=630)
- zurl (days_old=190, commits=16, score=620)
- hayai (days_old=44, commits=5, score=619)
- libressl (days_old=412, commits=39, score=608)
- emscripten (days_old=499, commits=50, score=608)
- terraform (days_old=353, commits=32, score=606)
- galen (days_old=217, commits=18, score=604)
- norm (days_old=64, commits=6, score=594)
- boot2docker (days_old=564, commits=57, score=589)
- gauge (days_old=317, commits=27, score=582)
- fleetctl (days_old=477, commits=45, score=580)
- softhsm (days_old=83, commits=7, score=571)
- vim (days_old=1041, commits=125, score=557)
- juju (days_old=756, commits=80, score=556)
- duck (days_old=233, commits=18, score=555)
- anjuta (days_old=32, commits=4, score=552)
- fig (days_old=452, commits=39, score=537)
- agda (days_old=218, commits=16, score=529)
- gdl (days_old=35, commits=4, score=527)
- dcd (days_old=153, commits=11, score=526)
- rethinkdb (days_old=941, commits=99, score=508)
- rem (days_old=14, commits=3, score=507)
- node (days_old=2147, commits=318, score=502)
- ipfs (days_old=62, commits=5, score=485)
- qt5 (days_old=979, commits=100, score=485)
- cryptol (days_old=150, commits=10, score=483)
- swiftlint (days_old=101, commits=7, score=479)
- algernon (days_old=83, commits=6, score=478)
- skinny (days_old=313, commits=22, score=478)
- ghq (days_old=85, commits=6, score=466)
- xplanetfx (days_old=486, commits=37, score=462)
- exercism (days_old=44, commits=4, score=461)
- jenkins (days_old=1678, commits=204, score=460)
- scriptcs (days_old=157, commits=10, score=459)
- minisign (days_old=44, commits=4, score=458)
- folly (days_old=67, commits=5, score=457)
- allegro (days_old=87, commits=6, score=456)
- mockserver (days_old=109, commits=7, score=448)
- sysdig (days_old=511, commits=38, score=445)
- xhyve (days_old=71, commits=5, score=435)
- gtkextra (days_old=229, commits=14, score=434)
- pcap_dnsproxy (days_old=72, commits=5, score=431)
- git (days_old=2275, commits=292, score=423)
- cig (days_old=116, commits=7, score=421)
- libbpg (days_old=264, commits=16, score=420)
- stdman (days_old=51, commits=4, score=419)
- pandoc (days_old=515, commits=36, score=416)
- fzf (days_old=537, commits=38, score=415)
- vault (days_old=52, commits=4, score=414)
- keybase (days_old=343, commits=21, score=405)
- nvm (days_old=633, commits=46, score=405)
- gexiv2 (days_old=159, commits=9, score=404)
- peco (days_old=211, commits=12, score=404)
- oauth2_proxy (days_old=27, commits=3, score=400)
- geographiclib (days_old=376, commits=23, score=396)
- cless (days_old=162, commits=9, score=394)
- arangodb (days_old=1202, commits=107, score=388)
- scw (days_old=57, commits=4, score=388)
- python (days_old=2219, commits=258, score=387)
- khal (days_old=58, commits=4, score=381)
- extract_url (days_old=58, commits=4, score=381)
- pypy3 (days_old=431, commits=26, score=377)
- pazpar2 (days_old=531, commits=34, score=377)
- deis (days_old=259, commits=14, score=374)
- graphite2 (days_old=61, commits=4, score=368)
- python3 (days_old=1958, commits=204, score=367)
- wellington (days_old=247, commits=13, score=365)
- saltstack (days_old=661, commits=44, score=364)
- gedit (days_old=64, commits=4, score=354)
- pla (days_old=91, commits=5, score=352)
- purescript (days_old=183, commits=9, score=345)
- mono (days_old=531, commits=31, score=342)
Maybe this needs a web page. For now, the code is on GitHub.