To find text in files with PowerShell:

Vagrant allows the creation and configuration of lightweight, reproducible, and portable development environments.

Apache Spark™ is a fast and general engine for large-scale data processing.

This is a tutorial on how to install Spark on Ubuntu using Vagrant. You’ll need Vagrant and Virtual Box installed.

Initialise Vagrant

Create a working directory and initialise a vagrant file.

vagrant init .

Change the vagrant file to below. This will create a vm which will use 6 GB of RAM and 2 cpus.

Vagrant.configure(2) do |config|
  config.vm.box = "ubuntu/trusty64"
  config.vm.provision :shell, path: "bootstrap.sh"
  config.vm.provider "virtualbox" do |vb|
    vb.gui = true
    vb.memory = 6144
    vb.cpus = 2
    vb.customize ["modifyvm", :id, "--vram", "128"]
    vb.customize ["modifyvm", :id, "--accelerate3d", "on"]
    vb.customize ["modifyvm", :id, "--graphicscontroller", "vboxvga"]
  end
end

Create bootstrap.sh which installs the Ubuntu Desktop and save in the directory.

Build the Vagrant image and run the VM.

vagrant up --provision

You should have a working Ubuntu VM now. You can login with the vagrant user with the password ‘vagrant’.

Install Java 7

Install Scala

Install Spark

VagrantSpark Repository

Alternatively clone and install from my repo:

git clone https://github.com/bayesjumping/VagrantSpark.git
cd VagrantSpark
make up-provision

Stack Exchange Data Set

The stack exchange data set is available under the Creative Commons Licence.
> License: http://creativecommons.org/licenses/by-sa/3.0/

It’s a great data set to get started with analysis of Q & A Systems.

I keep forgetting how to find files from the terminal with bash.

To find a file:

find . -name 'makefil*'

This will find all files whose names begin with ‘makefil’.

In machine learning, arithmetic underflow can become a problem when multiplying together many small probabilities. In many models it can be useful to calculate the log sum of exponentials.

If $x_{i}$ is sufficiently large or small, this will result in an arithmetic overflow/underflow. To avoid this we can use a common trick called the Log Sum Exponential trick.

Where $b$ is $\max(x)$.

We can calculate this in Python with:

or using Sci Py

from scipy.misc import logsumexp
logsumexp(ns)

Jupyter notebook here.