Every record in the data set represents a passenger – providing information on her/his age, gender, class, number of siblings/spouses aboard (sibsp), number of parents/children aboard (parch) and, of course, whether s/he survived the accident.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/read_and_prepare_titanic_dataset.R > df <- read_and_prepare_titanic_dataset("~/Downloads/titanic3.csv") > str(df) 'data.frame': 1046 obs. of 6 variables: $ survived: Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ... $ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ... $ age : num 29 0.92 2 30 25 48 63 39 53 71 ... $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ... $ parch : int 0 2 2 2 2 0 0 0 0 0 ...

The logistic regression model is tested on batches of 10 cases with a model trained on the remaining N-10 cases – the test batches form a partition of the data. In short, Leave-10-out CV has been applied to arrive at more accurate estimation of the out-of-sample error rates.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/log_reg.R > predictions <- log_reg(df, size=10) > str(predictions) 'data.frame': 1046 obs. of 2 variables: $ survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ... $ pred : num 0.114 0.854 0.176 0.117 0.524 ...

Now let’s first have a look at the distribution of survival and death cases on the predicted survival probabilities.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/plot_pred_type_distribution.R > plot_pred_type_distribution(predictions, 0.7)

If we consider survival as a positive (1) and death due to the accident as a negative (0) result, then the above plot illustrates the tradeoff we face upon choosing a reasonable threshold. If we increase the threshold the number of false positive (FP) results is lowered, while the number of false negative (FN) results increases.

This question of how to balance false positives and false negatives (depending on the cost/consequences of either mistake) arose on a major scale during World War II in context of interpretation of radar signals for identification of enemy air planes. For the purpose of visualizing and quantifying the impact of a threshold on the FP/FN-tradeoff the ROC curve was introduced. The ROC curve is the interpolated curve made of points whose coordinates are functions of the threshold:

In terms of hypothesis tests where rejecting the null hypothesis is considered a positive result the FPR (false positive rate) corresponds to the Type I error, the FNR (false negative rate) to the Type II error and (1 – FNR) to the power. So the ROC for above distribution of predictions would be:

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/calculate_roc.R roc <- calculate_roc(predictions, 1, 2, n = 100) # https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/plot_roc.R plot_roc(roc, 0.7, 1, 2)

The dashed lines indicate the location of the (FPR, TPR) corresponding to a threshold of 0.7. Note that the low corner (0,0) is associated with a threshold of 1 and the top corner (1,1) with a threshold of 0.

The cost function and the corresponding coloring of the ROC points illustrate that an optimal FPR and TPR combination is determined by the associated cost. Depending on the use case false negatives might be more costly than false positive or vice versa. Here I assumed a cost of 1 for FP cases and a cost of 2 for FN cases.

The optimal point on the ROC curve is (FPR, TPR) = (0,1). No false positives and all true positives. So the closer we get there the better. The second essential observation is that the curve is by definition monotonically increasing.

This inequation can be easily checked by looking at the first plot by mentally pushing the threshold (red line) up and down; it implies the monotonicity. Furthermore any reasonable model’s ROC is located above the identity line as a point below it would imply a prediction performance worse than random (in that case, simply inverting the predicted classes would bring us to the sunny side of the ROC space).

All those features combined make it apparently reasonable to summarize the ROC into a single value by calculating the area of the convex shape below the ROC curve – this is the AUC. The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.

# AUC for the example > library(pROC) > auc(predictions$survived, predictions$pred) Area under the curve: 0.8421

Mainly two reasons are responsible for why an ROC curve is a potentially powerful metric for comparison of different classifiers. One is that the resulting ROC is invariant against class skew of the applied data set – that means a data set featuring 60% positive labels will yield the same (statistically expected) ROC as a data set featuring 45% positive labels (though this will affect the cost associated with a given point of the ROC). The other is that the ROC is invariant against the evaluated score – which means that we could compare a model giving non-calibrated scores like a regular linear regression with a logistic regression or a random forest model whose scores can be considered as class probabilities.

The AUC furthermore offers interesting interpretations:

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

[Fawcett]

[The AUC] also has various natural intuitive interpretations, one of which is that it is the average sensitivity of a classifier under the assumption that one is equally likely to choose any value of the specificity — under the assumption of a uniform distribution over specificity.

[Hand]

As the ROC itself is variable with respect to a given data set it is necessary to average multiple ROCs derived from different data sets to arrive at a good estimation of a classifier’s true ROC function.

It seems problematic, in the first place, to absolutely measure and compare the performance of classifiers with something as simple as a scalar between 0 and 1. The main fundamental reason of this is that problem specific cost functions hurt the assumption of points in the ROC space being homogenous in that regard and by that comparable across classifiers. This non-uniformity of the cost function causes ambiguities if ROC curves of different classifiers cross and on itself when the ROC curve is compressed into the AUC by means of integration over the false positive rate.

However, the AUC also has a much more serious deficiency, and one which appears not to have been previously recognised. This is that it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules. It is equivalent to saying that, using one classifier, misclassifying a class 1 point is p times as serious as misclassifying a class 0 point, but, using another classifier, misclassifying a class 1 point is P times as serious, where p ≠ P. This is nonsensical because the relative severities of different kinds of misclassifications of individual points is a property of the problem, not the classifiers which happen to have been chosen.

[Hand]

David J. Hand gives a statistically profound reasoning for the dubiousness of the AUC.

[Fawcett]: “An introduction to ROC analysis” by Tom Fawcett

[Hand]: “Measuring classifier performance: a coherent alternative

to the area under the ROC curve” by David J. Hand

(original article published on www.joyofdata.de)

]]>

The subject of this article is the composition of a multi-layer feed-forward network. This model will be trained based on data of the “Otto Group Product Classification Challenge” at Kaggle. We’ll also take a look at applying the model to new data and eventually you’ll see how to visualize the network graph and the trained weights. I won’t explain all the details, as this would bloat the text beyond a bearable scale. Also, if you are like me – straightforward code says more than a thousand words. Instead check out this **IPython Notebook** for the programmatical details – here I will focus on describing the concepts and some of the tripping stones I encountered.

Most likely you don’t have caffe yet installed on your system – if yes, good for you – if not, I recommend working on an EC2 instance allowing GPU-processing, f.x. the g2.2xlarge instance. For instructions on how to work with EC2 have a look at Guide to EC2 from the Command Line and for setting up caffe and its prerequisits work through GPU Powered DeepLearning with NVIDIA DIGITS on EC2. For playing around with Caffe I also recommend installing IPython Notebook on your instance – the instructions for this you’ll find here.

Training of a model and its application requires at least three configuration files. The format of those configuration files follows an interface description language called protocol buffers. It supeficially resembles JSON but is significantly different and actually supposed to replace it in use cases where the data document needs to be validateable (by means of a custom schema – like this one for Caffe) and serializable.

For training you need one prototxt-file keeping the meta-parameters (config.prototxt) of the training and the model and another for defining the graph of the network (model_train_test.prototxt) – connecting the layers in an acyclical and directed fashion. Note that the data flows from

bottomto

topwith regards to how the order of layers is specified. The example network here is composed of five layers:

- data layer (one for TRAINing and one for TESTing)
- inner product layer (the weights I)
- rectified linear units (the hidden layer)
- inner product layer (the weights II)
- output layer (Soft Max for classification)
- soft max layer giving the loss
- accuracy layer – so we can see how the network improves while training.

The following excerpt from model_train_test.prototxt shows layers (4) and (5A):

[...] layer { name: "ip2" type: "InnerProduct" bottom: "ip1" top: "ip2" inner_product_param { num_output: 9 weight_filler { type: "xavier" } bias_filler { type: "constant" value: 0 } } } layer { name: "accuracy" type: "Accuracy" bottom: "ip2" bottom: "label" top: "accuracy" include { phase: TEST } } [...]

The third prototxt-file (model_prod.prototxt) specifies the network to be used for applying it. In this case it is mostly congruent with the specification for training – but it lacks the data layers (as we don’t read data from a data source at production) and the Soft Max layer won’t yield a loss value but classification probabilities. Also the accuracy layer is gone now. Note also that – at the beginning – we now specify the input dimensions (as expected: 1,93,1,1) – it is certainly confusing that all four dimensions are referred to as

input_dim, that only the order defines which is which and no explicit context is specified.

This is one of the first mental obstacle to overcome when trying to get started with Caffe. It is not as simple as providing the caffe executable with some CSV and let it have its way with it. Practically, for not-image data, you have three options.

HDF5 is probably the easiest to use b/c you simply have to store the data sets in files using the HDF5 format. LMDB and LevelDB are databases so you’ll have to go by their protocol. The size of a data set stored as HDF5 will be limited by your memory, which is why I discarded it. The choice between LMDB and LevelDB was rather arbitrary – LMDB seemed more powerful, faster and mature judging from the sources I skimmed over. Then again LevelDB seems more actively maintained, judging from its GitHub repo and also has a larger Google and stackoverflow footprint.

Caffe internally works with a data structure called blobs which is used to pass data forward and gradients backward. It’s a four dimensional array whose four dimensions are referred to as:

- N or batch_size
- channels
- height
- width

This is relevant to us b/c we’ll have to shape our cases into this structure before we can store it in LMDB – from where it is feeded directly to Caffe. The shape is straight-forward for images where a batch of 64 images each defined by 100×200 RGB-pixels would end up as an array shaped (64, 3, 200, 100). For a batch of 64 feature vectors each of length 93 the blob’s shape is (64, 93, 1, 1).

Under Load Data into LMDB you can see that the individual cases or feature vectors are stored in Datum objects. Integer valued features are stored (as a byte string) in

data, float valued features in

float_data. In the beginning I made the mistake to assign float valued features to data which caused the model to not learn anything. Before storing the Datum in LMDB you have to serialize the object into a byte string representation.

Getting a grip at Caffe was a surprisingly non-linear experience for me. That means there is no entry point and a continuous learning path which will lead you to a good understanding of the system. The information required to do something useful with Caffe is distributed onto many different tutorial sections, source code on GitHub, IPython notebooks and forum threads. This is why I took the time to compose this tutorial and its accompanying code, following my maxim to summarize what I learned into a text I would have liked to read myself in the beginning.

I think Caffe has a bright future ahead – provided it will not just grow horizontally by adding new features but also vertically by refactoring and improving the over all user experience. It’s definitely a great tool for high performance deep learning. In case you want to do image processing with convolutional neural networks, I recommend you take a look at NVIDIA DIGITS which offers you a comfortable GUI for that purpose.

(original article published on www.joyofdata.de)

]]>For classification or regression on images you have two choices:

- Feature engineering and upon that translating an image into a vector
- Relying on a convolutional DNN to figure out the features

Deep Neural Networks are computationally quite demanding. This is the case for two reasons:

- The input data is much larger if you use even a small image resolution of 256 x 256 RGB-pixel implies 196’608 input neurons (256 x 256 x 3). If you engineer your features intelligently then a 1000 neurons would be a lot already.
- Saddling the network with the burden of figuring out the relevant features also requires a more sophisticated network structure and more layers.

Luckily many of the involved floating point matrix operations have been unintentionally addressed by your graphic card’s GPU.

There are three major GPU utilizing Deep Learning frameworks available – Theano, Torch and caffe. NVIDIA DIGITS is a web server providing a convenient web interface for training and testing Deep Neural Networks based on caffe. I intend to cover in a future article how to work with caffe. Here I will show you how to set up CUDA

First of all you need an AWS account and g2.2xlarge instance up and running. That is mostly self-explanatory – for the command line parts (and some tips) you might want to have a look at my previous tutorial “Guide to EC2 from the Command Line“. Make sure to add an inbound rule for port 5000 for your IP – b/c this is where the DIGITS server is made available at.

# don't forget to get your system up to date sudo apt-get update sudo apt-get dist-upgrade

Main source for this step is Markus Beissinger’s blog post on setting up Theano.

# installation of required tools sudo apt-get install -y gcc g++ gfortran build-essential \ git wget linux-image-generic libopenblas-dev python-dev \ python-pip python-nose python-numpy python-scipy # downloading the (currently) most recent version of CUDA 7 sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb # installing CUDA sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb sudo apt-get update sudo apt-get install cuda # setting the environment variables so CUDA will be found echo -e "\nexport PATH=/usr/local/cuda/bin:$PATH" >> .bashrc echo -e "\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64" >> .bashrc sudo reboot # installing the samples and checking the GPU cuda-install-samples-7.0.sh ~/ cd NVIDIA\_CUDA-7.0\_Samples/1\_Utilities/deviceQuery make ./deviceQuery

To further speed up deep learning relevant calculations it is a good idea to set up the cuDNN library. For that purpose you will have to get an NVIDIA developer account and join the CUDA registered developer program. The last step requires NVIDIA to unlock your account and that might take one or two days. But you can get started also without cuDNN library. As soon as you have the okay from them – download cuDNN and upload it to your instance.

# unpack the library gzip -d cudnn-6.5-linux-x64-v2.tar.gz tar xf cudnn-6.5-linux-x64-v2.tar # copy the library files into CUDA's include and lib folders sudo cp cudnn-6.5-linux-x64-v2/cudnn.h /usr/local/cuda-7.0/include sudo cp cudnn-6.5-linux-x64-v2/libcudnn* /usr/local/cuda-7.0/lib64

Main source for this and the following step is the readme of the DIGITS project.

sudo apt-get install libprotobuf-dev libleveldb-dev \ libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev \ libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler \ libatlas-base-dev # the version number of the required branch might change # consult https://github.com/NVIDIA/DIGITS/blob/master/README.md git clone --branch v0.11.0 https://github.com/NVIDIA/caffe.git cd ~/caffe/python for req in $(cat requirements.txt); do sudo pip install $req; done cd ~/caffe cp Makefile.config.example Makefile.config # check that USE_CUDNN is set to 1 in case you would # like to use it and to 0 if not make all make py make test make runtest echo -e "\nexport CAFFE_HOME=/home/ubuntu/caffe" >> ~/.bashrc # load the new environmental variables bash

cd ~ git clone https://github.com/NVIDIA/DIGITS.git digits cd digits sudo apt-get install graphviz gunicorn for req in $(cat requirements.txt); do sudo pip install $req; done

The first time you start DIGITS it will ask you number of questions for the purpose of its configuration. But those settings are pretty much self-explanatory and you can change them afterwards in

~/.digits/digits.cfg. You might want to consider locating your job-directory (

jobs_dir) on an EBS – the data set of about 140’000 PNGs in the example I feature here consumes about 10 GB of space and the trained models (with all its model snapshots) accounts for about 1 GB.

# change into your digits directory cd digits # start the server ./digits-devserver

When you start DIGITS for the first time you might run into a number of errors and warnings. Here’s my take on them.

"libdc1394 error: Failed to initialize libdc1394" # no big deal - either ignore or treat symptomatically sudo ln /dev/null /dev/raw1394

"Gtk-WARNING **: Locale not supported by C library." # not sure how serious this is - but it is easy to resolve sudo apt-get install language-pack-en-base sudo dpkg-reconfigure locales # check what locales are available and then ... locale -a # ... set LC_ALL to it echo -e "\nexport LC_ALL=\"en_US.utf8\"" >> ~/.bashrc

"Gdk-CRITICAL **: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed" # this is a big deal and will cause the server start up to fail: # connect with ssh flags -Xi ssh -Xi ...

"Couldn't import dot_parser, loading of dot files will not be possible." # reinstall pyparsing: sudo pip uninstall pyparsing sudo pip install pyparsing==1.5.7 sudo pip install pydot

First you have to create the data set on which you want to train a model. You have to provide at least one large set of pictures for the training and optionally two smaller sets for validation and testing. You can either separate those sets (and their correct labels) by means of different folders or – what I’d recommend – by providing corresponding CSVs. Those CSVs are supposed to feature two unnamed tab separated columns. The first column keeps the full path of the image (don’t use

~for home, but the its path equivalent) and the second column keeps a 0-based index referencing the correct class. You will also have to provide a text file holding the different classes – one per line. For example if you have two classes “pos” (1st line) and “neg” (2nd line) – then an image belonging to class “pos” would have to have a class index of 0 associated with it. Loading might take a while. Loading my 140’000 PNGs with 256×256 resolution took about one hour.

Setting up the model you intend to train is even easier provided you stick with the suggested defaults – just choose the data set you want to use, a network and you’re ready to go! Training a GoogLeNet for 30 epochs on the described data set took about one day and 6 hours. This is why you should make sure that …

- … your bidding for a Spot instance is not too low – or you risk it being terminated
- … you start the server in tmux session. Otherwise if you lose connection – maybe b/c your IP changes over night – the server process will be killed

The provided training set consists of about 35 thousand images of high resolution – zipped and split accross five files. The whole zip archive is about 33 GB large. I downloaded the five components directly onto an EBS using lynx – b/c you can just regularly log on and initiate the download. The download speed on the g2.2xlarge instance btw was incredible – you are granted up to 100 MB per second. I started all five downloads in parallel – each going at 6 MB per second. And yes, its mega byte – not mega bit (the unit DSL providers use).

The visible indicators of diabetic retinopathy are as I understand it mostly leaking (aneurysms) and pathologically growing blood vessels. I figure those features are mirror and rotation invariant. So to increase the available training set I created four versions:

- (A): As is but resized to 256×256 pixels and saved as PNG
- (R): 180 degree rotation of (A)
- Vertical mirroring of (A)
- Vertical mirroring of (R)

Because the task at hand is obviously not a classification but a regression I abstained from attempting to learn a classification into no DR and the four stages of DR. I labelled all DR cases as “positive” and the no-DR cases respectively as “negative”. This would have to be done for all four possible splits ({0} vs {1,…,4}, …, {0,…,3},{4}) and those predictions would finally be regressed against the actual stage.

The bash script for this transformation you may find on bash commands for the processing.

Well … on one hand I would have liked to see a higher accuracy – on the other hand I can barely (if at all) make out the difference between some healthy cases and some extreme stage four cases. As 73.95% is the share of negative cases – this is also were the accuracy of the network started out at. In the course of 30 epochs it improved about 8 p.p. to 81.8%.

I highly recommend the DIGITS Google Group for your questions on features and issues. The developers of DIGITS are very helpful and open for suggestions.

(original article published on www.joyofdata.de)

]]>Let’s assume the following application:

A conservation organisation starts a project to geographically catalogue the remaining representatives of an endangered plant species. For that purpose hikers are encouraged to communicate the location of the plant if they encounter it. Due to those hikers using GPS technology ranging from cheap smartphones to highend GPS devices and weather as well as environmental circumstances the measurements are of varying accuracy. The goal of the conservation organisation is to build up a map locating all found plants with an ID assigned to them. Now every time a new location measurement is entered into the system a clustering is applied to identify related measurements – i.e. belonging to the same plant.

(… And we are all together” – I am the Walrus / Beatles) So far so good – but where it gets a bit tricky is when it comes to decide how to deal with IDs of clusters / plants when a newly introduced location estimate not just humbly joins an established cluster but causes trouble by messing up previously identified clusters / plants. Take the picture to the right. So far we had two plants with separate IDs – in the good case they stay separate and the new one is assigned to the red cluster. In the bad case the new one causes red and blue to merge and poses the question whether the new cluster is red or blue or something new itself. Here we are dealing with a clear draw and very few points and clusters – but it easy to come up with more ambiguous cases like f.x. the one described above. To make reasonable decisions for those cases well-chosen – and if possible mathematically at least plausibilzed – heuristcs are needed.

Fair question as one might argue that an ID only serves the purpose of differentiating and there is no need for maintaining a family tree of clusters. Also in above use case this argument is not easily denied. But a stable inheritance of IDs might simplify understanding dynamics of how clustering takes place – a large number of representatives might render a cluster and its represented entity “important” and it would be weird if you have no stable way to refer to it. And some other possible motivations come to my mind. Maybe the organisation will send to selected plants researchers to perform an examination on them and henceforth intends to refer to those ones specifically.

# calculates the contingency table described below cross <- function(c0, cx) { uc0 <- unique(c0[c0 != "?"]) ucx <- unique(cx) cross <- matrix(0, ncol=length(ucx), nrow=length(uc0), dimnames=list(uc0, ucx) ) for(id_c0 in uc0) { for(id_cx in ucx) { cross[id_c0, id_cx] <- length(intersect( which(c0 == id_c0), which(cx == id_cx) )) } } return(cross) } # helper function: "A B" -> c("A","B") sv <- function(str) { strsplit(str," +")[[1]] }

So how might we approach this almost philosophical problem? I guess what is needed first is a handy way to represent the relations. And for that purpose something one might be inclined to refer to as a “set theoretic contingency table” might make sense. Rows represent so far identified clusters, columns represent the result of the performed clustering and the values are the number of elements the respective clusters have in common. Take the illustration on the right hand side for an example – the new clustering leading to a temporary cluster with ID 2 has 1 element in common with cluster C. Now to choose A for clustering set 3 is an obvious choice but choosing B for 2 and C for 1 is not so evident but probably an obvious choice for a human being.

> c0 <- sv("A A B B C C C ?") > cx <- sv("3 3 2 2 2 1 1 2") > > cross(c0,cx) 3 2 1 A 2 0 0 B 0 2 0 C 0 1 2

Continuing with above example. Clustered set 2 contains elements of type B and C. In this case one might say: “The choice of B is most reasonable as there are two Bs, one A and one unsettled element”. Fair enough – but what if we face a draw? Or if we would have two Bs and five more elements of different types, like C,D,E,F,G? Might seem odd but in a space of high dimensionality this is, I guess, a possibility.

Or take the situation illustrated to the right. For set 1 the label is a clear choice. But with above democratic labeling heuristic we would have to choose the same label for 2 and this would lead to a conflict. :/

To make a long story short a possible way to go might be to take a very conservative stance and expect from a cluster to properly tend its flock if it would like to keep its label. Id est, a cluster looses an element or gains one, then its new label is chosen randomly. This can be told by checking the contingency table – the condition is met if one and only one field in a row is non-zero and the corresponding column is as well non-zero exclusively for that field.

# determines unambiguous cluster labeling cases labeling <- function(cross) { labels <- c() for(id_cx in colnames(cross)) { if(sum(cross[,id_cx]) == max(cross[,id_cx])){ id_c0 <- which.max(cross[,id_cx]) if(sum(cross[id_c0,]) == max(cross[id_c0,])) { labels[id_cx] <- names(id_c0) } else { labels[id_cx] <- "+" } } else { labels[id_cx] <- "+" } } return(labels) }

And now in action:

> c0 <- sv("A A B B C C C D D ?") > cx <- sv("3 3 2 2 1 1 1 1 4 2") > > x <- cross(c0,cx) > x 3 2 1 4 A 2 0 0 0 B 0 2 0 0 C 0 0 3 0 D 0 0 1 1 > > labeling(x) 3 2 1 4 "A" "B" "+" "+"

Congratulations for making it to this point – you are now part of a small distinguished circle! Write me a mail and I will organize for you a session so you will receive the fierce looking joyofdata-tattoo on your forehead which will grant you bargains in bio supermarkets all over the world and will facilitate meeting people at night clubs. Okay, seriously, I’d be interested in input!

(original article published on www.joyofdata.de)

]]>

The efficient way to get the job done is by applying linear programming (LP). That means representing the question “Is it possible to fit a hyper-plane between two sets of points?” with a number of inequalities (that make up a convex area). I’m going to give a quick walk through for the math to make the idea plausible – but this text is more describing an introductory example and not an introduction to LP itself. For solving the linear program I will use Rglpk which provides a high level interface to the GNU Linear Programming Kit (GLPK) – and of course has been co-crafted by the man himself – Kurt Hornik – who is also involved with kernlab and party – thank you, Prof. Hornik and keep up the good work!

Let’s say we have two sets and of points in :

And we want to know if there is a hyper-plane in which separates and then we can formulate the necessary condition with two symmetrical inequalities:

A and an exist, such that we can say for all (1) and for all (2).

This is because a hyper-plane in can be defined as a vector

and the points on either side of it can be distinguished with above stated inequalities.

N <- 100 g <- expand.grid(x=0:N/N,y=0:N/N) # definition of the hyper plane h1 <- 3 h2 <- -4 beta <- -1.3 # points on either side g$col <- ifelse(h1 * g$x + h2 * g$y > beta, "cornflowerblue","darkolivegreen3") # roughly on the hyper plane g$col <- ifelse(abs(h1 * g$x + h2 * g$y - beta) < 2/N, "red", g$col) plot(g$x, g$y, col=g$col, pch=16, cex=.5, xlab="x", ylab="y", main="h(x,y) = 3 * x + (-4) * y + 1.3 = 0")

The conditions of a linear program are usually stated as a number of “weakly smaller than” inequalities. So lets transform (1) and (2) appropriately:

The conditions and can be written as and . This is because we are dealing with finite sets and , so if we have a separating plane, then we can always fit in an such that and . If we now multiply both inequalities with , then we just end up with a different formulation for the same plane . The first inequality we additionally multiply with to turn into – and we and now we have:

(3)

(4)

Okay great – we’re almost there – now let’s get all the variables on the left hand side:

(5)

(6)

These hyper-plane conditions have to be true for all points. All points in have to fulfil (5) and all points in have to fulfil (6). Then this set of inequalities describes the convex set in of all possible separating hyper planes . Usually the description and purpose of a linear program does not stop at this point and the set of feasible solutions is used to maximize an objective function. In our case such an objective function might be introduced to maximize the distance of the plane from the points. But the article is long enough already and our objective is to just find **a** plane and not the the best plane. Which is why our objective function is going to be simply the

So now we formulate (5) and (6) in matrix notation because this is how LP solvers expect the program description to be fed to them – we get:

with

, ,

library(Rglpk) dim <- 2 N1 <- 3 N2 <- 3 # the points of sets A and B P <- matrix( runif(dim*N1 + dim*N2,0,1), ncol=dim,byrow=T ) # the matrix A defining the lhs of the conditions A <- cbind(P * c(rep(-1,N1),rep(1,N2)), c(rep(1,N1),rep(-1,N2))) # the objective function - no optimization necessary obj <- rep(0, dim+1) # the vector b defining the rhs of the conditions b <- rep(-1, N1+N2) # by default GLPK assums positive boundaries for the # variables. but we need the full set of real numbers. bounds <- list( lower = list(ind = 1:(dim+1), val = rep(-Inf, dim+1)), upper = list(ind = 1:(dim+1), val = rep(Inf, dim+1)) ) # solving the linear program s <- Rglpk_solve_LP(obj, A, rep("<=", N1+N2), b, bounds=bounds) plot(P,col=c(rep("red",N1),rep("blue",N2)), xlab="x", ylab="y", cex=1, pch=16, xlim=c(0,1), ylim=c(0,1)) # status 0 means that a solution was found if(s$status == 0) { h1 = s$solution[1] h2 = s$solution[2] beta = s$solution[3] # drawing the separating line if(h2 != 0) { abline(beta/h2,-h1/h2) } else { abline(v=-beta/h1) } } else { cat("Not linearly separable.") }

In case you are wondering how I managed to include all those pretty pretty math formulas in this post – I am using the QuickLaTeX WordPress plug-in and I must say I really like the result. In previous posts I used a LaTeX web editor and then included the rendered formulas as an image.

(original article published on www.joyofdata.de)

]]>