Statistical suicide

Data: Up to 8% of people who attempt a suicide succeed (die). Up to 80% of woman who have BRCA1 mutation die of breast cancer. Conclusion: Attempting a suicide is much safer than ignoring the BRCA1 mutation.

This is just the popular Jolie case, but, of course, this does not only concern woman. For example, 25% of “heavy smokers” (5+ cigarettes per day) develop a lung cancer. Still way more than people who succeed a suicide attempt. So:
Attempting a suicide is safer than smoking.

On the other hand, probability of winning the Powerball lottery is around 0.00000005%. Yet still many smokers play.

Don’t we understand basic probability? Apparently we don’t.

We can easily picture ourselves as millionaires winning the lottery (close to impossible). We can’t picture ourselves dying of cancer (very likely if you smoke, have cancer history in the family, etc.). As there exist methods which reduce the risk, it is worth to think about it for at least a second.

Can we prove the null hypothesis (H0)?

Statistics and logic

Suppose you want to compare two populations. Normally, you want to prove that they are significantly different. One of the most common ways to do so is to test if the different in means is a result of condition or occurred by chance. Common tests to assess this are t-test or in ANOVA. You compute a test statistic and check how likely it is to happen under the assumption that there is no difference between these groups. If it is very unlikely (which is quantitatively expressed by \alpha) you reject the assumption and conclude that the groups are different.

Idea is based on the classical logic and a so-called modus tollens. This rule can be stated like this

Assume some hypothesis H. You observe a consequence which contradicts H. Therefore you conclude that the assumption is incorrect and that the negation of H is true.

In statistics, however, we are dealing with a probabilistic version of this rule, which may lead to some misunderstandings. In a similar language it can be stated as

Assume H_0 (no difference between groups). You observe data which is “unlikely” under the assumption H_0. You conclude that “probably” H_0 assumption is incorrect and “probably” negation of H_0 is true (meaning there is a difference between groups).

Now, imagine that you expect no difference between two groups. How to prove it?

Note that, in the framework of the rule stated above, it is not possible to prove that H_0 is correct. We assume H_0, so under this assumption, it is evidently true, but we don’t know if there exist data which can contradict H_0. To illustrate the problem consider a scenario from the famous book of Nassim Taleb, “The black swan”.

Assume that all swans are white. Investigate some (even very large) sample of swans. You observe that all of them are white. Conclude (INCORRECTLY) that all swans are white.

If we want to be sure that all swans are white we need to check all swans. Only then, we may be sure that there is no counterexample.

All means are different

Things get even harder in the context of the initial scenario with two populations. There is a finite number of subjects in each of them so, almost surely, real means in these populations are not equal. Let’s consider another example.

Assign all people randomly into two groups. Measure their heights and take means in both groups. These means will be very close to each other but will be different at some decimal point! Most of the tests will show no difference even with large samples, but still H_0 is false.

By a similar argument in almost any experiment H_0 is false!!!

Ok, but if we can reject all H_0s in the first place, why do we care about these sort of tests anyway? The key here is the “significance”. In fact, by the construction, our tests will detect the difference only if it is relatively big. In practice what we care about is not the existence of the difference but it’s magnitude.

So, when we want to “prove the null hypothesis”, we should focus on the magnitude of the difference between the means instead of considering their equality.

Heuristic solution

In practice you can follow this procedure:

  1. Define a margin you want to tolerate (e.g. difference in means not larger than 5\%).
  2. Figure out the design and, in particular, a sufficiently large sample which makes you confident that you avoid “no effect” occurring by chance.
  3. If your confidence intervals of estimators of means are small and the difference is below the margin you conclude that H_0 is very probable.

References

[1] Cohen J, The Earth is Round (p < 0.05), American Psychologist 49(12), 997-1003 (1994)
[2] Streiner, D.L. Unicorns Do Exist: A Tutorial on “Proving” the Null Hypothesis, Canadian Journal of Psychiatry 48, 756-761 (2003).

Are one-way ANOVA and t-test equivalent?

Yes. Assume we know the variance \sigma. Let t be the t-test statistic and F be the one-way ANOVA statistic. One can reasonably easly show that t^2 = F. Therefore, for a two-sided test corresponding quantiles are equivalent. However, there is no equivalent one-sided test in anova since we are looking at a ‘squared’ value.

# generate some data
A = rnorm(10,0)
B = rnorm(10,0,2)

# change it to the format which we can use for lm
C = rbind(cbind(A,1),cbind(B,2))
colnames(C) = c("value", "group")
C = data.frame(C)

# run tests
M1 = t.test(A,B,var.equal = TRUE)
M2 = anova(lm(value ~ group ,C))

# compare p-values
print(M1$p.value)
print(M2["Pr(>F)"])

As we can see, p-values are equal. Note however, that ANOVA is run on the linear model, whereas t-test compares two groups. Although they are equivalent in this context, they are used differently and they are based on different principles.

For ANOVA we brought the data to another format

print(C)
         value group
1  -1.22252768     1
[...]
10 -1.27401195     1
11  1.64572910     2
[...]
20 -0.79178332     2

so that we can run lm() method. The model in this example is
\text{group} = \beta \text{value} + \varepsilon.

We estimate \hat \beta by lm() and, using anova(), we test if \beta = 0, which intuitively corresponds to no significance of the group variable. This is a vary different approach than t-test, but, as mentioned in the beginning of this post, it leads to exactly the same result.

Finally, note that in R we also have a function oneway.test which embraces all these functionalities:

  • If number of groups is 2 it runs t-test,
  • If number of groups is larger 2 it runs ANOVA,
  • In both cases it estimates variances if needed.

So, in this example, we can have

M3 = oneway.test(value ~ group, data=C, var.equal = TRUE)

Introduction to SVM

A simple classical problem

SVM finds an optimal “hyperplane“ between two groups of datapoints. Although the optimization problem it solves is always linear, we can incorporate other geometric structures by transforming the data to other spaces.

In the simplest case, consider this two dimensional example.

Linear split

Linear split of uniform U([0,1]^2) points

Our objective is to find the optimal separating line between these two groups of points. We assume that the best one is the one with the largest distance from red and from black dots. Let’s assume we have n=50 points as in the example above. We denote points as \mathbf{x}_i, the class of a point as y_i and set y_i = 1 for black dots and y_i = -1 for red dots for each point i \in \{1,2,...,n\}. We can now solve a minimization problem, i.e. find a hyperplane (which in \mathbb{R}^2 case is just a line),

Minimize
\| \mathbf{w} \|
subject to
y_i(x_i'\mathbf{w} + b) \geq 1,

where \mathbf{w} is a vector normal to the margin we are looking for and b is an afinite shift, corresponding to the fact that our hyperplane may not go through the origin. We use the fact that any hyperplane in \mathbb{R}^d space can by define by it’s normal vector and it’s shift from origin. Note however that here the length of the vector is crucial and corresponds to the margin size.

Note that the minimization problem does not have a closed for solution, so we need to use optimization methods in order to solve it. Luckily, it is a convex problem, so it is reasonably simple to solve by numerical methods.

Soft margin, ‘intersecting’ classes

In practice often the groups are not entirely separable and the separating hyperplane simply does not exist. In Cortes and Vapnik [2] introduced a so-called soft margin, where they allow misclassification of points close to the bound.

Minimize over \mathbf{w},b and \eta_i for i \in \{1,2,...,n\}
\| \mathbf{w} \| /2 + C \sum_i \xi_i
subject to
y_i(x_i'\mathbf{w} + b) \geq 1 - \xi_i,
\xi_i \geq 0.

The intuition behind this formula is that \xi_i are the distances of misclassified points and C is the cost of this misclassification. The further the point is, the larger the prize. We can also see that in the constraint, which has been relaxed from \geq 1 to \geq 1 - \xi_i, giving in some sense more freedom (and this freedom cost C\xi_i).

For technical reasons it is useful to work on a dual problem (equivalent in a sense that it has the same solution \mathbf{w},b) to

(1) Maximize
\sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\langle x_i,x_j \rangle,
subject to
0 \leq \alpha_i \leq C,
\sum_i \alpha_i y_i = 0,

where x_i = \alpha_i y_i so \langle\mathbf{x}_i,\mathbf{x}_j\rangle = \alpha_i\alpha_j y_iy_j \mathbf{x_i}\mathbf{x_j}. For the details of how to transform the initial problem to the last, most convenient, form, please refer to [1].

Note that if we set C very large, the problem is equivalent an SVM with a hard margin, whereas too small values of C give a robust but potentially inaccurate results.

Non-linearity and kernel methods

The power of SVM comes from the fact that we may incorporate nonlinear separations. The basic idea is to map the points to another space and look for the separating hyperplane in the new space. For example, suppose we have a two dimensional data of the following kind

Circular pattern

Circular pattern in a cloud of U([0,1]^2) points

We want to map \mathbf{x} = (x_1,x_2) to three dimensional space \phi(\mathbf{x}) = (x_1,x_2,x_1^2 + x_2^2) as shown below,

Transformed dataset

Transform (X,Y) \mapto (X,Y,X^2 + Y^2)

Note however, that in (1) we actually need only the inner product between the points, so in the new space we don’t care about the mapped point. What we need is an inner product \langle \phi(\mathbf{x}),\phi(\mathbf{z}) \rangle which is simply

    \[K(\mathbf{x},\mathbf{z}) = x_1z_1 + x_2 z_2 + (x_1^2 + x_2^2)(z_1^2 + z_2^2).\]

This approach, often referred to as a kernel method, already works for the simple case, as can be seen on a figrure. However, from the computational perspective it is important that we do not need to define \phi. We just need to define K and we can just plug it into (1), which now becomes

    \[\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} K(\mathbf{x}_i,\mathbf{x}_j).\]

In the next section we will investigate tuning of parameters. Apart from the cost parameter C which penalizes misclassification, we need to choose the kernel.

Source code

Examples from this post can be generated using the following R code

# Example 1: linear split
n = 1000
X1 = 2 * runif(n)-1
X2 = 2 * runif(n)-1

p = c(0.3, 0.6)
Y = X1 + 2 * X2 < -1
plot(X1, X2, col = Y + 1)

# Example 2: circular pattern
p = c(-0.2, -0.5)
r = 0.4
Y = (X1 - p[1])^2 + (X2 - p[2])^2 < r^2
plot(X1, X2, col = Y + 1)

# the 3d plot of transformed space 
# install.packages('rgl') # uncomment if you don't have rgl package
library(rgl)

X3 = X1^2 + X2^2
plot3d(X1, X2, X3, col=Y + 1)

# add lower layer
col = Y
col[Y] = rgb(1, 0.5, 0.5)
col[!Y] = rgb(0.5, 0.5, 0.5)
points3d(X1, X2, 0, col=col, add=T)

References

[1] Burges C., A Tutorial on Support Vector Machines for Pattern Recognition, Data mining and knowledge discovery 2(2), Springer 1998
[2] Cortes C. and Vapnik V., Support-vector networks, Machine learning 20, Springer 1995

Codeforces 427E – Police Patrol

Imagine that your city is an infinite 2D plane with Cartesian coordinate system. The only crime-affected road of your city is the x-axis. Currently, there are n criminals along the road. No police station has been built on this road yet, so the mayor wants to build one. [read on codeforces]

Let’s define a function cost(x) which gives the total value for the station put in x. One can make several algebraic observations.

First, we can limit ourselves to putting the station on spots with criminals. One can prove that between two subsequent criminals x and y values of cost(z) for x < z < y are greater or equal to min(cost(x),cost(y)).

Second, we can also remark that the cost is decreasing to a certain point and than increasing. In fact cost can be proven to be convex on a set of criminals.

Third, the strategy for computing the cost(x) can go like that:

  • take the leftmost criminal compute his distance from x and multiply by two,
  • remove m leftmost criminals (we take them together with the furthest one),
  • repeat as long as there are still criminals on left of x.

Then, do the same for the right side from x.

This gives already a neat solution O(n log n) if we implement cost like above and apply ternary search for finding the optimal value. And for C++ it is enough. For my python implementation it was not. So let’s tune it further.

The forth and the most important observation is that we will always have to reach the leftmost criminal and the rightmost and it will always take us the same amount of time no matter where we start. Therefore we can compute this cost and remove those guys together with m others beside them. Then we have the same problem but without 2*m guys.

We keep reducing the problem linearly and get an insanely simple solution of an E problem. Voila:

n, m = map(int,raw_input().split())
a = map(int,raw_input().split())

res, le = 0, 0

while le &lt; n:
    res = res + a[n - 1] - a[le]
    le, n = le + m, n - m

print 2 * res

Top coder open (TCO) 2014 – 500 problem

Elly is playing Scrabble with her family. The exact rules of the game are not important for this problem. You only need to know that Elly has a holder that contains a row of N tiles, and that there is a single letter on each of those tiles. (Tiles are small square pieces of wood. A holder is a tiny wooden shelf with room for precisely N tiles placed in a row.)… [More...]

As many others I came up with a simple greedy heuristic. We can simply find the characters one by one starting from the left. Suppose we want to find the i-th character. We just look at maxDistance characters on the left and maxDistance on the right and we take the smallest letter. We loop through all the letters and we get the whole sequence.

There are however two caveats. First, we need to make sure that we use all letters. Imagine maxDistance = 2 and letters ZAAA. With the algorithm described above we have a problem while choosing the last character. Lets look what happens

  • Iteration 1: Candidates ZAA, we choose A, the string so far is just A
  • Iteration 2: Candidates ZAA (one A is gone), we choose A, the string so far is just AA
  • Iteration 3: Candidates ZA (two As are gone), we choose A, the string so far is just AAA
  • Iteration 4: ERROR! We ran out of candidates!

To avoid this situation we simply need to use the first character of the list when it is the last chance to use it. Then the last to iterations are

  • Iteration 3: Candidates ZA (two As are gone), we must choose Z, the string so far is just AAZ
  • Iteration 4: Candidates A, we choose A, the string so far is just AAZA

The second caveat is that, whenever there is more than one smallest letter than we take the one from the left. This will give us more possibilities later on.

It’s quite easy to prove that this strategy gives the optimal solution.

My sub-optimal and sub-pretty Java

import java.util.Arrays;

public class EllysScrabble {
	public String getMin(String letters, int maxDistance) {
		StringBuilder newstring = new StringBuilder(letters);
		StringBuilder oldstring = new StringBuilder(letters);

		for (int i = 0; i &lt; letters.length(); i++){
			int start = Math.max(i - maxDistance, 0);
			int end = Math.min(i + maxDistance + 1, letters.length());
			char[] candidates = new char[end - start];
			oldstring.getChars(start, end, candidates, 0);

			if ((start == i - maxDistance) &amp;&amp; candidates[0] != '['){
				newstring.setCharAt(i, candidates[0]);
				continue;
			}

			Arrays.sort(candidates);
			newstring.setCharAt(i, candidates[0]);
			
			for (int j = start; j < end; j++){
				if (oldstring.charAt(j) == candidates[0]){
					oldstring.setCharAt(j, '[');
					break;
				}
			}
		}
		return newstring.toString();
	}
}