Bayesian Drunk Driving
Driving drunk is illegal for a good reason, it’s way riskier than driving sober. This article isn’t about driving drunk though, it’s more about the sloppy thought processes that can too easily confuse something as obvious as that first sentence. Here’s an example of a bogus argument that appears to support the idea that drunk driving is actually safer:
From a recent talk: 1/3 of accidents involve drunk drivers, so 2/3 don’t => sober drivers 2× as bad.
— Colin Beveridge (@icecolbeveridge)
April 12, 2015
So the argument is as follows: In 2012, 10,322 people were killed in alcoholimpaired driving crashes, accounting for nearly onethird (31%) of all trafficrelated deaths in the United States [1]. That means that approximately one third of trafficrelated deaths involve drunk driving, meaning that two thirds of trafficrelated deaths don’t involve drunk driving. Therefore, sober drivers are twice as likely to die in a traffic accident.
If you think something is wrong with that argument, you are right, but it’s not just because the conclusion intuitively seems wrong, it’s because it involves a mistake in conditional probability. To see the mistake, it helps to introduce a litle notation, we will define:
 P(D) to be the probability that a person is drunk
 P(A) to be the probability that a person will die in a trafficrelated accident
 P(D  A) (pronounced probability of D given A) is the probability that a person is drunk, given that there was a death in a trafficrelated accident they were in
So using the 2012 CDC data, we can assign 31%, P(D  A) = 0.31. This is that the probability of a drunk driver being involved given that there was a deadly driving accident.
The first thing to point out is that the statement that ‘sober drivers are twice as likely as drunk drivers to die in an accident’ is really a statement about P(A  D), that is, the probability of a deadly driving accident given that that person is drunk. We don’t know this yet, however, we can figure it out using Bayes’ theorem.
Bayes’ Theorem
Bayes’ Theorem is unusual in that it is extremely useful and easy to prove, but hard to really understand. This is something I learned several times in college, but never really understood it’s importance until much later. To see how easy to prove it is, we go back to the definition of conditional probability:
Where P(X ∩ Y) is the probability of X and Y occurring. Since this is true for any pair of events X and Y, we can reverse them and get
Also, remember that AND is commutative, so that P(X ∩ Y) = P(Y ∩ X), so we can multiply the above two equations by P(Y) and P(X), respectively, to get:
This relates P(XY) to P(YX), P(X) and P(Y), we can solve the above equation to get:
And that’s it, we took the definition of conditional probability, did a little algebra, and out popped Bayes’ theorem, we can now apply this to the above drunk driving fallacy, and calculate the probability that we are interested in, that is, P(A  D).
Since we know P(DA), we just need to find P(A) and P(D). Since the CDC data we are using is annual data, we need to take the number of casualties from deadly accidents in the United States for the year of 2012 (33,561) and divide by the number of drivers (211,814,830), that gives an estimate of P(A) = 33,561/211,814,830 = 0.0001584, which is about 1 in 6,313.
Then, the probability that a driver is drunk P(D) is
[1] Impaired Driving: Get the Facts Centers for Disease Control http://www.cdc.gov/Motorvehiclesafety/impaired_driving/impaireddrv_factsheet.html
[2] Total licensed drivers U.S. Department of Transportation Federal Highway Administration http://www.fhwa.dot.gov/policyinformation/statistics/2012/dl22.cfm
Word Frequencies After Removing Common Words
In taking the Coursera class on Mining Massive Datasets, the problem of computing word frequency for very large documents came up. I wanted some convenient tools for breaking documents into streams of words, and also a tool to remove common words like ‘the’, so I wrote up words
and decommonize
. The decommonize
script is just a big grep v '(foobarbaz)'
, where the words foo, bar and baz come from the words in a file. I made a script generate_decommonize
that reads in a list of common words, and builds the regex for grep v
.
Example usage of words
and decommonize
The full source code is available here on github.
After running make install
, you should have words
and decommonize
in your PATH, you can use them to find key words that are characteristic of a document, I chose
 the U.S. Declaration of Independence:
1 2 3 4 5 6 7 8 9 10 11 

 Sherlock Holmes
1 2 3 4 5 6 7 8 9 10 11 

 Working with Unix Processes (by @jstorimer)
1 2 3 4 5 6 7 8 9 10 11 

So words
breaks up the document into lowercase alphabetic words, then decommonize
greps out the common words, and sort
and uniq c
are used to count instances of each decommonized word, and then the results are sorted.
White House Releases First Ever Open Source Budget Proposal
The White House just released the first ever open source budget proposal. It is released on GitHub, and it’s a bunch of CSV files. This is not very difficult, it requires only a few extra clicks when exporting an Excel spreadsheet, but hosting it on GitHub also opens it up to Pull Requests, which I’ve talked about before as being a much better tool for 21st century democracy. Instead of paper and a bunch of politicians in a room following procedure, we should intead have a digital system where all citizens can contribute as easily as they can update a facebook status or apply an instagram filter.
One huge caveat is in order though: there is no reason to assume that the White House and Congress will even consider pull requests, let alone apply them. This aside, I will experiment with this, I’ve already modified textql so that I can easily query these CSV files from a SQLite database. If I have an idea about how I’d like to change the budget, I’ll submit the pull request and then follow it’s response, if any.
Caveats aside, I am impressed with the choice of technologies for making these public issues more accessible.
Parsing Nested Expressions Using Bison
I modified my tipcalc program to handle expressions of arbitrary depth, so now it can handle input like ((($100 + 2%) + 2%)  3%) + 3.5%
.
The trick was to change the start
symbol to match binary_expression
, and then define binary_expression
recursively, like so:
1 2 3 4 5 6 7 8 

This is what makes this new version a contextfree grammar and not a regular grammar. Now, if you think that you could still handle this input with a regular expression, notice that adding percentages is not associative. For example, you might think we could drop the parens and just parse $100 + 2% + 2% + 2%
using /\$\d+ (\+ \d\%)+/
1


However, if instead we wrote $100 + 2%  2% + 2%
, associativity says we can reduce it to $100 + 2%
, however, when associated to the left (($100 + 2%)  2%) + 2%
it is clear that the result is different from $100 + 2%
.
Tip Calculation Using Bison Grammar
As long as I’ve been able to do arithmetic, I’ve been able to figure out calculating taxes and tips, it’s easy. Given a dollar value $17.91 we can figure out the total with a tip of 18% as $17.91*(1.18) = $21.14
However, it would be nice just to enter in $17.91 + 18%
and have the computer figure it out. So one time at lunch after
calculating the tip for a burrito I decided to learn lex and bison, which can be used together to create a mini language.
The grammar I used was the following:
1 2 3 4 5 6 7 8 9 10 

Where OP_PLUS
and OP_MINUS
come from +
and 
. Also, TOKDOLLAR
and TOKPERCENT
are $
and %
.
Then, below each grammar rule, I added some C code that would be generated if the input matches that rule:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The full source code is available here.
Now, it is true that this is no more powerful than a regular expression, however,
I intend on modifying it to allow nested expressions like (($2 + 4%) + 4%)
, which
would be useful for compound interest calculations. That would be more powerful than
regular expressions, meaning it would be at least a contextfree grammar.