In her article, Sarah attacks the claim that "without a mathematical foundation, you’ll have only a surface understanding of programming" by first reducing it to an alleged "common variation" of the same sentiment: "without a CS degree, you can’t build anything substantial." But wait, those aren't the same thing at all.
I can argue that a mathematical foundation can help give you a deep, valuable understanding of certain programming problems. No one can argue that you need math to build anything of substance. Especially if "substance" is correlated with VC funding. Yes, you can build "Yo!" without a deep understanding of anything. Worthwhile endeavours and opportunities for gainful employment abound, which don't require understanding much math, or any deep understanding of the programming problems at hand.
She has a point when she says "computer science is not programming". Yes. But understanding some math, and mathy computer science concepts, can certainly be valuable when it comes to reasoning about a programming problem, and communicating that reasoning with your collaborators.
Here's an example from something I was working on recently. We were building an appliance for our enterprise clients, to streamline the process of deploying Cloud Foundry to their on-premise datacenters. Users would enter configuration, including an IP subnet in CIDR notation and a set of blacklisted IP ranges in a format like 10.10.0.0-10.10.0.15, 10.10.0.129-10.10.0.255
. They would then select and configure any number of distributed services to be deployed, such as the Cloud Foundry PaaS, an add-on MySQL DBaaS cluster, and let's say a Hadoop cluster. Our appliance would spin up VMs on which to run these services, assigning IPs to these VMs within the given subnet, but outside the blacklist.
Everything was fine and dandy, until a field rep told us that one of his customers was experiencing some really slow performance. We noticed that they were using a /16
subnet. First off, tell me that a solid confidence and facility with numbers doesn't make understanding CIDR blocks a walk in the park. Anyways, after some debugging we traced the problem to some code that looked like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Actually, it was a fair bit more complicated than that. Because this code was being called by something that was first looping over all already-installed products and hydrating the IP pool, then then looping over all the yet-to-be-installed products to determine what IPs to assign for each of its jobs. For each product, it had to loop over each job that comprises that product and find-or-take IPs from the pool depending on how many VMs it would be running.
My pair and I were able to reason through this problem and communicate with each other by breaking down the runtime complexity of the problem in terms of big O. Having the training of thinking about problems in that way helped. Having a common language within which to frame the problem, a language that's precise and unambiguous, really helped. I'm not saying the problem in any way requires familiarity with runtime analysis and big O notation. But just imagine what two people with no exposure to these things would go through when reasoning about this problem and communicating with each other. Then appreciate how much friction just disappears if those two collaborators knew this stuff.
The fix?
1 2 3 |
|
When n = 216, an O(n2) → O(n) improvement is significant. Calculations that were taking many hours (or rather, hitting the nginx timeout and never finishing) in the worst case were now being done in a couple of seconds at most!
As evidence for the claim that understanding math doesn't help much with understanding programming, Sarah cites her experience: "I have found little connection between a person’s formal qualifications and the depth of their understanding."
Formal qualifications and understanding math are not the same thing! She talks specifically about big O in this context. So here's a challenge: amongst all your colleagues, think about the ones who have limited formal education (e.g. early college dropout) but whom you admire for their skills and passion for solving hard and interesting problems. Now amongst those folks, find one who doesn't get big O notation or who doesn't care to think about runtime analysis. Can you even find one?
I think the example above and the two examples in Part 1 all show how a mathematical approach can certainly help you better understand and tackle certain problems that come up, even in your day-to-day, non-mathy work. Here's yet another example: Using the framework of classical optimization theory to understand the problem of designing a PaaS to schedule resources for running user applications in a highly available manner.
You'll often hear that even if you don't use math as a programmer, math teaches you abstract thinking and problem solving, and you need that for programming. Sarah retorts: "learning to program is more like learning a new language than it is like doing math problems."
Doing math problems is not the same as learning math. And there's so much more to learning math than learning calculus, matrix arithmetic, a bit of graph theory and some combinatorics. Especially in America, the state of modern math education in college is quite poor for the vast majority of students. What they're exposed to is excessively computational (doing problems vs. learning), centuries old, and severely limited in both breadth and depth.
So, what about learning math? Is there something special about it that helps someone learn programming later on? Absolutely. First off:
Doing math is about taking abstract and unfamiliar concepts and bodies of knowledge, and internalizing them by building mental models so that they feel concrete and familiar, often by building upon existing mental models, so that you can pose conjectures and prove assertions about these concepts.
You can try to say that this description could apply to other fields, but there's no field for which this is as fitting a description as it is for math. This habit of building mental models to understand the unfamiliar and make it concrete is something I use every time I'm introduced to a new problem domain. And I use the word "habit" deliberately, because it's not that you need math to be able to do this, it's that learning math necessitates making this a habitual part of your thinking process.
Furthermore, because so much of math is so abstract:
Doing math develops tenacity -- the confidence that things which seem entirely opaque to you know can eventually be learned and understood in a way that seems clear through patience and effort.
I came to programming knowing nothing. I didn't know the difference between Java and JavaScript. I didn't know about GET, POST, PUT, DELETE. I had hardly used Unix. I'd never heard of "client-server". Didn't know a thing about networks, IaaS, virtualization, databases, etc. My head would be throbbing after work every day for my first few weeks as I was being flooded with knowledge. And I would come in to work the next day and notice that a lot of the stuff from the previous day had leaked out overnight.
But that was okay. I could tell myself, there's no way this stuff is more abstract than combinatorial characterizations of compactness and incompactness of the second uncountable cardinal, the stuff I never wrote my thesis about. It took several months, but I was eventually able to penetrate those concepts in set-theory enough to start (but not finish, obviously) being productive. These programming concepts? Just give me a few weeks and I can be productive there too.
The ability to feel comfortable while having absolutely no clue what you're doing is something I take out of my time devoted to learning math. Again, math is not the only path to this goal. I feel that any artistic practice where the creations take a long time to take shape, be it drawing, sculpture, literature, etc. all develop the same sort of confidence. But math is at least one way to achieve it. And when it comes to programming, it certainly helps that math, much more so than art, requires a fairly similar "type" of thinking to programming.
That all said, I do see some merit in Sarah's analogy with learning a new language, but it depends on where the challenges come from in your programming career. If your challenges come from tackling harder and more ambitious problems in new and unfamiliar domains, learning math is more helpful. But as a Rails consultant, that's not where the challenge comes from. The challenge comes from finding clearer, more elegant, and more robust ways to express solutions to familiar problems. On one end of the spectrum, there's the person who can hack together a functioning Rails app, and on the other end there are people like Sarah. You get to be like Sarah by getting good at modelling things clearly, writing clean, maintainable code, and communicating effectively through your code and in real life with your peers. In that case, the analogy to learning and getting better at a new language is quite illustrative.
No matter where you are in your career as a programmer or what problem domain or business vertical you work on, there is value to knowing math, thinking mathematically, and experiencing the process of learning math. Knowing math adds a powerful set of tools to your toolbelt. It won't be the right tool for every task, but that's true of any tool. You know the old chestnut about hammers and nails. That said, there will undoubtedly be times that an application of math will be the best tool for the job, where that "job" could mean optimizing, designing, debugging, experimenting, or anything else.
Thinking in terms of mathematics, and in terms of those mathy concepts from computer science, can sometimes provide an incredibly productive way to reason about a particular programming problem. Furthmore, the precise nature of mathematics often makes it an ideal way to communicate about a particular programming problem, especially when all your collaborators are familiar with the language.
Finally, the process of learning math (real math, not multiplying matrices) is a great way to build habits and attitudes towards problem solving that transfer readily to programming.
]]>So the question that's truly of interest here is not whether programming is math, but
In what ways, and to what extent, does learning and knowing math and computer science help programmers solve problems?
In fairness to Sarah, her post is indeed about this more interesting question, the title of her post is just clickbait. In fairness, so is the title of this post.
While I agree with some of Sarah's points, there are flaws in her arguments. I plan to critique some of those arguments, and then shed some light on the question of interest from my perspective as an almost-mathematician-turned-software-engineer. In this post, I'll address claim 1, and in Part 2 I'll address 2 and 3.
Sarah asserts that "the vast majority of developer jobs only required middle-school math at the most." There's a couple problems I see with this:
Sarah does acknowledge the possibility she simply wasn't seeing the jobs that required more math than that, but doesn't take that possibility to its logical conclusion. And it's perfectly understandable. Sarah spent many years as a consultant, working predominantly with Rails and JavaScript, and teaching clients the Agile Way. Now she teaches Rails and JavaScript to newcomers.
If you have Rails, JavaScript, and Agile on your resume, and you're a relative novice to these things, you will be inundated with messages from recruiters+ at various companies who all need you to solve roughly the same class of problems, a class of problems that usually requires very little math. If you're as talented and experienced as Sarah, I can only imagine. What you will undoubtedly see is that there a ton of job openings like this that don't require much math. What that evidence does not bear out is the conclusion that the vast majority of jobs are like this.
What about finance, supply chain management, graphics, game engine programming, machine learning, and signal processing, just to name a few? In his response to Sarah's post, Jeremy Kun expounds on several of the "mind-bogglingly widespread applications of mathematics to industry." I recommend checking it out.
+Pro Tip: If you need a job, and don't mind working on the next Facebook/Pinterest/Instagram for dogs/seniors/snowboarders, then 1. Learn Rails and JS, 2. Profit.
Sarah talks about the math that is required for those jobs. Math may not be required for many jobs, but are there applications of math that allow you to solve problems in a better way than simply what's required? That's the real question. I'm going to get to a couple examples in just a second, but the gist is:
I would liken it to knowing the power of raw SQL and how to roll your own queries vis-à-vis relying on an ORM for everything. Replace "math skills" with "roll your own SQL" in the statements above. The analogy is clear, with math simply being a bigger and more powerful set of skills than SQL.
WARNING: There is actual math below, with equations and this thing: ∑ k = 1N − 1, and bipartite graphs and Pascal's Triangle. If you wish to continue reading about the merits of math but don't want to look at any math just now, do not pass this link, go directly to Part 2: ☞
Otherwise, here's just a couple recent examples where I was working on implementing or testing a feature that had nothing to do with math, but was able to apply math to great effect.
This first example is about using pure graph theory to improve a BDD testing framework for Golang. Gomega is a matcher library which often used along with the Ginkgo testing framework. Gomega allows you to make assertions like:
1
|
|
The ConsistOf
matcher was recently added, allowing one to make assertions such as:
1 2 3 4 5 |
|
It's useful when you want to say that some actual slice or array should look like some expected sequence of values, but the order doesn't matter. One of the features of this matcher is that it allows composition with other matchers. For instance, there's a ContainElement
matcher which you can use like this:
1
|
|
Now you can compose this with ConsistOf
to make assertions like:
1 2 3 4 |
|
The order of the sub-matchers shouldn't matter, so the following assertion should also pass:
1
|
|
and indeed it did. The problem arose when you had an assertion like this:
1
|
|
It should pass: x
satisfies both sub-matchers, and y
satisfies the first one. But it didn't. The problem is that the implementation of this matcher would look at the first sub-matcher, ContainElement(2)
, and find the first element in the given slice that satisfied it, x
in this case. At that point, x
is no-longer available. Then it tries to find a match for the next sub-matcher, ContainElement(1)
, and it only has one element to choose from, namely y
. But y
doesn't satisfy this, so it falsely reports a failure for the assertion.
What's the fix? Well, naively, you might consider going through every permutation of the input slice, and seeing if the ith element of the (permuted) slice satisfies the ith matcher. If, for some permutation, every element satisfies its corresponding matcher, then the assertion is marked as passing; else, it's a failure.
How bad is this approach? O(n!) -- really bad. Graph theory to the rescue! Thanks to this answer on cs.stackexchange.com, I was able to find the Hopkroft-Karp algorithm and apply it to this problem. It's runtime? O(n2. 5), not bad at all compared to the original O(n2) implementation, and, more importantly, no false negatives!
Source: https://www.mathworks.com/matlabcentral/fileexchange/screenshots/1307/original.jpg
Here's the idea. Model the problem as a bipartite graph, with the n elements of the actual array or slice on the left, and the n sub-matchers on the right. Do a pre-processing step of going through each pair of an element and a matcher (there's n2 such pairs), and connect the two vertices with an edge if that element satisfies that matcher. Then, use Hopkroft-Karp to do the hard work of determining if there is a way to choose exactly n edges so that each element on the left is paired with a unique matcher on the right, i.e. no two edges in this selection share a vertex. In the picture above, the bold edges represent an attempt to do this, except it was only able to find 3 edges. You can convince yourself that in that picture, there is a different way to choose 4 edges so that no two share a vertex, but not one that includes the 3 edges chosen so far.
I work on Cloud Foundry. It's a Platform-as-a-Service, so it let's SaaS developers push their source code, ask for n instances of the application to be run (in parallel, on separate servers), and then expose their app to their users on the web via a URL like my-app.my-domain.com
, with the expectation that all traffic to that URL will be load-balanced across the n servers. I'm working on a team that's rewriting much of Cloud Foundry in Go. To test that our new code is working, we wanted to write a high-level system test which pushed an app, asked for 3 instances, and then make some requests to the app's URL and somehow assert that it eventually hit all 3 instances.
Now if the load balancer is doing its job and randomly but uniformly distributing load to all three servers, then there's some chance, albeit small, that even if you curl
the endpoint 100 times, you'll never hit one of the app instances. In other words, even if the parts responsible for starting up 3 instances are working, and even if the parts responsible for keeping instances up and running (or restarting them quickly if they crash) are working, and even if the load balancer is being fair and balanced, there's some chance that you'll just happen to never hit one (or two) of those instances. A case like that would be a false negative.
So the question is, if I'm going to write a test that hits the app's endpoint in a for
loop, how many times do I have to iterate to have 99.9% that I won't encounter a false negative. We want to be pretty sure that if this test ever fails in the future, it should be catching a real failure within the system.
The solution: let's solve for N, where N is the smallest integer where the probability of a false negative when hitting the endpoint N times at most 0.1%, or 0.001. Before reading further, take a guess as to what N might be. 5, 10, 100, 1000?
The probability of a false negative is equal to the number of ways a false negative can occur, divided by the total number of possible outcomes. Here, an "outcome" is a sequence of the N instance numbers hit when repeating the curl
, e.g. if N = 14, one possible outcome is [1, 2, 2, 1, 1, 3, 3, 2, 1, 3, 2, 2, 1, 3]
. Clearly, there are 3N total possible occurrences?
How many outcomes are false negatives? There's two kinds. The kind where you only ever hit one of the instances, so [1, 1, ...]
, [2, 2, ...]
, and [3, 3, ...]
. There's just 3 of those. The other kind is where you only hit two of the three instances. So you either only hit 1 and 2, or only 2 and 3, or only 1 and 3. By symmetry, you can see that the number of outcomes for each of those three cases is the same, so let's just count one case and multiply by 3. How many ways to only hit 1 and 2? This means that you hit instance 1 somewhere between 1 and N − 1 times, and instance 2 the rest. Breaking it down further, for some k between 1 and N − 1, how many outcomes involve hitting 1 k times, and hitting instance 2 N − k times? It's easy to see that it's just ${N \choose k}$. So the inequality we want to solve is:
$\frac{3 + 3\sum_{k=1}^{N-1}{N \choose k}}{3^N} \leq 0.001$
Now here's something neat. We're almost looking at $\sum_{k=0}^{N}{N \choose k}$, which you might recognize is the sum of the Nth row in Pascal's Triangle. And that sum reduces to 2N because given a set of size N, the number of ways to choose a subset of size 0, plus the number of ways to choose a subset of size 1, ..., plus the number of ways to choose a subset of size N, is simply the total number of ways to choose a subset. And an equivalent way to choose a subset is to look at each element and make the binary choice "yes, you're in the subset" or "no, you're out", and there's 2N ways to do that. This kind of argument is called a combinatorial argument, where you prove two things are equal by showing that they represent two ways to count the same thing.
Source: https://www.mathsisfun.com/images/pascals-triangle-4.gif
An alternative argument uses the Binomial Theorem and the observation that (1 + 1)N = 2N. At any rate, we get:
$\frac{3 + 3\cdot(2^N - {N \choose 0} - {N \choose N})}{3^N} \leq 0.001$
$\frac{2^N - 1}{3^{N-1}} \leq 0.001$
N = 20
And that's indeed what we do. We poll 20 times, and then assert that we see all 3 instances. By the way, did you guess 20?
Before we move on, here's a question: what if we have more than 3 instances? Let's just say 4. The problem already gets way harder. The Pascal's Triangle trick no longer applies. How do you model the problem now? Well, here's one approach, and the pretty results:
Now let's move on to Part 2.
]]>We're gonna take a look at a concrete application of the k-NN algorithm, compare the performance of the implementations from those aforementioned blog posts with new implementations in Golang and Haskell, and take a look at an optimized version which takes a logical shortcut and also leverages Golang's built-in support for concurrency.
All the code and datasets can be found on Github. The Golang and Haskell code is also at the bottom of this post.
TL;DR: Golang wins, or, in honor of the World Cup: GOOOOOOOOOOLLLLLLLang!!!
In this particular example, we've got 5000 pixelated (28x28) greyscale (0-255) "drawings" of the digits 0 through 9. Some of them might look like this:
Source: https://onlinecourses.science.psu.edu/stat857/node/186
These 5000 digit drawings constitute our training set. We're then given a bunch of new drawings where (let's pretend for a moment) we don't know what digits they're supposed to represent, but we know the greyscale values at each pixel. Given any such unclassified drawing, our goal is to make a reasonable guess as to what digit it's supposed to represent. The way this algorithm works is to find the drawing in the training set which most nearly resembles our unclassified drawing, then our reasonable guess is that the unclassified drawing in question represents the same digit as the nearest drawing in the training set. At this point, we can say that we've classified our previously unclassified drawing.
But what does "nearly resemble" mean in this case? Roughly, we want to look at how different a pair of drawings is, pixel by pixel, and aggregate those differences for all the pixels. The smaller the aggregate pixel difference, the nearer the resemblance. The standard measure of distance here is the Euclidean metric: Given two vectors x⃗, y⃗ of length 28 × 28 = 784 consisting of 8-bit unsigned integers 0…255, we define their distance to be:
$d(\vec{x}, \vec{y}) = \sqrt{\sum_{i=0}^{783} (x_i - y_i)^2}$
In this problem we're given 500 drawings to classify, and they form our validation set. After running the algorithm against all 500, we can see what percentage of them we classified correctly (because we actually are given their labels, we just pretend not to know them when doing the classification), and how long it took to do them all.
The data is given to us as a couple of CSV files, one for the training set, one for the validation set. Each row corresponds to a drawing. The first column is the label (i.e. what digit the drawing represent), and the next 784 columns are the greyscale values of each pixel in the drawing.
Note that the above describes the k-Nearest Neighbour classification in the case k = 1. If we wanted to do it for k > 0, we would take an unclassified drawing and find the k nearest drawings in the training set, and then classify the drawing according to whichever digit is represented most amongst those k nearest drawings.
This post inspired by a chain of blog posts, each of which contains implementations of the algorithm in a different language (or two). All the implementations are naive, in that they pretty much do the simplest thing possible, and take hardly any shortcuts to speed up or skip calculations:
I work for Pivotal on the Cloud Foundry project and recently joined the Diego team where I was introduced to Golang. I thought it'd be fun to add naive and optimized implementations in Golang to the comparison. Then I came across an awesome primer on Haskell (http://learnyouahaskell.com/) so the incomparable @alexsuraci and I paired on adding Haskell to the mix.
Performance comparisons between the naive implementations in each language were performed on a freshly spun up c3.xlarge EC2 instance as follows:
time ./<executable-name>
. Run the Golang code with time go run golang-k-nn.go
. Run the Factor code in the scratchpad
REPL with [k-nn] time
.1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The Golang implementation gets a major performance boost involves two optimizations:
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
[Var]
$\underline{x:\sigma \in \Gamma}$
Γ ⊢ x: σ
This translates to: If "x has type σ" is a statement in our collection of statements Γ , then from Γ you can infer that x has type σ. Here x is a variable (hence the name of this rule of inference). Yes, it should sound that painfully obvious. The terse, cryptic way that [Var] is expressed isn't that way because it contains some deep, difficult fact. It's terse and succinct so that a machine can understand it and type inference can be automated.
[App]
$\underline{\Gamma\vdash e_0:\tau\rightarrow\tau '\ \ \ \Gamma\vdash e_1:\tau}$
Γ ⊢ e0(e1): τʹ
This translates to: If we can infer that e0 is an expression whose type is τ → τʹ (e.g. e0 might be an anonymous function which, according to Γ , takes input of type τ and returns output of type τʹ), and we can infer that e1 has type τ, then we may deduce that we can infer that e0(e1), the expression obtained by applying e0 to e1, has type τʹ. The intuitive gist is if we can infer the types of the input and output of a function, and we can infer some expression has the same type as the input of the function, then when we apply the function to that expression, we can infer the result expression has the type of the output of the function. Nothing bewildering here.
[Abs]
$\underline{\ \ \Gamma, x:\tau \vdash e:\tau '\ \ }$
Γ ⊢ λx. e: τ → τʹ
This translates to: If allowing us to assume that x has type τ we were able to infer that e has type τʹ, then we may deduce that we can infer that the abstraction/anonymization of e with respect to the variable x, λx. e, has type τ → τʹ. So, for example, we know that if x has type String, then the expression x[0] has type Char. Now [Abs] allows us to deduce that
1
|
|
has type String → Char.
Aside. I mentioned polytypes earlier. Let's revisit them it in this example, just to help hammer it home. So note that this function above also has type Array[Int] → Int. In fact, for any type t, the function has type Array[t] → t. So it has many different types, String → Char being just one of them. Each of its types of the form Array[t] → t is monotype. We can express that this function has all of these monotypes by saying that it has the polytype ∀ t(Array[t] → t). We read that as "for all t, the type Array[t] → t" and we treat that whole thing as a single, yet more abstract, type. So note that when we infer the type of some expression, that doesn't mean that said type is the only type of that expression. An expression can have many types, and some of these types can be specializations of more abstract types. The simplest kinds of types are monotypes: Int, String, String → Char, etc. but we can have more abstract/general types called polytypes.
[Let]
$\underline{\Gamma \vdash e_0:\sigma\ \ \ \ \Gamma , x:\sigma \vdash e_1 : \tau}$
Γ ⊢ let x = e0in e1: τ
Easy:
If we can infer that e0 has type σ, and
If we were to assume x had type σ we could infer that e1 has type t,
Then we may deduce that we can infer that the result of letting x = e0, and substituting it into e1, has type t.
These last four rules do nothing more than formally capture our intuition about what type inferences we can make when we have variables and we do things like create anonymous functions, apply functions, and substitute expressions into other expressions. It's something we as programmers can do intuitively, and here we're just saying that this is something we can formally describe, what's happening in our brains isn't necessarily magical. It's also worth noting that these last four rules correspond precisely with the four rules for defining what a valid expression is in the Lambda Calculus. This is not a coincidence.
[Inst]
$\underline{\Gamma \vdash e:\sigma '\ \ \ \ \sigma '\sqsubseteq \sigma}$
Γ ⊢ e: σ
This is about instantiation. You can think of the monotype Array[Int] → Int as an instantiation of the polytype ∀ t. Array[t] → t. Another word for this is "specialization": Array[Int] → Int is more specialized/specific than for ∀ t. Array[t] → t. Flipping it around, we denote the "less specialized/specific than" relation between types with ⊑ . So
∀ t. Array[t] → t ⊑ Array[t] → t
So the direct translation of [Inst] is: If we can infer e has type σʹ, and σ is a specialization/instantiation of σʹ, then we can deduce that we can infer that e has type σ. And you can think of σ and σʹ as being types like Array[t] → t and ∀ t. Array[t] → t respectively.
[Gen]
$\underline{\Gamma \vdash e:\sigma\ \ \ \ \alpha \notin \mathrm{free}(\Gamma)}$
Γ ⊢ e: ∀ α. σ
This is the hardest one to understand. It really only makes sense in the context of doing a type inference using these restricted set of rules we're outlining. It doesn't have a very concrete analogue since it heavily depends on the concept of a variable type, something that never occurs in any real programming language, but is an indispensable concept when we're trying to work in a meta-language that talks about types in any arbitrary real programming language. The idea can sort of be captured in this "example":
Suppose you have some variables x and y, and for the time being you're assuming they have type α, where α is a variable standing for a type. You later come across an expression that you somehow manage to infer has type α → α in this context (the context where you're assuming x and y have type α). The question is, will this function have the polytype ∀ α. α → α? I.e. does this function generally map objects to things of the same type, or does that only appear to be the case because you assumed x and y had the same type α?
Since α is a variable type, i.e. it could stand for any type, we might like to think that, since we've inferred that e has type α → α that it has the polytype ∀ α. α → α. But we can't necessarily make this generalization without more insight into how e is related to x and y; In particular, if our inference that it has type α → α is tightly coupled to our prior assumptions involving α, then we shouldn't conclude that it generally has the polytype ∀ α. α → α.
Here's the translation:
If some variable type α hasn't "freely" been mentioned in our current context/set of knowledge/assumptions, and we can infer that some expression e has some type σ, then we can infer that e has type σ independent of what α turns out to be. Slightly more technically, e has the polytype ∀ α. σ.
Okay, but what does "freely mentioned" mean? In a polytype like ∀ α. α → α, α isn't "really" being mentioned. That type is the exact same as this one: ∀ β. β → β. An expression with either type is just that of a function that sends any type to itself. On the other hand, x: α "really" does mention α.
x: α
y: β
and
x: α
y: α
mean different things. The latter means x and y definitely have the same type (even though what that type is may not have been pinned down). The former tells you nothing about how the types of x and y are related. The difference is, when α is mentioned inside the scope of a ∀ , as is the case in ∀ α. α → α, that α is just a dummy, and can be swapped out for any other type variable regardless of the rest of the context. So we can interpret the statement "α isn't freely mentioned in the context Γ " to say, "α is either never mentioned at all, or, if it is, it's only ever mentioned as a dummy and could in principle be swapped out for something entirely different without changing the semantics of the assumptions/knowledge in our context."
And that's it. Questions? Comments? Let me know.
]]>We'll give a recursive definition of what an expression is; in other words, we'll state what the most basic kind of expression is, we'll say how to create new, more complex expressions out of existing expressions, and we'll say that only things made in this way are valid expressions.
1
|
|
And nothing else is a valid expression.
Aside: anyone paying close attention will wonder, wait, how can I make any useful expressions out of this? How do I even get x2 + 2, or in fact 2 for that matter, out of the above? Heck, what about 0? There is nothing in the rules above which obviously yield the expression 0. The solution is to create expressions in the Lambda Calculus which behave like 0, 1, …, + , × , − , / when interpreted correctly. In other words, we have to encode numbers, numerical operations, strings, etc. into patterns we can create with the Lambda syntax. This blog post has a nice little section on numbers and numerical operations. This is a great feature of the Lambda Calculus: we have a simple syntax which we can define recursively in 4 simple clauses, and this therefore allows us to prove many things about it inductively in 4 main steps, yet the language itself has the expressive power to capture numbers, strings, and all the types and operations we could ever care about.
Let e be any expression, that is, "e" is a variable in our meta-language which stands for any expression in our base language, like any of the following:
1 2 3 |
|
Then if t is any type, we can express "e is of type t" by
e: t
Just like e, t is a variable in our meta-language, and it can stand for any type in the base language, like Int, String, etc. And just like e, t doesn't necessarily need to stand for any one type in particular.
One can give a formal definition for what counts as a type, just as we did for expressions above. However the abstraction gets fairly twisted, so we'll leave it at that. I should just point out a few two key things to keep in mind:
1
|
|
This function is type String → String. But it's also Int → Int. In fact, for any type t, it's type t → t. We're gonna say that it has type ∀ t. t → t. Each of the types String → String, t → t, are "monotypes". ∀ t. t → t is a "polytype". The identity function above has the abstract polytype ∀ t. t → t which, in practice, means that for every real type t, it has type t → t. If all of this has been sinking in, then we can compactly express this as:
λx. x: ∀ α. α → α
Now we're going to want to formalize a bunch of rules for how we can go from some knowledge of expressions and their types to inferring types of more expressions. Remember how propositional calculus formalized Modus Ponens? We're going to do something similar. For instance, say we want to formalize the following piece of reasoning:
Suppose I've already been able to infer that a variable
duck
has typeAnimal
.
Suppose furthermore that I've inferred thatspeak
is a method of typeAnimal -> String
.
Then I can infer thatspeak(duck)
has type String.And any reasoning of this form is valid type inference.
We'll formalize that as follows:
$\underline{\Gamma\vdash e_0:\tau\rightarrow\tau '\ \ \ \Gamma\vdash e_1:\tau}$
Γ ⊢ e0(e1): τʹ
That rule has the name [App] (for application), and it's one of the ones pictured in that StackOverflow question. We'll talk about it and the rest of the rules in the next post. For now, let's first get a handle on all the symbols you see above:
$\underline{\Gamma \vdash y:\sigma}$
Γ ⊢ x: τ
If we can infer that y has type σ from Γ , then we can infer x has type τ from Γ .
Next up:
Let's step back for a sec and fill in some context. What are we trying to do? We'd like some way to measure how hard it is to guess our passwords, a number that serves as a heuristic standard of password strength. But there are two fundamentally different things we might want to measure:
How hard would it be for someone to guess your password with essentially no knowledge of how you created your password?
How hard would it be for someone to guess your password if they knew the process used to generate it? This is of course assuming that there is a process, for example some script that does some Math.rand
-ing and produces a password string.
The term "entropy" has been used to refer to both kinds of calculations, but they're clearly entirely different things: the former essentially takes a string as input, the latter takes a random process as input. Hence, "entropy is not entropy."
Alright, well if entropy isn't entropy, let's see what entropies are. We'll look at the standard mathematical formulation of the random-process-entropy which comes from information theory. And we'll look at the function used to calculate particular-string-entropy in some password strength tester (e.g. http://rumkin.com/tools/password/passchk.php). And that's all we're going to do, we'll look at how the calculations are done, without dwelling too much on the differences between the two approaches or what their use cases are.
For our purposes, a random process will be determined by the set of all the possible outputs it can produce, and the probability associated with each output. If you roll a fair die, the possible outputs are 1, 2, 3, 4, 5, and 6, and each output happens to have the same probability associated with them, 1/6. If you have a process that rolls a die and then yells "fizz" if the output is divisible by 3, "buzz" if its divisible by 5, and just repeats the number otherwise, then the possible outputs are:
1, 2, "fizz", 4, and "buzz"
with corresponding probabilities:
1/6, 1/6, 1/3, 1/6, and 1/6
We would like to favor processes that can potentially produce a lot of different outputs, and that give all the different outputs a fairly similar chance of occurring. If we had a process that picked a number between 1 and 999999999, but 95% of the time it picked 1, it wouldn't be that great since if an attacker knew this about our process, he or she could just make one guess, namely 1, and have a 95% chance of accessing whatever was supposed to be secured by a password generated by this process.
So here's the formula: given a process with possible outcomes o1, …, on, and respective probabilities p1, …, pn, the entropy of this process is given by
− [p1log2(p1) + … + pnlog2(pn)]
This formula satisfies the two criteria we wanted to satisfy above, and additionally has the following aesthetically pleasing feature: the random process which generates n independent random bits has entropy of n. Let's prove it:
That process has 2n possible outcomes, each with equal probability, namely 1 / 2n (it's like rolling a 2n-sided fair die).
$- \left [ \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right) +\dots + \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right)\right]\ \ \ (2^n\mbox{ times})$
$= - \left[ 2^n \times \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right) \right]$
$= - \log_2\left(\frac{1}{2^n}\right)$
= − ( − n)
= n
What about the entropy of our fizz-buzz process?
$- \left[4 \times \frac{1}{6}\log_2\left(\frac{1}{6}\right) + \frac{1}{3}\log_2\left(\frac{1}{3}\right)\right] \approx 2.252$
Here's the heart of the code used on the Rumkin Strength Test webpage (http://rumkin.com/tools/password/passchk.php) to estimate password strength:
1 2 3 4 5 6 7 8 9 10 |
|
And here it is, refactored, Ruby-fied, and decorated with couple additional comments:
1 2 3 4 5 6 7 |
|
There's a couple questions this should evoke. How is it determining how unlikely it is for two characters to show up consecutively? If you look at the original code you can see it's looking things up in a table. It gets the index of the two characters (what it calls aidx
and bidx
), and then finds the entry in the table corresponding to that pair by finding the aidx * 27 + bidx
-th entry in an array representing the frequency table. This suggests that this table is 27 x 27. It treats upper case letters the same as lower case, and it treats all numbers and special characters the exact same! Indeed, it'll tell you that the following two passwords have the same entropy:
^341^)8@#05&*6%%#$7(9!24%
and
!111111111111111111111111
Also, notice how it only considers two consecutive characters at a time, never "learning" from patterns it might have observed earlier in the string. So for instance given a password like aaaaaaaaaaaaa
, it doesn't "catch on" by the 6th or 7th aa
pair that there seems to be a lot of aa
pairs. It's just as "surprised" to see the last aa
pair as it was to see the first.
And what about this character pool size estimation? Some experimenting will show you that if it detects a single numerical digit, it assumes all of the numerical digits were available in the selection pool for every single letter. If it sees a single character from !@#$%^&*()
, it'll assume they were all available, and again, for every letter. So in particular if your process is mainly a bunch of English letters and then you just throw in a single number and one of !@#$%^&*()
at the end, you'll get a bump to your password strength.
Going back to the consecutive pair thing, why just consecutive pairs? What about triples?
Oh, and what about that business of squaring the "unlikeliness" of the consecutive characters? The unlikeliness is a probability, hence a number between 0 and 1, and so squaring it results in a smaller number. Since we're adding these numbers to our calculation of entropy, it results in a smaller value for our final entropy. Therefore it tries to capture the assumption that the attacker is a good guesser. But why squaring specifically?
To be clear, these are simply questions worth asking, not criticisms per se. This way of calculating entropy is supposed to be naive, and is supposed to make as few assumptions as possible about how the given password is generated, and thus what patterns to expect. If it made stronger assumptions, then it would be very bad at estimating the password strength of even a very weak process that simply made sure to contradict those assumptions.
Well, I guess some of them are criticisms. And there are criticisms to be leveled against the ubiquity of the usual "your password must contain 8-20 characters, at least one number and one special character," etc. But that's all for this post.
Oh, and not only is entropy not entropy, entropy is also this thing:
$\Delta S = \int \frac{dQ_{rev}}{T}$
.html.erb
file into a .html
file), where the resulting LaTeX file has the output of the R code. It makes it super easy to embed statistical calculations, graphs, and all the good stuff R gives you right into your TeX files. It let's you put math in your math, so you can math while you math.
I've got a little project which:
Here's what the pre-"knitted" LaTeX looks like with the embedded R:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
You can comment out the line in the factory
script that deletes the tds2012-out.tex
file if you want to see what it looks like post-knit. The resulting TeX file basically contains a ton of new commonad definitions but the meat of it is what it does with your R code. It formats and displays the R code itself, and then it displays the output of the R code. Wherever the output is a graph, you'll see \includegraphics[...]{...}
. knitr will do the R computation, render the graphics, create a figures
subdirectory and store them there for the \includegraphics
to reference. Whenever the output is simply text or mathematical expressions, you'll see the R output translated to pure LaTeX markup.
Pretty cool stuff!
]]>Let's quickly see why the real life answer is "no." But first we should lay out the assumptions implicit in the problem. We're going to assume that at some point in time, everyone was either entirely Swedish or entirely non-Swedish. There's a chicken-and-egg problem that we're sweeping under the rug here, but that's what rugs are for. Next we're assuming that every person after that point in time has their Swedishness wholly and equally determined by their parents Swedishness. So if mom is 17% Swedish and dad is 66% Swedish, then baby is ½ x 17% + ½ x 66% = 41.5% Swedish.
So why is 15% impossible? Or, for that matter, all the numbers in the previous example: 17%, 66%, 41.5%? The reason is that any person's Swedishness must be a fraction which, in lowest terms, must have a denominator that is a power of 2. There's an easy proof by induction. Initially everyone is either entirely Swedish or entirely non-Swedish. In lowest terms these fractions can be expressed as 1/1 and 0/1, respectively. The denominator, 1, is a power of 2 (fyi 20 = 1). Now, assuming a mom and dad are Swedish in proportions m / 2M and d / 2D respectively, their offspring will be this Swedish:
$\frac{2^Dm + 2^Md}{2^{D+M+2}}$
The denominator is a power of 2, and reducing this fraction to lower terms will not change that fact. Numbers like 15%, a.k.a. 15/100, or 3/20 in lowest terms, have denominators which aren't powers of 2, and that's why no one can ever really be 15% Swedish.
What if we keep the assumption that Swedishness is determined equally by the parents' Swedishnesses, but without assuming there was some point in time where everyone was either entirely Swedish or entirely non-Swedish? Let's weaken that assumption to simply state that every person has an ancestor that's either entirely Swedish or entirely non-Swedish. And let's do one more crazy thing: let's allow human history to go back infinitely through the generations, with no beginning. So there could be an infinitely long lineage of Swedes without there being a first Swede. If it helps, imagine a family tree that's infinitely tall, with no original top level. In this universe of bastardized metaphysics, can you be 15% Swedish? Why yes!
We know how decimal numbers work. 12.34 as a decimal number means
1 × 101 + 2 × 100 + 3 × 10 − 1 + 4 × 10 − 2
Binary numbers work the same way, with 2's instead of 10's. So the number four is:
1 × 22 + 0 × 21 + 0 × 20
So it's represented as 100 in binary. Similarly one-half is:
0 × 20 + 1 × 2 − 1
So it's represented as 0. 1 in binary. What about a number like one-third? It's equal to the value of the following infinite sum:
0 × 100 + 3 × 10 − 1 + 3 × 10 − 2 + …
So it's represented as 0. 33. . . in decimal. It's also equal to the following infinite sum:
0 × 20 + 0 × 2 − 1 + 1 × 2 − 2 + 0 × 2 − 3 + 1 × 2 − 4 + 0 × 2 − 5 + 1 × 2 − 6 + …
So it's represented as 0. 010101. . . in binary.
Now why do we care? Well, if you've read this far, then you care because you can use the binary representation of a number to figure out what a person's family tree could look like if their Swedishness was equal to that number. For example, how can you be 1 / 2 Swedish? Well 0. 1 is the binary representation of 1 / 2, and this tells us that if we have 1 parent who is entirely Swedish (and hence all the ancestors on that side are entirely Swedish), and one parent who is entirely non-Swedish (along with all their ancestors), then you can be 1/2 Swedish.
How about a more involved example. To be 3 / 16 Swedish, which is 0. 0011 in binary, you can accomplish this by first having 1 great-grandparent who is entirely Swedish, let's call her Agnetha. Agnetha's parents will of course have to be entirley Swedish too. In addition to them you'll need one more great-great-grandparent who's entirley Swedish, let's call him Bjorn. If you have great-grandma Agnetha and great-great-grandpa Bjorn who are entirely Swedish (as must be all their ancestors), and if all of your ancestors who aren't descendents of Agnetha and Bjorn are entirely non-Swedish, then you'll be exactly 3 / 16 Swedish. How does this look in terms of your family tree? If we say you are at level 0, your parents at level 1, etc. then what we get is the following:
The first full Swede is Agnetha, on level 3.
The next full Swede who isn't logically "forced" to be Swedish on account of being Agnetha's ancestor is Bjorn, on level 4.
There are no other full Swedes except Agnetha's and Bjorn's ancestors.
Everyone else is entirely non-Swedish unless they're "forced" to be a bit Swedish on account of being descendents of Agnetha and/or Bjorn.
Notice how the "level 3" and "level 4" correspond to the locations of the 1's in the binary expansion of 3/16? If you go back to the simpler example of 1/2, which was 0. 1 in binary, you'll see that we have one full Swede on level 1 and everyone except that person's ancestors is fully non-Swedish.
So here's how you can be 15% Swedish:
15% = 2 − 3 + 2 − 6 + 2 − 7 + 2 − 10 + …
So if you have great-grandma Agnetha, great-great-great-great-grandpa Bjorn, great-great-great-great-great-granpa Benny, and great-great-great-great-great-great-great-great-grandma Anni-Frid, .... And all of them are entirely Swedish, and none of them are ancestors/descendents of one another. And if anyone else on your family tree that isn't a blood relative of theirs is entirely non-Swedish. Then you will be 15% Swedish!
Okay, so all that was incredibly silly. Can we say anything that's merely very silly? Say you want to know if you can be 15% Swedish in real life, but within some error bounds. Maybe you want to know if you can be 15% Swedish, give or take 1%. Easy: find a finitely-long binary number that's between 0.14 and 0.16, and repeat the above steps with that number. One simple way to do that is to start finding the binary expansion of 0.15, and stopping once you're within the desired range:
2 − 3?
Nope, that's 0.125, too small.
2 − 3 + 2 − 6?
Yup, that's 0.140625.
Could we have known ahead of time how much of the binary expansion of 0.15 we'd have to calculate before reaching the desired range? Yup, we can do that too. Once you've started writing a binary number out to n digits, no matter what digits you add on next, the most you can add to your current number is 2 − n. For example, all the binary numbers that start with 0. 11010. . . must be within 2 − 5 = 0. 03125 of one another. So if I know I want to be within 0. 01 (in decimal, i.e. 1%) of 15%, I just have to apply the above reasoning backwards. log2(0. 01) − 4. 605. . . so if I figure things out back more than 4.605 generations, that's enough. So I really only need to figure things out 5 generations back. 5 generations back I have 32 ancestors. The closest fraction of the form x/32 to 0.15 is 5/32. So I know that if I have exactly 5 totally Swedish level-5 ancestors and the remaining 27 level-5 ancestors are totally non-Swedish, I will be within 1% of being 15% Swedish (I'll be 15.625% Swedish to be exact).
]]>Before we figure out what it means, let's get an idea for why we care in the first place. Daniel Spiewak's blog post (link broken) gives a really nice explanation of the purpose of the HM algorithm, in addition to an in-depth example of its application:
Functionally speaking, Hindley-Milner (or “Damas-Milner”) is an algorithm for inferring value types based on use. It literally formalizes the intuition that a type can be deduced by the functionality it supports.
Okay, so we want to formalize an algorithm for inferring types of any given expression. In this post, I'm going to touch on what it means to formalize something, then describe the building blocks of the HM formalization. In Part 2, I'll flesh out the building blocks of the formalization. Finally in Part 3, I'll translate that StackOverflow question.
Okay, so we want to talk about expressions. Arbitrary expressions. In an arbitrary language. And we want to talk about inferring types of these expressions. And we want to figure out rules for how we can infer types. And then we're going to want to make an algorithm that uses these rules to infer types. So we're going to need a meta-language. A language to talk about expressions in an arbitrary programming language. This meta-language should:
To make all that a little more concrete, let's look at a really quick example of a formalization. If, instead of formalizing a language for talking about inferring types of expressions in an arbitrary programming language, what if we wanted to formalize a language for talking about truths of sentences in arbitrary natural languages? Without formalization, we might say something like
Suppose I know that if it's raining, Bob will carry an umbrella.
And suppose I also know that it's raining.
Then, I can conclude that Bob will carry an umbrella.And any argument that takes this form is a valid way to reason.
Propositional Calculus formalizes that whole things as a rule known as Modus Ponens:
$\underline{A,\ \ A \rightarrow B}$
B
where A and B are variables representing propositions (a.k.a. sentences or clauses) in an arbitrary natural language.
Okay, so let's enumerate the building blocks of the HM formalization:
We will need:
Onward, ho!
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
In order to use a PostgreSQL database for development, you'll need, in addition to the PostgreSQL package itself, a PostgreSQL server for your application to talk to. The PostgreSQL server package has the basic PostgreSQL package as a dependency, so we'll just run the command to install the server and we'll get both. The server package will allow you to run a process that serves your database, and the basic package provides a client that your Rails app will use to connect to and interact with (read, write, etc.) the database being served.
Pick the version of PostgreSQL you want to install. At the time I wrote this, the latest was 9.2.x so we'll go with that:
1
|
|
You'll likely see the following instructions in the installation output
1 2 3 4 5 |
|
Now in order to start a PostgreSQL server process, it needs an initial database cluster within which you will create your databases. To do this you need to create a directory for an initial database cluster and tell PostgreSQL to initialize that directory for use with a PostgreSQL database cluster. PostgreSQL doesn't allow the superuser to initialize a database cluster. The user used to initialize the database cluster should be one that will exist on any machine that has PostgreSQL, allowing you to collaborate on your Rails app with people using different machines, thus it makes sense to use the 'postgres' user. The above commands will:
After you do this, you'll see the following instructions in the output:
1 2 3 4 5 |
|
The first will start the server in the foreground, which you probably don't want. The second will start it in the background, but dump a log file in whatever directory you execute the command, which you don't want either. You'll also need to start the server as the 'postgres' user, which the above command doesn't do as is, so the solution is to:
1 2 3 |
|
Now when you go to create your Rails project, it will install the pg
gem for working with PostgreSQL, and it'll configure itself to use the first psql
(PostgreSQL client) it finds in your $PATH
environment variable. Your system comes with one, but you'll want to use the one you just installed. Assuming you're using a .bashrc
(or .bash_profile
) file for initial setup of your bash environment for shell sessions, add
1
|
|
to the bottom of the .bashrc
(or .bash_profile
) file. Don't forget to
1
|
|
for the change to take effect.
Now that you're done setting up your system for PostgreSQL, you are ready to create and setup a Rails app that uses PostgreSQL. Start with:
1
|
|
The standard rails new my_app
does a whole bunch of initial setup and file creation for your Rails app. Adding the --database=postgresql
flag ensures that your Rails setup includes some PostgreSQL-specific things, such as adding the pg
gem to your Gemfile, and pre-populating some of the database configuration properties in the my_app/config/database.yml
file. We'll need to edit that file a little. Go to my_app/config/database.yml
and change the username for the development and test databases to 'postgres'. What this does is ensure that when your Rails app uses the PostgreSQL client to try to access the database cluster served by your PostgreSQL server, it does so with the credentials of the user who owns that cluster, namely the 'postgres' user.
While you're in that file, you can get rid of the section for the production database altogether if you're deploying to Cloud Foundry or Heroku, since they will overwrite whatever you have there anyways.
Finally, create the development and test databases that your Rails app will use. (These databases will be created within your default cluster).
1
|
|
Now you're totally ready to go!
]]>