2012-05-24

My Very Own AJAX Tutorial

So, I finally accomplished my dream of figuring out how applications where, rather than just downloading a static webpage, the client communicates with the server and is constantly updating data (like chat rooms, MMORPGs and the like) work and implementing one of my own. Since this is at least the fourth time I tried to figure this out and the first I succeeded, I thought I would make my own, less complicated, tutorial of how to do so.

The example I'll be running through is a fairly simple chat room. To start off, you need to download the Python module web.py. Then, create a new python file, and we can get to work on the actual code.
import web, os
form = """
<script>
  setInterval(update,100);
  function update() {
    sendreq("ask");
  }
  function sendreq(type) {
    var xmlhttp = new XMLHttpRequest;
    xmlhttp.open("POST","/");
    xmlhttp.onreadystatechange=handler;
    var name = document.getElementById('name').value.replace(':','');
    if (name == '') name = 'Guest'
    if (type == "send") {
      var message = document.getElementById('text').value;
      xmlhttp.send(name+': '+message); //Requests containing a message post a message
    }
    else xmlhttp.send(name); //Requests with just a name ask for content
    function handler() {
      if(xmlhttp.readyState==4) {
        document.getElementById("messages").innerHTML = xmlhttp.responseText;
        xmlhttp.close;
      }
    }
  }
</script>
<div id="messages"></div><hr />
Name:<br><textarea id="name" cols="20" rows="1"></textarea>
<br>Message:<br><textarea id="text" cols="50" rows="5"></textarea>
<br><input type="button" onclick="sendreq('send')" value="Send">
"""

This big long string is the client-side, HTML/Javascript part of the code. A function lower down in our python script will send this off to the client's browser to render. The main method here is sendreq. It creates an XML HTTP request, xmlhttp, gives it the name of the function that it will call once it receives a response, handler, and sends the request off with a string containing the actual data that is sent. If sendreq is called with the parameter "ask", which we do automatically every 100 milliseconds, it sends just the name that you give (without any colons for processing purposes), but if it's called with "send" (which we do when the button is pressed) it sends the name and the message, with a colon separating the two. Once we receive a response, which will be the past 12 lines of messages that have been sent so far, the handler function is called, setting the HTML of the div with id messages to the response's text. Now, on to the server side:
class mainpage:
  def GET(self):
    return form
  def POST(self):
    data = web.data()
    name = data[:data.find(':')] if ':' in data else data 
    if os.path.exists('messages'): messages = open('messages','r').read().split('\n') #Read the existing messages from the 'messages' file
    else: messages = []
    if data.find(':') >= 0: open('messages','w').write('\n'.join(messages) + '\n' +  data) #Add our new message to that list if we have one
    return '<br>'.join(messages[-12:]).replace(name+':','<b>'+name+':</b>') #Return last 12 lines with the name in the request bolded

urls = ('/', 'mainpage')

app = web.application(urls, globals())

if __name__ == '__main__':
    app.run()
The GET function handles GET requests, which are just a new client asking for the webpage. We respond to them by sending off the HTML/Javascript that we had earlier. As for the POST function, it starts off by setting the data variable to the text sent by the request, and tries to extract the user's name. If there's a colon, the name is everything up to the colon, and if there isn't, the whole request is just the name. Then, it reads the file 'messages', where we're caching everything, and converts it into an array of lines (so that we can return just the last 12 lines later on), setting the array to be empty if the file does not yet exist. If the request text contains a colon (ie. it isn't just a name, so we're providing a new message), it adds the request text to our cache. Finally, it returns the last 12 lines, joined by <br> tags instead of literal newlines, and with the given name bolded to make it easier for users to see which of the past 12 lines are theirs and which are not.

Put these two chunks of code into your python file, run it, and that's it! You can go to localhost:8080 on your browser (or multiple browsers, so you can be sure that there's actual client-server communication going on) and try out your new chat room!

2012-01-09

Some advanced tips for programming competitions

Since the second Codesprint programming competition just closed, I thought I would leave some tips that helped me answer some of the more challenging questions (and the easier questions) there. This guide assumes that you already know the basics (ie. you can write a program that implements the substitution cipher*) and some simple optimization tips (move stuff outside loops, don't redo computations if you can help it, etc), but want to move bring your skills up to the next level, and I hope it will provide some insight into the kinds of strategies that will provide you with efficient solutions for programming challenges. I will include a hodgepodge of C++, Javascript and Python, since I'll focus not only on Codesprint but also some other programming competitions and personal projects I've done in the past that I consider relevant, and I'm including the code I used as is (with perhaps one or two comments added) so if you have trouble understanding any of the code you can always visit the relevant documentation.

Try some testcases and look for patterns

To illustrate this principle, consider two test cases from the Codesprint competition: Picking Cards and Coin Tosses**. In Picking Cards, there is a deck of cards each with a certain number of them, and you have to figure out the number of ways to arrange them in such a way that each card has before it at least as many cards as is the number on the card. So for example, a deck {A:0,B:0,C:1,D:3} could be arranged 4 ways: ABCD, BACD, ACBD, or BCAD, but not CADB, since C has a value of 1 but 0 cards before it and D has a value of 3 but only 2 cards before it. I tried some decks manually, and settled on a hypothesis: progressing along a sorted list of the cards from left to right, each card multiplies the number of ways by p+1-v, where p is the card's position in the sorted list and c is its value - so, for example, in our list of {A:0,B:0,C:1,D:3} (or {A:3,B:0,C:1,D:0}; the list is sorted first so it doesn't matter), we start off with 1 since there's only one type of empty deck, then the first card, with a position of 0 and a value of 0, gives us a p+1-v of 1, giving us a total of 1*1, or still 1, the second card (p=1,v=0) brings us to 2, the third card (p=2,v=1) doubles us again to 4, and the fourth card (p=3,v=3) keeps us at 4. The hypothesis checked out for a few other examples, so I implemented it in code. Hit submit, 20/20 score, done.

In Coin Tosses, there is a coin that has already been flipped and came out heads m times, and your job is to find the expected value of the number of tosses until the coin comes out n times in a row, potentially including those m. For example, for m = 2, n = 4, you might have a value of 4 (1111), 7 (1101111), 14 (11011000101111), and your job is to find what the value will be on average. Since you can't test expected value in your head (well, you could flip a coin whatever number of thousands of times until you're 95% confident that you're right, but here that would have been even harder than the mathematical approach) I immediately sat down with pen and paper and devised an unwieldy expression to find the expected value, by calculating the expected value in terms of itself (essentially, EV = 0.5(EV + 1) + 0.25(EV + 2) + 0.125(EV + 3) + ... + 1/2n(EV + n)), and then following the grade 9 algebra trick of moving all the variables to the left. Then I brought m into the picture, and I implemented the result in code:
cases = int(raw_input())
for i in range(cases):
    line = raw_input()
    n = int(line[:line.find(' ')])
    m = int(line[line.find(' ') + 1:])
    evfromscratch = 0
    for i in range(1,n+1): evfromscratch += i * 2.0 ** (n-i)
    evfromscratch += n
    ev = 0
    for i in range(1,n-m+1): ev += (i + evfromscratch) / 2.0 ** i
    ev += (n-m) / 2.0 ** (n-m)
    print ("%.2f" % round(ev,2))
The variable evfromscratch gives you the expected value given 0 prior heads, and ev gives you the value for m prior heads. The algorithm was perfect, and fairly quick, but there was a practical difficulty: I was getting errors because in python numbers start to round off when you try to store really big numbers. So I started to look for a different solution. Looking back now, the reason why it failed is that the ev variable brings in decimals, which causes python to switch from integers, which are stored as exact values, to floats, which round off after about 16 decimal places (52 binary places, to be exact). I didn't know this, but I decided to look for a better solution. I ran the code a few dozen times with some different test cases, and I saw a pattern, which I promptly implemented in code:
cases = int(raw_input())
for i in range(cases):
    line = raw_input()
    n = int(line[:line.find(' ')])
    m = int(line[line.find(' ') + 1:])
    print str(2 * (2 ** n - 2 ** m)) + ".00"
Full marks. Done. I don't know why it works, but it does, no rigorous proof required. I don't know why the rule for Picking Cards works either - I could go and spend hours trying to prove both, but I won't. And that is the key: unlike in the world of mathematics, where something without a formal proof is merely a "conjecture", in computer programming you don't need to know exactly why something works - if it works, you can use it. You do need a theoretical background to be able to make educated guesses in the first place, but often a vague intuitive idea of why something might work will suffice - the insight I had that got me to the solution for the picking cards problem so quickly was "deck of cards problems tend to involve factorials". A few empirical tests from there, and I had my algorithm.

Learn to love binary search

def binsearch(val,element,li):
  if len(li) == 0: return 0
  left = 0
  right = len(li)-1
  while right-left > 1:
    ave = int((left+right)/2)
    if val <= li[ave][element]: right = ave
    else: left = ave
  if val <= li[left][element]: return left
  elif val <= li[right][element]: return right
  else: return right + 1
Binary search is an algorithm that allows you to find a value (or the nearest value to a given input) in a sorted list. The naive solution to solving the search problem is to go through all the elements in the list one by one, but the problem is that that takes O(n) time (if you don't already know this, O() means "at high values becomes proportional to"), which is unacceptably high if you need to do the search inside one of your innermost loops. Binary search, on the other hand, takes a mere O(log n) time, which is a massive speedup for large lists: a search through a list 1000000 items, instead of taking an average of 500000 steps, will take a mere 20 steps, at least a 1000x speedup even if each step becomes much more complex. This particular implementation takes in an argument element, which allows it to look through a sorted list of objects (arrays or dicts) and sort them by an element, making the algorithm much more versatile. Having a key function as an argument provides even more versatility, but at the cost of speed.

Binary search is often overlooked because it requires a sorted list as a precondition, which seems like it would make it useless if you have a big unsorted list that you need to work with, but you can get around that by sorting the list by every key function you need it sorted by first. And how you do sort a list efficiently? Binary search! Just run through your unsorted list and add elements one by one to a new, first empty, list at the positions that binary search gives you. That's an O(n*log(n)) sorting algorithm right there, and one that you can implement in C without worrying about data structures like lists of lists or heaps that other efficient sorts require. Here's an example for a problem that had you keep track of the ranks of specific participants in a yodelling competition throughout the competition as their scores change (hence the use of a "byrank" array to store positions rather than actually sorting the list).
    for (int j = 1; j < yodellers; j++) { //eg. scores 34 56 12 23 40 -> byrank 2 0 4 3 1
      //Binary search
      int left = 0;
      int right = j - 1;
      while (right - left > 1) {
        int ave = (left+right)/2;
        if (scores[j] > scores[byrank[ave]]) right = ave;
        else left = ave;
      } 
      int place = 0;
      if (scores[j] > scores[byrank[left]]) place = left;
      else if (scores[j] > scores[byrank[right]]) place = right;
      else place = right+1;
      //Insertion
      for (int k = j-1; k >= place;k--) byrank[k+1] = byrank[k];
      byrank[place] = j;
    }
int scores[yodellers] and int byrank[yodellers] were both initialized earlier; yodellers is the number of participants. The problem was from a 2004 high school computing competition that I found and did for practice. Binary search is the one algorithm that now appears in almost every single one of my more advanced programming solutions - it simply is that powerful. If you have a problem that looks like it will take O(n2) time to compute, chances are there's a solution involving pre-sorting and binary search that will take O(n*log(n)) time. One example of such a binary search-based solution is my approach to the Subsequence Weighting problem in the Codesprint competition; solution here. I won't go into this example in detail here; those curious can go and try to understand the problem and code themselves, but that's an O(n*log(n)) algorithm for solving it. A second, and much more important one, is search trees.

Search Trees

Search trees are an incredibly powerful tool in solving optimization problems, where you need to find an optimal solution, such as the minimum cost of performing a certain task (eg. pathfinding, the traveling salesman problem, planning). Here's a naive implementation of the concept that I applied in a medieval/fantasy tactics game that I made a while back***:
def generate_distance_map(startx, starty):
  distmap = initialize_array(40,30,0)
  pts = []
  pts.append((startx, starty,1))
  index = 0
  while index < len(pts):
    p = pts[index]
    if (distmap[p[0]][p[1]] == 0 or distmap[p[0]][p[1]] > p[2]):
        distmap[p[0]][p[1]] = p[2]
        if p[0] < 39 and terrain[p[0]+1][p[1]] > 0: pts.append((p[0]+1,p[1],p[2]+1))
        if p[0] >= 1 and terrain[p[0]-1][p[1]] > 0: pts.append((p[0]-1,p[1],p[2]+1))
        if p[1] < 29 and terrain[p[0]][p[1]+1] > 0: pts.append((p[0],p[1]+1,p[2]+1))
        if p[1] >= 1 and terrain[p[0]][p[1]-1] > 0: pts.append((p[0],p[1]-1,p[2]+1))
    index += 1
  return distmap
The idea is to find the shortest path from A to B on a map (terrain) with obstacles in it. The solution I used is to first find the distance from all coordinates to B (returned in distmap), and then move toward the lowest nearby distance for each step starting from A (the reason things were done that way is because there may be 50 units seeking one unit and this allows the heavy lifting to be done once, rather than 50 times). For another example of this kind of searching, here's a solution in C for finding how lng it takes a knight to get from point A to point B on a chessboard (contest here). Of couse, these are not "trees", and do not have to be since the search space is only 1200 (or 64) states, but more complex applications do need trees.

Meet the fifteen puzzle. The idea is to move the fifteen tiles from a random shuffled configuration into the solved configuration of increasing numbers left to right then top to bottom using just the one empty space; post my solution code into the textbox here and run it on 3x3 or 4x4 mode (be prepared to wait up to 1 min for the 4x4 mode, the code solves the easier 3x3 mode nearly instantly) to get a better feel for the puzzle. The challenge is that there are not just 1200 states to look through, but up to (in the 5x5 case) 25! states - so many that as of the time of this writing there is not enough hard disk space in the world to store them all, and won't be for at least another 2 decades. So what do you do? Store just some of the states in a list and continue to make moves from the best states (defining "best" for the specific problem is part of the challenge) and put the resulting states into the list as well, until you eventually arrive at the goal. Here's where binary search comes in - you want the list of solutions to be sorted in order of how good they are, so you can easily pick out the best one to work with every round. And maintaining a sorted list dynamically is exactly what binary search is meant to do!

The way these types of route finding algorithms (whether finding solution steps for the fifteen puzzle or a literal route if you're asking your computer for driving directions) work in detail is like this (if you want, try to follow along in my code):
  1. A state is stored as an object containing the position (in the pathfinding problem, the x,y coordinate, in the fifteen puzzle, the entire map - my code represents it as a string, eg. "0123456789abcdef" for the solved configuration, "048c159d26ae37bf" for the solved configuration flipped along the diagonal, etc) and the moves needed to get to that position, and optionally the score of the position/moves combination (see the discussion on heuristic functions later).
  2. There are two lists maintained: a frontier, containing positions you're interested in evaluating, and a list of positions that have already been explored. The frontier starts off with just one element containing the initial position and an empty set of moves, and the explored list starts off empty.
  3. For each step, take the state in the frontier with the highest score, and take all possible moves from that state that lead to a legal state not yet in the explored list and put them into the frontier. For example, in the 2D pathfinding problem from (0,4) you might add (1,4) and (0,5), but not (0,3) because you had arrived at (0,4) from (0,3) so (0,3) has already been explored, and not (-1,4) because that's out of bounds. This is where binary search comes in handy - you can use it to dynamically maintain a sorted list, so the state with the highest score will always be simply the last element in the frontier.
  4. Take that state out of the frontier and put it into the explored list.
  5. If you find a state that is the desired destination, take it and end the algorithm, and you got yourself a path. Whether or not you will get the best path depends on your heuristic (or scoring) function.
There are 3 types of scoring function:
  • Breadth first - the score is the opposite of the length of the path to that state (or the score is the length of the path, and you're looking for the lowest score rather than the highest), with no regard to the position itself. This essentially searches through states expanding out from the start in a sphere. It's guaranteed to find the best path since moves are evaluated in order of how long the path to them is.
  • Closest first - the score is how close the position is to the goal according to some distance metric, with no regard for the length of the path. This is the fastest algorithm, and its speed is necessary in some cases (like the fifteen puzzle), but it will NOT necessarily give you a very efficient solution. Here's my heuristic function for the fifteen puzzle (essentially, distance is defined as the sum of the distances of all the numbers from their "proper" locations):
var heuristic = function(state) {
  var h = 0;
  var size = Math.sqrt(state.str.length);
  for (var x = 0; x < size; x++) {
    for (var y = 0; y < size; y++) {
      var val = sym.indexOf(state.str.charAt(x + y * size));
      var fx = val % size;
      var fy = Math.floor(val / size);
      if (val > 0) h += Math.abs(fx-x) + Math.abs(fy-y);
    }
  }
  return h;
}
  • Combined score (A* search) - the score is a combination of the length of the path and the distance to the goal. If the distance metric is always less than or equal to the length of the path to the goal, then the solution you get will also be optimal, since it won't cause efficient paths to be left behind due to an erroneously high distance estimation while less efficient paths get explored. This is generally better than breadth first, the exception being situations where you have no idea how to even estimate the distance to the goal.
In the particular case of the fifteen puzzle, the search tree actually is a "tree" - the solutions are sequences of moves, and places where multiple moves are possible are where the tree branches out. Search trees have their applications everywhere - solving pathfinding and optimization problems, solving games, planning, etc - they are like binary search's older brother in terms of their versatility.

Look for crazy optimizations

The magic of binary search is one way to turn many O(n^2) solutions into O(n*log(n)) solutions, but it is not a panacea. Here, I'll dissect the "direct connections" problem from the 2012 January Codesprint competition. The premise is this (direct link here: you have a list of cities c0,c1,c2,...,cn with populations p0,p1,p2,...,pn all arranged on a line with coordinates x0,x1,x2,...,xn. You are trying to build a network to connect every city to every other city, and the cost of connecting cities ci and cj is equal to max(pi,pj) multiplied by the distance from ci to cj (ie. abs(xi - xj)). Here's the naive O(n^2) solution I had at first (the answer had to be provided mod 1,000,000,007 as part of the problem specification):
import math
cases = int(raw_input())
for case in range(cases):
  citiesnum = int(raw_input())
  coords = [int(x) for x in raw_input().split(' ')[:citiesnum]]
  pops = [int(x) for x in raw_input().split(' ')[:citiesnum]]
  cities = []
  for x in range(citiesnum): cities.append([pops[x],coords[x]])
  cities.sort(key=lambda x:x[0],reverse=True)
  total = 0
  for i in range(citiesnum):
    nt = 0
    for j in range(i+1,citiesnum):
      nt += abs(cities[i][1] - cities[j][1])
    total += nt * cities[i][0]
    if total > 1000000007: total = total % 1000000007
  print total
Note that python has a built-in efficient sorting function, so you don't need to write your own (although you still do to maintain a sorted list dynamically). However, that did not save me from the horrors of O(n^2), and getting 15/35 points as a result due to time constraints. The algorithm is simple: take the most populated city, and since you know that the cost to connect it to all the others will be based on its population just go through all the other cities and add up the distances to them. Then, take it out and go through the next most populated city, and so on right down to the least two. Reflecting on this later, sorting the cities actually didn't really speed up my code much at all; I could have just gone through all combinations of i and j and added up the costs directly, and this would have been less efficient but still O(n^2).

To take my algorithm to the next step, I had to go beyond this naive approach and make a considerably more advanced algorithm; here is my code. Everything is somewhat explained in the comments, but I'll go over it here too. The idea is to have a list of cities in order of population (we'll call it P), so you have an order to go through cities in without worrying about population, a list of cities in order of position (we'll call it X), and a running total of the positions of all the cities (which we'll call S). For example, if the positions are [1,2,3,5,7,9,12,15], the third list will be [1,3,6,11,18,27,39,54]. To find the sum of the distances from all cities to, say, X5 (the one at 9), the formula is S7-S5-S4+9(5*2-7), where 7 is the number of cities minus one - try it for yourself, it works! Essentially, it works by adding all the numbers above 9, subtracting all the numbers below 9 (by adding all the numbers and subtracting these twice), subtracting 9 for every number above 9, and adding 9 for every number below 9, getting you the sum of the distances (xi - 9 for cities to the right of 9 and 9 - xi for all the cities to the left). Now, we can just find the total distances for each city by going through this list, multiply the result by the city's population, and do that for every city - problem solved!

Well, not quite. There's the tiny little snag that you need to take out cities after you evaluate them, so you don't evaluate every pair an extra time with the wrong population, and to do that every time you take out a city you have to go through the entire list of sums to the right of what you take out and reduce them. And back to O(n2) land we go. Fortunately, though, I discovered a faster solution that worked by splitting up X and S into buckets, and maintaining a running "microsum" of numbers within each bucket and a running "macrosum" of the buckets themselves. Thus, for our example, we would have (in my solution I maintained X the list and X the buckets, but from now on X is just the buckets for the sake of simplification):

X = [1 2 3] [5 7 9] [12 15]
Smic = [1 3 6] [5 12 21] [12 27]
Smac = 6 27 54

Now, look at how little you need to change to take out the 9:

Smic = [1 3 6] [5 12] [12 27]
Smac = 6 18 45

You only need to update one microbucket (O(n0.5)) and the macrobuckets (O(n0.5)). Hooray, n^1.5 solution, and I got 35/35 points. It is actually possible to go even further, making three layers of buckets to go to O(n1.33), four layers to O(n1.25), and a dynamic number of layers to go to the holy grail of O(n*log(n)), but I didn't realize that at the time, and this was good enough anyway. As a general rule of thumb, if there's an obvious solution at O(n2), you can go down to O(n*log(n)) if you try hard enough.
So there you go, I hope you'll start using these ideas and others to help you solve more and more challenging programming problems, and possibly even apply these skills when programming in real life (notice how I included the game I made as an example; there are always interesting applications for these around the corner). Some other resources that might help you are:
  • Stanford's online ML and AI classes; now done, but the videos will be around forever, and as a high-level interactive companion to the AI videos I'll shamelessly plug my AI Challenges website, home of my above linked fifteen puzzle challenge app); AI class in particular helped out a lot in expanding my repertoire of algorithms and strategies.
  • Past contests, from here and here and here. Remember, don't just skim through the problems, sit down, pick a hard one and try your hardest to solve it!
  • Wikipedia and Google - in my early high school days, I had spent days thinking up of how to implement a flood filling algorithm (yes, as in flood fill in basic drawing software), and had finally found a slow, but functional, solution, and I was prouder of myself than I had ever been of my coding abilities in my entire life. However, the next day I researched flood filling on Wikipedia to see what other people had to say on the issue and found a simple algorithm 20 times faster than what I made. Also, it's from there that I made them leap to solving the pathfinding problem myself. Good programmers invent, great programmers steal.
Good luck and happy coding!

Vitalik Buterin

* print (lambda a,b: ''.join([(b+" "*6)[(ord(x)%32-1)] for x in a]))(raw_input(),raw_input())
** Link to full text for all 3 problems that I talk about for those without Google/FB accounts: https://docs.google.com/document/d/1vUYNXE_TGaffG6JKN6E-EbJONHwRDaxExmVJwwKFJVc/edit?pli=1
*** This is not the code that runs in the linked game; it is from an earlier downloadable version that I made in python, while the linked game is a webapp with the frontend written in javascript. In the linked game the algorithm is, while essentially the same, more complex, treating a diagonal step as 1.5 rather than 2 to more closely mimic the Pythagorean distance and making some fairly complicated necessary adjustments to preserve efficiency; if you want you can go to the game, hit Ctrl+U in the game mode, and hunt around the source code for the newer algorithm. And you can play the game and send feedback if you really want.

2011-08-14

Password Strength: Rebuttal to http://xkcd.com/936/

The XKCD comic here made an interesting argument about passwords: that we should, rather than creating short passwords with random characters and 1337 substitutions, create long passwords out of easy-to-remember words. We are much better at remembering words than symbols per byte of entropy, the logic goes, so our passwords should be made out of words. The problem is that there are two very important variables that the analysis misses out on:

1. Ease of typing in password
2. Difficulty of getting password from shoulder surfing

Let's consider 'Tr0ub4dor&3' vs. 'correct horse battery staple'. How much longer would the latter take to type? One is 11 characters long, the other 28, but there are a few adjustments to take. First, in favor of  the latter, typing does take longer if the shift key is involves, as it is here twice, so we might adjust this to 13 vs 28 "effective characters". Second, though, is the issue of typing errors. The possibility of making a successful input decreases exponentially with the password length, so the average number of tries increases exponentially. If the success rate for the shorter password is 80%, the longer password will see a rate of 63%, so the longer password would, in effect, take 1.27x as many tries. Since there is a period of waiting between inputting an incorrect try, the computer rejecting you, you realizing what just happened, and you typing in the password again, so a 2-try success might actually take 3-5 times as long as a 1-try success, the effect is even worse. Thus, in practice, the suggestion of switching to the long password, if it gets adopted at all (most likely due to an enterprising IT geek drafting a corporate policy), will run into the fact of user laziness and will encourage users to stay logged in to everything forever on any computer they ever touch, clearly reducing security.

The second issue works like this: 'Tr0ub4dor&3' has 28 bits of entropy for 13 "effective characters", or 2.15 bits/ch. 'correct horse battery staple' has 44 bits for 28 characters, or 1.57 bits/ch. Thus, someone shoulder surfing the shorter password would have far more information to take in all at once. Due to linguistic redundancy, with the longer password he can even miss half the characters and get the password right. And god forbid the user fumbles and has to input the password twice...

I'll finish up with my password policy:

1. The three really important things - my desktop password, my email password and my Bitcoin wallet encryption password, are all their own 18-character strings made in a similar style to 'Tr0ub4dor&3'. The desktop password is considerably more incomprehensible - this is okay because you will remember something you have to type in 5-20 times a day no matter what it is, and if you're typing it that often the battery staple approach is obviously unreasonable.
2. Website passwords are formed from a algorithm based partally on the website's name and partially on a short 6-character string, forming an 11-character password unique to each website that I do not have to remember, but can figure out from the website name. Currently, the algorithm isn't very good (and I do plan on changing it soon) - comparing 2 website passwords can reveal the pattern, but the it defeats attackers who get username/password combinations on one website and try them everywhere else automatically. I prefer this over password managers due partially due to paranoia and partially because I want to be able to access all my sites on other people's computers.

2011-08-05

Public Service Announcement: Backup your data!

About a week ago, I ran into what would be for most people the second most serious computer problem imaginable (the most serious being complete failure of the hardware - that way you have to buy a new machine as well): the operating system would no longer boot. Because I was using encryption, and had not written down the encryption key since all I needed to normally get in was my login password, that meant that all my data on my computer was now permanently irretrievable. The problem was, of course, due to excessive hacking on my part (whoops, erased the partition tables on the wrong volume) a few weeks earlier, but I didn't know exactly what I did since my computer kept on running in a sort of walking ghost phase, acting as if nothing is wrong but ultimately doomed to fail on the next reboot. Two weeks later, I turned off my machine for the first time in over a month and discovered what had happened.

The data that was on the hard drive included the following:
  • All my old school projects, essays, etc
  • Everything I have programmed in the past few years
  • All my photos and pictures
  • My book, music and movie collection
  • About $350 worth of Bitcoin
Essentially, my entire life's work gone. Well, not quite. I reinstalled the OS, ran a script to reinstall all my software (the script would have also set all my system preferences, but I decided to take the opportunity to switch from Ubuntu to Xubuntu, so I turned off that bit) and copied over the data from my 4 week old USB backup. I then proceeded to install Dropbox and allowed the daemon to copy over a much newer backup of most of my files. I back up an encrypted copy of my Bitcoin wallet manually since Dropbox is not to be trusted (or, even worse, this) in situations where you actually have something to hide, so my wallet backup was a month old. Fortunately, the coins themselves are stored on the public network, it's just the keys that the wallet stores, so even the new coins sent to me after the last backup appeared once I ran the client. In the end, all I lost was my music (too much for the Dropbox and I had added a whole bunch of new stuff since last month) and the scripts that I use to show the weather, my email and my calendar/todo list on my desktop, and I had to rewrite them - well, once again, not quite. I realized after the fact that they were still intact in my other backup (my second laptop). And as for my music, I didn't lose that either - it was all on my cellphone/media player. The one file that I didn't back up anywhere was a travel checklist I had written a few days before - well, actually I had printed that on paper. Total losses: absolutely nothing.

Why was this possible? Because of a combination of multiply redundant automatic and manual backups that covered everything I had. If any of these services go down, I've still got my stuff in three other places. Most other people facing my problem would have suffered much more severely, and it is for their benefit that I am writing this today. In an age where more and more of our work, our possessions and our memories are nothing more than bytes on a computer (and, with Bitcoin, even our money; millions of dollars have been lost or stolen due to a lack of backups or security), it is imperative to keep your data safely backed up. Some basic recommendations are to use an automatic internet backup/sync service, like Dropbox and Wuala, for your everyday work (you could pay for a big account and store your media there too, but I don't like spending money if I don't have to), and an external hard drive, USB, smartphone and/or second computer for the big stuff and the private stuff. There are also ways to encrypt and decrypt private stuff automatically for syncing so you don't have to trust whatever internet storage you use, if that's what you need. Keeping everything on a USB drive can actually be quite smart - if you're going somewhere where there will be a computer but you can't bring yours, you can just plug in your USB and do your normal computer work from there. Don't wait until disaster ultimately strikes and you lose everything, backup your data now, before you get hit with something unexpected and realize that you should have. Even a simple minimal USB backup, or uploading a zipped version of your files onto gmail, could be something that you will be immensely grateful for a year later.

Links:
Dropbox
Wuala
Ccrypt - a simple, easy to use file-based encryption program, scroll down to Qccrypt if you prefer a graphical interface to the command line