Destination: West Lafayette, Indiana

Well, to those of you following the job saga, it is now over (pending approval by the Provost). After 2.5 months of interviewing all over the country, we have decided to accept a position in the Computer Science department at Purdue University. None of the details are set yet, so don’t start asking questions about when we’re moving or anything. But sometime prior to August 18th (my start date at Purdue), we’ll be moving to the West Lafayette, Indiana area. So when you find yourselves visiting Purdue, Indiana, Indianapolis, or Chicago, let us know!

Validating email addresses

As an early user of gmail, I was able to select precisely the username I wanted, ckillian, which is a very common username for people whose first name starts with ‘C’ and whose last name is Killian. Unfortunately, as often goes for popular shared services, there are many gmail users who fit those parameters. By itself, that wouldn’t be a problem, except that on occasion, these other gmail users seem to forget that ckillian is not their email address. They will use it to buy tickets on ticketmaster, place beach house reservations, setup ipod accounts, request proprietary recipes from companies, purchase items from websites, and, most recently, even to use it to purchase online postage from the USPS.

The “best” part is when users are so convinced it is their address that they go through Google’s password recovery system to try to get my password. This has happened 3 times so far. (I know, because Google sends me a link to my email addresses to follow if I want to proceed in resetting my password.) One truly intelligent user, after going through the password change system and failing, actually sent me an email asking if I would forward the information to her, which I was happy to do (on a temporary basis).

I recognize that for the users, this is generally an honest mistake (I get these receipts for a CXXXX Killian, and what’s obvious to me is that ckillian is there username for some other things, and they just got mixed up while entering the email address). When I get such receipts, responses, etc., if there is a phone number listed for the user, I often make the attempt to phone them to let them know of their mistake. But more often than not, there is no method shown to contact them with. In these cases, I have two options: (1) ignore it, and hope I don’t get stuck on some mailing list, or (2) contact the seller/sender and let them know they are sending information to the wrong party. Some vendors handle this well — the Apple store took care of it without hassle. Others, like the USPS, take some convincing (at first they thought I was trying to commit fraud). And then there are those like Ticketmaster, who I have simply given up on, because I can’t seem to get them to stop sending me junk even when I did setup the account (and diligently unchecked the boxes so I would not receive the junk).

This represents a fairly significant issue though, because for many of these services and sites, by going through a forgotten password dialog, I could have the password reset and emailed to my account, giving me access to their information and account, and possibly other information such as credit cards, or perhaps just the ability to purchase things using their credit cards.

And what frustrates me the most is that most of these sites have set up a kind of account based on this email address, without validating that the new user actually has access to the email address. It’s one thing if you simply mistype an email and as a result a single receipt goes to the wrong email address. It’s yet another if you are saving state for the user under this email address without validating it. Most websites form of validation is just to have the user type it twice. But we should know by now that user data cannot be trusted, and if we are going to store that kind of information, we really should validate the email address.

And it’s not even that hard to do so—mailing lists do this all the time. When you subscribe, they send you an email for you to prove you have received before allowing your subscription to proceed. All sites creating accounts should do the same. I would much rather have gotten an email from the USPS telling me to validate someone else’s account (which I would not have) than gotten a receipt for a delivery confirmation postage for a particular person.

So to those of you developing websites which create accounts for email addresses — please, please, please validate these email addresses before storing them!

New Category: Programming (this week: swap and concept_check)

I’m starting a new category for programming tips. I jokingly referred to something as the C++ feature of the week (for Mace development) with one of our developers, and he responded that he needed to subscribe. And it seemed like a good idea, so now I think I’ll start trying to blog about these new features I learn about.

So to start it off, there are two C++ features of the week for this past week:

  1. STL collection swap. The C++ STL Collections contain a swap() method, which takes another collection (of the same type) as a parameter. The method does what’s expected — to swap the one collection’s elements with the other. What makes this an interesting function is that it does it in constant time. It doesn’t require constructing, copying, or otherwise wasting time with the two collections. It just does pointer copies of internal collection state. To see how this is useful, consider these two cases I’ve applied this to:
    • Maps within maps. In one case, a Mace programmer had a map from an int to a vector. In this case, the int represented the number of things in the vector (admittedly, this is a bit of a simplification). So, when removing something from the vector, you would remove the vector from the map, then re-add it with the new key. (This is because for other good reasons, the key of a map entry cannot be changed). Because of the cost of this removal and re-addition, the programmer had originally implemented this using pointers, which, while correct and efficient, gave some of our other tools problems, and so we wanted to re-write it without using pointers. swap() made this possible. To do the update, use this code:

      void removeElement(IntVectorMap& ivmap, int size) {
      IntVectorMap::iterator i = ivmap.find(size);
      assert(i != ivmap.end());
      i->second.pop_front();
      ivmap[size-1].swap(i->second);
      ivmap.erase(i);
      }

      This does involve a construction of a new collection, but moves the elements of the collection quite efficiently. Since we will erase the original map entry, causing the old vector to cease to exist, the fact that it now holds no elements does not matter.
    • The second case was one where we wanted to iterate through a set, but other code might be adding things to the set at the same time. To maintain code safety, we must not lose newly added things, so we can process them later. The original design involved making a copy of the set, then clearing the original set, and iterating over the copy. Once again, swap() is the right tool here too.

      void processSet(IntSet& s) {
      IntSet t;
      t.swap(s);
      for(IntSet::iterator i = t.begin(); ...) {
      ...
      }
      if (!s.empty()) { processSet(s); }
      }

      As an added bonus, if you need to hold a lock to touch S, you can simply acquire the lock and do the switch, a very fast operation, the release the lock.
  2. boost::concept_check. We were updating our serialization code, but found that the compiler error messages on template errors are hard-to-decipher (to be generous). These errors were caused by one of two problems in one of our cases. First, we had added a new template parameter, and inserted it before some existing ones. In code which hadn’t been updated though, if they provided the older optional template parameters, the compiler would get very confused, and report error messages which could not be deciphered, and pointed to lines of code which didn’t make any sense. In the other error case, the default template parameter might not work with other types passed. (Specifically, it was a parameter telling how to serialize a collection, and the collection elements might not have been serializable.) This message was a little easier to decipher, complaining about types which could not be serialized, but still didn’t point to the right lines of code.

    Using boost’s concept check, we were able to help both of these problems. In the first case, we wrote a base class for all valid parameters of the template, then used a concept check to make sure the template parameter was convertible to the base class. Passing in the older parameter now would generate a shorter, easier to understand message, and the concept check library makes sure that the line of code makes sense. In the second case, we had to write our own concept checker, which would essentially just write code that needed to be able to compile (in this case, instantiating the type, serializing it, and deserializing it). Again, the concept_check library would make sure the error message was pointed to in the right place.

That’s all for this edition. Watch the programming category if you want to see other programming tips.

Server Cookies, and I don't think they quite understand advertising…

I should start by explaining I regularly run my web browser with cookies disabled. The reason is that I decided websites are tracking you too closely, and especially websites which you didn’t even know you were visiting. For example, open up your cookie list. (In firefox, this is: Tools->Options (under Windows, Edit->Preferences under Linux), then Privacy->”Show Cookies”. The questions to ask yourself are:

  1. How many of the sites listed do I even recognize?
  2. Of the sites I do recognize, what do I want that site to remember about me the next time I visit?

Cookies, you see, are files that a server gives to a web browser, and asks it to present them whenever they visit a set of pages on a set of sites. Cookies have a number of legitimate uses, most notably to give the browser a “session” id. The “session” id is used so the browser user can, e.g., log in, and have the server remember keep track of information related to the login. (The other option, not using cookies, is to make the sessionid part of the URLs, which is both ugly, and more likely to be logged by third parties such as proxies and caches run by ISPs)

Then there are some arguably useful features of cookies. For example, many online retailers will set a cookie identifying you at your browser, and recognize you immediately when you visit again (not for purchasing, but for welcoming, tracking the products you look at, so to remind you of past products you’ve visited and to suggest new products based on your viewing history. I personally find that a little creepy, though I admit in some cases it can be valuable. A few years ago, there were even reports of sites using cookies to do Dynamic Pricing (story by CNN), a practice where sites change the prices based on information they keep about the customer. There were reports of users visiting Amazon from a new computer, finding an item they like, then logging in, and seeing it for a new price. In my opinion, these types of things outweigh the possible positive benefits from having a site remember me just for cause.

Next there are in my book some outright despicable practices. Advertisements placed on sites will add cookies which get reported back to these tracking sites anytime you visit any site with an advertisement from the same company. As a result, there are sites which simply compile vast amounts of information about where you go and what you do online, to use in any way they seem fit. These are commonly called “Tracking Cookies” by products such as Ad-aware and Spybot, which will remove the ones they recognize for you.

I have simply taken the approach (mostly as an experiment) that sites shall not store cookies without my express consent. To that end, I have installed CookieSafe, which makes it easier to manage cookie settings. I either give or reject cookies from specific sites. This occurs as a site preference, meaning if a site uses both kinds of cookies, and I want to use the site, I accept them both. Importantly, the third-party cookies are still rejected — I have to authorize them separately.

So my browsing works like this: I browse normally, then if a site isn’t working (and particularly if submitting a login doesn’t work), I realize it needed cookies to work. I then decide if I really want to use the site, and if I do, I enable cookies for that site only.

Now, when I view my list of cookies, I can identify most of the sites. (Some I must have authorized, but don’t quite recognize by site name, like the third party my bank uses to process online billpay.). I find this to be much more acceptable, and my browsing hasn’t been worse for the wear.

A few days ago, however, I saw something that really brought a smile to my face. On a site I visited while trying to figure out what it meant to buy fertile eggs, I saw this image, where an ad belongs:
“No Cookie” Advertisement

I just had to laugh. If a site wants to not send me ads because I reject cookies — then great! I didn’t want them anyway. But somehow I think they’ve missed the point of advertising. If I were they, I would send SOMETHING back. But all the same – I hope other sites take this approach. It could be the end to all the annoying flash ads I get, if instead I got these images everywhere!

Maps in Disasters, Revisited

Last month I posted about the evolving maps during the San Diego Firestorm 2007. Yesterday as I was sitting in a waiting room, I was browsing the Union-Tribune, and found this article going into a bit of the detail of how those maps were created. It still doesn’t talk much about what advances were made, but does describe the players, basically San Diego State University, a team from Google, a prof from UCSD, and a collection of worldwide researchers who focus on imaging all got together. Form the U-T article, I mainly glean that the map images were the result of taking map images from a wide variety of sources (satellite, aerial footage, thermal imaging), and using “geo-referencing” to align them all onto the same map.

Evolving Technology in Crisis

Flash crowds are something that I think about a lot. This is mainly because it’s one of the prime challenges of building distributed systems.

Consider what happened in 1999 when Victoria’s Secret ran a Super Bowl ad announcing an online webcast of its Spring Fashion Show. The result was a sudden large volume of traffic to their site to view the webcast, so much that many customers were unable to view the webcast because the server could not handle the flash crowd.

A similar problem occurred after 9/11/2001, when everyone went to their favorite online news outlets for the emerging story.

What separates the two, of course, is that Victoria’s Secret planned their webcast (but failed to forsee the limits of their servers), where crisis situations are unpredicted, and generally not provisioned for.

This was clear in handling the San Diego firestorm last week in several ways, two of which I’ll mention here. What I find fascinating is how the people involved here had to adapt their technologies to handle the Crisis. In general, unexpected situations may always lead to this, and the people involved should largely be applauded. But at the same time, this presents us an opportunity to look at what happened to try and prepare automated systems for next time. Specifically, we need to improve or GIS/Mapping techniques, and our transparent web-content scalability techniques.

Continue reading “Evolving Technology in Crisis”

Trying to understand SiteMap(s)

So for some time I have been using Gallery as my picture site, and I’ve been quite happy with it overall (my prior post about it notwithstanding).

In recent versions, I have noted a reference to a “Google SiteMap” in the administration pages. Being ignorant of them, I ignored it. Yesterday, I decided to look a bit further into it to understand them. This was partly because lately I’ve felt like a large amount of my server bandwidth has been taken by search robots, and I wondered/hoped that the sitemap would make the crawler use less bandwidth.

Continue reading “Trying to understand SiteMap(s)”

"Hidden" Pages in WordPress

So I’ve been working with Kristina to set up her website/blog. Using WordPress to write a whole website is a quite interesting concept. In contrast: this site, or Tom’s blog, do more of embedding a WordPress blog in an otherwise functional site. But to setup a site in WordPress, you actually write all your web pages using its web interface, and tell it how to structure them.

But one of the issues is that a common practice in a website is to create a page which you don’t include in menus, but you might, say, link to it from a few special places such as an email, making it a sort of “hidden” page. But you don’t want people to have to log in to see it, you just want them to know the URL.

This idea has been suggested to wordpress, but the last comment was 6 months ago, and I’m afraid no good solution has been proposed yet. There is a workaround for the tech-savvy, namely modifying your theme to specially exclude the id of the page you want to exclude. This has two problems: (1) When doing so, the page was excluded, but some html was still generated, because it changed the formatting of the link list. (2) At kcubes.com, themes are shared across blogs, which means that if somoene else chose a specially modified theme, they would have confusing results.

Does anyone know other solutions? Is there a plug-in which provides this feature? For a while I thought “Private” would do what we wanted, but apparently that requires you to be logged in to see it. And while I’m griping, why does “wordpress private pages”, when typed into google, not give you a clear description of what exactly private pages are?

Mutt error message: "Message XXX UID YYY less than ZZZ"

If you are constantly getting this error message in Mutt (seemingly every time Mutt checks for new messages, or you change folders), then I may have a solution for you. You see, when I tried to lookup this message, I found two things:

  • Pages saying the message was harmless and to be ignored
  • Pages giving a technical explanation of the error message (which also tend to fall in the first category too)

Two pages in particular are this one from the Pine users group, and this one from the IMAP FAQs which give technical explanations. Basically, the UID’s must be increasing in a mailbox, and someone detected a dis-order. Typically, clients take the opportunity to re-order the UIDs to correct the problem.

But in my Mutt instance, this persisted restarting the client, moving messages, and more. I could not get the message to go away. After understanding the technical side, I went to find the problem. But I couldn’t find it. It did not occur in the mailboxes themselves.

Finally, I realized it had to do with the header cache. I deleted the header cache, and now I don’t have the error message anymore.

Good Luck!

3-D IMax Movies

So I went to see “Harry Potter and the Order of the Phoenix: An IMAX 3D Experience” Saturday. I was a little disappointed that only the last 20 minutes were in 3-d. Perhaps if I’d read up before, I would instead have been excited that a whole 20 minutes were in 3-d, but such is life. Those 20 minutes were quite impressive, and I did appreciate seeing them in 3-d.

So since then, I’ve been wondering why the rest of the movie wasn’t in 3-d. I tried to search Google for the answer, but only came up with stories about how there would be 20 minutes, and nothing about why not the other minutes.

What’s the reason? Possibilities I’ve come up with are:

  • Perhaps it’s too expensive
  • Maybe the technology doesn’t work as well on non-action scenes
  • Possibly people become disoriented with a full-length 3-d feature

I also found a nice article describing the different IMAX varieties, and in particular how the 3-d technology works [Wikipedia’s IMAX page]. First, the scenes are filmed simultaneously by two different cameras, about 2.5 inches apart (mimicking our eyes), and then project both images simultaneously. To keep it from confusing your eyes, the two projections are polarized at perpendicular angles. Then the glasses you wear cancel out one of the two images for each eye, reproducing the depth of feel. Of course, for a movie like HP, they are actually dealing mostly with animations, and use the patented computer graphics technologies to artificially create the two projections. This gives them the further advantage of being able to correct imperfections in the dual recording to give a more natural depth of feel. To read more about it, I recommend reading the linked Wikipedia page. I think it would be fascinating to find people who work on this kind of graphics to hear more about the technology.

Whatever the reason they only ran 20 minutes 3-d, I for one am looking forward to the day when watching movies is a full 3-d experience, whether through glasses, holograms, or otherwise.