Wednesday, July 4, 2007

My Code Search Project

This is probably the worst secret I have had to keep. It's been seven months. Seven anxious months! But the time has come. I can finally speak! The NDA holds me back no longer!

A small flashback is required. Over the past December-February I was a software engineer intern working in Google's Zurich offices. The project I spent most of my time on was Google Code Search. You can read up about my experiences here.

The problem is that what I was working on was confidential, as with most work at Google. Since Code Search is a relatively small project, I couldn't go any further than telling people I was/had worked on Code Search. This all changes however when my work goes live. Guess what? You made the connection yet? Yes!! After seven long has...(drum rolls)...gone live!!! :D

Another small flashback will do here. Before I even discovered I was to work on Code Search, I had this lingering question in my head about what is crawled. It was lingering enough to make me wonder, but not enough to make me find an answer. I wondered however, if they crawled html pages. Well, it's out now. That was what my main task during my internship - to crawl html pages for code embedded within web pages.

I am afraid of leaking information I shouldn't be discussing, so I'm afraid most of this big secret will have to remain that way, possibly forever. You might look at it and think "OMG, how did he do that?", or on the other end you might think "Gosh! I could do that in my sleep!" Now I'm afraid I can't back either end up. But, I will tell you that some this task taught me some interesting things about programming languages I probably never would have come across before. This happened while having to identify the language of the code and tag it. Tricky bastard that was!

It is still very much in its early stages in its live state so the results are far from perfect. And there are some other things which I cannot comment on. But you can see some of the results for yourself. You can see that a result is from an html snippet if it isn't from an archive or cvs/svn repository. There is no way to completely distinguish the types of results in a search (mine from original), but to get an idea you can search for html pages not detected as html or php, which will yield some results.

There is a posting on the Russian Google Blog, which you can read use Google Translate here. It's a nice read from the perspective of the Russian members of Code Search. They point out a nice example: wordexp_t example, which yields a result in html snippets at the top.

UPDATE: The ranking has changed slightly, so although the example about still returns html results, it falls lower down. There is another example however, which still yields a top html result: nph-refresh lang:perl

I'll leave you to explore the results of about 6-7 weeks of my time spent at Google. It was a really enjoyable experience and I'd like to thank my manager Miguel Garcia and co-worker Pawel Aleksander Fedorynski as well as the other members in other locations.

UPDATE: An English post has been made on the Google Code Blog.


  1. Hmmm
    When I first read your post, I thought along the lines of identifying code from pages where code is mixed in between html elements, e.g. like on pastebins, maybe through the use of html syntax highlighters like GeSHi.

    But then I looked at the examples, and it seems all of them contain pure code enclosed in pre elements, with the exception of the occasional html entity inside the code.

    Am I wrong? I was wondering since I suppose identifying code with html in between would be much harder than identifying uninterrupted code inside html pre tags ;) (although I can merely imagine that that should already be quite a challenge :P)

    I guess this is also classified information? :P

  2. As you say it does indeed fall under classified. There are problems that I had to overcome that I only wish I could discuss. Peek into the results and try figure it out - no-one can prevent you from doing that.

    Ag, I'd love to tell you more. But short it will have to remain. :(