Google App Engine | ECM Architect

I’m back from San Jose. My colleage, Dave Gynn, and I had fun at the O’Reilly Open Source Conference (OSCON) and learned a lot. Dave’s ability to pick out open source rockstars from a crowd is uncanny. It was pretty sweet seeing Larry Wall (and his family) hanging out and then hearing him speak. Although there are all kinds of topics on all things Open Source, the conference does have a heavy Perl bias.

Dave and I decided we were glad we went but we don’t feel like we have to be there every year going forward. This was my first time, but Dave said the general excitement level seemed low for some reason. Maybe it was Allison Randal’s seriously downbeat welcome address. Not sure. Anyway, here are my rough notes from some of the sessions I attended…

“Open Source in Government” was a big theme at OSCON this year. Speakers tried to instill a sense of urgency in the audience by saying that the window of opportunity for getting the government behind open source in a big way will only be open for a few more months. If you want to get involved, check out some of these links:

Data.gov mash-up contest
http://sunlightlabs.com/contests/appsforamerica2/

Machine readable datasets from the US Govt
http://www.data.gov/

Help the government make better use of open source
http://www.opensourceforamerica.org/

Some folks from Liferay presented on a new UI framework they’ve created called Alloy. Alloy is aimed at providing a single framework that addresses HTML, CSS, and JavaScript in a way that is abstracted from the underlying libraries. Alloy basically extends/subclasses JQuery and YUI. Liferay is migrating a lot of their OOTB portlets now to the new framework. It is expected to ship as part of 5.3. This talk was more about the “why” and less about the “what”. I would have liked to see more examples/demos.

Went to a talk on “using Django for election audits” that turned out to be more about how screwed up our elections process is and the minutiae of performing an audit on election results with not so much on how Django was used to solve the problem. The speaker did give a shout out to the Django Debug Toolbar that might prove to be useful. The presenter is looking for help with the project. He needs everything from UI help to people who can send him election results from their local election boards.

Saw a decent talk on Apache CouchDB. Couch is a schema-less database that is built for massive distributed scalability. Instead of SQL you use map-reduce functions to query. Key to Couch is the concept of “eventual consistency”–in a Couch app, data can be consistent over time instead of right now. Couch always knows either the correct old value or the correct current value, but it may take time to propogate the current value to every node in the system.

Noteworthy bullet points:

Couch can idle in 4MB of RAM. With a couple of production databases Couch will use about 20MB.
Canonical is including Couch in the Karmic Koala release. This will give apps running on Karmic the ability to easily sync data between nodes. Couch will also be running as part of Ubuntu One which means Karmic desktops can sync data with the Ubuntu cloud (See the Ubuntu wiki).
Someone is currently working on a JavaScript implementation of Couch. Among other things, this would give you the ability to replicate your CouchDB to a local version of Couch running in someone’s browser.
Current ACL is limited to “you are either an admin or you aren’t”. ACL for writers *might* make it into 1.0. ACL for readers won’t.

I went to the “JRuby on AppEngine” talk not for the JRuby, but because it was the only Google AppEngine session I could find. I was looking for some factoids on who’s using AppEngine. Here’s what they said:

200,000 registered developers
85,000 applications
Household names such as: eBay, Best Buy, Forbes, Whitehouse.gov.

Whitehouse.gov was a cool scalability story for AppEngine. They used AppEngine to moderate questions submitted during Obama’s first online town hall. According to the Google Code blog,

“During the 48-hour open voting period, the site peaked at 700 hits per second, and 92,934 people submitted 104,073 questions and cast 3,605,984 votes. In total, over one million unique visitors visited the site before the town hall. Even while the site was featured on major news outlets and even the Google homepage the other 50,000 apps built on App Engine were fully supported and experienced no adverse effects.”

The Erlang talk provided a good history of the language. I would have liked more on the language itself and less of the detailed history behind Ericsson’s telecom switches (even though Erlang played a critical role in those products). I was aware that CouchDB is built with Erlang but the speaker mentioned a couple of other open source projects that leverage Erlang that I hadn’t heard of: ejabberd is an Erlang-based chat server and RabbitMQ is an Erlang-based messaging server.

The “building a business on an open source distributed cloud” talk by Bradford Stephens was good. The speaker’s company, Visible Technologies, mines social networks and the internet in general for consumer sentiment on its customer’s brands. Their system ingests vast subsets of the Internet, parses the results, processes it, and indexes it so that they can run analytics against it for their clients. They moved from an all-Microsoft stack to an open source stack and have been very happy with it.

This was the third “noSQL”-themed talk I saw. He made a good point that when we design apps, we should be saying, “I need persistence” and then figure out what is the best provider of that given scalability and other constraints rather than starting out with “I need a relational database”.

The open source stack used by Visible Technologies includes the usual search players (Lucene, Nutch, Solr) as well as one I haven’t heard of: Katta is used to shard large Lucene indexes across multiple servers. They also use a couple of Hadoop sub-projects, HBase and ZooKeeper, and several others.

The New York Times API and NPR API talks were very good. I didn’t realize how many different API’s NYT has exposed. You can check out their API’s around people, news, search, movies, and books at http://developer.nytimes.com. Their blog is also worth checking out.

Lots of apps have been built using the NYT API. A personal favorite is InstantWatcher. It is a mash-up of NYT’s movies API with Netflix that helps you find good movies available to watch instantly.

NPR’s talk focused less on their specific API and more on how it is being used. Noteworthy bullets:

You can build API calls with their query generator (requires a free API key) or by hand (doc).
NPR offers tiered key levels. If you create something cool and drive a little traffic their way, you can get your key upgraded to a higher tier.
There are no rate limits. NPR believes they have built an infrastructure that can take “anything we can throw at it”.
The API has 2,000 users and serves 24 million requests (per ?) averaging 2 million requests per month.
50% of the API requests are for NPRML with less than 0.1% requesting ATOM. NPR API results are also available as JSON, RSS, and several other formats.
The NPR Digital Media team blogs at http://www.npr.org/blogs/inside/
Interesting side-note: NPR is currently migrating off of Oracle 10g to MySQL

After the NYT and NPR talks, they held a developer meet-up of sorts. Unfortunately I had to head to the airport so I missed out on that.

I’ve been playing with the newly-released Java support in Google App Engine and it is pretty cool. You can do more than I expected you could:

The Google App Engine Eclipse plug-in gives you a template project and associated config files, Ant build scripts, a deployment tool, and a local run-time environment that acts like GAE (user service, data store, limitations imposed by the platform).
You’ve got full persistence and query capability via JDO. You pretty much just model your entities as POJO’s, then you annotate the fields in those classes as “persistent” and you’re good to go. You do JDOQL to query your objects. Queries will only return the first 1000 results.
You can run cron jobs. A cron job wakes up on a schedule and invokes a URL you specify.
Servlets and JSPs are supported but you can also use things like Struts and Spring (See Will it Work in Google App Engine?).
You can take advantage Google’s User service, which means anyone with a Google account can sign-in to your app without creating a new account.
You can take advantage of Memcache if you need it (JCache).
You can fetch URLs via the URL Fetch service or java.net.URLConnection.
You can send mail via JavaMail.
You can use their Image service to resize, rotate, flip, and crop images.
Both JDK 5 and JDK 6 are supported.

There are some limits:

Execution of requests is limited to 30 seconds and that includes URLs invoked by cron jobs.
You can’t write to the file system. If you need to write out files, I assume you’d use S3 or something.
You can’t open sockets.
Each developer can create up to 10 applications and apps can’t be deleted so don’t fill up on Hello Worlds.
You can run an app that has up to 500 MB of storage and serves 5 million page views per month at no cost.

The beauty, obviously, is that as a developer, you get to focus on the code and let Google worry about scaling. For many applications, this Platform-as-a-Service (PaaS) will be preferred over Infrastructure-as-a-Service (IaaS). In an IaaS setup, you can use solutions like RightScale to automatically provision new nodes to handle spikes in demand, but you still have to set that up. Plus, you’ve got the additional cost and headache of installing, configuring, and maintaining the application server and database software (and making sure it is set up to work when new nodes are auto-provisioned). With the app engine, scaling globally is pretty simple: Step 1 – Write (Good) Code; Step 2 – Deploy Code to GAE.

ECM Architect

Tag: Google App Engine

Notes from OSCON 2009 in San Jose