Reflections on the Technology Stack for Connected

Given the recent acquisition of Connected by LinkedIn, I thought it would be a great time to reflect on what worked well with our technology stack as well as what I would like to improve going forward.

If you haven’t had a chance to check them out, feel free to review my previous posts where I detail our technology stack and the open source libraries we leverage. In this post I’ll jump right into our learnings.

What Worked Well

Python \ Django facilitated fast iteration. Our primary purpose for picking Python \ Django as the core of our stack was to enable fast iteration with both new features as well as feature enhancements. Overall the stack held true to this core value, enabling us to release and iterate on features faster than many of our competitors. We would incubate concepts as weekend projects and have them running in production before the weekend was over.

Amazon Web Services enabled on-demand and cost effective scaling. AWS was the key to keeping our costs down but allowing us to scale with spikes in traffic. We could easily spin up additional servers to handle the additional load. Given our work load required significant resources when a member joined but fewer resources afterwards, we were often able to spin down instances to save on expenses after big spikes in traffic.

Python was versatile enough to serve as both our front-end and back-end language of choice. Python worked equally well to build our front-end website as it did to implement our sophisticated back-end infrastructure that is continuously syncing contacts and conversations from various social networks, address books, and email providers. While we have seen some developers build their front-end in say Ruby on Rails but their back-end in Java \ C++ \ etc, we were able to efficiently leverage the same language and code base for both sets of tasks.

Python \ Django libraries embodied the “batteries included” philosophy. The included standard libraries in both Python and Django as well as the broad community of available libraries ensured we rarely re-invented the wheel to support core capabilities. We were able to find reusable libraries for everything from HTML parsing, to vCard reading\writing, to Amazon S3 object manipulation.

Memcache provided a quick win for performance. Memcache as well as the native support for it within Django made it very easy to employ this distributed cache from the get-go to improve performance via caching heavily used queries as well as pre-caching expected future queries. It allowed us to build an experience that felt more responsive than many others in our space.

jQuery minimized our cross-browser JavaScript headaches. By using jQuery for all our of DOM manipulation and AJAX requests, we were able to spend minimal time on JavaScript-related browser incompatibilities.

What Needs Improvement

Database migrations are painful. While Django models make it easy to iterate on data models prior to pushing to production, once they are in production, they do little to support the sometimes complicated and painful data migration process. We ended up maintaining a migrations file with each release that detailed the SQL needed to appropriately migrate the database. To keep things simple, we required that the database and app code were migrated at the same time without requiring backwards compatibility. This was possible in the early days, but started to get more painful as we grew. Many folks have pointed to Django South as a solution to these migration issues. I haven’t yet had a chance to explore it, but certainly plan on doing so going forward.

CSS is not easy to keep cross-browser compatible. While jQuery certainly helped with JavaScript-related browser incompatibilities, we had no such help for our CSS related issues. It’s difficult when leveraging modern UI design and relatively new CSS tags to ensure your page looks good in all supported browsers. While tools like MogoTest help to show what your page looks like in each browser and reduces testing costs, they don’t solve the core issue of enabling authoring a page’s CSS once and not having to worry about such browser incompatibility issues.

Application monitoring requires custom scripts and alerts. We leverage a variety of monitoring and alerting tools to manage our overall infrastructure. These included Pingdom for monitoring up-time and overall site availability, Munin for instrumenting our common application services like Apache, Nginx, MySQL, and memcache, Amazon Cloud Watch for Amazon instance monitoring, Cloudkick for additional virtual instance monitoring, and Django Sentry for app-level exceptions. Yet our most frequently occurring issues around our queue server and third party APIs issues were reported via custom scripts and alerts which were brittle, unsophisticated, and costly to maintain. It would have been helpful to have one overall platform for monitoring and alerting that easily supported our custom events. We’ll certainly be evaluating such solutions going forward.

Nginx, Apache, and mod_wsgi is heavy-weight for Django application serving. Our web application stack consisted of nginx as a front-end web server in front of several Apache instances. Apache interfaced with our Django app via mod_wsgi, which ran a separate pool of processes to handle incoming requests. Originally nginx was responsible for serving static files, but that was eventually off-loaded to Amazon S3, so it currently isn’t doing much for us. The biggest issue though is Apache with it’s high memory usage for each process spawned by the mod_wsgi daemon. It all feels too heavy-weight for our serving needs. Some folks have recommended simply using gunicorn with nginx. I’ll definitely be looking into alternatives like this and would love to see more large scale deployments talking about their experience with them.

There is no built-in framework for versioning static files. Django 1.3’s built in support for handling static files makes it a lot simpler to collect static files from various Django apps and push them to various serving locations, including Amazon S3. However, if you are following web performance best practices by setting long expire headers for static media, you need a way to version your static files to ensure new media are served when deployed. Some of the popular mechanisms for doing this include putting a version number in the path to the static file to ensure the new version is always served. Developers are left to invent their own mechanism for doing so and it would be great if it was either natively supported by the framework or best practices were encouraged.

Sharding your database is difficult. Django 1.2 brought multiple database support to a single Django instance, which allowed you to easily setup rules for where database traffic was routed. The DatabaseRouters mechanism made it easy to setup routing algorithms that enabled schemes like writing to a master and reading from various slaves, or partitioning database tables along app boundaries. However, actually sharding user data across various database shards remains a challenge. The hints offered in the DatabaseRouter mechanism are too limited to enable easy sharding along, say, a user or profile id. Better native support for this would enable developers to plan and build for this early on, as opposed to the expensive task of adding this in later.

Enabling rich JavaScript interactions quickly gets cumbersome. jQuery makes it easy to add simple AJAX interactions to your app as well as demand load various content onto your page. However, as you try to get more sophisticated with your JavaScript interactions, it quickly becomes cumbersome to do so. The first issue you run into is the need for client-side templates. We leverage John Resig’s micro_templating.js to enable this, though the placement of these templates and the syntax was fairly messy. The second issue we ran into was managing state within the client. We began experimenting with backbone.js to implement JavaScript models, but it really only felt appropriate for JavaScript-centric pages. I think there is plenty of room for improvement and we certainly plan on exploring alternatives to rich client-side JavaScript.

Amazon Web Services scaling is not for free. While the promise of on demand resources from AWS are enticing, the reality is it takes a decent amount of work on the developer’s side to enable it and comes with it own operational issues. While on demand virtual instances are certainly broadly available across a number of providers, the mechanisms for automatically scaling up and down your resources require significant input from the developer to manage. In addition, simply adding redundancy across your infrastructure in an environment like AWS isn’t as easy as you think, as you have to think through your deployment configuration to decide whether you wish to spread your infrastructure across multiple availability zones and/or multiple regions. These have latency and cost implications depending on your choices. You also need to deal with the fact you don’t have dedicated hardware, so your instances could be restarted or terminated at any point. We had this happen to us numerous times. Building fault tolerance into your system from the beginning is certainly a best practice, but is a fairly costly hit that you need to take upfront if you are going to deploy to such an environment. It shouldn’t have to be this kind of trade-off.

Round robin DNS is a poor solution for automatic failover. We leverage round robin DNS as a way to load balance across our front end web servers. While it has issues with lookups being cached for extended periods at different points, it does a decent job of balancing traffic across our servers. We also rely on round robin DNS as a simple form of automatic failover. When a browser does a DNS lookup and gets multiple IPs it first attempts to connect on the first IP. If it cannot, then it connects on the second IP and so forth. This provides failover in the case one of your servers is down. In practice, this turns out to be far from the reality. The browser only attempts the second IP if it gets a connection refused back from your server. If your server is experiencing high latencies, the browser will continue to wait on it. The browser also doesn’t cache the fact that it has failed over to an alternate IP. So every request checks the primary IP and then the alternatives, which results in significant latency before the timeouts expire. This makes it an unrealistic solution for automatic failover. Actual DNS failover solutions seem like the right answer.

I hope this provides an interesting perspective on what worked well and what needs improvement in the Connected tech stack. I’d love to hear from any of you if you have potential solutions to some of the issues we ran into!

A Look at Open Source Inside Connected

Open Source

The cost of building software products has dramatically fallen compared to a decade ago. Products that used to take millions of dollars are now being built for hundreds of thousands if not tens of thousands of dollars. Two of the most important drivers of falling costs have been open source software and cloud computing.

Yesterday I had the delightful task of rebuilding one of our production cloud images for Connected. What I realized during that process was the full extent to which we rely on open source software to build Connected. Connected wouldn't be what it is today and couldn't have been built nearly as quickly or cheaply without the incredible amount of open source used throughout the stack. I thought I'd take a moment to catalog all the open source software we use to give you a sense of just how much it has truly changed the cost of software development.

Production Operating Systems
Fedora - OS used on our web servers
CentOS - OS used on our queue servers

Data Tier
MySQL - data storage
mysql-proxy - used for automatic db failover
memcached - hot cache

Web Servers
Apache - application web server
mod_wsgi - interface to Python application code
Nginx - static files and load balancing web server

Application Code
Wordpress - hosts our blog
Python - application programming language

Python Libraries
django - Python web framework
setuptools - easy package installation
pip - even easier package installation
virtualenv - isolated package installations
mysql-python - Python MySQL driver
BeautifulSoup - HTML parser
lxml - HTML parser
django_compressor - JS and CSS static file compression
django-indexer - simple key/value store
django-paging - simple paging
django-sentry - detailed web request error logging
greenlet - concurrent programming library
eventlet - concurrent programming library
pyopenssl - SSL support
gdata - Google Data API library
httplib2 - advanced http support
pycrypto - cryptography
python-openid - OpenID
pytz - timezone support
tlslite - SSL support
feedparser - broad feed parsing support
iso8601 - ISO 8601 date conversion
thrift - cross-language development
evernote - Evernote API library
python-dateutil - automatic date conversion
vobject - vCard support
suds - SOAP API library
python-ntlm - NTLM authentication
dnspython - DNS querying
django-storages - common Django storage back-ends
boto - S3 library
python-memcached - memcached library
aweber_api - AWeber API
django-templatetag-sugar - simplified django templates
oauth2 - OAuth library
pyssh - SSH client
django-logging - django debugging
debug_toolbar - dango debugging
daemon - daemonize your background processes

Front-end Code
jQuery - light-weight javascript library
jQuery UI - UI widgets for javascript
jquery-autocomplete - autocomplete text field
jquery-fancybox - pop-up dialog
sencha - mobile javascript framework
micro_templating.js - John Resig's simple javascript templating
underscore.js - powerful data manipulation javascript library
Backbone.js - light-weight javascript MVC framework

Developer Tools
svn - version control
svnX - Mac svn client
Eclipse - developer tools
iTerm - alternative terminal client
PyDev - Python support in Eclipse
Pylint - Python static analysis
pyflakes - Python static analysis
Munin - graphing and monitoring
yui compressor - javascript compressor

Understanding the Players in the Social Data Layer

Qwerly

Social is clearly one of the biggest trends on the web right now, with the majority of new apps and services taking advantage of your friends to provide a more participatory experience. This extends across desktop and mobile applications as well as across most verticals, including media, e-commerce, travel, and more.

But what’s most exciting to me is what is happening a layer below these applications - the rise of the social data layer. The social data layer provides a set of compelling APIs that any application can take advantage of to quickly immerse it’s experience in social. Just as cloud computing significantly reduced the cost of building web applications, these social data platforms are significantly reducing the friction in creating compelling social experiences.

While Facebook is clearly leading the efforts in providing the social data layer, there are a growing set of startups and other providers of social data that new applications can take advantage of. I thought I’d take a moment to describe the current landscape from my perspective.

Social Network APIs
Without a doubt, at the core of the social data layer are the social networks that enable access to both their rich social profile data as well as robust social graph APIs.

Facebook
Facebook, now with over 750 million active members, is not only the largest of the social network providers, but also kicked off the social data revolution by opening up their APIs in 2007. Any app developer building a social application should strongly consider making Facebook their base, with the largest & truest social graph across all the networks.

Twitter
Twitter, with 300 million registered accounts, provides a very unique set of opportunities as their one-way follow mechanic has led to Twitter’s social graph being described as the interest graph as opposed to a pure-play social graph. Since you can follow people that you may not know, but are interested in their area of expertise or just keeping an eye on, it creates a unique set of graph nodes that are compelling for a variety of applications. And of all the social data providers, Twitter has been the closest in keeping up with Facebook in terms of API robustness and has even gone on to create an entire layer of streaming APIs that are very unique to Twitter and their data set.

LinkedIn
At 120 million members, LinkedIn is by far the largest professional graph with the richest searchable resume data for each of its members. It’s a clear choice for any professional application. LinkedIn has had renewed focus of late on their API offering and has expanded beyond basic profile APIs to also allow you to query their company data, group data, and jobs database.

Email Providers
Another rich source of implicit social data that I believe is still significantly under-utilized is the email inbox. Locked inside one’s inbox is almost a truer representation of one’s social graph compared to that which is mapped on explicit social networks. And we only now starting to see applications start to leverage this data in interesting ways.

Gmail
Gmail is leading the pack in opening up their platform to third party developers. For one, they launched an OAuth extension to their IMAP APIs, which now allow you to have delegated access to a user’s inbox without the user having to share their credentials. Given how sensitive the inbox is, this one addition goes a long way to ensure user trust. In addition, they launched the Gmail Contextual Gadget extension point that allows apps to be embedded right within Gmail. Unfortunately this extension is currently limited to Google Apps, but will hopefully be ported to consumer Gmail as well.

Yahoo Mail
Yahoo Mail also provides a set of APis to query their inbox directly. It’s nicer than Gmail’s interface since you can bypass IMAP altogether and use their web standard APIs. Yahoo has also invested in mail as a platform by enabling applications to be installed right within the Yahoo Mail interface. Unlike Gmail, these apps are targeted at both consumers and professionals leveraging Yahoo Mail.

Windows Live Hotmail and AOL Mail are other notable mentions here due to their large user bases. However neither have devoted serious resources to opening up a platform to access their inbox data, though straight POP and IMAP access is available.

Inbox APIs
While you could develop an application that directly speaks to the various email APIs out there, there are a set of inbox API startups looking to simplify the entire effort of accessing inboxes on your user’s behalf.

Context.IO
Context.IO provides a robust inbox API that will automatically index the inbox of your end-user and provides an easy-to-use REST API accessing those messages. If you have ever dealt with IMAP, you’ll appreciate all the work that Context.IO does for you so you don’t have to deal with the complexities associated with it. Their currently enable indexing an IMAP account with speed and scale.

Jexy
Jexy is still in it’s infancy, but one to watch for normalized API access to email, calendar, notes, and more. Their goal is to provide a single interface across any inbox, whether it’s IMAP, POP, Exchange, etc. Looking forward to trying their beta when it becomes available.

Social Data Providers
The social data providers supply connection data to map an email address to social profiles across the web. This becomes very useful when trying to acquire social data from email addresses or trying to fill out a full social profile of a given user.

Rapleaf
Rapleaf was the first compelling offering in this space with one of the largest databases of social data. Unfortunately they have discontinued their social profile lookup API that returned various social profiles for a given e-mail address due to negative press around how they acquired their data. They do still offer a useful personalization API that will give you data on the user behind an e-mail address, including age, gender, household income, and more.

Qwerly
Qwerly allows you to search both by an e-mail address to find associated social profiles or by a single social profile and find other associated social profiles. The data has been leveraged by many email marketing providers, CRM tools, and more to help provide more data about your customers. They have an interesting approach to acquiring their data through a sophisticated social profile crawler.

FullContact
FullContact is a more recent entrant to the social data space, with a compelling contact API that allows you to send it a partial contact record (name, email, phone number, etc) and have them fill it out with more complete data, including social profiles. Again a very useful data source for apps looking to build a more complete profile of users and connections.

Fliptop
Fliptop also enables looking up an e-mail address and returning both social profiles for that person as well as demographic data like name, gender, location, and more. Worth checking them out as well.

Google Social Graph API
Google also provides an answer to this problem via their public crawler. The Social Graph API enables accessing social profile data via their search engine and allows you to search across a variety of attributes. It’s definitely worth looking at for your needs as I hear their data set has gotten much better over the years.

Social Influence APIs
With the rise of the power of the collective community across social networks, key influencers become more and more important. And now there are a set of APIs for you to understand just how influential a person is across their areas of expertise. This data can be used for scenarios ranging from CRM, to customer support, to social marketing campaigns.

Klout
Klout is the most well known social influence provider available. While they got their initial start analyzing Twitter data, they have since expanded to analyze 10 different social media properties, including Facebook, LinkedIn, YouTube, Blogger, and more. For any user, Klout provides an overall influence score as well as details on a given person's areas of expertise.

PeerIndex
PeerIndex is another provider of social influence data. They similarly provide detailed scores on each user to help you better understand their topic expertise as well as overall audience reach.

Personal Data Stores
These projects attempt to bring all your personal data together and then make them available to services via a unified API. They provide value both to the end user in the aggregation but also to developers via their API.

Singly
Singly is the company behind the Locker Project, an open-source effort to create a personal data store of all your personal data from across the web. While a useful end-user service in itself, they also plan on offering a rich set of APIs for developers to take advantage of to get access to this personal data store for their applications. While still in the early phases, certainly a worthwhile effort and one to watch.

Greplin
Greplin is the ultimate search tool across your personal data. They index all your social accounts, inbox, and more and provide a simple Google-style interface to search across the data. They currently have an API in closed beta, but will hopefully open it up shortly to allow other developers to take advantage of their rich index.

API Aggregators
When you are looking to integrate with a variety of APIs, it’s often useful to consider leveraging an API aggregator that normalizes social data for you into a single API interface.

Gnip
Gnip is the most well known API aggregator, providing comprehensive access across a variety of social data providers. They are also the official Twitter partner for getting access to the Twitter firehose of data. So if you have extreme Twitter needs, these are also your guys.

Apigee
Apigee is useful in the development stage, as they provide developer tools for exploring APIs, making it much easier to get started with a variety of APIs. I expect over time these guys may even help normalize these different APIs for developers, though they are currently focused on working mainly with API publishers.

Standards
There have been several attempts as well to develop standards for sharing social profile data that will hopefully continue to get traction amongst publishers.

PortableContacts
PortableContacts was designed as a standard for publishers to share their contact data uniformly across a variety of publishers. Plaxo was one of the first publishers to support it, as Joseph Smarr was a strong advocate for it while he was there. Google also has an implementation of PortableContacts. However, we haven’t yet seen many of the other providers of contact data take up PortableContacts, so it’s usefulness is currently limited.

Webfinger
Webfinger attempts to bring back the old finger protocol that allowed you to get identity information. The new webfinger API enables modern access to identify information. Google again currently implements Webfinger, but continues to have low adoption amongst other data providers.

If you know of other startups or technologies that help to access the social data layer, please leave a comment!