Reflections on the Technology Stack for Connected

Given the recent acquisition of Connected by LinkedIn, I thought it would be a great time to reflect on what worked well with our technology stack as well as what I would like to improve going forward.

If you haven’t had a chance to check them out, feel free to review my previous posts where I detail our technology stack and the open source libraries we leverage. In this post I’ll jump right into our learnings.

What Worked Well

Python \ Django facilitated fast iteration. Our primary purpose for picking Python \ Django as the core of our stack was to enable fast iteration with both new features as well as feature enhancements. Overall the stack held true to this core value, enabling us to release and iterate on features faster than many of our competitors. We would incubate concepts as weekend projects and have them running in production before the weekend was over.

Amazon Web Services enabled on-demand and cost effective scaling. AWS was the key to keeping our costs down but allowing us to scale with spikes in traffic. We could easily spin up additional servers to handle the additional load. Given our work load required significant resources when a member joined but fewer resources afterwards, we were often able to spin down instances to save on expenses after big spikes in traffic.

Python was versatile enough to serve as both our front-end and back-end language of choice. Python worked equally well to build our front-end website as it did to implement our sophisticated back-end infrastructure that is continuously syncing contacts and conversations from various social networks, address books, and email providers. While we have seen some developers build their front-end in say Ruby on Rails but their back-end in Java \ C++ \ etc, we were able to efficiently leverage the same language and code base for both sets of tasks.

Python \ Django libraries embodied the “batteries included” philosophy. The included standard libraries in both Python and Django as well as the broad community of available libraries ensured we rarely re-invented the wheel to support core capabilities. We were able to find reusable libraries for everything from HTML parsing, to vCard reading\writing, to Amazon S3 object manipulation.

Memcache provided a quick win for performance. Memcache as well as the native support for it within Django made it very easy to employ this distributed cache from the get-go to improve performance via caching heavily used queries as well as pre-caching expected future queries. It allowed us to build an experience that felt more responsive than many others in our space.

jQuery minimized our cross-browser JavaScript headaches. By using jQuery for all our of DOM manipulation and AJAX requests, we were able to spend minimal time on JavaScript-related browser incompatibilities.

What Needs Improvement

Database migrations are painful. While Django models make it easy to iterate on data models prior to pushing to production, once they are in production, they do little to support the sometimes complicated and painful data migration process. We ended up maintaining a migrations file with each release that detailed the SQL needed to appropriately migrate the database. To keep things simple, we required that the database and app code were migrated at the same time without requiring backwards compatibility. This was possible in the early days, but started to get more painful as we grew. Many folks have pointed to Django South as a solution to these migration issues. I haven’t yet had a chance to explore it, but certainly plan on doing so going forward.

CSS is not easy to keep cross-browser compatible. While jQuery certainly helped with JavaScript-related browser incompatibilities, we had no such help for our CSS related issues. It’s difficult when leveraging modern UI design and relatively new CSS tags to ensure your page looks good in all supported browsers. While tools like MogoTest help to show what your page looks like in each browser and reduces testing costs, they don’t solve the core issue of enabling authoring a page’s CSS once and not having to worry about such browser incompatibility issues.

Application monitoring requires custom scripts and alerts. We leverage a variety of monitoring and alerting tools to manage our overall infrastructure. These included Pingdom for monitoring up-time and overall site availability, Munin for instrumenting our common application services like Apache, Nginx, MySQL, and memcache, Amazon Cloud Watch for Amazon instance monitoring, Cloudkick for additional virtual instance monitoring, and Django Sentry for app-level exceptions. Yet our most frequently occurring issues around our queue server and third party APIs issues were reported via custom scripts and alerts which were brittle, unsophisticated, and costly to maintain. It would have been helpful to have one overall platform for monitoring and alerting that easily supported our custom events. We’ll certainly be evaluating such solutions going forward.

Nginx, Apache, and mod_wsgi is heavy-weight for Django application serving. Our web application stack consisted of nginx as a front-end web server in front of several Apache instances. Apache interfaced with our Django app via mod_wsgi, which ran a separate pool of processes to handle incoming requests. Originally nginx was responsible for serving static files, but that was eventually off-loaded to Amazon S3, so it currently isn’t doing much for us. The biggest issue though is Apache with it’s high memory usage for each process spawned by the mod_wsgi daemon. It all feels too heavy-weight for our serving needs. Some folks have recommended simply using gunicorn with nginx. I’ll definitely be looking into alternatives like this and would love to see more large scale deployments talking about their experience with them.

There is no built-in framework for versioning static files. Django 1.3’s built in support for handling static files makes it a lot simpler to collect static files from various Django apps and push them to various serving locations, including Amazon S3. However, if you are following web performance best practices by setting long expire headers for static media, you need a way to version your static files to ensure new media are served when deployed. Some of the popular mechanisms for doing this include putting a version number in the path to the static file to ensure the new version is always served. Developers are left to invent their own mechanism for doing so and it would be great if it was either natively supported by the framework or best practices were encouraged.

Sharding your database is difficult. Django 1.2 brought multiple database support to a single Django instance, which allowed you to easily setup rules for where database traffic was routed. The DatabaseRouters mechanism made it easy to setup routing algorithms that enabled schemes like writing to a master and reading from various slaves, or partitioning database tables along app boundaries. However, actually sharding user data across various database shards remains a challenge. The hints offered in the DatabaseRouter mechanism are too limited to enable easy sharding along, say, a user or profile id. Better native support for this would enable developers to plan and build for this early on, as opposed to the expensive task of adding this in later.

Enabling rich JavaScript interactions quickly gets cumbersome. jQuery makes it easy to add simple AJAX interactions to your app as well as demand load various content onto your page. However, as you try to get more sophisticated with your JavaScript interactions, it quickly becomes cumbersome to do so. The first issue you run into is the need for client-side templates. We leverage John Resig’s micro_templating.js to enable this, though the placement of these templates and the syntax was fairly messy. The second issue we ran into was managing state within the client. We began experimenting with backbone.js to implement JavaScript models, but it really only felt appropriate for JavaScript-centric pages. I think there is plenty of room for improvement and we certainly plan on exploring alternatives to rich client-side JavaScript.

Amazon Web Services scaling is not for free. While the promise of on demand resources from AWS are enticing, the reality is it takes a decent amount of work on the developer’s side to enable it and comes with it own operational issues. While on demand virtual instances are certainly broadly available across a number of providers, the mechanisms for automatically scaling up and down your resources require significant input from the developer to manage. In addition, simply adding redundancy across your infrastructure in an environment like AWS isn’t as easy as you think, as you have to think through your deployment configuration to decide whether you wish to spread your infrastructure across multiple availability zones and/or multiple regions. These have latency and cost implications depending on your choices. You also need to deal with the fact you don’t have dedicated hardware, so your instances could be restarted or terminated at any point. We had this happen to us numerous times. Building fault tolerance into your system from the beginning is certainly a best practice, but is a fairly costly hit that you need to take upfront if you are going to deploy to such an environment. It shouldn’t have to be this kind of trade-off.

Round robin DNS is a poor solution for automatic failover. We leverage round robin DNS as a way to load balance across our front end web servers. While it has issues with lookups being cached for extended periods at different points, it does a decent job of balancing traffic across our servers. We also rely on round robin DNS as a simple form of automatic failover. When a browser does a DNS lookup and gets multiple IPs it first attempts to connect on the first IP. If it cannot, then it connects on the second IP and so forth. This provides failover in the case one of your servers is down. In practice, this turns out to be far from the reality. The browser only attempts the second IP if it gets a connection refused back from your server. If your server is experiencing high latencies, the browser will continue to wait on it. The browser also doesn’t cache the fact that it has failed over to an alternate IP. So every request checks the primary IP and then the alternatives, which results in significant latency before the timeouts expire. This makes it an unrealistic solution for automatic failover. Actual DNS failover solutions seem like the right answer.

I hope this provides an interesting perspective on what worked well and what needs improvement in the Connected tech stack. I’d love to hear from any of you if you have potential solutions to some of the issues we ran into!
Enjoyed this essay?
Get my weekly essays on product management & entrepreneurship delivered to your inbox.