Thursday, November 30, 2017

Landlord can reduce the cost of running microservices

When running Spring's PetClinic Spring Boot reference application, my experimental Landlord project appears to save 5 times the memory for the second instance of it. My terse observations show that PetClinic requires about 250MiB of resident memory when running as a standalone Java 8 application. When running two instances of it hosted by Landlord, approximately 342.2 MiB of resident memory is consumed in total i.e. Landlord + PetClinic + PetClinic.

These observations illustrate that sharing JVM machinery has positive impacts on hosting services. While running multiple PetClinic applications on the same machine isn't a particularly realistic scenario, running multiple Spring Boot microservices on the same machine is. Landlord could, therefore, provide a significant cost reduction in the hosting of any JVM based microservices.

Introduction

I've been experimenting with a project I created named "Landlord". The idea of the project is to promote a multi-tenant JVM scenario with the goal of saving memory. Our standard usage of the JVM isn't particularly kind regarding memory usage with a simple Hello World application consuming about 35MiB of memory out of the box. This figure is about 10 times what you get with a native target. For example, the same program built via Scala Native will consume about 4.5MiB of memory. Note that we're talking about resident memory - not the JVM heap (which will be much less than that).

I thought that it'd be fun to run the standard Spring Boot PetClinic application within Landlord in order to get a feel for Landlord's cost savings.

Steps

  1. Clone PetClinic and then perform a ./mvnw package in order to get a standalone jar.
  2. Let's determine the smallest amount of RAM that PetClinic can get away with comfortably. For this, I kept running the following java command until I stopped seeing Out Of Memory (OOM) exceptions from Java.

    java \
      -XX:+UseSerialGC \
      -Xss512k \
      -XX:-TieredCompilation \
      -XX:CICompilerCount=1 \
      -XX:MaxRAM=80m \
      -cp ${pwd}/target/spring-petclinic-1.5.1.jar \
      org.springframework.boot.loader.JarLauncher

    Note that the serial GC and other options are the results of my previous investigations in order to keep JVM resident memory usage down. There are pros/cons, but the above configuration is useful when deploying to small devices such as network gateways. That said, if you can get your application running well with the above options, you're likely to run even better at a larger scale, and potentially save money given a lesser number of machines to host your required load. One more option I'd normally use is -Xss256k, but I observed a stack overflow so it seems that Spring likes lots of stack.
  3. When I got to that point, I then profiled the process in order to observe the JVM heap used vs allocated vs the limit. With the above configuration, PetClinic appeared to function and I didn't observe OOM, but observing the JVM heap revealed:

    Used: 37MB Alloc.: 38MB Limit: 38MB

    That feels a bit too close to comfort and seemed to be causing the GC to kick in each time that I refreshed the page.

    So, I ended up specifying -XX:MaxRAM=100m. This then yielded:

    Used: 46MB Alloc.: 48MB Limit: 48MB
  4. Now, even though I've specified max ram as 100MiB, this turns only to be a starting point for the JVM on how it should size its memory regions. On OS X, if I use the Activity Monitor's inspection for a process (double-click on a process in its memory tab) then the following is reported: Real Memory: 259.3 MB (that's its Resident Set Size - RSS). So, even though we stated that we didn't want more than 100MiB, this is not a limit and the JVM can take more. Apparently, this is a JVM implementation consideration. I assume (hope) that the -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap are much stronger than -XX:MaxRAM, or perhaps this is just some OS X JVM implementation thing... I do wish that the JVM was more predictable in this regard.
  5. Time for Landlord. Let's start it up with the same options:

    JAVA_OPTS="-XX:+UseSerialGC -Xss512k -XX:-TieredCompilation -XX:CICompilerCount=1 -XX:MaxRAM=100m" \
    daemon/target/universal/stage/bin/landlordd \
      --process-dir-path=/tmp/a
  6. Let's observe its memory via the Activity Monitor: Real Memory: 133.8 MB.
  7. Now to load PetClinic. For this, we need a script to load it into Landlord:

    #!/bin/sh
    ( \
      printf "l" && \
      echo "-cp spring-petclinic-1.5.1.jar org.springframework.boot.loader.JarLauncher" && \
      tar -c -C $(pwd)/target spring-petclinic-1.5.1.jar && \
      cat <&0 \
      ) | nc -U /var/run/landlord/landlordd.sock
  8. Upon executing the above script: Real Memory: 278.3 MB. That's just 19MiB more than when it was run as a standalone application. Connecting to Landlord via YourKit shows the heap as:

    Used: 47MB Alloc.: 48MB Limit: 48MB

    ...which is quite similar to before. There doesn't appear to be any GC thrashing either. This shouldn't be any great surprise. Its thread and heap usage is quite minimal.

    Now, using Landlord for hosting just one application is not really going to give you any great benefit. Landlord's benefit's kick in when multiple applications are run. Let's run another PetClinic within Landlord.
  9. First, so that PetClinic's ports don't clash, declare a random port to bind to within src/main/resources/application.properties:
    server.port=0
  10. Package the app via ./mvnw package and then invoke the Landlord script for PetClinic as before.

    Unfortunately, this doesn't work... the embedded Tomcat of Spring Boot throws an exception:

    Caused by: java.lang.Error: factory already defined
    at java.net.URL.setURLStreamHandlerFactory(URL.java:1112) ~[na:1.8.0_131]

    While we're at it, there are a couple of places where Spring declares shutdown hooks. Landlord warns of this with the following output:

    Warning: Shutdown hooks are not applicable within landlord as many applications reside in the same JVM. Declare a `public static void trap(int signal)` for trapping signals and catch `SecurityException` around your shutdown hook code.

    Clearly, there are some changes in order for an application to run within a multi-tenant environment. The PetClinic/Spring Boot environment is built to assume that it is running within its own process. Going forward, I believe it would be easy for the Spring Boot project to cater for these multi-tenancy concerns. For now, we change the PetClinic application to use Jetty instead of Tomcat. To do this, we follow the recipe of Spring Boot's documentation.
  11. Once the Jetty version is running, I observed the native Java process with a: Real Memory: 245.9 MB. Under Landlord, the same package + Landlord: Real Memory: 290.9 MB. A bit more of a difference to the 278.3 MiB for the Tomcat based package + Landlord, but who knows what the JVM is doing... perhaps we can assume this as being some JVM anomaly.

    Now, if we try to run another PetClinic within Landlord then we get an OOM memory error. Clearly, we need more JVM not having very much at hand before. Let's re-run Landlord with -XX:MaxRAM=120m (20MiB more overall).

    We now get a problem given a clash of JMX endpoints, so we turn them off (src/main/resources/application.properties: spring.jmx.enabled=false) and try again.

    Real Memory: 342.2 MB

    That's just 51.3MiB additional RSS to run what would be 245.9 MiB to run an additional PetClinic outside of Landlord. Landlord, at least in this simple observation, is reducing the memory cost by about a factor of 5.
This has been a simple test and I welcome feedback on improving it.

Sunday, October 16, 2016

When closed source should become open

I've been thinking for some time about when closed source should become open, particularly in the context of when your core business is about producing software. If your core business is to provide a service such as movies, as in the case of Netflix, then the dynamics are different. Because the core business is to produce movies then simply go OSS and reap the benefits from having done so (as Netflix indeed have).

Before I start I should state that my views here don't describe the only reason why to go with open software; there can be other reasons of course. Indeed there are many valid reasons to start with open as well. This post just investigates the closed to open transition, and when to make it.

When your business is about producing software, you're producing software assets that contain costly intellectual property. I'm a massive fan of open software and I've made many contributions in that space. However a software business also needs to make money of course.

I assert that there is a very limited window of opportunity for a software business to retain a software asset as closed; and that window is governed by the open competition that it faces. The job of the software business then, is to stay ahead of the open curve, yet yield to open software when it starts to become a threat. This happened with Java when it was threatened by Apache Harmony. I believe that Harmony subsequently died precisely because of Sun OSS-ing Java.

I should state right now that my thoughts have been influenced by Danese Cooper who gave a great talk on this very subject during Scala Days 2015. Denese discussed why open languages win, and I think her talk has a wider application.

When discussing the subject of open vs closed software with colleagues at Lightbend over the past year or two, I've described closed software as resting on a tectonic plates. As these plates move around then the closed software at the edge falls off into the abyss of open software! I think that the analogy is mostly useful though in order to illustrate that the world changes. Because of this if you must regularly re-evaluate the competition that is open. If you have closed software solving a particularly useful/important problem then you can be fairly certain that open software will rise around it (again thinking of what Denese said here).

Open your commercial software and neutralise its open competition, also reaping the benefits of having gone open. Focus on adding higher level value building out from your core. Stay ahead of the game.

You certainly can't sit still.

Thursday, July 14, 2016

Microservices: from development to production


Let’s face it, microservices sound great, but they’re sure hard to set up and get going. There are service gateways to consider, setting up service discovery, consolidated logging, rolling updates, resiliency concerns… the list is almost endless. Distributed systems benefit the business, not so much the developer.

Until now.

Whatever you think of sbt, the primary build tool of Lagom, it is a powerful beast. As such we’ve made it do the heavy lifting of packaging, loading and running your entire Lagom system, including Cassandra, with just one simple command:


sbt> install


This “install” command will introspect your project and its sub-projects, generate configuration, package everything up, load it into a local ConductR cluster and then run it all. Just. One. Command. Try doing that with your >insert favourite build tool here<!

Lower level commands also remain available so that you can package, load and run individual services on a local ConductR cluster in support of getting everything right before pushing to production.

Lagom is aimed at making the developer productive when developing microservices. The ConductR integration now carries that same goal through to production.

Please watch the 8 minute video for a comprehensive demonstration, and be sure to visit the “Lagom for production” documentation in order to keep up to date with your production options. While we aim for Lagom to run with your favourite orchestration tool, we think you’ll find the build integration for ConductR hard to beat. Finally, you can focus on your business problem, and not the infrastructure to support it in production.

 Enjoy!

Tuesday, July 5, 2016

Developers need to care about resiliency

Disclaimer: I'm the technical lead for Lightbend ConductR - a tool that focuses on managing distributed systems with a key goal of resiliency.

I've been doing a reasonable amount of travelling over the past few years. Overall I enjoy it; I don't think that I'm away that much that it has become painful - may be 3-6 international flights per year.

One of the things you hear about when travelling is missing an international connection. I've been fortunate in that this has happened just once; a couple of weeks ago in fact.

The airline was British Airways (BA), and they did a really good job of trying to make up time given that flights out of Heathrow were causing delays across Europe. Thus my flight from Berlin TXL to London LHR was about two hours late. I missed my Sydney SYD flight from LHR. The BA staff did a great job of putting me up in a hotel overnight and getting me on to the next available flight. Honestly, from a staff perspective, BA were fantastic in fact.

What was frustrating though was that it took about two hours to arrange the accomodation and flight booking. This was also considering that I didn't have to queue for long and that a staff member attended to me in a reasonable time frame. The problem was the BA computer system.

Apparently BA have some new system. There were IT staff walking around helping the front-of-house staff get into the system and deal with its incapacity to handle any load. The IT staff were frustrated. The front-of-house staff were frustrated. I was frustrated (although not as frustrated as a Business Class passenger in front of me who felt that his ticket meant that BA should treat him like royalty!).

BA's computer system had an amazing effect on all concerned, except most likely the people that wrote it. In my opinion the original developers should be there supporting the front-of-house staff. They should feel the pain that they have inflicted.

I'm sure that you have similar stories to share, where computer systems have failed you miserably. Computer systems will of course fail, that's natural, but it is the fact that their degree of failure is considered acceptable that is the problem. Computer systems should not fail to the extent that they do. Your airline reservation system, online banking site, or whatever it is, it should be more reliable that it probably has been.

The problem is that developers do not understand that building-in resilience to their software is more important than most other things. As my colleague, Jonas Bonér has stated many times, "without resiliency nothing else matters". He's so right. Why is it then that developers just don't get this?

My answer to that is that many developers just see what they do as a job, and they don't really care about what they do. Putting that aside though, creating and then indeed managing distributed systems, a key requirement for resiliency, is harder than not; not hard, but harder and developers are lazy (btw: in case you don't realise this, I'm a developer!).

We need systems that are resilient. We therefore need developers to care about resiliency. The more that developers care about resiliency, the more tools and technologies we'll see appearing in support of it. I strongly feel that it all starts with the developer though.

I imagine a world where, given the inevitability of missing flight connections, I can wait in a queue for no longer than 10 minutes, be handled within another 10 minutes and then sleep off the tiredness and inconvenience of waiting another 24 hours for my next flight. The developer just needs to start caring in order for this to happen. Make he or she responsible for managing their system in production and they'll start caring, I guarantee it.

Here's a language/tool agnostic starting point for you if you are a developer that cares enough to have read this far: The Reactive Manifesto.

Thanks!

Sunday, May 22, 2016

Why we created an orchestration tool for you

One question I have had to answer a few times as the tech lead of ConductR, and I think it is a healthy question, is on why did Lightbend create ConductR? This post is my personal attempt to describe the rationale for it two years ago, and why I think it is more relevant than ever.

Back then we wanted a tool that was focused on making the deployment and management of applications and services built on Java, Scala, Akka and Play as easy as it could be. We wanted ConductR to be to operations what Play is to web application developers; a "batteries included" approach to deploying and managing reactive applications and services.

Two years ago, there really wasn't anything else out there that we felt offered such a packaged approach to solving these new use-cases for operations people. The sentiment was that we had done a reasonable job with the Reactive Manifesto at that point, and that we'd definitely engaged developers, but we were quickly going to arrive at a situation where operational people were going to find it a challenge to manage these new distributed applications and services. We also wanted something that had the reactive DNA.

That's how it all started. So, what's changed, and why is ConductR relevant now?

There are a number of players emerging in the orchestration space presently. This certainly validates our being a player in this space from a needs perspective. If you're happy to roll your own orchestration (which actually remains what we're up against in terms of competition, and this hasn't changed much in two years), then be prepared to have two people spend at least year tackling a problem that is harder than you think, and then realise that you have an operational cost in maintaining it. Atop of this, there's the risk to your company regarding those individuals leaving... is it sufficiently documented in terms of others taking over? Nobody has won in the orchestration space, but there's enough to choose from that will trump the business risk to your company of rolling your own. My advice here having been involved in designing and writing an orchestration tool (twice) is to not roll your own and focus on your core-business.

While I personally think that the operational productivity culture that permeates through our design is still the single most important reason to consider ConductR, here are some other reasons:

  • a means to manage configuration distinctly from your packaged artifact;
  • consolidated logging across many nodes;
  • a supervisory system whereby if your service(s) terminate unexpectedly then they are automatically restarted;
  • the ability to scale up and down with ease and with speed;
  • handling of network failures, in particular those that can lead to a split brain scenario;
  • automated seed node discovery when requiring more than one instance of your service so that they may share a cluster;
  • the ability to perform rolling updates of your services;
  • support for your services being monitored across a cluster; and
  • the ability to test your services locally prior to them being deployed.

Furthermore ConductR is the complete manifestation of the entire stack of technologies that we at Lightbend both contribute to and support. It is a great example of an Akka based distributed application that uses in particular, akka-cluster, akka-distributed-data and akka-streams/http. It is also tightly integrated with our Akka monitoring based instrumentation, and the monitoring story around events, tracing and metrics is going to get stronger. If you like our stack, you should feel good about the way ConductR has been put together.

We have programmed ConductR in the spirit of the Reactive Manifesto, with resiliency and elasticity being a particular focus. There is no single point of failure and our ability to scale out is holding up.

One last point: we use ConductR for our own production environment at Lightbend hosting our websites and sales/marketing services. With any product out there, you should always look for this trait. If a supplier is not dependent on their own technologies in terms of running their core business then beware; they can lose enthusiasm very quickly.

ConductR is as relevant as it ever was, and with its batteries-included approach for operations, I'm sure it'll become even more relevant as the industry moves toward deploying and managing microservices.

One last tidbit: ConductR is becoming a framework for Mesos/DCOS. Exciting times!

Thanks for reading this far!

Monday, March 7, 2016

What the name "lightbend" means to me

I thought that it'd be useful to share my personal perspective on the meaning of our company name change. Here are the contents of an email that I sent out to everyone within Lightbend, and which was warmly received.

---

Hi fellow lightbenders,

I’m very excited about the Lightbend name, and want to provide my view on what it means to me.

About two years ago, I presented at YOW. YOW is a great conference with the characteristic that speakers get to talk to a cross section of our industry on three occasions: Melbourne, Brisbane and Sydney. One is therefore not preaching to the converted, but rather talking to what can be quite a hostile crowd!

My first talk was to a few hundred people in Melbourne - apparently the most hostile of the three cities. About ten minutes into the talk I had that sinking feeling that I’d lost everyone. My talk was about Akka streams and the importance of back pressure. Lots of blank looks all around. An interesting aspect of YOW is that you are scored by the audience. You guessed it, my scores were low.

Travelling up to Brisbane I felt that it was important to bring the talk back a bit. Instead of delving right into Akka streams, I felt that I should at least have a preamble around reactive streams and why we did that. The Brisbane talk went much better.

However given the nature of the questions asked after my talk I felt that I could do even better. So, for Sydney, my preamble included a discussion on “why reactive”. This set the scene for the remainder of the talk and my Sydney scores reflected that.

Coming away from YOW I realised how fringe Typesafe were - again this is two years ago. I certainly appreciated that we were not anything near mainstream, but really, we were on another planet compared to where the IT industry was at.

Roll forward to today and you can see that we’ve come a long way. We have done so without deviating on our mission from an technical perspective. I would tell people that if you want to understand anything about our technical direction then simply read the Reactive Manifesto. You'll then see our DNA blueprint; the very fabric of what we are. Taking that further and quoting Jonas Boner, “without resiliency, nothing else matters”. We have upheld the manifesto and, in particular resiliency, like nothing else matters.

And now we are seeing the industry finally come our way. To highlight a few points, the industry received our new name well, it is excited about Lagom as a microservices framework for Java and the enterprise leading Spring framework is effectively adopting the reactive manifesto.

This is where the lightbend name kicks in for me.

I see lightbend as the gravitational force that is bending the light beam representing the direction of the industry at large. Gravity bends light.

To use my earlier analogy of Typesafe being on another planet, two years ago, we were light years away from where the industry was thinking. We are no longer. We have pulled the industry to the way we think software systems should be put together and managed.

We are now at an interesting juncture. As the company expands as it needs to, it would be easy to compromise our technical mission in order to gain further traction. However it is now more important than ever to stay on mission.

We need to be brave and continue to be bold. The industry doesn’t need more of the same; it needs more companies like us.

Thanks for reading!

Kind regards,
Christopher

Monday, December 29, 2014

Where FP meets OO

A strong feature of Scala is the embracing of both Functional Programming (FP) and Object Orientation (OO). This was a very deliberate and early design decision of Scala in recognition of the strengths of both approaches over the decades. I hope to show you that utilising both approaches to developing software can work together.

Over the past two years of using Scala on a daily basis I’ve found myself adopting a predominantly FP approach to development, while embracing true OO in the form of Akka’s actors. What this boils down to is side-effect-free programming with functions focused on one purpose. Where state is required to be maintained, typically for IO style scenarios, then actors come to the rescue. Within those actors if there are more than 2 state transitions then I find myself using Akka’s FSM where upon it maintains the state through transitions. As a consequence, there are very few “var" declarations in my code, but I don’t get hung up on their usage where the code becomes clearer or there is a performance advantage in some critical section of code.

I can’t even remember the last time I used a lock of some type...

Thus my actor code looks something like this:

object SomeActor {
  def props(): Props =
    ...

  // My actor’s other pure functions,
// perhaps forming the bulk of code.
// The functions are typically private
// to the package and therefore
// available for tests within the
// same package. 
}
 
class SomeActor extends Actor {
  override def receive: Receive =
    ...

  // Functions that break down the
// receive handling calling upon the
  // pure functions of the companion and
// possibly other traits

What I find is that as I expand the companion object with pure functions, patterns of behaviour emerge and its functions are factored out into other traits which of course become re-usable and remain highly testable. Sometimes I form these behavioural abstractions ahead of creating the companion object, but more often than not it is the other way round. I’m big on continuous re-factoring and spend a lot of time attempting to get the functional abstractions right. This can mean that the functions are often re-written once the behavioural patterns emerge.

So why then is the above representative of OO? Actors permit only message passing and are completely opaque to the outside of their instance. This is one of Alan Kay’s very early requirements of OO. Actors combined with Scala also mostly fit his following requirements (taken from http://c2.com/cgi/wiki?AlanKaysDefinitionOfObjectOriented):

  1. Everything is an object.
  2. Objects communicate by sending and receiving messages (in terms of objects).
  3. Objects have their own memory (in terms of objects).
  4. Every object is an instance of a class (which must be an object).
  5. The class holds the shared behavior for its instances (in the form of objects in a program list).
  6. To eval a program list, control is passed to the first object and the remainder is treated as its message.

Point 3 is a weak point of the JVM and one should be careful about message payloads remaining immutable, but at least with Scala, immutability is the default way of coding. It’d be great if actors themselves had their own address space. However this has never raised itself as a problem in my world.

Scala is one of the few languages that marries the world of FP and OO and thus does not need to “throw the baby out with the bathwater”. Many other languages force you to make a choice. That said, just like most marriages, there’s always the dominant party making the most sense, and that’d be FP!