Saturday, 19 April 2014

CacheCow 0.5 Released

[Level T1]

OK, finally CacheCow 0.5 (actually 0.5.1 as 0.5 was released by mistake and pulled out quickly) is out. So here is the list of changes, some of which are breaking. Breaking changes, however, are very easy to fix and if you do not need new features, you can happily stay at 0.4 stable versions. If you have moved to 0.5 and you see issues, please let me know ASAP. For more information, see my other post on Alpha release.

So here are the features:

Azure Caching for CacheCow Server and Client

Thanks to Ugo Lattanzi, we have now CacheCow storage in Azure Caching. This is both client and server. Considering calls to Azure Cache takes around 1ms (based on some ad hoc tests I have done, do not quote this as a proper benchmark), this makes a really good option if you have deployed your application in Azure. You just need to specify the cacheName, otherwise "default" cache will be used.

Complete re-design of cache invalidation

I have now put some sense into cache invalidation. Point is that strong ETag is generated for a particular representation of the resource while cache invalidation happens on the resource including all its representation. For example, if you send application/xml representation of the resource, ETag is generated for this particular representation. As such, application/json representation will get its own ETag. However, on PUT both need to be invalidated. On the other hand, in case of PUT or DELETE on a resource (let's say /api/customer/123) the collection resource (/api/customer) needs to be invalidated since the collection will be different. 

But how we would find out if a resource is collection or single? I have implemented some logic that infers whether the item is collection or not - and this is based on common and conventional routes design in ASP.NET Web API. If your routing is very custom this will not work. 

Another aspect is hierarchical routes. When a resource is invalidated, its parent resource will be invalidated as well. For example in case of PUT /api/customer/123/order/456 , customer resource will change too - if orders are returned as part of customer.  So in this case, not only the order collection resource but the customer needs to be invalidated. This is currently done in CacheCow.Server and will work for conventional routes.

Using MemoryCache for InMemory stores (both server and client)

Previously I have been using dictionaries for InMemory stores. The problem with Dictionary is that it just grows until system runs out of memory while MemoryCache limits its use of memory when system deals with a memory pressure.

Dependency of CachingHandler to HttpConfiguration

As I announced before, you need to pass the configuration to he CachingHandler on server. This should be an easy change but a breaking one.


Future roadmap and 0.6

I think we are now in a point where CacheCow requires a re-write for the features I want to introduce. One of such features is enabling Content-based ETag generation and cache invalidation. Currently Content-based ETag generation is there and can be used now but without content-based invalidation is not much use especially that in fact this now generates a weak ETag. Considering the changes required for achieving this, almost a re-write is required.

Please keep the feedbacks coming. Thanks for using and supporting CacheCow!


Sunday, 13 April 2014

Reactive Cloud Actors: no-nonsense MicroServices

[Level C3]

This post is not directly about MicroServices. If that is why you are reading it, might as well stop now. Apparently, we are still waiting as for the definition to be finally ratified. The definition, as it stands now, is blurry - Martin Fowler admits. This post is about Actors - the cloud ones - you know. After finishing reading it, I hope I have made it effortlessly clear how Reactive Cloud Actors are the real MicroServices, rather than albeit light RESTful Imperative MicroServices.

Watching Fred George delivering an excellent talk on High-Performance Bus inspired me to start working on the actors. I am still working on a final article on the subject but this post is basically a primer on that - as well as announcement of BeeHive mini-framework. The next section on actors is taken from that article which covers the essential theoretical background. Before we start, let's make it clear that the term Reactive is not used in the context of Reactive Extensions (Rx) or Frameworks, only in contrast to imperative (RPC-based) actors. Also RPC-based is not used in contrast to RESTful, but it simply means a system which relies on command and query messages rather than events.

Actors

Carl Hewitt, along with Peter Bishop and Richard Steiger, published an article back in 1973 that proposed a formalism that identified a single class of objects, i.e. Actors, as the building blocks of systems designed to implement Artificial Intelligence algorithms.

According to Hewitt an actor, in response to a message, can:
  1. Send a finite number of messages to other actors
  2. create a finite number of other actors
  3. decide on the behaviour to be used for the next message it receives
Any combination of these actions can occur concurrently and in response to messages arriving in any order - as such, there is no constraint with regard to ordering and an actor implementation must be able to handle messages arriving out of band. However, I believe it is best to separate these responsibilities as below.

Processor Actor

In a later description of the Actor Model, first constraint is re-defined as "send a finite number of messages to the address of other actors". Addressing is an integral part of the model that decouples actors and limits the knowledge of actors from each other to mere a token (i.e. address). Familiar implementation of addressing includes Web Services endpoints, Publish/Subscribe queue endpoints and email addresses. Actors that respond to a message by using the first constraint can be called Processor Actors.

Factory Actor

Second constraint makes actors capable of creating other actors that we conveniently call Factory Actors. Factory actors are important elements of a message-driven system where an actor is consuming from a message queue and create handlers based on the message type. Factory actors control the lifetime of the actors they create and have a deeper knowledge of the actors they create - compared to processor actors knowing mere an address. It is useful to separate factory actors from processing ones - in line with the single responsibility principle.

Stateful Actor

Third constraint is the Stateful Actor. Actors capable of the third constraint have a memory that allows them to react differently on subsequent messages. Such actors can be subject to a myriad of side-effects. Firstly, when we talk about "subsequent messages" we inherently assume an ordering while as we said, there is no constraint with regard to ordering: an out of band message arrival can lead to complications. Secondly, all aspects of CAP applies to this memory making a consistent yet highly available and partition tolerant impossible to achieve. In short, it is best to avoid stateful actors.

Modelling a Processor Actor

"Please open your eyes, Try to realise, I found out today we're going wrong, We're going wrong" - Cream
[Mind you there is only a glimpse of Ginger Baker visible while the song is heavily reliant on Ginger's drumming. And yeah, this goes back to a time when Eric played Gibson and not his signature Strat]


This is where most of us can go wrong. We do that, sometimes for 4 years - without realising it. This is by no means a reference to a certain project [... cough ... Orleans ... cough] that has been brewing (Strange Brew pun intended) for 4 years and coming up with an imperative, RPC-based, RESTfully coupled Micro-APIs. We know it, doing simple is hard - and we go wrong, i.e. we do the opposite: build really complex frameworks. 

I was chatting away on twitter with a few friends and I was saying "if you need a full-blown and complex framework to do actors, you are probably doing it wrong". All you need is a few interfaces, and some helpers doing the boilerplate stuff. This stuff ain't no rocket science, let's not turn it into.

The essence of the Reactive Cloud Actor is the interface below (part of BeeHive mini-framework introduced below):
    /// <summary>
    /// Processes an event.
    /// 
    /// Queue name containing the messages that the actor can process messages of.
    /// It can be in the format of [queueName] or [topicName]-[subscriptionName]
    /// </summary>
    public interface IProcessorActor : IDisposable
    {
        /// <summary>
        /// Asynchrnous processing of the message
        /// </summary>
        /// <param name="evnt">Event to process</param>
        /// <returns>Typically contains 0-1 messages. Exceptionally more than 1</returns>
        Task<IEnumerable<Event>> ProcessAsync(Event evnt);

    }

Yes, that is all. All of your business can be captured by the universal method above. Don't you believe that? Just have a look a non-trivial eCommerce example implemented using this single method.

So why Reactive (Event-based) and not Imperative (RPC-based)? Because in a reactive actor system, each actor only knows about its own Step and what itself does and has no clue about the next steps or the rest of the system - i.e. actors are decoupled leading to independence which facilitates actor Application Lifecycle Management and DevOps deployment.
As can be seen above, Imperative actors know about their actor dependencies while Reactive actors have no dependency other than the queues, basic data structure stores and external systems. Imperative actors communicate with other actors via a message store/bus and invoke method calls. We have been this for years, in different Enterprise Service Bus integrations, this one only brings it to a micro level which makes the pains event worse.

So let's bring an example: fraud check of an order.

Imperative
PaymentActor, after a successful payment for an order, calls the FraudCheckActor. FraudCheckActor calls external fraud check systems. After identifying a fraud, it calls CancelOrderActor to cancel the order. So as you can see, PaymentActor knows about and depends on FraudCheckActor. In the same way, FraudCheckActor depends on CancelOrderActor. They are coupled.

Reactive
PaymentActor, upon successful payment, raises PaymentAuthorised event. FraudCheckActor is one of its subscribers and after receiving this event checks for fraud and if one detected, raises FraudDetected event. CancelOrderActor subscribers to some events, including FraudDetected upon receiving which it cancels the order. None of these actors know about the other. They are decoupled.

So which one is better? By the way, none of this is new - we have been doing it for years. But it is important to identify why we should avoid the first and not to "go wrong".

Reactive Cloud Actors proposal


After categorising the actors, here I propose the following constraints for Reactive Cloud Actors:
  • A reactive system that communicates by sending events
  • Events are defined as a time-stamped, immutable, unique and eternally-true piece of information
  • Events have types
  • Events are stored in a Highly-Available cloud storage queues allowing topics
  • Queues must support delaying
  • Processor Actors react to receiving a single and then do some processing and then send back usually one (sometimes zero and rarely more than one) event
  • Processor Actors have type - implemented as a class
  • Processing should involve minimal number of steps, almost always a single step
  • Processing of the events are designed to be Idempotent
  • Each Processor Actor can receive one or more event types - all of which defined by Actor Description
  • Factory Actors responsible for managing the lifetime of processor actors
  • Actors are deployed to cloud nodes. Each node contains one Factory Actor and can create one or more Processor Actor depending on its configuration. Grouping of actors depends on cost vs. ease of deployment.
  • In addition to events, there are other Basic Data Structures that contain state and are stored in Highly-Available cloud storage (See below on Basic Data Structures)
  • There are no Stateful Actors. All state is managed by the Basic Data Structures and events.
  • This forms an evolvable web of events which can define flexible workflows
Breaking down all the processes into single steps is very important. A Highly-Available yet Eventually-Consistent system can handle delays but cannot easily bind multiple steps into a single transaction.

So how can we implement this? Is this gonna work?

Introducing BeeHive


 BeeHive is a vendor-agnostic Reactive Actor mini-framework I have been working over the last three months. It is implemented in C# but frankly could be done in any language supporting Asynchronous programming (promises) such as Java or node.

The cloud implementation has been only implemented for Azure but implementing another cloud vendor is basically implementing 4-5 interfaces. It also comes with an In-Memory implementation too which is only targeted at implementing demos. This framework is not meant to be used as an in-process actor framework.

It implements Prismo eCommerce example which is an imaginary eCommerce system and has been implemented for both In-Memory and Azure. This example is non-trivial has some tricky scenarios that have to implement Scatter-Gather sagas. There is also a Boomerang pattern event that turns a multi-step process into regurgitating an event a few times until all steps are done (this requires another post).

An event is model as:

[Serializable]
public sealed class Event : ICloneable
{

    public static readonly Event Empty = new Event();

    public Event(object body)
        : this()
    {        
       ...
    }

    public Event()
    {
        ...
    }

    /// <summary>
    /// Mrks when the event happened. Normally a UTC datetime.
    /// </summary>
    public DateTimeOffset Timestamp { get; set; }

    /// <summary>
    /// Normally a GUID
    /// </summary>
    public string Id { get; private set; }

    /// <summary>
    /// Optional URL to the body of the message if Body can be retrieved 
    /// from this URL
    /// </summary>
    public string Url { get; set; }

    /// <summary>
    /// Content-Type of the Body. Usually a Mime-Type
    /// Typically body is a serialised JSON and content type is application/[.NET Type]+json
    /// </summary>
    public string ContentType { get; set; }
        
    /// <summary>
    /// String content of the body.
    /// Typically a serialised JSON
    /// </summary>
    public string Body { get; set; }

    /// <summary>
    /// Type of the event. This must be set at the time of creation of event before PushAsync
    /// </summary>
    public string EventType { get; set; }

    /// <summary>
    /// Underlying queue message (e.g. BrokeredMessage in case of Azure)
    /// </summary>
    public object UnderlyingMessage { get; set; }

    /// <summary>
    /// This MUST be set by the Queue Operator upon Creation of message usually in NextAsync!!
    /// </summary>
    public string QueueName { get; set; }


    public T GetBody<T>()
    {
       ...
    }

    public object Clone()
    {
        ...
    }
}

As can be seen, Body has been defined as a string since BeeHive uses JSON serialisation. This can be made flexible but in reality events should normally contain small amount of data and mainly basic data types such as GUID, integer, string, boolean and DateTime. Any binary data should be stored in Azure Blob Storage or S3 and then path referenced here.

BeeHive Basic Data Structures

This is a work-in-progress but nearly done part of the BeeHive to define a minimal set of Basic Data Structures (and their stores) to cover all required data needs of Reactive Actors. These structures are defined as interfaces that can be implemented for different cloud platforms. This list as it stands now:
  • Topic-based and simple Queues
  • Key-Values
  • Collections
  • Keyed Lists
  • Counters

Some of these data structures hold entities within the system which should implement a simple interface:

public interface IHaveIdentity
{
    Guid Id { get; }
}

Optionally, entities could be Concurrency-Aware for updates and deletes in which case they will implement an additional interface:

public interface IConcurrencyAware
{
    DateTimeOffset? LastModofied { get; }

    string ETag { get; }
}


Prismo eCommerce sample

Prismo eCommerce is an imaginary eCommerce company that receives orders and processes them. The order taking relies on an eventually consistent (and delayed) read model of the inventory hence orders can be accepted for which items are out of stock. Process waits until all items are in stock or if out of stock, they arrive back in stock until it sends them to fulfilment. 

Prismo eCommerce states (solid), transitions (arrows) and processes (dashed)

This sample has been implemented both In-Memory and for Windows Azure. In both cases, the tracing can be used to see what is happening outside. In this sample all actors are configured to run in a single worker role, although they can each run in their own roles. I might provide a UI to show the status of each order as they go through statuses.

Conclusion

Reactive Cloud Actors are the way to implement decoupled MicroServices. By individualising actor definitions (Processor Actor and Factory Actor) and avoiding Stateful Actors, we can build resilient and Highly Available cloud based systems. Such systems will comprise a evolvable webs of events which each web defines a business capability. I don't know about you but this is how I am going to build my cloud systems.

Watch the space, the article is on its way.

Saturday, 5 April 2014

Web APIs and n+1 problem

[Level C3]
You are probably familiar with n+1 problem: when a request to load 1 item turns into n+1 requests since the item has n other associated items. The terms has been mainly described in the context of Object Relational Mappers (ORMs) when lazy loading associated records have caused additional calls to the RDBMS database resulting in poor performance, locks, deadlocks and timeouts.

TLDR; n+1 problem is a reality of modern APIs. While it can be averted in some cases, you have to be prepared for it. We propose these steps to deal with n+1 problem: 1) Redesign your data  2) Parallelise calls 3) Use async pattern 4) Optimise threading model 5) Implement timeout-retry 6) Design a circuit breaker 7) Build a layered cache

However, this problem is not confined to the ORMs. If we stick to the original definition above, we see that it can apply elsewhere. In fact we can see a rising trend in its occurrence elsewhere. Nowadays, our data are usually served by Web APIs, by composing data from multiple APIs, by using a distributed cache or by reaching to a NoSQL data store. As such Web APIs are naturally prone to this problem.

I have recently been battling with n+1 problem on the server - a Web API serving some composed data. Slow responses, thread starvation, inconsistent performance were some of the symptoms and caused me enough pain to make me completely re-design the serving layer of the API. In fact I was not the only one experiencing this problem: by looking around a bit I realised there is a trend emerging (for example here).
In this post I cover the background to the problem and propose several patterns and solutions - some of which I successfully integrated in my API Layer to alleviate the problem.

n+1 and NoSQL

Everyday we are moving further away from RDBMS. NoSQL movement is well established now with many companies already using or busy building applications that are sitting on the top of one or more NoSQL stores. Whether it is MongoDB, Cassandra, RavenDB or soft stores such as Redis, or even if it is the cloud variants such as Azure Table Storage or Amazon's Dynamo, it is all moving very fast.

By moving away from RDBMS we have successfully freed ourselves from the restricting schema-bound limitations of RDBMS and churning code very quickly. This has allowed for higher agility of our teams. Now, I leave the discussion for the importance of understanding schema of your data when it truly exists for another post but it is generally seen as a positive move.

Now what if we need to load a parent with its children? This issue is resolved in traditional RDBMS by joining parent and child tables. In NoSQL world, you can store the parent along with its children. While this is fine for many scenarios, it does not cover all. When the relationship is one of aggregation, an entity has an identity independent of other entities that it has association with. For example, a school, classroom, subject, teacher and student each have an identity of their own and it is not possible to define a single aggregate root.
One mistake is to turn aggregation into composition. While this sometimes works, it can lead to either very coarsely granular entities (defining customer with all its orders as the aggregate root) or can lead to unnecessary duplication of data.

Figure 1 - Embedding vs. referencing another document


So some NoSQL document databases provide the means to store reference of one object in another - for example MongoDb provides Concept of ObjectID which is the reference to another document. This is exactly where it leads to n+1 problem in NoSQL databases.

n+1 and API Composition

By adopting SOA (and if you add MicroServices to the mix even more so), our data is well confined within the realms of its own bounded context and other systems use APIs to access data from other boundaries. We have defined resources at the API Layer of our services and data are being comfortably served by RESTful APIs and can be consumed by other bounded contexts.

So in order to provide a data useful to the clients, APIs need to aggregate data from multiple services and compose before providing to the clients. This is accentuated on the presentation layer or composition layers (where service's sole responsibility is to gather data from several APIs are composed and then exposed as a consolidated API) where calls can be from heterogeneous sources. This by definition is an n+1 problem.


n+1 and Batch/Bulk APIs

Many APIs support batching requests into a single request. For example, ElasticSearch provides a Bulk API that accepts a batch of operations. While this implementation helps the clients to avoid n+1 problem, server implementing batching usually has to solve the n+1 problem. And this is usually not by a magic solution that is handled in the back-end data store, it is pure engineering.

If you are providing an API, you are likely to provide a batch API too. For example, a client to provide multiple customer IDs to get a list of customers and display on the screen. This is where you will hit an n+1.

n+1 and Distributed caching

This is not a very typical case however it happens more often than you think. If you have a couple of items to pick from the cache it is not such a big problem, however, if you are reading 100s of items, you might want to parallelise the calls. The more you design the site for better performance, the more you push the data to the cache increasing your dependency. Using Redis you have a quick response but with hundreds of items, you are looking at a few hundred millisecond extra just for the calls. And parallelising the calls can increase the risk of thread starvation depending on your implementation. Again, if you need to make many calls to an external cache in order to server one request, you are experiencing an n+1 problem which sooner or later will manifest itself in your performance metrics.

New patterns needed

n+1 is not always avoidable - building linearly scalable systems can lead to n+1 cropping here and there and we need be prepared. Making multiple requests is usually inevitable and this warrants new patterns to make calls to the servers resilient and optimally performant.

In this post we will look at some of these patterns. Before doing so, we present a model scenario based on a familiar eCommerce use cases. For the purpose of this post, let's assume we have a Web API that needs to compose its data by calling other APIs. And let's assume for every request, we have to make 10 requests.

Example scenario

In our example scenario we are consuming services that in most cases behave well but occasionally calls take longer. So in 99% of cases response time is 10-100ms but in 1% in will be 5-25sec. We assume distribution of likelihood among each of these two ranges are even (e.g. likelihood of 10ms is equal to 50ms or 100ms).
If we calculate the average time it takes to make a call, we get to the value of ~ 200ms. If we sequentially call these services, on average we are looking at 2 seconds to complete all calls which is not really acceptable.

Now let's look at calling these services in parallel. The call would take as long as the slowest call. We know that calls may take 5-25 seconds in 1% of cases, however, when 10 calls are involved this likelihood increases to 9%! (1.0 - 99%^10). So we need to do something about this.


Denormalise and build read models


While data modelling is not a Web API problem, when it involves an n+1 case it becomes one - hence we have a close look at this case. If you have a parent-child or multiple relationship and in order to load an item you have to make more than one query, you have an n+1 problem.

A common pitfalls is to approach the data model design is to use traditional RDBMS mindset and treat every conceptual model as an entity. In many cases, presentation layer only requires value objects to populate the screens, a read models that changes with the screen itself.

First of all, many items can simply become value objects or composed entities within the aggregate root. For example, if a blog comment is only a user name and a text and can be only accessed through the blog post, define it as a value object and store it with the blog post. If it requires an entity of its own (we need to supply a link so that comment can be shared and by clicking the link, blog post is shown and pages are scrolled down to the comment position) we define it as entity but embed it along with the blog post.

What if we need a signed-in user to be able to see all the comments he has left? One option is to have embedded comment entity as above in the Blog Post domain and to have another embedded comment entity with the user and store first few characters and its link inside the User domain.

There are drawbacks with bloating an aggregate root with embedded child objects. First of all size can become a constraining factor (e.g. a popular blog post with 1000s of long comments) or if the item has fluid state and gets updated many times (risk of concurrency). These drawbacks need to be weighed against the benefits.


Parallelising calls


This is in essence, the most important solution to the n+1 problem - although it requires other steps to be in place to work properly. Instead of making n calls sequentially, we parallelise the calls and wait for the successful completion of all before returning the data. This approach will make your code notably more complex and fragile unless you abstract out the parallelisation. In .NET framework you can use Parallel.For to make n parallel calls. However, Parallel.For is not designed for asynchronous operations and you have to resort to Task.WhenAll and pass n async operations to run in parallel.


Using Async call patterns


It is often said that in terms of security, your site/service/application is as vulnerable as the weakest point. This rule also applies to the performance of your site: your API/site is at least as slow as the slowest path of your system.

I cannot count number of times I have seen the web sites/APIs grind to a halt with a transient slow-response glitch in one of the 3rd-party or in-house services. What generally happens is handling first few calls s OK but then you soon run out of threads and start queueing on the IIS. Responses will take longer and longer until your IIS queue gets full (default value is 5000) after which you get 503 responses.

Async calls (now supported by many languages, tools and platforms) take advantage of IO Completion Ports (IOCP) and free the worker thread to serve other requests. After the completion of IO operation, IOCP will notify back the method to resume its operation where it left it. Any IO bound operation including netwrok access (such as reading/writing to files, calling other APIs and services or accessing database or out of process caches) can potentially use IOCP. On the .NET framework, method pairs prefixed with Begin- and End- have historically supported IOCP (for example BeginWrite and EndWrite). Most of these have a new Async counterpart which is a single method with -Async postfix (for example WriteAsync).

Nowadays with async/await syntax in C#, writing async code is very easy. To take full advantage of async programming, you need to go async all the way so your server code from the entry point until its exit are chained by async calls otherwise the thread is not free and even worse you could be subjecting yourself to deadlocks.


Optimising threading model and network throttles


Using asynchronous patterns and making calls in parallel puts a strain on the threading backbone which is usually not designed for such a heavy usage of threads. While asynchronous calls release your threads back to the thread pool, you still need enough threads to initiate the process. Also in some cases you might have to use non-async libraries in parts of your code.

.NET ThreadPool is designed to be self-regulating. ASP.NET by default uses a feature called autoConfig which is set in processModel element of the machine.config. By default it is on and it takes care of netwrok throttling and minimum and maximum worker threads and IOCPs. Based on experience, I have found these not to be sufficient for heavy use of threading proposed in this article. So I would normally turn off the autoConfigin the machine.config located at

%windir%\Microsoft.NET\Framework64\v4.0.30319\Config(for a 64-bit machine):
<processModel autoConfig="false"/>

With this change you need to make sure to add these codes at the application startup:
ThreadPool.SetMinThreads(250, 250); // a reasonable value but use your own
ServicePointManager.DefaultConnectionLimit = Int16.MaxValue;

First value in the SetMinThreads call is the minimum number of worker threads and second parameter is the minimum number of IOCPs. Having said that, ThreadPool has a habit of regulating itself so it is normal to see this number decrease after a period of inactivity. In fact setting values above is not the main reason to turn autoConfig on. The most important reason is to allow for sharp increase of number of threads in case of burst activity. With autoConfig on, only a couple of threads are added every second and is not quick enough to cater for burst and it leads to Thread Starvation.


Timeout-Retry pattern


While timeout and retry is not a novelty by any means, it is essential for building systems that can deal with transient failures and environment instability leading to calls to fail or take long while retrying succeeds very quickly. It is important to build retry for your command but even more important to implement timeout and retry for your queries. Netflix have successfully used aggressively low timeouts in their APIs. One of the reasons behind this aggressively low timeout is the principle of Fail Fast and the fact that next time you are very likely to get a performing server and you get the response very quickly. An interesting choice is to include a longer timeout for the last retry.

So let us see how this could improve our performance compared to the performance we had above. As we calculated, with no timeout/retry likelihood of the call taking 5-25% was 9%. Let us use 2 retries (total 3 calls), and use an initial timeout of 100ms coupled with a final timeout of 30 seconds. Using this approach chance of getting a timeout drops to 0.1% [1 - (0.99 * 0.99) ^ 10] so we have reduced the likelihood by 900 times! Similar to parallelisation, the retry-timeout code needs to be streamlined into a coding pattern so the callers not having to implement it everywhere.

Table 1 - Likelihood of operation times


While the timeout of 100ms is selected conveniently based on our arbitrary scenario setup, in real world you would choose a percentile of the response time. Here we have chosen 99th perentile. Choosing a value depends on your own service's defined SLA and response time distribution.


Circuit Breaker


Short timeouts and retry is useful when the long responses and errors are due to glitches and transient failures. When a service is generally having a bad time, overloading it even with more requests causes further disruptions. A common pattern is to implement a circuit breaker which stops sending any requests to that particular service and starts using an alternative means of fulfilling the request. It is obvious that a circuit breaker can be useful when we have an alternative means of fulfilling the request. Netflix successfully used a cache local to the service when the circuit breaker gets activated. Sometimes it is acceptable to send back data which has some missing parts.

While timeout-retry is a request level implementation, circuit breaker is a global switch and once activated for one request, it will apply to all requests. Conditions based on which circuit breaker is activated varies but includes receiving a particular error code or timeing out on subsequent requests. Circuit breakers can be reverted manually by an admin but most commonly they have a timeout (for example 15 minutes) and are automatically lifted: If the service is still down, they will be activated again.


Layered Caching


As we explained, accessing external caches to load multiple values can cause an n+1 effect. While the response time for accessing cache is usually trivial, they could still impose a strain on the system's threading. Since most servers contain abundant amount of memory, it is possible to use the memory as the first level hit of the cache. In other words, build a layered cache where a local memory cache with a shorter expiry sits in front of a distributed cache with standard expiry. Cache miss ratio is proportional to the number of servers you have in same zone. If the number is large, cache miss ratio will be high and cache is ineffective but with if you are having less than 10 servers (and certainly less than 5) this approach can lead to a measurable improvement in the service quality.

Figure 2 - Layered Caching


This pattern can be implemented invisibly to the consumers of the cache so that the distributed cache client is in fact an in-memory cache wrapping a distributed cache. A common expiry for the memory cache is half of the distributed cache's expiry value.


Conclusion


n+1 is not mere a data problem, it will impact the resilience of Web APIs serving the data. API composition produces a similar effect and is akin to similar adverse effects on the resilience and health of an API. While n+1 can sometimes be averted in the source by more denormalised data modelling, Web APIs needs to be equipped with the tools and patterns to combat the negative impact and successful maintain their Quality of Service.

Parallelising calls to APIs and cache or data stores is very important, however, this could have an put a strain on the server's thread pool. Using async calls and optimising the threading knobs and network throttles is mandatory. On the other hand, using short timeout plus retry helps with consuming an API with varying response time and transient glitches. Use circuit breakers to handle more permanent failures but you need to design and build an alternative to the failed service.

This article is humbly dedicated to Adrian Cockcroft and Daniel Jacobson for their inspiring yet ground breaking works.

Saturday, 18 January 2014

Am I going to keep supporting my open source projects?

I have recently been asked by a few whether I will be maintaining my open source projects, mainly CacheCow and PerfIt.

I think since my previous blog post, some have been thinking that I am going to completely drop C#, Microsoft technologies and whatever I have been building so far. There are also some concerns over the future of the Open Source projects and whether they will be pro-actively maintained.

If you read my post again, you will find out that I have twice stressed that I will keep doing what I am doing. In fact I have lately been doing a lot C# and Azure - like never before. I am really liking Windows Azure and I believe it is a great platform for building Cloud Applications.

So in short, the answer is a BIG YES. I just pushed some code in CacheCow. and released CacheCow.Client 0.5.0-alpha which has a few improvements. I have plans for finally releasing 0.5 but it has been delayed due to my engagement with this other project which I am working on - as I said C# and Azure.

So please keep PRs, Issues, Bugs and wish lists coming in.

With regard to my decision and recent blog, this is a strategy change hoping to position myself in 2 years to be able to do equally non-Microsoft work if I have to. I hope it is clear now!


Saturday, 21 December 2013

Thank you Microsoft, and so long ...

[Level C1]

Ideas and views expressed here are purely my own and none of my employer or any related affiliation.

Yeah, I know. It has been more than 13 years. Still remember how it started. It was New Years Eve of the new millennium (yeah year 2000) and I was standing on the shore of the Persian Gulf in Dubai. With my stomach full of delicious Lebanese grilled food and a local newspaper in hand looking for jobs. I was a frustrated Doctor, hurt by the patient abuse and discouraged by the lack of appreciation and creativity in my job. And the very small section of medical jobs compared to pages, after pages, after pages of computer jobs. Well, it was the dot.com boom days, you know, it was the bubble - so we know now. And that was it. A life-changing decision to self-teach myself network administration and get MCSE certificate.

Funny thing is I never did. I got distracted, side-tracked by something much more interesting. Yeah programming, you know, when it gets its hooks in you, it doesn't leave you - never. Hence it started by Visual Basic, and then HTML, and obviously databases which started from Microsoft Access. Then you had to learn SQL Server, And not to forget web programming: ASP 3.0. I remember I printed some docs and it was 14 pages of A4. It had everything I needed to write ASP pages, connect to database, store, search, everything. With that 14 pages I built a website for some foreign Journalists working in Iran: IranReporter - It was up until a few years ago. I know it looks so out-dated and primitive but for a 14-page documentation not so bad.

Then it was .NET, and then move to C# and the rest of the story - I don't bore you with. But what I knew was that the days of 14-page documentation were over. You had to work really hard to keep up. ASP.NET Webforms which I never liked and never learnt.  With .NET 3.0 came LINQ, WCF, WPF and there was Silverlight. And there was ORMs, DI frameworks, Unit Testing and mocking frameworks. And then there was Javascript explosion.

And for me, ASP.NET Web API was the big thing. I have spent best part of the last 2 years, learning, blogging, helping out, writing OSS libraries for it and then co-authoring a book on this beautiful technology. And I love it. True I have not been recognised for it and am not an MVP - and honestly not sure what else could do to deserve to become one - but that is OK, I did what I did because I loved and enjoyed it.

So why Goodbye?

Well, I am not going anywhere! I love C# - I don't like Java. I love Visual Studio - and eclipse is a pain. I adore ASP.NET Web API. I love NuGet. I actually like Windows and dislike Mac. So I will carry on writing C# and use ASP.NET Web API and Visual Studio. But I have to shift focus. I have no other option and in the next few lines I try to explain what I mean.

Reality is our industry is on the cusp of a very big change. Yes, a whirlwind of the change happened already over the last few years but well, to put it bluntly I was blind.

So based on what I have seen so far I believe IN THE NEXT 5 YEARS:

1. All data problems become Big Data problems

If you don't like generalisations you won't like this one too - when I mean all I mean most. But the point is industries and companies will either shrink or grow based on their IT agility. The ones shrinking not worth talking about. The ones growing will have big data problems if they do not have them yet. This is related to the growth of internet users as well as realization of Internet of Things. And the other reason is the adoption of Event-Based Architecture which will massively increase the amount of data movement and data crunching.

Data will be made of what currently is not data. Your mouse move and click is the kind of data you would have to deal with. Or user eye movement. Or input from 1000s of sensors inside your house. And this stuff is what you will be working on day-in day-out.

Dear Microsoft, Big Data solutions did not come from you. And despite my wishes, you did not provide a solution of your own. I love to see an alternative attempt and a rival to Hadoop to choose from. But I am stuck with it, so better embrace it. There are some alternatives emerging (real-time, in-memory) but again they come from the Java communities.

2. Any server-side code that cannot be horizontally scaled is gonna die

Is that even a prediction? We have had web farms for years and had to think of robust session management all these years. But something has changed. First of all, it is not just server-stickiness. You have to recover from site-stickiness and site failures too. On the other hand, this is again more about data.

So the closer you are to the web and further from data, the more likely that your technology will survive. For example, API Layer technologies are here to stay. As such ASP.NET Web API has a future. Cannot say the same thing for middleware technologies, such as BizTalk or NServiceBus. Databases? Out of question.

3. Data locality will still be an issue so technologies closer to data will prevail

Data will still be difficult to move in the years to come. While our bandwidth is growing fast, our data is growing faster so this will probably be even a bigger problem. So where your data is will be a key in defining your technologies. If you use Hadoop, using Java technologies will make more and more sense. If you are Amazon S3, likelihood of using Kafka than Azure Service Bus. 

4. We need 10x or 100x more data scientists and AI specialists

Again not a prediction. We are already seeing this. So perhaps I can sharpen some of my old skills in computer vision.

So what does it mean for me?

As I said, I will carry on writing C#, ASP.NET Web API and read or write from SQL Server and do my best to write the best code I can. But I will start working on my Unix skills (by using it at home) and pick up one or two JVM languages (Clojure, Scala) and work on Hadoop. I will take Big Data more seriously. I mean real seriously... I need to stay close to where innovation is happening which sadly is not Microsoft now. From the Big Data to Google Glass and cars, it is all happening in the Java communities - and when I mean Java communities I do not mean the language but the ecosystem that is around it. And I will be ready to jump ships if I have to.

And still wish Microsoft wakes up to the sound of its shrinking, not only in the PC market but also in its bread and butter, enterprise.

PS: This is not just Microsoft's problem. It is also a problem with our .NET community. The fight between Silverlight/XAML vs. Javascript took so many years. In the community that I am, we waste so much time fighting on OWIN. I had enough, I know how to best use my time.

Sunday, 1 December 2013

CacheCow 0.5.0-alpha released for community review: new features and breaking changes

[Level T3]

CacheCow 0.5 was long time coming but with everything that was happening, I am already 1 month behind. So I have now released it as alpha so I can get some feedback from teh community.

As some of you know, a few API improvements had been postponed due to breaking changes. On the other hand, there were a few areas I really needed to nail down - the most important one being resource organisation. While resource organisation can be anything, most common approach is to stick with the standards Web API routing, i.e. "/api/{controller}/{id}" where api/{controller} is a collection resource and api/{controller}/{id} is an instance resource.

I will go through some of the new features and changes and really appreciate if you spend a bit of time reviewing and sending me feedbacks.

No more dependency on WebHost package

OK, this has been asked by Darrel and Tugberk and only made sense. So in version 0.4 I added dependency to HttpConfiguration but in order not to break the compatibility, it would use GlobalConfiguration.Configuration in WebHost by default. This has been removed now so an instance of HttpConfiguration needs to be passed in. This is a breaking change but fixing it should be a minimal change in your code from:
var handler = new CachingHandler(entityTagStore);
config.MessageHandlers.Add(handler);
You would write:
var handler = new CachingHandler(config, entityTagStore);
config.MessageHandlers.Add(handler);

CacheKeyGenerator is gone

CacheKeyGenerator was providing a flexibility for generating CacheKey from resource and Vary headers. Reality is, there is no need to provide extensibility points. This is part of the framework which needs to be internal. At the end of the day, Vary headers and URL of the resource are important for generating the CacheKey and that is implemented inside CacheKey. Exposing CacheKeyGenerator had lead to some misuses and confusions surrounding its use cases and had to go.

RoutePatternProvider is reborn

A lot of work has been focused on optimising RoutePatternProvider and possible use cases. First of all, signature has been changed to receive the whole request rather than bits and pieces (URL and value of Vary headers as a SelectMany result).

So the work basically looked at the meaning of Cache Key. Basically we have two connected concepts: representation and resource. A resource can have many representations (XML, image, JSON, different encoding or languages). In terms of caching, representations are identified by a strong ETag while resource will have a weak ETag since it cannot be guaranteed that all its representations are byte-by-byte equal - and in fact they are not.

A resource and its representations

It is important to note that each representation needs to be cached separately, however, once the resource is changed, the cache for all representations are invalidated. This was the fundamental area missing in CacheCow implementation and has been added now. Basically, IEntityTagStore has a new method: RemoveResource. This method is pivotal and gets called when a resource is changed.

The change in RoutePatternProvider is not just a delegate, it is now an interface: IRoutePatternProvider. Reason for the change to delegate is that we have a new method in there too: GetLinkedRoutePatterns. So what is a RoutePattern?

RoutePattern is basically similar to a route and its definition conventions in ASP.NET, however, it is meant for resources and their relationships. I will write a separate post on this but basically RoutePattern is in relation to collection resources and instance resources. Let's imagine a car API hosted at http://server/api. In this case, http://server/api/car refers to all cars while http://server/api/car/123 refers to a car with Id of 123. In order to represent this, we use * sign for collection resources and + sign for instance:

http://server/api/car         collection         http://server/api/car/*
http://server/api/car/123     instance           http://server/api/car/+  

Now normally:

  • POST to collection invalidates collection
  • POST to instance invalidates both instance and collection
  • PUT or PATCH to instance invalidates both instance and collection
  • DELETE to instance invalidates both instance and collection

So in brief, change to collection invalidates collection and change to instance invalidates both.

In a hierarchical resource structure this can get even more complex. If /api/parent/123 gets deleted, /api/parent/123/* (collection) and /api/parent/123/+ will most likely get invalidated since they do not exist any more. However, implementing this without making some major assumptions is not possible hence its implementation is left to the CacheCow users to tailor for their own API.

However, since a flat resource organisation (defining all resources using "/api/{controller}/{id}") is quite common, I have implemented IRoutePatternProvider for flat structures in ConventionalRoutePatternProvider which is the default.

Prefixing SQL Server Script names with Server_ and Client_

The fact was that SQL Server EntityTagStore (on the server package) and CacheStore (on the client package) had some stored procedure name clashes preventing use of the same database for both client and server component. This was requested by Ben Foster and has been done now.

Please provide feedback

I hope this makes sense and waiting for your feedbacks. Please ping me on twitter or ask your questions in Github by raising issue.

Happy coding ...

Thursday, 21 November 2013

HTTP Content Negotiation on higher levels of media type

[Level T3]

TLDR; [This is a short post but anyhow] You can now achieve content negotiation on Domain Model or even Version (levels 4 and 5) in the same ASP.NET Web API controller.

As I explained in the previous post, there are real opportunities in adopting 5LMT. On one hand, semantic separation of all these levels of media type helps with organic development of a host of clients with different capabilities and needs. On the other hand, it solves some existing challenges of REST-based APIs.



In the last post we explored how we can have a Roy-correct resource organisation (URLs I mean, of course) and take advantage of content-based action selection which is not natively possible in ASP.NET Web API.

Content Negotiation is an important aspect of HTTP. This is the dance between client and server to negotiate on the optimum media type to get a resource. Most common implementation of content negotiation is the server-side one which server decides the format based on the client's preferences expressed by the content of Accept, Accept-Language, Accept-Encoding, etc headers.

For example, the page you are viewing is probably the result of this content negotiation:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8

This basically mean give me HTML or XHTML if you can (q=1.0 is implied here), if not XML should be OK, and if not then I can live with image/webp format - which is Google's experimental media type as I was using Chrome.

So, let's have a look at media types expressed above

text/html  -> Format
application/xhtml+xml -> Format/Schema
image/webp  -> Format

So as we can see, content negotiation at this request mainly deals with format and schema (levels 2 and 3). This level of content negotiation comes out of the box in ASP.NET Web API. Responsibility of this server-side content negotiation falls on MediaTypeFormatters and more importantly  IContentNegotiator implementations. The latter is a simple interface taking the type to be serialised, the request and a list of formatters and returns a ContentNegotiationResult:

IContentNegotiator

public interface IContentNegotiator
{
    ContentNegotiationResult Negotiate(Type type, HttpRequestMessage request, 
        IEnumerable<mediatypeformatter> formatters);
}

This is all well and good. But what if we want to ask for SomeDomainModel or AnotherDomainModel(level 4)? Or even more complex: asking for a particular version of the media type (level 5)? Unfortunately ASP.NET Web API does not provide this feature out of the box, however, using 5LMT this is very easy.

Let's imagine you have two different representation of the same resource - perhaps one was added recently and can be used by some clients. This representation can be a class with added properties or even more likely, removal of some existing properties which results in a breaking change. How would you approach this? Using 5LMT, we add a new action to the same controller returning the new representation:

public class MyController
{
    // old action
    public SomeDomainModel Get(int id)
    {
        ...
    }

    // new action added
    public AnotherDomainModel GetTheNewOne(int id)
    {
        ...
    }
}
And the request to return each of will be based on the value of the accept
GET /api/My
Accept: application/SomeDomainModel+json
or to get the new one:
GET /api/My
Accept: application/AnotherDomainModel+json
In order to achieve this, all you have to do is to replace your IHttpActionSelector in the Services of your configuration with MediaTypeBasedActionSelector.

This is the non-canonical representation of media type. You may also use the 5LMT representation which I believe is superior:
GET /api/My
Accept: application/json;domain-model=SomeDomainModel
or to get the new one:
GET /api/My
Accept: application/json;domain-model=AnotherDomainModel

In order to use this feature, get the NuGet package:
PM> Install-Package AspNetWebApi.FiveLevelsOfMediaType
and then add this line to your WebApiConfig.cs:
config.AddFiveLevelsOfMediaType();
MediaTypeBasedActionSelector class provides two extensibility points:

  • domainNameToTypeNameMapper which is a delegate and gets the name passed in the domain-name parameter and returns the class name of the type to be returned (if they are not the same although it is best to keep them the same)
  • Another one which deserves its own blog post, allows for custom versioning.

Happy coding...