Friday, October 17, 2014

Using CC.NET & Gallio for priority based smoke testing.

Pitfalls in Production

Being able to monitor production services for potential errors is critical. Especially if the services are dependant on external resources which may become unavailable or exhibit unexpected behavior. Even if you follow good software development discipline, this is always a source of concern. Think network outages, unannounced third party API changes, hosted services becoming unavailable, etc.

For large software projects, creating a testing strategy that involves unit and integration is helpful when managing complexity of the commit to deployment workflow. Functional/smoke tests are also good to ensure critical functionality works as expected. Although, in an environment where the running of your software is dependent on external resources, you need a system of continuous monitoring that run these smoke tests.

Monitoring Confusion

At Xignite, we use Gallio for our testing framework and CC.NET for our continuous integration. I used these tools for my production smoke tests but soon realized that not all tests were equal. Getting paged at 2am for a test failure that is not mission critical sucks. Even worse, these lower priority failures can mask very high priority ones since they cause the entire test fixture to fail, and require scrutiny on behalf of the dev-ops person to ensure that a high priority failure doesn't fall through the cracks.

Consider the following test fixture and how it lumps all the different tests together.

namespace MyServices.Tests.Smoke
{
   using Gallio.Framework;
   using Gallio.Model;
   using MbUnit.Framework;

   [TestFixture]
   public class OneOfMyServicesTests
   {
      [SetUp]
      public void Setup()
      {
         // Setup an individual test to run.
      }

      [FixtureSetUp]
      public void TestFixtureSetUp()
      {
         // Configure fixture for testing.
      }

      [FixtureTearDown]
      public void TestFixtureTearDown()
      {
         if (TestContext.CurrentContext.Outcome == TestOutcome.Failed)
         {
            // Send signal to monitoring system that test fixture has failed.
         }
         else if (TestContext.CurrentContext.Outcome == TestOutcome.Passed)
         {
            // Send signal to monitoring system that test fixture has succeeded.
         }
         else
         {
            // Handle some other outcome.
         }
      }     

      [Test]
      public void MissionCritical()
      {
         // ...
      }

      [Test]
      public void Important()
      {
         // ...
      }

      [Test]
      public void MinorFunctionality()
      {
         // ...
      }     
   }
}

Any test failure will make the entire context outcome to fail whether it was the mission critical test or the test that affects minor functionality. I tried looking through the Gallio/MbUnit API, event its source, but couldn't find a way to find out which tests failed within a fixture. If anyone knows how to determine this, please let me know.

Prioritized Testing

What I do know though is that you can inherit the TestAttribute class and override its Execute method. I created a required parameter to specify the priority of the test and then used a LoggedTestingContext class to store all the results.

namespace MyServices.Tests
{
   using Gallio.Framework.Pattern;
   using MbUnit.Framework;

   public class LoggedTestAttribute
      : TestAttribute
   {
      public const int MinPriority = 1;
      public const int MaxPriority = 3;

      private readonly int priority;
      public int Priority { get { return this.priority; } }

      public LoggedTestAttribute(int priority)
      {
         if (priority < MinPriority || priority > MaxPriority)
         {
            throw new ArgumentException("Priority must be 1, 2, or 3.", "priority");
         }
         this.priority = priority;
      }

      protected override void Execute(PatternTestInstanceState state)
      {
         try
         {
            base.Execute(state);
            LoggedTestingContext.AddTest(this, state.Test, true);
         }
         catch (Exception)
         {            
            LoggedTestingContext.AddTest(this, state.Test, false);
            throw;
         }
      }
   }
}
namespace MyServices.Tests
{
   using System;
   using System.Collections.Generic;
   using System.Linq;
   using Gallio.Framework;
   using Gallio.Model.Tree;

   public static class LoggedTestingContext
   {
      private class TestFailure
      {
         public string FullName { get; private set; }

         public LoggedTestAttribute TestAttribute { get; private set; }

         public TestFailure(LoggedTestAttribute testAttribute, Test test)
         {
            this.FullName = test.FullName;
            this.TestAttribute = testAttribute;
         }
      }

      private const int PriorityCount = LoggedTestAttribute.MaxPriority - LoggedTestAttribute.MinPriority + 1;
      
      private static readonly Dictionary nameToFailure = new Dictionary();

      internal static void AddTest(LoggedTestAttribute testAttribute, Test test, bool passed)
      {         
         if (passed)
         {
            return;
         }
         var failure = new TestFailure(testAttribute, test);
         if (!nameToFailure.ContainsKey(failure.FullName))
         {
            nameToFailure.Add(failure.FullName, failure);
         }
      }

      private static bool HasFailed(Test fixtureTest, int priority)
      {
         return fixtureTest.Children
            .Any(c =>
               nameToFailure.ContainsKey(c.FullName) &&
               nameToFailure[c.FullName].TestAttribute.Priority == priority);
      }

      public static void LogSmokeTests(Test fixtureTest, string serviceName)
      {     
         foreach (var priority in Enumerable.Range(LoggedTestAttribute.MinPriority, PriorityCount))
         {
            if (HasFailed(fixtureTest, priority))
            {
               // Send signal to monitoring system that test fixture has failed for priority # tests.
            }
            else
            {
               // Send signal to monitoring system that test fixture has succeeded for priority # tests.
            }
         }
      }
   }
}

Finally, putting it together, I replaced the TestAttribute with the new LoggedTestAttribute and then process the results in the test fixture teardown.

namespace MyServices.Tests.Smoke
{
   using Gallio.Framework;
   using Gallio.Model;
   using MbUnit.Framework;

   [TestFixture]
   public class OneOfMyServicesTests
   {
      [SetUp]
      public void Setup()
      {
         // Setup an individual test to run.
      }

      [FixtureSetUp]
      public void TestFixtureSetUp()
      {
         // Configure fixture for testing.
      }

      [FixtureTearDown]
      public void TestFixtureTearDown()
      {
         LoggedTestingContext.LogSmokeTests(TestContext.CurrentContext.Test, "OneOfMyServices");
      }     

      [LoggedTest(1)]
      public void MissionCritical()
      {
         // ...
      }

      [LoggedTest(2)]
      public void Important()
      {
         // ...
      }

      [LoggedTest(3)]
      public void AffectsFunctionalityByDoesntRequireImmediateAttention()
      {
         // ...
      }     
   }
}

More Reading & Resources

Monday, March 17, 2014

State behavioral pattern to the rescue.

The Creeping Problem

I recently found myself developing a request-response style system, where the lifetime of a request could be interrupted at any moment. For most same process execution, like your average desktop application, this is a concern, but arises more often when dealing with multiple coordinated processes. My case ended up being the latter.

One of the ways to ensure redundancy is to isolate the steps of the request-response workflow into isolated atomic units or states. This way, if it fails, it can always be re-executed without having to perform the work that came before it. It is especially helpful when the total resources required are large and there is a higher probability of failure. We can just divvy up the work into states that act like idempotent functions. Below is a great simplification of the actual project I worked on but I wanted to boil it down to its simplest form, eliminating excessive states that I have collapsed to the CreateResponse state.

In my original implementation, I modeled the requests as queued items (UserRequest) that I would dequeue and start work on.

foreach (var request in requests) // as IEnumerable
{
   switch (request.State)
   {
      case State.ReceiveRequest:
         if (TryReceiveRequest(request)) request.State = State.CreateResponse;
         break;
         
      case State.CreateResponse:
         if (TryCreateResponse(request)) request.State = State.SendResponse;
         break;

      case State.SendResponse:
         if (TrySendResponse(request)) request.State = State.ResponseSent;
         break;

      case State.ResponseSent:
         break;

      case State.Faulted:

      default:
         throw new ArgumentOutOfRangeException("request.State");
   }
   if (request.State != State.ResponseSent && request.State != State.Faulted)
      requests.Enqueue(request);
}

Seems simple enough, but in my case the CreateResponse state ended being fairly computationally intensive and could take anywhere from a few seconds to several minutes. These long delays could be due to the workload of remote processes it was waiting on, temporal failure points like the network or even the system the process was running on. Another added complexity was that these requests were being serviced in parallel, by multiple processes that could be on the same system or not. Lastly, actual production level code never ends up being this simple; you quickly find yourself adding a lot of instrumentation and covering of edge cases.

foreach (var request in requests)
{
   log.LogDebug("request.Id = {0}: Request dequeued in state {1}.", request.Id, request.State);
   switch (request.State)
   {
      case State.ReceiveRequest:
         logger.LogDebug("request.Id = {0}: Trying to receive request.", request.Id);
         if (TryReceiveRequest(request)) request.State = State.CreateResponsePart1;
         break;
         
      case State.CreateResponsePart1:
         logger.LogDebug("request.Id = {0}: Trying to create response for part 1.", request.Id);
         if (TryCreateResponsePart1(request)) request.State = State.CreateResponsePart2;
         break;

      case State.CreateResponsePart2:
         logger.LogDebug("request.Id = {0}: Trying to create response for part 2.", request.Id);
         if (TryCreateResponsePart2(request))
         {
            request.State = State.CreateResponsePart3;
         }
         else
         {
            request.State = State.CreateResponsePart1;
            ExecuteCreateResponsePart2Cleanup();
            logger.LogError("request.Id = {0}: Unexpected failure while evaluation create response part 2.", request.Id);
         }
         break;

      case State.CreateResponsePart3:
         logger.LogDebug("request.Id = {0}: Trying to create response for part 3.", request.Id);
         bool unrecoverable;
         if (TryCreateResponsePart3(request, out unrecoverable))
         {
            request.State = State.SendResponse;
         }
         else
         {
           if (unrecoverable)
           {
               logger.LogError("request.Id = {0}: Failure is unrecoverable, faulting request.", request.Id);
               request.State = State.Faulted;
           }
           else
           {
               request.State = State.CreateResponse2;
           }
         }
         break;

      case State.SendResponse:
         logger.LogDebug("request.Id = {0}: Trying to send response.", request.Id);
         if (TrySendResponse(request)) request.State = State.ResponseSent;
         break;

      case State.ResponseSent:
         break;

      case State.Faulted:
         logger.LogCritical("request.Id = {0}: Request faulted.", request.Id);
         break;

      default:
         throw new ArgumentOutOfRangeException("request.State");
   }
   log.LogDebug("request.Id = {0}: Request transitioned to state {1}.", request.Id, request.State);
   if (request.State != State.ResponseSent && request.State != State.Faulted)
   {
      logger.LogDebug("request.Id = {0}: Re-enqueuing request for further evaluation.", request.Id);
      requests.Enqueue(request);
   }
   else
   {
      logger.LogDebug("request.Id = {0}: Request evaluation is complete, not re-enqueuing.", request.Id);
   }
}

What a mess! This code quickly starts getting bloated. In addition, not every state evaluation will be successful and be considered exceptional. Maybe it is polling another process and can't transition until that process is ready. As the result of each state evaluation changes beyond a simple yes/no (true/false), we end up with a state machine that could have multiple transitions. This makes for ugly code and too much coupling. All the state evaluation logic is in the same class and you have this huge switch statement. We could get around the extensive logging by using dependency injection but what do we inject? There is no consistent call site signature to inject to. The ever growing case statements could be extracted into their own method, but then readability suffers. This sucks.

You may be saying, "Well you obviously could solve it by ..." and I would agree with you. This code ugliness could be solved many different ways, and is intentionally crap for the purpose of this post. The major problems I faced was:

  • A large state machine object and large code blocks.
  • Lack of symmetry in state handling.
  • Multiple method postconditions that couldn't be expressed by the boolean return result alone.
  • Coupling of state transition logic, business logic and diagnostics.

I knew something was wrong but I wasn't quite sure how to solve it without adding more complexity to the system and allowing readability to suffer. As someone who has had to spend hours reading other people's unreadable code, I didn't want to commit the same sin.

Looking For A Solution

In university, they teach you how to be a good Computer Scientist; you learn complexity analysis, synthetic languages and the theoretical underpinning of computation. Although, none of this really prepares you to be a software engineer. I could concoct my own system, by why do this when I can stand on the shoulders of giants.

I always read or heard references to the Gang of Four book, even listened to talks by the original authors and became familiar with some of the more famous patterns (Factory and Singleton come to mind). Maybe there was a solution in there. I can't be the first one to come across this simple design problem. So there I found it in the State design pattern.

The design is pretty simple. You have a context that is used by the end user, and the states themselves wrapped by the context. The context can have a range of methods that behave differently based on the concrete state type being used at that moment (eg. behavior of a cursor click in a graphics editor). I modified this design, using a single method to abstract workflow and act as a procedural agent for processing multiple state machines.

The Code

The first step was to construct a state object that will be the super type to all of my concrete states.

public abstract class StateBase   
{
   // Let the concrete type decide what the next transition state will be.
   protected abstract StateBase OnExecute();  

   public StateBase Execute()
   {
      // Can add diagnostic information here.
      return this.OnExecute();   
   }   
}

Next I need a context class that can create and run the state machine.

public abstract class StateContextBase
{
   private StateBase state;   

   protected abstract StateBase OnCreate();
   protected abstract StateBase OnExecuted(StateBase nextState);
   protected abstract bool OnIsRunning(StateBase state);

   public StateContextBase(StateBase state)
   {
      this.state = state;
   }

   public StateContextBase Execute()
   {
      // Need to create the state machine from something.
      if (this.state == null)
      {
         // We will get to this later.
         this.state = this.OnCreate();
      }
      // Let the concrete context decide what to do after a state transition.
      this.state = this.OnExecuted(state.Execute());
      return this;
   } 
 
   public bool IsRunning()
   {
      // Have the concrete type tell us when it is in the final state.
      return this.OnIsRunning(this.state);
   }
}

While glossing over the details, what will this look like at the application's entry point.

class Program
{
   static void Main(string[] args)
   {
      // Will need to get it from somewhere but won't worry about this for now.
      var requests = Enumerable.Empty<StateContextBase>();

      // Can be changed to false on an exit call.
      var running = true;
      while (running)
      {    
         requests = requests
            .Where(r => r.IsRunning())
            .Select(r => r.Execute());
      }
   }
}

That is beautiful! All I see is the state machine decider logic and I don't even need to be concerned with what type of state machines are running.

So let's dive into the details. First, there is the creation of the state into memory. We have to get this from somewhere, so let's add another abstraction on top of our StateBase super type. Something that can be persisted in case the process crashes and can be accessed across many systems (eg. database).

In my case, I used the Entity Framework ORM, which is based off of the unit of work and repository design patterns. There is a context (DataContext) that I will get my model object (UserRequest) from to figure out the current state. A unique key (UserRequest.Id : Guid) will be used to identify the persisted object. We won't concern ourselves as to why this is just unique and not an identity key (that could be in another post) but it basically comes down to the object's initial creation at runtime not relying on any persistence store for uniqueness.

public class DataContext
   : System.Data.Entity.DbContext
{
   public DbSet UserRequests { get; set; }

   public DataContext()
      : base("name=DataContext")
   {
   }
}
public abstract class PersistedStateBase<TEntity>
   : StateBase
   where TEntity : class
{
   private Guid id;

   protected abstract StateBase OnExecuteCommit(DataContext context, Guid id, TEntity entity);
   protected abstract TEntity OnExecuteCreate(DataContext context, Guid id);
   protected abstract StateBase OnExecuteRollback(DataContext context, Guid id, TEntity entity);

   public PersistedStateBase(Guid id)
   {
      this.id = id;
   }

   protected override StateBase OnExecute()
   {
      // Also consider exceptions thrown by DataContext.
      StateBase nextState = this;
      using (var context = new DataContext())
      {
         TEntity entity = null;
         try
         {
            entity = this.OnExecuteCreate(context, this.id);
            nextState = this.OnExecuteCommit(context, this.id, entity);
            context.SaveChanges();
         }
         catch (Exception ex)
         {
            // Handle exception.
            nextState = this.OnExecuteRollback(context, this.id, entity);
         }
      }
      return nextState;
   }
}

The model object (UserRequest, our entity type) will hold the state as an enumeration (UserRequest.State) and contain all the data needed for processing through the state machine.

public enum UserRequestState
{
   None = 0,  
   Receive = 1,  
   CreateResponse = 3,
   SendResponse = 4,
   ResponseSent = 5,
   Faulted = -1,  
}
[DataContract]
public class UserRequest
{
   [DataMember]
   public Guid Id { get; private set; }
   [DataMember]
   public UserRequestState State { get; private  set; }

   // Other properties here like the location of the user request and other metadata.

   private UserRequest()  // Required by EF to create the POCO proxy.
   {}

   public UserRequest(Guid id, UserRequestState state)
   {
      this.Id = id;
      this.State = state;
   }
}

Now let's implement our first state using the types we have created.

public class ReceiveState
   : PersistedStateBase<UserRequest>
{
   public ReceiveState(Guid id)
      : base(id)
   {}
  
   protected override StateBase OnExecuteCommit(DataContext context, Guid id, UserRequest entity)
   {      
      var successful = false;
      var faulted = false;
      // Receive user request and decide whether successful, unsuccessful with retry or
      // unrecoverable/faulted. 
      if (successful)
      {
         return new CreateResponseState(id);
      }
      else
      {
         return faulted ? new FaultedState(id) : this;
      }
   }

   protected override UserRequest OnExecuteCreate(DataContext context, Guid id)
   {
      // Get model object
      return context.UserRequests.Find(id);
   }

   protected override StateBase OnExecuteRollback(DataContext context, Guid id, UserRequest entity)
   {
      // Rollback any changes possibly made in the OnExecuteCommit method and attempt recovery,
      // if possible, in this method. For this example, we will just return the current state.
      return this;
   }
}

We need to also make our state context concrete with the type below. This tends to have more wiring since type per state doesn't really translate well in an ORM. This class could be greater simplified with attributes on the state types, designating the enumeration value they map to.

public class UserRequestContext
   : StateContextBase
{
   private static Dictionary<Type, UserRequestState> typeToDbState;
   private static bool databaseRead = false;

   public Guid Id { get; private set; }

   static UserRequestContext()
   {
      databaseRead = false;
      typeToDbState = new Dictionary<Type, UserRequestState>()
      {
         { typeof(ReceiveState), UserRequestState.Receive },
         { typeof(CreateResponseState), UserRequestState.CreateResponse},
         { typeof(SendResponse), UserRequestState.SendResponse},
         { typeof(ResponseSent), UserRequestState.ResponseSent},
         { typeof(FaultedState), UserRequestState.Faulted },
      };
   }

   public UserRequestContext(Guid id)
      : base(null)
   {
      this.Id = id;
   }

   public static IEnumerable<Guid> GetRunningIds()
   {
      if (UserRequestContext.databaseRead)
      {
         var ids = Enumerable.Empty<Guid>(); // Get from message queue.    
         return ids;
      }
      else
      {
         using (var dataContext = new DataContext())
         {
            var ids = dataContext.UserRequests
               .Where(u => 
                  u.State != UserRequestState.ResponseSent &&
                  u.State != UserRequestState.Faulted)
               .Select(u => u.Id)
               .ToArray(); // Force evaluation.

            UserRequestContext.databaseRead = true;

            return ids;
         }
      }
   }

   protected override bool OnIsRunning(StateBase state)
   {
      return !(state is CompleteState);
   }

   protected override StateBase OnCreate()
   {
      using (var dataContext = new DataContext())
      {
         // Maps persisted state enumeration to runtime types.
         var entity = dataContext.UserRequests.Find(this.Id);
         switch (entity.State)
         {
            case UserRequestState.Receive:
               return new ReceiveState(this.Id);

            case UserRequestState.CreateResponse:
               return new CreateResponseState(this.Id);

            case UserRequestState.SendResponse:
               return new SendResponseState(this.Id);

            case UserRequestState.ResponseSent:
               return new ResponseSentState(this.Id);

            case UserRequestState.Faulted:
               return new FaultedState(this.Id);

            default:
               throw new ArgumentOutOfRangeException();
         }
      }
   }

   protected override StateBase OnExecuted(StateBase nextState)
   {
      // Run any other deciding logic in here that is independent 
      // of the states themselves (eg. logging, perf counters).

      return nextState;      
   }
}

Finally, let's come full circle and show what the application entry point will look like once all is said and done.

class Program
{
   static void Main(string[] args)
   {
      var requests = Enumerable.Empty<StateContextBase>();

      var running = true;
      while (running)
      {         
         requests = requests
            .Where(r => r.IsRunning())
            // Append new user requests found.
            .Concat((IEnumerable)UserRequestContext
               .GetRunningIds()
               .Select(i => new UserRequestContext(i)));
            .Select(r => r.Execute());
      }
   }
}

Ahhhh Yeah...

After fleshing it all out, I really got that satisfying feeling you get as a software engineer when you know that you made the right design decisions. I isolated my business logic into their own types (eg. ReceiveRequestState), separated it from the state machine transition logic, added symmetrical handling of persistence logic by layering it on top of the state type (PersistedStateBase) and contained the persistence-runtime bridge (from UserRequest to PersistedStateBase subtypes) into its own type (UserRequestContext). If I want to add more states, I can simply add to the model's state enumeration (UserRequest.State) and update the state context (UserRequestContext). If I want to change the transition logic, all I need to do is go to the concrete state type itself (eg. ReceiveRequestState) and feel confident that my variables are all scoped correctly. No coupling, no excessive mutations and no excessive side effects.

Using The Right Tool

This design pattern isn't for every state machine problem. In simple cases, it can definitely be overkill; you can see a bit of starter code is needed. Although, if you find yourself designing a state machine with multiple outbound transitions and final states, this could be the right modified pattern for you.

More Reading & Resources