Roger Ngo's Website

Personal thoughts about life and tech written down.

Building a Video Streaming Service

Note: This is a very long article and has been written over the course of a month. I try to come back regularly to fix any grammar and spelling issues. I also try to rephrase some of the things I write. Explaining is hard, and I hope things can be understood.

Table of Contents

Motivation

A really popular question that seems to go around is "How would you design Netflix"? Normally, in a 45 minute discussion, there really isn't enough time to actually go over specific implementation details. What usually results in such a discussion is a very high-level design of the overall system architecture.

I have personally always wondered: what would happen if we actually attempted to implement a "Netflix", or more generically, an on-demand video streaming service from scratch?

I personally have always had this temptation, but never a "fire" to actually get started until I started reading a few books about cloud computing. Another crazy thought that had always been on my mind was to be able to experiment with a cloud service without much regret on the financial resources that would be spent learning.

So, I decided to commit to two things:

  1. Commit to implementing a basic on-demand video streaming service with adaptive streaming capabilities.
  2. Leverage the offerings from a particular cloud service to make it happen! (And keep it reasonably affordable.)

Some of the questions I had in mind before undertaking this project were:

  • What type of infrastructure would it take to implement such a service?
  • What are some of the unique design decisions that might occur while trying to deliver video content efficiently?
  • Would the cloud provider which I choose ultimately affect any of my design and implementation decisions?

Addtionally, I was quite inspired by Dianne Marsh's talk shown at InfoQ on Continuous Delivery at Netflix. . One of the points she discussed which hit home for me was that organizations tend to use cloud computing as a remote data center, where virtual machiness are just spun up to host an application. A lot of organizations tend to stray away from using what really cloud provides: a set of services to continuously improve your application. The ability develop and deploy to production at a higher velocity and create resilient infrastructure on the cheap which then redirects energy and focus on actually developing the product to deliver business value.

Because of the inspiration above, you will find that I have biased myself with a specific cloud service provider. I will be transparent and state that this provider will be Microsoft Azure.

Also, this article was written in bits and pieces. As you read along you will wonder: "What the hell is this article about?". To tell you the truth, I originally started this as a hobby project, but it has changed over time. Over time, the aspiration for philo is to become a teaching tool and template project for those who may want to begin in learning about microservices, and how to leverage cloud-computing to create interesting services and applications. If you do not agree or find the tone of writing to be this way, please provide me with some feedback and suggest how I can make it all better!

So, where to find the code? Well, the code is all here! https://github.com/urbanspr1nter/philo.

I am currently trying my best to keep it up to date. You can follow along as our journey progresses. Check-ins happen frequently -- almost everyday. Because of the velocity of the development and the nature of it being a side project while I have a day job, you will be unsurprised to see that there are plenty of areas within the codebase that scream "technical debt". You will end up seeing plenty of things such as: lazy coding, lots of technical debt, then commits that kill technical debt, followed by more technical debt incurring due to being lazy, and hardcoding things... then undoing hard-coded things and finally actually having some presentable code. It is the nature of software development and I am not shy to let it all out!

Requirements

In order to keep the scope of the project reasonable, I will define very early, a small set of requirements in which the video streaming service will have. I have divided the implementation to contain two domains: the viewer-focused domain, and the administrator-focused domain.

  • Viewer Focused
    • Allow on-demand video streaming of the content that is readily available.
    • Adaptive streaming of content. The quality of the video should adjust to the current available bandwidth of the viewer.
    • Hands-off discovery (dynamic listing) of new content made available by the backend services.
  • Administrator Focused
    • Ability to push new content to the video streaming platform.
    • Content should be made to support adaptive streaming.
    • Metadata regarding the video should be kept up to date.

philo, a Video Streaming Service

Sometimes the hardest thing about making a new project is to give it a name. When deciding on a name, I want the project name to be memorable, and simple -- yet, define what it is trying to achieve.

In keeping with my growing interest in history, I decided to name the project philo. The origin of the name comes from one of the key people involved with the invention of the television: Philo Farnsworth.

The name is very appropriate due to television being the ultimate bootstrap of all the multimedia content we now consume day-to-day online.

Choosing a Cloud Provider

One of the earliest decisions I made regarding the high level architecture was that I wanted the system to be composed of services that would interact with each other in a decoupled manner. Naturally, composing the system as a set of microservices was appropriate.

I felt the scope of the project was small enough that I would be able to create a few microservices and learn about some of the benefits and drawbacks on choosing to create a system composed of microservices right from the get-go.

I also wanted to make sure that I would invest my personal budget into a single cloud provider that could provide me the following:

  • Content Delivery Network (CDN) of multimedia assets.
  • Cheap storage for assets generated by both the user-facing and administrative domains.
  • Easy management of web application services. (Polyglot services, containerized deployments, etc)
  • Client secrets and key management.
  • Video encoding services.
  • Asynchronous messaging capability.
  • Affordability!

I ultimately chose Microsoft Azure due to the offerings available and the bias in that I am currently a Microsoft employee. The options provided by Azure were not only affordable, but also the libraries for each of the services I wanted to develop with were easy enough to get started with.

Having worked with cloud services from various providers in the past, I knew that the Azure offered mostly the basic things of CDN (Azure CDN), storage (Azure Storage Account), key management (Azure Key Vault), and web applications (Azure Web App Service).

What was really worrying me was whether or not Microsoft Azure actually had a readily available solution with regards to video encoding services. Thankfully, after some quick research, I found out that Azure Media Services provided the services I needed.

Without going too much into detail (you can read the offering information the website), the service allows the features of:

  • Encoding and processing video on the cloud.
  • Providing a media player integration to the media services account.
  • Delivering video from a streaming endpoint through a CDN optimized for multimedia content.
  • Invoking service operations through API calls with .NET/Java libraries.

Reading through some of the media services documentation, I realized that I had two versions of the service to choose from: a v2 API, and a newer v3 API.

Of course, the v3 API seemed more interesting to work with, and the documentation had seemed to be written around the v3 API entirely, I decided to go with this specific version.

Languages and Tooling

As for the choice of languages, I was energized early on to learn Spring and take my Java skills to the next level. However, experience told me that not all libraries are created equal. Although I saw that Azure had provided both .NET and Java libraries for the Media Services offering, I was a bit skeptical in seeing what the Java support was really like. (It turns out my skepticism proved to be correct.)

Looking at the MVN repository information, the latest v3 API implementation is 0.9.8 dated back October 2017. This lead me down in even more skeptical path after visiting the GitHub repository and finding out that the version corresponded with the 2018-03-30-PREVIEW API, with the 2018-06-01-PREVIEW implementation being in development.

This was also the newest available library for Java implementing the v3 API. Along with not many sample code floating around to reference for Java, I quickly decided to make the decision on whether to develop the specific service that would interact with Azure Media Services with Java, or .NET.

The method to decide that was to just simply create a test service that would do the suffice the core main requirements:

  • Encode a video
  • Get the streaming URLs for the output

Anyway, to really give imagery on what happened, I basically spent a day getting to know the API concepts, and libraries and implementing the above requirements through the Java library. I found it to be insufficient and hard to really use. I eventually chucked out the whole idea of writing the service in Java due to the primitive API implementation available paired with the lack of real documentation.

I then found this basic code example that would actually perform the operations I wanted, and was implemented in C#. It was then that I had already made the first implementation decision: go with .NET for the encoding service simply because of:

  1. Better documentation around the .NET libraries.
  2. Code examples and ease of implementation.

I was also readily willing to go with .NET for implementing the encoding service because I had reasoned with myself that I could simply write the entry point to kick off the service to send the jobs to Azure with any other type of language or technology. As it turns out and will be discussed later, this was necessary anyway.

With that hard decision finally made, I compromised with myself that I would get to play with whatever exciting technology as it becomes appropriate down the line for the this project. Afterall, with microservices, we can definitely do that.

Now, let's introduce our first service...

Service: philo-media-encoder

I always like to work on solving the most mysterious parts of a project first. Naturally, this became the media encoding service that serves as the main central work-horse of the system.

I really wanted the philo-media-encoder to take a video file and have it encode it to various bitrates to create an adaptive streaming experience similar to the process used by Netflix and YouTube. A viewer never has to adjust the quality of the video based on the current availability of bandwidth, and instead the service itself will deliver the appropriate video content with just the right bitrate for the available bandwidth to create a seamless viewing experience.

I decided that the philo-media-encoder architecture will consist of four main components to encode input media for adaptive streaming.

  1. philo-media-encoder
  2. philo-media-encoder-producer
  3. philo-media-encoder-consumer
  4. philo-media-service

First, there are pre-conditions before any of the steps within the encoding workflow can execute. The preconditions are that the input asset media files must be located either in Azure storage or through some publicly accessible URL. For the latter statement, this could be a CDN.

To start, the name of the service will be philo-media-encoder-producer. The basic functionality of this service is that it will receive a message with a payload that will contain two pieces of information: the name of the input asset, and the location on where to find the asset.

The input asset name comes from an asset that is already available in Azure through CDN or storage account. The input asset name is the is the name of the asset in the available assets of Azure Media Services.

This producer service receives the name and location and produces a messages that will be placed into the message queue backed by Azure Service Bus. The message will be placed asynchronously, and the producer service will immediately return a response back to the requester.

Each message takes on the following format:

{
"name": "My Video.mp4",
"location": "https://path.to.my.cdn.azureedge.net/asset-container/my_video.mp4"
}         
    

On the other side of the service bus, a consumer service, philo-media-encoder-consumer is running continuously waiting to consume messages that are placed into the service bus queue. Upon receipt of the message, an invocation to the philo-media-encoder module is made. Note that the philo-media-encoder-consumer, and the philo-media-encoder module sit on the same instance.

When the consumer service receives the message from the service bus, the following occurs:

  1. A philo-media-encoder object is created and the parameters are passed to start the media encoding job on Azure Media Services.
  2. After each step, which is performed synchronously, metadata about the job is returned. The metadata can be the input ID that correlates to the input asset, output asset, and streaming URL information placed into the persistent store for cataloging and querying by any client side application.

The consumer service is decoupled from the encoder class simply because we want the consumer service to not be blocked by an encoding job. By instantiating the object that will synchronously performed the encoding job, we can process many messages without waiting for the previous job to finish.

Upon knowing the information needed to pass to Azure, the flow in which the encoder is responsible for will then be:

  1. Encode the media file given by the location.
  2. Make the assets public through Azure storage.
  3. Create the streaming URLs which references the assets within Azure storage.

At each step along the way, the consumer class is responsible for using the metadata returned by philo-media-encoder to invoke an external service call to philo-media-service, in which I will introduce later, to store the information to be queried later by client-side applications. But for now, think of philo-media-service as the main orchestrator for manipulating/managing metadata information about a particular output from Azure Media Services in the overall philo ecosystem so that it can be used for video stream delivery.

In order to keep track of the status of the media service job, philo-media-encoder will also periodically poll for the job state.

States which define that the job is not complete are:

  • Scheduled
  • Queued
  • Processing
  • Canceling

States which define that job is complete are:

  • Finished
  • Error
  • Canceled

The expectation is that after running philo-media-encoder on a given asset, we should have multiple versions of the same asset available to be delivered for streaming situated in our storage account. For example, if we were to input a 1920x1080 MP4 video file, we should get video files along the lines of: 1280x720, 640x360, 320x180, etc.

With that said, here is the overall high-level system architecture for the encoding workflow.

Figure 1. The philo-media-encoder service.

Let us start with a basic flow of the above.

First, there is some arbitrary invocation call that passes an HTTP POST message to the philo-media-encoder-producer application. This contains a JSON payload defining the name and location of the asset. The payload received during this request is similar to the message that is actually produced by the producer service itself.

After the producer constructs the message to be passed to the service bus, the producer also immediately returns an HTTP OK response back to the requester. At the same time, the consumer is listening for any messages in the service bus.

When the consumer has received a message from the service bus, it then uses that information to start the philo-media-encoder job through object initilization.

The encoder job then interacts with both Azure Media Services and philo-media-service to orchestrate the job state, output asset metadata and streaming URL construction and persistence.

Why a Service Bus?

A question at may come up now is why I have I made the decision to use a service bus and not just have an entry point directly to the philo-media-encoder service with a point-to-point invocation. The decision comes from the fact that direct entry points through some sort of call is synchronous. With calls that result in processes that can be long running, synchronous invocation can lead to timeouts. With video encoding, almost all jobs are long-running processes and need a significant amount of time for completion.

For example, if we were to provide an entry point directly to the philo-media-encoder service through HTTP REST, upon invocation of the POST call, the job will be kicked off. An indeterminate amount of time will pass before receiving a response back from the job. By then, depending on the scale of the job, the calls will time out before the client has any chance to know whether or not something happened.

Therefore, we need to take client timeouts into consideration and introduce asynchronous requests. By providing a service bus between the client requests and responses, we can essentially deliver a response back to the client much faster by notifying them that the job is being processed, and will take some amount of time before the process is done. The request handler is simply responsible for delivering the information needed to start the encoder job to the service bus and just needs to return the response.

A response can be anything in that case. It can be a notification, or something that will invoke an external service, call, etc to check the status of the job. This alleviates the need for the client to "busy-wait" in a sense.

By avoiding a point-to-point communication, we not only get around with avoiding network request timeouts, but we also gain loose-coupling between services: they do not all need to be online at once to be able to function.

Logging

Before we dive deeper into our implementation, I would like to pause for a bit to talk about logging! There is anticipation to have many decoupled services all functioning asynchronously and distributively, it is easy to lose track what is going on as a result of the temporal distortion in execution.

In order to attempt to maintain some sanity in debugging services which are live, one should invest in basic logging within the infrastructure in which is built. In the case of philo, let us use our producer, consumer, and encode services as examples on how we should go about building a logging infrastructure. If you are not really familiar with basic practices of logging, and before we start going down the technical debt minefield, now is a good chance to read a thing or two about high performance logging :).

The goal is not to create anything spectacular, but merely something that will work within our application architecture, and give ourselves something to extend and build upon should our service scale even further.

Since we are using dotnet core as our platform for philo-media-encoder-producer, philo-media-encoder-consumer, and philo-media-encoder, let us use NLog as our primary logging framework. The basic fixture will not be anything that is super complex. In fact, let us aim for something simple for now -- we want to just log to a file, and then have it be delivered to a indexing service. (We will use the ELK stack for that.)

Each service will have a Logger> class which essentially configures NLog to set up a very basic logging component for the service. For example, here is the philo-media-encoder-consumer Logger class as an example:

using System;
using System.Collections.Generic;
using System.Linq;

namespace PhiloMediaEncoderProducer {
public class Logger {
    private static NLog.Logger thisLogger;

    private static String GetDateString() {
        DateTime currentDateTime = DateTime.Now;
        String month = currentDateTime.Month < 10 ? "0" + currentDateTime.Month.ToString() : currentDateTime.Month.ToString();
        String day = currentDateTime.Day < 10 ? "0" + currentDateTime.Day.ToString() : currentDateTime.Day.ToString();
        String year = currentDateTime.Year.ToString();

        return $"{year}{month}{day}";
    }

    private static NLog.Logger ConfigureNLog() {
        var config = new NLog.Config.LoggingConfiguration();
        var logfile = new NLog.Targets.FileTarget("logfile") {
            FileName = $"logs/{Logger.GetDateString()}.producer.log",
            ArchiveAboveSize = 4000000,
            MaxArchiveFiles = 10
        };

        var logconsole = new NLog.Targets.ConsoleTarget("logconsole");
        config.AddRule(NLog.LogLevel.Debug, NLog.LogLevel.Fatal, logfile);
        config.AddRule(NLog.LogLevel.Debug, NLog.LogLevel.Fatal, logconsole);

        NLog.LogManager.Configuration = config;

        return NLog.LogManager.GetCurrentClassLogger();
    }

    private static bool NeedsDateRotation() {
        String currentDateString = Logger.GetDateString();

        String[] files = System.IO.Directory.GetFiles($"logs");
        foreach(String file in files) {
            System.IO.FileInfo currentLogFile = new System.IO.FileInfo(file);
            if(currentLogFile.Name.StartsWith(currentDateString, StringComparison.CurrentCulture)) {
                return false;
            }
        }

        return true;
    }

    public static NLog.Logger GetLogger() {
        if(Logger.thisLogger == null || Logger.NeedsDateRotation()) {
            Logger.thisLogger = Logger.ConfigureNLog();
        }

        return Logger.thisLogger;
    }
}
}

Basically each service will write a log that is rotating based on filesize to the filesystem. The format of the log is YYYYMMdd.service.log. For example, in context to the philo-media-encoder-consumer service, a sample log file would be named: 20180802.consumer.log. By invoking the following method, Logger.GetLogger(), we acquire an NLog logging object in which we can use throughout the program to log into our flat file.

The following shows sample usage:

static NLog.Logger logger = Logger.GetLogger();

void AnActionMethod() {
logger.Debug("This will be a debug output!");

try {
    ...
} catch(Exception ex) {
    logger.Error("This is an error!");
}
}

If we inspect the log file, we will get output similar to the following in our YYYYMMdd.service.log file:

2018-08-03 00:03:22.4114|DEBUG|PhiloMediaEncoderProducer.Logger|This will be a debug output!

In summary, for the logging of the services we will mainly need to keep the following in mind for all our services when it comes to logging:

  • .log files being written to the file system and ensuring that new log files are created when a certain file size is hit. For example, 4 MB.
  • Initializing the logger only once per run and reusing the logger object to write to our log.
  • Appropriate log levels set as to not create too much noise during program execution.
  • Logs should be created with the expectation that the log files will eventually be uploaded to an external store. In our case, a blob container in Azure Storage.

With the last point above, we need to think about bringing logs together so that we do not have so many laying around in our filesystem within the web services. Additionally, over time, retaining the logs at those locations will become very expensive as they begin to eat away disk space.

In order to solve this issue, we will now need to write a little tool that will actually upload the logs periodically to a persistent storage account. We can simply use Azure Storage for this and upload our logs to a blob storage container. We will call this the LogIngester, and each service will have a sidecar process that will run periodically and upload the remaining logs within the filesystem up to cloud storage.

For folks who are familiar with syslog, you can do that too.

Now, what should we do with all the logs? We should aggregate it all into a searchable distrbuted logging system, of course! Using the ELK stack, we will be able to create a decent log management platform for our services to make maintenance easier. We can set up a Bitnami ELK Stack VM on Azure and then configure it to ingest the logs from Azure Storage using the logstash-input-azureblob plugin found here.

Figure 2. Logging infrastructure.

Given the pattern, we can configure our input filter to be:

input {
    azureblob {
            storage_account_name => "[STORAGE_ACCOUNT_NAME]"
            storage_access_key => "[STORAGE_ACCESS_KEY]"
            container => "[STORAGE_CONTAINER_NAME]"
            codec => line
    }
}
filter {
    mutate {
            gsub => ["message", "\|", " "]
    }
    grok {
            patterns_dir => ["./patterns"]
            match => ["message", "%{LOGSTRING_DATE:timestamp} %{LOGLEVEL:priority} %{LOGSTRING_SOURCE:source} %{DATA:data}"]
    }
    date {
            match => ["timestamp", "YYYY-MM-dd HH:mm:ss.SSSS"]
    }
}
output {
    elasticsearch {
            hosts => ["127.0.0.1:9200"]
    }
}
    

We are also using custom patterns with the above grok filter, so let us define those within our custom patterns directory:

LOGSTRING_DATA [.+?(?=\n)]
LOGSTRING_SOURCE [\w]+[\.][\w]+
LOGSTRING_DATE [\d]{4}[-][\d]{2}[-][\d]{2}[\s][\d]{2}[:][\d]{2}[:][\d]{2}[\.][\d]{4}
    

Now, basically the flow for log management is:

  1. Logs are transported to Azure Storage from all services.
  2. A VM on Azure running the ELK stack will then import the logs.
  3. Elasticsearch will index the logs, making it available on the Kibana dashboard.
  4. Now, logs are available on Kibana!

Later on, when the services are ready to be tied together, a good thing to do is to leverage a gateway service such as Netflix Zuul to attach a global correlation context ID in each HTTP request within the headers as it passes through each service. Each service can then read the correlation context ID in the header and use that as the ID for the log entry.

The advantage gained here is that now through our distributed log aggregation service that we have set up, we can tie all logs together and create the coherent flow needed to trace actions relating to a specific request in its context.

Key Management

As the number of individual distinct services increase witin our application ecosystem, credential management becomes very important to the overall security of the microservices architecture we have decided to put in place.

To illustrate my point, there are already several services in philo-media-encoder which requires a reference to the various cloud services. This results in a lot of sensitive information being stored within the application settings and/or configuration files floating around. In small projects, having credentials be located within the individual services and maintaining the entries may be doable, but not scalable. As the number of services grow, it will start to be cumbersome in managing the settings bounded to the application which are sensitive.

A good use case of having no credential management and experiencing the effects of having to manually update credentials is when something like a database connection string changes. An administrator would ultimately have to ensure that all services which references a particular database have the most up-to-date credentials sitting within the application. Not only is this a poor way to manage the security settings, but also is not very secure in practice. Things can leak and things will leak.

That is why key management is so important. We can have a master key or certificate which can retrieve a specific credential which is then stored in a secure manner remotely. All common credentials to our services are consolidated. This not only gives us a "source" of truth when it comes to secrets retrieval, but also allows greater ease of updating credentials should we invalidate one.

For our video streaming service, I will use Azure Key Vault to store not only the credentials, but some of the other settings that may change over time such as the database names themselves. Now, to be explicitly clear, having a key management store does not fortify an application architecture from all security threats. A key management system's greatest advantage is that it allows a service to invalidate credentials as needed and invalidate the master certificate or key to reach into the values stored in the key management store without having to update all application settings themselves.

Figure 3. Key Vault flow.

The overall flow is detailed below:

  1. Each application service makes a request with the master credentials, in this case, the Azure Active Directory credentials to Azure Key Vault
  2. Azure will authenticate and grant access to the vault.
  3. From here, the application service can retrieve secrets from Key Vault given the secret identifier.

For our .NET service, we will use the Azure Key Vault libraries available on nuget. For our Java service in which I will go over soon, we can use the Java library hosted on MVN: https://mvnrepository.com/artifact/com.microsoft.azure/azure-keyvault/1.1

Monitoring: Availability and Health Checks

It is no doubt that microservices have made monitoring and telemetry such an integral part in having "production-grade" software. When your architecture consists of many moving parts interacting with each other, it is vital that one broken service does not cause a catastrophic failure on the infrastructure. I am sure many of us have been in the situation where one failed external component referenced by many other services within the system has caused a cascading failure within the overall infrastructure. Additionally, those moments might have led to long hours of debugging and traffic management in stopping the bleeding, figuring out the problem, and bringing the service back up. Almost always in hindsight, the root cause is "obvious". Well, sure it is obvious that it had already happened. But, was it really that obvious when we were designing and developing? Most likely not.

Of course, the best advice is to always follow X-practice and do Y-thing to get your service as stable as possible before going live into production. If you cover all your bases during development, then you are certainly going to be industrial strength, right? How do you know, though? There is never a real way to tell until code hits production and has seen some action.

The reality is that there is always going to be something that will cause failure out in nature that was never considered. That is why we should at least plan to be notified, or alerted about issues that happen in production. At least, then a team can catch problems quickly and then fix the issue resulting in a more resilient system. With even some basic monitoring placed in the deployed infrastructure, we can actually get a lot of useful information that allow us to get an overall picture of the health of the overall system.

An example I can give is a feature that might have been very difficult to implement and is a key operational feature to your service. If a change is made and affects the service's reliability should it not behave as intended, it is best to roll-out in phases to catch any corner cases that might have been missed. Of course, the only way to catch corner cases, or even usage patterns that can give us insight as to whether or not a change has affected the service negatively, or positively is to send telemetry and monitor.

The pattern I prefer to do is to implement my feature, and then provide a feature-gate, or flag that leaves it off. However, telemetry calls are sent at various points of execution within the code as to simulate as if the feature was "on". This gives a good sense of the behavior of the system as if the feature was functioning. It gives some data for engineers to look at and improve, or fix the feature as needed before being turned on.

What about KPIs? Business metrics? What about other metrics like user behavior? Yes, I do know those are just as important to actually growing a service. This is a technical discussion, so let's just say we accounted for all the above there, and that if philo ever takes off, it is something we can definitely implement.

Although I am tempted to actually create a super-resilient infrastructure, philo ultimately is still just a demonstration project. I can easily spend many months developing a monitoring infrastructure for philo, but that would be impractical. I want to finish the project eventually, and with that I have decided to just implement some basic monitoring that will just allow me to determine whether or not the system is behaving correctly at a high level.

In order to make the requirements more concrete, listing them out is perhaps better.

The main things philo must have in monitoring will be:

  1. Availability tests through health check endpoint calls. This will allow me to know if all the services in philo are actually online.
  2. Event/Exception telemetry at the service level. I will need to know when exceptional exceptions happen right away and be alerted. Other data being recorded such as latency and time to process a job is useful too.
  3. Alerting through SMS. Email alerting was thought about here, but SMS text-messaging is more likely to be answered. For the sake of my significant other's sleep, I won't be destroying our life with phone calls for a hobby project.

With all that said, we must choose a tool for monitoring our service. The best tool for the job is basically the tool that fulfills our requirements now, and is easy to use so that we don't have to suffer the overhead in learning complex tooling which could result in lack of motivation to implement good monitoring into the system. I am strongly adhering to keeping things simple here, and for that I have decided that the best tool for the job is basically what I have available in front of me now: Azure Application Insights.

Application Insights has a lot of things already built in such as telemetry processing, and availability tests within Azure. Also there are libraries available to use within our service for .NET, Java and even NodeJS. One can do a lot here.

The easiest thing to do for monitoring is to implement an availability check for all your services. A recommended approach is to define some sort of common endpoint in which your monitoring tool can invoke and inspect the response from your service. A method I like to use is the health endpoint check pattern. It is just the following:

  • Define a known endpoint exposed to the monitoring service that will be used to check the availability of the service. For example, /health
  • Define criteria of the response of the /health endpoint that will determine whether the service is healthy, or not.
Figure 4. Example of a series of continous availability checks on the philo-media-encoder-producer service.

Taking the producer service, philo-media-encoder-producer as an example, our MediaEncoderController is the main controller of our REST-based microservice that produces the messages to be sent to the service bus so that the consumer class can read that data and invoke the encoding job. In order to determine whether, or not the producer class is online and is ready to accept requests, let's define a simple default HTTP GET handler to this controller to serve as a health check.

/*
* MediaEncoderController.cs
* Author: Roger Ngo
* (C) 2018
* 
* The controller receives a JSON payload that will construct a message 
* that will be routed to the Azure Service Bus.
* 
* The message is assumed to be picked up by the consumer application 
* which will then kick off the encoding job.
*/

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.ServiceBus;
using Microsoft.IdentityModel.Clients.ActiveDirectory;
using Microsoft.Azure.KeyVault;

namespace PhiloMediaEncoderProducer.Controllers {
[Route("api/mediaencoder")]
[ApiController]
public class MediaEncoderController : ControllerBase {
    private NLog.Logger logger = Logger.GetLogger();

    private static async Task GetAccessToken(string authority, string resource, string scope) {
        // ... Logic initializing the appropriate credentials to access Azure resources.
    }

    [HttpGet]
    public string Get() {
        logger.Debug("HEALTH OK.");

        return "HEALTH OK";
    }



    [HttpPost]
    public async Task Post([FromBody] Object value) {
        // ... Logic dealing with invoking the sending of the message to the service bus to invoke the consumer encoding job.
    }
}
}

Now, with the call to the GET /mediaencoder endpoint, we will be expecting the following to determine that the philo-media-encoder-producer service is online and healthy:

  1. HTTP status code 200 is returned.
  2. The page returns the content: HEALTH OK

Let's define that test in Application Insights:

Figure 5. Availability test.

Once that test has been created, the test will be invoked regularly to determine the health of the service. This isn't a tutorial on how to use Application Insights, but explore around your tooling to see what alerts you can send when the health of your application degrades, such as downtime for more than 10 minutes, etc.

Since that was easy, we can implement the basic availability health checks on all our current services. :)

Figure 6. Flow of health check.

If an application begins to experience downtime, the availability health check will then trigger an alert. For example, the philo-media-encoder-producer availability test will send an email notifying me that there has been a number of failed requests. Notice that the reason for the failures is also attached within the email.

Figure 6. Failing the availability tests triggers an alert.

Once your service is stable again, the health check will automatically detect the changes after a number of tests have passed:

Figure 7. Passing after failures will notify that everything is well.

Monitoring: Events and Telemetry

Assuming we have set up Application Insights to be connected to our service, we can use the Microsoft.ApplicationInights library to begin sending telemetry events from our application to the telemetry service.

Using the API is pretty simple. There is an initial set up involved where we must first configure our .NET Core application to use Application Insights. This is just an invocation of services.AddApplicationInsightsTelemetry to register the configuration into the application. Your telemetry key can be stored in either a configuration file, where you can parse later, or be stored in an environment varaible.

Keep in mind though that your telemetry key must be valid for the library to work. I have seen instances where an invalid key, or no key at all prevents the application from starting properly.

Here is how the producer service registers the telemetry service to begin sending telemetry events. I have omitted a lot of other code in order to keep the focus on the registration of the telemetry service:

using System.Collections.Generic;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.ApplicationInsights;
using PhiloMediaEncoderKeyVault;

namespace PhiloMediaEncoderProducer {
public class Startup {
    private TelemetryClient telemetry = new TelemetryClient();
    public static string ApplicationInsightsInstrumentationKey;

    public Startup(IConfiguration configuration) {
        Configuration = configuration;
    }

    public IConfiguration Configuration { get; }

    // This method gets called by the runtime. Use this method to add services to the container.
    public void ConfigureServices(IServiceCollection services) {
        // ...
        // Initialize all configuration here.
        // ...
        
        System.Environment.SetEnvironmentVariable("ApplicationInsightsInstrumentationKey", config["ApplicationInsightsInstrumentationKey"]);
        
        services.AddMvc().SetCompatibilityVersion(CompatibilityVersion.Version_2_1);
        services.AddApplicationInsightsTelemetry(System.Environment.GetEnvironmentVariable("ApplicationInsightsInstrumentationKey"));
    }

    // This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
    public void Configure(IApplicationBuilder app, IHostingEnvironment env) {
        if(env.IsDevelopment()) {
            app.UseDeveloperExceptionPage();
        }
        else {
            app.UseHsts();
        }

        app.UseHttpsRedirection();
        app.UseMvc();
        
        telemetry.TrackEvent("PhiloMediaEncoderProducer.Startup", new Dictionary() {
            { "Message", "Successfully started the application." }
        });

    }
}
}                
    

After our producer service knows where to send telmetry to, it can then send events and exceptions with a telemetry client. By creating a TelemetryClient object, we can invoke methods such as TrackEvent to send the events up to Azure.

/*
* MediaEncoderController.cs
* Author: Roger Ngo
* (C) 2018
* 
* The controller receives a JSON payload that will construct a message 
* that will be routed to the Azure Service Bus.
* 
* The message is assumed to be picked up by the consumer application 
* which will then kick off the encoding job.
*/

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.ServiceBus;
using Microsoft.ApplicationInsights;
using PhiloMediaEncoderKeyVault;

namespace PhiloMediaEncoderProducer.Controllers {
[Route("api/mediaencoder")]
[ApiController]
public class MediaEncoderController : ControllerBase {
    private NLog.Logger logger = Logger.GetLogger();
    private TelemetryClient telemetry = new TelemetryClient();

    [HttpGet]
    public string Get() {
        logger.Debug("HEALTH OK.");

        return "HEALTH OK";
    }

    [HttpPost]
    public async Task Post([FromBody] Object value) {
        String requestBody = Newtonsoft.Json.JsonConvert.SerializeObject(value);
        
        // Send telemetry event noting that an API call to invoke a message to encode media
        telemetry.TrackEvent("PhiloMediaEncoderProducer.MediaEncoder", new Dictionary() {
            { "HttpMethod", "POST" },
            { "RequestBody", requestBody },
            { "DateTime", DateTime.Now.ToLongDateString() }
        });
    
        // ...
        // Logic to send the message to the service bus here.
        // ...
    }
}
}               
    

The examples below depicts what our custom events and exceptions look like on Azure. We can see that we are able to send a good amount of information to our telemetry service so that we can track errors, behavior and other interesting events that can allow us get a better idea on how users use the application.

Figure 8. Custom events sent from the application to Application Insights.
Figure 9. Addtionally, exceptions can also be sent.

Speaking of which... how do we know what to track? In all the literature I have read, the best advice I have been able to comprehend has been to begin tracking telemetry from the user perspective. In fact, when it comes to microservices, tracking hardware resource usage such as CPU and memory consumption is not as useful as invocation which produce and consume messages along with the flow and error rates.

Think about tracking things such as:

  • Application status: starting, started, caches warmed up, or ready to receive requests.
  • Authentication events: Who, when and where has access to the key management vault been made?
  • Latency of important methods which produces or consume messages. For example, for our producer service, it is good to track the entry point of the request that will lead to sending of the message to the service bus along with the latency before sending the OK response to the user.
  • Did the method invocation fail, or was it successful? What was the total time to completion? What was the time taken for any remote API calls?
  • Any exceptions resulting in operations that can be invoked by the user with any detail that can help with debugging. Just remember not to store any personal identifiable information.

To add to the last point above regarding exception tracking, it can be helpful to include a correlation ID, or some unique GUID for each service request that can be tagged to each event to allow faster tracing and matching of an exception to a particular past action. In fact, Application Insights already provides features like that built out of the box. But, if your telemetry service does not do that, you can implement it by just adding a property to your event and including the value within your exception should one occur.

If tracking from the front-end application, the most important thing is to always track how fast your page loads. I am not going to try an persuade you on why having a faster page load time will bring better user experience or business, as there are already many case studies detailing that, but just trust me.

Use as many API calls available by the browser as available to send as much useful data as possible. The Navigation Timing API implemented by most browsers can be handy.

Sending telemetry from the back-end, probably in the overall system, tends to be more "wild-west" like. Especially in the context of a distributed system, distributed tracing must be implemented if your tooling doesn't provide it. Additionally, monitoring from the back-end presents a few more interesting metrics which can be generated by the data that is sent. With more data, we need to be able to be smart about how much to send and how to measure, and make sense of it. Brushing up on some basic statistics here, some of the questions to ask can be "should I apply an average here?", "how useful would means and medians be here? would percentiles be a better metric?", etc.

Monitoring is hard. Although philo has some monitoring built-in, the philosophy of monitoring should be to:

  • Never leave monitoring as an after-thought. Start thinking about your metrics (health, performance, business, etc) early and as you are developing, bake them in.
  • Continually improve your monitoring code. It really does take a while to get comfortable with the tooling and figuring out what really to measure. Just keep a mindset of changing what to send for better data and be open to throwing data away which is not useful.

One final note before I leave this section. I am lucky that philo does not really care about user identifiable information right now. Should your service start to use personal identifiable information (PII), that is, any information belonging to a user which can lead back to the user if leaked, be very careful about sending it. In fact, don't send it at all. The best practice is to anonymize any PII by scrubbing, or hashing the values so that the values stay consistent across telemetry events, but cannot be traced back to the user. I cannot stress how important it is to perform this as a practice. It isn't "best practice" in my opinion, it is the practice.

Service: philo-media-service

At some point, we need to have a way for any clients to actually retrieve all the video content information that gets generated by the philo-media-encoder backend. Conveniently, we have designed the consumer service so that it not only manages the instantiation of the encoder service, but also makes use of any meta-data returned during the encoding process.

The data that is returned by the encoder service is pretty handy. It includes:

  • The output asset information such as the output asset name generated by the encoder service, the job name used by Azure Media Service, the streaming locator name and the asset ID.
  • The appropriate streaming URLs in which our client media player can reference in order to begin delivering the content to the end-user.

As you might have guessed, we will need a service to handle this information and store it within a database. This service must also be capable of delivering this data via an API call through a HTTP REST request which can be leveraged by the client side. In this context, the client side will be some sort of web application in the form of a very humble Netflix.

This service is what I will call: philo-media-service. It is the service that interfaces the front end code and the back end encoding service and makes the overall user experience coherent. The service is dual purpose in that it writes metadata information and sends it out per request as a content service.

The technology stack I have chosen to implement this is relatively lean:

  • Java 8 with Spring Boot - I wanted to learn more about the Java ecosystem. I figured Spring Boot is a good way to start in learning how to code modern Java services.
  • Containerized deployments with Docker - Containers are hot right now and I do not have much experience with creating images and deploying them to the cloud.
  • Hosting open source technology on Azure. Linux? Java? Docker? Is this all possible on Azure? Why, YES!

Not everything we have chosen will be "latest and greatest" in practice. Since philo-media-service needs a persistent data store to store the content metadata, I have decided to stick with a low-maintenance database technology: SQL. Using Azure SQL backed by Microsoft SQL Server serves our needs perfectly. My reasoning for actually choosing SQL is simply due to the low cost in getting everything up and running. JDBC at this point, is very mature and the MSSQL drivers provided by Microsoft are very good. Additionally, we can always solve any performance issues in the future should they arise as there are many, many resources out there that deal with web application performance in the context of having a SQL data store being the persistent data store.

My strongest defense in using a SQL store is ultimately this: Our data must be stable and is essentially the catalog of all the videos that will exist in our video streaming service. It is much easier to handle this structured data relationally rather than using something along the lines of a NoSQL store.

Now finally, when it comes to data transport between requests and response, normally supporting multiple content types allows for a robust interface. I have decided to forfeit that here for the sake of simplicity. After all, you do want me to actually finish this project, right? :) For that, I have just decided to use JSON as the content type for both request response payloads. That is application/json for all you HTTP nuts out there.

All this and I must remember one thing: philo-media-service cannot do too much. The above requirements in which I listed will be all the functionality that will exist within the service. The temptation is to build more and more onto the service, and that is bad. Keeping our applications as small as possible within a bounded context to solve a business requirement is the goal to making maintainable services that can be pushed to production quickly.

So here is how philo-media-service fits in with the overall system architecture. There is a lot going on here, but it will just basically interact with a few components.

  • Azure SQL DB - For persistent data storage.
  • Azure Key Vault - For credential management, of course.
  • philo-media-encoder-consumer - To receive data to record into the database.
  • Azure Application Insights - Monitoring and telemetry.
  • ELK stack VM on Azure - Logging and parsing.
Figure 10. The current architecture so far.

Alright, let's implement!

Usually, the first thing I like to do when designing a service which refers to any sort of RDB, is to just design the database schema first. I will not break from my personal pattern here and will quickly discuss what our philo-media-service database can look like.

Figure 11. The database schema for philo-media-service

The database is small. It just has 3 tables: MediaEncoderInput, MediaEncoderOutput, and MediaEncoderOutputStreamUrl. A brief overview of the purpose of each are detailed:

  • MediaEncoderInput
    • This is the table that will store the registered input assets available within our system. The structure just holds the title of the video and the CDN/storage location endpoint in which to find the media asset file. This is basically an entry within the catalog which lets the encoder service know whether or not a new collection of adaptive streaming assets should be created, or not.
    • The reasoning behind this is if a user inputs the AssetName and AssetLocation which already exists within the media encoder input, we can then skip the encoding process. Otherwise, send the job to the service bus.
    • The philo-media-encoder-consumer service is in charge of writing data to this input when it consumes a message from the Azure Service Bus. It will make a HTTP REST invocation with a JSON payload to philo-media-service to write to the database.
  • MediaEncoderOutput
    • Media that is sent to be encoded to the philo-media-encoder service will output metadata information relating to the resources that are created by Azure Media Service. In order to organize and structure this data, we need to have a table that will house this information and have it relate back to the original input.
    • This table is a one-to-many relationship with the MediaEncoderInput entry. Although in our implementation, the service is only creating one output asset (the logical grouping of videos of various bitrates) associated with the input, I have left it extensible to create many output assets attached to the input.
    • When the Azure Media Services metadata is returned by the philo-media-encoder service, philo-media-encoder-consumer will invoke an HTTP REST call
  • MediaEncoderOutputStreamUrl
    • This table is likely to be the most relatable, and interesting one for many users. The content within this table consists of all the streaming URLs which are generated by Azure Media Service. A reference will be needed by the web user interface to be able to load the video content. The URL is the source in which to obtain the video.
    • Within the implementation of our service, there will be 3 output streaming URLs generated for every output asset. We are interested in using the most general one in the format of: https://myStreamingEndpoint.streaming.media.azure.net/assetContainerName/assetTitle.ism/manifest

Knowing all this, we can then make a simple query such as the following to obtain the output stream URLs for every input:

select i.id InputId, o.id OutputId, o.ContainerName, i.AssetName Title, u.Url 
from MediaEncoderInput i 
inner join MediaEncoderOutput o on o.InputId = i.Id 
left outer join MediaEncoderOutputStreamUrl u on u.OutputId = o.Id
    

A sample output from the above query will return the following table:

InputId OutputId    ContainerName                               Title                               Url
46	    37	    asset-3323d45a-2789-4ff4-ab5e-df961427735a	Caminandes 1 - Llama Drama	    https://philo-uswe.streaming.media.azure.net/...
46	    37	    asset-3323d45a-2789-4ff4-ab5e-df961427735a	Caminandes 1 - Llama Drama	    https://philo-uswe.streaming.media.azure.net/...
46	    37	    asset-3323d45a-2789-4ff4-ab5e-df961427735a	Caminandes 1 - Llama Drama	    https://philo-uswe.streaming.media.azure.net/...
    

We can then have philo-media-service shape the above information into a JSON payload as a response to any content queries.

Implementing the service itself was quite interesting. It was an experience because I had never created any Spring Boot Application before and actually packaged a Docker image from scratch for deployment into a private container registry. In addition to the lack of experience of the previously mentioned, my Java skills had become quite outdated over the past couple of years. I had last worked closely with Java 6 and to a certain degree, Java 7. So going to Java 8, which in retrospect is no longer considered "modern" Java, brought some learning curve.

I followed a simple tutorial to bootstrap myself into a Spring Boot project. The tutorial was very straightforward as it was in the context of building a REST API from scratch. This was perfectly suited for my needs.

In addition to the above tutorial, the following documentation was also very handy for building a Docker image out of my Java service and deploying it out to the private Azure container registry:

Let me first present the first part of philo-media-service -- which is the writing of metadata that is returned by the encoding service as the output asset is being processed and converted into streaming URLs.

We will have one simple REST controller to handle all this called MediaController within our project. To keep things very small, we will map a specific POST endpoint to a table. The endpoints are:

  • POST /input
    • This will take in a request body of application/json in the following format:
      {
      "name": "AssetName",
      "location": "AssetLocation"
      }
                          
    • The data is simply written to the MediaEncoderInput table. The response is the ID which is associated with the newly created entry in the MediaEncoderInput table.
      {
      "status": "200", 
      "message": "OK",
      "result": "45"
      }
                          
  • POST /output
    • This will take in a request body of application/json in the following format:
      {
      "assetName": "AssetName",
      "assetId": "Azure Media Services asset ID",
      "jobName": "Azure Media Services job name",
      "containerName": "Azure storage blob container name which the files associated with the output asset are stored",
      "locatorName": "The associated streaming locator name",
      "inputId": "The MediaEncoderInput ID"
      }
                          
    • This data is written to the MediaEncoderOutput table after the encoder service has finished processing the output assets. The response is the output ID which is associated with the newly created entry in the MediaEncoderOutput table.
      {
      "status": "200",
      "message": "OK",
      "result": "32"
      }
                          
  • POST /url
    • This will take in a request body of application/json in the following format:
      {
      "outputId": "The MediaEncoderOutput ID",
      "url": "The streaming endpoint referencing the output asset to be used by the Azure Media Player"
      }
                          
    • This data is written to the MediaEncoderOutputStreamUrl table after the encoder service has generated the streaming URLs from Azure Media Service. These URLs will be the src URLs used by the Azure Media Player on the client side. The response is the ID which is associated with the newly created entry in the MediaEncoderOutputStreamUrl table.
      {
      "status": "200",
      "message": "OK",
      "result": "5"
      }
                          

All of this is implemented in one small controller similar to the following:

package philomediaservice;

import java.sql.*;
import java.util.HashMap;
import java.util.Map;

import com.microsoft.applicationinsights.TelemetryClient;
import org.springframework.web.bind.annotation.*;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;

@RestController
public class MediaController {
private String connectionString;
private TelemetryClient telemetry;

public MediaController() {
    this.connectionString = (new philomediaservice.ConnectionStringManager()).getConnectionString();
    this.telemetry = philomediaservice.TelemetryClientHelper.getClient();
}

@RequestMapping(value="/input",
        method = RequestMethod.POST,
        consumes = "application/json",
        produces = "application/json")
public String input(@RequestBody Map<String, String> payload) {
    int resultId = 0;

    try(Connection connection = DriverManager.getConnection(this.connectionString)) {
        String assetName = payload.get("name");
        String assetLocation = payload.get("location");

        String insertStatement = "INSERT INTO MediaEncoderInput (AssetName, AssetLocation) VALUES (?, ?);";

        try(PreparedStatement statement = connection.prepareStatement(insertStatement,
                Statement.RETURN_GENERATED_KEYS)
        ) {
            statement.setString(1, assetName);
            statement.setString(2, assetLocation);

            statement.executeUpdate();

            ResultSet r = statement.getGeneratedKeys();
            if(r.next()) {
                resultId = r.getInt(1);
            }
        } catch(Exception e) {
            this.telemetry.trackException(e);
            throw e;
        }
    }
    catch(Exception e) {
        this.telemetry.trackException(e);
    }

    HashMap<String, String> response = new HashMap<>();
    response.put("status", "200");
    response.put("message", "OK");
    response.put("result", Integer.toString(resultId));

    ObjectMapper mapper = new ObjectMapper();

    String responseData;
    try {
        responseData = mapper.writeValueAsString(response);
    } catch(JsonProcessingException ex) {
        responseData = "{ \"status\": \"500\", \"message\": \"Error\"}";
    }

    return responseData;
}

@RequestMapping(value="/output",
    method = RequestMethod.POST,
    consumes = "application/json",
    produces = "application/json")
public String output(@RequestBody Map<String, String> payload) {
    // ... Implementation similar to /input
}

@RequestMapping(value = "/url",
    method = RequestMethod.POST,
    consumes = "application/json",
    produces = "application/json")
public String urls(@RequestBody Map<String, String> payload) {
    // ... Implementation similar to /input
}
}  
    

The ContentController implementation is a basic REST API that exposes a single endpoint called GET /content that will output the metadata needed to properly render a user interface to the user so that they can begin their video streaming experience.

A sample output is shown below.

{
"asset-3323d45a-2789-4ff4-ab5e-df961427735a": {
    "title": "Caminandes 1 - Llama Drama",
    "containerName": "asset-3323d45a-2789-4ff4-ab5e-df961427735a",
    "thumbnail": "https://philomedia.azureedge.net/asset-3323d45a-2789-4ff4-ab5e-df961427735a/Thumbnail000001.jpg",
    "urls": [
        "https://philo-uswe.streaming.media.azure.net/d00de513-d124-401c-a012-aecdb5e72b86/Caminandes%201%20-%20llama%20Drama.ism/manifest(format=m3u8-aapl)",
        "https://philo-uswe.streaming.media.azure.net/d00de513-d124-401c-a012-aecdb5e72b86/Caminandes%201%20-%20llama%20Drama.ism/manifest(format=mpd-time-csf)",
        "https://philo-uswe.streaming.media.azure.net/d00de513-d124-401c-a012-aecdb5e72b86/Caminandes%201%20-%20llama%20Drama.ism/manifest"
    ]
}
}
    

I am not going to show the implementation of the ContentService here, but it is fairly straightforward in that it is transformation of the results returned by the SQL query which I had previously shown. The JSON response returned by the content service call pretty much outputs exactly what would be returned by the SQL statement, with the exception of 1 attribute, the thumbnail attribute. For now, our encoder service by default, produces a JPG file called Thumbnail000001.jpg, which by finding out through the Azure documentation, is the first interesting thumbnail of the video. The thumbnail automatically gets created during the encoding process and is included with all the other files within the output asset.

I will go on ahead and say that the portion to generate the thumbnail is hardcoded for now. It suits our needs and we will actually be enhancing it later on to in our A/B experiment section.

response.get(containerName).thumbnail = this.cdnEndpoint + "/" + containerName + "/Thumbnail000001.jpg";
    

philo-config-service

There is one more service which needs to be implemented... It is a small one -- a configuration service in which philo-media-service will pull its configuration data from. To create such a service, I followed this tutorial, and autowired a ConfigStore object to the philo-media-service code.

I highly suggest researching on the centralized configuration patterns practiced by many microservice implementers. :) At this time, I am not going to go too much in-depth into this, but basically the motivation is the same as I have taken the time to implement key management. Centralized configuration is needed to ensure that all instances of our applications are ephemeral, and configuration changes happening should affect all instances -- thus reducing temptation to make "one-off" configuration changes to a specific instance and forgetting to set it back during for example, an incident mitigation session.

User interface

Now that the services are up and running, there must be some way in which users are allowed to use the philo service in an end-to-end manner. The first challenge is to create an experience where users are allowed to invoke the encoder service to produce streamable videos. The second task is then to create another experience where users can consume these encoded videos through streaming. Both are wildly different experiences, but they basically make a call to philo-media-encoder-producer and philo-media-service respectively.

Since our web applications are quite simple, we can get away with just a static webpage. create-react-app allows easy bootstrapping of React projects with just literally, a single command.

philo-admin

To begin, I will discuss the user experience in which videos are needed to be encoded. Going back to the original specification, philo-media-encoder-producer really takes 2 parameters: the asset name and the asset location. Therefore, a simple user interface can be created using HTML5 and JavaScript, and explicitly, the UI will be composed of React JSX elements.

A picture is worth a thousand words, so here is what it looks like:

Figure 12. philo admin.

Anyway, all this is quite simple. The Submit button will make an HTTP POST fetch() call to the philo-media-encoder-producer service, sending the JSON data to place into the service bus. From here, the familiar flow follows. The asynchronous HTTP POST method is shown below. Note, that this code is bare-bones and has very basic error-handling. Also notice that we have taken advantage of the Promise object returned by fetch to handle the HTTP response accordingly.

fetch('https://producer-instance.azurewebsites.net/api/mediaencoder', {
method: 'POST', 
headers: {
    'Content-Type': 'application/json'
},
body: JSON.stringify({
    assetName: mediaTitle,
    assetLocation: mediaLocation
}
)}).then((response) => {
    if(response.status === 200) {
        alert('Your job has been submitted!');
    } else {
        alert('Something went wrong!');
    }

    button.removeAttribute('disabled');
}).catch((e) => {
    alert('Something went wrong!');

    button.removeAttribute('disabled');
});
}

philo-theater

Next, let us discuss the user interface on the other side, where the client is consuming the video by watching a stream! This component is called philo theater, and aims to be a "Netflix-lite". It makes use of the philo-media-service by calling the GET /content endpoint to retrieve the list of available video titles processed through the encoding service along with the default thumbnail and streaming URLs. As you may recall, this data is stored in a SQL database that is queried by the media service.

Since this is a major client application, it would be best to make sure we have the scope of work defined clearly. Here are some basic requirements to create a decent streaming experience:

  1. When there are no movies to be streamed, the application should display a splash screen notifying the user that there are not yet any movies to be streamed.
  2. When there are movies to be streamed, the following should show:
    1. A list of movies should be shown in an aesthetically. Thumbnails and titles are clearly marked, allowing the user to click on the title in which they want to view.
    2. A nice and large media player that contains user interface controls to: play, pause, adjust volume, choose the appropriate bitrate, and move back and forth across the timeline.
    3. The ability to resume watching video from anywhere based on the video.
    4. Consideration of sending telemetry events to get a better understanding of user-behavior, performance, and exceptions in which may occur during the viewing experience.

With all these requirements defined, the hard part is likely to be findng the appropriate media player. Generally, there are two approaches to this. First, we can roll our own media player to suit our needs. This is likely to to be inappropriate as we must be able to have our media player understand the specific streaming URL format in which Azure Media Service generates. We can spend some time to understand and implement, but it is much more resourceful to find something which already exists. Fortunately, Azure Media Services actually offer a built-in video player which natively supports HTML5 and JavaScript. It is available here: Azure Media Player

With the given examples, implementing the video player is trivial for plain JavaScript. However, our front-end application is a React-based front-end and thus there is a bit of work involved in porting the video over to the client application.

First to note, the Azure Media Player, or AMP, loads asynchronously. To make the video player first available, we must include the script within our HTML document.

<link href="//amp.azure.net/libs/amp/2.1.9/skins/amp-default/azuremediaplayer.min.css" rel="stylesheet">
<script src= "//amp.azure.net/libs/amp/2.1.9/azuremediaplayer.min.js"></script>
    

After some time, the window.amp object will contain the AMP object which can then create the video player to be loaded onto the HTML document. Since this loading is done asynchronously, we must take care in that our React component which houses the AMP may not be initialized yet by the time we would like to use it. Therefore we must implement a loading routine that will safely ensure that the AMP component is properly initialized before the rest of the page is rendered.

The best way to probably solve this problem is to wrap the initalization of the AMP component within a Promise object. Within the promise, we define a timeout that is recursively executed for a set amount of times. The timeout should be a small number of milliseconds, as it is not expected that the AMP component should need that much time to initialize.

Writing a waitDelay timeout function wrapped by a Promise is the implementatio I have decided to commit to. I have chosen a timeout at 100 ms running for 10 iterations total. The maximum amount of time this asynchronous code can then run would be about 1 second. From here, if the player is initialized by inspecting the window.amp property, we can either resolve the Promise object with the AMP component passed in, or reject the promise to avoid any unnecessary waiting should the AMP component not be initialized in time. It is best to fail fast in this scenario.

waitForAmpInitialization() {
return new Promise((resolve, reject) => {
    const waitDelay = (currInterval) => {
        setTimeout(() => {
            currInterval++;
            
            const amp = window.amp;
            if(amp !== undefined) {
                return resolve(amp);
            } else if(currInterval > 10) {
                return reject();
            }
            waitDelay(currInterval);

            return null;
        }, 100);
    };

    waitDelay(0);
});
}
    

From here, we just make an instance variable videoPlayer available, and resolve the promise in our React rendering lifecycle. Conveniently, using componentDidMount is exactly what we need!

componentDidMount() {
this.waitForAmpInitialization().then((amp) => {
    // Per AMP documentation, initialize amp to contain all settings you want. 
    this.videoPlayer = this.createVideoPlayer(amp);

    // ...
    // All other interesting operations can go here.
    // ...
}).catch(() => {
    // log error here
});
}
    

The createVideoPlayer method essentially configures the AMP component, and returns the initialized AMP. Follow the Azure Media Player documentation for some of the settings that can be used to tailor AMP to your needs.

createVideoPlayer = (amp) => {
const video = amp(this.videoRef, {
    autoplay: false,
    controls: true,
    width: "98%",
    height: "720px",
    logo: { enabled: false },
    techOrder: [
        'azureHtml5JS',
        'html5FairPlayHLS',
        'html5',
        'flashSS'
    ],
    poster: this.state.currMovie.thumbnail
});
return video;
}
    

Now that the video player initialization issue is solved, our theater application now must be able to fetch the content from the philo-media-service API.

The splash screen should display when we fetch the content from philo-media-service and there is no data. This splash screen should also show when all videos are deleted and dynamically refreshes through a polling mechanism.

Figure 13. philo theater with no movies.

Creating the JSX element is trivial:

renderNoMoviesView() {
return (
    <div className='philo-theater_no-movies-view'>
        <h1>philo theater</h1>
        <h3>No movies yet.</h3>
        <p>
            Come back later to see what's showing! :)
        </p>
        <div className='philo-theater_attribution'>
            (C) 2018 Roger Ngo, <a href="http://rogerngo.com">rogerngo.com</a>
        </div>
    </div>
);
}
    

When there are videos available to be streamed, the reel (top component) will show the available videos which can be streamed along with a default thumbnail. Just like how the splash screen disappears through polling of content, the same mechanism is used to update the list of videos.

Figure 14. philo theaters with some movies to watch.

There are a few variables in which we must keep track of here to maintain the state.

  • The list of movies
  • The current movie that should be displayed
  • Whether or not there is a movie that is currently playing

The current movie in itself has a few properties that our application needs to be aware of: the title, thumbnail and streaming endpoint. All this is required to properly initialize the media player and render the content.

The constructor initializes them to be default values of:

this.state = {
movies: [],
currMovie: {
    title: null,
    thumbnail: null,
    endpoint: null
},
isPlaying: false
};
    

The list of movies can be fetched with an API call to philo-media-service. With GET /content, the JSON payload that is received is parsed and then loaded into the movies property in the state like so:

getContent() {
const jsxElements = [];
return fetch(`${PhiloMediaServiceEndpoint}/content`, {
    type: 'GET',
}).then((response) => {
    return response.json();
}).then((json) => {
    let firstMovie = null;
    for(let currContainerKey in json) {
        const currElement = json[currContainerKey];
        if(firstMovie === null) {
            firstMovie = currElement;
        }
        jsxElements.push(
            <div className='movie-item' 
                onClick={this.playMovie.bind(this)} 
                data-video-location={currElement.urls[2]} 
                data-video-title={currElement.title}
                data-video-thumbnail={currElement.thumbnail}
            >
            <div className='movie-thumbnail'>
                <img src={currElement.thumbnail} />
                <h4>{currElement.title}</h4>
            </div>
            </div>
        );
    }

    if(jsxElements.length === 0) {
        // reset the state to empty values
    } else {
        // update the state with the new movies
    }
});
}
    

The code is truncated due to the verbosity and JavaScript can be really hard to understand in large chunks, so I have only included the relevant pieces. Basically, we make a GET request. Our response is then expected to be a JSON. From here, we parse the JSON as a JavaScript object. The JSON object is an array of all the movie metadata data information.

As the routine iterates through the JSON array, JSX elements are created with the current element in the iteration. Basically the JSX element includes:

  • The click handler which contains logic to update the media player with the new video of interest.
  • The streaming URL of the movie as a data attribute value.
  • The title of the movie as a data attribute value.
  • The thumbnail file of the movie as a data attribute value.
renderDefaultView() {
return (
    <div>
        <div className='reel'>
            {this.state.movies}
        </div>
        <div className='theater-header'>
            <h1>philo theater</h1>
            <h3>{this.state.currMovie.title}</h3>
        </div>
        <div className={`video-container`}>
            <video
                className="azuremediaplayer amp-default-skin amp-big-play-centered"
                id='philo-movie'
                ref={(input) => { this.videoRef = input; }}
            />
        </div>
        <div className='philo-theater_attribution'>
            (C) 2018 Roger Ngo, <a href="http://rogerngo.com">rogerngo.com</a>
        </div>
    </div>
);
}
    

That is pretty much it for the basics of what our theater should do. The rest of the functions such as updating the video list, updating the player to contain the current movie can all be referenced in the source code itself.

Telemetry

The player should also have the basic telemetry set up to gather useful data. A suggestion in some things we should send telemetry whenever:

  • A user clicks on a new video to watch. Which are the most popular videos?
  • If a user is currently playing a video, and has clicked on a new video, where did the previous video leave off at?
  • A user pauses a video, when in the timespan did this occur?
  • Whenever a video bitrate changes. This can gather pretty good data on the most frequent bitrate accessed.

I am purposely keeping the list small for the sake of reducing the scope of this project, but it is definitely an exercise worth exploring to think of other creative telemetry one can send from the front-end. One should also explore sending telemetry for performance and exception tracking.

The approach to tracking telemetry is to initialize the Application Insights SDK for JavaScript, and attach an event handler for specific Azure Media Player events such as the play event and send the telemetry. One can find the list of interesting events which can be handled here: Azure Media Player events. For example, to send telemetry of the current movie title along with the streaming URL, we can attach the event handler to the play event.

this.videoPlayer.on('play', () => {
appInsights.trackEvent("PhiloTheater.StartPlay", {
    title: this.state.currMovie.title,
    streamingUrl: this.state.currMovie.endpoint
});
});
    

On Application Insights, the event can then be seen:

Figure 15. Telemetry from the front-end.

From here, it is really dependent on your creative juices to flow to make sense of the telemetry data. :) Application Insights provides a set of good reporting options to aggregate telemetry data and shape it to useful information that can aid in progressing design, development and the overall business direction.

There is a reason why I discussed event handling and telemetry before this last piece... allowing resuming of videos at a later times! Knowing the information in that we can get various events such as play, pause, and start, we can call to the currentTime method made available and retrieve the curren timespan of the video. With the information we have stored within the current playing video, we can cache this information either at the service level, or at the client with the following example message:

{
currTitle: 'The Current Title',
streamUrl: 'https://streamurl/',
time: 'videoPlayer.currentTime() call',
... other useful data.
}
    

After sending this information, it all becomes a basic exercise of mapping the video back to the client and programmatically seeking the video to the appropriate timespan upon load.

Architecture Map

Finally, we have implemented everything we need to create an end-to-end experience. Here is the architecture map for review:

Figure 16. Architecture Map

The main entry points for the actors, in this case the users are colored in green. This is philo-admin, and philo-theater.

All Azure services are colored blue, while services implemented to leverage Azure services are gray for the philo-media-encoder related services and yellow for the philo-media-service related services.

Demo and Performance

I believe the best way to demo is actually perform some show-and-tell. :) Here is a video of the end-to-end flow on what philo can do.

Figure 17. Demo Video

A few important aspects related to performance and throughput of the encoding process:

  • It is important to remember that the philo-media-encoder service does not do the encoding job itself, it is actually the Azure Media Services encoder which performs all the hard work.
    • To leverage this, it was intentional to keep philo-encoder and philo-media-encoder-consumer separate as the consumer service is decoupled and can spawn as many encoder tasks as needed.
    • philo-media-encoder can then make as many calls as needed to Azure Media Services to encode many videos at once. It is not only limited to 1 job, but multiple as the offering of Azure Media Services permits.
  • The deliberate attempt to ensure that all potentially long-running operations are kept asynchronous as possible. The .NET async/await keywords helped a lot along with the service bus used to have the produce/consumer communicate with each other.
  • I leveraged philo-media-service a lot. Keeping each API method call as small as possible was the best thing to do. The fact that I've left this implementation as a point-to-point call also made this a potential performance bottleneck in the overall system.
  • The above might not be as bad as it seems through. Should performance due to load become an issue, we can create more instances for the database and service itself. We can also practice database optimization techniques such as adding an in-memory cache such as memcached and database indexes.

A/B Testing

This section is intentionally short due myself wanting to keep this article to a reasonable length. If you want to read up more on experimentation and how to achieve it at a technical level, check out my other post on A/B Client Testing with Chatbots.

The advantages of having an easy deployment strategy for your services enables A/B testing to be quite pleasant. Armed with small services, and a good strategy for recording telemetry events, we can code our application to be fronted with a gateway service for routing requests to experimental instances.

Remember the hard-coded thumbnail image we had? Well, an easy way to convey the example of A/B testing is if we had another instance of philo-media-service in which we have deployed to display a different thumbnail for a particular video. We could choose a client and have our gateway service inject a special header value within a request to signify that the request should be experimental, and thus display a different thumbnail back to the user.

Having a good framework for sending telemetry helps here as data can be passed back and forth to gather events and metrics to determine how your experiment is running.

When the experiment is over, the service which held that experimental code can easily be killed, and all requests can be routed back to the stable service.

Checkout Netflix Zuul for a good gateway service to study!

Cost Discussion

I can imagine some of you might be asking after reading and going through all this is, "Dude, how much did this all cost? Your Microsoft Azure bill must have been insane!"

I can't lie and tell you it it wasn't that bad. I'll tell you the truth. It was bad.

It was quite expensive to play around with all these services when all said and done. During my first month, I was mostly trying to learn how the services worked by making API calls, and testing out the encoding services locally on my machine. I also did not care too much for cost as I just wanted to get the project implemented as quickly as possible during its early stages. Thus, when it came to making sure my instances were shut down and my streaming endpoints were off when I needed them to be, I didn't care.

Unfortunately, after seeing the bill for the first month, it came out to be a little over $619.

Figure 19. Cost of Month 1. Most of the early stages of development happened here.

At this point, for the sake of science, I wondered what would happen if I just left things the way they were, and just kept developing philo without being too concerned for cost? How much more would the bill be, and would it be practical for any sane person to make a streaming service on their own as a hobby project just for fun?

I let it be.

So for the second month, which has not ended yet, the estimate is going to be a little over $1,000.

YIKES!

Figure 20. Cost of Month 2. Most of the system had already been developed and testing had occurred.

The most expensive part of the bill through cost analysis has simply been running Azure Media Services through encoding and streaming the media itself. I was doing a lot of dev and test on with these services, and honestly, I didn't need to run them as often as I needed to. In total, I kicked off 82 encoding jobs. Each of these jobs had varying service levelsa of S1 and S3 encoding resources. It was entirely dependent on how fast I wanted my media encoded at that time and testing and proving a lot of concepts.

Fun thing I found out which I really appreciate... although the S3 service was costly, it was extremely fast to encode a Full-HD 1080p video. A 23 minute video only took about 4 minutes to encode!

The service overall is AWESOME, and very intuitive to use... however, the practicality only exists if you are serious about starting some sort of video delivery service yourself... whether it be a live-streaming service, or on-demand. It really is nice to know that one can literally create the next YouTube, or Netflix with just an ancient 2014 MacBook Pro with the power of the cloud. (Yes, I know I just said that.)

This project overall was very expensive to develop, but the amount of knowledge I gained from learning offsets any of the financial costs. All I lost was a few video games, nice meals, and overdue haircuts wearing the same t-shirts and jeans once in a while. My significant other also freaked out about my constant spending on these services, but she was quite understanding in that it was all for education. ^_^

Would I do something like this again? Probably not for a while. It did release a lot of curiosities out of system, and it was very relaxing and pleasurable to solve some of the problems I had encountered while developing this application. I also become more "zen" with the cloud. All of it was worth it.

So, how would I make this cheaper to run? Surely, there must be a way to do just that, right? Well, yes!

A few things I thought about while working on figuring out what I should do shave down the costs:

  • Allow the services instances to shutdown after a period in being idle. In Azure, you can do this with Web Apps with the "Always On" toggle in the configuration settings.
    • I had intentionally left some of my instances "always-on" for experimental purposes.
  • Choose a lower offering for application instances, and configure autoscaling to scale horizontally.
  • Hit the datastores less and use in-memory cache to reduce the number of query operations.
  • The philo-media-encoder code will blindly encode all jobs. This means that if you submit duplicate videos to process, each video will be considered a unique job.
    • Save some money! Have the consumer code detect whether or not a re-encode is actually needed!
  • Detect whether or not your streaming endpoint is idle or not. Shut it off when it is and turn it back on in an on-demand fashion!
  • Cache things as much as possible!

Conclusion.

Phew, well that was quite the long post! I know there are a lot of unanswered questions regarding the functionality, and work needed to create a stable and resilient architecture. I can do everything, and perform all the "best practices" in creating an application based off of microservices, however that will take much more time and resources which quite frankly, most of us do not have. So, I encourage YOU, yes, YOU! If you have time and are interested in building off of this project even further, here are a few things that would be nice to implement as practice, a learning experience, or get rid of that irk in the code that I have done wrong:

  1. Implement circuit-breaker technology along with fallback mechanisms for any remote API calls. This is pretty much any call to any other service -- whether it be philo, or Azure related.
  2. Implement best practices for security in each service. For example, no other instance should make contact to philo-media-service aside from the philo-media-encoder-consumer and philo-theater services themselves.
  3. Implement a gateway service, or a reverse proxy to handle policies, security, canary testing, etc. Have it be a single point of entry for most API calls.
  4. Encrypt all configuration data that is returned by phill-config-service, and ingested through properties settings.
  5. Implement a better logging system. Configure your ELK stack, or any other log-indexing service to handle the logs for all these services.
  6. Actually use all the telemetry data that is sent. How can we use it to not only detect service health and readiness, but also log metrics related to business-performance?
  7. Implement a CI/CD pipeline for all this!
    • Disclaimer: I have never personally done this myself, but I have used Jenkins and Travis CI is pretty good. If you prefer something like Microsoft, try Visual Studio Online. It has gotten pretty good over the months.
  8. Save some money! Make the consumer code smarter by detecting whether or not the video has already been processed once before! Exit early and don't kick off a job when there is already a streaming asset.