Big Data Integrator (BDI) Integrated Development Environment (IDE)

In the Big Data Europe framework, the Big Data Integrator is an application that can be thought as a "starter kit" to start working and implementing big data pipelines in your process. It is the minimal standalone system so you can create a project with multiple docker containers, upload it & make it run using a nice GUI.

Architecture

You can think of the Big Data Integrator as a placeholder. It acts as a "skeleton" application where you can plug & play different big data services from the big data europe platform, and add and develop your own.

At it's core it is a simple web application that will render each different service's frontends inside it, so it is easy to navigate between each system providing a sense of continuity in your workflow.

The basic application to start from is constituted of several components:

  • Stack Builder: this application allows users to create a personalized docker-compose.yml file describing the services to be used in the working environment. It is equipped with hinting & search features to ease discovery and selection of components.
  • Swarm UI: after the docker-compose.yml has been created in the Stack Builder, it can be uploaded into a github repository, and from the SwarmUI users can clone the repository and launch the containers using docker swarm from a nice graphical user interface, from there one can start, stop, restart, scale them, etc...
  • HTTP Logger: provides logging of all the http traffic generated by the containers and pushes it into an elasticsearch instance, to be visualized with kibana. It is important to note that containers to be observed must run always with the logging=true label activated.
  • Workflow Builder: it helps define a specific set of steps that have to be executed in sequence, as a "workflow". This adds functionality like docker healthchecks but more fine-grained. To allow the Workflow Builder to enforce a workflow for a given stack (docker-compose.yml), the mu-init-daemon-service needs to be added as part of the stack.

That service will be the "referee" that imposes the steps defined in the workflow builder. For more information check it's repository.

Systems are organized following a microservices architecture and run together using a docker-compose script, some of them sharing microserviecs common to all architectures, like the identifier, dispatcher, or resource. This is a more visual representation of the basic architecture:

bdi-arch--1-

Installation & Usage

  • Clone the repository
  • Per each one of the subsystems (stack builder, http logger, etc..) used, check their repository's README for it may be some small quirks to take into account before running each piece.
  • Run the edit-hosts.sh script. This is to assign url's to the different services in the integrator.
  • docker-compose up will run the services together.
  • Visit integrator-ui.big-data-europe.aksw.org to access the application's entry point.

How to add new services

  • Add the new service(s) to docker-compose.yml, it is important to expose the VIRTUAL_HOST & VIRTUAL_PORT environment variables for the frontend application of those services, to be accessible by the integrator (e.g):
  new-service-frontend:
    image: bde2020/new-service-frontend:latest
  links:
    - csswrapper
    - identifier:backend
  expose:
    - "80"
  environment:
    VIRTUAL_HOST: "new-service.big-data-europe.aksw.org"
    VIRTUAL_PORT: "80"
  • Add an entry in /etc/hosts to point the url to localhost (or wherever your service is running) (e.g):
127.0.0.1 workflow-builder.big-data-europe.aksw.org
127.0.0.1 swarm-ui.big-data-europe.aksw.org
127.0.0.1 kibana.big-data-europe.aksw.org
(..)
127.0.0.1 new-service.big-data-europe.aksw.org
  • Modify the file integrator-ui/user-interfaces to add a link to the new service in the integrator UI.
{
  "data": [
    ...etc .. ,
    {
      "id": 1,
      "type": "user-interfaces",
      "attributes": {
        "label": "My new Service",
        "base-url": "http://new-service.big-data-europe.aksw.org/",
        "append-path": ""
      }
    }
  ]
}

Have fun with it!

Healthchecks for nginx in docker

Introduction

Recently I faced a problem where I had two nginx servers connected sequentially, the first one acted as a proxy (let's call it nginx-proxy) between the frontend and another nginx server (let's call it nginx-server). All services were running in docker containers using the docker-compose script.

The nginx-proxy service would route (or dispatch, however you want to call it) requests to the nginx-server service, as well as other microservices. Nginx by default does healthchecks on the upstream servers. This has several uses, for example if it's worthy to proxy the request to a server that might be down instead of serving it's local copy of a cached asset.

But I thought, what if I wanted to have explicit healthchecks for the nginx servers, either by using specific tools or commands in this kind of docker-compose architecture?

Docker-only techniques


Healthcheck Dockerfile

This is the easiest type of healthcheck, it consists of each container taking care of checking it's own service's health status, but it can also be tricked into checking both the local container's status and performing additional tests for upstream servers.

It is implemented using the HEALTHCHECK [options] CMD instruction when building the docker image, or when running the container using specific command line flags.

In order to check if the nginx-proxy server is healthy by curl'ing the server you could specify on it's Dockerfile:

FROM nginx:1.13

HEALTHCHECK --interval=5m --timeout=3s CMD curl --fail http://nginx.host.com/ || exit 1
EXPOSE 80

Or when running it using docker run:

λ docker run --name=nginx-proxy -d \
        --health-cmd='curl --fail http://nginx.host.com || exit 1' \
        --health-interval=5m \
        --health-timeout=3s \
        nginx:1.13

It is important to note that docker healthchecks will not only be performed when the containers are starting, but they will be periodically executed to check container's status, as seen in this example:

# Check that the nginx config file exists
λ docker run --name=nginx-proxy -d \
        --health-cmd='stat /etc/nginx/nginx.conf || exit 1' \
        nginx:1.13

λ docker inspect --format='{{.State.Health.Status}}' nginx-proxy
healthy

λ docker exec nginx-proxy rm /etc/nginx/nginx.conf
λ sleep 5; docker inspect --format='{{.State.Health.Status}}' nginx-proxy
unhealthy

# But creating the nginx.conf file again will make the container be healthy again
λ docker exec nginx-container touch /etc/nginx/nginx.conf
λ sleep 5; docker inspect --format='{{.State.Health.Status}}' nginx-proxy
healthy

Healthcheck on docker-compose.yml

This is totally equivalent of running the docker container with the healthcheck related command line flags.

healthcheck:
  test: ["CMD", "curl", "--fail", "http://nginx.host.com"]
  interval: 1m30s
  timeout: 10s
  retries: 3s

Using depends_on in docker-compose.yml

There is no guarantee for this kind of healthcheck. I would not even call it one because what it does is expressing dependency between two or more services running inside a docker-compose script. This means, that it will only force the execution order of the containers, without caring of the services that are executing inside, for example:

nginx-proxy:
  image: nginx:1.13
  depends_on:
    - resource
    - push-service
  links:
    - resource:resource
    - push-service:push-service

Application logic techniques

This kind of checks involve adding application logic directly, via configuration files or scripting.

Upstream passive nginx server checks

Nginx will monitor your connections for upstream server failures and try to resume them after some time. In my example, tests will be done from the nginx-proxy service to the upstream nginx-server service.

nginx.conf:

# Upstream server for nginx-server
upstream nginx_server {
  server nginx-server max_fails=3 fail_timeout=30s;
}

fail_timeout indicate the amount of time where the total max_fails number has to happen to mark the service unavailable.

Upstream server active nginx server checks

In order to actively check the health of upstream servers, nginx can send special requests to verify it. It is as easy as adding the directive health_check in the location context, as long as there is a proxy_pass directive that specifies an upstream server:

upstream nginx_server {
  server nginx-server;
}

server {
    location / {
        proxy_pass http://nginx_server;
        health_check;
    }
}

Lua scripting on nginx

Nginx provides the possibility of adding additional logic in the nginx.conf file by using the scripting language Lua.

This feature come with the OpenResty package, but you can simply extend the danday74/nginx-lua docker image. An example healthcheck scripted in lua can be found in this post.

Have fun!

Ember Websockets & nginx integration

In a previous article I explained our approach at work to deploy an Ember.js application in an nginx server running on docker. Today I had to integrate an instance of that application to communicate with another microservice using WebSockets.

A simplified diagram of the architecture would be:

All this services run using a docker-compose script like this one (it is a very simplified version):

docker-compose.yml:

  frontend:
    image: bde2020/ember-swarm-ui-frontend:0.6.0
    ports:
      - "88:80"
    links:
      - dispatcher:backend
    volumes:
      - ./config/frontend:/etc/nginx/conf.d

  dispatcher:
    image: semtech/mu-dispatcher:1.0.1
    links:
      - push-service:push-service
    volumes:
      - ./config/dispatcher:/config

  db:
    image: tenforce/virtuoso:1.2.0-virtuoso7.2.2

  push-service:
    image: tenforce/mu-push-service
    environment:
      - MU_SPARQL_ENDPOINT=http://database:8890/sparql
    links:
      - db:database
    ports:
      - "83:80"

The Dispatcher will proxy calls to other microservices based on the request path. This is very useful to avoid the frontend to have any information about other microservices host names. More information here.

The push-service will listen for GET requests in the root path (/), open a websocket and start sending some json. The code in the frontend to test just taken from ember-websockets:

export default Ember.Component.extend({
  websockets: Ember.inject.service(),

  didInsertElement() {
    this._super(...arguments);
    const socket = this.get('websockets').socketFor('ws://localhost/push-service/');
    ...
  }

But this is not enough, nginx needs additional configuration to open and maintain a connection using WebSockets. Luckily, nginx has support for it since version 1.3, and can be activated by specifying the set of headers that start the handshake for the websockets protocol.

# Set the server to proxy requests to when used in configuration
upstream backend_app {
    server backend;
}

# Server specifies the domain, and location the relative url
server {
    ...

    # WebSockets support
    location /push-service {
      proxy_pass http://backend_app;
      proxy_http_version 1.1;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
  }

The call from the frontend is to http://location/push-service, but the Push-Service only understands calls to /, then why specify the location = /push-service in nginx's configuration file? This is thanks to the Dispatcher, that will detect the call to http://localhost/push-service and rewrite it to http://push-service/, having a link to it in the docker-compose.yml file.

Have fun!

References

Ember & nginx docker deployment with multi-stage builds

Introduction

At work we use docker as the virtualization technology of choice to perform our project's deployments. We follow a microservices architecture that allows us to do rapid development & testing, quickly trying new ideas and iterating on new functionality using a
modular approach, choosing the best language/framework that best adapts to our needs for each
particular use case.

The Problem

Our fronted stack consists of Ember.js happily running in an nginx server inside a docker container. The initial building & deployment process that we had was effective but a little cumbersome. I will use TenForce's webcat repository as example.

Initially the Ember application is built via command line (ember build --prod), generating a dist.zip file. The file is then uploaded to the repository releases with a new tag assigned.

Afterwards, when building the nginx docker image, from the Dockerfile we detect the current version of the frontend reading it from the package.json file and fetch it from the github releases, unpacking the zip file contents into the nginx serving directory.

The Dockerfile is self explanatory:

FROM semtech/mu-nginx-spa-proxy

MAINTAINER Aad Versteden <madnificent@gmail.com>

RUN apt-get update; apt-get upgrade -y; apt-get install -y unzip wget;
COPY package.json /package.json
RUN mkdir /app; cd /app; wget https://github.com/tenforce/webcat/releases/download/v$(cat /package.json | grep version | head -n 1 | awk -F: '{ print $2 }' | sed 's/[ ",]//g')/dist.zip
RUN cd /app; unzip dist.zip; mv dist/* .
RUN rm /app/dist.zip package.json

Now this has two problems:

  • We have to manually build the ember application and upload it to the github releases url.
  • Builds are not deterministic since each person has their own node, npm, bower & ember-cli combination. This has already accounted for some time lost looking on why seemingly identic builds some failed and some not.

The Solution

The solution came by using a combination of two new approaches:

  1. Using a docker image with node,npm, bower & ember-cli installed, therefore guaranteeing that every build would be with the same versions.
  2. Using Docker's multi-stage builds. Simply put, it allows to use the output of a given image as the input of the next one , avoiding fat images and simplifying the building process.

The first part is achieved by using the docker-ember image, ensuring fixed versions for the build tools:

FROM ubuntu:16.04
MAINTAINER Aad Versteden <madnificent@gmail.com>

# Install nodejs as per http://askubuntu.com/questions/672994/how-to-install-nodejs-4-on-ubuntu-15-04-64-bit-edition
RUN apt-get -y update; apt-get -y install wget python build-essential git libfontconfig
RUN wget -qO- https://deb.nodesource.com/setup_7.x > node_setup.sh
RUN bash node_setup.sh
RUN apt-get -y install nodejs
RUN npm install -g bower@1.7.9
RUN echo '{ "allow_root": true }' > /root/.bowerrc
RUN npm install -g ember-cli@2.14.0

WORKDIR /app

The second part is achieved by using the multi-stage build in the process, building the ember app and copying the resulting dist output folder inside nginx's serving directory.

FROM madnificent/ember:2.14.0 as ember
MAINTAINER Esteban Sastre <esteban.sastre@tenforce.com>

COPY . /app
RUN npm install && bower install
RUN ember build

FROM semtech/mu-nginx-spa-proxy
COPY --from=ember /app/dist /app

This way, all the building process is limited to a simple docker build .

Have fun!

Passing arguments to Dockerfiles

Introduction

When using docker as our virtualization software of choice to deploy our applications, sometimes we might want to build an image that depends on a variable parameter, for example when building images from a script and you have a changing deployment folder when constructing it.

Using the ENV keyword

The easiest way is to specify an environment variable inside the Dockerfile with the ENV keyword and then reference it from within the file. For instance, when you just need to update the version of a package and do some operations depending on that version:

FROM image
ENV PKG_VERSION 1.0.0
RUN curl http://my.cdn.com/package-$PKG_VERSION.zip

This PKG_VERSION was used as a build time variable but it is important to know that containers will be able to access it in runtime, which may lead to problems.

Using the ARG keyword

The ARG keyword defines a variable that users can access at build time when constructing the image using the --build-arg <variable>=<value> option and then referencing it inside the Dockerfile. In the previous example the same result could be achieved by executing:

λ docker build -t my-image-name --build-arg PKG_VERSION=1.0.0 $PWD

Dockerfile:

FROM image
ARG PKG_VERSION
RUN curl http://my.cdn.com/package-$PKG_VERSION.zip

And in this case, the PKG_VERSION variable would only live during the build process, being unreachable from within the containers. Additionally, it is also possible to set ARG with a default value in case no --build--arg is specified:

Dockerfile:

FROM image
ARG PKG_VERSION=1.0.0
RUN curl http://my.cdn.com/package-$PKG_VERSION.zip

Of course, you could also define an environment variable that would depend on a value passed as an argument:

λ docker build -t my-image-name --build-arg PKG_VERSION=1.0.0 $PWD

Dockerfile:

FROM image
ARG VERSION_ARG=1.0.0
ENV PKG_VERSION=$VERSION_ARG
RUN curl http://my.cdn.com/package-$PKG_VERSION.zip

Now the passed argument VERSION_ARG will be available as the PKG_VERSION environment variable from within the container.

Moreover, if container's environment variables are preferred to be declared in runtime, it can be done easily when running it:

λ docker run -e ENV=development -e TIMEOUT=300 -e EXPORT_PATH=/exports ruby

Have fun!