Big Data Integrator (BDI) Integrated Development Environment (IDE)

In the Big Data Europe framework, the Big Data Integrator is an application that can be thought as a "starter kit" to start working and implementing big data pipelines in your process. It is the minimal standalone system so you can create a project with multiple docker containers, upload it & make it run using a nice GUI.

Architecture

You can think of the Big Data Integrator as a placeholder. It acts as a "skeleton" application where you can plug & play different big data services from the big data europe platform, and add and develop your own.

At it's core it is a simple web application that will render each different service's frontends inside it, so it is easy to navigate between each system providing a sense of continuity in your workflow.

The basic application to start from is constituted of several components:

  • Stack Builder: this application allows users to create a personalized docker-compose.yml file describing the services to be used in the working environment. It is equipped with hinting & search features to ease discovery and selection of components.
  • Swarm UI: after the docker-compose.yml has been created in the Stack Builder, it can be uploaded into a github repository, and from the SwarmUI users can clone the repository and launch the containers using docker swarm from a nice graphical user interface, from there one can start, stop, restart, scale them, etc...
  • HTTP Logger: provides logging of all the http traffic generated by the containers and pushes it into an elasticsearch instance, to be visualized with kibana. It is important to note that containers to be observed must run always with the logging=true label activated.
  • Workflow Builder: it helps define a specific set of steps that have to be executed in sequence, as a "workflow". This adds functionality like docker healthchecks but more fine-grained. To allow the Workflow Builder to enforce a workflow for a given stack (docker-compose.yml), the mu-init-daemon-service needs to be added as part of the stack.

That service will be the "referee" that imposes the steps defined in the workflow builder. For more information check it's repository.

Systems are organized following a microservices architecture and run together using a docker-compose script, some of them sharing microserviecs common to all architectures, like the identifier, dispatcher, or resource. This is a more visual representation of the basic architecture:

bdi-arch--1-

Installation & Usage

  • Clone the repository
  • Per each one of the subsystems (stack builder, http logger, etc..) used, check their repository's README for it may be some small quirks to take into account before running each piece.
  • Run the edit-hosts.sh script. This is to assign url's to the different services in the integrator.
  • docker-compose up will run the services together.
  • Visit integrator-ui.big-data-europe.aksw.org to access the application's entry point.

How to add new services

  • Add the new service(s) to docker-compose.yml, it is important to expose the VIRTUAL_HOST & VIRTUAL_PORT environment variables for the frontend application of those services, to be accessible by the integrator (e.g):
  new-service-frontend:
    image: bde2020/new-service-frontend:latest
  links:
    - csswrapper
    - identifier:backend
  expose:
    - "80"
  environment:
    VIRTUAL_HOST: "new-service.big-data-europe.aksw.org"
    VIRTUAL_PORT: "80"
  • Add an entry in /etc/hosts to point the url to localhost (or wherever your service is running) (e.g):
127.0.0.1 workflow-builder.big-data-europe.aksw.org
127.0.0.1 swarm-ui.big-data-europe.aksw.org
127.0.0.1 kibana.big-data-europe.aksw.org
(..)
127.0.0.1 new-service.big-data-europe.aksw.org
  • Modify the file integrator-ui/user-interfaces to add a link to the new service in the integrator UI.
{
  "data": [
    ...etc .. ,
    {
      "id": 1,
      "type": "user-interfaces",
      "attributes": {
        "label": "My new Service",
        "base-url": "http://new-service.big-data-europe.aksw.org/",
        "append-path": ""
      }
    }
  ]
}

Have fun with it!

Healthchecks for nginx in docker

Introduction

Recently I faced a problem where I had two nginx servers connected sequentially, the first one acted as a proxy (let's call it nginx-proxy) between the frontend and another nginx server (let's call it nginx-server). All services were running in docker containers using the docker-compose script.

The nginx-proxy service would route (or dispatch, however you want to call it) requests to the nginx-server service, as well as other microservices. Nginx by default does healthchecks on the upstream servers. This has several uses, for example if it's worthy to proxy the request to a server that might be down instead of serving it's local copy of a cached asset.

But I thought, what if I wanted to have explicit healthchecks for the nginx servers, either by using specific tools or commands in this kind of docker-compose architecture?

Docker-only techniques


Healthcheck Dockerfile

This is the easiest type of healthcheck, it consists of each container taking care of checking it's own service's health status, but it can also be tricked into checking both the local container's status and performing additional tests for upstream servers.

It is implemented using the HEALTHCHECK [options] CMD instruction when building the docker image, or when running the container using specific command line flags.

In order to check if the nginx-proxy server is healthy by curl'ing the server you could specify on it's Dockerfile:

FROM nginx:1.13

HEALTHCHECK --interval=5m --timeout=3s CMD curl --fail http://nginx.host.com/ || exit 1
EXPOSE 80

Or when running it using docker run:

λ docker run --name=nginx-proxy -d \
        --health-cmd='curl --fail http://nginx.host.com || exit 1' \
        --health-interval=5m \
        --health-timeout=3s \
        nginx:1.13

It is important to note that docker healthchecks will not only be performed when the containers are starting, but they will be periodically executed to check container's status, as seen in this example:

# Check that the nginx config file exists
λ docker run --name=nginx-proxy -d \
        --health-cmd='stat /etc/nginx/nginx.conf || exit 1' \
        nginx:1.13

λ docker inspect --format='{{.State.Health.Status}}' nginx-proxy
healthy

λ docker exec nginx-proxy rm /etc/nginx/nginx.conf
λ sleep 5; docker inspect --format='{{.State.Health.Status}}' nginx-proxy
unhealthy

# But creating the nginx.conf file again will make the container be healthy again
λ docker exec nginx-container touch /etc/nginx/nginx.conf
λ sleep 5; docker inspect --format='{{.State.Health.Status}}' nginx-proxy
healthy

Healthcheck on docker-compose.yml

This is totally equivalent of running the docker container with the healthcheck related command line flags.

healthcheck:
  test: ["CMD", "curl", "--fail", "http://nginx.host.com"]
  interval: 1m30s
  timeout: 10s
  retries: 3s

Using depends_on in docker-compose.yml

There is no guarantee for this kind of healthcheck. I would not even call it one because what it does is expressing dependency between two or more services running inside a docker-compose script. This means, that it will only force the execution order of the containers, without caring of the services that are executing inside, for example:

nginx-proxy:
  image: nginx:1.13
  depends_on:
    - resource
    - push-service
  links:
    - resource:resource
    - push-service:push-service

Application logic techniques

This kind of checks involve adding application logic directly, via configuration files or scripting.

Upstream passive nginx server checks

Nginx will monitor your connections for upstream server failures and try to resume them after some time. In my example, tests will be done from the nginx-proxy service to the upstream nginx-server service.

nginx.conf:

# Upstream server for nginx-server
upstream nginx_server {
  server nginx-server max_fails=3 fail_timeout=30s;
}

fail_timeout indicate the amount of time where the total max_fails number has to happen to mark the service unavailable.

Upstream server active nginx server checks

In order to actively check the health of upstream servers, nginx can send special requests to verify it. It is as easy as adding the directive health_check in the location context, as long as there is a proxy_pass directive that specifies an upstream server:

upstream nginx_server {
  server nginx-server;
}

server {
    location / {
        proxy_pass http://nginx_server;
        health_check;
    }
}

Lua scripting on nginx

Nginx provides the possibility of adding additional logic in the nginx.conf file by using the scripting language Lua.

This feature come with the OpenResty package, but you can simply extend the danday74/nginx-lua docker image. An example healthcheck scripted in lua can be found in this post.

Have fun!

Git subtree introduction

[NOTE: I recovered this post from my old wordpress blog]

We often find ourselves developing projects that depend on other vendor’s libraries or even on our own produced external software components.

Git subtree provides a way to incorporate that external project into another one (normally bigger) by copying it inside the parent one and making it share the parent’s commit history from that moment on.

This is known as system-based approach in development, where you architect your design by taking the different interconnected projects as a whole. That strategy involves tagging, merging and pushing the whole repository constantly. One commit history to rule them all.

How does it work? Imagine we have two projects: the big one (the-backend), and the small one (the-frontend). Former is the main project, being constantly changed and under heavy commit routine. The latter is the mobile web application that consumes the backend’s API. Stable, and only being changed every backend’s major release.

It is therefore interesting to manage both projects independently, with sepparate commit histories but maintaining cohesion of the project itself. One way to do it is well, with git subtrees.

NOTE: this is an extremely simple use case of git subtree just to get the sense of it, any corrections are more than welcome.

the-backend/ (parent project)
   file1
   file2
 
the-frontend/ (child project)
   item1
   item2

Both the-backend/ and the-frontend/; are independent projects. They can be in the same server or in remote servers, only references will change.

λ mkdir the-backend/
λ touch the-backend/file1 the-backend/file2
 
λ mkdir the-frontend/
λ touch the-frontend/item1 the-frontend/item2

We want to add the-frontend/ as a dependence to the-backend/ project but letting it stand as an independent repository.

λ cd the-backend/
λ git init
λ git add .
λ git commit -a -m "Two first files added to the-backend parent project "
 
λ cd ../the-frontend/
λ git init
λ git add .
λ git commit -a -m "Two first files added to the-frontend child project"

Now with both repositories initialized, we add the the-frontend/ repo as a remote repo to the-backend/ project.

λ (In the-backend/ folder)
λ git remote add frontend-subtree ../the-frontend/
λ git subtree add --prefix=frontend frontend-subtree master
Creates a subtree of the-frontend/ project inside the-backend/ project under the prefix specified (prefix obligatory)

Now if we git log inside the-backend/ we see a commit message something like: “Add ‘the-frontend/’ from commit ”.
Say that now we want to add some changes in the-frontend/ project outside the-backend/ one.

λ (In the-frontend/ folder, not the-backend/frontend/)
λ touch item3
λ git add item3 && git commit item3 -m "Item3 added inside the-frontend/ outside project."
λ git log
* c557ce6 - (HEAD, master) Item3 added inside the-frontend/ outside project. (4 seconds ago) <Esteban>
* 45074d3 - Two first files added to the-frontend child project (4 days ago) <Esteban>

However this changes are only visible in the-frontend/ project, while in the-backend/frontend/..

λ ls the-backend/frontend
item1 item2
 
λ git log
* e009438 - (HEAD, master) Add 'the-frontend/' from commit '45074d397c99079acd20cb24e9d8b8830afcf802' (4 days ago) <Esteban>
|\
| * 45074d3 - (frontend-subtree/master) Two first files added to the-frontend child project (4 days ago) <Esteban>
* 3ea1a67 - Two first files added to the-backend parent project (4 days ago) <Esteban>

If we want those changes visible in the-backend/frontend/ subtree folder, we have to git pull subtree:

In the-backend/ root folder, above frontend/, otherwise we get a message like 'You need to run this command from the toplevel of the working tree.')

λ git subtree pull --prefix=frontend/ frontend-subtree master
λ git log
 
* 7dd7677 - (HEAD, master) Merge commit from the parent the-backend/ pulling changes from the the-frontend/ subtree (28 seconds ago) <Esteban>
|\
| * c557ce6 - (frontend-subtree/master) Item3 added inside the-frontend/ outside project. (6 minutes ago) <Esteban>
* | e009438 - Add 'the-frontend/' from commit '45074d397c99079acd20cb24e9d8b8830afcf802' (4 days ago) <Esteban>
|\ \
| |/
| * 45074d3 - Two first files added to the-frontend/ child project (4 days ago) <Esteban>
* 3ea1a67 - Two first files added to the-backend/ parent project (4 days ago) <Esteban>

See also how we got into the-backend/ parent folder the local commit from the-frontend/ repository. This may be also avoided adding the –squash option in the git subtree pull command. That option will compress all local commits in one single commit for the subtree pull.

Nonetheless, we might aswell do it inversely. If someone working from the parent the-backend/ folder makes a change to the frontend/ added subtree, and we want to see those changes reflected in the outsider the-frontend/ repository:

(In the-backed/frontend)

λ touch item4_from_parent
λ git add && git commit item4_from_parent -m "Item4 added from the parent to frontend/ subtree folder"
λ git log
* 4a50c88 - (HEAD, master) Item4 added from the parent to frontend/ subtree folder (9 seconds ago)
( . . . )
λ git subtree push --prefix=frontend/ frontend-subtree new-branch-from-master

This will create a new branch in the-frontend/ project with those changes. It is extremely cumbersome to create a new branch in the frontend external project every time a change is made in the subtree from the parent, but git subtree does not allow to overwrite the master branch by default due to potential inconsistent state restrictions.

One alternative to git subtree is git submodules, but that is topic for another article.

Links:

https://stackoverflow.com/questions/31769820/differences-between-git-submodule-and-subtree
https://stackoverflow.com/questions/769786/vendor-branches-in-git/769941#769941
https://git-scm.com/book/en/v1/Git-Tools-Subtree-Merging
https://developer.atlassian.com/blog/2015/05/the-power-of-git-subtree/

Routing in Javascript

Introduction

Originally web applications consisted in interconnected html documents that one could navigate through links between them. Every time a user clicked a link on a website a new document would be generated in the server and sent back to the browser to be rendered in their screen.

Around the year 2005 the term Single-Page Application (SPA) became popular. Said term encompassed a new way or architecting websites to make them behave more like desktop applications: snappy, with graphical animations and smooth transitions between links.
This was achieved by taking advantage of javascript, html & css, as new APIs became available to give the browser more native-like capabilities.

SPAs are based on a single document model. This means that web applications' lifespan happens on a single html page, along with the transitions between the different views. But since links no longer imply the fetching and generation of a new document, how are those transitions modelled? they are achieved by using a router.

What is a Javascript Router?

A Javascript router is a key component in most frontend frameworks. It is the piece of software in charge to organize the states of the application, switching between different views. For example, the router will render the login screen initially, and when the login is successfull it will perform the transition to the user's welcome screen.

How it works.

The router will be in charge of simulating transitions between documents by watching changes on the URL. When the document is reloaded or the URL is modified somehow, it will detect that change and render the view that is associated with the new URL.

I wrote a small router in javascript to illustrate the idea. At the beginning we need two objects, one to store the routes, and other to store the templates, along with two simple functions to register them.

Templates are just one way of describing the DOM that will be generated when the transition from one route to the other is completed. The whole javascript application will live in a div element.

// Application div
const appDiv = "app";

// Both set of different routes and template generation functions
let routes = {};
let templates = {};

// Register a template (this is to mimic a template engine)
let template = (name, templateFunction) => {
  return templates[name] = templateFunction;
};

// Define the routes. Each route is described with a route path & a template to render
// when entering that path. A template can be a string (file name), or a function that
// will directly create the DOM objects.
let route = (path, template) => {
    if (typeof template == "function") {
      return routes[path] = template;
    }
    else if (typeof template == "string") {
      return routes[path] = templates[template];
    }
    else {
      return;
    }
};

Now we will be able to register templates and routes, creating the mapping between them:

// Register the templates.
template('template1', () => {
    let myDiv = document.getElementById(appDiv);
    myDiv.innerHTML = "";
    const link1 = createLink('view1', 'Go to view1', '#/view1');
    const link2 = createLink('view2', 'Go to view2', '#/view2');

    myDiv.appendChild(link1);
    return myDiv.appendChild(link2);
});

template('template-view1', () => {
    let myDiv = document.getElementById(appDiv);
    myDiv.innerHTML = "";
    const link1 = createDiv('view1', "<div><h1>This is View 1 </h1><a href='#/'>Go Back to Index</a></div>");
    return myDiv.appendChild(link1);
});

template('template-view2', () => {
    let myDiv = document.getElementById(appDiv);
    myDiv.innerHTML = "";
    const link2 = createDiv('view2', "<div><h1>This is View 2 </h1><a href='#/'>Go Back to Index</a></div>");
    return myDiv.appendChild(link2);
});


// Define the mappings route->template.
route('/', 'template1');
route('/view1', 'template-view1');
route('/view2', 'template-view2');

For the templates we match a template name with a function that will generate javascript elements and append the resulting DOM to the div where the application lives. This functionality in a real router would be taken over by the templating engine. For the routes, we just do the mapping between a route path and the corresponding template.

The createLink & createDiv are auxiliary functions to generate DOM:

// Generate DOM tree from a string
let createDiv = (id, xmlString) => {
    let d = document.createElement('div');
    d.id = id;
    d.innerHTML = xmlString;
    return d.firstChild;
};

// Helper function to create a link.
let createLink = (title, text, href) => {
    let a = document.createElement('a');
    let linkText = document.createTextNode(text);
    a.appendChild(linkText);
    a.title = title;
    a.href = href;
    return a;
};

What is left is to have the logic to detect changes in the URL and resolve them to render the template. To do so, listen for the load & hashchange events. The former fires then a document is loaded, and the latter when the URL hash changes.

// Give the correspondent route (template) or fail
let resolveRoute = (route) => {
    try {
     return routes[route];
    } catch (error) {
        throw new Error("The route is not defined");
    }
};

// The actual router, get the current URL and generate the corresponding template
let router = (evt) => {
    const url = window.location.hash.slice(1) || "/";
    const routeResolved = resolveRoute(url);
    routeResolved();
};

// For first load or when routes are changed in browser url box.
window.addEventListener('load', router);
window.addEventListener('hashchange', router);

That's it! Of course many functionality is lacking: the use of controllers to transform data before passing it to the views, nested routes, the use of history api, etc.. but the idea of javascript routing is quite easy to grasp. The code together can be found in this gist.

Examples

References

Have fun!

Ember Websockets & nginx integration

In a previous article I explained our approach at work to deploy an Ember.js application in an nginx server running on docker. Today I had to integrate an instance of that application to communicate with another microservice using WebSockets.

A simplified diagram of the architecture would be:

All this services run using a docker-compose script like this one (it is a very simplified version):

docker-compose.yml:

  frontend:
    image: bde2020/ember-swarm-ui-frontend:0.6.0
    ports:
      - "88:80"
    links:
      - dispatcher:backend
    volumes:
      - ./config/frontend:/etc/nginx/conf.d

  dispatcher:
    image: semtech/mu-dispatcher:1.0.1
    links:
      - push-service:push-service
    volumes:
      - ./config/dispatcher:/config

  db:
    image: tenforce/virtuoso:1.2.0-virtuoso7.2.2

  push-service:
    image: tenforce/mu-push-service
    environment:
      - MU_SPARQL_ENDPOINT=http://database:8890/sparql
    links:
      - db:database
    ports:
      - "83:80"

The Dispatcher will proxy calls to other microservices based on the request path. This is very useful to avoid the frontend to have any information about other microservices host names. More information here.

The push-service will listen for GET requests in the root path (/), open a websocket and start sending some json. The code in the frontend to test just taken from ember-websockets:

export default Ember.Component.extend({
  websockets: Ember.inject.service(),

  didInsertElement() {
    this._super(...arguments);
    const socket = this.get('websockets').socketFor('ws://localhost/push-service/');
    ...
  }

But this is not enough, nginx needs additional configuration to open and maintain a connection using WebSockets. Luckily, nginx has support for it since version 1.3, and can be activated by specifying the set of headers that start the handshake for the websockets protocol.

# Set the server to proxy requests to when used in configuration
upstream backend_app {
    server backend;
}

# Server specifies the domain, and location the relative url
server {
    ...

    # WebSockets support
    location /push-service {
      proxy_pass http://backend_app;
      proxy_http_version 1.1;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
  }

The call from the frontend is to http://location/push-service, but the Push-Service only understands calls to /, then why specify the location = /push-service in nginx's configuration file? This is thanks to the Dispatcher, that will detect the call to http://localhost/push-service and rewrite it to http://push-service/, having a link to it in the docker-compose.yml file.

Have fun!

References