Automatically loading json files to ElasticSearch

Introduction

Right now at work I am working in the (Big Data Europe) project, a joint effort or several organizations and enterprises accross Europe to create a platform that provides a set of software services that allow to implement big data pipelines, with minimal effort compared to other stacks and in an extremely cost effective way.

This is to make any company or organization that wants to make sense of their data using Big Data an easy starter to play around with the technologies that allow so.

One of the pieces I developed is mu-bde-logging, a standalone system that allows to log HTTP traffic from running docker containers and post it into an EL(K) stack in real time for further visualization.

Last requirement was to add the possibility to replay old traffic backups and post them into the ElasticSearch instance to visualize them offline in Kibana.

So I wrote a small script to scan for the transformed .ha files (json format) in a given folder and replay them into the ElasticSearch container.
Since ElasticSearch & Kibana containers are part of a docker-compose.yml project, I didn't care much about being generic and used the name the docker-compose script will give the containers, but it is easy to change & extend.

The Code

This is what I came up with:

#!/usr/bin/env bash

#/ Usage: ./backup_replay.sh <backups_folder>
#/ Description: Run ElasticSearch and Kibana standalone and post every enriched .har file in the backups folder to ElasticSearch.
#/ Examples: ./backup_replay.sh backups/
#/ Options:
#/     --help: Display this help message
usage() { grep '^#/' "$0" | cut -c4- ; exit 0 ; }
expr "$*" : ".*--help" > /dev/null && usage

BACKUP_DIR="../backups/"

# Convenience logging function.
info()    { echo "[INFO]    $@"  ; }

cleanup() {
  true;
}

# Poll the ElasticSearch container
poll() {
  local elasticsearch_ip="$1"
  local result=$(curl -XGET http://${elasticsearch_ip}:9200 -I 2>/dev/null | head -n 1 | awk '{ print $2 }')

  if [[ $result == "200" ]]; then
    return 1 # ElasticSearch is up.
  else
    return 0 # It will execute as long as the return code is zero.
  fi
}

# Parse Parameters
while [ "$#" -gt 1 ];
  do
  key="$1"

  case $key in
      -f|--folder)
      BACKUP_DIR="$2" # EXAMPLE
      shift
      ;;
      --default)
      default=YES
      ;;
    *)
    ;;
  esac
  shift
done


if [[ "${BASH_SOURCE[0]}" = "$0" ]]; then
    trap cleanup EXIT

    # Start Elasticsearch & Kibana with docker Compose
    which docker-compose >/dev/null
    if (( $(echo $?) == "0" )); then
      docker-compose up -d elasticsearch kibana
    else
      info "Install docker-compose!"
      exit -1
    fi

    # Poll ElasticSearch until it is up and we can post hars to it.
    elasticsearch_ip=$(docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' mubdelogging_elasticsearch_1)

    info "ElasticSearch container ip: ${elasticsearch_ip}"

    while poll ${elasticsearch_ip}
    do
      info "ElasticSearch is not up yet."
      sleep 2
    done

    # Find all .trans.har files in the specified backups folder/
    # Per each one, pos to ElasticSearch.
    info "Ready to work!"
    info "POST all enriched hars "
    find ${BACKUP_DIR} -name "*.trans.har" | sed 's/^/@/g' | xargs -i /bin/bash -c "sleep 0.5; curl -XPOST 'http://$elasticsearch_ip:9200/hars/har?pretty' --data-binary {}"
fi

Have fun!

Sed tricked me!

Introduction

Today I had some time free at work since I am between projects and I wait for some additional information, and I took advantage of it to help a coworker that was new to Ember.js. For some reason all the calls to the backend (a Virtuoso Database) failed.

Taking a look together, we discovered that the middleware that transformed JSON-API calls into SPARQL queries and vice-versa, (the piece that was talking directly to the frontend) was consistently returning an error 500, and this happened because the triples that were introduced into the database using a script were generated wrong and had all the same id for every different model.

Let's say that you have a file with this structure:

<url1:concept1> <predicate1> <foo> ;
	<predicate2> <bar> ;
	<predicate3> <baz> .

<url1:concept2> <predicate1> <foo> ;
	<predicate2> <bar> ;
	<predicate3> <baz> .

And you want to detect each "foo" ocurrence and add a unique identifier, generated for example with the bash uuidgen utility, At the beginning this was the code:

blog λ cat example.txt | sed "s/foo/$(uuidgen)/g"
<url1:concept1> <predicate1> <66fa7661-889f-4ed5-b74d-540e18b9a83d> ;
        <predicate2> <bar> ;
        <predicate3> <baz> .

<url1:concept2> <predicate1> <66fa7661-889f-4ed5-b74d-540e18b9a83d> ;
        <predicate2> <bar> ;
        <predicate3> <baz> .

But then the uuid was generated only once, and substituted in all occurrences. We need to substitute each new "foo" appearance by a different uuid each time! Ah but sed allows you to pass an external command per each match, so you could do this in theory:

blog λ cat a.txt
foo
foo
c
d

blog λ cat a.txt | sed "s/foo/echo $(uuidgen)/ge"
8b8cc7ac-b089-4339-875c-76a5278b594a
8b8cc7ac-b089-4339-875c-76a5278b594a
b
c
d

Damn!, sed still only evaluates the command call once and does the substitution for each occurrence! But what if we manage to execute a command that depends on an external random source to generate the uuid? I found a little snippet here.

So we try adapting it to work with sed:

blog λ cat a.txt | sed "s^foo^cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w ${1:-32} | head -n 1^ge"
13YFcSxzshlFocig6AdA7yEbHeSKYq4r
6jhDEL3x3yDUsOf6mqScrea29YNDDURy
b
c
d

Nice, now each occurrence of "foo" is replaced by a random string. So let's try with the original file example.txt:

blog λ cat example.txt | sed "s^foo^cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w ${1:-32} | head -n 1^ge"
sh: 1: Syntax error: redirection unexpected

        <predicate2> <bar> ;
        <predicate3> <baz> .

sh: 1: Syntax error: redirection unexpected

        <predicate2> <bar> ;
        <predicate3> <baz> .

Argh, why the hell this happens?.. redirections are with the "<" character in the unix shell. Oh wait, could it be that sed is not only taking the exact match but the whole line? or the characters next to it? Let's verify it. Let's say that now this is a.txt:

blog λ cat a.txt
foo < /etc/passwd
foo
b
c
d

blog λ cat a.txt | sed "s/foo/cat/ge"
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
... etc ...
b
c
d

Yes, it is substituting "foo" by cat but receives also the "< /etc/passwd" and interprets not as text but as part of the commmand to execute inside sed.

The Solution

The solution came using awk. This is the line that did the trick, it will add a specific triple with a new uuid for each url:concept.

blog λ cat example.txt | awk '1;/foo/{command="uuidgen";command | getline uuidgen;close(command); print "\t<http://our.namespace.url> \"" uuidgen "\" ;"}'
<url1:concept1> <predicate1> <foo> ;
        <http://our.namespace.url> "802a44bd-c28f-4856-b275-e24c666308c8" ;
        <predicate2> <bar> ;
        <predicate3> <baz> .

<url1:concept2> <predicate1> <foo> ;
        <http://our.namespace.url> "6b8cbac2-70c4-4769-b065-5aa36af797a4" ;
        <predicate2> <bar> ;
        <predicate3> <baz> .

Special thanks to my colleague @wdullaer for asking me to help him out with Ember and we end up having fun with sed & awk.

Have fun!