Data Collection, Monitoring and Visualization

For the past two years, I’ve been focused on the problem of moving data from sensors on networked equipment to data sinks for aggregation, monitoring and visualization purposes. The process is what I call a pipeline that achieves the goals of monitoring and visualization.

Some terminology I use are measurement which can be substituted with metric. Observations which are a collection of measurements.

Data Collection

Collecting observations is done by agents. Agents can setup a receiving server (stream), poll data sources or subscribe to a data source (stream). Observations may be processed/transformed here a bit to standardize/normalize measurements since other agents may be collecting the same measurements from different sources.

Once you have the data, you have a good amount of choices on what to do with the data, so here are two: you can transfer immediately to the data sinks or store into a cache where a curator service will push to data sinks.

Before transferring, some more data processing in the transmission process may be needed for formatting purposes.

If data is transferred immediately, then you have to worry about questions such as backfilling data if the sink goes down!

If data is in a cache, how much data do you want to store there before things get dropped? How do you make sure that same data isn’t retransmitted from a time period assuming successful transmission? Remember that not all data sources are as quick as others (aperiodicity).

Aggregation

Aggregation is a step done after the initial data collection, but is still part of data collection umbrella. The primary types of aggregation I researched and know of are batch and streaming. Batch aggregations happened when all the data was available and streaming aggregations happened as data got through or when enough data came in to do the aggregation (delayed streaming? mini-batch?). I call it the hydration process where mini-blocks filtered for data in a stream and captured what was needed to perform the aggregation. Good Example for hydration would be for a division of measurements that would come in at different times (usually within 5 seconds of each other).

Aggregations could feed other aggregations! Batch aggregator would feed a streaming aggregator which could feed another streaming aggregator. This is a powerful feature.

Monitoring

Monitoring is an umbrella that contains measurement threshold checks and alerting. Systems monitor values and then immediately trigger alerts based on defined rules. These rules, typically, are simple order relation checks. I think we can start to see that simple rules can not scale beyond small deployments. At bigger companies, redundancies are in place which would mean simple alerts are not as important any more. How does one de-prioritize an alert where the rule is a 1 -> 1 relation to an alert. Alerts should only notify, if and only if some higher level measurement goes down. This higher level measurement would incorporate the idea of the primary and redundancy resources.

Again, typically, any alert would notify some people immediately. Due to the lack of heuristics/complex rules around alerts, it is a human driven incident management and response. Can we extend this rule -> alerting relation to incorporate complexity? Of course.

My peeve here is that if we do not want to be alerted on every alert then it is not necessarily an incident then. An incident, to me, is more involved and would contain notification/escalation policy. It can occur because something alerted a bunch of times. Incidents define their own aggregation level. It could have a whole history log such as escalations and remediations. One might say a root incident would absorb the children incidents in order to have one flat incident tracking everything.

In the end, monitor values, trigger some lower level structure (alerts?) and define incidents with notification policy and trigger using rules of lower level structures. Naming is hard… Rules, Alerts, Incidents…

Adding pub/sub on top of all of this, so other systems can use the data outside the scope of the architecture, is always a benefit. Examples are auditing, history and/or tracking SLAs.

Visualization

Visualization requires a historical data source, so one can create panels of graphs to see the history of somewhat related values. As for third party open source software, Grafana does a good job generally.

Graph creation needs automation and can be automated if there is a standardization of tags and template panels that need values filled in. These template panels should be exportable to other areas. Good example is system metrics, which on linux servers are pretty much the same all throughout (cpu util %, cpu load, etc.). There isn’t much of a need to have to create these sets of graphs five hundred times a day across different teams and companies if metric paths/tags are standardized.

The idea of dynamic graphs/dashboards where certain measurement names/units would automatically generate graphs pertaining to it. If I want overall temperature then I probably want to see a heatmap rather than a bunch of overlapping lines.

Conclusion

Some of these things I did and some I did not have the resources or level of support to do. In an ideal world, would have reach my goals of pushing things further in an historically underinvested area. This is something I’ll likely do on my own time for my own projects.

Auditing measurement collection is something that needs investment for enterprises to make sure data collection is occurring. Relying on a host or etc. being up is not enough.


Performant and Lock-free do not mean what you think

I think of words that have a lack of definition like performant as an indicator of whether an author knows what they are talking about. “Performant” ain’t a word and even if it was what does it mean? High performance? By what measure?

Avoid using “Performant”: https://english.stackexchange.com/questions/38945/what-is-wrong-with-the-word-performant

Along those lines, lock-free != wait-free and I would/will forgive people for this. Lock-free sounds nice, but the definition does not imply non-blocking or wait-free. Wait-free is not typically used to describe any algorithms, because guaranteeing wait-free anything is hard.

The differences are noted here: http://concurrencyfreaks.blogspot.com/2013/05/lock-free-and-wait-free-definition-and.html


Efficient LRU Cache in Java

 

Fun fun stuff to do in lintcode. I mean we can always incur a linear traversal using a LinkedList if one wanted a simple way to do LRU, but why not challenge myself.


 


public class LRUCache {
    
    public class LRUNode {
        
        Integer key;
        Integer data;
        LRUNode prev;
        LRUNode next;
        
        public LRUNode(int key, int data) {
            
            this.key = key;
            this.data = data;
        }
    }
    
    LRUNode head;
    LRUNode tail;
    Map nodeMap = new HashMap();
    int maxCap;
    
    
    /*
    * @param capacity: An integer
    */public LRUCache(int capacity) {
        // do intialization if necessary
        
        this.maxCap = capacity;
    }

    /*
     * @param key: An integer
     * @return: An integer
     */
    public int get(int key) {
        // write your code here
    
        LRUNode node = nodeMap.get(key);
        
        if(node != null) {
            
            this.remove(node);
            this.push(node);
            
            return node.data;
        }
        
        return -1;
    }

    /*
     * @param key: An integer
     * @param value: An integer
     * @return: nothing
     */
    public void set(int key, int value) {
        // write your code here
        
        if(nodeMap.containsKey(key)) {
            
            LRUNode oldNode = this.nodeMap.get(key);
            this.remove(oldNode);
        }
        
        LRUNode node = new LRUNode(key, value);
        
        this.push(node);
        
        if(nodeMap.size() > this.maxCap) {
            
            this.removeOld();
        }
    }
    
    private void push(LRUNode newHead) {
        
        if(this.head != null) {
            
            newHead.prev = null;
            newHead.next = this.head;
            this.head.prev = newHead;
        }
        
        if(this.tail == null) {
            
            this.tail = newHead;
        }
        
        this.head = newHead;
        this.nodeMap.put(newHead.key, newHead);
    }
    
    private void removeOld() {
        
        if(this.tail == null) {
            
            return;
        }
        
        this.remove(this.tail);
        
    }
    
    private void remove(LRUNode node) {
        
        if(node.prev != null) {
            
            node.prev.next = node.next;
        }
        
        if(node.next != null) {
            
            node.next.prev = node.prev;
        }
        
        if(node.equals(this.head)) {
            
            this.head = node.next;
        }
        
        if(node.equals(this.tail)) {
            
            this.tail = node.prev;
        }
        
        node.next = null;
        node.prev = null;
        this.nodeMap.remove(node.key);
    }
}

You just read Efficient LRU Cache in Java. Please share if you've liked it.
You may find related posts by clicking on the tags and/or categories above.

Care about Performance over Readability? Get rid of your Getters and Setters!

Cliff Click, during the Q/A, talks about not using getters and setters if you really really care about performance. One of the reasons being that there is a chance the getter/setter may not be inlined. Another may be that the function calls may cause a failure to inline hot spot areas due to a threshold being hit due to the function calls.


Dabbling in C++, Modern Way of Pointers (Raw, Unique, Shared, Weak)

I am going to assume you know what pointers are for and why you use them. If not, please go learn about stack and heap memory.

If you are like me, you know raw pointers look like this: `int * someNumber`. You manage your own memory with `new` and `delete`. Manual memory management has its’ benefits, but clearly it does not at a large scale given the amount of overhead people are willing to take to use garbage collected languages.

If you know about auto_ptr, forget about it.  It is gone in C++17.

Unique, Shared and Weak Pointers are the new memory management for C++, so semi-automatic garbage collection! Basically ARC from Objective-C though Boost may have been the inspiration for both. All of these pointers require the <memory> header. There is a link below for the C++ memory reference.

Unique pointer works in that you assign an owner (scope) to the pointer. When we exit from that scope, then the pointer and its’ contents are freed.

How do I use it? There are two ways to instantiate this guy:

  1. std::unique_ptr uptr(new Object());
  2. std::make_unique<Object>(args…);

Shared pointer is reference counting for pointers. I expect this to be heavily used. The reference counts are incremented for each assignment whereas decremented when the assignees are no longer assignees. When the count reaches zero then the point and its’ contents are freed.

How do I use it? Same way as unique.

  1. std::shared_ptr sptr(new Object());
  2. std::make_shared<Object>(args…);

Weak pointer is exactly what you think it is. It can point to other objects, but has no effect on whether the object is freed or not.

How do I use it?

  1. std::weak_ptr<Object> wptr = sptr;

 

This is your friendly Intro to Modern C++ pointers. Stop using raw pointers. 🙂 Use the link below as your bible.

Links

  1. CPP Memory Ref: http://en.cppreference.com/w/cpp/memory


Dabbling in C++, Using Boost

To start off, I’m using a Mac and you know we have homebrew to install things, so lets use it to use boost in our C++ program.

To install homebrew:
Go to http://brew.sh/ and you follow the instructions to get it

Install boost:
`brew install boost`

Verify boost is installed:
`brew list | grep ‘boost’`

Great, once it is verified now we need to include BOOST into our CMake project.

Open up CMakeLists.txt and make sure you have the below (USE C++ 14 PLS):

cmake_minimum_required(VERSION 3.6)

project(“PROJECT_NAME_HERE”)

set(CMAKE_CXX_FLAGS “${CMAKE_CXX_FLAGS} -std=c++14”)

FIND_PACKAGE(Boost COMPONENTS thread system)

INCLUDE_DIRECTORIES(${Boost_INCLUDE_DIRS})

add_executable(TARGET_NAME_HERE main.cc)

target_link_libraries(TARGET_NAME_HERE ${Boost_LIBRARIES})

Personally, I think FIND_PACKAGE is nice since you do not need to pick a specific directory where boost is installed. It will just search and expose some variables such as Boost_LIBRARIES and Boost_INCLUDE_DIRS that you can use elsewhere. Also, you’ll likely want to modify the component included as part of the find.

Alright, now regenerate CMake build with the new settings and boom BOOST is now included.

Try to do this:
http://www.boost.org/doc/libs/1_61_0/more/getting_started/unix-variants.html#link-your-program-to-a-boost-library

to verify that boost is successfully linked and your program can run.

I like to think small steps in learning things take you a long way.

You just read Dabbling in C++, Using Boost. Please share if you've liked it.
You may find related posts by clicking on the tags and/or categories above.

Started Dabbling in C++ again, so Let’s Get Started.

I have not officially programmed in C++ for a long time though I did program C in successfully fixing and refactoring a python C module. I’ve been programming Java at work for a good amount of time now.

IDEs are the staple of any low level language for me, though I can survive without them. Just a little slower because the lack of autocomplete. Anyway, there are many build systems you can use for C++, but I think CMake is a good starting point if you want cross platform builds. The link below is a good way to start off a project and get it up for both Visual Studio and XCode.

CMake Tutorial – Chapter 1: Getting Started

Follow that tutorial loosely to get a project off the ground unless you are a complete C++/build system n00b.

It is a small blog post because I am taking small steps. Where do I intend to go with this? I have an idea of what to implement, but will likely blog about my misgivings of using C++, Boost, how much I miss having a package manager (or maybe not! less bloat!) and static vs dynamic linking (o god, just static link pls… so what if your project executable is bigger… just save yourself the headache).


How to Install Maven on Windows

Many of us who use maven are used to using apt-get or yum on linux based systems. Well, WINDOWS has a third party package management system called Chocolatey.

On their website, you run this command in an ADMINISTRATIVE command prompt (you know.. right click cmd and run as administrator):

@powershell -NoProfile -ExecutionPolicy unrestricted -Command “iex ((new-object net.webclient).DownloadString(‘https://chocolatey.org/install.ps1’))” && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin

and then choco install maven. Now you may have to specify a version (choco install maven -version 3.2.5), if the mirror chocolatey attempts to use does not have the version chocolatey  is trying to download.

Use with your favorite IDE!

DONE!!!!

You just read How to Install Maven on Windows. Please share if you've liked it.
You may find related posts by clicking on the tags and/or categories above.

Google Chrome Adventures

Looks like it is one of those times of the year where I am deeply enveloped in a project. This is one of those projects where you build on top of what someone else created rather than start completely from scratch. Yeah, I know blank slate is always better, but there is nooooo way I will create my own browser.

Let’s go through what I’ve been doing. The idea is to control a browser to do what I want it to do, such that it can be used in a work project. The browser I chose is Google Chrome and that’s mainly because the most updated version of flash is included with it on Linux.

The first solution was to use Selenium where it will start up a chrome instance using the chrome driver then uses the web driver interface to tell the browser what to do. Problem solved!

Nope, I need to be able to do some specific things such as before caputring the request before it is fired off.

Selenium is great, but with an out of the box Chrome instance? Meh, can not do much request/response capturing. Okay…

Let’s add a proxy serve and have Chrome connect to it! Yup, now we can inspect requests before they go out to the network and responses as they come in. Woohoo!

Nope, specifically capture the request before it leaves the browser due to some unsupported protocols.

Okay, so let’s google for something that will let us control the browser internally where hopefully we can eliminate the proxy even.

JCEF (Java Chromium Embedded Framework) is a piece of work. This was love at first sight. Just imagine being able to put a chrome instance wherever you wanted it! Basically lets you do whatever you want with it. When I first installed it, I was back into reality. It is plagued with issues for Ubuntu 12 and 14. Biting the bullet as usual for linux builds, I stick with it and keep going until I solve or ignore the issues wherever possible. I get to the point where I can finally start the browser and one issue plagues it where it will not start sometimes. Unacceptable! It doesn’t support flash on top of that no matter how I tried to include it, but I will cut that part slack given, I posted on the forums about it, it needs a paid license from ADOBE. WTF?! It’s dumb, no one is going to implement the ppapi for flash if it isn’t open source/free license. Good job Adobe! Flash is a hard requirement unfortunately, so good bye CEF!

Okay, soooo what to do? BUILD MY OWN CHROMIUM BROWSER! Well, that’s what I’m doing now. It was the path I did not want to go, but I have to! Only means I have to read into the browser source code and modify the C++ code and build it. Oh wait, I’ve already done that. The browser source is pretty tough to read… maybe there are some docs out there. I’ve been grepping and manually reading trying to find where things are and it’s interesting how they’ve done some things. I need to figure out how to strip down the browser such as the applications and search ability and all that crap. Basically, creating a CEF not really. There should be some configuration options to disable that stuff, so I can cut the build time down (takes about an hour! on my tablet/laptop).

I will combine it with selenium and there we go. A browser that does everything that I want it to do with everything I need. 🙂

You just read Google Chrome Adventures. Please share if you've liked it.
You may find related posts by clicking on the tags and/or categories above.