Wednesday, October 24, 2012

BAM Toolboxes

Why BAM Toolboxes ?

As mentioned in introduction to BAM WSO2 BAM consists of three main components named Data Receiver, Analyzer and Visualizer. These three components act independent as much as possible to,

  1. Reduce the complexity of the BAM,
  2. Enable to scale up the BAM independently in each component,
  3. And enable to plugin each component when enhancing the architecture.

In the end user's point of view, these three components are complex set of API's. Many of their functionality are suitable to make transparent to the end user. Manually configuring each of them according to the business requirement is quite difficult and takes a long learning curve. But due to the requirement of flexibility and extendability this complexity is unavoidable. The solution is BAM Toolboxes.
Developers of BAM are shipping a set of pre-configured examples as a single zip archive file with the extension, "tbox" which can be deployed in BAM easily. Although the rationale behind the concept "toolbox" is much broader, in current versions of BAM we are shipping only a set of examples, for some useful and popular use cases of BAM. The user can study the given toolboxes in a relative use case and adopt them according to their requirements. In future we expect to deliver a much generic type of toolboxes to achieve the requirements of the BAM toolboxes.

Toolbox Content

At the moment, a toolbox is basically a zip archive of,

  1. Stream Definitions
  2. Analytics
  3. And Visualization Components.
The user should specify each of the above contents accordingly compatible with each other. Let's understand what each of them are and how they are related to each main three components of BAM. This post only discusses about the theoretical aspects of BAM toolboxes. Implementation details can be found here.

Stream Definitions

A detailed description about the Stream Definition concept is given in an earlier post. A toolbox should include the set of stream definitions used in the toolbox. Actually including Stream Definitions in a toolbox is optional as even without Stream Definitions, the toolbox functions well. The Stream Definition was introduced to toolboxes, to avoid an exception printed in the console, when the Hive script is executed before the data is stored in the Cassandra. (i.e. Column family is created when the very first time the data is sent to the Cassandra database. When Hive query is trying to access an unavailable column family, the exception fires.) The syntax is similar to the syntax given in the post about Stream Definitions. Here is an example stream definition which is the real stream definition used in Activity Monitoring toolbox 2.0.1.

{
  'name': 'org.wso2.bam.activity.monitoring',
  'version': '1.0.0',
  'nickName': 'Activity_Monitoring',
  'description': 'A sample for Activity Monitoring',
  'metaData':[
          {'name':'character_set_encoding','type':'STRING'},
          {'name':'host','type':'STRING'},
          {'name':'http_method','type':'STRING'},
          {'name':'message_type','type':'STRING'},
          {'name':'remote_address','type':'STRING'},
          {'name':'remote_host','type':'STRING'},
          {'name':'service_prefix','type':'STRING'},
          {'name':'tenant_id','type':'INT'},
          {'name':'transport_in_url','type':'STRING'}
  ],
  'correlationData':[
          {'name':'bam_activity_id','type':'STRING'}
  ],
  'payloadData':[
          {'name':'SOAPBody','type':'STRING'},
          {'name':'SOAPHeader','type':'STRING'},
          {'name':'message_direction','type':'STRING'},
          {'name':'message_id','type':'STRING'},
          {'name':'operation_name','type':'STRING'},
          {'name':'service_name','type':'STRING'},
          {'name':'timestamp','type':'LONG'}
  ]
}

Analytics

The middle component of the BAM is the analyzer. The analyzer is basically a Hadoop analytics engine. As Hadoop codes are considered as a very primitive programming, Hive scripts are run on top of Hadoop. Therefore the programming part of analyzer is a set of Hive scripts. These Hive scripts can be scheduled so that the scripts are executed periodically in as given or they can be unscheduled and can be executed manually when required.
Roughly what should happen in a Hive script can be described as follows.
  1. Create Hive tables for Cassandra column families that contain data received from the Data Receiver. - This will create the metadata of Hive tables relevant to the real Cassandra column families.
  2. Create Hive tables for RDBMS tables. - This will create the metadata of Hive tables relevant to the real RDBMS tables that should contain processed result data.
  3. Create Hive tables for both Cassandra and RDBMS tables that should keep intermediately generated data. (This is optional)
  4. Process data from source Hive tables and overwrite result data in the result Hive tables. If required intermediate can be stored in intermediate Hive tables.
Hive language is similar to SQL and can be learnt from HIve tutorial. Usage of JDBC handlers in Hive script can be found from here. Here is an example Hive script written for the Activity Monitoring toolbox 2.0.1.

CREATE EXTERNAL TABLE IF NOT EXISTS ActivityDataTable
 (messageID STRING, sentTimestamp BIGINT, activityID STRING, version STRING, soapHeader STRING, soapBody STRING)
 STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
 WITH SERDEPROPERTIES (
  "cassandra.host" = "127.0.0.1" ,
 "cassandra.port" = "9160" ,
 "cassandra.ks.name" = "EVENT_KS" ,
 "cassandra.ks.username" = "admin" ,
 "cassandra.ks.password" = "admin" ,
 "cassandra.cf.name" = "org_wso2_bam_activity_monitoring" ,
 "cassandra.columns.mapping" =
 ":key, payload_timestamp, correlation_bam_activity_id, Version, payload_SOAPHeader, payload_SOAPBody" );

CREATE EXTERNAL TABLE IF NOT EXISTS ActivitySummaryTable(
 messageRowID STRING, sentTimestamp BIGINT, bamActivityID STRING, soapHeader STRING, soapBody STRING)
 STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
 TBLPROPERTIES (
 'mapred.jdbc.driver.class' = 'org.h2.Driver' ,
 'mapred.jdbc.url' = 'jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE' ,
 'mapred.jdbc.username' = 'wso2carbon' ,
 'mapred.jdbc.password' = 'wso2carbon' ,
 'hive.jdbc.update.on.duplicate' = 'true' ,
 'hive.jdbc.primary.key.fields' = 'messageRowID' ,
 'hive.jdbc.table.create.query' =
 'CREATE TABLE ActivitySummary (messageRowID VARCHAR(100) NOT NULL PRIMARY KEY,
  sentTimestamp BIGINT, bamActivityID VARCHAR(40), soapHeader TEXT, soapBody TEXT)' );

insert overwrite table ActivitySummaryTable
 select messageID, sentTimestamp, activityID, soapHeader, soapBody
 from ActivityDataTable
 where version= "1.0.0";

Visualization Components

Visualizing the processed data from the Analyzer is the duty of the Visualizer. Generic way of visualizing different types of data, with different types of user-UI interactions is the main requirement of the BAM Visualizer. As the other two main components of BAM, Visualizer too have to be configured by the user according to the requirement. So the complexity of configuration of Visualization should be easy while fulfilling the above main requirement.
At this stage of BAM (version 2.0.1), it is designed to visualize using two different ways that is suitable for two different requirements. (Excluding the report generation mechanism)

  1. Generate a Gadget and deploy it in a Dashboard (this is the same dashboard used in WSO2 Gadget Server) - This way of configuration is most suitable for non-technical (ordinary) users. It is a straightforward way of generating a gadget using the Gadget Wizard by just specifying the type of visualization component (e.g.: Bar Chart) and the relevant data sets to each axis. (e.g.: x-axis, y-axis) But this way of specifying a visualization component is poor in flexibility and user interactive quality in the UI.
  2. The other way of specifying a visualization component is by writing a custom dashboard with Jaggery. Jaggery is a server side JavaScript language which is capable of interacting with web services and Carbon data sources. I recommend to look into an existing custom dashboard if someone is interested on creating their own custom dashboard.
Generating reports is another major visualization feature available in BAM. I am not going to discuss about it in this post.

Overall content of a toolbox archive can be shown as below.



No comments:

Post a Comment