-
With Akka cluster we have two options while creating routers. Group and Pool. The pool option is better suited when we are trying to scale a computing operation across nodes. The activator template has example of this kind of router The conf file for a Pool router: akka.actor.deployment { /statsService/singleton/workerRouter { router = consistent-hashing-pool cluster { enabled = on max-nr-of-instances-per-node = 16 allow-local-routees = on use-role = compute } } } Assuming that each machine...
-
Spark is a big data processing framework which enables fast and advanced analytics computation over hadoop clusters. Spark Architecture A Spark application consists of: a driver program and a list of executors. The driver program uses the SparkContext object to coordinate the running of the Spark applications run as independent sets of processes on a cluster The SparkContext can connect to several types of cluster managers Standalone cluster manager Mesos YARN These cluster managers allocate...
-
Hadoop Installation on Mac Installation Download the latest version of hadoop binaries and extract it in local folder On Mac you can also install with the brew command > brew install hadoop (The current version at writing was 2.7.3) This installs hadoop at /usr/local/Cellar/hadoop/2.7.3 The current JDK version was 1.8 and the java home was set up as /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home It is a good practice to set this up in .bashrc so that it could be...
-
There are three main stages in a predictive analytics project Requirement gathering Data modeling Delivery and deployment Requirement Gathering While working out the requirements for a predictive analytics project it is important to define the objectives clearly. Care should be taken to avoid making the objectives overly broad or overly specific. For example, if we are measuring the customer churn at an online portal, we can define the task as the percentage of customers who...
-
Given a corpus we can use topic modelling to get insights into the structure of information embedded in the docs. LDA is topic modelling algorithm that can be used for this purpose. LDA is a generative algorithm that assumes documents as a bag of words where each document has mixture of topics and each topic has a discrete probability distribution of words. In LDA the topic distribution is assumed to have a Dirichlet prior which...
-
Google Protocol Buffers provides a useful method to send data between services in binary format. Once the schema is defined then there are parsers in various languages that can consume the data. This handles the case of future updates when there are modifications to schema and it has to be transparently handled across services without breaking them.
https://developers.google.com/protocol-buffers/
-
Logistic regression estimates the probability of the output variable based on the linear combination of one or more predictor variables by using the logit function. The nonlinear transformation of the logit function makes it useful for complex classification models. Assumptions of logistic regression Logistic regression makes fewer assumptions about the input than linear regression. It does not need a linear relationship between the dependent and independent variables The features are no longer assumed to be...
-
There are many JSON libraries for scala but here we are using play-json which is part of Play framework but also can be used independently. I am using ammonite repl to try out the json parsing on console. # load the libraries load.ivy("com.typesafe.play" %% "play-json" % "2.4.0") import play.libs.Json._ var rawJson = """ {"name": "John", "age": 20, "address": "#42 milky way", "tags" : [ "freshman", "scholar" ] } """ The first step is going from...
-
AWS OpsWorks is based on Chef, a tool for configuring and automating the infrastructure deployments. Since it is based on Chef, it comes with all the benefits of cookbooks and recipes and the community resources. In addition OpsWorks provides AWS specific features like auto scaling, monitoring, access to other aws resources, security and easy management of server nodes. OpsWorks has the concept of stacks and layers. A stack describes all the resources of your entire...
-
Apache Sqoop is an open source tool that allows to extract data from a structured data store into Hadoop for further processing. In addition to writing the contents of the database table to HDFS, Sqoop also provides you with a generated Java source file (widgets.java) written to the current local directory. Sqoop is a client command and there is no daemon process for it. It depends on HDFS and YARN and database drivers to which...
-
These are the libraries that over the years I have found useful repeatedly at work Text processing tm : for text analysis quanteda : text processing Data modeling topicmodels : for LDA topic model caret: for various data preprocessing functions, regression and classification ROCR : model tuning and analysis e1071 : building models Time series zoo :time series objects xts : times series data manipulation quantmod : financial charting and analysis Workflow devtools : for...
-
Hadoop and bigdata are synonymous. This is because when you are processing data at a scale at which it is innefficient to process it on a single machine, you have to consider running it over a multiple machines or compute clusters. Hadoop framework provides tools to enable this by promising to do two things : manage the infrastructure and running of jobs split across machines and delivering the results. Hadoop architecture consists of distributed storage...
-
Model Datasets for Learning Select some datasets which are well understood. We should be know the problem context for which the data was collected and the outcomes. It should also be well published so that the nature of underlying data is well known. While working with these model datasets our goal is to reduce the unknowns so that we could focus on evaluating the algorithms and tools. These datasets should be of smaller size so...
-
Use giter8 to build a basic project structure https://github.com/n8han/giter8 #Install giter8 on OSX brew install giter8 Giter8 provides templates to kickstart the projects. The list of templates can be found on their github page https://github.com/n8han/giter8/wiki/giter8-templates #Applying a simple template to kickstart g8 chrislewis/basic-project Provide the basic information that it asks in the prompt Go to the project directory and check out the build.sbt to verify the content and modify if required. # compile sbt compile...
-
Hive provides SQL like query interface to Apache Hadoop. Hive in turn does the computation over large number of nodes on hadoop cluster to provide the results. This enables doing easy ad-hoc data analysis and summarization queries. Hive does not support index queries like RDBMS, but unlike other relational database it scales very well. Hive is not designed for real time queries and row level updates. It is best suited to batch jobs over large...
-
I wanted to have set up a hadoop development cluster on MacBook Pro which i can tinker. It was around the same time that i upgraded my Macbook Pro memory to 16GB and so it helps. The project is at GitHub. https://github.com/eellpp/spark-yarn-hadoop-cluster-vagrant Vagrant project to spin up a cluster of 4 nodes with 64-bit CentOS6.5 Linux virtual machines with Hadoop v2.6.0, Spark v1.6.1. and Hive v1.2.1 This is suitable as a quick hands-on and develpoment...
-
Chef can model your infrastructure as source code. This makes the process of infrastructure deployment repeatable and maintainable. It has three essential components Chef Server WorkStation and Nodes Workstation is the development machine. Changes are pushed from WorkStation to Chef Server which in turn manages the nodes. The source code to model the infrastructure are scripts called as recipes in chef terminology. A collection of this recipe which manages the application could be a cookbook....
-
Instead of the default non-distributed or standalone mode where hadoop runs as a single Java process, in a single JVM instance with no daemons and without HDFS, we can run it in the pseudo-distributed mode so as it make it closer to the production environment. In this mode hadoop simulates a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine and HDFS is used instead of...
-
When working with AWS the first thing to look into is how to secure your access to aws resources. Basically authentication, access policies and restrictions. Since this is a cloud environment, you have to be circumspect of everything from security perspective and double check each of them. Anything that is not secured will be broken into. The good thing is that AWS provides easy and comprehensive security service with IAM. It comes at no additional...
-
Two essential components required for continuous integration are: a central place to hold the repository and automated build tool. With these in place, at any time, anyone can checkout a working version of the code which could be deployed or further enhanced with additional features. We are using BitBucket to hold our Repo and Jenkins as our Continuous Integration build tool. The article details steps for setting this up. The Integration server was running on...
-
The methods used to evaluate the prediction accuracy of models depend on the type of models involved. Regression models In regression models we are dealing with numeric values. We are interested in knowing how far away are our predicted values from the expected values. For this purpose we calculate the Root Mean Square Error (RMSE). This is the square root of the average of the square of errors. We evaluate the model based on the...
-
Node.js by default runs on a single process and at max utilizes one CPU. To take the full advantage of a multi core system, multiple node processes can be run with a frontend proxy interfacing with the client. This scenario can be supported in node.js using the cluster module. However with websocket additionally it is required to provide a mapping between the client session-id and the server handling the client requests. This is achieved by...
-
Iris DataSet Iris DataSet The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Predicted attribute: class of iris plant. http://archive.ics.uci.edu/ml/datasets/Iris data(iris) head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2...
-
The tm package provides functionality for exploratory analysis on text corpus. The nature of text corpus has a big effect on the data modeling. Some of the text corpus that i have worked on in past includes Structured news feeds Unstructured html news webpages Web page Comments Tweets Posts on news sharing forum corpus Emails After the text preprocessing stage all of these would be a list of documents. But the nature of data in...
-
These are various notes and snippets from various sources that i found useful while learning scala Returns are discouraged In fact, odd problems can occur if you use return statements in Scala because it changes the meaning of your program. For this reason, return statement usage is discouraged. The return keyword is not “optional” or “inferred”; it changes the meaning of your program, and you should never use it. http://tpolecat.github.io/2014/05/09/return.html Important thing in scala collection...
-
Apply functions mapply Use mapply to operate on each element of multiple vectors find say we have monthly sales for two years and we want to find monthwise delta mapply(function(x,y) x - y, year2,year1 ) [1] 2 4 6 8 10 lapply Use lapply to operate on each element of a list and return a list lapply(mtcars$mpg,function(x) sqrt(x)) sapply Use sapply to operate on each element of a list and return a vector sapply(mtcars$mpg,function(x) sqrt(x))...
-
The common data structure in R are Vectors, List, Array, Matrix and Dataframe. ###Vector Operations # values from 1 :5 vec <- c(1:5) vec <- c(sample(5)) vec <- rnorm(5,mean=2.5,sd=2) vec <- c(letters[1:5]) vec_name <- c("id" = 1, "age" = 21) vec_name <- c("id" = 1:5, "age" = 21:25) id1 id2 id3 id4 id5 age1 age2 age3 age4 age5 1 2 3 4 5 21 22 23 24 25 # check whether value exists in vector...
-
A generative algorithm models how the data was generated by estimating the assumptions and distribution of the original model that generated the data. If x is the input and y the output, then it tries to learn the joint probability distribution p(x,y). Example, in natural language processing, probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) is a generative model. It makes the assumption that each document contains mixture of topics and each word...
-
Pass by Promise R does not copy the variables when passing them around unless there is change involved. From the R language manuals http://cran.at.r-project.org/doc/manuals/R-lang.html#Evaluation R has a form of lazy evaluation of function arguments. Arguments are not evaluated until needed. It is important to realize that in some cases the argument will never be evaluated. Thus, it is bad style to use arguments to functions to cause side-effects. While in C it is common to...
-
Boost your Vim Productivity Leader is an awesome idea. It allows for executing actions by key sequences instead of key combinations. Because I’m using it, I rarely need to press Ctrl-something combo to make things work. For long time I used , as my Leader key. Then, I realized I can map it to the most prominent key on my keyboard. Space. let mapleader = “<Space>” This turned my Vim life upside down. Now I...
-
Using Spell check in vim # start the spell check :set spell # move forward to next misspelled word ]s # move backword to next misspelled word [s # suggest alternatives z= # add word to dictionary zg # mark words as incorrect zw Commenting/Uncommenting multiple lines or vertical column text selection To comment out blocks in vim: hit ctrl+v (visual block mode) use the down arrow keys to select the lines you want (it...