So you are interested in Big Data? Ready to explore the huge potential of it? Well, to get ready, you have to know the basics firs, how this works overall. In recent days, I have learned some basics through my Masters course work and I would like to share what I learned on this post about big data analytics pipeline. This is basic guideline and thus won’t include deep details on each steps. But I will refer to some code examples I have done already, so that you can have a starting point to start practicing.
What is Analytics Pipeline?
Analytics Pipeline in the domain of Big data refers to the workflow/process of working with data to build interesting insights that is very much useful for business/companies. Lets say, you sell an online products and you have all order and other related information(user’s previous next actions of ordering items, visit durations of a page etc) in your database. Using these information, sometimes its very much possible to come up with some decisions based on statistics that will improve your conversion rate in sales/user engagements etc and thus help the business grow in a lot better way.
Well, data itself isn’t going to give us something right away. We have to come up with some goal, plan accordingly and use the data in such way that it gives us some a statistical results from where we can conclude some decisions. Lets go over the basic steps one by one:
1. Defining Goal:
This is the most important brain storming phase where we try to define a specific goal to achieve. To achieve that goal, we will have to come up with a hypothesis, which, at the end, may be proved as either true or false. Either way, it helps in the decision-making process. However, there might be situation, where, according to our hypothesis there is not enough statistical significance on the data. Then its better to abandon that completely or try again with some more noticeable changes in the data set. Based on the hypothesis we also need to specify some measurable attributes as well.
In my case, I was doing my course assignment on Google chrome project and tried to find out some good suggestions that might be helpful to increase the developers productivity. So, I came up with a simple hypothesis, that refers that, if we have a lot of people(developers) working on a source code file, that file tends to generate a lot bugs too!
Sure, it may sound obvious, but it’s just a course assignment! You will might come up with a lot more complex hypothesis for your business.
You can check my example goal definition and attributes specification on the first part of this readme.md file:
2. Collecting Raw Data:
It’s simply to gather raw data(in many cases, raw html files or so) from the source(websites, production db etc) and save them locally on the computing machine, where we are about to run the big data analytics on.
In my cases, necessary raw data was, the source control history of google chrome project, bugs information for google chrome project(tracked by downloading the raw html from their project management website).
You can see an easy/example of how we collect/scrap raw data or source control repository with python here:
3. Extract Attributes And Link To Database:
This is where we apply regular expressions and extract our necessary attributes of interests.
Based on my hypothesis, the necessary information was, number of contributors per file, number of bugs reported per file. Some additional data also been collected to add more value on our hypothesis and decision-making, such as average relative experience of developers, working on the google chrome project.
As an example of extracting the necessary attributes and save them to database, you can check the code part of my second assignments: https://github.com/ranacseruet/sm-6611/tree/master/A2
4. Analyzing different metrics:
This is a step that you may or may not need, based on your requirement. If you have a complex hypothesis, for which, required data can’t directly be found from the extracted information above, rather might need seem additional calculations to get this attributes value, we might consider this step.
In my case, I was calculating LCOM(Lack Of Cohesion Metrics) and CBO(Coupling Between Objects) metrics. I needed additional support from understand code metrics tool API . Here is the example that you might follow:
5. Develop Statistical Model:
As soon as you have the necessary data, we are OK to proceed with analyzing them whether they are really means something to us.
In my case, here are some outputs I found from the statistics:
- Increasing number of collaborators(programmers) really tends to increase bugs too.
- Programmers with more experience tends to generate less bugs.
Above might seem obvious, but we needed a statistical proof too!
Here is the statistical model generated by R Studio:
Just as basic, the ‘*’ means a statistical significance, more ‘*’ means more significance. Also, negative t value refers to reversly co-related and positive value means positively related. Such as, experience and bugs are negatively co-related and number of developers vs bugs are positively co-related.
If I get enough motivation, may be I will write some more on the details on the major steps, in details. However, from this article, If you have difficulty in proceeding with any specific sections described above, please let me know via comments so that I can understand and clarify necessary details.