So you are interested in Big Data? Ready to explore its huge potential of it? To get ready, you have to know the basics first, and how this works overall. In recent days, I have learned some basics through my Master’s coursework, and I would like to share what I learned in this post about the big data analytics pipeline. This is a basic guideline and thus won’t include deep details on each step. But I will refer to some code examples I have done already so that you can have a starting point to start practicing.
What is Analytics Pipeline?
Analytics Pipeline in the domain of Big data refers to the workflow/process of working with data to build interesting insights that is very much useful for business/companies. Let’s say you sell online products and have all orders and other related information(user’s previous next actions of ordering items, visit durations of a page etc.) in your database. Using this information, sometimes it’s very much possible to come up with some decisions based on statistics that will improve your conversion rate in sales/user engagements etc. and thus help the business grow a lot better way.
Well, data itself isn’t going to give us something right away. We have to come up with some goals, plan accordingly and use the data in such a way that it gives us some statistical results from which we can conclude some decisions. Let’s go over the basic steps one by one:
1. Defining Goal:
This is the most important brainstorming phase, where we try to define a specific goal to achieve. To achieve that goal, we will have to come up with a hypothesis, which, in the end, may be proved as either true or false. Either way, it helps in the decision-making process. However, there might be a situation where, according to our hypothesis, there is not enough statistical significance in the data. Then it’s better to abandon that completely or try again with more noticeable changes in the data set. Based on the hypothesis, we also need to specify some measurable attributes.
In my case, I was doing my course assignment on the Google chrome project and tried to find some good suggestions that might be helpful to increase the developer’s productivity. So, I came up with a simple hypothesis, that refers that if we have a lot of people(developers) working on a source code file, that file tends to generate a lot of bugs too!
It may sound obvious, but it’s just a course assignment! You will come up with a lot more complex hypotheses for your business.
You can check my example goal definition and attributes specification in the first part of this readme.md file:
2. Collecting Raw Data:
It’s simply to gather raw data(in many cases, raw HTML files or so) from the source(websites, production DB etc) and save them locally on the computing machine where we are about to run the big data analytics.
In my case, the necessary raw data was the source control history of the google chrome project and bug information for the google chrome project(tracked by downloading the raw HTML from their project management website).
You can see an easy/example of how we collect/scrap raw data or source control repository with python here:
3. Extract Attributes And Link To Database:
This is where we apply regular expressions and extract our necessary attributes of interest, and this is a very important step for the big data analytics pipeline.
Based on my hypothesis, the necessary information was the number of contributors per file number of bugs reported per file. Some additional data has also been collected to add more value to our hypothesis and decision-making, such as the average relative experience of developers working on the google chrome project.
As an example of extracting the necessary attributes and saving them to the database, you can check the code part of my second assignment: https://github.com/ranacseruet/sm-6611/tree/master/A2
4. Analyzing different metrics:
This is a step you may or may not need based on your requirement. If you have a complex hypothesis for which required data can’t be found directly from the extracted information above but rather might need to see additional calculations to get this attribute’s value, we might consider this step.
In my case, I was calculating LCOM(Lack Of Cohesion Metrics) and CBO(Coupling Between Objects) metrics. I needed additional support to understand the code metrics tool API. Here is an example that you might follow:
5. Develop Statistical Model:
As soon as you have the necessary data, we are OK to proceed with analyzing them whether they are really means something to us.
In my case, here are some outputs I found from the statistics:
- An increasing number of collaborators(programmers) tends to increase bugs too.
- Programmers with more experience tends to generate less bugs.
Above might seem obvious, but we needed a statistical proof too!
Here is the statistical model generated by R Studio:
Just as basic, the ‘*’ means a statistical significance, more ‘*’ means more significance. Also, negative t value refers to reversely co-related and positive value means positively related. Such as, experience and bugs are negatively co-related and number of developers vs bugs are positively co-related.
If I get enough motivation, may be I will write some more on the details on the major steps, in details. However, from this article, If you have difficulty in proceeding with any specific sections described above, please let me know via comments so that I can understand and clarify necessary details.