So you are interested in Big Data? Ready to explore its huge potential of it? To get ready, you must know the basics and how this works overall. In recent days, I have learned some basics through my Master’s coursework, and I would like to share what I learned in this post about the big data analytics pipeline. This is a basic guideline and thus won’t include deep details on each step. But I will refer to some code examples I have done already so that you can have a starting point from which to start practicing.
What is Analytics Pipeline?
Analytics Pipeline in the domain of Big data refers to the workflow/process of working with data to build interesting insights that are very useful for businesses/companies. Let’s say you sell online products and have all orders and other related information(user’s previous next actions of ordering items, visit durations of a page, etc.) in your database. Using this information, sometimes it’s possible to come up with some decisions based on statistics that will improve your conversion rate in sales/user engagements, etc., and thus help the business grow a lot better.
Well, data itself isn’t going to give us something right away. We have to come up with some goals, plan accordingly, and use the data in such a way that it gives us some statistical results from which we can make some decisions. Let’s go over the basic steps one by one:
1. Defining Goal:
This is the most important brainstorming phase, where we try to define a specific goal to achieve. To achieve that goal, we will have to come up with a hypothesis, which, in the end, may be proved as either true or false. Either way, it helps in the decision-making process. However, there might be a situation where, according to our hypothesis, there is not enough statistical significance in the data. Then, it’s better to abandon that completely or try again with more noticeable changes in the data set. Based on the hypothesis, we also need to specify some measurable attributes.
In my case, I was doing my course assignment on the Google Chrome project and tried to find some good suggestions that might be helpful to increase the developer’s productivity. So, I came up with a simple hypothesis: if we have a lot of people(developers) working on a source code file, that file tends to generate a lot of bugs, too!
It may sound obvious, but it’s just a course assignment! You will develop much more complex hypotheses about your business in parctice.
You can check my example goal definition and attributes specification in the first part of this readme.md file:
2. Collecting Raw Data:
It’s simply to gather raw data(in many cases, raw HTML files or so) from the source(websites, production DB, etc) and save them locally on the computing machine where we are about to run the big data analytics.
In my case, the necessary raw data were the Google Chrome project’s source control history and bug information (tracked by downloading the raw HTML from their project management website).
You can see an easy/example of how we collect/scrap raw data or source control repository with Python here:
https://github.com/ranacseruet/sm-6611/tree/master/A1
3. Extract Attributes And Link To Database:
This is where we apply regular expressions and extract our necessary attributes of interest, and this is a very important step for the big data analytics pipeline.
Based on my hypothesis, the necessary information was the number of contributors per file and the number of bugs reported per file. Some additional data, such as the average relative experience of developers working on the Google Chrome project, has also been collected to add more value to our hypothesis and decision-making.
As an example of extracting the necessary attributes and saving them to the database, you can check the code part of my second assignment: https://github.com/ranacseruet/sm-6611/tree/master/A2
4. Analyzing different metrics:
Based on your requirements, you may or may not need this step. If you have a complex hypothesis for which required data can’t be found directly from the extracted information above but might need additional calculations to get this attribute’s value, we might consider this step.
In my case, I was calculating LCOM(Lack Of Cohesion Metrics) and CBO(Coupling Between Objects) metrics. I needed additional support to understand the code metrics tool API. Here is an example that you might follow:
5. Develop Statistical Model:
As soon as you have the necessary data, we are OK with proceeding with analyzing it to see if it really matters to us.
In my case, here are some outputs I found from the statistics:
- An increasing number of collaborators(programmers) tends to increase bugs, too.
- Programmers with more experience tend to generate fewer bugs.
The above might seem obvious, but we needed statistical proof, too!
Here is the statistical model generated by R Studio:
Just as basic, the ‘*’ means statistical significance; more ‘*’ means more significance. Also, a negative t value refers to reversely co-related, and a positive value means positively related. For example, experience and bugs are negatively co-related, and the number of developers vs. bugs are positively co-related.
Final Words:
If I am motivated enough, I may write more about the major steps. However, if you have difficulty proceeding with any specific sections described above, please let me know via comments so that I can understand and clarify the necessary details.
Leave a Reply