CodeSamplez.com

Programming, Web development, Cloud Technologies

  • Facebook
  • Google+
  • RSS
  • Twitter
  • Home
  • Featured
    • C# Tutorials
      • LinQ Tutorials
      • Facebook C# API Tutorials
    • PHP Tutorials
      • CodeIgniter Tutorials
    • Amazon AWS Tutorials
  • Categories
    • Programming
    • Development
    • Database
    • Web Server
    • Source Control
    • Management
    • Project
  • About
  • Write
  • Contact
Home Development Getting Started With Big Data Analytics Pipeline

Getting Started With Big Data Analytics Pipeline

Rana Ahsan May 9, 2015 Leave a Comment


 Getting Started With Big Data Analytics Pipeline    

So you are interested in Big Data? Ready to explore the huge potential of it? Well, to get ready, you have to know the basics firs, how this works overall. In recent days, I have learned some basics through my Masters course work and I would like to share what I learned on this post about big data analytics pipeline. This is basic guideline and thus won’t include deep details on each steps. But I will refer to some code examples I have done already, so that you can have a starting point to start practicing.

What is Analytics Pipeline?

Analytics Pipeline in the domain of Big data refers to the workflow/process of working with data to build interesting insights that is very much useful for business/companies. Lets say, you sell an online products and you have all order and other related information(user’s previous next actions of ordering items, visit durations of a page etc) in your database. Using these information, sometimes its very much possible to come up with some decisions based on statistics that will improve your conversion rate in sales/user engagements etc and thus help the business grow in a lot better way.

Well, data itself isn’t going to give us something right away. We have to come up with some goal, plan accordingly and use the data in such way that it gives us some a statistical results from where we can conclude some decisions. Lets go over the basic steps one by one:

1. Defining Goal:

This is the most important brain storming phase where we try to define a specific goal to achieve. To achieve that goal, we will have to come up with a hypothesis, which, at the end, may be proved as either true or false. Either way, it helps in the decision-making process. However, there might be situation, where, according to our hypothesis there is not enough statistical significance on the data. Then its better to abandon that completely or try again with some more noticeable changes in the data set. Based on the hypothesis we also need to specify some measurable attributes as well.

In my case, I was doing my course assignment on Google chrome project and tried to find out some good suggestions that might be helpful to increase the developers productivity. So, I came up with a simple hypothesis, that refers that, if we have a lot of people(developers) working on a source code file, that file tends to generate a lot bugs too!

Sure, it may sound obvious, but it’s just a course assignment! You will might come up with a lot more complex hypothesis for your business.

You can check my example goal definition and attributes specification on the first part of this readme.md file:

https://github.com/ranacseruet/sm-6611/tree/master/A2

2. Collecting Raw Data:

It’s simply to gather raw data(in many cases, raw html files or so) from the source(websites, production db etc) and save them locally on the computing machine, where we are about to run the big data analytics on.

In my cases, necessary raw data was, the source control history of google chrome project, bugs information for google chrome project(tracked by downloading the raw html from their project management website).

You can see an easy/example of how we collect/scrap raw data or source control repository with python here:
https://github.com/ranacseruet/sm-6611/tree/master/A1

3. Extract Attributes And Link To Database:

This is where we apply regular expressions and extract our necessary attributes of interests.

Based on my hypothesis, the necessary information was, number of contributors per file, number of bugs reported per file. Some additional data also been collected to add more value on our hypothesis and decision-making, such as average relative experience of developers, working on the google chrome project.

As an example of extracting the necessary attributes and save them to database, you can check the code part of my second assignments: https://github.com/ranacseruet/sm-6611/tree/master/A2

4. Analyzing different metrics:

This is a step that you may or may not need, based on your requirement. If you have a complex hypothesis, for which, required data can’t directly be found from the extracted information above, rather might need seem additional calculations to get this attributes value, we might consider this step.

In my case, I was calculating LCOM(Lack Of Cohesion Metrics) and CBO(Coupling Between Objects) metrics. I needed additional support from understand code metrics tool API . Here is the example that you might follow:

https://github.com/ranacseruet/sm-6611/tree/master/A3

5. Develop Statistical Model:

As soon as you have the necessary data, we are OK to proceed with analyzing them whether they are really means something to us.

In my case, here are some outputs I found from the statistics:

  • Increasing number of collaborators(programmers) really tends to increase bugs too.
  • Programmers with more experience tends to generate less bugs.

Above might seem obvious, but we needed a statistical proof too!

Here is the statistical model generated by R Studio:
Statistical Model

Just as basic, the ‘*’ means a statistical significance, more ‘*’ means more significance. Also, negative t value refers to reversly co-related and positive value means positively related. Such as, experience and bugs are negatively co-related and number of developers vs bugs are positively co-related.

Final Words:

If I get enough motivation, may be I will write some more on the details on the major steps, in details. However, from this article, If you have difficulty in proceeding with any specific sections described above, please let me know via comments so that I can understand and clarify necessary details.

Related

Filed Under: Development Tagged With: postgres, python, r, understand

About Rana Ahsan

Rana is a passionate software engineer/Technology Enthusiast.
Github: ranacseruet

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Email Subscription

Never miss any programming tutorial again.

Popular Tutorials

  • PHP HTML5 Video Streaming Tutorial
  • How To Work With JSON In Node.js / JavaScript
  • Generate HTTP Requests using c#
  • How To Work With C# Serial Port Communication
  • Facebook C# API Tutorials
  • Tutorial On Uploading File With CodeIgniter Framework / PHP
  • Get Facebook C# Api Access Token
  • LinQ Query With Like Operator
  • How To Work With Codeigniter Caching In PHP
  • Pipe Email To PHP And Parse Content

Recent Tutorials

  • Building Auth With JWT – Part 1
  • Document Your REST API Like A Pro
  • Understanding Golang Error Handling
  • Web Application Case Studies You Must Read
  • Getting Started With Golang Unit Testing
  • Getting Started With Big Data Analytics Pipeline
  • NodeJS Tips And Tricks For Beginners
  • Apple Push Notification Backend In NodeJS
  • Web Based Universal Language Translator, Voice/Text Messaging App
  • How To Dockerize A Multi-Container App From Scratch

Recent Comments

  • intolap on PHP HTML5 Video Streaming Tutorial
  • manishpanchal on PHP HTML5 Video Streaming Tutorial
  • Rana Ghosh on PHP HTML5 Video Streaming Tutorial
  • ld13 on Pipe Email To PHP And Parse Content
  • Daniel on PHP HTML5 Video Streaming Tutorial

Archives

Resources

  • CodeSamplez.com Demo

Tags

.net apache api audio aws c# cache cloud server codeigniter deployment doctrine facebook git github golang htaccess html5 http image java javascript linq mysql nodejs oop performance php phpmyadmin plugin process python regular expression scalability server smarty ssh tfs thread tips ubuntu unit-test utility web application wordpress wpf

Copyright © 2010 - 2021 · CodeSamplez.com ·

Copyright © 2021 · Streamline Pro Theme on Genesis Framework · WordPress · Log in