Latest Updates

Introduction to Big Data & Hadoop


 In order to understand what is Big data and Hadoop, you need to understand what is data.

What is data?

In general, data is any set of characters that is gathered and translated for some purpose, usually analysis.

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and growing with time. In short such data is so large that none of the traditional data management techniques are able to store it and process it efficiently.


Big Data Examples

Big data is getting bigger every minute in almost every sector. The volume of data processing we are talking about is mind-boggling. Here is some information to give you an idea.:

  • The weather channels receive 18,055,555 forecast requests every minute.
  • Netflix users stream 97,222 hours of video every minute.
  • Twitter users post 473,400 tweets every minute.
  • Facebook generates 4 new petabytes of data per day.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes.

you can check the states from here:

 https://www.internetlivestats.com/

Types Of Big Data

BigData’ could be found in three forms:

  1. Structured
  2. Unstructured
  3. Semi-structured

Structured

Any data that can be stored, accessed, and processed in the form of a fixed-format is termed as ‘structured data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it.

Unstructured

Any data with the unknown form of the structure is classified as unstructured data. In addition to the size being huge, unstructured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is data containing a combination of simple text files, images, videos, etc.

Semi-structured

Semi-structured data can contain both forms of data that is structured and unstructured

Characteristics of Big Data


credits to : Simplilearn

Volume

Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards. We are currently using distributed systems, to store data in several locations and brought together by a software Framework like Hadoop.

Variety

Big Data is generated in multiple varieties. Compared to the traditional data like phone numbers and addresses, the latest trend of data is in the form of photos, videos, and audios, and many more, making about 80% of the data to be completely unstructured

Veracity

Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is crucial in business developments

Value

Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or process. It is actually the amount of valuable, reliable, and trustworthy data that needs to be stored, processed, analyzed to find insights.

Velocity

The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines the real potential in the data.

The most optimal solution nowadays that almost all companies use is Distributed Storage.

A Distributed Storage is a concept that can split data across multiple physical servers in more than one data center. It typically takes the form of a cluster of storage units. so such type of implementation is known topology and this topology known as master-slave node topology.

consider we have 1000 GB of data but we have fewer resources to store it. Users can think we should buy 1000GB storage and store our data in it. so we can store our data but it requires more time to process the data it will give the problem of I/O working.

So the best solution for this kind of use case is to divide the storage 1000GB into 10 parts of 100 GB and store it in different 10 storage centers. because of these, our data can be stored efficiently which removes the volume problem, and also it got stored in less time which removes the velocity problem.

Following 3(slave node)storage centers where we distribute our storage are known as Slave Nodes and from where we distribute our storage to slave nodes is known as Master Node. Now all these nodes combine to form an Infrastructure called a Cluster. In the Big Data world, it is known as a Distributed Storage Cluster.

the above problem can be solved by Hadoop.

What is apache hadoop ?




apache hadoop is open-source software for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

Conclusion:-

So, we learned how MNC like Google, Facebook, google pay, etc solve the challenges of the Big Data this concept.

this much I learned just in 2 days of the ARTH Journey

“Thanks! to Mr. Vimal Daga sir for giving the great information of Big data”


No comments