Big Data

June 1, 2016 | Author: Bijoy Chhetri | Category: N/A
Share Embed Donate


Short Description

Eexploring the future of Data Science...

Description

BI G DATA  Anappr oacht owar dst hef ut ur eofI nf or mat i onSci ence Bi j o yChhet r i Sr . Lec t ur er De pa r t me nto fCo mp mp ut e rSc i e nc ean dEn gi n ee r i n g Ce nt r ef o rCo mp mp ut e r sa n dCo mm mmu ni c a t i o nT ec h n ol o gy

bi j oy c hhet r i @gmai l . c om ABSTRACT 1

Today the term BIG DATA draws draws a lot of attention, attention, but  behind this there's a simple story. story. For decades, decades, companies companies hae been ma!in" business decisions based on transactional data stored in relation relational al databas databases es #$DB%&. #$DB%&. Beyond Beyond that that crit critic ical al data data,, howe howee er, r, is a pote potent ntia iall trea treasu sure re of non( non( traditional, less structured data) weblo"s, social media, email, senso sensors, rs, and and photo photo"r "rap aphs hs that that can can be mined mined for for usefu usefull information information and !nowled"e !nowled"e deliery #*DD. Decreases Decreases in the cost of both stora"e and compute power hae made it feasible to collect this data ( which would hae been thrown away only a few years a"o. As a result, more and more companies are loo!in" loo!in" to include include non(trad non(tradition itional al #+&T$+-T #+&T$+-T+$D +$D DATA yet potentially ery aluable data with their traditional enterprise data in their business intelli"ence analysis. To deri deriee real real busi busine ness ss alu aluee from from BIG BIG DAT DATA, this this information hi"hli"hts about the bac!"round of data science, need of the ri"ht tools to capture and or"ani/e a wide ariety of data types from different sources, and to be able to easily analy/e it within the conte0t of all enterprise data so that instantaneous, spontaneous and constant !nowled"e deliery can be made helpin" the decision ma!er to ma!e decisions on real time. In this paper an attempt has been made to hi"hli"ht  brief discussion on BIG DATA DATA and its impact on the data science.

Keywords BIG DATA, adoop, %ap $educe, -loud, 2 &34.

1. INT INTRODU RODUCT CTIO ION N BIG DATA is ery much similar to small data data but but bi"" bi""er er in the the scal scale, e, comp comple le0i 0ity ty and and ari ariab able le "eneration modes. But hain" BIG DATA means hain" to set newer technolo"ies technolo"ies and different approach in handlin" bi""er  dataset which aims to sole new problems and een sole old  problems in a better way. way. Data hae become a torrent flowin" into eery area of the "lobal economy and the science behind it as Data &cience whic which h deal dealss with with coll collec ecti tion on,, prep prepar arat atio ion, n, analy analysi sis, s, isuali/ation, mana"ement 5 preseration of lar"e collection of Information. In other words, data science 6 is the inte"ration of methods from statistics, computer science, and other fields for for "aini "ainin" n" insi"h insi"hts ts from from data. data. In practi practice ce,, data data scien science ce encompasses encompasses an iteratie process of data harestin", harestin", cleanin", analysis and isuali/ation, isuali/ation, and implementation. +ltimately, this interdisc interdisciplin iplinary ary and cross(fu cross(functi nctional onal field field leads leads to decision decisionss that moe an or"ani/ or"ani/atio ation n forward, forward, whether whether the

 business of interest ob7ect is product desi"n, a proposed inestment, or business strate"y. In the preious decade of data repository, the traditional methodolo"ies methodolo"ies incorporated Ad(hoc 8ueryin" and reportin" techni8ues with data minin" techni8ues specially implied on mostly structured data where the data are pourin" in from a typical sources "eneratin" small to mid(si/e datasets "iin"  birth to different different data warehousin" warehousin" #D9 techni8ues techni8ues with Data %arts and Data centers where the data drien !nowled"e discoery is done with the help of some Business Intelli"ence Intelli"ence feed on to them. The method wor!s fine for the limited ran"e of Data Data Analy Analysis sis but but will will certa certainl inly y hae hae const constra raint intss on  processin" the lar"e dataset in terms of time, cost, reliability and scalability. 9ith the rapid increase in production of data from aried source, source, increase increase in the stora"e stora"e capacit capacities ies alon" with the increas increasee of the processin" processin" power #%oore: #%oore:ss 4aw, 4aw, the term BIG DAT DATA came came into e0istenc e0istencee and "ained "ained importan importance. ce. ;aried ran"es of data are "enerated from the different source in continu continuous ous manner manner demandin demandin" " technolo technolo"ie "iess to handle handle newer type of data 5 inte"rate them to a decisie leel. Apart from to cater to increase in olume of data, newer analytical re8uirem re8uirements ents is the most most for optimi/ optimi/atio ations ns and predict predictie ie analytics, -omple0 statistical statistical analysis in a more of a real(time manner. And to 8uote from an IB% website 8uintillion bytes of data ? so much that @ of the data in the world today has been created in the last two years alone. This data comes from eerywhere) sensors used to "ather climate information, posts to social media sites, di"ital pictures and ideos, purchase transaction records, and cell phone GC& si"nals to name a few. This data is bi" data. To proide efficient and reliable serices, many factors hae to be consid consider ered ed such such as the appli applicat cation ion eniro enironm nment ent re8uirement, reliability of the protocol used in the networ!  and the networ! consistency. %A- layer is a ma7or reason to  proide the reliability and efficiency efficiency for 9&. %A- is responsi responsible ble for channel channel access access policies policies,, scheduli schedulin", n", buffer  buffer  mana"em mana"ement ent and error error control control.. In 9& we need a %A protocol to consider ener"y efficiency, reliability, reliability, low access delay and hi"h throu"hput as ma7or priorities to accommodate with sensor:s limited resources and to aoid redundant power  consumption

1.1 -loud and BIG DATA BIG DATA is fuelled by properties of cloud in terms of its creation creation and utili/ati utili/ation on rather rather than only sub7ecte sub7ected d to the enterprises data because of the computin" on demand serices of a cloud architecture which is a permanent driin" force for 

1

each other in a neer endin" cycle. The data bein" produced  by the -loud is captured by the BIG DATA architecture and on real time utili/ed by the serices proided by cloud usin" Internet. 9hen it comes to cloud few adanta"es of  affordability, economy of scale, a"ility and e0tensibility in turn enables us to access BIG DATA in more efficient way. The newer technolo"y used to analy/e data on the -loud will

 promotions ri"ht now for store ne0t and sensors monitorin" actiities and body for any abnormal measurements and re8uire immediate reaction amounts to the ;elocity of the data.

each other in a neer endin" cycle. The data bein" produced  by the -loud is captured by the BIG DATA architecture and on real time utili/ed by the serices proided by cloud usin" Internet. 9hen it comes to cloud few adanta"es of  affordability, economy of scale, a"ility and e0tensibility in turn enables us to access BIG DATA in more efficient way. The newer technolo"y used to analy/e data on the -loud will certainly dries future desi"n, enhancement and innoation e0pansion of cloud which further caters to the need of BIG DATA technolo"ies in place. Aalanche of BIG DATA is challen"in" and 8uestions us to ma!e use of data to the ne0t "eneration brea!throu"h.

 promotions ri"ht now for store ne0t and sensors monitorin" actiities and body for any abnormal measurements and re8uire immediate reaction amounts to the ;elocity of the data.

1.6 Definin" BIG DATA >

ery one in the enterprise or business need is wonderin" about how bi" is the BIG DATA for a particular  re8uirement. Is it a matter of only si/e, not necessarily si/e is the only element of describin" BIG, also to consider the speed and arious sources that are pourin" in data simultaneously. Ta!in" an e0ample of E %B .ppt slides, 1TB of %$I scan, 1 CB of %oie is BIG DATA only when the underlyin" infrastructure and technolo"ies does not support. Data is BIG when it challen"es the constraints of the e0istin" system capabilities and business need. Because E%B of presentation will certainly be bi" data set if it cannot be sent as an attachment or 1CB of moie when it cannot be rendered for  editin". Therefore an or"ani/ation must wei"h their  capabilities in terms of technolo"ies and architecture before callin" any data as bi" and yes wheneer any fast "rowin" data is pushin" to the limit that common technolo"y supports to utili/e them, it is definitely a BIG DATA. BIG DATA as defined by 9i!ipedia is term for a collection of data sets so lar"e and comple0 that it becomes difficult to  process usin" on(hand database mana"ement tools or  traditional data processin" applications. Gartner defines it as hi"h(olume, hi"h(elocity and hi"h( ariety information assets that demand cost(effectie, innoatie forms of information processin" for enhanced insi"ht and decision ma!in". u"e set of data whose si/e is beyond the ability of traditional database  tools to capture, store, mana"e and analy/e. As technolo"y adances oer time, the si/e of datasets that 8ualify as BIG DATA will also increase. Also note that the definition can ary by sector, dependin" on what !inds of  software tools are commonly aailable and what si/es of  datasets are common in a particular industry. 9ith those 8ualifications, BIG DATA in many sectors today will ran"e from a few do/en terabytes to multiple Cetabytes #thousands of terabytes. ere are few e0amples which coer the three ;:s of BIG DATA as ;olume, ;elocity and ;ariety also hain" more dimension as the subset of three ma7or as shown in the fi" 1. There is EE times increase of data si/e eery year. In 4ar"e &ynoptic &urey Telescope #4&&T alone oer  thousand "i"abytes #TB of ima"es will be "enerated eery ni"ht and it has been a decade (lon" 4&&T s!y surey. Ima"ine H6 hours of ideo uploaded to ouTube eery minute, 1E million tweets per day on aera"e amount to the olume of data that are "enerated in the pool. ;arious formats, types, and structures Te0t, numerical, ima"es, audio, ideo, se8uences, time series, social media data, multi(dim arrays, etcJ from the different sources amounts to the dierse ariety the data are "eneratin". Goo"le uses smart phones as sensors to determine traffic conditions,  based on your current location, your purchase history, send

FIG(1 Three Dimensions of BIG DATA 1. -hallen"es of BIG DATA The pro"ress and innoation is no lon"er hindered by the ability to collect data but the ability to mana"e, analy/e, summari/e, isuali/e, and discoer !nowled"e from the collected data in a timely manner and in a scalable fashion would certainly be the challen"es of IT fraternity.

2. METHODOLOGIES USED TO ANALYSE BIG DATA Fi"ure Fi"(6, representin" the different phases of  harnessin", cleanin" analy/in" and interpretation of hu"e set of data in efficient matter.

6.1 Data Ac8uirin" The ac8uisition phase is one of the ma7or chan"es in infrastructure from the days before bi" data. Because bi" data refers to data streams of hi"her elocity and hi"her ariety, the infrastructure re8uired to support the ac8uisition of bi" data must delier low latency in both capturin" data and in e0ecutin" short, simple 8ueriesK be able to handle ery hi"h transaction olumes, often in a distributed enironmentK and support fle0ible, dynamic data structures. For the purpose , o&34 databases are fre8uently used to ac8uire and store bi" data. They are well suited for dynamic data structures and are hi"hly scalable. It does not use fi0ed schema for stora"e of the data.

6.6 2r"ani/in" Data) This phase include the cleanin" and data inte"ration  but at its initial destination location by not moin" the database of lar"e olume. The infrastructure re8uired for  or"ani/in" bi" data must be able to process and manipulate data in the ori"inal stora"e location that supports hi"h throu"hput to deal lar"e set of data. For the purpose, adoop is a new technolo"y that allows lar"e data olumes to be or"ani/ed and processed while !eepin" the data on the ori"inal data stora"e cluster. adoop Distributed File &ystem #DF& is the lon"(term stora"e system for web

2

lo"s for e0ample. These web lo"s are turned into browsin" sessions by runnin" %ap$educe pro"rams on the cluster and "eneratin" a""re"ated result on the same.

• • • • •

6. Analy/e Bi" Data Th

al si

of

th

ired

d

cl

d



9eb %inin" %achine 4earnin" Techni8ues -rowd &ourcin" Genetic Al"orithm  4C &entiment Analysis

lo"s for e0ample. These web lo"s are turned into browsin" sessions by runnin" %ap$educe pro"rams on the cluster and "eneratin" a""re"ated result on the same.

• • • •

6. Analy/e Bi" Data The analysis of the ac8uired and cleaned transformed data can be done on the distributed enironment. The infrastructure re8uired for analy/in" bi" data must be able to support deeper analytics such as statistical analysis and data minin", on a wider ariety of data types stored in dierse systems that has tendency to scale to e0treme data olumes. It must also delier faster response, while withstandin" the chan"es in behaior and automate decisions based on analytical models. For e0ample, analy/in" sale of from a  pi//a in combination with the eents calendar for the enue in which pi//a house is located, will dictate the optimal product demand and business lo"ic to drie the sales percenta"e.

6.E Interpretation of the &olution &pectrum These new systems hae created a diided solutions spectrum comprised of &34 solutions) %ana"eability, security and trusted nature of relational database mana"ement systems #$DB%&  ot 2nly &34 #o&34 solutions) deeloper(centric speciali/ed systems and ew &34 strate"ies) ain" a perfect match of 2&34 and &34. In the interpretation area of the analysis result, system with rich palette of isuali/ation become important in coneyin" to the user the results of the 8ueries in what the best way it is understood.

• • • •

9eb %inin" %achine 4earnin" Techni8ues -rowd &ourcin" Genetic Al"orithm  4C &entiment Analysis ;isuali/ation Time &eries Analysis

Different architecture technolo"ies includes adoop %ap reduce BIG table -assandra Distributed system with Dynamo %ash(up Goo"le File &ystem Bi" 8uery • • • • • • • •

!. E"#er$%e&'() Res*)'s

+. CONCLUSION There is no 8uestion that there is enou"h data aailable that traditional database mana"ement systems will  be oerwhelmed and oerloaded, because new systems usin"  bi" data will e0tend, and possibly replace, traditional DB%&:s. And the rate of data ac8uisition is acceleratin" 8uic!ly enou"h that perhaps we will eentually coin a new term based on BIG DATA.

,. REERENCES

Fi"(6 #Crocessin" of BIG DATA 3. TECHNIQUES USED OR BIG DATA ANALYSIS

A

wide ariety of techni8ues and technolo"ies has been deeloped and adapted to a""re"ate, manipulate, analy/e, and isuali/e bi" data. These techni8ues and technolo"ies draw from seeral fields includin" statistics, computer science, applied mathematics, and economics. AB Testin"

L1M.The %c*insey Global Institute #%GI, BIG data $eport on Bi" data) The ne0t frontier for innoation, competition, and  productiity #616 L6M.Introduction to Data &cience, Neffery &tanton, & yracuse +niersity. LM.9hite Caper on =For Bi" Data Analytics There:s o &uch Thin" as Too Bi" The -ompellin" conomics and Technolo"y of Bi" Data -omputin" ,%arch 616 ,By) Esyth.com LEM.Caper published on &&$ = Bi" Data for Deelopment) From Information( to *nowled"e &ocieties by %artin ilbert #Dr. ChD., 616 L>M.9ebinar on BIG data Cra"matic approach. L
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF