Big Data
June 1, 2016 | Author: Bijoy Chhetri | Category: N/A
Short Description
Eexploring the future of Data Science...
Description
BI G DATA Anappr oacht owar dst hef ut ur eofI nf or mat i onSci ence Bi j o yChhet r i Sr . Lec t ur er De pa r t me nto fCo mp mp ut e rSc i e nc ean dEn gi n ee r i n g Ce nt r ef o rCo mp mp ut e r sa n dCo mm mmu ni c a t i o nT ec h n ol o gy
bi j oy c hhet r i @gmai l . c om ABSTRACT 1
Today the term BIG DATA draws draws a lot of attention, attention, but behind this there's a simple story. story. For decades, decades, companies companies hae been ma!in" business decisions based on transactional data stored in relation relational al databas databases es #$DB%&. #$DB%&. Beyond Beyond that that crit critic ical al data data,, howe howee er, r, is a pote potent ntia iall trea treasu sure re of non( non( traditional, less structured data) weblo"s, social media, email, senso sensors, rs, and and photo photo"r "rap aphs hs that that can can be mined mined for for usefu usefull information information and !nowled"e !nowled"e deliery #*DD. Decreases Decreases in the cost of both stora"e and compute power hae made it feasible to collect this data ( which would hae been thrown away only a few years a"o. As a result, more and more companies are loo!in" loo!in" to include include non(trad non(tradition itional al #+&T$+-T #+&T$+-T+$D +$D DATA yet potentially ery aluable data with their traditional enterprise data in their business intelli"ence analysis. To deri deriee real real busi busine ness ss alu aluee from from BIG BIG DAT DATA, this this information hi"hli"hts about the bac!"round of data science, need of the ri"ht tools to capture and or"ani/e a wide ariety of data types from different sources, and to be able to easily analy/e it within the conte0t of all enterprise data so that instantaneous, spontaneous and constant !nowled"e deliery can be made helpin" the decision ma!er to ma!e decisions on real time. In this paper an attempt has been made to hi"hli"ht brief discussion on BIG DATA DATA and its impact on the data science.
Keywords BIG DATA, adoop, %ap $educe, -loud, 2 &34.
1. INT INTRODU RODUCT CTIO ION N BIG DATA is ery much similar to small data data but but bi"" bi""er er in the the scal scale, e, comp comple le0i 0ity ty and and ari ariab able le "eneration modes. But hain" BIG DATA means hain" to set newer technolo"ies technolo"ies and different approach in handlin" bi""er dataset which aims to sole new problems and een sole old problems in a better way. way. Data hae become a torrent flowin" into eery area of the "lobal economy and the science behind it as Data &cience whic which h deal dealss with with coll collec ecti tion on,, prep prepar arat atio ion, n, analy analysi sis, s, isuali/ation, mana"ement 5 preseration of lar"e collection of Information. In other words, data science 6 is the inte"ration of methods from statistics, computer science, and other fields for for "aini "ainin" n" insi"h insi"hts ts from from data. data. In practi practice ce,, data data scien science ce encompasses encompasses an iteratie process of data harestin", harestin", cleanin", analysis and isuali/ation, isuali/ation, and implementation. +ltimately, this interdisc interdisciplin iplinary ary and cross(fu cross(functi nctional onal field field leads leads to decision decisionss that moe an or"ani/ or"ani/atio ation n forward, forward, whether whether the
business of interest ob7ect is product desi"n, a proposed inestment, or business strate"y. In the preious decade of data repository, the traditional methodolo"ies methodolo"ies incorporated Ad(hoc 8ueryin" and reportin" techni8ues with data minin" techni8ues specially implied on mostly structured data where the data are pourin" in from a typical sources "eneratin" small to mid(si/e datasets "iin" birth to different different data warehousin" warehousin" #D9 techni8ues techni8ues with Data %arts and Data centers where the data drien !nowled"e discoery is done with the help of some Business Intelli"ence Intelli"ence feed on to them. The method wor!s fine for the limited ran"e of Data Data Analy Analysis sis but but will will certa certainl inly y hae hae const constra raint intss on processin" the lar"e dataset in terms of time, cost, reliability and scalability. 9ith the rapid increase in production of data from aried source, source, increase increase in the stora"e stora"e capacit capacities ies alon" with the increas increasee of the processin" processin" power #%oore: #%oore:ss 4aw, 4aw, the term BIG DAT DATA came came into e0istenc e0istencee and "ained "ained importan importance. ce. ;aried ran"es of data are "enerated from the different source in continu continuous ous manner manner demandin demandin" " technolo technolo"ie "iess to handle handle newer type of data 5 inte"rate them to a decisie leel. Apart from to cater to increase in olume of data, newer analytical re8uirem re8uirements ents is the most most for optimi/ optimi/atio ations ns and predict predictie ie analytics, -omple0 statistical statistical analysis in a more of a real(time manner. And to 8uote from an IB% website 8uintillion bytes of data ? so much that @ of the data in the world today has been created in the last two years alone. This data comes from eerywhere) sensors used to "ather climate information, posts to social media sites, di"ital pictures and ideos, purchase transaction records, and cell phone GC& si"nals to name a few. This data is bi" data. To proide efficient and reliable serices, many factors hae to be consid consider ered ed such such as the appli applicat cation ion eniro enironm nment ent re8uirement, reliability of the protocol used in the networ! and the networ! consistency. %A- layer is a ma7or reason to proide the reliability and efficiency efficiency for 9&. %A- is responsi responsible ble for channel channel access access policies policies,, scheduli schedulin", n", buffer buffer mana"em mana"ement ent and error error control control.. In 9& we need a %A protocol to consider ener"y efficiency, reliability, reliability, low access delay and hi"h throu"hput as ma7or priorities to accommodate with sensor:s limited resources and to aoid redundant power consumption
1.1 -loud and BIG DATA BIG DATA is fuelled by properties of cloud in terms of its creation creation and utili/ati utili/ation on rather rather than only sub7ecte sub7ected d to the enterprises data because of the computin" on demand serices of a cloud architecture which is a permanent driin" force for
1
each other in a neer endin" cycle. The data bein" produced by the -loud is captured by the BIG DATA architecture and on real time utili/ed by the serices proided by cloud usin" Internet. 9hen it comes to cloud few adanta"es of affordability, economy of scale, a"ility and e0tensibility in turn enables us to access BIG DATA in more efficient way. The newer technolo"y used to analy/e data on the -loud will
promotions ri"ht now for store ne0t and sensors monitorin" actiities and body for any abnormal measurements and re8uire immediate reaction amounts to the ;elocity of the data.
each other in a neer endin" cycle. The data bein" produced by the -loud is captured by the BIG DATA architecture and on real time utili/ed by the serices proided by cloud usin" Internet. 9hen it comes to cloud few adanta"es of affordability, economy of scale, a"ility and e0tensibility in turn enables us to access BIG DATA in more efficient way. The newer technolo"y used to analy/e data on the -loud will certainly dries future desi"n, enhancement and innoation e0pansion of cloud which further caters to the need of BIG DATA technolo"ies in place. Aalanche of BIG DATA is challen"in" and 8uestions us to ma!e use of data to the ne0t "eneration brea!throu"h.
promotions ri"ht now for store ne0t and sensors monitorin" actiities and body for any abnormal measurements and re8uire immediate reaction amounts to the ;elocity of the data.
1.6 Definin" BIG DATA >
ery one in the enterprise or business need is wonderin" about how bi" is the BIG DATA for a particular re8uirement. Is it a matter of only si/e, not necessarily si/e is the only element of describin" BIG, also to consider the speed and arious sources that are pourin" in data simultaneously. Ta!in" an e0ample of E %B .ppt slides, 1TB of %$I scan, 1 CB of %oie is BIG DATA only when the underlyin" infrastructure and technolo"ies does not support. Data is BIG when it challen"es the constraints of the e0istin" system capabilities and business need. Because E%B of presentation will certainly be bi" data set if it cannot be sent as an attachment or 1CB of moie when it cannot be rendered for editin". Therefore an or"ani/ation must wei"h their capabilities in terms of technolo"ies and architecture before callin" any data as bi" and yes wheneer any fast "rowin" data is pushin" to the limit that common technolo"y supports to utili/e them, it is definitely a BIG DATA. BIG DATA as defined by 9i!ipedia is term for a collection of data sets so lar"e and comple0 that it becomes difficult to process usin" on(hand database mana"ement tools or traditional data processin" applications. Gartner defines it as hi"h(olume, hi"h(elocity and hi"h( ariety information assets that demand cost(effectie, innoatie forms of information processin" for enhanced insi"ht and decision ma!in". u"e set of data whose si/e is beyond the ability of traditional database tools to capture, store, mana"e and analy/e. As technolo"y adances oer time, the si/e of datasets that 8ualify as BIG DATA will also increase. Also note that the definition can ary by sector, dependin" on what !inds of software tools are commonly aailable and what si/es of datasets are common in a particular industry. 9ith those 8ualifications, BIG DATA in many sectors today will ran"e from a few do/en terabytes to multiple Cetabytes #thousands of terabytes. ere are few e0amples which coer the three ;:s of BIG DATA as ;olume, ;elocity and ;ariety also hain" more dimension as the subset of three ma7or as shown in the fi" 1. There is EE times increase of data si/e eery year. In 4ar"e &ynoptic &urey Telescope #4&&T alone oer thousand "i"abytes #TB of ima"es will be "enerated eery ni"ht and it has been a decade (lon" 4&&T s!y surey. Ima"ine H6 hours of ideo uploaded to ouTube eery minute, 1E million tweets per day on aera"e amount to the olume of data that are "enerated in the pool. ;arious formats, types, and structures Te0t, numerical, ima"es, audio, ideo, se8uences, time series, social media data, multi(dim arrays, etcJ from the different sources amounts to the dierse ariety the data are "eneratin". Goo"le uses smart phones as sensors to determine traffic conditions, based on your current location, your purchase history, send
FIG(1 Three Dimensions of BIG DATA 1. -hallen"es of BIG DATA The pro"ress and innoation is no lon"er hindered by the ability to collect data but the ability to mana"e, analy/e, summari/e, isuali/e, and discoer !nowled"e from the collected data in a timely manner and in a scalable fashion would certainly be the challen"es of IT fraternity.
2. METHODOLOGIES USED TO ANALYSE BIG DATA Fi"ure Fi"(6, representin" the different phases of harnessin", cleanin" analy/in" and interpretation of hu"e set of data in efficient matter.
6.1 Data Ac8uirin" The ac8uisition phase is one of the ma7or chan"es in infrastructure from the days before bi" data. Because bi" data refers to data streams of hi"her elocity and hi"her ariety, the infrastructure re8uired to support the ac8uisition of bi" data must delier low latency in both capturin" data and in e0ecutin" short, simple 8ueriesK be able to handle ery hi"h transaction olumes, often in a distributed enironmentK and support fle0ible, dynamic data structures. For the purpose , o&34 databases are fre8uently used to ac8uire and store bi" data. They are well suited for dynamic data structures and are hi"hly scalable. It does not use fi0ed schema for stora"e of the data.
6.6 2r"ani/in" Data) This phase include the cleanin" and data inte"ration but at its initial destination location by not moin" the database of lar"e olume. The infrastructure re8uired for or"ani/in" bi" data must be able to process and manipulate data in the ori"inal stora"e location that supports hi"h throu"hput to deal lar"e set of data. For the purpose, adoop is a new technolo"y that allows lar"e data olumes to be or"ani/ed and processed while !eepin" the data on the ori"inal data stora"e cluster. adoop Distributed File &ystem #DF& is the lon"(term stora"e system for web
2
lo"s for e0ample. These web lo"s are turned into browsin" sessions by runnin" %ap$educe pro"rams on the cluster and "eneratin" a""re"ated result on the same.
• • • • •
6. Analy/e Bi" Data Th
al si
of
th
ired
d
cl
d
•
9eb %inin" %achine 4earnin" Techni8ues -rowd &ourcin" Genetic Al"orithm 4C &entiment Analysis
lo"s for e0ample. These web lo"s are turned into browsin" sessions by runnin" %ap$educe pro"rams on the cluster and "eneratin" a""re"ated result on the same.
• • • •
6. Analy/e Bi" Data The analysis of the ac8uired and cleaned transformed data can be done on the distributed enironment. The infrastructure re8uired for analy/in" bi" data must be able to support deeper analytics such as statistical analysis and data minin", on a wider ariety of data types stored in dierse systems that has tendency to scale to e0treme data olumes. It must also delier faster response, while withstandin" the chan"es in behaior and automate decisions based on analytical models. For e0ample, analy/in" sale of from a pi//a in combination with the eents calendar for the enue in which pi//a house is located, will dictate the optimal product demand and business lo"ic to drie the sales percenta"e.
6.E Interpretation of the &olution &pectrum These new systems hae created a diided solutions spectrum comprised of &34 solutions) %ana"eability, security and trusted nature of relational database mana"ement systems #$DB%& ot 2nly &34 #o&34 solutions) deeloper(centric speciali/ed systems and ew &34 strate"ies) ain" a perfect match of 2&34 and &34. In the interpretation area of the analysis result, system with rich palette of isuali/ation become important in coneyin" to the user the results of the 8ueries in what the best way it is understood.
• • • •
9eb %inin" %achine 4earnin" Techni8ues -rowd &ourcin" Genetic Al"orithm 4C &entiment Analysis ;isuali/ation Time &eries Analysis
Different architecture technolo"ies includes adoop %ap reduce BIG table -assandra Distributed system with Dynamo %ash(up Goo"le File &ystem Bi" 8uery • • • • • • • •
!. E"#er$%e&'() Res*)'s
+. CONCLUSION There is no 8uestion that there is enou"h data aailable that traditional database mana"ement systems will be oerwhelmed and oerloaded, because new systems usin" bi" data will e0tend, and possibly replace, traditional DB%&:s. And the rate of data ac8uisition is acceleratin" 8uic!ly enou"h that perhaps we will eentually coin a new term based on BIG DATA.
,. REERENCES
Fi"(6 #Crocessin" of BIG DATA 3. TECHNIQUES USED OR BIG DATA ANALYSIS
A
wide ariety of techni8ues and technolo"ies has been deeloped and adapted to a""re"ate, manipulate, analy/e, and isuali/e bi" data. These techni8ues and technolo"ies draw from seeral fields includin" statistics, computer science, applied mathematics, and economics. AB Testin"
L1M.The %c*insey Global Institute #%GI, BIG data $eport on Bi" data) The ne0t frontier for innoation, competition, and productiity #616 L6M.Introduction to Data &cience, Neffery &tanton, & yracuse +niersity. LM.9hite Caper on =For Bi" Data Analytics There:s o &uch Thin" as Too Bi" The -ompellin" conomics and Technolo"y of Bi" Data -omputin" ,%arch 616 ,By) Esyth.com LEM.Caper published on &&$ = Bi" Data for Deelopment) From Information( to *nowled"e &ocieties by %artin ilbert #Dr. ChD., 616 L>M.9ebinar on BIG data Cra"matic approach. L
View more...
Comments