Big Data

June 1, 2016 | Author: vinod | Category: N/A
Share Embed Donate


Short Description

about big data concepts...

Description

Paper presentation on

BIG DATA

 

M VENKAT VENKATA A RAMANA C PRIYANKA Mobile No: 9493556589 E-mail:venkaa!amaname"#ava#$%"mail&'om

Department of Computer Science & Engineering

 YSR Engineering College of Yogi Vemana University Proddatur, Kadapa Ab(!a' —This tutorial serves as a foundation for the field of Big Data, overview of Big Data, overview of data platforms, Big Data Actions, an overview of  Hadoop, Storage techniques and file System Big data, Technologies, Basic concepts of  Big Data loud And !ast and Big data "nfrastructure #ased on ama$on, ritiques of the Big Data  paradigm and research activities " % " &T'(D)T"(&

"n information technology, bi" )aa is a collection of data sets so large and comple* that it #ecomes difficult to process using on+hand data#ase management tools or  traditional data processing applications% The challenges include capture, storage, search, sharing, analysis, and visuali$ation% The trend to larger  data sets is due to the additional information deriva#le from analysis of a single large set of  related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to #e found to spot #usiness trends, determine quality of research, prevent diseases, linlegal citations, com#at crime, and determine real+ time roadway traffic conditions%

As of ./0., limits on the si$e of  data sets that are feasi#le to  process in a reasona#le amount of  time were on the order   of 1*a#yte2s of data% Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,3 connectomics, comple* physics simulations,4003 and #iological and environmental research% The limitations also affect "nternet search, finance and  #usiness informatics% Data sets grow in si$e in part #ecause they are increasingly #eing gathered #y u#iquitous information+sensing mo#ile devices, aerial sensory technologies 5remote sensing6,

software logs, cameras, microphones, radio+frequency identification readers, and wireless sensor networ-s% The world7s technological per+capita capacity to store information has roughly dou#led every 8/ months since the 09:/s as of ./0., every day .%; quintillion 5.%;hat is considered #ig data varies depending on the capa#ilities of  the organi$ation managing the set, and on the capa#ilities of the applications that are traditionally used to process and analy$e the data set in its domain% !or some organi$ations, facing hundreds of  giga#ytes of data for the first time may trigger a need to reconsider  data management options% !or  others, it may ta-e tens or 

i" )aa (an( #!ee )imen(ion(: Vol$me Velo'i, an) Va!ie,&

A& *#a i( bi" )aa+ Eve!, )a, .e '!eae /&5 0$inillion b,e( o1 )aa 2 (o m$'# #a 9 o1 #e )aa in #e .o!l) o)a, #a( been '!eae) in #e la( .o ,ea!( alone& T#i( )aa 'ome( 1!om eve!,.#e!e: (en(o!( $(e) o "a#e! 'limae in1o!maion o(( o (o'ial me)ia (ie( )i"ial i'$!e( an) vi)eo( $!'#a(e !an(a'ion !e'o!)( an) 'ell #one P7 (i"nal( o name a 1e.& T#i( )aa i(  #ig data%

Vol$me: 1nterprises are awash with ever+growing data of all types, easily amassing tera#ytes—  even peta#ytes—of information Turn 0. tera#ytes of Tweets created each day into improved product sentiment analysis onvert ?;/ #illion annual meter readings to #etter  predict power consumption Velo'i,: Sometimes . minutes is too late% !or time+sensitive  processes such as catching fraud,  #ig data must #e used as it streams 



into your enterprise in order to ma*imi$e its value% Scrutini$e ; million trade events created each day to identify potential fraud Analy$e ;// million daily call detail records in real+ time to predict customer churn faster  Va!ie,:  Big data is any type of data + structured and unstructured data such as te*t, sensor data, audio, video, clic- streams, log files and more% &ew insights are found when analy$ing these data types together% @onitor 0//2s of live video feeds from surveillance cameras to target points of interest 1*ploit the :/ data growth in images, video and documents to improve customer satisfaction 







/& IM bi" )aa la1o!m Do you have a #ig data strategy "B@ does% >e2d li-e to share our  -now+how with you to help your  enterprise solve its #ig data challenges% "B@ is unique in having developed an enterprise class #ig data platform that allows you to address the full spectrum of #ig data #usiness challenges% The platform #lends traditional technologies that are well suited for structured, repeata#le tas-s together with complementary new technologies that address speed and fle*i#ility and are ideal for  adhoc data e*ploration, discovery and unstructured analysis%

"B@2s integrated #ig data  platform has four core capa#ilitiesC Hadoop+#ased analytics, stream computing, data warehousing, and information integration and governance% A& T#e 'o!e 'aabiliie( a!e: a)oo-ba(e) and anal,i'(: Processes analy$es any data type across commodity server clusters% 7!eam Com$in": Drives continuous analysis of massive volumes of streaming data with su#+millisecond response times% aa *a!e#o$(in": Delivers deep operational insight with advanced in+data#ase analytics% In1o!maion Ine"!aion an) ove!nan'e: Allows you to understand, cleanse, transform, govern and deliver  trusted information to your  critical #usiness initiatives% & 7$o!in" 7e!vi'e(:

Pla1o!m

Vi($ali;aion < i('ove!,: Helps end users e*plore large, comple* data sets% Ali'aion evelomen:  Streamlines the  process of developing #ig data applications% 7,(em( Mana"emen: @onitors and manages #ig data systems for  secure and optimi$ed  performance% A''ele!ao!(: Speeds time to value with analytical and industry+specific modules%

3& i" aa a!a)i"m

archives, video archives, large+scale e+commerce% A& i" 7'ien'e

and

The =arge Hadron ollider e*periments represent a#out 0;/ million sensors delivering data 8/ million times  per second% There are nearly // million collisions per second% After filtering and not recording more than 99%999 of these streams, there are 0// collisions of  interest per second% As a result, only wor-ing with less than /%//0 of the sensor  stream data, the data flow from all four =H e*periments represents .;  peta#ytes annual rate #efore replication 5as of ./0.6% This  #ecomes nearly .// peta#ytes after replication%

4& i" )aa in a'ion

1*amples include Big Science, we# logs, '!"D, sensor networ-s, social networ-s, social data 5due to the social data revolution6, "nternet te*t and documents, "nternet search inde*ing, call detail records, astronomy, atmospheric science, genomics,  #iogeochemical, #iological, and other comple* and often interdisciplinary scientific research, military surveillance, medical records, photography

"f all sensor data were to #e recorded in =H, the data flow would #e e*tremely hard to wor- with% The data flow would e*ceed 0;/ million  peta#ytes annual rate, or  nearly ;// e*a#ytes per day,  #efore replication% To put the num#er in perspective, this is equivalent to ./ ;// quintillion 5;almart handles more than 0 million customer transactions every hour, which is imported into data#ases estimated to contain more than .%;  peta#ytes 5.;/ tera#ytes6 of  data  the equivalent of 0G times the information contained in all the #oo-s in the )S =i#rary of ongress% !ace#oo- handles ;/ #illion  photos from its user #ase%

!"( !alcon redit ard !raud Detection System  protects .%0 #illion active accounts world+wide%

DA'PA2s Topological Data Analysis program see-s the fundamental structure of massive data sets%

The volume of #usiness data worldwide, across all companies, dou#les every 0%. years, according to estimates% !or Development !ollowing decades of wor- in the area of the effective usage of information and communication technologies for  development 5or  "T8D6, it has  #een suggested that Big Data can ma-e important contri#utions to international development% (n the one hand, the advent of Big Data delivers the cost+effective  prospect to improve decision+ ma-ing in critical development areas such ashealth care, employment, economic  productivity, crime and security, and natural disaster and resource management%4?93 (n the other   hand, all the well+-nown concerns of the Big Data de#ate, such as  privacy, interopera#ility challenges, and the almighty  power of imperfect algorithms, are aggravated in developing countries #y long+standing development challenges li-e lac-ing technological infrastructure and economic and human resource scarcity% This has the potential to result in a new -ind of digital divideC a divide in data+#ased intelligence to inform decision+ma-ing% 5& Te'#nolo",

Big data requires e*ceptional technologies to efficiently process large quantities of data within tolera#le elapsed times% A 3 ./00 @cIinsey report suggests suita#le technologies include AJB testing, association rule learning, classification, cluster  analysis, crowdsourcing, data fusion and integration, ensem#le learning, genetic algorithms, machine learning, natural language  processing, neural networ-s, pattern recognition, anomaly detection, predictive modelling, regression, sentiment analysis, signal  processing, supervised and unsupe rvised learning, simulation, time series analysis and visuali$ation% @ultidimensional #ig data can also #e represented as tensors, which can #e more efficiently handled #y tensor+#ased computation, such as multilinear  su#space learning%48.3 Additional technologies #eing applied to #ig data include massively parallel+

 processing 5@PP6 data#ases, search+#ased applications, data+mining grids, distri#uted file systems, distri#uted data#ases, cloud #ased infrastructure 5applications, storage and computing resources6 and the "nternet% Some #ut not all @PP relational data#ases have the a#ility to store and manage peta#ytes of data% "mplicit is the a#ility to load, monitor, #ac- up, and optimi$e the use of the large data ta#les in the 'DB@S% The practitioners of #ig data analytics processes are generally hostile to slower shared storage,  preferring direct+attached storage 5DAS6 in its various forms from solid state dis- 5SSD6 to high capacity SATA dis- #uried inside  parallel processing nodes% The  perception of shared storage architectures—SA& and &AS—is that they are relatively slow, comple*, and e*pensive% These qualities are not consistent with  #ig data analytics systems that thrive on system performance, commodity infrastructure, and low cost% 'eal or near+real time information delivery is one of the defining characteristics of #ig data analytics% =atency is therefore avoided whenever and wherever   possi#le% Data in memory is good  —data on spinning dis- at the other end of a ! SA& connection is not% The cost of a SA& at the scale needed for analytics applications is very much higher  than other storage techniques%

There are advantages as well as disadvantages to shared storage in  #ig data analytics, #ut #ig data analytics practitioners as of  ./00 did not favour it% 6& Aa'#e a)oo

"t is an open+source software framewor- that supports data+ intensive distri#uted applications, licensed under the Apache v. license% "t supports the running of  applications on large clusters of  commodity hardware% The Hadoop framewor- transparently  provides #oth relia#ility and data motion to applications% Hadoop implements a computational  paradigm named @ap'educe, where the application is divided into many small fragments of  wor-, each of which may #e e*ecuted or re+e*ecuted on any node in the cluster% "n addition, it  provides a distri#uted file system that stores data on the compute nodes, providing very high aggregate #andwidth across the cluster% Both mapJreduce and the distri#uted file system are designed so that node failures are automatically handled #y the framewor-%4.3 "t ena#les applications to worwith thousands of computation+ independent computers and peta#ytes of data% Hadoop was derived from Eoogle7s @ap'educe and E oogle !ile System 5E!S6 papers% The entire Apache Hadoop KplatformL is now commonly considered to consist of the Hadoop -ernel, @ap'educe and Hadoop Distri#uted !ile System5HD!S6, as well as a num#er of related

 proMects  including Apache Hive, Apache HBase, and others%

=& A >a( an) i" aa In1!a(!$'$!e ba(e) on Ama;on EC/

>hen " was wor-ing in the data mining group within NahooO SDS, " used to thin- only #ig "nternet companies or giant enterprises need care a#out #ig data and large scale data mining% This is no longer the case% Big data and data driven apps will #ecome the driving force of ne*t gen "nternet and also revolutioni$e many other  traditional industries% As Twitter7sJSquare co+founder ac-  Dorsey put it Data will #e a  product #y itselfO% He was  pushing #ig data initiatives really hard even during the very early stage of Square so that >e are no longer flying in the dar-%

So, what does this mean to other  startup7s with a equal or even more am#itious goal achieve >hy startup7s need care a#out #ig data How to collect and ma-e sense of 

it How to #uild the infrastructure to deal with #ig data How can all the #ig data stuff to #e pragmatic rather than fancy showcase This  #log is a#out to address all a#ove questions in a very practical way%

To start with, a typical pattern of  "nternet upstart is usually #uilt  #ased   on some sort of cloud hosting environment% Among all maMor   cloud vendors 5Ama$on A>S, 'ac-space, Eoogle App 1ngine , @icrosoft A$ure etc%6, A>S seems to #e the most natural choice for startup7s #ecause of its deep tech stac- and #attle hardened infrastructure5maMor  outage did happen sometimes C6 6%

 &ow, #ac- to the topic, if you run everything on A>S cloud and want your #ig data gig, how you would actually #uild it =et7s answer the question #y first splitting the design requirements into . maMor categoriesC Big Data C A#le to store and access large scale data set and support deep mining !ast Data C Support production dash#oard and near real+time B" need

Some companies might focus one more than the other, #ut eventually  #oth will #ecome critical as your   #usiness advances% Here at TapMoy, we are e*periencing high data demands from #oth categories% >e are seeing more than .// @illion mo#ile user sessions on daily  #asis% Both our internal #usiness fol-s and e*ternal partners want to access granular stats in near+ realtime fashion% @eanwhile, our  engineering and science group need #uild models to facilitate various data driven applications such as ad ran-ing optimi$ation, app recommendation, #ehavior  targeting etc% So we come up with the architecture #elow and #uilt it on Ama$on A>SC

Slogging log ollector luster C responsi#le of collecting logs from  production we# server, produce

 #atch logs every minute Hadoop 1T= luster C pic- up the logs collected #y Syslog+ng and run 15e*traction6 and T5transformation6 of the raw logsHadoop Data >arehouse C  pulls post 1T= data from 1T= cluster and load them into Hadoop HD!S% 'eady to #e used #y various data miningJmodeling  processes 5e%g% @ahout @= algorithms6 Qertical DB lusterC column oriented fast analytical DB for  near real+time reporting and dash  #oarding% "t also pulls data from the Hadoop 1T= cluster  B" SuiteC Ta#leau Des-top is neat solution for internal analysts for  daily drag R drop type of pivotal analysis 1verything in this system runs on A>S cloud, it uses #oth 1BS and S? for data redundancy and  #ac-up% There are some nice features a#out this systemC

SA=AB=1 C can handle 0 #illion raw eventsJday with modest num#er of 1. nodes% Beyond that, the system scales linearly,simply #y adding more nodes% '(B)ST C the infrastructure is designed assuming any node can  #e down at any moment% 1very component in the system has its  #ac- up plan% !A)=T+T(=1'1&T C Syscascading down flow, which helps isolate pro#lems and solve them% Data at difference stages all get #ac- up actively, which can contri#ute to quic- recovery from system failure%

8 & Re(ea!'# a'ivie( "n @arch ./0., The >hite House announced a national Big Data "nitiative that consisted of si* !ederal departments and agencies committing more than .// million to #ig data research  proMects%48;3 The initiative included a &ational Science !oundation 1*peditions in omputing grant of 0/ million over ; years to the A@P =a# at the )niversity of   alifornia, Ber-eley% The A@P =a# also received funds from DA'PA, and over a do$en industrial sponsors and uses #ig data to attac- a wide range of   pro#lems from predicting traffic congestion to fighting cancer% The >hite House Big Data "nitiative also included a commitment #y the Department of  1nergy to provide .; million in

funding over ; years to esta#lish the Scala#le Data @anagement, Analysis and Qisuali$ation 5SDAQ6 "nstitute, led #y the 1nergy Department2s =awrence Ber-eley &ational =a#oratory% The SDAQ "nstitute aims to #ring together the e*pertise of si* national la#oratories and seven universities to develop new tools to help scientists manage and visuali$e data on the Department2s supercomputers% The )%S% state of @assachusetts announced the @assachusetts Big Data "nitiative in @ay ./0., which provides funding from the state government and private companies to a variety of research institutions% The @assachusetts "nstitute of Technology hosts the "ntel Science and Technology enter for Big Data in the @"T omputer Science and Artificial "ntelligence =a#oratory, com#ining government, corporate, and institutional funding and research efforts

?&

/& 3& 4&

'1!1'&1S @ >hite, Tom 50/ @ay ./0.6% HadoopC The Definitive Euide% (7'eilly @edia% p% ?% "SB& 9G:+0+ 889?+?:GG+/% @ @"I1.%/, Big Data Definition% @ Iusnet$-y, Dan% >hat is Big Data% D&et% @ Qance, Ashley 5.. April ./0/6% Start+)p Eoes After Big Data >ith Hadoop Helper% &ew Nor- Times Blog%

5& Data, data everywhere% The 1conomist% .; !e#ruary ./0/% 'etrieved 9 Decem#er ./0.% 6& @ 1+Discovery Special 'eportC The 'ising Tide of   &onlinear 'eview% Hudson Elo#al% 'etrieved 0 uly ./0.% #y at asey and AleMandra Pere$ =& @ >hat Technology+ Assisted 1lectronic Discovery Teaches )s A#out The 'ole (f Humans "n Technology —  'e+Humani$ing Technology+Assisted 'eview% !or#es% 'etrieved 0 uly ./0.% 8& @ !rancis, @atthew 5./0.+ /8+/.6% !uture telescope array drives development of e*a#yte processing% 'etrieved ./0.+0/+.8% 9& @ >atters, Audrey 5./0/6% The Age of 1*a#ytesC Tools and Approaches for @anaging Big Data 5>e#siteJSlideshare 6% Hewlett+Pac-ard

Development ompany% 'etrieved ./0.+0/+.8% ?&@ ommunity cleverness required% &ature 455 5G. /96C 0% 8 Septem#er .//:% doiC0/%0/?:J8;;//0 a% ??& @ Sandia sees data management challenges spiral%HP ProMects% 8 August .//9% ?/&@ 'eichman, (%%F ones, @%B%F Schildhauer, @%P% 5./006% hallenges and (pportunities of (pen Data in 1cology%Science 33? 5/ 0:6C G/?  ;%doiC0/%00.Jscience%009 G9.% ?3&@ Hellerstein, oe 59  &ovem#er .//:6% Parallel Programming in the Age of Big Data% Eigaom Blog% ?4&@ Segaran, To#yF Hammer#acher, eff 5.//96% Beautiful DataC The Stories Behind 1legant Data Solutions% (7'eilly @edia%  p% .;G% "SB& 9G:+/+;9+ 0;G00+0%

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF