Big Data
June 1, 2016 | Author: vinod | Category: N/A
Short Description
about big data concepts...
Description
Paper presentation on
BIG DATA
M VENKAT VENKATA A RAMANA C PRIYANKA Mobile No: 9493556589 E-mail:venkaa!amaname"#ava#$%"mail&'om
Department of Computer Science & Engineering
YSR Engineering College of Yogi Vemana University Proddatur, Kadapa Ab(!a' —This tutorial serves as a foundation for the field of Big Data, overview of Big Data, overview of data platforms, Big Data Actions, an overview of Hadoop, Storage techniques and file System Big data, Technologies, Basic concepts of Big Data loud And !ast and Big data "nfrastructure #ased on ama$on, ritiques of the Big Data paradigm and research activities " % " &T'(D)T"(&
"n information technology, bi" )aa is a collection of data sets so large and comple* that it #ecomes difficult to process using on+hand data#ase management tools or traditional data processing applications% The challenges include capture, storage, search, sharing, analysis, and visuali$ation% The trend to larger data sets is due to the additional information deriva#le from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to #e found to spot #usiness trends, determine quality of research, prevent diseases, linlegal citations, com#at crime, and determine real+ time roadway traffic conditions%
As of ./0., limits on the si$e of data sets that are feasi#le to process in a reasona#le amount of time were on the order of 1*a#yte2s of data% Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics,3 connectomics, comple* physics simulations,4003 and #iological and environmental research% The limitations also affect "nternet search, finance and #usiness informatics% Data sets grow in si$e in part #ecause they are increasingly #eing gathered #y u#iquitous information+sensing mo#ile devices, aerial sensory technologies 5remote sensing6,
software logs, cameras, microphones, radio+frequency identification readers, and wireless sensor networ-s% The world7s technological per+capita capacity to store information has roughly dou#led every 8/ months since the 09:/s as of ./0., every day .%; quintillion 5.%;hat is considered #ig data varies depending on the capa#ilities of the organi$ation managing the set, and on the capa#ilities of the applications that are traditionally used to process and analy$e the data set in its domain% !or some organi$ations, facing hundreds of giga#ytes of data for the first time may trigger a need to reconsider data management options% !or others, it may ta-e tens or
i" )aa (an( #!ee )imen(ion(: Vol$me Velo'i, an) Va!ie,&
A& *#a i( bi" )aa+ Eve!, )a, .e '!eae /&5 0$inillion b,e( o1 )aa 2 (o m$'# #a 9 o1 #e )aa in #e .o!l) o)a, #a( been '!eae) in #e la( .o ,ea!( alone& T#i( )aa 'ome( 1!om eve!,.#e!e: (en(o!( $(e) o "a#e! 'limae in1o!maion o(( o (o'ial me)ia (ie( )i"ial i'$!e( an) vi)eo( $!'#a(e !an(a'ion !e'o!)( an) 'ell #one P7 (i"nal( o name a 1e.& T#i( )aa i( #ig data%
Vol$me: 1nterprises are awash with ever+growing data of all types, easily amassing tera#ytes— even peta#ytes—of information Turn 0. tera#ytes of Tweets created each day into improved product sentiment analysis onvert ?;/ #illion annual meter readings to #etter predict power consumption Velo'i,: Sometimes . minutes is too late% !or time+sensitive processes such as catching fraud, #ig data must #e used as it streams
into your enterprise in order to ma*imi$e its value% Scrutini$e ; million trade events created each day to identify potential fraud Analy$e ;// million daily call detail records in real+ time to predict customer churn faster Va!ie,: Big data is any type of data + structured and unstructured data such as te*t, sensor data, audio, video, clic- streams, log files and more% &ew insights are found when analy$ing these data types together% @onitor 0//2s of live video feeds from surveillance cameras to target points of interest 1*ploit the :/ data growth in images, video and documents to improve customer satisfaction
/& IM bi" )aa la1o!m Do you have a #ig data strategy "B@ does% >e2d li-e to share our -now+how with you to help your enterprise solve its #ig data challenges% "B@ is unique in having developed an enterprise class #ig data platform that allows you to address the full spectrum of #ig data #usiness challenges% The platform #lends traditional technologies that are well suited for structured, repeata#le tas-s together with complementary new technologies that address speed and fle*i#ility and are ideal for adhoc data e*ploration, discovery and unstructured analysis%
"B@2s integrated #ig data platform has four core capa#ilitiesC Hadoop+#ased analytics, stream computing, data warehousing, and information integration and governance% A& T#e 'o!e 'aabiliie( a!e: a)oo-ba(e) and anal,i'(: Processes analy$es any data type across commodity server clusters% 7!eam Com$in": Drives continuous analysis of massive volumes of streaming data with su#+millisecond response times% aa *a!e#o$(in": Delivers deep operational insight with advanced in+data#ase analytics% In1o!maion Ine"!aion an) ove!nan'e: Allows you to understand, cleanse, transform, govern and deliver trusted information to your critical #usiness initiatives% & 7$o!in" 7e!vi'e(:
Pla1o!m
Vi($ali;aion < i('ove!,: Helps end users e*plore large, comple* data sets% Ali'aion evelomen: Streamlines the process of developing #ig data applications% 7,(em( Mana"emen: @onitors and manages #ig data systems for secure and optimi$ed performance% A''ele!ao!(: Speeds time to value with analytical and industry+specific modules%
3& i" aa a!a)i"m
archives, video archives, large+scale e+commerce% A& i" 7'ien'e
and
The =arge Hadron ollider e*periments represent a#out 0;/ million sensors delivering data 8/ million times per second% There are nearly // million collisions per second% After filtering and not recording more than 99%999 of these streams, there are 0// collisions of interest per second% As a result, only wor-ing with less than /%//0 of the sensor stream data, the data flow from all four =H e*periments represents .; peta#ytes annual rate #efore replication 5as of ./0.6% This #ecomes nearly .// peta#ytes after replication%
4& i" )aa in a'ion
1*amples include Big Science, we# logs, '!"D, sensor networ-s, social networ-s, social data 5due to the social data revolution6, "nternet te*t and documents, "nternet search inde*ing, call detail records, astronomy, atmospheric science, genomics, #iogeochemical, #iological, and other comple* and often interdisciplinary scientific research, military surveillance, medical records, photography
"f all sensor data were to #e recorded in =H, the data flow would #e e*tremely hard to wor- with% The data flow would e*ceed 0;/ million peta#ytes annual rate, or nearly ;// e*a#ytes per day, #efore replication% To put the num#er in perspective, this is equivalent to ./ ;// quintillion 5;almart handles more than 0 million customer transactions every hour, which is imported into data#ases estimated to contain more than .%; peta#ytes 5.;/ tera#ytes6 of data the equivalent of 0G times the information contained in all the #oo-s in the )S =i#rary of ongress% !ace#oo- handles ;/ #illion photos from its user #ase%
!"( !alcon redit ard !raud Detection System protects .%0 #illion active accounts world+wide%
DA'PA2s Topological Data Analysis program see-s the fundamental structure of massive data sets%
The volume of #usiness data worldwide, across all companies, dou#les every 0%. years, according to estimates% !or Development !ollowing decades of wor- in the area of the effective usage of information and communication technologies for development 5or "T8D6, it has #een suggested that Big Data can ma-e important contri#utions to international development% (n the one hand, the advent of Big Data delivers the cost+effective prospect to improve decision+ ma-ing in critical development areas such ashealth care, employment, economic productivity, crime and security, and natural disaster and resource management%4?93 (n the other hand, all the well+-nown concerns of the Big Data de#ate, such as privacy, interopera#ility challenges, and the almighty power of imperfect algorithms, are aggravated in developing countries #y long+standing development challenges li-e lac-ing technological infrastructure and economic and human resource scarcity% This has the potential to result in a new -ind of digital divideC a divide in data+#ased intelligence to inform decision+ma-ing% 5& Te'#nolo",
Big data requires e*ceptional technologies to efficiently process large quantities of data within tolera#le elapsed times% A 3 ./00 @cIinsey report suggests suita#le technologies include AJB testing, association rule learning, classification, cluster analysis, crowdsourcing, data fusion and integration, ensem#le learning, genetic algorithms, machine learning, natural language processing, neural networ-s, pattern recognition, anomaly detection, predictive modelling, regression, sentiment analysis, signal processing, supervised and unsupe rvised learning, simulation, time series analysis and visuali$ation% @ultidimensional #ig data can also #e represented as tensors, which can #e more efficiently handled #y tensor+#ased computation, such as multilinear su#space learning%48.3 Additional technologies #eing applied to #ig data include massively parallel+
processing 5@PP6 data#ases, search+#ased applications, data+mining grids, distri#uted file systems, distri#uted data#ases, cloud #ased infrastructure 5applications, storage and computing resources6 and the "nternet% Some #ut not all @PP relational data#ases have the a#ility to store and manage peta#ytes of data% "mplicit is the a#ility to load, monitor, #ac- up, and optimi$e the use of the large data ta#les in the 'DB@S% The practitioners of #ig data analytics processes are generally hostile to slower shared storage, preferring direct+attached storage 5DAS6 in its various forms from solid state dis- 5SSD6 to high capacity SATA dis- #uried inside parallel processing nodes% The perception of shared storage architectures—SA& and &AS—is that they are relatively slow, comple*, and e*pensive% These qualities are not consistent with #ig data analytics systems that thrive on system performance, commodity infrastructure, and low cost% 'eal or near+real time information delivery is one of the defining characteristics of #ig data analytics% =atency is therefore avoided whenever and wherever possi#le% Data in memory is good —data on spinning dis- at the other end of a ! SA& connection is not% The cost of a SA& at the scale needed for analytics applications is very much higher than other storage techniques%
There are advantages as well as disadvantages to shared storage in #ig data analytics, #ut #ig data analytics practitioners as of ./00 did not favour it% 6& Aa'#e a)oo
"t is an open+source software framewor- that supports data+ intensive distri#uted applications, licensed under the Apache v. license% "t supports the running of applications on large clusters of commodity hardware% The Hadoop framewor- transparently provides #oth relia#ility and data motion to applications% Hadoop implements a computational paradigm named @ap'educe, where the application is divided into many small fragments of wor-, each of which may #e e*ecuted or re+e*ecuted on any node in the cluster% "n addition, it provides a distri#uted file system that stores data on the compute nodes, providing very high aggregate #andwidth across the cluster% Both mapJreduce and the distri#uted file system are designed so that node failures are automatically handled #y the framewor-%4.3 "t ena#les applications to worwith thousands of computation+ independent computers and peta#ytes of data% Hadoop was derived from Eoogle7s @ap'educe and E oogle !ile System 5E!S6 papers% The entire Apache Hadoop KplatformL is now commonly considered to consist of the Hadoop -ernel, @ap'educe and Hadoop Distri#uted !ile System5HD!S6, as well as a num#er of related
proMects including Apache Hive, Apache HBase, and others%
=& A >a( an) i" aa In1!a(!$'$!e ba(e) on Ama;on EC/
>hen " was wor-ing in the data mining group within NahooO SDS, " used to thin- only #ig "nternet companies or giant enterprises need care a#out #ig data and large scale data mining% This is no longer the case% Big data and data driven apps will #ecome the driving force of ne*t gen "nternet and also revolutioni$e many other traditional industries% As Twitter7sJSquare co+founder ac- Dorsey put it Data will #e a product #y itselfO% He was pushing #ig data initiatives really hard even during the very early stage of Square so that >e are no longer flying in the dar-%
So, what does this mean to other startup7s with a equal or even more am#itious goal achieve >hy startup7s need care a#out #ig data How to collect and ma-e sense of
it How to #uild the infrastructure to deal with #ig data How can all the #ig data stuff to #e pragmatic rather than fancy showcase This #log is a#out to address all a#ove questions in a very practical way%
To start with, a typical pattern of "nternet upstart is usually #uilt #ased on some sort of cloud hosting environment% Among all maMor cloud vendors 5Ama$on A>S, 'ac-space, Eoogle App 1ngine , @icrosoft A$ure etc%6, A>S seems to #e the most natural choice for startup7s #ecause of its deep tech stac- and #attle hardened infrastructure5maMor outage did happen sometimes C6 6%
&ow, #ac- to the topic, if you run everything on A>S cloud and want your #ig data gig, how you would actually #uild it =et7s answer the question #y first splitting the design requirements into . maMor categoriesC Big Data C A#le to store and access large scale data set and support deep mining !ast Data C Support production dash#oard and near real+time B" need
Some companies might focus one more than the other, #ut eventually #oth will #ecome critical as your #usiness advances% Here at TapMoy, we are e*periencing high data demands from #oth categories% >e are seeing more than .// @illion mo#ile user sessions on daily #asis% Both our internal #usiness fol-s and e*ternal partners want to access granular stats in near+ realtime fashion% @eanwhile, our engineering and science group need #uild models to facilitate various data driven applications such as ad ran-ing optimi$ation, app recommendation, #ehavior targeting etc% So we come up with the architecture #elow and #uilt it on Ama$on A>SC
Slogging log ollector luster C responsi#le of collecting logs from production we# server, produce
#atch logs every minute Hadoop 1T= luster C pic- up the logs collected #y Syslog+ng and run 15e*traction6 and T5transformation6 of the raw logsHadoop Data >arehouse C pulls post 1T= data from 1T= cluster and load them into Hadoop HD!S% 'eady to #e used #y various data miningJmodeling processes 5e%g% @ahout @= algorithms6 Qertical DB lusterC column oriented fast analytical DB for near real+time reporting and dash #oarding% "t also pulls data from the Hadoop 1T= cluster B" SuiteC Ta#leau Des-top is neat solution for internal analysts for daily drag R drop type of pivotal analysis 1verything in this system runs on A>S cloud, it uses #oth 1BS and S? for data redundancy and #ac-up% There are some nice features a#out this systemC
SA=AB=1 C can handle 0 #illion raw eventsJday with modest num#er of 1. nodes% Beyond that, the system scales linearly,simply #y adding more nodes% '(B)ST C the infrastructure is designed assuming any node can #e down at any moment% 1very component in the system has its #ac- up plan% !A)=T+T(=1'1&T C Syscascading down flow, which helps isolate pro#lems and solve them% Data at difference stages all get #ac- up actively, which can contri#ute to quic- recovery from system failure%
8 & Re(ea!'# a'ivie( "n @arch ./0., The >hite House announced a national Big Data "nitiative that consisted of si* !ederal departments and agencies committing more than .// million to #ig data research proMects%48;3 The initiative included a &ational Science !oundation 1*peditions in omputing grant of 0/ million over ; years to the A@P =a# at the )niversity of alifornia, Ber-eley% The A@P =a# also received funds from DA'PA, and over a do$en industrial sponsors and uses #ig data to attac- a wide range of pro#lems from predicting traffic congestion to fighting cancer% The >hite House Big Data "nitiative also included a commitment #y the Department of 1nergy to provide .; million in
funding over ; years to esta#lish the Scala#le Data @anagement, Analysis and Qisuali$ation 5SDAQ6 "nstitute, led #y the 1nergy Department2s =awrence Ber-eley &ational =a#oratory% The SDAQ "nstitute aims to #ring together the e*pertise of si* national la#oratories and seven universities to develop new tools to help scientists manage and visuali$e data on the Department2s supercomputers% The )%S% state of @assachusetts announced the @assachusetts Big Data "nitiative in @ay ./0., which provides funding from the state government and private companies to a variety of research institutions% The @assachusetts "nstitute of Technology hosts the "ntel Science and Technology enter for Big Data in the @"T omputer Science and Artificial "ntelligence =a#oratory, com#ining government, corporate, and institutional funding and research efforts
?&
/& 3& 4&
'1!1'&1S @ >hite, Tom 50/ @ay ./0.6% HadoopC The Definitive Euide% (7'eilly @edia% p% ?% "SB& 9G:+0+ 889?+?:GG+/% @ @"I1.%/, Big Data Definition% @ Iusnet$-y, Dan% >hat is Big Data% D&et% @ Qance, Ashley 5.. April ./0/6% Start+)p Eoes After Big Data >ith Hadoop Helper% &ew Nor- Times Blog%
5& Data, data everywhere% The 1conomist% .; !e#ruary ./0/% 'etrieved 9 Decem#er ./0.% 6& @ 1+Discovery Special 'eportC The 'ising Tide of &onlinear 'eview% Hudson Elo#al% 'etrieved 0 uly ./0.% #y at asey and AleMandra Pere$ =& @ >hat Technology+ Assisted 1lectronic Discovery Teaches )s A#out The 'ole (f Humans "n Technology — 'e+Humani$ing Technology+Assisted 'eview% !or#es% 'etrieved 0 uly ./0.% 8& @ !rancis, @atthew 5./0.+ /8+/.6% !uture telescope array drives development of e*a#yte processing% 'etrieved ./0.+0/+.8% 9& @ >atters, Audrey 5./0/6% The Age of 1*a#ytesC Tools and Approaches for @anaging Big Data 5>e#siteJSlideshare 6% Hewlett+Pac-ard
Development ompany% 'etrieved ./0.+0/+.8% ?&@ ommunity cleverness required% &ature 455 5G. /96C 0% 8 Septem#er .//:% doiC0/%0/?:J8;;//0 a% ??& @ Sandia sees data management challenges spiral%HP ProMects% 8 August .//9% ?/&@ 'eichman, (%%F ones, @%B%F Schildhauer, @%P% 5./006% hallenges and (pportunities of (pen Data in 1cology%Science 33? 5/ 0:6C G/? ;%doiC0/%00.Jscience%009 G9.% ?3&@ Hellerstein, oe 59 &ovem#er .//:6% Parallel Programming in the Age of Big Data% Eigaom Blog% ?4&@ Segaran, To#yF Hammer#acher, eff 5.//96% Beautiful DataC The Stories Behind 1legant Data Solutions% (7'eilly @edia% p% .;G% "SB& 9G:+/+;9+ 0;G00+0%
View more...
Comments