Ccd-410 _ Cloudera Exams

December 20, 2016 | Author: saravanase | Category: N/A
Share Embed Donate


Short Description

Download Ccd-410 _ Cloudera Exams...

Description

12/23/13

Ccd-410 | Cloudera Exams

Cloudera Certification Exams

Cloudera Exams

Search

Home Membership

Home Aruba CheckPoint Cisco CIW Cloudera CompTIA CWNP EC-Council EMC EXIN HP IBM ISACA ISC2 Isilon Juniper Linux Microsoft Network Appliance Oracle PMI Sun VMware Zend EX AM I NF ORMATI O N

RSS feed for this section This category contains 60 posts Exam CCD-410: Cloudera Certified Developer for Apache Hadoop which the reduce method of a given Reducer can be called? Posted by seenagape on October 26, 2013

No comments

When is the earliest point at which the reduce method of a given Reducer can be called? A. As soon as at least one mapper has finished processing its input split. B. As soon as a mapper has emitted at least one record. C. Not until all mappers have finished processing all records.

D. It depends on the InputFormat used for the job. Explanation: In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished. Note: The reduce phase has 3 steps: shuffle, sort, reduce. Shuffle is where the data is collected by the reducer from each mapper. This can happen while mappers are generating data since it is only a data transfer. On the other hand, sort and reduce can only start once all the mappers are done. Why is starting the reducers early a good thing? Because it spreads out the data transfer from the mappers to the reducers over time, which is a good thing if your network is the bottleneck. Why is starting the reducers early a bad thing? Because they “hog up” reduce slots while only copying data. Another job that starts later that will actually use the reduce slots now can’t use them. You can customize when the reducers startup by changing the default value of mapred.reduce.slowstart.completed.maps in mapred-site.xml. A value of 1.00 will wait for all the mappers to finish before starting the reducers. A value of 0.0 will start the reducers right away. A value of 0.5 will start the reducers when half of the mappers are complete. You can also change mapred.reduce.slowstart.completed.maps on a job-by-job basis. Typically, keep mapred.reduce.slowstart.completed.maps above 0.9 if the system ever has www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

1/31

12/23/13

Ccd-410 | Cloudera Exams

multiple jobs running at once. This way the job doesn’t hog up reducers when they aren’t doing anything but copying data. If you only ever have one job running at a time, doing 0.1 would probably be appropriate. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, When is the reducers are started in a MapReduce job? Which describes how a client reads a file from HDFS? Posted by seenagape on October 26, 2013

No comments

Which describes how a client reads a file from HDFS? A. The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s). B. The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly off the DataNode. C. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.

D. The client contacts the NameNode for the block location(s). The NameNode contacts the DataNode that holds the requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client. Explanation: The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How the Client communicates with HDFS? Which interface should your class implement? Posted by seenagape on October 26, 2013

No comments

You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement? A. Combiner B. Mapper C. Reducer D. Reducer

E. Combiner Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer? Posted by seenagape on October 26, 2013

No comments

Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer? A. Oozie B. Sqoop C. Flume www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

2/31

12/23/13

Ccd-410 | Cloudera Exams

D. Hadoop Streaming

E. mapred Explanation: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. Reference: http://hadoop.apache.org/common/docs/r0.20.1/streaming.html (Hadoop Streaming, second sentence)

How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce? Posted by seenagape on October 26, 2013

No comments

How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce? A. Keys are presented to reducer in sorted order; values for a given key are not sorted.

B. Keys are presented to reducer in sorted order; values for a given key are sorted in ascending order. C. Keys are presented to a reducer in random order; values for a given key are not sorted. D. Keys are presented to a reducer in random order; values for a given key are sorted in ascending order. Explanation: Reducer has 3 primary phases: 1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. 3. Reduce In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. Reference: org.apache.hadoop.mapreduce, Class Reducer Assuming default settings, which best describes the order of data provided to a reducer’s reduce method: Posted by seenagape on October 26, 2013

No comments

Assuming default settings, which best describes the order of data provided to a reducer’s reduce method: A. The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are. B. Both the keys and values passed to a reducer always appear in sorted order. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

3/31

12/23/13

Ccd-410 | Cloudera Exams

C. Neither keys nor values are in any predictable order. D. The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order

Explanation: Reducer has 3 primary phases: 1. Shuffle The Reducer copies the sorted output from each Mapper using HTTP across the network. 2. Sort The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged. SecondarySort To achieve a secondary sort on the values returned by the value iterator, the application should extend the key with the secondary key and define a grouping comparator. The keys will be sorted using the entire key, but will be grouped using the grouping comparator to decide which keys and values are sent in the same call to reduce. 3. Reduce In this phase the reduce(Object, Iterable, Context) method is called for each in the sorted inputs. The output of the reduce task is typically written to a RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not re-sorted. Reference: org.apache.hadoop.mapreduce, Class Reducer Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4: Posted by seenagape on October 26, 2013

No comments

You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4: A. You will have forty-eight failed task attempts B. You will have seventeen failed task attempts C. You will have five failed task attempts D. You will have twelve failed task attempts E. You will have twenty failed task attempts

Explanation: There will be four failed task attempts for each of the five file splits. Note:

which method in the Mapper you should use to implement code for reading the file and populating the associative array? Posted by seenagape on October 26, 2013

No comments

You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

4/31

12/23/13

Ccd-410 | Cloudera Exams

Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array? A. combine B. map C. init D. configure

Explanation: See 3) below. Here is an illustrative example on how to use the DistributedCache: // Setting up the cache for the application 1. Copy the requisite files to the FileSystem: $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz 2. Setup the application’s JobConf: JobConf job = new JobConf(); DistributedCache.addCacheFile(new URI(“/myapp/lookup.dat#lookup.dat”), job); DistributedCache.addCacheArchive(new URI(“/myapp/map.zip”, job); DistributedCache.addFileToClassPath(new Path(“/myapp/mylib.jar”), job); DistributedCache.addCacheArchive(new URI(“/myapp/mytar.tar”, job); DistributedCache.addCacheArchive(new URI(“/myapp/mytgz.tgz”, job); DistributedCache.addCacheArchive(new URI(“/myapp/mytargz.tar.gz”, job); 3. Use the cached files in the Mapper or Reducer: public static class MapClass extends MapReduceBase implements Mapper { private Path[] localArchives; private Path[] localFiles; public void configure(JobConf job) { // Get the cached archives/files localArchives = DistributedCache.getLocalCacheArchives(job); localFiles = DistributedCache.getLocalCacheFiles(job); } public void map(K key, V value, OutputCollector output, Reporter reporter) throws IOException { // Use data from the cached archives/files here // … // … output.collect(k, v); } } Reference: org.apache.hadoop.filecache , Class DistributedCache which interface is most likely to reduce the amount of intermediate data transferred across the network? Posted by seenagape on October 26, 2013

No comments

You’ve written a MapReduce job that will process 500 million input records and generated 500 million key-value pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network? A. Partitioner B. OutputFormat www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

5/31

12/23/13

Ccd-410 | Cloudera Exams

C. WritableComparable D. Writable E. InputFormat F. Combiner

Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job? Can you use MapReduce to perform a relational join on two large tables sharing a key? Posted by seenagape on October 26, 2013

No comments

Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated files in HDFS. A. Yes.

B. Yes, but only if one of the tables fits into memory C. Yes, so long as both tables fit into memory. D. No, MapReduce cannot perform relational operations. E. No, but it can be done with either Pig or Hive. Explanation: Note: * Join Algorithms in MapReduce A) Reduce-side join B) Map-side join C) In-memory join / Striped Striped variant variant / Memcached variant * Which join to use? / In-memory join > map-side join > reduce-side join / Limitations of each? In-memory join: memory Map-side join: sort order and partitioning Reduce-side join: general purpose Where is intermediate data written to after being emitted from the Mapper’s map method? Posted by seenagape on October 26, 2013

No comments

You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method? A. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk. B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS. C. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper. D. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

6/31

12/23/13

Ccd-410 | Cloudera Exams

Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer

E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS. Explanation: The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, Where is the Mapper Output (intermediate kay-value data) stored ? How will you gather this data for your analysis? Posted by seenagape on October 26, 2013

No comments

You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis? A. Ingest the server web logs into HDFS using Flume. B. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.

C. Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop. D. Channel these clickstreams inot Hadoop using Hadoop Streaming. E. Sample the weblogs from the web servers, copying them into Hadoop using curl. Explanation: Hadoop MapReduce for Parsing Weblogs Here are the steps for parsing a log file using Hadoop MapReduce: Load log files into the HDFS location using this Hadoop command: hadoop fs -put The Opencsv2.3.jar framework is used for parsing log records. Below is the Mapper program for parsing the log file from the HDFS location. public static class ParseMapper extends Mapper{ private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { CSVParser parse = new CSVParser(‘ ‘,’\”‘); String sp[]=parse.parseLine(value.toString()); int spSize=sp.length; StringBuffer rec= new StringBuffer(); for(int i=0;i < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> which two resources should you expect to be bottlenecks? Posted by seenagape on October 26, 2013

No comments

You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks? A. Processor and network I/O B. Disk I/O and network I/O

C. Processor and RAM D. Processor and disk I/O www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

20/31

12/23/13

Ccd-410 | Cloudera Exams

Will you be able to reuse your existing Reduces as your combiner in this case and why or why not? Posted by seenagape on October 26, 2013

No comments

You want to count the number of occurrences for each unique word in the supplied input data. You’ve decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not? A. Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.

B. No, because the sum operation in the reducer is incompatible with the operation of a Combiner. C. No, because the Reducer and Combiner are separate interfaces. D. No, because the Combiner is incompatible with a mapper which doesn’t use the same data type for both the key and value. E. Yes, because Java is a polymorphic object-oriented language and thus reducer code can be reused as a combiner. Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job? Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation. Posted by seenagape on October 26, 2013

No comments

Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation. A. TaskTracker B. NameNode C. DataNode D. JobTracker

E. Secondary NameNode Explanation: JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:) Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

21/31

12/23/13

Ccd-410 | Cloudera Exams

A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster? Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data? Posted by seenagape on October 26, 2013

No comments

Which project gives you a distributed, Scalable, data store that allows you random, realtime read/write access to hundreds of terabytes of data? A. HBase

B. Hue C. Pig D. Hive E. Oozie F. Flume G. Sqoop Explanation: Use Apache HBase when you need random, realtime read/write access to your Big Data. Note: This project’s goal is the hosting of very large tables — billions of rows X millions of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Features Linear and modular scalability. Strictly consistent reads and writes. Automatic and configurable sharding of tables Automatic failover support between RegionServers. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. Easy to use Java API for client access. Block cache and Bloom Filters for real-time queries. Query predicate push down via server side Filters Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options Extensible jruby-based (JIRB) shell Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX Reference: http://hbase.apache.org/ (when would I use HBase? First sentence) what would another user see when trying to access this life? Posted by seenagape on October 26, 2013

No comments

You use the hadoop fs –put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this life? A. They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file. B. They would see the current state of the file, up to the last bit written by the command. C. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

22/31

12/23/13

Ccd-410 | Cloudera Exams

They would see the current of the file through the last completed block. D. They would see no content until the whole file written and closed.

Explanation: Note: * put Usage: hadoop fs -put … Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data? Posted by seenagape on October 26, 2013

No comments

Identify the tool best suited to import a portion of a relational database every day as files into HDFS, and generate Java classes to interact with that imported data? A. Oozie B. Flume C. Pig

D. Hue E. Hive

F. Sqoop G. fuse-dfs Explanation: Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities: Imports individual tables or entire databases to files in HDFS Generates Java classes to allow you to interact with your imported data Provides the ability to import from SQL databases straight into your Hive data warehouse Note: Data Movement Between Hadoop and Relational Databases Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce map function. Note: * Cloudera’s Distribution for Hadoop provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases. Reference: http://log.medcl.net/item/2011/08/hadoop-and-mapreduce-big-data-analytics-gartner/ (Data Movement between hadoop and relational databases, second paragraph) How many files will be processed by the FileInputFormat.setInputPaths () command when it’s given a path object representing this directory? Posted by seenagape on October 26, 2013

No comments

You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt. How many files will be processed by the FileInputFormat.setInputPaths () command when it’s given a path object representing this directory? A. Four, all files will be processed B. Three, the pound sign is an invalid character for HDFS file names C. Two, file names with a leading period or underscore are ignored

www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

23/31

12/23/13

Ccd-410 | Cloudera Exams

D. None, the directory cannot be named jobdata E. One, no special characters can prefix the name of an input file Explanation: Files starting with ‘_’ are considered ‘hidden’ like unix files starting with ‘.’. # characters are allowed in HDFS file names. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero. Posted by seenagape on October 26, 2013

No comments

You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero. A. There is no difference in output between the two settings. B. With zero reducers, no reducer runs and the job throws an exception. With one reducer, instances of matching patterns are stored in a single file on HDFS. C. With zero reducers, all instances of matching patterns are gathered together in one file on HDFS. With one reducer, instances of matching patterns are stored in multiple files on HDFS. D. With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.

Explanation: * It is legal to set the number of reduce-tasks to zero if no reduction is desired. In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem. * Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job. Note: Reduce In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(WritableComparable, Writable). Applications can use the Reporter to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted. A combiner reduces: Posted by seenagape on October 26, 2013

No comments

A combiner reduces: A. The number of values across different keys in the iterator supplied to a single reduce method call. B. The amount of intermediate data that must be transferred between the mapper and reducer.

C. The number of input files a mapper must process. D. The number of output files a reducer must produce. Explanation: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

24/31

12/23/13

Ccd-410 | Cloudera Exams

use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job? how many map task attempts will there be? Posted by seenagape on October 26, 2013

No comments

In a MapReduce job with 500 map tasks, how many map task attempts will there be? A. It depends on the number of reduces in the job. B. Between 500 and 1000. C. At most 500. D. At least 500.

E. Exactly 500. Explanation: From Cloudera Training Course: Task attempt is a particular instance of an attempt to execute a task – There will be at least as many task attempts as there are tasks – If a task attempt fails, another will be started by the JobTracker – Speculative execution can also result in more task attempts than completed tasks which major functions of the JobTracker into separate daemons? Posted by seenagape on October 26, 2013

No comments

MapReduce v2 (MRv2/YARN) splits which major functions of the JobTracker into separate daemons? Select two. A. Heath states checks (heartbeats) B. Resource management

C. Job scheduling/monitoring D. Job coordination between the ResourceManager and NodeManager

E. Launching tasks F. Managing file system metadata G. MapReduce metric reporting H. Managing tasks Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Note: The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker: / Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

25/31

12/23/13

Ccd-410 | Cloudera Exams

/ Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job. Reference: Apache Hadoop YARN – Concepts & Applications What types of algorithms are difficult to express in MapReduce v1 (MRv1)? Posted by seenagape on October 26, 2013

No comments

What types of algorithms are difficult to express in MapReduce v1 (MRv1)? A. Algorithms that require applying the same mathematical function to large numbers of individual binary records. B. Relational operations on large amounts of structured and semi-structured data. C. Algorithms that require global, sharing states.

D. Large-scale graph algorithms that require one-step link traversal. E. Text analysis algorithms on large collections of unstructured text (e.g, Web crawls). Explanation: See 3) below. Limitations of Mapreduce – where not to use Mapreduce While very powerful and applicable to a wide variety of problems, MapReduce is not the answer to every problem. Here are some problems I found where MapReudce is not suited and some papers that address the limitations of MapReuce. 1. Computation depends on previously computed values If the computation of a value depends on previously computed values, then MapReduce cannot be used. One good example is the Fibonacci series where each value is summation of the previous two values. i.e., f(k+2) = f(k+1) + f(k). Also, if the data set is small enough to be computed on a single machine, then it is better to do it as a single reduce(map(data)) operation rather than going through the entire map reduce process. 2. Full-text indexing or ad hoc searching The index generated in the Map step is one dimensional, and the Reduce step must not generate a large amount of data or there will be a serious performance degradation. For example, CouchDB’s MapReduce may not be a good fit for full-text indexing or ad hoc searching. This is a problem better suited for a tool such as Lucene. 3. Algorithms depend on shared global state Solutions to many interesting problems in text processing do not require global synchronization. As a result, they can be expressed naturally in MapReduce, since map and reduce tasks run independently and in isolation. However, there are many examples of algorithms that depend crucially on the existence of shared global state during processing, making them difficult to implement in MapReduce (since the single opportunity for global synchronization in MapReduce is the barrier between the map and reduce phases of processing) Reference: Limitations of Mapreduce – where not to use Mapreduce What does calling the next () method return? Posted by seenagape on October 26, 2013

No comments

In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return? A. It returns a reference to a different Writable object time. B. It returns a reference to a Writable object from an object pool. C. It returns a reference to the same Writable object each time, but populated with different data.

D. It returns a reference to a Writable object. The API leaves unspecified whether this is a reused object or a new object. E. It returns a reference to the same Writable object if the next value is the same as the previous value, or a new Writable object otherwise. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

26/31

12/23/13

Ccd-410 | Cloudera Exams

Explanation: Calling Iterator.next() will always return the SAME EXACT instance of IntWritable, with the contents of that instance replaced with the next value. Reference: manupulating iterator in mapreduce Table metadata in Hive is: Posted by seenagape on October 26, 2013

No comments

Table metadata in Hive is: A. Stored as metadata on the NameNode. B. Stored along with the data in HDFS. C. Stored in the Metastore.

D. Stored in ZooKeeper. Explanation: By default, hive use an embedded Derby database to store metadata information. The metastore is the “glue” between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc. The Metastore is an application that runs on an RDBMS and uses an open source ORM layer called DataNucleus, to convert object representations into a relational schema and vice versa. They chose this approach as opposed to storing this information in hdfs as they need the Metastore to be very low latency. The DataNucleus layer allows them to plugin many different RDBMS technologies. Note: * By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. * features of Hive include: Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution. Reference: Store Hive Metadata into RDBMS which best describes the behavior of the default partitioner? Posted by seenagape on October 26, 2013

No comments

Analyze each scenario below and indentify which best describes the behavior of the default partitioner? A. The default partitioner assigns key-values pairs to reduces based on an internal random number generator. B. The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an event partition of the key space. C. The default partitioner computes the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is assigned to a specific reducer. D. The default partitioner computes the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.

E. The default partitioner computes the hash of the value and takes the mod of that value with the number of reducers. The result determines the reducer assigned to process the key-value pair. Explanation: The default partitioner computes a hash value for the key and assigns the partition based on this result. The default Partitioner implementation is called HashPartitioner. It uses the hashCode() method of the key objects modulo the number of partitions total to determine which partition to send a given (key, value) pair to. In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determine which partition (and thus which reducer) the record belongs in.The number of partition is then equal to the number of reduce tasks for the job. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

27/31

12/23/13

Ccd-410 | Cloudera Exams

Reference: Getting Started With (Customized) Partitioning Which action should you take to relieve this situation and store more files in HDFS? Posted by seenagape on October 26, 2013

No comments

You need to move a file titled “weblogs” into HDFS. When you try to copy the file, you can’t. You know you have ample space on your DataNodes. Which action should you take to relieve this situation and store more files in HDFS? A. Increase the block size on all current files in HDFS. B. Increase the block size on your remaining files. C. Decrease the block size on your remaining files.

D. Increase the amount of memory for the NameNode. E. Increase the number of disks (or size) for the NameNode. F. Decrease the block size on all current files in HDFS. Explanation: Note: * -put localSrc destCopies the file or directory from the local file system identified by localSrc to dest within the DFS. * What is HDFS Block size? How is it different from traditional file system block size? In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size. how many distinct copy operations will there be in the sort/shuffle phase? Posted by seenagape on October 26, 2013

No comments

In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase? A. mXn (i.e., m multiplied by n)

B. n C. m D. m+n (i.e., m plus n) E. E. mn (i.e., m to the power of n) Explanation: A MapReduce job with m mappers and r reducers involves up to m * r distinct copy operations, since each mapper may have intermediate output going to every reducer. Workflows expressed in Oozie can contain: Posted by seenagape on October 26, 2013

No comments

Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.

B. Sequences of MapReduce job only; on Pig on Hive tasks or jobs. These MapReduce www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

28/31

12/23/13

Ccd-410 | Cloudera Exams

sequences can be combined with forks and path joins. C. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions with exception handlers but no forks. D. Iterntive repetition of MapReduce jobs until a desired answer or state is reached. Explanation: Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language). hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes). Workflow definitions Currently running workflow instances, including instance states and variables Reference: Introduction to Oozie Note: Oozie is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store: what the map method accepts and emits? Posted by seenagape on October 26, 2013

No comments

Which best describes what the map method accepts and emits? A. It accepts a single key-value pair as input and emits a single key and list of corresponding values as output. B. It accepts a single key-value pairs as input and can emit only one key-value pair as output. C. It accepts a list key-value pairs as input and can emit only one key-value pair as output. D. It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.

Explanation: public class Mapper extends Object Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs. Reference: org.apache.hadoop.mapreduce Class Mapper When can a reduce class also serve as a combiner without affecting the output of a MapReduce program? Posted by seenagape on October 26, 2013

No comments

When can a reduce class also serve as a combiner without affecting the output of a MapReduce program? A. When the types of the reduce operation’s input key and input value match the types of the reducer’s output key and output value and when the reduce operation is both communicative and associative.

B. When the signature of the reduce method matches the signature of the combine method. C. Always. Code can be reused in Java since it is a polymorphic object-oriented programming language. D. Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

29/31

12/23/13

Ccd-410 | Cloudera Exams

E. Never. Combiners and reducers must be implemented separately because they serve different purposes. Explanation: You can use your reducer code as a combiner if the operation performed is commutative and associative. Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, What are combiners? When should I use a combiner in my MapReduce Job? Which format should you use to store this data in HDFS? Posted by seenagape on October 26, 2013

No comments

You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS? A. SequenceFiles

B. Avro C. JSON D. HTML E. XML F. CSV Explanation: Using Hadoop Sequence Files So what should we do in order to deal with huge amount of images? Use hadoop sequence files! Those are map files that inherently can be read by map reduce applications – there is an input format especially for sequence files – and are splitable by map reduce, so we can have one huge file that will be the input of many map tasks. By using those sequence files we are letting hadoop use its advantages. It can split the work into chunks so the processing is parallel, but the chunks are big enough that the process stays efficient. Since the sequence file are map file the desired format will be that the key will be text and hold the HDFS filename and the value will be BytesWritable and will contain the image content of the file. Reference: Hadoop binary files processing introduced by image duplicates finder Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine? Posted by seenagape on October 26, 2013

No comments

You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine? A. Run all the nodes in your production cluster as virtual machines on your development workstation.

B. Run the hadoop command with the –jt local and the –fs file:///options. C. Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine. D. Run simldooop, the Apache open-source software for simulating Hadoop clusters. Explanation: Hosting on local VMs As well as large-scale cloud infrastructures, there is another deployment pattern: local VMs on desktop systems or other development machines. This is a good tactic if your physical machines run windows and you need to bring up a Linux system running Hadoop, and/or you want to simulate the complexity of a small Hadoop cluster. www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

30/31

12/23/13

Ccd-410 | Cloudera Exams

Have enough RAM for the VM to not swap. Don’t try and run more than one VM per physical host, it will only make things slower. use file: URLs to access persistent input and output data. consider making the default filesystem a file: URL so that all storage is really on the physical host. It’s often faster and preserves data better. Determine how many Mappers will run? Posted by seenagape on October 26, 2013

No comments

Your cluster’s HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run? A. 64 B. 100 C. 200

D. 640 Explanation: Each file would be split into two as the block size (64 MB) is less than the file size (100 MB), so 200 mappers would be running. Note: If you’re not compressing the files then hadoop will process your large files (say 10G), with a number of mappers related to the block size of the file. Say your block size is 64M, then you will have ~160 mappers processing this 10G file (160*64 ~= 10G). Depending on how CPU intensive your mapper logic is, this might be an acceptable blocks size, but if you find that your mappers are executing in sub minute times, then you might want to increase the work done by each mapper (by increasing the block size to 128, 256, 512m – the actual size depends on how you intend to process the data). Reference: http://stackoverflow.com/questions/11014493/hadoop-mapreduce-appropriate-inputfiles-size (first answer, second paragraph) What is a SequenceFile? Posted by seenagape on October 26, 2013

No comments

What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous writable objects. B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous writable objects. C. A SequenceFile contains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order. D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.

Explanation: SequenceFile is a flat file consisting of binary key/value pairs. There are 3 different SequenceFile formats: Uncompressed key/value records. Record compressed key/value records – only ‘values’ are compressed here. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable. Reference: http://wiki.apache.org/hadoop/SequenceFile — © 2013 Cloudera Exams. All Rights Reserved.

www.aiotestking.com/cloudera/category/single/exam-ccd-410-cloudera-certified-developer-for-apache-hadoop/

31/31

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF