Assignment Solutions
Short Description
Assignment Solutions...
Description
Dear Learning Community, welcome to the two exam weeks of our In-Memory Data Management online course. During the next two weeks you will be able to take the final exam. The exact timespan is from 07.10.13, 10 am (CEST, Germany time zone) to 21.10.13, 5 pm. There are no prerequisites (like a point restriction) to take the exam. You have 100 minutes to solve the 50 questions of the exam. Please be advised that the exam can only be started ONCE (closing the browser does not stop the timer). The exam starts when you click on the start button on the exam page, so only click on the button if you are willing to take the assignment. Additionally, we advise you to re-check the answers you picked before you submit the answers. Most errors that occurred during the assignments were caused by accidental inputs through a touchpad or unwanted keyboard strokes. The results of the exam will be published after week 8. Your final score will be build up from the assignments as well as of the final exam. You have to reach at least 50% of the total reachable 198 points to get the certification (99 points and up). The total reachable score is composed of 99 points from the assignments and 99 points from the exam (which means the ratio between assignments and exam is exactly 50:50). Feedback Request We enjoyed doing the second iteration of this course together with you. We had many deep and interesting discussions with you, that also gave us new perspectives on the content. On behalf of all future learners, we thank you for the many suggestions as well as hints to rework certain examples which improved the slides and the reading material. To sum up, it was again a great learning opportunity not just for you but also for us. To further improve for the future, we are relying on your insights, thoughts, and feedback. For this, we kindly ask you to fill out our feedback survey after taking the exam. The survey will take about 5 to 10 minutes to complete. Assignment Solutions As a help to prepare for the exam, we added the assignment solutions for all closed assignments to the respective weeks. For your convenience, please see the collection of links below: Assignment Solutions Week 1 Assignment Solutions Week 2 Assignment Solutions Week 3 Assignment Solutions Week 4 Assignment Solutions Week 5
Dear learning community, as many of you asked for some more details on the correct answers for the assignment of week 1, we wrote the following explanations, that we want to gladly share with you. If there are any questions left, please do not hesitate to ask in the forum. Best regards, the IMDB Teaching Team
______________________________________________________ Question Name: "Transactional Data for OLAP" Text: "Using in-memory databases and transactional data for the Available to Promise (ATP) check has many advantages. Which of the following is NOT an advantage, when we do the ATP check on transactional data?" Correct answer:
"Aggregating on the fly is faster than reading the aggregated sum"
Incorrect answers:
"Programming is simplified, because data changes do not have to be reflected in aggregates anymore" "Calculations can be done more fine grained and based on real-time facts because data is not aggregated to monthly, daily, etc. sums" "Time consuming ETL processes can be abolished"
Explanation: "Reading pre-aggregated values is always the fastest way to retrieve information, because the actual work was done beforehand. Therefore, aggregating on the fly is not faster than reading a preaggregated value and this is NOT an advantage. The other points clearly are." ______________________________________________________ Question Name: "OLTP OLAP Separation Drawbacks" Text: "Why was the separation of OLAP and OLTP a bad idea? The separation should be reverted, because ..." Correct answer:
"latency between the separated storages prohibits up-to-data analytics"
Incorrect answers:
"the creation of materialized views and pre-computed aggregates negatively impacts OLTP performance" "query execution time is increased if we use two systems" "analytical queries are not used any longer"
Explanation: "The separation causes a latency between the transactional and the analytical data, because the analytical data has to be transformed and pre-optimized via an ETL (Extract-Transform-Load) process, which is time intensive and is therefore often only run once a night. The all in all query execution time is not increased on the OLTP system at all, in best case it decreases on the OLTP system because some workload is shifted to the OLAP system. Of course, during the update of the OLAP data some load is caused on the OLTP system, but since only new data is retrieved and this is done in periods of low usage, the overall performance was increased in general, at the cost of increased complexity and the inability to do analytical queries on up to date data."
______________________________________________________ Question Name: "Enterprise Data Characteristics" Text: "Which characteristic does enterprise data NOT have?" Correct answer:
"High entropy in many columns"
Incorrect answers:
"NULL and default values are dominant in many columns" "Large number of columns (attributes)" "Very low entropy in many columns"
Explanation: "Various analyses of different enterprise systems from actual customers showed that most tables are \"sparse and wide\". Many columns are not even used. Furthermore, the columns are often dominated by default or NULL values. So there is a large number of columns, with very low entropy. The correct answer is therefore, that high entropy in columns is a characteristic that enterprise systems usually do NOT have." ______________________________________________________ Question Name: "NUMA" Text: "What is NUMA (non uniform memory access)?" Correct answer:
"In multi-core setups, it means that each processor can access the local memory of all other processors"
Incorrect answers:
"It means that data types can have variable size now and do not need to be uniform any longer" "It is a standard that describes the physical structure of memory chips to fit in server-blades"
Explanation: "In NUMA systems, each processor has its own part of main memory that can be accessed very fast. Data, which is not in that local storage of a processor, has to be requested from non-local storage, i.e. another processor's local memory. In that implementation, all processors share the same adress space which simplifies memory management." ______________________________________________________ Question Name: "Moore's Law" Text: "What does Moore’s law state?" Correct answer:
"The number of transistors on integrated circuits doubles every 18 to 24 months"
Incorrect answers:
"In enterprise computing it is forbidden to run software on CPUs that are older than 2 years" "Enterprise data is not needed any longer after 2 years and is erased due to data privacy protection" "Computers tend to break by a mean time of 36 months" "Clock speed doubles every 18 to 24 months"
Explanation: "In 1965, Gordon E. Moore stated that the number of components per integrated circuit is doubling approximately every two years. The other answers are just made up." ______________________________________________________ Question Name: "Architecture Name" Text: "What is the name of the architecture in which a processor can access the cache of other processors?" Correct answer:
"Non-Uniform Memory Access (NUMA)"
Incorrect answers:
"Symmetric Multiprocessing (SMP)" "Shared Nothing Architecture (SNA)"
Explanation: "In Shared Nothing architectures, every node is independent and does not share any memory with other processors. SMP systems have a shared main memory, but their chaches remain nonaccessible to other processors. NUMA allows to access other nodes' local memory, therefore it is the correct answer." ______________________________________________________ Question Name: "NUMA and Cache Coherency" Text: "Which statement concerning NUMA and cache coherency is correct?" Correct answer:
"Most currently sold NUMA realizations come with special-purpose hardware to maintain cache coherency"
Incorrect answers:
"Most NUMA realizations currently in the market use software layers to maintain cache coherency" "Every program gains a huge performance boost from NUMA; no adaption of the software is needed to fully exploit the potential" "Cache coherency is no longer a concern when using NUMA architectures, since NUMA does not use caches at all"
Explanation: "As stated in the reading material, non ccNUMA hardware is practically non existent, because it is harder to program. Therefore, the terms NUMA and ccNUMA are usually used identically." ______________________________________________________ Question Name: "Disk Still Required" Text: "Why is disk usage (for example SSDs) still required for SanssouciDB?" Correct answer:
"Because logging and snapshotting need non-volatile storage"
Incorrect answers:
"Because non-volatile storage is needed for the hashing phase in the join operation" "Because main memory is still slower than SSDs in some use cases" "Because some database operations can only be processed on disk"
Explanation: "DRAM is a volatile storage medium, meaning that it looses it's state and therefore it's information when the power supply is interrupted. For that reason, it is necessary to persist the information on another storage medium that is non-volatile, like SSDs or HDDs. Hardware producers are currently researching on non-volatile RAM like MRAM or PCRAM, that might make this obsolote in the future, but can not compete in means of capacity and price with proven solutions today." ______________________________________________________ Question Name: "Scan Duration" Text: "How long does it take a quad core CPU to scan 2 columns with 10 million entries each and 50 byte per attribute in both columns? The assumed scan speed is 2 MB/ms per core." Correct answer:
"125 ms"
Incorrect answers:
"50 ms" "250 ms" "1 s"
Explanation: "2 columns with 10,000,000 entries with 50 bytes each result in a overall size of 1,000,000,000 bytes. 4 cores with a scan speed of 2 MB/ms each run through this amount of data in 1,000,000,000 B / (4 * 2,000,000 B/ms) = 125 ms." Dear Learning Community, just as for week 1, we wrote some explanations for the assignment questions of week 2 that we want to share with you.
Best regards, your IMDB Teaching Team ______________________________________________________ Question Name: "Technical Building Blocks of Dictionary Encoding" Text: "What data structures are required for dictionary encoding?" Correct answer:
"attribute vector and dictionary"
Incorrect answers:
"attribute vector and index" "attribute vector and foreign key" "index and dictionary"
Explanation: "The attribute vector holds references to the dictionary, where the actual values are stored. So, these two structures are sufficient for dictionary encoding." ______________________________________________________ Question Name: "Bit Representation of Countries" Text: "What is the minimum number of bits required to represent all countries (about 200) in the world?" Correct answer:
"8"
Incorrect answers:
"6" "7" "9"
Explanation: "The logarithm of the number of values to encode to the base to is log_2(200) = 7,64, so we have to choose the next integer above 7, which is 8, as the number of bits needed." ______________________________________________________ Question Name: "Compression Example Run Length Encoding Assignment" Text: "Suppose there is a table where all 14 million inhabitants of Blueland are assigned to their cities. The table is sorted by city. Blueland consists of about 3,000 cities (represented by 12 bit). Further assume that inhabitants are uniformly distributed across cities. Using Run Length Encoding with a start position vector, what is the size of the compressed city vector? Always use the minimal number of bits required for any of the values you have to choose and include all needed auxiliary structures. Further assume the following conversions: 1MB = 1,000 KB, 1KB = 1,000B"
Correct answer:
"13.5 kB"
Incorrect answers:
"6 kB" "20.5 kB" "2 MB"
Explanation: "We have to compute the size of a) the value array and b) the size of the start position array. The size of a) is the distinct number of cities (3,000) times the size of each field of the value array (log_2(3,000)). The size of b) is the number of entries in the dictionary (3,000) times the number of bits required to encode the highest possible number of inhabitants (log_2(14,000,000)). The total result is thus 12 bit times 3,000 (36,000) plus 24 bit times 3,000 (72,000), thus 108,000 bits (or 13.5 kB) in total." ______________________________________________________ Question Name: "Suitable Use Cases for Column Layout" Text: "A columnar layout is well suited to ..." Correct answer:
"... process sets and do full column scans"
Incorrect answers:
"... transform data" "... handle insert operations" "... materialize full tuples"
Explanation: "A columnar layout is especially suited to do set operations and full column scans. The complexity of data transformation is not influenced by the chosen layout. Insert operations are more cumbersome in a columnar layout than in a row layout, since we have to distribute the values of a tuple over different columns and therefore place them in different memory regions. On top of that come auxiliary structures like dictionaries, which have to be kept up to date. These migth also be used on row layouts, so the layout is not the influencing part concerning to that. But in general, row layouts are better suited for inserts than columnar layouts. For the same reasons is tuple reconstruction easier on row layouts than on columnar layouts." ______________________________________________________ Question Name: "Suitable partitoning strategy" Text: "Assume a table with customer data. Three different units in a company do support for the customers but also have to provide analytics about the customers they handle. The table is distributed over several servers, each unit has a server. What partitioning type is suited best if the company decided that the first organizational unit should handle customers with last names A-G, the second unit customers with last names H - R, and the last unit customers with last names S- Z?" Correct answer:
"range partitioning"
Incorrect answers:
"hash partitioning" "round robin partitioning"
Explanation: "Based on the decision to assign specifc ranges of names to specific organizational units, like the word says, range partitioning is most suited. Every organisational unit can have their primary data close to them, if we have a distributed server landscape. Hash partitioning could not guarantee the best locality. Of course, one could choose the first letter as the partitioning key, but at a closer look, this is range partitioning in disguise again. Round robin partitioning would clearly prohibit the locality advantage, since it would distribute the entries fairly over all servers, regardless of the actual information in the entries." ______________________________________________________ Question Name: "Delete implementation for hospital" Text: "Assume you have to setup a new database for a hospital which allows the hospital staff to keep track of all their patient records. Which delete implementation should be prefered for that use case?" Correct answer:
"Logical delete"
Incorrect answers:
"Doesn’t matter" "Physical delete" "Depends on the number of patients"
Explanation: "Hospitals typically have very strict regulations for keeping patient data, regardless of their number of patients. They need to be able to look at patient histories for multiple years. When using only logical delete, the data is still available for queries concerning the past. A physical delete would not allow this." ______________________________________________________ Question Name: "New value in dictionary" Text: "Given the following entities: Old dictionary: chair, drawer, monitor, table Old attribute vector: 0, 3, 0, 1, 2, 3, 3 Value to be inserted: lamp What valueID is the monitor mapped to in the new attribute vector?" Correct answer:
"3"
Incorrect answers:
"1" "2" "4"
Explanation: "The word monitor used to have the valueID 2. After inserting the value lamp and resorting the dictionary, the monitor will have the valueID 3, since the correct alphabethical oder is: chair (0), drawer (1), lamp (2), monitor (3), table (4). The updated attribute vector is 0, 4, 0, 1, 3, 4, 4, 2" ______________________________________________________ Question Name: "Dictionary resorting" Text: "Assume the following dictionaries: Brand dictionary: Audi, BMW, Mercedes-Benz, Opel, Porsche, Volkswagen Year dictionary: 2007, 2008, 2009, 2010, 2011 Model dictionary: 3 Series, 5 Series, 7 Series, 911, A3, A4, A8, A class, Boxter, C class, E class, Golf, Jetta, Passat And the following insert: INSERT INTO cars VALUES(‘Audi’, ‘2012’, ‘A7’); How many dictionaries need to be resorted?" Correct answer:
"1"
Incorrect answers:
"None" "2" "All"
Explanation: "‘Audi’ is already present in the brands dictionary. The year ‘2012’ simply gets appended to the year dictionary, since it is the last value with respect to existing the sorting order. When appending ‘A7’ to the model dictionary, it needs to be resorted." ______________________________________________________ Question Name: "Dictionary reordering after updates" Text: "Consider the world population table (first name, last name) that includes all people in the world: Angela Mueller marries Friedrich Schulze and becomes Angela Schulze. Should the dictionary for the last name column be reordered?" Correct answer:
"No, because the value ‘Schulze’ is already in dictionary"
Incorrect answers:
"No, because ‘Schulze’ > ‘Mueller’ when compared lexicographically" "Yes, because ‘Schulze’ is a new last name of Angela"
Explanation: "Mr. Friedrich Schulze is already in this table. Therefore ‘Schulze’ is in the last-name dictionary and its key can be taken to update the last name of Ms. Mueller (now Mrs. Schulze)" ______________________________________________________ Question Name: "Tuple reconstruction performance factors" Text: "The number of attributes of the reconstructed tuple is an important factor that can influence the performance of the operation on the column layout. Which of the following is the right explanation for this behavior?" Correct answer:
"A new cache line must be read for every attribute of the tuple and thus the number of the processed bytes will increase"
Incorrect answers:
"The size of the reconstructed tuple increases" "There is a risk that the size of a whole tuple exceeds the size of a cache line and therefore it cannot be read in one cache access" "The number of attributes is not an important factor for the tuple reconstruction in a column layout, but the size of the tuple is the key factor, because the data is stored tuple-wise"
Explanation: "In a columnar layout, each attribute of a tuple has to be retrieved via an own cache access, because the attributes of one tuple are located faw away from each other in memory. So the number of bytes to be read increases with every attribute of the reconstructed tuple and this answer is correct. The size of the reconstructed tuple is not increased, this size is fixed for a given tuple. Of course, with every additional attribute, the whole tuple gets bigger, but the important factor that we want to describe within this question is that we need an additional cache access for every attribute, regardless how small it is. The risk, that a whole tuple could not fit into one cache line is also not of interest, since this, as described above, clearly does not reflect the retrieval behavior. Last but not least, the option that the number of attributes is not an important factor but the size of the tuple is the key factor is wrong, because it has the false addition \"because the data is stored tuple-wise\", which is not the case in a columnar layout." ______________________________________________________ Question Name: "Scan performance on column and row layout: table scan" Text: "Given is a table with the following characteristics containing information about all customers in Germany: - columns: CustomerId, Customer Name, City, Street, Status, Sector, Category; - size per field (uncompressed): 28 byte; - number of rows: 500,000; - cardinality of the city column: 12,200. A user wants to know \"How many customers do we have in Berlin?\" How long will this query take?" Correct answers:
"row store: 49 milliseconds" "row store with stride access: 16 milliseconds"
Incorrect answers:
"row store: 24.5 milliseconds" "row store with stride access: 8 milliseconds" "column store with dictionary compression: 5.5 milliseconds" "column store with dictionary compression: 3.5 milliseconds"
We had a little glitch here, this question had 2 correct answers in it. Everybody who has chosen one of both correct answers got full points, these changes already have already been reflected in your results. (Update was done on 18.09.2013) Explanation: "The needed time in a row store without stride access is (7 attributes * 28 byte per attribute per row * 500,000 rows / ( 2 MB / ms / core ) = 7 * 28 * 500,000 / 2,000,000ms = 49ms
The needed time in a row store with stride access is (size of a cache line * number of accesses / scan speed) = 64 * 500,000 / 2,000,000ms = 16ms;
The needed time in a column store is (number of needed bits to encode attribute * number of values / scan speed / cores) = (14 bit / 8 bit/byte) * 500,000 / 2,000,000 byte/ms = 0.4375ms So, the answers for the row store with and without stride access are correct here." Dear Learning Community, here are the explanations for the assignment questions of week 3. If you notice any flaws or still have questions, please leave us a post in the forum for week 3. Best regards, your IMDB Teaching Team
Question Name: "Execution Plans" Text: "For any SELECT statement..." Correct answer:
"several execution plans with the same result set, but differing performance may exist"
Incorrect answers:
"exactly one execution plan exists" "several executions plans may exist that deliver differing result sets" "there always exist exactly two execution plans, which perform identically in each aspect"
Explanation: "For any SELECT statement, several execution plans with the same result set, but different runtimes may exist. As an example, we want to query all men living in Italy from world population table;
the database offers three different execution plans. We could query for the gender 'male' first and then for the country 'Italy' in the result set or we start with the selection for 'Italy' and then we narrow the result to only males, or we might perform the two selections on 'male' and 'Italy' in parallel queries, both running on the full dataset and then create the intersection. All three execution plans create the same result set, but require different runtimes. For example its faster to query first for 'Italy'and then for 'male', because in this case first 8 billion entries (all entries) and then further select on the resulting 60 million entries (all Italiens), if you start with 'male' and then query for 'Italy' you have to scan through 8 billion (all Italiens) and then through 4 billion entries (all males)."
Question Name: "Scan Speed" Text: "What is the assumed scan speed per core?" Correct answer:
"2 MB per ms"
Incorrect answers:
"1 MB per ms" "2 PB per ms" "100 kB per ms"
Explanation: "Nothing to explain here, it is the assumed speed under normal conditions. For given SQL queries, the measured speeds will vary, since not all operations are purely scans and the scan has often to be interrupted."
Question Name: "Early Materialization" Text: "What is early materialization?" Correct answer:
"A processing strategy, where valueIDs are decoded into actual values at the earliest time during processing"
Incorrect answers:
"An advanced aggregation strategy compared to basic ones like SUM" "A strategy, which works on compressed data as long as possible"
Explanation: "Early materialization, in contrast to late materialization, decodes valueIDs into the actual values at the earliest time during processing. The other answers are wrong, for an analog description please have a look on the explanation of the question \"Late Materialization\"."
Question Name: "Faster Materialization Strategy" Text: "Assuming the execution of the question which queries the complete data of all people in the world whose first name is NOT Jean-Pierre. Which of the following statements is true?"
Correct answer:
"Both strategies will perform poorly, as the predicate selects a lot of values, i.e., the costs are bound by the dictionary lookups to materialize the result"
Incorrect answers:
"Late materialization should be faster, since the predicate can be evaluated solely by using the dictionary of the column fname." "Early materialization should be faster, since the result will contain less returned rows."
Explanation: "The reason for this question is, that I think it's important for people to be able to estimate result sizes, since especially column store can suffer (see SELECT *). Here they should be able to see, that this query is painfull and any strategy will perform poorly. "
Question Name: "Late Materialization" Text: "What is late materialization?" Correct answer:
"A processing strategy, which restores the requested tuple at the latest possible point during processing"
Incorrect answers:
"An advanced aggregation strategy compared to basic ones like SUM" "A strategy that works on uncompressed data as long as possible" "Long batch operations that are run over night in big enterprises"
Explanation: "Late materialization, in contrast to early materialization, is a processing strategy, which aims at reconstructing the actual attribute values from the valueIDs at the latest possible point during processing. Working with the compressed integer values leads to speed advantages in most cases. Depending on the circumstances, there are also constellations where early materialization is favorable, however these situations are seldom. In general, late materialization is tried to keep up as long as possible. The other answers are wrong. There is no way to determine or even a special name for complex or advanced aggregation strategies. To work on uncompressed data as long as possible is like the correct answer, just twisted. Because data is saved in a compressed format, it is not the case that we work on uncompressed data first and than compress it afterwards. The last wrong answer, that late materialization describes long batch operations run over night is just messing with the word \"late\". These scheduled batch jobs that are run over night have no distinguished term we know of, sometimes they are just called \"over-night batch jobs\"."
Question Name: "Querying the Differential Buffer" Text: "Correctly complete the following sentence: Whereas write accesses are going against the differential buffer, ..." Correct answer:
"read accesses are going against the main store and the differential buffer"
Incorrect answers:
"read accesses are going against the main store" "read accesses against the differential buffer are denied" "read access are cached in a row-oriented format"
Explanation:"Read accesses have to go against the main store and the differential buffer, in order retrieve the newest valid entries. Just querying the main store would return potentially outdated information or lack completely new entries, just querying the differential buffer would not suffice since it holds only a fraction of the total data. Read accesses are not cached for any reason, regardless of the format."
Question Name: "Statements Concerning the Differential Buffer" Text: "Which statement related to the differential buffer is true?" Correct answer:
"Since the dictionary of the differential buffer is unsorted, range selects on the differential buffer are less efficient than on the main store"
Incorrect answers:
"Tuples in the differential buffer require less memory, because the advanced compression techniques used there result in better compression than in the main store" "The differential buffer is read-optimized" "The differential buffer should not exceed the size of a cache line (64 byte) for performance reasons"
Explanation:"The differential buffer has an unsorted dictionary to improve the write performance. Therefore it is not read-optimized. Tuples in the differential buffer require at least as much memory than the associated tuples in the main store, since no additional compression is employed. The size of a cache line is certainly not enough for the differential buffer. The correct answer is therefore, that range selects are less efficient in the differential buffer, because the dictionary is not sorted. Even with a CSB+ tree, which allows fast access on the unsorted values, we can not determine whether a value is in the desired range by just checking for example whether it is >=10 and
View more...
Comments