Introduction to Netezza
Short Description
Download Introduction to Netezza...
Description
Introduction to Netezza Bank of America
Topics
•
Netezza Architecture
•
SQL Differences
•
Stored Procedure Differences
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
2
Netezza Architecture – TwinFin 12
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
3
Topics
System Architecture Snippet Processing Unit (SPU) System Capacity Distribution in a shared-nothing architecture Zone maps & data organization
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
4
Database Hardware Architecture Terminology
• Symmetrical Multi Process Architecture (SMP) • • •
Multiple processors sharing access to disk and memory Processors operate asynchronously Examples: Sybase, Sybase IQ, Oracle, SQL Server, DB2
• Massive Parallel Process Architecture (MPP) • • • • • • •
Multiple processors, each with separate, dedicated memory and disk No hardware is shared between processors Processors are slaved to a controller Processors operate synchronously Implemented many different ways Netezza approach is unique Examples: Netezza, Teradata, Green Plum, Aster, Vertica Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
5
Netezza Architecture
SQL command sent to Result hostsent from user to user
Host collects and returns result sets
Host
Host compiles SQL, develops execution plan and sends code snippets to SPUs based on plan
SPUs pass data between each other as needed
Know Your Customers. Grow Your Business.
SPU executes code snippet. All SPUs execute same code synchronously. Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
6
Snippet Processing Unit (SPU)
Each SPU has a dedicated 1TB disk Disk controlled by FPGA (Field Programmable Gate Array) CPU loads query code into FPGA FPGA executes code using memory and cache CPU performs additional processing to result set CPU communicates toYour controller andYour Business. Know Customers. Grow other SPUs via NIC Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
7
System Capacity – TwinFin 12
•
12 S-Blades, each have 8 SPUs
•
4 SPUs are spares for fail over – 92 usable SPUs
•
Each SPU has a 1TB disk drive
•
Disk divided into 3 330GB partitions: • • •
Active data area Workspace Redundant copy of another SPU’s active data area
•
Nominal 330GB capacity per SPU (uncompressed)
•
Nominal 30TB capacity per system (uncompressed) Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
8
MPP Shared Nothing Architecture
•
A SPU only has visibility to the data on its own disk.
•
A SPU will broadcast data to other SPUs as needed (based on instructions from the controller) via the NIC.
•
Broadcast data is received and held in memory or work area by the SPU for use in resolving the query.
•
Each SPU operates synchronously with all other SPUs. Each execute the same snippet at the same time.
•
The controller’s execution plan ensures minimal data transfer between SPUs. Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
9
Data Distribution
•
Data distribution to a SPU is defined by the DISTRIBUTE ON clause of the CREATE TABLE command
•
Each table is distributed across all 92 SPUs
•
You can distribute RANDOM or specify one or more columns in the table as a distribution key • • •
Controller uses distribution key information when deciding the execution plan Joins between tables with the same distribution key that are joined using that key will execute entirely on that SPU without the need to broadcast data (table co-location) Joins to tables with RANDOM distribution will require data broadcast. The smaller of the two sets is broadcast
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
10
Effects of Poor Data Distribution
•
System Capacity • • •
•
The system is ‘full’ when any one SPU is full A badly skewed distribution of a large table may fill up a SPU prematurely Try to keep skew under 10% (5% off average) for very large tables
System Performance • • • • •
SQL code snippets are executed synchronously among all SPUs The total execution time for a snippet is the longest time among the SPUs The total execution time for a query is the sum of the longest times for each snippet Snippet execution time is proportionally related to the amount of data a SPU needs to process A query will only run as fast as the slowest SPU
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
11
Data Distribution – Co-Location
• Co-Location is the physical placement of related data across
multiple tables onto the same SPU • •
Allows a SPU to join co-located data without interaction with other SPUs Speeds queries by eliminating data broadcast steps
• Tables must have identical distribution key columns • •
RANDOM distribution will not co-locate Any difference in columns will result in a different, unrelated SPU assignment
• Join expression must include all distribution key columns • •
Query compiler cannot assume co-location unless the join between two co-located tables includes equi-joins between all columns specified in the distribution key Ensure use by defining as few columns as possible as a distribution key
• Don’t bother with small tables
Know Your Customers. Grow Your Business.
•
No real advantage co-locating small tables
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
12
Data Distribution - Example
CREATE TABLE TSVALUES ( ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL, VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL, ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL ) Know Your Customers. Grow Your Business. DISTRIBUTE ON (ID_TS) ORGANIZE ON (TS_END, TS_BEGIN);
• All 3 tables share ID_TS as a common key • ID_TS is always used when joining these tables • ID_TS is the appropriate column to use as a distribution key
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
13
Data Distribution – Example
•
Aginity Workbench • •
•
Provides ability to redistribute existing tables Allows you to view a table’s distribution
Distribution display for TSValues:
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
14
Co-Located Queries
SELECT N.ID_TS, N.TICKER, N.NAME, V.VALUE FROM TSNAMES N, TSVALUES V WHERE N.ID_TS = V.ID_TS AND V.TS_BEGIN BETWEEN N.TS_BEGIN AND N.TS_END AND V.TS_END = ‘1/1/3000’;
• The query will execute in parallel across all SPUs without data
sharing because: • •
Both tables are distributed on the same key (ID_TS) ID_TS is used to join the tables
• Identical distribution key definitions across tables ensure rows
with the same distribution key values in those tables reside on the same disk and SPU. • Run times (493M row result set): • •
Know Your Customers. Grow Your Business. Co-located: 22.1 seconds TSNames distributed on RANDOM: 74.3 seconds
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
15
Distribution Key Advice
• Keep distribution key as small as possible • • •
One column is best Ensure full use in joins Tables with some keys in common will not result in co-location
• Use a key that provides even distribution • • • • •
Avoid skews > 10% in very large tables (> 5% nominal system capacity) Avoid skews > 5% in multiple same key tables with total size > 5% nominal system capacity Avoid skews > 20% in other tables May be necessary to add a column to the distribution key to reduce skew Don’t worry about skew in small tables (< 0.1 % nominal system capacity)
• Poor Distribution Keys • •
Dates – cause query ‘hotspots’ when queries based on date range Low cardinalityKnow columns Your Customers. Grow Your Business. • Never use a column with cardinality less than the number of SPUs • Columns with cardinality < 20 x #of SPUs may produce very bad skews
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
16
Zone Maps
•
A zone map records the high and low values of columns for rows in a data block on disk
•
Netezza uses zone maps to skip data blocks that do not satisfy query predicates, speeding query execution
•
Zone maps are automatically created for Integer, Date and Timestamp columns
•
You can specify additional zone map columns in the ORGANIZE ON clause of the CREATE TABLE statement
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
17
ORGANIZE ON Clause
•
Defines a Clustered Base Table
•
Data is arranged (sorted) by the ORGANIZE ON columns during GROOM
•
Improves effectiveness of zone maps
•
Specify up to 4 columns
•
Allowable data types: • • • • • • •
INTEGER, DATE, TIMESTAMP CHAR, VARCHAR, NCHAR, NVARCHAR – Only first 8 bytes used NUMERIC – up to NUMERIC(18) FLOAT, DOUBLE Know Your Customers. Grow Your Business. BOOL TIME, TIME with Time zone INTERVAL
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
18
Using ORGANIZE ON
•
Select columns most likely used in WHERE clause
•
Zone map efficiency diminishes for 3rd and 4th columns •
•
Arrange lower cardinality or “clumpier” columns first • • •
•
Greater chance a wide range of values appearing in a data block
Groups larger numbers of rows first, spread over more data blocks Increases likelihood subsequent columns spread over multiple blocks “Clumpy” – large portion of rows have few unique values
GROOM table periodically • •
Reorganizes data according to ORGANIZE ON specification Removes deleted rows Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
19
Using ORGANIZE ON - TSValues
•
All current rows have ts_end = 1/1/3000 (very clumpy)
•
Current rows are of greatest interest (most often selected)
•
ts_begin commonly used in most queries
•
Specify ts_end first. Clumps all current rows into contiguous blocks. CREATE TABLE TSVALUES ( ID_TSDATE INTEGER NOT NULL, ID_TS INTEGER NOT NULL, VALUE DOUBLE PRECISION NOT NULL, ID_SRC INTEGER NOT NULL, ID_STATUS INTEGER NOT NULL, TS_BEGIN TIMESTAMP NOT NULL, TS_END TIMESTAMP NOT NULL, SIGMA DOUBLE PRECISION NOT NULL ) Know Your Customers. Grow Your Business. DISTRIBUTE ON (ID_TS) ORGANIZE ON (TS_END, TS_BEGIN);
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
20
GROOM Command
GROOM TABLE TSVALUES RECORDS ALL RECLAIM BACKUPSET DEFAULT; •
Organizes rows based on ORGANIZE ON clause
•
Reclaims space from deleted rows • •
Netezza performs updates by deleting old row and inserting new row Deleted rows consume space until table is GROOMed
•
Fast – 15 minutes for 4.6 billion rows in TSValues
•
Table remains available for query/update Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
21
SQL Differences – Netezza/Sybase
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
22
Topics
•
General Language Characteristics
•
Identifiers
•
Data Types
•
Functions
•
Command differences
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
23
General Language Characteristics
• Similar to Oracle syntax and interpretation • Commands terminated with semi-colon (;) •
Terminator is required at all times
• Uses Oracle’s interpretation of NULL • •
A zero length string (‘’) is considered NULL Assigning ‘’ to a character column defined as NOT NULL will generate an error
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
24
Identifiers
• An identifier names a database object • There are 2 types of Identifiers: regular and delimited • Regular Identifier • • •
Is case insensitive Must begin with a letter Contain letters, digits, underscores, dollar sign ($)
• Delimited Identifier • • • •
Enclosed in double quotes Is case sensitive May include spaces, other special symbols and reserved words May begin with any allowable character Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
25
Data Types
• Does not support LOB types (CLOB, BLOB) • Character types • •
CHAR, VARCHAR: ASCII data, 64,000 maximum length NCHAR, NVARCHAR: Unicode data, 16,000 maximum length
• Exact Numeric Types • • • • • •
BYTEINT 8 bit signed integer SMALLINT 16 bit signed integer INTEGER 32 bit signed integer BIGINT 64 bit signed integer NUMERIC(p,s), NUMERIC(p) Up to 38 digits precision (p), scale (s) from 0 to p NUMERIC Same as NUMERIC(18, 0) Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
26
Data Types (continued)
• Approximate Numeric Types • • •
FLOAT(p) precision can range from 1 to 15 REAL same as FLOAT(6) – 4 bytes DOUBLE PRECISION same as FLOAT(15) – 8 bytes
• Logical Types •
BOOLEAN (or BOOL)
True/false value, 1 byte
• Temporal Types • • • • • •
Time maintained to microsecond (6 decimal places) DATE Date with no time – 4 bytes TIME Time with no date – 8 bytes TIME WITH TIME ZONE Time with time zone information – 12 bytes TIMESTAMP Date and time – 8 bytes INTERVAL Know Time interval, non-standard implementation Your Customers. Grow Your Business. • •
Maintained in seconds Ignores unit declarations, literals require explicit units
•
Month assumed to be 30 days
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
27
Internal Data Types (pseudo-columns)
• ROWID • • • •
Unique row identifier, assigned when row is inserted Not sequential in table Value range 100,000 - 9,223,372,036,854,775,807 Will numbers ever repeat? • If a table contains 2 billion rows and takes 3 minutes to copy • And if the table is copied repeatedly and continuously (add 2 billion rows every 3 minutes) • It will take over 24,000 years to run out of numbers • Netezza hopes you would have upgraded before then
• CREATEXID, DELETEXID • • •
Transaction ID that created and deleted the row. If DELETEXID > 0, row has been deleted You can not see deleted rows using SQL Know Your Customers. Grow Your Business.
• DATASLICEID •
Identifies the SPU holding the row
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
28
Operators
• Differences from Sybase • • •
Concatenation Not Equal Null test
|| or != ISNULL or IS NULL NOTNULL or IS NOT NULL
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
29
Conditional Column Expression Functions
• CASE Searched form: CASE WHEN THEN WHEN THEN ... WHEN THEN ELSE END Search conditions can be arbitrarily complex and results can be expressions.
Value form: CASE WHEN THEN WHEN THEN ... WHEN THEN ELSE END Test values, comparand values, and results can be expressions.
• DECODE • •
Same as Oracle DECODE decode(,,, …,,) Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
30
Commonly Used Functions
• Current time • •
Use NOW() or CURRENT_TIMESTAMP TIMEOFDAY() returns a verbose date string. Example: ‘Mon Jan 24 16:12:05 2011 EST’
• Date Conversion • •
• • •
Date literal is in MM/DD/YYYY format (default). Example: ‘1/1/3000’ TO_DATE(, ) Converts string to DATE data type Template describes format of date string. Example: TO_DATE(‘24 Jan 2011’, ‘DD Mon YYYY’) TO_TIMESTAMP (, ) Converts string to TIMESTAMP data type TO_CHAR(date or timestamp, ) Converts date or timestamp to string See User Guide pp 3-26 – 3-28 for template information
Know Your Customers. Grow Your Business.
Introduction to Netezza –| Prepared for Bank of America Ipsos Loyalty
© 2011 Aginity LLC
Ipsos, January 2007
31
SQL Differences - Update
Sybase
Netezza
update #tempRaw Set a.Sigma = b.Sigma from #tempRaw a,TSValues b where a.ID_TS = b.ID_TS and a.ID_TSDate = b.ID_TSDate and getdate() < b.ts_end and a.ID_Status >= 1 and a.ID_StatusSigma = 1 and a.ID_StatusSigma = 1 and A.ID_Status
View more...
Comments