DAT manual.pdf

January 8, 2018 | Author: Angela Moore | Category: F Test, Probability Distribution, Errors And Residuals, Statistical Theory, Software

Share Embed Donate

Report this link

Short Description

dat...

Description

M.Sc. I.T. Part I Semester I

Data Analysis Tools MANUAL FOR PRACTICAL

2013 – 2014

1

M.Sc in Information Technology Part I Course III : Data Analysis Tools

Practical based on the Book “Modelling with Data”

Practical Problems Prepared and Implemented by Mr. Mahesh Naik, Valia College, Andheri

& Mr. Jayesh Shinde, UDIT, Santacruz

Compiled By R. Srivaramangai, UDIT, Santacruz

2

INDEX S.NO 1

DESCRIPTION List of Practical

PAGE NUMBER 4

2

Installation procedure for cygwin

6

3

Installation procedure for ubuntu

8

4

Practical 1

11

5

Practical 2

21

6

Practical 3

24

7

Practical 4

28

8

Practical 5

41

9

Practical 6

46

10

Practical 7

49

11

Practical 8

54

12

Practical 9

57

13

Practical 10

58

14

References

60

3

List of Practical 1. SQL queries based on Unit I a. DDL commands of SQL b. Select clause i. Simple select ii. Select queries with where clause iii. Select queries with arithmetic, relational and logical operators iv. Select queries with order by, group by, having, limit and offset v. Select queries with aggregation functions and distinct vi. Select queries with sub queries and Joins 2. Implementing gsl matrices and vectors a. Illustration of gsl Matrix multiplication b. Illustration of gsl vector with database query embedded 3. Graph Plotting a. Gnu plot for plotting vectors 1 b. Gnu plot for plotting vectors 2 c. Gnu plot for plotting vectors 3 4. Implementing Statistical Distributions Discrete distributions a) Bernoulli distribution b) Binomial distribution c) Poisson distribution d) Multinomial distribution e) Hyper geometric distribution Continuous distributions a) Normal distribution b) Lognormal distribution c) Gamma distribution d) Exponential distribution 4

e)

Beta distribution

5. Implementing Regression and goodness of fit a. Implementing OLS regression b. Implementing goodness of fit –chi square 6. Illustrating the maximum likelihood 7. Generating random numbers with Monte Carlo method using a. Exponential distribution b. Uniform distribution c. Binomial distribution 8. Implementing Parametric testing a. Using t-test b. Using f-test 9. Illustrating the method of Inference 10.Implementing non-parametric testing - ANOVA

5

Installation of cygwin 1) Download the Cygwin software from the site named as http://www.cygwin.com/ The most recent version of the Cygwin DLL is 1.7.20-1. 2) Download one more library of functions named as apophenia from the website http://apophenia.info/ 3) Now Install cygwin by running its setup.exe. 4) There are numerous packages in cygwin ans so select those packages which are required for the practical, namely gcc compiler, make, gsl , gnu, sqlite 5) Now the apophenia library is to be included in the cygwin software. When we install cygwin ,the cygwin folder is created in the C: drive. Within the cygwin folder , go to home directory and sub directory for example C:\cygwin\home\yourname (C:\cygwin\home\Jayesh). 6) Copy the apophenia library to that directory named Jayesh 7) Double click on the Cygwin terminal icon and the terminal will open. you will be taken to the cygwin terminal window as shown below which displays the present working directory

6

8) Configure the apophenia library by typing: tar xvzf apophenia-0.99-09_Jul_13.tgz cd apophenia-0.99 9) . /configure To test : 1. Once cygwin installation is complete, we can check the same by running a test program. 2. To run a test program with “abc.c” 3. Run the following command in bash…… 4. gcc –std=gnu99 abc.c –o abc.out –lapophenia –lgsl –lsqlite3 ./abc.out

7

Ubuntu Installation as per the free download. How to install the Sqlite on ubuntu 13.04 1) Download the archive package of sqlite database named sqlite-autoconf3071700.tar.gz from the htpp:// www.sqlite.org. 2) After download of the sqlite-autoconf-3071700.tar.gz package ,copy the package in the Home folder of Ubuntu 13.04 3) Open the Terminal. It will open in the Current Directory. We have to Extract the package sqlite-autoconf-3071700.tar.gz Then type the Command tar xvfz sqlite-autoconf-3071700.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as sqlite-autoconf-3071700 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd sqlite-autoconf-3071700 jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ 6) It is needed to configure all the files present in the sqlite-autoconf-3071700 folder type the Command: jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make It will ask the password ,type the passwoord and press the Enter Key 8) Now we need to install the “make” using the following command: jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make install 8

9) jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo ldconfig How to install the apophenia on ubuntu 13.04 1) Download the archive package of gsl named gsl-1.16.tar.gz from the htpp:// www.gnu.org/s/gsl/‎ 2) After download of the gsl-1.16.tar.gz package , copy the package in the Home folder of Ubuntu 13.04 3) Open the Termina. It will open in the Current Directory. We have to Extract the package gsl-1.16.tar.gz Then type the Command tar xvfz gsl-1.16.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as gsl-1.16 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd gsl-1.16 jayesh@jayesh-G31M-S2L:~/gsl-1.16$ 6) It is needed to configure all the files present in the gsl-1.16 folder type the Command: jayesh@jayesh-G31M-S2L:~/gsl-1.16$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make It will ask the password ,type the password and press the Enter Key 8) After the Make has been done it need to install the gsl jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make install

9

9) jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo ldconfig How to install the gsl on ubuntu 13.04 1) Download the archive package of apophenia named apophenia-0.99.tar.gz from the htpp:// apophenia.info/‎‎ 2) After download of the apophenia-0.99.tar.gz package, copy the package in the Home folder of Ubuntu 13.04 3)Open the Termina. It will open in the Current Directory. We have to Extract the package apophenia-0.99.tar.gz Then type the Command tar xvfz apophenia-0.99.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as apophenia-0.99 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd apophenia-0.99 jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ 6) It is needed to configure all the files present in the gsl-1.16 folder type the Command: jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/apophenia-0.99 $ sudo make install It will ask the password ,type the password and press the Enter Key 9) jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ sudo ldconfig Installation of GNUPLOT On Ubuntu 13.04 sudo apt-get install gnuplot-x11

10

Practical No.1 - SQL queries based on Unit I For all database related practical, create a database in Sqlite3 jayesh@jayesh-G31M-S2L:~$ sqlite3 testDB.db SQLite version 3.7.17 2013-05-20 00:56:22 Enter ".help" for instructions Enter SQL statements terminated with a ";" To Check the database created or not sqlite> .databases seq name file --- --------------- ---------------------------------------------------------0 main /home/jayesh/testDB.db sqlite>

Problem statement : To execute SQL queries in order to store and retrieve the data under study in a database. Sqlite is used for executing the queries. i) Queries for performing DDL commands. DDL commands are used to create, modify and delete database objects. The data is stored in an RDBMS in the form of tables. Following are the queries to be performed for DDL commands in Sqlite sqlite> CREATE TABLE COMPANY( ID INT PRIMARY KEY NOT NULL, NAME TEXT NOT NULL, AGE INT NOT NULL, ADDRESS CHAR(50), SALARY REAL ); 11

sqlite> CREATE TABLE DEPARTMENT( ID INT PRIMARY KEY NOT NULL, DEPT CHAR(50) NOT NULL, EMP_ID INT NOT NULL );

You can verify if your table has been created successfully using SQLIte command .tables command sqlite>.tables COMPANY DEPARTMENT ii) Insertion value into the COMPANY and DEPARTMENT Table INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 ); INSERT INTO COMPANY VALUES (7, 'James', 24, 'Houston', 10000.00 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (1, 'IT Billing', 1 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (2, 'Engineering', 2 ); 12

INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (3, 'Finance', 7 ); iii)

Select clause is a data manipulation command used for retrieving the data in the desired format from the database objects. The syntax of the various select clause and its purpose is given below: Select * from company;

a) list down all the records where AGE is greater than or equal to 25 AND salary is greater than or equal to 65000.00:

13

sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 AND SALARY >= 65000;

a) list down all the records where AGE is greater than or equal to 25 ORsalary is greater than or equal to 65000.00: sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 OR SALARY >= 65000;

list down all the records where AGE is not NULL which means all the records because none of the record is having AGE equal to NULL: sqlite> SELECT * FROM COMPANY WHERE AGE IS NOT NULL;

list down all the records where NAME starts with 'Ki', does not matter what comes after 'Ki'. sqlite> SELECT * FROM COMPANY WHERE NAME LIKE 'Ki%';

14

list down all the records where AGE value is either 25 or 27: sqlite> SELECT * FROM COMPANY WHERE AGE IN ( 25, 27 ); list down all the records where AGE value is neither 25 nor 27: sqlite> SELECT * FROM COMPANY WHERE AGE NOT IN ( 25, 27 ); list down all the records where AGE value is in BETWEEN 25 AND 27: sqlite> SELECT * FROM COMPANY WHERE AGE BETWEEN 25 AND 27;

finds all the records with AGE field having SALARY > 65000 sqlite> SELECT AGE FROM COMPANY WHERE EXISTS (SELECT AGE FROM COMPANY WHERE SALARY > 65000);

15

Find the total amount of salary on each customer sqlite> SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME;

Company Table Have a multiple record INSERT INTO COMPANY VALUES (8, 'Paul', 24, 'Houston', 20000.00 ); INSERT INTO COMPANY VALUES (9, 'James', 44, 'Norway', 5000.00 ); INSERT INTO COMPANY VALUES (10, 'James', 45, 'Texas', 5000.00 );sqlite> sqlite>

b) Order by Clause 16

SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME ORDER BY NAME;

Consider COMPANY table is having following records:

c) Following is the example which would display record for which name count is less than 2: SELECT * FROM COMPANY GROUP BY name HAVING count(name) < 2;

sqlite > SELECT * FROM COMPANY GROUP BY name HAVING count(name) > 2; 17

d) which would sort the result in Ascending order by SALARY: sqlite> SELECT * FROM COMPANY ORDER BY SALARY ASC;

e) which would sort the result in descending order by NAME: sqlite> SELECT * FROM COMPANY ORDER BY NAME DESC;

f) Following is an example which limits the row in the table according to the no of rows you want to fetch from table: sqlite> SELECT * FROM COMPANY LIMIT 6;

18

sqlite> SELECT * FROM COMPANY LIMIT 3 OFFSET 2;

g) Joins sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY CROSS JOIN DEPARTMENT;

19

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY INNER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY LEFT OUTER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

20

Practical 2 i)

Multiplication Table

#include int main(){ gsl_matrix *m = gsl_matrix_alloc(20,15); gsl_matrix_set_all(m, 1); for (int i=0; i< m->size1; i++){ Apop_matrix_row(m, i, one_row); gsl_vector_scale(one_row, i+1); } for (int i=0; i< m->size2; i++){ Apop_matrix_col(m, i, one_col); gsl_vector_scale(one_col, i+1); } apop_matrix_show(m); gsl_matrix_free(m); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 multiplicationtable.c -o multiplicationtable.out -lapophenia -lgsl -lsqlite3

jayesh@jayesh-G31M-S2L:~$ ./multiplicationtable.out

21

ii) the function in will take in a double indicating taxable income and will return US income taxes owed, assuming a head of household with two dependents taking the standard deduction #include double calc_taxes(double income){ double cutoffs[] = {0, 11200, 42650, 110100, 178350, 349700, INFINITY}; double rates[] = {0, 0.10, .15, .25, .28, .33, .35}; double tax = 0; int bracket = 1; income -= 7850; //Head of household standard deduction income -= 3400*3; //exemption: self plus two dependents. while (income > 0){ tax += rates[bracket] * GSL_MIN(income, cutoffs[bracket]-cutoffs[bracket1]); income -= cutoffs[bracket]; bracket ++; } return tax; } int main(){ apop_db_open("data-census.db"); strncpy(apop_opts.db_name_column, "geo_name", 100); apop_data *d = apop_query_to_data("select geo_name, Household_median_in as income\ 22

from income where sumlevel = '040'\ order by household_median_in desc"); Apop_col_t(d, "income", income_vector); d->vector = apop_vector_map(income_vector, calc_taxes); apop_name_add(d->names, "tax owed", 'v'); apop_data_show(d); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 taxes.c -o taxes.out -lapophenia lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./taxes.out

23

Practical III Plotting a vector #include void plot_matrix_now(gsl_matrix *data){ static FILE *gp = NULL; if (!gp) gp = popen("gnuplot -persist", "w"); if (!gp){ printf("Couldn't open Gnuplot.\n"); return; } fprintf(gp,"reset; plot '-' \n"); apop_matrix_print(data, .output_pipe=gp); fflush(gp); } int main(){ apop_db_open("data-climate.db"); plot_matrix_now(apop_query_to_matrix("select (year*12+month)/12., temp from temp")); }

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 pipeplot.c -o pipeplot.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./pipeplot.out

24

Eigen vector #include "eigenbox.h" apop_data *query_data(){ apop_db_open("data-census.db"); return apop_query_to_data(" select postcode as row_names, " " m_per_100_f, population/1e6 as population, median_age " " from geography, income,demos,postcodes " " where income.sumlevel= '040' " " and geography.geo_id = demos.geo_id " " and income.geo_name = postcodes.state " " and geography.geo_id = income.geo_id "); } void show_projection(gsl_matrix *pc_space, apop_data *data){ fprintf(stderr,"The eigenvectors:\n"); apop_matrix_print(pc_space, .output_pipe=stderr); apop_data *projected = apop_dot(data, apop_matrix_to_data(pc_space)); printf("plot '-' using 2:3:1 with labels\n"); apop_data_show(projected); } 25

int main(){ apop_plot_lattice(query_data(), "out"); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 eigenbox.c -o eigenbox.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./eigenbox.out jayesh@jayesh-G31M-S2L:~$ gnuplot -persist < out

Query out the month, average, and variance, and plot the data using errorbars. Prints to stdout, so pipe the output through Gnuplo #include int main(){ apop_db_open("data−climate.db"); apop_data *d = apop_query_to_data("select \ (yearmonth/100. − round(yearmonth/100.))*100 as month, \ avg(tmp), stddev(tmp) \ 26

from precip group by month"); printf("set xrange*0:13+; plot ’−’ with errorbars\n"); apop_matrix_show(d−>matrix); }

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 errorbars.c -o errorbars.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./errorbars.out | gnuplot –persist

27

Practical 4 Implement the statistical distributions Discrete distributions 1. Bernoulli distribution 2. binomial distribution 3. Poisson distribution 4. Multinomial distribution 5. hypergeometric distribution Continous distributions 1. Normal distribution 2. Lognormal distribution 3. Gamma distribution 4. Exponential distribution 5. Beta distribution bernoulli distribution (bernoulli.c) #include #include int main (void) { int i; double p = 0.6; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i data>weights)/apop_sum(modelhist->data->weights); gsl_vector_scale(modelhist->data->weights, scaling); apop_data_show(apop_histograms_test_goodness_of_fit(datahist, modelhist)); }

45

Prac 6. Implement testing with likelihood 1.

Building an optimized model & then solving the same for maximum.( a function can be provided in this case)

APOP_SIMPLEX_NM APOP_CG_FR APOP_SIMAN APOP_RF_NEWTON

Nelder-Mead simplex (gradient handling rule is irrelevant) Conjugate gradient (Fletcher-Reeves) (default) simulated annealing Find a root of the derivative via Newton's method

#include double sin_square(apop_data *data, apop_model *m){ double x = apop_data_get(m->parameters, 0, -1); return -sin(x)*gsl_pow_2(x); } apop_model sin_sq_model ={"-sin(x) times x^2",1, .p = sin_square};

#include "sinsq.c" void do_search(int number, char *name, char *trace){ apop_model *out; double p[] = {0}; double result; char *outf; asprintf(&outf, "localmax_out/%s.gplot", trace); Apop_model_add_group(&sin_sq_model, apop_mle, .starting_pt= p, .method= number, .tolerance= 1e-4, .mu_t= 1.25, .trace_path= outf); out = apop_estimate(NULL, sin_sq_model); result = gsl_vector_get(out->parameters->vector, 0); printf("The %s algorithm found %g.\n", name, result); Apop_settings_rm_group(&sin_sq_model, apop_mle); } int main(){ 46

system ("mkdir -p localmax_out; rm -f localmax_out/*.gplot"); apop_opts.verbose ++; do_search(APOP_SIMPLEX_NM, "N-M Simplex", "simplex"); do_search(APOP_CG_FR, "F-R Conjugate gradient", "fr"); do_search(APOP_SIMAN, "Simulated annealing", "siman"); do_search(APOP_RF_NEWTON, "Root-finding", "root"); fflush(NULL); system("sed -i \"1iplot '-'\" localmax_out/*.gplot"); }

2.

Comparing 2 models using likelihood ratio

#include apop_model * dummies(int slope_dummies){ apop_data *d = apop_query_to_mixed_data("mmt", "select riders, year1977, line \ from riders, lines \ where riders.station=lines.station"); apop_data *dummified = apop_data_to_dummies(d, 0, 't', .append='y', .remove='y'); if (slope_dummies){ Apop_col(d, 1, yeardata); for(int i=0; i < dummified->matrix->size2; i ++){ Apop_col(dummified, i, c); gsl_vector_mul(c, yeardata); } } apop_model *out = apop_estimate(dummified, apop_ols); 47

apop_model_show(out); return out; } #ifndef TESTING int main(){ apop_db_open("data-metro.db"); printf("With constant dummies:\n"); dummies(0); printf("With slope dummies:\n"); dummies(1); } #endif

#define TESTING #include "dummies.c" void show_normal_test(apop_model *unconstrained, apop_model *constrained, int n){ double statistic = (apop_data_get(unconstrained->info, .rowname="log likelihood") - apop_data_get(constrained->info, .rowname="log likelihood"))/sqrt(n); double confidence = gsl_cdf_gaussian_P(fabs(statistic), 1); //one-tailed. printf("The Normal statistic is: %g, so reject the null of no difference between models " "with %g%% confidence.\n", statistic, confidence*100); } int main(){ apop_db_open("data-metro.db"); apop_model *m0 = dummies(0); apop_model *m1 = dummies(1); show_normal_test(m0, m1, m0->data->matrix->size1); }

48

Prac 7. Generate random numbers using Monte Carlo method using Exponential distribution 2. uniform distribution 3. binomial distribution some functions used for random number generation the functions used for random number generation are declared in the header file `gsl_rng.h'.  const gsl_rng_type * T : holds static information about each type of generator.  gsl_rng_env_setup() : This function reads the environment variables GSL_RNG_TYPE and GSL_RNG_SEED and uses their values to set the corresponding library variables gsl_rng_default and gsl_rng_default_seed. 1.

program to create a global generator using the environment variables GSL_RNG_TYPE and GSL_RNG_SEED, #include #include gsl_rng * r; /* global generator */ int main (void) { const gsl_rng_type * T; gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); printf ("generator type: %s\n", gsl_rng_name (r)); printf ("seed = %lu\n", gsl_rng_default_seed); printf ("first value = %lu\n", gsl_rng_get (r)); gsl_rng_free (r); return 0; } 49

Running the program without any environment variables uses the initial defaults, an mt19937 generator with a seed of 0 as follows:

By setting the two variables on the command line we can change the default generator and the seed as follows:

using exponential distribution #include #include #include #include #include int main(int argc, char *argv[]) { int i,n; float x,alpha; gsl_rng *r=gsl_rng_alloc(gsl_rng_mt19937); /* initialises GSL RNG */ n=atoi(argv[1]); alpha=atof(argv[2]); x=0; for (i=0;ierror='a' Allocation error. out->error='d' dimension-matching error. out->error='i' matrix inversion error. out->error='m' GSL math error. #include "eigenbox.h" int main(){ double line[] = {0, 0, 0, 1}; apop_data *constr = apop_line_to_data(line, 1, 1, 3); apop_data *d = query_data(); 55

apop_model *est = apop_estimate(d, apop_ols); apop_model_show(est); apop_data_show(apop_f_test(est, constr)); }

56

Practical No. 9 Drawing an Inference Obtaining mean ,standard error & p value for the given data. #include void one_boot(gsl_vector *base_data, gsl_rng *r, gsl_vector* boot_sample); void one_boot(gsl_vector * base_data, gsl_rng *r, gsl_vector* boot_sample){ for (int i =0; i< boot_sample−>size; i++) gsl_vector_set(boot_sample, i, gsl_vector_get(base_data, gsl_rng_uniform_int(r, base_data−>size))); } int main(){ int rep_ct = 10000; gsl_rng *r = apop_rng_alloc(0); apop_db_open("data-census.db"); gsl_vector *base_data = apop_query_to_vector("select in_per_capita from income where sumlevel+0.0 =40"); double RI = apop_query_to_float("select in_per_capita from income where sumlevel+0.0 =40 and geo_id2+0.0=44"); gsl_vector *boot_sample = gsl_vector_alloc(base_data->size); gsl_vector *replications = gsl_vector_alloc(rep_ct); for (int i=0; i< rep_ct; i++){ one_boot(base_data, r, boot_sample); gsl_vector_set(replications, i, apop_mean(boot_sample)); } double stderror = sqrt(apop_var(replications)); double mean = apop_mean(replications); printf("mean: %g; standard error: %g; (RI-mean)/stderr: %g; p value: %g\n", mean, stderror, (RI-mean)/stderror, 2*gsl_cdf_gaussian_Q(fabs(RI-mean), stderror)); }

57

Practical No 10.Implement Non-parametric Testing 1.

Anova apop_data* apop_anova ( char * char * char * char * )

2. 3. 4.

table, data, grouping1, grouping2

This function produces a traditional one- or two-way ANOVA table. It works from data in an SQL table, using queries of the form select data from table group by grouping1, grouping2. Parameters: table

The table to be queried. Anything that can go in an SQL from clause is OK, so this can be a plain table name or a temp table specification like (select ... ), with parens.

data

The name of the column holding the count or other such data

grouping1 The name of the first column by which to group data If this is NULL, then the function will return a one-way grouping2 ANOVA. Otherwise, the name of the second column by which to group data in a two-way ANOVA. #include int main(){ apop_db_open("data-metro.db"); char joinedtab[] = "(select year, riders, line \ from riders, lines \ where riders.station = lines.station)"; apop_data_show(apop_anova(joinedtab, "riders", "line", "year"));

}

58

59

References

4.

Modelling with data, Ben Klemens, Princeton University Press Computational Statistics, James E. Gentle, Springer Computational Statistics, Second Edition, Geof H. Givens and Jennifer A.Hoeting, Wiley Publications www.cygwin.com

5.

http://apophenia.info/

1. 2. 3.

60

DAT manual.pdf

Short Description

Description

Comments

We need your help!