1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
CS122 Assignment 3 - Table Statistics and Plan Costing - Design Document
========================================================================
Fill in answers for all questions based on your team's work on Assignment 3.
A: Statistics Collection
-------------------------
A1. Using pseudocode, summarize the implementation of your HeapTupleFile
analyze() function. Please summarize your actual code; do not simply
paste in the assignment description or the actual code you wrote.
B: Plan Costing Estimates
----------------------------------------
B1. Briefly describe how you estimate the number of tuples and the cost
of a file-scan plan node. What factors does your cost include?
B2. Briefly describe how you estimate the number of tuples and the cost
of a simple-filter plan node. What factors does your cost include?
B3. Briefly describe how you estimate the number of tuples and the cost
of a nested-loop join plan node. What factors does your cost include?
B4. For each kind of comparison (==, !=, >, <, >=, <=), how do you update the
estimated number of distinct values for each kind of comparison that your
StatisticsUpdater implementation supports? Are there cases where you make
no changes to the statistics?
B5. For each kind of comparison (==, !=, >, <, >=, <=), how do you update the
estimated min and max values for each kind of comparison that your
StatisticsUpdater implementation supports? Are there cases where you make
no changes to the statistics?
C: Costing SQL Queries
-----------------------
Answer these questions after you have loaded the stores-28K.sql data, and
have analyzed all of the tables in that schema.
C1. Paste the output of running: EXPLAIN SELECT * FROM cities;
Do not include debug lines, just the output of the command itself.
C2. What is the estimated number of tuples that will be produced by each
of these queries:
SELECT * FROM cities WHERE population > 1000000;
<paste output here>
SELECT * FROM cities WHERE population > 5000000;
<paste output here>
SELECT * FROM cities WHERE population > 8000000;
<paste output here>
How many tuples does each query produce?
Briefly explain the difference between the estimated number of tuples
and the actual number of tuples for these queries.
C3. Paste the output of running these commands:
EXPLAIN SELECT store_id FROM stores, cities
WHERE stores.city_id = cities.city_id AND
cities.population > 1000000;
<paste output here>
EXPLAIN SELECT store_id FROM stores JOIN
(SELECT city_id FROM cities
WHERE population > 1000000) AS big_cities
ON stores.city_id = big_cities.city_id;
<paste output here>
The estimated number of tuples produced should be the same, but the
costs should be different. Explain why.
C4. The assignment gives this example "slow" query:
SELECT store_id, property_costs
FROM stores, cities, states
WHERE stores.city_id = cities.city_id AND
cities.state_id = states.state_id AND
state_name = 'Oregon' AND property_costs > 500000;
How long does this query take to run, in seconds?
Include the EXPLAIN output for the above query here.
<paste output here>
How would you rewrite this query (e.g. using ON clauses, subqueries
in the FROM clause, etc.) to be as optimal as possible? Also include
the result of EXPLAINing your query.
D: Extra Credit [OPTIONAL]
---------------------------
If you implemented any extra-credit tasks for this assignment, describe
them here. The description should be like this, with stuff in "<>" replaced.
(The value i starts at 1 and increments...)
D<i>: <one-line description>
<brief summary of what you did, including the specific classes that
we should look at for your implementation>
<brief summary of test-cases that demonstrate/exercise your extra work>
E: Feedback [OPTIONAL]
-----------------------
WE NEED YOUR FEEDBACK! Thoughtful and constructive input will help us to
improve future versions of the course. These questions are OPTIONAL, and
your answers will not affect your grade in any way (including if you hate
everything about the assignment and databases in general, or Donnie and/or
the TAs in particular). Feel free to answer as many or as few of them as
you wish.
E1. What parts of the assignment were most time-consuming? Why?
E2. Did you find any parts of the assignment particularly instructive?
Correspondingly, did any parts feel like unnecessary busy-work?
E3. Did you particularly enjoy any parts of the assignment? Were there
any parts that you particularly disliked?
E4. Were there any critical details that you wish had been provided with the
assignment, that we should consider including in subsequent versions of
the assignment?
E5. Do you have any other suggestions for how future versions of the
assignment can be improved?