• Donald H. (Donnie) Pinkston, III's avatar
    Documents and code changes for Assignment 3 · c7b4e8f9
    Donald H. (Donnie) Pinkston, III authored
    The design document and information document are included under the doc
    directory now.
    
    Added a lot of documentation, particularly to the Schema class.
    
    Made a few changes to the ArithmeticOperator and BooleanOperator
    classes' simplify() methods so that we can do some basic
    simplifications.
    
    Added the StatisticsUpdater so that it's easier for students to modify
    statistics based on selection predicates.
    
    Fixed a bug in SelectClause where ORDER BY clauses were using the
    FROM-clause schema rather than the SELECT-clause schema.
    c7b4e8f9
lab3design.txt 4.98 KB
CS122 Assignment 3 - Table Statistics and Plan Costing - Design Document
========================================================================

Fill in answers for all questions based on your team's work on Assignment 3.

A:  Statistics Collection
-------------------------

A1.  Using pseudocode, summarize the implementation of your HeapTupleFile
     analyze() function.  Please summarize your actual code; do not simply
     paste in the assignment description or the actual code you wrote.

B:  Plan Costing Estimates
----------------------------------------

B1.  Briefly describe how you estimate the number of tuples and the cost
     of a file-scan plan node.  What factors does your cost include?

B2.  Briefly describe how you estimate the number of tuples and the cost
     of a simple-filter plan node.  What factors does your cost include?

B3.  Briefly describe how you estimate the number of tuples and the cost
     of a nested-loop join plan node.  What factors does your cost include?

B4.  For each kind of comparison (==, !=, >, <, >=, <=), how do you update the
     estimated number of distinct values for each kind of comparison that your
     StatisticsUpdater implementation supports?  Are there cases where you make
     no changes to the statistics?

B5.  For each kind of comparison (==, !=, >, <, >=, <=), how do you update the
     estimated min and max values for each kind of comparison that your
     StatisticsUpdater implementation supports?  Are there cases where you make
     no changes to the statistics?

C:  Costing SQL Queries
-----------------------

Answer these questions after you have loaded the stores-28K.sql data, and
have analyzed all of the tables in that schema.

C1.  Paste the output of running:  EXPLAIN SELECT * FROM cities;
     Do not include debug lines, just the output of the command itself.

C2.  What is the estimated number of tuples that will be produced by each
     of these queries:

     SELECT * FROM cities WHERE population > 1000000;

     <paste output here>

     SELECT * FROM cities WHERE population > 5000000;

     <paste output here>

     SELECT * FROM cities WHERE population > 8000000;

     <paste output here>

     How many tuples does each query produce?

     Briefly explain the difference between the estimated number of tuples
     and the actual number of tuples for these queries.

C3.  Paste the output of running these commands:

     EXPLAIN SELECT store_id FROM stores, cities
     WHERE stores.city_id = cities.city_id AND
           cities.population > 1000000;

     <paste output here>

     EXPLAIN SELECT store_id FROM stores JOIN
                    (SELECT city_id FROM cities
                     WHERE population > 1000000) AS big_cities
                    ON stores.city_id = big_cities.city_id;

     <paste output here>

     The estimated number of tuples produced should be the same, but the
     costs should be different.  Explain why.

C4.  The assignment gives this example "slow" query:

     SELECT store_id, property_costs
     FROM stores, cities, states
     WHERE stores.city_id = cities.city_id AND
           cities.state_id = states.state_id AND
           state_name = 'Oregon' AND property_costs > 500000;

     How long does this query take to run, in seconds?

     Include the EXPLAIN output for the above query here.

     <paste output here>

     How would you rewrite this query (e.g. using ON clauses, subqueries
     in the FROM clause, etc.) to be as optimal as possible?  Also include
     the result of EXPLAINing your query.

D:  Extra Credit [OPTIONAL]
---------------------------

If you implemented any extra-credit tasks for this assignment, describe
them here.  The description should be like this, with stuff in "<>" replaced.
(The value i starts at 1 and increments...)

D<i>:  <one-line description>

     <brief summary of what you did, including the specific classes that
     we should look at for your implementation>

     <brief summary of test-cases that demonstrate/exercise your extra work>

E:  Feedback [OPTIONAL]
-----------------------

WE NEED YOUR FEEDBACK!  Thoughtful and constructive input will help us to
improve future versions of the course.  These questions are OPTIONAL, and
your answers will not affect your grade in any way (including if you hate
everything about the assignment and databases in general, or Donnie and/or
the TAs in particular).  Feel free to answer as many or as few of them as
you wish.

E1.  What parts of the assignment were most time-consuming?  Why?

E2.  Did you find any parts of the assignment particularly instructive?
     Correspondingly, did any parts feel like unnecessary busy-work?

E3.  Did you particularly enjoy any parts of the assignment?  Were there
     any parts that you particularly disliked?

E4.  Were there any critical details that you wish had been provided with the
     assignment, that we should consider including in subsequent versions of
     the assignment?

E5.  Do you have any other suggestions for how future versions of the
     assignment can be improved?