In this assignment you will use SQLite3 to execute a few SQL queries. You can find documentation on SQLite3 at here. SQLite3 is not as functional as PostgreSQL or MySQL or the commercial relational DBMSes, but it is much easier to use and query. That is why there are rumoured to be many more installations of SQLite than PostgreSQL or any other traditional DBMS. Also, you can easily install SQLite3 on your own machine, but make sure your final SQL queries run on SQLite3 installed on the CS mumble machines.
For this assignment, we will use the US Census Population Estimate dataset, which can be found here. We have downloaded and cleaned this dataset and inserted it into a sqlite database called census.db that can be found in the directory called P op Estimate Dataset in this assignment directory.
There is a layout file for each table in the database. The table layout is helpful in understanding what each attribute means. Section 3 of this document gives you details of the schema of the database.
2 Query the Database
Your task is to write the following queries in separate files, as described below:
File
Query
Output Columns
Query1.sql
List all states and their values for 2011 housing estimate
(HUEST 2011)
(State Code, Housing
Estimate)
Query2.sql
Find the states which have the number of births in 2012
(BIRTS2012) to be greater than 80,000
(State Code, Births)
Query3.sql
For each state, print its name and the net migration into
the state for the year 2012 (NETMIG2012)
(State Name, Net Migration)
Query4.sql
Find the number of counties in each state and print it with
the name of the state
(State Name, Count of
Counties)
Query5.sql
Display names of all “metropolitan statistical
areas”(LSAD) and their corresponding migration for the period between June 2010 and June 2011 (NETMIG2011)
(Area Name, Net Migration)
Query6.sql
Find each state, print the population estimate in the year
2011 (POPESTIMATE2011) of all women aged between
21 and 45 (both inclusive)
(Age, Population)
Query7.sql
For each division, print the name of the division 1∼9
(DIVISION DESC) and a list of all states that belong to that region concatenated into one string that is separated by commas (take a look at the sqlite documentation for the appropriate aggregate function)
(Division Name, State List)
Query8.sql
For each division, find the name of the state that has the
greatest increase in housing estimates for the year 2011 (HUEST 2011) as compared to housing estimates for the year 2010 (HUEST 2010)
(Division Name, State Name, Increase)
Query9.sql
For each states, display the percentage of the states
estimated 2011 population (POPESTIMATE2011) that is above the age of 21 (≥ 21)
(State Name, Percentage)
Query10.sql
Find the states for which the ratio of the “2011 resident
total population estimates”(POPESTIMATE2011) and “2011 housing estimates base”(HUEST 2011) is below the nationwide ratio. The ratios should be compared in floating point number, not in integer
(State Name)
Query11.sql
For each state, find the name of the county that has the
smallest increase in the population between 2011 and 2012 (NPOPCHG 2012). Look at counties individually and not the state as a whole. Display name of state and county and population change in lexicographic order of
“Division Desc”. For states with same division, sort in decreasing order of the 2012 population estimate (POPESTIMATE2012) of the county
(State Name, County, Population Change)
Query12.sql
For all age groups, calculate the absolute difference of 2011
population estimate (POPESTIMATE2011) and 2010 population estimate (POPESTIMATE2010). Also, display the indicator that indicates if the population has increased or decreased, or remained the same. The value of indicator should be one of {increased, same, decreased}
(Age, Difference, Indicator)
—
The sqlite file itself is in binary and does not convey any information about the tables present in it. You may want to use sqlitestudio. This allows you to visualize the tables and run queries on the data which might be useful in the above task.
If you cant write any one of the queries above, provide an explanation of the features that are lacking in SQLite3 that prevents you from being able to write that query. Write these explanations in a file called Queries-readme.txt. To test whether or not your queries work as expected, you can use sqlitestudio or the command below.
Note: To check the query output, we will run the following command on our copy of census.db :
$ sqlite3 census.db < Query3.sql
Thus, your text file can have multiple queries, but it must only output your final desired output, and clean up any temporary tables that it creates. You should try to minimize the number of distinct SQL query blocks in each query file – try to write a single query for each of the queries above, whenever possible.
Copy each of your query’s result into a separate file called result[1-12].txt, i.e., the word result followed by the query number and .txt. For example, if we want to get the results of Query 5, we would execute the following:
$ sqlite3 census.db < Query5.sql result5.txt
The above command can be repeated for each query with the proper query number substituted in place to get all
12 result files.
3 Schema for the Population Database
Here is a list of tables that are present in the database and the corresponding files in which you can find functional details such as the semantics of each column of the table.
Table Name
Where to Find the Functional Details About the Table
POP_ESTIMATE_NATION_STATE_PR
POP_ESTIMATE_NATION_STATE_PR.pdf
POP_ESTIMATE_STATE_COUNTY
POP_ESTIMATE_STATE_COUNTY.pdf
POP_ESTIMATE_CITIES_TOWNS
POP_ESTIMATE_CITIES_TOWNS.pdf
POP_ESTIMATE_STATE_AGE_SEX_RACE_ORIGIN
POP_ESTIMATE_STATE_AGE_SEX_RACE_ORIGIN.pdf
POP_ESTIMATE_PR_SEX_AGE
POP_ESTIMATE_PR_SEX_AGE.pdf
POP_ESTIMATE_PR_MUNICIPIOS
POP_ESTIMATE_PR_MUNICIPIOS.pdf
POP_ESTIMATE_PR_MUN_SEX_AGE
POP_ESTIMATE_PR_MUN_SEX_AGE.pdf
POP_ESTIMATE_METRO_MICRO
POP_ESTIMATE_METRO_MICRO.htm
HU_UNIT_STATE_LEVEL
HU_UNIT_STATE_LEVEL.pdf
The above mentioned pdf and htm files are present along with the data files in the P op Estimate Dataset directory.
We have created six more tables listed below. The above population tables have few columns such as Division, Region, Origin, Race, Sex, Sumlev that have numeric values in it. The meaning for those numeric values can be obtained from these separate tables below. The schemas for these tables are provided below along with the schemas for other tables.
Table Name
REGION
DIVISION
RACE
SEX
ORIGIN
SUMLEV
The file layout for each Population table does not give any information about the Data Type and any Constraints present on the table. It can be referred to know the functional details as to what information is present in each table and what each column in the table means.
We have listed the data type and constraints (Primary and Foreign Key) for each table below. The attribute with value F K in the C ons. field is a foreign key to the primary key of the table with the same name. As you can see below, there are cases where the same attribute is both a primary and foreign key. This might be useful when it comes to writing the actual queries.
Column
Type
Cons.
SEX_CD
Number
PK
SEX_DESC
String
REGION SEX
Column
Type
Cons.
REGION_CD
Number
PK
REGION_DESC
String
Column
Type
Cons.
RACE_CD
Number
PK
RACE_DESC
String
DIVISION RACE
Column
Type
Cons.
DIVISION_CD
Number
PK
DIVISION_DESC
String
Column
Type
Cons.
SUMLEV_CD
Number
PK
SUMLEV_DESC
String
ORIGIN SUMLEV
Column
Type
Cons.
ORIGIN_CD
Number
PK
ORIGIN_DESC
String
Column
Type
Cons.
SUMLEV
Number
FK
REGION
Number
FK
DIVISION
Number
FK
STATE
Number
NAME
String
PK
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
POPESTIMATE2012
Number
NPOPCHG_2010
Number
NPOPCHG_2011
Number
NPOPCHG_2012
Number
BIRTHS2010
Number
BIRTHS2011
Number
BIRTHS2012
Number
DEATHS2010
Number
DEATHS2011
Number
DEATHS2012
Number
NATURALINC2010
Number
NATURALINC2011
Number
NATURALINC2012
Number
INTERNATIONALMIG2010
Number
INTERNATIONALMIG2011
Number
INTERNATIONALMIG2012
Number
DOMESTICMIG2010
Number
DOMESTICMIG2011
Number
DOMESTICMIG2012
Number
NETMIG2010
Number
NETMIG2011
Number
NETMIG2012
Number
RESIDUAL2010
Number
RESIDUAL2011
Number
RESIDUAL2012
Number
RBIRTH2011
Real
RBIRTH2012
Real
RDEATH2011
Real
RDEATH2012
Real
RNATURALINC2011
Real
RNATURALINC2012
Real
RINTERNATIONALMIG2011
Real
RINTERNATIONALMIG2012
Real
RDOMESTICMIG2011
Real
RDOMESTICMIG2012
Real
RNETMIG2011
Real
RNETMIG2012
Real
Column
Type
Cons.
SUMLEV
Number
FK
REGION
Number
FK
DIVISION
Number
FK
STATE
Number
PK
COUNTY
Number
PK
STNAME
String
CTYNAME
String
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
POPESTIMATE2012
Number
NPOPCHG_2010
Number
NPOPCHG_2011
Number
NPOPCHG_2012
Number
BIRTHS2010
Number
BIRTHS2011
Number
BIRTHS2012
Number
DEATHS2010
Number
DEATHS2011
Number
DEATHS2012
Number
NATURALINC2010
Number
NATURALINC2011
Number
NATURALINC2012
Number
INTERNATIONALMIG2010
Number
INTERNATIONALMIG2011
Number
INTERNATIONALMIG2012
Number
DOMESTICMIG2010
Number
DOMESTICMIG2011
Number
DOMESTICMIG2012
Number
NETMIG2010
Number
NETMIG2011
Number
NETMIG2012
Number
RESIDUAL2010
Number
RESIDUAL2011
Number
RESIDUAL2012
Number
GQESTIMATESBASE2010
Number
GQESTIMATES2010
Number
GQESTIMATES2011
Number
GQESTIMATES2012
Number
RBIRTH2011
Real
RBIRTH2012
Real
RDEATH2011
Real
RDEATH2012
Real
RNATURALINC2011
Real
RNATURALINC2012
Real
RINTERNATIONALMIG2011
Real
RINTERNATIONALMIG2012
Real
RDOMESTICMIG2011
Real
RDOMESTICMIG2012
Real
RNETMIG2011
Real
RNETMIG2012
Real
4
Column
Type
Cons.
CBSA
Number
MDIV
Number
STCOU
Number
NAME
String
PK
LSAD
String
PK
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
NPOPCHG_2010
Number
NPOPCHG_2011
Number
NATURALINC2010
Number
NATURALINC2011
Number
BIRTHS2010
Number
BIRTHS2011
Number
DEATHS2010
Number
DEATHS2011
Number
NETMIG2010
Number
NETMIG2011
Number
INTERNATIONALMIG2010
Number
INTERNATIONALMIG2011
Number
DOMESTICMIG2010
Number
DOMESTICMIG2011
Number
RESIDUAL2010
Number
RESIDUAL2011
Number
Column
Type
Cons.
SUMLEV
Number
FK
MUNICIPIO
Number
PK
NAME
String
YEAR
Number
PK
POPESTIMATE
Number
POPEST_MALE
Number
POPEST_FEM
Number
UNDER5_TOT
Number
UNDER5_MALE
Number
UNDER5_FEM
Number
AGE513_TOT
Number
AGE513_MALE
Number
AGE513_FEM
Number
AGE1417_TOT
Number
AGE1417_MALE
Number
AGE1417_FEM
Number
AGE1824_TOT
Number
AGE1824_MALE
Number
AGE1824_FEM
Number
AGE16PLUS_TOT
Number
AGE16PLUS_MALE
Number
AGE16PLUS_FEM
Number
AGE18PLUS_TOT
Number
AGE18PLUS_MALE
Number
AGE18PLUS_FEM
Number
AGE1544_TOT
Number
AGE1544_MALE
Number
AGE1544_FEM
Number
AGE2544_TOT
Number
AGE2544_MALE
Number
AGE2544_FEM
Number
AGE4564_TOT
Number
AGE4564_MALE
Number
AGE4564_FEM
Number
AGE65PLUS_TOT
Number
AGE65PLUS_MALE
Number
AGE65PLUS_FEM
Number
AGE85PLUS_TOT
Number
AGE85PLUS_MALE
Number
AGE85PLUS_FEM
Number
MEDIAN_AGE_TOT
Number
MEDIAN_AGE_MALE
Number
MEDIAN_AGE_FEM
Number
HU_UNIT_STATE_LEVEL
Column
Type
Cons.
SUMLEV
Number
FK
STATE
Number
FK
REGION
Number
PK
DIVISION
Number
FK
STNAME
String
HUCENSUS2010
Number
HUESTBASE2010
Number
HUEST_2010
Number
HUEST_2011
Number
Column
Type
Cons.
SUMLEV
Number
FK
STATE
Number
PK
COUNTY
Number
PK
PLACE
Number
PK
COUSUB
Number
PK
CONCIT
Number
PK
NAME
String
STNAME
String
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
Column
Type
Cons.
SUMLEV
Number
FK
MUNICIPIO
Number
PK
NAME
String
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
NPOPCHG_2010
Real
NPOPCHG_2011
Real
PPOPCHG_2010
Real
PPOPCHG_2011
Real
SRANK_ESTBASE2010
Number
SRANK_POPEST2010
Number
SRANK_POPEST2011
Number
SRANK_NPCHG2010
Number
SRANK_NPCHG2011
Number
SRANK_PPCHG2010
Number
SRANK_PPCHG2011
Number
Column
Type
Cons.
SUMLEV
Number
FK
STATE
Number
NAME
String
SEX
Number
PK, FK
AGE
Number
PK
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
POP_ESTIMATE_STATE_AGE_SEX_RACE_ORIGIN POP_ESTIMATE_PR_SEX_AGE
Column
Type
Cons.
SUMLEV
Number
FK
REGION
Number
FK
DIVISION
Number
FK
STATE
Number
PK
SEX
Number
PK, FK
ORIGIN
Number
PK, FK
RACE
Number
PK, FK
AGE
Number
PK
CENSUS2010POP
Number
ESTIMATESBASE2010
Number
POPESTIMATE2010
Number
POPESTIMATE2011
Number
4 Submission
Create only the following files for this assignment: Query[1-12].sql, result[1-12].txt, Queries-readme.txt. Compress these files into a tar.gz package. The man pages for tar should be able to helf you with this. The name of your compressed package should be <your netid A2.tar.gz. For example, if your UW-netid is johndoe@wisc.edu, your compressed package should be called johndoe A2.tar.gz. Please use comments in all your SQL files to better explain what you are trying to accomplish in each query. Please upload this tar file to the Assignment 2
Submission Dropbox folder in learn@uw.