$24
• Objectives
In this project you are asked to write several SQL queries on a relational ight database. The data in this database is from U.S. Department of Transformation, which is o cially released public subset of ight data that is available for access to everyone.
Your SQL server will be PostgreSQL server and you will work on your local environment. You can see tutorials about how to setup working environment for PostgreSQL.
• Database Schema
2.1 Tables
-> airline_codes ( [airline_code], airline_name )
-> airport_codes ( [airport_code], airport_desc )
-> airport_ids ( [airport_id], airport_desc )
-> cancellation_reasons ( [reason_code], reason_desc )
-> weekdays ( [weekday_id], weekday_name )
-> world_area_codes ( [wac_id], wac_name )
-> flight_reports ( [report_id], year, month, day, weekday_id, airline_code, plane_tail_number, origin_airport_id, origin_airport_code, origin_city_name, origin_wac_id, dest_airport_id, dest_airport_code, dest_city_name, dest_wac_id, departure_time, departure_delay, taxi_out_time, wheels_off_time, wheels_on_time, taxi_in_time, arrival_time, arrival_delay, is_cancelled, cancellation_reason, is_diverted, flight_time, flight_distance )
2.2 Explanations of Columns
weekday id: A number between 1-7, which represents days from Monday to Sunday.
1
airline code: A short string that represents the airline in ight reports entries. Length is mostly between 2-6.
plane tail number: A string that identi es a plane. It is unique to the plane itself, but not the airline. You may see reports with same plane tail number but di erent airline code. This is because the plane is sold to another airline at some time.
origin airport id & dest airport id: Numbers that represent airports in ight reports entries. Used for identifying the airport in origin city and in destination city.
origin airport code & dest airport code: Strings that represent airports in ight reports entries. Used for identifying the airport in origin city and in destination city.
origin city name & dest city name: Origin and destination city names.
origin wac id & dest wac id: Numbers that represent world area codes in ight reports entries. Used for identifying the state in countries. In data, you can see cities like San Francisco or Denver. However, San Francisco belongs to California world area (state) and Denver belongs to Colorado world area.
departure time: Time that the plane starts to leave the gate in the airport. It is in HH:MM format.
departure delay: Di erence in minutes between scheduled and actual departure time. Early departures show negative numbers.
taxi out time: Time spent by a plane between the gate and the takeo . Planes leave the gate, pass the taxiway, take their position on the runway and take o in taxi out time minutes.
wheels o time: Time that the plane takes o and becomes airborne. It is in HH:MM format. You can consider:
departure_time + taxi_out_time = wheels_off_time
wheels on time: Time that the plane touches the ground in the destination airport. It is in HH:MM format.
taxi in time: Time spent by a plane between the touchdown and the parking to the gate. Planes touch down to runway, pass the taxiway, take their parking position in the gate in taxi in time minutes.
arrival time: Time that the plane stops at the gate in the destination airport. It is in HH:MM format. You can consider:
wheels_on_time + taxi_in_time = arrival_time
arrival delay: Di erence in minutes between scheduled and actual arrival time. Early ar-rivals show negative numbers.
is cancelled: Indicator for cancelled ight. This eld is either 1 or 0. When a ight is cancelled, there is a cancellation reason for that.
2
cancellation reason: A character that speci es the reason for cancellation. If ight is not cancelled (is cancelled = 0), this eld is NULL. Otherwise, the reason can be ’A’ (Carrier), ’B’ (Weather), ’C’ (National Air System), ’D’ (Security).
is diverted: A diverted ight is one that has been routed from its original arrival destination to a new, typically temporary, arrival destination. This eld is either 1 or 0.
ight time: Time spent by a plane in the air in minutes.
ight distance: Distance between origin and destination airports in miles.
2.3 Foreign Keys
Here, in table ight reports,
weekday id is a foreign key that references to weekday id in weekdays table
airline code is a foreign key that references to airline code in airline codes table
origin airport id, dest airport id are foreign keys that reference to airline id in air-line ids table
origin airport code, dest airport code are foreign keys that reference to airport code in airline codes table
origin wac id, dest wac id are foreign keys that reference to wac id in world area codes table
cancellation reason is a foreign key that references to reason code in cancellation reasons table
You can check this link for more understanding of the data.
• SQL Queries
1. Find airlines and their average departure delay in 2018. Show airline name in ascending order, airline code and average departure delay in ascending order. Remember to disregard cancelled ights. (18 rows)
Columns: [airline_name *] [airline_code] [avg_delay *]
... ... ...
... ... ...
◦ avg_delay -> ascending order
◦ airline_name -> ascending order (for equal avg_delay values)
2. List airports and the number of cancelled ights with ’Security’ reason. Show airport code, airport description and cancelled ight count in descending order. (53 rows)
3
Columns: [airport_code *] [airport_desc] [cancel_count *]
... ... ...
... ... ...
• cancel_count -> descending order
• airport_code -> ascending order (for equal cancel_count values)
***CLARIFICATION: You should only consider origin airport code for this query.
3. List planes that received maintenance at the end of the year. When a plane has more than 5 ights per day in a yearly period, it receives maintenance. Show plane tail number, year, daily ight average. Remember to disregard cancelled ights. (3821 rows)
Columns: [plane_tail_number *] [year *] [daily_avg]
... ... ...
... ... ...
• plane_tail_number -> ascending order
• year -> ascending order (for equal plane_tail_number values)
***CLARIFICATION: You need to nd the number of non-cancelled ights of planes for each day. Then you need to check yearly ight count averages considering daily ight counts that you previously found. If the yearly average is more than 5, then the plane went under maintenance.
4. Find airlines that goes to ALL of ’Boston, MA’, ’New York, NY’, ’Portland, ME’, ’Wash-ington, DC’, ’Philadelphia, PA’ cities, in 2018 or 2019. You should disregard the ones that already ying these cities in 2016 or 2017. You should also ignore cancelled ights. Show airline names only. (5 rows)
Columns: [airline_name *]
...
...
◦ airline_name -> ascending order
5. Find all non-cancelled travels from Seattle to Boston with one stop. Flights should happen in the same day ( ight1: Seattle => Destination, ight2: Destination => Boston). Order ights by their total time, in ascending order.
Total time = flight1(flight_time) + flight1(taxi_out_time)
▪ flight2(taxi_in_time) + flight2(flight_time)
Total distance = flight1(flight_distance) + flight2(flight_distance)
Remember to check that ight1(arrival time) < ight2(departure time). Show ight date as "DD/MM/YYYY" string, plane tail number, ight1 arrival time, ight2 depar-ture time, ight1 origin city name, stop city name, ight2 dest city name, total time, to-tal distance. (253 rows)
4
Columns: flight_date*, plane_tail_number*, flight1_arrival_time,
flight2_departure_time, origin_city_name, stop_city_name*, dest_city_name, total_time*, total_distance*
◦ total_time -> ascending order
◦ total_distance -> ascending order
◦ plane_tail_number -> ascending order
◦ stop_city_name -> ascending order
6. Find best weekday for ights from San Francisco to Boston. Best weekday is the day that has least "departure delay + arrival delay" daily average. Show weekday id, weekday name, average delay. (1 row only, the best one)
Columns: [weekday_id] [weekday_name] [avg_delay]
... ... ... (only 1 row)
7. Find all airlines that had more than 10% of their ights out of Boston are cancelled. Return the airline name and the percentage of canceled ights out of Boston. Order the results by the percentage of canceled ights in descending order. (2 rows)
Columns: [airline_name] [percentage*]
... ...
... ...
◦ percentage -> descending order
8. Sometimes an airline can buy planes from another airline and re-brand it. Find sold and re-branded planes. Display plane tail number, rst owner airline code and second owner airline code. You need to check same plane tail number for di erent airlines.
Airline A can use plane X in 2016 and airline B can use same plane in 2018, 2019. Therefore (’X’, ’A’, ’B’) should be in the query result. (189 rows)
Columns: [plane_tail_number *] [first_owner] [second_owner]
... ... ...
... ... ...
• plane_tail_number -> ascending order
• first_owner -> ascending order
• second_owner -> ascending order
***CLARIFICATION: If same planes are used by di erent airlines for di erent dates, then it is re-branded. For di erent dates, date1 < date2:
if year1 < year2
or if year1 = year2 and month1 < month2
5
or if year1 < year2 and month1 = month2 and day1 < day2
9. Find average speed of planes that ew ONLY weekends of January 2016. Show plane tail number and average speed. Results should have an descending order on average speed. (15 rows)
Columns: [plane_tail_number] [avg_speed*]
... ...
... ...
◦ avg_speed -> descending order
10. Find airlines that have planes that ONLY goes to Texas area. Display airline name and the number of ights by those planes. (2 rows)
Columns: [airline_name *] [flight_count]
... ...
... ...
◦ airline_name -> ascending order
11. List the popular airline names, total number of ights, total number of cancelled ights for each year. Popular airlines are airlines that have more than 2000 ights in a day on the average, and have this popularity in all years. (12 rows)
Columns: [airline_name *] [year *] [total_num_flights] [cancelled_flights]
... ... ... ...
... ... ... ...
* airline_name -> ascending order
***CLARIFICATION: To nd popular airlines, you need to look at daily ight counts, and take the average of those counts. If the average is above 2000 ights per day, for all years, then the airline is a popular airline.
12. For each year and airline codes, nd number of ights to Boston and the percentage of Boston ights to the overall number of ights of that airline in that year. Consider results with the percentage > 1%. Disregard cancelled ights. (28 rows)
[year *] [airline_code *] [boston_flight_count] [boston_flight_percentage]
... ... ... ...
... ... ... ...
• year -> ascending order
• airline_code -> ascending order
6
13. For each airline, list the total number of ights on Monday and the total number of ights on Sunday in separate columns. Disregard cancelled ights.
[airline_name *] [monday_flights] [sunday_flights]
... ... ...
... ... ...
* airline_name -> ascending order
14. For each year and weekday, nd the frequent cancellation reason and number of such cancel-lations.
[year *] [weekday_name] [reason] [number_of_cancellations]
... ... ... ...
... ... ... ...
• year -> ascending order
• weekday_name -> ordered as Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday
***CLARIFICATION: You need to order by year and order by weekdays like Monday, Tuesday, ... Sunday (you can use weekday id for that).
15. Find top 5 airports with highest ight tra c. Disregard cancelled ights. (5 rows) Total tra c is de ned as:
Traffic = Outgoing flight count + Incoming flight count or
Traffic = Flight count where the airport is the origin airport
◦ Flight count where the airport is the destination airport
Columns: [airport_desc *]
...
...
* airport_desc -> ascending order
• Regulations
1. Submission: Submission will be done via ODTUClass. Please remember that late sub-mission is allowed 5 days for all programming proejcts. You can distribute these 5 days to any mini-project your want. You should put the answer query to each question in a separate .sql le and zip those .sql les with following name:
7
e1234567_project1.zip
-> q1.sql
-> q2.sql
...
Where you should replace "1234567" with your own student number.
2. SQL Style: You should write your queries in readable manner. Write each clause on a separate line, add indentation for clear look. For example:
This is easier to read
select
*
from
flight_reports fr
where
fr.origin_city_name like ’%Denver%’
and fr.is_cancelled = 0
than this
select * from flight_reports fr where fr.origin_city_name like ’%Denver%’ and fr.is_cancelled = 0
You can use online-formatters/beauti ers for better indentation if you want.
3. Newsgroup: You must follow the newsgroup (news.ceng.metu.edu.tr) for discussions and possible updates on a daily basis.
• Tutorial & Guides
You can download PostgreSQL server from here.
For visualization, DBeaver is a nice tool which works for both Ubuntu and Windows. You can also use it for many other database servers.
Once you start database server, open DBeaver and click ’New Database Connection’ on top left corner. (Figure 1)
Choose PostgreSQL. (Figure 2)
Enter credentials and connect to default ’postgres’ database. You will create your own database for this project later. (Figure 3)
Open a new script and execute "create database ceng352 ight data" by selecting script and hitting (CTRL + ENTER). (Figure 4)
8
Now that you have created another database, connect to newly created database just like before. (Figure 5)
Open ’Tables’ to see the tables. (Figure 6) Since there are no tables you need to create tables using the prepared script. Download it from ODTUClass and open it with DBeaver software. Run queries by selecting parts of script and hitting CTRL + ENTER. (Figure 7)
You are almost there. Now, you need to import the data to tables. Import data other than " ight reports YYYY.sql"s, leave them to the end for foreign keys.
To import data, right click on the table name and click ’Import data’ (Figure 8)
Choose CSV type and .csv le that you want to import for selected table. (Figure 9, 10)
***IMPORTANT NOTE*** Choose semicolon(;) as delimiter for ’ ight reports YYYY.csv’ les. Choose comma (,) as delimiter for other .csv les.
That’s it. Now you can write queries on tables.
Figure 1: New database connection
Figure 2: Database options
9
Figure 3: Enter credentials and connect
Figure 4: Open new script
10
Figure 5: Connect to new database
Figure 6: Open tables view
11
Figure 7: Run create scripts
Figure 8: Import data
12
Figure 9: Choose CSV
Figure 10: Set delimiter
13