Project 4 Graph Algorithms

Starting from:

~~$30~~

$24

Home

Introduction

In this project we will explore graph theory theorems and algorithms, by applying them on real data. In the rst part of the project, we consider a particular graph modeling correlations between stock price time series. In the second part, we analyse tra c data on a dataset provided by Uber.

1. Stock Market

In this part of the project, we study data from stock market. The data is available on this Dropbox Link. The goal of this part is to study correlation structures among uctuation patterns of stock prices using tools from graph theory. The intuition is that investors will have similar strategies of investment for stocks that are e ected by the same economic factors. For example, the stocks belonging to the transportation sector may have di erent absolute prices, but if for example fuel prices change or are expected to change signi cantly in the near future, then you would expect the investors to buy or sell all stocks similarly and maximize their returns. Towards that goal, we construct di erent graphs based on similarities among the time series of returns on di erent stocks at di erent time scales (day vs a week). Then, we study properties of such graphs. The data is obtained from Yahoo Finance website for 3 years. You’re provided with a number of csv tables, each containing several elds: Date, Open, High, Low, Close, Volume, and Adj Close price. The les are named according to Ticker Symbol of each stock. You may nd the market sector for each company in Name sector.csv.

1. Return correlation

In this part of the project, we will compute the correlation among log-normalized stock-return time series data. Before giving the expression for correlation, we introduce the following nota-tion:

pi(t) is the closing price of stock i at the tth day

qi(t) is the return of stock i over a period of [t
1; t]
qi(t) =
pi(t) pi(t 1)

pi(t
1)

ri(t) is the log-normalized return stock i over a period of [t 1; t]

ri(t) = log(1 + qi(t))

1
Then with the above notation, we de ne the correlation between the log-normalized stock-return time series data of stocks i and j as
hri(t)rj (t)i h ri(t)ihrj (t)i
ij = p

(hri(t)2i h ri(t)i2)(hrj (t)2i h rj (t)i2)

where h i is a temporal average on the investigated time regime (for our data set it is over 3 years).

QUESTION 1: What are upper and lower bounds on ij ? Provide a justi cation for using log-normalized return (ri(t)) instead of regular return (qi(t)).

2. Constructing correlation graphs

In this part,we construct a correlation graph using the correlation coe cient computed in the previous section. The correlation graph has the stocks as the nodes and the edge weights are given by the following expression
q

wij = 2(1 ij )

Compute the edge weights using the above expression and construct the correlation graph.

QUESTION 2: Plot a histogram showing the un-normalized distribution of edge weights.

3. Minimum spanning tree (MST)

In this part of the project, we will extract the MST of the correlation graph and interpret it.

QUESTION 3: Extract the MST of the correlation graph. Each stock can be categorized into a sector, which can be found in Name sector.csv le. Plot the MST and color-code the nodes based on sectors. Do you see any pattern in the MST? The structures that you nd in MST are called Vine clusters. Provide a detailed explanation about the pattern you observe.

4. Sector clustering in MST’s

In this part, we want to predict the market sector of an unknown stock. We will explore two methods for performing the task. In order to evaluate the performance of the methods we de ne the following metric
◦ X
= jV j vi2V P (vi 2 Si)

where Si is the sector of node i. De ne

jQij
P (vi 2 Si) =

jNij

where Qi is the set of neighbors of node i that belong to the same sector as node i and Ni is the set of neighbors of node i. Compare with the case where
jSij

P (vi 2 Si) = jV j

QUESTION 4: Report the value of for the above two cases and provide an interpretation for the di erence.

2
5. Correlation graphs for weekly data

In the previous parts, we constructed the correlation graph based on daily data. In this part of the project, we will construct a correlation graph based on weekly data. To create the graph, sample the stock data weekly on Mondays and then calculate ij using the sampled data. If there is a holiday on a Monday, we ignore that week. Create the correlation graph based on weekly data.

QUESTION 5: Extract the MST from the correlation graph based on weekly data. Compare the pattern of this MST with the pattern of the MST found in Question 3.

2. Let’s Help Santa!

Companies like Google and Uber have a vast amount of statistics about transportation dy-namics. Santa has decided to use network theory to facilitate his gift delivery for the next Christmas. When we learned about his decision, we designed this part of the project to help him. We will send him your results for this part!

1. Download the Data

Go to \Uber Movement" website and download data of Travel Times by Month (All Days), 2019 Quarter 4, for Los Angeles area1. The dataset contains pairwise traveling time statistics between most pairs of points in the Los Angeles area. Points on the map are represented by unique IDs. To understand the correspondence between map IDs and areas, download Geo Boundaries le from the same website2. This le contains latitudes and longitudes of the corners of the polygons circumscribing each area. To be speci c, if an area is represented by a polygon with 5 corners, then you have a 5 2 matrix of the latitudes and longitudes, each row of which represents latitude and longitude of one corner.

We strongly suggest using igraph library in Python language for this section. Most data processing steps will become far more convenient if you use Python instead of R.

2. Build Your Graph

Read the dataset at hand, and build a graph in which nodes correspond to locations, and undirected weighted edges correspond to the mean traveling times between each pair of locations (only December). Add the mean of the coordinates of each region’s polygon corners (a 2-D vector) as an attribute to the corresponding vertex.

The graph will contain some isolated nodes (extra nodes existing in the Geo Boundaries JSON le) and a few small connected components. Remove such nodes and just keep the largest connected component of the graph. In addition, merge duplicate edges by averaging their weights 3. We will refer to this cleaned graph as G afterwards.

QUESTION 6: Report the number of nodes and edges in G.

• If you download the dataset correctly, it should be named as

los angeles-censustracts-2019-4-All-MonthlyAggregate.csv 2The le should be named los angeles censustracts.json

• Duplicate edges may exist when the dataset provides you with the statistic of a road in both directions. We remove duplicate edges for the sake of simplicity.

3
3. Traveling Salesman Problem

QUESTION 7: Build a minimum spanning tree (MST) of graph G. Report the street addresses of the two endpoints of a few edges. Are the results intuitive?

QUESTION 8: Determine what percentage of triangles in the graph (sets of 3 points on the map) satisfy the triangle inequality. You do not need to inspect all triangles, you can just estimate by random sampling of 1000 triangles.

Now, we want to nd an approximation solution for the traveling salesman problem (TSP) on G. Apply the 1-approximate algorithm described in the class4. Inspect the sequence of street addresses visited on the map and see if the results are intuitive.

QUESTION 9: Find an upper bound on the empirical performance of the approximate algorithm:

= Approximate TSP Cost Optimal TSP Cost

QUESTION 10: Plot the trajectory that Santa has to travel!

4. Analysing Tra c Flow

Next December, there is going to be a large group of visitors travelling between a location near Malibu to a location near Long Beach. We would like to analyse the maximum tra c that can ow between the two locations.

5. Estimate the Roads

We want to estimate the map of roads without using actual road datasets. Educate yourself about Delaunay triangulation algorithm and then apply it to the nodes coordinates5.

QUESTION 11: Plot the road mesh that you obtain and explain the result. Create a graph G whose nodes are di erent locations and its edges are produced by triangulation.

6. Calculate Road Tra c Flows

QUESTION 12: Using simple math, calculate the tra c ow for each road in terms of cars/hour.

Report your derivation.

Hint: Consider the following assumptions:

Each degree of latitude and longitude 69 miles Car length 5 m = 0.003 mile

Cars maintain a safety distance of 2 seconds to the next car Each road has 2 lanes in each direction

Assuming no tra c jam, consider the calculated tra c ow as the max capacity of each road.

4You can nd the algorithm in: Papadimitriou and Steiglitz, \Combinatorial optimization: algorithms and complex-ity", Chapter 17, page 414
5You can use scipy.spatial.Delaunay in Python

4
7. Calculate Max Flow

Consider the following locations in terms of latitude and longitude:

Source coordinates: [34.04, -118.56]

Destination coordinates: [33.77, -118.18]

QUESTION 13: Calculate the maximum number of cars that can commute per hour from Malibu to Long Beach. Also calculate the number of edge-disjoint paths between the two spots. Does the number of edge-disjoint paths match what you see on your road map?

8. Prune Your Graph

In G , there are a number of unreal roads that could be removed. For instance, you might notice some unreal links along the concavities of the beach, as well as in the hills of Topanga. Apply a threshold on the travel time of the roads in G to remove the fake edges. Call the
~
resulting graph G .
~
QUESTION 14: Plot G on actual coordinates. Do you think the thresholding method worked?

~
QUESTION 15: Now, repeat question 13 for G and report the results. Do you see any changes?

Why?

9. De ne Your Own Task

Brainstorm and de ne a new interesting task based on the dataset at hand. This part is open-ended and you’ll be graded based on your creativity. You may also wish to go beyond, and use any other navigation- / tra c- related dataset and de ne your favorite task. This part is worth 20% credit of the project. You may discuss your ideas with the TA in designated o ce hours.

Submission

Please submit a zip le containing your codes and report to CCLE. The zip le should be named as \Project4 UID1 ... UIDn.zip" where UIDx are student ID numbers of team members. If you had any questions you can post on piazza.

5