Proposal

For most people, I believe there is a strong relationship between the median income and median age. How much money you earn is dependent on how much experience you have, and experience comes with time. In other words, there is an association with how much money you earn and how old you are.

I collected data from the Data Commons Python API. I retrieved data about all 50 states in the United States and information for income, age, population, and crime for every year from 2011 to 2019. Some examples include median income of people who live in Wyoming in 2013 and the count of total crimes in Alabama in 2018. Each row represents a different year for one state, meaning there are 9 rows of data for each state.

Note: After exploring the data (EDA), I realized that I did not have enough relevant information for a model. Specifically, I had data about crimes for different parts of the world, but most did not have a timestamp, like the year of when those numbers of crimes happened. Without several time stamps for each location, it would be difficult to yield any meaningful results from the model, since it wouldn't be comparable to a different time in the same location.

With the semester coming to an end, Professor Holowczak and I ultimately decided to stick to collecting the data that I needed, without worrying about gathering more than 10 GB of data, which was intended for this project as it gives a reason to use an engine for large-scale data processing like Spark. Furthermore, with a large amount of data, using a cloud computing service like AWS makes a lot more sense as it can do things like reduce costs, since you only pay for the services you use and you don't pay for the hardware that is used to process and deploy applications from the data. This led to the decision to use the current dataset from the Data Commons Python API. Previously, I was using data from the Data Commons CSV files, Data Commons Data Download Tool files, and a Kaggle dataset.

Dataset

Sample of dataset from the API
Sample of transformed dataset

I wrote code to collect data from the Data Commons Python API. The resulting dataset needed to be transformed so that the columns could be used for a model, where every record is a yearly observation for a given state. A new column for state abbreviations was added, which will be used to create visualizations of choropleth maps of U.S. states.

Next, I connected to an Amazon EC2 instance and configured the AWS CLI. An Amazon S3 bucket was created and the code used to collect and transform the data from the API was run. The dataset was then placed in an Amazon S3 bucket.

EDA

Exploring the data

The dataset in the Amazon S3 bucket is loaded for exploratory data analysis using Python. The number of rows and columns, column names, data types of each column, number of missing values in each column, and descriptive statistics like mean and max for all the columns were found.

In the dataset, each row is a specific year and gives information for a state in the United States for the total population, median income of a person, median age of a person, and total count of crimes. The data spans a total of 9 years, starting from 2011 to 2019. Furthermore, there is data for all 50 states. There are no missing values.

From this analysis, I concluded that the dataset should be sufficient in predicting crime rates based on the relationship between median income and median age.

Model

Now that all the data is in the Amazon S3 bucket and was explored for any issues, the next step is to build the model to predict crime rates. I decided to use a logistic regression model, which predicts the probability (between 0 and 1) of an event. This is ideal for this dataset because in order to predict crime, there will need to be a measure to say whether or not it is good or bad. Assigning a 1 to represent crime that is higher than the national average and 0 for crime lower than the national average is perfect in this case. The features are state, population, income, and age. The label to predict is whether the total count of crime per 100,000 people in each state within a given year is over the national average of crime per 100,000 people or below for that given year.

Sample of some data for logistic regression

In order to read and process this data, an Amazon EMR cluster was created with Spark as the Application and connected with 'hadoop' as the User name. I wrote code using PySpark, the Python package for Spark programming, for the model. These are the main steps the code takes:

  1. PySpark reads the CSV dataset in the Amazon S3 bucket and creates a Spark DataFrame from it.
  2. Columns are created to measure crime: crime per 100,000 people for each year and state, national average number of crimes per 100,000 people for every year, and a column to create a label for the logistic regression model to work, whether the crime per 100,000 people for that state in that year is greater than the average national crime for that year.
  3. Feature Engineering was done for the state column: it was encoded and put in a VectorAssembler along with the Age, Population, and Income.
  4. A pipeline was created to standardize the data by applying the same transformations to the data at each step.
  5. The dataset was split where 70% became the training set and 30% became the test set for the logistic regression model.
  6. The Area Under the Curve (AUC) was used to evaluate the models. Steps to ensure the best model was picked included using a 3-Fold Validation and exploring the Hyperparameters (Grid Search) to see which model had the highest AUC.
  7. The best model was tested on the testing set.

Visuals

Visualizations of the data and prediction results were created with Spark tools and Python libraries (Matplotlib and Plotly Express). Below is a sample of visualizations created.

Confusion Matrix

For the test set, the model predicted that the crime per 100,000 people:

  • would be more than the national average when it was actually more (True Positive) 66 times.
  • would be less than the national average when it was actually less (True Negative) 59 times.
  • would be more than the national average when it was actually less (False Positive) 4 times.
  • would be less than the national average when it was actually more (False Negative) 2 times.

Average National Crime Over Time

A line graph showing the average national crime per 100,000 people over time in the United States from 2011 to 2019. With each increasing year, crime decreased (2015 had 2874 crimes and 2016 had 2873 crimes). Compared to the highest crime rate in 2011 with 3249 crimes, the lowest crime rate in 2019 was 2487.

Percent Change in Average National Crime

Note: The percent change value for the year 2012 is obtained from getting the percent difference between the crime rate in 2011 and 2012.

A bar graph with a line showing how much the average national crime per 100,000 people in the United States changes (decreases) as a percent from 2011 to 2019. Based on the previous graph, it is known that the crime rate with each increase in year declined, so all the values for the percent change are negative. An example of interpreting the graph is in 2013, the crime rate had an almost 4% decrease from 2012.

Year Over Year Change in Crime

Note: A darker shade of blue (positive numbers) indicates crime in that state is higher than the national average, with the shade of blue showing how much higher it is. Whereas a lighter shade of the blue (negative numbers) indicates crime in that state is lower than the national average, with the shade of blue showing how much lower it is.

Plots that take the difference between crime per 100,000 people and national average for 2011 and 2019 respectively. A state with a number near 1000 like Louisiana and Arkansas in 2011 have higher crime rates, compared to a state with a number near -1000 like South Dakota in 2011 have lower crime rates. Comparing the plots for 2011 and 2019, it can be seen that in those two years, states like Ohio and Florida have had a decline in crime, whereas states like Alaska and New York have had a rise in crime, so there is a mix between crime rising and declining depending on the state.

Results

Interpreting the model results:

  • The best model had the best performance. The Area Under the ROC Curve (AUC) was used to evaluate this, where values range from 0 - 1. Scoring a 1 means the model is perfect, whereas scoring a 0 means all the predictions were wrong.
  • To validate that the model was not a result from the random split, a 3-Fold Validation was used on the training data. This runs the model 3 times, where the data is split into 3 parts and ⅔ of data is built for the model while ⅓ is held off for each split. The average AUC over these models was 0.9619. The AUC was then used for the testing data for a score of 0.9872.
  • To optimize the model, a range of Hyperparameters, parameters that are fixed and can affect how well a model trains, were explored by carrying out multiple splits and then seeing which parameters lead to the best model performance, also known as Grid Search.
    • As a result, regparams were used, which are hyperparameters that try to prevent a model from overfitting, where the model performs well on the training data, but not on new, unseen data . Six of them were used to specify the range of regparam values to use when searching for the best model hyperparameters. In this case, they were 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
    • Additionally, 2 elasticNetParams of 0 (Ridge Regression) and 1 (Lasso regression) discourage the model from learning complex and overfitted models, resulting in 12 different models to be tested.
    • Those 12 models with the 3 number of folds, resulted in 36 total models. Each model was tested on the performance (AUC) for each combination and the combination with the best performance was selected.
  • The best model had an AUC of 0.9921, which is almost perfect.
  • This best model was tested on the testing set, resulting in an AUC of 0.9820.

Next Steps

In the future, I would like to investigate deeper into the counties, cities, and zip codes within these states to learn more about crime rates and what features help predict it. Since there would be more data to collect and process, it would allow using an engine for large-scale data processing like Spark and a cloud computing service like AWS to be used as intended. In addition, I would like to use different software to create data visualizations. A business intelligence tool such as Tableau can connect to multiple data sources like a database or CSV file at the same time and use them all to build visualizations by blending the data. I could build the same visualizations and more with its simple drag and drop functionality. Afterwards, assembling all the visualizations into a dashboard and presenting these insights would be effective in telling the story to the viewer.

Updates - 4/22/2023

I collected more data through the Data Commons Python API for the count of different types of crimes (aggravated assault, robberies, larceny theft, etc.). After, I created a Tableau dashboard to further analyze these crimes and the factors that play into why they occurred. I drilled down on those specific crimes, calculated new KPIs, grouped states by regions, and more into interactive visualizations.

Check out the Tableau dashboard here!

Elements

Text

This is bold and this is strong. This is italic and this is emphasized. This is superscript text and this is subscript text. This is underlined and this is code: for (;;) { ... }. Finally, this is a link.


Heading Level 2

Heading Level 3

Heading Level 4

Heading Level 5
Heading Level 6

Blockquote

Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis volutpat ac adipiscing accumsan faucibus. Vestibulum ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.

Preformatted

i = 0;

while (!deck.isInOrder()) {
    print 'Iteration ' + i;
    deck.shuffle();
    i++;
}

print 'It took ' + i + ' iterations to sort the deck.';

Lists

Unordered

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Alternate

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Ordered

  1. Dolor pulvinar etiam.
  2. Etiam vel felis viverra.
  3. Felis enim feugiat.
  4. Dolor pulvinar etiam.
  5. Etiam vel felis lorem.
  6. Felis enim et feugiat.

Icons

Actions

Table

Default

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Alternate

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Buttons

  • Disabled
  • Disabled

Form