Philadelphia Urban Crime Analytics

*Created using Python and the Bokeh data visualization library. Each dot represents one crime incident.


There has been an ongoing discussion in academia regarding the association between school location, time, and crime frequency.

A common theory is that there are so-called “hot spots” that are places more prone to crime, like schools.

Research shows that youth, specifically older youth, are more likely to commit crimes compared to other demographics and that schools provide an environment where these youth can aggregate with limited guardianship. Furthermore, past studies have shown that city blocks in San Diego, California with public high schools are more susceptible to higher crime rates than other blocks that are not near high schools. Even though there has been progress in the study of this area, there is still not substantial evidence showing that school schedules have caused the association between school locations and crime.

My research aims to further the analysis of this subject through the lens of Philadelphia schools. This project uses datasets from OpenDataPhilly, a publicly available catalog of data in the Philadelphia region. The first dataset contains all crime incidents from 2006 to 2017 and various attributes such as their locations, crime types, and time of occurrence. The second dataset contains all of the schools in the Philadelphia region. And the third dataset contains information on each school’s internal crime incidents.

By using this data, I was able to

  1. Depict general trends of crimes

  2. Analyze how changes in crime frequency relate to the hour of the day

  3. See how school types affect these changes

  4. Visualize these associations

This research report will detail the methodology in regards to data cleaning, data analysis, and visualization and show all of the results from the analyses.


As mentioned before, this report uses three datasets from OpenDataPhilly to conduct analysis. The initial dataset on crime incidents already had fields on crime type, longitude and latitude, and time of occurrence. These are valuable variables that we would use later on. However, there were also many attributes that we needed to add regarding time and school attributes. In order to analyze changes over the time of day, we created an “hour” field by extracting the hour from the dispatch date and time. Since this analysis would only apply to crimes that occurred during school days, we also added the “weekday” and “month” fields so that we could filter out all of the crimes that occurred during the weekend and the summer.

In order to distinguish between crimes that were close and far away from schools, we iterated through all of the schools for each crime incident and calculated the distance between their longitude and latitude coordinates. This was done using the haversine function. The school that was closest to the crime incident and its distance from the crime was recorded on the aggregate dataset. Based on its distance from the closest school, each crime was classified into five different categories: less than 250 feet, between 250 and 500 feet, between 500 and 1000 feet, between 1000 and 2000 feet, and other. From these attributes, the fields “distance_from_school” and “closest_school” were created. In addition, we also wanted to classify the schools into various categories. Using the third dataset, we aggregated each schools’ internal crime incidents and categorized them into high risk, medium risk, and low risk schools. This field was then added to the final dataset so that we could tell what type of school the crime incident happened near.


We first looked at the general trends from the crime incidents dataset. We conducted a time series plot of all of the crimes from 2006 to 2008 and we saw a substantial decrease in crimes during this time span. Based off of the time series, it also appeared that there was a cyclical trend in crimes; the crime frequencies would peak during the summer and dip during the winter. After plotting the coordinates of all theft crimes in 2006 and 2017, this general decrease was even more apparent. In the graphs below the time series, each yellow dot represent one theft crime incident. There were significantly less crime incidents in the second graph, especially near the borders of Philadelphia.

*Created using Python and the Bokeh Data Visualization Library

*Created using Python and the Bokeh Data Visualization Library


We took a random sample of 100,000 crime incidents from the dataset. Using the dataset from above, we visualized the frequency of all violent and nonviolent crimes over the course of a day. The y-axis represents all crimes that have occurred during a certain hour. Since we were looking at the association between schools and crime frequencies, we limited the crimes to ones that occurred during school days (weekdays and between August and June).

Nonviolent crimes include general vandalism/criminal mischief, burglaries, and thefts. And violent crimes include homicide, aggravated assault, sex offenses, and robberies. These are all categorizations under the “text_general_code” field in the dataset.


Based on these graphs, it can be seen that there are significantly more nonviolent crimes than violent crimes. Overall, the patterns appear relatively similar. The crime frequencies both dip at 2pm, and significantly rise from 2pm to 4pm. However, for nonviolent crimes, there was a noticeable increase in crime frequencies when going from less than 250 feet to 500 feet, from 500 to 1000, and slightly from 1000 to 2000. The increase from 2pm to 4pm at distances after that remains relatively similar. For violent crimes, there is a decrease in crime frequency at 250 feet, and an increase at 500 feet. This substantial increase rises even more from 500 to 1000 feet. There is also more variation at the lower distances for violent crimes compared to nonviolent crimes. And lastly, the crime frequencies reach their lowest point at roughly 5am and 6am.


There are much greater relative increases in crime frequencies from 2pm to 4pm, especially for high-risk schools. Across all risk types, there is a general decrease in crimes at 2pm and increase from 3pm to 4pm. For nonviolent crimes, there is an even distribution across all risk types. Nonviolent crimes near low-risk and high-risk schools both peak at roughly 900 crimes. However, there is a noticeable increase as we go from low-risk to high-risk schools for violent crimes. Furthermore, violent crimes near high-risk schools and nonviolent crimes near low-risk schools saw the sharpest spike in crimes from 2pm to 4pm. And all graphs show a sharp upward trend in crimes during the morning from 6am to 9am.

Nonviolent and Violent Crimes Across Risk Types

After seeing some general trends of crime frequencies peaking from 2pm to 4pm, we further targeted specific variables and created visualizations to see if there are any noticeable differences between low risk and medium risk, vs. high risk schools and nonviolent vs violent crimes. From the previous visualizations, it seemed like there was the most activity going on for crimes between 1000 and 2000 feet from schools. Thus, we not only wanted to compare the risk-types together but also look at the percentage changes in crimes in this 2pm to 4pm time span.


Once again, this confirms that there was not a substantial difference in crime frequencies across all risk types for nonviolent crimes. In addition, the % Changes in Nonviolent Crimes graph also proved that the crimes near low-risk schools grew the most compared to the medium and high risk schools.


Compared to the nonviolent crime frequencies, the violent crime frequencies were substantially higher near high risk schools than low risk schools; at every hour from 12pm to 6pm, there was a higher number of crimes near high-risk schools than those near low-risk schools. They all also generally increased from 2pm to 4pm, which can also be seen in the % Changes in Violent Crimes graph. Through this graph, we can also see that the largest increase in crimes was from 3pm to 4pm for violent crimes near high-risk schools.



After conducting these analyses, we arrived at a few key takeaways. First, there has been a general decrease in crimes from 2006 to 2008. For both violent and nonviolent crimes, crime frequencies peaked from 2pm to 4pm, and were the lowest at 5am and 6am. A large number of crimes that resulted in these peaks were located near schools (under 2000 feet). After segmenting by school type, we saw little to no variation among the number of nonviolent crimes between low-risk and high-risk schools. However, we did see a substantial increase in violent crimes when going from low risk to high risk schools. For violent crimes near high risk schools (under 2000 feet), there was a sharp percentage increase in crimes from 2pm to 4pm.

Moving Forward

Moving forward, it seems worthwhile to further investigate the 2pm to 4pm increase for both nonviolent and violent crimes, paying special attention to violent crimes near high-risk schools. Furthermore, it would also be valuable to compare the crime densities of areas directly surrounding schools and the average crime density of the zip code the schools are in. This would allow for greater analysis to see whether or not the spike in crime is above average compared to the crime density of the area. While this was attempted, it was difficult finding specific square footage to calculate the densities, given the fact that many schools are within a 2000 foot radius to each other. However, there is a lot of potential for analyzing crime densities, especially since high growth rates in absolute numbers can be influenced by small sample sizes.