Predicting New York City 311 Service Requests

IBM DS0720EN DATA SCIENCE AND MACHINE LEARNING CAPSTONE PROJECT

18 min readMar 7, 2021

ABSTRACT

The report presents an analysis of the publicly available New York City Department of Housing Preservation and Development datasets of non-emergency complaints to help the agency manage the high volume of complaints with higher operational eficiency. We conclude that Heating and Combination of Heating and Hot Water complaint is the area the city needs to focus on first, with the Bronx borough being affected the most. We recommend the KNN classification algorithm to predict future complaints with a 98% f-1 score accuracy based on building characteristics.

INTRODUCTION

New York City uses an NYC311 portal to report problems and raise non-emergency service requests with local authorities. The portal is used by different agencies, depending on the issue. This case study focuses on Housing Preservation and Development of New York City, which deals with buildings and houses’ complaints. Following a substantial increase number of complaints, the agency needs answers to the following questions to help manage the volume and to improve operations efficiency:

Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?

2. Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or streets (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?

3. Does the Complaint Type you identified in response to question 1 have an apparent relationship with any particular characteristic or characteristics of the houses or buildings?

4. Can a predictive model be built for a future prediction of the possibility of complaints of the type you have identified in response to question 1?

Note: the course instructors defined the background and the problem statement [1]

ANALYTIC APPROACH AND DATA REQUIREMENTS

We will require a range of descriptive analytics and predictive analytical approaches to meet the business requirement. Some of the questions necessitate understanding the current situation in terms of complaint types, geographical areas affected, and patterns between housing characteristics and the possibility of a defect, while predicting complaints arising requires a Predictive Analytics approach, including Machine Learning techniques. The section outlines the approach as well as the data requirements for each of the chosen strategies.

Identify top complaint type:

A descriptive analytics approach will uncover the top complaint type. The method can be in the form of data frame sorting and aggregation report with possible visualization support. We will need, as a minimum, data on complaint types grouped by the complaint. Data on the date the agency received the complaint, the status of the complaint, whether it is open or closed, the severity of the case would help determine urgency to remedy the complaint and help set priorities.

Identify Areas Most Affected by the Top Complaint Type

The question also requires an understanding of the current state. Thus, we will use a descriptive analytics approach. To determine which boroughs, zip codes to focus on for the complaint type, we need geographical data on the ZIP Codes.

Relationship between Housing Characteristics and Complaints

To answer the question, a descriptive analytics approach including visualization, basic statistical metrics such as the correlation between numerical variables, or ANOVA for categorical (if available in the dataset). The data required includes building characteristics with an address that we can match against the complaints raised. Additionally, within the complaints database, complaints need to have a way to link to the complaint to an address.

Predictive Model for the Top Complaint Type

We will use a supervised classification algorithm such as KNN, SVM, Logistic Regression, and Decision Trees to predict whether a building would have a defect or not. We will compare performance across the algorithms and recommend the best approach using consistent metrics. The algorithms would require numerical inputs and a label. For best comparison across the models, the target to be set to a binary True/False.

DATA

Datasets

We will use two datasets from the New York City Department of Housing Preservation and Development for our analysis and predictions.

311 complaint dataset

The dataset contains records of 311 requests in New York City from January 1, 2010. The dataset contains 41 columns describing the complaint type, location, and status, amongst many others.

The full dataset contains 25 million rows, with each row being 311 services request. [2]

We can find the full description for each column name, description, the data type on the Department of Housing Preservation and Development of New York City website [2]

The database is live and is updated daily. Therefore, to compare results consistently, IBM provided a subset of the dataset covering the period between January 1, 2010, until February 2020, with data only related to the Department of Housing Preservation and Development.

We will use this data to identify the type of service request that the New York City agencies need to prioritize and the areas affected by the complaint type.

PLUTO dataset for housing

The other dataset provided by the course instructors was the Primary Land Use Tax Lot Output (PLUTO™) data file the New York City Department of City Planning has developed.

We will use the PLUTO™ DataFrame to discover any relationship between complaints and building characteristics. The full dataset, as well as descriptions for the data, can be found on New York City government website: [3]

Data Ingestion

First, we imported the “311” dataset into a pandas DataFrame. Next, we continue exploring the area most affected by the selected complaint type and connecting add building-specific features to our DataFrame. We imported the Bronx Pluto dataset into a different data frame. The dataset included information on the various area and volume measurements, the floor numbers, build and repair date history, zip code, and coordinates.

Merge Datasets

Now that we have both the complaints dataset for the top complaint type and selected geography and Bronx PLUTO dataset that gives the required features, we can combine the two into one table with 86,324 records.

We will use “Right join” to merge the “311” and the Bronx PLUTO datasets. “Right join” allows us to preserve the records on houses, not in the complaint dataset.

Data Wrangling

We first wrangled each DataFrame individually. As part of the process, we dealt with missing data by imputing, dropping rows where it was critical to have accurate data to connect the datasets, or dropping the full columns where the data was not useful or had only one option.

We brought column values to the correct format and dropped duplicate entries.

New Features

“days_open” to evaluate the number of days the complaint was outstanding.
Bins followed by an indicator variable for the tax lot column going back to the PLUTO dataset source description and learning that the integer numbers represent different categories.
Features to convert year built and year altered into a DateTime format. Then we use these newly created features to derive the number of years since the building was build and altered as possible predictors for failure.

Wrangle Joined DataFrame

As before, we replace any records with missing values. This time fields associated with complaint fields (5356) would be for the buildings where new yorkers did not complain, so we leave them in. The other 80,968 records correspond to complaint logs. Seventy-five thousand six hundred sixty-five rows had missing features related to complaints. We also need to replace the missing “status” or “Unique Key” with “no_complaints” to reflect houses where the residents did not experience heating and hot water complaints.

The original columns “city” and “complaint type” were dropped as they had only one unique value. Days open converted to np.timedelta64 for further processing.

The subsequent analysis will focus on complaints in the Bronx and the top complaint type. As a result of we have narrowed down the records from 2,130,400 to 408,970.

For records with “Unique Key” associated with the complaint was found, we set the supervised label to 1, otherwise 0. This binary logic corresponds to the building having a defect or not. Eighty thousand seven hundred fifty-seven rows had the status listed as closed, 5,356 had no complaints, and 211 were open. Next, we created indicator variables for the “Status” column.

Train / Test split

The final data frame is then normalized to have a zero mean and unit variance. Finally, we split the data into training (80%) and validation sets (20%)

RESULTS

PRELIMINARY INSIGHTS

Top Complaint Type

As shown below in Figure Top Complaints by Type Using 800,000 as a threshold, the complaint type(s) recommended for the Department of Housing Preservation and Development of New York City address first are Heating 887850 and both of “Heat/Hot Water” 1,261,574.

AREAS MOST AFFECTED BY HEAT/HOT WATER COMPLAINT

Borough

Brooklyn had the overall highest number of complaints of all the Boroughs, so if the total number of complaints was to drive the decision, the municipality could prioritize Brooklyn.

BROOKLYN 1,731,202
BRONX 1,609,837
MANHATTAN 1,049,360
QUEENS 641,741
STATEN ISLAND 87,187

Another way could be to adopt a score using open cases and adjust the score on a per household or building count to see if a borough is experiencing a higher than the average number of pending complaints.

However, the business requirement set by the Department of Housing Preservation and Development was to focus on the top complaint type, i.e., Heat and Hot Water.

Heat and Hot Water Complaints by Borough

Complaints by ZipCode

If we look at the ZipCodes, the most heat and hot water problems occurred in Zip Code 11,226 (Brooklyn), with 41,786 complaints registered. If we want to target service level improvement more precisely, the top 10 zip codes to focus on would be as shown on the Fig. 7 Most affected ZipCodes.

Most Affected Address

89–21 Elmhurst Avenue has the highest number of complaints submitted. Please note, you will get a different address list if you wrangle or impute missing values differently.

Fig. 5.4 Heat/Hot Water Defects by Address

Open Cases

A total of 1,249,817 were closed, while 4640 were open at the time of dataset compilation.

Mapping Open Top Complaints

EXPLORATORY DATA ANALYSIS

Buildings Age

Most buildings are over 80 years old, as can be seen from the age distribution histogram.

Correlation Heat Map

There were not that many features with a strong correlation. Consequently, we will use a threshold of 0.3 to give us sufficient information to make predictions.

Using this threshold, we narrowed down the dataset to the following features ‘BuiltFAR’, ‘FacilFAR’, ‘NumFloors’, ‘ResidFAR’, ‘supervised_label’, ‘Status_Closed’,’XCoord’. Figure Correlation Shortlisted presents the correlation between the shortlisted features

Fig. 5.8 Correlation Shortlisted Heat Map

Continuous Numerical Variables

The shortlisted variables all visually showed signs of a direct positive relationship with the predicted value apart from XCoord.

Built Floor Area Ratio

Buildings with reported complaints tended to have a higher built floor area ratio.

Maximum Allowable Community Facility FAR

The histogram helps us observe that buildings with no heating and hot water complaints tend to have a lower and more equally distributed allowable community facility area. The Community/Facility Floor Area Ratio has been higher among buildings that had the selected defect.

Number of Floors

The number of floors was lower for building with no defects.

Maximum Allowable Floor Area ratio (ResidFAR)

Buildings with complaints tended to have a higher Maximum Allowable Residential Floor Area Ratio.

Fig. 5.12 ResidFAR for buildings with or without complaints

Years since built

In general, buildings tended to be old; the buildings with defects had relatively older buildings. Intuitively, buildings with defects tend to be older (94 years vs. 84) and have had more years since repair (29 vs. 22)

Years since altered

Buildings with defects tend to have a higher proportion of units with alterations done between 25 to 50 years.

Continuous variables summary:

On average, buildings that have experienced the defect had more floors (5.8 vs. 2.5), a higher facility to floor ratio (4.9 vs. 3.3), higher Maximum Allowable Community Facility (3.53 vs. 1.91)

ZipCodes

We can see that there is variability between the average age and number of complaints of buildings between ZipCodes.

Fig. 5.15 Relationship between zip code, property age, and defects

PEARSON CORRELATION AND CAUSATION

The P-value for ‘BuiltFAR,’ ‘FacilFAR,’ ‘NumFloors,’ ‘ResidFAR’ was <0.001. We say there is strong evidence that the correlation is significant.

What about other weak correlated variables? ‘Years since built,’ ‘Years since altered,’ ‘X-Coord,’ ‘Y-Coord’ are statistically significant as well.

ANOVA: ANALYSIS OF VARIANCE

Build date vs. Heat and Hot Water

The difference between buildings with defects and buildings without flaws has a large ANOVA number and low p_value, meaning the difference is statistically significant.

Build date vs. Heat and Hot Water

Infinitely large f-value and a zero p-value indicate the relationship between years altered and defects is significant.

Lot Binned vs. Service Request

The difference between lots is statistically significant.

TIMELINE

The complaints raised (in blue) seem to follow a seasonal pattern peaking in the winter season with a relatively quick fix. Predicting buildings likely to experience heat and hot water failure could help level the resources by doing preventive maintenance in the off-peak summer season. Doing so would allow the agency to deploy the same resources more rapidly in the winter months.

Important Variables and Final Dataframe

We better understood our data and what features we will use to predict the probability of the heat and hot water defect occurring in a building. We have now narrowed the list to the following variables:

Continuous numerical variables:

BuiltFAR
FacilFAR
NumFloors
ResidFAR

Numerical variables <0.2 coefficient:

Years_Since_Built
Years_Since_Alt1

Final DataFrame

Final Dataframe Normalisation

The final dataframe is then normalised to to have a zero mean and unit variance

Traning and Validation Set Split

Finally we split the data in into training (80%) and validation sets (20%)

METHODOLOGY

Now that our dataset is prepared for our selected methodology, to be able to predict heating and heating and hot water complaints, we will use the classification models such as:

K Nearest Neighbor (KNN)
Decision Tree
Support Vector Machine
Logistic Regression

First, we select the best parameters and then for the individual model, and then we compare the performance between the models on a validation set.

K NEAREST NEIGHBOR (KNN)

Having tried different K’s, K=2 results in the highest prediction accuracy at 98%

DECISION TREE

For the decision tree, we tried to determine the minimum value for the parameter max_depth that improves results. At Max Depth = 22 with mean accuracy= 0.9833 and the accuracy, improvement starts to plateau.

SUPPORT VECTOR MACHINE

Sigmoid does not perform well on predicting 0 label — i.e., no defects as evidence in low precision/recall ~30%. The Jaccard score is also lower than the other three algorithms.

We will choose the linear kernel as having the highest minimum precision-recall, especially for the negative label for further comparison. For SVM, there is a choice of kernels to be applied: linear, poly, RBF, or sigmoid. We tested metrics using each of the options to select the optimal resulting in the following outcomes.

Linear

Poly

RBF

Sigmoid

LOGISTIC REGRESSION

here was no significant difference between any of the metrics across the four solvers (Newton-CG, LBFGS, Liblinear, SAG) . We will choose Newton-CG for subsequent model comparison.

solver NEWTON-CG

solver LBFGS

solver LIBLINEAR

solver SAGA

Model Selection

The dataset is a relatively skewed dataset, with only 6.3% of buildings expected not to have a failure. To prioritize any preventive maintenance during the summer period, the model should perform well in the class representing no failure so that the right buildings get the attention.

The Model performing best at both “no defect” class prediction (87%) and the overall f-1 score (98%) is KNN. The f1 — score for the rarest class of KNN was 87% vs. 71–71% achieved by other algorithms.

Having ruled out the buildings that are not likely to have a heat and hot water complaint using KNN, logistic regression can inform prioritization of the cases based on probabilities.

Discussion

We have found that the defect that the complaint that the Department of Housing Preservation and Development should focus on first (using the department criterion of 800,000 cases) would be the combination of heating and hot/water and heating defects. The Bronx is the area that has been the highest affected area, therefore using this as the preset criterion, the borough should be the borough that the department should focus on first. We have also identified specific ZipCodes, streets, and addresses that have suffered from the most complaints. Understanding the areas, combined with the predictive model, can help guide the efforts to schedule preventive maintenance and look for any root causes not captured in the dataset. Building features linked to the building’s size and building age or last alterations had a correlation and predictive value with the defect. We visualized the marked difference in buildings with complaints, and we saw that this difference was statistically significant. The building size-related properties that could help forecast failure were: built floor area ratio, the community/facility floor-area ratio, the number of floors, maximum allowable residential floor area ratio. Having compared different models across a range of parameters, I assessed models’ performance on the rarest class (no defect) and overall f-1 score. The Model performing best at both “no defect” class prediction (87%) and the overall f-1 score (98%) is KNN.

CONCLUSION

Having gone through the descriptive analytics, we have been able to identify the top complaint and the areas most affected by heating and hot water complaints. By applying descriptive analytics and statistical analysis, we found patterns in building characteristics that we can use to predict failures. We used those features to create a range of machine learning models, with the best model achieving an accuracy of 98%. We also found that the defects follow a seasonal pattern with spikes in the winter season. Understanding the seasonal patterns helps the service focus on the right complaint in the right area to understand what building characteristics are associated with the defect.

Using the understanding of the current state and forecasting buildings with likely defects, the Department of Housing Preservation and Development can now level the capacity by pre-emptively maintaining facilities during the summer to reduce the peak months’ workload. Doing so will also improve the time required to resolve the complaints as a result.

FUTURE DIRECTION

The course determined goals and objectives, so the business understanding was mostly out of scope. One could select a different analytical approach if the business understanding were part of the project. For instance, if the focus on immediate priority, selecting open complaints, or applying weighting based on the number of days the complaint is outstanding would lead one to focus on a different complaint type, i.e., general construction or plumbing. Literary review or interviews with domain experts can determine the cost, weighting, or urgency factors differentiate between the categories and not just look at the total number. Likewise, when selecting the geographical area, it would also be possible to check the frequency of complaints against the number of households to identify if specific neighborhoods’ are affected disproportionately more than others. Absolute complaint volumes don’t tell the full story.

Instead of using a classification approach at the analytical approach stage, we could have used a numerical forecast predicting the number of failures as an alternative. The numerical prediction would help to concentrate on buildings that would likely have the most defects.

At the data collection stage, a data scientist could also explore the earlier version of the PLUTO datasets to estimate the years since the building had any alterations at the time of defect rather than at present. Since the dataset containing the complaints covers ten years, while the dataset with building characteristics only shows the last alteration, the model does not “know” about the number of years at the time of complaint if the building had an alteration after the previous heat and hot water complaint. After identifying the most affected geography for future dataset preparation, one could use the datasets for all boroughs and complaints to build machine learning. Using data from all boroughs to predict service requests should improve prediction accuracy, especially for more complex models. At the data preparation stage, interviews with the domain experts can help discover what this defect entails and the operations behind fixing it. We may find that there is already research on building characteristics available. We can also perform the data understanding stage and analysis before wrangling as we can lose some data or “impute” data in the process, which can lead to a distorted picture. A cross-validation set was used similar to the earlier machine learning course labs (except for the Machine Learning capstone project), and we took a similar approach in this project. In practice, at the modeling stage, the model generalization would be more accurate if we used a separate test set in addition to the cross-validation set. Secondly, we could also look at learning curves to determine if the model was overfitting or underfitting and test different regularization parameters to optimize it. Deployment and feedback stages of the data cycle were out of scope.

REFERENCES

[1] N.A. Tayoun, IBM DS0720EN Data Science and Machine Learning Capstone Project (2021), edx.org

[2] NYC DoITT, 311 Service Requests from 2010 to Present (2021), (nyc.gov)

[3] The Primary Land Use Tax Lot Output (PLUTO™) (2021), The Department of City Planning (nyc.gov)

APPENDIXES

APPENDIX 1: PROBLEM STATEMENT

The people of New York use the 311 system to report complaints about non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase impact the agency’s overall efficiency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. Data and analytics must support the answers to those questions. These are their questions:

Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?
Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or streets (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?
Does the Complaint Type you identified in response to question 1 have an apparent relationship with any particular characteristic or characteristics of the houses or buildings?
Can a predictive model be built for a future prediction of the possibility of complaints of the type you have identified in response to question 1? 47IBM DS0720EN DATA SCIENCE AND MACHINE LEARNING CAPSTONE PROJECT

Your organization has assigned you as the lead data scientist to provide the answers to these questions. You need to get answers to them in this Capstone Project by following the standard data science and machine learning approach.[1]

APPENDIX 2: OPEN COMPLAINTS

What immediately stands out is one way to set immediate priorities: prioritize themes within open complaints instead of looking at the last ten years.

**Fig. A.1 Open Complaint Duration Histogram**

A separate study can determine whether this is a function of complaints being closed quickly or if it’s an indication of a backlog. A possible area to look into would be the average time to close out.

As can be seen from Figure Open Complaint Duration Distribution number of complaints are several years since being open.

If we look at the open complaints (see Figure Open Complaints by Type), General Construction, Plumbing, and Paint Plaster would be the top categories.

Assigning more weight by aggregating the days open to each category, General Construction, Plumbing, Paint Plaster are still the top categories for active complaints.

The department can launch a project to investigate these overdue open complaints to understand if the problem still exists and whether to leave the complaint active or close it. Future researchers can use literary review or interviews with domain experts to assign a cost or weight to certain categories. Domain experts may determine specific complaint types being open as not critical, so equal weighting would be misleading.