DEV Community

AnzheXue
AnzheXue

Posted on • Edited on

Understanding and Predicting the Feature "Reviews per Month" of Airbnbs in NewYork by Machine Learning

By Ryan Xue

One of the reasons why Machine Learning is so exciting is that we can apply it to any corner of our world, and we can use it to predict and understand things we would like to investigate and learn. It kind of opens a new world for us, isn't it?

I am a 2nd-year student currently studying Computer Science at the University of British Columbia. I was interested in Machine Learning but I had no prior knowledge in this field beforehand at the beginning of this term. I have taken a course in Machine Learning this term and this is my first homework project in Data Science. In this project, I am looking forward to predicting Reviews per Month of Airbnbs in NewYork city.

Description of the Dataset
The project is based on a dataset of New York City Airbnb listing from 2019, which includes various features such as price, room type, neighborhood, and so on. reviews_per_month is included in the dataset and it has continuous values, thus, the problem is a supervised regression problem.

Description of the Problem/Decision
The main challenge of this project was to predict the feature of reviews_per_month. The relevance of different features seems different, for example, price seems strictly related to the prediction while latitude seems less relevant to the prediction. So I need to be careful when I am dealing with the data. This problem is crucial for guests since it provides insight into the popularity of a listing of Airbnbs.

EDA
The data quality is extremely important when dealing with the data. I would like to carry out some feature engineering based on the dataset to carry out a better result. Firstly, I carried out a summary table to look at whether there are any Nan values in features and the interval and the distribution of numerical features. If there are Nan values in the dataset, I need to carry out extra engineering in these features. If the distribution numerical features deviates from each other significantly, then it can greatly influence my model's performance, so these characteristics of features are indispensable to look at. Here are the summary tables:

Image description

Image description

Based on the summary table, I discovered that the distribution and interval of data have large differences and some of the features have missing values. For categorical features, according to my understanding of customers' preferences on Airbnb, room types, and neighborhood groups are the relatively more important categorical features, so I would like to figure out the categories and the number of targets in each category, I could decide how I can preprocess the data. Here are the distribution of neighborhood group and room type:

Image description

Image description

These two plots show the distribution of room types and neighborhood groups. These categories do not have a strict ranking and the number of categories is reasonable, so I can choose appropriate models to preprocess data based on the information I discovered.

Feature Engineering
Since the feature total_reviews is highly relevant to the reviews_per_month, to make the project challenging, I dropped this feature. I added a new feature named difference_from_mean_price to improve the model's performance. I suppose this feature is more meaningful than price since the characteristic of expensive or cheap is highly dependent on the market. I constructed a pipeline to help me better analyze the data. The pipeline includes two parts: preprocessor and model. Preprocessor is used for making the raw data more suitable for the machine learning model. I divided the features into three groups, which are "numerical features", "categorical features", and "drop features".

Description of the Model
I established four models to compare and contrast, including baseline model, SVM model, DecisionTreeRegreesor model, and GradientBoostingRegressor model. I optimized the hypermeters of SVM and DecisionTreeRegressor model manually and used RandomizedSearchCV (which is a model for hyperparameter optimization) for GradientBoostingRegressor. GradientBoostingRegreeor has the best performance as expected, since it is an ensemble model (combined several simple models to improve learning), usually takes a longer time to run, and has a better performance.

Feature Selection
I used SequentialFeatureSelector to select the most important features. The result was not improved after feature selection, so I would not include feature selection in my following steps.

Feature Importance
I used shap (which is a common method for demonstrating feature importance) to show me the feature importance, here is the summary plot:

Image description

Based on the summary plot, availablity of 365 days is the most important feature for the GradientBoostRegressor model. The features that influence the prediction are the living requirements. Less importantly, people care about the price and locations. Neighbourhood and room type are not that important for prediction, which contradicts my original assumption.

Result
I used The MAPE (mean_absolute_precentage_error) to evaluate my result. The result of my model's performance is not satisfying, the value of MAPE is 0.67, which is pretty high and means my model makes only one time correct under three predictions of the reviews_per_month of Airbnb in NewYork city.

Caveats
I violated the golden rule when passing the preprocessed data to cross-validate, the results of my train score, cross-validation score are all the same when I am tuning my hyperparameter, which means the test data probably influence the training data in a way.
The features I dropped might not be reasonable enough to be dropped, I dropped "host_name", "id", "host_id", "last_review", "name" when preprocess the data, but these features can be useful for prediction. I did not know much about how these features may influence people's decisions, the reason I dropped them is just because I guess they are not that important, which is not reasonable enough.
The quality of my data is probably not high since based on the shap summary plot, I did not do a good job in feature engineering, the new feature I added is not useful enough.

Top comments (0)