AirBnB: Here to Stay!

I recently came across the AirBnb dataset (link). I found this dataset very interesting and wanted to look at the growth of Airbnb and study its trends. I limited the data analysis to 14 cities in the US. 

Question 1: Are AirBnB locations clustered around tourist spots or points of interest?


This dashboard shows a snapshot of the listings in two cities – I’ve considered New York and San Francisco as examples. (Any two cities can be looked at, at a time, using the filter)

If it were true that airbnb listings were clustered around tourist spots, then the map show reflect that. However, in this case, the map reflects that airBnB listings are not just clustered around the tourist spots, but are spread out across the city.

This raises questions about the type of travelers who use airBnB. It would be interesting to study this with respect to the type of traveler or their purpose.

Question 2: Does staying at an AirBnB listing mean sharing a house with the host or a room with the host or someone else?


The above dashboard shows that a higher number of entire homes / apartments are listed on the website as compared to shared room or even private rooms. This strongly supports the fact that it is a misconception that airBnB involved sharing a room with the host or someone else. A majority of the listing are for independent properties. Also adding support is the value for these listings While a private room costs about $73.30, for $168.40, an entire apartment is available, making it a very good deal for those traveling in groups or families needing atleast 2 rooms. The dashboard also shows steep rate of growth of independent listings and reviews. In showcasing that entire apartments are available all across the city, the map displays yet another reason to choose an entire apartment / house – it’s ubiquitous presence.

Question 3: Does the size of a city and its cost of living impacts the number of listings and the price at which a listing is offered?


The top part of the above dashboard shows a comparison across different cities on the basis of households and listings. This leads to the density of listings. For places with more households, is there a higher density of households being listed? The data does not show a clear correlation. For instance, Santa Cruz has low number of households but has a high density of households listed; whereas Chicago has a lower density. This leads to the question of cost of living. The graph on the bottom left seems to answer this. Chicago has a low cost of living and a low average room rate; while Santa Cruz has a high cost of living and a high average room rate. However, this graph also sports anomalies such as Austin – which has a high average room rate but a low cost of living.

Further data on occupancy rates, the type of property (beyond room type), the neighborhood, etc. would be helpful in analyzing this further; and studying other trends with respect to population or cost of living differences across listings.

Question 4: Are there regional differences across AirBnB listings?


There seem to be some differences in the parameters across different regions. The Northeast not only have higher listings, but also sport higher prices. This is true across all types of properties as soon in the box plot on top. However, when it comes to the number of reviews, South’s average is higher than the others; with west matching (and beating it) for private rooms and entire apartments. It’s also interesting to note that shared rooms receive the lowest number of reviews while private rooms receive the most, with entire homes and apartments inbetween the two – across all regions. An analysis on the sentiment of the reviews could lead to differences across the regions and whether culturally one region is more attuned to giving reviews.

Another trend that is somewhat uniform across the regions is that of the minimum nights stay required. Of interest here is the minimum number of nights required to stay in a shared room in the West. This number seems high, and needs further investigation in terms of a difference across cities. In looking at a deep dive within the West region, we see that this is because of the requirement in SF. This requirement needs to be looked into – is it a policy, or is it that some properties are being operated as hostels, etc.

Question 5: Are some listings in New York illegal?


News reports claimed that AirBnB had purged their New York listings data around 20th of Nov 2015, since these listings were illegal. A deepdive into the data showcases this as seen in the top graph. The difference is about 1500 listings. However, between then and January 2016, the listings have been revived.

In New York, it is illegal to let out entire properties for less than 30 days. As can be seen from the bar graphs, about 54% of all listings are for entire homes or apartments and these are available to be booked for about 220 days a year. The scatterplot on the right shows the listings of entire homes and the duration for which they are available. The ones coded in orange are available for less than 30 days; and hence it can be inferred that these are illegal listings.


‘Insult’ Mining in Social Media Posts

Collaborators: Cristian Garay, Divya Garg – School of Information, UC Berkeley


The anonymity of social media provides a convenient façade to online harassment. Using a Kaggle dataset on comments from online discussions, we train a machine-learning algorithm using binary classification to classify insults. Our objective is to use algorithms such as Naïve Bayes and Logistic Regression; and more complex ones such as Random Forests and Support Vector Machines, and compare the performance of each model on predicting whether a comment is an insult or not an insult.

Initial Results:

We featurized the training data as unigrams and trained the Naïve Bayes model on it. On this model we initially tested individual comments that were insults and we also extracted the most informative features to understand the model’s working. These features are as given below:

Screen Shot 2016-03-18 at 10.03.44 PM

Work in Progress: 

We’re now improving the featurization by adding n-grams. 

Model Performance: 

Currently, the Naive Bayes classifier shows an accuracy of 77.9% using just unigrams. The other algorithms are under development. 


Pepsi vs. CocaCola Tweets

In my previous blog post, I introduced Project Scanz, that gets a Twitter user’s profile & also performs a frequency count of the hashtags & words used. I’ve been playing around with this and in typical classical marketing fashion, some of the first few Twitter handles I experimented with were @CocaCola, @Pepsi, @Nestle and of course, in keeping with my current avatar, @UCBerkeley!

@Pepsi & @CocaCola were the most interesting results. On counting the hashtags of the last few tweets, I got the following results.

Pepsi’s top hashtags:

Screen Shot 2015-06-30 at 10.37.09 PM

CocaCola’s top few hashtags:

Screen Shot 2015-06-30 at 10.36.11 PM

The approaches that both these brands have taken are so telling from their hashtags. While Pepsi has chosen to participate in current conversations with the relevant hashtags, Coke participates in the same discussions but with focused hashtags – ‘cokemyname’ and ‘ShareaCoke.’ It would be an interesting analysis to study their individual approach. I plan to undertake a more detailed analysis of their social media statistics in the near future. 

It’s interesting that traditional marketing bigwigs such as Pepsi & Coke – though they have a strong presence on social media, do not seem to top the engagement charts on social media. Marketers will have to unlearn and relearn means of connecting with consumers. These are interesting times to be in!


Project Scanz

I started experimenting with the Twitter API about 2 weeks back and I am now hooked! I embarked on a small project to get a feel for the data and now that I am a little bit more comfortable, I will be undertaking more ambitious projects!

My objective with Scanz is to get a Twitter user’s profile using their Twitter handle, and then basis their tweets, to gain an understanding of their most frequently used hashtags and words.

For this, after a whole host of issues with Tweepy (I found it incompatible with Python 3), I used Twython and it worked.

The Steps I followed were:

1) Access Twitter API through Twython:

Screen Shot 2015-06-30 at 2.04.04 PM

2) Get a Twitter user’s basic data using the function get_info. I am displaying 25 of the latest tweets of this user. If required, we can dump the tweets onto a json file.

Screen Shot 2015-06-30 at 2.04.55 PM

3) I then get the tweets required for analysis. For the purpose of this exercise, I am using 200 tweets. We can access upto 3200 tweets but only 200 at a time. By updating the tweet id and adding a sleep time element, so as not to inundate the API with our requests, we can write 3200 tweets to a json file.

Screen Shot 2015-06-30 at 2.11.00 PM

4) Now, the tweets are in the file, jsontweetsfile. We could just open this file and create a list with all the hashtags in it. The following snippet does that. Within the function, I have called another function to create a flat list out of all the nested list elements. Will cover the function word_count() in the next step.

Screen Shot 2015-06-30 at 2.13.08 PM

5) Word_count() is a function that is a frequency counter. It creates dictionaries with words being the ‘key’ and the frequency count as the ‘value.’ I am using this function here to count the hashtags and later, will call the same function for a word frequency of the tweets. This function also relies on sort_dict() – this is a function that sorts dictionaries by values and then prints the output as key, value in descending order.

Screen Shot 2015-06-30 at 2.18.01 PM

6) If we just want to analyze the hashtags, the above would be sufficient, but if we want to  analyze the frequency of words too, then we could add another function – process_line(). This function take a line, one at a time, strips it of spaces and punctuation and adds it to a list. We can then create a single list out of all the words by calling make_one_list() and then call word_count() to get the frequency of words used. We could also use a counter to just display the top 100 words or so, since that would be more relevant and the word list is more likely to have a very long tail.

Another option I might explore at a later date is to classify words as pronouns, conjunctions, etc and then run the frequency analysis on the classifications. It’ll probably show us how few words we use!

Screen Shot 2015-06-30 at 2.21.17 PM

I got some very interesting results with this – I analyzed Pepsi & Coke’s twitter handles and found a significant difference! More in my next blog post.


Comment Spam – Too flattering to be true

I am a complete newbie to creating and maintaining a website. I’ve maintained blogs in the past, but that was when spam didn’t look so real!

I am using ‘WordPress’ to build my site and as I was browsing through the plugins and reading about them online, I came across the Akismet plugin, which would ‘catch spam.’ I was a bit apprehensive initially since I didn’t think I would get all that much spam and would need a spam filter! However, I decided to err on the side of caution and I did install the plugin.

One day, on a whim, I clicked on the Comments section in the admin panel and was astonished to see 21 comments marked as spam. I’d just published just two blog posts, so naturally I wasn’t expecting anything in the comments section, apart from the ones I’d received in my mail!

I can very easily classify the comments spam into three categories:

1) This is nonsense – This is the easiest to notice for it’s just nonsense! One such comment that I got was an excerpt from Wikipedia on Hillary Clinton. It wasn’t even a complete paragraph, just text that had no context! The other type of comments in this section are complete but unrelated to you or your blog post and blatantly promote their product or service in the comment.


2) Sounds genuine – The comments in these categories usually start with a positive note about the blog and then sign off with “Why don’t you visit my link?” Now I realize this is a way of boosting the website’s rank in google, which is why such tactics are resorted to, but these do sound nice and it is possible to get taken in by these nice words! Many comments here also ask you for helpful tips – these play on the emotional high you get when someone asks for your opinion or help. It’s a temporary high even if it is just spam!

comment spam


3) How is this spam? Does this spam filter really work? – These are the comments that I kept for 3 days in the hope that it wasn’t spam! The comments stroked my ego, had no link to click, and sounded like something a normal person would say. Yes, by normal person I mean someone who randomly came across your blog and was soooooo impressed with your writing, said your blog would become famous, promised to visit again and even asked you for tips! While my ever-optimistic self basked in the afterglow of these comments, my rational self decided things were getting out of hand and quickly took over to rationalize these comments. A quick google search of the big paragraph with the flowery compliments showed me many such links with the exact same comment on multiple websites and it being classified as spam.

spam04 spam05

spam06 spam07It is so easy to be flattered by phrases such as these and these comments have in fact turned me into a staunch believer in ‘anti-spam’ plugins!


Alice in Hackland

Collaborators: Shirish Dhar (School of Information), Aaron Hobson (School of Education) – UC Berkeley

The Idea

A troubleshooting interface for experiential gaming – to inspire computer programming interest among students in middle school.

Our design methodology evolved from a generic UI for any game to specifically a trouble shooting UI.


Contextual Inquiry

We started with a contextual inquiry with middle grade students. Our objective was to learn how middle school children approach and interact with games; specifically, we were interested in puzzle-based adventure computer games. We decided to look for three games that were differentiated enough among themselves – aesthetically, functionally, and otherwise – to allow for rich interviews. Ultimately, we decided on the three following games – Abandoned (Link), Puzzle Legends (Link) and You’re Grounded.

The contextual inquiries involved getting the students to play each game while we recorded them and the screen while playing the game (we used Camtasia).


Affinity Diagramming

Based on each of our interviews, we noted all the observations and clubbed them into themes.


The above notes were grouped into themes: Red – what they hated about the games, green – what they loved, grey – their familiarity with the games, blue – visual appeal elements, purple – types of games, yellow – specific preferences.

Work Models

We also explored the gaming experience during the contextual inquiries in terms of three different work models – Artifact Model, Cultural Model and Sequence Model.


Artifact Model: Exploring interactions with physical and virtual artifacts


Cultural Model: Exploring the reasoning behind two different students picking a favorite game


Sequence Model: Exploring the sequence of actions of a user

Persona Creation

Based on the findings of the contextual inquiry, 7 personas were created. Of these, we narrowed the personas to a ‘Primary persona’ and we also considered two secondary personas.

Our Primary persona was Selah – A ‘Social’ Gamer

Selah, aged 12 and a Taylor Swift fan, typically only plays games with her family. She prefers social settings to playing games indoors and spends time with girlfriends outdoors. Her primary interaction is face to face and she plays games only to be a part of what everyone else is doing.


Image source: http://icdn5.digitaltrends.com/image/nintendo-wii-casual-gamers-625×350.jpg

Our secondary personas were Adrian – a hardcore gamer who prefers games above all other activities, and Annie, a Scrabble addict who is ranked #21 in her online championship league.

Scenario Creation

We explored various scenarios, such as the one below, with respect to our primary persona, Selah.


Lo-Fi Prototype

We created different options for the lo-fi prototype, using paper.


Lo-Fi Prototype Option 1


Lo-Fi Prototype Option 2

Our lo-fi prototype contained different iterations of these help options, and the biggest takeaway from testing our lo-fi prototype was the controversial nature of the ‘Advance’ option. While it empowered some gamers who would want a way out from extremely tricky situations, it would also ‘over empower’ certain users who would exploit it to no end and remark that the game was very easy at the end. This is when we decided to have a fixed count of three for the ‘Advance’ option, striking the right balance between utility and exploitation.


The Think-Alouds proved to be another cornerstone in our research, with a large number of users asking us what our game was actually about. They wanted a gaming context that they could relate to when employing these help features. They wanted to actually be ‘in a game’ and feel ‘stuck’ if they had to gain full value from the troubleshooting interface. This is when we made a decision to base the troubleshooting interface on a computer-programming enabling game, and we started to build a basic gameplay that would arouse interest within middle-aged kids towards computer programming.

Interactive Prototype


For the interactive prototype, we developed a gaming prototype and used the finalized troubleshooting options from the feedback on the lo-fi prototype.

Heuristic Evaluation

The subsequent Heuristic Evaluations were a great way to find the best way to integrate the troubleshooting interface with the computer programming gameplay in our final prototype. The focus of the subsequent research was based on tradeoffs like having a permanent command terminal vs having a click-enabled terminal next to each character, having just an icon saying ‘Help’ or ‘?’ vs personifying the help icon and making it more relatable to the user, just like Powerpoint’s ‘Clippy’

Final Prototype

Screen Shot 2016-03-29 at 9.12.28 AM

We developed the final working prototype based on feedback such as having persistent terminals and markers for help areas vs. user’s programming areas.

final_prototype_game2Once the user clicks ‘Help’, the player is greeted with the various ‘Help’ options – Game Guide, Tutorial, Get Clue and Advance.

Usability Testing

We tested the final prototype among users against a control game. While 80% of the users completed a level in the control game, 60% of them completed a level in the experiment game. A theme emerged about the kind of games the users liked – ‘it should be challenging but fun’. This was supported by an average rating of 3.2 in level of hardness (1 – too easy, 5 – too hard).