Rosy Ratings

Posted by Dan Temkin on February 10, 2017

A comprehensive visual and statistical analysis of the relationship between the number of stars a reviewer assigns to a movie and the degree of positive/negative sentiment they conveyed in the review itself. Code used to develop this analysis can be found on my Github —-

xkcd star ratings


What is the relationship between star ratings that a reviewer gives to a movie and the sentiment of the review itself?

A note about nomenclature:

I figured it might be important to highlight some of the naming conventions I will use below because some of the variables and concepts can be difficult to distinguish due to their implicit similarity.

Observed Stars or Observed Ratings :

The star rating or number of stars that the reviewer/critic themselves provided.

Review Text :

The written review that the reviewer/critic had published on

Sentiment Score :

The text sentiment score that was provided by AlchemyAPI.

Measured Stars or Measured Ratings :

The star rating or number of stars that were calculated by applying the 0-4 rating scale to the Sentiment Scores provided by AlchemyAPI.


There are three conclusions that can be drawn from the analysis:

  1. There is a definitive relationship between the number of stars that a critic assigns to a movie and the sentiment conveyed in the review. The most significant being when the review has an observed rating of 1.5 or 4 stars.
  2. The average difference between the number of stars observed and the number of stars implied by the review sentiment was ~1 +/- 0.79 . Meaning, on average the star rating assigned by the reviewer was 1 star over what it should have been.
  3. Critics are generally optimistic when giving the movie a star rating versus when writing up the review.

So, what should you do with this information. I will admit the study is a bit trivial but I would keep this in the back of your mind when looking for movies to go see and when you land on something that could be good, subtract the reviewers’ rating by one and judge again if it is something worth the time and money. That is not to say you should not go see a particular movie, I am just suggesting you do this one last litmus test before making a final decision.


  ChiSquared Analysis with Cramers V 

  Contingency Table

        o0.5  o1.0  o1.5   o2.0   o2.5   o3.0  o3.5  o4.0
  m0.5   4.0   6.0   6.0   13.0    1.0    0.0   0.0   0.0
  m1.0  15.0  47.0  81.0   95.0   36.0   44.0  13.0   8.0
  m1.5   4.0  37.0  60.0  145.0  112.0  199.0  79.0  41.0
  m2.0   1.0   3.0  19.0   59.0   74.0  220.0  95.0  66.0
  m2.5   0.0   1.0   1.0   13.0   15.0   85.0  62.0  49.0
  m3.0   0.0   0.0   0.0    1.0    5.0   20.0  10.0  11.0 
  Expected Values
   [[   0.38793103    1.51939655    2.69935345    5.26939655    3.92780172
       9.18103448    4.18642241    2.82866379]
   [   4.38362069   17.16918103   30.50269397   59.54418103   44.38415948
     103.74568966   47.30657328   31.96390086]
   [   8.75431034   34.28771552   60.91540948  118.91271552   88.63739224
     207.18534483   94.47359914   63.83351293]
   [   6.94396552   27.19719828   48.31842672   94.32219828   70.30765086
     164.34051724   74.93696121   50.6330819 ]
   [   2.92241379   11.44612069   20.33512931   39.69612069   29.58943966
      69.1637931    31.53771552   21.30926724]
   [   0.60775862    2.38038793    4.22898707    8.25538793    6.15355603
      14.38362069    6.55872845    4.43157328]] 
   [  70.5435591    98.69950243  128.05107495   75.73436276   17.52762712
     68.58108555   68.20045436   79.34515385] 
   [  7.89769853e-14   9.93342422e-20   6.16723392e-26   6.53638100e-15
     3.60057787e-03   2.02202893e-13   2.42615848e-13   1.15035731e-15] 

The Chi-Squared Test indicates that the dependence of the measured ratings on the observed is significant for all the groups included in the contingency table except for the 2.5 star rating category which had a chi-stat that was below what was required for a 90 percent confidence interval. I then performed a Cramers V to measure the effect size of the chi-squared statistic for each group.

  Cramers V
   Overall:  0.25568594778537235 
  By Group:  [0.7667222394847636, 0.45825641440953424, 0.3916051337551449, 0.21555231207823727, 0.12010841166499396, 0.15539713498346144, 0.22948734937591297, 0.3011314925921122]


The results are in line with the chi-squared for the most part. The greatest effect in relation to the chi-stat was among the 1.5 stars group. The effect of 2.5 star group was the lowest though, since it was not statistically significant under the chi-squared there was no real surprise. The last test I ran was a Spearman’s Rho Correlation between the measured and observed ratings because of the rank order data it was the prudent choice.

  Correlation:  SpearmanrResult(correlation=0.51184497688915387, pvalue=2.4843853040154645e-125)

The correlation statistic came out at ~.51 and is statistically significant.

One of the most interesting artifacts that can be seen above is the bimodal frequency distribution of observed stars with both the two and three star ratings groups having a discernably greater occurrence than the surrounding groups. This is interesting because I remember watching a video on youtube of a professor showing the class how statistics can be used to determine who cheated (heres a link). In the video he mentions that whenever there is unaccounted for, external influence applied to a normal distribution the effect is that the distribution becomes bimodal. For example, if you have a computer generate 10 random values and a human 5, then the computer generates 10 more, and so on, the result according to that professor would be the creation of a skewed, bimodal distribution.

Other than that the distributions were as expected apart from two things.

  1. I was surprised to see that there were no cases in which the negative sentiment was significant enough to warrant zero or 4 stars. Although, this could be an artifact caused by the data collection method, which is a concern I will discuss in the next section.
  2. I was similarly surprised that there were, in fact, any 0 star ratings assigned by reviewers. I don’t think I can remember a review that ever got 0 stars, except maybe Battlefield Earth but I guess there must have been more, it is just hard to imagine.

As the Violin Plot indicates, for every rating assigned by the reviewer the sentiment ratings were consistently distributed with a definitive center. Moreover the center of sentiment rating distributions increased marginally, largely in tandem with the observed ratings. Reaffirming that there is a consistent and arguable associative pattern between the observed number of stars and the sentiment score from AlchemyAPI. This was the pattern confirmed using statistical tests as seen above. However, the violin plot does aleviate a concern I had when performing the analysis that the statistics themselves were otherwise exposed to. That is the possibly of parametric assignment error when creating the normalized groups which “converted” the sentiment score into the infinitly more relatable “Measured Star” groups. Since the violin plot uses the raw sentiment score, before the normalization process, it serves as appropriate affirmation that there is in fact a meaningful relationship between the datasets. Interestingly, it also confirms the effect size result of the Cramers V which was below a meaningful level for the 2.0 and 2.5 observed star groupings. Leading me to conclude that the wide range of sentiment conveyed at these levels reduced the potential effect that the sentiment scores could have on a subsequent observed star rating.


As mentioned before, there were a couple things that potentially reduced the validity of the study and its underlying conclusion.

  1. The review text was collected from as such, it is only really valid in the context of reviews provided on that site.
  2. When I collected the review text by page scraping I noticed it was grabbing the excerpts of other reviews promoted at the bottom of the page and given the structure of the HTML it was not possible to get cleaner data. This could describe the significant neutrality of the measured ratings basically by lessening the negativity or positivity quotient.
  3. Since I used AlchemyAPI which runs on a black box proprietary algorithm maintained by IBM. Meaning that I am beholden to their criteria as far as determining what indicates positive and negative sentiment. Similarly, there was no parameter to discern between subjects. Therefore, a movie could have been great but due to the reviewers focus on the bad directing or writing relative to the good components would cause the algorithm to return a largely negative review. On a side note it should also be considered there are more words in the english language to convey negative sentiment than there are for positive. So right off the bat it is more likely to be negative.
  4. When forming the contingency table for the chi-squared I omitted the 0, 3.5, 4.0 measured ratings groups because those groups were almost entirely null values which would have resulted in an invalid chi-squared. I just didnt see any way around it.

I do not own nor do I have any claim to the rights over the content from “”, AlchemyAPI or any of their affiliates. If you have claim to either of these sources or their underlying content and believe that my project in any way infringes upon your intellectual property please contact me at: temkin.d01[at]gmail[dot]com and I will do my utmost to remedy the situation. Thanks.