Though I seek to be accurate with margins of victory and loss with the projections I post here, even more important than that are the predictions of whether a candidate will win or lose a contest. As many of you already know, I got two consequential calls wrong last Tuesday, and missed two more by significant amounts. Hillary Clinton won Missouri by 0.2%, and won Illinois by 1.6%; both very small margins. Though numerically I missed the win/loss in these states by 0.2% and 1.6%, I fully recognize that the difference is night and day. This is why I started over, from scratch, and have spent the last two days building a more robust and comprehensive model that can account for factors that I had previously thought were indirectly contained within the variables I was using.

  • Why did Bernie under-perform my estimates in almost every state Tuesday? Was it coincidence or a systemic mathematical bias of my model?

I believe it was more coincidence than mathematical bias, though I will concede both to some degree. I do want to make it clear that there was no intentional bias (I have been accused numerous times of inflating Bernie’s numbers for some imaginary reason), but rather the structure of the model itself created a mathematical bias in four of these last five elections. I say it was coincidental because the factors that allow this bias to show appeared disproportionately in most of Tuesday’s states, particularly states with an open primary.

Illinois, Missouri, and Ohio all have open primaries. Up until this point, the open primary was not a statistically significant driver of results for either candidate, and therefore was not included in my model. However, over this past month, more and more Democrats (apparently a disproportionate number of Sanders rather than Clinton supporters) have been requesting Republican ballots in open primaries to cast anti-Trump votes. They seem to harbor more disdain for Donald Trump than support for Bernie Sanders. I was able to isolate this effect and subsequently include it in the new model by interacting the amount of Trump support on social media in a state with a binary variable that defines whether the state has an open primary or not. This is a powerful variable, because it accounts for the scale of anti-Trump sentiment. In states that have more Trump support, more Democrats will cast anti-Trump votes, disproportionately helping Hillary Clinton. This happened to a substantial extent in Illinois, Missouri, and Ohio.

I am also now factoring in the median age of the state in question. Though Sanders has won some “older” states like Maine, New Hampshire, and Vermont, he does better overall in “younger” states, statistically speaking. Florida and Ohio are both older states, with a median age of 41.6 and 39.4, respectively. This is now being accounted for and will help produce more accurate results.

I have heard the claim many times that northerners and southerners, and particularly minorities, just vote differently from an ideological perspective. I don’t disagree, but I had previously believed that this bias was contained in the social media data that I was using. I have been experimenting with including a variable to track whether a state is in the “Deep South,” and as it turns out, this variable is statistically significant. In my opinion, this is the primary reason that Hillary Clinton performed so much better than my expectations in Florida. Even accounting for so many different things, people that reside in an area that possesses a southern culture will simply vote for a more conservative candidate.

I am happy for the opportunity to refine the model in so many different ways. This is, at its very core, an experiment to determine whether it is possible to model primary elections without the aid of public polling. I have a renewed confidence in the projections for the next few weeks, and look forward to determining once and for all which candidate Hispanics prefer with the Arizona contest next week.




  1. I think its a very worthy endeavor to create an alternative model to the current public polling method.

    I really hope your new model is more accurate than the older one, it was deflating thinking Bernie was assured two wins and to end up with zero wins.


  2. Its hard to take any of this seriously when you are so clearly biased towards Senator Sanders. You’re just not an impartial observer. I mean seriously. You suggested that Sanders would have become the frontrunner if he had won three states on Tuesday. Your elation regarding that possibility was palpable.

    People are not accusing you of inflating Sanders’ numbers for an “imaginary” reason. They are accusing you of this because you openly advocate for Sanders.

    For example, you suggest that Democratic voters who crossed over to vote against Trump would have overwhelmingly supported Sanders had there not been an opportunity to oppose Trump. Of course, you didn’t cite any data to support this assertion. Rather, it seems you just pulled this argument out of your ass in order to help explain why your model was incorrect.

    You need to take a step back and examine your bias. There is no actual hard data to suggest that “in states that have more Trump support, more Democrats will cast anti-Trump votes, disproportionately helping Hillary Clinton.” Furthermore, the fact that you were practically the only one who was surprised by Hillary Clinton’s margin of victory in Florida should set off some alarm bells.

    In actuality, Tuesday’s results closely matched the predictions of many other pollsters. You shouldn’t have created a model that is explicitly designed to support the results you WISH would happen.


    • Actually there is data to support that at least in Ohio. I haven’t looked at the other states yet. If you look at the exit polls from 2008 in the primary in Ohio you’ll see that 18% of the votes cast in the Republican primary were from Independents. 3% were from Democrats.

      This time through, 28% were Independent voting in the Ohio Republican primary and 7% were Democrat. Both Independents and Democrats voting in the Republican primary voted mostly for Kasich.

      I think there was a definite effort to “stop trump” at least in Ohio. That state (with Kasich) and Florida (with Rubio) were kind of the last hopes for a normal Republican candidate to become a serious competitor and challenge Trump.

      Now, were there enough who crossed over for that purpose that it swung the election on the Democrat side? I don’t think so but we’ll never really know for sure.


    • “There is no actual hard data to suggest that “in states that have more Trump support, more Democrats will cast anti-Trump votes, disproportionately helping Hillary Clinton.” ”

      Yes there is, my data. Trump’s share of Republican FB support in a state, multiplied by the open primary binary variable, has a coefficient of -6.82 (though this needs to be scaled a bit because my B0 is -29) and a p-value of 0.045 in my model. t is -2.17 for this variable just in case you prefer those measures.

      Also, if you think what I’m doing here is nonsense, you’re free to not come back.

      Liked by 1 person

      • Yes, please post preliminary predictions today, if you can, that would be great. (for some reason I could not reply to the correct post below)


      • I agreed 100% with the original poster. And your response makes no sense. You still have not provided a single data point that suggests anti-trump dem votes are more likely to be bernie than hillary voters had they not switched. that is what he was asking for, and i was also curious why you came to that conclusion. I personally concluded they were roughly 60/40 clinton voters based on some admittedly weak models.

        But when that guy mentions other pollsters predicting evereything tuesday on point, he might be referring to this…

        both nate silver of 538 and nate cohn of upshot had models that did not rely on polling data at all, purely demographic. both predicted hillary would win all 5 races. (nates model is different from the one everyone talks about which is the polling one. http://fivethirtyeight.com/features/can-bernie-sanders-pull-off-an-upset-in-ohio/

        Nate Cohn posted something on twitter but i didnt bookmark it. But I do know that he had Sanders favored in 5 of the next 6 states (Hillary with AZ) Nate Silver has the same predictions in the link above

        So for the next 2 voting dates, 6 dem races… you are competing against an industry that has called 5 of the 6 races for Bernie based on demographic models alone. by wide margins.

        I say this because you can easily fall into a trap and think you found a new model that works but really you are just facing a map that favors your view of sanders support.


  3. I appreciate what you’re doing, and think it has a ton of value. It really seems there’s something to the Google Trends data you’ve been analyzing, perhaps it just needs to be massaged differently.

    Some thoughts of mine:

    Looking at the exit polls, the African American (AA) vote in the Deep South (where he didn’t campaign as actively) was split wider in Clinton’s favor than it was in the Midwest (where he *did* campaign aggressively). Here’s a two-part writeup on AA voters and closing the gap based on name recognition:

    1) http://www.carlbeijer.com/2016/02/black-voters-and-2016-primaries-part-1.html
    2) http://www.carlbeijer.com/2016/02/black-voters-and-2016-primaries-part-2.html

    I reject the claim that the AA portion of the Deep South is inherently a low-information bloc as has been suggested on Reddit. Rather, I believe a big part of Sanders’ momentum has been in his ability to get his name and message out there.

    Weaver and Devine said in their ‘half-time’ conference call (https://berniesanders.com/half-time/) that they intentionally focussed less on the Deep South, because it was essential to put some wins on the board with Super Tuesday. The strategy this past Tuesday on the other hand, was to make up some delegate math. Sanders has actually played down his Michigan win since the upset, noting that it was a virtual tie (as Illinois and Missouri were as well).

    THEORY: There seem to be three ways in which Sanders has gotten his name out there in a meaningful way:

    1) Number of rallies/appearances in a given state.
    2) Campaign spending (I’m not sure if the absolute cost of advertisements, or the amount spent relative to Clinton is more predictive).
    3) National Debates/Town Halls/Interviews.

    When Sanders’ name and message are saturated, it seems that voting seems to mirror demographics. I don’t think your model needs to be something as exhaustive as benchmarkpolitics.com, since they look at individual districts (and you have different purposes, you’re seeking to predict in advance, they’re trying to make an early call on election day, based on reports and exit polls). Their factors appear to be %women, %black, %hispanic, median age, median income, %bachelors, and an indicator variable (similar to the one you use for the Deep South) for whether there is a major university (not sure how you’d incorporate this on a statewide level, maybe give more credit for rallies at colleges). Something on a statewide basis, kind of like what FiveThirtyEight has done in some pieces (and after Michigan, Silver noted they shifted to more of a demographic model in their predictions):

    1) http://fivethirtyeight.com/datalab/bernie-sanders-could-win-iowa-and-new-hampshire-then-lose-everywhere-else/
    2) http://fivethirtyeight.com/features/bernie-sanders-doesnt-need-momentum-he-needs-to-win-these-states/
    3) http://fivethirtyeight.com/features/florida-ohio-democratic-primary-preview/

    His primary variables seem to be how white and how liberal each state was in the 08 exit polls.

    Again though, when Sanders has to split his time (as he did, on Super Tuesday, and in the five states from this past Tuesday), saturation becomes more difficult. I’m not sure what exactly it is, but *something* in the Google Trends data you’ve used seems to have a legitimate correlation with the results at the least, and perhaps predictive value. I guess what you do next comes down to which path you decide:

    1) Would you rather put together a blended model, that takes into account both the Trends/saturation phenomenon (along with the frequency of the three factors I mentioned above under the ‘THEORY’ heading) and demographics?


    2) Would you prefer to refine a purely Google Trends model using saturation information (and some means of incorporating the three types of events under ‘THEORY’)?

    As a huge fan of your work, I’d selfishly prefer both (though your time is finite, so I don’t know how possible this is), but would love to see you release a pure saturation/Trends model in addition to a predictive model. That way, we can analyze the two demographics and saturation models separately, as well as alongside your hybrid.

    Either way, good luck, and thanks again for sharing with us.



  4. When will you be posting your predictions for next Tuesday? I am really curious because there is a lack of polls and you have been pretty accurate (and I understand why you were off in Missouri and Illinois). Keep doing what you’re doing.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s