How do I calculate if a test like this is statistically significant?

I let people rate how much they like different things on a scale of 1-10. How do I actually tell if people like one thing more than another thing if the sample sizes are different? This is not about any real scientific study, more like a personal test :)

For example, if one thing got voted on 10 times and has an average value of 6.5, and another thing got voted on 6 times and has a 6.1, is the 6.5 thing actually more liked? Or is this small sample size still so random that it could with a high chance go both ways?

I’ve never done anything like this, if someone could explain it or direct me to the correct key words/links.

I’ve read up a bit on p-value determination, but I’m not sure what my “null hypothesis” is here actually, numerically. If I’d put it in words I guess my hypothesis would be “this thing is more liked than the other thing”, but honestly, it seems like my specific case would be much simpler than all the stuff I’m reading here :D

JWBananas , 9 months ago

People are inherently bad at rating things. Why not run a “This or that?” style study instead?

Given a list of items to rate, pair them up randomly. Ask a person which item they like better out of each pair. Run through Final Four type eliminations until you get down to their number one preference.

Run through this process for each person, beginning with different random pairings every time.

Record data on all the choices - not just the final ones. You should be able to get good data like that.

For example, there will probably be a thing that is so disliked that it gets eliminated in the first round more frequently than anything else. The inverse will likely be true of a highly-preferred item. And I am sure you can identify other insights as well.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Azzu OP , 9 months ago

Sounds like a good idea, however my participants neither have the attention span nor do I have the resources to do anything else :) after all, like I said, it’s just a small personal thing :)

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

TauZero , 9 months ago

Your situation reminded me of the way IMDB sorts movies by rating, even though different movies may receive vastly different total number of votes. They use something called a credibility formula which is apparently a Bayesian statistics way of doing it, unlike the frequentist statistics with p-values and null hypotheses that you are looking for atm.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

altairabove , 9 months ago (edited 9 months ago)

You could use a few different null hypotheses here. One with minimal assumptions would be that the medians are equal. This can be tested using the Mann-Whitney U test. en.m.wikipedia.org/wiki/Mann–Whitney_U_test

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Jlafs , 9 months ago

Your null hypothesis is the thing you’re trying to disprove. For example, if I wanted to run a study to asses the effect of adding a certain growth hormone to a cell culture, my null hypothesis would be “there is no effect”. In your case, it would be “there is no difference in how much different things are liked”. From there, you’d run your study, and do your statistical analysis, for which there are different methods based on the type of data, number of groups your comparing, sample size, etc., and I’m not a statistician so I can’t say which methods are best for what you’re planning.

When it comes to p-value, to really simplify it, you can think of your p-value as the likelihood your null hypothesis is true. That’s not exactly what it means, but it’s an easy way to remember it.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...