A commentary on “Problems in using text-mining and p-curve analysis to detect rate of p-hacking” J. C. F. de Winter 31 July 2015
Bishop and Thompson (2015) concluded that “For uncorrelated variables, simulated p-hacked data do not give the signature left-skewed p-curve that Head et al. took as evidence of p-hacking.” The simulations by Bishop and Thomspon were conducted with one particular form of p-hacking, the “use of ghost variables”. Thus, their simulations show that if there is p-hacking by means of ghost variables, it will not be detected in the p-curve. It is useful to emphasize that other forms of p-hacking, such as ‘optional stopping’, do yield a distinct peak below 0.05. I performed simulations where one-sample t-tests were performed until reaching statistical significance (up to a maximum of 50 participants). 10,000 studies were simulated. The MATLAB code is shown below. The results in Figure 1 show that a left skewed distribution arises, even when the true effect size is large (d = 1.0). Figure 2 shows the consequences of another form of p-hacking: selecting a nonparametric test if it gives a statistically significant result while the parametric test does not. Again, a left skew can be seen. This time the simulations were conducted with 1,000,000 two-sample t-tests, with a sample size of 25 per group, and a true effect size of d = 0. Bishop and Thompson (2015) cite a relatively limited number of papers. It is important to note that a substantial number of papers have previously discussed/analysed p-values extracted from papers. Examples are: Brodeur et al. (2013), De Winter and Dodou (2015), Ioannidis (2014), Jager and Leek (2014), Krawczyk (2015), Lakens (2015), Leggett et al. (2013), Masicampo and Lalande (2012), Mathôt (2013), Nuijten et al. (2015), and Vermeulen et al. (in press). Some of these papers (e.g., Ioannidis, 2014) are negative towards the idea of automatic p-value mining and raise similar arguments as you did. Others, such as De Winter and Dodou (2015), discuss both strengths and weaknesses of automatically extracted p-values compared to manually harvested pvalues. 2500 d=0 d = 0.3 d = 1.0 d = 10
2000
Count
1500
1000
500
0
0
0.05
0.1
0.15
0.2
0.25
p-value
Figure 1. Distribution of p-values for ‘optional stopping’, for four effect sizes.
0.3
0.35
0.4
7500 7000 6500
Count
6000 5500 5000 4500 4000 3500 3000
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
p-value
Figure 2. Distribution of p-values when ‘strategically’ selecting a non-parametric test when the parametric test yields a result that is not statistically significant. References Bishop, D., & Thompson, P. A. (2015). Problems in using text-mining and p-curve analysis to detect rate of phacking. Retrieved from https://peerj.com/preprints/1266/ Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2013). Star wars: The empirics strike back (Discussion Paper No. 7268). Discussion Paper Series. Bonn: Forschungsinstitut zur Zukunft der Arbeit. De Winter, J. C. F., & Dodou, D. (2015). A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing rapidly too). PeerJ, 3, e733. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13, e1002106. Ioannidis, J. P. (2014). Discussion: Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics, 15, 28–36. Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15, 1–12. Krawczyk, M. (2015). The search for significance: a few peculiarities in the distribution of p values in experimental psychology literature. PLOS ONE 10, e0127872. Lakens, D. (2015). On the challenges of drawing conclusions from p-values just below 0.05. PeerJ, 3, e1142. Leggett, N. C., Thomas, N. A., Loetscher, T., & Nicholls, M. E. (2013). The life of p:“Just significant” results are on the rise. The Quarterly Journal of Experimental Psychology, 66, 2303–2309. Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below. 05. The Quarterly Journal of Experimental Psychology, 65, 2271–2279. Mathôt, S. (2013). No particular prevalence of p values just below .05 [blog post]. Retrieved from http://www.cogsci.nl/blog/miscellaneous/221-no-particular-prevalence-of-p-values-just-below-05 Nuijten, M. B., Hartgerink, C. H. J., Van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985-2013). Retrieved from https://mbnuijten.files.wordpress.com/2013/01/nuijtenetal_2015_reportingerrorspsychology.pdf Vermeulen, I. E., Beukeboom, C. J., Batenburg, A. E., Stoyanov, D., Avramiea, A., Van de Velde, R. N., & Oegema, D. (in press). Blinded by the light: How a focus on statistical “significance” may cause p-value misreporting and an excess of p-values just below .05 in communication science. Communication Methods and Measures.
Matlab code %% tested in Matlab R2015a. It takes a few minutes to run on a standard laptop clear variables;close all;clc d=[0 0.3 1 10]; reps=10000;pp=NaN(length(d),reps); for k=1:length(d); for j=1:reps S=[]; for i=1:50 S(i)=randn+d(k); [~,pp(k,j)]=ttest(S,0); if pp(k,j)<.05;break;end end end end V=[0:0.005:1]; figure;plot(V+mean(V(1:2)),histc(pp',V),'-o','Linewidth',2); xlabel('\itp\rm-value') ylabel('Count') legend('\itd\rm = 0','\itd\rm = 0.3', '\itd\rm = 1.0', '\itd\rm = 10') set(gca,'xlim',[0 .4]) h=findobj('FontName','Helvetica'); set(h,'FontSize',20,'Fontname','Arial') %% pp2=NaN(1,reps*100); for i=1:reps*100 S=randn(25,1);S2=randn(25,1);SR=tiedrank([S;S2]); [~,p1]=ttest2(S,S2); [~,p2]=ttest2(SR(1:25),SR(26:end)); if (p1>.05 && p2 < .05) pp2(i)=p2; else pp2(i)=p1; end end figure;plot(V+mean(V(1:2)),histc(pp2',V),'k-o','Linewidth',2); xlabel('\itp\rm-value') ylabel('Count') set(gca,'xlim',[0 .4]) h=findobj('FontName','Helvetica'); set(h,'FontSize',20,'Fontname','Arial')