Huber Plisiecki, Paweł Lenartowicz

**Integrity of Polish Psychology**

**What can we say about the replicability of polish academic psychology?**

**How does the ministerial evaluation impact the reliability of research?**

## About the project:

Project *Estimating Replicability of Polish Psychology*, initiated in 2022, aims to analyze the reliability of psychological research and the prevalence of questionable research practices in Poland. This initiative was created as a part of the Student Society for Open Science at the SWPS University in Warsaw, and since 2023 exists as a part of the Society for Open Science.

The aim of the project is to conduct metascientific research based on an as complete as possible database of psychological scientific articles. The database consists of articles that were published between 2017 and 2021, which is the same year range that was taken into consideration during the last ministerial evaluation.

We have preregistered selected research hypothesis, available under the link: https://osf.io/jgrbf

So far, 15 people have been involved in the project, and its coordinators are Hubert Plisiecki, and Paweł Lenartowicz. All of the people taking part in the project work on a volunteer basis, and any remaining costs are covered by Paweł and Hubert's private funds.

## Database:

The database covers: **1650 researchers**, who declared psychology as their discipline and are present in the RAD-on database^{1}, including **1201 people**, whose main place of work were institutes subject to evaluation in psychology. For these 1,201 people, we have manually searched for and assigned their ORCID profiles^{2} as well as IDs from the Polish Scientific Bibliography^{3} (PBN).

From the collected profiles in ORCID and PBN databases, we have downloaded and cleaned respectively 6488 and 8347 scientific article record s (with some overlap).

The next step was to try to download all available scientific works and to encode the journals in which they were published, correct typos in the titles and names of the journals, etc. The articles were downloaded in two wars: records from with DOIs were downloaded automatically, and the rest were downloaded manually to eliminate errors (the manual download of articles is completed in about 40%).

Currently the database consists of **over 5 thousand unique scientific articles** in PDF format. To extract the statistical results we have used the Statcheck software. For the purpose of the project we have adapted this software to the Python programming languagehttps://github.com/hplisiecki/statcheck_python).

## What is Z-Curve

Z-curve^{4} is a statistical method, which estimates the replication potential of research endeavors, based on their p-value distribution and assesses the degree of questionable research practices. This method predicts what subset of research from a given sample could be replicated with a similar result as well as estimates how many results were not published due to 'insignificant' findings.

The first step of the Z-Curve method is the transformation of p-values to z-values (aka the number of standard deviations from the zero hypothesis). In the ideal scenario tests of the same statistical power will rear values that come from from the normal distribution with a standard deviation of 1. For many study samples this will be the sum of normal distributions from many studies. On this basis, assuming a selection criterion of p = 0.05, we can as the second step estimate the theoretical distribution of test results, assuming full reporting of insignificant results.

The real distribution of p-values is affected by many variables, such as:

**The file drawer effect**selective publication of only those results, that have reached the significance threshold.**Questionable Research Practices**Various questionable methods of modification of the results of statistical tests with the aim of reaching the significance threshold.

Examples:**Selective reporting**– conducting multiple analyses, manipulating variables and models, and subsequent reporting of selected significant results.**HARKing**– picking the hypothesis after the analysis of results**p-hacking**– manipulating the sample size and deleting selected observations with the aim of achieving statistical significance.

This method computes additional metrics, such as:

**Observed Discovery Rate**(ODR) – percentage of statistically significant results observed in the dataset.**Expected Discovery Rate**(EDR) – the expected percentage of statistically significant results, if the studies will be replicated.

## How to interpret Z-Curve plots

The undeniable advantage of the Z-Curve method is the ability to visualize the effects of p-hacking and selective publication. By transforming p-values into standard deviation values, it is possible to recreate an important property of them — the distribution of these values should be approximately normal for a single experiment or a sum of normal distributions for multiple studies.

This means that when visualizing these values using a histogram, we should obtain a distribution without "steep drops" or "sharp cuts." An example of such a distribution is the Z-Curve for David Matsumoto^{5}, a psychologist known for his research on microexpressions, conducted jointly with Paul Ekman. Such a shape of the graph and the lack of significant differences between the "Observed Discovery Rate (DR)" and the "Expected Discovery Rate (DR)" do not indicate the presence of p-hacking or publication bias in the analyzed studies.

In the above graph, the x-axis represents p-values transformed into z-values. A z-value of 0 corresponds to a p-value of 1, while a z-score of 1.94 corresponds to the significance threshold of p-value = 0.05 (for a two-sided test). A z-score above 5 corresponds to a very low p-value < 0.0001. The red vertical lines indicate the significance threshold of p = 0.05 for both one-sided and two-sided tests.

How to interpret Z-Curve plots

The Z-curve calculated for Shelly Chaiken^{5} is the opposite of the previously presented graph and is characteristic of fields where common replication issues have been observed (in this case, 'dual process theory'). In this histogram, we can clearly see a "cutoff" at the threshold of statistical significance and differences that are difficult to justify statistically in the number of effects just below and just above this threshold. We observe a large difference between the ODR (Observed Discovery Rate) and the EDR (Expected Discovery Rate). Distribution fit analysis suggests that, in addition to the 379 reported results, around 1,000 statistically insignificant results are missing, which were not published.

## The state of polish research

We presented the results for 7 universities with the largest number of published statistical tests, as well as a similar graph for Harvard University^{6}, which, in terms of detected bias and p-hacking, ranks roughly in the middle among American universities. It is worth noting that Polish universities perform quite well in this analysis; however, there are clear differences between them, which could be an interesting topic for further analysis.

To avoid controversies related to presenting results for individual researchers and to shift the focus from personal actions to institutional aspects, we decided not to publish results for specific individuals.

## Z-Curve across the years

One of the goals of the study is to analyze the impact of evaluation on publication practices. To this end, we are analyzing, as per the preregistration, the differences between the years. What happened in 2021 is particularly significant — the last year in which published articles could be considered in the upcoming evaluation at that time. We are witnessing a sharp increase in signs of p-hacking in publications.

## Discussion and Conclusions

It is important to emphasize that the issues related to research quality and publication practices are multifaceted and difficult to explain clearly. They are influenced by both researchers' shortcomings in understanding statistical methods and the pressure to increase the number of publications, as well as a misinterpretation of publication quality, which is often defined primarily by being featured in highly rated journals. These journals are also not free from harmful editorial practices, such as making it difficult to publish replication studies or results that do not meet the threshold of statistical significance.

However, these problems are not unique to Poland — they have been the subject of international debate for years^{7}. There is no single solution that could instantly fix the system. Moreover, poorly thought-out reforms could worsen the situation, leading to unpredictable negative consequences.

The results of this project are not surprising — compared to the USA, where academic careers are more focused on competition in publications, we observe less manipulation of results in Poland. The only surprising aspect may be the scale of p-hacking observed in the last year of evaluation. Beyond the carefully planned and preregistered analysis on the impact of evaluation, we will certainly want to verify alternative explanations, such as the influence of "COVID-related" publications.

As our project progresses, we will publish increasingly precise estimates of replication potential, broken down by specific institutes and years. We are also working on improving analytical methods, which could be a significant contribution of the project to the field of statistics. We hope that the results obtained from the project will strengthen the weight of arguments based on empirical data in the ongoing discussion about reforms. We also hope that the discussion generated around these results will contribute to improving the quality of research conducted in our country.

## Bibliography

- RAD-ON database https://radon.nauka.gov.pl/
- ORCID database https://orcid.org/
- PBN database https://pbn.nauka.gov.pl/core/#/home
- Schimmack and Bartoš Z-Curve 2.0: Estimating Replication and Discovery Rates https://replicationindex.com/2020/01/10/z-curve-2-0/
- Example z-curve plots https://replicationindex.com/2021/01/19/personalized-p-values/
- The z-curve of the Harvard University https://replicationindex.com/2022/02/23/rr22-harvard/
- Reading list related to problems with replicability (borrowed from the Reproducibilitea project) https://osf.io/qxbcs