POPL 2025 - Artifact Evaluation

Paper artifacts are the software, mechanized proofs, test suites, and benchmarks that support a research paper and evaluate its claims. POPL has run an artifact evaluation process since 2015, and invites artifact submissions from all authors of accepted papers.

Submit your Artifact

Artifact evaluation is optional, but highly encouraged. We solicit artifacts from authors of all accepted papers. Artifacts can be software, mechanical proofs, test suites, benchmarks, or anything else that bolsters the claims of the paper, except paper proofs, which the AEC lacks the time and expertise to carefully review.

Log in to the Artifact Evaluation HotCRP to register and submit your artifact here. Please register and submit your artifact by the following dates:

Artifact registration deadline: Friday, 11 October 2024 (AoE)
Artifact submission deadline: Tuesday, 15 October 2024 (AoE)
Artifact evaluation phase 1: Friday, 25 October 2024 (AoE)
Artifact acceptance decisions: ~~Thursday, 7 November 2024~~ Monday, 11 November (AoE) – concurrent with final acceptance notifications for POPL papers

Acceptance Criteria

Our goal is to work with the authors to identify issues early on, improve the artifact, and get it accepted wherever possible. Typically, almost all artifacts are accepted for at least one badge. Artifacts are evaluated against the criteria of:

Consistency with the claims of the paper and the results it presents
Completeness insofar as possible, supporting all evaluated claims of the paper
Documentation to allow easy reproduction by other researchers
Reusability, facilitating further research

Installing, configuring, and testing unknown software of research quality is difficult. Please carefully review our “Author Recommendations” tab on packaging and documenting your artifact in a way that is easier for the AEC to evaluate.

Artifact evaluation begins with authors of conditionally accepted POPL papers submitting artifacts on the Artifact Evaluation HotCRP. Artifact evaluation is optional. Authors are strongly encouraged to submit an artifact to the AEC, but not doing so will not impact their paper acceptance. Authors may, but are not required to, provide their artifacts to paper reviewers as supplemental materials.

Artifacts are submitted as a stable URL or, if that is not possible, as an uploaded archive. We recommend using a URL that you can update in response to reviewer comments, to fix issues that come up. Additionally, authors are asked to enter topics, conflicts, and “bidding instructions” to be used for assigning reviewers. You must check the “ready for review” box before the deadline for your artifact to be considered.

Artifact evaluation is single blind. Artifacts should not collect any telemetry or logging data; if that is impossible to ensure, logging data should not be accessed by authors. Any data files included with the artifact should be anonymized.

Reviewers will be instructed that they may not publicize any part of an artifact during or after completing evaluation, nor retain any part of one after evaluation. Thus, authors are free to include models, data files, proprietary binaries, etc. in your artifact.

AEC Membership

The AEC will consist of roughly 30 members, mostly senior graduate students, postdocs, and researchers. As the future of our community, graduate students will be the ones reproducing, reusing, and building upon the results published at POPL. They are also better positioned to handle the diversity of systems that artifacts span.

Participation in the AEC demonstrates the value of artifacts, provides early experience with the peer review process, and establishes community norms. We therefore seek to include a broad cross-section of the POPL community on the AEC.

Two-Phase Evaluation

The artifact evaluation process will proceed in two phases.

In the first “kick the tires” phase reviewers download and install the artifact (if relevant) and exercise the basic functionality of the artifact to ensure that it works. We recommend authors include explicit instructions for this step. Failing the first phase—so that reviewers are unable to download and install the artifact—will prevent the artifact from being accepted.

In the second “evaluation” phase reviewers systematically evaluate all claims in the paper via procedures included in the artifact to ensure consistency, completeness, documentation, and reusability. We recommend authors list all claims in the paper and indicate how to evaluate each claim using the artifact.

Reviewers and authors will communicate back and forth during the review process over HotCRP. We have set up HotCRP to allow reviewers to ask questions or raise issues: those questions and issues will immediately be forwarded to authors, who will be able to answer questions or implement fixes.

After the two-phase evaluation process, the AEC will discuss each artifact and notify authors of the final decision.

Two days separate the AEC notification from the camera ready deadline for accepted papers. This gap allows authors time to update their papers to indicate artifact acceptance.

Badges

The AEC will award three ACM standard badges. Badges are added to papers by the publisher, not by the authors.

ACM’s Artifacts Evaluated – Functional Badge
ACM’s Artifacts Evaluated – Reusable Badge (subsumes “Functional”)
ACM’s Artifacts Available Badge (awarded in addition to “Functional” or “Reusable”)

Functional and Reusable: Artifacts that can be shown to support the claims made in the paper, with sufficient documentation towards running the artifact and validating those claims, are awarded the “Artifacts Evaluated - Functional” badge. Those which additionally are packaged in a way that enables ease of reuse – including but not limited to good documentation, good installation instructions, platform compatibility, ease of running the tool on other examples not in the paper, and making the code available via open source licensing and/or an open issue tracker (e.g. GitHub, GitLab, or BitBucket) will be additionally awarded the “Artifacts Evaluated - Reusable” badge. Per ACM policy, papers will receive at most one of the Reusable and Functional badges.

Available: Artifacts which the authors make available eternally on a publicly accessible archival repository such as Zenodo or ACM DL will also receive ACM’s “Artifacts Available” badge. (Note that this is not the same as putting the code on GitHub, GitLab, BitBucket, or your personal website! However, an immutable snapshot does not prevent authors from also distributing their code in another way.) We recommend following this process for all accepted artifacts.

The following recommendations will help you package and document your artifact in a way that makes a successful evaluation most likely. None of these guidelines are mandatory – diverging from them will not disqualify your artifact.

Timeline and general advice

As you prepare to package your artifact, keep ifn mind that good artifacts take time to create! In our experience, you should expect to spend at least 2-3 workdays packaging your artifact, a large portion of which should be spent writing and testing your detailed instructions, provided in a README file.

In our experience, the key to a successful artifact evaluation is a good README file! Reviewers (and future researchers) will appreciate long, detailed, and clearly organized instructions which describe every aspect of your artifact in detail – including, e.g., shell commands to run, files to open, how long these will take, and what output is expected. This is not only to help artifact evaluation go smoothly – it provides confidence that members of the community will be able to replicate your results and use your tool for their own work in the future.

Packaging

We recommend creating a single web page at a stable URL from which reviewers can download the artifact and which also hosts a README file with instructions for installing and using the artifact. Having a stable URL, instead of uploading an archive, allows you to update the artifact in response to issues that come up.

We recommend using Zenodo to create the stable URL mentioned above for your artifact at submission time. Not only can you upload multiple versions in response to reviewer comments, you can use the same stable URL when publishing your paper to avoid uploading your artifact twice.

We recommend (but do not require) packaging your artifact as a virtual machine image. Virtual machine images avoid some issues with differing operating systems, versions, or dependencies. Other options for artifacts (such as source code, binary installers, web versions, or screencasts) are acceptable but generally cause more issues for reviewers and thus more issues for you. Virtual machines also protect reviewers from malfunctioning artifacts damaging their computer. We recommend VirtualBox 7.1, a free and actively maintained virtual machine host software; see more details below.

Plain software (recommend in some cases): It is appropriate not to use a virtual machine in cases where the software has very few dependencies and requires only a working installation of a single programming language or package manager – e.g., Cargo or OPAM. In these cases, please document the installation of your artifact and required version(s) of everything via clear, step-by-step instructions. Additionally, make sure to test the instructions yourself on a fresh machine without any software installed. If the reviewers are unable to replicate your setup during the kick-the-tires phase, they may ask you to provide a virtual machine.
VirtualBox instructions (recommended): As of fall 2023, VirtualBox still had some issues with running on newer ARM-based Macs (M1/M2/M3). For this reason, please ensure you are using at least version 7.1 (released in September 2024), which now supports macOS/Arm virtualization. As a safety precaution, the submission form will also ask authors to clarify whether their artifact was built/tested on Apple Silicon. Recent graphical Linux releases, such as Ubuntu 20.04 LTS, are good choices for the guest OS: the reviewer can easily navigate the image or install additional tools, and the resulting virtual machines are not too large to download.
Docker instructions (not recommended): If you use Docker, be warned that Docker images are not fully cross-platform out-of-the-box! Due to cross-platform compatibility issues on M1/M2/M3 Macs, Docker builds a separate image depending on the target platform, and may fail to run on a different machine. For these reasons, we do not recommend using Docker. If you choose to use Docker, see this (more complicated) process to bundle multiple-platform images.

The virtual machine should contain everything necessary for artifact evaluation: your source code, any compiler and build system your software needs, any platforms your artifact runs on (a proof assistant, or Cabal, or the Android emulator), any data or benchmarks, and all the tools you use to summarize the results of experiments such as plotting software. Execute all the steps you expect the reviewer to do in the virtual machine: the virtual machine should already have dependencies pre-downloaded, the software pre-compiled, and output files pre-generated.

Do not include anything in the artifact that would compromise reviewer anonymity, such as telemetry or analytics. If there is anything the reviewers should be worried about, please provide a clear disclaimer at the top of your artifact.

Documentation

You should provide a top-level documentation in a clearly marked location; typically, as a single README file. The HotCRP submission will also provide a field where you can enter any additional instructions to access the README.

Besides the artifact itself, we recommend your documentation contain four sections:

A complete list of claims made by your paper
Download, installation, and sanity-testing instructions
Evaluation instructions
Additional artifact description (file structure, extending the tool or adding your own examples, etc.)

Artifact submissions are single-blind: reviewers will know the authors for each artifact, so there is no need to expend effort to anonymize the artifact. If you have any questions about how best to package your artifact, please don’t hesitate to contact the AEC chairs.

List of claims

The list of claims should list all claims made in the paper. For each claim, provide a reference to the claim in the paper, the portion of the artifact evaluating that claim, and the evaluation instructions for evaluating that claim. The artifact need not support every claim in the paper; when evaluating the completeness of an artifact, reviewers will weigh the centrality and importance of the supported claims. Listing each claim individually provides the reviewer with a checklist to follow during the second, evaluation phase of the process. Organize the list of claims by section and subsection of the paper. A claim might read,

Theorem 12 from Section 5.2 of the paper corresponds to the theorem “foo” in the Coq file “src/Blah.v” and is evaluated in Step 7 of the evaluation instructions.

Some artifacts may attempt to perform malicious operations by design. Boldly and explicitly flag this in the instructions so AEC members can take appropriate precautions before installing and running these artifacts.

Reviewers expect artifacts to be buggy, immature, and have obscure error messages. Explicitly listing all claims allows the author to delineate which bugs invalidate the paper’s results and which are simply a normal part of the software engineering process.

Download, installation, and sanity-testing

The download, installation, and sanity-testing instructions should contain complete instructions for obtaining a copy of the artifact and ensuring that it works. List any software the reviewer will need (such as virtual machine host software) along with version numbers and platforms that are known to work. Then list all files the reviewer will need to download (such as the virtual machine image) before beginning. Downloads take time, and reviewers prefer to complete all downloads before beginning evaluation.

Note the guest OS used in the virtual machine, and any unusual modifications made to it. Explain its directory layout. It’s a good idea to put your artifact on the desktop of a graphical guest OS or in the home directory of a terminal-only guest OS.

Installation and sanity-testing instructions should list all steps necessary to set up the artifact and ensure that it works. This includes explaining how to invoke the build system; how to run the artifact on small test cases, benchmarks, or proofs; and the expected output. Your instructions should make clear which directory to run each command from, what output files it generates, and how to compare those output files to the paper. If your artifact generates plots, the sanity testing instructions should check that the plotting software works and the plots can be viewed.

Helper scripts that automate building the artifact, running it, and viewing the results can help reviewers out. Test those scripts carefully—what do they do if run twice?

Aim for the download, installation, and sanity-testing instructions to be completable in about a half hour. Remember that reviewers will not know what error messages mean or how to circumvent errors. The more foolproof the artifact, the easier evaluation will be for them and for you.

Evaluation instructions

The evaluation instructions should describe how to run the complete artifact, end to end, and then evaluate each claim in the paper that the artifact supports. This section often takes the form of a series of commands that generate evaluation data, and then a claim-by-claim list of how to check that the evaluation data is similar to the claims in the paper.

For each command, note the output files it writes to, so the reviewer knows where to find the results. If possible, generate data in the same format and organization as in the paper: for a table, include a script that generates a similar table, and for a plot, generate a similar plot.

Indicate how similar you expect the artifact results to be. Program speed usually differs in a virtual machine, and this may lead to, for example, more timeouts. Indicate how many you expect. You might write, for example:

The paper claims 970/1031 benchmarks pass (claim 5). Because the program runs slower in a virtual machine, more benchmarks time out, so as few as 930 may pass.

Reviewers must use their judgement to check if the suggested comparison is reasonable, but the author can provide expert guidance to set expectations.

Explicitly include commands that check soundness, such as counting admits in a Coq code base. Explain any checks that fail.

Aim for the evaluation instructions to take no more than a few hours. Clearly note steps that take more than a few minutes to complete. If the artifact cannot be evaluated in a few hours (experiments that require days to run, for example) consider an alternative artifact format, like a screencast.

Additional artifact description

The additional description should explain how the artifact is organized, which scripts and source files correspond to which experiments and components in the paper, and how reviewers can try their own inputs to the artifact. For a mechanical proof, this section can point the reviewer to key definitions and theorems.

Expect reviewers to examine this section if something goes wrong (an unexpected error, for example) or if they are satisfied with the artifact and want to explore it further.

Reviewers expect that new inputs can trigger bugs, flag warnings, or behave oddly. However, describing the artifact’s organization lends credence to claims of reusability. Reviewers may also want to examine components of the artifact that interest them.

Remember that the AEC is attempting to determine whether the artifact meets the expectations set by the paper. (The instructions to the committee are available at the “Reviewer Guidelines” tab above.) Package your artifact to help the committee easily evaluate this.

Reusability Guidelines

Reusable artifacts should be released under an open-source license (e.g., OSI list). Additionally, see the following additional instructions for specific artifact types.

Pre-existing software

If your artifact packages a well-known or existing piece of software that already existed prior to POPL, please note that the goal of the evaluation process is not necessarily to make existing well-known software more reusable, but to ensure that any new contribution(s) are reusable and integrate well. In your documentation, please clarify which parts of the software are new for this artifact and should be considered for the “Reusable” badge.

Proof artifacts

When packaging proof artifacts, all of the other advice on this page applies. In addition, let us clarify the definition of “reusability” to set expectations better for authors and reviewers. A proof artifact should be considered reusable if it contains definitions and proofs that can be used in other projects. Examples of such artifacts include Coq or Isabelle proof libraries and Coq plugins. To be considered reusable, they must:

Be made publicly available (after paper acceptance) via website download or public repository (e.g. GitHub).
Clearly state all environment dependencies, including supported versions of the proof assistant and required third-party packages.
Have clear installation instructions.
If the artifact contains proofs, those claimed as reusable must be complete (no “admit” in Coq or “sorry” in Lean/Isabelle).
(Optionally) include documentation for reusable components.
(Optionally) provide usage examples.

Resources

Further advice for packaging particular types of artifacts can be found below:

Introduction

Thank you for volunteering to serve on the POPL Artifact Evaluation Committee. Artifacts are an important product of the scientific process, and it is your goal, as members of the AEC, to study these artifacts and determine whether they meet the expectations laid out in the paper and adequately support its central claims.

General Guidelines and Advice

Since reviewers sometimes differ on what artifact evaluation is (or should be), please keep in mind the following general guidelines and advice:

Artifact evaluation is a collaborative process with the authors! Our goal is not necessarily to find flaws with the artifact, but to try to the best of our ability to validate the artifact’s claims. This means that back-and-forth discussion is always a good idea if there are issues, particularly with installation.
Artifact evaluation is not an examination of the scientific merits of the paper. In other words, it is within scope to evaluate the artifact (towards the functional and reusable badges), but not to evaluate the paper itself.

Basically, in cooperation with the authors, and without sacrificing scientific rigor, we aim to accept all artifacts which support the claims laid out in the paper.

Badges and Acceptance Criteria

We will be awarding one of two badges, Functional or Reusable, based on the following considerations:

Functional artifacts: Does the artifact completely support all evaluated claims in the paper, and are the results from running the artifact consistent with the claims made in the paper?
Reusable artifacts: In addition to all the criteria for “Functional”, are you able to modify the artifact to solve problems and benchmarks different from those in the paper? Is the artifact sufficiently well-documented to support reuse and future research?

In general, we aim for most or all artifacts to receive the functional badge, as long as the claims can be evaluated (even with some difficulty); but the reusable badge is strictly superior to Functional, and considered more at the reviewers’ discretion.

Overview of Process

Bidding

During the bidding process, you will examine the list of papers, and place bids for the artifacts that most closely match your research background, interests, and experience. Based on your bids, and depending on which paper authors actually choose to submit artifacts, we will assign you with two or three artifacts to review.

After artifacts are assigned, we organize the evaluation process as the following sequence of three milestones:

Milestone 1: Kick the Tires (By Thu Oct 24)

Research software is delicate and needs careful setup. In order to ease this process, in the first phase of artifact evaluation, you will be expected to at least install the artifact and run a minimum set of commands (usually provided in the README by the authors) to sanity check that the artifact is correctly installed.

Here is a suggested process with some questions you can try to answer.

After reading the paper:

Q1: What is the central contribution of the paper?
Q2: What claims do the authors make of the artifact, and how does it connect to Q1 above?
Q3: Can you locate the specific, significant experimental claims made in the paper (such as figures, tables, etc.)?
Q4: What do you expect as a reasonable range of deviations for the experimental results?

After installing the artifact:

Q5: Are you able to install and test the artifact as indicated by the authors in their “kick the tires” instructions?
Q6: Are there any significant modifications you needed to make to the artifact while answering Q5?
Q7: For each claim highlighted in Q3 above, do you know how to reproduce the result, using the artifact?
Q8: Is there anything else that the authors or other reviewers should be aware of?

During the process, you can leave a comment on HotCRP indicating success, or ask the authors questions. These questions can concern unclear commands, or error messages that you encounter. The authors will have a chance to respond, fix bugs in their artifact or distribution, or make additional clarifications. Errors at this stage will not be counted against the artifact. Remember, the evaluation process is cooperative!

Milestone 2: Evaluating Functionality (By Mon Nov 4 Wed Nov 6)

After the kick-the-tires phase, you will perform an actual review of the artifact.

During this phase, here is a suggested list of questions to answer:

Q9: Does the artifact provide evidence for all the claims you noted in Q3? This corresponds to the completeness criterion of your evaluation.
Q10: Do the results of running / examining the artifact meet your expectations after having read the paper? This corresponds to the criterion of consistency between the paper and the artifact
Q11: Is the artifact well-documented, to the extent that answering questions Q5–Q10 is straightforward? Are the steps to reproduce results clear? (Note: by well-documented, for this stage, we are considering generally only the README and instructions – we don’t mean that the code itself needs to be documented. That would matter only for reusability if the intention would be to modify the code in some way.)

In unusual cases, depending on the type of artifact and its specific claims, questions Q1–Q11 may be inappropriate or irrelevant. In these cases, we encourage you to disregard our suggestions and review the artifact as you think is most appropriate.

Milestone 3: Evaluating Reusability (By Mon Nov 4 Wed Nov 6)

Finally, you will evaluate artifacts for reusability in new settings. To evaluate reusability, the following three initial questions are suggested for all artifacts:

Q12: If you were doing follow-up research in this area, do you think you would be able to reuse the paper as a baseline in your own work?
Q13: Is the code released via an open source license (e.g., released with an OSI approved license)? Is it made publicly available on a platform such as GitHub, GitLab, or BitBucket?
Q14: Does the artifact have clear installation instructions?

New this year, to help you evaluate proof artifacts, the remaining questions are different for traditional (software) artifacts and for proof artifacts. For traditional software artifacts:

Q15a: Are you able to modify the benchmarks / artifact to run simple additional experiments, similar to, but beyond those discussed in the paper?

For proof artifacts, instead of Q15a, we suggest answering:

Q15b: Does the proof artifact contain definitions and proofs that can be used in other projects? (Examples of such artifacts include Coq or Isabelle proof libraries and Coq plugins.)
Q16: Does the artifact clearly state all environment dependencies, including supported versions of the proof assistant and required third-party packages?
Q17: Are all proofs claimed as reusable complete? (no “admit” in Coq or “sorry” in Lean/Isabelle)

Writing Reviews (By Mon Nov 4 Wed Nov 6) and Discussion (By Nov 6 Friday, Nov 8)

Write a review by drawing on your answers to Q1–Q13. Make sure to include a specific recommendation of whether to award a badge, and of which badge(s) to award. Once you submit your draft review, we will make all reviews public, so you have a chance to discuss the artifact with other reviewers and the co-chairs, and reach a consensus. You should feel free to change your mind or revise your review during these discussions.

That’s it! You can always reach out to the POPL AEC chairs to discuss any specific questions or concerns.

The purpose of artifact evaluation is two-fold: First, to reward authors who take the trouble to create useful artifacts beyond the paper. Sometimes the software tools that accompany the paper take years to build; in many such cases, authors who go to this trouble should be rewarded for setting high standards and creating systems that others in the community can build on. Second, to probe artifacts for their legitimacy. Authors sometimes take liberties in describing the status of their artifacts—claims they would temper if they knew the artifacts are going to be scrutinized.

Our hope is that eventually, the assessment of a paper’s accompanying artifacts will guide the decision-making about papers: that is, the Artifact Evaluation Committee (AEC) would inform and advise the Program Committee (PC). This would, however, represent a radical shift in our conference evaluation processes; we would rather proceed gradually. Thus, artifact evaluation is optional, and authors choose to undergo evaluation only after their paper has been conditionally accepted. Nonetheless, feedback from the Artifact Evaluation Committee can help improve the both the final version of the paper and any publicly released artifacts. The authors are free to take or ignore the AEC feedback at their discretion.

Beyond helping the community, creating a bundle for artifact evaluation can be directly beneficial for the authors:

The same bundle can be publicly released or distributed to third-parties.
A bundle can be used subsequently for later experiments (e.g., on new parameters).
The bundle simplifies having to re-run the system subsequently when, say, having to respond to a journal reviewer’s questions.
The bundle is more likely to survive being put in storage between the departure of one student and the arrival of the next.

Artifact EvaluationPOPL 2025

Process

Author Recommendations

Reviewer Guidelines

Purpose

Caleb StanfordCo-chair

University of California, Davis

United States

Vadim ZalivaCo-chair

University of Cambridge, UK

United States

Ariel E. Kellison

Cornell University

United States

Rui Dong

University of Michigan

United States

Mohit Tekriwal

Lawrence Livermore National Laboratory

United States

Deivid Vale

Radboud University

Netherlands

Pinhan Zhao

University of Michigan

United States

Stefania Damato

University of Nottingham

United Kingdom

Cheng Zhang

University College London (UCL)

United Kingdom

Songlin Jia

Purdue University, USA

United States

Cyrus Liu

Samsung Semiconductor

United States

Mallku Soldevila

FAMAF, UNC (Argentina) / CONICET (Argentina)

Argentina

Shaopeng Zhu

Currently in industry

Weihao Qu

Monmouth University

United States

Stefan Zetzsche

Amazon Web Services

United Kingdom

Abhishek Kr Singh

National University of Singapore

Singapore

Zhengxiong Luo

National University of Singapore

China

Andong Fan

University of Toronto

Neea Rusch

Augusta University

United States

Dong Chen

Huawei

Yudai Tanabe

Institute of Science Tokyo

Japan

Runming Li

Carnegie Mellon University

Peixin Wang

University of Oxford

Litao Zhou

University of Hong Kong

China

Joshua M. Cohen

Princeton University

Vincenzo Arceri

University of Parma, Italy

Italy

Mário Pereira

NOVA LINCS & DI -- Nova School of Science and Technology

Portugal