Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Resources for the paper Convergent Representations of Computer Programs in Human and Artificial Neural Networks by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Published in NeurIPS 2022: https://openreview.net/forum?id=AqexjBWRQFx

Citation:

@inproceedings{SrikantLipkin2022,
	title={Convergent Representations of Computer Programs in Human and Artificial Neural Networks},
	author={Shashank Srikant* and Ben Lipkin* and Anna A Ivanova and Evelina Fedorenko and {Una-May} {O'R}eilly},
	booktitle={Advances in Neural Information Processing Systems},
	editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
	year={2022},
	url={https://openreview.net/forum?id=AqexjBWRQFx}
}

The labs involved:

https://evlab.mit.edu/

https://alfagroup.csail.mit.edu/

For additional information, contact [email protected], [email protected], or [email protected], [email protected].

Related material like slides, talk, a summary of our work etc. are available here.
Datasets and model checkpoints which this codebase downloads and analyzes are here: https://huggingface.co/datasets/benlipkin/braincode-neurips2022

Overview

The goal of this work is to relate brain representations of code to (1) specific code properties and (2) representations of code produced by language models trained on code.
In Experiment 1, we predict the different static and dynamic analysis metrics from the brain MRI recordings (each of dimension D_B) of 24 human subjects reading 72 unique Python programs (N) by training separate linear models for each subject and metric.
In Experiment 2, we learn affine maps from brain representations to the corresponding representations generated by code language models (each of dimension D_M) on these 72 programs.

Details

This pipeline supports several major functions.

MVPA (multivariate pattern analysis) evaluates decoding of code properties or code model representations from their respective brain representations within a collection of canonical brain regions.
PRDA (program representation decoding analysis) evaluates decoding of code properties from code model representations.

Reproducing paper results

This package provides an automated build using GNU Make. A single pipeline is provided, which starts from an empty environment, and provides ready to use software.

make setup # see 'make help' for more info

Pipelines also exist to run core analyses and generate figures and tables.

To run all core experiments from the paper, the following command will suffice after setup:

make analysis

To regenerate tables and figures from the paper, run the following after completing the analyses:

make paper

Note - These commands will take ~8 hours to complete on a machine without GPU cards.

Custom Analyses

The pipeline can also be used for custom analyses, via the following command line interface.

# basic examples
python braincode mvpa -f brain-MD -t task-structure # brain -> {task, model}
python braincode prda -f code-bert -t task-tokens # model -> task

# more complex example
python braincode mvpa -f brain-lang+brain-MD -t code-projection -d 64 -m SpearmanRho -p $BASE_PATH --score_only
# note how `+` operator can be used to join multiple representations via concatenation
# additional metrics are available in the `metrics.py` module

Supported Brain Regions

brain-MD (Multiple Demand)
brain-lang (Language)
brain-vis (Visual)
brain-aud (Auditory)

Supported Code Features

Code Properties

test-code (code vs. sentences)
test-lang (english vs. japanese)
task-content (math vs. str) ^*datatype
task-structure (seq vs. for vs. if) ^{*control flow}
task-tokens (# of tokens in program) ^{*static analysis}
task-lines (# of runtime steps during execution) ^{*dynamic analysis}
task-bytes (# of bytecode ops executed)
task-nodes (# of nodes in AST)
task-halstead (function of tokens, operations, vocabulary)
task-cyclomatic (function of program control flow graph)

Code Models

code-projection (presence of tokens)
code-bow (token frequency)
code-tfidf (token and document frequency)
code-seq2seq¹ (sequence modeling)
code-xlnet² (autoregressive LM)
code-gpt2⁴ (autoregressive LM)
code-bert⁵ (masked LM)
code-roberta⁶ (masked LM)
code-transformer³ (LM + structure learning)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
braincode		braincode
paper/scripts		paper/scripts
setup		setup
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

braincode

braincode

paper/scripts

paper/scripts

setup

setup

CITATION.cff

CITATION.cff

Dockerfile

Dockerfile

LICENSE.md

LICENSE.md

Makefile

Makefile

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Overview

Details

Reproducing paper results

Custom Analyses

Supported Brain Regions

Supported Code Features

License

About

Releases

Packages

Contributors 2

Languages

License

ALFA-group/code-representations-ml-brain

Folders and files

Latest commit

History

Repository files navigation

Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Overview

Details

Reproducing paper results

Custom Analyses

Supported Brain Regions

Supported Code Features

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages