Skip to content

[NeurIPS 2022] "Convergent Representations of Computer Programs in Human and Artificial Neural Networks" by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

License

Notifications You must be signed in to change notification settings

ALFA-group/code-representations-ml-brain

Repository files navigation

Convergent Representations of Computer Programs in Human and Artificial Neural Networks

Resources for the paper Convergent Representations of Computer Programs in Human and Artificial Neural Networks by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Published in NeurIPS 2022: https://openreview.net/forum?id=AqexjBWRQFx

Citation:

@inproceedings{SrikantLipkin2022,
	title={Convergent Representations of Computer Programs in Human and Artificial Neural Networks},
	author={Shashank Srikant* and Ben Lipkin* and Anna A Ivanova and Evelina Fedorenko and {Una-May} {O'R}eilly},
	booktitle={Advances in Neural Information Processing Systems},
	editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
	year={2022},
	url={https://openreview.net/forum?id=AqexjBWRQFx}
}

The labs involved:

https://evlab.mit.edu/

https://alfagroup.csail.mit.edu/

For additional information, contact [email protected], [email protected], or [email protected], [email protected].

Related material like slides, talk, a summary of our work etc. are available here.
Datasets and model checkpoints which this codebase downloads and analyzes are here: https://huggingface.co/datasets/benlipkin/braincode-neurips2022

Overview

The goal of this work is to relate brain representations of code to (1) specific code properties and (2) representations of code produced by language models trained on code.
In Experiment 1, we predict the different static and dynamic analysis metrics from the brain MRI recordings (each of dimension D_B) of 24 human subjects reading 72 unique Python programs (N) by training separate linear models for each subject and metric.
In Experiment 2, we learn affine maps from brain representations to the corresponding representations generated by code language models (each of dimension D_M) on these 72 programs.

Untitled

Details

This pipeline supports several major functions.

  • MVPA (multivariate pattern analysis) evaluates decoding of code properties or code model representations from their respective brain representations within a collection of canonical brain regions.
  • PRDA (program representation decoding analysis) evaluates decoding of code properties from code model representations.

Reproducing paper results

This package provides an automated build using GNU Make. A single pipeline is provided, which starts from an empty environment, and provides ready to use software.

make setup # see 'make help' for more info

Pipelines also exist to run core analyses and generate figures and tables.

To run all core experiments from the paper, the following command will suffice after setup:

make analysis

To regenerate tables and figures from the paper, run the following after completing the analyses:

make paper

Note - These commands will take ~8 hours to complete on a machine without GPU cards.

Custom Analyses

The pipeline can also be used for custom analyses, via the following command line interface.

# basic examples
python braincode mvpa -f brain-MD -t task-structure # brain -> {task, model}
python braincode prda -f code-bert -t task-tokens # model -> task

# more complex example
python braincode mvpa -f brain-lang+brain-MD -t code-projection -d 64 -m SpearmanRho -p $BASE_PATH --score_only
# note how `+` operator can be used to join multiple representations via concatenation
# additional metrics are available in the `metrics.py` module

Supported Brain Regions

  • brain-MD (Multiple Demand)
  • brain-lang (Language)
  • brain-vis (Visual)
  • brain-aud (Auditory)

Supported Code Features

Code Properties

  • test-code (code vs. sentences)
  • test-lang (english vs. japanese)
  • task-content (math vs. str) *datatype
  • task-structure (seq vs. for vs. if) *control flow
  • task-tokens (# of tokens in program) *static analysis
  • task-lines (# of runtime steps during execution) *dynamic analysis
  • task-bytes (# of bytecode ops executed)
  • task-nodes (# of nodes in AST)
  • task-halstead (function of tokens, operations, vocabulary)
  • task-cyclomatic (function of program control flow graph)

Code Models

  • code-projection (presence of tokens)
  • code-bow (token frequency)
  • code-tfidf (token and document frequency)
  • code-seq2seq 1 (sequence modeling)
  • code-xlnet 2 (autoregressive LM)
  • code-gpt2 4 (autoregressive LM)
  • code-bert 5 (masked LM)
  • code-roberta 6 (masked LM)
  • code-transformer 3 (LM + structure learning)

License

License: MIT

About

[NeurIPS 2022] "Convergent Representations of Computer Programs in Human and Artificial Neural Networks" by Shashank Srikant*, Benjamin Lipkin*, Anna A. Ivanova, Evelina Fedorenko, Una-May O'Reilly.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages