LEARNING FROM RULES GENERALIZING LABELED EXEMPLARS (ICLR 2020)

This repository provides an implementation of experiments in our ICLR2020 paper

@inproceedings{
Awasthi2020Learning,
title={Learning from Rules Generalizing Labeled Exemplars},
author={Abhijeet Awasthi and Sabyasachi Ghosh and Rasna Goyal and Sunita Sarawagi},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=SkeuexBtDr}
}

Requirements

This code has been developed with

python 3.6
tensorflow 1.12.0
numpy 1.17.2
snorkel 0.9.1
tensorflow_hub 0.7.0

Data Description

We have currently released processed version of 4 datastes used in our paper. Following datasets can be found in data/ directory

TREC - Question Classification task (Source: http://cogcomp.org/Data/QA/QC/)
MITR - Slot filling task (Source: https://groups.csail.mit.edu/sls/downloads/restaurant/)
YOUTUBE - Spam Classification task of youtube comments (Source: http://www.dt.fee.unicamp.br/~tiago//youtubespamcollection)
SMS - Spam classification task of text messages (Source: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)
Please see the README files in individual dataset directiories for more information

data/TREC (or any other data dir) consists following four pickle files

d_processed.p (d set: labeled data -- In paper we refer to this is as the "L" dataset)
U_processed.p (U set: unlabeled data -- In paper as well this is referred as "U" dataset)
validation_processed.p (validation data)
test_processed.p (test data)
NOTE U_processed.p for YOUTUBE and MITR is unavailable on GitHub due to larger size. You can download entire data dir from this link

Following objects are dumped inside each pickle file

x : feature representation of instances
- shape : [num_instances, num_features]
l : Class Labels assigned by rules
- shape : [num_instances, num_rules]
- class labels belong to {0, 1, 2, .. num_classes-1}
- l[i][j] provides the class label provided by jth rule on ith instance
- if jth rule doesn't cover ith instance, then l[i][j] = num_classes (convention)
- in snorkel, convention is to keep l[i][j] = -1, if jth rule doesn't cover ith instance
m : Rule coverage mask
- A binary matrix of shape [num_instances, num_rules]
- m[i][j] = 1 if jth rule cover ith instance
- m[i][j] = 0 otherwise
L : Instance labels
- shape : [num_instances, 1]
- L[i] = label of ith instance, if label is available i.e. if instance is from labeled set d
- Else, L[i] = num_clases if instances comes from the unlabeled set U
- class labels belong to {0, 1, 2, .. num_classes-1}
d : binary matrix of shape [num_instances, 1]
- d[i]=1 if instance belongs to labeled data (d), d[i]=0 otherwise
- d[i]=1 for all instances is from d_processed.p
- d[i]=0 for all instances in other 3 pickles {U,validation,test}_processed.p
r : A binary matrix of shape [num_instances, num_rules]
- r[i][j]=1 if jth rule was associated with ith instance
- Highly sparse matrix
- r is a 0 matrix in all the pickles except d_processed.p
- Note that this is different from rule coverage mask "m"
- This matrix defines the coupled rule,example pairs.

Usage

From src/hls

For reproducing numbers in Table 1, Row 1
- python3 get_rule_related_statistics.py ../../data/TREC 6 None
- This also provides Majority Vote accuracy in Table2 Column2 (Question dataset)
For training, saving and testing a snorkel model
- python3 run_snorkel.py ../../data/TREC 6 None
- (RUN THIS BEFORE EXPERIMENTS WHICH DEPEND ON SNORKEL LABELS) if snorkel model is not already saved in the dataset directory.
- We have released pre-trained snorkel models in each dataset directory with name "saved_label_model" )
For reproducing (approximately) numbers in Table2 Column2 (Question dataset)
- use train_TREC.sh for training models for different loss functions
- use test_TREC.sh for testing models for different loss functions
- best hyperparameters are already set in these scripts
- both of the above scripts use TREC.sh
For reproducing numbers (approximately) for other datasets follow steps same as above, with TREC replaced by the dataset name.

Note:

f network refes to the classification network
w network refers to the rule network

File Description in src/hls

analyze_w_predictions.py - Used for diagnostics (Old Precision Vs Denoised Precision in Figure 3)
checkpoint.py - Load/Save checkpoints (Uses code from checkmate)
config.py - All configuration options go here
data_feeders.py - all kind of data handling for training and testing.
data_feeder_utils.py - Load train/test data from processed pickles
data_utils.py - Other utilities related to data processing
generalized_cross_entropy_utils.py - Implementation of a noise tolerant loss functions
get_rule_related_statistics.py - For reproducing numbers in Table 1
hls_data_types.py - some basic data types used in data_feeders.py
hls_model.py - Creates train ops All the loss functions are defined here
hls_test.py - Runs inference using f or w.
- Inference on f tests the classification network (valid for all the loss functions)
- Inference on w is used to analyze the denoised rule-precision obtained by w network
- Inference on w is only meaningful for ImplyLoss and Posterior Reg. method since only these involve a rule (w) network.
hls_train.py - Two modes:
- f_d (simply trains f network on labeled data)
- f_d_U : used for all other modes which utilize unlabeled data
learn2reweight_utils.py - utilities for implementing L2R method
main.py - entry point
metrics_utils.py - utilities for computing metrics
networks.py - implementation of f network (classification network) and w network (rule network)
pr_utils.py - utilities for implementing Posterior Reg. method
run_snorkel.py - training, saving and testing a snorkel model
snorkel_utils.py - utilitiy to convert l in our format to l in snorkel's format
test_"DATASET_NAME".sh - model testing (inference) script
- e.g. test_TREC.sh runs inference for models trained on TREC dataset
"train_"DATASET_NAME".sh - model training script
- e.g. train_TREC.sh trains models on TREC dataset
"DATASET_NAME".sh - test_"DATASET_NAME".sh and train_"DATASET_NAME".sh use "DATASET_NAME".sh
utils.py - misc. utilities

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

LEARNING FROM RULES GENERALIZING LABELED EXEMPLARS (ICLR 2020)

Requirements

Data Description

data/TREC (or any other data dir) consists following four pickle files

Following objects are dumped inside each pickle file

Usage

Note:

File Description in src/hls

About

Releases

Packages

Languages

License

awasthiabhijeet/Learning-From-Rules

Folders and files

Latest commit

History

Repository files navigation

LEARNING FROM RULES GENERALIZING LABELED EXEMPLARS (ICLR 2020)

Requirements

Data Description

data/TREC (or any other data dir) consists following four pickle files

Following objects are dumped inside each pickle file

Usage

Note:

File Description in src/hls

About

Topics

Resources

License

Stars

Watchers

Forks

Languages