Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sequence_tagging example #302

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add sequence_tagging example #302

wants to merge 2 commits into from

Conversation

gpengzhi
Copy link
Collaborator

@gpengzhi gpengzhi commented Mar 3, 2020

Adapted from sequence_tagging in texar-tf.

@gpengzhi gpengzhi requested a review from huzecong March 3, 2020 04:37
@codecov
Copy link

codecov bot commented Mar 3, 2020

Codecov Report

Merging #302 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #302   +/-   ##
=======================================
  Coverage   79.76%   79.76%           
=======================================
  Files         133      133           
  Lines       11122    11122           
=======================================
  Hits         8872     8872           
  Misses       2250     2250           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 520018c...2c6dde0. Read the comment docs.

examples/sequence_tagging/README.md Show resolved Hide resolved

import texar.torch as tx

# pylint: disable=redefined-outer-name, unused-variable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these necessary? I can imagine unused-variable being required for the constants, but why redefined-outer-name? It's also better practice to add a corresponding pylint: enable comment after where it's no longer required.

DIGIT_RE = re.compile(r"\d")


def create_vocabs(train_path, dev_path, test_path, normalize_digits=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type annotations for functions.

Comment on lines +70 to +92
# Prepares/loads data
if config.load_glove:
print('loading GloVe embedding...')
glove_dict = load_glove(embedding_path, EMBEDD_DIM)
else:
glove_dict = None

(word_vocab, char_vocab, ner_vocab), (i2w, i2n) = create_vocabs(
train_path, dev_path, test_path, glove_dict=glove_dict)

data_train = read_data(train_path, word_vocab, char_vocab, ner_vocab)
data_dev = read_data(dev_path, word_vocab, char_vocab, ner_vocab)
data_test = read_data(test_path, word_vocab, char_vocab, ner_vocab)

scale = np.sqrt(3.0 / EMBEDD_DIM)
word_vecs = np.random.uniform(
-scale, scale, [len(word_vocab), EMBEDD_DIM]).astype(np.float32)
if config.load_glove:
word_vecs = construct_init_word_vecs(word_vocab, word_vecs, glove_dict)

scale = np.sqrt(3.0 / CHAR_DIM)
char_vecs = np.random.uniform(
-scale, scale, [len(char_vocab), CHAR_DIM]).astype(np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving these into main as well.

examples/sequence_tagging/ner.py Show resolved Hide resolved
return word_vecs


class CoNLLReader:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify this to use tx.data.Dataset? This would eliminate the need for separate read_data and iterate_batch methods.

ner_tags, ner_ids)


class NERInstance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and Sentence below can be changed to NamedTuples, which also supports custom methods.

yield wid_inputs, cid_inputs, nid_inputs, masks, lengths


def load_glove(filename, emb_dim, normalize_digits=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had a load_glove method inside tx.data? What are the differences here?

self.dense_2 = nn.Linear(in_features=config.tag_space,
out_features=len(ner_vocab))

def forward(self, inputs, chars, targets, masks, seq_lengths, mode):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations.

Comment on lines +26 to +30
def start(self, file_path):
self.__source_file = open(file_path, 'w', encoding='utf-8')

def close(self):
self.__source_file.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change these to __enter__ and __exit__ and use the with writer.open(path) as f context manager pattern. It's also fine to keep it as is if you're more comfortable with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants