5.1.1.4.1.6. FedEval.dataset.Sentiment140

5.1.1.4.1.6.1. Module Contents

5.1.1.4.1.6.1.1. Classes

sentiment140

By default, FedData produces datasets for horizontal federated learning

5.1.1.4.1.6.1.2. Functions

normalize_text(text)

Final cleanup of text by removing non-alpha characters like '

hashtags_preprocess(x)

Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words

allcaps_preprocess(x)

If text/word written in uppercase, change to lowercase and tag with <allcaps>.

glove_preprocess(text)

To be consistent with use of GloVe vectors, we replicate most of their preprocessing.

tweet2Vec(tweet, word2vectors)

Takes in a processed tweet, tokenizes it, converts to GloVe embeddings

FedEval.dataset.Sentiment140.normalize_text(text)

Final cleanup of text by removing non-alpha characters like ‘

‘, ‘ ‘… and

non-latin characters + stripping.

inputs:
  • text (str): tweet to be processed

return:
  • text (str): preprocessed tweet

FedEval.dataset.Sentiment140.hashtags_preprocess(x)

Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words if possible, all letters to lowercase.

inputs:
  • x (regex group): x.group(1) contains the text associated with a hashtag

Returns:

preprocessed text

Return type:

  • text (str)

FedEval.dataset.Sentiment140.allcaps_preprocess(x)

If text/word written in uppercase, change to lowercase and tag with <allcaps>.

inputs:
  • x (regex group): x.group() contains the text

Returns:

preprocessed text

Return type:

  • text (str)

FedEval.dataset.Sentiment140.glove_preprocess(text)

To be consistent with use of GloVe vectors, we replicate most of their preprocessing. Therefore the word distribution should be close to the one used to train the embeddings. Adapted from https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb

inputs:
  • text (str): tweet to be processed

Returns:

preprocessed tweet

Return type:

  • text (str)

FedEval.dataset.Sentiment140.tweet2Vec(tweet, word2vectors)

Takes in a processed tweet, tokenizes it, converts to GloVe embeddings (or zeroes if words are unknown) and applies average pool to obtain one vector for that tweet.

inputs:
  • tweet (str): one raw tweet from the dataset

  • word2vectors (dict): GloVe words mapped to GloVe vectors

Returns:

resulting sentence vector (shape: (200,))

Return type:

  • embeddings (np.array)

class FedEval.dataset.Sentiment140.sentiment140

Bases: FedEval.dataset.FedDataBase.FedData

By default, FedData produces datasets for horizontal federated learning

load_data()