5.1.1.4.1.6. `FedEval.dataset.Sentiment140`

5.1.1.4.1.6.1. Module Contents

5.1.1.4.1.6.1.1. Classes

sentiment140

By default, FedData produces datasets for horizontal federated learning

5.1.1.4.1.6.1.2. Functions

`normalize_text`(text)	Final cleanup of text by removing non-alpha characters like '
`hashtags_preprocess`(x)	Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words
`allcaps_preprocess`(x)	If text/word written in uppercase, change to lowercase and tag with <allcaps>.
`glove_preprocess`(text)	To be consistent with use of GloVe vectors, we replicate most of their preprocessing.
`tweet2Vec`(tweet, word2vectors)	Takes in a processed tweet, tokenizes it, converts to GloVe embeddings

FedEval.dataset.Sentiment140.normalize_text(text)

Final cleanup of text by removing non-alpha characters like ‘

‘, ‘ ‘… and

non-latin characters + stripping.

inputs:

text (str): tweet to be processed

return:

text (str): preprocessed tweet

FedEval.dataset.Sentiment140.hashtags_preprocess(x)

Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words if possible, all letters to lowercase.

inputs:

x (regex group): x.group(1) contains the text associated with a hashtag

Returns:

preprocessed text

Return type:

text (str)

FedEval.dataset.Sentiment140.allcaps_preprocess(x)

If text/word written in uppercase, change to lowercase and tag with <allcaps>.

inputs:

x (regex group): x.group() contains the text

Returns:

preprocessed text

Return type:

text (str)

FedEval.dataset.Sentiment140.glove_preprocess(text)

To be consistent with use of GloVe vectors, we replicate most of their preprocessing. Therefore the word distribution should be close to the one used to train the embeddings. Adapted from https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb

inputs:

text (str): tweet to be processed

Returns:

preprocessed tweet

Return type:

text (str)

FedEval.dataset.Sentiment140.tweet2Vec(tweet, word2vectors)

Takes in a processed tweet, tokenizes it, converts to GloVe embeddings (or zeroes if words are unknown) and applies average pool to obtain one vector for that tweet.

inputs:

tweet (str): one raw tweet from the dataset
word2vectors (dict): GloVe words mapped to GloVe vectors

Returns:

resulting sentence vector (shape: (200,))

Return type:

embeddings (np.array)

class FedEval.dataset.Sentiment140.sentiment140

Bases: FedEval.dataset.FedDataBase.FedData

By default, FedData produces datasets for horizontal federated learning

load_data()

5.1.1.4.1.6. FedEval.dataset.Sentiment140

5.1.1.4.1.6.1. Module Contents

5.1.1.4.1.6.1.1. Classes

5.1.1.4.1.6.1.2. Functions

5.1.1.4.1.6. `FedEval.dataset.Sentiment140`