5.1.1.4.1.6. FedEval.dataset.Sentiment140
5.1.1.4.1.6.1. Module Contents
5.1.1.4.1.6.1.1. Classes
By default, FedData produces datasets for horizontal federated learning |
5.1.1.4.1.6.1.2. Functions
|
Final cleanup of text by removing non-alpha characters like ' |
Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words |
|
If text/word written in uppercase, change to lowercase and tag with <allcaps>. |
|
|
To be consistent with use of GloVe vectors, we replicate most of their preprocessing. |
|
Takes in a processed tweet, tokenizes it, converts to GloVe embeddings |
- FedEval.dataset.Sentiment140.normalize_text(text)
Final cleanup of text by removing non-alpha characters like ‘
- ‘, ‘ ‘… and
non-latin characters + stripping.
- inputs:
text (str): tweet to be processed
- return:
text (str): preprocessed tweet
- FedEval.dataset.Sentiment140.hashtags_preprocess(x)
Creating a hashtag token and processing the formatting of hastags, i.e. separate uppercase words if possible, all letters to lowercase.
- inputs:
x (regex group): x.group(1) contains the text associated with a hashtag
- Returns:
preprocessed text
- Return type:
text (str)
- FedEval.dataset.Sentiment140.allcaps_preprocess(x)
If text/word written in uppercase, change to lowercase and tag with <allcaps>.
- inputs:
x (regex group): x.group() contains the text
- Returns:
preprocessed text
- Return type:
text (str)
- FedEval.dataset.Sentiment140.glove_preprocess(text)
To be consistent with use of GloVe vectors, we replicate most of their preprocessing. Therefore the word distribution should be close to the one used to train the embeddings. Adapted from https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb
- inputs:
text (str): tweet to be processed
- Returns:
preprocessed tweet
- Return type:
text (str)
- FedEval.dataset.Sentiment140.tweet2Vec(tweet, word2vectors)
Takes in a processed tweet, tokenizes it, converts to GloVe embeddings (or zeroes if words are unknown) and applies average pool to obtain one vector for that tweet.
- inputs:
tweet (str): one raw tweet from the dataset
word2vectors (dict): GloVe words mapped to GloVe vectors
- Returns:
resulting sentence vector (shape: (200,))
- Return type:
embeddings (np.array)
- class FedEval.dataset.Sentiment140.sentiment140
Bases:
FedEval.dataset.FedDataBase.FedDataBy default, FedData produces datasets for horizontal federated learning
- load_data()