BISA AI - AI For Everyone

Text Analytics (Sentiment Analysis)

Khoiriya Latifah

Sosial Media

0 orang menyukai ini
Suka

Summary

Text Analytics (Sentiment Analysis Using Twitter Data Crawling)

This sentiment analysis utilizes Twitter data crawling. The data is obtained from the Twitter API. This case discusses the processing of unstructured text data with simple steps, namely preprocessing involving several stages, Text Vectorization, and the Model using LSTM.

We used LSTM Model cause Sequential and Time-Series Data: LSTMs are particularly effective for sequential and time-series data, where the order and context of data points matter. This makes them well-suited for tasks like language modeling, sentiment analysis,

LSTMs have shown exceptional performance in various natural language processing tasks, achieving state-of-the-art results in tasks like sentiment analysis

Description

Text Analytics (Sentiment Analysis Using Twitter Data Crawling)

The steps to perform unstructured data text preprocessing :

• Case Folding

The process of converting text into "uppercase" and "lowercase" letters.

• Stopword Removal

Eliminating words that are considered not significantly impactful in the text data.

• Stemming/ Lemmatization

The process of transforming words into their root forms.

Example: membuat -> buat, menulis -> tulis, etc.

• Slangword Handling

Addressing non-formal words, everyday words, abbreviations, or colloquialisms present in the text by transforming them into formal word forms.

Example: Yg -> yang, krn -> karena, etc.

def formaldanstop(t):

t = word_tokenize(t)

for i,x in enumerate(t):

if x in SlangS.keys():

t[i] = SlangS[x]

return ''.join(' '.join(x for x in t if x not in stops))

id_stopword_dict = pd.read_csv('/content/drive/My Drive/dataset/stopwordbahasa.csv', header=None)

id_stopword_dict = id_stopword_dict.rename(columns={0: 'stopword'})

stopwords_new = pd.DataFrame(['sih','nya', 'iya', 'nih', 'biar', 'tau', 'kayak', 'banget'], columns=['stopword'])

id_stopword_dict = pd.concat([id_stopword_dict,stopwords_new]).reset_index()

id_stopword_dict = pd.DataFrame(id_stopword_dict['stopword'])

import nltk

nltk.download('punkt')

from nltk import word_tokenize

factory = StemmerFactory()

stemmer = factory.create_stemmer()

text_preproc1 = []

for x in text:

pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

x = re.sub(pattern,' ',x) #remove urls if any

#Convert to lower case

x = x.lower()

#unnecessary

x = re.sub(r'pic.twitter.com.[w]+', '', x) # Remove every pic

x = re.sub('((www.[^s]+)|(https?://[^s]+)|(http?://[^s]+))',' ',x)

x = re.sub('[^0-9a-zA-Z]+', ' ', x)

#Convert www.* or https?://*

x = re.sub('((www.[^s]+)|(https?://[^s]+))','',x)

#remove symbols

x = re.sub(r'[^.,a-zA-Z0-9 .]',' ',x)

x = x.replace(',',' ').replace('.',' ')

#Remove additional white spaces

x = re.sub('[s]+', ' ', x)

from wordcloud import WordCloud, STOPWORDS

comment_words = ''

stopwords = set(stops)

for val in text_preproc2:

# typecaste each val to string

val = str(val)

# split the value

tokens = val.split()

# Converts each token into lowercase

for i in range(len(tokens)):

tokens[i] = tokens[i].lower()

comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 800, height = 800,

background_color ='white', stopwords = stopwords,

min_font_size = 10).generate(comment_words)

# plot the WordCloud image

plt.figure(figsize = (8, 8), facecolor = None)

plt.imshow(wordcloud)

plt.axis("off")

plt.tight_layout(pad = 0)

plt.show()

from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

stopword = StopWordRemoverFactory().create_stop_word_remover()

stemmer = StemmerFactory().create_stemmer()

clean_text = []

for i,kalimat in enumerate(text_preproc2):

stop = stopword.remove(kalimat)

stem = stemmer.stem(stop)

if i%100 ==0:

print('loading kalimat ke:',i,'dari',len(text_preproc2))

clean_text.append(stem)

Feature Extraction

The process of converting words and sentences into representative vector forms. Some commonly used methods include TF-IDF, word embedding, skip gram, etc. In this case used TF-IDF Method.

Count Vectorisation

This is a way of vectorizing texts that considers the count of each word in a document/documents. In count vectorization, a word with more counts is considered more significant. However, only giving counts does not guarantee that the feature of the document is well expressed with this type of vectorization. There could be certain words that appear repetitively but do not have much significance.

TF-IDF (Term Frequency — Inverse Document Frequency)

TF-IDF is a measure that can quantify the importance or relevance of words in a document amongst a collection of documents. TF-IDF is expressed as below.

Among multiple documents, the number of documents that contain the word i, frequency of word i in each document and the total number of documents are considered in TF-IDF.

# TOKENISASI

max_features = 7600

tokenizer = Tokenizer(num_words=max_features, split=' ')

tokenizer.fit_on_texts(data['Tweets'].values)

x = tokenizer.texts_to_sequences(data['Tweets'].values)

X = pad_sequences(x)

import tensorflow as tf

from keras.utils.data_utils import pad_sequences

from tensorflow.keras.utils import pad_sequences

tf.keras.utils.pad_sequences

#TEXT PREPROCESSING

max_review_length = 1000

X_train = tf.keras.utils.pad_sequences(X_train, maxlen=max_review_length)

X_test = tf.keras.utils.pad_sequences(X_test, maxlen=max_review_length)

CREATE MODEL MACHINE LEARNING USING LSTM

# CREATE THE MODEL

embedding_vecor_length = 500

model = Sequential()

model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))

model.add(LSTM(64))

model.add(Dense(2, activation='sigmoid'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

print(model.summary())

model.fit(X_train, dummy_y_train, validation_data=(X_test, dummy_y_test), epochs=3, batch_size=25, verbose=1)

The model utilizes LSTM layers to process textual data. We can adjust the size of the LSTM layers, activation functions, and other parameters according to our needs. Furthermore, we need to ensure that we input the preprocessing data and input data in the format required by the model.

LSTMs have shown exceptional performance in various natural language processing tasks, achieving state-of-the-art results in tasks like sentiment analysis.

# MODEL EVALUATION

scores = model.evaluate(X_test, dummy_y_test, verbose=1)

print("Accuracy: %.2f%%" % (scores[1]*100))

2/2 [==============================] - 0s 34ms/step - loss: 0.6165 - categorical_accuracy: 0.7377

Accuracy: 73.77%

This model has an accuracy of only 0.7 due to the unstructured dataset, leading to the removal of a significant amount of data during the data cleaning phase because of duplicates.

Informasi Course Terkait

Kategori: Artificial Intelligence
Course: Statistika dan Probabilitas

Kelas GRATIS

Master Class

Master Class on Job Training

Learning Path

Kelas OFFLINE

Kelas Corporate

Prakerja

Webinar

Udemy

Kampus Merdeka

Learncation

Portofolio Peserta

Sertifikasi International

Sertifikasi Nasional

Kuliah RPL

Politeknik BISA AI

Pendidikan Profesional

Educloud

Siakad by Bisa AI

IT Solution

Konsultan Pendidikan

Kolaborasi Seminar

Kolaborasi pelatihan

Gallery

Tentang Kami

Testimonial Peserta

Corporate Social Responsibility

Hubungi Kami

Dokter Mekanik

E-learning

Bisa Design

Flungo

Tampil

Bakerspice Academy

TripTracker

Gramatikal

Text Analytics (Sentiment Analysis)

Sosial Media

Summary

Description

Informasi Course Terkait

Bisa AI Academy