Eka Wulan Yunita
the Text preprocessing is the stage for processing text from datasets into clean and ready-to-process datasets.tage for processing text from datasets into clean and ready-to-process datasets. The step of text preprocessing are different, depending on the dataset that you have. Dataset that will processed is the Sentiment Labeled Sentences Data Set which will be cleaned so that it is ready to proceed to the next stage.
TEXT PROCESSING?
Text preprocessing is the stage for processing text from datasets into clean and ready-to-process datasets. Of course, the dataset that is owned is in the form of text or documents. This is necessary so that the model to be made has good and accurate results. Without this process, it is feared that the model to be built will be inaccurate and ineffective. Now, dataset that will processed is the Sentiment Labeled Sentences Data Set which will be cleaned so that it is ready to proceed to the next stage. This dataset amounts to -+ 1000 data in the form of sentiments that have been labeled 1-5.
STEPS
Preprocessing step are different for each person, according to the data that they have. Because, each data has different elements.
In this case, For preprocessing has several steps, there are:
Before that, we must download libraries that is required
Don't forget to install the nltk library and the literary library (for stopwords). NLTK is a platform used to build text analysis programs.
Then, load data.
Continue to casefolding
Case folding is useful for equating all letters to lowercase by using the str.lower() command
2. FILTERING
Then, Filtering process is the stage of selecting things that are considered important and not, such as punctuation marks, emoticons, etc.
In here, the things that are omitted are tagger, punctuation, and numbers. so that it will produce data that is really words.
3. STOPWORD
Actually, stopwords are the same as filtering, but the difference is that stopwords only select words to be removed/added. while filtering selects other than words.
Here, the stopword uses the nltk library. 'English' there adjusts to the language of the data that we have.
4. TOKENIZATION
Tokenizing or also called the Lexical Analysis stage is the process of cutting text into smaller parts, which are called tokens.
5. STEMMING
Stemming is the process of changing word forms into basic words or the stage of finding the root of each word.
RESULT
Then, the clean data is saved to proceed to the next step.