It’s about 14219 sentences
first I Remove Invalid Characters
then Remove Redundancy
then I start to clear the files through :
1- I put spaces before and after words that have the special character ‘&’, whether the character has a space after it or not.
2- Replace full stops/periods at the end of each line with space + full stop/period, remove Not Used words (i.e. identical lines) and remove the space created previously (trims the two lines) before words that have ‘&‘ character.
3- Replaces numbers with ##NUM##.
4- Puts spaces around numbers and removes redundant spaces
5- Puts spaces around foreign sequences/ “not-target” characters found in the target file
6- Remove the unwanted spaces before/after a sentence
7- Insert spaces before and after some Predefined special characters
8- Remove the unwanted spaces before/after a sentence
Then after cleaning I start to generate a suggested abbreviation lists for both source and target files in the corpus. According to ONE criterion.The two output files are then revised by linguists.
Then I start segmentation (Convert Lines into Segments ) by converting paragraphs into segments through :
1- Remove the dot in the end of a paragraph/line.
2- Generates lots of empty line
Then I Filter Segments (not sentences) by :
1- Remove empty lines.
2- Put spaces between special characters if they are in the middle of the sentence, or put a space
after it if the special character starts the sentence.
Then I clean the test set (put a white space before a full stop, remove the unwanted spaces in any sentence and in the whole file and put white spaces before and after the special characters like ‘!’) and replace number with the tag ##NUM## reserving the index of each number in the file.