The aim of this project is to analyse all the files containing reviews for each hotel in different situations.
The aim of this project is to analyse the number of reviews of each hotel, extract it and sort the hotels in a descending way based on their number of reviews. The data for each hotel is stored in the directory reviews_folder.
The dataset to be used for this project is found at: https://secure.ecs.soton.ac.uk/notes/comp1204/coursework/dataset/reviews_dataset.tar.gz
The dataset to be used for this project is found at: https://secure.ecs.soton.ac.uk/notes/comp1204/coursework/dataset/reviews_dataset.tar.gz
The file should be extracted with the following UNIX commands:
From the terminal, you can extract the folder with the following commands:
gunzip reviews_dataset.tar.gz
gunzip reviews_dataset.tar.gz
tar xvf reviews_dataset.tar
tar xvf reviews_dataset.tar
In countreviews.sh the script of this project can be found. The first parameter is the path of the directory
to extract the data from. If you work in the same place you download the directory, you can simply write the
name of it.
The script loops through the files of the directory mentioned, it memorises the name of the file (without any
extension or path) and then whenever it founds in that file a line which contains "Author", it counts it
(every time a line has "Author" in it, it means a new review will follow; by counting the number of authors
will result in counting the number of reviews).
After looping through all the files and extracting the name of the file and the number of reviews, sort the
output by numerical value(-n flag) in reverse(-r flag) by the second key(-k2).
To run the programme, first you have to make it executable with the command:
chmod a+x countreviews.sh
Then execute (in case you work in the same workspace as your data directory):