diff --git a/README.md b/README.md index ea6e8efacc8c97e99ada8548d115ac5a2f2eb73b..148441f943e8f21d1d33424876de8a5fae5bb28c 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,29 @@ -The aim of this project is to analyse all the files containing reviews for each hotel in different situations. +The aim of this project is to analyse the number of reviews of each hotel, extract it and sort the hotels in a descending way based on their number of reviews. The data for each hotel is stored in the directory reviews_folder. The dataset to be used for this project is found at: https://secure.ecs.soton.ac.uk/notes/comp1204/coursework/dataset/reviews_dataset.tar.gz -The file should be extracted with the following UNIX commands: +From the terminal, you can extract the folder with the following commands: gunzip reviews_dataset.tar.gz tar xvf reviews_dataset.tar +In countreviews.sh the script of this project can be found. The first parameter is the path of the directory +to extract the data from. If you work in the same place you download the directory, you can simply write the +name of it. + +The script loops through the files of the directory mentioned, it memorises the name of the file (without any +extension or path) and then whenever it founds in that file a line which contains "Author", it counts it +(every time a line has "Author" in it, it means a new review will follow; by counting the number of authors + will result in counting the number of reviews). + +After looping through all the files and extracting the name of the file and the number of reviews, sort the +output by numerical value(-n flag) in reverse(-r flag) by the second key(-k2). + +To run the programme, first you have to make it executable with the command: + +chmod a+x countreviews.sh + +Then execute (in case you work in the same workspace as your data directory): + +./countreviews.sh reviews_folder