Skip to content
Snippets Groups Projects
Forked from an inaccessible project.

Home exam - Web Fundamentals - Autumn 2021

Describe the project

The project involves fetching news articles from online newspapers (tv2.no/nyheter, tv2.no/sport and nrk.no/innlandet) and republish them on a web server that runs on a Raspberry Pi. To do this I have made 7 scripts using bash and made/changed four configuration files. It is possible to download and use my project on an clean raspberry pi if you run the deployment.sh script. This script will clone the project from gitlab in addition to installing and enabling everything that you need to run the project. This sets up nginx, fcgi and systemd. Systemd runs the main script every 6th hour, which is a script that runs all the other scripts like the scrape.sh, page.sh, overview.sh and gitrepo.sh.

scrape.sh

I started making the scrape.sh. This script fetches news from online newspapers. Since I have done all the optional tasks in this script it fetches from tv2.no/nyheter, tv2.no/sport and nrk.no/innlandet. It fetches the three newest news from all three websites and stores the url in three different variables. To be clear it fetches just the news that is from the same website it came from, and not news that redirect you for instance to nrk.no or tv2.no. Firstly because the script curls every newsarticle to fetch the title, image url and the summary instead of fetching it from the main page. Secondly because I got several problems with curling these redirecteded articles so I could not fetch any information from these articles. Therefore I chose not to scrape news articles that redirect you to the main website.

I started the script making a date file, so I can use the same date in every script that runs after the scrapescript. It overwrites the data inside, so it contains only one line which is todays date. Then I avoid individually fetching the date in the start of every script. The problem with individually fetching the date in every script is that it fetches the minutes and then the minute could have been changed before every scripts finished running. Then it can not fetch the same date that the first script scraped. I also use the date to make a folder where I store all the txt files with the information of the individual news article.

I scrape the url to the news article individually, because the different websites have different html structure. The scrape script curl tv2.no/sport on only even numbers. I used a if statment to check if it is possible to divide the day with 2 to see if it is a even number. I used 10# to ensure that bash interpret the number with base 10 instead of base 8.

It loops inside every url in the three different variables and curls it. Then it fetches the title, image url and the summary, which is a optional task, and stores it in different variables. Since I fetch the information inside the html tag head all three websites have nearly the same html structure. If not I have fixed it with help of regex. Therefore I can use the same variables to fetch from all three websites. Just to have some control when we scrape from different websites I chose to name the files after the websites and start every title with it. To do this I made a if statement that sees if the article is from tv2.no/nyheter, nrk.no/innlandet or tv2.no/sport. Then the loop makes a txt file and fetches all the information from different variables. This information is the url of the news article, the title, the url of the image, the date it was scraped and a summary of the news article.

Finally the script ends with removing all the temporary files.

page.sh

Then I made page.sh. The page.sh script takes the txt files that where made in the scrape.sh script and makes a individual html page to each of them. It loops inside every files that are placed inside the directory that matches the date, in the date.txt file, inside the newsscraping directory. It fetches the information from the text files and stores it in variables. To get the information from the txt file I cat the line using sed -n. Then the script makes a html file for each txt file and stores it inside a directory based on the date. These directories is inside htmlfiles.

overview.sh

Then it is time to make the overview script. The overview.sh script makes a html file, index.html, that links to every news article that are fetched. I made a function that loops inside every html file in every directory inside htmllist. It makes one line for each file and stores it inside a temporary file. This line is inside the html tag li and serves the path to each html file. I reverse the temporary file to do the optional task, which is sorting the html files, so the newest article is on the top. Then the script echos the necessary html code and calls on the function.

main.sh

This is the main script which runs the scrape.sh, page.sh, overview.sh and gitrepo.sh script. This is also the script that systemd runs every 6th hour. To generate the overview page on the fly using CGI via fcgiwrap I copy the htmlfile.txt and place it inside the same directory as overview.cgi. Finally remove temporary files and directories including the txt files with the information of the news articles.

nginx

After installing nginx I got a default file inside:
/etc/nginx/sites-available. 

This configuration file makes it possible to serve the generated HTML files via nginx. I have made a copy of that file and stored it inside the config folder inside this project directory. I called the file default. Inside the default file I changed the root to my project directory:

root /home/pi/git/exam;

This configuration file made it possible to see the overview page with the raspberry pi's IP address.

I have also done the optional feature of this task and set up nginx to generate the overview page on the fly via fcgi wrap. First I created a new configuration file for fcgi with the command:

sudo nano /etc/nginx/fcgiwrap.conf

I have made a copy and stored it inside the config folder. The file includes, among other things, the location and the root. After this I made the overview.cgi inside /usr/lib/cgi-bin, because in fcgiwrap.conf I specified the root to be /usr/lib. The overview.cgi script is using bash to make a overview page based on the htmllist.txt file that changes by the overview.sh. Overview.cgi reads the htmllist.txt with cat, sorts it and stores it in a variable. When you click on the links to the news article you will be redirected to the htmlfiles that is stored inside this project directory. In other words, the html files of the news articles is run by nginx and the cgi script is run by nginx via fcgi fcgi wrap. Therefore, in the script, I used sed to include the localhost in the html attribute href. Then the script echo all the html code as well as the variable. This makes a overview page that updates everytime the htmllist.txt updates. Nginx finds the individual html pages inside /home/pi/git/exam/htmlfiles, because, as I already mention, I changed the root to /home/pi/git/exam inside /etc/nginx/sites-available/default. I also changed the default file so it also includes fcgiwrap.conf inside the server section:

include /etc/nginx/fcgiwrap.conf;

After changing nginx I wrote sudo nginx -t in the terminal and got this output:

sudo nginx -t

    nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
    nginx: configuration file /etc/nginx/nginx.conf test is successful

And then I reload the nginx server to get everything working:

sudo nginx -s reload

The overview page can also be reached with your raspberry pi IP address, but you have to include :

http://localhost/cgi-bin/overview.cgi 

Sometimes I had problems reaching the page when the raspberry pi was connected to the eduroam wifi, however I did not have the same problem connecting to my wifi at home.

In this script I got much help from this two websites:

https://garywoodfine.com/how-to-install-nginx-on-raspberry-pi/?fbclid=IwAR39Gkc-lx_xBJ82kBPRVZo8bP6LnmCtQswizU8Rw605VmJfzy1XXSzdIXo
https://www.cyberciti.biz/faq/how-to-install-fcgiwrap-for-nginx-on-ubuntu-20-04/?fbclid=IwAR359fleoKKxzjv1jzFaZa_9l2Me9iu0TfxofL6s7OZoXe94qGiyqKY7uOw 

Crontab entries (systemd)

I did the optional task which means that I created a systemd timer unit instead of a crontab entry. The systemd runs the main.sh script automatically every 6th hour. A systemd timer unit includes two files, schedule-exam.service and schedule-exam.timer, which haves to be inside /etc/systemd/user. For now it is stored inside the config folder. I have made the optional deployment script which copy (cp) these files from the folder and makes the files inside the right directory. The schedule-exam.service file describes the job the schedule-exam.timer schould do. In other words, the schedule-exam.service contains which script it should run and the schedule-exam.timer contains when it should run.

First I tried to schedule the timer with this command inside the schedule-exam.timer:

OnUnitActiveSec=6h

It worked well until I disconnected my raspberry pi and reconnected to it. The systemd would not work until I started the timer again with the command:

systemctl --user start schedule-exam.timer

I discovered this when I run this command:

systemctl --user status schedule-exam.timer 

and got a output telling me that the trigger was not applicable:

Trigger: n/a

The solution could be running the start command before every systemd operation. However, instead I switched to make the timer run on the same time with 6 hour apart. I chose to run the systemd when the clock shows 0, 6, 12 and 18 as well as 120 seconds after a reboot. This may be a more static way of running the script with six hour apart, but it gives us predictability. Therefore I changed the timer command to:

OnCalendar=*-*-* 00,06,12,18:00:00

Inside the schedule-exam.timer I made it possible to stop and start the service manually. To manually stop the timer you have to write this inside the terminal in /etc/systemd/user:

systemctl --user stop schedule.timer 

As well as running the job every 6th hour it also runs the job 120 seconds after a reboot.

In this script I got much help from this website:

https://fedoramagazine.org/systemd-timers-for-scheduling-tasks/ 

gitrepo.sh

The gitrepo.sh script is an optional task and updates the local and the centralized git repository. Inside this script I run the git commands to push all the changes that have been done after running the other scripts. This includes the git command git add, git commit and git push. To add all the changes I wrote a dot to get all the changed and new files. In the commit message I included the date to see when it was pushed automatically. I also included that it is an automatic push so it is possible to see if the commit is pushed manually or automatically. Since this script also runs automatic by systemd every 6th hour it requires a ssh key or a token, if not it asks for the username and the password for the gitlab. I chose to use a token, because of security. The ssh key has an option of making a passphrase, but then the user have to write it down everytime it runs which is pointless. It is possible to not give the ssh key a passprase, but this is not a secure way of doing it. Therefore I made a token in gitlab to clone the project with https. I made a project token so the token can just be used with this project only. Since I expose the token inside the scripts it is more appropriate to use a project token than a personal access token. This is because the personal access token gives the user more power considering it applies to your whole account. I gave the token the role maintainer and the scope api. This means that the token have complete read and write access to the scoped project api. To use the token you have to clone the project inside ~/git:
git clone https://exam:ZqPZxzPx4fCo3GXtTpHJ@gitlab.stud.idi.ntnu.no/ingring/exam.git

When the project is clone with a token, it will not ask for any username or password when it pull and push.

deployment.sh

Finally I made the deployment script. The deployment.sh is an optional script and will configurate a blank installation of raspberry pi os with all that is necessary for the project to work. Firstly the script installs all the programs that I have used to solve this assignment. Then making a git directory inside home/pi and clone the git repository with https inside this directory. This will make an exam directory with all the scripts inside as well as one config folder with all the configuration files. Then the script will copy the configuration files and the overview.cgi and place it inside the right directory as well as enable and start it. Finally available all the scripts with the chmod +x command. However you have to manually enable the deployment.sh before you run it:
chmod +x deployment.sh 

I tried to run the deployment.sh on an virtual machine, however the schedule-exam.timer refuses to start. It says that the trigger was not loaded. I do not know why this happened. I do enable and start both the schedule-exam.service and schedule-exam.timer in the deployment.sh.