MGTA 495: Analytics Assignment 1

Setup of PySpark

Tasks:

Due date: Refer to Gradescope


Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame.

PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html

Spark DataFrame Guide: https://spark.apache.org/docs/latest/sql-programming-guide.html


1. Copy data file(s) to HDFS

Spark works best with the Hadoop File System (HDFS). Here are the steps you need to start up HDFS in this container and copy files between the local file system and HDFS:

  1. Make sure you have the 3 scripts you'll need to start/stop the HDFS service: init-dfs.sh, start-dfs.sh, stop-dfs.sh. Copy these scripts from Canvas to your local folder.
  1. Open a terminal tab by going to File->New->Terminal. Use cd to navigate to the directory where you copied the scripts. You can do this either using a terminal from JupyterLab or from the ZSH terminal on the container.
  2. Run bash ./init-dfs.sh to initialize the HDFS service. If prompted, answer y to erase HDFS.
  3. Run bash ./start-dfs.sh to start the HDFS service. When this is complete, run jps to confirm that you can see NameNode, DataNode and SecondaryNameNode running.
  4. If running a local container, use cd to navigate to the directory where you have the downloaded data file. If running on the RSM server, copy the file to the container from the distribution folder.
  5. Run hadoop fs -mkdir /W6 - This will create a directory named W6 in the root directory of your HDFS file system.
  6. Now run hadoop fs -copyFromLocal <your-data-file> /W6 to copy the datafile to the W6 folder of the Hadoop File System that you just created. You may run hadoop fs -help to see more about how to manually navigate and use HDFS. After you're done, run hadoop fs -ls / to list the files and check that the file was copied.

Expected output: None

2. Start Spark Session

Expected Output: None

3. Load Data

Read data from the BookReviews_1M.txt file, and print the number of rows in that dataframe. You can find this file on Canvas, which you will need to copy to your HDFS file system using the command mentioned in section 1.

Expected output:

Number of lines = 1000000

4. Examine the data

Your task:

  1. Examine the contents of the dataframe that you've just read from file.
  2. Print the schema of the raw dataframe, as well as its first 25 rows.

Expected output:

root |-- value: string (nullable = true)

+--------------------+
| value|
+--------------------+
|This was the firs...|
|Also after going ...|
|As with all of Ms...|
|I've not read any...|
|This romance nove...|
|Carolina Garcia A...|
|Not only can she ...|
|Once again Garcia...|
|The timing is jus...|
|Engaging. Dark. R...|
|Set amid the back...|
|This novel is a d...|
|If readers are ad...|
| Reviewed by Phyllis|
| APOOO BookClub|
|A guilty pleasure...|
|In the tradition ...|
|Beryl Unger, top ...|
|What follows is a...|
|The book flap say...|
|I'd never before ...|
|The novel's narra...|
|It is centered on...|
|If you like moder...|
|Beryl Unger is a ...|
+--------------------+
only showing top 25 rows