Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame.
PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html
Spark DataFrame Guide: https://spark.apache.org/docs/latest/sql-programming-guide.html
Spark works best with the Hadoop File System (HDFS). Here are the steps you need to start up HDFS in this container and copy files between the local file system and HDFS:
init-dfs.sh
, start-dfs.sh
, stop-dfs.sh
. Copy these scripts from Canvas to your local folder.cd
to navigate to the directory where you copied the scripts. You can do this either using a terminal from JupyterLab or from the ZSH terminal on the container.bash ./init-dfs.sh
to initialize the HDFS service. If prompted, answer y
to erase HDFS. bash ./start-dfs.sh
to start the HDFS service. When this is complete, run jps
to confirm that you can see NameNode
, DataNode
and SecondaryNameNode
running. cd
to navigate to the directory where you have the downloaded data file. If running on the RSM server, copy the file to the container from the distribution folder. hadoop fs -mkdir /W6
- This will create a directory named W6 in the root directory of your HDFS file system.hadoop fs -copyFromLocal <your-data-file> /W6
to copy the datafile to the W6 folder of the Hadoop File System that you just created. You may run hadoop fs -help
to see more about how to manually navigate and use HDFS. After you're done, run hadoop fs -ls /
to list the files and check that the file was copied.Expected output: None
Expected Output: None
# Initialize Spark
import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAll([('spark.master', 'local[1]'),
('spark.app.name', 'Basic Setup')])
spark = SparkSession.builder.config(conf=conf).getOrCreate()
# Verify that the correct versions of spark and pyspark are installed.
# print (spark.version, pyspark.version.__version__)
assert(spark.version == '3.0.1')
assert(pyspark.version.__version__ == '3.0.1')
# Don't remove this print statement. It's going to be used for Autograder.
print('All set.')
All set.
Read data from the BookReviews_1M.txt file, and print the number of rows in that dataframe. You can find this file on Canvas, which you will need to copy to your HDFS file system using the command mentioned in section 1.
Expected output:
Number of lines = 1000000
# Read data from HDFS
# You'll need to run the hadoop fs -copyFromLocal command before you can use this path.
dataFileName = "hdfs:////W6/BookReviews_1M.txt"
# Read data from the file above, convert it to a dataframe, and print the number of rows in that dataframe.
df = spark.read.text(dataFileName) \
.cache()
# YOUR CODE HERE FOR READING DATA USING SPARK
print('Number of lines = ', df.count())
Number of lines = 1000000
Your task:
Expected output:
# Use the printSchema function to print the dataframe's schema
df.printSchema()
root |-- value: string (nullable = true)
# Use the show() function to show the first 25 rows of the dataframe.
df.show(25)
+--------------------+ | value| +--------------------+ |This was the firs...| |Also after going ...| |As with all of Ms...| |I've not read any...| |This romance nove...| |Carolina Garcia A...| |Not only can she ...| |Once again Garcia...| |The timing is jus...| |Engaging. Dark. R...| |Set amid the back...| |This novel is a d...| |If readers are ad...| | Reviewed by Phyllis| | APOOO BookClub| |A guilty pleasure...| |In the tradition ...| |Beryl Unger, top ...| |What follows is a...| |The book flap say...| |I'd never before ...| |The novel's narra...| |It is centered on...| |If you like moder...| |Beryl Unger is a ...| +--------------------+ only showing top 25 rows
# Stop Spark session
spark.stop()