Installing Pyspark on Mac
Pyspark is the abbreviations for Spark Python API. I understand it as a python library providing entry points for spark functionalities.
Installation of pyspark can be as easy as below, given pip installed.
!pip install pyspark
Since pyspark follows the idea of functional programmings, most of its operations can be put into two categories:
- transformation
- action
Operations of transformation are lazy. They only get executed when an action operation is called.
To check if the pyspark library is correctly installed, we can run a simple test program like below
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession\
.builder\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()schema = StructType([StructField("user_id", StringType()),
StructField("occured_at", StringType()),
StructField("event_type", StringType())])test_list = (
[['16','2014-06-04T09:33:02','engagement'],
['16','2014-08-18T09:32:27','engagement'],
['16','2014-05-27T09:27:01','engagement'],
['16','2014-05-13T19:58:46','engagement'],
['16','2014-07-31T15:19:02','engagement'],
['16','2014-06-28T15:03:59','signup_flow'],
['1547','2014-06-16T17:25:51','engagement'],
['1547','2014-07-24T02:58:10','engagement'],
['1547','2014-07-07T09:31:51','engagement'],
['1547','2014-07-09T01:42:40','engagement']]
)df = spark.createDataFrame(test_list, schema)
df.show()
Above operations did several things.
- First, include necessary libraries.
- Second, set up sparkSession, which is the channel to access all Spark functionalities. The setting of configuration could be complicated. But here simply an easy copy paste from stack overflow.
- Third, set up the schema and the content list for the dataframe, a tabular format of data that is widely used in spark.
- The last operation is an action, which will actually execute all the above operations.
I hit an error here mainly complaining about python version incompatibility issue.
Exception: Python in worker has different version 2.7 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Pyspark was confused because it is installed through python 2.7 in the mac system. But the IDE is Jupyter Notebook which is using a 3.7 python version.
To check the python version:
import sys
print(sys.executable)
To fix this, need to arbitrarily set up the two environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. I did this by pointing PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to the same path of the python framework installed for the jupyter notebook. Changes are made in .bash_profile
export PYSPARK_PYTHON="/Library/Frameworks/Python.framework/Versions/3.7/bin/python3"export PYSPARK_DRIVER_PYTHON="/Library/Frameworks/Python.framework/Versions/3.7/bin/python3"
Then restart Jupyter Notebook, now the dataframe should be presented as below
+-------+-------------------+-----------+
|user_id| occured_at| event_type|
+-------+-------------------+-----------+
| 16|2014-06-04T09:33:02| engagement|
| 16|2014-08-18T09:32:27| engagement|
| 16|2014-05-27T09:27:01| engagement|
| 16|2014-05-13T19:58:46| engagement|
| 16|2014-07-31T15:19:02| engagement|
| 16|2014-06-28T15:03:59|signup_flow|
| 1547|2014-06-16T17:25:51| engagement|
| 1547|2014-07-24T02:58:10| engagement|
| 1547|2014-07-07T09:31:51| engagement|
| 1547|2014-07-09T01:42:40| engagement|
+-------+-------------------+-----------+