Python Introduction to PySpark MLlib
Introduction
Apache Spark is an open-source distributed computing system used for big data processing and analytics. PySpark, a python library for Apache Spark, provides an interface for programming Spark with Python. PySpark MLlib, a machine learning library in PySpark, provides a range of algorithms for scalable and distributed machine learning.
Syntax
from pyspark.mllib.classification import LogisticRegressionWithSGD
model = LogisticRegressionWithSGD.train(training_data, iterations=100, step=0.1)
Example
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
sc = SparkContext("local")
spark = SparkSession(sc)
# Load the dataset
data = spark.read.csv("iris.csv", header=True, inferSchema=True)
# Prepare the data for machine learning
assembler = VectorAssembler(inputCols=["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"],
outputCol="features")
data = assembler.transform(data)
data = data.select("features", "Species")
# Split the data into training and testing sets
splits = data.randomSplit([0.7, 0.3])
train_data = splits[0]
test_data = splits[1]
# Train a logistic regression model
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
model = lr.fit(train_data)
# Evaluate the model on test data
result = model.transform(test_data)
result.show()
Output
+-----------------+----------+--------------------+--------------------+----------+
| features| Species| rawPrediction| probability|prediction|
+-----------------+----------+--------------------+--------------------+----------+
|[4.4,3.0,1.3,0.2]| setosa|[-2.5986822773672...|[0.06940842848906...| 1.0|
|[4.5,2.3,1.3,0.3]| setosa|[-2.5858611269813...|[0.07044494791203...| 1.0|
|[4.6,3.1,1.5,0.2]| setosa|[-2.7005485323125...|[0.06439869749576...| 1.0|
|[4.9,3.1,1.5,0.1]| setosa|[-2.8348128961914...|[0.05621767835447...| 1.0|
|[5.0,3.0,1.6,0.2]| setosa|[-2.7113005461473...|[0.06416365091247...| 1.0|
|[5.0,3.4,1.5,0.2]| setosa|[-2.9124087463647...|[0.05271940280812...| 1.0|
|[5.0,3.6,1.4,0.2]| setosa|[-2.9648568747981...|[0.04985776276308...| 1.0|
|[5.1,3.5,1.4,0.2]| setosa|[-2.9771940613839...|[0.04909836309138...| 1.0|
|[5.2,2.7,3.9,1.4]|versicolor|[0.22383875758864...|[0.55579968525996...| 0.0|
|[5.5,2.3,4.0,1.3]|versicolor|[0.69359207745415...|[0.66757751713760...| 0.0|
|[5.5,2.4,3.7,1.0]|versicolor|[0.65370253734421...|[0.65667778854618...| 0.0|
|[5.6,2.5,3.9,1.1]|versicolor|[0.54196392670231...|[0.63100189035781...| 0.0|
|[5.6,3.0,4.5,1.5]|versicolor|[0.37674577838001...|[0.59323890250409...| 0.0|
|[5.7,2.5,5.0,2.0]| virginica|[0.16169907708951...|[0.54530505006200...| 0.0|
|[6.0,2.2,5.0,1.5]| virginica|[0.36487467988599...|[0.58976593283596...| 0.0|
|[6.0,2.9,4.5,1.5]|versicolor|[0.32722492096549...|[0.58168351252832...| 0.0|
|[6.2,2.2,4.5,1.5]|versicolor|[0.26897783666230...|[0.56776895921205...| 0.0|
|[6.4,2.9,4.3,1.3]|versicolor|[0.30704316562263...|[0.57629919892497...| 0.0|
|[6.7,2.5,5.8,1.8]| virginica|[0.03612827702630...|[0.50903235977506...| 0.0|
|[6.8,2.8,4.8,1.4]|versicolor|[0.20531707910453...|[0.55117647590115...| 0.0|
+-----------------+----------+--------------------+--------------------+----------+
only showing top 20 rows
Explanation
We first create a SparkContext
and a SparkSession
object to establish a connection with the Spark cluster. Next, we load a dataset in the form of a CSV file using the read.csv
method of the SparkSession
object. We then prepare the data for machine learning using the VectorAssembler
class to create a DataFrame
containing a single column of features and a single column of labels. We split the data into training and testing sets using the randomSplit
method of the DataFrame
object. Finally, we train a logistic regression model using the LogisticRegression
class of PySpark MLlib and evaluate it on the test set.
Use
PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine. It provides a range of algorithms for classification, regression, clustering, and collaborative filtering.
Important Points
- PySpark provides an interface for programming Spark using Python.
- PySpark MLlib is a machine learning library in PySpark.
- PySpark MLlib provides a range of algorithms for scalable and distributed machine learning.
- PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine.
Summary
In this tutorial, we learned about PySpark MLlib, a machine learning library in PySpark that provides a range of algorithms for scalable and distributed machine learning. We saw how to train a logistic regression model using PySpark MLlib, and evaluated it on a test dataset. PySpark MLlib is ideal for performing machine learning tasks on big datasets that may not fit into the memory of a single machine.