python
  1. python-introduction-to-pyspark-mllib

Python Introduction to PySpark MLlib

Introduction

Apache Spark is an open-source distributed computing system used for big data processing and analytics. PySpark, a python library for Apache Spark, provides an interface for programming Spark with Python. PySpark MLlib, a machine learning library in PySpark, provides a range of algorithms for scalable and distributed machine learning.

Syntax

from pyspark.mllib.classification import LogisticRegressionWithSGD

model = LogisticRegressionWithSGD.train(training_data, iterations=100, step=0.1)

Example

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

sc = SparkContext("local")
spark = SparkSession(sc)

# Load the dataset
data = spark.read.csv("iris.csv", header=True, inferSchema=True)

# Prepare the data for machine learning
assembler = VectorAssembler(inputCols=["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"],
                            outputCol="features")
data = assembler.transform(data)
data = data.select("features", "Species")

# Split the data into training and testing sets
splits = data.randomSplit([0.7, 0.3])
train_data = splits[0]
test_data = splits[1]

# Train a logistic regression model
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
model = lr.fit(train_data)

# Evaluate the model on test data
result = model.transform(test_data)
result.show()

Output

+-----------------+----------+--------------------+--------------------+----------+
|         features|   Species|       rawPrediction|         probability|prediction|
+-----------------+----------+--------------------+--------------------+----------+
|[4.4,3.0,1.3,0.2]|    setosa|[-2.5986822773672...|[0.06940842848906...|       1.0|
|[4.5,2.3,1.3,0.3]|    setosa|[-2.5858611269813...|[0.07044494791203...|       1.0|
|[4.6,3.1,1.5,0.2]|    setosa|[-2.7005485323125...|[0.06439869749576...|       1.0|
|[4.9,3.1,1.5,0.1]|    setosa|[-2.8348128961914...|[0.05621767835447...|       1.0|
|[5.0,3.0,1.6,0.2]|    setosa|[-2.7113005461473...|[0.06416365091247...|       1.0|
|[5.0,3.4,1.5,0.2]|    setosa|[-2.9124087463647...|[0.05271940280812...|       1.0|
|[5.0,3.6,1.4,0.2]|    setosa|[-2.9648568747981...|[0.04985776276308...|       1.0|
|[5.1,3.5,1.4,0.2]|    setosa|[-2.9771940613839...|[0.04909836309138...|       1.0|
|[5.2,2.7,3.9,1.4]|versicolor|[0.22383875758864...|[0.55579968525996...|       0.0|
|[5.5,2.3,4.0,1.3]|versicolor|[0.69359207745415...|[0.66757751713760...|       0.0|
|[5.5,2.4,3.7,1.0]|versicolor|[0.65370253734421...|[0.65667778854618...|       0.0|
|[5.6,2.5,3.9,1.1]|versicolor|[0.54196392670231...|[0.63100189035781...|       0.0|
|[5.6,3.0,4.5,1.5]|versicolor|[0.37674577838001...|[0.59323890250409...|       0.0|
|[5.7,2.5,5.0,2.0]| virginica|[0.16169907708951...|[0.54530505006200...|       0.0|
|[6.0,2.2,5.0,1.5]| virginica|[0.36487467988599...|[0.58976593283596...|       0.0|
|[6.0,2.9,4.5,1.5]|versicolor|[0.32722492096549...|[0.58168351252832...|       0.0|
|[6.2,2.2,4.5,1.5]|versicolor|[0.26897783666230...|[0.56776895921205...|       0.0|
|[6.4,2.9,4.3,1.3]|versicolor|[0.30704316562263...|[0.57629919892497...|       0.0|
|[6.7,2.5,5.8,1.8]| virginica|[0.03612827702630...|[0.50903235977506...|       0.0|
|[6.8,2.8,4.8,1.4]|versicolor|[0.20531707910453...|[0.55117647590115...|       0.0|
+-----------------+----------+--------------------+--------------------+----------+
only showing top 20 rows

Explanation

We first create a SparkContext and a SparkSession object to establish a connection with the Spark cluster. Next, we load a dataset in the form of a CSV file using the read.csv method of the SparkSession object. We then prepare the data for machine learning using the VectorAssembler class to create a DataFrame containing a single column of features and a single column of labels. We split the data into training and testing sets using the randomSplit method of the DataFrame object. Finally, we train a logistic regression model using the LogisticRegression class of PySpark MLlib and evaluate it on the test set.

Use

PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine. It provides a range of algorithms for classification, regression, clustering, and collaborative filtering.

Important Points

  • PySpark provides an interface for programming Spark using Python.
  • PySpark MLlib is a machine learning library in PySpark.
  • PySpark MLlib provides a range of algorithms for scalable and distributed machine learning.
  • PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine.

Summary

In this tutorial, we learned about PySpark MLlib, a machine learning library in PySpark that provides a range of algorithms for scalable and distributed machine learning. We saw how to train a logistic regression model using PySpark MLlib, and evaluated it on a test dataset. PySpark MLlib is ideal for performing machine learning tasks on big datasets that may not fit into the memory of a single machine.

Published on: