Python Introduction to PySpark MLlib

Introduction

Apache Spark is an open-source distributed computing system used for big data processing and analytics. PySpark, a python library for Apache Spark, provides an interface for programming Spark with Python. PySpark MLlib, a machine learning library in PySpark, provides a range of algorithms for scalable and distributed machine learning.

Syntax

from pyspark.mllib.classification import LogisticRegressionWithSGD

model = LogisticRegressionWithSGD.train(training_data, iterations=100, step=0.1)

Example

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

sc = SparkContext("local")
spark = SparkSession(sc)

# Load the dataset
data = spark.read.csv("iris.csv", header=True, inferSchema=True)

# Prepare the data for machine learning
assembler = VectorAssembler(inputCols=["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"],
                            outputCol="features")
data = assembler.transform(data)
data = data.select("features", "Species")

# Split the data into training and testing sets
splits = data.randomSplit([0.7, 0.3])
train_data = splits[0]
test_data = splits[1]

# Train a logistic regression model
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
model = lr.fit(train_data)

# Evaluate the model on test data
result = model.transform(test_data)
result.show()

Output

+-----------------+----------+--------------------+--------------------+----------+
|         features|   Species|       rawPrediction|         probability|prediction|
+-----------------+----------+--------------------+--------------------+----------+
|[4.4,3.0,1.3,0.2]|    setosa|[-2.5986822773672...|[0.06940842848906...|       1.0|
|[4.5,2.3,1.3,0.3]|    setosa|[-2.5858611269813...|[0.07044494791203...|       1.0|
|[4.6,3.1,1.5,0.2]|    setosa|[-2.7005485323125...|[0.06439869749576...|       1.0|
|[4.9,3.1,1.5,0.1]|    setosa|[-2.8348128961914...|[0.05621767835447...|       1.0|
|[5.0,3.0,1.6,0.2]|    setosa|[-2.7113005461473...|[0.06416365091247...|       1.0|
|[5.0,3.4,1.5,0.2]|    setosa|[-2.9124087463647...|[0.05271940280812...|       1.0|
|[5.0,3.6,1.4,0.2]|    setosa|[-2.9648568747981...|[0.04985776276308...|       1.0|
|[5.1,3.5,1.4,0.2]|    setosa|[-2.9771940613839...|[0.04909836309138...|       1.0|
|[5.2,2.7,3.9,1.4]|versicolor|[0.22383875758864...|[0.55579968525996...|       0.0|
|[5.5,2.3,4.0,1.3]|versicolor|[0.69359207745415...|[0.66757751713760...|       0.0|
|[5.5,2.4,3.7,1.0]|versicolor|[0.65370253734421...|[0.65667778854618...|       0.0|
|[5.6,2.5,3.9,1.1]|versicolor|[0.54196392670231...|[0.63100189035781...|       0.0|
|[5.6,3.0,4.5,1.5]|versicolor|[0.37674577838001...|[0.59323890250409...|       0.0|
|[5.7,2.5,5.0,2.0]| virginica|[0.16169907708951...|[0.54530505006200...|       0.0|
|[6.0,2.2,5.0,1.5]| virginica|[0.36487467988599...|[0.58976593283596...|       0.0|
|[6.0,2.9,4.5,1.5]|versicolor|[0.32722492096549...|[0.58168351252832...|       0.0|
|[6.2,2.2,4.5,1.5]|versicolor|[0.26897783666230...|[0.56776895921205...|       0.0|
|[6.4,2.9,4.3,1.3]|versicolor|[0.30704316562263...|[0.57629919892497...|       0.0|
|[6.7,2.5,5.8,1.8]| virginica|[0.03612827702630...|[0.50903235977506...|       0.0|
|[6.8,2.8,4.8,1.4]|versicolor|[0.20531707910453...|[0.55117647590115...|       0.0|
+-----------------+----------+--------------------+--------------------+----------+
only showing top 20 rows

Explanation

We first create a SparkContext and a SparkSession object to establish a connection with the Spark cluster. Next, we load a dataset in the form of a CSV file using the read.csv method of the SparkSession object. We then prepare the data for machine learning using the VectorAssembler class to create a DataFrame containing a single column of features and a single column of labels. We split the data into training and testing sets using the randomSplit method of the DataFrame object. Finally, we train a logistic regression model using the LogisticRegression class of PySpark MLlib and evaluate it on the test set.

Use

PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine. It provides a range of algorithms for classification, regression, clustering, and collaborative filtering.

Important Points

PySpark provides an interface for programming Spark using Python.
PySpark MLlib is a machine learning library in PySpark.
PySpark MLlib provides a range of algorithms for scalable and distributed machine learning.
PySpark MLlib can be used to perform machine learning tasks on big datasets that may not fit into the memory of a single machine.

Summary

In this tutorial, we learned about PySpark MLlib, a machine learning library in PySpark that provides a range of algorithms for scalable and distributed machine learning. We saw how to train a logistic regression model using PySpark MLlib, and evaluated it on a test dataset. PySpark MLlib is ideal for performing machine learning tasks on big datasets that may not fit into the memory of a single machine.

Python Introduction to PySpark MLlib

Introduction

Syntax

Example

Output

Explanation

Use

Important Points

Summary

Python

python Introduction

python Features

python Applications

python Installation

python Variables

python Data Types

python Keywords

python Literals

python Operators Overview

python Arithmetic Operators

python Comparison Operators

python Logical Operators

python If-Else Statements

python Loops

python For Loop

python While Loop

python Break

python Continue

python Pass

python Strings

python Lists

python Tuples

python List vs Tuple

python Sets

python Dictionaries

python Built-in Functions

python Lambda Functions

python Reading and Writing Files

python Working with Modules

python Error Handling

python Date and Time

python Introduction to Regex

python Sending Emails

python Read CSV File

python Write CSV File

python Read Excel File

python Write Excel File

python Assert Statement

python List Comprehension

python Collection Module

python Math Module

python OS Module

python Random Module

python Statistics Module

python Sys Module

python IDEs for Python

python Arrays

python Command Line Arguments

python Magic Methods

python Stack & Queue

python Introduction to PySpark MLlib

python Decorators

python Generators

python Web Scraping Using Python

python JSON Handling

python Itertools

python Multiprocessing

python Calculating Distance with GEOPY

python Gmail API

python Plotting Google Maps with Folium

python Grid Search

python High Order Function

python nsetools

python Fibonacci Number

python OpenCV Object Detection

python SimpleImputer Module

python Finding Second Largest Number

python OOPs Concepts

python Object Class

python Constructors