Data Science with Spark

Spark Summit Europe

Jon Bates | @joncbates

Spark Training: Data Science

  • DataFrames
  • MLlib and ML overview
  • Labs:
    • MLlib data types
    • ETL and k-means
    • Feature transformations, pipelines, and logistic regression
    • Decision trees, evaluators, and cross-validation
    • Bootstrap and regression
    • Model parallel vs. data parallel
    • Exploring Wikipedia data
  • We'll also see: UDFs with MLlib data types, plotting with matplotlib, train and test splits, sampling, and more

DataFrames

What are DataFrames?

+-----+---+--------------+
| name|age|       address|
+-----+---+--------------+
|  Bob| 35|   [London,UK]|
|Susan| 42|[Amsterdam,NL]|
| Sara| 29| [Boulder,USA]|
+-----+---+--------------+
	    
root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
	    
from collections import namedtuple
Person = namedtuple('employee', ['name', 'age', 'address'])
Address = namedtuple('address', ['city', 'country'])

row1 = Person('Bob', 35, Address('London', 'UK'))
row2 = Person('Susan', 42, Address('Amsterdam', 'NL'))
row3 = Person('Sara', 29, Address('Boulder', 'USA'))

people = sqlContext.createDataFrame([row1, row2, row3])
display(people)

DataFrame Usage

Select the people older than 30

people.where(people['age'] > 30)

Select the name of people from the USA

people.where(people['address.country'] == 'USA').select('name')

Calculate the average age

import pyspark.sql.functions as func
people.agg(func.avg('age'))

Python API Scala API

Why DataFrames?

  • Greater accessiblity
  • Declarative rather than imperative
  • Catalyst Optimizer

DataFrame Actions and Transformations

Transformations Actions
filter count
select collect
drop show
join take

Transformations contribute to the query plan but nothing is executed until an action is called

Query Execution

What does "query execution" mean for Spark DataFrames?

  • Distribute read
  • RDD transformations
  • Result delivery

DataFrames vs RDDs

DataFrames can be significantly faster than RDDs. And they perform the same, regardless of language.

image/svg+xml image/svg+xml image/svg+xml path connector-curvature="0" id="path4317" d="m 203.37155,380.49245 -0.47841,-0.48507 0,-2.12626 0,-2.12624 -2.19492,0 -2.19491,0 -0.39423,-0.63788 -0.39422,-0.63787 2.58914,0 2.58914,0 0,-23.28244 0,-23.28244 -2.5515,0 -2.5515,0 0,-0.63787 0,-0.63788 2.5515,0 2.5515,0 0,-23.28244 0,-23.28243 -2.5515,0 -2.5515,0 0,-0.63788 0,-0.63787 2.5515,0 2.5515,0 0,-23.28244 0,-23.28244 -2.5515,0 -2.5515,0 0,-0.63787 0,-0.63788 2.5515,0 2.5515,0 0,-23.28244 0,-23.28243 -2.58914,0 -2.58914,0 0.39422,-0.63788 0.39423,-0.63787 2.19491,0 2.19492,0 0,-22.925 0,-22.925 -2.24589,-0.19797 -2.24589,-0.19797 0,-0.63788 0,-0.63787 2.24589,-0.19798 2.24589,-0.19797 0,-22.88649 0,-22.88649 -2.24589,-0.19797 -2.24589,-0.19798 0.0134,-0.63787 0.0133,-0.63788 2.71098,0 2.71096,0 0.18257,6.21928 0.18257,6.21929 54.17235,0 54.17233,0 0.1834,-5.84875 0.18341,-5.84875 0.79734,-0.26422 0.79734,-0.26422 0,6.11297 0,6.11297 1.91362,0 1.91363,0 0,11.80068 0,11.80069 -1.91363,0 -1.91362,0 0,12.11963 0,12.11962 2.23257,0 2.23255,0 0,11.80069 0,11.80069 -2.23255,0 -2.23257,0 0,11.80068 0,11.80069 1.59469,0 1.59468,0 0,12.11963 0,12.11962 -1.59468,0 -1.59469,0 0,11.80069 0,11.80069 2.23257,0 2.23255,0 0,12.11962 0,12.11963 -2.23255,0 -2.23257,0 0,11.80068 0,11.80069 54.21937,0 54.21939,0 0,-101.42212 0,-101.42212 0.63787,0 0.63788,0 0,101.42212 0,101.42212 54.21937,0 54.21945,0 0,-101.42212 0,-101.42212 0.63787,0 0.63788,0 0,101.42212 0,101.42212 54.53832,0 54.5383,0 0,-101.42212 0,-101.42212 0.63788,0 0.63787,0 0,101.42212 0,101.42212 40.50507,0 40.50506,0 0,12.11963 0,12.11962 -40.50506,0 -40.50507,0 0,29.66119 0,29.66119 54.21938,0 54.21943,0 0,-143.20294 0,-143.20293 0.63783,0 0.63792,0 0,146.3923 0,146.39232 -0.63792,0 -0.63783,0 0,-2.5515 0,-2.5515 -54.21943,0 -54.21938,0 0,2.19491 0,2.19492 -0.63787,0.39423 -0.63788,0.39422 0,-2.58914 0,-2.58914 -54.5383,0 -54.53832,0 0,2.58914 0,2.58914 -0.63788,-0.39422 -0.63787,-0.39423 0,-2.19492 0,-2.19491 -54.21945,0 -54.21937,0 0,2.5515 0,2.5515 -0.63788,0 -0.63787,0 0,-2.5515 0,-2.5515 -54.18087,0 -54.18087,0 -0.19796,2.24588 -0.19799,2.2459 -0.63788,0 -0.63787,0 -0.19798,-2.2459 -0.19796,-2.24588 -54.14421,0 -54.14419,0 -0.19614,2.61133 -0.19615,2.61134 -0.4784,-0.4851 z m 109.23609,-12.07314 0,-6.05981 -54.21937,0 -54.21938,0 0,6.05981 0,6.05982 54.21938,0 54.21937,0 0,-6.05982 z m 110.35241,0 0,-6.05981 -54.21941,0 -54.21937,0 0,6.05981 0,6.05982 54.21937,0 54.21941,0 0,-6.05982 z m 109.7145,-23.60137 0,-29.66119 -54.21937,0 -54.21938,0 0,11.80069 0,11.80069 5.103,0 5.103,0 0,11.80068 0,11.80069 -5.103,0 -5.103,0 0,6.05981 0,6.05982 54.21938,0 54.21937,0 0,-29.66119 z m 110.35238,0 0,-29.66119 -54.53832,0 -54.53831,0 0,29.66119 0,29.66119 54.53831,0 54.53832,0 0,-29.66119 z m -330.41929,-17.8605 0,-11.80069 -54.21937,0 -54.21938,0 0,11.80069 0,11.80069 54.21938,0 54.21937,0 0,-11.80069 z m 110.35234,0 0,-11.80069 -54.21934,0 -54.21937,0 0,11.80069 0,11.80069 54.21937,0 54.21934,0 0,-11.80069 z m -110.35234,-47.84063 0,-11.80068 -54.21937,0 -54.21938,0 0,11.80068 0,11.80069 54.21938,0 54.21937,0 0,-11.80069 z m 0,-47.84062 0,-11.80069 -54.21937,0 -54.21938,0 0,11.80069 0,11.80069 54.21938,0 54.21937,0 0,-11.80069 z m 0,-47.84063 0,-11.80068 -54.21937,0 -54.21938,0 0,11.80068 0,11.80069 54.21938,0 54.21937,0 0,-11.80069 z m 0,-47.52168 0,-12.11963 -54.21937,0 -54.21938,0 0,12.11963 0,12.11962 54.21938,0 54.21937,0 0,-12.11962 z m -118.59438,-17.59472 0.0504,-1.2226 0.33501,0.83721 0.33504,0.83721 -0.38539,0.38538 -0.38538,0.38539 0.0504,-1.22259 z" style="fill:#ededed"/ path connector-curvature="0" id="path4311" d="m 203.50719,377.98744 -0.29511,-2.5515 -2.5515,-0.3879 -2.5515,-0.38789 2.39203,-0.0905 2.39203,-0.0905 0,-23.56288 0,-23.56287 -2.39203,-0.25021 -2.39203,-0.25022 2.39203,-0.10723 2.39203,-0.10722 0,-23.28244 0,-23.28243 -2.23256,0 -2.23256,0 0,-0.63788 0,-0.63787 2.23256,0 2.23256,0 0,-23.28244 0,-23.28244 -2.23256,0 -2.23256,0 0,-0.63787 0,-0.63788 2.23256,0 2.23256,0 0,-23.24394 0,-23.24393 -2.39203,-0.25021 -2.39203,-0.25022 2.39203,-0.10722 2.39203,-0.10723 0,-23.56287 0,-23.56287 -2.39203,-0.25021 -2.39203,-0.25022 2.39203,-0.10723 2.39203,-0.10722 0,-23.56288 0,-23.56287 -2.39203,-0.2502 -2.39203,-0.25022 3.02991,-0.10723 3.0299,-0.10722 0,6.0598 0,6.05982 54.51438,0 54.51438,0 0.21049,-5.90035 0.21049,-5.90034 0.13238,5.90034 0.13238,5.90035 2.23257,0 2.23256,0 0,11.80068 0,11.80069 -2.23256,0 -2.23257,0 0,12.11963 0,12.11962 2.5515,0 2.5515,0 0,11.80069 0,11.80069 -2.5515,0 -2.5515,0 0,11.80068 0,11.80069 1.91363,0 1.91362,0 0,12.11963 0,12.11962 -1.91362,0 -1.91363,0 0,11.80069 0,11.80069 2.5515,0 2.5515,0 0,12.11962 0,12.11963 -2.5515,0 -2.5515,0 0,11.80068 0,11.80069 54.53832,0 54.53834,0 0,-101.10319 0,-101.10318 0.63788,0 0.63787,0 0,101.10318 0,101.10319 54.53635,0 54.53634,0 0.16337,-101.26266 0.16344,-101.26265 0.15747,101.26265 0.15753,101.26266 54.53635,0 54.53634,0 0.16337,-101.26266 0.16344,-101.26265 0.15747,101.26265 0.15753,101.26266 40.824,0 40.824,0 0,12.11963 0,12.11962 -40.824,0 -40.824,0 0,29.66119 0,29.66119 54.53831,0 54.53831,0 0,-142.884 0,-142.884 0.63792,0 0.63783,0 0,145.75443 0,145.75444 -0.63783,0 -0.63792,0 0,-2.55509 0,-2.55508 -54.37887,0.16305 -54.37881,0.16306 -0.38798,2.5515 -0.38782,2.5515 -0.0905,-2.71098 -0.0905,-2.71096 -54.53832,0 -54.53831,0 -0.0905,2.71096 -0.0905,2.71098 -0.38783,-2.5515 -0.3879,-2.5515 -54.37888,-0.16306 -54.37881,-0.16305 0,2.55508 0,2.55509 -0.63787,0 -0.63788,0 0,-2.55509 0,-2.55508 -54.37888,0.16305 -54.37884,0.16306 -0.38788,2.5515 -0.3879,2.5515 -0.0905,-2.71098 -0.0905,-2.71096 -54.50164,0 -54.50163,0 -0.21998,2.71096 -0.21998,2.71098 -0.2951,-2.5515 z m 109.73833,-9.56813 0,-6.05981 -54.53831,0 -54.53832,0 0,6.05981 0,6.05982 54.53832,0 54.53831,0 0,-6.05982 z m 109.71453,0 0,-6.05981 -54.53834,0 -54.53832,0 0,6.05981 0,6.05982 54.53832,0 54.53834,0 0,-6.05982 z m 110.35238,-23.60137 0,-29.66119 -54.53832,0 -54.53831,0 0,11.80069 0,11.80069 5.103,0 5.103,0 0,11.80068 0,11.80069 -5.103,0 -5.103,0 0,6.05981 0,6.05982 54.53831,0 54.53832,0 0,-29.66119 z m 109.7145,0 0,-29.66119 -54.53832,0 -54.53831,0 0,29.66119 0,29.66119 54.53831,0 54.53832,0 0,-29.66119 z m -329.78141,-17.8605 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 109.71453,0 0,-11.80069 -54.53834,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53834,0 0,-11.80069 z m -109.71453,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84062 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.52168 0,-12.11963 -54.53831,0 -54.53832,0 0,12.11963 0,12.11962 54.53832,0 54.53831,0 0,-12.11962 z" style="fill:#aaaaaa"/ path connector-curvature="0" id="path4309" d="m 203.64246,304.01381 -0.11144,-2.6743 -2.71097,-0.24416 -2.71097,-0.24419 2.71882,-0.11142 2.71883,-0.11144 -0.16733,-23.44192 -0.16732,-23.4419 -2.5515,-0.38789 -2.5515,-0.38789 2.71882,-0.0905 2.71883,-0.0905 -0.16733,-23.44191 -0.16732,-23.4419 -2.5515,-0.3879 -2.5515,-0.38789 2.71097,-0.0905 2.71097,-0.0905 0,-23.28244 0,-23.28244 -2.71097,-0.0905 -2.71097,-0.0905 2.5515,-0.38789 2.5515,-0.38789 0.16732,-23.44191 0.16733,-23.4419 -2.71883,-0.0905 -2.71882,-0.0905 2.5515,-0.38789 2.5515,-0.38789 0.16732,-23.4419 0.16733,-23.44191 -2.71883,-0.0905 -2.71882,-0.0905 2.5515,-0.38789 2.5515,-0.38789 0,-23.28244 0,-23.28244 -2.5515,-0.38789 -2.5515,-0.38789 3.02991,-0.0905 3.0299,-0.0905 0,6.0598 0,6.05982 54.51438,0 54.51438,0 0.21049,-5.90035 0.21049,-5.90034 0.13238,5.90034 0.13238,5.90035 2.23257,0 2.23256,0 0,11.80068 0,11.80069 -2.23256,0 -2.23257,0 0,12.11963 0,12.11962 2.5515,0 2.5515,0 0,11.80069 0,11.80069 -2.5515,0 -2.5515,0 0,11.80068 0,11.80069 1.91363,0 1.91362,0 0,12.11963 0,12.11962 -1.91362,0 -1.91363,0 0,11.80069 0,11.80069 2.5515,0 2.5515,0 0,12.11962 0,12.11963 -2.5515,0 -2.5515,0 0,11.80068 0,11.80069 54.53636,0 54.53637,0 0.16337,-101.26266 0.16344,-101.26265 0.15746,101.26265 0.15754,101.26266 54.85528,0 54.85528,0 0.16337,-101.26266 0.16344,-101.26265 0.15747,101.26265 0.15753,101.26266 54.53635,0 54.53634,0 0.16337,-101.26266 0.16344,-101.26265 0.15747,101.26265 0.15753,101.26266 40.824,0 40.824,0 0,12.11963 0,12.11962 -40.824,0 -40.824,0 0,29.66119 0,29.66119 54.85579,0 54.85587,0 0.16362,-142.72454 0.16353,-142.72453 0.0639,145.75444 0.063,145.75444 -0.38529,-2.5515 -0.38511,-2.5515 -54.66099,-0.16302 -54.66092,-0.16304 -0.24437,2.71454 -0.24436,2.71452 -0.11139,-2.71098 -0.11147,-2.71096 -54.50171,0 -54.50157,0 -0.24422,2.71096 -0.24414,2.71098 -0.11146,-2.71454 -0.1114,-2.71453 -54.69781,0.16303 -54.69775,0.16304 -0.38797,2.5515 -0.38783,2.5515 -0.0905,-2.71098 -0.0905,-2.71096 -54.50166,0 -54.50164,0 -0.24417,2.71096 -0.24418,2.71098 -0.11143,-2.71098 -0.11145,-2.71096 -54.50163,0 -54.50164,0 -0.24417,2.71096 -0.24418,2.71098 -0.11144,-2.6743 z m 109.60306,-9.44533 0,-6.05981 -54.53831,0 -54.53832,0 0,6.05981 0,6.05982 54.53832,0 54.53831,0 0,-6.05982 z m 109.71453,0 0,-6.05981 -54.53834,0 -54.53832,0 0,6.05981 0,6.05982 54.53832,0 54.53834,0 0,-6.05982 z m 110.35238,-23.60137 0,-29.66119 -54.85725,0 -54.85725,0 0,11.80069 0,11.80069 5.42193,0 5.42194,0 0,11.80068 0,11.80069 -5.42194,0 -5.42193,0 0,6.05981 0,6.05982 54.85725,0 54.85725,0 0,-29.66119 z m 109.7145,0 0,-29.66119 -54.53832,0 -54.53831,0 0,29.66119 0,29.66119 54.53831,0 54.53832,0 0,-29.66119 z m -329.78141,-17.8605 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 109.71453,0 0,-11.80069 -54.53834,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53834,0 0,-11.80069 z m -109.71453,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84062 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.52168 0,-12.11963 -54.53831,0 -54.53832,0 0,12.11963 0,12.11962 54.53832,0 54.53831,0 0,-12.11962 z" style="fill:#9b9b9b" connector-curvature="0"/ path connector-curvature="0" id="path4307" d="m 203.64246,339.01381 -0.11144,-2.6743 -2.71097,-0.24416 -2.71097,-0.24419 2.71097,-0.11142 2.71097,-0.11144 0,-23.5647 0,-23.56471 -2.71097,-0.24416 -2.71097,-0.24419 2.70538,-0.11143 2.70537,-0.11144 0,-71.08666 0,-71.08665 -2.70537,-0.2439 -2.70538,-0.24392 2.71097,-0.11143 2.71097,-0.11144 0,-23.5647 0,-23.56471 -2.71097,-0.24416 -2.71097,-0.24418 2.71097,-0.11143 2.71097,-0.11144 0,-23.5647 0,-23.56471 -2.55623,-0.25871 -2.55624,-0.25871 2.7157,0.0626 2.71571,0.0626 0.18341,5.90035 0.18339,5.90034 54.49045,0 54.49044,0 0.21049,-5.90034 0.21049,-5.90035 0.13238,5.90035 0.13238,5.90034 2.23257,0 2.23256,0 0,11.80068 0,11.8007 -2.23256,0 -2.23257,0 0,12.11962 0,12.11963 2.5515,0 2.5515,0 0,11.80068 0,11.80069 -2.5515,0 -2.5515,0 0,11.80068 0,11.8007 1.91363,0 1.91362,0 0,12.11962 0,12.11963 -1.91362,0 -1.91363,0 0,11.80068 0,11.80069 2.5515,0 2.5515,0 0,12.11963 0,12.11962 -2.5515,0 -2.5515,0 0,11.80068 0,11.8007 109.71257,0 109.71253,0 0.16337,-101.26266 0.16344,-101.26266 0.15747,101.26266 0.15753,101.26266 54.53635,0 54.53634,0 0.16337,-101.26266 0.16344,-101.26266 0.15747,101.26266 0.15753,101.26266 40.824,0 40.824,0 0,12.11962 0,12.11963 -40.824,0 -40.824,0 0,29.65764 0,29.65764 55.01669,0.16661 55.01681,0.1666 -54.98015,0.15588 -54.98001,0.15588 -0.24421,2.71097 -0.24414,2.71096 -0.11147,-2.71096 -0.11139,-2.71097 -54.50172,0 -54.50157,0 -0.24421,2.71097 -0.24414,2.71096 -0.11147,-2.71096 -0.11139,-2.71097 -109.67791,0 -109.67777,0 -0.24417,2.71097 -0.24418,2.71096 -0.11144,-2.71096 -0.11144,-2.71097 -54.50164,0 -54.50163,0 -0.24417,2.71097 -0.24418,2.71096 -0.11144,-2.67429 z m 109.60306,-9.44533 0,-6.05981 -54.53831,0 -54.53832,0 0,6.05981 0,6.05982 54.53832,0 54.53831,0 0,-6.05982 z m 220.06691,-23.60137 0,-29.66119 -109.7145,0 -109.71454,0 0,11.80069 0,11.80069 60.27919,0 60.27922,0 0,11.80068 0,11.80069 -60.27922,0 -60.27919,0 0,6.05981 0,6.05982 109.71454,0 109.7145,0 0,-29.66119 z m 109.7145,0 0,-29.66119 -54.53832,0 -54.53831,0 0,29.66119 0,29.66119 54.53831,0 54.53832,0 0,-29.66119 z m -329.78141,-17.8605 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84062 0,-11.80069 -54.53831,0 -54.53832,0 0,11.80069 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.84063 0,-11.80068 -54.53831,0 -54.53832,0 0,11.80068 0,11.80069 54.53832,0 54.53831,0 0,-11.80069 z m 0,-47.52168 0,-12.11963 -54.53831,0 -54.53832,0 0,12.11963 0,12.11962 54.53832,0 54.53831,0 0,-12.11962 z" style="fill:#868686" connector-curvature="0"/ DataFrame SQL DataFrame R DataFrame Python DataFrame Scala RDD Python RDD Scala Time to aggregate 10 million integer pairs (in seconds) text space="preserve" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:19.79956627px;line-height:125%;font-family:FreeSans;-inkscape-font-specification:FreeSans;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" x="190.94746" y="333.58151" id="text4357" linespacing="125%" transform="scale(1.030932,0.96999609)" linespacing="125%" 0

Labs

MLlib and ML

ML and DataFrames

  • Input to many operations
  • Versatile storage
  • Natural display
  • Output with additional columns

ML: Transformer

  • A Transformer is a class which can transform one DataFrame into another DataFrame
  • Examples
    • HashingTF
    • Bucketizer
    • LogisticRegressionModel
    • PipelineModel
  • A Transformer implements transform()

ML: Estimator

  • An Estimator is a class which can take a DataFrame and produce a Transformer
  • Examples
    • RandomForestClassifier
    • CrossValidator
    • IDF
    • StandardScaler
    • Pipeline
  • An Estimator implements fit()

ML: Pipelines

A Pipeline is an estimator that contains stages representing a resusable workflow

Feature Extraction and Transformation

Estimators

MLlib ML
IDF IDF
Word2Vec Word2Vec
CountVectorizer CountVectorizer
StandardScaler StandardScaler and MinMaxScaler
ChiSqSelector StringIndexer and VectorIndexer
OneHotEncoder
RFormula

Feature Extraction and Transformation

Transformers

MLlib ML
PCA and SVD PCA
HashingTF HashingTF
Normalizer Normalizer
ElementwiseProduct ElementwiseProduct
Tokenizer, n-gram, and StopWordsRemover
Binarizer and Bucketizer

Feature Extraction and Transformation

Transformers cont.

MLlib ML
PolynomialExpansion
Discrete Cosine Transformation (DCT)
VectorAssembler and VectorSlicer

Algorithms, Estimators, and Metrics

Clustering

MLlib ML
k-means k-means
Gaussian mixture
power iteration clustering (PIC)
Latent Dirichlet allocation (LDA)
streaming k-means

Algorithms, Estimators, and Metrics

Classification and Regression

MLlib ML
linear, logistic, and isotonic regression linear, logistic, and isotonic regression
Naive Bayes Naive Bayes
decision trees, random forests, and gradient-boosted trees (GBTs) decision trees, random forests, and gradient-boosted trees (GBTs)
support vector machines (SVMs) multilayer perceptron

Algorithms, Estimators, and Metrics

Collaborative filtering and frequent pattern mining

MLlib ML
alternating least squares (ALS) alternating least squares (ALS)
FP-growth, association rules, and PrefixSpan

Algorithms, Estimators, and Metrics

Evaluation

MLlib ML
binary classification binary classification
multiclass classification multiclass classification
regression metrics regression metrics
multilabel classification
ranking metrics

Labs

Questions?

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License