Pipeline Outline

The technical architecture of the project.

graph LR

    A[Ingestion] --> B[Data Cleaner]
    B --> C[Feature Extraction]
    C --> D[Model Training]
    D --> E[API serving model]

Data Ingestion

We expose a gRPC/REST API to allow us to control what data we're processing. The main idea here is to expose a very lightweight service that allows us to hit the NHL API, collect data, and upload that data to pachyderm.

We then pass that data into a cleaning/transformation pipeline that takes the raw data and cleans it up, and transforms it into a format that is more useful for our model.

Feature Extraction

Secondly we have the extraction pipeline. This pipeline takes stats and transforms them into something usable for the training pipeline. A lot of data like goals/assists translate very nicely into training models, however some data like where the game is/time of day, etc is a bit more difficult to use. We'll need to do some feature engineering to make this data usable.

Note: We may in the future be able to feed data back into the model. For example, it's more impressive if a player scores against a hot goalie/shutdown defencemen rather than a cold goalie/rookie defencemen. We can use this data to help train our model.

Model Training

Thirdly we do the exciting thing. Training a model. We'll use the data we've collected and the features we've extracted to train a model. We'll use this model to predict how good a player will be.

Model Serving

Lastly we'll serve our model. We'll expose a gRPC/REST API that allows us to query our model and get predictions.