A Complete Information to Finest Practices and Frequent Pitfalls in Fashionable Information Science Workflows
Within the quickly evolving fields of MLOps, machine studying, and knowledge engineering, understanding the perfect practices and customary pitfalls is important for constructing strong and environment friendly techniques. This text supplies a complete information to the do’s and don’ts that may assist you streamline your workflows, improve productiveness, and keep away from expensive errors. Whether or not you’re a seasoned skilled or simply beginning out, these sensible suggestions and insights will equip you with the information wanted to navigate the complexities of recent knowledge science and engineering tasks.
1. Do Use Model Management for Code and Information
2. Do Implement Unit Assessments for Your Code
3. Do Use Digital Environments
4. Do Doc Your Code
5. Do Use Logging for Debugging and Monitoring
6. Do Normalize Your Information
7. Do Use Pipelines for Information Processing
8. Do Break up Information into Coaching and Testing Units
9. Do Use Cross-Validation
10. Do Monitor Mannequin Efficiency Over Time
11. Do Deal with Lacking Information Appropriately
12. Do Use Function Engineering
13. Do Carry out Hyperparameter Tuning
14. Do Use Ensemble Strategies
15. Do Monitor Information Lineage
16. Do Use Information Versioning
17. Do Use Automated Deployment Pipelines
18. Do Encrypt Delicate Information
19. Do Use Scalable Information Storage Options
20. Do Implement Information Governance
21. Do Automate Information Cleansing
22. Do Use Parallel Processing
23. Do Hold Your Fashions Up to date
24. Do Guarantee Information Privateness Compliance
25. Do Use Visualization for Information Understanding
# 1. Do Use Model Management for Code and Information
# All the time observe adjustments in your code and knowledge utilizing model management techniques like Git.
# This helps in collaboration and sustaining historical past.
!git init
!git add .
!git commit -m "Preliminary commit"
# Do not maintain your code with out model management
# This makes it arduous to trace adjustments and collaborate
# code = "your_code_here"
# 2. Do Implement Unit Assessments for Your Code
# Unit exams be sure that your features carry out as anticipated and assist catch bugs early.
def add(a, b):
return a + b
def test_add():
assert add(2, 3) == 5
assert add(-1, 1) == 0
test_add()
# Do not assume your code works with out testing
# This may result in undetected bugs
# outcome = add(2, 3)
# 3. Do Use Digital Environments
# Digital environments assist handle dependencies and keep away from conflicts between
# packages.
!python3 -m venv myenv
!supply myenv/bin/activate
!pip set up numpy pandas
# Do not set up packages globally
# This may result in conflicts with different tasks
# !pip set up numpy pandas
# 4. Do Doc Your Code
# Correct documentation makes your code comprehensible and maintainable.
def add(a, b):
"""
Provides two numbers.
Parameters:
a (int): The primary quantity
b (int): The second quantity
Returns:
int: The sum of a and b
"""
return a + b
# Do not write code with out feedback or documentation
# def add(a, b):
# return a + b
# 5. Do Use Logging for Debugging and Monitoring
# Logging is essential for monitoring the habits and points in your code.
import logging
logging.basicConfig(stage=logging.INFO)
logging.information("That is an information message")
logging.error("That is an error message")
# Do not use print statements for debugging
# print("That is an information message")
# print("That is an error message")
# 6. Do Normalize Your Information
# Normalizing knowledge helps in enhancing the efficiency of many machine studying algorithms.
from sklearn.preprocessing import StandardScaler
import numpy as np
knowledge = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
normalized_data = scaler.fit_transform(knowledge)
# Do not use uncooked knowledge with out normalization
# knowledge = np.array([[1, 2], [3, 4], [5, 6]])
# 7. Do Use Pipelines for Information Processing
# Pipelines streamline the method of reworking knowledge and coaching fashions.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
X = [[1, 2], [3, 4], [5, 6]]
y = [0, 1, 0]
pipeline.match(X, y)
# Do not separate preprocessing and mannequin coaching
# scaler = StandardScaler()
# X = scaler.fit_transform([[1, 2], [3, 4], [5, 6]])
# y = [0, 1, 0]
# mannequin = LogisticRegression()
# mannequin.match(X, y)
# 8. Do Break up Information into Coaching and Testing Units
# Splitting knowledge helps consider mannequin efficiency on unseen knowledge.
from sklearn.model_selection import train_test_split
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 1, 0, 1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Do not practice and check on the identical knowledge
# mannequin.match(X, y)
# 9. Do Use Cross-Validation
# Cross-validation supplies a extra dependable estimate of mannequin efficiency.
from sklearn.model_selection import cross_val_score
mannequin = LogisticRegression()
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 1, 0, 1]
scores = cross_val_score(mannequin, X, y, cv=5)
print("Cross-validation scores:", scores)
# Do not depend on a single train-test cut up for analysis
# mannequin.match(X, y)
# 10. Do Monitor Mannequin Efficiency Over Time
# Monitoring instruments like MLflow assist observe the efficiency and parameters of fashions over time.
import mlflow
mlflow.start_run()
mlflow.log_param("param1", 5)
mlflow.log_metric("accuracy", 0.85)
mlflow.end_run()
# Do not deploy fashions with out monitoring
# This makes it tough to detect efficiency degradation
# 11. Do Deal with Lacking Information Appropriately
# Correct dealing with of lacking knowledge prevents bias and inaccuracies in your fashions.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})
df.fillna(df.imply(), inplace=True)
# Do not ignore lacking knowledge
# df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [None, 2, 3, 4]})
# 12. Do Use Function Engineering
# Creating new options may help enhance mannequin efficiency by offering extra data.
df['C'] = df['A'] * df['B']
# Do not use uncooked options with out transformation
# This will likely restrict mannequin efficiency
# 13. Do Carry out Hyperparameter Tuning
# Hyperparameter tuning helps find the perfect parameters to your fashions.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators': [10, 50, 100]}
mannequin = RandomForestClassifier()
grid_search = GridSearchCV(mannequin, param_grid, cv=5)
grid_search.match(X, y)
print("Finest parameters:", grid_search.best_params_)
# Do not use default parameters with out tuning
# mannequin.match(X, y)
# 14. Do Use Ensemble Strategies
# Ensemble strategies usually present higher efficiency by combining a number of fashions.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()
ensemble = VotingClassifier(estimators=[
('lr', model1), ('dt', model2), ('svc', model3)], voting='arduous')
ensemble.match(X_train, y_train)
# Do not depend on a single mannequin when ensemble strategies can enhance efficiency
# mannequin = LogisticRegression()
# mannequin.match(X_train, y_train)
# 15. Do Monitor Information Lineage
# Monitoring knowledge lineage ensures knowledge high quality and traceability from supply to vacation spot.
# Instance: Utilizing Nice Expectations for knowledge validation
import great_expectations as ge
df_ge = ge.from_pandas(df)
df_ge.expect_column_values_to_not_be_null('A')
df_ge.expect_column_values_to_be_between('B', 2, 4)
df_ge.validate()
# Do not course of knowledge with out validating its high quality
# This may result in inaccurate outcomes
# 16. Do Use Information Versioning
# Versioning knowledge helps handle adjustments and observe the historical past of datasets.
# Instance utilizing DVC (Information Model Management)
!dvc init
!dvc add knowledge/knowledge.csv
!git add knowledge/knowledge.csv.dvc .gitignore
!git commit -m "Add uncooked knowledge versioning"
# Do not modify datasets with out model management
# This may result in confusion and lack of knowledge integrity
# 17. Do Use Automated Deployment Pipelines
# Automated deployment pipelines guarantee steady integration and deployment of your fashions.
# Instance utilizing GitHub Actions
# .github/workflows/ci_cd.yml
"""
title: CI/CD
on:
push:
branches:
- fundamental
jobs:
construct:
runs-on: ubuntu-latest
steps:
- makes use of: actions/checkout@v2
- title: Arrange Python
makes use of: actions/setup-python@v2
with:
python-version: '3.8'
- title: Set up dependencies
run: |
pip set up -r necessities.txt
- title: Run exams
run: |
pytest
"""
# Do not manually deploy code
# That is error-prone and never scalable
# 18. Do Encrypt Delicate Information
# Encrypting delicate knowledge protects it from unauthorized entry.
from cryptography.fernet import Fernet
key = Fernet.generate_key()
cipher_suite = Fernet(key)
cipher_text = cipher_suite.encrypt(b"Delicate Information")
plain_text = cipher_suite.decrypt(cipher_text)
# Do not retailer delicate knowledge in plaintext
# sensitive_data = "Delicate Information"
# 19. Do Use Scalable Information Storage Options
# Scalable storage options like AWS S3 enable dealing with giant volumes of knowledge effectively.
import boto3
s3 = boto3.shopper('s3')
s3.upload_file('knowledge.csv', 'mybucket', 'knowledge.csv')
# Do not retailer giant volumes of knowledge regionally
# with open('knowledge.csv', 'w') as file:
# file.write("knowledge")
# 20. Do Implement Information Governance
# Information governance ensures correct administration, entry, and utilization of knowledge.
# Instance utilizing Apache Atlas for knowledge governance
from atlas
shopper.shopper import Atlas
shopper = Atlas('http://atlas-server:21000', username='admin', password='admin')
shopper.entity_post.create(knowledge={'typeName': 'hive_table', 'attributes': {'title': 'my_table'}})
# Do not handle knowledge with out governance insurance policies
# This may result in knowledge misuse and safety points
# 21. Do Automate Information Cleansing
# Automating knowledge cleansing reduces guide errors and ensures constant knowledge high quality.
df.dropna(inplace=True)
# Do not clear knowledge manually
# df = df[df['A'].notna()]
# 22. Do Use Parallel Processing
# Parallel processing quickens knowledge processing duties.
from joblib import Parallel, delayed
def process_data(i):
return i * i
outcomes = Parallel(n_jobs=4)(delayed(process_data)(i) for i in vary(10))
print(outcomes)
# Do not course of knowledge sequentially when parallel processing is possible
# outcomes = [i * i for i in range(10)]
# print(outcomes)
# 23. Do Hold Your Fashions Up to date
# Common updates guarantee fashions stay correct and related.
# Instance utilizing cron job for periodic mannequin updates
# 0 0 * * SUN /usr/bin/python3 /path/to/update_model.py
# Do not deploy fashions and neglect about them
# This may result in outdated and inaccurate fashions
# 24. Do Guarantee Information Privateness Compliance
# Guaranteeing compliance with knowledge privateness legal guidelines protects consumer knowledge and avoids authorized points.
# Pseudocode instance for knowledge anonymization
def anonymize_data(df):
df['name'] = df['name'].apply(lambda x: '****' + x[-2:])
return df
# Do not ignore knowledge privateness rules
# df['name'] = df['name']
# 25. Do Use Visualization for Information Understanding
# Information visualization helps in understanding the distribution and patterns in knowledge.
import matplotlib.pyplot as plt
plt.hist(df['A'], bins=10)
plt.xlabel('Worth')
plt.ylabel('Frequency')
plt.title('Histogram of Column A')
plt.present()
# Do not analyze knowledge with out visualization
# This makes it arduous to grasp knowledge distribution and patterns
# df.describe()
```
26. Do Use Descriptive Variable Names
27. Do Use Model Management for Jupyter Notebooks
28. Do Use Feedback to Clarify Code
29. Do Deal with Exceptions Correctly
30. Do Use Record Comprehensions for Easy Transformations
31. Do Use Constructed-In Capabilities
32. Do Use Context Managers for File Operations
33. Do Use f-Strings for String Formatting
34. Do Use Enumerate for Index Monitoring in Loops
35. Do Use Default Dictionary for Grouping Information
36. Do Use Generator Expressions for Massive Information
37. Do Use Named Tuples for Readable Tuple Information
38. Do Use Sort Hints for Operate Signatures
39. Do Use DataFrames for Structured Information
40. Do Use SQLAlchemy for Database Interactions
41. Do Use Matplotlib for Primary Plotting
42. Do Use Seaborn for Statistical Plots
43. Do Use Scikit-learn for Machine Studying Fashions
44. Do Use TensorFlow/Keras for Deep Studying
45. Do Use `timeit` for Measuring Execution Time
46. Do Use Decorators for Reusable Code
47. Do Use `argparse` for Command-Line Arguments
48. Do Use Logging for Debugging and Monitoring
49. Do Use `pandas_profiling` for Fast Information Profiling
50. Do Use `matplotlib` for Plotting
#26. Do Use Descriptive Variable Names
# Descriptive variable names make your code extra readable and comprehensible.
# This helps you and others shortly perceive the aim of every variable.
num_students = 50
average_score = 85.6# Do not use imprecise variable names
# Utilizing x and y does not inform the reader what these variables characterize.
# x = 50
# y = 85.6
#27. Do Use Model Management for Jupyter Notebooks
# This makes it simpler to trace adjustments in notebooks utilizing model management.
# Changing Jupyter notebooks to Python scripts ensures higher model management.
!jupyter nbconvert --to script my_notebook.ipynb
# Do not ignore model management for notebooks
# Notebooks might be tough to trace adjustments with out conversion, resulting in misplaced work.
# Not having model management means shedding the flexibility to rollback adjustments or collaborate effectively.
#28. Do Use Feedback to Clarify Code
# Feedback clarify the aim and performance of your code.
# This helps others (and your self) perceive what your code does, particularly when revisiting it after a while.
def calculate_area(radius):
# Calculate the world of a circle
pi = 3.14159
return pi * radius ** 2
# Do not write uncommented code
# Code with out feedback might be obscure and preserve.
# def calculate_area(radius):
# pi = 3.14159
# return pi * radius ** 2
#29. Do Deal with Exceptions Correctly
# Correct exception dealing with prevents your program from crashing and supplies helpful error messages.
# Dealing with exceptions makes your code extra strong and user-friendly.
attempt:
outcome = 10 / 0
besides ZeroDivisionError:
print("Division by zero shouldn't be allowed.")
# Do not ignore potential errors
# Ignoring exceptions can result in unhandled errors and crashes.
# outcome = 10 / 0
#30. Do Use Record Comprehensions for Easy Transformations
# Record comprehensions are a concise and environment friendly method to create lists.
# They make your code cleaner and infrequently sooner.
numbers = [1, 2, 3, 4, 5]
squared = [x ** 2 for x in numbers]
# Do not use conventional loops for easy checklist operations
# Utilizing conventional loops for easy transformations is much less environment friendly and extra verbose.
# squared = []
# for x in numbers:
# squared.append(x ** 2)
#31. Do Use Constructed-In Capabilities
# Constructed-in features are optimized and make your code extra concise.
# They're often carried out in C and are sooner than manually carried out loops.
numbers = [1, 2, 3, 4, 5]
complete = sum(numbers)
# Do not manually implement frequent operations
# Manually implementing frequent operations is error-prone and fewer environment friendly.
# complete = 0
# for quantity in numbers:
# complete += quantity
#32. Do Use Context Managers for File Operations
# Context managers guarantee information are correctly closed after operations.
# This prevents useful resource leaks and ensures knowledge integrity.
with open('file.txt', 'r') as file:
knowledge = file.learn()
# Do not manually handle file closing
# Forgetting to shut information can result in useful resource leaks.
# file = open('file.txt', 'r')
# knowledge = file.learn()
# file.shut()
#33. Do Use f-Strings for String Formatting
# f-Strings are extra readable and concise for string formatting.
# They're additionally sooner and extra highly effective than older formatting strategies.
title = "Alice"
greeting = f"Hi there, {title}!"
# Do not use older formatting strategies
# Older formatting strategies are much less readable and extra error-prone.
# greeting = "Hi there, {}!".format(title)
#34. Do Use Enumerate for Index Monitoring in Loops
# Enumerate supplies a clear method to get each the index and worth in loops.
# This makes the code extra readable and avoids utilizing vary(len()).
fruits = ["apple", "banana", "cherry"]
for index, fruit in enumerate(fruits):
print(f"{index}: {fruit}")
# Do not use vary(len()) for index monitoring
# Utilizing vary(len()) is much less readable and extra error-prone.
# for i in vary(len(fruits)):
# print(f"{i}: {fruits[i]}")
#35. Do Use Default Dictionary for Grouping Information
# Default dictionaries simplify grouping knowledge by mechanically initializing lists.
# This reduces boilerplate code and avoids key errors.
from collections import defaultdict
knowledge = [("fruit", "apple"), ("fruit", "banana"), ("vegetable", "carrot")]
grouped = defaultdict(checklist)
for class, merchandise in knowledge:
groupedmachine-learning.append(merchandise)
# Do not manually initialize dictionary keys
# Manually initializing keys is extra verbose and error-prone.
# grouped = {}
# for class, merchandise in knowledge:
# if class not in grouped:
# groupedmachine-learning = []
# groupedmachine-learning.append(merchandise)
#36. Do Use Generator Expressions for Massive Information
# Generator expressions are reminiscence environment friendly for dealing with giant knowledge.
# They generate gadgets on the fly and don't retailer all the checklist in reminiscence.
numbers = (x ** 2 for x in vary(1000000))
for num in numbers:
if num > 100:
break
# Do not use checklist comprehensions for giant knowledge
# Record comprehensions retailer all gadgets in reminiscence, which might be inefficient for giant knowledge units.
# numbers = [x ** 2 for x in range(1000000)]
# for num in numbers:
# if num > 100:
# break
#37. Do Use Named Tuples for Readable Tuple Information
# Named tuples present named fields for higher readability.
# They mix the simplicity of tuples with named fields, making code extra self-documenting.
from collections import namedtuple
Level = namedtuple('Level', ['x', 'y'])
p = Level(1, 2)
print(p.x, p.y)
# Do not use common tuples for structured knowledge
# Common tuples depend on positional entry, which is much less readable.
# p = (1, 2)
# print(p[0], p[1])
#38. Do Use Sort Hints for Operate Signatures
# Sort hints enhance code readability and assist with kind checking.
# They supply details about the anticipated kinds of perform parameters and return values.
def add(a: int, b: int) -> int:
return a + b
# Do not omit kind hints
# Omitting kind hints makes it tougher to grasp the anticipated enter and output varieties.
# def add(a, b):
# return a + b
#39. Do Use DataFrames for Structured Information
# DataFrames present highly effective knowledge manipulation capabilities.
# They're environment friendly, versatile, and combine effectively with different knowledge science instruments.
import pandas as pd
knowledge = {'title': ['Alice', 'Bob'], 'age': [25, 30]}
df = pd.DataFrame(knowledge)
# Do not use dictionaries or lists for structured knowledge operations
# DataFrames supply extra performance and ease of use in comparison with uncooked lists or dictionaries.
# knowledge = {'title': ['Alice', 'Bob'], 'age': [25, 30]}
# names = knowledge['name']
# ages = knowledge['age']
#40. Do Use SQLAlchemy for Database Interactions
# SQLAlchemy supplies a high-level API for database interactions.
# It abstracts database operations and helps keep away from SQL injection assaults.
from sqlalchemy import create_engine
engine = create_engine('sqlite:///mydatabase.db')
with engine.join() as connection:
outcome = connection.execute("SELECT * FROM customers")
for row in outcome:
print(row)
# Do not use uncooked SQL queries with direct database connections
# Uncooked SQL queries might be susceptible to SQL injection and are much less maintainable.
# import sqlite3
# conn = sqlite3.join('mydatabase.db')
# cursor = conn.cursor()
# cursor.execute("SELECT * FROM customers")
# for row in cursor.fetchall():
# print(row)
# conn.shut()
#41. Do Use Matplotlib for Primary Plotting
# Matplotlib is a strong and versatile plotting library.
# It permits you to create a variety of static, animated, and interactive plots.
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Easy Plot')
plt.present()
# Do not skip visualization in knowledge evaluation
# Visualizing knowledge helps in understanding tendencies and patterns.
# It is essential for exploratory knowledge evaluation.
#42. Do Use Seaborn for Statistical Plots
# Seaborn supplies high-level interfaces for drawing engaging statistical graphics.
# It simplifies complicated visualization duties and integrates effectively with pandas.
import seaborn as sns
knowledge = sns.load_dataset("iris")
sns.pairplot(knowledge, hue="species")
plt.present()
# Do not use solely fundamental plots for statistical knowledge
# Statistical plots present extra insights and reveal relationships between variables.
# Primary plots will not be enough for in-depth evaluation.
#43. Do Use Scikit-learn for Machine Studying Fashions
# Scikit-learn affords a variety of instruments for machine studying.
# It supplies easy and environment friendly instruments for knowledge mining and knowledge evaluation.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3
mannequin = LinearRegression().match(X, y)
print(mannequin.coef_)
# Do not manually implement machine studying algorithms
# Manually implementing algorithms is error-prone and fewer environment friendly.
# Use well-tested libraries like Scikit-learn.
#44. Do Use TensorFlow/Keras for Deep Studying
# TensorFlow/Keras simplifies the implementation of deep studying fashions.
# They provide high-level APIs and are extensively used within the trade.
import tensorflow as tf
mannequin = tf.keras.fashions.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Do not use customized implementations for deep studying
# Customized implementations are tougher to take care of and fewer environment friendly.
# Use TensorFlow/Keras for higher efficiency and scalability.
#45. Do Use `timeit` for Measuring Execution Time
# `timeit` is helpful for timing small code snippets.
# It supplies a dependable method to measure the execution time of your code.
import timeit
print(timeit.timeit("x = sum(vary(1000))", quantity=1000))
# Do not use time module for fast timing
# The time module is much less correct and might be affected by different system processes.
# import time
# begin = time.time()
# x = sum(vary(1000))
# finish = time.time()
# print(finish - begin)
#46. Do Use Decorators for Reusable Code
# Decorators add performance to present features in a clear manner.
# They're a strong device for code reuse and separation of issues.
def my_decorator(func):
def wrapper():
print("One thing is occurring earlier than the perform is named.")
func()
print("One thing is occurring after the perform is named.")
return wrapper
@my_decorator
def say_hello():
print("Hi there!")
say_hello()
# Do not manually add performance to features
# Manually including performance is much less clear and reusable.
# def say_hello():
# print("Hi there!")
# def say_hello_decorated():
# print("One thing is occurring earlier than the perform is named.")
# say_hello()
# print("One thing is occurring after the perform is named.")
# say_hello_decorated()
#47. Do Use `argparse` for Command-Line Arguments
# `argparse` supplies a method to deal with command-line arguments.
# It makes your scripts extra versatile and user-friendly.
import argparse
parser = argparse.ArgumentParser(description="Course of some integers.")
parser.add_argument('integers', metavar='N', kind=int, nargs='+', assist='an integer for the accumulator')
parser.add_argument('--sum', dest='accumulate', motion='store_const', const=sum, default=max, assist='sum the integers (default: discover the max)')
args = parser.parse_args()
print(args.accumulate(args.integers))
# Do not use sys.argv for command-line arguments
# sys.argv is much less versatile and requires extra guide dealing with.
# import sys
# if len(sys.argv) > 1:
# print(sys.argv[1:])
#48. Do Use Logging for Debugging and Monitoring
# Logging is essential for monitoring the habits and points in your code.
# It supplies a standardized method to output standing and error messages.
import logging
logging.basicConfig(stage=logging.INFO)
logging.information("That is an information message")
logging.error("That is an error message")
# Do not use print statements for debugging
# Print statements are much less versatile and tougher to handle in manufacturing.
# print("That is an information message")
# print("That is an error message")
#49. Do Use `pandas_profiling` for Fast Information Profiling
# `pandas_profiling` generates an in depth profiling report for a DataFrame.
# It helps you perceive your knowledge shortly and comprehensively.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame({
'title': ['Alice', 'Bob', 'Charlie'],
'age': [24, 27, 22]
})
profile = ProfileReport(df)
profile.to_file("report.html")
# Do not rely solely on fundamental descriptive statistics for knowledge profiling
# Primary descriptive statistics present restricted insights in comparison with a full profiling report.
# df.describe()
#50. Do Use `matplotlib` for Plotting
# `matplotlib` is a strong and versatile plotting library.
# It permits you to create a variety of static, animated, and interactive plots.
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Easy Plot')
plt.present()
# Do not omit axis labels and title
# Axis labels and titles are vital for understanding the context of the plot.
# plt.plot([1, 2, 3], [4, 5, 6])
# plt.present()
```
51. Do Carry out Function Choice
52. Do Use Common Expressions for Textual content Cleansing
53. Do Use Information Augmentation for Picture Information
54. Do Use Confusion Matrix for Classification Analysis
55. Do Use Studying Fee Schedulers
56. Do Use Early Stopping to Forestall Overfitting
57. Do Use Grid Seek for Hyperparameter Tuning
58. Do Use Random Seek for Hyperparameter Tuning
59. Do Use Mannequin Ensembles to Enhance Efficiency
60. Do Use Function Scaling
61. Do Use Batch Normalization in Deep Studying Fashions
62. Do Use Information Imputation for Lacking Values
63. Do Use Dropout for Regularization in Neural Networks
64. Do Use `joblib` for Saving and Loading Fashions
65. Do Use Precision and Recall for Imbalanced Lessons
66. Do Use DataFrame Operations As an alternative of Iterating Over Rows
67. Do Use `pyyaml` for Configuration Information
68. Do Use SQLAlchemy ORM for Database Operations
69. Do Use `.gitignore` to Exclude Pointless Information
70. Do Use `tqdm` for Progress Bars in Loops
71. Do Use `configparser` for Configuration Administration
72. Do Use `seaborn` for Pair Plots
73. Do Use `pandas` for GroupBy Operations
74. Do Use L2 Regularization in Linear Fashions
75. Do Use `time` Library for Easy Time Measurement
#51. Do Carry out Function Choice
# Function choice improves mannequin efficiency by decreasing overfitting and specializing in an important options.
from sklearn.feature_selection import SelectKBest, f_classifX_new = SelectKBest(f_classif, ok=10).fit_transform(X, y)
# Do not use all options with out evaluating their significance
# This may result in overfitting and degraded mannequin efficiency.
# mannequin.match(X, y)
#52. Do Use Common Expressions for Textual content Cleansing
# Common expressions effectively clear and preprocess textual content knowledge by matching patterns.
import re
textual content = "This can be a pattern textual content with numbers 12345 and symbols $%&."
cleaned_text = re.sub(r'W+', ' ', textual content)
# Do not manually exchange every undesirable character
# This strategy is much less environment friendly and susceptible to errors.
# cleaned_text = textual content.exchange('$', '').exchange('%', '').exchange('&', '')
#53. Do Use Information Augmentation for Picture Information
# Information augmentation will increase the variety of your coaching knowledge with out accumulating new knowledge, enhancing mannequin robustness.
from tensorflow.keras.preprocessing.picture import ImageDataGenerator
datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True)
# Do not practice on restricted knowledge with out augmentation
# This may result in overfitting and poor generalization.
# mannequin.match(X_train, y_train)
#54. Do Use Confusion Matrix for Classification Analysis
# Confusion matrices present an in depth breakdown of classification efficiency, displaying true positives, false positives, and so on.
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 0, 1]
y_pred = [0, 0, 0, 1]
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Do not rely solely on accuracy for classification analysis
# Accuracy might be deceptive, particularly with imbalanced courses.
# from sklearn.metrics import accuracy_score
# accuracy = accuracy_score(y_true, y_pred)
# print(accuracy)
#55. Do Use Studying Fee Schedulers
# Studying price schedulers regulate the training price throughout coaching, enhancing mannequin efficiency and convergence.
import tensorflow as tf
mannequin = tf.keras.fashions.Sequential([...])
mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-3 * 10 ** (epoch / 20))
historical past = mannequin.match(X_train, y_train, epochs=30, callbacks=[lr_scheduler])
# Do not use a continuing studying price for all epochs
# This may result in suboptimal coaching and convergence.
# mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# historical past = mannequin.match(X_train, y_train, epochs=30)
#56. Do Use Early Stopping to Forestall Overfitting
# Early stopping terminates coaching when the validation efficiency stops enhancing, stopping overfitting.
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', persistence=3)
historical past = mannequin.match(X_train, y_train, epochs=30, validation_split=0.2, callbacks=[early_stopping])
# Do not practice for a hard and fast variety of epochs with out monitoring validation efficiency
# This may result in overfitting if coaching is stopped too late.
# historical past = mannequin.match(X_train, y_train, epochs=30, validation_split=0.2)
#57. Do Use Grid Seek for Hyperparameter Tuning
# Grid search systematically finds the perfect hyperparameters to your mannequin by evaluating all doable mixtures.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'n_estimators': [10, 50, 100]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.match(X, y)
print(grid_search.best_params_)
# Do not use default hyperparameters with out tuning
# Default settings are unlikely to be optimum to your particular dataset.
# mannequin = RandomForestClassifier()
# mannequin.match(X, y)
#58. Do Use Random Seek for Hyperparameter Tuning
# Random search explores a variety of hyperparameters extra effectively than grid search by randomly sampling mixtures.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
param_distributions = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions, n_iter=10, cv=5)
random_search.match(X, y)
print(random_search.best_params_)
# Do not use default hyperparameters with out tuning
# This strategy could miss mixtures that might result in higher mannequin efficiency.
# mannequin = RandomForestClassifier()
# mannequin.match(X, y)
#59. Do Use Mannequin Ensembles to Enhance Efficiency
# Mannequin ensembles mix a number of fashions to enhance general efficiency, leveraging the strengths of every particular person mannequin.
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()
ensemble = VotingClassifier(estimators=[
('lr', model1), ('dt', model2), ('svc', model3)], voting='arduous')
ensemble.match(X_train, y_train)
# Do not depend on a single mannequin when an ensemble can enhance efficiency
# Single fashions won't seize all patterns within the knowledge.
# mannequin = LogisticRegression()
# mannequin.match(X_train, y_train)
#60. Do Use Function Scaling
# Function scaling ensures all options contribute equally to the mannequin efficiency, enhancing convergence and accuracy.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Do not use unscaled knowledge for algorithms delicate to characteristic scaling
# This may result in suboptimal efficiency and convergence points.
# mannequin.match(X, y)
#### 61. Do Use Batch Normalization in Deep Studying Fashions
# Batch normalization helps stabilize and speed up the coaching of deep studying fashions by normalizing layer inputs.
from tensorflow.keras.layers import BatchNormalization
mannequin = tf.keras.fashions.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
BatchNormalization(),
tf.keras.layers.Dense(10, activation='softmax')
])
mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Do not ignore batch normalization for deep networks
# This may result in unstable and gradual coaching.
# mannequin = tf.keras.fashions.Sequential([
# tf.keras.layers.Dense(128, activation='relu'),
# tf.keras.layers.Dense(10, activation='softmax')
# ])
# mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#62. Do Use Information Imputation for Lacking Values
# Information imputation fills in lacking values, making the dataset extra full and usable for evaluation.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(technique='imply')
X_imputed = imputer.fit_transform(X)
# Do not drop rows with lacking values until essential
# Dropping rows can lead to a major lack of knowledge.
# X = X.dropna()
#63. Do Use Dropout for Regularization in Neural Networks
# Dropout regularizes neural networks by stopping overfitting by randomly dropping neurons throughout coaching.
from tensorflow.keras.layers import Dropout
mannequin = tf.keras.fashions.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Do not ignore dropout for regularization in deep networks
# This may result in overfitting and poor generalization.
# mannequin = tf.keras.fashions.Sequential([
# tf.keras.layers.Dense(128, activation='relu'),
# tf.keras.layers.Dense(10, activation='softmax')
# ])
# mannequin.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
#64. Do Use `joblib` for Saving and Loading Fashions
# `joblib` effectively saves and masses giant fashions, making it simple to reuse skilled fashions.
import joblib
mannequin = RandomForestClassifier()
mannequin.match(X, y)
joblib.dump(mannequin, 'mannequin.joblib')
loaded_model = joblib.load('mannequin.joblib')
# Do not use pickle for giant fashions resulting from efficiency points
# Pickle is much less environment friendly for giant objects.
# import pickle
# mannequin = RandomForestClassifier()
# mannequin.match(X, y)
# with open('mannequin.pkl', 'wb') as file:
# pickle.dump(mannequin, file)
# with open('mannequin.pkl', 'rb') as file:
# loaded_model = pickle.load(file)
#65. Do Use Precision and Recall for Imbalanced Lessons
# Precision and recall present higher insights for imbalanced courses than accuracy, specializing in false positives and false negatives.
from sklearn.metrics import precision_score, recall_score
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print(f"Precision: {precision}, Recall: {recall}")
# Do not rely solely on accuracy for imbalanced courses
# Accuracy might be deceptive if the category distribution is skewed.
# from sklearn.metrics import accuracy_score
# accuracy = accuracy_score(y_true, y_pred)
# print(f"Accuracy: {accuracy}")
#66. Do Use DataFrame Operations As an alternative of Iterating Over Rows
# Vectorized operations are extra environment friendly than iterating
over DataFrame rows, offering higher efficiency and readability.
import pandas as pd
df['new_column'] = df['existing_column'].apply(lambda x: x * 2)
# Do not iterate over DataFrame rows for easy operations
# This strategy is much less environment friendly and extra verbose.
# for index, row in df.iterrows():
# df.at[index, 'new_column'] = row['existing_column'] * 2
#67. Do Use `pyyaml` for Configuration Information
# YAML information are simple to learn and write for configuration settings, offering a clear method to handle configurations.
import yaml
with open('config.yaml', 'r') as file:
config = yaml.safe_load(file)
print(config['parameter'])
# Do not hardcode configuration settings
# Hardcoding makes it tough to handle and replace configurations.
# config = {
# "parameter": "worth"
# }
# print(config['parameter'])
#68. Do Use SQLAlchemy ORM for Database Operations
# SQLAlchemy ORM supplies an easy-to-use abstraction for database operations, making database interactions extra Pythonic.
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
engine = create_engine('sqlite:///mydatabase.db')
Base = declarative_base()
class Person(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
title = Column(String)
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
new_user = Person(title='Alice')
session.add(new_user)
session.commit()
# Do not use uncooked SQL queries with direct database connections
# Uncooked SQL queries might be susceptible to SQL injection and are much less maintainable.
# import sqlite3
# conn = sqlite3.join('mydatabase.db')
# cursor = conn.cursor()
# cursor.execute("INSERT INTO customers (title) VALUES ('Alice')")
# conn.commit()
# conn.shut()
# 69. Do Use `.gitignore` to Exclude Pointless Information
# Excluding pointless information retains your repository clear and environment friendly, avoiding litter and decreasing repository dimension.
# .gitignore file
"""
__pycache__/
*.pyc
*.pyo
.DS_Store
"""
# Do not observe pointless information in model management
# This clutters the repository and will increase its dimension.
#70. Do Use `tqdm` for Progress Bars in Loops
# `tqdm` supplies a progress bar for loops, enhancing consumer expertise by displaying the progress of operations.
from tqdm import tqdm
for i in tqdm(vary(100)):
move
# Do not manually observe progress in loops
# Handbook progress monitoring is much less environment friendly and fewer user-friendly.
# for i in vary(100):
# print(f"Progress: {i}")
#71. Do Use `configparser` for Configuration Administration
# `configparser` helps handle configuration settings in a structured manner, making it simpler to deal with a number of configurations.
import configparser
config = configparser.ConfigParser()
config.learn('config.ini')
db_host = config['database']['host']
# Do not hardcode configuration settings
# Hardcoding makes it tough to handle and replace configurations.
# db_host = 'localhost'
# db_user = 'consumer'
# db_pass = 'move'
#72. Do Use `seaborn` for Pair Plots
# Pair plots visualize relationships between a number of variables, offering insights into knowledge distributions and correlations.
import seaborn as sns
knowledge = sns.load_dataset("iris")
sns.pairplot(knowledge, hue="species")
plt.present()
# Do not analyze knowledge with out visualizing relationships
# Visualizations present insights that uncooked knowledge or statistics won't reveal.
#73. Do Use `pandas` for GroupBy Operations
# GroupBy operations in `pandas` are environment friendly for aggregating knowledge, simplifying complicated knowledge manipulation duties.
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'], 'B': [1, 2, 3, 4]})
grouped = df.groupby('A').sum()
print(grouped)
# Do not manually combination knowledge with loops
# Handbook aggregation is much less environment friendly and extra error-prone.
# grouped = {}
# for index, row in df.iterrows():
# key = row['A']
# if key not in grouped:
# grouped[key] = 0
# grouped[key] += row['B']
# print(grouped)
#74. Do Use L2 Regularization in Linear Fashions
# L2 regularization prevents overfitting in linear fashions by including a penalty for giant coefficients.
from sklearn.linear_model import Ridge
mannequin = Ridge(alpha=1.0)
mannequin.match(X_train, y_train)
# Do not use linear fashions with out regularization
# This may result in overfitting, particularly with many options.
# from sklearn.linear_model import LinearRegression
# mannequin = LinearRegression()
# mannequin.match(X_train, y_train)
#75. Do Use `time` Library for Easy Time Measurement
# The `time` library is helpful for easy time measurements, offering a simple method to measure code execution time.
import time
start_time = time.time()
# Code block
end_time = time.time()
print(f"Execution time: {end_time - start_time} seconds")
# Do not use `datetime` for easy time measurement
# The `time` library is extra acceptable for easy timing duties.
# from datetime import datetime
# start_time = datetime.now()
# # Code block
# end_time = datetime.now()
# print(f"Execution time: {(end_time - start_time).total_seconds()} seconds")
76. Use Model Management for Your Code and Information
77. Use Relative File Paths As an alternative of Absolute Paths
78. Deal with Lacking or Invalid Information
79. Use Descriptive Variable and Operate Names
80. Write Modular and Reusable Code
81. Use Logging to Monitor Progress and Errors
82. Use Configuration Information for Storing Settings
83. Doc Your Capabilities and Lessons
84. Use Pipelines for Information Preprocessing and Mannequin Coaching
85. Use Digital Environments for Isolating Dependencies
86. Use a Necessities File to Specify Dependencies
87. Use Caching to Keep away from Redundant Computations
88. Use Parallel Processing for Intensive Duties
89. Use GPU Acceleration for Deep Studying Duties
90. Monitor Mannequin Efficiency and Information Drift
91. Use Cross-Validation for Mannequin Analysis
92. Use Function Choice for Excessive-Dimensional Information
93. Deal with Imbalanced Datasets Appropriately
94. Use Applicable Analysis Metrics for Your Drawback
95. Carry out Hyperparameter Tuning
96. Save and Load Skilled Fashions
97. Monitor Useful resource Utilization and Optimize Accordingly
98. Constantly Replace and Retrain Fashions on New Information
99. Collaborate with Area Consultants and Stakeholders
100. Use Information Model Management
#76. Use Model Management for Your Code and Information
# Utilizing Git to trace adjustments in your code ensures that you may handle variations, collaborate successfully, and revert to earlier states if essential.
# Initialize git repository and make the preliminary commit
git init
git add .
git commit -m "Preliminary commit"# Do not hardcode delicate data similar to passwords in your code
# This may result in safety vulnerabilities.
db_password = "my_secret_password" # Do not do that!
#77. Use Relative File Paths As an alternative of Absolute Paths
# Utilizing relative paths makes your code extra moveable and fewer depending on particular listing buildings.
data_path = "../knowledge/dataset.csv" # Relative path
# Do not hardcode absolute paths
# This may result in points when sharing or deploying code throughout totally different environments.
data_path = "/consumer/residence/knowledge/dataset.csv" # Absolute path
#78. Deal with Lacking or Invalid Information
# All the time verify for and deal with lacking or invalid knowledge to forestall errors throughout knowledge processing and mannequin coaching.
import pandas as pd
df = pd.read_csv("knowledge.csv")
df.dropna(inplace=True) # Deal with lacking values
# Do not ignore lacking knowledge
# Ignoring lacking knowledge can result in incorrect outcomes and errors in your evaluation or mannequin.
df = pd.read_csv("knowledge.csv")
# Not checking for lacking values earlier than processing
#79. Use Descriptive Variable and Operate Names
# Descriptive names make your code simpler to grasp and preserve.
def calculate_accuracy(y_true, y_pred):
"""Calculate the accuracy of predictions."""
return accuracy_score(y_true, y_pred)
# Do not use cryptic or abbreviated names
# Utilizing unclear names could make your code tough to learn and preserve.
x = "Hi there" # Unclear what 'x' represents
#80. Write Modular and Reusable Code
# Breaking down your code into smaller, reusable features improves readability and maintainability.
def preprocess_data(knowledge):
"""Preprocess the enter knowledge."""
# Preprocessing steps right here
return preprocessed_data
# Do not write lengthy and sophisticated features
# Lengthy features are arduous to debug and preserve. All the time purpose for modularity.
def process_data(knowledge):
# A really lengthy and sophisticated perform
# ...
move
#81. Use Logging to Monitor Progress and Errors
# Logging helps in monitoring the circulation of execution and debugging points.
import logging
logging.basicConfig(stage=logging.INFO)
logging.information("Beginning knowledge processing...")
# Do not use print statements for debugging
# Use logging as an alternative as it's extra manageable for manufacturing code.
print("Debug: Worth of x is", x) # Use logging as an alternative
#82. Use Configuration Information for Storing Settings
# Configuration information separate code from settings, making your code extra versatile and simpler to handle.
config = {
"learning_rate": 0.01,
"batch_size": 32,
"epochs": 10
}
# Do not hardcode hyperparameters
# Hardcoding makes it tough to regulate settings with out altering the code.
learning_rate = 0.01 # Do not hardcode hyperparameters
#83. Doc Your Capabilities and Lessons
# Documenting your code helps others perceive its objective and utilization.
def train_model(X, y):
"""
Practice a machine studying mannequin.
Args:
X (numpy.ndarray): Enter options.
y (numpy.ndarray): Goal labels.
Returns:
mannequin: Skilled mannequin object.
"""
# Coaching code right here
return mannequin
# Do not neglect documentation
# Lack of documentation makes it arduous for others (and your future self) to grasp and use your code.
def some_function(x, y):
# Operate with no documentation
return x + y
#4. Use Pipelines for Information Preprocessing and Mannequin Coaching
# Pipelines streamline the method of reworking knowledge and coaching fashions.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("classifier", LogisticRegression())
])
# Do not manually apply preprocessing steps
# This may result in inconsistent and fewer maintainable code.
X_scaled = StandardScaler().fit_transform(X) # Handbook preprocessing
#85. Use Digital Environments for Isolating Dependencies
# Digital environments assist handle dependencies and keep away from conflicts between packages.
# Create a digital atmosphere and activate it
python -m venv myenv
supply myenv/bin/activate
# Do not set up packages globally
# Putting in packages globally can result in conflicts with different tasks.
pip set up numpy # Putting in packages globally
#86. Use a Necessities File to Specify Dependencies
# A necessities file ensures constant environments throughout totally different setups.
# necessities.txt
"""
numpy==1.21.0
pandas==1.3.0
scikit-learn==0.24.2
"""
# Do not manually handle dependencies
# This may result in inconsistencies and missed packages.
pip set up numpy pandas scikit-learn # Handbook dependency administration
#87. Use Caching to Keep away from Redundant Computations
# Caching avoids recomputing costly operations, enhancing efficiency.
import functools
@functools.lru_cache(maxsize=None)
def expensive_function(x):
# Costly computation right here
return outcome
# Do not carry out redundant computations
# Redundant computations waste time and sources.
outcome = expensive_function(x) # Recomputing the identical outcome
#88. Use Parallel Processing for Intensive Duties
# Parallel processing quickens computationally intensive duties by using a number of cores.
from joblib import Parallel, delayed
outcomes = Parallel(n_jobs=-1)(delayed(process_data)(knowledge) for knowledge in dataset)
# Do not use sequential processing for giant datasets
# Sequential processing is slower and inefficient for giant datasets.
outcomes = [process_data(data) for data in dataset] # Sequential processing
#89. Use GPU Acceleration for Deep Studying Duties
# Utilizing GPU acceleration considerably quickens coaching for deep studying fashions.
import tensorflow as tf
with tf.system("/GPU:0"):
mannequin.match(X_train, y_train) # Carry out GPU-accelerated computations
# Do not practice deep studying fashions on a CPU
# Coaching on a CPU is far slower and fewer environment friendly.
mannequin.match(X_train, y_train) # Coaching on CPU
#90. Monitor Mannequin Efficiency and Information Drift
# Monitoring helps guarantee your mannequin stays correct and detects when retraining is required.
from sklearn.metrics import accuracy_score
def monitor_performance(mannequin, X_test, y_test):
y_pred = mannequin.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
logging.information(f"Mannequin accuracy: {accuracy}")
# Do not assume mannequin efficiency stays fixed
# Failing to watch efficiency can result in degraded outcomes over time.
# Not monitoring mannequin efficiency over time
#91. Use Cross-Validation for Mannequin Analysis
# Cross-validation supplies a extra dependable estimate of mannequin efficiency by evaluating it on a number of subsets of the info.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(mannequin, X, y, cv=5)
print(f"Cross-validation scores: {scores}")
# Do not depend on a single train-test cut up
# This may result in overfitting or underfitting.
mannequin.match(X_train, y_train)
accuracy = mannequin.rating(X_test, y_test) # Single train-test cut up
#92. Use Function Choice for Excessive-Dimensional Information
# Function choice reduces dimensionality, enhancing mannequin efficiency and interpretability.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, ok=10)
X_selected = selector.fit_transform(X, y)
# Do not use all options with out choice
# Utilizing all options can result in overfitting and longer coaching instances.
mannequin.match(X, y) # Utilizing all options with out choice
#93. Deal with Imbalanced Datasets Appropriately
# Dealing with class imbalance improves mannequin efficiency on minority courses.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
# Do not practice on imbalanced datasets
# Ignoring class imbalance can result in biased fashions.
mannequin.match(X, y) # Coaching on imbalanced dataset
#94. Use Applicable Analysis Metrics for Your Drawback
# Utilizing acceptable metrics supplies a greater understanding of mannequin efficiency, particularly for imbalanced datasets.
from sklearn.metrics import precision_recall_fscore_support
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, common="weighted")
print(f"Precision: {precision}, Recall: {recall}, F1-score: {f1}")
# Do not rely solely on accuracy
# Accuracy might be deceptive, particularly for imbalanced datasets.
accuracy = accuracy_score(y_test, y_pred) # Utilizing accuracy alone
#95. Carry out Hyperparameter Tuning
# Hyperparameter tuning improves mannequin efficiency by discovering the perfect parameters.
from sklearn.model_selection import GridSearchCV
param_grid = {
"C": [0.1, 1, 10],
"kernel": ["linear", "rbf"]
}
grid_search = GridSearchCV(mannequin, param_grid, cv=5)
grid_search.match(X_train, y_train)
best_model = grid_search.best_estimator_
# Do not use default hyperparameters
# Default hyperparameters can result in suboptimal mannequin efficiency.
mannequin = LogisticRegression() # Utilizing default hyperparameters
#96. Save and Load Skilled Fashions
# Saving and loading fashions permits you to reuse skilled fashions with out retraining.
import joblib
joblib.dump(mannequin, "trained_model.pkl")
loaded_model = joblib
.load("trained_model.pkl")
# Do not retrain fashions with out saving
# Retraining from scratch each time is inefficient and time-consuming.
mannequin.match(X_train, y_train) # Retraining mannequin with out saving
#97. Monitor Useful resource Utilization and Optimize Accordingly
# Monitoring system sources ensures that your software runs effectively and prevents useful resource exhaustion.
import psutil
cpu_usage = psutil.cpu_percent()
memory_usage = psutil.virtual_memory().p.c
logging.information(f"CPU utilization: {cpu_usage}%, Reminiscence utilization: {memory_usage}%")
# Do not ignore useful resource constraints
# Ignoring sources can result in system crashes and degraded efficiency.
# Working computationally costly duties with out contemplating sources
#98. Constantly Replace and Retrain Fashions on New Information
# Constantly updating fashions ensures they continue to be correct and related with new knowledge.
def retrain_model(mannequin, X_new, y_new):
mannequin.match(X_new, y_new)
joblib.dump(mannequin, "updated_model.pkl")
# Do not use outdated fashions
# Utilizing fashions skilled on outdated knowledge can result in poor efficiency on new knowledge.
# Utilizing a mannequin skilled on outdated knowledge with out updates
#99. Collaborate with Area Consultants and Stakeholders
# Collaboration ensures that your fashions are aligned with area information and enterprise targets.
# Instance: Discussing characteristic significance with area specialists
feature_importances = mannequin.feature_importances_
# Talk about and interpret characteristic importances with area specialists
# Do not work in isolation
# Working alone can result in fashions that don't meet enterprise wants or overlook vital insights.
# Constructing fashions with out consulting area specialists
#100. Use Information Model Management
# Information model management instruments like DVC observe adjustments in datasets, guaranteeing reproducibility and collaboration.
dvc init
dvc add knowledge/dataset.csv
dvc push
# Do not handle knowledge manually
# Handbook knowledge administration is error-prone and lacks the advantages of model management.
# Manually copying and shifting datasets
```
101. Use Container Orchestration for Mannequin Deployment
102. Implement CI/CD for ML Pipelines
103. Use Function Shops
104. Implement Mannequin Versioning
105. Use Mannequin Registries
106. Implement A/B Testing for Mannequin Deployment
107. Use Distributed Coaching for Massive Datasets
108. Implement Mannequin Monitoring
109. Use Configuration Administration
110. Implement Information Validation
111. Use Function Flags for Gradual Rollouts
112. Implement Automated Mannequin Retraining
113. Use Experiment Monitoring
114. Implement Mannequin Interpretability
115. Use Dependency Administration
116. Implement Information Lineage Monitoring
117. Implement Mannequin Serving with API Endpoints
118. Use Infrastructure as Code
119. Implement Automated Mannequin Documentation
120. Use Function Significance Evaluation
121. Implement Mannequin Caching
122. Use Mannequin Compression Methods
123. Implement Mannequin Equity Checks
124. Use Multi-Mannequin Serving
125. Implement Mannequin Rollback Mechanisms
#101. Use Container Orchestration for Mannequin Deployment
# Container orchestration ensures scalable and manageable mannequin deployments.
# Kubernetes deployment instance
apiVersion: apps/v1
type: Deployment
metadata:
title: model-service
spec:
replicas: 3
selector:
matchLabels:
app: model-service
template:
metadata:
labels:
app: model-service
spec:
containers:
- title: model-service
picture: your-registry/model-service:v1
ports:
- containerPort: 8080# Do not deploy fashions instantly on servers with out containerization
# This strategy shouldn't be scalable or manageable.
python run_model_server.py
#102. Implement CI/CD for ML Pipelines
# CI/CD pipelines automate the method of coaching, testing, and deploying fashions.
# GitLab CI/CD instance
phases:
- practice
- check
- deploy
train_model:
stage: practice
script:
- python train_model.py
artifacts:
paths:
- mannequin.pkl
test_model:
stage: check
script:
- python test_model.py
deploy_model:
stage: deploy
script:
- kubectl apply -f deployment.yaml
# Do not manually practice and deploy fashions
# Handbook processes are error-prone and never scalable.
python train_model.py
scp mannequin.pkl consumer@server:/path/to/deployment/
#103. Use Function Shops
# Function shops guarantee constant characteristic engineering throughout coaching and serving.
from feast import FeatureStore
retailer = FeatureStore(repo_path=".")
feature_vector = retailer.get_online_features(
options=[
"customer_features:age",
"customer_features:total_purchases",
],
entity_rows=[{"customer_id": 1001}]
)
# Do not recompute options for every mannequin or service
# This strategy is inefficient and susceptible to errors.
def get_customer_features(customer_id):
# Recompute options from uncooked knowledge
move
#104. Implement Mannequin Versioning
# Mannequin versioning permits monitoring of various mannequin iterations and their efficiency.
import mlflow
mlflow.set_experiment("my_experiment")
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_metric("accuracy", 0.95)
mlflow.sklearn.log_model(mannequin, "mannequin")
# Do not overwrite fashions with out versioning
# This strategy makes it tough to trace adjustments and revert to earlier variations.
mannequin.save("mannequin.pkl")
#105. Use Mannequin Registries
# Mannequin registries present a centralized location to handle the mannequin lifecycle.
from mlflow.monitoring import MlflowClient
shopper = MlflowClient()
model_version = shopper.create_model_version(
title="my_model",
supply="mlflow-artifacts:/1/mannequin",
run_id="run_id"
)
# Do not handle fashions manually and not using a registry
# This strategy lacks monitoring and administration capabilities.
mannequin.save("/path/to/fashions/model_v1.pkl")
#106. Implement A/B Testing for Mannequin Deployment
# A/B testing helps in evaluating mannequin efficiency in manufacturing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model_A.match(X_train, y_train)
model_B.match(X_train, y_train)
# Deploy each fashions and route visitors accordingly.
# Do not deploy new fashions with out comparability
# This strategy can result in unintended penalties if the brand new mannequin performs worse.
mannequin.match(X, y)
deploy_model(mannequin)
#107. Use Distributed Coaching for Massive Datasets
# Distributed coaching permits environment friendly processing of enormous datasets.
import horovod.tensorflow as hvd
hvd.init()
mannequin = build_model()
optimizer = hvd.DistributedOptimizer(optimizer)
mannequin.match(x_train, y_train, steps_per_epoch=500 // hvd.dimension())
# Do not practice giant fashions on a single machine
# This strategy is inefficient and might be infeasible for very giant datasets.
mannequin.match(X, y, epochs=100)
#108. Implement Mannequin Monitoring
# Mannequin monitoring helps in detecting efficiency degradation and knowledge drift.
from prometheus_client import start_http_server, Abstract
REQUEST_TIME = Abstract('request_processing_seconds', 'Time spent processing request')
@REQUEST_TIME.time()
def process_request(t):
# Course of request right here
move
if __name__ == '__main__':
start_http_server(8000)
# Do not deploy fashions with out monitoring
# Lack of monitoring can result in unnoticed degradation in mannequin efficiency.
mannequin.predict(X)
#109. Use Configuration Administration
# Configuration administration permits simple updates and reproducibility.
import configparser
config = configparser.ConfigParser()
config.learn('config.ini')
learning_rate = config['model']['learning_rate']
batch_size = config['training']['batch_size']
# Do not hardcode configuration parameters
# Hardcoding makes it tough to replace settings and may result in errors.
learning_rate = 0.01
batch_size = 32
#110. Implement Information Validation
# Information validation ensures knowledge high quality and consistency.
import great_expectations as ge
context = ge.data_context.DataContext()
batch = context.get_batch({"path": "knowledge.csv"}, "my_datasource")
outcomes = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
# Do not assume knowledge high quality with out validation
# This may result in errors in downstream processes.
df = pd.read_csv("knowledge.csv")
mannequin.match(df)
#111. Use Function Flags for Gradual Rollouts
# Function flags enable managed rollout of recent fashions or options.
import flag
flag.init('my_app_key')
if flag.is_enabled('new_model_feature'):
prediction = new_model.predict(X)
else:
prediction = old_model.predict(X)
# Do not deploy new fashions to all customers instantly
# Gradual rollouts assist mitigate dangers of recent deployments.
prediction = new_model.predict(X)
#112. Implement Automated Mannequin Retraining
# Automated retraining ensures fashions keep up-to-date with new knowledge.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def retrain_model():
# Retrain mannequin logic right here
move
dag = DAG('model_retraining', start_date=datetime.now(), schedule_interval=timedelta(days=1))
retrain_task = PythonOperator(task_id='retrain_model', python_callable=retrain_model, dag=dag)
# Do not manually retrain fashions
# This strategy is error-prone and never scalable.
# 113. Use Experiment Monitoring
# Experiment monitoring helps in organizing and evaluating totally different mannequin iterations.
import mlflow
mlflow.set_experiment("hyperparameter_tuning")
for learning_rate in [0.01, 0.1, 1.0]:
with mlflow.start_run():
mlflow.log_param("learning_rate", learning_rate)
mannequin = train_model(learning_rate)
accuracy = evaluate_model(mannequin)
mlflow.log_metric("accuracy", accuracy)
# Do not observe experiments manually
# Handbook monitoring is inefficient and susceptible to errors.
outcomes = []
for learning_rate in [0.01, 0.1, 1.0]:
mannequin = train_model(learning_rate)
accuracy = evaluate_model(mannequin)
outcomes.append((learning_rate, accuracy))
#114. Implement Mannequin Interpretability
# Mannequin interpretability is essential for understanding and trusting mannequin selections.
import shap
explainer = shap.TreeExplainer(mannequin)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)
# Do not deploy black-box fashions with out interpretation
# Lack of interpretability can result in distrust and regulatory points.
predictions = mannequin.predict(X)
#115. Use Dependency Administration
# Correct dependency administration ensures reproducibility throughout totally different environments.
# Use digital environments and necessities information
python -m venv myenv
supply myenv/bin/activate
pip set up -r necessities.txt
# Do not set up packages globally or handle dependencies manually
# This strategy can result in conflicts and inconsistencies.
pip set up package1 package2 package3
#116. Implement Information Lineage Monitoring
# Information lineage monitoring helps in understanding knowledge circulation and impression evaluation.
from openlineage.shopper import OpenLineageClient
from openlineage.shopper.run import RunEvent, RunState
shopper = OpenLineageClient()
shopper.emit(RunEvent(
eventType=RunState.START,
job={
"namespace": "my_namespace",
"title": "data_transformation_job"
},
inputs=[{
"namespace": "my_namespace",
"name": "input_dataset"
}],
outputs=[{
"namespace": "my_namespace",
"name": "output_dataset"
}]
))
# Do not course of knowledge with out monitoring lineage
# Lack of lineage monitoring can result in difficulties in understanding knowledge transformations and dependencies.
output_data = process_data(input_data)
#117. Implement Mannequin Serving with API Endpoints
# API endpoints enable simple integration of fashions into functions.
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/predict', strategies=['POST'])
def predict():
knowledge = request.json
prediction = mannequin.predict(knowledge['features'])
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
# Do not use fashions instantly in software code
# API endpoints present a cleaner and extra scalable resolution.
prediction = mannequin.predict(options)
#118. Use Infrastructure as Code
# Infrastructure as Code ensures constant and reproducible infrastructure setup.
# Terraform instance
useful resource "aws_sagem
aker_model" "instance" {
execution_role_arn = aws_iam_role.instance.arn
primary_container {
picture = "${aws_ecr_repository.instance.repository_url}:newest"
}
title = "my-model"
}
# Do not manually arrange infrastructure
# Handbook setup is error-prone and never scalable.
#119. Implement Automated Mannequin Documentation
# Automated documentation ensures up-to-date and constant documentation.
from pdoc import pdoc
modules = ['my_model']
pdoc(*modules, output_dir='docs')
# Do not depend on guide documentation
# Handbook documentation can turn into outdated shortly.
#120. Use Function Significance Evaluation
# Function significance evaluation helps in understanding mannequin habits and have choice.
from sklearn.inspection import permutation_importance
outcome = permutation_importance(mannequin, X_test, y_test, n_repeats=10, random_state=42)
for i in outcome.importances_mean.argsort()[::-1]:
print(f"{feature_names[i]}: {outcome.importances_mean[i]:.3f}")
# Do not ignore characteristic significance
# Understanding characteristic significance helps in mannequin interpretation and have choice.
mannequin.match(X, y)
#121. Implement Mannequin Caching
# Mannequin caching can considerably enhance prediction latency for ceaselessly requested predictions.
from flask_caching import Cache
app = Flask(__name__)
cache = Cache(app, config={'CACHE_TYPE': 'easy'})
@app.route('/predict')
@cache.cached(timeout=60)
def predict():
# Prediction logic right here
move
# Do not recompute predictions for each request
# This strategy is inefficient and will increase latency.
@app.route('/predict')
def predict():
return mannequin.predict(X)
#122. Use Mannequin Compression Methods
# Mannequin compression methods can cut back mannequin dimension and enhance inference velocity.
import tensorflow_model_optimization as tfmot
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.0, final_sparsity=0.5, begin_step=0, end_step=1000
)
mannequin = tfmot.sparsity.keras.prune_low_magnitude(mannequin, pruning_schedule=pruning_schedule)
# Do not deploy giant fashions with out contemplating compression
# Massive fashions might be inefficient and gradual.
mannequin.save('large_model.h5')
#123. Implement Mannequin Equity Checks
# Equity checks assist guarantee fashions do not discriminate towards protected teams.
from fairlearn.metrics import demographic_parity_difference
y_pred = mannequin.predict(X_test)
dpd = demographic_parity_difference(y_test, y_pred, sensitive_features=A_test)
print(f"Demographic Parity Distinction: {dpd}")
# Do not deploy fashions with out checking for bias
# Unchecked bias can result in discriminatory practices and authorized points.
mannequin.match(X, y)
#124. Use Multi-Mannequin Serving
# Multi-model serving permits environment friendly deployment and administration of a number of fashions.
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.title = 'my_model'
request.model_spec.signature_name = 'serving_default'
# Do not deploy every mannequin as a separate service
# This strategy can result in administration and useful resource points.
model1 = load_model('model1.h5')
model2 = load_model('model2.h5')
#125. Implement Mannequin Rollback Mechanisms
# Mannequin rollback mechanisms enable fast restoration if a brand new mannequin deployment fails.
import mlflow
def rollback_model(model_name, model):
shopper = mlflow.monitoring.MlflowClient()
model_version = shopper.get_model_version(title=model_name, model=model)
shopper.transition_model_version_stage(
title=model_name,
model=model_version.model,
stage="Manufacturing"
)
# Do not deploy fashions and not using a rollback plan
# Lack of rollback mechanisms can result in extended downtime and errors.
mannequin.deploy(new_version)
```
126. Implement Information Versioning
127. Use Information Catalogs
128. Implement Information High quality Checks
129. Implement Information Lineage Monitoring
130. Implement Information Masking for Delicate Data
131. Use Environment friendly Information Codecs
132. Implement Information Partitioning
133. Use Information Streaming for Actual-time Processing
134. Implement Information Anonymization
135. Use Information Augmentation for ML
136. Implement Information Caching
137. Use Distributed Information Processing
138. Implement Information Validation for LLM Inputs
139. Use Environment friendly Textual content Encoding for LLMs
140. Implement Information Sharding for Massive Datasets
141. Use Environment friendly Information Loading for ML Coaching
142. Implement Information Compression
143. Use Environment friendly Information Constructions for LLM Token Administration
144. Implement Information Preprocessing Pipelines
145. Use Environment friendly Information Serialization
146. Implement Information Augmentation for LLM Advantageous-tuning
147. Use Environment friendly Information Indexing
148. Implement Information Entry Controls
149. Use Schema Validation
150. Use Information Encryption for Delicate Information
151. Use Information Lake for Centralized Storage
152. Implement Information Governance
153. Use Information Replication for Redundancy
154. Use ETL for Information Transformation
155. Use Information Orchestration Instruments
156. Use Information High quality Instruments
157. Use Change Information Seize (CDC) for Actual-time Updates
158. Use Information Lakes for Massive-scale Storage
159. Use Information Encryption for Safety
160. Use Schema Evolution for Altering Information Constructions
126. Implement Information Versioning
# Information versioning ensures reproducibility and traceability of experiments.
import dvc.apiwith dvc.api.open('knowledge/dataset.csv', rev='v1.0') as f:
# Use the precise model of the info
knowledge = pd.read_csv(f)
# Do not use knowledge with out model management
knowledge = pd.read_csv('knowledge/dataset.csv')
127. Use Information Catalogs
# Information catalogs assist in discovering and understanding obtainable datasets.
from databuilder.extractor.csv_extractor import CsvExtractor
from databuilder.loader.file_system_neo4j_csv_loader import FsNeo4jCSVLoader
from databuilder.writer.neo4j_csv_publisher import Neo4jCsvPublisher
extractor = CsvExtractor()
loader = FsNeo4jCSVLoader()
writer = Neo4jCsvPublisher()
job = DefaultJob(extractor=extractor, loader=loader, writer=writer)
job.launch()
# Do not depend on guide monitoring of datasets
datasets = {'user_data': 'path/to/user_data.csv', 'product_data': 'path/to/product_data.csv'}
128. Implement Information High quality Checks
# Information high quality checks make sure the reliability of knowledge utilized in ML fashions.
import great_expectations as ge
context = ge.get_context()
suite = context.create_expectation_suite("my_suite")
validator = context.get_validator(
batch_request={"data_asset_name": "my_table"},
expectation_suite_name="my_suite"
)
validator.expect_column_values_to_not_be_null("important_column")
validator.save_expectation_suite()
# Do not assume knowledge high quality with out checks
df = pd.read_csv("knowledge.csv")
mannequin.match(df)
129. Implement Information Lineage Monitoring
# Information lineage monitoring helps in understanding knowledge circulation and impression evaluation.
from openlineage.shopper import OpenLineageClient
from openlineage.shopper.run import RunEvent, RunState
shopper = OpenLineageClient()
shopper.emit(RunEvent(
eventType=RunState.START,
job={
"namespace": "my_namespace",
"title": "data_transformation_job"
},
inputs=[{
"namespace": "my_namespace",
"name": "input_dataset"
}],
outputs=[{
"namespace": "my_namespace",
"name": "output_dataset"
}]
))
# Do not course of knowledge with out monitoring lineage
output_data = process_data(input_data)
130. Implement Information Masking for Delicate Data
# Information masking protects delicate data in datasets used for ML.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
textual content = "My title is John Doe and my cellphone quantity is 212-555-5555"
analyzer_results = analyzer.analyze(textual content=textual content, language='en')
anonymized_text = anonymizer.anonymize(textual content=textual content, analyzer_results=analyzer_results)
# Do not use uncooked knowledge with delicate data
df = pd.read_csv("raw_user_data.csv")
131. Use Environment friendly Information Codecs
# Environment friendly knowledge codecs like Parquet can considerably enhance knowledge processing velocity.
import pyarrow as pa
import pyarrow.parquet as pq
desk = pa.Desk.from_pandas(df)
pq.write_table(desk, 'knowledge.parquet')
# Do not use inefficient codecs for giant datasets
df.to_csv('large_data.csv')
132. Implement Information Partitioning
# Information partitioning can enhance question efficiency and allow parallel processing.
def partition_data(df, partition_column):
for worth in df[partition_column].distinctive():
partition = df[df[partition_column] == worth]
partition.to_parquet(f"knowledge/partition={worth}.parquet")
# Do not retailer all knowledge in a single file
df.to_parquet('all_data.parquet')
133. Use Information Streaming for Actual-time Processing
# Information streaming allows real-time knowledge processing for ML functions.
from pyspark.sql import SparkSession
from pyspark.sql.features import *
spark = SparkSession.builder.appName("StreamProcessor").getOrCreate()
df = spark
.readStream
.format("kafka")
.choice("kafka.bootstrap.servers", "localhost:9092")
.choice("subscribe", "input_topic")
.load()
processed = df.choose(from_json(col("worth").forged("string"), schema).alias("knowledge")).choose("knowledge.*")
question = processed
.writeStream
.outputMode("append")
.format("parquet")
.choice("path", "output_path")
.choice("checkpointLocation", "checkpoint_path")
.begin()
# Do not depend on batch processing for real-time knowledge
whereas True:
df = pd.read_csv('new_data.csv')
process_data(df)
time.sleep(60)
134. Implement Information Anonymization
# Information anonymization helps defend privateness whereas sustaining knowledge utility for ML.
from faker import Faker
pretend = Faker()
def anonymize_data(df):
df['name'] = df['name'].apply(lambda x: pretend.title())
df['email'] = df['email'].apply(lambda x: pretend.e-mail())
return df
# Do not use actual private knowledge for ML experiments
df = pd.read_csv('personal_data.csv')
135. Use Information Augmentation for ML
# Information augmentation may help enhance dataset dimension and variety for ML fashions.
from nltk.corpus import wordnet
def synonym_replacement(sentence, n):
phrases = sentence.cut up()
new_words = phrases.copy()
random_word_list = checklist(set([word for word in words if word.isalnum()]))
random.shuffle(random_word_list)
num_replaced = 0
for random_word in random_word_list:
synonyms = []
for syn in wordnet.synsets(random_word):
for l in syn.lemmas():
synonyms.append(l.title())
if len(synonyms) >= 1:
synonym = random.selection(checklist(set(synonyms)))
new_words = [synonym if word == random_word else word for word in new_words]
num_replaced += 1
if num_replaced >= n:
break
return ' '.be part of(new_words)
# Do not rely solely on authentic knowledge for coaching
mannequin.match(X, y)
136. Implement Information Caching
# Information caching can considerably enhance knowledge retrieval velocity for ML pipelines.
import redis
import pickle
r = redis.Redis(host='localhost', port=6379, db=0)
def get_data(key):
cached_data = r.get(key)
if cached_data:
return pickle.masses(cached_data)
else:
knowledge = expensive_data_operation()
r.set(key, pickle.dumps(knowledge))
return knowledge
# Do not repeatedly compute costly knowledge operations
def get_data():
return expensive_data_operation()
137. Use Distributed Information Processing
# Distributed processing allows dealing with of large-scale datasets for ML.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataProcessing").getOrCreate()
df = spark.learn.parquet("hdfs://knowledge.parquet")
processed_df = df.groupBy("class").agg({"worth": "imply"})
processed_df.write.parquet("hdfs://processed_data.parquet")
# Do not course of giant datasets on a single machine
df = pd.read_parquet("large_data.parquet")
processed_df = df.groupby("class")["value"].imply()
138. Implement Information Validation for LLM Inputs
# Enter validation helps stop potential points with LLM processing.
import re
def validate_llm_input(textual content):
if len(textual content) > 1000:
increase ValueError("Enter textual content is simply too lengthy")
if re.search(r'[^ws.,!?]', textual content):
increase ValueError("Enter incorporates invalid characters")
return textual content
# Do not move uncooked consumer enter to LLMs
response = llm.generate(user_input)
139. Use Environment friendly Textual content Encoding for LLMs
# Environment friendly encoding can enhance processing velocity and cut back reminiscence utilization for LLMs.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
def encode_text(textual content):
return tokenizer.encode(textual content, add_special_tokens=True, max_length=512, truncation=True)
# Do not use character-level encoding for giant texts
encoded_text = [ord(c) for c in text]
140. Implement Information Sharding for Massive Datasets
# Information sharding allows parallel processing and may enhance coaching velocity for giant datasets.
def shard_dataset(knowledge, num_shards):
shards = [[] for _ in vary(num_shards)]
for i, merchandise in enumerate(knowledge):
shards[i % num_shards].append(merchandise)
return shards
sharded_data = shard_dataset(large_dataset, 10)
# Do not course of complete giant datasets directly
mannequin.match(large_dataset)
141. Use Environment friendly Information Loading for ML Coaching
# Environment friendly knowledge loading can considerably velocity up ML coaching processes.
from torch.utils.knowledge import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, data_file):
self.knowledge = pd.read_parquet(data_file)
def __len__(self):
return len(self.knowledge)
def __getitem__(self, idx):
return self.knowledge.iloc[idx].values
dataset = MyDataset('knowledge.parquet')
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)
# Do not load complete dataset into reminiscence
knowledge = pd.read_csv('large_data.csv')
mannequin.match(knowledge)
142. Implement Information Compression
# Information compression can cut back storage necessities and enhance I/O efficiency.
import gzip
def save_compressed_data(knowledge
, filename):
with gzip.open(filename, 'wb') as f:
pickle.dump(knowledge, f)
def load_compressed_data(filename):
with gzip.open(filename, 'rb') as f:
return pickle.load(f)
# Do not retailer giant uncompressed datasets
with open('large_data.pkl', 'wb') as f:
pickle.dump(large_data, f)
143. Use Environment friendly Information Constructions for LLM Token Administration
# Environment friendly knowledge buildings can enhance token administration for LLMs.
from collections import deque
class TokenBuffer:
def __init__(self, max_tokens):
self.buffer = deque(maxlen=max_tokens)
def add_tokens(self, tokens):
self.buffer.prolong(tokens)
def get_tokens(self):
return checklist(self.buffer)
# Do not use easy lists for token administration
tokens = []
tokens.prolong(new_tokens)
if len(tokens) > max_tokens:
tokens = tokens[-max_tokens:]
144. Implement Information Preprocessing Pipelines
# Preprocessing pipelines guarantee constant knowledge transformation throughout coaching and inference.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
preprocessing_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
X_processed = preprocessing_pipeline.fit_transform(X)
# Do not preprocess knowledge manually
X_imputed = X.fillna(X.imply())
X_scaled = (X_imputed - X_imputed.imply()) / X_imputed.std()
145. Use Environment friendly Information Serialization
# Environment friendly serialization can enhance knowledge switch and storage for ML pipelines.
import msgpack
def serialize_data(knowledge):
return msgpack.packb(knowledge)
def deserialize_data(serialized_data):
return msgpack.unpackb(serialized_data)
# Do not use inefficient serialization strategies
import pickle
serialized = pickle.dumps(knowledge)
146. Implement Information Augmentation for LLM Advantageous-tuning
# Information augmentation may help enhance LLM efficiency on particular duties.
import nlpaug.augmenter.phrase as naw
aug = naw.SynonymAug(aug_src='wordnet')
def augment_text(textual content, num_augmentations=5):
return [aug.augment(text) for _ in range(num_augmentations)]
# Do not rely solely on authentic knowledge for fine-tuning
mannequin.match(original_texts, labels)
147. Use Environment friendly Information Indexing
# Environment friendly indexing improves search efficiency for giant datasets.
import faiss
index = faiss.IndexFlatL2(vector_dimension)
index.add(vectors)
def search_similar_vectors(query_vector, ok=5):
return index.search(query_vector, ok)
# Do not use linear seek for giant datasets
similar_vectors = [vector for vector in vectors if cosine_similarity(query_vector, vector) > threshold]
148. Implement Information Entry Controls
# Information entry controls defend delicate data and guarantee compliance.
import pandas as pd
def restrict_access(knowledge, user_role):
if user_role != 'admin':
return knowledge.drop(columns=['sensitive_column'])
return knowledge
df = pd.read_csv("knowledge.csv")
restricted_data = restrict_access(df, user_role)
# Do not enable unrestricted entry to delicate knowledge
df = pd.read_csv("knowledge.csv")
149. Use Schema Validation
# Schema validation ensures knowledge integrity and consistency.
import pandas as pd
import pandera as pa
from pandera import Column, DataFrameSchema
schema = DataFrameSchema({
"title": Column(pa.String),
"age": Column(pa.Int, checks=pa.Verify.greater_than_or_equal_to(0)),
"e-mail": Column(pa.String, checks=pa.Verify.str_matches(r"[^@]+@[^@]+.[^@]+"))
})
df = pd.read_csv("knowledge.csv")
validated_df = schema.validate(df)
# Do not assume knowledge schema with out validation
df = pd.read_csv("knowledge.csv")
150. Use Information Encryption for Delicate Information
# Information encryption protects delicate knowledge throughout storage and transmission.
from cryptography.fernet import Fernet
key = Fernet.generate_key()
cipher_suite = Fernet(key)
def encrypt_data(knowledge):
return cipher_suite.encrypt(knowledge.encode())
def decrypt_data(encrypted_data):
return cipher_suite.decrypt(encrypted_data).decode()
encrypted_data = encrypt_data("delicate data")
decrypted_data = decrypt_data(encrypted_data)
# Do not retailer delicate knowledge with out encryption
with open('sensitive_data.txt', 'w') as f:
f.write("delicate data")
151. Use Information Lake for Centralized Storage
# Information lakes present a centralized storage resolution for uncooked and processed knowledge.
import boto3
s3 = boto3.shopper('s3')
def upload_to_data_lake(file_path, bucket_name, object_name):
s3.upload_file(file_path, bucket_name, object_name)
upload_to_data_lake('knowledge.parquet', 'my-data-lake', 'uncooked/knowledge.parquet')
# Do not retailer all knowledge in native storage
df.to_parquet('local_data.parquet')
152. Implement Information Governance
# Information governance ensures correct administration and high quality of knowledge belongings.
from openmetadata.shopper import OpenMetadata
metadata = OpenMetadata()
metadata.create_database(title="my_database")
def apply_data_governance(knowledge):
# Apply governance insurance policies right here
move
df = pd.read_csv("knowledge.csv")
apply_data_governance(df)
# Do not handle knowledge with out governance insurance policies
df = pd.read_csv("knowledge.csv")
153. Use Information Replication for Redundancy
# Information replication ensures knowledge availability and fault tolerance.
import boto3
s3 = boto3.shopper('s3')
def replicate_data(source_bucket, target_bucket, object_name):
copy_source = {'Bucket': source_bucket, 'Key': object_name}
s3.copy_object(CopySource=copy_source, Bucket=target_bucket, Key=object_name)
replicate_data('source-bucket', 'target-bucket', 'knowledge.parquet')
# Do not depend on a single copy of knowledge
df.to_parquet('single_copy.parquet')
154. Use ETL for Information Transformation
# ETL processes guarantee knowledge is remodeled and loaded effectively.
import pandas as pd
def extract_data(file_path):
return pd.read_csv(file_path)
def transform_data(df):
df['new_column'] = df['existing_column'] * 2
return df
def load_data(df, output_path):
df.to_parquet(output_path)
knowledge = extract_data('knowledge.csv')
transformed_data = transform_data(knowledge)
load_data(transformed_data, 'transformed_data.parquet')
# Do not remodel knowledge instantly within the database
df['new_column'] = df['existing_column'] * 2
df.to_parquet('knowledge.parquet')
155. Use Information Orchestration Instruments
# Information orchestration instruments handle knowledge workflows and dependencies.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
def extract():
# Extract knowledge
move
def remodel():
# Rework knowledge
move
def load():
# Load knowledge
move
dag = DAG('etl_workflow', start_date=datetime.now(), schedule_interval=timedelta(days=1))
extract_task = PythonOperator(task_id='extract', python_callable=extract, dag=dag)
transform_task = PythonOperator(task_id='remodel', python_callable=remodel, dag=dag)
load_task = PythonOperator(task_id='load', python_callable=load, dag=dag)
extract_task >> transform_task >> load_task
# Do not handle workflows manually
extract()
remodel()
load()
156. Use Information High quality Instruments
# Information high quality instruments assist preserve knowledge integrity and cleanliness.
import great_expectations as ge
context = ge.data_context.DataContext()
batch = context.get_batch({"path": "knowledge.csv"}, "my_datasource")
outcomes = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])
# Do not ignore knowledge high quality checks
df = pd.read_csv("knowledge.csv")
157. Use Change Information Seize (CDC) for Actual-time Updates
# CDC captures adjustments within the database and applies them in real-time.
import psycopg2
def capture_changes():
conn = psycopg2.join(dbname="mydb", consumer="consumer", password="move", host="localhost")
cur = conn.cursor()
cur.execute("SELECT * FROM adjustments WHERE timestamp > last_checkpoint")
adjustments = cur.fetchall()
apply_changes(adjustments)
conn.commit()
cur.shut()
conn.shut()
def apply_changes(adjustments):
# Apply adjustments to focus on database
move
# Do not depend on batch updates for real-time knowledge
df = pd.read_csv("knowledge.csv")
apply_changes(df)
158. Use Information Lakes for Massive-scale Storage
# Information lakes allow environment friendly storage and retrieval of large-scale datasets.
import boto3
s3 = boto3.shopper('s3')
def upload_to_data_lake(file_path, bucket_name, object_name):
s3.upload_file(file_path, bucket_name, object_name)
upload_to_data_lake('knowledge.parquet', 'my-data-lake', 'uncooked/knowledge.parquet')
# Do not retailer giant datasets in native storage
df.to_parquet('local_data.parquet')
159. Use Information Encryption for Safety
# Information encryption ensures knowledge safety throughout storage and transmission.
from cryptography.fernet import Fernet
key = Fernet.generate_key()
cipher_suite = Fernet(key)
def encrypt_data(knowledge):
return cipher_suite.encrypt(knowledge.encode())
def decrypt_data(encrypted_data):
return cipher_suite.decrypt(encrypted_data).decode()
encrypted_data = encrypt_data("delicate data")
decrypted_data = decrypt_data(encrypted_data)
# Do not retailer delicate knowledge with out encryption
with open('sensitive_data.txt', 'w') as f:
f.write("delicate data")
160. Use Schema Evolution for Altering Information Constructions
# Schema evolution permits for adjustments in knowledge buildings over time.
import pyarrow.parquet as pq
import pyarrow as pa
schema_v1 = pa.schema([('name', pa.string()), ('age', pa.int32())])
schema_v2 = pa.schema([('name', pa.string()), ('age', pa.int32()), ('email', pa.string())])
def save_data_with_schema(knowledge, schema, filename):
desk = pa.Desk.from_pandas(knowledge, schema=schema)
pq.write_table(desk, filename)
def read_data_with_schema(filename, schema):
return pq.read_table(filename, schema=schema).to_pandas()
data_v1 = pd.DataFrame({'title': ['Alice', 'Bob'], 'age': [25, 30]})
save_data_with_schema(data_v1, schema_v1, 'data_v1.parquet')
data_v2 = read_data_with_schema('data_v1.parquet', schema_v2)
# Do not ignore schema adjustments
df.to_parquet('knowledge.parquet')
The sector of machine studying, encompassing MLOps and knowledge engineering, is quickly evolving and changing into more and more complicated. As we’ve seen by these quite a few examples, adhering to finest practices is essential for creating strong, environment friendly, and scalable ML techniques.By following these finest practices, knowledge scientists and engineers can construct extra dependable, environment friendly, and maintainable machine studying techniques.
1. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., … & Dennison, D. (2015). Hidden technical debt in machine studying techniques. Advances in neural data processing techniques, 28.
2. Burkov, A. (2020). Machine Studying Engineering. True Constructive Inc.
3. Molnar, C. (2020). Interpretable Machine Studying. A Information for Making Black Field Fashions Explainable.
4. Géron, A. (2019). Arms-On Machine Studying with Scikit-Be taught, Keras, and TensorFlow: Ideas, Instruments, and Methods to Construct Clever Techniques. O’Reilly Media.
5. Kleppmann, M. (2017). Designing Information-Intensive Functions: The Huge Concepts Behind Dependable, Scalable, and Maintainable Techniques. O’Reilly Media.
6. Treveil, M., Omont, N., Stenac, C., Lefevre, Ok., Phan, D., Zentici, J., … & Downes, A. (2020). Introducing MLOps. O’Reilly Media.
7. Huyen, C. (2022). Designing Machine Studying Techniques: An Iterative Course of for Manufacturing-Prepared Functions. O’Reilly Media.
8. Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine Studying Design Patterns: Options to Frequent Challenges in Information Preparation, Mannequin Constructing, and MLOps. O’Reilly Media.
9. Deshpande, A., & Kumar, M. (2021). MLOps Engineering at Scale. Manning Publications.
10. Alla, S., & Adari, S. Ok. (2021). What Is MLOps?. O’Reilly Media.
11. Thoughtworks. (n.d.). Steady Supply for Machine Studying. https://martinfowler.com/articles/cd4ml.html
12. Google Cloud. (n.d.). MLOps: Steady supply and automation pipelines in machine studying. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
13. Databricks. (n.d.). MLflow: A Machine Studying Lifecycle Platform. https://mlflow.org/
14. Kubeflow. (n.d.). The Machine Studying Toolkit for Kubernetes. https://www.kubeflow.org/
15. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, E. (2011). Scikit-learn: Machine studying in Python. Journal of machine studying analysis, 12(Oct), 2825–2830.