Global
Austria
Bulgaria
Croatia
Czech Republic
Denmark
Estonia
Finland
France
Germany
Greece
Hungary
Ireland
Italy
Latvia
Lithuania
Luxembourg
Netherlands
Norway
Poland
Portugal
Romania
Russia
Serbia
Slovakia
Slovenia
Spain
Sweden
Turkiye
United Kingdom
Global
Argentina
Aruba
Bolivia
Brazil
Chile
Colombia
Costa Rica
Dominican Republic
Ecuador
El Salvador
Guatemala
Honduras
Mexico
Panama
Paraguay
Peru
Puerto Rico
United States of America
Uruguay
Global
Bahrain
Israel
Jordan
Kuwait
Lebanon
Oman
Pakistan
Palestine
Qatar
Saudi Arabia
South Africa
United Arab Emirates
Global
Australia
Bangladesh
India
Indonesia
Japan
Kazakhstan
Malaysia
New Zealand
Philippines
Singapore
South Korea
Sri Lanka
Taiwan (Chinese Taipei)
Thailand
Vietnam
ABB Review | 04/2024 | 2024-12-02
Understanding the factors that affect the carbon footprint of computation could help decision makers in the process industries to reduce their CO₂ emissions. Taking a theoretical and experimental approach, ABB explores this topic and provides advice on how to make AI models greener.
Ralf Gitzel, ralf.gitzel@de.abb.com
Marie Platenius-Mohr, marie.platenius-mohr@de.abb.com
ABB Corporate Research
Mannheim, Germany
Andreas Burger
Former ABB Employee
Artificial intelligence (AI), specifically machine learning (ML), has infiltrated everyday life. Neural networks (NN), which are multi-layered deep learning models, provide facial recognition capabilities for added mobile phone security or convert human speech into commands for smart home applications. This rapid progress is becoming increasingly relevant for process industries with applications ranging from the interpretation of infrared images of machinery [1] to the analysis of production-related data and more [2]. Clearly, the promise of this potential ignites competition to improve performance leading to ever larger AI models that are trained for longer, thereby generating worrisome secondary effects: more energy is consumed and more CO₂ is emitted [3,4,5] – less than laudable ramifications considering the current climate crisis.
It would appear that carbon footprint goals must be sacrificed for the performance enhancement generated by AI models. But, is this necessarily so? At first glance, studies that have evaluated this tenet have focused on high-performance language- or image processing models such as GPT-3, a deep, a neural network model with 175B parameters that provides human-like texts. This large language model (LLM) required 1’287 MWh for training, which corresponds to 552 t of CO₂ – the annual emission of 276 medium-sized cars [4]. Though not yet disclosed, the footprint for GPT-4 will probably be much larger. Still, other high-performance AI models are associated with a smaller carbon footprint; some investigators have even suggested that the scope of the problem has been exaggerated [6]. Such discrepancies make AI providers unsure about the carbon footprint of their specific model or how to reduce it. Decision makers in the process industries face an additional challenge: models are typically much smaller in scale than the high-performance models discussed in the literature. Are emissions in these cases even relevant? In this paper, ABB aims to give engineers, managers, and others guidance so they understand the impact that individual AI models have on the environment. Specifically, ABB examines the literature to create a comprehensive framework to explain the various carbon drivers over the entire AI model lifecycle and offers advice about reducing those drivers [7]. Based on experimental data, ABB also tests the validity of literature recommendations to provide guidance for reducing the carbon footprint and energy consumption of AI models, eg, the use of transfer learning models. Moreover, the carbon footprint of process industry-relevant AI models is computed and discussed.
While theoretical models from the literature estimate the carbon impact of a new AI model based on architecture (layer types and size), training and usage [4,5,8,9], it is difficult to determine which metric drives the footprint [5,10]. Contrastingly, the carbon footprint of ML models can easily be measured using software tools to record the impact of development or use via carbon accounting. Some tools use metrics, eg, training time, energy mix, and hardware information [8,11], while others, eg, energyusage or CodeCarbon, integrate directly with the ML code [6,12,13,14]. Other tools compute central processing unit (CPU) power usage, estimate graphics processing unit (GPU) runs, compare hardware type [12,15,16] and determine the carbon impact of image recognition models [17].
Despite the significance of this research, two study gaps loom: First, studies are either too generic or focus on specific unrelated domains, eg, images [17,16] or LLMs [3,4,5], with unknown relevance to process industry data since industry models are specific and use small data sets. Second, carbon calculation models are not standardized and specific lifecycle steps are often omitted [18]. To close these gaps ABB empirically evaluated the carbon footprint and created a model for all AI life cycle phases.
The carbon footprint (CO₂eq) of NNs depends on how much energy is used (in kWh) and the carbon intensity (in lbs/kWh) of the energy source. The carbon intensity of Chat GTP-3 (1’214’400 lbs CO₂eq) [4], Gopher (851’200 lbs CO₂eq) [18], and NAS (626’155 lbs CO₂eq) [9] are unsurprisingly high. Contrastingly, the carbon footprint of other high-performing models is much lower, such as BERTbase (1’438 lbs. CO₂eq) [9]. Such disparities suggest that the factors that impact a high-performance model’s carbon footprint require more scrutiny.
To explain the impact of AI models on carbon footprint, ABB holistically modeled all life cycle phases [4,5]:
Because inference operations are executed during all phases, they are described first. Essentially, inference (which is estimated to cause between 80 and 90 percent of a model’s total energy use [4]) can be defined as the computation of a mathematical formula expressed through a series of learned parameters that transforms an input vector into the correct output vector, eg, an image, time series, predicted value, etc. The mathematical operations for a standard NN dense layer consist of a matrix multiplication and application of a simple activation function to the result →01. Layer output acts as input for the next layer leading to a series of matrix multiplications, thereby consuming energy. The amount of inference energy used depends on: model architecture (M ), ie, layer types, order, and size; and type and quantity of processing units (PT ) eg, CPUs, GPUs, and tensor processing units (TPUs). The overhead imposed by the power usage effectiveness (PUE ) of the datacenter also has an impact [19]. Thus, the energy cost I of an inference can be described as:
I = F(M,PT) ⋅ PUE
Approximating f is challenging mainly due to different hardware implementations, memory access [13,9,21] and the use of specialized layers. Thus, simple substitutes for M such as the number of trainable parameters [20] are problematic [10]. Nonetheless, measurement-based estimates of I can be used to calculate the total life cycle carbon footprint of a model. Both PT and PUE can be optimized by choosing efficient data centers and/or hardware. For example, a GPU is 10 times more efficient than a CPU; a TPU is 4 to 8 times more efficient than a GPU [8]. Although the PUE of a datacenter might be unavailable, those centers located in colder regions generally consume less energy than those situated in warmer regions [22]. Selecting a low-carbon M can also reduce energy use without sacrificing performance [20,12,14]. To reduce model size suggested techniques are pruning, adding sparsity, quantization, or knowledge distillation [4, 23,24]. For DNNs [6], computation effort can be reduced by factors 5 – 10 [4]; for convolutional neural networks (CNNs) – a feature engineering NN – by a factor of 40 [20].
Energy use during a model’s training phase depends on training duration and number of processors used [4]. Three factors act as drivers: The energy cost of a single inference (I ), the size of the training data set (D ) and the number of epochs (E ) used to optimize the model weights. Overhead1 is expressed as a constant θ.
T ∝ E ⋅ D ⋅ I ⋅ θ
Where, PUE, and number and type of processors are considered within the value for I. It follows that training energy can be reduced, theoretically, through transfer learning – the reuse of a pre-trained model on a new problem – as it reduces E and D [4,5,8].
Notably, different model architectures used for the same task can vary in accuracy. For this reason, many architectures are trained during MAS and the best one is chosen for the final training phase. While performance is the optimization criterion of choice, energy consumption could be used as an additional criterion.
The cost at this stage (CT ) is proportional to two factors [5]: The cost of training, T, and the number of times hyperparameters are tuned (H ). Some of T ’s components, ie, I, E, and D, might vary for each tuning step resulting in different values of T for each tuning step.
H
CT ∝ ∑ Th
n=1
Choice of MAS is critical because the more often Hs are tuned, the more energy is used. Interestingly, in terms of energy used, a random search is better than a systematic grid search, which compares many similar architectures [8]. It follows from equation iii that transfer learning could reduce MAS or even eliminate it [4].
The total life cycle energy use depends on the energy cost of all life cycle phases: CT, T, I ; and the expected number of inference calls (e ):
Elife = CT + T + I ⋅ e
The CO₂eq is determined by multiplying (4) with the carbon emission factor (EF ):
CO2eq = Elife ⋅ EF
EF varies greatly depending on the energy source used. For example, EF ranged from 20g CO₂eq/kWh in Quebec, to 736.6g CO₂eq/kWh in Iowa in 2019 [8]. Evidently, the easiest way to reduce CO₂eq is to choose the right location [4].
By taking into account the carbon footprint at each ML life cycle stage as determined in the previous sections, the resulting consolidated framework provides a reasonable estimate of a DNN model’s carbon footprint2.
To empirically test framework assumptions about the carbon footprint of models with different properties, ABB conducted a series of experiments3. The code (Keras/ Python) was tested on a PC with a GeForce RTX 2080 Ti GPU and 32 GB RAM. ABB assumed the energy mix of Germany and used the CodeCarbon tool, which uses a carbon intensity of 365.5 g/kWh for its calculations.
Testing set size, epochs, and pretrained model use
To test if set size and epoch number increase energy use, ABB conducted two experiments:
The results confirm that energy use linearly increases with increasing number of training samples. Similarly, increasing epoch size results in a linear growth of emissions, thereby demonstrating the vital importance of both factors. While these results seem to confirm the value of pretrained models to reduce energy consumption [4,5,8], ABB’s experiments indicate that pretrained models bear the risk of using oversized and therefore inefficient models. A pretrained model can be fine-tuned for a fraction of the cost that training the same architecture would require when trained from scratch [26]. However, using a dedicated (smaller) architecture for a problem can be even more energy-efficient. For example, in an experiment with MNIST classification, a dedicated model needed only a tiny fraction of the energy used for a fine-tuned Xception model of comparable performance →02.
Generally, larger models require more energy than smaller ones, especially if the model properties are similar. ABB’s experimental results confirm this statement →03. They also support the literature in rejecting the number of trainable parameters as a carbon driver.
In one test, ABB compared wide and narrow models with the same number of trainable parameters and observed great divergence in energy used →04. In deep learning, it is notable that the energy consumption of nets with many small layers, “deep nets”, is much higher than for “wide” nets of the same size (fewer layers yet more nodes per layer). Nevertheless, depth is far better than width at increasing expressive power of an NN [27].
To assess the impact of layer type [20], ABB compared two groups of models: a series of wide models with dense layers, and a series of similarly shaped convolutional layers (where the “width” is represented by the number of filters). The results show that purely convolutional models consume significantly more energy than do dense models →05. Thus, trainable parameters are a basic indicator of energy use, but only if the models compared share many properties ie, shape and type of layers.
While the consolidated framework presented and the experimental results can help users and decision-makers to reduce model carbon footprints, the question arises: Are these findings even applicable to process industry-relevant models? Certainly, vast quantities of industrial data are produced, thanks to distributed control systems (DCS) indicating that NN models could be useful. Unfortunately, significantly less data is available for training because most of this data is unlabeled. Less available data means less training time and lower energy costs. But, is this positive, or not? Crucially, such a scenario implies lower performance and, yet smaller AI models with a specific use case and good feature engineering do perform well indicating that performance might not need to be sacrificed.
To evaluate the carbon footprint of small models, ABB chose to evaluate two literature examples →06: a non-deep anomaly detection algorithm (ECOD) [28] and a deep anomaly detection model, Deep Support Vector Data Description (DeepSVDD) [29]. Both models were trained on data from an angular sensor used for condition monitoring. Not only do both models perform well in the test, their carbon footprints are negligible even when compared to an efficient LLM such as BLOOM →06.
These results suggest that further actions to reduce the carbon footprint of such process automation-relevant models is currently unnecessary. Nonetheless, the explosive growth of LLMs, eg, GPT 4.0, strongly indicates that large deep models will enter the industrial domain soon. When this happens, the consolidated framework and experimental findings discussed in this paper will help engineers and managers make better decisions about the design, deployment, and use of their models in terms of carbon footprint.
Footnotes
1 There is significant overhead for the loss function and back propagation step, which is included in the calculation as its use has been validated in experiments.
2 The framework ignores static energy consumption and original hardware production [13] and in contrast to some studies sacrifices accuracy to focus on ease-of-use.
3 Model performance was not considered in the experiments. Carbon optimization and performance optimization interfere with each other but are not a direct trade-off.
References
[1] R. Gitzel, et., “Maps of Infrared Images to Detect Equipment Faults”, IEEE Eighth International Conference on Big Data Computing Service and Applications (BigDataService), 2022, pp. 167 – 172.
[2] M. Gaertler, et al., 2021, “The machine learning life cycle in chemical operations–status and open challenges”, Chemie Ingenieur Technik, vol.93, no.12, 2021, pp. 2,063 – 2,080.
[3] O.Y. Al-Jarrah, et al., “Efficient machine learning for big data: A review”, Big Data Research, vol. no.3, 2015, pp. 87 – 93.
[4] D. Patterson, et al., “Carbon emissions and large neural network training” arXiv preprint arXiv:2104.10,350, 2021.
[5] R. Schwartz, et al., “Green ai” Communications of the ACM, vol.63, no. 12, 2020, pp. 54 – 63.
[6] D. Patterson, et al., “The carbon footprint of machine learning training will plateau, then shrink”. Computer, vol.55 no.7, 2022, pp. 18 – 28.
[7] D.H. Fisher, “Recent advances in AI for computational sustainability” IEEE Intelligent Systems, 31(04), 2016, pp.56 – 61.
[8] A. Lacoste, et al., “Quantifying the carbon emissions of machine learning”, arXiv preprint arXiv:1910.09700, Available online at doi:10.48550/ARXIV.1910.09700, 2019.
[9] E. Strubell, et al., “Energy and policy considerations for deep learning in NLP” arXiv preprint arXiv:1906.02243. Available online at doi:10.48550/ARXIV.1906.02243, 2019.
[10] L. Lai, et al., “Not all ops are created equal!”, arXiv preprint arXiv:1801.04326, 2018.
[11] L. Lannelongue, et al., “Green Algorithms: Quantifying the Carbon Footprint of Computation”. Advanced science, vol. 12 no.8, 2021, pp. 1 – 10.
[12] K. Lottick, K., et al., “Energy Usage Reports: Environmental awareness as part of algorithmic accountability”, arXiv preprint arXiv:1911.08354, 2019.
[13] P. Henderson, et al., “Towards the systematic reporting of the energy and carbon footprints of machine learning”, The Journal of Machine Learning Research, vol.21 no.1, 2020, pp. 10,039 – 10,081.
[14] M. Kumar, et al., “Energy-efficient machine learning on the edges”, in IEEE international parallel and distributed processing symposium Workshops, 2020, pp. 912 – 921.
[15] Y. Wang, et al., “Benchmarking the performance and energy efficiency of AI accelerators for AI training”, in 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, 2020, pp. 744 – 751.
[16] R. Selvan, et al., “Carbon footprint of selecting and training deep learning models for medical image analysis” in International Conference on Medical Image Computing and Computer-Assisted Intervention, Cham, CH, Springer Nature 2022, pp. 506 – 516.
[17] L. Heguerte, et al., “How to estimate carbon footprint when training deep learning models? A guide and review”. arXiv preprint arXiv:2306.08323, 2023.
[18] A. S. Luccioni, et al., “Estimating the carbon footprint of bloom, a 176b parameter language model” Journal of Machine Learning Research, vol.24 no.253, 2023, pp. 1 – 15.
[19] E. Jaureguialzo, “PUE: The Green Grid metric for evaluating the energy efficiency in DC (Data Center). Measurement method using the power demand”, IEEE 33rd International Telecommunications Energy Conference 2011, pp. 1 – 8.
[20] E. Cai, et al., “Neuralpower: Predict and deploy energy-efficient convolutional neural networks”, in Asian Conference on Machine Learning, 2017, pp. 622 – 637.
[21] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it”, in IEEE international solid-state circuits conference digest of technical papers, 2014, pp. 10 – 14.
[22] M. Sharma, et al., “Analyzing the data center efficiency by using PUE to make data centers more energy efficient by reducing the electrical consumption and exploring new strategies” Procedia Computer Science, 48, 2015, pp. 142 – 148.
[23] D. Blalock, et al., “What is the state of neural network pruning?”, Proceedings of machine learning and systems, 2, 2020 , pp. 129 – 146.
[24] G. Hinton, et al., “Distilling the knowledge in a neural network”, arXiv preprint arXiv:1503.02531, 2015.
[25] L. Heim, et al., “Measuring what really matters: Optimizing neural networks for tiny ml”, arXiv preprint arXiv:2104.10645, 2021.
[26] P. Walsh, et al., “Sustainable AI in the Cloud: Exploring machine learning energy use in the cloud”, in 36th IEEE/ACM International Conference on Automated Software Engineering Workshops, 2021 pp. 265 – 266 doi:10.1109/ ASEW52652.2021.00058
[27] Z. Lu, et al., “The expressive power of neural networks: A view from the width”, Advances in neural information processing systems, 30, 2017.
[28] Z. Li, et al., “Ecod: Unsupervised outlier detection using empirical cumulative distribution functions”, IEEE Transactions on Knowledge and Data Engineering, 2022.
[29] L. Ruff, et al., “Deep One-Class Classification” Proceedings of the 35th International Conference on Machine Learning, 80, 2018, pp. 4,393 – 4,402.