Toward Reducing Crop Spoilage and Increasing Small Farmer Profits in India: a Simultaneous Hardware and Software Solution
India’s agricultural system has been facing a severe problem of crop wastage. A key contributing factor to this problem is that many small farmers lack access to reliable cold storage that extends crop shelf-life. To avoid having leftover crops that spoil, these farmers often sell their crops at unfavorable low prices. Inevitably, not all crops are sold before spoilage. Even if the farmers have access to cold storage, the farmers may not know how long to hold different crops in cold storage for, which hinges on strategizing over when and where to sell their harvest. In this note, we present progress toward a simultaneous hardware and software solution that aims to help farmers reduce crop spoilage and increase their profits. The hardware is a cost-effective solar-powered refrigerator and control unit. The software refers to a produce price forecasting system, for which we have tested a number of machine learning methods. Note that unlike standard price forecasting tasks such as for stock market data, the produce price data from predominantly rural Indian markets have a large amount of missing values. In developing our two-pronged solution, we are actively working with farmers at two pilot sites in Karnataka and Odisha.
Crop wastage in India results in an annual loss valued at 92,651 crore INR (15B USD) as of 2014 . Among these crops, fruits and vegetables have the highest wastage percentage, upwards of 15.88% depending on the crop. A major part of the problem is that small and marginal farmers, who as of 2002-2003 account for roughly 81% of agriculture holdings in India and who typically have field sizes under 1 hectare (roughly 2.47 acres) , lack access to the required cold storage and marketing infrastructure. These farmers often rely on the cultivation of perishables. Due to lack of access to cold storage, they are forced to “crash sell” their harvest at market prices dictated by the middlemen or wholesalers to avoid wastage and financial loss. As an example, a farmer who harvests 100 kg of tomatoes will try to sell them at the nearest market, as soon as possible, and has to take whatever price is available; otherwise, the produce spoils and is worth nothing. With access to reliable cold storage, the farmer could keep the tomatoes fresh for longer before selling them at a more favorable price. Reducing food wastage could thus also increase farmers’ profits. With only 10-11% of fruits and vegetables produced with access to cold storage, the director of India’s National Horticulture Board stated that a 40% increase in cold storage capacity would be needed to avoid wastage .
We are working with small farmers to help them store and plan when and where to sell their produce. To do so, we provide a solution that simultaneously has both hardware and software components. On the hardware side, we are developing cost-effective efficient solar-powered cold storage units, each of which is essentially a walk-in closet-like refrigerator that can service 40-50 small farmers. On the software side, we are developing produce price forecasting models with the goal of helping farmers better plan when and where to sell their produce, and eventually what and when to grow. The hardware and software solutions complement each other: without cold storage, the software solution would not be of much use since a farmer cannot easily delay selling produce in case of spoilage. With only the cold storage hardware but not the software solution, it is not straightforward when and where to sell, and at what price.
We remark that on the software side, the problem we are addressing differs substantially from, say, forecasting stock market prices or developing a high-frequency trading strategy. As we discuss in more detail in Section 3, produce pricing data available for the Indian markets have a large amount of missing values. To handle these missing values, we use ideas from a forecasting method for clinical time series data that also exhibits a large data missingness problem . Separately, unlike in normal stock trading or the case of high-frequency trading, executing a “trade” (i.e., selling some amount of produce at a specific market) is not remotely instantaneous. A farmer would have to task someone to drive to a specific market and stay there for some time to sell produce, easily taking on the order of hours. Commonly, farmers choose to sell at multiple markets, which could take more than a whole day. Because of the labor and time intensiveness of selling, the problem we are tackling could perhaps be more aptly described by “very low-frequency trading”. While we focus the discussion of the software component in this note only on forecasting and not on the actual execution of “trades” (e.g., planning the driving route to different markets, how long to stay at each, etc), the latter clearly suggests that price forecasts should be as far in advance as possible.
Importantly, we are developing our hardware and software components with input from local farmers at two pilot sites, one in Dandeli, Karnataka and another in Cuttack, Odisha. Involving local farmers in agricultural development rather than only providing them with a technological solution is important to creating a solution that lasts . We want to ensure that the farmers find our solution to be useful, and we want them to let us know what we could do better.
In this note, we report our progress in developing our simultaneous hardware and software solution. In Section 2, we describe our solar-powered cold storage unit. In Section 3, we describe how we cast produce price forecasting as a classification problem, and benchmark a number of standard classifiers. We conclude in Section 4 with a discussion of end-user financing and future work.
2Cold Storage Hardware
Currently, commercial off-the-shelf (COTS) cold storage units generally consist of an insulated cold room, a cooling unit, and a basic controller. However, for such units, the cooling system generally is incompatible with solar panel systems and consumes a large amount of energy (and is thus not cost-effective). To address these two shortcomings and specifically to tailor the cooling system to the farmers’ cultivation behavior and local climate, we developed a new controller, called the CoolCrop controller, that readily handles different sensors, cooling units, and power sources including solar. This controller replaces the existing controller of a COTS cold storage unit. Figure 1 shows the CAD and the system deployed in the pilot site of Dandeli, for which we installed our controller in a 5-metric-ton COTS cold storage unit that services about 40-50 small farmers.
The CoolCrop controller records and regulates temperature and humidity within the cold storage unit. We intentionally designed the controller to be flexible in what hardware it can control, what sensors it can pull data from, and what power source it uses. Moreover, we wanted the control logic itself to be easily programmable. With these design goals in mind, our controller consists of a single board Raspberry Pi 3 computer connected to a control and data acquisition board. A diagram and photo of this board is shown in Figure 2. Low cost sensors for monitoring (i.e., temperature and humidity sensor) can be easily connected using a standard Ethernet cable. The CoolCrop controller has a small form factor approximately 3 inches by 5 inches. Descriptions of the controller functional blocks in Figure 2 are given below:
Control Relays: These relays manipulate (energize or de-energize) external control circuits and send alarm flags.
User Defined Connection Point: This interface point on the control board can be configured to multiple instrumentation suites.
Connection Point for Prototype Sensors: An I2C and SPI interface to connect to and communicate with COTS and integrated sensors.
Analog-Digital Conversion (ADC) Channels: These channels can be used to monitor additional aspects of interest and interface with other sensor(s).
Digital-Analog Conversion (DAC) Channels: The DACs are used to generate control signals that can be used to achieve more complete control or generate reference or other signals.
Real-Time Clock (RTC): The real-time clock is used to generate accurate timestamps for collected data from local, connected, and remote sensors.
COTS and custom sensors can easily be integrated to the controller. We specifically use a new sensor that we have developed that measures temperature within C and relative humidity within 2%, has user selectable 12 or 16 bit resolution, and connects by a standard Ethernet cable. This sensor (including its circuitry and board) is approximately the size of two Ethernet jacks; a photo of it next to a Ethernet cable is in the bottom right of Figure 2.
3Forecasting Produce Prices
We now discuss produce prices at different markets, and how we forecast these prices. We collect pricing data from a website called Agmarknet that is run by the Indian government’s Ministry of Agriculture and Farmers Welfare.
An example of onion prices over time is shown in Figure 3 for three markets near our second pilot site in Cuttack, Odisha. For these markets, we collected all available data from 2012 to 2016 off of Agmarknet. As shown, onion pricing for the first market only becomes available in August 2014, whereas in the case of the third market, the pricing information stops being available after July 2014. Within the range of dates for which pricing data are available, the three markets exhibit drastically different fractions of missing data. For example, between the years 2012 and 2016, the first market only has data from August 7, 2014 to April 22, 2016, where 13.6% of the prices are missing between these dates. For the second market, despite pricing data being available from April 1, 2012 up through July 2, 2016, 81.0% of the days in between do not have pricing data. Thus, even though a market may have pricing data for more years, the data could be less regularly collected. Lastly, note that the data exhibit seasonality: each year between August through March, the price reaches a local peak.
Forecasting exact prices for the next few days per market turns out to be challenging. Standard approaches like exponential smoothing methods  and ARIMA models  do not handle missing data well. Moreover, typically the price of a specific produce at a specific market does not change over short periods of time, and predicting the price at the next day to be the price at the current day is correct over 60% of the time in the dataset we collected. Rather than forecasting exact prices, we instead forecast just the short-term direction of price changes, i.e., whether the price will go up, go down, or stay the same for each of the next few days. Importantly, always predicting that the price stays the same provides no actionable insight to farmers.
We now describe how to set up a classification problem to account for missing data and seasonality in forecasting price movement directions for different markets for a specific produce (onion). The way we do this is similar to how missing data are modeled in an existing recurrent neural network (RNN) forecasting approach that has been successfully applied to clinical time series that contain a large number of missing values . Note that the approach of  works for any classifier, not just RNN’s. We similarly will not limit ourselves to only using RNN’s. However, our work differs from that of  in two important ways. First, we are forecasting price movement directions of the next few days per time series (each time series is associated with a market). These price movement directions in general vary with time. In contrast,  forecasts a single non-time-varying outcome per time series. A second difference is that we explicitly account for seasonality.
As shown in Figure 4, for a specific produce, we track percentage price changes of the produce at different markets over time. Suppose that we want to use pricing information from the previous days to forecast price movement directions of the next days. Then to assemble training data, we take a sliding window approach, each time looking at percentage price changes of the days leading up to the current day to treat as input, and the following days’ price movement directions to use as test data, with a notable exception: we reveal which entries in the test data are missing as part of training data. The reason we do this is that we actually only care about predicting entries that are not missing, which vary over time. Missing data for percentage price changes in the most recent days are filled in with 0, while missing data in price movement directions of the next days are filled in with “stay”. Finally, to account for seasonality, we also provide, as additional inputs to the classifier, the days in the year that a time window corresponds to (January 1 regardless of year would be encoded as day 1, January 2 as day 2, etc). See Figure 4 for an example where and . Specifically when the classifier used is an RNN, we treat the “masks” that specify which entries are missing as features that we replicate for each of the time steps (in the example of Figure 4, this replication procedure would introduce features for each of the time steps). After training the classifier, then to actually do forecasts, we would provide the classifier with two main inputs: the percentage price changes of the most recent days, and a mask of all 1’s specifying that none of the price movement directions that we want to forecast are missing.
We apply our approach to forecasting onion price movement directions in a dataset of 14 markets around Cuttack, Odisha. We compared seven different classifiers for making predictions: a baseline classifier that always predict that the price will stay the same (this method is denoted as “Stay” in Figure 5), support vector machines (SVM) , logistic regression (“LogReg”) , random forests (“RForest”) , AdaBoost (“ABoost”)  with decision trees as base predictors, gradient tree boosting (“GBoost”) , and long short-term memory (LSTM) RNN’s . We fix the number of days we forecast ahead to . We report two different accuracy measures: (1) a raw accuracy for what fraction of price movement directions are correctly predicted, and (2) the average of three accuracy fractions corresponding to correctly predicting price movement directions going up, going down, and staying the same (we call this the “balanced” accuracy). The balanced accuracy measure helps deal with class imbalance between the three outcomes, especially as over 60% of the time the price stays the same. Note that asking for high raw accuracy differs from asking for high balanced accuracy. We introduce a parameter for the user to choose that specifies how important balanced accuracy is when training the classifier ( means we only care about raw accuracy, and means we only care about balanced accuracy). We train on data from 2012 through 2015 and forecast price movement directions in 2016. We tune classifier parameters (but not , which the user specifies) during training by treating 2015 pricing information as validation data.
Raw and balanced forecasting accuracies on the 2016 test data for varying are shown in Figure 5; note that for each method, we average over all days and all 14 markets in computing the two accuracy measures. For all methods, we forecast based on pricing information for the most recent days ( is chosen during training using validation data). We see that the trivial Stay classifier has the best raw accuracy but the worst balanced accuracy. The random forest classifier effectively learns to nearly almost predict “stay” and thus performs similar to the Stay classifier. Meanwhile, gradient tree boosting and AdaBoost achieve the best tradeoff between raw and balanced accuracies. For example, gradient tree boosting with correctly predicts 20.4% of up movements, 15.7% of down movements, and 73.6% of stay movements to achieve a raw accuracy of 63.7% and balanced accuracy of 36.6%. We remark that since gradient tree boosting, AdaBoost with decision tree base predictors, and random forests are adaptive nearest neighbor methods, they can provide evidence for their forcasts in the form of past training data most similar to test data. This evidence may be helpful to farmers.
End-user financing. One of the challenges in providing new technological solutions such as what we have developed is end-user financing, especially to cover the capital cost of hardware. The small farmers we have talked to make on average around 40,000 INR ($600 USD) per year from agriculture, and are hesitant to invest or take loans to pay for the system. Many of them not only lack access to financial products and services, but they also do not have a credit history. This problem can be addressed by creating low-risk financial models and streamlining the process of financing through formal routes such as banks, micro-financing institutes, and government subsidies. To successfully implement these financial models, we create partnerships with local organizations, NGOs, and farmer cooperatives who can ensure that the farmers use the storage judiciously and stick to payment schedules, which can either be based on renting the hardware for as long as the farmers find it useful, or paying for it over a duration of 3-5 years.
In our pilot site of Dandeli, we have observed that farmers increase their vegetable production by at least 100% after they have cold storage. Considering the amount of wastage curbed and overall increase in vegetable production, we suspect the annual income of farmers to dramatically increase. In devising postpaid models for small farmers to cover the initial capital cost, we account for both their increase in revenue and their existing financial situation.
Future work. On the hardware side, we are working on making the cold storage unit cheaper and more energy efficient. The controller needs to minimize both the surge power consumption during the start-stop mechanism of the cooling equipment, and the energy consumption while maintaining the desired temperature and humidity. We are also looking into cheap, locally-sourced building materials such as thermal storage and additional insulation to both decrease the system cost and improve its energy efficiency.
On the software side, we suspect that forecasts could be improved by accounting for a variety of real-time parameters such as rainfall, market demand, and available supply of produce. However, having good forecasts is not enough. Figuring out how to communicate forecasts to farmers is extremely important. For example, sending SMS messages about prices at different markets could be insufficient. In a 2007 report, Jensen claimed that such an SMS strategy significantly helps fishermen identify which market to sell at . However, a recent study by Steyn refutes Jensen’s claims and gives evidence that fishermen do not take advantage of such information despite its availability . We are working with farmers to understand what forecast information they will find most useful. Moreover, we suspect that since growing and selling horticulture crops typically requires more extensive planning than fishing, horticulture farmers may be more inclined to take advantage of pricing information than fishermen.
Acknowledgments. This work was supported in part by the MIT Legatum Center, MIT Sandbox, and the CMU Berkman Faculty Development Fund.
- Arima models and the Box–Jenkins methodology.
D. Asteriou and S. G. Hall. Applied Econometrics (2nd ed.)
- Random forests.
L. Breiman. In Machine Learning, 2001.
- Exponential smoothing for predicting demand, 1956.
R. G. Brown.
- Support-vector networks.
C. Cortes and V. Vapnik. In Machine Learning, 1995.
- The regression analysis of binary sequences (with discussion).
D. R. Cox. Journal of the Royal Statistical Society. Series B.
- Small farmers in India: Challenges and opportunities.
S. M. Dev. Indira Gandhi Institute of Development Research
- Towards a complexity-aware theory of change for participatory research programs working within agricultural innovation systems.
B. Douthwaite and E. Hoffecker. Agricultural Systems
- A decision-theoretic generalization of on-line learning and an application to boosting.
Y. Freund and R. E. Schapire. Journal of Computer and System Sciences
- Greedy function approximation: A gradient boosting machine, 1999.
J. H. Friedman.
- Long short-term memory.
S. Hochreiter and J. Schmidhuber. Neural Computation
- The digital provide: Information (technology), market performance, and welfare in the south Indian fisheries sector.
R. Jensen. The Quarterly Journal of Economics
- Report on assessment of quantitative harvest and post-harvest losses of major crops and commodities in India.
S. N. Jha, R. K. Vishwakarma, T. Ahmad, A. Rai, and A. K. Dixit. All India Coordinated Research Project on Post-Harvest Technology, ICAR-CIPHET
- Modeling missing data in clinical time series with RNNs.
Z. C. Lipton, D. C. Kale, and R. Wetzel. In Machine Learning for Healthcare, 2016.
- A critique of the claims about mobile phones and Kerala fishermen: the importance of the context of complex social systems.
J. Steyn. The Electronic Journal of Information Systems in Developing Countries
- The food wastage & cold storage infrastructure relationship in India: Developing realistic solutions, 2013.
Emerson Climate Technologies.