Hackjob temperature logging shell scripts

After Rejuvenating an old laptop with electrical tape, I found it had some intermittent temperature issues.  Although I have not put out all the fires yet, I did cook up some quick scripts to help me diagnose the problem.

My first indication, other than the computer shutting down unexpectedly, was finding this error message in the /var/log/syslog:


Jul 25 07:41:08 caitlerstein kernel: [84971.259286] thermal_sys: Critical temperature reached(97 C),shutting down

Not a good sign! So, I got to work making some scripts to diagnose the problem while I went to work. As I was in my usual rush to leave, they are very quick and simple.

 

temptest.sh

This script simply takes the output of the sensors command and formats it for use in .csv logging:


sensors | awk '$0 ~ /Core 0/{c0=$3} $0 ~ /Core 1/{c1=$3} $0 ~ /temp1/{t1=$2} $0 ~ /temp2/{t2=$2}END{print d"," t1","t2","c0 "," c1}' "d=$(date +%s)" | sed 's/[^0-9,.]//g'

Not the most elegant script… but it takes output that looks like this:

trdenton@caitlerstein:~$ sensors
acpitz-virtual-0
Adapter: Virtual device
temp1: +60.0°C (crit = +95.0°C)
temp2: +55.0°C (crit = +105.0°C)
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +59.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +59.0°C (high = +100.0°C, crit = +100.0°C)

And comma-delimits it to this:


trdenton@caitlerstein:~$ ./temptest.sh
1375930761,60.0,55.0,59.0,59.0

Which is then useful for logging to a .csv log file, via a crontab entry:

* * * * * /home/trdenton/temptest.sh >> /home/trdenton/templog.csv

 

cputest.sh

Another hackjob, I basically use this script to take the output of ps and turn it into a timestamped file with the top resource-consuming processes.


ps axopid,pcpu,pmem,comm k -pcpu | awk '$2 != "0.0"{print $0}'

used in conjunction with a cronjob like so:


* * * * * /home/trdenton/cputest.sh >> /home/trdenton/cpulog/$(date +\%s)

produces timestamped files detailing which processes were consuming more than 0.0% CPU usage:


trdenton@caitlerstein:~/cpulog$ cat 1375015261
PID %CPU %MEM COMMAND
28821 19.1 2.5 oneconf-service
28798 0.2 0.3 xfce4-notifyd
28 0.1 0.0 kworker/0:1

 

Staying on top of file counts

For the cputest cronjob, I don’t want too generate _too_ many files, so I added a cronjob to remove any in excess of a day old:


* * * * * /usr/bin/find /home/trdenton/cpulog -ctime +1 -exec rm '{}' \;

 

Checking disk I/O statistics

For this, I just used a straight cron entry:


* * * * * /usr/bin/iostat -d | /bin/grep sda | /bin/sed 's/\s\+/,/g' | /bin/sed "s/^/`date +\%s`,/" >> /home/trdenton/iolog.csv

which produces output as so:

trdenton@caitlerstein:~$ head iolog.csv
Date,Device,tps,kB_read/s,kB_wrtn/s,kB_read,kB_wrtn
1374761702,sda,3.77,146.58,30.27,506891,104684
1374761761,sda,3.72,144.14,29.88,506963,105084
1374761821,sda,3.81,141.85,43.35,507439,155064
1374761881,sda,3.76,139.54,42.69,507567,155296
1374761941,sda,3.70,137.28,42.05,507567,155472
1374762001,sda,3.65,135.08,41.47,507571,155836
1374762061,sda,3.60,132.96,40.88,507571,156060
1374762121,sda,3.56,130.90,40.33,507571,156384
1374762181,sda,3.51,128.90,39.77,507571,156600

 

Putting it all together

With all these cronjobs in place, I diligently waited… and waited… and waited for the computer to shut down.  And after 3 days, I ran out of patience and started pouring through the logs.  What I found, is that areas of high temperature seemed to correlate most with high CPU load.

My fix for now was to install cpufreqd and set it to a low, powersaving mode – this comptuer is really only used for very light duty anyway.  But realistically, there is likely something awry with a video driver somewhere.  Or perhaps, there is enough heat being generated by the back of the LCD screen to affect CPU cooling (recall from my earlier post, that i flipped the screen around to the rear of a laptop to change its form factor).  In any case, a series of hacks have, at least temporarily, saved the day.