At a friends house, the Victron Cerbo GX stopped working.
They have a classic reneweable energy setup of a
Frönius Solar Converter, Victron Cerbo GX, 3x Victron Multiplus II and a 48V LiFePo.
Suddenly the Cerbo GX stopped working and the batteries were not charged anymore.
A simple overview of the setup:
The Cerbo GX is only a controller, it has a lot of interfaces including CAN interface and a couple
digital IOs. So it is quite safe to operate. There are no high current or high voltage involved.
The Cerbo GX still showed a blinking LED, was answering pings, but nothing else.
No web-interface, no mqtt.
Later we found out the Cerbo GX sent a dying message to the cloud about a broken data partition.
So the first step to investigate it further is to find a UART on the board.
The Cerbo GX contains 2 boards:
- an interface PCB (ioboard) which holds all the black ethernet sockets holding the CAN bus and other bus systems
- the mainboard containg an ARM SoC and all the top level connects including HDMI, USB
Opening the Victron Cerbo Gx
Opening up the Cerbo GX is quite easy. There are only 4 screws to open the black back cover.
Now you will see the back of the ioboard:
The MCU of the ioboard is an ATSAMC21. The whole PCB is covered in an additional
layer of protective coating.
Now pull the ioboard, it is only hold by 2x 2-row pin headers.
You now see the mainboard.
You can now either connect to the back cover test points DBG RX TX using needles or soldering.
Or remove some of the glue holding the mainboard in the cover (already removed in the picture)
The glue isn't a huge issue, because the whole stack is keep together by the outside screws as well.
Maybe when using the Cerbo GX in a car it might be a good idea to put glue back between the mainboard and the plastic cover.
After removing the glue on the sides of the PCB, which holds it
additional in place in the plastic cover. You can take the mainboard out.
The only part left in the plastic case are the antennas
Connecting to the debug connectors
On the mainboard there are also test points for also a Rx and Tx. The test points are also connected to a 6x pin header.
Gnd, ?, ? , Tx, Rx, ?
Connector JP201
It contains a 3.3V TTL uart with 115200 as speed.
It also shows the bootloader, how the linux boots and in the end, you get a direct root shell, which is great
for such an open system.
No need to find a hack into the system. If you have such deep hardware access.
The Cerbo GX is still booting fine into the Linux, but no writeable eMMC.
The Cerbo GX is based on the Allwinner A20 and the board might be also similar as the Cubieboard 2.
Now while watching the bootloader, I noticed it has problem with the data partition.
[ 3.210477] EXT4-fs (mmcblk1p3): orphan cleanup on readonly fs
[ 3.216833] EXT4-fs (mmcblk1p3): mounted filesystem 9df04e3d-d617-4e06-925c-49c0099a6a4d ro with ordered data mode. Quota mode: disabled.
[ 5.240233] I/O error, dev mmcblk1, sector 5302492 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.249533] Buffer I/O error on dev mmcblk1p5, logical block 4871, lost async page write
[ 5.371919] I/O error, dev mmcblk1, sector 5264244 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.381186] Buffer I/O error on dev mmcblk1p5, logical block 184, lost async page write
[ 5.521028] I/O error, dev mmcblk1, sector 5264121 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.530280] Buffer I/O error on dev mmcblk1p5, logical block 167, lost async page write
[ 5.642565] I/O error, dev mmcblk1, sector 5264598 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.651820] Buffer I/O error on dev mmcblk1p5, logical block 156, lost async page write
[ 5.791627] I/O error, dev mmcblk1, sector 5264446 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.800879] Buffer I/O error on dev mmcblk1p5, logical block 152, lost async page write
[ 5.923283] I/O error, dev mmcblk1, sector 5264644 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
[ 5.932536] Buffer I/O error on dev mmcblk1p5, logical block 148, lost async page write
[ 5.946909] EXT4-fs (mmcblk1p5): error loading journal
[ 17.724539] EXT4-fs (mmcblk1p3): re-mounted 9df04e3d-d617-4e06-925c-49c0099a6a4d ro. Quota mode: disabled.
So this looks already bad, because those are unrecoverable failure. The good thing, it seems only write errors.
Reading is still possible.
Try to repair the Victron Cerbo GX
First I would like to create a backup of the current running system. Sadly ssh also doesn't work and I was lazy
enough to just use netcat for it.
The Cerbo GX has a micro sdcard slot, but nothing was in there. So the current boot medium must be an embedded sdcard
or in this case an eMMC.
eMMC is basicly the same, except it also has a special boot0 and boot1 storage which is not a partition.
dd if=/dev/mmcblk1 | nc 169.254.23.42 9999
dd if=/dev/mmcblk1boot0 | nc 169.254.23.42 9998
dd if=/dev/mmcblk1boot1 | nc 169.254.23.42 9997
# and on the host side
nc -l -p 9999 > mmcblk1.raw
nc -l -p 9998 > mmcblk1boot0.raw
nc -l -p 9997 > mmcblk1boot1.raw
# I'll play with the mmcblk1.raw, so better to have a backup file
cp mmcblk1.raw mmcblk1.backup.raw
Now I've a backup.
I played with the Cubieboard 2 when it got released in 2013 and the A20. The A20 is a SOC from Allwinner which
has additional to a u-boot also a usb boot mode (FEL mode) which works when u-boot is broken.
With the backup I could look further into it. I mounted it on a differed machine with loop.
# create a loop device
losetup -f mmcblk1.raw
# create partitions in /dev/mapper/loop0p1 ...
kpartx -a /dev/loop0
Now I can mount the loop device, use fsck and other tools.
First look on the partition table:
> parted mmcblk1.raw print
Model: (file)
Disk mmcblk1.raw: 3909MB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Disk Flags:
Number Start End Size Type File system Flags
1 1049kB 9437kB 8389kB primary boot, lba
2 9437kB 1352MB 1342MB primary ext4
3 1352MB 2694MB 1342MB primary ext4
4 2694MB 3909MB 1215MB extended
5 2695MB 3909MB 1214MB logical ext4
Partition 1 is empty.
Partition 2 and 3 contains each a root filesystem based on OpenEmbedded.
Partition 5 is the data partition which can't be mounted read-write.
Also additional, the A20 boots a little bit different. There is still some data before the first partition.
It contains the bootloader. The A20 expects to find a bootloader at a specific offset.
The Partition 2 contains an older version of the Venus OS, Partition 3 a newer system.
While looking into the u-boot bootloader, it shows you, it supports an A/B schema. So you boot from partition 2
and update partition 3, afterwards you boot from partition 3.
So there should be always a working system installed, even when the power goes out while updating.
So back to partition 5. First of all, it still contains data and is fully recoverable thanks to the journal support of it.
Now I was looking on the data partition with tune2fs
tune2fs -l /dev/mapper/loop0p5
Filesystem volume name: <none>
Last mounted on: /data
[ lots of information ]
Lifetime writes: 1386 GB
[...]
1386 GB writes on a 4GB eMMC. No wonder it doesn't want to work any longer.
As disclaimer, the system has a custom driver for the power meter, I didn't looked into it, if this driver was responsible or
if the normal system already writes sooo many data onto the partition.
All the errors in the dmesg are at the start of the filesystem.
Maybe only some sectors are broken and the eMMC is still functional for a bit?
So I move the partition a little bit behind. Maybe 200 MB.
The data partition itself doesn't have a lot data:
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/loop0p5 812M 4.5M 749M 1% /data
So shrinking the ext4 with resize2fs, moving the extended partition (partition 4) and logical partition (partition 5) a little bit behind.
I did the modification to the partition table on my laptop and moved the first 512 bytes over via netcat.
dd if=/dev/mapper/loop0p5 of=/tmp/datapart
resize2fs /tmp/datapart 800M
# modify the partition table, removing logical and extended, re-create it 200 MB further to the end
parted /dev/loop0
# remove outdated partition device nodes /dev/mapper/loop0pX
kpartx -d /dev/loop0
# re-create partition device nodes /dev/mapper/loop0pX
kpartx -a /dev/loop0
dd if=/tmp/datapart of=/dev/mapper/loop0p5
So now everthing looks right. Let's do this on the live system.
# now my local image is in sync.
# now update the mbr on the Gerbo GX
nc 169.254.23.42 9995 > /tmp/mbr
# lets check if the mbr contains the real data
# validate with the file on the host.
sha256sum /tmp/mbr
# take a short peak into the mbr. it must ends with 55 aa.
hexdump -C /tmp/mbr
# all good, let's copy it.
dd if=/tmp/mbr of=/dev/mmcblk1 bs=512 count=1
Reboot to refresh the partition table. A little bit risky, but we still have u-boot or maybe the FEL mode for the resue.
The board rebooted just fine.
There is now a /dev/mmcblk1p4 but no /dev/mmcblk1p5. The reason behind this is, this is a logical partition,
meaning the partition table is within the partition 4 and not like partition 1-4 in the MBR at the start of the eMMC.
But I'll copy it over using dd.
# On the laptop
sudo dd if=/dev/mapper/loop0p4 | nc -l -p 9994
# On the Cerbo GX
nc 169.254.23.42 9994 | dd of=/dev/mmcblk1p4
Everything looks right, until the first second.
The dd aborts because of write errors. The dmesg shows an write errors on the mmcblk triggered
by the dd.
This eMMC doesn't like heavy write operations anymore.
Even I moved everything further to the end.
The eMMC might be just at the end of it's lifetime.
But maybe it can still do some operations.
Repair attempt #2 move /data to external sdcard
Now I'm looking into an alternative settings. Can I move the data partition to the external micro sd?
I'll copy the current backup just to an microsd card and insert it into the Cerbo GX.
To make this working, I need to change the /etc/fstab on the root partition.
#> cat /etc/fstab
/dev/root / auto ro 0 0
proc /proc proc defaults 0 0
devpts /dev/pts devpts mode=0620,gid=5 0 0
tmpfs /run tmpfs mode=0755,nodev,nosuid,strictatime 0 0
tmpfs /var/volatile tmpfs defaults 0 0
/dev/mmcblk1p5 /data ext4 noatime 0 0
But the root system is read-only.
mount -o remount,rw /
vi /etc/fstab
In theory a good idea, but sadly the eMMC still doesn't like write operations much.
So I'll go back to the boot loader and read what it supports.
Repair attempt #3 move rootfs and data to external sdcard
You need to "break" into the bootloader by pressing a key over UART shortly after boot.
U-Boot SPL 2018.05 (Jan 27 2022 - 21:53:22 +0000)
DRAM: 1024 MiB
CPU: 912000000Hz, AXI/AHB/APB: 3/2/2
Trying to boot from MMC1
U-Boot 2018.05 (Jan 27 2022 - 21:53:22 +0000) Allwinner Technology
CPU: Allwinner A20 (SUN7I)
Model: Cubietech Cubieboard2
I2C: ready
DRAM: 1 GiB
MMC: SUNXI SD/MMC: 0, SUNXI SD/MMC: 1
Loading Environment from MMC... OK
In: serial@01c28000
Out: serial@01c28000
Err: serial@01c28000
Net: No ethernet found.
Setting bus to 1
** Unrecognized filesystem type **
Hit any key to stop autoboot: 2
Now you need to press a key and get into the u-boot shell.
help shows you all available commands and the second most important command to run
is printenv to see all variable and short scripts.
I redacted the board_id, mac address and serial.
printenv
baudrate=115200
board_12345678=setenv fdtfile sun7i-a20-einstein-ccgx2.dtb
board_id=12345678
boot_mmc0=setenv bootpart 0:1; load mmc ${bootpart} ${scriptaddr} boot.scr && source ${scriptaddr}; run boot_mmc1
boot_mmc1=setenv bootdir /boot; run setroot loadfdt loadimage mmcargs testmode && bootz ${kernel_addr_r} - ${fdt_addr_r}
boot_mmc_auto=run boot_mmc${mmc_bootdev}
bootcmd=run setfdt boot_mmc_auto
bootdelay=2
bootfile=zImage
bootm_size=0xa000000
console=ttyS0,115200
eeprom_addr=50
ethaddr=00:02:03:04:05:06
fdt_addr_r=0x43000000
fdtfile=sun7i-a20-cubieboard2.dtb
hw_rev=4
kernel_addr_r=0x42000000
loadfdt=load mmc ${bootpart} ${fdt_addr_r} ${bootdir}/${fdtfile}; fdt addr $fdt_addr_r; run setmodel
loadimage=load mmc ${bootpart} ${kernel_addr_r} ${bootdir}/${bootfile}
loadramdisk=load mmc ${bootpart} ${ramdisk_addr_r} ${bootdir}/${ramdiskfile}
mmc_bootdev=0
mmcargs=setenv bootargs console=${console} root=${mmcroot} rootwait ro rootfstype=ext4 ${runlevel}
preboot=i2c dev 1; i2c read ${eeprom_addr} 0 4 ${kernel_addr_r}; if test ${mmc_bootdev} = 0; then load mmc 0:1 ${kernel_addr_r} board_id; fi; setexpr.l board_id *${kernel_addr_r}
product_id=c00a
pxefile_addr_r=0x43200000
ramdisk_addr_r=0x43300000
ramdiskfile=initramfs
scriptaddr=0x43100000
serial#=1650000000000000
setfdt=run board_${board_id} || echo unknown board_id ${board_id}
setmodel=if env exists model; then fdt set / model "$model"; else true; fi
setroot=if test "${version}" = 1; then setenv bootpart 1:2; setenv mmcroot /dev/mmcblk1p2; else setenv bootpart 1:3; setenv mmcroot /dev/mmcblk1p3; fi
testmode=if test "${runlevel}" = 4; then fdt rm i2c1/eeprom@50 read-only; else true; fi
So it uses the version variable to select partition 2 or 3. Further it has some support of booting from the external sdcard.
How to read this?
U-boot will run the script in preboot and bootcmd.
For U-boot both the eMMC and the external sdcard are just sdcards.
# u-boot
=> mmc list
SUNXI SD/MMC: 0 (SD)
SUNXI SD/MMC: 1 (eMMC)
bootcmd=run setfdt boot_mmc_auto means run the script in variable setfdt and boot_mmc_auto.
If mmc_bootdev=0 then it will try to use the external sdcard, if mmc_bootdev=1 the internal eMMC.
Also if an external sdcard is present, u-boot will set mmc_bootdev to 0 / external sdcard which is good.
But the bootscript doesn't really boot the external sdcard.
boot_mmc_auto will run boot_mmc0 if mmc_bootdev=0.
And boot_mmc0 is loading a u-boot shell script if present from the sdcard and execute it.
Afterwards it will run boot_mmc1.
But later it will call the scripts run setroot loadfdt loadimage mmcargs testmode.
setroot is the interesting one. setroot will change the bootpart variable which contains the device and partition.
If we can modify setroot, all the following scripts depend on bootpart and mmcroot
We can modify the setroot with the script which is loaded in boot_mmc0.
setenv bootpart 0:1; load mmc ${bootpart} ${scriptaddr} boot.scr && source ${scriptaddr}:
Means load from external sdcard, from partition 1 the file boot.scr to the memory ${scriptaddr} and execute it.
u-boot has a tool mkimage which can create such file. It can be install from the package repository. (Arch uboot-tools, debian u-boot-tools).
#> cat boot.scr.src
setenv setroot 'if test "${version}" = 1; then setenv bootpart ${mmc_bootdev}:2; setenv mmcroot /dev/mmcblk{mmc_bootdev}p2; else setenv bootpart ${mmc_bootdev}:3; setenv mmcroot /dev/mmcblk${mmc_bootdev}p3; fi'
#> mkimage -A arm -O linux -T script -C none -a 0 -e 0 -d boot.scr.src boot.scr
The first partition is currently empty. I'll use fat, but ext4 should be fine too.
# Using a sdcard reader
mkfs.vfat /dev/sdX1
mount /dev/sdX1 /mnt
cp boot.scr /mnt
umount /mnt
Now it should be booting using the external sdcard. But the system still tries to mount the hardcoded /data from the eMMC.
But since the root is now on the sdcard, you can modify it.
Either using the laptop and a sdcard reader
mount /dev/sdX3 /mnt/
vim /mnt/etc/fstab
# replace /dev/mmcblk1p5 with /dev/mmcblk0p5
umount /mnt
Or from within the Cerbo GX using
mount -o remount,rw /
sed -i 's/mmcblk1p5/mmcblk0p5/g' /etc/fstab
mount -o remount,ro /
# reboot
It's working
Now everything should be working again.
But the firmware update won't work, because the system still tries to use mmcblk1 instead of mmcblk0.
You could try to fix this (on the rootfs), but firmware update is for another day.
I don't know how long the micro sdcard will survive such heavy writes.
Usually micro sdcards don't sustain as many cycles as eMMC.
The eMMC might be still used as read-only boot medium to load the U-Boot and U-Boot Env (it depends on the
boot priority of the romloader).
However the sdcard should be already ready as replacement because the bootloader was copied by dd.
The bootloader is at a specific offset at the sdcard within the first megabyte.